Extract url components¶
Extract components from an URL.
Let's say we have http://www.cwi.nl:80/%7Eguido/Python.html;a=2;b=3?c=4,2&d=e#anchor
as our URL.
Then these components will be the following:
scheme
: URL scheme specifier (http)domain
: Network location part (www.cwi.nl:80)path
: Hierarchical path (/%7Eguido/Python.html)params
: Parameters for last path element (a=2;b=3)query
: Query component (c=4,2&d=e)fragment
: Fragment identifier (anchor)
For more information about these components you can check urllib's description here.
Usage¶
The following are the step's expected inputs and outputs and their specific types.
extract_url_components(urls: url, {"param": value}) -> (
scheme: category,
domain: category,
path: category,
params: category,
query: category,
fragment: category
)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
Use http
as default scheme.
extract_url_components(ds.urls, {
"default_scheme": "http",
}) -> (
ds.scheme,
ds.domain,
ds.path,
ds.params,
ds.query,
ds.fragment
)
Inputs¶
urls: column:url
The list of URLs you wish to decomponse.
Outputs¶
scheme: column:category
URL scheme specifier (http).
domain: column:category
Network location part (www.cwi.nl:80).
path: column:category
Hierarchical path (/%7Eguido/Python.html).
params: column:category
Parameters for last path element (a=2;b=3).
query: column:category
Query component (c=4,2&d=e).
fragment: column:category
Fragment identifier, like after hashtag (anchor).
Parameters¶
default_scheme: string | null = "http"
URL Default Scheme. If you wish to add a scheme (http, https...) prefix to those urls that don't have one, do it here. If you wish none to be added, use null instead.