Skip to content

Extract url components

Extract components from an URL.

Let's say we have http://www.cwi.nl:80/%7Eguido/Python.html;a=2;b=3?c=4,2&d=e#anchor as our URL. Then these components will be the following:

  • scheme: URL scheme specifier (http)
  • domain: Network location part (www.cwi.nl:80)
  • path: Hierarchical path (/%7Eguido/Python.html)
  • params: Parameters for last path element (a=2;b=3)
  • query: Query component (c=4,2&d=e)
  • fragment: Fragment identifier (anchor)

For more information about these components you can check urllib's description here.

Example

Use http as default scheme.

extract_url_components(ds.urls, {
  "default_scheme": "http",
}) -> (
  ds.scheme,
  ds.domain,
  ds.path,
  ds.params,
  ds.query,
  ds.fragment
)

Usage

The following are the step's expected inputs and outputs and their specific types.

extract_url_components(urls: url, {"param": value}) -> (
    scheme: category,
    domain: category,
    path: category,
    params: category,
    query: category,
    fragment: category
)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


urls: column:url

The list of URLs you wish to decomponse.

Outputs


scheme: column:category

URL scheme specifier (http).


domain: column:category

Network location part (www.cwi.nl:80).


path: column:category

Hierarchical path (/%7Eguido/Python.html).


params: column:category

Parameters for last path element (a=2;b=3).


query: column:category

Query component (c=4,2&d=e).


fragment: column:category

Fragment identifier, like after hashtag (anchor).

Parameters


default_scheme: string | null = "http"

URL Default Scheme. If you wish to add a scheme (http, https...) prefix to those urls that don't have one, do it here. If you wish none to be added, use null instead.