Skip to content

Fetch url content

NLPtext

Fetch the main text from a web URL, and return its title, author, content, excerpt and domain.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
fetch_url_content(urls: url) -> (
    title: text,
    author: category,
    content: text,
    excerpt: text,
    domain: url
)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

Since the step has no configuration parameter, it's simply

Example call (in recipe editor)
fetch_url_content(ds.article_url) -> (
  ds.article_title,
  ds.article_author,
  ds.article_content,
  ds.article_excerpt,
  ds.article_domain
)

Inputs


urls: column:url

A column of URLs linkling to articles, blog posts or webpages.

Outputs


title: column:text

A text column containing the extracted article's title.


author: column:category

A categorical column containing the extracted article's author.


content: column:text

A text column containing the extracted article's main text.


excerpt: column:text

A text column containing a summary of the extracted article.


domain: column:url

A column containing only the domain of each original URL.