Custom Extraction

Discover custom Extraction to collect custom data from html code.

Custom Data Extraction

Screaming Frog is a very versatile Seo Tool that not only returns predefined data and metrics but also allows for customizable advanced analytics. One of these is “Custom Extraction,” a very powerful feature that allows you to collect all data from the html of any web page (text-only mode) or data rendered with the “Javascript Rendering Mode” scan mode.

Seo Spider provides 3 modes of data extraction with “Custom Extraction.”

  1. XPath: XPath selectors;
  2. CSS Path;
  3. Regex: for more advanced data extraction (deepening: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html)

If you decide to take advantage of the first two solutions “XPath” and “CSS Path” you can choose which elements to extrapolate through:

  • Extract Html elements: allows collecting data from the selected elements and their inner content (“Inner Html Content”).
  • Extract Inner HTML: allows you to extract the inner content of a selected element. In case the Html element contains other Html elements, the sub-elements will also be available to you.
  • Extract Text: collects the textual content of the selected element and the textual content of its subelements.
  • Function Value: returns the result of the provided function. For example, if you use a function such as “count(//h1)” you get the number of <h1> on the page.

Once “Custom Extraction” is selected you simply click on “Add” and enter your data extraction instructions.

You can also decide to click on “Crawl Config” and choose from the tabs in the first column of the popup proposed by the Seo Spider.

The data obtained are available in the “Custom Extraction” tab and in the “Internal” tab in a dedicated column.

Now that you know how to set up “Custom Extraction” and understand the potential, let’s look at some application examples that I think you might find useful in your forays into technical seo.

Extraction with X-Path

Below are some examples of extrapolations using XPath.

Headings: by default the Seo Spider collects only the main headings of the page ( H1 and H2) but for a more specific and complete analysis you may need to find specific information about the other “headings” on the page as well.

Discover and collect the different types (“Types”) in structured data.

Please note: For structured data validation, you do not need to use a “Custom Extraction” but can view the data in the “Structured Data” tab.

Collecting Social Media Tags, Open Graph Tags and Twitter cards:

Extracting email addresses and/or phone numbers from a website

Extract particular frames from the website such as Google Tag Manager, Youtube video.

Extracting content from specific Divs or Span by providing the class (to be substituted in place of “example”)

This example collects the titles and number of blog post comments (you will need to adjust your website-specific classes to work).

Extraction with Regex

Regex rules are a very powerful tool for collecting data with Screaming Frog; let’s look at some application examples you can use immediately in your next Seo Audit:

Google Analytics and Google Tag Manager ID extraction:

Extraction of Structured Data.

To extract structured data we have seen that you can also use XPath mode but if these have formatting in JSON-LD it is advisable to use RegEX syntax:

Custom Video Extraction

Seo Spider Tab