Web Scraping

Learn how to use web scraping and custom extraction for advanced analysis.

INDEX:

Web Scraping & Custom Extraction

Let’s see how you can use Screaming Frog to do Web Scraping using the Custom Extraction (Advanced Search) feature.

Through this feature you are able to retrieve any HTML data from a web page using CSSPath, XPath, and RegEX.

Extraction is performed on the static HTML of URLs scanned by the SEO Spider that respond with a status code 200 ‘OK’.

If you want to do extractions from rendered data, it is possible to enable the “Javascript Rendered” mode.

  • 1. Configuration of Custom Extraction

To set up your custom search go to Configuration > Custom > Extraction.

Through this function you are able to set up to 100 custom data extraction requests.

Css, XPath and Regex Instructions

  • 2. Select the CSS, XPath or Regex path to be used for scraping

The Seo Spider offers three opportunities for scraping data in websites:

  • XPath: through XPath you are able to select nodes from a document where to perform a query using XPath selectors, including attributes.
  • CSS Path: this option is the fastest of the mentioned methods and allows scraping using CSS Path selectors.
  • Regex: This data query uses RegEx regular expressions and is recommended for advanced uses such as scraping HTML or JavaScript comments inline.

Opting for XPath or CSS Path to query the HTML, you can choose from several Seo Spider filters:

  • Extract HTML Elements: collects the information of the selected element and all its internal HTML content.
  • Extract Inner HTML: collects the inner HTML content of the selected element. If, for example, the selected element contains other HTML elements, these will also be included.
  • Extract Text: collects the textual content of the selected element and its sub-element.
  • Function Value: returns the total number of the requested element, e.g. if you are looking for how many h3 are on a page you can use “count(//h3)”.

Syntax entry

  • 3. Enter your syntax

Once you have chosen the scraping mode, all that remains is to define the extraction syntax. To find the relevant CSS or Xpath you can simply open the web page in Chrome and ‘inspect the desired element, then right-click and copy the relevant selection path provided.

Example:
Let us examine the Screaming Frog blog.

Open any blog post in Chrome, right-click and “inspect item” on the author’s name.

Right-click on the relevant HTML line (with the author’s name), copy the relevant CSS or XPath path and paste it into the respective Seo Spider field.

If the syntax entered is valid (.author-details-social>a) you will see a green check mark next to your input, otherwise there will be a warning with a red cross identifying that the syntax is not considered correct.

Having completed this you simply click the “ok” button and start the crawl.

To learn more about CSS selectors and XPath I recommend you follow w3schools.

Scan the website

With the syntax entered and validated all you have to do is scan the website to start scraping.

View scraping data in the “Custom Extraction” tab

Web scraping data are available to you in real time during scanning, in the ‘Custom Extraction’ tab, and in the “Internal” tab.

In our example, a full scan of a Web site was initiated, but if you want to do scraping from a specific list of URLs you can decide to use the “List” scan mode.

The fields of application are endless and depend on the type of analytics being performed, this feature can be very useful for example to collect Analytics or GTM ID, social meta tags, Hreflang attribute values, product prices of an ecommerce, some discounted prices etc.

Related Tab: Custom Extraction

Scraping Search Intent

Scraping “People Also Ask”

INDEX:

Seo Spider Tab