Scanning large sites

How to scan large sites with 'Database Storage' mode and some tricks.

Database Storage Mode

To handle scanning a site from millions of URLs the “Memory Storage” mode with storage on RAM is not the recommended choice, so the first step I recommend is to switch to “Database Storage.”

To get the maximum performance from Seo Spider with this mode you should provide an internal SSD storage card or an external one if your system supports UASP mode.

Memory Allocation

Having chosen the “Database Storage” mode, the second step is to increase the allocated memory.

Below is a comparison of performance based on the maximum number of URLs that can be crawled.

These requirements are just references, always start with the idea that the more memory allocated the better the performance of the Seo Spider will be.

Attributes to be scanned

It is not always necessary to scan an entire site so it is a good idea to tailor the crawl to the specific focuses of the analysis in place. Remember that the more data you collect, the more intense the demand for RAM will be.

To choose which metrics to scan and which not to include you can directly edit the Seo Spider settings.

Configuration > Spider > Crawl

Depending on the type of analysis you can choose to disable scanning and storage of the following items:

  • Images*;
  • CSS*;
  • JavaScript*;
  • SWF*.

*In terms of memory it makes a lot of sense to disable these resources but in case you use JavaScript rendering mode you will have to enable them in order not to penalize rendering.

  • External Links;
  • Canonicals;
  • Pagination;
  • Hreflang.
  • AMP.

Likewise, you can disable the extraction of certain non-essential elements such as “Meta Keywords” directly from the “Spider Configuration” screen or other attributes that are not strictly necessary at the time of analysis by choosing from “Page Details” (e.g. page title, word count, headings etc.), URL details, directives or structured data.

In addition to the above-mentioned elements, I recommend that you disable the function related to the position of links in Html.

Config > Custom > Link Position

Another best practice is to avoid scanning advanced features such as the following that can be included in a second phase of analysis:

  • Custom Search – Custom Search.
  • Custom Extraction – Custom Extraction.
  • Google Analytics Integration.
  • Google Search Console Integration.
  • Integration of PageSpeed Insights.
  • Spelling & Grammar – Spelling and Grammar.
  • Integration of Link Metrics (Majestic, Ahrefs and Moz).

Exclusion Urls

To make crawling more streamlined and less memory-impactful, you can limit crawling to specific areas of the Web site and exclude items such as sortings generated by “faceted navigation,” particular parameters found on URLs, or infinite URLs with repeated directories using the “exclude” function.

This function through RegEX expressions allows you to exclude URLs that you do not wish to scan. Be careful, however, that if the excluded URLs contain links to other pages they will also not be crawled.

To best understand the importance of the “Exclude” function, let us examine a real-world scenario, such as the John Lewis Web site.

Using the default settings of the Seo Spider, due to the numerous sorting, the crawl would include an infinite set of URLs as per the example below:

https://www.johnlewis.com/browse/men/mens-trousers/adidas/allsaints/gant-rugger/hymn/kin-by-john-lewis/selected-femme/homme/size=36r/_/N-ebiZ1z13ruZ1z0s0la1.

This condition is very common in ecommerce businesses that use “faceted navigation” for brands, color, pant fit and more creating endless combinations and inexorably burdening crawling.

In this specific case, you only need to perform a sample scan of the website to detect this type of parameterized URL template and exclude it.

Subfolders and Subdomains

If your website is very substantial, you may consider separate scanning of subdomains or even more granular scanning of particular subfolders.

By default, the SEO Spider scans only the entered subdomain, and all other subdomains encountered are treated as external links by appearing in the lower window of the ‘External’ tab.

You can choose to scan all subdomains, but obviously this will require more memory.

The Seo Spider can also be configured to scan only a subfolder by entering the URL of the “Subfolder” with its path and disabling “check links outside of start folder” and “crawl outside of start folder.”

Configuration > Spider > Check links outside of start folder | Crawl outside of start folder

For example, let’s assume you only want to scan the blog area of the Screaming Frog site.

The first activity is to identify the subfolder and run the crawler; which in our case will be: https://www.screamingfrog.co.uk/blog/.

Please note: to scan a subfolder remember to add the final slash “/blog/” otherwise Seo Spider will not recognize it as such. You might have the same problem if the version with the “/” is redirected to its version without it.

In the latter case you can use the “include” function and use a RegEx rule to scan (e.g. “.*blog.*”).

Function Includes

In addition to excluding areas and/or disabling items from being scanned and archived you can opt to take advantage of the “Include” feature that allows you to limit the URLs to be scanned using RegEx expressions.

For example, if you want to scan only pages that have “search” in the URL string you simply need to include the RegEx “.*search.*” in the “include” function.

Taking a look at the Screaming Frog site you would get the following results.

Sampling

Another way to save time and make scanning less memory-impactful is to work on sampling the Web site through:

Limit Crawl Total – Limit total crawl : identifies the maximum threshold of pages to be crawled in total.

Limit Crawl Depth – Limit Crawl Depth: limits the crawl depth to pages closest in terms of clicks to the source page (or home) to get a sample of all patterns.

Limit Max URI Length To Crawl – Maximum URL Length To Crawl: limits URL crawling to a maximum number of characters in the string by avoiding very deep URLs.

Limit Max Folder Depth – Limits maximum folder depth: limits scanning based on folder depth.

Limit Number of Query Strings – Limits the number of query strings: limits parameter scanning based on the number of query strings. If you set the query string limit to ‘1’, you allow the Seo Spider to scan URLs that have at most a single parameter (?=color for example) in the string while avoiding infinite combinations.

Crawl large sites

Seo Spider Tab