Crawling Sites in Staging

Find out how to perform a seo audit of staging websites and be ready to go online.

Staging Sites Analysis

Staging websites usually cannot be crawled by Search Engines and crawlers. Existing various methods to prevent crawling there are several approaches or configurations of Screaming Frog to implement to achieve crawling and circumvent these limitations.

Robots.txt

  • 1. Robots.txt

The most common case involves the robots.txt file blocking the scan. In this case the Seo Spider collects only one URL , the record shows the message “Blocked by robots.txt” and the status is “Indexable”. To break through this barrier, simply set “Ignore Robots.txt” from Config. > Robots.txt

An alternative is to configure a custom robots.txt via “Config > robots.txt > Custom” to remove any “disallow” directives and add if necessary other customizations for the Seo Spider to follow.

Authentication

  • 2. Authentication

In case the server requires authentication, it is possible to provide username and password to the Seo Spider to scan the site.

Basic authentication

Basic authentication and digest is automatically detected by Seo Spider when scanning the website.

When scanning the site under development, an authentication pop-up window appears, just as it does in a Web browser, asking for a user name and password.

By entering the correct credentials, the Seo Spider begins normal crawling of the staging pages of the website.

Web Form Authentication

If there are other web forms or areas that require cookie login for authentication, the Seo Spider allows users to access these web forms within the tool’s built-in Chrome browser.

To log in using web forms authentication, click on “Configuration > Authentication > Forms-based.” Then click on the ‘Add’ button, enter the URL of the site you wish to crawl and log in from your browser.

IP Address

Some staging platforms may restrict access based on IP address.

Since the SEO Spider crawls locally from the machine from which it runs, you must provide this IP address to be included in the “whitelist.”

Seo Spider Configuration

Sites in development may respond to HTTP requests differently than those in a live environment, and may often have robot directives that require additional configuration in the SEO Spider.

Scanning speed

Staged Web sites are generally slower and cannot carry the same load as a production server. If you notice site instability and timeout errors or server errors you need to reduce the Spider scan speed.

Or manage the timeout and AJAX timeout in the case of scanning with Javascript rendering.

No Follow

To protect against possible BOT intrusions many developing sites might present a “nofollow” meta robots tag throughout the site, or X-Robots-Tag in the HTTP header. A “nofollow” is a very different directive from a noindex, and it tells a crawler not to follow any outlinks from a page.

In case you present the condition in the example go to “Config > Spider” and enable “Follow Internal Nofollow” to scan the outlinks from these pages and get a full Audit.

No Index Urls

Another tag you may encounter in a staging site is noindex, which allows the crawler to crawl but not index it. The use of noindex can be seen under the “Directives” tab and the “noindex” filter. What I recommend is to check that this tag is removed immediately when the website goes online, too many times you see sites with the robots.txt file or noindex tag still active even though they have already been published online.

In addition to the aspects related to getting online you should consider that the Seo Spider scans pages with a noindex (like a normal BOT), but sees these pages as ‘non-indexable’. This means that by default the Spider does not consider the relevant metrics in the filters and does not provide you with items such as duplicate content, or missing page titles, meta descriptions, etc., so the analysis may be very biased.

The solution is to disable “Ignore Non-Indexable URLs for On-Page Filters ” for a complete scan of the staging website.

Config > Spider > Advanced

None Directive

Too many times the “None” directive is underestimated and misinterpreted thinking that it means the lack of directives instead the same is equivalent to “noindex, no follow”!

If this is the case you need to follow the same steps described above.

Live site vs Staging

With the latest releases of Screaming Frog the Seo Spider you can compare two crawls to see the differences through the ‘URL Mapping’ feature, which allows you to compare two different URL structures, such as a staging website versus a production or live environment.

To compare staging with the live website, you simply click on “Mode > Compare” and select the two crawls.

Crawl sites Staging Video

Seo Spider Tab