Seo Spider crawling issues

Find out how to handle potential Seo Spider crawling problems.

Intro

Sometimes it may happen that Screaming Frog performs a one-page crawl or does not scan as expected.

In that case, the first things to check are the ‘Status’ and ‘Status Code’ of the returned resources to identify the problem and fix it.

Blocked by Robots.txt

The filter defines all URLs blocked by the site’s robots.txt that does not allow the Spider to scan those resources and consequently unable to index on Search Engines.

Each “0” response in the Status Code column identifies the lack of an HTTP response from the server; the Status column identifies the reason. In the example above the Spider’s user-agent was blocked by the robots.txt through the “disallow” directive and being blocked it cannot see the HTTP response.

What to do

In the case of resources blocked by robots you can set the Seo Spider to ignore it via Configuration > Robots.txt > Settings > Ignore Robots.txt or by customizing it. By default Screaming Frog follows the “disallow” directions of the robots.txt file.

DNS Lookup Failed

Identifies that the site was not found at all because of a domain entry error or a lack of network connection.

What to do

Check that the site has been written correctly and is visible in your browser. In case you cannot view it via browser there might be problems with connectivity; on the other hand, in case the site is correctly rendered on the browser you should check if there are any Antivirus or Firewalls blocking the Seo Spider.

Connection Timeout

Identifies that Screaming Frog requested the HTTP response from the server but did not receive a response within 20 seconds. To overcome the problem, you can increase the “Response Timeout” from the configurations or decrease the speed of the spider (lowering Speed) to not load the Server and give it more time for response.

What to Do

Check the site load times on your browser; if they are very long you can change the “Timeout” time on the Seo Spider and decrease the crawler speed to lighten the server.

Check to see if you can scan other sites with Screaming Frog. If the problem is widespread there may be a problem with Antivirus and Firewall. Try in this case to create an exception to the Screaming Frog scan. If this solution does not solve the problem either, there is probably a network or local problem present. For example, check if the proxy is enabled (Configuration > System > Proxy and if it is, try disabling it and restart crawling.

Connection Refused

It is returned when the Seo Spider’s connection attempt was rejected somewhere between the local machine and the website.

What to do

Can you scan other websites? If the answer is yes check the Antivirus and Firewall and create an exception for the Spider, if the problem is common to other sites check for a network or local problem to your pc.

Can you view the page in the browser or does it return a similar error? If the page can be seen change User-agent and use “Chrome” (Configuration > User-Agent).

No Response

The Seo Spider has trouble making connections or receiving responses. A common problem may be the proxy.

What to do

The first check concerns the proxy (Configuration > System > Proxy) where you can try to disable it. If it is not set correctly then this could mean that the Seo Spider is not sending/receiving requests correctly.

Success (2xx)

the requested URL was successfully received, understood, accepted, and processed. The check I suggest you do is about the presence of the “noFollow” attribute

What to do

  • Does the requested page have the meta robots ‘nofollow’ directive on the page/header HTTP or do all links on the page have rel=’nofollow’ attributes? In this case, simply set the Seo Spider to follow Internal/External Nofollow (Configuration > Spider). This critical issue is due to the fact that Screaming Frog follows “nofollow” directions by default.
  • Does the website have JavaScript links? Try disabling Javascript and display the pages in the browser. If you find yourself in this case you just need to enable JavaScript rendering (Configuration > Spider >Rendering > JavaScript). By default, the Seo Spider only scans for links <a href=””>, <img src=””> e <link rel=”canonical”> in the HTML source code and does not read the DOM.
  • Check the “Limits” tab of “Configuration > Spider” specifically the “Limit Search Depth” and “Limit Search Total” options. In case they are set to 0 or 1 respectively, then the Seo Spider is instructed to scan only a single URL.
  • Does the site require cookies? Try viewing your page with your browser and cookies disabled. This condition occurs because the Seo Spider is served a different page without hyperlinks in the case of disabled cookies. To resolve this critical scanning issue go to Configuration > Spider > Advanced Tab > Allow Cookies
  • What is specified in the ‘Content’ column? If there is no content, enable JavaScript rendering (Configuration > Spider >Rendering > JavaScript) and try starting crawl again. This critical issue occurs if no content type is specified in the HTTP header and the Seo Spider does not know if the URL is an image, PDF, HTML pages, etc. so it cannot scan to determine if there are other links. This can be circumvented with the rendering mode as the tool checks if a <meta http-equiv> is specified in the <head> of the document when it is enabled.
  • Are there any restrictions due to age? In this case try changing the user-agent to Googlebot (Configuration > User-Agent). The site/server could be set up to serve HTML to search bots without requiring age input.

Redirection (3xx)

redirection such as 301 or 302 was encountered. Ideally each internal link should link to a response with status code 200 and avoid links that cause the Spider to take extra steps in an effort to safeguard the Crawling Budget.

What to Do

  • Check what the destination of the redirection is (Check the outlinks of the returned URL). If you find a loop condition the Seo Spider fails to do a full scan.
  • External tab: the seo spider treats several subdomains as external and does not scan them by default. To resolve this condition enable the option scan all subdomains. Configuration > Spider > Scan all subdomains.
  • Does the site require cookies? Try disabling cookies from your browser, if you encounter the problem enable the “Allow Cookies” function in Seo Spider. Configuration > Spider > Advanced tab > Allow cookies. This condition occurs because the SEO Spider is redirected to a URL where a cookie is left, but does not accept cookies.

Bad request | 403 Forbidden

The server fails or does not want to process the request and is denying the SEO Spider’s request to display the requested URL.

What to do

  • If the page can be seen through your browser, try setting Chrome as the user agent (Configuration > User-Agent). The site probably denies the page to our Seo Spider for security issues.
  • If the page responds with error 404 the server is indicating that it no longer exists. If even by changing the user-agent in Chrome you find yourself in the same condition presumably it is a website problem, otherwise if with Chrome the page is returned ( or you can view it with the browser ) and you can start crawling probably the server was blocking our user-agent for a security issue.
  • If the page has error 429, it means that too many requests have been made to the server in a given period of time.

What to do

Can you view your site in the browser or does this show a similar error message? In this case you can lower the crawl speed and/or test a different user agent such as Chrome or Googlebot.

Server Error (5xx)

The server failed to fulfill an apparently valid request. This can include common responses such as 500 Internal Server Errors, 502 and 503 Server Unavailable.

What to do

  • Can you see your site in the browser or is it down? If the page can be seen, change user-agent and use Chrome as the user agent (Configuration > User-Agent).
Seo Spider Tab