Checking for duplicate content, similar pages and low-value elements.

Tab Content

The “Content” tab shows data on the content of URLs discovered during the scan in terms of word count, duplicate and near-duplicate content, and spelling and grammar errors.

  • Address: the address of the URL.
  • Word Count: this index identifies the sum of all “words” scanned within the body tag, excluding HTML markup.
    The count is based on the content area defined by Seo Spider but you can also customize it from “Config > Content > Area.”
    By default, the elements <nav> and <footer> are excluded.
    For a more analytical analysis you can decide to include or exclude HTML elements, classes, and IDs. There may be discrepancies between the values reported by Screaming Frog and a manual calculation of them. These inconsistencies are due to the fact that the parser performs some corrections when it comes across invalid HTML. Not to be underestimated are any personal rendering settings that might also affect which HTML is considered. Screaming Frog counts a word by taking the text and dividing it by spaces without any consideration based on the visibility of the content (such as text within a div set as hidden).
  • Closest Similarity Match: Through this feature you can see the similarity index between multiple pages and avoid duplicate situations. Using the default thresholds, the Seo Spider identifies as “near duplicate” content if it has a 90% or higher match. You can customize the threshold from:Config > Content > Duplicates > Enable Near Duplicates > Near Duplicate Similarity Threshold (defining a new threshold).To populate this column you must use “Crawl Analysis”. Only URLs with content above the selected similarity threshold will contain data, the others will remain empty. In summary, by default, this column will only contain data for URLs with 90% or more similarity.
  • No. Near Duplicates: identifies the number of near duplicate URLs discovered in a crawl that meet or exceed the ‘Near Duplicate Similarity Threshold’, which is a 90% match by default. You can adjust this setting from “Config > Content > Duplicates.” To populate this column, you must enable the “Enable Near-Duplicates” feature and run the “Crawl Analysis.”
  • Total language errors: the total number of spelling and grammar errors discovered for a URL. For this column to be populated, it is necessary to select “Enable Spell Check” and/or “Enable Grammar Check”

Config > Content > Spelling & Grammar

  • Spelling Errors: the total number of spelling errors discovered for a URL. For this column to be populated, you must enable “Enable Spell Check.”
  • Grammar Errors: identifies the total number of grammar errors discovered for a URL. For this column to be populated, “Enable Grammar Check” must be selected. In the settings you can define the control language, define the grammar rules and the reference dictionary.
    To look up grammatical and “Spelling” errors in the lower tab of the Seo Spider, simply select the URLs in the dedicated column of the “Internal” tab (upper window).
    Excellent functionality for the copywriter to unearth any typos or misuse of grammar rules. In addition to the list of errors are suggestions for correction and the section of the website where they were encountered with a preview . For the Italian language, it is still unreliable in suggestions but very useful in terms of typos because of the dictionary.
  • Language: shows the language chosen for spelling and grammar checks. This index is based on the language attribute set in the HTML. Language can also be set through.

Config > Content > Spelling & Grammar

  • Hash: considers the value of the “hash” through MD5 algorithm of the URL. This index is a check for exact duplicate content unlike the “Closest Similarity Match” tab where a similarity threshold is made explicit. Note well that if two hash values match, the pages are exactly the same in content. If there is the difference of even one character, they have unique hash values and will not be detected as duplicate content. Exact duplicates can be seen under ‘URL > Duplicate’.
  • Indexability: whether the URL is indexable or non-indexable.
  • Indexability Status: the reason why a URL is not indexable. It could be canonicalized to another URL or feature the “noindex” tag.

Content Tab Filters

The following filters are available:

  1. Exact Duplicates: this filter shows all pages that have the same “hash” and could be considered duplicates by the Search Engine worsening the PageRank.
    This filter allows you to isolate duplicate pages and optimize them by correctly setting the canonical and “Canonicalized” versions.
  2. Near Duplicates: through this filter you are able to display all similar (not identical!) pages based on a similarity threshold, which by default is 90% and can be customized from the settings.

The “Closest Similarity Match” column shows the highest percentage of similarity to another page.

The “No. Near Duplicates” column shows the number of pages that are similar to the page based on the similarity threshold. The algorithm is run on the text of the page, rather than the entire HTML like exact duplicates.

The content used for this analysis can be configured under ‘Config > Content > Area’.

Remember that pages can have 100% similarity, but only be a ‘near duplicate’ due to rounding, so 99.5% or more will be displayed as 100%.

To populate this column you must enable “Enable Near Duplicates” and run the “Crawl Analysis.”

Config > Content > Duplicates

  • Low Content Pages: this filter highlights all pages with fewer than 200 words. The word count is based on the settings of the content area used in the analysis, which can be configured via ‘Config > Content > Area’.
    There are no official statements from the Search Engine with respect to the minimum number of words for a piece of content to be considered valid, but considering the quote “Content is the King,” descriptive content is one of the most important elements from an organic ranking perspective.
    A very good analysis I recommend is to compare Search Console performance in terms of Impressions and the “Low Content Pages” filter to see if pages with little content are penalized and the content needs to be deepened. Of course, these considerations depend on the website. E-commerce may not need very long text and perform well with smaller content

You can adjust the threshold for defining low content pages via “Low Content Word Count.”

Config > Spider > Preferences > Content Low Word Count.

  • Spelling Errors: this filter contains all HTML pages with spelling errors.
  • Grammar Errors: contains all HTML pages with grammar errors

Grammatical Error Analysis

Seo Spider Tab