Duplicate Content Analysis

Analysis and best practices for avoiding duplicate content and not losing search engine rankings.

Duplicate Content

As repeatedly stated by Google, content is one of the most important ranking elements, and managing any duplication is one of the activities to consider in any Seo optimization.

Duplicate content is one of the problems that most commonly clips the wings of a project and must be minimized so as not to waste the Crawling Budget.

The goal of a Seo specialist should not be limited to understanding and optimizing identical pages but should also hone in on “similar” content that could create conditions of “Query Cannibalization” creating absolute inefficiency to Search Engine crawling.

Through this tutorial, let’s see how Screaming Frog allows you to ferret out both exact duplicate content (with the same hash) and similar content that might turn out to have the same “Search Intent” in Google’s eyes.

  • 1. Enable ‘Near Duplicates’ via ‘Config > Content > Duplicates’.

Screaming Frog, by default automatically identifies exact duplicate pages (duplicate pages have the same “Hash”), while to identify similar content you must enable the “Enable Near Duplicates” option. Once activated, the Seo Spider considers any document that has a minimum 90% match as a “similar” page.

You can also adjust this threshold according to the refinement that best suits your project.

By default the SEO Spider only checks similar or duplicate pages if they are ‘indexable’ and not ‘canonicalized’; the advice is to uncheck the ‘Only Check Indexable Pages For Duplicates’ option, as this may help you find areas of potential Crawler Budget waste.

Nota Bene: I normally recommend that for new projects, where you go to manage all Key Research, you set the threshold at 50-60%, while on projects with some history at a minimum of 70% to make sure that the Search Intent is always unique and well marked.

Through the use of Search Console you can periodically check the Query-Landing Page relationship and in the case of cannibalization choose which landing page is most appropriate (also checking GA data) to that specific Search Intent and renew the content of secondary pages or, in some cases remove them.

  • 2. Define the content area to be analyzed

Even for this task Screaming Frog is very flexible and allows you to choose the focus to put your attention on when searching for duplicate or similar content.

By default, the Seo Spider automatically excludes items from the <nav> and the
<footer> to focus on the main body content. However, not all websites are built using these HTML5 elements but Seo Spider allows you to ‘include’ or ‘exclude’ certain HTML tags, classes or IDs from your examination.

For example, if you crawl the Screaming Frog website you will find that it has a mobile menu outside of the “nav” element but, by default, this element is still included in the content analysis. In this specific case you could define the menu class (‘mobile-menu__dropdown’) and exclude it from consideration to focus more on the main content.

  • 3. Scan the website.
  • 4. View duplicates in the “Content” tab

In the “Content” tab you have two 2 filters available:

  • ‘exact duplicates’ (identical content).
  • ‘near duplicates’ (similar content).

During the crawl this tab populates with data as the crawl progresses but presents only the “exact duplicates.”

  • 5. Crawl Analysis Configuration.

To populate the ‘Near Duplicates’ filter, the “Closest Similarity Match” and “No. Near Duplicates” columns, you must configure and run Crawl Analysis.

  • 6. View the ‘Content’ tab and the filters “Exact” and “Near”

After performing the post crawl analysis, the “Near Duplicates” filter, “Closest Similarity Match” and “No. Near Duplicates” columns will be populated.

Only URLs with content above the selected similarity threshold will contain data, the others will remain empty.

In the case below, the Screaming Frog site has only two resources with a 92% threshold that will need to be optimized.

For a more timely evaluation you can sort the data by the two filters:

  • Exact Duplicates: identifies pages with an identical “hash”.
    Very useful for managing main pages and pages to be “canonicalized.”
  • Near Duplicates: identifies pages with a “similarity” threshold above 90% or with a different index if you have changed it via “Config > Content > Duplicates” .

In summary in the “Closest Similarity Match” column you can see the percentage of similarity while in “No. Near Duplicates” the number of pages involved by this similarity.

  • 7. View duplicate URLs via the ‘Duplicate Details’ tab”

For “exact duplicates,” simply move to the “hash” column and sort them by clicking on the header to find all matches.

In the example related to the BBC site each URL has an exact duplicate because it has two versions one with the “/” and one without.

To find out about “near duplicates (similar)” instead, you simply click on the “Duplicate Details” tab in the bottom window.

In the case above clicking on a URL in the upper window there are 4 pages that exceed the 90% similarity threshold.

The Seo Spider also provides a preview of near-duplicate content in the “Duplicate Details” tab and allows a very intuitive view of the textual parts that differ between the pages under consideration.

  • 8. Bulk export of duplicate URLs

Both exact duplicates and near-duplicates can be exported in bulk.

Bulk Export > Content > Exact Duplicates/ Near Duplicates

  • 8. Bulk export of duplicate URLs

Both exact duplicates and near-duplicates can be exported in bulk.

Bulk Export > Content > Exact Duplicates/ Near Duplicates

Related Tab: Content | Sidebar | Report

Duplicate Content Analysis Video

Cannibalizing Queries

Seo Spider Tab