Analysis of sites in Javascript

How to analyse websites that contain Javascript content and could hinder the crawling of Search Engine Bots.

Introduction Javascript sites

Until a few years ago, one of the most common problems in Seo optimization was getting pages with dynamically created content indexed using Javascript. The problem was related to Search Engine Bots that were unable to scan and index these resources and only considered the content present in the source code of static HTML.

Contextually to this issue, there has been an evolution of frameworks such as AngularJS, React, Vue.JS, single-page applications (SPAs) and progressive web apps (PWAs) that has led Google to evolve significantly and achieve full maturity in understanding web pages as a modern browser.

However, although today Google is generally able to scan and index most JavaScript content, server-side rendering, pre-rendering or dynamic rendering is recommended rather than relying on client-side JavaScript since it is still difficult for crawlers to process JavaScript successfully or immediately.

For these reasons in analysis, it is critical to understand the differences between the DOM after JavaScript has come into play and built the Web page and the static HTML response.

Javascript, the seo spider and some contraindications

To remedy this need, Screaming Frog comes to your aid with its “Javascript Rendering” mode, which has integrated the “Chromium” library to be able to better emulate the behavior of the Google Spider.

Insight: Google upgraded in 2019 the Web Rendering Service (WRS) that was previously based on Chrome 41 with the latest stable version of Chrome, and this enabled it to support over a thousand additional features.

1. Let us consider some aspects inherent to sites in JavaScript:

  • 1. Enable “Javascript Rendering” carefully

Although understanding what the differences are between the DOM and static html after Javascript execution is very important I advise against enabling the “Javascript Rendering” mode indiscriminately for every website.

Crawling with JavaScript is slower and more intensive for the server, since all resources (both JavaScript, CSS, images, etc.) must be retrieved for rendering each web page. This criticality is irrelevant for small sites, but for large portals or ecommerce it could become a hindrance to your scanning in terms of time and Ram.

  • 2. Consider the principles and limitations of Javascript scanning

All resources on a page (JS, CSS, images) must be available to be scanned, rendered, and indexed.

Google still requires clean, unique URLs for a page, and links must be in appropriate HTML anchor tags (you can offer a static link, as well as call a JavaScript function).

The Google Spider does not interact like users, for example, it does not click on resources by loading additional events after rendering (a click, hover or scroll for example).

The snapshot of the rendered page is taken when it is determined that network activity has stopped, or beyond a time threshold. There is a risk that if a page takes a long time to render it may be skipped and elements may not be seen and indexed.

In general Google renders all pages, however it does not queue pages for rendering if they have a ‘noindex’ in the initial HTTP response or static HTML.

Finally, Google rendering is separate from indexing. Google initially scans the static HTML of a website, and defers rendering until it has resources.

When beginning the development of a website knowing these conditions is vital to the success or failure of the project in terms of organic ranking.

Server-side and Client-side Javascript

Google advises against relying on client-side JavaScript and recommends developing with “progressive enhancement,” building the structure and navigation of the site using only HTML and then enhancing the look and feel of the site with AJAX.

For this reason, if you use a JavaScript framework, Google recommends setting up server-side rendering, pre-rendering, or hybrid rendering, which can improve performance for users and Search Engine crawlers.

Serving server-side rendering (SSR) and pre-rendering the JavaScript execution of the pages provides an initial rendered, scan-ready HTML version.

Hybrid rendering (sometimes referred to as “isomorphic”) identifies a version of rendering that executes server-side JavaScript for initial page loading and HTML and client-side for non-critical elements and subsequent pages.

Many JavaScript frameworks such as React or Angular Universal allow server-side and hybrid rendering.

Alternatively, you can opt to use dynamic rendering.

This can be especially useful when changes cannot be made to the front-end codebase. Dynamic rendering switches from client-side rendering for users to pre-rendered content for Search Engines.

Google crawling times

Although opting for the three solutions listed above solves most crawling problems, it is a good idea to consider one more aspect.

Google has a two-step indexing process, in which it initially scans and indexes static HTML, and then returns later when resources are available to render the page and to scan and index the content and links in the rendered HTML.

The time required between crawling and rendering could also be very long (up to 7 days) and this condition is a problem for sites that rely on content that by its nature has timeliness characteristics or simply expires (e.g. news site, events etc.); while the average time (stated at the 2019 Chrome Dev Summit) is 5 seconds.

How to identify Javascript

Identifying a site built using a JavaScript framework can be fairly straightforward, however, diagnosing sections, pages, or just smaller elements that are dynamically adapted using JavaScript can be much more challenging. When you are getting ready to define a Seo Audit, it is a good idea to immediately define whether there is the presence of Javascript and try to compend if the site may or may not have problems during different User Agent crawls.

There are several ways to know if the site is built using a JavaScript framework, let’s look at some of them:

  • 1. Identifying Javascript with Screaming Frog

By default, the Seo Spider crawls using the “old AJAX crawling scheme,” which means JavaScript is disabled, and only scans Raw HTML.

If the site uses only client-side JavaScript then you will get as a result of your scan only the Homepage with a 200 response with a few “JavaScript and CSS” files.

As you can also see the “Outlinks” Tab does not have hyperlinks since they are not rendered in the raw html and cannot be seen by the Seo Spider.

While this is often a sign that the Web site is using a JavaScript framework, with client-side rendering-it doesn’t tell you if other JavaScript dependencies also exist on the site.

For example, a website might only have JavaScript to load products on category pages, or update title elements.

How then is it easy to find them?

To identify JavaScript more efficiently, you need to switch to JavaScript rendering mode (“Config > Spider > Rendering”) and scan the site, or a sample of templates from around the site.

The SEO Spider will scan both the original and rendered HTML to identify pages that have content or links available only client-side and flag other key dependencies.

To guide you through the common problems of sites with client-side JavaScript, you have a comprehensive list of filters that will guide you in defining your Seo Audit.

The Seo Spider will alert you if there is content that is only in the HTML rendered after the JavaScript comes into play allowing you a complete view of the behavior of the website.

In this example, 100% of the content is only in the rendered HTML since the site relies entirely on JavaScript.

In this case, the examination focused on the number of words found in the content of the pages, but you can also find, for example, page references with links found only in the rendered HTML and remedy them by following Google’s directions.

This Screaming Frog feature is also very useful for checking whether or not meta tags or other Seo elements are present in static html or are only served and/or modified once the JavaScript has been executed.

  • 2. Disable Javascript in the browser

An alternative way to understand the nature of Web sites is to disable Javascript directly in the browser and reload the page. If it will show up white, there is no doubt that the website was designed with Javascript.

  • 3. Search for JavaScript directly in the source code

The third solution is to right-click on the Web site and view the source code by examining the Html content, whether there is text, there are links or signs of the libraries of the various JS frameworks.

By following this process you are seeing code before it is processed by the browser that will match what the Seo Spider scans, when it is not in JavaScript rendering mode.

If you try to search for some element but get no occurrences within the source code, then presumably they are dynamically generated in the DOM and will be viewable only in the rendered code.

In this case the <body tag> is completely empty, a clear signal that the site has JS.

  • 4. Do an Audit of the rendered code

The first question to ask yourself should be: how different is the rendered code from the static HTML source? To find out just right-click and use “inspect element” in Chrome to see the rendered HTML. You can often see the name of the JS Framework in the rendered code, such as ‘React’ in the example below.

You will find that the content and hyperlinks are in the rendered code, but not in the original HTML source code. This is what the Seo Spider will see, when it is in JavaScript rendering mode.

By clicking on the opening HTML element, then “copy > outerHTML” you can compare the rendered source code with the original.

Javascript Crawl with Screaming Frog

Having defined the presence of JS and understood the pitfalls you can begin your Seo Audit with Screaming Frog by configuring it in JavaScript rendering mode. This allows you to scan dynamic, JavaScript-rich websites and frameworks, such as Angular, React, and Vue.js.

  • 1. To scan a JavaScript website, open Seo Spider, click on ‘Configuration > Spider > Rendering’ and change ‘Rendering’ to ‘JavaScript’.
  • 2. Configure User-Agent & Window Size

Configure the user-agent

Configuration > HTTP Header > User-Agent

And the size of the window

Configuration > Spider > Rendering

By default, the viewport is set to “Google Mobile: Smartphone” following the “Mobile Index First” orientation of the Search Engine.

  • 3. Check resources and external links

Make sure resources such as images, CSS, and JS are checked in the “Configuration > Spider.” If the resources are on a different subdomain, or on a separate “root domain”, then also enable “check external links”, otherwise they will not be scanned and thus rendered.

  • 4. Run the scan
  • 5. See the Javascript tab

With the JavaScript tab you have 15 filters to help you consider JavaScript dependencies and isolate common problems.

  • Pages with Blocked Resources: identifies all pages with resources (such as images, JavaScript and CSS) that are blocked by robots.txt. This condition can be a problem because Search Engines may not be able to access critical resources in order to render pages accurately. I recommend that you update the robots.txt file to allow all critical resources to be scanned and used to render site content. Resources that are not critical (e.g., Google Maps embed) can be ignored.
  • Contains JavaScript Links : the filter displays pages that contain hyperlinks that are discovered in the rendered HTML only after JavaScript is executed. These hyperlinks are not in raw HTML. It is a good idea to consider including important server-side links in static HTML.
  • Contains JavaScript Content: shows all pages that contain body text only in the HTML rendered after JavaScript is executed. Google is able to render pages and see only client-side content, consider including important server-side content in the raw HTML.
  • Noindex Only in Original HTML: all pages that contain a noindex tag in static HTML but not in rendered HTML. When Googlebot encounters a noindex tag, it skips rendering and executing JavaScript. Since Googlebot skips executing JavaScript, using JavaScript to remove the ‘noindex’ in the rendered HTML will not work.
  • Nofollow Only in Original HTML: identifies pages that contain a nofollow attribute in static HTML, and not in rendered HTML. This means that any hyperlinks in the raw HTML before JavaScript is executed will not be followed. Remove ‘nofollow’ if links should be followed, scanned, and indexed.
  • Canonical Only in Rendered HTML: returns all pages that contain a canonical only tag in the rendered HTML after JavaScript execution. I recommend that you include a canonical link in the static HTML (or HTTP header) to ensure that Google can see it and avoid relying only on the canonical in the rendered HTML.
  • Canonical Mismatch:identifies pages that contain a different canonical link in static HTML than in rendered HTML after JavaScript execution. This condition can cause mixed signals and can lead to undesirable behavior by the Search Engine.
  • Page Title Only in Rendered HTML: returns pages that contain a page title only in the rendered HTML after JavaScript is executed. This means that a Search Engine must render the page to see them. Consider including the important server-side content in the raw HTML.
  • Page Title Updated by JavaScript: identifies pages that have page titles modified by JavaScript. This means that the page title in static HTML is different from the page title in rendered HTML. Consider including important server-side content in static HTML.
  • Meta Description Only in Rendered HTML: returns pages that contain a meta description only in the rendered HTML after JavaScript execution.
  • Meta Description Updated by JavaScript: all pages that have meta descriptions that are modified by JavaScript. This means that the meta description in static HTML is different from the meta description in rendered HTML.
  • H1 Only in Rendered HTML : pages that contain an h1 only in the rendered HTML after JavaScript execution. This means that a Search Engine must render the page to consider the heading. Consider including server-side headings in static HTML.
  • H1 Updated by JavaScript: pages that have h1s modified by JavaScript. This means that the h1 in the raw HTML is different from the h1 in the rendered HTML.
  • Uses Old AJAX Crawling Scheme URLs: identifies URLs that are still using the old AJAX crawling scheme (a URL containing a hash fragment #!) that was officially deprecated as of October 2015. Consider server-side rendering or pre-rendering where possible, and dynamic rendering as an alternative solution.
  • Uses Old AJAX Crawling Scheme Meta Fragment Tag: includes URLs that include a meta fragment tag indicating that the page is still using the old AJAX crawling scheme that was officially deprecated as of October 2015. The advice is to update URLs to follow the evolution of JavaScript on current Web sites. Consider server-side rendering or pre-rendering where possible, and dynamic rendering as a workaround solution. If the site still has the old meta fragment tag in error, then this should be removed.
  • 6. Monitor blocked resources

Keep an eye on anything that appears under the “Pages with blocked resources” filter in the “Javascript” tab

Blocked resources can also be viewed for each page within the ‘Rendered Page’ tab, adjacent to the rendered screen in the bottom panel of the window. In severe cases, if a JavaScript site completely blocks JS resources, then the site simply will not be scanned.

Any blocked resources you can also see in the “Response Code” tab.

Response Codes > Blocked Resource

Blocked pages and individual resources can also be bulk exported with the report “Bulk Export > Response Code > Blocked Resource Inlinks to be unblocked and optimized.

  • 7. View rendered pages

With Screaming Frog, it is possible to display the crawled rendered page in the ‘Rendered Page’ tab that dynamically appears at the bottom of the user interface when crawling in JavaScript rendering mode.

In case the Rendering of pages does not load consider acting on the AJAX timeout time directly in the Seo Spider configuration.

Viewing the rendered page is vital when analyzing what a modern search bot is able to see and is especially useful when reviewing in staging, where Google’s native “Fetch & Render” functionality cannot be used through Search Console.

If you have adjusted the user-agent and viewport on Googlebot Smartphone, you can see exactly how each page is rendered on mobile, for example.

  • 8. Compare raw and rendered HTML.

You may want to store and display HTML and rendered HTML within the Seo Spider when working with JavaScript.

Configuration > Spider > Extraction > Store Html/Render Html

By checking the appropriate “Store HTML” and “Store HTML Rendered” options you can have an easy comparison between the two versions and be able to compare the differences to be sure that critical content or links are present in the DOM.

This is super useful for a variety of scenarios, such as debugging the differences between what is seen in a browser and in the SEO Spider, or simply when analyzing how the JavaScript was rendered, and whether certain elements are within the code.

If the “JavaScript Content” filter has turned on for a page, you can switch the filter from ‘HTML’ to ‘Visible Content’ to identify exactly what textual content is only in the rendered HTML.

  • 9. Identify JavaScript-only links

Another important check concerns links present only in the rendered text. Through the “Contains JavaScript links” filter, you can identify which hyperlinks are discovered after JavaScript is executed.

Having collected the data simply click on a URL in the upper window, then on the lower “Outlinks” tab and select “Rendered HTML” as the filter.

  • 10. Configure the AJAX Timeout

Based on your crawl responses, you can choose the Timeout setting, which by default is set to 5 seconds. Very useful this feature in case of problems in Rendering pages with Seo Spider.

Configuration > Spider > Rendering

The 5-second timeout is generally fine for most websites, and normally Googlebot is more flexible in that it adjusts based on how long a page takes to load content, considering network activity and performing a lot of caching.

Related Tab: Bottom Tab | Sidebar | Report

Crawl Sites Javascript Video Tutorial

React vs Google

Seo Spider Tab