Evidence of the Surprising State of JavaScript Indexing

Author: Will Critchlow / Source: Moz Back when I started in this industry, it was standard advice to tell our clients that the search eng

Author: Will Critchlow / Source: Moz

Back when I started in this industry, it was standard advice to tell our clients that the search engines couldn’t execute JavaScript (JS), and anything that relied on JS would be effectively invisible and never appear in the index. Over the years, that has changed gradually, from early work-arounds (such as the horrible escaped fragment approach my colleague Rob wrote about back in 2010) to the actual execution of JS in the indexing pipeline that we see today, at least at Google.

In this article, I want to explore some things we’ve seen about JS indexing behavior in the wild and in controlled tests and share some tentative conclusions I’ve drawn about how it must be working.

A brief introduction to JS indexing

At its most basic, the idea behind JavaScript-enabled indexing is to get closer to the search engine seeing the page as the user sees it. Most users browse with JavaScript enabled, and many sites either fail without it or are severely limited. While traditional indexing considers just the raw HTML source received from the server, users typically see a page rendered based on the DOM (Document Object Model) which can be modified by JavaScript running in their web browser. JS-enabled indexing considers all content in the rendered DOM, not just that which appears in the raw HTML.

There are some complexities even in this basic definition (answers in brackets as I understand them):

What about JavaScript that requests additional content from the server? (This will generally be included, subject to timeout limits)
What about JavaScript that executes some time after the page loads? (This will generally only be indexed up to some time limit, possibly in the region of 5 seconds)
What about JavaScript that executes on some user interaction such as scrolling or clicking? (This will generally not be included)
What about JavaScript in external files rather than in-line? (This will generally be included, as long as those external files are not blocked from the robot — though see the caveat in experiments below)

For more on the technical details, I recommend my ex-colleague Justin’s writing on the subject.

A high-level overview of my view of JavaScript best practices

Despite the incredible work-arounds of the past (which always seemed like more effort than graceful degradation to me) the “right” answer has existed since at least 2012, with the introduction of PushState. Rob wrote about this one, too. Back then, however, it was pretty clunky and manual and it required a concerted effort to ensure both that the URL was updated in the user’s browser for each view that should be considered a “page,” that the server could return full HTML for those pages in response to new requests for each URL, and that the back button was handled correctly by your JavaScript.

Along the way, in my opinion, too many sites got distracted by a separate prerendering step. This is an approach that does the equivalent of running a headless browser to generate static HTML pages that include any changes made by JavaScript on page load, then serving those snapshots instead of the JS-reliant page in response to requests from bots. It typically treats bots differently, in a way that Google tolerates, as long as the snapshots do represent the user experience. In my opinion, this approach is a poor compromise that’s too susceptible to silent failures and falling out of date. We’ve seen a bunch of sites suffer traffic drops due to serving Googlebot broken experiences that were not immediately detected because no regular users saw the prerendered pages.

These days, if you need or want JS-enhanced functionality, more of the top frameworks have the ability to work the way Rob described in 2012, which is now called isomorphic (roughly meaning “the same”).

Isomorphic JavaScript serves HTML that corresponds to the rendered DOM for each URL, and updates the URL for each “view” that should exist as a separate page as the content is updated via JS. With this implementation, there is actually no need to render the page to index basic content, as it’s served in response to any fresh request.

I was fascinated by this piece of research published recently — you should go and read the whole study. In particular, you should watch this video (recommended in the post) in which the speaker — who is an Angular developer and evangelist — emphasizes the need for an isomorphic approach:

Resources for auditing JavaScript

If you work in SEO, you will increasingly find yourself called upon to figure out whether a particular implementation is correct (hopefully on a staging/development server before it’s deployed live, but who are we kidding? You’ll be doing this live, too).

To do that, here are some resources I’ve found useful:

The developer tools built into Chrome are excellent, and some of the documentation is actually really good:
- The console is where you can see errors and interact with the state of the page
- As soon as you get past debugging the most basic JavaScript, you will want to start setting breakpoints, which allow you to step through the code from specified points
Although it’s about a broader set of technical skills, anyone who hasn’t already read it should definitely check out Mike’s post on the technical SEO renaissance.

Some surprising/interesting results

There are likely to be timeouts on JavaScript execution

I already linked above to the ScreamingFrog post that mentions experiments they have done to measure the timeout Google uses to determine when to stop executing JavaScript (they found a limit of around 5…

Click here to read more