Duplicating optimised content across multiple pages will cause cannibalisation issues, wherein Google will not know which internal page to rank for the term. From there, right click again, but on the attribute, go to copy, then select copy selector: From here you need to go into Screaming Frog. The output will only work if you have a single content column, so if you have had to extract multiple content blocks on each URL simply combine them before you paste into the tool. Normally it is pretty good at this, but if our website has issues will crawl budget, or gets its content syndicated by a website with far greater authority, there is a chance that Google will show them over us. What you’re essentially doing here is performing a search for a block of your content in Google. To find content gaps it’s the same process as before. Without a significant amount of valuable content on a page Google will not be able to understand the topic of the page and so the page will struggle to rank for anything at all. Google needs content in order to rank a page, this is SEO 101! In order to diagnose how much above the fold content we have, we will need to rerun our crawls, but this time we only want to extract the content blocks which are visible on page load. From here, just running the above word count formula will be sufficient to diagnose content gaps.
In 2016, Google rolled Panda into its core algorithm. What this means for webmasters is that a website can be hit by (and recover from) a content penalty at any time.
But, more problematically, it also means it’s becoming impossible to diagnose why a website has dropped in rankings. Google ultimately does not want us to understand how its ranking algorithm works, because there will always be people who manipulate it. We now suspect that core signals are rolled out so slowly that SEOs won’t even realise when Penguin or Panda has refreshed.
For this reason, it makes it crucial that we understand how well our website is performing at all times. This blog post is intended to show you how to do a comprehensive content audit at scale, in order to find any gaps which may lead to rankings penalties.
Essentially, there are five types of content gaps a website may suffer from. I’ll explain each one, and show you how you can find every instance of it occurring on your website FAST.
1. Internally duplicated content
Internally duplicated content is the daddy of content gaps. Duplicating optimised content across multiple pages will cause cannibalisation issues, wherein Google will not know which internal page to rank for the term. The pages will compete for ranking signals with each other, reducing the rankings as a result.
Further to this, if you have enough duplicate content within a directory on your site, Google will treat that entire directory as low quality and penalise the rankings. Should the content be hosted on the root, your site’s entire rankings are under threat.
To find these at scale you have to use Screaming Frog’s custom extraction configuration to pull all your content from your site, then compare in Excel for duplicates. Using this method, I was able to find 6,000 duplicate pages on a site within a few hours.
To configure Screaming Frog you first need to copy the CSS selector for your content blocks on all your pages. This should be relatively simple should your pages follow a consistent template.
Go onto the page, right click on the content and go to inspect element. This will open up the right panel at the exact attribute which you right clicked on. From there, right click again, but on the attribute, go to copy, then select copy selector:
From here you need to go into Screaming Frog. Go into the configuration drop down and select custom, then extraction.
From here this will bring up the below box, which you will then need to select CSSPath as the mode, and then paste your selector into the highlighted field, and change the second drop down to extract text.
Now if I run a crawl on the ASOS website, Screaming Frog is going to extract the above the fold content from all the category pages.
You can specify up to 10 separate paths to extract, so when you have templates with multiple content blocks, for example above and below the products, or category and product & static page templates, you will need to specify them in the same way I have just shown.
Now, it is going to be very unlikely that an entire page’s content, or an entire block of content, will be duplicated. Usually spun content keeps the majority of the content the same but will replace specific keywords. So trying to find duplicate content based on entire blocks of content is pretty futile. Luckily I have created a duplicate content tool for this exact situation.
Fire up the tool and input just the URL and content into the specified columns on the input tab. Essentially what we have to do is split the content down into sentences and compare occurrences this way.
The output will only work if you have a single content column, so if you have had to extract multiple content blocks on each URL simply combine them before you paste into the tool. You can do this with the concatenate function which for example would look like this (should your content be in columns A2 & B2).
=concatenate(A2, “.”, “B2”)
The period in the middle is essential (if your content does not contain periods at the end) because we are going to split the content out by text to columns by full sentences, so will use the period as the delimiter.
Next highlight the column with the content in it, then select text to columns. Select delimited, then select other…