XML Sitemaps: The Most Misunderstood Tool in the SEO’s Toolbox

XML Sitemaps: The Most Misunderstood Tool in the SEO’s Toolbox

Author: Michael Cottam / Source: Moz In all my years of SEO consulting, I’ve seen many clients with wild misconceptions about XML sitemap

Everything You Need to Know About SEO
Rise of the PPC Machines: The Search Marketer’s Role in the Age of AdWords Automatons
The Simplified Guide to Google’s RankBrain Algorithm

In all my years of SEO consulting, I’ve seen many clients with wild misconceptions about XML sitemaps. They’re a powerful tool, for sure — but like any power tool, a little training and background on how all the bits work goes a long ways.

Indexation

Probably the most common misconception is that the XML sitemap helps get your pages indexed. The first thing we’ve got to get straight is this: Google does not index your pages just because you asked nicely. Google indexes pages because (a) they found them and crawled them, and (b) they consider them good enough quality to be worth indexing. Pointing Google at a page and asking them to index it doesn’t really factor into it.

Having said that, it is important to note that by submitting an XML sitemap to Google Search Console, you’re giving Google a clue that you consider the pages in the XML sitemap to be good-quality search landing pages, worthy of indexation. But, it’s just a clue that the pages are important… like linking to a page from your main menu is.

Consistency

One of the most common mistakes I see clients make is to lack consistency in the messaging to Google about a given page. If you block a page in robots.txt and then include it in an XML sitemap, you’re being a tease. “Here, Google… a nice, juicy page you really ought to index,” your sitemap says. But then your robots.txt takes it away. Same thing with meta robots: Don’t include a page in an XML sitemap and then set meta robots “noindex,follow.”

While I’m at it, let me rant briefly about meta robots: “noindex” means don’t index the page. “Nofollow” means nothing about that page. It means “don’t follow the links outbound from that page,” i.e. go ahead and flush all that link juice down the toilet. There’s probably some obscure reason out there for setting meta robots “noindex,nofollow,” but it’s beyond me what that might be. If you want Google to not index a page, set meta robots to “noindex,follow.”

OK, rant over…

In general, then, you want every page on your site to fall into two buckets:

  1. Utility pages (useful to users, but not anything you’d expect to be a search landing page)
  2. Yummy, high-quality search landing pages

Everything in bucket #1 should either be blocked by robots.txt or blocked via meta robots “noindex,follow” and should not be in an XML sitemap.

Everything in bucket #2 should not be blocked in robots.txt, should not have meta robots “noindex,” and probably should be in an XML sitemap.

(Bucket image, prior to my decorating them, courtesy of Minnesota Historical Society on Flickr.)

Overall site quality

It would appear that Google is taking some measure of overall site quality, and using that site-wide metric to impact ranking — and I’m not talking about link juice here.

Think about this from Google’s perspective. Let’s say you’ve got one great page full of fabulous content that ticks all the boxes, from relevance to Panda to social media engagement. If Google sees your site as 1,000 pages of content, of which only 5–6 pages are like this one great page… well, if Google sends a user to one of those great pages, what’s the user experience going to be like if they click a link on that page and visit something else on your site? Chances are, they’re going to land on a page that sucks. It’s bad UX. Why would they want to send a user to a site like that?

Google engineers certainly understand that every site has a certain number of “utility” pages that are useful to users, but not necessarily content-type pages that should be landing pages from search: pages for sharing content with others, replying to comments, logging in, retrieving a lost password, etc.

If your XML sitemap includes all of these pages, what are you communicating to Google? More or less that you have no clue as to what constitutes good content on your site and what doesn’t.

Here’s the picture you want to paint for Google instead. Yes, we have a site here with 1,000 pages… and here are the 475 of those 1,000 that are our great content pages. You can ignore the others — they’re utility pages.

Now, let’s say Google crawls those 475 pages, and with their metrics, decides that 175 of those are “A” grade, 200 are “B+,” and 100 are “B” or “B-.” That’s a pretty good overall average, and probably indicates a pretty solid site to send users to.

Contrast that with a site that submits all 1,000 pages via the XML sitemap. Now, Google looks at the 1,000 pages you say are good content, and sees over 50% are “D” or “F” pages. On average, your site is pretty sucky; Google probably doesn’t want to send users to a site like that.

The hidden fluff

Remember, Google is going to use what you submit in your XML sitemap as a clue to what’s probably important on your site. But just because it’s not in your XML sitemap doesn’t necessarily mean that Google will ignore those pages. You could still have many thousands of…

COMMENTS

WORDPRESS: 0
DISQUS: 0