Tackling Tag Sprawl: Crawl Budget, Duplicate Content, and User-Generated Content

Author: Russ Jones / Source: Moz Alright, so here's the situation. You have a million-product website. Your competitors have a lot of the

Author: Russ Jones / Source: Moz

Alright, so here’s the situation. You have a million-product website. Your competitors have a lot of the same products. You need unique content. What do you do? The same thing everyone does — you turn to user-generated content. Problem solved, right?

User-generated content (UGC) can be an incredibly valuable source of content and organization, helping you build natural language descriptions and human-driven organization of site content. One common feature used by sites to take advantage of user-created content are tags, found everywhere from e-commerce sites to blogs. Webmasters can leverage tags to power site search, create taxonomies and categories of products for browsing, and to provide rich descriptions of site content.

This is a logical and practical approach, but can cause intractable SEO problems if left unchecked. For mega-sites, manually moderating millions of user-submitted tags can be cumbersome (if not wholly impossible). Leaving tags unchecked, though, can create massive problems with thin content, duplicate content, and general content sprawl. In our case study below, three technical SEOs from different companies joined forces to solve a massive tag sprawl problem. The project was led by Jacob Bohall, VP of Marketing at Hive Digital, while computational statistics services were provided by J.R. Oakes of Adapt Partners and Russ Jones of Moz. Let’s dive in.

What is tag sprawl?

We define tag sprawl as the unchecked growth of unique, user-contributed tags resulting in a large amount of near-duplicate pages and unnecessary crawl space. Tag sprawl generates URLs likely to be classified as doorway pages, pages appearing to exist only for the purpose of building an index across an exhaustive array of keywords. You’ve probably seen this in its most basic form in the tagging of posts across blogs, which is why most SEOs recommend a blanket “noindex, follow” across tag pages in WordPress sites. This simple approach can be an effective solution for small blog sites, but is not often the solution for major e-commerce sites that rely more heavily on tags for categorizing products.

The three following tag clouds represent a list of user-generated terms associated with different stock photos. Note: User behavior is generally to place as many tags as possible in an attempt to ensure maximum exposure for their products.

USS Yorktown, Yorktown, cv, cvs-10, bonhomme richard, revolutionary war-ships, war-ships, naval ship, military ship, attack carriers, patriots point, landmarks, historic boats, essex class aircraft carrier, water, ocean
ship, ships, Yorktown, war boats, Patriot pointe, old war ship, historic landmarks, aircraft carrier, war ship, naval ship, navy ship, see, ocean
Yorktown ship, Warships and aircraft carriers, historic military vessels, the USS Yorktown aircraft carrier

As you can see, each user has generated valuable information for the photos, which we would want to use as a basis for creating indexable taxonomies for related stock images. However, at any type of scale, we have immediate threats of:

Thin content: Only a handful of products share the user-generated tag when a user creates a more specific/defining tag, e.g. “cvs-10”
Duplicate and similar content: Many of these tags will overlap, e.g. “USS Yorktown” vs. “Yorktown,” “ship” vs. “ships,” “cv” vs. “cvs-10,” etc.
Bad content: Created by improper formatting, misspellings, verbose tags, hyphenation, and similar mistakes made by users.

Now that you understand what tag sprawl is and how it negatively effects your site, how can we address this issue at scale?

The proposed solution

In correcting tag sprawl, we have some basic (at the surface) problems to solve. We need to effectively review each tag in our database and place them in groups so further action can be taken. First, we determine the quality of a tag (how likely is someone to search for this tag, is it spelled correctly, is it commercial, is it used for many products) and second, we determine if there is another tag very similar to it that has a higher quality.

Identify good tags: We defined a good tag as term capable of contributing meaning, and easily justifiable as an indexed page in search results. This also entailed identifying a “master” tag to represent groups of similar terms.
Identify bad tags: We wanted to isolate tags that should not appear in our database due to misspellings, duplicates, poor format, high ambiguity, or likely to cause a low-quality page.
Relate bad tags to good tags: We assumed many of our initial “bad tags” could be a range of duplicates, i.e. plural/singular, technical/slang, hyphenated/non-hyphenated, conjugations, and other stems. There could also be two phrases which refer to the same thing, like “Yorktown ship” vs. “USS Yorktown.” We need to identify these relationships for every “bad” tag.

For the project inspiring this post, our sample tag database comprised over 2,000,000 “unique” tags, making this a nearly impossible feat to accomplish manually. While theoretically we could have leveraged Mechanical Turk or similar platform to get “manual” review, early tests of this method proved to be unsuccessful. We would need a programmatic method (several methods, in fact) that we could later reproduce when adding new tags.

The methods

Keeping the goal in mind of identifying good tags, labeling bad tags, and relating bad tags to good tags, we employed more than a dozen methods, including: spell correction, bid value, tag search volume, unique visitors, tag count, Porter stemming, lemmatization, Jaccard index, Jaro-Winkler distance, Keyword Planner grouping, Wikipedia disambiguation, and K-Means clustering with word vectors. Each method either helped us determine whether the tag was valuable and, if not, helped us identify an alternate tag that was valuable.

Spell correction

Method: One of the obvious issues with user-generated content is the occurrence of misspellings. We would regularly find misspellings where semicolons are transposed for the letter “L” or words have unintended characters at the beginning or end. Luckily, Linux has an excellent built-in spell checker called Aspell which we were able to use to fix a large volume of issues.
Benefits: This offered a quick, early win in that it was fairly easy to identify bad tags when they were composed of words that weren’t included in the dictionary or included characters that were simply inexplicable (like a semicolon in the middle of a word). Moreover, if the corrected word or phrase occurred in the tag list, we could trust the corrected phrase as a potentially good tag, and relate the misspelled term to the good tag. Thus, this method help us both filter bad tags (misspelled terms) and find good tags (the spell-corrected term)
Limitations: The biggest limitation with this methodology was that combinations of correctly spelled words or phrases aren’t necessarily useful for users or the search engine. For example, many of the tags in the database were concatenations of multiple tags where the user space-delimited rather than comma-delimited their submitted tags. Thus, a tag might consist of correctly spelled terms but still be useless in terms of search value. Moreover, there were substantial dictionary limitations, especially with domain names, brand names, and Internet slang. In order to accommodate this, we added a personal dictionary that included a list of the top 10,000 domains according to Quantcast, several thousand brands, and a slang dictionary. While this was helpful, there were still several false recommendations that needed to be handled. For example, we saw “purfect” correct to “perfect,” despite being a pop-culture reference for cat images. We also noticed some users reference this saying as “purrfect,” “purrrfect,” “purrrrfect,” “purrfeck,” etc. Ultimately, we had to rely on other metrics to determine whether we trusted the misspelling recommendations.

Bid value

Method: While a tag might be good in the sense that it is descriptive, we wanted tags that were commercially relevant. Using the estimated cost-per-click of the tag or tag phrase proved useful in making sure that the term could attract buyers, not just visitors.
Benefits: One of the great features of this methodology is that it tends to have a high signal-to-noise ratio. Most tags that have high CPCs tend to be commercially relevant and searched frequently enough to warrant inclusion as “good tags.” In many cases we could feel confident that a tag was good just on this metric alone.
Limitations: However, the bid value metric comes with some pretty big limitations, too. For starters, Google Keyword Planner’s disambiguation problem is readily apparent. Google combines related keywords together when reporting search volume and CPC data, which means a tag like “facbook” would return the same data as “facebook.” Obviously, we would prefer to map “facbook” to “facebook” rather than keep both tags, so in some cases the CPC metric wasn’t sufficient to identify good tags. A further limitation of the bid value was the difficulty of acquiring CPC data. Google now requires running active Adwords campaigns to get access to CPC value. It is no simple feat to look up 5,000,000 keywords in Google Keyword Planner, even if you have a sufficient account. Luckily, we felt comfortable that historical data would be trustworthy enough, so we didn’t need to acquire fresh data.

Tag search volume

Method: Similar to CPC, we could use search volume to determine the potential value of a tag. We had to be careful not to rely on the tag itself, though, since the tag could be so generic that it earns traffic unrelated to the product itself. For example, the tag “USS Yorktown” might get a few hundred searches a month, but “USS Yorktown T-shirt” gets 0. For all of the tags in our index, we tracked down the search volume for the tag plus the product name, in order to make sure we had good estimates of potential product traffic.
Benefits: Like CPC, this metric did a very good job of consolidating our tag data set to just keywords that were likely to deliver traffic. In the vast majority of cases, if “tag + product” had search volume, we could feel confident that it is a good term.
Limitations: Unfortunately, this method fell victim to the same disambiguation problem that CPC presents. Because Google groups terms together, it is possible that on some occasions two tags will be given the same metrics. For example: “pontoons boat,” “pontoonboat,” “pontoon boats,” “pontoon boat,” “pontoon boating,” and “pontoons boats” were in the same traffic volume group which also included tags like “yacht” and “yachts.” Moreover, there is no accounting for keyword difficulty in this metric. Some tags, when combined with product types, produce keywords that receive substantial traffic but will always be out of reach for a templated tag page.

Unique visitors

Method: This method was a no-brainer: protect the tags that already receive traffic from Google. We exported all of the tags from Google Analytics that had received search traffic from Google in the last 12 months. Generally speaking, this should be a fairly safe list of terms.
Benefits: When doing experimental work with a client, it is always nice to be able to give them a scenario that almost guarantees improvement. Because we were able to protect tags that already receive traffic by labeling them as good (in the vast majority of cases), we could ensure that the client had a high probability of profiting from the changes we made and minimal risk of any traffic loss.
Limitations: Unfortunately, even this method wasn’t perfect. If a product (or set of products) with high enough authority included a poor variation of a tag, then the bad variant would rank and receive traffic. We had to use other strategies to verify our selections from this method and devise a method to encourage a tag swap in the index for the correct version of a term.

Tag count

Description: The frequency with which a tag was used on the site was often a strong signal that we could trust the tag, especially when compared with other similar tags. By counting the number of times each tag was used on the site, we could bias our final set of trusted tags in favor of these more popular terms.
Benefits: This was a great tie-breaker metric when we had two tags that were very similar but needed to choose just one. For example, sometimes two variants of a phrase were completely acceptable (such as a version with and without a hyphen). We could simply defer to the one with a higher tag count.
Limitations: The clear limitation of tag frequency is that many of the most frequent tags were too generic to be useful. The tag “blue” isn’t particularly useful when it just helps people find “blue t-shirts.” The term is too generic and too competitive to warrant inclusion. Additionally, the inclusion of too broad of a tag would simply create a very large crawl vs. traffic-potential ratio. A common tag will…

Click here to read more