Google underscores filenames

by cameron

people responsible for GoogleGoogle’s ranking algorithm takes into account much more than just the infamous PageRank. In fact, they claim to use over 100 factors in determining the order of results returned for each query made. The specific features and weightings that go into this calculation are the special sauce that makes Google so wonderful.

One of my jobs recently has been to explain to various branches in the Division of STD Prevention here at the CDC the meaning of their rankings, and how to structure their site to be most effectively indexed by engines like Google. For example, I was posed with the question of why a searches for different STDs show the Division pages at different rankings:

Search for syphilis: #1
Search for herpes: #10

Good question. Why was the CDC winning with syphilis and outranked by herpes.com? Was it because of a commercial interest in herpes treatment? Or perhaps that because herpes is so much more prevalent, there is more competition for providing information? And of course, there is always the possibility that it is related to the quirkiness of Google’s ranking algorithm.

At first I assumed that this effect was related to PageRank. I’ve explained my way out of many a hole with that line of reasoning before, but this time it didn’t help: the Google toolbar revealed that in both cases, the CDC site had higher PageRank than anyone above it (6/10 as compared to 5/10 for the others).

From the other documented ranking features, it appeared that the CDC site was neck and neck with each of these sites. I checked these various features using Advanced syntax:

Search for herpes in the title,
Search for herpes in anchor tags,
Search for herpes in the text,
Search for herpes in the URL

If you look closely you’ll notice the CDC fact sheet for herpes in the top 10 for each query except the last one. Despite the fact that the URL ends with the filename facts_Genital_Herpes.htm, Google does not find the word “herpes.”

After a bit of prodding, I realized that Google is indexing this last filename all as one word due to the underscores. It appears that an underscore is not a token that is split, while just about every other punctuation character is. Instead of appearing as “facts genital herpes” as a human sees it, Google is indexing the above filename as factsUNDERSCOREgenitalUNDERSCOREherpes

This has many ramifications for weblogs, and especially those using individual archiving with filenames that include the title of the post. When encoding the title, those using MovableType probably use the dirify tag attribute which explicitly converts spaces into underscores. If the token used could be specified, the average post would get a higher ranking for words in the title.

Or alternatively Google could just break on underscores like any other punctuation. Of course I wouldn’t expect them to come running to the desires of a single weblogger.