Google’s ranking algorithm takes into account much more than just the infamous PageRank. In fact, they claim to use over 100 factors in determining the order of results returned for each query made. The specific features and weightings that go into this calculation are the special sauce that makes Google so wonderful.
One of my jobs recently has been to explain to various branches in the Division of STD Prevention here at the CDC the meaning of their rankings, and how to structure their site to be most effectively indexed by engines like Google. For example, I was posed with the question of why a searches for different STDs show the Division pages at different rankings:
Search for syphilis: #1
Search for herpes: #10
Good question. Why was the CDC winning with syphilis and outranked by herpes.com? Was it because of a commercial interest in herpes treatment? Or perhaps that because herpes is so much more prevalent, there is more competition for providing information? And of course, there is always the possibility that it is related to the quirkiness of Google’s ranking algorithm.
At first I assumed that this effect was related to PageRank. I’ve explained my way out of many a hole with that line of reasoning before, but this time it didn’t help: the Google toolbar revealed that in both cases, the CDC site had higher PageRank than anyone above it (6/10 as compared to 5/10 for the others).
From the other documented ranking features, it appeared that the CDC site was neck and neck with each of these sites. I checked these various features using Advanced syntax:
Search for herpes in the title,
Search for herpes in anchor tags,
Search for herpes in the text,
Search for herpes in the URL
If you look closely you’ll notice the CDC fact sheet for herpes in the top 10 for each query except the last one. Despite the fact that the URL ends with the filename facts_Genital_Herpes.htm, Google does not find the word “herpes.”
After a bit of prodding, I realized that Google is indexing this last filename all as one word due to the underscores. It appears that an underscore is not a token that is split, while just about every other punctuation character is. Instead of appearing as “facts genital herpes” as a human sees it, Google is indexing the above filename as factsUNDERSCOREgenitalUNDERSCOREherpes
This has many ramifications for weblogs, and especially those using individual archiving with filenames that include the title of the post. When encoding the title, those using MovableType probably use the dirify tag attribute which explicitly converts spaces into underscores. If the token used could be specified, the average post would get a higher ranking for words in the title.
Or alternatively Google could just break on underscores like any other punctuation. Of course I wouldn’t expect them to come running to the desires of a single weblogger.
8 thoughts on “Google underscores filenames”
I’d seen a similar analysis of Google’s ignorance of underscores before, but I can’t think of a “smarter” way for us to dirify names. The only other obvious options are periods or plus signs, but they both seem much less readable, and other indexing software tends to be fussy about such things.
We’ve seen that TypePad sites tend to do *really* well in Google indexing, and they’re using underscore dirifying. I suppose the page titles are making up for the filenames. I’d love to find out more options for us to consider in MT 2.7 and in MT Pro and TypePad, as we keep working to de-cruftify our templates.
I’m totally talking out of my tushie here, but is it possible that that’s deliberate on Google’s part, as a bit of an anti-Googlebombing countermeasure? These “blogs” are dangerous things, you know, when it comes to Google’s precious search results.
I can’t believe I find this topic absolutely fascinating, but I do. Thank you for teaching me something new.
any idea if dashes — get tokenized?
I have also been studying this matter. One solution I have come up with is the DirifyPlus plugin for Movable Type.
I use it like so:
Which gives me: name-of-title.php, losing the underscores. This doesn’t help large sites that would desire to change over though, due to broken links all over the place. I am fortunate in that I just changed domain names and servers. so was in a position to make a fresh start. Google should se that a change is required due to thew huge number of sites that now use the underscore in filenames.
I should have added that dashes are seen by googlebots as spaces.
Gosh, you guys are so smart, you should start your own search engine. I’m sure it would surpass Google in no time.
Five minutes in Google would answer the question of why Google does what it does with underscores.
i really dont know what your talking about, but it sounds cool. So how do i save a file name so google a see the different words as one word or _ or – or