Google news, meet spam

I’ve been a long-time user of Google news and news alerts. For certain topics, it’s the only way for me to stay informed, and the quality of their index has generally kept these updates to high-quality, on-topic news that matched some keywords. Over the past six months I have noticed a diminishing returns on the value of their search, especially in the case of alerts. While the amount of information has increased, the average quality has been diminishing. This decrease in relevance can be attributed to certain publications in their corpus:

Small publications: as more college newspapers, trade publications, and otherwise non-authoritative sources become primarily web-distributed, they have also started to overwhelm the news index. It’s rare these days to come across a story from a mass media publication.

PR announcements: some readers may remember a few months back when a 15-year old boy wrote a press release about how Google had hired him, and the entire affair turned out to be a hoax. Press releases seem to be a media that is not well policed, probably because they mainly come from

Blogs: The boundary between mass media and blogs has certainly blurred over the past few years, but the selection criteria for news indexes does not seem to follow any rules. Presumably the site maintainers take submissions to the site and decide based on internal editorial guidelines what to let in. Some of the blogs I have seen do not seem to make the cut, but maybe their inclusion of blog search into the interface suggests they are working on a better solution.

Syndication sites: a few news sources indexed by Google are actually sites that aggregate news from other sources. Try a search for any of your favorite spam keywords, such as “viagra,” you will find some surprising results. Spam?! It seemed absurd to me that spam could get into the news index, where every source was hand evaluated, but lo and behold, there are more than a few pages trying to sell viagra:

Google News vs. Viagra

What each of these examples points to is the need for a ranking mechanism that takes into account the reputation of the source. At last count, the US version of news is indexing over 10k sources, and as this bar gets lower, our collective trust in this site becomes more and more important. Unlike web search, which can be indexed and updated over the course of months, the news index has to be extremely fresh; for this reason, algorithms like PageRank cannot function properly. Attention indicators like del.icio.us, Digg or Newsvine might help, but each of these sources comes with an inherent bias that might not reflect the audience of Google News.

It seems much more likely that the sources of news will become the harbingers of trust. I am not advocating a return to old media, but the index could be built to reflect the current opinion of the web at large. If most sites trust the New York Times or the Washington Post as an authoritative host, so could a news search index. Andy Baio did an experiment around host ranking using Metafilter as a source, and the results from 1999 to 2006 are quite interesting: many sites appear out of nowhere (Youtube, Wikipedia) while others maintain rank over the years (New York Times, BBC). My guess is that standard news results run through this filter would provide a substantially better experience, especially for ranking results within a given news cluster. I guess we’ll see what the big G ends up doing to rectify the situation.