Google has published slides and videos from a 2007 tutorial series for new interns covering distributed computing, MapReduce, GFS and a few algorithms. This seems to be part of Google’s efforts to engage universities in their code, probably to give future Googlers a head-start. (via Geeking with Greg)
There are a lot of angry people in the world. These people typically have a number of gripes, and sometimes one of them stands above everything else. Those who have web savvy might even take it to the rest of the world through a passionate blog or unifying community website. I was interested in what Google thought the most hated things were, and this is the list:
From this logic, I present a highly unsuccessful personals ad:
Part-time clown seeks cilantro-loving emo kid. My house in Brooklyn , my cubicle in Manhattan (selling SBC Yahoo), but my heart is with Austen (die hagglers!). Let’s grab a Starbucks or just chat on our powerbooks!
Surprisingly, I find myself being quite a big fan of most of them. Maybe people just hate the things I like, but probably these things get more attention because they are highly divided topics.
I’ve been a long-time user of Google news and news alerts. For certain topics, it’s the only way for me to stay informed, and the quality of their index has generally kept these updates to high-quality, on-topic news that matched some keywords. Over the past six months I have noticed a diminishing returns on the value of their search, especially in the case of alerts. While the amount of information has increased, the average quality has been diminishing. This decrease in relevance can be attributed to certain publications in their corpus:
Small publications: as more college newspapers, trade publications, and otherwise non-authoritative sources become primarily web-distributed, they have also started to overwhelm the news index. It’s rare these days to come across a story from a mass media publication.
PR announcements: some readers may remember a few months back when a 15-year old boy wrote a press release about how Google had hired him, and the entire affair turned out to be a hoax. Press releases seem to be a media that is not well policed, probably because they mainly come from
Blogs: The boundary between mass media and blogs has certainly blurred over the past few years, but the selection criteria for news indexes does not seem to follow any rules. Presumably the site maintainers take submissions to the site and decide based on internal editorial guidelines what to let in. Some of the blogs I have seen do not seem to make the cut, but maybe their inclusion of blog search into the interface suggests they are working on a better solution.
Syndication sites: a few news sources indexed by Google are actually sites that aggregate news from other sources. Try a search for any of your favorite spam keywords, such as “viagra,” you will find some surprising results. Spam?! It seemed absurd to me that spam could get into the news index, where every source was hand evaluated, but lo and behold, there are more than a few pages trying to sell viagra:
What each of these examples points to is the need for a ranking mechanism that takes into account the reputation of the source. At last count, the US version of news is indexing over 10k sources, and as this bar gets lower, our collective trust in this site becomes more and more important. Unlike web search, which can be indexed and updated over the course of months, the news index has to be extremely fresh; for this reason, algorithms like PageRank cannot function properly. Attention indicators like del.icio.us, Digg or Newsvine might help, but each of these sources comes with an inherent bias that might not reflect the audience of Google News.
It seems much more likely that the sources of news will become the harbingers of trust. I am not advocating a return to old media, but the index could be built to reflect the current opinion of the web at large. If most sites trust the New York Times or the Washington Post as an authoritative host, so could a news search index. Andy Baio did an experiment around host ranking using Metafilter as a source, and the results from 1999 to 2006 are quite interesting: many sites appear out of nowhere (Youtube, Wikipedia) while others maintain rank over the years (New York Times, BBC). My guess is that standard news results run through this filter would provide a substantially better experience, especially for ranking results within a given news cluster. I guess we’ll see what the big G ends up doing to rectify the situation.
Google is currently testing distributed computing as an option of its toolbar. Sergei Brin (Google co-founder) says that the initial use of the computation will be for the [email protected] project at Stanford, but also says that it might be aimed at internal search problems. It seems awkward for a company who has sold themselves on doing one thing, and one thing well, to suddenly branch out just to “give something back to science.” I’d say that Google has either got something up its sleeve, or lost the plot.
Scott Andrew is using Google to create a new weblog widget: instant context. Each weblog post is accompanied by a link that directs users to whatever is most closely related on Google. Scott takes the work out of finding relevant information by providing the appropriate search query for his readers. (via web voice)
This approach has a similar goal as the Watson project (at Northwestern University), which attempts to provide people with “just-in-time information access” by giving them links to task-relevant resources. Like Scott, Watson constructs queries to search engines based on whatever you’re currently working on.