Youtube Epidemiology Interface

Youtube launched the most amazing statistics recently, hidden under their collapsed “Statistics & Data” header. Instead of a random list of awards, it now shows a timeline of the growth of the video’s popularity along with references to each source. Take for instance the video “Chap-hop History” by Mr. B the Gentleman Player:

Youtube Statistics

In addition to the existing statistics page (including awards and demographics), this interface now shows a chronology of the video’s popularity. In the case of Mr. B, the link first appears on b3ta, then @DJYodaUK and a few other twitter users, followed by Facebook, more b3ta, and then Planet Gnome. For the first time I feel like I have a clear, concise view of how a piece of content went viral. Take as a counter-example a popular video from this week, The Cat That Betrayed His Girlfriend whose popularity seems to have existed before the video was on the site (its first source of traffic was searches for the title). Browsing around a bit, it seems as though most big videos get their start from external sources, related videos, and searches.

This feels like a secret view into the inner workings of the internet, but all the pieces are still scattered around. I guess Youtube is the only one that can put them together.

Teaching data science

The New York Times had a piece over the weekend discussing the how computer science curricula are limited in their capacity to teach distributed computation and data mining:

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.

Besides being an advertisement for Facebook and Google internships, it does raise the question of how schools can adopt these technologies quickly enough to teach them. There have been lots of industrial partnerships and government grants for research clusters, but these are far from a standard undergraduate class on the topic. I would love to see Cloudera or a similar company partner with a hardware provider to make clusters affordable and easy to configure, while data scientists can make sure that they come pre-installed with some interesting data (Wikipedia, Twitter, etc.). With a consistent installation across institutions, professors can write and teach data science without the immense operational overhead of setting up a cluster and getting it operational.