Teaching data science

The New York Times had a piece over the weekend discussing the how computer science curricula are limited in their capacity to teach distributed computation and data mining:

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.

Besides being an advertisement for Facebook and Google internships, it does raise the question of how schools can adopt these technologies quickly enough to teach them. There have been lots of industrial partnerships and government grants for research clusters, but these are far from a standard undergraduate class on the topic. I would love to see Cloudera or a similar company partner with a hardware provider to make clusters affordable and easy to configure, while data scientists can make sure that they come pre-installed with some interesting data (Wikipedia, Twitter, etc.). With a consistent installation across institutions, professors can write and teach data science without the immense operational overhead of setting up a cluster and getting it operational.

