MIT Weblog Survey Update

There have been a few requests lately for the results of the MIT Weblog Survey that I conducted last summer, so I figure I should respond publicly.

I’m sorry for the delay, and I admit I was hopeful in my assessment of the time it would take me to release the results. In the past three months I’ve moved, started a new job and a new life. Things have settled down a bit now, and I have some spare time to devote to writing up the results. I honestly expect to have them done by next week.

In the mean time, if you’d like a copy of my thesis, please email me, I’d be happy to send it to you. I’d just prefer to keep it semi-public until the survey results are posted. Sorry again for the delay.

comScore weblog report

I am obviously always on the lookout for weblog statistics, as it has become a core part of my thesis. Today a marketing company by the name of comScore has released a report detailing a number of different statements about the weblog community. I’d like to take a moment to remind people that this is a marketing survey, and as such should be carefully scrutinized before drawing any conclusions.

First, comScore’s methodology claims that they have 2 million active subjects, recruited through Random Digit Dial and an “online recruitment program,” for which they provide no details. They do however list the incentives that are provided to those individuals:

  • Server-based virus protection
  • Attractive sweepstakes prizes
  • Opportunity to impact and improve the Internet

Sans the third incentive which is the blanket “feel-good” incentive for all surveys, I challenge you to think of someone who is attracted to the first two. Let’s just say they’re not your average person or internet user. They also note:

All demographic segments of the online population are represented in the comScore Global Network, with large samples of participants in each segment. For example, our network includes hundreds of thousands of high-income Internet users – one of the most desirable and influential groups to measure, yet also one of the most difficult to recruit.

Without diving into what “high-income Internet users” are, having hundreds of thousands subjects from a assumedly small portion of the population leads me to believe that they’re not really interested in representivity, but rather, umm, marketing. Given that they do not justify their sample, nor provide margins of error, the initial sampling frame should be considered bunk.

Second, if their sampling of weblogs seems strange at first, it is. They were interested in how the aforementioned sample visited weblogs, so they decided to look at visits to 400 blog-related domains, which they culled from “top blog lists.” These domains include hosting services (e.g. “*”) among the other top blogs. Keep in mind that this sample of 400 domains incorperates community sites (,,,, etc.), professionally written sites (,,, etc) and potentially spam ( throws my spam alarm).

I’m assuming, based on their distribution of unique visitors shown below, that all of these sites are included in one sample, with the top sites being blog hosts (although note the missing blogspot, which supposedly saw 19 million unique visitors), and the second group being community sites and professional blogs. As far as many people might be concerned, the “real blogs” start around #30, for which they provide no description. How this is a sample of weblogs at all, I can’t say. But building categories around this strange set of sites seems a little unsound.

comScore statistics

What this report, in sum, seems to say to me is that some large number of people have visited either a professional weblog or some weblog on any number of the hosted services in the past year. This should not be surprising. I get a blog site response from Google just about once every five queries. Without any description of how many of these blog visitors saw only one blog in the entire period, I’d say an overwhelming majority could be from search engines (which they admit).

Given their sampling frame and blog selection methodology, it seems hard to extrapolate any meaningful statistics about true blog readership. Until they release the data, I would quote these numbers with extreme caution.

Thesis: defended

IAMDEFENDERFor those wondering whether or not I’ve died in my apartment in a vat of sweet-smelling liquid that masked the smell of my rotting body, the answer is NO! I’m alive and well, just in the wake of one of the more excruciatingly painful periods of work-induced anti-social behavior. And as a consolation, I never have to defend my thesis again.

Unlike most Ph.D. defenses, the Media Lab counterpart is quite public, held in an auditorium-sized room, and can occur before the thesis document is finished. Last Thursday at 9am I went through this process presenting my thesis titled The structural determinants of media contagion, and I came through fairly unscathed. It was fairly well attended once people woke up (around 9:30 I guess), and my committee decided I was ready to enter the cloistered halls of academia… after I finish writing the document.

It’s unfortunate, but true. I can’t put Dr. on my credit card just yet, nor can I pretend like I have any plans after that. In the mean time I’ll be writing in limbo until my April 5th deadline. I’ll hold off on the results until then, lest I contradict myself in two weeks time. But I just wanted to thank everyone that helped me get here, and there are so many. If you’re reading this, I’m sure you know that I mean you, because just about everyone who possibly could have lended a hand did in some way (even if it was just by taking the survey).

So thanks. I’ll be filling in the details in a few weeks, but you can take solace in the mean time that I’m taking showers again and interacting with people other than the three friends I’ve developed in my brain over the past two months.

Survey 2, Electric Boogaloo

some graphFor some reason I expected my survey to spread much further and wider than it actually did. At the current moment, the individuals who were emailed about taking the survey (the random sample) outnumber webloggers at large 3 to 1. I really expected things to go in the other direction.

To rectify this situation, I thought I would provide a little incentive. For those individuals who complete the survey, you can see how you compare to the rest of the survey respondents. You can get a taste of the results here:

Of course anyone who has already taken the survey can see their results as well, just log in with your login key. Don’t worry if you threw it away, you can request it again.

The survey is up for another week, until Monday June 27, so if you get a chance… I’d appreciate it.

Cognitive dissonance

someone's brain ripped openI haven’t been dumped in quite a while. Usually my relationships just fade away until a decision is made. Besides putting Gloria Gaynor and malt liquor into higher rotation, I’ve been doing a little bit of introspection about the topic.

Getting dumped is a classic case of cognitive dissonance, a theory first proposed by Leon Festinger in the 50s. He observed that people make decisions and actions to minimize the amount of contradictory beliefs they have in their head. When a person is forced to believe two things that don’t match up, they experience extreme emotional discomfort until they can fix their belief system.

So basically I have this thought in my head that’s tied to all kinds of memories and beliefs: she is my girlfriend. Then I introduce this new idea, she is not my girlfriend and the sum of these two obviously contradictory beliefs turns me into a raving lunatic. The more embedded the first belief is, the harder it is to accept the latter, and the longer you pour Old English on your corn flakes instead of milk. F. Scott Fitzgerald put it nicely:

The test of a first-rate intelligence is the ability to hold two opposed ideas in the mind at the same time, and still retain the ability to function. One should, for example, be able to see that things are hopeless and yet be determined to make them otherwise.

Obviously I’m not operating at first-rate levels currently. But writing dry, bland weblog posts about something that is obviously extremely emotional certainly helps to bring it back.

Hedonic treadmill

the hedonic hamster wheelI’m just about to return a book to the library, something I read a while back and have been meaning to post about for centuries. In their article “Hedonic Relativism and planning the good society**,” Philip Brickman and Donald Campbell give a name to the ongoing state of happiness that we all experience. Despite the fact that external forces are constantly changing our life goals, happiness for most people is a relatively constant state. Regardless of how good things get, we’ll always be about the same level of happy; this they call the hedonic treadmill.

Psychology researchers have observed this phenomenon in a myriad of different situations: lottery winners, tenure achievers, recently handicapped, etc. In all of these situations, despite a massive shift in standard of living or achievement of major life goals, after a short period of time the life-satisfaction levels return to normal.

If this is what we can expect from our own psychology, how does hedonic relevatism affect the way we choose to live our lives? Brickman and Campbell look at this question from a societal level, and suggest that there is an optimal setup for making every member of our culture as happy as possible. You have to give them credit, it was the 70’s and socialism was still a form of utopia. But as far as I can tell, the only way to keep yourself on an increasing scale of happiness is to achieve some small goals on a daily basis, not putting too much emphasis on achieving one over another.

So why am I writing this damned Ph.D.?!

** Brickman, Philip, & Campbell, Donald. (1977). �Hedonic relativism and planning the good society.� In M.H. Appley (Ed.), Social comparison processes: Theoretical and empirical perspectives. New York: Wiley/Halsted.

Academic conference spam

About two years ago I started getting peculiar messages from unknown academics about conferences I’d never heard of. They all follow a standard form, with a subject like “inviting you to participate in BLAH-05.” Some address me as “potential speaker,” some “Dr. Cameron A. Marlow,” and some simple “Dr. Marlow.” This isn’t all that surprising, given that lots of legitimate emails I get from academic institutions refer to me as a Dr. (it’s much more offensive not to refer to a Ph.D. as Dr. than it is to inflate the ego of a mere student).

an increase in conference spam
An increase in conference spam

The surprising thing about these emails is that they’ve been increasing in frequency pretty regularly. They have moved from the space of “oversized conference list” to legitimate spam. In some cases I’ve gotten emailed multiple times about the same conference, and for a subject that’s about as close to my research as I am to finishing my course in Scientology.

So who are these people? Given the regular structure of the emails, I assume that they’re being sent out from one master list. Some arrive from, which appears to be a collection of losely-related conferences, and others from, an ISP in Serbia.

How big is this network? Did I get randomly added to some master list, or are they spidering for academic’s email addresses? Has anyone actually gone to one of their conferences? As with most spam, lots of questions, few answers.

Presidential Debate Redux

bush and kerryI’ve rerun my presidential debate analysis (see analyses from the first presidential debate and the vice presidential debate) on the scripts of the second presidential debate. I’ve also updated the Debate Spotter to include the new text. But this time I’ve taken a slightly different approach to the analysis. Instead of some complicated weighting scheme, I’ve decided to use a very simple technique to sort the phrases for each candidate:

  • Count the number of phrases for each candidate
  • Score each phrase as the difference between the number of times each candidate used the phrase
  • Favor longer phrases in sorting

The results follow, and I think you’ll find them much more revealing than the previous lists. I also fed both candidate’s transcripts into Microsoft Word’s AutoSummarize feature to produce a sub-100 word summary. The results are… umm… compelling. From my perspective, it seems as though Kerry is on the offensive, and Bush is backpeddling. But of course that’s just Microsoft’s take on the debate. Click on the following links to download the source Word documents. I’ll leave running the grammar checker as an exercise to the reader.

kerry041008.doc bush041008.doc
Continue reading

Vice Presidential Debate Analysis

Akin to my last entry, I’ve run the transcript of the Vice Presidential Debate through a part of speech tagger and identified the most popular noun phrases for each speaker (listed below). I’ve also updated the Debate Spotter to handle both scripts. Simply change the debate field and the transcript and speakers will be changed accordingly.

Have fun, and of course let us know if you identify any interesting phrases.

Continue reading

Presidential Debate Analysis

Whenever I watch a televised debate, I always wonder what percentage of the speaker’s message is actually thinking on the feet and how much is canned material. With the advent of available transcripts, these sorts of questions can be addressed with various computational methods.

A simple way to identify repeated statements is to count the number of times a particular noun phrase is metioned. Noun phrases act as both a proxy to the subject matter of a given piece of text, but also the way in which things are worded.

For this simple experiment, we’ll need four tools:

The results are quite interesting. Looking only at noun phrases of at least 2 words occuring at least twice for a given speaker, we arrive at some spectacular catch phrases. For Bush my favorite is “hard work,” which he said repeatedly. Apparently Bush thinks that the world is a difficult place to be. For Kerry, a salient phrase was “war as a last resort.”

The top 25 phrases for Bush and Kerry follow. The number following each phrase is a rank described by the length of the phrase and the number of times it appeared.

There are so many other types of analysis that could be run on these data. If you find anything interesting, please let me know. Also, the Debate Spotter allows for any query, so post any interesting phrases that you find.

Update: I have also analyzed the Vice Presidential and the Second Presidential debates.
Continue reading