Whenever I watch a televised debate, I always wonder what percentage of the speaker’s message is actually thinking on the feet and how much is canned material. With the advent of available transcripts, these sorts of questions can be addressed with various computational methods.
A simple way to identify repeated statements is to count the number of times a particular noun phrase is metioned. Noun phrases act as both a proxy to the subject matter of a given piece of text, but also the way in which things are worded.
For this simple experiment, we’ll need four tools:
- The transcript (simplified from the original)
- Lingua::EN::Tagger, an English Part-of-speech tagger written in Perl
- phrases.pl, a perl script to parse the document and extract the noun phrases
- Debate Spotter, an interactive interface to visualize the results
The results are quite interesting. Looking only at noun phrases of at least 2 words occuring at least twice for a given speaker, we arrive at some spectacular catch phrases. For Bush my favorite is “hard work,” which he said repeatedly. Apparently Bush thinks that the world is a difficult place to be. For Kerry, a salient phrase was “war as a last resort.”
The top 25 phrases for Bush and Kerry follow. The number following each phrase is a rank described by the length of the phrase and the number of times it appeared.
There are so many other types of analysis that could be run on these data. If you find anything interesting, please let me know. Also, the Debate Spotter allows for any query, so post any interesting phrases that you find.
Update: I have also analyzed the Vice Presidential and the Second Presidential debates.
free iraq (14),
hard work (13),
wrong war at the wrong place at the wrong time (13),
wrong war at the wrong time at the wrong place (12),
north korea (10),
kim jong il (10),
my opponent (9),
american people (8),
same intelligence (8),
prime minister allawi (8),
best way (7),
free afghanistan (7),
world a more peaceful place (7),
mixed messages (7),
iraqi citizens (6),
al qaida (6),
weapons of mass destruction (6),
dynamics on the ground (6),
breach on the agreement (6),
end of this year (6),
grave threat (6),
matter of fact (5),
cannot lead (5),
grand diversion (5),
wrong signals (5)
saddam hussein (14),
north korea (14),
nuclear weapons (10),
weapons of mass destruction (9),
osama bin (9),
united nations (9),
war as a last resort (9),
american people (8),
90 percent of the casualties (7),
nuclear proliferation (7),
remedies of the united nations (7),
90 percent of the costs (7),
united states of america (7),
homeland security (7),
mountains of tora bora (6),
10 active duty divisions (6),
different set of convictions (6),
four years (6),
president bush (6),
president of south korea (6),
strong alliances (6),
two years (5),
secretary of state (5),
tax cut (5),
bilateral talks (5)
80 thoughts on “Presidential Debate Analysis”
You forgot poland!
You forgot “Smoooooooooooke’n” … woops, wrong Kerry.
Actually, he forgot poland. I didn’t forget anything.
You may want to rework your numbers. I searched the transcripts and counted occurrences of some of those phrases. The numbers you give are higher than the number of occurrences in the transcript (for example, Kerry’s mention of Tora Bora occurred only twice).
A very cool idea, though.
Cameron, this is brilliant. Could it be used in conjuction with something that scrapes the pair’s campaign sites?
The numbers he gives are not the nmber of times the phrase occurred. “The number following each phrase is a rank described by the length of the phrase and the number of times it appeared.”
I needed to rank the phrases somehow, and just looking at the total occurences favors phrases that are short but common. Instead I made up my own ranking algorithm:
score = length (in words) + occurences
So a phrase with 8 words that appears twice will have a score of 10, the same as a phrase with 2 words that appears 5 times. I played around with this score for a while and this method seemed to pull up the most interesting results.
As for applying this elsewhere, the script I’m using to visualize it is extremely simple. All I need is a plaintext file with the content and it’s trivial to set up.
Amy’s Robot did a similar, though less technically sophisticated, analysis that gives the number of occurrences of each phrase. Here’s the post.
What this is really great for is making sure, after the fact, that you took enough drinks during your presidential debate buzzword drinking game.
Although not only used as a noun phrase, I think it is equally revealling to find that Bush used the word VOTE 7 times, while Kerry only did 3 times.
This is Brillant! I found it after wondering how many times Bush said “it’s hard work” …my favorite also.
It reminds me of Will Ferrel on SNL playing Bush…He’s in the Oval office and there are flames outside the windows as the whole world goes to hell Will (bush) pops out from under the desk exclaiming “it’s hard work” and then he pops a beer can open! funny stuff if people weren’t dying.
text analysis is pretty cool. With software like textpac and catpac (and a gazillion others) you can count the occurence of individual words, plus the occurence of words in relation to certain other words. Knowing that Bush said “hard work” 13 times is interesting, but knowing that he said it 13 times next to words describing his own job – as a potential rhetorical strategy to get us to feel bad for the poor little guy, he must be tuckered out! nation-building is rough stuff! – is pretty interesting too.
I missed your explanation of method before my previous post. But I’m still not sure how much we can learn from this method.
You say “this method seemed to pull up the most interesting results.”
If you chose the algorithm that produces the ‘most interesting results’, then you’re skewing your research. You chose the way that makes the data fit a pre-determined goal, rather than extrapolate from simple evidence. How many words a phrase contains doesn’t have much to do with whether it’s ‘canned material’. The Bush administration has gone far on two- and three-word phrases (“Society of ownership”; “faith-based initiatives”), which are about as simple as an English construction can get. So it doesn’t follow that the longer a phrase, the more likely it is to be campaign boilerplate. It might be more interesting to look at a straight ranking of most common noun phrases rather than filter them in this way.
I realize this isn’t your department, but the Oct. 3, 2000 Gore-Bush Debate on that CPD transcripts page is actually linked to the transcript of the Kerry-Bush debate on Sept. 30, 2004. So far, I haven’t found the webmaster’s contact info.
Two points on the validity of these methods:
1. The data, software and methods used to generate these results are freely available. Anyone disagreeing with the ranking can easily download the source and run the software yourself. I’d be more than happy to help with this process.
2. In saying “this method seemed to pull up the most interesting results,” I meant that I was changing the algorithm based on my knowledge of the types of phrases I was looking for and the parameters I could use (the length of the phrase and the number of occurrences). The noun phrase extractor that I used scores the phrases multiplicatively by default (length * occurrences) which places too much emphasis on long noun phrases. I chose to score them additively (length + occurrences) because I felt that repetitiveness is a more important feature, especially for such small amounts of data.
This is really interesting. Language Log did a related analysis on both candidates’ sentence length, with Kerry’s (as expected) average sentence length higher, as well as his contribution in words to the debate overall (though Bush had slightly more sentences…but they were shorter).
Why not use a more information-retrieval-type score:
score = TF*IDF = term freq * inverse document freq
term freq = # occurrences of a phrase in the speeches
inverse document freq = # occurrences of the same phrase in some large body of English text.
This way, you don’t have to worry about the length of phrases, just whether they occur commonly or not in normal text.
I wondered about the amount of “Hard Work” comments were made during Bush’s parts of the debate. Thanks for posting it.
Err, rather IDF = 1 / (# occurrences of the same phrase in some large body of English text)
Quotient — Frequency is one of the two parameters I’m using. I would consider the frequency relative to the documents if the definition of document made more sense. I could look at the words in terms of their use within turns of the dialog, e.g. is the word common in one turn of Bush or the entire talk. Unlike TFIDF though I’d be looking for phrases with low IDF, or rather to maximize the DF instead of the IDF, i.e., the more spread out a phrase is, the more important the phrase is for the talk.
The reason I’m including the phrase length in the calculation of the rank is that it’s more interesting to see a long phrase repeated instead rather than a short one. For instance, Bush says “free Iraq” 12 times and “Iraq” 52 times. Both are noun phrases, but the term “free Iraq” has more semantic meaning and importance than the single word Iraq. In describing the phrases, I think longer phrases are more meaningful, and thus more interesting, but I want to balance this feature with the frequency. Does that make sense?
don’t forget poor poor poland
Isn’t it important for any candidate to state their standpoint and DEBATE, rather than to reiterate their campaign, to say what they have been saying over and over and over. By using noun phrases they are just creating “coined terms” that induce an individual to take a side on the basis of how they sound, on what the semantic meaning of these terms are, rather on the “core belief” (another highly used term, by the way)of each candidate.
Cool tool. However, tools can miss semantical repeats. For example, I decided that there were at least 15 instances of “hard work” because two times Bush followed the phrase with “it’s hard.” I wrote a letter-to-the-editor at the Seattle Times (which I should post online).
This also misses the very interesting philosphical approach to “control” that was evidenced in the ad lib exchange about Bush’s daughters (he wants to put them on a “leash” and Kerry advises that it doesn’t work.)
For the congitive scientists: I’m curious about the “wrong war, wrong place, wrong time” phrase … isn’t there a danger of the phrase standing alone (ie, not as an indictment of Kerry?).
(PS, I’m blogging you but I don’t have trackback technology)
Are there programs out there like this one we can use for other speeches? Or can you provide yours via the Web to political junkies?
Hours of fun, but on a serious note, such analyses of elections will find their way into serious considerations of the styles and strategies of candidates for office.
Just noticed in my search for “world” that there are a couple of sentences repeated in the fourth Bush paragraph – where he’s talking about talking on the phone with world leaders…
I’m looking for that “I know the world we live in,” or something to that effect statement.
Odd that Kerry’s top 25 list includes “Osama bin” rather than “Osama bin Laden” — the former phrase was never spoken in the debate without “Laden” after it.
The “Osama bin” phrase is a byproduct of the noun phrase parser that I use. In 99% of the uses of the work “laden,” it’s a verb, which won’t be part of a noun phrase if it’s at the end. So while it correctly identifies “Osama bin,” it misidentifies his last name. All of these techniques are prone to exceptions, but it still seems to work pretty well.
How about we count emotional expressions as buzzwords? After being provoked, a certain candidate gave emotional expression, before taking a moment of recomposure and delivering a substantive answer. The debate in the media was not about capturing where each person stood, and allowing the voters to see which camp backs the way they feel. Rather than focus on the substance, the aftermath focused on a sort of buzzword: emotional facial expressions …
This is mildly interesting work in the perl sense (i use it, but really), it amazes me that so many people are keen on this random kind of analysis. Why don’t you just listen to the bastards. Their meaning will become quite clear, you don’t need perl. If you missed it (god knows why, you have so few chances to actually hear the candidates) then read the transcript in full. You are humans (in the loosest sense) so get involved, read and listen to them and decide. The world is waiting. Welcome to the 18th century.
I don’t understand why you think this is a “random kind of analysis,” given that it’s using techniques employed by computational linguists for decades. The motivation is simple: computers allow us to see patterns that we wouldn’t see otherwise.
There are really two parts to this post, the first being the linguistic analysis above, and the second being the Debate Spotter tool for visualizing the results. The former provides utility because it aggregates phrases at a level that connects with the semantics of the speaker, namely if there’s a phrase or jargon that they’re trying to repeat, it will typically be in the form of a noun phrase. The latter tool allows people to do their own investigative work without spending hours poring over the text. I’d like to know who was talking about a particular topic more, but I don’t really want to count for every word.
Is that not motivation enough to invest an hour of coding time? And personally I don’t think it’s interesting in the Perl sense at all, as “perl -MCPAN -e install Lingua::EN::Tagger” is pretty trivial.
It would be interesting to connect the word and phrase scored to transcripts from the spin doctors to see how far the politicians get from point…
I think your counts would be more interesting if you collected together phrases that had essentially identical meanings, like the “wrong war etc.” of which you have two interchangeable versions, or by including “working hard” (twice) with “hard work” (eleven).
This is pretty cool. What would be interesting would be to graph the phrases out of the stump speeches over time and see how the campaigns are adjusting the message as time goes by.
Interesting, Kerry mentions “Florida” twice, while Bush doesn’t mention the state by name at all, ever. Pretty remarkable, considering the debate was held in Coral Gables, FL. It’s as if Bush is afraid to jinx the upcoming election by drawing people’s attention to the state, i.e. his trojan horse. Or was Florida a gift horse in 2000? Or simply a stolen horse? At any rate, great tool, Cameron.
I hope you where a bulletproof vest. I am sure they are out gunning for you. I want to go in the direction that you want to steer the country. May you sail to victory.
Did anyone keep score of how many questions each candidate actually answered directly, I mean actually answered the question that was asked? My favorite was the right to life issue which Kerry completely skirted by saying that he didn’t want to bring his personal views into it, WELL HELL SON, That’s how we vote for presidents, on their personal views (at least that’s how we hang ‘em). Well I’ll tell you, as far as directly answering the questions Bush wins hands down, so much so that we might even overlook this oil baron’s bullshit answer to about spending a billion dollars on hydrogen cell research. As you can tell I have no love for either candidate, but as far as this debate goes, Bush’s advisors kicked the hell out of Kerry’s advisors hands down. Advisors took a huge part in this debate evidenced by how intelligent bush sounded, as well as his complete lack of the use of clichés and the usual catch phrases. P.S. I would love to see proof of the billion dollars spent on hydrogen cell research if anyone has it handy.
only a pussy would vote for kerry
Click on “Wrong war at the wrong place at the wrong time” under Bush and read that through.
“A simple way to identify repeated statements is to count the number of times a particular noun phrase is metioned.”
Just a note that you mispelled “metioned” … it should be “mentioned” . Thought I would point this out.
Another phrase of interest would be “no child left behind”
looks like one thing that the automatic thing didn’t catch was phrases that were plural versus singular. it already counted a decent score for kerry’s use of “90 percent of the casualties”, but the score could have been quite higher if it had caught “90 percent of the casualties and 90 percent of the cost(s)”. there are two phrases with that. and still another in which the words “in Iraq” are interposed.
it seems the computational method could perhaps take this type of thing into account for slightly longer phrases to really highlight those items that a debate participant is trying to emphasize.
tax cut – Kerry 17, Bush 4
Nice work, even with all the caveats.
I’m attempting to apply predicate calculus to the intersection of political speak with related fact. My goal is to drive a truth table that can devolve into a simple measure – perhaps a percentage – giving the overall truth value of a given statement, position paper or speech.
My first thought was to use an autoconstructing database (e.g., askSam) to create entries, but that seems a bit clunky. An algorithmic approach would be much better, since ideally it could be run in real-time.
so where can i sign up to vote for arnold?
is John Kerry republican
Yes, yes, of course he is. XP
Why will president Bush all of a sudden accept flue shots made in Canada? In the last debate he argued that drugs from Canada (made in the USA)were unsafe for Americans. Has he suddenly discovered that Canada is a pretty safe neighbor?
I’m in Dubai, UAE. I have special interest in working with texts.
Could you tutor me on this subject. I need to learn how to do the analysis of texts and what tools to use.
With the greatest respect, I can think of after absorbing the data was the good old (pre-divorce) Czechoslovak saying:
He who thinks by the inch and talks by the yard deserves to be kicked by the foot.
I really hate the fact that both canidates say the same thing over and over. I was going towards Kerry but all he’s focused on is Iraq. The main thing. Bush spent 42% of his time on vacation during the 6 months that he was in office before 9/11. Was he ready? No way. They are both idiots. But Kerry might be able to save the economy.
Everyone should vote for john kerry he makes so much more sense then bush. Bush just blabs away about anything that comes to mind. he has the IQ of a rock
GO BUSH!!! KERRY SUCKS JUST LIKE THE YANKEES!!!
People should get a little more educated about voting, instead going out there and just do it they need to know the real issues and I think that people who is lazy and likes free stuff should vote for Kerry, because that’s all he is about.
…not in our nation’s best interest
Do you have any programs or ideas that you might want to try out with music? I’ve allways wanted to mix the two.
/ Sebastian, Sweden.
Presidential Debate or Presidential Parody? Ahh. You lose either way.
As a Newbie, I am constantly exploring online for articles that can be of assistance to me. Thank you. http://www.kashzahovbah.com/electronics/video-marketing-best-methods-tips/
I enjoy reading a post that will make people think. Also, thanks for allowing me to comment!