How voters turned out on Facebook

We just posted this on the Data Team Page, and I thought I would post it here as well.

When Facebook users in the United States logged into Facebook on Election Day this year, they were greeted by a message alerting them of voting activity on Facebook. Users could click a button to announce to their friends that they had already voted and see which of their friends had done the same.

These data about who on Facebook voted offer a new lens into the demographics and behaviors underpinning election returns.  There are a few caveats, (e.g., selection bias for those who are members of Facebook and who visit frequently, reporting bias, no verification, etc.), but we believe that looking at these data across a number of dimensions offers insight into what types of people decided to vote, when they went to the polls, and which factors may have influenced the election.

Voter turnout has been a central issue during this election cycle.  Would disillusioned voters stay home? Would there be an enthusiasm gap between Republicans and Democrats? By looking at those users who state their political affiliation on Facebook, we can see a significant discrepancy between the Democrats and Republicans: Dems were 3% less likely than Republicans to get out to the poll. In a number of House and Governor elections, this would have been enough to flip the vote.

We can also observe how people of different ages behaved. The figure above shows the proportion of users in each age bucket who said they voted as a fraction of the people who came to the site yesterday, broken down by political party. If you’re wondering if youth today are apathetic about voting, this graph is striking proof that of this fact. The height of voter turnout peaks at 65 years of age, while the lowest turnout occurs at 18 years of age. In fact, a 65 year old is almost 3 times as likely to vote as a younger counterpart.  This tracks results collected from traditional exit polls, which also show a 30% turnout gap between younger voters and older voters.  Furthermore, while Democrats were able to mobilize as many young voters as Republicans, Republicans were far more successful at mobilizing older voters.

Many of the seats in this year’s election were hotly contested.  But did voters respond by turning out to the polls? The map above shows voter turnout by state. There are a few general trends: lower rates of people in the South and the Northeast turned out. The two states with lowest voter turnout were New York and Utah, followed closely by Mississippi and Nebraska.

Another view into the state-level turnout is the relative percentage of voters who came out in each state. The map above shows the share of voters in each state: A blue state means Democrats were voting much more than Republicans, while red implies high Republican turnout relative to Democrats. States with an even number of Democrats and Republicans voting are grey.  Unsurprisingly, traditionally blue states on the Pacific coast and northeast are blue, while the South and mountain states are red.  Grey states partially reflect some of the most hotly disputed seats in battleground states, such as Nevada and Virginia.

On our election-day display we showed users which of their friends had voted; but how much effect could this have on voter turnout? Could people see their friends voting and go out to do the same? The above plot shows the probability that a person voted yesterday as a function of the fraction of their friends who had voted. As more and more of your friends vote, not surprisingly, you are more likely to vote. Unfortunately, we cannot tell whether this effect is because of social influence, or if voting practice is simply clustered at a local level, but the fact that voting behavior is shared between friends is quite clear.

Finally, we wanted to look into recent research which suggests that irrelevant events can have a large effect on voter turnout. It was expected that the winners of this year’s World Series would get a boost in voters, while the loser would see a decline. As we can see from the chart above, 6% fewer Rangers fans voted than Giants fans (go Giants!), but without any longitudinal data it is impossible to know if winning or playing in the world series had a causal effect on voter turnout.  Results, however, are suggestive.  It is worth noting that both Giants fans and Rangers fans turned out at rates significantly lower than others in those states (California and Texas). Having the last game of the World Series the night before the election probably means some people weren’t in the right mindset to go out and vote the very next morning.

This post was made possible by Jonathan Chang who crunched the numbers, Jason Bonta, Feng Qian, Nathan Schrenk and Doug Li who developed the election day tool and Adam Conner who brought our election day efforts together.

Introducing Facebook Fellowships

Today I’m happy to announce that Facebook will be offering fellowships to support graduate students in the 2010-2011 school year. The program will provide tuition, stipend and other perks to lucky students whose applications are chosen. Lots more details can be found on the Facebook Fellowship page.

The areas are quite broad, and reflect the range of problems we believe are important in shaping the future of social media and web engineering:

  • Internet Economics: auction theory and algorithmic game theory relevant to online advertising auctions.
  • Cloud Computing: storage, databases, and optimization for computing in a massively distributed environment.
  • Social Computing: models, algorithms and systems around social networks, social media, social search and collaborative environments.
  • Data Mining and Machine Learning: learning algorithms, feature generation, and evaluation methods to produce effective online and offline models of behavioral signals.
  • Systems: Hardware, operating system, runtime, and language support for fast, scalable, efficient data centers.
  • Information Retrieval: search algorithms, information extraction, question answering, cross-lingual retrieval and multimedia retrieval

If you or any Ph.D. students you know are interested in applying for the program, the deadlines are quite tight to make sure we can support students in the upcoming year. I’m really looking forward to seeing the applications. If you have any questions, please feel free to ask me or email the fellowship list at fellowships AT facebook.com.

How Diverse is Facebook?

In order to make Facebook as open and connected as possible for everyone, one of our goals is to understand how different populations of users join and use the service. With that objective in mind, the Facebook Data team recently sought to answer the question, “How diverse are the ethnic backgrounds of the people using Facebook?” This is a tough question to answer because, unlike information such as gender or age, Facebook does not ask users to share their ethnicity or race on their profiles. In order to answer it, we focused on a single country with a large and diverse population—the United States. Comparing people’s surnames on Facebook with data collected by the U.S. Census Bureau, we are able to estimate the racial breakdown of Facebook users over the history of the site.

facebook-minority-proportion

We discovered that Facebook has always been diverse and that the diversity has increased significantly over the past year to the point where U.S. Facebook users nearly mirror the diversity of the overall population of the country. The graph above shows the proportion of the three largest minorities on Facebook over time as predicted by our model, while the dashed lines show the proportion of the Internet population for the same ethnicities.

In this report, we’ll discuss how we are able to measure diversity without user-supplied race or ethnicity. We’ll also explain how race and ethnicity have varied over the course of Facebook’s history and explore future research for understanding friendship diversity on the site.

Methodology

The U.S. Census Bureau’s Genealogy Project publishes a data set containing the frequency of popular surnames along with a breakdown by race and ethnicity. These data are the key to our analysis, so we will spend some time describing them in some detail. An example of the raw data is shown below for the three most-frequent surnames in the census: Smith, Johnson and Williams. These data provide the rank in the population, the total count of people with the name, their proportion per 100,000 Americans, and the percent for various races: White, Black, Asian/Pacific Islander, American-Indian/Alaskan Native, two or more races and Hispanic respectively[1].

name rank count prop100k cum_100k white black api aian 2prace hisp
SMITH 1 2376206 880.85 880.85 73.35 22.22 0.4 0.85 1.63 1.56
JOHNSON 2 1857160 688.44 1569.3 61.55 33.8 0.42 0.91 1.82 1.5
WILLIAMS 3 1534042 568.66 2137.96 48.52 46.72 0.37 0.78 2.01 1.6

This data set allows us to predict what a person’s race is based solely on his or her surname. While these predictions will be often be wrong, in aggregate they will be correct. For example, suppose you select 10,000 people with the name Smith from the U.S. population at random. The data above suggest that 7,335 of them will be White, 2,222 will be Black and so on. Certain names will be more predictive of a certain race, while others will predict a wide array of ethnic backgrounds. The table below shows the top three names within the top 1,000 ordered by the percent in a given group. It shows that some ethnicities have distinctive surnames while others do not. For instance, 98.1% of individuals with the name Yoder are White while the most predictive name for American Indian / Alaskan Native individuals only has 4.4% in that group. For this reason, we will only look at White, Black, Asian/Pacific Islander and Hispanic predictions in our analysis.

Name Rank Count % in group
Caucasian
Yoder 707 44245 98.1%
Krueger 863 36694 97.1%
Mueller 467 64305 97.0%
African American
Washington 138 163036 89.9%
Jefferson 594 51361 75.2%
Booker 902 35101 65.6%
Asian / Pacific Islander
Zhang 963 33202 98.2%
Huang 697 44715 96.8%
Choi 872 57786 96.4%
American Indian / Alaskan Native
Lowery 752 41670 4.4%
Hunt 157 151986 3.9%
Sampson 844 37234 3.8%
Two or more races
Ali 876 36079 13.4%
Khan 665 46713 15.6%
Singh 396 72642 15.3%
Hispanic
Barajas 989 32147 96.0%
Orozco 690 45289 95.1%
Zavala 938 34068 95.1%

A simple technique for finding the distribution of ethnicities on Facebook is as follows: given the users who are on the site at a given time, sum the total users with each name in the Census Genealogy data. For each of these names, we estimate the total number of each ethnicity by multiplying by the numbers above. As in the previous example, if we have 10,000 Smiths on the site at one time, then we assume we have 7,335 White users, 2,222 Black users, and so on.

One potential source of error in this estimate comes from our assumption that users are selected at random from the U.S. population. What if Facebook is primarily White? Wouldn’t a majority of the Smiths be White then, breaking our assumption? In order to address this, we refine our estimates using a statistical technique known as mixture-modeling. We imagine that people come from a population with unknown racial/ethnic proportions. Individuals then get assigned names based on their race/ethnicity. Under this assumption, determining the ethnic makeup of Facebook becomes a problem of back-solving each individual’s ethnicity using only their revealed name. By allowing the Facebook population to be different from the Census population, and for each name to inform our interpretation of every other name, this technique allows us to more accurately estimate the expected number of Facebook users of a given race or ethnicity at any given time.

Finally, we adjust the estimates in our analyses with Internet adoption rates based on values from the National Telecommunications and Information Administration report on the Networked Nation. We use the percent of households with Internet access as a proxy for the addressable Internet population of each race or ethnicity.

Results

Given the approach outlined in the methodology section, we obtain a picture of how the relative makeup of Facebook’s racial subpopulations within the United States. Because the Facebook population is changing over time, as is the ethnic diversity of addressable Internet users, we compare these groups over time. At each time step we recalibrate our model to account for the set of people on Facebook.

To illustrate this, the following plot shows how the model’s estimate of the distribution of the surname Lee has changed over time, tracking the change in Facebook’s population along with the change in our predictions of ethnicity. The dashed lines show the ethnic breakdown of people named Lee given by the Census Bureau tables described above. The disparity between the solid and dashed lines shows the possible bias when estimating race/ethnicity without the adjustment we describe in the previous section. For instance, the Census numbers would underestimate the number of Asian/Pacific Islanders on Facebook and overestimate the number of Black users on Facebook.

facebook-lee

Looking at all users who have joined over the history of Facebook, we can examine the total population of that race on Facebook as predicted by our model at every point in time. These predictions are shown in the following chart. The chart conveys little about the diversity of Facebook since the growth of the site has affected all populations, and the U.S. population is predominantly White.

facebook-ethnicity-total

To look at the diversity of non-White users, the example shown at the top of this post shows our model prediction as a fraction of the Facebook population as well as the percent of the overall U.S. Internet population for each ethnicity. Here the solid lines show the Facebook percentage while the dashed lines show the U.S. population (in this case, we have chosen the U.S. population at the end of the time period). Because White users are a large majority, we have left them out of this plot.

Another approach to visualizing this data is to look at the relative saturation of each race. This is the fraction of users on Facebook compared to the fraction we would expect from the U.S. Internet population at that time. For instance, if Facebook had 100M users, and Asian Americans made up 4.4% of the U.S. Internet population, we would expect to find 4.4M Asian users on Facebook. If instead we observe 5M then the relative saturation would be roughly 114%.

facebook-saturation

The plot above shows Facebook saturation by ethnic and racial groups. Since 2005, Asian/Pacific Islanders have been much more likely to be on Facebook than Whites, and that has remained so. While Hispanics were once 40 percent as likely as Whites to be on the site, this number has been steadily climbing since early 2007 and currently is at 80 percent. This graph also shows that Black users are now about as likely to be on the site as White users.

Conclusions

In this post we have outlined an approach to determine the racial and ethnic breakdown of a population based solely on people’s surnames and data provided by the U.S. Census Bureau. We have found that while Facebook has always been diverse, this diversity has increased over time leading to a population that today looks very similar to the U.S. population.

Since completing this initial work, we have started using the first names of users to increase the precision of our predictions. While in this post we have only looked at the diversity of the population as a whole, we hope to use predictions of race and ethnicity for individuals, along with their friend connections, to understand how these populations of users are connected to each other. We are working to understand how diversity of interpersonal relationships is changing over time as more users join the site and find their friends.

The work in this post was a collaborative effort between the data scientists Lars Backstrom, Jonathan Chang, Cameron Marlow and Itamar Rosenn. This is a cross-post of the note on the Data Team Facebook Page.

Footnotes

  1. While there are many preferences for describing people’s race and ethnicity, we have chosen to use the terms used in the U.S. Census to be consistent with our data. []

Androgenization of Cameron

The baby name blog has a great post about how some names seem to become more female over time. It would appear from recent years that many names are becoming increasingly androgynous, and parents are afraid: what if my boy’s name becomes girlish? The author posits that one could surmise, this is it, the boypocalypse:

It’s one of the classic maxims of the baby name business: most parents who like “androgynous” names really like masculine-sounding names for both sexes. Parents of boys carefully avoid anything feminine. When a boy’s name starts to show up on the girl’s chart, the male version’s days are usually numbered. …Does that mean an entire generation of names is destined to turn feminine? Will boys eventually find themselves stranded on a tiny name island with nothing but kingly classics and absurdly macho inventions to choose from? Don’t panic yet, parents of boys. There are reasons to think that this crop may be different.

What’s fascinating though is that while the pronunciation of my name is extremely androgynous, the reality is that there are a number of variants which fall on either side of the gender spectrum: Kamren and Camren are mainly boys, Kamryn only for girls, and Camryn sported by both. I can only imagine the conversations that will ensue 10 years from now once these kids are in college: “c’mon man, it’s k-a-m-r-E-n, stop dissin’ me.”

Let’s just say I’m happy to be an ur-cameron.

Youtube Epidemiology Interface

Youtube launched the most amazing statistics recently, hidden under their collapsed “Statistics & Data” header. Instead of a random list of awards, it now shows a timeline of the growth of the video’s popularity along with references to each source. Take for instance the video “Chap-hop History” by Mr. B the Gentleman Player:

Youtube Statistics

In addition to the existing statistics page (including awards and demographics), this interface now shows a chronology of the video’s popularity. In the case of Mr. B, the link first appears on b3ta, then @DJYodaUK and a few other twitter users, followed by Facebook, more b3ta, and then Planet Gnome. For the first time I feel like I have a clear, concise view of how a piece of content went viral. Take as a counter-example a popular video from this week, The Cat That Betrayed His Girlfriend whose popularity seems to have existed before the video was on the site (its first source of traffic was searches for the title). Browsing around a bit, it seems as though most big videos get their start from external sources, related videos, and searches.

This feels like a secret view into the inner workings of the internet, but all the pieces are still scattered around. I guess Youtube is the only one that can put them together.

Teaching data science

The New York Times had a piece over the weekend discussing the how computer science curricula are limited in their capacity to teach distributed computation and data mining:

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.

Besides being an advertisement for Facebook and Google internships, it does raise the question of how schools can adopt these technologies quickly enough to teach them. There have been lots of industrial partnerships and government grants for research clusters, but these are far from a standard undergraduate class on the topic. I would love to see Cloudera or a similar company partner with a hardware provider to make clusters affordable and easy to configure, while data scientists can make sure that they come pre-installed with some interesting data (Wikipedia, Twitter, etc.). With a consistent installation across institutions, professors can write and teach data science without the immense operational overhead of setting up a cluster and getting it operational.

Venturing to the tail

It’s now second nature to think that the top 1% of media account for an overwhelming percentage of overall sales. But how many people actually consume content from the more obscure parts of Netflix’s catalog? Sharad and Co. at Yahoo! Research just released the results of some research looking at how users fit into long-tail distributions of content.

Corpus Satisfaction

The results? “85% of Netflix users and 95% of Yahoo! Music users have ventured into the tail (i.e., consumed items not available in large, brick-and-mortar retailers), and 40% of Netflix users and 70% of Yahoo! Music users regularly consume tail items.” The distributions above show how many users in a given system will be satisfied when you only include the top items in a given catalog. People’s web browsing may be more obscure than their music tastes, but in both cases a media provider needs to maintain a significant catalog to afford the tastes of their audience.

For social media practitioners, this is a great indicator of how much content you need to reach a mainstream audience. For music you’re going to need over 60% of the entire music catalog (or at least Yahoo!’s), and for search, well, I wouldn’t go there.

Bay Bridge Logistics

Fortunately I’m not affected at all by the added day of Bay Bridge Closure, but this quote about the repair amazes me:

The parts needed to make the fix were manufactured overnight by Stinger Welding Inc. in Coolidge, Ariz. Weighing about 18,000 pounds, they landed at Oakland International Airport aboard a chartered plane Sunday afternoon.

It reminds me of the MacArthur Maze Fire that was supposed to take half a year to repair, but ended up taking 25 days.

Cyborgs and offliners

On the train to work today I had the opportunity to read Aaron Swartz’s My Life Offline and danah boyd’s I want my cyborg back-to-back. The dichotomy between these two pieces, both from respected internet thinkers, is great. They aren’t necessarily contradictory, but they definitely show the range of emotions people have about being connected.

Maintained Relationships on Facebook

This past week the Economist published a piece entitled Primates On Facebook that described some research done by the Facebook Data Team. Since there have been a number of questions throughout the monkeysphere, we thought we would take the opportunity to describe our approach, the data, and our analysis.

network-comparison

We were asked a simple question: is Facebook increasing the size of people’s personal networks? This is a particularly difficult question to answer, so as a first attempt we looked into the types of relationships people do maintain, and the relative size of these groups. The image above presents a high-level overview of our findings: while the average Facebook user communicates with a small subset of their entire friend network, they maintain relationships with a group two times the size of this core. This not only affects each user, but also has systemic effects that may explain why things spread so quickly on Facebook.

Before discussing the data, let us first set the context.

People you know

Many people are asking questions about the number of friends they have on Facebook. Do I have enough? Do I have too many? What may be tripping people up here is the language: while the people you’re connected to on Facebook are called your “friends,” they’re more likely people you have met at some point in your life. Social network researchers have been trying to measure this number for decades, and come across a number of clever techniques.

If you’ve read the Tipping Point, you may remember a study Gladwell described where people were asked to identify whether or not they knew people with names from a long list culled from a phone book. Based on the probability of knowing someone with a given name and the number of people with this name that a person knows, we can estimate the number of people a given subject has met. Killworth, et al. found using this technique and others that the number of people a person will know in their lifetime ranges somewhere between 300 and 3000[1].

On Facebook, the average number of friends that a person has is currently 120[2]. Given that Facebook has only been around for 5 years, that not everyone uses it, and that the not every acquaintance has found each other, this number seems reasonable for an average user.

Communication network

As a subset of the people you know, there are some individuals with whom you communicate on an ongoing basis. The number of individuals that represent a person’s core support network has been found to be much, much smaller than their entire network. Peter Marsden found the number of people with whom individuals “can discuss important matters” numbers only 3 for Americans[3]. In a subsequent survey, researchers found that this number has dropped slightly over the past 10 years[4], causing some alarm in the press, but without sufficient explanation[5].

How many people an individual communicates with probably exists somewhere between their total network size and their support network. Some research by Gueorgi Kossinets and Duncan Watts observing all email communication at a university shows that the number of ongoing contacts hovers somewhere between 10 and 20 over a 30 day period[6].

Maintained Relationships

Facebook and other social media allow for a type of communication that is somewhat less taxing than direct communication. Technologies like News Feed and RSS readers allow people to consume content from their friends and stay in touch with the content that is being shared. This consumption is still a form of relationship management as it feeds back into other forms of communication in the future. For instance, a high school friend uploads a photo of her new puppy and this photo appears in your News Feed. You click on the photo, browse through a host of other photos and discover that she has also gotten engaged, which may lead you to reach out to her.

This type of communication is the core of the Facebook experience, and given the question posed by the Economist, we wondered what effect this sort of relationship maintenance had on the breadth of people’s networks.

Measuring Networks on Facebook

To try and answer questions about network size on Facebook, we looked at the communications of a random sample of users over the course of 30 days. We defined networks in 4 different ways:

  • All Friends: the largest representation of a person’s network is the set of all people they have verified as friends.
  • Reciprocal Communication: as a measure of a sort of core network, we counted the number of people with whom a person had had reciprocal communications, or an active exchange of information between two parties.
  • One-way Communication: the total set of people with whom a person has communicated.
  • Maintained Relationships: to measure engagement, we took the set of people for whom a user had clicked on a News Feed story or visited their profile more than twice.

For each users we calculated the size of their reciprocal network, one-way network and network of maintained relationships, and plotted this as a function of the number of friends a user has. As Andreas mentions in his blog post about the article, the visualization (shown below) did not make it into the article, but presents a pretty clear picture of the relationship between these types of communication.

active-network-size

In the diagram, the red line shows the number of reciprocal relationships, the green line shows the one-way relationships, and the blue line shows the passive relationships as a function of your network size. This graph shows the same data as the first graph, only combined for both genders. What it shows is that, as a function of the people a Facebook user actively communicate with, you are passively engaging with between 2 and 2.5 times more people in their network. I’m sure many people have had this feeling, but these data make this effect more transparent.

Systemic Effects

What effect does a 2x increase in connectivity mean for a network? The easiest way to observe this is to look at one person’s personal network. The image below shows the personal network for one of my coworkers. The first diagram shows his entire network, namely all of his friends, and all of the relationships between his friends. It is clear that the cluster on the top is the highly connected set of Facebook coworkers, and the cluster on the right is another group of friends.

asmith-connections

The cell on the bottom right shows only those relationships that have reciprocal communication. Many of the individuals in his network are completely disconnected or out of touch with each other. Moving to the bottom left cell, we see the slightly more connected network containing one-way communication. This includes every person who wrote a comment, sent a message or wrote a wall post to one of my coworker’s other friends. The cell on the top-right shows the passive network, including all those people who were keeping up with their friends. While some of his friends are still disconnected, a very large percentage are now reachable through some set of observations.

The stark contrast between reciprocal and passive networks shows the effect of technologies such as News Feed. If these people were required to talk on the phone to each other, we might see something like the reciprocal network, where everyone is connected to a small number of individuals. Moving to an environment where everyone is passively engaged with each other, some event, such as a new baby or engagement can propagate very quickly through this highly connected network.

While these data are not a controlled experiment, and do not directly relate to the theories described above, they do show a directional trend in the way people manage relationships on a social network today. We hope to continue this line of research with the eventual hope of making relationships that much easier to manage.

This post represents the work of data scientists Lee Byron, Tom Lento, Cameron Marlow, Itamar Rosenn. Special thanks to Alex Smith for letting us use him as an example. For more insights like this, make sure to become a fan of the Facebook Data Team.

Footnotes

  1. Killworth, P., Johnsen, E., Russell, H. B., Shelley, G. A., and McCarty, C. Estimating the size of personal networks. Social Networks 12 (1990), 289–312. []
  2. Facebook Statistics []
  3. Marsden, P. Core discussion networks of americans. American Sociological Review 52, 1 (1987), 122–131. []
  4. Mcpherson, Miller, Smith-Lovin, Lynn, Brashears, and Matthew, E. Social isolation in america: Changes in core discussion networks over two decades. American Sociological Review 71, 3 (June 2006), 353–375. []
  5. While this work is well cited, there is support that the methodology underestimates the core network, e.g. Bearman, P., and Parigi, P. Cloning Headless Frogs and Other Important Matters: Conversation Topics and Network Structure. Social Forces 83 (2004), 535. []
  6. Kossinets, G., and Watts, D. J. Empirical analysis of an evolving social network. Science 311, 5757 (January 2006), 88–90. []