In a fundamental sense, impact implies a change in the state of a system. In order to measure the impact of a media project, then, we might compare the state of a system before the project’s release with its state after, and note any differences. In practice, however, impact assessment is much more complicated; establishing a singular cause for social change is notoriously difficult. Fortunately, new sources of data and analytical techniques allow us to observe forces of social change at larger scales and higher resolutions than ever before.
Let’s walk through this process by way of (hypothetical) example: say, a documentary film about the American health care system, released in 2012.
The filmmakers consider their primary objective to be “shifting the national conversation” around health care from identifying problems to proposing solutions. Their potential audience is broad, but they’ve prioritized reaching those individuals directly contributing to and shaping the conversation, such as policy makers and columnists at national newspapers. While the filmmakers acknowledge that changes at the societal level take more time to manifest than those at smaller scales, they hope to see some evidence of impact within a year’s time.
The next step, then, is to identify measurable indicators of impact on the target audience and relate them back to the project’s goals. For example, we can download and analyze a broad sample of the national health care conversation itself, both pre- and post-release. The former acts as a baseline against which we can assess changes in the character of conversation post-release. At the very least, such a dataset should include discussion of the topic by political leaders as well as topic-specific articles and editorials in national newspapers. Given such data, analysis will rely mostly on natural language processing and network analysis techniques.
First, we have to identify data sources. For official political discussion at a national level, we’ll access the Sunlight Foundation CapitolWords, which connects us to a database containing the full U.S. Congressional Record since 1996. As for news sources, the New York Times has a variety of APIs (Application Programming Interfaces) that allow for sophisticated, full-text searches of articles published since 1851. For both sources, querying by keyword(s) is probably our best bet.
But that approach brings up an interesting challenge: How do we derive a small set of key words or phrases that “cover” the conversation around an entire social issue? One possible strategy goes as follows:
- Aggregate a collection of issue-specific text documents.
- Parse the text and extract a list of “candidate” words or phrases, such as noun phrases (or some subset thereof).
- Score candidates by several criteria, including domain specificity and consensus, lexical cohesion, and other heuristics.
- Choose the highest-ranked candidates as query terms.
With this methodology, we get sensible top terms: medicare, medicaid, healthcare, health insurance, primary care, health reform, affordable care, and so on. These terms match our intuitions about healthcare conversations and allow us to move forward while retaining a cautious, skeptical attitude. The aforementioned APIs can then be queried such that any article or speech containing any of these terms is added to our snapshot of the “health care conversation.” This amounts to approximately 84k NYT articles and 94k Congressional speeches since 1996.
There are a number of ways to algorithmically identify the important terms or topics in a collection of text documents. Here’s how we might go about it in a given case:
- Clean the text, split it into words, identify their parts-of-speech, and filter out uninteresting words.
- Construct a network: each unique word is represented as a node; connections between nodes are made when the two words are used within a certain number of words of each other.
- Run a variant of Google’s PageRank algorithm over the network to determine which words are “most important.”
- Use top-ranked words as seeds to build phrases, then clean the list, remove duplicates, and re-rank them.
Rather than delving deeply into the details, let’s examine some concrete results. Here are the top phrases derived from the U.S. Congressional conversation around healthcare in 2011 vs. 2013:
It’s wise not to read too much into these results, if only because they are (by construction) at the extreme end of a much larger distribution of terms; however, we can make some broad observations informed by general knowledge of the evolving narrative around health care in America. In both years, we see an emphasis on Medicare as well as health care reform, largely around the Affordable Care Act, aka Obamacare. In 2011, de-funding health care reform makes an appearance, while in 2013, Federal Government shutdown indicates the end-game of that strategy. Terms in 2013 seem to deal more with the realities of implementing Obamacare—e.g, State health insurance exchanges—while terms in 2011 seem more abstract, focusing on spending and the Reform Act or reform law in general. Lastly, in 2013 we can see a number of political talking points, including one-size-fits-all health care law, job-killing bill, quality affordable care, and health insurance for 50-plus-million Americans.
The most important aspect of this exercise is in relating it back to the goals of our hypothetical filmmakers: shifting the national conversation around healthcare towards solutions and away from problems. The brief results shown here are suggestive but insufficient to provide definitive answers about what change was caused by this particular film. The different ways in which members of Congress are talking about healthcare is interesting, albeit unsurprising to those who follow American politics. As a next step in our brief analysis of the text that we collected, it could be enlightening to see if/how any of those semantic frames have been echoed in the press.
That said, let’s try something a bit different with the collection of healthcare-related NYT articles. The Times has a sophisticated tagging scheme for its content, and fortunately its API returns such tagging data with every article. Using the approximately 5.5k articles published in 2011 and 3.1k published in 2013 on healthcare, we generated networks of tags indicating what, specifically, each collection of articles is about. Here’s the result for 2011:
Each node in the network is a major subject keyword tag; its size and opacity are directly related to the number of times it is applied to the articles in the corpus. Connections between nodes are created when two tags are applied to the same article (articles usually have more than one subject tag); the width of the connection is directly related to the number of times those two tags co-occur. You can see that the dominant tags are United States Politics And Government, Health Insurance And Managed Care, Federal Budget (US), and Medicine And Health. They are all located in the densely connected center of the network, which makes sense: the more a tag occurs, the more chances it has to connect to additional tags. For comparison, let’s look at the network for articles published in 2013:
In this network, the most important tags are Health Insurance And Managed Care, Patient Protection And Affordable Care Act (2010), Federal Budget (US), and United States Politics And Government. As we saw with the Congressional conversation, Obamacare is a dominant driver of conversation, even more so in 2013 than in 2011!
Admittedly, these networks are difficult to understand at a glance, owing to their many nodes, inevitable text overlap, and the density of connections between them. Still, if you look closely enough, you can identify patterns that do indeed make sense. For example, Elder Care is strongly linked to Medicare, Dementia, and Death And Dying. In the 2013 network, Birth Control And Family Planning is closely tied to Abortion, Patient Protection And Affordable Care Act (2010), and Editorials. And so on.
Identifying the relationships in such a complex dataset is better done by algorithm than by eye. As it turns out, the results from such a shallow dive into the data (to which we’ve restricted ourselves in this post) aren’t particularly illuminating: as expected, the networks both have clusters around U.S. politics, law and legislation, medical insurance, death and disease, a range of economic issues, discrimination, etc. One big difference, perhaps of interest to our hypothetical filmmakers, is the presence of a sizable cluster around Reform and Reorganization in the 2013 network that wasn’t present in 2011. This seems to be tied to the rise of conversation around Obamacare as it has continued to roll out. If we wanted to perform a deeper dive into the data, we could do a textual analysis of the articles in this cluster to reveal more nuanced insights, but we’ll leave that as an exercise for another time.
This quick case study was meant to illustrate best practices when it comes to assessing impact: careful consideration of goals and audiences, the necessity of baseline data against which change can be assessed, and analysis of data explicitly related back to said goals and audiences. Combining social science and media research theory with data science methodology gives us a bird’s eye view of the story of media impact. For more examples, check out our past and current work.