[Originally posted May 17, 2011]
Researchers interested in large text collections and their itinerant coders tend to muddle through with limited collaborative, cross-disciplinary resources upon which to draw. The generic criteria for high-quality codebook construction and effective coding are underdeveloped, even as the tools and techniques for measuring the limits of manual or machine coding grow ever more sophisticated. In that paradox there may be the seed of a partial solution to some of these issues.
The ability to quickly and easily pre-test coding schemes and produce on-the-fly displays of coding inconsistencies is one way to more uniformly train coders to perform reliably (hence usefully) while ensuring a satisfactory level of valid observations. By the same token, the ability to permit an unlimited number of users to review or replicate all the coding and adjudication steps using a free, web-based platform would be a large and bold step onto our methodological and metaphorical bridge. What are needed are more universal annotation metrics, a standard lexicon, and widely shared, semi-automated coding tools that make the work of humans more useful, fungible, and durable. Ideally, these tools would be interoperable, or combined in a single system. The new system would allow human coders to create annotations and allow other experts to efficiently examine, influence, and validate their work.
At a deeper level, this calls for much better and more transparently codified approaches to training and deploying coders—an annotation science subfield—so that a more coherent and collaborative research community can form around this promising methodological domain. Investigators in the social sciences use reliably coded texts to reach inferences about diverse phenomena. Many forms of public-sphere discourse and governmental records are readily amenable to coding; these include press content, policy documents, speeches, international treaties, and public comments submitted to government decision-makers, among many others. Systematic analysis of large quantities of these sorts of texts represents an appealing new avenue for both theory building and hypothesis testing. It also represents a bridge across the divide between qualitative and quantitative methodologies in the social sciences. These large text datasets are ripe for mixed-methods work that can provide a rich, data-driven approach both to the macro and micro view of large-scale political phenomena.
Traditionally, social scientists working with text use a variety of qualitative research methods for in-depth case studies. For many legitimate and pragmatic reasons, these studies generally consist of a small number of cases or even just a single case. As Steven Rothman and Ron Mitchell note, the reliability of data drawn from qualitative research comes under greater scrutiny, as increased dataset complexity requires increased interpretation and, subsequently, leads to increased opportunity for error. The case study method is plagued by concerns about limitations on its external validity and the ability to reach generalized inferences. With the proliferation of easily available, large-scale digitized text datasets, an array of new opportunities exist for large-n studies of text-based political phenomena that can yield both qualitative and quantitative findings. More to the point, high-quality manual annotation opens up the possibility for cross-disciplinary studies featuring collaboration between social and computational scientists. This second opportunity exists because researchers in the computational sciences, particularly those working in text classification, IR, opinion detection, and NLP, hunger for the elusive “gold standard” in manual annotation.
Accurate coding with high levels of inter-rater reliability and validity is possible. For example, work by the eRulemaking Research Group on near-duplicate detection in mass e-mail campaigns demonstrated that focusing on a small number of codes, each with a clear-cut rule set, has been able to produce just such a gold standard. Reliably coded corpora of sufficient size and containing consistently valid observations are essential to the process of designing and training NLP algorithms. We are likely to see more political scientists using methodologies that combine manual annotation and machine learning. In short, there are exciting possibilities for applied and basic research as techniques and tools emerge for reliably coding across the disciplines. To unleash the potential for this interdisciplinary approach, a research community must now form around the nuts and bolts questions of what and how to annotate, as well as how to train and equip the coders that make this possible.