We did it! The free, open source, Web-based, university-hosted, FISMA-compliant “Coding Analysis Toolkit” CAT recorded its one millionth coding choice. Pretty much all the credit goes to Texifter CTO and chief CAT architect Mark Hoy who has put in many paid (and unpaid) hours making sure CAT is reliable, usable, & scalable. Texifter Chief Security Officer Jim Lefcakis also played a key role ensuring the hardware and server room were maintained at the highest level of reliability and security. In honor of this milestone, I have been digging through my unpublished papers looking for material that explains in more detail where CAT, PCAT, DiscoverText, QDAP & Texifter come from. This post is the first in a series about the particular approach to coding text we have come to call the “QDAP method.” Large political text data collections are coded and used for basic and applied research in social and computational sciences. Yet the manual annotation of the text—the coding of corpora—is often conducted in an ad hoc, inconsistent, non-replicable, invalid and unreliable manner. Even the best intentions to create the possibility for replication can, in practice, confound the most ardent followers of the creed “ Replicate, Replicate.” While mechanical, process, documentary, and other challenges exist for all approaches, practitioners of qualitative or text data analysis routinely profess to greater, even insurmountable, barriers to re-using coded data or repeating significant analyses. There are diverse approaches to coding text. They tend to be hidden away in small niche sub-fields where knowledge of them is limited to a small research community, a project team, or even a single person. While researchers classify text for a variety of reasons, it remains very difficult, for many counter-intuitive, to share these annotations with other researchers, or to work on them with partners from other disciplines for whom the coding may serve an alternate purpose. A change in the way the researchers think about, conduct, and share coded political corpora is overdue. Coding is expensive, challenging, and too often idiosyncratic. Training and retaining student coders or producing algorithms capable of tens of thousands of reliable and valid observations requires patience, funding, and a framework for measuring and reporting effort and error. Given these factors, it is not surprising that a proprietary model of data acquisition and coding still dominates the social sciences. Despite the important role for the social in social science, researchers guard “their” privately coded text, even the raw data, fearing others will beat them to the publication finish line or challenge the validity of their inferences. The competitive approach to producing and failing to share annotations disables intriguing and highly scalable collaborative social research possibilities enabled by the Internet. Researchers should seek to enhance and modernize their architecture for large-scale collaborative research using advanced qualitative methods of data analysis. This will require working out and attaining widespread acceptance of Internet-enabled data sharing protocols, as well as the establishment of free, open source platforms for coding text and for sharing and replicating results. We believe that when utilized in combination, “The Dataverse Network Project” and the “Coding Analysis Toolkit” (CAT) represent two important steps advancing that effort. Large-scale annotation projects conducted on CAT can be archived in the Dataverse and as a result will be more easily available for replication, review, or re-adjudication of their original coding. In Part Two of the Series “Coding Text the QDAP Way,” we’ll say more about the role of scholarly journals advancing this practice of re-using datasets.
Texifter manages the Coding Analysis Toolkit (CAT), which is a free, open source, Web-based and FISMA-compliant system launched in the fall of 2007 and hosted by the University of Pittsburgh. CAT is the precursor to PCAT and DiscoverText. This is a big day for the CAT team, as we are on the brink of recording the 1 millionth coding choice in the system. Why do people like to use this software? Certainly the price helps. However, over the years we have engineered CAT to make some of the most common coding and validation tasks easier. CAT uses a simple keystroke coding interface and predefined text spans to limit the pain caused by using a mouse. More important to regular CAT users are the on-board tools for easily calculating multi-coder reliability. CAT simplifies the process of assigning the same coding task to a group of coders who can code asynchronously via the Web. When the coding is done, it is a simple matter to generate table of rater reliability stats for better understanding how different coders use the various codes. When pre-testing a new coding scheme, this on-the-fly measure of reliability is a key learning and training tool we use in QDAP all the time. Probably the most important innovation introduced by CAT is the adjudication module. The 118,850 adjudication choices recorded to date by CAT users grew out of our practice of comparing multi-coder experiments pen and paper. Aside from using lots of paper in big experiments, we found ourselves with a time-consuming challenge to transform our validation choices back into the software we were using at the time. Validation in CAT allows an expert or consensus team to review coding choices one at a time and score them as valid or invalid. The system reports validity as a percentage by code, coder and project. This (often iterative) step is absolutely critical when training a coding team.