The evolution of the API opens the door for third-party developers to access information on social media networks. In the best case, this provides a healthy, democratic flow of information. Yesterday, DiscoverText had “rate limits” imposed in terms of its access to Twitter data. As written, the Twitter API allows unauthenticated calls of 150 per hour, per IP address. Authorized calls (users logged on using their Twitter credentials, also known as OAuth) allow for up to 350 calls per hour, per person. In addition, the Twitter Search API has internal rate limiting mechanisms, but Twitter does not publish those specific limitations for fear of abuse. Going over any of these limits results in the user being presented with “Error 420”, which simply means that the user is being rate limited. This hampers the ability to harvest twitter feeds within DiscoverText. We have never had rate limit problems prior to this, but according to timestamps on articles posted on Twitter’s developer website, Twitter might have become more cognizant of those harvesting large amounts of data (not just us), and as a result, are cracking down on heavy users. At Texifter, we fully respect the rules and regulations of the Twitter API, and in no way seek to disobey or bend these set rules in our flagship software product, DiscoverText. On August 18, 2011, the same day we learned of the 420 errors, we performed emergency maintenance to better cope with Twitter rate limitations. We also wanted to more gracefully handle rate limitation errors and to ensure we abide by Twitter Terms of Service. With that said, in order to continue our ability to harvest information from Twitter and perform our cutting-edge research, we are currently exploring easier and more reliable ways to harvest data. The maintenance performed on DiscoverText stills allow 1500 items per fetch as determined by Twitter’s architecture on the public API. In addition, no extraneous error messages should result when DiscoverText is being rate limited. Some searches might be silently delayed for 5 minutes, however, these fetches will catch up as soon as they can. In the near future, look for new developments for DiscoverText. We’ve got big plans for our social media API fetching that will greatly enhance our user’s ability to receive timely and actionable social media feeds. We don’t want to reveal too much right this moment, but we’re sure you’ll like what we have in store and in traditional Texifter style, we’ll plan a large announcement when the time is right.
In a recent series of recommendations, the Administrative Conference of the United States (ACUS), announced findings under the auspices of “Legal Considerations in e-Rulemaking,” from the Committee on Rulemaking. Having spent more than decade working on e-Rulemaking, I was curious to see what was at the top of their list. It was a relief to find that in the Final Recommendations, Item 1, Section A reads:
Consider whether, in light of their comment volume, they could save substantial time and effort by using reliable comment analysis software to organize and review public comments.
The ACUS report continues:
(1) While 5 U.S.C. § 553 requires agencies to consider all comments received, it does not require agencies to ensure that a person reads each one of multiple identical or nearly identical comments. (2) Agencies should also work together and with the eRulemaking program management office (PMO), to share experiences and best practices with regard to the use of such software. [emphasis added]
At Texifter, we know quite a bit about best practices for sorting duplicate and near duplicate public comments. We have supported and trained Public Comment Analysis Toolkit (PCAT) and DiscoverText users at the USDA, NOAA, FCC, NLRB, SBA, USFWS, and Treasury departments. Our duplicate detection and near-duplicate clustering saves agencies from the expense of manually sorting non-substantive modified form letters . DiscoverText is now used in Europe by aviation regulators. How did we get here? More than 300 agency officials attended workshops, focus groups and interviews over a 10-year period. Algorithms were developed and tested. Interfaces were designed, built, tested and re-built. Agencies shared millions of public comments and guided us as we tailored a system to work with the bulk downloads from their email servers and the Federal Docket Management System, which gathers the nation’s public comments at Regulations.gov. If “reliable comment analysis software” is needed, Texifter’s flagship product DiscoverText has to be considered a guiding light for some of the key ACUS findings.
Researchers interested in large text collections and their itinerant coders tend to muddle through with limited collaborative, cross-disciplinary resources upon which to draw. The generic criteria for high-quality codebook construction and effective coding are underdeveloped, even as the tools and techniques for measuring the limits of manual or machine coding grow ever more sophisticated. In that paradox there may be the seed of a partial solution to some of these issues. The ability to quickly and easily pre-test coding schemes and produce on-the-fly displays of coding inconsistencies is one way to more uniformly train coders to perform reliably (hence usefully) while ensuring a satisfactory level of valid observations. By the same token, the ability to permit an unlimited number of users to review or replicate all the coding and adjudication steps using a free, web-based platform would be a large and bold step onto our methodological and metaphorical bridge. What are needed are more universal annotation metrics, a standard lexicon, and widely shared, semi-automated coding tools that make the work of humans more useful, fungible, and durable. Ideally, these tools would be interoperable, or combined in a single system. The new system would allow human coders to create annotations and allow other experts to efficiently examine, influence, and validate their work. At a deeper level, this calls for much better and more transparently codified approaches to training and deploying coders—an annotation science subfield—so that a more coherent and collaborative research community can form around this promising methodological domain. Investigators in the social sciences use reliably coded texts to reach inferences about diverse phenomena. Many forms of public-sphere discourse and governmental records are readily amenable to coding; these include press content, policy documents, speeches, international treaties, and public comments submitted to government decision-makers, among many others. Systematic analysis of large quantities of these sorts of texts represents an appealing new avenue for both theory building and hypothesis testing. It also represents a bridge across the divide between qualitative and quantitative methodologies in the social sciences. These large text datasets are ripe for mixed-methods work that can provide a rich, data-driven approach both to the macro and micro view of large-scale political phenomena. Traditionally, social scientists working with text use a variety of qualitative research methods for in-depth case studies. For many legitimate and pragmatic reasons, these studies generally consist of a small number of cases or even just a single case. As Steven Rothman and Ron Mitchell note, the reliability of data drawn from qualitative research comes under greater scrutiny, as increased dataset complexity requires increased interpretation and, subsequently, leads to increased opportunity for error. The case study method is plagued by concerns about limitations on its external validity and the ability to reach generalized inferences. With the proliferation of easily available, large-scale digitized text datasets, an array of new opportunities exist for large-n studies of text-based political phenomena that can yield both qualitative and quantitative findings. More to the point, high-quality manual annotation opens up the possibility for cross-disciplinary studies featuring collaboration between social and computational scientists. This second opportunity exists because researchers in the computational sciences, particularly those working in text classification, IR, opinion detection, and NLP, hunger for the elusive “gold standard” in manual annotation. Accurate coding with high levels of inter-rater reliability and validity is possible. For example, work by the eRulemaking Research Group on near-duplicate detection in mass e-mail campaigns demonstrated that focusing on a small number of codes, each with a clear-cut rule set, has been able to produce just such a gold standard. Reliably coded corpora of sufficient size and containing consistently valid observations are essential to the process of designing and training NLP algorithms. We are likely to see more political scientists using methodologies that combine manual annotation and machine learning. In short, there are exciting possibilities for applied and basic research as techniques and tools emerge for reliably coding across the disciplines. To unleash the potential for this interdisciplinary approach, a research community must now form around the nuts and bolts questions of what and how to annotate, as well as how to train and equip the coders that make this possible.
In Part One of the series “Coding Text the QDAP Way,” I wrote about the problem of idiosyncratic annotation and the lack of diverse, interesting and re-usable annotated data sets. Providing data for replication (when possible) is a requisite for step scientific approach. An important aspect of this is effort is a follow up on the agreements that were made starting in the 1990s among editors of major research journals to require replication datasets and sharing of the specifics of data coding and computer syntax. This work is now well advanced on the quantitative data sharing frontiers. Developing such an agreement for qualitative data research applications and implementing it consistently among a wide-reaching community of researchers is no simple task. Sharing raw and coded political corpora will lead to better manual and automated text mining and analysis in political science. This is an epoch of highly accessible digital text collections. Blog posts, wikis, YouTube comments, and the like, as well as the full range of digitized traditional media, are vast sources of potentially important political data in text format. A new approach to coding and sharing annotations might help to eviscerate the prevailing perception of a zero-sum game in research, resulting in many new basic and applied research opportunities for political scientists. The manual annotation of text is a nexus for collaboration by political scientists with computer scientists, and with researchers in allied social sciences as well as in fields such as journalism, literary analysis, library science, and education where the rigorous interrogation of text is a well-established tradition. In particular, researchers in computer science possess the tools, repositories, and methods necessary for managing studies of millions of documents over time. Just like search engines, in a very short time we can expect these emergent human language tools to become irreplaceable elements of the researcher’s electronic desktop. The next generation of language tools will be built with the “ground truth” support of high-quality coding and evaluation studies. Many researchers from a variety of disciplines stand to benefit from reliably recorded, publicly available, transparent, large-scale annotations. These collections can be produced by properly equipped and trained coders, as well as by active and machine-learning algorithms developed by computer scientists. Yet very few researchers in any discipline can say with confidence that they know where to acquire or how to produce reusable annotated corpora with widespread, multi-disciplinary appeal. Even fewer could imagine freely sharing those hard-earned text annotations with other members of a research community or publishing them on the Web to attract more diverse and sustained scholarly attention. There is some evidence that making data available increases citation. Although a strong tradition is emerging among leading social science journals whereby scholars post their statistical data and models in repositories for those who would replicate their experiments and calculations, the same cannot currently be said about text annotations, other forms of qualitative work, and even raw text datasets. As a result, there is a dearth of well-coded contemporary and historical text datasets. This is only partly due to fact that the manual annotation of text can be conceptually very difficult, if not a bit controversial, expensive, and too often unsuccessful. It is often dreary work, a characteristic that further encourages the use of unsupervised machine annotation when possible. More fundamentally, however, only limited guidance exists in the scholarly literature about how best to recruit, train, equip, and supervise coders to get them to produce useful annotations that serve multiple research agendas in divergent disciplines. As Eduard Hovy (Computer Science, University of Southern California-Information Sciences Institute) regularly points out, researchers need a formal science of annotation focused on cross-disciplinary text mining activities. Carefully and transparently coded corpora are a viable bridge to collaboration with computer science and computational linguistics and can open up new possibilities for large-scale text analysis. In the third and final part of this series, we look at the quest for the elusive “gold standard” in human annotation.
We did it! The free, open source, Web-based, university-hosted, FISMA-compliant “Coding Analysis Toolkit” CAT recorded its one millionth coding choice. Pretty much all the credit goes to Texifter CTO and chief CAT architect Mark Hoy who has put in many paid (and unpaid) hours making sure CAT is reliable, usable, & scalable. Texifter Chief Security Officer Jim Lefcakis also played a key role ensuring the hardware and server room were maintained at the highest level of reliability and security. In honor of this milestone, I have been digging through my unpublished papers looking for material that explains in more detail where CAT, PCAT, DiscoverText, QDAP & Texifter come from. This post is the first in a series about the particular approach to coding text we have come to call the “QDAP method.” Large political text data collections are coded and used for basic and applied research in social and computational sciences. Yet the manual annotation of the text—the coding of corpora—is often conducted in an ad hoc, inconsistent, non-replicable, invalid and unreliable manner. Even the best intentions to create the possibility for replication can, in practice, confound the most ardent followers of the creed “ Replicate, Replicate.” While mechanical, process, documentary, and other challenges exist for all approaches, practitioners of qualitative or text data analysis routinely profess to greater, even insurmountable, barriers to re-using coded data or repeating significant analyses. There are diverse approaches to coding text. They tend to be hidden away in small niche sub-fields where knowledge of them is limited to a small research community, a project team, or even a single person. While researchers classify text for a variety of reasons, it remains very difficult, for many counter-intuitive, to share these annotations with other researchers, or to work on them with partners from other disciplines for whom the coding may serve an alternate purpose. A change in the way the researchers think about, conduct, and share coded political corpora is overdue. Coding is expensive, challenging, and too often idiosyncratic. Training and retaining student coders or producing algorithms capable of tens of thousands of reliable and valid observations requires patience, funding, and a framework for measuring and reporting effort and error. Given these factors, it is not surprising that a proprietary model of data acquisition and coding still dominates the social sciences. Despite the important role for the social in social science, researchers guard “their” privately coded text, even the raw data, fearing others will beat them to the publication finish line or challenge the validity of their inferences. The competitive approach to producing and failing to share annotations disables intriguing and highly scalable collaborative social research possibilities enabled by the Internet. Researchers should seek to enhance and modernize their architecture for large-scale collaborative research using advanced qualitative methods of data analysis. This will require working out and attaining widespread acceptance of Internet-enabled data sharing protocols, as well as the establishment of free, open source platforms for coding text and for sharing and replicating results. We believe that when utilized in combination, “The Dataverse Network Project” and the “Coding Analysis Toolkit” (CAT) represent two important steps advancing that effort. Large-scale annotation projects conducted on CAT can be archived in the Dataverse and as a result will be more easily available for replication, review, or re-adjudication of their original coding. In Part Two of the Series “Coding Text the QDAP Way,” we’ll say more about the role of scholarly journals advancing this practice of re-using datasets.