We are having the spookiest Halloween discount ever on historical Twitter data via Sifter. Any purchases completed before 11:55 PM on October 31, 2017 will suffer the terrifying horror of a grisly, dollar-dripping, cut-to-the-bone, end-of-month discount. If you have the guts to go through with it, we will double the pain and suffering by increasing the extensive free, but deadly, access to DiscoverText.com, a ghoulishly sinister platform for mercilessly dissecting data and making it sing. Let us know if you need any help putting a foot into the grave.
Archives for October 2017
We are very excited to launch v1 of our NodeXL integration. DiscoverText users are invited to apply to join the beta test of the new functionality that creates GraphML files based on Twitter data. New users will be given a gratis license for 30 days, and existing customers will get 10,000 Gnip Twitter credits to experiment with. The new output allows users of NodeXL to generate network maps like this:
This is just step one. Going forward, we plan to make it possible to append existing Twitter archives with new metadata created using NodeXL. This is an exciting new avenue for Twitter researchers interested in creating greater interoperability between software packages.
Ritme Brings Machine-Learning Text Analytics to European Research Players
July 4, 2017
Texifter, LLC is pleased to launch a 5-country European marketing agreement for DiscoverText with scientific software distributor Ritme. For more than 20 years, the Ritme philosophy has been to offer “sustainable support, independent and personalized advice, to make investments as relevant and sustainable as possible.” Texifter is proud to join the limited ranks of data science and research software vendors under the Ritme umbrella.
DiscoverText is a cloud-based multilingual collaborative text analytics software service rooted in more than a decade of basic research into the problem of human and machine-learning. It supports advanced search, filtering, de‐duplication, clustering, crowdsourcing, human coding, annotation measurement tools, and custom machine‐learning text classifiers. DiscoverText is considered a “Top 5” integrated strategic partner by SurveyMonkey and is a longtime vendor of Gnip PowerTrack access to Twitter data.
Ritme, scientific solutions, headquartered in France and founded in 1989, is a full solution provider to the science and technology industries, offering researchers the full spectrum of scientific software and expert assistance. With offices in Paris, Brussels and Lausanne, Ritme serves more than 50,000 customers across Western Europe with services ranging from pre-sales to training, support and consulting. Ritme will offer DiscoverText in Belgium, The Netherlands, France, Switzerland, and Italy.
November 2, 2017, in Paris, I will present a free workshop at the Emlyon Business School (Paris Campus) on using DiscoverText for text mining. Participate in this workshop to learn how to build custom machine classifiers for sifting Twitter data. The topics covered include how to:
- construct precise social data fetch queries,
- use Boolean search on resulting archives,
- filter on metadata or other project attributes,
- tabulate, explore, and set aside duplicates, cluster near-duplicates,
- crowd source human coding,
- measure inter-rater reliability,
- adjudicate coder disagreements, and
- build high quality word sense and topic disambiguation engines.
DiscoverText is designed specifically for collecting and cleaning up messy Twitter and other text data streams. Use basic research measurement tools to improve human and machine performance classifying data over time. The workshop covers how to reach and substantiate inferences using a theoretical and applied model informed by a decade of interdisciplinary, National Science Foundation-funded research into the text classification problem.
The key breakthrough led to a patent (US No. 9,275,291) being issued on March 1, 2016. We built a tools for adjudicating the work of coders. For example, if I ask 10 students to look at 100 Tweets that mention “penguins” and code whether or not they are about the NHL’s Pittsburgh Penguins, there will be imperfect agreement. Some coders will have deeper knowledge of the subject and some Tweets will be inscrutably ambiguous. Adjudication allows an expert to review the way the group labeled the Tweets and decide who was right and wrong. This method of validation creates a “gold standard” and it allows us to score over time the likelihood that an individual coder will create a valid observation. Participants will learn how to apply “CoderRank” in machine-learning. The major idea of the workshop is that when training machines for text analysis, greater reliance should be placed on the input of those humans most likely to create a valid observation. Texifter proposed a unique way to recursively validate, measure, and rank humans on trust and knowledge vectors, and called it CoderRank.
Dr. Stuart Shulman
Founder & CEO, Texifter
Dr. Stuart W. Shulman is founder & CEO of Texifter. He was a Research Associate Professor of Political Science at the University of Massachusetts Amherst and the founding Director of the Qualitative Data Analysis Program (QDAP) at the University of Pittsburgh and at UMass Amherst. Dr. Shulman is Editor Emeritus of the Journal of Information Technology & Politics, the official journal of Information Technology & Politics section of the American Political Science Association.
[Originally posted May 10, 2011]
We did it! The free, open source, Web-based, university-hosted, FISMA-compliant “Coding Analysis Toolkit” CAT recorded its one millionth coding choice. Pretty much all the credit goes to Texifter CTO and chief CAT architect Mark Hoy who has put in many paid (and unpaid) hours making sure CAT is reliable, usable, & scalable. Texifter Chief Security Officer Jim Lefcakis also played a key role ensuring the hardware and server room were maintained at the highest level of reliability and security. In honor of this milestone, I have been digging through my unpublished papers looking for material that explains in more detail where CAT, PCAT, DiscoverText, QDAP & Texifter come from. This post is the first in a series about the particular approach to coding text we have come to call the “QDAP method.”
Large political text data collections are coded and used for basic and applied research in social and computational sciences. Yet the manual annotation of the text—the coding of corpora—is often conducted in an ad hoc, inconsistent, non-replicable, invalid and unreliable manner. Even the best intentions to create the possibility for replication can, in practice, confound the most ardent followers of the creed “ Replicate, Replicate.” While mechanical, process, documentary, and other challenges exist for all approaches, practitioners of qualitative or text data analysis routinely profess to greater, even insurmountable, barriers to re-using coded data or repeating significant analyses. There are diverse approaches to coding text. They tend to be hidden away in small niche sub-fields where knowledge of them is limited to a small research community, a project team, or even a single person. While researchers classify text for a variety of reasons, it remains very difficult, for many counter-intuitive, to share these annotations with other researchers, or to work on them with partners from other disciplines for whom the coding may serve an alternate purpose. A change in the way the researchers think about, conduct, and share coded political corpora is overdue.
Coding is expensive, challenging, and too often idiosyncratic. Training and retaining student coders or producing algorithms capable of tens of thousands of reliable and valid observations requires patience, funding, and a framework for measuring and reporting effort and error. Given these factors, it is not surprising that a proprietary model of data acquisition and coding still dominates the social sciences. Despite the important role for the social in social science, researchers guard “their” privately coded text, even the raw data, fearing others will beat them to the publication finish line or challenge the validity of their inferences. The competitive approach to producing and failing to share annotations disables intriguing and highly scalable collaborative social research possibilities enabled by the Internet.
Researchers should seek to enhance and modernize their architecture for large-scale collaborative research using advanced qualitative methods of data analysis. This will require working out and attaining widespread acceptance of Internet-enabled data sharing protocols, as well as the establishment of free, open source platforms for coding text and for sharing and replicating results. We believe that when utilized in combination, “The Dataverse Network Project” and the “Coding Analysis Toolkit” (CAT) represent two important steps advancing that effort. Large-scale annotation projects conducted on CAT can be archived in the Dataverse and as a result will be more easily available for replication, review, or re-adjudication of their original coding. In Part Two of the Series “Coding Text the QDAP Way,” we’ll say more about the role of scholarly journals advancing this practice of re-using datasets.