Texifter was the first company to join as a paying customer in the alpha “Snapshot” offering from Gnip. You can still take part in that alpha by submitting a request for a free estimate of a snapshot from Twitter’s complete history. This is, however, a very fast-moving landscape for for social # bigdata. We are quickly transitioning from the alpha “Snaphot” tests to the beta of a cradle-to-grave system for building estimates for the cost of text analytic projects that feature either the real-time day-forward, Gnip-enabled PowerTrack (the Twitter fire hose), or the new historical PowerTrack. So if you have ever wished you could go back in time and collect all the tweets from an epic moment in history, your wish just came true. Contact us if you have any questions and submit a request for a free estimate today.
[Originally posted May 14, 2011]
In Part One of the series “Coding Text the QDAP Way,” I wrote about the problem of idiosyncratic annotation and the lack of diverse, interesting and re-usable annotated data sets. Providing data for replication (when possible) is a requisite for step scientific approach. An important aspect of this is effort is a follow up on the agreements that were made starting in the 1990s among editors of major research journals to require replication datasets and sharing of the specifics of data coding and computer syntax. This work is now well advanced on the quantitative data sharing frontiers. Developing such an agreement for qualitative data research applications and implementing it consistently among a wide-reaching community of researchers is no simple task.
Sharing raw and coded political corpora will lead to better manual and automated text mining and analysis in political science. This is an epoch of highly accessible digital text collections. Blog posts, wikis, YouTube comments, and the like, as well as the full range of digitized traditional media, are vast sources of potentially important political data in text format. A new approach to coding and sharing annotations might help to eviscerate the prevailing perception of a zero-sum game in research, resulting in many new basic and applied research opportunities for political scientists. The manual annotation of text is a nexus for collaboration by political scientists with computer scientists, and with researchers in allied social sciences as well as in fields such as journalism, literary analysis, library science, and education where the rigorous interrogation of text is a well-established tradition.
In particular, researchers in computer science possess the tools, repositories, and methods necessary for managing studies of millions of documents over time. Just like search engines, in a very short time we can expect these emergent human language tools to become irreplaceable elements of the researcher’s electronic desktop. The next generation of language tools will be built with the “ground truth” support of high-quality coding and evaluation studies. Many researchers from a variety of disciplines stand to benefit from reliably recorded, publicly available, transparent, large-scale annotations. These collections can be produced by properly equipped and trained coders, as well as by active and machine-learning algorithms developed by computer scientists. Yet very few researchers in any discipline can say with confidence that they know where to acquire or how to produce reusable annotated corpora with widespread, multi-disciplinary appeal. Even fewer could imagine freely sharing those hard-earned text annotations with other members of a research community or publishing them on the Web to attract more diverse and sustained scholarly attention.
There is some evidence that making data available increases citation. Although a strong tradition is emerging among leading social science journals whereby scholars post their statistical data and models in repositories for those who would replicate their experiments and calculations, the same cannot currently be said about text annotations, other forms of qualitative work, and even raw text datasets. As a result, there is a dearth of well-coded contemporary and historical text datasets. This is only partly due to fact that the manual annotation of text can be conceptually very difficult, if not a bit controversial, expensive, and too often unsuccessful. It is often dreary work, a characteristic that further encourages the use of unsupervised machine annotation when possible. More fundamentally, however, only limited guidance exists in the scholarly literature about how best to recruit, train, equip, and supervise coders to get them to produce useful annotations that serve multiple research agendas in divergent disciplines.
As Eduard Hovy (Computer Science, University of Southern California-Information Sciences Institute) regularly points out, researchers need a formal science of annotation focused on cross-disciplinary text mining activities. Carefully and transparently coded corpora are a viable bridge to collaboration with computer science and computational linguistics and can open up new possibilities for large-scale text analysis. In the third and final part of this series, we look at the quest for the elusive “gold standard” in human annotation.
Texifter’s most recent historical Twitter prize winners include three from the United States, one from Great Britain, and one from France. Winners receive Enterprise access to DiscoverText for six months, and Sifter credit for up to three historical Twitter days and 200,000 tweets. The following is a snapshot of the most recent winners and their proposed research projects. Diana Ascher PhD student in the Department of Information Studies at UCLA @dianaascher “Helping Companies Streamline Information” Ascher proposes exploring cultural time orientation by analyzing the Twitter feeds from three news organizations to better understand how “information agents’ cultural backgrounds affect corporate information practice,” and specifically how organizations decide what information to share and when. Ascher hopes the research will help businesses streamline their information activity and routines, and help managers understand “how employees decide what’s important and what’s not.” Stephen Barnard Assistant Professor in the Sociology Department at St. Lawrence University @socsavvy “Better Understanding Journalism via Boston Marathon Bombing Twitter Data” Barnard plans to use Sifter to collect and analyze Twitter data about the 2013 Boston Marathon bombings. He will use Twitter’s PowerTrack filters to conduct a detailed search of Tweets that reported on the bombing, and compare the results to the responses from professional and citizen journalists. “I hope to gain a better understanding of the reporting processes and outcomes emerging from both groups,” Barnard writes, adding that he will use the findings to “highlight the structural relations of the emerging journalistic field.” Oliver Haimson PhD Student in the Informatics Department at University of California, Irvine @oliverhaimson “Analyzing Hashtags” Haimson’s plans to use the prize to analyze the hashtags #nymwars and #mynameis, which were used in 2011 and 2014 to critique Google’s and Facebook’s “real name” policies. He plans to evaluate the Twitter data from these two hashtags “using computational linguistics, qualitative coding, and social network analysis.” Omar Jaafor PhD Student in the Department of Operational Research, Applied Statistics and Simulation at University of Technology of Troyes @lmhasher “Developing Algorithms for Social Networks” Jaafor and fellow researchers will use the prize to continue to develop “clustering and anomaly detection algorithms for social networks in a big data environment.” Wasim Ahmed PhD Student in the Health Informatics Research Group at the University of Sheffield’s Information Department @was3210 ” Responding to Infectious Disease Outbreaks” Ahmed will use his prize to “study how users respond to outbreaks on infectious diseases on social media platforms, such as Twitter.” He plans to use his data towards his PhD “Pandemics and epidemics: User reactions on social media and Web 2.0 platforms.” For more information on the Texifter’s social data offer and text analytics tools, please send us an email email@example.com. Better yet, sign up for a free 30-day trial and start collecting your own social data today.
As a part of getting new users to test our Sifter beta, every month this summer we are awarding 12 #datagrants to academics. These prizes shave thousands of dollars of costs off of your research. The August social data and tools prize winners were: Kelli S. Burns, Ph.D. University of South Florida School of Mass Communications “I will look at the #icebucketchallenge during a particularly active time in the campaign (mid-August 2014) when several celebrities were creating a lot of attention for their videos. I plan to explore the celebrity impact on tweets as well as specific mentions of ALS in tweets about the campaign. I am also interested in conversation themes related to the campaign and how other organizations hijacked the hashtag for their own gain.” @KelliSBurns Kathleen PJ Brennan PhD Candidate at the University of Hawai’i at Manoa Political Science “I hope to use my data and software prize to study the influence of internet memes on political interest and awareness. This particular analysis will form part of a dissertation chapter on internet memes, which examines such memes as emergent agents in the overlaps of online and offline spaces. This will be my first opportunity to incorporate such data into my dissertation, and I can’t wait to get started!” @katiepbrennan Aminu Bello Phd Research Student Marketing “To analyse data from social media To find out the role of social media in CRM Data will be collected primarily from facebook and twitter pages” Ann Pegoraro Laurentian University School of Sports Administration and I am the Director of the Institute for Sport Marketing, a research center at the university “I plan on using the Texifter Data and software to further my research work in social media use in sport. In particular, the historical data will be used by my colleagues and I to investigate how the use of Twitter by athletes, teams/organizations and fans has evolved over time.” @SportMgmtProf Susan Currie Sivek Linfield College Mass Communication “I will use the prize to continue to study the relationship between journalism and social media. I am especially interested in how magazines use these media to connect to their audiences.” @profsivek Dimitrinka Atanasova Research Associate (CascEff) and PhD student Media and Communication, University of Leicester “I plan to study information sharing about obesity, specifically I hope to identify the sources behind the web links that are shared most. For my recently submitted PhD I analysed obesity-related news articles from selected online newspapers, and while it can be expected that content from these should be among the most shared, I would like to see what other information sources are read/shared.” @dbatanasova Hassan Zamir University of South Carolina School of Library and Information Science “The Texifter data prize will be primarily used as the data for writing my dissertation which focuses on how and what citizens and expatriates of Bangladesh reported about the Shahbag Movement during 2013 in Twitter. A content analysis of these tweets will be helpful to get an insight about the protest, it’s primary issues, protesters, and their concerns. The data will be useful for understanding how social media tools like Twitter increases democracy, civic engagement, and social empowerment. A potential outcome of this research will be designing a computer supported tool for better understanding worldwide social movements and mitigate the social crisis issues quickly.” @hassan_zamir Jacob Groshek Boston university Emerging media “I plan to look at how people use social media in a smoking cessation program. Or follow other emergent social situations, like Ferguson or Gaza.” @jgroshek Yunkang Yang University of Washington Department of Communication “I would use it to extract historical posts to study online discourse regarding a major public event in China in 2012, as well as the access to discover text to cleanse, code and visualize the data. I hope to group those posts into categories to show the levels of contention in discourse and to reflect the role social media play in facilitating public debate.” @yangyunkang Will Frankenstein Carnegie Mellon University Dept. Engineering & Public Policy / Center for Computational Analysis of Social and Organizational Systems “I will be using the data to explore how individuals communicate and discuss technological risk as expressed on social media. I will be focusing on discussions of nuclear proliferation. The prize is especially helpful for gauging and distinguishing the immediate social media response vs. the long-term response of major events related to nuclear materials, such as Fukushima and New START.” Micah Altman MIT Libraries: Program on Information Science “We will experiment with PowerTrack to pilot to integrate dynamic corrections to official statistics. We will experiment with DiscoverText to perform collaborative evaluation of transparency in government data and websites.” @drmaltman
As a part of getting new users to test our sifter beta, every month this summer we are awarding 12 #datagrants to academics. All you need to do to be included in the August drawing is submit a valid historical Twitter estimate request using sifter and then send us your CV. These prizes shave thousands of dollars of costs off of your research. The July social data and tools prize winners were: Enrique Castro Sanchez Centre for Infection Prevention and Management at Imperial College London
“I am interested in exploring how antibiotics and antibiotic resistance are discussed in Twitter, focusing on opinion leaders driving particular perceptions. The data will allow me to explore collective Twitter responses to news and events related to antibiotics, in an effort to understand how best mobilise public opinion.” @castrocloud
Stephen Barnard Department of Sociology at St. Lawrence University
“I plan to use the Texifter #datagrant and DiscoverText software package to extend my research on the significance of Twitter in American journalism. This may include collecting both real-time and historical tweets relating to major events in the journalistic field. Additionally, I am also hoping to use the Texifter/DiscoverText package as a grading tool, given that I often incorporate social media projects and Twitter discussion in my classes and have been searching for an efficient way to collect and grade them. This prize provides an ideal opportunity for me to experiment with new grading protocols.” @socsavvy
Gonzalo Bacigalupe Counseling Psychology at the University of Massachusetts Boston
“Do ehealth, innovation in healthcare and technology, mhealth, and other forms of ehealth ideas, emerge associated to the question of health equity, social determinants of health, and overall with concerns about social justice” @bacigalupe
“As an economist I am interested in how economic agents interact with each other; in particular how networks (formally or informally – hence Twitter and other social networks) influence decision-making. I hope to use this data award to learn more about the ways in which decisions are impacted by the position somebody has within a network.” @jjreade
Zachary Steinert-Threlkeld Political Science at the University of California – San Diego
“I am researching how individuals use Twitter to organize contentious action in authoritarian regimes. Because I have too many tweets to hand code, creating topic models is a core part of my research. Access to an Enterprise level DiscoverText account will prove invaluably productive.” @ZacharyST
“I will be using the data for community detection and anomaly detection. I am building algorithms that allow for community and anomaly detection in networks using both the attributes of nodes (country, age, messages…) and relationships between nodes.” @lmhasher
“With Prof. Ed Lee in the IIT Chicago-Kent College of Law, I’m studying how to evaluate online protests and their achievements. We use the case study method to examine tweets related to protests of NSA surveillance. Our goal is to develop a set of metrics by which we can better evaluate the success of online protests and what they may achieve, particularly in protests whose objectives do not involve revolution or overthrow of the government. The results of the project will be useful for Internet activists, businesses, media, policymakers, and software programmers in designing, evaluating, or utilizing social media for political purposes.” @libbyh
“I’m exploring social media commentary about the use of conflict resolution programming in schools, with a special focus on peer mediation. I’ve been gathering tweets related to peer mediation and find some interesting back-channel conversations going on that school staff probably are not aware of.” @bwarters
Nigel L. Williams FestIM Research Project, School of Tourism at Bournemouth University
“My research examines Digital Engagement by stakeholders with Projects and Events. I’m especially interested in applying Social Network Analysis and Text Analysis to understand conversations on Social Media about Projects and Events. In the Project Domain, I will look at online narratives discussing Crossrail, a London transport project. For Events, I will apply the data and software to examine the impact of online narratives on a costal destination” @Org_PM
Meredith Clark Journalism & Mass Communication at UNC-CH
“I will use the prize to extend my research into digital media use and connectivity among minorities.” @meredithclark
Stephen K Tagg Marketing at Strathclyde Business School
“To produce academic articles on dynamic modelling of sentiments in the Scottish Independence Referendum debate. This is in cooperation with a colleague in the school of government (Dr Mark Shephard). Techniques for the analysis of unstructured data in the R software environment will be used: qdap, tm and Austin.” @stephenktagg
Bill Wilkerson Political Science at SUNY Oneonta
“I am interested in learning about how the US Supreme Court is discussed on Twitter. What cases draw interest? What network patterns exist in this discussion? I hope that there is sufficient geo-location data to use this as part of the research as well.” @bill_wilkerson
Remember: All you need to do to be included in the July drawing is submit a valid historical Twitter estimate request using sifter and then send us your CV.