This is the latest DiscoverText filtering feature designed to speed up the creation of accurate custom machine classifiers. This video shows how we use an interactive display of classifier scores to isolate items in a dataset that require further human coding to improve the accuracy of the classifier. Click on the screenshot below to start the video.
The use of social media has grown exponentially over the last several years. In fact, most television programs and televised advertising have a social media component, designed to expand reach and engagement with the audience. To date, the tobacco control community has relied on traditional media—paid television, radio, billboard and print media advertising—to promote their messages. On March 19, 2012, the Centers for Disease Control and Prevention (CDC) launched Tips from Former Smokers. This campaign was the CDC’s largest anti-smoking campaign ever and its first national advertising effort. The campaign will last four months and consist of both traditional and social media. The Health Media Collaboratory at the University of Illinois at Chicago, directed by Sherry Emery, PhD, will measure and evaluate a key social media component of the campaign—its Twitter reach and impact. Using DiscoverText with GNIP’s PowerTrack provides full access to Twitter’s Firehose. This is in contrast to Twitter’s publicly available API stream, which provides only a 1% sample of tweets. Because the volume of tweets for health social media campaigns are relatively low, every tweet matters. Access to GNIP’s premium Twitter feed allows us to capture all tweets and metadata for the campaign. The use of DiscoverText to sift through tweets and code for content provides a useful tool for measuring online public engagement, audience sentiment, and campaign discourse. The Collaboratory will report on the overall reach and audience engagement of the campaign through an analysis of unique users reached, number of retweets, and mentions. This information will not only track the engagement of individual users but also measure the engagement of state tobacco control programs in the campaign. A sentiment analysis will be conducted on tweets to gauge the emotional valence of the campaign and individual television ads. Finally, using root keywords for quitting and smoking uptake, the numbers of Twitter users that express interest in quitting or prevention will be reported. For more information about this project, visit the UIC Health Media Collaboratory website or follow @GLENszczypka for updates. Research funded by the National Cancer Institute (Grant No. 1U01CA154254).
DiscoverText is rolling-out an addition to its analytical toolkit: random sampling. The Web-service already offers an array of tools for text analytics and rigorous, team-based qualitative data analysis. These functions include the ability to code and annotate text, measure inter-rater reliability, adjudicate coder validity, attach memos to text, cluster duplicate and near-duplicate documents, share documents, and to classify text using an active-learning Naive-Bayesian classifier. While still in beta, random sampling is a key new addition. After DiscoverText users amass extraordinary amounts of social media data (for example via the Public Twitter API, the GNIP Powertrack, or the Facebook Social Graph), they can now more easily extract a random sample for analysis. The size of the sample is decided by the user in order to accommodate to iteration, experimentation and other scientific methods. The option is streamlined into the dataset creation process. On the new dataset creation page, you see a sample size prompt. This additional method for data prep and analysis augments current information retrieval techniques, such as search with advanced filtering. It also builds up our framework for expanding available NLP methods from straightforward Bayesian classification, which aims to analyze substantial quantities of data in their original bulk-form, to a menu of computationally intensive methods that can iterate more quickly and effectively against random data samples. For example, the LDA topic model tool we are releasing will be faster and more effective against smaller random samples. This new feature accommodates both an additional analytical approach as well as the opportunity to easily compare results between competing (or complimentary) analytic methods. We look forward to experimenting with this new tool and hearing about how random sampling will enhance the research of our users and users to come. Special Note to DT Users: We need to turn this feature on one account at a time while we are testing it. Drop us a line if you want to try the tool. We’ll keep you posted on the launch as more dataset modifications are pushed live. As always, if you have any questions, feel free to email us anytime at firstname.lastname@example.org. Your feedback is crucial. Sign up and try it out for yourself at discovertext.com.
We have been delighted with the response to our call for beta testers to try the GNIP-enabled PowerTrack for Twitter. You can still sign up. Round 1 of the beta test concludes on October 31, 2011. Even just testing the system’s data filtering and collecting capabilitiesfor 1 or 2 days, or as few as 1-2 hours, may convert you to a devoted GNIP via DiscoverText user. As part of taking beta tester applications, we asked folks to tell us something about how they planned to use the beta test opportunity. Thanks to ” Wordle” we can visualize an answer to the question: “Why do people want to take part in the GNIP beta test via DiscoverText?”