Deduplication and automated clustering of near-duplicates gives users a high level sense of the data landscape. With Twitter data, these groupings are a roadmap to the digital footprint of viral Tweets. With public comment data, these groupings are form letters and modified forms. In large-scale surveys, duplicates and near duplicates are frequently held but independently expressed opinions among customers or employees. Our interactive machine classifier histograms allow data science teams to identify the items in a collection that add the most value when coded by humans. These text analytics tools enable purposive sampling that further accelerates the process of training machine classifiers.
Boolean defined search, n-grams, word clouds, and custom topic dictionaries are power tools for text analysis and machine-learning
Discover central topics and also elusive but valuable unexpected or rare concepts. Use this information to train machine-learning classifiers to recognize relevant text and social media data. Jump into data using an interactive word CloudExplorer or build a mini topic dictionary using “defined” search. Try our new listview for seeing the top 300 bigrams and trigrams in your data
Create gold standard training sets by labeling your training data accurately and reliably using our state-of-the-art collaborative annotation system. Then use our trusted, multilingual machine learning web service (uClassify) to create and apply your own custom-trained text classifiers. Please take the time to check out the work being done by the large and growing uClassify community.