Stock Predictions Using DiscoverText

Wouldn’t it be great if twitter data could tell us something about how stocks will move in the future? Given the massive expansion of twitter activity in the finance sector by both individual and institutional investors  it seems reasonable that some signal could be found in the twitter fire-hose.  This blog post describes a small experiment to test this line of thinking.
For the first phase of the project, described in this post, the goal was restricted to predicting just the direction of stock price movements during the upcoming day of trading, leaving the modeling of magnitude of price movements for a later project.   Furthermore, the work centered on making short-term horizon predictions, partially because I wouldn’t have to wait as long to build the required dataset, but also because it seemed more of fit for the real-time, immediate nature of Twitter data.  
My hypothesis was that Twitter predictive signal for next day stock price direction could be significantly enhanced by using off-the-shelf text classification technology.  
The overall approach was to create vector features based on different classifications of twitter sentiment, and then apply simple empirical tests to compare the predictive value from the features.
In order to derive predictive value from Twitter, I first needed to classify each tweet based on the kind of information or sentiment conveyed about a particular stock.  As many of you know, a large portion of twitter data, even when limited to a particular search term is essentially noise. 
Based on human coded results, we see that well over 60% of tweets that contain a stock’s ticker are not, in fact, related to the company at all, let alone to trading it’s stock.  Most are advertisements, click bait or just plain nonsense.  While the remaining 40% are relevant to the stock, most of those are not related to future prospects of the stock, being about past news or past observations about the price of the stock.  It turns out that fewer than 10% of tweets are actually related to future sentiment of stock’s price.  The trick, of course, is finding those 10%.  
Enter DiscoverText.  The tool was especially useful for me because from a standing start, I had a lot of pieces to integrate.  DiscoverText has all of those pieces integrated, allowing me to jump-start the research: twitter data, data coder functionality, classifiers, export tools.  I was up an running very quickly with real predictions based on real twitter data.
Empirical Results
As mentioned above, this phase of the project was dedicated to predicting direction.  I found that while it did not perform equally for every stock, there was empirical evidence of predictive signal.  By performing simple linear regressions, I was able to demonstrate that features derived from tweets that were filtered for future-looking sentiment significantly outperformed those that were unfiltered.   
In the chart above you can see that while unfiltered results were essentially equal to a coin toss, over the course of 60 trading days, the filtered results were ~60% accurate on average.  Is that a meaningful result? One way to assess that is to determine how likely one would be to achieve the same result with a coin toss.  As you can see by the binomial distribution chart below, a coin toss would achieve this result only ~6% of the the time.
Everyone always wants to know: would this make money?  It’s a bit premature to say.  Since magnitude/volatility was not specifically modeled, the traded value is somewhat left to chance – one can successfully predict several days of low volatility but lose all those gains on one highly volatile day where direction is predicted incorrectly.   That said, the modeled return for the test period was definitely positive, and DT has given us a quick way to get in and start working quickly.  Thank you DiscoverText!
Posted by Roland Pan, advisor to startups and mid-sized companies on matters of strategy and analytics.

Leave a Reply

Your email address will not be published. Required fields are marked *