For more than seven years the Library of Congress (LoC) collected a plain text version of the entire Twitter archive. In their December 2017 White Paper, they noted: “The Library only receives text. It does not receive images, videos or linked content. Tweets now are often more visual than textual, limiting the value of text-only collecting.”
In an interesting article, TechCrunch author Catherine Shu leads with the idea that the recent doubling of the “chararcter limits” was the cause of the policy change. That is actually just sub-part three of a more general explanation that “Twitter is changing.” As the founder of a start-up that provides the only web-based self serve access to every undeleted Tweet in history, I have to offer my own take as a eye-witness to most of this history. The way I understand it:
- The LoC has never had a proper budget for this.
- There was no plan for making all the data, or even samples, accessible.
- Twitter never gave the LoC any distribution rights.
- A complete duplicate copy of Twitter, with no revenue stream, was a clear and present threat to Twitter’s business, not to mention our long-term efforts to support the academic market for Twitter data.
- The original announcement got huge press followed by seven years of silence.
- Some academics mistakenly believed it was actually for them to study Twitter data now.
- Librarians at the LoC saw the document collection as a time capsule for the scholars of the future long after Twitter leaves this earth.
- No tools were built to search, filter, display, manipulate, or export data.
- The archive may contain as many as 50 billion deleted Tweets; the cost of removing deleted Tweets is non-trivial.
- The lawyers at Twitter and many of the academic librarians who support scholars (not the LoC librarians) may not see deleted Tweets in precisely the same way.
- Updated information: LoC and Twitter had a policy in place to manage deleted Tweets. I am trying to get the documentation about how this was physically implemented over the last seven years.
So history marches on and the LoC is off the hook for the weekly inquiries from data hungry scholars who, at times and in some circles, will claim that all data should be free, no matter what the cost to physically store, sort, and access it. Imagine making a similar claim about field research. No one expects plane, train, or bus tickets to be free. Recording devices, laptops, and cell phones are not free. If you spend three days photocopying at the National Archives (which I did twice for my dissertation), the lunch and photocopies are not free. The notion that hundreds of trillions of cells of data in commercial cloud services can be accessed for free as a public good is not grounded in the economics of information access.
After eight years running Texifter, I have a developed a different view. As a recovering academic and first-time founder, I chose to face, head on, the challenge (and risk) of making the undeleted history of Twitter fully accessible. This has required lawyers, loans, a supremely gifted engineer, and a never-yielding belief that the data has inherent value if only it can be properly tapped. To that end, we built Sifter and proceeded to lose money on it month after month for years. As we find ourselves approaching the end of the third full year, Sifter has just turned the corner returning a quarterly profit for the first time ever.
Lots of academics use our service now because they can get the slice of Twitter they need along with a powerful research and measurement toolkit to search, filter, code, cluster, and machine-classify the data. We have also led the way on building an annotation system for labeling the data while viewing it live in the Twitter display. Unlike a plain text representation, we provide the Tweets in their native display so the the pictures, videos, media previews, and other features of the data are visible.
Anyone who attends a DiscoverText workshop to learn text and Twitter analytics also learns the gospel of Stu: Twitter data must be viewed in the Twitter display if the goal is to interpret the meaning of the Tweet.