The Premium Twitter API (formerly the Gnip PowerTrack) enhances the standard capabilities of filtering real-time data received through the full Twitter firehose. Originally developed by Gnip, a Colorado-based startup built on the resale of social data, the API now falls under the various packages that Twitter offers after they purchased Gnip in 2014.
Platforms like DiscoverText integrate with the PowerTrack API to let users extract social data from Twitter based on the operators and filters (collectively the logical “rules”) specified by the client. These premium operators include cashtag filtering, emoji filtering and lower latency, offering great potential to enrich data mining capabilities.
The Twitter PowerTrack API
The PowerTrack rules endpoint is managed by the developer’s application, which is where rules are managed for mining Twitter data. Unique to this endpoint, and why it is heralded as being so powerful, is the dynamic ability to update and manage filtering rules at a moment’s notice without disconnecting from the stream. This reduces noise in the dataset and adjusts for considerations as the data is mined. Using a Boolean syntax and the operators available to premium users, complex streams can be set up to mine the data of interest to your organization, while negating data features that are not of interest.
Understanding PowerTrack Filters/Rules
Platforms that leverage the PowerTrack API must adhere to a particular syntax to enable filtering rules to be set up. These filters are composed of at least one “clause”, which consists of a keyword or phrase using the PowerTrack operators that details the parameters of the query. These may include keyword matches, negating particular terms or hashtags, or even substring matches. For instance, if you wish to search for the term happy birthday, the AND operator (specified with a blank space in the PowerTrack rules), could be used to find tweets with the words happy and birthday. However, this returns tweets where happy and birthday may be found anywhere in the tweet (i.e., not adjacent to each other). To locate the specific phrase, happy birthday, quotation marks are used around the phrase, i.e., “happy birthday”.
Specific operators are used to create a filter with multiple rules. When pulling together multiple clauses, certain logic must be adhered to in order for the PowerTrack API to successfully interpret your rule. For instance, if you wanted to identify tweets referencing a party, but not a birthday or a dinner party, within a 10 mile range of Los Angeles, you would group the contents of a clause with brackets to prevent the space character from being interpreted as AND, whilst separating distinct clauses with a space. For example:
(house party) -(birthday OR dinner) point_radius:[-34.0522 118.243 10.0mi]
This queries for the terms “house” and “party” in a tweet, although not necessarily next to each other, excludes tweets that contain either “birthday” or “dinner”, while searching within a 10 mile radius of Los Angeles. The “-” parameter is used to specify a negative operator, where you define what you want to exclude from results.
Note: The full list of operators may be found on Twitter’s Developer webpage and should be read before creating your own rules.
Getting the Most Out of Your Searches – Addressing the Most Common Errors
New users are prone to making some common errors when formulating these PowerTrack filters, as we’ve learned when researchers began using the DiscoverText platform. One of the most common mistakes is misunderstanding and misusing Boolean operators.
In the formulation of any piece of code, each different platform will have a particular syntax that must be observed and understood. We will illustrate some of the ways that these operators should be used, with examples of searches that work as intended and searches that don’t return the desired data.
Parentheses are used to enclose arguments and group particular terms together. For example, to search for tweets that contains the following hashtags and that are in the English language, use:
(#BigData #MachineLearning #DataMining) lang:en
to ensure that the language filter applies to all of the clauses. Without the parentheses, the language filter would only apply to the last hashtag, as in:
#BigData #MachineLearning #DataMining lang:en
because the order of precedence rules interpret that as:
#BigData #MachineLearning (#DataMining lang:en)
Parenthesis can be used to override the order of precedence rules, or more appropriately, to make your intentions clear.
The AND Operator
While some platforms explicitly use the keyword “AND”, for PowerTrack this term is replaced with a space character. Thus:
will search for tweets containing the two terms “drunk” and “driver”. These terms can appear anywhere in the tweet, in any order, and not necessarily the exact phrase of “drunk driver”.
However, mistakenly including the term “and” or “AND” will search for tweets containing all three of the terms (drunk, and, driver), which is probably not what you intended. Thus:
drunk AND driver
is usually an incorrect query.
This applies to multiple clauses. For example:
(drunk driver) -uber -lyft
can be read as:
Return all tweets that contain both the term “drunk” and the term “driver” and that does not contain the term “uber” and that does not contain the term “lyft”.
The OR Operator
In the DiscoverText platform, when the operator OR is used in the rule:
drunk OR driver
the software will search for tweets that have the words “drunk” or “driver” anywhere in the tweet. “OR” is a helpful operator when searching terms that do not need to co-occur together. An example is:
democrat OR republican
which returns tweets containing either one or both of these terms in the tweet body.
The OR operator must be in uppercase letters.
This also applies to lists, as in the two examples below:
(merry OR happy) Christmas
(drunk OR inebriated) driver
If an exact phrases of two or more terms are desired, then quotation marks are used to identify the specific terms to search for. For instance, “drunk driver” will query for tweets containing this exact phrase with the words in this particular order (i.e., drunk appears before driver and the two words are separated by a space character).
While this operator is core to generating rules, it is frequently misused when the level of specificity is incorporated yet not needed. If mining for “drunk driver”, for example, tweets that contain phrases such as “this driver who is drunk” will not be returned. It is in this context that the AND operator (specified with a space character) becomes more powerful. For example,
“drunk uber driver”
will return tweets containing the exact phrase drunk uber driver, but will not return uber driver drunk .
When using various operators, or within the point_radius operator, there is no need to add commas. While this practice is common in languages such as Python or R, there is no requirement to do this for PowerTrack. For example, in the phrase below, there is no need for commas.
“drunk driver”, point_radius:[-105.27346517, 40.01924738, 10.0mi]
Omitting the commas yields the correct search:
“drunk driver” point_radius:[-105.27346517 40.01924738 10.0mi]
Uppercase versus Lowercase Letters
Uppercase and lowercase letters are treated identically for search terms, therefore you do not need to capitalize proper nouns. For example, the following two rules are equivalent and return the same result set:
(Indiana OR Iowa OR Wisconsin)
(indiana OR iowa OR wisconsin)
Note, however, that the OR operator must be in uppercase.
PowerTrack (Gnip) Errors Received in DiscoverText/Sifter
If the PowerTrack (Gnip) API detects an error then it will email an error message to the user. Error messages are in the form:
Gnip Error: User- <first name> <last name> <email address> | <error message>|<timestamp>|<the attempted rule that was incorrect>
Examples of Errors
The following are some examples that did not return the desired results, followed by what is probably the correct rule that the user intended.
Ambiguous use of or as a keyword. Use OR to logically join two clauses, or “or” to find occurrences of or in text (at position 10)
|20180424184525-7097|9/12/2015 12:00:00 AM|9/14/2015 11:59:00 PM|Smith or “John Smith”
The “or” operator was entered in lowercase letters. Should be uppercase “OR”.
Smith OR “John Smith”
Ambiguous use of and as a keyword. Use a space to logically join two clauses, or “and” to find occurrences of and in text
|20180424162228-7295|3/24/2018 12:00:00 AM|3/24/2018 11:59:00 PM|(#MarchForOurLives AND @MileyCyrus OR #MileyCyrus OR “Miley Cyrus”)
The “AND” operator is not a valid keyword. Use a space character instead to logically join two clauses.
(#MarchForOurLives @MileyCyrus OR #MileyCyrus OR “Miley Cyrus”)
Note: It is also possible that the actual intent is:
#MarchForOurLives (@MileyCyrus OR #MileyCyrus OR “Miley Cyrus”)
from:southwest from:americanair from:united
Although this particular example did not generate an error message because syntactically it is a valid rule, it is probably not what the user intended. As written, this rule will only return a tweet if it is from southwest and from americanair and from united, which is an impossibility, because each tweet only has one “from:” field. Remember, a space character is the logical AND operator. The user probably wanted any tweets from southwest or from americanair or from united, which requires explicitly specifying the OR operator.
from:southwest OR from:americanair OR from:united
#nasa has:media OR has:links
Although this particular example did not generate an error message because syntactically it is a valid rule, it is probably not what the user intended because of the order of precedence rules. Using parentheses will make the intent clear, which is to obtain tweets that contain the NASA hashtag and that have either media or a link.
#nasa (has:media OR has:links)
Reference to invalid operator ‘From’. Operator is not available in current product or product packaging. Must be from the list: [from, to, source, …] (at position 1)
|20180429180232-7322|4/30/2013 12:00:00 AM|5/1/2014 11:59:00 PM|From:anyuser
PowerTrack did not recognize the operator “From:”. The “from:” operator must be in lowercase letters.
missing EOF at ‘point_radius’ (at position 30)
|20180503123408-7338|7/28/2016 12:00:00 AM|9/15/2016 11:59:00 PM|(hinkley point c OR nuclear) point_radius: [51.207554, -3.127435 20.0mi] lang:en
No space characters are allowed between “point_radius:” and the coordinates within the square brackets. Remove the space character between point_radius: and the coordinates.
(hinkley point c OR nuclear) point_radius:[51.207554, -3.127435 20.0mi] lang:en
Invalid field value (at position 41)
|20180507034309-7361|1/1/2017 12:00:00 AM|12/31/2017 11:59:00 PM|place_country:au has:geo point_radius:[153.033333 -27.466667 25]
The unit of measure of the radius (mi or km) was mistakenly omitted from the query. Specify either “mi” or “km”.
place_country:au has:geo point_radius:[153.033333 -27.466667 25km]
Invalid field value (at position 27) Sample field operator cannot take a non-integer value, must be in the range 1..100: 2.84 (at position 20)
|20180508114053-7119|5/8/2014 12:00:00 AM|5/8/2014 11:59:00 PM|#bringbackourgirls sample:2.84
The “sample:” rule was incorrectly specified with a non-integer value (2.84). Sifter sample values must be whole numbers between 1-100.
mismatched input ‘<EOF>’ expecting ‘)’ (at position 143)
|(“corporate monopolies” OR “antitrust” OR “anti-trust” OR “anti trust” OR “corporate monopoly” OR “business monopolies” OR “business monopoly”
PowerTrack found the end of the query (an end of file character “<EOF>”) when it was expecting a closing parenthesis character. There needs to be a closing parenthesis at the end to match the opening parenthesis at the beginning of the query.
(“corporate monopolies” OR “antitrust” OR “anti-trust” OR “anti trust” OR “corporate monopoly” OR “business monopolies” OR “business monopoly”)