Twitter Analytics Experiments in Geography and Spatial Analysis at Ryerson

In my Master of Spatial Analysis (MSA) course “Cartography and Geographic Visualization” in the Fall 2014 semester, three MSA students experimented with geospatial analysis of tweets. This post provides a brief account of the three student projects and ends with a caution about mapping and spatially analyzing tweets.

Yishi Zhao wrote her “mini research paper” assignment about “Exploring the Thematic Patterns of Twitter Feeds in Toronto: A Spatio-Temporal Approach”. Yishi’s goal was to identify the spatial and thematic patterns of geolocated tweets in Toronto at different times of day, as well as to explore the use of R for spatio-temporal analysis of the Twitter stream. Within the R platform, Yishi used the streamR package to collect geolocated tweets for the City of Toronto and mapped them by ward using a combination of MapTools, GISTools, and QGIS. Additionally, the tm package was used for text mining and to generate word clouds of the most frequent words tweeted at different times of the day.

Toronto tweets per population at different times of day - standard-deviation classification (Source: Yishi Zhao)
Toronto tweets per population at different times of day – standard-deviation classification (Source: Yishi Zhao)
Frequent words in Toronto tweets at different times of day (Source: Yishi Zhao)
Frequent words in Toronto tweets at different times of day (Source: Yishi Zhao)

One general observation is that the spatial distribution of tweets (normalized by residential population) becomes increasingly concentrated in downtown throughout the day, while the set of most frequent words expands (along with the actual volume of tweets, which peaked in the 7pm-9pm period).

MSA student Alexa Hinves pursued a more focused objective indicated in her paper’s title, “Twitter Data Mining with R for Business Analysts”. Her project aimed to examine the potential of geolocated Twitter data towards branding research using the example of singer Taylor Swift’s new album “1989”. Alexa explored the use of both, the streamR and twitteR packages in R. The ggplot2, maps, and wordcloud packages were used for presentation of results.

Distribution of geolocated tweets and word cloud referring to Taylor Swift (Source: Alexa Hinves)
Distribution of geolocated tweets and word cloud referring to Taylor Swift (Source: Alexa Hinves)

Alexa’s map of 1,000 Taylor Swift-related tweets suffers from a challenge that is common to many Twitter maps – they basically show population distribution rather than spatial patterns that are specific to tweet topics or general Twitter use. In this instance, we see the major cities in the United States lighting up. The corresponding word cloud (which I pasted onto the map) led Alexa to speculate that businesses can use location-specific sentiment analysis for targeted advertising, for example in the context of product releases.

The third project was an analysis and map poster on “#TOpoli – Geovisualization of Political Twitter Data in Toronto, Ontario”, completed by MSA cand. Richard Wen. With this project, we turn our interest back to the City of Toronto and to the topic of the October 2014 municipal election. Richard used similar techniques as the other two students to collect geolocated tweets, the number of which he mapped by the 140 City neighbourhoods (normalized by neighbourhood area – “bubble map” at top of poster). Richard then created separate word clouds for the six former municipalities in Toronto and mapped them within those boundaries (map at bottom of poster).

#TOpoli map poster - spatial pattern and contents of tweets in Toronto's mayoral election 2015 (Source: Richard Wen)
#TOpoli map poster – spatial pattern and contents of tweets in Toronto’s mayoral election 2015 (Source: Richard Wen)

Despite the different approach to normalization (normalization by area compared to Yishi’s normalization by population), Richard also finds a concentration of Twitter activity in downtown Toronto. The word clouds contain similar terms, notably the names of the leading candidates, now-mayor John Tory and candidate Doug Ford. An interesting challenge arose in that we cannot tell just from the word count whether tweets with a candidate’s name were written in support or opposition to this candidate.

The three MSA students used the open-ended cartography assignment to acquire expertise in a topic that is “trending” among neo-cartographers. They have already been asked for advice by a graduate student of an environmental studies program contemplating a Twitter sentiment analysis for her Master’s thesis. Richard’s project also led to an ongoing collaboration with journalism and communication researchers. However, the most valuable lesson for the students and myself was an increased awareness of the pitfalls of analyzing and mapping tweets. These pitfalls stem from the selective use of Twitter among population subgroups (e.g., young professionals; globally the English-speaking countries), the small proportion of tweets that have a location attached (less than 1% of all tweets by some accounts), and the limitations imposed by Twitter on the collection of free samples from the Twitter stream.

I have previously discussed some of these data-related issues in a post on “Big Data – Déjà Vu in Geographic Information Science”. An additional discussion of the cartography-related pitfalls of mapping tweets will be the subject of another blog post.