Big Data – Déjà Vu in Geographic Information Science – GIS2 at Toronto Metropolitan University

A couple of years ago, one of my first blog posts here was a brief note on “Trends in GIScience: Big Data”. Although not at the core of my research interests, the discussions and developments around big data continue to influence my work. In an analysis of “The Pathologies of Big Data”, Adam Jacobs notes that “What makes most big data big is repeated observations over time and/or space”. Indeed, Geographic Information Systems (GIS) researchers and professionals have been working with large datasets for decades. During my PhD in the late 1990s, the proceedings of the “Very Large Data Bases” (VLDB) conference series were a relevant resource. I am not sure what distinguishes big data from large data, though I don’t have the space nor time to discuss this further.

Instead, I want to draw a first link between big data and my research on geovisual analytics. In an essay on “The End of Theory”, Chris Anderson famously argued that with sufficiently large data volumes, the “numbers [would] speak for themselves”. As researchers, we know that data are a rather passive species and the most difficult stage in many research projects is to determine the right questions to ask of your data, or to guide the collection of data to begin with. The more elaborate critiques of the big data religion include a recent article by Tim Harford on “Big data: are we making a big mistake?” Harford points to the flawed assumption that n=all in big data collection (not everybody tweets, has a smartphone, or even a credit card!) and argues that we are at risk of repeating statistical mistakes, only at the larger scale of big data. Harford also characterizes some big data as “found data” from the “digital exhaust” of people’s activities, such as Web searches. This makes me worried about the polluted analyses that will be based on such data!

On a more positive note, cartographers have argued for using interactive visualization as a means to analyse complex spatial datasets. For example, Alan MacEachren’s 1994 map use cube defines geovisualization as the expert use of highly interactive maps to discover unknown spatial patterns. On this basis, I understand geovisual analytics as an efficient and effective approach to “making the data speak”. For example, in Rinner & Taranu (2006) we concluded that “an interactive mapping tool is worth a thousand numbers” (p. 647), which may actually underestimate the potential of map-based data exploration. Along similar lines, I noted in Rinner (2007) that data (read: small data) can quickly become complex (read: big data), when they are subject to analytical processing. For example, in a composite index created from a few indicators for the 140 social planning neighbourhoods in the Wellbeing Toronto tool, changes in the indicator set, weights assigned to indicators, and normalization and standardization applied, will create an exponentially growing set of potential indices. The interactive, geovisual nature of the tool will help analysts to draw reasonable conclusions for decision-makers.

A second link exists between big data and my research on the participatory Geoweb. In this research, we examine how the Geoweb is changing interactions between government and citizens. On the one hand, government data are being released in open data catalogues for all to enjoy – i.e., use for scrutinizing public service, developing value-added products or services, or just to play with cool map and app designs. On the other hand, governments start to rely on crowdsourcing to fill gaps in data where shrinking budgets are limiting authoritative data collection and maintenance. In this context of “volunteered geographic information” (VGI), we argue that we need to consider the entire VGI system, including the hardware and software, user-generated data, and the application and people involved, in order to fully understand the emerging phenomenon. We also took up the study of different types of VGI, such as facilitated VGI in contrast to ambient VGI. Of these two types, ambient or “involuntary” VGI is connected with big data and the “digital exhaust” discussed above, as it consists of information collected from large numbers of users without their knowledge.

Again, geographers are in a strong position to examine big data resulting from ambient VGI, as location plays a major role in the VGI system. The 2014 annual meeting of the Association of American Geographers (AAG) included a high-profile panel on big data, their impact on real people, asymmetries in location privacy, and the role of “big money” in big data analytics. In contrast to previous discourse, in which geographers often limited themselves to deploring the disconnect between the social sciences and the developments in computer science and information technology, at AAG 2014 a tendency to more confident commentary and critique of big data and other unreflected IT developments was tangible. We need to understand the societal risks of global data collection and (geo)surveillance, and explain why if you let the data speak for themselves, you may earn a Big Silence or make bad decisions.

Both, my research on Wellbeing Toronto and place-specific policy-making as well as the Geothink partnership studying the Geoweb and government-citizen interactions are funded by the Social Sciences and Humanities Research Council of Canada (SSHRC). While supporting research into the opportunities provided by big data, I think that SSHRC is best positioned among the granting councils to also fund critical research on the risks and side effects of big data.