Yet Another Review of the Terminology Used to Describe Techniques for Making Multiple Variables Comparable
Ok, here we go again. I wrote in this blog on 30 November 2013 about “Normalization vs. Standardization – Clarification (?) of Key Geospatial Data Processing Terminology using the Example of Toronto Neighbourhood Wellbeing Indicators“. Note the question mark in that title? Its length and that of my title and subtitle today, and the choice of words used in them, will tell you a lot about the challenge at hand: clarifying, reviewing, and settling – once and for all! – the meaning of terms like “normalization”, “standardization”, and “rescaling”. The challenge is related to the processing and combination of multiple variables in GIS-based multi-criteria decision analysis, for example in my ongoing professional elective GEO641 GIS and Decision Support, and extends to many situations in which we utilize multi-variate statistical or analytical tools for geographic inquiry.
In two other blog posts, I discussed the need to normalize raw-count variables for choropleth mapping. On 26 March 2020, I wrote about “The Graduated Colour Map: A Minefield for Armchair Cartographers“. The armchair cartographer’s greatest gaffe: mapping raw-count variables as choropleth or graduated-colour maps. In a post dated 3 November 2020 on “How to Lie with COVID-19 Maps … or tell some truths through refined cartography“, I go into more detail about why to use “relative metrics” on choropleth maps. These metrics can take the form of a percentage, proportion, ratio, rate, or density. They are obtained by dividing a raw-count variable by a suitable reference variable. In class, I used the example of unemployment, where the City of Toronto provides the number of unemployed people in each its 140 neighbourhoods.
The first scatterplot shows the number of people in the work force against the number of unemployed. We see a strong linear relationship – the more people in working age, the more are unemployed! Does this mean that large neighbourhoods (those with more people in the labour force) need more employment services, or more support in general? Most likely not. The majority of social policies and programming will instead depend on the prevalence of an issue like unemployment, not the raw count of the people affected. To create the common unemployment rate, we divide the number of unemployed by the total in the labour force for each neighbourhood. This normalization makes the values of a variable independent of the size of the spatial units of analysis, thus enabling comparisons between the units (here: neighbourhoods). I use the term “normalization” for this purpose primarily because one of the leading geospatial software vendors, Esri, has been using it in their software consistently since decades. Different versions of ArcView, generations of ArcMap, and now the latest ArcGIS Pro all offer(ed) to normalize a selected variable by another field for choropleth mapping.
Normalization
I will use COVID data from https://ourworldindata.org/coronavirus to describe normalization as a “horizontal” operation that enables “vertical” comparisons. The COVID data come in the typical tabular configuration of a feature attribute table in GIS, with spatial units defining the rows and variables defining the columns. One variable in the downloaded table is the total number of COVID “cases” per spatial unit. I boiled the table down to include only continental summaries. But can we compare COVID case numbers of North America to those of South America? The European Union to all of Europe? Asia to the World? No, of course not, since the cases are based on different total populations in each of the unit, and some units are actually part of others. That’s where normalization comes into play. Row by row, we put each of the raw-count values (total cases) in relation to the corresponding value of a reference variable (population), as illustrated by the thick, horizontal arrow. The curved arrow represents the creation of the new, normalized variable (cases per million, which includes a multiplication by one million to obtain the conventional per-million rate).
Hopefully, you can see why I propose to call normalization a horizontal operation. Importantly, the values of the new normalized variable are now vertically comparable, as indicated by the vertical arrow in the cases-per-million column! For example, we can sort the continents by cases-per-million and make assertions about where COVID has hit harder or less hard (depending on the questionable definition of a COVID “case”, of course, but that’s something I have discussed elsewhere). We can also contrast the EU’s performance with that of the entire continent, and even check each continent against the World’s overall rate, which has magically turned into a weighted average. You know this process well from when we compare provincial or state data to federal/national averages, which include each of the sub-units. With normalized data, we can do such a silly thing!
Rescaling
Although multiple normalized variables often have the same units, such as the unitless rates of COVID cases and deaths included in my figures, or percentage values in many socio-economic studies, the value ranges can still be very different. Naturally, the deaths-per-million are much smaller than the cases-per-million. However, in multi-criteria decision analysis (MCDA) or any composite index building endeavour, we need to combine the values of different variables. To do so, we need to bring these values not only to the same units but also to the same numeric range, usually 0.0 to 1.0 (also known as 0% to 100%). While I have used the term “standardization” to describe this process in the past, there is a risk of confusing it with z-score standardization in statistics. Since z-scores are not limited to a fixed, positive range, they do not serve the same purpose. Therefore, we’ll use the term “rescaling” instead.
Rescaling is a vertical operation in my new “system”, illustrated by the thick, vertical arrows. It operates within one variable (column) at a time. For the most common rescaling techniques (score-range transformation and maximum-score transformation), you will need the minimum and/or maximum value in the column. The maximum-score transformation of a “benefit criterion” (a variable that is to be maximized), you divide each value by the maximum value, as shown for the cases-per-million and separately for the deaths-per-million (they obviously are not a “benefit” but for a COVID vulnerability index, they would have to be maximized). After this operation, the new rescaled, normalized values represent each unit’s performance in a new way. The largest rate has been translated into the value 1.0. All other rates can be read as a percentile rank in approaching that largest/worst value. For example, the cases-per-million in South America is 75% as “bad” as those in North America. The other rescaled variable(s) can be interpreted in the same way. The deaths-per-million are quite consistent with the cases rate, but South America is closer to North America with respect to the death rate. The values of the two rescaled variables can now be compared horizontally (indicated by the thinner arrows) and also combined into a composite index (although this would not make much sense in the example).
In my course, students created composite indices of wellbeing or quality of life, or conversely indices of deprivation or vulnerability, for Toronto neighbourhoods based on 3-5 rescaled, normalized socio-economic and infrastructure indicators. It must be noted that many indicators are already normalized when we obtain them. Examples from the Wellbeing Toronto dataset include average household income per neighbourhood (i.e. total income divided by number of households), or the Gini coefficient and walk score (metrics that are independent of neighbourhood size). We used MCDA techniques in Excel and QGIS for this assignment. In MCDA, we use “criteria” for what I have called variables, attributes, or indicators above, and we may distinguish between “constraints” and “factors” among the criteria. But clarifying that terminology is a matter for another post…