Table of Contents

Introduction

UK supermarket retailer emissions and energy consumption depends of its ‘size’ that may be measured by turnover or a measure of operational, but emission and energy intensity reporting is sporadic and a absolute metrics are standardised using a different denominators limiting direct comparion. The Geolytix Supermarket Retail Points [downloaded 9/24] dataset has tracked UK Supermarkets since 2014 recording their location, date of opening, and size class, among other attributes. In this section you will use R to process and explore the data, clean it, and then summarise it by retailer and year ready for scaling absolute reported emissions and energy use in the final section. Analysis is limited to the UK’s eleven largest supermarket retailers, Aldi, Asda, Coop, Iceland, Lidl, Marks and Spencer, Morrisons, Ocado, Sainsbury’s, Tesco, and Waitrose.

Download and unizip UKSmktComp.zip.

Start R Studio.

R basics

In this section, the required librarires are installed and data frames explained.

Libraries

The tutorial uses six libraries that must be installed. Libraries are packages of code that provide functionality. This only needs to be done once so if you previously installed any of the libraries then you should not need to download again.

Select Tools, then Install Packages…. then enter a comma seperated list of the packages you wish to install.

dplyr, readr, tidyr, ggplot2, paletteer, sf, ggrepel, ggpubr

Now open a new R script and add the code to load the required libraries. In R hash ‘#’ denotes a comment and all text after the hash on a line is ignored.

			
# Load Libraries
library(readr)      # File reading/writing
library(dplyr)      # Data processing
library(tidyr)      # Data processing
library(ggplot2)    # Graphing
library(paletteer)  # Colour palletes
library(sf)         # Spatial data
library(ggrepel)    # Text positioning
library(ggpubr)     # ggplot functions

		

Select the code lines and Run to execute the code.

The ‘working directory/folder’ is the default location were R looks for and saves files You will need to edit this to the UKSmrkComp folder . getwd() returns current working directory path, while setwd() sets the path.

			
# Get and set the working directory
getwd()
# Edit the path to where you have stored the supplied data, e.g.
setwd("D:/UKSmtkComp")

Now import use the readr library read_csv() function to import the Retail Points csv file.

			
# Import CSV file from Working Directory
store <- read_csv("geolytix_retailpoints_v34_202412.csv")

Data frame

To understand data you need to understand what it represents, how it was collected, how it is structured, and the distribution of values within it. The most common way to structure data is as a table in which columns are attributes and the rows records. The CSV format is a data structure for storing a table and read_csv() loads the data into a different table data structure in R called a ‘data frame’ stored in the variable named store.

			
# Is the variable a data frame?
is.data.frame(store)
[1] TRUE

To further explore table structure at the command prompt.

			
# Number of columns in data frame
ncol(store) 
[1] 19
# Data frame column names
colnames(store) 
 [1] "id"         "retailer"   "fascia"     "store_name" "add_one"    "add_two"    "town"       "suburb"    
 [9] "postcode"   "long_wgs"   "lat_wgs"    "bng_e"      "bng_n"      "pqi"        "open_date"  "size_band" 
[17] "county"     "Year"       "size_level"
# Number of rows/records
nrow(store)     
[1] 18840
# Row names, by default and uniquely assigned ascending integer
rownames(store) 
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14"  "15" ...

		

To obtain a summary report of table structure and values.

			
# A summary report of colums and their content
summary(store)  
       id             retailer            fascia           store_name       
 Min.   :1.01e+09   Length:18840       Length:18840       Length:18840      
 1st Qu.:1.01e+09   Class :character   Class :character   Class :character  
 Median :1.01e+09   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1.01e+09                                                           
 3rd Qu.:1.01e+09                                                           
 Max.   :1.01e+09                                                           
                                                                            
   add_one            add_two              town              suburb         
 Length:18840       Length:18840       Length:18840       Length:18840      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   postcode            long_wgs          lat_wgs          bng_e       
 Length:18840       Min.   :-8.0899   Min.   :49.17   Min.   :  5123  
 Class :character   1st Qu.:-2.8045   1st Qu.:51.50   1st Qu.:346908  
 Mode  :character   Median :-1.5874   Median :52.50   Median :427048  
                    Mean   :-1.7833   Mean   :52.81   Mean   :416710  
                    3rd Qu.:-0.4016   3rd Qu.:53.71   3rd Qu.:509612  
                    Max.   : 1.7529   Max.   :60.40   Max.   :654949  
                                                                      
     bng_n             pqi              open_date         size_band        
 Min.   : -81178   Length:18840       Min.   :18559999   Length:18840      
 1st Qu.: 179310   Class :character   1st Qu.:20179999   Class :character  
 Median : 290230   Mode  :character   Median :20210916   Mode  :character  
 Mean   : 324690                      Mean   :20206981                     
 3rd Qu.: 424197                      3rd Qu.:20231025                     
 Max.   :1168234                      Max.   :20249999                     
                                      NA's   :12763                        
    county               Year         size_level   
 Length:18840       Min.   :   1    Min.   :1.000  
 Class :character   1st Qu.:2017    1st Qu.:1.000  
 Mode  :character   Median :2021    Median :1.000  
                    Mean   :2020    Mean   :1.605  
                    3rd Qu.:2023    3rd Qu.:2.000  
                    Max.   :2024    Max.   :5.000       

		

Check that the column data types are ‘correct’ as malformed data (e.g. a single character value in field that obstensively store numeric data) may result in the ‘wrong’ column type in R. The column type would be ‘correct’ as far as R is concerned, but ‘wrong’ from our perspective. Look at the summary statistics for each column to get a feel for their range, distributions, and whether they contain any unknown ‘null’ values encoded as ‘NA’. 12763 Year and 1 size_level values out of the 18840 total are null.

Type the name of a variable at the prompt to view the first rows.

# Type the name of a variable at the prompt to view the first rows.
store

# A tibble: 18,840 × 19
           id retailer      fascia store_name add_one add_two town  suburb postcode long_wgs lat_wgs  bng_e  bng_n pqi  
        <dbl> <chr>         <chr>  <chr>      <chr>   <chr>   <chr> <chr>  <chr>       <dbl>   <dbl>  <dbl>  <dbl> <chr>
 1 1010008812 The Co-opera… The C… Co-op Hol… Holyhe… Chirk   Chirk NA     LL14 5NA    -3.06    52.9 3.29e5 3.39e5 Roof…
 2 1010021060 Spar          Spar   Spar Glyn… Cross … NA      Glyn… NA     LL20 7EH    -3.19    52.9 3.20e5 3.38e5 Roof…
 3 1010007877 The Co-opera… The C… Co-op Bla… High S… Blaena… Blae… Blaen… LL41 3AG    -3.93    53.0 2.70e5 3.46e5 Roof…
 4 1010016823 Spar          Spar   Spar - Cy… Cynwyd… Statio… Cynw… NA     LL21 0LA    -3.41    53.0 3.06e5 3.41e5 Roof…
 5 1010010459 Tesco         Tesco  Tesco Wre… Plas K… Off We… Cefn… NA     LL14 3AT    -3.07    53.0 3.28e5 3.42e5 Roof…
 6 1010007420 Lidl          Lidl   Lidl Newt… Llanid… NA      Newt… NA     SY16 1ET    -3.32    52.5 3.11e5 2.91e5 Roof…
 7 1010004399 Tesco         Tesco  Tesco Wel… Smithf… NA      Wels… NA     SY21 7BL    -3.14    52.7 3.23e5 3.07e5 Roof…
 8 1010022270 Morrisons     Morri… Morrisons… Buttin… NA      Butt… NA     SY21 8SL    -3.12    52.7 3.24e5 3.09e5 Roof…
 9 1010017041 Spar          Spar   Spar - Ll… The Ma… Oswest… Llan… NA     SY10 0JG    -3.30    52.8 3.12e5 3.26e5 Roof…
10 1010016321 Spar          Spar   Spar - Mo… Broad … NA      Mont… NA     SY15 6PN    -3.15    52.6 3.22e5 2.96e5 Roof…
# ℹ 18,830 more rows
# ℹ 5 more variables: open_date <dbl>, size_band <chr>, county <chr>, Year <dbl>, size_level <dbl>
# ℹ Use `print(n = ...)` to see more rows

To see the first 50 rows.

print(store, n = 50)

Columns are accessed using the ‘$’ character after the variable name.

store$retailer

unique() returns a vector of unique values.

# Unique values in the retailer column
unique(store$retailer)

 [1] "The Co-operative Group"               "Spar"                                
 [3] "Tesco"                                "Lidl"                                
 [5] "Morrisons"                            "Aldi"                                
 [7] "Budgens"                              "Sainsburys"                          
 [9] "Marks and Spencer"                    "Farmfoods"                           
[11] "Iceland"                              "Waitrose"                            
[13] "Asda"                                 "Heron"                               
[15] "Makro"                                "Costco"                              
[17] "Cook"                                 "The Southern Co-operative"           
[19] "Booths"                               "Central England Co-operative"        
[21] "Scottish Midland Co-operative"        "Langdale Co-operative Society"       
[23] "Clydebank Co-operative"               "The Radstock Co-operative Society"   
[25] "Midcounties Co-operative"             "Channel Islands Co-operative Society"
[27] "Amazon"                               "Heart of England Co-operative"       
[29] "Tamworth Co-operative Society"        "Lincolnshire Co-operative"           
[31] "Dunnes Stores"                        "East of England Co-operative"        
[33] "Chelmsford Star Co-operative Society" "Whole Foods Market"                  
[35] "Planet Organic"                       "Coniston Co-operative Society"       
[37] "Allendale Co-operative Society"

Filter

There are 37 retailers, but for simplity only the eleven largest UK retailers will be analysed, one of which Ocado, is an online supermarket with no stores. To simplify long retailer names they are mapped to a shorter id code a new Retailer_Id column. This is the code used to identify retailers in the emmission dataset to which it will later be joined. A named list maps pairs of values.

			
# c(name = value) creates a named list 
retailer.list <- c("ALDI" = "Aldi",
                   "ASDA" = "Asda",
                   "COOP" = "The Co-operative Group",
                   "ICE" = "Iceland",
                   "LIDL" = "Lidl",
                   "M&S" = "Marks and Spencer", 
                   "MORR" = "Morrisons",
                   "OCADO" = "Ocado",
                   "SAINS" = "Sainsburys", 
                   "TESC" = "Tesco",
                   "JLP"= "Waitrose"
)
names(retailer.list)
  [1] "COOP"  "ASDA"  "TESC"  "LIDL"  "MORR"  "ALDI"  "SAINS" "M&S"   "JLP"   "ICE"  
[11] "OCADO
Use [x], where x is the item number, to return individal names and values.

		

			
names(retailer.list)[3]
[1] "TESC"
retailer.list[3]
   TESC 
"Tesco" 

		

Use the dpylr filter() function to subset rows in store to those with a retailer the matches any of their names in retailer.list.

			
nrow(store)     
[1] 18840
# Filter
store <- filter(store, retailer %in% retailer.list)
row(store)
[1] 14176
# Precent of retail points of the 11 retailers
nrow(store)/18840 * 100
[1] 75.24416

		

75.24% of the retail points records are owend by the 11 retailers, considerably less than their market domination.

table() counts the records with matching values in one or more vector columns.

table(store$retailer)

                  Aldi                   Asda                Iceland 
                  1043                   1106                    972 
                  Lidl      Marks and Spencer              Morrisons 
                  1010                   1056                   1745 
            Sainsburys                  Tesco The Co-operative Group 
                  1444                   2909                   2467 
              Waitrose 
                   424

Tesco has the highest number of stores, followed by the Co-operative Group. Ocado is an online retailer without stores.

NAs

There were NAs in Year, size_band, and size_level columns but are they in the filtered data set? Count the number of NAs in the size_band column.

length(which(is.na(store$size_band)))
[1] 1

‘$’ signifies the name of a data frame column. is.na() returns whether a value in a vector (the column) is NA or not. which() returns the position of values in a vector that match a criteria (is.na == TRUE), and length() returns the rumber of items in the vector. ‘Stringing’ multiple functions into a single statement is an efficent way of writing code but the more stinging is done the harder the code is to understand. Do NAs remain in the other two columns?

Code: Do NAs remain in the other two columns?

length(which(is.na(store$size_level)))
[1] 1

length(which(is.na(store$Year)))
[1] 9255

Lets deal with the missing ‘size’ values first. Passing two vectors to Table() returns a pairwise count table

			
table(store$size_band, store$size_level)
                                            1     2     3     4     5
  < 3,013 ft2 (280m2)                    11416     0     0     0     0
  15,069 < 30,138 ft2 (1,400 < 2,800 m2)     0     0  1103     0     0
  3,013 < 15,069 ft2 (280 < 1,400 m2)        0  4889     0     0     0
  30,138 ft2 > (2,800 m2)                    0     0     0  1428     0
  Over 30,138                                0     0     0     0     3

		

size_level is a ordinal integer version of the size_band field which stores a text description of the band. The table also shows that there are only three stores in the largest size class, but thousands in the others.

To find the missing value.

			
filter(store, is.na(size_level)) %>%
    select(retailer, store_name, add_one, suburb, town, postcode, Year, size_band)
A tibble: 1 × 8
retailer store_name add_one suburb town postcode Year size_band
1 The Co-operative Group Co-op Welling - Wrotham Road Wrotham Road Welling London DA16 1LS 2024 NA

		

is.na() within the dpylr filter() selects records where the value is NA. The selected records are then ‘passed’ to ‘%>%’ select() which returns the named columns. There is one store that does not have a ‘size’. It was only opened in 2024 so perhaps the store was known to the opening but the store area had yet to be established? This foor area of this store was estimated the OS Mastermap supplied to UK Universities through the Digimap service as 820m² so it within size_level 2.

To update the record.

			
store$size_level[which(is.na(store$size_level))] <- 2
store$size_band[which(is.na(store$size_band))] <- "3,013 < 15,069 ft2 (280 < 1,400 m2)"

These lines look complicated but are actually not too difficult to understand if broken down. is.na() finds NAs, which() returns the row names that have NA size level, while store$size_level[…] acesses the record values to up dated to the value passed in with ‘<-‘.

We know Year contains NAs, but non-NA Year values should also be evaluated to see if they are ‘reasonable’ or anomalous to expectation. Use table() to obtain the number of stores for each Year.

			
# useNA = "ifany" includes NAs in the table.
table(store$Year, useNA = "ifany")
   1 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 <NA> 
   1    3    3   10    6    4   20  352  343  491  294  401  414  372  466  759  982 9255

What pattern do you see in the data, what might it indicate, and what should be done about it (if anything)?

Outlier values and abrupt change in values may be indicative of ‘something going on’ in the data.

One store is an outlier opening in 1AD! This is clearly an error and needs to be updated to NA.

store$Year[which(store$Year == 1)] <- NA

It is also noticable that the number of stores recorded as opening from 2009 to 2014 are far fewer than in 2015 and following years – an abrupt change. It might be that the data reflects reality, however, the Retail points data set was first released in 2014 and have been added to since. It is therefore more likely that NAs arise from stores being included in the first release as they existed then but their opening year unknown. All 2015 and later store openings may be assumed to have an opening year. Interstingly, spplying table() to retialer and Year reveals that most of the pre-2015 records are for Waitrose stores.

table(store$retailer, store$Year, useNA = "ifany")
                        
                            1 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
  Aldi                      1    0    0    0    0    0    2   70   42   94   66   60   51   57
  Asda                      0    0    0    0    0    1    0   31    9    8    1    0    1    0
  Iceland                   0    0    0    0    0    0    0    1    7   11   37   54   24   16
  Lidl                      0    0    0    0    1    0    1   24   29   51   55   61   59   65
  Marks and Spencer         0    0    0    0    0    0    2   32  125   75   41   31   17   23
  Morrisons                 0    0    0    0    0    0    0    7    1    0    3    3  124  121
  Sainsburys                0    0    0    0    0    1    8  104   18   74    8   18   14   22
  Tesco                     0    0    0    0    0    0    3   68   33   21    9    3   74   32
  The Co-operative Group    0    0    0    0    0    0    2    2   68  150   71  171   50   36
  Waitrose                  0    3    3   10    5    2    2   13   11    7    3    0    0    0
                        
                         2022 2023 2024 <NA>
  Aldi                     44   33   28  495
  Asda                      2  214  263  576
  Iceland                  19    9   13  781
  Lidl                     62   24   24  554
  Marks and Spencer        17   17   27  649
  Morrisons               163  364  490  469
  Sainsburys               11   25   24 1117
  Tesco                    65   47   72 2482
  The Co-operative Group   83   26   40 1768
  Waitrose                  0    0    1  364

While there are some (probaly) oppotunistic pre-2015 opening years, 2014 is the ‘true baseline’ of the dataset from which year seams to have been systematically collected. Update the Year values equal or less than 2014 OR (pipe character ‘|’) NA to 2014.

			
store$Year[which(store$Year <= 2014 | is.na(store$Year))] <- 2014
# Check the output
table(store$retailer, store$Year, useNA = "ifany")
                        
                         2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
  Aldi                    498   70   42   94   66   60   51   57   44   33   28
  Asda                    577   31    9    8    1    0    1    0    2  214  263
  Iceland                 781    1    7   11   37   54   24   16   19    9   13
  Lidl                    556   24   29   51   55   61   59   65   62   24   24
  Marks and Spencer       651   32  125   75   41   31   17   23   17   17   27
  Morrisons               469    7    1    0    3    3  124  121  163  364  490
  Sainsburys             1126  104   18   74    8   18   14   22   11   25   24
  Tesco                  2485   68   33   21    9    3   74   32   65   47   72
  The Co-operative Group 1770    2   68  150   71  171   50   36   83   26   40
  Waitrose                389   13   11    7    3    0    0    0    0    0    1

		

The store data frame has been ‘cleaned’.

Records for 10 of the 11 retailers were filtered from other records.
One set of store area values was updated using external information.
All pre-2015 records were assigned to the 2014 baseline year.

Summary statistics

The final data preparation stage is to sum the number and estimated area of new stores for each retailer and year, and their cumulative values. Retailer store area is estimated from size classes by assigining the mid-class area for bound classes and the lower boundary for the largest area class

size_level	size_band	size_area_mid (m²)
1	< 3,013 ft2 (280m2)	140
2	3,013 < 15,069 ft2 (280 < 1,400 m2)	840
3	15,069 < 30,138 ft2 (1,400 < 2,800 m2)	2100
4	30,138 ft2 > (2,800 m2)	2800
5	Over 30,138	3000

This is quite a complex operation, so is provided as commented code that can be copied and run.

Code: Sumarise by retailer and year

			
# geolytix_size_class.csv contains the area values that will be used 
# for each area class when estimating retailer store area
store.size.class <- read_csv("geolytix_size_class.csv")
store.size.class
# Add Retailer ID to store
store$Retailer_Id <- NA
i = 1
for (s in retailer.list) {
  store$Retailer_Id[which(store$retailer == s)] = names(retailer.list[i])[1]
  i = i + 1
}
# Summarise by retialer, year and size_level
store.yr <- store %>%
  count(Retailer_Id, Year, size_level, sort=TRUE)
# Join to size class table
store.yr <- merge(store.yr, store.size.class, by = "size_level")
# Calculate estimated new store area
store.yr$Area_M2 <- store.yr$n * store.yr$size_area_mid
# Calculate Cumulative Sum of store number and area
store.yr.n <- store.yr  %>%
  group_by(Retailer_Id, Year) %>%
  arrange(Retailer_Id, Year) %>% 
  summarise(Sum_N = sum(n)) %>%
  mutate(Cum_Sum_N = cumsum(Sum_N))
store.yr.a <- store.yr  %>%
  group_by(Retailer_Id, Year) %>%
  arrange(Retailer_Id, Year) %>% 
  summarise(Sum_Area_M2 = sum(Area_M2)) %>%
  mutate(Cum_Sum_Area_M2 = cumsum(Sum_Area_M2))
# Create 'base table' of retailer and year 2014-2024
yr <- rep(seq(2014, 2024))
base <- expand.grid(yr,names(retailer.list))
colnames(base)[1] <- "Year"
colnames(base)[2] <- "Retailer_Id"
base<-data.frame(base)
# Join cumulative sum tables to base to create store
store.yr <- merge(base, store.yr.a, by.x = c("Retailer_Id","Year"), by.y = c("Retailer_Id","Year"), all = TRUE)
store.yr <- merge(store.yr, store.yr.n, by.x = c("Retailer_Id","Year"), by.y = c("Retailer_Id","Year"), all = TRUE)
# Fill missing Cum Sum values
store.yr<- store.yr %>% 
  arrange(Retailer_Id, Year) %>% 
  fill(Cum_Sum_Area_M2, .direction = "up")
store.yr<- store.yr%>% 
  arrange(Retailer_Id, Year) %>% 
  fill(Cum_Sum_N, .direction = "up")
# Rename columns
colnames(store.yr)[3] <- "New_Area"
colnames(store.yr)[4] <- "Total_Area"
colnames(store.yr)[5] <- "New_Store"
colnames(store.yr)[6] <- "Total_Store"
# Set New Store and New Area which are NA to zero except for Ocardo
store.yr$New_Store[which(is.na(store.yr$New_Store) &
   store.yr$Retailer_Id != "OCADO")] <- 0
store.yr$New_Area[which(is.na(store.yr$New_Area) &
   store.yr$Retailer_Id != "OCADO")] <- 0
# Set 2014 New Store and Area to 0
store.yr$New_Store[which(store.yr$Year == 2014)] <- NA
store.yr$New_Area[which(store.yr$Year == 2014)] <- NA
# Set Ocado Total Store and Area to NA
store.yr$Total_Store[which(store.yr$Retailer_Id == "OCADO")] <- NA
store.yr$Total_Area[which(store.yr$Retailer_Id == "OCADO")] <- NA
# Check
store.yr

		

The cleaned retail points have been summarised, so lets have a look.

Graphing

Viewing values in tabular form can be useful especially for small appropriately ordered tables, but is of less value for larger tables that are better comprehended in graphical from.

Setting the pallette

To ensure consistent assignment of colour to retailer in all plots a named list is created that maps Retailer_Id to a draw colour. We shall adapt the colour in the pals::glasbey to create a custom palette that assigns colours (or at least hues) associated the retailer’s brand design.

# Colours in palette
pal <- paletteer_d("pals::glasbey")


# Names list of retailer colour
retailer.pal <- c("ALDI" = pal[1],
                   "ASDA" = pal[12],
                   "COOP" = pal[8],
                   "ICE" = pal[26],
                   "LIDL" = pal[14],
                   "M&S" = "grey40", 
                   "MORR" = pal[21],
                   "OCADO" = pal[11],
                   "SAINS" = pal[16], 
                   "TESC" = pal[2],
                   "JLP"= pal[6])

All colours except M&S are hex codes from the glasbey palette. M&S colour is set to 60% black using its colour name as black was difficult to distinguish from the dark green of Waitrose, and the glasbey grey is not bold enough to be easily differentiated from other pattern.

Store number and area

Let’s start with a graph that plots the number of new stores that retailers opened each year

			
# New stores line plot
ggplot(store.yr, aes(x = Year, y = New_Store, colour = Retailer_Id)) +
  geom_line()  +
  geom_point() +
  ylab("New stores") +
  scale_colour_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
  scale_x_continuous(breaks=seq(2015,2024)) +
  theme_classic()

		

Lets go through this line by line with each line connected by a plus ‘+’.:

ggplot() is the main graphing function which is passed the data frame store.yr and a list of aes() (aesthetics) parameters. x specifies the variable assigned to the x-axis of the graph, and y to the y-axis. The colour parameter sets the colour attribute of drawn graphics is set to be determined by Retailer_Id.
geom_line() plots data values using the default line symbol coloured by the Retailed_Id ‘inhertied’from the main ggplot() function. The specific colours to use are set later.
geom_point() plots values using the default point symbol coloured by the Retailed_Id ‘inhertied’from the main ggplot() function.The specific colours to use are set later.
ylab() sets the y-axis label. The default x-axis label is not changed.
scale_colour_manual() matches the gglpot colour value to the named list of colours (retailer.pal).
scale_x_continuous() sets a manual ticks and labels (breaks) for each year, with seq() generating a sequence of integer years to display.
theme_classic() applies a simple theme to the plot.

When you create a plot you should critically evaluate:

pattern observed within the data, and
how plot design inflences what pattern is discernable and what is not.

These are not isolated undertakings but more of a ‘conversation’ in which they inform each other. Look at the plot and identify ‘interesting’ pattern. But, what qualifies as interesting? As a guide look for:

Outliers
Simiarity and difference (e.g. high/low, rising/falling).
Abupt change in values and trends.

Perhaps the most obvious pattern is the opening of large numbers of Morrisons beginning in 2020 followed by Asda in 2023. As the majority of values are less than 100 the large number of recent openings results in the majority of the data being dislayed in just one fifth of the y-axis, reducing the visual distiction within and between lines exasperated by data density and occlusion. Despite the data density other patterns can be picked out including Marks are Spencer peaking in 2016 peaks in 2017 and 2019 for the Co-op. Tesco’s decline in new openings until 2020 from when it has increased is less easy to see.

Various approaches might be taken to illuminate other pattern in the data. New_Store could be log() transformed, retailers displayed individually as small multiples, or as stacked-area plot. Let’s plot the same data but as stacked areas.

			
# New stores area plot
ggplot(store.yr, aes(x = Year, y = New_Store, fill = Retailer_Id)) +
  geom_area() +
  ylab("New stores") +
  scale_fill_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
  scale_x_continuous(breaks=seq(2015,2024)) +
  theme_classic()

		

In the ggplot() aes() function, the area fill (colour) is set to Retailer_Id that is then passed to geom_area() which draws the chart. The change to colouring fill requires a scale_fill_manual() statement to map palette colours. Is any new pattern revealed and is any less prominant?

Plotting as a stacked areas both seperates retailers and in doing so reveals overall overall and relative growth improving pattern comprehension, however, the direct comparion of retailer values is harder. It is now appatent that all retailers except Iceland slowed opening new stores following the 2018 financial crash.

Copy the two plot statements and paste them below the original versions. Edit the copies to plot the new area.

Code: Plot total number of stores by retailer and year

			
# New store area line plot
ggplot(store.yr, aes(x = Year, y = New_Area, colour = Retailer_Id)) +
  geom_line()  +
  geom_point() +
  ylab(expression("New store area m"^2)) +
  scale_colour_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
  scale_x_continuous(breaks=seq(2015,2024)) +
  theme_classic()
# New store area area plot
ggplot(store.yr, aes(x = Year, y = New_Area, fill = Retailer_Id)) +
  geom_area() +
  ylab(expression("New store area m"^2)) +
  scale_fill_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
  scale_x_continuous(breaks=seq(2015,2024)) +
  theme_classic()

		

The code is similar to the previous plot but the y-variable is set to New_Area not New_Store, The units of the y-axis label are m². Superscript characters are defined using an expression() in which caret ‘^’ denotes to superscript following unquoted characters.

What pattern do you see?

The large numbers of new Morrsions and Asda stores does translate into an equivalent area difference indicating most new stores are relatively small compared to average. Both Aldi and Lidl greatly increased their area from 2017 until 2021-22, with Iceland also expanding in store are 2018-22.

Make a new copy of the four plot statements and edit the copies to display total stores and total area.

Code: Total stores and total area.

			
# Total stores line plot
ggplot(store.yr, aes(x = Year, y = Total_Store, colour = Retailer_Id)) +
   geom_line() +
   geom_point() +
   ylab("Stores") +
   scale_colour_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
   scale_x_continuous(breaks=seq(2014,2024)) +
   theme_classic()
# Total stores area plot
ggplot(store.yr, aes(x = Year, y = Total_Store, fill = Retailer_Id)) +
  geom_area() +
  ylab("Stores") +
  scale_fill_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
  scale_x_continuous(breaks=seq(2014,2024)) +
  theme_classic()
# Total store area line plot
ggplot(store.yr, aes(x = Year, y = Total_Area, colour = Retailer_Id)) +
  geom_line()  +
  geom_point() +
  ylab(expression("Total store area m"^2)) +
  scale_colour_paletteer_d("ggsci::default_igv", name = "Retailer", labels =   retailer.list) +
  scale_x_continuous(breaks=seq(2014,2024)) +
  theme_classic()
# Total store area area plot
ggplot(store.yr, aes(x = Year, y = Total_Area/1000, fill = Retailer_Id)) +
  geom_area() +
  ylab(expression("Total store area 1000m"^2)) +
  scale_fill_paletteer_d("ggsci::default_igv", name = "Retailer", labels = retailer.list) +
  scale_x_continuous(breaks=seq(2014,2024)) +
  theme_classic()

		

Only the line plots are shown as they are the most informative.

The total store number and area plot seperate retailers into relatively distinct classes, which will be visualised in a better way below. Growth in the smaller retailers has slowed since 2020 except of M&S which with the four largest retailers have had a small increase in recent growth from 2020-21, with Morrisons exhibition the greatest growth of the four for the longest period of time (since 2019).

You now have now looked at eight plots, each of which provides a different perspective on growth in UK supermarket stores, but is it possible to display all this information in a single graph?

Plot total store number against total store area colured by to see how the relationship between retailer, store number, store area change through time.

Code: Plot total store number against total store area

			
# Scatter plot 
ggplot(store.yr, aes(x = Total_Store, y = Total_Area/1000, colour = Retailer_Id)) +
  geom_line()  +
  geom_point() +
  xlab("Stores") +
  ylab(expression("Store area 1000m"^2)) +
  scale_colour_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
  theme_classic()

		

What patten do you see and how do you interpret it?

This plot displays total store number and area by year and retailer. As stores only open, year increases ordinally within retailers away from the origin. Slope is the ratio of store number to area with the greater the slope the greater the average size of new stores. Most retailers lie along the diagnal with roughly equivalent ratios, however the Coop is an exception with more smaller stores that other retailers. Tesco stands out as the largest retailer by both store number and area. Asda, Sainsbury’s, and Morrisons cluster with slightly larger store area ratio than Lidl, Aldi, Iceland, Marks and Spencer, and Waitrose. The slope of Asda, Sainsbury’s, and Morrisons noticeably changes with the number of dots each side of the inflex revealing when the change occured, with Asda the most recent to move towards the diagnal, a shift that Morrisons had made.

Store portfolios

We have discovered that most the ration of retailer store number and area of most retailers are similar with the exception of the Coop which has a smaller average. Of the others, the three retails most distant from the dignal have recently ‘re-adjusted’ their strategy to ratios closer to the mean ratio. A deeper understanding of retailer store portfolios can be obtained from the original unsummarised retail points.

Let’s create a barchart of the number of stores in each class.

			
ggplot(store, aes(x = size_level)) +
  geom_bar() +
  geom_text(
    stat = "count",
    aes(label = ..count..), vjust=-1) +
  xlab("Store Size") +
  ylab("Number of Stores") +
  theme_classic()

		

geom_bar() plots a bar chart of the count of each store_level which correspond to the size class order (1 = small store, 5 = hypermarket). geom_text() adds the count above each bar.

This is a graphic version of the counts retured by table().

			
table(store$size_level)
   1    2    3    4    5 
7111 4618 1072 1371    3

This is the combined store estate of all ten retailers in 2024 but how do their store size portfolios differ?

			
table(store$retailer, store$size_level)
                        
                            1    2    3    4    5
  Aldi                      1 1037    5    0    0
  Asda                    496  161  106  343    0
  Iceland                  46  926    0    0    0
  Lidl                      0  996   14    0    0
  Marks and Spencer       455  498   91   12    0
  Morrisons              1249   44  213  239    0
  Sainsburys              806  132  174  332    0
  Tesco                  1995  212  269  430    3
  The Co-operative Group 1936  513   17    0    0
  Waitrose                127   99  183   15    0

		

There are clear differences between retailer store portfolios, for example, the Coop has many small stores, Lidl and Aldi stores are nearly all in the second size class, and the ‘big four’ have more large and small stores than mid-sized, but it is difficult discern pattern across retailers simultaneously, A plot can do this.

			
ggplot(store, aes(fill = as.character(size_level), x = retailer)) +
  geom_bar(position="stack") +
  xlab("") +
  ylab("Stores") +
  scale_fill_paletteer_d("lisa::JohannJacobUlrich", 
     name = "Store size",
     labels = store$size_band) +
  theme_classic()

		

as.character() converts numeric class_level the character type required by the fill parameter. geom_bar(position=”stack”) draws a stacked bar chart of counts by retailer (the x-axis). scale_fill_paletteer_d() is used as bar sections are to be assigned colours to size_level values. The lisa::JohannJacobUlrich palette is a sequential from light (small stores) to dark (large stores).

As always, what pattern do you observe in the plot?

This plot is an effective way of showing how current retailer store portfolios differ. Aldi, Aldi and Lidl stores are almost exclusively in the second size class. The Coop has mostly smaller stores, and Waitrose (JLP) and Marks and Spencer few hypermarkets. The other retailers have a range of store sizes in their portfolios which are mostly small or large with Asda having a larger proportion bigger stores. This is their current portfolio but how have retailer portfolios changed through time? It would be challenging to create plot that shows all retailer store size classes though time, but change within a single retailer may also be visualised as a stacked barcharts.

In processing store to create store.yr NA Year was updated to 2014 for summarising so records are filtered by retailer and Year greater than 2014.

			
r <- "Tesco"
data <- filter(store, retailer == r & Year > 2014)

Now edit a copy of the stacked bar chart code to plot new Tesco store size by year.

Code: Retailer new store sizes through time

			
r <- "Tesco"
data <- filter(store, retailer == r & Year > 2014)
ggplot(data, aes(fill = as.character(size_level), x = Year)) +
  geom_bar(position="stack") +
  ggtitle(paste("New", r, "Stores", sep = " ")) +
  xlab("") +
  ylab("Number of Stores") +
  scale_fill_paletteer_d("lisa::JohannJacobUlrich",
                         name = "Store size",
                         labels = data$size_band) +
  scale_x_continuous(breaks=seq(min(data$Year, na.rm = TRUE),
                                max(data$Year, na.rm = TRUE))) +
  theme_classic()

		

In the code, change r to the name of another retailer, select the three lines and run to contrast how retailer are directing growth.

Morrisons and Tesco have different patterns of new store opening. Tesco shows decreasing expansion from 2015 until 2019, before expanding again from 2020, whereas Morrisons added few new stores until 2020 when rapid growth began. Most new stores are small, but Tesco has opened a higher proportion of larger stores except in 2018 until 2021 when the majority of stores were small.

Store location

The spatial distribution of stores is of interest as store density, size and location influences carbon retailer carbon emsissions (Cachon 2014).

			
# Load uk shapefile
uk <- read_sf("uk.shp")
# Filter data from retailer
r <- "Tesco"
data <- filter(retail.10, retailer == r)
# Map stores
ggplot(data = data) +
  geom_sf(data = uk) +
  xlim(-10, 3) +
  ylim(50, 61) +
  geom_point(x = data$long_wgs, y = data$lat_wgs, size = 0.7) +
  theme_classic()

		

Tesco stores are more widely distributed across the UK than Waitrose which is concentrated in the Midland and South of Englnad with minimal presence in Wales and Scotland.

Summary

This section explored UK supermarket retailer store portfolios using the Geolytix retail point dataset. The retail points were cleaned and summarized by retailer and year. Scatter and line plots, bar charts, and maps of these data revealed temporal, store size, and geographic patterns between retailer portfolios, factors that may influence retailer emissions and energy use. In the final part of the tutorial, the summarized store area is used to standardize reported absolute emissions and energy reported by retailers to a common area-based intensity metric. Patterns in absolute reported values and estimated intensity are explored, and the fit between reported and estimated intensity is assessed.

<<< UK supermarkets

Carbon emissions >>>

References

Cachon, G. P. (2014). Retail Store Density and the Cost of Greenhouse Gas Emissions. Management Science, 60(8), 1907–1925. http://www.jstor.org/stable/42919643

UKSmktComp: Retailer stores