UKSmktComp: Retailer stores

Introduction

UK supermarket retailer emissions and energy consumption depends of its ‘size’ that may be measured by turnover or a measure of operational, but emission and energy intensity reporting is sporadic and a absolute metrics are standardised using a different denominators limiting direct comparion. The Geolytix Supermarket Retail Points [downloaded 9/24] dataset has tracked UK Supermarkets since 2014 recording their location, date of opening, and size class, among other attributes. In this section you will use R to process and explore the data, clean it, and then summarise it by retailer and year ready for scaling absolute reported emissions and energy use in the final section. Analysis is limited to the UK’s eleven largest supermarket retailers, Aldi, Asda, Coop, Iceland, Lidl, Marks and Spencer, Morrisons, Ocado, Sainsbury’s, Tesco, and Waitrose.

Download and unizip UKSmktComp.zip.

Start R Studio.

R basics

In this section, the required librarires are installed and data frames explained.

Libraries

The tutorial uses six libraries that must be installed. Libraries are packages of code that provide functionality. This only needs to be done once so if you previously installed any of the libraries then you should not need to download again.

Select Tools, then Install Packages…. then enter a comma seperated list of the packages you wish to install.

dplyr, readr, tidyr, ggplot2, paletteer, sf, ggrepel, ggpubr

Now open a new R script and add the code to load the required libraries. In R hash ‘#’ denotes a comment and all text after the hash on a line is ignored.

# Load Libraries
library(readr) # File reading/writing
library(dplyr) # Data processing
library(tidyr) # Data processing
library(ggplot2) # Graphing
library(paletteer) # Colour palletes
library(sf) # Spatial data
library(ggrepel) # Text positioning
library(ggpubr) # ggplot functions

Select the code lines and Run to execute the code.

The ‘working directory/folder’ is the default location were R looks for and saves files You will need to edit this to the UKSmrkComp folder . getwd() returns current working directory path, while setwd() sets the path.

# Get and set the working directory
getwd()
# Edit the path to where you have stored the supplied data, e.g.
setwd("D:/UKSmtkComp")

Now import use the readr library read_csv() function to import the Retail Points csv file.

# Import CSV file from Working Directory
store <- read_csv("geolytix_retailpoints_v34_202412.csv")

Data frame

To understand data you need to understand what it represents, how it was collected, how it is structured, and the distribution of values within it. The most common way to structure data is as a table in which columns are attributes and the rows records. The CSV format is a data structure for storing a table and read_csv() loads the data into a different table data structure in R called a ‘data frame’ stored in the variable named store.

# Is the variable a data frame?
is.data.frame(store)
[1] TRUE

To further explore table structure at the command prompt.

# Number of columns in data frame
ncol(store)
[1] 19
# Data frame column names
colnames(store)
[1] "id" "retailer" "fascia" "store_name" "add_one" "add_two" "town" "suburb"
[9] "postcode" "long_wgs" "lat_wgs" "bng_e" "bng_n" "pqi" "open_date" "size_band"
[17] "county" "Year" "size_level"
# Number of rows/records
nrow(store)
[1] 18840
# Row names, by default and uniquely assigned ascending integer
rownames(store)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" ...

To obtain a summary report of table structure and values.

# A summary report of colums and their content
summary(store)
id retailer fascia store_name
Min. :1.01e+09 Length:18840 Length:18840 Length:18840
1st Qu.:1.01e+09 Class :character Class :character Class :character
Median :1.01e+09 Mode :character Mode :character Mode :character
Mean :1.01e+09
3rd Qu.:1.01e+09
Max. :1.01e+09
add_one add_two town suburb
Length:18840 Length:18840 Length:18840 Length:18840
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
postcode long_wgs lat_wgs bng_e
Length:18840 Min. :-8.0899 Min. :49.17 Min. : 5123
Class :character 1st Qu.:-2.8045 1st Qu.:51.50 1st Qu.:346908
Mode :character Median :-1.5874 Median :52.50 Median :427048
Mean :-1.7833 Mean :52.81 Mean :416710
3rd Qu.:-0.4016 3rd Qu.:53.71 3rd Qu.:509612
Max. : 1.7529 Max. :60.40 Max. :654949
bng_n pqi open_date size_band
Min. : -81178 Length:18840 Min. :18559999 Length:18840
1st Qu.: 179310 Class :character 1st Qu.:20179999 Class :character
Median : 290230 Mode :character Median :20210916 Mode :character
Mean : 324690 Mean :20206981
3rd Qu.: 424197 3rd Qu.:20231025
Max. :1168234 Max. :20249999
NA's :12763
county Year size_level
Length:18840 Min. : 1 Min. :1.000
Class :character 1st Qu.:2017 1st Qu.:1.000
Mode :character Median :2021 Median :1.000
Mean :2020 Mean :1.605
3rd Qu.:2023 3rd Qu.:2.000
Max. :2024 Max. :5.000

Check that the column data types are ‘correct’ as malformed data (e.g. a single character value in field that obstensively store numeric data) may result in the ‘wrong’ column type in R. The column type would be ‘correct’ as far as R is concerned, but ‘wrong’ from our perspective. Look at the summary statistics for each column to get a feel for their range, distributions, and whether they contain any unknown ‘null’ values encoded as ‘NA’. 12763 Year and 1 size_level values out of the 18840 total are null.

Type the name of a variable at the prompt to view the first rows.

# Type the name of a variable at the prompt to view the first rows.
store

# A tibble: 18,840 × 19
id retailer fascia store_name add_one add_two town suburb postcode long_wgs lat_wgs bng_e bng_n pqi
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 1010008812 The Co-opera… The C… Co-op Hol… Holyhe… Chirk Chirk NA LL14 5NA -3.06 52.9 3.29e5 3.39e5 Roof…
2 1010021060 Spar Spar Spar Glyn… Cross … NA Glyn… NA LL20 7EH -3.19 52.9 3.20e5 3.38e5 Roof…
3 1010007877 The Co-opera… The C… Co-op Bla… High S… Blaena… Blae… Blaen… LL41 3AG -3.93 53.0 2.70e5 3.46e5 Roof…
4 1010016823 Spar Spar Spar - Cy… Cynwyd… Statio… Cynw… NA LL21 0LA -3.41 53.0 3.06e5 3.41e5 Roof…
5 1010010459 Tesco Tesco Tesco Wre… Plas K… Off We… Cefn… NA LL14 3AT -3.07 53.0 3.28e5 3.42e5 Roof…
6 1010007420 Lidl Lidl Lidl Newt… Llanid… NA Newt… NA SY16 1ET -3.32 52.5 3.11e5 2.91e5 Roof…
7 1010004399 Tesco Tesco Tesco Wel… Smithf… NA Wels… NA SY21 7BL -3.14 52.7 3.23e5 3.07e5 Roof…
8 1010022270 Morrisons Morri… Morrisons… Buttin… NA Butt… NA SY21 8SL -3.12 52.7 3.24e5 3.09e5 Roof…
9 1010017041 Spar Spar Spar - Ll… The Ma… Oswest… Llan… NA SY10 0JG -3.30 52.8 3.12e5 3.26e5 Roof…
10 1010016321 Spar Spar Spar - Mo… Broad … NA Mont… NA SY15 6PN -3.15 52.6 3.22e5 2.96e5 Roof…
# ℹ 18,830 more rows
# ℹ 5 more variables: open_date <dbl>, size_band <chr>, county <chr>, Year <dbl>, size_level <dbl>
# ℹ Use `print(n = ...)` to see more rows

To see the first 50 rows.

print(store, n = 50)

Columns are accessed using the ‘$’ character after the variable name.

store$retailer

unique() returns a vector of unique values.

# Unique values in the retailer column
unique(store$retailer)

[1] "The Co-operative Group" "Spar"
[3] "Tesco" "Lidl"
[5] "Morrisons" "Aldi"
[7] "Budgens" "Sainsburys"
[9] "Marks and Spencer" "Farmfoods"
[11] "Iceland" "Waitrose"
[13] "Asda" "Heron"
[15] "Makro" "Costco"
[17] "Cook" "The Southern Co-operative"
[19] "Booths" "Central England Co-operative"
[21] "Scottish Midland Co-operative" "Langdale Co-operative Society"
[23] "Clydebank Co-operative" "The Radstock Co-operative Society"
[25] "Midcounties Co-operative" "Channel Islands Co-operative Society"
[27] "Amazon" "Heart of England Co-operative"
[29] "Tamworth Co-operative Society" "Lincolnshire Co-operative"
[31] "Dunnes Stores" "East of England Co-operative"
[33] "Chelmsford Star Co-operative Society" "Whole Foods Market"
[35] "Planet Organic" "Coniston Co-operative Society"
[37] "Allendale Co-operative Society"

Filter

There are 37 retailers, but for simplity only the eleven largest UK retailers will be analysed, one of which Ocado, is an online supermarket with no stores. To simplify long retailer names they are mapped to a shorter id code a new Retailer_Id column. This is the code used to identify retailers in the emmission dataset to which it will later be joined. A named list maps pairs of values.

# c(name = value) creates a named list
retailer.list <- c("ALDI" = "Aldi",
"ASDA" = "Asda",
"COOP" = "The Co-operative Group",
"ICE" = "Iceland",
"LIDL" = "Lidl",
"M&S" = "Marks and Spencer",
"MORR" = "Morrisons",
"OCADO" = "Ocado",
"SAINS" = "Sainsburys",
"TESC" = "Tesco",
"JLP"= "Waitrose"
)
names(retailer.list)
[1] "COOP" "ASDA" "TESC" "LIDL" "MORR" "ALDI" "SAINS" "M&S" "JLP" "ICE"
[11] "OCADO
Use [x], where x is the item number, to return individal names and values.
names(retailer.list)[3]
[1] "TESC"
retailer.list[3]
TESC
"Tesco"

Use the dpylr filter() function to subset rows in store to those with a retailer the matches any of their names in retailer.list.

nrow(store)
[1] 18840
# Filter
store <- filter(store, retailer %in% retailer.list)
row(store)
[1] 14176
# Precent of retail points of the 11 retailers
nrow(store)/18840 * 100
[1] 75.24416

75.24% of the retail points records are owend by the 11 retailers, considerably less than their market domination.

table() counts the records with matching values in one or more vector columns.

table(store$retailer)

Aldi Asda Iceland
1043 1106 972
Lidl Marks and Spencer Morrisons
1010 1056 1745
Sainsburys Tesco The Co-operative Group
1444 2909 2467
Waitrose
424

Tesco has the highest number of stores, followed by the Co-operative Group. Ocado is an online retailer without stores.

NAs

There were NAs in Year, size_band, and size_level columns but are they in the filtered data set? Count the number of NAs in the size_band column.

length(which(is.na(store$size_band)))
[1] 1

‘$’ signifies the name of a data frame column. is.na() returns whether a value in a vector (the column) is NA or not. which() returns the position of values in a vector that match a criteria (is.na == TRUE), and length() returns the rumber of items in the vector. ‘Stringing’ multiple functions into a single statement is an efficent way of writing code but the more stinging is done the harder the code is to understand. Do NAs remain in the other two columns?

Code: Do NAs remain in the other two columns?
length(which(is.na(store$size_level)))
[1] 1

length(which(is.na(store$Year)))
[1] 9255

Lets deal with the missing ‘size’ values first. Passing two vectors to Table() returns a pairwise count table

table(store$size_band, store$size_level)
1 2 3 4 5
< 3,013 ft2 (280m2) 11416 0 0 0 0
15,069 < 30,138 ft2 (1,400 < 2,800 m2) 0 0 1103 0 0
3,013 < 15,069 ft2 (280 < 1,400 m2) 0 4889 0 0 0
30,138 ft2 > (2,800 m2) 0 0 0 1428 0
Over 30,138 0 0 0 0 3

size_level is a ordinal integer version of the size_band field which stores a text description of the band. The table also shows that there are only three stores in the largest size class, but thousands in the others.

To find the missing value.

filter(store, is.na(size_level)) %>%
select(retailer, store_name, add_one, suburb, town, postcode, Year, size_band)
A tibble: 1 × 8
retailer store_name add_one suburb town postcode Year size_band
1 The Co-operative Group Co-op Welling - Wrotham Road Wrotham Road Welling London DA16 1LS 2024 NA

is.na() within the dpylr filter() selects records where the value is NA. The selected records are then ‘passed’ to ‘%>%’ select() which returns the named columns. There is one store that does not have a ‘size’. It was only opened in 2024 so perhaps the store was known to the opening but the store area had yet to be established? This foor area of this store was estimated the OS Mastermap supplied to UK Universities through the Digimap service as 820m2 so it within size_level 2.

To update the record.

store$size_level[which(is.na(store$size_level))] <- 2
store$size_band[which(is.na(store$size_band))] <- "3,013 < 15,069 ft2 (280 < 1,400 m2)"

These lines look complicated but are actually not too difficult to understand if broken down. is.na() finds NAs, which() returns the row names that have NA size level, while store$size_level[…] acesses the record values to up dated to the value passed in with ‘<-‘.

We know Year contains NAs, but non-NA Year values should also be evaluated to see if they are ‘reasonable’ or anomalous to expectation. Use table() to obtain the number of stores for each Year.

# useNA = "ifany" includes NAs in the table.
table(store$Year, useNA = "ifany")
1 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 <NA>
1 3 3 10 6 4 20 352 343 491 294 401 414 372 466 759 982 9255
What pattern do you see in the data, what might it indicate, and what should be done about it (if anything)?

Outlier values and abrupt change in values may be indicative of ‘something going on’ in the data.

One store is an outlier opening in 1AD! This is clearly an error and needs to be updated to NA.

store$Year[which(store$Year == 1)] <- NA

It is also noticable that the number of stores recorded as opening from 2009 to 2014 are far fewer than in 2015 and following years – an abrupt change. It might be that the data reflects reality, however, the Retail points data set was first released in 2014 and have been added to since. It is therefore more likely that NAs arise from stores being included in the first release as they existed then but their opening year unknown. All 2015 and later store openings may be assumed to have an opening year. Interstingly, spplying table() to retialer and Year reveals that most of the pre-2015 records are for Waitrose stores.

table(store$retailer, store$Year, useNA = "ifany")

1 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Aldi 1 0 0 0 0 0 2 70 42 94 66 60 51 57
Asda 0 0 0 0 0 1 0 31 9 8 1 0 1 0
Iceland 0 0 0 0 0 0 0 1 7 11 37 54 24 16
Lidl 0 0 0 0 1 0 1 24 29 51 55 61 59 65
Marks and Spencer 0 0 0 0 0 0 2 32 125 75 41 31 17 23
Morrisons 0 0 0 0 0 0 0 7 1 0 3 3 124 121
Sainsburys 0 0 0 0 0 1 8 104 18 74 8 18 14 22
Tesco 0 0 0 0 0 0 3 68 33 21 9 3 74 32
The Co-operative Group 0 0 0 0 0 0 2 2 68 150 71 171 50 36
Waitrose 0 3 3 10 5 2 2 13 11 7 3 0 0 0

2022 2023 2024 <NA>
Aldi 44 33 28 495
Asda 2 214 263 576
Iceland 19 9 13 781
Lidl 62 24 24 554
Marks and Spencer 17 17 27 649
Morrisons 163 364 490 469
Sainsburys 11 25 24 1117
Tesco 65 47 72 2482
The Co-operative Group 83 26 40 1768
Waitrose 0 0 1 364

While there are some (probaly) oppotunistic pre-2015 opening years, 2014 is the ‘true baseline’ of the dataset from which year seams to have been systematically collected. Update the Year values equal or less than 2014 OR (pipe character ‘|’) NA to 2014.

store$Year[which(store$Year <= 2014 | is.na(store$Year))] <- 2014
# Check the output
table(store$retailer, store$Year, useNA = "ifany")
2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Aldi 498 70 42 94 66 60 51 57 44 33 28
Asda 577 31 9 8 1 0 1 0 2 214 263
Iceland 781 1 7 11 37 54 24 16 19 9 13
Lidl 556 24 29 51 55 61 59 65 62 24 24
Marks and Spencer 651 32 125 75 41 31 17 23 17 17 27
Morrisons 469 7 1 0 3 3 124 121 163 364 490
Sainsburys 1126 104 18 74 8 18 14 22 11 25 24
Tesco 2485 68 33 21 9 3 74 32 65 47 72
The Co-operative Group 1770 2 68 150 71 171 50 36 83 26 40
Waitrose 389 13 11 7 3 0 0 0 0 0 1

The store data frame has been ‘cleaned’.

  1. Records for 10 of the 11 retailers were filtered from other records.
  2. One set of store area values was updated using external information.
  3. All pre-2015 records were assigned to the 2014 baseline year.

Summary statistics

The final data preparation stage is to sum the number and estimated area of new stores for each retailer and year, and their cumulative values. Retailer store area is estimated from size classes by assigining the mid-class area for bound classes and the lower boundary for the largest area class

size_levelsize_bandsize_area_mid (m2)
1< 3,013 ft2 (280m2)140
23,013 < 15,069 ft2 (280 < 1,400 m2)840
315,069 < 30,138 ft2 (1,400 < 2,800 m2)2100
430,138 ft2 > (2,800 m2)2800
5Over 30,1383000

This is quite a complex operation, so is provided as commented code that can be copied and run.

Code: Sumarise by retailer and year
# geolytix_size_class.csv contains the area values that will be used
# for each area class when estimating retailer store area
store.size.class <- read_csv("geolytix_size_class.csv")
store.size.class
# Add Retailer ID to store
store$Retailer_Id <- NA
i = 1
for (s in retailer.list) {
store$Retailer_Id[which(store$retailer == s)] = names(retailer.list[i])[1]
i = i + 1
}
# Summarise by retialer, year and size_level
store.yr <- store %>%
count(Retailer_Id, Year, size_level, sort=TRUE)
# Join to size class table
store.yr <- merge(store.yr, store.size.class, by = "size_level")
# Calculate estimated new store area
store.yr$Area_M2 <- store.yr$n * store.yr$size_area_mid
# Calculate Cumulative Sum of store number and area
store.yr.n <- store.yr %>%
group_by(Retailer_Id, Year) %>%
arrange(Retailer_Id, Year) %>%
summarise(Sum_N = sum(n)) %>%
mutate(Cum_Sum_N = cumsum(Sum_N))
store.yr.a <- store.yr %>%
group_by(Retailer_Id, Year) %>%
arrange(Retailer_Id, Year) %>%
summarise(Sum_Area_M2 = sum(Area_M2)) %>%
mutate(Cum_Sum_Area_M2 = cumsum(Sum_Area_M2))
# Create 'base table' of retailer and year 2014-2024
yr <- rep(seq(2014, 2024))
base <- expand.grid(yr,names(retailer.list))
colnames(base)[1] <- "Year"
colnames(base)[2] <- "Retailer_Id"
base<-data.frame(base)
# Join cumulative sum tables to base to create store
store.yr <- merge(base, store.yr.a, by.x = c("Retailer_Id","Year"), by.y = c("Retailer_Id","Year"), all = TRUE)
store.yr <- merge(store.yr, store.yr.n, by.x = c("Retailer_Id","Year"), by.y = c("Retailer_Id","Year"), all = TRUE)
# Fill missing Cum Sum values
store.yr<- store.yr %>%
arrange(Retailer_Id, Year) %>%
fill(Cum_Sum_Area_M2, .direction = "up")
store.yr<- store.yr%>%
arrange(Retailer_Id, Year) %>%
fill(Cum_Sum_N, .direction = "up")
# Rename columns
colnames(store.yr)[3] <- "New_Area"
colnames(store.yr)[4] <- "Total_Area"
colnames(store.yr)[5] <- "New_Store"
colnames(store.yr)[6] <- "Total_Store"
# Set New Store and New Area which are NA to zero except for Ocardo
store.yr$New_Store[which(is.na(store.yr$New_Store) &
store.yr$Retailer_Id != "OCADO")] <- 0
store.yr$New_Area[which(is.na(store.yr$New_Area) &
store.yr$Retailer_Id != "OCADO")] <- 0
# Set 2014 New Store and Area to 0
store.yr$New_Store[which(store.yr$Year == 2014)] <- NA
store.yr$New_Area[which(store.yr$Year == 2014)] <- NA
# Set Ocado Total Store and Area to NA
store.yr$Total_Store[which(store.yr$Retailer_Id == "OCADO")] <- NA
store.yr$Total_Area[which(store.yr$Retailer_Id == "OCADO")] <- NA
# Check
store.yr

The cleaned retail points have been summarised, so lets have a look.

Graphing

Viewing values in tabular form can be useful especially for small appropriately ordered tables, but is of less value for larger tables that are better comprehended in graphical from.

Setting the pallette

To ensure consistent assignment of colour to retailer in all plots a named list is created that maps Retailer_Id to a draw colour. We shall adapt the colour in the pals::glasbey to create a custom palette that assigns colours (or at least hues) associated the retailer’s brand design.

# Colours in palette
pal <- paletteer_d("pals::glasbey")

# Names list of retailer colour
retailer.pal <- c("ALDI" = pal[1],
                   "ASDA" = pal[12],
                   "COOP" = pal[8],
                   "ICE" = pal[26],
                   "LIDL" = pal[14],
                   "M&S" = "grey40", 
                   "MORR" = pal[21],
                   "OCADO" = pal[11],
                   "SAINS" = pal[16], 
                   "TESC" = pal[2],
                   "JLP"= pal[6])

All colours except M&S are hex codes from the glasbey palette. M&S colour is set to 60% black using its colour name as black was difficult to distinguish from the dark green of Waitrose, and the glasbey grey is not bold enough to be easily differentiated from other pattern.

Store number and area

Let’s start with a graph that plots the number of new stores that retailers opened each year

# New stores line plot
ggplot(store.yr, aes(x = Year, y = New_Store, colour = Retailer_Id)) +
geom_line() +
geom_point() +
ylab("New stores") +
scale_colour_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
scale_x_continuous(breaks=seq(2015,2024)) +
theme_classic()

Lets go through this line by line with each line connected by a plus ‘+’.:

  1. ggplot() is the main graphing function which is passed the data frame store.yr and a list of aes() (aesthetics) parameters. x specifies the variable assigned to the x-axis of the graph, and y to the y-axis. The colour parameter sets the colour attribute of drawn graphics is set to be determined by Retailer_Id.
  2. geom_line() plots data values using the default line symbol coloured by the Retailed_Id ‘inhertied’from the main ggplot() function. The specific colours to use are set later.
  3. geom_point() plots values using the default point symbol coloured by the Retailed_Id ‘inhertied’from the main ggplot() function.The specific colours to use are set later.
  4. ylab() sets the y-axis label. The default x-axis label is not changed.
  5. scale_colour_manual() matches the gglpot colour value to the named list of colours (retailer.pal).
  6. scale_x_continuous() sets a manual ticks and labels (breaks) for each year, with seq() generating a sequence of integer years to display.
  7. theme_classic() applies a simple theme to the plot.

When you create a plot you should critically evaluate:

  1. pattern observed within the data, and
  2. how plot design inflences what pattern is discernable and what is not.

These are not isolated undertakings but more of a ‘conversation’ in which they inform each other. Look at the plot and identify ‘interesting’ pattern. But, what qualifies as interesting? As a guide look for:

  • Outliers
  • Simiarity and difference (e.g. high/low, rising/falling).
  • Abupt change in values and trends.

Perhaps the most obvious pattern is the opening of large numbers of Morrisons beginning in 2020 followed by Asda in 2023. As the majority of values are less than 100 the large number of recent openings results in the majority of the data being dislayed in just one fifth of the y-axis, reducing the visual distiction within and between lines exasperated by data density and occlusion. Despite the data density other patterns can be picked out including Marks are Spencer peaking in 2016 peaks in 2017 and 2019 for the Co-op. Tesco’s decline in new openings until 2020 from when it has increased is less easy to see.

Various approaches might be taken to illuminate other pattern in the data. New_Store could be log() transformed, retailers displayed individually as small multiples, or as stacked-area plot. Let’s plot the same data but as stacked areas.

# New stores area plot
ggplot(store.yr, aes(x = Year, y = New_Store, fill = Retailer_Id)) +
geom_area() +
ylab("New stores") +
scale_fill_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
scale_x_continuous(breaks=seq(2015,2024)) +
theme_classic()

In the ggplot() aes() function, the area fill (colour) is set to Retailer_Id that is then passed to geom_area() which draws the chart. The change to colouring fill requires a scale_fill_manual() statement to map palette colours. Is any new pattern revealed and is any less prominant?

Plotting as a stacked areas both seperates retailers and in doing so reveals overall overall and relative growth improving pattern comprehension, however, the direct comparion of retailer values is harder. It is now appatent that all retailers except Iceland slowed opening new stores following the 2018 financial crash.

Copy the two plot statements and paste them below the original versions. Edit the copies to plot the new area.

Code: Plot total number of stores by retailer and year
# New store area line plot
ggplot(store.yr, aes(x = Year, y = New_Area, colour = Retailer_Id)) +
geom_line() +
geom_point() +
ylab(expression("New store area m"^2)) +
scale_colour_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
scale_x_continuous(breaks=seq(2015,2024)) +
theme_classic()
# New store area area plot
ggplot(store.yr, aes(x = Year, y = New_Area, fill = Retailer_Id)) +
geom_area() +
ylab(expression("New store area m"^2)) +
scale_fill_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
scale_x_continuous(breaks=seq(2015,2024)) +
theme_classic()

The code is similar to the previous plot but the y-variable is set to New_Area not New_Store, The units of the y-axis label are m2. Superscript characters are defined using an expression() in which caret ‘^’ denotes to superscript following unquoted characters.

What pattern do you see?

The large numbers of new Morrsions and Asda stores does translate into an equivalent area difference indicating most new stores are relatively small compared to average. Both Aldi and Lidl greatly increased their area from 2017 until 2021-22, with Iceland also expanding in store are 2018-22.

Make a new copy of the four plot statements and edit the copies to display total stores and total area.

Code: Total stores and total area.
# Total stores line plot
ggplot(store.yr, aes(x = Year, y = Total_Store, colour = Retailer_Id)) +
geom_line() +
geom_point() +
ylab("Stores") +
scale_colour_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
scale_x_continuous(breaks=seq(2014,2024)) +
theme_classic()
# Total stores area plot
ggplot(store.yr, aes(x = Year, y = Total_Store, fill = Retailer_Id)) +
geom_area() +
ylab("Stores") +
scale_fill_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
scale_x_continuous(breaks=seq(2014,2024)) +
theme_classic()
# Total store area line plot
ggplot(store.yr, aes(x = Year, y = Total_Area, colour = Retailer_Id)) +
geom_line() +
geom_point() +
ylab(expression("Total store area m"^2)) +
scale_colour_paletteer_d("ggsci::default_igv", name = "Retailer", labels = retailer.list) +
scale_x_continuous(breaks=seq(2014,2024)) +
theme_classic()
# Total store area area plot
ggplot(store.yr, aes(x = Year, y = Total_Area/1000, fill = Retailer_Id)) +
geom_area() +
ylab(expression("Total store area 1000m"^2)) +
scale_fill_paletteer_d("ggsci::default_igv", name = "Retailer", labels = retailer.list) +
scale_x_continuous(breaks=seq(2014,2024)) +
theme_classic()

Only the line plots are shown as they are the most informative.

The total store number and area plot seperate retailers into relatively distinct classes, which will be visualised in a better way below. Growth in the smaller retailers has slowed since 2020 except of M&S which with the four largest retailers have had a small increase in recent growth from 2020-21, with Morrisons exhibition the greatest growth of the four for the longest period of time (since 2019).

You now have now looked at eight plots, each of which provides a different perspective on growth in UK supermarket stores, but is it possible to display all this information in a single graph?

Plot total store number against total store area colured by to see how the relationship between retailer, store number, store area change through time.

Code: Plot total store number against total store area
# Scatter plot
ggplot(store.yr, aes(x = Total_Store, y = Total_Area/1000, colour = Retailer_Id)) +
geom_line() +
geom_point() +
xlab("Stores") +
ylab(expression("Store area 1000m"^2)) +
scale_colour_manual(name = "Retailer", values = retailer.pal, labels = retailer.list) +
theme_classic()

What patten do you see and how do you interpret it?

This plot displays total store number and area by year and retailer. As stores only open, year increases ordinally within retailers away from the origin. Slope is the ratio of store number to area with the greater the slope the greater the average size of new stores. Most retailers lie along the diagnal with roughly equivalent ratios, however the Coop is an exception with more smaller stores that other retailers. Tesco stands out as the largest retailer by both store number and area. Asda, Sainsbury’s, and Morrisons cluster with slightly larger store area ratio than Lidl, Aldi, Iceland, Marks and Spencer, and Waitrose. The slope of Asda, Sainsbury’s, and Morrisons noticeably changes with the number of dots each side of the inflex revealing when the change occured, with Asda the most recent to move towards the diagnal, a shift that Morrisons had made.

Store portfolios

We have discovered that most the ration of retailer store number and area of most retailers are similar with the exception of the Coop which has a smaller average. Of the others, the three retails most distant from the dignal have recently ‘re-adjusted’ their strategy to ratios closer to the mean ratio. A deeper understanding of retailer store portfolios can be obtained from the original unsummarised retail points.

Let’s create a barchart of the number of stores in each class.

ggplot(store, aes(x = size_level)) +
geom_bar() +
geom_text(
stat = "count",
aes(label = ..count..), vjust=-1) +
xlab("Store Size") +
ylab("Number of Stores") +
theme_classic()

geom_bar() plots a bar chart of the count of each store_level which correspond to the size class order (1 = small store, 5 = hypermarket). geom_text() adds the count above each bar.

This is a graphic version of the counts retured by table().

table(store$size_level)
1 2 3 4 5
7111 4618 1072 1371 3

This is the combined store estate of all ten retailers in 2024 but how do their store size portfolios differ?

table(store$retailer, store$size_level)
1 2 3 4 5
Aldi 1 1037 5 0 0
Asda 496 161 106 343 0
Iceland 46 926 0 0 0
Lidl 0 996 14 0 0
Marks and Spencer 455 498 91 12 0
Morrisons 1249 44 213 239 0
Sainsburys 806 132 174 332 0
Tesco 1995 212 269 430 3
The Co-operative Group 1936 513 17 0 0
Waitrose 127 99 183 15 0

There are clear differences between retailer store portfolios, for example, the Coop has many small stores, Lidl and Aldi stores are nearly all in the second size class, and the ‘big four’ have more large and small stores than mid-sized, but it is difficult discern pattern across retailers simultaneously, A plot can do this.

ggplot(store, aes(fill = as.character(size_level), x = retailer)) +
geom_bar(position="stack") +
xlab("") +
ylab("Stores") +
scale_fill_paletteer_d("lisa::JohannJacobUlrich",
name = "Store size",
labels = store$size_band) +
theme_classic()

as.character() converts numeric class_level the character type required by the fill parameter. geom_bar(position=”stack”) draws a stacked bar chart of counts by retailer (the x-axis). scale_fill_paletteer_d() is used as bar sections are to be assigned colours to size_level values. The lisa::JohannJacobUlrich palette is a sequential from light (small stores) to dark (large stores).

As always, what pattern do you observe in the plot?

This plot is an effective way of showing how current retailer store portfolios differ. Aldi, Aldi and Lidl stores are almost exclusively in the second size class. The Coop has mostly smaller stores, and Waitrose (JLP) and Marks and Spencer few hypermarkets. The other retailers have a range of store sizes in their portfolios which are mostly small or large with Asda having a larger proportion bigger stores. This is their current portfolio but how have retailer portfolios changed through time? It would be challenging to create plot that shows all retailer store size classes though time, but change within a single retailer may also be visualised as a stacked barcharts.

In processing store to create store.yr NA Year was updated to 2014 for summarising so records are filtered by retailer and Year greater than 2014.

r <- "Tesco"
data <- filter(store, retailer == r & Year > 2014)

Now edit a copy of the stacked bar chart code to plot new Tesco store size by year.

Code: Retailer new store sizes through time
r <- "Tesco"
data <- filter(store, retailer == r & Year > 2014)
ggplot(data, aes(fill = as.character(size_level), x = Year)) +
geom_bar(position="stack") +
ggtitle(paste("New", r, "Stores", sep = " ")) +
xlab("") +
ylab("Number of Stores") +
scale_fill_paletteer_d("lisa::JohannJacobUlrich",
name = "Store size",
labels = data$size_band) +
scale_x_continuous(breaks=seq(min(data$Year, na.rm = TRUE),
max(data$Year, na.rm = TRUE))) +
theme_classic()

In the code, change r to the name of another retailer, select the three lines and run to contrast how retailer are directing growth.

Morrisons and Tesco have different patterns of new store opening. Tesco shows decreasing expansion from 2015 until 2019, before expanding again from 2020, whereas Morrisons added few new stores until 2020 when rapid growth began. Most new stores are small, but Tesco has opened a higher proportion of larger stores except in 2018 until 2021 when the majority of stores were small.

Store location

The spatial distribution of stores is of interest as store density, size and location influences carbon retailer carbon emsissions (Cachon 2014).

# Load uk shapefile
uk <- read_sf("uk.shp")
# Filter data from retailer
r <- "Tesco"
data <- filter(retail.10, retailer == r)
# Map stores
ggplot(data = data) +
geom_sf(data = uk) +
xlim(-10, 3) +
ylim(50, 61) +
geom_point(x = data$long_wgs, y = data$lat_wgs, size = 0.7) +
theme_classic()

Tesco stores are more widely distributed across the UK than Waitrose which is concentrated in the Midland and South of Englnad with minimal presence in Wales and Scotland.

Summary

This section explored UK supermarket retailer store portfolios using the Geolytix retail point dataset. The retail points were cleaned and summarized by retailer and year. Scatter and line plots, bar charts, and maps of these data revealed temporal, store size, and geographic patterns between retailer portfolios, factors that may influence retailer emissions and energy use. In the final part of the tutorial, the summarized store area is used to standardize reported absolute emissions and energy reported by retailers to a common area-based intensity metric. Patterns in absolute reported values and estimated intensity are explored, and the fit between reported and estimated intensity is assessed.

References

Cachon, G. P. (2014). Retail Store Density and the Cost of Greenhouse Gas Emissions. Management Science60(8), 1907–1925. http://www.jstor.org/stable/42919643