LONDON DATASTORE: Recommendations, workflows and challenges from a data analyst’s perspective

Executive Summary

This report presents an extensive analysis of the current London Datastore (LDS - 20.02.2024), examining common use cases and improvements to consider for the future developments and expansions of the platform. It offers an in-depth exploration from three distinct viewpoints with common interests: academic research, data analysis, and Geographic Information System (GIS) research. This multifaceted approach demonstrates typical workflows for integrating data from the London Datastore into various fields of study and application. The proposed recommendations and modifications are designed to enhance the Datastore as the premier choice for analysts seeking up-to-date and precise data on London.

The first section revisits the previously set priorities for the future development of the London Data Catalogue (LDC). Following this, we introduce the variety of user tools for data analysis and mapping. We then present an analysis of the LDS’s current metadata, acknowledging its complex scope and identifying areas requiring enhancements for the user and overall potential of the platform. Then we weave in observations and recommendations from the workflows of contemporary data analysts, researchers, and academics. This journey, from the initial idea to the final output, leverages the LDS as a central resource, through various data types, access formats, and frameworks. Consequently, we have identified and summarized 20 key observations and recommendations, which are detailed in the following sections.

Access to the London Datastore : https://data.london.gov.uk

Key Recommendations

I. Standardization and Efficiency

Encourage data producers to adopt a standardized approach for data sharing formats, ensuring that dataset format is tidy¹, consistent, and efficient. [Platform policy & Data inventory]
Emphasise on key geographical datasets and mapping shapes that are most commonly used for GIS applications in London, greatly facilitating their use. [Design and Data inventory]

II. Metadata Quality and Utility

Enrich the metadata tagging and labeling, while improving the metadata quality and consistency. [Platform Policy & Data inventory]
Correct, expand, and refine metadata tags to enhance their scope and inter-connectivity, fostering the discovery of relevant related data on the platform. [Platform policy & Data inventory]
Improve the visibility and accuracy of the data’s release dates, and ensure clear indicators are present for data updates and changes. [Product functionality & Design]
Highlight the fact that the metadata of the catalogs is available to explore. [Design]

III. User Interface and Experience

Highlight the platform’s API capabilities and facilitate its use by guiding the user through examples. [Design, One-off product & Product functionality]
Implement versatile sorting features, allowing users to filter datasets by relevance, publication date, and other relevant criteria. [Product functionality]
Always facilitate the access to the raw data behind dashboards and reports, for more analytics freedom and transparency. [Product functionality]
Enhance the user experience for data exploration and navigation, allowing for more intuitive searches and the discovery of relevant new and past data. [Product functionality & Design]
Simplify the process for users to find and utilize published analyses and reports, making them more accessible. [Product functionality & Design]
Clearly differentiate among datasets, analytics reports, and tabular data to prevent confusion and streamline user interaction. [Product functionality & Design]
Ensure that file format information is displayed in a clear, concise, and transparent manner, aiding in user comprehension. [Product functionality & Design]
Provide a succinct description for each dataset to convey its contents and purpose at a glance. [Platform policy]
Introduce a better search engine system or Large Language Model chatbot assistance to explore the datasets available. [Product functionality]
Facilitate the discovery and navigation by suggesting recent similar datasets to the one selected. [Product functionality]
Create a section with Popular datasets containing the most downloaded data over different time periods. [Product functionality]

IV. Future-proofing and Technology Adaptation

Anticipate and adapt to future data-sharing needs by supporting advanced formats and technologies like point clouds, LiDAR, 3D, APIs, and real-time data streams. [Data inventory & Product functionality]
Gather, analyse, and explore user metrics and their downloads to define KPIs and to stay ahead of data demand and learn from users. [Product management]
Prepare a strategy for aging datasets and a growing catalog, with labels, folder, archives or sign-posting [Product management & Product functionality]

The report breaks down these recommendations and has more detailed calls to action 🔷 and labels the suggestions by difficulty of implementation 🟩🟨 🟥.

Introduction

“What do data analysts want from the data they use and where they access it from? What would a bold and visionary approach to data provision (a new data store) look like?”

There is very often a disconnect between how data custodians or producers structure and provide access to their data compared to the needs of potential end users. In this report, we will look into different classes of users, each with their format priorities while at the same time a lot of common interests. The purpose of this study is to inform the London Datastore Team with practical examples of the platform uses and how improvements can make a big difference scaled to all the users of the London Datastore. As data end-users, data analysts, scientists and researchers, how would recommendations from the report on the London Datastore’s Future ² affect and help users? We will work backwards from them and the outputs to the London Data Store and its sources. For that we also do a study of the metadata of the current LDS data catalog, extracting descriptive statistics and insights. We will also elaborate on the effect of the implementation of the Data Services Standard identified in the GLA’s Data Services Investment Proposal (2021) ³ – more specifically, looking at the users needs and problems, how to facilitate the accessibility to data and how some changes will incentivise the use of the platform.

This will lead us to mention the so-called ‘Tidy Data Principles’⁴ and similar frameworks for optimal dataset formats and efficient standardisation. We will extend the recommendations that are found at the end of ‘Discovering the Future of the London Data Store’ with tangible examples which show how they affect the end-user’s experience. This will also elaborate on the ‘Data Services Investment Proposition (2021)’ section on Data Services Standards from an academic’s, an analyst’s and researcher’s perspective, different users of the London datastore but with similar workflows.

1 Background of the report

1.1 Previous research to design the future London Datastore

Our study expands on the findings of extensive research done by GLA and the Open Data Institute. The current improvements identified in the Future of the London Datastore report are the following:

Up to date data
Broader datasets and types of data
Metadata and different views for insights
Improving navigation and search function - including expanding categories and better description of them
Offering signpost and updates indicators
Offering different formats – and allowing interactive visualisation and charts, but also better integration between data and analytical outputs

We will do the exercise of diving deeper into the workflow affected by the mentioned points and elaborate their impact on analytics projects.

It will include the observation from the “Data Services Investment Proposal (2021) - Appendix B”⁵ and at the same time, we will expand the “Data Service Standards” proposed concerning the user:

Understand users and their needs. Look at the full context to understand what the user is trying to achieve, not just the part where they interact with the GLA. Understanding as much of the context as possible gives you the best chance of meeting users’ needs in a simple and cost effective way. In a data services context, the core “user” may be a particular type of officer in a London Borough, who has their own particular work context to understand.
Solve a whole problem for users. Work towards creating a service that solves one whole problem for users, collaborating across organisational boundaries where necessary. This may mean aligning with other parts of the GLA or central government to combine data or other supporting services, or simply ensuring that the data service can connect seamlessly with related tasks, for example by ensuring that it uses a common language and that data is exportable in a helpful format.
Provide a joined up experience across all channels [less relevant to data services, as they are generally not consumed e.g. offline]
Make the service simple to use: make sure the service helps the user to do the thing they need to do as simply as possible - so that people succeed first time, with the minimum of help; test for usability frequently with actual and potential users, using appropriate research techniques ; design the service to work with a range of devices that reflects users’ behaviour - in a data services context, most services may be delivered to officers’ laptops and desktops, but should take account of security/firewall limitations, browsers, etc in their corporate environments.
Make sure everyone can use the service: where users are officers, this will mainly mean meeting accessibility standards .

We will proceed with tangible examples from the users perspective.

2 A modern data analysis & mapping-project workflow

This is a hands-on journey from idea to publication highlighting where the London Datastore (LDS) can support and facilitate most user’s workflows. Most research pieces involving data, commonly require a series of tasks, transformations and operations which can be extremely time consuming, especially if we aggregate it to the whole user base of the London Datastore. The benefits from providing data the right way will accelerate and multiply the use and impact of the LDS on Londoners.

2.1 Tools for data and maps

Modern data users will most likely combine multiple tools to extract insights out of data. Here is a non-exhaustive list of the most commonly used tools, their importance varies by industry, sector and purpose:

MS Excel: A ubiquitous tool for basic data analysis, manipulation, and visualization, especially in business contexts.
Python: A versatile programming language with extensive libraries for data analysis (e.g., Pandas, NumPy, SciPy, Matplotlib).
R: A programming language and environment focused on statistical computing and graphics, widely used in academia and research. This report and analysis were created purely with R and RStudio.
SQL: Essential for managing and querying structured data in relational databases.
Tableau: A powerful tool for data visualization, enabling the creation of interactive and shareable dashboards.
Power BI: Microsoft’s business analytics service, providing interactive visualizations and business intelligence capabilities.
SAS: An integrated software suite for advanced analytics, business intelligence, data management, and predictive analytics.
SPSS (IBM SPSS Statistics): Widely used for statistical analysis in social science, it offers a range of analytical tools.
Matlab: A high-level language and interactive environment used heavily in engineering and scientific computing.
Stata: A tool for data manipulation, visualization, statistics, and automated reporting.

2.2 Modern Geographic Information Science tools (GIS)

ArcGIS: A comprehensive GIS software for creating, analyzing, and managing geographic data, widely used in various industries.
QGIS (Quantum GIS): An open-source GIS software that supports viewing, editing, and analysis of geospatial data.
R (Programming language) and RStudio: Known for its strong statistical capabilities, R has packages like sp, rgdal, and rgeos, which are used for spatial data analysis and visualization.
Python: A versatile programming language with GIS-focused libraries like Geopandas, Shapely, Fiona, and Pyproj, widely used for scripting and automating GIS processes.
CARTO, Mapbox Studio , Felt and other GIS online platforms are becoming more popular.

Now that we know the analyst’s toolkit, we have to find the right data in the right and easy to use format.

2.3 Exploring data and finding data as a user

First of all let’s establish the scenarios you as a user will encounter.

A .You know what you are looking for:

🟢🟢🟢🟢 Best case: you find what you want and more than you expected.
🟢🟢 Great case: you find what you are looking for and nothing else.
🟢 Good case: you don’t find what you want but find something useful.
🔴 Bad case: you don’t find what you want and nothing else is useful.

B. You don’t know what you are looking for:

🟢🟢🟢 Excellent case : you discover something to start a new project or to influence a project.
🔴 Bad case:the exploration process and data discovered doesn’t inspire you to start a project nor to include new data in your projects.

C. You don’t know enough

🟠 Bad case: you see something you want to use but don’t have the capacity to use it because the data access is unclear, too complex and messy.

Any change on the platform should aim to increase the likelihood of 🟢 scenarios 1,2,3,5 and decrease 🔴4, 6 and 🟠 7. We will be gathering recommendations for that purpose.

2.4 Empowering the catalog with better meta data

2.4.1 Meta-data analysis of the current catalog

Our analysis will be done programmatically with R and Quarto as tools which are open-source and free. The code for the graphics and data wrangling will be displayed above the charts.

2.4.1.1 Loading all the metadata of the Datastore as of 08.02.2024

The meta data can be accessed as an excel file, .CSV, .JSON by clicking on the Export at the bottom right of the page or in a more complex way with the API. This is wonderful in terms of transparency and could be highlighted more, for users to explore🟩.

We will begin with the temporal analysis of the metadata.

Code

metadata <- read_excel("Data/All meta data - export -2024.02.08.xlsx")

When researchers and analysts first land on a data platform, these common questions arise 🔷:

Is this really the latest data available?
How much of the data is being updated?
Are old datasets being refreshed?
Are many new datasets appearing regularly?
Are these datasets a one-off publication?
When is the data next release going to happen?

The graphs in the next section provide answers and will give an overview of the update rates and status of datasets of the current data catalog.

2.5 The update frequency

What you see in the visualisation are the names of the datasets on the y-axis (every 10th for clarity and simplicity purposes) and on the x-axis, you see the initial upload date 🔴 and the last update date 🟢 of the dataset.

The red vertical line shows the year 2020 and as of 2024.02.18 , 49.7% of the datastore (586 out of 1177) have been last updated before 2020.01.01🔷. By definition, the older the datastore the more archival content it will have so there has to be rigorous strategy to deal with historical and for some still relevant data, and the less important outdated data 🟥.

The rate of uploading dataset might increase or decrease, this could be a key performance indicator (KPI) to keep track of to plan accordingly 🟨.

Code

#Convert your string to a POSIXct object, making sure to specify the correct timezone if needed
vertical_line_datetime <- as.POSIXct("2020-01-01", tz = "UTC")

every_nth = function(n) {
  return(function(x) {x[c(TRUE, rep(FALSE, n - 1))]})
}


# Plot with the adjusted geom_hline for the vertical line
ggplot(metadata %>% 
          mutate(short = str_sub(metadata$title, 1, 20)) %>%
          head(1200) %>% 
          arrange(desc(updatedAt)), 
       aes(x = reorder(title, updatedAt))) +
  geom_segment( aes(y=createdAt, yend=updatedAt, xend=title), color="grey") +
  geom_point( aes(y=updatedAt), color=rgb(0.2,0.7,0.1,0.5), size=1) +
  geom_point( aes(y=createdAt), color=rgb(0.7,0.2,0.1,0.5), size=1 ) +
  coord_flip() +
  geom_hline( yintercept = vertical_line_datetime, col = "red") + # Use geom_hline with yintercept
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.y = element_text(size = rel(0.5))) +
  
  scale_x_discrete(breaks = every_nth(n = 10))+
  xlab("Title (every 10th of 1177)") +
  ylab("Added date and update date")+
  labs(title = "Dataset date and update",
       subtitle = "As of 2024.02.08 there are 1177 datasets")

Code

pre_count<-metadata %>% 
  filter(as.Date(updatedAt)<as.Date("2020-01-01")) %>% 
  summarise(pre2020 =n())

tot_count<-metadata %>% 
  summarise(count =n())

pre_count/tot_count

   pre2020
1 0.497876

Code

chart_df <- metadata %>% 
  select(update_frequency) %>% 
  group_by(update_frequency) %>%
  summarise(count =n()) %>% 
  mutate(total = sum(count))

ggplot(data = chart_df, aes(x =count, y = reorder(update_frequency, (count))))+
    geom_bar(stat="identity", fill ="Steelblue") +
    geom_text(aes(label = paste0(" ",count, " - ", round(count/total,2)*100,"%")), hjust =0) +
    theme_minimal()+
  labs(title ="The distribution of the update frequency available for 1177 datasets",
       x= "Number of datasets",
       y="")+
    scale_x_continuous(limits = c(0, max(chart_df$count) * 1.15))  # Expanding the limit to 20% more than the max count

As you can see in the graph above, there are many update rates, going from 24% of the catalog annually, to 3 datasets (0%) in real time and 27% simply “ad hoc” update frequency. We can also note that the labelling of frequencies could be improved, in fact it has some redundancies and inconsistencies which could be easily corrected.🟩 Some data validation measures could avoid this type of errors in the future.🟩.

Now, not only th upload rate is interesting: the LDS could track download analytics and user preferences to stay ahead of the demand 🟩.

This would also allow the creation of a section with Popular datasets containing the most downloaded data over different time periods with filters available, potentially inspiring new research projects.🟨

2.6 The data publishers

The Greater London Authority (GLA) label is the largest publisher with 478 datasets equal to 41% of all, followed by 13% by ONS. 8% of datasets or from publishers which only published one data set. This representation also shows that GLA in fact, has other subcategories of publishers such as the “GLA Opinion research” 4% which combined then represent more than 3/4 of datasets.

Code

chart_df <- metadata %>% 
  select(publisher) %>% 
  group_by(publisher) %>%
  summarise(count =n()) %>% 
  mutate(publisher_2 = ifelse(count == 1, "***Single publication only", as.character(publisher))) %>%
  group_by(publisher_2) %>%
  summarise(summ = sum(count)) %>%
    mutate(total = sum(summ))%>%
  mutate(GLA = (grepl("GLA|Greater London Authority", publisher_2)))

  ggplot(data = chart_df, aes(x =summ, y = reorder(publisher_2, summ)))+
    geom_bar(aes(fill = GLA),stat="identity") +
    geom_text(aes(label = paste0(" ",summ, " - ", round(summ/total,2)*100,"%")), hjust =0) +
    theme_minimal()+
  labs(title ="Publishers of the 1177 datasets",
       x= "Number of datasets",
       y="") +
  scale_fill_manual(values = c("Dark red", "Dark green"))+
  scale_x_continuous(limits = c(0, max(chart_df$summ) * 1.15))  # Expanding the limit to 20% more than the max count

2.7 The data authors

Most authors only publish 1 or 2 datasets in the LDS with 22% of the catalog. The label NA appears for 14 percent of times which is missing information.

We also find Greater London Authority as the most prolific author of datasets, nevertheless we also find inconsistent labeling and sub-GLA groups.

Greater London Authority 13%, GLA 5%, Greater London Authority (GLA) 1%, GLA Economics 3% etc. Other examples of inconsistent labels happen with Transport for London (TFL) or Census data etc. Correcting these is possible and would make the catalog clearer, more efficient and transparent 🟩.

Code

chart_df <-metadata %>% 
  select(author) %>% 
  group_by(author) %>%
  summarise(count =n()) %>% 
  mutate(publisher_2 = ifelse(count <= 2, "***1 or 2 publicatiions", as.character(author))) %>%
  group_by(publisher_2) %>%
  summarise(summ = sum(count))%>%
    mutate(total = sum(summ))%>%
  mutate(GLA = (grepl("GLA|Greater London Authority", publisher_2)))


  ggplot(data =chart_df, aes(x =summ, y = reorder(publisher_2, summ)))+
    geom_bar(aes(fill = GLA),stat="identity") +
    geom_text(aes(label = paste0(" ",summ)), hjust =0) +
    geom_text(aes(label = paste0(" ",summ, " - ", round(summ/total,2)*100,"%")), hjust =0) +
    theme_minimal()+
  labs(title ="Authors of the 1177 datasets",
       x= "Number of datasets",
       y="")+
  scale_x_continuous(limits = c(0, max(chart_df$summ) * 1.15)) + # Expanding the limit to 20% more than the max count
  scale_fill_manual(values = c("Steelblue", "Darkred"))

2.7.0.1 Meta data interconnection

The are currently 18 main tag categories, with many sub-tags. In the following fifure we the co-occurrence of labels in all 1177 datasets.

The diagonal of this table shows the total number of datasets in that category, the largest being “Demographics” with 217 datasets. The other tiles show the co-occurrence of that label with the horizontal topic label (or vertical if you prefer).

The matrix shows that most labels will also be linked with most of other label but it might us wonder if some of the smaller labels such as “covid-19” shouldn´t be more common than they actually are.

The important questions come to mind:

Should there have been more datasets with the label “london-2012”? 🔷
Should the 18 categories be reevaluated and new labels added? 🔷
Should some categories be archived? 🔷
Are 18 categories enough?🔷

Code

# Assuming 'your_matrix' is the 18x18 matrix you want to display
# your_matrix <- matrix(rnorm(18*18), nrow = 18, ncol = 18)

## Convert the matrix to a data frame in long format
matrix_long <- as.data.frame(co_occurrence_matrix_18) %>%
  rownames_to_column(var = "row") %>%
  pivot_longer(cols = -row, names_to = "column", values_to = "value") %>%
  mutate(row = str_replace_all(row, "topic-", ""))%>%
  mutate(column = str_replace_all(column, "topic-", ""))


order_vector <- metadata %>%
  select(id, starts_with("topic")) %>%
  pivot_longer(cols = starts_with("topic"),names_to ="topic", values_to ="is")%>%
  filter(is == "TRUE") %>%
  group_by(topic) %>% 
  summarise(n=n())%>%
  mutate(topic = str_replace_all(topic, "topic-", "")) %>% arrange(desc(n))

order_vector_1 <- as.character(order_vector$topic)



# Convert 'row' and 'column' to factors with levels specified by 'order_vector'
matrix_long$row <- factor(matrix_long$row, levels = order_vector_1)
matrix_long$column <- factor(matrix_long$column, levels = order_vector_1)

ggplot(matrix_long, aes(x = column, y = row, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(value, 2)), vjust = 1) +
  scale_fill_distiller(palette = "Spectral") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  coord_fixed() +
  labs(title ="Matrix of co-occurrence of topics")

2.7.0.2 A Network analysis approach

The labels and tags can be represented as a network, to visualise and calculate centrality measures and let patterns emerge. Here in the figure below, each major category is a node, and each edge represents a link is when a dataset contains both categories of labels and becomes proportionally wider. The larger the node, the more datasets the label has and the larger the edges the more connected both nodes are.

You can explore the relations on this interactive network graph and here are some observations that can be highlighted:

Demographics, environment, employment-an-skills, and planning are the most important labels.
London-2012 stands out as a small category
Overall the network is very interconnected.
Are 18 labels comprehensive enough to describe the variety of data and the needs of users?🔷
Employment-and-skills is strongly linked with business-and-economy labels
Crime-and-community-safety are strongly linked with “transparency”
Should some links have been stronger for some categories? 🔷

2.7.0.3 Granular Tag Analysis

There are many more sub-tags beyond the main 18 categories. Without correcting for typos and different spellings, there are 1573 different tags. Now, fore the future of the LDS it is important to re-emphasize and create a new systematic tagging system 🟨 for a richer meta-data ecosystem, for data exploration and an easier a search method of the catalogue. The graph below shows the frequency of each subtag, the most common being borough-economics, which apears 83 times, followed by population, which appears 80 times.

I’ve been told that the labels are more often set by the data creators and less by LDS. An initial correction of typos and review wouldn’t be too time consuming as a one-off measure 🟩, before setting in place another more systematic labeling or guidance for the data creators 🟨.

The vast majority of tags appears less than 5 times as seen on the figure below, representing the histogramme of the frequency of appearance. More tagging could help us link datasets and create clusters beyond the 18 main categories 🟩. Search engines and large language model (LLM) assistants would benefit from it and help users find useful datasets and define new projects for London.

2.8 Data access and formats

Let’s dive in with a list of encountered issues:

There isn’t a clear distinction between Raw Tabular data from Reports and Publications in pdf.
There are Hundreds of pdfs in one place, without a summary or short description.
Inconvenient formats: column names are in a separate file from the data.
Outdated: Shapes for geographic boundaries and files. For London we had to go directly to ONS to look for the latest geographic boundaries, something required for most GIS and mapping work in London. Link to the latest CENSUS geography
The flow of exploration of datasets is not optimal. Long scrolling is necessary🔷, without sorting options 🔷 by date or other relevant dimensions. The filtering options 🟨 could be improved as well with with access to improved tags, keywords and format information.
Outdated data links. Example of broken links:

Broken Links to Tfl api. ⁶

Broken link to Tfl planner. ⁷

Broken link to Tfl feeds.⁸

Crowd-source the update-control process by adding a possibility for users to signal to the London Data Store team when a link is broken or a new link or data source is available 🟥 .
The store should include and highlight more the date of the dataset or warn that there is a risk it might be outdated and also might have currently a broken link 🟨.
There aren’t any links to LiDAR datasets, which are public 3D datasets. The store could potentially link to governmental open-access LiDAR sources 🟩 ⁹
Simple access to data: dashboards, analytics and reports are great but leaving an easy access to the raw data 🟩 behind the dashboard is key, not only for researchers who want to do more with the data but also for transparency purposes. An example of this would be the Residential-completions-dashboard, which was a One-off update according to the label (it hasn’t been updated in a year), which is a shame for a interactive dashboard, where the access to the data behind is not avalable as far as we tried.

We can look at the frequency of formats on the platform but in this report we don’t have the information about how many files each dataset contains. In fact, it is often the case that there are multiple files in one dataset, which is a great thing showing the richness of the London data catalog.

The figure above shows that the most common file format is a Spreadsheet, appearing 677 times. In second place there is the .PDF format with 334 and and third place with 250 for .CSV.

There are also 111 ZIP files where we most likely don’t know what format is inside. This could also something to be clarified by describing the files available🟩.

3 APIs and the future of real-time data

The Application Programming Interface (API) pipeline is’s currently easily visible and hidden behind the “Guidance” menu. There is also a “Developers” section we were only able to find through Google🟩. Furthermore, the “Datastore API” page is not linked to the “Developers” page and neither are clearly visible on the main page of the London Datastore. Combined, they could be put forward as a key asset of the platform for developers and advanced data science users🟨.

The hidden developers sections simply has 4 links that the user can test to fetch with the API, without any more explanations. The API documentation is present but there aren’t any detailed examples to facilitate the use of it. Users have to be already advanced analysts and familiar users to access it, or spend a significant amount of time to discover the possibilities.🟨 [Causes user scenario 7.🟠 - User doesn’t know enough]

The use of the Application Programming Interface (API) could be more prominently highlighted 🟩, this would facilitate the creation of many apps, automated dashboards and other tools that programmatically always extract the latest version of a dataset from the LDS. They could also precisely download a series of required and updated datasets always located at the same address, and could be browsed with the api.

This opens the gate to many opportunities for the use of the London Datastore and reduces the burden of manually downloading files and go through many time consuming and repetitive tasks to use the data.

All in all, more guidance and examples with a more prominent presence on the platform will attract more users to more ambitious use-cases of the data🔷.

3.1 Accessing the meta data through the API

Currently there is documentation for the use of the API but it isn’t specific to the LDS but it doesn´t explain the possibilities based on a dataset and examples. So, a blog, tutorial 🟥 or article presenting 🟨how to access the api with Python, R or any other way would significantly facilitate the task for any new user.

The code below shows how I accessed the metadata as .JSON file in R. Many more hours are needed to truly display all the possibilities empowered by the API which could be described in a simple yet clear tutorial🟥.

Code

# Load necessary libraries
library(httr)
library(readr)
library(jsonlite)
library(tidyverse)

# Set API key and endpoint
api_key <- "your_api_key_here"

# The API endpoint from where you want to download the JSON file
api_url_metadata <- "https://data.london.gov.uk/api/datasets/export.json"

# Make the GET request with the API key in headers or query parameters as required
response_meta <- GET(api_url_metadata, add_headers(`Authorization` = paste('Bearer', api_key)))

# Check if the request was successful using the status_code() function
if (status_code(response_meta) == 200) {
  # Extract content from the response object
  content <- content(response_meta, "text", encoding = "UTF-8")
  
  # Parse the JSON content
  json_meta_data <- fromJSON(content)
  
  # Now json_data contains the data as an R object, which you can manipulate or analyze
} else {
  print("Failed to download JSON")
}

3.2 Developer tab: getting the list of datasets

The presence of a “Developers” page is great but it could be expanded with more detailed help, and definitly be highlighted more prominently.

Code

# The API endpoint from where you want to download the JSON file
api_url <- "https://data.london.gov.uk/api/action/package_list"

#or extracted the address from the metadatafile column "webpage". You still need to insert "/api" before"/dataset" in the url.

# Make the GET request with the API key in headers or query parameters as required
response_list<- GET(api_url, add_headers(`Authorization` = paste('Bearer', api_key)))

# Check if the request was successful using the status_code() function
if (status_code(response_list) == 200) {
  # Extract content from the response object
  content <- content(response_list, "text", encoding = "UTF-8")
  
  # Parse the JSON content
  json_data_list <- fromJSON(content) #%>% flatten()
  
  # Now json_data contains the data as an R object, which you can manipulate or analyze
} else {
  print("Failed to download JSON")
}

3.3 Get a dataset through the api

Ideally you should easily access any dataset through the address found in the metadata file and then used in the API with input from the meta data file. Otherwise you can also manually write the wanted URL. The only requirement is to insert “api/” before “dataset” within the URL.

As stated before, a walk-through document for R or Python could speed up and promote the use of the API and the LDS.🟥

Code

# The API endpoint from where you want to download the JSON file
api_url <- "https://data.london.gov.uk/api/dataset/electricity-consumption-borough"

#or extracted the address from the metadatafile column "webpage". You still need to insert "/api" before"/dataset" in the url.

# Make the GET request with the API key in headers or query parameters as required
response <- GET(api_url, add_headers(`Authorization` = paste('Bearer', api_key)))

# Check if the request was successful using the status_code() function
if (status_code(response) == 200) {
  # Extract content from the response object
  content <- content(response, "text", encoding = "UTF-8")
  
  # Parse the JSON content
  json_data <- fromJSON(content)
  
  # Now json_data contains the data as an R object, which you can manipulate or analyze
} else {
  print("Failed to download JSON")
}

3.4 Real-time present and future

As the world continues to increase the frequency and amount of data creation, the London Datastore should prepare for this trend 🔷. As we showed in the data update frequency graph in the previous section, there are many update rates, going from 24% of the catalog annually , to 3 datasets (0%) in real time and 27% simply “ad hoc” update frequency..

For real-time and higher frequency updated data the API access is key. This could be transport data, tourist data, temperature data, pollution, noise, water levels, etc.

Other real-time data initiative in London such the one around the Olympic Park area - in collaboration with the London Docklands Development Corporation (LDDC) and SHIFT are an innovative example on a real-time data platform for urban analytics. This would be open to the GLA and companies in the IoT sector to consume and upload realtime data. The platform uses MQTT explorer and “At the moment [they] get approximately 20 million points a day - on building and environmental data. So occupancy of buildings, energy use, water etc. Also external so using it for real-time heat island data at the moment with the GLA, with sensors deployed in sites across London and feeding into our platform. This is for policy to be reactive in realtime” - Prof. Andrew Hudson-Smith, UCL. The granularity and real-time level extends to real-time bird song analysis, realtime bat tracking and more. The GLA is already part of the project and the LDS team could learn from this initiative 🔷.

APIs can be very sophisticated and complex, allowing even for extremely rich access to variables within datasets. That type of data access is only possible with tidy and well crafted datasets and data architecture.🔷

4 Tidy data principles

A notoriously huge amount of effort is spent cleaning data to get it ready for analysis, this is a known burden in the Data Science world. As a solution, Hadley Wickham developed a framework that makes it easy to tidy messy datasets.

The definition is simple, the structure of the data has to be as follows: each variable is a column, each observation is a row, and each type of observational unit is a table 🔷. Tidy datasets are easy to manipulate, model and visualize. We will show the advantages of a consistent data structures with case studies from the London Datastore and how everyone could benefit from it by being freed from mundane and long data manipulation chores.

Tidy Data - Hadley Wickham synthesize and created this efficient framework in Data Science and data architecture published in a now acclaimed publication of Journal of Statistical Software ¹⁰ and is the best practice among data scientists and researchers today.

This is incredibly helpful for modern Data Science practitioners and for any type of data in spreadsheets or more complex GIS formats. We will get into more details in the next sections. Any policy of the LDS that widens the use of tidy dataset will be of major help.🔷

Experts in the topic at UCL have also developed workflows and efficient systems such as the data curation software Whyqd : to do “Do more research, for less wrangling” to”transform messy data into structured schemas using readable, auditable methods. Perform schema-to-schema crosswalks for interoperability and data reuse”.

5 Mapping London

With more time, this section could be widely extended as London is among the top cities in the world in terms of open data and maps without a doubt, also thanks to the award-winning London Datastore. As mentioned before, the LDS could even further contribute to the mapping and spatial research of London by implementing some of the recommendations of this report. 🔷

5.1 Loading Geographic Data

Geographic data is a major pillar of the London Datastore and should be highlighted further 🟩 due the fact that most of its datasets contain a spatial dimension in form of coordinates, administrative boundaries or other geographic boundaries.

The LDS should contain, support and highlight common GIS file formats such as:

Shapefile (.shp): A vector data format for storing the location, shape, and attributes of geographic features. It is Important to note that inconveniently, this format is a composite file format that is made out of multiple files so it is often stored within a folder.
GeoPackage (.gpkg): A compact, portable format for transferring geospatial information, supporting a wide range of data types without file size limitations.
GeoJSON (.geojson): A lightweight format for encoding a variety of geographic data structures using JavaScript Object Notation.
KML (Keyhole Markup Language) (.kml): An XML notation for expressing geographic annotation and visualization within two-dimensional maps and three-dimensional Earth browsers.
GML (Geography Markup Language) (.gml): An XML grammar for expressing geographical features, serving as a modeling language and interchange format for geographic systems.
GeoTIFF (.tif): A format that allows georeferencing information to be embedded within a TIFF file, widely supported by image-processing and GIS applications.
CSV (Comma-Separated Values) (.csv): A simple format used to store tabular data, including geographic coordinates, easily imported into and exported from GIS software.
XLSX (Microsoft Excel Open XML Spreadsheet) (.xlsx): Can store geographic coordinates and spatial data, often used for data collection and analysis before importing into GIS software.
LAS (LiDAR Aerial Survey) (.las): The standard format for storing LiDAR point cloud data, accommodating a wide array of point data and associated attributes.
LAZ (Compressed LAS) (.laz): A compressed version of the LAS format, using lossless compression to maintain data integrity while significantly reducing file size.

The hierarchical representation of UK Statistical Geographies is notoriously complex as you can see on this schema. Any help provided to users interested in London could drastically facilitate the task.🟥

Now, making the access to the most commonly used geopackages/shapesfiles/coordinates easy, simple and friction-less could significantly improve the efficiency and increase the use of the platform.🟨

Another complementary option would be cataloging or creating a list just for London of important links to the latest relevant GIS resources such as Ordnance Survey (OS)¹¹ , Office of national Statistics (ONS)¹² or CENSUS geographies (ONS)¹³🔷.

For example, the London Boroughs boundaries is an an extremely common GIS file to download. Today, in the LDS the boundaries are available, updated 2 years ago (while the stated update frequency is Monthly). Secondly, the desired external link to the supposedly latest version is perfectly there but unfortunately doesn’t work. Making it possible for LDS users to signal a broken link would be excellent 🟨. Otherwise, creating a specific section with common GIS files and links that is regularly checked would be of great help🟨.

A significant amount of datasets in the store are not labelled with the granularity of their geographic boundaries, aka their spatial resolution. The figure above shows that if we exclude unlabeled datasets, the most common geography is Local Authority with 250 dataset at that scale. It is followed by Greater London 135 and Borough 72. It would be interesting to retrospectively label the 398 datasets with the correct label and make sure that this labeling system is enforced for future datasets. 🔷

Once the geographic data is downloaded, there are many common tasks and transformations which are needed for the analysis. For example, some kind of merging and linking data to geographic data or operations like geocoding are almost certain to be required. We will be describing such tasks for data analysis in general in the following sections.

Here we will simply combine 3 GIS files, that are in the right format and easily accessible in London Datastore.

Reading layer `london_boroughs' from data source 
  `C:\Users\julia\Hoffmann Dropbox\Julian Hoffmann\0. Julian Studio\1. Julian Studio Projects\2023.08 - UCL-GLA London Data Store\London Datastore Online Report\Data\London_Boroughs.gpkg' 
  using driver `GPKG'
Simple feature collection with 33 features and 7 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
Projected CRS: OSGB36 / British National Grid

Reading layer `InnerUltraLowEmissionZone' from data source 
  `C:\Users\julia\Hoffmann Dropbox\Julian Hoffmann\0. Julian Studio\1. Julian Studio Projects\2023.08 - UCL-GLA London Data Store\London Datastore Online Report\Data\InnerUltraLowEmissionZone.gpkg' 
  using driver `GPKG'
Simple feature collection with 1 feature and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 518493.4 ymin: 172850.4 xmax: 543936.5 ymax: 192597.7
Projected CRS: OSGB36 / British National Grid

Reading layer `LondonWideUltraLowEmissionZone' from data source 
  `C:\Users\julia\Hoffmann Dropbox\Julian Hoffmann\0. Julian Studio\1. Julian Studio Projects\2023.08 - UCL-GLA London Data Store\London Datastore Online Report\Data\LondonWideUltraLowEmissionZone.gpkg' 
  using driver `GPKG'
Simple feature collection with 22 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 503899.4 ymin: 156666 xmax: 559664.7 ymax: 200880.3
Projected CRS: OSGB36 / British National Grid

Code

ggplot(data=london_br_sf)+

  geom_sf(col ="black", fill = "steelblue")+
  geom_sf(data =london_ULEZ2021_sf, col = "red", fill ="transparent", inherit.aes = F)+
 geom_sf(data =london_ULEZ_sf, col = "#32CD32", fill ="transparent", inherit.aes = F)+

  theme_minimal()+
  labs(title = "London Boroughs and London ULEZ",
       subtitle = "Inner ULEZ 2021 and London wide ULEZ August 2023 update")

5.1.1 London Interactive Map

Now at days, if the GIS data is easily accessible in the right format (geopackage for example) it is straightforward to create interactive maps in R and publish insights about London’s wide data catalog.

Code

library(leaflet)
library(leaflet.providers)
# Create the leaflet map with the shapefile

leaflet_map <- leaflet() %>%
  #addProviderTiles(providers$CartoDB.DarkMatter) %>% # Adding a basemap
  #addProviderTiles(providers$ProviderName)
 addTiles(urlTemplate = 'http://{s}.basemaps.cartocdn.com/dark_all/{z}/{x}/{y}.png')%>%
  addPolygons(data = london_br_sf)%>% # Adding the shapefile
  addPolygons(data = london_ULEZ2021_sf,    
              color = "red", # Border color
    fillOpacity = 0.2)%>% # Adding the shapefile
  addPolygons(data = london_ULEZ_sf, color = "darkgreen", # Border color
    fillOpacity = 0.2,) # Adding the shapefile

  leaflet_map

5.2 Exploratory analysis and data wrangling

⚠️ This section had to be shortened due to time constraints.⚠️

5.2.1 Geocoding

A reoccurring issue is to have data and addresses without geographic coordinates. When coordinates are missing or geographic shapes are absent, GIS analysts and researchers need to apply “Geocoding”, which in spatial data science refers to the process of converting addresses or other geographic descriptors into numerical coordinates on the Earth’s surface. Typically, this involves transforming a description like a street address, city name, or postal code into a precise latitude and longitude. These coordinates can then be used for various purposes such as mapping, spatial analysis, and geographic data visualization.

The LDS could promote the addition of coordinates as much as possible to facilitate the common and sometimes pricey task of geocoding. 🔷

5.2.2 Sub-setting, selecting and filtering the data.

For research and analysis around London it is is often required to look at national level or regional levels datasets and then there is the need to narrow it down to the 32 Boroughs and the City for example. Any help or national dataset that can be sub-set and filtered to the Greater London Area and stored in the LDS would be a great asset and support for analytics.🔷

5.2.3 Data Transformation (splitting, aggregating, summarising etc.)

Generally speaking, the more granular the data the better because researchers can always aggregate up and summarise statistics up and aggregate points for example. This is almost certain to happen for any analysis.

5.2.4 Data and merging

One of the major powers from data analysis comes from the combination of datasets. Analysts, like alchemist, will combine data that perhaps has never been combined before. This is a task that without a doubt happens in any project and even more so in GIS projects where we connect data to shapes.

A common burden when merging is the presence of inconsistent names, boundaries, categories, Addresses, spelling etc. Any policy by the LSD that could standardise ways of labelling things will help researchs connect data.🔷

5.2.5 Data filtering and sorting

A researcher will always create filters and sort the data during exploration and presentation of the data in charts. The more columns with data the more insights and filter options we will have for the analysis. 🔷

Even before the exploratory analysis, we can remind here that while we look for data, rich filters and sorting options should be available on the platform.🔷

Filtering can also be the last resort for incorrect or bad data that has to be removed.

5.2.6 Data correcting / cleaning

In most cases, some type of correction and cleaning is required especially after combining datasets. The more tidy and standardised the data in the LDS the easier for the researcher. 🔷

Types of corrections:

Misspelling
NA
Empty cells
Wrong data format
Date and temporal data miss-formatting
Typos
Outliers that are actually errors
Comma mistakes
etc.

5.2.7 Gap filling / modelling

A lot of research consist of forecasting or look to the past, particularly in academia but equally important in business. This tasks often requires to fill gaps and combine data. The LDS can help us have consistent timeseries and complete data sets if data policies, datasets and the catalog remain consistent overtime.🔷

5.2.8 Data persistence and linking to historical data and updates etc.

As with the gapfilling and merging of data set, linking past data and combining it , making it consitent over time can require whole teams, months of research resulting datasets such as the House Price per Square Metre in England and Wales . Any help the LDS can provide on standardised formats over time for consistent time series will help many analysts. 🔷

5.2.9 Data formatting and pivoting (wide to long / long to wide)

One of the most common data operations at the start of any analysis in Excel, R, Python, are transformations like pivoting. Many researchers will want the data in table and wide format but for programmatical analysis, many others will want the tidy and long data format. Both are possible in the same file.

The more the LDS can promote to use the tidy data format, for example on the last tab of a excel spreadsheet the happier the SQL,Python and R user will. be. 🔷

Consistent Structure: Use a uniform layout and structure throughout the spreadsheet.
Descriptive Headings: Use clear and descriptive headings for columns and rows.
Avoid Merging Cells: It can complicate data sorting, filtering, and analysis.
Minimize Empty Rows/Columns: They can interfere with data processing and visualization.
Data Validation: Use data validation features to ensure data integrity (e.g., dropdown lists for specific entries).
Cell Formatting: Apply cell formatting (e.g., number, date) appropriately for clarity.
Use Formulas Judiciously: Avoid overly complex formulas; break them down if necessary.
Document Assumptions: Clearly document any assumptions or sources for data within the spreadsheet.(Avoid unnecessary) Use Tables: Convert ranges into tables to benefit from structured references and automatic headers.
Standardize Date Formats: Ensure consistency in date formats throughout the spreadsheet.
Keep Headers Visible: Freeze panes to keep row and column headers visible while scrolling.

This list is non-exhaustive.

5.2.10 Storing / retrieving / updating

The never-ending task of getting the latest data available or even looking for the opposite end, looking for historical data should be made easy. 🔷Ideally, it can then be merged but that is often wishful thinking because over the years, especially if the data is not in a tidy format, the merger will require many spreadsheet-cleaning operations.

Conclusions

What the London Data Store has achieved is already remarkable and can to be praised. In this report we only enumerated the potential improvements and issues that are solvable for the many future iterations of the platform. While we have outlined possible enhancements and addressable issues, the evolution of the Future Datastore necessitates numerous modifications to the existing framework. These adjustments, detailed in this study, promise to significantly enhance the experience for researchers, analysts, and academics. Although some proposed changes involve intricate technical challenges, many suggested actions are straightforward to execute and will reinforce the position of the London Datastore as the leading source for accurate, reliable, and current data on London.

Footnotes

Tidy data format definition: “Tidy datasets are easy to manipulate, model and visualise, and have aspecific structure: each variable is a column, each observation is a row,and each type of observational unit is a table.” https://vita.had.co.nz/papers/tidy-data.pdf↩︎
(2019 - Open Data Institute: https://theodi.org/)↩︎
(See 2021 Report Appendix B/*)↩︎
Tidy Data Principles: https://r4ds.hadley.nz/data-tidy.html↩︎
Data Services Investment Proposal (2021) - Appendix B↩︎
Broken Links to tfl api: https://data.london.gov.uk/dataset/tfl-live-bus-arrivals-api↩︎
https://data.london.gov.uk/dataset/journey-planner-api-beta↩︎
https://tfl.gov.uk/info-for/open-data-users/our-feeds↩︎
Public LiDAR data: https://www.data.gov.uk/dataset/f0db0249-f17b-4036-9e65-309148c97ce4/national-lidar-programme↩︎
Tidy Data: https://www.jstatsoft.org/article/view/v059i10↩︎
Ordnance Survey: https://www.ordnancesurvey.co.uk/products/boundary-line↩︎
ONS Link https://geoportal.statistics.gov.uk/search?q=BDY_ADM&sort=Date%20Created%7Ccreated%7Cdesc↩︎
Census geographies: https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies↩︎