COVID-19, Data Pipelines, and the Limits of Excel
A deep dive into the loss of vital data and a possible solution
28th December 2024 | Richard Honour
On Monday 5th October 2020, UK newspapers reported there had been a major data pipeline error. The error concerned COVID-19 testing data, connected to the NHS’s test-and-trace system.
The data pipeline fed various government dashboards. Dashboards used for monitoring testing capacity, infection rates, and other metrics.
The error reportedly led to tens of thousands of testing records lost. And 15,000 positive cases of COVID-19 infections missing from daily monitoring metrics.
Thousands exposed to infected individuals were not alerted to self-isolate.
Reports were quick to blame Public Health England (PHE). The government body responsible for the collection of testing data at that time.
However, blame was targeted at PHE's use of one specific tool. A tool used as part of the ETL process, Excel.
Total Data Loss
From 25th Sept to 2nd Oct:
15,841 unreported positive test cases
8 days of incomplete data
1,980 cases on average missed per day
48-hour ideal time limit for tracing contacts following a positive test was missed
Source: BBC/PHE/gov.ukWhat happened?
Drawbacks of using older tools.
The error was reportedly caused by the use of an old version of Microsoft Excel, XLS. And took place during a data loading procedure.
Allegedly, the loading process involved opening incoming CSV files within Excel. Columnar data was then loaded to the central PHE database.
During this process, larger CSV files that eclipsed the 65,536 row limit of XLS were truncated. Meaning thousands of rows had potentially been lost with each load process.
In response to the error, an expert at the University of Cambridge suggested:
"Even a high-school computing student would know that better alternatives to XLS exist. When you need to do something more serious, you build something bespoke that works - there's dozens of other things you could do. But you wouldn't use XLS."
While no one would dispute the fact that using outdated tools can lead to modern problems. It would appear that many issues, in fact, contributed towards the error.
As the investigation reveals, these issues extended beyond the decision to use Excel.
The cause of the error.
Blaming Excel does not paint the entire picture.
Before focusing on the technical issues which occurred in Excel, it is important to first contextualise the situation within which PHE found themselves.
Towards the beginning of the pandemic, partnerships with third-party labs were established (separate from PHE and NHS labs), in order to meet the rising demands for increased testing capacity across the UK.
These partnerships included universities, research institutes and companies such as Amazon and Boots.
This meant that existing PHE pipelines would not be relied on for complete data collection and either a solution that could be implemented alongside PHE's existing systems was required, or a completely new system would have to be developed.
Regardless of approach, any solution would likely have needed to respond to the following constraints:
⚠️ Time constraints
Any new pipelines or software would need to be developed then deployed within a very narrow time frame to start dealing with the inflow of testing data.
⚠️ Data constraints
Each lab would likely be using its own software to record test results; so a common data format for submission would need to be agreed (hence why CSV was likely chosen).
⚠️ User constraints
Any tools or software that needed to be used by PHE users and commercial partners would need to require almost no new training - again due to time constraints.
"With all this considered, it is not difficult to understand why PHE decided to rely on existing methods for loading data through Excel, even if better tools were available."
Preventing the error.
Excel was the problem but not the cause.
In regards to the limitations of XLS, the following is true:
The 65,536 row limitation should have been acknowledged.
And even newer versions of Excel (2007- onwards) could have at least been considered, given they are are capable of handling up to 1,048,576 rows.
However, an investigation into the error concluded that despite it's limitations XLS could have been used as part of a temporary solution while a more purpose-built system was put into development.
XLS Solutions:
XLS is in fact capable of handling larger CSV files.
By implementing an appropriately written VBA extension to read one row at a time (rather than the entire file in one go) this would prevent rows from being excluded.
Also, if a user had a reason for needing to read the raw data, a sliding window technique could be used that would display around a hundred rows.
Finally, an appropriate VBA extension that achieves all of this could have been developed within a very small time frame (just a few hours).
Source: Dionysis DimakopoulosIt would appear the root cause of the error had less to do with the choice to use XLS but rather the manner in which it had been used to load data.
Rather than implementing an automated process, it is understood that CSV files were being opened manually in Excel by a PHE user, before being loaded into the central database.
Based on a simulation of the likely process that occurred, it can be assumed that either:
The user would have been issued an on-screen warning of data truncation, yet still clicked 'ok' regardless.
Or a developer who created a related VBA extension chose to suppress all messages shown from the VBA interpreter and the user was then not warned of possible data loss.
In both scenarios, this would be a case of human error.
Data integrity is essential when dealing with medical data, especially when it could impact the spread of a disease. For this reason, I decided any solution going forward must attempt to eliminate any potential for human error.
Planning a Data Pipeline.
What does the data pipeline need to do?
After identifying the causes of the error, I began the process of designing a data pipeline that would have avoided a similar situation from occurring.
The first step in this process involved conducting a requirements analysis.
i.e. Determining the list of specific requirements that need to be met by the data pipeline.
Requirements Analysis
⬜ Fast Deployment
Must be capable of deployment within a narrow time frame.
⬜ (Large) File Handling
Must be capable of processing a rapid inflow of large CSV files immediately.
⬜ Data Aggregation
Data from different sources must be aggregated into standardized formats before being stored in structured systems.
⬜ Automated
Processes must be fully automated and the potential for human error removed.
⬜ Testing
Automations and data validation procedures must be capable of being tested vigorously and regularly; to ensure the integrity of secure medical data.
⬜ Data Transformations
Must be possible for a range of data transformations to be made, for purposes including: dashboards, monitoring, alert systems, and reporting.
⬜ Consistency & Speed
Must prioritise data consistency over availability - but must also be fast enough to meet the 48-hour target window for an alert system.
⬜ Modularity
Must be possible to add new data sources over time (as simply as possible).
Designing a Data Pipeline
How will the pipeline address requirements?
The next step involved determining the logical design for the data pipeline.
i.e. Plotting the physical flow of data.
This was done by creating a system diagram; outlining each of the various components.
This diagram would then serve as a point of reference when building out the pipeline.
System Diagram
Additionally, the diagram outlined a method for simulating the inflow of data through the use of a fake data generator.
Recommendation: Use a Secure Web Service
Commercial partners could have been instructed to upload CSV files to a secure web portal, rather than sending data to PHE through separate channels.
✅ Benefits of a secure web service
Simplify the process of setting up channels to commercial partners as the testing program expands.
Perform (automated) database-side validation checks across all incoming data from a singular entry point.
Create backup and logging procedures against a singular data source, so a complete historical data record can be maintained.
Provide a point of reference for commercial partners, with guidelines for data submission.
⚠️ Potential drawbacks
One concern regarding the use of a tailor-made web service concerns the speed in which this service could be set up.
There are ways for this type of service to be deployed relatively quickly, but difficulties could be encountered when needing to spin-up a highly customised service within a narrow time frame.
It would therefore be advantageous to have a reproducible (or template) framework for such a service prepared in advance.
Recommendation: Use a Data Lake
A data lake could have been used as a form of staging area for all incoming CSV files.
In this system, a data lake would aggregate incoming files across multiple sources and assign metadata tags such as date, time and commercial partner ID.
Batch scheduled transform and load procedures would then clean the data before loading to a structured data warehouse.
✅ Benefits of a Data Lake
Capacity to store large volumes of data in its native format, without risk of data loss.
Testing data could be submitted straight away, without the need for designing schemas, defining data structures and configuring transformations - which would otherwise take time.
Backup procedures can be implemented for preserving a full historical record of data - essential for guaranteeing accurate analysis and post-reporting capabilities.
The scalability of a data lake would allow for other data sources to be incorporated; providing scope for a more sophisticated system.
The ability to access all data in its native format would enable fast, flexible reuse of the data for a wide range of use cases (e.g. alert systems, dashboarding and reporting) (i.e. ELT).
⚠️ Potential drawbacks
High costs for cloud storage. This could be the case if it is not possible for the data lake to be self hosted, or appropriate storage management plans are not implemented.
Vendor lock-in. If the pipeline is developed to a high level of complexity with high availability required, this could potentially create data migration issues.
These potential drawbacks could be mitigated by implementing a routine compression and archiving procedure.
Creating placeholder data.
Testing a new system.
Once happy with the overall structure of the pipeline, I needed a way of testing it was fit for purpose.
This required simulating the inflow of CSV files from the web service before loading them into the data lake.
Obviously I would not be able to get my hands on real COVID-19 testing data, so this meant having to write a Python script that could generate placeholder data in a similar format and volume as what was likely received by PHE.
This would involve understanding the formatting of specific data, for example:
NHS Number
10 numeric digits in length. The tenth digit is a check digit used to confirm its validity.(UK) Postcode
Five to seven alphanumeric characters that can follow a number of different patterns.
The following is an example of the type of data that would have likely existed, in tabular format:
Here, the same data is represented in CSV format:
The following is the resulting script written for generating the placeholder medical data.
The Python library Faker was used, which is useful for generating random addresses accoding to a specific locational format.
If you would like to try running this code and see a very simple recreation of the UK Covid Dashboard UI, you can have a go here.
Final proposal.
Closing thoughts.
Building a data pipeline, especially on the scale witnessed during the Covid-19 pandemic, presents significant difficulties and complexities.
Hindsight allows for a critical examination of mistakes, but addressing errors in real-time demands a high level of skill under pressure.
Despite encountering challenges, Public Health England (PHE) successfully assembled the most comprehensive and granular data platform across EU countries, providing valuable insights during the pandemic.
The platform, despite facing publicized errors, served millions of end users efficiently and the achievement stands as a testament to the expertise of the team responsible for developing the diverse pipelines contributing to the UK Covid dashboard.
Acknowledging the complexity and scale of the endeavor, the reported error is a minor setback considering the overall success of the project.
This case study, utilized as a learning exercise, underscores the importance of evaluating challenges in building pipelines of such magnitude and is not likely to come close to providing a realistic solution that accommodates the various facets and challenges that were faced at the time.
For an analysis of some of the challenges faced, Pouria Hadjibagheri (MBE), the tech lead on the UK Covid-19 Dashboard, provides valuable insights in the following presentation, available here.