Excel: A data pipeline that lost thousands of Covid-19 test results

And trying to design a solution.

28th December 2023 | Richard Honour

On Monday 5th October 2020, numerous UK newspapers began reporting on a major error that had occurred within a vital Covid-19 data pipeline.

The pipeline in question was connected to various government dashboards used for monitoring testing capacity, infection rates and other metrics.

Importantly, it was also connected to the NHS’s test-and-trace system.

The error caused tens of thousands of testing records to be lost and more than 15,000 positive cases of COVID-19 infections to not be included in daily statistics.

As a result, thousands of people who had come in contact with infected individuals were not alerted or instructed to self-isolate, potentially furthering the spread of the virus.

Reports were quick to blame Public Health England (PHE), the government body responsible for the collection of testing data at that time.

But the blame was targeted specifically at PHE's use of one particular tool which had been used as part of an ETL process: Excel.

What happened?

The cause of the error.

Preventing the error.

Planning a Data Pipeline.

Designing a Data Pipeline

Creating fake medical data.

Testing a new system.

Final proposal.

Closing thoughts.

Total Data Loss

From 25th Sept to 2nd Oct:

15,841 unreported positive test cases

8 days of incomplete data

1,980 cases on average missed per day

48-hour ideal time limit for tracing contacts following a positive test was missed

Source: BBC/PHE/gov.uk

What happened?

The drawbacks of using older tools.

It was reported the error was caused by an older version of Microsoft Excel, XLS, used during a data loading procedure.

Allegedly, the loading process involved opening incoming CSV files within Excel, before loading the columnar data to the central PHE database.

During this process, larger CSV files (eclipsing the 65,536 row limit of XLS) appear to have been truncated, meaning thousands of rows could have potentially been lost during each load process.

In response to the error, an expert at the University of Cambridge suggested:

"Even a high-school computing student would know that better alternatives to XLS exist."

"When you need to do something more serious, you build something bespoke that works - there's dozens of other things you could do. But you wouldn't use XLS."

"As an aspiring data engineer, I thought this case provided a good basis for a project, so I set about designing a data pipeline that would have avoided the same mistake from happening. But I soon discovered a number of factors had in fact lead to the resulting error - and they encompassed more than just the decision to use Excel..."

The cause of the error.

Blaming Excel does not paint the entire picture.

Before focusing on the technical issues which occurred in Excel, it is important to first contextualise the situation within which PHE found themselves.

Towards the beginning of the pandemic, partnerships with third-party labs were established (separate from PHE and NHS labs), in order to meet the rising demands for increased testing capacity across the UK.

These partnerships included universities, research institutes and companies such as Amazon and Boots.

This meant that existing PHE pipelines would not be relied on for complete data collection and either a solution that could be implemented alongside PHE's existing systems was required, or a completely new system would have to be developed.

Regardless of approach, any solution would likely have needed to respond to the following constraints:

⚠️ Time constraints

Any new pipelines or software would need to be developed then deployed within a very narrow time frame to start dealing with the inflow of testing data.

⚠️ Data constraints

Each lab would likely be using its own software to record test results; so a common data format for submission would need to be agreed (hence why CSV was likely chosen).

⚠️ User constraints

Any tools or software that needed to be used by PHE users and commercial partners would need to require almost no new training - again due to time constraints.

"With all this considered, it is not difficult to understand why PHE decided to rely on existing methods for loading data through Excel, even if better tools were available."

Preventing the error.

Excel was the problem but not the cause.

In regards to the limitations of XLS, the following is true:

The 65,536 row limitation should have been acknowledged.
And even newer versions of Excel (2007- onwards) could have at least been considered, given they are are capable of handling up to 1,048,576 rows.

However, an investigation into the error concluded that despite it's limitations XLS could have been used as part of a temporary solution while a more purpose-built system was put into development.

XLS Solutions:

XLS is in fact capable of handling larger CSV files.

By implementing an appropriately written VBA extension to read one row at a time (rather than the entire file in one go) this would prevent rows from being excluded.

Also, if a user had a reason for needing to read the raw data, a sliding window technique could be used that would display around a hundred rows.

Finally, an appropriate VBA extension that achieves all of this could have been developed within a very small time frame (just a few hours).

Source: Dionysis Dimakopoulos

It would appear the root cause of the error had less to do with the choice to use XLS but rather the manner in which it had been used to load data.

Rather than implementing an automated process, it is understood that CSV files were being opened manually in Excel by a PHE user, before being loaded into the central database.

Based on a simulation of the likely process that occurred, it can be assumed that either:

The user would have been issued an on-screen warning of data truncation, yet still clicked 'ok' regardless.
Or a developer who created a related VBA extension chose to suppress all messages shown from the VBA interpreter and the user was then not warned of possible data loss.

In both scenarios, this would be a case of human error.

Data integrity is essential when dealing with medical data, especially when it could impact the spread of a disease. For this reason, I decided any solution going forward must attempt to eliminate any potential for human error.

Planning a Data Pipeline.

What does the data pipeline need to do?

After identifying the causes of the error, I began the process of designing a data pipeline that would have avoided a similar situation from occurring.

The first step in this process involved conducting a requirements analysis.

i.e. Determining the list of specific requirements that need to be met by the data pipeline.

Requirements Analysis

⬜ Fast Deployment

Must be capable of deployment within a narrow time frame.

⬜ (Large) File Handling

Must be capable of processing a rapid inflow of large CSV files immediately.

⬜ Data Aggregation

Data from different sources must be aggregated into standardized formats before being stored in structured systems.

⬜ Automated

Processes must be fully automated and the potential for human error removed.

⬜ Testing

Automations and data validation procedures must be capable of being tested vigorously and regularly; to ensure the integrity of secure medical data.

⬜ Data Transformations

Must be possible for a range of data transformations to be made, for purposes including: dashboards, monitoring, alert systems, and reporting.

⬜ Consistency & Speed

Must prioritise data consistency over availability - but must also be fast enough to meet the 48-hour target window for an alert system.

⬜ Modularity

Must be possible to add new data sources over time (as simply as possible).

Designing a Data Pipeline

How will the pipeline address requirements?

The next step involved determining the logical design for the data pipeline.

i.e. Plotting the physical flow of data.

This was done by creating a system diagram; outlining each of the various components.

This diagram would then serve as a point of reference when building out the pipeline.

System Diagram

Additionally, the diagram outlined a method for simulating the inflow of data through the use of a fake data generator.

*The tool used to build this diagram can be found here

Recommendation: Use a Secure Web Service

Commercial partners could have been instructed to upload CSV files to a secure web portal, rather than sending data to PHE through separate channels.

✅ Benefits of a secure web service

- Simplify the process of setting up channels to commercial partners as the testing program expands.

- - Perform (automated) database-side validation checks across all incoming data from a singular entry point.
  - Create backup and logging procedures against a singular data source, so a complete historical data record can be maintained.
  - Provide a point of reference for commercial partners, with guidelines for data submission.

⚠️ Potential drawbacks

One concern regarding the use of a tailor-made web service concerns the speed in which this service could be set up.

There are ways for this type of service to be deployed relatively quickly, but difficulties could be encountered when needing to spin-up a highly customised service within a narrow time frame.

It would therefore be advantageous to have a reproducible (or template) framework for such a service prepared in advance.

Recommendation: Use a Data Lake

A data lake could have been used as a form of staging area for all incoming CSV files.

In this system, a data lake would aggregate incoming files across multiple sources and assign metadata tags such as date, time and commercial partner ID.

Batch scheduled transform and load procedures would then clean the data before loading to a structured data warehouse.

✅ Benefits of a Data Lake

Capacity to store large volumes of data in its native format, without risk of data loss.
Testing data could be submitted straight away, without the need for designing schemas, defining data structures and configuring transformations - which would otherwise take time.
Backup procedures can be implemented for preserving a full historical record of data - essential for guaranteeing accurate analysis and post-reporting capabilities.
The scalability of a data lake would allow for other data sources to be incorporated; providing scope for a more sophisticated system.
The ability to access all data in its native format would enable fast, flexible reuse of the data for a wide range of use cases (e.g. alert systems, dashboarding and reporting) (i.e. ELT).

⚠️ Potential drawbacks

High costs for cloud storage. This could be the case if it is not possible for the data lake to be self hosted, or appropriate storage management plans are not implemented.

Vendor lock-in. If the pipeline is developed to a high level of complexity with high availability required, this could potentially create data migration issues.

These potential drawbacks could be mitigated by implementing a routine compression and archiving procedure.

Creating fake medical data.

Testing a new system.

Once happy with the overall structure of the pipeline, I needed a way of testing it was fit for purpose.

This required simulating the inflow of CSV files from the web service before loading them into the data lake.

Obviously I would not be able to get my hands on real Covid-19 testing data, so this meant having to write a Python script that could generate fake data in a similar format and volume as what was likely received by PHE.

This would involve understanding the formatting of specific data, for example:

NHS Number
10 numeric digits in length. The tenth digit is a check digit used to confirm its validity.
(UK) Postcode
Five to seven alphanumeric characters that can follow a number of different patterns.

The following is an example of the type of data that would have likely existed, in tabular format:

Note: All data has been randomly generated.

Here, the same data is represented in CSV format:

The following is the resulting script written for generating the placeholder medical data.

The Python library Faker was used, which is useful for generating random addresses accoding to a specific locational format.

If you would like to try running this code and see a very simple recreation of the UK Covid Dashboard UI, you can have a go here.

Final proposal.

Closing thoughts.

Building a data pipeline, especially on the scale witnessed during the Covid-19 pandemic, presents significant difficulties and complexities.

Hindsight allows for a critical examination of mistakes, but addressing errors in real-time demands a high level of skill under pressure.

Despite encountering challenges, Public Health England (PHE) successfully assembled the most comprehensive and granular data platform across EU countries, providing valuable insights during the pandemic.

The platform, despite facing publicized errors, served millions of end users efficiently and the achievement stands as a testament to the expertise of the team responsible for developing the diverse pipelines contributing to the UK Covid dashboard.

Acknowledging the complexity and scale of the endeavor, the reported error is a minor setback considering the overall success of the project.

This case study, utilized as a learning exercise, underscores the importance of evaluating challenges in building pipelines of such magnitude and is not likely to come close to providing a realistic solution that accommodates the various facets and challenges that were faced at the time.

For an analysis of some of the challenges faced, Pouria Hadjibagheri (MBE), the tech lead on the UK Covid-19 Dashboard, provides valuable insights in the following presentation, available here.

Sources

[City, University of London] What really caused the Excel error in NHS Test and Trace COVID-19 system? An in-depth technical analysis.[Gov.uk] Capacity for pillars 1 and 2.[Gov.uk] Coronavirus Dashboard.[BBC] Excel: Why using Microsoft's tool caused Covid-19 results to be lost.[Guardian] Covid: how Excel may have caused loss of 16,000 test results in England.[Mail] What COVID numbers SHOULD have said: Excel bungle masked daily cases hitting 11,000...[Mirror] 16,000 coronavirus tests went missing because of Excel spreadsheet blunder.[Mirror2] 50,000 Brits with possible coronavirus not contacted due to Microsoft Excel blunder.[Sky-News] Coronavirus: Data can save lives, data can cost lives - and this latest testing blunder will likely prove it.[Telegraph] The 'Excel error' that led to 16,000 missing coronavirus cases.

Looking for work.

Reach out to me anytime, I reply quickly.

Hopefully I'm a great fit for what you're looking for.

Illustration by Pablo Stanley