27/6/2020 by Theo Goutier

Does Big Data equal Good Info? Episode 1

An analogy of the Dobilo approach in businesses using Covid-19 worldwide available data

In businesses, we see there is a lot of data captured and stored nowadays, but how to use such a large data-set wisely? As an analogy to show you what we would do, we have worked on the Covid-19 data. As a side-effect, to also make sense of the Corona media stories and reports we have seen flashing by day to day on deaths, tests, positive cases, R numbers, and as a consequence the measures that have, or have not, been put in place to control the spread.

Today we learned from the WHO that the pandemic is out of control mainly in Latin America and the US (certain states only!) while in Europe and Australia/New Zealand, the lock-down measures are being eased. People feel we could avoid yet still fear a second wave.

At least this is contradictory. We at Dobilo are confronted with these kinds of ambiguities in the companies we serve as well. Many snapshots of reports, mixed with false conclusions, that are also contradictions in themselves.

What we then do is look at the data and, together with everyone around, start asking the questions from observations before concluding anything. Let alone acting upon early 'conclusions'. William Deming said it right:  "Without data, you're just another person with an opinion." Deming may have missed out in this quote on the words 'properly checked' before the word ‘data’.

In this blog, we like to show, using the available Covid-19 data (source https://ourworldindata.org/coronavirus) step by step, how we observe the data available to find the best route to investigate and substantiate any hypothesis. Companies can this way better judge the real relevant factors, factors not in their control, and what thus to do to control the outcomes of their processes.

Meaningful Covid-19 data?

First, we need to look at the full data-set, which is provided to understand what is in versus what we'd like to (be able to) learn from it (or not).

We also need to be wary, and in the case of Covid-19 data we know, that measurements need to be gauged to see if that particular system (like in companies we have ERP, CRM, BI, PLC, or other) really measures what we want to be measured. Different countries use different ways of measuring, for instance, recording of cases and certainly reporting of Covid-19 deaths! Also, testing for Covid-19 will lead to false positives and false negatives. We know that the test result for people that have had Covid-19 can easily lead to 'false' positives as the test for the virus can not distinguish between dead or alive virus DNA. False negatives are a lot less likely but still likely if the test used can not detect smaller amounts of the virus that still could make the victim ill later or infect others. https://www.modernhealthcare.com/technology/accuracy-covid-19-tests-still-largely-unknown 

Anyway, we will just take the data as it is available and deal with the reliability of the data later again. Let's start!

The data we have on Covid-19

The table is a lot bigger (see OurWorldInData), but we have summarised and computed some ratios to make some first sense out of it. Below are the data from 2020 on the 20th of each month. In the pivot of the raw data below, we can also drill down by continent and country.

Something one can immediately note when drilling down is that a country called 'world' exists and seems a sum of all 'real' countries. Also clear from the above table is that the number of total tests was not or poorly reported in the first months as 439% can not be accurate (the total tests were below the reported cases until March).

What data do we have in the set (all are cumulative from the end of 2019 by date):

  • Total Cases as reported
  • Total Deaths as reported
  • Deaths / cases reported (= Mortality rate %)
  • Cases / tests reported (= Case %)
  • Total Tests reported
  • Tests / population (per million or thousand)
  • Certain data on the country like population size and % of people older than 65 / 75 years.

We look at the totals to see if the data is consistent or not:

The above chart (MiniTab time series plots) shows the %'s computed (connected dots by day) by Continent (the variable is called ‘Country’ by OurWorldInData, confusingly) for all the continents (sets of countries) by day, cumulatively. The continent 'World' is a subset provided separately in the data-set, and OurWorldInData has not aggregated the number of tests, as not all countries have reported this (fully and/or daily), which we will see later confirmed as well.

Observations on above:

  • Case % by continent is very different in both trends and values
  • Case % was (well above 50%) until early April in Asia
  • Case % in Europe was jumpy in the first months, then went up and started coming down
  • Mortality rate has been more constant and except for Europe higher than the world, going to 7% and now last month easing to 5% (cumulative, so fewer people seem to die last weeks than earlier on)

We thus have to understand here that some countries report in better or worse ways (sometimes not at all). In businesses, we see this with different factories, sales units, countries, locations, departments, and even people within the same department too. So let's zoom in to Europe and then specific countries before concluding anything yet.

To be clear, the above chart has omitted six countries like Isle of Man, Guernsey, Jersey, and Faroër Islands (raises a question too!). We have just done this to keep the graph legible for a first scan. Immediately, what strikes us the most is that not every country reports the total number of tests versus the total number of cases (as no red lines appear in the plots for Bosnia, France, and Kosovo). Further observations:

  • Mortality rate is either below or above the average of 7%
  • Case %, when recorded in most countries, went up from Feb until early April and then down
  • Mortality rate seems uninterrupted while case% is up and down through the weeks it seems

Let's have a closer look at Austria, Belgium, Italy, Greece, Netherlands, and UK!

Observations:

  • Netherlands is the only country of these six not to report the total cases daily
  • UK started reporting total tests as of April 26th only
  • Mortality rate in Greece and Austria stayed below 7% while NL, UK, and IT have well above 10% up to 16% in BE
  • Case % in Austria suddenly broke on April 10th: the total tests jumped up around that day from less than 5 to 30 per 1000 inhabitants. It was the larger capacity that became available; hence Case % dropped (before that date, tests were probably mostly performed on health care workers and elderly people, thus finding more positive cases percentage-wise)
  • Testing is at best 8% in some of these countries (total tests vs. total population since the beginning!)

So we can't aggregate the data per continent, as partly countries have not reported certain data or have reported them inconsistently, which is the first small but solid conclusion we can draw. This surely poses a danger when incomplete data is stacked and overall one sees trends that are partly or completely driven by such inconsistencies!

Observations IT/NL compared to Belgium:

  • Italy’s data trend looks the same and is slightly ahead of Belgium on testing (8 vs. 7% total tested now)
  • Netherlands: only 3% of the population has been tested
  • Mortality rates in Italy and Netherlands are a little lower than in Belgium
  • Both also have some data missing for the last days in the data-set
  • Feel free to alert us on something you observe!

To conclude on what the data-set can do, two more observations (not in the graphs): Brazil does not report the number of tests performed, which is skewing the data, and with the US, their mortality rate pulls the average down to 7%. Again, this all proves and indicates different measurement systems (ways of recording) across countries.

We need to understand this before we can conclude anything substantial. Is the data-set worthless? No not at all, but we have to understand these limitations: 

  1. Countries report differently, and some omit certain data (tests mainly)
  2. The data itself does not say anything about the quality of the data. How could we gauge/calibrate this data? The only example we have is on the deaths reported where we can clearly see that Netherlands and Belgium are reporting Covid-19 deaths differently from the excess of deaths weekly reported (as the pattern and numbers of those are very comparable). Netherlands might be under-reporting Covid-19 deaths. Belgium reported even suspicions of deaths by Covid-19 without testing reports. Here's a good article proving this view: https://multimedia.tijd.be/oversterfte/

Furthermore, in this data-set, we have bigger and smaller countries. For instance, for Brazil and US, it would have been good to have the data by state, province, and maybe to really be able to interpret it, have every big city in the world in the data-set, but we don't. The granularity of the data plays a role too in enabling to draw conclusions, i.e. understand significant factors behind these ‘Key Performance Indicators’ (KPI's) computed from the data. Usually, the more granular data is captured in more local systems, such as (Excel) sheets. No doubt, we can find data on cities too, but we would have to do a significant exercise to collate all of this data and clean it up to make it comparable. No doubt, this is true as well for Covid-19 data (to maybe explore in a much later episode?).

In the next episode, we will go into depth a lot more on this data-set and will accelerate the observations and maybe draw some first real conclusions carefully. The first question is if the Case % rate is going up again lately (let's see better next week!). We all fear the second wave and need to understand what is happening in the Americas and are Germany, South Korea and/or Iran really seeing second wave signs already?!

Our next episode will be posted next week, keep an eye on it!