4/7/2020 by Theo Goutier
Does Big Data equal Good Info? Episode 2
An analogy of the Dobilo approach in businesses using Covid-19 worldwide available data
We'll continue the exploration of our Big Data on Covid-19 using the same data-set (source https://ourworldindata.org/coronavirus) now that we have a better sense of the structure, granularity and reliability of this particular database (see Episode 1: https://www.dobilo.com/news/blog).
In business contexts, we need to do this step extremely careful as to be sure the data is understood meaning we might not have all the range and detail we would need to make solid conclusions.
For example, some years ago we have analysed a customer's big data from a centralised cash system rolled out at their 300+ restaurants over a year. This system did not capture of course potential customers deciding not to buy anything/skipping by to go elsewhere. Data-blindness we could call that and forces us to include other data sources or set-up new measurements if that is what we would like to know. A well known public case is this one from World-War II on bomber reinforcements analysing the returned(!) bombers damages. https://www.trevorbragdon.com/blog/when-data-gives-the-wrong-solution
What Covid-19 information can we get out of this data-set?
The way we would best explore the data is by stating hypotheses and then testing these against what we have. Hypotheses can be stated from observations from our early explorations but they can virtually come from anyone. Remember certain hypothesis we might not be able to test with the data we have. We need to know what portion of reality the data provides us to have a look on. We might be just looking through a straw to a 65 inch TV screen...
For this matter let's explore if we can test the hypothesis lately that in certain countries the signs of a second wave of infections with Coronavirus are starting to show. Below we've zoomed in onto the last days of the COVID-19 data (now updated until July 1-st).
The spike on July 1st around the world certainly comes from countries not reporting test data (at all or not frequently or swiftly enough). Again we'll need to zoom in before concluding anything so let's take the elephant in the room: the USA. In the plots below we can see case % is dropping still but flattening out.
The number of tests has increased enormously: from 3 per thousand inhabitants early March to almost 1 on 10 today. Still, the people tested are not likely to be a random sample of the entire population. As the people tested had or suspected symptoms we can safely say the number of US citizens infected today is lower than 10%. How much we can not say! So let's look at the newly confirmed cases to spot if infections are going up again
So we do see new cases going up again! The question is now where? Is New York again suffering or is the pandemic spreading to other states or cities? See on below info-graphic map that most are either over the bump or seeing the acceleration only first time around. (source: https://coronavirus.jhu.edu/data/new-cases-50-states )
Let's zoom in to Texas and Florida to see exactly where it is happening. Below are the 14 counties, of 300+ across Texas and Florida, that have already had reached 5000 confirmed cases (the first dot is when that county has hit the 5K mark). All other counties are below the 5K, another 17 counties have reached 2K mark but not yet this 5K mark.
Let's zoom in on Potter county (Amarillo) for why did they keep it flat after 16/5 when they saw a spike of new cases? Amarillo keeps a nice dashboard where you can drill down to ZIP code even. https://covid-data-amarillo.hub.arcgis.com/
So the spike is explained but now why they have been able to keep this 'outbreak' under control since? You would have expected that after testing capacity has increased significantly you would find even more soon. Could it be that in this spike there are a lot of 'false' positives? Remember from our first episode of this blog we learned that tests can not distinguish between live and dead virus: it just detects DNA. So these 734 people found on 12313 tests might have recovered from the infection and were no longer contagious? Another explanation, less likely, could be reporting was knowingly or unknowingly manipulated in these counties. Furthermore there is nothing found on Amirallo taking more extreme lock-down measures straight after May 16th.
So the more we zoom in the better we can see what is happening and where, down to postal code areas even, and even link certain events (the testing capacity increase on May 16th) to what we have observed. That does pose other questions then too that are much harder to prove right or wrong. The same applies in companies, f.i. monthly reports saying the average sales per order increased from August to July but it might be just one big order just tipping it into August depending on invoicing date being July 31st or August 1st.
Now back to our hypothesis. Do we see 2nd waves signs in other countries like Germany, Iran or UK?
Case closed!?! There seems to be no 2nd wave starting anywhere yet although from far above the US seemed to get one. As a country, yes they do, but all are new local outbreaks!
Now what can we do to prevent a real 2nd wave?
We would first need to understand all the factors that play a role and how we best can control the ones we can influence. We know keeping distance is a factor but what is safe? Even country measures vary from 1 to 2 meters! What about climate factors (humidity and temperature) and what about ventilation, personal hygiene?
Let's see if we can use the data provided or publicly available to see if we can determine the most significant factors. Like in business when we have a dashboard with KPI's we then would like to get our foot on the 'throttle' to improve upon them and our hands on the 'steering wheel' to stay on the road meanwhile.
Look out for our next blog next week!