2/9/2020 by Theo Goutier
Data-blindness. Can’t we see the sea for the Covid-19 waves?
Do we run even more risk of misinterpretation in this era of Big Data and AI?
Dobilo has built up strong expertise and experience of successfully using Big Data analysis and started also on AI in the context of running your business. This expertise is on offer to you as a business, see at the end of this blog.
As a follow-on from our analogy on Covid-19, see our earlier 3 blog-episodes posted, we now pick up again today with the talk of 2nd waves hitting us. Are we looking at the whole ‘stormy sea’?
Like in business we sometimes say, ‘we can’t see the forest for the trees’. Do we then also not risk to drown in these oceans of data or are we not swimming against the tides instead of along with it?
In the second episode, we did already allude to the risk of data-blindness with the famous story of the bombers returning in WWII reinforcing the planes where they least needed it as they missed out on 'data' from the ones that did NOT return. Also, in business, we’ve seen numerous examples of data-blindness. One of our clients collected ticket-details of all of their 300 store cash points in a centralised system. That huge amount of data, however, would never tell our client how many potential customers, that would not enter their stores for various unknown reasons, might have been missed. Increasing the spend per customer is less relevant if you can get more customers even if they spend less than the average. If additional sales is what you are after you need to get that picture complete first with additional data-gathering.
It is very much like watching a tennis game on your television set through a tube of 10 cm wide. A good one that can keep his or her eye on the ball!
Random sampling is critical
Back to Covid-19. What is the data-blindness here? What one learns, if the population is exceptionally large or the size cannot be known, that you need to do random sampling. Already 30 truly random selected points will give a fairly reliable outcome.
Imagine a forest of one million trees and they would ask you to measure the average thickness of the trees. That would take far too much time to measure all of them but also an easy measuring of the first 30 trees will give you a biased outcome as they might be different on the rim of the forest. Let us show how it would work or would not work: visualise the 1.000.000 tree forest as a large square. Then let’s pretend we do know the thickness of each of these trees, but the researchers do not. So, the real average thickness we know is 79.36 cm and the minimum is 10 cm and maximum 150 cm.
From all the trees thicknesses we took 10 samples of different sizes ranging from 3 to 30000. Now if you only do 3 the chance of taking only small or only big ones is quite large. But with 30 already it becomes less of a ‘risk’ to be far off. Below graph shows the ranges found where the rectangular is the most likely range. For a sample of just 3, the range was 68-89 or 11-14% different from the real average. For 30 it is 71-78 or 2-10% off while for 30000 it is just 0.1-0.5% off or only 1.2-3.6 millimetres! You also see the bigger the sample the less it will be offset from the average (we call this bias).
In practice what is the case we might not have the resources even to do 3000 random samples. Imagine you have 8 people going into the forest which is this square; 4 go in on the corners and 4 go in on the exact middle of each side and measure the first 40 trees they encounter so we have 320 data points which seem rather okay?! Let us have a look: what they do not know, but we do, is that the forest was planted in stages and has 4 even squares of youngest, young, old and oldest trees. So, the ones on the corner measure 40 of each of this category while the people on the middle might measure 2 categories that are on either side of his entry point. Let us also state that on the edges of the forest we’ll only find the smallest ones of each category. In fact, they then will find 80 of 10, 51, 91 and 131cm thick each. The average of that sample is 70.75cm or 11% short while when truly random they should have found in 80% of the cases less than 5% difference with the real average and less biased to one side (as on the edges the trees are smaller)!
Now what and how we measure is also determining the reliability of our information from the sampled data. In the case of the trees, the 8 researchers will have normal dressmakers measuring ribbons and the instruction to measure the thickness at chest height. So we have people with different lengths and ways of measuring that could easily be a couple of centimetres off when measuring the same tree. Now the purpose of measuring the trees might not require a tight measurement error. If the owner of the forest just likes to know the tree growth-rate to predict for enough thickness (so when to harvest them), it might be more than okay to miss out a couple of centimetres. The sampling method would be much more of a potential issue if not done randomly.
What does this all mean for Covid-19?
Here we have to breakdown the issue into 3 pieces. One, the size of the infected population is not known and certainly has and will vary over days and weeks. Two, the testing has and will not be random as in the beginning only the (very) sick people, nurses and doctors were tested due to testing capacity limitations. Three, the reliability of testing is an issue. A PCR test has 66% chance of being right as it might give false negatives (virus-RNA amount needs to reach a certain threshold first) and certainly also will test positively some time after the infection already is tackled by the immune-system as it cannot distinguish between infectious and neutralised viruses (their RNA is the same). The test will be positive even for a week or longer after(!) the blue curve, below depicted, is back at 0 until the threshold of detectable RNA is reached.
Click on below image to read this interesting article explaining it all.
“ PCR detection of viruses is helpful so long as its limitations are understood; while it detects RNA in minute quantities, caution needs to be applied to the results as it often does not detect infectious virus.”
What does this say, in hindsight, about the waves of Covid-19? The first wave people were barely tested. In some countries, even people were counted as Covid-19 patients or deaths without any test. Belgium is a ‘nice’ example. The testing was far from random and far from accurate beyond the intrinsic accuracy. A good way of estimating the real infected number of people should come from other sources like blood sampling like they did in the Netherlands. People giving blood (close to random?) were tested on antibodies and they found back in April 2020 5-6% has had the disease. The testing on (suspected) patients for the same period was 15% positive. That indicates that most likely 5-15 times more people had been infected but maybe were not getting sick nor reported themselves as such at that particular time. Also, other research points strongly that way. Based on the (low) testing efficiency Belgium is estimated to have had half a million people infected back in May or 10 times more than officially diagnosed!
Courtesy of Bisnode (article on their website)