In the current age of data abundance, an ever-present question is how to balance data quantity with data quality, because more data does not necessarily mean more data which is of high quality. Generalising the results from data analyses is also a closely related conundrum: we might naively expect that more data should make results more generalisable. Disentangling these entities and approaching a more wholesomely “correct” answer is a challenging but nevertheless critical task.

a 3d image of a cube made of cubes

Randomly distributed errors may be reasonably handled through accumulation of more data or applying noise-reduction approaches such as Gaussian filters that smooth the data to average out small “aberrations” or errors. However, these small aberrations can be substantive in medical data so there is a risk of removing meaningful data with error reduction techniques. Furthermore, “real world” medical data collected daily in healthcare systems around the world often contain semi-systematic rather than random errors, such as missing, inconsistent, and incorrect documentation.

Therefore, a greater quantity of data does not alleviate inherent non-random documentation error but rather augments the complexity. Similarly, an approach that fixes a specific error in data from one hospital could corrupt the data collected at another hospital depending on individual clinical practices or cultural norms. Hence, devising appropriate error reduction models to apply to medical data necessitates the input of domain expertise from the outset.

While the correct result for the correct reason is paramount in medicine and life sciences, pure quantity of data and generalised error correction cannot substitute for strategically applied domain expertise to resolve data errors.

As machine learning approaches are increasingly employed to determine associations in data and make predictions in medicine, we must vigilantly remember that these algorithms themselves cannot identify and correct missing or incorrect data; rather, they inherit and learn the biases and errors embedded in the data we give them.