Data abuse and misadventure might have been a good subtitle for an illuminating and thought-provoking conference on data quality, organised jointly by Winton and Cancer Research UK at the Francis Crick Institute on June 13.
Part of the Big Data Analytics Conference Series, the full-day symposium featured insights from the financial services industry, electioneering, bioinformatics and medicine more generally. One sobering conclusion was that data quality is vital yet often lacking – sometimes with disastrous consequences. Getting incentive structures right is one way to improve this suboptimal state of affairs, many of the speakers argued, possibly including increased recognition of data authorship.
Ed Humpherson, Director General for Regulation at the UK Statistics Authority, delivered the keynote address in which he expanded on Winton Chief Scientific Adviser David Hand’s categories of problems with Big Data, by adding terms such as “mathswashing” – the use of mathematical terms to confuse people.
Humpherson’s government work has included combining several huge datasets to arrive at a definition of ordinary working families, as well as uncovering police underreporting of crime statistics by comparing them against separate survey data.
Banks can also underreport their problems, as the conference’s second speaker, Oliver Wyman partner Barrie Wilkinson, revealed during an anecdote about the travails of a distressed German lender.
Head of Winton Data Geoff Cross, who spoke third, reassured the audience that humans have an important role in data quality. Likening Winton to a self-driving car, Cross explained that we take in data and make decisions about where to go; but also, that humans are vital in this process – and that we run thousands of checks on our data each day.
Even in advanced machine learning, humans can add substantial value by working with computers to improve efficiency, as Cross explained by citing the work of Jeremy Howard.
Data can play an important role in exposing fraud. University of Michigan Professor Walter Mebane explained how he uses statistics to identify electoral fraud, although strategic voting – that is, when people vote for different parties than those actually support – often muddies the picture.
It can be just as hard to spot fraudulent data, a particularly dangerous problem in medicine since patients can receive ineffective or even harmful treatments as a result. Freelance Publications Consultant Elizabeth Wager pointed out that big data adds to science’s reproducibility crisis. There is no peer review of the data used in academic articles, nor is there often access to the raw data that was used. It takes data sleuths such as John Carlisle, a UK anaesthetist who sifted through 169 articles to uncover fraud by a Japanese doctor, to uncover the deception.
One such detective is Keith Baggerly, a bioinformatics professor at the MD Anderson Cancer Center, who spoke next and is perhaps best known for demonstrating that research papers supposedly demonstrating improved diagnoses and treatments for cancer were riddled with data handling errors that completely invalidated their findings. Baggerly pointed out that other basic data mistakes are ubiquitous: one he highlighted probably suggests we have underestimated the quantity of vitamin D supplements humans should be taking by a large factor. Better information about data – metadata, in other words – was one partial solution that Baggerly proposed.
Robust encryption and permissioning could also help to fix the problem of messy data, according to John Quackenbush, a professor of biostatistics at the Dana-Farber Cancer Institute and the final speaker of the day. But Quackenbush also spoke of the need for flexible data management systems: it is still typical practice for doctors to fax documents to each other or scan them into electronic medical record machines, for instance, despite the huge improvements in digital communications over the last two decades.
Quackenbush made a compelling case for what he called self-validating datasets. The notion, which will be familiar to Winton employees given the out-of-sample tests to which our investment strategies are subjected, is that analyses should be re-run as more data is collected. Medical data could thus be re-validated by running automated tests on, say, the next 5,000 patients to take a pharmaceutical drug that has passed regulatory trials.
Drawing a distinction between population-level data collected for medical research and individual-level data collected for medical care, Quackenbush’s contention was that in 2017 medicine remains primarily anecdotally rather than data-driven. A counterpoint, if ever there was one, to the prevailing hype about massive strides in Artificial Intelligence and machine learning.
The conference closed with a panel discussion between IMB’s director of Watson Financial Services Europe, David Robson and three academics – Sabina Leonelli, Erik Mayer and Ian Gilmore – about the anonymisation of data and the challenges of handling medical datasets.