The Overlooked Data Scientists in the Fight against Coronavirus: Biostatisticians
Last updated: Apr 15, 2020
(Reprinted from link.medium.com/zRGL4enTq5. For retweet: https://twitter.com/ericjdaza/status/1246721289970012160)
This is exactly the time to temper the sprinting agility of data science with the scientifically rigorous methodology of biostatistics.
Epidemiologists, infectious-disease specialists in particular, are the ultimate domain experts in guiding data science solutions that model or otherwise analyze population-level health-related aspects of SARS-CoV-2 (“coronavirus”) and its health impacts (i.e., COVID-19 characteristics and effects). As such, it is encouraging to see more and more data science hackathons and projects that correctly recognize the need to recruit epidemiologists to guide solution development. Not seeking their guidance is a huge mistake that will likely add to the noise around reported statistics, and thereby mislead regular folks, hospitals, policy experts, and governments on what to do next.
But these population-health data science projects continue to largely ignore a key group of natural teammates and collaborators: Biostatisticians.
Biostatisticians—the statisticians of public health—have been stewards of health data collection and analysis for decades. They are specifically trained to recognize and address issues and complications arising in the analysis of human health data. Their deep health-focused methodological training and experience spans not just statistical modeling, but also includes selection bias (e.g., observer/reporting/detection bias), missingness (i.e., missing data patterns), measurement error, demographics, survey sampling, study design, data management, and causal inference (e.g., dealing with confounders properly in order to truly estimate intervention effects). Importantly, biostatisticians are trained to integrate approaches that address these many complications with an eye to effectively consulting and communicating with epidemiologists and other health researchers.
Biostatisticians are the original “full-stack” data scientists of public health.
Statistical modeling alone is neither statistics or biostatistics, and is only part of the data-storytelling narrative around “reducing bias” or “balancing bias and variance”. A data scientist may be an expert in statistical modeling in ways that complement, enhance, or even improve upon biostatistical models. But it should be clear from the earlier paragraph why this skill alone does not make them a biostatistician.
In short, biostatisticians are the original “full-stack” data scientists of public health.
Their deliverables, products, and solutions have traditionally been study protocols, data collection procedures, and statistical analysis plans at the start of a research study (ideally), and data management and analysis reports throughout.
Their clients and customers have regularly included epidemiologists, other public health researchers (e.g., nutrition, health behavior, environmental/occupational health, maternal/child health), government agencies and institutes, clinical trialists, and pharma/biotech and life sciences companies.
None of the aforementioned biostatistical concepts (sans statistical modeling) are generally taught to data scientists on as deep a level—if at all. However, data scientists are experts at skillfully integrating statistical modeling, software engineering, and computer science to provide business or organizational solutions that are scalable (i.e., can be implemented over very large interconnected datasets and databases very quickly). The converse sentiment also holds: If a biostatistician excels at statistical modeling, that does not automatically make them a data scientist.
I am intimately aware of how monumental these data science tasks are. Having worked as a statistician embedded in a health data science team, I came to deeply appreciate the skill and sheer endurance with which my teammates optimized analytic and engineering processes and pipelines on a daily basis in order to meet very demanding deadlines. Through our constant interaction, I internalized how to manage the time-constrained balance of what is ideal and what can be done.
Predictions and would-be counterfactuals are generally much more noisy and biased than let on by compelling data visualizations and beautiful dashboards published on data science blogs and social media.
However, as pointed out by Hernán et al (2019), data scientists often come from physical- or life-sciences backgrounds and domains. These largely involve systems of low-noise data and well-characterized mechanistic relationships. Even when such a data scientist knows that public health data are far more noisy, it’s only natural to fall back on these simplifying assumptions when racing to complete a sprint or meet a deadline—or beat other coronavirus hackathon teams to the finish. Both noise in the data and uncertainty in the science (i.e., our understanding of the mechanisms that generated that data) are set aside as nuisance factors to deal with later. Building the analytic tool comes first, which makes sense.
Unfortunately, these nuisance factors are often front-and-center in modeling key population-level characteristics of the coronavirus epidemic, especially early on. For example, these might include parameters of susceptible-exposed-infected-removed (SEIR) models. Hence, the usual data science process severely breaks down without guidance from epidemiologists and public health methodological experts like biostatisticians and health statisticians. Predictions and would-be counterfactuals (i.e., outcomes “under no intervention” or vice versa, as appropriate) are generally much more noisy and biased than let on by compelling data visualizations and beautiful dashboards (e.g., interactive charts and tables) published on data science blogs and social media.
Recent advice from a Harvard T.H. Chan School of Public Health webinar (COVID-19 Data Science Zoomposium) to data scientists who want to help sounds eminently useful. Here’s my distillation and interpretation of this advice:
Stop trying to re-invent the SEIR wheel. At the very least, stop publicly posting your own SEIR-type models to your blogs or other social media without epidemiological qualification. Rather, try to improve data collection efforts to help better estimate the parameters of SEIR models that public health professionals need help with right now. Send your ideas in for review by these health experts. Recruit biostatistician or health statistician teammates who will help you do these things correctly. They will not only help you mitigate bias, but will also coach you in how to honestly communicate the noise and uncertainty in your solutions to your health-expert “clients”. See Wynants et al (2020) for an excellent example.
Use your unique expertise to improve critical components of the public health infrastructure. For example, create robust solutions to link consumer databases, health records, geospatial data, and wearable/app/sensor data. These may help improve supply-chain, hospital, and clinical coordination, contact tracing, disease surveillance, patient care, and mental and physical health while practicing social distancing. Recruit biostatistician or health statistician teammates for help with surfacing and mitigating selection bias, missingness-induced bias, and measurement error; help with demographics, survey sampling, study design, and data management; and help with conducting causal inference. These colleagues will catch important analytic and modeling assumptions in the model-and-deploy frenzy you won’t realize you’re making, assumptions that directly impact your predictions—and the life-or-death decisions made based on them.
So why has the data science community failed to invite biostatisticians to help in the fight against coronavirus?
My guess is that data scientists tend to see biostatisticians as randomized controlled trial (RCT) statisticians or specialists. … This gives the false impression that data scientists only need input from biostatisticians when they work with RCTs and other clinical studies.
My guess is that data scientists tend to see biostatisticians as randomized controlled trial (RCT) statisticians or specialists. This is probably due to the prominence of biostatisticians in the pharma, biotech, and life sciences industries—at least, to the attention of industry data scientists. (Many biostatisticians also work in academia.) This gives the false impression that data scientists only need input from biostatisticians when they work with RCTs and other clinical studies.
To all the data scientists out there who are already working with epidemiologists or other relevant public health experts: Well done! Now consider adding biostatisticians to your team. This is exactly the time to temper the sprinting agility of data science with the scientifically rigorous methodology of biostatistics. (Importantly, remember that while many biostatisticians work with epidemiologists, they are in general not substitutes for epidemiologists.)
To all the biostatisticians out there who want to help fight coronavirus by working with data scientists alongside public health experts:
Get ready to conduct study design and analysis at breakneck speed, often in retrospect.
Get cozy with having to quickly identify, report, and communicate the importance—and impact—of tacit analytic assumptions on downstream decisions and actions. You generally won’t have time to refine them in detail.
Prepare to do all this within an agile project management system (e.g., Scrum, Kanban, wikis). This will help you communicate and coordinate with your data scientist and data engineer teammates.
Taking these actions may help you better understand and strengthen the often commonplace (yet implicit) inferences of statistical prediction in the quickening world of coronavirus-focused data science.
For a related perspective, check out “The Role of Statistics in Fighting the Coronavirus” by Glenn Geher, PhD.
COVID-19 Data Science Zoomposium. https://www.hsph.harvard.edu/biostatistics/2020/03/covid-19-data-science-zoomposium-4-2/
Hernán MA, Hsu J, Healy B. A second chance to get causal inference right: a classification of data science tasks. Chance. 2019 Jan 2;32(1):42–9. https://amstat.tandfonline.com/doi/full/10.1080/09332480.2019.1579578
Wynants L, Van Calster B, Bonten MM, Collins GS, Debray TP, De Vos M, Haller MC, Heinze G, Moons KG, Riley RD, Schuit E. Systematic review and critical appraisal of prediction models for diagnosis and prognosis of COVID-19 infection. medRxiv. 2020 Jan 1. https://www.medrxiv.org/content/10.1101/2020.03.24.20041020v1
About the Author
Dr. Daza is a biostatistician and health data scientist—not an epidemiologist—who develops causal inference methods for personalized (n-of-1) digital health. | 🇵🇭🇺🇸 ericjdaza.com @ericjdaza linkedin.com/in/ericjdaza | statsof1.org @statsof1 @fsbiostats