# Your Coronavirus Telemedicine Health App Might Be Overrated: How to Tell

# Introduction

We would overstate our health app’s effectiveness by claiming it reduces the risk of new coronavirus infections by 16.9%—when in fact it will only reduce this risk by 3.1%.

But we can re-weight our real-world evidence results to provide more accurate risk-reduction estimates of either 2.3% or 2.2%.

To review from Part 1 of this two-part tutorial using synthetic data:

Our analysis goal will be to help public health authorities in our simulated world reduce SARS-CoV-2 (“coronavirus”) infections. We believe our

digital health or telemedicine appcan help prevent new infections; e.g., by promoting healthy lifestyle choices—particularly while social distancing and sheltering-in-place—that reduce the risk of coronavirus infection. But to do this, we’ll need an unbiased or statistically consistent (i.e., more unbiased with larger samples) estimate of the true effect of our would-be intervention.

Tutorial Goal: To understand how causal assumptions change our interpretation of predictions, moving them towards explanations. To understand why this is needed before we can quantify the impact of our tech solution.

What We’ll Learn: How to recognize the difference between a modeling association and a causal effect. How to conduct causal inference via the g-formula (Robins, 1986; Hernán and Robins, 2006) and propensity score weighting (Rosenbaum and Rubin, 1983; Hirano and Imbens, 2001; Lunceford and Davidian, 2004).We will learn how a confounder biases our estimate of the would-be effect of a hypothetical health app intervention. The biases that emerge from our synthetic data are shown below.

Part | Study Design | Confounder Adjustment | Empirical Bias | Estimated Risk Reduction |
---|---|---|---|---|

1 | RWE (observational study) | not done | -0.138 | overstated |

2 | RCT (experimental study) | not done | -0.003 | consistently estimated |

2 | RWE (observational study) | g-formula | 0.008 | consistently estimated |

2 | RWE (observational study) | propensity score weighting | 0.010 | consistently estimated |

In Part 1, we investigated a synthetic real-world evidence (RWE) dataset inspired by a recent healthcare example involving racial disparities in COVID-19 cases (Garg et al, 2020; Aubrey, 2020). (Synthetic data are simulated—not real—data commonly created to teach or learn about an analytical tool.)

We learned how to recognize the difference between a statistical contrast (e.g., risk difference) and a causal effect. We also saw why it wasn’t enough, in general, to state the causal mechanisms (e.g., directed acyclic graph) and control for all confounders (assuming we observed all of them in our dataset) in our analysis—as is done in an explanatory or prediction model (Shmueli, 2010).

We learned from Part 1 that we needed a better way to estimate the true overall or *average treatment effect* (ATE). To do this, we ran a smaller (n = 9600) randomized controlled trial (RCT) with the same statistical power and evidence requirements as our RWE explanatory model. We randomly assigned the app to 50% of all trial participants. All participants complied with their treatment assignments, and the app had the same ATE on infection risk as in the RWE dataset.

Here in Part 2, we’ll analyze this synthetic RCT dataset with the same three person-level variables as before: `infection`

status, `app`

usage, and `race`

(Black or White only for simplicity). We’ll also see that if race really is the only confounder of app usage on infection risk (i.e., if we’d observed all possible confounders), then we can estimate the ATE using just our RWE dataset. This requires understanding the Law of Total Expectation—a straightforward and intuitive concept we actually use all the time—that we’ll review below.

**Part 2 Objective**: To correctly estimate the ATE of app usage on infections using only observational RWE data, given that race is the only other factor that influences infection risk (i.e., the only confounder). As before, we’ll specify the ATE as the risk among users minus that among non-users, or*risk difference*(RD).**What We’ll Learn**: How to estimate the ATE by re-weighting our RWE predictions and outcomes using the g-formula (also called*standardization*) and propensity score weighting, respectively. The g-formula is closely related to the*back-door adjustment formula*of Pearl (2009).

We will conclude Part 2 with a high-level action plan for reporting ATE estimates using RWE data.

# Dataset Characteristics

The experimental (i.e., randomized) RCT data are in the tibble `experimental_rct`

. (Generate this with the R code in the Appendix.)

`glimpse(experimental_rct)`

```
## Observations: 9,600
## Variables: 3
## $ race <chr> "White", "White", "White", "White", "White", "White", "Whit…
## $ app <chr> "didn't use app", "didn't use app", "didn't use app", "didn…
## $ infection <chr> "0. uninfected", "1. infected", "0. uninfected", "0. uninfe…
```

Each observation (i.e., row) represents a unique individual who was originally susceptible and uninfected. The variables and their unique values are the same as in Part 1:

`knitr::kable(apply(experimental_rct, 2, unique))`

race | app | infection |
---|---|---|

White | didn’t use app | 0. uninfected |

Black | used app | 1. infected |

# Explanatory Modeling

Our RCT dataset has 9600 observations.

## Univariate Associations

### Correlation Matrix

```
dummy_rct <- experimental_rct %>%
dplyr::mutate(
race = (race == "White"),
app = (app == "used app"),
infection = (infection == "1. infected")
)
```

`knitr::kable(round(cor(dummy_rct), 4))`

race | app | infection | |
---|---|---|---|

race | 1.0000 | 0.0054 | -0.3156 |

app | 0.0054 | 1.0000 | -0.0497 |

infection | -0.3156 | -0.0497 | 1.0000 |

`corrplot::corrplot.mixed(cor(dummy_rct))`

Unlike the RWE correlation matrix in Part 1, `race`

is no longer correlated with `app`

. This is because we randomized app usage.

### Infections by App Usage (Marginal Model)

Let’s first examine our main relationship of interest, as we did with the training data.

```
experimental_rct %>%
ggplot2::ggplot(ggplot2::aes(x = app, fill = infection)) +
ggplot2::theme_classic() +
ggplot2::geom_bar(position = "dodge") +
ggplot2::ggtitle("Infections by App Usage")
```

```
df_rct <- with(
experimental_rct,
cbind(
table(app, infection),
prop.table(table(app, infection), margin = 1) # row proportions
)
)
rct_rd <- df_rct[2,4] - df_rct[1,4]
```

`rct_rd # empirical RD`

`## [1] -0.03392808`

`knitr::kable(df_rct) # row proportions`

0. uninfected | 1. infected | 0. uninfected | 1. infected | |
---|---|---|---|---|

didn’t use app | 4056 | 726 | 0.8481807 | 0.1518193 |

used app | 4250 | 568 | 0.8821088 | 0.1178912 |

As in Part 1, there was a lower infection rate among app users: Only 11.8% of users had infections, compared to 15.2% of non-users. However, the empirical RD is -0.034, much closer to the true ATE of -0.031. This is markedly smaller than the “false ATE estimate” of -0.169 (i.e., the RWE holdout empirical RD) that would have misled public health authorities on the true effectiveness of our app at reducing coronavirus infections.

```
out_fisher_rct <- with(
experimental_rct,
fisher.test(app, infection)
)
```

`out_fisher_rct`

```
##
## Fisher's Exact Test for Count Data
##
## data: app and infection
## p-value = 1.254e-06
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.6623291 0.8415144
## sample estimates:
## odds ratio
## 0.7466837
```

In addition, there is strong statistical evidence (i.e., statistical significance) that infections varied by app usage (p << 0.001). (Note that this corresponds to a two-sided hypothesis test; our RCT hypothesis is one-sided.) Here, the estimated odds of infection for app users were 0.747 (i.e., roughly three quarters) that of non-users, with a 95% confidence interval (CI) of (0.662, 0.842).

```
out_epi.2by2 <- epiR::epi.2by2(
with(
experimental_rct,
table(app == "didn't use app", infection == "0. uninfected")
)
)
```

`out_epi.2by2$res$ARisk.crude.wald / 100`

```
## est lower upper
## 1 -0.03392808 -0.04757939 -0.02027677
```

The corresponding estimated RD with 95% CI is -0.034 (95% CI: -0.048, -0.02). (Note that this corresponds to a two-sided hypothesis test; our RCT hypothesis is one-sided.) That is, we’re 95% confident that using the app lowered the risk of infection by somewhere between 0.02 and 0.048. Note that this is now a statement about a causal effect, not a statistical association.

We *could* stop here because we designed our RCT in order to properly estimate the ATE. To review from Part 1, the ATE is a *marginal* quantity in statistical terms because it does not account for (i.e., “is marginalized over”) any other variables. A model that does account for other variables (here, race) in addition to the would-be intervention (here, app usage) is called a *conditional* model.

But in the **Explanatory Model** section, we’ll go ahead and fit the Part 1 conditional model that included `race`

as a confounder. In the next section (**Causal Inference: It’s the Law**), we’ll see how to “roll up” the predictions from this model to equal our estimated RD. We’ll then learn how to use this same procedure to estimate the ATE in a statistically consistent way using our original RWE data.

### Infections by Race

As in Part 1, race seems to have been associated with infection.

```
experimental_rct %>%
ggplot2::ggplot(ggplot2::aes(x = race, fill = infection)) +
ggplot2::theme_classic() +
ggplot2::geom_bar(position = "dodge") +
ggplot2::ggtitle("Infections by Race")
```

```
df_rct_race_infection <- with(
experimental_rct,
cbind(
table(race, infection),
prop.table(table(race, infection), margin = 1) # row proportions
)
)
```

`knitr::kable(df_rct_race_infection) # row proportions`

0. uninfected | 1. infected | 0. uninfected | 1. infected | |
---|---|---|---|---|

Black | 794 | 537 | 0.5965440 | 0.4034560 |

White | 7512 | 757 | 0.9084533 | 0.0915467 |

```
out_fisher_rct_race_infection <- with(
experimental_rct,
fisher.test(race, infection)
)
```

`out_fisher_rct_race_infection`

```
##
## Fisher's Exact Test for Count Data
##
## data: race and infection
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1302266 0.1705730
## sample estimates:
## odds ratio
## 0.1490586
```

As before, Blacks were more likely to have been infected than Whites (p << 0.001): 40.3% of Blacks were infected, compared to 9.2% of Whites. Put differently, the estimated odds of infection for Whites were 0.149 (95% CI: 0.13, 0.171) times that of Blacks.

### App Usage by Race

Unlike our RWE findings in Part 1, `race`

is no longer correlated with `app`

. Again, this is because we randomized app usage.

```
experimental_rct %>%
ggplot2::ggplot(ggplot2::aes(x = race, fill = app)) +
ggplot2::theme_classic() +
ggplot2::geom_bar(position = "dodge") +
ggplot2::ggtitle("App Usage by Race")
```

```
df_rct_race_app <- with(
experimental_rct,
cbind(
table(race, app),
prop.table(table(race, app), margin = 1) # row proportions
)
)
```

`knitr::kable(df_rct_race_app) # row proportions`

didn’t use app | used app | didn’t use app | used app | |
---|---|---|---|---|

Black | 672 | 659 | 0.5048835 | 0.4951165 |

White | 4110 | 4159 | 0.4970371 | 0.5029629 |

```
out_fisher_rct_race_app <- with(
experimental_rct,
fisher.test(race, app)
)
```

`out_fisher_rct_race_app`

```
##
## Fisher's Exact Test for Count Data
##
## data: race and app
## p-value = 0.6156
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.917487 1.160591
## sample estimates:
## odds ratio
## 1.03189
```

## Explanatory Model

### Causal Model

From Part 1, recall that an explanatory model consists of a causal model and a statistical model. The causal model is often specified as a directed acyclic graph (DAG) (Pearl, 2009). We had assumed that the true DAG was:

- App Usage \(\rightarrow\) Infection
- Race \(\rightarrow\) Infection
- Race \(\rightarrow\) App Usage

```
DiagrammeR::grViz("
digraph causal {
# Nodes
node [shape = plaintext]
Z [label = 'Race']
X [label = 'App \n Usage']
Y [label = 'Infection']
# Edges
edge [color = black,
arrowhead = vee]
rankdir = LR
X -> Y
Z -> X
Z -> Y
# Graph
graph [overlap = true]
}")
```

Here, race confounds the effect of app usage on infection.

### Check Sample Size

Do we have a large enough sample for each app usage and racial group combination to meet our Part 1 statistical power and evidence requirements?

```
experimental_rct %>%
ggplot2::ggplot(ggplot2::aes(x = race, fill = app)) +
ggplot2::theme_classic() +
ggplot2::geom_bar(position = "dodge") +
ggplot2::ggtitle("App Usage by Race")
```

```
df_rct_race_app <- with(
experimental_rct,
cbind(
table(race, app),
prop.table(table(race, app), margin = 1) # row proportions
)
)
```

`knitr::kable(df_rct_race_app) # row proportions`

didn’t use app | used app | didn’t use app | used app | |
---|---|---|---|---|

Black | 672 | 659 | 0.5048835 | 0.4951165 |

White | 4110 | 4159 | 0.4970371 | 0.5029629 |

Yes: We have at least 556 Black individuals in each app usage group, and at least 1617 White individuals likewise.

### Fit Statistical Model

We fit our RCT logistic model as follows.

```
glm_rct <- glm(
data = experimental_rct,
formula = as.factor(infection) ~ app + race,
family = "binomial"
)
```

`knitr::kable(summary(glm_rct)$coefficients)`

Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|

(Intercept) | -0.2383012 | 0.0637708 | -3.73684 | 0.0001863 |

appused app | -0.3133894 | 0.0634329 | -4.94049 | 0.0000008 |

raceWhite | -1.9089811 | 0.0678516 | -28.13467 | 0.0000000 |

After controlling for race, there is strong statistical evidence (p << 0.001) for the estimated effect of app usage on infection risk. There is also very strong statistical evidence (p <<< 0.001) for race’s association with infection risk. Specifically, the estimated odds of infection for Whites were \(\exp(\)-1.909\() =\) 0.148 (95% CI: 0.13, 0.169) times that of Blacks (regardless of app usage).

The estimated odds of infection for app users were \(\exp(\)-0.313\() =\) 0.731 (95% CI: 0.645, 0.828) times that of non-users (regardless of race). The corresponding estimated infection risks for app usage by race are:

```
risk_didnt_use_app_black_rct <- plogis(coef(glm_rct) %*% c(1, 0, 0))
risk_used_app_black_rct <- plogis(coef(glm_rct) %*% c(1, 1, 0))
risk_didnt_use_app_white_rct <- plogis(coef(glm_rct) %*% c(1, 0, 1))
risk_used_app_white_rct <- plogis(coef(glm_rct) %*% c(1, 1, 1))
```

- 0.441 for Blacks not using the app
- 0.365 for Blacks using the app
- 0.105 for Whites not using the app
- 0.079 for Whites using the app

The estimated RDs by race are:

```
rct_rd_black <- risk_used_app_black_rct - risk_didnt_use_app_black_rct
rct_rd_white <- risk_used_app_white_rct - risk_didnt_use_app_white_rct
confint_glm_rct <- confint(glm_rct) # 95% CIs: odds ratios of infection
rct_rd_ci_black <- c(
plogis(confint_glm_rct[, 1] %*% c(1, 1, 0)) - plogis(confint_glm_rct[, 1] %*% c(1, 0, 0)),
plogis(confint_glm_rct[, 2] %*% c(1, 1, 0)) - plogis(confint_glm_rct[, 2] %*% c(1, 0, 0))
)
rct_rd_ci_white <- c(
plogis(confint_glm_rct[, 1] %*% c(1, 1, 1)) - plogis(confint_glm_rct[, 1] %*% c(1, 0, 1)),
plogis(confint_glm_rct[, 2] %*% c(1, 1, 1)) - plogis(confint_glm_rct[, 2] %*% c(1, 0, 1))
)
```

- -0.075 (95% CI: -0.1, -0.047) for Blacks
- -0.026 (95% CI: -0.028, -0.02) for Whites

# Causal Inference: It’s the Law

### Rolling Up the Conditional Estimated Risks

#### Randomized Controlled Trial

We can use our explanatory model’s estimated coefficients to predict the infection risk for each person.

```
experimental_rct_preds <- experimental_rct %>%
dplyr::mutate(predicted_risk = predict(object = glm_rct, type = "response"))
tbl_estimated_risks_rct <- experimental_rct_preds %>%
dplyr::select(race, app, predicted_risk) %>%
dplyr::distinct() %>%
dplyr::arrange(race, app)
```

The predicted risks for each `race`

and `app`

combination are:

`knitr::kable(tbl_estimated_risks_rct)`

race | app | predicted_risk |
---|---|---|

Black | didn’t use app | 0.4407050 |

Black | used app | 0.3654723 |

White | didn’t use app | 0.1045855 |

White | used app | 0.0786616 |

These predicted risks are just the estimated infection risks for app usage by race we’d calculated earlier. (Note that while predicted probabilities such as these are used in classification, they aren’t “predictions” in the classification sense. The latter can only take on the values of “used app” or “didn’t use app”, not continuous probabilities. The predicted risks are “predictions” by the standard statistics—not machine learning—definition.)

We “roll up” these estimated risks to the level of app usage by taking their averages over both racial categories for app users and non-users.

```
tbl_aac_rct <- experimental_rct_preds %>%
dplyr::group_by(app) %>%
dplyr::summarize(mean_preds = mean(predicted_risk))
estimated_AAC_rct <- tbl_aac_rct$mean_preds[2] - tbl_aac_rct$mean_preds[1]
```

`knitr::kable(tbl_aac_rct)`

app | mean_preds |
---|---|

didn’t use app | 0.1518193 |

used app | 0.1178912 |

But the estimated RD we calculate from these average estimated risks is exactly the same as the empirical RD we’d calculated earlier!

`rct_rd # RCT: empirical RD`

`## [1] -0.03392808`

`estimated_AAC_rct # RCT: RD from average estimated risks`

`## [1] -0.03392808`

#### Real-world Evidence

So can’t we just do this with our RWE estimated risks to come up with a similar estimate?

```
# RWE: empirical RD
df_rwe_holdout <- with(
observational_rwe_holdout, # see Part 1 to generate this:
# https://towardsdatascience.com/coronavirus-telemedicine-and-race-part-1-simulated-real-world-evidence-9971f553194d?source=email-8430d9f1992d-1586971057342-layerCake.autoLayerCakeWriterNotification-------------------------90bb4612_8a36_4b93_81dd_1656d841e715&sk=32bfb03e1ca157e423ecf8cb69835ef3
cbind(
table(app, infection),
prop.table(table(app, infection), margin = 1) # row proportions
)
)
rwe_holdout_rd <- df_rwe_holdout[2,4] - df_rwe_holdout[1,4]
# RWE: RD from average estimated risks
observational_rwe_preds <- observational_rwe_holdout %>%
dplyr::mutate(predicted_risk = predict(object = glm_rwe_holdout, type = "response"))
tbl_estimated_risks_rwe <- observational_rwe_preds %>%
dplyr::select(race, app, predicted_risk) %>%
dplyr::distinct() %>%
dplyr::arrange(race, app)
tbl_aac_rwe <- observational_rwe_preds %>%
dplyr::group_by(app) %>%
dplyr::summarize(mean_preds = mean(predicted_risk))
estimated_AAC_rwe <- tbl_aac_rwe$mean_preds[2] - tbl_aac_rwe$mean_preds[1]
```

`knitr::kable(tbl_estimated_risks_rwe)`

race | app | predicted_risk |
---|---|---|

Black | didn’t use app | 0.4201799 |

Black | used app | 0.3650084 |

White | didn’t use app | 0.0947902 |

White | used app | 0.0766925 |

`knitr::kable(tbl_aac_rwe)`

app | mean_preds |
---|---|

didn’t use app | 0.2598114 |

used app | 0.0904586 |

`rwe_holdout_rd # RWE: empirical RD`

`## [1] -0.1693528`

`estimated_AAC_rwe # RWE: RD from average estimated risks`

`## [1] -0.1693528`

Apparently not! What happened?

### The Law of Total Expectation

Suppose we’ve measured the following heights in feet of two women and six men:

`knitr::kable(tbl_lte_example)`

id | gender | height |
---|---|---|

woman1 | 1 | 4.938729 |

woman2 | 1 | 5.217788 |

man1 | 0 | 5.222013 |

man2 | 0 | 5.820665 |

man3 | 0 | 5.257096 |

man4 | 0 | 5.186209 |

man5 | 0 | 5.788102 |

man6 | 0 | 5.482407 |

#### Overall Average

The overall average height is just 5.364. Let \(y_i\) represent the height for individual \(i = 1, ..., n\), where \(n = 8\). This overall average height is explicitly calculated as \(\frac{1}{n} \sum_i^n y_i\), or:

`(woman1 + woman2 + man1 + man2 + man3 + man4 + man5 + man6) / 8`

`## [1] 5.364126`

#### Overall Average as Weighted Average of Components

But the overall average height is also:

`((woman1 + woman2) / 2) * (2/8) + ((man1 + man2 + man3 + man4 + man5 + man6) / 6) * (6/8)`

`## [1] 5.364126`

Put differently, but using code similar to what we’d used earlier:

```
tbl_lte_example_components <- tbl_lte_example %>%
dplyr::group_by(gender) %>%
dplyr::summarize(mean_height = mean(height))
```

`knitr::kable(tbl_lte_example_components)`

gender | mean_height |
---|---|

0 | 5.459415 |

1 | 5.078258 |

```
tbl_lte_example_components[2, 2] * prop.table(table(tbl_lte_example$gender))[2] +
tbl_lte_example_components[1, 2] * prop.table(table(tbl_lte_example$gender))[1]
```

```
## mean_height
## 1 5.364126
```

#### Relationship of Overall Average to Component Weighted Average

Let \(z_i = 1\) for women, and \(z_i = 0\) for men. The average height is:

- \(\left( \sum_i^n y_i z_i \right) / \left( \sum_i^n z_i \right) =\) 5.078 for women
- \(\left\{ \sum_i^n y_i (1 - z_i) \right\} / \left\{ \sum_i^n (1 - z_i) \right\} =\) 5.459 for men

The proportion of each gender is:

- \(\frac{1}{n} \sum_i^n z_i =\) 0.25 for women
- \(\frac{1}{n} \sum_i^n (1 - z_i) =\) 0.75 for men

The calculation above adds these two component averages, but it weights each component average by its respective component proportions. Hence, this weighted average is equal to the overall average height:

\[ \left\{ \frac{\sum_i^n y_i z_i}{\sum_i^n z_i} \times \frac{\sum_i^n z_i}{n} \right\} + \left\{ \frac{\sum_i^n y_i (1 - z_i)}{\sum_i^n (1 - z_i)} \times \frac{\sum_i^n (1 - z_i)}{n} \right\} \] \[ = \frac{\sum_i^n y_i z_i}{n} + \frac{\sum_i^n y_i (1 - z_i)}{n} = \frac{1}{n} \left( \sum_i^n y_i z_i + \sum_i^n y_i - y_i z_i \right) \] \[ = \frac{1}{n} \sum_i^n \left( y_i z_i + y_i - y_i z_i \right) = \frac{1}{n} \sum_i^n y_i \]

`## [1] "(5.078 * 0.25) + (5.459 * 0.75) = 5.364"`

`## [1] "((woman1 + woman2) / 2) * (2/8) + ((man1 + man2 + man3 + man4 + man5 + man6) / 6) * (6/8) = 5.364"`

This statement of equivalence is an expression of the Law of Total Expectation (LTE):

The overall average is a sum of the component averages weighted by the proportion of each component.

See the Appendix for the definition of the LTE.

### Applying the LTE to Our Data

#### Real-world Evidence: App Users

Let’s now apply the LTE to our RWE holdout data. For simplicity, we’ll just look at app users; similar reasoning holds for app non-users.

- We have \(n =\) 9600 individuals.
- Let \(Y_i = 1\) if individual \(i\) was infected, and \(Y_i = 0\) if uninfected.
- Let \(X_i = 1\) if individual \(i\) used the app, and \(X_i = 0\) if they didn’t.
- Let \(Z_i = 1\) if individual \(i\) is White, and \(Z_i = 0\) if they are Black.
- Let \(\hat{R}_i = \widehat{\text{Pr}} ( Y_i=1 | X_i, Z_i )\) represent the estimated infection risk given \(X_i\) and \(Z_i\).

Among app users, the overall estimated risk was \(\left( \sum_i^n Y_i X_i \right) / \left( \sum_i^n X_i \right) =\) 0.09. The average estimated risk was:

- \(\left( \sum_i^n \hat{R}_i X_i Z_i \right) / \left( \sum_i^n X_i Z_i \right) = \widehat{\text{Pr}} ( Y=1 | X=1, Z=1 ) =\) 0.077 for Whites
- \(\left\{ \sum_i^n \hat{R}_i X_i (1 - Z_i) \right\} / \left\{ \sum_i^n X_i (1 - Z_i) \right\} = \widehat{\text{Pr}} ( Y=1 | X=1, Z=0 ) =\) 0.365 for Blacks

The proportion of each race was:

- \(\left( \sum_i^n X_i Z_i \right) / \left( \sum_i^n X_i \right) =\)
**0.952 White** - \(\left\{ \sum_i^n X_i (1 - Z_i) \right\} / \left\{ \sum_i^n X_i \right\} =\)
**0.048 Black**

The weighted average of average estimated risks taken over both race categories is equal to the overall estimated risk:

\[ \left\{ \frac{\sum_i^n \hat{R}_i X_i Z_i}{\sum_i^n X_i Z_i} \times \frac{\sum_i^n X_i Z_i}{\sum_i^n X_i} \right\} + \left\{ \frac{\sum_i^n \hat{R}_i X_i (1 - Z_i)}{\sum_i^n X_i (1 - Z_i)} \times \frac{\sum_i^n X_i (1 - Z_i)}{\sum_i^n X_i} \right\} \] \[ = \frac{\sum_i^n \hat{R}_i X_i}{\sum_i^n X_i} = \frac{\sum_i^n Y_i X_i}{\sum_i^n X_i} \]

`## [1] "(0.077 * 0.952) + (0.365 * 0.048) = 0.09 = 0.09"`

The last equivalence follows from the definition of a logistic regression. Here’s some code that does this explicitly:

```
observational_rwe_preds %>%
dplyr::filter(app == "used app") %>%
dplyr::mutate(
weight_Black = (race == "Black") * mean(observational_rwe_holdout$race[observational_rwe_holdout$app == "used app"] == "Black"),
weight_White = (race == "White") * mean(observational_rwe_holdout$race[observational_rwe_holdout$app == "used app"] == "White"),
predicted_risk_weighted = predicted_risk * (race == "Black") * weight_Black +
predicted_risk * (race == "White") * weight_White
) %>%
dplyr::group_by(race) %>%
dplyr::summarize(mean_preds_weighted_rwe = mean(predicted_risk_weighted)) %>%
dplyr::summarize(sum_mean_preds_weighted_rwe = sum(mean_preds_weighted_rwe))
```

```
## # A tibble: 1 x 1
## sum_mean_preds_weighted_rwe
## <dbl>
## 1 0.0905
```

Compare this to our earlier code we used to calculate the empirical RD, which did not explicitly average of race-specific average estimated risks:

```
observational_rwe_preds %>%
dplyr::filter(app == "used app") %>%
dplyr::summarize(mean_preds = mean(predicted_risk))
```

```
## # A tibble: 1 x 1
## mean_preds
## <dbl>
## 1 0.0905
```

#### Randomized Controlled Trial: App Users

Now let’s apply the LTE to our RCT data. Again, we’ll just look at app users. We’ll use the exact same formulas.

Among app users, the overall estimated risk was 0.118. The average estimated risk was:

- 0.079 for Whites
- 0.365 for Blacks

The proportion of each race was:

**0.863 White****0.137 Black**

The weighted average of average estimated risks taken over both race categories is equal to the overall estimated risk:

`## [1] "(0.079 * 0.863) + (0.365 * 0.137) = 0.118 = 0.118"`

Here’s some code that does this explicitly:

```
experimental_rct_preds %>%
dplyr::filter(app == "used app") %>%
dplyr::mutate(
weight_Black = (race == "Black") * mean(experimental_rct$race[experimental_rct$app == "used app"] == "Black"),
weight_White = (race == "White") * mean(experimental_rct$race[experimental_rct$app == "used app"] == "White"),
predicted_risk_weighted = predicted_risk * (race == "Black") * weight_Black +
predicted_risk * (race == "White") * weight_White
) %>%
dplyr::group_by(race) %>%
dplyr::summarize(mean_preds_weighted_rct = mean(predicted_risk_weighted)) %>%
dplyr::summarize(sum_mean_preds_weighted_rct = sum(mean_preds_weighted_rct))
```

```
## # A tibble: 1 x 1
## sum_mean_preds_weighted_rct
## <dbl>
## 1 0.118
```

Can you spot the main difference between the four corresponding RWE and RCT components? This difference unlocks the key insight of the g-formula.

# The G-Formula

## Outcome Model

We had assumed that race and app usage both impact the infection risk. The relevant part of our earlier DAG is:

- App Usage (\(X\)) \(\rightarrow\) Infection (\(Y\))
- Race (\(Z\)) \(\rightarrow\) Infection (\(Y\))

```
DiagrammeR::grViz("
digraph causal {
# Nodes
node [shape = plaintext]
X [label = 'App \n Usage \n (X)']
Z [label = 'Race \n (Z)']
Y [label = 'Infection \n (Y)']
# Edges
edge [color = black,
arrowhead = vee]
rankdir = LR
X -> Y
Z -> Y
# Graph
graph [overlap = true]
}")
```

Thus far, we’ve been investigating this DAG’s corresponding statistical model. This is often called an *outcome model* in the causal inference literature because it models how the outcome is statistically related to its input variables (inputs). Our outcome model is a logistic model of `infection`

risk as a function of `app`

usage and `race`

.

## G-Formula: Standardizing RWE Estimates

### RCT Substitution

In the RWE data, race influenced the propensity of using the app. As a result, app users were more likely to be White than non-users:

`knitr::kable(df_rwe_holdout_race_app_colprop) # column proportions`

didn’t use app | used app | didn’t use app | used app | |
---|---|---|---|---|

Black | 1667 | 607 | 0.5071494 | 0.0477464 |

White | 1620 | 12106 | 0.4928506 | 0.9522536 |

However, the proportion of each race among app users and non-users were very similar in the RCT data:

`knitr::kable(df_rct_race_app_colprop) # column proportions`

didn’t use app | used app | didn’t use app | used app | |
---|---|---|---|---|

Black | 672 | 659 | 0.140527 | 0.1367787 |

White | 4110 | 4159 | 0.859473 | 0.8632213 |

This is because randomizing app usage made it statistically independent of race. Hence, the app-usage-specific proportions approximated the overall race proportions, which were:

- \(\frac{1}{n} \sum_i^n (1 - Z_i) =\) 0.139 Black
- \(\frac{1}{n} \sum_i^n Z_i =\) 0.861 White

That is, app users were:

- \(\left( \sum_i^n X_i Z_i \right) / \left( \sum_i^n X_i \right) =\) 0.863 \(\approx\) 0.861 \(= \frac{1}{n} \sum_i^n Z_i\) White
- \(\left\{ \sum_i^n X_i (1 - Z_i) \right\} / \left\{ \sum_i^n X_i \right\} =\) 0.137 \(\approx\) 0.139 \(= \frac{1}{n} \sum_i^n (1 - Z_i)\) Black

So we can set \(X_i=1\) for all \(i\), such that the race proportions we use are:

- \(\left( \sum_i^n 1 Z_i \right) / \left( \sum_i^n 1 \right) = \frac{1}{n} \sum_i^n Z_i =\) 0.861 White
- \(\left\{ \sum_i^n 1 (1 - Z_i) \right\} / \left\{ \sum_i^n 1 \right\} = \frac{1}{n} \sum_i^n (1 - Z_i) =\) 0.139 Black

This is the same as removing `[experimental_rct$app == "used app"]`

from our earlier code like so:

```
(
experimental_rct_preds %>%
dplyr::filter(app == "used app") %>%
dplyr::mutate(
weight_Black = (race == "Black") * mean(experimental_rct$race == "Black"),
weight_White = (race == "White") * mean(experimental_rct$race == "White"),
predicted_risk_weighted = predicted_risk * (race == "Black") * weight_Black +
predicted_risk * (race == "White") * weight_White
) %>%
dplyr::group_by(race) %>%
dplyr::summarize(mean_preds_weighted_rct = mean(predicted_risk_weighted)) %>%
dplyr::summarize(sum_mean_preds_weighted_rct = sum(mean_preds_weighted_rct))
)$sum_mean_preds_weighted_rct
```

`## [1] 0.1184267`

Statistically, this substitution is acceptable: We are simply replacing one statistically consistent estimate of the true race proportions with another. But what happens if we do this with our RWE data?

### RWE Re-weighting

Using our RWE dataset, let’s re-weight the average estimated risks for app users using the overall race proportions instead of the app-usage-specific proportions by removing `[observational_rwe_holdout$app == "used app"]`

from our earlier code like so:

```
(
observational_rwe_preds %>%
dplyr::filter(app == "used app") %>%
dplyr::mutate(
weight_Black = (race == "Black") * mean(observational_rwe_holdout$race == "Black"),
weight_White = (race == "White") * mean(observational_rwe_holdout$race == "White"),
predicted_risk_weighted = predicted_risk * (race == "Black") * weight_Black +
predicted_risk * (race == "White") * weight_White
) %>%
dplyr::group_by(race) %>%
dplyr::summarize(mean_preds_weighted_rwe = mean(predicted_risk_weighted)) %>%
dplyr::summarize(sum_mean_preds_weighted_rwe = sum(mean_preds_weighted_rwe))
)$sum_mean_preds_weighted_rwe
```

`## [1] 0.1176694`

This quantity is much closer to our RCT estimate. We apply this same re-weighting to the app non-users using the following code:

```
tbl_ate_gf <- observational_rwe_preds %>%
dplyr::mutate(
predicted_risk = predict(object = glm_rwe_holdout, type = "response"),
predicted_risk_X1 = predict(
object = glm_rwe_holdout,
newdata = observational_rwe_holdout %>%
dplyr::mutate(app = "used app"),
type = "response"
),
predicted_risk_X0 = predict(
object = glm_rwe_holdout,
newdata = observational_rwe_holdout %>%
dplyr::mutate(app = "didn't use app"),
type = "response"
),
gformula_weight = (race == "Black") * mean(observational_rwe_holdout$race == "Black") +
(race == "White") * mean(observational_rwe_holdout$race == "White"),
predicted_risk_weighted = predicted_risk * gformula_weight
) %>%
dplyr::group_by(app, race) %>%
dplyr::summarize(mean_preds_weighted_rwe = mean(predicted_risk_weighted))
estimated_ATE_gf <- (tbl_ate_gf$mean_preds_weighted_rwe[3] + tbl_ate_gf$mean_preds_weighted_rwe[4]) -
(tbl_ate_gf$mean_preds_weighted_rwe[1] + tbl_ate_gf$mean_preds_weighted_rwe[2])
```

We calculate the estimated RD as -0.023. Compare this to our RCT estimated RD of -0.034. Both estimates are much closer to the true ATE of -0.031 than our original RWE estimated RD of -0.169.

The re-weighting procedure can be more generally stated as:

- Calculate the proportion of each combination of confounder values (i.e., the empirical joint distribution of confounders).
- Weight the estimated average outcomes (conditional on all confounders) by their corresponding confounder proportions from Step 1.
- For each would-be intervention group, sum these weighted estimates.

This general procedure is known as theThe contrast (e.g., difference, ratio) of these standardized sums is a statistically consistent estimate of the ATE.g-formula.

## Intuition: Standardization

Why does this re-weighting method work? The idea is that we can estimate the ATE in a statistically consistent or unbiased way using an RCT. But when we only have RWE data, we can still replicate the way an RCT breaks the influence—and thereby, statistical association—of the confounders on the would-be intervention.

For each would-be intervention group, the g-formula achieves this by *standardizing* (in epidemiological terms) the distribution of confounders in the RWE to the distribution of confounders in the corresponding RCT. The latter is the same as the overall distribution of confounders (i.e., regardless of intervention group) thanks to randomization.

# Propensity Score Weighting

Recall that in our RWE data, Blacks were less likely to have used the app than Whites: Only 27.2% of Blacks used the app, compared to 88.2% of Whites. That is, the *propensity* to use the app was 0.272 among Black individuals, but 0.882 among White individuals.

```
observational_rwe_training %>%
ggplot2::ggplot(ggplot2::aes(x = race, fill = app)) +
ggplot2::theme_classic() +
ggplot2::geom_bar(position = "dodge") +
ggplot2::ggtitle("App Usage by Race")
```

`knitr::kable(df_rwe_training_race_app) # row proportions`

didn’t use app | used app | didn’t use app | used app | |
---|---|---|---|---|

Black | 6698 | 2500 | 0.7282018 | 0.2717982 |

White | 6474 | 48328 | 0.1181344 | 0.8818656 |

## Propensity Model

### Causal and Statistical Models

We had assumed that race affects app usage. The relevant part of our earlier DAG is:

- Race (\(Z\)) \(\rightarrow\) App Usage (\(X\))

```
DiagrammeR::grViz("
digraph causal {
# Nodes
node [shape = plaintext]
Z [label = 'Race \n (Z)']
X [label = 'App \n Usage \n (X)']
# Edges
edge [color = black,
arrowhead = vee]
rankdir = LR
Z -> X
# Graph
graph [overlap = true]
}")
```

Race is said to influence the propensity to use the app. Hence, the statistical model of this relationship is often called a *propensity model* in the causal inference literature. That is, it models how the would-be intervention is statistically related to its inputs.

The true model for `app`

usage as a function of `race`

is the logistic model listed in the Appendix of Part 1. The propensity to use the app is measured as the probability of using the app. In the context of this causal model, this probability is therefore called a *propensity score* (Rosenbaum and Rubin, 1983).

In reality, we wouldn’t know this true model. But suppose we’d correctly modeled it and estimated its parameters as follows:

```
glm_rwe_holdout_ps <- glm(
data = observational_rwe_holdout,
formula = as.factor(app) ~ race,
family = "binomial"
)
```

`knitr::kable(summary(glm_rwe_holdout_ps)$coefficients)`

Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|

(Intercept) | -1.010252 | 0.0474059 | -21.31067 | 0 |

raceWhite | 3.021527 | 0.0542881 | 55.65728 | 0 |

Now what?

## Intuition: Survey Sampling (Horvitz-Thompson Weights)

The propensity to use our app is like a *selection process*: here, “being selected” to use the app depended on race.

In *survey sampling*, such *selection probabilities* are used in probability sampling to ensure that survey participants are representative of the target population from which they are sampled. These probabilities are specified in the survey design—and hence, fully known in advance.

These selection probabilities can then be used to weight the survey respondents’ outcomes by the reciprocal or inverse of their corresponding selection probabilities. This general technique, called *inverse probability weighting*, provides a statistically consistent estimate of the mean outcome (i.e., the average outcome taken over the entire target population). In survey sampling, the resuling estimator of the population mean is called a Horvitz-Thompson (HT) estimator (Horvitz and Thompson, 1952).

Intuitively, each respondent is weighted in order to represent individuals who are similar in the factors that determine their selection probability. To see how, let’s revisit our earlier example of height and gender:

`knitr::kable(tbl_lte_example)`

id | gender | height |
---|---|---|

woman1 | 1 | 4.938729 |

woman2 | 1 | 5.217788 |

man1 | 0 | 5.222013 |

man2 | 0 | 5.820665 |

man3 | 0 | 5.257096 |

man4 | 0 | 5.186209 |

man5 | 0 | 5.788102 |

man6 | 0 | 5.482407 |

Suppose these eight individuals were part of a small probability sample (n = 514) taken from a much larger target population (n = 10000). The code to generate the target population and sample is:

```
# generate target population
mean_height_women <- 5.21654 # from https://ourworldindata.org/human-height
mean_height_men <- 5.61024 # from https://ourworldindata.org/human-height
n_pop_htw_example <- 10000
sel_prob_women <- 0.01
sel_prob_men <- 0.09
set.seed(2004201426)
tbl_htw_example <- dplyr::tibble(
gender = rbinom(n = n_pop_htw_example, size = 1, prob = 0.5),
height = gender * rnorm(n = n_pop_htw_example, mean = mean_height_women, sd = 0.3) +
(1 - gender) * rnorm(n = n_pop_htw_example, mean = mean_height_men, sd = 0.3),
selected = gender * rbinom(n = n_pop_htw_example, size = 1, prob = sel_prob_women) +
(1 - gender) * rbinom(n = n_pop_htw_example, size = 1, prob = sel_prob_men)
)
# select sample
tbl_htw_example_sample <- tbl_htw_example %>%
dplyr::filter(selected == 1)
```

In the target population (with numbers derived from ourworldindata.org/human-height):

- The mean height among women is 5.217 feet.
- The mean height among men is 5.61 feet.
- Women and men are distributed fairly equally: 51% are women.
- The true population mean height is 5.406.

Women were selected into our sample with probability 0.01, and men with probability 0.09. That is, men were nine times more likely to be selected into the sample than women; 90% of our sample are men (n = 464). Hence, the sample overall average height of 5.555 is higher than the true population mean height of 5.406.

However, if we weight each individual by the inverse of their respective selection probability, sum these values, and divide by the total population size, we get the weighted average:

```
with(
tbl_htw_example_sample,
sum(
c(
height[gender == 1] / sel_prob_women,
height[gender == 0] / sel_prob_men
)
)/nrow(tbl_htw_example)
)
```

`## [1] 5.498455`

This is closer to the true population mean height of 5.406.

## Propensity Scores: Weighting RWE Outcomes

We can apply such an HT estimator to estimate the ATE. In our case, there are two types of “target populations” just within our RWE data:

- the population of
*potential outcomes*if every individual had used the app - the population of potential outcomes if every individual had not used the app

Each potential outcome (Neyman, 1923; Rubin, 1974; Holland, 1986) is observed according to an individual’s actual app usage. The other unobserved potential outcome is called a counterfactual outcome, or simply *counterfactual*.

We weight each individual by the inverse of their respective propensity score, sum these values, and divide by the total number of individuals in our dataset to get the weighted averages per app usage group:

```
tbl_ate_ps <- observational_rwe_holdout %>%
dplyr::mutate(
infection01 = (infection == "1. infected"),
app01 = (app == "used app"),
propensityscore_used_app = predict(object = glm_rwe_holdout_ps, type = "response"),
propensityscore_didnt_use_app = 1 - propensityscore_used_app
)
estimated_risk_used_app <- with(
tbl_ate_ps,
sum(infection01[app01 == 1] / propensityscore_used_app[app01 == 1]) / nrow(tbl_ate_ps)
)
estimated_risk_didnt_use_app <- with(
tbl_ate_ps,
sum(infection01[app01 == 0] / propensityscore_didnt_use_app[app01 == 0]) / nrow(tbl_ate_ps)
)
estimated_ATE_ps <- estimated_risk_used_app - estimated_risk_didnt_use_app
```

We calculate the estimated RD as -0.022. Compare this to our RCT estimated RD of -0.034. Both estimates are much closer to the true ATE of -0.031 than our original RWE estimated RD of -0.169.

# Conclusion and Public Health Implications

## No Free Lunch

We’ve successfully applied two causal inference methods to estimate the true ATE of app usage on infection risk. These methods are statistically consistent for the ATE, but they rely on some key assumptions being true:

*No Unmeasured Confounders*: We’ve observed all possible confounders.*Correct Model Specification*: We’ve correctly specified both the causal and statistical models. (See Part 1 for the difference and relationship between the two.) That is, they both correctly formalize what happens in reality. This assumption applies to the outcome model when using the g-formula, and the propensity model when using propensity score weighting.*Positivity*: There is at least one individual in each intervention group for every unique combination of all observed confounder values (or at least for every rough but reasonably sized unique cluster of said values).

You can learn more about these and other important assumptions in Hernán and Robins (2006) and Hernán and Robins (2020), and at Adam Kelleher’s Medium page.

What’s stopping us from treating as many variables as possible as potential confounders (perhaps after cross-validated model and variable selection)? Why shouldn’t we include them all in the statistical model? One interesting caution against this is the prospect of *“M-bias”* (Ding and Miratrix, 2015). This is a special type of *collider bias* (Pearl, 2009) arising from including a model input that is not a confounder. The argument is that including such non-confounders can bias the ATE estimate, due to the induced collider bias.

## Choices, Choices

We’ve decided on an RWE explanatory model. But which statistical model should we fit? The outcome model, or the propensity model? Our choice will depend, in part, on how confident we are in fitting either model. And this choice will determine whether we use the g-formula or propensity score weighting.

What if there is only equivocal evidence in favor of one over the other? Perhaps we’re otherwise agnostic to which model to fit. In such cases, there are what are called *“doubly robust” estimators* that make use of both models (Bang and Robins, 2005). They are called as much because only one of the models needs to be correctly specified in order for the estimator to yield statistically consistent estimates of the ATE.

## Other Methods

There are at least three other popular methods for enabling ATE estimation and inference using RWE data.

*Matching*: In this approach, each individual in one intervention group is matched with at least one other individual in the other intervention group. A match is specified as some level of similarity between both individuals’ observed values for confounders. The idea is that any variation left must be due to the would-be intervention—and is therefore a likely treatment effect.*Propensity Score Matching*: What if there are too many confounders over which to find good matches? One well-supported solution is to match individuals in opposite intervention groups using their propensity scores, instead (Hirano and Imbens, 2001; Lunceford and Davidian M, 2004).*Instrumental Variables*: The g-formula and propensity score weighting both require that we’ve measured all confounders, and have included them in our statistical model. What if this isn’t possible? Suppose we’ve nonetheless observed another RWE variable that effectively randomized the would-be intervention, was not affected by any confounders, and otherwise had no effect on our outcome. This variable is an*instrument*we can use to calculate a statistically consistent estimate of the ATE (Angrist and Imbens, 1995).

## Public Health Implications

Suppose we mistake [the real-world evidence] empirical RD as an estimate of the ATE. Unbeknownst to us, in our simulated world we would overstate our telemedicine app’s effectiveness by claiming it reduces the risk of new coronavirus infections by 16.9%—when in fact it will only reduce this risk by 3.1%.

But we can re-weight our real-world evidence results to provide more accurate risk-reduction estimates of either 2.3% or 2.2%.

We conclude Part 2 with the following high-level action plan:

- Report and interpret your DAG for your client, along with your unadjusted ATE estimate.
- Adjust for confounding (e.g., through re-weighting). Report and interpret your adjusted ATE estimate to your client.

# References

Angrist JD, Imbens GW. Identification and estimation of local average treatment effects. National Bureau of Economic Research; 1995 Feb 1. https://www.nber.org/papers/t0118

Aubrey A. CDC Hospital Data Point To Racial Disparity In COVID-19 Cases. NPR. April 8, 2020 2:43 PM ET. npr.org/sections/coronavirus-live-updates/2020/04/08/830030932/cdc-hospital-data-point-to-racial-disparity-in-covid-19-cases

Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005 Dec;61(4):962-73.https://onlinelibrary.wiley.com/doi/full/10.1111/j.1541-0420.2005.00377.x

Ding P, Miratrix LW. To adjust or not to adjust? Sensitivity analysis of M-bias and butterfly-bias. Journal of Causal Inference. 2015 Mar 1;3(1):41-57. https://www.degruyter.com/view/journals/jci/3/1/article-p41.xml

Garg S. Hospitalization Rates and Characteristics of Patients Hospitalized with Laboratory-Confirmed Coronavirus Disease 2019—COVID-NET, 14 States, March 1–30, 2020. MMWR. Morbidity and Mortality Weekly Report. 2020;69. cdc.gov/mmwr/volumes/69/wr/mm6915e3.htm

Hernán MA, Robins JM. Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health. 2006 Jul 1;60(7):578-86. jech.bmj.com/content/60/7/578.short

Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Services and Outcomes research methodology. 2001 Dec 1;2(3-4):259-78. https://link.springer.com/article/10.1023/A:1020371312283

Holland PW. Statistics and causal inference. Journal of the American statistical Association. 1986 Dec 1;81(396):945-60. https://www.jstor.org/stable/pdf/2289064.pdf

Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association. 1952 Dec 1;47(260):663-85. https://www.jstor.org/stable/pdf/2280784.pdf

Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine. 2004 Oct 15;23(19):2937-60. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.1903

Neyman J. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science. 1923, tr 1990;5(4):465-480. Translated and edited by D.M. Dabrowska and T.P. Speed from the Polish original, which appeared in Roczniki Nauk Rolniczych Tom X (1923) 1-51 (Annals of Agricultural Sciences). https://www.jstor.org/stable/pdf/2245382.pdf

Pearl J. Causality. Cambridge university press; 2009 Sep 14. bayes.cs.ucla.edu/BOOK-2K/

Robins J. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling. 1986 Jan 1;7(9-12):1393-512. sciencedirect.com/science/article/pii/0270025586900886

Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983 Apr 1;70(1):41-55. academic.oup.com/biomet/article/70/1/41/240879

Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology. 1974 Oct;66(5):688. https://psycnet.apa.org/fulltext/1975-06502-001.pdf

Shmueli G. To explain or to predict?. Statistical science. 2010;25(3):289-310. projecteuclid.org/euclid.ss/1294167961

# Appendix

## The Law of Total Expectation (Continued)

The more general (abstract) expression of the statement from our earlier height-by-gender example is

\[ E (Y|Z=1) \times \Pr (Z=1) + E (Y|Z=0) \times \Pr (Z=0) = E (Y), \]

where \(E ( \cdot )\) is the expectation operator. Note that the variables are now all upper-case, indicating that they are random variables (e.g., due to random sampling)—unlike the lower-case fixed variables in our height-by-gender example.

The expectation operator takes the *expectation* (i.e., *expected value* or *mean*) of the random variable inside the parentheses like so:

- It multiplies each possible value with its corresponding probability.
- It then sums over all of these.

Loosely speaking, the expectation operator calculates a weighted average—with probabilities as the weights.

The expression \(E(Y|Z)\) denotes a conditional expectation, where the expectation of \(Y\) is taken over all \(Y\) at a given value \(Z\). The general expression above can be compactly stated as

\[ E (Y|Z=1) \times \Pr (Z=1) + E (Y|Z=0) \times \Pr (Z=0) = E_Z \left\{ E ( Y | Z ) \right\} , \] where \(E_Z ( \cdot )\) specifies taking the expectation of the expression inside the parentheses over all values of \(Z\).

The Law of Total Expectation states:

\[ E(Y) = E_Z \left\{ E ( Y | Z ) \right\} \] That is, the expected value of a random variable \(Y\) is equal to expectation of the expected value of \(Y\) conditional on a different random variable \(Z\), taken over all values of \(Z\).

## RCT Simulation R Code

```
##### Set simulation parameters
### Preliminaries
random_seed <- 2004101447
sample_size_observational <- 80000
holdout_proportion <- 0.2
sample_size_experimental <- sample_size_observational * holdout_proportion * 0.6
### Feature distribution
piZ <- 0.755 / (0.755 + 0.127) # race (based on U.S. Census)
### Outcome model
# beta0 and betaZ are derived from:
# https://www.cdc.gov/mmwr/volumes/69/wr/mm6915e3.htm
# https://www.npr.org/sections/coronavirus-live-updates/2020/04/08/830030932/cdc-hospital-data-point-to-racial-disparity-in-covid-19-cases
# https://www.washingtonpost.com/news/powerpost/paloma/the-health-202/2020/04/09/the-health-202-los-angeles-is-racing-to-discover-the-true-coronavirus-infection-rate/5e8de70588e0fa101a75e13d/
prInfection <- 0.15
prBlack <- 1 - piZ
prWhite <- piZ
prBlackGivenInfection <- 33 / (33 + 45)
prWhiteGivenInfection <- 1 - prBlackGivenInfection
prInfectionGivenBlack <- prBlackGivenInfection * prInfection / prBlack
prInfectionGivenWhite <- prWhiteGivenInfection * prInfection / prWhite
beta0 <- log(prInfectionGivenBlack / (1 - prInfectionGivenBlack)) # baseline: infection risk for Blacks who don't use app
betaX <- -0.3
betaZ <- log(prInfectionGivenWhite / (1 - prInfectionGivenWhite)) - beta0 # average influence of being White on infection risk
### Propensity model: app usage
alpha0_experimental <- 0 # randomized controlled trial: 0.5 randomization probability
alphaZ_experimental <- 0 # randomized controlled trial: 0.5 randomization probability
##### Generate data.
set.seed(random_seed + 3)
experimental_rct <- dplyr::tibble(
race = rbinom(n = sample_size_experimental, size = 1, prob = piZ),
app = rbinom(n = sample_size_experimental, size = 1, prob = plogis(alpha0_experimental + alphaZ_experimental * race)),
infection = rbinom(n = sample_size_experimental, size = 1, prob = plogis(beta0 + betaX * app + betaZ * race))
) %>%
dplyr::mutate(
race = ifelse(race == 1, "White", ifelse(race == 0, "Black", NA)),
app = ifelse(app == 1, "used app", ifelse(app == 0, "didn't use app", NA)),
infection = ifelse(infection == 1, "1. infected", ifelse(infection == 0, "0. uninfected", NA))
)
```