Causal questions and counterfactuals in pregnancy research Intro to R
Everyone is coming with a different background, so some things will likely be a review. Hopefully even if you’ve seen these topics before you will be thinking about them in a new or deeper way!
| Time | Activity |
|---|---|
| 09:30–10:30 | Session Block 1 |
| 10:30–11:00 | Tea/Coffee break |
| 11:00–12:00 | Session Block 2 |
| 12:00–13:00 | Lunch |
| 13:00–14:00 | Session Block 3 |
| 14:00–14:30 | Tea/Coffee break |
| 14:30–15:30 | Q&A/Office Hours |
R can be overwhelming if you’re not used to it
I teach a couple of short courses on R (for you to peruse later)
These also link to more resources that may be helpful


Miguel Hernán’s two-step algorithm for causal inference: Ask a causal question. Answer the causal question.
A second chance to get causal inference right: a classification of data science tasks Hernán, Hsu, and Healy (2019)
What proportion of pregnant people received COVID-19 vaccination during pregnancy in 2023?
How many preterm births occurred among women with gestational diabetes in our hospital last year?
What is the probability of preterm birth for a 35-year-old nulliparous woman with gestational diabetes?
How can we best guess which pregnancies will result in low birth weight based on first trimester characteristics?
Does COVID-19 vaccination during pregnancy reduce the risk of severe maternal illness compared to no vaccination?
Would discontinuing antidepressants during pregnancy reduce birth defect risk compared to continuing them?
What proportion of pregnant people received COVID-19 vaccination during pregnancy in 2023?
How many preterm births occurred among women with gestational diabetes in our hospital last year?
What is the probability of preterm birth for a 35-year-old nulliparous woman with gestational diabetes?
How can we best guess which pregnancies will result in low birth weight based on first trimester characteristics?
Does COVID-19 vaccination during pregnancy reduce the risk of severe maternal illness compared to no vaccination?
Would discontinuing antidepressants during pregnancy reduce birth defect risk compared to continuing them?
What would have happened under different treatment scenarios?
If \(Y\) is our factual (observed) outcome of interest:
More generally: \(Y^a\) = outcome under treatment scenario \(A = a\)
We can only observe one potential outcome per person:
| ID | Vaccinated | Y | \(Y^{\text{vacc}}\) | \(Y^{\text{unvacc}}\) |
|---|---|---|---|---|
| 1 | 0 | Preterm | ??? | Preterm |
| 2 | 1 | Preterm | Preterm | ??? |
| 3 | 1 | Term | Term | ??? |
| 4 | 0 | Term | ??? | Term |
Missing data problem: Causal inference is about estimating the missing potential outcomes using data from people with observed outcomes
We can only observe one potential outcome per person:
| ID | Vaccinated | Observed | \(Y^{\text{vacc}}\) | \(Y^{\text{unvacc}}\) |
|---|---|---|---|---|
| 1 | 0 | Preterm | ??? | Preterm |
| 2 | 1 | Preterm | Preterm | ??? |
| 3 | 1 | Term | Term | ??? |
| 4 | 0 | Term | ??? | Term |
But what does the variable “Vaccinated” mean? How can we be sure that we make a fair comparison?
“If I use this treatment during pregnancy, do I have an increased risk of some outcome?”
Problems with this question:
“If I use this treatment during pregnancy, do I have an increased risk relative to if I had not used it?”
This translates to: “What is the average risk in people like me who use the medication relative to the risk in an identical group who do not?”
Among users:
Among non-users (or former users):
All compared to an identical population who does something different.
The comparison population might do the same thing but at a different time (e.g., effect of earlier vs. later treatment), might do a different thing (e.g., treatment with a different drug), might do the same thing but more/less (different dose or duration), or might not do anything at all (but could have!)
We can be more specific:
Notice that we’re being more specific with the outcome as well, to allow us to study effects across gestational duration
| ID | Vaccination week | T | \(T^{x=12}\) | \(T^{x=22}\) | \(T^{x\neq12}\) |
|---|---|---|---|---|---|
| 5 | Never | 11 | 11 | 11 | 11 |
| 6 | Never | 14 | ??? | 14 | 14 |
| 7 | 12 | 18 | 18 | ??? | ??? |
| 8 | 34 | 40 | ??? | ??? | ??? |
.qmd
These slides are made in quarto too, so I can generate output using R code!
I asked you to install some packages with install.packages()
install.packages("packagename") will install the most recent version of the package to ensure you get the same results as meIf you don’t have a package installed, you may see this (you can click Install!)
Every time you start a new R session (which I suggest you do often, including with every new set of exercises), you need to load the packages with library(packagename)
I also set a ggplot theme at the top, so it applies to all figures
You can run the whole chunk at once with the green “play” button, or run line-by-line with Cmd+Enter (Mac) or Ctrl+Enter (Windows)
I made more than we need, so we are just going to sample n = 10,000 pregnancies, but we may change that as we go
Setting the seed ensures that we get the same random sample each time we run this code
The data is somewhat but not totally realistic
| Variable Name | Description |
|---|---|
| ID | Unique pregnancy identifier |
| maternal_age | Maternal age at conception (years) |
| BMI_b4preg | Pre-pregnancy body mass index (kg/m²) |
| Riskfactors | Pre-existing risk factors (low vs. moderate or high) |
| educ | Educational level (low vs. high) |
| nullparity | Nulliparity (1 if first pregnancy, 0 if multiparous) |
| country_birth | Country of birth (Scandinavia vs. Outside Scandinavia) |
| Variable Name | Description |
|---|---|
| bleeding_beforeWk13 | Bleeding before week 13 (0/1) |
| bleeding_week13_28 | Bleeding between weeks 13 and 28 (0/1) |
| bleeding_afterWk28 | Bleeding after week 28 (0/1) |
| gest_week | Gestational age at delivery/pregnancy end (weeks) |
| gest_days | Gestational age at delivery/pregnancy end (days) |
| sab | Spontaneous abortion (1 if pregnancy ended <20 weeks, 0 otherwise) |
| stillbirth | Stillbirth (1 if fetal death ≥20 weeks, 0 if live birth, NA if SAB) |
| preterm | Preterm birth (1 if <37 weeks, 0 if ≥37 weeks, NA if SAB) |
| birthweight | Birth weight in grams (NA if SAB) |
| end_preg_event | Did we observe the pregnancy ending? (1 if yes, 0 if censored) |

# Look at pregnancy outcomes
ggplot(dat, aes(gest_week)) +
geom_histogram(bins = 40, alpha = 0.7) +
geom_vline(xintercept = 20, linetype = "dashed", color = "red") +
geom_vline(xintercept = 37, linetype = "dashed", color = "blue") +
labs(
y = "Count",
title = "Distribution of pregnancy lengths",
subtitle = "Red line: 20 weeks (SAB cutoff); Blue line: 37 weeks (preterm cutoff)"
)