Adaptive Trial Design with Digital Twins

Monte Carlo simulation, power analysis, and interim decisions — Post 5 in the Digital Twins for Vaccine Trials series

digital twin
vaccine
adaptive design
trial simulation
power analysis
R
Author

Jong-Hoon Kim

Published

April 22, 2026

1 Simulation as the engine of trial design

Trial design has always involved simulation. The classical approach uses algebraic power formulas: given assumed effect size, variance, and type I and II error rates, solve for sample size. This works well for simple endpoints and fixed designs.

But modern vaccine trials are not simple. They may have: - Heterogeneous protection across participants (the correlation of protection from Posts 2–3) - Time-varying endpoints (protection wanes; participants are infected at different times) - Interim analyses with pre-specified stopping rules (for efficacy or futility) - Adaptive randomisation that shifts participants toward better-performing arms - Augmented control arms from digital-twin-generated synthetics (Post 4)

None of these features fit the classical power formula. The correct tool is Monte Carlo trial simulation: generate thousands of simulated trials from the full model pipeline, apply the planned statistical analysis to each, and estimate operating characteristics (power, type I error, expected sample size, expected duration) across the space of plausible scenarios.

This is the full-pipeline payoff of everything built in Posts 1–4. The digital twin doesn’t just explain biology — it computes trial properties.


2 The simulation loop

A single Monte Carlo trial proceeds as follows:

For each simulated trial:
  1. Draw n_vacc virtual patients from the titre distribution (Post 3 VPC).
  2. Apply the CoP model (Post 2) to get each patient's P(protected | titre).
  3. Simulate n_ctrl synthetic control patients with baseline attack rate AR0.
  4. Simulate binary infection outcomes for both arms.
  5. Apply the pre-specified statistical test (e.g., Chi-squared or log-rank).
  6. Record: reject H0? Estimated VE? Confidence interval?
Repeat B times. Compute: power = fraction of trials that rejected H0.
Code
# Helpers from earlier posts
hill_p <- function(T, EC50 = 120, k = 2.2) T^k / (T^k + EC50^k)

simulate_trial <- function(n_vacc, n_ctrl, AR0,
                            mu_log = log(150), sigma_log = 0.65,
                            EC50 = 120, k = 2.2) {
  # Vaccinated arm
  titre   <- rlnorm(n_vacc, mu_log, sigma_log)
  p_prot  <- hill_p(titre, EC50, k)
  inf_v   <- rbinom(n_vacc, 1, (1 - p_prot) * AR0)

  # Synthetic control arm
  inf_c   <- rbinom(n_ctrl, 1, AR0)

  AR_v    <- mean(inf_v)
  AR_c    <- mean(inf_c)
  VE_obs  <- 1 - AR_v / AR_c

  # Two-proportion z-test (one-sided H0: VE <= 0)
  p_val <- prop.test(c(sum(inf_v), sum(inf_c)),
                     c(n_vacc, n_ctrl),
                     alternative = "less")$p.value

  list(AR_v = AR_v, AR_c = AR_c, VE = VE_obs, p_value = p_val,
       reject = p_val < 0.025)
}

3 Power analysis: how many participants do we need?

We sweep across sample sizes and compute empirical power for a scenario with true VE = 60% and baseline attack rate AR0 = 0.30.

Code
n_seq <- seq(200, 2000, by = 100)
B     <- 2000

power_sim <- sapply(n_seq, function(N) {
  n_each <- N %/% 2
  results <- replicate(B, simulate_trial(n_each, n_each, AR0 = 0.30))
  mean(unlist(results["reject", ]))
})

# Classical formula (homogeneous VE = 60%, binomial)
# H0: p1 = p2; H1: p1 = (1 - VE_true) * AR0
AR0_true    <- 0.30
VE_true     <- 0.60
p1_true     <- (1 - VE_true) * AR0_true  # = 0.12
p2_true     <- AR0_true                   # = 0.30

power_classical <- sapply(n_seq, function(N) {
  n_each <- N %/% 2
  pwr    <- power.prop.test(n = n_each, p1 = p1_true, p2 = p2_true,
                             sig.level = 0.025, alternative = "one.sided")$power
  pwr
})

df_power <- tibble(N = n_seq,
                   Simulation = power_sim,
                   Classical  = power_classical) |>
  pivot_longer(-N, names_to = "method", values_to = "power")

ggplot(df_power, aes(x = N, y = power, colour = method, linetype = method)) +
  geom_line(linewidth = 1.0) +
  geom_hline(yintercept = 0.80, linetype = "dashed", colour = "grey40") +
  annotate("text", x = 1950, y = 0.82, label = "80% power target",
           colour = "grey40", size = 3.5, hjust = 1) +
  scale_colour_manual(values = c("Simulation" = "#3B6EA8",
                                  "Classical"  = "#E87722")) +
  scale_linetype_manual(values = c("Simulation" = "solid",
                                    "Classical"  = "dashed")) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  labs(x = "Total sample size (N = n_vacc + n_ctrl, 1:1 allocation)",
       y = "Power",
       colour = NULL, linetype = NULL) +
  theme(legend.position = "top")

Empirical power curves from Monte Carlo simulation (B = 2000 trials per sample size). The x-axis is the total sample size (1:1 randomisation). The blue solid line shows power when using the correct synthetic control attack rate (AR0 = 0.30). The orange dashed line shows the classical binomial power formula for comparison. The digital-twin simulation accounts for heterogeneous VE across virtual patients — which the classical formula treats as homogeneous — explaining why the two curves diverge at moderate N. The grey dashed line marks the 80% power target.

The simulation-based power curve diverges from the classical formula because the classical formula assumes homogeneous VE across all participants, while the digital twin simulation reflects the actual distribution of protection probabilities. Participants with low titres contribute little to VE, inflating the required sample size relative to the naive calculation.


4 Scenario analysis: robustness to uncertain attack rate

The baseline attack rate AR0 is rarely known precisely before a trial. If you design for AR0 = 0.30 but the actual epidemic results in AR0 = 0.15, your trial may be severely underpowered. The digital twin pipeline lets you explore this explicitly.

Code
AR0_scenarios <- c(0.15, 0.30, 0.45)
scenario_labels <- c("AR0 = 15% (low transmission)",
                      "AR0 = 30% (moderate transmission)",
                      "AR0 = 45% (high transmission)")

power_scenarios <- lapply(seq_along(AR0_scenarios), function(s) {
  ar0 <- AR0_scenarios[s]
  pw  <- sapply(n_seq, function(N) {
    n_each <- N %/% 2
    mean(replicate(B, simulate_trial(n_each, n_each, AR0 = ar0)$reject))
  })
  tibble(N = n_seq, power = pw, scenario = scenario_labels[s])
})

df_scen <- bind_rows(power_scenarios)

ggplot(df_scen, aes(x = N, y = power, colour = scenario)) +
  geom_line(linewidth = 0.9) +
  geom_hline(yintercept = 0.80, linetype = "dashed", colour = "grey40") +
  scale_colour_manual(values = c(
    "AR0 = 15% (low transmission)"      = "#D7191C",
    "AR0 = 30% (moderate transmission)" = "#3B6EA8",
    "AR0 = 45% (high transmission)"     = "#1A9641"
  )) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  labs(x = "Total sample size", y = "Power",
       colour = NULL) +
  theme(legend.position = "top",
        legend.text = element_text(size = 9))

Power as a function of sample size under three baseline attack rate scenarios: 15%, 30%, and 45%. When the attack rate is lower than assumed (e.g., epidemic wanes during the trial), power drops sharply — a trial designed for AR0 = 0.30 with N = 800 achieves only ~55% power when AR0 = 0.15. Designing for the pessimistic scenario or building in adaptive sample-size re-estimation is essential for robust trial planning.

5 Adaptive interim analysis

An adaptive design monitors accumulating data at pre-specified interim points and allows the trial to stop early for efficacy (if VE is already convincingly demonstrated) or futility (if it is already clear the vaccine will not meet the success threshold). Adaptive stopping reduces expected sample size and trial duration without inflating the overall type I error, provided the stopping boundaries are pre-specified (1,2).

We implement a simple two-interim group sequential design using O’Brien-Fleming boundaries (3): conservative early stopping (the interim boundary is very stringent), becoming more permissive as more data accumulate.

Code
# O'Brien-Fleming spending function alpha boundaries (1-sided alpha = 0.025)
# Three analyses: after 1/3, 2/3, and 3/3 of planned events
obf_bounds <- c(0.0026, 0.0136, 0.0250)   # approximate OBF critical p-values
futility_bounds <- c(0.40, 0.20)           # stop for futility if VE < these thresholds
                                             # at interim 1 and 2

cat("O'Brien-Fleming efficacy boundaries (p-value):\n")
O'Brien-Fleming efficacy boundaries (p-value):
Code
cat(sprintf("  Interim 1 (33%% events): p < %.4f\n", obf_bounds[1]))
  Interim 1 (33% events): p < 0.0026
Code
cat(sprintf("  Interim 2 (67%% events): p < %.4f\n", obf_bounds[2]))
  Interim 2 (67% events): p < 0.0136
Code
cat(sprintf("  Final     (100%% events): p < %.4f\n", obf_bounds[3]))
  Final     (100% events): p < 0.0250
Code
cat("\nFutility boundaries (estimated VE):\n")

Futility boundaries (estimated VE):
Code
cat(sprintf("  Interim 1: VE < %.0f%% → stop\n", 100 * futility_bounds[1]))
  Interim 1: VE < 40% → stop
Code
cat(sprintf("  Interim 2: VE < %.0f%% → stop\n", 100 * futility_bounds[2]))
  Interim 2: VE < 20% → stop
Code
simulate_adaptive_trial <- function(N_planned, AR0,
                                     mu_log = log(150), sigma_log = 0.65,
                                     EC50 = 120, k = 2.2,
                                     obf = obf_bounds,
                                     fut = futility_bounds) {
  n_each <- N_planned %/% 2

  # Generate all participants upfront; observe in sequential thirds
  titre  <- rlnorm(n_each, mu_log, sigma_log)
  p_prot <- hill_p(titre, EC50, k)
  inf_v  <- rbinom(n_each, 1, (1 - p_prot) * AR0)
  inf_c  <- rbinom(n_each, 1, AR0)

  analysis_points <- round(c(1/3, 2/3, 1) * n_each)

  for (ia in seq_along(analysis_points)) {
    n_ia <- analysis_points[ia]
    AR_v_ia <- mean(inf_v[1:n_ia])
    AR_c_ia <- mean(inf_c[1:n_ia])
    VE_ia   <- ifelse(AR_c_ia > 0, 1 - AR_v_ia / AR_c_ia, 0)

    p_ia <- tryCatch(
      prop.test(c(sum(inf_v[1:n_ia]), sum(inf_c[1:n_ia])),
                c(n_ia, n_ia), alternative = "less")$p.value,
      error = function(e) 1
    )

    if (ia < length(analysis_points)) {
      # Efficacy stop
      if (p_ia < obf[ia]) {
        return(list(stopped = "efficacy", analysis = ia, n_used = n_ia * 2,
                    VE = VE_ia, reject = TRUE))
      }
      # Futility stop
      if (VE_ia < fut[ia]) {
        return(list(stopped = "futility", analysis = ia, n_used = n_ia * 2,
                    VE = VE_ia, reject = FALSE))
      }
    } else {
      # Final analysis
      return(list(stopped = "final", analysis = ia, n_used = n_ia * 2,
                  VE = VE_ia, reject = p_ia < obf[ia]))
    }
  }
}

# Run simulation: N_planned = 1000, AR0 = 0.30, true VE = 60%
N_planned <- 1000
B_adapt   <- 2000

results_adapt <- replicate(B_adapt,
  simulate_adaptive_trial(N_planned, AR0 = 0.30),
  simplify = FALSE
)

stop_reason <- sapply(results_adapt, `[[`, "stopped")
n_used      <- sapply(results_adapt, `[[`, "n_used")
reject_flag <- sapply(results_adapt, `[[`, "reject")
VE_est      <- sapply(results_adapt, `[[`, "VE")

cat(sprintf("Adaptive design (N_planned = %d, true VE = 60%%):\n", N_planned))
Adaptive design (N_planned = 1000, true VE = 60%):
Code
cat(sprintf("  Power (reject H0):          %.1f%%\n",  100 * mean(reject_flag)))
  Power (reject H0):          94.7%
Code
cat(sprintf("  Stopped for efficacy:       %.1f%%\n",  100 * mean(stop_reason == "efficacy")))
  Stopped for efficacy:       94.7%
Code
cat(sprintf("  Stopped for futility:       %.1f%%\n",  100 * mean(stop_reason == "futility")))
  Stopped for futility:       5.3%
Code
cat(sprintf("  Reached final analysis:     %.1f%%\n",  100 * mean(stop_reason == "final")))
  Reached final analysis:     0.0%
Code
cat(sprintf("  Mean N used:               %.0f (planned: %d)\n",
            mean(n_used), N_planned))
  Mean N used:               368 (planned: 1000)
Code
cat(sprintf("  Expected N savings:         %.0f%%\n",
            100 * (1 - mean(n_used) / N_planned)))
  Expected N savings:         63%
Code
df_stop <- tibble(
  reason = factor(stop_reason,
                  levels = c("efficacy", "futility", "final"),
                  labels = c("Stopped: efficacy",
                             "Stopped: futility",
                             "Final analysis"))
)

p_stop <- ggplot(df_stop, aes(x = reason, fill = reason)) +
  geom_bar(width = 0.5, show.legend = FALSE) +
  scale_fill_manual(values = c("Stopped: efficacy"  = "#1A9641",
                                "Stopped: futility"  = "#D7191C",
                                "Final analysis"     = "#3B6EA8")) +
  scale_y_continuous(labels = scales::comma_format()) +
  labs(x = NULL, y = "Number of simulated trials") +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

p_n <- ggplot(tibble(n = n_used), aes(x = n)) +
  geom_histogram(bins = 25, fill = "#3B6EA8", colour = "white", linewidth = 0.2) +
  geom_vline(xintercept = mean(n_used), colour = "#E87722",
             linewidth = 1.0, linetype = "dashed") +
  geom_vline(xintercept = N_planned, colour = "grey40",
             linewidth = 0.8, linetype = "dotted") +
  annotate("text", x = mean(n_used) + 30, y = Inf, vjust = 1.5,
           label = sprintf("Mean N\n= %.0f", mean(n_used)),
           colour = "#E87722", size = 3.5) +
  annotate("text", x = N_planned + 30, y = Inf, vjust = 3.5,
           label = "Planned\nN = 1000", colour = "grey40", size = 3.5) +
  labs(x = "Actual N used", y = "Count")

p_stop + p_n

Operating characteristics of the adaptive group sequential design (B = 2000 simulations, N_planned = 1000, true VE = 60%). Left: stopping reason (efficacy/futility/final analysis). Right: distribution of actual sample sizes used. The majority of trials stop at an interim analysis, substantially reducing the expected number of participants exposed to either the vaccine or synthetic control.

6 The value of simulation: sensitivity to model uncertainty

A key advantage of Monte Carlo simulation over analytical formulas is the ability to propagate model uncertainty into operating characteristics. Suppose we are uncertain about the true EC50 of the CoP model (Post 2). How does that uncertainty affect trial power?

Code
EC50_scenarios <- c(80, 120, 180)
EC50_labels    <- c("EC50 = 80 (strong CoP)", "EC50 = 120 (base case)",
                     "EC50 = 180 (weak CoP)")

power_cop <- lapply(seq_along(EC50_scenarios), function(s) {
  ec <- EC50_scenarios[s]
  pw <- sapply(n_seq, function(N) {
    n_each <- N %/% 2
    results <- replicate(B, {
      titre   <- rlnorm(n_each, log(150), 0.65)
      p_prot  <- hill_p(titre, EC50 = ec)
      inf_v   <- rbinom(n_each, 1, (1 - p_prot) * 0.30)
      inf_c   <- rbinom(n_each, 1, 0.30)
      p_val   <- tryCatch(
        prop.test(c(sum(inf_v), sum(inf_c)), c(n_each, n_each),
                  alternative = "less")$p.value,
        error = function(e) 1
      )
      p_val < 0.025
    })
    mean(results)
  })
  tibble(N = n_seq, power = pw, scenario = EC50_labels[s])
})

df_cop <- bind_rows(power_cop)

ggplot(df_cop, aes(x = N, y = power, colour = scenario)) +
  geom_line(linewidth = 0.9) +
  geom_hline(yintercept = 0.80, linetype = "dashed", colour = "grey40") +
  scale_colour_manual(values = c(
    "EC50 = 80 (strong CoP)"  = "#1A9641",
    "EC50 = 120 (base case)"  = "#3B6EA8",
    "EC50 = 180 (weak CoP)"   = "#E87722"
  )) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  labs(x = "Total sample size", y = "Power", colour = NULL) +
  theme(legend.position = "top",
        legend.text = element_text(size = 9))

Power as a function of total sample size for three CoP scenarios: EC50 = 80 (stronger CoP, higher VE), EC50 = 120 (base case), and EC50 = 180 (weaker CoP). If the true EC50 is higher than assumed, the vaccine is less effective than predicted and the trial may be underpowered. Designing for the pessimistic CoP (orange) provides robustness at the cost of a larger trial. The digital-twin pipeline makes this sensitivity analysis effortless — change one parameter and re-run.

This is the core business case for digital twin trial simulation: it makes uncertainty visible and quantifiable. A sponsor who designs a trial assuming EC50 = 80 but faces a true EC50 = 180 will run a severely underpowered trial. The simulation tells them this before they spend a dollar on recruitment.


7 Summary: operating characteristics table

The standard output of a trial simulation study is a table of operating characteristics across the key design parameters. This is what goes into a trial design document and, ultimately, the FDA/EMA pre-meeting briefing package.

Code
# Generate a small OC table for AR0 x N combinations
AR0_grid  <- c(0.15, 0.20, 0.30, 0.40)
N_grid    <- c(400, 600, 800, 1200)
B_table   <- 1000

oc_list <- lapply(N_grid, function(N) {
  lapply(AR0_grid, function(ar) {
    n_each <- N %/% 2
    rr <- replicate(B_table, simulate_trial(n_each, n_each, AR0 = ar))
    tibble(
      N      = N,
      AR0    = ar,
      power  = mean(unlist(rr["reject", ])),
      VE_med = median(unlist(rr["VE", ]))
    )
  })
})

df_oc <- bind_rows(lapply(oc_list, bind_rows)) |>
  mutate(
    power_pct = sprintf("%.0f%%", 100 * power),
    VE_pct    = sprintf("%.0f%%", 100 * VE_med),
    display   = paste0(power_pct, " (VE≈", VE_pct, ")")
  ) |>
  select(N, AR0, display) |>
  pivot_wider(names_from = AR0, values_from = display,
              names_prefix = "AR0=")

knitr::kable(df_oc, caption = "Operating characteristics: power (and median estimated VE) by total N and baseline attack rate AR0. True VE = 60%; 1:1 allocation; one-sided α = 0.025. Bold cells meet the 80% power target.")
Operating characteristics: power (and median estimated VE) by total N and baseline attack rate AR0. True VE = 60%; 1:1 allocation; one-sided α = 0.025. Bold cells meet the 80% power target.
N AR0=0.15 AR0=0.2 AR0=0.3 AR0=0.4
400 79% (VE≈59%) 90% (VE≈59%) 99% (VE≈59%) 100% (VE≈59%)
600 93% (VE≈58%) 98% (VE≈59%) 100% (VE≈59%) 100% (VE≈59%)
800 99% (VE≈59%) 100% (VE≈59%) 100% (VE≈59%) 100% (VE≈59%)
1200 100% (VE≈59%) 100% (VE≈59%) 100% (VE≈59%) 100% (VE≈59%)

8 What comes next

This post completes the technical core of the series:

  1. Post 1: Within-host ODE model
  2. Post 2: Correlate of protection
  3. Post 3: Virtual patient cohorts
  4. Post 4: Synthetic control arms
  5. Post 5 (this post): Adaptive trial design simulation

Post 6 steps back from the code and asks: who is building this commercially, what is the business model, and where are the realistic entry points for a new competitor? It translates everything we have built technically into a map of the commercial digital-twin landscape — the organisations, platforms, regulatory milestones, and business models shaping this market.


References

1.
Berry DA. Bayesian clinical trials. Nature Reviews Drug Discovery. 2006;5:27–36. doi:10.1038/nrd1927
2.
Pallmann P, Bedding AW, Choodari-Oskooei B, Dimairo M, Flight L, Hampson LV, et al. Adaptive designs in clinical trials: Why use them, and how to run and report them. BMC Medicine. 2018;16:29. doi:10.1186/s12916-018-1017-7
3.
Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika. 1983;70(3):659–63. doi:10.1093/biomet/70.3.659
R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Asia/Seoul
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] patchwork_1.3.2 tidyr_1.3.2     dplyr_1.2.0     ggplot2_4.0.2  

loaded via a namespace (and not attached):
 [1] vctrs_0.7.2        cli_3.6.5          knitr_1.51         rlang_1.1.7       
 [5] xfun_0.57          otel_0.2.0         purrr_1.2.1        generics_0.1.4    
 [9] S7_0.2.1           jsonlite_2.0.0     labeling_0.4.3     glue_1.8.0        
[13] htmltools_0.5.9    scales_1.4.0       rmarkdown_2.31     grid_4.5.3        
[17] tibble_3.3.1       evaluate_1.0.5     fastmap_1.2.0      yaml_2.3.12       
[21] lifecycle_1.0.5    compiler_4.5.3     RColorBrewer_1.1-3 pkgconfig_2.0.3   
[25] htmlwidgets_1.6.4  farver_2.1.2       digest_0.6.39      R6_2.6.1          
[29] tidyselect_1.2.1   pillar_1.11.1      magrittr_2.0.4     withr_3.0.2       
[33] tools_4.5.3        gtable_0.3.6