Rcpp is smoking fast for agent-based models in data frames

Rcpp is smoking fast for agent-based models in data frames

In a previous post, I discussed different approaches to speeding up some loops in data frames. In particular, R data frames provide a simple framework for representing large cohorts of agents in stochastic epidemiological models, such as those representing disease transmission. This approach is much easier and likely faster than trying to implement cohorts of R objects. In this post we’ll explore a simple agent-based model, and then benchmark a few different approaches to iterating through the cohort. Rcpp outperforms all of them by a few orders of magnitude. Priceless.


Let’s say we are trying to predict the probability of someone choosing to receive a vaccination in a given year. The decision will be based on their age (age), gender (female), and whether or not they were infected with the virus last year (ily). Let’s make up some data:


# construct a cohort
N <- 1000 # cohort size
cohort <- data.frame(age=rnorm(N,mean=50,sd=10),

The probability of choosing to be vaccinated will be given by the following function:

vaccinate <- function( age, female, ily) {
        # this is based on some pretend regression equation
        p <- 0.25 + 0.3 * 1/(1-exp(0.04 * age)) +  0.1 *ily
        # use some logic
        p <- p * ifelse(female, 1.25 , 0.75)
        # boundary checking
        p <- max(0,p); p <- min(1,p)

Try some iterators: for loop, apply, ddply

Let’s create a testable function for each strategy. The objective is to take a cohort data frame as input, calculate the vaccination probability for each member of the cohort, then return a data frame with the cohort data plus a new column for the vaccination probability (p).

# The traditional for loop
# Some would say is always a no-no in R...
do_forloop <- function(df) {
     v_prob <- vector(length=nrow(df),mode="numeric")
     for (x in 1:nrow(df)) {
          v_prob[x] <- vaccinate(df$age[x],

# The apply approach
do_apply <- function(df) {
   v_prob <- apply(df, 1, function(x) vaccinate(x[1],x[2],x[3]))

# ddply approach
do_plyr <- function (df) {
     v_prob <- ddply (df, names(df), function(x)

Enter Rcpp

Now rewrite the test using a traditional for-loop in C++ including a helper function to calculate the vaccination probability. I use the inline library so I can embed the C++ directly in the R script, thus obviating additional .cpp or .h files. Self-contained code is nice.

# create an R function built on C++ code
# required for inline Rcpp calls
# write the C++ code
do_rcpp_src <- '
     // get data from the input data frame
     Rcpp::DataFrame cohort(the_cohort);
     // now extract columns by name from 
     // the data fame into C++ vectors
     std::vector<double> age_v = 
               Rcpp::as< std::vector<double> >(cohort["age"]);
     std::vector<int> female_v = 
               Rcpp::as< std::vector<int> >(cohort["female"]);
     std::vector<int> ily_v = 
               Rcpp::as< std::vector<int> >(cohort["ily"]);
     // create a new variable v_prob for export
     std::vector<double> v_prob (ily_v.size());
     // iterate over data frame to calculate v_prob
     for (int i = 0; i < v_prob.size() ; i++) {
          v_prob[i] = 
     // export the old with the new in a combined data frame
     return Rcpp::DataFrame::create( Named("age")= age_v, 
                                     Named("female") = female_v,
                                     Named("ily") = ily_v,
                                     Named("p") = v_prob);
# write the helper function also in C++
vaccinate_cxx_src <- '
double vaccinate_cxx (double age, int female, int ily){
        // this is based on some pretend regression equation
        double p = 0.25 + 0.3 * 1/(1-exp(0.004*age)) +  0.1 *ily;
        // use some logic
        p = p * (female ? 1.25 : 0.75);
        // boundary checking
        p = max(0,p); p = min(1,p);
# create an R function to call the C++ code
do_rcpp <- cxxfunction(signature(the_cohort="data.frame"),
                       do_rcpp_src, plugin="Rcpp", 
                       includes=c('#include <cmath>',

May the best function win

# benchmarking
bm_results <- benchmark(do_forloop(cohort),
strategy <- with(bm_results, reorder(test,relative))
barchart(relative ~ strategy, bm_results, ylab='Relative performance', 
        main='Performance of iteration strategies over data frames in R', 

              test replications elapsed  relative user.self sys.self
  do_apply(cohort)         1000  38.668  70.05072    38.642    0.004
do_forloop(cohort)         1000  36.521  66.16123    36.507    0.000
   do_plyr(cohort)         1000 215.332 390.09420   215.178    0.152
   do_rcpp(cohort)         1000   0.552   1.00000     0.536    0.016

Post Comment