Blog

Speeding up agent-based simulations with data frames in R

Speeding up agent-based simulations with data frames in R

In health economics it is common to use agent-based simulations for exploring epidemiological models, prevention policies, and clinical interventions, among other things. In C++ I enjoy using object-oriented design to build these agent-based models. It feels so natural. In R, however, I have yet to delve into the S4 object model, and so have instead resorted to using data frames for simple object data structures. Stochastic, agent-based models often require large cohorts and multiple trials, so finding improvements in speed is a great help. The examples listed below are inspired by comments made recently on the r-help list, to whose contributors I am very grateful.

 

Here I present three approaches using data frames for constructing a cohort of agents for use in an agent-based simulation. The differences in code are small, but the time savings to work ratio is very high, and I hope this will aid the reader in speeding up any type of work with data frames.

The first example is how I would build my cohort if I were to code as I think.

# first pass at constructing a cohort
cohortA <- function(size) {
 
        # define some factors
	levels.gender <- factor(c('M','F'))
	levels.flavor <- factor(c('chocolate','vanilla','strawberry','mint'))
 
	# set up the data frame
	df <- data.frame(id=numeric(0),
			age=numeric(0),
			gender=factor(levels=levels.gender),
			fav.flavor=factor(levels=levels.flavor))
 
	# loop through each person and create them				
	for (cc in 1:size) {
 
		# generate attribute for each person in cohort
		p.id <- cc
		p.age <- rnorm(mean=50,sd=10,n=1)
		p.gender <- as.vector(sample(levels.gender,size=1))
		p.fav.flavor <- as.vector(sample(levels.flavor,size=1))
 
		# add person to the data frame
		df[cc,] <- c(p.id,p.age,p.gender,p.fav.flavor)
	}
 
	# output results
	return (df)
}

I saw a few posts on the r-help list suggesting that allocating the size of a vector all at once (rather than adding an element one by one) and then “filling in” the values would improve speed. Would the same work for a data frame? Here’s the same code again but with initial allocation of the data frame size.

# allocate vector size at initialization
cohortB <- function(size) {
 
        # define some factors
	levels.gender <- factor(c('M','F'))
	levels.flavor <- factor(c('chocolate','vanilla','strawberry','mint'))
 
	# set up the data frame with specified size
	df <- data.frame(id=numeric(size),
			age=numeric(size),
			gender=factor(NA,levels=levels.gender),
			fav.flavor=factor(NA,levels=levels.flavor))
 
	# loop through each person and create them				
	for (cc in 1:size) {
 
		# generate attribute for each person in cohort
		p.id <- cc
		p.age <- rnorm(mean=50,sd=10,n=1)
		p.gender <- as.vector(sample(levels.gender,size=1))
		p.fav.flavor <- as.vector(sample(levels.flavor,size=1))
 
		# this time "fill in" value in data frame
		df[cc,] <- c(p.id,p.age,p.gender,p.fav.flavor)
	}
 
	# output results
	return (df)
}

Now the same approach, but creating each column separately as a vector, then combining into a data frame at the end. Also a suggestion from the r-help list.

# create separate vectors to build data frame at the end
cohortC <- function(size) {
 
        # define some factors
	levels.gender <- factor(c('M','F'))
	levels.flavor <- factor(c('chocolate','vanilla','strawberry','mint'))
 
	# initialize vectors by size
	p.id <- numeric(size)
	p.age <- numeric(size)
	p.gender <- character(size)
	p.fav.flavor <- character(size)
 
	for (cc in 1:size) {
 
		# generate attribute for each person in cohort
		p.id[cc] <- cc
		p.age[cc] <- rnorm(mean=50,sd=10,n=1)
		p.gender[cc] <- as.vector(sample(levels.gender,size=1))
		p.fav.flavor[cc] <- as.vector(sample(levels.flavor,size=1))
	}
 
        # construct data frame from vectors
	df <- data.frame(id=p.id,
			age=p.age,
			gender=p.gender,
			fav.flavor=p.fav.flavor)
 
	return(df)	
}

Let’s see who wins…

 
# benchmarking each approach
library(rbenchmark)
library(ggplot2)
# requires R v2.13
library(compiler)
 
# compiler is so awesome
cohortAc <- cmpfun(cohortA)
cohortBc <- cmpfun(cohortB)
cohortCc <- cmpfun(cohortC)
 
benchIt <- function(size) {
     data.frame(benchmark(cohortA(size),
                       cohortAc(size),
                       cohortB(size),
                       cohortBc(size),
                       cohortC(size),
                       cohortCc(size),
                       columns=c('test','elapsed','relative'),
                       replications=100),
                       size=size,
                       method=rep(c('A','B','C'),each=2),
                       compiled=rep(c('n','y'),3))
}
 
# try out some different cohort sizes
cohorts <- c(10,50,100,250,500,750,1000,2000,3000,4000,5000,7500,10000)
 
results <- data.frame()
 
# now run the test
for (cc in cohorts) {
     cat('\nBenchmarking cohort of ',cc)
     results <- rbind(results,benchIt(cc))
}
 
# plot results for elapsed time
ggplot(results,aes(size,elapsed,linetype=compiled))+
     geom_line(aes(color=method))+
     geom_point(aes(color=method))+
     xlab('cohort size')+
     ylab('elapsed time (sec)')+
     opts(title='elapsed time for cohort construction in a data frame')
ggsave(file='cohort_elapsed.png',width=800,height=600)
 
# plot results for relative performance
ggplot(results,aes(size,relative,linetype=compiled))+
     geom_line(aes(color=method))+
     geom_point(aes(color=method))+
     xlab('cohort size')+
     ylab('relative performance')+
     opts(title='relative performance for cohort construction in a data frame')
ggsave(file='cohort_relative.png',width=800,height=600)
 
# see how the compiled vs non-compiled versions fared
# for just method C (the fastest)
justc <- subset(results,method=='C')
ggplot(justc,aes(size,elapsed,linetype=compiled))+
     geom_line(color='blue')+
     geom_point(color='blue')+
     xlab('cohort size')+
     ylab('relative performance')+
     opts(title='elapsed time for cohort construction using method C')
ggsave(file='cohort_elapsed_method_C.png',width=800,height=600)

For very large cohort sizes the time savings is priceless, especially if you are paying $1.62/hr for your cluster time. Here is the elapsed time for each method by cohort size:

And looking at the relative performance:

And just looking at the fastest method (C) for compiled vs. non-compiled performance:

A 35-fold improvement for the largest cohort size?! Incredible. Would this have been even faster using a list of S4 objects?

Thanks to all the gurus at r-help who save researchers CPU cycles by the log10.

Post Comment