Speed up Tidyverse Calculation of Multiple Quantiles
up vote
2
down vote
favorite
I have this great little function summarise_posterior
(given below) as part of my package driver
(available here).
The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).
At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.
#' Shortcut for summarize variable with quantiles and mean
#'
#' @param data tidy data frame
#' @param var variable name (unquoted) to be summarised
#' @param ... other expressions to pass to summarise
#'
#' @return data.frame
#' @export
#' @details Notation: code{pX} refers to the code{X}% quantile
#' @import dplyr
#' @importFrom stats quantile
#' @importFrom rlang quos quo UQ
#' @examples
#' d <- data.frame("a"=sample(1:10, 50, TRUE),
#' "b"=rnorm(50))
#'
#' # Summarize posterior for b over grouping of a and also calcuate
#' # minmum of b (in addition to normal statistics returned)
#' d <- dplyr::group_by(d, a)
#' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
summarise_posterior <- function(data, var, ...){
qvar <- enquo(var)
qs <- quos(...)
data %>%
summarise(p2.5 = quantile(!!qvar, prob=0.025),
p25 = quantile(!!qvar, prob=0.25),
p50 = quantile(!!qvar, prob=0.5),
mean = mean(!!qvar),
p75 = quantile(!!qvar, prob=0.75),
p97.5 = quantile(!!qvar, prob=0.975),
!!!qs)
}
Rcpp back-end options are also more than welcome.
Thanks!
r dplyr tidyverse rlang
add a comment |
up vote
2
down vote
favorite
I have this great little function summarise_posterior
(given below) as part of my package driver
(available here).
The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).
At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.
#' Shortcut for summarize variable with quantiles and mean
#'
#' @param data tidy data frame
#' @param var variable name (unquoted) to be summarised
#' @param ... other expressions to pass to summarise
#'
#' @return data.frame
#' @export
#' @details Notation: code{pX} refers to the code{X}% quantile
#' @import dplyr
#' @importFrom stats quantile
#' @importFrom rlang quos quo UQ
#' @examples
#' d <- data.frame("a"=sample(1:10, 50, TRUE),
#' "b"=rnorm(50))
#'
#' # Summarize posterior for b over grouping of a and also calcuate
#' # minmum of b (in addition to normal statistics returned)
#' d <- dplyr::group_by(d, a)
#' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
summarise_posterior <- function(data, var, ...){
qvar <- enquo(var)
qs <- quos(...)
data %>%
summarise(p2.5 = quantile(!!qvar, prob=0.025),
p25 = quantile(!!qvar, prob=0.25),
p50 = quantile(!!qvar, prob=0.5),
mean = mean(!!qvar),
p75 = quantile(!!qvar, prob=0.75),
p97.5 = quantile(!!qvar, prob=0.975),
!!!qs)
}
Rcpp back-end options are also more than welcome.
Thanks!
r dplyr tidyverse rlang
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have this great little function summarise_posterior
(given below) as part of my package driver
(available here).
The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).
At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.
#' Shortcut for summarize variable with quantiles and mean
#'
#' @param data tidy data frame
#' @param var variable name (unquoted) to be summarised
#' @param ... other expressions to pass to summarise
#'
#' @return data.frame
#' @export
#' @details Notation: code{pX} refers to the code{X}% quantile
#' @import dplyr
#' @importFrom stats quantile
#' @importFrom rlang quos quo UQ
#' @examples
#' d <- data.frame("a"=sample(1:10, 50, TRUE),
#' "b"=rnorm(50))
#'
#' # Summarize posterior for b over grouping of a and also calcuate
#' # minmum of b (in addition to normal statistics returned)
#' d <- dplyr::group_by(d, a)
#' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
summarise_posterior <- function(data, var, ...){
qvar <- enquo(var)
qs <- quos(...)
data %>%
summarise(p2.5 = quantile(!!qvar, prob=0.025),
p25 = quantile(!!qvar, prob=0.25),
p50 = quantile(!!qvar, prob=0.5),
mean = mean(!!qvar),
p75 = quantile(!!qvar, prob=0.75),
p97.5 = quantile(!!qvar, prob=0.975),
!!!qs)
}
Rcpp back-end options are also more than welcome.
Thanks!
r dplyr tidyverse rlang
I have this great little function summarise_posterior
(given below) as part of my package driver
(available here).
The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).
At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.
#' Shortcut for summarize variable with quantiles and mean
#'
#' @param data tidy data frame
#' @param var variable name (unquoted) to be summarised
#' @param ... other expressions to pass to summarise
#'
#' @return data.frame
#' @export
#' @details Notation: code{pX} refers to the code{X}% quantile
#' @import dplyr
#' @importFrom stats quantile
#' @importFrom rlang quos quo UQ
#' @examples
#' d <- data.frame("a"=sample(1:10, 50, TRUE),
#' "b"=rnorm(50))
#'
#' # Summarize posterior for b over grouping of a and also calcuate
#' # minmum of b (in addition to normal statistics returned)
#' d <- dplyr::group_by(d, a)
#' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
summarise_posterior <- function(data, var, ...){
qvar <- enquo(var)
qs <- quos(...)
data %>%
summarise(p2.5 = quantile(!!qvar, prob=0.025),
p25 = quantile(!!qvar, prob=0.25),
p50 = quantile(!!qvar, prob=0.5),
mean = mean(!!qvar),
p75 = quantile(!!qvar, prob=0.75),
p97.5 = quantile(!!qvar, prob=0.975),
!!!qs)
}
Rcpp back-end options are also more than welcome.
Thanks!
r dplyr tidyverse rlang
r dplyr tidyverse rlang
asked Nov 11 at 3:03
jds
293213
293213
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
Here's a solution that makes use of nesting to avoid calling quantile
multiple times. Any time you need to store a vector of results inside summarize
, simply wrap it inside list
. Afterwards, you can unnest these results, pair them up against their names, and use spread
to put them in separate columns:
summarise_posterior2 <- function(data, var, ...){
qvar <- ensym(var)
vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)
summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
.nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
unnest %>% spread( .nms, .qq )
}
This doesn't give you nearly the same speed up as @jay.sf's solution
d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5))
microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
# f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100
but it will work correctly with group_by
and inside nested functions (whereas substitute
-based solutions will break when nested).
r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
all_equal( r1, r2 ) # TRUE
If you profile the code, you can see where the major hang ups are
Rprof()
for( i in 1:100 )
d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
Rprof(NULL)
summaryRprof()$by.self %>% head
# self.time self.pct total.time total.pct
# ".Call" 1.84 49.73 3.18 85.95
# "sort.int" 0.94 25.41 1.12 30.27
# "eval" 0.08 2.16 3.64 98.38
# "tryCatch" 0.08 2.16 1.44 38.92
# "anyNA" 0.08 2.16 0.08 2.16
# "structure" 0.04 1.08 0.08 2.16
The .Call
corresponds mainly to the C++ backend of dplyr
, while sort.int
is the worker behind quantile()
. @jay.sf's solution gains a major speedup by decoupling from dplyr
, but it also loses the associated flexibility (e.g., integration with group_by
). Ultimately, it's up to you to decide which is more important.
add a comment |
up vote
1
down vote
Why not something like this?
summarise_posterior2 <- function(data, x, ...){
x <- deparse(substitute(x))
nm <- deparse(substitute(...))
M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
qs <- t(sapply(list(...), do.call, list(M)))
'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
p25 = quantile(M, prob=0.25),
p50 = quantile(M, prob=0.5),
mean = mean(M),
p75 = quantile(M, prob=0.75),
p97.5 = quantile(M, prob=0.975), qs), NULL
)
}
> summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
p2.5 p25 p50 mean p75 p97.5 mean mean min
[1,] 28.2 30 32 32 34 35.8 32 32 28
# > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
# p2.5 p25 p50 mean p75 p97.5 mean.b min
# 1 28.2 30 32 32 34 35.8 32 28
Runs six times faster:
> microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
+ new.fun=summarise_posterior2(df1, X4, max=max, min=min))
Unit: microseconds
expr min lq mean median uq max neval
orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100
Data
df1 <- data.frame(matrix(1:144, 9, 16))
Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
– jds
Nov 11 at 19:55
Also will it work with group_by?
– jds
Nov 11 at 19:56
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Here's a solution that makes use of nesting to avoid calling quantile
multiple times. Any time you need to store a vector of results inside summarize
, simply wrap it inside list
. Afterwards, you can unnest these results, pair them up against their names, and use spread
to put them in separate columns:
summarise_posterior2 <- function(data, var, ...){
qvar <- ensym(var)
vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)
summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
.nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
unnest %>% spread( .nms, .qq )
}
This doesn't give you nearly the same speed up as @jay.sf's solution
d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5))
microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
# f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100
but it will work correctly with group_by
and inside nested functions (whereas substitute
-based solutions will break when nested).
r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
all_equal( r1, r2 ) # TRUE
If you profile the code, you can see where the major hang ups are
Rprof()
for( i in 1:100 )
d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
Rprof(NULL)
summaryRprof()$by.self %>% head
# self.time self.pct total.time total.pct
# ".Call" 1.84 49.73 3.18 85.95
# "sort.int" 0.94 25.41 1.12 30.27
# "eval" 0.08 2.16 3.64 98.38
# "tryCatch" 0.08 2.16 1.44 38.92
# "anyNA" 0.08 2.16 0.08 2.16
# "structure" 0.04 1.08 0.08 2.16
The .Call
corresponds mainly to the C++ backend of dplyr
, while sort.int
is the worker behind quantile()
. @jay.sf's solution gains a major speedup by decoupling from dplyr
, but it also loses the associated flexibility (e.g., integration with group_by
). Ultimately, it's up to you to decide which is more important.
add a comment |
up vote
1
down vote
accepted
Here's a solution that makes use of nesting to avoid calling quantile
multiple times. Any time you need to store a vector of results inside summarize
, simply wrap it inside list
. Afterwards, you can unnest these results, pair them up against their names, and use spread
to put them in separate columns:
summarise_posterior2 <- function(data, var, ...){
qvar <- ensym(var)
vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)
summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
.nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
unnest %>% spread( .nms, .qq )
}
This doesn't give you nearly the same speed up as @jay.sf's solution
d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5))
microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
# f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100
but it will work correctly with group_by
and inside nested functions (whereas substitute
-based solutions will break when nested).
r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
all_equal( r1, r2 ) # TRUE
If you profile the code, you can see where the major hang ups are
Rprof()
for( i in 1:100 )
d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
Rprof(NULL)
summaryRprof()$by.self %>% head
# self.time self.pct total.time total.pct
# ".Call" 1.84 49.73 3.18 85.95
# "sort.int" 0.94 25.41 1.12 30.27
# "eval" 0.08 2.16 3.64 98.38
# "tryCatch" 0.08 2.16 1.44 38.92
# "anyNA" 0.08 2.16 0.08 2.16
# "structure" 0.04 1.08 0.08 2.16
The .Call
corresponds mainly to the C++ backend of dplyr
, while sort.int
is the worker behind quantile()
. @jay.sf's solution gains a major speedup by decoupling from dplyr
, but it also loses the associated flexibility (e.g., integration with group_by
). Ultimately, it's up to you to decide which is more important.
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Here's a solution that makes use of nesting to avoid calling quantile
multiple times. Any time you need to store a vector of results inside summarize
, simply wrap it inside list
. Afterwards, you can unnest these results, pair them up against their names, and use spread
to put them in separate columns:
summarise_posterior2 <- function(data, var, ...){
qvar <- ensym(var)
vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)
summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
.nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
unnest %>% spread( .nms, .qq )
}
This doesn't give you nearly the same speed up as @jay.sf's solution
d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5))
microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
# f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100
but it will work correctly with group_by
and inside nested functions (whereas substitute
-based solutions will break when nested).
r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
all_equal( r1, r2 ) # TRUE
If you profile the code, you can see where the major hang ups are
Rprof()
for( i in 1:100 )
d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
Rprof(NULL)
summaryRprof()$by.self %>% head
# self.time self.pct total.time total.pct
# ".Call" 1.84 49.73 3.18 85.95
# "sort.int" 0.94 25.41 1.12 30.27
# "eval" 0.08 2.16 3.64 98.38
# "tryCatch" 0.08 2.16 1.44 38.92
# "anyNA" 0.08 2.16 0.08 2.16
# "structure" 0.04 1.08 0.08 2.16
The .Call
corresponds mainly to the C++ backend of dplyr
, while sort.int
is the worker behind quantile()
. @jay.sf's solution gains a major speedup by decoupling from dplyr
, but it also loses the associated flexibility (e.g., integration with group_by
). Ultimately, it's up to you to decide which is more important.
Here's a solution that makes use of nesting to avoid calling quantile
multiple times. Any time you need to store a vector of results inside summarize
, simply wrap it inside list
. Afterwards, you can unnest these results, pair them up against their names, and use spread
to put them in separate columns:
summarise_posterior2 <- function(data, var, ...){
qvar <- ensym(var)
vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)
summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
.nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
unnest %>% spread( .nms, .qq )
}
This doesn't give you nearly the same speed up as @jay.sf's solution
d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5))
microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
# f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100
but it will work correctly with group_by
and inside nested functions (whereas substitute
-based solutions will break when nested).
r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
all_equal( r1, r2 ) # TRUE
If you profile the code, you can see where the major hang ups are
Rprof()
for( i in 1:100 )
d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
Rprof(NULL)
summaryRprof()$by.self %>% head
# self.time self.pct total.time total.pct
# ".Call" 1.84 49.73 3.18 85.95
# "sort.int" 0.94 25.41 1.12 30.27
# "eval" 0.08 2.16 3.64 98.38
# "tryCatch" 0.08 2.16 1.44 38.92
# "anyNA" 0.08 2.16 0.08 2.16
# "structure" 0.04 1.08 0.08 2.16
The .Call
corresponds mainly to the C++ backend of dplyr
, while sort.int
is the worker behind quantile()
. @jay.sf's solution gains a major speedup by decoupling from dplyr
, but it also loses the associated flexibility (e.g., integration with group_by
). Ultimately, it's up to you to decide which is more important.
answered Nov 16 at 3:38
Artem Sokolov
4,64021936
4,64021936
add a comment |
add a comment |
up vote
1
down vote
Why not something like this?
summarise_posterior2 <- function(data, x, ...){
x <- deparse(substitute(x))
nm <- deparse(substitute(...))
M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
qs <- t(sapply(list(...), do.call, list(M)))
'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
p25 = quantile(M, prob=0.25),
p50 = quantile(M, prob=0.5),
mean = mean(M),
p75 = quantile(M, prob=0.75),
p97.5 = quantile(M, prob=0.975), qs), NULL
)
}
> summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
p2.5 p25 p50 mean p75 p97.5 mean mean min
[1,] 28.2 30 32 32 34 35.8 32 32 28
# > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
# p2.5 p25 p50 mean p75 p97.5 mean.b min
# 1 28.2 30 32 32 34 35.8 32 28
Runs six times faster:
> microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
+ new.fun=summarise_posterior2(df1, X4, max=max, min=min))
Unit: microseconds
expr min lq mean median uq max neval
orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100
Data
df1 <- data.frame(matrix(1:144, 9, 16))
Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
– jds
Nov 11 at 19:55
Also will it work with group_by?
– jds
Nov 11 at 19:56
add a comment |
up vote
1
down vote
Why not something like this?
summarise_posterior2 <- function(data, x, ...){
x <- deparse(substitute(x))
nm <- deparse(substitute(...))
M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
qs <- t(sapply(list(...), do.call, list(M)))
'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
p25 = quantile(M, prob=0.25),
p50 = quantile(M, prob=0.5),
mean = mean(M),
p75 = quantile(M, prob=0.75),
p97.5 = quantile(M, prob=0.975), qs), NULL
)
}
> summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
p2.5 p25 p50 mean p75 p97.5 mean mean min
[1,] 28.2 30 32 32 34 35.8 32 32 28
# > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
# p2.5 p25 p50 mean p75 p97.5 mean.b min
# 1 28.2 30 32 32 34 35.8 32 28
Runs six times faster:
> microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
+ new.fun=summarise_posterior2(df1, X4, max=max, min=min))
Unit: microseconds
expr min lq mean median uq max neval
orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100
Data
df1 <- data.frame(matrix(1:144, 9, 16))
Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
– jds
Nov 11 at 19:55
Also will it work with group_by?
– jds
Nov 11 at 19:56
add a comment |
up vote
1
down vote
up vote
1
down vote
Why not something like this?
summarise_posterior2 <- function(data, x, ...){
x <- deparse(substitute(x))
nm <- deparse(substitute(...))
M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
qs <- t(sapply(list(...), do.call, list(M)))
'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
p25 = quantile(M, prob=0.25),
p50 = quantile(M, prob=0.5),
mean = mean(M),
p75 = quantile(M, prob=0.75),
p97.5 = quantile(M, prob=0.975), qs), NULL
)
}
> summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
p2.5 p25 p50 mean p75 p97.5 mean mean min
[1,] 28.2 30 32 32 34 35.8 32 32 28
# > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
# p2.5 p25 p50 mean p75 p97.5 mean.b min
# 1 28.2 30 32 32 34 35.8 32 28
Runs six times faster:
> microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
+ new.fun=summarise_posterior2(df1, X4, max=max, min=min))
Unit: microseconds
expr min lq mean median uq max neval
orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100
Data
df1 <- data.frame(matrix(1:144, 9, 16))
Why not something like this?
summarise_posterior2 <- function(data, x, ...){
x <- deparse(substitute(x))
nm <- deparse(substitute(...))
M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
qs <- t(sapply(list(...), do.call, list(M)))
'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
p25 = quantile(M, prob=0.25),
p50 = quantile(M, prob=0.5),
mean = mean(M),
p75 = quantile(M, prob=0.75),
p97.5 = quantile(M, prob=0.975), qs), NULL
)
}
> summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
p2.5 p25 p50 mean p75 p97.5 mean mean min
[1,] 28.2 30 32 32 34 35.8 32 32 28
# > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
# p2.5 p25 p50 mean p75 p97.5 mean.b min
# 1 28.2 30 32 32 34 35.8 32 28
Runs six times faster:
> microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
+ new.fun=summarise_posterior2(df1, X4, max=max, min=min))
Unit: microseconds
expr min lq mean median uq max neval
orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100
Data
df1 <- data.frame(matrix(1:144, 9, 16))
answered Nov 11 at 8:27
jay.sf
4,04421435
4,04421435
Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
– jds
Nov 11 at 19:55
Also will it work with group_by?
– jds
Nov 11 at 19:56
add a comment |
Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
– jds
Nov 11 at 19:55
Also will it work with group_by?
– jds
Nov 11 at 19:56
Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
– jds
Nov 11 at 19:55
Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
– jds
Nov 11 at 19:55
Also will it work with group_by?
– jds
Nov 11 at 19:56
Also will it work with group_by?
– jds
Nov 11 at 19:56
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245493%2fspeed-up-tidyverse-calculation-of-multiple-quantiles%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown