R and Simmer: Performance boost on large data frames












1















I've got on own dataframe on actual events/task and I use the simmer r package to simulate how many task can be done if different resources were available. My simulation runs very fast up to 120.000 rows within my dataframe.



rm(list=ls())
library(dplyr)
library(simmer)
library(simmer.plot)

load("task_df.RDATA")

working_hours <- 7.8
productivity <- 0.7
no.employees <- 292

SIM_TIME <- round((working_hours*productivity*60), 0)+1

employees <- vector("character")

for (i in 1:no.employees) {
employees[i] <- paste("employee", i, sep="_")
}

taskTraj <- trajectory(name = "tasK simulation") %>%
simmer::select(resources = employees, policy = "shortest-queue") %>%
seize_selected(amount = 1) %>%
timeout_from_attribute("duration") %>%
release_selected(amount = 1)


arrivals_gen <- simmer()

for (i in 1:no.employees) {arrivals_gen %>%
add_resource(paste("employee", i, sep="_"), capacity = 1)
}

ptm <- proc.time()

arrivals_gen <- arrivals_gen %>%
add_dataframe("Task_", taskTraj, task_df, mon = 2, col_time = "time", time = "absolute", col_priority="priority") %>%
run(SIM_TIME)

proc.time() - ptm


But my dataframe tasK_df contains 350k datasets and thats the point where my simulation takes a lot of more time.



head(task_df, n = 50)



workload_shift  task_id duration priority time
1 20180403 68347632 3 2.502 0
2 20180403 68151881 10 24.478 0
3 20180403 68069718 3 0.724 0
4 20180403 68345621 4 2.226 0
5 20180403 68508858 3 36.062 0
6 20180403 66148996 3 9.421 0
7 20180403 68565066 2 24.478 0
8 20180403 68005344 3 7.910 0
9 20180403 55979902 3 3.732 0
10 20180403 66452138 2 2.502 0
11 20180403 68051869 10 2.226 0
12 20180403 68561364 10 3.584 0
13 20180403 59292591 3 2.138 0
14 20180403 68415657 10 2.853 0
15 20180403 66848400 3 2.290 0
16 20180403 68454851 10 6.167 0
17 20180403 68361846 10 11.688 0
18 20180403 68572723 2 6.259 0
19 20180403 68520328 2 24.478 0
20 20180403 68500955 10 1.855 0
21 20180403 67000753 3 219.751 0
22 20180403 68487613 3 8.131 0
23 20180403 68333674 4 5.263 0
24 20180403 66423486 3 2.290 0
25 20180403 68241616 5 1.470 0
26 20180403 68415001 4 3.584 0
27 20180403 67487967 3 2.636 0
28 20180403 68494771 10 6.259 0
29 20180403 67673981 10 2.226 0
30 20180403 68355727 3 2.613 0
31 20180403 36942995 3 0.590 0
32 20180403 66633446 3 5.968 0
33 20180403 68461510 2 24.478 0
34 20180403 67126138 3 0.357 0
35 20180403 68485682 3 8.131 0
36 20180403 67852953 10 2.290 0
37 20180403 68150106 10 6.259 0
38 20180403 67833053 10 4.114 0
39 20180403 67816673 3 6.259 0
40 20180403 68041431 5 2.502 0
41 20180403 66283761 5 2.502 0
42 20180403 68543314 2 26.302 0
43 20180403 68492843 3 2.290 0
44 20180403 68556960 4 2.853 0
45 20180403 66885335 3 5.975 0
46 20180403 66249231 5 2.636 0
47 20180403 68242565 12 1.470 0
48 20180403 68530355 2 2.290 0
49 20180403 66683717 5 5.705 0
50 20180403 67802538 4 0.864 0


user system elapsed

76.745 0.039 76.717



vs



user system elapsed
608.443 0.270 608.186



My CPU



Is there a way to boost my simulation? I use simmer 4.1.0 and Rcpp 1.0.0. Memory doesnt seems to be an issue.










share|improve this question




















  • 1





    Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

    – Iñaki Úcar
    Nov 14 '18 at 9:29











  • @IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

    – MCR90
    Nov 14 '18 at 13:09
















1















I've got on own dataframe on actual events/task and I use the simmer r package to simulate how many task can be done if different resources were available. My simulation runs very fast up to 120.000 rows within my dataframe.



rm(list=ls())
library(dplyr)
library(simmer)
library(simmer.plot)

load("task_df.RDATA")

working_hours <- 7.8
productivity <- 0.7
no.employees <- 292

SIM_TIME <- round((working_hours*productivity*60), 0)+1

employees <- vector("character")

for (i in 1:no.employees) {
employees[i] <- paste("employee", i, sep="_")
}

taskTraj <- trajectory(name = "tasK simulation") %>%
simmer::select(resources = employees, policy = "shortest-queue") %>%
seize_selected(amount = 1) %>%
timeout_from_attribute("duration") %>%
release_selected(amount = 1)


arrivals_gen <- simmer()

for (i in 1:no.employees) {arrivals_gen %>%
add_resource(paste("employee", i, sep="_"), capacity = 1)
}

ptm <- proc.time()

arrivals_gen <- arrivals_gen %>%
add_dataframe("Task_", taskTraj, task_df, mon = 2, col_time = "time", time = "absolute", col_priority="priority") %>%
run(SIM_TIME)

proc.time() - ptm


But my dataframe tasK_df contains 350k datasets and thats the point where my simulation takes a lot of more time.



head(task_df, n = 50)



workload_shift  task_id duration priority time
1 20180403 68347632 3 2.502 0
2 20180403 68151881 10 24.478 0
3 20180403 68069718 3 0.724 0
4 20180403 68345621 4 2.226 0
5 20180403 68508858 3 36.062 0
6 20180403 66148996 3 9.421 0
7 20180403 68565066 2 24.478 0
8 20180403 68005344 3 7.910 0
9 20180403 55979902 3 3.732 0
10 20180403 66452138 2 2.502 0
11 20180403 68051869 10 2.226 0
12 20180403 68561364 10 3.584 0
13 20180403 59292591 3 2.138 0
14 20180403 68415657 10 2.853 0
15 20180403 66848400 3 2.290 0
16 20180403 68454851 10 6.167 0
17 20180403 68361846 10 11.688 0
18 20180403 68572723 2 6.259 0
19 20180403 68520328 2 24.478 0
20 20180403 68500955 10 1.855 0
21 20180403 67000753 3 219.751 0
22 20180403 68487613 3 8.131 0
23 20180403 68333674 4 5.263 0
24 20180403 66423486 3 2.290 0
25 20180403 68241616 5 1.470 0
26 20180403 68415001 4 3.584 0
27 20180403 67487967 3 2.636 0
28 20180403 68494771 10 6.259 0
29 20180403 67673981 10 2.226 0
30 20180403 68355727 3 2.613 0
31 20180403 36942995 3 0.590 0
32 20180403 66633446 3 5.968 0
33 20180403 68461510 2 24.478 0
34 20180403 67126138 3 0.357 0
35 20180403 68485682 3 8.131 0
36 20180403 67852953 10 2.290 0
37 20180403 68150106 10 6.259 0
38 20180403 67833053 10 4.114 0
39 20180403 67816673 3 6.259 0
40 20180403 68041431 5 2.502 0
41 20180403 66283761 5 2.502 0
42 20180403 68543314 2 26.302 0
43 20180403 68492843 3 2.290 0
44 20180403 68556960 4 2.853 0
45 20180403 66885335 3 5.975 0
46 20180403 66249231 5 2.636 0
47 20180403 68242565 12 1.470 0
48 20180403 68530355 2 2.290 0
49 20180403 66683717 5 5.705 0
50 20180403 67802538 4 0.864 0


user system elapsed

76.745 0.039 76.717



vs



user system elapsed
608.443 0.270 608.186



My CPU



Is there a way to boost my simulation? I use simmer 4.1.0 and Rcpp 1.0.0. Memory doesnt seems to be an issue.










share|improve this question




















  • 1





    Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

    – Iñaki Úcar
    Nov 14 '18 at 9:29











  • @IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

    – MCR90
    Nov 14 '18 at 13:09














1












1








1








I've got on own dataframe on actual events/task and I use the simmer r package to simulate how many task can be done if different resources were available. My simulation runs very fast up to 120.000 rows within my dataframe.



rm(list=ls())
library(dplyr)
library(simmer)
library(simmer.plot)

load("task_df.RDATA")

working_hours <- 7.8
productivity <- 0.7
no.employees <- 292

SIM_TIME <- round((working_hours*productivity*60), 0)+1

employees <- vector("character")

for (i in 1:no.employees) {
employees[i] <- paste("employee", i, sep="_")
}

taskTraj <- trajectory(name = "tasK simulation") %>%
simmer::select(resources = employees, policy = "shortest-queue") %>%
seize_selected(amount = 1) %>%
timeout_from_attribute("duration") %>%
release_selected(amount = 1)


arrivals_gen <- simmer()

for (i in 1:no.employees) {arrivals_gen %>%
add_resource(paste("employee", i, sep="_"), capacity = 1)
}

ptm <- proc.time()

arrivals_gen <- arrivals_gen %>%
add_dataframe("Task_", taskTraj, task_df, mon = 2, col_time = "time", time = "absolute", col_priority="priority") %>%
run(SIM_TIME)

proc.time() - ptm


But my dataframe tasK_df contains 350k datasets and thats the point where my simulation takes a lot of more time.



head(task_df, n = 50)



workload_shift  task_id duration priority time
1 20180403 68347632 3 2.502 0
2 20180403 68151881 10 24.478 0
3 20180403 68069718 3 0.724 0
4 20180403 68345621 4 2.226 0
5 20180403 68508858 3 36.062 0
6 20180403 66148996 3 9.421 0
7 20180403 68565066 2 24.478 0
8 20180403 68005344 3 7.910 0
9 20180403 55979902 3 3.732 0
10 20180403 66452138 2 2.502 0
11 20180403 68051869 10 2.226 0
12 20180403 68561364 10 3.584 0
13 20180403 59292591 3 2.138 0
14 20180403 68415657 10 2.853 0
15 20180403 66848400 3 2.290 0
16 20180403 68454851 10 6.167 0
17 20180403 68361846 10 11.688 0
18 20180403 68572723 2 6.259 0
19 20180403 68520328 2 24.478 0
20 20180403 68500955 10 1.855 0
21 20180403 67000753 3 219.751 0
22 20180403 68487613 3 8.131 0
23 20180403 68333674 4 5.263 0
24 20180403 66423486 3 2.290 0
25 20180403 68241616 5 1.470 0
26 20180403 68415001 4 3.584 0
27 20180403 67487967 3 2.636 0
28 20180403 68494771 10 6.259 0
29 20180403 67673981 10 2.226 0
30 20180403 68355727 3 2.613 0
31 20180403 36942995 3 0.590 0
32 20180403 66633446 3 5.968 0
33 20180403 68461510 2 24.478 0
34 20180403 67126138 3 0.357 0
35 20180403 68485682 3 8.131 0
36 20180403 67852953 10 2.290 0
37 20180403 68150106 10 6.259 0
38 20180403 67833053 10 4.114 0
39 20180403 67816673 3 6.259 0
40 20180403 68041431 5 2.502 0
41 20180403 66283761 5 2.502 0
42 20180403 68543314 2 26.302 0
43 20180403 68492843 3 2.290 0
44 20180403 68556960 4 2.853 0
45 20180403 66885335 3 5.975 0
46 20180403 66249231 5 2.636 0
47 20180403 68242565 12 1.470 0
48 20180403 68530355 2 2.290 0
49 20180403 66683717 5 5.705 0
50 20180403 67802538 4 0.864 0


user system elapsed

76.745 0.039 76.717



vs



user system elapsed
608.443 0.270 608.186



My CPU



Is there a way to boost my simulation? I use simmer 4.1.0 and Rcpp 1.0.0. Memory doesnt seems to be an issue.










share|improve this question
















I've got on own dataframe on actual events/task and I use the simmer r package to simulate how many task can be done if different resources were available. My simulation runs very fast up to 120.000 rows within my dataframe.



rm(list=ls())
library(dplyr)
library(simmer)
library(simmer.plot)

load("task_df.RDATA")

working_hours <- 7.8
productivity <- 0.7
no.employees <- 292

SIM_TIME <- round((working_hours*productivity*60), 0)+1

employees <- vector("character")

for (i in 1:no.employees) {
employees[i] <- paste("employee", i, sep="_")
}

taskTraj <- trajectory(name = "tasK simulation") %>%
simmer::select(resources = employees, policy = "shortest-queue") %>%
seize_selected(amount = 1) %>%
timeout_from_attribute("duration") %>%
release_selected(amount = 1)


arrivals_gen <- simmer()

for (i in 1:no.employees) {arrivals_gen %>%
add_resource(paste("employee", i, sep="_"), capacity = 1)
}

ptm <- proc.time()

arrivals_gen <- arrivals_gen %>%
add_dataframe("Task_", taskTraj, task_df, mon = 2, col_time = "time", time = "absolute", col_priority="priority") %>%
run(SIM_TIME)

proc.time() - ptm


But my dataframe tasK_df contains 350k datasets and thats the point where my simulation takes a lot of more time.



head(task_df, n = 50)



workload_shift  task_id duration priority time
1 20180403 68347632 3 2.502 0
2 20180403 68151881 10 24.478 0
3 20180403 68069718 3 0.724 0
4 20180403 68345621 4 2.226 0
5 20180403 68508858 3 36.062 0
6 20180403 66148996 3 9.421 0
7 20180403 68565066 2 24.478 0
8 20180403 68005344 3 7.910 0
9 20180403 55979902 3 3.732 0
10 20180403 66452138 2 2.502 0
11 20180403 68051869 10 2.226 0
12 20180403 68561364 10 3.584 0
13 20180403 59292591 3 2.138 0
14 20180403 68415657 10 2.853 0
15 20180403 66848400 3 2.290 0
16 20180403 68454851 10 6.167 0
17 20180403 68361846 10 11.688 0
18 20180403 68572723 2 6.259 0
19 20180403 68520328 2 24.478 0
20 20180403 68500955 10 1.855 0
21 20180403 67000753 3 219.751 0
22 20180403 68487613 3 8.131 0
23 20180403 68333674 4 5.263 0
24 20180403 66423486 3 2.290 0
25 20180403 68241616 5 1.470 0
26 20180403 68415001 4 3.584 0
27 20180403 67487967 3 2.636 0
28 20180403 68494771 10 6.259 0
29 20180403 67673981 10 2.226 0
30 20180403 68355727 3 2.613 0
31 20180403 36942995 3 0.590 0
32 20180403 66633446 3 5.968 0
33 20180403 68461510 2 24.478 0
34 20180403 67126138 3 0.357 0
35 20180403 68485682 3 8.131 0
36 20180403 67852953 10 2.290 0
37 20180403 68150106 10 6.259 0
38 20180403 67833053 10 4.114 0
39 20180403 67816673 3 6.259 0
40 20180403 68041431 5 2.502 0
41 20180403 66283761 5 2.502 0
42 20180403 68543314 2 26.302 0
43 20180403 68492843 3 2.290 0
44 20180403 68556960 4 2.853 0
45 20180403 66885335 3 5.975 0
46 20180403 66249231 5 2.636 0
47 20180403 68242565 12 1.470 0
48 20180403 68530355 2 2.290 0
49 20180403 66683717 5 5.705 0
50 20180403 67802538 4 0.864 0


user system elapsed

76.745 0.039 76.717



vs



user system elapsed
608.443 0.270 608.186



My CPU



Is there a way to boost my simulation? I use simmer 4.1.0 and Rcpp 1.0.0. Memory doesnt seems to be an issue.







c++ r simulation






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 13:08







MCR90

















asked Nov 13 '18 at 14:18









MCR90MCR90

83




83








  • 1





    Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

    – Iñaki Úcar
    Nov 14 '18 at 9:29











  • @IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

    – MCR90
    Nov 14 '18 at 13:09














  • 1





    Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

    – Iñaki Úcar
    Nov 14 '18 at 9:29











  • @IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

    – MCR90
    Nov 14 '18 at 13:09








1




1





Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

– Iñaki Úcar
Nov 14 '18 at 9:29





Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

– Iñaki Úcar
Nov 14 '18 at 9:29













@IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

– MCR90
Nov 14 '18 at 13:09





@IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

– MCR90
Nov 14 '18 at 13:09












1 Answer
1






active

oldest

votes


















0














I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.



Internally, attributes are always double, so there are lots of conversions, row by row, which apparently take most of the execution time (!). Try converting your table before feeding it into simmer. Using dplyr,



task_df <- mutate_all(task_df, as.double)


The simulation should be much faster, and the execution time for increasing number of rows should grow more or less linearly. It's evident why so many casts are degrading the performance, though I'm not sure why it makes execution time non-linear.



Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.






share|improve this answer
























  • Thank you! it worked very well and it seems to be linear!

    – MCR90
    Nov 15 '18 at 16:45











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283052%2fr-and-simmer-performance-boost-on-large-data-frames%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.



Internally, attributes are always double, so there are lots of conversions, row by row, which apparently take most of the execution time (!). Try converting your table before feeding it into simmer. Using dplyr,



task_df <- mutate_all(task_df, as.double)


The simulation should be much faster, and the execution time for increasing number of rows should grow more or less linearly. It's evident why so many casts are degrading the performance, though I'm not sure why it makes execution time non-linear.



Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.






share|improve this answer
























  • Thank you! it worked very well and it seems to be linear!

    – MCR90
    Nov 15 '18 at 16:45
















0














I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.



Internally, attributes are always double, so there are lots of conversions, row by row, which apparently take most of the execution time (!). Try converting your table before feeding it into simmer. Using dplyr,



task_df <- mutate_all(task_df, as.double)


The simulation should be much faster, and the execution time for increasing number of rows should grow more or less linearly. It's evident why so many casts are degrading the performance, though I'm not sure why it makes execution time non-linear.



Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.






share|improve this answer
























  • Thank you! it worked very well and it seems to be linear!

    – MCR90
    Nov 15 '18 at 16:45














0












0








0







I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.



Internally, attributes are always double, so there are lots of conversions, row by row, which apparently take most of the execution time (!). Try converting your table before feeding it into simmer. Using dplyr,



task_df <- mutate_all(task_df, as.double)


The simulation should be much faster, and the execution time for increasing number of rows should grow more or less linearly. It's evident why so many casts are degrading the performance, though I'm not sure why it makes execution time non-linear.



Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.






share|improve this answer













I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.



Internally, attributes are always double, so there are lots of conversions, row by row, which apparently take most of the execution time (!). Try converting your table before feeding it into simmer. Using dplyr,



task_df <- mutate_all(task_df, as.double)


The simulation should be much faster, and the execution time for increasing number of rows should grow more or less linearly. It's evident why so many casts are degrading the performance, though I'm not sure why it makes execution time non-linear.



Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 15 '18 at 13:32









Iñaki ÚcarIñaki Úcar

1969




1969













  • Thank you! it worked very well and it seems to be linear!

    – MCR90
    Nov 15 '18 at 16:45



















  • Thank you! it worked very well and it seems to be linear!

    – MCR90
    Nov 15 '18 at 16:45

















Thank you! it worked very well and it seems to be linear!

– MCR90
Nov 15 '18 at 16:45





Thank you! it worked very well and it seems to be linear!

– MCR90
Nov 15 '18 at 16:45


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283052%2fr-and-simmer-performance-boost-on-large-data-frames%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Bressuire

Vorschmack

Quarantine