Drawing equally-sized samples from differently-sized substrata of a dataframe in R [duplicate]





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0
















This question already has an answer here:




  • Sample n random rows per group in a dataframe

    5 answers



  • Stratified random sampling from data frame

    4 answers




I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:



df <- data.frame(
word = sample(LETTERS, 100, replace = T),
position = sample(1:5, 100, replace = T)
)
head(df)
word position
1 K 1
2 R 5
3 J 2
4 Y 5
5 Z 5
6 U 4


Obviously, the tranches of 'position' are differently sized:



table(df$position)
1 2 3 4 5
15 15 17 28 25


To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:



df_pos1 <- df[df$position==1,]
df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]

df_pos2 <- df[df$position==2,]
df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]

df_pos3 <- df[df$position==3,]
df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]

df_pos4 <- df[df$position==4,]
df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]

df_pos5 <- df[df$position==5,]
df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]


and so on, to finally combine the individual samples in a single dataframe:



df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)


but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':



df_samples <-c()
for(i in unique(df$position)){
df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
}
df_samples
word position
13 D 2
2 R 5
12 G 3
4 Y 5
16 Z 3
11 S 3
6 U 4
14 J 3
9 O 5
1 K 1


What's wrong with this code and how can it be improved?










share|improve this question













marked as duplicate by Henrik r
Users with the  r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 16 '18 at 19:15


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

























    0
















    This question already has an answer here:




    • Sample n random rows per group in a dataframe

      5 answers



    • Stratified random sampling from data frame

      4 answers




    I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:



    df <- data.frame(
    word = sample(LETTERS, 100, replace = T),
    position = sample(1:5, 100, replace = T)
    )
    head(df)
    word position
    1 K 1
    2 R 5
    3 J 2
    4 Y 5
    5 Z 5
    6 U 4


    Obviously, the tranches of 'position' are differently sized:



    table(df$position)
    1 2 3 4 5
    15 15 17 28 25


    To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:



    df_pos1 <- df[df$position==1,]
    df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]

    df_pos2 <- df[df$position==2,]
    df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]

    df_pos3 <- df[df$position==3,]
    df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]

    df_pos4 <- df[df$position==4,]
    df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]

    df_pos5 <- df[df$position==5,]
    df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]


    and so on, to finally combine the individual samples in a single dataframe:



    df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)


    but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':



    df_samples <-c()
    for(i in unique(df$position)){
    df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
    }
    df_samples
    word position
    13 D 2
    2 R 5
    12 G 3
    4 Y 5
    16 Z 3
    11 S 3
    6 U 4
    14 J 3
    9 O 5
    1 K 1


    What's wrong with this code and how can it be improved?










    share|improve this question













    marked as duplicate by Henrik r
    Users with the  r badge can single-handedly close r questions as duplicates and reopen them as needed.

    StackExchange.ready(function() {
    if (StackExchange.options.isMobile) return;

    $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
    var $hover = $(this).addClass('hover-bound'),
    $msg = $hover.siblings('.dupe-hammer-message');

    $hover.hover(
    function() {
    $hover.showInfoMessage('', {
    messageElement: $msg.clone().show(),
    transient: false,
    position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
    dismissable: false,
    relativeToBody: true
    });
    },
    function() {
    StackExchange.helpers.removeMessages();
    }
    );
    });
    });
    Nov 16 '18 at 19:15


    This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.





















      0












      0








      0









      This question already has an answer here:




      • Sample n random rows per group in a dataframe

        5 answers



      • Stratified random sampling from data frame

        4 answers




      I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:



      df <- data.frame(
      word = sample(LETTERS, 100, replace = T),
      position = sample(1:5, 100, replace = T)
      )
      head(df)
      word position
      1 K 1
      2 R 5
      3 J 2
      4 Y 5
      5 Z 5
      6 U 4


      Obviously, the tranches of 'position' are differently sized:



      table(df$position)
      1 2 3 4 5
      15 15 17 28 25


      To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:



      df_pos1 <- df[df$position==1,]
      df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]

      df_pos2 <- df[df$position==2,]
      df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]

      df_pos3 <- df[df$position==3,]
      df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]

      df_pos4 <- df[df$position==4,]
      df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]

      df_pos5 <- df[df$position==5,]
      df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]


      and so on, to finally combine the individual samples in a single dataframe:



      df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)


      but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':



      df_samples <-c()
      for(i in unique(df$position)){
      df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
      }
      df_samples
      word position
      13 D 2
      2 R 5
      12 G 3
      4 Y 5
      16 Z 3
      11 S 3
      6 U 4
      14 J 3
      9 O 5
      1 K 1


      What's wrong with this code and how can it be improved?










      share|improve this question















      This question already has an answer here:




      • Sample n random rows per group in a dataframe

        5 answers



      • Stratified random sampling from data frame

        4 answers




      I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:



      df <- data.frame(
      word = sample(LETTERS, 100, replace = T),
      position = sample(1:5, 100, replace = T)
      )
      head(df)
      word position
      1 K 1
      2 R 5
      3 J 2
      4 Y 5
      5 Z 5
      6 U 4


      Obviously, the tranches of 'position' are differently sized:



      table(df$position)
      1 2 3 4 5
      15 15 17 28 25


      To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:



      df_pos1 <- df[df$position==1,]
      df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]

      df_pos2 <- df[df$position==2,]
      df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]

      df_pos3 <- df[df$position==3,]
      df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]

      df_pos4 <- df[df$position==4,]
      df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]

      df_pos5 <- df[df$position==5,]
      df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]


      and so on, to finally combine the individual samples in a single dataframe:



      df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)


      but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':



      df_samples <-c()
      for(i in unique(df$position)){
      df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
      }
      df_samples
      word position
      13 D 2
      2 R 5
      12 G 3
      4 Y 5
      16 Z 3
      11 S 3
      6 U 4
      14 J 3
      9 O 5
      1 K 1


      What's wrong with this code and how can it be improved?





      This question already has an answer here:




      • Sample n random rows per group in a dataframe

        5 answers



      • Stratified random sampling from data frame

        4 answers








      r for-loop sample






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 16 '18 at 18:48









      Chris RuehlemannChris Ruehlemann

      469210




      469210




      marked as duplicate by Henrik r
      Users with the  r badge can single-handedly close r questions as duplicates and reopen them as needed.

      StackExchange.ready(function() {
      if (StackExchange.options.isMobile) return;

      $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
      var $hover = $(this).addClass('hover-bound'),
      $msg = $hover.siblings('.dupe-hammer-message');

      $hover.hover(
      function() {
      $hover.showInfoMessage('', {
      messageElement: $msg.clone().show(),
      transient: false,
      position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
      dismissable: false,
      relativeToBody: true
      });
      },
      function() {
      StackExchange.helpers.removeMessages();
      }
      );
      });
      });
      Nov 16 '18 at 19:15


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.









      marked as duplicate by Henrik r
      Users with the  r badge can single-handedly close r questions as duplicates and reopen them as needed.

      StackExchange.ready(function() {
      if (StackExchange.options.isMobile) return;

      $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
      var $hover = $(this).addClass('hover-bound'),
      $msg = $hover.siblings('.dupe-hammer-message');

      $hover.hover(
      function() {
      $hover.showInfoMessage('', {
      messageElement: $msg.clone().show(),
      transient: false,
      position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
      dismissable: false,
      relativeToBody: true
      });
      },
      function() {
      StackExchange.helpers.removeMessages();
      }
      );
      });
      });
      Nov 16 '18 at 19:15


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.


























          3 Answers
          3






          active

          oldest

          votes


















          2














          Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().



          df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])

          final_df <- do.call(rbind, df_list)


          Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.



          Specifically,





          • by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.


          • do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.






          share|improve this answer


























          • Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!

            – Chris Ruehlemann
            Nov 16 '18 at 21:24



















          0














          Each time you run the loop you are overwriting the last entry. Try:



          df_samples <- data.frame()
          df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])





          share|improve this answer































            0














            We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient



            i1 <- setDT(df)[, sample(.I, 3), position]$V1
            df[i1]




            Or use sample_n from tidyverse



            library(tidyverse)
            df %>%
            group_by(position) %>%
            sample_n(3)




            Or as a function



            f1 <- function(data) {
            data as.data.table(data)
            i1 <- data[, sample(.I, 3), by = position]$V1
            data[i1]
            }





            share|improve this answer
































              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              2














              Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().



              df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])

              final_df <- do.call(rbind, df_list)


              Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.



              Specifically,





              • by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.


              • do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.






              share|improve this answer


























              • Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!

                – Chris Ruehlemann
                Nov 16 '18 at 21:24
















              2














              Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().



              df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])

              final_df <- do.call(rbind, df_list)


              Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.



              Specifically,





              • by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.


              • do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.






              share|improve this answer


























              • Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!

                – Chris Ruehlemann
                Nov 16 '18 at 21:24














              2












              2








              2







              Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().



              df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])

              final_df <- do.call(rbind, df_list)


              Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.



              Specifically,





              • by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.


              • do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.






              share|improve this answer















              Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().



              df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])

              final_df <- do.call(rbind, df_list)


              Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.



              Specifically,





              • by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.


              • do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Nov 16 '18 at 22:43

























              answered Nov 16 '18 at 18:56









              ParfaitParfait

              54.3k104872




              54.3k104872













              • Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!

                – Chris Ruehlemann
                Nov 16 '18 at 21:24



















              • Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!

                – Chris Ruehlemann
                Nov 16 '18 at 21:24

















              Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!

              – Chris Ruehlemann
              Nov 16 '18 at 21:24





              Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!

              – Chris Ruehlemann
              Nov 16 '18 at 21:24













              0














              Each time you run the loop you are overwriting the last entry. Try:



              df_samples <- data.frame()
              df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])





              share|improve this answer




























                0














                Each time you run the loop you are overwriting the last entry. Try:



                df_samples <- data.frame()
                df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])





                share|improve this answer


























                  0












                  0








                  0







                  Each time you run the loop you are overwriting the last entry. Try:



                  df_samples <- data.frame()
                  df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])





                  share|improve this answer













                  Each time you run the loop you are overwriting the last entry. Try:



                  df_samples <- data.frame()
                  df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 16 '18 at 18:57









                  xsabatoxxsabatox

                  1




                  1























                      0














                      We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient



                      i1 <- setDT(df)[, sample(.I, 3), position]$V1
                      df[i1]




                      Or use sample_n from tidyverse



                      library(tidyverse)
                      df %>%
                      group_by(position) %>%
                      sample_n(3)




                      Or as a function



                      f1 <- function(data) {
                      data as.data.table(data)
                      i1 <- data[, sample(.I, 3), by = position]$V1
                      data[i1]
                      }





                      share|improve this answer






























                        0














                        We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient



                        i1 <- setDT(df)[, sample(.I, 3), position]$V1
                        df[i1]




                        Or use sample_n from tidyverse



                        library(tidyverse)
                        df %>%
                        group_by(position) %>%
                        sample_n(3)




                        Or as a function



                        f1 <- function(data) {
                        data as.data.table(data)
                        i1 <- data[, sample(.I, 3), by = position]$V1
                        data[i1]
                        }





                        share|improve this answer




























                          0












                          0








                          0







                          We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient



                          i1 <- setDT(df)[, sample(.I, 3), position]$V1
                          df[i1]




                          Or use sample_n from tidyverse



                          library(tidyverse)
                          df %>%
                          group_by(position) %>%
                          sample_n(3)




                          Or as a function



                          f1 <- function(data) {
                          data as.data.table(data)
                          i1 <- data[, sample(.I, 3), by = position]$V1
                          data[i1]
                          }





                          share|improve this answer















                          We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient



                          i1 <- setDT(df)[, sample(.I, 3), position]$V1
                          df[i1]




                          Or use sample_n from tidyverse



                          library(tidyverse)
                          df %>%
                          group_by(position) %>%
                          sample_n(3)




                          Or as a function



                          f1 <- function(data) {
                          data as.data.table(data)
                          i1 <- data[, sample(.I, 3), by = position]$V1
                          data[i1]
                          }






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 16 '18 at 19:09

























                          answered Nov 16 '18 at 18:52









                          akrunakrun

                          422k13209284




                          422k13209284















                              Popular posts from this blog

                              Xamarin.iOS Cant Deploy on Iphone

                              Glorious Revolution

                              Dulmage-Mendelsohn matrix decomposition in Python