R: How can I group rows in a dataframe, ID rows meeting a condition, then delete prior rows for the group?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:

   customer year item.A item.B  key

1         1 2000     NA     NA         <NA>

2         1 2001     NA     NA         <NA>

3         1 2002      1      5 new.customer

4         1 2003      2      6         <NA>

5         2 2000     NA     NA         <NA>

6         2 2001     NA     NA         <NA>

7         2 2002     NA     NA         <NA>

8         2 2003      2      7 new.customer

9         3 2000      2      4         <NA>

10        3 2001      6      4         <NA>

11        3 2002      2      5         <NA>

12        3 2003      1      8         <NA>

needs to look like this:

  customer year item.A item.B key

1        1 2002      1      5 new.customer

2        1 2003      2      6         <NA>

3        2 2003      2      7 new.customer

4        3 2000      2      4         <NA>

5        3 2001      6      4         <NA>

6        3 2002      2      5         <NA>

7        3 2003      1      8         <NA>

I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).

a<-c(1,1,1,1,2,2,2,2,3,3,3,3)

b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)

c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)

d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)

e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA) 

df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)

df

asked Nov 17 '18 at 1:21

Emilio M. Bruna

8419

shouldn't first row of customer 3 be marked as new customer?

– Shree
Nov 17 '18 at 1:27

Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

– Mathieu Basille
Nov 24 '18 at 0:15

add a comment |

   customer year item.A item.B  key

1         1 2000     NA     NA         <NA>

2         1 2001     NA     NA         <NA>

3         1 2002      1      5 new.customer

4         1 2003      2      6         <NA>

5         2 2000     NA     NA         <NA>

6         2 2001     NA     NA         <NA>

7         2 2002     NA     NA         <NA>

8         2 2003      2      7 new.customer

9         3 2000      2      4         <NA>

10        3 2001      6      4         <NA>

11        3 2002      2      5         <NA>

12        3 2003      1      8         <NA>

needs to look like this:

  customer year item.A item.B key

1        1 2002      1      5 new.customer

2        1 2003      2      6         <NA>

3        2 2003      2      7 new.customer

4        3 2000      2      4         <NA>

5        3 2001      6      4         <NA>

6        3 2002      2      5         <NA>

7        3 2003      1      8         <NA>

a<-c(1,1,1,1,2,2,2,2,3,3,3,3)

b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)

c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)

d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)

e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA) 

df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)

df

asked Nov 17 '18 at 1:21

Emilio M. Bruna

8419

shouldn't first row of customer 3 be marked as new customer?

– Shree
Nov 17 '18 at 1:27

Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

– Mathieu Basille
Nov 24 '18 at 0:15

add a comment |

   customer year item.A item.B  key

1         1 2000     NA     NA         <NA>

2         1 2001     NA     NA         <NA>

3         1 2002      1      5 new.customer

4         1 2003      2      6         <NA>

5         2 2000     NA     NA         <NA>

6         2 2001     NA     NA         <NA>

7         2 2002     NA     NA         <NA>

8         2 2003      2      7 new.customer

9         3 2000      2      4         <NA>

10        3 2001      6      4         <NA>

11        3 2002      2      5         <NA>

12        3 2003      1      8         <NA>

needs to look like this:

  customer year item.A item.B key

1        1 2002      1      5 new.customer

2        1 2003      2      6         <NA>

3        2 2003      2      7 new.customer

4        3 2000      2      4         <NA>

5        3 2001      6      4         <NA>

6        3 2002      2      5         <NA>

7        3 2003      1      8         <NA>

a<-c(1,1,1,1,2,2,2,2,3,3,3,3)

b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)

c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)

d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)

e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA) 

df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)

df

asked Nov 17 '18 at 1:21

Emilio M. Bruna

8419

   customer year item.A item.B  key

1         1 2000     NA     NA         <NA>

2         1 2001     NA     NA         <NA>

3         1 2002      1      5 new.customer

4         1 2003      2      6         <NA>

5         2 2000     NA     NA         <NA>

6         2 2001     NA     NA         <NA>

7         2 2002     NA     NA         <NA>

8         2 2003      2      7 new.customer

9         3 2000      2      4         <NA>

10        3 2001      6      4         <NA>

11        3 2002      2      5         <NA>

12        3 2003      1      8         <NA>

needs to look like this:

  customer year item.A item.B key

1        1 2002      1      5 new.customer

2        1 2003      2      6         <NA>

3        2 2003      2      7 new.customer

4        3 2000      2      4         <NA>

5        3 2001      6      4         <NA>

6        3 2002      2      5         <NA>

7        3 2003      1      8         <NA>

a<-c(1,1,1,1,2,2,2,2,3,3,3,3)

b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)

c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)

d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)

e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA) 

df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)

df

r dataframe conditional delete-row

asked Nov 17 '18 at 1:21

Emilio M. Bruna

8419

asked Nov 17 '18 at 1:21

Emilio M. Bruna

8419

asked Nov 17 '18 at 1:21

Emilio M. Bruna

8419

asked Nov 17 '18 at 1:21

Emilio M. Bruna

8419

asked Nov 17 '18 at 1:21

Emilio M. Bruna

8419

shouldn't first row of customer 3 be marked as new customer?

– Shree
Nov 17 '18 at 1:27

Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

– Mathieu Basille
Nov 24 '18 at 0:15

add a comment |

shouldn't first row of customer 3 be marked as new customer?

– Shree
Nov 17 '18 at 1:27

Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

– Mathieu Basille
Nov 24 '18 at 0:15

shouldn't first row of customer 3 be marked as new customer?

– Shree
Nov 17 '18 at 1:27

Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

– Mathieu Basille
Nov 24 '18 at 0:15

add a comment |

1 Answer
1

active

oldest

votes

As a first step I am marking existing customers (customer 3 in this case) in the key column -

df %>% 

  group_by(customer) %>% 

  mutate(

    key = as.character(key), # can be avoided if key is a character to begin with

    key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)

  ) %>% 

  filter(cumsum(!is.na(key)) > 0) %>% 

  ungroup()



# A tibble: 7 x 5

  customer  year     C     D key     

     <dbl> <dbl> <dbl> <dbl> <chr>   

1        1  2002     1     5 new     

2        1  2003     2     6 NA      

3        2  2003     2     7 new     

4        3  2000     2     4 existing

5        3  2001     6     4 NA      

6        3  2002     2     5 NA      

7        3  2003     1     8 NA

edited Nov 17 '18 at 13:17

answered Nov 17 '18 at 1:30

Shree

3,5561524

Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

– Emilio M. Bruna
Nov 17 '18 at 1:41

@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

– Shree
Nov 17 '18 at 1:45

Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

– Emilio M. Bruna
Nov 17 '18 at 2:01

1

I have updated my answer to include logic for existing customers. You don't need to do anything now.

– Shree
Nov 17 '18 at 2:18

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53347352%2fr-how-can-i-group-rows-in-a-dataframe-id-rows-meeting-a-condition-then-delete%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

As a first step I am marking existing customers (customer 3 in this case) in the key column -

df %>% 

  group_by(customer) %>% 

  mutate(

    key = as.character(key), # can be avoided if key is a character to begin with

    key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)

  ) %>% 

  filter(cumsum(!is.na(key)) > 0) %>% 

  ungroup()



# A tibble: 7 x 5

  customer  year     C     D key     

     <dbl> <dbl> <dbl> <dbl> <chr>   

1        1  2002     1     5 new     

2        1  2003     2     6 NA      

3        2  2003     2     7 new     

4        3  2000     2     4 existing

5        3  2001     6     4 NA      

6        3  2002     2     5 NA      

7        3  2003     1     8 NA

edited Nov 17 '18 at 13:17

answered Nov 17 '18 at 1:30

Shree

3,5561524

Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

– Emilio M. Bruna
Nov 17 '18 at 1:41

@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

– Shree
Nov 17 '18 at 1:45

Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

– Emilio M. Bruna
Nov 17 '18 at 2:01

1

I have updated my answer to include logic for existing customers. You don't need to do anything now.

– Shree
Nov 17 '18 at 2:18

add a comment |

As a first step I am marking existing customers (customer 3 in this case) in the key column -

df %>% 

  group_by(customer) %>% 

  mutate(

    key = as.character(key), # can be avoided if key is a character to begin with

    key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)

  ) %>% 

  filter(cumsum(!is.na(key)) > 0) %>% 

  ungroup()



# A tibble: 7 x 5

  customer  year     C     D key     

     <dbl> <dbl> <dbl> <dbl> <chr>   

1        1  2002     1     5 new     

2        1  2003     2     6 NA      

3        2  2003     2     7 new     

4        3  2000     2     4 existing

5        3  2001     6     4 NA      

6        3  2002     2     5 NA      

7        3  2003     1     8 NA

edited Nov 17 '18 at 13:17

answered Nov 17 '18 at 1:30

Shree

3,5561524

Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

– Emilio M. Bruna
Nov 17 '18 at 1:41

@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

– Shree
Nov 17 '18 at 1:45

Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

– Emilio M. Bruna
Nov 17 '18 at 2:01

1

I have updated my answer to include logic for existing customers. You don't need to do anything now.

– Shree
Nov 17 '18 at 2:18

add a comment |

As a first step I am marking existing customers (customer 3 in this case) in the key column -

df %>% 

  group_by(customer) %>% 

  mutate(

    key = as.character(key), # can be avoided if key is a character to begin with

    key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)

  ) %>% 

  filter(cumsum(!is.na(key)) > 0) %>% 

  ungroup()



# A tibble: 7 x 5

  customer  year     C     D key     

     <dbl> <dbl> <dbl> <dbl> <chr>   

1        1  2002     1     5 new     

2        1  2003     2     6 NA      

3        2  2003     2     7 new     

4        3  2000     2     4 existing

5        3  2001     6     4 NA      

6        3  2002     2     5 NA      

7        3  2003     1     8 NA

edited Nov 17 '18 at 13:17

answered Nov 17 '18 at 1:30

Shree

3,5561524

As a first step I am marking existing customers (customer 3 in this case) in the key column -

df %>% 

  group_by(customer) %>% 

  mutate(

    key = as.character(key), # can be avoided if key is a character to begin with

    key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)

  ) %>% 

  filter(cumsum(!is.na(key)) > 0) %>% 

  ungroup()



# A tibble: 7 x 5

  customer  year     C     D key     

     <dbl> <dbl> <dbl> <dbl> <chr>   

1        1  2002     1     5 new     

2        1  2003     2     6 NA      

3        2  2003     2     7 new     

4        3  2000     2     4 existing

5        3  2001     6     4 NA      

6        3  2002     2     5 NA      

7        3  2003     1     8 NA

edited Nov 17 '18 at 13:17

answered Nov 17 '18 at 1:30

Shree

3,5561524

edited Nov 17 '18 at 13:17

answered Nov 17 '18 at 1:30

Shree

3,5561524

answered Nov 17 '18 at 1:30

Shree

3,5561524

answered Nov 17 '18 at 1:30

Shree

3,5561524

Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

– Emilio M. Bruna
Nov 17 '18 at 1:41

@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

– Shree
Nov 17 '18 at 1:45

Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

– Emilio M. Bruna
Nov 17 '18 at 2:01

1

I have updated my answer to include logic for existing customers. You don't need to do anything now.

– Shree
Nov 17 '18 at 2:18

add a comment |

Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

– Emilio M. Bruna
Nov 17 '18 at 1:41

@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

– Shree
Nov 17 '18 at 1:45

Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

– Emilio M. Bruna
Nov 17 '18 at 2:01

1

I have updated my answer to include logic for existing customers. You don't need to do anything now.

– Shree
Nov 17 '18 at 2:18

Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

– Emilio M. Bruna
Nov 17 '18 at 1:41

@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

– Shree
Nov 17 '18 at 1:45

Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

– Emilio M. Bruna
Nov 17 '18 at 2:01

I have updated my answer to include logic for existing customers. You don't need to do anything now.

– Shree
Nov 17 '18 at 2:18

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

FW Yp75yVuLlCLE BCvqXsEkD,q661YF1w 6,TedlFV8rLBTYY My,fFM,c7mKxvIztUR0TEQo,vC0O5QNP

搜尋此網誌

Vfrdtyky