R: How can I group rows in a dataframe, ID rows meeting a condition, then delete prior rows for the group?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:



   customer year item.A item.B  key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>


needs to look like this:



  customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>


I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).



a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df









share|improve this question























  • shouldn't first row of customer 3 be marked as new customer?

    – Shree
    Nov 17 '18 at 1:27











  • Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

    – Mathieu Basille
    Nov 24 '18 at 0:15




















0















I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:



   customer year item.A item.B  key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>


needs to look like this:



  customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>


I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).



a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df









share|improve this question























  • shouldn't first row of customer 3 be marked as new customer?

    – Shree
    Nov 17 '18 at 1:27











  • Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

    – Mathieu Basille
    Nov 24 '18 at 0:15
















0












0








0








I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:



   customer year item.A item.B  key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>


needs to look like this:



  customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>


I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).



a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df









share|improve this question














I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:



   customer year item.A item.B  key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>


needs to look like this:



  customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>


I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).



a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df






r dataframe conditional delete-row






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 17 '18 at 1:21









Emilio M. BrunaEmilio M. Bruna

8419




8419













  • shouldn't first row of customer 3 be marked as new customer?

    – Shree
    Nov 17 '18 at 1:27











  • Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

    – Mathieu Basille
    Nov 24 '18 at 0:15





















  • shouldn't first row of customer 3 be marked as new customer?

    – Shree
    Nov 17 '18 at 1:27











  • Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

    – Mathieu Basille
    Nov 24 '18 at 0:15



















shouldn't first row of customer 3 be marked as new customer?

– Shree
Nov 17 '18 at 1:27





shouldn't first row of customer 3 be marked as new customer?

– Shree
Nov 17 '18 at 1:27













Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

– Mathieu Basille
Nov 24 '18 at 0:15







Do these lines necessarily correspond to NA for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ], or using subset which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))

– Mathieu Basille
Nov 24 '18 at 0:15














1 Answer
1






active

oldest

votes


















1














As a first step I am marking existing customers (customer 3 in this case) in the key column -



df %>% 
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()

# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA





share|improve this answer


























  • Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

    – Emilio M. Bruna
    Nov 17 '18 at 1:41











  • @EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

    – Shree
    Nov 17 '18 at 1:45











  • Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

    – Emilio M. Bruna
    Nov 17 '18 at 2:01






  • 1





    I have updated my answer to include logic for existing customers. You don't need to do anything now.

    – Shree
    Nov 17 '18 at 2:18














Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53347352%2fr-how-can-i-group-rows-in-a-dataframe-id-rows-meeting-a-condition-then-delete%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














As a first step I am marking existing customers (customer 3 in this case) in the key column -



df %>% 
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()

# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA





share|improve this answer


























  • Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

    – Emilio M. Bruna
    Nov 17 '18 at 1:41











  • @EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

    – Shree
    Nov 17 '18 at 1:45











  • Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

    – Emilio M. Bruna
    Nov 17 '18 at 2:01






  • 1





    I have updated my answer to include logic for existing customers. You don't need to do anything now.

    – Shree
    Nov 17 '18 at 2:18


















1














As a first step I am marking existing customers (customer 3 in this case) in the key column -



df %>% 
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()

# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA





share|improve this answer


























  • Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

    – Emilio M. Bruna
    Nov 17 '18 at 1:41











  • @EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

    – Shree
    Nov 17 '18 at 1:45











  • Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

    – Emilio M. Bruna
    Nov 17 '18 at 2:01






  • 1





    I have updated my answer to include logic for existing customers. You don't need to do anything now.

    – Shree
    Nov 17 '18 at 2:18
















1












1








1







As a first step I am marking existing customers (customer 3 in this case) in the key column -



df %>% 
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()

# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA





share|improve this answer















As a first step I am marking existing customers (customer 3 in this case) in the key column -



df %>% 
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()

# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 17 '18 at 13:17

























answered Nov 17 '18 at 1:30









ShreeShree

3,5561524




3,5561524













  • Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

    – Emilio M. Bruna
    Nov 17 '18 at 1:41











  • @EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

    – Shree
    Nov 17 '18 at 1:45











  • Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

    – Emilio M. Bruna
    Nov 17 '18 at 2:01






  • 1





    I have updated my answer to include logic for existing customers. You don't need to do anything now.

    – Shree
    Nov 17 '18 at 2:18





















  • Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

    – Emilio M. Bruna
    Nov 17 '18 at 1:41











  • @EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

    – Shree
    Nov 17 '18 at 1:45











  • Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

    – Emilio M. Bruna
    Nov 17 '18 at 2:01






  • 1





    I have updated my answer to include logic for existing customers. You don't need to do anything now.

    – Shree
    Nov 17 '18 at 2:18



















Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

– Emilio M. Bruna
Nov 17 '18 at 1:41





Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.

– Emilio M. Bruna
Nov 17 '18 at 1:41













@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

– Shree
Nov 17 '18 at 1:45





@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not NA. You could populate it with say "existing" instead of NA. Does that work for you?

– Shree
Nov 17 '18 at 1:45













Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

– Emilio M. Bruna
Nov 17 '18 at 2:01





Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.

– Emilio M. Bruna
Nov 17 '18 at 2:01




1




1





I have updated my answer to include logic for existing customers. You don't need to do anything now.

– Shree
Nov 17 '18 at 2:18







I have updated my answer to include logic for existing customers. You don't need to do anything now.

– Shree
Nov 17 '18 at 2:18






















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53347352%2fr-how-can-i-group-rows-in-a-dataframe-id-rows-meeting-a-condition-then-delete%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python