R: How can I group rows in a dataframe, ID rows meeting a condition, then delete prior rows for the group?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:
customer year item.A item.B key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>
needs to look like this:
customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>
I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).
a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df
r dataframe conditional delete-row
add a comment |
I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:
customer year item.A item.B key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>
needs to look like this:
customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>
I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).
a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df
r dataframe conditional delete-row
shouldn't first row of customer 3 be marked as new customer?
– Shree
Nov 17 '18 at 1:27
Do these lines necessarily correspond toNA
for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entiredata.frame
. If it is, simply:df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ]
, or usingsubset
which might be easier to read:subset(df, !(is.na(C) & is.na(D) & is.na(key)))
– Mathieu Basille
Nov 24 '18 at 0:15
add a comment |
I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:
customer year item.A item.B key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>
needs to look like this:
customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>
I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).
a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df
r dataframe conditional delete-row
I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:
customer year item.A item.B key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>
needs to look like this:
customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>
I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).
a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df
r dataframe conditional delete-row
r dataframe conditional delete-row
asked Nov 17 '18 at 1:21
Emilio M. BrunaEmilio M. Bruna
8419
8419
shouldn't first row of customer 3 be marked as new customer?
– Shree
Nov 17 '18 at 1:27
Do these lines necessarily correspond toNA
for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entiredata.frame
. If it is, simply:df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ]
, or usingsubset
which might be easier to read:subset(df, !(is.na(C) & is.na(D) & is.na(key)))
– Mathieu Basille
Nov 24 '18 at 0:15
add a comment |
shouldn't first row of customer 3 be marked as new customer?
– Shree
Nov 17 '18 at 1:27
Do these lines necessarily correspond toNA
for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entiredata.frame
. If it is, simply:df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ]
, or usingsubset
which might be easier to read:subset(df, !(is.na(C) & is.na(D) & is.na(key)))
– Mathieu Basille
Nov 24 '18 at 0:15
shouldn't first row of customer 3 be marked as new customer?
– Shree
Nov 17 '18 at 1:27
shouldn't first row of customer 3 be marked as new customer?
– Shree
Nov 17 '18 at 1:27
Do these lines necessarily correspond to
NA
for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame
. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ]
, or using subset
which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))
– Mathieu Basille
Nov 24 '18 at 0:15
Do these lines necessarily correspond to
NA
for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entire data.frame
. If it is, simply: df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ]
, or using subset
which might be easier to read: subset(df, !(is.na(C) & is.na(D) & is.na(key)))
– Mathieu Basille
Nov 24 '18 at 0:15
add a comment |
1 Answer
1
active
oldest
votes
As a first step I am marking existing customers (customer 3 in this case) in the key column -
df %>%
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()
# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA
Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.
– Emilio M. Bruna
Nov 17 '18 at 1:41
@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is notNA
. You could populate it with say "existing" instead ofNA
. Does that work for you?
– Shree
Nov 17 '18 at 1:45
Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.
– Emilio M. Bruna
Nov 17 '18 at 2:01
1
I have updated my answer to include logic for existing customers. You don't need to do anything now.
– Shree
Nov 17 '18 at 2:18
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53347352%2fr-how-can-i-group-rows-in-a-dataframe-id-rows-meeting-a-condition-then-delete%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As a first step I am marking existing customers (customer 3 in this case) in the key column -
df %>%
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()
# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA
Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.
– Emilio M. Bruna
Nov 17 '18 at 1:41
@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is notNA
. You could populate it with say "existing" instead ofNA
. Does that work for you?
– Shree
Nov 17 '18 at 1:45
Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.
– Emilio M. Bruna
Nov 17 '18 at 2:01
1
I have updated my answer to include logic for existing customers. You don't need to do anything now.
– Shree
Nov 17 '18 at 2:18
add a comment |
As a first step I am marking existing customers (customer 3 in this case) in the key column -
df %>%
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()
# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA
Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.
– Emilio M. Bruna
Nov 17 '18 at 1:41
@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is notNA
. You could populate it with say "existing" instead ofNA
. Does that work for you?
– Shree
Nov 17 '18 at 1:45
Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.
– Emilio M. Bruna
Nov 17 '18 at 2:01
1
I have updated my answer to include logic for existing customers. You don't need to do anything now.
– Shree
Nov 17 '18 at 2:18
add a comment |
As a first step I am marking existing customers (customer 3 in this case) in the key column -
df %>%
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()
# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA
As a first step I am marking existing customers (customer 3 in this case) in the key column -
df %>%
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()
# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA
edited Nov 17 '18 at 13:17
answered Nov 17 '18 at 1:30
ShreeShree
3,5561524
3,5561524
Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.
– Emilio M. Bruna
Nov 17 '18 at 1:41
@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is notNA
. You could populate it with say "existing" instead ofNA
. Does that work for you?
– Shree
Nov 17 '18 at 1:45
Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.
– Emilio M. Bruna
Nov 17 '18 at 2:01
1
I have updated my answer to include logic for existing customers. You don't need to do anything now.
– Shree
Nov 17 '18 at 2:18
add a comment |
Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.
– Emilio M. Bruna
Nov 17 '18 at 1:41
@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is notNA
. You could populate it with say "existing" instead ofNA
. Does that work for you?
– Shree
Nov 17 '18 at 1:45
Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.
– Emilio M. Bruna
Nov 17 '18 at 2:01
1
I have updated my answer to include logic for existing customers. You don't need to do anything now.
– Shree
Nov 17 '18 at 2:18
Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.
– Emilio M. Bruna
Nov 17 '18 at 1:41
Thanks for the quick reply. Unfortunately, existing customers have no notation in the ‘key’ column. Maybe the df could be split into two dfs - one for new customers and one for the original esting ones - and your solution appliedto the one for new customers? I could then put them back together.
– Emilio M. Bruna
Nov 17 '18 at 1:41
@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not
NA
. You could populate it with say "existing" instead of NA
. Does that work for you?– Shree
Nov 17 '18 at 1:45
@EmilioM.Bruna You don't need to do all that. Just make sure the first row for existing customers is not
NA
. You could populate it with say "existing" instead of NA
. Does that work for you?– Shree
Nov 17 '18 at 1:45
Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.
– Emilio M. Bruna
Nov 17 '18 at 2:01
Yes, though I’d have to backfill the first row for 7k customers. I’m not sure which is simpler.
– Emilio M. Bruna
Nov 17 '18 at 2:01
1
1
I have updated my answer to include logic for existing customers. You don't need to do anything now.
– Shree
Nov 17 '18 at 2:18
I have updated my answer to include logic for existing customers. You don't need to do anything now.
– Shree
Nov 17 '18 at 2:18
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53347352%2fr-how-can-i-group-rows-in-a-dataframe-id-rows-meeting-a-condition-then-delete%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
shouldn't first row of customer 3 be marked as new customer?
– Shree
Nov 17 '18 at 1:27
Do these lines necessarily correspond to
NA
for both items and the key? From your (simplified) example, it is the case, but I cannot assume it's true for the entiredata.frame
. If it is, simply:df[!(is.na(df$C) & is.na(df$D) & is.na(df$key)), ]
, or usingsubset
which might be easier to read:subset(df, !(is.na(C) & is.na(D) & is.na(key)))
– Mathieu Basille
Nov 24 '18 at 0:15