how to match values in columns by group or within category (groupwise) with other column(having multiple...
up vote
3
down vote
favorite
So, I have data table in R which looks like follows:
Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X
So the idea here is for every group (claim, failure and part) if 'any' of the code in 'code' matches with list of the codes in 'matchcode' column then than group should have Match column as True.
So expected output should be as following:
Claim failure Part Match
23 F1 P1 TRUE
23 F2 P2 FALSE
45 F1 P4 TRUE
45 F1 P1 FALSE
The size of this data.table is huge so will be needing a optimized solution.
please help :)
r data.table
add a comment |
up vote
3
down vote
favorite
So, I have data table in R which looks like follows:
Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X
So the idea here is for every group (claim, failure and part) if 'any' of the code in 'code' matches with list of the codes in 'matchcode' column then than group should have Match column as True.
So expected output should be as following:
Claim failure Part Match
23 F1 P1 TRUE
23 F2 P2 FALSE
45 F1 P4 TRUE
45 F1 P1 FALSE
The size of this data.table is huge so will be needing a optimized solution.
please help :)
r data.table
If you want to vectorize/optimize regex operation, always use the stringi package. That will be by far faster than any alternative (at Ieast I know of). For your purposes I would just dolibrary(stringi) ; DT[, .(Match = any(stri_detect_fixed(matchcode, code))), by = .(Claim, failure, Part)]
. And I would also do some searching as vectorizinggrepl
is something that was discussed many times already
– David Arenburg
Nov 11 at 11:20
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
So, I have data table in R which looks like follows:
Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X
So the idea here is for every group (claim, failure and part) if 'any' of the code in 'code' matches with list of the codes in 'matchcode' column then than group should have Match column as True.
So expected output should be as following:
Claim failure Part Match
23 F1 P1 TRUE
23 F2 P2 FALSE
45 F1 P4 TRUE
45 F1 P1 FALSE
The size of this data.table is huge so will be needing a optimized solution.
please help :)
r data.table
So, I have data table in R which looks like follows:
Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X
So the idea here is for every group (claim, failure and part) if 'any' of the code in 'code' matches with list of the codes in 'matchcode' column then than group should have Match column as True.
So expected output should be as following:
Claim failure Part Match
23 F1 P1 TRUE
23 F2 P2 FALSE
45 F1 P4 TRUE
45 F1 P1 FALSE
The size of this data.table is huge so will be needing a optimized solution.
please help :)
r data.table
r data.table
asked Nov 10 at 22:08
Rahul Rajaram
225
225
If you want to vectorize/optimize regex operation, always use the stringi package. That will be by far faster than any alternative (at Ieast I know of). For your purposes I would just dolibrary(stringi) ; DT[, .(Match = any(stri_detect_fixed(matchcode, code))), by = .(Claim, failure, Part)]
. And I would also do some searching as vectorizinggrepl
is something that was discussed many times already
– David Arenburg
Nov 11 at 11:20
add a comment |
If you want to vectorize/optimize regex operation, always use the stringi package. That will be by far faster than any alternative (at Ieast I know of). For your purposes I would just dolibrary(stringi) ; DT[, .(Match = any(stri_detect_fixed(matchcode, code))), by = .(Claim, failure, Part)]
. And I would also do some searching as vectorizinggrepl
is something that was discussed many times already
– David Arenburg
Nov 11 at 11:20
If you want to vectorize/optimize regex operation, always use the stringi package. That will be by far faster than any alternative (at Ieast I know of). For your purposes I would just do
library(stringi) ; DT[, .(Match = any(stri_detect_fixed(matchcode, code))), by = .(Claim, failure, Part)]
. And I would also do some searching as vectorizing grepl
is something that was discussed many times already– David Arenburg
Nov 11 at 11:20
If you want to vectorize/optimize regex operation, always use the stringi package. That will be by far faster than any alternative (at Ieast I know of). For your purposes I would just do
library(stringi) ; DT[, .(Match = any(stri_detect_fixed(matchcode, code))), by = .(Claim, failure, Part)]
. And I would also do some searching as vectorizing grepl
is something that was discussed many times already– David Arenburg
Nov 11 at 11:20
add a comment |
3 Answers
3
active
oldest
votes
up vote
3
down vote
accepted
... and a solution using data.table
.
library(data.table)
grepl_v <- Vectorize(grepl)
DT[, .(Match = any(grepl_v(code, matchcode))), by = .(Claim, failure, Part)]
# Claim failure Part Match
#1: 23 F1 P1 TRUE
#2: 23 F2 P2 FALSE
#3: 45 F1 P4 TRUE
#4: 45 F1 P1 FALSE
data
DT <- fread("Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X")
add a comment |
up vote
1
down vote
Here is a dplyr
solution.
library(dplyr)
dat %>%
rowwise() %>%
mutate(Match = grepl(code, matchcode)) %>%
group_by(Claim, failure, Part) %>%
mutate(Match = any(Match)) %>%
select(-code, -matchcode) %>%
unique
## A tibble: 4 x 4
## Groups: Claim, failure, Part [4]
# Claim failure Part Match
# <int> <fct> <fct> <lgl>
#1 23 F1 P1 TRUE
#2 23 F2 P2 FALSE
#3 45 F1 P4 TRUE
#4 45 F1 P1 FALSE
#Warning message:
#Grouping rowwise data frame strips rowwise nature
Don't worry about the warning, it simply tells you that after grep
'ing rowwise, the pipe groups by certain variables and therefore the processing is no longer row by row.
Data.
dat <- read.table(text = "
Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X
", header = TRUE)
add a comment |
up vote
1
down vote
base
solution:
d$match <- apply(d, 1, function(x) { return(grepl(x[['code']], x['matchcode']))})
# Claim failure Part code matchcode match
# 1 23 F1 P1 A B,A,C 1
# 2 23 F1 P1 D B,A,C 0
# 3 23 F2 P2 D B,A,C 0
# 4 23 F2 P2 E B,A,C 0
# 5 45 F1 P4 X Y,Z,X 1
# 6 45 F1 P4 Y Y,Z,X 1
# 7 45 F1 P4 A Y,Z,X 0
# 8 45 F1 P1 F Y,Z,X 0
# 9 45 F1 P1 H Y,Z,X 0
Original version of this answer used grep()
; thanks to markus for suggesting grepl()
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
... and a solution using data.table
.
library(data.table)
grepl_v <- Vectorize(grepl)
DT[, .(Match = any(grepl_v(code, matchcode))), by = .(Claim, failure, Part)]
# Claim failure Part Match
#1: 23 F1 P1 TRUE
#2: 23 F2 P2 FALSE
#3: 45 F1 P4 TRUE
#4: 45 F1 P1 FALSE
data
DT <- fread("Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X")
add a comment |
up vote
3
down vote
accepted
... and a solution using data.table
.
library(data.table)
grepl_v <- Vectorize(grepl)
DT[, .(Match = any(grepl_v(code, matchcode))), by = .(Claim, failure, Part)]
# Claim failure Part Match
#1: 23 F1 P1 TRUE
#2: 23 F2 P2 FALSE
#3: 45 F1 P4 TRUE
#4: 45 F1 P1 FALSE
data
DT <- fread("Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X")
add a comment |
up vote
3
down vote
accepted
up vote
3
down vote
accepted
... and a solution using data.table
.
library(data.table)
grepl_v <- Vectorize(grepl)
DT[, .(Match = any(grepl_v(code, matchcode))), by = .(Claim, failure, Part)]
# Claim failure Part Match
#1: 23 F1 P1 TRUE
#2: 23 F2 P2 FALSE
#3: 45 F1 P4 TRUE
#4: 45 F1 P1 FALSE
data
DT <- fread("Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X")
... and a solution using data.table
.
library(data.table)
grepl_v <- Vectorize(grepl)
DT[, .(Match = any(grepl_v(code, matchcode))), by = .(Claim, failure, Part)]
# Claim failure Part Match
#1: 23 F1 P1 TRUE
#2: 23 F2 P2 FALSE
#3: 45 F1 P4 TRUE
#4: 45 F1 P1 FALSE
data
DT <- fread("Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X")
edited Nov 10 at 22:43
answered Nov 10 at 22:30
markus
8,264928
8,264928
add a comment |
add a comment |
up vote
1
down vote
Here is a dplyr
solution.
library(dplyr)
dat %>%
rowwise() %>%
mutate(Match = grepl(code, matchcode)) %>%
group_by(Claim, failure, Part) %>%
mutate(Match = any(Match)) %>%
select(-code, -matchcode) %>%
unique
## A tibble: 4 x 4
## Groups: Claim, failure, Part [4]
# Claim failure Part Match
# <int> <fct> <fct> <lgl>
#1 23 F1 P1 TRUE
#2 23 F2 P2 FALSE
#3 45 F1 P4 TRUE
#4 45 F1 P1 FALSE
#Warning message:
#Grouping rowwise data frame strips rowwise nature
Don't worry about the warning, it simply tells you that after grep
'ing rowwise, the pipe groups by certain variables and therefore the processing is no longer row by row.
Data.
dat <- read.table(text = "
Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X
", header = TRUE)
add a comment |
up vote
1
down vote
Here is a dplyr
solution.
library(dplyr)
dat %>%
rowwise() %>%
mutate(Match = grepl(code, matchcode)) %>%
group_by(Claim, failure, Part) %>%
mutate(Match = any(Match)) %>%
select(-code, -matchcode) %>%
unique
## A tibble: 4 x 4
## Groups: Claim, failure, Part [4]
# Claim failure Part Match
# <int> <fct> <fct> <lgl>
#1 23 F1 P1 TRUE
#2 23 F2 P2 FALSE
#3 45 F1 P4 TRUE
#4 45 F1 P1 FALSE
#Warning message:
#Grouping rowwise data frame strips rowwise nature
Don't worry about the warning, it simply tells you that after grep
'ing rowwise, the pipe groups by certain variables and therefore the processing is no longer row by row.
Data.
dat <- read.table(text = "
Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X
", header = TRUE)
add a comment |
up vote
1
down vote
up vote
1
down vote
Here is a dplyr
solution.
library(dplyr)
dat %>%
rowwise() %>%
mutate(Match = grepl(code, matchcode)) %>%
group_by(Claim, failure, Part) %>%
mutate(Match = any(Match)) %>%
select(-code, -matchcode) %>%
unique
## A tibble: 4 x 4
## Groups: Claim, failure, Part [4]
# Claim failure Part Match
# <int> <fct> <fct> <lgl>
#1 23 F1 P1 TRUE
#2 23 F2 P2 FALSE
#3 45 F1 P4 TRUE
#4 45 F1 P1 FALSE
#Warning message:
#Grouping rowwise data frame strips rowwise nature
Don't worry about the warning, it simply tells you that after grep
'ing rowwise, the pipe groups by certain variables and therefore the processing is no longer row by row.
Data.
dat <- read.table(text = "
Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X
", header = TRUE)
Here is a dplyr
solution.
library(dplyr)
dat %>%
rowwise() %>%
mutate(Match = grepl(code, matchcode)) %>%
group_by(Claim, failure, Part) %>%
mutate(Match = any(Match)) %>%
select(-code, -matchcode) %>%
unique
## A tibble: 4 x 4
## Groups: Claim, failure, Part [4]
# Claim failure Part Match
# <int> <fct> <fct> <lgl>
#1 23 F1 P1 TRUE
#2 23 F2 P2 FALSE
#3 45 F1 P4 TRUE
#4 45 F1 P1 FALSE
#Warning message:
#Grouping rowwise data frame strips rowwise nature
Don't worry about the warning, it simply tells you that after grep
'ing rowwise, the pipe groups by certain variables and therefore the processing is no longer row by row.
Data.
dat <- read.table(text = "
Claim failure Part code matchcode
23 F1 P1 A B,A,C
23 F1 P1 D B,A,C
23 F2 P2 D B,A,C
23 F2 P2 E B,A,C
45 F1 P4 X Y,Z,X
45 F1 P4 Y Y,Z,X
45 F1 P4 A Y,Z,X
45 F1 P1 F Y,Z,X
45 F1 P1 H Y,Z,X
", header = TRUE)
answered Nov 10 at 22:26
Rui Barradas
14.5k31729
14.5k31729
add a comment |
add a comment |
up vote
1
down vote
base
solution:
d$match <- apply(d, 1, function(x) { return(grepl(x[['code']], x['matchcode']))})
# Claim failure Part code matchcode match
# 1 23 F1 P1 A B,A,C 1
# 2 23 F1 P1 D B,A,C 0
# 3 23 F2 P2 D B,A,C 0
# 4 23 F2 P2 E B,A,C 0
# 5 45 F1 P4 X Y,Z,X 1
# 6 45 F1 P4 Y Y,Z,X 1
# 7 45 F1 P4 A Y,Z,X 0
# 8 45 F1 P1 F Y,Z,X 0
# 9 45 F1 P1 H Y,Z,X 0
Original version of this answer used grep()
; thanks to markus for suggesting grepl()
add a comment |
up vote
1
down vote
base
solution:
d$match <- apply(d, 1, function(x) { return(grepl(x[['code']], x['matchcode']))})
# Claim failure Part code matchcode match
# 1 23 F1 P1 A B,A,C 1
# 2 23 F1 P1 D B,A,C 0
# 3 23 F2 P2 D B,A,C 0
# 4 23 F2 P2 E B,A,C 0
# 5 45 F1 P4 X Y,Z,X 1
# 6 45 F1 P4 Y Y,Z,X 1
# 7 45 F1 P4 A Y,Z,X 0
# 8 45 F1 P1 F Y,Z,X 0
# 9 45 F1 P1 H Y,Z,X 0
Original version of this answer used grep()
; thanks to markus for suggesting grepl()
add a comment |
up vote
1
down vote
up vote
1
down vote
base
solution:
d$match <- apply(d, 1, function(x) { return(grepl(x[['code']], x['matchcode']))})
# Claim failure Part code matchcode match
# 1 23 F1 P1 A B,A,C 1
# 2 23 F1 P1 D B,A,C 0
# 3 23 F2 P2 D B,A,C 0
# 4 23 F2 P2 E B,A,C 0
# 5 45 F1 P4 X Y,Z,X 1
# 6 45 F1 P4 Y Y,Z,X 1
# 7 45 F1 P4 A Y,Z,X 0
# 8 45 F1 P1 F Y,Z,X 0
# 9 45 F1 P1 H Y,Z,X 0
Original version of this answer used grep()
; thanks to markus for suggesting grepl()
base
solution:
d$match <- apply(d, 1, function(x) { return(grepl(x[['code']], x['matchcode']))})
# Claim failure Part code matchcode match
# 1 23 F1 P1 A B,A,C 1
# 2 23 F1 P1 D B,A,C 0
# 3 23 F2 P2 D B,A,C 0
# 4 23 F2 P2 E B,A,C 0
# 5 45 F1 P4 X Y,Z,X 1
# 6 45 F1 P4 Y Y,Z,X 1
# 7 45 F1 P4 A Y,Z,X 0
# 8 45 F1 P1 F Y,Z,X 0
# 9 45 F1 P1 H Y,Z,X 0
Original version of this answer used grep()
; thanks to markus for suggesting grepl()
edited Nov 10 at 22:32
answered Nov 10 at 22:27
12b345b6b78
553115
553115
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53243908%2fhow-to-match-values-in-columns-by-group-or-within-category-groupwise-with-othe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
If you want to vectorize/optimize regex operation, always use the stringi package. That will be by far faster than any alternative (at Ieast I know of). For your purposes I would just do
library(stringi) ; DT[, .(Match = any(stri_detect_fixed(matchcode, code))), by = .(Claim, failure, Part)]
. And I would also do some searching as vectorizinggrepl
is something that was discussed many times already– David Arenburg
Nov 11 at 11:20