Pandas merge handling duplicates in join output
up vote
1
down vote
favorite
Is there a nice way to bring only one row, preferably random in one-to-many matching during left join in Pandas?
e.g
left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
left = np.asarray(left)
right = np.asarray(right)
left = pd.DataFrame(left)
right = pd.DataFrame(right)
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])
So this is what we get
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 9 9 9
4 1 3 2
0 1 2
0 1 2 2
1 1 2 3
2 3 2 2
3 3 2 9
4 3 2 2
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 NaN NaN
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 NaN NaN
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
So now I want to have output to be of the same size as my left dataframe and when there are more than one match in right dataframe I want to bring only single random column.
Is there a nice way of doing it using pandas short cut tricks?
thank you!
python pandas dataframe random merge
add a comment |
up vote
1
down vote
favorite
Is there a nice way to bring only one row, preferably random in one-to-many matching during left join in Pandas?
e.g
left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
left = np.asarray(left)
right = np.asarray(right)
left = pd.DataFrame(left)
right = pd.DataFrame(right)
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])
So this is what we get
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 9 9 9
4 1 3 2
0 1 2
0 1 2 2
1 1 2 3
2 3 2 2
3 3 2 9
4 3 2 2
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 NaN NaN
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 NaN NaN
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
So now I want to have output to be of the same size as my left dataframe and when there are more than one match in right dataframe I want to bring only single random column.
Is there a nice way of doing it using pandas short cut tricks?
thank you!
python pandas dataframe random merge
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Is there a nice way to bring only one row, preferably random in one-to-many matching during left join in Pandas?
e.g
left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
left = np.asarray(left)
right = np.asarray(right)
left = pd.DataFrame(left)
right = pd.DataFrame(right)
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])
So this is what we get
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 9 9 9
4 1 3 2
0 1 2
0 1 2 2
1 1 2 3
2 3 2 2
3 3 2 9
4 3 2 2
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 NaN NaN
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 NaN NaN
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
So now I want to have output to be of the same size as my left dataframe and when there are more than one match in right dataframe I want to bring only single random column.
Is there a nice way of doing it using pandas short cut tricks?
thank you!
python pandas dataframe random merge
Is there a nice way to bring only one row, preferably random in one-to-many matching during left join in Pandas?
e.g
left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
left = np.asarray(left)
right = np.asarray(right)
left = pd.DataFrame(left)
right = pd.DataFrame(right)
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])
So this is what we get
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 9 9 9
4 1 3 2
0 1 2
0 1 2 2
1 1 2 3
2 3 2 2
3 3 2 9
4 3 2 2
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 NaN NaN
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 NaN NaN
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
So now I want to have output to be of the same size as my left dataframe and when there are more than one match in right dataframe I want to bring only single random column.
Is there a nice way of doing it using pandas short cut tricks?
thank you!
python pandas dataframe random merge
python pandas dataframe random merge
edited Nov 11 at 1:26
coldspeed
111k1799169
111k1799169
asked Nov 11 at 0:36
YohanRoth
8861819
8861819
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
You can shuffle right
and drop_duplicates(...[, keep='first'])
before merging.
right2 = right.sample(frac=1).drop_duplicates(subset=[0])
left.merge(right2, how='left', left_on=[0], right_on=[0])
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 2 2 2 NaN NaN
2 3 3 3 2.0 2.0
3 9 9 9 NaN NaN
4 1 3 2 2.0 2.0
We shuffle right
first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.
1
I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
– YohanRoth
Nov 11 at 0:43
@YohanRoth - in this case - if your first row of the output is1 1 1 2.0 2.0
, I think that guarantees the last row is also1 3 2 2.0 2.0
since you've dropped1 2 3
. From your question asking for arandom
choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
– Joel
Nov 11 at 4:47
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
You can shuffle right
and drop_duplicates(...[, keep='first'])
before merging.
right2 = right.sample(frac=1).drop_duplicates(subset=[0])
left.merge(right2, how='left', left_on=[0], right_on=[0])
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 2 2 2 NaN NaN
2 3 3 3 2.0 2.0
3 9 9 9 NaN NaN
4 1 3 2 2.0 2.0
We shuffle right
first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.
1
I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
– YohanRoth
Nov 11 at 0:43
@YohanRoth - in this case - if your first row of the output is1 1 1 2.0 2.0
, I think that guarantees the last row is also1 3 2 2.0 2.0
since you've dropped1 2 3
. From your question asking for arandom
choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
– Joel
Nov 11 at 4:47
add a comment |
up vote
1
down vote
accepted
You can shuffle right
and drop_duplicates(...[, keep='first'])
before merging.
right2 = right.sample(frac=1).drop_duplicates(subset=[0])
left.merge(right2, how='left', left_on=[0], right_on=[0])
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 2 2 2 NaN NaN
2 3 3 3 2.0 2.0
3 9 9 9 NaN NaN
4 1 3 2 2.0 2.0
We shuffle right
first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.
1
I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
– YohanRoth
Nov 11 at 0:43
@YohanRoth - in this case - if your first row of the output is1 1 1 2.0 2.0
, I think that guarantees the last row is also1 3 2 2.0 2.0
since you've dropped1 2 3
. From your question asking for arandom
choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
– Joel
Nov 11 at 4:47
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
You can shuffle right
and drop_duplicates(...[, keep='first'])
before merging.
right2 = right.sample(frac=1).drop_duplicates(subset=[0])
left.merge(right2, how='left', left_on=[0], right_on=[0])
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 2 2 2 NaN NaN
2 3 3 3 2.0 2.0
3 9 9 9 NaN NaN
4 1 3 2 2.0 2.0
We shuffle right
first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.
You can shuffle right
and drop_duplicates(...[, keep='first'])
before merging.
right2 = right.sample(frac=1).drop_duplicates(subset=[0])
left.merge(right2, how='left', left_on=[0], right_on=[0])
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 2 2 2 NaN NaN
2 3 3 3 2.0 2.0
3 9 9 9 NaN NaN
4 1 3 2 2.0 2.0
We shuffle right
first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.
answered Nov 11 at 0:39
coldspeed
111k1799169
111k1799169
1
I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
– YohanRoth
Nov 11 at 0:43
@YohanRoth - in this case - if your first row of the output is1 1 1 2.0 2.0
, I think that guarantees the last row is also1 3 2 2.0 2.0
since you've dropped1 2 3
. From your question asking for arandom
choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
– Joel
Nov 11 at 4:47
add a comment |
1
I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
– YohanRoth
Nov 11 at 0:43
@YohanRoth - in this case - if your first row of the output is1 1 1 2.0 2.0
, I think that guarantees the last row is also1 3 2 2.0 2.0
since you've dropped1 2 3
. From your question asking for arandom
choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
– Joel
Nov 11 at 4:47
1
1
I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
– YohanRoth
Nov 11 at 0:43
I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
– YohanRoth
Nov 11 at 0:43
@YohanRoth - in this case - if your first row of the output is
1 1 1 2.0 2.0
, I think that guarantees the last row is also 1 3 2 2.0 2.0
since you've dropped 1 2 3
. From your question asking for a random
choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.– Joel
Nov 11 at 4:47
@YohanRoth - in this case - if your first row of the output is
1 1 1 2.0 2.0
, I think that guarantees the last row is also 1 3 2 2.0 2.0
since you've dropped 1 2 3
. From your question asking for a random
choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.– Joel
Nov 11 at 4:47
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53244793%2fpandas-merge-handling-duplicates-in-join-output%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown