NumPy: how to left join arrays with duplicates

up vote
3
down vote

favorite

To use Cython, I need to convert df1.merge(df2, how='left') (using Pandas) to plain NumPy, while I found numpy.lib.recfunctions.join_by(key, r1, r2, jointype='leftouter') doesn't support any duplicates along key. Is there any way to solve it?

edited Nov 12 at 8:03

betontalpfa

8351023

asked Nov 12 at 7:58

Naive

80211

2

The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
– hpaulj
Nov 12 at 8:03

add a comment |

up vote
3
down vote

favorite

edited Nov 12 at 8:03

betontalpfa

8351023

asked Nov 12 at 7:58

Naive

80211

2

The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
– hpaulj
Nov 12 at 8:03

add a comment |

up vote
3
down vote

favorite

edited Nov 12 at 8:03

betontalpfa

8351023

asked Nov 12 at 7:58

Naive

80211

python pandas numpy cython

edited Nov 12 at 8:03

betontalpfa

8351023

asked Nov 12 at 7:58

Naive

80211

edited Nov 12 at 8:03

betontalpfa

8351023

asked Nov 12 at 7:58

Naive

80211

edited Nov 12 at 8:03

betontalpfa

8351023

edited Nov 12 at 8:03

betontalpfa

8351023

edited Nov 12 at 8:03

betontalpfa

8351023

asked Nov 12 at 7:58

Naive

80211

asked Nov 12 at 7:58

Naive

80211

asked Nov 12 at 7:58

Naive

80211

2

The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
– hpaulj
Nov 12 at 8:03

add a comment |

2

The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
– hpaulj
Nov 12 at 8:03

The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
– hpaulj
Nov 12 at 8:03

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Here's a stab at a pure numpy left join that can handle duplicate keys:

import numpy as np



def join_by_left(key, r1, r2, mask=True):

    # figure out the dtype of the result array

    descr1 = r1.dtype.descr

    descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]

    descrm = descr1 + descr2 



    # figure out the fields we'll need from each array

    f1 = [d[0] for d in descr1]

    f2 = [d[0] for d in descr2]



    # cache the number of columns in f1

    ncol1 = len(f1)



    # get a dict of the rows of r2 grouped by key

    rows2 = {}

    for row2 in r2:

        rows2.setdefault(row2[key], ).append(row2)



    # figure out how many rows will be in the result

    nrowm = 0

    for k1 in r1[key]:

        if k1 in rows2:

            nrowm += len(rows2[k1])

        else:

            nrowm += 1



    # allocate the return array

    _ret = np.recarray(nrowm, dtype=descrm)

    if mask:

        ret = np.ma.array(_ret, mask=True)

    else:

        ret = _ret



    # merge the data into the return array

    i = 0

    for row1 in r1:

        if row1[key] in rows2:

            for row2 in rows2[row1[key]]:

                ret[i] = tuple(row1[f1]) + tuple(row2[f2])

                i += 1

        else:

            for j in range(ncol1):

                ret[i][j] = row1[j]

            i += 1



    return ret

Basically, it uses a plain dict to do the actual join operation. Like numpy.lib.recfunctions.join_by, this func will also return a masked array. When there are keys missing from the right array, those values will be masked out in the return array. If you would prefer a record array instead (in which all of the missing data is set to 0), you can just pass mask=False when calling join_by_left.

edited Nov 12 at 12:52

answered Nov 12 at 12:07

tel

4,62911429

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53257916%2fnumpy-how-to-left-join-arrays-with-duplicates%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Here's a stab at a pure numpy left join that can handle duplicate keys:

import numpy as np



def join_by_left(key, r1, r2, mask=True):

    # figure out the dtype of the result array

    descr1 = r1.dtype.descr

    descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]

    descrm = descr1 + descr2 



    # figure out the fields we'll need from each array

    f1 = [d[0] for d in descr1]

    f2 = [d[0] for d in descr2]



    # cache the number of columns in f1

    ncol1 = len(f1)



    # get a dict of the rows of r2 grouped by key

    rows2 = {}

    for row2 in r2:

        rows2.setdefault(row2[key], ).append(row2)



    # figure out how many rows will be in the result

    nrowm = 0

    for k1 in r1[key]:

        if k1 in rows2:

            nrowm += len(rows2[k1])

        else:

            nrowm += 1



    # allocate the return array

    _ret = np.recarray(nrowm, dtype=descrm)

    if mask:

        ret = np.ma.array(_ret, mask=True)

    else:

        ret = _ret



    # merge the data into the return array

    i = 0

    for row1 in r1:

        if row1[key] in rows2:

            for row2 in rows2[row1[key]]:

                ret[i] = tuple(row1[f1]) + tuple(row2[f2])

                i += 1

        else:

            for j in range(ncol1):

                ret[i][j] = row1[j]

            i += 1



    return ret

edited Nov 12 at 12:52

answered Nov 12 at 12:07

tel

4,62911429

add a comment |

up vote
1
down vote

accepted

Here's a stab at a pure numpy left join that can handle duplicate keys:

import numpy as np



def join_by_left(key, r1, r2, mask=True):

    # figure out the dtype of the result array

    descr1 = r1.dtype.descr

    descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]

    descrm = descr1 + descr2 



    # figure out the fields we'll need from each array

    f1 = [d[0] for d in descr1]

    f2 = [d[0] for d in descr2]



    # cache the number of columns in f1

    ncol1 = len(f1)



    # get a dict of the rows of r2 grouped by key

    rows2 = {}

    for row2 in r2:

        rows2.setdefault(row2[key], ).append(row2)



    # figure out how many rows will be in the result

    nrowm = 0

    for k1 in r1[key]:

        if k1 in rows2:

            nrowm += len(rows2[k1])

        else:

            nrowm += 1



    # allocate the return array

    _ret = np.recarray(nrowm, dtype=descrm)

    if mask:

        ret = np.ma.array(_ret, mask=True)

    else:

        ret = _ret



    # merge the data into the return array

    i = 0

    for row1 in r1:

        if row1[key] in rows2:

            for row2 in rows2[row1[key]]:

                ret[i] = tuple(row1[f1]) + tuple(row2[f2])

                i += 1

        else:

            for j in range(ncol1):

                ret[i][j] = row1[j]

            i += 1



    return ret

edited Nov 12 at 12:52

answered Nov 12 at 12:07

tel

4,62911429

add a comment |

up vote
1
down vote

accepted

Here's a stab at a pure numpy left join that can handle duplicate keys:

import numpy as np



def join_by_left(key, r1, r2, mask=True):

    # figure out the dtype of the result array

    descr1 = r1.dtype.descr

    descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]

    descrm = descr1 + descr2 



    # figure out the fields we'll need from each array

    f1 = [d[0] for d in descr1]

    f2 = [d[0] for d in descr2]



    # cache the number of columns in f1

    ncol1 = len(f1)



    # get a dict of the rows of r2 grouped by key

    rows2 = {}

    for row2 in r2:

        rows2.setdefault(row2[key], ).append(row2)



    # figure out how many rows will be in the result

    nrowm = 0

    for k1 in r1[key]:

        if k1 in rows2:

            nrowm += len(rows2[k1])

        else:

            nrowm += 1



    # allocate the return array

    _ret = np.recarray(nrowm, dtype=descrm)

    if mask:

        ret = np.ma.array(_ret, mask=True)

    else:

        ret = _ret



    # merge the data into the return array

    i = 0

    for row1 in r1:

        if row1[key] in rows2:

            for row2 in rows2[row1[key]]:

                ret[i] = tuple(row1[f1]) + tuple(row2[f2])

                i += 1

        else:

            for j in range(ncol1):

                ret[i][j] = row1[j]

            i += 1



    return ret

edited Nov 12 at 12:52

answered Nov 12 at 12:07

tel

4,62911429

Here's a stab at a pure numpy left join that can handle duplicate keys:

import numpy as np



def join_by_left(key, r1, r2, mask=True):

    # figure out the dtype of the result array

    descr1 = r1.dtype.descr

    descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]

    descrm = descr1 + descr2 



    # figure out the fields we'll need from each array

    f1 = [d[0] for d in descr1]

    f2 = [d[0] for d in descr2]



    # cache the number of columns in f1

    ncol1 = len(f1)



    # get a dict of the rows of r2 grouped by key

    rows2 = {}

    for row2 in r2:

        rows2.setdefault(row2[key], ).append(row2)



    # figure out how many rows will be in the result

    nrowm = 0

    for k1 in r1[key]:

        if k1 in rows2:

            nrowm += len(rows2[k1])

        else:

            nrowm += 1



    # allocate the return array

    _ret = np.recarray(nrowm, dtype=descrm)

    if mask:

        ret = np.ma.array(_ret, mask=True)

    else:

        ret = _ret



    # merge the data into the return array

    i = 0

    for row1 in r1:

        if row1[key] in rows2:

            for row2 in rows2[row1[key]]:

                ret[i] = tuple(row1[f1]) + tuple(row2[f2])

                i += 1

        else:

            for j in range(ncol1):

                ret[i][j] = row1[j]

            i += 1



    return ret

edited Nov 12 at 12:52

answered Nov 12 at 12:07

tel

4,62911429

edited Nov 12 at 12:52

answered Nov 12 at 12:07

tel

4,62911429

answered Nov 12 at 12:07

tel

4,62911429

answered Nov 12 at 12:07

tel

4,62911429

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky