NumPy: how to left join arrays with duplicates











up vote
3
down vote

favorite












To use Cython, I need to convert df1.merge(df2, how='left') (using Pandas) to plain NumPy, while I found numpy.lib.recfunctions.join_by(key, r1, r2, jointype='leftouter') doesn't support any duplicates along key. Is there any way to solve it?










share|improve this question




















  • 2




    The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
    – hpaulj
    Nov 12 at 8:03

















up vote
3
down vote

favorite












To use Cython, I need to convert df1.merge(df2, how='left') (using Pandas) to plain NumPy, while I found numpy.lib.recfunctions.join_by(key, r1, r2, jointype='leftouter') doesn't support any duplicates along key. Is there any way to solve it?










share|improve this question




















  • 2




    The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
    – hpaulj
    Nov 12 at 8:03















up vote
3
down vote

favorite









up vote
3
down vote

favorite











To use Cython, I need to convert df1.merge(df2, how='left') (using Pandas) to plain NumPy, while I found numpy.lib.recfunctions.join_by(key, r1, r2, jointype='leftouter') doesn't support any duplicates along key. Is there any way to solve it?










share|improve this question















To use Cython, I need to convert df1.merge(df2, how='left') (using Pandas) to plain NumPy, while I found numpy.lib.recfunctions.join_by(key, r1, r2, jointype='leftouter') doesn't support any duplicates along key. Is there any way to solve it?







python pandas numpy cython






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 12 at 8:03









betontalpfa

8351023




8351023










asked Nov 12 at 7:58









Naive

80211




80211








  • 2




    The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
    – hpaulj
    Nov 12 at 8:03
















  • 2




    The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
    – hpaulj
    Nov 12 at 8:03










2




2




The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
– hpaulj
Nov 12 at 8:03






The basic idea in most recfunctions is to define a new dtype, create the appropriate 'empty' array, and copy values by field name. It's all readable python; no hidden compiled code. If existing functions don't do the job (they aren't heavily used or tested), write your own.
– hpaulj
Nov 12 at 8:03














1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










Here's a stab at a pure numpy left join that can handle duplicate keys:



import numpy as np

def join_by_left(key, r1, r2, mask=True):
# figure out the dtype of the result array
descr1 = r1.dtype.descr
descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]
descrm = descr1 + descr2

# figure out the fields we'll need from each array
f1 = [d[0] for d in descr1]
f2 = [d[0] for d in descr2]

# cache the number of columns in f1
ncol1 = len(f1)

# get a dict of the rows of r2 grouped by key
rows2 = {}
for row2 in r2:
rows2.setdefault(row2[key], ).append(row2)

# figure out how many rows will be in the result
nrowm = 0
for k1 in r1[key]:
if k1 in rows2:
nrowm += len(rows2[k1])
else:
nrowm += 1

# allocate the return array
_ret = np.recarray(nrowm, dtype=descrm)
if mask:
ret = np.ma.array(_ret, mask=True)
else:
ret = _ret

# merge the data into the return array
i = 0
for row1 in r1:
if row1[key] in rows2:
for row2 in rows2[row1[key]]:
ret[i] = tuple(row1[f1]) + tuple(row2[f2])
i += 1
else:
for j in range(ncol1):
ret[i][j] = row1[j]
i += 1

return ret


Basically, it uses a plain dict to do the actual join operation. Like numpy.lib.recfunctions.join_by, this func will also return a masked array. When there are keys missing from the right array, those values will be masked out in the return array. If you would prefer a record array instead (in which all of the missing data is set to 0), you can just pass mask=False when calling join_by_left.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53257916%2fnumpy-how-to-left-join-arrays-with-duplicates%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    Here's a stab at a pure numpy left join that can handle duplicate keys:



    import numpy as np

    def join_by_left(key, r1, r2, mask=True):
    # figure out the dtype of the result array
    descr1 = r1.dtype.descr
    descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]
    descrm = descr1 + descr2

    # figure out the fields we'll need from each array
    f1 = [d[0] for d in descr1]
    f2 = [d[0] for d in descr2]

    # cache the number of columns in f1
    ncol1 = len(f1)

    # get a dict of the rows of r2 grouped by key
    rows2 = {}
    for row2 in r2:
    rows2.setdefault(row2[key], ).append(row2)

    # figure out how many rows will be in the result
    nrowm = 0
    for k1 in r1[key]:
    if k1 in rows2:
    nrowm += len(rows2[k1])
    else:
    nrowm += 1

    # allocate the return array
    _ret = np.recarray(nrowm, dtype=descrm)
    if mask:
    ret = np.ma.array(_ret, mask=True)
    else:
    ret = _ret

    # merge the data into the return array
    i = 0
    for row1 in r1:
    if row1[key] in rows2:
    for row2 in rows2[row1[key]]:
    ret[i] = tuple(row1[f1]) + tuple(row2[f2])
    i += 1
    else:
    for j in range(ncol1):
    ret[i][j] = row1[j]
    i += 1

    return ret


    Basically, it uses a plain dict to do the actual join operation. Like numpy.lib.recfunctions.join_by, this func will also return a masked array. When there are keys missing from the right array, those values will be masked out in the return array. If you would prefer a record array instead (in which all of the missing data is set to 0), you can just pass mask=False when calling join_by_left.






    share|improve this answer



























      up vote
      1
      down vote



      accepted










      Here's a stab at a pure numpy left join that can handle duplicate keys:



      import numpy as np

      def join_by_left(key, r1, r2, mask=True):
      # figure out the dtype of the result array
      descr1 = r1.dtype.descr
      descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]
      descrm = descr1 + descr2

      # figure out the fields we'll need from each array
      f1 = [d[0] for d in descr1]
      f2 = [d[0] for d in descr2]

      # cache the number of columns in f1
      ncol1 = len(f1)

      # get a dict of the rows of r2 grouped by key
      rows2 = {}
      for row2 in r2:
      rows2.setdefault(row2[key], ).append(row2)

      # figure out how many rows will be in the result
      nrowm = 0
      for k1 in r1[key]:
      if k1 in rows2:
      nrowm += len(rows2[k1])
      else:
      nrowm += 1

      # allocate the return array
      _ret = np.recarray(nrowm, dtype=descrm)
      if mask:
      ret = np.ma.array(_ret, mask=True)
      else:
      ret = _ret

      # merge the data into the return array
      i = 0
      for row1 in r1:
      if row1[key] in rows2:
      for row2 in rows2[row1[key]]:
      ret[i] = tuple(row1[f1]) + tuple(row2[f2])
      i += 1
      else:
      for j in range(ncol1):
      ret[i][j] = row1[j]
      i += 1

      return ret


      Basically, it uses a plain dict to do the actual join operation. Like numpy.lib.recfunctions.join_by, this func will also return a masked array. When there are keys missing from the right array, those values will be masked out in the return array. If you would prefer a record array instead (in which all of the missing data is set to 0), you can just pass mask=False when calling join_by_left.






      share|improve this answer

























        up vote
        1
        down vote



        accepted







        up vote
        1
        down vote



        accepted






        Here's a stab at a pure numpy left join that can handle duplicate keys:



        import numpy as np

        def join_by_left(key, r1, r2, mask=True):
        # figure out the dtype of the result array
        descr1 = r1.dtype.descr
        descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]
        descrm = descr1 + descr2

        # figure out the fields we'll need from each array
        f1 = [d[0] for d in descr1]
        f2 = [d[0] for d in descr2]

        # cache the number of columns in f1
        ncol1 = len(f1)

        # get a dict of the rows of r2 grouped by key
        rows2 = {}
        for row2 in r2:
        rows2.setdefault(row2[key], ).append(row2)

        # figure out how many rows will be in the result
        nrowm = 0
        for k1 in r1[key]:
        if k1 in rows2:
        nrowm += len(rows2[k1])
        else:
        nrowm += 1

        # allocate the return array
        _ret = np.recarray(nrowm, dtype=descrm)
        if mask:
        ret = np.ma.array(_ret, mask=True)
        else:
        ret = _ret

        # merge the data into the return array
        i = 0
        for row1 in r1:
        if row1[key] in rows2:
        for row2 in rows2[row1[key]]:
        ret[i] = tuple(row1[f1]) + tuple(row2[f2])
        i += 1
        else:
        for j in range(ncol1):
        ret[i][j] = row1[j]
        i += 1

        return ret


        Basically, it uses a plain dict to do the actual join operation. Like numpy.lib.recfunctions.join_by, this func will also return a masked array. When there are keys missing from the right array, those values will be masked out in the return array. If you would prefer a record array instead (in which all of the missing data is set to 0), you can just pass mask=False when calling join_by_left.






        share|improve this answer














        Here's a stab at a pure numpy left join that can handle duplicate keys:



        import numpy as np

        def join_by_left(key, r1, r2, mask=True):
        # figure out the dtype of the result array
        descr1 = r1.dtype.descr
        descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]
        descrm = descr1 + descr2

        # figure out the fields we'll need from each array
        f1 = [d[0] for d in descr1]
        f2 = [d[0] for d in descr2]

        # cache the number of columns in f1
        ncol1 = len(f1)

        # get a dict of the rows of r2 grouped by key
        rows2 = {}
        for row2 in r2:
        rows2.setdefault(row2[key], ).append(row2)

        # figure out how many rows will be in the result
        nrowm = 0
        for k1 in r1[key]:
        if k1 in rows2:
        nrowm += len(rows2[k1])
        else:
        nrowm += 1

        # allocate the return array
        _ret = np.recarray(nrowm, dtype=descrm)
        if mask:
        ret = np.ma.array(_ret, mask=True)
        else:
        ret = _ret

        # merge the data into the return array
        i = 0
        for row1 in r1:
        if row1[key] in rows2:
        for row2 in rows2[row1[key]]:
        ret[i] = tuple(row1[f1]) + tuple(row2[f2])
        i += 1
        else:
        for j in range(ncol1):
        ret[i][j] = row1[j]
        i += 1

        return ret


        Basically, it uses a plain dict to do the actual join operation. Like numpy.lib.recfunctions.join_by, this func will also return a masked array. When there are keys missing from the right array, those values will be masked out in the return array. If you would prefer a record array instead (in which all of the missing data is set to 0), you can just pass mask=False when calling join_by_left.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 12 at 12:52

























        answered Nov 12 at 12:07









        tel

        4,62911429




        4,62911429






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53257916%2fnumpy-how-to-left-join-arrays-with-duplicates%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Xamarin.iOS Cant Deploy on Iphone

            Glorious Revolution

            Dulmage-Mendelsohn matrix decomposition in Python