More efficient vector comparison

I am trying to compare 2 different vectors to catch any duplicates. one vector is 5 million elements of 10 numbers and the other is 2.8 million of 10 elements.
My operating system is ubuntu 18.04 and I am using QtCreator. I am getting a hangup when I try to compare these large vectors. here is what I have tried:

vector<vector<int> >::iterator v1;

vector<vector<int> >::iterator v2;



for(v1 = vector1.begin(); v1 != vector1.end(); v1++)

    {

        for(v2 = vector2.begin(); v2 != vector2.end(); v2++)

        {

            if(*v1 == *v2)

            {

                vector1.erase(v1);

            }

        }

    }

when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :

vector1.erase(v1.begin(), v1.end());

Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.

Thx in advance

Idzireit

Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:

#include "includes.h"



bool vec_less(vector<int> &v1, vector<int> &v2)

{



    for(int i = 0; i < 10; i++)

    {

        if(v1[i] == v2[i])

        {

            i++;

        }

        if(v1[i] < v2[i])

            return true;

        else

            return false;

    }

    return v1.size() <v2.size();

}



void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)

{

    vector<vector<int> >::iterator v1 = aaperms.begin();

    vector<vector<int> >::iterator v2 = perms.begin();



    while(v1 != aaperms.end() && v2 != perms.end())

    {



        if(*v1 == *v2)

        {

            aaperms.erase(v1);

            ++v1;

            ++v2;

        }



        if(vec_less(*v1, *v2) == true)

            ++v1;

        else

            ++v2;

    }



    return;

}

I only needed to sort 1 of the vectors. The other was sorted as it was made.
The problem I am having with the appended code is now it is not finding the duplicates. It does go through each of the vectors once but for some reason it is not finding the duplicates. I know there are some because a previous attempt and sorting them out found them though I was running into a serious sigseg fault.

I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.

Idzireit

edited Nov 16 '18 at 13:30

asked Nov 13 '18 at 22:17

idzireit

495

3

I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

– juanpa.arrivillaga
Nov 13 '18 at 22:45

2

@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

– jwimberley
Nov 13 '18 at 22:50

1

Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

– jwimberley
Nov 13 '18 at 22:55

1

Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

– Damien
Nov 13 '18 at 23:02

4

It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

– Bob__
Nov 13 '18 at 23:25

|
show 5 more comments

vector<vector<int> >::iterator v1;

vector<vector<int> >::iterator v2;



for(v1 = vector1.begin(); v1 != vector1.end(); v1++)

    {

        for(v2 = vector2.begin(); v2 != vector2.end(); v2++)

        {

            if(*v1 == *v2)

            {

                vector1.erase(v1);

            }

        }

    }

when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :

vector1.erase(v1.begin(), v1.end());

Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.

Thx in advance

Idzireit

Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:

#include "includes.h"



bool vec_less(vector<int> &v1, vector<int> &v2)

{



    for(int i = 0; i < 10; i++)

    {

        if(v1[i] == v2[i])

        {

            i++;

        }

        if(v1[i] < v2[i])

            return true;

        else

            return false;

    }

    return v1.size() <v2.size();

}



void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)

{

    vector<vector<int> >::iterator v1 = aaperms.begin();

    vector<vector<int> >::iterator v2 = perms.begin();



    while(v1 != aaperms.end() && v2 != perms.end())

    {



        if(*v1 == *v2)

        {

            aaperms.erase(v1);

            ++v1;

            ++v2;

        }



        if(vec_less(*v1, *v2) == true)

            ++v1;

        else

            ++v2;

    }



    return;

}

I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.

Idzireit

edited Nov 16 '18 at 13:30

asked Nov 13 '18 at 22:17

idzireit

495

3

I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

– juanpa.arrivillaga
Nov 13 '18 at 22:45

2

@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

– jwimberley
Nov 13 '18 at 22:50

1

Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

– jwimberley
Nov 13 '18 at 22:55

1

Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

– Damien
Nov 13 '18 at 23:02

4

It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

– Bob__
Nov 13 '18 at 23:25

|
show 5 more comments

vector<vector<int> >::iterator v1;

vector<vector<int> >::iterator v2;



for(v1 = vector1.begin(); v1 != vector1.end(); v1++)

    {

        for(v2 = vector2.begin(); v2 != vector2.end(); v2++)

        {

            if(*v1 == *v2)

            {

                vector1.erase(v1);

            }

        }

    }

when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :

vector1.erase(v1.begin(), v1.end());

Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.

Thx in advance

Idzireit

Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:

#include "includes.h"



bool vec_less(vector<int> &v1, vector<int> &v2)

{



    for(int i = 0; i < 10; i++)

    {

        if(v1[i] == v2[i])

        {

            i++;

        }

        if(v1[i] < v2[i])

            return true;

        else

            return false;

    }

    return v1.size() <v2.size();

}



void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)

{

    vector<vector<int> >::iterator v1 = aaperms.begin();

    vector<vector<int> >::iterator v2 = perms.begin();



    while(v1 != aaperms.end() && v2 != perms.end())

    {



        if(*v1 == *v2)

        {

            aaperms.erase(v1);

            ++v1;

            ++v2;

        }



        if(vec_less(*v1, *v2) == true)

            ++v1;

        else

            ++v2;

    }



    return;

}

I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.

Idzireit

edited Nov 16 '18 at 13:30

asked Nov 13 '18 at 22:17

idzireit

495

vector<vector<int> >::iterator v1;

vector<vector<int> >::iterator v2;



for(v1 = vector1.begin(); v1 != vector1.end(); v1++)

    {

        for(v2 = vector2.begin(); v2 != vector2.end(); v2++)

        {

            if(*v1 == *v2)

            {

                vector1.erase(v1);

            }

        }

    }

when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :

vector1.erase(v1.begin(), v1.end());

Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.

Thx in advance

Idzireit

Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:

#include "includes.h"



bool vec_less(vector<int> &v1, vector<int> &v2)

{



    for(int i = 0; i < 10; i++)

    {

        if(v1[i] == v2[i])

        {

            i++;

        }

        if(v1[i] < v2[i])

            return true;

        else

            return false;

    }

    return v1.size() <v2.size();

}



void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)

{

    vector<vector<int> >::iterator v1 = aaperms.begin();

    vector<vector<int> >::iterator v2 = perms.begin();



    while(v1 != aaperms.end() && v2 != perms.end())

    {



        if(*v1 == *v2)

        {

            aaperms.erase(v1);

            ++v1;

            ++v2;

        }



        if(vec_less(*v1, *v2) == true)

            ++v1;

        else

            ++v2;

    }



    return;

}

I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.

Idzireit

c++ vector

edited Nov 16 '18 at 13:30

asked Nov 13 '18 at 22:17

idzireit

495

edited Nov 16 '18 at 13:30

asked Nov 13 '18 at 22:17

idzireit

495

edited Nov 16 '18 at 13:30

asked Nov 13 '18 at 22:17

idzireit

495

asked Nov 13 '18 at 22:17

idzireit

495

asked Nov 13 '18 at 22:17

idzireit

495

3

I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

– juanpa.arrivillaga
Nov 13 '18 at 22:45

2

@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

– jwimberley
Nov 13 '18 at 22:50

1

Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

– jwimberley
Nov 13 '18 at 22:55

1

Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

– Damien
Nov 13 '18 at 23:02

4

It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

– Bob__
Nov 13 '18 at 23:25

|
show 5 more comments

3

I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

– juanpa.arrivillaga
Nov 13 '18 at 22:45

2

@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

– jwimberley
Nov 13 '18 at 22:50

1

Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

– jwimberley
Nov 13 '18 at 22:55

1

Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

– Damien
Nov 13 '18 at 23:02

4

It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

– Bob__
Nov 13 '18 at 23:25

I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

– juanpa.arrivillaga
Nov 13 '18 at 22:45

@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

– jwimberley
Nov 13 '18 at 22:50

Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

– jwimberley
Nov 13 '18 at 22:55

Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

– Damien
Nov 13 '18 at 23:02

It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

– Bob__
Nov 13 '18 at 23:25

|
show 5 more comments

2 Answers
2

active

oldest

votes

The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.

If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.

If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.

bool vec_less(const vector<int>& v1, const vector<int>& v2)

{

    for (int i = 0; i < v1.size() && i < v2.size(); i++)

    {

        if (v1[i] < v2[i])

            return true;

        if (v2[i] < v1[i])

            return false;

    }

    return v1.size() < v2.size();

}



std::sort(vector1.begin(), vector1.end(), vec_less);

std::sort(vector2.begin(), vector2.end(), vec_less);

vector<vector<int> >::iterator v1 = vector1.begin();

vector<vector<int> >::iterator v1out = v1;

vector<vector<int> >::iterator v2 = vector2.begin();



while (v1 != vector1.end())

{

    if (v2 == vector2.end() || vec_less(*v1, *v2))

    {

        if (v1out != v1)

            *v1out = *v1;

        ++v1;

        ++v1out;

    }

    else if (vec_less(*v2, *v1))

        ++v2;

    else // equal

    {

        ++v1;

        ++v2;

    }

}

vector1.resize(v1out - vector1.begin());

edited Nov 16 '18 at 14:33

answered Nov 13 '18 at 23:43

Mark Ransom

223k29281509

in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

– idzireit
Nov 16 '18 at 12:40

2

std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

– Caleth
Nov 16 '18 at 13:48

@Caleth well now that makes two things I didn't know C++ included.

– Mark Ransom
Nov 16 '18 at 14:19

@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

– Mark Ransom
Nov 16 '18 at 14:32

add a comment |

There are ~~two~~ three problems with your solution.

Your code has undefined behavior. When you delete item iterator becomes invalid.

Your code has large complexity ~~o(n^2)~~ o(n^3).

Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.

Code below has o(n) time complexity, and use of STL algorithms is usually best choice:

using Vec = std::vector<std::vector<int>>;



void removeItems(Vec& from, const Vec& itemsToRemove)

{

    const std::unordered_set<Vec::value_type> items {

       itemsToRemove.begin(),

       itemsToRemove.end()

    };



    auto it = 

    std::remove_if(from.begin(), from.end(),

                   [&items](const auto &x){

                       return items.count(x) != 0;

                   });

    from.erase(it, from.end());

}

You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).

using Vec = std::vector<std::array<int, 5>>;

edited Nov 16 '18 at 13:32

answered Nov 14 '18 at 0:01

Marek R

12.8k22672

What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

– Damien
Nov 14 '18 at 6:00

Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

– Damien
Nov 14 '18 at 6:03

code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

– Marek R
Nov 14 '18 at 9:30

Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

– Damien
Nov 14 '18 at 9:36

now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

– Marek R
Nov 14 '18 at 9:50

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290356%2fmore-efficient-vector-comparison%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.

If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.

bool vec_less(const vector<int>& v1, const vector<int>& v2)

{

    for (int i = 0; i < v1.size() && i < v2.size(); i++)

    {

        if (v1[i] < v2[i])

            return true;

        if (v2[i] < v1[i])

            return false;

    }

    return v1.size() < v2.size();

}



std::sort(vector1.begin(), vector1.end(), vec_less);

std::sort(vector2.begin(), vector2.end(), vec_less);

vector<vector<int> >::iterator v1 = vector1.begin();

vector<vector<int> >::iterator v1out = v1;

vector<vector<int> >::iterator v2 = vector2.begin();



while (v1 != vector1.end())

{

    if (v2 == vector2.end() || vec_less(*v1, *v2))

    {

        if (v1out != v1)

            *v1out = *v1;

        ++v1;

        ++v1out;

    }

    else if (vec_less(*v2, *v1))

        ++v2;

    else // equal

    {

        ++v1;

        ++v2;

    }

}

vector1.resize(v1out - vector1.begin());

edited Nov 16 '18 at 14:33

answered Nov 13 '18 at 23:43

Mark Ransom

223k29281509

in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

– idzireit
Nov 16 '18 at 12:40

2

std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

– Caleth
Nov 16 '18 at 13:48

@Caleth well now that makes two things I didn't know C++ included.

– Mark Ransom
Nov 16 '18 at 14:19

@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

– Mark Ransom
Nov 16 '18 at 14:32

add a comment |

The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.

If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.

bool vec_less(const vector<int>& v1, const vector<int>& v2)

{

    for (int i = 0; i < v1.size() && i < v2.size(); i++)

    {

        if (v1[i] < v2[i])

            return true;

        if (v2[i] < v1[i])

            return false;

    }

    return v1.size() < v2.size();

}



std::sort(vector1.begin(), vector1.end(), vec_less);

std::sort(vector2.begin(), vector2.end(), vec_less);

vector<vector<int> >::iterator v1 = vector1.begin();

vector<vector<int> >::iterator v1out = v1;

vector<vector<int> >::iterator v2 = vector2.begin();



while (v1 != vector1.end())

{

    if (v2 == vector2.end() || vec_less(*v1, *v2))

    {

        if (v1out != v1)

            *v1out = *v1;

        ++v1;

        ++v1out;

    }

    else if (vec_less(*v2, *v1))

        ++v2;

    else // equal

    {

        ++v1;

        ++v2;

    }

}

vector1.resize(v1out - vector1.begin());

edited Nov 16 '18 at 14:33

answered Nov 13 '18 at 23:43

Mark Ransom

223k29281509

in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

– idzireit
Nov 16 '18 at 12:40

2

std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

– Caleth
Nov 16 '18 at 13:48

@Caleth well now that makes two things I didn't know C++ included.

– Mark Ransom
Nov 16 '18 at 14:19

@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

– Mark Ransom
Nov 16 '18 at 14:32

add a comment |

The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.

If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.

bool vec_less(const vector<int>& v1, const vector<int>& v2)

{

    for (int i = 0; i < v1.size() && i < v2.size(); i++)

    {

        if (v1[i] < v2[i])

            return true;

        if (v2[i] < v1[i])

            return false;

    }

    return v1.size() < v2.size();

}



std::sort(vector1.begin(), vector1.end(), vec_less);

std::sort(vector2.begin(), vector2.end(), vec_less);

vector<vector<int> >::iterator v1 = vector1.begin();

vector<vector<int> >::iterator v1out = v1;

vector<vector<int> >::iterator v2 = vector2.begin();



while (v1 != vector1.end())

{

    if (v2 == vector2.end() || vec_less(*v1, *v2))

    {

        if (v1out != v1)

            *v1out = *v1;

        ++v1;

        ++v1out;

    }

    else if (vec_less(*v2, *v1))

        ++v2;

    else // equal

    {

        ++v1;

        ++v2;

    }

}

vector1.resize(v1out - vector1.begin());

edited Nov 16 '18 at 14:33

answered Nov 13 '18 at 23:43

Mark Ransom

223k29281509

The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.

If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.

bool vec_less(const vector<int>& v1, const vector<int>& v2)

{

    for (int i = 0; i < v1.size() && i < v2.size(); i++)

    {

        if (v1[i] < v2[i])

            return true;

        if (v2[i] < v1[i])

            return false;

    }

    return v1.size() < v2.size();

}



std::sort(vector1.begin(), vector1.end(), vec_less);

std::sort(vector2.begin(), vector2.end(), vec_less);

vector<vector<int> >::iterator v1 = vector1.begin();

vector<vector<int> >::iterator v1out = v1;

vector<vector<int> >::iterator v2 = vector2.begin();



while (v1 != vector1.end())

{

    if (v2 == vector2.end() || vec_less(*v1, *v2))

    {

        if (v1out != v1)

            *v1out = *v1;

        ++v1;

        ++v1out;

    }

    else if (vec_less(*v2, *v1))

        ++v2;

    else // equal

    {

        ++v1;

        ++v2;

    }

}

vector1.resize(v1out - vector1.begin());

edited Nov 16 '18 at 14:33

answered Nov 13 '18 at 23:43

Mark Ransom

223k29281509

edited Nov 16 '18 at 14:33

answered Nov 13 '18 at 23:43

Mark Ransom

223k29281509

answered Nov 13 '18 at 23:43

Mark Ransom

223k29281509

answered Nov 13 '18 at 23:43

Mark Ransom

223k29281509

in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

– idzireit
Nov 16 '18 at 12:40

2

std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

– Caleth
Nov 16 '18 at 13:48

@Caleth well now that makes two things I didn't know C++ included.

– Mark Ransom
Nov 16 '18 at 14:19

@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

– Mark Ransom
Nov 16 '18 at 14:32

add a comment |

in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

– idzireit
Nov 16 '18 at 12:40

2

std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

– Caleth
Nov 16 '18 at 13:48

@Caleth well now that makes two things I didn't know C++ included.

– Mark Ransom
Nov 16 '18 at 14:19

@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

– Mark Ransom
Nov 16 '18 at 14:32

in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

– idzireit
Nov 16 '18 at 12:40

std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

– Caleth
Nov 16 '18 at 13:48

@Caleth well now that makes two things I didn't know C++ included.

– Mark Ransom
Nov 16 '18 at 14:19

@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

– Mark Ransom
Nov 16 '18 at 14:32

add a comment |

There are ~~two~~ three problems with your solution.

Your code has undefined behavior. When you delete item iterator becomes invalid.

Your code has large complexity ~~o(n^2)~~ o(n^3).

Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.

Code below has o(n) time complexity, and use of STL algorithms is usually best choice:

using Vec = std::vector<std::vector<int>>;



void removeItems(Vec& from, const Vec& itemsToRemove)

{

    const std::unordered_set<Vec::value_type> items {

       itemsToRemove.begin(),

       itemsToRemove.end()

    };



    auto it = 

    std::remove_if(from.begin(), from.end(),

                   [&items](const auto &x){

                       return items.count(x) != 0;

                   });

    from.erase(it, from.end());

}

You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).

using Vec = std::vector<std::array<int, 5>>;

edited Nov 16 '18 at 13:32

answered Nov 14 '18 at 0:01

Marek R

12.8k22672

What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

– Damien
Nov 14 '18 at 6:00

Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

– Damien
Nov 14 '18 at 6:03

code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

– Marek R
Nov 14 '18 at 9:30

Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

– Damien
Nov 14 '18 at 9:36

now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

– Marek R
Nov 14 '18 at 9:50

|
show 1 more comment

There are ~~two~~ three problems with your solution.

Your code has undefined behavior. When you delete item iterator becomes invalid.

Your code has large complexity ~~o(n^2)~~ o(n^3).

Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.

Code below has o(n) time complexity, and use of STL algorithms is usually best choice:

using Vec = std::vector<std::vector<int>>;



void removeItems(Vec& from, const Vec& itemsToRemove)

{

    const std::unordered_set<Vec::value_type> items {

       itemsToRemove.begin(),

       itemsToRemove.end()

    };



    auto it = 

    std::remove_if(from.begin(), from.end(),

                   [&items](const auto &x){

                       return items.count(x) != 0;

                   });

    from.erase(it, from.end());

}

You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).

using Vec = std::vector<std::array<int, 5>>;

edited Nov 16 '18 at 13:32

answered Nov 14 '18 at 0:01

Marek R

12.8k22672

What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

– Damien
Nov 14 '18 at 6:00

Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

– Damien
Nov 14 '18 at 6:03

code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

– Marek R
Nov 14 '18 at 9:30

Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

– Damien
Nov 14 '18 at 9:36

now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

– Marek R
Nov 14 '18 at 9:50

|
show 1 more comment

There are ~~two~~ three problems with your solution.

Your code has undefined behavior. When you delete item iterator becomes invalid.

Your code has large complexity ~~o(n^2)~~ o(n^3).

Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.

Code below has o(n) time complexity, and use of STL algorithms is usually best choice:

using Vec = std::vector<std::vector<int>>;



void removeItems(Vec& from, const Vec& itemsToRemove)

{

    const std::unordered_set<Vec::value_type> items {

       itemsToRemove.begin(),

       itemsToRemove.end()

    };



    auto it = 

    std::remove_if(from.begin(), from.end(),

                   [&items](const auto &x){

                       return items.count(x) != 0;

                   });

    from.erase(it, from.end());

}

You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).

using Vec = std::vector<std::array<int, 5>>;

edited Nov 16 '18 at 13:32

answered Nov 14 '18 at 0:01

Marek R

12.8k22672

There are ~~two~~ three problems with your solution.

Your code has undefined behavior. When you delete item iterator becomes invalid.

Your code has large complexity ~~o(n^2)~~ o(n^3).

Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.

Code below has o(n) time complexity, and use of STL algorithms is usually best choice:

using Vec = std::vector<std::vector<int>>;



void removeItems(Vec& from, const Vec& itemsToRemove)

{

    const std::unordered_set<Vec::value_type> items {

       itemsToRemove.begin(),

       itemsToRemove.end()

    };



    auto it = 

    std::remove_if(from.begin(), from.end(),

                   [&items](const auto &x){

                       return items.count(x) != 0;

                   });

    from.erase(it, from.end());

}

You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).

using Vec = std::vector<std::array<int, 5>>;

edited Nov 16 '18 at 13:32

answered Nov 14 '18 at 0:01

Marek R

12.8k22672

edited Nov 16 '18 at 13:32

answered Nov 14 '18 at 0:01

Marek R

12.8k22672

answered Nov 14 '18 at 0:01

Marek R

12.8k22672

answered Nov 14 '18 at 0:01

Marek R

12.8k22672

What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

– Damien
Nov 14 '18 at 6:00

Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

– Damien
Nov 14 '18 at 6:03

code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

– Marek R
Nov 14 '18 at 9:30

Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

– Damien
Nov 14 '18 at 9:36

now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

– Marek R
Nov 14 '18 at 9:50

|
show 1 more comment

What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

– Damien
Nov 14 '18 at 6:00

Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

– Damien
Nov 14 '18 at 6:03

code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

– Marek R
Nov 14 '18 at 9:30

Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

– Damien
Nov 14 '18 at 9:36

now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

– Marek R
Nov 14 '18 at 9:50

What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

– Damien
Nov 14 '18 at 6:00

Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

– Damien
Nov 14 '18 at 6:03

code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

– Marek R
Nov 14 '18 at 9:30

Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

– Damien
Nov 14 '18 at 9:36

now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

– Marek R
Nov 14 '18 at 9:50

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky