More efficient vector comparison












3















I am trying to compare 2 different vectors to catch any duplicates. one vector is 5 million elements of 10 numbers and the other is 2.8 million of 10 elements.
My operating system is ubuntu 18.04 and I am using QtCreator. I am getting a hangup when I try to compare these large vectors. here is what I have tried:



vector<vector<int> >::iterator v1;
vector<vector<int> >::iterator v2;

for(v1 = vector1.begin(); v1 != vector1.end(); v1++)
{
for(v2 = vector2.begin(); v2 != vector2.end(); v2++)
{
if(*v1 == *v2)
{
vector1.erase(v1);
}
}
}


when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :



vector1.erase(v1.begin(), v1.end());


Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.



Thx in advance



Idzireit



Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:



#include "includes.h"

bool vec_less(vector<int> &v1, vector<int> &v2)
{

for(int i = 0; i < 10; i++)
{
if(v1[i] == v2[i])
{
i++;
}
if(v1[i] < v2[i])
return true;
else
return false;
}
return v1.size() <v2.size();
}

void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)
{
vector<vector<int> >::iterator v1 = aaperms.begin();
vector<vector<int> >::iterator v2 = perms.begin();

while(v1 != aaperms.end() && v2 != perms.end())
{

if(*v1 == *v2)
{
aaperms.erase(v1);
++v1;
++v2;
}

if(vec_less(*v1, *v2) == true)
++v1;
else
++v2;
}

return;
}


I only needed to sort 1 of the vectors. The other was sorted as it was made.
The problem I am having with the appended code is now it is not finding the duplicates. It does go through each of the vectors once but for some reason it is not finding the duplicates. I know there are some because a previous attempt and sorting them out found them though I was running into a serious sigseg fault.



I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.



Idzireit










share|improve this question




















  • 3





    I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

    – juanpa.arrivillaga
    Nov 13 '18 at 22:45






  • 2





    @MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

    – jwimberley
    Nov 13 '18 at 22:50






  • 1





    Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

    – jwimberley
    Nov 13 '18 at 22:55








  • 1





    Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

    – Damien
    Nov 13 '18 at 23:02






  • 4





    It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

    – Bob__
    Nov 13 '18 at 23:25
















3















I am trying to compare 2 different vectors to catch any duplicates. one vector is 5 million elements of 10 numbers and the other is 2.8 million of 10 elements.
My operating system is ubuntu 18.04 and I am using QtCreator. I am getting a hangup when I try to compare these large vectors. here is what I have tried:



vector<vector<int> >::iterator v1;
vector<vector<int> >::iterator v2;

for(v1 = vector1.begin(); v1 != vector1.end(); v1++)
{
for(v2 = vector2.begin(); v2 != vector2.end(); v2++)
{
if(*v1 == *v2)
{
vector1.erase(v1);
}
}
}


when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :



vector1.erase(v1.begin(), v1.end());


Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.



Thx in advance



Idzireit



Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:



#include "includes.h"

bool vec_less(vector<int> &v1, vector<int> &v2)
{

for(int i = 0; i < 10; i++)
{
if(v1[i] == v2[i])
{
i++;
}
if(v1[i] < v2[i])
return true;
else
return false;
}
return v1.size() <v2.size();
}

void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)
{
vector<vector<int> >::iterator v1 = aaperms.begin();
vector<vector<int> >::iterator v2 = perms.begin();

while(v1 != aaperms.end() && v2 != perms.end())
{

if(*v1 == *v2)
{
aaperms.erase(v1);
++v1;
++v2;
}

if(vec_less(*v1, *v2) == true)
++v1;
else
++v2;
}

return;
}


I only needed to sort 1 of the vectors. The other was sorted as it was made.
The problem I am having with the appended code is now it is not finding the duplicates. It does go through each of the vectors once but for some reason it is not finding the duplicates. I know there are some because a previous attempt and sorting them out found them though I was running into a serious sigseg fault.



I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.



Idzireit










share|improve this question




















  • 3





    I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

    – juanpa.arrivillaga
    Nov 13 '18 at 22:45






  • 2





    @MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

    – jwimberley
    Nov 13 '18 at 22:50






  • 1





    Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

    – jwimberley
    Nov 13 '18 at 22:55








  • 1





    Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

    – Damien
    Nov 13 '18 at 23:02






  • 4





    It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

    – Bob__
    Nov 13 '18 at 23:25














3












3








3


1






I am trying to compare 2 different vectors to catch any duplicates. one vector is 5 million elements of 10 numbers and the other is 2.8 million of 10 elements.
My operating system is ubuntu 18.04 and I am using QtCreator. I am getting a hangup when I try to compare these large vectors. here is what I have tried:



vector<vector<int> >::iterator v1;
vector<vector<int> >::iterator v2;

for(v1 = vector1.begin(); v1 != vector1.end(); v1++)
{
for(v2 = vector2.begin(); v2 != vector2.end(); v2++)
{
if(*v1 == *v2)
{
vector1.erase(v1);
}
}
}


when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :



vector1.erase(v1.begin(), v1.end());


Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.



Thx in advance



Idzireit



Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:



#include "includes.h"

bool vec_less(vector<int> &v1, vector<int> &v2)
{

for(int i = 0; i < 10; i++)
{
if(v1[i] == v2[i])
{
i++;
}
if(v1[i] < v2[i])
return true;
else
return false;
}
return v1.size() <v2.size();
}

void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)
{
vector<vector<int> >::iterator v1 = aaperms.begin();
vector<vector<int> >::iterator v2 = perms.begin();

while(v1 != aaperms.end() && v2 != perms.end())
{

if(*v1 == *v2)
{
aaperms.erase(v1);
++v1;
++v2;
}

if(vec_less(*v1, *v2) == true)
++v1;
else
++v2;
}

return;
}


I only needed to sort 1 of the vectors. The other was sorted as it was made.
The problem I am having with the appended code is now it is not finding the duplicates. It does go through each of the vectors once but for some reason it is not finding the duplicates. I know there are some because a previous attempt and sorting them out found them though I was running into a serious sigseg fault.



I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.



Idzireit










share|improve this question
















I am trying to compare 2 different vectors to catch any duplicates. one vector is 5 million elements of 10 numbers and the other is 2.8 million of 10 elements.
My operating system is ubuntu 18.04 and I am using QtCreator. I am getting a hangup when I try to compare these large vectors. here is what I have tried:



vector<vector<int> >::iterator v1;
vector<vector<int> >::iterator v2;

for(v1 = vector1.begin(); v1 != vector1.end(); v1++)
{
for(v2 = vector2.begin(); v2 != vector2.end(); v2++)
{
if(*v1 == *v2)
{
vector1.erase(v1);
}
}
}


when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :



vector1.erase(v1.begin(), v1.end());


Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.



Thx in advance



Idzireit



Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:



#include "includes.h"

bool vec_less(vector<int> &v1, vector<int> &v2)
{

for(int i = 0; i < 10; i++)
{
if(v1[i] == v2[i])
{
i++;
}
if(v1[i] < v2[i])
return true;
else
return false;
}
return v1.size() <v2.size();
}

void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)
{
vector<vector<int> >::iterator v1 = aaperms.begin();
vector<vector<int> >::iterator v2 = perms.begin();

while(v1 != aaperms.end() && v2 != perms.end())
{

if(*v1 == *v2)
{
aaperms.erase(v1);
++v1;
++v2;
}

if(vec_less(*v1, *v2) == true)
++v1;
else
++v2;
}

return;
}


I only needed to sort 1 of the vectors. The other was sorted as it was made.
The problem I am having with the appended code is now it is not finding the duplicates. It does go through each of the vectors once but for some reason it is not finding the duplicates. I know there are some because a previous attempt and sorting them out found them though I was running into a serious sigseg fault.



I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.



Idzireit







c++ vector






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 16 '18 at 13:30







idzireit

















asked Nov 13 '18 at 22:17









idzireitidzireit

495




495








  • 3





    I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

    – juanpa.arrivillaga
    Nov 13 '18 at 22:45






  • 2





    @MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

    – jwimberley
    Nov 13 '18 at 22:50






  • 1





    Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

    – jwimberley
    Nov 13 '18 at 22:55








  • 1





    Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

    – Damien
    Nov 13 '18 at 23:02






  • 4





    It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

    – Bob__
    Nov 13 '18 at 23:25














  • 3





    I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

    – juanpa.arrivillaga
    Nov 13 '18 at 22:45






  • 2





    @MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

    – jwimberley
    Nov 13 '18 at 22:50






  • 1





    Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

    – jwimberley
    Nov 13 '18 at 22:55








  • 1





    Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

    – Damien
    Nov 13 '18 at 23:02






  • 4





    It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

    – Bob__
    Nov 13 '18 at 23:25








3




3





I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

– juanpa.arrivillaga
Nov 13 '18 at 22:45





I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure

– juanpa.arrivillaga
Nov 13 '18 at 22:45




2




2





@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

– jwimberley
Nov 13 '18 at 22:50





@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."

– jwimberley
Nov 13 '18 at 22:50




1




1





Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

– jwimberley
Nov 13 '18 at 22:55







Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases

– jwimberley
Nov 13 '18 at 22:55






1




1





Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

– Damien
Nov 13 '18 at 23:02





Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element

– Damien
Nov 13 '18 at 23:02




4




4





It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

– Bob__
Nov 13 '18 at 23:25





It would help if you were allowed to sort both vectors. Then you could use std::set_difference.

– Bob__
Nov 13 '18 at 23:25












2 Answers
2






active

oldest

votes


















3














The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.



If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.



If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.



bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}

std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();

while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());





share|improve this answer


























  • in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

    – idzireit
    Nov 16 '18 at 12:40








  • 2





    std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

    – Caleth
    Nov 16 '18 at 13:48











  • @Caleth well now that makes two things I didn't know C++ included.

    – Mark Ransom
    Nov 16 '18 at 14:19











  • @idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

    – Mark Ransom
    Nov 16 '18 at 14:32



















7














There are two three problems with your solution.




  1. Your code has undefined behavior. When you delete item iterator becomes invalid.


  2. Your code has large complexity o(n^2) o(n^3).


  3. Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.



Code below has o(n) time complexity, and use of STL algorithms is usually best choice:



using Vec = std::vector<std::vector<int>>;

void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};

auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}


You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).



using Vec = std::vector<std::array<int, 5>>;





share|improve this answer


























  • What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

    – Damien
    Nov 14 '18 at 6:00











  • Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

    – Damien
    Nov 14 '18 at 6:03











  • code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

    – Marek R
    Nov 14 '18 at 9:30











  • Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

    – Damien
    Nov 14 '18 at 9:36













  • now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

    – Marek R
    Nov 14 '18 at 9:50











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290356%2fmore-efficient-vector-comparison%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.



If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.



If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.



bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}

std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();

while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());





share|improve this answer


























  • in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

    – idzireit
    Nov 16 '18 at 12:40








  • 2





    std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

    – Caleth
    Nov 16 '18 at 13:48











  • @Caleth well now that makes two things I didn't know C++ included.

    – Mark Ransom
    Nov 16 '18 at 14:19











  • @idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

    – Mark Ransom
    Nov 16 '18 at 14:32
















3














The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.



If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.



If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.



bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}

std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();

while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());





share|improve this answer


























  • in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

    – idzireit
    Nov 16 '18 at 12:40








  • 2





    std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

    – Caleth
    Nov 16 '18 at 13:48











  • @Caleth well now that makes two things I didn't know C++ included.

    – Mark Ransom
    Nov 16 '18 at 14:19











  • @idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

    – Mark Ransom
    Nov 16 '18 at 14:32














3












3








3







The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.



If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.



If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.



bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}

std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();

while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());





share|improve this answer















The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.



If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.



If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.



bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}

std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();

while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 16 '18 at 14:33

























answered Nov 13 '18 at 23:43









Mark RansomMark Ransom

223k29281509




223k29281509













  • in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

    – idzireit
    Nov 16 '18 at 12:40








  • 2





    std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

    – Caleth
    Nov 16 '18 at 13:48











  • @Caleth well now that makes two things I didn't know C++ included.

    – Mark Ransom
    Nov 16 '18 at 14:19











  • @idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

    – Mark Ransom
    Nov 16 '18 at 14:32



















  • in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

    – idzireit
    Nov 16 '18 at 12:40








  • 2





    std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

    – Caleth
    Nov 16 '18 at 13:48











  • @Caleth well now that makes two things I didn't know C++ included.

    – Mark Ransom
    Nov 16 '18 at 14:19











  • @idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

    – Mark Ransom
    Nov 16 '18 at 14:32

















in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

– idzireit
Nov 16 '18 at 12:40







in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.

– idzireit
Nov 16 '18 at 12:40






2




2





std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

– Caleth
Nov 16 '18 at 13:48





std::vector already has a < that does the same as vec_less. The second part of your algorithm is approximately std::set_difference

– Caleth
Nov 16 '18 at 13:48













@Caleth well now that makes two things I didn't know C++ included.

– Mark Ransom
Nov 16 '18 at 14:19





@Caleth well now that makes two things I didn't know C++ included.

– Mark Ransom
Nov 16 '18 at 14:19













@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

– Mark Ransom
Nov 16 '18 at 14:32





@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.

– Mark Ransom
Nov 16 '18 at 14:32













7














There are two three problems with your solution.




  1. Your code has undefined behavior. When you delete item iterator becomes invalid.


  2. Your code has large complexity o(n^2) o(n^3).


  3. Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.



Code below has o(n) time complexity, and use of STL algorithms is usually best choice:



using Vec = std::vector<std::vector<int>>;

void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};

auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}


You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).



using Vec = std::vector<std::array<int, 5>>;





share|improve this answer


























  • What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

    – Damien
    Nov 14 '18 at 6:00











  • Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

    – Damien
    Nov 14 '18 at 6:03











  • code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

    – Marek R
    Nov 14 '18 at 9:30











  • Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

    – Damien
    Nov 14 '18 at 9:36













  • now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

    – Marek R
    Nov 14 '18 at 9:50
















7














There are two three problems with your solution.




  1. Your code has undefined behavior. When you delete item iterator becomes invalid.


  2. Your code has large complexity o(n^2) o(n^3).


  3. Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.



Code below has o(n) time complexity, and use of STL algorithms is usually best choice:



using Vec = std::vector<std::vector<int>>;

void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};

auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}


You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).



using Vec = std::vector<std::array<int, 5>>;





share|improve this answer


























  • What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

    – Damien
    Nov 14 '18 at 6:00











  • Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

    – Damien
    Nov 14 '18 at 6:03











  • code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

    – Marek R
    Nov 14 '18 at 9:30











  • Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

    – Damien
    Nov 14 '18 at 9:36













  • now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

    – Marek R
    Nov 14 '18 at 9:50














7












7








7







There are two three problems with your solution.




  1. Your code has undefined behavior. When you delete item iterator becomes invalid.


  2. Your code has large complexity o(n^2) o(n^3).


  3. Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.



Code below has o(n) time complexity, and use of STL algorithms is usually best choice:



using Vec = std::vector<std::vector<int>>;

void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};

auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}


You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).



using Vec = std::vector<std::array<int, 5>>;





share|improve this answer















There are two three problems with your solution.




  1. Your code has undefined behavior. When you delete item iterator becomes invalid.


  2. Your code has large complexity o(n^2) o(n^3).


  3. Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point 2.



Code below has o(n) time complexity, and use of STL algorithms is usually best choice:



using Vec = std::vector<std::vector<int>>;

void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};

auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}


You can consider replacing internal std::vector with std::array, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).



using Vec = std::vector<std::array<int, 5>>;






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 16 '18 at 13:32

























answered Nov 14 '18 at 0:01









Marek RMarek R

12.8k22672




12.8k22672













  • What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

    – Damien
    Nov 14 '18 at 6:00











  • Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

    – Damien
    Nov 14 '18 at 6:03











  • code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

    – Marek R
    Nov 14 '18 at 9:30











  • Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

    – Damien
    Nov 14 '18 at 9:36













  • now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

    – Marek R
    Nov 14 '18 at 9:50



















  • What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

    – Damien
    Nov 14 '18 at 6:00











  • Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

    – Damien
    Nov 14 '18 at 6:03











  • code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

    – Marek R
    Nov 14 '18 at 9:30











  • Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

    – Damien
    Nov 14 '18 at 9:36













  • now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

    – Marek R
    Nov 14 '18 at 9:50

















What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

– Damien
Nov 14 '18 at 6:00





What happens if input vector from has duplicate elements ? We have few information on the scenario targeted by the OP

– Damien
Nov 14 '18 at 6:00













Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

– Damien
Nov 14 '18 at 6:03





Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint

– Damien
Nov 14 '18 at 6:03













code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

– Marek R
Nov 14 '18 at 9:30





code provided does exactly same thing what original suppose to do. Use of unordered_set allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.

– Marek R
Nov 14 '18 at 9:30













Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

– Damien
Nov 14 '18 at 9:36







Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance

– Damien
Nov 14 '18 at 9:36















now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

– Marek R
Nov 14 '18 at 9:50





now I noticed another issue with OP code and my. So update should be even faster (if order is not important).

– Marek R
Nov 14 '18 at 9:50


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290356%2fmore-efficient-vector-comparison%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python