More efficient vector comparison
I am trying to compare 2 different vectors to catch any duplicates. one vector is 5 million elements of 10 numbers and the other is 2.8 million of 10 elements.
My operating system is ubuntu 18.04 and I am using QtCreator. I am getting a hangup when I try to compare these large vectors. here is what I have tried:
vector<vector<int> >::iterator v1;
vector<vector<int> >::iterator v2;
for(v1 = vector1.begin(); v1 != vector1.end(); v1++)
{
for(v2 = vector2.begin(); v2 != vector2.end(); v2++)
{
if(*v1 == *v2)
{
vector1.erase(v1);
}
}
}
when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :
vector1.erase(v1.begin(), v1.end());
Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.
Thx in advance
Idzireit
Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:
#include "includes.h"
bool vec_less(vector<int> &v1, vector<int> &v2)
{
for(int i = 0; i < 10; i++)
{
if(v1[i] == v2[i])
{
i++;
}
if(v1[i] < v2[i])
return true;
else
return false;
}
return v1.size() <v2.size();
}
void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)
{
vector<vector<int> >::iterator v1 = aaperms.begin();
vector<vector<int> >::iterator v2 = perms.begin();
while(v1 != aaperms.end() && v2 != perms.end())
{
if(*v1 == *v2)
{
aaperms.erase(v1);
++v1;
++v2;
}
if(vec_less(*v1, *v2) == true)
++v1;
else
++v2;
}
return;
}
I only needed to sort 1 of the vectors. The other was sorted as it was made.
The problem I am having with the appended code is now it is not finding the duplicates. It does go through each of the vectors once but for some reason it is not finding the duplicates. I know there are some because a previous attempt and sorting them out found them though I was running into a serious sigseg fault.
I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.
Idzireit
c++ vector
|
show 5 more comments
I am trying to compare 2 different vectors to catch any duplicates. one vector is 5 million elements of 10 numbers and the other is 2.8 million of 10 elements.
My operating system is ubuntu 18.04 and I am using QtCreator. I am getting a hangup when I try to compare these large vectors. here is what I have tried:
vector<vector<int> >::iterator v1;
vector<vector<int> >::iterator v2;
for(v1 = vector1.begin(); v1 != vector1.end(); v1++)
{
for(v2 = vector2.begin(); v2 != vector2.end(); v2++)
{
if(*v1 == *v2)
{
vector1.erase(v1);
}
}
}
when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :
vector1.erase(v1.begin(), v1.end());
Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.
Thx in advance
Idzireit
Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:
#include "includes.h"
bool vec_less(vector<int> &v1, vector<int> &v2)
{
for(int i = 0; i < 10; i++)
{
if(v1[i] == v2[i])
{
i++;
}
if(v1[i] < v2[i])
return true;
else
return false;
}
return v1.size() <v2.size();
}
void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)
{
vector<vector<int> >::iterator v1 = aaperms.begin();
vector<vector<int> >::iterator v2 = perms.begin();
while(v1 != aaperms.end() && v2 != perms.end())
{
if(*v1 == *v2)
{
aaperms.erase(v1);
++v1;
++v2;
}
if(vec_less(*v1, *v2) == true)
++v1;
else
++v2;
}
return;
}
I only needed to sort 1 of the vectors. The other was sorted as it was made.
The problem I am having with the appended code is now it is not finding the duplicates. It does go through each of the vectors once but for some reason it is not finding the duplicates. I know there are some because a previous attempt and sorting them out found them though I was running into a serious sigseg fault.
I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.
Idzireit
c++ vector
3
I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure
– juanpa.arrivillaga
Nov 13 '18 at 22:45
2
@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."
– jwimberley
Nov 13 '18 at 22:50
1
Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases
– jwimberley
Nov 13 '18 at 22:55
1
Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element
– Damien
Nov 13 '18 at 23:02
4
It would help if you were allowed to sort both vectors. Then you could usestd::set_difference
.
– Bob__
Nov 13 '18 at 23:25
|
show 5 more comments
I am trying to compare 2 different vectors to catch any duplicates. one vector is 5 million elements of 10 numbers and the other is 2.8 million of 10 elements.
My operating system is ubuntu 18.04 and I am using QtCreator. I am getting a hangup when I try to compare these large vectors. here is what I have tried:
vector<vector<int> >::iterator v1;
vector<vector<int> >::iterator v2;
for(v1 = vector1.begin(); v1 != vector1.end(); v1++)
{
for(v2 = vector2.begin(); v2 != vector2.end(); v2++)
{
if(*v1 == *v2)
{
vector1.erase(v1);
}
}
}
when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :
vector1.erase(v1.begin(), v1.end());
Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.
Thx in advance
Idzireit
Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:
#include "includes.h"
bool vec_less(vector<int> &v1, vector<int> &v2)
{
for(int i = 0; i < 10; i++)
{
if(v1[i] == v2[i])
{
i++;
}
if(v1[i] < v2[i])
return true;
else
return false;
}
return v1.size() <v2.size();
}
void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)
{
vector<vector<int> >::iterator v1 = aaperms.begin();
vector<vector<int> >::iterator v2 = perms.begin();
while(v1 != aaperms.end() && v2 != perms.end())
{
if(*v1 == *v2)
{
aaperms.erase(v1);
++v1;
++v2;
}
if(vec_less(*v1, *v2) == true)
++v1;
else
++v2;
}
return;
}
I only needed to sort 1 of the vectors. The other was sorted as it was made.
The problem I am having with the appended code is now it is not finding the duplicates. It does go through each of the vectors once but for some reason it is not finding the duplicates. I know there are some because a previous attempt and sorting them out found them though I was running into a serious sigseg fault.
I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.
Idzireit
c++ vector
I am trying to compare 2 different vectors to catch any duplicates. one vector is 5 million elements of 10 numbers and the other is 2.8 million of 10 elements.
My operating system is ubuntu 18.04 and I am using QtCreator. I am getting a hangup when I try to compare these large vectors. here is what I have tried:
vector<vector<int> >::iterator v1;
vector<vector<int> >::iterator v2;
for(v1 = vector1.begin(); v1 != vector1.end(); v1++)
{
for(v2 = vector2.begin(); v2 != vector2.end(); v2++)
{
if(*v1 == *v2)
{
vector1.erase(v1);
}
}
}
when I try and run this and debug Qt hangs. I am also wondering if I need to change the erase to look something like :
vector1.erase(v1.begin(), v1.end());
Any suggestions as to a "Better" way of doing this would be helpful. I know these are some big vectors having more then 2 and a half million elements of 10 numbers.
Thx in advance
Idzireit
Still tackling the problem. Right now I am trying a derivative of Mark Ransom's solution. Here is what I got so far:
#include "includes.h"
bool vec_less(vector<int> &v1, vector<int> &v2)
{
for(int i = 0; i < 10; i++)
{
if(v1[i] == v2[i])
{
i++;
}
if(v1[i] < v2[i])
return true;
else
return false;
}
return v1.size() <v2.size();
}
void dupfilter(vector<vector<int> > &aaperms, vector<vector<int> > &perms)
{
vector<vector<int> >::iterator v1 = aaperms.begin();
vector<vector<int> >::iterator v2 = perms.begin();
while(v1 != aaperms.end() && v2 != perms.end())
{
if(*v1 == *v2)
{
aaperms.erase(v1);
++v1;
++v2;
}
if(vec_less(*v1, *v2) == true)
++v1;
else
++v2;
}
return;
}
I only needed to sort 1 of the vectors. The other was sorted as it was made.
The problem I am having with the appended code is now it is not finding the duplicates. It does go through each of the vectors once but for some reason it is not finding the duplicates. I know there are some because a previous attempt and sorting them out found them though I was running into a serious sigseg fault.
I have been trying to wrap my head around auto and unique and just cant quite get the examples and my (code? methods?) to coincide.
Idzireit
c++ vector
c++ vector
edited Nov 16 '18 at 13:30
idzireit
asked Nov 13 '18 at 22:17
idzireitidzireit
495
495
3
I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure
– juanpa.arrivillaga
Nov 13 '18 at 22:45
2
@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."
– jwimberley
Nov 13 '18 at 22:50
1
Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases
– jwimberley
Nov 13 '18 at 22:55
1
Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element
– Damien
Nov 13 '18 at 23:02
4
It would help if you were allowed to sort both vectors. Then you could usestd::set_difference
.
– Bob__
Nov 13 '18 at 23:25
|
show 5 more comments
3
I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure
– juanpa.arrivillaga
Nov 13 '18 at 22:45
2
@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."
– jwimberley
Nov 13 '18 at 22:50
1
Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases
– jwimberley
Nov 13 '18 at 22:55
1
Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element
– Damien
Nov 13 '18 at 23:02
4
It would help if you were allowed to sort both vectors. Then you could usestd::set_difference
.
– Bob__
Nov 13 '18 at 23:25
3
3
I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure
– juanpa.arrivillaga
Nov 13 '18 at 22:45
I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure
– juanpa.arrivillaga
Nov 13 '18 at 22:45
2
2
@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."
– jwimberley
Nov 13 '18 at 22:50
@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."
– jwimberley
Nov 13 '18 at 22:50
1
1
Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases
– jwimberley
Nov 13 '18 at 22:55
Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases
– jwimberley
Nov 13 '18 at 22:55
1
1
Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element
– Damien
Nov 13 '18 at 23:02
Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element
– Damien
Nov 13 '18 at 23:02
4
4
It would help if you were allowed to sort both vectors. Then you could use
std::set_difference
.– Bob__
Nov 13 '18 at 23:25
It would help if you were allowed to sort both vectors. Then you could use
std::set_difference
.– Bob__
Nov 13 '18 at 23:25
|
show 5 more comments
2 Answers
2
active
oldest
votes
The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.
If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.
If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.
bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}
std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();
while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());
in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.
– idzireit
Nov 16 '18 at 12:40
2
std::vector
already has a<
that does the same asvec_less
. The second part of your algorithm is approximatelystd::set_difference
– Caleth
Nov 16 '18 at 13:48
@Caleth well now that makes two things I didn't know C++ included.
– Mark Ransom
Nov 16 '18 at 14:19
@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.
– Mark Ransom
Nov 16 '18 at 14:32
add a comment |
There are two three problems with your solution.
Your code has undefined behavior. When you delete item iterator becomes invalid.
Your code has large complexity
o(n^2)
o(n^3)
.Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point
2
.
Code below has o(n)
time complexity, and use of STL algorithms is usually best choice:
using Vec = std::vector<std::vector<int>>;
void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};
auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}
You can consider replacing internal std::vector
with std::array
, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).
using Vec = std::vector<std::array<int, 5>>;
What happens if input vectorfrom
has duplicate elements ? We have few information on the scenario targeted by the OP
– Damien
Nov 14 '18 at 6:00
Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint
– Damien
Nov 14 '18 at 6:03
code provided does exactly same thing what original suppose to do. Use ofunordered_set
allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.
– Marek R
Nov 14 '18 at 9:30
Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance
– Damien
Nov 14 '18 at 9:36
now I noticed another issue with OP code and my. So update should be even faster (if order is not important).
– Marek R
Nov 14 '18 at 9:50
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290356%2fmore-efficient-vector-comparison%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.
If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.
If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.
bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}
std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();
while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());
in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.
– idzireit
Nov 16 '18 at 12:40
2
std::vector
already has a<
that does the same asvec_less
. The second part of your algorithm is approximatelystd::set_difference
– Caleth
Nov 16 '18 at 13:48
@Caleth well now that makes two things I didn't know C++ included.
– Mark Ransom
Nov 16 '18 at 14:19
@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.
– Mark Ransom
Nov 16 '18 at 14:32
add a comment |
The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.
If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.
If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.
bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}
std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();
while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());
in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.
– idzireit
Nov 16 '18 at 12:40
2
std::vector
already has a<
that does the same asvec_less
. The second part of your algorithm is approximatelystd::set_difference
– Caleth
Nov 16 '18 at 13:48
@Caleth well now that makes two things I didn't know C++ included.
– Mark Ransom
Nov 16 '18 at 14:19
@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.
– Mark Ransom
Nov 16 '18 at 14:32
add a comment |
The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.
If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.
If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.
bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}
std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();
while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());
The algorithm you've selected is O(n²), which means for large data sets it will take a very long time. It's easy to see why you thought it was hung.
If you don't care about ordering, you can sort both vectors to convert this from an O(n²) problem to O(n log n). Once they're sorted you walk through each vector simultaneously, incrementing an index depending on which one is less than the other.
If you can't fit the entire data set into memory at once, you can even use this method by reading from sorted files.
bool vec_less(const vector<int>& v1, const vector<int>& v2)
{
for (int i = 0; i < v1.size() && i < v2.size(); i++)
{
if (v1[i] < v2[i])
return true;
if (v2[i] < v1[i])
return false;
}
return v1.size() < v2.size();
}
std::sort(vector1.begin(), vector1.end(), vec_less);
std::sort(vector2.begin(), vector2.end(), vec_less);
vector<vector<int> >::iterator v1 = vector1.begin();
vector<vector<int> >::iterator v1out = v1;
vector<vector<int> >::iterator v2 = vector2.begin();
while (v1 != vector1.end())
{
if (v2 == vector2.end() || vec_less(*v1, *v2))
{
if (v1out != v1)
*v1out = *v1;
++v1;
++v1out;
}
else if (vec_less(*v2, *v1))
++v2;
else // equal
{
++v1;
++v2;
}
}
vector1.resize(v1out - vector1.begin());
edited Nov 16 '18 at 14:33
answered Nov 13 '18 at 23:43
Mark RansomMark Ransom
223k29281509
223k29281509
in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.
– idzireit
Nov 16 '18 at 12:40
2
std::vector
already has a<
that does the same asvec_less
. The second part of your algorithm is approximatelystd::set_difference
– Caleth
Nov 16 '18 at 13:48
@Caleth well now that makes two things I didn't know C++ included.
– Mark Ransom
Nov 16 '18 at 14:19
@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.
– Mark Ransom
Nov 16 '18 at 14:32
add a comment |
in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.
– idzireit
Nov 16 '18 at 12:40
2
std::vector
already has a<
that does the same asvec_less
. The second part of your algorithm is approximatelystd::set_difference
– Caleth
Nov 16 '18 at 13:48
@Caleth well now that makes two things I didn't know C++ included.
– Mark Ransom
Nov 16 '18 at 14:19
@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.
– Mark Ransom
Nov 16 '18 at 14:32
in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.
– idzireit
Nov 16 '18 at 12:40
in the bool function should you check to see if the first elements between v1 and v2 are equal? Thinking that might return false because the next elements might be true. Example v1 = 1 2 3 4 .. and v2 = 1 2 4 ....... which would make v1 lesser then v2 but because the first element is equal and not less than therefore would return false.
– idzireit
Nov 16 '18 at 12:40
2
2
std::vector
already has a <
that does the same as vec_less
. The second part of your algorithm is approximately std::set_difference
– Caleth
Nov 16 '18 at 13:48
std::vector
already has a <
that does the same as vec_less
. The second part of your algorithm is approximately std::set_difference
– Caleth
Nov 16 '18 at 13:48
@Caleth well now that makes two things I didn't know C++ included.
– Mark Ransom
Nov 16 '18 at 14:19
@Caleth well now that makes two things I didn't know C++ included.
– Mark Ransom
Nov 16 '18 at 14:19
@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.
– Mark Ransom
Nov 16 '18 at 14:32
@idzireit you're absolutely right, that was a big thing to overlook. I'll fix it.
– Mark Ransom
Nov 16 '18 at 14:32
add a comment |
There are two three problems with your solution.
Your code has undefined behavior. When you delete item iterator becomes invalid.
Your code has large complexity
o(n^2)
o(n^3)
.Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point
2
.
Code below has o(n)
time complexity, and use of STL algorithms is usually best choice:
using Vec = std::vector<std::vector<int>>;
void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};
auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}
You can consider replacing internal std::vector
with std::array
, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).
using Vec = std::vector<std::array<int, 5>>;
What happens if input vectorfrom
has duplicate elements ? We have few information on the scenario targeted by the OP
– Damien
Nov 14 '18 at 6:00
Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint
– Damien
Nov 14 '18 at 6:03
code provided does exactly same thing what original suppose to do. Use ofunordered_set
allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.
– Marek R
Nov 14 '18 at 9:30
Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance
– Damien
Nov 14 '18 at 9:36
now I noticed another issue with OP code and my. So update should be even faster (if order is not important).
– Marek R
Nov 14 '18 at 9:50
|
show 1 more comment
There are two three problems with your solution.
Your code has undefined behavior. When you delete item iterator becomes invalid.
Your code has large complexity
o(n^2)
o(n^3)
.Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point
2
.
Code below has o(n)
time complexity, and use of STL algorithms is usually best choice:
using Vec = std::vector<std::vector<int>>;
void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};
auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}
You can consider replacing internal std::vector
with std::array
, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).
using Vec = std::vector<std::array<int, 5>>;
What happens if input vectorfrom
has duplicate elements ? We have few information on the scenario targeted by the OP
– Damien
Nov 14 '18 at 6:00
Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint
– Damien
Nov 14 '18 at 6:03
code provided does exactly same thing what original suppose to do. Use ofunordered_set
allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.
– Marek R
Nov 14 '18 at 9:30
Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance
– Damien
Nov 14 '18 at 9:36
now I noticed another issue with OP code and my. So update should be even faster (if order is not important).
– Marek R
Nov 14 '18 at 9:50
|
show 1 more comment
There are two three problems with your solution.
Your code has undefined behavior. When you delete item iterator becomes invalid.
Your code has large complexity
o(n^2)
o(n^3)
.Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point
2
.
Code below has o(n)
time complexity, and use of STL algorithms is usually best choice:
using Vec = std::vector<std::vector<int>>;
void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};
auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}
You can consider replacing internal std::vector
with std::array
, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).
using Vec = std::vector<std::array<int, 5>>;
There are two three problems with your solution.
Your code has undefined behavior. When you delete item iterator becomes invalid.
Your code has large complexity
o(n^2)
o(n^3)
.Removing item from middle of vector has linear complexity, so for large vectors it should be avoided. This is why I've corrected point
2
.
Code below has o(n)
time complexity, and use of STL algorithms is usually best choice:
using Vec = std::vector<std::vector<int>>;
void removeItems(Vec& from, const Vec& itemsToRemove)
{
const std::unordered_set<Vec::value_type> items {
itemsToRemove.begin(),
itemsToRemove.end()
};
auto it =
std::remove_if(from.begin(), from.end(),
[&items](const auto &x){
return items.count(x) != 0;
});
from.erase(it, from.end());
}
You can consider replacing internal std::vector
with std::array
, since as you describe it has constant size and this will reduce memory fragmentation (what should provide extra boost).
using Vec = std::vector<std::array<int, 5>>;
edited Nov 16 '18 at 13:32
answered Nov 14 '18 at 0:01
Marek RMarek R
12.8k22672
12.8k22672
What happens if input vectorfrom
has duplicate elements ? We have few information on the scenario targeted by the OP
– Damien
Nov 14 '18 at 6:00
Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint
– Damien
Nov 14 '18 at 6:03
code provided does exactly same thing what original suppose to do. Use ofunordered_set
allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.
– Marek R
Nov 14 '18 at 9:30
Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance
– Damien
Nov 14 '18 at 9:36
now I noticed another issue with OP code and my. So update should be even faster (if order is not important).
– Marek R
Nov 14 '18 at 9:50
|
show 1 more comment
What happens if input vectorfrom
has duplicate elements ? We have few information on the scenario targeted by the OP
– Damien
Nov 14 '18 at 6:00
Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint
– Damien
Nov 14 '18 at 6:03
code provided does exactly same thing what original suppose to do. Use ofunordered_set
allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.
– Marek R
Nov 14 '18 at 9:30
Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance
– Damien
Nov 14 '18 at 9:36
now I noticed another issue with OP code and my. So update should be even faster (if order is not important).
– Marek R
Nov 14 '18 at 9:50
What happens if input vector
from
has duplicate elements ? We have few information on the scenario targeted by the OP– Damien
Nov 14 '18 at 6:00
What happens if input vector
from
has duplicate elements ? We have few information on the scenario targeted by the OP– Damien
Nov 14 '18 at 6:00
Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint
– Damien
Nov 14 '18 at 6:03
Is the order of elements maintained? Sorry for this naive question. You may indicate it in your answer. We still don't know if it is a constraint
– Damien
Nov 14 '18 at 6:03
code provided does exactly same thing what original suppose to do. Use of
unordered_set
allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.– Marek R
Nov 14 '18 at 9:30
code provided does exactly same thing what original suppose to do. Use of
unordered_set
allows to make it linear in time. When big data are processed, as described by OP, it will be very fast.– Marek R
Nov 14 '18 at 9:30
Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance
– Damien
Nov 14 '18 at 9:36
Thank you for your answer. Very interesting. I will apppreciate that the OP tests it and gives a feedback on the performance
– Damien
Nov 14 '18 at 9:36
now I noticed another issue with OP code and my. So update should be even faster (if order is not important).
– Marek R
Nov 14 '18 at 9:50
now I noticed another issue with OP code and my. So update should be even faster (if order is not important).
– Marek R
Nov 14 '18 at 9:50
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290356%2fmore-efficient-vector-comparison%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
I am not familiar with the language, but this looks like you fundamentally are using a polynomial time (quartic?) algorithm here (I'm guessing .erase is worst-case linear). Probably, if you want to remove duplicates, you should use something like a hash-set or whatever equivalent data-structure
– juanpa.arrivillaga
Nov 13 '18 at 22:45
2
@MooingDuck Misleading text, but look at the type: its a vector of vectors, so the "10 numbers" means a "vector of 10 integers."
– jwimberley
Nov 13 '18 at 22:50
1
Adding to @juanpa.arrivillaga 's suggestion: the equivalent structure in C++ is std::unordered_set. If you made an std::unordered_set< std::vector<int> > out of v2, then checking whether each *v1 corresponded to an element of v2 could be reduced to O(1) time. This might require defining an appropriate hash function for std::vector<int>. You could also use std::set< std::vector<int> >, which does not require a hash function but looks up through binary search (using the lexicographical ordering on std::vector<int>). This is O(log(length of v2)), asymptotically, but can be better in some cases
– jwimberley
Nov 13 '18 at 22:55
1
Do you need to keep the elements in the same order ? If not, you could copy the last element at position v1 and erase the last element
– Damien
Nov 13 '18 at 23:02
4
It would help if you were allowed to sort both vectors. Then you could use
std::set_difference
.– Bob__
Nov 13 '18 at 23:25