Find same values in two huge datasets
i have a list with roughly 2 000 rows [UnixTimestamp, Value01, Value02](it comes as a JSON) and i have another list which has a few million rows [UnixTimestamp, Value01, Value02] (it comes as a .csv) I want to figure out if each element in the smaller list has an element in the second list with the same values.
Both the lists are sorted by the timestamp
The simplest way is obviously something like that:
for x in small_List:
if x in big_list:
return True
return False
But does that make sense or is there a more efficient way?
Thanks
python-3.x algorithm list search bigdata
add a comment |
i have a list with roughly 2 000 rows [UnixTimestamp, Value01, Value02](it comes as a JSON) and i have another list which has a few million rows [UnixTimestamp, Value01, Value02] (it comes as a .csv) I want to figure out if each element in the smaller list has an element in the second list with the same values.
Both the lists are sorted by the timestamp
The simplest way is obviously something like that:
for x in small_List:
if x in big_list:
return True
return False
But does that make sense or is there a more efficient way?
Thanks
python-3.x algorithm list search bigdata
Is there any relation betweenUnixTimestamp
,Value01
,Value02
?
– vivek_23
Nov 15 '18 at 15:34
You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.
– Paddy3118
Nov 15 '18 at 16:23
add a comment |
i have a list with roughly 2 000 rows [UnixTimestamp, Value01, Value02](it comes as a JSON) and i have another list which has a few million rows [UnixTimestamp, Value01, Value02] (it comes as a .csv) I want to figure out if each element in the smaller list has an element in the second list with the same values.
Both the lists are sorted by the timestamp
The simplest way is obviously something like that:
for x in small_List:
if x in big_list:
return True
return False
But does that make sense or is there a more efficient way?
Thanks
python-3.x algorithm list search bigdata
i have a list with roughly 2 000 rows [UnixTimestamp, Value01, Value02](it comes as a JSON) and i have another list which has a few million rows [UnixTimestamp, Value01, Value02] (it comes as a .csv) I want to figure out if each element in the smaller list has an element in the second list with the same values.
Both the lists are sorted by the timestamp
The simplest way is obviously something like that:
for x in small_List:
if x in big_list:
return True
return False
But does that make sense or is there a more efficient way?
Thanks
python-3.x algorithm list search bigdata
python-3.x algorithm list search bigdata
asked Nov 15 '18 at 15:02
TimSTimS
156
156
Is there any relation betweenUnixTimestamp
,Value01
,Value02
?
– vivek_23
Nov 15 '18 at 15:34
You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.
– Paddy3118
Nov 15 '18 at 16:23
add a comment |
Is there any relation betweenUnixTimestamp
,Value01
,Value02
?
– vivek_23
Nov 15 '18 at 15:34
You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.
– Paddy3118
Nov 15 '18 at 16:23
Is there any relation between
UnixTimestamp
,Value01
,Value02
?– vivek_23
Nov 15 '18 at 15:34
Is there any relation between
UnixTimestamp
,Value01
,Value02
?– vivek_23
Nov 15 '18 at 15:34
You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.
– Paddy3118
Nov 15 '18 at 16:23
You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.
– Paddy3118
Nov 15 '18 at 16:23
add a comment |
2 Answers
2
active
oldest
votes
Both are already sorted by timestamp, so use that to your advantage:
big_list_index = 0
for x in small_list:
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp < x.timestamp:
big_list_index += 1
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp == x.timestamp:
if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
return True
else:
big_list_index += 1
y = big_list[big_list_index]
If timestamps are unique, complexity is O(len(big_list) + len(small_List))
add a comment |
If they are just lists, you can try something like this.
set(small_list) & set(big_list)
Converting to set
will remove the duplicate values and you can use &
operator to compare and result back the same values of the two sets.
Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.
– TimS
Nov 15 '18 at 15:14
The cast to a set doesn't work since i guess the list actually contains lists.
– TimS
Nov 15 '18 at 15:36
What is the time complexity of this?
– Surt
Nov 15 '18 at 22:56
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322280%2ffind-same-values-in-two-huge-datasets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Both are already sorted by timestamp, so use that to your advantage:
big_list_index = 0
for x in small_list:
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp < x.timestamp:
big_list_index += 1
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp == x.timestamp:
if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
return True
else:
big_list_index += 1
y = big_list[big_list_index]
If timestamps are unique, complexity is O(len(big_list) + len(small_List))
add a comment |
Both are already sorted by timestamp, so use that to your advantage:
big_list_index = 0
for x in small_list:
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp < x.timestamp:
big_list_index += 1
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp == x.timestamp:
if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
return True
else:
big_list_index += 1
y = big_list[big_list_index]
If timestamps are unique, complexity is O(len(big_list) + len(small_List))
add a comment |
Both are already sorted by timestamp, so use that to your advantage:
big_list_index = 0
for x in small_list:
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp < x.timestamp:
big_list_index += 1
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp == x.timestamp:
if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
return True
else:
big_list_index += 1
y = big_list[big_list_index]
If timestamps are unique, complexity is O(len(big_list) + len(small_List))
Both are already sorted by timestamp, so use that to your advantage:
big_list_index = 0
for x in small_list:
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp < x.timestamp:
big_list_index += 1
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp == x.timestamp:
if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
return True
else:
big_list_index += 1
y = big_list[big_list_index]
If timestamps are unique, complexity is O(len(big_list) + len(small_List))
answered Nov 16 '18 at 16:28
juvianjuvian
13.3k22127
13.3k22127
add a comment |
add a comment |
If they are just lists, you can try something like this.
set(small_list) & set(big_list)
Converting to set
will remove the duplicate values and you can use &
operator to compare and result back the same values of the two sets.
Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.
– TimS
Nov 15 '18 at 15:14
The cast to a set doesn't work since i guess the list actually contains lists.
– TimS
Nov 15 '18 at 15:36
What is the time complexity of this?
– Surt
Nov 15 '18 at 22:56
add a comment |
If they are just lists, you can try something like this.
set(small_list) & set(big_list)
Converting to set
will remove the duplicate values and you can use &
operator to compare and result back the same values of the two sets.
Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.
– TimS
Nov 15 '18 at 15:14
The cast to a set doesn't work since i guess the list actually contains lists.
– TimS
Nov 15 '18 at 15:36
What is the time complexity of this?
– Surt
Nov 15 '18 at 22:56
add a comment |
If they are just lists, you can try something like this.
set(small_list) & set(big_list)
Converting to set
will remove the duplicate values and you can use &
operator to compare and result back the same values of the two sets.
If they are just lists, you can try something like this.
set(small_list) & set(big_list)
Converting to set
will remove the duplicate values and you can use &
operator to compare and result back the same values of the two sets.
answered Nov 15 '18 at 15:07
NaveenNaveen
920214
920214
Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.
– TimS
Nov 15 '18 at 15:14
The cast to a set doesn't work since i guess the list actually contains lists.
– TimS
Nov 15 '18 at 15:36
What is the time complexity of this?
– Surt
Nov 15 '18 at 22:56
add a comment |
Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.
– TimS
Nov 15 '18 at 15:14
The cast to a set doesn't work since i guess the list actually contains lists.
– TimS
Nov 15 '18 at 15:36
What is the time complexity of this?
– Surt
Nov 15 '18 at 22:56
Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.
– TimS
Nov 15 '18 at 15:14
Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.
– TimS
Nov 15 '18 at 15:14
The cast to a set doesn't work since i guess the list actually contains lists.
– TimS
Nov 15 '18 at 15:36
The cast to a set doesn't work since i guess the list actually contains lists.
– TimS
Nov 15 '18 at 15:36
What is the time complexity of this?
– Surt
Nov 15 '18 at 22:56
What is the time complexity of this?
– Surt
Nov 15 '18 at 22:56
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322280%2ffind-same-values-in-two-huge-datasets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Is there any relation between
UnixTimestamp
,Value01
,Value02
?– vivek_23
Nov 15 '18 at 15:34
You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.
– Paddy3118
Nov 15 '18 at 16:23