C - sorting big binary files with variable length elements
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
my problem is the following:
- I have a set of elements with variable length size stored on a binary file that we'll call
data.bin
. - Each element has a sorting key parameter that we'll call
key
, a sizesize
and a file position in data.binpos
. These 3 parameters are represented as a structure.
The problem is how to sort data.bin
effectively.
Right now I have defined a data structure
typedef struct {
int key;
unsigned int size;
long int pos;
} index
and an array index indices
that contains the values for the elements stored in data.bin
. This list is successively sorted according to key
using a quicksort algorithm, which is fast enough even for a very large number of entries (e.g. 10M). Then I use the sorted list indices
to write the sorted data.bin
file as sorted.bin
. The core of my code is the following (here I have intentionally removed error checking parts):
size_t mergeBuffIdx = 0;
char *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));
for (unsigned int idx = 0; idx < numEntries; idx++) {
unsigned int dataSize = indices[idx].size;
if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
mergeBuffIdx = 0;
}
// set the file pointer at the beginning of the data file position
fseek(dataFile, indices[idx].pos, SEEK_SET);
// read the data from the file as an unsigned char array
fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);
mergeBuffIdx += dataSize;
}
// write remaining data
if (mergeBuffIdx != 0) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
}
This approach, which is quite simple, shortly become very slow when data.bin
is very large (my general use case is 30GB), the number of entries is around 10M and two successive sorted entries could be very far on the original data.bin
file. For a data.bin
file of 1GB this approach can require even more than 30 minutes c.a. even if I'm using an SSD HD.
Do you have some ideas, alternative solutions or approaches to this problem?
I have tried using memory mapped files, with similar or even worst performances. My thought is that the bottleneck is fseek
call, however, I cannot figure out alternatives and more performant approaches.
Thanks for reading,
S,
c performance sorting io binaryfiles
add a comment |
my problem is the following:
- I have a set of elements with variable length size stored on a binary file that we'll call
data.bin
. - Each element has a sorting key parameter that we'll call
key
, a sizesize
and a file position in data.binpos
. These 3 parameters are represented as a structure.
The problem is how to sort data.bin
effectively.
Right now I have defined a data structure
typedef struct {
int key;
unsigned int size;
long int pos;
} index
and an array index indices
that contains the values for the elements stored in data.bin
. This list is successively sorted according to key
using a quicksort algorithm, which is fast enough even for a very large number of entries (e.g. 10M). Then I use the sorted list indices
to write the sorted data.bin
file as sorted.bin
. The core of my code is the following (here I have intentionally removed error checking parts):
size_t mergeBuffIdx = 0;
char *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));
for (unsigned int idx = 0; idx < numEntries; idx++) {
unsigned int dataSize = indices[idx].size;
if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
mergeBuffIdx = 0;
}
// set the file pointer at the beginning of the data file position
fseek(dataFile, indices[idx].pos, SEEK_SET);
// read the data from the file as an unsigned char array
fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);
mergeBuffIdx += dataSize;
}
// write remaining data
if (mergeBuffIdx != 0) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
}
This approach, which is quite simple, shortly become very slow when data.bin
is very large (my general use case is 30GB), the number of entries is around 10M and two successive sorted entries could be very far on the original data.bin
file. For a data.bin
file of 1GB this approach can require even more than 30 minutes c.a. even if I'm using an SSD HD.
Do you have some ideas, alternative solutions or approaches to this problem?
I have tried using memory mapped files, with similar or even worst performances. My thought is that the bottleneck is fseek
call, however, I cannot figure out alternatives and more performant approaches.
Thanks for reading,
S,
c performance sorting io binaryfiles
add a comment |
my problem is the following:
- I have a set of elements with variable length size stored on a binary file that we'll call
data.bin
. - Each element has a sorting key parameter that we'll call
key
, a sizesize
and a file position in data.binpos
. These 3 parameters are represented as a structure.
The problem is how to sort data.bin
effectively.
Right now I have defined a data structure
typedef struct {
int key;
unsigned int size;
long int pos;
} index
and an array index indices
that contains the values for the elements stored in data.bin
. This list is successively sorted according to key
using a quicksort algorithm, which is fast enough even for a very large number of entries (e.g. 10M). Then I use the sorted list indices
to write the sorted data.bin
file as sorted.bin
. The core of my code is the following (here I have intentionally removed error checking parts):
size_t mergeBuffIdx = 0;
char *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));
for (unsigned int idx = 0; idx < numEntries; idx++) {
unsigned int dataSize = indices[idx].size;
if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
mergeBuffIdx = 0;
}
// set the file pointer at the beginning of the data file position
fseek(dataFile, indices[idx].pos, SEEK_SET);
// read the data from the file as an unsigned char array
fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);
mergeBuffIdx += dataSize;
}
// write remaining data
if (mergeBuffIdx != 0) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
}
This approach, which is quite simple, shortly become very slow when data.bin
is very large (my general use case is 30GB), the number of entries is around 10M and two successive sorted entries could be very far on the original data.bin
file. For a data.bin
file of 1GB this approach can require even more than 30 minutes c.a. even if I'm using an SSD HD.
Do you have some ideas, alternative solutions or approaches to this problem?
I have tried using memory mapped files, with similar or even worst performances. My thought is that the bottleneck is fseek
call, however, I cannot figure out alternatives and more performant approaches.
Thanks for reading,
S,
c performance sorting io binaryfiles
my problem is the following:
- I have a set of elements with variable length size stored on a binary file that we'll call
data.bin
. - Each element has a sorting key parameter that we'll call
key
, a sizesize
and a file position in data.binpos
. These 3 parameters are represented as a structure.
The problem is how to sort data.bin
effectively.
Right now I have defined a data structure
typedef struct {
int key;
unsigned int size;
long int pos;
} index
and an array index indices
that contains the values for the elements stored in data.bin
. This list is successively sorted according to key
using a quicksort algorithm, which is fast enough even for a very large number of entries (e.g. 10M). Then I use the sorted list indices
to write the sorted data.bin
file as sorted.bin
. The core of my code is the following (here I have intentionally removed error checking parts):
size_t mergeBuffIdx = 0;
char *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));
for (unsigned int idx = 0; idx < numEntries; idx++) {
unsigned int dataSize = indices[idx].size;
if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
mergeBuffIdx = 0;
}
// set the file pointer at the beginning of the data file position
fseek(dataFile, indices[idx].pos, SEEK_SET);
// read the data from the file as an unsigned char array
fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);
mergeBuffIdx += dataSize;
}
// write remaining data
if (mergeBuffIdx != 0) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
}
This approach, which is quite simple, shortly become very slow when data.bin
is very large (my general use case is 30GB), the number of entries is around 10M and two successive sorted entries could be very far on the original data.bin
file. For a data.bin
file of 1GB this approach can require even more than 30 minutes c.a. even if I'm using an SSD HD.
Do you have some ideas, alternative solutions or approaches to this problem?
I have tried using memory mapped files, with similar or even worst performances. My thought is that the bottleneck is fseek
call, however, I cannot figure out alternatives and more performant approaches.
Thanks for reading,
S,
c performance sorting io binaryfiles
c performance sorting io binaryfiles
asked Nov 16 '18 at 22:59
simalpssimalps
37
37
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.
It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.
You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53346492%2fc-sorting-big-binary-files-with-variable-length-elements%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.
It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.
You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.
add a comment |
With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.
It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.
You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.
add a comment |
With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.
It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.
You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.
With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.
It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.
You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.
edited Nov 17 '18 at 4:04
answered Nov 17 '18 at 3:35
rcgldrrcgldr
16.2k31437
16.2k31437
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53346492%2fc-sorting-big-binary-files-with-variable-length-elements%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown