C - sorting big binary files with variable length elements

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

my problem is the following:

I have a set of elements with variable length size stored on a binary file that we'll call data.bin.

Each element has a sorting key parameter that we'll call key, a size size and a file position in data.bin pos. These 3 parameters are represented as a structure.

The problem is how to sort data.bin effectively.

Right now I have defined a data structure

typedef struct {

    int key;

    unsigned int size;

    long int pos;

} index

and an array index indices that contains the values for the elements stored in data.bin. This list is successively sorted according to key using a quicksort algorithm, which is fast enough even for a very large number of entries (e.g. 10M). Then I use the sorted list indices to write the sorted data.bin file as sorted.bin. The core of my code is the following (here I have intentionally removed error checking parts):

size_t mergeBuffIdx = 0;

char  *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));



for (unsigned int idx = 0; idx < numEntries; idx++) {

    unsigned int dataSize = indices[idx].size;

    if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {

            fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);

            mergeBuffIdx = 0;

    }



    // set the file pointer at the beginning of the data file position

    fseek(dataFile, indices[idx].pos, SEEK_SET);



    // read the data from the file as an unsigned char array

    fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);

    mergeBuffIdx += dataSize;

}



// write remaining data

if (mergeBuffIdx != 0) {

    fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);

}

This approach, which is quite simple, shortly become very slow when data.bin is very large (my general use case is 30GB), the number of entries is around 10M and two successive sorted entries could be very far on the original data.bin file. For a data.bin file of 1GB this approach can require even more than 30 minutes c.a. even if I'm using an SSD HD.

Do you have some ideas, alternative solutions or approaches to this problem?

I have tried using memory mapped files, with similar or even worst performances. My thought is that the bottleneck is fseek call, however, I cannot figure out alternatives and more performant approaches.

Thanks for reading,
S,

asked Nov 16 '18 at 22:59

simalps

add a comment |

my problem is the following:

I have a set of elements with variable length size stored on a binary file that we'll call data.bin.

Each element has a sorting key parameter that we'll call key, a size size and a file position in data.bin pos. These 3 parameters are represented as a structure.

The problem is how to sort data.bin effectively.

Right now I have defined a data structure

typedef struct {

    int key;

    unsigned int size;

    long int pos;

} index

size_t mergeBuffIdx = 0;

char  *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));



for (unsigned int idx = 0; idx < numEntries; idx++) {

    unsigned int dataSize = indices[idx].size;

    if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {

            fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);

            mergeBuffIdx = 0;

    }



    // set the file pointer at the beginning of the data file position

    fseek(dataFile, indices[idx].pos, SEEK_SET);



    // read the data from the file as an unsigned char array

    fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);

    mergeBuffIdx += dataSize;

}



// write remaining data

if (mergeBuffIdx != 0) {

    fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);

}

Do you have some ideas, alternative solutions or approaches to this problem?

Thanks for reading,
S,

asked Nov 16 '18 at 22:59

simalps

add a comment |

my problem is the following:

I have a set of elements with variable length size stored on a binary file that we'll call data.bin.

Each element has a sorting key parameter that we'll call key, a size size and a file position in data.bin pos. These 3 parameters are represented as a structure.

The problem is how to sort data.bin effectively.

Right now I have defined a data structure

typedef struct {

    int key;

    unsigned int size;

    long int pos;

} index

size_t mergeBuffIdx = 0;

char  *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));



for (unsigned int idx = 0; idx < numEntries; idx++) {

    unsigned int dataSize = indices[idx].size;

    if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {

            fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);

            mergeBuffIdx = 0;

    }



    // set the file pointer at the beginning of the data file position

    fseek(dataFile, indices[idx].pos, SEEK_SET);



    // read the data from the file as an unsigned char array

    fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);

    mergeBuffIdx += dataSize;

}



// write remaining data

if (mergeBuffIdx != 0) {

    fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);

}

Do you have some ideas, alternative solutions or approaches to this problem?

Thanks for reading,
S,

asked Nov 16 '18 at 22:59

simalps

my problem is the following:

I have a set of elements with variable length size stored on a binary file that we'll call data.bin.

Each element has a sorting key parameter that we'll call key, a size size and a file position in data.bin pos. These 3 parameters are represented as a structure.

The problem is how to sort data.bin effectively.

Right now I have defined a data structure

typedef struct {

    int key;

    unsigned int size;

    long int pos;

} index

size_t mergeBuffIdx = 0;

char  *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));



for (unsigned int idx = 0; idx < numEntries; idx++) {

    unsigned int dataSize = indices[idx].size;

    if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {

            fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);

            mergeBuffIdx = 0;

    }



    // set the file pointer at the beginning of the data file position

    fseek(dataFile, indices[idx].pos, SEEK_SET);



    // read the data from the file as an unsigned char array

    fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);

    mergeBuffIdx += dataSize;

}



// write remaining data

if (mergeBuffIdx != 0) {

    fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);

}

Do you have some ideas, alternative solutions or approaches to this problem?

Thanks for reading,
S,

c performance sorting io binaryfiles

asked Nov 16 '18 at 22:59

simalps

asked Nov 16 '18 at 22:59

simalps

asked Nov 16 '18 at 22:59

simalps

asked Nov 16 '18 at 22:59

simalps

asked Nov 16 '18 at 22:59

simalps

add a comment |

1 Answer
1

active

oldest

votes

With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.

It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.

You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.

edited Nov 17 '18 at 4:04

answered Nov 17 '18 at 3:35

rcgldr

16.2k31437

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53346492%2fc-sorting-big-binary-files-with-variable-length-elements%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.

edited Nov 17 '18 at 4:04

answered Nov 17 '18 at 3:35

rcgldr

16.2k31437

add a comment |

With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.

edited Nov 17 '18 at 4:04

answered Nov 17 '18 at 3:35

rcgldr

16.2k31437

add a comment |

With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.

edited Nov 17 '18 at 4:04

answered Nov 17 '18 at 3:35

rcgldr

16.2k31437

With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.

edited Nov 17 '18 at 4:04

answered Nov 17 '18 at 3:35

rcgldr

16.2k31437

edited Nov 17 '18 at 4:04

answered Nov 17 '18 at 3:35

rcgldr

16.2k31437

answered Nov 17 '18 at 3:35

rcgldr

16.2k31437

answered Nov 17 '18 at 3:35

rcgldr

16.2k31437

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

yQHezVxKMdMvMJkPSl0,2BkXW528HAD CDD0aUZfMS

搜尋此網誌

Vfrdtyky