C - sorting big binary files with variable length elements





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















my problem is the following:




  • I have a set of elements with variable length size stored on a binary file that we'll call data.bin.

  • Each element has a sorting key parameter that we'll call key, a size size and a file position in data.bin pos. These 3 parameters are represented as a structure.


The problem is how to sort data.bin effectively.



Right now I have defined a data structure



typedef struct {
int key;
unsigned int size;
long int pos;
} index


and an array index indices that contains the values for the elements stored in data.bin. This list is successively sorted according to key using a quicksort algorithm, which is fast enough even for a very large number of entries (e.g. 10M). Then I use the sorted list indices to write the sorted data.bin file as sorted.bin. The core of my code is the following (here I have intentionally removed error checking parts):



size_t mergeBuffIdx = 0;
char *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));

for (unsigned int idx = 0; idx < numEntries; idx++) {
unsigned int dataSize = indices[idx].size;
if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
mergeBuffIdx = 0;
}

// set the file pointer at the beginning of the data file position
fseek(dataFile, indices[idx].pos, SEEK_SET);

// read the data from the file as an unsigned char array
fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);
mergeBuffIdx += dataSize;
}

// write remaining data
if (mergeBuffIdx != 0) {
fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
}


This approach, which is quite simple, shortly become very slow when data.bin is very large (my general use case is 30GB), the number of entries is around 10M and two successive sorted entries could be very far on the original data.bin file. For a data.bin file of 1GB this approach can require even more than 30 minutes c.a. even if I'm using an SSD HD.



Do you have some ideas, alternative solutions or approaches to this problem?



I have tried using memory mapped files, with similar or even worst performances. My thought is that the bottleneck is fseek call, however, I cannot figure out alternatives and more performant approaches.



Thanks for reading,
S,










share|improve this question





























    0















    my problem is the following:




    • I have a set of elements with variable length size stored on a binary file that we'll call data.bin.

    • Each element has a sorting key parameter that we'll call key, a size size and a file position in data.bin pos. These 3 parameters are represented as a structure.


    The problem is how to sort data.bin effectively.



    Right now I have defined a data structure



    typedef struct {
    int key;
    unsigned int size;
    long int pos;
    } index


    and an array index indices that contains the values for the elements stored in data.bin. This list is successively sorted according to key using a quicksort algorithm, which is fast enough even for a very large number of entries (e.g. 10M). Then I use the sorted list indices to write the sorted data.bin file as sorted.bin. The core of my code is the following (here I have intentionally removed error checking parts):



    size_t mergeBuffIdx = 0;
    char *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));

    for (unsigned int idx = 0; idx < numEntries; idx++) {
    unsigned int dataSize = indices[idx].size;
    if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {
    fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
    mergeBuffIdx = 0;
    }

    // set the file pointer at the beginning of the data file position
    fseek(dataFile, indices[idx].pos, SEEK_SET);

    // read the data from the file as an unsigned char array
    fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);
    mergeBuffIdx += dataSize;
    }

    // write remaining data
    if (mergeBuffIdx != 0) {
    fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
    }


    This approach, which is quite simple, shortly become very slow when data.bin is very large (my general use case is 30GB), the number of entries is around 10M and two successive sorted entries could be very far on the original data.bin file. For a data.bin file of 1GB this approach can require even more than 30 minutes c.a. even if I'm using an SSD HD.



    Do you have some ideas, alternative solutions or approaches to this problem?



    I have tried using memory mapped files, with similar or even worst performances. My thought is that the bottleneck is fseek call, however, I cannot figure out alternatives and more performant approaches.



    Thanks for reading,
    S,










    share|improve this question

























      0












      0








      0


      1






      my problem is the following:




      • I have a set of elements with variable length size stored on a binary file that we'll call data.bin.

      • Each element has a sorting key parameter that we'll call key, a size size and a file position in data.bin pos. These 3 parameters are represented as a structure.


      The problem is how to sort data.bin effectively.



      Right now I have defined a data structure



      typedef struct {
      int key;
      unsigned int size;
      long int pos;
      } index


      and an array index indices that contains the values for the elements stored in data.bin. This list is successively sorted according to key using a quicksort algorithm, which is fast enough even for a very large number of entries (e.g. 10M). Then I use the sorted list indices to write the sorted data.bin file as sorted.bin. The core of my code is the following (here I have intentionally removed error checking parts):



      size_t mergeBuffIdx = 0;
      char *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));

      for (unsigned int idx = 0; idx < numEntries; idx++) {
      unsigned int dataSize = indices[idx].size;
      if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {
      fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
      mergeBuffIdx = 0;
      }

      // set the file pointer at the beginning of the data file position
      fseek(dataFile, indices[idx].pos, SEEK_SET);

      // read the data from the file as an unsigned char array
      fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);
      mergeBuffIdx += dataSize;
      }

      // write remaining data
      if (mergeBuffIdx != 0) {
      fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
      }


      This approach, which is quite simple, shortly become very slow when data.bin is very large (my general use case is 30GB), the number of entries is around 10M and two successive sorted entries could be very far on the original data.bin file. For a data.bin file of 1GB this approach can require even more than 30 minutes c.a. even if I'm using an SSD HD.



      Do you have some ideas, alternative solutions or approaches to this problem?



      I have tried using memory mapped files, with similar or even worst performances. My thought is that the bottleneck is fseek call, however, I cannot figure out alternatives and more performant approaches.



      Thanks for reading,
      S,










      share|improve this question














      my problem is the following:




      • I have a set of elements with variable length size stored on a binary file that we'll call data.bin.

      • Each element has a sorting key parameter that we'll call key, a size size and a file position in data.bin pos. These 3 parameters are represented as a structure.


      The problem is how to sort data.bin effectively.



      Right now I have defined a data structure



      typedef struct {
      int key;
      unsigned int size;
      long int pos;
      } index


      and an array index indices that contains the values for the elements stored in data.bin. This list is successively sorted according to key using a quicksort algorithm, which is fast enough even for a very large number of entries (e.g. 10M). Then I use the sorted list indices to write the sorted data.bin file as sorted.bin. The core of my code is the following (here I have intentionally removed error checking parts):



      size_t mergeBuffIdx = 0;
      char *mergeBuff = (char *) calloc(MERGE_BUFF_SIZE, sizeof(char));

      for (unsigned int idx = 0; idx < numEntries; idx++) {
      unsigned int dataSize = indices[idx].size;
      if ((mergeBuffIdx + dataSize) >= MERGE_BUFF_SIZE) {
      fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
      mergeBuffIdx = 0;
      }

      // set the file pointer at the beginning of the data file position
      fseek(dataFile, indices[idx].pos, SEEK_SET);

      // read the data from the file as an unsigned char array
      fread(&mergeBuff[mergeBuffIdx], sizeof(unsigned char), dataSize, dataFile);
      mergeBuffIdx += dataSize;
      }

      // write remaining data
      if (mergeBuffIdx != 0) {
      fwrite(mergeBuff, sizeof(unsigned char), mergeBuffIdx, sortedDataFile);
      }


      This approach, which is quite simple, shortly become very slow when data.bin is very large (my general use case is 30GB), the number of entries is around 10M and two successive sorted entries could be very far on the original data.bin file. For a data.bin file of 1GB this approach can require even more than 30 minutes c.a. even if I'm using an SSD HD.



      Do you have some ideas, alternative solutions or approaches to this problem?



      I have tried using memory mapped files, with similar or even worst performances. My thought is that the bottleneck is fseek call, however, I cannot figure out alternatives and more performant approaches.



      Thanks for reading,
      S,







      c performance sorting io binaryfiles






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 16 '18 at 22:59









      simalpssimalps

      37




      37
























          1 Answer
          1






          active

          oldest

          votes


















          0














          With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.



          It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.



          You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.






          share|improve this answer


























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53346492%2fc-sorting-big-binary-files-with-variable-length-elements%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.



            It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.



            You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.






            share|improve this answer






























              0














              With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.



              It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.



              You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.






              share|improve this answer




























                0












                0








                0







                With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.



                It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.



                You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.






                share|improve this answer















                With your approach, you only get one element per read, so that's 10 million commands to read the file according to sorted keys. At least the SSD will eliminate the random access overhead.



                It should be faster to use an external merge sort to sort the data directly rather than sort keys and make a random access pass. This would allow using read/write for perhaps a bit over 256 MB at a time. The initial pass would read 256MB chunks, sort the elements, then write the 256MB chunks (128 of these would be 32GB). A 16 way merge (2048 reads 16 MB each) followed by a 8 way merge (1024 reads, 32 MB per read) would sort about 32GB of data. For the 16 or 8 way merge, you probably want to use some form of priority queue, such as a heap.



                You don't mention if the key/size info is in a separate file. If there's enough spare memory, you could just keep this information in memory during the external merge sort and write it back out when done.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 17 '18 at 4:04

























                answered Nov 17 '18 at 3:35









                rcgldrrcgldr

                16.2k31437




                16.2k31437
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53346492%2fc-sorting-big-binary-files-with-variable-length-elements%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Xamarin.iOS Cant Deploy on Iphone

                    Glorious Revolution

                    Dulmage-Mendelsohn matrix decomposition in Python