What if Dictionary size in LZW algorithm is full?











up vote
3
down vote

favorite












I have been studying LZW compression and there is one thing that I can't satisfy myself with which is that while building up the dictionary in LZW mostly its maximum limit is set to 4096 entries. Why is that ?. Also if dictionary gets full then the dictionary is reset but what if the next few characters about to read were present in the dictionary before resetting the dictionary. Is this a limitation ? or my understanding is not correct?










share|improve this question




















  • 2




    One prominent early LZW implementation was BSD compress(1). It started with 9 bit codes (dictionary size 512 (2^9)) - Huffman coding the LZW code stream would have been slow, and arithmetic coding wasn't common, let alone speedy encoders unencumbered by patents. It grew codesize as needed up to a limit of 16, parametrisable down to 12. Compression ratio was monitored, a reset occurred when it got bad, only. Limiting dictionary size allowed computers with little memory to decompress the stream - think 1/3 century ago, 1200/75 modems, … Little input left is a valid concern not addressed AFAIR.
    – greybeard
    Oct 15 '16 at 5:55















up vote
3
down vote

favorite












I have been studying LZW compression and there is one thing that I can't satisfy myself with which is that while building up the dictionary in LZW mostly its maximum limit is set to 4096 entries. Why is that ?. Also if dictionary gets full then the dictionary is reset but what if the next few characters about to read were present in the dictionary before resetting the dictionary. Is this a limitation ? or my understanding is not correct?










share|improve this question




















  • 2




    One prominent early LZW implementation was BSD compress(1). It started with 9 bit codes (dictionary size 512 (2^9)) - Huffman coding the LZW code stream would have been slow, and arithmetic coding wasn't common, let alone speedy encoders unencumbered by patents. It grew codesize as needed up to a limit of 16, parametrisable down to 12. Compression ratio was monitored, a reset occurred when it got bad, only. Limiting dictionary size allowed computers with little memory to decompress the stream - think 1/3 century ago, 1200/75 modems, … Little input left is a valid concern not addressed AFAIR.
    – greybeard
    Oct 15 '16 at 5:55













up vote
3
down vote

favorite









up vote
3
down vote

favorite











I have been studying LZW compression and there is one thing that I can't satisfy myself with which is that while building up the dictionary in LZW mostly its maximum limit is set to 4096 entries. Why is that ?. Also if dictionary gets full then the dictionary is reset but what if the next few characters about to read were present in the dictionary before resetting the dictionary. Is this a limitation ? or my understanding is not correct?










share|improve this question















I have been studying LZW compression and there is one thing that I can't satisfy myself with which is that while building up the dictionary in LZW mostly its maximum limit is set to 4096 entries. Why is that ?. Also if dictionary gets full then the dictionary is reset but what if the next few characters about to read were present in the dictionary before resetting the dictionary. Is this a limitation ? or my understanding is not correct?







algorithm lzw






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 7:52









Kromster

4,77364885




4,77364885










asked Oct 15 '16 at 2:03









Syed Ahmed Jamil

187215




187215








  • 2




    One prominent early LZW implementation was BSD compress(1). It started with 9 bit codes (dictionary size 512 (2^9)) - Huffman coding the LZW code stream would have been slow, and arithmetic coding wasn't common, let alone speedy encoders unencumbered by patents. It grew codesize as needed up to a limit of 16, parametrisable down to 12. Compression ratio was monitored, a reset occurred when it got bad, only. Limiting dictionary size allowed computers with little memory to decompress the stream - think 1/3 century ago, 1200/75 modems, … Little input left is a valid concern not addressed AFAIR.
    – greybeard
    Oct 15 '16 at 5:55














  • 2




    One prominent early LZW implementation was BSD compress(1). It started with 9 bit codes (dictionary size 512 (2^9)) - Huffman coding the LZW code stream would have been slow, and arithmetic coding wasn't common, let alone speedy encoders unencumbered by patents. It grew codesize as needed up to a limit of 16, parametrisable down to 12. Compression ratio was monitored, a reset occurred when it got bad, only. Limiting dictionary size allowed computers with little memory to decompress the stream - think 1/3 century ago, 1200/75 modems, … Little input left is a valid concern not addressed AFAIR.
    – greybeard
    Oct 15 '16 at 5:55








2




2




One prominent early LZW implementation was BSD compress(1). It started with 9 bit codes (dictionary size 512 (2^9)) - Huffman coding the LZW code stream would have been slow, and arithmetic coding wasn't common, let alone speedy encoders unencumbered by patents. It grew codesize as needed up to a limit of 16, parametrisable down to 12. Compression ratio was monitored, a reset occurred when it got bad, only. Limiting dictionary size allowed computers with little memory to decompress the stream - think 1/3 century ago, 1200/75 modems, … Little input left is a valid concern not addressed AFAIR.
– greybeard
Oct 15 '16 at 5:55




One prominent early LZW implementation was BSD compress(1). It started with 9 bit codes (dictionary size 512 (2^9)) - Huffman coding the LZW code stream would have been slow, and arithmetic coding wasn't common, let alone speedy encoders unencumbered by patents. It grew codesize as needed up to a limit of 16, parametrisable down to 12. Compression ratio was monitored, a reset occurred when it got bad, only. Limiting dictionary size allowed computers with little memory to decompress the stream - think 1/3 century ago, 1200/75 modems, … Little input left is a valid concern not addressed AFAIR.
– greybeard
Oct 15 '16 at 5:55












1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










The dictionary size is limited to the output symbol size. 12 bits can encode 4096 distinct values. It is a common choice for lecture notes and simple/assignment implementations.



However, ANY symbol bits more than the source bits can be used: a 16 bit symbol would allow 65k dictionary entries. The more bits, the more entries in the current dictionary can exist which can increase the “maximum” compression. Conversely, as each output symbol is larger it may decrease compression rates, especially for smaller inputs (insufficient time to generate a dictionary) and data that is more random (reduced ability to re-use the symbols in the dictionary). In practice, 19-20 bits seems to be about the useful limit2, while 16 bit symbols naturally align to bytes.



It is also possible to have an adaptive symbol size based on log2 of the CURRENT number of mapped symbols1- but this benefit disappears as data size increases, as the dictionary quickly fills. It is also largely superseded by Huffman coding.



When the dictionary is "reset", it is effectively the same as compressing multiple chunks of data and appending the compressed output: the dictionaries are separate. However, the data can be “split” dynamically based on when it fills the dictionary as opposed to, say, every X bytes of input. Since the symbol size is fixed, it is more efficient to ensure the dictionary is filled up before making the decision.



The primary purpose to reset a dictionary is to avoid “fixating” the symbols to data characteristics in one part of the input that might not be true for later data. A compressor can use a single non-resetting dictionary, reset the dictionary as soon as it is full, reset the dictionary when it’s full and a drop in compression is encountered, etc.: the goal is to achieve the highest compression within the domain/parameters.



Many LZ77/LZ78/LZW variations (and optimizations they utilize) are briefly discussed in "On parsing optimality for dictionary-based text compression—the Zip case" by AlessioLangiu; these excerpts contain lots of juicy details to further additional research.



1"Improving LZW' by R. Nigel Horspool goes into some details on adaptive symbol sizes. Nigel's "The Effect of Non-Greedy Parsing
in Ziv-Lempel Compression Methods" paper also includes a summary on compress's handling of dictionary resets.



2"The Relative Efficiency of Data Compression by LZW and LZSS" by Yair Wiseman includes a sample graph of symbols sizes vs. compression efficiency. The graph is highly dependent on the data.






share|improve this answer























  • Your link is broken.
    – Michael Foukarakis
    Nov 11 at 7:47











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40054218%2fwhat-if-dictionary-size-in-lzw-algorithm-is-full%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










The dictionary size is limited to the output symbol size. 12 bits can encode 4096 distinct values. It is a common choice for lecture notes and simple/assignment implementations.



However, ANY symbol bits more than the source bits can be used: a 16 bit symbol would allow 65k dictionary entries. The more bits, the more entries in the current dictionary can exist which can increase the “maximum” compression. Conversely, as each output symbol is larger it may decrease compression rates, especially for smaller inputs (insufficient time to generate a dictionary) and data that is more random (reduced ability to re-use the symbols in the dictionary). In practice, 19-20 bits seems to be about the useful limit2, while 16 bit symbols naturally align to bytes.



It is also possible to have an adaptive symbol size based on log2 of the CURRENT number of mapped symbols1- but this benefit disappears as data size increases, as the dictionary quickly fills. It is also largely superseded by Huffman coding.



When the dictionary is "reset", it is effectively the same as compressing multiple chunks of data and appending the compressed output: the dictionaries are separate. However, the data can be “split” dynamically based on when it fills the dictionary as opposed to, say, every X bytes of input. Since the symbol size is fixed, it is more efficient to ensure the dictionary is filled up before making the decision.



The primary purpose to reset a dictionary is to avoid “fixating” the symbols to data characteristics in one part of the input that might not be true for later data. A compressor can use a single non-resetting dictionary, reset the dictionary as soon as it is full, reset the dictionary when it’s full and a drop in compression is encountered, etc.: the goal is to achieve the highest compression within the domain/parameters.



Many LZ77/LZ78/LZW variations (and optimizations they utilize) are briefly discussed in "On parsing optimality for dictionary-based text compression—the Zip case" by AlessioLangiu; these excerpts contain lots of juicy details to further additional research.



1"Improving LZW' by R. Nigel Horspool goes into some details on adaptive symbol sizes. Nigel's "The Effect of Non-Greedy Parsing
in Ziv-Lempel Compression Methods" paper also includes a summary on compress's handling of dictionary resets.



2"The Relative Efficiency of Data Compression by LZW and LZSS" by Yair Wiseman includes a sample graph of symbols sizes vs. compression efficiency. The graph is highly dependent on the data.






share|improve this answer























  • Your link is broken.
    – Michael Foukarakis
    Nov 11 at 7:47















up vote
2
down vote



accepted










The dictionary size is limited to the output symbol size. 12 bits can encode 4096 distinct values. It is a common choice for lecture notes and simple/assignment implementations.



However, ANY symbol bits more than the source bits can be used: a 16 bit symbol would allow 65k dictionary entries. The more bits, the more entries in the current dictionary can exist which can increase the “maximum” compression. Conversely, as each output symbol is larger it may decrease compression rates, especially for smaller inputs (insufficient time to generate a dictionary) and data that is more random (reduced ability to re-use the symbols in the dictionary). In practice, 19-20 bits seems to be about the useful limit2, while 16 bit symbols naturally align to bytes.



It is also possible to have an adaptive symbol size based on log2 of the CURRENT number of mapped symbols1- but this benefit disappears as data size increases, as the dictionary quickly fills. It is also largely superseded by Huffman coding.



When the dictionary is "reset", it is effectively the same as compressing multiple chunks of data and appending the compressed output: the dictionaries are separate. However, the data can be “split” dynamically based on when it fills the dictionary as opposed to, say, every X bytes of input. Since the symbol size is fixed, it is more efficient to ensure the dictionary is filled up before making the decision.



The primary purpose to reset a dictionary is to avoid “fixating” the symbols to data characteristics in one part of the input that might not be true for later data. A compressor can use a single non-resetting dictionary, reset the dictionary as soon as it is full, reset the dictionary when it’s full and a drop in compression is encountered, etc.: the goal is to achieve the highest compression within the domain/parameters.



Many LZ77/LZ78/LZW variations (and optimizations they utilize) are briefly discussed in "On parsing optimality for dictionary-based text compression—the Zip case" by AlessioLangiu; these excerpts contain lots of juicy details to further additional research.



1"Improving LZW' by R. Nigel Horspool goes into some details on adaptive symbol sizes. Nigel's "The Effect of Non-Greedy Parsing
in Ziv-Lempel Compression Methods" paper also includes a summary on compress's handling of dictionary resets.



2"The Relative Efficiency of Data Compression by LZW and LZSS" by Yair Wiseman includes a sample graph of symbols sizes vs. compression efficiency. The graph is highly dependent on the data.






share|improve this answer























  • Your link is broken.
    – Michael Foukarakis
    Nov 11 at 7:47













up vote
2
down vote



accepted







up vote
2
down vote



accepted






The dictionary size is limited to the output symbol size. 12 bits can encode 4096 distinct values. It is a common choice for lecture notes and simple/assignment implementations.



However, ANY symbol bits more than the source bits can be used: a 16 bit symbol would allow 65k dictionary entries. The more bits, the more entries in the current dictionary can exist which can increase the “maximum” compression. Conversely, as each output symbol is larger it may decrease compression rates, especially for smaller inputs (insufficient time to generate a dictionary) and data that is more random (reduced ability to re-use the symbols in the dictionary). In practice, 19-20 bits seems to be about the useful limit2, while 16 bit symbols naturally align to bytes.



It is also possible to have an adaptive symbol size based on log2 of the CURRENT number of mapped symbols1- but this benefit disappears as data size increases, as the dictionary quickly fills. It is also largely superseded by Huffman coding.



When the dictionary is "reset", it is effectively the same as compressing multiple chunks of data and appending the compressed output: the dictionaries are separate. However, the data can be “split” dynamically based on when it fills the dictionary as opposed to, say, every X bytes of input. Since the symbol size is fixed, it is more efficient to ensure the dictionary is filled up before making the decision.



The primary purpose to reset a dictionary is to avoid “fixating” the symbols to data characteristics in one part of the input that might not be true for later data. A compressor can use a single non-resetting dictionary, reset the dictionary as soon as it is full, reset the dictionary when it’s full and a drop in compression is encountered, etc.: the goal is to achieve the highest compression within the domain/parameters.



Many LZ77/LZ78/LZW variations (and optimizations they utilize) are briefly discussed in "On parsing optimality for dictionary-based text compression—the Zip case" by AlessioLangiu; these excerpts contain lots of juicy details to further additional research.



1"Improving LZW' by R. Nigel Horspool goes into some details on adaptive symbol sizes. Nigel's "The Effect of Non-Greedy Parsing
in Ziv-Lempel Compression Methods" paper also includes a summary on compress's handling of dictionary resets.



2"The Relative Efficiency of Data Compression by LZW and LZSS" by Yair Wiseman includes a sample graph of symbols sizes vs. compression efficiency. The graph is highly dependent on the data.






share|improve this answer














The dictionary size is limited to the output symbol size. 12 bits can encode 4096 distinct values. It is a common choice for lecture notes and simple/assignment implementations.



However, ANY symbol bits more than the source bits can be used: a 16 bit symbol would allow 65k dictionary entries. The more bits, the more entries in the current dictionary can exist which can increase the “maximum” compression. Conversely, as each output symbol is larger it may decrease compression rates, especially for smaller inputs (insufficient time to generate a dictionary) and data that is more random (reduced ability to re-use the symbols in the dictionary). In practice, 19-20 bits seems to be about the useful limit2, while 16 bit symbols naturally align to bytes.



It is also possible to have an adaptive symbol size based on log2 of the CURRENT number of mapped symbols1- but this benefit disappears as data size increases, as the dictionary quickly fills. It is also largely superseded by Huffman coding.



When the dictionary is "reset", it is effectively the same as compressing multiple chunks of data and appending the compressed output: the dictionaries are separate. However, the data can be “split” dynamically based on when it fills the dictionary as opposed to, say, every X bytes of input. Since the symbol size is fixed, it is more efficient to ensure the dictionary is filled up before making the decision.



The primary purpose to reset a dictionary is to avoid “fixating” the symbols to data characteristics in one part of the input that might not be true for later data. A compressor can use a single non-resetting dictionary, reset the dictionary as soon as it is full, reset the dictionary when it’s full and a drop in compression is encountered, etc.: the goal is to achieve the highest compression within the domain/parameters.



Many LZ77/LZ78/LZW variations (and optimizations they utilize) are briefly discussed in "On parsing optimality for dictionary-based text compression—the Zip case" by AlessioLangiu; these excerpts contain lots of juicy details to further additional research.



1"Improving LZW' by R. Nigel Horspool goes into some details on adaptive symbol sizes. Nigel's "The Effect of Non-Greedy Parsing
in Ziv-Lempel Compression Methods" paper also includes a summary on compress's handling of dictionary resets.



2"The Relative Efficiency of Data Compression by LZW and LZSS" by Yair Wiseman includes a sample graph of symbols sizes vs. compression efficiency. The graph is highly dependent on the data.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 12 at 5:23

























answered Nov 11 at 7:24









user2864740

43.3k669142




43.3k669142












  • Your link is broken.
    – Michael Foukarakis
    Nov 11 at 7:47


















  • Your link is broken.
    – Michael Foukarakis
    Nov 11 at 7:47
















Your link is broken.
– Michael Foukarakis
Nov 11 at 7:47




Your link is broken.
– Michael Foukarakis
Nov 11 at 7:47


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40054218%2fwhat-if-dictionary-size-in-lzw-algorithm-is-full%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python