Introduction to How File Compression Works
If you download many programs and files off the Internet, you've probably encountered ZIP files before. This compression system is a very handy invention, especially for Web users, because it lets you reduce the overall number of bits and bytes in a file so it can be transmitted faster over slower Internet connections, or take up less space on a disk. Once you download the file, your computer uses a program such as WinZip or Stuffit to expand the file back to its original size. If everything works correctly, the expanded file is identical to the original file before it was compressed.
At first glance, this seems very mysterious. How can you reduce the number of bits and bytes and then add those exact bits and bytes back later? As it turns out, the basic idea behind the process is fairly straightforward. In this article, we'll examine this simple method as we take a very small file through the basic process of compression.
Most types of computer files are fairly redundant -- they have the same information listed over and over again. File-compression programs simply get rid of the redundancy. Instead of listing a piece of information over and over again, a file-compression program lists that information once and then refers back to it whenever it appears in the original program.
As an example, let's look at a type of information we're all familiar with: words.
In John F. Kennedy's 1961 inaugural address, he delivered this famous line:
"Ask not what your country can do for you -- ask what you can do for your country."
The quote has 17 words, made up of 61 letters, 16 spaces, one dash and one period. If each letter, space or punctuation mark takes up one unit of memory, we get a total file size of 79 units. To get the file size down, we need to look for redundancies.
Immediately, we notice that:
"ask" appears two times
"what" appears two times
"your" appears two times
"country" appears two times
"can" appears two times
"do" appears two times
"for" appears two times
"you" appears two times
Ignoring the difference between capital and lower-case letters, roughly half of the phrase is redundant. Nine words -- ask, not, what, your, country, can, do, for, you -- give us almost everything we need for the entire quote. To construct the second half of the phrase, we just point to the words in the first half and fill in the spaces and punctuation.
Most compression programs use a variation of the LZ adaptive dictionary-based algorithm to shrink files. "LZ" refers to Lempel and Ziv, the algorithm's creators, and "dictionary" refers to the method of cataloging pieces of data.
The system for arranging dictionaries varies, but it could be as simple as a numbered list. When we go through Kennedy's famous words, we pick out the words that are repeated and put them into the numbered index. Then, we simply write the number instead of writing out the whole word.
So, if this is our dictionary:
ask
what
your
country
can
do
for
you
Our sentence now reads:
"1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4"
If you knew the system, you could easily reconstruct the original phrase using only this dictionary and number pattern. This is what the expansion program on your computer does when it expands a downloaded file. You might also have encountered compressed files that open themselves up. To create this sort of file, the programmer includes a simple expansion program with the compressed file. It automatically reconstructs the original file once it's downloaded.
But how much space have we actually saved with this system? "1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4" is certainly shorter than "Ask not what your country can do for you; ask what you can do for your country;" but keep in mind that we need to save the dictionary itself along with the file.
In an actual compression scheme, figuring out the various file requirements would be fairly complicated; but for our purposes, let's go back to the idea that every character and every space takes up one unit of memory. We already saw that the full phrase takes up 79 units. Our compressed sentence (including spaces) takes up 37 units, and the dictionary (words and numbers) also takes up 37 units. This gives us a file size of 74, so we haven't reduced the file size by very much.
But this is only one sentence! You can imagine that if the compression program worked through the rest of Kennedy's speech, it would find these words and others repeated many more times. And, as we'll see in the next section, it would also be rewriting the dictionary to get the most efficient organization possible.
At first glance, this seems very mysterious. How can you reduce the number of bits and bytes and then add those exact bits and bytes back later? As it turns out, the basic idea behind the process is fairly straightforward. In this article, we'll examine this simple method as we take a very small file through the basic process of compression.
Most types of computer files are fairly redundant -- they have the same information listed over and over again. File-compression programs simply get rid of the redundancy. Instead of listing a piece of information over and over again, a file-compression program lists that information once and then refers back to it whenever it appears in the original program.
As an example, let's look at a type of information we're all familiar with: words.
In John F. Kennedy's 1961 inaugural address, he delivered this famous line:
"Ask not what your country can do for you -- ask what you can do for your country."
The quote has 17 words, made up of 61 letters, 16 spaces, one dash and one period. If each letter, space or punctuation mark takes up one unit of memory, we get a total file size of 79 units. To get the file size down, we need to look for redundancies.
Immediately, we notice that:
"ask" appears two times
"what" appears two times
"your" appears two times
"country" appears two times
"can" appears two times
"do" appears two times
"for" appears two times
"you" appears two times
Ignoring the difference between capital and lower-case letters, roughly half of the phrase is redundant. Nine words -- ask, not, what, your, country, can, do, for, you -- give us almost everything we need for the entire quote. To construct the second half of the phrase, we just point to the words in the first half and fill in the spaces and punctuation.
Most compression programs use a variation of the LZ adaptive dictionary-based algorithm to shrink files. "LZ" refers to Lempel and Ziv, the algorithm's creators, and "dictionary" refers to the method of cataloging pieces of data.
The system for arranging dictionaries varies, but it could be as simple as a numbered list. When we go through Kennedy's famous words, we pick out the words that are repeated and put them into the numbered index. Then, we simply write the number instead of writing out the whole word.
So, if this is our dictionary:
ask
what
your
country
can
do
for
you
Our sentence now reads:
"1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4"
If you knew the system, you could easily reconstruct the original phrase using only this dictionary and number pattern. This is what the expansion program on your computer does when it expands a downloaded file. You might also have encountered compressed files that open themselves up. To create this sort of file, the programmer includes a simple expansion program with the compressed file. It automatically reconstructs the original file once it's downloaded.
But how much space have we actually saved with this system? "1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4" is certainly shorter than "Ask not what your country can do for you; ask what you can do for your country;" but keep in mind that we need to save the dictionary itself along with the file.
In an actual compression scheme, figuring out the various file requirements would be fairly complicated; but for our purposes, let's go back to the idea that every character and every space takes up one unit of memory. We already saw that the full phrase takes up 79 units. Our compressed sentence (including spaces) takes up 37 units, and the dictionary (words and numbers) also takes up 37 units. This gives us a file size of 74, so we haven't reduced the file size by very much.
But this is only one sentence! You can imagine that if the compression program worked through the rest of Kennedy's speech, it would find these words and others repeated many more times. And, as we'll see in the next section, it would also be rewriting the dictionary to get the most efficient organization possible.

0 Comments:
Post a Comment
<< Home