2011年4月24日 星期日

[ Data Structures with Java ] Section 23.3 : Huffman Compression


Preface :
Data compression is a software technology that takes information and represents it in compact form. Compression algorithms create these compact representations by detecting patterns in the data and then representing them by using less information. The data can consist of a simple textfile or a binary word processing file. More complex examples include sound and audio files. Most computer users have applied a compression program to save disk space.
There are two types of data compression. With lossless compression, the data compression loses no information. The original data can be recovered exactly from the compressed data. We normally apply this type of compression to "discrete" data, such as text, word processing files, computer applications, and so forth. A lossy compression technique loses some information during compression and the data cannot be recovered exactly. However, this type of compression tends to shrink the data further than lossless compression techniques. Sound files often use this type of compression.
Evaluation of a compression algorithm can include a mathematical complexity analysis that determines its computing requirements. Other criteria assess how much comparison the algorithm produces. When evaluating a lossy compression technique, another criterion is how closely the decompressed image resembles the original. Often, a mathematical analysis is very difficult to perform, so we use the latter criteria. The compression ration is the ratio of the number of bits in the original data to the number of bits in the compressed image. For instance, if a data file contains 500,000 bytes and the compressed data contains 100,000 bytes, the compression ratio is 5:1.
In this section, we will consider a compression technique, known as Huffman compression, that relies on counting the number of occurrences of each 8-bit byte in the data and generating a sequence of optimal binary codes called prefix codes. Algorithms that perform optimization normally execute a series of steps requiring choices. At each step, a greedy algorithm always looks at the data on hand and makes the choice that looks best on the basic of the local data. It hopes that the locally optimal choices will lead to an optimal solution to the whole problem. The Huffman algorithm is an example of a greedy algorithm.
Huffman compression is a popular and effective technique for data compression. The technique tends to very effective for textfiles,in which compression ratios of at least 1.8 (at least 45% reduction) are common. The technique is successful for binary files, but the savings are generally not as good. The algorithm generates a table that contains the frequency of occurrence of each byte in the file. Using these frequencies, the algorithm assigns each byte a string of bits known as its bit code and writes the bit code to the compressed image in place of the original byte. It is hoped that the sum total of the bit codes is smaller than the original file.
Suppose that a files contains only the ASCII (8-bits) characther 'a' through 'f' and that the characters have the frequencies specified in below table. For example, the character 'c' occurs 20,000 times in the file. If we assign each character a fixed-length bit code, we will need 3 bits for each character ('a'=000, 'b'=001, ..., 'f'=101), as shown in below table. If we replace each character by its bit code, the resulting compressed image use a total of :
(16(3) + 4(3) + 8(3) + 6(3) + 20(3) + 3(3)) x 1000 = 171,000 bits

Computers store data in 8-bit bytes, so the image will have a size of 21,375 bytes. The original file has a sizeof 57,000 bytes, so the compression ratio is 2.67 (Saving of 62.5%)


The goal of the Huffman algorithm is to create an "optimal" binary tree that represents the bit codes. We will explain the meaning of optimal when we develop the algorithm. For now, let us explore a tree that illustrates the issues. In the example, each leaf node contains a byte and its frequency of occurrence. Each internal node contains the sum of the frequencies of its children. Starting at the root, a left child designates a bit of 0 and a right child designates a bit of 1. The tree grows until the branches create all bit codes as leaf nodes. The tree structure for the bit codes in upper table is as follows :


Note that, in the tree, no bit code is also a prefix for another bit code. This is guaranteed, because a byte occurs only in a leaf node. Such codes are calledprefix codes. In a tree, each non-leaf node contains a frequency count. This becomes part of our strategy to create a optimal tree. Another issue of optimality involves the right subtree of the root. There are two codes beginning with 10 but no codes beginning with 11. The reason is that the tree is not a full binary tree. A full binary tree is one in which interior node has two children. By converting the tree to a full tree, we can generate better bit codes for 'a'-'f'. If we replace the 23 on level by its subtree, the following full tree result :


The bit codes for 'e' and 'f' are both 1 bit shorter, so the compressed file will now have :
(16(3) + 4(3) + 8(3) + 6(3) + 20(2) + 3(2)) x 1000 = 148,000 bits

which corresponds to a compression ratio of 3:1.
Now that we have seen a tree of bit codes, we can begin to understand the design of a file-compression algorithm. Begin by reading the file and determining the frequencies of the bytes in the files. Then use the frequencies to build a tree of prefix codes that determines unique bit codes for each byte. Write the tree to the compressed file and then reread the source file. For each byte in the second pass, write its corresponding bit code to the compressed file.
To decompress a file, decode the stream of bits by using the prefix codes. To determine the first byte of the original file, start at the root of the tree and read a bit from the compressed file. If the bit is 0, move to the left child; if the bit is 1, move to the right. Read the next bit and move again. Continue until you encounter a leaf node, so there is no ambiguity. Continue in this fashion until reaching the end of the compressed file.
The overriding issue in using bit codes for compression is to choose an optimal tree. It can be shown that the optimal bit codes for a file are always represented by a full tree. Any full tree with n leaf nodes has exactly n-1 internal nodes, and so it has a total of n+(n-1)=2n-1 nodes. Thus, if a file contains nunique bytes, and we compress it by replacing its bytes by bit codes, the corresponding tree will have 2n-1 nodes. There is a simple formula for the number of bits in the compressed image. For each byte b in the original file, let f(b) be the frequency of the byte and d(b) be the depth of the leaf node containing b. The depth of the node is also the number of bits in the bit code for b. We refer to the number of bits necessary to compress the file as the cost of the tree, which we specify by the following relation :


It can also be shown that a Huffman tree generates optimal prefix codes. That is, among all tree representing prefix codes, a Huffman tree gives the minimum cost. The Huffman algorithm constructs the tree so that the most frequently occurring bytes correspond to leaf nodes near the top of the tree, and the least frequently occurring bytes occur at the bottom. In this way, frequently occurring bytes have short bit codes and less frequently occurring bytes have long bit codes. As we describe the algorithm, we illustrate each step with the data in previous table.

Building a Huffman Tree :
For each of the n bytes in a file, assign the byte and its frequency to a tree node, and insert the node into a minimum priority queue ordered by frequency.
Remove two elements, x and y, from the priority queue, and attach them as children of a node whose frequency is the sum of the frequencies of its children. It does not matter which node is the left child and which node is the right node. The byte value in an interior node is not used. Insert the resulting node into the priority queue.
In a loop, perform this action n-1 times. Each loop iteration creates one of the n-1 internal nodes of the tree. Because each internal node has two children, the resulting tree is full. Below figure shows each step in the construction of a Huffman tree for the sample file :

For the Huffman tree, the compressed file contains :
(16(2) + 4(4) + 8(2) + 6(3) + 20(2) + 3(4)) * 1000 = 134,000 bits

which corresponds to a compression ratio of 3.4.
Building a Huffman tree for file compression is simple in principle. However, its implementation involves a great many details. The implementation uses a variety of tools we have developed, including the HeapPQueue class based on the heap, the bit array, and inheritance. Because the compression algorithm writes binary data to disk and the decompression algorithm reads binary data from disk, we use binary files.

Implementing Huffman Compression :
The HCompress class is designed for use in compressing a file with a file name as an argument. It opens the source file and creates a binary output file by adding the extension .huf to the name. The compression process includes accompanying process messages that are output to console. To use the same HCompress object for additional files, call setFile() with a new source file name. The public method compress() executes the compression steps. To understand some of the internal parameters of the compression process, we provide the methods compressionRatio() and size(). The latter gives the number of nodes in the Huffman tree.
To simply all process and demonstration, we made a static method generateData() to generate fake test data. Afterward, we can use the fake data to be used in the method compress() for demonstration. Also we store the data in HNode class which will be used for generate the binary tree to calculate the bit codes. The implementation is as below :
- HNode.java : Node in the Huffman binary tree
  1. package DSwJ.S23.app;  
  2.   
  3. public class HNode implements java.lang.Comparable{  
  4.     public T data;  
  5.     public int count=0;  
  6.     public HNode left;  
  7.     public HNode right;  
  8.       
  9.     public HNode(T d, int c, HNode l, HNode r) {  
  10.         data = d;  
  11.         count = c;  
  12.         left = l;  
  13.         right = r;  
  14.     }  
  15.       
  16.     public HNode(int c, HNode l, HNode r) {count=c; left=l; right=r;}  
  17.     public HNode(T d, int c) {data=d; count=c; left=null; right=null;}  
  18.   
  19.     @Override  
  20.     public int compareTo(Object o) {  
  21.         HNode node = (HNode)o;  
  22.         if(count==node.count) return 0;  
  23.         else if(count>node.count) return 1;  
  24.         else return -1;  
  25.     }  
  26. }  

- HCompress.java : Demonstration on Huffman compression
  1. package DSwJ.S23.app;  
  2.   
  3. import java.util.ArrayList;  
  4. import java.util.HashMap;  
  5. import java.util.Iterator;  
  6. import java.util.List;  
  7. import java.util.Random;  
  8. import java.util.Set;  
  9. import java.util.Map.Entry;  
  10.   
  11. import DSwJ.S22.HeapPQueue;  
  12.   
  13. public class HCompress {  
  14.     public static HashMap bitCodesTable = new HashMap();  
  15.       
  16.     public static List generateData(){  
  17.         ArrayList data = new ArrayList();  
  18.         Random rnd = new Random();  
  19.         for(int i=0; i<100000; i++) {  
  20.             int index = rnd.nextInt(57);  
  21.             if(index<16) data.add('a');  
  22.             else if(1519) data.add('b');  
  23.             else if(1927) data.add('c');  
  24.             else if(2733) data.add('d');  
  25.             else if(3353) data.add('e');  
  26.             else if(5356) data.add('f');  
  27.         }  
  28.         return data;  
  29.     }  
  30.       
  31.     public static void inorderVisit(HNode node, String bc) {  
  32.         if(node!=null) {  
  33.             if(node.left==null || node.right==null) {  
  34.                 System.out.println(node.data+"(bit code)="+bc);  
  35.                 bitCodesTable.put(node.data, bc);  
  36.             } else {  
  37.                 inorderVisit(node.left, bc+"0");  
  38.                 inorderVisit(node.right, bc+"1");  
  39.             }  
  40.         }  
  41.     }  
  42.       
  43.     public static void compress(){  
  44.         List data = generateData();  
  45.         HashMap cntTable = new HashMap();  
  46.         Iterator itr = data.iterator();  
  47.         while(itr.hasNext()) {  
  48.             char c = itr.next();  
  49.             if(cntTable.containsKey(c)) cntTable.put(c, cntTable.get(c)+1);  
  50.             else cntTable.put(c, 1);  
  51.         }  
  52.         HeapPQueue> pq = new HeapPQueue>();  
  53.         Set> set = cntTable.entrySet();  
  54.         Iterator> iter = set.iterator();  
  55.         System.out.println("Sym\tCount");  
  56.         System.out.println("==========================");  
  57.         while(iter.hasNext()) {  
  58.             Entry entry = iter.next();  
  59.             System.out.printf("%c=\t%d\n", entry.getKey(), entry.getValue());  
  60.             pq.push(new HNode(entry.getKey(), entry.getValue()));  
  61.         }  
  62.           
  63.         while(pq.size()>1) {  
  64.             HNode lnode = pq.pop();  
  65.             HNode rnode = pq.pop();  
  66.             HNode inode = new HNode(lnode.count+rnode.count, lnode, rnode);  
  67.             pq.push(inode);  
  68.         }  
  69.         System.out.println("\nBit Code Table\n==========================");  
  70.         HNode root = pq.pop();  
  71.         inorderVisit(root,"");  
  72.           
  73.         double totalBits = 0;  
  74.         iter = set.iterator();  
  75.         while(iter.hasNext()) {  
  76.             Entry entry = iter.next();  
  77.             char key = entry.getKey();  
  78.             totalBits+=entry.getValue()*bitCodesTable.get(key).length();  
  79.         }  
  80.         System.out.println("\nCompression ratio is "+(((double)root.count*8)/totalBits));  
  81.     }  
  82.       
  83.     public static void main(String args[]) {  
  84.         compress();  
  85.     }  
  86. }  

Output :
Sym Count
==========================
f= 5265
d= 10752
e= 35035
b= 7028
c= 13880
a= 28040

Bit Code Table
==========================
c(bit code)=00
d(bit code)=010
f(bit code)=0110
b(bit code)=0111
a(bit code)=10
e(bit code)=11

Compression ratio is 3.3993660182375987
This message was edited 12 times. Last update was at 21/03/2011 14:45:20

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...