CSCI 2720 Data Structures Spring 2004 Programming Project 3 - due *** Saturday March 20 *** by midnight. See the General Project Instructions on the course web site for directions which apply to all projects, including this one. The project has three parts, I, II and III. I. Implement the basic Lempel-Ziv (abbreviated LZ from now on) encoding and decoding algorithms as described in Section 5.4 of Lewis and Denenberg. II. Implement the modified LZ encoding and decoding algorithms as described in problem 34, Chapter 5 of Lewis and Denenberg. III. Test your implementations from parts I and II on large files and compare them for compression and speed. Part I. Basic LZ coding Implement basic LZ coding to produce executable files "lzb_encode" and "lzb_decode" which perform as follows: "lzb_encode fname" creates a file "fname.lzb" which has been encoded, and "lzb_decode fname.lzb" creates a file "fname" which is identical to the file which was encoded to produce "fname.lzb". The algorithm is completely determined by the dictionary size, the policy for a full dictionary, and the initial state of the dictionary. The dictionary size will be 2^16, and a full dictionary will remain as is for the coding. The initial configuration will be 256 codes, one for each byte. Every code will be two bytes, H and L, considered to form an unsigned short int HL for purposes of developing the dictionary, where H is the high order byte (most significant) and L the low order byte. When written to the code file H must be written before L. The code assigned to a byte is the short int representing the same number as the byte considered as an unsigned char. That is, for 0 <= i < 256 the byte '\octal(i)' will have a code consisting of high byte '\0' and low byte '\octal(i)'. You have no restrictions as to HOW to implement the basic LZ algorithm with the dictionary specified as above, but NO disrcretion as to what file is produced; every byte is determined, so that the instructor's decoder program will correctly decode the output of your encoder program, and vice versa. ** Well, almost no discretion; the basic LZ algorithm may produce duplicate ** entries in the code dictionary. You may output either code for such a ** string. The decoder will work correctly in any case. Your source files should be named "lzb_encode.cc" and "lzb_decode.cc". An example of a very short file "bananas" and it's encoded form "bananas.lzb" is provided on the course web site. Part II. Modified LZ coding This is the same as Part I except that the modified method of creating the dictionary must be implemented. The modified method is described and worked out in problem 34(a)(b) of Chapter 5, which was aasigned for HW 3. Your source files should be named "lzm_encode.cc" and "lzm_decode.cc". The corresponding executables should be named "lzm_encode" and "lzm_decode". When file "fname" is encoded, the resulting file should be named "fname.lzm". III. Comparison of basic and modified LZ coding. Test your implementations from parts I and II on large files of different sizes and compare the basic and modified LZ codings for compression and speed. Use different types of files, such as text and executables. For timing tests try to use files large enough that the coding and decoding take at least 60 seconds, but that may not be possible. The results of your testing should be reported in a file named proj3_results, along with a discussion of their significance. Timing data can be generated within your coding and decoding programs as in Project 1, and sent to "cout". Alternatively you can use the Unix utlity "time". Here is an example: executing the command "(time lzb_encode fname) >& tfile &" will perform the encoding and write the timing data to the file "tfile". Here "tfile" is created by the command, after deleting any existing file named "tfile". An example of data generated this way is "189.64u 0.04s 3:11.65 12.4%". Here "u" is the "user" time in seconds, and "s" is the system time in seconds; add the two to obtain the total time, 189.68 seconds in this case. Testing and timing: do everything logged onto atlas. Grading: correctness will be the first priority, followed in importance by coding style/documentation and speed of execution.