CSCI 2720			Data Structures			     Spring 2004


Programming Project 3 - due *** Saturday March 20 *** by midnight.


See the General Project Instructions on the course web site for directions
which apply to all projects, including this one.


The project has three parts, I, II and III.

I.   Implement the basic Lempel-Ziv (abbreviated LZ from now on) encoding
     and decoding algorithms as described in Section 5.4 of Lewis and Denenberg.

II.  Implement the modified LZ encoding and decoding algorithms as described in
     problem 34, Chapter 5 of Lewis and Denenberg.

III. Test your implementations from parts I and II on large files and compare
     them for compression and speed.

Part I.  Basic LZ coding

Implement basic LZ coding to produce executable files "lzb_encode" and 
"lzb_decode" which perform as follows: "lzb_encode fname" creates a file
"fname.lzb" which has been encoded, and "lzb_decode fname.lzb" creates a file
"fname" which is identical to the file which was encoded to produce "fname.lzb".

The algorithm is completely determined by the dictionary size, the policy for
a full dictionary, and the initial state of the dictionary.  The dictionary
size will be 2^16, and a full dictionary will remain as is for the coding.
The initial configuration will be 256 codes, one for each byte.  Every code
will be two bytes, H and L, considered to form an unsigned short int HL for
purposes of developing the dictionary, where H is the high order byte (most
significant) and L the low order byte.  When written to the code file H must
be written before L.  The code assigned to a byte is the short int representing
the same number as the byte considered as an unsigned char.  That is, for
0 <= i < 256 the byte '\octal(i)' will have a code consisting of high byte '\0'
and low byte '\octal(i)'.

You have no restrictions as to HOW to implement the basic LZ algorithm with 
the dictionary specified as above, but NO disrcretion as to what file is 
produced; every byte is determined, so that the instructor's decoder program
will correctly decode the output of your encoder program, and vice versa.
** Well, almost no discretion; the basic LZ algorithm may produce duplicate
** entries in the code dictionary.  You may output either code for such a
** string.  The decoder will work correctly in any case.

Your source files should be named "lzb_encode.cc" and "lzb_decode.cc".
An example of a very short file "bananas" and it's encoded form
"bananas.lzb" is provided on the course web site.


Part II.  Modified LZ coding

This is the same as Part I except that the modified method of creating the
dictionary must be implemented.  The modified method is described and worked out
in problem 34(a)(b) of Chapter 5, which was aasigned for HW 3.

Your source files should be named "lzm_encode.cc" and "lzm_decode.cc".  The 
corresponding executables should be named "lzm_encode" and "lzm_decode".
When file "fname" is encoded, the resulting file should be named "fname.lzm".


III. Comparison of basic and modified LZ coding.

Test your implementations from parts I and II on large files of different
sizes and compare the basic and modified LZ codings for compression and speed.
Use different types of files, such as text and executables.  For timing tests
try to use files large enough that the coding and decoding take at least 60
seconds, but that may not be possible. 

The results of your testing should be reported in a file named proj3_results,
along with a discussion of their significance.  Timing data can be generated 
within your coding and decoding programs as in Project 1, and sent to "cout".
Alternatively you can use the Unix utlity "time".  Here is an example:
executing the command "(time lzb_encode fname) >& tfile &" will perform the
encoding and write the timing data to the file "tfile".  Here "tfile" is created
by the command, after deleting any existing file named "tfile".  An example of
data generated this way is "189.64u 0.04s 3:11.65 12.4%".  Here "u" is the
"user" time in seconds, and "s" is the system time in seconds; add the two to 
obtain the total time, 189.68 seconds in this case.  


Testing and timing:  do everything logged onto atlas.


Grading: correctness will be the first priority, followed in importance
	 by coding style/documentation and speed of execution.