Distributed Computing Systems (CSCI 4780/6780)

Programming Assignment 1

In this assignment, you are going to develop a simple HTTP client. This client repeatedly reads URLs from a file, retreives the web pages corresponding to the URLs. Each retreived web page is stored locally in a seperate file. The file name of a web page is the MD5 hash of the URL. The file should contain just the html content of the web page (i.e., it should not contain the HTTP headers).

The input file for your program will be in the following format: First line indicates the number of URLs in the file. Each subsequent line contains one URL. You can make the following assumptions:

1. The file contains exact number of URLs as stated in the first line.
2. All the URLs are for static web pages.
3. Cookies are not employed.

Your client should parse the HTML content and retrieve any embedded images (no need to support other kinds of embeddded content). Each image needs to be stored in seperate file as well.

You should also instrument the code to measure the time required for retreiving each web page. Be careful to measure only the time required to retrieve the page (i.e., the measured time should not include anything else). Your program should create a log file that contains the following information (separated by white spaces) for each URL in the input file.

a. URL
b. Success/failure in retreiving the web page.
c. The code returned by the server indicating success/failure
d. In case of successful retrieval, the local file name where the file is stored.
e. In case of successful retreival, the time tkaen for retreival.

The assignment is due on 02/21/2011. The assignment is to be done individually.

This web page from UMBC lists MD5 implemenations in various languages. You will find the C implementation here.

Submission Instructions
To submit your assignment, you should use the submit command on odin.
The syntax is: submit <directory name> csx780

Here, <directory name> is the name of the directory containing your code. Your first project should be in directory named your
PA1_<your last name followed by your first initial>. When you run this command, you MUST be in the parent directory of PA1 directory.