Introduction to Http
Http stands for HyperText Transfer Protocol. It is an application level protocol that defines the communication of html documents over the World Wide Web. Http sits on top of TCP. There are a few different versions of the protocol. This paper will discuss version 1.0.
The knowledge of http is embodied in web browsers and web servers. Http is asymmetric in that there is distinct client and server activity. The client initiates connections, makes requests, and terminates connections. The server listens for requests and fulfills them. These roles are true of any client/server system; they are not specific to http. Because of this asymmetry, clients, web browser, are generally more complicated. This paper will describe the client side of things. Since a server just replies to client requests, most of what one has to do should be apparent from looking at the client side.
Http works by sending files from the server to the client. There are many file formats supported. The most fundamental format is html. Html stands for HyperText Markup Language. This paper will only describe as much html as is necessary to demonstrate http.
Http is an extensible protocol. It was designed to work with some prior protocols; one can initiate an ftp or gopher download from a web browser. Part of the design involves finding "resource". This is what URLs (uniform resource locators) are for. URLs were designed to work with multiple protocols, DNS, and html. URLs are text strings that define the protocol, host, & file. Here is an example page from my service provider:
http://www.cove.com/cove110.html
The first token, "http" defines the protocol. This is followed by a separator, "://". Next comes the DNS host name, "www.cove.com". DNS stands for Domain Name Service, a standard way to look up IP addresses by logical, human readable names. Lastly, another separator, "/", and the file name, "cove110.html".
All three pieces are necessary to find a resource, however there are some defaults. For instance, if "/cove110.html" were omitted, the server would send us the page home.html. The administrator at cove configures this. Likewise if "http://" were omitted, it would still find the page because web browsers default to the http protocol.
So, now you know how a page is found. What does the browser do with it? It does what ever the html tells it to. Html plays two roles in the web: formatting text, and page layout.
The less important role is text formatter. Html is the probably the most common way of distributing text on the web. It is not the only way, nor the most sophisticated way to format text. (Indeed, by the volume of text information, postscript may be the largest format on the web, as this is the format of most research papers. However, probably more html is being read than any other format, since it is the first thing a web surfer sees.). The more important role of html is as a "glue" for the web.
As the glue, an html file defines the layout of the viewing area of a browser. It arranges the screen by telling the web browser where (and when) to show the text in the html file and all other associated files should be shown. Html also formats the text. Html effectively says something like: put text X2 first, then picture Y, then button Z, then text X2. The difference between text and other formats is that the text is embedded in the html, whereas all other formats reference separate files.
When a browser is pointed at a web page, the browser effectively asks the server, "send me this html file". Once the browser receives the file it inspects the file for 2 things. First it finds and displays the formatted text. Second it looks for references to other types of files that the browser knows what to do with. These references to other files are what make http such a popular protocol. They can be references to any arbitrary file type, as http allows extensions to be added.
The 2 most popular references are to other html files and pictures (typically in gif format). The references to other html documents are by convention shown in bright blue and underlined. These are the links to other pages. This is what makes hypertext hyper. Again by convention, links turn purple after one uses it.
After receiving the html document, graphical browsers typically immediately request the picture files. NSCA's Mosaic, Netscape's Navigator, and Microsoft's Internet Explorer are all graphical browsers. This is what makes hypertext graphical. Images are usually an order of magnitude larger than the html that references them. Performance is the reason why browsers have a way to shut off the downloading of images.
There are many other types of references, too many to put in a short paper. Some examples are sounds, movies, other text formats, and Java applets. The important concept is that http is extensible. When a new format is invented, the user can get a program that deals with it. Then they can configure the browser to run the program with these types of references.
So, we discussed client & servers, requests & receiving files. What does this have to do with http? Http defines a valid client request and a valid server response. It is surprisingly simple. It should be clear at this point why we said "The knowledge of http is embodied in web browsers and web servers." That knowledge is in how the requests & responses are formed.
Http actually relies on an earlier spec, MIME, the Internet Email to define the format of the requests and responses. So when one is surfing the net, you are in effect rapidly sending and receiving Email to a small number of servers. In the next section, we'll describe in a little more detail some simple requests and responses.