HTTP: Doing What Your Browser Does For You
Before we start:
This tutorial assumes you have knowledge and experience using telnet or something
that can do the same job - like netcat. Knowing a bit of HTML would come in
This tutorial is provided free as a voluntary service of the author. No responsibility
is held, nor is any claim made, for the accuracy, safety or any other aspect
of this tutorial"s content.
OK, I wanted to write a tutorial on HTTP and now you want to learn it for whatever
reason you do - to make a web browser or to just learn more about the Web. Seeing
as I don"t want to write a big introduction, I"m going to get right into it.
HTTP is a service that usually runs on port 80 (however it often runs on 81,
8080, 8000 or 1080 or whatever). You can see this if you connect to microsoft.com
on port 80 and give it a few carriage returns:
HTTP/1.1 400 Bad Request
Date: Mon, 07 Jan 2002 02:17:22 GMT
parameter is incorrect.
As you can see, many web servers (or daemons - pronounced just like demon)
give an excessive amount of information out to anyone (some, like yahoo"s daemon,
don"t give any at all). Let"s take a moment to examine this dump. The first
phrase is "HTTP/1.1" which you"d probably correctly guess means that
this server is using version 1.1 of HTTP. The second part is a status code with
a friendly translation afterwards. These status codes are standard three digit
numbers. Here"s what the HTTP 1.1 rfc (no. 2616) says:
The Status-Code element is a 3-digit integer result code
of the attempt to understand and satisfy the request. These codes are fully
defined in section 10. The Reason-Phrase is intended to give a short textual
description of the Status-Code. The Status-Code is intended for use by automata
and the Reason-Phrase is intended for the human user. The client is not required
to examine or display the Reason-Phrase.
The first digit of the Status-Code defines the class of response.
The last two digits do not have any categorization role. There are 5 values
for the first digit:
- 1xx: Informational - Request received, continuing process
- 2xx: Success - The action was successfully received, understood, and accepted
- 3xx: Redirection - Further action must be taken in order to complete the request
- 4xx: Client Error - The request contains bad syntax or cannot be fulfilled
- 5xx: Server Error - The server failed to fulfil an apparently valid request
(There is a detailed list of these codes in the Appendix.) Looking at this
we can see that code 400 is a client error. This means that microsoft.com"s
daemon is blaming us for the fact it can"t make head or tail of our HTTP request.
We can find out the name of this blaming daemon in the next line. Server: Microsoft-IIS/5.0
Well, you would expect microsoft.com to use the latest version of the Microsoft
web daemon IIS, now, wouldn"t you? Putting this kind of information out in response
to a connection from anywhere is often seen as a security risk by security experts
and an excellent opportunity by lame "hackers" who"ll just look up
these details on Bugtraq, download an exploit that looks nice and do lots of
nasty things to the server. Enough of that, let"s skip the next line (current
date and time - GMT in this case) and go on with the last two lines of headers.
The Content-Type tells us that what we got in return was HTML - web page text
- and the Content-Length tells us that the HTML (the bit the user would see
if he/she used a common browser) takes up 87 bytes - and if you count the number
of characters, you"ll find that that MR I.I.S. Daemon is perfectly correct.
OK, so now you know how to get an error message from a web server. Very good,
but you probably want to get a full-blooded webpage. The magic word to do this
is in fact GET. GET with a slash (/) after it that is (you"ll see why soon).
Hotmail - The World"s FREE Web-Based Email
Hotmail redirect in progress. Please wait...
(Hotmail.com hasn"t given us a personal life story here - it"s to do with the
GET /, you"ll see why soon.) Using GET / on hotmail.com is getting the URL http://hotmail.com/.
The URL http://google.com/jobs/benefits.html would be requested by connecting
to google.com and typing GET /jobs/benefits.html. See:
Google Job Benefits
Cool Jobs at Google
You should see how it works by now: to see what should go after the GET, just
drop the http:// and the domain name. There are more methods (as they are called)
than GET - just remember that they are case-sensitive (ie. Use GET not get or
GET - as already covered
HEAD - like GET but requests that the daemon only returns the headers for a
URL. This is little known (but useful) and I"ve seen many servers not implement
it properly (eg. yahoo.com).
PUT - gives the server a file
POST - gives the server some information (it"s up to the server what it does
with it, sometimes POST is just like PUT).
DELETE - a request that a server deletes a resource (now, now, a server"s not
going to delete something without a reason so don"t get any ideas). The daemon
shouldn"t tell you whether it deleted the file or not.
TRACE - kind of weird
OPTIONS - I"ve never got this to work.
CONNECT- this is reserved for use with other protocols.
Not all these work on all daemons unless you give the version of HTTP you are
using (as 1.0 or 1.1, depending on how you feel). To do this you"d add HTTP/1.x
to the end of your method (for instance, GET / HTTP/1.1) but now you have to
press Enter twice.
HEAD / HTTP/1.0
HTTP/1.1 200 OK
P3P: CP="ALL IND DSP COR ADM CONo CUR CUSo IVAo IVDo PSA PSD TAI TELo OUR SAMo
CNT COM INT NAV ONL PHY PRE PUR UNI"
Date: Sun, 06 Jan 2002 07:44:33 GMT
Last-Modified: Fri, 04 Jan 2002 01:52:58 GMT
(See how useful HEAD is? You can find out all about the server and file without
having 26597 bytes streaming past your client.)
(Notice that using HTTP 1.0 or more means that the server gives you header
stuff.) Why do you have to press Enter twice rather than once? Because by specifying
1.0 or more as your HTTP version number, you are now using HTTP v1.0 or more
(duh...). With HTTP 1.x, you can give your own headers after the method. The
server keeps accepting stuff from you until it gets a double carriage return
(a blank line in other words). With these headers you"d give extra information
like the language you"d prefer to have your webpage in. I set up my own dummy
web daemon (with netcat if you"re interested) sometime and got three different
browsers (Netscape, Internet Explorer and Lynx) to connect to it and get the
main page. You can see what kind of information the browsers give gratia:
GET / HTTP/1.1
Accept: application/msword, image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; TBP_7.0_GS)
GET / HTTP/1.0
User-Agent: Mozilla/4.08 [en] (Win98; I ;Nav)
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
That"s quite a bit of information as well - the Lynx output has a little more
because I used a copy on a remote network. (Why do Netscape and Internet Explorer
feel the need to broadcast the name of people"s operating systems?) The important
headers here are Connection, User-Agent and Host. The value for Connection here
in all cases is keep-alive because browser-makers know that making a new connection
is a real rigmarole. Actually, with HTTP v1.1, the default connection is keep-alive
so unless a Connection: close is issued in response by the server, you should
assume that the connection will keep on going. Whenever a Connection: close
is in a message, by server or client, then the connection will close after the
message is complete. Now, here"s what rfc 2616 has to say about User-Agent:
The User-Agent request-header field contains information about the user agent
originating the request. This is for statistical purposes, the tracing of protocol
violations, and automated recognition of user agents for the sake of tailoring
responses to avoid particular user
agent limitations. User agents SHOULD include this field with requests.
SHOULD is rfc-speak for saying that this is a recommendation that you should
only ignore with good reason and after examining all the details (defined in
rfc 2119...). User-Agent usually gives information in a big to small (most to
least significant) format and sometimes strange things are in the User-Agent
header. Anyway, now we have the host header. This is just specifies the hostname
so if you connected to yahoo.com, you"d have Host: yahoo.com amongst the rest.
Rfc 2616 does not fail in its attempt at stressing the importance of Host:
A client MUST include a Host header field in all HTTP/1.1 request messages.
If the requested URI does not include an Internet host name for the service
being requested, then the Host header field MUST be given with an empty value.
An HTTP/1.1 proxy MUST ensure that any request message it forwards does contain
an appropriate Host header
field that identifies the service being requested by the proxy. All Internet-based
HTTP/1.1 servers MUST respond with a 400 (Bad Request) status code to any HTTP/1.1
request message which [sic] lacks a Host header field.
So don"t be naughty (or a bad request) and forget your Host - unless of course
you are using HTTP v1.0. Even with HTTP v1.0, however, you sometimes come across
servers and situations where you"re stuck with it. Consider this connection
Hang on, I am connecting to www.tripod.lycos.com so why the hell is this daemon
telling me to go to www.tripod.lycos.com??? This is what I was muttering to
myself when I tried this some time ago until I suddenly thought of using Host.
Worked like a charm but still a little weird.
This would have to be the end of this HTTP tutorial (hope you got something
from it). Just a few final points though:
You can GET any kind of media (pictures, videos, sounds, etc) not just html.
Look at the HTTP 1.1 rfc (2616) that I quoted heavily throughout this document
if you still have any questions although I must warn you that an rfc makes terrible
bedtime reading (I quoted the more gripping, on-the-edge-of-your-seat things...).
If you really liked this tutorial or have something other to say about it then
email me at firstname.lastname@example.org.
Copying this document
This document released under the condition that it is only reproduced, partially
or complete, in its original form and together with the name Nekogaimasu and
Here"s some useful information I promised regarding status codes as taken from
The individual values of the numeric status codes defined for HTTP/1.1, and
an example set of corresponding Reason-Phrase"s, are presented below. The reason
phrases listed here are only recommendations -- they MAY be replaced by local
equivalents without affecting the protocol.
HTTP status codes are extensible. HTTP applications are not required to understand
the meaning of all registered status codes, though such understanding is obviously
desirable. However, applications MUST understand the class of any status code,
as indicated by the first digit, and treat any unrecognized response as being
equivalent to the x00 status code of that class, with the exception that an
unrecognized response MUST NOT be cached. For example, if an unrecognized status
code of 431 is received by the client, it can safely assume that there was something
wrong with its request and treat the response as if it had received a 400 status
code. In such cases, user agents SHOULD present to the user the entity returned
with the response, since that entity is likely to include human-readable information
which will explain the unusual status.