Web working principles

Every time you open your browser, type a URL, and press enter, you will see a beautiful web page appear on your screen. But do you know what is happening behind this simple action?

Normally, your browser is a client and after you have typed a URL, it sends requests to a DNS server and gets the IP address of the URL. Then it finds the server in that IP address and asks to setup TCP connections. When the browser is finished sending HTTP requests, the server starts handling your request packages and then returns HTTP response packages to your browser. Finally, the browser renders the body of the web pages and disconnects from the server.

Figure 3.1 Processes of users visit a website

A web server also known as a HTTP server; it uses HTTP protocol to communicate with clients. All web browsers can be seen as clients.

We can divide web working principles into the following steps:

Client uses TCP/IP protocol to connect to server.
Client sends HTTP request packages to server.
Server returns HTTP response packages to client. If request resources includes dynamic scripts, then the server calls the script engine first.
Client disconnects from server, starts rendering HTML.

This a simple work flow of HTTP affairs. Notice that the server closes its connection after it has sent data to the client and then waits for the next request.

URL and DNS resolution

We are always using URL to access web pages, but do you know how URL works?

The full name of URL is Uniform Resource Locator and it's for describing resources on the internet. Its basic form is as follows.

scheme://host[:port#]/path/.../[?query-string][#anchor]
scheme         assign underlying protocol(such as HTTP, HTTPS, ftp)
host           IP or domain name of HTTP server
port#          default port is 80, and you can omit in this case. If you want to use other ports, you must to specify which port. For example, http://www.cnblogs.com:8080/
path           resources path
query-string   data are sent to server
anchor         anchor

DNS is an abbreviation of Domain Name System. It's the name system for computer network services and it converts domain name to actual IP addresses (just like a translator).

Figure 3.2 DNS working principles

To understand more about its working principle, let's see detailed DNS resolution process as follows.

After typed domain name www.qq.com in the browser, operating system will check if there is any mapping relationship in the hosts file for this domain name, if so then finished the domain name resolution.
If no mapping relationship in the hosts file, operating system will check if there is any cache in the DNS, if so then finished the domain name resolution.
If no mapping relationship in the hosts and DNS cache, operating system finds the first DNS resolution server in your TCP/IP setting, which is local DNS server at this time. When local DNS server received query, if the domain name that you want to query is contained in the local configuration of regional resources, then gives back results to the client. This DNS resolution is authoritative.
If local DNS server doesn't contain the domain name, and there is a mapping relationship in the cache, local DNS server gives back this result to client. This DNS resolution is not authoritative.
If local DNS server cannot resolve this domain name either by configuration of regional resource or cache, it gets into next step depends on the local DNS server setting. If the local DNS server doesn't enable forward mode, it sends request to root DNS server, then returns the IP of top level DNS server may know this domain name, .com in this case. If the first top level DNS server doesn't know, it sends request to next top level DNS server until the one that knows the domain name. Then the top level DNS server asks next level DNS server for qq.com, then finds the www.qq.com in some servers.
If the local DNS server enabled forward mode, it sends the request to upper level DNS server. If the upper level DNS server also doesn't know the domain name, then it keeps sending request to upper level. Whether the local DNS server enables forward mode, server's IP address of domain name returns to local DNS server, and local server sends it to clients.

Figure 3.3 DNS resolution work flow

Recursive query process means the enquirers are changing in the process, and enquirers do not change in Iterative query process.

Now we know how clients get IP addresses in the end. The browsers are communicating with servers through IP addresses.

HTTP protocol

HTTP protocol is the core part of web services. It's important to to fully understand HTTP protocol before you can understand how the web works.

HTTP is the protocol that used for communicating between browsers and web servers. It is based on TCP protocol, and it usually uses port 80 in the web server side. It is a protocol that uses the request-response model: clients send request and servers respond. According to HTTP protocol, clients always setup a new connection and send a HTTP request to server in every affair. The server is not able to connect to client proactively (or via a call back connection). The connection between the client and the server can be closed by either side. For example, you can cancel your download task and HTTP connection. It disconnects from server before you finish downloading.

HTTP protocol is stateless, which means the server has no idea about the relationship between the two connections even though they are both from same client. To solve this problem, web applications use Cookies to maintain sustainable state of connections.

Because HTTP protocol is based on TCP protocol, all TCP attacks will affect the HTTP communication in your server, such as SYN Flood, DoS and DDoS.

HTTP request package (browser information)

Request packages all have three parts: request line, request header, and body. There is one blank line between header and body.

GET /domains/example/ HTTP/1.1      // request line: request method, URL, protocol and its version
Host：www.iana.org             // domain name
User-Agent：Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4            // browser information
Accept：text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8    // mine that clients can accept
Accept-Encoding：gzip,deflate,sdch     // stream compression
Accept-Charset：UTF-8,*;q=0.5      // character set in client side
// blank line
// body, request resource arguments (for example, arguments in POST)

We use fiddler to get following request information.

Figure 3.4 Information of GET method caught by fiddler

Figure 3.5 Information of POST method caught by fiddler

We can see that the GET method doesn't have a request body but that POST method does.

There are many methods you can use to communicate with servers in HTTP, and GET, POST, PUT, DELETE are the basic 4 methods that we use. A URL represents a resource on the network, so these 4 method means query, change, add and delete operations. GET and POST are most commonly used in HTTP. GET appends data to the URL and uses ? to break them up, uses & between arguments, like EditPosts.aspx?name=test1&id=123456. POST puts data in the request body because URL has length limitation by browsers, so POST can submit much more data than GET method. Also when we submit our user name and password, we don't want this kind of information appear in the URL, so we use POST to keep them invisible.

HTTP response package (server information)

Let's see what information is contained in the response packages.

HTTP/1.1 200 OK                     // status line
Server: nginx/1.0.8                 // web server software and its version in the server machine
Date:Date: Tue, 30 Oct 2012 04:14:25 GMT        // responded time
Content-Type: text/html             // responded data type
Transfer-Encoding: chunked          // it means data were sent in fragments
Connection: keep-alive              // keep connection 
Content-Length: 90                  // length of body
// blank line
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"... // message body

The first line is called status line, it has HTTP version, status code and statue message.

Status code tells the client that it is a HTTP server and it has an expectation for the response. In HTTP/1.1, we defined 5 kinds of status code.

- 1xx Informational
- 2xx Success
- 3xx Redirection
- 4xx Client Error
- 5xx Server Error

Let's see more examples about response packages, 200 means server responded correctly, 302 means redirection.

Figure 3.6 Full information for visiting a website

HTTP is stateless and Connection: keep-alive

Stateless doesn't means server has no ability to keep a connection, in other words, server doesn't know any relationship between any two requests.

In HTTP/1.1, Keep-alive is used as default, if clients have more requests, they will use the same connection for many different requests.

Notice that Keep-alive cannot keep one connection forever, the software runs in the server has certain time to keep connection, and you can change it.

Request instance

Figure 3.7 All packages for open one web page

We can see the whole process of communication between the client and server by above picture. You may notice that there are many resource files in the list, they are called static files, and Go has specialized processing methods for these files.

This is the most important function of browsers, request for a URL and get data from web servers, then render HTML for good user interface. If it finds some files in the DOM, such as CSS or JS files, browsers will request for these resources from server again, until all the resources finished rendering on your screen.

Reducing HTTP request times is one of the methods that improves speed of loading web pages. By reducing the size of CSS and JS files, it reduces pressure in the web servers at the same time.

Links

Directory
Previous section: Web foundation
Next section: Build a simple web server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!