2002 January at Jython Journeys

Archive for January, 2002

HTTP Compression in python and jython

Index

Introduction
Can the browser handle GZIPped content?
GZIP vs ZLIB
Content-encoding vs Transfer-encoding
Content-length
Character translation on Windows and OS/2
OutputStream vs. Writer in jython servlets
Some links for related reading
Some sample CGI code
Some sample code for mod_python
Some sample code for jython servlets

Introduction

These are some notes on how to do HTTP compression, i.e. compress content before it is sent to the user. The benefits are

You transmit less bandwidth. If the amount of bandwidth you can serve is limited (i.e. you have fixed allocation of X gig/month), then compressing your content before you send it can dramatically lower your transmitted bandwidth.
Your client receives less bandwidth, meaning that pages load faster for them. This is particularly important for people using modems. (Although some modems already do compression. I have no idea how efficient modem compression is compared to gzip. See the Links section for a link to an article which discusses HTML compression vs. modem compression.)

Some points to remember

Although you save on bandwidth transmitted from your server, it comes at a cost of increased CPU usage on the server.
Compressing images probably isn’t worth it. Specialised image compression formats such as GIF, JPEG, and PNG are optimised for the content they carry, and probably cannot be much improved on. So it’s probably not worth compressing images using these techniques.
The best bandwidth saving will be achieved on textual files, i.e. mime types like , , .

I’m not going to go into the details of the rationale or the mechanisms of compression here, I’m just going to stick to telling you how to do http compression in python. If you want to know about the nuts and bolts of compression and http, I suggest you read

Can the browser handle GZIPped content?

Not all browsers can handle compressed content. Old browsers that support HTTP version 0.9 are unlikely to support compressed content. Compression was optional in version 1.0 of the HTTP spec, so only some HTTP 1.0 clients support it, most notably versions 4.x of Netscape Navigator. Support for compression is compulsory in HTTP 1.1, so all HTTP 1.1 clients should support compression. HTTP 1.1 clients include Internet Explorer 5.0 and up. Although I don’t know if Internet Explorer 4.0 is HTTP 1.1 or 1.0, I’m sure IE 4.0 supports gzip compression.

Fortunately, you don’t have to guess or keep a list of which browsers do or do not support compression, since the browsers themselves will inform you of it in their HTTP request. Browsers that support compression will inform you via the HTTP header Accept-Encoding (Check your browser’s headers here). Possible compression methods, as mentioned in RFC2616, include gzip, compress, deflate, and identity. Note that the last one, identity, means no compression. Note also that I’m only going to cover gzip in this document.

So if a browser supports compression, its HTTP request will include a header similar to one of the following

Accept-Encoding: gzip
Accept-Encoding: gzip, deflate

So you need to get the value of the Accept-Encoding header, and check if it contains gzip. If it does, then you can send gzip-compressed content to that browser. Here is some CGI code that checks the content of the header

import string
import os
 
acceptsGzip = 0
try:
    if string.find(os.environ["HTTP_ACCEPT_ENCODING"], "gzip") != -1:
        acceptsGzip = 1
except:
    pass

And here is some mod_python code that checks the content of the header.

import string
 
def acceptsGzip(req):
    """
        Checks if a browser request indicates that the browser will accept
        gzipped content in reply.
    """
    if req.headers_in.has_key('accept-encoding'):
        encodings = req.headers_in['accept-encoding']
        return (string.find(encodings, "gzip") != -1)
    else:
        return 0

And here is some jython servlet code that checks the content of the header.

import string
 
def acceptsGzip(req):
    """
        Checks if a browser request indicates that the browser will accept
        gzipped content in reply.
    """
    encodings = req.getHeader('accept-encoding')
    if encodings != None:
        return (string.find(encodings, "gzip") != -1)
    else:
        return 0

GZIP vs ZLIB

There are a few compression libraries that come with Python. The one which is used in HTTP compression is GZIP. ZLIB, although it is used by GZIP to do part of its job, should not be needed directly in your code.

When a file is compressed for transmission through http, it must be preceded by some special header bytes and followed by some special trailer bytes. Conveniently enough, the python GZIP module constructs exactly those headers and trailers. The details of what those headers and trailers are can be found in RFC1952.

In order to compress using the GZIP module, you’ll need to use code something like this.

import gzip
 
def compressBuf(buf):
    zbuf = cStringIO.StringIO()
    zfile = gzip.GzipFile(mode = 'wb',  fileobj = zbuf, compresslevel = 9)
    zfile.write(buf)
    zfile.close()
    return zbuf.getvalue()

Note that this code compresses into a buffer held in memory, rather than a disk file. This is done through the use of cStringIO.StringIO().

You can vary the compression level by changing the value of the compresslevel parameter, with compresslevel = 9 giving the best compression but consuming the most CPU cycles, and compresslevel = 1 giving the least compression and also consuming the least CPU.

Content-encoding vs Transfer-encoding

When you’re sending compressed content back to the browser, you have to inform the browser of the compression. This is done by the header Content-Encoding. So you should include a header in your response that looks like this.

Content-Encoding: gzip

You don’t need to read any further in this section. The note below about different methods of declaring encoding is just here for interest and completeness.

Note: There is another possible way to communicate the encoding: the Transfer-Encoding header. According to RFC2616, the difference between the two headers is as follows

Content-Encoding should be used when the encoding is a property of the content. So if you were serving a static file that is always compressed, then this is the header to use.
Transfer-Encoding should be used when the encoding is a property of the message used to transmit the content. So you were sending a static file that is normally uncompressed, and were compressing it just so as to minimise bandwidth during the transmission, then this is the header you should use.

However, RFC2616 is unclear what encoding to use if you are generating dynamic content, which may only be transient in memory and has no lifetime beyond the HTTP request-response pair it was contructed for. In such a situation, an argument could be made for using either Content-Encoding or Transfer-Encoding.

Rather than try to resolve that (potentially unresolvable) issue, I would just like to point out that majority of software "out there" seems to have opted for using Content-Encoding, and that’s the choice that I’ve made as well. You are, of course, free to choose otherwise.

Content-Length

You also need to tell the client browser the length of the compressed content you are sending. This is done by sending a Content-Length header, like this.

Content-Length: xyz

where xyz is the COMPRESSED length of the content.

There seems to be some "folk wisdom" out there on the ‘net that you should send the uncompressed length of the content. This is wrong! RFC2616 is quite clear about this. If you are interested, read sections 7.2.2 Entity-Length and 4.4 Message Length.

As far as I can see, the definitive statement on this matter is at the end of section 4.4 (I have underlined the relevant statement)

When a Content-Length is given in a message where a message-body
is allowed, its field value MUST exactly match the number of OCTETs
in the message-body. HTTP/1.1 user agents MUST notify the user when
an invalid length is received and detected.

Character translation on Windows and OS/2

If you’re working in CGI, and on Windows or on OS/2, you need to be careful about character translation.

As you may be aware, Windows and OS/2 are different from other platforms in the way that they represent an end of line. Whereas most platforms, including *nix, represent line ends as an ASCII linefeed (hex 0x0A, octal 012, python escape string ‘n’), Windows and OS/2 represent end-of-line as a sequence of two characters, an ASCII carriage return followed by an ASCII linefeed (hex 0x0D 0x0A, octal 015 012, python escape string ‘rn’). Therefore, when you print anything in python, or write anything to sys.stdout, using code like this

print "Hello World!"
sys.stdout.write("Hello World!n")

then both Windows and OS/2 filter the characters, and turn all linefeed characters into a sequence of carriage return followed by linefeed.

This is fine when you’re printing text. But when you’re trying to send binary information, particularly a compressed gzipped file, then this translation will corrupt the binary content, and your transmission of gzip compressed content will fail. Therefore, you have to disable this character translation in order for transmission of gzipped content to work.

When you’re working in CGI, the best way to do this is with the “-u” command line flag to python. This was kindly pointed out by Richie Hindle, who says

…… There is a much simpler way to switch off character translation of the standard
channels. The python interpreter accepts the -u switch to mean “make the
standard channels both unbuffered and binary.” This is tailor-made for
CGI – change your shebang line from “#!…python” to “#!…python -u” and
everything will work without changing your code (and without relying on
platorm-specific modules like msvcrt). Responsiveness may even improve
due to the lack of buffering – and that’s also true on platforms like Unix
which don’t do character translation.

A less convenient method is to execute some platform specific code to disable character translation inside your script. On Windows (MSVC), you should execute some code like this

import msvcrt
import os
import sys
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)

I don’t know what the equivalent code is on Windows (CygWin) or OS/2. If anyone does, email it to me, and I’ll include it in this page.

OutputStream vs. Writer in jython servlets.

Under the J2EE Servlet interface, you have a choice of two different ways to output generated content. They are

Through an OutputStream (obtained by calling ServletResponse.getOutputStream()). OutputStreams do not carry out character translation on their output.
Through a Writer (obtained by calling ServletResponse.getWriter()). Writers carry out character translation on their output, i.e. they will change the value of bytes in the content output by the servlet, to ensure that it meets the character encoding requirements of the client.

Since compressed gzip content is a (dense) binary format, none of the output bytes should be translated. If any bytes are translated, the output may be corrupted and the recipient may be unable to decode it. Therefore, you must output compressed content through a (non-translating) OutputStream object.

Some links for related reading

W3C: Compression and Performance

Some sample CGI code

So, without further ado, here is some sample CGI code that will transmit compressed HTML to gzip capable browsers.

#! /path/to/python -u
 
import string
import os
import sys
import gzip
import cStringIO
 
def compressBuf(buf):
    zbuf = cStringIO.StringIO()
    zfile = gzip.GzipFile(mode = 'wb',  fileobj = zbuf, compresslevel = 6)
    zfile.write(buf)
    zfile.close()
    return zbuf.getvalue()
 
def testAcceptsGzip():
    acceptsGzip = 0
    try:
        if string.find(os.environ["HTTP_ACCEPT_ENCODING"], "gzip") != -1:
            acceptsGzip = 1
    except:
        pass
    return acceptsGzip
 
def sendHtml(buf):
    sys.stdout.write("Content-type: text/htmlrn")
    if testAcceptsGzip():
        zbuf = compressBuf(buf)
        sys.stdout.write("Content-Encoding: gziprn")
        sys.stdout.write("Content-Length: %drn" % (len(zbuf)))
        sys.stdout.write("rn")
        sys.stdout.write(zbuf)
    else:
        sys.stdout.write("rn")
        sys.stdout.write(buf)
 
myHtml = """<html><body><h1>hello compressed world!</h1></body></html>"""
sendHtml(myHtml)

Some sample mod python code

And here is some sample mod python code that will transmit compressed HTML to gzip capable browsers.

import string
import os
import sys
import gzip
import cStringIO
from   mod_python import apache
 
def compressBuf(buf):
    zbuf = cStringIO.StringIO()
    zfile = gzip.GzipFile(mode = 'wb',  fileobj = zbuf, compresslevel = 6)
    zfile.write(buf)
    zfile.close()
    return zbuf.getvalue()
 
def testAcceptsGzip(req):
    if req.headers_in.has_key('accept-encoding'):
        encodings = req.headers_in['accept-encoding']
        return (string.find(encodings, "gzip") != -1)
    else:
        return 0
 
def handler(req):
    req.content_type = "text/html"
    myHtml = """<html><body><h1>hello compressed world!</h1></body></html>"""
    if testAcceptsGzip(req):
        zbuf = compressBuf(myHtml)
        req.headers_out['Content-Encoding'] = 'gzip'
        req.headers_out['Content-Length'] = '%d' % (len(zbuf))
        req.send_http_header()
        req.write(zbuf)
    else:
        req.send_http_header()
        req.write(myHtml)
    return apache.OK

Some sample jython servlet code

And here is some sample jython servlet code that will transmit compressed HTML to gzip capable browsers.

import  javax.servlet.http.HttpServlet
 
import  cStringIO
import  gzip
import  string
 
def compressBuf(buf):
    zbuf = cStringIO.StringIO()
    zfile = gzip.GzipFile(mode = 'wb',  fileobj = zbuf, compresslevel = 6)
    zfile.write(buf)
    zfile.close()
    return zbuf.getvalue()
 
def acceptsGzip(req):
    encodings = req.getHeader('accept-encoding')
    if encodings != None:
        return (string.find(encodings, "gzip") != -1)
    else:
        return 0
 
class compressor(javax.servlet.http.HttpServlet):
 
    def service(self, req, resp):
        resp.setContentType('text/html')
        myHtml = """<html><body><h1>hello compressed world!</h1></body></html>"""
        if acceptsGzip(req):
            binarychan = resp.getOutputStream()
            zbuf = compressBuf(myHtml)
            resp.setHeader('Content-Encoding', 'gzip')
            resp.setHeader('Content-Length', '%d' % len(zbuf))
            binarychan.write(zbuf)
        else:
            textchan = resp.getWriter()
            textchan.write(myHtml)

Written by alan.kennedy

January 13th, 2002 at 10:00 am

Posted in jython

Tagged with networking, performance, web technology

Jython Journeys