Message Encoding in REST

The basic HTTP message framing described in my last post has some performance issues. You can think of advanced message framing techniques like chunking, compression, and multipart messages as performance optimizations. The HTTP specification calls them encodings because they alter (encode) the message body for transmission and reconstruct it at arrival. All HTTP 1.1 libraries and frameworks support chunkig, most support compression, and many can handle multipart messages too.

Chunking

You need to know the length of the message body to use the Content-Length header. While this is not an issue with static content like HTML files, it creates performance problems with dynamically generated REST messages. Not only that the entire message body has to be buffered in memory, but the network sits unused during message generation and the CPU is idle during transmission. The performance improves significantly if the two steps are combined and message generation or processing is done in parallel with transmission. To support this, a new message length prefixing model, called chunked transfer encoding, was introduced in HTTP 1.1. It is described in RFC 2616, section 3.6.1

Chunked transfer encoding is length prefixing applied not to the entire message body, but to smaller parts of it (chunks). A chunk starts with the length of data in hexadecimal format separated by a CRLF (carriage return and line feed) sequence from the actual chunk data. Another CRLF pair marks the end of the chunk. After a series of chunks a zero-length chunk signals the end of the message. The presence of the Transfer-Encoding header containing the value “chunked” and the absence of the Content-Length header tell the receiver to read the message body in chunks (Figure 1).

Figure 1: Compressed and chunked HTTP response

Figure 1: Compressed and chunked HTTP response

You don’t need to take any action to enable chunking. Basic message framing is used only for short messages. Longer messages are automatically chunked by HTTP 1.1 implementations. Just remember not to depend on the Content-Length header for message processing because it is not always present. Structure your code so that you can begin processing before receiving the full message body. Use streaming and event-based programming techniques, for example event-based JSON and XML parsers.

Compression

Compression is a data optimization technique used to reduce message sizes, thus network bandwidth usage. It also speeds up message delivery. Compression looks for repeating patterns in data streams and replaces their occurrences with shorter placeholders. How effective this is depends on the type and volume of data. XML and JSON compress quite well, by 60% or more for messages over two kilobytes in length. Experience shows that the generic HTTP compression algorithms are just as effective as specially developed JSON- and XML-aware algorithms.

HTTP compresses the message body before sending it and un-compresses it at arrival. Headers are never compressed. If present, the Content-Length header shows the number of the actual (compressed) bytes sent. Similarly, chunking is applied to the already compressed stream of bytes (Figure 1). Since compression is optional in HTTP 1.1, clients must list the compression algorithms they support in the Accept-Encoding header to receive compressed messages. Likewise, REST services should not send compressed content unless the client is prepared to receive it. Compressed messages are sent with a Content-Encoding header identifying the compression algorithm used.

Client support for HTTP compression is good since well-known public domain algorithms are used. On Android, you can receive gzip compressed messages with the Spring for Android RestTemplate Module. RestKit for Apple iOS is built on top of NSURLConnection, which provides transparent gzip/deflate support for response bodies. On Blackberry you can use HttpConnection with with GZipInputStream. When writing client-side Javascript the XmlHttpRequest implementation decompresses messages automatically.

Multipart messages

As the name suggests, multipart messages contain parts of different types within as a single message body. This helps reduce protocol chattiness, eliminating the need to retrieve each part with a separate HTTP request. In REST this technique is called batching.

Figure 2: Multipart HTTP response

Figure 2: Multipart HTTP response

Multipart encoding was originally developed for email attachments (Multipurpose Internet Mail Extensions or MIME) and later extended to HTTP and the Web. It is described in six related RFC documents: RFC 2045, RFC 2046, RFC 2047, RFC 4288, RFC 4289 and RFC 2049. Figure 2 shows a multipart message example. To receive multipart messages, clients must list the multipart formats they support in the Accept header. The Content-Type header identifies the type of multipart message sent (related, mixed, etc.) and also contains the byte pattern used to mark the boundary between message parts. The mime type of each message part is transmitted separately as shown in Figure 2.

Building and parsing multipart messages is difficult unless supported out-of-the box by libraries or frameworks. You cannot expect clients to write code for it from scratch so you must provide alternative ways of retrieving the data. Common alternatives are embedding links into messages to let clients retrieve parts separately or using JSON or XML collections for embedding multiple message parts of the same type.

REST message examples are always shown in documentation with basic message framing because it is easy to read. Nonetheless, many real-life REST protocols use a combination of chunking, compression, and multipart encoding for better performance. When you develop REST clients, remember that compression and multipart encoding are optional. REST services won’t use them unless the client sends them the proper Accept and Accept-Encoding headers. When you design REST services, consider compressing large messages to save bandwidth and using multipart messages to reduce protocol chattiness.

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 Canada License.

Advertisements

Message Framing in REST

Most REST designers take message framing for granted; something they get for free from HTTP and don’t need to worry about because it just works. You are probably wondering what motivated me to write about such an obvious and unimportant topic. I wanted to show that REST exposes message framing details to clients. This can cause some issues and may influence your REST design decisions.

The need for message framing

That you cannot send messages directly over TCP is the first difficulty in application protocol design. There is no “send message” function. You read from an input stream by calling a “receive” method, and you write to an output stream by calling a “send” method. However, you cannot assume that a single “send” will result in a single “receive”. Message boundaries are not preserved in TCP.

HTTP uses a combination of message framing techniques, delimiters and prefixing, to send messages over TCP (Figure 1).

Figure 1: HTTP message framing

Figure 1: HTTP message framing

Delimiters are predetermined markers placed inside and around messages to help the protocol separate messages and message parts from each other when reading from an input stream. The carriage return – new line (/r/n) pair of characters divide the ASCII character stream of the HTTP message header into lines. Another delimiter, white space, divides the request and status lines into sections. A third delimiter, the colon, separates header names from header values. An empty line marks the end of the header and the beginning of the (optional) message body.

Prefixing works by sending in the first part of messages information about the remaining, variable part. HTTP uses headers for prefixing, instructing the protocol implementation how to handle the message body. Length prefixing is the most important: the Content-Length header tells the protocol implementation how many bytes to read before it reaches the end of the message body.

The message framing details are clearly visible when you look at REST messages (Figure 1) and are also partially exposed in code that generates them (Listing 1).

Not a text-based protocol

That HTTP is a text-based protocol is a widespread misconception. Only the header section is sent as ASCII characters, the message body is sent as a sequence of bytes. This has the consequence that sending text in the message body is not quite as straightforward as you might expect. You need to ensure that both client and server use the same method when converting text to bytes and back.

It is much safer to be explicit about the character set used by setting and reading the character set from the Accept, Accept-Charset, and Content-Type headers than relying on defaults. Client libraries and server frameworks are partially to blame for the text-based protocol misconception because they attempt to convert text using a default character set. The Apache client library uses ISO-8859-1 by default as required by RFC2616 section 3.7.1, but this obviously can cause problems if the server is sending JSON using UTF-8.

Listing 1: Sending a text message to a URL in Java using the Apache HTTP client library

    /**
     * Sends a text message to the given URL
     * @param message the text to send
     * @param url where to send it to
     * @return true if the message was sent, false otherwise
     **/
    public static boolean sendTextMessage(String message, String url) {
        boolean success = false;
        HttpClient httpClient = new DefaultHttpClient();
        try {
            HttpPost request = new HttpPost(url);

            BasicHttpEntity entity = new BasicHttpEntity();
            byte[] content = message.getBytes("utf-8");
            entity.setContent(new ByteArrayInputStream(content));
            entity.setContentLength(content.length);
            request.setEntity(entity);
            request.setHeader("Content-Type", "text/plain; charset=utf-8");

            HttpResponse response = httpClient.execute(request);

            StatusLine statusLine = response.getStatusLine();
            int statusCode = statusLine.getStatusCode();
            success = (statusCode == 200);
        } catch (IOException e) {
            success = false;
        } finally {
           httpClient.getConnectionManager().shutdown();
        }
        return success;
    }

Restrictions on headers

The use of delimiters for message framing limits what data can be safely sent in HTTP headers. You will find these limitations in RCF 2616, section 2.2 and section 4.2, but here is a short summary:

  • All data need to be represented as ASCII characters
  • The delimiter characters used for message framing cannot appear in header names or values
  • Header names are further limited to lowercase and uppercase letters and the dash character
  • There is also a maximum limit on the length of each header, typically 4 or 8 KB
  • It is a convention to start all custom header names not defined in RFC2616 with “X-”

You might occasionally encounter message framing errors because some client library implementations expose the headers without enforcing these rules. If a HTTP framework or intermediary detects a framing error, it discards the request and returns the “400 Bad Request” status code. What may be even worse though, every so often a malformed message will get through, causing weird behavior or a “500 Internal Error” status code and some incomprehensible internal error message. To avoid such hard-to-trace errors do not attempt to send in HTTP headers any data which:

  • comes from user input
  • is shown to the user
  • is persisted
  • can grow in size uncapped
  • you have no full control over (it is generated by third-party libraries or services)

Keeping protocol data separate from application data

Notice that I did not say don’t use headers at all. Many REST protocols chose not to use them, but this may not be the wisest protocol design decision. Headers and body serve distinct roles in protocol design and both are important.

The message header carries information needed by the protocol implementation itself. The headers tells the protocol what to do, but do not necessarily show what the application is doing. If you are sniffing the headers you are not likely to capture any business information collected and stored by an application.

The message body is where the application data is sent, but it has no or very little influence on how the protocol itself works. Protocols typically don’t interpret the data sent in message bodies and treat it as opaque streams of bytes.

Sending protocol data in the message body creates strong couplings between the various parts of the application, making further evolution difficult. Once I asked someone to return the URI of a newly created resource in the Content-Location header of a POST response, a common HTTP practice. “There is no need”, he said, “the URI is already available as a link in the message body”. This was true, of course, but the generic protocol logic in which I needed this URI was up till then completely independent of any resource representations. Forcing it to parse the URI out of the representations meant that it will likely break the next time the representations changed.

Conclusion

I hope I managed to convince you that message framing in REST is not a mere implementation detail you can safely ignore.  Becoming familiar with how it works can help you avoid some common pitfalls and design more robust REST APIs. I discussed only basic HTTP message framing so far. In my next post I’ll talk about more advanced topics like chunking, compression, and multipart messages.

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 Canada License.