Challenges of remote interface design
February 21, 2012 Leave a comment
If you listened to some middleware vendors, you’d believe that distributed applications can be easily built by making local APIs remotely accessible. Indeed, as this 10 minute video tutorial illustrates, you can do the latter by simply adding a few annotations to your code. Vendors have products to sell and their focus on promoting ease of use is not entirely surprising.
The problem is that this approach does not work. It was convincingly demonstrated in classic papers like “A Critique of the Remote Procedure Call Paradigm” by Professors Andrew S. Tanenbaum and Robert van Renesse (1988) or more recently by Jim Waldo and his colleagues at Sun Microsystems Laboratories in “A Note on Distributed Computing” (1994).
A simple example will illustrate why. Consider a local FIFO task queue used in desktop applications to maintain the responsiveness of the user interface while processing computation-intensive tasks (Listing 1). The user interface thread creates and places tasks in the queue and a background thread processes them asynchronously (Listing 2).
Listing 1:
/** * A FIFO queue of tasks to be executed by a background thread * Note: This queue implementation is not thread safe */ public class TaskQueue { /** * @return True if the the queue is empty, False otherwise */ public boolean isEmpty() {...} /** * Places a task into the queue * @param task the task to execute */ public void putTask(Task task) {...} /** * Retrieves the next task from the queue * @return a task to execute */ public Task getTask() {...} }
Listing 2:
/** * Simple client showing the use of the task queue */ public class Client { final TaskQueue queue = new TaskQueue(); /** * Called when the user chooses to print from the GUI */ public void onPrint() { Task printTask = new PrintTask(); synchronized (queue) { queue.putTask(printTask); queue.notifyAll(); } } /** * This method runs in its own thread * @throws InterruptedException signals the thread to exit */ public void processTasks() throws InterruptedException { Task task; while (true) { synchronized (queue) { while (queue.isEmpty()) { queue.wait(); } task = queue.getTask(); } task.run(); } } }
Let’s say that we want to move the processing of tasks to a different computer. As platform vendors would quickly point out, making the queue interface (Listing 1) remotely accessible is easily solved with a wide variety of technologies. The difficult problems are latency, possibility of partial failures, lack of shared memory, and synchronization.
Network latency degrades the performance of some API calls by orders of magnitude, rendering them practically unusable. The local queue works well because the overhead of a call to the local putTask() method is both small and predictable. If the queue is on a different computer, a remote call is neither quick nor predictable. Transient network events like packet loss or congestion may block a call for a significant time. Because it no longer prevents the user interface from freezing up, moving the queue to a server renders it useless.
Partial failures lead to non-deterministic behavior and potential data loss. Assume that we kept the queue on the client to eliminate the latency issue discussed above. What happens if the network fails while the server calls the remote getTask() method? If the failure happens before the call is processed by the queue, the server can safely retry it. However, if the failure occurs after the task was removed from the queue, but before it was delivered to the server, the task is lost, and a retry will get the next task from the queue. Since we can no longer use the queue to reliably submit tasks for processing, keeping the queue on the client does not work either.
Lack of shared memory means that object references, pointers, or global state cannot be handled transparently. When a task is transferred to a different computer, the question arises what to do with the other in-memory objects it references. While sometimes it makes sense to move the referenced objects with the task, just as often we needed to look up the corresponding objects on the server. No technology can automatically handle all situations correctly, meaning that unless we change our queue and task implementations, some tasks will fail to execute on a remote server.
The absence of reliable remote synchronization primitives makes even relatively simple local behaviors difficult to replicate in distributed environments. The two local threads use the built-in Java synchronization primitives to communicate with each other (Listing 2). These synchronization methods do not work remotely. Server code written as shown won’t work. We need to rewrite it, perhaps polling the queue at regular intervals. But now the calls to the queue are no longer synchronized, and unless we make the queue implementation fully re-entrant, fatal data corruption will likely occur.
So what all of this boils down to in the end? It shows that all technologies used to make local APIs remotely accessible leak. By leaking I mean that they change the API behavior. And two APIs with different behaviors are not the same API, even if they look the same. Joel Spolsky calls this the Law of Leaky Abstractions, explaining that “one reason the law of leaky abstractions is problematic is that it means that abstractions do not really simplify our lives as much as they were meant to. Code generation tools which pretend to abstract out something, like all abstractions, leak, and the only way to deal with the leaks competently is to learn about how the abstractions work and what they are abstracting. So the abstractions save us time working, but they don’t save us time learning.”
The “don’t save us time learning” part is the reason why I don’t like calling the design of remote software interfaces “API design”. While technically correct, this choice of words encourages a false and somewhat dangerous sense of familiarity and comfort. I’ve found a better concept, one which does not swipe under the rug the important issues of latency, partial failures, lack of shared memory and synchronization. I’ll talk about this concept in the next installment.
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 Canada License.