Hi,
I was wondering how best to use IOLib to write an efficient HTTP server that can handle perhaps 10,000+ simultaneous connections. It seems like iolib has all the right ingredients: a sytem-level sockets interface and a io multiplexer that uses epoll/kqueue for efficiently querying sockets. There is quite a bit of code already written so I was hoping for some advice about how this would be best implemented.
Here is a possible architecture for a server that can handle tons of connections at once:
Some lisp thread originally sets up a passive-socket (server socket) to listen for connections on some port. There are a few worker threads (on the order of the number of processors in the machine). When a connection is received, one of these worker threads will dequeue an active socket with ACCEPT.
However, after the initial connection all the HTTP headers and content must be read from the socket. Presumably not all the data will be ready as soon as a connection is received, and read operations will block if allowed to block. While it waits for the full HTTP request to come across the wire, the worker thread could be accepting new connections or processing older ones where the fully request is available. To quickly send HTTP responses off, writes to sockets should also never block--so if we try to send more bytes than a socket can handle, we should handle that asynchronously so the worker can get on to the next thing to do.
So, a worker thread will either 1) be processing a request (arbitrary lisp code to respond appropriately to an HTTP request). When a response is is ready, it should be written to some non-blocking gray stream.
2) be waiting for the next of the following events:
a) socket writable. Some gray stream that was written to in (1) but blocked now has enough room in its buffers to allow more data to be sent immediately. (ie a "would block" message was received in a "send to" call).
b) socket readable. Some socket that we are listening to (gray stream) has more data available. When this data is sufficient to respond to the request, this connection is now elgible for the processing described in (1).
c) socket accepted. Some socket is available in the queue of the passive socket. We can now begin listening to the socket's read events for processing as in (2.b) or process the socket as in (1).
My question about IOLib is how this sort of processing model could be implemented on top of iolib. Do passive sockets generate epoll/kqueue events when a new connection is available to accept? If so it seems like the multiplexer could be used to listen for events 2.a,2.b, and 2.c all simultaneously.
I see there are some gray stream implementations in the code right now, though I have not figured out how to use them. How do I, for example, create a stream with an underlying socket? Could these sockets work with the multiplexer implementation to accomplish the processing model described?
I think that sums it up. Thanks for the great library!
-Red
On Sat, 17 Oct 2009 20:05:52 -0700 Red Daly reddaly@gmail.com wrote:
I was wondering how best to use IOLib to write an efficient HTTP server that can handle perhaps 10,000+ simultaneous connections. It seems like iolib has all the right ingredients: a sytem-level sockets interface and a io multiplexer that uses epoll/kqueue for efficiently querying sockets. There is quite a bit of code already written so I was hoping for some advice about how this would be best implemented.
Note that the following is about using the kqueue backend using SBCL and iolib dating more than a year ago. It mignt not apply when using the /dev/poll, epoll, or even select backends. Also, please forgive me if I'm stating the obvious, as I have no knowledge of your background :)
I tried using iolib on NetBSD (which supports kqueue), along with the multiplexer. I wrote a very simplistic IO-bound server around it to measure preformance (no worker threads, but non-blocking I/O in a single threaded process, a model which I previously successfully used for high performance C+kqueue(2) (and JavaScript+libevent(3)) on the same OS).
The performance was unfortunatly pretty bad compared to using C+kqueue (i.e. in the order of a few hundred served requests per second versus thousands, and nearly a thousand with JS), so I made sure the kqueue backend was being used (it was), and then looked at the code (after being warned that the multiplexer was the less tested part of iolib). What I noticed at the time was that timers were not dispatched to kqueue but to a custom scheduler, and that a kevent(2) syscall was issued per FD add/remove/state change event.
kqueue allows to use a single kevent(2) syscall in the main loop to handle all additions/removals/state changes/notifications of descriptors, signals and timers, which is part of what makes it so performant, other than only needing the caller to iterate among new state changes rather than a full descriptor table.
I admit that I didn't look at the iolib kevent backend code again lately, which could have improved, and didn't try to fix it myself (library portability being of limited value in my case, and using complete C+PHP and C+JavaScript solutions for work, my adventure into CL and iolib was experimental and a hobby, but I can confirm my growing love for CL. :)
Another potential performance issue I've noticed is the interface itself, i.e. all the sanity checking which to be (allegro?) compatible as much as possible has to force distinction of various socket types (bind/listen/accept vs read/write sockets for instance, adding overhead). Also, unlike BSD accept(2) which allows to immediately access the client's address as it's stored into a supplied sockaddr object, with iolib one has to perform a separate syscall to obtain the client address/port as the interface did not cache that address. I honestly didn't look at if iolib made this possible, but the BSD sockets API also allows asynchroneous non-blocking accept(2)/connect(2) which is important for non-blocking I/O-bound proxies.
In the case of my test code, there also was some added overhead as I wrote a general purpose TCP server library which the minimal test HTTP server could use. CLOS was used (which itself has some overhead over struct/closures/lambda based code because of dynamic runtime dispatching, although SBCL was pretty good compared to other implementations to optimize CLOS code). It also used a custom buffer to be able to use file descriptors directly instead of streams (especially since non-blocking I/O was used), although similar code using a libevent(3) stub class in non-JIT/interpreted JavaScript using SpiderMonkey was still faster (note that I've not tested iolib's own buffering against mine however). libevent(3) is also able to use a single-syscall kevent(3) based loop which greatly helps performance.
At the time I didn't look into this as I had no idea, but CFFI itself appears to incur some overhead compared to UFFI, but only looking at the resulting assembly and microbenchmarks showed me this. It probably was a non-issue compared to the numerous kevent(2) syscalls. Another probably insignificant, since CPU-bound overhead could be iolib's use of CLOS (I noticed CLOS to be from 1.5 to 10 times slower in some struct+lambda vs class+method tests depending on task and CL implementation).
Another factor was that it was among my first Common Lisp tests, so the code was probably clumsy :) In case it can be useful, the test code can be found at: http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/test/htt... Which uses: http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/lib/rw-q... http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/lib/serv...
In case iolib's multiplexer can't suit your needs with your favorite backend, it however still doesn't make iolib useless, especially in the case of application servers. For instance:
As I was playing with ECL more recently, and that it supports POSIX threads and SBCL-compatible simple BSD sockets API contrib library, I wrote a simple multithreaded test server where a pool of ready threads accept new connections themselves to serve the client to then go back to accept mode when done. This was actually to test ECL itself, and is very minimal (isn't flexible and doesn't even implement input timeouts!), but it can serve to demonstrate the idea which also could be implemented using SBCL and its native sockets, or iolib, and the performance was very decent for an application-type server (also note that the bugs mentionned in the comments have since been fixed in ECL): http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/test/ecl...
The above does not require an efficient multiplexer. The method it uses is similar to mmlib/mmserver|js-appserv and apache, and generally a manager thread/process uses heuristics to grow and shrink the processes/threads pool as necessary. In the case of ECL, a libevent(3)/kqueue(2) C-based main loop could even invoke CL functions if optimal multiplexing was a must, as ECL compiles CL to C (SBCL's compiler is more efficient however, especially where CLOS is involved).
In general, CPU-bound applications (HTTP application and database servers often are) use a pool of processes if optimal reliability and security is a must (permits privilege separation, avoids resource leaks by occasionally recycling the process, a bug generally only affects the instance, need for reentrant and thread-safe libraries is a non-issue) or threads (generally with languages minimizing buffer overflows and supporting a GC, with a master process managing the threaded processes) when I/O-bound applications are the ones needing optimal multiplexing with non-blocking asynchroneous I/O, often in a single thread/process (i.e. frontend HTTP servers/proxies (lighttpd or nginx), IRCD, etc).
For very busy dynamic sites, as load grows, a farm of CPU-bound application servers can be setup and a few frontend I/O-bound HTTP servers proxy dynamic requests to them (via fastcgi, or most commonly today HTTP, especially with Keep-Alive support) and perform load balancing (which is sometimes done at an upper layer). In this sense it is not necessary for a single all-purpose HTTP server to both handle very efficient multiplexing and CPU-bound worker threads simultaneously (the later usually better kept separate for the purpose of redundancy and application-specific configuration)...
That said, if you want to implement an IO-bound server, I hope the backends you'll need to use provide better performance than the kqueue one did for me back then. Working on improving it would be interesting but I'm affraid I don't have the time or motivation to take up the task at current time. As for the interface-specific improvements I can (and did) suggest a few changes but have no authority to change the API, which seems to have been thought out with valid compatibility concerns.
On Sun, 18 Oct 2009 05:04:36 -0400 Matthew Mondor mm_lists@pulsar-zone.net wrote:
Another factor was that it was among my first Common Lisp tests, so the code was probably clumsy :) In case it can be useful, the test code can be found at:
[...]
Oh, obviously, adding the webeconomybs page to the httpd was done afterwards for fun, the performance tests were done using a static result page at the time. :)
Here is a possible architecture for a server that can handle tons of connections at once:
this is all fine, but the interesting question comes when you consider how to switch between the threads.
what you need is essentially green-threads (much cheaper compared to OS threads): http://en.wikipedia.org/wiki/Green_threads
the naive solution is to implement the inversion of control by hand, having hand-written state machines, etc all around. this makes the code uglier, and through that helps for bugs creeping in.
another solution is to implement green threads.
and yet another solution is to use a currently available call/cc implementation (cl-cont or hu.dwim.delico). delimited continuations provide more than green threads, and therefore mean more overhead, but the difference shouldn't be big. unfortunately for now delico only provides interpreted continuations (slow), and cl-cont had quite a few issues when i tried it.
but using cl-cont i've made a proof of concept implementation. the idea in short:
keep a connection data structure for each connection
set sockets to non-blocking
provide call/cc versions of socket (gray-stream) reading primitives which when detect that the socket is dry or flooded then store the current continuation in the connection structure and mark what event we are waiting for
and then continue (using call/cc primitives) connections based on what the event handler layer tells us. please note that call/cc is only needed on the code that reads the request. once it's parsed, the rest of the handler code can be plain CL up to the point when we start writing the response. if you need some stream processing that does not buffer the response before sending, then you need to keep all the handler code inside the delimited continuation borders.
it's somewhere on my TODO to make hu.dwim.wui work based on that experiment.
currently the wui codebase can process some 3-4 k requests per sec on my 2.4 GHz core2 laptop, which is good enough for me, especially that i didn't pay too much attention to performance. but for now it can only cope with about 50 parallel requests a second, because workers are not multiplexed. if there are many users with slow network connections then it can be an issue...
the proof of concept code is lying somewhere in an iolib branch here, but it's most probably badly bitrotten. it's about a year old now and iolib went much ahead.