On Sat, 17 Oct 2009 20:05:52 -0700 Red Daly reddaly@gmail.com wrote:
I was wondering how best to use IOLib to write an efficient HTTP server that can handle perhaps 10,000+ simultaneous connections. It seems like iolib has all the right ingredients: a sytem-level sockets interface and a io multiplexer that uses epoll/kqueue for efficiently querying sockets. There is quite a bit of code already written so I was hoping for some advice about how this would be best implemented.
Note that the following is about using the kqueue backend using SBCL and iolib dating more than a year ago. It mignt not apply when using the /dev/poll, epoll, or even select backends. Also, please forgive me if I'm stating the obvious, as I have no knowledge of your background :)
I tried using iolib on NetBSD (which supports kqueue), along with the multiplexer. I wrote a very simplistic IO-bound server around it to measure preformance (no worker threads, but non-blocking I/O in a single threaded process, a model which I previously successfully used for high performance C+kqueue(2) (and JavaScript+libevent(3)) on the same OS).
The performance was unfortunatly pretty bad compared to using C+kqueue (i.e. in the order of a few hundred served requests per second versus thousands, and nearly a thousand with JS), so I made sure the kqueue backend was being used (it was), and then looked at the code (after being warned that the multiplexer was the less tested part of iolib). What I noticed at the time was that timers were not dispatched to kqueue but to a custom scheduler, and that a kevent(2) syscall was issued per FD add/remove/state change event.
kqueue allows to use a single kevent(2) syscall in the main loop to handle all additions/removals/state changes/notifications of descriptors, signals and timers, which is part of what makes it so performant, other than only needing the caller to iterate among new state changes rather than a full descriptor table.
I admit that I didn't look at the iolib kevent backend code again lately, which could have improved, and didn't try to fix it myself (library portability being of limited value in my case, and using complete C+PHP and C+JavaScript solutions for work, my adventure into CL and iolib was experimental and a hobby, but I can confirm my growing love for CL. :)
Another potential performance issue I've noticed is the interface itself, i.e. all the sanity checking which to be (allegro?) compatible as much as possible has to force distinction of various socket types (bind/listen/accept vs read/write sockets for instance, adding overhead). Also, unlike BSD accept(2) which allows to immediately access the client's address as it's stored into a supplied sockaddr object, with iolib one has to perform a separate syscall to obtain the client address/port as the interface did not cache that address. I honestly didn't look at if iolib made this possible, but the BSD sockets API also allows asynchroneous non-blocking accept(2)/connect(2) which is important for non-blocking I/O-bound proxies.
In the case of my test code, there also was some added overhead as I wrote a general purpose TCP server library which the minimal test HTTP server could use. CLOS was used (which itself has some overhead over struct/closures/lambda based code because of dynamic runtime dispatching, although SBCL was pretty good compared to other implementations to optimize CLOS code). It also used a custom buffer to be able to use file descriptors directly instead of streams (especially since non-blocking I/O was used), although similar code using a libevent(3) stub class in non-JIT/interpreted JavaScript using SpiderMonkey was still faster (note that I've not tested iolib's own buffering against mine however). libevent(3) is also able to use a single-syscall kevent(3) based loop which greatly helps performance.
At the time I didn't look into this as I had no idea, but CFFI itself appears to incur some overhead compared to UFFI, but only looking at the resulting assembly and microbenchmarks showed me this. It probably was a non-issue compared to the numerous kevent(2) syscalls. Another probably insignificant, since CPU-bound overhead could be iolib's use of CLOS (I noticed CLOS to be from 1.5 to 10 times slower in some struct+lambda vs class+method tests depending on task and CL implementation).
Another factor was that it was among my first Common Lisp tests, so the code was probably clumsy :) In case it can be useful, the test code can be found at: http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/test/htt... Which uses: http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/lib/rw-q... http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/lib/serv...
In case iolib's multiplexer can't suit your needs with your favorite backend, it however still doesn't make iolib useless, especially in the case of application servers. For instance:
As I was playing with ECL more recently, and that it supports POSIX threads and SBCL-compatible simple BSD sockets API contrib library, I wrote a simple multithreaded test server where a pool of ready threads accept new connections themselves to serve the client to then go back to accept mode when done. This was actually to test ECL itself, and is very minimal (isn't flexible and doesn't even implement input timeouts!), but it can serve to demonstrate the idea which also could be implemented using SBCL and its native sockets, or iolib, and the performance was very decent for an application-type server (also note that the bugs mentionned in the comments have since been fixed in ECL): http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/test/ecl...
The above does not require an efficient multiplexer. The method it uses is similar to mmlib/mmserver|js-appserv and apache, and generally a manager thread/process uses heuristics to grow and shrink the processes/threads pool as necessary. In the case of ECL, a libevent(3)/kqueue(2) C-based main loop could even invoke CL functions if optimal multiplexing was a must, as ECL compiles CL to C (SBCL's compiler is more efficient however, especially where CLOS is involved).
In general, CPU-bound applications (HTTP application and database servers often are) use a pool of processes if optimal reliability and security is a must (permits privilege separation, avoids resource leaks by occasionally recycling the process, a bug generally only affects the instance, need for reentrant and thread-safe libraries is a non-issue) or threads (generally with languages minimizing buffer overflows and supporting a GC, with a master process managing the threaded processes) when I/O-bound applications are the ones needing optimal multiplexing with non-blocking asynchroneous I/O, often in a single thread/process (i.e. frontend HTTP servers/proxies (lighttpd or nginx), IRCD, etc).
For very busy dynamic sites, as load grows, a farm of CPU-bound application servers can be setup and a few frontend I/O-bound HTTP servers proxy dynamic requests to them (via fastcgi, or most commonly today HTTP, especially with Keep-Alive support) and perform load balancing (which is sometimes done at an upper layer). In this sense it is not necessary for a single all-purpose HTTP server to both handle very efficient multiplexing and CPU-bound worker threads simultaneously (the later usually better kept separate for the purpose of redundancy and application-specific configuration)...
That said, if you want to implement an IO-bound server, I hope the backends you'll need to use provide better performance than the kqueue one did for me back then. Working on improving it would be interesting but I'm affraid I don't have the time or motivation to take up the task at current time. As for the interface-specific improvements I can (and did) suggest a few changes but have no authority to change the API, which seems to have been thought out with valid compatibility concerns.