Here is a possible architecture for a server that can handle tons of connections at once:
this is all fine, but the interesting question comes when you consider how to switch between the threads.
what you need is essentially green-threads (much cheaper compared to OS threads): http://en.wikipedia.org/wiki/Green_threads
the naive solution is to implement the inversion of control by hand, having hand-written state machines, etc all around. this makes the code uglier, and through that helps for bugs creeping in.
another solution is to implement green threads.
and yet another solution is to use a currently available call/cc implementation (cl-cont or hu.dwim.delico). delimited continuations provide more than green threads, and therefore mean more overhead, but the difference shouldn't be big. unfortunately for now delico only provides interpreted continuations (slow), and cl-cont had quite a few issues when i tried it.
but using cl-cont i've made a proof of concept implementation. the idea in short:
keep a connection data structure for each connection
set sockets to non-blocking
provide call/cc versions of socket (gray-stream) reading primitives which when detect that the socket is dry or flooded then store the current continuation in the connection structure and mark what event we are waiting for
and then continue (using call/cc primitives) connections based on what the event handler layer tells us. please note that call/cc is only needed on the code that reads the request. once it's parsed, the rest of the handler code can be plain CL up to the point when we start writing the response. if you need some stream processing that does not buffer the response before sending, then you need to keep all the handler code inside the delimited continuation borders.
it's somewhere on my TODO to make hu.dwim.wui work based on that experiment.
currently the wui codebase can process some 3-4 k requests per sec on my 2.4 GHz core2 laptop, which is good enough for me, especially that i didn't pay too much attention to performance. but for now it can only cope with about 50 parallel requests a second, because workers are not multiplexed. if there are many users with slow network connections then it can be an issue...
the proof of concept code is lying somewhere in an iolib branch here, but it's most probably badly bitrotten. it's about a year old now and iolib went much ahead.