Folks,
I've been thinking about efficient io models some time lately, and I'd
like to share some thoughts.
What any high-performance "C10k" io-model tries to combat is eventually
reduce overhead that is useless for the final goal - the throughput.
If we eliminate all disk bottlenecks, optimise all code paths and omit
acl and regex processing, we'll still be left with quite an ample of
overhead with network io under high loads. Most problems occur near
socalled C10K levels, as Dan Kegel describes on his page. Actually
already at C2K levels.
On that page I think most of current models are covered, but none of
them seems perfect to me, although some are pretty close.
It seems that in current state main effort is put into resolving
state or event notification overhead. Funny is that almost none of the
solutions actually handles IO itself, only handles either readiness or
completion notification. IO seems to be taken for granted.
But its not so. Every IO syscall burns more CPU than expected, both due
to context-switching, OS'es internal action initialising and also due
to running process/thread scheduling checks.
- threads are expensive if doing little work at a time. Mostly because
of syncronisation and context switching overhead.
- frequent poll() is expensive if only fraction of fdarray is active.
- signals are expensive when most fds are active.
- small data sizes are inefficient due to excessive syscalls
- syscalls are expensive if doing little work, mostly for similar
reasons as threads.
Above points cut off several io-models mentioned on Kegel's page, like
process/request (obviously), thread/request, RT-signals, select/poll.
Based on Henrik's comments now seems that most aio implementations fall
to thread/request model.
Squid is using single-threaded nonblocking IO with poll notification.
What we are trying to do is get rid of poll overhead by trying other
methods after eventio is mature: kqueues, devpoll, rt-signals, etc.
This will save some CPU in main squid thread, but the win won't be as
high as we hope, I'm afraid. Squid main thread is already loaded, and all
pure IO syscalls together with FD ioctrl calls consume quite alot.
Eventually we'll strike syscall rate that wastes most CPU cycles in
context-switches and cache-misses.
The more I think about it, the more I see the need to consolidate actions.
Poll is a wakeup function. Most io models turn around the wakeup overhead.
There are few very nice and effective solutions for that, but handling IO
itself is not covered, excluding perhaps in-kernel servers.
Under extremely high loads we inevitably have to account for overhead of
context switches between kernel and squid, and the more work we can give
to kernel in a shot the less this overhead is notable.
Unfortunately, There seems no method for doing this in current OS'es...
Ideally, kernel should be given a list of sockets to work on, not in just
terms of readiness detection, but actual IO itself. Just like in devpoll,
where kernel updates ready fd list as events occur, it should be made to
actually do the IO as events occur. Squid should provide a list of FDs,
commands, timeout values and bufferspace per FD and enqueue the list.
This is like kernel-aio, but not quite.
From other end sleep in a wakeup function which returns a list of completed
events, thus dequeueing events that either complete or error. Then handle
all data in a manner coded into Squid, and enqueue another list of work.
Again, point is on returning list of completed events, not 1 event at a
time. Much like poll returns possibly several ready FD's.
To get rid of wakeup (poll) overhead, we need to stop poll()ing like crazy.
devpoll does this for us nicely. But we solve this only to be able to issue
actual IO itself. Actually we don't really care about efficient polling,
we care about efficient IO. And as load on Squid increases, we start
calling IO syscalls like crazy. FD at a time, as little amount of bytes
at a time as we have at the moment, but no more than few KB's. We do this
to keep latency low, although under high loads this is exact reason why
latency goes up.
With aio, we'd omit poll altogether, but start enqueueing IO like crazy,
and correspondingly dequeueing completed IO like crazy. More efficient,
but only to a point. thread/aio-request overhead brings any savings down.
I dunno, many io models are quite reminding what I'm saying, but seems
that all they stop on a halfway. You either can register interest in
FD events and then handle IO one at a time or enqueue/dequeue actions
one at a time, but you can't do that in bulk.
All this reminds a tapedrive that can only transfer small chunks of data
at a time without ability to stream.
I think that what is needed is some a combination of kernel-queues (or
devpoll) and kaio. Schedule actions in bulk and dequeue in bulk.
Together with appropriate number of worker threads any MP scalability
can be achieved. This needs kernel support.
I believe this is useful, because only kernel can really do work in
async manner - at packet arrival time. It could even skip buffering
packets in kernel space, but append data directly to userspace buffs,
or put data on wire from userspace. Same for disk io.
Today this can't be done (not sure of TUX). Maybe in future. I think so.
I just wonder if anyone is working in that direction already.
So, wakeup function is not about readiness, but directly about popping
up completed work. And scheduled work gathered in bulks before passing
to the hourse - kernel. Pipeline.
In regards to eventio branch, new network API, seems it allows to
implement almost any io model behind the scenes. What seems to stick
is FD-centric and one-action-at-a-time approach. Also it seems that
it could be made more general and expandable, possibly covering also
disk io. Also, some calls assume that they are fulfilled immediately,
no async nature, no callbacks (close for eg). This makes it awkward
to issue close while read/write is pending.
One thing that bothers me abit is that you can't proceed before FD
is known. For disk io, for eg. it would help if you could schedule
open/read/close in one shot. For that some kind of abstract session
ID could be used I guess. Then such triplets could be scheduled to
the same worker-thread avoiding several context-switches.
Also, how about some more general ioControlBlock struct, that defines
all the callback, cbdata, iotype, size, offset, etc... And is possibly
expandable in future.
Hmm, probably it would be even possible to generalise the api to such
extent, that you could schedule acl-checks, dns, redirect lookups all
via same api. Might become useful if we want main squid thread to do
nothing else but be a broker between worker-threads. Not sure if that
makes sense though, just a wild thought.
Also, I think we should think about trying to be more threadsafe.
Having one compact IOCB helps here. Maybe even allowing to pass a
list of IOCB's.
ouch, I waste bandwidth again...
------------------------------------
Andres Kroonmaa <andre@online.ee>
CTO, Microlink Online
Tel: 6501 731, Fax: 6501 725
Pärnu mnt. 158, Tallinn,
11317 Estonia
Received on Fri Sep 14 2001 - 10:39:51 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:14:21 MST