On 2 Jan 2003, at 6:02, Henrik Nordstrom <hno@squid-cache.org> wrote:
> I have now managed to randomly reproduce the situation here on my poor
> old P133/64MB home development machine, giving almost exact 200KB/se hit
> transfer rate for a single aufs cache hit or sometimes a couple in
> parallell (each then receiving 200KB/s). Exact conditions triggering
> this is yet unknown but I would guess there is some kind of oscillation
> in the queue management due to the extremely even and repetive pattern
> I/O pattern.
>
> Hmm.. 200KB/s / 4KB per I/O request = 50 * 2 queue operations per
> request = 100 = number of clock ticks/s (HZ) on a Linux Intel X86..
>
> Wait a minute.. yes, naturally. This will quite likely happen on an aufs
> Squid which has nothing to do as the I/O queue is only polled once per
> comm loop, and an "idle" comm loop runs at 100/s. The question is more
> why it does not always happen, and why it is 200KB/s and not 400KB/s.
> What can be said is also that the likelyhood that this won't happen
> decreases a lot on SMP machines as the main thread then have no
> copmetition with the I/O threads for the CPU.
I'd not be sure of that. Competition for cpu is not so much of an
issue here, its thread scheduling. Before main thread LWP does not
reach scheduler, no IO thread rescheduling will happen. I see not
much difference.
> I have an old async-io patch somewhere which introduces a signal to
> force the main process to resume when an disk I/O operation have
> completed, but it may have other ill effects so it is not included (or
> maintained).
Signal is heavy. Can't we have one dummy FD in the fd_set that is
always polled for read, and when IO thread is ready, write 0-1 bytes
in there? Like pipe? That would cause poll to unblock and allows
multiple threads to write into same FD.
> Testing on my old UP P133 glibc 2.1.3 gives some surpricing results in
> thread scheduling. It seems that under some conditions the signalled
> thread is not awakened until much later than expected. I would expect
> the signalled thread to be awakeden when the signalling thread calls
> pthread_mutex_unlock or at latest at sched_yield, but when Squid is in
> "200K/s aufs cache hit mode" the I/O thread stays suspended until the
> next system tick or something similar it seems... will try to do
> additional investigation later on.
Thats how threads work. Wishful thinking, ;) been there too. Thread
switch will happen only when it is unavoidable. SMP optimisations..
For that reason, yield() is almost 100% no-op. Mutex unlock after
cond_signal quite the same. It is about async nature of threads, OS
"assumes" that you gonna cond_signal many threads before you block
so that it can then schedule signalled threads on cpu (batching).
Therefore mutex_unlock is not steered through LWP scheduler. Signalled
thread will not awake before main LWP hits scheduler (which is
any call causing blocking in kernel). At the same time, you can't
rely on this also, as there are cases when mutex unlock may cause
thread switch (quite probable if threads are scope_process). Problem
is that different OS'es behave differently, and infact no assumptions
are relyable. Keyword is that threads are asyncronous, and proceed
in unpredictable order. If you need syncronisation, use mutexes.
The only way to reliably kick specific thread is through solid
mutex handshaking. Even blocking in poll does not guarantee that
after return signalled thread has been run, especially on SMP systems
that try to keep threads from migrating cpus. If we are blocked in
poll long enough, they all obviously would have to get to run. But
here lies another problem, we can't slow down network IO to make sure
aio threads get run. We may try to tweak with priorities, but thats
not enough unless we run IO threads in realtime class, I suppose.
If I recall right, cond_wait is boolean, and, if many threads are
blocked on cond, and cond_signal does not make thread switch, and
another cond_signal is sent, only 1 thread would be unblocked
eventually. I assume that mutex is put into to-be-locked state upon
first cond_signal (owner unspecified, but one of threads on wait),
and second attempt to signal would cause solid thread switch to
consume first cond (because mutex is "locked"). That most probably
happens when there is alot of activity with IO. But when only 1
client, thread switch would not happen until we block into poll.
In this case yield is also nop because io threads are not yet
scheduled to cpu, thus there is nothing to run at yield() time.
Using yield is bad coding in SMP world, I'd suggest avoiding it.
Basically, we have 2 probs to solve. 1) we need reliable and least
overhead kickstart of aufs IO threads at the end of comm_poll run.
Poll can return immediately without running scheduler if there are
FDs ready. Forcibly blocking in poll would cause lost systick for
network io. Therefore I think we'd need to think of some other
way to get io-threads running before going into poll. We only
need to make sure io-threads have grabbed their job and are on
cpu queue. Maybe even only last cond_signal is an issue if my
above guess is right.
2) We need semi-reliable and least latency notification of aio
completion when poll is blocking. The latter one probably more
important. Could the pipe FD do the trick? Signal would, but at
high loads it would cause alot of overhead.
------------------------------------
Andres Kroonmaa <andre@online.ee>
CTO, Microlink Data AS
Tel: 6501 731, Fax: 6501 725
Pärnu mnt. 158, Tallinn
11317 Estonia
Received on Thu Jan 02 2003 - 06:19:48 MST
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:19:05 MST