Alex Rousskov wrote:
> Right, when multiple IOs are detected on a FD, the async-code
> panics and starts registering that FD with select, effectively
> "waiting" for a thread to process the IO that was queued first.
> Expensive and breaks the async/sync code separation, IMO.
When does this happen? (multiple outstanding I/O operations on
one FD). Writes are queued, and there should never be more than
one read operation on one fd, and no read and writes on the same
fd.
As we all know, using select() on disk files is mostly useless.
On most (all?) platforms it returns ready on every call... And I
do not beleive select is currently used on async-io disk
operations..
Yes, the code waits for the previous operation to finish, but
not by registering the fd with select (if it does then the code
is/has been broken).
> I think there exist a nice solution to this problem which achieves
> fairness and balance without creating extra threads or using
> select(2). If I have time, I will try to write it down and test it.
We do not need something perfect. Something reasonable is enought.
Suggestion for both a reasonable load balancing and bypass on high
load:
* A number of threads assigned to each cache_dir, set by the
  cache_dir directive as different disks have different speeds..
  This tuning capability is especially needed when normal disks and
  RAID disks are combined.
* Once a file is opened all I/O operations on this fd are handled by
  the same thread to minimise the amount of strange effects that can
  be expected on various strange OS:es.
* Multiple outstanding writes are joined into one (I think this is
  already partially done..), or handled as one large writev().
* Read prefetching is possible when time permits, as the thread can
  maintain some state info about the files it maintains.
* In order for a object to be cached on disk there must be at least
  one available thread. If there is no thread available on the most
  wanted cache_dir then try the next. Repeated until all available
  cache_dirs are tried.
* Select/poll should not be used for disk files (I don't think the
  current async-io code uses select either)
* Larger I/O operations than one page should be allowed and swapout
  delayed accordingly. 32K is probably a reasonable figure. Disks
  are quick on sequential read/writes, and having larger I/O
  operations hints the OS to spend some extra effort to try to have
  larger files less fragmented.
Effects:
1. Swapouts are not initiated on saturated disks
2. The I/O queue for one disk is limited by the number of assigned
   threads, giving a somewhat guaranteed maximum service time.
3. If all disks are saturated then disk is completely bypassed.
4. Less work for select/poll.
5. Larger I/O sizes allows for more efficient use of the available
   disk spindles (less seeks -> less iops -> higher througput).
6. The current "magic" (round-robin of some of the less filled disks)
   for disk load balancing is unneeded. Simpy try to fill the first
   available disk with most available space.
Open issues:
* How should HITs be handled on saturated disks?
  1) Should only-if-cached be denied?
  2) Should other requests be handled as misses?
My wote is that only-if-cached is denied, and other requests are
handled as normally (slightly delayed).
Comments are welcome. We should thing this throught before any major
coding begins.
/Henrik
Received on Tue Jul 29 2003 - 13:15:53 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:55 MST