Re: [PATCH] Move timeout handling from ConnOpener::connect to ConnOpener::timeout

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Fri, 18 Jan 2013 14:35:18 +1300

On 18/01/2013 10:23 a.m., Alex Rousskov wrote:
> On 01/17/2013 10:56 AM, Rainer Weikusat wrote:
>> Alex Rousskov writes:
>>> On 01/16/2013 09:58 PM, Amos Jeffries wrote:
>>>> * timeout() happens first and runs doneConnecting() while connect() is
>>>> queued.
>>> Yes (timeout wins and ends ConnOpener job).
>> It is possible to 'do better' here fairly easily, see more detailed
>> explanation below.
> No, it is not. It is possible to spend a lot of time on writing,
> reviewing, and fixing complicated code that will optimize the case that
> virtually never happens. Correctly detecting this race and properly
> handling it is not easy and is not useful.
>
>
>>>> - we have no handle to the connect() AsyncCall stored so we can't
>>>> cancel it directly (bug?).
>>> We do not need to cancel it (just like with the timeout call discussed
>>> above). The async call will not be delivered to the destroyed ConnOpener
>>> job).
>> I'm rather in favour of cancelling the calls, at least where easily
>> possible, because this makes the debugging/ diagnostic output more
>> easy to understand.
> We should, of course, clear the stored timeout (or I/O) callback after
> that callback has been fired. Without that, swanSong() would try to
> clear a fired callback, and that would be confusing.
>
> I do not have the guts to object to us calling the cancel() method for
> the non-fired callback that is no longer relevant (as long as it is done
> consistently). However, I do not see how that improves debugging in
> ConnOpener case because the callback will know that the job is gone
> without our explicit cancellation. And I am sure somebody will
> eventually think that calling cancel has some useful side-effects such
> as closing file descriptors or freeing some memory (it does not).
>
>
>>> Ending the job is enough to guarantee that it will not be called via an
>>> async job call:
>> Taking this into account and after having spent some more time
>> thinking on this, I think what the timeout handler should do is,
>>
>> 1. Clear calls_.timeout_.
> Yes. calls.timeout must reflect whether there is a timeout notification
> scheduled (from our job point of view).

Yes.

>
>> 2. Check if temporaryFd_ is still valid. At the moment, this is always
>> the case, but it would prudent wrt general 'defensive programming' and
>> the check will become necessary for change 3|.
> I do not know what valid is in this context, but temporaryFd_ and
> earlyAbort must reflect whether there is a Comm I/O scheduled (from our
> job point of view).

Valid is (temporaryFd_ >= 0). Unless that is true there is no connect
operation in-progress, the Job is either setting up still or closing
down already. The doneAll() indicates whether closing down is underway
or will happen next.

When the connection is done successfully temporaryFd_ is moved to conn,
on failure temporaryFd_ is cleaned up directly. This is all handled by
doneConnecting().

>
>> 2a. If this was the case, check if the write handler of the
>> corresponding fde is still set.
> If I/O is scheduled, we should stop it, close the descriptor, and free
> memory using Comm APIs. This condition should probably be detected only
> in swanSong(), to avoid code duplication. The job should exit after a
> timeout, of course.
>
> Accessing Comm write handler pointers in some low-level table should be
> avoided if at all possible. If Comm APIs need to be adjusted to handle
> this better, we can adjust them.
>
> Amos, do you remember what "partially open FD" in the following comment
> meant, exactly? Why cannot we just call comm_close() to cleanup?

It means we have issued socket() and connect() OS calls. fd_table[]
state probably exists. But we are still waiting on either TCP response,
or the In-Progress event to run and the FD to be confirmed as open.
  This is signalled by (temporaryFd_ >= 0).

>
>> // it never reached fully open, so cleanup the FD handlers
>> // Note that comm_close() sequence does not happen for partially open FD
>
>
> In your latest patch, please move fd cleanup from timeout() and from
> doneConnecting() into a dedicated method and call that method from
> swanSong() if temporaryFd_ is still set there. Do not call it from
> doneConnecting() or anywhere else. Remember that a job may end for
> reasons other than some timeout or I/O. swanSong() is the cleanup method.
>
> Please do not forget to set calls.earlyAbort to NULL after canceling it
> in the fd cleanup method.

It was the design intention of doneConnecting() to cleanup the FD state
and be the single exit point for the Job. swanSong() will call
doneConnecting() with error flag if it is reached and doneConnecting()
has not been run yet to do that cleanup.

The timeout() and connect() handlers should do whatever cleanup they
need to override what doneConnecting() cleanup would do. ie timeout()
clearing the calls_.timeout_ instead of letting it be canceled with a
bogus message.

>
>> + ptr = static_cast<Pointer *>(F->write_data);
>> + delete ptr;
> This is not going to work well. F->write_data now points to deleted
> memory. If you absolutely have to do this, please set it to NULL after
> delete.

This code is doing what Comm::SetSelect(temporaryFd_, COMM_SELECT_WRITE,
NULL, NULL, 0); is *supposed* to do. However the SetSelect is still old
C code and *not* expecting a temporary ref-counting Pointer to exist
there, just a old C-style raw function pointer. To fix this properly we
need to do a major conversion of read and write in the fd_table to make
them smart pointer fields. After which we can just use SetSelect() with
NULL and be happy.

For now this is all a workaround for the memory leak caused by using new
Pointer() in
   Comm::SetSelect(temporaryFd_, COMM_SELECT_WRITE,
Comm::ConnOpener::InProgressConnectRetry, new Pointer(this), 0);
which we do in order to use valid() in the wrapper handler and prevent
segmentation faults accessing deleted objects in the handler (something
else which would be solved by the above upgrade).
So the workaround should explicitly be:

   ptr = static_cast<Pointer *>(F->write_data);
   delete ptr;
   F->write_data = NULL;
   Comm::SetSelect(temporaryFd_, COMM_SELECT_WRITE, NULL, NULL, 0);

With the SetSelect being there to ensure all the state related to write_data now or in future is consistent with it being NULL/deleted.
Please also add an XXX comment with the workaround code explaining the above triplet.

For completeness sake, src/comm.cc does this in the regular write
handler cleanup:

     if (COMMIO_FD_WRITECB(fd)->active()) {
         Comm::SetSelect(fd, COMM_SELECT_WRITE, NULL, NULL, 0);
         COMMIO_FD_WRITECB(fd)->finish(COMM_ERR_CLOSING, errno);
     }

That will be done for us by comm_close(). If we do that here as part of
the workaround cleaning up on timeout() the bit calling finish() will
schedule the ConnOpener::connect() or InProgress handler.

Amos
Received on Fri Jan 18 2013 - 01:35:26 MST

This archive was generated by hypermail 2.2.0 : Fri Jan 18 2013 - 12:00:07 MST