Thread: Fwd: Re: Fwd: Problem with recv syscall on socket when other side closed connection

Hello all,

This is the result of my 6 days "war" with Linux kernel people...
Any usefull comments on this????

----------  Forwarded Message  ----------
Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection
Date: Tue, 27 Jun 2000 16:21:55 +0400 (MSK DST)
From: kuznet@ms2.inr.ac.ru

Hello!

> Sorry... But seems that you did not understand the problem.
> I talk about recv... Not write... write SHOULD give EPIPE on connection reset...
> But not recv/read.

I did understand. This error was for write(), but it became known
_after_ you exited write(). So that it is delivered to read().
It is usual problem of all full-duplex pipes.

We could translate this EPIPE to ECONNRESET, when it is delivered
to read(), but it does not change its sense.

Solaris does not translate.


> Usual way of handling connection reset when you do only read is to give
> all data available and then return 0, indicating EOF.

Sorry? Think a bit.

You wrote to dead socket, right? It is the hardest error.
If the transport were local, you would get SIGPIPE and died painful death.
If an OS ignores such events, it is simply impossible to use,
you will get silently truncated data all the time.


> Or some OSes (HPUX if I'm not mistaken) gives you all data available and then
> ECONNRESET. But not other way around...

This approach has its merits, and it is acceptable in principle.

But Linux approach is evidently better, because errors are expedited.
Each protocol, where out of band events are inlined to data
is inclined to deadlocks.

In Linux scheme you know forward that stream is aborted.
Depending on protocol you may choose to abort protocol
or to continue to operate, parsing already received messages.


>         But not other way around...

You have just seen a new way around. The correct one. 8)

Alexey
-------------------------------------------------------

-- 
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------


Denis Perchine <dyp@perchine.com> writes:
> This is the result of my 6 days "war" with Linux kernel people...
> Any usefull comments on this????

The Linux people seem to be assuming (erroneously) that application
protocols are strictly I-send-and-then-you-send.  That's too
restrictive, and in fact it falls down in exactly the case that
we're seeing in libpq: one side of the connection may send an
error message "out of turn" and then close the connection.  If the
other side of the connection was busy sending, the first thing it
will get is an EPIPE error on its send.  By closing the connection
*AND DISCARDING VALID USER DATA* at this point, the Linux kernel
makes it impossible to retrieve the error message --- which might
have contained essential information.

> In Linux scheme you know forward that stream is aborted.
> Depending on protocol you may choose to abort protocol
> or to continue to operate, parsing already received messages.

But what about the messages you didn't get yet, but the other end
sent in good faith?  There's nothing in the TCP specs that says
a program can't close its end of the connection as soon as it has
sent the last data it intends to send.

>> But not other way around...

> You have just seen a new way around. The correct one. 8)

No, just a new half-baked excuse for doing things wrong.  The kernel at
the other side of the connection accepted the data for delivery.  That
means that both sides of the connection are going to make their best
efforts to deliver it.  By willfully failing to deliver that data, the
Linux kernel is violating the fundamental premise of TCP (or any other
reliable byte-stream protocol).  This is not "correct", it is broken.
Do I need to quote RFC chapter and verse at you?
        regards, tom lane


Re: Fwd: Re: Fwd: Problem with recv syscall on socket when other side closed connection

From
kuznet@ms2.inr.ac.ru
Date:
Hello!

> > This is the result of my 6 days "war" with Linux kernel people...

Alas, I did not understand that he was in the state of war.
I believed he asked for an advice and/or reported a bug. 8)


> *AND DISCARDING VALID USER DATA* at this point, the Linux kernel
> makes it impossible to retrieve the error message --- which might
> have contained essential information.

Denis was explained that Linux _never_ discards any data...
Ough, I beg pardon: Denis noticed this _himself_.
Please, do not desinform people.

BTW look at this. It is RFC1122, 4.2.2.13. 
                                   If a TCP           connection is closed by the remote site, the local
applicationMUST be informed whether it closed normally or           was aborted.
 

See? 8)


Also, David Miller explained (and Denis understood and accepted this)
that TCP does not guarantee data delivery, when an RST is issued.
It is solely due to intrinsic network unreliability and segment reordering.
The application depending on these aspects is broken.


> But what about the messages you didn't get yet, but the other end
> sent in good faith?  There's nothing in the TCP specs that says
> a program can't close its end of the connection as soon as it has
> sent the last data it intends to send.

TCP specs say directly and unabiguously, that sending data
to half-closed pipe is followed by immediate abort (RFC1122, a bit
below the cite above). More detailed explanation can be found
in current draft-ietf-tcpimpl.

> Do I need to quote RFC chapter and verse at you?

Of course.

Alexey


[ Sorry for delay in response, I had other things to do over the
weekend. ]

kuznet@ms2.inr.ac.ru writes:
> BTW look at this. It is RFC1122, 4.2.2.13. 

>                                     If a TCP
>             connection is closed by the remote site, the local
>             application MUST be informed whether it closed normally or
>             was aborted.

So?  This is not relevant, because the connection was not aborted.
The sentence immediately preceding that one defines an abort as an event
in which RST segment(s) are sent, but closure of a connection is defined
to send FIN, not RST.  (More about that below.)

The more relevant quote is the next paragraph,
           The normal TCP close sequence delivers buffered data           reliably in both directions.  Since the two
directionsof a           TCP connection are closed independently, it is possible for           a connection to be "half
closed,"i.e., closed in only one           direction, and a host is permitted to continue sending data           in the
opendirection on a half-closed connection.
 

I do not see how you can read the first sentence of that paragraph in
any way but to say that data once sent must be delivered if at all
possible.  Another example is from RFC-793 (STD-7), section 3.8,
definition of CLOSE:
       Closing connections is intended to be a graceful operation in       the sense that outstanding SENDs will be
transmitted(and       retransmitted), as flow control permits, until all have been       serviced.  Thus, it should be
acceptableto make several SEND       calls, followed by a CLOSE, and expect all the data to be sent       to the
destination.

In our situation, the server sends (queues) some data and then closes
its side of the connection.  The server-side TCP stack should send the
data along with FIN and then go to FIN-WAIT-1 state.  In this state
the server side may receive more data from the client side (since the
client isn't yet aware the server has quit).  RFC-793 is perfectly
clear that the server side must send a dummy ACK but *no* RST in this
case --- see section 3.4, almost the end of the section:
   3.  If the connection is in a synchronized state (ESTABLISHED,   FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING,
LAST-ACK,TIME-WAIT),   any unacceptable segment (out of window sequence number or   unacceptible acknowledgment number)
mustelicit only an empty   acknowledgment segment containing the current send-sequence number   and an acknowledgment
indicatingthe next sequence number expected   to be received, and the connection remains in the same state.
 

Therefore, sending data to a no-longer-present receiver does not cause
a connection reset (at least not in a spec-conforming TCP stack), and
there is no justification for discarding data that is coming the other
way.

The Linux kernel's present behavior is contrary to the standard, unable
to support an essential user capability (ie, delivery of last-gasp error
messages), and contrary to the behavior of all other TCP implementations
that I have worked with.  There is a reason why you are in the minority
here...
        regards, tom lane


Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
>> 3.  If the connection is in a synchronized state (ESTABLISHED,
>> FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
>> any unacceptable segment (out of window sequence number or
>> unacceptible acknowledgment number) must elicit only an empty
>> acknowledgment segment containing the current send-sequence number
>> and an acknowledgment indicating the next sequence number expected
>> to be received, and the connection remains in the same state.

> Reread the 3. above. What it actually requires if you think about it is that
> the receive window is shrunk to zero and the connection hangs for all
> eternity the way you are arguing it.

No, it doesn't "hang for all eternity", it sits in the same state until
(a) the client side closes its sending side of the connection (ie, sends
FIN), or (b) the FIN-WAIT-1 state times out.  But given a normally
responsive client and no loss of physical connectivity or crash of
either TCP stack, there is no connection reset and no failure to deliver
sent data.

There would be no need for all the half-open-connection verbiage if the
spec were meant to be read the way you are reading it.
        regards, tom lane


Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
>> No, it doesn't "hang for all eternity", it sits in the same state until
>> (a) the client side closes its sending side of the connection (ie, sends
>> FIN), or (b) the FIN-WAIT-1 state times out.  But given a normally
>> responsive client and no loss of physical connectivity or crash of
>> either TCP stack, there is no connection reset and no failure to deliver
>> sent data.

> I cannot ack the data since it has not been read, so how can I ack the fin ?

ACK does not mean that you've delivered the data to the end user.
RFC 793, 2.6:
 An acknowledgment by TCP does not guarantee that the data has been delivered to the end user, but only that the
receivingTCP has taken the responsibility to do so.
 

Bit-bucketing the data because the end user app is no longer present to
accept it (due to having already closed its input socket) is implicitly
within the receiving TCP's authority here.  I think this is the core of
our disagreement, but I can see no justification for your position that
ACK implies the data has been delivered to the end user.  Every TCP
implementation I've ever heard of sends ACK as soon as it's collected
data into kernel buffers, *not* after the application has executed
recv() to extract the data from the kernel.  (Who's to say that
completion of recv() represents final delivery of the data anyway?
Sending ACK cannot be considered a report of end-to-end delivery;
that has to be an application-level concept.)

Also observe that the discussion of segment-arrival processing in
section 3.9 explicitly says that the behavior in FIN-WAIT-1 and later
states is not different from the behavior in ESTABLISHED state.  In
particular, if you do not like the segment:
       If an incoming segment is not acceptable, an acknowledgment       should be sent in reply (unless the RST bit is
set,if so drop       the segment and return):
 
         <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
       After sending the acknowledgment, drop the unacceptable segment       and return.

There is no room here for the TCP to decide to send RST instead.
        regards, tom lane


Re: Fwd: Re: Fwd: Problem with recv syscall on socket when other side closed connection

From
kuznet@ms2.inr.ac.ru
Date:
Hello!

>           <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> 
>         After sending the acknowledgment, drop the unacceptable segment
>         and return.
> 
> There is no room here for the TCP to decide to send RST instead.

I apologize, but RFC793 is sort of incomplete. Please, look at
errata in RFC1122 and to bug alerts described in documents published
by tcp-impl (draft-tcpimpl-*).


I cited you corresponding paragraph of the RFC in previous mail.
Shortly:

1. When new data arrive after half-duplex close, we must reset.
2. When close occurs on connection, which has unread data, we  must reset.

It is required from the viewpoint of TCP protocol. Any OS, which
forgets to make this is buggy. By the way, I do not know about OSes,
which do not make this.

From the viewpoint of application, the behaviour is also correct.
Data arrived, when nobody plans to read it, unambiguously means
either connection abort or hard bug in application protocol.

Alexey


kuznet@ms2.inr.ac.ru writes:
> I apologize, but RFC793 is sort of incomplete. Please, look at
> errata in RFC1122 and to bug alerts described in documents published
> by tcp-impl (draft-tcpimpl-*).

The errata in RFC 1122 do not recommend any changes in connection
closure behavior.  You appear to be hanging your hat on this paragraph
from 1122 4.2.2.13:
           A host MAY implement a "half-duplex" TCP close sequence, so           that an application that has called
CLOSEcannot continue to           read data from the connection.  If such a host issues a           CLOSE call while
receiveddata is still pending in TCP, or           if new data is received after CLOSE is called, its TCP
SHOULDsend a RST to show that data was lost.
 

However I read this as a requirement pertaining only to half-duplex
close sequences.  There is nothing half-duplex about closing a socket
completely.  In any case, it can hardly be a good idea to abort the
flow of data in the outbound direction in order to report that data
is being dropped in the inbound direction.  If an application has done
a half-close to close its inbound side only, but wants to keep sending
outbound data, it presumably has a good reason for doing so.  Behaving
as you suggest would render this mode of operation useless.

As for the drafts, I assume you are referring to sections 2.16 and 2.17
of RFC 2525 --- I couldn't find anything about connection resets in
the other files ftp://ftp.isi.edu/internet-drafts/draft-ietf-tcpimpl-*.
May I remind you that 2525 is an informational RFC, not a
standards-track RFC, and accordingly it has not been reviewed to the
extent that a proposed standards change would be?

I shall be writing to the authors of 2525 to object to sections 2.16
and 2.17 on the grounds that an RST causes data loss in the other
direction.  We'll see what they have to say.

> From the viewpoint of application, the behaviour is also correct.
> Data arrived, when nobody plans to read it, unambiguously means
> either connection abort or hard bug in application protocol.

Sure, it's a connection abort.  My point is that RST is an unacceptably
blunt instrument for reporting it, because it causes loss of data going
in the other direction.
        regards, tom lane


Re: Fwd: Re: Fwd: Problem with recv syscall on socket when other side closed connection

From
kuznet@ms2.inr.ac.ru
Date:
Hello!

> blunt instrument for reporting it, because it causes loss of data going
> in the other direction.

First. Data, which reached the host are not lost.

Second. TCP may lose this data because this data did not reach host
before reset arrived, indeed.

After the second we arrive at the next: if you send to dead pipe,
or do not read some remnant of data before closing, it is _HARD_ bug
in your application or in protocol.

Do you understand what hard bug is? It is when further behaviour
is unpredictable and the state cannot be recovered. Essentially,
it is thing which exceptions and fatal signals are invented for. 8)

Alexey


kuznet@ms2.inr.ac.ru writes:
>> blunt instrument for reporting it, because it causes loss of data going
>> in the other direction.

> First. Data, which reached the host are not lost.

As I recall, the original complaint was precisely that Linux discards
the server->client data instead of allowing the client to read it.  This
was on a single machine, so there's no issue of whether it got lost in
the network.
        regards, tom lane


>     3.  If the connection is in a synchronized state (ESTABLISHED,
>     FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
>     any unacceptable segment (out of window sequence number or
>     unacceptible acknowledgment number) must elicit only an empty
>     acknowledgment segment containing the current send-sequence number
>     and an acknowledgment indicating the next sequence number expected
>     to be received, and the connection remains in the same state.
> 
> Therefore, sending data to a no-longer-present receiver does not cause
> a connection reset (at least not in a spec-conforming TCP stack), and
> there is no justification for discarding data that is coming the other
> way.
> 
> The Linux kernel's present behavior is contrary to the standard, unable
> to support an essential user capability (ie, delivery of last-gasp error
> messages), and contrary to the behavior of all other TCP implementations
> that I have worked with.  There is a reason why you are in the minority
> here...

Reread the 3. above. What it actually requires if you think about it is that
the receive window is shrunk to zero and the connection hangs for all
eternity the way you are arguing it.

Alan



> No, it doesn't "hang for all eternity", it sits in the same state until
> (a) the client side closes its sending side of the connection (ie, sends
> FIN), or (b) the FIN-WAIT-1 state times out.  But given a normally
> responsive client and no loss of physical connectivity or crash of
> either TCP stack, there is no connection reset and no failure to deliver
> sent data.

I cannot ack the data since it has not been read, so how can I ack the fin ?


Re: Fwd: Re: Fwd: Problem with recv syscall on socket when other side closed connection

From
kuznet@ms2.inr.ac.ru
Date:
Hello!

> As I recall, the original complaint was precisely that Linux discards
> the server->client data instead of allowing the client to read it.  This
> was on a single machine, so there's no issue of whether it got lost in
> the network.

I am sorry. I have already said: it is not truth.

Original reporter (Denis) blamed particularly on the fact,
that Linux allows to read all queued data until EOF.
Try yourself, if you do not believe.

Unfortunately, I deleted that his mail, but you can find it
in mail archives I think, it was to netdev or to linux-kernel.

Alexey


Hello kuznet,

Wednesday, July 05, 2000, 7:06:06 PM, you wrote:

kmiar> Hello!

>> As I recall, the original complaint was precisely that Linux discards
>> the server->client data instead of allowing the client to read it.  This
>> was on a single machine, so there's no issue of whether it got lost in
>> the network.

kmiar> I am sorry. I have already said: it is not truth.

kmiar> Original reporter (Denis) blamed particularly on the fact,
kmiar> that Linux allows to read all queued data until EOF.
kmiar> Try yourself, if you do not believe.

kmiar> Unfortunately, I deleted that his mail, but you can find it
kmiar> in mail archives I think, it was to netdev or to linux-kernel.

I blamed that: Linux gives you EPIPE when you call recv before all
data available is retrieved. If you will try to read AFTER error you
will get all data. Problem is that it makes handling very complicated.
In the case of EPIPE you should try to read again. The problem is that
you should always try only once.

-- 
Best regards,Denis                            mailto:dyp@perchine.com




Denis Perchine <dyp@perchine.com> writes:
> I blamed that: Linux gives you EPIPE when you call recv before all
> data available is retrieved. If you will try to read AFTER error you
> will get all data. Problem is that it makes handling very complicated.
> In the case of EPIPE you should try to read again. The problem is that
> you should always try only once.

Ah, thanks for the correction.  Now, should we really program around
this behavior of the Linux kernel?  I cannot think of any other OS where
it is appropriate to retry on an EPIPE error condition, nor does it make
any sense to do so given the normal meaning of that error code.  "Retry,
but only once" is even sillier.

I still think this behavior is a bug, not a feature.  If you want to
issue EPIPE (or more likely ECONNRESET) *after* all available data has
been read, that's fine, and EPIPE for subsequent send attempts makes
sense too.  But EPIPE on read when there is more data available is just
plain bizarre.
        regards, tom lane


Hello!

> I blamed that: Linux gives you EPIPE when you call recv before all
> data available is retrieved. If you will try to read AFTER error you
> will get all data. Problem is that it makes handling very complicated.
> In the case of EPIPE you should try to read again. The problem is that
> you should always try only once.

Well, to me it does not look very essential, when asynchronous
error returned. Remember about EAGAIN and EINTR yet. You are not confused
with such erros, right? Why? 8)

Seems, this order of issuing errors etc. at read() is specified in posix.
I do not know really. I have said, error reporting only if no data are pending,
looks legal and has its merits. Main thing is not to forget to report error
at all. 8)

[ Alan, seems, all the comments about order of checks while read() are your ones. Can you comment? Maybe, it is really
worthto change.
 
]

Side note: TLI really does not _allow_ any operations on endpoint
in any direction until asynchronous error condition is cleared.
In fact, Linux does this on BSD sockets as well.
This is really natural, but I agree, it is inconvenient yet.

Alexey