Thread: [HACKERS] walsender termination error messages worse in v10

[HACKERS] walsender termination error messages worse in v10

From

Andres Freund

Date:

03 June 2017, 00:51:26

Hi,

commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
Author: Peter Eisentraut <peter_e@gmx.net>
Date:   2017-03-23 08:36:36 -0400
   Logical replication support for initial data copy

made walreceiver emit worse messages in v10 than before when the master
gets shut down.  Before 10 you'll get something like:

2017-06-02 11:46:07.173 PDT [16143][] LOG:  replication terminated by primary server
2017-06-02 11:46:07.173 PDT [16143][] DETAIL:  End of WAL reached on timeline 1 at 0/1B7FB8C8.
2017-06-02 11:46:07.173 PDT [16143][] FATAL:  could not send end-of-streaming message to primary: no COPY in progress

the last bit is a bit superflous, but it still kinda makes sense.  Now
you get:
2017-06-02 11:47:09.362 PDT [17061][] FATAL:  unexpected result after CommandComplete: server closed the connection
unexpectedly              This probably means the server terminated abnormally               before or while processing
therequest.
 

the reason is that

static int
libpqrcv_receive(char **buffer, pgsocket *wait_fd)
{

previously did:   if (rawlen == -1)           /* end-of-streaming or error */   {       PGresult   *res;
       res = PQgetResult(streamConn);       if (PQresultStatus(res) == PGRES_COMMAND_OK ||
PQresultStatus(res)== PGRES_COPY_IN)       {           PQclear(res);           return -1;       }       else       {
      PQclear(res);           ereport(ERROR,                   (errmsg("could not receive data from WAL stream: %s",
                      PQerrorMessage(streamConn))));       }   }
 

and now does
   if (rawlen == -1)           /* end-of-streaming or error */   {       PGresult   *res;
       res = PQgetResult(conn->streamConn);       if (PQresultStatus(res) == PGRES_COMMAND_OK)       {
PQclear(res);
           /* Verify that there are no more results */           res = PQgetResult(conn->streamConn);           if (res
!=NULL)               ereport(ERROR,                       (errmsg("unexpected result after CommandComplete: %s",
                       PQerrorMessage(conn->streamConn))));           return -1;       }       else if
(PQresultStatus(res)== PGRES_COPY_IN)       {           PQclear(res);           return -1;       }       else       {
       PQclear(res);           ereport(ERROR,                   (errmsg("could not receive data from WAL stream: %s",
                       pchomp(PQerrorMessage(conn->streamConn)))));       }   }
 

note the new /* Verify that there are no more results */ bit.

I don't understand why the new block is there, nor does the commit
message explain it.


Greetings,

Andres Freund

Re: [HACKERS] walsender termination error messages worse in v10

From

Petr Jelinek

Date:

03 June 2017, 02:57:37

Hi Andres,

On 02/06/17 20:51, Andres Freund wrote:
> Hi,
> 
> commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
> Author: Peter Eisentraut <peter_e@gmx.net>
> Date:   2017-03-23 08:36:36 -0400
> 
>     Logical replication support for initial data copy
> 
> made walreceiver emit worse messages in v10 than before when the master
> gets shut down.  Before 10 you'll get something like:
> 
> 2017-06-02 11:46:07.173 PDT [16143][] LOG:  replication terminated by primary server
> 2017-06-02 11:46:07.173 PDT [16143][] DETAIL:  End of WAL reached on timeline 1 at 0/1B7FB8C8.
> 2017-06-02 11:46:07.173 PDT [16143][] FATAL:  could not send end-of-streaming message to primary: no COPY in
progress
> 
> the last bit is a bit superflous, but it still kinda makes sense.  Now
> you get:
> 2017-06-02 11:47:09.362 PDT [17061][] FATAL:  unexpected result after CommandComplete: server closed the connection
unexpectedly
>                 This probably means the server terminated abnormally
>                 before or while processing the request.
> 
> the reason is that
> 
> static int
> libpqrcv_receive(char **buffer, pgsocket *wait_fd)
> {
> 
> previously did:
>     if (rawlen == -1)           /* end-of-streaming or error */
>     {
>         PGresult   *res;
> 
>         res = PQgetResult(streamConn);
>         if (PQresultStatus(res) == PGRES_COMMAND_OK ||
>             PQresultStatus(res) == PGRES_COPY_IN)
>         {
>             PQclear(res);
>             return -1;
>         }
>         else
>         {
>             PQclear(res);
>             ereport(ERROR,
>                     (errmsg("could not receive data from WAL stream: %s",
>                             PQerrorMessage(streamConn))));
>         }
>     }
> 
> and now does
> 
>     if (rawlen == -1)           /* end-of-streaming or error */
>     {
>         PGresult   *res;
> 
>         res = PQgetResult(conn->streamConn);
>         if (PQresultStatus(res) == PGRES_COMMAND_OK)
>         {
>             PQclear(res);
> 
>             /* Verify that there are no more results */
>             res = PQgetResult(conn->streamConn);
>             if (res != NULL)
>                 ereport(ERROR,
>                         (errmsg("unexpected result after CommandComplete: %s",
>                                 PQerrorMessage(conn->streamConn))));
>             return -1;
>         }
>         else if (PQresultStatus(res) == PGRES_COPY_IN)
>         {
>             PQclear(res);
>             return -1;
>         }
>         else
>         {
>             PQclear(res);
>             ereport(ERROR,
>                     (errmsg("could not receive data from WAL stream: %s",
>                             pchomp(PQerrorMessage(conn->streamConn)))));
>         }
>     }
> 
> note the new /* Verify that there are no more results */ bit.
> 
> I don't understand why the new block is there, nor does the commit
> message explain it.
> 

Hmm, that particular change can actually be reverted. It was needed for
one those custom replication commands which were replaced by normal
query support. I have missed it during the rewrite.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: [HACKERS] walsender termination error messages worse in v10

From

Andres Freund

Date:

03 June 2017, 03:45:59

Hi Petr,

On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote:
> On 02/06/17 20:51, Andres Freund wrote:
> > I don't understand why the new block is there, nor does the commit
> > message explain it.
> > 
> 
> Hmm, that particular change can actually be reverted. It was needed for
> one those custom replication commands which were replaced by normal
> query support. I have missed it during the rewrite.

Doesn't appear to be quite that simple, I get regression test failures
in that case.

Greetings,

Andres Freund

Re: [HACKERS] walsender termination error messages worse in v10

From

Petr Jelinek

Date:

03 June 2017, 04:55:22

On 02/06/17 23:45, Andres Freund wrote:
> Hi Petr,
> 
> On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote:
>> On 02/06/17 20:51, Andres Freund wrote:
>>> I don't understand why the new block is there, nor does the commit
>>> message explain it.
>>>
>>
>> Hmm, that particular change can actually be reverted. It was needed for
>> one those custom replication commands which were replaced by normal
>> query support. I have missed it during the rewrite.
> 
> Doesn't appear to be quite that simple, I get regression test failures
> in that case.
> 

Hmm, looks like we still use it for normal COPY handling. So basically
the problem is that if we run COPY TO STDOUT and then consume it using
the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we
need to call PQgetResult() in that case otherwise libpq thinks the
command is still active and any following command will fail, but if we
call PQgetResult on dead connection we get that error you complained about.

I guess it would make sense to do conditional exit on
(PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's
quite ugly code-wise though.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

libpqwalsender-disconnect-fix.diff

Re: [HACKERS] walsender termination error messages worse in v10

From

Noah Misch

Date:

08 June 2017, 11:57:58

On Fri, Jun 02, 2017 at 11:51:26AM -0700, Andres Freund wrote:
> commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
> Author: Peter Eisentraut <peter_e@gmx.net>
> Date:   2017-03-23 08:36:36 -0400
> 
>     Logical replication support for initial data copy
> 
> made walreceiver emit worse messages in v10 than before when the master
> gets shut down.  Before 10 you'll get something like:
> 
> 2017-06-02 11:46:07.173 PDT [16143][] LOG:  replication terminated by primary server
> 2017-06-02 11:46:07.173 PDT [16143][] DETAIL:  End of WAL reached on timeline 1 at 0/1B7FB8C8.
> 2017-06-02 11:46:07.173 PDT [16143][] FATAL:  could not send end-of-streaming message to primary: no COPY in
progress
> 
> the last bit is a bit superflous, but it still kinda makes sense.  Now
> you get:
> 2017-06-02 11:47:09.362 PDT [17061][] FATAL:  unexpected result after CommandComplete: server closed the connection
unexpectedly
>                 This probably means the server terminated abnormally
>                 before or while processing the request.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

Re: [HACKERS] walsender termination error messages worse in v10

From

Andres Freund

Date:

09 June 2017, 03:57:19

On 2017-06-03 00:55:22 +0200, Petr Jelinek wrote:
> On 02/06/17 23:45, Andres Freund wrote:
> > Hi Petr,
> > 
> > On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote:
> >> On 02/06/17 20:51, Andres Freund wrote:
> >>> I don't understand why the new block is there, nor does the commit
> >>> message explain it.
> >>>
> >>
> >> Hmm, that particular change can actually be reverted. It was needed for
> >> one those custom replication commands which were replaced by normal
> >> query support. I have missed it during the rewrite.
> > 
> > Doesn't appear to be quite that simple, I get regression test failures
> > in that case.
> > 
> 
> Hmm, looks like we still use it for normal COPY handling. So basically
> the problem is that if we run COPY TO STDOUT and then consume it using
> the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we
> need to call PQgetResult() in that case otherwise libpq thinks the
> command is still active and any following command will fail, but if we
> call PQgetResult on dead connection we get that error you complained about.

Should this possibly handled at the caller level?  This is a bit too
much magic for my taste.

Looking at the callers, the new code isn't super-obvious either:
       len = walrcv_receive(wrconn, &buf, &fd);
       if (len != 0)       {           /* Process the data */           for (;;)           {
CHECK_FOR_INTERRUPTS();
               if (len == 0)               {                   break;               }               else if (len < 0)
           {                   ereport(LOG,                           (errmsg("data stream from publisher has
ended")));                  endofstream = true;                   break;               }
 

The len < 0, hidden inside a len != 0, which in the loop again chcks if
len == 0 (because it's decremented in loop)?  And there's two different[5~    len = walrcv_receive(wrconn, &buf, &fd);
statements?

> I guess it would make sense to do conditional exit on
> (PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's
> quite ugly code-wise though.

I think that's fine for now.  It'd imo be a good idea to improve matters
here a bit, but for now I've just applied your patch.

- Andres

Re: [HACKERS] walsender termination error messages worse in v10

From

Petr Jelinek

Date:

09 June 2017, 07:16:59

On 08/06/17 23:57, Andres Freund wrote:
> On 2017-06-03 00:55:22 +0200, Petr Jelinek wrote:
>> On 02/06/17 23:45, Andres Freund wrote:
>>> Hi Petr,
>>>
>>> On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote:
>>>> On 02/06/17 20:51, Andres Freund wrote:
>>>>> I don't understand why the new block is there, nor does the commit
>>>>> message explain it.
>>>>>
>>>>
>>>> Hmm, that particular change can actually be reverted. It was needed for
>>>> one those custom replication commands which were replaced by normal
>>>> query support. I have missed it during the rewrite.
>>>
>>> Doesn't appear to be quite that simple, I get regression test failures
>>> in that case.
>>>
>>
>> Hmm, looks like we still use it for normal COPY handling. So basically
>> the problem is that if we run COPY TO STDOUT and then consume it using
>> the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we
>> need to call PQgetResult() in that case otherwise libpq thinks the
>> command is still active and any following command will fail, but if we
>> call PQgetResult on dead connection we get that error you complained about.
> 
> Should this possibly handled at the caller level?  This is a bit too
> much magic for my taste.

It is, I agree with that. I guess the main issue with handling it in
callers is that callers can't access libpq api so there would need to be
some additional abstraction (we did some of the necessary parts of that
for libpqrcv_exec).

> 
> Looking at the callers, the new code isn't super-obvious either:
> 
>         len = walrcv_receive(wrconn, &buf, &fd);
> 
>         if (len != 0)
>         {
>             /* Process the data */
>             for (;;)
>             {
>                 CHECK_FOR_INTERRUPTS();
> 
>                 if (len == 0)
>                 {
>                     break;
>                 }
>                 else if (len < 0)
>                 {
>                     ereport(LOG,
>                             (errmsg("data stream from publisher has ended")));
>                     endofstream = true;
>                     break;
>                 }
> 
> The len < 0, hidden inside a len != 0, which in the loop again chcks if
> len == 0 (because it's decremented in loop)?  And there's two different[5~
>         len = walrcv_receive(wrconn, &buf, &fd);
> statements?
> 

The logic there is that you have one main loop and one busy loop. When
there is flow of data, things happen as the data are processed and we
don't need to do the bookkeeping of timeouts, latching, other
maintenance tasks that need to be done (like checking table
synchronization in logical replication) because it all happens naturally
and that's what the inner loop is for. But when the upstream gets idle
and does not send anything (or there is network issue, etc) we need to
do these things (otherwise we'd miss timeouts, table synchronization
worker would wait forever on apply, etc) and that's what the outer loop
is for.

We only enter the inner loop if there were some data that's why there is
the firs len != 0, but then the inner loop bases decisions on the return
code of the walrcv_receive() so the if is repeated again (it also will
call walrcv_receive() as long as there are data so it needs to check for
those calls too).

It took me a while to understand why walreceiver does this originally,
but it did make sense to me once I understood the split between the main
(idle) loop and busy loop, so I did it the same way.

>> I guess it would make sense to do conditional exit on
>> (PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's
>> quite ugly code-wise though.
> 
> I think that's fine for now.  It'd imo be a good idea to improve matters
> here a bit, but for now I've just applied your patch.
> 

Okay, thanks.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: [HACKERS] walsender termination error messages worse in v10

From

Alvaro Herrera

Date:

09 June 2017, 08:26:47

Petr Jelinek wrote:

> It took me a while to understand why walreceiver does this originally,
> but it did make sense to me once I understood the [...], so I did it
> the same way.

In other words, the coding pattern needs a comment in both places.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] walsender termination error messages worse in v10

From

Peter Eisentraut

Date:

09 June 2017, 18:31:17

On 6/8/17 01:57, Noah Misch wrote:
> On Fri, Jun 02, 2017 at 11:51:26AM -0700, Andres Freund wrote:
>> commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
>> Author: Peter Eisentraut <peter_e@gmx.net>
>> Date:   2017-03-23 08:36:36 -0400
>>
>>     Logical replication support for initial data copy
>>
>> made walreceiver emit worse messages in v10 than before when the master
>> gets shut down.  Before 10 you'll get something like:
>>
>> 2017-06-02 11:46:07.173 PDT [16143][] LOG:  replication terminated by primary server
>> 2017-06-02 11:46:07.173 PDT [16143][] DETAIL:  End of WAL reached on timeline 1 at 0/1B7FB8C8.
>> 2017-06-02 11:46:07.173 PDT [16143][] FATAL:  could not send end-of-streaming message to primary: no COPY in
progress
>>
>> the last bit is a bit superflous, but it still kinda makes sense.  Now
>> you get:
>> 2017-06-02 11:47:09.362 PDT [17061][] FATAL:  unexpected result after CommandComplete: server closed the connection
unexpectedly
>>                 This probably means the server terminated abnormally
>>                 before or while processing the request.
> 
> [Action required within three days.  This is a generic notification.]

This was fixed by Andres.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services