Thread: [HACKERS] walsender termination error messages worse in v10
Hi, commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 Author: Peter Eisentraut <peter_e@gmx.net> Date: 2017-03-23 08:36:36 -0400 Logical replication support for initial data copy made walreceiver emit worse messages in v10 than before when the master gets shut down. Before 10 you'll get something like: 2017-06-02 11:46:07.173 PDT [16143][] LOG: replication terminated by primary server 2017-06-02 11:46:07.173 PDT [16143][] DETAIL: End of WAL reached on timeline 1 at 0/1B7FB8C8. 2017-06-02 11:46:07.173 PDT [16143][] FATAL: could not send end-of-streaming message to primary: no COPY in progress the last bit is a bit superflous, but it still kinda makes sense. Now you get: 2017-06-02 11:47:09.362 PDT [17061][] FATAL: unexpected result after CommandComplete: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing therequest. the reason is that static int libpqrcv_receive(char **buffer, pgsocket *wait_fd) { previously did: if (rawlen == -1) /* end-of-streaming or error */ { PGresult *res; res = PQgetResult(streamConn); if (PQresultStatus(res) == PGRES_COMMAND_OK || PQresultStatus(res)== PGRES_COPY_IN) { PQclear(res); return -1; } else { PQclear(res); ereport(ERROR, (errmsg("could not receive data from WAL stream: %s", PQerrorMessage(streamConn)))); } } and now does if (rawlen == -1) /* end-of-streaming or error */ { PGresult *res; res = PQgetResult(conn->streamConn); if (PQresultStatus(res) == PGRES_COMMAND_OK) { PQclear(res); /* Verify that there are no more results */ res = PQgetResult(conn->streamConn); if (res !=NULL) ereport(ERROR, (errmsg("unexpected result after CommandComplete: %s", PQerrorMessage(conn->streamConn)))); return -1; } else if (PQresultStatus(res)== PGRES_COPY_IN) { PQclear(res); return -1; } else { PQclear(res); ereport(ERROR, (errmsg("could not receive data from WAL stream: %s", pchomp(PQerrorMessage(conn->streamConn))))); } } note the new /* Verify that there are no more results */ bit. I don't understand why the new block is there, nor does the commit message explain it. Greetings, Andres Freund
Hi Andres, On 02/06/17 20:51, Andres Freund wrote: > Hi, > > commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 > Author: Peter Eisentraut <peter_e@gmx.net> > Date: 2017-03-23 08:36:36 -0400 > > Logical replication support for initial data copy > > made walreceiver emit worse messages in v10 than before when the master > gets shut down. Before 10 you'll get something like: > > 2017-06-02 11:46:07.173 PDT [16143][] LOG: replication terminated by primary server > 2017-06-02 11:46:07.173 PDT [16143][] DETAIL: End of WAL reached on timeline 1 at 0/1B7FB8C8. > 2017-06-02 11:46:07.173 PDT [16143][] FATAL: could not send end-of-streaming message to primary: no COPY in progress > > the last bit is a bit superflous, but it still kinda makes sense. Now > you get: > 2017-06-02 11:47:09.362 PDT [17061][] FATAL: unexpected result after CommandComplete: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > the reason is that > > static int > libpqrcv_receive(char **buffer, pgsocket *wait_fd) > { > > previously did: > if (rawlen == -1) /* end-of-streaming or error */ > { > PGresult *res; > > res = PQgetResult(streamConn); > if (PQresultStatus(res) == PGRES_COMMAND_OK || > PQresultStatus(res) == PGRES_COPY_IN) > { > PQclear(res); > return -1; > } > else > { > PQclear(res); > ereport(ERROR, > (errmsg("could not receive data from WAL stream: %s", > PQerrorMessage(streamConn)))); > } > } > > and now does > > if (rawlen == -1) /* end-of-streaming or error */ > { > PGresult *res; > > res = PQgetResult(conn->streamConn); > if (PQresultStatus(res) == PGRES_COMMAND_OK) > { > PQclear(res); > > /* Verify that there are no more results */ > res = PQgetResult(conn->streamConn); > if (res != NULL) > ereport(ERROR, > (errmsg("unexpected result after CommandComplete: %s", > PQerrorMessage(conn->streamConn)))); > return -1; > } > else if (PQresultStatus(res) == PGRES_COPY_IN) > { > PQclear(res); > return -1; > } > else > { > PQclear(res); > ereport(ERROR, > (errmsg("could not receive data from WAL stream: %s", > pchomp(PQerrorMessage(conn->streamConn))))); > } > } > > note the new /* Verify that there are no more results */ bit. > > I don't understand why the new block is there, nor does the commit > message explain it. > Hmm, that particular change can actually be reverted. It was needed for one those custom replication commands which were replaced by normal query support. I have missed it during the rewrite. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi Petr, On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote: > On 02/06/17 20:51, Andres Freund wrote: > > I don't understand why the new block is there, nor does the commit > > message explain it. > > > > Hmm, that particular change can actually be reverted. It was needed for > one those custom replication commands which were replaced by normal > query support. I have missed it during the rewrite. Doesn't appear to be quite that simple, I get regression test failures in that case. Greetings, Andres Freund
On 02/06/17 23:45, Andres Freund wrote: > Hi Petr, > > On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote: >> On 02/06/17 20:51, Andres Freund wrote: >>> I don't understand why the new block is there, nor does the commit >>> message explain it. >>> >> >> Hmm, that particular change can actually be reverted. It was needed for >> one those custom replication commands which were replaced by normal >> query support. I have missed it during the rewrite. > > Doesn't appear to be quite that simple, I get regression test failures > in that case. > Hmm, looks like we still use it for normal COPY handling. So basically the problem is that if we run COPY TO STDOUT and then consume it using the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we need to call PQgetResult() in that case otherwise libpq thinks the command is still active and any following command will fail, but if we call PQgetResult on dead connection we get that error you complained about. I guess it would make sense to do conditional exit on (PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's quite ugly code-wise though. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jun 02, 2017 at 11:51:26AM -0700, Andres Freund wrote: > commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 > Author: Peter Eisentraut <peter_e@gmx.net> > Date: 2017-03-23 08:36:36 -0400 > > Logical replication support for initial data copy > > made walreceiver emit worse messages in v10 than before when the master > gets shut down. Before 10 you'll get something like: > > 2017-06-02 11:46:07.173 PDT [16143][] LOG: replication terminated by primary server > 2017-06-02 11:46:07.173 PDT [16143][] DETAIL: End of WAL reached on timeline 1 at 0/1B7FB8C8. > 2017-06-02 11:46:07.173 PDT [16143][] FATAL: could not send end-of-streaming message to primary: no COPY in progress > > the last bit is a bit superflous, but it still kinda makes sense. Now > you get: > 2017-06-02 11:47:09.362 PDT [17061][] FATAL: unexpected result after CommandComplete: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Peter, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On 2017-06-03 00:55:22 +0200, Petr Jelinek wrote: > On 02/06/17 23:45, Andres Freund wrote: > > Hi Petr, > > > > On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote: > >> On 02/06/17 20:51, Andres Freund wrote: > >>> I don't understand why the new block is there, nor does the commit > >>> message explain it. > >>> > >> > >> Hmm, that particular change can actually be reverted. It was needed for > >> one those custom replication commands which were replaced by normal > >> query support. I have missed it during the rewrite. > > > > Doesn't appear to be quite that simple, I get regression test failures > > in that case. > > > > Hmm, looks like we still use it for normal COPY handling. So basically > the problem is that if we run COPY TO STDOUT and then consume it using > the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we > need to call PQgetResult() in that case otherwise libpq thinks the > command is still active and any following command will fail, but if we > call PQgetResult on dead connection we get that error you complained about. Should this possibly handled at the caller level? This is a bit too much magic for my taste. Looking at the callers, the new code isn't super-obvious either: len = walrcv_receive(wrconn, &buf, &fd); if (len != 0) { /* Process the data */ for (;;) { CHECK_FOR_INTERRUPTS(); if (len == 0) { break; } else if (len < 0) { ereport(LOG, (errmsg("data stream from publisher has ended"))); endofstream = true; break; } The len < 0, hidden inside a len != 0, which in the loop again chcks if len == 0 (because it's decremented in loop)? And there's two different[5~ len = walrcv_receive(wrconn, &buf, &fd); statements? > I guess it would make sense to do conditional exit on > (PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's > quite ugly code-wise though. I think that's fine for now. It'd imo be a good idea to improve matters here a bit, but for now I've just applied your patch. - Andres
On 08/06/17 23:57, Andres Freund wrote: > On 2017-06-03 00:55:22 +0200, Petr Jelinek wrote: >> On 02/06/17 23:45, Andres Freund wrote: >>> Hi Petr, >>> >>> On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote: >>>> On 02/06/17 20:51, Andres Freund wrote: >>>>> I don't understand why the new block is there, nor does the commit >>>>> message explain it. >>>>> >>>> >>>> Hmm, that particular change can actually be reverted. It was needed for >>>> one those custom replication commands which were replaced by normal >>>> query support. I have missed it during the rewrite. >>> >>> Doesn't appear to be quite that simple, I get regression test failures >>> in that case. >>> >> >> Hmm, looks like we still use it for normal COPY handling. So basically >> the problem is that if we run COPY TO STDOUT and then consume it using >> the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we >> need to call PQgetResult() in that case otherwise libpq thinks the >> command is still active and any following command will fail, but if we >> call PQgetResult on dead connection we get that error you complained about. > > Should this possibly handled at the caller level? This is a bit too > much magic for my taste. It is, I agree with that. I guess the main issue with handling it in callers is that callers can't access libpq api so there would need to be some additional abstraction (we did some of the necessary parts of that for libpqrcv_exec). > > Looking at the callers, the new code isn't super-obvious either: > > len = walrcv_receive(wrconn, &buf, &fd); > > if (len != 0) > { > /* Process the data */ > for (;;) > { > CHECK_FOR_INTERRUPTS(); > > if (len == 0) > { > break; > } > else if (len < 0) > { > ereport(LOG, > (errmsg("data stream from publisher has ended"))); > endofstream = true; > break; > } > > The len < 0, hidden inside a len != 0, which in the loop again chcks if > len == 0 (because it's decremented in loop)? And there's two different[5~ > len = walrcv_receive(wrconn, &buf, &fd); > statements? > The logic there is that you have one main loop and one busy loop. When there is flow of data, things happen as the data are processed and we don't need to do the bookkeeping of timeouts, latching, other maintenance tasks that need to be done (like checking table synchronization in logical replication) because it all happens naturally and that's what the inner loop is for. But when the upstream gets idle and does not send anything (or there is network issue, etc) we need to do these things (otherwise we'd miss timeouts, table synchronization worker would wait forever on apply, etc) and that's what the outer loop is for. We only enter the inner loop if there were some data that's why there is the firs len != 0, but then the inner loop bases decisions on the return code of the walrcv_receive() so the if is repeated again (it also will call walrcv_receive() as long as there are data so it needs to check for those calls too). It took me a while to understand why walreceiver does this originally, but it did make sense to me once I understood the split between the main (idle) loop and busy loop, so I did it the same way. >> I guess it would make sense to do conditional exit on >> (PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's >> quite ugly code-wise though. > > I think that's fine for now. It'd imo be a good idea to improve matters > here a bit, but for now I've just applied your patch. > Okay, thanks. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Petr Jelinek wrote: > It took me a while to understand why walreceiver does this originally, > but it did make sense to me once I understood the [...], so I did it > the same way. In other words, the coding pattern needs a comment in both places. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 6/8/17 01:57, Noah Misch wrote: > On Fri, Jun 02, 2017 at 11:51:26AM -0700, Andres Freund wrote: >> commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 >> Author: Peter Eisentraut <peter_e@gmx.net> >> Date: 2017-03-23 08:36:36 -0400 >> >> Logical replication support for initial data copy >> >> made walreceiver emit worse messages in v10 than before when the master >> gets shut down. Before 10 you'll get something like: >> >> 2017-06-02 11:46:07.173 PDT [16143][] LOG: replication terminated by primary server >> 2017-06-02 11:46:07.173 PDT [16143][] DETAIL: End of WAL reached on timeline 1 at 0/1B7FB8C8. >> 2017-06-02 11:46:07.173 PDT [16143][] FATAL: could not send end-of-streaming message to primary: no COPY in progress >> >> the last bit is a bit superflous, but it still kinda makes sense. Now >> you get: >> 2017-06-02 11:47:09.362 PDT [17061][] FATAL: unexpected result after CommandComplete: server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. > > [Action required within three days. This is a generic notification.] This was fixed by Andres. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services