Thread: Failed test 'pg_recvlogical acknowledged changes, nothing pending on slot'

Failed test 'pg_recvlogical acknowledged changes, nothing pending on slot'

From
Thomas Munro
Date:
Hello,

I saw a one-off test failure that might be trying to tell us something:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/334334417

#   Failed test 'pg_recvlogical acknowledged changes, nothing pending on slot'
#   at t/006_logical_decoding.pl line 91.
#          got: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
# table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
# table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
# table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
# COMMIT'
#     expected: ''
# Looks like you failed 1 test of 16.

This occurred when testing master (010123e) with the patch from
https://www.postgresql.org/message-id/1bd396a9-4573-55ad-7ce8-fe7adffa1bd9%40uni-muenster.de
applied, but I don't think the patch has anything to do with it
because it doesn't touch anything outside contrib/pg_stat_statements.

-- 
Thomas Munro
http://www.enterprisedb.com


On Mon, Jan 29, 2018 at 10:09:49AM +1300, Thomas Munro wrote:
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/334334417
> 
> #   Failed test 'pg_recvlogical acknowledged changes, nothing pending on slot'
> #   at t/006_logical_decoding.pl line 91.
> #          got: 'BEGIN
> # table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
> # table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
> # table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
> # table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
> # COMMIT'
> #     expected: ''
> # Looks like you failed 1 test of 16.

The problem is this StreamLogicalLog() code:

    if (PQresultStatus(res) == PGRES_COPY_OUT)
    {
        /*
         * We're doing a client-initiated clean exit and have sent CopyDone to
         * the server. We've already sent replay confirmation and fsync'd so
         * we can just clean up the connection now.
         */
        goto error;
    }

Once pg_recvlogical receives the XLogData containing the sought-after end
position, that code makes pg_recvlogical exit without draining the remainder
of the backend messages.  If pg_recvlogical exits quickly enough, the backend
send()s libpq messages after pg_recvlogical disconnects, which can cause
internal_flush() to fail with EPIPE ("LOG:  could not send data to client:
Broken pipe").  If that precedes LogicalConfirmReceivedLocation(), the test
fails as you experienced.  Such failure happened once on the buildfarm[1];
post-LogicalConfirmReceivedLocation() EPIPE also happens[2].  The first
attached patch causes this failure to happen almost every run.  The fix
(second attachment) is to call PQgetCopyData() until no records remain, then
issue PQresultStatus() again[3].  This closes an additional defect, described
in the log message.


I looked at the other instances of "Broken pipe" in a couple of check-world
runs.  Clients might prevent those with cleaner shutdown on error, but it's
cosmetic.  They appeared in cases where the client or the server had already
recognized some other failure, whereas $SUBJECT taints a successful run.


This led me to notice other pg_recvlogical bugs, which I left unchanged:

1) An "unexpected termination of replication stream" error doesn't preclude
exit(0).

2) sigint_handler() doesn't trigger a PQputCopyEnd().  The connection status
remains PGRES_COPY_BOTH, prompting this weird message:

$ pg_recvlogical --create-slot --start -S foo -d postgres -f- && echo success
^Cpg_recvlogical: error: unexpected termination of replication stream: 
success

Other aspects of signal handling surprised me, but they may not be bugs.  The
documentation says that --start continues "until terminated by a signal".  We
don't trap SIGTERM, just SIGINT (initiate clean exit) and SIGHUP (reopen
output file).  pg_recvlogical copied SIGINT behavior from pg_receivewal, and
the pg_receivewal documentation is specific about signals.  Both programs
react to SIGINT with exit(0), whether or not they reached --endpos.
sigint_handler() doesn't trigger a flushAndSendFeedback(), so the slot's next
pg_recvlogical will repeat messages that followed the last fsync/feedback.

3) sendFeedback() wants s/last_fsync_lsn !=/last_fsync_lsn ==/.  This just
changes the volume of feedback messages.


[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=frogfish&dt=2020-03-07%2018%3A49%3A34
[2]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=florican&dt=2020-04-27%2017%3A25%3A08&stg=recovery-check
[3] https://www.postgresql.org/docs/devel/libpq-copy.html writes "After
    PQgetCopyData returns -1, call PQgetResult to obtain the final result
    status of the COPY command. One can wait for this result to be available
    in the usual way. Then return to normal operation."

Attachment