Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl
Date
Msg-id 31663.1499032944@sss.pgh.pa.us
Whole thread Raw
In response to [HACKERS] Race-like failure in recovery/t/009_twophase.pl  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl
List pgsql-hackers
I wrote:
> Anyway, having vented about that ... it's not very clear to me whether the
> test script is at fault for not being careful to let the slave catch up to
> the master before we promote it (and then deem the master to be usable as
> a slave without rebuilding it first), or whether we actually imagine that
> should work, in which case there's a replication logic bug here someplace.

OK, now that I can make some sense of what's going on in the 009 test
script ... it seems like that test script is presuming synchronous
replication behavior, but it's only actually set up sync rep in one
direction, namely the london->paris direction.  The failure occurs
when we lose data in the paris->london direction.  Specifically,
with the delay hack in place, I find this in the log before things
go south completely:

# Now london is master and paris is slave
ok 11 - Restore prepared transactions from files with master down
### Enabling streaming replication for node "paris"
### Starting node "paris"
# Running: pg_ctl -D /home/postgres/pgsql/src/test/recovery/tmp_check/data_paris_xSFF/pgdata -l
/home/postgres/pgsql/src/test/recovery/tmp_check/log/009_twophase_paris.logstart 
waiting for server to start.... done
server started
# Postmaster PID for node "paris" is 30930
psql:<stdin>:1: ERROR:  prepared transaction with identifier "xact_009_11" does not exist

That ERROR is being reported by the london node, at line 267 of the
current script:$cur_master->psql('postgres', "COMMIT PREPARED 'xact_009_11'");
So london is missing a prepared transaction that was created while
paris was master, a few lines earlier.  (It's not real good that
the test script isn't bothering to check the results of any of
these queries, although the end-state test I just added should close
the loop on that.)  london has no idea that it's missing data, but
when we restart the paris node a little later, it notices that its
WAL is past where london is.

I'm now inclined to think that the correct fix is to ensure that we
run synchronous rep in both directions, rather than to insert delays
to substitute for that.  Just setting synchronous_standby_names for
node paris at the top of the script doesn't work, because there's
at least one place where the script intentionally issues commands
to paris while london is stopped.  But we could turn off sync rep
for that step, perhaps.

Anyone have a different view of what to fix here?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Álvaro Hernández Tortosa
Date:
Subject: Re: [HACKERS] Using postgres planner as standalone component
Next
From: Robert Haas
Date:
Subject: Re: protocol version negotiation (Re: [HACKERS] Libpq PGRES_COPY_BOTH- version compatibility)