Re: Potential G2-item cycles under serializable isolation - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: Potential G2-item cycles under serializable isolation
Date
Msg-id CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p=6Hg@mail.gmail.com
Whole thread Raw
In response to Re: Potential G2-item cycles under serializable isolation  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Potential G2-item cycles under serializable isolation
List pgsql-bugs
On Fri, Jun 5, 2020 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I suspect that I should try a high fidelity way of capturing the
> original execution order. I'm thinking of something like
> test_decoding

I had a better idea -- use rr debugging recorder to get a recording of
an entire execution:


https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Recording_Postgres_using_rr_Record_and_Replay_Framework

Attached results.edn + G2-item.txt files show the anomalies that were
reported during an execution I recorded a little earlier. We can now
use this as a point of reference for further debugging.

One less obvious benefit of doing this is that rr makes the execution
of the entire Postgres instance single threaded, so we can be sure
that the log faithfully represents the order of operations during
execution, within and across PostgreSQL backends -- no more
uncertainty about the exact order, when each xact starts and ends,
etc. (Note that I had to use rr's "chaos mode" feature to reproduce
the race condition problem in the first place -- it was easy once I
remembered about that option.)

This useful property is pretty much self-evident from the logs, since
rr includes an event number. For example, the log starts like this:

[rr 316370 892] 1591388133.699-316370--0--5edaa7e5.4d3d2- LOG:
starting PostgreSQL 13beta1-dc on x86_64-pc-linux-gnu, compiled by gcc
(Ubuntu 9.3.0-10ubuntu2) 9.3.0, 64-bit
[rr 316370 1056] 1591388133.706-316370--0--5edaa7e5.4d3d2- LOG:
listening on IPv6 address "::1", port 5432
[rr 316370 1068] 1591388133.706-316370--0--5edaa7e5.4d3d2- LOG:
listening on IPv4 address "127.0.0.1", port 5432
[rr 316370 1093] 1591388133.707-316370--0--5edaa7e5.4d3d2- LOG:
listening on Unix socket "/tmp/.s.PGSQL.5432"
[rr 316371 1303] 1591388133.716-316371--0--5edaa7e5.4d3d3- LOG:
database system was shut down at 2020-06-05 13:15:18 PDT
[rr 316370 1726] 1591388133.735-316370--0--5edaa7e5.4d3d2- LOG:
database system is ready to accept connections
[rr 316448 4190] 1591388153.592-316448--0--5edaa7f9.4d420-[unknown]
LOG:  connection received: host=127.0.0.1 port=41078

The second number in each log line represents an rr event number --
notice that these are strictly monotonically increasing. I don't even
need a PID to be able to start an interactive debugging session during
execution of a query of interest, though -- all I need is an event
number (such as 4190) to reliably get an interactive debugging
session. Since the entire recorded execution is actually single
threaded, and because of Postgres' multi-process architecture, that
information alone unambiguously identifies a particular
backend/connection and query execution.

I also make available the entire log that I showed the start of just now:

https://drive.google.com/file/d/1YFTVChfEtzZncdjakXKpp22HKDNFoWKo/view?usp=sharing

This hasn't been filtered by me this time around, so anyone playing
along at home can look through everything, including the queries
pertaining to the two G2-Item issues Jepsen complains about.

Hopefully this helps you, Kyle. Would you mind repeating the exercise
from earlier with this new log file/execution?

Thanks
-- 
Peter Geoghegan

Attachment

pgsql-bugs by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: ERROR: Attempting to Install the pgAdmin 4 Web Interface onCentOS 7
Next
From: Fabien COELHO
Date:
Subject: Re: pgbench bug / limitation