Re: snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: snapshot too old issues, first around wraparound and then more.
Date
Msg-id CAH2-Wz=FQ9rbBKkt1nXvz27kmd4A8i1+7dcLTNqpCYibxX83VQ@mail.gmail.com
Whole thread Raw
In response to Re: snapshot too old issues, first around wraparound and then more.  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: snapshot too old issues, first around wraparound and then more.  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I like that idea. I think that I've spotted what may be an independent
> bug, but I have to wait around for a minute or two to reproduce it
> each time. Makes it hard to get to a minimal test case.

I now have simple steps to reproduce a bug when I start Postgres
master with "--old_snapshot_threshold=1"  (1 minute).

This example shows wrong answers to queries in session 2:

Session 1:

pg@regression:5432 [1444078]=# create table snapshot_bug (col int4);
CREATE TABLE
pg@regression:5432 [1444078]=# create index on snapshot_bug (col );
CREATE INDEX
pg@regression:5432 [1444078]=# insert into snapshot_bug select i from
generate_series(1, 500) i;
INSERT 0 500

Session 2 starts, and views the data in a serializable transaction:

pg@regression:5432 [1444124]=# begin isolation level serializable ;
BEGIN
pg@regression:5432 [1444124]=*# select col from snapshot_bug where col
>= 0 order by col limit 14;
┌─────┐
│ col │
├─────┤
│   1 │
│   2 │
│   3 │
│   4 │
│   5 │
│   6 │
│   7 │
│   8 │
│   9 │
│  10 │
│  11 │
│  12 │
│  13 │
│  14 │
└─────┘
(14 rows)

So far so good. Now session 2 continues:

pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
DELETE 14

Session 1:

(repeats the same "select col from snapshot_bug where col >= 0 order
by col limit 14" query every 100 ms using psql's \watch 0.1)

Session 2:

pg@regression:5432 [1444078]=# vacuum snapshot_bug ;
VACUUM

Before too long, we see the following over in session 2 -- the answer
the query gives changes, even though this is a serializable
transaction:

Wed 01 Apr 2020 03:12:59 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│   1 │
│   2 │
│   3 │
│   4 │
│   5 │
│   6 │
│   7 │
│   8 │
│   9 │
│  10 │
│  11 │
│  12 │
│  13 │
│  14 │
└─────┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│  15 │
│  16 │
│  17 │
│  18 │
│  19 │
│  20 │
│  21 │
│  22 │
│  23 │
│  24 │
│  25 │
│  26 │
│  27 │
│  28 │
└─────┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│  15 │
│  16 │
│  17 │
│  18 │
│  19 │
│  20 │
│  21 │
│  22 │
│  23 │
│  24 │
│  25 │
│  26 │
│  27 │
│  28 │
└─────┘
(14 rows)

We continue to get this wrong answer for almost another minute (at
least on this occasion). Eventually we get "snapshot too old". Note
that the answer changes when we cross the "minute threshold"

Andres didn't explain anything to me that contributed to finding the
bug (though it could be a known bug, I don't think that it is). It
took me a surprisingly short amount of time to stumble upon this bug
-- I didn't find it because I have good intuitions about how to break
the feature.

-- 
Peter Geoghegan

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: snapshot too old issues, first around wraparound and then more.
Next
From: James Coleman
Date:
Subject: Re: Proposal: Expose oldest xmin as SQL function for monitoring