Thread: Re: Bug #613: Sequence values fall back to previously checkpointed

Re: Bug #613: Sequence values fall back to previously checkpointed

From
Bruce Momjian
Date:
bgrimm@zaeon.com wrote:
> On Tue, 12 Mar 2002, Tom Lane wrote:
>
> >
> > The "no commit record" part of the logic seems okay to me, but we need
> > an independent test to decide whether to write/flush XLog.  If we have
> > reported a nextval() value to the client then it seems to me we'd better
> > be certain that XLOG record is flushed to XLog before we report commit
> > to the client.
>
> I think the part I don't understand is why WAL is being used to update
> sequence values in the first place when sequences exist independantly of
> transactions.  In previous releases a sequence basically just existed
> on disk in a specific location and updates to it updated the on disk
> copy directly since there are no concurrency issues.  I do realize that
> running everything through WAL gives other benefits, so it's not likely
> to revert back to the old way.  But it would seem that the only way to
> fix it is to flush the XLOG record immediately after the XLogInsert is
> called, just as if the operation took place within its own transaction.

Initially, I agreed with you, but after reading Tom's email, I
understand now.  The point is that if no data mofifying transactions are
committed to the database, do we care if a crash rolls back the
transaction counter?  I don't think we do because the new sequence
values couldn't have been used for anything in the database, so rollback
is OK.  If there is a data modification, the commit does flush the
sequence change to disk, meaning I think we are OK.

So, actually, this now seems like a very low level bug in the sense that
any data modification done would have preserved the sequence change. The
only reason we see the bug is because we are calling 'nextval()' with no
intention of inserting it into the database.

While clearly we should fsync during a transaction that just changes a
sequence counter, it is not clear there is a huge problem with not
preserving it _until_ a data-modifying transaction actually commits.


> > This is certainly fixable.  However, here's the kicker: essentially what
> > this means is that we are not treating *reporting a nextval() value to
> > the client* as a commit-worthy event.  I do not think this bug explains
> > the past reports that claim a nextval() value *inserted into the
> > database* has been rolled back.  Seems to me that a subsequent tuple
> > insertion would create a normal XLog record which we'd flush before
> > commit, and thereby also flush the sequence-update XLog record.
> >
> > Can anyone see a way that this mechanism explains the prior reports?
> >
>
> Actually, that doesn't appear to be the case either because in some of
> my tests I used a serial column type and I was just inserting data into
> a table.  It would appear that if the sequence is in the same tuple as
> the data you modified then it won't get logged.   What I did was create
> a table with a serial column and a varchar(255).  Inserted 100 rows
> filled with data, committed.  Ran a checkpoint.  Checked my sequence
> values, inserted 10 more rows of data, committed, checked the value of
> the sequence again.  Kill -9 the postmaster.  Tried to insert into the
> table, but to no avail... duplicate key.  currval of the sequence and
> it matched the value right after the checkpoint.  I've been able to
> duplicate that scenario several times.

Now, this seems to be a major report because you are saying you have
done data modifications (INSERT) after nextval() and the counter is not
being preserved. What PostgreSQL version are you running?

I just tested it here by doing a similar test of several nextval()
calls, but then doing an INSERT and kill, and on restart, the sequence
counter did have the proper value.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Bug #613: Sequence values fall back to previously checkpointed

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> bgrimm@zaeon.com wrote:
>> the sequence again.  Kill -9 the postmaster.  Tried to insert into the
>> table, but to no avail... duplicate key.  currval of the sequence and
>> it matched the value right after the checkpoint.  I've been able to
>> duplicate that scenario several times.

> I just tested it here by doing a similar test of several nextval()
> calls, but then doing an INSERT and kill, and on restart, the sequence
> counter did have the proper value.

There have been prior reports of similar problems --- all quite
unrepeatable in my testing, and despite considerable study of the source
code I can't see how it could happen.  A reproducible test case would be
a tremendous help.

            regards, tom lane

Re: Bug #613: Sequence values fall back to previously checkpointed

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > bgrimm@zaeon.com wrote:
> >> the sequence again.  Kill -9 the postmaster.  Tried to insert into the
> >> table, but to no avail... duplicate key.  currval of the sequence and
> >> it matched the value right after the checkpoint.  I've been able to
> >> duplicate that scenario several times.
>
> > I just tested it here by doing a similar test of several nextval()
> > calls, but then doing an INSERT and kill, and on restart, the sequence
> > counter did have the proper value.
>
> There have been prior reports of similar problems --- all quite
> unrepeatable in my testing, and despite considerable study of the source
> code I can't see how it could happen.  A reproducible test case would be
> a tremendous help.


I can confirm repeatable case!

---------------------------------------------------------------------------

test=> create table test (x serial, y varchar(255));
NOTICE:  CREATE TABLE will create implicit sequence 'test_x_seq' for SERIAL column 'test.x'
NOTICE:  CREATE TABLE / UNIQUE will create implicit index 'test_x_key' for table 'test'
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32951 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32951 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32951 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32951 1
...

test=> select nextval('test_x_seq');
 nextval
---------
      41
(1 row)

test=> checkpoint;
CHECKPOINT
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32991 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32992 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32992 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32992 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32992 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');


[ kill -9 backend ]


$ sql test
Welcome to psql, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit

test=> select nextval('test_x_seq');
 nextval
---------
      42
(1 row)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Bug #613: Sequence values fall back to previously checkpointed

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I can confirm repeatable case!

Ah-hah, now I understand the problem.  I think sequences are suffering
from premature optimization.

The crux of the matter: sequences try to avoid generating a WAL record
for every single nextval().  The idea is to generate a WAL record
every SEQ_LOG_VALS (32) nextval operations.  If you crash and replay
the WAL, then when you see the WAL record you don't know exactly how
many nextval operations got done, so for safety you set the sequence
value to where-it-was plus 32.  This may leave a gap in the sequence
number assignment, but it's no worse than aborting a few transactions.

The problem in the scenario Bruce exhibits is that the CHECKPOINT
forces out both the latest sequence WAL record and the current state
of the sequence relation itself.  The subsequent nextval()'s advance
the sequence relation in-memory but generate no disk writes and no
WAL records.  On restart, you lose: the sequence relation is back
to where it was checkpointed, and the latest WAL record for the
sequence is before the checkpoint *so it won't get rescanned*.
Thus, the sequence doesn't get pushed forward like it's supposed to.

AFAICS the only way that we could make the one-WAL-record-every-32-
nextvals idea really work would be if CHECKPOINT could nullify the
logged-in-advance state of each sequence (so that the first nextval
after a checkpoint would always generate a fresh WAL record, but
subsequent ones wouldn't have to).  But I don't see any practical
way for CHECKPOINT to do that, especially not for sequences whose
disk block isn't even in memory at the instant of the CHECKPOINT.

Accordingly, I'm thinking that we must remove the SEQ_LOG_VALS
functionality and force one-WAL-record-per-nextval operation.

Vadim, do you see another way?  This was a cool idea and I hate
to throw it away...

            regards, tom lane

Re: Bug #613: Sequence values fall back to previously checkpointed

From
Ben Grimm
Date:
On Wed, 13 Mar 2002, Tom Lane wrote:
>
> Accordingly, I'm thinking that we must remove the SEQ_LOG_VALS
> functionality and force one-WAL-record-per-nextval operation.
>
> Vadim, do you see another way?  This was a cool idea and I hate
> to throw it away...
>

It seems like you just need to ensure that when the sequence is loaded
from disk that log_cnt gets updated and written back to disk before the
sequence is used.  I'm not sure of the impact, but I can't reproduce the
bugs after making these changes.  I just added a flag to the SeqTableData
struct to say whether its been logged yet - this seems like overkill..
but it works for me :-)  (at least in the scenarios I've tried)

-- Ben


*** src/backend/commands/sequence.c     Wed Mar 13 11:14:42 2002
--- src/backend/commands/sequence.c.orig        Tue Mar 12 18:58:55 2002
***************
*** 62,68 ****
        int64           cached;
        int64           last;
        int64           increment;
-       bool            islogged;
        struct SeqTableData *next;
  } SeqTableData;

--- 62,67 ----
***************
*** 315,321 ****
                PG_RETURN_INT64(elm->last);
        }

!       seq = read_info("nextval", elm, &buf);          /* lock page' buffer and read tuple */

        last = next = result = seq->last_value;
        incby = seq->increment_by;
--- 314,321 ----
                PG_RETURN_INT64(elm->last);
        }

!       seq = read_info("nextval", elm, &buf);          /* lock page' buffer and
!                                                                                                * read tuple */

        last = next = result = seq->last_value;
        incby = seq->increment_by;
***************
*** 331,337 ****
                log--;
        }

!       if (log < fetch || !elm->islogged)
        {
                fetch = log = fetch - log + SEQ_LOG_VALS;
                logit = true;
--- 331,337 ----
                log--;
        }

!       if (log < fetch)
        {
                fetch = log = fetch - log + SEQ_LOG_VALS;
                logit = true;
***************
*** 405,411 ****
                rdata[0].next = &(rdata[1]);

                seq->last_value = next;
!               elm->islogged = seq->is_called = true;
                seq->log_cnt = 0;
                rdata[1].buffer = InvalidBuffer;
                rdata[1].data = (char *) page + ((PageHeader) page)->pd_upper;
--- 405,411 ----
                rdata[0].next = &(rdata[1]);

                seq->last_value = next;
!               seq->is_called = true;
                seq->log_cnt = 0;
                rdata[1].buffer = InvalidBuffer;
                rdata[1].data = (char *) page + ((PageHeader) page)->pd_upper;
***************
*** 417,424 ****

                PageSetLSN(page, recptr);
                PageSetSUI(page, ThisStartUpID);
!               XLogFlush(recptr);
!
                if (fetch)                              /* not all numbers were fetched */
                        log -= fetch;
        }
--- 417,423 ----

                PageSetLSN(page, recptr);
                PageSetSUI(page, ThisStartUpID);
!
                if (fetch)                              /* not all numbers were fetched */
                        log -= fetch;
        }
***************
*** 729,735 ****
                        prev->next = elm;
        }

-       elm->islogged = false;
        return elm;
  }

--- 728,733 ----

Re: Bug #613: Sequence values fall back to previously checkpointed

From
Tom Lane
Date:
Ben Grimm <bgrimm@zaeon.com> writes:
> It seems like you just need to ensure that when the sequence is loaded
> from disk that log_cnt gets updated and written back to disk before the
> sequence is used.  I'm not sure of the impact, but I can't reproduce the
> bugs after making these changes.  I just added a flag to the SeqTableData
> struct to say whether its been logged yet - this seems like overkill..
> but it works for me :-)  (at least in the scenarios I've tried)

I don't think that can work.  AFAICT what your patch does is to ensure
a WAL record is written by the first nextval() in any given backend
session.  But what we need is to ensure a WAL record from the first
nextval() after a checkpoint.  The failure cases for your patch would
involve backends that have been running for longer than one checkpoint
cycle ...

            regards, tom lane

Re: Bug #613: Sequence values fall back to previously checkpointed

From
Ben Grimm
Date:
On Wed, 13 Mar 2002, Tom Lane wrote:
>
> I don't think that can work.  AFAICT what your patch does is to ensure
> a WAL record is written by the first nextval() in any given backend
> session.

That's exactly what it does, yes.  It forces the WAL record to be
written at least once.  I think the reason this works is because the
WAL record that's written seems to be one behind what should be on
disk.  So doing it once gets it ahead of the game.  I'm sure it's
a very naive approach, but before yesterday I had never looked at
the source for postgresql.  All I can say for my patch it is that if
it does not indeed fix the problem it masks it well enough that I
can't reproduce it.

> But what we need is to ensure a WAL record from the first
> nextval() after a checkpoint.
>
> The problem in the scenario Bruce exhibits is that the CHECKPOINT
> forces out both the latest sequence WAL record and the current state
> of the sequence relation itself.  The subsequent nextval()'s advance
> the sequence relation in-memory but generate no disk writes and no
> WAL records.  On restart, you lose: the sequence relation is back
> to where it was checkpointed, and the latest WAL record for the
> sequence is before the checkpoint *so it won't get rescanned*.
> Thus, the sequence doesn't get pushed forward like it's supposed to.

This isn't quite true... because until you select enough values to
get to log < fetch it won't even have inserted a WAL record to
CHECKPOINT to so it falls back to the unmodified state which means
that the 'last_value' on disk never moves forward, in theory the
value on disk should *always* be equal or greater (up to 32) than the
value being returned to the client and when you load it off disk
it isn't.

More (possibly redundant) examples:
select * from your_seq and note the values.  Then select nextval a few
times, kill -9 your backend, and reconnect.  select * from your_seq
again and you should see that it's identical to the previous values.

Now, try the same thing again, but force a checkpoint before killing
your backend, then select again... same values as initially.

Now, select nextval the number of times needed to get log_cnt to
loop past 0 and back up to 32, then select * from your_seq again,
note the values and checkpoint.  crash the backend, reconnect and
select again... it saved it this time because it got through enough
code to do the xloginsert.

> The failure cases for your patch would
> involve backends that have been running for longer than one checkpoint
> cycle ...

I haven't been able to reproduce that, even checkpointing multiple
times on several open backends.  But I also found a couple mistakes
in my patch that make it a little better.  I can forward the new
patch if you'd like to see it.

-- Ben