Thread: Re: Bug #613: Sequence values fall back to previously checkpointed
bgrimm@zaeon.com wrote: > On Tue, 12 Mar 2002, Tom Lane wrote: > > > > > The "no commit record" part of the logic seems okay to me, but we need > > an independent test to decide whether to write/flush XLog. If we have > > reported a nextval() value to the client then it seems to me we'd better > > be certain that XLOG record is flushed to XLog before we report commit > > to the client. > > I think the part I don't understand is why WAL is being used to update > sequence values in the first place when sequences exist independantly of > transactions. In previous releases a sequence basically just existed > on disk in a specific location and updates to it updated the on disk > copy directly since there are no concurrency issues. I do realize that > running everything through WAL gives other benefits, so it's not likely > to revert back to the old way. But it would seem that the only way to > fix it is to flush the XLOG record immediately after the XLogInsert is > called, just as if the operation took place within its own transaction. Initially, I agreed with you, but after reading Tom's email, I understand now. The point is that if no data mofifying transactions are committed to the database, do we care if a crash rolls back the transaction counter? I don't think we do because the new sequence values couldn't have been used for anything in the database, so rollback is OK. If there is a data modification, the commit does flush the sequence change to disk, meaning I think we are OK. So, actually, this now seems like a very low level bug in the sense that any data modification done would have preserved the sequence change. The only reason we see the bug is because we are calling 'nextval()' with no intention of inserting it into the database. While clearly we should fsync during a transaction that just changes a sequence counter, it is not clear there is a huge problem with not preserving it _until_ a data-modifying transaction actually commits. > > This is certainly fixable. However, here's the kicker: essentially what > > this means is that we are not treating *reporting a nextval() value to > > the client* as a commit-worthy event. I do not think this bug explains > > the past reports that claim a nextval() value *inserted into the > > database* has been rolled back. Seems to me that a subsequent tuple > > insertion would create a normal XLog record which we'd flush before > > commit, and thereby also flush the sequence-update XLog record. > > > > Can anyone see a way that this mechanism explains the prior reports? > > > > Actually, that doesn't appear to be the case either because in some of > my tests I used a serial column type and I was just inserting data into > a table. It would appear that if the sequence is in the same tuple as > the data you modified then it won't get logged. What I did was create > a table with a serial column and a varchar(255). Inserted 100 rows > filled with data, committed. Ran a checkpoint. Checked my sequence > values, inserted 10 more rows of data, committed, checked the value of > the sequence again. Kill -9 the postmaster. Tried to insert into the > table, but to no avail... duplicate key. currval of the sequence and > it matched the value right after the checkpoint. I've been able to > duplicate that scenario several times. Now, this seems to be a major report because you are saying you have done data modifications (INSERT) after nextval() and the counter is not being preserved. What PostgreSQL version are you running? I just tested it here by doing a similar test of several nextval() calls, but then doing an INSERT and kill, and on restart, the sequence counter did have the proper value. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > bgrimm@zaeon.com wrote: >> the sequence again. Kill -9 the postmaster. Tried to insert into the >> table, but to no avail... duplicate key. currval of the sequence and >> it matched the value right after the checkpoint. I've been able to >> duplicate that scenario several times. > I just tested it here by doing a similar test of several nextval() > calls, but then doing an INSERT and kill, and on restart, the sequence > counter did have the proper value. There have been prior reports of similar problems --- all quite unrepeatable in my testing, and despite considerable study of the source code I can't see how it could happen. A reproducible test case would be a tremendous help. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > bgrimm@zaeon.com wrote: > >> the sequence again. Kill -9 the postmaster. Tried to insert into the > >> table, but to no avail... duplicate key. currval of the sequence and > >> it matched the value right after the checkpoint. I've been able to > >> duplicate that scenario several times. > > > I just tested it here by doing a similar test of several nextval() > > calls, but then doing an INSERT and kill, and on restart, the sequence > > counter did have the proper value. > > There have been prior reports of similar problems --- all quite > unrepeatable in my testing, and despite considerable study of the source > code I can't see how it could happen. A reproducible test case would be > a tremendous help. I can confirm repeatable case! --------------------------------------------------------------------------- test=> create table test (x serial, y varchar(255)); NOTICE: CREATE TABLE will create implicit sequence 'test_x_seq' for SERIAL column 'test.x' NOTICE: CREATE TABLE / UNIQUE will create implicit index 'test_x_key' for table 'test' test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); INSERT 32951 1 test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); INSERT 32951 1 test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); INSERT 32951 1 test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); INSERT 32951 1 ... test=> select nextval('test_x_seq'); nextval --------- 41 (1 row) test=> checkpoint; CHECKPOINT test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); INSERT 32991 1 test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); INSERT 32992 1 test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); INSERT 32992 1 test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); INSERT 32992 1 test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); INSERT 32992 1 test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf'); [ kill -9 backend ] $ sql test Welcome to psql, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help on internal slash commands \g or terminate with semicolon to execute query \q to quit test=> select nextval('test_x_seq'); nextval --------- 42 (1 row) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I can confirm repeatable case! Ah-hah, now I understand the problem. I think sequences are suffering from premature optimization. The crux of the matter: sequences try to avoid generating a WAL record for every single nextval(). The idea is to generate a WAL record every SEQ_LOG_VALS (32) nextval operations. If you crash and replay the WAL, then when you see the WAL record you don't know exactly how many nextval operations got done, so for safety you set the sequence value to where-it-was plus 32. This may leave a gap in the sequence number assignment, but it's no worse than aborting a few transactions. The problem in the scenario Bruce exhibits is that the CHECKPOINT forces out both the latest sequence WAL record and the current state of the sequence relation itself. The subsequent nextval()'s advance the sequence relation in-memory but generate no disk writes and no WAL records. On restart, you lose: the sequence relation is back to where it was checkpointed, and the latest WAL record for the sequence is before the checkpoint *so it won't get rescanned*. Thus, the sequence doesn't get pushed forward like it's supposed to. AFAICS the only way that we could make the one-WAL-record-every-32- nextvals idea really work would be if CHECKPOINT could nullify the logged-in-advance state of each sequence (so that the first nextval after a checkpoint would always generate a fresh WAL record, but subsequent ones wouldn't have to). But I don't see any practical way for CHECKPOINT to do that, especially not for sequences whose disk block isn't even in memory at the instant of the CHECKPOINT. Accordingly, I'm thinking that we must remove the SEQ_LOG_VALS functionality and force one-WAL-record-per-nextval operation. Vadim, do you see another way? This was a cool idea and I hate to throw it away... regards, tom lane
On Wed, 13 Mar 2002, Tom Lane wrote: > > Accordingly, I'm thinking that we must remove the SEQ_LOG_VALS > functionality and force one-WAL-record-per-nextval operation. > > Vadim, do you see another way? This was a cool idea and I hate > to throw it away... > It seems like you just need to ensure that when the sequence is loaded from disk that log_cnt gets updated and written back to disk before the sequence is used. I'm not sure of the impact, but I can't reproduce the bugs after making these changes. I just added a flag to the SeqTableData struct to say whether its been logged yet - this seems like overkill.. but it works for me :-) (at least in the scenarios I've tried) -- Ben *** src/backend/commands/sequence.c Wed Mar 13 11:14:42 2002 --- src/backend/commands/sequence.c.orig Tue Mar 12 18:58:55 2002 *************** *** 62,68 **** int64 cached; int64 last; int64 increment; - bool islogged; struct SeqTableData *next; } SeqTableData; --- 62,67 ---- *************** *** 315,321 **** PG_RETURN_INT64(elm->last); } ! seq = read_info("nextval", elm, &buf); /* lock page' buffer and read tuple */ last = next = result = seq->last_value; incby = seq->increment_by; --- 314,321 ---- PG_RETURN_INT64(elm->last); } ! seq = read_info("nextval", elm, &buf); /* lock page' buffer and ! * read tuple */ last = next = result = seq->last_value; incby = seq->increment_by; *************** *** 331,337 **** log--; } ! if (log < fetch || !elm->islogged) { fetch = log = fetch - log + SEQ_LOG_VALS; logit = true; --- 331,337 ---- log--; } ! if (log < fetch) { fetch = log = fetch - log + SEQ_LOG_VALS; logit = true; *************** *** 405,411 **** rdata[0].next = &(rdata[1]); seq->last_value = next; ! elm->islogged = seq->is_called = true; seq->log_cnt = 0; rdata[1].buffer = InvalidBuffer; rdata[1].data = (char *) page + ((PageHeader) page)->pd_upper; --- 405,411 ---- rdata[0].next = &(rdata[1]); seq->last_value = next; ! seq->is_called = true; seq->log_cnt = 0; rdata[1].buffer = InvalidBuffer; rdata[1].data = (char *) page + ((PageHeader) page)->pd_upper; *************** *** 417,424 **** PageSetLSN(page, recptr); PageSetSUI(page, ThisStartUpID); ! XLogFlush(recptr); ! if (fetch) /* not all numbers were fetched */ log -= fetch; } --- 417,423 ---- PageSetLSN(page, recptr); PageSetSUI(page, ThisStartUpID); ! if (fetch) /* not all numbers were fetched */ log -= fetch; } *************** *** 729,735 **** prev->next = elm; } - elm->islogged = false; return elm; } --- 728,733 ----
Ben Grimm <bgrimm@zaeon.com> writes: > It seems like you just need to ensure that when the sequence is loaded > from disk that log_cnt gets updated and written back to disk before the > sequence is used. I'm not sure of the impact, but I can't reproduce the > bugs after making these changes. I just added a flag to the SeqTableData > struct to say whether its been logged yet - this seems like overkill.. > but it works for me :-) (at least in the scenarios I've tried) I don't think that can work. AFAICT what your patch does is to ensure a WAL record is written by the first nextval() in any given backend session. But what we need is to ensure a WAL record from the first nextval() after a checkpoint. The failure cases for your patch would involve backends that have been running for longer than one checkpoint cycle ... regards, tom lane
On Wed, 13 Mar 2002, Tom Lane wrote: > > I don't think that can work. AFAICT what your patch does is to ensure > a WAL record is written by the first nextval() in any given backend > session. That's exactly what it does, yes. It forces the WAL record to be written at least once. I think the reason this works is because the WAL record that's written seems to be one behind what should be on disk. So doing it once gets it ahead of the game. I'm sure it's a very naive approach, but before yesterday I had never looked at the source for postgresql. All I can say for my patch it is that if it does not indeed fix the problem it masks it well enough that I can't reproduce it. > But what we need is to ensure a WAL record from the first > nextval() after a checkpoint. > > The problem in the scenario Bruce exhibits is that the CHECKPOINT > forces out both the latest sequence WAL record and the current state > of the sequence relation itself. The subsequent nextval()'s advance > the sequence relation in-memory but generate no disk writes and no > WAL records. On restart, you lose: the sequence relation is back > to where it was checkpointed, and the latest WAL record for the > sequence is before the checkpoint *so it won't get rescanned*. > Thus, the sequence doesn't get pushed forward like it's supposed to. This isn't quite true... because until you select enough values to get to log < fetch it won't even have inserted a WAL record to CHECKPOINT to so it falls back to the unmodified state which means that the 'last_value' on disk never moves forward, in theory the value on disk should *always* be equal or greater (up to 32) than the value being returned to the client and when you load it off disk it isn't. More (possibly redundant) examples: select * from your_seq and note the values. Then select nextval a few times, kill -9 your backend, and reconnect. select * from your_seq again and you should see that it's identical to the previous values. Now, try the same thing again, but force a checkpoint before killing your backend, then select again... same values as initially. Now, select nextval the number of times needed to get log_cnt to loop past 0 and back up to 32, then select * from your_seq again, note the values and checkpoint. crash the backend, reconnect and select again... it saved it this time because it got through enough code to do the xloginsert. > The failure cases for your patch would > involve backends that have been running for longer than one checkpoint > cycle ... I haven't been able to reproduce that, even checkpointing multiple times on several open backends. But I also found a couple mistakes in my patch that make it a little better. I can forward the new patch if you'd like to see it. -- Ben