Thread: Possible bug in vacuum redo
In the vacuum tuple-chain moving logic, shouldn't the lines that update the new tuple's t_ctid (vacuum.c lines 1882-1891 in current sources) be moved up to before the log_heap_move call at line 1866? It appears to me that as the code stands, log_heap_move will log the new tuple containing the wrong t_ctid; therefore, if we crash and have to redo the transaction from WAL, the wrong t_ctid will be restored. No? regards, tom lane
> -----Original Message----- > From: Tom Lane > > In the vacuum tuple-chain moving logic, shouldn't the lines that update > the new tuple's t_ctid (vacuum.c lines 1882-1891 in current sources) > be moved up to before the log_heap_move call at line 1866? > > It appears to me that as the code stands, log_heap_move will log the new > tuple containing the wrong t_ctid; therefore, if we crash and have to > redo the transaction from WAL, the wrong t_ctid will be restored. No? AFAIR t_ctid isn't logged in WAL. regards, Hiroshi Inoue
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > AFAIR t_ctid isn't logged in WAL. After looking at the heap_update code I think you are right. Doesn't that render the field completely useless/unreliable? In the simple heap_update case I think that heap_xlog_update could easily set the old tuple's t_ctid field correctly. Not sure how it works when VACUUM is moving tuple chains around, however. Another thing I am currently looking at is that I do not believe VACUUM handles tuple chain moves correctly. It only enters the chain-moving logic if it finds a tuple that is in the *middle* of an update chain, ie, both the prior and next tuples still exist. In the case of a two-element update chain (only the latest and next-to-latest tuples of a row survive VACUUM), AFAICT vacuum will happily move the latest tuple without ever updating the previous tuple's t_ctid. In short t_ctid seems extremely unreliable. I have been trying to work out a way that a bad t_ctid link could lead to the duplicate-tuple reports we've been hearing lately, but so far I haven't seen one. I do think it can lead to missed UPDATEs in read-committed mode, however. regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > AFAIR t_ctid isn't logged in WAL. > > After looking at the heap_update code I think you are right. Doesn't > that render the field completely useless/unreliable? Redo runs with no concurrent backends. New backends invoked after a redo operation don't need(see) the existent t_ctid values. PostgreSQL before MVCC didn't need the t_ctid. > > In the simple heap_update case I think that heap_xlog_update could > easily set the old tuple's t_ctid field correctly. Not sure how > it works when VACUUM is moving tuple chains around, however. > > Another thing I am currently looking at is that I do not believe VACUUM > handles tuple chain moves correctly. It only enters the chain-moving > logic if it finds a tuple that is in the *middle* of an update chain, > ie, both the prior and next tuples still exist. ^^^^^ Isn't it *either* not *both* ? Anyway I agree with you at the point that the tuple chain-moving is too complex. It's one of the main reason why I prefer the new VACUUM. regards, Hiroshi Inoue
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> Another thing I am currently looking at is that I do not believe VACUUM >> handles tuple chain moves correctly. It only enters the chain-moving >> logic if it finds a tuple that is in the *middle* of an update chain, >> ie, both the prior and next tuples still exist. > ^^^^^ > Isn't it *either* not *both* ? [ reads it again ] Oh, you're right. Still, if WAL isn't taking care to maintain t_ctid then we have a problem. regards, tom lane
Tom Lane wrote: > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > >> Another thing I am currently looking at is that I do not believe VACUUM > >> handles tuple chain moves correctly. It only enters the chain-moving > >> logic if it finds a tuple that is in the *middle* of an update chain, > >> ie, both the prior and next tuples still exist. > > ^^^^^ > > Isn't it *either* not *both* ? > > [ reads it again ] Oh, you're right. > > Still, if WAL isn't taking care to maintain t_ctid then we have a > problem. I don't think it's preferable either. However there's no problem unless there's an application which handle the tuples containing the t_ctid link. I know few applications(vacuum, create index ??) which handles the tuples already updated to new ones and committed before the transaction started. Note that redo is executed alone. Shutdown recovery is called after killing all backends if necessary. Of cource there are no other backends running when startup recovery is called. regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > I don't think it's preferable either. However there's > no problem unless there's an application which handle > the tuples containing the t_ctid link. What about READ COMMITTED mode? EvalPlanQual uses the t_ctid field to find the updated version of the row. If the t_ctid is wrong, you might get an elog(), or you might miss the row you should have updated, or possibly you might update a row that you shouldn't have. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > I don't think it's preferable either. However there's > > no problem unless there's an application which handle > > the tuples containing the t_ctid link. > > What about READ COMMITTED mode? EvalPlanQual uses the t_ctid field > to find the updated version of the row. In READ COMMITTED mode, an app searches valid tuples first using the snapshot taken when the query started. It never searches already updated(to newer ones) and committed tuples at the point when the query started. Essentially t_ctid is only needed by the concurrently running backends. regards, Hiroshi Inoue
FWIW, I got the following deadlock-like situation by simply running pgbench : 968 ? S 0:00 /usr/bin/postmaster -D /var/lib/pgsql/data 13245 pts/1 S 0:00 su - postsql 31671 pts/1 S 0:00 bin/postmaster 31672 pts/1 S 0:00 postgres: stats buffer process 31674 pts/1 S 0:00 postgres: stats collector process 32606 pts/1 S 0:02 pgbench -p 7204 -t 128 -c 32 postsql 32613 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE 32615 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE 32616 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE 32617 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE 32618 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE 32623 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE 32624 pts/1 S 0:01 postgres: postsql postsql [local] idle in transaction 32628 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE 32633 pts/1 S 0:00 postgres: postsql postsql [local] idle 32634 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE 32635 pts/1 S 0:00 postgres: postsql postsql [local] COMMIT 32636 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE waiting 32637 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE 32638 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE 32639 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE 32640 pts/1 S 0:00 postgres: checkpoint subprocess 32692 pts/2 S 0:00 su - postsql 32752 pts/2 S 0:00 grep post It is probably some bug in pgbench, but could also be something more serious. I was running pgbench using modified (to use vacuum full between runs) Tatsuos bench script on a uniprocessor Athlon 850.: [postsql@taru bench]$ cat bench.sh #! /bin/sh for i in 1 2 3 4 5 6 7 8 do # pg_ctl stop # pg_ctl -w -o '-c "wal_sync_method=fdatasync" -S' start # pg_ctl -w start sh mpgbench| tee bench.res.$i | sh extract.sh > bench.data.$i done [postsql@taru bench]$ cat mpgbench #! /bin/sh DB=postsql PORT=7204 #pgbench -p $PORT -i -s 128 $DB #pgbench -p $PORT -i -s 10 $DB #pgbench -p $PORT -i -s 1 $DB for i in 1 2 4 8 16 32 64 128 do # t=$(echo "scale=0; 512/$i" | bc -l) t=$(echo "scale=0; 4096/$i" | bc -l) echo $i concurrent users...1>&2 echo $t transactions each... 1>&2 pgbench -p $PORT -t $t -c $i $DB psql -p $PORT -c 'vacuumfull' $DB psql -p $PORT -c 'checkpoint' $DB echo "===== sync ======" 1>&2 sync;sync;sync;sleep 10 echo "===== sync done ======" 1>&2 done [postsql@taru bench]$ cat extract.sh #! /bin/sh sed -n -e '/^number of clients.*'/p \ -e '/.*excluding connections establishing.*'/p | sed -e 's/number of clients: //' \ -e 's/^tps = //' \ -e 's/(excluding connections establishing)//'| while read i do echo -n "$i " read j echo $j done ------------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > FWIW, I got the following deadlock-like situation by simply running > pgbench : Using what version? That looks like the deadlock I fixed last week. regards, tom lane
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > In READ COMMITTED mode, an app searches valid tuples first > using the snapshot taken when the query started. It never > searches already updated(to newer ones) and committed tuples > at the point when the query started. Essentially t_ctid is > only needed by the concurrently running backends. [ thinks for awhile ] I see: you're saying that t_ctid is only used by transactions that are concurrent with the deleting transaction, so if there's a database crash there's no need to restore t_ctid. Probably true, but still mighty ugly. Meanwhile, I guess I gotta look elsewhere for a theory to explain those reports of duplicate rows. Oh well... regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > In READ COMMITTED mode, an app searches valid tuples first > > using the snapshot taken when the query started. It never > > searches already updated(to newer ones) and committed tuples > > at the point when the query started. Essentially t_ctid is > > only needed by the concurrently running backends. > > [ thinks for awhile ] I see: you're saying that t_ctid is only > used by transactions that are concurrent with the deleting transaction, > so if there's a database crash there's no need to restore t_ctid. Yes. > Probably true, but still mighty ugly. Yes. > Meanwhile, I guess I gotta look elsewhere for a theory to explain > those reports of duplicate rows. Oh well... Great. Where is it ? regards, Hiroshi Inoue
Tom Lane wrote: >Hannu Krosing <hannu@tm.ee> writes: > >>FWIW, I got the following deadlock-like situation by simply running >>pgbench : >> > >Using what version? That looks like the deadlock I fixed last week. > Official 7.2b4. Should I get a fresh one from CVS or is beta5/RC1 imminent ? -------------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: >> Using what version? That looks like the deadlock I fixed last week. >> > Official 7.2b4. Should I get a fresh one from CVS or is beta5/RC1 imminent ? Use CVS (or a nightly snapshot if that's easier). I doubt we'll do anything until after Christmas... regards, tom lane
Tom Lane wrote: > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > In READ COMMITTED mode, an app searches valid tuples first > > using the snapshot taken when the query started. It never > > searches already updated(to newer ones) and committed tuples > > at the point when the query started. Essentially t_ctid is > > only needed by the concurrently running backends. > > [ thinks for awhile ] I see: you're saying that t_ctid is only > used by transactions that are concurrent with the deleting transaction, > so if there's a database crash there's no need to restore t_ctid. > > Probably true, but still mighty ugly. > > Meanwhile, I guess I gotta look elsewhere for a theory to explain > those reports of duplicate rows. Oh well... Can someone document this in the sources somewhere? I am not sure how to do it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026