Thread: Possible bug in vacuum redo

Possible bug in vacuum redo

From
Tom Lane
Date:
In the vacuum tuple-chain moving logic, shouldn't the lines that update
the new tuple's t_ctid (vacuum.c lines 1882-1891 in current sources)
be moved up to before the log_heap_move call at line 1866?

It appears to me that as the code stands, log_heap_move will log the new
tuple containing the wrong t_ctid; therefore, if we crash and have to
redo the transaction from WAL, the wrong t_ctid will be restored.  No?
        regards, tom lane


Re: Possible bug in vacuum redo

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Tom Lane
> 
> In the vacuum tuple-chain moving logic, shouldn't the lines that update
> the new tuple's t_ctid (vacuum.c lines 1882-1891 in current sources)
> be moved up to before the log_heap_move call at line 1866?
> 
> It appears to me that as the code stands, log_heap_move will log the new
> tuple containing the wrong t_ctid; therefore, if we crash and have to
> redo the transaction from WAL, the wrong t_ctid will be restored.  No?

AFAIR t_ctid isn't logged in WAL.

regards,
Hiroshi Inoue


Re: Possible bug in vacuum redo

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> AFAIR t_ctid isn't logged in WAL.

After looking at the heap_update code I think you are right.  Doesn't
that render the field completely useless/unreliable?

In the simple heap_update case I think that heap_xlog_update could
easily set the old tuple's t_ctid field correctly.  Not sure how
it works when VACUUM is moving tuple chains around, however.

Another thing I am currently looking at is that I do not believe VACUUM
handles tuple chain moves correctly.  It only enters the chain-moving
logic if it finds a tuple that is in the *middle* of an update chain,
ie, both the prior and next tuples still exist.  In the case of a
two-element update chain (only the latest and next-to-latest tuples of
a row survive VACUUM), AFAICT vacuum will happily move the latest tuple
without ever updating the previous tuple's t_ctid.

In short t_ctid seems extremely unreliable.  I have been trying to work
out a way that a bad t_ctid link could lead to the duplicate-tuple
reports we've been hearing lately, but so far I haven't seen one.  I do
think it can lead to missed UPDATEs in read-committed mode, however.
        regards, tom lane


Re: Possible bug in vacuum redo

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > AFAIR t_ctid isn't logged in WAL.
> 
> After looking at the heap_update code I think you are right.  Doesn't
> that render the field completely useless/unreliable?

Redo runs with no concurrent backends. New backends invoked
after a redo operation don't need(see)  the existent t_ctid values.
PostgreSQL before MVCC didn't need the t_ctid.

> 
> In the simple heap_update case I think that heap_xlog_update could
> easily set the old tuple's t_ctid field correctly.  Not sure how
> it works when VACUUM is moving tuple chains around, however.
> 
> Another thing I am currently looking at is that I do not believe VACUUM
> handles tuple chain moves correctly.  It only enters the chain-moving
> logic if it finds a tuple that is in the *middle* of an update chain,
> ie, both the prior and next tuples still exist.       ^^^^^
Isn't it *either* not *both* ?
Anyway I agree with you at the point that the tuple chain-moving
is too complex. It's one of the main reason why I prefer the new
VACUUM.

regards,
Hiroshi Inoue


Re: Possible bug in vacuum redo

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> Another thing I am currently looking at is that I do not believe VACUUM
>> handles tuple chain moves correctly.  It only enters the chain-moving
>> logic if it finds a tuple that is in the *middle* of an update chain,
>> ie, both the prior and next tuples still exist. 
>        ^^^^^
> Isn't it *either* not *both* ?

[ reads it again ]  Oh, you're right.

Still, if WAL isn't taking care to maintain t_ctid then we have a
problem.
        regards, tom lane


Re: Possible bug in vacuum redo

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> >> Another thing I am currently looking at is that I do not believe VACUUM
> >> handles tuple chain moves correctly.  It only enters the chain-moving
> >> logic if it finds a tuple that is in the *middle* of an update chain,
> >> ie, both the prior and next tuples still exist.
> >        ^^^^^
> > Isn't it *either* not *both* ?
> 
> [ reads it again ]  Oh, you're right.
> 
> Still, if WAL isn't taking care to maintain t_ctid then we have a
> problem.

I don't think it's preferable either. However there's
no problem unless there's an application which handle
the tuples containing the t_ctid link. I know few
applications(vacuum, create index ??) which handles
the tuples already updated to new ones and committed
before the transaction started.
Note that redo is executed alone. Shutdown recovery
is called after killing all backends if necessary.
Of cource there are no other backends running when
startup recovery is called.

regards,
Hiroshi Inoue


Re: Possible bug in vacuum redo

From
Tom Lane
Date:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> I don't think it's preferable either. However there's
> no problem unless there's an application which handle
> the tuples containing the t_ctid link.

What about READ COMMITTED mode?  EvalPlanQual uses the t_ctid field
to find the updated version of the row.  If the t_ctid is wrong,
you might get an elog(), or you might miss the row you should have
updated, or possibly you might update a row that you shouldn't have.
        regards, tom lane


Re: Possible bug in vacuum redo

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > I don't think it's preferable either. However there's
> > no problem unless there's an application which handle
> > the tuples containing the t_ctid link.
> 
> What about READ COMMITTED mode?  EvalPlanQual uses the t_ctid field
> to find the updated version of the row.

In READ COMMITTED mode, an app searches valid tuples first
using the snapshot taken when the query started. It never
searches already updated(to newer ones) and committed tuples
at the point when the query started. Essentially t_ctid is
only needed by the concurrently running backends.

regards,
Hiroshi Inoue


Re: Possible bug in vacuum redo

From
Hannu Krosing
Date:
FWIW, I got the following deadlock-like situation by simply running 
pgbench :
 968 ?        S      0:00 /usr/bin/postmaster -D /var/lib/pgsql/data
13245 pts/1    S      0:00 su - postsql
31671 pts/1    S      0:00 bin/postmaster
31672 pts/1    S      0:00 postgres: stats buffer process
31674 pts/1    S      0:00 postgres: stats collector process
32606 pts/1    S      0:02 pgbench -p 7204 -t 128 -c 32 postsql
32613 pts/1    S      0:01 postgres: postsql postsql [local] UPDATE
32615 pts/1    S      0:00 postgres: postsql postsql [local] UPDATE
32616 pts/1    S      0:01 postgres: postsql postsql [local] UPDATE
32617 pts/1    S      0:00 postgres: postsql postsql [local] UPDATE
32618 pts/1    S      0:00 postgres: postsql postsql [local] UPDATE
32623 pts/1    S      0:01 postgres: postsql postsql [local] UPDATE
32624 pts/1    S      0:01 postgres: postsql postsql [local] idle in 
transaction
32628 pts/1    S      0:01 postgres: postsql postsql [local] UPDATE
32633 pts/1    S      0:00 postgres: postsql postsql [local] idle
32634 pts/1    S      0:00 postgres: postsql postsql [local] UPDATE
32635 pts/1    S      0:00 postgres: postsql postsql [local] COMMIT
32636 pts/1    S      0:01 postgres: postsql postsql [local] UPDATE waiting
32637 pts/1    S      0:00 postgres: postsql postsql [local] UPDATE
32638 pts/1    S      0:00 postgres: postsql postsql [local] UPDATE
32639 pts/1    S      0:00 postgres: postsql postsql [local] UPDATE
32640 pts/1    S      0:00 postgres: checkpoint subprocess
32692 pts/2    S      0:00 su - postsql
32752 pts/2    S      0:00 grep post

It is probably some bug in pgbench, but could also be something more 
serious.

I was running pgbench using modified (to use vacuum full between runs) 
Tatsuos
bench script on a uniprocessor Athlon 850.:

[postsql@taru bench]$ cat bench.sh
#! /bin/sh
for i in 1 2 3 4 5 6 7 8   do
#    pg_ctl stop
#    pg_ctl -w -o '-c "wal_sync_method=fdatasync" -S' start
#    pg_ctl -w start   sh mpgbench| tee bench.res.$i |   sh extract.sh > bench.data.$i
done
[postsql@taru bench]$ cat mpgbench
#! /bin/sh
DB=postsql
PORT=7204

#pgbench -p $PORT -i -s 128 $DB
#pgbench -p $PORT -i -s 10 $DB
#pgbench -p $PORT -i -s 1 $DB

for i in 1 2 4 8 16 32 64 128
do
#        t=$(echo "scale=0; 512/$i" | bc -l)       t=$(echo "scale=0; 4096/$i" | bc -l)       echo $i concurrent
users...1>&2       echo $t transactions each... 1>&2       pgbench -p $PORT -t $t -c $i $DB       psql -p $PORT -c
'vacuumfull' $DB       psql -p $PORT -c 'checkpoint' $DB       echo "===== sync ======" 1>&2       sync;sync;sync;sleep
10      echo "===== sync done ======" 1>&2
 
done


[postsql@taru bench]$ cat extract.sh
#! /bin/sh
sed -n -e '/^number of clients.*'/p \   -e '/.*excluding connections establishing.*'/p |
sed -e 's/number of clients: //' \   -e 's/^tps = //' \   -e 's/(excluding connections establishing)//'|
while read i
do   echo -n "$i "   read j   echo $j
done

-------------------
Hannu




Re: Possible bug in vacuum redo

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> FWIW, I got the following deadlock-like situation by simply running 
> pgbench :

Using what version?  That looks like the deadlock I fixed last week.
        regards, tom lane


Re: Possible bug in vacuum redo

From
Tom Lane
Date:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> In READ COMMITTED mode, an app searches valid tuples first
> using the snapshot taken when the query started. It never
> searches already updated(to newer ones) and committed tuples
> at the point when the query started. Essentially t_ctid is
> only needed by the concurrently running backends.

[ thinks for awhile ]  I see: you're saying that t_ctid is only
used by transactions that are concurrent with the deleting transaction,
so if there's a database crash there's no need to restore t_ctid.

Probably true, but still mighty ugly.

Meanwhile, I guess I gotta look elsewhere for a theory to explain
those reports of duplicate rows.  Oh well...
        regards, tom lane


Re: Possible bug in vacuum redo

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > In READ COMMITTED mode, an app searches valid tuples first
> > using the snapshot taken when the query started. It never
> > searches already updated(to newer ones) and committed tuples
> > at the point when the query started. Essentially t_ctid is
> > only needed by the concurrently running backends.
> 
> [ thinks for awhile ]  I see: you're saying that t_ctid is only
> used by transactions that are concurrent with the deleting transaction,
> so if there's a database crash there's no need to restore t_ctid.

Yes.
> Probably true, but still mighty ugly.

Yes.
> Meanwhile, I guess I gotta look elsewhere for a theory to explain
> those reports of duplicate rows.  Oh well...

Great. Where is it ?

regards,
Hiroshi Inoue


Re: Possible bug in vacuum redo

From
Hannu Krosing
Date:

Tom Lane wrote:

>Hannu Krosing <hannu@tm.ee> writes:
>
>>FWIW, I got the following deadlock-like situation by simply running 
>>pgbench :
>>
>
>Using what version?  That looks like the deadlock I fixed last week.
>
Official 7.2b4. Should I get a fresh one from CVS or is beta5/RC1 imminent ?

--------------------
Hannu





Re: Possible bug in vacuum redo

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
>> Using what version?  That looks like the deadlock I fixed last week.
>> 
> Official 7.2b4. Should I get a fresh one from CVS or is beta5/RC1 imminent ?

Use CVS (or a nightly snapshot if that's easier).  I doubt we'll do
anything until after Christmas...
        regards, tom lane


Re: Possible bug in vacuum redo

From
Bruce Momjian
Date:
Tom Lane wrote:
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > In READ COMMITTED mode, an app searches valid tuples first
> > using the snapshot taken when the query started. It never
> > searches already updated(to newer ones) and committed tuples
> > at the point when the query started. Essentially t_ctid is
> > only needed by the concurrently running backends.
> 
> [ thinks for awhile ]  I see: you're saying that t_ctid is only
> used by transactions that are concurrent with the deleting transaction,
> so if there's a database crash there's no need to restore t_ctid.
> 
> Probably true, but still mighty ugly.
> 
> Meanwhile, I guess I gotta look elsewhere for a theory to explain
> those reports of duplicate rows.  Oh well...

Can someone document this in the sources somewhere?  I am not sure how
to do it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026