Thread: Questions about update, delete, ctid...

Questions about update, delete, ctid...

From
DANTE Alexandra
Date:
Hello List,

I try to understand perfectly the mecanisms used to update / delete a
tuple (and consequently those used in VACUUM) and I've got some questions.

I've created a small database with only one table called "nation".  This
table was created with this command :
CREATE TABLE nation(
    n_nationkey          bigint NOT NULL,
    n_name               char(25)
);
When I have inserted 25 rows concerning countries ans have updated on row.
Before doing an update, the values of xmin, xmax and ctid were :
testvacuum=# select xmin, xmax, ctid, * from nation;
  xmin  | xmax |  ctid  | n_nationkey |          n_name
--------+------+--------+-------------+---------------------------
 140049 |    0 |  (0,1) |           0 | ALGERIA
 140049 |    0 |  (0,2) |           1 | ARGENTINA
 140049 |    0 |  (0,3) |           2 | BRAZIL
 140049 |    0 |  (0,4) |           3 | CANADA
 140049 |    0 |  (0,5) |           4 | EGYPT
 140049 |    0 |  (0,6) |           5 | ETHIOPIA
 140049 |    0 |  (0,7) |           6 | FRANCE
 140049 |    0 |  (0,8) |           7 | GERMANY
 140049 |    0 |  (0,9) |           8 | INDIA
 140049 |    0 | (0,10) |           9 | INDONESIA
 140049 |    0 | (0,11) |          10 | IRAN
 140049 |    0 | (0,12) |          11 | IRAQ
 140049 |    0 | (0,13) |          12 | JAPAN
 140049 |    0 | (0,14) |          13 | JORDAN
 140049 |    0 | (0,15) |          14 | KENYA
 140049 |    0 | (0,16) |          15 | MOROCCO
 140049 |    0 | (0,17) |          16 | MOZAMBIQUE
 140049 |    0 | (0,18) |          17 | PERU
 140049 |    0 | (0,19) |          18 | CHINA
 140049 |    0 | (0,20) |          19 | ROMANIA
 140049 |    0 | (0,21) |          20 | SAUDI ARABIA
 140049 |    0 | (0,22) |          21 | VIETNAM
 140049 |    0 | (0,23) |          22 | RUSSIA
 140049 |    0 | (0,24) |          23 | UNITED KINGDOM
 140049 |    0 | (0,25) |          24 | UNITED STATES
(25 rows)

Then I updated the row where the "n_name" was "IRAQ", and replaced it by
"ITALY" :
testvacuum=# update nation set n_name='ITALY' where n_nationkey=11;
UPDATE 1
testvacuum=# select xmin, xmax, ctid, * from nation;
  xmin  | xmax |  ctid  | n_nationkey |          n_name
--------+------+--------+-------------+---------------------------
 140049 |    0 |  (0,1) |           0 | ALGERIA
 140049 |    0 |  (0,2) |           1 | ARGENTINA
 140049 |    0 |  (0,3) |           2 | BRAZIL
 140049 |    0 |  (0,4) |           3 | CANADA
 140049 |    0 |  (0,5) |           4 | EGYPT
 140049 |    0 |  (0,6) |           5 | ETHIOPIA
 140049 |    0 |  (0,7) |           6 | FRANCE
 140049 |    0 |  (0,8) |           7 | GERMANY
 140049 |    0 |  (0,9) |           8 | INDIA
 140049 |    0 | (0,10) |           9 | INDONESIA
 140049 |    0 | (0,11) |          10 | IRAN
 140049 |    0 | (0,13) |          12 | JAPAN
 140049 |    0 | (0,14) |          13 | JORDAN
 140049 |    0 | (0,15) |          14 | KENYA
 140049 |    0 | (0,16) |          15 | MOROCCO
 140049 |    0 | (0,17) |          16 | MOZAMBIQUE
 140049 |    0 | (0,18) |          17 | PERU
 140049 |    0 | (0,19) |          18 | CHINA
 140049 |    0 | (0,20) |          19 | ROMANIA
 140049 |    0 | (0,21) |          20 | SAUDI ARABIA
 140049 |    0 | (0,22) |          21 | VIETNAM
 140049 |    0 | (0,23) |          22 | RUSSIA
 140049 |    0 | (0,24) |          23 | UNITED KINGDOM
 140049 |    0 | (0,25) |          24 | UNITED STATES
 140061 |    0 | (0,26) |          11 | ITALY
(25 rows)

By doing this update, I see that a new xmin, xmax and ctid have been
computed and that the new tuple with the name "ITALY" appears at the end
of the table.
I have tried to found in the source code what has been done during the
update, exploring the "ExecUpdate" method in the
"backend/executor/execMain.c" file, the "heap_update" method in the
"backend/access/heap/heapam.c" file, the structure defined in the
"include/access/htup.h" file, ... but it is not very easy for someone
not familiar with the code...

I hope someone could answer these questions :
- what are the new values for xmin, xmax and ctid for an updated tuple ?
- what about the old tuple ? what is the value for xmax ?
- is it correct to think that the ctid of the old version of the tuple
is a link to newer version ? In my example, is it correct to think that
the tuple :
140049    |    0 | (0,12) |          11 | IRAQ
has become :
new value |    0 | (0,26) |          11 | ITALY

- what are the values set in the "infomask" structure for the old
version of the tuple ?


And then, after all these questions about tables, I've got questions
about index. Imagine that we have an index of the "n_name" column, after
the update :
- is it correct to think that a new index tuple has been created ?
- does the old index tuple link to the new index tuple ?
- if not, how the B-tree can be still balanced ? is it necessary to
rebuild the index ?

Thank you very much for your help.
Regards,
Alexandra DANTE


Re: Questions about update, delete, ctid...

From
Martijn van Oosterhout
Date:
On Fri, Jul 28, 2006 at 12:00:19PM +0200, DANTE Alexandra wrote:
> Hello List,
>
> I try to understand perfectly the mecanisms used to update / delete a
> tuple (and consequently those used in VACUUM) and I've got some questions.

<snip example>

> I hope someone could answer these questions :
> - what are the new values for xmin, xmax and ctid for an updated tuple ?

xmin is the transaction that created the tuple (ie your XID)
xmax is the transaction that deleted the tuple (ie zero, it's not dead)
ctid is wherever it ends up on disk

> - what about the old tuple ? what is the value for xmax ?

Your XID, given you deleted it.

> - is it correct to think that the ctid of the old version of the tuple
> is a link to newer version ? In my example, is it correct to think that
> the tuple :
> 140049    |    0 | (0,12) |          11 | IRAQ
> has become :
> new value |    0 | (0,26) |          11 | ITALY

The word "become" is not really right. The old version has become
invisible to you and the new version is visible. Some other
transactions will see the old one, some the new one. However, there is
a link between the old and the new version do detect conflicting
updates.

> - what are the values set in the "infomask" structure for the old
> version of the tuple ?

I don't think there are any changes. Until your transaction commits you
can't really know if the tuple is really deleted or not. The first
transaction to read the tuple after your transaction commits will
update the bits.

> And then, after all these questions about tables, I've got questions
> about index. Imagine that we have an index of the "n_name" column, after
> the update :
> - is it correct to think that a new index tuple has been created ?

Yes

> - does the old index tuple link to the new index tuple ?

No

> - if not, how the B-tree can be still balanced ? is it necessary to
> rebuild the index ?

The b-tree code attempts to keep itself balanced. But it does nothing
special for an UPDATE, it works the same as an INSERT.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Questions about update, delete, ctid...

From
"Florian G. Pflug"
Date:
Martijn van Oosterhout wrote:
> On Fri, Jul 28, 2006 at 12:00:19PM +0200, DANTE Alexandra wrote:
>> - is it correct to think that the ctid of the old version of the tuple
>> is a link to newer version ? In my example, is it correct to think that
>> the tuple :
>> 140049    |    0 | (0,12) |          11 | IRAQ
>> has become :
>> new value |    0 | (0,26) |          11 | ITALY
>
> The word "become" is not really right. The old version has become
> invisible to you and the new version is visible. Some other
> transactions will see the old one, some the new one. However, there is
> a link between the old and the new version do detect conflicting
> updates.

That sounds interesting... how is that link between old and new
tuple implemented? I've been playing with the idea to implement
"update table foo where current of <cursor>", and such a link
would help tremendously. I read the code, but couldn't find any
such link - which part of the code should I read closer?

greetings, Florian Pflug


Re: Questions about update, delete, ctid...

From
Martijn van Oosterhout
Date:
On Sun, Jul 30, 2006 at 04:37:26PM +0200, Florian G. Pflug wrote:
> Martijn van Oosterhout wrote:
> >The word "become" is not really right. The old version has become
> >invisible to you and the new version is visible. Some other
> >transactions will see the old one, some the new one. However, there is
> >a link between the old and the new version do detect conflicting
> >updates.
>
> That sounds interesting... how is that link between old and new
> tuple implemented? I've been playing with the idea to implement
> "update table foo where current of <cursor>", and such a link
> would help tremendously. I read the code, but couldn't find any
> such link - which part of the code should I read closer?

It's the t_ctid field of the tuple header. But I don't see what that
has to do with "WHERE CURRENT OF". That expression should return the
tuple visible to your transaction, not some updated version which you
won't be able to fetch. If you end up updating the old version, the
system will take care of finding the newer version if necessary.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Questions about update, delete, ctid...

From
"Florian G. Pflug"
Date:
Martijn van Oosterhout wrote:
> On Sun, Jul 30, 2006 at 04:37:26PM +0200, Florian G. Pflug wrote:
>> Martijn van Oosterhout wrote:
>>> The word "become" is not really right. The old version has become
>>> invisible to you and the new version is visible. Some other
>>> transactions will see the old one, some the new one. However, there is
>>> a link between the old and the new version do detect conflicting
>>> updates.
>> That sounds interesting... how is that link between old and new
>> tuple implemented? I've been playing with the idea to implement
>> "update table foo where current of <cursor>", and such a link
>> would help tremendously. I read the code, but couldn't find any
>> such link - which part of the code should I read closer?
>
> It's the t_ctid field of the tuple header. But I don't see what that
> has to do with "WHERE CURRENT OF". That expression should return the
> tuple visible to your transaction, not some updated version which you
> won't be able to fetch. If you end up updating the old version, the
> system will take care of finding the newer version if necessary.

The problem is that the ctid of the tuple that "fetch" returned
might not actually be the tuple that needs to be updated. Imagine:
create cursor c_foo as select * from foo;
fetch c_foo into v_foo ;
update foo set <somevars> where current of c_foo ;
update foo set <someothervars> where current of c_foo ;

The second update needs to find the correct tuple to update. In
case of a serializable transaction, or a cursor declared "for update",
you only need to that updates by that same transaction into account -
but for the general case, it might even haven been another transaction
that did the update.

My first conclusion was that a possible solution would be to
restrict "where current of" to "for update"-cursors, and then somehow
remember old->new ctid mappings inside a transaction - but that sounds
rather messy, and intrusive.

Being able to follow ctid chains should make that quite a bit easier

greetings, Florian Pflug


Re: Questions about update, delete, ctid...

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> Martijn van Oosterhout wrote:
>> It's the t_ctid field of the tuple header. But I don't see what that
>> has to do with "WHERE CURRENT OF". That expression should return the
>> tuple visible to your transaction, not some updated version which you
>> won't be able to fetch. If you end up updating the old version, the
>> system will take care of finding the newer version if necessary.

> The problem is that the ctid of the tuple that "fetch" returned
> might not actually be the tuple that needs to be updated.

Martijn's got a point though: it's not clear that that needs any special
care on the part of WHERE CURRENT OF.  It may be that the existing
EvalPlanQual machinery does everything that's needful.  Is the example
you show really any different from updating a tuple that someone else
updated while your UPDATE was in progress?  In both cases the EPQ
machinery will take care of chaining forward to the right version to
update.

            regards, tom lane

Re: Questions about update, delete, ctid...

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> Martijn van Oosterhout wrote:
>>> It's the t_ctid field of the tuple header. But I don't see what that
>>> has to do with "WHERE CURRENT OF". That expression should return the
>>> tuple visible to your transaction, not some updated version which you
>>> won't be able to fetch. If you end up updating the old version, the
>>> system will take care of finding the newer version if necessary.
>
>> The problem is that the ctid of the tuple that "fetch" returned
>> might not actually be the tuple that needs to be updated.
>
> Martijn's got a point though: it's not clear that that needs any special
> care on the part of WHERE CURRENT OF.  It may be that the existing
> EvalPlanQual machinery does everything that's needful.  Is the example
> you show really any different from updating a tuple that someone else
> updated while your UPDATE was in progress?  In both cases the EPQ
> machinery will take care of chaining forward to the right version to
> update.

I agree, at least for "for-update"-cursors. If the cursor was not
declared "for update", then it is not even cleaer to me what the
correct behaviour would be. Imagine that you declared a cursor, and fetched
a row. After fetching, but before you call "update table foo where current of
mycursor" someone updates the row (and commits), and the new version would have
never been returned by your select statement in the first place. Should that row be
updates, or not? What if the other transaction deleted the row - should your
update raise an error? fail silently?

Because of those problems, I'll always assume that the cursor was declared
"for update" from now on.

After reading backend code for the last hours, I've created the following plan for
implementing "where current of"

1) Allow "declare ... cursor .... for update of <field1>, <field2>, <table1>, ...".
    Since all locking is done on row, not on particular columns, specifying
    <field1> is really equivalent to specifying the table that this field came from.
2) For each table in "for update of ..." remember the ctid of the returned tuple when
    doing fetch.

Now, there are two possibilities
A) Implement a "<table-alias> current of <cursor>" predicate for usage in where clauses.
    It would get the last ctid from the cursor (for the table aliased by <table-alias),
    follow ctid-chain to get the newest version, and compare this ctid to the one found
    in <table-alias>.
    "update <table> where current of <cursor>" would then be an abbreviation for
    "update <table> where <table> current of <cursor>", and e.g.
    "update <table> where <table> current of <cursor> and data like '%whatever%'"
    would be allowed to.
B) Implement "update <table> where current of <cursor>" as a special case. The plan
    generated would be similar to the one generated by "update <table> where ctid = ...",
    but the ctid stored would only be used to find an initial tuple version, and ignored
    when rechecking if a newer version still matches.

A) sound like a little bit more work, but it would avoid special-casing "where current of"
somewhere in either the planner or the executor. I've googled around a bit, and it
seems as if at least oracle and db2 only support "where current of <cursor>", and
not a general "<table> current of <cursor>" predicate.

Since I'm new to postgres-backend-hacking, this has no chance to be ready until feature
freeze, so it's 8.3 material at best, I guess ;-)
Still, I'd be thankfull for any comments, especially the "This can't work because ..."
type of comments ;-)

greetings, Florian Pflug

Re: Questions about update, delete, ctid...

From
Martijn van Oosterhout
Date:
On Mon, Jul 31, 2006 at 01:47:41AM +0200, Florian G. Pflug wrote:
> I agree, at least for "for-update"-cursors. If the cursor was not
> declared "for update", then it is not even cleaer to me what the
> correct behaviour would be. Imagine that you declared a cursor, and fetched
> a row. After fetching, but before you call "update table foo where current
> of
> mycursor" someone updates the row (and commits), and the new version would
> have
> never been returned by your select statement in the first place. Should
> that row be
> updates, or not? What if the other transaction deleted the row - should your
> update raise an error? fail silently?

That's what the EPQ machinery that Tom talked about does. It will trace
a newer version, raise a serialization failure or do nothing as
appropriate. I really don't think you need to do anything special in
this case.

Read the src/backend/executor/README file for details.

> Because of those problems, I'll always assume that the cursor was declared
> "for update" from now on.
>
> After reading backend code for the last hours, I've created the following
> plan for
> implementing "where current of"

<snip>

Looks like a good plan, but I think you've overlooked something: if
you've locked the tuple FOR UPDATE then by definition there cannot be a
newer version, right? (Incidently, I see message about FOR UPDATE not
being supported on cursors).

More specifically, if you have a tuple locked FOR UPDATE and there is a
newer version, that almost certaintly means this new version is dead,
part of an aborted transaction. Comments from more knowledgable people
would be good though. Locked tuples are usually easier.

I think plan A is the way to go. Good luck.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Questions about update, delete, ctid...

From
DANTE Alexandra
Date:
Hello Martijn, hello List,

Thank you very much.
I still have some questions, further to your answers :

Martijn van Oosterhout wrote:

>On Fri, Jul 28, 2006 at 12:00:19PM +0200, DANTE Alexandra wrote:
>
>
>>Hello List,
>>
>>I try to understand perfectly the mecanisms used to update / delete a
>>tuple (and consequently those used in VACUUM) and I've got some questions.
>>
>>
>
><snip example>
>
>
>
>>I hope someone could answer these questions :
>>- what are the new values for xmin, xmax and ctid for an updated tuple ?
>>
>>
>
>xmin is the transaction that created the tuple (ie your XID)
>xmax is the transaction that deleted the tuple (ie zero, it's not dead)
>ctid is wherever it ends up on disk
>
>
>
>>- what about the old tuple ? what is the value for xmax ?
>>
>>
>
>Your XID, given you deleted it.
>
>
>
>>- is it correct to think that the ctid of the old version of the tuple
>>is a link to newer version ? In my example, is it correct to think that
>>the tuple :
>>140049    |    0 | (0,12) |          11 | IRAQ
>>has become :
>>new value |    0 | (0,26) |          11 | ITALY
>>
>>
>
>The word "become" is not really right. The old version has become
>invisible to you and the new version is visible. Some other
>transactions will see the old one, some the new one. However, there is
>a link between the old and the new version do detect conflicting
>updates.
>
>
>
I've just seen that I've done a mistake in my example. My question was :
is it correct to think that the ctid of the old version of the tuple is
a link to newer version ? In my example, is it correct to think that the
tuple :
140049    |    0 | (0,12) |          11 | IRAQ
has become :
new value |    0 | (0,26) |          11 | *IRAQ*
Could you give me more details about the link between the old and the
new version, please ?
For me, the link is the c_tid, but maybe I'm wrong...

>>- what are the values set in the "infomask" structure for the old
>>version of the tuple ?
>>
>>
>
>I don't think there are any changes. Until your transaction commits you
>can't really know if the tuple is really deleted or not. The first
>transaction to read the tuple after your transaction commits will
>update the bits.
>
>
>
My question about the "infomask" strucutre was linked to the code of
VACUUM. I've seen in the "lazy_scan_heap method that the
"HeapTupleSatisfiesVacuum" method is called. In this method, according
to the value of "infomask", a tuple is defined as "dead" or not.
That's why I wonder if the "infomask" structure is changed after an
commited update or delete, and what are the values set ?

>>And then, after all these questions about tables, I've got questions
>>about index. Imagine that we have an index of the "n_name" column, after
>>the update :
>>- is it correct to think that a new index tuple has been created ?
>>
>>
>
>Yes
>
>
>
>>- does the old index tuple link to the new index tuple ?
>>
>>
>
>No
>
>
>
>>- if not, how the B-tree can be still balanced ? is it necessary to
>>rebuild the index ?
>>
>>
>
>The b-tree code attempts to keep itself balanced. But it does nothing
>special for an UPDATE, it works the same as an INSERT.
>
>
So, consequently, it is not necessary to rebuild the B-tree index after
an update or a delete.
Is it correct ?

>Hope this helps,
>
>

Regards,
Alexandra


Re: Questions about update, delete, ctid...

From
Martijn van Oosterhout
Date:
On Mon, Jul 31, 2006 at 11:04:58AM +0200, DANTE Alexandra wrote:
> I've just seen that I've done a mistake in my example. My question was :
> is it correct to think that the ctid of the old version of the tuple is
> a link to newer version ? In my example, is it correct to think that the
> tuple :
> 140049    |    0 | (0,12) |          11 | IRAQ
> has become :
> new value |    0 | (0,26) |          11 | *IRAQ*
> Could you give me more details about the link between the old and the
> new version, please ?
> For me, the link is the c_tid, but maybe I'm wrong...

Well, in your case where there are no other transactions running, yes.
In the general case there may have been other updates so all you know
is that the new tuple is a descendant of the old one. The chain of
t_ctid links can be arbitrarily long.

Note: with multiple psql sessions you can see some of this happening.

======= Session 1 =======
test=# begin;
BEGIN
test=# set TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
SET
test=# select xmin, xmax, cmin, cmax, * from a;
  xmin  | xmax | cmin | cmax | country
--------+------+------+------+---------
 277264 |    0 |    0 |    0 | IRAQ
(1 row)

======= Session 2 =======
test=# select xmin, xmax, cmin, cmax, * from a;
  xmin  | xmax | cmin | cmax | country
--------+------+------+------+---------
 277264 |    0 |    0 |    0 | IRAQ
(1 row)

test=# update a set country = 'ITALY';
UPDATE 1
test=# select xmin, xmax, cmin, cmax, * from a;
  xmin  | xmax | cmin | cmax | country
--------+------+------+------+---------
 277269 |    0 |    0 |    0 | ITALY
(1 row)

======= Session 1 again =======
test=# select xmin, xmax, cmin, cmax, * from a;
  xmin  |  xmax  |  cmin  | cmax | country
--------+--------+--------+------+---------
 277264 | 277269 | 277269 |    0 | IRAQ
(1 row)

As you can see now, both sessions are seeing different views of the
same table. The old tuple has now updated xmax and cmin values. If
session two updated the tuple again it would get a cmax value also. You
can't see the t_ctid link here, but there is one from the old row to
the new one.

> My question about the "infomask" strucutre was linked to the code of
> VACUUM. I've seen in the "lazy_scan_heap method that the
> "HeapTupleSatisfiesVacuum" method is called. In this method, according
> to the value of "infomask", a tuple is defined as "dead" or not.
> That's why I wonder if the "infomask" structure is changed after an
> commited update or delete, and what are the values set ?

It was Tom who pointed this out to me first: the infomask is not the
important part. The infomask is just a cache of the results of tests.

The problem is that checking if a transaction has been committed or not
can be reasonably expensive since it might have to check on disk. If
you had to do that every time you looked up a tuple the performence
would be terrible. So what happens is that the first time someone looks
up the status of a transaction and finds it's committed or aborted, it
sets that bit so no-one else has to do the test.

The basic result is that you can read the code as if the infomask was
blank and the result should be the same. The only difference is that
various bits allow the code to skip certain tests because somebody has
already done them before. The end result should be the same.

> So, consequently, it is not necessary to rebuild the B-tree index after
> an update or a delete.
> Is it correct ?

You never have to rebuild the index. The whole system is designed so
many people can be reading and writing the index simultaneously without
getting in eachothers way.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Questions about update, delete, ctid...

From
"Florian G. Pflug"
Date:
Martijn van Oosterhout wrote:
> On Mon, Jul 31, 2006 at 01:47:41AM +0200, Florian G. Pflug wrote:
>> I agree, at least for "for-update"-cursors. If the cursor was not
>> declared "for update", then it is not even cleaer to me what the
>> correct behaviour would be. Imagine that you declared a cursor, and fetched
>> a row. After fetching, but before you call "update table foo where current
>> of
>> mycursor" someone updates the row (and commits), and the new version would
>> have
>> never been returned by your select statement in the first place. Should
>> that row be
>> updates, or not? What if the other transaction deleted the row - should your
>> update raise an error? fail silently?
>
> That's what the EPQ machinery that Tom talked about does. It will trace
> a newer version, raise a serialization failure or do nothing as
> appropriate. I really don't think you need to do anything special in
> this case.

I can see how the EPQ machinery can be used to chain forward to the
correct row to be updated, even if I originally found an older version
(e.g. by searching for a specific ctid). But for non-"for
update"-cursors, the newest version of the row returned by fetch could
be modified such that it would have never been returned by fetch in the
first place. Image two transactions A and B:
A: begin
B: begin
A: declare c_foo cursor for select * from foo where type='A';
A: fetch c_foo -- the returned row is *not* locked.
B: update foo set type='B' -- There are no rows with type='B' now
B: commit
A: update foo set type='A' where current of c_foo

The EPQ machinery can now guarantee that we actually try to update
the correct row version. But I'm unsure if the last update should
do anything at all. Because at the time of the update, the current
row of the cursor c_foo wouldn't haven been returned by fetch, because
it's type is now 'B' and not 'A'.

This is why I believe that the use of "current of" to cursors that take
rowlevel share locks in the right thing to do.

>
> Read the src/backend/executor/README file for details.
Done that ;-)

>> Because of those problems, I'll always assume that the cursor was declared
>> "for update" from now on.
>>
>> After reading backend code for the last hours, I've created the following
>> plan for
>> implementing "where current of"
>
> <snip>
>
> Looks like a good plan, but I think you've overlooked something: if
> you've locked the tuple FOR UPDATE then by definition there cannot be a
> newer version, right? (Incidently, I see message about FOR UPDATE not
> being supported on cursors).
There can be, because the transaction owning the lock could have updated
the row itself.

> More specifically, if you have a tuple locked FOR UPDATE and there is a
> newer version, that almost certaintly means this new version is dead,
> part of an aborted transaction. Comments from more knowledgable people
> would be good though. Locked tuples are usually easier.
>
> I think plan A is the way to go. Good luck.
Sounds cleaner to me too, even though it extends the spec a bit.

greetings, Florian Pflug

Re: Questions about update, delete, ctid...

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> I can see how the EPQ machinery can be used to chain forward to the
> correct row to be updated, even if I originally found an older version
> (e.g. by searching for a specific ctid). But for non-"for
> update"-cursors, the newest version of the row returned by fetch could
> be modified such that it would have never been returned by fetch in the
> first place.

Yah, EPQ checks for that ... none of the situations you've mentioned are
any different from the case of an ordinary UPDATE that finds a row
that's been modified since its snapshot was taken.  A cursor would just
make the time window a bit larger.

            regards, tom lane

Re: Questions about update, delete, ctid...

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Looks like a good plan, but I think you've overlooked something: if
> you've locked the tuple FOR UPDATE then by definition there cannot be a
> newer version, right?

Nope --- think self-update.

            regards, tom lane

Re: Questions about update, delete, ctid...

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Mon, Jul 31, 2006 at 11:04:58AM +0200, DANTE Alexandra wrote:
>> I've just seen that I've done a mistake in my example. My question was :
>> is it correct to think that the ctid of the old version of the tuple is
>> a link to newer version ?

> Well, in your case where there are no other transactions running, yes.
> In the general case there may have been other updates so all you know
> is that the new tuple is a descendant of the old one. The chain of
> t_ctid links can be arbitrarily long.

It's probably worth pointing out here that the "ctid" column exposed
at the SQL level is not the same as t_ctid --- it's what the C code calls
t_self.

            regards, tom lane

Re: Questions about update, delete, ctid...

From
DANTE Alexandra
Date:
Hello Martijn, hello List,

Last question on this subject, what's happened during a SELECT query on
a tuple just updated and commited ?
I followed in the source code the links between these methods :
CreateQueryDesc, ExecutorStart, ExecutorRun, ExecutePlan, ExecSelect but
I still have a question :
    - during a SELECT query on a tuple just updated and commited, does
the executor first detect the old tuple and then via the c_tid link go
to the new version of the tuple ? or go directly to the new version ?
    - is it the same for the index ?

Could you tell me where theses checks are done in the source code ?

Thank you very much !
Regards,
Alexandra DANTE


Martijn van Oosterhout wrote:

>On Mon, Jul 31, 2006 at 11:04:58AM +0200, DANTE Alexandra wrote:
>
>
>>I've just seen that I've done a mistake in my example. My question was :
>>is it correct to think that the ctid of the old version of the tuple is
>>a link to newer version ? In my example, is it correct to think that the
>>tuple :
>>140049    |    0 | (0,12) |          11 | IRAQ
>>has become :
>>new value |    0 | (0,26) |          11 | *IRAQ*
>>Could you give me more details about the link between the old and the
>>new version, please ?
>>For me, the link is the c_tid, but maybe I'm wrong...
>>
>>
>
>Well, in your case where there are no other transactions running, yes.
>In the general case there may have been other updates so all you know
>is that the new tuple is a descendant of the old one. The chain of
>t_ctid links can be arbitrarily long.
>
>Note: with multiple psql sessions you can see some of this happening.
>
>======= Session 1 =======
>test=# begin;
>BEGIN
>test=# set TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
>SET
>test=# select xmin, xmax, cmin, cmax, * from a;
>  xmin  | xmax | cmin | cmax | country
>--------+------+------+------+---------
> 277264 |    0 |    0 |    0 | IRAQ
>(1 row)
>
>======= Session 2 =======
>test=# select xmin, xmax, cmin, cmax, * from a;
>  xmin  | xmax | cmin | cmax | country
>--------+------+------+------+---------
> 277264 |    0 |    0 |    0 | IRAQ
>(1 row)
>
>test=# update a set country = 'ITALY';
>UPDATE 1
>test=# select xmin, xmax, cmin, cmax, * from a;
>  xmin  | xmax | cmin | cmax | country
>--------+------+------+------+---------
> 277269 |    0 |    0 |    0 | ITALY
>(1 row)
>
>======= Session 1 again =======
>test=# select xmin, xmax, cmin, cmax, * from a;
>  xmin  |  xmax  |  cmin  | cmax | country
>--------+--------+--------+------+---------
> 277264 | 277269 | 277269 |    0 | IRAQ
>(1 row)
>
>As you can see now, both sessions are seeing different views of the
>same table. The old tuple has now updated xmax and cmin values. If
>session two updated the tuple again it would get a cmax value also. You
>can't see the t_ctid link here, but there is one from the old row to
>the new one.
>
>
>
>>My question about the "infomask" strucutre was linked to the code of
>>VACUUM. I've seen in the "lazy_scan_heap method that the
>>"HeapTupleSatisfiesVacuum" method is called. In this method, according
>>to the value of "infomask", a tuple is defined as "dead" or not.
>>That's why I wonder if the "infomask" structure is changed after an
>>commited update or delete, and what are the values set ?
>>
>>
>
>It was Tom who pointed this out to me first: the infomask is not the
>important part. The infomask is just a cache of the results of tests.
>
>The problem is that checking if a transaction has been committed or not
>can be reasonably expensive since it might have to check on disk. If
>you had to do that every time you looked up a tuple the performence
>would be terrible. So what happens is that the first time someone looks
>up the status of a transaction and finds it's committed or aborted, it
>sets that bit so no-one else has to do the test.
>
>The basic result is that you can read the code as if the infomask was
>blank and the result should be the same. The only difference is that
>various bits allow the code to skip certain tests because somebody has
>already done them before. The end result should be the same.
>
>
>
>>So, consequently, it is not necessary to rebuild the B-tree index after
>>an update or a delete.
>>Is it correct ?
>>
>>
>
>You never have to rebuild the index. The whole system is designed so
>many people can be reading and writing the index simultaneously without
>getting in eachothers way.
>
>Hope this helps,
>
>


Re: Questions about update, delete, ctid...

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> I can see how the EPQ machinery can be used to chain forward to the
>> correct row to be updated, even if I originally found an older version
>> (e.g. by searching for a specific ctid). But for non-"for
>> update"-cursors, the newest version of the row returned by fetch could
>> be modified such that it would have never been returned by fetch in the
>> first place.
>
> Yah, EPQ checks for that ... none of the situations you've mentioned are
> any different from the case of an ordinary UPDATE that finds a row
> that's been modified since its snapshot was taken.  A cursor would just
> make the time window a bit larger.

I don't get it. EPQ would need to reevaluate the plan of the _select_
belonging to the _cursor_, not the plan of the _update_, to prevent
the effect outlined above. Are you suggesting to use EPQ for that?

greetings, Florian Pflug

Re: Questions about update, delete, ctid...

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> I don't get it. EPQ would need to reevaluate the plan of the _select_
> belonging to the _cursor_, not the plan of the _update_, to prevent
> the effect outlined above. Are you suggesting to use EPQ for that?

Yes, I don't think you have much choice --- note that SELECT FOR UPDATE
doesn't block self-updates, so assuming that the row is unmodified
doesn't work.

If WHERE CURRENT OF were easy, we'd have done it before now ;-)

            regards, tom lane

Re: Questions about update, delete, ctid...

From
Tom Lane
Date:
DANTE Alexandra <Alexandra.Dante@bull.net> writes:
>     - during a SELECT query on a tuple just updated and commited, does
> the executor first detect the old tuple and then via the c_tid link go
> to the new version of the tuple ? or go directly to the new version ?

Neither.  SELECT doesn't care about update relationships, it just
returns whatever row version is current according to its snapshot.
It's only UPDATE/DELETE that are constrained to act on the most
current version of each row.

            regards, tom lane

Re: Questions about update, delete, ctid...

From
DANTE Alexandra
Date:
Hello Tom, Martijn and List,

Thank you for yours answers !
This point is now closed for me and I can go further in the mecanisms of
VACUUM.

Regards,
Alexandra


Tom Lane wrote:

>DANTE Alexandra <Alexandra.Dante@bull.net> writes:
>
>
>>    - during a SELECT query on a tuple just updated and commited, does
>>the executor first detect the old tuple and then via the c_tid link go
>>to the new version of the tuple ? or go directly to the new version ?
>>
>>
>
>Neither.  SELECT doesn't care about update relationships, it just
>returns whatever row version is current according to its snapshot.
>It's only UPDATE/DELETE that are constrained to act on the most
>current version of each row.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>
>
>