Thread: existing row not found by SELECT ... WHERE CTID = ?
We have a C-written program, written in ESQL/C, of our LMS where the logic crawls with FETCH through a hit list and does UPDATE on some rows which match certain condition. This works fine for thousands of rows every night, but magically sometimes it fails. I have here the part of ESQL/C logs when I was able to catch such a case: The hit list for the FETCHes is built with: [29858] [23.05.2022 23:11:06:419]: prepare_common on line 1825: name sid_d01buch; query: "SELECT ctid, * from d01buch WHEREd01status = 4 " [29858] [23.05.2022 23:11:06:419]: ecpg_execute on line 2026: query: declare d01buch_scr scroll cursor with hold for SELECTctid, * from d01buch WHERE d01status = 4 ; with 0 parameter(s) on connection sisis then the CURSOR d01buch_scr is FETCHed some 59537 times, some of the rows were updated and here is the failing situation with a good FETCH of the row: ... [29858] [23.05.2022 23:21:21:842]: ecpg_process_output on line 2655: correctly got 1 tuples with 79 fields [29858] [23.05.2022 23:21:21:842]: ecpg_get_data on line 2655: RESULT: (668486,20) offset: 19; array: no [29858] [23.05.2022 23:21:21:842]: ecpg_get_data on line 2655: RESULT: 03904016 offset: 1152; array:no ... but when a new CURSOR should be build to UPDATE the row based on its CTID = (668486,20), the row could not be found: [29858] [23.05.2022 23:21:21:843]: deallocate_one on line 2494: name hs_d01buch [29858] [23.05.2022 23:21:21:843]: prepare_common on line 2494: name hs_d01buch; query : "SELECT * FROM d01buch WHEREctid = $1 FOR UPDATE" [29858] [23.05.2022 23:21:21:843]: ecpg_execute on line 2526: query: declare hc_d01buch cursor for SELECT * FROM d01buchWHERE ctid = $1 FOR UPDATE; with 1 parameter(s) on connection sisis [29858] [23.05.2022 23:21:21:844]: ecpg_execute on line 2526: using PQexecParams [29858] [23.05.2022 23:21:21:844]: ecpg_free_params on line 2526: parameter 1 = (668486,20) ... and the first FETCH in the CURSOR hc_d01buch could not see any row: 104 [29858] [23.05.2022 23:21:21:844]: raising sqlcode 100 on line 2531: no data found on line 2531 Why is this? Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Matthias Apitz <guru@unixarea.de> writes: > We have a C-written program, written in ESQL/C, of our LMS where the logic > crawls with FETCH through a hit list and does UPDATE on some rows which > match certain condition. This works fine for thousands of rows every night, > but magically sometimes it fails. Maybe something else already updated the row since the cursor was opened? That would change its CTID. regards, tom lane
El día martes, mayo 24, 2022 a las 10:47:11 -0400, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > We have a C-written program, written in ESQL/C, of our LMS where the logic > > crawls with FETCH through a hit list and does UPDATE on some rows which > > match certain condition. This works fine for thousands of rows every night, > > but magically sometimes it fails. > > Maybe something else already updated the row since the cursor was opened? > That would change its CTID. If you compare the time when the CTID was read: [29858] [23.05.2022 23:21:21:842]: ecpg_get_data on line 2655: RESULT: (668486,20) offset: 19; array: no with the time when it was not found: [29858] [23.05.2022 23:21:21:844]: raising sqlcode 100 on line 2531: no data found on line 2531 it's nearly imposible that our software could have done this. Maybe VACUUM hits exactly this moment in time and row, but this sounds also like 5 good numbers in the lotery jackpot :-) Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
On Tue, 2022-05-24 at 16:44 +0200, Matthias Apitz wrote: > We have a C-written program, written in ESQL/C, of our LMS where the logic > crawls with FETCH through a hit list and does UPDATE on some rows which > match certain condition. This works fine for thousands of rows every night, > but magically sometimes it fails. I have here the part of ESQL/C logs > when I was able to catch such a case: > > The hit list for the FETCHes is built with: > > [29858] [23.05.2022 23:11:06:419]: prepare_common on line 1825: name sid_d01buch; query: "SELECT ctid, * from d01buch WHEREd01status = 4 " > [29858] [23.05.2022 23:11:06:419]: ecpg_execute on line 2026: query: declare d01buch_scr scroll cursor with hold for SELECTctid, * from d01buch WHERE d01status = 4 ; with 0 parameter(s) on > connection sisis > > then the CURSOR d01buch_scr is FETCHed some 59537 times, some of the > rows were updated and here is the failing situation with a good FETCH of > the row: I cannot understand many of your log messages, but it seems quite clear that you are declaring a WITH HOLD cursor. Such cursors are materialized when the transactoin that creates the cursor commits, so the result set is "frozen" and does no longer reflect the current state of the table. It may well be that somebody deleted or updated a few rows between the time the cursor was materialized and the time the 50000th row was fetched. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Laurenz Albe <laurenz.albe@cybertec.at> writes: > It may well be that somebody deleted or updated a few rows between the time > the cursor was materialized and the time the 50000th row was fetched. Even without HOLD, a cursor will return a view of the data as it stood when the cursor was opened, just as a plain SELECT does. There is *plenty* of time for another session to get in there if you've been groveling through 50K records one at a time. regards, tom lane
El día martes, mayo 24, 2022 a las 12:11:49 -0400, Tom Lane escribió: > Laurenz Albe <laurenz.albe@cybertec.at> writes: > > It may well be that somebody deleted or updated a few rows between the time > > the cursor was materialized and the time the 50000th row was fetched. > > Even without HOLD, a cursor will return a view of the data as it stood > when the cursor was opened, just as a plain SELECT does. There is > *plenty* of time for another session to get in there if you've been > groveling through 50K records one at a time. Tom, Thanks for pointing us in the right direction where to look for a solution. The CURSOR was opened around 23:11 pm and the CTID not found at 23:21 pm, i.e. ten minutes later. This piece of software does every night some housekeeping work in the circulation area of our LMS (Library Management System) and is meant to run as a standalone job (only one process after the other). We're trying to figure out with the customer if something else was started/running at this time between 23:11 and 23:21, to shut this off in the future. Is it possible that the PostgreSQL 13.1 server does something by its own to invalidate the rowid? matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
On Wed, 2022-05-25 at 11:21 +0200, Matthias Apitz wrote: > Is it possible that the PostgreSQL 13.1 server does something by its own to invalidate the rowid? No. PostgreSQL may remove a dead row, but a dead row is by definition no longer visible, so it wouldn't be found by a query. Yours, Laurenz Albe
El día Mittwoch, Mai 25, 2022 a las 12:51:02 +0200, Laurenz Albe escribió: > On Wed, 2022-05-25 at 11:21 +0200, Matthias Apitz wrote: > > Is it possible that the PostgreSQL 13.1 server does something by its own to invalidate the rowid? > > No. PostgreSQL may remove a dead row, but a dead row is by definition > no longer visible, so it wouldn't be found by a query. We will solve the problem now with setting the session after connect to SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL REPEATABLE READ; (with an appropriate ESQL/C call). Any comments? Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
## Matthias Apitz (guru@unixarea.de): > We will solve the problem now with setting the session after connect to > > SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL REPEATABLE READ; > > (with an appropriate ESQL/C call). Any comments? Maybe the real question is whether it is wise to use an implementation artifact (ctid) to identify rows? The textbook approach could be row locks (SELECT ... FOR SHARE/UPDATE and variants) to prevent concurrent changes or optimistic locking (and a primary key in any case) - but maybe you already investigated those options? Regards, Christoph -- Spare Space
On Wed, 2022-05-25 at 14:27 +0200, Christoph Moench-Tegeder wrote: > ## Matthias Apitz (guru@unixarea.de): > > > We will solve the problem now with setting the session after connect to > > > > SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL REPEATABLE READ; > > > > (with an appropriate ESQL/C call). Any comments? > > Maybe the real question is whether it is wise to use an implementation > artifact (ctid) to identify rows? > The textbook approach could be row locks (SELECT ... FOR SHARE/UPDATE and > variants) to prevent concurrent changes or optimistic locking (and a > primary key in any case) - but maybe you already investigated those options? Right. REPEATABLE READ won't help you there. True, you will see a stable snapshot of the database inside a single transaction, but if a concurrent session has modified the row, you will get a serialization error. So that is not a solution. Yours, Laurenz Albe
> No. PostgreSQL may remove a dead row, but a dead row is by definition > no longer visible, so it wouldn't be found by a query. I am wondering whether it is a good practice to use CTID in a where clause. years ago when I use to code in Informix, using ROWID as a generic substitute for primary key was discouraged precisely for the same reason as described here for CTID. Sometimes rowid can change under concurrent updates.
Ravi Krishna <srkrishna@vivaldi.net> writes: >> No. PostgreSQL may remove a dead row, but a dead row is by definition >> no longer visible, so it wouldn't be found by a query. > I am wondering whether it is a good practice to use CTID in a where > clause. It's fine if part of your business logic is that you don't want to allow concurrent updates. In this case, the OP seems to want to prevent rather than tolerate the concurrent update, so I don't think he needs to revisit the app's use of CTID. If you do need to support concurrent updates, then yeah relying on CTID is likely to be problematic. regards, tom lane
El día Wednesday, May 25, 2022 a las 11:21:44AM +0200, Matthias Apitz escribió: > El día martes, mayo 24, 2022 a las 12:11:49 -0400, Tom Lane escribió: > > > Laurenz Albe <laurenz.albe@cybertec.at> writes: > > > It may well be that somebody deleted or updated a few rows between the time > > > the cursor was materialized and the time the 50000th row was fetched. > > > > Even without HOLD, a cursor will return a view of the data as it stood > > when the cursor was opened, just as a plain SELECT does. There is > > *plenty* of time for another session to get in there if you've been > > groveling through 50K records one at a time. > > Tom, Thanks for pointing us in the right direction where to look for a > solution. The CURSOR was opened around 23:11 pm and the CTID not found > at 23:21 pm, i.e. ten minutes later. This piece of software does every > night some housekeeping work in the circulation area of our LMS (Library > Management System) and is meant to run as a standalone job (only one > process after the other). We're trying to figure out with the customer if something > else was started/running at this time between 23:11 and 23:21, to shut this > off in the future. ... Is there any way to get with the old CTID to the row, for example with the old CTID to the new one which the row now has after the update of the row? matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
On Thursday, May 26, 2022, Matthias Apitz <guru@unixarea.de> wrote:
Is there any way to get with the old CTID to the row, for example with
the old CTID to the new one which the row now has after the update of the row?
No, there is no link between old and new in the main table.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > On Thursday, May 26, 2022, Matthias Apitz <guru@unixarea.de> wrote: >> Is there any way to get with the old CTID to the row, for example with >> the old CTID to the new one which the row now has after the update of the >> row? > No, there is no link between old and new in the main table. There is a forward link from the old CTID to the new, but we don't provide any user-accessible way to use it. I wonder though if the OP should be considering using SELECT FOR UPDATE in his cursor, so that it'd automatically chain up to the newest row version. regards, tom lane
El día jueves, mayo 26, 2022 a las 12:02:35 -0400, Tom Lane escribió: > I wonder though if the > OP should be considering using SELECT FOR UPDATE in his cursor, > so that it'd automatically chain up to the newest row version. I diged into this but the CURSOR are all declared WITH HOLD and this conflicts with FOR UPDATE. And this is not easy to change in our generic generated DB-layer for all the ~400 tables. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
El día miércoles, mayo 25, 2022 a las 10:38:24a. m. -0400, Tom Lane escribió: > Ravi Krishna <srkrishna@vivaldi.net> writes: > >> No. PostgreSQL may remove a dead row, but a dead row is by definition > >> no longer visible, so it wouldn't be found by a query. > > > I am wondering whether it is a good practice to use CTID in a where > > clause. > > It's fine if part of your business logic is that you don't want to allow > concurrent updates. In this case, the OP seems to want to prevent rather > than tolerate the concurrent update, so I don't think he needs to revisit > the app's use of CTID. > > If you do need to support concurrent updates, then yeah relying on CTID > is likely to be problematic. Tom, we detected another issue of missing a row by its CTID in another table where fees are stored which one(!) process cumulates in the night. The time window between creating the CURSOR and missing the CTID is only 42 seconds and I can not imagine that any other concurrent process is updating such fee rows at midnight. Could exist any other reason why a row changes its CTID? Full VACUUM is not used either. Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Analizing our logs in more detail, we got to know that there are cases where, after updating a row with something like: EXEC SQL UPDATE d01buch SET d01gsi =:d01gsi, d01ex =:d01ex, ... d01vldate =:d01vldate WHERE CURRENT OF hc_d01buch; we look-up the same row later again with its old CTID in :rowid; one solution of it could be right after the UPDATE look-up the new CTID with: EXEC SQL SELECT ctid INTO :rowid FROM d01buch WHERE CURRENT OF hc_d01buch; i.e. using the same CURSOR which was used for the UPDATE again for a SELECT for the CTID. Can this work? matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub