Thread: XX000: tuple concurrently deleted during DROP STATISTICS

XX000: tuple concurrently deleted during DROP STATISTICS

From
Justin Pryzby
Date:
I found this in our logs, and reproduced it under v11-v16.

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT generate_series(1,999);
CREATE STATISTICS t_stats ON a,b FROM t;

while :; do psql postgres -qtxc "ANALYZE t"; done &
while :; do psql postgres -qtxc "begin; DROP STATISTICS t_stats"; done &

It's known that concurrent DDL can hit elog().  But in this case,
there's only one DDL operation.

(gdb) bt
#0  0x00000000009442a0 in pg_re_throw ()
#1  0x0000000000943504 in errfinish ()
#2  0x00000000004fcafe in simple_heap_delete ()
#3  0x0000000000639d3f in RemoveStatisticsDataById ()
#4  0x0000000000639d79 in RemoveStatisticsById ()
#5  0x000000000057a428 in deleteObjectsInList ()
#6  0x000000000057a8f0 in performMultipleDeletions ()
#7  0x000000000060b5ed in RemoveObjects ()
#8  0x00000000008099ce in ProcessUtilitySlow.isra.1 ()
#9  0x0000000000808c71 in standard_ProcessUtility ()
#10 0x00007efbfed7a508 in pgss_ProcessUtility () from /usr/pgsql-16/lib/pg_stat_statements.so
#11 0x000000000080745a in PortalRunUtility ()
#12 0x0000000000807579 in PortalRunMulti ()
#13 0x00000000008079dc in PortalRun ()
#14 0x0000000000803927 in exec_simple_query ()
#15 0x0000000000803f28 in PostgresMain ()
#16 0x000000000077bae6 in ServerLoop ()
#17 0x000000000077cbaa in PostmasterMain ()
#18 0x00000000004ba788 in main ()

-- 
Justin



Re: XX000: tuple concurrently deleted during DROP STATISTICS

From
Tomas Vondra
Date:
On 11/8/23 16:10, Justin Pryzby wrote:
> I found this in our logs, and reproduced it under v11-v16.
> 
> CREATE TABLE t(a int, b int);
> INSERT INTO t SELECT generate_series(1,999);
> CREATE STATISTICS t_stats ON a,b FROM t;
> 
> while :; do psql postgres -qtxc "ANALYZE t"; done &
> while :; do psql postgres -qtxc "begin; DROP STATISTICS t_stats"; done &
> 
> It's known that concurrent DDL can hit elog().  But in this case,
> there's only one DDL operation.
> 
AFAICS this happens because store_statext (after ANALYZE builds the new
statistics) does this:

----------------------------
/*
 * Delete the old tuple if it exists, and insert a new one. It's easier
 * than trying to update or insert, based on various conditions.
 */
RemoveStatisticsDataById(statOid, inh);

/* form and insert a new tuple */
stup = heap_form_tuple(RelationGetDescr(pg_stextdata), values, nulls);
CatalogTupleInsert(pg_stextdata, stup);
----------------------------

So it deletes the tuple first (if there's one), and then inserts the new
statistics tuple.

We could update the tuple instead, but that would be more complex (as
the comment explains), and it doesn't actually fix anything because then
simple_heap_delete just fails with TM_Updated instead.

I think the only solution would be to lock the statistics tuple before
running ANALYZE, or something like that. Or maybe we should even lock
the statistics object itself, so that ANALYZE and DROP can't run
concurrently on it?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: XX000: tuple concurrently deleted during DROP STATISTICS

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> On 11/8/23 16:10, Justin Pryzby wrote:
>> I found this in our logs, and reproduced it under v11-v16.
>> 
>> CREATE TABLE t(a int, b int);
>> INSERT INTO t SELECT generate_series(1,999);
>> CREATE STATISTICS t_stats ON a,b FROM t;
>> 
>> while :; do psql postgres -qtxc "ANALYZE t"; done &
>> while :; do psql postgres -qtxc "begin; DROP STATISTICS t_stats"; done &
>> 
>> It's known that concurrent DDL can hit elog().  But in this case,
>> there's only one DDL operation.

> AFAICS this happens because store_statext (after ANALYZE builds the new
> statistics) does this:

Shouldn't DROP STATISTICS be taking a lock on the associated table
that is strong enough to lock out ANALYZE?

            regards, tom lane



Re: XX000: tuple concurrently deleted during DROP STATISTICS

From
Tomas Vondra
Date:
On 11/8/23 16:52, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> On 11/8/23 16:10, Justin Pryzby wrote:
>>> I found this in our logs, and reproduced it under v11-v16.
>>>
>>> CREATE TABLE t(a int, b int);
>>> INSERT INTO t SELECT generate_series(1,999);
>>> CREATE STATISTICS t_stats ON a,b FROM t;
>>>
>>> while :; do psql postgres -qtxc "ANALYZE t"; done &
>>> while :; do psql postgres -qtxc "begin; DROP STATISTICS t_stats"; done &
>>>
>>> It's known that concurrent DDL can hit elog().  But in this case,
>>> there's only one DDL operation.
> 
>> AFAICS this happens because store_statext (after ANALYZE builds the new
>> statistics) does this:
> 
> Shouldn't DROP STATISTICS be taking a lock on the associated table
> that is strong enough to lock out ANALYZE?
> 

Yes, I think that's the correct thing to do. I recall having a
discussion about this with someone while working on the patch, leading
to the current code. But I haven't managed to find that particular bit
in the archives :-(

Anyway, the attached patch should fix this by getting the lock, I think.

- RemoveStatisticsById is what gets called drop DROP STATISTICS (or for
dependencies), so that's where we get the AE lock

- RemoveStatisticsDataById gets called from ANALYZE, so that already
should have a lock (so no need to acquire another one)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: XX000: tuple concurrently deleted during DROP STATISTICS

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> On 11/8/23 16:52, Tom Lane wrote:
>> Shouldn't DROP STATISTICS be taking a lock on the associated table
>> that is strong enough to lock out ANALYZE?

> Yes, I think that's the correct thing to do. I recall having a
> discussion about this with someone while working on the patch, leading
> to the current code. But I haven't managed to find that particular bit
> in the archives :-(
> Anyway, the attached patch should fix this by getting the lock, I think.

This looks generally correct, but surely we don't need it to be as
strong as AccessExclusiveLock?  There seems no reason to conflict with
ordinary readers/writers of the table.

ANALYZE takes ShareUpdateExclusiveLock, and offhand I think this
command should do the same.

            regards, tom lane



Re: XX000: tuple concurrently deleted during DROP STATISTICS

From
Tomas Vondra
Date:
On 11/8/23 20:58, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> On 11/8/23 16:52, Tom Lane wrote:
>>> Shouldn't DROP STATISTICS be taking a lock on the associated table
>>> that is strong enough to lock out ANALYZE?
> 
>> Yes, I think that's the correct thing to do. I recall having a
>> discussion about this with someone while working on the patch, leading
>> to the current code. But I haven't managed to find that particular bit
>> in the archives :-(
>> Anyway, the attached patch should fix this by getting the lock, I think.
> 
> This looks generally correct, but surely we don't need it to be as
> strong as AccessExclusiveLock?  There seems no reason to conflict with
> ordinary readers/writers of the table.
> 
> ANALYZE takes ShareUpdateExclusiveLock, and offhand I think this
> command should do the same.
> 

Right. I did copy that from DROP TRIGGER code somewhat mindlessly, but
you're right this does not need block readers/writers.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: XX000: tuple concurrently deleted during DROP STATISTICS

From
Tomas Vondra
Date:
I've pushed a cleaned up version of the fix.

I had to make some adjustments in the backbranches, because the way we
store the analyzed statistics evolved, and RemoveStatisticsById() used
to do everything. I ended up introducing RemoveStatisticsDataById() in
the backbranches too, but only as a static function - that makes the
code much cleaner.

regards


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company