'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2) - Mailing list pgsql-bugs

From Marco Boeringa
Subject 'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)
Date
Msg-id ed211b60-2b76-4d98-4284-328916068036@boeringa.demon.nl
Whole thread Raw
Responses Fwd: 'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)
Re: 'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)
List pgsql-bugs
Hi all,

I am running into a weird issue I haven't encountered before.

This is PostgreSQL 15.2 (Ubuntu 15.2-1.pgdg22.04+1) running on Ubuntu 
22.04.2 LTS, with PostGIS (POSTGIS="3.3.2 4975da8" [EXTENSION] 
PGSQL="150" GEOS="3.10.2-CAPI-1.16.0" PROJ="8.2.1" LIBXML="2.9.13" 
LIBJSON="0.15" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)").

Now the PostgreSQL database cluster contains multiple databases. In one 
of these, an OpenStreetMap 'Planet' size database, a CLUSTER operation 
is running against a very large table of buildings (> 400M records). 
This runs fine, and I can see the actual progress in pgAdmin by 
inspecting the 'pg_stat_progress_cluster' view. So far everything seems OK.

However, as part of two secondary Python scripts running that do all 
kinds of batch operations against spatial tables (generalization and 
such) in two other databases on the same database cluster, at the very 
end of the processing, these initialize 'CLUSTER' operations as well. 
Both of these now fail with a:

"missing chunk number 0 for toast value X in pg_toast_Y"

type error.

After these errors, I see the CLUSTER command listed as well in the 
PostgreSQL main log, but the ERROR line mentioned above is listed just 
above the actual CLUSTER statement in the log. No other anomaly / error 
is visible there. The CLUSTER command just fails though on the two 
secondary databases. The time displayed in the log is exactly the same 
for these two log lines one after another, also the number between the 
square brackets listed immediately after the "CET" of the time are the 
same for the same database. The pg_toast_Y table names mentioned in the 
errors differ by the way between the two databases.

If I inspect the table that is mentioned in the CLUSTER command in the 
log, the very first one that is going to be CLUSTERed in the secondary 
databases, the table can be displayed fine in DBeaver, even spatially. 
Checking the geometries with PostGIS's 'ST_IsValid' also shows them to 
be fine, no errors in the geometries.

Everything else seems OK, database cluster is still up&running, all 
tables in all databases can be viewed and accessed in DBeaver and pgAdmin.

In all other rare cases where I have seen the "missing chunk number 0 
for toast value X in pg_toast_Y" type error, the generalization 
processing already failed somewhere half-way, never at the end when 
CLUSTER is initiated. The error therefore seems weird to occur at that 
point.

I have now been able to reproduce this twice, with two tiny database's 
filled with just Liechtenstein data, while the Planet database is still 
in the stage of CLUSTERing the huge building table.

Obviously, it will be interesting to see what happens if the CLUSTER 
operation on the Planet database is finished, but this will still take 
many hours, especially since other large tables need to be CLUSTERed in 
the same database, and I do not want to break off this process right now 
after already gotten this far.

I will report back once it is finished, but does anyone have a clue why 
this might happen?

The system has plenty of free space by the way (4x2TB NVMe), and the 
database runs on a professional workstation with ECC RAM, and used RAM 
also doesn't seem an issue.

Marco




pgsql-bugs by date:

Previous
From: "Howard A. Chou"
Date:
Subject: found a possible bug, modulus of an integer on a partition table appears to be wrong
Next
From: "David G. Johnston"
Date:
Subject: Re: BUG #17822: Some scripts are blocked by Device Guard