Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4? - Mailing list pgsql-bugs

From Marco Boeringa
Subject Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Date
Msg-id f6175d33-0e5d-480a-6e08-560ec4d25a0e@boeringa.demon.nl
Whole thread Raw
In response to Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?  (Marco Boeringa <marco@boeringa.demon.nl>)
Responses Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-bugs
To extend on this, two interesting questions that come to mind are:

- Does running SELECT COUNT(*) create WAL?

- Is it potentially conceivable that there is a kind of cross-database 
vulnerability *within one and the same PostgreSQL cluster*, where an 
issue in one database causes the WAL in another database to no longer 
successfully be written to disk during checkpoints? I have never seen 
processing errors where PostgreSQL emitted true PostgreSQL errors with 
error numbers cause issues like that and affect a second database in the 
same cluster, but since no error is generated here, and there might be 
some uncatched error, I wonder?

I am especially asking the second question since, although I wrote there 
is no edit activity going on potentially generating WAL in the affected 
small database, which is true, there *was* processing on Planet sized 
data going on in a second database in the same cluster. That certainly 
*is* capable of generating 890GB of WAL if nothing is cleaned up during 
checkpoints due to checkpoints failing.

Marco

Op 24-7-2022 om 09:55 schreef Marco Boeringa:
> Hi Tom,
>
> Thanks for the response, but please review the exact conditions I 
> already mentioned in my previous mails:
>
> - When this issue happens, there is absolutely no other activity going 
> on than the three active sessions I mentioned: the autovacuum worker 
> with no wait event, and the two "SELECT COUNT(*) FROM <table>" related 
> sessions with both a wait event, see the original post. There are no 
> other active sessions doing any kind of editing work, no INSERTS, 
> UPDATES, DELETES or whatever in this point of the processing that 
> could generate WAL.
>
> - Note that this is a custom written geoprocessing workflow with just 
> one user on the database, not a public database with hundreds of users 
> emitting whatever unknown queries against the database, so I know 
> exactly at what point in my processing flow it fails and what goes on 
> then.
>
> - The database affected itself is just a few dozen GBs. While I 
> appreciate, if I understand PostgreSQL and the concept of WAL good 
> enough (I don't consider myself a PostgreSQL expert), that WAL might 
> potentially exceed the size of the database when heavy editing is 
> going, 890 GB of WAL being written seems like an anomaly given in the 
> context of the first points.
>
> - This problem only first reared its head after the issues starting in 
> PG14.2 related to SELECT COUNT(*)
>
> So, does your suggested option (b) still make sense in this context?
>
> If not, and we assume this is a bug needing reporting, what exact 
> information will you guys need to pinpoint the issue besides the 
> information already given? What is the best course of action? I have 
> never before reported a bug for PostgreSQL, so I am slightly at loss 
> as to what exact information you will need. E.g., besides your 
> suggestion of activating 'log_checkpoints', what other suggestions for 
> specific logging?
>
> I fully appreciate the main answer will be to submit the typical 
> "smallest reproducible case", but that will be extremely hard in this 
> particular case, as the geoprocessing workflow processing 
> OpenStreetMap data goes through a whole chain of largely 
> auto-generated SQL statements (based on settings in the input tool), 
> that are nearly impossible to share. Although it is also again 
> questionable if it is actually relevant, as the point where it fails 
> only has the mentioned sessions and single SELECT COUNT(*) SQL 
> statement going on. The issues is intermittent as well, so there 
> wouldn't be guarantees it would reproduce on the first try, even if I 
> could share it.
>
> I also appreciate I might need to hire an expert for some remote 
> debugging, but before going that way, I appreciate some more insights.
>
> Marco
>
> Op 23-7-2022 om 17:33 schreef Tom Lane:
>> Marco Boeringa <marco@boeringa.demon.nl> writes:
>>> Ok, I found it out using 'sudo baobab'. It is the:
>>> 'var/lib/postgresql/14/main/pg_wal'
>>> folder that is filled up with 890 GB of data... causing the file system
>>> root to run out of space and Ubuntu opening the disk usage analyzer and
>>> a warning as a consequence.
>> The most likely explanations for this are
>> (a) misconfiguration of WAL archiving, so that the server thinks
>> it should keep WAL files till they've been archived, only that
>> never happens.
>> (b) inability to complete checkpoints for some reason, preventing
>> WAL files from being recycled.
>>
>> It doesn't look like you have wal_archiving on, so (a) *should*
>> be ruled out, but you never know.  If there are a ton of "nnn.ready"
>> files underneath pg_wal then trouble here would be indicated.
>>
>> As for (b), you might try enabling log_checkpoints and seeing if
>> the log messages give any clue.
>>
>>             regards, tom lane



pgsql-bugs by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: could not link file in wal restore lines