Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4? - Mailing list pgsql-bugs

From Marco Boeringa
Subject Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Date
Msg-id 2dbccee7-9616-997d-cf19-fa23e29e0497@boeringa.demon.nl
Whole thread Raw
In response to Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?  (Marco Boeringa <marco@boeringa.demon.nl>)
List pgsql-bugs
Hi Tom,

Thanks for the response, but please review the exact conditions I 
already mentioned in my previous mails:

- When this issue happens, there is absolutely no other activity going 
on than the three active sessions I mentioned: the autovacuum worker 
with no wait event, and the two "SELECT COUNT(*) FROM <table>" related 
sessions with both a wait event, see the original post. There are no 
other active sessions doing any kind of editing work, no INSERTS, 
UPDATES, DELETES or whatever in this point of the processing that could 
generate WAL.

- Note that this is a custom written geoprocessing workflow with just 
one user on the database, not a public database with hundreds of users 
emitting whatever unknown queries against the database, so I know 
exactly at what point in my processing flow it fails and what goes on then.

- The database affected itself is just a few dozen GBs. While I 
appreciate, if I understand PostgreSQL and the concept of WAL good 
enough (I don't consider myself a PostgreSQL expert), that WAL might 
potentially exceed the size of the database when heavy editing is going, 
890 GB of WAL being written seems like an anomaly given in the context 
of the first points.

- This problem only first reared its head after the issues starting in 
PG14.2 related to SELECT COUNT(*)

So, does your suggested option (b) still make sense in this context?

If not, and we assume this is a bug needing reporting, what exact 
information will you guys need to pinpoint the issue besides the 
information already given? What is the best course of action? I have 
never before reported a bug for PostgreSQL, so I am slightly at loss as 
to what exact information you will need. E.g., besides your suggestion 
of activating 'log_checkpoints', what other suggestions for specific 
logging?

I fully appreciate the main answer will be to submit the typical 
"smallest reproducible case", but that will be extremely hard in this 
particular case, as the geoprocessing workflow processing OpenStreetMap 
data goes through a whole chain of largely auto-generated SQL statements 
(based on settings in the input tool), that are nearly impossible to 
share. Although it is also again questionable if it is actually 
relevant, as the point where it fails only has the mentioned sessions 
and single SELECT COUNT(*) SQL statement going on. The issues is 
intermittent as well, so there wouldn't be guarantees it would reproduce 
on the first try, even if I could share it.

I also appreciate I might need to hire an expert for some remote 
debugging, but before going that way, I appreciate some more insights.

Marco

Op 23-7-2022 om 17:33 schreef Tom Lane:
> Marco Boeringa <marco@boeringa.demon.nl> writes:
>> Ok, I found it out using 'sudo baobab'. It is the:
>> 'var/lib/postgresql/14/main/pg_wal'
>> folder that is filled up with 890 GB of data... causing the file system
>> root to run out of space and Ubuntu opening the disk usage analyzer and
>> a warning as a consequence.
> The most likely explanations for this are
> (a) misconfiguration of WAL archiving, so that the server thinks
> it should keep WAL files till they've been archived, only that
> never happens.
> (b) inability to complete checkpoints for some reason, preventing
> WAL files from being recycled.
>
> It doesn't look like you have wal_archiving on, so (a) *should*
> be ruled out, but you never know.  If there are a ton of "nnn.ready"
> files underneath pg_wal then trouble here would be indicated.
>
> As for (b), you might try enabling log_checkpoints and seeing if
> the log messages give any clue.
>
>             regards, tom lane



pgsql-bugs by date:

Previous
From: Zsolt Ero
Date:
Subject: Re: could not link file in wal restore lines
Next
From: Ajin Cherian
Date:
Subject: Re: Excessive number of replication slots for 12->14 logical replication