Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4? - Mailing list pgsql-bugs
From | Marco Boeringa |
---|---|
Subject | Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4? |
Date | |
Msg-id | f6175d33-0e5d-480a-6e08-560ec4d25a0e@boeringa.demon.nl Whole thread Raw |
In response to | Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4? (Marco Boeringa <marco@boeringa.demon.nl>) |
Responses |
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
|
List | pgsql-bugs |
To extend on this, two interesting questions that come to mind are: - Does running SELECT COUNT(*) create WAL? - Is it potentially conceivable that there is a kind of cross-database vulnerability *within one and the same PostgreSQL cluster*, where an issue in one database causes the WAL in another database to no longer successfully be written to disk during checkpoints? I have never seen processing errors where PostgreSQL emitted true PostgreSQL errors with error numbers cause issues like that and affect a second database in the same cluster, but since no error is generated here, and there might be some uncatched error, I wonder? I am especially asking the second question since, although I wrote there is no edit activity going on potentially generating WAL in the affected small database, which is true, there *was* processing on Planet sized data going on in a second database in the same cluster. That certainly *is* capable of generating 890GB of WAL if nothing is cleaned up during checkpoints due to checkpoints failing. Marco Op 24-7-2022 om 09:55 schreef Marco Boeringa: > Hi Tom, > > Thanks for the response, but please review the exact conditions I > already mentioned in my previous mails: > > - When this issue happens, there is absolutely no other activity going > on than the three active sessions I mentioned: the autovacuum worker > with no wait event, and the two "SELECT COUNT(*) FROM <table>" related > sessions with both a wait event, see the original post. There are no > other active sessions doing any kind of editing work, no INSERTS, > UPDATES, DELETES or whatever in this point of the processing that > could generate WAL. > > - Note that this is a custom written geoprocessing workflow with just > one user on the database, not a public database with hundreds of users > emitting whatever unknown queries against the database, so I know > exactly at what point in my processing flow it fails and what goes on > then. > > - The database affected itself is just a few dozen GBs. While I > appreciate, if I understand PostgreSQL and the concept of WAL good > enough (I don't consider myself a PostgreSQL expert), that WAL might > potentially exceed the size of the database when heavy editing is > going, 890 GB of WAL being written seems like an anomaly given in the > context of the first points. > > - This problem only first reared its head after the issues starting in > PG14.2 related to SELECT COUNT(*) > > So, does your suggested option (b) still make sense in this context? > > If not, and we assume this is a bug needing reporting, what exact > information will you guys need to pinpoint the issue besides the > information already given? What is the best course of action? I have > never before reported a bug for PostgreSQL, so I am slightly at loss > as to what exact information you will need. E.g., besides your > suggestion of activating 'log_checkpoints', what other suggestions for > specific logging? > > I fully appreciate the main answer will be to submit the typical > "smallest reproducible case", but that will be extremely hard in this > particular case, as the geoprocessing workflow processing > OpenStreetMap data goes through a whole chain of largely > auto-generated SQL statements (based on settings in the input tool), > that are nearly impossible to share. Although it is also again > questionable if it is actually relevant, as the point where it fails > only has the mentioned sessions and single SELECT COUNT(*) SQL > statement going on. The issues is intermittent as well, so there > wouldn't be guarantees it would reproduce on the first try, even if I > could share it. > > I also appreciate I might need to hire an expert for some remote > debugging, but before going that way, I appreciate some more insights. > > Marco > > Op 23-7-2022 om 17:33 schreef Tom Lane: >> Marco Boeringa <marco@boeringa.demon.nl> writes: >>> Ok, I found it out using 'sudo baobab'. It is the: >>> 'var/lib/postgresql/14/main/pg_wal' >>> folder that is filled up with 890 GB of data... causing the file system >>> root to run out of space and Ubuntu opening the disk usage analyzer and >>> a warning as a consequence. >> The most likely explanations for this are >> (a) misconfiguration of WAL archiving, so that the server thinks >> it should keep WAL files till they've been archived, only that >> never happens. >> (b) inability to complete checkpoints for some reason, preventing >> WAL files from being recycled. >> >> It doesn't look like you have wal_archiving on, so (a) *should* >> be ruled out, but you never know. If there are a ton of "nnn.ready" >> files underneath pg_wal then trouble here would be indicated. >> >> As for (b), you might try enabling log_checkpoints and seeing if >> the log messages give any clue. >> >> regards, tom lane
pgsql-bugs by date: