Thread: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3?
"SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3?
Hi, Anyone else encountering this? I have had issues with SELECT COUNT (*) in PG 14.2, that I think were supposed to be fixed in PG 14.3. However, I have now seen PostgreSQL locking up with a SELECT COUNT (*) statement in PG 14.3 twice. Note that in both cases, the code that issued this statement, had successfully processed similar statements against other tables multiple times before locking up. When I look in pgAdmin, I see three "active" sessions, that never finish though (or are extremely slow...), and I do see some minor "Tuple out" activity sporadically (a few thousand tuples at a time): - autovacuum: "VACUUM <TABLE_NAME>" with no Wait Event - client backend: "SELECT COUNT (*) FROM <TABLE_NAME>" with Wait Event "IPC:ExecuteGather" - parallel worker: "SELECT COUNT (*) FROM <TABLE_NAME>" with Wait Event "LWLock:BufferContent" When this happens, it is impossible to stop the sessions, when I choose "Cancel Query" or "Terminate Session", pgAdmin returns "success" for the operation, yet the sessions remain visible in the pgAdmin window, even after clicking "refresh", something that definitely doesn't happen in ordinary situations. Under "Locks" in pgAdmin, I see an "AccesShareLock" for the "client backend" and "parallel worker", and a "ShareUpdateExclusiveLock" for the "autovacuum" for the table involved. Additionally, when this happens, my "File system root" of Ubuntu slowly starts filling up, until 100% full (growth from +/-630GB to 1TB), and PostgreSQL shuts down: "server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request." Note that the "logging collector" is set "off", and no logs are thus being kept (this is a test system, and I don't care much for the logs). In order to get out this situation, I have needed to restore from backup. My system: - Windows Hyper-V virtualized Ubuntu 20.04.1 LTS - PG 14.3 (Ubuntu 14.3-1.pgdg20.04+1) - POSTGIS="3.2.1 5fae8e5" [EXTENSION] PGSQL="140" GEOS="3.8.0-CAPI-1.13.1 " PROJ="6.3.1" LIBXML="2.9.10" LIBJSON="0.13.1" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)" All run on an HPZ840 system with 256 GB ECC RAM, dual Xeon E5-2680v4, Windows 10. 5x2TB NVMe configured as Windows "Storage Space" attached via a PCIe Express card. Data: OpenStreetMap data for the entire Planet, +/- 1.4 TB database if processing finishes. Marco
Re: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3?
FWIIW, I have not been able to reproduce this issue in the latest 14.4 release up until now, with quite significant testing. So it seems this issue is finally fixed, although only time will tell for sure, as the issue was intermittent.
Marco
-------- Doorgestuurd bericht --------
Onderwerp: | "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3? |
---|---|
Datum: | Thu, 16 Jun 2022 08:10:12 +0200 |
Van: | Marco Boeringa <marco@boeringa.demon.nl> |
Aan: | pgsql-bugs@lists.postgresql.org |
Hi,
Anyone else encountering this?
I have had issues with SELECT COUNT (*) in PG 14.2, that I think were supposed to be fixed in PG 14.3.
However, I have now seen PostgreSQL locking up with a SELECT COUNT (*) statement in PG 14.3 twice.
Note that in both cases, the code that issued this statement, had successfully processed similar statements against other tables multiple times before locking up.
When I look in pgAdmin, I see three "active" sessions, that never finish though (or are extremely slow...), and I do see some minor "Tuple out" activity sporadically (a few thousand tuples at a time):
- autovacuum: "VACUUM <TABLE_NAME>" with no Wait Event
- client backend: "SELECT COUNT (*) FROM <TABLE_NAME>" with Wait Event "IPC:ExecuteGather"
- parallel worker: "SELECT COUNT (*) FROM <TABLE_NAME>" with Wait Event "LWLock:BufferContent"
When this happens, it is impossible to stop the sessions, when I choose "Cancel Query" or "Terminate Session", pgAdmin returns "success" for the operation, yet the sessions remain visible in the pgAdmin window, even after clicking "refresh", something that definitely doesn't happen in ordinary situations.
Under "Locks" in pgAdmin, I see an "AccesShareLock" for the "client backend" and "parallel worker", and a "ShareUpdateExclusiveLock" for the "autovacuum" for the table involved.
Additionally, when this happens, my "File system root" of Ubuntu slowly starts filling up, until 100% full (growth from +/-630GB to 1TB), and PostgreSQL shuts down:
"server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request."
Note that the "logging collector" is set "off", and no logs are thus being kept (this is a test system, and I don't care much for the logs).
In order to get out this situation, I have needed to restore from backup.
My system:
- Windows Hyper-V virtualized Ubuntu 20.04.1 LTS
- PG 14.3 (Ubuntu 14.3-1.pgdg20.04+1)
- POSTGIS="3.2.1 5fae8e5" [EXTENSION] PGSQL="140" GEOS="3.8.0-CAPI-1.13.1 " PROJ="6.3.1" LIBXML="2.9.10" LIBJSON="0.13.1" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)"
All run on an HPZ840 system with 256 GB ECC RAM, dual Xeon E5-2680v4, Windows 10. 5x2TB NVMe configured as Windows "Storage Space" attached via a PCIe Express card.
Data: OpenStreetMap data for the entire Planet, +/- 1.4 TB database if processing finishes.
Marco
Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Unfortunately, after more testing, it turns out this issue still persists in PostgreSQL 14.4.
I have now encountered exactly the same problem as described in the original issue below: One autovacuum session that never finishes with no wait event (or is just incredibly slow and not finishing after many hours although it should in minutes considering the relative small dataset and normal operation), and the "client backend" and "parallel worker" stuck on the same wait events as listed below with the same "SELECT COUNT (*)" SQL statement.
One thing to note as well, besides this being workstation level hardware with ECC RAM, is that I now also activated 'pg_checksums' on the PostgreSQL databases, and reloaded all data, so all data should now have checksums. No PostgreSQL error at all is generated via the ODBC connection I use to access and update the database when this happens and PostgreSQL appears stuck on the autovacuum. So I guess this now means I can now pretty much exclude a hardware error, and this must be some software issue, considering the checksums.
Marco
-------- Doorgestuurd bericht --------
Onderwerp: | Re: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3? |
---|---|
Datum: | Fri, 1 Jul 2022 11:49:47 +0200 |
Van: | Marco Boeringa <marco@boeringa.demon.nl> |
Aan: | pgsql-bugs@lists.postgresql.org |
FWIIW, I have not been able to reproduce this issue in the latest 14.4 release up until now, with quite significant testing. So it seems this issue is finally fixed, although only time will tell for sure, as the issue was intermittent.
Marco
-------- Doorgestuurd bericht --------
Onderwerp: | "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3? |
---|---|
Datum: | Thu, 16 Jun 2022 08:10:12 +0200 |
Van: | Marco Boeringa <marco@boeringa.demon.nl> |
Aan: | pgsql-bugs@lists.postgresql.org |
Hi,
Anyone else encountering this?
I have had issues with SELECT COUNT (*) in PG 14.2, that I think were supposed to be fixed in PG 14.3.
However, I have now seen PostgreSQL locking up with a SELECT COUNT (*) statement in PG 14.3 twice.
Note that in both cases, the code that issued this statement, had successfully processed similar statements against other tables multiple times before locking up.
When I look in pgAdmin, I see three "active" sessions, that never finish though (or are extremely slow...), and I do see some minor "Tuple out" activity sporadically (a few thousand tuples at a time):
- autovacuum: "VACUUM <TABLE_NAME>" with no Wait Event
- client backend: "SELECT COUNT (*) FROM <TABLE_NAME>" with Wait Event "IPC:ExecuteGather"
- parallel worker: "SELECT COUNT (*) FROM <TABLE_NAME>" with Wait Event "LWLock:BufferContent"
When this happens, it is impossible to stop the sessions, when I choose "Cancel Query" or "Terminate Session", pgAdmin returns "success" for the operation, yet the sessions remain visible in the pgAdmin window, even after clicking "refresh", something that definitely doesn't happen in ordinary situations.
Under "Locks" in pgAdmin, I see an "AccesShareLock" for the "client backend" and "parallel worker", and a "ShareUpdateExclusiveLock" for the "autovacuum" for the table involved.
Additionally, when this happens, my "File system root" of Ubuntu slowly starts filling up, until 100% full (growth from +/-630GB to 1TB), and PostgreSQL shuts down:
"server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request."
Note that the "logging collector" is set "off", and no logs are thus being kept (this is a test system, and I don't care much for the logs).
In order to get out this situation, I have needed to restore from backup.
My system:
- Windows Hyper-V virtualized Ubuntu 20.04.1 LTS
- PG 14.3 (Ubuntu 14.3-1.pgdg20.04+1)
- POSTGIS="3.2.1 5fae8e5" [EXTENSION] PGSQL="140" GEOS="3.8.0-CAPI-1.13.1 " PROJ="6.3.1" LIBXML="2.9.10" LIBJSON="0.13.1" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)"
All run on an HPZ840 system with 256 GB ECC RAM, dual Xeon E5-2680v4, Windows 10. 5x2TB NVMe configured as Windows "Storage Space" attached via a PCIe Express card.
Data: OpenStreetMap data for the entire Planet, +/- 1.4 TB database if processing finishes.
Marco
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
On Fri, Jul 22, 2022 at 09:56:06AM +0200, Marco Boeringa wrote: > Unfortunately, after more testing, it turns out this issue still persists in > PostgreSQL 14.4. > > I have now encountered exactly the same problem as described in the original > issue below: One autovacuum session that never finishes with no wait event (or > is just incredibly slow and not finishing after many hours although it should > in minutes considering the relative small dataset and normal operation), and > the "client backend" and "parallel worker" stuck on the same wait events as > listed below with the same "SELECT COUNT (*)" SQL statement. > > One thing to note as well, besides this being workstation level hardware with > ECC RAM, is that I now also activated 'pg_checksums' on the PostgreSQL > databases, and reloaded all data, so all data should now have checksums. No > PostgreSQL error at all is generated via the ODBC connection I use to access > and update the database when this happens and PostgreSQL appears stuck on the > autovacuum. So I guess this now means I can now pretty much exclude a hardware > error, and this must be some software issue, considering the checksums. You might want to run these queries and show us the output, in case it suggests a cause: SELECT version(); -- non-default server settings SELECT name, current_setting(name), source FROM pg_settings WHERE source NOT IN ('default', 'override'); -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Hi Bruce, As requested. Note that this is on very capable hardware in the form of an HP Z840 workstation with NVMe. I have processed the entire OpenStreetMap "Planet" file with this hardware and configuration with success, it actually fails on a much smaller Geofabrik "Italy" extract: version | ---------------------------------------------------------------------------------------------------------------------------------+ PostgreSQL 14.4 (Ubuntu 14.4-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit| |name |current_setting |source | |--------------------------------|-----------------------------------------------------------------|--------------------| |application_name |DBeaver 22.1.2 - SQLEditor <Script-2.sql> |session | |autovacuum_work_mem |8000kB |configuration file | |checkpoint_completion_target |0.9 |configuration file | |checkpoint_timeout |5min |configuration file | |client_encoding |UTF8 |client | |cluster_name |14/main |configuration file | |cursor_tuple_fraction |1 |configuration file | |DateStyle |ISO, DMY |client | |default_statistics_target |1000 |configuration file | |default_text_search_config |pg_catalog.english |configuration file | |dynamic_shared_memory_type |posix |configuration file | |effective_cache_size |100GB |configuration file | |effective_io_concurrency |500 |configuration file | |extra_float_digits |3 |session | |jit |on |configuration file | |lc_messages |en_US.UTF-8 |configuration file | |lc_monetary |nl_NL.UTF-8 |configuration file | |lc_numeric |nl_NL.UTF-8 |configuration file | |lc_time |nl_NL.UTF-8 |configuration file | |listen_addresses |localhost, |configuration file | |log_destination |stderr |configuration file | |log_line_prefix |%m [%p] %q%u@%d |configuration file | |log_rotation_age |1h |configuration file | |log_rotation_size |10000kB |configuration file | |log_statement |none |configuration file | |log_timezone |Europe/Amsterdam |configuration file | |log_truncate_on_rotation |on |configuration file | |logging_collector |off |configuration file | |maintenance_io_concurrency |500 |configuration file | |maintenance_work_mem |8000MB |configuration file | |max_connections |1000 |configuration file | |max_parallel_maintenance_workers|28 |configuration file | |max_parallel_workers |128 |configuration file | |max_parallel_workers_per_gather |56 |configuration file | |max_stack_depth |2MB |environment variable| |max_wal_senders |0 |configuration file | |max_wal_size |25GB |configuration file | |max_worker_processes |128 |configuration file | |min_wal_size |1GB |configuration file | |parallel_leader_participation |off |configuration file | |parallel_setup_cost |100 |configuration file | |parallel_tuple_cost |0.025 |configuration file | |password_encryption |md5 |configuration file | |port |5433 |configuration file | |random_page_cost |1 |configuration file | |search_path |osm, public, "$user" |session | |shared_buffers |75GB |configuration file | |ssl |off |configuration file | |ssl_cert_file |server.crt |configuration file | |ssl_key_file |server.key |configuration file | |synchronous_commit |off |configuration file | |temp_buffers |8000MB |configuration file | |temp_tablespaces |osm_i |configuration file | |TimeZone |Europe/Berlin |client | |track_activity_query_size |10000B |configuration file | |wal_compression |on |configuration file | |wal_level |minimal |configuration file | |work_mem |2000MB |configuration file | Op 22-7-2022 om 21:07 schreef Bruce Momjian: > On Fri, Jul 22, 2022 at 09:56:06AM +0200, Marco Boeringa wrote: >> Unfortunately, after more testing, it turns out this issue still persists in >> PostgreSQL 14.4. >> >> I have now encountered exactly the same problem as described in the original >> issue below: One autovacuum session that never finishes with no wait event (or >> is just incredibly slow and not finishing after many hours although it should >> in minutes considering the relative small dataset and normal operation), and >> the "client backend" and "parallel worker" stuck on the same wait events as >> listed below with the same "SELECT COUNT (*)" SQL statement. >> >> One thing to note as well, besides this being workstation level hardware with >> ECC RAM, is that I now also activated 'pg_checksums' on the PostgreSQL >> databases, and reloaded all data, so all data should now have checksums. No >> PostgreSQL error at all is generated via the ODBC connection I use to access >> and update the database when this happens and PostgreSQL appears stuck on the >> autovacuum. So I guess this now means I can now pretty much exclude a hardware >> error, and this must be some software issue, considering the checksums. > You might want to run these queries and show us the output, in case it > suggests a cause: > > SELECT version(); > > -- non-default server settings > SELECT name, current_setting(name), source > FROM pg_settings > WHERE source NOT IN ('default', 'override'); >
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Unfortunately, it now also again, as I have seen before in the same situation, took my PG14.4 server down by filling up my entire Ubuntu file system root, the PostgreSQL server process shut down the connection automatically as a consequence. It is not clear to me what is actually being written out that causes this. I am not really an Ubuntu / Linux expert. Any suggestions for commands to quickly find out where PostgreSQL may have dumped 100's of GB of data in file system root, as I had over 700 GB free space there? Op 22-7-2022 om 21:07 schreef Bruce Momjian: > On Fri, Jul 22, 2022 at 09:56:06AM +0200, Marco Boeringa wrote: >> Unfortunately, after more testing, it turns out this issue still persists in >> PostgreSQL 14.4. >> >> I have now encountered exactly the same problem as described in the original >> issue below: One autovacuum session that never finishes with no wait event (or >> is just incredibly slow and not finishing after many hours although it should >> in minutes considering the relative small dataset and normal operation), and >> the "client backend" and "parallel worker" stuck on the same wait events as >> listed below with the same "SELECT COUNT (*)" SQL statement. >> >> One thing to note as well, besides this being workstation level hardware with >> ECC RAM, is that I now also activated 'pg_checksums' on the PostgreSQL >> databases, and reloaded all data, so all data should now have checksums. No >> PostgreSQL error at all is generated via the ODBC connection I use to access >> and update the database when this happens and PostgreSQL appears stuck on the >> autovacuum. So I guess this now means I can now pretty much exclude a hardware >> error, and this must be some software issue, considering the checksums. > You might want to run these queries and show us the output, in case it > suggests a cause: > > SELECT version(); > > -- non-default server settings > SELECT name, current_setting(name), source > FROM pg_settings > WHERE source NOT IN ('default', 'override'); >
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Ok, I found it out using 'sudo baobab'. It is the: 'var/lib/postgresql/14/main/pg_wal' folder that is filled up with 890 GB of data... causing the file system root to run out of space and Ubuntu opening the disk usage analyzer and a warning as a consequence. I have never seen this happen under normal operation when I am not seeing this issue popup. So this is something to do with WAL. Op 22-7-2022 om 23:08 schreef Marco Boeringa: > Unfortunately, it now also again, as I have seen before in the same > situation, took my PG14.4 server down by filling up my entire Ubuntu > file system root, the PostgreSQL server process shut down the > connection automatically as a consequence. > > It is not clear to me what is actually being written out that causes > this. I am not really an Ubuntu / Linux expert. Any suggestions for > commands to quickly find out where PostgreSQL may have dumped 100's of > GB of data in file system root, as I had over 700 GB free space there? > > Op 22-7-2022 om 21:07 schreef Bruce Momjian: >> On Fri, Jul 22, 2022 at 09:56:06AM +0200, Marco Boeringa wrote: >>> Unfortunately, after more testing, it turns out this issue still >>> persists in >>> PostgreSQL 14.4. >>> >>> I have now encountered exactly the same problem as described in the >>> original >>> issue below: One autovacuum session that never finishes with no wait >>> event (or >>> is just incredibly slow and not finishing after many hours although >>> it should >>> in minutes considering the relative small dataset and normal >>> operation), and >>> the "client backend" and "parallel worker" stuck on the same wait >>> events as >>> listed below with the same "SELECT COUNT (*)" SQL statement. >>> >>> One thing to note as well, besides this being workstation level >>> hardware with >>> ECC RAM, is that I now also activated 'pg_checksums' on the PostgreSQL >>> databases, and reloaded all data, so all data should now have >>> checksums. No >>> PostgreSQL error at all is generated via the ODBC connection I use >>> to access >>> and update the database when this happens and PostgreSQL appears >>> stuck on the >>> autovacuum. So I guess this now means I can now pretty much exclude >>> a hardware >>> error, and this must be some software issue, considering the checksums. >> You might want to run these queries and show us the output, in case it >> suggests a cause: >> >> SELECT version(); >> >> -- non-default server settings >> SELECT name, current_setting(name), source >> FROM pg_settings >> WHERE source NOT IN ('default', 'override'); >>
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Marco Boeringa <marco@boeringa.demon.nl> writes: > Ok, I found it out using 'sudo baobab'. It is the: > 'var/lib/postgresql/14/main/pg_wal' > folder that is filled up with 890 GB of data... causing the file system > root to run out of space and Ubuntu opening the disk usage analyzer and > a warning as a consequence. The most likely explanations for this are (a) misconfiguration of WAL archiving, so that the server thinks it should keep WAL files till they've been archived, only that never happens. (b) inability to complete checkpoints for some reason, preventing WAL files from being recycled. It doesn't look like you have wal_archiving on, so (a) *should* be ruled out, but you never know. If there are a ton of "nnn.ready" files underneath pg_wal then trouble here would be indicated. As for (b), you might try enabling log_checkpoints and seeing if the log messages give any clue. regards, tom lane
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Hi Tom, Thanks for the response, but please review the exact conditions I already mentioned in my previous mails: - When this issue happens, there is absolutely no other activity going on than the three active sessions I mentioned: the autovacuum worker with no wait event, and the two "SELECT COUNT(*) FROM <table>" related sessions with both a wait event, see the original post. There are no other active sessions doing any kind of editing work, no INSERTS, UPDATES, DELETES or whatever in this point of the processing that could generate WAL. - Note that this is a custom written geoprocessing workflow with just one user on the database, not a public database with hundreds of users emitting whatever unknown queries against the database, so I know exactly at what point in my processing flow it fails and what goes on then. - The database affected itself is just a few dozen GBs. While I appreciate, if I understand PostgreSQL and the concept of WAL good enough (I don't consider myself a PostgreSQL expert), that WAL might potentially exceed the size of the database when heavy editing is going, 890 GB of WAL being written seems like an anomaly given in the context of the first points. - This problem only first reared its head after the issues starting in PG14.2 related to SELECT COUNT(*) So, does your suggested option (b) still make sense in this context? If not, and we assume this is a bug needing reporting, what exact information will you guys need to pinpoint the issue besides the information already given? What is the best course of action? I have never before reported a bug for PostgreSQL, so I am slightly at loss as to what exact information you will need. E.g., besides your suggestion of activating 'log_checkpoints', what other suggestions for specific logging? I fully appreciate the main answer will be to submit the typical "smallest reproducible case", but that will be extremely hard in this particular case, as the geoprocessing workflow processing OpenStreetMap data goes through a whole chain of largely auto-generated SQL statements (based on settings in the input tool), that are nearly impossible to share. Although it is also again questionable if it is actually relevant, as the point where it fails only has the mentioned sessions and single SELECT COUNT(*) SQL statement going on. The issues is intermittent as well, so there wouldn't be guarantees it would reproduce on the first try, even if I could share it. I also appreciate I might need to hire an expert for some remote debugging, but before going that way, I appreciate some more insights. Marco Op 23-7-2022 om 17:33 schreef Tom Lane: > Marco Boeringa <marco@boeringa.demon.nl> writes: >> Ok, I found it out using 'sudo baobab'. It is the: >> 'var/lib/postgresql/14/main/pg_wal' >> folder that is filled up with 890 GB of data... causing the file system >> root to run out of space and Ubuntu opening the disk usage analyzer and >> a warning as a consequence. > The most likely explanations for this are > (a) misconfiguration of WAL archiving, so that the server thinks > it should keep WAL files till they've been archived, only that > never happens. > (b) inability to complete checkpoints for some reason, preventing > WAL files from being recycled. > > It doesn't look like you have wal_archiving on, so (a) *should* > be ruled out, but you never know. If there are a ton of "nnn.ready" > files underneath pg_wal then trouble here would be indicated. > > As for (b), you might try enabling log_checkpoints and seeing if > the log messages give any clue. > > regards, tom lane
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
To extend on this, two interesting questions that come to mind are: - Does running SELECT COUNT(*) create WAL? - Is it potentially conceivable that there is a kind of cross-database vulnerability *within one and the same PostgreSQL cluster*, where an issue in one database causes the WAL in another database to no longer successfully be written to disk during checkpoints? I have never seen processing errors where PostgreSQL emitted true PostgreSQL errors with error numbers cause issues like that and affect a second database in the same cluster, but since no error is generated here, and there might be some uncatched error, I wonder? I am especially asking the second question since, although I wrote there is no edit activity going on potentially generating WAL in the affected small database, which is true, there *was* processing on Planet sized data going on in a second database in the same cluster. That certainly *is* capable of generating 890GB of WAL if nothing is cleaned up during checkpoints due to checkpoints failing. Marco Op 24-7-2022 om 09:55 schreef Marco Boeringa: > Hi Tom, > > Thanks for the response, but please review the exact conditions I > already mentioned in my previous mails: > > - When this issue happens, there is absolutely no other activity going > on than the three active sessions I mentioned: the autovacuum worker > with no wait event, and the two "SELECT COUNT(*) FROM <table>" related > sessions with both a wait event, see the original post. There are no > other active sessions doing any kind of editing work, no INSERTS, > UPDATES, DELETES or whatever in this point of the processing that > could generate WAL. > > - Note that this is a custom written geoprocessing workflow with just > one user on the database, not a public database with hundreds of users > emitting whatever unknown queries against the database, so I know > exactly at what point in my processing flow it fails and what goes on > then. > > - The database affected itself is just a few dozen GBs. While I > appreciate, if I understand PostgreSQL and the concept of WAL good > enough (I don't consider myself a PostgreSQL expert), that WAL might > potentially exceed the size of the database when heavy editing is > going, 890 GB of WAL being written seems like an anomaly given in the > context of the first points. > > - This problem only first reared its head after the issues starting in > PG14.2 related to SELECT COUNT(*) > > So, does your suggested option (b) still make sense in this context? > > If not, and we assume this is a bug needing reporting, what exact > information will you guys need to pinpoint the issue besides the > information already given? What is the best course of action? I have > never before reported a bug for PostgreSQL, so I am slightly at loss > as to what exact information you will need. E.g., besides your > suggestion of activating 'log_checkpoints', what other suggestions for > specific logging? > > I fully appreciate the main answer will be to submit the typical > "smallest reproducible case", but that will be extremely hard in this > particular case, as the geoprocessing workflow processing > OpenStreetMap data goes through a whole chain of largely > auto-generated SQL statements (based on settings in the input tool), > that are nearly impossible to share. Although it is also again > questionable if it is actually relevant, as the point where it fails > only has the mentioned sessions and single SELECT COUNT(*) SQL > statement going on. The issues is intermittent as well, so there > wouldn't be guarantees it would reproduce on the first try, even if I > could share it. > > I also appreciate I might need to hire an expert for some remote > debugging, but before going that way, I appreciate some more insights. > > Marco > > Op 23-7-2022 om 17:33 schreef Tom Lane: >> Marco Boeringa <marco@boeringa.demon.nl> writes: >>> Ok, I found it out using 'sudo baobab'. It is the: >>> 'var/lib/postgresql/14/main/pg_wal' >>> folder that is filled up with 890 GB of data... causing the file system >>> root to run out of space and Ubuntu opening the disk usage analyzer and >>> a warning as a consequence. >> The most likely explanations for this are >> (a) misconfiguration of WAL archiving, so that the server thinks >> it should keep WAL files till they've been archived, only that >> never happens. >> (b) inability to complete checkpoints for some reason, preventing >> WAL files from being recycled. >> >> It doesn't look like you have wal_archiving on, so (a) *should* >> be ruled out, but you never know. If there are a ton of "nnn.ready" >> files underneath pg_wal then trouble here would be indicated. >> >> As for (b), you might try enabling log_checkpoints and seeing if >> the log messages give any clue. >> >> regards, tom lane
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
On 7/25/22 08:04, Marco Boeringa wrote: > To extend on this, two interesting questions that come to mind are: > > - Does running SELECT COUNT(*) create WAL? > Yes. An obvious example is updating the visibility map (which is always logged to WAL) or hint bits (which may be WAL logged). I'd also bet we may generate WAL for indexes, e.g. to kill deleted tuples. > - Is it potentially conceivable that there is a kind of cross-database > vulnerability *within one and the same PostgreSQL cluster*, where an > issue in one database causes the WAL in another database to no longer > successfully be written to disk during checkpoints? I have never seen > processing errors where PostgreSQL emitted true PostgreSQL errors with > error numbers cause issues like that and affect a second database in the > same cluster, but since no error is generated here, and there might be > some uncatched error, I wonder? > > I am especially asking the second question since, although I wrote there > is no edit activity going on potentially generating WAL in the affected > small database, which is true, there *was* processing on Planet sized > data going on in a second database in the same cluster. That certainly > *is* capable of generating 890GB of WAL if nothing is cleaned up during > checkpoints due to checkpoints failing. > WAL is a resource shared by all the databases in the cluster, so if that gets broken it's broken for everyone. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Thanks Tomas for the explanation. This all still points to a bug then, with WAL being affected, and possibly the processing of 'Planet' data in the secondary database that did not show an issue, being responsible for the enormous amount of WAL being written due to checkpoint failures. Although the latter is still speculation, it may be caused by another as of yet to determine cause. To be absolutely clear about this: I have never seen so much WAL being generated with Planet processing in ordinary situations when this issue doesn't rear its head. In fact, I am currently processing Planet again, and the total size of files on "file system root" has not been over about 250GB, with still 750GB free space. This is what I see under ordinary situations. So I am also pretty much convinced there is no misconfiguration of the WAL settings in my 'posgresql.conf' file. Any suggestions for a further course of action, or do you people have at least enough info for now to give it a first try to find out what might be wrong? Marco Op 25-7-2022 om 15:39 schreef Tomas Vondra: > > On 7/25/22 08:04, Marco Boeringa wrote: >> To extend on this, two interesting questions that come to mind are: >> >> - Does running SELECT COUNT(*) create WAL? >> > Yes. An obvious example is updating the visibility map (which is always > logged to WAL) or hint bits (which may be WAL logged). I'd also bet we > may generate WAL for indexes, e.g. to kill deleted tuples. > >> - Is it potentially conceivable that there is a kind of cross-database >> vulnerability *within one and the same PostgreSQL cluster*, where an >> issue in one database causes the WAL in another database to no longer >> successfully be written to disk during checkpoints? I have never seen >> processing errors where PostgreSQL emitted true PostgreSQL errors with >> error numbers cause issues like that and affect a second database in the >> same cluster, but since no error is generated here, and there might be >> some uncatched error, I wonder? >> >> I am especially asking the second question since, although I wrote there >> is no edit activity going on potentially generating WAL in the affected >> small database, which is true, there *was* processing on Planet sized >> data going on in a second database in the same cluster. That certainly >> *is* capable of generating 890GB of WAL if nothing is cleaned up during >> checkpoints due to checkpoints failing. >> > WAL is a resource shared by all the databases in the cluster, so if that > gets broken it's broken for everyone. > > > regards >
Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Now its getting interesting. I had another, different failure mode now. The server closed unexpectedly, but without my "file system root" being entirely clogged like last time, and PostgreSQL in fact successfully recovering after the failure. I had another look at my system and opened the 'postgresql-14-main.log', see the log below. Please note I did not attempt to manually stop an autovacuum session, so that entry must be some automatic generated message whenever PostgreSQL is unable to start autovacuum? Anyway, the interesting entry is the third one, that has the PANIC with a reference to the 'visibilitymap' that Tomas also referred to in relation to SELECT COUNT(*), so that at least ties this second, different failure, potentially to the previous issue (although it may still be an unrelated issue). Now to be clear about this, as can also be seen from the EXECUTE statements, there is no SELECT COUNT(*) going on here (the referenced 'arc_update' prepared statement doesn't use it) Any remarks related to this log? Marco 2022-07-27 02:59:56.892 CEST [281070] ERROR: canceling autovacuum task 2022-07-27 02:59:56.892 CEST [281070] CONTEXT: while scanning block 46845 of relation "pg_toast.pg_toast_359621860" automatic vacuum of table "gis.pg_toast.pg_toast_359621860" 2022-07-27 03:02:49.721 CEST [281124] osm@gis PANIC: wrong buffer passed to visibilitymap_clear 2022-07-27 03:02:49.721 CEST [281124] osm@gis STATEMENT: EXECUTE arc_update(11995223);EXECUTE arc_update(177815656);EXECUTE arc_update(443174623);EXECUTE arc_update(443213826);EXECUTE arc_update(12673263);EXECUTE arc_update(550352967);EXECUTE arc_update(11551029);EXECUTE arc_update(153847654);EXECUTE arc_update(329349);EXECUTE arc_update(237168019);EXECUTE arc_update(11552629);EXECUTE arc_update(381319471);EXECUTE arc_update(171879263);EXECUTE arc_update(11555372);EXECUTE arc_update(385793425);EXECUTE arc_update(362413550);EXECUTE arc_update(363999384);EXECUTE arc_update(12622220);EXECUTE arc_update(529712698);EXECUTE arc_update(357812660);EXECUTE arc_update(145882199);EXECUTE arc_update(499576651);EXECUTE arc_update(123173842);EXECUTE arc_update(345906810);EXECUTE arc_update(11558506);EXECUTE arc_update(602477052);EXECUTE arc_update(362390402);EXECUTE arc_update(488370921);EXECUTE arc_update(148592092);EXECUTE arc_update(590614967);EXECUTE arc_update(474316584);EXECUTE arc_update(598052676);EXECUTE arc_update(345447488);EXECUTE arc_update(488880839);EXECUTE arc_update(323713375);EXECUTE arc_update(362377910);EXECUTE arc_update(306066948);EXECUTE arc_update(688704);EXECUTE arc_update(600560187);EXECUTE arc_update(213569619);EXECUTE arc_update(180294245);EXECUTE arc_update(343433621);EXECUTE arc_update(199994366);EXECUTE arc_update(314610973);EXECUTE arc_update(353722823);EXECUTE arc_update(222912592);EXECUTE arc_update(305662591);EXECUTE arc_update(361771102);EXECUTE arc_update(472591085);EXECUTE arc_update(146245966); 2022-07-27 04:07:14.422 CEST [1563] LOG: server process (PID 281124) was terminated by signal 6: Aborted 2022-07-27 04:07:14.422 CEST [1563] DETAIL: Failed process was running: EXECUTE arc_update(11995223);EXECUTE arc_update(177815656);EXECUTE arc_update(443174623);EXECUTE arc_update(443213826);EXECUTE arc_update(12673263);EXECUTE arc_update(550352967);EXECUTE arc_update(11551029);EXECUTE arc_update(153847654);EXECUTE arc_update(329349);EXECUTE arc_update(237168019);EXECUTE arc_update(11552629);EXECUTE arc_update(381319471);EXECUTE arc_update(171879263);EXECUTE arc_update(11555372);EXECUTE arc_update(385793425);EXECUTE arc_update(362413550);EXECUTE arc_update(363999384);EXECUTE arc_update(12622220);EXECUTE arc_update(529712698);EXECUTE arc_update(357812660);EXECUTE arc_update(145882199);EXECUTE arc_update(499576651);EXECUTE arc_update(123173842);EXECUTE arc_update(345906810);EXECUTE arc_update(11558506);EXECUTE arc_update(602477052);EXECUTE arc_update(362390402);EXECUTE arc_update(488370921);EXECUTE arc_update(148592092);EXECUTE arc_update(590614967);EXECUTE arc_update(474316584);EXECUTE arc_update(598052676);EXECUTE arc_update(345447488);EXECUTE arc_update(488880839);EXECUTE arc_u 2022-07-27 04:07:14.422 CEST [1563] LOG: terminating any other active server processes 2022-07-27 04:07:18.978 CEST [1563] LOG: all server processes terminated; reinitializing 2022-07-27 04:07:34.563 CEST [281625] LOG: database system was interrupted; last known up at 2022-07-27 03:02:59 CEST 2022-07-27 04:07:39.646 CEST [281625] LOG: database system was not properly shut down; automatic recovery in progress 2022-07-27 04:07:39.656 CEST [281625] LOG: redo starts at 2595/7FB578D8 2022-07-27 04:09:43.329 CEST [281625] LOG: invalid record length at 259B/9375C7A0: wanted 24, got 0 2022-07-27 04:09:43.329 CEST [281625] LOG: redo done at 259B/9375C6E0 system usage: CPU: user: 71.84 s, system: 36.82 s, elapsed: 123.67 s 2022-07-27 04:10:23.425 CEST [1563] LOG: database system is ready to accept connections Op 25-7-2022 om 15:39 schreef Tomas Vondra: > > On 7/25/22 08:04, Marco Boeringa wrote: >> To extend on this, two interesting questions that come to mind are: >> >> - Does running SELECT COUNT(*) create WAL? >> > Yes. An obvious example is updating the visibility map (which is always > logged to WAL) or hint bits (which may be WAL logged). I'd also bet we > may generate WAL for indexes, e.g. to kill deleted tuples. > >> - Is it potentially conceivable that there is a kind of cross-database >> vulnerability *within one and the same PostgreSQL cluster*, where an >> issue in one database causes the WAL in another database to no longer >> successfully be written to disk during checkpoints? I have never seen >> processing errors where PostgreSQL emitted true PostgreSQL errors with >> error numbers cause issues like that and affect a second database in the >> same cluster, but since no error is generated here, and there might be >> some uncatched error, I wonder? >> >> I am especially asking the second question since, although I wrote there >> is no edit activity going on potentially generating WAL in the affected >> small database, which is true, there *was* processing on Planet sized >> data going on in a second database in the same cluster. That certainly >> *is* capable of generating 890GB of WAL if nothing is cleaned up during >> checkpoints due to checkpoints failing. >> > WAL is a resource shared by all the databases in the cluster, so if that > gets broken it's broken for everyone. > > > regards >