Thread: temp_file_limit?
Hi list,
I have a misbehaving query which uses all available disk space and then terminates with a "cannot write block" error. To prevent other processes from running into trouble I've set the following:
temp_file_limit = 100GB
The query does parallelize and uses one parallel worker while executing, but it does not abort when the temp file limit is reached:
345G pgsql_tmp
It does abort way later, after using around 300+ GB:
[53400] ERROR: temporary file size exceeds temp_file_limit (104857600kB) Where: parallel worker
The comment in the file states that this is a per-session parameter, so what is going wrong here?
I am using Postgres 14 on Ubuntu.
Regards,
Frits
On Sun, Dec 18, 2022 at 12:48:03PM +0100, Frits Jalvingh wrote: > Hi list, > > I have a misbehaving query which uses all available disk space and then > terminates with a "cannot write block" error. To prevent other processes > from running into trouble I've set the following: > > temp_file_limit = 100GB > The comment in the file states that this is a per-session parameter, so > what is going wrong here? Do you mean the comment in postgresql.conf ? commit d1f822e58 changed to say that temp_file_limit is actually per-process and not per-session. Could you send the query plan, preferably "explain analyze" (if the query finishes sometimes) ? log_temp_files may be helpful here. > The query does parallelize and uses one parallel worker while executing, > but it does not abort when the temp file limit is reached: > > 345G pgsql_tmp > > It does abort way later, after using around 300+ GB: > [53400] ERROR: temporary file size exceeds temp_file_limit (104857600kB) > Where: parallel worker Are you sure the 345G are from only one instance of the query ? Or is it running multiple times, or along with other queries writing 100GB of tempfiles. It seems possible that it sometimes runs with more than one parallel worker. Also, are there old/stray tempfiles there which need to be cleaned up? -- Justin
Hi Justin, thanks for your help!
Simple things first:
- I am running a single query on a developer machine. Nothing else uses the database at that point.
- The database runs on a disk that has 473GB in use and 1.3T still free. I am watching the increase in size used (watch df -hl /d2).
- If I remove the temp_file_limit the query will run until it has used the 1.3TB that was free, then it dies.
- when it runs I see two PG processes active: a main and a worker process for that main.
I hope this answers some of the questions: yes, the query is the one using the tempspace; it is the only one running; it uses only one parallel worker.
Just to be clear: my real question is: why is temp_file_limit not working at the specified size? Because this is my real problem: when a query is dying like this it will also kill other queries because these are also running out of space. Even when the limit is per-process it should not have exceeded 200GB imo. BTW, if that limit is really per process instead of per session/query then that is a Very Bad Thing(tm), because this makes the limit effectively worthless - if a query can spawn 8 parallel processes then you can suddenly, without any form of control, again fill up that disk.
I'm not really asking for a solution to the bad performance, but hints are always welcome so I'll include the requested info below:
With the failing plan the query never finishes; it just uses 1.3TB of space, then dies.
This also means I cannot explain analyze as this does not produce output when the query dies. This is a pretty terrible bug in my eyes, because you cannot get the info when it's most needed. If I ever have time left to work on Postgres' code this will be the first thing to fix 8-/
Anyway. The plan that fails badly is this one:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=37360.85..37360.86 rows=1 width=42)
-> Sort (cost=37360.85..37360.85 rows=1 width=42)
Sort Key: (COALESCE(tijd.tijdkey, 'Unknown'::character varying)), s_h_eenheid_ssm.identificatie
-> Hash Join (cost=34899.49..37360.84 rows=1 width=42)
Hash Cond: ((ve03678.calender_id)::text = (COALESCE(tijd.tijdkey, 'Unknown'::character varying))::text)
Join Filter: ((s_h_eenheid_ssm.dv_start_dts <= tijd.einddatum) AND (s_h_eenheid_ssm.dv_end_dts > tijd.einddatum) AND (l_eenheid_sturingslabel_ssm_pe.dv_start_dts <= tijd.einddatum) AND (l_eenheid_sturingslabel_ssm_pe.dv_end_dts > tijd.einddatum) AND (sturingslabel_pe.dv_start_dts <= tijd.einddatum) AND (sturingslabel_pe.dv_end_dts > tijd.einddatum))
-> Gather (cost=34897.66..37358.98 rows=1 width=65)
Workers Planned: 1
-> Parallel Hash Join (cost=33897.66..36358.88 rows=1 width=65)
Hash Cond: ((s_h_eenheid_ssm.id_h_eenheid = l_eenheid_sturingslabel_ssm_pe.id_h_eenheid) AND (COALESCE(s_h_eenheid_ssm.id_s, '-1'::integer) = ve03678.eenheid_id))
-> Parallel Seq Scan on s_h_eenheid_ssm (cost=0.00..2326.55 rows=17955 width=34)
-> Parallel Hash (cost=33896.02..33896.02 rows=109 width=47)
-> Parallel Hash Join (cost=18850.80..33896.02 rows=109 width=47)
Hash Cond: (ve03678.ve03678 = sturingslabel_pe.datum)
-> Parallel Seq Scan on ve0367801 ve03678 (cost=0.00..12584.92 rows=655792 width=15)
-> Parallel Hash (cost=18850.78..18850.78 rows=1 width=40)
-> Parallel Hash Join (cost=15458.27..18850.78 rows=1 width=40)
Hash Cond: (l_eenheid_sturingslabel_ssm_pe.id_h_sturingslabel = sturingslabel_pe.id_h_sturingslabel)
-> Parallel Seq Scan on l_eenheid_sturingslabel_ssm l_eenheid_sturingslabel_ssm_pe (cost=0.00..2963.36 rows=114436 width=24)
-> Parallel Hash (cost=15458.26..15458.26 rows=1 width=24)
-> Parallel Seq Scan on s_h_sturingslabel_ssm sturingslabel_pe (cost=0.00..15458.26 rows=1 width=24)
Filter: ((soort = 'MSL'::text) AND (code = 'DAE'::text))
-> Hash (cost=1.37..1.37 rows=37 width=11)
-> Seq Scan on tijd (cost=0.00..1.37 rows=37 width=11)
(24 rows)
Unique (cost=37360.85..37360.86 rows=1 width=42)
-> Sort (cost=37360.85..37360.85 rows=1 width=42)
Sort Key: (COALESCE(tijd.tijdkey, 'Unknown'::character varying)), s_h_eenheid_ssm.identificatie
-> Hash Join (cost=34899.49..37360.84 rows=1 width=42)
Hash Cond: ((ve03678.calender_id)::text = (COALESCE(tijd.tijdkey, 'Unknown'::character varying))::text)
Join Filter: ((s_h_eenheid_ssm.dv_start_dts <= tijd.einddatum) AND (s_h_eenheid_ssm.dv_end_dts > tijd.einddatum) AND (l_eenheid_sturingslabel_ssm_pe.dv_start_dts <= tijd.einddatum) AND (l_eenheid_sturingslabel_ssm_pe.dv_end_dts > tijd.einddatum) AND (sturingslabel_pe.dv_start_dts <= tijd.einddatum) AND (sturingslabel_pe.dv_end_dts > tijd.einddatum))
-> Gather (cost=34897.66..37358.98 rows=1 width=65)
Workers Planned: 1
-> Parallel Hash Join (cost=33897.66..36358.88 rows=1 width=65)
Hash Cond: ((s_h_eenheid_ssm.id_h_eenheid = l_eenheid_sturingslabel_ssm_pe.id_h_eenheid) AND (COALESCE(s_h_eenheid_ssm.id_s, '-1'::integer) = ve03678.eenheid_id))
-> Parallel Seq Scan on s_h_eenheid_ssm (cost=0.00..2326.55 rows=17955 width=34)
-> Parallel Hash (cost=33896.02..33896.02 rows=109 width=47)
-> Parallel Hash Join (cost=18850.80..33896.02 rows=109 width=47)
Hash Cond: (ve03678.ve03678 = sturingslabel_pe.datum)
-> Parallel Seq Scan on ve0367801 ve03678 (cost=0.00..12584.92 rows=655792 width=15)
-> Parallel Hash (cost=18850.78..18850.78 rows=1 width=40)
-> Parallel Hash Join (cost=15458.27..18850.78 rows=1 width=40)
Hash Cond: (l_eenheid_sturingslabel_ssm_pe.id_h_sturingslabel = sturingslabel_pe.id_h_sturingslabel)
-> Parallel Seq Scan on l_eenheid_sturingslabel_ssm l_eenheid_sturingslabel_ssm_pe (cost=0.00..2963.36 rows=114436 width=24)
-> Parallel Hash (cost=15458.26..15458.26 rows=1 width=24)
-> Parallel Seq Scan on s_h_sturingslabel_ssm sturingslabel_pe (cost=0.00..15458.26 rows=1 width=24)
Filter: ((soort = 'MSL'::text) AND (code = 'DAE'::text))
-> Hash (cost=1.37..1.37 rows=37 width=11)
-> Seq Scan on tijd (cost=0.00..1.37 rows=37 width=11)
(24 rows)
By itself I'm used to bad query performance in Postgresql; our application only does bulk queries and Postgres quite often makes terrible plans for those, but with set enable_nestloop=false set always most of them at least execute. The remaining failing queries are almost 100% caused by bad join sequences; I plan to work around those by forcing the join order from our application. For instance, the exact same query above can also generate the following plan (this one was created by manually setting join_collapse_limit = 1, but fast variants also occur quite often when disabling parallelism):
Unique (cost=70070.71..70070.72 rows=1 width=42) (actual time=4566.379..4766.112 rows=1058629 loops=1)
-> Sort (cost=70070.71..70070.71 rows=1 width=42) (actual time=4566.377..4618.021 rows=1058629 loops=1)
Sort Key: (COALESCE(tijd.tijdkey, 'Unknown'::character varying)), s_h_eenheid_ssm.identificatie
Sort Method: quicksort Memory: 115317kB
-> Gather (cost=50108.01..70070.70 rows=1 width=42) (actual time=1297.620..1651.003 rows=1058629 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Hash Join (cost=49108.01..69070.60 rows=1 width=42) (actual time=1294.655..1604.524 rows=529314 loops=2)
Hash Cond: (((ve03678.calender_id)::text = (COALESCE(tijd.tijdkey, 'Unknown'::character varying))::text) AND (ve03678.eenheid_id = COALESCE(s_h_eenheid_ssm.id_s, '-1'::integer)) AND (ve03678.ve03678 = sturingslabel_pe.datum))
-> Parallel Seq Scan on ve0367801 ve03678 (cost=0.00..12584.92 rows=655792 width=15) (actual time=0.004..27.971 rows=557423 loops=2)
-> Parallel Hash (cost=49107.99..49107.99 rows=1 width=25) (actual time=1294.347..1294.352 rows=542512 loops=2)
Buckets: 2097152 (originally 1024) Batches: 1 (originally 1) Memory Usage: 100728kB
-> Parallel Hash Join (cost=39244.15..49107.99 rows=1 width=25) (actual time=390.276..1089.770 rows=542512 loops=2)
Hash Cond: (l_eenheid_sturingslabel_ssm_pe.id_h_sturingslabel = sturingslabel_pe.id_h_sturingslabel)
Join Filter: ((sturingslabel_pe.dv_start_dts <= tijd.einddatum) AND (sturingslabel_pe.dv_end_dts > tijd.einddatum))
-> Hash Join (cost=23785.87..33486.40 rows=43548 width=29) (actual time=342.982..791.469 rows=3367092 loops=2)
Hash Cond: (l_eenheid_sturingslabel_ssm_pe.id_h_eenheid = s_h_eenheid_ssm.id_h_eenheid)
Join Filter: ((l_eenheid_sturingslabel_ssm_pe.dv_start_dts <= tijd.einddatum) AND (l_eenheid_sturingslabel_ssm_pe.dv_end_dts > tijd.einddatum))
-> Parallel Seq Scan on l_eenheid_sturingslabel_ssm l_eenheid_sturingslabel_ssm_pe (cost=0.00..2963.36 rows=114436 width=24) (actual time=0.002..5.818 rows=97271 loops=2)
-> Hash (cost=22217.33..22217.33 rows=125483 width=29) (actual time=342.703..342.705 rows=1129351 loops=2)
Buckets: 2097152 (originally 131072) Batches: 1 (originally 1) Memory Usage: 86969kB
-> Nested Loop (cost=0.00..22217.33 rows=125483 width=29) (actual time=0.039..175.471 rows=1129351 loops=2)
Join Filter: ((s_h_eenheid_ssm.dv_start_dts <= tijd.einddatum) AND (s_h_eenheid_ssm.dv_end_dts > tijd.einddatum))
-> Seq Scan on s_h_eenheid_ssm (cost=0.00..2452.23 rows=30523 width=34) (actual time=0.022..4.488 rows=30523 loops=2)
-> Materialize (cost=0.00..1.56 rows=37 width=11) (actual time=0.000..0.001 rows=37 loops=61046)
-> Seq Scan on tijd (cost=0.00..1.37 rows=37 width=11) (actual time=0.009..0.013 rows=37 loops=2)
-> Parallel Hash (cost=15458.26..15458.26 rows=1 width=24) (actual time=47.265..47.265 rows=69 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 72kB
-> Parallel Seq Scan on s_h_sturingslabel_ssm sturingslabel_pe (cost=0.00..15458.26 rows=1 width=24) (actual time=4.478..47.241 rows=69 loops=2)
Filter: ((soort = 'MSL'::text) AND (code = 'DAE'::text))
Rows Removed by Filter: 233072
Planning Time: 0.623 ms
Execution Time: 5144.937 ms
(33 rows)
https://explain.depesz.com/s/CKhC
-> Sort (cost=70070.71..70070.71 rows=1 width=42) (actual time=4566.377..4618.021 rows=1058629 loops=1)
Sort Key: (COALESCE(tijd.tijdkey, 'Unknown'::character varying)), s_h_eenheid_ssm.identificatie
Sort Method: quicksort Memory: 115317kB
-> Gather (cost=50108.01..70070.70 rows=1 width=42) (actual time=1297.620..1651.003 rows=1058629 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Hash Join (cost=49108.01..69070.60 rows=1 width=42) (actual time=1294.655..1604.524 rows=529314 loops=2)
Hash Cond: (((ve03678.calender_id)::text = (COALESCE(tijd.tijdkey, 'Unknown'::character varying))::text) AND (ve03678.eenheid_id = COALESCE(s_h_eenheid_ssm.id_s, '-1'::integer)) AND (ve03678.ve03678 = sturingslabel_pe.datum))
-> Parallel Seq Scan on ve0367801 ve03678 (cost=0.00..12584.92 rows=655792 width=15) (actual time=0.004..27.971 rows=557423 loops=2)
-> Parallel Hash (cost=49107.99..49107.99 rows=1 width=25) (actual time=1294.347..1294.352 rows=542512 loops=2)
Buckets: 2097152 (originally 1024) Batches: 1 (originally 1) Memory Usage: 100728kB
-> Parallel Hash Join (cost=39244.15..49107.99 rows=1 width=25) (actual time=390.276..1089.770 rows=542512 loops=2)
Hash Cond: (l_eenheid_sturingslabel_ssm_pe.id_h_sturingslabel = sturingslabel_pe.id_h_sturingslabel)
Join Filter: ((sturingslabel_pe.dv_start_dts <= tijd.einddatum) AND (sturingslabel_pe.dv_end_dts > tijd.einddatum))
-> Hash Join (cost=23785.87..33486.40 rows=43548 width=29) (actual time=342.982..791.469 rows=3367092 loops=2)
Hash Cond: (l_eenheid_sturingslabel_ssm_pe.id_h_eenheid = s_h_eenheid_ssm.id_h_eenheid)
Join Filter: ((l_eenheid_sturingslabel_ssm_pe.dv_start_dts <= tijd.einddatum) AND (l_eenheid_sturingslabel_ssm_pe.dv_end_dts > tijd.einddatum))
-> Parallel Seq Scan on l_eenheid_sturingslabel_ssm l_eenheid_sturingslabel_ssm_pe (cost=0.00..2963.36 rows=114436 width=24) (actual time=0.002..5.818 rows=97271 loops=2)
-> Hash (cost=22217.33..22217.33 rows=125483 width=29) (actual time=342.703..342.705 rows=1129351 loops=2)
Buckets: 2097152 (originally 131072) Batches: 1 (originally 1) Memory Usage: 86969kB
-> Nested Loop (cost=0.00..22217.33 rows=125483 width=29) (actual time=0.039..175.471 rows=1129351 loops=2)
Join Filter: ((s_h_eenheid_ssm.dv_start_dts <= tijd.einddatum) AND (s_h_eenheid_ssm.dv_end_dts > tijd.einddatum))
-> Seq Scan on s_h_eenheid_ssm (cost=0.00..2452.23 rows=30523 width=34) (actual time=0.022..4.488 rows=30523 loops=2)
-> Materialize (cost=0.00..1.56 rows=37 width=11) (actual time=0.000..0.001 rows=37 loops=61046)
-> Seq Scan on tijd (cost=0.00..1.37 rows=37 width=11) (actual time=0.009..0.013 rows=37 loops=2)
-> Parallel Hash (cost=15458.26..15458.26 rows=1 width=24) (actual time=47.265..47.265 rows=69 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 72kB
-> Parallel Seq Scan on s_h_sturingslabel_ssm sturingslabel_pe (cost=0.00..15458.26 rows=1 width=24) (actual time=4.478..47.241 rows=69 loops=2)
Filter: ((soort = 'MSL'::text) AND (code = 'DAE'::text))
Rows Removed by Filter: 233072
Planning Time: 0.623 ms
Execution Time: 5144.937 ms
(33 rows)
https://explain.depesz.com/s/CKhC
Same query, now runs in 5 seconds.
This query is behaving quite special in one of our customers' databases; it runs about 80% of the time in 8 to 16 seconds, about 15% of the time it takes about 2 hours, and the remaining 5% it dies with a disk space issue...
Regards,
Frits
Frits Jalvingh <jal@etc.to> writes: > Just to be clear: my real question is: why is temp_file_limit not working > at the specified size? I've not looked at that code lately, but I strongly suspect that it's implemented in such a way that it's a per-process limit, not a per-session limit. So each parallel worker could use up that much space. It's also possible that you've found an actual bug, but without a reproducer case nobody's going to take that possibility too seriously. We're unlikely to accept "the limit should work across multiple processes" as a valid bug though. That would require a vastly more complicated implementation. regards, tom lane
On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote: > Just to be clear: my real question is: why is temp_file_limit not > working at the specified size? Because this is my real problem: when a > query is dying like this it will also kill other queries because these > are also running out of space. Even when the limit is per-process it > should not have exceeded 200GB imo. What OS and filesystem are in use ? Could you list the tmpdir when it's getting huge? The filenames include the PID, which would indicate if there's another procecss involved, or a bug allowed it to get huge. sudo du --max=2 -mx ./pgsql_tmp |sort -nr BTW, pg_ls_tmpdir() hides directories, so you shouldn't rely on it for listing temporary directories... One possibility is that there are files in the tmpdir, which have been unlinked, but are still opened, so their space hasn't been reclaimed. You could check for that by running lsof -nn |grep pgsql_tmp Any deleted files would say things like 'DEL|deleted|inode|no such' > BTW, if that limit is really per process instead of per > session/query then that is a Very Bad Thing(tm), because this makes the > limit effectively worthless - if a query can spawn 8 parallel processes > then you can suddenly, without any form of control, again fill up that disk. 8 is the default value of max_worker_processes and max_parallel_workers, but 2 is the default value of max_parallel_workers_per_gather. You're free the change the default value to balance it with the temp_file_limit (as suggested by the earlier-mentioned commit). -- Justin
On Mon, Dec 19, 2022 at 9:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote: > > Just to be clear: my real question is: why is temp_file_limit not > > working at the specified size? Because this is my real problem: when a > > query is dying like this it will also kill other queries because these > > are also running out of space. Even when the limit is per-process it > > should not have exceeded 200GB imo. It's really the limit for a single file (or virtual file because we split them on 1GB boundaries, probably well past time we stopped doing that), but we create many temporary files for various reasons. One possibility is that you've hit a case that needs several rounds of repartitioning (because of a failure to estimate the number of tuples well), but we can't see that because you didn't show EXPLAIN (ANALYZE) output (understandably if it runs out of disk space before completing...). The parallel hash code doesn't free up the previous generations' temporary files; it really only needs two generations' worth concurrently (the one it's reading from and the one it's writing to). In rare cases where more generations are needed it could unlink the older ones -- that hasn't been implemented yet. If you set log_temp_files = 0 to log temporary file names, it should be clear if it's going through multiple rounds of repartitioning, from the names (...of32..., ...of64..., ...of128..., ...of256..., ...).
On Mon, Dec 19, 2022 at 1:51 PM Thomas Munro <thomas.munro@gmail.com> wrote: > It's really the limit for a single file Oops, sorry I take that back. It should be per process.
Hi Tom and Thomas, thanks for your help.
@Tom:
If it really is per-process then I would have expected it to die after 200GB was used?
As far as "valid bug" is concerned: I had hoped this would be per session, as this at least delivers a reasonable and usable limit; it is easy to control the number of sessions/statements in execution.
If it really is per process then the limit is not really useful, just like work_mem: the execution plan of a query determines the number of processes (up to the max, at least that is way better than work_mem) and that can change whenever Postgres feels a new plan is in order. I can understand that solving this might be harder (although to me it looks like just a little bit of shared memory per session to keep a number). To me it does not sound like an invalid bug, just one you do not want to solve now ;) And the real problem, for me, is actually that both work_mem and temp_file_limit should be for the entire instance/cluster ;) I know that that is even harder.
For us it means we really cannot use Postgres parallelism: it is infinitely better to have a query that runs longer but which finishes than to have the database die and recover randomly with OOM or with disk space filling up killing random queries. Which is a bit of a pity, ofc.
The test is running on Ubuntu 22.04.1, x86_64, the disk is an NVMe 2TB WD850X with ext4 as a file system.
I will collect the other data around tmpfiles hopefully later today.
I have already set max_parallel_workers_per_gather to 1. I will probably disable all parallelism for the next runs to see whether that makes the size limit more workable..
Em seg., 19 de dez. de 2022 às 06:47, Frits Jalvingh <jal@etc.to> escreveu:
The test is running on Ubuntu 22.04.1, x86_64, the disk is an NVMe 2TB WD850X with ext4 as a file system.
It's probably not a production environment.
Any chance of adding another 2TB NVMe, just for the temp files?
Any chance of adding another 2TB NVMe, just for the temp files?
To see if Postgres can finish the queries and provide more information?
What exactly is the version of Postgres (14.???) are you using it?
What exactly is the version of Postgres (14.???) are you using it?
regards,
Ranier Vilela
Hi Ranier, thanks for your help.
I do not have more disks lying around, and I fear that if it does not complete with 1.3TB of disk space it might not be that likely that adding 750GB would work...
Postgres version: the original (prd) issue was on 10.x. I also tested it on 14.x with the same issue. I then upgraded my machine to 15.1 to make sure to report on the latest version, and all information mentioned in this thread is from that version.
btw, this query generates quite different plans when tweaking things like nested_loop=false/true, and the "fast" plan requires nested_loops=true and join_collapse_limit=1 (5 seconds response). An odd thing is that both plans contain only one nested loop (a cross join, I think it cannot do that without one) but the general plan changes a lot.. I am trying to get output from that second plan because this one just loops using CPU, not disk... Perhaps that one will finish with some statistics...
Em seg., 19 de dez. de 2022 às 11:45, Frits Jalvingh <jal@etc.to> escreveu:
Hi Ranier, thanks for your help.I do not have more disks lying around, and I fear that if it does not complete with 1.3TB of disk space it might not be that likely that adding 750GB would work...Postgres version: the original (prd) issue was on 10.x. I also tested it on 14.x with the same issue. I then upgraded my machine to 15.1 to make sure to report on the latest version, and all information mentioned in this thread is from that version.
You can run with a Postgres debug compiled version?
Maybe, some light appears.
regards,
Ranier Vilela
Hehehe, that is not the worst plan ;) I did that once to debug a deadlock in the JDBC driver when talking with Postgres, but it's not an adventure I'd like to repeat right now ;)
@justin:
root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx ./pgsql_tmp |sort -nr
Ran the query again. Top shows the following processes:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
650830 postgres 20 0 7503,2m 2,6g 2,6g R 100,0 4,2 12:46.34 postgres: jal datavault_317_prd [local] EXPLAIN
666141 postgres 20 0 7486,3m 2,6g 2,6g R 100,0 4,1 2:10.24 postgres: parallel worker for PID 650830
650830 postgres 20 0 7503,2m 2,6g 2,6g R 100,0 4,2 12:46.34 postgres: jal datavault_317_prd [local] EXPLAIN
666141 postgres 20 0 7486,3m 2,6g 2,6g R 100,0 4,1 2:10.24 postgres: parallel worker for PID 650830
Your commands shows, during execution:
root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx ./pgsql_tmp |sort -nr
68629 ./pgsql_tmp/pgsql_tmp650830.3.fileset
68629 ./pgsql_tmp
68629 ./pgsql_tmp/pgsql_tmp650830.3.fileset
68629 ./pgsql_tmp
root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx ./pgsql_tmp |sort -nr
194494 ./pgsql_tmp
194493 ./pgsql_tmp/pgsql_tmp650830.3.fileset
194494 ./pgsql_tmp
194493 ./pgsql_tmp/pgsql_tmp650830.3.fileset
335289 ./pgsql_tmp/pgsql_tmp650830.3.fileset
335289 ./pgsql_tmp
335289 ./pgsql_tmp
root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx ./pgsql_tmp |sort -nr
412021 ./pgsql_tmp/pgsql_tmp650830.3.fileset
412021 ./pgsql_tmp
412021 ./pgsql_tmp/pgsql_tmp650830.3.fileset
412021 ./pgsql_tmp
^^^ a few seconds after this last try the query aborted:
ERROR: temporary file size exceeds temp_file_limit (104857600kB)
One possibility is that there are files in the tmpdir, which have been
unlinked, but are still opened, so their space hasn't been reclaimed.
You could check for that by running lsof -nn |grep pgsql_tmp Any deleted
files would say things like 'DEL|deleted|inode|no such'
I do not really understand what you would like me to do, and when. The disk space is growing, and it is actual files under pgsql_tmp?
Hope this tells you something, please let me know if you would like more info, and again - thanks!
On Mon, Dec 19, 2022 at 05:57:42PM +0100, Frits Jalvingh wrote: > @justin: > > Ran the query again. Top shows the following processes: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ Thanks > root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx > ./pgsql_tmp |sort -nr > 412021 ./pgsql_tmp/pgsql_tmp650830.3.fileset > 412021 ./pgsql_tmp > ^^^ a few seconds after this last try the query aborted: > ERROR: temporary file size exceeds temp_file_limit (104857600kB) > > One possibility is that there are files in the tmpdir, which have been > > unlinked, but are still opened, so their space hasn't been reclaimed. > > You could check for that by running lsof -nn |grep pgsql_tmp Any deleted > > files would say things like 'DEL|deleted|inode|no such' > > I do not really understand what you would like me to do, and when. The disk > space is growing, and it is actual files under pgsql_tmp? Run this during the query as either postgres or root: | lsof -nn |grep pgsql_tmp |grep -E 'DEL|deleted|inode|no such' Any files it lists would be interesting to know about. > Hope this tells you something, please let me know if you would like more > info, and again - thanks! I think Thomas' idea is more likely. We'd want to know the names of files being written, either as logged by log_temp_files or from | find pgsql_tmp -ls during the query. -- Justin
I have listed the files during that run, I will try to add those as an attachment to the mail, no idea if the list accepts that. There were 2024 files in that directory around the end...
Executing that lsof command only produces some warnings, no files:
root@chatelet:/d2/var/lib/postgresql/15/main/base# lsof -nn |grep pgsql_tmp |grep -E 'DEL|deleted|inode|no such'
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.jetbrains-toolbox file system /tmp/.mount_jetbraOAyv5H
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.jetbrains-toolbox file system /tmp/.mount_jetbraOAyv5H
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
Output information may be incomplete.
Attachment
On Mon, Dec 19, 2022 at 06:27:57PM +0100, Frits Jalvingh wrote: > I have listed the files during that run, > 213M -rw------- 1 postgres postgres 213M dec 19 17:46 i100of128.p0.0 > 207M -rw------- 1 postgres postgres 207M dec 19 17:46 i100of128.p1.0 > 210M -rw------- 1 postgres postgres 210M dec 19 17:49 i100of256.p0.0 > 211M -rw------- 1 postgres postgres 211M dec 19 17:49 i100of256.p1.0 > 188M -rw------- 1 postgres postgres 188M dec 19 17:53 i100of512.p0.0 [...] I think that proves Thomas' theory. I'm not sure how that helps you, though... On Mon, Dec 19, 2022 at 01:51:33PM +1300, Thomas Munro wrote: > One possibility is that you've hit a case that needs several rounds of > repartitioning (because of a failure to estimate the number of tuples > well), but we can't see that because you didn't show EXPLAIN (ANALYZE) > output (understandably if it runs out of disk space before > completing...). The parallel hash code doesn't free up the previous > generations' temporary files; it really only needs two generations' > worth concurrently (the one it's reading from and the one it's writing > to). In rare cases where more generations are needed it could unlink > the older ones -- that hasn't been implemented yet. If you set > log_temp_files = 0 to log temporary file names, it should be clear if > it's going through multiple rounds of repartitioning, from the names > (...of32..., ...of64..., ...of128..., ...of256..., ...). -- Justin Pryzby System Administrator Telsasoft +1-952-707-8581
Ok, just to make sure that I understand correctly:
The parallel hash implementation needs to resize its table because of a mismatch in expected tuple count. I do expect this to be true: Postgres often grossly underestimates the expected row counts in our queries.
This is not fully implemented yet: removing the "old "files is not yet done, so every time the table resizes it creates a new set of files and the old ones remain.
I assume that the "used file size" only includes the "current" set of files, and that the old ones are not counted towards that amount? That would explain why it overallocates, of course.
By itself I now know what to do: I just need to disable all parallelism ( •̀ᴗ•́ )و ̑̑
I usually do that anyway because it makes queries die randomly. This is just another reason.
I restarted that query with max_parallel_workers_per_gather=0, and this does not seem to use tempspace at all. It was not exactly fast, it took 82 minutes of a single process running at 100% cpu. https://explain.depesz.com/s/HedE
Thanks a lot for your help Justin, and all others that chimed in too.
Next round is to try to get that query to run in the 5 seconds that we know it can do, reliably.
On Mon, Dec 19, 2022 at 6:46 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Mon, Dec 19, 2022 at 06:27:57PM +0100, Frits Jalvingh wrote:
> I have listed the files during that run,
> 213M -rw------- 1 postgres postgres 213M dec 19 17:46 i100of128.p0.0
> 207M -rw------- 1 postgres postgres 207M dec 19 17:46 i100of128.p1.0
> 210M -rw------- 1 postgres postgres 210M dec 19 17:49 i100of256.p0.0
> 211M -rw------- 1 postgres postgres 211M dec 19 17:49 i100of256.p1.0
> 188M -rw------- 1 postgres postgres 188M dec 19 17:53 i100of512.p0.0
[...]
I think that proves Thomas' theory. I'm not sure how that helps you,
though...
On Mon, Dec 19, 2022 at 01:51:33PM +1300, Thomas Munro wrote:
> One possibility is that you've hit a case that needs several rounds of
> repartitioning (because of a failure to estimate the number of tuples
> well), but we can't see that because you didn't show EXPLAIN (ANALYZE)
> output (understandably if it runs out of disk space before
> completing...). The parallel hash code doesn't free up the previous
> generations' temporary files; it really only needs two generations'
> worth concurrently (the one it's reading from and the one it's writing
> to). In rare cases where more generations are needed it could unlink
> the older ones -- that hasn't been implemented yet. If you set
> log_temp_files = 0 to log temporary file names, it should be clear if
> it's going through multiple rounds of repartitioning, from the names
> (...of32..., ...of64..., ...of128..., ...of256..., ...).
--
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581
Em seg., 19 de dez. de 2022 às 16:29, Frits Jalvingh <jal@etc.to> escreveu:
Ok, just to make sure that I understand correctly:The parallel hash implementation needs to resize its table because of a mismatch in expected tuple count. I do expect this to be true: Postgres often grossly underestimates the expected row counts in our queries.This is not fully implemented yet: removing the "old "files is not yet done, so every time the table resizes it creates a new set of files and the old ones remain.I assume that the "used file size" only includes the "current" set of files, and that the old ones are not counted towards that amount? That would explain why it overallocates, of course.
It is not necessary what is happening.
Could you try manually deleting (rm) these files, using the postgres user?
It's an ugly and dirty test, but it could indicate that files are really being left behind, without being deleted by Postgres.
Alternatively, you could compile a version with
CHECK_WRITE_VS_EXTEND set, and try to fetch as much information from the logs as possible,
Could you try manually deleting (rm) these files, using the postgres user?
It's an ugly and dirty test, but it could indicate that files are really being left behind, without being deleted by Postgres.
Alternatively, you could compile a version with
CHECK_WRITE_VS_EXTEND set, and try to fetch as much information from the logs as possible,
as has been indicated by others here.
By itself I now know what to do: I just need to disable all parallelism ( •̀ᴗ•́ )و ̑̑I usually do that anyway because it makes queries die randomly. This is just another reason.I restarted that query with max_parallel_workers_per_gather=0, and this does not seem to use tempspace at all. It was not exactly fast, it took 82 minutes of a single process running at 100% cpu. https://explain.depesz.com/s/HedE
Anyway, see the hint page (https://explain.depesz.com/s/HedE#hints),
maybe it will be useful.
regards,
Ranier Vilela
On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote: > By itself I'm used to bad query performance in Postgresql; our application > only does bulk queries and Postgres quite often makes terrible plans for > those, but with set enable_nestloop=false set always most of them at least > execute. The remaining failing queries are almost 100% caused by bad join > sequences; I plan to work around those by forcing the join order from our > application. For instance, the exact same query above can also generate the > following plan (this one was created by manually setting > join_collapse_limit = 1, but fast variants also occur quite often when > disabling parallelism): I, too, ended up setting enable_nestloop=false for our report queries, to avoid the worst-case plans. But you should also try to address the rowcount misestimates. This underestimates the rowcount by a factor of 69 (or 138 in the plan you sent today): | (soort = 'MSL'::text) AND (code = 'DAE'::text) If those conditions are correlated, you can improve the estimate by adding extended stats object. | CREATE STATISTICS s_h_sturingslabel_ssm_stats soort,code FROM s_h_sturingslabel_ssm; ANALYZE s_h_sturingslabel_ssm; Unfortunately, stats objects currently only improve scans, and not joins, so that might *improve* some queries, but it won't resolve the worst problems: | Hash Join (cost=22,832.23..44,190.21 rows=185 width=47) (actual time=159.725..2,645,634.918 rows=28,086,472,886 loops=1) Maybe you can improve that by adjusting the stats target or ndistinct... -- Justin
These files ONLY exist during the query. They get deleted as soon as the query terminates, by Postgres itself. Once the query terminates pgsql_tmp is completely empty. Considering what Thomas said (and the actual occurrence of the files he mentioned) this does seem to be the more likely cause to me.
On Tue, Dec 20, 2022 at 8:59 AM Frits Jalvingh <jal@etc.to> wrote: > @ranier > These files ONLY exist during the query. They get deleted as soon as the query terminates, by Postgres itself. Once thequery terminates pgsql_tmp is completely empty. Considering what Thomas said (and the actual occurrence of the files hementioned) this does seem to be the more likely cause to me. I'm working on some bug fixes near this area at the moment, so I'll also see if I can figure out how to implement the missing eager cleanup of earlier generations. It's still a pretty bad scenario once you reach it (repartitioning repeatedly, that is) and the solution to that it probably much harder, but it's obviously not great to waste temporary disk space like that. BTW you can disable just parallel hash with enable_parallel_hash=false.
@justin
I tried the create statistics variant and that definitely improves the estimate, and with that one of the "bad" cases (the one with the 82 minute plan) now creates a good plan using only a few seconds.
That is a worthwhile path to follow. A bit hard to do, because those conditions can be anything, but I can probably calculate the ones used per customer and create those correlation statistics from that... It is definitely better than tweaking the "poor man's query hints" enable_xxxx 8-/ which is really not helping with plan stability either.
That will be a lot of work, but I'll let you know the results ;)
@Thomas
Thanks for helping identifying the issue. I think it would be nice to clean up those obsoleted files during the run because running out of disk is reality not a good thing to have ;) Of course the bad estimates leading to the resize are the real issue but this at least makes it less bad.
Thanks for helping identifying the issue. I think it would be nice to clean up those obsoleted files during the run because running out of disk is reality not a good thing to have ;) Of course the bad estimates leading to the resize are the real issue but this at least makes it less bad.
Thanks for the tip about disabling parallel_hash but I also found it in the source. As mentioned before I disable (on production systems) all parallelism, not just for this issue but to prevent the OOM killer from killing Postgres - which happens way more often with parallelism on...
On Mon, Dec 19, 2022 at 09:10:27PM +0100, Frits Jalvingh wrote: > @justin > > I tried the create statistics variant and that definitely improves the > estimate, and with that one of the "bad" cases (the one with the 82 minute > plan) now creates a good plan using only a few seconds. > That is a worthwhile path to follow. A bit hard to do, because those > conditions can be anything, but I can probably calculate the ones used per > customer and create those correlation statistics from that... It is > definitely better than tweaking the "poor man's query hints" enable_xxxx > 8-/ which is really not helping with plan stability either. > > That will be a lot of work, but I'll let you know the results ;) Yeah, if the conditions are arbitrary, then it's going to be more difficult. Hopefully you don't have too many columns. :) I suggest enabling autoexplain and monitoring for queries which were slow, and retroactively adding statistics to those columns which are most-commonly queried, and which have correlations (which the planner doesn't otherwise know about). You won't want to have more than a handful of columns in a stats object (since it requires factorial(N) complexity), but you can have multiple stats objects with different combinations of columns (and, in v14, expressions). You can also set a lower stats target to make the cost a bit lower. You could try to check which columns are correlated, either by running: | SELECT COUNT(1),col1,col2 FROM tbl GROUP BY 2,3 ORDER BY 1; for different combinations of columns. Or by creating a tentative/experimental stats object on a handful of columns at a time for which you have an intuition about their correlation, and then checking the calculated dependencies FROM pg_stats_ext. You may need to to something clever to use that for arbitrarily columns. Maybe this is a start. | SELECT dep.value::float, tablename, attnames, dep.key, exprs FROM (SELECT (json_each_text(dependencies::text::json)).*AS dep, * FROM pg_stats_ext)dep WHERE dependencies IS NOT NULL ORDER BY 1 DESC; -- AND regexp_count(key, ',') < 2 -- Justin
Hi Justin,
As our queries are generated I decided to create a peephole optimizer kind of thing to scan the generated SQL AST to find multiple conditions on the same table reference. I can then use our metadata to see if these references are expected to be correlated. This creates about 20 statistics sets, including the one you have indicated. This at least makes the problematic query have a stable and very fast plan (so far). I had hoped for some more improvement with other queries but that has not yet been evident ;)
Thanks a lot for the tips and your help!
Cordially,
Frits
On Tue, Dec 20, 2022 at 10:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Mon, Dec 19, 2022 at 09:10:27PM +0100, Frits Jalvingh wrote:
> @justin
>
> I tried the create statistics variant and that definitely improves the
> estimate, and with that one of the "bad" cases (the one with the 82 minute
> plan) now creates a good plan using only a few seconds.
> That is a worthwhile path to follow. A bit hard to do, because those
> conditions can be anything, but I can probably calculate the ones used per
> customer and create those correlation statistics from that... It is
> definitely better than tweaking the "poor man's query hints" enable_xxxx
> 8-/ which is really not helping with plan stability either.
>
> That will be a lot of work, but I'll let you know the results ;)
Yeah, if the conditions are arbitrary, then it's going to be more
difficult. Hopefully you don't have too many columns. :)
I suggest enabling autoexplain and monitoring for queries which were
slow, and retroactively adding statistics to those columns which are
most-commonly queried, and which have correlations (which the planner
doesn't otherwise know about).
You won't want to have more than a handful of columns in a stats object
(since it requires factorial(N) complexity), but you can have multiple
stats objects with different combinations of columns (and, in v14,
expressions). You can also set a lower stats target to make the cost a
bit lower.
You could try to check which columns are correlated, either by running:
| SELECT COUNT(1),col1,col2 FROM tbl GROUP BY 2,3 ORDER BY 1;
for different combinations of columns.
Or by creating a tentative/experimental stats object on a handful of
columns at a time for which you have an intuition about their
correlation, and then checking the calculated dependencies FROM
pg_stats_ext. You may need to to something clever to use that for
arbitrarily columns. Maybe this is a start.
| SELECT dep.value::float, tablename, attnames, dep.key, exprs FROM (SELECT (json_each_text(dependencies::text::json)).* AS dep, * FROM pg_stats_ext)dep WHERE dependencies IS NOT NULL ORDER BY 1 DESC ; -- AND regexp_count(key, ',') < 2
--
Justin