Thread: SELECT creates millions of temp files in a single directory
In modern versions of postgres a simple SELECT writes a couple of millions of individual temp files into a single directory under pgsql_tmp. I know of no filesystem that would take such lightly, and even ZFS gets some problems with such extremely long directories. What is the rationale in this behaviour and how is it supposed to be handled? The specific case is a database that is in use for almost 20 years. It happened to run on a pentium-2 with 768 MB memory, there it was slow, but did work reliably. Now it runs on a 2660v3 with 32 GB memory, and tends to exhaust that memory. Database size has not increased, postgres memory configuration has not been changed, only postgres versions were gradually upgraded from 8 to 12. This is the memory configuration: shared_buffers = 40MB temp_buffers = 20MB work_mem = 50MB max_stack_depth = 40MB max_files_per_process = 200 But the actual memory consumption is 30 GB (per query!), i.e. all of the installed memory: UID PID PPID C PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 770 53143 10252 16 20 0 9359944 7796128 zio->io_ DsJ - 3:11.29 postgres: bareos bareos fd00::118(53471) SELECT (postgres) 770 54334 10252 17 20 0 9279780 24388 zio->io_ DsJ - 2:58.19 postgres: parallel worker for PID 53143 (postgres) 770 54335 10252 17 20 0 9279780 22168 zfs DLsJ - 2:51.30 postgres: parallel worker for PID 53143 (postgres) This is the situation on the filesystem: $ data12/base # du -k 16387/ 9093312 16387/ $ data12/base/pgsql_tmp # du -k * 19979644 pgsql_tmp53143.0.sharedfileset $ data12/base/pgsql_tmp/pgsql_tmp53143.0.sharedfileset # ls -la | wc 1264755 11382788 96271672 More than a million files in a single directory, this is inacceptable. This is the query: SELECT DISTINCT Path.PathId, File.PathId, Path.Path FROM Path LEFT JOIN File USING (PathId) LEFT JOIN PathHierarchy ON (Path.Pathid = PathHierarchy.PPathId) WHERE File.PathId IS NULL AND PathHierarchy.PPathId IS NULL LIMIT 300000 These are the stats: Path: 464229 live tuples, 42 MB, 49 MB index File: 43779725 live tuples, 1 dead tuple, 7160 MB, 2971 MB index PathHierarchy: 380879 live tuples, 13 MB, 17 MB index The pathhierarchy table is irrelevant to the behaviour and can be left out. Vacuum and Analyze has been run right before the query. This is the structure: CREATE TABLE IF NOT EXISTS public.path ( pathid integer NOT NULL DEFAULT nextval('path_pathid_seq'::regclass), path text COLLATE pg_catalog."default" NOT NULL, CONSTRAINT path_pkey PRIMARY KEY (pathid) ) CREATE TABLE IF NOT EXISTS public.file ( fileid bigint NOT NULL DEFAULT nextval('file_fileid_seq'::regclass), fileindex integer NOT NULL DEFAULT 0, jobid integer NOT NULL, pathid integer NOT NULL, deltaseq smallint NOT NULL DEFAULT 0, markid integer NOT NULL DEFAULT 0, fhinfo numeric(20,0) NOT NULL DEFAULT 0, fhnode numeric(20,0) NOT NULL DEFAULT 0, lstat text COLLATE pg_catalog."default" NOT NULL, md5 text COLLATE pg_catalog."default" NOT NULL, name text COLLATE pg_catalog."default" NOT NULL, CONSTRAINT file_pkey PRIMARY KEY (fileid) ) CREATE INDEX IF NOT EXISTS file_jpfid_idx ON public.file USING btree (jobid ASC NULLS LAST, pathid ASC NULLS LAST, name COLLATE pg_catalog."default" ASC NULLS LAST) TABLESPACE pg_default; And this does not give the explanation: bareos=# explain SELECT DISTINCT Path.PathId, File.PathId, Path.Path FROM Path LEFT JOIN File USING (PathId) LEFT JOIN PathHierarchy ON (Path.Pathid = PathHierarchy.PPathId) WHERE File.PathId IS NULL AND PathHierarchy.PPathId IS NULL LIMIT 300000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1367204.20..1370204.20 rows=300000 width=67) -> HashAggregate (cost=1367204.20..1371346.45 rows=414225 width=67) Group Key: path.pathid, file.pathid, path.path -> Gather (cost=1225693.97..1364097.51 rows=414225 width=67) Workers Planned: 2 -> Parallel Hash Anti Join (cost=1224693.97..1321675.01 rows=172594 width=67) Hash Cond: (path.pathid = file.pathid) -> Parallel Hash Anti Join (cost=6727.04..19953.72 rows=181359 width=63) Hash Cond: (path.pathid = pathhierarchy.ppathid) -> Parallel Seq Scan on path (cost=0.00..7361.29 rows=193429 width=63) -> Parallel Hash (cost=3926.46..3926.46 rows=224046 width=4) -> Parallel Seq Scan on pathhierarchy (cost=0.00..3926.46 rows=224046 width=4) -> Parallel Hash (cost=918690.59..918690.59 rows=18241547 width=4) -> Parallel Index Only Scan using file_jpfid_idx on file (cost=0.56..918690.59 rows=18241547width=4) (14 rows) Tracking this down: the first one works well, the second one bloats memory and into millions of temp files: bareos=# explain SELECT * from path LEFT JOIN file USING (pathid) WHERE File.PathId IS NULL LIMIT 300000; QUERY PLAN ---------------------------------------------------------------------------------------------- Limit (cost=9073765.86..9376924.49 rows=300000 width=193) -> Gather (cost=9073765.86..9520210.40 rows=441793 width=193) Workers Planned: 2 -> Merge Anti Join (cost=9072765.86..9475031.10 rows=184080 width=193) Merge Cond: (path.pathid = file.pathid) -> Sort (cost=24345.75..24829.32 rows=193429 width=63) Sort Key: path.pathid -> Parallel Seq Scan on path (cost=0.00..7361.29 rows=193429 width=63) -> Sort (cost=9048403.94..9157853.22 rows=43779712 width=134) Sort Key: file.pathid -> Seq Scan on file (cost=0.00..1354253.12 rows=43779712 width=134) (11 rows) bareos=# explain SELECT Path.PathId, File.PathId, Path.Path from path LEFT JOIN file USING (pathid) WHERE File.PathId ISNULL LIMIT 300000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Limit (cost=1218966.93..1311402.77 rows=300000 width=67) -> Gather (cost=1218966.93..1355091.95 rows=441793 width=67) Workers Planned: 2 -> Parallel Hash Anti Join (cost=1217966.93..1309912.65 rows=184080 width=67) Hash Cond: (path.pathid = file.pathid) -> Parallel Seq Scan on path (cost=0.00..7361.29 rows=193429 width=63) -> Parallel Hash (cost=918690.59..918690.59 rows=18241547 width=4) -> Parallel Index Only Scan using file_jpfid_idx on file (cost=0.56..918690.59 rows=18241547 width=4) (8 rows) Using these "parallel workers" was not my idea, they came creeping along unsolicited with some version upgrade. Switching them OFF deliberately, makes the first query five times faster (5 minutes instead of 25 minutes), and makes the second query use only 25'000 temp files and successfully deliver 25'000 result rows (instead of getting stuck with a million temp files), apparently one temp file per result row now. So, 1. it appears that these "parallel workers" are utterly counter-efficient whenever the working set does not fit into fast storage, and they need be switched off. 2. something with this anti-hash-whatever must be wrong. 25'000 temp files does still not appear to be a good thing. But, delivering code that, by default, allows in excess of a million files be written in a single directory, that is just wrong. Checking web ressources: * It seems now the normal behaviour to write millions of files, and people seem to just accept this: https://stackoverflow.com/q/61696008 * Tables with some 50 mio rows seem now to be considered a "high row count": https://www.postgresql.org/message-id/ 38E9456A-7841-4F13-B72B-FD3137591972%40gmail.com They were not considered a "high row count" back in 2007; they were just normal then, and did run fine on machines with 1/50 of the memory. :( People seem to have been brainwashed by Web-Services and OLTP, and now think the working set must always fit in memory. But this is only one possible usecase, it is not the exclusive only one.
On Sat, Apr 23, 2022 at 1:00 PM Peter <pmc@citylink.dinoex.sub.org> wrote:
In modern versions of postgres a simple SELECT writes a couple
of millions of individual temp files into a single directory under
pgsql_tmp.
I know of no filesystem that would take such lightly, and even
ZFS gets some problems with such extremely long directories.
Your running commentary detracts greatly from the problem you are trying to describe and demonstrate.
What is the rationale in this behaviour and how is it supposed to
be handled?
There are these things called bugs. This may be one of those, which by definition, are not intentional.
Database size has not increased, postgres memory configuration has
not been changed, only postgres versions were gradually upgraded
from 8 to 12.
v12 what?
Using these "parallel workers" was not my idea, they came creeping
along unsolicited with some version upgrade.
Well, in theory at least parallelism should be a positive improvement. Why would we want to require our users to opt-in to the feature?
Switching them OFF deliberately, makes the first query five times
faster (5 minutes instead of 25 minutes), and makes the second
query use only 25'000 temp files and successfully deliver 25'000
result rows (instead of getting stuck with a million temp files),
apparently one temp file per result row now.
In your situation maybe they do not help. Whether that is a fundamental limitation of parallelism in low memory conditions or whether its specific to this query I have no clue. But aren't you glad we had the foresight to allow for the parallelism to be disabled in the case of bugs and/or situations where it did prove to be harmful?
So,
1. it appears that these "parallel workers" are utterly
counter-efficient whenever the working set does not fit into
fast storage, and they need be switched off.
Not sure you've "proven" this but you can do so for your setup.
2. something with this anti-hash-whatever must be wrong. 25'000
temp files does still not appear to be a good thing. But,
delivering code that, by default, allows in excess of a million
files be written in a single directory, that is just wrong.
Probably. It would help if you can provide a self-contained demonstration that others can then verify and debug (or explain).
Checking web ressources:
* It seems now the normal behaviour to write millions of files,
and people seem to just accept this:
https://stackoverflow.com/q/61696008
If they aren't willing to post to our -bugs or other mailing lists there isn't much we can do for them.
* Tables with some 50 mio rows seem now to be considered a "high
row count":
https://www.postgresql.org/message-id/
38E9456A-7841-4F13-B72B-FD3137591972%40gmail.com
They were not considered a "high row count" back in 2007; they were
just normal then, and did run fine on machines with 1/50 of the
memory. :(
I don't see how this has any relevance.
Generalizing to "people" from three or four examples is pointless. And, regardless, it isn't like any of those people are committers for the project, whose opinions are really the only ones that matter because they control whether to fix something or not.
People seem to have been brainwashed by Web-Services and OLTP,
and now think the working set must always fit in memory. But this
is only one possible usecase, it is not the exclusive only one.
Again, your running commentary is providing zero, or negative, value here.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > v12 what? That ... > It would help if you can provide a self-contained demonstration > that others can then verify and debug (or explain). ... and that. As this message stands, it's undocumented whining. Please see https://wiki.postgresql.org/wiki/Slow_Query_Questions for the sort of information we need to debug performance issues. There's also https://wiki.postgresql.org/wiki/Guide_to_reporting_problems although that's not oriented specifically to performance issues. (I recall that we have fixed some problems in the past that could lead to unreasonable numbers of temp files in hash joins. Whether there's more to do, or Peter is running a version that lacks those fixes, is impossible to tell with the given info.) regards, tom lane
On Sat, Apr 23, 2022 at 1:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> v12 what?
That ...
> It would help if you can provide a self-contained demonstration
> that others can then verify and debug (or explain).
... and that. As this message stands, it's undocumented whining.
Please see
https://wiki.postgresql.org/wiki/Slow_Query_Questions
for the sort of information we need to debug performance issues.
There's also
https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
although that's not oriented specifically to performance issues.
(I recall that we have fixed some problems in the past that could
lead to unreasonable numbers of temp files in hash joins. Whether
there's more to do, or Peter is running a version that lacks those
fixes, is impossible to tell with the given info.)
Thanks.
I'll add that given the nature of the problem that changing temp_file_limit away from its default of unlimited may be useful.
David J.
On 4/23/22 12:50, Peter wrote: > People seem to have been brainwashed by Web-Services and OLTP, > and now think the working set must always fit in memory. But this > is only one possible usecase, it is not the exclusive only one. This is no-win situation as most of the complaints in recent years have been that Postgres was/is to conservative in its default settings and is not taking advantage of newer more powerful hardware. -- Adrian Klaver adrian.klaver@aklaver.com
"David G. Johnston" <david.g.johnston@gmail.com> writes: > I'll add that given the nature of the problem that changing temp_file_limit > away from its default of unlimited may be useful. > https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-DISK Maybe ... that limits the total space involved, not the number of files it's separated into, so I'm not sure how much it can help. It might be worth playing around to see how varying values of work_mem affect this behavior, too. That should change the planner's initial estimate of the number of hash batches needed, which likely plays into this somehow. regards, tom lane
On Sat, Apr 23, 2022 at 02:11:00PM -0700, Adrian Klaver wrote: ! On 4/23/22 12:50, Peter wrote: ! ! ! > People seem to have been brainwashed by Web-Services and OLTP, ! > and now think the working set must always fit in memory. But this ! > is only one possible usecase, it is not the exclusive only one. ! ! This is no-win situation as most of the complaints in recent years have been ! that Postgres was/is to conservative in its default settings and is not ! taking advantage of newer more powerful hardware. I know, and You got to the point; this is exactly what I am talking about: people take the abundance of ressources as granted. In Rel. 8 postgres was a breathtaking beauty of engineering: the style of old, driven to perfection. Now You're gradually sacrificing this, for the speed junkies and to protect those from mistakes who are not engineers. And no, I don't know how this could be solved: the more influential you get, the more driven you are by public demand; the less freedom you have to follow ideals. David Johnston thinks we must just not speak that out, we must instead behave like "the emperor's new clothes", and follow google's understanding of "positive values". Sorry, that doesn't work for me.
On 4/23/22 14:58, Peter wrote: > On Sat, Apr 23, 2022 at 02:11:00PM -0700, Adrian Klaver wrote: > ! On 4/23/22 12:50, Peter wrote: > ! > ! > ! > People seem to have been brainwashed by Web-Services and OLTP, > ! > and now think the working set must always fit in memory. But this > ! > is only one possible usecase, it is not the exclusive only one. > ! > ! This is no-win situation as most of the complaints in recent years have been > ! that Postgres was/is to conservative in its default settings and is not > ! taking advantage of newer more powerful hardware. > > I know, and You got to the point; this is exactly what I am talking > about: people take the abundance of ressources as granted. Probably because the resources are there. My phone has computing power I could only dream of when I was using a desktop of old. > > In Rel. 8 postgres was a breathtaking beauty of engineering: the style > of old, driven to perfection. I would hardly call version 8.x perfection, especially the attempt at running on Windows natively. > Now You're gradually sacrificing this, for the speed junkies and to > protect those from mistakes who are not engineers. > > And no, I don't know how this could be solved: the more influential > you get, the more driven you are by public demand; the less freedom > you have to follow ideals. Whose ideals? That is the issue. Postgres covers a broad spectrum of uses and as such you will have complaints from either end that their needs are not met. > > David Johnston thinks we must just not speak that out, we must instead > behave like "the emperor's new clothes", and follow google's > understanding of "positive values". > Sorry, that doesn't work for me. No, the complaint was that your pontificating interfered with your problem description and got in the way of coming up with a solution. -- Adrian Klaver adrian.klaver@aklaver.com
On Sat, Apr 23, 2022 at 05:13:41PM -0400, Tom Lane wrote: ! "David G. Johnston" <david.g.johnston@gmail.com> writes: ! > I'll add that given the nature of the problem that changing temp_file_limit ! > away from its default of unlimited may be useful. ! > https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-DISK ! ! Maybe ... that limits the total space involved, not the number of ! files it's separated into, so I'm not sure how much it can help. That's what I understood from the docs, too. What I also read in the docs is that it will just kill the query when it hits the limit, and this is not really what I want. And btw, I probably lost-in-translation the relevant info about the running version: Name : postgresql12-server Version : 12.10 Installed on : Mon Apr 4 04:13:18 2022 CEST Origin : databases/postgresql12-server Architecture : FreeBSD:13:amd64 ! It might be worth playing around to see how varying values of work_mem ! affect this behavior, too. That should change the planner's initial ! estimate of the number of hash batches needed, which likely plays into ! this somehow. Agreed. But then, looking at the generated filenames, in the style of "i44297of524288.p1.0" - this is an algorithm at work, so somebody must have done this, and obviousely didn't bother to create half a million of files, after having created another half million already. So I thought I might just ask what is the idea with this. | > It would help if you can provide a self-contained demonstration | > that others can then verify and debug (or explain). | | ... and that. As this message stands, it's undocumented whining. | Please see | | https://wiki.postgresql.org/wiki/Slow_Query_Questions | | for the sort of information we need to debug performance issues. It is not a performance issue, it is a design question: You inflict pain on my beloved ZFS, and as a lover I react. ;) | (I recall that we have fixed some problems in the past that could | lead to unreasonable numbers of temp files in hash joins. Whether | there's more to do, or Peter is running a version that lacks those | fixes, is impossible to tell with the given info.) Yes, I was accidentially deleting that info too when I deleted the more extensive rants from my original posting. See here, above.
On 4/23/22 15:40, Peter wrote: > On Sat, Apr 23, 2022 at 05:13:41PM -0400, Tom Lane wrote: > And btw, I probably lost-in-translation the relevant info about the > running version: > > Name : postgresql12-server > Version : 12.10 > Installed on : Mon Apr 4 04:13:18 2022 CEST > Origin : databases/postgresql12-server > Architecture : FreeBSD:13:amd64 > > | Please see > | > | https://wiki.postgresql.org/wiki/Slow_Query_Questions > | > | for the sort of information we need to debug performance issues. > > It is not a performance issue, it is a design question: You inflict > pain on my beloved ZFS, and as a lover I react. ;) Be that as it may, the requested information is still needed. -- Adrian Klaver adrian.klaver@aklaver.com
On Sun, Apr 24, 2022 at 8:00 AM Peter <pmc@citylink.dinoex.sub.org> wrote: > More than a million files in a single directory, this is > inacceptable. You didn't show EXPLAIN (ANALYZE) but if [Parallel] Hash is making insane numbers of temporary files then something is not working as intended... and I can take a guess at what it is. I tried to recreate this with randomly distributed file.pathid, same size tables as you, and I got 32 batches and a nice manageable number of temporary files. Adding millions of extra file rows with duplicate pathid=42 gets me something like "Batches: 524288 (originally 32)" in EXPLAIN (ANALYZE) output. I guess that's the sort of distribution you have here? Extensive discussion of the cause of that and potential (hard) solutions over here: https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com To summarise, when the hash table doesn't fit in work_mem, then we "partition" (spill part of the data to disk) to make twice as many (hopefully) smaller hash tables that do fit. Sometimes partitoning produces one or more hash tables that are still too big because of uneven key distribution, so we go around again, doubling the number of partitions (and thus temporary files) every time. I would say that once we get past hundreds of partitions, things are really turning to custard (file descriptors, directory entries, memory overheads, ... it just doesn't work well anymore), but currently we don't give up until it becomes very clear that repartitioning is not helping. This algorithmic problem existed before parallelism was added, but it's possible that the parallel version of the meltdown hurts a lot more (it has extra per-process files, and in multi-attempt scenarios the useless earlier attempts hang around until the end of the query instead of being cleaned up sooner, which doubles the number of files). Hopefully that gives some clues about how one might rewrite the query to avoid massive unsplittable set of duplicate keys in hash tables, assuming I'm somewhere close to the explanation here (maybe some subselect with UNIQUE or GROUP BY in it, or some way to rewrite the query to avoid having the problematic duplicates on the "inner" side, or completely different plan..., or just crank up work_mem massively for this query so that you don't need a partition step at all) . Obviously it would be nice for PostgreSQL to be hardened against this risk, eventually, though. As for merge join planning, unfortunately they aren't fully parallelisable and in the plan you showed, a separate copy of the big sort runs in every process, which isn't nice (we don't have a Parallel Sort that could fix that yet).