SELECT creates millions of temp files in a single directory - Mailing list pgsql-general
From | Peter |
---|---|
Subject | SELECT creates millions of temp files in a single directory |
Date | |
Msg-id | YmRYiVuzXa/SN8ZM@gate.intra.daemon.contact Whole thread Raw |
Responses |
Re: SELECT creates millions of temp files in a single directory
Re: SELECT creates millions of temp files in a single directory Re: SELECT creates millions of temp files in a single directory |
List | pgsql-general |
In modern versions of postgres a simple SELECT writes a couple of millions of individual temp files into a single directory under pgsql_tmp. I know of no filesystem that would take such lightly, and even ZFS gets some problems with such extremely long directories. What is the rationale in this behaviour and how is it supposed to be handled? The specific case is a database that is in use for almost 20 years. It happened to run on a pentium-2 with 768 MB memory, there it was slow, but did work reliably. Now it runs on a 2660v3 with 32 GB memory, and tends to exhaust that memory. Database size has not increased, postgres memory configuration has not been changed, only postgres versions were gradually upgraded from 8 to 12. This is the memory configuration: shared_buffers = 40MB temp_buffers = 20MB work_mem = 50MB max_stack_depth = 40MB max_files_per_process = 200 But the actual memory consumption is 30 GB (per query!), i.e. all of the installed memory: UID PID PPID C PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 770 53143 10252 16 20 0 9359944 7796128 zio->io_ DsJ - 3:11.29 postgres: bareos bareos fd00::118(53471) SELECT (postgres) 770 54334 10252 17 20 0 9279780 24388 zio->io_ DsJ - 2:58.19 postgres: parallel worker for PID 53143 (postgres) 770 54335 10252 17 20 0 9279780 22168 zfs DLsJ - 2:51.30 postgres: parallel worker for PID 53143 (postgres) This is the situation on the filesystem: $ data12/base # du -k 16387/ 9093312 16387/ $ data12/base/pgsql_tmp # du -k * 19979644 pgsql_tmp53143.0.sharedfileset $ data12/base/pgsql_tmp/pgsql_tmp53143.0.sharedfileset # ls -la | wc 1264755 11382788 96271672 More than a million files in a single directory, this is inacceptable. This is the query: SELECT DISTINCT Path.PathId, File.PathId, Path.Path FROM Path LEFT JOIN File USING (PathId) LEFT JOIN PathHierarchy ON (Path.Pathid = PathHierarchy.PPathId) WHERE File.PathId IS NULL AND PathHierarchy.PPathId IS NULL LIMIT 300000 These are the stats: Path: 464229 live tuples, 42 MB, 49 MB index File: 43779725 live tuples, 1 dead tuple, 7160 MB, 2971 MB index PathHierarchy: 380879 live tuples, 13 MB, 17 MB index The pathhierarchy table is irrelevant to the behaviour and can be left out. Vacuum and Analyze has been run right before the query. This is the structure: CREATE TABLE IF NOT EXISTS public.path ( pathid integer NOT NULL DEFAULT nextval('path_pathid_seq'::regclass), path text COLLATE pg_catalog."default" NOT NULL, CONSTRAINT path_pkey PRIMARY KEY (pathid) ) CREATE TABLE IF NOT EXISTS public.file ( fileid bigint NOT NULL DEFAULT nextval('file_fileid_seq'::regclass), fileindex integer NOT NULL DEFAULT 0, jobid integer NOT NULL, pathid integer NOT NULL, deltaseq smallint NOT NULL DEFAULT 0, markid integer NOT NULL DEFAULT 0, fhinfo numeric(20,0) NOT NULL DEFAULT 0, fhnode numeric(20,0) NOT NULL DEFAULT 0, lstat text COLLATE pg_catalog."default" NOT NULL, md5 text COLLATE pg_catalog."default" NOT NULL, name text COLLATE pg_catalog."default" NOT NULL, CONSTRAINT file_pkey PRIMARY KEY (fileid) ) CREATE INDEX IF NOT EXISTS file_jpfid_idx ON public.file USING btree (jobid ASC NULLS LAST, pathid ASC NULLS LAST, name COLLATE pg_catalog."default" ASC NULLS LAST) TABLESPACE pg_default; And this does not give the explanation: bareos=# explain SELECT DISTINCT Path.PathId, File.PathId, Path.Path FROM Path LEFT JOIN File USING (PathId) LEFT JOIN PathHierarchy ON (Path.Pathid = PathHierarchy.PPathId) WHERE File.PathId IS NULL AND PathHierarchy.PPathId IS NULL LIMIT 300000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1367204.20..1370204.20 rows=300000 width=67) -> HashAggregate (cost=1367204.20..1371346.45 rows=414225 width=67) Group Key: path.pathid, file.pathid, path.path -> Gather (cost=1225693.97..1364097.51 rows=414225 width=67) Workers Planned: 2 -> Parallel Hash Anti Join (cost=1224693.97..1321675.01 rows=172594 width=67) Hash Cond: (path.pathid = file.pathid) -> Parallel Hash Anti Join (cost=6727.04..19953.72 rows=181359 width=63) Hash Cond: (path.pathid = pathhierarchy.ppathid) -> Parallel Seq Scan on path (cost=0.00..7361.29 rows=193429 width=63) -> Parallel Hash (cost=3926.46..3926.46 rows=224046 width=4) -> Parallel Seq Scan on pathhierarchy (cost=0.00..3926.46 rows=224046 width=4) -> Parallel Hash (cost=918690.59..918690.59 rows=18241547 width=4) -> Parallel Index Only Scan using file_jpfid_idx on file (cost=0.56..918690.59 rows=18241547width=4) (14 rows) Tracking this down: the first one works well, the second one bloats memory and into millions of temp files: bareos=# explain SELECT * from path LEFT JOIN file USING (pathid) WHERE File.PathId IS NULL LIMIT 300000; QUERY PLAN ---------------------------------------------------------------------------------------------- Limit (cost=9073765.86..9376924.49 rows=300000 width=193) -> Gather (cost=9073765.86..9520210.40 rows=441793 width=193) Workers Planned: 2 -> Merge Anti Join (cost=9072765.86..9475031.10 rows=184080 width=193) Merge Cond: (path.pathid = file.pathid) -> Sort (cost=24345.75..24829.32 rows=193429 width=63) Sort Key: path.pathid -> Parallel Seq Scan on path (cost=0.00..7361.29 rows=193429 width=63) -> Sort (cost=9048403.94..9157853.22 rows=43779712 width=134) Sort Key: file.pathid -> Seq Scan on file (cost=0.00..1354253.12 rows=43779712 width=134) (11 rows) bareos=# explain SELECT Path.PathId, File.PathId, Path.Path from path LEFT JOIN file USING (pathid) WHERE File.PathId ISNULL LIMIT 300000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Limit (cost=1218966.93..1311402.77 rows=300000 width=67) -> Gather (cost=1218966.93..1355091.95 rows=441793 width=67) Workers Planned: 2 -> Parallel Hash Anti Join (cost=1217966.93..1309912.65 rows=184080 width=67) Hash Cond: (path.pathid = file.pathid) -> Parallel Seq Scan on path (cost=0.00..7361.29 rows=193429 width=63) -> Parallel Hash (cost=918690.59..918690.59 rows=18241547 width=4) -> Parallel Index Only Scan using file_jpfid_idx on file (cost=0.56..918690.59 rows=18241547 width=4) (8 rows) Using these "parallel workers" was not my idea, they came creeping along unsolicited with some version upgrade. Switching them OFF deliberately, makes the first query five times faster (5 minutes instead of 25 minutes), and makes the second query use only 25'000 temp files and successfully deliver 25'000 result rows (instead of getting stuck with a million temp files), apparently one temp file per result row now. So, 1. it appears that these "parallel workers" are utterly counter-efficient whenever the working set does not fit into fast storage, and they need be switched off. 2. something with this anti-hash-whatever must be wrong. 25'000 temp files does still not appear to be a good thing. But, delivering code that, by default, allows in excess of a million files be written in a single directory, that is just wrong. Checking web ressources: * It seems now the normal behaviour to write millions of files, and people seem to just accept this: https://stackoverflow.com/q/61696008 * Tables with some 50 mio rows seem now to be considered a "high row count": https://www.postgresql.org/message-id/ 38E9456A-7841-4F13-B72B-FD3137591972%40gmail.com They were not considered a "high row count" back in 2007; they were just normal then, and did run fine on machines with 1/50 of the memory. :( People seem to have been brainwashed by Web-Services and OLTP, and now think the working set must always fit in memory. But this is only one possible usecase, it is not the exclusive only one.
pgsql-general by date: