Thread: SELECT creates millions of temp files in a single directory

SELECT creates millions of temp files in a single directory

From
Peter
Date:
In modern versions of postgres a simple SELECT writes a couple
of millions of individual temp files into a single directory under
pgsql_tmp.
I know of no filesystem that would take such lightly, and even
ZFS gets some problems with such extremely long directories.

What is the rationale in this behaviour and how is it supposed to
be handled?

The specific case is a database that is in use for almost 20 years.
It happened to run on a pentium-2 with 768 MB memory, there it was
slow, but did work reliably.
Now it runs on a 2660v3 with 32 GB memory, and tends to exhaust that
memory.

Database size has not increased, postgres memory configuration has
not been changed, only postgres versions were gradually upgraded
from 8 to 12.

This is the memory configuration:

shared_buffers = 40MB
temp_buffers = 20MB
work_mem = 50MB
max_stack_depth = 40MB
max_files_per_process = 200

But the actual memory consumption is 30 GB (per query!), i.e. all
of the installed memory:

 UID   PID  PPID  C PRI NI     VSZ     RSS MWCHAN   STAT  TT     TIME COMMAND
 770 53143 10252 16  20  0 9359944 7796128 zio->io_ DsJ    -  3:11.29 postgres: bareos bareos fd00::118(53471) SELECT
(postgres)
 770 54334 10252 17  20  0 9279780   24388 zio->io_ DsJ    -  2:58.19 postgres: parallel worker for PID 53143
(postgres)
 770 54335 10252 17  20  0 9279780   22168 zfs      DLsJ   -  2:51.30 postgres: parallel worker for PID 53143
(postgres)

This is the situation on the filesystem:
$ data12/base # du -k 16387/
9093312 16387/

$ data12/base/pgsql_tmp # du -k *
19979644        pgsql_tmp53143.0.sharedfileset

$ data12/base/pgsql_tmp/pgsql_tmp53143.0.sharedfileset # ls -la | wc
 1264755 11382788 96271672

More than a million files in a single directory, this is
inacceptable.


This is the query:
  SELECT DISTINCT Path.PathId, File.PathId, Path.Path
  FROM Path LEFT JOIN File USING (PathId)
     LEFT JOIN PathHierarchy ON (Path.Pathid = PathHierarchy.PPathId)
  WHERE File.PathId IS NULL AND PathHierarchy.PPathId IS NULL LIMIT 300000

These are the stats:
  Path: 464229 live tuples, 42 MB, 49 MB index
  File: 43779725 live tuples, 1 dead tuple, 7160 MB, 2971 MB index
  PathHierarchy: 380879 live tuples, 13 MB, 17 MB index

The pathhierarchy table is irrelevant to the behaviour and can be left
out.

Vacuum and Analyze has been run right before the query.

This is the structure:
CREATE TABLE IF NOT EXISTS public.path
(
    pathid integer NOT NULL DEFAULT nextval('path_pathid_seq'::regclass),
    path text COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT path_pkey PRIMARY KEY (pathid)
)
CREATE TABLE IF NOT EXISTS public.file
(
    fileid bigint NOT NULL DEFAULT nextval('file_fileid_seq'::regclass),
    fileindex integer NOT NULL DEFAULT 0,
    jobid integer NOT NULL,
    pathid integer NOT NULL,
    deltaseq smallint NOT NULL DEFAULT 0,
    markid integer NOT NULL DEFAULT 0,
    fhinfo numeric(20,0) NOT NULL DEFAULT 0,
    fhnode numeric(20,0) NOT NULL DEFAULT 0,
    lstat text COLLATE pg_catalog."default" NOT NULL,
    md5 text COLLATE pg_catalog."default" NOT NULL,
    name text COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT file_pkey PRIMARY KEY (fileid)
)
CREATE INDEX IF NOT EXISTS file_jpfid_idx
    ON public.file USING btree
    (jobid ASC NULLS LAST, pathid ASC NULLS LAST, name COLLATE pg_catalog."default" ASC NULLS LAST)
    TABLESPACE pg_default;

And this does not give the explanation:
bareos=# explain SELECT DISTINCT Path.PathId, File.PathId, Path.Path  FROM Path LEFT JOIN File USING (PathId)
     LEFT JOIN PathHierarchy ON (Path.Pathid = PathHierarchy.PPathId)
  WHERE File.PathId IS NULL AND PathHierarchy.PPathId IS NULL LIMIT 300000;
                                                            QUERY PLAN
          
 

-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1367204.20..1370204.20 rows=300000 width=67)
   ->  HashAggregate  (cost=1367204.20..1371346.45 rows=414225 width=67)
         Group Key: path.pathid, file.pathid, path.path
         ->  Gather  (cost=1225693.97..1364097.51 rows=414225 width=67)
               Workers Planned: 2
               ->  Parallel Hash Anti Join  (cost=1224693.97..1321675.01 rows=172594 width=67)
                     Hash Cond: (path.pathid = file.pathid)
                     ->  Parallel Hash Anti Join  (cost=6727.04..19953.72 rows=181359 width=63)
                           Hash Cond: (path.pathid = pathhierarchy.ppathid)
                           ->  Parallel Seq Scan on path  (cost=0.00..7361.29 rows=193429 width=63)
                           ->  Parallel Hash  (cost=3926.46..3926.46 rows=224046 width=4)
                                 ->  Parallel Seq Scan on pathhierarchy  (cost=0.00..3926.46 rows=224046 width=4)
                     ->  Parallel Hash  (cost=918690.59..918690.59 rows=18241547 width=4)
                           ->  Parallel Index Only Scan using file_jpfid_idx on file  (cost=0.56..918690.59
rows=18241547width=4)
 
(14 rows)


Tracking this down: the first one works well, the second one bloats
memory and into millions of temp files:

bareos=# explain SELECT *  from path LEFT JOIN file USING (pathid) WHERE File.PathId IS NULL LIMIT 300000;
                                          QUERY PLAN                                          
----------------------------------------------------------------------------------------------
 Limit  (cost=9073765.86..9376924.49 rows=300000 width=193)
   ->  Gather  (cost=9073765.86..9520210.40 rows=441793 width=193)
         Workers Planned: 2
         ->  Merge Anti Join  (cost=9072765.86..9475031.10 rows=184080 width=193)
               Merge Cond: (path.pathid = file.pathid)
               ->  Sort  (cost=24345.75..24829.32 rows=193429 width=63)
                     Sort Key: path.pathid
                     ->  Parallel Seq Scan on path  (cost=0.00..7361.29 rows=193429 width=63)
               ->  Sort  (cost=9048403.94..9157853.22 rows=43779712 width=134)
                     Sort Key: file.pathid
                     ->  Seq Scan on file  (cost=0.00..1354253.12 rows=43779712 width=134)
(11 rows)

bareos=# explain SELECT Path.PathId, File.PathId, Path.Path  from path LEFT JOIN file USING (pathid) WHERE File.PathId
ISNULL LIMIT 300000;
 
                                                         QUERY PLAN
    
 

-----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1218966.93..1311402.77 rows=300000 width=67)
   ->  Gather  (cost=1218966.93..1355091.95 rows=441793 width=67)
         Workers Planned: 2
         ->  Parallel Hash Anti Join  (cost=1217966.93..1309912.65 rows=184080 width=67)
               Hash Cond: (path.pathid = file.pathid)
               ->  Parallel Seq Scan on path  (cost=0.00..7361.29 rows=193429 width=63)
               ->  Parallel Hash  (cost=918690.59..918690.59 rows=18241547 width=4)
                     ->  Parallel Index Only Scan using file_jpfid_idx on file  (cost=0.56..918690.59 rows=18241547
width=4)
(8 rows)


Using these "parallel workers" was not my idea, they came creeping
along unsolicited with some version upgrade.
Switching them OFF deliberately, makes the first query five times
faster (5 minutes instead of 25 minutes), and makes the second
query use only 25'000 temp files and successfully deliver 25'000
result rows (instead of getting stuck with a million temp files),
apparently one temp file per result row now.

So,
1. it appears that these "parallel workers" are utterly
   counter-efficient whenever the working set does not fit into
   fast storage, and they need be switched off.
2. something with this anti-hash-whatever must be wrong. 25'000
   temp files does still not appear to be a good thing. But,
   delivering code that, by default, allows in excess of a million
   files be written in a single directory, that is just wrong.


Checking web ressources:

 * It seems now the normal behaviour to write millions of files,
   and people seem to just accept this:
   https://stackoverflow.com/q/61696008

 * Tables with some 50 mio rows seem now to be considered a "high
   row count":
   https://www.postgresql.org/message-id/
       38E9456A-7841-4F13-B72B-FD3137591972%40gmail.com
   They were not considered a "high row count" back in 2007; they were
   just normal then, and did run fine on machines with 1/50 of the
   memory. :(
   People seem to have been brainwashed by Web-Services and OLTP,
   and now think the working set must always fit in memory. But this
   is only one possible usecase, it is not the exclusive only one.



Re: SELECT creates millions of temp files in a single directory

From
"David G. Johnston"
Date:
On Sat, Apr 23, 2022 at 1:00 PM Peter <pmc@citylink.dinoex.sub.org> wrote:
In modern versions of postgres a simple SELECT writes a couple
of millions of individual temp files into a single directory under
pgsql_tmp.
I know of no filesystem that would take such lightly, and even
ZFS gets some problems with such extremely long directories.


Your running commentary detracts greatly from the problem you are trying to describe and demonstrate.
 
What is the rationale in this behaviour and how is it supposed to
be handled?

There are these things called bugs.  This may be one of those, which by definition, are not intentional.
 

Database size has not increased, postgres memory configuration has
not been changed, only postgres versions were gradually upgraded
from 8 to 12.

v12 what?

Using these "parallel workers" was not my idea, they came creeping
along unsolicited with some version upgrade.

Well, in theory at least parallelism should be a positive improvement.  Why would we want to require our users to opt-in to the feature?

Switching them OFF deliberately, makes the first query five times
faster (5 minutes instead of 25 minutes), and makes the second
query use only 25'000 temp files and successfully deliver 25'000
result rows (instead of getting stuck with a million temp files),
apparently one temp file per result row now.

In your situation maybe they do not help.  Whether that is a fundamental limitation of parallelism in low memory conditions or whether its specific to this query I have no clue.  But aren't you glad we had the foresight to allow for the parallelism to be disabled in the case of bugs and/or situations where it did prove to be harmful?


So,
1. it appears that these "parallel workers" are utterly
   counter-efficient whenever the working set does not fit into
   fast storage, and they need be switched off.

Not sure you've "proven" this but you can do so for your setup.
 
2. something with this anti-hash-whatever must be wrong. 25'000
   temp files does still not appear to be a good thing. But,
   delivering code that, by default, allows in excess of a million
   files be written in a single directory, that is just wrong.

Probably.  It would help if you can provide a self-contained demonstration that others can then verify and debug (or explain).


Checking web ressources:

 * It seems now the normal behaviour to write millions of files,
   and people seem to just accept this:
   https://stackoverflow.com/q/61696008

If they aren't willing to post to our -bugs or other mailing lists there isn't much we can do for them.



 * Tables with some 50 mio rows seem now to be considered a "high
   row count":
   https://www.postgresql.org/message-id/
       38E9456A-7841-4F13-B72B-FD3137591972%40gmail.com
   They were not considered a "high row count" back in 2007; they were
   just normal then, and did run fine on machines with 1/50 of the
   memory. :(

I don't see how this has any relevance.

Generalizing to "people" from three or four examples is pointless.  And, regardless, it isn't like any of those people are committers for the project, whose opinions are really the only ones that matter because they control whether to fix something or not.

 
   People seem to have been brainwashed by Web-Services and OLTP,
   and now think the working set must always fit in memory. But this
   is only one possible usecase, it is not the exclusive only one.


Again, your running commentary is providing zero, or negative, value here.

David J.

Re: SELECT creates millions of temp files in a single directory

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> v12 what?

That ...

> It would help if you can provide a self-contained demonstration
> that others can then verify and debug (or explain).

... and that.  As this message stands, it's undocumented whining.
Please see

https://wiki.postgresql.org/wiki/Slow_Query_Questions

for the sort of information we need to debug performance issues.
There's also

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

although that's not oriented specifically to performance issues.

(I recall that we have fixed some problems in the past that could
lead to unreasonable numbers of temp files in hash joins.  Whether
there's more to do, or Peter is running a version that lacks those
fixes, is impossible to tell with the given info.)

            regards, tom lane



Re: SELECT creates millions of temp files in a single directory

From
"David G. Johnston"
Date:
On Sat, Apr 23, 2022 at 1:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> v12 what?

That ...

> It would help if you can provide a self-contained demonstration
> that others can then verify and debug (or explain).

... and that.  As this message stands, it's undocumented whining.
Please see

https://wiki.postgresql.org/wiki/Slow_Query_Questions

for the sort of information we need to debug performance issues.
There's also

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

although that's not oriented specifically to performance issues.

(I recall that we have fixed some problems in the past that could
lead to unreasonable numbers of temp files in hash joins.  Whether
there's more to do, or Peter is running a version that lacks those
fixes, is impossible to tell with the given info.)

Thanks.

I'll add that given the nature of the problem that changing temp_file_limit away from its default of unlimited may be useful.


David J.

Re: SELECT creates millions of temp files in a single directory

From
Adrian Klaver
Date:
On 4/23/22 12:50, Peter wrote:


>     People seem to have been brainwashed by Web-Services and OLTP,
>     and now think the working set must always fit in memory. But this
>     is only one possible usecase, it is not the exclusive only one.

This is no-win situation as most of the complaints in recent years have 
been that Postgres was/is to conservative in its default settings and is 
not taking advantage of newer more powerful hardware.

-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: SELECT creates millions of temp files in a single directory

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> I'll add that given the nature of the problem that changing temp_file_limit
> away from its default of unlimited may be useful.
> https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-DISK

Maybe ... that limits the total space involved, not the number of
files it's separated into, so I'm not sure how much it can help.

It might be worth playing around to see how varying values of work_mem
affect this behavior, too.  That should change the planner's initial
estimate of the number of hash batches needed, which likely plays into
this somehow.

            regards, tom lane



Re: SELECT creates millions of temp files in a single directory

From
Peter
Date:
On Sat, Apr 23, 2022 at 02:11:00PM -0700, Adrian Klaver wrote:
! On 4/23/22 12:50, Peter wrote:
! 
! 
! >     People seem to have been brainwashed by Web-Services and OLTP,
! >     and now think the working set must always fit in memory. But this
! >     is only one possible usecase, it is not the exclusive only one.
! 
! This is no-win situation as most of the complaints in recent years have been
! that Postgres was/is to conservative in its default settings and is not
! taking advantage of newer more powerful hardware.

I know, and You got to the point; this is exactly what I am talking
about: people take the abundance of ressources as granted.

In Rel. 8 postgres was a breathtaking beauty of engineering: the style
of old, driven to perfection. 
Now You're gradually sacrificing this, for the speed junkies and to
protect those from mistakes who are not engineers.

And no, I don't know how this could be solved: the more influential
you get, the more driven you are by public demand; the less freedom
you have to follow ideals.

David Johnston thinks we must just not speak that out, we must instead
behave like "the emperor's new clothes", and follow google's
understanding of "positive values".
Sorry, that doesn't work for me.



Re: SELECT creates millions of temp files in a single directory

From
Adrian Klaver
Date:
On 4/23/22 14:58, Peter wrote:
> On Sat, Apr 23, 2022 at 02:11:00PM -0700, Adrian Klaver wrote:
> ! On 4/23/22 12:50, Peter wrote:
> !
> !
> ! >     People seem to have been brainwashed by Web-Services and OLTP,
> ! >     and now think the working set must always fit in memory. But this
> ! >     is only one possible usecase, it is not the exclusive only one.
> !
> ! This is no-win situation as most of the complaints in recent years have been
> ! that Postgres was/is to conservative in its default settings and is not
> ! taking advantage of newer more powerful hardware.
> 
> I know, and You got to the point; this is exactly what I am talking
> about: people take the abundance of ressources as granted.

Probably because the resources are there. My phone has computing power I 
could only dream of when I was using a desktop of old.

> 
> In Rel. 8 postgres was a breathtaking beauty of engineering: the style
> of old, driven to perfection.

I would hardly call version 8.x perfection, especially the attempt at 
running on Windows natively.

> Now You're gradually sacrificing this, for the speed junkies and to
> protect those from mistakes who are not engineers.
> 
> And no, I don't know how this could be solved: the more influential
> you get, the more driven you are by public demand; the less freedom
> you have to follow ideals.

Whose ideals? That is the issue. Postgres covers a broad spectrum of 
uses and as such you will have complaints from either end that their 
needs are not met.

> 
> David Johnston thinks we must just not speak that out, we must instead
> behave like "the emperor's new clothes", and follow google's
> understanding of "positive values".
> Sorry, that doesn't work for me.


No, the complaint was that your pontificating interfered with your 
problem description and got in the way of coming up with a solution.

-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: SELECT creates millions of temp files in a single directory

From
Peter
Date:
On Sat, Apr 23, 2022 at 05:13:41PM -0400, Tom Lane wrote:
! "David G. Johnston" <david.g.johnston@gmail.com> writes:
! > I'll add that given the nature of the problem that changing temp_file_limit
! > away from its default of unlimited may be useful.
! > https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-DISK
! 
! Maybe ... that limits the total space involved, not the number of
! files it's separated into, so I'm not sure how much it can help.

That's what I understood from the docs, too. What I also read in the
docs is that it will just kill the query when it hits the limit, and
this is not really what I want.

And btw, I probably lost-in-translation the relevant info about the
running version:

Name           : postgresql12-server
Version        : 12.10
Installed on   : Mon Apr  4 04:13:18 2022 CEST
Origin         : databases/postgresql12-server
Architecture   : FreeBSD:13:amd64

! It might be worth playing around to see how varying values of work_mem
! affect this behavior, too.  That should change the planner's initial
! estimate of the number of hash batches needed, which likely plays into
! this somehow.

Agreed. But then, looking at the generated filenames, in the style of
"i44297of524288.p1.0" - this is an algorithm at work, so somebody must
have done this, and obviousely didn't bother to create half a million
of files, after having created another half million already.

So I thought I might just ask what is the idea with this.

| > It would help if you can provide a self-contained demonstration
| > that others can then verify and debug (or explain).
|
| ... and that.  As this message stands, it's undocumented whining.
| Please see
|
| https://wiki.postgresql.org/wiki/Slow_Query_Questions
|
| for the sort of information we need to debug performance issues.

It is not a performance issue, it is a design question: You inflict
pain on my beloved ZFS, and as a lover I react. ;)

| (I recall that we have fixed some problems in the past that could  
| lead to unreasonable numbers of temp files in hash joins.  Whether
| there's more to do, or Peter is running a version that lacks those
| fixes, is impossible to tell with the given info.)

Yes, I was accidentially deleting that info too when I deleted the
more extensive rants from my original posting. See here, above.



Re: SELECT creates millions of temp files in a single directory

From
Adrian Klaver
Date:
On 4/23/22 15:40, Peter wrote:
> On Sat, Apr 23, 2022 at 05:13:41PM -0400, Tom Lane wrote:

> And btw, I probably lost-in-translation the relevant info about the
> running version:
> 
> Name           : postgresql12-server
> Version        : 12.10
> Installed on   : Mon Apr  4 04:13:18 2022 CEST
> Origin         : databases/postgresql12-server
> Architecture   : FreeBSD:13:amd64
> 

> | Please see
> |
> | https://wiki.postgresql.org/wiki/Slow_Query_Questions
> |
> | for the sort of information we need to debug performance issues.
> 
> It is not a performance issue, it is a design question: You inflict
> pain on my beloved ZFS, and as a lover I react. ;)

Be that as it may, the requested information is still needed.



-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: SELECT creates millions of temp files in a single directory

From
Thomas Munro
Date:
On Sun, Apr 24, 2022 at 8:00 AM Peter <pmc@citylink.dinoex.sub.org> wrote:
> More than a million files in a single directory, this is
> inacceptable.

You didn't show EXPLAIN (ANALYZE) but if [Parallel] Hash is making
insane numbers of temporary files then something is not working as
intended... and I can take a guess at what it is.  I tried to recreate
this with randomly distributed file.pathid, same size tables as you,
and I got 32 batches and a nice manageable number of temporary files.
Adding millions of extra file rows with duplicate pathid=42 gets me
something like "Batches: 524288 (originally 32)" in EXPLAIN (ANALYZE)
output.  I guess that's the sort of distribution you have here?

Extensive discussion of the cause of that and potential (hard)
solutions over here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

To summarise, when the hash table doesn't fit in work_mem, then we
"partition" (spill part of the data to disk) to make twice as many
(hopefully) smaller hash tables that do fit.  Sometimes partitoning
produces one or more hash tables that are still too big because of
uneven key distribution, so we go around again, doubling the number of
partitions (and thus temporary files) every time.  I would say that
once we get past hundreds of partitions, things are really turning to
custard (file descriptors, directory entries, memory overheads, ... it
just doesn't work well anymore), but currently we don't give up until
it becomes very clear that repartitioning is not helping.

This algorithmic problem existed before parallelism was added, but
it's possible that the parallel version of the meltdown hurts a lot
more (it has extra per-process files, and in multi-attempt scenarios
the useless earlier attempts hang around until the end of the query
instead of being cleaned up sooner, which doubles the number of
files).

Hopefully that gives some clues about how one might rewrite the query
to avoid massive unsplittable set of duplicate keys in hash tables,
assuming I'm somewhere close to the explanation here (maybe some
subselect with UNIQUE or GROUP BY in it, or some way to rewrite the
query to avoid having the problematic duplicates on the "inner" side,
or completely different plan..., or just crank up work_mem massively
for this query so that you don't need a partition step at all) .
Obviously it would be nice for PostgreSQL to be hardened against this
risk, eventually, though.

As for merge join planning, unfortunately they aren't fully
parallelisable and in the plan you showed, a separate copy of the big
sort runs in every process, which isn't nice (we don't have a Parallel
Sort that could fix that yet).