Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: tweaking NTUP_PER_BUCKET
Date
Msg-id 53BEEA9E.2080009@fuzzy.cz
Whole thread Raw
In response to Re: tweaking NTUP_PER_BUCKET  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: tweaking NTUP_PER_BUCKET  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
On 9.7.2014 16:07, Robert Haas wrote:
> On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Thinking about this a bit more, do we really need to build the hash
>> table on the first pass? Why not to do this:
>>
>> (1) batching
>>     - read the tuples, stuff them into a simple list
>>     - don't build the hash table yet
>>
>> (2) building the hash table
>>     - we have all the tuples in a simple list, batching is done
>>     - we know exact row count, can size the table properly
>>     - build the table
>
> We could do this, and in fact we could save quite a bit of memory if
> we allocated say 1MB chunks and packed the tuples in tightly instead
> of palloc-ing each one separately.  But I worry that rescanning the
> data to build the hash table would slow things down too much.

I did a quick test of how much memory we could save by this. The
attached patch densely packs the tuples into 32kB chunks (1MB seems way
too much because of small work_mem values, but I guess this might be
tuned based on number of tuples / work_mem size ...).

Tested on query like this (see the first message in this thread how to
generate the tables):

                             QUERY PLAN
-----------------------------------------------------------------------
 Aggregate  (cost=2014697.64..2014697.65 rows=1 width=33) (actual
time=63796.270..63796.271 rows=1 loops=1)
   ->  Hash Left Join  (cost=318458.14..1889697.60 rows=50000016
width=33) (actual time=2865.656..61778.592 rows=50000000 loops=1)
         Hash Cond: (o.id = i.id)
         ->  Seq Scan on outer_table o  (cost=0.00..721239.16
rows=50000016 width=4) (actual time=0.033..2676.234 rows=50000000 loops=1)
         ->  Hash  (cost=193458.06..193458.06 rows=10000006 width=37)
(actual time=2855.408..2855.408 rows=10000000 loops=1)
               Buckets: 1048576  Batches: 1  Memory Usage: 703125kB
               ->  Seq Scan on inner_table i  (cost=0.00..193458.06
rows=10000006 width=37) (actual time=0.044..952.802 rows=10000000 loops=1)
 Planning time: 1.139 ms
 Execution time: 63889.056 ms
(9 rows)

I.e. it creates a single batch with ~700 MB of tuples. Without the
patch, top shows this:

     VIRT    RES    SHR S  %CPU %MEM   COMMAND
  2540356 1,356g   5936 R 100,0 17,6   postgres: EXPLAIN

and the MemoryContextStats added to MultiExecHash shows this:

HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11
chunks); 1448394448 used

So yeah, the overhead is pretty huge in this case - basicaly 100%
overhead, because the inner table row width is only ~40B. With wider
rows the overhead will be lower.

Now, with the patch it looks like this:

     VIRT    RES    SHR S  %CPU %MEM   COMMAND
  1835332 720200   6096 R 100,0  8,9   postgres: EXPLAIN

HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks);
729651520 used

So, pretty much no overhead at all. It was slightly faster too (~5%) but
I haven't done much testing so it might be measurement error.

This patch is pretty independent of the other changes discussed here
(tweaking NTUP_PER_BUCKET / nbuckets) so I'll keep it separate.

regards
Tomas

Attachment

pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Missing autocomplete for CREATE DATABASE
Next
From: Jeff Janes
Date:
Subject: Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]