Re: Hash join on int takes 8..114 seconds - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Hash join on int takes 8..114 seconds
Date
Msg-id 4929E9C4.1010706@fuzzy.cz
Whole thread Raw
In response to Re: Hash join on int takes 8..114 seconds  ("Andrus" <kobruleht2@hot.ee>)
Responses Re: Hash join on int takes 8..114 seconds
List pgsql-performance
> Scott,
>
> thank you.
>
>>> > work_mem = 512
>> This is very easy to try.  You can change work_mem for just a single
>> session, and this can in some cases help performance quite a bit, and in
>> others not at all.
>> I would not recommend having it lower than at least 4MB on a server like
>> that unless you have a lot of concurrently active queries / connections.
>> To try it, simply use the SET command.  To try out 32MB, just do:
>> SET work_mem = '32MB';
>
> 8.1.4 requires int value.
> SET work_mem = 33554432;
> causes:
> ERROR: 33554432 is outside the valid range for parameter "work_mem" (64 ..
> 2097151)
>
> So max allowed value seems to be 2 MB
> I tested it when this server was idle by running both queries several times
> after VACUUM FULL was running
>
> SET work_mem = 2097151;


No, not really. The work_mem value is specified in kilobytes, so you've
set it to 2GB :-)

> set search_path to firma2,public;
> explain analyze
> SELECT sum(1)
>   FROM dok JOIN rid USING (dokumnr)
>   JOIN toode USING (toode)
>   WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'
> "Aggregate  (cost=177291.36..177291.37 rows=1 width=0) (actual
> time=5153.856..5153.859 rows=1 loops=1)"
> "  ->  Nested Loop  (cost=73607.45..177229.96 rows=24561 width=0) (actual
> time=1395.935..5071.247 rows=21541 loops=1)"
> "        ->  Index Scan using toode_pkey on toode  (cost=0.00..6.01 rows=1
> width=24) (actual time=0.078..0.087 rows=1 loops=1)"
> "              Index Cond: ('X05'::bpchar = toode)"
> "        ->  Hash Join  (cost=73607.45..176978.33 rows=24561 width=24)
> (actual time=1395.836..4911.425 rows=21541 loops=1)"
> "              Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
> "              ->  Bitmap Heap Scan on rid  (cost=4083.10..101802.05
> rows=270316 width=28) (actual time=238.578..2367.189 rows=278247 loops=1)"
> "                    Recheck Cond: (toode = 'X05'::bpchar)"
> "                    ->  Bitmap Index Scan on rid_toode_idx
> (cost=0.00..4083.10 rows=270316 width=0) (actual time=150.868..150.868
> rows=278248 loops=1)"
> "                          Index Cond: (toode = 'X05'::bpchar)"
> "              ->  Hash  (cost=69242.72..69242.72 rows=112651 width=4)
> (actual time=1146.081..1146.081 rows=110170 loops=1)"
> "                    ->  Bitmap Heap Scan on dok  (cost=1492.28..69242.72
> rows=112651 width=4) (actual time=46.210..696.803 rows=110170 loops=1)"
> "                          Recheck Cond: (kuupaev >= '2008-09-01'::date)"
> "                          ->  Bitmap Index Scan on dok_kuupaev_idx
> (cost=0.00..1492.28 rows=112651 width=0) (actual time=33.938..33.938
> rows=110232 loops=1)"
> "                                Index Cond: (kuupaev >=
> '2008-09-01'::date)"
> "Total runtime: 5154.911 ms"
>
>
> SET work_mem to default;
> set search_path to firma2,public;
> explain analyze
> SELECT sum(1)
>   FROM dok JOIN rid USING (dokumnr)
>   JOIN toode USING (toode)
>   WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'
> "Aggregate  (cost=181869.36..181869.37 rows=1 width=0) (actual
> time=7807.867..7807.871 rows=1 loops=1)"
> "  ->  Nested Loop  (cost=74048.45..181807.96 rows=24561 width=0) (actual
> time=2607.429..7728.138 rows=21541 loops=1)"
> "        ->  Index Scan using toode_pkey on toode  (cost=0.00..6.01 rows=1
> width=24) (actual time=0.079..0.091 rows=1 loops=1)"
> "              Index Cond: ('X05'::bpchar = toode)"
> "        ->  Hash Join  (cost=74048.45..181556.33 rows=24561 width=24)
> (actual time=2607.332..7569.612 rows=21541 loops=1)"
> "              Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
> "              ->  Bitmap Heap Scan on rid  (cost=4083.10..101802.05
> rows=270316 width=28) (actual time=1147.071..4528.845 rows=278247 loops=1)"
> "                    Recheck Cond: (toode = 'X05'::bpchar)"
> "                    ->  Bitmap Index Scan on rid_toode_idx
> (cost=0.00..4083.10 rows=270316 width=0) (actual time=1140.337..1140.337
> rows=278248 loops=1)"
> "                          Index Cond: (toode = 'X05'::bpchar)"
> "              ->  Hash  (cost=69242.72..69242.72 rows=112651 width=4)
> (actual time=1240.988..1240.988 rows=110170 loops=1)"
> "                    ->  Bitmap Heap Scan on dok  (cost=1492.28..69242.72
> rows=112651 width=4) (actual time=68.053..769.448 rows=110170 loops=1)"
> "                          Recheck Cond: (kuupaev >= '2008-09-01'::date)"
> "                          ->  Bitmap Index Scan on dok_kuupaev_idx
> (cost=0.00..1492.28 rows=112651 width=0) (actual time=61.358..61.358
> rows=110232 loops=1)"
> "                                Index Cond: (kuupaev >=
> '2008-09-01'::date)"
> "Total runtime: 7808.174 ms"
>
> In both cases vmstat 2 shows only cpu activity when queries are running:
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> ----cpu----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
> wa
> ...
> 0  0    232 123692      0 1891044    0    0     0     0  252    17  0  0
> 100  0
> 0  0    232 123692      0 1891044    0    0     0     0  252    17  0  0
> 100  0
> 0  0    232 123684      0 1891044    0    0     0   162  254    22  0  0
> 100  0
> 0  0    232 123684      0 1891044    0    0     0     0  252    18  0  0
> 100  0
> 1  0    232 123056      0 1891444    0    0     0    13  254    21 62  5 34
> 0   <---- start of slower query
> 1  0    232 102968      0 1911060    0    0     0    16  252    18 26 75  0
> 0
> 1  0    232  77424      0 1936996    0    0     0     0  252    18 37 63  0
> 0
> 1  0    232  71464      0 1941928    0    0     0    73  260    34 38 62  0
> 0
> 0  0    232 123420      0 1891044    0    0     0    32  257    31  8 15 77
> 0 <-------- end of slower query
> 0  0    232 123420      0 1891044    0    0     0    25  255    24  0  0
> 100  0
> 0  0    232 123420      0 1891044    0    0     0    28  255    27  0  0
> 100  0
>

Well, this parameter specifies how much memory may be used for in-memory
sorting and hash tables. If more memory is needed, the sort / hash table
will be performed on-disk.

I guess the difference in speed is not huge, so in this case the hash
table (used for join) probably fits into the 1024kB (which is a default
value).

> Is it safe to set
>
> work_mem = 2097151
>
> in postgresql.conf file ?

Each sort in a query uses a separate area in a memory (up to work_mem).
So if you have 10 sessions running a query with 2 sorts, that may occupy

    10 * 2 * work_mem

of memory. Let's suppose you set a reasonable value (say 8096) instead
of 2GB. That gives about 160MB.

Anyway this depends - if you have a lot of slow queries caused by
on-disk sorts / hash tables, use a higher value. Otherwise leave it as
it is.

>> First, demonstrate that it is all or mostly in memory -- use iostat or
>> other tools to ensure that there is not much disk activity during the
>> query.  If your system doesn't have iostat installed, it should be
>> installed.  It is a very useful tool.
>
> # iostat
> bash: iostat: command not found
> # locate iostat
> /usr/src/linux-2.6.16-gentoo-r9/Documentation/iostats.txt
>
> I have few experience in Linux. No idea how to install or compile iostat in
> this system.
>
>> If it is all cached in memory, you may want to ensure that your
>> shared_buffers is a reasonalbe size so that there is less shuffling of
>> data
>> from the kernel to postgres and back.  Generally, shared_buffers works
>> best
>> between 5% and 25% of system memory.
>
> currently shared_buffers = 15000

That's 120MB, i.e. about 6% of the memory. Might be a little bit higher,
but seems reasonable.

>> If it is completely CPU bound then the work done for the query has to be
>> reduced by altering the plan to a more optimal one or making the work it
>> has to do at each step easier.  Most of the ideas in this thread revolve
>> around those things.
>
> When running on loaded server even after VACUUM FULL, response time for
> original work_mem is  longer probably because it must fetch blocks from
> disk.
>
>> Based on the time it took to do the vacuum, I suspect your disk subsystem
>> is a bit slow.  If it can be determined that there is much disk I/O in
>> your
>> use cases, there are generally several things that can be done to tune
>> Linux I/O.  The main ones in my experience are the 'readahead' value for
>> each disk which helps sequential reads significantly, and trying out the
>> linux 'deadline' scheduler and comparing it to the more commonly used
>> 'cfq'
>> scheduler.  If the system is configured with the anticipatory scheduler,
>> absolutely switch to cfq or deadline as the anticipatory scheduler will
>> perform horribly poorly for a database.
>
> This is 3 year old cheap server.
> No idea what to config.

Well, how could we know that if you don't?

Anyway the options mentioned by Scott Carey are related to linux kernel,
so it shouldn't be a problem to change.

> There is also other similar server which as 1.2 GB more usable memory.
> No idea is it worth to switch into it.
> After some years sales data will still exceed this more memory.

Given the fact that the performance issues are caused by bloated tables
and / or slow I/O subsystem, moving to a similar system won't help I guess.

regards
Tomas

pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Hash join on int takes 8..114 seconds
Next
From: Richard Huxton
Date:
Subject: Re: Increasing pattern index query speed