Thread: hash join vs nested loop join
Hello All
While investigating switching to Postgres, we come across a query plan that uses hash join and is a lot slower than a nested loop join.
I don't understand why the optimiser chooses the hash join in favor of the nested loop. What can I do to get the optimiser to make a better decision (nested loop in this case)? I have run analyze on both tables.
The query is,
While investigating switching to Postgres, we come across a query plan that uses hash join and is a lot slower than a nested loop join.
I don't understand why the optimiser chooses the hash join in favor of the nested loop. What can I do to get the optimiser to make a better decision (nested loop in this case)? I have run analyze on both tables.
The query is,
/*The hash join plan is,
smalltable has about 48,000 records.
bigtable has about 168,000,000 records.
invtranref is char(10) and is the primary key for both tables
*/
SELECT
*
FROM IM_Match_Table smalltable
inner join invtran bigtable on
bigtable.invtranref = smalltable.invtranref
"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)"The nested loop join plan is,
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" -> Seq Scan on public.invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Hash (cost=1078.61..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
"Nested Loop (cost=0.00..12888684.07 rows=48261 width=171)"The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
" -> Index Scan using pk_invtran on public.invtran bigtable (cost=0.00..267.03 rows=1 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" Index Cond: (bigtable.invtranref = smalltable.invtranref)"
- Centos, ext4
- 24GB memory
- 6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
- raid 10 on 4 sata disks
Config changes are
- shared_buffers = 6GB
- effective_cache_size = 18GB
- work_mem = 10MB
- maintenance_work_mem = 3GB
Many Thanks
Huan
On Dec 12, 2012, at 8:25 AM, Huan Ruan <leohuanruan@gmail.com> wrote:
Hello All
While investigating switching to Postgres, we come across a query plan that uses hash join and is a lot slower than a nested loop join.
I don't understand why the optimiser chooses the hash join in favor of the nested loop. What can I do to get the optimiser to make a better decision (nested loop in this case)? I have run analyze on both tables.
Optimiser thinks that nested loop is more expensive, because of point PK lookups, which a random io.
Can you set random_page_cost to 2 or 3 and try again?
The query is,/*The hash join plan is,
smalltable has about 48,000 records.
bigtable has about 168,000,000 records.
invtranref is char(10) and is the primary key for both tables
*/
SELECT
*
FROM IM_Match_Table smalltable
inner join invtran bigtable on
bigtable.invtranref = smalltable.invtranref"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)"The nested loop join plan is,
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" -> Seq Scan on public.invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Hash (cost=1078.61..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref""Nested Loop (cost=0.00..12888684.07 rows=48261 width=171)"The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
" -> Index Scan using pk_invtran on public.invtran bigtable (cost=0.00..267.03 rows=1 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" Index Cond: (bigtable.invtranref = smalltable.invtranref)"
- Centos, ext4
- 24GB memory
- 6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
- raid 10 on 4 sata disks
Config changes are
- shared_buffers = 6GB
- effective_cache_size = 18GB
- work_mem = 10MB
- maintenance_work_mem = 3GB
Many Thanks
Huan
On Dec 12, 2012, at 8:57 AM, Evgeny Shishkin <itparanoia@gmail.com> wrote:
On Dec 12, 2012, at 8:44 AM, Huan Ruan <huan.ruan.it@gmail.com> wrote:On 12 December 2012 15:33, Evgeny Shishkin <itparanoia@gmail.com> wrote:Optimiser thinks that nested loop is more expensive, because of point PK lookups, which a random io.Can you set random_page_cost to 2 or 3 and try again?Hi EvgenyThanks for the quick reply. Setting random_page_cost to 3 doesn't make a difference, but to 2 makes the optimiser to choose nested loop. However, with such a small penalty for random I/O, I'm worry about this setting will make other small queries incorrectly use index when it should be a sequential scan though. I understand random I/O is expensive, but in this case the optimiser already knows the big table is really big, should it consider a sequential scan will be slower than an index lookup? Scan 170 million records vs index lookup of 50,000 records. Any thoughts?Yes, this is the most common issue for me.Usually you just have to find the right combination of random and seq scan costs, shared_buffers and effective_cache_size.If some of the queries work well with another value of, say, random_page_cost, then, since it is per session parameter, you can SET it in your session before the query. But over time your table may change in size and distribution and everything brakes. No speaking about general ugliness from application standpoint.May be somebody more experienced would help.Also you can set different costs per tablespace.ThanksHuan
Added CC.
On Tue, Dec 11, 2012 at 8:25 PM, Huan Ruan <leohuanruan@gmail.com> wrote: > Hello All > > While investigating switching to Postgres, we come across a query plan that > uses hash join and is a lot slower than a nested loop join. > > I don't understand why the optimiser chooses the hash join in favor of the > nested loop. What can I do to get the optimiser to make a better decision > (nested loop in this case)? I have run analyze on both tables. > > The query is, > > /* > smalltable has about 48,000 records. > bigtable has about 168,000,000 records. > invtranref is char(10) and is the primary key for both tables > */ > SELECT > * > FROM IM_Match_Table smalltable > inner join invtran bigtable on > bigtable.invtranref = smalltable.invtranref .. > " -> Index Scan using pk_invtran on public.invtran bigtable (cost=0.00..267.03 rows=1 width=108)" This looks like the same large-index over-penalty as discussed in the recent thread "[PERFORM] Slow query: bitmap scan troubles". Back-patching the log(npages) change is starting to look like a good idea. Cheers, Jeff
Huan Ruan wrote: > is a lot slower than a nested loop join. Giving actual numbers is more useful than terms like "a lot". Even better is to provide the output of EXPLAIN ANALYZZE rather than just EXPLAIN. This shows estimates against actual numbers, and give timings. For more suggestions see this page: http://wiki.postgresql.org/wiki/SlowQueryQuestions > I don't understand why the optimiser chooses the hash join in > favor of the nested loop. What can I do to get the optimiser to > make a better decision (nested loop in this case)? I have run > analyze on both tables. > Config changes are > > - shared_buffers = 6GB > - effective_cache_size = 18GB > - work_mem = 10MB > - maintenance_work_mem = 3GB As already suggested, there was a change made in 9.2 which may have over-penalized nested loops using index scans. This may be fixed in the next minor release. Also, as already suggested, you may want to reduce random_page cost, to bring it in line with the actual cost relative to seq_page_cost based on your cache hit ratio. Additionally, I just routinely set cpu_tuple_cost higher than the default of 0.01. I find that 0.03 to 0.05 better models the actual relative cost of processing a tuple. -Kevin
Huan Ruan wrote: > Hash 1st run > "Hash Join (cost=1681.87..6414169.04 rows=48261 width=171) > (actual time=2182.450..88158.645 rows=48257 loops=1)" > " -> Seq Scan on invtran bigtable (cost=0.00..4730787.28 > rows=168121728 width=108) (actual time=0.051..32581.052 > rows=168121657 loops=1)" 194 nanoseconds per row suggests 100% cache hits. > NL 1st run > "Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual > time=0.056..551.438 rows=48257 loops=1)" > " -> Index Scan using pk_invtran on invtran bigtable > (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010 > rows=1 loops=48261)" 10 microseconds per index scan (each index scan requiring multiple "random" accesses) also suggests 100% cache hits. > I originally reduced random_page_cost to 2 to achieve the nested > loop join. Now I set cpu_tuple_cost to 0.05 and reset > random_page_cost back to 4, I can also achieve a nested loop > join. > > I'm still new in Postgres, but I'm worried about random_page_cost > being 2 is too low, so maybe increasing cpu_tuple_cost is a > better choice. If these are typical of what you would expect in production, then the fact that with default cost factors the costs are barely different (by 0.6%) for actual run times which differ by two orders of magnitude (the chosen plan is 160 times slower) means that the modeling of cost factors is off by a lot. If you expect the active portion of your database to be fully cached like this, it makes sense to reduce random_page_cost to be equal to seq_page_cost. But that only adjusts the costs by at most a factor of four, and we've established that in the above query they're off by a factor of 160. To help make up the difference, it makes sense to de-emphasize page access compared to cpu-related costs by reducing both page costs to 0.1. Combined, these adjustments still can't compensate for how far off the estimate was. In my experience default cpu_tuple_cost is understated compared to other cpu-related costs, so I would do the above *plus* a boost to cpu_tuple_cost. Personally, I have never seen a difference between plans chosen with that set to 0.03 and 0.05, so I can't say where in that range is the ideal value; you should feel free to experiment if there is a query which seems to be choosing a bad plan. If the above results really do represent cache hit levels you expect in production, the combination of the above changes should come reasonably close to modeling costs realistically, resulting in better plan choice. If you don't expect such high cache hit ratios in production, you probably don't want to go so low with page costs. >>> - shared_buffers = 6GB >>> - effective_cache_size = 18GB >>> - work_mem = 10MB >>> - maintenance_work_mem = 3GB > Can you see any obvious issues with the other memory settings I > changed? I might bump up work_mem to 20MB to 60MB, as long as you're not going crazy with max_connections. I would probably take maintenance_work_mem down to 1GB to 2GB -- you can have several of these allocations at one time, and you don't want to blow away your cache. (I think it might actually be adjusted down to 2GB internally anyway; but I would need to check.) -Kevin
On 13 December 2012 03:28, Jeff Janes <jeff.janes@gmail.com> wrote:
This looks like the same large-index over-penalty as discussed in the
recent thread "[PERFORM] Slow query: bitmap scan troubles".
Back-patching the log(npages) change is starting to look like a good idea.
Cheers,
Jeff
Thanks for the information Jeff. That does seem to be related.
Hi Kevin
On 13 December 2012 10:47, Kevin Grittner <kgrittn@mail.com> wrote:
Huan Ruan wrote:Giving actual numbers is more useful than terms like "a lot". Even
> is a lot slower than a nested loop join.
better is to provide the output of EXPLAIN ANALYZZE rather than
just EXPLAIN. This shows estimates against actual numbers, and give
timings. For more suggestions see this page:
http://wiki.postgresql.org/wiki/SlowQueryQuestions
You are right. I realised my information wasn't accurate. Was a bit slack and canceled the slower one. The full outputs are
Hash 1st run
"QUERY PLAN""Hash Join (cost=1681.87..6414169.04 rows=48261 width=171) (actual time=2182.450..88158.645 rows=48257 loops=1)"" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"" Buffers: shared hit=3950 read=3046219"" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108) (actual time=0.051..32581.052 rows=168121657 loops=1)"" Buffers: shared hit=3351 read=3046219"" -> Hash (cost=1078.61..1078.61 rows=48261 width=63) (actual time=21.751..21.751 rows=48261 loops=1)"" Buckets: 8192 Batches: 1 Memory Usage: 4808kB"" Buffers: shared hit=596"" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.007..8.299 rows=48261 loops=1)"" Buffers: shared hit=596""Total runtime: 88162.417 ms"
Hash 2nd run (after disconnect and reconnect)
"QUERY PLAN""Hash Join (cost=1681.87..6414169.04 rows=48261 width=171) (actual time=2280.390..87934.540 rows=48257 loops=1)"" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"" Buffers: shared hit=3982 read=3046187"" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108) (actual time=0.052..32747.805 rows=168121657 loops=1)"" Buffers: shared hit=3383 read=3046187"" -> Hash (cost=1078.61..1078.61 rows=48261 width=63) (actual time=62.161..62.161 rows=48261 loops=1)"" Buckets: 8192 Batches: 1 Memory Usage: 4808kB"" Buffers: shared hit=596"" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.006..8.209 rows=48261 loops=1)"" Buffers: shared hit=596""Total runtime: 87938.584 ms"
NL 1st run
"QUERY PLAN""Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual time=0.056..551.438 rows=48257 loops=1)"" Buffers: shared hit=242267"" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.009..7.353 rows=48261 loops=1)"" Buffers: shared hit=596"" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010 rows=1 loops=48261)"" Index Cond: (invtranref = smalltable.invtranref)"" Buffers: shared hit=241671""Total runtime: 555.336 ms"
NL 2nd run (after disconnect and reconnect)
"QUERY PLAN""Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual time=0.058..554.215 rows=48257 loops=1)"" Buffers: shared hit=242267"" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.009..7.416 rows=48261 loops=1)"" Buffers: shared hit=596"" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010 rows=1 loops=48261)"" Index Cond: (invtranref = smalltable.invtranref)"" Buffers: shared hit=241671""Total runtime: 558.095 ms"
> Config changes are
> I don't understand why the optimiser chooses the hash join in
> favor of the nested loop. What can I do to get the optimiser to
> make a better decision (nested loop in this case)? I have run
> analyze on both tables.
>
> - shared_buffers = 6GB
> - effective_cache_size = 18GB
> - work_mem = 10MB
> - maintenance_work_mem = 3GB
As already suggested, there was a change made in 9.2 which may have
over-penalized nested loops using index scans. This may be fixed in
the next minor release.
Will keep this in mind.
Also, as already suggested, you may want to reduce random_page
cost, to bring it in line with the actual cost relative to
seq_page_cost based on your cache hit ratio.
Additionally, I just routinely set cpu_tuple_cost higher than the
default of 0.01. I find that 0.03 to 0.05 better models the actual
relative cost of processing a tuple.
I originally reduced random_page_cost to 2 to achieve the nested loop join. Now I set cpu_tuple_cost to 0.05 and reset random_page_cost back to 4, I can also achieve a nested loop join.
I'm still new in Postgres, but I'm worried about random_page_cost being 2 is too low, so maybe increasing cpu_tuple_cost is a better choice. All these tuning probably also depends on the above mentioned possible fix as well. Can you see any obvious issues with the other memory settings I changed?
Thanks for your help.
Cheers
Huan
-Kevin
Hi Kevin
Again, many thanks for your time and help.
On 14 December 2012 02:26, Kevin Grittner <kgrittn@mail.com> wrote:
Huan Ruan wrote:
> Hash 1st run
> "Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)
> (actual time=2182.450..88158.645 rows=48257 loops=1)"> " -> Seq Scan on invtran bigtable (cost=0.00..4730787.28194 nanoseconds per row suggests 100% cache hits.
> rows=168121728 width=108) (actual time=0.051..32581.052
> rows=168121657 loops=1)"
> NL 1st run
> "Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
> time=0.056..551.438 rows=48257 loops=1)"> " -> Index Scan using pk_invtran on invtran bigtable10 microseconds per index scan (each index scan requiring multiple
> (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010
> rows=1 loops=48261)"
"random" accesses) also suggests 100% cache hits.
Interesting to see how you derived 100% cache hits. I assume by 'cache' you mean the pg shared buffer plus the OS cache? Because the table is 23GB but the shared buffer is only 6GB. Even then, I'm not completely convinced because the total RAM is just 24GB, part of which will have to be used for other data and indexes.
I read somewhere that a pg shared buffer that's too big can hurt the performance and it's better just leave it to the OS cache. I'm not sure why but for now, I just configured the shared buffer to be 1/4 of the total RAM.
If these are typical of what you would expect in production, then
> I originally reduced random_page_cost to 2 to achieve the nested
> loop join. Now I set cpu_tuple_cost to 0.05 and reset
> random_page_cost back to 4, I can also achieve a nested loop
> join.
>
> I'm still new in Postgres, but I'm worried about random_page_cost
> being 2 is too low, so maybe increasing cpu_tuple_cost is a
> better choice.
the fact that with default cost factors the costs are barely
different (by 0.6%) for actual run times which differ by two orders
of magnitude (the chosen plan is 160 times slower) means that the
modeling of cost factors is off by a lot.
If you expect the active portion of your database to be fully
cached like this, it makes sense to reduce random_page_cost to be
equal to seq_page_cost. But that only adjusts the costs by at most
a factor of four, and we've established that in the above query
they're off by a factor of 160. To help make up the difference, it
makes sense to de-emphasize page access compared to cpu-related
costs by reducing both page costs to 0.1. Combined, these
adjustments still can't compensate for how far off the estimate
was.
In my experience default cpu_tuple_cost is understated compared to
other cpu-related costs, so I would do the above *plus* a boost to
cpu_tuple_cost. Personally, I have never seen a difference between
plans chosen with that set to 0.03 and 0.05, so I can't say where
in that range is the ideal value; you should feel free to
experiment if there is a query which seems to be choosing a bad
plan. If the above results really do represent cache hit levels you
expect in production, the combination of the above changes should
come reasonably close to modeling costs realistically, resulting in
better plan choice.
In production, 60% of the database would be able to fit in the RAM. But roughly, all the active data we need to use should be able to fit in 100%. On the test server I'm playing with now, RAM is only 8% of the database size. Nonetheless, I will play with these parameters like you suggested.
I was wondering on our production server where the effetive_cache_size will be much bigger, will pg then guess that probably most data is cached anyway therefore leaning towards nested loop join rather than a scan for hash join?
Even on a test server where the cache hit rate is much smaller, for a big table like this, under what circumstances, will a hash join perform better than nested loop join though?
If you don't expect such high cache hit ratios in production, you
probably don't want to go so low with page costs.
>>> - shared_buffers = 6GB
>>> - effective_cache_size = 18GB
>>> - work_mem = 10MB
>>> - maintenance_work_mem = 3GB> Can you see any obvious issues with the other memory settings II might bump up work_mem to 20MB to 60MB, as long as you're not
> changed?
going crazy with max_connections. I would probably take
maintenance_work_mem down to 1GB to 2GB -- you can have several of
these allocations at one time, and you don't want to blow away your
cache. (I think it might actually be adjusted down to 2GB
internally anyway; but I would need to check.)
Yes, I had bumped up work_mem yesterday to speed up another big group by query. I used 80MB. I assumed this memory will only be used if the query needs it and will be released as soon as it's finished, so it won't be too much an issue as long as I don't have too many concurrently sorting queries running (which is true in our production). Is this correct?
I increased maintenance_work_mem initially to speed up the index creation when I first pump in the data. In production environment, we don't do run time index creation, so I think only the vacuum and analyze will consume this memory?
Thanks
Huan
-Kevin
Huan Ruan wrote: > Interesting to see how you derived 100% cache hits. I assume by 'cache' you > mean the pg shared buffer plus the OS cache? Because the table is 23GB but > the shared buffer is only 6GB. Even then, I'm not completely convinced > because the total RAM is just 24GB, part of which will have to be used for > other data and indexes. Well, you can't have more than a few disk hits, which typically take something like 10 ms each, and still get an average less than 200 nanoseconds. > I read somewhere that a pg shared buffer that's too big can hurt the > performance and it's better just leave it to the OS cache. I'm not sure why > but for now, I just configured the shared buffer to be 1/4 of the total RAM. PostgreSQL goes through the OS and its filesystems, unlike some other products. The right balance of memory in the PostgreSQL shared buffers versus what is left for a combination of OS caching and other memory allocations can be hard to determine. 25% is a good starting point, but your best performance might be higher or lower. It's a good idea to try incremental adjustments using your actual workload. Just remember you need to allow enough for several maintenance_work_mem allocations, about one work_mem allocation per max_connections setting, plus a reasonable OS cache size. > I was wondering on our production server where the effetive_cache_size will > be much bigger, will pg then guess that probably most data is cached anyway > therefore leaning towards nested loop join rather than a scan for hash join? Once effective_cache_size is larger than your largest index, its exact value doesn't matter all that much. > Even on a test server where the cache hit rate is much smaller, for a big > table like this, under what circumstances, will a hash join perform better > than nested loop join though? With a low cache hit rate, that would generally be when the number of lookups into the table exceeds about 10% of the table's rows. > Yes, I had bumped up work_mem yesterday to speed up another big group by > query. I used 80MB. I assumed this memory will only be used if the query > needs it and will be released as soon as it's finished, so it won't be too > much an issue as long as I don't have too many concurrently sorting queries > running (which is true in our production). Is this correct? Each connection running a query can allocate one work_mem allocation per plan node (depending on node type), which will be freed after the query completes. A common "rule of thumb" is to plan on peaks of max_conncetions allocations of work_mem. > I increased maintenance_work_mem initially to speed up the index creation > when I first pump in the data. In production environment, we don't do run > time index creation, so I think only the vacuum and analyze will consume > this memory? You'll probably be creating indexes from time to time. Figure an occasional one of those plus up to one allocation per autovacuum worker (and you probably shouldn't go below three of those). -Kevin
With a low cache hit rate, that would generally be when the number
of lookups into the table exceeds about 10% of the table's rows.
So far, my main performance issue comes down to this pattern where Postgres chooses hash join that's slower than a nest loop indexed join. By changing those cost parameters, this query works as expected now, but there are others fall into the same category and appear to be harder to convince the optimiser.
I'm still a bit worried about this query as Postgres gets the record count right, and knows the index is a primary key index, therefore it knows it's 0.05m out of 170m records (0.03%) but still chooses the sequential scan. Hopefully this is just related to that big index penalty bug introduced in 9.2.
Huan Ruan wrote: > Kevin Grittner wrote: >> With a low cache hit rate, that would generally be when the number >> of lookups into the table exceeds about 10% of the table's rows. > > So far, my main performance issue comes down to this pattern where > Postgres chooses hash join that's slower than a nest loop indexed join. By > changing those cost parameters, this query works as expected now, but there > are others fall into the same category and appear to be harder to convince > the optimiser. > > I'm still a bit worried about this query as Postgres gets the record count > right, and knows the index is a primary key index, therefore it knows it's > 0.05m out of 170m records (0.03%) but still chooses the sequential scan. > Hopefully this is just related to that big index penalty bug introduced in > 9.2. Quite possibly, but it could be any of a number of other things, like a type mismatch. It might be best to rule out other causes. If you post the new query and EXPLAIN ANALYZE output, along with the settings you have now adopted, someone may be able to spot something. It wouldn't hurt to repeat OS and hardware info with it so people have it handy for reference. -Kevin
Quite possibly, but it could be any of a number of other things,
like a type mismatch. It might be best to rule out other causes. If
you post the new query and EXPLAIN ANALYZE output, along with the
settings you have now adopted, someone may be able to spot
something. It wouldn't hurt to repeat OS and hardware info with it
so people have it handy for reference.
Sorry for the late reply. To summarise,
- Centos, ext4
- 24GB memory
- 6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
- raid 10 on 4 sata disks
Config changes are
- shared_buffers = 6GB
- work_mem = 80MB
- maintenance_work_mem = 3GB
- effective_cache_size = 22GB
- seq_page_cost = 0.1
- random_page_cost = 0.1
- cpu_tuple_cost = 0.05
- geqo = off
The query is,
explain (analyze, buffers)The result is,
SELECT
*
FROM IM_Match_Table smalltable
inner join invtran bigtable on bigtable.invtranref = smalltable.invtranref
"QUERY PLAN"
"Nested Loop (cost=0.00..341698.92 rows=48261 width=171) (actual time=0.042..567.980 rows=48257 loops=1)"
" Buffers: shared hit=242267"
" -> Seq Scan on im_match_table smalltable (cost=0.00..2472.65 rows=48261 width=63) (actual time=0.006..8.230 rows=48261 loops=1)"
" Buffers: shared hit=596"
" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..6.98 rows=1 width=108) (actual time=0.010..0.011 rows=1 loops=48261)"
" Index Cond: (invtranref = smalltable.invtranref)"
" Buffers: shared hit=241671"
"Total runtime: 571.662 ms"
"Huan Ruan" <huan.ruan.it@gmail.com> wrote: > explain (analyze, buffers) > SELECT > * > FROM IM_Match_Table smalltable > inner join invtran bigtable on bigtable.invtranref = smalltable.invtranref Well, one table or the other will need to be read in full, and you would normally want that one to be the small table. When there is no ORDER BY clause, the fastest way to do that will normally be a seqscan. So that part of the query is as it should be. The only question is whether the access to the big table is using the fastest technique. If you want to see what the planner's second choice would have been, you could run: SET enable_indexscan = off; on a connection and try the explain again. If you don't like that one, you might be able to disable another node type and see what you get. If one of the other alternatives is faster, that would suggest that adjustments are needed to the costing factors; otherwise, it just takes that long to read hundreds of thousands of rows in one table and look for related data for each of them in another table. > "Nested Loop (cost=0.00..341698.92 rows=48261 width=171) (actual > time=0.042..567.980 rows=48257 loops=1)" Frankly, at 12 microseconds per matched pair of rows, I think you're doing OK. -Kevin
Frankly, at 12 microseconds per matched pair of rows, I thinkThis plan is the good one, I want the indexscan nested loop join and this is only achieved after making all these costing factors change. Before that, it was hash join and was very slow.
you're doing OK.
However, I'm worried about the config changes being too 'extreme', i.e. both sequential I/O and random I/O have the same cost and being only 0.1. So, I was more wondering why I have to make such dramatic changes to convince the optimiser to use NL join instead of hash join. And also, I'm not sure what impact will these changes have on other queries yet. e.g. will a query that's fine with hash join now choose NL join and runs slower?
Huan Ruan wrote: > Kevin Grittner wrote: >> Frankly, at 12 microseconds per matched pair of rows, I think >> you're doing OK. > > This plan is the good one, I want the indexscan nested loop join and this > is only achieved after making all these costing factors change. Before > that, it was hash join and was very slow. > > However, I'm worried about the config changes being too 'extreme', i.e. > both sequential I/O and random I/O have the same cost and being only 0.1. > So, I was more wondering why I have to make such dramatic changes to > convince the optimiser to use NL join instead of hash join. And also, I'm > not sure what impact will these changes have on other queries yet. e.g. > will a query that's fine with hash join now choose NL join and runs slower? I understand the concern, but PostgreSQL doesn't yet have a knob to turn for "cache hit ratio". You essentially need to build that into the page costs. Since your cache hit ratio (between shared buffers and the OS) is so high, the cost of page access relative to CPU costs has declined and there isn't any effective difference between sequential and random access. As the level of caching changes, you may need to adjust. In one production environment where there was significant caching, but far enough from 100% to matter, we tested various configurations and found the fastest plans being chosen with seq_page_cost = 0.3 and random_page_cost = 0.5. Tune to your workload. -Kevin
On 21 December 2012 01:06, Kevin Grittner <kgrittn@mail.com> wrote:
Huan Ruan wrote:
> Kevin Grittner wrote:I understand the concern, but PostgreSQL doesn't yet have a knob to>> Frankly, at 12 microseconds per matched pair of rows, I think
>> you're doing OK.
>
> This plan is the good one, I want the indexscan nested loop join and this
> is only achieved after making all these costing factors change. Before
> that, it was hash join and was very slow.
>
> However, I'm worried about the config changes being too 'extreme', i.e.
> both sequential I/O and random I/O have the same cost and being only 0.1.
> So, I was more wondering why I have to make such dramatic changes to
> convince the optimiser to use NL join instead of hash join. And also, I'm
> not sure what impact will these changes have on other queries yet. e.g.
> will a query that's fine with hash join now choose NL join and runs slower?
turn for "cache hit ratio". You essentially need to build that into
the page costs. Since your cache hit ratio (between shared buffers
and the OS) is so high, the cost of page access relative to CPU
costs has declined and there isn't any effective difference between
sequential and random access. As the level of caching changes, you
may need to adjust. In one production environment where there was
significant caching, but far enough from 100% to matter, we tested
various configurations and found the fastest plans being chosen
with seq_page_cost = 0.3 and random_page_cost = 0.5. Tune to your
workload.
Thanks Kevin. I think I get some ideas now that I can try on the production server when we switch.