Marc Cousin <cousinmarc@gmail.com> wrote:
> Then we insert missing paths. This is one of the plans that fail
> insert into path (path)
> select path from batch
> where not exists
> (select 1 from path where path.path=batch.path)
> group by path;
I know you said you wanted to focus on a different query, but this
one can easily be optimized. Right now it is checking for an
existing row in path for each row in batch; and it only needs to
check once for each path. One way to write it would be:
insert into path (path)
select path from (select distinct path from batch) b
where not exists
(select 1 from path p where p.path = b.path);
> So now we insert into the file table.
>
> insert into file (pathid,filename)
> select pathid, filename from batch join path using (path);
> What I think is the cause of the problem is that the planner
> doesn't take into account that we are going to fetch the exact
> same values all the time in the path table, so we'll have a very
> good hit ratio.
It kinda takes that into account for the index part of things via
the effective_cache_size setting. That should normally be set to
50% to 75% of machine RAM.
> Maybe the n_distinct from batch.path could be used to refine the
> caching effect on the index scan ?
Interesting idea.
> For now, we work around this by using very low values for
> seq_page_cost and random_page_cost for these 2 queries.
If you are not already doing so, you might want to try setting
cpu_tuple_cost to something in the 0.03 to 0.05 range. I have
found that the default is too low compared to other cpu cost
factors, and raising it makes the exact settings for page costs
less sensitive -- that is, you get good plans over a wider range of
page cost settings. I have sometimes been unable to get a good
plan for a query without boosting this, regardless of what I do
with other settings.
Running with a development build on my 16GB development PC, I got
your fast plan with your "big data" test case by making only this
one adjustment from the postgresql.conf defaults:
set effective_cache_size = '2GB';
I also got the fast plan if I left effective_cache_size at the
default and only changed:
set cpu_tuple_cost = 0.03;
I know that there have been adjustments to cost calculations for
use of large indexes in both minor and major releases. If a little
sensible tuning of cost factors to better match reality doesn't do
it for you, you might want to consider an upgrade.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company