Re: query plan not optimal - Mailing list pgsql-performance

From Kevin Grittner
Subject Re: query plan not optimal
Date
Msg-id 1387485379.74465.YahooMailNeo@web162902.mail.bf1.yahoo.com
Whole thread Raw
In response to query plan not optimal  (Marc Cousin <cousinmarc@gmail.com>)
Responses Re: query plan not optimal
List pgsql-performance
Marc Cousin <cousinmarc@gmail.com> wrote:

> Then we insert missing paths. This is one of the plans that fail

> insert into path (path)
>   select path from batch
>     where not exists
>           (select 1 from path where path.path=batch.path)
>     group by path;

I know you said you wanted to focus on a different query, but this
one can easily be optimized.  Right now it is checking for an
existing row in path for each row in batch; and it only needs to
check once for each path.  One way to write it would be:

insert into path (path)
  select path from (select distinct path from batch) b
    where not exists
          (select 1 from path p where p.path = b.path);

> So now we insert into the file table.
>
> insert into file (pathid,filename)
>   select pathid, filename from batch join path using (path);

> What I think is the cause of the problem is that the planner
> doesn't take into account that we are going to fetch the exact
> same values all the time in the path table, so we'll have a very
> good hit ratio.

It kinda takes that into account for the index part of things via
the effective_cache_size setting.  That should normally be set to
50% to 75% of machine RAM.

> Maybe the n_distinct from batch.path could be used to refine the
> caching effect on the index scan ?

Interesting idea.

> For now, we work around this by using very low values for
> seq_page_cost and random_page_cost for these 2 queries.
If you are not already doing so, you might want to try setting
cpu_tuple_cost to something in the 0.03 to 0.05 range.  I have
found that the default is too low compared to other cpu cost
factors, and raising it makes the exact settings for page costs
less sensitive -- that is, you get good plans over a wider range of
page cost settings.  I have sometimes been unable to get a good
plan for a query without boosting this, regardless of what I do
with other settings.

Running with a development build on my 16GB development PC, I got
your fast plan with your "big data" test case by making only this
one adjustment from the postgresql.conf defaults:

set effective_cache_size = '2GB';

I also got the fast plan if I left effective_cache_size at the
default and only changed:

set cpu_tuple_cost = 0.03;

I know that there have been adjustments to cost calculations for
use of large indexes in both minor and major releases.  If a little
sensible tuning of cost factors to better match reality doesn't do
it for you, you might want to consider an upgrade.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-performance by date:

Previous
From: Sergey Konoplev
Date:
Subject: Re: slow query - will CLUSTER help?
Next
From: Sergey Konoplev
Date:
Subject: Re: slow query - will CLUSTER help?