Thread: -HEAD planner issue wrt hash_joins on dbt3 ?

-HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Hi!

I'm about to do some benchmarking on -HEAD one some hardware I have
available and it seems I'm hitting a rather weird issue causing the osdl
dbt3 benchmark to run very slow and eating CPU time for hours ...

it seems that the issue is caused by the following query:
(in case it gets linewrapped:
http://www.kaltenbrunner.cc/files/dbt3_with_hashjoin.txt)

select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from
customer, orders, lineitem, supplier, nation, region where c_custkey =
o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and
c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey
= r_regionkey and r_name = 'AFRICA' and o_orderdate >= date '1993-01-01'
and o_orderdate < date '1993-01-01' + interval '1 year' group by n_name
order by revenue desc;

that results in the following plan on my box:
Sort  (cost=2543391.75..2543391.81 rows=25 width=37)  Sort Key: sum((lineitem.l_extendedprice * (1::double precision -
lineitem.l_discount)))  ->  HashAggregate  (cost=2543390.73..2543391.17 rows=25 width=37)        ->  Hash Join
(cost=440864.81..2543027.40rows=72666 width=37)              Hash Cond: ((orders.o_custkey = customer.c_custkey) AND
 
(supplier.s_nationkey = customer.c_nationkey))              ->  Hash Join  (cost=377714.59..2415568.01 rows=1816643
width=49)                    Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)                    ->  Nested Loop
(cost=13.65..1719683.85
rows=12000672 width=49)                          ->  Merge Join  (cost=0.00..10248.66
rows=20000 width=41)                                Merge Cond: (nation.n_nationkey =
supplier.s_nationkey)                                ->  Nested Loop  (cost=0.00..19.19
rows=5 width=33)                                      ->  Index Scan using pk_nation on
nation  (cost=0.00..9.38 rows=25 width=37)                                      ->  Index Scan using pk_region on
region  (cost=0.00..0.38 rows=1 width=4)                                            Index Cond:
(nation.n_regionkey = region.r_regionkey)                                            Filter: (r_name =
'AFRICA'::bpchar)                                ->  Index Scan using i_s_nationkey on
supplier  (cost=0.00..9779.46 rows=100000 width=8)                          ->  Bitmap Heap Scan on lineitem
(cost=13.65..77.16 rows=665 width=16)                                Recheck Cond: (lineitem.l_suppkey =
supplier.s_suppkey)                                ->  Bitmap Index Scan on i_l_suppkey
(cost=0.00..13.65 rows=665 width=0)                                      Index Cond: (lineitem.l_suppkey =
supplier.s_suppkey)                    ->  Hash  (cost=372023.51..372023.51 rows=2270971
width=8)                          ->  Bitmap Heap Scan on orders
(cost=41391.94..372023.51 rows=2270971 width=8)                                Recheck Cond: ((o_orderdate >=
'1993-01-01'::date) AND (o_orderdate < '1994-01-01 00:00:00'::timestamp
without time zone))                                ->  Bitmap Index Scan on i_o_orderdate(cost=0.00..41391.94
rows=2270971width=0)                                      Index Cond: ((o_orderdate >=
 
'1993-01-01'::date) AND (o_orderdate < '1994-01-01 00:00:00'::timestamp
without time zone))              ->  Hash  (cost=55647.15..55647.15 rows=1500615 width=8)                    ->  Seq
Scanon customer  (cost=0.00..55647.15
 
rows=1500615 width=8)
(27 rows)

so it really thinks that doing hashes with gigantic amounts of data is
a good idea generally - this seems to be independent on work_mem - the
plan looks the same with 1MB vs 126MB(which I had during the run).

the profile of the backend eating the cpu looks similiar to:

26351    27.9047  ExecScanHashBucket
8239      8.7248  hash_seq_search
6984      7.3958  hash_search_with_hash_value


setting hash_join to off results in a runtime of about 2,5minutes:

(http://www.kaltenbrunner.cc/files/dbt3_without_hashjoin.txt)

Sort  (cost=3700257.38..3700257.45 rows=25 width=37) (actual
time=286820.962..286820.968 rows=5 loops=1)  Sort Key: sum((lineitem.l_extendedprice * (1::double precision -
lineitem.l_discount)))  ->  HashAggregate  (cost=3700256.37..3700256.80 rows=25 width=37)
(actual time=286820.932..286820.941 rows=5 loops=1)        ->  Nested Loop  (cost=730956.43..3699893.04 rows=72666
width=37) (actual time=43551.767..286488.555 rows=72441 loops=1)              Join Filter: (customer.c_nationkey =
supplier.s_nationkey)             ->  Merge Join  (cost=730956.43..3624153.73 rows=1816643
 
width=49) (actual time=43281.710..257082.739 rows=1822547 loops=1)                    Merge Cond: (lineitem.l_orderkey
=orders.o_orderkey)                    ->  Index Scan using i_l_orderkey on lineitem
 
(cost=0.00..2715943.34 rows=60003360 width=16) (actual
time=32.868..123668.380 rows=59991868 loops=1)                    ->  Sort  (cost=730956.43..732091.92 rows=454194
width=41) (actual time=43248.797..45754.223 rows=1822547 loops=1)                          Sort Key: orders.o_orderkey
                       ->  Merge Join  (cost=670885.68..688278.21
 
rows=454194 width=41) (actual time=34469.359..42050.059 rows=455262 loops=1)                                Merge Cond:
(customer.c_custkey=
 
orders.o_custkey)                                ->  Sort  (cost=59105.79..59856.10
rows=300123 width=41) (actual time=8113.826..8491.532 rows=299493 loops=1)                                      Sort
Key:customer.c_custkey                                      ->  Nested Loop
 
(cost=781.13..31801.81 rows=300123 width=41) (actual
time=107.537..7461.355 rows=299493 loops=1)                                            ->  Nested Loop
(cost=1.06..11.00 rows=5 width=33) (actual time=0.030..0.296 rows=5 loops=1)
     Join Filter:
 
(nation.n_regionkey = region.r_regionkey)                                                  ->  Index Scan using
pk_nation on nation  (cost=0.00..9.38 rows=25 width=37) (actual
time=0.007..0.063 rows=25 loops=1)                                                  ->  Materialize
(cost=1.06..1.07 rows=1 width=4) (actual time=0.002..0.004 rows=1 loops=25)
          ->  Seq Scan on
 
region  (cost=0.00..1.06 rows=1 width=4) (actual time=0.009..0.018
rows=1 loops=1)                                                              Filter:
(r_name = 'AFRICA'::bpchar)                                            ->  Bitmap Heap Scan on
customer  (cost=780.07..5607.85 rows=60025 width=8) (actual
time=61.150..1331.466 rows=59899 loops=5)                                                  Recheck Cond:
(nation.n_nationkey = customer.c_nationkey)                                                  ->  Bitmap Index Scan
on i_c_nationkey  (cost=0.00..780.07 rows=60025 width=0) (actual
time=44.637..44.637 rows=59899 loops=5)                                                        Index Cond:
(nation.n_nationkey = customer.c_nationkey)                                ->  Sort  (cost=611779.89..617457.31
rows=2270971 width=8) (actual time=26355.515..29471.963 rows=2276859
loops=1)                                      Sort Key: orders.o_custkey                                      ->
BitmapHeap Scan on orders
 
(cost=41391.94..372023.51 rows=2270971 width=8) (actual
time=1630.604..16266.102 rows=2276859 loops=1)                                            Recheck Cond: ((o_orderdate
>= '1993-01-01'::date) AND (o_orderdate < '1994-01-01
00:00:00'::timestamp without time zone))                                            ->  Bitmap Index Scan on
i_o_orderdate  (cost=0.00..41391.94 rows=2270971 width=0) (actual
time=1352.037..1352.037 rows=2276859 loops=1)                                                  Index Cond:
((o_orderdate >= '1993-01-01'::date) AND (o_orderdate < '1994-01-01
00:00:00'::timestamp without time zone))              ->  Index Scan using pk_supplier on supplier
(cost=0.00..0.03 rows=1 width=8) (actual time=0.010..0.012 rows=1
loops=1822547)                    Index Cond: (lineitem.l_suppkey = supplier.s_suppkey)Total runtime: 286984.386 ms
(34 rows)

(about 120s seem to be explain analyze overhead here)


fwiw the box in question is a Dual 2,6Ghz Opteron with 8GB or RAM - wal
is on the BBWC-onboard Smartarray (RAID 10 on 4 disks) and the data is
on a 14 disk Linux Software RAID 10 running Debian Sarge/AMD64 with
Kernel 2.6.17.7. the dbt3 database got initialized with scaling factor
of 10 (running with just 1 works fine).



Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> I'm about to do some benchmarking on -HEAD one some hardware I have
> available and it seems I'm hitting a rather weird issue causing the osdl
> dbt3 benchmark to run very slow and eating CPU time for hours ...

Could we see the actual EXPLAIN ANALYZE results for the slow plan?
I'm unconvinced by your "hash join is bad" analysis, especially in
the cases where you're giving it lots of work_mem.  I think it's got
something to do with the different join orders.  The rowcount estimates
in the fast plan all seem pretty good, but I'm betting something is
wrong with some of them in the slow case.
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> I'm about to do some benchmarking on -HEAD one some hardware I have
>> available and it seems I'm hitting a rather weird issue causing the osdl
>> dbt3 benchmark to run very slow and eating CPU time for hours ...
> 
> Could we see the actual EXPLAIN ANALYZE results for the slow plan?
> I'm unconvinced by your "hash join is bad" analysis, especially in
> the cases where you're giving it lots of work_mem.  I think it's got
> something to do with the different join orders.  The rowcount estimates
> in the fast plan all seem pretty good, but I'm betting something is
> wrong with some of them in the slow case.

will do - but that will take a while - the slow one runs for 12h or so
even without explain analyze overhead ...


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> btw - the "hashjoin is bad" was more or less based on the observation
> that nearly all of the cpu is burned in hash-related functions in the
> profile (when profiling over a longer period of time those accumulate
> even more % of the time than in the short profile I included in the
> original report)

[ shrug... ]  Two out of the three functions you mentioned are not used
by hash join, and anyway the other plan probably has a comparable
execution density in sort-related functions; does that make it bad?

It's possible that the large time for ExecScanHashBucket has something
to do with skewed usage of the hash buckets due to an unfortunate data
distribution, but that's theorizing far in advance of the data.
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> btw - the "hashjoin is bad" was more or less based on the observation
>> that nearly all of the cpu is burned in hash-related functions in the
>> profile (when profiling over a longer period of time those accumulate
>> even more % of the time than in the short profile I included in the
>> original report)
> 
> [ shrug... ]  Two out of the three functions you mentioned are not used
> by hash join, and anyway the other plan probably has a comparable
> execution density in sort-related functions; does that make it bad?

hmm sorry for that - I should have checked the source before I made that 
assumption :-(

> 
> It's possible that the large time for ExecScanHashBucket has something
> to do with skewed usage of the hash buckets due to an unfortunate data
> distribution, but that's theorizing far in advance of the data.

http://www.kaltenbrunner.cc/files/4/

has preliminary data of the dbt3/scaling 10 run I did which seems to 
imply we have at least 4 queries in there that take an excessive amount 
of time (query 5 is the one I started the complaint with).
However those results have to be taken with a graint of salt since there 
is an appearant bug in the dbt3 code which seems to rely on 
add_missing_from=on (as can be seen in some of the errorlogs of the 
database) and towards the end of the throughput run I did some of the 
explain analyzes for the report (those are the small 100% spikes in the 
graph due to the box using the second CPU to run them).
I will redo those tests later this week though ...

Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> Could we see the actual EXPLAIN ANALYZE results for the slow plan?

> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze.txt

Well, indeed it seems that the hash join is just an innocent bystander:
the bulk of the runtime (all but about 120 sec in fact) is spent here:
                    ->  Nested Loop  (cost=13.65..1719683.85 rows=12000672 width=49) (actual time=60.325..24923860.713
rows=11897899loops=1)                          ->  Merge Join  (cost=0.00..10248.66 rows=20000 width=41) (actual
time=16.654..2578.060rows=19837 loops=1)                              ...                          ->  Bitmap Heap Scan
onlineitem  (cost=13.65..77.16 rows=665 width=16) (actual time=13.492..1254.535 rows=600 loops=19837)
            Recheck Cond: (lineitem.l_suppkey = supplier.s_suppkey)                                ->  Bitmap Index
Scanon i_l_suppkey  (cost=0.00..13.65 rows=665 width=0) (actual time=10.662..10.662 rows=600 loops=19837)
                      Index Cond: (lineitem.l_suppkey = supplier.s_suppkey)
 

I suppose that the profile result you showed was taken during the
startup transient where it was computing the hashtables that this loop's
results are joined to ... but that's not where the problem is.  The
problem is repeating that bitmap scan on lineitem for nearly 20000
different l_suppkeys.

Apparently we've made the planner a bit too optimistic about the savings
that can be expected from repeated indexscans occurring on the inside of
a join.  The other plan uses a different join order and doesn't try to
join lineitem until it's got orders.o_orderkey, whereupon it does a
mergejoin against an indexscan on lineitem:
                    ->  Index Scan using i_l_orderkey on lineitem (cost=0.00..2715943.34 rows=60003360 width=16)
(actualtime=32.868..123668.380 rows=59991868 loops=1)
 

The runtimes for the remainders of the plans are roughly comparable, so
it's the cost of joining lineitem that is hurting here.

Is lineitem sorted (or nearly sorted) by l_orderkey?  Part of the
problem could be overestimating the cost of this indexscan.

What are the physical sizes of lineitem and its indexes, and how do
those compare to your RAM?  What are you using for planner settings
(particularly effective_cache_size)?
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> Tom Lane wrote:
>>> Could we see the actual EXPLAIN ANALYZE results for the slow plan?
> 
>> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze.txt
> 
> Well, indeed it seems that the hash join is just an innocent bystander:
> the bulk of the runtime (all but about 120 sec in fact) is spent here:
> 
>                      ->  Nested Loop  (cost=13.65..1719683.85 rows=12000672 width=49) (actual
time=60.325..24923860.713rows=11897899 loops=1)
 
>                            ->  Merge Join  (cost=0.00..10248.66 rows=20000 width=41) (actual time=16.654..2578.060
rows=19837loops=1)
 
>                                ...
>                            ->  Bitmap Heap Scan on lineitem  (cost=13.65..77.16 rows=665 width=16) (actual
time=13.492..1254.535rows=600 loops=19837)
 
>                                  Recheck Cond: (lineitem.l_suppkey = supplier.s_suppkey)
>                                  ->  Bitmap Index Scan on i_l_suppkey  (cost=0.00..13.65 rows=665 width=0) (actual
time=10.662..10.662rows=600 loops=19837)
 
>                                        Index Cond: (lineitem.l_suppkey = supplier.s_suppkey)
> 
> I suppose that the profile result you showed was taken during the
> startup transient where it was computing the hashtables that this loop's
> results are joined to ... but that's not where the problem is.  The
> problem is repeating that bitmap scan on lineitem for nearly 20000
> different l_suppkeys.

possible - I actually took them over a longer period of time

> 
> Apparently we've made the planner a bit too optimistic about the savings
> that can be expected from repeated indexscans occurring on the inside of
> a join.  The other plan uses a different join order and doesn't try to
> join lineitem until it's got orders.o_orderkey, whereupon it does a
> mergejoin against an indexscan on lineitem:
> 
>                      ->  Index Scan using i_l_orderkey on lineitem (cost=0.00..2715943.34 rows=60003360 width=16)
(actualtime=32.868..123668.380 rows=59991868 loops=1)
 
> 
> The runtimes for the remainders of the plans are roughly comparable, so
> it's the cost of joining lineitem that is hurting here.
> 
> Is lineitem sorted (or nearly sorted) by l_orderkey?  Part of the
> problem could be overestimating the cost of this indexscan.
> 
> What are the physical sizes of lineitem and its indexes, and how do
> those compare to your RAM?  What are you using for planner settings
> (particularly effective_cache_size)?

ouch - you are right(as usual) here.
effective_cache_size was set to 10GB(my fault for copying over the conf
from a 16GB box) during the run - lowering it just a few megabytes(!) or
to a more realistic 6GB results in the following MUCH better plan:

http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt

as for the relation sizes:

dbt3=# select pg_relation_size('lineitem'); pg_relation_size
------------------      10832764928
(1 row)

dbt3=# select pg_total_relation_size('lineitem'); pg_total_relation_size
------------------------            22960259072
(1 row)

there are nine btree indexes on lineitem all between 1,1GB and 1,4GB in 
size.


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> Apparently we've made the planner a bit too optimistic about the savings
>> that can be expected from repeated indexscans occurring on the inside of
>> a join.

> effective_cache_size was set to 10GB(my fault for copying over the conf
> from a 16GB box) during the run - lowering it just a few megabytes(!) or
> to a more realistic 6GB results in the following MUCH better plan:
> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt

Interesting.  It used to be that effective_cache_size wasn't all that
critical... what I think this report is showing is that with the 8.2
changes to try to account for caching effects in repeated indexscans,
we've turned that into a pretty significant parameter.

It'd be nice not to have to depend on the DBA to give us a good number
for this setting.  But I don't know of any portable ways to find out
how much RAM is in the box, let alone what fraction of it we should
assume is available per-query.
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> Tom Lane wrote:
>>> Apparently we've made the planner a bit too optimistic about the savings
>>> that can be expected from repeated indexscans occurring on the inside of
>>> a join.
> 
>> effective_cache_size was set to 10GB(my fault for copying over the conf
>> from a 16GB box) during the run - lowering it just a few megabytes(!) or
>> to a more realistic 6GB results in the following MUCH better plan:
>> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt
> 
> Interesting.  It used to be that effective_cache_size wasn't all that
> critical... what I think this report is showing is that with the 8.2
> changes to try to account for caching effects in repeated indexscans,
> we've turned that into a pretty significant parameter.

yes I'm a bit worried about that too - it has been a bit of 
"conventional wisdom" that setting effective_cache_size optimistic will 
never hurt and that it encourages postgresql to sometimes get a better 
plan by favouring index-scans.

> 
> It'd be nice not to have to depend on the DBA to give us a good number
> for this setting.  But I don't know of any portable ways to find out
> how much RAM is in the box, let alone what fraction of it we should
> assume is available per-query.

well there are really a number of things the dba would better give 
accurate information to the database - though in that case we might go 
from "too much won't hurt" to "too much will hurt" ...


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
David Fetter
Date:
On Wed, Sep 13, 2006 at 10:47:09AM -0400, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> > Tom Lane wrote:
> >> Apparently we've made the planner a bit too optimistic about the savings
> >> that can be expected from repeated indexscans occurring on the inside of
> >> a join.
> 
> > effective_cache_size was set to 10GB(my fault for copying over the conf
> > from a 16GB box) during the run - lowering it just a few megabytes(!) or
> > to a more realistic 6GB results in the following MUCH better plan:
> > http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt
> 
> Interesting.  It used to be that effective_cache_size wasn't all
> that critical... what I think this report is showing is that with
> the 8.2 changes to try to account for caching effects in repeated
> indexscans, we've turned that into a pretty significant parameter.
> 
> It'd be nice not to have to depend on the DBA to give us a good
> number for this setting.  But I don't know of any portable ways to
> find out how much RAM is in the box, let alone what fraction of it
> we should assume is available per-query.

That's fairly straight-forward, if a little crude.  We ask the DBA and
provide some tools for estimating and tuning same. :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
[already sent a variant of that yesterday but it doesn't look like it
made it to the list]

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> Tom Lane wrote:
>>> Apparently we've made the planner a bit too optimistic about the savings
>>> that can be expected from repeated indexscans occurring on the inside of
>>> a join.
> 
>> effective_cache_size was set to 10GB(my fault for copying over the conf
>> from a 16GB box) during the run - lowering it just a few megabytes(!) or
>> to a more realistic 6GB results in the following MUCH better plan:
>> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt
> 
> Interesting.  It used to be that effective_cache_size wasn't all that
> critical... what I think this report is showing is that with the 8.2
> changes to try to account for caching effects in repeated indexscans,
> we've turned that into a pretty significant parameter.

took me a while due to hardware issues on my testbox - but there are new
results(with 6GB for effective_cache_size) up at:

http://www.kaltenbrunner.cc/files/5/

there are still a few issues with the validity of the run like the rf
tests not actually being done right - but lowering effective_cache_size
gave a dramtic speedup on Q5,Q7 and Q8.

that is the explain for the 4h+ Q9:

http://www.kaltenbrunner.cc/files/analyze_q9.txt

increasing the the statistic_target up to 1000 does not seem to change
the plan btw.

disabling nested loop leads to the following (4 times faster) plan:

http://www.kaltenbrunner.cc/files/analyze_q9_no_nest.txt

since the hash-joins in there look rather slow (inappropriate hashtable
set up due to the wrong estimates?) I disabled hash_joins too:

http://www.kaltenbrunner.cc/files/analyze_q9_no_nest_no_hashjoin.txt

and amazingly this plan is by far the fastest one in runtime (15min vs
4,5h ...) except that the planner thinks it is 20 times more expensive ...


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Stefan Kaltenbrunner wrote:
> [already sent a variant of that yesterday but it doesn't look like it
> made it to the list]
> 
> Tom Lane wrote:
>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>> Tom Lane wrote:
>>>> Apparently we've made the planner a bit too optimistic about the savings
>>>> that can be expected from repeated indexscans occurring on the inside of
>>>> a join.
>>> effective_cache_size was set to 10GB(my fault for copying over the conf
>>> from a 16GB box) during the run - lowering it just a few megabytes(!) or
>>> to a more realistic 6GB results in the following MUCH better plan:
>>> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt
>> Interesting.  It used to be that effective_cache_size wasn't all that
>> critical... what I think this report is showing is that with the 8.2
>> changes to try to account for caching effects in repeated indexscans,
>> we've turned that into a pretty significant parameter.
> 
> took me a while due to hardware issues on my testbox - but there are new
> results(with 6GB for effective_cache_size) up at:
> 
> http://www.kaltenbrunner.cc/files/5/
> 
> there are still a few issues with the validity of the run like the rf
> tests not actually being done right - but lowering effective_cache_size
> gave a dramtic speedup on Q5,Q7 and Q8.
> 
> that is the explain for the 4h+ Q9:
> 
> http://www.kaltenbrunner.cc/files/analyze_q9.txt
> 
> increasing the the statistic_target up to 1000 does not seem to change
> the plan btw.
> 
> disabling nested loop leads to the following (4 times faster) plan:
> 
> http://www.kaltenbrunner.cc/files/analyze_q9_no_nest.txt
> 
> since the hash-joins in there look rather slow (inappropriate hashtable
> set up due to the wrong estimates?) I disabled hash_joins too:
> 
> http://www.kaltenbrunner.cc/files/analyze_q9_no_nest_no_hashjoin.txt
> 
> and amazingly this plan is by far the fastest one in runtime (15min vs
> 4,5h ...) except that the planner thinks it is 20 times more expensive ...

some additional numbers(first one is with default settings, second is
with enable_nestloop = 'off', third one is with enable_nestloop = 'off'
and enable_hashjoin='off'):

http://www.kaltenbrunner.cc/files/analyze_q7.txt

here we have a 3x speedup with disabling nested loops and a 2x speedup
(over the original plan) with nested loops and hashjoins disabled.


http://www.kaltenbrunner.cc/files/analyze_q20.txt

here we have a 180x(!) speedup with both disabled planner options ...

it is worth mentioning that for both queries the estimated costs in
relation to each other looks quite reasonable as soon as enable_nestloop
= 'off' (ie 5042928 vs 10715247 with 344sec vs 514 for Q7 and 101441851
vs 101445468 with 10sec vs 11sec)


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> that is the explain for the 4h+ Q9:
> http://www.kaltenbrunner.cc/files/analyze_q9.txt

The big problem there seems to be the drastic misestimation of the
number of rows matching the p_name ~~ '%ghost%' condition.  What does
pg_stats have for the p_name column?
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> that is the explain for the 4h+ Q9:
>> http://www.kaltenbrunner.cc/files/analyze_q9.txt
> 
> The big problem there seems to be the drastic misestimation of the
> number of rows matching the p_name ~~ '%ghost%' condition.  What does
> pg_stats have for the p_name column?

http://www.kaltenbrunner.cc/files/pg_stat_p_name.txt


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> some additional numbers(first one is with default settings, second is
> with enable_nestloop = 'off', third one is with enable_nestloop = 'off'
> and enable_hashjoin='off'):

> http://www.kaltenbrunner.cc/files/analyze_q7.txt

I'm inclined to think you still have effective_cache_size set too high;
or at least that the planner is being too optimistic about how much
cache space is actually available to each indexscan.

With the code as it currently stands, effective_cache_size has some of
the same properties as work_mem: the planner effectively assumes that
that much space is available to *each* indexscan, and so you'd need to
de-rate the setting based on the complexity of queries and the number of
concurrent sessions.

I'm not sure what we could do about the concurrent-sessions issue, but
we could make some sort of attack on the query complexity issue by
pro-rating the effective_cache_size among all the tables used by a
query.


> http://www.kaltenbrunner.cc/files/analyze_q20.txt
> here we have a 180x(!) speedup with both disabled planner options ...

There's something awfully bogus about that one --- how is it that the
aggregate subplan, with the exact same plan and same number of
executions in all three cases, has an actual runtime 200x more in the
first case?
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> The big problem there seems to be the drastic misestimation of the
>> number of rows matching the p_name ~~ '%ghost%' condition.  What does
>> pg_stats have for the p_name column?

> http://www.kaltenbrunner.cc/files/pg_stat_p_name.txt

Hmm ... pattern_sel already applies the operator directly to the
most_common_vals, but in this situation those aren't common enough
to help much.  With such an extensive histogram it is awfully tempting
to assume that the histogram members are a representative sample, and
take the selectivity as being the fraction of histogram entries that
match the pattern.  Maybe drop the first and last histogram entries
on the grounds they're probably outliers.  Thoughts?  What would be a
reasonable minimum histogram size to enable using this approach instead
of the guess-on-the-basis-of-the-pattern code?
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> some additional numbers(first one is with default settings, second is
>> with enable_nestloop = 'off', third one is with enable_nestloop = 'off'
>> and enable_hashjoin='off'):
> 
>> http://www.kaltenbrunner.cc/files/analyze_q7.txt
> 
> I'm inclined to think you still have effective_cache_size set too high;
> or at least that the planner is being too optimistic about how much
> cache space is actually available to each indexscan.

I have long term external monitoring on that server and it indeed shows
that that there was never less then about 5.8G of buffercache used (or
more then 2.2GB used by other means). So 6G might still be a bit on the
optimistic side but it is not actually that far of from reality.
I will redo with lower settings - do you have any suggestions for that ?

> 
> With the code as it currently stands, effective_cache_size has some of
> the same properties as work_mem: the planner effectively assumes that
> that much space is available to *each* indexscan, and so you'd need to
> de-rate the setting based on the complexity of queries and the number of
> concurrent sessions.

concurrency is 1 here - there is never more than a single query running
in parallel in those tests.

> 
> I'm not sure what we could do about the concurrent-sessions issue, but
> we could make some sort of attack on the query complexity issue by
> pro-rating the effective_cache_size among all the tables used by a
> query.

hmm not sure i understand what you mean here :-(

> 
> 
>> http://www.kaltenbrunner.cc/files/analyze_q20.txt
>> here we have a 180x(!) speedup with both disabled planner options ...
> 
> There's something awfully bogus about that one --- how is it that the
> aggregate subplan, with the exact same plan and same number of
> executions in all three cases, has an actual runtime 200x more in the
> first case?

hmm - good question. I will redo those in a bit ...


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> I'm not sure what we could do about the concurrent-sessions issue, but
>> we could make some sort of attack on the query complexity issue by
>> pro-rating the effective_cache_size among all the tables used by a
>> query.

> hmm not sure i understand what you mean here :-(

Per the comment for index_pages_fetched:
* We assume that effective_cache_size is the total number of buffer pages* available for both table and index, and
pro-ratethat space between the* table and index.  (Ideally other_pages should include all the other* tables and indexes
usedby the query too; but we don't have a good way* to get that number here.)
 

A first-order approximation to this would be to add up the total sizes
of all the other tables used in the query.  I am thinking of leaving out
other indexes, mainly because we can't tell at this level which other
indexes are actually gonna get used.  This would tend to underestimate
by leaving out indexes, but not by a lot if you assume indexes are much
smaller than their tables.  It would also be an overestimate because
tables that are not indexscanned concurrently with the one under
consideration probably shouldn't be counted anyway.  So one might hope
these effects would more or less cancel out.  Anyway it seems to be a
better idea than what we have now.

> I will redo with lower settings - do you have any suggestions for that ?

Try reducing effective_cache_size to maybe a fourth of what it is now.
If that helps the thing pick better plans for these multi-table queries,
then we should try changing the other_pages calculation as above.
        regards, tom lane


relation cache statistics (was: -HEAD planner issue wrt hash_joins on dbt3 ?)

From
"Jim C. Nasby"
Date:
On Sun, Sep 17, 2006 at 04:18:36PM -0400, Tom Lane wrote:
>  * table and index.  (Ideally other_pages should include all the other
>  * tables and indexes used by the query too; but we don't have a good way
>  * to get that number here.)
> 
> A first-order approximation to this would be to add up the total sizes
> of all the other tables used in the query.  I am thinking of leaving out
> other indexes, mainly because we can't tell at this level which other
> indexes are actually gonna get used.  This would tend to underestimate
> by leaving out indexes, but not by a lot if you assume indexes are much
> smaller than their tables.  It would also be an overestimate because
> tables that are not indexscanned concurrently with the one under
> consideration probably shouldn't be counted anyway.  So one might hope
> these effects would more or less cancel out.  Anyway it seems to be a
> better idea than what we have now.

I think it'd be better to attack this problem from the "other side";
namely looking at what's actually cached. Sadly, I don't think there's
any way to actually query the OS for info about what it has buffered,
but we can look at what's in shared_buffers and assume that it's a
reasonable proxy for the OS's cache. Something like...

relBufPages / shared_buffers * effective_cache_size

should give us a decent idea of what percentage of a relation will be in
cache somewhere. (relBufPages is the number of pages the relation in
question has in the buffer).

Of course, that raises the question of how to track how many pages are
in shared buffers for a relation. Given the criticality of locking
there, we probably don't want to update that info in real-time, but for
this application it's probably OK to just scan through the buffer every
X period of time (maybe after X number of pages read into the buffers).
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


"Jim C. Nasby" <jimn@enterprisedb.com> writes:
> I think it'd be better to attack this problem from the "other side";
> namely looking at what's actually cached.

You can kiss goodbye to plan stability if you go that route... and
in any case I doubt the assumption that what's in shared buffers is
representative of what's in kernel cache.
        regards, tom lane


Re: relation cache statistics (was: -HEAD planner issue wrt hash_joins on dbt3 ?)

From
"Jim C. Nasby"
Date:
On Mon, Sep 18, 2006 at 12:20:10AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jimn@enterprisedb.com> writes:
> > I think it'd be better to attack this problem from the "other side";
> > namely looking at what's actually cached.
> 
> You can kiss goodbye to plan stability if you go that route... and
> in any case I doubt the assumption that what's in shared buffers is
> representative of what's in kernel cache.

Well, there's 2 issues with caching:

1) Is something we're going to want actually in cache right now?
2) If we need to read something more than once (ie: higher level btree
pages), what are the odds it will still be in cache when we come around
to it the next time.

Once the caches are warmed up, looking at what's actually in them would
give you a very good probability for #1. I suspect that for large
relations, shared_buffers would also match the OS cache pretty well in
most cases, almost certainly better than whatever estimate we're using
now. But I'm not sure how useful that info is to the planner.

For #2 we'd have to know what kind of pressure the caches are under to
replace pages and have some kind of idea how frequently the system is
hitting them. The pg_statio info might be useful there, though
unfortunately in that case I think there's much less likely to be a good
correlation between the two.

If there was *some* way to track stats on page fetches that came out of
the OS cache, I suspect we could make great use of per-relation hit
rates to come up with better plans.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Matteo Beccati
Date:
Hi,

Tom Lane wrote:
> Hmm ... pattern_sel already applies the operator directly to the
> most_common_vals, but in this situation those aren't common enough
> to help much.  With such an extensive histogram it is awfully tempting
> to assume that the histogram members are a representative sample, and
> take the selectivity as being the fraction of histogram entries that
> match the pattern.  Maybe drop the first and last histogram entries
> on the grounds they're probably outliers.  Thoughts?  What would be a
> reasonable minimum histogram size to enable using this approach instead
> of the guess-on-the-basis-of-the-pattern code?

That's what I was suggesting here respectively for ltree operators and like:

http://archives.postgresql.org/pgsql-patches/2006-05/msg00178.php
http://archives.postgresql.org/pgsql-performance/2006-01/msg00083.php

My original ltree patch was stripped of the histogram matching code and 
I will need to re-patch 8.2 when deploying it to get decent performance 
with a couple of queries, but it would be very nice to avoid it ;)

I cannot see anything bad by using something like that:

if (histogram is large/representative enough)
{  recalculate_selectivity_matching_histogram_values()
  if (new_selectivity > old_selectivity)    return new_selectivity  else    return old_selectivity
}


Best regards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Matteo Beccati <php@beccati.com> writes:
> I cannot see anything bad by using something like that:
> if (histogram is large/representative enough)

Well, the question is exactly what is "large enough"?  I feel a bit
uncomfortable about applying the idea to a histogram with only 10
entries (especially if we ignore two of 'em).  With 100 or more,
it sounds all right.  What's the breakpoint?
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Matteo Beccati
Date:
Tom Lane ha scritto:
> Matteo Beccati <php@beccati.com> writes:
>> I cannot see anything bad by using something like that:
>> if (histogram is large/representative enough)
> 
> Well, the question is exactly what is "large enough"?  I feel a bit
> uncomfortable about applying the idea to a histogram with only 10
> entries (especially if we ignore two of 'em).  With 100 or more,
> it sounds all right.  What's the breakpoint?

Yes, I think 100-200 could be a good breakpoint. I don't actually know 
what is the current usage of SET STATISTICS, I usually set it to 1000 
for columns which need more precise selectivity.

The breakpoint could be set even higher (500?) so there is space to 
increase statistics without enabling the histogram check, but I don't 
feel very comfortable though suggesting this kind of possibly 
undocumented side effect...


Best ragards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Mark Cave-Ayland
Date:
On Mon, 2006-09-18 at 17:46 +0200, Matteo Beccati wrote:
> Tom Lane ha scritto:
> > Matteo Beccati <php@beccati.com> writes:
> >> I cannot see anything bad by using something like that:
> >> if (histogram is large/representative enough)
> > 
> > Well, the question is exactly what is "large enough"?  I feel a bit
> > uncomfortable about applying the idea to a histogram with only 10
> > entries (especially if we ignore two of 'em).  With 100 or more,
> > it sounds all right.  What's the breakpoint?
> 
> Yes, I think 100-200 could be a good breakpoint. I don't actually know 
> what is the current usage of SET STATISTICS, I usually set it to 1000 
> for columns which need more precise selectivity.
> 
> The breakpoint could be set even higher (500?) so there is space to 
> increase statistics without enabling the histogram check, but I don't 
> feel very comfortable though suggesting this kind of possibly 
> undocumented side effect...


Hi everyone,

You may be interested to have a look at the statistics collector for the
geometry type within PostGIS. In order to prevent very large or very
small geometries from ruining the statistics histogram and generating
incorrect query plans, we make the assumption that the column
distribution is likely to be close to normal, and then remove any
ANALYZE-collected geometries from the set that lie outside +/- 3.25
standard deviations from the mean before creating the final histogram
(removes just under 1% of the data from each end of an assumed normal
distribution). This works well and AFAIK we've only ever had one
reported case of an incorrect query plan being generated using this
method.


Kind regards,

Mark.




Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> Tom Lane wrote:
>>> I'm not sure what we could do about the concurrent-sessions issue, but
>>> we could make some sort of attack on the query complexity issue by
>>> pro-rating the effective_cache_size among all the tables used by a
>>> query.
> 
>> hmm not sure i understand what you mean here :-(
> 
> Per the comment for index_pages_fetched:
> 
>  * We assume that effective_cache_size is the total number of buffer pages
>  * available for both table and index, and pro-rate that space between the
>  * table and index.  (Ideally other_pages should include all the other
>  * tables and indexes used by the query too; but we don't have a good way
>  * to get that number here.)
> 
> A first-order approximation to this would be to add up the total sizes
> of all the other tables used in the query.  I am thinking of leaving out
> other indexes, mainly because we can't tell at this level which other
> indexes are actually gonna get used.  This would tend to underestimate
> by leaving out indexes, but not by a lot if you assume indexes are much
> smaller than their tables.  It would also be an overestimate because
> tables that are not indexscanned concurrently with the one under
> consideration probably shouldn't be counted anyway.  So one might hope
> these effects would more or less cancel out.  Anyway it seems to be a
> better idea than what we have now.


aah - I think I understand that logic now - thanks for the reference to
the source :-)

> 
>> I will redo with lower settings - do you have any suggestions for that ?
> 
> Try reducing effective_cache_size to maybe a fourth of what it is now.
> If that helps the thing pick better plans for these multi-table queries,
> then we should try changing the other_pages calculation as above.

ok - the planner switches to a different plan at about 2.5GB of
effective_cache_size resulting in the following plan:

http://www.kaltenbrunner.cc/files/analyze_q7_1GB.txt (3 consecutive runs
- starting with cold caches)

with 6GB I get:

http://www.kaltenbrunner.cc/files/analyze_q7_6GB.txt (single run -
immediatly after the above ones)


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> ok - the planner switches to a different plan at about 2.5GB of
> effective_cache_size resulting in the following [ much better ] plan:

OK, so it sounds like it'd be a good idea to try to pro-rate
effective_cache_size among all the tables in the query.  I'll see
if I can get that in without kluging the code up too much.
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Matteo Beccati <php@beccati.com> writes:
> Tom Lane ha scritto:
>> Matteo Beccati <php@beccati.com> writes:
>>> I cannot see anything bad by using something like that:
>>> if (histogram is large/representative enough)
>> 
>> Well, the question is exactly what is "large enough"?  I feel a bit
>> uncomfortable about applying the idea to a histogram with only 10
>> entries (especially if we ignore two of 'em).  With 100 or more,
>> it sounds all right.  What's the breakpoint?

> Yes, I think 100-200 could be a good breakpoint.

I've committed this change with (for now) 100 as the minimum histogram
size to use.  Stefan, are you interested in retrying your benchmark?
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Matteo Beccati
Date:
Hi,

Tom Lane wrote:
> I've committed this change with (for now) 100 as the minimum histogram
> size to use.  Stefan, are you interested in retrying your benchmark?

A first try with ltree gave big improvements on my smaller data set: the 
estimated row count is correct or off by only 1 row. I'm now restoring a 
bigger database to get more reliable results.

I hope Stefan can confirm the improvement on dbt3 too.

Thanks Tom :)


Best regards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Matteo Beccati <php@beccati.com> writes:
>> Tom Lane ha scritto:
>>> Matteo Beccati <php@beccati.com> writes:
>>>> I cannot see anything bad by using something like that:
>>>> if (histogram is large/representative enough)
>>> Well, the question is exactly what is "large enough"?  I feel a bit
>>> uncomfortable about applying the idea to a histogram with only 10
>>> entries (especially if we ignore two of 'em).  With 100 or more,
>>> it sounds all right.  What's the breakpoint?
> 
>> Yes, I think 100-200 could be a good breakpoint.
> 
> I've committed this change with (for now) 100 as the minimum histogram
> size to use.  Stefan, are you interested in retrying your benchmark?

sure - but I'm having hardware (harddisk firmware) related issues on my
testbox which will take a few further days to be resolved ...


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
[RESENT in a modified version since the original reply seems to have ben
lost by the listserver which seems to happen sometimes lately]

Tom Lane wrote:
> Matteo Beccati <php@beccati.com> writes:
>> Tom Lane ha scritto:
>>> Matteo Beccati <php@beccati.com> writes:
>>>> I cannot see anything bad by using something like that:
>>>> if (histogram is large/representative enough)
>>> Well, the question is exactly what is "large enough"?  I feel a bit
>>> uncomfortable about applying the idea to a histogram with only 10
>>> entries (especially if we ignore two of 'em).  With 100 or more,
>>> it sounds all right.  What's the breakpoint?
> 
>> Yes, I think 100-200 could be a good breakpoint.
> 
> I've committed this change with (for now) 100 as the minimum histogram
> size to use.  Stefan, are you interested in retrying your benchmark?

spent some time retesting that and I got the following results(this is
the same box as before but with a much slower disk-setup and a newly
initdb'd cluster):

http://www.kaltenbrunner.cc/files/analyze_q9_beta1.txt

all that is with 2GB of effective_cache_size(plan does not change with
much smaller settings btw) and a statistic target of 1000 and the
following parameters:

default planner settings for the EXPLAIN and the first EXPLAIN ANALYZE -
then the same query with disabled hash_joins, the next one is with
disabled nest_loops and the fastest one is with both nest_loop and
hash_joins disabled (all run in that order - so there are possible
caching effects).

in comparision to:

http://www.kaltenbrunner.cc/files/analyze_q9.txt

we nearly got a 7 figure speedup due to the latest changes(much better
estimates at least) - however the mergejoin+sort only plan is still faster.



the other troubling query is the following:

query: http://www.kaltenbrunner.cc/files/7/power1/db/plans/power_query21.txt

plans: http://www.kaltenbrunner.cc/files/analyze_q21_beta1.txt

(default,default,enable_nestloop=off,enable_nestloop=off and
enable_hashjoin=off)

despite having not-too bad estimates for most of the key-points in the
plan the actual runtime of the choosen plan is quite disappointing.




Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> http://www.kaltenbrunner.cc/files/analyze_q9_beta1.txt

The next problem seems to be the drastic misestimation of this join
size:
   ->  Nested Loop  (cost=0.00..6872092.36 rows=135 width=28) (actual time=94.762..14429291.129 rows=3554044 loops=1)
     ->  Merge Join  (cost=0.00..519542.74 rows=449804 width=16) (actual time=48.197..49636.006 rows=474008 loops=1)
          Merge Cond: (part.p_partkey = partsupp.ps_partkey)               ->  Index Scan using pk_part on part
(cost=0.00..105830.22rows=112447 width=4) (actual time=34.646..14381.644 rows=118502 loops=1)
Filter:((p_name)::text ~~ '%ghost%'::text)               ->  Index Scan using i_ps_partkey on partsupp
(cost=0.00..388943.05rows=8000278 width=12) (actual time=13.511..22659.364 rows=7999685 loops=1)         ->  Index Scan
usingi_l_suppkey_partkey on lineitem  (cost=0.00..14.11 rows=1 width=24) (actual time=4.415..30.310 rows=7
loops=474008)              Index Cond: ((partsupp.ps_partkey = lineitem.l_partkey) AND (partsupp.ps_suppkey =
lineitem.l_suppkey))

With a factor-of-25000 error in that rowcount estimate, it's amazing the
plans aren't worse than they are.

It evidently thinks that most of the rows in the join of part and
partsupp won't have any matching rows in lineitem, whereas on average
there are about 7 matching rows apiece.  So that's totally wacko, and
it's not immediately obvious why.  Could we see the pg_stats entries for
part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey,
lineitem.l_partkey, lineitem.l_suppkey?
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> http://www.kaltenbrunner.cc/files/analyze_q9_beta1.txt
> 
> The next problem seems to be the drastic misestimation of this join
> size:
> 
>     ->  Nested Loop  (cost=0.00..6872092.36 rows=135 width=28) (actual time=94.762..14429291.129 rows=3554044
loops=1)
>           ->  Merge Join  (cost=0.00..519542.74 rows=449804 width=16) (actual time=48.197..49636.006 rows=474008
loops=1)
>                 Merge Cond: (part.p_partkey = partsupp.ps_partkey)
>                 ->  Index Scan using pk_part on part  (cost=0.00..105830.22 rows=112447 width=4) (actual
time=34.646..14381.644rows=118502 loops=1)
 
>                       Filter: ((p_name)::text ~~ '%ghost%'::text)
>                 ->  Index Scan using i_ps_partkey on partsupp  (cost=0.00..388943.05 rows=8000278 width=12) (actual
time=13.511..22659.364rows=7999685 loops=1)
 
>           ->  Index Scan using i_l_suppkey_partkey on lineitem  (cost=0.00..14.11 rows=1 width=24) (actual
time=4.415..30.310rows=7 loops=474008)
 
>                 Index Cond: ((partsupp.ps_partkey = lineitem.l_partkey) AND (partsupp.ps_suppkey =
lineitem.l_suppkey))
> 
> With a factor-of-25000 error in that rowcount estimate, it's amazing the
> plans aren't worse than they are.
> 
> It evidently thinks that most of the rows in the join of part and
> partsupp won't have any matching rows in lineitem, whereas on average
> there are about 7 matching rows apiece.  So that's totally wacko, and
> it's not immediately obvious why.  Could we see the pg_stats entries for
> part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey,
> lineitem.l_partkey, lineitem.l_suppkey?


http://www.kaltenbrunner.cc/files/dbt3_beta1_statistics.txt


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> It evidently thinks that most of the rows in the join of part and
>> partsupp won't have any matching rows in lineitem, whereas on average
>> there are about 7 matching rows apiece.  So that's totally wacko, and
>> it's not immediately obvious why.  Could we see the pg_stats entries for
>> part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey,
>> lineitem.l_partkey, lineitem.l_suppkey?

> http://www.kaltenbrunner.cc/files/dbt3_beta1_statistics.txt

OK, so we have 2 million parts and 100000 suppliers, and ANALYZE doesn't
seem to have been too far off at estimating either of those numbers.
I think the problem is that there are not very many suppliers for any
particular part, and thus the condition "part match AND supplier match"
is really not much more selective than "part match" alone.  The planner
is supposing that their selectivities are independent, which they
aren't.

Offhand I don't see any good way to fix this without multi-column
statistics, which is something that's certainly not happening for 8.2 :-(
        regards, tom lane


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> Tom Lane wrote:
>>> It evidently thinks that most of the rows in the join of part and
>>> partsupp won't have any matching rows in lineitem, whereas on average
>>> there are about 7 matching rows apiece.  So that's totally wacko, and
>>> it's not immediately obvious why.  Could we see the pg_stats entries for
>>> part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey,
>>> lineitem.l_partkey, lineitem.l_suppkey?
> 
>> http://www.kaltenbrunner.cc/files/dbt3_beta1_statistics.txt
> 
> OK, so we have 2 million parts and 100000 suppliers, and ANALYZE doesn't
> seem to have been too far off at estimating either of those numbers.
> I think the problem is that there are not very many suppliers for any
> particular part, and thus the condition "part match AND supplier match"
> is really not much more selective than "part match" alone.  The planner
> is supposing that their selectivities are independent, which they
> aren't.

looks like there are exactly 4 suppliers for any given part so that
seems indeed like the problem :-(

> 
> Offhand I don't see any good way to fix this without multi-column
> statistics, which is something that's certainly not happening for 8.2 :-(

too bad - however any idea on one of the other troubling querys (q21) I
mentioned in the mail I resent to the list (after the original one got
lost)?

http://archives.postgresql.org/pgsql-hackers/2006-09/msg02011.php


Stefan


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Matteo Beccati
Date:
Stefan Kaltenbrunner wrote:
> too bad - however any idea on one of the other troubling querys (q21) I
> mentioned in the mail I resent to the list (after the original one got
> lost)?
> 
> http://archives.postgresql.org/pgsql-hackers/2006-09/msg02011.php

What happens if you increase statistics for l_orderkey?


Best regards
-- 
Matteo Beccati
http://phpadsnew.com
http://phppgads.com


Re: -HEAD planner issue wrt hash_joins on dbt3 ?

From
Stefan Kaltenbrunner
Date:
Matteo Beccati wrote:
> Stefan Kaltenbrunner wrote:
>> too bad - however any idea on one of the other troubling querys (q21) I
>> mentioned in the mail I resent to the list (after the original one got
>> lost)?
>>
>> http://archives.postgresql.org/pgsql-hackers/2006-09/msg02011.php
> 
> What happens if you increase statistics for l_orderkey?

statistic target is already at 1000 ...


Stefan