Re: Should Oracle outperform PostgreSQL on a complex - Mailing list pgsql-performance

From Simon Riggs
Subject Re: Should Oracle outperform PostgreSQL on a complex
Date
Msg-id 1134913732.2964.164.camel@localhost.localdomain
Whole thread Raw
In response to Re: Should Oracle outperform PostgreSQL on a complex  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Should Oracle outperform PostgreSQL on a complex  (Mark Kirkwood <markir@paradise.net.nz>)
List pgsql-performance
On Sat, 2005-12-17 at 13:13 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Fri, 2005-12-16 at 23:28 -0500, Bruce Momjian wrote:
> >> How are star joins different from what we do now?
>
> > Methods:
> > 1. join all N small tables together in a cartesian product, then join to
> > main Large table once (rather than N times)
>
> Of course, the reason the current planner does not think of this is that
> it does not consider clauseless joins unless there is no alternative.

Understood

>  The plan you currently get for
> this sort of scenario is typically a nest of hash joins:
>
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Hash Join  (cost=2.25..4652.25 rows=102400 width=16)
>    Hash Cond: ("outer".f1 = "inner".f1)
>    ->  Hash Join  (cost=1.12..3115.12 rows=102400 width=12)
>          Hash Cond: ("outer".f2 = "inner".f1)
>          ->  Seq Scan on fact  (cost=0.00..1578.00 rows=102400 width=8)
>          ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>                ->  Seq Scan on d2  (cost=0.00..1.10 rows=10 width=4)
>    ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>          ->  Seq Scan on d1  (cost=0.00..1.10 rows=10 width=4)
> (9 rows)
>
> This involves only one scan of the fact table.  As each row is pulled up
> through the nest of hash joins, we hash one dimension key and join to
> one small table at each level.

Understood

> This is at worst the same amount of work
> as hashing all the keys at once and probing a single cartesian-product
> hashtable, probably less work (fewer wasted key-comparisons).  And
> definitely less memory.  You do have to keep your eye on the ball that
> you don't waste a lot of overhead propagating the row up through
> multiple join levels, but we got rid of most of the problem there in
> 8.1 via the introduction of "virtual tuple slots".  If this isn't fast
> enough yet, it'd make more sense to invest effort in further cutting the
> executor's join overhead (which of course benefits *all* plan types)
> than in trying to make the planner choose a star join.

That join type is used when an index-organised table is available, so
that a SeqScan of the larger table can be avoided.

I'd say the plan would make sense if the columns of the cartesian
product match a multi-column index on the larger table that would not
ever be used unless sufficient columns are restricted in each lookup.
That way you are able to avoid the SeqScan that occurs for the multiple
nested Hash Join case. (Clearly, normal selectivity rules apply on the
use of the index in this way).

So I think that plan type still can be effective in some circumstances.
Mind you: building an N-way index on a large table isn't such a good
idea, unless you can partition the tables and still use a join. Which is
why I've not centred on this case as being important before now.

My understanding: Teradata and DB2 use this.

This may be covered by patents.

> > 2. transform joins into subselects, then return subselect rows via an
> > index bitmap. Joins are performed via a bitmap addition process.
>
> This one might be interesting but it's not clear what you are talking
> about.  "Bitmap addition"?

Ref: "Re: [HACKERS] slow IN() clause for many cases"

Required Transforms: join -> IN (subselect) -> = ANY(ARRAY(subselect))
ending with the ability to use an bitmap index scan
(which clearly requires a run-time, not a plan-time evaluation - though
the distinction is minor if you go straight from plan->execute as is the
case with most Data Warehouse queries).

If you do this for all joins, you can then solve the problem with a
Bitmap And step, which is what I meant by "addition".

If you have need columns in the result set from the smaller tables you
can get them by joining the result set back to the smaller tables again.

My understanding: Oracle uses this.

This may be covered by patents.

Best Regards, Simon Riggs


pgsql-performance by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Should Oracle outperform PostgreSQL on a complex
Next
From: Juan Casero
Date:
Subject: PostgreSQL and Ultrasparc T1