Re: star schema and the optimizer - Mailing list pgsql-hackers

From Marc Cousin
Subject Re: star schema and the optimizer
Date
Msg-id 54F17668.902@gmail.com
Whole thread Raw
In response to Re: star schema and the optimizer  (Marc Cousin <cousinmarc@gmail.com>)
Responses Re: star schema and the optimizer
List pgsql-hackers
On 27/02/2015 20:01, Marc Cousin wrote:
> On 27/02/2015 19:45, Tom Lane wrote:
>>> I wrote:
>>>> I had actually thought that we'd fixed this type of problem in recent
>>>> versions, and that you should be able to get a plan that would look
>>>> like
>>
>>>> Nestloop
>>>>    -> scan dim1
>>>>    -> Nestloop
>>>>         -> scan dim2
>>>>         -> indexscan fact table using dim1.a and dim2.b
>>
>> After closer study, I think this is an oversight in commit
>> e2fa76d80ba571d4de8992de6386536867250474, which quoth
>>
>> +It can be useful for the parameter value to be passed down through
>> +intermediate layers of joins, for example:
>> +
>> +    NestLoop
>> +        -> Seq Scan on A
>> +        Hash Join
>> +            Join Condition: B.Y = C.W
>> +            -> Seq Scan on B
>> +            -> Index Scan using C_Z_IDX on C
>> +                Index Condition: C.Z = A.X
>> +
>> +If all joins are plain inner joins then this is unnecessary, because
>> +it's always possible to reorder the joins so that a parameter is used
>> +immediately below the nestloop node that provides it.  But in the
>> +presence of outer joins, join reordering may not be possible, and then
>> +this option can be critical.  Before version 9.2, Postgres used ad-hoc
>>
>> This reasoning overlooked the fact that if we need parameters from
>> more than one relation, and there's no way to join those relations
>> to each other directly, then we have to allow passing the dim1 parameter
>> down through the join to dim2.
>>
>> The attached patch seems to fix it (modulo the need for some updates
>> in the README, and maybe a regression test).  Could you see if this
>> produces satisfactory plans for you?
> 
> 
> From what I see, it's just perfect. I'll give it a more thorough look a
> bit later, but it seems to be exactly what I was waiting for.
> 
> Thanks a lot.
> 
> Regards

I gave it another look this morning. It works very well with the initial test schema. The situation is much improved
forme.
 

I still have one issue: I've extended the test to more than 2 dimensions.

marc=# explain analyze SELECT * FROM dim1,dim2,dim3,dim4,facts WHERE facts.dim1=dim1.a and facts.dim2=dim2.a and
facts.dim3=dim3.aand facts.dim4=dim4.a and dim1.b between 10 and 60 AND dim2.b between 30 and 50 and dim3.b=8526 and
dim4.bbetween 70 and 90;                                                                      QUERY PLAN
                                                     
 

------------------------------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=15.00..957710.99 rows=1 width=200) (actual time=793.247..793.247 rows=0 loops=1)  ->  Nested Loop
(cost=14.71..957704.75rows=1 width=192) (actual time=793.245..793.245 rows=0 loops=1)        Join Filter: (facts.dim3 =
dim3.a)       Rows Removed by Join Filter: 942        ->  Index Scan using idxdim32 on dim3  (cost=0.29..3300.29
rows=10width=8) (actual time=49.498..67.923 rows=1 loops=1)              Filter: (b = 8526)              Rows Removed
byFilter: 99999        ->  Materialize  (cost=14.41..954262.56 rows=962 width=184) (actual time=0.552..724.988 rows=942
loops=1)             ->  Nested Loop  (cost=14.41..954257.75 rows=962 width=184) (actual time=0.542..723.380 rows=942
loops=1)                   ->  Index Scan using idxdim22 on dim2  (cost=0.29..3648.29 rows=192 width=16) (actual
time=0.074..47.445rows=229 loops=1)                          Filter: ((b >= 30) AND (b <= 50))
RowsRemoved by Filter: 99771                    ->  Nested Loop  (cost=14.12..4946.08 rows=501 width=168) (actual
time=2.575..2.946rows=4 loops=229)                          ->  Bitmap Heap Scan on dim1  (cost=13.43..573.60 rows=501
width=16)(actual time=0.170..0.647 rows=509 loops=229)                                Recheck Cond: ((b >= 10) AND (b
<=60))                                Heap Blocks: exact=78547                                ->  Bitmap Index Scan on
idxdim11 (cost=0.00..13.30 rows=501 width=0) (actual time=0.112..0.112 rows=509 loops=229)
       Index Cond: ((b >= 10) AND (b <= 60))                          ->  Index Scan using idxfacts on facts
(cost=0.69..8.72rows=1 width=152) (actual time=0.004..0.004 rows=0 loops=116561)                                Index
Cond:((dim1 = dim1.a) AND (dim2 = dim2.a))  ->  Index Scan using idxdim42 on dim4  (cost=0.29..6.24 rows=1 width=8)
(neverexecuted)        Index Cond: (a = facts.dim4)        Filter: ((b >= 70) AND (b <= 90))Planning time: 8.092
msExecutiontime: 793.621 ms
 
(25 lignes)

marc=# \d facts      Table « public.facts »Colonne |  Type  | Modificateurs 
---------+--------+---------------dim1    | bigint | dim2    | bigint | dim3    | bigint | dim4    | bigint | dim5    |
bigint| dim6    | bigint | dim7    | bigint | dim8    | bigint | dim9    | bigint | dim10   | bigint | data1   | text
|data2   | text   | data3   | text   | data4   | text   | data5   | text   | data6   | text   | 
 
Index :   "idxfacts" btree (dim1, dim2, dim3, dim4, dim5, dim6, dim7, dim8, dim9, dim10)

I hoped that the dim3=dim3.a condition could be used in the index scan on facts (I have chosen the 8526 value because
itonly has one matching row). Maybe that's not possible, or not efficient, but I thought I'd rather ask.
 


Regards.



pgsql-hackers by date:

Previous
From: Vadim Gribanov
Date:
Subject: Re: Docs about shared memory
Next
From: Michael Paquier
Date:
Subject: Re: Strange assertion using VACOPT_FREEZE in vacuum.c