Thread: performance regression in 9.2 CTE with SRF function

performance regression in 9.2 CTE with SRF function

From
Pavel Stehule
Date:
Hello

In Czech discussion group was reported performance regression of CTE
query. I wrote a test, when I can show it.

Execution on 9.2, 9.3 is about 9 sec, on 8.4, 9.1 is about 30ms with
default work_mem ='1MB'


optimal plan (8.4)
                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=45.54..69.32 rows=441 width=4) (actual
time=29.943..29.943 rows=0 loops=1)
   Output: x2.a
   Hash Cond: (x2.a = pl.a)
   CTE pl
     ->  HashAggregate  (cost=17.50..18.39 rows=89 width=4) (actual
time=6.723..7.016 rows=89 loops=1)
           Output: x.a
           ->  Seq Scan on x  (cost=0.00..15.00 rows=1000 width=4)
(actual time=0.026..3.390 rows=1000 loops=1)
                 Output: x.a
   ->  Seq Scan on x2  (cost=24.25..41.75 rows=500 width=4) (actual
time=21.178..21.757 rows=131 loops=1)
         Output: x2.a, x2.b
         Filter: (NOT (hashed SubPlan 2))
         SubPlan 2
           ->  CTE Scan on pl  (cost=0.00..24.03 rows=89 width=4)
(actual time=0.208..18.821 rows=1000 loops=1)
                 Output: foo(pl.a)
   ->  Hash  (cost=1.78..1.78 rows=89 width=4) (actual
time=7.895..7.895 rows=89 loops=1)
         Output: pl.a
         ->  CTE Scan on pl  (cost=0.00..1.78 rows=89 width=4) (actual
time=6.733..7.592 rows=89 loops=1)
               Output: pl.a
 Total runtime: 30.046 ms

suboptimal plan (9.3)
                                                          QUERY PLAN

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Nested Loop  (cost=18.39..346248.92 rows=441 width=4) (actual
time=9373.295..9373.295 rows=0 loops=1)
   Output: x2.a
   Join Filter: (x2.a = pl.a)
   Rows Removed by Join Filter: 11837
   CTE pl
     ->  HashAggregate  (cost=17.50..18.39 rows=89 width=4) (actual
time=4.363..4.539 rows=89 loops=1)
           Output: x.a
           ->  Seq Scan on public.x  (cost=0.00..15.00 rows=1000
width=4) (actual time=0.031..2.133 rows=1000 loops=1)
                 Output: x.a
   ->  CTE Scan on pl  (cost=0.00..1.78 rows=89 width=4) (actual
time=4.371..4.519 rows=89 loops=1)
         Output: pl.a
   ->  Materialize  (cost=0.00..345562.50 rows=500 width=4) (actual
time=0.203..105.051 rows=133 loops=89)
         Output: x2.a
         ->  Seq Scan on public.x2  (cost=0.00..345560.00 rows=500
width=4) (actual time=17.914..9330.645 rows=133 loops=1)
               Output: x2.a
               Filter: (NOT (SubPlan 2))
               Rows Removed by Filter: 867
               SubPlan 2
                 ->  CTE Scan on pl pl_1  (cost=0.00..468.59
rows=89000 width=4) (actual time=0.023..8.379 rows=566 loops=1000)
                       Output: foo(pl_1.a)
 Total runtime: 9373.444 ms

Higher work_mem 3MB fixes this issue
                                                       QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Hash Join  (cost=712.37..736.15 rows=441 width=4) (actual
time=38.024..38.024 rows=0 loops=1)
   Output: x2.a
   Hash Cond: (x2.a = pl.a)
   CTE pl
     ->  HashAggregate  (cost=17.50..18.39 rows=89 width=4) (actual
time=8.457..8.759 rows=89 loops=1)
           Output: x.a
           ->  Seq Scan on public.x  (cost=0.00..15.00 rows=1000
width=4) (actual time=0.029..5.073 rows=1000 loops=1)
                 Output: x.a
   ->  Seq Scan on public.x2  (cost=691.09..708.59 rows=500 width=4)
(actual time=27.629..28.152 rows=105 loops=1)
         Output: x2.a, x2.b
         Filter: (NOT (hashed SubPlan 2))
         Rows Removed by Filter: 895
         SubPlan 2
           ->  CTE Scan on pl pl_1  (cost=0.00..468.59 rows=89000
width=4) (actual time=0.328..24.224 rows=1000 loops=1)
                 Output: foo(pl_1.a)
   ->  Hash  (cost=1.78..1.78 rows=89 width=4) (actual
time=9.650..9.650 rows=89 loops=1)
         Output: pl.a
         Buckets: 1024  Batches: 1  Memory Usage: 3kB
         ->  CTE Scan on pl  (cost=0.00..1.78 rows=89 width=4) (actual
time=8.468..9.346 rows=89 loops=1)
               Output: pl.a
 Total runtime: 38.153 ms

There is terribly wrong estimation when work_mem is lower

               SubPlan 2
                 ->  CTE Scan on pl pl_1  (cost=0.00..468.59
rows=89000 width=4) (actual time=0.023..8.379 rows=566 loops=1000)
                       Output: foo(pl_1.a)

tables x and x2 are wery small - 64KB

link to original thread - sorry it is in Czech language (but attached
plans are in English :))
http://groups.google.com/group/postgresql-cz/browse_thread/thread/75e0e34e88950802?hl=cs

Regards
Pavel Stehule

Attachment

Re: performance regression in 9.2 CTE with SRF function

From
Tom Lane
Date:
Pavel Stehule <pavel.stehule@gmail.com> writes:
> In Czech discussion group was reported performance regression of CTE
> query. I wrote a test, when I can show it.

I don't see anything tremendously wrong here.  The older branches are
choosing the right plan for entirely wrong reasons: they don't notice
that "select foo(a) from pl" has a set-returning function in the
targetlist and so don't adjust the estimated number of result rows
for that.  In this particular example, foo() seems to return an average
of about 11 rows per call, versus the default estimate of 1000 rows per
call, so the size of the result is overestimated and that dissuades
the planner from using a hashed subplan.  But the error could easily
have gone the other way, causing the planner to use a hashed subplan
when it shouldn't, and the consequences of that are even worse.  So
I don't think that ignoring SRFs in the targetlist is better.

If you add "ROWS 10" or so to the declaration of the function, you
get a better row estimate and it goes back to the hashed subplan.
        regards, tom lane



Re: performance regression in 9.2 CTE with SRF function

From
Pavel Stehule
Date:
2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:
> Pavel Stehule <pavel.stehule@gmail.com> writes:
>> In Czech discussion group was reported performance regression of CTE
>> query. I wrote a test, when I can show it.
>
> I don't see anything tremendously wrong here.  The older branches are
> choosing the right plan for entirely wrong reasons: they don't notice
> that "select foo(a) from pl" has a set-returning function in the
> targetlist and so don't adjust the estimated number of result rows
> for that.  In this particular example, foo() seems to return an average
> of about 11 rows per call, versus the default estimate of 1000 rows per
> call, so the size of the result is overestimated and that dissuades
> the planner from using a hashed subplan.  But the error could easily
> have gone the other way, causing the planner to use a hashed subplan
> when it shouldn't, and the consequences of that are even worse.  So
> I don't think that ignoring SRFs in the targetlist is better.

no, there is strange estimation
        ->  Seq Scan on public.x2  (cost=0.00..345560.00 rows=500
width=4) (actual time=17.914..9330.645 rows=133 loops=1)              Output: x2.a              Filter: (NOT (SubPlan
2))             Rows Removed by Filter: 867              SubPlan 2                ->  CTE Scan on pl pl_1
(cost=0.00..468.59
rows=89000 width=4) (actual time=0.023..8.379 rows=566 loops=1000)                      Output: foo(pl_1.a)

CTE Scan expect rows=89000

I don't know how is possible to take too high number

Regards

Pavel

>
> If you add "ROWS 10" or so to the declaration of the function, you
> get a better row estimate and it goes back to the hashed subplan.
>
>                         regards, tom lane



Re: performance regression in 9.2 CTE with SRF function

From
Pavel Stehule
Date:
>
> no, there is strange estimation
>
>          ->  Seq Scan on public.x2  (cost=0.00..345560.00 rows=500
> width=4) (actual time=17.914..9330.645 rows=133 loops=1)
>                Output: x2.a
>                Filter: (NOT (SubPlan 2))
>                Rows Removed by Filter: 867
>                SubPlan 2
>                  ->  CTE Scan on pl pl_1  (cost=0.00..468.59
> rows=89000 width=4) (actual time=0.023..8.379 rows=566 loops=1000)
>                        Output: foo(pl_1.a)
>
> CTE Scan expect rows=89000
>
> I don't know how is possible to take too high number
>

respective why estimation is unstrable

first (1MB work_mem)
                ->  CTE Scan on pl pl_1  (cost=0.00..468.59
rows=89000 width=4) (actual time=0.023..8.379 rows=566 loops=1000)                      Output: foo(pl_1.a)


second (3MB work_mem)
 ->  Hash  (cost=1.78..1.78 rows=89 width=4) (actual
time=9.650..9.650 rows=89 loops=1)        Output: pl.a        Buckets: 1024  Batches: 1  Memory Usage: 3kB        ->
CTEScan on pl  (cost=0.00..1.78 rows=89 width=4) (actual
 
time=8.468..9.346 rows=89 loops=1)              Output: pl.a

I expect so estimation not depends on work_mem

Best regards

Pavel

> Regards
>
> Pavel
>
>>
>> If you add "ROWS 10" or so to the declaration of the function, you
>> get a better row estimate and it goes back to the hashed subplan.
>>
>>                         regards, tom lane



Re: performance regression in 9.2 CTE with SRF function

From
Tom Lane
Date:
Pavel Stehule <pavel.stehule@gmail.com> writes:
>> no, there is strange estimation
>> 
>> SubPlan 2
>> ->  CTE Scan on pl pl_1  (cost=0.00..468.59
>> rows=89000 width=4) (actual time=0.023..8.379 rows=566 loops=1000)
>> Output: foo(pl_1.a)

Nothing strange about it.  89 rows out from the underlying CTE (which
appears to be dead accurate) times 1000 for the SRF's expansion.

> respective why estimation is unstrable

It isn't --- you are looking at two different subplans there, one with
a SRF in it and one without.
        regards, tom lane



Re: performance regression in 9.2 CTE with SRF function

From
Pavel Stehule
Date:
2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:
> Pavel Stehule <pavel.stehule@gmail.com> writes:
>>> no, there is strange estimation
>>>
>>> SubPlan 2
>>> ->  CTE Scan on pl pl_1  (cost=0.00..468.59
>>> rows=89000 width=4) (actual time=0.023..8.379 rows=566 loops=1000)
>>> Output: foo(pl_1.a)
>
> Nothing strange about it.  89 rows out from the underlying CTE (which
> appears to be dead accurate) times 1000 for the SRF's expansion.
>
>> respective why estimation is unstrable
>
> It isn't --- you are looking at two different subplans there, one with
> a SRF in it and one without.

ok, I understand now

Thank you

Regards

Pavel

>
>                         regards, tom lane