Thread: Strange sort node/explain result

Strange sort node/explain result

From

"Gunnar \"Nick\" Bluth"

Date:

31 October 2022, 17:20:15

Hello all!

Last week at pgconf.eu, my good friend Corrado aka. Dino (in CC) showed 
me a plan he couldn't make sense of.
I stood there, scratching my head and couldn't make sense of it just the 
same.
What comforted me somehow is that nobody else we showed it to, including 
  Peter Geoghegan (but to be fair, it was on day 3 of the conference ;-) 
could explain what's going on there.
So we follow Peter's advise and post this here.


Background: Dino played a bit in a test DB, wondering if putting all the 
extract() values into a CTE when querying the large timeseries table 
would save time compared to generating them on the fly or storing them 
in the timeseries table (as it happens right now).


Mind you, this inquiry is not about if that's a good idea, the query's 
performance or even the planner's choice of a merge join!


What puzzles us is the part where the CTE "oneyear" somehow explodes 
into a sort node of almost 10 mio (but not the same amount as the index 
scan emits!) rows, taking ~ 0.4 seconds but only using 4x-5x kB of memory:

   ->  Sort  (cost=69.83..72.33 rows=1000 width=4) (actual 
time=0.418..448.397 rows=9855001 loops=1)
         Output: y.d
         Sort Key: y.d
         Sort Method: quicksort  Memory: 42kB
         ->  CTE Scan on oneyear y  (cost=0.00..20.00 rows=1000 width=4) 
(actual time=0.079..0.383 rows=366 loops=1)
               Output: y.d

Full plan is here: https://explain.depesz.com/s/B3mN

Testcase to reproduce is attached. It uses roughly the same amount of 
rows as the original data.

I tried this on 9.6, 10, 11, 12, 14 & 15 and it's always showing the 
same effect.

We also tried to "broaden" everything (adding more columns to the CTE, 
fetching more or less from the main table, etc.) in Berlin, but the 
outcome is rock solid, so to speak. Even the memory footprint of the 
sort node doesn't change at all.

Dropping the index though leads to a completely different plan, and 
_that_ sort node is sane (but not fed by a CTE scan, obviously), using 
the same 42kB of memory:

    ->  Sort  (cost=59.84..62.34 rows=1000 width=8) (actual 
time=57.275..57.355 rows=366 loops=1) 

          Output: y.d, ((y.d)::date) 

          Sort Key: ((y.d)::date) 

          Sort Method: quicksort  Memory: 42kB 

          Buffers: shared hit=4 

          ->  Function Scan on pg_catalog.generate_series y 
(cost=0.01..10.01 rows=1000 width=8) (actual time=57.190..57.233 
rows=366 loops=1)
                Output: y.d, (y.d)::date 

                Function Call: generate_series((now() - '365 
days'::interval), now(), '1 day'::interval) 




Maybe the sort node gets attributed wrongly and it's actually the index 
scan result that gets sorted? But that wouldn't be necessary, afterall...
Maybe the sort really takes much more memory, but the executor doesn't 
report that correctly?
Why does the CTE sort node sort exactly 26.999 rows less than the main 
table has rows, when that sports exactly 27.000 rows per day?
Maybe we may have found a long missing optimization opportunity? ;-)


We are seriously clueless; pls. enlighten us! ;-)

Best regards,
-- 
Gunnar "Nick" Bluth

Eimermacherweg 106
D-48159 Münster

Mobil +49 172 8853339
Email: gunnar.bluth@pro-open.de
__________________________________________________________________________
"Ceterum censeo SystemD esse delendam" - Cato

Attachment

Re: Strange sort node/explain result

From

David Rowley

Date:

31 October 2022, 23:40:12

On Tue, 1 Nov 2022 at 03:20, Gunnar "Nick" Bluth
<gunnar.bluth@pro-open.de> wrote:
> What puzzles us is the part where the CTE "oneyear" somehow explodes
> into a sort node of almost 10 mio (but not the same amount as the index
> scan emits!) rows, taking ~ 0.4 seconds but only using 4x-5x kB of memory:
>
>    ->  Sort  (cost=69.83..72.33 rows=1000 width=4) (actual
> time=0.418..448.397 rows=9855001 loops=1)

I think you're interpreting the EXPLAIN output wrongly.  It's not that
the Sort node emits 9855001 rows, it's that 9855001 are read from the
Sort node.

The reason more rows are read from it than are produced is because
Merge Join must perform mark and restore to "rewind" the inner side of
the scan back for the subsequent outer tuple which has the same value.

e.g if you're joining:

outer:
1
1

inner:
1
1

We'll get 4 rows (total).  After the first outer row has found all its
join partners, the same must be done with the 2nd outer row, however,
we're already read beyond the final 1 in the inner side, so we must
rewind back to the position of the first inner 1 and then perform the
join to the 2nd outer row. Thus producing the 3rd and 4th outer rows.
The inner side will have been read 4 times despite there only being 2
rows in it.

There are only 366 rows for the Sort node to sort. 42kb seems like
reasonable memory use for that.

David

Re: Strange sort node/explain result

From

"Gunnar \"Nick\" Bluth"

Date:

01 November 2022, 16:13:38

Am 31.10.22 um 21:40 schrieb David Rowley:
> On Tue, 1 Nov 2022 at 03:20, Gunnar "Nick" Bluth
> <gunnar.bluth@pro-open.de> wrote:
>> What puzzles us is the part where the CTE "oneyear" somehow explodes
>> into a sort node of almost 10 mio (but not the same amount as the index
>> scan emits!) rows, taking ~ 0.4 seconds but only using 4x-5x kB of memory:
>>
>>     ->  Sort  (cost=69.83..72.33 rows=1000 width=4) (actual
>> time=0.418..448.397 rows=9855001 loops=1)
> 
> I think you're interpreting the EXPLAIN output wrongly.  It's not that
> the Sort node emits 9855001 rows, it's that 9855001 are read from the
> Sort node.

Thanks for looking into this, David!


> The reason more rows are read from it than are produced is because
> Merge Join must perform mark and restore to "rewind" the inner side of
> the scan back for the subsequent outer tuple which has the same value.

Ok, I get it. We kind of had that suspicion.

So the sort could also say
    "rows=366 loops=26926"
instead of
    "rows=9855001 loops=1"

(which I myself would find reasonable...)?

< snip >

> There are only 366 rows for the Sort node to sort. 42kb seems like
> reasonable memory use for that.

Absolutely!

Thanks again,
-- 
Gunnar "Nick" Bluth

Eimermacherweg 106
D-48159 Münster

Mobil +49 172 8853339
Email: gunnar.bluth@pro-open.de
__________________________________________________________________________
"Ceterum censeo SystemD esse delendam" - Cato

Attachment

OpenPGP_signature

Re: Strange sort node/explain result

From

David Rowley

Date:

02 November 2022, 04:30:45

On Wed, 2 Nov 2022 at 02:13, Gunnar "Nick" Bluth
<gunnar.bluth@pro-open.de> wrote:
> So the sort could also say
>     "rows=366 loops=26926"
> instead of
>     "rows=9855001 loops=1"
>
> (which I myself would find reasonable...)?

Maybe my example led you to believe that we rewind (rescan) the inner
node each time we finish matching an outer tuple, but that's not what
happens. We only restore back to the marked position, after we've just
fetched the next outer tuple and see that the new outer matches the
tuple that we've marked on the inner side.

The maths just wouldn't make any sense if we incremented the loop
counter each time we did a restore.  When we call ExecRestrPos we
could either jump back 0 rows, or jump back to the very start of the
scan. It just depends on the marked position. I think you're under the
impression that if you multiply rows by loops that you'd get a number
that would actually make sense here. It wouldn't.

David