Thread: Strange sort node/explain result
Hello all! Last week at pgconf.eu, my good friend Corrado aka. Dino (in CC) showed me a plan he couldn't make sense of. I stood there, scratching my head and couldn't make sense of it just the same. What comforted me somehow is that nobody else we showed it to, including Peter Geoghegan (but to be fair, it was on day 3 of the conference ;-) could explain what's going on there. So we follow Peter's advise and post this here. Background: Dino played a bit in a test DB, wondering if putting all the extract() values into a CTE when querying the large timeseries table would save time compared to generating them on the fly or storing them in the timeseries table (as it happens right now). Mind you, this inquiry is not about if that's a good idea, the query's performance or even the planner's choice of a merge join! What puzzles us is the part where the CTE "oneyear" somehow explodes into a sort node of almost 10 mio (but not the same amount as the index scan emits!) rows, taking ~ 0.4 seconds but only using 4x-5x kB of memory: -> Sort (cost=69.83..72.33 rows=1000 width=4) (actual time=0.418..448.397 rows=9855001 loops=1) Output: y.d Sort Key: y.d Sort Method: quicksort Memory: 42kB -> CTE Scan on oneyear y (cost=0.00..20.00 rows=1000 width=4) (actual time=0.079..0.383 rows=366 loops=1) Output: y.d Full plan is here: https://explain.depesz.com/s/B3mN Testcase to reproduce is attached. It uses roughly the same amount of rows as the original data. I tried this on 9.6, 10, 11, 12, 14 & 15 and it's always showing the same effect. We also tried to "broaden" everything (adding more columns to the CTE, fetching more or less from the main table, etc.) in Berlin, but the outcome is rock solid, so to speak. Even the memory footprint of the sort node doesn't change at all. Dropping the index though leads to a completely different plan, and _that_ sort node is sane (but not fed by a CTE scan, obviously), using the same 42kB of memory: -> Sort (cost=59.84..62.34 rows=1000 width=8) (actual time=57.275..57.355 rows=366 loops=1) Output: y.d, ((y.d)::date) Sort Key: ((y.d)::date) Sort Method: quicksort Memory: 42kB Buffers: shared hit=4 -> Function Scan on pg_catalog.generate_series y (cost=0.01..10.01 rows=1000 width=8) (actual time=57.190..57.233 rows=366 loops=1) Output: y.d, (y.d)::date Function Call: generate_series((now() - '365 days'::interval), now(), '1 day'::interval) Maybe the sort node gets attributed wrongly and it's actually the index scan result that gets sorted? But that wouldn't be necessary, afterall... Maybe the sort really takes much more memory, but the executor doesn't report that correctly? Why does the CTE sort node sort exactly 26.999 rows less than the main table has rows, when that sports exactly 27.000 rows per day? Maybe we may have found a long missing optimization opportunity? ;-) We are seriously clueless; pls. enlighten us! ;-) Best regards, -- Gunnar "Nick" Bluth Eimermacherweg 106 D-48159 Münster Mobil +49 172 8853339 Email: gunnar.bluth@pro-open.de __________________________________________________________________________ "Ceterum censeo SystemD esse delendam" - Cato
Attachment
On Tue, 1 Nov 2022 at 03:20, Gunnar "Nick" Bluth <gunnar.bluth@pro-open.de> wrote: > What puzzles us is the part where the CTE "oneyear" somehow explodes > into a sort node of almost 10 mio (but not the same amount as the index > scan emits!) rows, taking ~ 0.4 seconds but only using 4x-5x kB of memory: > > -> Sort (cost=69.83..72.33 rows=1000 width=4) (actual > time=0.418..448.397 rows=9855001 loops=1) I think you're interpreting the EXPLAIN output wrongly. It's not that the Sort node emits 9855001 rows, it's that 9855001 are read from the Sort node. The reason more rows are read from it than are produced is because Merge Join must perform mark and restore to "rewind" the inner side of the scan back for the subsequent outer tuple which has the same value. e.g if you're joining: outer: 1 1 inner: 1 1 We'll get 4 rows (total). After the first outer row has found all its join partners, the same must be done with the 2nd outer row, however, we're already read beyond the final 1 in the inner side, so we must rewind back to the position of the first inner 1 and then perform the join to the 2nd outer row. Thus producing the 3rd and 4th outer rows. The inner side will have been read 4 times despite there only being 2 rows in it. There are only 366 rows for the Sort node to sort. 42kb seems like reasonable memory use for that. David
Am 31.10.22 um 21:40 schrieb David Rowley: > On Tue, 1 Nov 2022 at 03:20, Gunnar "Nick" Bluth > <gunnar.bluth@pro-open.de> wrote: >> What puzzles us is the part where the CTE "oneyear" somehow explodes >> into a sort node of almost 10 mio (but not the same amount as the index >> scan emits!) rows, taking ~ 0.4 seconds but only using 4x-5x kB of memory: >> >> -> Sort (cost=69.83..72.33 rows=1000 width=4) (actual >> time=0.418..448.397 rows=9855001 loops=1) > > I think you're interpreting the EXPLAIN output wrongly. It's not that > the Sort node emits 9855001 rows, it's that 9855001 are read from the > Sort node. Thanks for looking into this, David! > The reason more rows are read from it than are produced is because > Merge Join must perform mark and restore to "rewind" the inner side of > the scan back for the subsequent outer tuple which has the same value. Ok, I get it. We kind of had that suspicion. So the sort could also say "rows=366 loops=26926" instead of "rows=9855001 loops=1" (which I myself would find reasonable...)? < snip > > There are only 366 rows for the Sort node to sort. 42kb seems like > reasonable memory use for that. Absolutely! Thanks again, -- Gunnar "Nick" Bluth Eimermacherweg 106 D-48159 Münster Mobil +49 172 8853339 Email: gunnar.bluth@pro-open.de __________________________________________________________________________ "Ceterum censeo SystemD esse delendam" - Cato
Attachment
On Wed, 2 Nov 2022 at 02:13, Gunnar "Nick" Bluth <gunnar.bluth@pro-open.de> wrote: > So the sort could also say > "rows=366 loops=26926" > instead of > "rows=9855001 loops=1" > > (which I myself would find reasonable...)? Maybe my example led you to believe that we rewind (rescan) the inner node each time we finish matching an outer tuple, but that's not what happens. We only restore back to the marked position, after we've just fetched the next outer tuple and see that the new outer matches the tuple that we've marked on the inner side. The maths just wouldn't make any sense if we incremented the loop counter each time we did a restore. When we call ExecRestrPos we could either jump back 0 rows, or jump back to the very start of the scan. It just depends on the marked position. I think you're under the impression that if you multiply rows by loops that you'd get a number that would actually make sense here. It wouldn't. David