Re: [PATCH] Incremental sort (was: PoC: Partial sort) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date
Msg-id 20200324230831.5zvjpgayh35bxedp@development
Whole thread Raw
In response to Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (James Coleman <jtc331@gmail.com>)
Responses Re: [PATCH] Incremental sort (was: PoC: Partial sort)
List pgsql-hackers
On Tue, Mar 24, 2020 at 06:26:11PM -0400, James Coleman wrote:
>On Mon, Mar 23, 2020 at 11:44 PM Alvaro Herrera
><alvherre@2ndquadrant.com> wrote:
>>
>> On 2020-Mar-23, James Coleman wrote:
>>
>> > 4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
>> > is suspect. I've mentioned previously I don't have a great mental
>> > model of how rescan works and its invariants (IIRC someone said it was
>> > about moving around a result set in a cursor). Regardless I'm pretty
>> > sure this code just doesn't work correctly.
>>
>> I don't think that's the whole of it.  My own vague understanding of
>> ReScan is that it's there to support running a node again, possibly with
>> different parameters.  For example if you have a join of an indexscan
>> on the outer side and an incremental sort on the inner side, and the
>> values from the index are used as parameters to the incremental sort,
>> then the incremental sort is going to receive ReScan calls for each of
>> the values that the index returns.  Sometimes the index could give you
>> the same values as before (because there's a dupe in the index), so you
>> can just return the same values from the incremental sort; but other
>> times it's going to return different values so you need to reset the
>> incremental sort to "start from scratch" using the new values as
>> parameters.
>>
>> Now, if you have a cursor reading from the incremental sort and fetch
>> all tuples, then rewind completely and fetch all again, then that's
>> going to be a rescan as well.
>>
>> I agree with you that the code doesn't seem to implement that.
>
>I grepped the codebase for rescan, and noted this relevant info in
>src/backend/executor/README:
>
>* Rescan command to reset a node and make it generate its output sequence
>over again.
>
>* Parameters that can alter a node's results. After adjusting a parameter,
>the rescan command must be applied to that node and all nodes above it.
>There is a moderately intelligent scheme to avoid rescanning nodes
>unnecessarily (for example, Sort does not rescan its input if no parameters
>of the input have changed, since it can just reread its stored sorted data).
>
>That jives pretty well with what you're saying.
>
>The interesting thing with incremental sort, as the comments in the
>patch already note, is that even if the params haven't changed, we
>can't regenerate the same values again *unless* we know that we're
>still in the same batch, or, have only processed a single full batch
>(and the tuples are still in the full sort state) or we've
>transitioned to prefix mode and have only transferred tuples from the
>full sort state for a single prefix key group.
>
>That's a pretty narrow range of applicability of not needing to
>re-execute the entire node, at least based on my assumptions about
>when rescanning will typically happen.
>
>So, two followup questions:
>
>1. Given the narrow applicability, might it make sense to just say
>"we're only going to do a total reset and rescan and not try to
>implement a smart 'don't rescan if we don't have to'"?
>

I think that's a sensible approach.

>2. What would be a typical or good way to test this? Should I
>basically repeat many of the existing implementation tests but with a
>cursor and verify that rescanning produces the same results? That's
>probably the path I'm going to take if there are no objections. Of
>course we would need even more testing if we wanted to have the "smart
>rescan" functionality.
>

I haven't checked, but how are we testing it for the other nodes?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Ltree syntax improvement
Next
From: Cary Huang
Date:
Subject: Include sequence relation support in logical replication