Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Parallel Seq Scan
Date
Msg-id CA+TgmoazFeUhBNYYUE0K4rhtqQAQT4=RBVkKuvod+dOxHTgvDA@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Fri, Oct 2, 2015 at 11:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Oct 1, 2015 at 7:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Does heap_parallelscan_nextpage really need a pscan_finished output
>> parameter, or can it just return InvalidBlockNumber to indicate end of
>> scan?  I think the latter can be done and would be cleaner.
>
> I think we can do that way as well, however we have to keep the check
> of page == InvalidBlockNumber at all the callers to indicate finish
> of scan which made me write the function as it exists in patch.  However,
> I don't mind changing it the way you have suggested if you feel that would
> be cleaner.

I think it would.  I mean, you just end up testing the other thing instead.

>>  I think that heap_parallelscan_initialize() should taken an
>> additional Boolean argument, allow_sync.  It should decide whether to
>> actually perform a syncscan using the logic from initscan(), and then
>> it should store a phs_syncscan flag into the ParallelHeapScanDesc.
>> heap_beginscan_internal should set rs_syncscan based on phs_syncscan,
>> regardless of anything else.
>
> I think this will ensure that any future change in this area won't break the
> guarantee for sync scans for parallel workers, is that the reason you
> prefer to implement this function in the way suggested by you?

Basically.  It seems pretty fragile the way you have it now.

>> I think heap_parallel_rescan() is an unworkable API.  When rescanning
>> a synchronized scan, the existing logic keeps the original start-block
>> so that the scan's results are reproducible,
>
> It seems from the code comments in initscan, the reason for keeping
> previous startblock is to allow rewinding the cursor which doesn't hold for
> parallel scan.  We might or might not want to support such cases with
> parallel query, but even if we want to there is a way we can do with
> current logic (as mentioned in one of my replies below).

You don't need to rewind a cursor; you just need to restart the scan.
So for example if we were on the inner side of a NestLoop, this would
be a real issue.

>> but no longer reports the
>> scan position since we're presumably out of step with other backends.
>
> Is it true for all form of rescans or are you talking about some
> case (like SampleScan) in particular?  As per my reading of code
> (and verified by debugging that code path), it doesn't seem to be true
> for rescan in case of seqscan.

I think it is:
   if (keep_startblock)   {       /*        * When rescanning, we want to keep the previous startblock setting,
*so that rewinding a cursor doesn't generate surprising results.        * Reset the active syncscan setting, though.
   */       scan->rs_syncscan = (allow_sync && synchronize_seqscans);   }
 

>> This isn't going to work at all with this API.  I don't think you can
>> swap out the ParallelHeapScanDesc for another one once you've
>> installed it;
>>
>
> Sure, but if we really need some such parameters like startblock position,
> then we can preserve those in ScanDesc.

Sure, we could transfer the information out of the
ParallelHeapScanDesc and then transfer it back into the new one, but I
have a hard time thinking that's a good design.

> PARAM_EXEC parameters will be used to support initPlan in parallel query (as
> it is done in the initial patch), so it seems to me this is the main
> downside of
> this approach.  I think rather than trying to come up with various possible
> solutions for this problem, lets prohibit sync scans with parallel query if
> you
> see some problem with the suggestions made by me above.  Do you see any
> main use case getting hit due to non support of sync scans with
> parallel seq. scan?

Yes.  Synchronized scans are particularly important with large tables,
and those are the kind you're likely to want to use a parallel
sequential scan on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Refactoring speculative insertion with unique indexes a little
Next
From: Peter Geoghegan
Date:
Subject: Re: Less than ideal error reporting in pg_stat_statements