Re: Pluggable Storage - Andres's take - Mailing list pgsql-hackers

From Ashwin Agrawal
Subject Re: Pluggable Storage - Andres's take
Date
Msg-id CALfoeiu+69FZWDCiH_qw8rBTNBKAuy3ttGkTotmWgx1-L0Kb6w@mail.gmail.com
Whole thread Raw
In response to Re: Pluggable Storage - Andres's take  (Andres Freund <andres@anarazel.de>)
Responses Re: Pluggable Storage - Andres's take  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers

On Wed, May 15, 2019 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-04-25 15:43:15 -0700, Andres Freund wrote:

> > 3) nodeTidscan, skipping over too large tids
> >    I think this should just be moved into the AMs, there's no need to
> >    have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
>       /*
>        * We silently discard any TIDs that are out of range at the time of scan
>        * start.  (Since we hold at least AccessShareLock on the table, it won't
>        * be possible for someone to truncate away the blocks we intend to
>        * visit.)
>        */
>       nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
>
>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan.  And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
>
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter.  But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.

Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Needs a bit of polishing, but I think this is the right direction?

Highlevel this looks good to me. Will look into full details tomorrow. This alligns with the high level thought I made but implemented in much better way, to consult with the AM to perform the optimization or not. So, now using the new callback table_tuple_tid_valid() AM either can implement some way to perform the validation for TID to optimize the scan, or if has no way to check based on scan descriptor then can decide to always pass true and let table_fetch_row_version() handle the things.

pgsql-hackers by date:

Previous
From: Ashwin Agrawal
Date:
Subject: Re: Adding a test for speculative insert abort case
Next
From: Michael Paquier
Date:
Subject: Re: New EXPLAIN option: ALL