Home > mailing lists

Re: Pluggable Storage - Andres's take - Mailing list pgsql-hackers

From	Ashwin Agrawal
Subject	Re: Pluggable Storage - Andres's take
Date	May 16, 2019 09:00:38
Msg-id	CALfoeiu+69FZWDCiH_qw8rBTNBKAuy3ttGkTotmWgx1-L0Kb6w@mail.gmail.com Whole thread Raw
In response to	Re: Pluggable Storage - Andres's take (Andres Freund <andres@anarazel.de>)
Responses	Re: Pluggable Storage - Andres's take (Andres Freund <andres@anarazel.de>)
List	pgsql-hackers

Tree view

On Wed, May 15, 2019 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-25 15:43:15 -0700, Andres Freund wrote:

> > 3) nodeTidscan, skipping over too large tids
> > I think this should just be moved into the AMs, there's no need to
> > have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
> /*
> * We silently discard any TIDs that are out of range at the time of scan
> * start. (Since we hold at least AccessShareLock on the table, it won't
> * be possible for someone to truncate away the blocks we intend to
> * visit.)
> */
> nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
>
>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan. And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
>
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter. But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.

Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Needs a bit of polishing, but I think this is the right direction?

Highlevel this looks good to me. Will look into full details tomorrow. This alligns with the high level thought I made but implemented in much better way, to consult with the AM to perform the optimization or not. So, now using the new callback table_tuple_tid_valid() AM either can implement some way to perform the validation for TID to optimize the scan, or if has no way to check based on scan descriptor then can decide to always pass true and let table_fetch_row_version() handle the things.

pgsql-hackers by date:

From: Ashwin Agrawal
Date: 16 May 2019, 08:32:34
Subject: Re: Adding a test for speculative insert abort case

From: Michael Paquier
Date: 16 May 2019, 09:23:40
Subject: Re: New EXPLAIN option: ALL

Re: Pluggable Storage - Andres's take - Mailing list pgsql-hackers

Previous

Next