Re: Pluggable Storage - Andres's take - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Pluggable Storage - Andres's take
Date
Msg-id 20190515185447.gno2jtqxyktylyvs@alap3.anarazel.de
Whole thread Raw
In response to Re: Pluggable Storage - Andres's take  (Andres Freund <andres@anarazel.de>)
Responses Re: Pluggable Storage - Andres's take
Re: Pluggable Storage - Andres's take
List pgsql-hackers
Hi,

On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
> Hm. I think some of those changes would be a bit bigger than I initially
> though. Attached is a more minimal fix that'd route
> RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> think it's definitely the right answer for 1), probably the pragmatic
> answer to 2), but certainly not for 3).

> I've for now made the AM return the size in bytes, and then convert that
> into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> are going to continue to want it internally as pages (otherwise there's
> going to be way too much churn, without a benefit I can see). So I
> thinkt that's OK.
> 
> There's also a somewhat weird bit of returning the total relation size
> for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
> postgres' current forks, but have equivalent concepts. And without that
> there'd be no way to get that size.  I'm not sure I like this, input
> welcome. But it seems good to offer the ability to get the entire size
> somehow.

I'm still reasonably happy with this.  I'll polish it a bit and push.


> > 3) nodeTidscan, skipping over too large tids
> >    I think this should just be moved into the AMs, there's no need to
> >    have this in nodeTidscan.c
> 
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
> 
>     /*
>      * We silently discard any TIDs that are out of range at the time of scan
>      * start.  (Since we hold at least AccessShareLock on the table, it won't
>      * be possible for someone to truncate away the blocks we intend to
>      * visit.)
>      */
>     nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> 
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
> 
> 
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan.  And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
> 
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter.  But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.

Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Needs a bit of polishing, but I think this is the right direction?
Unless somebody protests, I'm going to push something along those lines
quite soon.

Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: more message fixes
Next
From: Andres Freund
Date:
Subject: Are ctid chaining loops safe without relation size checks?