Re: Pluggable Storage - Andres's take - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Pluggable Storage - Andres's take |
Date | |
Msg-id | 20190515185447.gno2jtqxyktylyvs@alap3.anarazel.de Whole thread Raw |
In response to | Re: Pluggable Storage - Andres's take (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Pluggable Storage - Andres's take
Re: Pluggable Storage - Andres's take |
List | pgsql-hackers |
Hi, On 2019-04-25 15:43:15 -0700, Andres Freund wrote: > Hm. I think some of those changes would be a bit bigger than I initially > though. Attached is a more minimal fix that'd route > RelationGetNumberOfBlocksForFork() through tableam if necessary. I > think it's definitely the right answer for 1), probably the pragmatic > answer to 2), but certainly not for 3). > I've for now made the AM return the size in bytes, and then convert that > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers > are going to continue to want it internally as pages (otherwise there's > going to be way too much churn, without a benefit I can see). So I > thinkt that's OK. > > There's also a somewhat weird bit of returning the total relation size > for InvalidForkNumber - it's pretty likely that other AMs wouldn't use > postgres' current forks, but have equivalent concepts. And without that > there'd be no way to get that size. I'm not sure I like this, input > welcome. But it seems good to offer the ability to get the entire size > somehow. I'm still reasonably happy with this. I'll polish it a bit and push. > > 3) nodeTidscan, skipping over too large tids > > I think this should just be moved into the AMs, there's no need to > > have this in nodeTidscan.c > > I think here it's *not* actually correct at all to use the relation > size. It's currently doing: > > /* > * We silently discard any TIDs that are out of range at the time of scan > * start. (Since we hold at least AccessShareLock on the table, it won't > * be possible for someone to truncate away the blocks we intend to > * visit.) > */ > nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation); > > which is fine (except for a certain abstraction leakage) for an AM like > heap or zheap, but I suspect strongly that that's not ok for Ashwin & > Heikki's approach where tid isn't tied to physical representation. > > > The obvious answer would be to just move that check into the > table_fetch_row_version implementation (currently just calling > heap_fetch()) - but that doesn't seem OK from a performance POV, because > we'd then determine the relation size once for each tid, rather than > once per tidscan. And it'd also check in cases where we know the tid is > supposed to be valid (e.g. fetching trigger tuples and such). > > The proper fix seems to be to introduce a new scan variant > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take > a scan as a parameter. But it seems we'd have to introduce that as a > separate tableam callback, because we'd not want to incur the overhead > of creating an additional scan / RelationGetNumberOfBlocks() checks for > triggers et al. Attached is a prototype of a variation of this. I added a table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid) callback / wrapper. Currently it just takes a "plain" scan, but we could add a separate table_beginscan variant too. For heap that just means we can just use HeapScanDesc's rs_nblock to filter out invalid tids, and we only need to call RelationGetNumberOfBlocks() once, rather than every table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good improvement for nodeTidscan's table_get_latest_tid() call (for WHERE CURRENT OF) - which previously computed the relation size once per tuple. Needs a bit of polishing, but I think this is the right direction? Unless somebody protests, I'm going to push something along those lines quite soon. Greetings, Andres Freund
Attachment
pgsql-hackers by date: