Re: Hot Standby and VACUUM FULL - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Hot Standby and VACUUM FULL
Date
Msg-id 201002010358.o113wmU10019@momjian.us
Whole thread Raw
In response to Re: Hot Standby and VACUUM FULL  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
FYI, getting rid of VACUUM FULL in a .0 release does make more sense
than doing it in a .1 release.

---------------------------------------------------------------------------

Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > I'll do a little work towards step (1) just so we can take a more
> > informed view once you've had a better look at just what (2) involves.
> 
> I spent a couple of hours reading code and believe that I've identified
> all the pain points for allowing relocation of system catalogs (ie,
> assigning them new relfilenodes during cluster-style VACUUM FULL,
> REINDEX, etc).  It's definitely not a trivial project but it's not out
> of reach either --- I estimate I could get it done in a couple or three
> days of full-time effort.  Given the potential benefits I think it's
> worth doing.  Rough notes are attached --- comments?
> 
>             regards, tom lane
> 
> 
> Change pg_class.relfilenode to be 0 for all rels for which a map file should
> be used instead.  Define RelationGetFilenode() to look to the physaddr info
> instead, and make all internal refs go to that instead of reading
> rd_rel->relfilenode directly.  Define pg_relation_filenode(regclass) and fix
> external refs (oid2name, pg_dump) to use that instead.
> 
> In zeroth cut, just have relcache.c substitute the OID in place of reading
> a map file.  Possibly it should always do that during bootstrap.
> 
> How should bootstrap decide which rels to insert zero for, anyway?
> Shared definitely, pg_class, ... maybe that's enough?  Or do we need
> it for all nailed local catalogs?
> 
> local state contains
>     * shared map list
>     * this database's map list
>     * list of local overrides to each on-disk map list
> 
> At commit: if modified, write out new version; do this as late as possible
> before xact commit
> At abort: just discard local-override list
> 
> "Write" must include generating a WAL entry as well as sending a shared-inval
> signal
>     * write temp file, fsync it
>     * emit WAL record containing copy of new file contents, fsync it
>     * atomically rename temp file into place
>     * send sinval
> 
> During relation open, check for sinval before relying on current cached value
> of map contents
> 
> Hm, what about concurrent updates of map?  Probably instantiate a new
> LWLock or maybe better a HW lock.  So write involves taking lock, check
> for updates, finally write --- which means that local modifications
> have to be stored in a way that allows re-reading somebody else's mods.
> Hence above data structure with separate storage of local modifications.
> We assume rel-level locking protects us from concurrent update attempts
> on the same map entry, but we don't want to forbid concurrent relocations
> of different catalogs.
> 
> LWLock would work if we do an unconditional read of the file contents after
> getting lock rather than relying on sinval signaling, which seems a small
> price considering map updates should be infrequent.  Without that, writers
> have to hold the lock till commit which rules out using LWLock.
> 
> Hm ... do we want an LWLock per map file, or is one lock to rule them all
> sufficient?  LWLock per database seems problematic.  With an HW lock there
> wouldn't be a problem with that.  HW lock would allow concurrent updates of
> the map files of different DBs, but is that worth the extra cycles?
> 
> In a case where a transaction relocates pg_class (or another mapped catalog)
> and then makes updates in that catalog, all is well in the sense that the
> updates will be preserved iff the xact commits.  HOWEVER we would have
> problems during WAL replay?  No, because the WAL entries will command updates
> to the catalog's new relfilenode, which will be interesting to anyone else if
> and only if the xact gets to commit.  We would need to be careful about the
> case of relocating pg_class itself though --- make sure local relcache
> references the new pg_class before any such updates happen.  Probably a CCI
> is sufficient.
> 
> Another issue for clustering a catalog is that sinval catcache signalling
> could get confused, since that depends on catalog entry TIDs which would
> change --- we'd likely need to have relocation send an sinval signal saying
> "flush *everything* from that catalog".  (relcache inval on the catalog itself
> doesn't do that.)  This action could/should be transactional so no
> additional code is needed to propagate the notice to HS standbys.
> 
> Once the updated map file is moved into place, the relocation is effectively
> committed even if we subsequently abort the transaction.  We can make that
> window pretty narrow but not remove it completely.  Risk factors:
> * abort would try to remove relation files created by xact, in particular
> the new version of the catalog.  Ooops.  Can fix this by telling smgr to
> forget about removing those files before installing the new map file ---
> better to leak some disk space than destroy catalogs.
> * on abort, we'd not send sinval notices, leaving other backends at risk
> of using stale info (maybe our own too).  We could fix this by sending
> the sinval notice BEFORE renaming into place (if we send it and then fail
> to rename, no real harm done, just useless cache flushes).  This requires
> keeping a nontransactional-inval path in inval.c, but at least it's much
> more localized than before.  No problem for HS since we have the WAL record
> for map update to drive issuing sinvals on the slave side.  Note this
> means that readers need to take the mapfile lock in shared mode, since they
> could conceivably get the sinval notice before we complete the rename.
> 
> For our own backend we need cache flushes at CCI and again at xact
> commit/abort.  This could be handled by regular transactional inval
> path but we end up with a lot of redundant flushing.  Maybe not worth
> worrying about though.
> 
> Disallow catalog relocation inside subtransactions, to avoid having
> to handle subxact abort effects on the local-map-changes state.
> This could be implemented if desired, but doesn't seem worth it
> at least in first pass.
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Hot Standby and VACUUM FULL
Next
From: Pavel Stehule
Date:
Subject: Re: Review: listagg aggregate