Re: Hot Standby and VACUUM FULL - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Hot Standby and VACUUM FULL
Date
Msg-id 28499.1264996196@sss.pgh.pa.us
Whole thread Raw
In response to Re: Hot Standby and VACUUM FULL  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Hot Standby and VACUUM FULL
Re: Hot Standby and VACUUM FULL
Re: Hot Standby and VACUUM FULL
List pgsql-hackers
Simon Riggs <simon@2ndQuadrant.com> writes:
> I'll do a little work towards step (1) just so we can take a more
> informed view once you've had a better look at just what (2) involves.

I spent a couple of hours reading code and believe that I've identified
all the pain points for allowing relocation of system catalogs (ie,
assigning them new relfilenodes during cluster-style VACUUM FULL,
REINDEX, etc).  It's definitely not a trivial project but it's not out
of reach either --- I estimate I could get it done in a couple or three
days of full-time effort.  Given the potential benefits I think it's
worth doing.  Rough notes are attached --- comments?
        regards, tom lane


Change pg_class.relfilenode to be 0 for all rels for which a map file should
be used instead.  Define RelationGetFilenode() to look to the physaddr info
instead, and make all internal refs go to that instead of reading
rd_rel->relfilenode directly.  Define pg_relation_filenode(regclass) and fix
external refs (oid2name, pg_dump) to use that instead.

In zeroth cut, just have relcache.c substitute the OID in place of reading
a map file.  Possibly it should always do that during bootstrap.

How should bootstrap decide which rels to insert zero for, anyway?
Shared definitely, pg_class, ... maybe that's enough?  Or do we need
it for all nailed local catalogs?

local state contains* shared map list* this database's map list* list of local overrides to each on-disk map list

At commit: if modified, write out new version; do this as late as possible
before xact commit
At abort: just discard local-override list

"Write" must include generating a WAL entry as well as sending a shared-inval
signal* write temp file, fsync it* emit WAL record containing copy of new file contents, fsync it* atomically rename
tempfile into place* send sinval
 

During relation open, check for sinval before relying on current cached value
of map contents

Hm, what about concurrent updates of map?  Probably instantiate a new
LWLock or maybe better a HW lock.  So write involves taking lock, check
for updates, finally write --- which means that local modifications
have to be stored in a way that allows re-reading somebody else's mods.
Hence above data structure with separate storage of local modifications.
We assume rel-level locking protects us from concurrent update attempts
on the same map entry, but we don't want to forbid concurrent relocations
of different catalogs.

LWLock would work if we do an unconditional read of the file contents after
getting lock rather than relying on sinval signaling, which seems a small
price considering map updates should be infrequent.  Without that, writers
have to hold the lock till commit which rules out using LWLock.

Hm ... do we want an LWLock per map file, or is one lock to rule them all
sufficient?  LWLock per database seems problematic.  With an HW lock there
wouldn't be a problem with that.  HW lock would allow concurrent updates of
the map files of different DBs, but is that worth the extra cycles?

In a case where a transaction relocates pg_class (or another mapped catalog)
and then makes updates in that catalog, all is well in the sense that the
updates will be preserved iff the xact commits.  HOWEVER we would have
problems during WAL replay?  No, because the WAL entries will command updates
to the catalog's new relfilenode, which will be interesting to anyone else if
and only if the xact gets to commit.  We would need to be careful about the
case of relocating pg_class itself though --- make sure local relcache
references the new pg_class before any such updates happen.  Probably a CCI
is sufficient.

Another issue for clustering a catalog is that sinval catcache signalling
could get confused, since that depends on catalog entry TIDs which would
change --- we'd likely need to have relocation send an sinval signal saying
"flush *everything* from that catalog".  (relcache inval on the catalog itself
doesn't do that.)  This action could/should be transactional so no
additional code is needed to propagate the notice to HS standbys.

Once the updated map file is moved into place, the relocation is effectively
committed even if we subsequently abort the transaction.  We can make that
window pretty narrow but not remove it completely.  Risk factors:
* abort would try to remove relation files created by xact, in particular
the new version of the catalog.  Ooops.  Can fix this by telling smgr to
forget about removing those files before installing the new map file ---
better to leak some disk space than destroy catalogs.
* on abort, we'd not send sinval notices, leaving other backends at risk
of using stale info (maybe our own too).  We could fix this by sending
the sinval notice BEFORE renaming into place (if we send it and then fail
to rename, no real harm done, just useless cache flushes).  This requires
keeping a nontransactional-inval path in inval.c, but at least it's much
more localized than before.  No problem for HS since we have the WAL record
for map update to drive issuing sinvals on the slave side.  Note this
means that readers need to take the mapfile lock in shared mode, since they
could conceivably get the sinval notice before we complete the rename.

For our own backend we need cache flushes at CCI and again at xact
commit/abort.  This could be handled by regular transactional inval
path but we end up with a lot of redundant flushing.  Maybe not worth
worrying about though.

Disallow catalog relocation inside subtransactions, to avoid having
to handle subxact abort effects on the local-map-changes state.
This could be implemented if desired, but doesn't seem worth it
at least in first pass.


pgsql-hackers by date:

Previous
From: Takahiro Itagaki
Date:
Subject: Re: Review: listagg aggregate
Next
From: Bruce Momjian
Date:
Subject: Re: Hot Standby and VACUUM FULL