Re: Hot Standby and VACUUM FULL - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Hot Standby and VACUUM FULL |
Date | |
Msg-id | 201002010358.o113wmU10019@momjian.us Whole thread Raw |
In response to | Re: Hot Standby and VACUUM FULL (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
FYI, getting rid of VACUUM FULL in a .0 release does make more sense than doing it in a .1 release. --------------------------------------------------------------------------- Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > I'll do a little work towards step (1) just so we can take a more > > informed view once you've had a better look at just what (2) involves. > > I spent a couple of hours reading code and believe that I've identified > all the pain points for allowing relocation of system catalogs (ie, > assigning them new relfilenodes during cluster-style VACUUM FULL, > REINDEX, etc). It's definitely not a trivial project but it's not out > of reach either --- I estimate I could get it done in a couple or three > days of full-time effort. Given the potential benefits I think it's > worth doing. Rough notes are attached --- comments? > > regards, tom lane > > > Change pg_class.relfilenode to be 0 for all rels for which a map file should > be used instead. Define RelationGetFilenode() to look to the physaddr info > instead, and make all internal refs go to that instead of reading > rd_rel->relfilenode directly. Define pg_relation_filenode(regclass) and fix > external refs (oid2name, pg_dump) to use that instead. > > In zeroth cut, just have relcache.c substitute the OID in place of reading > a map file. Possibly it should always do that during bootstrap. > > How should bootstrap decide which rels to insert zero for, anyway? > Shared definitely, pg_class, ... maybe that's enough? Or do we need > it for all nailed local catalogs? > > local state contains > * shared map list > * this database's map list > * list of local overrides to each on-disk map list > > At commit: if modified, write out new version; do this as late as possible > before xact commit > At abort: just discard local-override list > > "Write" must include generating a WAL entry as well as sending a shared-inval > signal > * write temp file, fsync it > * emit WAL record containing copy of new file contents, fsync it > * atomically rename temp file into place > * send sinval > > During relation open, check for sinval before relying on current cached value > of map contents > > Hm, what about concurrent updates of map? Probably instantiate a new > LWLock or maybe better a HW lock. So write involves taking lock, check > for updates, finally write --- which means that local modifications > have to be stored in a way that allows re-reading somebody else's mods. > Hence above data structure with separate storage of local modifications. > We assume rel-level locking protects us from concurrent update attempts > on the same map entry, but we don't want to forbid concurrent relocations > of different catalogs. > > LWLock would work if we do an unconditional read of the file contents after > getting lock rather than relying on sinval signaling, which seems a small > price considering map updates should be infrequent. Without that, writers > have to hold the lock till commit which rules out using LWLock. > > Hm ... do we want an LWLock per map file, or is one lock to rule them all > sufficient? LWLock per database seems problematic. With an HW lock there > wouldn't be a problem with that. HW lock would allow concurrent updates of > the map files of different DBs, but is that worth the extra cycles? > > In a case where a transaction relocates pg_class (or another mapped catalog) > and then makes updates in that catalog, all is well in the sense that the > updates will be preserved iff the xact commits. HOWEVER we would have > problems during WAL replay? No, because the WAL entries will command updates > to the catalog's new relfilenode, which will be interesting to anyone else if > and only if the xact gets to commit. We would need to be careful about the > case of relocating pg_class itself though --- make sure local relcache > references the new pg_class before any such updates happen. Probably a CCI > is sufficient. > > Another issue for clustering a catalog is that sinval catcache signalling > could get confused, since that depends on catalog entry TIDs which would > change --- we'd likely need to have relocation send an sinval signal saying > "flush *everything* from that catalog". (relcache inval on the catalog itself > doesn't do that.) This action could/should be transactional so no > additional code is needed to propagate the notice to HS standbys. > > Once the updated map file is moved into place, the relocation is effectively > committed even if we subsequently abort the transaction. We can make that > window pretty narrow but not remove it completely. Risk factors: > * abort would try to remove relation files created by xact, in particular > the new version of the catalog. Ooops. Can fix this by telling smgr to > forget about removing those files before installing the new map file --- > better to leak some disk space than destroy catalogs. > * on abort, we'd not send sinval notices, leaving other backends at risk > of using stale info (maybe our own too). We could fix this by sending > the sinval notice BEFORE renaming into place (if we send it and then fail > to rename, no real harm done, just useless cache flushes). This requires > keeping a nontransactional-inval path in inval.c, but at least it's much > more localized than before. No problem for HS since we have the WAL record > for map update to drive issuing sinvals on the slave side. Note this > means that readers need to take the mapfile lock in shared mode, since they > could conceivably get the sinval notice before we complete the rename. > > For our own backend we need cache flushes at CCI and again at xact > commit/abort. This could be handled by regular transactional inval > path but we end up with a lot of redundant flushing. Maybe not worth > worrying about though. > > Disallow catalog relocation inside subtransactions, to avoid having > to handle subxact abort effects on the local-map-changes state. > This could be implemented if desired, but doesn't seem worth it > at least in first pass. > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
pgsql-hackers by date: