Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+hUKGKN42jB+ubCKru716HPtMbahdia39GwG5pLgWLMZ_y1ng@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Andres Freund <andres@anarazel.de>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
On Sun, Feb 3, 2019 at 11:09 PM Andres Freund <andres@anarazel.de> wrote: > On 2018-12-03 18:43:04 +1300, Thomas Munro wrote: > > Sorry for my silence... I got stuck on a design problem with the lower > > level undo log management code that I'm now close to having figured > > out. I'll have a new patch soon. Hello all, Here's a new WIP version of this patch set. It builds on a fairly deep stack of patches being developed by several people. As mentioned before, it's a useful crash-test dummy for a whole stack of technology we're working on, but it's also aiming to solve a real problem. It currently fails in one regression test for a well understood reason, fix on the way (see end), and there are some other stability problems being worked on. Here's a quick tour of the observable behaviour, having installed the pg_buffercache and test_undorecord extensions: ================== postgres=# begin; BEGIN postgres=# create table foo (); CREATE TABLE Check if our transaction has generated undo data: postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs ; logno | discard | insert | xid | pid -------+------------------+------------------+-----+------- 0 | 0000000000002CD9 | 0000000000002D1A | 476 | 39169 (1 row) Here, we see that undo log number 0 has some undo data because discard < insert. We can find out what it says: postgres=# call dump_undo_records(0); NOTICE: 0000000000002CD9: Storage: CREATE dbid=12916, tsid=1663, relfile=16386; xid=476, next xact=0 CALL The undo record shown there lives in shared buffers, and we can see that it's in there with pg_buffercache (the new column smgrid 1 means undo data; 0 is regular relation data): postgres=# select bufferid, smgrid, relfilenode, relblocknumber, isdirty, usagecount from pg_buffercache where smgrid = 1; bufferid | smgrid | relfilenode | relblocknumber | isdirty | usagecount ----------+--------+-------------+----------------+---------+------------ 3 | 1 | 0 | 1 | t | 5 (1 row) Even though that's just a dirty page in shared buffers, if we crash now and recover, it'll be recreated by a new WAL record that was flushed *before* creating the relation file. We can see that with pg_waldump: rmgr: Storage ... PRECREATE base/12916/16384, blkref #0: smgr 1 rel 1663/0/0 blk 1 FPW rmgr: Storage ... CREATE base/12916/16384 The PRECREATE record dirtied block 1 of undo log 0. In this case it happened to include a FPW of the undo log page too, following the usual rules. FPWs are rare for undo pages because of the REGBUF_WILL_INIT optimisation that applies to the zeroed out pages (which is most undo pages, due to the append-mostly access pattern). Finally, we if commit we see the undo data is discarded by a background worker, and if we roll back explicitly or crash and run recovery, the file is unlinked. Here's an example of the crash case: postgres=# begin; BEGIN postgres=# create table foo (); CREATE TABLE postgres=# select relfilenode from pg_class where relname = 'foo'; relfilenode ------------- 16395 (1 row) postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 39169 (1 row) $ kill -9 39169 ... server restarts, recovers ... $ ls pgdata/base/12916/16395 pgdata/base/12916/16395 It's still there, though it's been truncated by an undo worker (see end of email). And finally, after the next checkpoint: $ ls pgdata/base/12916/16395 ls: pgdata/base/12916/16395: No such file or directory That's the end of the quick tour. Most of these patches should probably be discussed in other threads, but I'm posting a snapshot of the full stack here anyway. Here's a patch-by-patch summary: === 0001 "Refactor the fsync mechanism to support future SMGR implementations." === The 0001 patch has its own CF thread https://commitfest.postgresql.org/22/1829/ and is from Shawn Debnath (based on earlier work by me), but I'm including a copy here for convenience/cfbot. === 0002 "Add SmgrId to smgropen() and BufferTag." === This is new, and is based on the discussion from another recent thread[1] about how we should identify buffers belonging to different storage managers. In earlier versions of the patch-set I had used a special reserved DB OID for undo data. Tom Lane didn't like that idea much, and Anton Shyrabokau (via Shawn Debnath) suggested making ForkNumber narrower so we can add a new field to BufferTag, and Andres Freund +1'd my proposal to add the extra value as a parameter to smgropen(). So, here is a patch that tries those ideas. Another way to do this would be to widen RelFileNode instead, to avoid having to pass around the SMGR ID separately in various places. Looking at the number of places that have to chance, you can probably see why we wanted to use a magic DB OID instead, and I'm not entirely convinced that it wasn't better that way, or that I've found all the places that need to carry an smgrid alongside a RelFileNode. Archeological note: smgropen() was like that ~15 years ago before commit 87bd9563, but buffer tags didn't include the SMGR ID. I decided to call md.c's ID "SMGR_RELATION", describing what it really holds -- regular relations -- rather than perpetuating the doubly anachronistic "magnetic disk" name. While here, I resurrected the ancient notion of a per-SMGR 'open' routine, so that a small amount of md.c-specific stuff could be kicked out of smgr.c and future implementations can do their own thing here too. While doing that work I realised that at least pg_rewind needs to learn about how different storage managers map blocks to files, so that's a new TODO item requiring more thought. I wonder what other places know how to map { RelFileNode, ForkNumber, BlockNumber } to a path + offset, and I wonder what to think about the fact that some of them may be non-backend code... === 0003 "Add undo log manager." === This and the next couple of patches live in CF thread https://commitfest.postgresql.org/22/1649/ but here's a much newer snapshot that hasn't been posted there yet. Manages a set of undo logs in shared memory, manages undo segment files, tracks discard, insert, end pointers visible in pg_stat_undo_logs. With this patch you can allocate and discard space in undo logs using the UndoRecPtr type to refer to addresses, but there is no access to the data yet. Improvements since the last version are not requiring DSM segments, proper FPW support and reduced WAL traffic. Previously there were extra per-xact and per-checkpoint records requiring retry-loops in code that inserted undo data. === 0004 "Provide access to undo log data via the buffer manager." === Provide SMGR_UNDO. While the 0003 patch deals with allocating and discarding undo address space and makes sure that backing files exist, this patch lets you read and write buffered data in them. === 0005 "Allow WAL record data on first modification after a checkpoint." === Provide a way for data to be attached to a WAL-registered block that is only included if this turns out to be the first WAL record that touches the block after a checkpoint. This is a bit like FPW images, except that it's arbitrary extra data and happens even if FPW is off. This is used to capture a copy of the (tiny) undo log meta-data (primary the insertion pointer) to fix a consistency problem when recovering from an online checkpoint. === 0006 + 0007 "Provide interfaces to store and fetch undo records." === This is a snapshot of work by my colleagues Dilip, Rafia and others based on earlier prototyping by Robert. While the earlier patches give you buffered binary undo data, this patch introduces the concept of high level undo records that can be inserted, and read back given an UndoRecPtr. This is a version presented on another thread already; here it's lightly changed due to rebasing by me. Undo-aware modules should design a set of undo record types, and insert exactly the same ones at do and undo time. The 0007 patch is fixups from me to bring that code into line with changes to the lower level patches. Future versions will be squashed and tidied up; still working on that. === 0008 + 0009 "Undo worker and transaction rollback" === This has a CF thread at https://commitfest.postgresql.org/22/1828/ and again this is a snapshot of work from Dilip, Rafia and others, with a fixup from me. Still working on coordinating that for the next version. This provides a way for RMGR modules to register a callback function that will receive all the undo records they inserted during a given [sub]transaction if it rolls back. It also provides a system of background workers that can execute those undo records in case the rollback happens after crash recovery, or in case the work can be usefully pushed into the background during a regular online rollback. This is a complex topic and I'm not attempting to explain it here. There are a few known problems with this and Dilip is working on a more sophisticated worker management system, but I'll let him write about that, over in that other thread. I think it'd probably be a good idea to split this patch into two or three; the RMGR undo support, the xact.c integration and the worker machinery. But maybe that's just me. Archeological note: XXXX_undo() callback functions registered via rmgrlist.h a bit like this originally appeared in the work by Vadim Mikheev (author of WAL) in commit b58c0411bad4, but that was apparently never completed once people figured out that you can make a force, steal, redo, no-undo database work (curiously I saw a slide from a university lecture somewhere saying that would be impossible). The stub functions were removed from the tree in 4c8495a1. Our new work differs from Vadim's original vision by putting undo data in a separate place from the WAL, and accessing it via shared buffers. I guess that might be because Vadim planned to use undo for rollback only, not for MVCC (but I might be wrong about that). That difference might explains why eg Vadim's function heap_undo() took an XLogRecord, whereas our proposal takes a different type. Our proposal also passes more than one records at a time to the undo handler; in future this will allow us to collect up all undo records relating to a page of (eg) zheap, and process them together for mechanical sympathy. === 0010 "Add developer documentation for the undo log storage subsystem." === Updated based on Robert's review up-thread. No coverage of background workers yet -- that is under development. === 0011 "Add user-facing documentation for undo logs." === Updated based on Robert's review up-thread. === 0012 "Add test_undorecord test module." === Provides quick and dirty dump_undo_records() procedure for testing. === 0013 "Use undo-based rollback to clean up files on abort." === Finally, this is the actual feature that this CF item is about. The main improvement here is that the previous version unlinked files immediately when executing undo actions, which broke the protocol established by commit 6cc4451b, namely that you can't reuse a relfilenode until after the next checkpoint, and the existence of an (empty) first relation segment in the filesystem is the only thing preventing that. That is fixed in this version (but see problem 2 below). Known problems: 1. A couple of tests fail with "ERROR: buffer is pinned in InvalidateBuffer". That's because ROLLBACK TO SAVEPOINT is executing the undo actions that drop the buffers for a newly created table before the subtransaction has been cleaned up. Amit is working on a solution to that. More soon. 2. The are two levels of deferment of file unlinking in current PostgreSQL. First, when you create a new relation, it is pushed on pendingDeletes; this patch-set replaces that in-memory list with persistent undo records as discussed. There is a second level of deferment: we unlink all the segments of the file except the first one, which we truncate, and then finally the zero-length file is unlinked after the next checkpoint; this is an important part of PostgreSQL's protocol for not reusing relfilenodes too soon. That means that there is still a very narrow window after the checkpoint is logged but before we've unlinked that file where you could still crash and leak a zero-length file. I've thought about a couple of solutions to close that window, including a file renaming scheme where .zombie files get cleaned up on crash, but that seemed like something that could be improved later. There is something else that goes wrong under parallel make check, which I must have introduced recently but haven't tracked down yet. I wanted to post a snapshot version for discussion anyway. More soon. This code is available at https://github.com/EnterpriseDB/zheap/tree/undo. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BDE0mmiBZMtZyvwWtgv1sZCniSVhXYsXkvJ_Wo%2B83vvw%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
Attachment
pgsql-hackers by date: