Thread: POC: Cleaning up orphaned files using undo logs
Hello hackers, The following sequence creates an orphaned file: BEGIN; CREATE TABLE t (); <kill -9 this backend> Occasionally there are reports of systems that have managed to produce a lot of them, perhaps through ENOSPC-induced panics, OOM signals or buggy/crashing extensions etc. The most recent example I found in the archives involved 1.7TB of unexpected files and some careful cleanup work. Relation files are created eagerly, and rollback is handled by pushing PendingRelDelete objects onto the pendingDeletes list, to be discarded on commit or processed on abort. That's effectively a kind of specialised undo log, but it's in memory only, so it's less persistent than the effects it is supposed to undo. Here's a proof-of-concept patch that plugs the gap using the undo log technology we're developing as part of the zHeap project. Note that zHeap is not involved here: the SMGR module is acting as a direct client of the undo machinery. Example: postgres=# begin; BEGIN postgres=# create table t1 (); CREATE TABLE postgres=# create table t2 (); CREATE TABLE ... now we can see that this transaction has some undo data (discard < insert): postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs; logno | discard | insert | xid | pid -------+------------------+------------------+-----+------- 0 | 00000000000021EF | 0000000000002241 | 581 | 18454 (1 row) ... and, if the test_undorecord module is installed, we can inspect the records it holds: postgres=# call dump_undo_records(0); NOTICE: 0000000000002224: Storage: CREATE dbid=12655, tsid=1663, relfile=24594 NOTICE: 00000000000021EF: Storage: CREATE dbid=12655, tsid=1663, relfile=24591 CALL If we COMMIT, the undo data is discarded by advancing the discard pointer (tail) to match the insert pointer (head). If we ROLLBACK, either explicitly or automatically by crashing and recovering, then the files will be unlinked and the insert pointer will be rewound; either way the undo log eventually finishes up "empty" again (discard == insert). This is done with a system of per-rmgr-ID record types and callbacks, similar to redo. The rollback action are either executed immediately or offloaded to an undo worker process, depending on simple heuristics. Of course this isn't free, and the current patch makes table creation slower. The goal is to make sure that there is no scenario (kill -9, power cut etc) in which there can be a new relation file on disk, but not a corresponding undo record that would unlink that file if the transaction later has to roll back. Currently, that means that we need to flush the WAL record that will create the undo record that will unlink the file *before* we create the relation file. I suspect that could be mitigated quite easily, by deferring file creation in a backend-local queue until forced by access or commit. I didn't try to do that in this basic version. There are probably other ways to solve the specific problem of orphaned files, but this approach is built on a general reusable facility and I think it is a nice way to show the undo concepts, and how they are separate from zheap. Specifically, it demonstrates the more traditional of the two uses for undo logs: a reliable way to track actions that must be performed on rollback. (The other use is: seeing past versions of data, for vacuumless MVCC; that's a topic for later). Patches 0001-0006 are development snapshots of material posted on other threads already[1][2], hacked around by me to make this possible (see those threads for further developments in those patches including some major strengthening work, coming soon). The subject of this thread is 0007, the core of which is just a couple of hundred lines written by me, based on an idea from Robert Haas. Personally I think it'd be a good feature to get into PostgreSQL 12, and I will add it to the CF that is about to start to seek feedback. It passes make check on Unix and Windows, though currently it's failing some of the TAP tests for reasons I'm looking into (possibly due to bugs in the lower level patches, not sure). Thanks for reading, [1] https://www.postgresql.org/message-id/flat/CAEepm=2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAFiTN-sYQ8r8ANjWFYkXVfNxgXyLRfvbX9Ee4SxO9ns-OBBgVA@mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
While I've been involved in the design discussions for this patch set, I haven't looked at any of the code personally in a very long time. I certainly don't claim to be an independent reviewer, and I encourage others to review this work also. That said, here are some review comments. I decided to start with 0005, as that has the user-facing documentation for this feature. There is a spurious whitespace-only hunk in monitoring.sgml. + <entry>Process ID of the backend currently attached to this undo log + for writing.</entry> or NULL/0/something if none? + each undo log that exists. Undo logs are extents within a contiguous + addressing space that have their own head and tail pointers. This sentence seems to me to have so little detail that it's not going to help anyone, and it also seems somewhat out-of-place here. I think it would be better to link to the longer explanation in the new storage section instead. + Each backend that has written undo data is associated with one or more undo extra space +<para> +Undo logs hold data that is used for rolling back and for implementing +MVCC in access managers that are undo-aware (currently "zheap"). The storage +format of undo logs is optimized for reusing existing files. +</para> I think the mention of zheap should be removed here since the hope is that the undo stuff can be committed independently of and prior to zheap. I think you mean access methods, not access managers. I suggest making that an xref. Maybe add a little more detail, e.g. Undo logs provide a place for access methods to store data that can be used to perform necessary cleanup operations after a transaction abort. The data will be retained after a transaction abort until the access method successfully performs the required cleanup operations. After a transaction commit, undo data will be retained until the transaction is all-visible. This makes it possible for access managers to use undo data to implement MVCC. Since it most cases undo data is discarded very quickly, the undo system has been optimized to minimize writes to disk and to reuse existing files efficiently. +<para> +Undo data exists in a 64 bit address space broken up into numbered undo logs +that represent 1TB extents, for efficient management. The space is further +broken up into 1MB segment files, for physical storage. The name of each file +is the address of of the first byte in the file, with a period inserted after +the part that indicates the undo log number. +</para> I cannot read this section and know what an undo filename is going to look like. Also, the remarks about efficient management seems like it might be unclear to someone not already familiar with how this works. Maybe something like: Undo data exists in a 64-bit address space divided into 2^34 undo logs, each with a theoretical capacity of 1TB. The first time a backend writes undo, it attaches to an existing undo log whose capacity is not yet exhausted and which is not currently being used by any other backend; or if no suitable undo log already exists, it creates a new one. To avoid wasting space, each undo log is further divided into 1MB segment files, so that segments which are no longer needed can be removed (possibly recycling the underlying file by renaming it) and segments which are not yet needed do not need to be physically created on disk. An undo segment file has a name like <example>, where <thing> is the undo log number and <thang> is the segment number. I think it's good to spell out the part about attaching to undo logs here, because when people look at pg_undo, the number of files will be roughly proportional to the number of backends, and we should try to help them understand - at least in general terms - why that happens. +<para> +Just as relations can have one of the three persistence levels permanent, +unlogged or temporary, the undo data that is generated by modifying them must +be stored in an undo log of the same persistence level. This enables the +undo data to be discarded at appropriate times along with the relations that +reference it. +</para> This is not quite general, because we're not necessarily talking about modifications to the files. In fact, in this POC, we're explicitly talking about the cleanup of the files themselves. Also, it's not technically correct to say that the persistence level has to match. You could put everything in permanent undo logs. It would just suck. Moving on to 0003, the developer documentation: +The undo log subsystem provides a way to store data that is needed for +a limited time. Undo data is generated whenever zheap relations are +modified, but it is only useful until (1) the generating transaction +is committed or rolled back and (2) there is no snapshot that might +need it for MVCC purposes. See src/backend/access/zheap/README for +more information on zheap. The undo log subsystem is concerned with Again, I think this should be rewritten to make it independent of zheap. We hope that this facility is not only usable by but will actually be used by other AMs. +their location within a 64 bit address space. Unlike redo data, the +addressing space is internally divided up unto multiple numbered logs. Except it's not totally unlike; cf. the log and seg arguments to XLogFileNameById. The xlog division is largely a historical accident of having to support systems with 32-bit arithmetic and has minimal consequences in practice, and it's a lot less noticeable now than it used to be, but it does still kinda exist. I would try to sharpen this wording a bit to de-emphasize the contrast over whether a log/seg distinction exists and instead just contrast multiple insertion points vs. a single one. +level code (zheap) is largely oblivious to this internal structure and Another zheap reference. +eviction provoked by memory pressure, then no disk IO is generated. I/O? +Keeping the undo data physically separate from redo data and accessing +it though the existing shared buffers mechanism allows it to be +accessed efficiently for MVCC purposes. And also non-MVCC purposes. I mean, it's not very feasible to do post-abort cleanup driven solely off the WAL, because the WAL segments might've been archived or recycled and there's no easy way to access the bits we want. Saying this is for MVCC purposes specifically seems misleading. +shared memory and can be inspected in the pg_stat_undo_logs view. For Replace "in" with "via" or "through" or something? +shared memory and can be inspected in the pg_stat_undo_logs view. For +each undo log, a set of properties called the undo log's meta-data are +tracked: "called the undo log's meta-data" seems a bit awkward. +* the "discard" pointer; data before this point has been discarded +* the "insert" pointer: new data will be written here +* the "end" pointer: a new undo segment file will be needed at this point why ; for the first and : for the others? +The three pointers discard, insert and end move strictly forwards +until the whole undo log has been exhausted. At all times discard <= +insert <= end. When discard == insert, the undo log is empty I think you should either remove "discard, insert and end" from this sentence, relying on people to remember the list they just read, or else punctuate it like this: The three pointers -- discard, insert, and end -- move... +logs are held in a fixed-sized pool in shared memory. The size of +the array is a multiple of max_connections, and limits the total size of +transactions. I think you should elaborate on "limits the total size of transactions." +The meta-data for all undo logs is written to disk at every +checkpoint. It is stored in files under PGDATA/pg_undo/, using the Even unlogged and temporary undo logs? +level of the relation being modified and the current value of the GUC Suggest: the corresponding relation +suitable undo log must be either found or created. The system should +stabilize on one undo log per active writing backend (or more if +different tablespaces are persistence levels are used). Won't edge effects drive the number up considerably? +and they cannot be accessed by other backend including undo workers. Grammar. Also, begs the question "so how does this work if the undo workers are frozen out?" +Responsibility for WAL-logging the contents of the undo log lies with +client code (ie zheap). While undolog.c WAL-logs all meta-data Another zheap reference. +hard coded to use md.c unconditionally, PostgreSQL 12 routes IO for the undo Suggest I/O rather than IO. I'll see if I can find time to actually review some of this code at some point. Regarding 0006, I can't help but notice that it is completely devoid of documentation and README updates, which will not do. Regarding 0007, that's an impressively small patch. ...Robert
Hello Thomas, On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > It passes make check on Unix and Windows, though currently it's > failing some of the TAP tests for reasons I'm looking into (possibly > due to bugs in the lower level patches, not sure). > I looked into the regression failures when the tap-tests are enabled. It seems that we're not estimating and allocating the shared memory for rollback-hash tables correctly. I've added a patch to fix the same. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Nov 5, 2018 at 5:13 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > Hello Thomas, > > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > > > It passes make check on Unix and Windows, though currently it's > > failing some of the TAP tests for reasons I'm looking into (possibly > > due to bugs in the lower level patches, not sure). > > > I looked into the regression failures when the tap-tests are enabled. > It seems that we're not estimating and allocating the shared memory > for rollback-hash tables correctly. I've added a patch to fix the > same. > I have included your fix in the latest version of the undo-worker patch[1] [1] https://www.postgresql.org/message-id/flat/CAFiTN-sYQ8r8ANjWFYkXVfNxgXyLRfvbX9Ee4SxO9ns-OBBgVA%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > It passes make check on Unix and Windows, though currently it's > > failing some of the TAP tests for reasons I'm looking into (possibly > > due to bugs in the lower level patches, not sure). > > > I looked into the regression failures when the tap-tests are enabled. > It seems that we're not estimating and allocating the shared memory > for rollback-hash tables correctly. I've added a patch to fix the > same. Thanks Kuntal. -- Thomas Munro http://www.enterprisedb.com
> On Thu, Nov 8, 2018 at 4:03 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro > > <thomas.munro@enterprisedb.com> wrote: > > > It passes make check on Unix and Windows, though currently it's > > > failing some of the TAP tests for reasons I'm looking into (possibly > > > due to bugs in the lower level patches, not sure). > > > > > I looked into the regression failures when the tap-tests are enabled. > > It seems that we're not estimating and allocating the shared memory > > for rollback-hash tables correctly. I've added a patch to fix the > > same. > > Thanks Kuntal. Thanks for the patch, Unfortunately, cfbot complains about these patches and can't apply them for some reason, so I did this manually to check it out. All of them (including the fix from Kuntal) were applied without conflicts, but compilation stopped here undoinsert.c: In function ‘UndoRecordAllocateMulti’: undoinsert.c:547:18: error: ‘urec’ may be used uninitialized in this function [-Werror=maybe-uninitialized] urec->uur_info = 0; /* force recomputation of info bits */ ~~~~~~~~~~~~~~~^~~ Could you please post a fixed version of the patch?
On Sat, Dec 1, 2018 at 5:12 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Thu, Nov 8, 2018 at 4:03 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro > > > <thomas.munro@enterprisedb.com> wrote: > > > > It passes make check on Unix and Windows, though currently it's > > > > failing some of the TAP tests for reasons I'm looking into (possibly > > > > due to bugs in the lower level patches, not sure). > > > > > > > I looked into the regression failures when the tap-tests are enabled. > > > It seems that we're not estimating and allocating the shared memory > > > for rollback-hash tables correctly. I've added a patch to fix the > > > same. > > > > Thanks Kuntal. > > Thanks for the patch, > > Unfortunately, cfbot complains about these patches and can't apply them for > some reason, so I did this manually to check it out. All of them (including the > fix from Kuntal) were applied without conflicts, but compilation stopped here > > undoinsert.c: In function ‘UndoRecordAllocateMulti’: > undoinsert.c:547:18: error: ‘urec’ may be used uninitialized in this > function [-Werror=maybe-uninitialized] > urec->uur_info = 0; /* force recomputation of info bits */ > ~~~~~~~~~~~~~~~^~~ > > Could you please post a fixed version of the patch? Sorry for my silence... I got stuck on a design problem with the lower level undo log management code that I'm now close to having figured out. I'll have a new patch soon. -- Thomas Munro http://www.enterprisedb.com
Hi, On 2018-12-03 18:43:04 +1300, Thomas Munro wrote: > On Sat, Dec 1, 2018 at 5:12 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Thu, Nov 8, 2018 at 4:03 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > > On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > > > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro > > > > <thomas.munro@enterprisedb.com> wrote: > > > > > It passes make check on Unix and Windows, though currently it's > > > > > failing some of the TAP tests for reasons I'm looking into (possibly > > > > > due to bugs in the lower level patches, not sure). > > > > > > > > > I looked into the regression failures when the tap-tests are enabled. > > > > It seems that we're not estimating and allocating the shared memory > > > > for rollback-hash tables correctly. I've added a patch to fix the > > > > same. > > > > > > Thanks Kuntal. > > > > Thanks for the patch, > > > > Unfortunately, cfbot complains about these patches and can't apply them for > > some reason, so I did this manually to check it out. All of them (including the > > fix from Kuntal) were applied without conflicts, but compilation stopped here > > > > undoinsert.c: In function ‘UndoRecordAllocateMulti’: > > undoinsert.c:547:18: error: ‘urec’ may be used uninitialized in this > > function [-Werror=maybe-uninitialized] > > urec->uur_info = 0; /* force recomputation of info bits */ > > ~~~~~~~~~~~~~~~^~~ > > > > Could you please post a fixed version of the patch? > > Sorry for my silence... I got stuck on a design problem with the lower > level undo log management code that I'm now close to having figured > out. I'll have a new patch soon. Given this patch has been in waiting for author for ~two months, I'm unfortunately going to have to mark it as returned with feedback. Please resubmit once refreshed. Greetings, Andres Freund
On Sun, Feb 3, 2019 at 11:09 PM Andres Freund <andres@anarazel.de> wrote: > On 2018-12-03 18:43:04 +1300, Thomas Munro wrote: > > Sorry for my silence... I got stuck on a design problem with the lower > > level undo log management code that I'm now close to having figured > > out. I'll have a new patch soon. Hello all, Here's a new WIP version of this patch set. It builds on a fairly deep stack of patches being developed by several people. As mentioned before, it's a useful crash-test dummy for a whole stack of technology we're working on, but it's also aiming to solve a real problem. It currently fails in one regression test for a well understood reason, fix on the way (see end), and there are some other stability problems being worked on. Here's a quick tour of the observable behaviour, having installed the pg_buffercache and test_undorecord extensions: ================== postgres=# begin; BEGIN postgres=# create table foo (); CREATE TABLE Check if our transaction has generated undo data: postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs ; logno | discard | insert | xid | pid -------+------------------+------------------+-----+------- 0 | 0000000000002CD9 | 0000000000002D1A | 476 | 39169 (1 row) Here, we see that undo log number 0 has some undo data because discard < insert. We can find out what it says: postgres=# call dump_undo_records(0); NOTICE: 0000000000002CD9: Storage: CREATE dbid=12916, tsid=1663, relfile=16386; xid=476, next xact=0 CALL The undo record shown there lives in shared buffers, and we can see that it's in there with pg_buffercache (the new column smgrid 1 means undo data; 0 is regular relation data): postgres=# select bufferid, smgrid, relfilenode, relblocknumber, isdirty, usagecount from pg_buffercache where smgrid = 1; bufferid | smgrid | relfilenode | relblocknumber | isdirty | usagecount ----------+--------+-------------+----------------+---------+------------ 3 | 1 | 0 | 1 | t | 5 (1 row) Even though that's just a dirty page in shared buffers, if we crash now and recover, it'll be recreated by a new WAL record that was flushed *before* creating the relation file. We can see that with pg_waldump: rmgr: Storage ... PRECREATE base/12916/16384, blkref #0: smgr 1 rel 1663/0/0 blk 1 FPW rmgr: Storage ... CREATE base/12916/16384 The PRECREATE record dirtied block 1 of undo log 0. In this case it happened to include a FPW of the undo log page too, following the usual rules. FPWs are rare for undo pages because of the REGBUF_WILL_INIT optimisation that applies to the zeroed out pages (which is most undo pages, due to the append-mostly access pattern). Finally, we if commit we see the undo data is discarded by a background worker, and if we roll back explicitly or crash and run recovery, the file is unlinked. Here's an example of the crash case: postgres=# begin; BEGIN postgres=# create table foo (); CREATE TABLE postgres=# select relfilenode from pg_class where relname = 'foo'; relfilenode ------------- 16395 (1 row) postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 39169 (1 row) $ kill -9 39169 ... server restarts, recovers ... $ ls pgdata/base/12916/16395 pgdata/base/12916/16395 It's still there, though it's been truncated by an undo worker (see end of email). And finally, after the next checkpoint: $ ls pgdata/base/12916/16395 ls: pgdata/base/12916/16395: No such file or directory That's the end of the quick tour. Most of these patches should probably be discussed in other threads, but I'm posting a snapshot of the full stack here anyway. Here's a patch-by-patch summary: === 0001 "Refactor the fsync mechanism to support future SMGR implementations." === The 0001 patch has its own CF thread https://commitfest.postgresql.org/22/1829/ and is from Shawn Debnath (based on earlier work by me), but I'm including a copy here for convenience/cfbot. === 0002 "Add SmgrId to smgropen() and BufferTag." === This is new, and is based on the discussion from another recent thread[1] about how we should identify buffers belonging to different storage managers. In earlier versions of the patch-set I had used a special reserved DB OID for undo data. Tom Lane didn't like that idea much, and Anton Shyrabokau (via Shawn Debnath) suggested making ForkNumber narrower so we can add a new field to BufferTag, and Andres Freund +1'd my proposal to add the extra value as a parameter to smgropen(). So, here is a patch that tries those ideas. Another way to do this would be to widen RelFileNode instead, to avoid having to pass around the SMGR ID separately in various places. Looking at the number of places that have to chance, you can probably see why we wanted to use a magic DB OID instead, and I'm not entirely convinced that it wasn't better that way, or that I've found all the places that need to carry an smgrid alongside a RelFileNode. Archeological note: smgropen() was like that ~15 years ago before commit 87bd9563, but buffer tags didn't include the SMGR ID. I decided to call md.c's ID "SMGR_RELATION", describing what it really holds -- regular relations -- rather than perpetuating the doubly anachronistic "magnetic disk" name. While here, I resurrected the ancient notion of a per-SMGR 'open' routine, so that a small amount of md.c-specific stuff could be kicked out of smgr.c and future implementations can do their own thing here too. While doing that work I realised that at least pg_rewind needs to learn about how different storage managers map blocks to files, so that's a new TODO item requiring more thought. I wonder what other places know how to map { RelFileNode, ForkNumber, BlockNumber } to a path + offset, and I wonder what to think about the fact that some of them may be non-backend code... === 0003 "Add undo log manager." === This and the next couple of patches live in CF thread https://commitfest.postgresql.org/22/1649/ but here's a much newer snapshot that hasn't been posted there yet. Manages a set of undo logs in shared memory, manages undo segment files, tracks discard, insert, end pointers visible in pg_stat_undo_logs. With this patch you can allocate and discard space in undo logs using the UndoRecPtr type to refer to addresses, but there is no access to the data yet. Improvements since the last version are not requiring DSM segments, proper FPW support and reduced WAL traffic. Previously there were extra per-xact and per-checkpoint records requiring retry-loops in code that inserted undo data. === 0004 "Provide access to undo log data via the buffer manager." === Provide SMGR_UNDO. While the 0003 patch deals with allocating and discarding undo address space and makes sure that backing files exist, this patch lets you read and write buffered data in them. === 0005 "Allow WAL record data on first modification after a checkpoint." === Provide a way for data to be attached to a WAL-registered block that is only included if this turns out to be the first WAL record that touches the block after a checkpoint. This is a bit like FPW images, except that it's arbitrary extra data and happens even if FPW is off. This is used to capture a copy of the (tiny) undo log meta-data (primary the insertion pointer) to fix a consistency problem when recovering from an online checkpoint. === 0006 + 0007 "Provide interfaces to store and fetch undo records." === This is a snapshot of work by my colleagues Dilip, Rafia and others based on earlier prototyping by Robert. While the earlier patches give you buffered binary undo data, this patch introduces the concept of high level undo records that can be inserted, and read back given an UndoRecPtr. This is a version presented on another thread already; here it's lightly changed due to rebasing by me. Undo-aware modules should design a set of undo record types, and insert exactly the same ones at do and undo time. The 0007 patch is fixups from me to bring that code into line with changes to the lower level patches. Future versions will be squashed and tidied up; still working on that. === 0008 + 0009 "Undo worker and transaction rollback" === This has a CF thread at https://commitfest.postgresql.org/22/1828/ and again this is a snapshot of work from Dilip, Rafia and others, with a fixup from me. Still working on coordinating that for the next version. This provides a way for RMGR modules to register a callback function that will receive all the undo records they inserted during a given [sub]transaction if it rolls back. It also provides a system of background workers that can execute those undo records in case the rollback happens after crash recovery, or in case the work can be usefully pushed into the background during a regular online rollback. This is a complex topic and I'm not attempting to explain it here. There are a few known problems with this and Dilip is working on a more sophisticated worker management system, but I'll let him write about that, over in that other thread. I think it'd probably be a good idea to split this patch into two or three; the RMGR undo support, the xact.c integration and the worker machinery. But maybe that's just me. Archeological note: XXXX_undo() callback functions registered via rmgrlist.h a bit like this originally appeared in the work by Vadim Mikheev (author of WAL) in commit b58c0411bad4, but that was apparently never completed once people figured out that you can make a force, steal, redo, no-undo database work (curiously I saw a slide from a university lecture somewhere saying that would be impossible). The stub functions were removed from the tree in 4c8495a1. Our new work differs from Vadim's original vision by putting undo data in a separate place from the WAL, and accessing it via shared buffers. I guess that might be because Vadim planned to use undo for rollback only, not for MVCC (but I might be wrong about that). That difference might explains why eg Vadim's function heap_undo() took an XLogRecord, whereas our proposal takes a different type. Our proposal also passes more than one records at a time to the undo handler; in future this will allow us to collect up all undo records relating to a page of (eg) zheap, and process them together for mechanical sympathy. === 0010 "Add developer documentation for the undo log storage subsystem." === Updated based on Robert's review up-thread. No coverage of background workers yet -- that is under development. === 0011 "Add user-facing documentation for undo logs." === Updated based on Robert's review up-thread. === 0012 "Add test_undorecord test module." === Provides quick and dirty dump_undo_records() procedure for testing. === 0013 "Use undo-based rollback to clean up files on abort." === Finally, this is the actual feature that this CF item is about. The main improvement here is that the previous version unlinked files immediately when executing undo actions, which broke the protocol established by commit 6cc4451b, namely that you can't reuse a relfilenode until after the next checkpoint, and the existence of an (empty) first relation segment in the filesystem is the only thing preventing that. That is fixed in this version (but see problem 2 below). Known problems: 1. A couple of tests fail with "ERROR: buffer is pinned in InvalidateBuffer". That's because ROLLBACK TO SAVEPOINT is executing the undo actions that drop the buffers for a newly created table before the subtransaction has been cleaned up. Amit is working on a solution to that. More soon. 2. The are two levels of deferment of file unlinking in current PostgreSQL. First, when you create a new relation, it is pushed on pendingDeletes; this patch-set replaces that in-memory list with persistent undo records as discussed. There is a second level of deferment: we unlink all the segments of the file except the first one, which we truncate, and then finally the zero-length file is unlinked after the next checkpoint; this is an important part of PostgreSQL's protocol for not reusing relfilenodes too soon. That means that there is still a very narrow window after the checkpoint is logged but before we've unlinked that file where you could still crash and leak a zero-length file. I've thought about a couple of solutions to close that window, including a file renaming scheme where .zombie files get cleaned up on crash, but that seemed like something that could be improved later. There is something else that goes wrong under parallel make check, which I must have introduced recently but haven't tracked down yet. I wanted to post a snapshot version for discussion anyway. More soon. This code is available at https://github.com/EnterpriseDB/zheap/tree/undo. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BDE0mmiBZMtZyvwWtgv1sZCniSVhXYsXkvJ_Wo%2B83vvw%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
Attachment
On Tue, Mar 12, 2019 at 6:51 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Feb 3, 2019 at 11:09 PM Andres Freund <andres@anarazel.de> wrote: > > On 2018-12-03 18:43:04 +1300, Thomas Munro wrote: > > > Sorry for my silence... I got stuck on a design problem with the lower > > > level undo log management code that I'm now close to having figured > > > out. I'll have a new patch soon. > > Hello all, > > Here's a new WIP version of this patch set. It builds on a fairly > deep stack of patches being developed by several people. As mentioned > before, it's a useful crash-test dummy for a whole stack of technology > we're working on, but it's also aiming to solve a real problem. > > It currently fails in one regression test for a well understood > reason, fix on the way (see end), and there are some other stability > problems being worked on. > > Here's a quick tour of the observable behaviour, having installed the > pg_buffercache and test_undorecord extensions: > > ================== > > postgres=# begin; > BEGIN > postgres=# create table foo (); > CREATE TABLE > > Check if our transaction has generated undo data: > > postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs ; > logno | discard | insert | xid | pid > -------+------------------+------------------+-----+------- > 0 | 0000000000002CD9 | 0000000000002D1A | 476 | 39169 > (1 row) > > Here, we see that undo log number 0 has some undo data because discard > < insert. We can find out what it says: > > postgres=# call dump_undo_records(0); > NOTICE: 0000000000002CD9: Storage: CREATE dbid=12916, tsid=1663, > relfile=16386; xid=476, next xact=0 > CALL > > The undo record shown there lives in shared buffers, and we can see > that it's in there with pg_buffercache (the new column smgrid 1 means > undo data; 0 is regular relation data): > > postgres=# select bufferid, smgrid, relfilenode, relblocknumber, > isdirty, usagecount from pg_buffercache where smgrid = 1; > bufferid | smgrid | relfilenode | relblocknumber | isdirty | usagecount > ----------+--------+-------------+----------------+---------+------------ > 3 | 1 | 0 | 1 | t | 5 > (1 row) > > Even though that's just a dirty page in shared buffers, if we crash > now and recover, it'll be recreated by a new WAL record that was > flushed *before* creating the relation file. We can see that with > pg_waldump: > > rmgr: Storage ... PRECREATE base/12916/16384, blkref #0: smgr 1 rel > 1663/0/0 blk 1 FPW > rmgr: Storage ... CREATE base/12916/16384 > > The PRECREATE record dirtied block 1 of undo log 0. In this case it > happened to include a FPW of the undo log page too, following the > usual rules. FPWs are rare for undo pages because of the > REGBUF_WILL_INIT optimisation that applies to the zeroed out pages > (which is most undo pages, due to the append-mostly access pattern). > > Finally, we if commit we see the undo data is discarded by a > background worker, and if we roll back explicitly or crash and run > recovery, the file is unlinked. Here's an example of the crash case: > > postgres=# begin; > BEGIN > postgres=# create table foo (); > CREATE TABLE > postgres=# select relfilenode from pg_class where relname = 'foo'; > relfilenode > ------------- > 16395 > (1 row) > > postgres=# select pg_backend_pid(); > pg_backend_pid > ---------------- > 39169 > (1 row) > > $ kill -9 39169 > > ... server restarts, recovers ... > > $ ls pgdata/base/12916/16395 > pgdata/base/12916/16395 > > It's still there, though it's been truncated by an undo worker (see > end of email). And finally, after the next checkpoint: > > $ ls pgdata/base/12916/16395 > ls: pgdata/base/12916/16395: No such file or directory > > That's the end of the quick tour. > > Most of these patches should probably be discussed in other threads, > but I'm posting a snapshot of the full stack here anyway. Here's a > patch-by-patch summary: > > === 0001 "Refactor the fsync mechanism to support future SMGR > implementations." === > > The 0001 patch has its own CF thread > https://commitfest.postgresql.org/22/1829/ and is from Shawn Debnath > (based on earlier work by me), but I'm including a copy here for > convenience/cfbot. > > === 0002 "Add SmgrId to smgropen() and BufferTag." === > > This is new, and is based on the discussion from another recent > thread[1] about how we should identify buffers belonging to different > storage managers. In earlier versions of the patch-set I had used a > special reserved DB OID for undo data. Tom Lane didn't like that idea > much, and Anton Shyrabokau (via Shawn Debnath) suggested making > ForkNumber narrower so we can add a new field to BufferTag, and Andres > Freund +1'd my proposal to add the extra value as a parameter to > smgropen(). So, here is a patch that tries those ideas. > > Another way to do this would be to widen RelFileNode instead, to avoid > having to pass around the SMGR ID separately in various places. > Looking at the number of places that have to chance, you can probably > see why we wanted to use a magic DB OID instead, and I'm not entirely > convinced that it wasn't better that way, or that I've found all the > places that need to carry an smgrid alongside a RelFileNode. > > Archeological note: smgropen() was like that ~15 years ago before > commit 87bd9563, but buffer tags didn't include the SMGR ID. > > I decided to call md.c's ID "SMGR_RELATION", describing what it really > holds -- regular relations -- rather than perpetuating the doubly > anachronistic "magnetic disk" name. > > While here, I resurrected the ancient notion of a per-SMGR 'open' > routine, so that a small amount of md.c-specific stuff could be kicked > out of smgr.c and future implementations can do their own thing here > too. > > While doing that work I realised that at least pg_rewind needs to > learn about how different storage managers map blocks to files, so > that's a new TODO item requiring more thought. I wonder what other > places know how to map { RelFileNode, ForkNumber, BlockNumber } to a > path + offset, and I wonder what to think about the fact that some of > them may be non-backend code... > > === 0003 "Add undo log manager." === > > This and the next couple of patches live in CF thread > https://commitfest.postgresql.org/22/1649/ but here's a much newer > snapshot that hasn't been posted there yet. > > Manages a set of undo logs in shared memory, manages undo segment > files, tracks discard, insert, end pointers visible in > pg_stat_undo_logs. With this patch you can allocate and discard space > in undo logs using the UndoRecPtr type to refer to addresses, but > there is no access to the data yet. Improvements since the last > version are not requiring DSM segments, proper FPW support and reduced > WAL traffic. Previously there were extra per-xact and per-checkpoint > records requiring retry-loops in code that inserted undo data. > > === 0004 "Provide access to undo log data via the buffer manager." === > > Provide SMGR_UNDO. While the 0003 patch deals with allocating and > discarding undo address space and makes sure that backing files exist, > this patch lets you read and write buffered data in them. > > === 0005 "Allow WAL record data on first modification after a checkpoint." === > > Provide a way for data to be attached to a WAL-registered block that > is only included if this turns out to be the first WAL record that > touches the block after a checkpoint. This is a bit like FPW images, > except that it's arbitrary extra data and happens even if FPW is off. > This is used to capture a copy of the (tiny) undo log meta-data > (primary the insertion pointer) to fix a consistency problem when > recovering from an online checkpoint. > > === 0006 + 0007 "Provide interfaces to store and fetch undo records." === > > This is a snapshot of work by my colleagues Dilip, Rafia and others > based on earlier prototyping by Robert. While the earlier patches > give you buffered binary undo data, this patch introduces the concept > of high level undo records that can be inserted, and read back given > an UndoRecPtr. This is a version presented on another thread already; > here it's lightly changed due to rebasing by me. > > Undo-aware modules should design a set of undo record types, and > insert exactly the same ones at do and undo time. > > The 0007 patch is fixups from me to bring that code into line with > changes to the lower level patches. Future versions will be squashed > and tidied up; still working on that. Currently, undo branch[1] contain an older version of the (undo interface + some fixup). Now, I have merged the latest changes from the zheap branch[2] to the undo branch[1] which can be applied on top of the undo storage commit[3]. For merging those changes, I need to add some changes to the undo log storage patch as well for handling the multi log transaction. So I have attached two patches, 1) improvement to undo log storage 2) complete undo interface patch which include 0006+0007 from undo branch[1] + new changes on the zheap branch. [1] https://github.com/EnterpriseDB/zheap/tree/undo [2] https://github.com/EnterpriseDB/zheap [3] https://github.com/EnterpriseDB/zheap/tree/undo (b397d96176879ed5b09cf7322b8d6f2edd8043a5) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Apr 19, 2019 at 6:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Currently, undo branch[1] contain an older version of the (undo > interface + some fixup). Now, I have merged the latest changes from > the zheap branch[2] to the undo branch[1] > which can be applied on top of the undo storage commit[3]. For > merging those changes, I need to add some changes to the undo log > storage patch as well for handling the multi log transaction. So I > have attached two patches, 1) improvement to undo log storage 2) > complete undo interface patch which include 0006+0007 from undo > branch[1] + new changes on the zheap branch. Some review comments: +#define AtAbort_ResetUndoBuffers() ResetUndoBuffers() I don't think this really belongs in xact.c. Seems like we ought to declare it in the appropriate header file. Perhaps we also ought to consider using a static inline function rather than a macro, although I guess it doesn't really matter. +void +SetCurrentUndoLocation(UndoRecPtr urec_ptr) +{ + UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false); + UndoPersistence upersistence = log->meta.persistence; Could we arrange things so that the caller passes the persistence level, instead of having to recalculate it here? This doesn't seem to be a particularly cheap operation. UndoLogGet() will call get_undo_log() which I guess will normally just do a hash table lookup but which in the worst case will take an LWLock and perform a linear search of a shared memory array. But at PrepareUndoInsert time we knew the persistence level already, so I don't see why InsertPreparedUndo should have to force it to be looked up all over again -- and while in a critical section, no less. Another thing that is strange about this patch is that it adds these start_urec_ptr and latest_urec_ptr arrays and then uses them for absolutely nothing. I think that's a pretty clear sign that the division of this work into multiple patches is not correct. We shouldn't have one patch that tracks some information that is used nowhere for anything and then another patch that adds a user of that information -- the two should go together. Incidentally, wouldn't StartTransaction() need to reinitialize these fields? + * When the undorecord for a transaction gets inserted in the next log then we undo record + * insert a transaction header for the first record in the new log and update + * the transaction header with this new logs location. We will also keep This appears to be nonsensical. You're saying that you add a transaction header to the new log and then update it with its own location. That can't be what you mean. + * Incase, this is not the first record in new log (aka new log already "Incase," -> "If" "aka" -> "i.e." Overall this whole paragraph is a bit hard to understand. + * same transaction spans across multiple logs depending on which log is delete "across" + * processed first by the discard worker. If it processes the first log which + * contains the transactions first record, then it can get the last record + * of that transaction even if it is in different log and then processes all + * the undo records from last to first. OTOH, if the next log get processed Not sure how that would work if the number of logs is >2. This whole paragraph is also hard to understand. +static UndoBuffers def_buffers[MAX_UNDO_BUFFERS]; +static int buffer_idx; This is a file-global variable with a very generic name that is very similar to a local variable name used by multiple functions within the file (bufidx) and no comments. Ugh. +UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp) The locking regime for this function is really confusing. It requires that the caller hold discard_lock on entry, and on exit the lock will still be held if the return value is true but will no longer be held if the return value is false. Yikes! Anybody who reads code that uses this function is not going to guess that it has such strange behavior. I'm not exactly sure how to redesign this, but I think it's not OK the way you have it. One option might be to inline the logic into each call site. +/* + * Overwrite the first undo record of the previous transaction to update its + * next pointer. This will just insert the already prepared record by + * UndoRecordPrepareTransInfo. This must be called under the critical section. + * This will just overwrite the undo header not the data. + */ +static void +UndoRecordUpdateTransInfo(int idx) It bugs me that this function goes back in to reacquire the discard lock for the purpose of preventing a concurrent undo discard. Apparently, if the other transaction's undo has been discarded between the prepare phase and where we are now, we're OK with that and just exit without doing anything; otherwise, we update the previous transaction header. But that seems wrong. When we enter a critical section, I think we should aim to know exactly what modifications we are going to make within that critical section. I also wonder how the concurrent discard could really happen. We must surely be holding exclusive locks on the relevant buffers -- can undo discard really discard undo when the relevant buffers are x-locked? It seems to me that remaining_bytes is a crock that should be ripped out entirely, both here and in InsertUndoRecord. It seems to me that UndoRecordUpdateTransInfo can just contrive to set remaining_bytes correctly. e.g. do { // stuff if (!BufferIsValid(buffer)) { Assert(InRecovery); already_written += (BLCKSZ - starting_byte); done = (already_written >= undo_len); } else { page = BufferGetPage(buffer); done = InsertUndoRecord(...); MarkBufferDirty(buffer); } } while (!done); InsertPreparedUndo needs similar treatment. To make this work, I guess the long string of assignments in InsertUndoRecord will need to be made unconditional, but that's probably pretty cheap. As a fringe benefit, all of those work_blah variables that are currently referencing file-level globals can be replaced with something local to this function, which will surely please the coding style police. + * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise, + * it refers to the top transaction id because undo log only stores mapping + * for the top most transactions. + */ +UndoRecPtr +PrepareUndoInsert(UnpackedUndoRecord *urec, FullTransactionId fxid, xid vs fxid + urec->uur_xidepoch = EpochFromFullTransactionId(fxid); We need to make some decisions about how we're going to handle 64-bit XIDs vs. 32-bit XIDs in undo. This doesn't look like a well-considered scheme. In general, PrepareUndoInsert expects the caller to have populated the UnpackedUndoRecord, but here, uur_xidepoch is getting overwritten with the high bits of the caller-specified XID. The low-order bits aren't stored anywhere by this function, but the caller is presumably expected to have placed them inside urec->uur_xid. And it looks like the low-order bits (urec->uur_xid) get stored for every undo record, but the high-order bits (urec->xidepoch) only get stored when we emit a transaction header. This all seems very confusing. I would really like to see us replace uur_xid and uur_xidepoch with a FullTransactionId; now that we have that concept, it seems like bad practice to break what is really a FullTransactionId into two halves and store them separately. However, it would be a bit unfortunate to store an additional 4 bytes of mostly-static data in every undo record. What if we went the other way? That is, remove urec_xid from UndoRecordHeader and from UnpackedUndoRecord. Make it the responsibility of whoever is looking up an undo record to know which transaction's undo they are searching. zheap, at least, generally does know this: if it's starting from a page, then it has the XID + epoch available from the transaction slot, and if it's starting from an undo record, you need to know the XID for which you are searching, I guess from uur_prevxid. I also think that we need to remove uur_prevxid. That field does not seem to be properly a general-purpose part of the undo machinery, but a zheap-specific consideration. I think it's job is to tell you which transaction last modified the current tuple, but zheap can put that data in the payload if it likes. It is a waste to store it in every undo record, because it's not needed if the older undo has already been discarded or if the operation is an insert. + * Insert a previously-prepared undo record. This will write the actual undo Looks like this now inserts all previously-prepared undo records (rather than just a single one). + * in page. We start writting immediately after the block header. Spelling. + * Helper function for UndoFetchRecord. It will fetch the undo record pointed + * by urp and unpack the record into urec. This function will not release the + * pin on the buffer if complete record is fetched from one buffer, so caller + * can reuse the same urec to fetch the another undo record which is on the + * same block. Caller will be responsible to release the buffer inside urec + * and set it to invalid if it wishes to fetch the record from another block. + */ +UnpackedUndoRecord * +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode, + UndoPersistence persistence) I don't really understand why uur_buffer is part of an UnpackedUndoRecord. It doesn't look like the other fields, which tell you about the contents of an undo record that you have created or that you have parsed. Instead, it's some kind of scratch space for keeping track of a buffer that we're using in the process of reading an undo record. It looks like it should be an argument to UndoGetOneRecord() and ResetUndoRecord(). I also wonder whether it's really a good design to make the caller responsible for invalidating the buffer before accessing another block. Maybe it would be simpler to have this function just check whether the buffer it's been given is the right one; if not, unpin it and pin the new one instead. But I'm not really sure... + /* If we already have a buffer pin then no need to allocate a new one. */ + if (!BufferIsValid(buffer)) + { + buffer = ReadBufferWithoutRelcache(SMGR_UNDO, + rnode, UndoLogForkNum, cur_blk, + RBM_NORMAL, NULL, + RelPersistenceForUndoPersistence(persistence)); + + urec->uur_buffer = buffer; + } I think you should move this code inside the loop that follows. Then at the bottom of that loop, instead of making a similar call, just set buffer = InvalidBuffer. Then when you loop around it'll do the right thing and you'll need less code. Notice that having both the local variable buffer and the structure member urec->uur_buffer is actually making this code more complex. You are already setting urec->uur_buffer = InvalidBuffer when you do UnlockReleaseBuffer(). If you didn't have a separate 'buffer' variable you wouldn't need to clear them both here. In fact I think what you should have is an argument Buffer *curbuf, or something like that, and no uur_buffer at all. + /* + * If we have copied the data then release the buffer, otherwise, just + * unlock it. + */ + if (is_undo_rec_split) + UnlockReleaseBuffer(buffer); + else + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); Ugh. I think what's going on here is: UnpackUndoRecord copies the data if it switches buffers, but not otherwise. So if the record is split, we can release the lock and pin, but otherwise we have to keep the pin to avoid having the data get clobbered. But having this code know about what UnpackUndoRecord does internally seems like an abstraction violation. It's also probably not right if we ever want to fetch undo records in bulk, as I see that the latest code in zheap master does. I think we should switch UnpackUndoRecord over to always copying the data and just avoid all this. (To make that cheaper, we may want to teach UnpackUndoRecord to store data into scratch space provided by the caller rather than using palloc to get its own space, but I'm not actually sure that's (a) worth it or (b) actually better.) [ Skipping over some things ] +bool +UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte, + int *already_decoded, bool header_only) I think we should split this function into three functions that use a context object, call it say UnpackUndoContext. The caller will do: BeginUnpackUndo(&ucontext); // just once UnpackUndoData(&ucontext, page, starting_byte); // any number of times FinishUnpackUndo(&uur, &ucontext); // just once The undo context will store an enum value that tells us the "stage" of decoding: - UNDO_DECODE_STAGE_HEADER: We have not yet decoded even the record header; we need to do that next. - UNDO_DECODE_STAGE_RELATION_DETAILS: The next thing to be decoded is the relation details, if present. - UNDO_DECODE_STAGE_BLOCK: The next thing to be decoded is the block details, if present. - UNDO_DECODE_STAGE_TRANSACTION: The next thing to be decoded is the transaction details, if present. - UNDO_DECODE_STAGE_PAYLOAD: The next thing to be decoded is the payload details, if present. - UNDO_DECODE_STAGE_DONE: Decoding is complete. It will also store the number of bytes that have been already been copied as part of whichever stage is current. A caller who wants only part of the record can stop when ucontext.stage > desired_stage; e.g. the current header_only flag corresponds to stopping when ucontext.stage > UNDO_DECODE_STAGE_HEADER, and the potential optimization mentioned in UndoGetOneRecord could be done by stopping when ucontext.stage > UNDO_DECODE_STAGE_BLOCK (although I don't know if that's worth doing). In this scheme, BeginUnpackUndo just needs to set the stage to UNDO_DECODE_STAGE_HEADER and the number of bytes copied to 0. The context object contains all the work_blah things (no more global variables!), but BeginUnpackUndo does not need to initialize them, since they will be overwritten before they are examined. And FinishUnpackUndo just needs to copy all of the fields from the work_blah things into the UnpackedUndoRecord. The tricky part is UnpackUndoData itself, which I propose should look like a big switch where all branches fall through. Roughly: switch (ucontext->stage) { case UNDO_DECODE_STAGE_HEADER: if (!ReadUndoBytes(...)) return; stage = UNDO_DECODE_STAGE_RELATION_DETAILS; /* FALLTHROUGH */ case UNDO_DECODE_STAGE_RELATION_DETAILS: if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0) { if (!ReadUndoBytes(...)) return; } stage = UNDO_DECODE_STAGE_BLOCK; /* FALLTHROUGH */ etc. ReadUndoBytes would need some adjustments in this scheme; it wouldn't need my_bytes_read any more since it would only get called for structures that are not yet completely read. (Regardless of whether we adopt this idea, the nocopy flag to ReadUndoBytes appears to be unused and can be removed.) We could use a similar context object for InsertUndoRecord. BeginInsertUndoRecord(&ucontext, &uur) would initialize all of the work_blah structures within the context object. InsertUndoData will be a big switch. Maybe no need for a "finish" function here. There can also be a SkipInsertingUndoData function that can be called instead of InsertUndoData if the page is discarded. I think this would be more elegant than what we've got now. This is not a complete review, but I'm out of time and energy for the moment... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 13, 2019 at 02:20:29AM +1300, Thomas Munro wrote: > === 0002 "Add SmgrId to smgropen() and BufferTag." === > > This is new, and is based on the discussion from another recent > thread[1] about how we should identify buffers belonging to different > storage managers. In earlier versions of the patch-set I had used a > special reserved DB OID for undo data. Tom Lane didn't like that idea > much, and Anton Shyrabokau (via Shawn Debnath) suggested making > ForkNumber narrower so we can add a new field to BufferTag, and Andres > Freund +1'd my proposal to add the extra value as a parameter to > smgropen(). So, here is a patch that tries those ideas. > > Another way to do this would be to widen RelFileNode instead, to avoid > having to pass around the SMGR ID separately in various places. > Looking at the number of places that have to chance, you can probably > see why we wanted to use a magic DB OID instead, and I'm not entirely > convinced that it wasn't better that way, or that I've found all the > places that need to carry an smgrid alongside a RelFileNode. > > Archeological note: smgropen() was like that ~15 years ago before > commit 87bd9563, but buffer tags didn't include the SMGR ID. > > I decided to call md.c's ID "SMGR_RELATION", describing what it really > holds -- regular relations -- rather than perpetuating the doubly > anachronistic "magnetic disk" name. > > While here, I resurrected the ancient notion of a per-SMGR 'open' > routine, so that a small amount of md.c-specific stuff could be kicked > out of smgr.c and future implementations can do their own thing here > too. > > While doing that work I realised that at least pg_rewind needs to > learn about how different storage managers map blocks to files, so > that's a new TODO item requiring more thought. I wonder what other > places know how to map { RelFileNode, ForkNumber, BlockNumber } to a > path + offset, and I wonder what to think about the fact that some of > them may be non-backend code... Given the scope of this patch, it might be prudent to start a separate thread for it. So far, this discussion has been burried within other discussions and I want to ensure folks don't miss this. Thanks. -- Shawn Debnath Amazon Web Services (AWS)
On Fri, Apr 19, 2019 at 10:13 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Apr 19, 2019 at 6:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Currently, undo branch[1] contain an older version of the (undo > > interface + some fixup). Now, I have merged the latest changes from > > the zheap branch[2] to the undo branch[1] > > which can be applied on top of the undo storage commit[3]. For > > merging those changes, I need to add some changes to the undo log > > storage patch as well for handling the multi log transaction. So I > > have attached two patches, 1) improvement to undo log storage 2) > > complete undo interface patch which include 0006+0007 from undo > > branch[1] + new changes on the zheap branch. Thanks for the review Robert. Please find my reply inline. > > Some review comments: > > +#define AtAbort_ResetUndoBuffers() ResetUndoBuffers() > > I don't think this really belongs in xact.c. Seems like we ought to > declare it in the appropriate header file. Perhaps we also ought to > consider using a static inline function rather than a macro, although > I guess it doesn't really matter. Moved to undoinsert.h > > +void > +SetCurrentUndoLocation(UndoRecPtr urec_ptr) > +{ > + UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false); > + UndoPersistence upersistence = log->meta.persistence; > Right. This should not be part of this patch so removed. > > + * When the undorecord for a transaction gets inserted in the next log then we > > undo record Changed > > + * insert a transaction header for the first record in the new log and update > + * the transaction header with this new logs location. We will also keep > > This appears to be nonsensical. You're saying that you add a > transaction header to the new log and then update it with its own > location. That can't be what you mean. Actually, what I meant is update the transaction's header which is in the old log. I have changed the comments > > + * Incase, this is not the first record in new log (aka new log already > > "Incase," -> "If" > "aka" -> "i.e." > Done > Overall this whole paragraph is a bit hard to understand. I tired to improve it in newer version. > + * same transaction spans across multiple logs depending on which log is > > delete "across" Fixed > > + * processed first by the discard worker. If it processes the first log which > + * contains the transactions first record, then it can get the last record > + * of that transaction even if it is in different log and then processes all > + * the undo records from last to first. OTOH, if the next log get processed > > Not sure how that would work if the number of logs is >2. > This whole paragraph is also hard to understand. Actually, what I meant is that if it spread in multiple logs for example 3 logs(1,2,3) and the discard worker check the log 1 first then for aborted transaction it will follow the chain of undo headers and register complete request for rollback and it will apply all undo action in log1,2and 3 together. Whereas if it encounters log2 first it will register request for undo actions in log2 and 3 and similarly if it encounter log 3 first then it will only process that log. We have done that so that we can maintain the order of undo apply. However, there is possibility that we always collect all undos and apply together but for that we need to add one more pointer in the transaction header (transaction's undo header in previous log). May be the next log pointer we can keep in separate header instead if keeping in transaction header so that it will only occupy space on log switch. I think this comment don't belong here it's more related to undo discard processing so I have removed. > > +static UndoBuffers def_buffers[MAX_UNDO_BUFFERS]; > +static int buffer_idx; > > This is a file-global variable with a very generic name that is very > similar to a local variable name used by multiple functions within the > file (bufidx) and no comments. Ugh. > Comment added. > +UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp) > > The locking regime for this function is really confusing. It requires > that the caller hold discard_lock on entry, and on exit the lock will > still be held if the return value is true but will no longer be held > if the return value is false. Yikes! Anybody who reads code that > uses this function is not going to guess that it has such strange > behavior. I'm not exactly sure how to redesign this, but I think it's > not OK the way you have it. One option might be to inline the logic > into each call site. I think the simple solution will be that inside UndoRecordIsValid function we can directly check UndoLogIsDiscarded if oldest_data is not yet initialized, I think we don't need to release the discard lock for that. So I have changed it like that. > > +/* > + * Overwrite the first undo record of the previous transaction to update its > + * next pointer. This will just insert the already prepared record by > + * UndoRecordPrepareTransInfo. This must be called under the critical section. > + * This will just overwrite the undo header not the data. > + */ > +static void > +UndoRecordUpdateTransInfo(int idx) > > It bugs me that this function goes back in to reacquire the discard > lock for the purpose of preventing a concurrent undo discard. > Apparently, if the other transaction's undo has been discarded between > the prepare phase and where we are now, we're OK with that and just > exit without doing anything; otherwise, we update the previous > transaction header. But that seems wrong. When we enter a critical > section, I think we should aim to know exactly what modifications we > are going to make within that critical section. > > I also wonder how the concurrent discard could really happen. We must > surely be holding exclusive locks on the relevant buffers -- can undo > discard really discard undo when the relevant buffers are x-locked? > > It seems to me that remaining_bytes is a crock that should be ripped > out entirely, both here and in InsertUndoRecord. It seems to me that > UndoRecordUpdateTransInfo can just contrive to set remaining_bytes > correctly. e.g. > > do > { > // stuff > if (!BufferIsValid(buffer)) > { > Assert(InRecovery); > already_written += (BLCKSZ - starting_byte); > done = (already_written >= undo_len); > } > else > { > page = BufferGetPage(buffer); > done = InsertUndoRecord(...); > MarkBufferDirty(buffer); > } > } while (!done); > > InsertPreparedUndo needs similar treatment. > > To make this work, I guess the long string of assignments in > InsertUndoRecord will need to be made unconditional, but that's > probably pretty cheap. As a fringe benefit, all of those work_blah > variables that are currently referencing file-level globals can be > replaced with something local to this function, which will surely > please the coding style police. Got fixed as part of last comment fix where we introduced SkipInsertingUndoData and globals moved to the context. > > + * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise, > + * it refers to the top transaction id because undo log only stores mapping > + * for the top most transactions. > + */ > +UndoRecPtr > +PrepareUndoInsert(UnpackedUndoRecord *urec, FullTransactionId fxid, > > xid vs fxid > > + urec->uur_xidepoch = EpochFromFullTransactionId(fxid); > > We need to make some decisions about how we're going to handle 64-bit > XIDs vs. 32-bit XIDs in undo. This doesn't look like a well-considered > scheme. In general, PrepareUndoInsert expects the caller to have > populated the UnpackedUndoRecord, but here, uur_xidepoch is getting > overwritten with the high bits of the caller-specified XID. The > low-order bits aren't stored anywhere by this function, but the caller > is presumably expected to have placed them inside urec->uur_xid. And > it looks like the low-order bits (urec->uur_xid) get stored for every > undo record, but the high-order bits (urec->xidepoch) only get stored > when we emit a transaction header. This all seems very confusing. Yeah it seems bit confusing. Actually, discard worker process the transaction chain from one transaction header to the next transaction header so we need epoch only when it's first record of the transaction and currently we have set all header information inside PrepareUndoInsert. Xid is stored by caller as caller needs it for the MVCC purpose. I think caller can always set it and if transaction header get added then it will be stored otherwise not. So I think we can remove setting it here. > > I would really like to see us replace uur_xid and uur_xidepoch with a > FullTransactionId; now that we have that concept, it seems like bad > practice to break what is really a FullTransactionId into two halves > and store them separately. However, it would be a bit unfortunate to > store an additional 4 bytes of mostly-static data in every undo > record. What if we went the other way? That is, remove urec_xid > from UndoRecordHeader and from UnpackedUndoRecord. Make it the > responsibility of whoever is looking up an undo record to know which > transaction's undo they are searching. zheap, at least, generally > does know this: if it's starting from a page, then it has the XID + > epoch available from the transaction slot, and if it's starting from > an undo record, you need to know the XID for which you are searching, > I guess from uur_prevxid. Right, from uur_prevxid we would know that for which xid's undo we are looking for but without having uur_xid in undo record it self how we would know which undo record is inserted by the xid we are looking for? Because in zheap while following the undo chain and if slot got switch, then there is possibility (because of slot reuse) that we might get some other transaction's undo record for the same zheap tuple, but we want to traverse back as we want to find the record inserted by uur_prevxid. So we need uur_xid as well to tell who is inserter of this undo record? > > I also think that we need to remove uur_prevxid. That field does not > seem to be properly a general-purpose part of the undo machinery, but > a zheap-specific consideration. I think it's job is to tell you which > transaction last modified the current tuple, but zheap can put that > data in the payload if it likes. It is a waste to store it in every > undo record, because it's not needed if the older undo has already > been discarded or if the operation is an insert. Done > > + * Insert a previously-prepared undo record. This will write the actual undo > > Looks like this now inserts all previously-prepared undo records > (rather than just a single one). Fixed > > + * in page. We start writting immediately after the block header. > > Spelling. Done > > + * Helper function for UndoFetchRecord. It will fetch the undo record pointed > + * by urp and unpack the record into urec. This function will not release the > + * pin on the buffer if complete record is fetched from one buffer, so caller > + * can reuse the same urec to fetch the another undo record which is on the > + * same block. Caller will be responsible to release the buffer inside urec > + * and set it to invalid if it wishes to fetch the record from another block. > + */ > +UnpackedUndoRecord * > +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode, > + UndoPersistence persistence) > > I don't really understand why uur_buffer is part of an > UnpackedUndoRecord. It doesn't look like the other fields, which tell > you about the contents of an undo record that you have created or that > you have parsed. Instead, it's some kind of scratch space for keeping > track of a buffer that we're using in the process of reading an undo > record. It looks like it should be an argument to UndoGetOneRecord() > and ResetUndoRecord(). > > I also wonder whether it's really a good design to make the caller > responsible for invalidating the buffer before accessing another > block. Maybe it would be simpler to have this function just check > whether the buffer it's been given is the right one; if not, unpin it > and pin the new one instead. But I'm not really sure... I am not sure what will be better here, But I thought caller anyway has to release the last buffer so why not to make it responsible to keeping the track of the first buffer of the undo record and caller understands it better that it needs to hold the first buffer of the undo record because it hope that the previous undo record in the chain might fall on the same buffer? May be we can make caller completely responsible for reading the buffer for the first block of the undo record and it will always pass the valid buffer and UndoGetOneRecord only need to read buffer if the undo record is spilt and it can release them right there. So the caller will always keep track of the first buffer where undo record start and whenever the undo record pointer change it will be responsible for changing the buffer. > + /* If we already have a buffer pin then no need to allocate a new one. */ > + if (!BufferIsValid(buffer)) > + { > + buffer = ReadBufferWithoutRelcache(SMGR_UNDO, > + rnode, UndoLogForkNum, cur_blk, > + RBM_NORMAL, NULL, > + RelPersistenceForUndoPersistence(persistence)); > + > + urec->uur_buffer = buffer; > + } > > I think you should move this code inside the loop that follows. Then > at the bottom of that loop, instead of making a similar call, just set > buffer = InvalidBuffer. Then when you loop around it'll do the right > thing and you'll need less code. Done > > Notice that having both the local variable buffer and the structure > member urec->uur_buffer is actually making this code more complex. You > are already setting urec->uur_buffer = InvalidBuffer when you do > UnlockReleaseBuffer(). If you didn't have a separate 'buffer' > variable you wouldn't need to clear them both here. In fact I think > what you should have is an argument Buffer *curbuf, or something like > that, and no uur_buffer at all. Done > > + /* > + * If we have copied the data then release the buffer, otherwise, just > + * unlock it. > + */ > + if (is_undo_rec_split) > + UnlockReleaseBuffer(buffer); > + else > + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > > Ugh. I think what's going on here is: UnpackUndoRecord copies the > data if it switches buffers, but not otherwise. So if the record is > split, we can release the lock and pin, but otherwise we have to keep > the pin to avoid having the data get clobbered. But having this code > know about what UnpackUndoRecord does internally seems like an > abstraction violation. It's also probably not right if we ever want > to fetch undo records in bulk, as I see that the latest code in zheap > master does. I think we should switch UnpackUndoRecord over to always > copying the data and just avoid all this. Done > > (To make that cheaper, we may want to teach UnpackUndoRecord to store > data into scratch space provided by the caller rather than using > palloc to get its own space, but I'm not actually sure that's (a) > worth it or (b) actually better.) > > [ Skipping over some things ] > > +bool > +UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte, > + int *already_decoded, bool header_only) > > I think we should split this function into three functions that use a > context object, call it say UnpackUndoContext. The caller will do: > > BeginUnpackUndo(&ucontext); // just once > UnpackUndoData(&ucontext, page, starting_byte); // any number of times > FinishUnpackUndo(&uur, &ucontext); // just once > > The undo context will store an enum value that tells us the "stage" of decoding: > > - UNDO_DECODE_STAGE_HEADER: We have not yet decoded even the record > header; we need to do that next. > - UNDO_DECODE_STAGE_RELATION_DETAILS: The next thing to be decoded is > the relation details, if present. > - UNDO_DECODE_STAGE_BLOCK: The next thing to be decoded is the block > details, if present. > - UNDO_DECODE_STAGE_TRANSACTION: The next thing to be decoded is the > transaction details, if present. > - UNDO_DECODE_STAGE_PAYLOAD: The next thing to be decoded is the > payload details, if present. > - UNDO_DECODE_STAGE_DONE: Decoding is complete. > > It will also store the number of bytes that have been already been > copied as part of whichever stage is current. A caller who wants only > part of the record can stop when ucontext.stage > desired_stage; e.g. > the current header_only flag corresponds to stopping when > ucontext.stage > UNDO_DECODE_STAGE_HEADER, and the potential > optimization mentioned in UndoGetOneRecord could be done by stopping > when ucontext.stage > UNDO_DECODE_STAGE_BLOCK (although I don't know > if that's worth doing). > > In this scheme, BeginUnpackUndo just needs to set the stage to > UNDO_DECODE_STAGE_HEADER and the number of bytes copied to 0. The > context object contains all the work_blah things (no more global > variables!), but BeginUnpackUndo does not need to initialize them, > since they will be overwritten before they are examined. And > FinishUnpackUndo just needs to copy all of the fields from the > work_blah things into the UnpackedUndoRecord. The tricky part is > UnpackUndoData itself, which I propose should look like a big switch > where all branches fall through. Roughly: > > switch (ucontext->stage) > { > case UNDO_DECODE_STAGE_HEADER: > if (!ReadUndoBytes(...)) > return; > stage = UNDO_DECODE_STAGE_RELATION_DETAILS; > /* FALLTHROUGH */ > case UNDO_DECODE_STAGE_RELATION_DETAILS: > if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0) > { > if (!ReadUndoBytes(...)) > return; > } > stage = UNDO_DECODE_STAGE_BLOCK; > /* FALLTHROUGH */ > etc. > > ReadUndoBytes would need some adjustments in this scheme; it wouldn't > need my_bytes_read any more since it would only get called for > structures that are not yet completely read. Yeah so we can directly jump to the header which is not yet completely read but if any of the header is partially read then we need to maintain some kind of partial read variable otherwise from 'already read' we wouldn't be able to know how many bytes of the header got read in last call unless we calculate that from uur_info or maintain the partial_read in context like I have done in the new version. (Regardless of whether > we adopt this idea, the nocopy flag to ReadUndoBytes appears to be > unused and can be removed.) Yup. > > We could use a similar context object for InsertUndoRecord. > BeginInsertUndoRecord(&ucontext, &uur) would initialize all of the > work_blah structures within the context object. InsertUndoData will > be a big switch. Maybe no need for a "finish" function here. There > can also be a SkipInsertingUndoData function that can be called > instead of InsertUndoData if the page is discarded. I think this > would be more elegant than what we've got now. Done. Note: - I think the ucontext->stage are same for the insert and DECODE can we just declare only one enum and give some generic name e.g. UNDO_PROCESS_STAGE_HEADER ? - In SkipInsertingUndoData also I have to go through all the stages so that if we find some valid block then stage is right for inserting the partial record? Do you think I could have avoided that? Apart from these changes I have also included UndoRecordBulkFetch in the undoinsert.c. I have tested this patch with my local test modules which basically insert, fetch and bulk fetch multiple records and compare the contents. My test patch is still not in good shape so I will post the test module later. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Apr 25, 2019 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > +static UndoBuffers def_buffers[MAX_UNDO_BUFFERS]; > > +static int buffer_idx; > > > > This is a file-global variable with a very generic name that is very > > similar to a local variable name used by multiple functions within the > > file (bufidx) and no comments. Ugh. > > > Comment added. The variable name is still bad, and the comment isn't very helpful either. First, you can't tell by looking at the name that it has anything to do with the undo_buffers variable, because undo_buffers and buffer_idx are not obviously related names. Second, it's not an index; it's a count. A count tells you how many of something you have; an index tells you which one of those you are presently thinking about. Third, the undo_buffer array is itself poorly named, because it's not an array of all the undo buffers in the world or anything like that, but rather an array of undo buffers for some particular purpose. "static UndoBuffers *undo_buffer" is about as helpful as "int integer" and I hope I don't need to explain why that isn't usually a good thing to write. Maybe prepared_undo_buffer for the array and nprepared_undo_buffer for the count, or something like that. > I think the simple solution will be that inside UndoRecordIsValid > function we can directly check UndoLogIsDiscarded if oldest_data is > not yet initialized, I think we don't need to release the discard lock > for that. So I have changed it like that. Can we instead eliminate the special case? It seems like the if (log->oldest_data == InvalidUndoRecPtr) case will be taken very rarely, so if it's buggy, we might not notice. > Right, from uur_prevxid we would know that for which xid's undo we are > looking for but without having uur_xid in undo record it self how we > would know which undo record is inserted by the xid we are looking > for? Because in zheap while following the undo chain and if slot got > switch, then there is possibility (because of slot reuse) that we > might get some other transaction's undo record for the same zheap > tuple, but we want to traverse back as we want to find the record > inserted by uur_prevxid. So we need uur_xid as well to tell who is > inserter of this undo record? It seems desirable to me to make this the caller's problem. When we are processing undo a transaction at a time, we'll know the XID because it will be available from the transaction header. If a system like zheap maintains a pointer to an undo record someplace in the middle of a transaction, it can also store the XID if it needs it. The thing is, the zheap code almost works that way already. Transaction slots within a page store both the undo record pointer and the XID. The only case where zheap doesn't store the undo record pointer and the XID is when a slot switch occurs, but that could be changed. If we moved urec_xid into UndoRecordTransaction, we'd save 4 bytes per undo record across the board. When zheap emits undo records for updates or deletes, they would need store an UndoRecPtr (8 bytes) + FullTransactionId (8 bytes) in the payload unless the previous change to that TID is all-visible or the previous change to that TID was made by the same transaction. Also, zheap would no longer need to store the slot number in the payload in any case, because this would substitute for that (and permit more efficient lookups, to boot). So the overall impact on zheap update and delete records would be somewhere between -4 bytes (when we save the space used by XID and incur no other cost) and +12 bytes (when we lose the XID but gain the UndoRecPtr + FullTransactionId). That worst case could be further optimized. For example, instead of storing a FullTransactionId, zheap could store the difference between the XID to which the current record pertains (which in this model the caller is required to know) and the XID of whoever last modified the tuple. That difference certainly can't exceed 4 billion (or even 2 billion) so 4 bytes is enough. That reduces the worst-case difference to +8 bytes. Probably zheap could use payloads with some kind of variable-length encoding and squeeze out even more in common cases, but I'm not sure that's necessary or worth the complexity. Let's also give uur_blkprev its own UREC_INFO_* constant and omit it when this is the first time we're touching this block in this transaction and thus the value is InvalidUndoRecPtr. In the pretty-common case where a transaction updates one tuple on the page and never comes back, this - together with the optimization in the previous paragraph - will cause zheap to come out even on undo, because it'll save 4 bytes by omitting urec_xid and 8 bytes by omitting uur_blkrev, and it'll lose 8 bytes storing an UndoRecPtr in the payload and 4 bytes storing an XID-difference. Even with those changes, zheap's update and delete could still come out a little behind on undo volume if hitting many tuples on the same page, because for every tuple they hit after the first, we'll still need the UndoRecPtr for the previous change to that page (uur_blkprev) and we'll also have the UndoRecPtr extracted from the tuple's previous slot, store in the payload. So we'll end up +8 bytes in this case. I think that's acceptable, because it often won't happen, it's hardly catastrophic if it does, and saving 4 bytes on every insert, and on every update or delete where the old undo is already discarded is pretty sweet. > Yeah so we can directly jump to the header which is not yet completely > read but if any of the header is partially read then we need to > maintain some kind of partial read variable otherwise from 'already > read' we wouldn't be able to know how many bytes of the header got > read in last call unless we calculate that from uur_info or maintain > the partial_read in context like I have done in the new version. Right, we need to know the bytes already read for the next header. > Note: > - I think the ucontext->stage are same for the insert and DECODE can > we just declare only > one enum and give some generic name e.g. UNDO_PROCESS_STAGE_HEADER ? I agree. Maybe UNDO_PACK_STAGE_WHATEVER or, more briefly, UNDO_PACK_WHATEVER. > - In SkipInsertingUndoData also I have to go through all the stages so > that if we find some valid > block then stage is right for inserting the partial record? Do you > think I could have avoided that? Hmm, I didn't foresee that, but no, I don't think you can avoid that. That problem wouldn't occur before we added the stage stuff, since we'd just go through all the stages every time and each one would know its own size and do nothing if that number of bytes had already been passed, but with this design there seems to be no way around it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Replying to myself to resend to the list, since my previous attempt seems to have been eaten by a grue. On Tue, Apr 30, 2019 at 11:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Apr 30, 2019 at 2:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Like previous version these patch set also applies on: > > https://github.com/EnterpriseDB/zheap/tree/undo > > (b397d96176879ed5b09cf7322b8d6f2edd8043a5) > > Some more review of 0003: > > There is a whitespace-only hunk in xact.c. > > It would be nice if AtAbort_ResetUndoBuffers didn't need to exist at > all. Then, this patch would make no changes whatsoever to xact.c. > We'd still need such changes in other patches, because the whole idea > of undo is tightly bound up with the concept of transactions. Yet, > this particular patch wouldn't touch that file, and that would be > nice. In general, the reason why we need AtCommit/AtAbort/AtEOXact > callbacks is to adjust the values of global variables (or the data > structures to which they point) at commit or abort time. And that is > also the case here. The thing is, I'm not sure why we need these > particular global variables. Is there some way that we can get by > without them? The obvious thing to do seems to be to invent yet > another context object, allocated via a new function, which can then > be passed to PrepareUndoInsert, InsertPreparedUndo, > UndoLogBuffersSetLSN, UnlockReleaseUndoBuffers, etc. This would > obsolete UndoSetPrepareSize, since you'd instead pass the size to the > context allocator function. > > UndoRecordUpdateTransInfo should declare a local variable > XactUndoRecordInfo *something = &xact_urec_info[idx] and use that > variable wherever possible. > > It should also probably use while (1) { ... } rather than do { ... } > while (true). > > In UndoBufferGetSlot you could replace 'break;' with 'return i;' and > then more than half the function would need one less level of > indentation. This function should also declare PreparedUndoBuffer > *something and set that variable equal to &prepared_undo_buffers[i] at > the top of the loop and again after the loop, and that variable should > then be used whenever possible. > > UndoRecordRelationDetails seems to need renaming now that it's down to > a single member. > > The comment for UndoRecordBlock needs updating, as it still refers to blkprev. > > The comment for UndoRecordBlockPrev refers to "Identifying > information" but it's not identifying anything. I think you could > just delete "Identifying information for" from this sentence and not > lose anything. And I think several of the nearby comments that refer > to "Identifying information" could more simply and more correctly just > refer to "Information". > > I don't think that SizeOfUrecNext needs to exist. > > xact.h should not add an include for undolog.h. There are no other > changes to this header, so presumably the header does not depend on > anything in undolog.h. If .c files that include xact.h need > undolog.h, then the header should be included in those files, not in > the header itself. That way, we avoid making partial recompiles more > painful than necessary. > > UndoGetPrevUndoRecptr looks like a strange interface. Why not just > arrange not to call the function in the first place if prevurp is > valid? > > Every use of palloc0 in this code should be audited to check whether > it is really necessary to zero the memory before use. If it will be > initialized before it's used for anything anyway, it doesn't need to > be pre-zeroed. > > FinishUnpackUndo looks nice! But it is missing a blank line in one > place, and 'if it presents' should be 'if it is present' in a whole > bunch of places. > > BeginInsertUndo also looks to be missing a few blank lines. > > Still need to do some cleanup of the UnpackedUndoRecord, e.g. unifying > uur_xid and uur_xidepoch into uur_fxid. > > InsertUndoData ends in an unnecessary return, as does SkipInsertingUndoData. > > I think the argument to SkipInsertingUndoData should be the number of > bytes to skip, rather than the starting byte within the block. > > Function header comment formatting is not consistent. Compare > FinishUnpackUndo (function name recapitulated on first line of > comment) to ReadUndoBytes (not recapitulated) to UnpackUndoData > (entire header comment jammed into one paragraph with function name at > start). I prefer the style where the function name is not restated, > but YMMV. Anyway, it has to be consistent. > > UndoGetPrevRecordLen should not declare char *page instead of Page > page, I think. > > UndoRecInfo looks a bit silly, I think. Isn't index just the index of > this entry in the array? You can always figure that out by ptr - > array_base. Instead of having UndoRecPtr urp in this structure, how > about adding that to UnpackedUndoRecord? When inserting, caller > leaves it unset and InsertPreparedUndo sets it; when retrieving, > UndoFetchRecord or UndoRecordBulkFetch sets it. With those two > changes, I think you can get rid of UndoRecInfo entirely and just > return an array of UnpackedUndoRecords. This might also eliminate the > need for an 'urp' member in PreparedUndoSpace. > > I'd probably use UREC_INFO_BLKPREV rather than UREC_INFO_BLOCKPREV for > consistency. > > Similarly UndoFetchRecord and UndoRecordBulkFetch seems a bit > inconsistent. Perhaps UndoBulkFetchRecord. > > In general, I find the code for updating transaction headers to be > really hard to understand. I'm not sure exactly what can be done > about that. Like, why is UndoRecordPrepareTransInfo unpacking undo? > Why does it take two undo record pointers as arguments and how are > they different? > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 1, 2019 at 6:02 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Replying to myself to resend to the list, since my previous attempt > seems to have been eaten by a grue. > > On Tue, Apr 30, 2019 at 11:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Apr 30, 2019 at 2:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Like previous version these patch set also applies on: > > > https://github.com/EnterpriseDB/zheap/tree/undo > > > (b397d96176879ed5b09cf7322b8d6f2edd8043a5) > > > > Some more review of 0003: > > Another suggestion: +/* + * Insert a previously-prepared undo records. This will write the actual undo + * record into the buffers already pinned and locked in PreparedUndoInsert, + * and mark them dirty. This step should be performed after entering a + * criticalsection; it should never fail. + */ +void +InsertPreparedUndo(void) +{ .. .. + + /* Advance the insert pointer past this record. */ + UndoLogAdvance(urp, size); + } .. } UndoLogAdvance internally takes LWLock and we don't recommend doing that in the critical section which will happen as this function is supposed to be invoked in the critical section as mentioned in comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Thomas told me offlist that this email of mine didn't hit pgsql-hackers, so trying it again by resending. On Mon, Apr 29, 2019 at 3:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Apr 19, 2019 at 3:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Mar 12, 2019 at 6:51 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > > > > Currently, undo branch[1] contain an older version of the (undo > > interface + some fixup). Now, I have merged the latest changes from > > the zheap branch[2] to the undo branch[1] > > which can be applied on top of the undo storage commit[3]. For > > merging those changes, I need to add some changes to the undo log > > storage patch as well for handling the multi log transaction. So I > > have attached two patches, 1) improvement to undo log storage 2) > > complete undo interface patch which include 0006+0007 from undo > > branch[1] + new changes on the zheap branch. > > > > [1] https://github.com/EnterpriseDB/zheap/tree/undo > > [2] https://github.com/EnterpriseDB/zheap > > [3] https://github.com/EnterpriseDB/zheap/tree/undo > > (b397d96176879ed5b09cf7322b8d6f2edd8043a5) > > > > [] > > Dilip has posted the patch for "undo record interface", next in series > is a patch that handles transaction rollbacks (the machinery to > perform undo actions) and background workers to manage undo. > > Transaction Rollbacks > ---------------------------------- > We always perform rollback actions after cleaning up the current > (sub)transaction. This will ensure that we perform the actions > immediately after an error (and release the locks) rather than when > the user issues Rollback command at some later point of time. We are > releasing the locks after the undo actions are applied. The reason to > delay lock release is that if we release locks before applying undo > actions, then the parallel session can acquire the lock before us > which can lead to deadlock. > > We promote the error to FATAL error if it occurred while applying undo > for a subtransaction. The reason we can't proceed without applying > subtransaction's undo is that the modifications made in that case must > not be visible even if the main transaction commits. Normally, the > backends that receive the request to perform Rollback (To Savepoint) > applies the undo actions, but there are cases where it is preferable > to push the requests to background workers. The main reasons to push > the requests to background workers are (a) The request for a very > large rollback, this will allow us to return control to users quickly. > There is a guc rollback_overflow_size which indicates that rollbacks > greater than the configured size are performed lazily by background > workers. (b) While applying the undo actions, if there is an error, we > push such a request to background workers. > > Undo Requests and Undo workers > -------------------------------------------------- > To improve the efficiency of the rollbacks, we create three queues and > a hash table for the rollback requests. A Xid based priority queue > which will allow us to process the requests of older transactions and > help us to move oldesdXidHavingUndo (this is a xid-horizon below which > all the transactions are visible) forward. A size-based queue which > will help us to perform the rollbacks of larger aborts in a timely > fashion so that we don't get stuck while processing them during > discard of the logs. An error queue to hold the requests for > transactions that failed to apply its undo. The rollback hash table > is used to avoid duplicate undo requests by backends and discard > worker. > > Undo launcher is responsible for launching the workers iff there is > some work available in one of the work queues and there are more > workers available. The worker is launched to handle requests for a > particular database. Each undo worker then start reading from one of > the queues the requests for that particular database. A worker would > peek into each queue for the requests from a particular database if it > needs to switch a database in less than undo_worker_quantum ms (10s as > default) after starting. Also, if there is no work, it lingers for > UNDO_WORKER_LINGER_MS (10s as default). This avoids restarting the > workers too frequently. > > The discard worker is responsible for discarding the undo log of > transactions that are committed and all-visible or are rolled-back. > It also registers the request for aborted transactions in the work > queues. It iterates through all the active logs one-by-one and tries > to discard the transactions that are old enough to matter. > > The details of how all of this works are described in > src/backend/access/undo/README.UndoProcessing. The main idea to keep > a readme is to allow reviewers to understand this patch, later we can > decide parts of it to move to comments in code and others to main > README of undo. > > Question: Currently, TwoPhaseFileHeader stores just TransactionId, so > for the systems (like zheap) that support FullTransactionId, the > two-phase transaction will be tricky to support as we need > FullTransactionId during rollbacks. Is it a good idea to store > FullTransactionId in TwoPhaseFileHeader? > > Credits: > -------------- > Designed by: Andres Freund, Amit Kapila, Robert Haas, and Thomas Munro > Author: Amit Kapila, Dilip Kumar, Kuntal Ghosh, and Thomas Munro > > This patch is based on the latest Dilip's patch for undo record > interface. The branch can be accessed at > https://github.com/EnterpriseDB/zheap/tree/undoprocessing > > Inputs on design/code are welcome. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Apr 30, 2019 at 11:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: The attached patch will provide mechanism for masking the necessary bits in undo pages for supporting consistency checking for the undo pages. Ideally we can merge this patch with the main interface patch but currently I have kept it separate for mainly because a) this is still a WIP patch and b) review of the changes will be easy. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Apr 30, 2019 at 8:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > UndoRecInfo looks a bit silly, I think. Isn't index just the index of > this entry in the array? You can always figure that out by ptr - > array_base. Instead of having UndoRecPtr urp in this structure, how > about adding that to UnpackedUndoRecord? When inserting, caller > leaves it unset and InsertPreparedUndo sets it; when retrieving, > UndoFetchRecord or UndoRecordBulkFetch sets it. With those two > changes, I think you can get rid of UndoRecInfo entirely and just > return an array of UnpackedUndoRecords. This might also eliminate the > need for an 'urp' member in PreparedUndoSpace. > Yeah, at least in this patch it looks silly. Actually, I added that index to make the qsort stable when execute_undo_action sorts them in block order. But, as part of this patch we don't have that processing so either we can remove this structure completely as you suggested but undo processing patch has to add that structure or we can just add comment that why we added this index field. I am ok with other comments and will work on them. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, May 2, 2019 at 5:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Yeah, at least in this patch it looks silly. Actually, I added that > index to make the qsort stable when execute_undo_action sorts them in > block order. But, as part of this patch we don't have that processing > so either we can remove this structure completely as you suggested but > undo processing patch has to add that structure or we can just add > comment that why we added this index field. Well, the qsort comparator could compute the index as ptr - array_base just like any other code, couldn't it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 2, 2019 at 7:00 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, May 2, 2019 at 5:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Yeah, at least in this patch it looks silly. Actually, I added that > > index to make the qsort stable when execute_undo_action sorts them in > > block order. But, as part of this patch we don't have that processing > > so either we can remove this structure completely as you suggested but > > undo processing patch has to add that structure or we can just add > > comment that why we added this index field. > > Well, the qsort comparator could compute the index as ptr - array_base > just like any other code, couldn't it? > I might be completely missing but (ptr - array_base) is only valid when first time you get the array, but qsort will swap the element around and after that you will never be able to make out which element was at lower index and which one was at higher index. Basically, our goal is to preserve the order of the undo record for the same block but their order might get changed due to swap when they are getting compared with the undo record pointer of the another block and once the order is swap we will never know what was their initial positions? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, May 3, 2019 at 12:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I might be completely missing but (ptr - array_base) is only valid > when first time you get the array, but qsort will swap the element > around and after that you will never be able to make out which element > was at lower index and which one was at higher index. Basically, our > goal is to preserve the order of the undo record for the same block > but their order might get changed due to swap when they are getting > compared with the undo record pointer of the another block and once > the order is swap we will never know what was their initial positions? *facepalm* Yeah, you're right. Still, I think we should see if there's some way of getting rid of that structure, or at least making it an internal detail that is used by the code that's doing the sorting rather than something that is exposed as an external interface. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 1, 2019 at 10:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Thomas told me offlist that this email of mine didn't hit > pgsql-hackers, so trying it again by resending. > Attached is next version of the patch with minor improvements: a. use FullTransactionId b. improve comments c. removed some functions The branch can be accessed at https://github.com/EnterpriseDB/zheap/tree/undoprocessing. It is on top of Thomas and Dilip's patches related to undo logs and undo records, though still not everything is synced up from both the branches as they are also actively working on their set of patches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Apr 30, 2019 at 8:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Apr 30, 2019 at 2:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Like previous version these patch set also applies on: > > https://github.com/EnterpriseDB/zheap/tree/undo > > (b397d96176879ed5b09cf7322b8d6f2edd8043a5) > > Some more review of 0003: > > There is a whitespace-only hunk in xact.c. Fixed > > It would be nice if AtAbort_ResetUndoBuffers didn't need to exist at > all. Then, this patch would make no changes whatsoever to xact.c. > We'd still need such changes in other patches, because the whole idea > of undo is tightly bound up with the concept of transactions. Yet, > this particular patch wouldn't touch that file, and that would be > nice. In general, the reason why we need AtCommit/AtAbort/AtEOXact > callbacks is to adjust the values of global variables (or the data > structures to which they point) at commit or abort time. And that is > also the case here. The thing is, I'm not sure why we need these > particular global variables. Is there some way that we can get by > without them? The obvious thing to do seems to be to invent yet > another context object, allocated via a new function, which can then > be passed to PrepareUndoInsert, InsertPreparedUndo, > UndoLogBuffersSetLSN, UnlockReleaseUndoBuffers, etc. This would > obsolete UndoSetPrepareSize, since you'd instead pass the size to the > context allocator function. I have moved all the global variables to a context. Now, I think we don't need AtAbort_ResetUndoBuffers as memory will be freed with the transaction context. > > UndoRecordUpdateTransInfo should declare a local variable > XactUndoRecordInfo *something = &xact_urec_info[idx] and use that > variable wherever possible. > Done. > It should also probably use while (1) { ... } rather than do { ... } > while (true). Ok > > In UndoBufferGetSlot you could replace 'break;' with 'return i;' and > then more than half the function would need one less level of > indentation. This function should also declare PreparedUndoBuffer > *something and set that variable equal to &prepared_undo_buffers[i] at > the top of the loop and again after the loop, and that variable should > then be used whenever possible. Done > > UndoRecordRelationDetails seems to need renaming now that it's down to > a single member. I have directly moved that context to UndoPackContext > > The comment for UndoRecordBlock needs updating, as it still refers to blkprev. Done > > The comment for UndoRecordBlockPrev refers to "Identifying > information" but it's not identifying anything. I think you could > just delete "Identifying information for" from this sentence and not > lose anything. And I think several of the nearby comments that refer > to "Identifying information" could more simply and more correctly just > refer to "Information". Done > > I don't think that SizeOfUrecNext needs to exist. Removed > > xact.h should not add an include for undolog.h. There are no other > changes to this header, so presumably the header does not depend on > anything in undolog.h. If .c files that include xact.h need > undolog.h, then the header should be included in those files, not in > the header itself. That way, we avoid making partial recompiles more > painful than necessary. Right, fixed. > > UndoGetPrevUndoRecptr looks like a strange interface. Why not just > arrange not to call the function in the first place if prevurp is > valid? Done > > Every use of palloc0 in this code should be audited to check whether > it is really necessary to zero the memory before use. If it will be > initialized before it's used for anything anyway, it doesn't need to > be pre-zeroed. Yeah, I found at few places it was not required so fixed. > > FinishUnpackUndo looks nice! But it is missing a blank line in one > place, and 'if it presents' should be 'if it is present' in a whole > bunch of places. > > BeginInsertUndo also looks to be missing a few blank lines. Fixed > > Still need to do some cleanup of the UnpackedUndoRecord, e.g. unifying > uur_xid and uur_xidepoch into uur_fxid. > I will work on this. > InsertUndoData ends in an unnecessary return, as does SkipInsertingUndoData. Done > > I think the argument to SkipInsertingUndoData should be the number of > bytes to skip, rather than the starting byte within the block. Done > > Function header comment formatting is not consistent. Compare > FinishUnpackUndo (function name recapitulated on first line of > comment) to ReadUndoBytes (not recapitulated) to UnpackUndoData > (entire header comment jammed into one paragraph with function name at > start). I prefer the style where the function name is not restated, > but YMMV. Anyway, it has to be consistent. > Fixed > UndoGetPrevRecordLen should not declare char *page instead of Page > page, I think. > > UndoRecInfo looks a bit silly, I think. Isn't index just the index of > this entry in the array? You can always figure that out by ptr - > array_base. Instead of having UndoRecPtr urp in this structure, how > about adding that to UnpackedUndoRecord? When inserting, caller > leaves it unset and InsertPreparedUndo sets it; when retrieving, > UndoFetchRecord or UndoRecordBulkFetch sets it. With those two > changes, I think you can get rid of UndoRecInfo entirely and just > return an array of UnpackedUndoRecords. This might also eliminate the > need for an 'urp' member in PreparedUndoSpace. As discussed upthread, I will work on fixing this. > > I'd probably use UREC_INFO_BLKPREV rather than UREC_INFO_BLOCKPREV for > consistency. > > Similarly UndoFetchRecord and UndoRecordBulkFetch seems a bit > inconsistent. Perhaps UndoBulkFetchRecord. Done > > In general, I find the code for updating transaction headers to be > really hard to understand. I'm not sure exactly what can be done > about that. Like, why is UndoRecordPrepareTransInfo unpacking undo? It's only unpacking header. But, yeah we can do better, instead of unpacking we can just read the main header and from uur_info we can calculate exact offset of the uur_next and in UndoRecordUpdateTransInfo we can directly update only uur_next by writing at that offset, instead of overwriting the complete header? > Why does it take two undo record pointers as arguments and how are > they different? One is previous transaction's start header which we wants to update and other is current transaction's urec pointer what we want to set as uur_next in the previous transaction's start header. Just for tracking, open comments which still needs to be worked on. 1. Avoid special case in UndoRecordIsValid. > Can we instead eliminate the special case? It seems like the if > (log->oldest_data == InvalidUndoRecPtr) case will be taken very > rarely, so if it's buggy, we might not notice. 2. While updating the previous transaction header instead of unpacking complete header and writing it back, we can just unpack main header and calculate the offset of uur_next and then update it directly. 3. unifying uur_xid and uur_xidepoch into uur_fxid. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, May 6, 2019 at 8:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > In general, I find the code for updating transaction headers to be > > really hard to understand. I'm not sure exactly what can be done > > about that. Like, why is UndoRecordPrepareTransInfo unpacking undo? > > It's only unpacking header. But, yeah we can do better, instead of > unpacking we can just read the main header and from uur_info we can > calculate exact offset of the uur_next and in > UndoRecordUpdateTransInfo we can directly update only uur_next by > writing at that offset, instead of overwriting the complete header? Hmm. I think it's reasonable to use the unpack infrastructure to figure out where uur_next is. I don't know whether a bespoke method of figuring that out would be any better. At least the comments probably need some work. > > Why does it take two undo record pointers as arguments and how are > > they different? > One is previous transaction's start header which we wants to update > and other is current transaction's urec pointer what we want to set as > uur_next in the previous transaction's start header. So put some comments. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 1, 2019 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, May 1, 2019 at 6:02 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > Replying to myself to resend to the list, since my previous attempt > > seems to have been eaten by a grue. > > > > On Tue, Apr 30, 2019 at 11:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > On Tue, Apr 30, 2019 at 2:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Like previous version these patch set also applies on: > > > > https://github.com/EnterpriseDB/zheap/tree/undo > > > > (b397d96176879ed5b09cf7322b8d6f2edd8043a5) > > > > > > Some more review of 0003: > > > > > Another suggestion: > > +/* > + * Insert a previously-prepared undo records. This will write the actual undo > + * record into the buffers already pinned and locked in PreparedUndoInsert, > + * and mark them dirty. This step should be performed after entering a > + * criticalsection; it should never fail. > + */ > +void > +InsertPreparedUndo(void) > +{ > .. > .. > + > + /* Advance the insert pointer past this record. */ > + UndoLogAdvance(urp, size); > + } > .. > } > > UndoLogAdvance internally takes LWLock and we don't recommend doing > that in the critical section which will happen as this function is > supposed to be invoked in the critical section as mentioned in > comments. I think we can call UndoLogAdvanceFinal in FinishUndoRecordInsert because this function will be called outside the critical section. And, now we already have the undo record size inside UndoRecordInsertContext. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, May 6, 2019 at 5:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Just for tracking, open comments which still needs to be worked on. > > 1. Avoid special case in UndoRecordIsValid. > > Can we instead eliminate the special case? It seems like the if > > (log->oldest_data == InvalidUndoRecPtr) case will be taken very > > rarely, so if it's buggy, we might not notice. I have worked on this comments and added changes in the latest patch. > > 2. While updating the previous transaction header instead of unpacking > complete header and writing it back, we can just unpack main header > and calculate the offset of uur_next and then update it directly. For this as you suggested I am not changing, updated the comments. > > 3. unifying uur_xid and uur_xidepoch into uur_fxid. Still open. I have also added the README. Patches can be applied on top of undo branch [1] commit: (cb777466d008e656f03771cf16ec7ef9d6f2778b) [1] https://github.com/EnterpriseDB/zheap/tree/undo -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, May 9, 2019 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Patches can be applied on top of undo branch [1] commit: > (cb777466d008e656f03771cf16ec7ef9d6f2778b) Hello all, Here is a new patch set which includes all of the patches discussed in this thread in one go, rebased on today's master. To summarise the main layers, from the top down we have: 0013: undo-based orphaned file clean-up ($SUBJECT, a demo of undo technology) 0009-0010: undo processing (execution of undo actions when rolling back) 0008: undo records 0001-0007: undo storage The main changes to the storage layer since the last time I posted the full patch stack: * pg_upgrade support: you can't have any live undo logs (much like 2PC transactions, we want to be free to change the format), but some work was required to make sure that all "discarded" undo record pointers from the old cluster still appear as discarded in the new cluster, as well as any from the new cluster * tweaks to various other src/bin tools that are aware of files under pgdata and were confused by undo segment files * the fsync of undo log segment files when they're created or recycled is now handed off to the checkpointer (this was identified as a performance problem for zheap) * code tidy-up, removing dead code (undo log rewind, prevlen, prevlog were no longer needed by patches higher up in the stack), removing global variables, noisy LOG messages about undo segment files now reduced to DEBUG1 * new extension contrib/undoinspect, for developer use, showing what will be undone if you abort: postgres=# begin; BEGIN postgres=# create table t(); CREATE TABLE postgres=# select * from undoinspect(); urecptr | rmgr | flags | xid | description ------------------+---------+-------+-----+--------------------------------------------- 00000000000032FA | Storage | P,T | 487 | CREATE dbid=12934, tsid=1663, relfile=16393 (1 row) One silly detail: I had to change the default max_worker_processes from 8 to 12, because otherwise a couple of tests run with fewer parallel workers than they expect, due to undo worker processes using up slots. There is probably a better solution to that problem. I put the patches in a tarball here, but they are also available from https://github.com/EnterpriseDB/zheap/tree/undo. -- Thomas Munro https://enterprisedb.com
Attachment
Hello Thomas,
In pg_buffercache contrib module, the file pg_buffercache--1.3--1.4.sql is missing. AFAICS, this file should be added as part of the following commit:
Add SmgrId to smgropen() and BufferTag
In pg_buffercache contrib module, the file pg_buffercache--1.3--1.4.sql is missing. AFAICS, this file should be added as part of the following commit:
Add SmgrId to smgropen() and BufferTag
Otherwise, I'm not able to compile the contrib modules. I've also attached the patch to fix the same.
On Fri, May 10, 2019 at 11:48 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, May 9, 2019 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Patches can be applied on top of undo branch [1] commit:
> (cb777466d008e656f03771cf16ec7ef9d6f2778b)
Hello all,
Here is a new patch set which includes all of the patches discussed in
this thread in one go, rebased on today's master. To summarise the
main layers, from the top down we have:
0013: undo-based orphaned file clean-up ($SUBJECT, a demo of
undo technology)
0009-0010: undo processing (execution of undo actions when rolling back)
0008: undo records
0001-0007: undo storage
The main changes to the storage layer since the last time I posted the
full patch stack:
* pg_upgrade support: you can't have any live undo logs (much like 2PC
transactions, we want to be free to change the format), but some work
was required to make sure that all "discarded" undo record pointers
from the old cluster still appear as discarded in the new cluster, as
well as any from the new cluster
* tweaks to various other src/bin tools that are aware of files under
pgdata and were confused by undo segment files
* the fsync of undo log segment files when they're created or recycled
is now handed off to the checkpointer (this was identified as a
performance problem for zheap)
* code tidy-up, removing dead code (undo log rewind, prevlen, prevlog
were no longer needed by patches higher up in the stack), removing
global variables, noisy LOG messages about undo segment files now
reduced to DEBUG1
* new extension contrib/undoinspect, for developer use, showing what
will be undone if you abort:
postgres=# begin;
BEGIN
postgres=# create table t();
CREATE TABLE
postgres=# select * from undoinspect();
urecptr | rmgr | flags | xid |
description
------------------+---------+-------+-----+---------------------------------------------
00000000000032FA | Storage | P,T | 487 | CREATE dbid=12934,
tsid=1663, relfile=16393
(1 row)
One silly detail: I had to change the default max_worker_processes
from 8 to 12, because otherwise a couple of tests run with fewer
parallel workers than they expect, due to undo worker processes using
up slots. There is probably a better solution to that problem.
I put the patches in a tarball here, but they are also available from
https://github.com/EnterpriseDB/zheap/tree/undo.
--
Thomas Munro
https://enterprisedb.com
Attachment
On Fri, May 10, 2019 at 10:46 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > In pg_buffercache contrib module, the file pg_buffercache--1.3--1.4.sql is missing. AFAICS, this file should be added aspart of the following commit: > Add SmgrId to smgropen() and BufferTag > > Otherwise, I'm not able to compile the contrib modules. I've also attached the patch to fix the same. Oops, thanks Kuntal. Fixed, along with some compiler warnings from MSVC and GCC. I added a quick tour of this to a README.md visible here: https://github.com/EnterpriseDB/zheap/tree/undo -- Thomas Munro https://enterprisedb.com
Attachment
On Thu, May 9, 2019 at 12:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, May 6, 2019 at 5:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Just for tracking, open comments which still needs to be worked on. > > > > 1. Avoid special case in UndoRecordIsValid. > > > Can we instead eliminate the special case? It seems like the if > > > (log->oldest_data == InvalidUndoRecPtr) case will be taken very > > > rarely, so if it's buggy, we might not notice. > > I have worked on this comments and added changes in the latest patch. > > > > 2. While updating the previous transaction header instead of unpacking > > complete header and writing it back, we can just unpack main header > > and calculate the offset of uur_next and then update it directly. > > For this as you suggested I am not changing, updated the comments. > > > > 3. unifying uur_xid and uur_xidepoch into uur_fxid. > Still open. > > I have also added the README. > > Patches can be applied on top of undo branch [1] commit: > (cb777466d008e656f03771cf16ec7ef9d6f2778b) > > [1] https://github.com/EnterpriseDB/zheap/tree/undo > I have removed some of the globals and also improved some comments. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sun, May 12, 2019 at 2:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have removed some of the globals and also improved some comments. I don't like the discard_lock very much. Perhaps it's OK, but I hope that there are better alternatives. One problem with Thomas Munro pointed out to me in off-list discussion is that the discard_lock has to be held by anyone reading undo even if the undo they are reading and the undo that the discard worker wants to discard are in completely different parts of the undo log. Somebody could be trying to read an undo page written 1 second ago while the discard worker is trying to discard an undo page written to the same undo log 1 hour ago. Those things need not block each other, but with this design they will. Another problem is that we end up holding it across an I/O; there's precedent for that, but it's not particularly good precedent. Let's see if we can do better. My first idea was that we should just make this the caller's problem instead of handling it in this layer. Undo is retained for committed transactions until they are all-visible, and the reason for that is that we presume that nobody can be interested in the data for MVCC purposes unless there's a snapshot that can't see the results of the transaction in question. Once the committed transaction is all-visible, that's nobody, so it should be fine to just discard the undo any time we like. That won't work with the existing zheap code, which currently sometimes follows undo chains for transactions that are all-visible, but I think that's a problem we should fix rather than something we should force the undo layer to support. We'd still need something kinda like the discard_lock for aborted transactions, though, because as soon as you release the buffer lock on a table page, the undo workers could apply all the undo to that page and then discard it, and then you could afterwards try to look up the undo pointer which you had retrieved from that page and stored in backend-local memory. One thing we could probably do is make that a heavyweight lock on the XID itself, so if you observe that an XID is aborted, you have to go get this lock in ShareLock mode, then recheck the page, and only then consult the undo; discarding the undo for an aborted transaction would require AccessExclusiveLock on the XID. This solution gets rid of the LWLock for committed undo; for aborted undo, it avoids the false sharing and non-interruptibility that an LWLock imposes. But then I had what I think may be a better idea. Let's add a new ReadBufferMode that suppresses the actual I/O; if the buffer is not already present in shared_buffers, it allocates a buffer but returns it without doing any I/O, so the caller must be prepared for BM_VALID to be unset. I don't know what to call this, so I'll call it RBM_ALLOCATE (leaving room for possible future variants like RBM_ALLOCATE_AND_LOCK). Then, the protocol for reading an undo buffer would go like this: 1. Read the buffer with RBM_ALLOCATE, thus acquiring a pin on the relevant buffer. 2. Check whether the buffer precedes the discard horizon for that undo log stored in shared memory. 3. If so, use the ForgetBuffer() code we have in the zheap branch to deallocate the buffer and stop here. The undo is not available to be read, whether it's still physically present or not. 4. Otherwise, if the buffer is not valid, call ReadBufferExtended again, or some new function, to make it so. Remember to release all of our pins. The protocol for discarding an undo buffer would go like this: 1. Advance the discard horizon in shared memory. 2. Take a cleanup lock on each buffer that ought to be discarded. Remember the dirty ones and forget the others. 3. WAL-log the discard operation. 4. Revisit the dirty buffers we remembered in step 2 and forget them. The idea is that, once we've advanced the discard horizon in shared memory, any readers that come along later are responsible for making sure that they never do I/O on any older undo. They may create some invalid buffers in shared memory, but they'll hopefully also get rid of them if they do, and if they error out for some reason before doing so, that buffer should age out naturally. So, the discard worker just needs to worry about buffers that already exist. Once it's taken a cleanup lock on each buffer, it knows that there are no I/O operations and in fact no buffer usage of any kind still in progress from before it moved the in-memory discard horizon. Anyone new that comes along will clean up after themselves. We postpone forgetting dirty buffers until after we've successfully WAL-logged the discard, in case we fail to do so. With this design, we don't add any new cases where a lock of any kind must be held across an I/O, and there's also no false sharing. Furthermore, unlike the previous proposal, this will work nicely with something like old_snapshot_threshold. The previous design relies on undo not getting discarded while anyone still cares about it, but old_snapshot_threshold, if applied to zheap, would have the express goal of discarding undo while somebody still cares about it. With this design, we could support old_snapshot_threshold by having undo readers error out in step #2 if the transaction is committed and not visible to our snapshot but yet the undo is discarded. Heck, we can do that anyway as a safety check, basically for free, and just tailor the error message depending on whether old_snapshot_threshold is such that the condition is expected to be possible. While I'm kvetching, I can't help noticing that undoinsert.c contains functions both for inserting undo and also for reading it, which seems like a loose end that needs to be tied up somehow. I'm mildly inclined to think that we should rename the file to something more generic (e.g. undoaccess.h) rather than splitting it into two files (e.g. undoinsert.c and undoread.c). Also, it looks to me like you need to go through what is currently undoinsert.h and look for stuff that can be made private to the .c file. I don't see why thing like MAX_PREPARED_UNDO need to be exposed at all, and for things like PreparedUndoSpace it seems like it would suffice to just do 'struct PreparedUndoSpace; typedef struct PreparedUndoSpace PreparedUndoSpace;' in the header and put the actual 'struct PreparedUndoSpace { ... };' definition in the .c file. And UnlockReleaseUndoBuffers has a declaration but no longer has a definition, so I think that can go away too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 13, 2019 at 11:36 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, May 12, 2019 at 2:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have removed some of the globals and also improved some comments. > > I don't like the discard_lock very much. Perhaps it's OK, but I hope > that there are better alternatives. One problem with Thomas Munro > pointed out to me in off-list discussion is that the discard_lock has > to be held by anyone reading undo even if the undo they are reading > and the undo that the discard worker wants to discard are in > completely different parts of the undo log. Somebody could be trying > to read an undo page written 1 second ago while the discard worker is > trying to discard an undo page written to the same undo log 1 hour > ago. Those things need not block each other, but with this design > they will. > Yeah, this doesn't appear to be a good way to deal with the problem. > Another problem is that we end up holding it across an > I/O; there's precedent for that, but it's not particularly good > precedent. Let's see if we can do better. > > But then I had what I think may be a better idea. > +1. I also think the below idea is better than the previous one. > Let's add a new > ReadBufferMode that suppresses the actual I/O; if the buffer is not > already present in shared_buffers, it allocates a buffer but returns > it without doing any I/O, so the caller must be prepared for BM_VALID > to be unset. I don't know what to call this, so I'll call it > RBM_ALLOCATE (leaving room for possible future variants like > RBM_ALLOCATE_AND_LOCK). Then, the protocol for reading an undo buffer > would go like this: > > 1. Read the buffer with RBM_ALLOCATE, thus acquiring a pin on the > relevant buffer. > 2. Check whether the buffer precedes the discard horizon for that undo > log stored in shared memory. > 3. If so, use the ForgetBuffer() code we have in the zheap branch to > deallocate the buffer and stop here. The undo is not available to be > read, whether it's still physically present or not. > 4. Otherwise, if the buffer is not valid, call ReadBufferExtended > again, or some new function, to make it so. Remember to release all > of our pins. > > The protocol for discarding an undo buffer would go like this: > > 1. Advance the discard horizon in shared memory. > 2. Take a cleanup lock on each buffer that ought to be discarded. > Remember the dirty ones and forget the others. > 3. WAL-log the discard operation. > 4. Revisit the dirty buffers we remembered in step 2 and forget them. > > The idea is that, once we've advanced the discard horizon in shared > memory, any readers that come along later are responsible for making > sure that they never do I/O on any older undo. They may create some > invalid buffers in shared memory, but they'll hopefully also get rid > of them if they do, and if they error out for some reason before doing > so, that buffer should age out naturally. So, the discard worker just > needs to worry about buffers that already exist. Once it's taken a > cleanup lock on each buffer, it knows that there are no I/O operations > and in fact no buffer usage of any kind still in progress from before > it moved the in-memory discard horizon. Anyone new that comes along > will clean up after themselves. We postpone forgetting dirty buffers > until after we've successfully WAL-logged the discard, in case we fail > to do so. > I have spent some time thinking over this and couldn't see any problem with this. So, +1 for trying this out on the lines of what you have described above. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, May 13, 2019 at 11:36 PM Robert Haas <robertmhaas@gmail.com> wrote: > > While I'm kvetching, I can't help noticing that undoinsert.c contains > functions both for inserting undo and also for reading it, which seems > like a loose end that needs to be tied up somehow. I'm mildly > inclined to think that we should rename the file to something more > generic (e.g. undoaccess.h) rather than splitting it into two files > (e.g. undoinsert.c and undoread.c). Changed to undoaccess Also, it looks to me like you > need to go through what is currently undoinsert.h and look for stuff > that can be made private to the .c file. I don't see why thing like > MAX_PREPARED_UNDO need to be exposed at all, Ideally, my previous patch should have got rid of MAX_PREPARED_UNDO as we are now always allocating memory for prepared space but by mistake I left it in this file. Now, I have removed it. and for things like > PreparedUndoSpace it seems like it would suffice to just do 'struct > PreparedUndoSpace; typedef struct PreparedUndoSpace > PreparedUndoSpace;' in the header and put the actual 'struct > PreparedUndoSpace { ... };' definition in the .c file. Changed, I think typedef struct PreparedUndoSpace PreparedUndoSpace; in header and PreparedUndoSpace { ... }; is fine. And > UnlockReleaseUndoBuffers has a declaration but no longer has a > definition, so I think that can go away too. Removed, and also cleaned some other such declarations. Pending items to be worked upon: a) Get rid of UndoRecInfo b) Get rid of xid in generic undo code and unify epoch and xid to fxid c) Get rid of discard lock d) Move log switch related information from transaction header to new log switch header -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi, On 2019-05-05 10:28:21 +0530, Amit Kapila wrote: > From 5d9e179bd481b5ed574b6e7117bf3eb62b5dc003 Mon Sep 17 00:00:00 2001 > From: Amit Kapila <amit.kapila@enterprisedb.com> > Date: Sat, 4 May 2019 16:52:01 +0530 > Subject: [PATCH] Allow undo actions to be applied on rollbacks and discard > unwanted undo. I think this needs to be split into some constituent parts, to be reviewable. Discussing 270kb of patch at once is just too much. My first guess for a viable split would be: 1) undoaction related infrastructure 2) xact.c integration et al 3) binaryheap changes etc 4) undo worker infrastructure It probably should be split even further, by moving things like: - oldestXidHavingUndo infrastructure - discard infrastructure Some small remarks: > > + { > + {"disable_undo_launcher", PGC_POSTMASTER, DEVELOPER_OPTIONS, > + gettext_noop("Decides whether to launch an undo worker."), > + NULL, > + GUC_NOT_IN_SAMPLE > + }, > + &disable_undo_launcher, > + false, > + NULL, NULL, NULL > + }, > + We don't normally formulate GUCs in the negative like that. C.F. autovacuum etc. > +/* Extract xid from a value comprised of epoch and xid */ > +#define GetXidFromEpochXid(epochxid) \ > + ((uint32) (epochxid) & 0XFFFFFFFF) > + > +/* Extract epoch from a value comprised of epoch and xid */ > +#define GetEpochFromEpochXid(epochxid) \ > + ((uint32) ((epochxid) >> 32)) > + Why do these exist? This should all go through FullTransactionId. > /* End-of-list marker */ > { > {NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL > @@ -2923,6 +2935,16 @@ static struct config_int ConfigureNamesInt[] = > 5000, 1, INT_MAX, > NULL, NULL, NULL > }, > + { > + {"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM, > + gettext_noop("Rollbacks greater than this size are done lazily"), > + NULL, > + GUC_UNIT_MB > + }, > + &rollback_overflow_size, > + 64, 0, MAX_KILOBYTES, > + NULL, NULL, NULL > + }, rollback_foreground_size? rollback_background_size? I don't think overflow is particularly clear. > @@ -1612,6 +1635,85 @@ FinishPreparedTransaction(const char *gid, bool isCommit) > > MyLockedGxact = NULL; > > + /* > + * Perform undo actions, if there are undologs for this transaction. We > + * need to perform undo actions while we are still in transaction. Never > + * push rollbacks of temp tables to undo worker. > + */ > + for (i = 0; i < UndoPersistenceLevels; i++) > + { This should be in a separate function. And it'd be good if more code between this and ApplyUndoActions() would be shared. > + /* > + * Here, we just detect whether there are any pending undo actions so that > + * we can skip releasing the locks during abort transaction. We don't > + * release the locks till we execute undo actions otherwise, there is a > + * risk of deadlock. > + */ > + SetUndoActionsInfo(); This function name is so generic that it gives the reader very little information about why it's called here (and in other similar places). Greetings, Andres Freund
On Tue, May 21, 2019 at 1:18 PM Andres Freund <andres@anarazel.de> wrote: > I think this needs to be split into some constituent parts, to be > reviewable. Discussing 270kb of patch at once is just too much. +1. > > + { > > + {"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM, > > + gettext_noop("Rollbacks greater than this size are done lazily"), > > + NULL, > > + GUC_UNIT_MB > > + }, > > + &rollback_overflow_size, > > + 64, 0, MAX_KILOBYTES, > > + NULL, NULL, NULL > > + }, > > rollback_foreground_size? rollback_background_size? I don't think > overflow is particularly clear. The problem with calling it 'rollback' is that a rollback is a general PostgreSQL term that gives no hint the proposed undo facility is involved. I'm not exactly sure what to propose but I think it's got to have the word 'undo' in there someplace (or some new term we invent that is only used in connection with undo). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 21, 2019 at 10:47 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-05-05 10:28:21 +0530, Amit Kapila wrote: > > From 5d9e179bd481b5ed574b6e7117bf3eb62b5dc003 Mon Sep 17 00:00:00 2001 > > From: Amit Kapila <amit.kapila@enterprisedb.com> > > Date: Sat, 4 May 2019 16:52:01 +0530 > > Subject: [PATCH] Allow undo actions to be applied on rollbacks and discard > > unwanted undo. > > I think this needs to be split into some constituent parts, to be > reviewable. Okay. > Discussing 270kb of patch at once is just too much. My first > guess for a viable split would be: > > 1) undoaction related infrastructure > 2) xact.c integration et al > 3) binaryheap changes etc > 4) undo worker infrastructure > > It probably should be split even further, by moving things like: > - oldestXidHavingUndo infrastructure > - discard infrastructure > Okay, I will think about this and split the patch. > Some small remarks: > > > > > + { > > + {"disable_undo_launcher", PGC_POSTMASTER, DEVELOPER_OPTIONS, > > + gettext_noop("Decides whether to launch an undo worker."), > > + NULL, > > + GUC_NOT_IN_SAMPLE > > + }, > > + &disable_undo_launcher, > > + false, > > + NULL, NULL, NULL > > + }, > > + > > We don't normally formulate GUCs in the negative like that. C.F. > autovacuum etc. > Okay, will change. Actually, this is just for development purpose. It can help us in testing cases where we have pushed the undo, but it won't apply, so whenever the foreground process encounter such a transaction, it will perform the page-wise undo. I am not 100% sure if we need this for the final version. Similarly, for testing purpose, we might need enable_discard_worker to test the cases where discard doesn't happen for a long time. > > > +/* Extract xid from a value comprised of epoch and xid */ > > +#define GetXidFromEpochXid(epochxid) \ > > + ((uint32) (epochxid) & 0XFFFFFFFF) > > + > > +/* Extract epoch from a value comprised of epoch and xid */ > > +#define GetEpochFromEpochXid(epochxid) \ > > + ((uint32) ((epochxid) >> 32)) > > + > > Why do these exist? > We don't need the second one (GetEpochFromEpochXid), but the first one is required. Basically, the oldestXidHavingUndo computation does consider oldestXmin (which is still a TransactionId) as we can't retain undo which is 2^31 transactions old due to other limitations like clog/snapshots still has a limit of 4-byte transaction ids. Slightly unrelated, but we do want to improve the undo retention in a subsequent version such that we won't allow pending undo for transaction whose age is more than 2^31. > > > /* End-of-list marker */ > > { > > {NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL > > @@ -2923,6 +2935,16 @@ static struct config_int ConfigureNamesInt[] = > > 5000, 1, INT_MAX, > > NULL, NULL, NULL > > }, > > + { > > + {"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM, > > + gettext_noop("Rollbacks greater than this size are done lazily"), > > + NULL, > > + GUC_UNIT_MB > > + }, > > + &rollback_overflow_size, > > + 64, 0, MAX_KILOBYTES, > > + NULL, NULL, NULL > > + }, > > rollback_foreground_size? rollback_background_size? I don't think > overflow is particularly clear. > How about rollback_undo_size or abort_undo_size or undo_foreground_size or pending_undo_size? > > > @@ -1612,6 +1635,85 @@ FinishPreparedTransaction(const char *gid, bool isCommit) > > > > MyLockedGxact = NULL; > > > > + /* > > + * Perform undo actions, if there are undologs for this transaction. We > > + * need to perform undo actions while we are still in transaction. Never > > + * push rollbacks of temp tables to undo worker. > > + */ > > + for (i = 0; i < UndoPersistenceLevels; i++) > > + { > > This should be in a separate function. And it'd be good if more code > between this and ApplyUndoActions() would be shared. > makes sense, will try. > > > + /* > > + * Here, we just detect whether there are any pending undo actions so that > > + * we can skip releasing the locks during abort transaction. We don't > > + * release the locks till we execute undo actions otherwise, there is a > > + * risk of deadlock. > > + */ > > + SetUndoActionsInfo(); > > This function name is so generic that it gives the reader very little > information about why it's called here (and in other similar places). > NeedToPerformUndoActions()? UndoActionsRequired()? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, May 22, 2019 at 7:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > +/* Extract xid from a value comprised of epoch and xid */ > > > +#define GetXidFromEpochXid(epochxid) \ > > > + ((uint32) (epochxid) & 0XFFFFFFFF) > > > + > > > +/* Extract epoch from a value comprised of epoch and xid */ > > > +#define GetEpochFromEpochXid(epochxid) \ > > > + ((uint32) ((epochxid) >> 32)) > > > + > > > > Why do these exist? > > > > We don't need the second one (GetEpochFromEpochXid), but the first one > is required. Basically, the oldestXidHavingUndo computation does > consider oldestXmin (which is still a TransactionId) as we can't > retain undo which is 2^31 transactions old due to other limitations > like clog/snapshots still has a limit of 4-byte transaction ids. > Slightly unrelated, but we do want to improve the undo retention in a > subsequent version such that we won't allow pending undo for > transaction whose age is more than 2^31. The point is that we now have EpochFromFullTransactionId and XidFromFullTransactionId. You shouldn't be inventing your own version of that infrastructure. Use FullTransactionId, not a uint64, and then use the functions for dealing with full transaction IDs from transam.h. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 22, 2019 at 5:47 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 22, 2019 at 7:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > +/* Extract xid from a value comprised of epoch and xid */ > > > > +#define GetXidFromEpochXid(epochxid) \ > > > > + ((uint32) (epochxid) & 0XFFFFFFFF) > > > > + > > > > +/* Extract epoch from a value comprised of epoch and xid */ > > > > +#define GetEpochFromEpochXid(epochxid) \ > > > > + ((uint32) ((epochxid) >> 32)) > > > > + > > > > > > Why do these exist? > > > > > > > We don't need the second one (GetEpochFromEpochXid), but the first one > > is required. Basically, the oldestXidHavingUndo computation does > > consider oldestXmin (which is still a TransactionId) as we can't > > retain undo which is 2^31 transactions old due to other limitations > > like clog/snapshots still has a limit of 4-byte transaction ids. > > Slightly unrelated, but we do want to improve the undo retention in a > > subsequent version such that we won't allow pending undo for > > transaction whose age is more than 2^31. > > The point is that we now have EpochFromFullTransactionId and > XidFromFullTransactionId. You shouldn't be inventing your own version > of that infrastructure. Use FullTransactionId, not a uint64, and then > use the functions for dealing with full transaction IDs from > transam.h. > Okay, I misunderstood the comment. I'll change accordingly. Thanks for pointing out. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, May 22, 2019 at 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, May 21, 2019 at 10:47 PM Andres Freund <andres@anarazel.de> wrote: > > > Some small remarks: > > > > > > > > + { > > > + {"disable_undo_launcher", PGC_POSTMASTER, DEVELOPER_OPTIONS, > > > + gettext_noop("Decides whether to launch an undo worker."), > > > + NULL, > > > + GUC_NOT_IN_SAMPLE > > > + }, > > > + &disable_undo_launcher, > > > + false, > > > + NULL, NULL, NULL > > > + }, > > > + > > > > We don't normally formulate GUCs in the negative like that. C.F. > > autovacuum etc. > > > > Okay, will change. Actually, this is just for development purpose. > It can help us in testing cases where we have pushed the undo, but it > won't apply, so whenever the foreground process encounter such a > transaction, it will perform the page-wise undo. I am not 100% sure > if we need this for the final version. Similarly, for testing > purpose, we might need enable_discard_worker to test the cases where > discard doesn't happen for a long time. > Changed. > > > > > +/* Extract xid from a value comprised of epoch and xid */ > > > +#define GetXidFromEpochXid(epochxid) \ > > > + ((uint32) (epochxid) & 0XFFFFFFFF) > > > + > > > +/* Extract epoch from a value comprised of epoch and xid */ > > > +#define GetEpochFromEpochXid(epochxid) \ > > > + ((uint32) ((epochxid) >> 32)) > > > + > > > > Why do these exist? > > > > We don't need the second one (GetEpochFromEpochXid), but the first one > is required. Basically, the oldestXidHavingUndo computation does > consider oldestXmin (which is still a TransactionId) as we can't > retain undo which is 2^31 transactions old due to other limitations > like clog/snapshots still has a limit of 4-byte transaction ids. > Slightly unrelated, but we do want to improve the undo retention in a > subsequent version such that we won't allow pending undo for > transaction whose age is more than 2^31. > Removed both the above defines. > > > > > /* End-of-list marker */ > > > { > > > {NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL > > > @@ -2923,6 +2935,16 @@ static struct config_int ConfigureNamesInt[] = > > > 5000, 1, INT_MAX, > > > NULL, NULL, NULL > > > }, > > > + { > > > + {"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM, > > > + gettext_noop("Rollbacks greater than this size are done lazily"), > > > + NULL, > > > + GUC_UNIT_MB > > > + }, > > > + &rollback_overflow_size, > > > + 64, 0, MAX_KILOBYTES, > > > + NULL, NULL, NULL > > > + }, > > > > rollback_foreground_size? rollback_background_size? I don't think > > overflow is particularly clear. > > > > How about rollback_undo_size or abort_undo_size or > undo_foreground_size or pending_undo_size? > I think we need some more discussion on this before we change as Robert seems to feel that we should have 'undo' someplace in the name. Please let me know your preference. > > > > > @@ -1612,6 +1635,85 @@ FinishPreparedTransaction(const char *gid, bool isCommit) > > > > > > MyLockedGxact = NULL; > > > > > > + /* > > > + * Perform undo actions, if there are undologs for this transaction. We > > > + * need to perform undo actions while we are still in transaction. Never > > > + * push rollbacks of temp tables to undo worker. > > > + */ > > > + for (i = 0; i < UndoPersistenceLevels; i++) > > > + { > > > > This should be in a separate function. And it'd be good if more code > > between this and ApplyUndoActions() would be shared. > > > > makes sense, will try. > Done. Now, there is a common function that is used in twophase.c and ApplyUndoActions. > > > > > + /* > > > + * Here, we just detect whether there are any pending undo actions so that > > > + * we can skip releasing the locks during abort transaction. We don't > > > + * release the locks till we execute undo actions otherwise, there is a > > > + * risk of deadlock. > > > + */ > > > + SetUndoActionsInfo(); > > > > This function name is so generic that it gives the reader very little > > information about why it's called here (and in other similar places). > > > > NeedToPerformUndoActions()? UndoActionsRequired()? > Changed to UndoActionsRequired and added comments atop of the function to make it clear why and when this function needs to use. Apart from fixing the above comments, the patch is rebased on latest undo patchset. As of now, I have split the binaryheap.c changes into a separate patch. We are stilll enhancing the patch to compute oldestXidHavingUnappliedUndo which touches various parts of patch, so splitting further without completing that can make it a bit difficult to work on that. Pending work ------------------- 1. Enhance uur_progress so that it updates undo action apply progress at regular intervals. 2. Enhance to support oldestXidHavingUnappliedUndo, more on that later. 3. Split the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
My understanding is smgr pendingDeletes infrastructure will be replaced by these patches. I still see CommitTransaction() calling smgrDoPendingDeletes() in the latest patch set. Am I missing something?
Asim
Asim
On Mon, Jun 10, 2019 at 5:35 AM Asim R P <apraveen@pivotal.io> wrote: > My understanding is smgr pendingDeletes infrastructure will be replaced by these patches. I still see CommitTransaction()calling smgrDoPendingDeletes() in the latest patch set. Am I missing something? Hi Asim, Thanks for looking at the patch. The pendingDeletes list is used both for files that should be deleted if we commit and files that should be deleted if we abort. This patch deals only with the abort case, using the undo log instead of pendingDeletes. That is the file leak scenario that has an arbitrarily wide window controlled by the user and is probably the source of almost all cases that you hear of of disks filling up with orphaned junk AFAICS. There could in theory be a persistent stuff-to-do-if-we-commit system exactly unlike undo logs (records to be discarded on abort, executed on commit). I haven't thought much about how it'd work, but Andres did suggest something like that for another purpose just the other day, and although it's hard to think of a name for it, it doesn't seem crazy as long as it doesn't add overheads when you're not using it. Without such a mechanism, you can probably leak files belonging to tables that you have dropped in a committed transaction, if you die in CommitTransaction() after it has called RecordTransactionCommit() but before it reaches smgrDoPendingDeletes(), and even then probably only if there is super well-timed checkpoint so that you recover without replaying the drop. I'm not try to tackle that today. BTW, there is yet another kind of deferred unlinking going on. In SyncPostCheckpoint() (formerly known as mdpostckpt()) we defer the last bit of the job until after the next checkpoint. At that point we only expect the first segment to exist and we expect it to be empty. That's a mechanism introduced by commit 6cc4451b5c47 to make sure that we don't reuse relfilenode numbers too soon in some crash scenarios. That means there is another very narrow window there to leak a file (though these ones are empty): you die after the checkpoint is logged but before SyncPostCheckpoint() is run, or even after that but before the operating system has flushed the directory. -- Thomas Munro https://enterprisedb.com
On Mon, May 27, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > Apart from fixing the above comments, the patch is rebased on latest > undo patchset. As of now, I have split the binaryheap.c changes into > a separate patch. We are stilll enhancing the patch to compute > oldestXidHavingUnappliedUndo which touches various parts of patch, so > splitting further without completing that can make it a bit difficult > to work on that. Some review comments around execute_undo_actions: The 'nopartial' argument to execute_undo_actions is confusing. First, it would probably be worth spelling it out instead of abbreviation: not_partial_transaction rather than nopartial. Second, it is usually better to phrase parameter names in terms of what they are rather than in terms of what they are not: complete_transaction rather than not_partial_transaction. Third, it's unclear from these comments why we'd be undoing something other than a complete transaction. It looks as though the answer is that this flag will be false when we're undoing a subxact -- in which case, why not invert the sense of the flag and call it 'bool subxact'? I might be wrong, but it seems like that would be a whole lot clearer. Fourth, the block at the top of this function, guarded by nopartial, seems like it must be vulnerable to race conditions. If we're undoing the complete transaction, then it checks whether UndoFetchRecord() still returns anything. But that could change not just at the beginning of the function, but also any time in the middle, or so it seems to me. I doubt that this is the right level at which to deal with this sort of interlocking. I think there should be some higher-level mechanism that prevents two processes from trying to undo the same transaction at the same time, like a heavyweight lock or some kind of flag in the shared memory data structure that keeps track of pending undo, so that we never even reach this code unless we know that this XID needs undo work and no other process is already doing it. If you're the only one undoing XID 123456, then there shouldn't be any chance of the undo disappearing from underneath you. And we definitely want to guarantee that only one process is undoing any given XID at a time. The 'blk_chain_complete' variable which is set in this function and passed down to execute_undo_actions_page() and then to the rmgr's rm_undo callback also looks problematic. First, not every AM that uses undo may even have the notion of a 'block chain'; zedstore for example uses TIDs as a 48-bit integer, not a block + offset number, so it's really not going to have a 'block chain.' Second, even in zheap's usage, it seems to me that the block chain could be complete even when this gets set to false. It gets set to true when we're undoing a toplevel transaction (not a subxact) and we were able to fetch all of the undo for that toplevel transaction. But even if that's not true, the chain for an individual block could still be complete, because all the remaining undo for the block at issue might've been in the chunk of undo we already read; the remaining undo could be for other blocks. For that reason, I can't see how the zheap code that relies on this value can be correct; it uses this value to decide whether to stick zeroes in the transaction slot, but if the scenario described above happened, then I suppose the XID would not get cleared from the slot during undo. Maybe zheap is just relying on that being harmless, since if all of the undo actions have been correctly executed for the page, the fact that the transaction slot is still bogusly used by an aborted xact won't matter; nothing will refer to it. However, it seems to me that it would be better for zheap to set things up so that the first undo record for a particular txn/page combination is flagged in some way (in the payload!) so that undo can zero the slot if the action being undone is the one that claimed the slot. That seems cleaner on principle, and it also avoids having supposedly AM-independent code pass down details that are driven by zheap's particular needs. While it's probably moot since I think this code should go away anyway, I find it poor style to write something like: + if (nopartial && !UndoRecPtrIsValid(urec_ptr)) + blk_chain_complete = true; + else + blk_chain_complete = false; "if (x) y = true; else y = false;" can be more compactly written as "y = x;", like this: blk_chain_complete = nopartial && !UndoRecPtrIsValid(urec_ptr); I think that the signature for rm_undo can be simplified considerably. I think blk_chain_complete should go away for the reasons discussed above. Also, based on our conversations with Heikki at PGCon, we decided that we should not presume that the AM wants the records grouped by block, so the blkno argument should go away. In addition, I don't see much reason to have a first_idx argument. Instead of passing a pointer to the caller's entire array and telling the callback where to start looking, couldn't we just pass a pointer to the first record the callback should examine, i.e. instead of passing urp_array, pass urp_array + first_idx. Then instead of having a last_idx argument, have an argument for the number of entries in the array, computed as last_idx - first_idx + 1. With those changes, rm_undo would look like this: bool (*rm_undo) (UndoRecInfo *urp_array, int count, Oid reloid, FullTransactionId full_xid); Now for the $10m question: why even pass reloid and full_xid? Aren't those values going to be present inside every UnpackedUndoRecord? Why not just let the callback get them from the first record (or however it wants to do things)? Perhaps there is some documentation value here in that it implies that the value will be the same for every record, but we could also handle that by just documenting in the appropriate places that undo is done by transaction and relation and therefore the callback is entitled to assume that the same value will be present in every record. Then again, I am not sure we really want the callback to assume that reloid doesn't change. I don't see a reason offhand not to just pass as many records as we have for a given transaction and let the callback do what it likes. So maybe that's another reason to get rid of the reloid argument, at least. And then we could document that all the record will have the same full_xid (unless we decide that we don't want to guarantee that either). Additionally, it strikes me that urp_array is not the greatest name. Generally, putting _array into the name of the variable to indicate that it's an array doesn't seem all that great from a coding-style perspective. I mean, sometimes it's the best you can do, but it's not amazing. And urp seems like it's using an abbreviation without any real reason. For contrast, consider this existing precedent: extern SysScanDesc systable_beginscan_ordered(Relation heapRelation, Relation indexRelation, Snapshot snapshot, int nkeys, ScanKey key); Or this one: extern TupleDesc CreateTupleDesc(int natts, Form_pg_attribute *attrs); Notice that in each case the array parameter (which is the last one) is named based on what data it contains rather than on the fact that it is an array. Finally, I observe that rm_undo returns a Boolean, but it's not used for anything. The only call to rm_undo in the current patch set is in execute_undo_actions_page, which returns that value to the caller, but the callers just discard it. I suppose maybe this was intended to report success or failure, but I think the way that rm_undo will report failure is to ERROR. Or, if we want to allow a fail-soft behavior for some reason, then the callers all need to check the value. I'm not sure whether there's a use case for that or not. Putting all that together, I suggest a signature like this: void (*rm_undo) (int nrecords, UndoRecInfo *records); Or if we decide we need to have a fail-soft behavior, then like this: bool (*rm_undo) (int nrecords, UndoRecInfo *records); -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 10, 2019 at 3:00 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Jun 10, 2019 at 5:35 AM Asim R P <apraveen@pivotal.io> wrote: > > My understanding is smgr pendingDeletes infrastructure will be replaced by these patches. I still see CommitTransaction()calling smgrDoPendingDeletes() in the latest patch set. Am I missing something? > Thanks for looking at the patch. Hello, Here is a new rebased version of the full patch set for orphaned file cleanup. The orphaned file cleanup code itself hasn't changed but there are some changes in lower down patches: * getting rid of more global variables, instead using eg CurrentSession->attached_undo_logs (the session.h infrastructure that is intended to avoid creating more multithreading-hostile code) * using undo log "slots" in various APIs to make it clearer that slots can be recycled, which has locking implications, plus several locking bug fixes that motivated that change * current versions of the record and worker code discussed upthread by Amit and others The code is also at https://github.com/EnterpriseDB/zheap/tree/undo and includes patches from https://github.com/EnterpriseDB/zheap/tree/undoprocessing and https://github.com/EnterpriseDB/zheap/tree/undo_interface_v1 where some parts of this stack (workers etc) are being developed. -- Thomas Munro https://enterprisedb.com
Attachment
On Fri, Jun 14, 2019 at 8:26 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > * current versions of the record and worker code discussed upthread by > Amit and others > Thanks for posting the complete patchset. Last time, I mentioned the remaining work in undo-processing patchset, the status of which is as follows: 1. Enhance uur_progress so that it updates undo action apply progress at regular intervals. This has been done. The idea is that we update the transaction's undo apply progress at regular intervals so that after a crash we can skip already applied undo. The undo apply progress is updated in terms of the number of blocks processed. I think it is better to change the name of uur_progress to something like uur_apply_progress. Any suggestions? 2. Enhance to support oldestXidHavingUnappliedUndo, more on that later. This has been done. The idea here is that we register all the undo apply (transaction abort) requests in the hash table (referred to as Rollback Hash Table in the patch) and we have a hard limit (after that we won't allow new transactions to write undo) on how many such requests can be pending. So scanning this table gives us the value of oldestXidHavingUnappliedUndo (actually the value for this will be smallest of 'xid having pending undo' and 'oldestXmin'). As this rollback hash table is not persistent, after start, we need to take a pass over undo logs to register all the pending abort requests in the rollback hash table. There are two main purposes which this value serves (a) Any Xid below this is all-visible, so it can help in visibility checks, (b) it can help us implementing the rule that "No aborted XID with an age >2^31 can have unapplied undo.". This part helps us to decide to truncate the clog because we can't truncate the clog for transactions having undo. 3. Split the patch. The patch is split into five patches. I will give a brief description of each patch which to a good extent is mentioned in the commit message for each patch as well: 0010-Extend-binary-heap-functionality - This patch adds the routines to allocate binary heap in shared memory and to remove nth element from binary heap. These routines will be used by a later patch that will allow an efficient way to process the pending rollback requests. 0011-Infrastructure-to-register-and-fetch-undo-action-req - This patch provides an infrastructure to register and fetch undo action requests. This infrastructure provides a way to allow execution of undo actions. One might think that we can always execute undo actions on error or explicit rollback by the user, however, there are cases when that is not possible. For example, (a) if the system crash while doing the operation, then after startup, we need a way to perform undo actions; (b) If we get an error while performing undo actions. Apart from this, when there are large rollback requests, then it is quite inefficient to perform all the undo actions and then return control to the user. 0012-Infrastructure-to-execute-pending-undo-actions - This provides an infrastructure to execute pending undo actions. To apply the undo actions, we collect the undo records in bulk and try to process them together. We ensure to update the transaction's progress at regular intervals so that after a crash we can skip already applied undo. This needs some more work to generalize the processing of undo records so that this infrastructure can be used by other AM's as well. 0013-Allow-foreground-transactions-to-perform-undo-action - This patch allows foreground transactions to perform undo actions on abort. We always perform rollback actions after cleaning up the current (sub)transaction. This will ensure that we perform the actions immediately after an error (and release the locks) rather than when the user issues Rollback command at some later point of time. We are releasing the locks after the undo actions are applied. The reason to delay lock release is that if we release locks before applying undo actions, then the parallel session can acquire the lock before us which can lead to deadlock. 0014-Allow-execution-and-discard-of-undo-by-background-wo- - This patch allows execution and discard of undo by background workers. Undo launcher is responsible for launching the workers iff there is some work available in one of the work queues and there are more workers available. The worker is launched to handle requests for a particular database. The discard worker is responsible for discarding the undo log of transactions that are committed and all-visible or are rolled-back. It also registers the request for aborted transactions in the work queues. It iterates through all the active logs one-by-one and tries to discard the transactions that are old enough to matter. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 13, 2019 at 3:13 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, May 27, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Apart from fixing the above comments, the patch is rebased on latest > > undo patchset. As of now, I have split the binaryheap.c changes into > > a separate patch. We are stilll enhancing the patch to compute > > oldestXidHavingUnappliedUndo which touches various parts of patch, so > > splitting further without completing that can make it a bit difficult > > to work on that. > > Some review comments around execute_undo_actions: > > The 'nopartial' argument to execute_undo_actions is confusing. First, > it would probably be worth spelling it out instead of abbreviation: > not_partial_transaction rather than nopartial. Second, it is usually > better to phrase parameter names in terms of what they are rather than > in terms of what they are not: complete_transaction rather than > not_partial_transaction. Third, it's unclear from these comments why > we'd be undoing something other than a complete transaction. It looks > as though the answer is that this flag will be false when we're > undoing a subxact -- in which case, why not invert the sense of the > flag and call it 'bool subxact'? I might be wrong, but it seems like > that would be a whole lot clearer. > The idea was that it could be use for multiple purposes like 'rolling back complete xact', 'rolling back subxact', 'rollback at page-level' or any similar future need even though not all code paths use that function. I am not wedded to any particular name here, but among your suggestions complete_transaction sounds better to me. Are you okay going with that? > Fourth, the block at the top of > this function, guarded by nopartial, seems like it must be vulnerable > to race conditions. If we're undoing the complete transaction, then > it checks whether UndoFetchRecord() still returns anything. But that > could change not just at the beginning of the function, but also any > time in the middle, or so it seems to me. > It won't change in between because we have ensured at top-level that no two processes can start executing pending undo at the same time. Basically, anyone wants to execute the undo actions will have an entry in rollback hash table and that will be marked as in-progress. As mentioned in comments, the race is only "after discard worker fetches the record and found that this transaction need to be rolled back, backend might concurrently execute the actions and remove the request from rollback hash table." > I doubt that this is the > right level at which to deal with this sort of interlocking. I think > there should be some higher-level mechanism that prevents two > processes from trying to undo the same transaction at the same time, > like a heavyweight lock or some kind of flag in the shared memory data > structure that keeps track of pending undo, so that we never even > reach this code unless we know that this XID needs undo work > Introducing heavyweight lock can create different sort of problems because we need to hold it till all the actions are applied to avoid what I have mentioned above. The problem will be that discard worker will be blocked till backend/undo worker applies the complete actions unless we just take this lock conditionally in discard worker. Another way could be that we re-fetch the undo record when we are registering the undo request under RollbackRequestLock and check it's status again becuase in that case backend or other undoworker won't be able to remove the request from hash table concurrently. However, the advantage of checking it in execute_undo_actions is that we can optimize it in the future to avoid re-fetching this record when actually fetching the records to apply undo actions. > and no > other process is already doing it. > This part is already ensured in the current code. > > The 'blk_chain_complete' variable which is set in this function and > passed down to execute_undo_actions_page() and then to the rmgr's > rm_undo callback also looks problematic. > I agree this parameter should go away from the generic interface considering the requirements from zedstore. > First, not every AM that > uses undo may even have the notion of a 'block chain'; zedstore for > example uses TIDs as a 48-bit integer, not a block + offset number, so > it's really not going to have a 'block chain.' Second, even in > zheap's usage, it seems to me that the block chain could be complete > even when this gets set to false. It gets set to true when we're > undoing a toplevel transaction (not a subxact) and we were able to > fetch all of the undo for that toplevel transaction. But even if > that's not true, the chain for an individual block could still be > complete, because all the remaining undo for the block at issue > might've been in the chunk of undo we already read; the remaining undo > could be for other blocks. For that reason, I can't see how the zheap > code that relies on this value can be correct; it uses this value to > decide whether to stick zeroes in the transaction slot, but if the > scenario described above happened, then I suppose the XID would not > get cleared from the slot during undo. Maybe zheap is just relying on > that being harmless, since if all of the undo actions have been > correctly executed for the page, the fact that the transaction slot is > still bogusly used by an aborted xact won't matter; nothing will refer > to it. However, it seems to me that it would be better for zheap to > set things up so that the first undo record for a particular txn/page > combination is flagged in some way (in the payload!) so that undo can > zero the slot if the action being undone is the one that claimed the > slot. That seems cleaner on principle, and it also avoids having > supposedly AM-independent code pass down details that are driven by > zheap's particular needs. > Yeah, we can do what you are suggesting for zheap or in many cases, we should be able to detect it via uur_blkprev of the last record of page. The invalid value will indicate that the chain for the page is complete. > > I think that the signature for rm_undo can be simplified considerably. > I think blk_chain_complete should go away for the reasons discussed > above. Also, based on our conversations with Heikki at PGCon, we > decided that we should not presume that the AM wants the records > grouped by block, so the blkno argument should go away. In addition, > I don't see much reason to have a first_idx argument. Instead of > passing a pointer to the caller's entire array and telling the > callback where to start looking, couldn't we just pass a pointer to > the first record the callback should examine, i.e. instead of passing > urp_array, pass urp_array + first_idx. Then instead of having a > last_idx argument, have an argument for the number of entries in the > array, computed as last_idx - first_idx + 1. With those changes, > rm_undo would look like this: > > bool (*rm_undo) (UndoRecInfo *urp_array, int count, Oid reloid, > FullTransactionId full_xid); > I agree. > Now for the $10m question: why even pass reloid and full_xid? Aren't > those values going to be present inside every UnpackedUndoRecord? Why > not just let the callback get them from the first record (or however > it wants to do things)? Perhaps there is some documentation value > here in that it implies that the value will be the same for every > record, but we could also handle that by just documenting in the > appropriate places that undo is done by transaction and relation and > therefore the callback is entitled to assume that the same value will > be present in every record. Then again, I am not sure we really want > the callback to assume that reloid doesn't change. I don't see a > reason offhand not to just pass as many records as we have for a given > transaction and let the callback do what it likes. So maybe that's > another reason to get rid of the reloid argument, at least. And then > we could document that all the record will have the same full_xid > (unless we decide that we don't want to guarantee that either). > > Additionally, it strikes me that urp_array is not the greatest name. > Generally, putting _array into the name of the variable to indicate > that it's an array doesn't seem all that great from a coding-style > perspective. I mean, sometimes it's the best you can do, but it's not > amazing. And urp seems like it's using an abbreviation without any > real reason. For contrast, consider this existing precedent: > > extern SysScanDesc systable_beginscan_ordered(Relation heapRelation, > Relation indexRelation, > Snapshot snapshot, > int nkeys, ScanKey key); > > Or this one: > > extern TupleDesc CreateTupleDesc(int natts, Form_pg_attribute *attrs); > > Notice that in each case the array parameter (which is the last one) > is named based on what data it contains rather than on the fact that > it is an array. > Agreed, will change accordingly. > Finally, I observe that rm_undo returns a Boolean, but it's not used > for anything. The only call to rm_undo in the current patch set is in > execute_undo_actions_page, which returns that value to the caller, but > the callers just discard it. I suppose maybe this was intended to > report success or failure, but I think the way that rm_undo will > report failure is to ERROR. > For Error case, it is fine to report failure, but there can be cases where we don't need to apply undo actions like when the relation is dropped/truncated, undo actions are already applied. The original idea was to cover such cases by the return value. I agree that currently, caller ignores this value, but there is some value in keeping it. So, I am in favor of a signature with bool as the return value. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jun 17, 2019 at 6:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > The idea was that it could be use for multiple purposes like 'rolling > back complete xact', 'rolling back subxact', 'rollback at page-level' > or any similar future need even though not all code paths use that > function. I am not wedded to any particular name here, but among your > suggestions complete_transaction sounds better to me. Are you okay > going with that? Sure, let's try that for now and see how it looks. We can always change it again if it seems to be a good idea later. > It won't change in between because we have ensured at top-level that > no two processes can start executing pending undo at the same time. > Basically, anyone wants to execute the undo actions will have an entry > in rollback hash table and that will be marked as in-progress. As > mentioned in comments, the race is only "after discard worker > fetches the record and found that this transaction need to be rolled > back, backend might concurrently execute the actions and remove the > request from rollback hash table." > > [ discussion of alternatives ] I'm not precisely sure what the best thing to do here is, but I'm skeptical that the code in question belongs in this function. There are two separate things going on here: one is this revalidation that the undo hasn't been discarded, and the other is executing the undo actions. Those are clearly separate tasks, and they are not tasks that always get done together: sometimes we do only one, and sometimes we do both. Any function that looks like this is inherently suspicious: whatever(....., bool flag) { if (flag) { // lengthy block of code } // another lengthy block of code } There has to be a reason not to just split this into two functions and let the caller decide whether to call one or both. > For Error case, it is fine to report failure, but there can be cases > where we don't need to apply undo actions like when the relation is > dropped/truncated, undo actions are already applied. The original > idea was to cover such cases by the return value. I agree that > currently, caller ignores this value, but there is some value in > keeping it. So, I am in favor of a signature with bool as the return > value. OK. So then the callers can't keep ignoring it... and there should be some test framework that verifies the behavior when the return value is false. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 17, 2019 at 7:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jun 17, 2019 at 6:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I'm not precisely sure what the best thing to do here is, but I'm > skeptical that the code in question belongs in this function. There > are two separate things going on here: one is this revalidation that > the undo hasn't been discarded, and the other is executing the undo > actions. Those are clearly separate tasks, and they are not tasks that > always get done together: sometimes we do only one, and sometimes we > do both. Any function that looks like this is inherently suspicious: > > whatever(....., bool flag) > { > if (flag) > { > // lengthy block of code > } > > // another lengthy block of code > } > > There has to be a reason not to just split this into two functions and > let the caller decide whether to call one or both. > Yeah, because some of the information required to perform the necessary steps (in the code under the flag) is quite central to this function (see undo apply progress update part) and it is used at more than one place in this function. I have refactored the code in this function, see if it makes sense now. You need to check patch 0012-Infrastructure-to-execute-pending-undo-actions.patch for these changes. > > For Error case, it is fine to report failure, but there can be cases > > where we don't need to apply undo actions like when the relation is > > dropped/truncated, undo actions are already applied. The original > > idea was to cover such cases by the return value. I agree that > > currently, caller ignores this value, but there is some value in > > keeping it. So, I am in favor of a signature with bool as the return > > value. > > OK. So then the callers can't keep ignoring it... > I again thought about this but couldn't come up with anything meaningful. The idea is to ignore some undo records if they belong to the same relation which is already gone. I think we can do something about it in zheap specific code and make the generic code return void. I have fixed the other comments raised by you. See 0012-Infrastructure-to-execute-pending-undo-actions.patch Apart from the changes related to the undo apply, this patch series contains changes for making the transaction header at a location immediately after UndoRecordHeader which makes it easy to update the same. The changes are in patches 0007-Provide-interfaces-to-store-and-fetch-undo-records.patch and 0012-Infrastructure-to-execute-pending-undo-actions.patch. There are no changes in undo log module patches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0005-Add-prefetch-support-for-the-undo-log.patch
- 0001-Add-SmgrId-to-smgropen-and-BufferTag.patch
- 0004-Allow-WAL-record-data-on-first-modification-after-a-.patch
- 0002-Move-tablespace-dir-creation-from-smgr.c-to-md.c.patch
- 0003-Add-undo-log-manager.patch
- 0006-Defect-and-enhancement-in-multi-log-support.patch
- 0008-Test-module-for-undo-api.patch
- 0007-Provide-interfaces-to-store-and-fetch-undo-records.patch
- 0010-Extend-binary-heap-functionality.patch
- 0009-undo-page-consistency-checker.patch
- 0012-Infrastructure-to-execute-pending-undo-actions.patch
- 0011-Infrastructure-to-register-and-fetch-undo-action-req.patch
- 0013-Allow-foreground-transactions-to-perform-undo-action.patch
- 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch
On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > [ new patches ] I tried writing some code that throws an error from an undo log handler and the results were not good. It appears that the code will retry in a tight loop: 2019-06-18 13:58:53.262 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo 2019-06-18 13:58:53.264 EDT [42803] ERROR: robert_undo It seems clear that the error-handling aspect of this patch has not been given enough thought. It's debatable what strategy should be used when undo fails, but retrying 40 times per millisecond isn't the right answer. I assume we want some kind of cool-down between retries. 10 seconds? A minute? Some kind of back-off algorithm that gradually increases the retry time up to some maximum? Should there be one or more GUCs? Another thing that is not very nice is that when I tried to shut down the server via 'pg_ctl stop' while the above was happening, it did not shut down. I had to use an immediate shutdown. That's clearly not OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > [ new patches ] > > I tried writing some code that throws an error from an undo log > handler and the results were not good. I discovered another bothersome thing here: if you have a single transaction that generates a bunch of undo records, the first one has uur_dbid set correctly and the remaining records have uur_dbid set to 0. That means if you try to write a sanity check like if (record->uur_dbid != MyDatabaseId) elog(ERROR, "the undo system messed up") it fails. The original idea of UnpackedUndoRecord was this: you would put a bunch of data into an UnpackedUndoRecord that you wanted written to undo, and the undo system would find ways to compress stuff out of the on-disk representation by e.g. omitting the fork number if it's MAIN_FORKNUM. Then, when you read an undo record, it would decompress so that you ended up with the same UnpackedUndoRecord that you had at the beginning. However, the inclusion of transaction headers has made this a bit confusing: that stuff isn't being added by the user but by the undo system itself. It's not very clear from the comments what the contract is around these things: do you need to set uur_dbid to MyDatabaseId when preparing to insert an undo record? Or can you just leave it unset and then it'll possibly be set at decoding time? The comments for the UnpackedUndoRecord structure don't explain this. I'd really like to see this draw a cleaner distinction between the stuff that the user is expected to set and the other stuff we deal with internally to the undo subsystem. For example, suppose that UnpackedUndoRecord didn't include any of the fields that are only present in the transaction header. Maybe there's another structure, like UndoTransactionHeader, that includes those fields. The client of the undo subsystem creates a bunch of UnpackedUndoRecords and inserts them. At undo time, the callback gets back an identical set of UnpackedUndoRecords. And maybe it also gets a pointer to the UndoTransactionHeader which contains all of the system-generated stuff. Under this scheme, uur_xid, uur_xidepoch (which still need to be combined into uur_fxid), uur_progress, uur_dbid, uur_next, uur_prevlogstart, and uur_prevurp would all move out of the UnpackedUndoRecord and into the UndoTransactionHeader. The user would supply none of those things when inserting undo records, but the rm_undo callback could examine those values if it wished. A weaker approach would be to at least clean up the structure definition so that the transaction-header fields set by the system are clearly segregated from the per-record fields set by the undo-inserter, with comments explaining that those fields don't need to be set but will (or may?) be set at undo time. That would be better than what we have right now - because it would hopefully make it much more clear which fields need to be set on insert and which fields can be expected to be set when decoding - but I think it's probably not going far enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 18, 2019 at 11:37 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > [ new patches ] > > I tried writing some code that throws an error from an undo log > handler and the results were not good. It appears that the code will > retry in a tight loop: > > 2019-06-18 13:58:53.262 EDT [42803] ERROR: robert_undo > 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo > 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo .. > > It seems clear that the error-handling aspect of this patch has not > been given enough thought. It's debatable what strategy should be > used when undo fails, but retrying 40 times per millisecond isn't the > right answer. > The reason for the same is that currently, the undo worker keep on executing the requests if there are any. I think this is good when there are different requests, but getting the same request from error queue and doing it, again and again, doesn't seem to be good and I think it will not help either. > I assume we want some kind of cool-down between retries. > 10 seconds? A minute? Some kind of back-off algorithm that gradually > increases the retry time up to some maximum? > Yeah, something on these lines would be good. How about if we add failure_count with each request in error queue? Now, it will get incremented on each retry and we can wait in proportion to that, say 10s after the first retry, 20s after second and so on and maximum up to 10 failure_count (100s) will be allowed after which worker will exit considering it has no more work to do. Actually, we also need to think about what we should with such requests because even if undo worker exits after retrying for some threshold number of times, undo launcher will again launch a new worker for this request unless we have some special handling for the same. We can issue some WARNING once any particular request reached the maximum number of retries but not sure if that is enough because the user might not notice the same or didn't take any action. Do we want to PANIC at some point of time, if so, when or the other alternative is we can try at regular intervals till we succeed? > Should there be one or > more GUCs? > Yeah, we can do that, something like undo_apply_error_retry_count, but I am not completely sure about this, maybe some pre-defined number say 10 or 20 should be enough. However, I am fine if you or others think that a guc can help users in this case. > Another thing that is not very nice is that when I tried to shut down > the server via 'pg_ctl stop' while the above was happening, it did not > shut down. I had to use an immediate shutdown. That's clearly not > OK. > CHECK_FOR_INTERRUPTS is missing at one place, will fix. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 19, 2019 at 2:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > [ new patches ] > > > > I tried writing some code that throws an error from an undo log > > handler and the results were not good. > > I discovered another bothersome thing here: if you have a single > transaction that generates a bunch of undo records, the first one has > uur_dbid set correctly and the remaining records have uur_dbid set to > 0. That means if you try to write a sanity check like if > (record->uur_dbid != MyDatabaseId) elog(ERROR, "the undo system messed > up") it fails. > > The original idea of UnpackedUndoRecord was this: you would put a > bunch of data into an UnpackedUndoRecord that you wanted written to > undo, and the undo system would find ways to compress stuff out of the > on-disk representation by e.g. omitting the fork number if it's > MAIN_FORKNUM. Then, when you read an undo record, it would decompress > so that you ended up with the same UnpackedUndoRecord that you had at > the beginning. However, the inclusion of transaction headers has made > this a bit confusing: that stuff isn't being added by the user but by > the undo system itself. It's not very clear from the comments what the > contract is around these things: do you need to set uur_dbid to > MyDatabaseId when preparing to insert an undo record? Or can you just > leave it unset and then it'll possibly be set at decoding time? The > comments for the UnpackedUndoRecord structure don't explain this. > > I'd really like to see this draw a cleaner distinction between the > stuff that the user is expected to set and the other stuff we deal > with internally to the undo subsystem. For example, suppose that > UnpackedUndoRecord didn't include any of the fields that are only > present in the transaction header. Maybe there's another structure, > like UndoTransactionHeader, that includes those fields. The client of > the undo subsystem creates a bunch of UnpackedUndoRecords and inserts > them. At undo time, the callback gets back an identical set of > UnpackedUndoRecords. And maybe it also gets a pointer to the > UndoTransactionHeader which contains all of the system-generated > stuff. Under this scheme, uur_xid, uur_xidepoch (which still need to > be combined into uur_fxid), uur_progress, uur_dbid, uur_next, > uur_prevlogstart, and uur_prevurp would all move out of the > UnpackedUndoRecord and into the UndoTransactionHeader. The user would > supply none of those things when inserting undo records, but the > rm_undo callback could examine those values if it wished. > > A weaker approach would be to at least clean up the structure > definition so that the transaction-header fields set by the system are > clearly segregated from the per-record fields set by the > undo-inserter, with comments explaining that those fields don't need > to be set but will (or may?) be set at undo time. That would be better > than what we have right now - because it would hopefully make it much > more clear which fields need to be set on insert and which fields can > be expected to be set when decoding - but I think it's probably not > going far enough. I think it's a fair point. We can keep pointer to UndoRecordTransaction(urec_progress, dbid, uur_next) and UndoRecordLogSwitch(urec_prevurp, urec_prevlogstart) in UnpackedUndoRecord and include them whenever undo record contain these headers. Transaction header in the first record of the transaction and log-switch header in the first record after undo-log switch during a transaction. IMHO uur_fxid, we can keep as part of the main UnpackedUndoRecord, because as part of the other work "Compression for undo records to consider rmgrid, xid,cid,reloid for each record", the FullTransactionId, will be present in every UnpackedUndoRecord (although it will not be stored in every undo record). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 19, 2019 at 2:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > The reason for the same is that currently, the undo worker keep on > executing the requests if there are any. I think this is good when > there are different requests, but getting the same request from error > queue and doing it, again and again, doesn't seem to be good and I > think it will not help either. Even if there are multiple requests involved, you don't want a tight loop like this. > > I assume we want some kind of cool-down between retries. > > 10 seconds? A minute? Some kind of back-off algorithm that gradually > > increases the retry time up to some maximum? > > Yeah, something on these lines would be good. How about if we add > failure_count with each request in error queue? Now, it will get > incremented on each retry and we can wait in proportion to that, say > 10s after the first retry, 20s after second and so on and maximum up > to 10 failure_count (100s) will be allowed after which worker will > exit considering it has no more work to do. > > Actually, we also need to think about what we should with such > requests because even if undo worker exits after retrying for some > threshold number of times, undo launcher will again launch a new > worker for this request unless we have some special handling for the > same. > > We can issue some WARNING once any particular request reached the > maximum number of retries but not sure if that is enough because the > user might not notice the same or didn't take any action. Do we want > to PANIC at some point of time, if so, when or the other alternative > is we can try at regular intervals till we succeed? PANIC is a terrible idea. How would that fix anything? You'll very possibly still have the same problem after restarting, and so you'll just keep on hitting the PANIC. That will mean that in addition to whatever problem with undo you already had, you now have a system that you can't use for anything at all, because it keeps restarting. The design goal here should be that if undo for a transaction fails, we keep retrying periodically, but with minimal adverse impact on the rest of the system. That means you can't retry in a loop. It also means that the system needs to provide fairness: that is, it shouldn't be possible to create a system where one or more transactions for which undo keeps failing cause other transactions that could have been undone to get starved. It seems to me that thinking of this in terms of what the undo worker does and what the undo launcher does is probably not the right approach. We need to think of it more as an integrated system. Instead of storing a failure_count with each request in the error queue, how about storing a next retry time? I think the error queue needs to be ordered by database_id, then by next_retry_time, and then by order of insertion. (The last part is important because next_retry_time is going to be prone to having ties, and we need to break those ties in the right way.) So, when a per-database worker starts up, it's pulling from the queues in alternation, ignoring items that are not for the current database. When it pulls from the error queue, it looks at the item for the current database that has the lowest retry time - if that's still in the future, then it ignores the queue until something new (perhaps with a lower retry_time) is added, or until the first next_retry_time arrives. If the item that it pulls again fails, it gets inserted back into the error queue but with a higher next retry time. This might not be exactly right, but the point is that there should probably be NO logic that causes a worker to retry the same transaction immediately afterward, with or without a delay. It should be all be driven off what gets pulled out of the error queue. In the above sketch, if a worker gets to the point where there's nothing in the error queue for the current database with a timestamp that is <= the current time, then it can't pull anything else from that queue; if there's no other work to do, it exits. If there is other work to do, it does that and then maybe enough time will have passed to allow something to be pulled from the error queue, or maybe not. Meanwhile some other worker running in the same database might pull the item before the original worker gets back to it. Meanwhile if the worker exits because there's nothing more to do in that database, the launcher can also see the error queue. When enough time has passed, it can notice that there is an item (or items) that could be pulled from the error queue for that database and launch a worker for that database if necessary (or else let an existing worker take care of it). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > [ new patches ] > > I tried writing some code [ to use these patches ]. I spent some more time experimenting with this patch set today and I think that the UndoFetchRecord interface is far too zheap-centric. I expected that I would be able to do this: UnpackedUndoRecord *uur = UndoFetchRecord(urp); // do stuff with uur UndoRecordRelease(uur); But I can't, because the UndoFetchRecord API requires me to pass not only an undo record but also a block number, an offset number, an XID, and a callback. I think I could get the effect that I want by defining a callback that always returns true. Then I could do: UndoRecPtr junk; UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber, InvalidOffsetNumber, &junk, always_returns_true); // do stuff with uur UndoRecordRelease(uur); That seems ridiculously baroque. I think the most common thing that an AM will want to do with an UndoRecPtr is look up that exact record; that is, for example, what zedstore will want to do. However, even if some AMs, like zheap, want to search backward through a chain of records, there's no real reason to suppose that all of them will want to search by block number + offset. They might want to search by some bit of data buried in the payload, for example. I think the basic question here is whether we really need anything more complicated than: extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp); I mean, if you had that, the caller can implement looping easily enough, and insert any test they want: for (;;) { UnpackedUndoRecord *uur = UndoFetchRecord(urp); if (i like this one) break; urp = uur->uur_blkprev; // should be renamed, since zedstore + probably others will have tuple chains not block chains UndoRecordRelease(uur); } The question in my mind is whether there's some performance advantage of having the undo layer manage the looping rather than the caller do it. If there is, then there's a lot of zheap code that ought to be changed to use it, because it's just using the same satisfies-callback everywhere. If there's not, we should just simplify the undo record lookup along the lines mentioned above and put all the looping into the callers. zheap could provide a wrapper around UndoFetchRecord that does a search by block and offset, so that we don't have to repeat that logic in multiple places. BTW, an actually generic iterator interface would probably look more like this: typedef bool (*SatisfyUndoRecordCallback)(void *callback_data, UnpackedUndoRecord *uur); extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr *found, SatisfyUndoRecordCallback callback, void *callback_data); Now we're not assuming anything about what parts of the record the callback wants to examine. It can do whatever it likes. Typically with this sort of interface a caller will define a file-private struct that is known to both the callback and the caller of UndoFetchRecord, but not elsewhere. If we decide we need an iterator within the undo machinery itself, then I think it should look like the above, and I think it should accept NULL for found, callback, and callback_data, so that somebody who wants to just look up a record, full stop, can do just: UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL); which seems tolerable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 19, 2019 at 9:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think it's a fair point. We can keep pointer to > UndoRecordTransaction(urec_progress, dbid, uur_next) and > UndoRecordLogSwitch(urec_prevurp, urec_prevlogstart) in > UnpackedUndoRecord and include them whenever undo record contain these > headers. Transaction header in the first record of the transaction > and log-switch header in the first record after undo-log switch during > a transaction. IMHO uur_fxid, we can keep as part of the main > UnpackedUndoRecord, because as part of the other work "Compression > for undo records to consider rmgrid, xid,cid,reloid for each record", > the FullTransactionId, will be present in every UnpackedUndoRecord > (although it will not be stored in every undo record). I agree that fxid needs to be set all the time. I'm not sure I'm entirely following the rest of what you are saying here, but let me say again that I don't think UnpackedUndoRecord should include a bunch of stuff that callers (1) don't need to set when inserting and (2) can't count on having set when fetching. Stuff of that type should be handled in some way that spares clients of the undo system from having to worry about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 19, 2019 at 8:25 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jun 19, 2019 at 2:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > The reason for the same is that currently, the undo worker keep on > > executing the requests if there are any. I think this is good when > > there are different requests, but getting the same request from error > > queue and doing it, again and again, doesn't seem to be good and I > > think it will not help either. > > Even if there are multiple requests involved, you don't want a tight > loop like this. > Okay, one reason that comes to mind is we don't want to choke the system as applying undo can consume CPU and generate a lot of I/O. Is that you have in mind or something else? I see an advantage in having some sort of throttling here, so we can have some wait time (say 100ms) between processing requests. Do we see any need of guc here? I think on one side it seems a good idea to have multiple guc's for tuning undo worker machinery because whatever default values we pick might not be good for some of the users. OTOH, giving too many guc's can also make the system difficult to understand and can confuse users or at the least, they won't know how exactly to use those. It seems to me that we should first complete the entire patch and then we can decide which all things need separate guc. > > > I assume we want some kind of cool-down between retries. > > > 10 seconds? A minute? Some kind of back-off algorithm that gradually > > > increases the retry time up to some maximum? > > > > Yeah, something on these lines would be good. How about if we add > > failure_count with each request in error queue? Now, it will get > > incremented on each retry and we can wait in proportion to that, say > > 10s after the first retry, 20s after second and so on and maximum up > > to 10 failure_count (100s) will be allowed after which worker will > > exit considering it has no more work to do. > > > > Actually, we also need to think about what we should with such > > requests because even if undo worker exits after retrying for some > > threshold number of times, undo launcher will again launch a new > > worker for this request unless we have some special handling for the > > same. > > > > We can issue some WARNING once any particular request reached the > > maximum number of retries but not sure if that is enough because the > > user might not notice the same or didn't take any action. Do we want > > to PANIC at some point of time, if so, when or the other alternative > > is we can try at regular intervals till we succeed? > > PANIC is a terrible idea. How would that fix anything? You'll very > possibly still have the same problem after restarting, and so you'll > just keep on hitting the PANIC. That will mean that in addition to > whatever problem with undo you already had, you now have a system that > you can't use for anything at all, because it keeps restarting. > > The design goal here should be that if undo for a transaction fails, > we keep retrying periodically, but with minimal adverse impact on the > rest of the system. That means you can't retry in a loop. It also > means that the system needs to provide fairness: that is, it shouldn't > be possible to create a system where one or more transactions for > which undo keeps failing cause other transactions that could have been > undone to get starved. > Agreed. > It seems to me that thinking of this in terms of what the undo worker > does and what the undo launcher does is probably not the right > approach. We need to think of it more as an integrated system. Instead > of storing a failure_count with each request in the error queue, how > about storing a next retry time? > I think both failure_count and next_retry_time can work in a similar way. I think incrementing next retry time in multiples will be a bit tricky. Say first-time error occurs at X hours. We can say that next_retry_time will X+10s=Y and error_occured_at will be X. The second time it again failed, how will we know that we need set next_retry_time as Y+20s, maybe we can do something like Y-X and then add 10s to it and add the result to the current time. Now whenever the worker or launcher finds this request, they can check if the current_time is greater than or equal to next_retry_time, if so they can pick that request, otherwise, they check request in next queue. The failure_count can also work in a somewhat similar fashion. Basically, we can use error_occurred at and failure_count to compute the required time. So, if error is occurred at say X hours and failure count is 3, then we can check if current_time is greater than X+(3 * 10s), then we will allow the entry to be processed, otherwise, it will check other queues for work. > I think the error queue needs to be > ordered by database_id, then by next_retry_time, and then by order of > insertion. (The last part is important because next_retry_time is > going to be prone to having ties, and we need to break those ties in > the right way.) > I think it makes sense to order requests by next_retry_time, error_occurred_at (this will ensure the order of insertion). However, I am not sure if there is a need to club the requests w.r.t database id. It can starve the error requests from other databases. Moreover, we already have a functionality wherein if the undo worker doesn't encounter the next request from the same database on which it is operating for a certain amount of time, then it will peek ahead (few entries) in each queue to get the request for the same database. We don't sort by db_id in other queues as well, so it will be consistent for this queue if we just sort by next_retry_time and error_occurred_at. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 19, 2019 at 11:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > [ new patches ] > > > > I tried writing some code [ to use these patches ]. > > I spent some more time experimenting with this patch set today and I > think that the UndoFetchRecord interface is far too zheap-centric. I > expected that I would be able to do this: > > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > // do stuff with uur > UndoRecordRelease(uur); > > But I can't, because the UndoFetchRecord API requires me to pass not > only an undo record but also a block number, an offset number, an XID, > and a callback. I think I could get the effect that I want by > defining a callback that always returns true. Then I could do: > > UndoRecPtr junk; > UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber, > InvalidOffsetNumber, &junk, always_returns_true); > // do stuff with uur > UndoRecordRelease(uur); > > That seems ridiculously baroque. I think the most common thing that > an AM will want to do with an UndoRecPtr is look up that exact record; > that is, for example, what zedstore will want to do. However, even if > some AMs, like zheap, want to search backward through a chain of > records, there's no real reason to suppose that all of them will want > to search by block number + offset. They might want to search by some > bit of data buried in the payload, for example. > > I think the basic question here is whether we really need anything > more complicated than: > > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp); > > I mean, if you had that, the caller can implement looping easily > enough, and insert any test they want: > > for (;;) > { > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > if (i like this one) > break; > urp = uur->uur_blkprev; // should be renamed, since zedstore + > probably others will have tuple chains not block chains > UndoRecordRelease(uur); > } The idea behind having the loop inside the undo machinery was that while traversing the blkprev chain, we can read all the undo records on the same undo page under one buffer lock. > > The question in my mind is whether there's some performance advantage > of having the undo layer manage the looping rather than the caller do > it. If there is, then there's a lot of zheap code that ought to be > changed to use it, because it's just using the same satisfies-callback > everywhere. If there's not, we should just simplify the undo record > lookup along the lines mentioned above and put all the looping into > the callers. zheap could provide a wrapper around UndoFetchRecord > that does a search by block and offset, so that we don't have to > repeat that logic in multiple places. > > BTW, an actually generic iterator interface would probably look more like this: > > typedef bool (*SatisfyUndoRecordCallback)(void *callback_data, > UnpackedUndoRecord *uur); Right, it should be this way. > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr > *found, SatisfyUndoRecordCallback callback, void *callback_data); > > Now we're not assuming anything about what parts of the record the > callback wants to examine. It can do whatever it likes. Typically > with this sort of interface a caller will define a file-private struct > that is known to both the callback and the caller of UndoFetchRecord, > but not elsewhere. > > If we decide we need an iterator within the undo machinery itself, > then I think it should look like the above, and I think it should > accept NULL for found, callback, and callback_data, so that somebody > who wants to just look up a record, full stop, can do just: > > UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL); > > which seems tolerable. > I agree with this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 20, 2019 at 2:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jun 19, 2019 at 11:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > [ new patches ] > > > > > > I tried writing some code [ to use these patches ]. > > > > I spent some more time experimenting with this patch set today and I > > think that the UndoFetchRecord interface is far too zheap-centric. I > > expected that I would be able to do this: > > > > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > > // do stuff with uur > > UndoRecordRelease(uur); > > > > But I can't, because the UndoFetchRecord API requires me to pass not > > only an undo record but also a block number, an offset number, an XID, > > and a callback. I think I could get the effect that I want by > > defining a callback that always returns true. Then I could do: > > > > UndoRecPtr junk; > > UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber, > > InvalidOffsetNumber, &junk, always_returns_true); > > // do stuff with uur > > UndoRecordRelease(uur); > > > > That seems ridiculously baroque. I think the most common thing that > > an AM will want to do with an UndoRecPtr is look up that exact record; > > that is, for example, what zedstore will want to do. However, even if > > some AMs, like zheap, want to search backward through a chain of > > records, there's no real reason to suppose that all of them will want > > to search by block number + offset. They might want to search by some > > bit of data buried in the payload, for example. > > > > I think the basic question here is whether we really need anything > > more complicated than: > > > > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp); > > > > I mean, if you had that, the caller can implement looping easily > > enough, and insert any test they want: > > > > for (;;) > > { > > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > > if (i like this one) > > break; > > urp = uur->uur_blkprev; // should be renamed, since zedstore + > > probably others will have tuple chains not block chains > > UndoRecordRelease(uur); > > } > > The idea behind having the loop inside the undo machinery was that > while traversing the blkprev chain, we can read all the undo records > on the same undo page under one buffer lock. > I think if we want we can hold this buffer and allow it to be released in UndoRecordRelease. However, this buffer needs to be stored in some common structure which can be then handed over to UndoRecordRelease. Another thing is that as of now the API allocates the memory just once for UnpackedUndoRecord whereas in the new scheme it needs to be allocated again and again. I think this is a relatively minor thing, but it might be better if we can avoid palloc again and again. BTW, while looking at the code of UndoFetchRecord, I see some problem. There is a coding pattern like if() { } else { LWLockAcquire() .. .. } LWLockRelease(). I think this is not correct. > > > > BTW, an actually generic iterator interface would probably look more like this: > > > > typedef bool (*SatisfyUndoRecordCallback)(void *callback_data, > > UnpackedUndoRecord *uur); > Right, it should be this way. > > > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr > > *found, SatisfyUndoRecordCallback callback, void *callback_data); > > > > Now we're not assuming anything about what parts of the record the > > callback wants to examine. It can do whatever it likes. Typically > > with this sort of interface a caller will define a file-private struct > > that is known to both the callback and the caller of UndoFetchRecord, > > but not elsewhere. > > > > If we decide we need an iterator within the undo machinery itself, > > then I think it should look like the above, and I think it should > > accept NULL for found, callback, and callback_data, so that somebody > > who wants to just look up a record, full stop, can do just: > > > > UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL); > > > > which seems tolerable. > > > I agree with this. > +1. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 19, 2019 at 11:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > [ new patches ] > > > > I tried writing some code [ to use these patches ]. > > > for (;;) > { > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > if (i like this one) > break; > urp = uur->uur_blkprev; // should be renamed, since zedstore + > probably others will have tuple chains not block chains .. +1 for renaming this variable. How about uur_prev_ver or uur_prevver or uur_verprev? Any other suggestions? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > Okay, one reason that comes to mind is we don't want to choke the > system as applying undo can consume CPU and generate a lot of I/O. Is > that you have in mind or something else? Yeah, mainly that, but also things like log spam, and even pressure on the lock table. If we are trying over and over again to take useless locks, it can affect other things on the system. The main thing, however, is the CPU and I/O consumption. > I see an advantage in having some sort of throttling here, so we can > have some wait time (say 100ms) between processing requests. Do we > see any need of guc here? I don't think that is the right approach. As I said in my previous reply, we need a way of holding off the retry of the same error for a certain amount of time, probably measured in seconds or tens of seconds. Introducing a delay before processing every request is an inferior alternative: if there are a lot of rollbacks, it can cause the system to lag; and in the case where there's just one rollback that's failing, it will still be way too much log spam (and probably CPU time too). Nobody wants 10 failure messages per second in the log. > > It seems to me that thinking of this in terms of what the undo worker > > does and what the undo launcher does is probably not the right > > approach. We need to think of it more as an integrated system. Instead > > of storing a failure_count with each request in the error queue, how > > about storing a next retry time? > > I think both failure_count and next_retry_time can work in a similar way. > > I think incrementing next retry time in multiples will be a bit > tricky. Say first-time error occurs at X hours. We can say that > next_retry_time will X+10s=Y and error_occured_at will be X. The > second time it again failed, how will we know that we need set > next_retry_time as Y+20s, maybe we can do something like Y-X and then > add 10s to it and add the result to the current time. Now whenever > the worker or launcher finds this request, they can check if the > current_time is greater than or equal to next_retry_time, if so they > can pick that request, otherwise, they check request in next queue. > > The failure_count can also work in a somewhat similar fashion. > Basically, we can use error_occurred at and failure_count to compute > the required time. So, if error is occurred at say X hours and > failure count is 3, then we can check if current_time is greater than > X+(3 * 10s), then we will allow the entry to be processed, otherwise, > it will check other queues for work. Meh. Don't get stuck on one particular method of calculating the next retry time. We want to be able to change that easily if whatever we try first doesn't work out well. I am not convinced that we need anything more complex than a fixed retry time, probably controlled by a GUC (undo_failure_retry_time = 10s?). An escalating time between retries would be important and advantageous if we expected the sizes of these queues to grow into the millions, but the current design seems to be contemplating something more in the tends-of-thousands range and I am not sure we're going to need it at that level. We should try simple things first and then see where we need to make it more complex. At some basic level, the queue needs to be ordered by increasing retry time. You can do that with your design, but you have to recompute the next retry time from the error_occurred_at and failure_count values every time you examine an entry. It's almost certainly better to store the next_retry_time explicitly. That way, if for example we change the logic for computing the next_retry_time to something really complicated, it doesn't have any effect on the code that keeps the queue in order -- it just looks at the computed value. If we end up with something very simple, like error_occurred_at + constant, it may end up seeming a little silly, but I think that's a price well worth paying for code maintainability. If we end up with error_occurred_at + Min(failure_count * 10, 100) or something of that sort, then we can also store failure_count in each record, but it will just be part of the payload, not the sort key, so adding it or removing it won't affect the code that maintains the queue ordering. > > I think the error queue needs to be > > ordered by database_id, then by next_retry_time, and then by order of > > insertion. (The last part is important because next_retry_time is > > going to be prone to having ties, and we need to break those ties in > > the right way.) > > I think it makes sense to order requests by next_retry_time, > error_occurred_at (this will ensure the order of insertion). However, > I am not sure if there is a need to club the requests w.r.t database > id. It can starve the error requests from other databases. Moreover, > we already have a functionality wherein if the undo worker doesn't > encounter the next request from the same database on which it is > operating for a certain amount of time, then it will peek ahead (few > entries) in each queue to get the request for the same database. We > don't sort by db_id in other queues as well, so it will be consistent > for this queue if we just sort by next_retry_time and > error_occurred_at. You're misunderstanding my point. We certainly do not wish to always pick the request from the database with the lowest OID, or anything like that. However, we do need a worker for a particular database to find the work pending for that database efficiently. Peeking ahead a few requests is a version of that, but I'm not sure it's going to be good enough. Suppose we look ahead 3 requests but there are 10 databases. Then, if all 10 databases have requests pending, it is likely that we won't find the next request for our particular database even though it exists -- the first 3 may easily be for all other databases. If you look ahead more requests, that doesn't really fix it - it just means you need more databases for the problem to become likely. And note that this problem happens even if every database contains a worker. Some of those workers will erroneously think that they should exit. I'm not sure exactly what to do about this. My first thought was that for all of the queues we might need to have a queue per database (or something equivalent) rather than just one big queue. But that has problems too: it will mean that a database worker will never exit as long as there is any work at all to be done in that database, even if some other database is getting starved. Somehow we need to balance the efficiency of having a worker for a particular database process many requests before exiting against the need to ensure fairness across databases, and it doesn't sound to me like we quite know what exactly we ought to do there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 20, 2019 at 6:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > for (;;) > > { > > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > > if (i like this one) > > break; > > urp = uur->uur_blkprev; // should be renamed, since zedstore + > > probably others will have tuple chains not block chains > .. > > +1 for renaming this variable. How about uur_prev_ver or uur_prevver > or uur_verprev? Any other suggestions? Maybe just uur_previous or uur_prevundo or something like that. We've already got a uur_prevurp, but that's really pretty misnamed and IMHO it doesn't belong in this structure anyway. (uur_next is also a bad name and also doesn't belong in this structure.) I don't think we want to use 'ver' because that supposes that undo is being used to track tuple versions, which is a likely use but perhaps not the only one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 20, 2019 at 6:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > BTW, while looking at the code of UndoFetchRecord, I see some problem. > There is a coding pattern like > if() > { > } > else > { > LWLockAcquire() > .. > .. > } > > LWLockRelease(). > > I think this is not correct. Independently of that problem, I think it's probably bad that we're not maintaining the same shared memory state on the master and the standby. Doing the same check in one way on the master and in a different way on the standby is a recipe for surprising and probably bad behavior differences between master and standby servers. Those could be simple things like lock acquire/release not matching, but they could also be things like performance or correctness differences that only materialize under certain scenarios. This is not the only place in the patch set where we have this kind of thing, and I hate them all. I don't exactly know what the solution is, either, but I suspect it will involve either having the recovery process do a more thorough job updating the shared memory state when it does undo-related stuff, or running some of the undo-specific processes on the standby just for the purpose of getting these updates done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 20, 2019 at 8:01 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Okay, one reason that comes to mind is we don't want to choke the > > system as applying undo can consume CPU and generate a lot of I/O. Is > > that you have in mind or something else? > > Yeah, mainly that, but also things like log spam, and even pressure on > the lock table. If we are trying over and over again to take useless > locks, it can affect other things on the system. The main thing, > however, is the CPU and I/O consumption. > > > I see an advantage in having some sort of throttling here, so we can > > have some wait time (say 100ms) between processing requests. Do we > > see any need of guc here? > > I don't think that is the right approach. As I said in my previous > reply, we need a way of holding off the retry of the same error for a > certain amount of time, probably measured in seconds or tens of > seconds. Introducing a delay before processing every request is an > inferior alternative: > This delay is for *not* choking the system by constantly performing undo requests that consume a lot of CPU and I/O as discussed in above point. For holding off the same error request to be re-tried, we need next_retry_time type of method as discussed below. if there are a lot of rollbacks, it can cause > the system to lag; and in the case where there's just one rollback > that's failing, it will still be way too much log spam (and probably > CPU time too). Nobody wants 10 failure messages per second in the > log. > > > > It seems to me that thinking of this in terms of what the undo worker > > > does and what the undo launcher does is probably not the right > > > approach. We need to think of it more as an integrated system. Instead > > > of storing a failure_count with each request in the error queue, how > > > about storing a next retry time? > > > > I think both failure_count and next_retry_time can work in a similar way. > > > > I think incrementing next retry time in multiples will be a bit > > tricky. Say first-time error occurs at X hours. We can say that > > next_retry_time will X+10s=Y and error_occured_at will be X. The > > second time it again failed, how will we know that we need set > > next_retry_time as Y+20s, maybe we can do something like Y-X and then > > add 10s to it and add the result to the current time. Now whenever > > the worker or launcher finds this request, they can check if the > > current_time is greater than or equal to next_retry_time, if so they > > can pick that request, otherwise, they check request in next queue. > > > > The failure_count can also work in a somewhat similar fashion. > > Basically, we can use error_occurred at and failure_count to compute > > the required time. So, if error is occurred at say X hours and > > failure count is 3, then we can check if current_time is greater than > > X+(3 * 10s), then we will allow the entry to be processed, otherwise, > > it will check other queues for work. > > Meh. Don't get stuck on one particular method of calculating the next > retry time. We want to be able to change that easily if whatever we > try first doesn't work out well. I am not convinced that we need > anything more complex than a fixed retry time, probably controlled by > a GUC (undo_failure_retry_time = 10s?). > IIRC, then you only seem to have suggested that we need a kind of back-off algorithm that gradually increases the retry time up to some maximum [1]. I think that is a good way to de-prioritize requests that are repeatedly failing. Say, there is a request that has already failed for 5 times and the worker queues it to get executed after 10s. Immediately after that, another new request has failed for the first time for the same database and it also got queued to get executed after 10s. In this scheme the request that has already failed for 5 times will get a chance before the request that has failed for the first time. [1] - https://www.postgresql.org/message-id/CA%2BTgmoYHBkm7M8tNk6Z9G_aEOiw3Bjdux7v9%2BUzmdNTdFmFzjA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 20, 2019 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > This delay is for *not* choking the system by constantly performing > undo requests that consume a lot of CPU and I/O as discussed in above > point. For holding off the same error request to be re-tried, we need > next_retry_time type of method as discussed below. Oh. That's not what I thought we were talking about. It's not unreasonable to think about trying to rate limit undo application just like we do for vacuum, but a fixed delay between requests would be a completely inadequate way of attacking that problem. If the individual requests are short, it will create too much delay, and if they are long, it will not create enough. We would need delays within a transaction, not just between transactions, similar to how the vacuum cost delay stuff works. I suggest that we leave that to one side for now. It seems like something that could be added later, maybe in a more general way, and not something that needs to be or should be closely connected to the logic for deciding the order in which we're going to process different transactions in undo. > > Meh. Don't get stuck on one particular method of calculating the next > > retry time. We want to be able to change that easily if whatever we > > try first doesn't work out well. I am not convinced that we need > > anything more complex than a fixed retry time, probably controlled by > > a GUC (undo_failure_retry_time = 10s?). > > IIRC, then you only seem to have suggested that we need a kind of > back-off algorithm that gradually increases the retry time up to some > maximum [1]. I think that is a good way to de-prioritize requests > that are repeatedly failing. Say, there is a request that has already > failed for 5 times and the worker queues it to get executed after 10s. > Immediately after that, another new request has failed for the first > time for the same database and it also got queued to get executed > after 10s. In this scheme the request that has already failed for 5 > times will get a chance before the request that has failed for the > first time. Sure, that's an advantage of increasing back-off times -- you can keep the stuff that looks hopeless from interfering too much with the stuff that is more likely to work out. However, I don't think we've actually done enough testing to know for sure what algorithm will work out best. Do we want linear back-off (10s, 20s, 30s, ...)? Exponential back-off (1s, 2s, 4s, 8s, ...)? No back-off (10s, 10s, 10s, 10s)? Some algorithm that depends on the size of the failed transaction, so that big things get retried less often? I think it's important to design the code in such a way that the algorithm can be changed easily later, because I don't think we can be confident that whatever we pick for the first attempt will prove to be best. I'm pretty sure that storing the failure count INSTEAD OF the next retry time is going to make it harder to experiment with different algorithms later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > The idea behind having the loop inside the undo machinery was that > while traversing the blkprev chain, we can read all the undo records > on the same undo page under one buffer lock. That's not a bad goal, although invoking a user-supplied callback while holding a buffer lock is a little scary. If we stick with that, it had better be clearly documented. Perhaps worth noting: ReadBuffer() is noticeably more expensive in terms of CPU consumption than LockBuffer(). So it may work out that we keep a pin to avoid redoing that and worry less about retaking the buffer locks. But I'm not sure: avoiding buffer locks is clearly good, too. I have a couple of observations after poking at this some more. One is that it's not necessarily adequate to have an interface that iterates backward through the undo chain until the callback returns true. Here's an example: suppose you want to walk backward through the undo chain until you find the first undo record that corresponds to a change you can see, and then return the undo record immediately prior to that. zheap doesn't really need this, because it (mostly?) stores the XID of the next record we're going to look up in the current record, and the XID of the first record we're going to look up in the chain -- so it can tell from the record it's found so far whether it should bother looking up the next record. That, however, would not necessarily be true for some other AM. Suppose you just store an undo pointer in the tuple header, as Heikki proposed to do for zedstore. Suppose further that each record has the undo pointer for the previous record that modified that TID but not necessarily the XID. Then imagine a TID where we do an insert and a bunch of in-place updates. Then, a scan comes along with an old snapshot. It seems to me that what you need to do is walk backward through the chain of undo records until you see one that has an XID that is visible to your snapshot, and then the version of the tuple that you want is in the payload of the next-newer undo record. So what you want to do is something like this: look up the undo pointer in the tuple. call that the current undo record. loop: - look up the undo pointer in the current undo record. call that the previous undo record. - if the XID from the previous undo record is visible, then stop; use the tuple version from the current undo record. - release the current undo record and let the new current undo record be the previous undo record. I'm not sure if this is actually a reasonable design from a performance point of view, but it doesn't sound totally crazy, and it's just to illustrate the point that there might be cases that are too complicated for a loop-until-true model. In this kind of loop, at any given time you are holding onto two undo records, working your way back through the undo log, and you just can't make this work with the UndoFetchRecord callback interface. Possibly you could have a context object that could hold onto one or a few buffer pins: BeginUndoFetch(&cxt); uur = UndoFetchRecord(&cxt, urp); // maybe do this a bunch of times FinishUndoFetch(&cxt); ...but I'm not sure if that's exactly what we want either. Still, if there is a significant savings from avoiding repinning and relocking the buffer, we want to make it easy for people to get that advantage as often as possible. Another point that has occurred to me is that it is probably impossible to avoid a fairly large number of duplicate undo fetches. For instance, suppose somebody runs an UPDATE on a tuple that has been recently updated. The tuple_update method just gets a TID + snapshot, so the AM basically has to go look up the tuple all over again, including checking whether the latest version of the tuple is the one visible to our snapshot. So that means repinning and relocking the same buffers and decoding the same undo record all over again. I'm not exactly sure what to do about this, but it seems like a potential performance problem. I wonder if it's feasible to cache undo lookups so that in common cases we can just reuse the result of a previous lookup instead of doing a new one, and I wonder whether it's possible to make that fast enough that it actually helps... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> The idea behind having the loop inside the undo machinery was that >> while traversing the blkprev chain, we can read all the undo records >> on the same undo page under one buffer lock. > That's not a bad goal, although invoking a user-supplied callback > while holding a buffer lock is a little scary. I nominate Robert for Understater of the Year. I think there's pretty much 0 chance of that working reliably. regards, tom lane
On Fri, Jun 21, 2019 at 6:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > That's not a bad goal, although invoking a user-supplied callback > > while holding a buffer lock is a little scary. > > I nominate Robert for Understater of the Year. I think there's pretty > much 0 chance of that working reliably. It's an honor to be nominated, although I am pretty sure this is not my best work in category, even for 2019. There are certainly useful things that could be done by such a callback without doing anything that touches shared memory and without doing anything that consumes more than a handful of CPU cycles, so it doesn't seem utterly crazy to think that such a design might survive. However, the constraints we'd have to impose might chafe. I am more inclined to ditch the callback model altogether in favor of putting any necessary looping logic on the caller side. That seems a lot more flexible, and the only trick is figuring out how to keep it cheap. Providing some kind of context object that can hold onto one or more pins seems like the most reasonable approach. Last week it seemed to me that we would need several, but at the moment I can't think of a reason why we would need more than one. I think we just want to optimize the case where several undo lookups in quick succession are actually reading from the same page, and we don't want to go to the expense of looking that page up multiple times. It doesn't seem at all likely that we would have a chain of undo records that leaves a certain page and then comes back to it later, because this is a log that grows forward, not some kind of random-access thing. So a cache of size >1 probably wouldn't help. Unless I'm still confused. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 20, 2019 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 20, 2019 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > IIRC, then you only seem to have suggested that we need a kind of > > back-off algorithm that gradually increases the retry time up to some > > maximum [1]. I think that is a good way to de-prioritize requests > > that are repeatedly failing. Say, there is a request that has already > > failed for 5 times and the worker queues it to get executed after 10s. > > Immediately after that, another new request has failed for the first > > time for the same database and it also got queued to get executed > > after 10s. In this scheme the request that has already failed for 5 > > times will get a chance before the request that has failed for the > > first time. > > Sure, that's an advantage of increasing back-off times -- you can keep > the stuff that looks hopeless from interfering too much with the stuff > that is more likely to work out. However, I don't think we've actually > done enough testing to know for sure what algorithm will work out > best. Do we want linear back-off (10s, 20s, 30s, ...)? Exponential > back-off (1s, 2s, 4s, 8s, ...)? No back-off (10s, 10s, 10s, 10s)? > Some algorithm that depends on the size of the failed transaction, so > that big things get retried less often? I think it's important to > design the code in such a way that the algorithm can be changed easily > later, because I don't think we can be confident that whatever we pick > for the first attempt will prove to be best. I'm pretty sure that > storing the failure count INSTEAD OF the next retry time is going to > make it harder to experiment with different algorithms later. > Fair enough. I have implemented it based on next_retry_at and use constant time 10s for the next retry. I have used define instead of a GUC as all the other constants for similar things are defined as of now. One thing to note is that we want the linger time (defined as UNDO_WORKER_LINGER_MS) for a undo worker to be more than failure retry time (defined as UNDO_FAILURE_RETRY_DELAY_MS) as, otherwise, the undo worker can exit before retrying the failed requests. The changes for this are in patches 0011-Infrastructure-to-register-and-fetch-undo-action-req.patch and 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch. Apart from these, there are few other changes in the patch series: 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch: 1. Allow the undo workers to respond to cancel command by the user. CHECK_FOR_INTERRUPTS was missing while the worker was checking for the next undo request in a loop. 2. Change the value of UNDO_WORKER_LINGER_MS to 20s, so that it is more than UNDO_FAILURE_RETRY_DELAY_MS. 3. Handled Sigterm signal for undo launcher and workers 4. Fixed the code bug to avoid having CommitTransaction when one of the workers fails to register. There is no StartTransaction to match the same. This was leftover from the previous approach. 0012-Infrastructure-to-execute-pending-undo-actions.patch 1 Fix compiler warning 0007-Provide-interfaces-to-store-and-fetch-undo-records.patch 1. Fixed a bug to unlock the buffer while resetting the undo unpacked record 2. Fixed the spurious release of the lock in UndoFetchRecord. 3. Remove the pointer to previous undo in a different log from UndoRecordTransaction structure. Now, a separate low_switch header contains the same. 0007-Provide-interfaces-to-store-and-fetch-undo-records.patch is Dilip's patch and he has modified it, but changes were small so there was not much sense in posting it separately. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Add-SmgrId-to-smgropen-and-BufferTag.patch
- 0002-Move-tablespace-dir-creation-from-smgr.c-to-md.c.patch
- 0004-Allow-WAL-record-data-on-first-modification-after-a-.patch
- 0005-Add-prefetch-support-for-the-undo-log.patch
- 0003-Add-undo-log-manager.patch
- 0006-Defect-and-enhancement-in-multi-log-support.patch
- 0008-Test-module-for-undo-api.patch
- 0009-undo-page-consistency-checker.patch
- 0007-Provide-interfaces-to-store-and-fetch-undo-records.patch
- 0010-Extend-binary-heap-functionality.patch
- 0013-Allow-foreground-transactions-to-perform-undo-action.patch
- 0011-Infrastructure-to-register-and-fetch-undo-action-req.patch
- 0012-Infrastructure-to-execute-pending-undo-actions.patch
- 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch
On Tue, Jun 25, 2019 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > [ new patches ] I happened to open up 0001 from this series, which is from Thomas, and I do not think that the pg_buffercache changes are correct. The idea here is that the customer might install version 1.3 or any prior version on an old release, then upgrade to PostgreSQL 13. When they do, they will be running with the old SQL definitions and the new binaries. At that point, it sure looks to me like the code in pg_buffercache_pages.c is going to do the Wrong Thing. There's some existing code in there to omit the pinning_backends column if the SQL definitions don't know about it; instead of adding similar code for the newly-added smgrid column, this patch rips out the existing backward-compatibility code. I think that's a double fail. At some later point, the customer is going to run ALTER EXTENSION pg_buffercache UPDATE. At that point they are going to be expecting that their SQL definitions now match the state that would have been created had they installed on pg_buffercache for the first time on PG13, starting with pg_buffercache v1.4. Since the view is now going to have a new column, that would seem to required dropping and recreating the view, but pg_buffercache--1.3--1.4.sql doesn't do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jun 22, 2019 at 2:51 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > The idea behind having the loop inside the undo machinery was that > > while traversing the blkprev chain, we can read all the undo records > > on the same undo page under one buffer lock. > > That's not a bad goal, although invoking a user-supplied callback > while holding a buffer lock is a little scary. If we stick with that, > it had better be clearly documented. Perhaps worth noting: > ReadBuffer() is noticeably more expensive in terms of CPU consumption > than LockBuffer(). So it may work out that we keep a pin to avoid > redoing that and worry less about retaking the buffer locks. But I'm > not sure: avoiding buffer locks is clearly good, too. > > I have a couple of observations after poking at this some more. One > is that it's not necessarily adequate to have an interface that > iterates backward through the undo chain until the callback returns > true. Here's an example: suppose you want to walk backward through > the undo chain until you find the first undo record that corresponds > to a change you can see, and then return the undo record immediately > prior to that. zheap doesn't really need this, because it (mostly?) > stores the XID of the next record we're going to look up in the > current record, and the XID of the first record we're going to look up > in the chain -- so it can tell from the record it's found so far > whether it should bother looking up the next record. That, however, > would not necessarily be true for some other AM. > > Suppose you just store an undo pointer in the tuple header, as Heikki > proposed to do for zedstore. Suppose further that each record has the > undo pointer for the previous record that modified that TID but not > necessarily the XID. Then imagine a TID where we do an insert and a > bunch of in-place updates. Then, a scan comes along with an old > snapshot. It seems to me that what you need to do is walk backward > through the chain of undo records until you see one that has an XID > that is visible to your snapshot, and then the version of the tuple > that you want is in the payload of the next-newer undo record. So what > you want to do is something like this: > > look up the undo pointer in the tuple. call that the current undo record. > loop: > - look up the undo pointer in the current undo record. call that the > previous undo record. > - if the XID from the previous undo record is visible, then stop; use > the tuple version from the current undo record. > - release the current undo record and let the new current undo record > be the previous undo record. > > I'm not sure if this is actually a reasonable design from a > performance point of view, but it doesn't sound totally crazy, and > it's just to illustrate the point that there might be cases that are > too complicated for a loop-until-true model. In this kind of loop, at > any given time you are holding onto two undo records, working your way > back through the undo log, and you just can't make this work with the > UndoFetchRecord callback interface. Possibly you could have a context > object that could hold onto one or a few buffer pins: > > BeginUndoFetch(&cxt); > uur = UndoFetchRecord(&cxt, urp); // maybe do this a bunch of times > FinishUndoFetch(&cxt); > > ...but I'm not sure if that's exactly what we want either. Still, if > there is a significant savings from avoiding repinning and relocking > the buffer, we want to make it easy for people to get that advantage > as often as possible. > IIUC, you are proposing to retain the pin and then lock/unlock the buffer each time in this API. I think there is no harm in trying out something on these lines to see if there is any impact on some of the common scenarios. > Another point that has occurred to me is that it is probably > impossible to avoid a fairly large number of duplicate undo fetches. > For instance, suppose somebody runs an UPDATE on a tuple that has been > recently updated. The tuple_update method just gets a TID + snapshot, > so the AM basically has to go look up the tuple all over again, > including checking whether the latest version of the tuple is the one > visible to our snapshot. So that means repinning and relocking the > same buffers and decoding the same undo record all over again. I'm > not exactly sure what to do about this, but it seems like a potential > performance problem. I wonder if it's feasible to cache undo lookups > so that in common cases we can just reuse the result of a previous > lookup instead of doing a new one, and I wonder whether it's possible > to make that fast enough that it actually helps... > I think it will be helpful if we can have such a cache, but OTOH, we can try out such optimizations after the first version as well by analyzing its benefit. For zheap, in many cases, the version in heap itself is all-visible and or is visible as per current snapshot and the same can be detected by looking at the transaction slot, however, it might be tricky for zedstore. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jun 28, 2019 at 6:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > I happened to open up 0001 from this series, which is from Thomas, and > I do not think that the pg_buffercache changes are correct. The idea > here is that the customer might install version 1.3 or any prior > version on an old release, then upgrade to PostgreSQL 13. When they > do, they will be running with the old SQL definitions and the new > binaries. At that point, it sure looks to me like the code in > pg_buffercache_pages.c is going to do the Wrong Thing. [...] Yep, that was completely wrong. Here's a new version. I tested that I can install 1.3 in an older release, then pg_upgrade to master, then look at the view without the new column, then UPGRADE the extension to 1.4, and then the new column appears. Other new stuff in this tarball (and also at https://github.com/EnterpriseDB/zheap/tree/undo): Based on hallway track discussions at PGCon, I have made a few modifications to the undo log storage and record layer to support "shared" record sets. They are groups of records can be used for temporary storage space for anything that needs to outlive a whole set of transactions. The intended usage is extra transaction slots for updaters and lockers when there isn't enough space on a zheap (or other AM) page. The idea is to avoid the need to have in-heap overflow pages for transient transaction management data, and instead put that stuff on the conveyor belt of perfectly timed doom[1] along with old tuple versions. "Shared" undo records are never executed (that is, they don't really represent rollback actions), they are just used for storage space that is eventually discarded. (I experimented with a way to use these also to perform rollback actions to clean up stuff like the junk left behind by aborted CREATE INDEX CONCURRENTLY commands, which seemed promising, but it turned out to be quite tricky so I abandoned that for now). Details: 1. Renamed UndoPersistence to UndoLogCategory everywhere, and add a fourth category UNDO_SHARED where transactions can write 'out of band' data that relates to more than one transaction. 2. Introduced a new RMGR callback rm_undo_status. It is used to decide when record sets in the UNDO_SHARED category should be discarded (instead of the usual single xid-based rules). The possible answers are "discard me now!", "ask me again when a given XID is all visible", and "ask me again when a given XID is no longer running". 3. Recognise UNDO_SHARED record set boundaries differently. Whereas undolog.c recognises transaction boundaries automatically for the other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for UNDO_SHARED the 4. Add some quick-and-dirty throw-away test stuff to demonstrate that. SELECT test_multixact([1234, 2345]) will create a new record set that will survive until the given array of transactions is no longer running, and then it'll be discarded. You can see that with SELECT * FROM undoinspect('shared'). Or look at SELECT pg_stat_undo_logs. This test simply writes all the xids into its payload, and then has an rm_undo_status function that returns the first xid it finds in the list that is still running, or if none are running returns UNDO_STATUS_DISCARD. Currently you can only return UNDO_STATUS_WAIT_XMIN so wait for an xid to be older than the oldest xmin; presumably it'd be useful to be able to discard as soon as an xid is no longer active, which could be a bit sooner. Another small change: several people commented that UndoLogIsDiscarded(ptr) ought to have some kind of fast path that doesn't acquire locks since it'll surely be hammered. Here's an attempt at that that provides an inlined function that uses a per-backend recent_discard to avoid doing more work in the (hopefully) common case that you mostly encounter discarded undo pointers. I hope this change will show up in profilers in some zheap workloads but this hasn't been tested yet. Another small change/review: the function UndoLogGetNextInsertPtr() previously took a transaction ID, but I'm not sure if that made sense, I need to think about it some more. I pulled the latest patches pulled in from the "undoprocessing" branch as of late last week, and most of the above is implemented as fixup commits on top of that. Next I'm working on DBA facilities for forcing undo records to be discarded (which consists mostly of sorting out the interlocking to make that work safely). And also testing facilities for simulating undo log switching (when you fill up each log and move to another one, which are rare code paths run, so we need a good way to make them not rare). [1] https://speakerdeck.com/macdice/transactions-in-postgresql-and-other-animals?slide=23 -- Thomas Munro https://enterprisedb.com
Attachment
On Mon, Jul 1, 2019 at 7:53 PM Thomas Munro <thomas.munro@gmail.com> wrote: > 3. Recognise UNDO_SHARED record set boundaries differently. Whereas > undolog.c recognises transaction boundaries automatically for the > other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for > UNDO_SHARED the ... set of records inserted in between calls to BeginUndoRecordInsert() and FinishUndoRecordInsert() calls is eventually discarded as a unit, and the rm_undo_status() callback for the calling AM decides when that is allowed. In contrast, for the other categories there may be records from any number undo-aware AMs that are entirely unaware of each other and they must all be discarded together if the transaction commits and becomes all visible, so undolog.c automatically manages the boundaries to make that work when inserting. -- Thomas Munro https://enterprisedb.com
On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Another small change/review: the function UndoLogGetNextInsertPtr() > previously took a transaction ID, but I'm not sure if that made sense, > I need to think about it some more. > The changes you have made related to UndoLogGetNextInsertPtr() doesn't seem correct to me. @@ -854,7 +854,9 @@ FindUndoEndLocationAndSize(UndoRecPtr start_urecptr, * has already started in this log then lets re-fetch the undo * record. */ - next_insert = UndoLogGetNextInsertPtr(slot->logno, uur->uur_xid); + next_insert = UndoLogGetNextInsertPtr(slot->logno); + + /* TODO this can't happen */ if (!UndoRecPtrIsValid(next_insert)) I think this is a possible case. Say while the discard worker tries to register the rollback request from some log and after it fetches the undo record corresponding to start location in this function, another backend adds the new transaction undo. The same is mentioned in comments as well. Can you explain what makes you think that this can't happen? If we don't want to pass the xid to UndoLogGetNextInsertPtr, then I think we need to get the insert location before fetching the record. I will think more on it to see if there is any other problem with the same. 2. @@ -167,25 +205,14 @@ UndoDiscardOneLog(UndoLogSlot *slot, TransactionId xmin, bool *hibernate) + if (!TransactionIdIsValid(wait_xid) && !pending_abort) { UndoRecPtr next_insert = InvalidUndoRecPtr; - /* - * If more undo has been inserted since we checked last, then - * we can process that as well. - */ - next_insert = UndoLogGetNextInsertPtr(logno, undoxid); - if (!UndoRecPtrIsValid(next_insert)) - continue; + next_insert = UndoLogGetNextInsertPtr(logno); This change is also not safe. It can lead to discarding the undo of some random transaction because a new undo records from some other transaction would have been added since we last fetched the undo record. This can be fixed by just removing the call to UndoLogGetNextInsertPtr. I have done so in undoprocessing branch and added the comment as well. I think the common problem with the above changes is that it assumes that new undo can't be added to undo logs while discard worker is processing them. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 4, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > Another small change/review: the function UndoLogGetNextInsertPtr() > > previously took a transaction ID, but I'm not sure if that made sense, > > I need to think about it some more. > > > > The changes you have made related to UndoLogGetNextInsertPtr() doesn't > seem correct to me. > > @@ -854,7 +854,9 @@ FindUndoEndLocationAndSize(UndoRecPtr start_urecptr, > * has already started in this log then lets re-fetch the undo > * record. > */ > - next_insert = UndoLogGetNextInsertPtr(slot->logno, uur->uur_xid); > + next_insert = UndoLogGetNextInsertPtr(slot->logno); > + > + /* TODO this can't happen */ > if (!UndoRecPtrIsValid(next_insert)) > > I think this is a possible case. Say while the discard worker tries > to register the rollback request from some log and after it fetches > the undo record corresponding to start location in this function, > another backend adds the new transaction undo. The same is mentioned > in comments as well. Can you explain what makes you think that this > can't happen? If we don't want to pass the xid to > UndoLogGetNextInsertPtr, then I think we need to get the insert > location before fetching the record. I will think more on it to see > if there is any other problem with the same. > Pushed the fixed on above lines in the undoprocessing branch. It will be available in the next set of patches we post. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 5, 2019 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 4, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > > > Another small change/review: the function UndoLogGetNextInsertPtr() > > > previously took a transaction ID, but I'm not sure if that made sense, > > > I need to think about it some more. > > > > > > > The changes you have made related to UndoLogGetNextInsertPtr() doesn't > > seem correct to me. > > > > @@ -854,7 +854,9 @@ FindUndoEndLocationAndSize(UndoRecPtr start_urecptr, > > * has already started in this log then lets re-fetch the undo > > * record. > > */ > > - next_insert = UndoLogGetNextInsertPtr(slot->logno, uur->uur_xid); > > + next_insert = UndoLogGetNextInsertPtr(slot->logno); > > + > > + /* TODO this can't happen */ > > if (!UndoRecPtrIsValid(next_insert)) > > > > I think this is a possible case. Say while the discard worker tries > > to register the rollback request from some log and after it fetches > > the undo record corresponding to start location in this function, > > another backend adds the new transaction undo. The same is mentioned > > in comments as well. Can you explain what makes you think that this > > can't happen? If we don't want to pass the xid to > > UndoLogGetNextInsertPtr, then I think we need to get the insert > > location before fetching the record. I will think more on it to see > > if there is any other problem with the same. > > > > Pushed the fixed on above lines in the undoprocessing branch. > Just in case anyone wants to look at the undoprocessing branch, it is available at https://github.com/EnterpriseDB/zheap/tree/undoprocessing -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jun 25, 2019 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Fair enough. I have implemented it based on next_retry_at and use > constant time 10s for the next retry. I have used define instead of a > GUC as all the other constants for similar things are defined as of > now. One thing to note is that we want the linger time (defined as > UNDO_WORKER_LINGER_MS) for a undo worker to be more than failure retry > time (defined as UNDO_FAILURE_RETRY_DELAY_MS) as, otherwise, the undo > worker can exit before retrying the failed requests. Uh, I think we want exactly the opposite. We want the workers to exit before retrying, so that there's a chance for other databases to get processed, I think. Am I confused? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > [ new patches ] I took a look at 0012 today, Amit's patch for extending the binary heap machinery, and 0013, Amit's patch for "Infrastructure to register and fetch undo action requests." I don't think that binaryheap_allocate_shm() is a good design. It presupposes that we want to store the binary heap as its own chunk of shared memory allocated via ShmemInitStruct(), but we might want to do something else, like embed in another structure, store it in a DSM or DSA, etc., and this function can't do any of that. I think we should have something more like: extern Size binaryheap_size(int capacity); extern void binaryheap_initialize(binaryheap *, int capacity, binaryheap_comparator compare, void *arg); Then the caller can do something like: sz = binaryheap_size(capacity); bh = ShmemInitStruct(name, sz, &found); if (!found) binaryheap_initialize(bh, capacity, comparator, whatever); If it wants to get the memory in some other way, it just needs to initialize bh differently; the rest is the same. Note that there is no need, in this design, for binaryheap_size/initialize to make use of "shared" memory. They could equally well be used on backend-local memory. They do not need to care. You just provide the memory, and they do their thing. I wasn't very happy about binaryheap_nth(), binaryheap_remove_nth(), and binaryheap_remove_nth_unordered() and started looking at how they are used to try to see if there might be a better way. That led me to look at 0013. Unfortunately, I find it really hard to understand what this code is actually doing. There's a lot of redundant and badly-written stuff in here. As a general principle, if you have two or three data structures of some particular type, you don't write a separate family of functions for manipulating each one. You write one function for each operation, and you pass the particular copy of the data structure with which you are working as an argument. In the lengthy section of macros definitions at the top of undorequest.c, we have macros InitXidQueue, XidQueueIsEmpty, GetXidQueueSize, GetXidQueueElem, GetXidQueueTopElem, GetXidQueueNthElem, and SetXidQueueElem. Several of these are used in only one place or are not used anywhere at all; those should be removed altogether and inlined into the single call site if there is one. Now, then after this, there is a matching set of macros, InitSizeQueue, SizeQueueIsEmpty, GetSizeQueueSize, GetSizeQueueElem, GetSizeQueueTopElem, GetSizeQueueNthElem, and SetSizeQueueElem. Many of these macros are exactly the same as the previous set of macros except that they operate on a different queue, which as I mentioned in the previous paragraph, is not a good design. It leads to extensive code duplication. Look, for example, at RemoveOldElemsFromSizeQueue and RemoveOldElemsFromXidQueue. They are basically identical except for s/Size/Xid/g and s/SIZE/XID/g, but you can't unify them easily because they are calling different functions. However, if you didn't have one function called GetSizeQueueSize and another called GetXidQueueSize, but just had a pointer to the relevant binary heap, then both functions could just call binaryheap_empty() on it, which would be better style, use fewer macros, generate less machine code, and be easier to read. Ideally, you'd get to the point where you could just have one function rather than two, and pass the queue upon which it should operate as an argument. There seems to be a good deal of this kind of duplication in this file and it really needs to be cleaned up. Now, one objection to the above line of attack is the different queues actually contain different types of elements. Apparently, the XID queue contains elements of type UndoXidQueue and the size queue contains elements of type UndoSizeQueue. It is worth noting here that these are bad type names, because they sound like they are describing a type of queue, but it seems that they are actually describing an element in the queue. However, there are two larger problems: 1. I don't think we should have three different kinds of objects for each of the three different queues. It seems like it would be much simpler and easier to just have one kind of object that stores all the information we need (full_xid, start_urec_ptr, dbid, request_size, next_retry_at, error_ocurred_at) and use that everywhere. You could object that this would increase the storage space requirement, but it wouldn't be enough to make any real difference and it probably would be well worth it for the avoidance of complexity. 2. However, I don't think we should have a separate request object for each queue anyway. We should insert pointers to the same objects in all the relevant queue (either size + XID, or else error). So instead of having three sets of objects, one for each queue, we'd just have one set of objects and point to them with as many as two pointers. We'd therefore need LESS memory than we're using today, because we wouldn't have separate arrays for XID, size, and error queue elements. In fact, it seems to me that we shouldn't have any such thing as "queue entries" at all. The queues should just be pointing to RollbackHashEntry *, and we should add all the fields there that are present in any of the "queue entry" structures. This would use less memory still. I also think we should be using simplehash rather than dynahash. I'm not sure that I would really say that simplehash is "simple," but it does have a nicer API and simpler memory management. There's just a big old block of memory, and there's no incremental allocation. That keeps things simple for the code that wants to go through the queues and removing dangling pointers. I think that the way this should work is that each RollbackHashEntry * should contain a field "bool active." Then: 1. When we pull an item out of one of the binary heaps, we check the active flag. If it's clear, we ignore the entry and pull the next item. If it's set, we clear the flag and process the item, so that if it's subsequently pulled from the other queue it will be ignored. 2. If a binary heap is full when we need to insert into it, we can iterate over all of the elements and throw away any that are !active. They've already been dequeued and processed from some other queue, so they're not "really" in this queue any more, even though we haven't gone to the trouble of actually kicking them out yet. On another note, UNDO_PEEK_DEPTH is bogus. It's used in UndoGetWork() and it passes the depth argument down to GetRollbackHashKeyFromQueue, which then does binaryheap_nth() on the relevant queue. Note that this function is another places that is ends up duplicating code because of the questionable decision to have separate types of queue entries for each different queue; otherwise, it could probably just take the binary heap into which it's peeking as an argument instead of having three different cases. But that's not the main point here. The main point is that it calls a function for whichever type of queue we've got and gets some kind of queue entry using binaryheap_nth(). But binaryheap_nth(whatever, 2) does not give you the third-smallest element in the binary heap. It gives you the third entry in the array, which may or may not have the heap property, but even if it does, the third element could be huge. Consider this binary heap: 0 1 100000 2 3 100001 100002 4 5 6 7 100003 100004 100005 100006 This satisfies the binary heap property, because the element at position n is always smaller than the elements at positions 2n+1 and 2n+2 (assuming 0-based indexing). But if you want to look at the smallest three elements in the heap, you can't just look at indexes 0..2. The second-smallest element must be at index 1 or 2, but it could be either place. The third-smallest element could be the other of 1 and 2, or it could be either child of the smaller one, so there are three places it might be. In general, a binary heap is not a good data structure for finding the smallest N elements of a collection unless N is 1, and what's going to happen with what you've got here is that we'll sometimes prioritize an item that would not have been pulled from the queue for a long time over one that would have otherwise been processed much sooner. I'm not sure that's a show-stopper, but it doesn't seem good, and the current patch doesn't seem to have any comments justifying it, or at least not in the places nearby to where this is actually happening. I think there are more problems here, too. Let's suppose that we fixed the problem described in the previous paragraph somehow, or decided that it won't actually make a big difference and just ignored it. Suppose further that we have N active databases which are generating undo requests. Luckily, we happen to also have N undo workers available, and let's suppose that as of a certain moment in time there is exactly one worker in each database. Think about what will happen when one of those workers goes to look for the next undo request. It's likely that the first request in the queue will be for some other database, so it's probably going to have to peak ahead to find a request for the database to which it's connected -- let's just assume that there is one. How far will it have to peak ahead? Well, if the requests are uniformly distributed across databases, each request has a 1-in-N chance of being the right one. I wrote a little Perl program to estimate the probability that we won't find the next request for our databases within 10 requests as a function of the number of databases: 1 databases => failure chance with 10 lookahead is 0.00% 2 databases => failure chance with 10 lookahead is 0.10% 3 databases => failure chance with 10 lookahead is 1.74% 4 databases => failure chance with 10 lookahead is 5.66% 5 databases => failure chance with 10 lookahead is 10.74% 6 databases => failure chance with 10 lookahead is 16.18% 7 databases => failure chance with 10 lookahead is 21.45% 8 databases => failure chance with 10 lookahead is 26.31% 9 databases => failure chance with 10 lookahead is 30.79% 10 databases => failure chance with 10 lookahead is 34.91% 11 databases => failure chance with 10 lookahead is 38.58% 12 databases => failure chance with 10 lookahead is 41.85% 13 databases => failure chance with 10 lookahead is 44.91% 14 databases => failure chance with 10 lookahead is 47.69% 15 databases => failure chance with 10 lookahead is 50.12% 16 databases => failure chance with 10 lookahead is 52.34% 17 databases => failure chance with 10 lookahead is 54.53% 18 databases => failure chance with 10 lookahead is 56.39% 19 databases => failure chance with 10 lookahead is 58.18% 20 databases => failure chance with 10 lookahead is 59.86% Assuming my script (attached) doesn't have a bug, with only 8 databases, there's better than a 1-in-4 chance that we'll fail to find the next entry for the current database within the lookahead window. That's bad, because then the worker will be sitting around waiting when it should be doing stuff. Maybe it will even exit, even though there's work to be done, and even though all the other databases have their own workers already. You can construct way worse examples than this one, too: imagine that there are two databases, each with a worker, and one has 99% of the requests and the other one has 1% of the requests. It's really unlikely that there's going to be an entry for the second database within the lookahead window. And note that increasing the window doesn't really help either: you just need more databases than the size of the lookahead window, or even almost as many as the lookahead window, and things are going to stop working properly. On the other hand, suppose that you have 10 databases and one undo worker. One database is pretty active and generates a continuous stream of undo requests at exactly the same speed we can process them. The others all have 1 pending undo request. Now, what's going to happen is that you'll always find the undo request for the current database within the lookahead window. So, you'll never exit. But that means the undo requests in the other 9 databases will just sit there for all eternity, because there's no other worker to process them. On the other hand, if you had 11 databases, there's a good chance it would work fine, because the new request for the active database would likely be outside the lookahead window, and so you'd find no work to do and exit, allowing a worker to be started up in some other database. It would in turn exit and so on and you'd clear the backlog for the other databases at least for a while, until you picked the active database again. Actually, I haven't looked at the whole patch set, so perhaps there is some solution to this problem contemplated somewhere, but I consider this argument to be pretty good evidence that a fixed lookahead distance is probably the wrong thing. The right things to do about these problems probably need some discussion, but here's the best idea I have off-hand: instead of just have 3 binary heaps (size, XID, error), have N+1 "undo work trackers", each of which contains 3 binary heaps (size, XID, error). Undo work tracker #0 contains all requests that are not assigned to any other undo work tracker. Each of undo work trackers #1..N contain all the requests for one particular database, but they all start out unused. Before launching an undo worker for a particular database, the launcher must check whether it has an undo work tracker allocated to that database. If not, it allocates one and moves all the work for that database out of tracker #0 and into the newly-allocated tracker. If there are none free, it must first deallocate an undo work tracker, moving any remaining work for that tracker back into tracker #0. With this approach, there's no need for lookahead, because every worker is always pulling from a queue that is database-specific, so the next entry is always guaranteed to be relevant. And you choose N to be equal to the number of workers, so that even if every worker is in a separate database there will be enough trackers for all workers to have one, plus tracker #0 for whatever's left. There still remains the problem of figuring out when a worker should terminate to allow for new workers to be launched, which is a fairly complex problem that deserves its own discussion, but I think this design helps. At the very least, you can see whether tracker #0 is empty. If it is, you might still want to rebalance workers between databases, but you don't really need to worry about databases getting starved altogether, because you know that you can run a worker for every database that has any pending undo. If tracker #0 is non-empty but you have unused workers, you can just allocate trackers for the databases in tracker #0 and move stuff over there to be processed. If tracker #0 is non-empty and all workers are allocated, you are going to need to ask one of them to exit at some point, to avoid starvation. I don't know exactly what the algorithm for that should be; I do have some ideas. I'm not going to include them in this email though, because this email is already long and I don't have time to make it longer right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Jul 5, 2019 at 7:39 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 25, 2019 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Fair enough. I have implemented it based on next_retry_at and use > > constant time 10s for the next retry. I have used define instead of a > > GUC as all the other constants for similar things are defined as of > > now. One thing to note is that we want the linger time (defined as > > UNDO_WORKER_LINGER_MS) for a undo worker to be more than failure retry > > time (defined as UNDO_FAILURE_RETRY_DELAY_MS) as, otherwise, the undo > > worker can exit before retrying the failed requests. > > Uh, I think we want exactly the opposite. We want the workers to exit > before retrying, so that there's a chance for other databases to get > processed, I think. > The workers will exit if there is any chance for other databases to get processed. Basically, we linger only when we find there is no work in other databases. Not only that even if some new work is added to the queues for some other database then also we stop the lingering worker if there is no worker available for the new request that has arrived. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 4, 2019 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > PFA, the latest version of the undo interface and undo processing patches. Summary of the changes in the patch set 1. Undo Interface - Rebased over latest undo storage code - Implemented undo page compression (don't store the common fields in all the records instead we get from the first complete record of the page). - As per Robert's comment, UnpackedUndoRecord is divided in two parts, a) All fields which are set by the caller. b) Pointer to structures which are set internally. - Epoch and the Transaction id are unified as full transaction id - Fixed handling of dbid during recovery (TODO in PrepareUndoInsert) Pending: - Move loop in UndoFetchRecord to outside and test performance with keeping pin vs pin+lock across undo records. This will be done after testing performance over the zheap code. - I need to investigate whether Discard checking can be unified in master and HotStandby in UndoFetchRecord function. 2. Undo Processing - Defect fix in multi-log rollback for subtransaction. - Assorted defect fixes. Others - Fixup for undo log code to handle full transaction id in UndoLogSlot for discard and other bug fixes in undo log. - Fixup for Orphan file cleanup to pass dbid in PrepareUndoInsert -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Jul 6, 2019 at 1:47 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > [ new patches ] > > I took a look at 0012 today, Amit's patch for extending the binary > heap machinery, and 0013, Amit's patch for "Infrastructure to register > and fetch undo action requests." > Thanks for looking into the patches. > I don't think that binaryheap_allocate_shm() is a good design. It > presupposes that we want to store the binary heap as its own chunk of > shared memory allocated via ShmemInitStruct(), but we might want to do > something else, like embed in another structure, store it in a DSM or > DSA, etc., and this function can't do any of that. I think we should > have something more like: > > extern Size binaryheap_size(int capacity); > extern void binaryheap_initialize(binaryheap *, int capacity, > binaryheap_comparator compare, void *arg); > > Then the caller can do something like: > > sz = binaryheap_size(capacity); > bh = ShmemInitStruct(name, sz, &found); > if (!found) > binaryheap_initialize(bh, capacity, comparator, whatever); > > If it wants to get the memory in some other way, it just needs to > initialize bh differently; the rest is the same. Note that there is > no need, in this design, for binaryheap_size/initialize to make use of > "shared" memory. They could equally well be used on backend-local > memory. They do not need to care. You just provide the memory, and > they do their thing. > I didn't have other use cases in mind and I think to some extent this argument holds true for existing binaryheap_allocate allocate. If we want to make it more generic, then shouldn't we try to even change the existing binaryheap_allocate to use this new model, that way the binary heap allocation API will be more generic? .. .. > > Now, one objection to the above line of attack is the different queues > actually contain different types of elements. Apparently, the XID > queue contains elements of type UndoXidQueue and the size queue > contains elements of type UndoSizeQueue. It is worth noting here that > these are bad type names, because they sound like they are describing > a type of queue, but it seems that they are actually describing an > element in the queue. However, there are two larger problems: > > 1. I don't think we should have three different kinds of objects for > each of the three different queues. It seems like it would be much > simpler and easier to just have one kind of object that stores all the > information we need (full_xid, start_urec_ptr, dbid, request_size, > next_retry_at, error_ocurred_at) and use that everywhere. You could > object that this would increase the storage space requirement, Yes, this was the reason to keep them separate, but I see your point. > but it > wouldn't be enough to make any real difference and it probably would > be well worth it for the avoidance of complexity. > Okay, will give it a try and see if it can avoid some code complexity. Along with this, I will investigate your other suggestions related to code improvements as well. > > On another note, UNDO_PEEK_DEPTH is bogus. It's used in UndoGetWork() > and it passes the depth argument down to GetRollbackHashKeyFromQueue, > which then does binaryheap_nth() on the relevant queue. Note that > this function is another places that is ends up duplicating code > because of the questionable decision to have separate types of queue > entries for each different queue; otherwise, it could probably just > take the binary heap into which it's peeking as an argument instead of > having three different cases. But that's not the main point here. The > main point is that it calls a function for whichever type of queue > we've got and gets some kind of queue entry using binaryheap_nth(). > But binaryheap_nth(whatever, 2) does not give you the third-smallest > element in the binary heap. It gives you the third entry in the > array, which may or may not have the heap property, but even if it > does, the third element could be huge. Consider this binary heap: > > 0 1 100000 2 3 100001 100002 4 5 6 7 100003 100004 100005 100006 > > This satisfies the binary heap property, because the element at > position n is always smaller than the elements at positions 2n+1 and > 2n+2 (assuming 0-based indexing). But if you want to look at the > smallest three elements in the heap, you can't just look at indexes > 0..2. The second-smallest element must be at index 1 or 2, but it > could be either place. The third-smallest element could be the other > of 1 and 2, or it could be either child of the smaller one, so there > are three places it might be. In general, a binary heap is not a good > data structure for finding the smallest N elements of a collection > unless N is 1, and what's going to happen with what you've got here is > that we'll sometimes prioritize an item that would not have been > pulled from the queue for a long time over one that would have > otherwise been processed much sooner. > You are right that it won't be nth smallest element from the queue and we don't even care for that here. The peeking logic is not to find the next prioritized element but to check if we can find some element for the same database in the next few entries to avoid frequent undo worker restart. > I'm not sure that's a > show-stopper, but it doesn't seem good, and the current patch doesn't > seem to have any comments justifying it, or at least not in the places > nearby to where this is actually happening. > I agree that we should add more comments explaining this. > I think there are more problems here, too. Let's suppose that we > fixed the problem described in the previous paragraph somehow, or > decided that it won't actually make a big difference and just ignored > it. Suppose further that we have N active databases which are > generating undo requests. Luckily, we happen to also have N undo > workers available, and let's suppose that as of a certain moment in > time there is exactly one worker in each database. Think about what > will happen when one of those workers goes to look for the next undo > request. It's likely that the first request in the queue will be for > some other database, so it's probably going to have to peak ahead to > find a request for the database to which it's connected -- let's just > assume that there is one. How far will it have to peak ahead? Well, > if the requests are uniformly distributed across databases, each > request has a 1-in-N chance of being the right one. I wrote a little > Perl program to estimate the probability that we won't find the next > request for our databases within 10 requests as a function of the > number of databases: > > 1 databases => failure chance with 10 lookahead is 0.00% > 2 databases => failure chance with 10 lookahead is 0.10% > 3 databases => failure chance with 10 lookahead is 1.74% > 4 databases => failure chance with 10 lookahead is 5.66% > 5 databases => failure chance with 10 lookahead is 10.74% > 6 databases => failure chance with 10 lookahead is 16.18% > 7 databases => failure chance with 10 lookahead is 21.45% > 8 databases => failure chance with 10 lookahead is 26.31% > 9 databases => failure chance with 10 lookahead is 30.79% > 10 databases => failure chance with 10 lookahead is 34.91% > 11 databases => failure chance with 10 lookahead is 38.58% > 12 databases => failure chance with 10 lookahead is 41.85% > 13 databases => failure chance with 10 lookahead is 44.91% > 14 databases => failure chance with 10 lookahead is 47.69% > 15 databases => failure chance with 10 lookahead is 50.12% > 16 databases => failure chance with 10 lookahead is 52.34% > 17 databases => failure chance with 10 lookahead is 54.53% > 18 databases => failure chance with 10 lookahead is 56.39% > 19 databases => failure chance with 10 lookahead is 58.18% > 20 databases => failure chance with 10 lookahead is 59.86% > > Assuming my script (attached) doesn't have a bug, with only 8 > databases, there's better than a 1-in-4 chance that we'll fail to find > the next entry for the current database within the lookahead window. > This is a good test scenario, but I think it has not been considered that there are multiple queues and we peek into each one. > That's bad, because then the worker will be sitting around waiting > when it should be doing stuff. Maybe it will even exit, even though > there's work to be done, and even though all the other databases have > their own workers already. > I think we should once try with the actual program which can test such a scenario on the undo patches before reaching any conclusion. I or one of my colleague will work on this and report back the results. > You can construct way worse examples than > this one, too: imagine that there are two databases, each with a > worker, and one has 99% of the requests and the other one has 1% of > the requests. It's really unlikely that there's going to be an entry > for the second database within the lookahead window. > I am not sure if that is the case because as soon as the request from other database get prioritized (say because its XID becomes older) and came as the first request in one of the queues, the undo worker will exit (provided it has worked for some threshold time (10s) in that database) and allow the request from another database to be processed. > And note that > increasing the window doesn't really help either: you just need more > databases than the size of the lookahead window, or even almost as > many as the lookahead window, and things are going to stop working > properly. > > On the other hand, suppose that you have 10 databases and one undo > worker. One database is pretty active and generates a continuous > stream of undo requests at exactly the same speed we can process them. > The others all have 1 pending undo request. Now, what's going to > happen is that you'll always find the undo request for the current > database within the lookahead window. So, you'll never exit. > Following the logic given above, I think here also worker will exit as soon as the request from other database get prioritised. > But > that means the undo requests in the other 9 databases will just sit > there for all eternity, because there's no other worker to process > them. On the other hand, if you had 11 databases, there's a good > chance it would work fine, because the new request for the active > database would likely be outside the lookahead window, and so you'd > find no work to do and exit, allowing a worker to be started up in > some other database. > As explained above, I think it will work the same way both for 10 or 11 databases. Note, that we don't always try to look ahead. We look ahead when we have not worked on the current database for some threshold amount of time. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 8, 2019 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I didn't have other use cases in mind and I think to some extent this > argument holds true for existing binaryheap_allocate allocate. If we > want to make it more generic, then shouldn't we try to even change the > existing binaryheap_allocate to use this new model, that way the > binary heap allocation API will be more generic? No. binaryheap_allocate is fine for simple cases and there's no reason that I can see to change it. > You are right that it won't be nth smallest element from the queue and > we don't even care for that here. The peeking logic is not to find > the next prioritized element but to check if we can find some element > for the same database in the next few entries to avoid frequent undo > worker restart. You *should* care for that here. The whole purpose of a binary heap is to help us choose which task we should do first and which ones should be done later. I don't see why it's OK to decide that we only care about doing the tasks in priority order sometimes, and other times it's OK to just pick semi-randomly. > This is a good test scenario, but I think it has not been considered > that there are multiple queues and we peek into each one. I think that makes very little difference, so I don't see why it should be considered. It's true that this will sometimes mask the problem, but so what? An algorithm that works 90% of the time is not much better than one that works 80% of the time, and neither is the caliber of work we are expecting to see in PostgreSQL. > I think we should once try with the actual program which can test such > a scenario on the undo patches before reaching any conclusion. I or > one of my colleague will work on this and report back the results. There is certainly a place for empirical testing of a patch like this (perhaps even before posting it). It does not substitute for a good theoretical explanation of why the algorithm is correct, and I don't think it is. > > You can construct way worse examples than > > this one, too: imagine that there are two databases, each with a > > worker, and one has 99% of the requests and the other one has 1% of > > the requests. It's really unlikely that there's going to be an entry > > for the second database within the lookahead window. > > I am not sure if that is the case because as soon as the request from > other database get prioritized (say because its XID becomes older) and > came as the first request in one of the queues, the undo worker will > exit (provided it has worked for some threshold time (10s) in that > database) and allow the request from another database to be processed. I don't see how this responds to what I wrote. Neither worker needs to exit in this scenario, but the worker from the less-popular database is likely to exit anyway, which seems like it's probably not the right thing. > > And note that > > increasing the window doesn't really help either: you just need more > > databases than the size of the lookahead window, or even almost as > > many as the lookahead window, and things are going to stop working > > properly. > > > > On the other hand, suppose that you have 10 databases and one undo > > worker. One database is pretty active and generates a continuous > > stream of undo requests at exactly the same speed we can process them. > > The others all have 1 pending undo request. Now, what's going to > > happen is that you'll always find the undo request for the current > > database within the lookahead window. So, you'll never exit. > > Following the logic given above, I think here also worker will exit as > soon as the request from other database get prioritised. OK. > > But > > that means the undo requests in the other 9 databases will just sit > > there for all eternity, because there's no other worker to process > > them. On the other hand, if you had 11 databases, there's a good > > chance it would work fine, because the new request for the active > > database would likely be outside the lookahead window, and so you'd > > find no work to do and exit, allowing a worker to be started up in > > some other database. > > As explained above, I think it will work the same way both for 10 or > 11 databases. Note, that we don't always try to look ahead. We look > ahead when we have not worked on the current database for some > threshold amount of time. That's interesting, and it means that some of the scenarios that I mentioned are not problems. However, I don't believe it means that your code is actually correct. It's just means that it's wrong in different ways. The point is that, with the way you've implemented this, whenever you do lookahead, you will, basically randomly, sometimes find the next entry for the current database within the lookahead window, and sometimes you won't. And sometimes it will be the next-highest-priority request, and sometimes it won't. That just cannot possibly be the right thing to do. Would you propose to commit a patch that implemented the following pseudocode? find-next-thing-to-do: see if the highest-priority task in any database is for our database. if it is, do it and stop here. if it is not, and if we haven't worked on the current database for at least 10 seconds, look for an item in the current database. ...but don't look very hard, so that we'll sometimes, semi-randomly, find nothing even when there is something we could do. ...and also, sometimes find a lower-priority item that we can do, possibly much lower-priority, instead of the highest priority thing we can do. Because that's what your patch is doing. In contrast, the algorithm that I proposed would work like this: find-next-thing-to-do: find the highest-priority item for the current database. do it. I venture to propose that the second one is the superior algorithm here. One problem with the second algorithm, which I pointed out in my previous email, is that sometimes we might want the worker to exit even though there is work to do in the current database. My algorithm makes no provision for that, and yours does. However, yours does that in a way that's totally unprincipled: it just sometimes fails to find any work that it could do even though there is work that it could do. No amount of testing or argumentation is going to convince me that this is a good approach. The decision about when a worker should exit to allow a new one to be launched needs to be based on clear, understandable rules, not be something that happens semi-randomly when a haphazard search for the next entry fails, as if by chance, to find it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jul 6, 2019 at 8:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jul 4, 2019 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > PFA, the latest version of the undo interface and undo processing patches. > > Summary of the changes in the patch set > > 1. Undo Interface > - Rebased over latest undo storage code > - Implemented undo page compression (don't store the common fields in > all the records instead we get from the first complete record of the > page). > - As per Robert's comment, UnpackedUndoRecord is divided in two parts, > a) All fields which are set by the caller. > b) Pointer to structures which are set internally. > - Epoch and the Transaction id are unified as full transaction id > - Fixed handling of dbid during recovery (TODO in PrepareUndoInsert) > > Pending: > - Move loop in UndoFetchRecord to outside and test performance with > keeping pin vs pin+lock across undo records. This will be done after > testing performance over the zheap code. > - I need to investigate whether Discard checking can be unified in > master and HotStandby in UndoFetchRecord function. > > 2. Undo Processing > - Defect fix in multi-log rollback for subtransaction. > - Assorted defect fixes. > > Others > - Fixup for undo log code to handle full transaction id in > UndoLogSlot for discard and other bug fixes in undo log. > - Fixup for Orphan file cleanup to pass dbid in PrepareUndoInsert > PFA, updated patch version which includes - One defect fix in undo interface related to undo page compression for handling persistence level - Implemented pending TODO optimization in undo page compression. - One defect fix in undo processing related to the prepared transaction -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Jul 6, 2019 at 1:47 AM Robert Haas <robertmhaas@gmail.com> wrote: > > In fact, it seems to me that we shouldn't have any such thing as > "queue entries" at all. The queues should just be pointing to > RollbackHashEntry *, and we should add all the fields there that are > present in any of the "queue entry" structures. This would use less > memory still. > As of now, after we finish executing the rollback actions, the entry from the hash table is removed. Now, at a later time (when queues are full and we want to insert a new entry) when we access the queue entry (to check whether we can remove it) corresponding to the removed hash table entry, will it be safe to access it? The hash table entry might have been freed or would have been reused as some other entry by the time we try to access it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 10, 2019 at 2:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > As of now, after we finish executing the rollback actions, the entry > from the hash table is removed. Now, at a later time (when queues are > full and we want to insert a new entry) when we access the queue entry > (to check whether we can remove it) corresponding to the removed hash > table entry, will it be safe to access it? The hash table entry might > have been freed or would have been reused as some other entry by the > time we try to access it. Hmm, yeah, that's a problem. I think we could possibly fix this by having the binary heaps just store a FullTransactionId rather than a pointer to the RollBackHashEntry. Then, if you get a FullTransactionId from the binary heap, you just do a hash table lookup to find the RollBackHashEntry instead of accessing it directly. If it doesn't exist, then you can just discard the entry: it's for some old transaction that's no longer relevant. However, there are a few problems with that idea. One is that I see that you've made the hash table keyed by full_xid + start_urec_ptr rather than just full_xid, so if the queues just point to an XID, it's not enough to find the hash table entry. The comment claims that this is necessary because "in the same transaction, there could be rollback requests for both logged and unlogged relations," but I don't understand why that means we need start_urec_ptr in the hash table key. It would seem more natural to me to have a single entry that covers both the logged and the unlogged undo for that transaction. (Incidentally, I don't think it's correct that RollbackHashEntry starts with FullTransactionId full_xid + UndoRecPtr start_uprec_ptr declared separately; I think it should start with RollbackHashKey - although if we change the key back to just a FullTransactionId then we don't need to worry separately about fixing this issue.) Another problem is that on a 64-bit system, we can pass a FullTransactionId by value, but on a 32-bit system we can't. That's awkward, because if we can't pass the XID by value, then we're back to needing a separately-allocated structure for the queue entries, which I was really hoping to avoid. A second possible approach to this problem is to just reset all the binary heaps (using binaryheap_reset) whenever we insert a new entry into the hash table, and rebuild them the next time they're needed by reinserting all of the current entries in the hash table. That might be too inefficient. You can insert a bunch of things in a row without re-heaping, and you can dequeue a bunch of things in a row without re-heaping, but if they alternate you'll re-heap a lot. I don't know whether that costs enough to worry about; it might be fine. A third possible approach is to allocate a separate array whose entries are reused, and to maintain a freelist of entries from that array. All the real data is stored in this array, and the binary heaps and hash table entries just point to it. When the freelist is empty, the next allocate scans all the binary heaps and removes any pointers to inactive entries; it then puts all inactive entries back onto the freelist. This is more complex than the previous approach, and it doesn't totally avoid re-heaping, because removing pointers to inactive entries from the binary heaps will necessitate a re-heap on next access. However, if the total capacity of the data structures is large compared to the number of entries actually in use, which will usually be true, we'll have to re-heap much less often, because we only have to do it when the number of allocations exhausts *everything* on the free-list, rather than after every allocation. A fourth possible approach is to enhance the simplehash mechanism to allow us to do cleanup when an item to which there might still be residual pointers is reused. We could allow some code supplied by the definer of an individual simplehash implementation to be executed inside SH_INSERT, just at the point where we're going to make an entry status SH_STATUS_IN_USE. What we'd do is add a flag to the structure indicating whether there might be deferred cleanup work for that entry. Maybe it would be called something like 'bool processed' and set when we process the undo work for that entry. If, when we're about to reuse an entry, that flag is set, then we go scan all the binary heaps and remove all entries for which that flag is set. And then we unset the flag for all of those entries. Like the previous approach, this is basically a refinement of the second approach in that it tries to avoid re-heaping too often. Here, instead of re-heaping once we've been through the entire free-list, we'll re-heap when we happen (more or less randomly) happen to reuse a hash table entry that's been reused, but we avoid it when we happen to snag a hash table entry that hasn't been reused recently. This is probably less efficient at avoiding re-heaping than the previous approach, but it avoids a separately-allocated data structure, which is nice. Broadly, you are correct to point out that you need to avoid chasing stale pointers, and there are a bunch of ways to accomplish that: approach #1 avoids using real pointers, and the rest just make sure that any stale pointers don't stick around long enough to cause any harm. There are probably also several other totally realistic alternatives, and I don't know for sure what is best, or how much it matters. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 9, 2019 at 6:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > PFA, updated patch version which includes > - One defect fix in undo interface related to undo page compression > for handling persistence level > - Implemented pending TODO optimization in undo page compression. > - One defect fix in undo processing related to the prepared transaction Looking at 0002 a bit, it seems to me that you really need to spend some energy getting things into a consistent order all across the patch. For example, UndoPackStage uses the ordering: HEADER, TRANSACTION, RMID, RELOID, XID, CID... But the declarations of the UREC_INFO constants go in a different order: TRANSACTION, FORK, BLOCK, BLKPREV... The comments defining those go in a different order and some of them are missing. The definition of the UndoRecordBlah structures go in a different order still: Transaction, Block, LogSwitch, Payload. UndoRecordHeaderSize goes with FORK, BLOCK, BLPREV, TRANSACTION, LOGSWITCH, .... That really needs to be straightened out and made consistent. You (still) need to rename blkprev to something more generic, as mentioned in previous rounds of review. I think it would be a good idea to avoid complex macros in favor of functions where possible, e.g. UNDO_PAGE_PARTIAL_REC_SIZE. If performance is a concern, it could be declared static inline, which should be as good as a macro. I don't like the fact that undoaccess.c has a new global, undo_compression_info. I haven't read the code thoroughly, but do we really need that? I think it's never modified (so it could just be declared const), and I also think it's just all zeroes (so initializing it isn't really necessary), and I also think that it's just used for initializing other UndoCompressionInfos (so we could just initialize them directly, either by setting the members individually or jus zeroing them). It seems like UndoRecordPrepareTransInfo ought to have an Assert(index < some_limit) in the loop. A comment in PrepareUndoInsert refers to "low switch" where it means "log switch." This is by no means a complete review, for which I unfortunately lack the time at present. Just some initial observations. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 10, 2019 at 12:36 PM Robert Haas <robertmhaas@gmail.com> wrote: > Broadly, you are correct to point out that you need to avoid chasing > stale pointers, and there are a bunch of ways to accomplish that: > approach #1 avoids using real pointers, and the rest just make sure > that any stale pointers don't stick around long enough to cause any > harm. There are probably also several other totally realistic > alternatives, and I don't know for sure what is best, or how much it > matters. After some off-list discussion with Andres ... Another possible approach here, which I think I like better, is to switch from using a binary heap to using an rbtree. That wouldn't work well in DSM because of the way it uses pointers, but here we're putting data in the shared memory segment so it seems like it should work. The idea would be to allocate an array of entries with a freelist, and then have allocfunc and freefunc defined to push and pop the freelist. Unlike a binary heap, an rbtree lets us (a) do peek-ahead in sorted order and (b) delete elements from an arbitrary position without rebuilding anything. If we adopt this approach, then I think a bunch of the problems we've been talking about actually get a lot easier. If we pull an item from the ordered-by-XID rbtree or the ordered-by-undo-size rbtree, we can remove it from the other one cheaply, because we can store a pointer to the RBTNode in the main object. So then we never have any stale pointers in any data structure, which means we don't have to have a strategy to avoid accidentally following them. The fact that we can peak-ahead correctly without any new code is also very nice. I'm still concerned that peeking ahead isn't the right approach in general, but if we're going to do it, peeking ahead to the actually-next-highest-priority item is a lot better than peeking ahead to some-item-that-may-be-fairly-high-priority. One problem which Andres spotted is that rbt_delete() can actually move content around, so if you just cache the RBTNode returned by rbt_insert(), it might not be the right one by the time you rbt_delete(), if other stuff has been deleted first. There are several possible approaches to that problem, but one that I'm wondering about is modifying rbt_delete_node() so that it doesn't rely on rbt_copy_data. The idea is that if y != z, instead of copying the data from y to z, copy the left/right/parent pointers from z into y, and make z's left, right, and parent nodes point to y instead. Then we always end up removing the correct node, which would make things much easier for us and might well be helpful to other code that uses rbtree as well. Another small problem, also spotted by Andres, is that rbt_create() uses palloc. That seems easy to work around: just provide an rbt_intialize() function that a caller can use instead of it wants to initialize an already-allocated block of memory. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jul 9, 2019 at 6:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > PFA, updated patch version which includes > > - One defect fix in undo interface related to undo page compression > > for handling persistence level > > - Implemented pending TODO optimization in undo page compression. > > - One defect fix in undo processing related to the prepared transaction > > Looking at 0002 a bit, it seems to me that you really need to spend > some energy getting things into a consistent order all across the > patch. For example, UndoPackStage uses the ordering: HEADER, > TRANSACTION, RMID, RELOID, XID, CID... But the declarations of the > UREC_INFO constants go in a different order: TRANSACTION, FORK, BLOCK, > BLKPREV... The comments defining those go in a different order and > some of them are missing. The definition of the UndoRecordBlah > structures go in a different order still: Transaction, Block, > LogSwitch, Payload. UndoRecordHeaderSize goes with FORK, BLOCK, > BLPREV, TRANSACTION, LOGSWITCH, .... That really needs to be > straightened out and made consistent. > Thanks for the review, I will work on this. > You (still) need to rename blkprev to something more generic, as > mentioned in previous rounds of review. I will change this. > > I think it would be a good idea to avoid complex macros in favor of > functions where possible, e.g. UNDO_PAGE_PARTIAL_REC_SIZE. If > performance is a concern, it could be declared static inline, which > should be as good as a macro. ok > > I don't like the fact that undoaccess.c has a new global, > undo_compression_info. I haven't read the code thoroughly, but do we > really need that? I think it's never modified (so it could just be > declared const), Actually, this will get modified otherwise across undo record insertion how we will know what was the values of the common fields in the first record of the page. Another option could be that every time we insert the record, read the value from the first complete undo record on the page but that will be costly because for every new insertion we need to read the first undo record of the page. Currently, we are doing like this a) BeginUndoRecordInsert - Copy the global "undo_compression_info" to our local context for handling multi-prepare because for multi-prepare we don't want to update the global value until we have successfully inserted the undo record. b) PrepareUndoInsert -Operate on the context and update the context->undo_compression_info if required (page changed) c)InsertPrepareUndo - After we have inserted successfully overwrite context->undo_compression_info to the global "undo_compression_info". So that next undo insertion can get the right information. and I also think it's just all zeroes (so > initializing it isn't really necessary), and I also think that it's > just used for initializing other UndoCompressionInfos (so we could > just initialize them directly, either by setting the members > individually or jus zeroing them). Initially, I was doing that but later I thought that InvalidUndoRecPtr is macro (although the value is 0) shouldn't we initialize all UndoRecPtr variables with value InvalidUndoRecPtr instead of directly using 0 so I changed like this. > > It seems like UndoRecordPrepareTransInfo ought to have an Assert(index > < some_limit) in the loop. > > A comment in PrepareUndoInsert refers to "low switch" where it means > "log switch." I will fix. > > This is by no means a complete review, for which I unfortunately lack > the time at present. Just some initial observations. > ok -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 10, 2019 at 10:06 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jul 10, 2019 at 2:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > As of now, after we finish executing the rollback actions, the entry > > from the hash table is removed. Now, at a later time (when queues are > > full and we want to insert a new entry) when we access the queue entry > > (to check whether we can remove it) corresponding to the removed hash > > table entry, will it be safe to access it? The hash table entry might > > have been freed or would have been reused as some other entry by the > > time we try to access it. > > Hmm, yeah, that's a problem. I think we could possibly fix this by > having the binary heaps just store a FullTransactionId rather than a > pointer to the RollBackHashEntry. Then, if you get a > FullTransactionId from the binary heap, you just do a hash table > lookup to find the RollBackHashEntry instead of accessing it directly. > If it doesn't exist, then you can just discard the entry: it's for > some old transaction that's no longer relevant. > > However, there are a few problems with that idea. One is that I see > that you've made the hash table keyed by full_xid + start_urec_ptr > rather than just full_xid, so if the queues just point to an XID, it's > not enough to find the hash table entry. The comment claims that this > is necessary because "in the same transaction, there could be rollback > requests for both logged and unlogged relations," but I don't > understand why that means we need start_urec_ptr in the hash table > key. It would seem more natural to me to have a single entry that > covers both the logged and the unlogged undo for that transaction. > The data for logged and unlogged undo are in separate logs. So, the discard worker can encounter them at different times. It is quite possible that by the time it encounters the second request, some undo worker is already half-way processing the first request. It might be feasible to combine them during foreground work, but after startup or some other times when discard worker has to register the request, it won't be feasible to have one entry or at least we need more smarts to ensure that we can always edit the hash table entry at later time to append the request. I have thought about keep full_xid + persistence_level/undo_category as a key, but as we anyway need start_ptr for the request, so it seems appealing to use the same. Also, even if we try to support one entry for logged and unlogged undo, it won't be always possible to have one request for it as is the case explained for discard worker. > (Incidentally, I don't think it's correct that RollbackHashEntry > starts with FullTransactionId full_xid + UndoRecPtr start_uprec_ptr > declared separately; I think it should start with RollbackHashKey - > although if we change the key back to just a FullTransactionId then we > don't need to worry separately about fixing this issue.) > Agreed. It seems before we analyze or discuss in detail the other solutions related to dangling entries, it is better to investigate the rbtree idea you and Andres came up with as on a quick look it seems that might avoid creating the dangling entries at the first place. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 11, 2019 at 1:29 AM Robert Haas <robertmhaas@gmail.com> wrote: > > After some off-list discussion with Andres ... > > Another possible approach here, which I think I like better, is to > switch from using a binary heap to using an rbtree. That wouldn't > work well in DSM because of the way it uses pointers, but here we're > putting data in the shared memory segment so it seems like it should > work. The idea would be to allocate an array of entries with a > freelist, and then have allocfunc and freefunc defined to push and pop > the freelist. Unlike a binary heap, an rbtree lets us (a) do > peek-ahead in sorted order and (b) delete elements from an arbitrary > position without rebuilding anything. > > If we adopt this approach, then I think a bunch of the problems we've > been talking about actually get a lot easier. If we pull an item from > the ordered-by-XID rbtree or the ordered-by-undo-size rbtree, we can > remove it from the other one cheaply, because we can store a pointer > to the RBTNode in the main object. So then we never have any stale > pointers in any data structure, which means we don't have to have a > strategy to avoid accidentally following them. > > The fact that we can peak-ahead correctly without any new code is also > very nice. I'm still concerned that peeking ahead isn't the right > approach in general, but if we're going to do it, peeking ahead to the > actually-next-highest-priority item is a lot better than peeking ahead > to some-item-that-may-be-fairly-high-priority. > > One problem which Andres spotted is that rbt_delete() can actually > move content around, so if you just cache the RBTNode returned by > rbt_insert(), it might not be the right one by the time you > rbt_delete(), if other stuff has been deleted first. There are > several possible approaches to that problem, but one that I'm > wondering about is modifying rbt_delete_node() so that it doesn't rely > on rbt_copy_data. The idea is that if y != z, instead of copying the > data from y to z, copy the left/right/parent pointers from z into y, > and make z's left, right, and parent nodes point to y instead. > I am not sure but don't we need to retain the color of z as well? Apart from this, the duplicate key (ex. for size queues the size of two requests can be same) handling might need some work. Basically, either special combine function needs to be written (not sure yet what we should do there) or we always need to ensure that the key is unique like (size + start_urec_ptr). If the size is the same, then we can decide based on start_urec_ptr. I think we can go by changing the implementation to rbtree by doing some enhancements instead of the binary heap or alternatively, we can use one of the two ideas suggested by you in the email above [1] to simplify the code and keep using the binary heap for now. Especially, I like the below one. "2. However, I don't think we should have a separate request object for each queue anyway. We should insert pointers to the same objects in all the relevant queue (either size + XID, or else error). So instead of having three sets of objects, one for each queue, we'd just have one set of objects and point to them with as many as two pointers. We'd therefore need LESS memory than we're using today, because we wouldn't have separate arrays for XID, size, and error queue elements." I think even if we currently go with a binary heap, it will be possible to change it to rbtree later, but I am fine either way. [1] - https://www.postgresql.org/message-id/CA%2BTgmoZ5g7UzMvM_42YMG8nbhOYpH%2Bu5OMMnePJkYtT5HWotUw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 12, 2019 at 5:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I am not sure but don't we need to retain the color of z as well? I believe that would be very wrong. If you recolor an internal node, you'll break the constant-black-height invariant. > Apart from this, the duplicate key (ex. for size queues the size of > two requests can be same) handling might need some work. Basically, > either special combine function needs to be written (not sure yet what > we should do there) or we always need to ensure that the key is unique > like (size + start_urec_ptr). If the size is the same, then we can > decide based on start_urec_ptr. I think that this problem is somewhat independent of whether we use an rbtree or a binaryheap or some other data structure. I would be inclined to use XID as a tiebreak for the size queue, so that it's more likely to match the ordering of the XID queue, but if that's inconvenient, then some other arbitrary value like start_urec_ptr should be fine. > I think we can go by changing the implementation to rbtree by doing > some enhancements instead of the binary heap or alternatively, we can > use one of the two ideas suggested by you in the email above [1] to > simplify the code and keep using the binary heap for now. Especially, > I like the below one. > "2. However, I don't think we should have a separate request object > for each queue anyway. We should insert pointers to the same objects > in all the relevant queue (either size + XID, or else error). So > instead of having three sets of objects, one for each queue, we'd just > have one set of objects and point to them with as many as two > pointers. > We'd therefore need LESS memory than we're using today, because we > wouldn't have separate arrays for XID, size, and error queue > elements." > > I think even if we currently go with a binary heap, it will be > possible to change it to rbtree later, but I am fine either way. Well, I don't see much point in revising all of this logic twice. We should pick the way we want it to work and make it work that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 12, 2019 at 7:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jul 12, 2019 at 5:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Apart from this, the duplicate key (ex. for size queues the size of > > two requests can be same) handling might need some work. Basically, > > either special combine function needs to be written (not sure yet what > > we should do there) or we always need to ensure that the key is unique > > like (size + start_urec_ptr). If the size is the same, then we can > > decide based on start_urec_ptr. > > I think that this problem is somewhat independent of whether we use an > rbtree or a binaryheap or some other data structure. > I think then I am missing something because what I am talking about is below code in rbt_insert: rbt_insert() { .. cmp = rbt->comparator(data, current, rbt->arg); if (cmp == 0) { /* * Found node with given key. Apply combiner. */ rbt->combiner(current, data, rbt->arg); *isNew = false; return current; } .. } If you see, here it doesn't add the duplicate key in the tree and that is not the case with binary_heap as far as I can understand. > I would be > inclined to use XID as a tiebreak for the size queue, so that it's > more likely to match the ordering of the XID queue, but if that's > inconvenient, then some other arbitrary value like start_urec_ptr > should be fine. > I think it would be better to use start_urec_ptr because XID can be non-unique in our case. As I explained in one of the emails above [1] that we register the requests for logged and unlogged relations separately, so XID can be non-unique. > > > > I think even if we currently go with a binary heap, it will be > > possible to change it to rbtree later, but I am fine either way. > > Well, I don't see much point in revising all of this logic twice. We > should pick the way we want it to work and make it work that way. > Yeah, I agree. So, I am assuming here that as you have discussed this idea with Andres offlist, he is on board with changing it as he has originally suggested using binary_heap. Andres, do let us know if you think differently here. It would be good if anyone else following the thread can also weigh in. [1] - https://www.postgresql.org/message-id/CAA4eK1LEKyPZD5Dy4j1u2smUUyMzxgC2YLj8E%2BaJpsvG7sVJYA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jul 13, 2019 at 6:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I think then I am missing something because what I am talking about is > below code in rbt_insert: What you're saying here is that, with an rbtree, an exact match will result in a merging of requests which we don't want, so we have to make them always unique. That's fine, but even if you used a binary heap where it wouldn't be absolutely required that you break the ties, you'd still want to think at least a little bit about what behavior is best in case of a tie, just from the point of view of making the system efficient. > I think it would be better to use start_urec_ptr because XID can be > non-unique in our case. As I explained in one of the emails above [1] > that we register the requests for logged and unlogged relations > separately, so XID can be non-unique. Yeah. I didn't understand that explanation. It seems to me that one of the fundamental design questions for this system is whether we should allow there to be an unbounded number of transactions that are pending undo application, or whether it's OK to enforce a hard limit. Either way, there should certainly be pressure applied to try to keep the number low, like forcing undo application into the foreground when a backlog is accumulating, but the question is what to do when that's insufficient. My original idea was that we should not have a hard limit, in which case the shared memory data on what is pending might be incomplete, in which case we would need the discard workers to discover transactions needing undo and add them to the shared memory data structures, and if those structures are full, then we'd just skip adding those details and rediscover those transactions again at some future point. But, my understanding of the current design being implemented is that there is a hard limit on the number of transactions that can be pending undo and the in-memory data structures are sized accordingly. In such a system, we cannot rely on the discard worker(s) to (re)discover transactions that need undo, because if there can be transactions that need undo that we don't know about, then we can't enforce a hard limit correctly. The exception, I suppose, is that after a crash, we'll need to scan all the undo logs and figure out which transactions are pending, but that doesn't preclude using a single queue entry covering both the logged and the unlogged portion of a transaction that has written undo of both kinds. We've got to scan all of the undo logs before we allow any new undo-using transactions to start, and so we can create one fully-up-to-date entry that reflects the data for both persistence levels before any concurrent activity happens. I am wondering (and would love to hear other opinions on) the question of which kind of design we ought to be pursuing, but it's got to be one or the other, not something in the middle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > 2. Introduced a new RMGR callback rm_undo_status. It is used to > decide when record sets in the UNDO_SHARED category should be > discarded (instead of the usual single xid-based rules). The possible > answers are "discard me now!", "ask me again when a given XID is all > visible", and "ask me again when a given XID is no longer running". From the minor nitpicking department, the patches from this stack that are updating rmgrlist.h are consistently failing to update the comment line preceding the list of PG_RMGR() lines. This looks to be patches 0014 and 0015 in this stack; 0015 seems to need to be squashed into 0014. Reviewing Amit's 0016: performUndoActions appears to be badly-designed. For starters, it's sometimes wrong: the only place it gets set to true is in UndoActionsRequired (which is badly named, because from the name you expect it to return a Boolean and to not have side effects, but instead it doesn't return anything and does have side effects). UndoActionsRequired() only gets called from selected places, like AbortCurrentTransaction(), so the rest of the time it just returns a wrong answer. Now maybe it's never called at those times, but there's no guard to prevent a function like CanPerformUndoActions() (which is also badly named, because performUndoActions tells you whether you need to perform undo actions, not whether it's possible to perform undo actions) from being called before the flag is set. I think that this flag should be either (1) maintained eagerly - so that wherever we set start_urec_ptr we also set the flag right away or (2) removed - so when we need to know, we just loop over all of the undo categories on the spot, which is not that expensive because there aren't that many of them. It seems pointless to make PrepareTransaction() take undo pointers as arguments, because those pointers are just extracted from the transaction state, to which PrepareTransaction() has a pointer. Thomas has already objected to another proposal to add functions that turn 32-bit XIDs into 64-bit XIDs. Therefore, I feel confident in predicting that he will likewise object to GetEpochForXid. I think this needs to be changed somehow, maybe by doing what the XXX comment you added suggests. This patch has some problems with naming consistency. There's a function called PushUndoRequest() which calls a function called RegisterRollbackReq() to do the heart of the work. So, is it undo or rollback? Are we pushing or registering? Is it a request or a req? For bonus points, the flag that the function sets is called undo_req_pushed, which is halfway in between the two competing terminologies. Other gripes about PushUndoRequest: push is vague and doesn't really explain what's happening, "apllying" is a typo, per_level is a poor variable name and shouldn't be declared volatile. This function has problems with naming in other places, too; please go through all of the names carefully and make them consistent and adequately descriptive. I am not a fan of applying_subxact_undo. I think we should look for a better design there. A couple of things occur to me. One is that we don't necessarily need to go to FATAL; we could just force the current transaction and all of its subtransactions fail all the way out to the top level, but then perhaps allow new transactions to be started afterwards. I'm not sure that's worth it, but it would work, and I think it has precedent in SxactIsDoomed. Assuming we're going to stick with the current FATAL plan, I think we should do something like invent a new kind of critical section that forces ERROR to be promoted to FATAL and then use it here. We could call it a semi-critical or locally-critical section, and the undo machinery could use it, but then also so could other things. I've wanted that sort of concept before, so I think it's a good idea to try to have something general and independent of undo. The same concept could be used in PerformUndoActions() instead of having to invent pg_rethrow_as_fatal(), so we'd have two uses for this mechanism right away. FinishPreparedTransactions() tries to apply undo actions while interrupts are still held. Is that necessary? Can we avoid it? It seems highly likely that the logic added to the TBLOCK_SUBCOMMIT case inside CommitTransactionCommand and also into ReleaseCurrentSubTransaction should have been added to CommitSubTransaction instead. If that's not true, then we have to believe that the TBLOCK_SUBRELEASE call to CommitSubTransaction needs different treatment from the other two cases, which sounds unlikely; we also have to explain why undo is somehow different from all of these other releases that are already handled in that function, not in its callers. I also strongly suspect it is altogether wrong to do this before CommitSubTransaction sets s->state to TRANS_COMMIT; what if a subxact callback throws an error? For related reasons, I don't think that the change ReleaseSavepoint() are right either. Notice the header comment: "As above, we don't actually do anything here except change blockState." The "as above" part of the comment probably didn't originally refer to DefineSavepoint(), which definitely does do other stuff, but to something like EndImplicitTransactionBlock() or EndTransactionBlock(), and DefineSavepoint() got stuck in the middle later. Anyway, your patch makes the comment false by doing actual state changes in this function, rather than just marking the subtransactions for commit. But why should that be right? If none of the many other bits of state are manipulated here rather than in CommitSubTransaction(), why is undo the one thing that is different? I guess this is basically just compensation for the lack of any of this code in the TBLOCK_SUBRELEASE path which I noted in the previous paragraph, but I still think the right answer is to put it all in CommitSubTransaction() *after* we set TRANS_COMMIT. There are a number of things I either don't like or don't understand about PerformUndoActions. One is that undo_req_pushed gets passed to this function. That just looks really odd from an abstraction point of view. Basically, we have a function whose job is to "perform undo actions," and it gets a flag as an argument that tells it to not actually perform some of the undo actions: that's odd. I think the reason it's like that is because of the issue we've been discussing elsewhere that there's a separate undo request for each category. If you didn't have that, you wouldn't need to do this here. I'm not saying that proves that the one-request-per-persistence-level design is definitely wrong, but this is certainly not a point in its favor, at least IMHO. PerformUndoActions() also thinks that there is a possibility of failing to insert a failed request into the error queue, and makes reference to such requests being rediscovered by the discard worker, but I thought (as I said in my previous email) that we had abandoned that approach in favor of always having enough space in shared memory to record everything. Among other problems, if you want oldestXidHavingUndo to be calculated based on the information in shared memory, then you have to have all the records in shared memory, not lose some of them temporarily and have them get re-inserted into the error queue. It also feels to me like there may be a conflict between the everything-must-fit approach and the one-request-per-persistence level thing you've got here. I believe Andres's idea was one-request-per-transaction, so the idea is something like: - When your transaction first tries to attach to an undo log, you make a hash table entry. - If that fails, you error out, but you have no undo, so it's OK. - If it works, then you know that there's no chance of aborting without making a hash table entry, because you already did it. - If you commit, you remove the entry, because your transaction does not need to be undone. - If you abort, you process the entry in the foreground if it's small or if the number of hash table slots remaining is < max_connections. Otherwise you leave it for the background worker to handle. If you have one request per persistence level, you could make an entry for the first persistence level, and then find that you are out of room when trying to make an entry for the second persistence level. I guess that doesn't break anything: the changes from the first persistence level would get undone, and the second persistence level wouldn't get any undo. Maybe that's OK, but again it doesn't seem all that nice, so maybe we need to think about it some more. I think there's more, but I am out of time for the moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > > PerformUndoActions() also thinks that there is a possibility of > failing to insert a failed request into the error queue, and makes > reference to such requests being rediscovered by the discard worker, > but I thought (as I said in my previous email) that we had abandoned > that approach in favor of always having enough space in shared memory > to record everything. Among other problems, if you want > oldestXidHavingUndo to be calculated based on the information in > shared memory, then you have to have all the records in shared memory, > not lose some of them temporarily and have them get re-inserted into > the error queue. > The idea is that the queues can get full, but not rollback hash table. In the case where the error queue gets full, we mark the entry as Invalid in the hash table and later when discard worker again encounters this request, it adds it to the queue if there is a space available and marks the entry in the hash table as valid. This allows us to keep the information of all xacts having pending undo in shared memory. > It also feels to me like there may be a conflict > between the everything-must-fit approach and the > one-request-per-persistence level thing you've got here. I believe > Andres's idea was one-request-per-transaction, so the idea is > something like: > > - When your transaction first tries to attach to an undo log, you make > a hash table entry. .. .. > - If you commit, you remove the entry, because your transaction does > not need to be undone. I think this can regress the performance when there are many concurrent sessions unless there is a way to add/remove request without a lock. As of now, we don't enter any request or block any space in shared memory related to pending undo till there is an error or user explicitly Rollback the transaction. We can surely do some other way as well, but this way we won't have any overhead in the commit or successful transaction's path. > > If you have one request per persistence level, you could make an entry > for the first persistence level, and then find that you are out of > room when trying to make an entry for the second persistence level. I > guess that doesn't break anything: the changes from the first > persistence level would get undone, and the second persistence level > wouldn't get any undo. Maybe that's OK, but again it doesn't seem all > that nice, so maybe we need to think about it some more. > Again coming to question of whether we need single or multiple entries for one-request-per-persistence level, the reason for the same we have discussed so far is that discard worker can register the requests for them while scanning undo logs at different times. However, there are few more things like what if while applying the actions, the actions for logged are successful and unlogged fails, keeping them separate allows better processing. If one fails, register its request in error queue and try to process the request for another persistence level. I think the requests for the different persistence levels are kept in a separate log which makes their processing separately easier. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 15, 2019 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Jul 13, 2019 at 6:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think then I am missing something because what I am talking about is > > below code in rbt_insert: > > What you're saying here is that, with an rbtree, an exact match will > result in a merging of requests which we don't want, so we have to > make them always unique. That's fine, but even if you used a binary > heap where it wouldn't be absolutely required that you break the ties, > you'd still want to think at least a little bit about what behavior is > best in case of a tie, just from the point of view of making the > system efficient. > Okay. > > I think it would be better to use start_urec_ptr because XID can be > > non-unique in our case. As I explained in one of the emails above [1] > > that we register the requests for logged and unlogged relations > > separately, so XID can be non-unique. > > Yeah. I didn't understand that explanation. It seems to me that one > of the fundamental design questions for this system is whether we > should allow there to be an unbounded number of transactions that are > pending undo application, or whether it's OK to enforce a hard limit. > Either way, there should certainly be pressure applied to try to keep > the number low, like forcing undo application into the foreground when > a backlog is accumulating, but the question is what to do when that's > insufficient. My original idea was that we should not have a hard > limit, in which case the shared memory data on what is pending might > be incomplete, in which case we would need the discard workers to > discover transactions needing undo and add them to the shared memory > data structures, and if those structures are full, then we'd just skip > adding those details and rediscover those transactions again at some > future point. > > But, my understanding of the current design being implemented is that > there is a hard limit on the number of transactions that can be > pending undo and the in-memory data structures are sized accordingly. > Yes, that is correct. > In such a system, we cannot rely on the discard worker(s) to > (re)discover transactions that need undo, because if there can be > transactions that need undo that we don't know about, then we can't > enforce a hard limit correctly. > I have responded in the email above about this point. > The exception, I suppose, is that > after a crash, we'll need to scan all the undo logs and figure out > which transactions are pending, but that doesn't preclude using a > single queue entry covering both the logged and the unlogged portion > of a transaction that has written undo of both kinds. We've got to > scan all of the undo logs before we allow any new undo-using > transactions to start, and so we can create one fully-up-to-date entry > that reflects the data for both persistence levels before any > concurrent activity happens. > It is correct that no new undo using transaction can start, but nothing prevents undo launcher to start the undo workers to process the already registered requests which can lead to some concurrent activity. > I am wondering (and would love to hear other opinions on) the question > of which kind of design we ought to be pursuing, but it's got to be > one or the other, not something in the middle. > I agree that it should not be in the middle. It is possible that I am missing or misunderstanding something here, but AFAIU, the current design, and implementation allows us to maintain the pending undo state in-memory. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Jul 9, 2019 at 6:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > PFA, updated patch version which includes > > > - One defect fix in undo interface related to undo page compression > > > for handling persistence level > > > - Implemented pending TODO optimization in undo page compression. > > > - One defect fix in undo processing related to the prepared transaction > > > > Looking at 0002 a bit, it seems to me that you really need to spend > > some energy getting things into a consistent order all across the > > patch. For example, UndoPackStage uses the ordering: HEADER, > > TRANSACTION, RMID, RELOID, XID, CID... But the declarations of the > > UREC_INFO constants go in a different order: TRANSACTION, FORK, BLOCK, > > BLKPREV... The comments defining those go in a different order and > > some of them are missing. The definition of the UndoRecordBlah > > structures go in a different order still: Transaction, Block, > > LogSwitch, Payload. UndoRecordHeaderSize goes with FORK, BLOCK, > > BLPREV, TRANSACTION, LOGSWITCH, .... That really needs to be > > straightened out and made consistent. I have worked on this part, please check in the latest patch. For some of the header i.e RMID, RELOID, XID, CID, FORK, PREVUNDO, and BLOCK have an only one member so there are no structures for them except this I think others are in order now. > > > You (still) need to rename blkprev to something more generic, as > > mentioned in previous rounds of review. > > I will change this. Changed to prevundo > > > > I think it would be a good idea to avoid complex macros in favor of > > functions where possible, e.g. UNDO_PAGE_PARTIAL_REC_SIZE. If > > performance is a concern, it could be declared static inline, which > > should be as good as a macro. > ok Done > > > > I don't like the fact that undoaccess.c has a new global, > > undo_compression_info. I haven't read the code thoroughly, but do we > > really need that? I think it's never modified (so it could just be > > declared const), > > Actually, this will get modified otherwise across undo record > insertion how we will know what was the values of the common fields in > the first record of the page. Another option could be that every time > we insert the record, read the value from the first complete undo > record on the page but that will be costly because for every new > insertion we need to read the first undo record of the page. > > Currently, we are doing like this > > a) BeginUndoRecordInsert > - Copy the global "undo_compression_info" to our local context for > handling multi-prepare because for multi-prepare we don't want to > update the global value until we have successfully inserted the undo > record. > > b) PrepareUndoInsert > -Operate on the context and update the context->undo_compression_info > if required (page changed) > > c)InsertPrepareUndo > - After we have inserted successfully overwrite > context->undo_compression_info to the global "undo_compression_info". > So that next undo insertion can get the right information. > > and I also think it's just all zeroes (so > > initializing it isn't really necessary), and I also think that it's > > just used for initializing other UndoCompressionInfos (so we could > > just initialize them directly, either by setting the members > > individually or jus zeroing them). > > Initially, I was doing that but later I thought that InvalidUndoRecPtr > is macro (although the value is 0) shouldn't we initialize all > UndoRecPtr variables with value InvalidUndoRecPtr instead of directly > using 0 so I changed like this. > > > > > It seems like UndoRecordPrepareTransInfo ought to have an Assert(index > > < some_limit) in the loop. Done > > > > A comment in PrepareUndoInsert refers to "low switch" where it means > > "log switch." > > I will fix. Fixed > > > > This is by no means a complete review, for which I unfortunately lack > > the time at present. Just some initial observations. > > > ok -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Reviewing Amit's 0016: > > performUndoActions appears to be badly-designed. For starters, it's > sometimes wrong: the only place it gets set to true is in > UndoActionsRequired (which is badly named, because from the name you > expect it to return a Boolean and to not have side effects, but > instead it doesn't return anything and does have side effects). > UndoActionsRequired() only gets called from selected places, like > AbortCurrentTransaction(), so the rest of the time it just returns a > wrong answer. Now maybe it's never called at those times, but there's > no guard to prevent a function like CanPerformUndoActions() (which is > also badly named, because performUndoActions tells you whether you > need to perform undo actions, not whether it's possible to perform > undo actions) from being called before the flag is set. I think that > this flag should be either (1) maintained eagerly - so that wherever > we set start_urec_ptr we also set the flag right away or (2) removed - > so when we need to know, we just loop over all of the undo categories > on the spot, which is not that expensive because there aren't that > many of them. > I would prefer to go with (2). So, I will change the function CanPerformUndoActions() to loop over categories and return whether there is a need to perform undo actions. Also, rename CanPerformUndoActions as NeedToPerformUndoActions or UndoActionsRequired, any other better suggestion? > It seems pointless to make PrepareTransaction() take undo pointers as > arguments, because those pointers are just extracted from the > transaction state, to which PrepareTransaction() has a pointer. > Agreed, will remove. > Thomas has already objected to another proposal to add functions that > turn 32-bit XIDs into 64-bit XIDs. Therefore, I feel confident in > predicting that he will likewise object to GetEpochForXid. I think > this needs to be changed somehow, maybe by doing what the XXX comment > you added suggests. > We can do what the comment says, but there is one more similar usage in undodiscard.c as well, so not sure if that is the right thing. I think Thomas is suggesting to open code its usage where it is safe to do so and required. I have responded to his email, let us see what he has to say, based on that we can modify this patch. > This patch has some problems with naming consistency. There's a > function called PushUndoRequest() which calls a function called > RegisterRollbackReq() to do the heart of the work. So, is it undo or > rollback? Are we pushing or registering? Is it a request or a req? > I think we can rename PushUndoRequest as RegisterUndoRequest and RegisterRollbackReq as RegisterUndoRequestGuts. > For bonus points, the flag that the function sets is called > undo_req_pushed, which is halfway in between the two competing > terminologies. Other gripes about PushUndoRequest: push is vague and > doesn't really explain what's happening, "apllying" is a typo, > per_level is a poor variable name and shouldn't be declared volatile. > This function has problems with naming in other places, too; please go > through all of the names carefully and make them consistent and > adequately descriptive. > Okay, will change as per suggestion. > I am not a fan of applying_subxact_undo. I think we should look for a > better design there. A couple of things occur to me. One is that we > don't necessarily need to go to FATAL; we could just force the current > transaction and all of its subtransactions fail all the way out to the > top level, but then perhaps allow new transactions to be started > afterwards. I'm not sure that's worth it, but it would work, and I > think it has precedent in SxactIsDoomed. Assuming we're going to stick > with the current FATAL plan, I think we should do something like > invent a new kind of critical section that forces ERROR to be promoted > to FATAL and then use it here. We could call it a semi-critical or > locally-critical section, and the undo machinery could use it, but > then also so could other things. I've wanted that sort of concept > before, so I think it's a good idea to try to have something general > and independent of undo. The same concept could be used in > PerformUndoActions() instead of having to invent > pg_rethrow_as_fatal(), so we'd have two uses for this mechanism right > away. > Okay, I will investigate on the lines of the semi-critical section. > FinishPreparedTransactions() tries to apply undo actions while > interrupts are still held. Is that necessary? > I don't think so. I'll think some more and update back if I see any problem, otherwise, will do RESUME_INTERRUPTS before performing actions. > Can we avoid it? > > It seems highly likely that the logic added to the TBLOCK_SUBCOMMIT > case inside CommitTransactionCommand and also into > ReleaseCurrentSubTransaction should have been added to > CommitSubTransaction instead. If that's not true, then we have to > believe that the TBLOCK_SUBRELEASE call to CommitSubTransaction needs > different treatment from the other two cases, which sounds unlikely; > we also have to explain why undo is somehow different from all of > these other releases that are already handled in that function, not in > its callers. > Yeah, it is better to move that code from ReleaseSavepoint to here or rather move it to CommitSubTransaction as suggested by you. > I also strongly suspect it is altogether wrong to do > this before CommitSubTransaction sets s->state to TRANS_COMMIT; what > if a subxact callback throws an error? > Are you worried that it might lead to the execution of actions twice? If so, I think we prevent that during replay of actions and also that can happen in other ways too. I am not telling that we should not move that code block to the location you are suggesting, but I think the current code is also not wrong. > For related reasons, I don't think that the change ReleaseSavepoint() > are right either. Notice the header comment: "As above, we don't > actually do anything here except change blockState." The "as above" > part of the comment probably didn't originally refer to > DefineSavepoint(), which definitely does do other stuff, but to > something like EndImplicitTransactionBlock() or EndTransactionBlock(), > and DefineSavepoint() got stuck in the middle later. Anyway, your > patch makes the comment false by doing actual state changes in this > function, rather than just marking the subtransactions for commit. > But why should that be right? If none of the many other bits of state > are manipulated here rather than in CommitSubTransaction(), why is > undo the one thing that is different? I guess this is basically just > compensation for the lack of any of this code in the TBLOCK_SUBRELEASE > path which I noted in the previous paragraph, but I still think the > right answer is to put it all in CommitSubTransaction() *after* we set > TRANS_COMMIT. > Agreed, will change accordingly. > There are a number of things I either don't like or don't understand > about PerformUndoActions. One is that undo_req_pushed gets passed to > this function. That just looks really odd from an abstraction point > of view. Basically, we have a function whose job is to "perform undo > actions," and it gets a flag as an argument that tells it to not > actually perform some of the undo actions: that's odd. I think the > reason it's like that is because of the issue we've been discussing > elsewhere that there's a separate undo request for each category. > The reason was that if we don't have that check here, then we need to do the same in both the callers. As there are just two places, so moving it to the caller should be okay. I think if we do that then probably looping for each persistence level can also be moved into the caller. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > 1. Renamed UndoPersistence to UndoLogCategory everywhere, and add a > fourth category UNDO_SHARED where transactions can write 'out of band' > data that relates to more than one transaction. > > 2. Introduced a new RMGR callback rm_undo_status. It is used to > decide when record sets in the UNDO_SHARED category should be > discarded (instead of the usual single xid-based rules). The possible > answers are "discard me now!", "ask me again when a given XID is all > visible", and "ask me again when a given XID is no longer running". > > 3. Recognise UNDO_SHARED record set boundaries differently. Whereas > undolog.c recognises transaction boundaries automatically for the > other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for > UNDO_SHARED the > > 4. Add some quick-and-dirty throw-away test stuff to demonstrate > that. SELECT test_multixact([1234, 2345]) will create a new record > set that will survive until the given array of transactions is no > longer running, and then it'll be discarded. You can see that with > SELECT * FROM undoinspect('shared'). Or look at SELECT > pg_stat_undo_logs. This test simply writes all the xids into its > payload, and then has an rm_undo_status function that returns the > first xid it finds in the list that is still running, or if none are > running returns UNDO_STATUS_DISCARD. > > Currently you can only return UNDO_STATUS_WAIT_XMIN so wait for an xid > to be older than the oldest xmin; presumably it'd be useful to be able > to discard as soon as an xid is no longer active, which could be a bit > sooner. > > Another small change: several people commented that > UndoLogIsDiscarded(ptr) ought to have some kind of fast path that > doesn't acquire locks since it'll surely be hammered. Here's an > attempt at that that provides an inlined function that uses a > per-backend recent_discard to avoid doing more work in the (hopefully) > common case that you mostly encounter discarded undo pointers. I hope > this change will show up in profilers in some zheap workloads but this > hasn't been tested yet. > > Another small change/review: the function UndoLogGetNextInsertPtr() > previously took a transaction ID, but I'm not sure if that made sense, > I need to think about it some more. > > I pulled the latest patches pulled in from the "undoprocessing" branch > as of late last week, and most of the above is implemented as fixup > commits on top of that. > > Next I'm working on DBA facilities for forcing undo records to be > discarded (which consists mostly of sorting out the interlocking to > make that work safely). And also testing facilities for simulating > undo log switching (when you fill up each log and move to another one, > which are rare code paths run, so we need a good way to make them not > rare). > In 0003-Add-undo-log-manager /* If we discarded everything, the slot can be given up. */ + if (entirely_discarded) + free_undo_log_slot(slot); I have noticed that when the undo log was detached and it was full then if we discard complete log we release its slot. But, what is bothering me is should we add that log to the free list? Or I am missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 16, 2019 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > This patch has some problems with naming consistency. There's a > > function called PushUndoRequest() which calls a function called > > RegisterRollbackReq() to do the heart of the work. So, is it undo or > > rollback? Are we pushing or registering? Is it a request or a req? > > > > I think we can rename PushUndoRequest as RegisterUndoRequest and > RegisterRollbackReq as RegisterUndoRequestGuts. > One thing I am not sure about the above suggestion is whether it is a good idea to expose a function which ends with 'Guts'. I have checked and found that there are a few similar precedents like ExecuteTruncateGuts. Another idea could be to rename RegisterRollbackReq as RegisterUndoRequestInternal. We have few precedents for that as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 16, 2019 at 10:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jul 16, 2019 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > This patch has some problems with naming consistency. There's a > > > function called PushUndoRequest() which calls a function called > > > RegisterRollbackReq() to do the heart of the work. So, is it undo or > > > rollback? Are we pushing or registering? Is it a request or a req? > > > > I think we can rename PushUndoRequest as RegisterUndoRequest and > > RegisterRollbackReq as RegisterUndoRequestGuts. > > One thing I am not sure about the above suggestion is whether it is a > good idea to expose a function which ends with 'Guts'. I have checked > and found that there are a few similar precedents like > ExecuteTruncateGuts. Another idea could be to rename > RegisterRollbackReq as RegisterUndoRequestInternal. We have few > precedents for that as well. I don't personally like Guts, not only because bringing human (or animal) body parts into this seems unnecessary, but more importantly because it's not at all descriptive. Internal is no better. The point is that you need to give the functions names that make it clear how what one function does is different from what another function does, and neither Guts nor Internal is going to help with that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 16, 2019 at 12:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > The idea is that the queues can get full, but not rollback hash table. > In the case where the error queue gets full, we mark the entry as > Invalid in the hash table and later when discard worker again > encounters this request, it adds it to the queue if there is a space > available and marks the entry in the hash table as valid. This allows > us to keep the information of all xacts having pending undo in shared > memory. I don't understand. How is it OK to have entries in the hash table but not the queues? And why would that ever happen, anyway? If you make the queues as big as the hash table is, then they should never fill up (or, if using binary heaps with lazy removal rather than rbtrees, they might fill up, but if they do, you can always make space by cleaning out the stale entries). > I think this can regress the performance when there are many > concurrent sessions unless there is a way to add/remove request > without a lock. As of now, we don't enter any request or block any > space in shared memory related to pending undo till there is an error > or user explicitly Rollback the transaction. We can surely do some > other way as well, but this way we won't have any overhead in the > commit or successful transaction's path. Well, we're already incurring some overhead to attach to an undo log, and that probably involves some locking. I don't see why this would be any worse, and maybe it could piggyback on the existing work. Anyway, if you don't like this solution, propose something else. It's impossible to correctly implement a hard limit unless the number of aborted-but-not-yet-undone transaction is bounded to (HARD_LIMIT - ENTRIES_THAT_WOULD_BE_ADDED_AFTER_RECOVERY_IF_THE_SYSTEM_CRASHED_NOW). If there are 100 transactions each bound to 2 undo logs, and you crash, you will need to (as you have it designed now) add another 200 transactions to the hash table upon recovery, and that will make you exceed the hard limit unless you were at least 200 transactions below the limit before the crash. Have you handled that somehow? If so, how? It seems to me that you MUST - at a minimum - keep a count of undo logs attached to in-progress transactions, if not the actual hash table entries. > Again coming to question of whether we need single or multiple entries > for one-request-per-persistence level, the reason for the same we have > discussed so far is that discard worker can register the requests for > them while scanning undo logs at different times. Yeah, but why do we need that in the first place? I wrote something about that in a previous email, but you haven't responded to it here. > However, there are > few more things like what if while applying the actions, the actions > for logged are successful and unlogged fails, keeping them separate > allows better processing. If one fails, register its request in error > queue and try to process the request for another persistence level. I > think the requests for the different persistence levels are kept in a > separate log which makes their processing separately easier. I don't find this convincing. It's not really an argument, just a vague list of issues. If you want to convince me, you'll need to be much more precise. It seems to me that it is generally undesirable to undo the unlogged part of a transaction separately from the logged part of the transaction. But even if we want to support that, having one entry per XID rather than one entry per <XID, persistence level> doesn't preclude that. Even if you discover the entries at different times, you can still handle that by updating the existing entry rather than making a new one. There might be a good reason to do it the way you are describing, but I don't see that you've made the argument for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 16, 2019 at 7:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I also strongly suspect it is altogether wrong to do > > this before CommitSubTransaction sets s->state to TRANS_COMMIT; what > > if a subxact callback throws an error? > > Are you worried that it might lead to the execution of actions twice? No, I'm worried that you are running code that is part of the commit path before the transaction has actually committed. CommitSubTransaction() is full of stuff which basically propagates whatever the subtransaction did out to the parent transaction, and all of that code runs after we've ruled out the possibility of an abort, but this very-similar-looking code runs while it's still possible for an abort to happen. That seems unlikely to be correct, and even if it is, it seems needlessly inconsistent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's a new version. Here's a relatively complete review of 0019 and 0020 and a remark or two on the beginning of 0003. Regarding 0020: The documentation claims that undo data exists in a 64-bit address space divided into 2^34 undo logs, each with a theoretical capacity of 1TB, but that would require 74 bits. I am mildly suspicious that, on a busy system, the use of 1MB segment files could result in slowdowns due to frequent filesystem operations. We just recently made it more convenient to change the WAL segment size, mostly so that people on very busy systems could crank it up from 16MB to, say, 64MB or 256MB. It's true that the considerations are a bit different here, because undo logs don't have to be archived, and because we might be using many undo logs simultaneously rather than only 1 for the whole system, but it's still true that if you've got a bunch of backends blasting out undo at top speed, you're going to have to recycle files *extremely* quickly. How much performance testing have you done to assess the effect of segment size? Do you think there's an argument for making this 1MB size configurable at initdb-time? Or even variable at runtime, so that we use larger files if we're filling them up in < 100ms or whatever? I don't think the last paragraph is entirely accurate. The access method gets to control what records are written, but the general format of the records is fixed by the undo system. Perhaps the undo log code isn't what cares about that, but whether it's the undo log code or the undo access code or the undo processing code isn't likely to seem relevant to developers. Regarding 0019: I think there's a substantial amount of duplication between 0019 and 0020, and I'm not sure that we ought to have both. They both talk about the purpose of undo, the way the adddress space is divided, etc. I understand that it would be a little weird to include all of the information from 0019 in the user-facing documentation, and I also understand that it won't work to have no user-facing documentation at all, but it still seems a little odd to me. Possibly 0019 could refer to the SGML documentation for preliminaries and then add only those details that are not covered there. How could we avoid the limit on the total size of an active transaction mentioned here? And what would be the cost of such a scheme? If we've filled an undo log and moved on to another one, why can't we evict the one that's full and reuse the shared memory slot, bringing it back in later when required? I suspect the answer is that there is a locking rule involved. I think this README would be a good place to document things like locking rules, or a least to refer to where they are documented. I also think we should mull over whether we could relax the rule without too much pain. I expect that at least part of the problem is that somebody might have a pointer to an UndoLogSlot which could become stale if we recycle a slot, but that can already happen at least when the log is fully discarded, so maybe allowing it to happen in other cases wouldn't be too bad. I know you're laughing at me on the inside, worrying about a transaction that touches so many TB of data that it manages to exhaust all the undo log slots, but I don't think that's a completely crazy scenario. There are PB-scale databases out there, and it would be nice to think that PostgreSQL could capture more of those workloads. They will probably become more common over time. Reading the section on persistence levels and tablespaces makes me wonder what happens to address space that gets allocated to temporary and unlogged undo logs. It seems pretty important to make sure that we at least don't leak anything significant, and maybe that we actually recycle the address space or share it across backends. That is, if several backends are all writing temporary undo, there's no intrinsic reason why they can't all be using the same temporary undo logs, as long as the file naming works OK for that (e.g. if it follows the same pattern we use for relation names). Any undo logs that get allocated to unlogged undo can be recycled - either for unlogged undo or otherwise - after a crash, and any that are partially filled can be rewound. I don't know how much effort we're expending on any of that right now, but it seems like it would be worth discussing in this README, and possibly improving. When the undo log contents section mentions that "client code is responsible for stepping over the page headers and advancing to the next page," that's again a somewhat middle-of-the-patch stack perspective. I am not sure exactly how this should be phrased, but the point is that the client code we're talking about is not the AM but the next patch in the stack. I think developers will view the AM as the client and our wording probably ought to reflect that. "keepign" is not spelled correctly. A little later on, "checkpoin" is missing a letter. I think it would be worth mentioning how you solved the problem of inferring during recovery the position within the page where the record needs to be placed. The bit about checkpoint files written to pg_undo being potentially inconsistent is confusing. If the files are written before the checkpoint is completed, fsync'd, and not modified afterwards, how can they be inconsistent? Regarding 0003: UndoLogSharedData could use a more extensive comment. It's not very clear what low_logno and next_logno are, and it also seems like it would be worth mentioning how the free lists are linked. On a similar note, I think the file header comment ought to reference the undo README added by 0019 and perhaps also the documentation added by 0020, and I think 0019 and 0020 ought to be flattened into 0003. I meant to write more about 0003 before sending this, but I am out of time and it seems more useful to send what I have now than to wait until I have more... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-07-13 15:55:51 +0530, Amit Kapila wrote: > On Fri, Jul 12, 2019 at 7:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > I think even if we currently go with a binary heap, it will be > > > possible to change it to rbtree later, but I am fine either way. > > > > Well, I don't see much point in revising all of this logic twice. We > > should pick the way we want it to work and make it work that way. > > > > Yeah, I agree. So, I am assuming here that as you have discussed this > idea with Andres offlist, he is on board with changing it as he has > originally suggested using binary_heap. Andres, do let us know if you > think differently here. It would be good if anyone else following the > thread can also weigh in. Yes, I think using an rbtree makes sense. I'm not yet sure whether we'd want the rbtree nodes being pointed to directly by the hashtable, or whether we'd want one indirection. e.g. either something like: typedef struct UndoWorkerQueue { /* priority ordered tree */ RBTree *tree; .... } typedef struct UndoWorkerQueueEntry { RBTNode tree_node; /* * Reference hashtable via key, not pointers, entries might be * moved. */ RollbackHashKey rollback_key ... } UndoWorkerQueueEntry; typedef struct RollbackHashEntry { ... UndoWorkerQueueEntry *queue_memb_size; UndoWorkerQueueEntry *queue_memb_age; UndoWorkerQueueEntry *queue_memb_error; } and call rbt_delete() for any non-NULL queue_memb_* whenever an entry is dequeued via one of the queues (after setting the one already dequeued from to NULL, of course). Which requires - as Robert mentioned - that rbtree pointers remain stable after insertions. Alternatively we can have a more complicated arrangement without the "stable pointer" requirement (which'd also similarly work for a binary heap): typedef struct UndoWorkerQueue { /* information about work needed, not meaningfully ordered */ UndoWorkerQueueEntry *entries; /* * Priority ordered references into 0<entries, using * UndoWorkerQueueTreeEntry as members. */ RBTree tree; /* unused elements in ->entries, UndoWorkerQueueEntry members */ slist_head freelist; /* * Number of entries in ->entries and tree that can be pruned by * doing a scan of both. */ int num_prunable_entries; } typedef struct UndoWorkerQueueEntry { /* * Reference hashtable via key, not pointers, entries might be * moved. */ RollbackHashKey rollback_key /* * As members of UndoWorkerQueue->tree can be moved in memory, * RollbackHashEntry cannot directly point to them. Instead */ bool already_processed; ... slist_node freelist_node; } UndoWorkerQueueEntry; typedef struct UndoWorkerQueueTreeEntry { RBTree tree; /* offset into UndoWorkerQueue->entries */ int off; } UndoWorkerQueueEntry; and again typedef struct RollbackHashEntry { RBTNode tree_node; ... UndoWorkerQueueEntry *queue_memb_size; UndoWorkerQueueEntry *queue_memb_age; UndoWorkerQueueEntry *queue_memb_error; } Because the tree entries are not members of the tree itself, pointers to them would be stable, regardless of rbtree (or binary heap) moving them around. The cost of that would be more complicated datastructures, and insertion/deletion/dequeing operations: insertion: if (slist_is_empty(&queue->freelist)) prune(); if (slist_is_empty(&queue->freelist)) elog(ERROR, "full") UndoWorkerQueueEntry *entry = slist_pop_head_node(&queue->freelist) UndoWorkerQueueTreeEntry tree_entry; entry->already_processed = false; entry->... = ...; tree_entry.off = entry - queue->entries; // calculate offset rbt_insert(queue->tree, &tree_entry, NULL); prune: if (queue->num_prunable_entries > 0) RBTreeIterator iter; slist_node *pending_freelist; rbt_begin_iterate(queue->tree, &iter, LeftRightWalk); while ((tnode = rbt_iterate(&iter)) != 0) node = (UndoWorkerQueueTreeEntry *) tnode; if (queue->entries[node->off]->already_processed) rbt_delete(tnode); /* XXX: Have to stop here, the iterator is invalid - * probably should add a rbt_delete_current(iterator); */ break; dequeue: while (node = rbt_leftmost(queue->tree)) node = (UndoWorkerQueueTreeEntry *) tnode; entry = &queue->entries[node->off]; rbt_delete(tnode); /* check if the entry has already been processed via another queue */ if (entry->already_processed) slist_push(&queue->freelist, &entry->freelist_node); else /* found it */ return entry; return NULL; delete (i.e. processed in another queue): /* * Queue entry will only be reusable when the corresponding tree * entry has been removed. That'll happen either when new entries * are needed (cf prune), or when the entry is dequeued (cf dequeue). */ entry->already_processed = true; I think the first approach is clearly preferrable from a simplicity POV, but the second approach would be a bit more generic (applicable to heap as well) and wouldn't require adjusting the rbtree code. Greetings, Andres Freund
On Tue, Jul 16, 2019 at 11:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > /* If we discarded everything, the slot can be given up. */ > + if (entirely_discarded) > + free_undo_log_slot(slot); > > I have noticed that when the undo log was detached and it was full > then if we discard complete log we release its slot. But, what is > bothering me is should we add that log to the free list? Or I am > missing something? Stepping back a bit: The free lists are for undo logs that someone might want to attach to and insert into. If it's full, we probably can't insert anything into it again (well, technically someone else who wants to insert something a bit smaller might be able to, but that's not an interesting case to worry about). So it doesn't need to go back on a free list, but it still needs to exist (= occupy a slot) as long as there is undiscarded data in it, because that data is needed and we need to be able to test URPs against its discard pointer. But once its data is entirely discarded, it ceases to exist -- there is no reason to waste a slot on it, and any URP in this undo log will be considered to be discarded (because we can't find a slot, and we also cache that fact in recent_discard so lookups are fast and lock-free), and therefore it'll not be checkpointed or reloaded at next startup; then we couldn't put it on a free list even if we wanted to, because there is nothing left of it ("logs" don't really exist in memory, only "slots", currently holding the meta-data for a log, which is why I renamed UndoLog to UndoLogSlot to reduce confusion on that point). One of the goals here is to make a system that doesn't require an increasing amount of memory as time goes on -- hence desire to completely remove state relating to entirely discarded undo logs (you might point out that the recent_discard cache would get arbitrarily large after we chew through millions of undo logs, but there is another defence against that in the form of low_logno which isn't used in that test yet but could be used to miminise that effect). Does this make sense, and do you see a problem? -- Thomas Munro https://enterprisedb.com
Hi, On 2019-07-15 12:26:21 -0400, Robert Haas wrote: > Yeah. I didn't understand that explanation. It seems to me that one > of the fundamental design questions for this system is whether we > should allow there to be an unbounded number of transactions that are > pending undo application, or whether it's OK to enforce a hard limit. > Either way, there should certainly be pressure applied to try to keep > the number low, like forcing undo application into the foreground when > a backlog is accumulating, but the question is what to do when that's > insufficient. My original idea was that we should not have a hard > limit, in which case the shared memory data on what is pending might > be incomplete, in which case we would need the discard workers to > discover transactions needing undo and add them to the shared memory > data structures, and if those structures are full, then we'd just skip > adding those details and rediscover those transactions again at some > future point. > > But, my understanding of the current design being implemented is that > there is a hard limit on the number of transactions that can be > pending undo and the in-memory data structures are sized accordingly. My understanding is that that's really just an outcome of needing to maintain oldestXidHavingUndo accurately, right? I know I asked this before, but I didn't feel like the answer was that clear (probably due to my own haziness). To me it seems very important to understand whether / how much we can separate the queuing/worker logic from the question of how to maintain oldestXidHavingUndo. > In such a system, we cannot rely on the discard worker(s) to > (re)discover transactions that need undo, because if there can be > transactions that need undo that we don't know about, then we can't > enforce a hard limit correctly. The exception, I suppose, is that > after a crash, we'll need to scan all the undo logs and figure out > which transactions are pending, but that doesn't preclude using a > single queue entry covering both the logged and the unlogged portion > of a transaction that has written undo of both kinds. We've got to > scan all of the undo logs before we allow any new undo-using > transactions to start, and so we can create one fully-up-to-date entry > that reflects the data for both persistence levels before any > concurrent activity happens. Yea, that seems like a question independent of the "completeness" requirement. If desirable, it seems trivial to either have RollbackHashEntry have per-persistence level status (for one entry per xid), or not (for per-persistence entries). Greetings, Andres Freund
On Wed, Jul 17, 2019 at 3:53 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-07-15 12:26:21 -0400, Robert Haas wrote: > > Yeah. I didn't understand that explanation. It seems to me that one > > of the fundamental design questions for this system is whether we > > should allow there to be an unbounded number of transactions that are > > pending undo application, or whether it's OK to enforce a hard limit. > > Either way, there should certainly be pressure applied to try to keep > > the number low, like forcing undo application into the foreground when > > a backlog is accumulating, but the question is what to do when that's > > insufficient. My original idea was that we should not have a hard > > limit, in which case the shared memory data on what is pending might > > be incomplete, in which case we would need the discard workers to > > discover transactions needing undo and add them to the shared memory > > data structures, and if those structures are full, then we'd just skip > > adding those details and rediscover those transactions again at some > > future point. > > > > But, my understanding of the current design being implemented is that > > there is a hard limit on the number of transactions that can be > > pending undo and the in-memory data structures are sized accordingly. > > My understanding is that that's really just an outcome of needing to > maintain oldestXidHavingUndo accurately, right? > Yes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 17, 2019 at 3:48 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Jul 16, 2019 at 11:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > /* If we discarded everything, the slot can be given up. */ > > + if (entirely_discarded) > > + free_undo_log_slot(slot); > > > > I have noticed that when the undo log was detached and it was full > > then if we discard complete log we release its slot. But, what is > > bothering me is should we add that log to the free list? Or I am > > missing something? > > Stepping back a bit: The free lists are for undo logs that someone > might want to attach to and insert into. If it's full, we probably > can't insert anything into it again (well, technically someone else > who wants to insert something a bit smaller might be able to, but > that's not an interesting case to worry about). So it doesn't need to > go back on a free list, but it still needs to exist (= occupy a slot) > as long as there is undiscarded data in it, because that data is > needed and we need to be able to test URPs against its discard > pointer. But once its data is entirely discarded, it ceases to exist > -- there is no reason to waste a slot on it, Right, actually I got that point. But, I was thinking that we are wasting one logno from undo log addressing space no?. Instead, if we can keep it attached to the slot and somehow manage to add to the free list then the same logno can be used by someone else? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 17, 2019 at 3:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Right, actually I got that point. But, I was thinking that we are > wasting one logno from undo log addressing space no?. Instead, if we > can keep it attached to the slot and somehow manage to add to the free > list then the same logno can be used by someone else? We can never reuse log numbers. UndoRecPtr values containing that log number could exist in permanent storage anywhere (zheap, zedstore etc) and must appear to be discarded forever if anyone asks. Now, it so happens that the current coding in zheap has fxid + urp for each transaction slot and always checks the fxid first so it probably wouldn't ask about discarded urps too much, but I don't think that's policy is a requirement and the undo layer can't count on it. I think I heard that zedstore is planning to check urp only. -- Thomas Munro https://enterprisedb.com
On Wed, Jul 17, 2019 at 9:27 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Jul 17, 2019 at 3:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Right, actually I got that point. But, I was thinking that we are > > wasting one logno from undo log addressing space no?. Instead, if we > > can keep it attached to the slot and somehow manage to add to the free > > list then the same logno can be used by someone else? > > We can never reuse log numbers. UndoRecPtr values containing that log > number could exist in permanent storage anywhere (zheap, zedstore etc) > and must appear to be discarded forever if anyone asks. Yeah right. I knew that we can not reuse UndoRecPtr but forget to think that if we reuse logno then it is same as reusing UndoRecPtr. Sorry for the noise. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 16, 2019 at 9:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jul 16, 2019 at 12:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > The idea is that the queues can get full, but not rollback hash table. > > In the case where the error queue gets full, we mark the entry as > > Invalid in the hash table and later when discard worker again > > encounters this request, it adds it to the queue if there is a space > > available and marks the entry in the hash table as valid. This allows > > us to keep the information of all xacts having pending undo in shared > > memory. > > I don't understand. How is it OK to have entries in the hash table > but not the queues? And why would that ever happen, anyway? > We add entries in queues only when we want them to be processed by background workers whereas hash table will contain the entries for all the pending undo requests irrespective of whether they are executed by foreground-transaction or by background workers. Once the request is processed, we remove it from the hash table. The reasons for keeping all the pending abort requests in hash table is that it allows us to compute oldestXidHavingUnappliedUndo and second is it avoids us to have duplicate undo requests by backends and discard worker. In short, there is no reason to keep all the entries in queues, but there are reasons to keep all the aborted xact entries in hash table. There is some more explanation about queues and hash table in README.UndoProcessing which again might not be sufficient to get all the details, but it can still help. > If you > make the queues as big as the hash table is, then they should never > fill up (or, if using binary heaps with lazy removal rather than > rbtrees, they might fill up, but if they do, you can always make space > by cleaning out the stale entries). > > > I think this can regress the performance when there are many > > concurrent sessions unless there is a way to add/remove request > > without a lock. As of now, we don't enter any request or block any > > space in shared memory related to pending undo till there is an error > > or user explicitly Rollback the transaction. We can surely do some > > other way as well, but this way we won't have any overhead in the > > commit or successful transaction's path. > > Well, we're already incurring some overhead to attach to an undo log, > and that probably involves some locking. I don't see why this would > be any worse, and maybe it could piggyback on the existing work. > We attach to the undo log only once per backend (unless user changes tablespace of undo in-between or probably when the space in current log is finished) and then use it for all transactions via that backend. For each transaction, we don't take any global lock for undo, so here we need something different. Also, we need it at commit time as well. > Anyway, if you don't like this solution, propose something else. It's > impossible to correctly implement a hard limit unless the number of > aborted-but-not-yet-undone transaction is bounded to (HARD_LIMIT - > ENTRIES_THAT_WOULD_BE_ADDED_AFTER_RECOVERY_IF_THE_SYSTEM_CRASHED_NOW). > If there are 100 transactions each bound to 2 undo logs, and you > crash, you will need to (as you have it designed now) add another 200 > transactions to the hash table upon recovery, and that will make you > exceed the hard limit unless you were at least 200 transactions below > the limit before the crash. Have you handled that somehow? If so, > how? > Yeah, we have handled it by reserving the space of MaxBackends. It is UndoRollbackHashTableSize() - MaxBackends. There is a bug in the current patch which is that it should reserve space for 2 * MaxBackends so that after recovery, we are safe, but that can be fixed. > It seems to me that you MUST - at a minimum - keep a count of > undo logs attached to in-progress transactions, if not the actual hash > table entries. > > > Again coming to question of whether we need single or multiple entries > > for one-request-per-persistence level, the reason for the same we have > > discussed so far is that discard worker can register the requests for > > them while scanning undo logs at different times. > > Yeah, but why do we need that in the first place? I wrote something > about that in a previous email, but you haven't responded to it here. > I have responded to it as a separate email, but let's discuss it here. So, you are right that only time we need to scan the undo logs to find all pending aborted xacts is immediately after startup. But, we can't create a fully update-to-date entry from both the logs unless we make undo launcher to also wait to process anything till we are done. We are not doing this in the current patch but we can do it if we want. This will be an additional restriction we have to put which is not required for the current approach. Another related thing is that to update the existing entry for queues, we need to delete and re-insert the entry after we find the request in a different log category. Again it depends if we point queue entries to hash table, then we might not have this additional work but that has its own set of complexities. > > However, there are > > few more things like what if while applying the actions, the actions > > for logged are successful and unlogged fails, keeping them separate > > allows better processing. If one fails, register its request in error > > queue and try to process the request for another persistence level. I > > think the requests for the different persistence levels are kept in a > > separate log which makes their processing separately easier. > > I don't find this convincing. It's not really an argument, just a > vague list of issues. If you want to convince me, you'll need to be > much more precise. > I think it is implementation wise simpler to have one entry per persistence level. It is not that we can't deal with all the problems being discussed. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 17, 2019 at 3:53 AM Andres Freund <andres@anarazel.de> wrote: > On 2019-07-15 12:26:21 -0400, Robert Haas wrote: Responding again with some more details. > > > > But, my understanding of the current design being implemented is that > > there is a hard limit on the number of transactions that can be > > pending undo and the in-memory data structures are sized accordingly. > > My understanding is that that's really just an outcome of needing to > maintain oldestXidHavingUndo accurately, right? > Yes. > I know I asked this > before, but I didn't feel like the answer was that clear (probably due > to my own haziness). To me it seems very important to understand whether > / how much we can separate the queuing/worker logic from the question of > how to maintain oldestXidHavingUndo. > I am not sure if there is any tight coupling between queuing/worker logic and computing oldestXid* value. The main thing to compute oldestXid* value is that we need to know the xids of all the pending abort transactions. We have already decided from the very beginning that the hash table will have all the abort requests irrespective of whether it is being processed by the foreground process or background process. This will help us to avoid duplicate entries by backend and background workers. Later, we decided that if we can have a hard limit on how many pending undo requests can be present in a system, then we can find the value of oldestXid* from the hash table. I don't know how much it helps and you might already know all of this, but I thought it is better to summarize to avoid any confusion. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 16, 2019 at 9:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jul 16, 2019 at 7:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I also strongly suspect it is altogether wrong to do > > > this before CommitSubTransaction sets s->state to TRANS_COMMIT; what > > > if a subxact callback throws an error? > > > > Are you worried that it might lead to the execution of actions twice? > > No, I'm worried that you are running code that is part of the commit > path before the transaction has actually committed. > CommitSubTransaction() is full of stuff which basically propagates > whatever the subtransaction did out to the parent transaction, and all > of that code runs after we've ruled out the possibility of an abort, > but this very-similar-looking code runs while it's still possible for > an abort to happen. That seems unlikely to be correct, and even if it > is, it seems needlessly inconsistent. > Fair point, will change as per your suggestion. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 17, 2019 at 3:37 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-07-13 15:55:51 +0530, Amit Kapila wrote: > > On Fri, Jul 12, 2019 at 7:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > I think even if we currently go with a binary heap, it will be > > > > possible to change it to rbtree later, but I am fine either way. > > > > > > Well, I don't see much point in revising all of this logic twice. We > > > should pick the way we want it to work and make it work that way. > > > > > > > Yeah, I agree. So, I am assuming here that as you have discussed this > > idea with Andres offlist, he is on board with changing it as he has > > originally suggested using binary_heap. Andres, do let us know if you > > think differently here. It would be good if anyone else following the > > thread can also weigh in. > > Yes, I think using an rbtree makes sense. > Okay. > I'm not yet sure whether we'd want the rbtree nodes being pointed to > directly by the hashtable, or whether we'd want one indirection. > > e.g. either something like: > > > typedef struct UndoWorkerQueue > { > /* priority ordered tree */ > RBTree *tree; > .... > } > I think we also need the size of rbtree (aka how many nodes/undo requests it has) to know whether we can add more. This information is available in binary heap, but here I think we need to track it in UndoWorkerQueue. Basically, at each enqueue/dequeue, we need to increment/decrement the same. > typedef struct UndoWorkerQueueEntry > { > RBTNode tree_node; > > /* > * Reference hashtable via key, not pointers, entries might be > * moved. > */ > RollbackHashKey rollback_key > ... > } UndoWorkerQueueEntry; > In UndoWorkerQueueEntry, we might also want to include some other info like dbid, request_size, next_retry_at, err_occurred_at so that while accessing queue entry in comparator functions or other times, we don't always need to perform hash table search. OTOH, we can do hash_search as well, but may be code-wise it will be better to keep additional information. Another thing is we need some freelist/array for UndoWorkerQueueEntries equivalent to size of three queues? > typedef struct RollbackHashEntry > { > ... > UndoWorkerQueueEntry *queue_memb_size; > UndoWorkerQueueEntry *queue_memb_age; > UndoWorkerQueueEntry *queue_memb_error; > } > > and call rbt_delete() for any non-NULL queue_memb_* whenever an entry is > dequeued via one of the queues (after setting the one already dequeued > from to NULL, of course). Which requires - as Robert mentioned - that > rbtree pointers remain stable after insertions. > Right. BTW, do you have any preference for using dynahash or simplehash for RollbackHashTable? > > Alternatively we can have a more complicated arrangement without the > "stable pointer" requirement (which'd also similarly work for a binary > heap): > > > I think the first approach is clearly preferrable from a simplicity POV, > but the second approach would be a bit more generic (applicable to heap > as well) and wouldn't require adjusting the rbtree code. > +1 for the first approach, the second one appears to be quite complicated as compared to first. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, On 2019-07-18 11:15:05 +0530, Amit Kapila wrote: > On Wed, Jul 17, 2019 at 3:37 AM Andres Freund <andres@anarazel.de> wrote: > > I'm not yet sure whether we'd want the rbtree nodes being pointed to > > directly by the hashtable, or whether we'd want one indirection. > > > > e.g. either something like: > > > > > > typedef struct UndoWorkerQueue > > { > > /* priority ordered tree */ > > RBTree *tree; > > .... > > } > > > > I think we also need the size of rbtree (aka how many nodes/undo > requests it has) to know whether we can add more. This information is > available in binary heap, but here I think we need to track it in > UndoWorkerQueue. Basically, at each enqueue/dequeue, we need to > increment/decrement the same. > > > typedef struct UndoWorkerQueueEntry > > { > > RBTNode tree_node; > > > > /* > > * Reference hashtable via key, not pointers, entries might be > > * moved. > > */ > > RollbackHashKey rollback_key > > ... > > } UndoWorkerQueueEntry; > > > > In UndoWorkerQueueEntry, we might also want to include some other info > like dbid, request_size, next_retry_at, err_occurred_at so that while > accessing queue entry in comparator functions or other times, we don't > always need to perform hash table search. OTOH, we can do hash_search > as well, but may be code-wise it will be better to keep additional > information. The dots signal that additional fields are needed in those places. > Another thing is we need some freelist/array for > UndoWorkerQueueEntries equivalent to size of three queues? I think using the slist as I proposed for the second alternative is better? > BTW, do you have any preference for using dynahash or simplehash for > RollbackHashTable? I find simplehash nicer to use in code, personally, and it's faster in most cases... Greetings, Andres Freund
On Tue, Jul 16, 2019 at 8:39 AM Robert Haas <robertmhaas@gmail.com> wrote: > Thomas has already objected to another proposal to add functions that > turn 32-bit XIDs into 64-bit XIDs. Therefore, I feel confident in > predicting that he will likewise object to GetEpochForXid. I think > this needs to be changed somehow, maybe by doing what the XXX comment > you added suggests. Perhaps we should figure out how to write GetOldestFullXmin() and friends. For FinishPreparedTransaction(), the XXX comment sounds about right (TwoPhaseFileHeader should hold an fxid). -- Thomas Munro https://enterprisedb.com
On Tue, Jul 16, 2019 at 2:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Few comments on the new patch: 1. Additionally, +there is a mechanism for multi-insert, wherein multiple records are prepared +and inserted at a time. Which mechanism are you talking about here? By any chance is this related to some old code? 2. +Fetching and undo record +------------------------ +To fetch an undo record, a caller must provide a valid undo record pointer. +Optionally, the caller can provide a callback function with the information of +the block and offset, which will help in faster retrieval of undo record, +otherwise, it has to traverse the undo-chain. I think this is out-dated information. You seem to forget updating README after latest changes in API. 3. + * The cid/xid/reloid/rmid information will be added in the undo record header + * in the following cases: + * a) The first undo record of the transaction. + * b) First undo record of the page. + * c) All subsequent record for the transaction which is not the first + * transaction on the page. + * Except above cases, If the rmid/reloid/xid/cid is same in the subsequent + * records this information will not be stored in the record, these information + * will be retrieved from the first undo record of that page. + * If any of the member rmid/reloid/xid/cid has changed, the changed information + * will be stored in the undo record and the remaining information will be + * retrieved from the first complete undo record of the page + */ +UndoCompressionInfo undo_compression_info[UndoLogCategories]; a. Do we want to compress fork_number also? It is an optional field and is only include when undo record is for not MAIN_FORKNUM. For zheap, this means it will never be included, but in future, it could be included for some other AM or some other use case. So, not sure if there is any benefit in compressing the same. b. cid/xid/reloid/rmid - I think it is better to write it as rmid, reloid, xid, cid in the same order as you declare them in UndoPackStage. c. Some minor corrections. /Except above/Except for above/; /, If the/, if the/; /is same/is the same/; /record, these information/record rather this information/ d. I think there is no need to start the line "If any of the..." from a new line, it can be continued where the previous line ends. Also, at the end of that line, add a full stop. 4. /* + * Copy the compression global compression info to our context before + * starting prepare because this value might get updated multiple time in + * case of multi-prepare but the global value should be updated only after + * we have successfully inserted the undo record. + */ In the above comment, the first 'compression' is not required. /time/times/ 5. +/* + * The below common information will be stored in the first undo record of the page. + * Every subsequent undo record will not store this information, if required this information + * will be retrieved from the first undo record of the page. + */ +typedef struct UndoCompressionInfo The line length in the above comments exceeds the 80-char limit. You might want to run pgindent to avoid such problems. 6. +/* + * Exclude the common info in undo record flag and also set the compression + * info in the context. + * 'flag' seems to be a redundant word here? 7. +UndoSetCommonInfo(UndoCompressionInfo *compressioninfo, + UnpackedUndoRecord *urec, UndoRecPtr urp, + Buffer buffer) +{ + + /* + * If we have valid compression info and the for the same transaction and + * the current undo record is on the same block as the last undo record + * then exclude the common information which are same as first complete + * record on the page. + */ + if (compressioninfo->valid && + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) && + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp)) Here the comment is just a verbal for of if-check. How about writing it as: "Exclude the common information from the record which is same as the first record on the page." 8. UndoSetCommonInfo() { .. if (compressioninfo->valid && + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) && + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp)) + { + urec->uur_info &= ~UREC_INFO_XID; + + /* Don't include rmid if it's same. */ + if (urec->uur_rmid == compressioninfo->rmid) + urec->uur_info &= ~UREC_INFO_RMID; + + /* Don't include reloid if it's same. */ + if (urec->uur_reloid == compressioninfo->reloid) + urec->uur_info &= ~UREC_INFO_RELOID; In all the checks except for transaction id, urec's info is on the left side. I think all the checks can be consistent. These are some of the things I noticed while skimming through this patch. I will do some more detailed review later. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Jun 28, 2019 at 6:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I happened to open up 0001 from this series, which is from Thomas, and > > I do not think that the pg_buffercache changes are correct. The idea > > here is that the customer might install version 1.3 or any prior > > version on an old release, then upgrade to PostgreSQL 13. When they > > do, they will be running with the old SQL definitions and the new > > binaries. At that point, it sure looks to me like the code in > > pg_buffercache_pages.c is going to do the Wrong Thing. [...] > > Yep, that was completely wrong. Here's a new version. > One comment/question related to 0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch. +make_undo_smgr_create(RelFileNode *rnode, FullTransactionId fxid, + XLogReaderState *xlog_record) +{ + UnpackedUndoRecord undorecord = {0}; + UndoRecordInsertContext context; + + undorecord.uur_rmid = RM_SMGR_ID; + undorecord.uur_type = UNDO_SMGR_CREATE; + undorecord.uur_info = UREC_INFO_PAYLOAD; + undorecord.uur_dbid = rnode->dbNode; + undorecord.uur_xid = XidFromFullTransactionId(fxid); + undorecord.uur_cid = InvalidCommandId; + undorecord.uur_fork = InvalidForkNumber; While reviewing Dilip's patch(undo-record-interface), I noticed that we include Fork_Num in undo record, if it is not a MAIN_FORKNUM. So, in this patch's case, we will always include it as you are passing InvalidForkNumber. I also see that the patch doesn't use uur_fork in the undo record handler, so I think you don't care what is its value. I am not sure what is the best thing to do here, but it might be better if we can avoiding adding fork_num in each undo record. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 17, 2019 at 2:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > We add entries in queues only when we want them to be processed by > background workers whereas hash table will contain the entries for all > the pending undo requests irrespective of whether they are executed by > foreground-transaction or by background workers. Once the request is > processed, we remove it from the hash table. The reasons for keeping > all the pending abort requests in hash table is that it allows us to > compute oldestXidHavingUnappliedUndo and second is it avoids us to > have duplicate undo requests by backends and discard worker. In > short, there is no reason to keep all the entries in queues, but there > are reasons to keep all the aborted xact entries in hash table. I think we're drifting off on a tangent here. That does make sense, but my original comment that led to this discussion was "PerformUndoActions() also thinks that there is a possibility of failing to insert a failed request into the error queue, and makes reference to such requests being rediscovered by the discard worker, ..." and none of what you've written explains why there is or should be a possibility of failing to insert a request into the error queue. I feel like we've discussed this point to death. You just make the maximum size of the queue equal to the maximum size of the hash table, and it can't ever fail to have room for a new entry. If you remove entries lazily, then it can, but any time it does, you can just go and clean out all of the dead entries and you're guaranteed to then have enough room. And if we switch to rbtree then we won't do lazy removal any more, and it won't matter anyway. > > Anyway, if you don't like this solution, propose something else. It's > > impossible to correctly implement a hard limit unless the number of > > aborted-but-not-yet-undone transaction is bounded to (HARD_LIMIT - > > ENTRIES_THAT_WOULD_BE_ADDED_AFTER_RECOVERY_IF_THE_SYSTEM_CRASHED_NOW). > > If there are 100 transactions each bound to 2 undo logs, and you > > crash, you will need to (as you have it designed now) add another 200 > > transactions to the hash table upon recovery, and that will make you > > exceed the hard limit unless you were at least 200 transactions below > > the limit before the crash. Have you handled that somehow? If so, > > how? > > Yeah, we have handled it by reserving the space of MaxBackends. It is > UndoRollbackHashTableSize() - MaxBackends. There is a bug in the > current patch which is that it should reserve space for 2 * > MaxBackends so that after recovery, we are safe, but that can be > fixed. One of us is REALLY confused here. Nothing you do in UndoRollbackHashTableSize() can possibly fix the problem that I'm talking about. Suppose the system gets to a point where all of the rollback hash table entries are in use - there are some entries that are used because work was pushed into the background, and then there are other entries that are present because those transactions are being rolled back in the foreground. Now at this point you crash. Now when you start up, all the hash table entries, including the reserved ones, are already in use before any running transactions start. Now if you allow transactions to start before some of the rollbacks complete, you have got big problems. The system might crash again, and if it does, when it restarts, the total amount of outstanding requests will no longer fit in the hash table, which was the whole premise of this design. Maybe that doesn't make sense, so think about it this way. Suppose the following happens repeatedly: the system starts, someone begins a transaction that writes an undo record, the rollback workers start up but don't make very much progress because the system is heavily loaded or whatever reason, the system crashes, rinse, repeat. Since no transactions got successfully rolled back and 1 new transaction that needs roll back got added, the number of transactions pending rollback has increased by one. Now, however big you made the hash table, just repeat this process that number of times plus one, and the hash table overflows. The only way you can prevent that is if you stop the transaction from writing undo when the hash table is already too full. > I have responded to it as a separate email, but let's discuss it here. > So, you are right that only time we need to scan the undo logs to find > all pending aborted xacts is immediately after startup. But, we can't > create a fully update-to-date entry from both the logs unless we make > undo launcher to also wait to process anything till we are done. We > are not doing this in the current patch but we can do it if we want. > This will be an additional restriction we have to put which is not > required for the current approach. I mean, that is just not true. There's no fundamental difference between having two possible entries each of which looks like this: struct entry { txn_details d; }; And having a single entry that looks like this: struct entry { txn_details permanent; txn_details unlogged; bool using_permanent; bool using_unlogged; }; I mean, I'm not saying you would actually want to do exactly the second thing, but arguing that something cannot be done with one design or the other is just not correct. > Another related thing is that to update the existing entry for queues, > we need to delete and re-insert the entry after we find the request in > a different log category. Again it depends if we point queue entries > to hash table, then we might not have this additional work but that > has its own set of complexities. I don't follow this. If you have a hash table where the key is XID, there is no need to delete and reinsert anything just because you discover that the XID has not only permanent undo but also unlogged undo, or something of that sort. > I think it is implementation wise simpler to have one entry per > persistence level. It is not that we can't deal with all the > problems being discussed. It's possible that it's simpler, but I'm not finding the arguments you're making very convincing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 19, 2019 at 12:28 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jul 17, 2019 at 2:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Anyway, if you don't like this solution, propose something else. It's > > > impossible to correctly implement a hard limit unless the number of > > > aborted-but-not-yet-undone transaction is bounded to (HARD_LIMIT - > > > ENTRIES_THAT_WOULD_BE_ADDED_AFTER_RECOVERY_IF_THE_SYSTEM_CRASHED_NOW). > > > If there are 100 transactions each bound to 2 undo logs, and you > > > crash, you will need to (as you have it designed now) add another 200 > > > transactions to the hash table upon recovery, and that will make you > > > exceed the hard limit unless you were at least 200 transactions below > > > the limit before the crash. Have you handled that somehow? If so, > > > how? > > > > Yeah, we have handled it by reserving the space of MaxBackends. It is > > UndoRollbackHashTableSize() - MaxBackends. There is a bug in the > > current patch which is that it should reserve space for 2 * > > MaxBackends so that after recovery, we are safe, but that can be > > fixed. > > One of us is REALLY confused here. Nothing you do in > UndoRollbackHashTableSize() can possibly fix the problem that I'm > talking about. Suppose the system gets to a point where all of the > rollback hash table entries are in use - there are some entries that > are used because work was pushed into the background, and then there > are other entries that are present because those transactions are > being rolled back in the foreground. > We are doing exactly what you have written in the last line of the next paragraph "stop the transaction from writing undo when the hash table is already too full.". So we will never face the problems related to repeated crash recovery. The definition of too full is that we stop allowing the new transactions that can write undo when we have the hash table already have entries equivalent to (UndoRollbackHashTableSize() - MaxBackends). Does this make sense? > Now at this point you crash. Now > when you start up, all the hash table entries, including the reserved > ones, are already in use before any running transactions start. Now > if you allow transactions to start before some of the rollbacks > complete, you have got big problems. The system might crash again, > and if it does, when it restarts, the total amount of outstanding > requests will no longer fit in the hash table, which was the whole > premise of this design. > > Maybe that doesn't make sense, so think about it this way. > All you are saying makes sense and I think I can understand the problem you are trying to describe, but we have thought about the same thing and have the algorithm/code in place which won't allow such situations. > > > Another related thing is that to update the existing entry for queues, > > we need to delete and re-insert the entry after we find the request in > > a different log category. Again it depends if we point queue entries > > to hash table, then we might not have this additional work but that > > has its own set of complexities. > > I don't follow this. If you have a hash table where the key is XID, > there is no need to delete and reinsert anything just because you > discover that the XID has not only permanent undo but also unlogged > undo, or something of that sort. > The size of the total undo to be processed will be changed if we find anyone (permanent or unlogged) later. Based on the size, the entry location should be changed in size queue. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > I don't like the fact that undoaccess.c has a new global, > > undo_compression_info. I haven't read the code thoroughly, but do we > > really need that? I think it's never modified (so it could just be > > declared const), > > Actually, this will get modified otherwise across undo record > insertion how we will know what was the values of the common fields in > the first record of the page. Another option could be that every time > we insert the record, read the value from the first complete undo > record on the page but that will be costly because for every new > insertion we need to read the first undo record of the page. > This information won't be shared across transactions, so can't we keep it in top transaction's state? It seems to me that will be better than to maintain it as a global state. Few more comments on this patch: 1. PrepareUndoInsert() { .. + if (logswitched) + { .. + } + else + { .. + resize = true; .. + } + .. + + do + { + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, rbm); .. + rbm = RBM_ZERO; + cur_blk++; + } while (cur_size < size); + + /* + * Set/overwrite compression info if required and also exclude the common + * fields from the undo record if possible. + */ + if (UndoSetCommonInfo(compression_info, urec, urecptr, + context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)) + resize = true; + + if (resize) + size = UndoRecordExpectedSize(urec); I see that in some cases where resize is possible are checked before buffer allocation and some are after. Isn't it better to do all these checks before buffer allocation? Also, isn't it better to even compute changed size before buffer allocation as that might sometimes help in lesser buffer allocations? Can you find a better way to write :context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)? It makes the line too long and difficult to understand. Check for similar instances in the patch and if possible, change them as well. 2. +InsertPreparedUndo(UndoRecordInsertContext *context) { .. /* + * Try to insert the record into the current page. If it + * doesn't succeed then recall the routine with the next page. + */ + InsertUndoData(&ucontext, page, starting_byte); + if (ucontext.stage == UNDO_PACK_STAGE_DONE) + { + MarkBufferDirty(buffer); + break; + } + MarkBufferDirty(buffer); .. } Can't we call MarkBufferDirty(buffer) just before 'if' check? That will avoid calling it twice. 3. + * Later, during insert phase we will write actual records into thse buffers. + */ +struct PreparedUndoBuffer /thse/these 4. + /* + * If we are writing first undo record for the page the we can set the + * compression so that subsequent records from the same transaction can + * avoid including common information in the undo records. + */ + if (first_complete_undo) /page the we/page then we 5. PrepareUndoInsert() { .. After + * allocation We'll only advance by as many bytes as we turn out to need. + */ + UndoRecordSetInfo(urec); Change the beginning of comment as: "After allocation, we'll .." 6. PrepareUndoInsert() { .. * TODO: instead of storing this in the transaction header we can + * have separate undo log switch header and store it there. + */ + prevlogurp = + MakeUndoRecPtr(UndoRecPtrGetLogNo(prevlog_insert_urp), + (UndoRecPtrGetOffset(prevlog_insert_urp) - prevlen)); + I don't think this TODO is valid anymore because now the patch has a separate log-switch header. 7. /* + * If undo log is switched then set the logswitch flag and also reset the + * compression info because we can use same compression info for the new + * undo log. + */ + if (UndoRecPtrIsValid(prevlog_xact_start)) /can/can't -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, 9 May 2019 at 12:04, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Patches can be applied on top of undo branch [1] commit: > (cb777466d008e656f03771cf16ec7ef9d6f2778b) > > [1] https://github.com/EnterpriseDB/zheap/tree/undo Below are some review points for 0009-undo-page-consistency-checker.patch : + /* Calculate the size of the partial record. */ + partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) + + phdr->tuple_len + phdr->payload_len - + phdr->record_offset; There is already an UndoPagePartialRecSize() function which calculates the size of partial record, which seems to do the same as above. If this is same, you can omit the above code, and instead down below where you increment next_record, you can do "next_record += UndoPagePartialRecSize()". Also, I see an extra sizeof(uint16) added in UndoPagePartialRecSize(). Not sure which one is correct, and which one wrong, unless I am wrong in assuming that the above calculation and the function definition do the same thing. ------------------ + * We just want to mask the cid in the undo record header. So + * only if the partial record in the current page include the undo + * record header then we need to mask the cid bytes in this page. + * Otherwise, directly jump to the next record. Here, I think you mean : "So only if the partial record in the current page includes the *cid* bytes", rather than "includes the undo record header" May be we can say : We just want to mask the cid. So do the partial record masking only if the current page includes the cid bytes from the partial record header. ---------------- + if (phdr->record_offset < (cid_offset + sizeof(CommandId))) + { + char *cid_data; + Size mask_size; + + mask_size = Min(cid_offset - phdr->record_offset, + sizeof(CommandId)); + + cid_data = next_record + cid_offset - phdr->record_offset; + memset(&cid_data, MASK_MARKER, mask_size); + Here, if record_offset lies *between* cid start and cid end, then cid_offset - phdr->record_offset will be negative, and so will be mask_size. Probably abs() should do the work. Also, an Assert(cid_data + mask_size <= page_end) would be nice. I know cid position of a partial record cannot go beyond the page boundary, but it's better to have this Assert to do sanity check. + * Process the undo record of the page and mask their cid filed. filed => field -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > We are doing exactly what you have written in the last line of the > next paragraph "stop the transaction from writing undo when the hash > table is already too full.". So we will > never face the problems related to repeated crash recovery. The > definition of too full is that we stop allowing the new transactions > that can write undo when we have the hash table already have entries > equivalent to (UndoRollbackHashTableSize() - MaxBackends). Does this > make sense? Oops, I was looking in the wrong place. Yes, that makes sense, but: 1. It looks like the test you added to PrepareUndoInsert has no locking, and I don't see how that can be right. 2. It seems like this is would result in testing for each new undo insertion that gets prepared, whereas surely we would want to only test when first attaching to an undo log. If you've already attached to the undo log, there's no reason not to continue inserting into it, because doing so doesn't increase the number of transactions (or transaction-persistence level combinations) that need undo. 3. I don't think the test itself is correct. It can fire even when there's no problem. It is correct (or would be if it said 2 * MaxBackends) if every other backend in the system is already attached to an undo log (or two). But if they are not, it will block transactions from being started for no reason. For instance, suppose max_connections = 100 and there are another 100 slots for background rollbacks. Now suppose that the system crashes when 101 slots are in use -- 100 pushed into the background plus 1 that was aborted by the crash. On recovery, this test will refuse to let any new transaction start. Actually it is OK for up to 99 transactions to write undo, just not 100. Or, given that you have a slot per persistence level, it's OK to have up to 199 transaction-persistence-level combinations in flight, just not 200. And that is the difference between the system being unusable after the crash until a rollback succeeds and being almost fully usable immediately. > > I don't follow this. If you have a hash table where the key is XID, > > there is no need to delete and reinsert anything just because you > > discover that the XID has not only permanent undo but also unlogged > > undo, or something of that sort. > > The size of the total undo to be processed will be changed if we find > anyone (permanent or unlogged) later. Based on the size, the entry > location should be changed in size queue. OK, true. But that's not a significant cost, either in runtime or code complexity. I still don't really see any good reason to the hash table key be anything other than XID, or really, FXID. I mean, sure, the data structure manipulations are a little different, but not in any way that really matters. And it seems to me that there are some benefits, the biggest of which is that the system becomes easier for users to understand. We can simply say that there is a limit on the number of transactions that either (1) are in progress and have written undo or (2) have aborted and not all of the undo has been processed. If the key is XID + persistence level, then it's a limit on the number of transaction-and-persistence-level combinations, which I feel is not so easy to understand. In most but not all scenarios, it means that the limit is about double what you think the limit is, and as the mistake in the current version of the patch makes clear, even the people writing the code can forget about that factor of two. It affects a few other things, too. If you made the key XID and fixed problems (2) and (3) from above, then you'd have a situation where a transaction could fail at only one times: either it bombs the first time it tries to write undo, or it works. As it is, there is a second failure scenario: you do a bunch of work on permanent (or unlogged) tables and then try to write to an unlogged (or permanent) table and it fails because there are not enough slots. Is that the end of the world? No, certainly not. The situation should be rare. But if we have to fail transactions, it's best to fail them before they've started doing any work, because that minimizes the amount of work we waste by having to retry. Of course, a transaction that fails midway through when it tries to write at a second persistence level is also consuming an undo slot in a situation where we're short of undo slots. Another thing which Andres pointed out to me off-list is that we might want to have a function that takes a transaction ID as an argument and tells you the status of that transaction from the point of view of the undo machinery: does it have any undo, and if so how much? As you have it now, such a function would require searching the whole hash table, because the user won't be able to provide an UndoRecPtr to go with the XID. If the hash table key were in fact <XID, undo persistence level> rather than <XID, UndoRecPtr>, then you could it with two lookups; if it were XID alone, you could do it with one lookup. The difference between one lookup and two is not significant, but having to search the whole hash table is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 19, 2019 at 7:54 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > + * We just want to mask the cid in the undo record header. So > + * only if the partial record in the current page include the undo > + * record header then we need to mask the cid bytes in this page. > + * Otherwise, directly jump to the next record. > Here, I think you mean : "So only if the partial record in the current > page includes the *cid* bytes", rather than "includes the undo record > header" > May be we can say : > We just want to mask the cid. So do the partial record masking only if > the current page includes the cid bytes from the partial record > header. Hmm, but why is it correct to mask the CID at all? Shouldn't that match? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 16, 2019 at 6:23 PM Andres Freund <andres@anarazel.de> wrote: > Yea, that seems like a question independent of the "completeness" > requirement. If desirable, it seems trivial to either have > RollbackHashEntry have per-persistence level status (for one entry per > xid), or not (for per-persistence entries). I want to talk more about the "completeness" issue, which is basically a question of whether we should (a) put a hard limit on the number of transactions that have unprocessed undo and that are either aborted or in progress (and thus capable of aborting) as proposed by Andres or (b) not have such a hard limit, as originally proposed by me. I think everyone who is not working on this code has installed an automatic filter rule to send the original thread to /dev/null, so I'm changing the subject line in the hopes of getting some of those people to pay attention. If it doesn't work, at least the concerns will be memorialized in case it comes up later. I originally proposed (b) because if undo application is failing for some reason, it seemed better not to bring the whole system to a screeching halt, but rather just cause incremental performance degradation or something. However, Andres has pointed out this may postpone remedial action and let bugs go undetected, so it might not actually be better. Also, some of the things we need to do, like computing the oldest XID whose undo has not been retired, are tricky/expensive if you don't have complete data in memory, and you can't be sure of having complete data in shared memory without a hard limit. No matter which way we go, failures to apply undo had better be really rare, or we're going to be in really serious trouble, so we're only concerned here with how to handle what is hopefully a very rare scenario, not the common case. I want to consider three specific scenarios that could cause undo application to fail, and then offer some observations about them. Scenario #1: 1. Sessions 1..N each begin a transaction and write a bunch of data to a table (at least enough that we'll try to perform undo in the background). 2. Session N+1 begins a transaction and tries to lock the same table. It blocks. 3. Sessions 1..N abort, successfully pushing the undo work into the background. 4. Session N+1 now acquires the lock and sits on it. 5. Optionally, repeat steps 1-4 K times, each time for a different table. Scenario #2: 1. Any number of sessions begin a transaction, write a bunch of data, and then abort. 2. They all try to perform undo in the foreground. 3. They get killed using pg_terminate_backend(). Scenario #3: 1. A transaction begins, does some work, and then aborts. 2. When undo processing occurs, 1% of such transactions fail during undo apply because of a bug in the table AM. 3. When undo processing retries after a failure, it fails again because the bug is triggered by something about the contents of the undo record, rather than by, say, concurrency. In scenario one, the problem is mostly self-correcting. When we decide that we've got too many things queued up for background processing, and start to force undo processing to happen in the foreground, it will start succeeding, because the foreground process will have retained the lock that it took before writing any data and can therefore undo those writes without having to wait for the lock. However, this will do nothing to help the requests that are already in the background, which will just sit there until they can get the lock. I think there is a good argument that they should not actually wait for the lock, or only for a certain time period, and then give up and put the transaction on the error queue for reprocessing at a later time. Otherwise, we're pinning down undo workers, which could easily lead to starvation, just as it does for autovacuum. On the whole, this doesn't sound too bad. We shouldn't be able to fill up the queue with small transactions, because of the restriction that we only push undo work into the background when the transaction is big enough, and if we fill it up with big transactions, then (1) back-pressure will prevent the problem from crippling the system and (2) eventually the problem will be self-correcting, because when the transaction in session N+1 ends, the undo will all go through and everything will be fine. The only real user impact of this scenario is that unrelated work on the system might notice that large rollbacks are now happening in the foreground rather than the background, and if that causes a problem, the DBA can fix it by terminating session N+1. Even if she doesn't, you shouldn't ever hit the hard cap. However, if prepared transactions are in use, we could have a variant of scenario #1 in which each transaction is first prepared, and then the prepared transaction is rolled back. Unlike the ordinary case, this can lead to a nearly-unbounded growth in the number of transactions that are pending undo, because we don't have a way to transfer the locks held by PGPROC used for the prepare to some running session that could perform the undo. It's not necessary to have a large value for max_prepared_transactions; it only has to be greater than 0, because we can keep reusing the same slots with different tables. That is, let N = max_prepared_xacts, and let K be anything at all; session N+1 can just stay in the same transaction and keep on taking new locks one at a time until the lock table fills up; not sure exactly how long that will take, but it's probably a five digit number of transactions, or maybe six. In this case, we can't force undo into the foreground, so we can exceed the number of transactions that are supposed to be backgrounded. We'll eventually have to just start refusing new transactions permission to attach to an undo log, and they'll error out. Although unpleasant, I don't think that this scenario is a death sentence for the idea of having a hard cap on the table size, because if we can the cap is 100k or so, you shouldn't really hit it unless you specifically make it your goal to do so. At least, not this way. But if you have a lower cap, like 1k, it doesn't seem crazy to think that you could hit this in a non-artificial scenario; you just need lots of rolled-back prepared transactions plus some long-running DDL. We could mitigate the prepared transaction scenario by providing a way to transfer locks from the prepared transaction to the backend doing the ROLLBACK PREPARED and then make it try to execute the undo actions. I think that would bring this scenario into parity with the non-prepared case. We could still try to background large rollbacks, but if the queue gets too full then ROLLBACK PREPARED would do the work instead, and, with the hypothetical lock transfer mechanism, that would dodge the locking issues. In scenario #2, the undo work is going to have to be retried in the background, and perforce that means reacquiring locks that have been released, and so there is a chance of long lock waits and/or deadlock that cannot really be avoided. I think there is basically no way at all to avoid an unbounded accumulation of transactions requiring undo in this case, just as in the similar case where the cluster is repeatedly shut down or repeatedly crashes. Eventually, if you have a hard cap on the number of transactions requiring undo, you're going to hit it, and have to start refusing new undo-using transactions. As Thomas pointed out, that might still be better than some other systems which use undo, where the system doesn't open for any transactions at all after a restart until all undo is retired, and/or where undo is never processed in the background. But it's a possible concern. On the other hand, if you don't have a hard cap, the system may just get further and further behind until it eventually melts, and that's also a possible concern. How plausible is this scenario? For most users, cluster restarts and crashes are uncommon, so that variant isn't likely to happen unless something else is going badly wrong. As to the scenario as written, it's not crazy to think that a DBA might try to kill off sessions that sitting there stuck in undo processing for long periods of time, but that doesn't make it a good idea. Whatever problems it causes are analogous to the problems you get if you keep killing of autovacuum processes: the system is trying to make you do the right thing, and if you fight it, you will have some kind of trouble no matter what design decisions we make. In scenario #3, the hard limit is likely to bring things to a screeching halt pretty quickly; you'll just run out of space in the in-memory data structures. Otherwise, the problem will not be obvious unless you're keeping an eye on error messages in your logs, the first sign of trouble may be that the undo logs fill up the disk. It's not really clear which is better. There is value in knowing about the problem sooner (because then you can file a bug report right away and get a fix sooner) but there is also value in having the system limp along instead of grinding to a halt (because then you might not be totally down while you're waiting for that bug fix to become available). One other thing that seems worth noting is that we have to consider what happens after a restart. After a crash, and depending on exactly how we design it perhaps also after a non-crash restart, we won't immediately know how many outstanding transactions need undo; we'll have to grovel through the undo logs to find out. If we've got a hard cap, we can't allow new undo-using transactions to start until we finish that work. It's possible that, at the moment of the crash, the maximum number of items had already been pushed into the background, and every foreground session was busy trying to undo an abort as well. If so, we're already up against the limit. We'll have to scan through all of the undo logs and examine each transaction to get a count on how many transactions are already in a needs-undo-work state; only once we have that value do we know whether it's OK to admit new transactions to using the undo machinery, and how many we can admit. In typical cases, that won't take long at all, because there won't be any pending undo work, or not much, and we'll very quickly read the handful of transaction headers that we need to consult and away we go. However, if the hard limit is pretty big, and we're pretty close to it, counting might take a long time. It seems bothersome to have this interval between when we start accepting transactions and when we can accept transactions that use undo. Instead of throwing an ERROR, we can probably just teach the system to wait for the background process to finish doing the counting; that's what Amit's patch does currently. Or, we could not even open for connections until the counting has been completed. When I first thought about this, I was really concerned about the idea of a hard limit, but the more I think about it the less problematic it seems. I think in the end it boils down to a question of: when things break, what behavior would users prefer? You can either have a fairly quick, hard breakage which will definitely get your attention, or you can have a long, slow process of gradual degradation that doesn't actually stop the system until, say, the XIDs stuck in the undo processing queue become old enough to threaten wraparound, or the disk fills up. Which is less evil? Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: should there be a hard-limit on the number of transactionspending undo?
From
Andres Freund
Date:
Hi, On 2019-07-19 13:28:14 -0400, Robert Haas wrote: > I want to consider three specific scenarios that could cause undo > application to fail, and then offer some observations about them. > > Scenario #1: > > 1. Sessions 1..N each begin a transaction and write a bunch of data to > a table (at least enough that we'll try to perform undo in the > background). > 2. Session N+1 begins a transaction and tries to lock the same table. > It blocks. > 3. Sessions 1..N abort, successfully pushing the undo work into the background. > 4. Session N+1 now acquires the lock and sits on it. > 5. Optionally, repeat steps 1-4 K times, each time for a different table. > > Scenario #2: > > 1. Any number of sessions begin a transaction, write a bunch of data, > and then abort. > 2. They all try to perform undo in the foreground. > 3. They get killed using pg_terminate_backend(). > > Scenario #3: > > 1. A transaction begins, does some work, and then aborts. > 2. When undo processing occurs, 1% of such transactions fail during > undo apply because of a bug in the table AM. > 3. When undo processing retries after a failure, it fails again > because the bug is triggered by something about the contents of the > undo record, rather than by, say, concurrency. > However, if prepared transactions are in use, we could have a variant > of scenario #1 in which each transaction is first prepared, and then > the prepared transaction is rolled back. Unlike the ordinary case, > this can lead to a nearly-unbounded growth in the number of > transactions that are pending undo, because we don't have a way to > transfer the locks held by PGPROC used for the prepare to some running > session that could perform the undo. It doesn't seem that hard - and kind of required for robustness independent of the decision around "completeness" - to find a way to use the locks already held by the prepared transaction. > It's not necessary to have a > large value for max_prepared_transactions; it only has to be greater > than 0, because we can keep reusing the same slots with different > tables. That is, let N = max_prepared_xacts, and let K be anything at > all; session N+1 can just stay in the same transaction and keep on > taking new locks one at a time until the lock table fills up; not sure > exactly how long that will take, but it's probably a five digit number > of transactions, or maybe six. In this case, we can't force undo into > the foreground, so we can exceed the number of transactions that are > supposed to be backgrounded. I'm not following, unfortunately. I don't understand what exactly the scenario is you refer to. You say "session N+1 can just stay in the same transaction", but then you also reference something taking "probably a five digit number of transactions". Are those transactions the prepared ones? Aloso, if someobdy fills up the entire lock table, then the system is effectively down - independent of UNDO, and no meaningful amount of UNDO is going to be written. Perhaps we need some better resource control, but that's really independent of UNDO. Perhaps you can just explain the scenario in a few more words? My comments regarding it probably make no sense, given how little I understand what the scenario is. > In scenario #2, the undo work is going to have to be retried in the > background, and perforce that means reacquiring locks that have been > released, and so there is a chance of long lock waits and/or deadlock > that cannot really be avoided. I think there is basically no way at > all to avoid an unbounded accumulation of transactions requiring undo > in this case, just as in the similar case where the cluster is > repeatedly shut down or repeatedly crashes. Eventually, if you have a > hard cap on the number of transactions requiring undo, you're going to > hit it, and have to start refusing new undo-using transactions. As > Thomas pointed out, that might still be better than some other systems > which use undo, where the system doesn't open for any transactions at > all after a restart until all undo is retired, and/or where undo is > never processed in the background. But it's a possible concern. On the > other hand, if you don't have a hard cap, the system may just get > further and further behind until it eventually melts, and that's also > a possible concern. You could force new connections to complete the rollback processing of the terminated connection, if there's too much pending UNDO. That'd be a way of providing back-pressure against such crazy scenarios. Seems again that it'd be good to have that pressure, independent of the decision on completeness. > One other thing that seems worth noting is that we have to consider > what happens after a restart. After a crash, and depending on exactly > how we design it perhaps also after a non-crash restart, we won't > immediately know how many outstanding transactions need undo; we'll > have to grovel through the undo logs to find out. If we've got a hard > cap, we can't allow new undo-using transactions to start until we > finish that work. Couldn't we record the outstanding transactions in the checkpoint, and then recompute the changes to that record during WAL replay? > When I first thought about this, I was really concerned about the idea > of a hard limit, but the more I think about it the less problematic it > seems. I think in the end it boils down to a question of: when things > break, what behavior would users prefer? You can either have a fairly > quick, hard breakage which will definitely get your attention, or you > can have a long, slow process of gradual degradation that doesn't > actually stop the system until, say, the XIDs stuck in the undo > processing queue become old enough to threaten wraparound, or the disk > fills up. Which is less evil? Yea, I think that's what it boils down to... Would be good to have a few more opinions on this. Greetings, Andres Freund
On Fri, Jul 19, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote: > It doesn't seem that hard - and kind of required for robustness > independent of the decision around "completeness" - to find a way to use > the locks already held by the prepared transaction. I'm not wild about finding more subtasks to put on the must-do list, but I agree it's doable. > I'm not following, unfortunately. > > I don't understand what exactly the scenario is you refer to. You say > "session N+1 can just stay in the same transaction", but then you also > reference something taking "probably a five digit number of > transactions". Are those transactions the prepared ones? So you begin a bunch of transactions. All but one of them begin a transaction, insert data into a table, and then prepare. The last one begins a transaction and locks the table. Now you roll back all the prepared transactions. Those sessions now begin new transactions, insert data into a second table, and prepare the second set of transactions. The last session, which still has the first table locked, now locks the second table in addition. Now you again roll back all the prepared transactions. At this point you have 2 * max_prepared_transactions that are waiting for undo, all blocked on that last session that holds locks on both tables. So now you go have all of those sessions begin a third transaction, and they all insert into a third table, and prepare. The last session now attempts AEL on that third table, and once it's waiting, you roll back all the prepared transactions, after which that last session successfully picks up its third table lock. You can keep repeating this, locking a new table each time, until you run out of lock table space, by which time you will have roughly max_prepared_transactions * size_of_lock_table transactions waiting for undo processing. > You could force new connections to complete the rollback processing of > the terminated connection, if there's too much pending UNDO. That'd be a > way of providing back-pressure against such crazy scenarios. Seems > again that it'd be good to have that pressure, independent of the > decision on completeness. That would definitely provide a whole lot of back-pressure, but it would also make the system unusable if the undo handler finds a way to FATAL, or just hangs for some stupid reason (stuck I/O?). It would be a shame if the administrative action needed to fix the problem were prevented by the back-pressure mechanism. One thing I've thought about, which I think would be helpful for a variety of scenarios, is to have a facility that forces a computed delay at the each write transaction (when it first writes WAL, or when an XID is assigned), or we could adapt that to this case and say the beginning of each undo-using transaction. So for example if you are about to run out of space in pg_wal, you can slow thinks down to let the checkpoint complete, or if you are about to run out of XIDs, you can slow things down to let autovacuum complete, or if you are about to run out of undo slots, you can slow things down to let some undo to complete. The trick is to make sure that you only wait when it's likely to do some good; if you wait because you're running out of XIDs and the reason you're running out of XIDs is because somebody left a replication slot or a prepared transaction around, the back-pressure is useless. > Couldn't we record the outstanding transactions in the checkpoint, and > then recompute the changes to that record during WAL replay? Hmm, that's not a bad idea. So the transactions would have to "count" the moment they insert their first undo record, which is exactly the right thing anyway. Hmm, but what about transactions that are only touching unlogged tables? > Yea, I think that's what it boils down to... Would be good to have a few > more opinions on this. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: should there be a hard-limit on the number of transactionspending undo?
From
Peter Geoghegan
Date:
On Fri, Jul 19, 2019 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote: > In scenario #2, the undo work is going to have to be retried in the > background, and perforce that means reacquiring locks that have been > released, and so there is a chance of long lock waits and/or deadlock > that cannot really be avoided. I haven't studied the UNDO or zheap stuff in any detail, but I am concerned about rollbacks that deadlock. I'd feel a lot better about it if forward progress was guaranteed, somehow. That seems to imply that locks are retained, which is probably massively inconvenient to ensure. Not least because it probably requires cooperation from underlying access methods. -- Peter Geoghegan
Re: should there be a hard-limit on the number of transactionspending undo?
From
Andres Freund
Date:
Hi, On 2019-07-19 14:50:22 -0400, Robert Haas wrote: > On Fri, Jul 19, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote: > > It doesn't seem that hard - and kind of required for robustness > > independent of the decision around "completeness" - to find a way to use > > the locks already held by the prepared transaction. > > I'm not wild about finding more subtasks to put on the must-do list, > but I agree it's doable. Isn't that pretty inherently required? How are otherwise ever going to be able to roll back a transaction that holds an AEL on a relation it also modifies? I might be standing on my own head here, though. > > You could force new connections to complete the rollback processing of > > the terminated connection, if there's too much pending UNDO. That'd be a > > way of providing back-pressure against such crazy scenarios. Seems > > again that it'd be good to have that pressure, independent of the > > decision on completeness. > > That would definitely provide a whole lot of back-pressure, but it > would also make the system unusable if the undo handler finds a way to > FATAL, or just hangs for some stupid reason (stuck I/O?). It would be > a shame if the administrative action needed to fix the problem were > prevented by the back-pressure mechanism. Well, then perhaps that admin ought not to constantly terminate connections... I was thinking that new connections wouldn't be forced to do that if there were still a lot of headroom regarding #transactions-to-be-rolled-back. And if undo workers kept up, you'd also not hit this. > > Couldn't we record the outstanding transactions in the checkpoint, and > > then recompute the changes to that record during WAL replay? > > Hmm, that's not a bad idea. So the transactions would have to "count" > the moment they insert their first undo record, which is exactly the > right thing anyway. > > Hmm, but what about transactions that are only touching unlogged tables? Wouldn't we throw all that UNDO away in a crash restart? There's no underlying table data anymore, after all. And for proper shutdown checkpoints they could just be included. Greetings, Andres Freund
On Fri, Jul 19, 2019 at 2:57 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jul 19, 2019 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote: > > In scenario #2, the undo work is going to have to be retried in the > > background, and perforce that means reacquiring locks that have been > > released, and so there is a chance of long lock waits and/or deadlock > > that cannot really be avoided. > > I haven't studied the UNDO or zheap stuff in any detail, but I am > concerned about rollbacks that deadlock. I'd feel a lot better about > it if forward progress was guaranteed, somehow. That seems to imply > that locks are retained, which is probably massively inconvenient to > ensure. Not least because it probably requires cooperation from > underlying access methods. Right, that's definitely a big part of the concern here, but I don't really believe that retaining locks is absolutely required, or even necessarily desirable. For instance, suppose that I create a table, bulk-load a whole lotta data into it, and then abort. Further support that by the time we start trying to process the undo in the background, we can't get the lock. Well, that probably means somebody is performing DDL on the table. If they just did LOCK TABLE or ALTER TABLE SET STATISTICS, we are going to need to execute that same undo once the DDL is complete. However, if the DDL is DROP TABLE, we're going to find that once we can get the lock, the undo is obsolete, and we don't need to worry about it any more. Had we made it 100% certain that the DROP TABLE couldn't go through until the undo was performed, we could avoid having to worry about the undo having become obsolete ... but that's hardly a win. We're better off allowing the drop and then just chucking the undo. Likely, something like CLUSTER or VACUUM FULL would take care of removing any rows created by aborted transactions along the way, so the undo could be thrown away afterwards without processing it. Point being - there's at least some chance that the operations which block forward progress also represent progress of another sort. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 19, 2019 at 3:12 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-07-19 14:50:22 -0400, Robert Haas wrote: > > On Fri, Jul 19, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote: > > > It doesn't seem that hard - and kind of required for robustness > > > independent of the decision around "completeness" - to find a way to use > > > the locks already held by the prepared transaction. > > > > I'm not wild about finding more subtasks to put on the must-do list, > > but I agree it's doable. > > Isn't that pretty inherently required? How are otherwise ever going to > be able to roll back a transaction that holds an AEL on a relation it > also modifies? I might be standing on my own head here, though. I think you are. If a transaction holds an AEL on a relation it also modifies, we still only need something like RowExclusiveLock to roll it back. If we retain the transaction's locks until undo is complete, we will not deadlock, but we'll also hold AccessExclusiveLock for a long time. If we release the transaction's locks, we can perform the undo in the background with only RowExclusiveLock, which is full of win. Even you insist that the undo task should acquire the same lock the relation has, which seems entirely excessive to me, that hardly prevents undo from being applied. Once the original transaction has released its locks, those locks are released, and the undo system can acquire those locks the next time the relation isn't busy (or when it gets to the head of the lock queue). As far as I can see, the only reason why you would care about this is to make the back-pressure system effective against prepared transactions. Different people may want that more or less, but I have a little trouble with the idea that it is a hard requirement. > Well, then perhaps that admin ought not to constantly terminate > connections... I was thinking that new connections wouldn't be forced > to do that if there were still a lot of headroom regarding > #transactions-to-be-rolled-back. And if undo workers kept up, you'd > also not hit this. Sure, but cascading failure scenarios suck. > > Hmm, that's not a bad idea. So the transactions would have to "count" > > the moment they insert their first undo record, which is exactly the > > right thing anyway. > > > > Hmm, but what about transactions that are only touching unlogged tables? > > Wouldn't we throw all that UNDO away in a crash restart? There's no > underlying table data anymore, after all. > > And for proper shutdown checkpoints they could just be included. On thirty seconds thought, that sounds like it would work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: should there be a hard-limit on the number of transactionspending undo?
From
Andres Freund
Date:
On 2019-07-19 15:57:45 -0400, Robert Haas wrote: > On Fri, Jul 19, 2019 at 3:12 PM Andres Freund <andres@anarazel.de> wrote: > > Isn't that pretty inherently required? How are otherwise ever going to > > be able to roll back a transaction that holds an AEL on a relation it > > also modifies? I might be standing on my own head here, though. > > I think you are. If a transaction holds an AEL on a relation it also > modifies, we still only need something like RowExclusiveLock to roll > it back. If we retain the transaction's locks until undo is complete, > we will not deadlock, but we'll also hold AccessExclusiveLock for a > long time. If we release the transaction's locks, we can perform the > undo in the background with only RowExclusiveLock, which is full of > win. Even you insist that the undo task should acquire the same lock > the relation has, which seems entirely excessive to me, that hardly > prevents undo from being applied. Once the original transaction has > released its locks, those locks are released, and the undo system can > acquire those locks the next time the relation isn't busy (or when it > gets to the head of the lock queue). Good morning, Mr Freund. Not sure what you were thinking there.
Re: should there be a hard-limit on the number of transactionspending undo?
From
Peter Geoghegan
Date:
On Fri, Jul 19, 2019 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > Right, that's definitely a big part of the concern here, but I don't > really believe that retaining locks is absolutely required, or even > necessarily desirable. For instance, suppose that I create a table, > bulk-load a whole lotta data into it, and then abort. Further support > that by the time we start trying to process the undo in the > background, we can't get the lock. Well, that probably means somebody > is performing DDL on the table. I believe that the primary reason why certain other database systems retain locks until rollback completes (or release their locks in reverse order, as UNDO processing progresses) is that application code will often repeat exactly the same actions on receiving a transient error, until the action finally completes successfully. Just like with serialization failures, or with manually implemented UPSERT loops that must sometimes retry. This is why UNDO is often (or always) processed synchronously, blocking progress of the client connection as its xact rolls back. Obviously these other systems could easily hand off the work of rolling back the transaction to an asynchronous worker process, and return success to the client that encounters an error (or asks to abort/roll back) almost immediately. I have to imagine that they haven't implemented this straightforward optimization because it makes sense that the cost of rolling back the transaction is primarily borne by the client that actually rolls back. And, as I said, because a lot of application code will immediately retry on failure, which needs to not deadlock with an asynchronous roll back process. > If they just did LOCK TABLE or ALTER > TABLE SET STATISTICS, we are going to need to execute that same undo > once the DDL is complete. However, if the DDL is DROP TABLE, we're > going to find that once we can get the lock, the undo is obsolete, and > we don't need to worry about it any more. Had we made it 100% certain > that the DROP TABLE couldn't go through until the undo was performed, > we could avoid having to worry about the undo having become obsolete > ... but that's hardly a win. We're better off allowing the drop and > then just chucking the undo. I'm sure that there are cases like that. And, I'm pretty sure that at least one of the other database systems that I'm thinking of isn't as naive as I suggest, without being sure of the specifics. The classic approach is to retain the locks, even though that sucks in some cases. That doesn't mean that you have to do it that way, but it's probably a good idea to present your design in a way that compares and contrasts with the classic approach. I'm pretty sure that this is related to the way in which other systems retain coarse-grained locks when bitmap indexes are used, even though that makes them totally unusable with OLTP apps. It seems like it would help users a lot if their bitmap indexes didn't come with that problem, but it's a price that they continue to have to pay. > Point being - there's at least some chance that the operations which > block forward progress also represent progress of another sort. That's good, provided that there isn't observable lock starvation. I don't think that you need to eliminate the theoretical risk of lock starvation. It deserves careful, ongoing consideration, though. It's difficult to codify exactly what I have in mind, but I can give you an informal definition now: It's probably okay if there is the occasional implementation-level deadlock because the user got unlucky once. However, it's not okay for there to be *continual* deadlocks because the user got unlucky just once. Even if the user had *extraordinarily* bad luck that one time. In short, my sense is that it's never okay for the system as a whole to "get stuck" in a deadlock or livelock loop. Actually, it might even be okay if somebody had a test case that exhibits "getting stuck" behavior, provided the test case is very delicate, and looks truly adversarial (i.e. it goes beyond being extraordinarily unlucky). I know that this is all pretty hand-wavy, and I don't expect you to have a definitive response. These are some high level concerns that I have, that may or may not apply to what you're trying to do. -- Peter Geoghegan
On Fri, Jul 19, 2019 at 6:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > I believe that the primary reason why certain other database systems > retain locks until rollback completes (or release their locks in > reverse order, as UNDO processing progresses) is that application code > will often repeat exactly the same actions on receiving a transient > error, until the action finally completes successfully. Just like with > serialization failures, or with manually implemented UPSERT loops that > must sometimes retry. This is why UNDO is often (or always) processed > synchronously, blocking progress of the client connection as its xact > rolls back. I don't think this matters here at all. As long as there's only DML involved, there won't be any lock conflicts anyway - everybody's taking RowExclusiveLock or less, and it's all fine. If you update a row in zheap, abort, and then try to update again before the rollback happens, we'll do a page-at-a-time rollback in the foreground, and proceed with the update; when we get around to applying the undo, we'll notice that page has already been handled and skip the undo records that pertain to it. To get the kinds of problems I'm on about here, somebody's got to be taking some more serious locks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: should there be a hard-limit on the number of transactionspending undo?
From
Peter Geoghegan
Date:
On Fri, Jul 19, 2019 at 4:14 PM Robert Haas <robertmhaas@gmail.com> wrote: > I don't think this matters here at all. As long as there's only DML > involved, there won't be any lock conflicts anyway - everybody's > taking RowExclusiveLock or less, and it's all fine. If you update a > row in zheap, abort, and then try to update again before the rollback > happens, we'll do a page-at-a-time rollback in the foreground, and > proceed with the update; when we get around to applying the undo, > we'll notice that page has already been handled and skip the undo > records that pertain to it. To get the kinds of problems I'm on about > here, somebody's got to be taking some more serious locks. If I'm not mistaken, you're tacitly assuming that you'll always be using zheap, or something sufficiently similar to zheap. It'll probably never be possible to UNDO changes to something like a GIN index on a zheap table, because you can never do that with sensible concurrency/deadlock behavior. I don't necessarily have a problem with that. I don't pretend to understand how much of a problem it is. Obviously it partially depends on what your ambitions are for this infrastructure. Still, assuming that I have it right, ISTM that UNDO/zheap/whatever should explicitly own this restriction. -- Peter Geoghegan
On Fri, Jul 19, 2019 at 6:37 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jul 19, 2019 at 7:54 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > + * We just want to mask the cid in the undo record header. So > > + * only if the partial record in the current page include the undo > > + * record header then we need to mask the cid bytes in this page. > > + * Otherwise, directly jump to the next record. > > Here, I think you mean : "So only if the partial record in the current > > page includes the *cid* bytes", rather than "includes the undo record > > header" > > May be we can say : > > We just want to mask the cid. So do the partial record masking only if > > the current page includes the cid bytes from the partial record > > header. > > Hmm, but why is it correct to mask the CID at all? Shouldn't that match? > We don't write CID in the WAL. Because In hot-standby or after recovery we don't need actual CID for the visibility. So during REDO while generating the undo record we set CID as 'FirstCommandId' which is different from the DO time. That's the reason we mask it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 19, 2019 at 6:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > We are doing exactly what you have written in the last line of the > > next paragraph "stop the transaction from writing undo when the hash > > table is already too full.". So we will > > never face the problems related to repeated crash recovery. The > > definition of too full is that we stop allowing the new transactions > > that can write undo when we have the hash table already have entries > > equivalent to (UndoRollbackHashTableSize() - MaxBackends). Does this > > make sense? > > Oops, I was looking in the wrong place. Yes, that makes sense, but: > > 1. It looks like the test you added to PrepareUndoInsert has no > locking, and I don't see how that can be right. > +if (ProcGlobal->xactsHavingPendingUndo > +(UndoRollbackHashTableSize() - MaxBackends)) Actual HARD_LIMIT is UndoRollbackHashTableSize(), but we only allow a new backend to prepare the undo record if we have MaxBackends number of empty slots in the hash table. This will guarantee us to always have at least one slot in the hash table for our current prepare, even if all the backend which running their transaction has aborted and inserted an entry in the hash table. I think the problem with this check is that for any backend to prepare an undo there must be MaxBackend number of empty slots in the hash table for any concurrent backend to insert the request and this seems too restrictive. Having said that I think we must ensure MaxBackends*2 empty slots in the hash table as each backend can enter 2 requests in the hash table. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 19, 2019 at 10:58 PM Robert Haas <robertmhaas@gmail.com> wrote: > > One other thing that seems worth noting is that we have to consider > what happens after a restart. After a crash, and depending on exactly > how we design it perhaps also after a non-crash restart, we won't > immediately know how many outstanding transactions need undo; we'll > have to grovel through the undo logs to find out. If we've got a hard > cap, we can't allow new undo-using transactions to start until we > finish that work. It's possible that, at the moment of the crash, the > maximum number of items had already been pushed into the background, > and every foreground session was busy trying to undo an abort as well. > If so, we're already up against the limit. We'll have to scan through > all of the undo logs and examine each transaction to get a count on > how many transactions are already in a needs-undo-work state; only > once we have that value do we know whether it's OK to admit new > transactions to using the undo machinery, and how many we can admit. > In typical cases, that won't take long at all, because there won't be > any pending undo work, or not much, and we'll very quickly read the > handful of transaction headers that we need to consult and away we go. > However, if the hard limit is pretty big, and we're pretty close to > it, counting might take a long time. It seems bothersome to have this > interval between when we start accepting transactions and when we can > accept transactions that use undo. Instead of throwing an ERROR, we > can probably just teach the system to wait for the background process > to finish doing the counting; that's what Amit's patch does currently. > Yeah, however, we wait for a certain threshold period of time (one minute) for counting to finish and then error out. We can wait till the counting is finished but I am not sure if that is a good idea because anyway user can try again after some time. > Or, we could not even open for connections until the counting has been > completed. > > When I first thought about this, I was really concerned about the idea > of a hard limit, but the more I think about it the less problematic it > seems. I think in the end it boils down to a question of: when things > break, what behavior would users prefer? > One minor thing I would like to add here is that we are providing some knobs wherein the systems having more number of rollbacks can configure to have a much higher value of hard limit such that it won't hit in their systems. I know it is not always easy to find the right value, but I guess they can learn from the behavior and then change it to avoid the same in future. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jul 20, 2019 at 4:17 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Jul 19, 2019 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Right, that's definitely a big part of the concern here, but I don't > > really believe that retaining locks is absolutely required, or even > > necessarily desirable. For instance, suppose that I create a table, > > bulk-load a whole lotta data into it, and then abort. Further support > > that by the time we start trying to process the undo in the > > background, we can't get the lock. Well, that probably means somebody > > is performing DDL on the table. > > I believe that the primary reason why certain other database systems > retain locks until rollback completes (or release their locks in > reverse order, as UNDO processing progresses) is that application code > will often repeat exactly the same actions on receiving a transient > error, until the action finally completes successfully. Just like with > serialization failures, or with manually implemented UPSERT loops that > must sometimes retry. This is why UNDO is often (or always) processed > synchronously, blocking progress of the client connection as its xact > rolls back. > > Obviously these other systems could easily hand off the work of > rolling back the transaction to an asynchronous worker process, and > return success to the client that encounters an error (or asks to > abort/roll back) almost immediately. I have to imagine that they > haven't implemented this straightforward optimization because it makes > sense that the cost of rolling back the transaction is primarily borne > by the client that actually rolls back. > It is also possible that there are some other disadvantages or technical challenges in those other systems due to which they decided not to have such a mechanism. I think one such database prepares the consistent copy of pages during read operation based on SCN or something like that. It might not be as easy for such a system to check if there is some pending undo which needs to be consulted. I am not telling that there are no ways to overcome such things but that might have incurred much more cost or has some other disadvantages. I am not sure if it is straight-forward to imagine why some other system does the things in some particular way unless there is some explicit documentation about the same. Having said that, I agree that there are a good number of advantages of performing the actions in the client that actually rolls back and we should try to do that where it is not a good idea to transfer to background workers like for short transactions. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 19, 2019 at 6:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > We are doing exactly what you have written in the last line of the > > next paragraph "stop the transaction from writing undo when the hash > > table is already too full.". So we will > > never face the problems related to repeated crash recovery. The > > definition of too full is that we stop allowing the new transactions > > that can write undo when we have the hash table already have entries > > equivalent to (UndoRollbackHashTableSize() - MaxBackends). Does this > > make sense? > > Oops, I was looking in the wrong place. Yes, that makes sense, but: > > 1. It looks like the test you added to PrepareUndoInsert has no > locking, and I don't see how that can be right. > > 2. It seems like this is would result in testing for each new undo > insertion that gets prepared, whereas surely we would want to only > test when first attaching to an undo log. If you've already attached > to the undo log, there's no reason not to continue inserting into it, > because doing so doesn't increase the number of transactions (or > transaction-persistence level combinations) that need undo. > I agree that it should not be for each undo insertion rather whenever any transaction attached to an undo log. > 3. I don't think the test itself is correct. It can fire even when > there's no problem. It is correct (or would be if it said 2 * > MaxBackends) if every other backend in the system is already attached > to an undo log (or two). But if they are not, it will block > transactions from being started for no reason. > Right, we should find a way to know the exact number of transactions that are attached to undo-log at any point in time, then we can have a more precise check. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jul 20, 2019 at 12:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jul 19, 2019 at 6:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > We are doing exactly what you have written in the last line of the > > > next paragraph "stop the transaction from writing undo when the hash > > > table is already too full.". So we will > > > never face the problems related to repeated crash recovery. The > > > definition of too full is that we stop allowing the new transactions > > > that can write undo when we have the hash table already have entries > > > equivalent to (UndoRollbackHashTableSize() - MaxBackends). Does this > > > make sense? > > > > Oops, I was looking in the wrong place. Yes, that makes sense, but: > > > > 1. It looks like the test you added to PrepareUndoInsert has no > > locking, and I don't see how that can be right. > > > > 2. It seems like this is would result in testing for each new undo > > insertion that gets prepared, whereas surely we would want to only > > test when first attaching to an undo log. If you've already attached > > to the undo log, there's no reason not to continue inserting into it, > > because doing so doesn't increase the number of transactions (or > > transaction-persistence level combinations) that need undo. > > > > I agree that it should not be for each undo insertion rather whenever > any transaction attached to an undo log. > > > 3. I don't think the test itself is correct. It can fire even when > > there's no problem. It is correct (or would be if it said 2 * > > MaxBackends) if every other backend in the system is already attached > > to an undo log (or two). But if they are not, it will block > > transactions from being started for no reason. > > > > Right, we should find a way to know the exact number of transactions > that are attached to undo-log at any point in time, then we can have a > more precise check. Maybe we can make ProcGlobal->xactsHavingPendingUndo an atomic variable. We can increment its value atomically whenever a) A transaction writes the first undo record for each persistence level. b) For each abort request inserted by the 'StartupPass'. And, we will decrement it when a) The transaction commits (decrement by 1 for each persistence level it has written undo for). b) Rollback request is processed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 18, 2019 at 4:41 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Jul 16, 2019 at 8:39 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Thomas has already objected to another proposal to add functions that > > turn 32-bit XIDs into 64-bit XIDs. Therefore, I feel confident in > > predicting that he will likewise object to GetEpochForXid. I think > > this needs to be changed somehow, maybe by doing what the XXX comment > > you added suggests. > > Perhaps we should figure out how to write GetOldestFullXmin() and friends. > > For FinishPreparedTransaction(), the XXX comment sounds about right > (TwoPhaseFileHeader should hold an fxid). > I think we can do that, but what about subxids in TwoPhaseFileHeader? Shall we store them as fxids as well? If we don't do that then it will appear inconsistent and if we want to store subxids as fxids, then we need to track them as fxids in TransactionStateData. It might not be a very big change, but certainly, more work as compared to if we just store top-level fxid or use GetEpochForXid as we are currently using in the patch. Another thing is changing subxids to fxids can increase the size of two-phase file for a xact having many sub-transactions which again might be okay, but not completely sure. Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 22, 2019 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I have reviewed 0012-Infrastructure-to-execute-pending-undo-actions, Please find my comment so far. 1. + /* It shouldn't be discarded. */ + Assert(!UndoRecPtrIsDiscarded(xact_urp)); I think comments can be added to explain why it shouldn't be discarded. 2. + /* Compute the offset of the uur_next in the undo record. */ + offset = SizeOfUndoRecordHeader + + offsetof(UndoRecordTransaction, urec_progress); + in comment /uur_next/uur_progress 3. +/* + * undo_record_comparator + * + * qsort comparator to handle undo record for applying undo actions of the + * transaction. + */ Function header formating is not in sync with other functions. 4. +void +undoaction_redo(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + switch (info) + { + case XLOG_UNDO_APPLY_PROGRESS: + undo_xlog_apply_progress(record); + break; For HotStandby it doesn't make sense to apply this wal as this progress is only required when we try to apply the undo action after restart but in HotStandby we never apply undo actions. 5. + Assert(from_urecptr != InvalidUndoRecPtr); + Assert(to_urecptr != InvalidUndoRecPtr); we can use macros UndoRecPtrIsValid instead of checking like this. 6. + if ((slot == NULL) || (UndoRecPtrGetLogNo(urecptr) != slot->logno)) + slot = UndoLogGetSlot(UndoRecPtrGetLogNo(urecptr), false); + + Assert(slot != NULL); We are passing missing_ok as false in UndoLogGetSlot. But, not sure why we are expecting that undo lot can not be dropped. In multi-log transaction it's possible that the tablespace in which next undolog is there is already dropped? 7. + */ + do + { + BlockNumber progress_block_num = InvalidBlockNumber; + int i; + int nrecords; ..... + */ + if (!UndoRecPtrIsValid(urec_ptr)) + break; + } while (true); I think we can convert above loop to while(true) instead of do..while, because there is no need for do while loop. 8. + if (last_urecinfo->uur->uur_info & UREC_INFO_LOGSWITCH) + { + UndoRecordLogSwitch *logswitch = last_urecinfo->uur->uur_logswitch; IMHO, the caller of UndoFetchRecord should directly check uur->uur_logswitch instead of uur_info & UREC_INFO_LOGSWITCH. Actually, uur_info is internally set for inserting the tuple and check there to know what to insert and fetch but I think caller of UndoFetchRecord should directly rely on the field because ideally all the fields in UnpackUndoRecord must be set and uur_txt or uur_logswitch will be allocated when those headers present. I think this needs to be improved in undo interface patch as well (in UndoBulkFetchRecord). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Jul 20, 2019 at 11:28 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jul 19, 2019 at 4:14 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I don't think this matters here at all. As long as there's only DML > > involved, there won't be any lock conflicts anyway - everybody's > > taking RowExclusiveLock or less, and it's all fine. If you update a > > row in zheap, abort, and then try to update again before the rollback > > happens, we'll do a page-at-a-time rollback in the foreground, and > > proceed with the update; when we get around to applying the undo, > > we'll notice that page has already been handled and skip the undo > > records that pertain to it. To get the kinds of problems I'm on about > > here, somebody's got to be taking some more serious locks. > > If I'm not mistaken, you're tacitly assuming that you'll always be > using zheap, or something sufficiently similar to zheap. It'll > probably never be possible to UNDO changes to something like a GIN > index on a zheap table, because you can never do that with sensible > concurrency/deadlock behavior. > > I don't necessarily have a problem with that. I don't pretend to > understand how much of a problem it is. Obviously it partially depends > on what your ambitions are for this infrastructure. Still, assuming > that I have it right, ISTM that UNDO/zheap/whatever should explicitly > own this restriction. I had a similar thought: you might regret that choice if you were wanting to implement an AM with lock table-based concurrency control (meaning that there are lock ordering concerns for row and page locks, for DML statements, not just DDL). That seemed a bit too far fetched to mention before, but are you saying the same sort of concerns might come up with indexes that support true undo (as opposed to indexes that still need VACUUM)? For comparison, ARIES[1] has no-deadlock rollbacks as a basic property and reacquires locks during restart before new transactions are allow to execute. In its model, the locks in question can be on things like rows and pages. We don't even use our lock table for those (except for non-blocking SIREAD locks, irrelevant here). After crash recovery, if zheap encounters a row with pending rollback from an aborted transaction, as usual it either needs to read an older version from an undo log (for reads) or help execute the rollback before updating (for writes). That only requires page-at-a-time LWLocks ("latching"), so it's deadlock-free. The only deadlock risk comes from the need to acquire heavyweight locks on relations which typically only conflict when you run DDL, so yeah, it's tempting to worry a lot less about those than the fine grained lock traffic from DML statements that DB2 and others have to deal with. So spell out the two options again: A. Rollback can't deadlock. You have to make sure you reliably hold locks until rollback is completed (including some tricky new lock transfer magic), and then reacquire them after recovery before new transactions are allowed. You could trivially achieve the restart part by simply waiting until all rollback is executed before you allow new transactions, but other systems including DB2 first acquire all the locks in an earlier scan through the log, then allow new connections, and then execute the rollback. Acquiring them before new transactions are allowed means that they must fit in the lock table and there must be no conflicts among them if they were all granted as at the moment you crashed or shut down. B. Rollback can deadlock or exhaust the lock table because we release and reacquire some arbitrary time later. No choice but to keep retrying if anything goes wrong, and rollback is theoretically not guaranteed to complete and you can contrive a workload that will never make progress. This amounts to betting that these problems will be rare enough that it doesn't matter and eventually make progress, and it should be fairly clear what's happening and why. I might as well put the quote marks on now: "Perhaps we could implement A later." [1] https://cs.stanford.edu/people/chrismre/cs345/rl/aries.pdf -- Thomas Munro https://enterprisedb.com
On Mon, 22 Jul 2019 at 14:21, Amit Kapila <amit.kapila16@gmail.com> wrote: I have started review of 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch. Below are some quick comments to start with: +++ b/src/backend/access/undo/undoworker.c +#include "access/xact.h" +#include "access/undorequest.h" Order is not alphabetical + * Each undo worker then start reading from one of the queue the requests for start=>starts queue=>queues ------------- + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + 10L, WAIT_EVENT_BGWORKER_STARTUP); + + /* emergency bailout if postmaster has died */ + if (rc & WL_POSTMASTER_DEATH) + proc_exit(1); I think now, thanks to commit cfdf4dc4fc9635a, you don't have to explicitly handle postmaster death; instead you can use WL_EXIT_ON_PM_DEATH. Please check at all such places where this is done in this patch. ------------- +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) +{ + /* Block concurrent access. */ + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + + MyUndoWorker = &UndoApplyCtx->workers[slot]; Not sure why MyUndoWorker is used here. Can't we use a local variable ? Or do we intentionally attach to the slot as a side-operation ? ------------- + * Get the dbid where the wroker should connect to and get the worker wroker=>worker ------------- + BackgroundWorkerInitializeConnectionByOid(urinfo.dbid, 0, 0); 0, 0 => InvalidOid, 0 + * Set the undo worker request queue from which the undo worker start + * looking for a work. start => should start a work => work -------------- + if (!InsertRequestIntoErrorUndoQueue(urinfo)) I was thinking what happens if for some reason InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the entry will not be marked invalid, and so there will be no undo action carried out because I think the undo worker will exit. What happens next with this entry ?
On Fri, 19 Jul 2019 at 17:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Thu, 9 May 2019 at 12:04, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Patches can be applied on top of undo branch [1] commit: > > (cb777466d008e656f03771cf16ec7ef9d6f2778b) > > > > [1] https://github.com/EnterpriseDB/zheap/tree/undo > > Below are some review points for 0009-undo-page-consistency-checker.patch : Another point that I missed : + * Process the undo record of the page and mask their cid filed. + */ + while (next_record < page_end) + { + UndoRecordHeader *header = (UndoRecordHeader *) next_record; + + /* If this undo record has cid present, then mask it */ + if ((header->urec_info & UREC_INFO_CID) != 0) Here, even though next record starts in the current page, the urec_info itself may or may not lie on this page. I hope this possibility is also considered when populating the partial-record-specific details in the page header. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 2019-07-22 14:21:36 +0530, Amit Kapila wrote: > Another thing is changing subxids to fxids can increase the size of > two-phase file for a xact having many sub-transactions which again > might be okay, but not completely sure. I can't see that being a problem.
On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Mon, 22 Jul 2019 at 14:21, Amit Kapila <amit.kapila16@gmail.com> wrote: > > ------------- > > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) > +{ > + /* Block concurrent access. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + > + MyUndoWorker = &UndoApplyCtx->workers[slot]; > Not sure why MyUndoWorker is used here. Can't we use a local variable > ? Or do we intentionally attach to the slot as a side-operation ? > > ------------- > I think here, we can use a local variable as well. Do you see any problem with the current code or do you think it is better to use a local variable here? > -------------- > > + if (!InsertRequestIntoErrorUndoQueue(urinfo)) > I was thinking what happens if for some reason > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the > entry will not be marked invalid, and so there will be no undo action > carried out because I think the undo worker will exit. What happens > next with this entry ? The same entry is present in two queues xid and size, so next time it will be executed from the second queue based on it's priority in that queue. However, if it fails again a second time in the same way, then we will be in trouble because now the hash table has entry, but none of the queues has entry, so none of the workers will attempt to execute again. Also, when discard worker again tries to register it, we won't allow adding the entry to queue thinking either some backend is executing the same or it must be part of some queue. The one possibility to deal with this could be that we somehow allow discard worker to register it again in the queue or we can do this in critical section so that it allows system restart on error. However, the main thing is it possible that InsertRequestIntoErrorUndoQueue will fail unless there is some bug in the code? If so, we might want to have an Assert for this rather than handling this condition. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, 23 Jul 2019 at 08:48, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Mon, 22 Jul 2019 at 14:21, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > ------------- > > > > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) > > +{ > > + /* Block concurrent access. */ > > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > > + > > + MyUndoWorker = &UndoApplyCtx->workers[slot]; > > Not sure why MyUndoWorker is used here. Can't we use a local variable > > ? Or do we intentionally attach to the slot as a side-operation ? > > > > ------------- > > > > I think here, we can use a local variable as well. Do you see any > problem with the current code or do you think it is better to use a > local variable here? I think, even though there might not be a correctness issue with the current code as it stands, we should still use a local variable. Updating MyUndoWorker is a big side-effect, which the caller is not supposed to be aware of, because all that function should do is just get the slot info. > > > -------------- > > > > + if (!InsertRequestIntoErrorUndoQueue(urinfo)) > > I was thinking what happens if for some reason > > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the > > entry will not be marked invalid, and so there will be no undo action > > carried out because I think the undo worker will exit. What happens > > next with this entry ? > > The same entry is present in two queues xid and size, so next time it > will be executed from the second queue based on it's priority in that > queue. However, if it fails again a second time in the same way, then > we will be in trouble because now the hash table has entry, but none > of the queues has entry, so none of the workers will attempt to > execute again. Also, when discard worker again tries to register it, > we won't allow adding the entry to queue thinking either some backend > is executing the same or it must be part of some queue. > > The one possibility to deal with this could be that we somehow allow > discard worker to register it again in the queue or we can do this in > critical section so that it allows system restart on error. However, > the main thing is it possible that InsertRequestIntoErrorUndoQueue > will fail unless there is some bug in the code? If so, we might want > to have an Assert for this rather than handling this condition. Yes, I also think that the function would error out only because of can't-happen cases, like "too many locks taken" or "out of binary heap slots" or "out of memory" (this last one is not such a can't happen case). These cases happen probably due to some bugs, I suppose. But I was wondering : Generally when the code errors out with such can't-happen elog() calls, worst thing that happens is that the transaction gets aborted. Whereas, in this case, the worst thing that could happen is : the undo action would never get executed, which means selects for this tuple will keep on accessing the undo log ? This does not sound like any data consistency issue, so we should be fine after all ? -------------------- Some further review comments for undoworker.c : +/* Sets the worker's lingering status. */ +static void +UndoWorkerIsLingering(bool sleep) The function name sounds like "is the worker lingering ?". Can we rename it to something like "UndoWorkerSetLingering" ? ------------- + errmsg("undo worker slot %d is empty, cannot attach", + slot))); + } + + if (MyUndoWorker->proc) + { + LWLockRelease(UndoWorkerLock); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("undo worker slot %d is already used by " + "another worker, cannot attach", slot))); These two error messages can have a common error message "could not attach to worker slot", with errdetail separate for each of them : slot %d is empty. slot %d is already used by another worker. -------------- +static int +IsUndoWorkerAvailable(void) +{ + int i; + int alive_workers = 0; + + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); Should have bool return value. Also, why is it keeping track of number of alive workers ? Sounds like earlier it used to return number of alive workers ? If it indeed needs to just return true/false, we can do away with alive_workers. Also, *exclusive* lock is unnecessary. -------------- +if (UndoGetWork(false, false, &urinfo, NULL) && + IsUndoWorkerAvailable()) + UndoWorkerLaunch(urinfo); There is no lock acquired between IsUndoWorkerAvailable() and UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable() returns true, there is a small window where UndoWorkerLaunch() does not find any worker slot with in_use false, causing assertion failure for (worker != NULL). -------------- +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) +{ + /* Block concurrent access. */ + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); *Exclusive* lock is unnecessary. ------------- + LWLockRelease(UndoWorkerLock); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("undo worker slot %d is empty", + slot))); I believe there is no need to explicitly release an lwlock before raising an error, since the lwlocks get released during error recovery. Please check all other places where this is done. ------------- + * Start new undo apply background worker, if possible otherwise return false. worker, if possible otherwise => worker if possible, otherwise ------------- +static bool +UndoWorkerLaunch(UndoRequestInfo urinfo) We don't check UndoWorkerLaunch() return value. Can't we make it's return value type void ? Also, it would be better to have urinfo as pointer to UndoRequestInfo rather than UndoRequestInfo, so as to avoid structure copy. ------------- +{ + BackgroundWorker bgw; + BackgroundWorkerHandle *bgw_handle; + uint16 generation; + int i; + int slot = 0; We can remove variable i, and use slot variable in place of i. ----------- + snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker"); I think it would be trivial to also append the worker->generation in the bgw_name. ------------- + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle)) + { + /* Failed to start worker, so clean up the worker slot. */ + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + UndoWorkerCleanup(worker); + LWLockRelease(UndoWorkerLock); + + return false; + } Is it intentional that there is no (warning?) message logged when we can't register a bg worker ? ------------- -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Please find my review comments for 0013-Allow-foreground-transactions-to-perform-undo-action + /* initialize undo record locations for the transaction */ + for (i = 0; i < UndoLogCategories; i++) + { + s->start_urec_ptr[i] = InvalidUndoRecPtr; + s->latest_urec_ptr[i] = InvalidUndoRecPtr; + s->undo_req_pushed[i] = false; + } Can't we just memset this memory? + * We can't postpone applying undo actions for subtransactions as the + * modifications made by aborted subtransaction must not be visible even if + * the main transaction commits. + */ + if (IsSubTransaction()) + return; I am not completely sure but is it possible that the outer function CommitTransactionCommand/AbortCurrentTransaction can avoid calling this function in the switch case based on the current state, so that under subtransaction this will never be called? + /* + * Prepare required undo request info so that it can be used in + * exception. + */ + ResetUndoRequestInfo(&urinfo); + urinfo.dbid = dbid; + urinfo.full_xid = fxid; + urinfo.start_urec_ptr = start_urec_ptr[per_level]; + I see that we are preparing urinfo before execute_undo_actions so that in case of an error in CATCH we can use that to insert into the queue, but can we just initialize urinfo right there before inserting into the queue, we have all the information Am I missing something? + + /* + * We need the locations of the start and end undo record pointers when + * rollbacks are to be performed for prepared transactions using undo-based + * relations. We need to store this information in the file as the user + * might rollback the prepared transaction after recovery and for that we + * need it's start and end undo locations. + */ + UndoRecPtr start_urec_ptr[UndoLogCategories]; + UndoRecPtr end_urec_ptr[UndoLogCategories]; it's -> its + bool undo_req_pushed[UndoLogCategories]; /* undo request pushed + * to worker? */ + bool performUndoActions; + struct TransactionStateData *parent; /* back link to parent */ We must have some comments to explain how performUndoActions is used, where it's set. If it's explained somewhere else then we can give reference to that code. + for (i = 0; i < UndoLogCategories; i++) + { + if (s->latest_urec_ptr[i]) + { + s->performUndoActions = true; + break; + } + } I think we should chek UndoRecPtrIsValid(s->latest_urec_ptr[i]) + PG_TRY(); + { + /* + * Prepare required undo request info so that it can be used in + * exception. + */ + ResetUndoRequestInfo(&urinfo); + urinfo.dbid = dbid; + urinfo.full_xid = fxid; + urinfo.start_urec_ptr = start_urec_ptr[per_level]; + + /* for subtransactions, we do partial rollback. */ + execute_undo_actions(urinfo.full_xid, + end_urec_ptr[per_level], + start_urec_ptr[per_level], + !isSubTrans); + } + PG_CATCH(); Wouldn't it be good to explain in comments that we are not rethrowing the error in PG_CATCH but because we don't want the main transaction to get an error if there is an error while applying to undo action for the main transaction and we will abort the transaction in the caller of this function? +tables are only accessible in the backend that has created them. We can't +postpone applying undo actions for subtransactions as the modifications +made by aborted subtransaction must not be visible even if the main transaction +commits. I think we need to give detail reasoning why subtransaction changes will be visible if we don't apply it's undo and the main the transaction commits by mentioning that we don't use separate transaction id for the subtransaction and that will make all the changes of the transaction id visible when it commits. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi,
I have stated review of0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few
quick comments.
1) README.undointerface should provide more information like API details or
the sequence in which API should get called.
2) Information about the API's in the undoaccess.c file header block would
good. For reference please look at heapam.c.
3) typo
+ * Later, during insert phase we will write actual records into thse buffers.
+ */
%s/thse/these
4) UndoRecordUpdateTransInfo() comments says that this must be called under
the critical section, but seems like undo_xlog_apply_progress() do call it
outside of critical section? Is there exception, then should add comments?
or Am I missing anything?
5) In function UndoBlockGetFirstUndoRecord() below code:
/* Calculate the size of the partial record. */
partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
phdr->tuple_len + phdr->payload_len -
phdr->record_offset;
can directly use UndoPagePartialRecSize().
6)
+static int
+UndoGetBufferSlot(UndoRecordInsertContext *context,
+ RelFileNode rnode,
+ BlockNumber blk,
+ ReadBufferMode rbm)
+{
+ int i;
In the above code variable "i" is mean "block index". It would be good
to give some valuable name to the variable, maybe "blockIndex" ?
7)
* We will also keep a previous undo record pointer to the first and last undo
* record of the transaction in the previous log. The last undo record
* location is used find the previous undo record pointer during rollback.
%s/used fine/used to find
8)
/*
* Defines the number of times we try to wait for rollback hash table to get
* initialized. After these many attempts it will return error and the user
* can retry the operation.
*/
#define ROLLBACK_HT_INIT_WAIT_TRY 60
%s/error/an error
9)
* we can get the exact size of partial record in this page.
*/
%s/of partial/of the partial"
10)
* urecptr - current transaction's undo record pointer which need to be set in
* the previous transaction's header.
%s/need/needs
11)
/*
* If we are writing first undo record for the page the we can set the
* compression so that subsequent records from the same transaction can
* avoid including common information in the undo records.
*/
%s/the page the/the page then
12)
/*
* If the transaction's undo records are split across the undo logs. So
* we need to update our own transaction header in the previous log.
*/
double space between "to" and "update"
13)
* The undo record should be freed by the caller by calling ReleaseUndoRecord.
* This function will old the pin on the buffer where we read the previous undo
* record so that when this function is called repeatedly with the same context
%s/old/hold
1) README.undointerface should provide more information like API details or
the sequence in which API should get called.
2) Information about the API's in the undoaccess.c file header block would
good. For reference please look at heapam.c.
3) typo
+ * Later, during insert phase we will write actual records into thse buffers.
+ */
%s/thse/these
4) UndoRecordUpdateTransInfo() comments says that this must be called under
the critical section, but seems like undo_xlog_apply_progress() do call it
outside of critical section? Is there exception, then should add comments?
or Am I missing anything?
5) In function UndoBlockGetFirstUndoRecord() below code:
/* Calculate the size of the partial record. */
partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
phdr->tuple_len + phdr->payload_len -
phdr->record_offset;
can directly use UndoPagePartialRecSize().
6)
+static int
+UndoGetBufferSlot(UndoRecordInsertContext *context,
+ RelFileNode rnode,
+ BlockNumber blk,
+ ReadBufferMode rbm)
+{
+ int i;
In the above code variable "i" is mean "block index". It would be good
to give some valuable name to the variable, maybe "blockIndex" ?
7)
* We will also keep a previous undo record pointer to the first and last undo
* record of the transaction in the previous log. The last undo record
* location is used find the previous undo record pointer during rollback.
%s/used fine/used to find
8)
/*
* Defines the number of times we try to wait for rollback hash table to get
* initialized. After these many attempts it will return error and the user
* can retry the operation.
*/
#define ROLLBACK_HT_INIT_WAIT_TRY 60
%s/error/an error
9)
* we can get the exact size of partial record in this page.
*/
%s/of partial/of the partial"
10)
* urecptr - current transaction's undo record pointer which need to be set in
* the previous transaction's header.
%s/need/needs
11)
/*
* If we are writing first undo record for the page the we can set the
* compression so that subsequent records from the same transaction can
* avoid including common information in the undo records.
*/
%s/the page the/the page then
12)
/*
* If the transaction's undo records are split across the undo logs. So
* we need to update our own transaction header in the previous log.
*/
double space between "to" and "update"
13)
* The undo record should be freed by the caller by calling ReleaseUndoRecord.
* This function will old the pin on the buffer where we read the previous undo
* record so that when this function is called repeatedly with the same context
%s/old/hold
I will continue further review for the same patch.
Regards,
Rushabh Lathia
On Wed, Jul 24, 2019 at 11:28 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > > Hi, > > I have stated review of > 0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few > quick comments. Thanks for the review, I will work on them soon and post the updated patch along with other comments. I have noticed some comments are pointing to the code which is not part of this patch for example > > 5) In function UndoBlockGetFirstUndoRecord() below code: > > /* Calculate the size of the partial record. */ > partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) + > phdr->tuple_len + phdr->payload_len - > phdr->record_offset; > > can directly use UndoPagePartialRecSize(). UndoBlockGetFirstUndoRecord is added under 0014 patch, I think you got confused because this code is in undoaccess.c file. But actually later patch set added some code under undoaccess.c. Basically, this comment needs to be fixed but under another patch. I am pointing out so that we don't miss this. > 8) > > /* > * Defines the number of times we try to wait for rollback hash table to get > * initialized. After these many attempts it will return error and the user > * can retry the operation. > */ > #define ROLLBACK_HT_INIT_WAIT_TRY 60 > > %s/error/an error > This macro also added under 0014. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > On Fri, Jun 28, 2019 at 6:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > I happened to open up 0001 from this series, which is from Thomas, and > > > I do not think that the pg_buffercache changes are correct. The idea > > > here is that the customer might install version 1.3 or any prior > > > version on an old release, then upgrade to PostgreSQL 13. When they > > > do, they will be running with the old SQL definitions and the new > > > binaries. At that point, it sure looks to me like the code in > > > pg_buffercache_pages.c is going to do the Wrong Thing. [...] > > > > Yep, that was completely wrong. Here's a new version. > > > > One comment/question related to > 0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch. > I have done some more review of undolog patch series and here are my comments: 0003-Add-undo-log-manager.patch 1. allocate_empty_undo_segment() { .. .. if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST) + { + char *parentdir; + + if (errno != ENOENT || !InRecovery) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", + undo_path))); + + /* + * In recovery, it's possible that the tablespace directory + * doesn't exist because a later WAL record removed the whole + * tablespace. In that case we create a regular directory to + * stand in for it. This is similar to the logic in + * TablespaceCreateDbspace(). + */ + + /* create two parents up if not exist */ + parentdir = pstrdup(undo_path); + get_parent_directory(parentdir); + get_parent_directory(parentdir); + /* Can't create parent and it doesn't already exist? */ + if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST) All of this code is almost same as we have code in TablespaceCreateDbspace still we have small differences like here you are using mkdir instead of MakePGDirectory which as far as I can see use similar permissions for creating directory. Also, it checks whether the directory exists before trying to create it. Is there a reason why we need to do a few things differently here if not, they can both the places use one common function? 2. allocate_empty_undo_segment() { .. .. /* Flush the contents of the file to disk before the next checkpoint. */ + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace); .. } +void +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace) +{ + char path[MAXPGPATH]; + FileTag tag; + + INIT_UNDOFILETAG(tag, logno, tablespace, segno); + + /* Try to send to the checkpointer, but if out of space, do it here. */ + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false)) The comment in allocate_empty_undo_segment indicates that the code wants to flush before checkpoint, but the actual function tries to register the request with checkpointer. Shouldn't this be similar to XLogFileInit where we use pg_fsync to flush the contents immediately? I guess that will avoid what you have written in comments in the same function (we just want to make sure that the filesystem has allocated physical blocks for it so that non-COW filesystems will report ENOSPC now rather than later when space is needed). OTOH, I think it is performance-wise better to postpone the work to checkpointer. If we want to push this work to checkpointer, then we might need to change comments or alternatively, we might want to use bigger segment sizes to mitigate the performance effect. If my above understanding is correct and the reason to fsync immediately is to reserve space now, then we also need to think whether we are always safe in postponing the work? Basically, if this means that it can fail when we are actually trying to write undo, then it could be risky because we could be in the critical section at that time. I am not sure about this point, rather it is just to discuss if there are any impacts of postponing the fsync work. Another thing is that recently in commit 475861b261 (commit by you), we have introduced a mechanism to not fill the files with zero's for certain filesystems like ZFS. Do we want similar behavior for undo files? 3. +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end) +{ + UndoLogSlot *slot; + size_t end; + + slot = find_undo_log_slot(logno, false); + + /* TODO review interlocking */ + + Assert(slot != NULL); + Assert(slot->meta.end % UndoLogSegmentSize == 0); + Assert(new_end % UndoLogSegmentSize == 0); + Assert(InRecovery || + CurrentSession->attached_undo_slots[slot->meta.category] == slot); Can you write some comments explaining the above Asserts? Also, can you explain what interlocking issues are you worried about here? 4. while (end < new_end) + { + allocate_empty_undo_segment(logno, slot->meta.tablespace, end); + end += UndoLogSegmentSize; + } + + /* Flush the directory entries before next checkpoint. */ + undofile_request_sync_dir(slot->meta.tablespace); I see that at two places after allocating empty undo segment, the patch performs undofile_request_sync_dir whereas it doesn't perform the same in UndoLogNewSegment? Is there a reason for the same or is it missed from one of the places? 5. +static void +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end) { .. /* + * We didn't need to acquire the mutex to read 'end' above because only + * we write to it. But we need the mutex to update it, because the + * checkpointer might read it concurrently. Is this assumption correct? It seems patch also modified slot->meta.end during discard in function UndoLogDiscard. I am referring below code: +UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid) { .. + /* Update shmem to show the new discard and end pointers. */ + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); + slot->meta.discard = discard; + slot->meta.end = end; + LWLockRelease(&slot->mutex); .. } 6. extend_undo_log() { .. .. if (!InRecovery) + { + xl_undolog_extend xlrec; + XLogRecPtr ptr; + + xlrec.logno = logno; + xlrec.end = end; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND); + XLogFlush(ptr); + } It is not obvious to me why we need to perform XLogFlush here, can you explain? 7. +attach_undo_log(UndoLogCategory category, Oid tablespace) { .. if (candidate->meta.tablespace == tablespace) + { + logno = *place; + slot = candidate; + *place = candidate->next_free; + break; + } Here, the code is breaking from the loop, so why do we need to set *place? Am I missing something obvious? 8. + /* WAL-log the creation of this new undo log. */ + { + xl_undolog_create xlrec; + + xlrec.logno = logno; + xlrec.tablespace = slot->meta.tablespace; + xlrec.category = slot->meta.category; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); Here and in most other places in this patch you are using sizeof(xlrec) for xlog static data. However, as far as I know in other places in the code we define the size using offset of the last parameter of corresponding structure to avoid any inconsistency in WAL record size across different platforms. Is there a reason to go differently with this patch? See below one for example: typedef struct xl_hash_add_ovfl_page { uint16 bmsize; bool bmpage_found; } xl_hash_add_ovfl_page; #define SizeOfHashAddOvflPage \ (offsetof(xl_hash_add_ovfl_page, bmpage_found) + sizeof(bool)) 9. +static void +undolog_xlog_create(XLogReaderState *record) +{ + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record); + UndoLogSlot *slot; + + /* Create meta-data space in shared memory. */ + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + + /* TODO: assert that it doesn't exist already? */ + + slot = allocate_undo_log_slot(); + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); Why do we need to acquire locks during recovery? 10. I think UndoLogAllocate can leak allocation of slots. It first allocates the slot for a new log from the free pool in there is no existing slot/log, writes a WAL record and then at a later point of time it actually creates the required physical space in the log via extend_undo_log which also writes a separate WAL. Now, if there is a error between these two operations, then we will have a redundant slot allocated. What if there are repeated errors for similar thing from multiple backends after which system crashes. Now, after restart, we will allocate multiple slots for different lognos which don't have any actual (physical) logs. This might not be a big problem in practice because the chances of error between two operations are less, but can't we delay the WAL logging for allocation of a slot for a new log. 11. +UndoLogAllocate() { .. .. + /* + * Maintain our tracking of the and the previous transaction start + * locations. + */ + if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert) + { + slot->meta.unlogged.last_xact_start = + slot->meta.unlogged.this_xact_start; + slot->meta.unlogged.this_xact_start = slot->meta.unlogged.insert; + } ".. of the and the ..", after first the, something is missing. 12. UndoLogAllocate() { .. .. + /* + * We don't need to acquire log->mutex to read log->meta.insert and + * log->meta.end, because this backend is the only one that can + * modify them. + */ + if (unlikely(new_insert > slot->meta.end)) I might be confused but slot->meta.end is modified by discard process also, so how is it safe? If so, may be adding a comment to explain the same would be good. Also, I think in the comments log should be replaced with the slot. 13. UndoLogAllocate() { .. + /* This undo log is entirely full. Get a new one. */ + if (logxid == GetTopTransactionId()) + { + /* + * If the same transaction is split over two undo logs then + * store the previous log number in new log. See detailed + * comments in undorecord.c file header. + */ .. } The undorecord.c should be renamed to undoaccess.c 14. UndoLogAllocate() { .. + if (logxid != GetTopTransactionId()) + { + /* + * While we have the lock, check if we have been forcibly detached by + * DROP TABLESPACE. That can only happen between transactions (see + * DropUndoLogsInsTablespace()). + */ /DropUndoLogsInsTablespace/DropUndoLogsInTablespace 15. UndoLogSegmentPath() { .. /* + * Build the path from log number and offset. The pathname is the + * UndoRecPtr of the first byte in the segment in hexadecimal, with a + * period inserted between the components. + */ + snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno, + segno * UndoLogSegmentSize); .. } a. It is not very clear from the above code why are we multiplying segno with UndoLogSegmentSize? I see that many of the callers pass segno as segno/UndoLogSegmentSize. Won't it be better if the caller take care of passing correct value of segno? b. In the comment above, instead of offset, shouldn't there be segment number. 16. UndoLogGetLastXactStartPoint is not used any where. I think this was required in previous version of patchset, now, we can remove it. 17. Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com This discussion link seems to be from old discussion/thread, not this one. 0019-Add-developer-documentation-for-the-undo-log-storage 18. +each undo log, a set of meta-data properties is tracked: +tracked, including: + +* the tablespace that holds its segment files +* the persistence level (permanent, unlogged or temporary) Here, don't we want to refer to UndoLogCategory rather than persistence level? "tracked, including:" seems bit confusing. 0020-Add-user-facing-documentation-for-undo-logs 19. <row> + <entry><structfield>persistence</structfield></entry> + <entry><type>text</type></entry> + <entry>Persistence level of data stored in this undo log; one of + <literal>permanent</literal>, <literal>unlogged</literal> or + <literal>temporary</literal>.</entry> + </row> Don't we want to cover the new (shared) undolog category here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > Yep, that was completely wrong. Here's a new version. > > 10. > I think UndoLogAllocate can leak allocation of slots. It first > allocates the slot for a new log from the free pool in there is no > existing slot/log, writes a WAL record and then at a later point of > time it actually creates the required physical space in the log via > extend_undo_log which also writes a separate WAL. Now, if there is a > error between these two operations, then we will have a redundant slot > allocated. What if there are repeated errors for similar thing from > multiple backends after which system crashes. Now, after restart, we > will allocate multiple slots for different lognos which don't have any > actual (physical) logs. This might not be a big problem in practice > because the chances of error between two operations are less, but > can't we delay the WAL logging for allocation of a slot for a new log. > After sending this email, I was browsing the previous comments raised by me for this patch and it seems this same point was raised previously [1] as well and there were few additional questions related to same (See point-1 in email [1].) [1] - https://www.postgresql.org/message-id/CAA4eK1LDctrYeZ8ev1N1v-8KwiigAmNMx%3Dt-UTs9qgEFt%2BP0XQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > 7. > +attach_undo_log(UndoLogCategory category, Oid tablespace) > { > .. > if (candidate->meta.tablespace == tablespace) > + { > + logno = *place; > + slot = candidate; > + *place = candidate->next_free; > + break; > + } > > Here, the code is breaking from the loop, so why do we need to set > *place? Am I missing something obvious? > I think I know what I was missing. It seems here you are removing an element from the freelist. One point related to detach_current_undo_log. + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); + slot->pid = InvalidPid; + slot->meta.unlogged.xid = InvalidTransactionId; + if (full) + slot->meta.status = UNDO_LOG_STATUS_FULL; + LWLockRelease(&slot->mutex); If I read the comments in structure UndoLogMetaData, it is mentioned that 'status' is changed by explicit WAL record whereas there is no WAL record in code to change the status. I see the problem as well if we don't WAL log this change. Suppose after changing the status of this log, we allocate a new log and insert some records in that log as well for the same transaction for which we have inserted records in the log which we just marked as FULL. Now, here we form the link between two logs as the same transaction has overflowed into a new log. Say, we crash after this. Now, after recovery the log won't be marked as FULL which means there is a chance that it can be used for some other transaction, if that happens, then our link for a transaction spanning to different log will break and we won't be able to access the data in another log. In short, I think it is important to WAL log this status change unless I am missing something. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, I have done some review of undolog patch series and here are my comments: 0003-Add-undo-log-manager.patch 1) As undo log is being created in tablespace, if the tablespace is dropped later, will it have any impact? +void +UndoLogDirectory(Oid tablespace, char *dir) +{ + if (tablespace == DEFAULTTABLESPACE_OID || + tablespace == InvalidOid) + snprintf(dir, MAXPGPATH, "base/undo"); + else + snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo", + tablespace, TABLESPACE_VERSION_DIRECTORY); +} 2) Header file exclusion a) The following headers can be excluded in undolog.c +#include "access/transam.h" +#include "access/undolog.h" +#include "access/xlogreader.h" +#include "catalog/catalog.h" +#include "nodes/execnodes.h" +#include "storage/buf.h" +#include "storage/bufmgr.h" +#include "storage/fd.h" +#include "storage/lwlock.h" +#include "storage/shmem.h" +#include "storage/standby.h" +#include "storage/sync.h" +#include "utils/memutils.h" b) The following headers can be excluded from undofile.c +#include "access/undolog.h" +#include "catalog/database_internal.h" +#include "miscadmin.h" +#include "postmaster/bgwriter.h" +#include "storage/fd.h" +#include "storage/smgr.h" +#include "utils/memutils.h" 3) Some macro replacement. a)Session.h +++ b/src/include/access/session.h @@ -17,6 +17,9 @@ /* Avoid including typcache.h */ struct SharedRecordTypmodRegistry; +/* Avoid including undolog.h */ +struct UndoLogSlot; + /* * A struct encapsulating some elements of a user's session. For now this * manages state that applies to parallel query, but it principle it could @@ -27,6 +30,10 @@ typedef struct Session dsm_segment *segment; /* The session-scoped DSM segment. */ dsa_area *area; /* The session-scoped DSA area. */ + /* State managed by undolog.c. */ + struct UndoLogSlot *attached_undo_slots[4]; /* UndoLogCategories */ + bool need_to_choose_undo_tablespace; + Should we change 4 to UndoLogCategories or suitable macro? b) +static inline size_t +UndoLogNumSlots(void) +{ + return MaxBackends * 4; +} Should we change 4 to UndoLogCategories or suitable macro c) +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) +{ + struct stat stat_buffer; + off_t size; + char path[MAXPGPATH]; + void *zeroes; + size_t nzeroes = 8192; + int fd; should we use BLCKSZ instead of 8192? 4) Should we add a readme file for undolog as it does a fair amount of work and is core part of the undo system? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com On Wed, Jul 24, 2019 at 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 7. > > +attach_undo_log(UndoLogCategory category, Oid tablespace) > > { > > .. > > if (candidate->meta.tablespace == tablespace) > > + { > > + logno = *place; > > + slot = candidate; > > + *place = candidate->next_free; > > + break; > > + } > > > > Here, the code is breaking from the loop, so why do we need to set > > *place? Am I missing something obvious? > > > > I think I know what I was missing. It seems here you are removing an > element from the freelist. > > One point related to detach_current_undo_log. > > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + slot->pid = InvalidPid; > + slot->meta.unlogged.xid = InvalidTransactionId; > + if (full) > + slot->meta.status = UNDO_LOG_STATUS_FULL; > + LWLockRelease(&slot->mutex); > > If I read the comments in structure UndoLogMetaData, it is mentioned > that 'status' is changed by explicit WAL record whereas there is no > WAL record in code to change the status. I see the problem as well if > we don't WAL log this change. Suppose after changing the status of > this log, we allocate a new log and insert some records in that log as > well for the same transaction for which we have inserted records in > the log which we just marked as FULL. Now, here we form the link > between two logs as the same transaction has overflowed into a new > log. Say, we crash after this. Now, after recovery the log won't be > marked as FULL which means there is a chance that it can be used for > some other transaction, if that happens, then our link for a > transaction spanning to different log will break and we won't be able > to access the data in another log. In short, I think it is important > to WAL log this status change unless I am missing something. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > -- Regards, vignesh Have a nice day
On Wed, Jul 24, 2019 at 11:04 PM vignesh C <vignesh21@gmail.com> wrote: > > Hi, > > I have done some review of undolog patch series > and here are my comments: > 0003-Add-undo-log-manager.patch > > 1) As undo log is being created in tablespace, > if the tablespace is dropped later, will it have any impact? > Yes, it drops the undo logs present in tablespace being dropped. See DropUndoLogsInTablespace() in the same patch. > > 4) Should we add a readme file for undolog as it does a fair amount of work > and is core part of the undo system? > The Readme is already present in the patch series posted by Thomas. See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in email [1]. [1] - https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 25, 2019 at 7:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 24, 2019 at 11:04 PM vignesh C <vignesh21@gmail.com> wrote: > > > > Hi, > > > > I have done some review of undolog patch series > > and here are my comments: > > 0003-Add-undo-log-manager.patch > > > > 1) As undo log is being created in tablespace, > > if the tablespace is dropped later, will it have any impact? Thanks Amit, that clarifies the problem I was thinking. I have another question regarding drop table space failure, but I don't have a better solution for that problem. Let me think more about it and discuss. > > Yes, it drops the undo logs present in tablespace being dropped. See > DropUndoLogsInTablespace() in the same patch. > > > > > 4) Should we add a readme file for undolog as it does a fair amount of work > > and is core part of the undo system? > > Thanks Amit, I could get the details of readme. > > The Readme is already present in the patch series posted by Thomas. > See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in > email [1]. > > [1] - https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Hello Thomas, Here are some review comments on 0003-Add-undo-log-manager.patch. I've tried to avoid duplicate comments as much as possible. 1. In UndoLogAllocate, + * time this backend as needed to write to an undo log at all or because s/as/has + * Maintain our tracking of the and the previous transaction start Do you mean current log's transaction start as well? 2. In UndoLogAllocateInRecovery, we try to find the current log from the first undo buffer. So, after a log switch, we always have to register at least one buffer from the current undo log first. If we're updating something in the previous log, the respective buffer should be registered after that. I think we should document this in the comments. 3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full), it seems the 'full' parameter is not used anywhere. Do we still need this? + /* It's been recycled. SO it must have been entirely discarded. */ s/SO/So 4. In CleanUpUndoCheckPointFiles, we can emit a debug2 message with something similar to : 'removed unreachable undo metadata files' + if (unlink(path) != 0) + elog(ERROR, "could not unlink file \"%s\": %m", path); according to my observation, whenever we deal with a file operation, we usually emit a ereport message with errcode_for_file_access(). Should we change it to ereport? There are other file operations as well including read(), OpenTransientFile() etc. 5. In CheckPointUndoLogs, + /* Capture snapshot while holding each mutex. */ + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); + serialized[num_logs++] = slot->meta; + LWLockRelease(&slot->mutex); why do we need an exclusive lock to read something from the slot? A share lock seems to be sufficient. pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE) without calling pgstat_report_wait_end(). I think you've done the same to avoid an extra function call. But, it differs from other places in the PG code. Perhaps, we should follow this approach everywhere. 6. In StartupUndoLogs, + if (fd < 0) + elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path); assuming your agreement upon changing above elog to ereport, the message should be more user friendly. May be something like 'cannot open pg_undo file'. + if ((size = read(fd, &slot->meta, sizeof(slot->meta))) != sizeof(slot->meta)) The usage of size of doesn't look like a problem. But, we can save some extra padding bytes at the end if we use (offsetof + sizeof) approach similar to other places in PG. 7. In free_undo_log_slot, + /* + * When removing an undo log from a slot in shared memory, we acquire + * UndoLogLock, log->mutex and log->discard_lock, so that other code can + * hold any one of those locks to prevent the slot from being recycled. + */ + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); + Assert(slot->logno != InvalidUndoLogNumber); + slot->logno = InvalidUndoLogNumber; + memset(&slot->meta, 0, sizeof(slot->meta)); + LWLockRelease(&slot->mutex); + LWLockRelease(UndoLogLock); you've not taken the discard_lock as mentioned in the comment. 8. In find_undo_log_slot, + * 1. If the calling code knows that it is attached to this lock or is the s/lock/slot + * 2. All other code should acquire log->mutex before accessing any members, + * and after doing so, check that the logno hasn't moved. If it is not, the + * entire undo log must be assumed to be discarded (as if this function + * returned NULL) and the caller must behave accordingly. Perhaps, you meant '..check that the logno remains same. If it is not..'. + /* + * If we didn't find it, then it must already have been entirely + * discarded. We create a negative cache entry so that we can answer + * this question quickly next time. + * + * TODO: We could track the lowest known undo log number, to reduce + * the negative cache entry bloat. + */ This is an interesting thought. But, I'm wondering how we are going to search the discarded logno in the simple hash. I guess that's why it's in the TODO list. 9. In attach_undo_log, + * For now we have a simple linked list of unattached undo logs for each + * persistence level. We'll grovel though it to find something for the + * tablespace you asked for. If you're not using multiple tablespaces s/though/through + if (slot == NULL) + { + if (UndoLogShared->next_logno > MaxUndoLogNumber) + { + /* + * You've used up all 16 exabytes of undo log addressing space. + * This is a difficult state to reach using only 16 exabytes of + * WAL. + */ + elog(ERROR, "undo log address space exhausted"); + } looks like a potential unlikely() condition. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I have done some more review of undolog patch series and here are my comments: Hi Amit, Thanks! There a number of actionable changes in your review. I'll be posting a new patch set soon that will address most of your complaints individually. In this message want to respond to one topic area, because the answer is long enough already: > 2. > allocate_empty_undo_segment() > { > .. > .. > /* Flush the contents of the file to disk before the next checkpoint. */ > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace); > .. > } > > +void > +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace) > +{ > + char path[MAXPGPATH]; > + FileTag tag; > + > + INIT_UNDOFILETAG(tag, logno, tablespace, segno); > + > + /* Try to send to the checkpointer, but if out of space, do it here. */ > + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false)) > > > The comment in allocate_empty_undo_segment indicates that the code > wants to flush before checkpoint, but the actual function tries to > register the request with checkpointer. Shouldn't this be similar to > XLogFileInit where we use pg_fsync to flush the contents immediately? > I guess that will avoid what you have written in comments in the same > function (we just want to make sure that the filesystem has allocated > physical blocks for it so that non-COW filesystems will report ENOSPC > now rather than later when space is needed). OTOH, I think it is > performance-wise better to postpone the work to checkpointer. If we > want to push this work to checkpointer, then we might need to change > comments or alternatively, we might want to use bigger segment sizes > to mitigate the performance effect. In an early version I was doing the fsync() immediately. While testing zheap, Mithun CY reported that whenever segments couldn't be recycled in the background, such as during a bit long-running transaction, he could measure ~6% of the time time spent waiting for fsync(), and throughput increased with bigger segments (and thus fewer files to fsync()). Passing the work off to the checkpointer is better not only because it's done in the background but also because there is a chance that the work can be consolidated with other sync requests, and perhaps even avoided completely if the file is discarded and unlinked before the next checkpoint. I'll update the comment to make it clearer. > If my above understanding is correct and the reason to fsync > immediately is to reserve space now, then we also need to think > whether we are always safe in postponing the work? Basically, if this > means that it can fail when we are actually trying to write undo, then > it could be risky because we could be in the critical section at that > time. I am not sure about this point, rather it is just to discuss if > there are any impacts of postponing the fsync work. Here is my theory for why this arrangement is safe, and why it differs from what we're doing with WAL segments and regular relation files. First, let's review why those things work the way they do (as I understand it): 1. WAL's use of fdatasync(): The reason we fill and then fsync() newly created WAL files up front is because we want to make sure the blocks are definitely on disk. The comment doesn't spell out exactly why the author considered later fdatasync() calls to be insufficient, but they were: it was many years after commit 33cc5d8a4d0d that Linux ext3/4 filesystems began flushing file size changes to disk in fdatasync()[1][2]. I don't know if its original behaviour was intentional or not. So, if you didn't use the bigger fsync() hammer on that OS, you might lose the end of a recently extended file in a power failure even though fdatasync() had returned success. By my reading of POSIX, that shouldn't be necessary on a conforming implementation of fdatasync(), and that was fixed years ago in Linux. I'm not proposing any changes there, and I'm not proposing to take advantage of that in the new code. I'm pointing out that that we don't have to worry about that for these undo segments, because they are already flushed with fsync(), not fdatasync(). (To understand POSIX's descriptions of fsync() and fdatasync() you have to find the meanings of "Synchronized I/O Data Integrity Completion" and "Synchronized I/O File Integrity Completion" elsewhere in the spec. TL;DR: fdatasync() is only allowed to skip flushing attributes like the modified time, it's not allowed to skip flushing a file size change since that would interfere with retrieving the data.) 2. Time of reservation: Although they don't call fsync(), regular relations and these new undo files still write zeroes up front (respectively, for a new block and for a new segment). One reason for that is that most popular filesystems reserve space at write time, so you'll get ENOSPC when trying to allocate undo space, and that's a non-fatal ERROR. If we deferred until writing back buffer contents, we might get file holes, and deferred ENOSPC is much harder to report to users and for users to deal with. You can still get a ENOSPC at checkpoint write-back time on COW systems like ZFS, and there is not much I can do about that. You can still get ENOSPC at checkpoint fsync() time on NFS, and there's not much we can do about that for now except panic (without direct IO, or other big changes). 3. Separate size tracking: Another reason that regular relations write out zeroes at relation-extension time is that that's the only place that the size of a relation is recorded. PostgreSQL doesn't track the number of blocks itself, so we can't defer file extension until write-back from our buffer pool. Undo doesn't rely on the filesystem to track the amount of undo data, it has its own crash-safe tracking of the discard and end pointers, which can be used to know which segment files exist and what ranges contain data. That allows us to work in whole files at a time, like WAL logs, even though we still have checkpoint-based flushing rules. To summarise, we write zeroes so we can report ENOSPC errors as early as possible, but we defer and consolidate fsync() calls because the files' contents and names don't actually have to survive power loss until a checkpoint says they existed at that point in the WAL stream. Does this make sense? BTW we could probably use posix_fallocate() instead of writing zeroes; I think Andres mentioned that recently. I see also that someone tried that for WAL and it got reverted back in 2013 (commit b1892aaeaaf34d8d1637221fc1cbda82ac3fcd71, I didn't try to hunt down the discussion). [1] https://lkml.org/lkml/2012/9/3/83 [2] https://github.com/torvalds/linux/commit/b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5 -- Thomas Munro https://enterprisedb.com
Hi Thomas, I have started reviewing 0003-Add-undo-log-manager, I haven't yet reviewed but some places I noticed that instead of UndoRecPtr you are directly using UndoLogOffset. Which seems like bugs to me 1. +UndoRecPtr +UndoLogAllocateInRecovery(UndoLogAllocContext *context, + TransactionId xid, + uint16 size, + bool *need_xact_header, + UndoRecPtr *last_xact_start, .... + *need_xact_header = + context->try_location == InvalidUndoRecPtr && + slot->meta.unlogged.insert == slot->meta.unlogged.this_xact_start; + *last_xact_start = slot->meta.unlogged.last_xact_start; the output parameter last_xact_start is of type UndoRecPtr whereas slot->meta.unlogged.last_xact_start is of type UndoLogOffset shouldn't we use MakeUndoRecPtr(logno, offset) here? 2. + slot = find_undo_log_slot(logno, false); + if (UndoLogOffsetPlusUsableBytes(try_offset, size) <= slot->meta.end) + { + *need_xact_header = false; + return try_offset; + } Here also you are returning directly try_offset instead of UndoRecPtr -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Yep, that was completely wrong. Here's a new version. > > > > > > > One comment/question related to > > 0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch. > > > > I have done some more review of undolog patch series and here are my comments: > 0003-Add-undo-log-manager.patch > Some more review of the same patch: 1. +typedef struct UndoLogSharedData +{ + UndoLogNumber free_lists[UndoLogCategories]; + UndoLogNumber low_logno; What is the use of low_logno? I don't see anywhere in the code this being assigned any value. Is it for some future use? 2. +void +CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo) { .. + /* Compute header checksum. */ + INIT_CRC32C(crc); + COMP_CRC32C(crc, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno)); + COMP_CRC32C(crc, &UndoLogShared->next_logno, sizeof(UndoLogShared->next_logno)); + COMP_CRC32C(crc, &num_logs, sizeof(num_logs)); + FIN_CRC32C(crc); + + /* Write out the number of active logs + crc. */ + if ((write(fd, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno)) != sizeof(UndoLogShared->low_logno)) || + (write(fd, &UndoLogShared->next_logno, sizeof(UndoLogShared->next_logno)) != sizeof(UndoLogShared->next_logno)) || Is it safe to read UndoLogShared without UndoLogLock? All other places accessing UndoLogShared uses UndoLogLock, so if this usage is safe, maybe it is better to add a comment. 3. UndoLogAllocateInRecovery() { .. /* + * Otherwise we need to do our own transaction tracking + * whenever we see a new xid, to match the logic in + * UndoLogAllocate(). + */ + if (xid != slot->meta.unlogged.xid) + { + slot->meta.unlogged.xid = xid; + if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert) + slot->meta.unlogged.last_xact_start = + slot->meta.unlogged.this_xact_start; + slot->meta.unlogged.this_xact_start = + slot->meta.unlogged.insert; The code doesn't follow the comment. In UndoLogAllocate, both last_xact_start and this_xact_start are assigned in if block, so the should be the case here. 4. UndoLogAllocateInRecovery() { .. + /* + * Just as in UndoLogAllocate(), the caller may be extending an existing + * allocation before committing with UndoLogAdvance(). + */ + if (context->try_location != InvalidUndoRecPtr) + { .. } I am not sure how will this work because unlike UndoLogAllocate, this function doesn't set try_location initially. It will be set later by UndoLogAdvance which can easily go wrong because that doesn't include UndoLogBlockHeaderSize. 5. +UndoLogAdvance(UndoLogAllocContext *context, size_t size) +{ + context->try_location = UndoLogOffsetPlusUsableBytes(context->try_location, + size); +} Here, you are using UndoRecPtr whereas UndoLogOffsetPlusUsableBytes expects offset. 6. UndoLogAllocateInRecovery() { .. + /* + * At this stage we should have an undo log that can handle this + * allocation. If we don't, something is screwed up. + */ + if (UndoLogOffsetPlusUsableBytes(slot->meta.unlogged.insert, size) > slot->meta.end) + elog(ERROR, + "cannot allocate %d bytes in undo log %d", + (int) size, slot->logno); .. } Similar to point-5, here you are using a pointer instead of offset. 7. UndoLogAllocateInRecovery() { .. + /* We found a reference to a different (or first) undo log. */ + slot = find_undo_log_slot(logno, false); .. + /* TODO: check locking against undo log slot recycling? */ .. } I think it is better to have an Assert here that slot can't be NULL. AFAICS, slot can't be NULL unless there is some bug. I don't understand this 'TODO' comment. 8. + { + {"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT, + gettext_noop("Sets the tablespace(s) to use for undo logs."), + NULL, + GUC_LIST_INPUT | GUC_LIST_QUOTE + }, + &undo_tablespaces, + "", + check_undo_tablespaces, assign_undo_tablespaces, NULL + }, It seems you need to update variable_is_guc_list_quote for this variable. 9. +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end) { .. + if (!InRecovery) + { + xl_undolog_extend xlrec; + XLogRecPtr ptr; + + xlrec.logno = logno; + xlrec.end = end; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND); + XLogFlush(ptr); + } .. } Do we need it for temporary/unlogged persistence level? Similarly, there is a WAL logging in attach_undo_log which I can't understand why it would be required for temporary/unlogged persistence levels. 10. +choose_undo_tablespace(bool force_detach, Oid *tablespace) { .. + oid = get_tablespace_oid(name, true); + if (oid == InvalidOid) .. } Do we need to check permissions to see if the current user is allowed to create in this tablespace? 11. +static bool +choose_undo_tablespace(bool force_detach, Oid *tablespace) +{ + char *rawname; + List *namelist; + bool need_to_unlock; + int length; + int i; + + /* We need a modifiable copy of string. */ + rawname = pstrdup(undo_tablespaces); I don't see the usage of rawname outside this function, isn't it better to free it? I understand that this function won't be called frequently enough to matter, but still, there is some theoretical danger if the user continuously changes undo_tablespaces. 12. +find_undo_log_slot(UndoLogNumber logno, bool locked) { .. + * TODO: We could track the lowest known undo log number, to reduce + * the negative cache entry bloat. + */ + if (result == NULL) + { .. } Do we have any mechanism to clear this bloat or will it stay till the end of the session? If it is later, then I think it might be good to take care of this TODO. I think this is not a blocker, but good to have kind of stuff. 13. +static void +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) { .. } What will happen if the transaction creating undolog segment rolls back? Do we want to have pendingDeletes stuff as we have for normal relation files? This might also help in clearing the shared memory state (undo log slots) if any. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi Thomas, Few review comments on 0003-Add-undo-log-manager.patch: 1) Upgrade may fail +/* + * Compute the new redo, and move the pg_undo file to match if necessary. + * Rather than renaming it, we'll create a new copy, so that a failure that + * occurs before the controlfile is rewritten won't be fatal. + */ +static void +AdjustRedoLocation(const char *DataDir) +{ + uint64 old_redo = ControlFile.checkPointCopy.redo; + char old_pg_undo_path[MAXPGPATH]; + char new_pg_undo_path[MAXPGPATH]; + int old_fd; + int new_fd; + ssize_t nread; + ssize_t nwritten; + char buffer[1024]; + + /* + * Adjust fields as needed to force an empty XLOG starting at + * newXlogSegNo. + */ During the upgrade we delete the undo files present in the new cluster and copy the undo files from the old cluster to the new cluster. Then we try to readjust the redo location using pg_resetwal. While trying to readjust we get the current control file details from current cluster. We try to open the current undo file present in the cluster using the details from the current cluster. As the undo files from the current cluster have been removed and replaced with the old cluster contents, the file open will fail. Attached a patch to solve this problem. 2) drop table space failure in corner case. + else + { + /* + * There is data we need in this undo log. We can't force it to + * be detached. + */ + ok = false; + } + LWLockRelease(&slot->mutex); + /* If we failed, then give up now and report failure. */ + if (!ok) + return false; One thought, can we discard the current tablespace entries and try not to fail. 3) There will be a problem if some files deletion is successful and some file deletion fails, the meta contents having end details also need to be applied or to handle the case where the undo is created further after rollback + while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL) + { + char segment_path[MAXPGPATH]; + + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) + continue; + snprintf(segment_path, sizeof(segment_path), "%s/%s", + undo_path, de->d_name); + if (unlink(segment_path) < 0) + elog(LOG, "couldn't unlink file \"%s\": %m", segment_path); + } 4) In error case unlinked undo segment message will be logged + while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL) + { + char segment_path[MAXPGPATH]; + + if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0) + continue; + snprintf(segment_path, sizeof(segment_path), "%s/%s", + undo_path, de->d_name); + elog(DEBUG1, "unlinked undo segment \"%s\"", segment_path); + if (unlink(segment_path) < 0) + elog(LOG, "couldn't unlink file \"%s\": %m", segment_path); + } + FreeDir(dir); In error case the success message will be logged. 5) UndoRecPtrIsValid can be used to check InvalidUndoRecPtr + /* + * 'size' is expressed in usable non-header bytes. Figure out how far we + * have to move insert to create space for 'size' usable bytes, stepping + * over any intervening headers. + */ + Assert(slot->meta.unlogged.insert % BLCKSZ >= UndoLogBlockHeaderSize); + if (context->try_location != InvalidUndoRecPtr) Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com On Thu, Jul 25, 2019 at 9:30 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Jul 25, 2019 at 7:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jul 24, 2019 at 11:04 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > Hi, > > > > > > I have done some review of undolog patch series > > > and here are my comments: > > > 0003-Add-undo-log-manager.patch > > > > > > 1) As undo log is being created in tablespace, > > > if the tablespace is dropped later, will it have any impact? > > Thanks Amit, that clarifies the problem I was thinking. > I have another question regarding drop table space failure, but I > don't have a better solution for that problem. > Let me think more about it and discuss. > > > > Yes, it drops the undo logs present in tablespace being dropped. See > > DropUndoLogsInTablespace() in the same patch. > > > > > > > > 4) Should we add a readme file for undolog as it does a fair amount of work > > > and is core part of the undo system? > > > > Thanks Amit, I could get the details of readme. > > > > The Readme is already present in the patch series posted by Thomas. > > See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in > > email [1]. > > > > [1] - https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com > > > > -- > > With Regards, > > Amit Kapila. > > EnterpriseDB: http://www.enterprisedb.com > > -- > Regards, > Vignesh > EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Jul 25, 2019 at 11:22 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have done some more review of undolog patch series and here are my comments: > > Hi Amit, > > Thanks! There a number of actionable changes in your review. I'll be > posting a new patch set soon that will address most of your complaints > individually. In this message want to respond to one topic area, > because the answer is long enough already: > > > 2. > > allocate_empty_undo_segment() > > { > > .. > > .. > > /* Flush the contents of the file to disk before the next checkpoint. */ > > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace); > > .. > > } > > > > +void > > +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace) > > +{ > > + char path[MAXPGPATH]; > > + FileTag tag; > > + > > + INIT_UNDOFILETAG(tag, logno, tablespace, segno); > > + > > + /* Try to send to the checkpointer, but if out of space, do it here. */ > > + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false)) > > > > > > The comment in allocate_empty_undo_segment indicates that the code > > wants to flush before checkpoint, but the actual function tries to > > register the request with checkpointer. Shouldn't this be similar to > > XLogFileInit where we use pg_fsync to flush the contents immediately? > > I guess that will avoid what you have written in comments in the same > > function (we just want to make sure that the filesystem has allocated > > physical blocks for it so that non-COW filesystems will report ENOSPC > > now rather than later when space is needed). OTOH, I think it is > > performance-wise better to postpone the work to checkpointer. If we > > want to push this work to checkpointer, then we might need to change > > comments or alternatively, we might want to use bigger segment sizes > > to mitigate the performance effect. > > In an early version I was doing the fsync() immediately. While > testing zheap, Mithun CY reported that whenever segments couldn't be > recycled in the background, such as during a bit long-running > transaction, he could measure ~6% of the time time spent waiting for > fsync(), and throughput increased with bigger segments (and thus fewer > files to fsync()). Passing the work off to the checkpointer is better > not only because it's done in the background but also because there is > a chance that the work can be consolidated with other sync requests, > and perhaps even avoided completely if the file is discarded and > unlinked before the next checkpoint. > > I'll update the comment to make it clearer. > Okay, that makes sense. > > If my above understanding is correct and the reason to fsync > > immediately is to reserve space now, then we also need to think > > whether we are always safe in postponing the work? Basically, if this > > means that it can fail when we are actually trying to write undo, then > > it could be risky because we could be in the critical section at that > > time. I am not sure about this point, rather it is just to discuss if > > there are any impacts of postponing the fsync work. > > Here is my theory for why this arrangement is safe, and why it differs > from what we're doing with WAL segments and regular relation files. > First, let's review why those things work the way they do (as I > understand it): > > 1. WAL's use of fdatasync(): > I was referring to function XLogFileInit which doesn't appear to be directly using fdatasync. > > 3. Separate size tracking: Another reason that regular relations > write out zeroes at relation-extension time is that that's the only .. > > To summarise, we write zeroes so we can report ENOSPC errors as early > as possible, but we defer and consolidate fsync() calls because the > files' contents and names don't actually have to survive power loss > until a checkpoint says they existed at that point in the WAL stream. > > Does this make sense? > Yes, this makes sense. However, I wonder if we need to do some special handling for ENOSPC while writing to file in this function (allocate_empty_undo_segment). Basically, unlink/remove the file if fail to write because of disk full, something similar to what we do in XLogFileInit. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 25, 2019 at 11:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Hi Thomas, > > I have started reviewing 0003-Add-undo-log-manager, I haven't yet > reviewed but some places I noticed that instead of UndoRecPtr you are > directly > using UndoLogOffset. Which seems like bugs to me > > 1. > +UndoRecPtr > +UndoLogAllocateInRecovery(UndoLogAllocContext *context, > + TransactionId xid, > + uint16 size, > + bool *need_xact_header, > + UndoRecPtr *last_xact_start, > .... > + *need_xact_header = > + context->try_location == InvalidUndoRecPtr && > + slot->meta.unlogged.insert == slot->meta.unlogged.this_xact_start; > + *last_xact_start = slot->meta.unlogged.last_xact_start; > > the output parameter last_xact_start is of type UndoRecPtr whereas > slot->meta.unlogged.last_xact_start is of type UndoLogOffset > shouldn't we use MakeUndoRecPtr(logno, offset) here? > > 2. > + slot = find_undo_log_slot(logno, false); > + if (UndoLogOffsetPlusUsableBytes(try_offset, size) <= slot->meta.end) > + { > + *need_xact_header = false; > + return try_offset; > + } > > Here also you are returning directly try_offset instead of UndoRecPtr > +UndoLogRegister(UndoLogAllocContext *context, uint8 block_id, UndoLogNumber logno) +{ + int i; + + for (i = 0; i < context->num_meta_data_images; ++i) + { + if (context->meta_data_images[i].logno == logno) + { + XLogRegisterBufData(block_id, + (char *) &context->meta_data_images[i].data, + sizeof(context->meta_data_images[i].data)); + return; + } + } +} I have observed one more thing that you are registering "meta_data_images" with each buffer of that log. Suppose, if one undo record is spread across 2 undo blocks then both the blocks will include a duplicate copy of this metadata image if this first changes after the checkpoint? It will not cause any issue but IMHO we can avoid including 2 copies of the same meta_data_image. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Tue, 23 Jul 2019 at 08:48, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > -------------- > > > > > > + if (!InsertRequestIntoErrorUndoQueue(urinfo)) > > > I was thinking what happens if for some reason > > > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the > > > entry will not be marked invalid, and so there will be no undo action > > > carried out because I think the undo worker will exit. What happens > > > next with this entry ? > > > > The same entry is present in two queues xid and size, so next time it > > will be executed from the second queue based on it's priority in that > > queue. However, if it fails again a second time in the same way, then > > we will be in trouble because now the hash table has entry, but none > > of the queues has entry, so none of the workers will attempt to > > execute again. Also, when discard worker again tries to register it, > > we won't allow adding the entry to queue thinking either some backend > > is executing the same or it must be part of some queue. > > > > The one possibility to deal with this could be that we somehow allow > > discard worker to register it again in the queue or we can do this in > > critical section so that it allows system restart on error. However, > > the main thing is it possible that InsertRequestIntoErrorUndoQueue > > will fail unless there is some bug in the code? If so, we might want > > to have an Assert for this rather than handling this condition. > > Yes, I also think that the function would error out only because of > can't-happen cases, like "too many locks taken" or "out of binary heap > slots" or "out of memory" (this last one is not such a can't happen > case). These cases happen probably due to some bugs, I suppose. But I > was wondering : Generally when the code errors out with such > can't-happen elog() calls, worst thing that happens is that the > transaction gets aborted. Whereas, in this case, the worst thing that > could happen is : the undo action would never get executed, which > means selects for this tuple will keep on accessing the undo log ? > Yeah, or in zheap, we have page-wise rollback facility which rollbacks the transaction for a particular page (this gets triggers whenever we try to update/delete a tuple which was last updated by aborted xact or when we try to reuse slot of aborted xact) and we don't need to traverse undo chain. > This does not sound like any data consistency issue, so we should be > fine after all ? > I will see if we can have an Assert in the code for this. > > -------------- > > +if (UndoGetWork(false, false, &urinfo, NULL) && > + IsUndoWorkerAvailable()) > + UndoWorkerLaunch(urinfo); > > There is no lock acquired between IsUndoWorkerAvailable() and > UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable() > returns true, there is a small window where UndoWorkerLaunch() does > not find any worker slot with in_use false, causing assertion failure > for (worker != NULL). > -------------- > Yeah, I think UndoWorkerLaunch should be able to return without launching worker in such a case. > > + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle)) > + { > + /* Failed to start worker, so clean up the worker slot. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + UndoWorkerCleanup(worker); > + LWLockRelease(UndoWorkerLock); > + > + return false; > + } > > Is it intentional that there is no (warning?) message logged when we > can't register a bg worker ? > ------------- I don't think it was intentional. I think it will be good to have a warning here. I agree with all your other comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 24, 2019 at 10:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Please find my review comments for > 0013-Allow-foreground-transactions-to-perform-undo-action > > > + * We can't postpone applying undo actions for subtransactions as the > + * modifications made by aborted subtransaction must not be visible even if > + * the main transaction commits. > + */ > + if (IsSubTransaction()) > + return; > > I am not completely sure but is it possible that the outer function > CommitTransactionCommand/AbortCurrentTransaction can avoid > calling this function in the switch case based on the current state, > so that under subtransaction this will never be called? > We can do that and also can have an additional check similar to "if (!s->performUndoActions)", but such has to be all places from where this function is called. I feel that will make code less readable at many places. > > + bool undo_req_pushed[UndoLogCategories]; /* undo request pushed > + * to worker? */ > + bool performUndoActions; > + > struct TransactionStateData *parent; /* back link to parent */ > > We must have some comments to explain how performUndoActions is used, > where it's set. If it's explained somewhere else then we can > give reference to that code. > I am planning to remove this variable in the next version and have an explicit check as we have in UndoActionsRequired. I agree with your other comments and will address them in the next version of the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, 26 Jul 2019 at 12:25, Amit Kapila <amit.kapila16@gmail.com> wrote: > I agree with all your other comments. Thanks for addressing the comments. Below is the continuation of my comments from 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch : + * Perform rollback request. We need to connect to the database for first + * request and that is required because we access system tables while for first request and that is required => for the first request. This is required --------------- +UndoLauncherShmemSize(void) +{ + Size size; + + /* + * Need the fixed struct and the array of LogicalRepWorker. + */ + size = sizeof(UndoApplyCtxStruct); The fixed structure size should be offsetof(UndoApplyCtxStruct, workers) rather than sizeof(UndoApplyCtxStruct) --------------- In UndoWorkerCleanup(), we set individual fields of the UndoApplyWorker structure, whereas in UndoLauncherShmemInit(), for all the UndoApplyWorker array elements, we just memset all the UndoApplyWorker structure elements to 0. I think we should be consistent for the two cases. I guess we can just memset to 0 as you do in UndoLauncherShmemInit(), but this will cause the worker->undo_worker_queue to be 0 i.e. XID_QUEUE , whereas in UndoWorkerCleanup(), it is set to -1. Is the -1 value essential, or we can just set it to XID_QUEUE initially ? Also, if we just use memset in UndoWorkerCleanup(), we need to first save generation into a temp variable, and then after memset(), restore it back. That brought me to another point : We already have a macro ResetUndoRequestInfo(), so UndoWorkerCleanup() can just call ResetUndoRequestInfo(). ------------ + bool allow_peek; + + CHECK_FOR_INTERRUPTS(); + + allow_peek = !TimestampDifferenceExceeds(started_at, Some comments would be good about what is allow_peek used for. Something like : "Arrange to prevent the worker from restarting quickly to switch databases" ----------------- +++ b/src/backend/access/undo/README.UndoProcessing ----------------- +worker then start reading from one of the queues the requests for that start=>starts --------------- +work, it lingers for UNDO_WORKER_LINGER_MS (10s as default). This avoids As per the latest definition, it is 20s. IMHO, there's no need to mention the default value in the readme. --------------- +++ b/src/backend/access/undo/discardworker.c --------------- + * portion of transaction that is overflowed into a separate log can be processed 80-col crossed. +#include "access/undodiscard.h" +#include "access/discardworker.h" Not in alphabetical order +++ b/src/backend/access/undo/undodiscard.c --------------- + next_insert = UndoLogGetNextInsertPtr(logno); I checked UndoLogGetNextInsertPtr() definition. It calls find_undo_log_slot() to get back the slot from logno. Why not make it accept slot as against logno ? At all other places, the slot->logno is passed, so it is convenient to just pass the slot there. And in UndoDiscardOneLog(), first call find_undo_log_slot() just before the above line (or call it at the end of the do-while loop). This way, during each of the UndoLogGetNextInsertPtr() calls in undorequest.c, we will have one less find_undo_log_slot() call. My suggestion is of course valid only under the assumption that when you call UndoLogGetNextInsertPtr(fooslot->logno), then inside UndoLogGetNextInsertPtr(), find_undo_log_slot() will return back the same fooslot. ------------- In UndoDiscardOneLog(), there are at least 2 variable declarations that can be moved inside the do-while loop : uur and next_insert. I am not sure about the other variables viz : undofxid and latest_discardxid. Values of these variables in one iteration continue across to the second iteration. For latest_discardxid, it looks like we do want its value to be carried forward, but is it also true for undofxid ? + /* If we reach here, this means there is something to discard. */ + need_discard = true; + } while (true); Also, about need_discard; there is no place where need_discard is set to false. That means, from 2nd iteration onwards, it will never be false. So even if the code that explicitly sets need_discard to true does not get run, still the undolog will be discarded. Is this expected ? ------------- + if (request_rollback && dbid_exists(uur->uur_txn->urec_dbid)) + { + (void) RegisterRollbackReq(InvalidUndoRecPtr, + undo_recptr, + uur->uur_txn->urec_dbid, + uur->uur_fxid); + + pending_abort = true; + } We can get rid of request_rollback variable. Whatever the "if" block above is doing, do it in this upper condition : if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress)) Something like this : if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress)) { if (dbid_exists(uur->uur_txn->urec_dbid)) { (void) RegisterRollbackReq(InvalidUndoRecPtr, undo_recptr, uur->uur_txn->urec_dbid, uur->uur_fxid); pending_abort = true; } } ------------- + UndoRecordRelease(uur); + uur = NULL; + } ..... ..... + Assert(uur == NULL); + + /* If we reach here, this means there is something to discard. */ + need_discard = true; + } while (true); Looks like it is neither necessary to set uur to NULL, nor is it necessary to have the Assert(uur == NULL). At the start of each iteration uur is anyway assigned a fresh value, which may or may not be NULL. ------------- + * over undo logs is complete, new undo can is allowed to be written in the new undo can is allowed => new undo is allowed + * hash table size. So before start allowing any new transaction to write the before start allowing => before allowing any new transactions to start writing the ------------- + /* Get the smallest of 'xid having pending undo' and 'oldestXmin' */ + oldestXidHavingUndo = RollbackHTGetOldestFullXid(oldestXidHavingUndo); + .... + .... + if (FullTransactionIdIsValid(oldestXidHavingUndo)) + pg_atomic_write_u64(&ProcGlobal->oldestFullXidHavingUnappliedUndo, + U64FromFullTransactionId(oldestXidHavingUndo)); Is it possible that the FullTransactionId returned by RollbackHTGetOldestFullXid() could be invalid ? If not, then the if condition above can be changed to an Assert(). ------------- + * If the log is already discarded, then we are done. It is important + * to first check this to ensure that tablespace containing this log + * doesn't get dropped concurrently. + */ + LWLockAcquire(&slot->mutex, LW_SHARED); + /* + * We don't have to worry about slot recycling and check the logno + * here, since we don't care about the identity of this slot, we're + * visiting all of them. I guess, it's accidental that the LWLockAcquire() call is *between* the two comments ? ----------- + if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED) + { + /* + * For the "shared" category, we only discard when the + * rm_undo_status callback tells us we can. + */ + status = RmgrTable[uur->uur_rmid].rm_undo_status(uur, &wait_xid); status variable could be declared in this block itself. ------------- Some variable declaration alignments and comments spacing need changes as per pgindent. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Hi, On 2019-07-25 17:51:33 +1200, Thomas Munro wrote: > 1. WAL's use of fdatasync(): The reason we fill and then fsync() > newly created WAL files up front is because we want to make sure the > blocks are definitely on disk. The comment doesn't spell out exactly > why the author considered later fdatasync() calls to be insufficient, > but they were: it was many years after commit 33cc5d8a4d0d that Linux > ext3/4 filesystems began flushing file size changes to disk in > fdatasync()[1][2]. I don't know if its original behaviour was > intentional or not. So, if you didn't use the bigger fsync() hammer > on that OS, you might lose the end of a recently extended file in a > power failure even though fdatasync() had returned success. > > By my reading of POSIX, that shouldn't be necessary on a conforming > implementation of fdatasync(), and that was fixed years ago in Linux. > I'm not proposing any changes there, and I'm not proposing to take > advantage of that in the new code. I'm pointing out that that we > don't have to worry about that for these undo segments, because they > are already flushed with fsync(), not fdatasync(). > (To understand POSIX's descriptions of fsync() and fdatasync() you > have to find the meanings of "Synchronized I/O Data Integrity > Completion" and "Synchronized I/O File Integrity Completion" elsewhere > in the spec. TL;DR: fdatasync() is only allowed to skip flushing > attributes like the modified time, it's not allowed to skip flushing a > file size change since that would interfere with retrieving the data.) Note that there's very good performance reasons trying to avoid metadata changes at e.g. commit time. They're commonly journaled at the FS level, which can add a good chunk of IO and synchronization to an operations that we commonly want to be as fast as possible. Basically you often at least double the amount of synchronous writes. And for the potential future where use async direct IO - writes that change the file size take considerably slower codepaths, and add a lot of synchronization. I suspect that's much more likely to be the reason for the preallocation in 33cc5d8a4d0d, than avoiding an ext* bug (I doubt the bug you reference existed back then, it IIUC didn't apply to ext2, and ext3 was was introduced after 33cc5d8a4d0d). > 2. Time of reservation: Although they don't call fsync(), regular > relations and these new undo files still write zeroes up front > (respectively, for a new block and for a new segment). One reason for > that is that most popular filesystems reserve space at write time, so > you'll get ENOSPC when trying to allocate undo space, and that's a > non-fatal ERROR. If we deferred until writing back buffer contents, > we might get file holes, and deferred ENOSPC is much harder to report > to users and for users to deal with. FWIW, the hole bit I don't quite buy - we could zero the hole at that time (and not be worse than today, except that it might be done by somebody that didn't cause the extension), or even better just look up the buffers between the FS end of the relation, and the block currently written, and write them out in order. The whole thing with deferred ENOSPC being harder to report to users is obviously true regardless of htat. > BTW we could probably use posix_fallocate() instead of writing zeroes; > I think Andres mentioned that recently. I see also that someone tried > that for WAL and it got reverted back in 2013 (commit > b1892aaeaaf34d8d1637221fc1cbda82ac3fcd71, I didn't try to hunt down > the discussion). IIRC the problem from back then was that while the space is reserved on the FS level, the actual blocks don't contain zeroes at that time. Which means that a) Small writes need to write more, because the surrounding data also needs to be zeroed (annoying but not terrible). b) Writes into the fallocated but not written range IIRC effectively cause metadata writes, because while the "allocated file ending" doesn't change anymore, the new "non-zero written to" fileending does need to be journaled to disk before an f[data]sync - otherwise you could end up with the old value after a crash, and would read spurious zeroes. That's quite bad. Those don't necessarily apply to e.g. extending relations as we e.g. don't granularly fsync them. Although even there the performance picture is mixed - it helps a lot in certain workloads, but there's others were it mildly regresses performance on ext4. Not sure why yet, possibly it's due to more heavyweight locking needed when later changing the "non-zero size", or it's the additional metadata changes. I suspect those would be mostly gone if we didn't write back blocks in random order under memory pressure. Note that neither of those mean that it's not a good idea to posix_fallocate() and *then* write zeroes, when initializing. For several filesystems that's more likely to result in more optimally sized filesystem extents, reducing fragmentation. And without an intervening f[data]sync, there's not much additional metadata journalling. Although that's less of an issue on some newer filesystems, IIRC (due to delayed allocation). Greetings, Andres Freund
On Sat, Jul 27, 2019 at 2:27 PM Andres Freund <andres@anarazel.de> wrote: > Note that neither of those mean that it's not a good idea to > posix_fallocate() and *then* write zeroes, when initializing. For > several filesystems that's more likely to result in more optimally sized > filesystem extents, reducing fragmentation. And without an intervening > f[data]sync, there's not much additional metadata journalling. Although > that's less of an issue on some newer filesystems, IIRC (due to delayed > allocation). Interesting. One way to bring back posix_fallocate() without upsetting people on some filesystem out there would be to turn the new wal_init_zero GUC into a choice: write (current default, and current behaviour for 'on'), pwrite_hole (write just the final byte, current behaviour for 'off'), posix_fallocate (like that 2013 patch that was reverted) and posix_fallocate_and_write (do both as you said, to try to solve that problem you mentioned that led to the revert). I suppose there'd be a parallel GUC undo_init_zero. Or some more general GUC for any fixed-sized preallocated files like that (for example if someone were to decide to do the same for SLRU files instead of growing them block-by-block), called something like file_init_zero. -- Thomas Munro https://enterprisedb.com
Hi On 2019-06-26 01:29:57 +0530, Amit Kapila wrote: > From 67845a7afa675e973bd0ea9481072effa1eb219d Mon Sep 17 00:00:00 2001 > From: Dilip Kumar <dilipkumar@localhost.localdomain> > Date: Wed, 24 Apr 2019 14:36:28 +0530 > Subject: [PATCH 05/14] Add prefetch support for the undo log > > Add prefetching function for undo smgr and also provide mechanism > to prefetch without relcache. > +#ifdef USE_PREFETCH > /* > - * PrefetchBuffer -- initiate asynchronous read of a block of a relation > + * PrefetchBufferGuts -- Guts of prefetching a buffer. > * No-op if prefetching isn't compiled in. This isn't true for the this function, as you've defined it? > > diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c > index 2aa4952..14ccc52 100644 > --- a/src/backend/storage/smgr/undofile.c > +++ b/src/backend/storage/smgr/undofile.c > @@ -117,7 +117,18 @@ undofile_extend(SMgrRelation reln, ForkNumber forknum, > void > undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) > { > - elog(ERROR, "undofile_prefetch is not supported"); > +#ifdef USE_PREFETCH > + File file; > + off_t seekpos; > + > + Assert(forknum == MAIN_FORKNUM); > + file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE); > + seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE)); > + > + Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE); > + > + (void) FilePrefetch(file, seekpos, BLCKSZ, WAIT_EVENT_UNDO_FILE_PREFETCH); > +#endif /* USE_PREFETCH */ > } This looks like it should be part of the commit that introduces undofile_prefetch(), rather than separately? Afaics there's no reason to have it in this commit. > From 7206c40e4cee3391c537cdb22c854889bb417d0e Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Wed, 6 Mar 2019 16:46:04 +1300 > Subject: [PATCH 03/14] Add undo log manager. > +/* > + * If the caller doesn't know the the block_id, but does know the RelFileNode, > + * forknum and block number, then we try to find it. > + */ > +XLogRedoAction > +XLogReadBufferForRedoBlock(XLogReaderState *record, > + SmgrId smgrid, > + RelFileNode rnode, > + ForkNumber forknum, > + BlockNumber blockno, > + ReadBufferMode mode, > + bool get_cleanup_lock, > + Buffer *buf) I find that a somewhat odd function comment. Nor does the function name tell me much. A buffer is always block sized. And you pass in a block number. > @@ -347,7 +409,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record, > * Make sure that if the block is marked with WILL_INIT, the caller is > * going to initialize it. And vice versa. > */ > - zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK); > + zeromode = (mode == RBM_ZERO || mode == RBM_ZERO_AND_LOCK || > + mode == RBM_ZERO_AND_CLEANUP_LOCK); > willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0; > if (willinit && !zeromode) > elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine"); > @@ -463,7 +526,7 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, ForkNumber forknum, > { > /* page exists in file */ > buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum, blkno, > - mode, NULL); > + mode, NULL, RELPERSISTENCE_PERMANENT); > } > else > { > @@ -488,7 +551,8 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, ForkNumber forknum, > ReleaseBuffer(buffer); > } > buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum, > - P_NEW, mode, NULL); > + P_NEW, mode, NULL, > + RELPERSISTENCE_PERMANENT); > } > while (BufferGetBlockNumber(buffer) < blkno); > /* Handle the corner case that P_NEW returns non-consecutive pages */ > @@ -498,7 +562,8 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, ForkNumber forknum, > LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > ReleaseBuffer(buffer); > buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum, blkno, > - mode, NULL); > + mode, NULL, > + RELPERSISTENCE_PERMANENT); > } > } Not this patches fault, but it strikes me as a bad idea to just hardcode RELPERSISTENCE_PERMANENT. E.g. it can totally make sense to WAL log some records for an unlogged table, e.g. to create the init fork. > +/* > + * Main control structure for undo log management in shared memory. > + * UndoLogSlot objects are arranged in a fixed-size array, with no particular > + * ordering. > + */ > +typedef struct UndoLogSharedData > +{ > + UndoLogNumber free_lists[UndoPersistenceLevels]; > + UndoLogNumber low_logno; > + UndoLogNumber next_logno; > + UndoLogNumber nslots; > + UndoLogSlot slots[FLEXIBLE_ARRAY_MEMBER]; > +} UndoLogSharedData; Would be good to document at least low_logno - at least to me it's not obvious what that means by name. Also some higher level comments about what the shared memory layout is wouldn't hurt. > +/* > + * How many undo logs can be active at a time? This creates a theoretical > + * maximum amount of undo data that can exist, but if we set it to a multiple > + * of the maximum number of backends it will be a very high limit. > + * Alternative designs involving demand paging or dynamic shared memory could > + * remove this limit but would be complicated. > + */ > +static inline size_t > +UndoLogNumSlots(void) > +{ > + return MaxBackends * 4; > +} I'd put this factor in a macro (or named integer constant variable). It's a) nice to have all such numbers defined in one place b) it makes it easier to understand where the four comes from. > +/* > + * Initialize the undo log subsystem. Called in each backend. > + */ > +void > +UndoLogShmemInit(void) > +{ > + bool found; > + > + UndoLogShared = (UndoLogSharedData *) > + ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found); > + > + /* The postmaster initialized the shared memory state. */ > + if (!IsUnderPostmaster) > + { > + int i; > + > + Assert(!found); I don't quite understand putting this under IsUnderPostmaster, rather than found (and then potentially having an IsUnderPostmaster assert). I know that a few other places do it this way too. > +/* > + * Iterate through the set of currently active logs. Pass in NULL to get the > + * first undo log. Not a fan of APIs like this. Harder to understand at callsites. > NULL indicates the end of the set of logs. + "A return value of"? Right now this sounds a bit like it's referencing the NULL argument. > The caller > + * must lock the returned log before accessing its members, and must skip if > + * logno is not valid. > + */ > +UndoLogSlot * > +UndoLogNextSlot(UndoLogSlot *slot) > +{ > +/* > + * Create a new empty segment file on disk for the byte starting at 'end'. > + */ > +static void > +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, > + UndoLogOffset end) > +{ > + struct stat stat_buffer; > + off_t size; > + char path[MAXPGPATH]; > + void *zeroes; > + size_t nzeroes = 8192; > + int fd; > + > + UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path); > + > + /* > + * Create and fully allocate a new file. If we crashed and recovered > + * then the file might already exist, so use flags that tolerate that. > + * It's also possible that it exists but is too short, in which case > + * we'll write the rest. We don't really care what's in the file, we > + * just want to make sure that the filesystem has allocated physical > + * blocks for it, so that non-COW filesystems will report ENOSPC now > + * rather than later when the space is needed and we'll avoid creating > + * files with holes. > + */ > + fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY); As I said somewhere nearby, I think it might make sense to optionally first fallocate, and then zero. Is there agood reason to not just use O_TRUNC here, and enter the zeroing path without a stat? We could potentially end up with holes this way, I think (if the writes didn't make it to disk, but the metadata operation did). Also seems better to just start from a consistently zeroed out block, rather than sometimes having old data in there. > + /* > + * If we're not in recovery, we need to WAL-log the creation of the new > + * file(s). We do that after the above filesystem modifications, in > + * violation of the data-before-WAL rule as exempted by > + * src/backend/access/transam/README. This means that it's possible for > + * us to crash having made some or all of the filesystem changes but > + * before WAL logging, but in that case we'll eventually try to create the > + * same segment(s) again, which is tolerated. > + */ Perhaps explain *why* the rule is violated here? > +/* > + * Advance the insertion pointer in this context by 'size' usable (non-header) > + * bytes. This is the next place we'll try to allocate a record, if it fits. > + * This is not committed to shared memory until after we've WAL-logged the > + * record and UndoLogAdvanceFinal() is called. > + */ > +void > +UndoLogAdvance(UndoLogAllocContext *context, size_t size) > +{ > + context->try_location = UndoLogOffsetPlusUsableBytes(context->try_location, > + size); > +} > + > +/* > + * Advance the insertion pointer to 'size' usable (non-header) bytes past > + * insertion_point. > + */ > +void > +UndoLogAdvanceFinal(UndoRecPtr insertion_point, size_t size) Think this should differentiate from UndoLogAdvance(). > + /* > + * We acquire UndoLogLock to prevent any undo logs from being created or > + * discarded while we build a snapshot of them. This isn't expected to > + * take long on a healthy system because the number of active logs should > + * be around the number of backends. Holding this lock won't prevent > + * concurrent access to the undo log, except when segments need to be > + * added or removed. > + */ > + LWLockAcquire(UndoLogLock, LW_SHARED); s/the undo log/undo logs/? > + /* Dump into a file under pg_undo. */ > + snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X", > + checkPointRedo); > + pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE); > + fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY); > + if (fd < 0) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not create file \"%s\": %m", path))); > + > + /* Compute header checksum. */ > + INIT_CRC32C(crc); > + COMP_CRC32C(crc, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno)); > + COMP_CRC32C(crc, &UndoLogShared->next_logno, sizeof(UndoLogShared->next_logno)); > + COMP_CRC32C(crc, &num_logs, sizeof(num_logs)); > + FIN_CRC32C(crc); > + > + /* Write out the number of active logs + crc. */ > + if ((write(fd, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno)) != sizeof(UndoLogShared->low_logno)) || > + (write(fd, &UndoLogShared->next_logno, sizeof(UndoLogShared->next_logno)) != sizeof(UndoLogShared->next_logno))|| > + (write(fd, &num_logs, sizeof(num_logs)) != sizeof(num_logs)) || > + (write(fd, &crc, sizeof(crc)) != sizeof(crc))) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not write to file \"%s\": %m", path))); I'd prefix it with some magic value. It provides a way to do version bumps if really necessary (or just provide an explicit version), makes it easier to distinguish proper checksum failures from zeroed out files, and helps identify the files after FS corruption. > + /* Write out the meta data for all active undo logs. */ > + data = (char *) serialized; > + INIT_CRC32C(crc); > + serialized_size = num_logs * sizeof(UndoLogMetaData); > + while (serialized_size > 0) > + { > + ssize_t written; > + > + written = write(fd, data, serialized_size); > + if (written < 0) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not write to file \"%s\": %m", path))); > + COMP_CRC32C(crc, data, written); > + serialized_size -= written; > + data += written; > + } > + FIN_CRC32C(crc); > + > + if (write(fd, &crc, sizeof(crc)) != sizeof(crc)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not write to file \"%s\": %m", path))); > + The number of small writes here makes me wonder if this shouldn't either use fopen/write or a manual buffer. > + /* Flush file and directory entry. */ > + pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC); > + pg_fsync(fd); > + if (CloseTransientFile(fd) < 0) > + ereport(data_sync_elevel(ERROR), > + (errcode_for_file_access(), > + errmsg("could not close file \"%s\": %m", path))); > + fsync_fname("pg_undo", true); > + pgstat_report_wait_end(); Is there a risk in crashing during this, and leaving an incomplete file in place? Presumably not, because the checkpoint wouldn't exist? > +/* > + * Find the UndoLogSlot object for a given log number. > + * > + * The caller may or may not already hold UndoLogLock, and should indicate > + * this by passing 'locked'. We'll acquire it in the slow path if necessary. > + * If it is not held by the caller, the caller must deal with the possibility > + * that the returned UndoLogSlot no longer contains the requested logno by the > + * time it is accessed. > + * > + * To do that, one of the following approaches must be taken by the calling > + * code: > + * > + * 1. If the calling code knows that it is attached to this lock or is the *this "log", not "lock", right? > +static void > +attach_undo_log(UndoPersistence persistence, Oid tablespace) > +{ > + UndoLogSlot *slot = NULL; > + UndoLogNumber logno; > + UndoLogNumber *place; > + > + Assert(!InRecovery); > + Assert(CurrentSession->attached_undo_slots[persistence] == NULL); > + > + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); > + > + /* > + * For now we have a simple linked list of unattached undo logs for each > + * persistence level. We'll grovel though it to find something for the > + * tablespace you asked for. If you're not using multiple tablespaces > + * it'll be able to pop one off the front. We might need a hash table > + * keyed by tablespace if this simple scheme turns out to be too slow when > + * using many tablespaces and many undo logs, but that seems like an > + * unusual use case not worth optimizing for. > + */ > + place = &UndoLogShared->free_lists[persistence]; > + while (*place != InvalidUndoLogNumber) > + { > + UndoLogSlot *candidate = find_undo_log_slot(*place, true); > + > + /* > + * There should never be an undo log on the freelist that has been > + * entirely discarded, or hasn't been created yet. The persistence > + * level should match the freelist. > + */ > + if (unlikely(candidate == NULL)) > + elog(ERROR, > + "corrupted undo log freelist, no such undo log %u", *place); > + if (unlikely(candidate->meta.persistence != persistence)) > + elog(ERROR, > + "corrupted undo log freelist, undo log %u with persistence %d found on freelist %d", > + *place, candidate->meta.persistence, persistence); > + > + if (candidate->meta.tablespace == tablespace) > + { > + logno = *place; > + slot = candidate; > + *place = candidate->next_free; > + break; > + } > + place = &candidate->next_free; > + } I'd replace the linked list with ilist.h ones. < more tomorrow > Greetings, Andres Freund
On Sun, Jul 28, 2019 at 9:38 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Interesting. One way to bring back posix_fallocate() without > upsetting people on some filesystem out there would be to turn the new > wal_init_zero GUC into a choice: write (current default, and current > behaviour for 'on'), pwrite_hole (write just the final byte, current > behaviour for 'off'), posix_fallocate (like that 2013 patch that was > reverted) and posix_fallocate_and_write (do both as you said, to try > to solve that problem you mentioned that led to the revert). > > I suppose there'd be a parallel GUC undo_init_zero. Or some more > general GUC for any fixed-sized preallocated files like that (for > example if someone were to decide to do the same for SLRU files > instead of growing them block-by-block), called something like > file_init_zero. I think it's pretty sane to have a GUC for how we extend files, but to me it seems like overkill to have one for every separate kind of file. It's not theoretically impossible that you could have the data and WAL on separate partitions on separate mount points with, consequently, separate needs, and the data (including undo) could be split among multiple tablespaces each of which uses a different filesystem. Probably, the right design would be a per-tablespace storage option plus an overall default that is always used for WAL. However, that strikes me as a lot of complexity for a pretty marginal use case: most people have a favorite filesystem and stick with it. And all of that seems like something a bit separate from coming up with a good undo framework. Why doesn't undo just do this like we do it elsewhere, and leave the question of changing the way we do extend-and-zero for another thread? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 19, 2019 at 7:28 PM Peter Geoghegan <pg@bowt.ie> wrote: > If I'm not mistaken, you're tacitly assuming that you'll always be > using zheap, or something sufficiently similar to zheap. It'll > probably never be possible to UNDO changes to something like a GIN > index on a zheap table, because you can never do that with sensible > concurrency/deadlock behavior. I mean, essentially any well-designed framework intended for any sort of task whatsoever is going to have a design center where one can foresee that it will work well, and then as a result of working well for the thing for which it was designed, it will also work well for other things that are sufficiently similar. So, I think you're correct, but I also don't think that's really saying very much. The trick is to figure out whether and how the ideas you have could be generalized with reasonable effort to handle other cases, and that's easier with some projects than others. I think when it comes to UNDO, it's actually really hard. The system has some assumptions built into it which are probably required for good performance and reasonable complexity, and it's probably got other assumptions in it which are unnecessary and could be eliminated if we only realized that we were making those assumptions in the first place. The more involvement we get from people who aren't coming at this from the point of view of zheap, the more likely it is that we'll be able to find those assumptions and wipe them out before they get set in concrete. Unfortunately, we haven't had many takers so far -- thanks for chiming in. I don't really understand your comments about GIN. My simplistic understanding of GIN is that it's not very different from btree in this regard. Suppose we insert a row, and then the insert aborts; suppose also that the index wants to use UNDO. In the case of a btree index, we're going to go insert an index entry for the new row; upon abort, we should undo the index insertion by removing that index tuple or at least marking it dead. Unless a page split has happened, triggered either by the insertion itself or by subsequent activity, this puts the index in a state that is almost perfectly equivalent to where we were before the now-aborted transaction did any work. If a page split has occurred, trying to undo the index insertion is going to run into two problems. One, we probably can't undo the page split, so the index will be logically equivalent but not physically equivalent after we get rid of the new tuple. Two, if the page split happened after the insertion of the new tuple rather than at the same time, the index tuple may not be on the page where we left it. Possibly we can walk right (or left, or sideways, or diagonally at a 35 degree angle, my index-fu is not great here) and be sure of finding it, assuming the index is not corrupt. Now, my mental model of a GIN index is that you go find N>=0 index keys inside each value and do basically the same thing as you would for a btree index for each one of them. Therefore it seems to me, possibly stupidly, that you're going to have basically the same problems, except each problem will now potentially happen up to N times instead of up to 1 time. I assume here that in either case - GIN or btree - you would tentatively record where you left the tuple that now needs to be zapped and that you can jump to that place directly to try to zap it. Possibly those assumptions are bad and maybe that's where you're seeing a concurrency/deadlock problem; if so, a more detailed explanation would be very helpful. To me, based on my more limited knowledge of indexing, I'm not really seeing a concurrency/deadlock issue, but I do see that there's going to be a horrid efficiency problem if page splits are common. Suppose for example that you bulk-load a bunch of rows into an indexed table in descending order according to the indexed column, with all the new values being larger than any existing values in that column. The insertion point basically doesn't change: you're always inserting after what was the original high value in the column, and that point is always on the same page, but that page is going to be repeatedly split, so that, at the end of the load, almost none of the newly-inserted rows are going to be on the page into which they were originally inserted. Now if you abort, you're going to either have to walk right a long way from the original insertion point to find each tuple, or re-find each tuple by traversing from the root of the tree instead of remembering where you left it. Doing the first for N tuples is O(N^2), and doing the second is O(N*H) where H is the height of the btree. The latter is almost like O(N) given the high fanout of a btree, but with a much higher constant factor than the remember-where-you-put-it strategy would be in cases where no splits have occurred. Neither seems very good. This seems to be a very general problem with making undo and indexes work nicely together: almost any index type has to sometimes move tuple around to different pages, which makes finding them a lot more expensive than re-finding a heap tuple. I think that most of the above is a bit of a diversion from the original topic of the thread. I think I see the connection you're making between the two topics: the more likely undo application is to fail, the more worrying a hard limit is, and deadlocks are a way for undo application to fail, and if that's likely to be common when undo is applied to indexes, then undo failure will be common and a hard limit is bad. However, I think the solution to that problem is page-at-a-time undo: if foreground process needs to modify a page with pending undo, and if the modification it wants to make can't be done sensibly unless the undo is applied first, it should be prepared to apply that undo itself - just for that page - rather than wait for somebody else to get it done. That's important not only for deadlock avoidance - though deadlock avoidance is certainly a legitimate concern - but also because the change might be part of some gigantic rollback that's going to take an hour, and waiting for the undo to hit all the other pages before it gets to this one will make users very sad. Assuming page-at-a-time undo is possible for all undo-using AMs, which I believe to be more or less a requirement if you want to have something production-grade, I don't really see what common deadlock scenario could exist. Either we're talking about LWLocks -- in which case we've got a bug in the code -- or we're talking about heavyweight locks -- in which case we're dealing with a rare scenario where undo work is piling up behind strategically-acquired AELs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: should there be a hard-limit on the number of transactionspending undo?
From
Peter Geoghegan
Date:
On Mon, Jul 22, 2019 at 4:15 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I had a similar thought: you might regret that choice if you were > wanting to implement an AM with lock table-based concurrency control > (meaning that there are lock ordering concerns for row and page locks, > for DML statements, not just DDL). That seemed a bit too far fetched > to mention before, but are you saying the same sort of concerns might > come up with indexes that support true undo (as opposed to indexes > that still need VACUUM)? Yes. It doesn't really make any difference with B-Trees, because the locks there are very similar to row locks (you still need forwarding UNDO metadata in index pages, probably for checking the visibility of index tuples that have their ghost bit set). But when you need to undo changes to an indexes with coarse grained index tuples (e.g. in a GIN index), the transaction needs to roll back the index tuple as a whole, necessitating that locks be held. Heap TIDs need to be completely stable to avoid a VACUUM-like mechanism -- you cannot just create a new HOT chain. You even have to be willing to store a single heap row across two heap pages in extreme cases where an UPDATE makes it impossible to fit a new row on the same heap page as the original -- this is called row forwarding. Once heap TIDs are guaranteed to be associated with a logical row for the lifetime of that row, and once you lock index entries, you're always able to cleanly undo the changes in the index (i.e. remove new tuples on abort). Then you have indexes that don't need VACUUMING, and that have cheap index-only scans. > For comparison, ARIES[1] has no-deadlock rollbacks as a basic property > and reacquires locks during restart before new transactions are allow > to execute. In its model, the locks in question can be on things like > rows and pages. We don't even use our lock table for those (except > for non-blocking SIREAD locks, irrelevant here). Right. ARIES has plenty to say about concurrency control, even though we often think of it as something that is only concerned with crash recovery. The undo phase is tied to how concurrency control works in general in ARIES. There is something called ARIES/KVL, and something else called ARIES/IM [1]. > After crash > recovery, if zheap encounters a row with pending rollback from an > aborted transaction, as usual it either needs to read an older version > from an undo log (for reads) or help execute the rollback before > updating (for writes). That only requires page-at-a-time LWLocks > ("latching"), so it's deadlock-free. The only deadlock risk comes > from the need to acquire heavyweight locks on relations which > typically only conflict when you run DDL, so yeah, it's tempting to > worry a lot less about those than the fine grained lock traffic from > DML statements that DB2 and others have to deal with. I think that DB2 index deletes are synchronous, and immediately remove space from a leaf page. Rollbacks will re-insert the deleted tuple. Systems that use a limited form of MVCC based on 2PL [2] set a ghost bit instead of physically removing the tuple immediately. But I don't think that that's actually very different to the DB2 classic 2PL approach, since there is forwarding undo information that makes it possible to reclaim tuples with the ghost bit set at the earliest possible opportunity. And because you can immediately do an in-place update of an index tuple's heap TID in the case of unique indexes, which can be optimized as a special case. Queries like "UPDATE tab set tab_pk = tab_pk + 1" work per the SQL standard (no duplicate violation), and don't even bloat the index, because the changes in the index can happen almost entirely in-place. > I might as well put the quote marks on now: "Perhaps we could > implement A later." I don't claim to have any real answers here. I don't claim to understand how much of a problem this is. [1] https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf [2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- See "6.7 Standard Practice" -- Peter Geoghegan
On Tue, Jul 23, 2019 at 10:42 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > I think, even though there might not be a correctness issue with the > current code as it stands, we should still use a local variable. > Updating MyUndoWorker is a big side-effect, which the caller is not > supposed to be aware of, because all that function should do is just > get the slot info. Absolutely right. It's just routine good practice to avoid using global variables when there is no compelling reason to do otherwise. The reason you state here is one of several good ones. > Yes, I also think that the function would error out only because of > can't-happen cases, like "too many locks taken" or "out of binary heap > slots" or "out of memory" (this last one is not such a can't happen > case). These cases happen probably due to some bugs, I suppose. But I > was wondering : Generally when the code errors out with such > can't-happen elog() calls, worst thing that happens is that the > transaction gets aborted. Whereas, in this case, the worst thing that > could happen is : the undo action would never get executed, which > means selects for this tuple will keep on accessing the undo log ? > This does not sound like any data consistency issue, so we should be > fine after all ? I don't think so. Every XID present in undo has to be something we can look up in CLOG to figure out which transactions are aborted and which transactions are committed, so that we know which transactions need undo. If we forget to undo the transaction, we can't discard it, which means we can't advance the CLOG transaction horizon, which means we'll eventually start failing to assign XIDs, leading to a refusal of all write transactions. Oops. More generally, it's not OK for the generic undo layer to make assumptions about whether the operations performed by the undo handlers are essential or not. We don't want to impose a design constraint the undo can only be used for things that are not actually critical, because that will make it hard to write AMs that use it. And there's no reason to live with such a design constraint anyway, because, as noted above, CLOG truncation requires it. More generally still, some can't-happen situations should be checked via Assert() and others via elog(). For example, consider some code that looks up a syscache tuple and pulls data from the returned tuple. If the code that handles DDL is written in such a way that the tuple should always exist, then this is a can't-happen situation, but generally the code checks this via elog(), not Assert(), because it could also happen due to the catalog contents being corrupted. If Assert() were used, the checks would not run in production builds, and a corrupt catalog would lead to a seg fault. An elog() is much friendlier. As a general principle, when a certain thing ought to always be true, but it being true depends on a whole lot of assumptions elsewhere in the code, and especially if it also depends on assumptions like "the database is not corrupted," I think elog() is preferable. Assert() is better for things that are more localized and that really can't go wrong for any reason other than a bug. In this case, I think I would tend towards elog(PANIC), but it's arguable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 29, 2019 at 2:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > Yes. It doesn't really make any difference with B-Trees, because the > locks there are very similar to row locks (you still need forwarding > UNDO metadata in index pages, probably for checking the visibility of > index tuples that have their ghost bit set). But when you need to undo > changes to an indexes with coarse grained index tuples (e.g. in a GIN > index), the transaction needs to roll back the index tuple as a whole, > necessitating that locks be held. Heap TIDs need to be completely > stable to avoid a VACUUM-like mechanism -- you cannot just create a > new HOT chain. You even have to be willing to store a single heap row > across two heap pages in extreme cases where an UPDATE makes it > impossible to fit a new row on the same heap page as the original -- > this is called row forwarding. I find this hard to believe, because an UPDATE can always be broken up into a DELETE and an INSERT. If that were to be done, you would not have a stable heap TID and you would have a "new HOT chain," or your AM's equivalent of that concept. So if we can't handle an UPDATE that changes the TID, then we also can't handle a DELETE + INSERT. But surely handling that case is a hard requirement for any AM. Sorry if I'm being dense here, but I feel like you're making some assumptions that I'm not quite following. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: should there be a hard-limit on the number of transactionspending undo?
From
Peter Geoghegan
Date:
On Mon, Jul 29, 2019 at 9:35 AM Robert Haas <robertmhaas@gmail.com> wrote: > I mean, essentially any well-designed framework intended for any sort > of task whatsoever is going to have a design center where one can > foresee that it will work well, and then as a result of working well > for the thing for which it was designed, it will also work well for > other things that are sufficiently similar. So, I think you're > correct, but I also don't think that's really saying very much. I agree that it's quite unclear how important this is. I don't necessarily think it matters if zheap doesn't do that well with GIN indexes. I think it's probably going to be useful to imagine how GIN indexing might work for zheap because it clarifies the strengths and weaknesses of your design. It's perfectly fine for there to be weaknesses, provided that they are well understood. > The > trick is to figure out whether and how the ideas you have could be > generalized with reasonable effort to handle other cases, and that's > easier with some projects than others. I think when it comes to UNDO, > it's actually really hard. I agree. > Unfortunately, we haven't had many takers so far -- thanks for chiming > in. I don't have the ability to express my general concerns here in a very crisp way. This is complicated stuff. Thanks for tolerating the hand-wavy nature of my feedback about this. > I don't really understand your comments about GIN. My simplistic > understanding of GIN is that it's not very different from btree in > this regard. GIN is quite similar to btree from a Postgres point of view -- GIN is simply a btree that is good at storing duplicates (and has higher level infrastructure to make things like FTS work). So I'd say that your understanding is fairly complete, at least as far as traditional Postgres goes. But if we imagine a system in which we have to roll back in indexes, it's quite a different story. See my remarks to Thomas just now about that. > Suppose we insert a row, and then the insert aborts; > suppose also that the index wants to use UNDO. In the case of a btree > index, we're going to go insert an index entry for the new row; upon > abort, we should undo the index insertion by removing that index tuple > or at least marking it dead. Unless a page split has happened, > triggered either by the insertion itself or by subsequent activity, > this puts the index in a state that is almost perfectly equivalent to > where we were before the now-aborted transaction did any work. If a > page split has occurred, trying to undo the index insertion is going > to run into two problems. One, we probably can't undo the page split, > so the index will be logically equivalent but not physically > equivalent after we get rid of the new tuple. Two, if the page split > happened after the insertion of the new tuple rather than at the same > time, the index tuple may not be on the page where we left it. Actually, page splits are the archetypal case where undo cannot restore the original physical state. In general, we cannot expect the undo process to reverse page splits. Undo might be able to merge the pages together, but it also might not be able to. It won't be terribly different to the situation with deletes where the transaction commits, most likely. Some other systems have something called "system transactions" for things like page splits. They don't need to have their commit record flushed synchronously, and occur in the foreground of the xact that needs to split the page. That way, rollback doesn't have to concern itself with rolling back things that are pretty much impossible to roll back, like page splits. > Now, my mental model of a GIN index is that you go find N>=0 index > keys inside each value and do basically the same thing as you would > for a btree index for each one of them. Therefore it seems to me, > possibly stupidly, that you're going to have basically the same > problems, except each problem will now potentially happen up to N > times instead of up to 1 time. I assume here that in either case - > GIN or btree - you would tentatively record where you left the tuple > that now needs to be zapped and that you can jump to that place > directly to try to zap it. Possibly those assumptions are bad and > maybe that's where you're seeing a concurrency/deadlock problem; if > so, a more detailed explanation would be very helpful. Imagine a world in which zheap cannot just create a new TID (or HOT chain) for the same logical tuple, which is something that I believe should be an important goal for zheap (again, see my remarks to Thomas). Simplicity for rollbacks in access methods like GIN demands that you lock the entire index tuple, which may point to hundreds of logical rows (or TIDs, since they have a 1:1 correspondence with logical rows in this imaginary world). Rolling back with more granular locking seems very hard for the same reason that rolling back a page split would be very hard -- you cannot possibly have enough book keeping information to make that work in a sane way in the face of concurrent insertions that may also commit or abort unpredictably. It seems necessary to bake concurrency control in to roll back at the index access method level in order to get significant benefits from a design like zheap. Now, maybe zheap should be permitted to not work particularly well with GIN, while teaching btree to take advantage of the common case where we can roll everything back, even in indexes (so zheap behaves much more like heapam when you have a GIN index, which is hopefully not that common). That could be a perfectly reasonable restriction. But ISTM that you need to make heap TIDs completely stable for the case that zheap is expected to excel at. You also need to teach nbtree to take advantage of this by rolling back if and when it's safe to do so (when we know that heap TIDs are stable for the indexed table). In general, the only way that rolling back changes to indexes can work is by making heap TIDs completely stable. Any design for rollback in nbtree that allows there to be multiple entries for the same logical row in the index seems like a disaster to me. Are you really going to put forwarding information in the index that mirrors what has happened in the table? > To me, based on my more limited knowledge of indexing, I'm not really > seeing a concurrency/deadlock issue, but I do see that there's going > to be a horrid efficiency problem if page splits are common. I'm not worried about rolling back page splits. That seems to present us with exactly the same issues as rolling back in GIN indexes reliably (i.e. problems that are practically impossible to solve, or at least don't seem worth solving). > This seems to be a very > general problem with making undo and indexes work nicely together: > almost any index type has to sometimes move tuple around to different > pages, which makes finding them a lot more expensive than re-finding a > heap tuple. Right. That's why undo is totally logical in indexes. And it's why you cannot expect to roll back page splits. > I think that most of the above is a bit of a diversion from the > original topic of the thread. I think I see the connection you're > making between the two topics: the more likely undo application is to > fail, the more worrying a hard limit is, and deadlocks are a way for > undo application to fail, and if that's likely to be common when undo > is applied to indexes, then undo failure will be common and a hard > limit is bad. This is an awkward thing to discuss, because it involves so many interrelated moving parts. And because I know that I could easily miss quite a bit about the zheap design. Forgive me if I've hijacked the thread. > However, I think the solution to that problem is > page-at-a-time undo: if foreground process needs to modify a page with > pending undo, and if the modification it wants to make can't be done > sensibly unless the undo is applied first, it should be prepared to > apply that undo itself - just for that page - rather than wait for > somebody else to get it done. That's important not only for deadlock > avoidance - though deadlock avoidance is certainly a legitimate > concern - but also because the change might be part of some gigantic > rollback that's going to take an hour, and waiting for the undo to hit > all the other pages before it gets to this one will make users very > sad. It's something that users in certain other systems (though certainly not all other systems) have had to live with for some time. SQL Server 2019 has something called "instantaneous transaction rollback", which seems to make SQL Server optionally behave a lot more like Postgres [1], apparently with many of the same disadvantages as Postgres. I agree that there is probably a middle way that more or less has the advantages of both approaches. I don't really know what that should look like, though. [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/06/p700-antonopoulos.pdf -- Peter Geoghegan
Re: should there be a hard-limit on the number of transactionspending undo?
From
Peter Geoghegan
Date:
On Mon, Jul 29, 2019 at 12:11 PM Robert Haas <robertmhaas@gmail.com> wrote: > I find this hard to believe, because an UPDATE can always be broken up > into a DELETE and an INSERT. If that were to be done, you would not > have a stable heap TID and you would have a "new HOT chain," or your > AM's equivalent of that concept. So if we can't handle an UPDATE that > changes the TID, then we also can't handle a DELETE + INSERT. But > surely handling that case is a hard requirement for any AM. I'm not saying you can't handle it. But that necessitates "write amplification", in the sense that you must now create new index tuples even for indexes where the indexed columns were not logically altered. Isn't zheap supposed to fix that problem, at least at in version 2 or version 3? I also think that stable heap TIDs make index-only scans a lot easier and more effective. I think that indexes (or at least B-Tree indexes) will ideally almost always have tuples that are the latest versions with zheap. The exception is tuples whose ghost bit is set, whose visibility varies based on the MVCC snapshot in use. But the instant that the deleting/updating xact commits it becomes legal to recycle the old heap TID. We don't need to go back to the index to permanently zap the tuple whose ghost bit we already set, because there is an undo pointer in the same leaf page, so nobody is in danger of getting confused and following the now-recycled heap TID. This ghost bit design owes plenty to 2PL (which will fully remove the index tuple synchronously, rather than just setting a ghost bit). You could say that it's a 2PL/MVCC hybrid, while classic Postgres is "pure" MVCC because it uses explicit row versioning -- it doesn't need to impose restrictions on TID stability. Which seems to be why we offer such a large variety of index access methods -- it's relatively straight forward for Postgres to add niche index AMs, such as SP-GiST. -- Peter Geoghegan
Re: should there be a hard-limit on the number of transactionspending undo?
From
Peter Geoghegan
Date:
On Mon, Jul 29, 2019 at 12:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that indexes (or at least B-Tree indexes) will ideally almost > always have tuples that are the latest versions with zheap. The > exception is tuples whose ghost bit is set, whose visibility varies > based on the MVCC snapshot in use. But the instant that the > deleting/updating xact commits it becomes legal to recycle the old > heap TID. Sorry, I meant the instant the ghost bit index tuple cannot be visible to any possible MVCC snapshot. Which, in general, will be pretty soon after the deleting/updating xact commits. -- Peter Geoghegan
On Mon, Jul 29, 2019 at 3:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > I'm not saying you can't handle it. But that necessitates "write > amplification", in the sense that you must now create new index tuples > even for indexes where the indexed columns were not logically altered. > Isn't zheap supposed to fix that problem, at least at in version 2 or > version 3? I also think that stable heap TIDs make index-only scans a > lot easier and more effective. I think there's a cost-benefit analysis here. You're completely correct that inserting new index tuples causes write amplification and, yeah, that's bad. On the other hand, row forwarding has its own costs. If a row ends up persistently moved to someplace else, then every subsequent access to that row has an extra level of indirection. If it ends up split between two places, every read of that row incurs two reads. The "someplace else" where moved rows or ends of split rows are stored has to be skipped by sequential scans, which is complex and possibly inefficient if it breaks up a sequential I/O pattern. Those things are bad, too. It's a little difficult to compare the kinds of badness. My thought is that in the short run, the redirect strategy probably wins, because there could be and likely are a bunch of indexes and it's cheaper to just insert one redirect. But in the long term, the redirect thing seems like a loser, because you have to keep following it. That (perhaps naive) analysis is why zheap doesn't try to maintain TID stability. Instead it wants to do in-place updates (no new TID) as often as possible, but the fallback strategy is simply to do a non-in-place update (new TID) rather than a redirect. > I think that indexes (or at least B-Tree indexes) will ideally almost > always have tuples that are the latest versions with zheap. The > exception is tuples whose ghost bit is set, whose visibility varies > based on the MVCC snapshot in use. But the instant that the > deleting/updating xact commits it becomes legal to recycle the old > heap TID. We don't need to go back to the index to permanently zap the > tuple whose ghost bit we already set, because there is an undo pointer > in the same leaf page, so nobody is in danger of getting confused and > following the now-recycled heap TID. I haven't run across the "ghost bit" terminology before. Is there a good place to read about the technique you're assuming here? A major question is how you handle inserted rows, that are new now and thus not yet visible to everyone, but which will later become all-visible. One idea is: if the undo pointer is new enough that a write transaction which modified the page could still be in-flight, check the undo log to ascertain visibility of index tuples. If not, then any potentially-deleted index tuples are in fact deleted, and any others are all-visible. With this design, you don't set the ghost bit on new tuples, but are still able to stop following the undo pointers for them after a while. To put that another way, there seems to be pretty clearly a need for a bit, but what does the bit mean? It could mean "please check the undo log," in which case it'd have to be set on insert, eventually cleared, and then reset on delete, but I think that's likely to suck. I think therefore that the bit should mean is-deleted-but-not-necessarily-all-visible-yet, which avoids that problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: should there be a hard-limit on the number of transactionspending undo?
From
Peter Geoghegan
Date:
On Mon, Jul 29, 2019 at 1:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think there's a cost-benefit analysis here. You're completely > correct that inserting new index tuples causes write amplification > and, yeah, that's bad. On the other hand, row forwarding has its own > costs. If a row ends up persistently moved to someplace else, then > every subsequent access to that row has an extra level of indirection. The devil is in the details. It doesn't seem that optimistic to assume that a good implementation could practically always avoid it, by being clever about heap fillfactor. It can work a bit like external TOAST pointers. The oversized datums can go on the other heap page, which presumably not be in the SELECT list of most queries. It won't be one of the indexed columns in typical cases, so index scans will generally only have to visit one heap page. It occurs to me that the zheap design is still sensitive to heap fillfactor in much the same way as it would be with reliably-stable TIDs, combined with some amount of row forwarding. It's not essential for correctness that you avoid creating a new HOT chain (or whatever it's called in zheap) with new index tuples, but it is still quite preferable on performance grounds. It's still worth going to a lot of work to avoid having that happen, such as using external TOAST pointers with some of the larger datums on the existing heap page. > If it ends up split between two places, every read of that row incurs > two reads. The "someplace else" where moved rows or ends of split rows > are stored has to be skipped by sequential scans, which is complex and > possibly inefficient if it breaks up a sequential I/O pattern. Those > things are bad, too. > > It's a little difficult to compare the kinds of badness. I would say that it's extremely difficult. I'm not going to speculate about how the two approaches might compare today. > I haven't run across the "ghost bit" terminology before. Is there a > good place to read about the technique you're assuming here? "5.2 Key Range Locking and Ghost Records" from "A Survey of B-Tree Locking Techniques" seems like a good place to start. As I said earlier, the paper is available from: https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf This description won't define the term ghost record/bit in a precise way that you can just adopt, since the details will vary somewhat based on considerations like whether or not MVCC is used. But you'll get the general idea from the paper, I think. > To put that another way, there seems to be pretty clearly a need for a > bit, but what does the bit mean? It could mean "please check the undo > log," in which case it'd have to be set on insert, eventually cleared, > and then reset on delete, but I think that's likely to suck. I think > therefore that the bit should mean > is-deleted-but-not-necessarily-all-visible-yet, which avoids that > problem. That sounds about right to me. -- Peter Geoghegan
On Tue, Jul 30, 2019 at 7:12 AM Peter Geoghegan <pg@bowt.ie> wrote: > SQL Server 2019 has something called "instantaneous transaction > rollback", which seems to make SQL Server optionally behave a lot more > like Postgres [1], apparently with many of the same disadvantages as > Postgres. I agree that there is probably a middle way that more or > less has the advantages of both approaches. I don't really know what > that should look like, though. > > [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/06/p700-antonopoulos.pdf Thanks for sharing that. I see they're giving that paper at VLDB next month in LA... I hope the talk video will be published on the web. While we've been working on a hybrid vaccum/undo design, they've built a hybrid undo/vacuum system. I've only skimmed this, but one of their concerns that caught my eye is log volume in the presence of long running transactions ("3.6 Aggressive Log Truncation"). IIUC they have only a single log for both redo and undo, so a long running transaction requires them to keep all log data around as long as it might be needed for that transaction, in traditional SQL Server. That's basically the flip side of the problem we're trying to solve, in-heap bloat. I think we might have a different solution to that problem, with our finer grained undo logs. Our undo data is not mixed in with redo data (though redo can recreated it, it's not needed after that), and we have multiple undo logs with their own discard pointers, so a long running transaction only prevents only one single undo log from being truncated, while other undo logs holding other transactions can be truncated as soon as those transactions are committed/rolled back and are either all visible (admittedly tracked with a system-wide xmin approach for now, but could probably be made more granular) or a snapshot-too-old threshold is reached (not implemented yet). -- Thomas Munro https://enterprisedb.com
Hi, I realize that this might not be the absolutely newest version of the undo storage part of this patchset - but I'm trying to understand the whole context, and that's hard without reading through the whole stack in a situation where the layers actually fit together On 2019-07-29 01:48:30 -0700, Andres Freund wrote: > < more tomorrow > > + /* Move the high log number pointer past this one. */ > + ++UndoLogShared->next_logno; Fwiw, I find having "next" and "low" as variable names, and then describing "next" as high in comments somewhat confusing. > +/* check_hook: validate new undo_tablespaces */ > +bool > +check_undo_tablespaces(char **newval, void **extra, GucSource source) > +{ > + char *rawname; > + List *namelist; > + > + /* Need a modifiable copy of string */ > + rawname = pstrdup(*newval); > + > + /* > + * Parse string into list of identifiers, just to check for > + * well-formedness (unfortunateley we can't validate the names in the > + * catalog yet). > + */ > + if (!SplitIdentifierString(rawname, ',', &namelist)) > + { > + /* syntax error in name list */ > + GUC_check_errdetail("List syntax is invalid."); > + pfree(rawname); > + list_free(namelist); > + return false; > + } Why can't you validate the catalog here? In a lot of cases this will be called in a transaction, especially when changing it in a session. E.g. temp_tablespaces does so? > + /* > + * Make sure we aren't already in a transaction that has been assigned an > + * XID. This ensures we don't detach from an undo log that we might have > + * started writing undo data into for this transaction. > + */ > + if (GetTopTransactionIdIfAny() != InvalidTransactionId) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + (errmsg("undo_tablespaces cannot be changed while a transaction is in progress")))); Hm. Is this really a great proxy? Seems like it'll block changing the tablespace unnecessarily in a lot of situations, and like there could even be holes in the future - it doesn't seem crazy that we'd want to emit undo without assigning an xid in some situations (e.g. for deleting files in error cases, or for more aggressive cleanup of dead index entries during reads or such). It seems like it'd be pretty easy to just check CurrentSession->attached_undo_slots[i].slot->meta.unlogged.this_xact_start or such? > +static bool > +choose_undo_tablespace(bool force_detach, Oid *tablespace) > +{ > + else > + { > + /* > + * Choose an OID using our pid, so that if several backends have the > + * same multi-tablespace setting they'll spread out. We could easily > + * do better than this if more serious load balancing is judged > + * useful. > + */ We're not really choosing an oid, we're choosing a tablespace. Obviously one can understand it as is, but it confused me for a second. > + int index = MyProcPid % length; Hm. Is MyProcPid a good proxy here? Wouldn't it be better to use MyProc->pgprocno or such? That's much more guaranteed to space out somewhat evenly? > + int first_index = index; > + Oid oid = InvalidOid; > + > + /* > + * Take the tablespace create/drop lock while we look the name up. > + * This prevents the tablespace from being dropped while we're trying > + * to resolve the name, or while the called is trying to create an > + * undo log in it. The caller will have to release this lock. > + */ > + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE); Why exclusive? I think any function that acquires a lock it doesn't release (or the reverse) ought to have a big honking comment in its header warning of that. And an explanation as to why that is. > + for (;;) > + { > + const char *name = list_nth(namelist, index); > + > + oid = get_tablespace_oid(name, true); > + if (oid == InvalidOid) > + { > + /* Unknown tablespace, try the next one. */ > + index = (index + 1) % length; > + /* > + * But if we've tried them all, it's time to complain. We'll > + * arbitrarily complain about the last one we tried in the > + * error message. > + */ > + if (index == first_index) > + ereport(ERROR, > + (errcode(ERRCODE_UNDEFINED_OBJECT), > + errmsg("tablespace \"%s\" does not exist", name), > + errhint("Create the tablespace or set undo_tablespaces to a valid or empty list."))); > + continue; Wouldn't it be better to simply include undo_tablespaces in the error messages? Something roughly like 'none of the tablespaces in undo_tablespaces = \"%s\" exists"? > + /* > + * If we came here because the user changed undo_tablesaces, then detach > + * from any undo logs we happen to be attached to. > + */ > + if (force_detach) > + { > + for (i = 0; i < UndoPersistenceLevels; ++i) > + { > + UndoLogSlot *slot = CurrentSession->attached_undo_slots[i]; > + > + if (slot != NULL) > + { > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + slot->pid = InvalidPid; > + slot->meta.unlogged.xid = InvalidTransactionId; > + LWLockRelease(&slot->mutex); Would it make sense to re-assert here that the current transaction didn't write undo? > +bool > +DropUndoLogsInTablespace(Oid tablespace) > +{ > + DIR *dir; > + char undo_path[MAXPGPATH]; > + UndoLogSlot *slot = NULL; > + int i; > + > + Assert(LWLockHeldByMe(TablespaceCreateLock)); IMO this ought to be mentioned in a function header comment. > + /* First, try to kick everyone off any undo logs in this tablespace. */ > + while ((slot = UndoLogNextSlot(slot))) > + { > + bool ok; > + bool return_to_freelist = false; > + > + /* Skip undo logs in other tablespaces. */ > + if (slot->meta.tablespace != tablespace) > + continue; > + > + /* Check if this undo log can be forcibly detached. */ > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + if (slot->meta.discard == slot->meta.unlogged.insert && > + (slot->meta.unlogged.xid == InvalidTransactionId || > + !TransactionIdIsInProgress(slot->meta.unlogged.xid))) > + { Not everyone will agree, but this looks complicated enough that I'd put it just in a simple wrapper function. If this were if (CanDetachUndoForcibly(slot)) you'd not need a comment either... Also, isn't the slot->meta.discard == slot->meta.unlogged.insert a separate concern from detaching? My understanding is that it'll be perfectly normal to have undo logs with undiscarded data that nobody is attached to? In fact, I got confused below, because I initially didn't spot any place that implemented the check referenced in the caller: > + * Drop the undo logs in this tablespace. This will fail (without > + * dropping anything) if there are undo logs that we can't afford to drop > + * because they contain non-discarded data or a transaction is in > + * progress. Since we hold TablespaceCreateLock, no other session will be > + * able to attach to an undo log in this tablespace (or any tablespace > + * except default) concurrently. > + */ > + if (!DropUndoLogsInTablespace(tablespaceoid)) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs", > + tablespacename))); > + > + /* > + else > + { > + /* > + * There is data we need in this undo log. We can't force it to > + * be detached. > + */ > + ok = false; > + } Seems like we ought to return more information here. An error message like: > /* > + * Drop the undo logs in this tablespace. This will fail (without > + * dropping anything) if there are undo logs that we can't afford to drop > + * because they contain non-discarded data or a transaction is in > + * progress. Since we hold TablespaceCreateLock, no other session will be > + * able to attach to an undo log in this tablespace (or any tablespace > + * except default) concurrently. > + */ > + if (!DropUndoLogsInTablespace(tablespaceoid)) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs", > + tablespacename))); doesn't really allow a DBA to do anything about the issue. Seems we ought to at least include the pid in the error message? I'd perhaps just move the error message from DropTableSpace() into DropUndoLogsInTablespace(). I don't think that's worse from a layering perspective, and allows to raise a more precise error, and simplifies the API. > + /* > + * Put this undo log back on the appropriate free-list. No one can > + * attach to it while we hold TablespaceCreateLock, but if we return > + * earlier in a future go around this loop, we need the undo log to > + * remain usable. We'll remove all appropriate logs from the > + * free-lists in a separate step below. > + */ > + if (return_to_freelist) > + { > + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); > + slot->next_free = UndoLogShared->free_lists[slot->meta.persistence]; > + UndoLogShared->free_lists[slot->meta.persistence] = slot->logno; > + LWLockRelease(UndoLogLock); > + } There's multiple places that put logs onto the freelist. I'd put them into one small function. Not primarily because it'll be easier to read, but because it makes it easier to search for places that do so. > + /* > + * We detached all backends from undo logs in this tablespace, and no one > + * can attach to any non-default-tablespace undo logs while we hold > + * TablespaceCreateLock. We can now drop the undo logs. > + */ > + slot = NULL; > + while ((slot = UndoLogNextSlot(slot))) > + { > + /* Skip undo logs in other tablespaces. */ > + if (slot->meta.tablespace != tablespace) > + continue; > + > + /* > + * Make sure no buffers remain. When that is done by > + * UndoLogDiscard(), the final page is left in shared_buffers because > + * it may contain data, or at least be needed again very soon. Here > + * we need to drop even that page from the buffer pool. > + */ > + forget_undo_buffers(slot->logno, slot->meta.discard, slot->meta.discard, true); > + > + /* > + * TODO: For now we drop the undo log, meaning that it will never be > + * used again. That wastes the rest of its address space. Instead, > + * we should put it onto a special list of 'offline' undo logs, ready > + * to be reactivated in some other tablespace. Then we can keep the > + * unused portion of its address space. > + */ > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + slot->meta.status = UNDO_LOG_STATUS_DISCARDED; > + LWLockRelease(&slot->mutex); > + } Before I looked up forget_undo_buffers()'s implementation I wrote: Hm. Iterating through shared buffers several times, especially when there possibly could be a good sized numbers of undo logs, seems a bit superfluous. This probably isn't going to be that frequently used in practice, so it's perhaps ok. But it seems like this might be used when things are bad (i.e. there's a lot of UNDO). But I still wonder about that. Especially when there's a lot of UNDO (most of it not in shared buffers), this could end up doing a *crapton* of buffer lookups. I'm inclined to think that this case - probably in contrast to the discard case, would be better served using DropRelFileNodeBuffers(). > + /* Remove all dropped undo logs from the free-lists. */ > + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); > + for (i = 0; i < UndoPersistenceLevels; ++i) > + { > + UndoLogSlot *slot; > + UndoLogNumber *place; > + > + place = &UndoLogShared->free_lists[i]; > + while (*place != InvalidUndoLogNumber) > + { > + slot = find_undo_log_slot(*place, true); > + if (!slot) > + elog(ERROR, > + "corrupted undo log freelist, unknown log %u", *place); > + if (slot->meta.status == UNDO_LOG_STATUS_DISCARDED) > + *place = slot->next_free; > + else > + place = &slot->next_free; > + } > + } > + LWLockRelease(UndoLogLock); Hm, shouldn't this check that the log is actually in the being-dropped tablespace? > +void > +ResetUndoLogs(UndoPersistence persistence) > +{ This imo ought to explain why one would want/need to do that. As far as I can tell this implementation for example wouldn't be correct in all that many situations, because it e.g. doesn't drop the relevant buffers? Seems like this would need to assert that persistence isn't PERMANENT? This is made more "problematic" by the fact that there's no caller for this in this commit, only being used much later in the series. But I think the comment should be there anyway. Hard to review (and understand) otherwise. Why is it correct not to take any locks here? The caller in 0014 afaict is when we're already in hot standby, which means people will possibly read undo? > + UndoLogSlot *slot = NULL; > + > + while ((slot = UndoLogNextSlot(slot))) > + { > + DIR *dir; > + struct dirent *de; > + char undo_path[MAXPGPATH]; > + char segment_prefix[MAXPGPATH]; > + size_t segment_prefix_size; > + > + if (slot->meta.persistence != persistence) > + continue; > + > + /* Scan the directory for files belonging to this undo log. */ > + snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", slot->logno); > + segment_prefix_size = strlen(segment_prefix); > + UndoLogDirectory(slot->meta.tablespace, undo_path); > + dir = AllocateDir(undo_path); > + if (dir == NULL) > + continue; > + while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL) > + { > + char segment_path[MAXPGPATH]; > + > + if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0) > + continue; I'm perfectly fine with using MAXPGPATH buffers. But I do find it confusing that in some places you're using dynamic allocations (in some cases quite repeatedly, like in allocate_empty_undo_segment(), but here you don't? Hm, isn't this kinda O(#slot*#total_size_of_undo) due to going over the whole tablespace for each log? > + snprintf(segment_path, sizeof(segment_path), "%s/%s", > + undo_path, de->d_name); > + elog(DEBUG1, "unlinked undo segment \"%s\"", segment_path); > + if (unlink(segment_path) < 0) > + elog(LOG, "couldn't unlink file \"%s\": %m", segment_path); > + } > + FreeDir(dir); I think the LOG should be done alternatively do to the DEBUG1, otherwise it's going to be confusing. Should this really only be a LOG? Going to be hard to cleanup for a DBA later. > +Datum > +pg_stat_get_undo_logs(PG_FUNCTION_ARGS) > +{ > +#define PG_STAT_GET_UNDO_LOGS_COLS 9 > + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > + TupleDesc tupdesc; > + Tuplestorestate *tupstore; > + MemoryContext per_query_ctx; > + MemoryContext oldcontext; > + char *tablespace_name = NULL; > + Oid last_tablespace = InvalidOid; > + int i; > + > + /* check to see if caller supports us returning a tuplestore */ > + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("set-valued function called in context that cannot accept a set"))); > + if (!(rsinfo->allowedModes & SFRM_Materialize)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("materialize mode required, but it is not " \ > + "allowed in this context"))); > + > + /* Build a tuple descriptor for our result type */ > + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) > + elog(ERROR, "return type must be a row type"); I wish we'd encapsulate this in one place instead of copying it over and over. Imo it's bad style to break error messages over multiple lines, makes it harder to grep for. > + /* Scan all undo logs to build the results. */ > + for (i = 0; i < UndoLogShared->nslots; ++i) > + { > + UndoLogSlot *slot = &UndoLogShared->slots[i]; > + char buffer[17]; > + Datum values[PG_STAT_GET_UNDO_LOGS_COLS]; > + bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false }; > + Oid tablespace; Uncommented numbers like '17' for buffer lengths make me nervous. > + values[0] = ObjectIdGetDatum((Oid) slot->logno); > + values[1] = CStringGetTextDatum( > + slot->meta.persistence == UNDO_PERMANENT ? "permanent" : > + slot->meta.persistence == UNDO_UNLOGGED ? "unlogged" : > + slot->meta.persistence == UNDO_TEMP ? "temporary" : "<uknown>"); s/uknown/unknown/ > + tablespace = slot->meta.tablespace; > + > + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, > + MakeUndoRecPtr(slot->logno, slot->meta.discard)); > + values[3] = CStringGetTextDatum(buffer); > + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, > + MakeUndoRecPtr(slot->logno, slot->meta.unlogged.insert)); > + values[4] = CStringGetTextDatum(buffer); > + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, > + MakeUndoRecPtr(slot->logno, slot->meta.end)); > + values[5] = CStringGetTextDatum(buffer); Makes me wonder if we shouldn't have a type for undo pointers. > + if (slot->meta.unlogged.xid == InvalidTransactionId) > + nulls[6] = true; > + else > + values[6] = TransactionIdGetDatum(slot->meta.unlogged.xid); > + if (slot->pid == InvalidPid) > + nulls[7] = true; > + else > + values[7] = Int32GetDatum((int32) slot->pid); > + switch (slot->meta.status) > + { > + case UNDO_LOG_STATUS_ACTIVE: > + values[8] = CStringGetTextDatum("ACTIVE"); break; > + case UNDO_LOG_STATUS_FULL: > + values[8] = CStringGetTextDatum("FULL"); break; > + default: > + nulls[8] = true; > + } Don't think this'll survive pgindent. > + /* > + * Deal with potentially slow tablespace name lookup without the lock. > + * Avoid making multiple calls to that expensive function for the > + * common case of repeating tablespace. > + */ > + if (tablespace != last_tablespace) > + { > + if (tablespace_name) > + pfree(tablespace_name); > + tablespace_name = get_tablespace_name(tablespace); > + last_tablespace = tablespace; > + } If we need to do this repeatedly, I think we ought to add a syscache for tablespace names. > + if (tablespace_name) > + { > + values[2] = CStringGetTextDatum(tablespace_name); > + nulls[2] = false; > + } > + else > + nulls[2] = true; > + > + tuplestore_putvalues(tupstore, tupdesc, values, nulls); Seems like a CHECK_FOR_INTERRUPTS() in this loop wouldn't hurt. > + } > + > + if (tablespace_name) > + pfree(tablespace_name); That seems a bit superflous, given we're leaking plenty other memory (which is perfectly fine). > +/* > + * replay the creation of a new undo log > + */ > +static void > +undolog_xlog_create(XLogReaderState *record) > +{ > + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record); > + UndoLogSlot *slot; > + > + /* Create meta-data space in shared memory. */ > + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); > + > + /* TODO: assert that it doesn't exist already? */ > + slot = allocate_undo_log_slot(); Doesn't this need some error checking? allocate_undo_log_slot() will return NULL if there's no slots left. E.g. restarting a server with a lower max_connections could have one run into this easily? > +/* > + * Drop all buffers for the given undo log, from the old_discard to up > + * new_discard. If drop_tail is true, also drop the buffer that holds > + * new_discard; this is used when discarding undo logs completely, for example > + * via DROP TABLESPACE. If it is false, then the final buffer is not dropped > + * because it may contain data. > + * > + */ > +static void > +forget_undo_buffers(int logno, UndoLogOffset old_discard, > + UndoLogOffset new_discard, bool drop_tail) > +{ > + BlockNumber old_blockno; > + BlockNumber new_blockno; > + RelFileNode rnode; > + > + UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard)); > + old_blockno = old_discard / BLCKSZ; > + new_blockno = new_discard / BLCKSZ; > + if (drop_tail) > + ++new_blockno; > + while (old_blockno < new_blockno) > + { Hm. This'll be quite bad if you have a lot more undo than shared_buffers. Taking the partition lwlocks this many times will hurt. OTOH, scanning all of shared buffers everytime we truncate a few hundred bytes of undo away is obviously also not going to work. > + ForgetBuffer(SMGR_UNDO, rnode, UndoLogForkNum, old_blockno); > + ForgetLocalBuffer(SMGR_UNDO, rnode, UndoLogForkNum, old_blockno++); This seems odd to me - why do we need to scan both? We ought to know which one is needed, right? > + } > +} > +/* > + * replay an undo segment discard record > + */ Missing newline between functions. > +static void > +undolog_xlog_discard(XLogReaderState *record) > +{ > + /* > + * We're about to discard undologs. In Hot Standby mode, ensure that > + * there's no queries running which need to get tuple from discarded undo. nitpick: s/undologs/undo logs/? I think most other comments split it? > + * XXX we are passing empty rnode to the conflict function so that it can > + * check conflict in all the backend regardless of which database the > + * backend is connected. > + */ > + if (InHotStandby && TransactionIdIsValid(xlrec->latestxid)) > + ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode); Hm. Perhaps it'd be better to change ResolveRecoveryConflictWithSnapshot's API to just accept the database (OTOH, we perhaps ought to be more granular in conflict processing). Or just mention that it's ok to pass in an invalid rnode? > + /* > + * See if we need to unlink or rename any files, but don't consider it an > + * error if we find that files are missing. Since UndoLogDiscard() > + * performs filesystem operations before WAL logging or updating shmem > + * which could be checkpointed, a crash could have left files already > + * deleted, but we could replay WAL that expects the files to be there. > + */ Or we could have crashed/restarted during WAL replay and processing the same WAL again. Not sure if that's worth mentioning. > + /* Unlink or rename segments that are no longer in range. */ > + while (old_segment_begin < new_segment_begin) > + { > + char discard_path[MAXPGPATH]; > + > + /* Tell the checkpointer that the file is going away. */ > + undofile_forget_sync(slot->logno, > + old_segment_begin / UndoLogSegmentSize, > + slot->meta.tablespace); > + > + UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize, > + slot->meta.tablespace, discard_path); > + > + /* Can we recycle the oldest segment? */ > + if (end < xlrec->end) > + { > + char recycle_path[MAXPGPATH]; > + > + UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize, > + slot->meta.tablespace, recycle_path); > + if (rename(discard_path, recycle_path) == 0) > + { > + elog(DEBUG1, "recycled undo segment \"%s\" -> \"%s\"", > + discard_path, recycle_path); > + end += UndoLogSegmentSize; > + } > + else > + { > + elog(LOG, "could not rename \"%s\" to \"%s\": %m", > + discard_path, recycle_path); > + } > + } > + else > + { > + if (unlink(discard_path) == 0) > + elog(DEBUG1, "unlinked undo segment \"%s\"", discard_path); > + else > + elog(LOG, "could not unlink \"%s\": %m", discard_path); > + } > + old_segment_begin += UndoLogSegmentSize; > + } The code to recycle or delete one segment exists in multiple places (at least also in UndoLogDiscard()). Think it's long enough that it's easily worthwhile to share. > +/* > @@ -1418,12 +1418,18 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf > segmentpath = strstr(filename, "."); > if (segmentpath != NULL) > { > - segmentno = atoi(segmentpath + 1); > - if (segmentno == 0) > + char *end; > + if (strstr(readfilename, "undo")) > + first_blkno = strtol(segmentpath + 1, &end, 16) / BLCKSZ; > + else > + first_blkno = strtol(segmentpath + 1, &end, 10) * RELSEG_SIZE; > + if (*end != '\0') > ereport(ERROR, > - (errmsg("invalid segment number %d in file \"%s\"", > - segmentno, filename))); > + (errmsg("invalid segment number in file \"%s\"", > + filename))); > } > + else > + first_blkno = 0; > } > } Hm. Not a fan of just using strstr() here. Can't quite articulate why. Just somehow rubs me wrong. > /* > * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require > * a relcache entry for the relation. > - * > - * NB: At present, this function may only be used on permanent relations, which > - * is OK, because we only use it during XLOG replay. If in the future we > - * want to use it on temporary or unlogged relations, we could pass additional > - * parameters. > */ > Buffer > ReadBufferWithoutRelcache(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum, > BlockNumber blockNum, ReadBufferMode mode, > - BufferAccessStrategy strategy) > + BufferAccessStrategy strategy, > + char relpersistence) > { > bool hit; > > - SMgrRelation smgr = smgropen(smgrid, rnode, InvalidBackendId); > - > - Assert(InRecovery); > + SMgrRelation smgr = smgropen(smgrid, rnode, > + relpersistence == RELPERSISTENCE_TEMP > + ? MyBackendId : InvalidBackendId); > > - return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum, > + return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum, > mode, strategy, &hit); > } Hm. Using this for undo access means that we don't do any buffer read/hit counting associatable with the relation causing undo to be read/written. That seems like a sizable monitoring defficiency. > /* > + * ForgetBuffer -- drop a buffer from shared buffers > + * > + * If the buffer isn't present in shared buffers, nothing happens. If it is > + * present, it is discarded without making any attempt to write it back out to > + * the operating system. The caller must therefore somehow be sure that the > + * data won't be needed for anything now or in the future. It assumes that > + * there is no concurrent access to the block, except that it might be being > + * concurrently written. > + */ > +void > +ForgetBuffer(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum, > + BlockNumber blockNum) > +{ > + SMgrRelation smgr = smgropen(smgrid, rnode, InvalidBackendId); > + BufferTag tag; /* identity of target block */ > + uint32 hash; /* hash value for tag */ > + LWLock *partitionLock; /* buffer partition lock for it */ > + int buf_id; > + BufferDesc *bufHdr; > + uint32 buf_state; > + > + /* create a tag so we can lookup the buffer */ > + INIT_BUFFERTAG(tag, smgrid, smgr->smgr_rnode.node, forkNum, blockNum); > + > + /* determine its hash code and partition lock ID */ > + hash = BufTableHashCode(&tag); > + partitionLock = BufMappingPartitionLock(hash); > + > + /* see if the block is in the buffer pool */ > + LWLockAcquire(partitionLock, LW_SHARED); > + buf_id = BufTableLookup(&tag, hash); > + LWLockRelease(partitionLock); > + > + /* didn't find it, so nothing to do */ > + if (buf_id < 0) > + return; > + > + /* take the buffer header lock */ > + bufHdr = GetBufferDescriptor(buf_id); > + buf_state = LockBufHdr(bufHdr); > + /* > + * The buffer might been evicted after we released the partition lock and > + * before we acquired the buffer header lock. If so, the buffer we've > + * locked might contain some other data which we shouldn't touch. If the > + * buffer hasn't been recycled, we proceed to invalidate it. > + */ > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && > + bufHdr->tag.blockNum == blockNum && > + bufHdr->tag.forkNum == forkNum) > + InvalidateBuffer(bufHdr); /* releases spinlock */ > + else > + UnlockBufHdr(bufHdr, buf_state); Phew, I don't like this code one bit. It imo is a bad idea / unnecessary to look up the buffer, unlock the partition lock, and then recheck identity. And do exactly the same thing again in InvalidateBuffer() (including making a copy of the tag while holding the buffer header lock). Seems like this should be something roughly like ReservePrivateRefCountEntry(); LWLockAcquire(partitionLock, LW_SHARED); buf_id = BufTableLookup(&tag, hash); if (buf_id >= 0) { bufHdr = GetBufferDescriptor(buf_id); buf_state = LockBufHdr(bufHdr); /* * Temporarily acquire pin - that prevents the buffer * from being replaced with one that we did not intend * to target. * * XXX: */ ref = PinBuffer_Locked(bufHdr, strategy); /* release partition lock, acquire exclusively so we can drop */ LWLockRelease(partitionLock); /* loop until nobody else has the buffer pinned */ while (true) { LWLockAcquire(partitionLock, LW_EXCLUSIVE); buf_state = LockBufHdr(buf); /* * Check if somebody else is busy writing the buffer (we * have one pin). */ if (BUF_STATE_GET_REFCOUNT(buf_state) == 1) break; // XXX: Should we assert IO_IN_PROGRESS? Ought to be the // only way to get here. /* wait for IO to finish, without holding locks */ UnlockBufHdr(buf, buf_state); LWLockRelease(partitionLock); Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) == 1); WaitIO(buf); /* buffer identity can't change, we've a pin */ // XXX: Assert that the buffer isn't dirty anymore? There // ought to be no possibility for it to get dirty now. } Assert(!(buf_state & BM_PIN_COUNT_WAITER)); /* * Clear out the buffer's tag and flags. We must do this to ensure that * linear scans of the buffer array don't think the buffer is valid. */ oldFlags = buf_state & BUF_FLAG_MASK; CLEAR_BUFFERTAG(buf->tag); buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); /* remove our refcount */ buf_state -= BUF_REFCOUNT_ONE; UnlockBufHdr(buf, buf_state); /* * Remove the buffer from the lookup hashtable, if it was in there. */ if (oldFlags & BM_TAG_VALID) BufTableDelete(&oldTag, oldHash); /* * Done with mapping lock. */ LWLockRelease(oldPartitionLock); Assert(ref->refcount == 1); ForgetPrivateRefCountEntry(ref); ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf)); } or something in that vein. Now you can validly argue that this is more complicated - but I also think that this is going to be a much hotter path than normal relation drops. <more after some errands> Greetings, Andres Freund
Re: should there be a hard-limit on the number of transactionspending undo?
From
Peter Geoghegan
Date:
On Mon, Jul 29, 2019 at 2:52 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Thanks for sharing that. I see they're giving that paper at VLDB next > month in LA... I hope the talk video will be published on the web. > While we've been working on a hybrid vaccum/undo design, they've built > a hybrid undo/vacuum system. It seems that this will be in a stable release soon, so it's not pie-in-the-sky stuff. AFAICT, they have indexes that always point to the latest row version. Getting an old version always required working backwards from the latest. Perhaps the constant time recovery stuff is somewhat like Postgres heapam when it comes to SELECTs, INSERTs, and DELETEs, but much less similar when it comes to UPDATEs. This seems like it might be an important distinction. As the MVCC survey paper out of CMU [1] from a couple of years back says: "The main idea of using logical pointers is that the DBMS uses a fixed identifier that does not change for each tuple in its index entry. Then, as shown in Fig. 5a, the DBMS uses an indirection layer that maps a tuple’s identifier to the HEAD of its version chain. This avoids the problem of having to update all of a table’s indexes to point to a new physical location whenever a tuple is modified. (even if the indexed attributes were not changed)." To me, this suggests that zheap ought to make heap TIDs "more logical" than they are with heapam today (heap TIDs are hybrid physical/logical identifiers today). "Row forwarding" across heap pages is the traditional way of ensuring that TIDs in indexes are stable even in the worst case, apparently, but other approaches also seem possible. [1] http://www.vldb.org/pvldb/vol10/p781-Wu.pdf -- Peter Geoghegan
On Tue, Jul 30, 2019 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jul 23, 2019 at 10:42 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > Yes, I also think that the function would error out only because of > > can't-happen cases, like "too many locks taken" or "out of binary heap > > slots" or "out of memory" (this last one is not such a can't happen > > case). These cases happen probably due to some bugs, I suppose. But I > > was wondering : Generally when the code errors out with such > > can't-happen elog() calls, worst thing that happens is that the > > transaction gets aborted. Whereas, in this case, the worst thing that > > could happen is : the undo action would never get executed, which > > means selects for this tuple will keep on accessing the undo log ? > > This does not sound like any data consistency issue, so we should be > > fine after all ? > > I don't think so. Every XID present in undo has to be something we > can look up in CLOG to figure out which transactions are aborted and > which transactions are committed, so that we know which transactions > need undo. If we forget to undo the transaction, we can't discard it, > which means we can't advance the CLOG transaction horizon, which means > we'll eventually start failing to assign XIDs, leading to a refusal of > all write transactions. Oops. > > More generally, it's not OK for the generic undo layer to make > assumptions about whether the operations performed by the undo > handlers are essential or not. We don't want to impose a design > constraint the undo can only be used for things that are not actually > critical, because that will make it hard to write AMs that use it. > And there's no reason to live with such a design constraint anyway, > because, as noted above, CLOG truncation requires it. > > More generally still, some can't-happen situations should be checked > via Assert() and others via elog(). For example, consider some code > that looks up a syscache tuple and pulls data from the returned tuple. > If the code that handles DDL is written in such a way that the tuple > should always exist, then this is a can't-happen situation, but > generally the code checks this via elog(), not Assert(), because it > could also happen due to the catalog contents being corrupted. If > Assert() were used, the checks would not run in production builds, and > a corrupt catalog would lead to a seg fault. An elog() is much > friendlier. As a general principle, when a certain thing ought to > always be true, but it being true depends on a whole lot of > assumptions elsewhere in the code, and especially if it also depends > on assumptions like "the database is not corrupted," I think elog() is > preferable. Assert() is better for things that are more localized and > that really can't go wrong for any reason other than a bug. In this > case, I think I would tend towards elog(PANIC), but it's arguable. > Agreed, elog(PANIC) seems like a better way for this as compared to Assert. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > I don't like the fact that undoaccess.c has a new global, > > > undo_compression_info. I haven't read the code thoroughly, but do we > > > really need that? I think it's never modified (so it could just be > > > declared const), > > > > Actually, this will get modified otherwise across undo record > > insertion how we will know what was the values of the common fields in > > the first record of the page. Another option could be that every time > > we insert the record, read the value from the first complete undo > > record on the page but that will be costly because for every new > > insertion we need to read the first undo record of the page. > > > > This information won't be shared across transactions, so can't we keep > it in top transaction's state? It seems to me that will be better > than to maintain it as a global state. I think this idea is good for the DO time but during REDO time it will not work as we will not have the transaction state. Having said that the current idea of keeping in the global variable will also not work during REDO time because the WAL from the different transaction can be interleaved. There are few ideas to handle this issue 1. At DO time keep in TopTransactionState as you suggested and during recovery time read from the first complete record on the page. 2. Just to keep the code uniform always read from the first complete record of the page. After putting more though I am more inclined towards idea-2. Because we are anyway inserting our current record into that page basically we have read the buffer and also holds the exclusive lock on the buffer. So reading a few extra bytes from the buffer will not hurt us IMHO. If someone has a better solution please suggest. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 30, 2019 at 5:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > I don't like the fact that undoaccess.c has a new global, > > > > undo_compression_info. I haven't read the code thoroughly, but do we > > > > really need that? I think it's never modified (so it could just be > > > > declared const), > > > > > > Actually, this will get modified otherwise across undo record > > > insertion how we will know what was the values of the common fields in > > > the first record of the page. Another option could be that every time > > > we insert the record, read the value from the first complete undo > > > record on the page but that will be costly because for every new > > > insertion we need to read the first undo record of the page. > > > > > > > This information won't be shared across transactions, so can't we keep > > it in top transaction's state? It seems to me that will be better > > than to maintain it as a global state. > > I think this idea is good for the DO time but during REDO time it will > not work as we will not have the transaction state. Having said that > the current idea of keeping in the global variable will also not work > during REDO time because the WAL from the different transaction can be > interleaved. There are few ideas to handle this issue > > 1. At DO time keep in TopTransactionState as you suggested and during > recovery time read from the first complete record on the page. > 2. Just to keep the code uniform always read from the first complete > record of the page. > > After putting more though I am more inclined towards idea-2. Because > we are anyway inserting our current record into that page basically we > have read the buffer and also holds the exclusive lock on the buffer. > So reading a few extra bytes from the buffer will not hurt us IMHO. > > If someone has a better solution please suggest. Hi Dilip, Here's some initial review of the following patch (from your public undo_interface_v1 branch as of this morning). I haven't tested this version yet, because my means of testing this stuff involves waiting for undoprocessing to be rebased, so that I can test it with my orphaned files stuff and other test programs. It contains another suggestion for that problem you just mentioned (and also me pointing out what you just pointed out, since I wrote it earlier) though I'm not sure if it's better than your options above. > commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e > Author: Dilip Kumar <dilipkumar@localhost.localdomain> > Date: Thu May 2 11:28:13 2019 +0530 > > Provide interfaces to store and fetch undo records. +#include "commands/tablecmds.h" +#include "storage/block.h" +#include "storage/buf.h" +#include "storage/buf_internals.h" +#include "storage/bufmgr.h" +#include "miscadmin.h" "miscadmin.h" comes before "storage...". +/* + * Compute the size of the partial record on the undo page. + * + * Compute the complete record size by uur_info and variable field length + * stored in the page header and then subtract the offset of the record so that + * we can get the exact size of partial record in this page. + */ +static inline Size +UndoPagePartialRecSize(UndoPageHeader phdr) +{ + Size size; We decided to use size_t everywhere in new code (except perhaps functions conforming to function pointer types that historically use Size in their type). + /* + * Compute the header size from undo record uur_info, stored in the page + * header. + */ + size = UndoRecordHeaderSize(phdr->uur_info); + + /* + * Add length of the variable part and undo length. Now, we know the + * complete length of the undo record. + */ + size += phdr->tuple_len + phdr->payload_len + sizeof(uint16); + + /* + * Subtract the size which is stored in the previous page to get the + * partial record size stored in this page. + */ + size -= phdr->record_offset; + + return size; This is probably a stupid question but why isn't it enough to just store the offset of the first record that begins on this page, or 0 for none yet? Why do we need to worry about the partial record's payload etc? +UndoRecPtr +PrepareUndoInsert(UndoRecordInsertContext *context, + UnpackedUndoRecord *urec, + Oid dbid) +{ ... + /* Fetch compression info for the transaction. */ + compression_info = GetTopTransactionUndoCompressionInfo(category); How can this work correctly in recovery? [Edit: it doesn't, as you just pointed out] I had started reviewing an older version of your patch (the version that had made it as far as the undoprocessing branch as of recently), before I had the bright idea to look for a newer version. I was going to object to the global variable you had there in the earlier version. It seems to me that you have to be able to reproduce the exact same compression in recovery that you produced as "do" time, no? How can TopTranasctionStateData be the right place for this in recovery? One data structure that could perhaps hold this would be UndoLogTableEntry (the per-backend cache, indexed by undo log number, with pretty fast lookups; used for things like UndoLogNumberGetCategory()). As long as you never want to have inter-transaction compression, that should have the right scope to give recovery per-undo log tracking. If you ever wanted to do compression between transactions too, maybe UndoLogSlot could work, but that'd have more complications. +/* + * Read undo records of the transaction in bulk + * + * Read undo records between from_urecptr and to_urecptr until we exhaust the + * the memory size specified by undo_apply_size. If we could not read all the + * records till to_urecptr then the caller should consume current set of records + * and call this function again. + * + * from_urecptr - Where to start fetching the undo records. If we can not + * read all the records because of memory limit then this + * will be set to the previous undo record pointer from where + * we need to start fetching on next call. Otherwise it will + * be set to InvalidUndoRecPtr. + * to_urecptr - Last undo record pointer to be fetched. + * undo_apply_size - Memory segment limit to collect undo records. + * nrecords - Number of undo records read. + * one_page - Caller is applying undo only for one block not for + * complete transaction. If this is set true then instead + * of following transaction undo chain using prevlen we will + * follow the block prev chain of the block so that we can + * avoid reading many unnecessary undo records of the + * transaction. + */ +UndoRecInfo * +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr, + int undo_apply_size, int *nrecords, bool one_page) Could you please make it clear in comments and assertions what the relation between from_urecptr and to_urecptr is and what they mean (they must be in the same undo log, one must be <= the other, both point to the *start* of a record, so it's not the same as the total range of undo)? undo_apply_size is not a good parameter name, because the function is useful for things other than applying records -- like the undoinspect() extension (or some better version of that), for example. Maybe max_result_size or something like that? +{ ... + /* Allocate memory for next undo record. */ + uur = palloc0(sizeof(UnpackedUndoRecord)); ... + + size = UnpackedUndoRecordSize(uur); + total_size += size; I see, so the unpacked records are still allocated one at a time. I guess that's OK for now. From some earlier discussion I had been expecting an arrangement where the actual records were laid out contiguously with their subcomponents (things they point to in palloc()'d memory) nearby. +static uint16 +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, + UndoLogCategory category) +{ ... + char prevlen[2]; ... + prev_rec_len = *(uint16 *) (prevlen); I don't think that's OK, and might crash on a non-Intel system. How about using a union of uint16 and char[2]? + /* Copy undo record transaction header if it is present. */ + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0) + memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction); I was wondering why you don't use D = S instead of mempcy(&D, &S, size) wherever you can, until I noticed you use these SizeOfXXX macros that don't include trailing padding from structs, and that's also how you allocate objects. Hmm. So if I were to complain about you not using plain old assignment whenever you can, I'd also have to complain about that. I think that that technique of defining a SizeOfXXX macro that excludes trailing bytes makes sense for writing into WAL or undo log buffers using mempcy(). I'm not sure it makes sense for palloc() and copying into typed variables like you're doing here and I think I'd prefer the notational simplicity of using the (very humble) type system facilities C gives us. (Some memory checker might not like it you palloc(the shorter size) and then use = if the compiler chooses to implement it as memcpy sizeof().) +/* + * The below common information will be stored in the first undo record of the + * page. Every subsequent undo record will not store this information, if + * required this information will be retrieved from the first undo record of the + * page. + */ +typedef struct UndoCompressionInfo Shouldn't this say "Every subsequent record will not store this information *if it's the same as the relevant fields in the first record*"? +#define UREC_INFO_TRANSACTION 0x001 +#define UREC_INFO_RMID 0x002 +#define UREC_INFO_RELOID 0x004 +#define UREC_INFO_XID 0x008 Should we call this UREC_INFO_FXID, since it refers to a FullTransactionId? +/* + * Every undo record begins with an UndoRecordHeader structure, which is + * followed by the additional structures indicated by the contents of + * urec_info. All structures are packed into the alignment without padding + * bytes, and the undo record itself need not be aligned either, so care + * must be taken when reading the header. + */ I think you mean "All structures are packed into undo pages without considering alignment and without trailing padding bytes"? This comes from the definition of the SizeOfXXX macros IIUC. There might still be padding between members of some of those structs, no? Like this one, that has the second member at offset 2 on my system: +typedef struct UndoRecordHeader +{ + uint8 urec_type; /* record type code */ + uint16 urec_info; /* flag bits */ +} UndoRecordHeader; + +#define SizeOfUndoRecordHeader \ + (offsetof(UndoRecordHeader, urec_info) + sizeof(uint16)) +/* + * Information for a transaction to which this undo belongs. This + * also stores the dbid and the progress of the undo apply during rollback. + */ +typedef struct UndoRecordTransaction +{ + /* + * Undo block number where we need to start reading the undo for applying + * the undo action. InvalidBlockNumber means undo applying hasn't + * started for the transaction and MaxBlockNumber mean undo completely + * applied. And, any other block number means we have applied partial undo + * so next we can start from this block. + */ + BlockNumber urec_progress; + Oid urec_dbid; /* database id */ + UndoRecPtr urec_next; /* urec pointer of the next transaction */ +} UndoRecordTransaction; I propose that we rename this to UndoRecordGroupHeader (or something like that... maybe "Set", but we also use "set" as a verb in various relevant function names): 1. We'll also use these for the new "shared" records we recently invented that don't relate to a transaction. This is really about defining the unit of discarding; we throw away the whole set of records at once, which is why it's basically about proividing a space for "urec_next". 2. Though it also holds rollback progress information, which is a transaction-specific concept, there can be more than one of these sets of records for a single transaction anyway. A single transaction can write undo stuff in more than one undo log (different categories perm/temp/unlogged/shared and also due to log switching when they are full). So really it's just a header for an arbitrary set of records, used to track when and how to discard them. If you agree with that idea, perhaps urec_next should become something like urec_next_group, too. "next" is a bit vague, especially for something as untyped as UndoRecPtr: someone might think it points to the next record. More soon. -- Thomas Munro https://enterprisedb.com
Hi, Amit, short note: The patches aren't attached in patch order. Obviously a miniscule thing, but still nicer if that's not the case. Dilip, this also contains the start of a review for the undo record interface further down. On 2019-07-29 16:35:20 -0700, Andres Freund wrote: > <more after some errands> Here we go. I'm a bit worried about expanding the use of ReadBufferWithoutRelcache(). Not so much because of the relcache itself, but because it requires doing separate smgropen() calls. While not crazily expensive, it's also not free. Especially combined with closing all such relations at transaction end (c.f. AtEOXact_SMgr). I'm somewhat inclined to think that this requires a slightly bigger refactoring than done in this patch. Imo at the very least the smgr entries ought not to be unowned. But working towards not haven to re-open the smgr entry for every single trival request ought to be part of this too. > /* > + * ForgetLocalBuffer - drop a buffer from local buffers > + * > + * This is similar to bufmgr.c's ForgetBuffer, except that we do not need > + * to do any locking since this is all local. As with that function, this > + * must be used very carefully, since we'll cheerfully throw away dirty > + * buffers without any attempt to write them. > + */ > +void > +ForgetLocalBuffer(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum, > + BlockNumber blockNum) > +{ > + SMgrRelation smgr = smgropen(smgrid, rnode, BackendIdForTempRelations()); > + BufferTag tag; /* identity of target block */ > + LocalBufferLookupEnt *hresult; > + BufferDesc *bufHdr; > + uint32 buf_state; > + > + /* > + * If somehow this is the first request in the session, there's nothing to > + * do. (This probably shouldn't happen, though.) > + */ > + if (LocalBufHash == NULL) > + return; Given that the call to ForgetLocalBuffer() currently is unconditional, rather than checking the persistence of the undo log, I don't see why this wouldn't happen? > + /* mark buffer invalid */ > + bufHdr = GetLocalBufferDescriptor(hresult->id); > + CLEAR_BUFFERTAG(bufHdr->tag); > + buf_state = pg_atomic_read_u32(&bufHdr->state); > + buf_state &= ~(BM_VALID | BM_TAG_VALID | BM_DIRTY); > + pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); Shouldn't this also clear out at least the usagecount? I'd probably just use buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); like InvalidateBuffer() does. I'd probably also add an assert ensuring that the refcount is zero. > @@ -97,7 +116,6 @@ static dlist_head unowned_relns; > /* local function prototypes */ > static void smgrshutdown(int code, Datum arg); > > - > /* > * smgrinit(), smgrshutdown() -- Initialize or shut down storage > * managers. spurious change. > +/* > + * While md.c expects random access and has a small number of huge > + * segments, undofile.c manages a potentially very large number of smaller > + * segments and has a less random access pattern. Therefore, instead of > + * keeping a potentially huge array of vfds we'll just keep the most > + * recently accessed N. > + * > + * For now, N == 1, so we just need to hold onto one 'File' handle. > + */ > +typedef struct UndoFileState > +{ > + int mru_segno; > + File mru_file; > +} UndoFileState; IMO N==1 gotta change before this is committable. There's too many design issues that could creep in without fixing this (e.g. not being careful enough about closing cached file handles after certain operations etc), that will be harder to fix later. > +void > +undofile_open(SMgrRelation reln) > +{ > + UndoFileState *state; > + > + state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState)); > + reln->private_data = state; > +} Hm, I don't quite like this 'private_data' design. Was that design discussed anywhere? Intuitively ISTM that it'd be better if SMgrRelation were embedded in a per-SMGR type struct. Obviously that'd not quite work as things are set up, because the size has to be constant due to SMgrRelationHash. But I think it'd might be good anyway if that hash just stored a pointer to the relevant SMgrRelation. > +void > +undofile_close(SMgrRelation reln, ForkNumber forknum) > +{ > +} Hm, aren't we leaking private_data right now? > +void > +undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo) > +{ > + /* > + * File creation is managed by undolog.c, but xlogutils.c likes to call > + * this just in case. Ignore. > + */ > +} Phew, this is not pretty. > +bool > +undofile_exists(SMgrRelation reln, ForkNumber forknum) > +{ > + elog(ERROR, "undofile_exists is not supported"); > + > + return false; /* not reached */ > +} This one I actually find bad. It seems pretty reasonable to just be able for SMGR-kind agnostic code to be able to know whether a file exists or not. > +void > +undofile_extend(SMgrRelation reln, ForkNumber forknum, > + BlockNumber blocknum, char *buffer, > + bool skipFsync) > +{ > + elog(ERROR, "undofile_extend is not supported"); > +} This one I have much less problems with. > +void > +undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, > + char *buffer) > +{ > + File file; > + off_t seekpos; > + int nbytes; > + > + Assert(forknum == MAIN_FORKNUM); > + file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE); > + seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE)); > + Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE); I'd not name this seekpos, given that we're not seeking.. > +BlockNumber > +undofile_nblocks(SMgrRelation reln, ForkNumber forknum) > +{ > + /* > + * xlogutils.c likes to call this to decide whether to read or extend; for > + * now we lie and say the relation is big as possible. > + */ > + return UndoLogMaxSize / BLCKSZ; > +} That's imo not ok. > /* > + * check_for_live_undo_data() > + * > + * Make sure there are no live undo records (aborted transactions that have > + * not been rolled back, or committed transactions whose undo data has not > + * yet been discarded). > + */ > +static void > +check_for_undo_data(ClusterInfo *cluster) > +{ > + PGresult *res; > + PGconn *conn = connectToServer(cluster, "template1"); > + > + if (GET_MAJOR_VERSION(old_cluster.major_version) < 1200) > + return; Needs to be updated now. > --- a/src/bin/pg_upgrade/exec.c > +++ b/src/bin/pg_upgrade/exec.c > @@ -351,6 +351,10 @@ check_data_dir(ClusterInfo *cluster) > check_single_dir(pg_data, "pg_clog"); > else > check_single_dir(pg_data, "pg_xact"); > + > + /* pg_undo is new in v13 */ > + if (GET_MAJOR_VERSION(cluster->major_version) >= 1200) > + check_single_dir(pg_data, "pg_undo"); > } The comment talks about v13, but code checks for v12? > +++ b/src/bin/pg_upgrade/undo.c > @@ -0,0 +1,292 @@ > +/* > + * undo.c > + * > + * Support for upgrading undo logs.\ > + * Copyright (c) 2019, PostgreSQL Global Development Group > + * src/bin/pg_upgrade/undo.c > + */ A small design note here seems like a good idea. > +/* Undo log statuses. */ > +typedef enum > +{ > + UNDO_LOG_STATUS_UNUSED = 0, > + UNDO_LOG_STATUS_ACTIVE, > + UNDO_LOG_STATUS_FULL, > + UNDO_LOG_STATUS_DISCARDED > +} UndoLogStatus; An explanation of what these mean would be good. > +/* > + * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence > + * enumerator. > + */ > +#define UndoPersistenceForRelPersistence(rp) \ > + ((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT : \ > + (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP) > + > +/* > + * Convert from UndoPersistence to a relpersistence value. > + */ > +#define RelPersistenceForUndoPersistence(up) \ > + ((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT : \ > + (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED : \ > + RELPERSISTENCE_TEMP) We shouldn't add macros with multiple evaluation hazards without need. > +/* > + * Properties of an undo log that don't have explicit WAL records logging > + * their changes, to reduce WAL volume. Instead, they change incrementally > + * whenever data is inserted as a result of other WAL records. Since the > + * values recorded in an online checkpoint may be out of the sync (ie not the > + * correct values as at the redo LSN), these are backed up in buffer data on > + * first change after each checkpoint. > + */ s/on first/on the first/? > +/* > + * Instantiate fast inline hash table access functions. We use an identity > + * hash function for speed, since we already have integers and don't expect > + * many collisions. > + */ > +#define SH_PREFIX undologtable > +#define SH_ELEMENT_TYPE UndoLogTableEntry > +#define SH_KEY_TYPE UndoLogNumber > +#define SH_KEY number > +#define SH_HASH_KEY(tb, key) (key) > +#define SH_EQUAL(tb, a, b) ((a) == (b)) > +#define SH_SCOPE static inline > +#define SH_DECLARE > +#define SH_DEFINE > +#include "lib/simplehash.h" > + > +extern PGDLLIMPORT undologtable_hash *undologtable_cache; Why isn't this defined in a .c file? I've a bit of a hard time believing that making UndoLogGetTableEntry() an extern function would be a meaningful overhead compared to the operations this is used for. Not exposing those details seems nicer to me. > +/* Create a new undo log. */ > +typedef struct xl_undolog_create > +{ > + UndoLogNumber logno; > + Oid tablespace; > + UndoPersistence persistence; > +} xl_undolog_create; > + > +/* Extend an undo log by adding a new segment. */ > +typedef struct xl_undolog_extend > +{ > + UndoLogNumber logno; > + UndoLogOffset end; > +} xl_undolog_extend; > + > +/* Discard space, and possibly destroy or recycle undo log segments. */ > +typedef struct xl_undolog_discard > +{ > + UndoLogNumber logno; > + UndoLogOffset discard; > + UndoLogOffset end; > + TransactionId latestxid; /* latest xid whose undolog are discarded. */ > + bool entirely_discarded; > +} xl_undolog_discard; > + > +/* Switch undo log. */ > +typedef struct xl_undolog_switch > +{ > + UndoLogNumber logno; > + UndoRecPtr prevlog_xact_start; > + UndoRecPtr prevlog_last_urp; > +} xl_undolog_switch; I'd add flags to these. Perhaps I'm overly cautious, but I found that extremely valuable when having to fix bugs in already released versions. And these aren't so frequent that that'd hurt. Obviously entirely_discarded would then be a flag. > +extern void undofile_init(void); > +extern void undofile_shutdown(void); > +extern void undofile_open(SMgrRelation reln); > +extern void undofile_close(SMgrRelation reln, ForkNumber forknum); > +extern void undofile_create(SMgrRelation reln, ForkNumber forknum, > + bool isRedo); > +extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum); > +extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum, > + bool isRedo); > +extern void undofile_extend(SMgrRelation reln, ForkNumber forknum, > + BlockNumber blocknum, char *buffer, bool skipFsync); > +extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum, > + BlockNumber blocknum); > +extern void undofile_read(SMgrRelation reln, ForkNumber forknum, > + BlockNumber blocknum, char *buffer); > +extern void undofile_write(SMgrRelation reln, ForkNumber forknum, > + BlockNumber blocknum, char *buffer, bool skipFsync); > +extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum, > + BlockNumber blocknum, BlockNumber nblocks); > +extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum); > +extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum, > + BlockNumber nblocks); > +extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum); > + > +/* Callbacks used by sync.c. */ > +extern int undofile_syncfiletag(const FileTag *tag, char *path); > +extern bool undofile_filetagmatches(const FileTag *tag, const FileTag *candidate); > + > +/* Management of checkpointer requests. */ > +extern void undofile_request_sync(UndoLogNumber logno, BlockNumber segno, > + Oid tablespace); > +extern void undofile_forget_sync(UndoLogNumber logno, BlockNumber segno, > + Oid tablespace); > +extern void undofile_forget_sync_tablespace(Oid tablespace); > +extern void undofile_request_sync_dir(Oid tablespace); Istm that it'd be better to only have mdopen/... referenced from smgrsw() and then have f_smgr be included (as a const f_smgr* const) as part of SMgrRelation. For one, that'll allow us to hide a lot more of this into md.c/undofile.c. It's also a pathway to being more extensible. I think performance ought to be at least as good (as we currently have to read SMgrRelation->smgr_which, then read the callback from smgrsw (which is probably combined into one op in many platforms), and then call it). Could then also just make the simple smgr* functions static inline wrappers, as that doesn't require exposing f_smgr anymore. > From 880f25a543783f8dc3784a51ab1c29b72f6b5b27 Mon Sep 17 00:00:00 2001 > From: Dilip Kumar <dilip.kumar@enterprisedb.com> > Date: Fri, 7 Jun 2019 15:03:37 +0530 > Subject: [PATCH 06/14] Defect and enhancement in multi-log support That's imo not a good thing to have in patch series intended to be reviewed, especially relatively early in the series. At least the commit message ought to include an explanation. > Subject: [PATCH 07/14] Provide interfaces to store and fetch undo records. > > Add the capability to form undo records and store them in undo logs. We > also provide the capability to fetch the undo records. This layer will use > undo-log-storage to reserve the space for the undo records and buffer > management routines to write and read the undo records. > > Undo records are stored in sequential order in the undo log. Maybe "In each und log undo records are stored in sequential order."? > +++ b/src/backend/access/undo/README.undointerface > @@ -0,0 +1,29 @@ > +Undo record interface layer > +--------------------------- > +This is the next layer which sits on top of the undo log storage, which will > +provide an interface for prepare, insert, or fetch the undo records. This > +layer will use undo-log-storage to reserve the space for the undo records > +and buffer management routine to write and read the undo records. The reference to "undo log storage" kinda seems like a reference into nothingness... > +Writing an undo record > +---------------------- > +To prepare an undo record, first, it will allocate required space using > +undo log storage module. Next, it will pin and lock the required buffers and > +return an undo record pointer where it will insert the record. Finally, it > +calls the Insert routine for final insertion of prepared record. Additionally, > +there is a mechanism for multi-insert, wherein multiple records are prepared > +and inserted at a time. I'm not sure whta this is telling me. Who is "it"? To me the filename ("interface"), and the title of this section, suggests this provides documentation on how to write code to insert undo records. But I don't think this does. > +Fetching and undo record > +------------------------ > +To fetch an undo record, a caller must provide a valid undo record pointer. > +Optionally, the caller can provide a callback function with the information of > +the block and offset, which will help in faster retrieval of undo record, > +otherwise, it has to traverse the undo-chain. > +There is also an interface to bulk fetch the undo records. Where the caller > +can provide a TO and FROM undo record pointer and the memory limit for storing > +the undo records. This API will return all the undo record between FROM and TO > +undo record pointers if they can fit into provided memory limit otherwise, it > +return whatever can fit into the memory limit. And, the caller can call it > +repeatedly until it fetches all the records. There's a lot of terminology in this file that's not been introduced. I think this needs to be greatly expanded and restructured to allow people unfamiliar with the code to benefit. > +/*------------------------------------------------------------------------- > + * > + * undoaccess.c > + * entry points for inserting/fetching undo records > + * NOTES: > + * Undo record layout: > + * > + * Undo records are stored in sequential order in the undo log. Each undo > + * record consists of a variable length header, tuple data, and payload > + * information. Is that actually true? There's records without tuples, no? > The first undo record of each transaction contains a > + * transaction header that points to the next transaction's start > header. Seems like this needs to reference different persistence levels, otherwise it seems misleading, given there can be multiple first records in multiple undo logs? > + * This allows us to discard the entire transaction's log at one-shot > rather s/at/in/ > + * than record-by-record. The callers are not aware of transaction header, s/of/of the/ > + * this is entirely maintained and used by undo record layer. See s/this/it/ > + * undorecord.h for detailed information about undo record header. s/undo record/the undo record/ I think at the very least there's explanations missing for: - what is the locking protocol for multiple buffers - what are the contexts for insertion - what phases an undo insertion happens in - updating previous records in general - what "packing" actually is > + > +/* Prototypes for static functions. */ Don't think we commonly include that... > +static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec, > + UndoRecPtr urp, RelFileNode rnode, > + UndoPersistence persistence, > + Buffer *prevbuf); > +static int UndoRecordPrepareTransInfo(UndoRecordInsertContext *context, > + UndoRecPtr xact_urp, int size, int offset); > +static void UndoRecordUpdateTransInfo(UndoRecordInsertContext *context, > + int idx); > +static void UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context, > + UndoRecPtr urecptr, UndoRecPtr xact_urp); > +static int UndoGetBufferSlot(UndoRecordInsertContext *context, > + RelFileNode rnode, BlockNumber blk, > + ReadBufferMode rbm); > +static uint16 UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, > + UndoPersistence upersistence); > + > +/* > + * Structure to hold the prepared undo information. > + */ > +struct PreparedUndoSpace > +{ > + UndoRecPtr urp; /* undo record pointer */ > + UnpackedUndoRecord *urec; /* undo record */ > + uint16 size; /* undo record size */ > + int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array > + * index */ > +}; > + > +/* > + * This holds undo buffers information required for PreparedUndoSpace during > + * prepare undo time. Basically, during prepare time which is called outside > + * the critical section we will acquire all necessary undo buffers pin and lock. > + * Later, during insert phase we will write actual records into thse buffers. > + */ > +struct PreparedUndoBuffer > +{ > + UndoLogNumber logno; /* Undo log number */ > + BlockNumber blk; /* block number */ > + Buffer buf; /* buffer allocated for the block */ > + bool zero; /* new block full of zeroes */ > +}; Most files define datatypes before function prototypes, because functions may reference the datatypes. > +/* > + * Prepare to update the transaction header > + * > + * It's a helper function for PrepareUpdateNext and > + * PrepareUpdateUndoActionProgress This doesn't really explain much. PrepareUpdateUndoActionProgress doesnt' exist. I assume it's UndoRecordPrepareApplyProgress from 0012? > + * xact_urp - undo record pointer to be updated. > + * size - number of bytes to be updated. > + * offset - offset in undo record where to start update. > + */ These comments seem redundant with the parameter names. > +static int > +UndoRecordPrepareTransInfo(UndoRecordInsertContext *context, > + UndoRecPtr xact_urp, int size, int offset) > +{ > + BlockNumber cur_blk; > + RelFileNode rnode; > + int starting_byte; > + int bufidx; > + int index = 0; > + int remaining_bytes; > + XactUndoRecordInfo *xact_info; > + > + xact_info = &context->xact_urec_info[context->nxact_urec_info]; > + > + UndoRecPtrAssignRelFileNode(rnode, xact_urp); > + cur_blk = UndoRecPtrGetBlockNum(xact_urp); > + starting_byte = UndoRecPtrGetPageOffset(xact_urp); > + > + /* Remaining bytes on the current block. */ > + remaining_bytes = BLCKSZ - starting_byte; > + > + /* > + * Is there some byte of the urec_next on the current block, if not then > + * start from the next block. > + */ This comment needs rephrasing. > + /* Loop until we have fetched all the buffers in which we need to write. */ > + while (size > 0) > + { > + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, RBM_NORMAL); > + xact_info->idx_undo_buffers[index++] = bufidx; > + size -= (BLCKSZ - starting_byte); > + starting_byte = UndoLogBlockHeaderSize; > + cur_blk++; > + } So, this locks a very large number of undo buffers at the same time, do I see that correctly? What guarantees that there are no deadlocks due to multiple buffers locked at the same time (I guess the order inside the log)? What guarantees that this is a small enough number that we can even lock all of them at the same time? Why do we need to lock all of them at the same time? That's not clear to me. Also, why do we need code to lock an unbounded number here? It seems hard to imagine we'd ever want to update more than something around 8 bytes? Shouldn't that at the most require two buffers? > +/* > + * Prepare to update the previous transaction's next undo pointer. > + * > + * We want to update the next transaction pointer in the previous transaction's > + * header (first undo record of the transaction). In prepare phase we will > + * unpack that record and lock the necessary buffers which we are going to > + * overwrite and store the unpacked undo record in the context. Later, > + * UndoRecordUpdateTransInfo will overwrite the undo record. > + * > + * xact_urp - undo record pointer of the previous transaction's header > + * urecptr - current transaction's undo record pointer which need to be set in > + * the previous transaction's header. > + */ > +static void > +UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context, > + UndoRecPtr urecptr, UndoRecPtr xact_urp) That name imo is confusing - it's not clear that it's not actually about the next record or such. > +{ > + UndoLogSlot *slot; > + int index = 0; > + int offset; > + > + /* > + * The absence of previous transaction's undo indicate that this backend *indicates > + /* > + * Acquire the discard lock before reading the undo record so that discard > + * worker doesn't remove the record while we are in process of reading it. > + */ *the discard worker > + LWLockAcquire(&slot->discard_update_lock, LW_SHARED); > + /* Check if it is already discarded. */ > + if (UndoLogIsDiscarded(xact_urp)) > + { > + /* Release lock and return. */ > + LWLockRelease(&slot->discard_update_lock); > + return; > + } Ho, hum. I don't quite remember what we decided in the discussion about not having to use the discard lock for this purpose. > + /* Compute the offset of the uur_next in the undo record. */ > + offset = SizeOfUndoRecordHeader + > + offsetof(UndoRecordTransaction, urec_next); > + > + index = UndoRecordPrepareTransInfo(context, xact_urp, > + sizeof(UndoRecPtr), offset); > + /* > + * Set the next pointer in xact_urec_info, this will be overwritten in > + * actual undo record during update phase. > + */ > + context->xact_urec_info[index].next = urecptr; What does "this will be overwritten mean"? It sounds like "context->xact_urec_info[index].next" would be overwritten, but that can't be true. > + /* We can now release the discard lock as we have read the undo record. */ > + LWLockRelease(&slot->discard_update_lock); > +} Hm. Because you expect it to be blocked behind the content lwlocks for the buffers? > +/* > + * Overwrite the first undo record of the previous transaction to update its > + * next pointer. > + * > + * This will insert the already prepared record by UndoRecordPrepareTransInfo. It doesn't actually appear to insert any records. At least not a record in the way the rest of the file uses that term? > + * This must be called under the critical section. s/under the/in a/ Think that should be asserted. > + /* > + * Start writing directly from the write offset calculated during prepare > + * phase. And, loop until we write required bytes. > + */ Why do we do offset calculations multiple times? Seems like all the offsets, and the split, should be computed in exactly one place. > +/* > + * Find the block number in undo buffer array > + * > + * If it is present then just return its index otherwise search the buffer and > + * insert an entry and lock the buffer in exclusive mode. > + * > + * Undo log insertions are append-only. If the caller is writing new data > + * that begins exactly at the beginning of a page, then there cannot be any > + * useful data after that point. In that case RBM_ZERO can be passed in as > + * rbm so that we can skip a useless read of a disk block. In all other > + * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't > + * happen to be already in the buffer pool. > + */ > +static int > +UndoGetBufferSlot(UndoRecordInsertContext *context, > + RelFileNode rnode, > + BlockNumber blk, > + ReadBufferMode rbm) > +{ > + int i; > + Buffer buffer; > + XLogRedoAction action = BLK_NEEDS_REDO; > + PreparedUndoBuffer *prepared_buffer; > + UndoPersistence persistence = context->alloc_context.persistence; > + > + /* Don't do anything, if we already have a buffer pinned for the block. */ As the code stands, it's locked, not just pinned. > + for (i = 0; i < context->nprepared_undo_buffer; i++) > + { How large do we expect this to get at most? > + /* > + * We did not find the block so allocate the buffer and insert into the > + * undo buffer array. > + */ > + if (InRecovery) > + action = XLogReadBufferForRedoBlock(context->alloc_context.xlog_record, > + SMGR_UNDO, > + rnode, > + UndoLogForkNum, > + blk, > + rbm, > + false, > + &buffer); Why is not locking the buffer correct here? Can't there be concurrent reads during hot standby? > +/* > + * This function must be called before all the undo records which are going to > + * get inserted under a single WAL record. How can a function be called "before all the undo records"? > + * nprepared - This defines the max number of undo records that can be > + * prepared before inserting them. > + */ > +void > +BeginUndoRecordInsert(UndoRecordInsertContext *context, > + UndoPersistence persistence, > + int nprepared, > + XLogReaderState *xlog_record) There definitely needs to be explanation about xlog_record. But also about memory management etc. Looks like one e.g. can't call this from a short lived memory context. > +/* > + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you > + * intended to insert. Upon return, the necessary undo buffers are pinned and > + * locked. Again, how is deadlocking / max number of buffers handled, and why do they all need to be locked at the same time? > + /* > + * We don't yet know if this record needs a transaction header (ie is the > + * first undo record for a given transaction in a given undo log), because > + * you can only find out by allocating. We'll resolve this circularity by > + * allocating enough space for a transaction header. We'll only advance > + * by as many bytes as we turn out to need. > + */ Why can we only find this out by allocating? This seems like an API deficiency of the storage layer to me. The information is in the und log slot's metadata, no? > + urec->uur_next = InvalidUndoRecPtr; > + UndoRecordSetInfo(urec); > + urec->uur_info |= UREC_INFO_TRANSACTION; > + urec->uur_info |= UREC_INFO_LOGSWITCH; > + size = UndoRecordExpectedSize(urec); > + > + /* Allocate space for the record. */ > + if (InRecovery) > + { > + /* > + * We'll figure out where the space needs to be allocated by > + * inspecting the xlog_record. > + */ > + Assert(context->alloc_context.persistence == UNDO_PERMANENT); > + urecptr = UndoLogAllocateInRecovery(&context->alloc_context, > + XidFromFullTransactionId(txid), > + size, > + &need_xact_header, > + &last_xact_start, > + &prevlog_xact_start, > + &prevlogurp); > + } > + else > + { > + /* Allocate space for writing the undo record. */ That's basically the same comment as before the if. > + urecptr = UndoLogAllocate(&context->alloc_context, > + size, > + &need_xact_header, &last_xact_start, > + &prevlog_xact_start, &prevlog_insert_urp); > + > + /* > + * If prevlog_xact_start is a valid undo record pointer that means > + * this transaction's undo records are split across undo logs. > + */ > + if (UndoRecPtrIsValid(prevlog_xact_start)) > + { > + uint16 prevlen; > + > + /* > + * If undo log is switch during transaction then we must get a "is switch" is right. > +/* > + * Insert a previously-prepared undo records. s/a// More tomorrow. Greetings, Andres Freund
On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Jul 30, 2019 at 5:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I think this idea is good for the DO time but during REDO time it will > > not work as we will not have the transaction state. Having said that > > the current idea of keeping in the global variable will also not work > > during REDO time because the WAL from the different transaction can be > > interleaved. There are few ideas to handle this issue > > > > 1. At DO time keep in TopTransactionState as you suggested and during > > recovery time read from the first complete record on the page. > > 2. Just to keep the code uniform always read from the first complete > > record of the page. > > It contains another > suggestion for that problem you just mentioned (and also me pointing > out what you just pointed out, since I wrote it earlier) though I'm > not sure if it's better than your options above. Thanks, Thomas for your review, Currently, I am replying to the problem which both of us has identified and found a different set of solutions. I will go through other comments soon and work on those. > +UndoRecPtr > +PrepareUndoInsert(UndoRecordInsertContext *context, > + UnpackedUndoRecord *urec, > + Oid dbid) > +{ > ... > + /* Fetch compression info for the transaction. */ > + compression_info = GetTopTransactionUndoCompressionInfo(category); > > How can this work correctly in recovery? [Edit: it doesn't, as you > just pointed out] > > > One data structure that could perhaps hold this would be > UndoLogTableEntry (the per-backend cache, indexed by undo log number, > with pretty fast lookups; used for things like > UndoLogNumberGetCategory()). As long as you never want to have > inter-transaction compression, that should have the right scope to > give recovery per-undo log tracking. If you ever wanted to do > compression between transactions too, maybe UndoLogSlot could work, > but that'd have more complications. I think this could be a good idea. I had thought of keeping in the slot as my 3rd option but later I removed it thinking that we need to expose the compression field to the undo log layer. I think keeping in the UndoLogTableEntry is a better idea than keeping in the slot. But, I still have the same problem that we need to expose undo record-level fields to undo log layer to compute the cache entry size. OTOH, If we decide to get from the first record of the page (as I mentioned up thread) then I don't think there is any performance issue because we are inserting on the same page. But, for doing that we need to unpack the complete undo record (guaranteed to be on one page). And, UnpackUndoData will internally unpack the payload data as well which is not required in our case unless we change UnpackUndoData such that it unpacks only what the caller wants (one input parameter will do). I am not sure out of these two which idea is better? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi Amit I've been testing some undo worker workloads (more on that soon), but here's a small thing: I managed to reach an LWLock self-deadlock in the undo worker launcher: diff --git a/src/backend/access/undo/undorequest.c b/src/backend/access/undo/undorequest.c ... +bool +UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo *urinfo, ... + /* Search the queues under lock as they can be modified concurrently. */ + LWLockAcquire(RollbackRequestLock, LW_EXCLUSIVE); ... + RollbackHTRemoveEntry(rh->full_xid, rh->start_urec_ptr); ^ but that function acquires the same lock, leading to: (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00007fff5d110106 libsystem_kernel.dylib`semop + 10 frame #1: 0x0000000104bbf24c postgres`PGSemaphoreLock(sema=0x000000010e216a08) at pg_sema.c:428:15 frame #2: 0x0000000104c90186 postgres`LWLockAcquire(lock=0x000000010e218300, mode=LW_EXCLUSIVE) at lwlock.c:1246:4 frame #3: 0x000000010487463d postgres`RollbackHTRemoveEntry(full_xid=(value = 89144), start_urec_ptr=20890721090967) at undorequest.c:1717:2 frame #4: 0x0000000104873dbe postgres`UndoGetWork(allow_peek=false, remove_from_queue=false, urinfo=0x00007ffeeb4d3e30, in_other_db_out=0x0000000000000000) at undorequest.c:1388:5 frame #5: 0x0000000104876211 postgres`UndoLauncherMain(main_arg=0) at undoworker.c:607:7 ... (lldb) print held_lwlocks[0] (LWLockHandle) $0 = { lock = 0x000000010e218300 mode = LW_EXCLUSIVE } -- Thomas Munro https://enterprisedb.com
On Tue, Jul 30, 2019 at 1:32 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > Amit, short note: The patches aren't attached in patch order. Obviously > a miniscule thing, but still nicer if that's not the case. > Noted, I will try to ensure that patches are in order in future posts. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Hi Amit > > I've been testing some undo worker workloads (more on that soon), > One small point, there is one small bug in the error queues which is that the element pushed into error queue doesn't have an updated value of to_urec_ptr which is important to construct the hash key. This will lead to undolauncher/worker think that the action for the same is already processed and it removes the same from the hash table. I have a fix for the same which I will share in next version of the patch (which I am going to share in the next day or two). > but > here's a small thing: I managed to reach an LWLock self-deadlock in > the undo worker launcher: > I could see the problem, will fix in next version. Thank you for reviewing and testing this. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > 2. Introduced a new RMGR callback rm_undo_status. It is used to > > decide when record sets in the UNDO_SHARED category should be > > discarded (instead of the usual single xid-based rules). The possible > > answers are "discard me now!", "ask me again when a given XID is all > > visible", and "ask me again when a given XID is no longer running". > > From the minor nitpicking department, the patches from this stack that > are updating rmgrlist.h are consistently failing to update the comment > line preceding the list of PG_RMGR() lines. This looks to be patches > 0014 and 0015 in this stack; 0015 seems to need to be squashed into > 0014. > Fixed. You can verify in patch 0011-Infrastructure-to-execute-pending-undo-actions. The 'mask' was missing in the list as well which I have added here, we might want to commit that separately. > Reviewing Amit's 0016: > > performUndoActions appears to be badly-designed. For starters, it's > sometimes wrong: the only place it gets set to true is in > UndoActionsRequired (which is badly named, because from the name you > expect it to return a Boolean and to not have side effects, but > instead it doesn't return anything and does have side effects). > UndoActionsRequired() only gets called from selected places, like > AbortCurrentTransaction(), so the rest of the time it just returns a > wrong answer. Now maybe it's never called at those times, but there's > no guard to prevent a function like CanPerformUndoActions() (which is > also badly named, because performUndoActions tells you whether you > need to perform undo actions, not whether it's possible to perform > undo actions) from being called before the flag is set. I think that > this flag should be either (1) maintained eagerly - so that wherever > we set start_urec_ptr we also set the flag right away or (2) removed - > so when we need to know, we just loop over all of the undo categories > on the spot, which is not that expensive because there aren't that > many of them. > I have taken approach-2 to fix this. > It seems pointless to make PrepareTransaction() take undo pointers as > arguments, because those pointers are just extracted from the > transaction state, to which PrepareTransaction() has a pointer. > Fixed. > Thomas has already objected to another proposal to add functions that > turn 32-bit XIDs into 64-bit XIDs. Therefore, I feel confident in > predicting that he will likewise object to GetEpochForXid. I think > this needs to be changed somehow, maybe by doing what the XXX comment > you added suggests. > I will fix this later. I think we can separately write a patch to extend Two-phase file to use fulltransactionid and then use it here. > This patch has some problems with naming consistency. There's a > function called PushUndoRequest() which calls a function called > RegisterRollbackReq() to do the heart of the work. So, is it undo or > rollback? Are we pushing or registering? Is it a request or a req? > For bonus points, the flag that the function sets is called > undo_req_pushed, which is halfway in between the two competing > terminologies. Other gripes about PushUndoRequest: push is vague and > doesn't really explain what's happening, "apllying" is a typo, > per_level is a poor variable name and shouldn't be declared volatile. > This function has problems with naming in other places, too; please go > through all of the names carefully and make them consistent and > adequately descriptive. > I have changed the namings to make them consistent. If you see anything else, then do let me know. > I am not a fan of applying_subxact_undo. I think we should look for a > better design there. A couple of things occur to me. One is that we > don't necessarily need to go to FATAL; we could just force the current > transaction and all of its subtransactions fail all the way out to the > top level, but then perhaps allow new transactions to be started > afterwards. I'm not sure that's worth it, but it would work, and I > think it has precedent in SxactIsDoomed. Assuming we're going to stick > with the current FATAL plan, I think we should do something like > invent a new kind of critical section that forces ERROR to be promoted > to FATAL and then use it here. We could call it a semi-critical or > locally-critical section, and the undo machinery could use it, but > then also so could other things. I've wanted that sort of concept > before, so I think it's a good idea to try to have something general > and independent of undo. The same concept could be used in > PerformUndoActions() instead of having to invent > pg_rethrow_as_fatal(), so we'd have two uses for this mechanism right > away. > Okay, I have developed the concept of semi-critical section and used it for sub-transactions and temp tables. Kindly check if this is something that you have in mind? > FinishPreparedTransactions() tries to apply undo actions while > interrupts are still held. Is that necessary? Can we avoid it? > Fixed. > It seems highly likely that the logic added to the TBLOCK_SUBCOMMIT > case inside CommitTransactionCommand and also into > ReleaseCurrentSubTransaction should have been added to > CommitSubTransaction instead. If that's not true, then we have to > believe that the TBLOCK_SUBRELEASE call to CommitSubTransaction needs > different treatment from the other two cases, which sounds unlikely; > we also have to explain why undo is somehow different from all of > these other releases that are already handled in that function, not in > its callers. I also strongly suspect it is altogether wrong to do > this before CommitSubTransaction sets s->state to TRANS_COMMIT; what > if a subxact callback throws an error? > > For related reasons, I don't think that the change ReleaseSavepoint() > are right either. Notice the header comment: "As above, we don't > actually do anything here except change blockState." The "as above" > part of the comment probably didn't originally refer to > DefineSavepoint(), which definitely does do other stuff, but to > something like EndImplicitTransactionBlock() or EndTransactionBlock(), > and DefineSavepoint() got stuck in the middle later. Anyway, your > patch makes the comment false by doing actual state changes in this > function, rather than just marking the subtransactions for commit. > But why should that be right? If none of the many other bits of state > are manipulated here rather than in CommitSubTransaction(), why is > undo the one thing that is different? I guess this is basically just > compensation for the lack of any of this code in the TBLOCK_SUBRELEASE > path which I noted in the previous paragraph, but I still think the > right answer is to put it all in CommitSubTransaction() *after* we set > TRANS_COMMIT. > Changed as per suggestion. > There are a number of things I either don't like or don't understand > about PerformUndoActions. One is that undo_req_pushed gets passed to > this function. That just looks really odd from an abstraction point > of view. Basically, we have a function whose job is to "perform undo > actions," and it gets a flag as an argument that tells it to not > actually perform some of the undo actions: that's odd. I think the > reason it's like that is because of the issue we've been discussing > elsewhere that there's a separate undo request for each category. If > you didn't have that, you wouldn't need to do this here. I'm not > saying that proves that the one-request-per-persistence-level design > is definitely wrong, but this is certainly not a point in its favor, > at least IMHO. > I think we have discussed in detail about one-request-per-persistence-level design and I will investigate it to see if we can make it one-request-per-transaction and if not what are the challenges and can we overcome them without significantly more work and complexity. So for now, I have not changed anything related to this point. Apart from these comments, I have changed a few more things: a. Changed TWOPHASE_MAGIC define as we are changing TwoPhaseFileHeader. b. Fixed comments by Dilip on same patch [1]. I will respond to them separately. c. Fixed the problem reported by Thomas [2] and one similar problem in an error queue noticed by me. I have still not addressed all the comments raised. This is mainly to unblock Thomas's test and share whatever is done until now. I am posting all the patches, but have not modified anything related to undo-log and undo-interface patches (aka from 0001 to 0008). [1] - https://www.postgresql.org/message-id/CAFiTN-tObs5BQZETqK12QuOz7nPSXb90PdG49AzK2ZJ4ts1c5g%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CA%2BhUKGLv016-1y%3DCwx%2Bmme%2BcFRD5Bn03%3D2JVFnRB7JMLsA35%3Dw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Move-some-md.c-specific-logic-from-smgr.c-to-md.c.patch
- 0002-Prepare-to-support-multiple-SMGR-implementations.patch
- 0003-Add-undo-log-manager.patch
- 0004-Allow-WAL-record-data-on-first-modification-after-a-.patch
- 0005-Add-prefetch-support-for-the-undo-log.patch
- 0006-Defect-and-enhancement-in-multi-log-support.patch
- 0007-Provide-interfaces-to-store-and-fetch-undo-records.patch
- 0008-undo-page-consistency-checker.patch
- 0008-undo-page-consistency-checker-1.patch
- 0009-Extend-binary-heap-functionality.patch
- 0010-Infrastructure-to-register-and-fetch-undo-action-req.patch
- 0011-Infrastructure-to-execute-pending-undo-actions.patch
- 0012-Allow-foreground-transactions-to-perform-undo-action.patch
- 0013-Allow-execution-and-discard-of-undo-by-background-wo.patch
On Wed, Jul 24, 2019 at 10:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Please find my review comments for > 0013-Allow-foreground-transactions-to-perform-undo-action > > + /* initialize undo record locations for the transaction */ > + for (i = 0; i < UndoLogCategories; i++) > + { > + s->start_urec_ptr[i] = InvalidUndoRecPtr; > + s->latest_urec_ptr[i] = InvalidUndoRecPtr; > + s->undo_req_pushed[i] = false; > + } > > Can't we just memset this memory? > Yeah, that sounds better, so changed. > > + * We can't postpone applying undo actions for subtransactions as the > + * modifications made by aborted subtransaction must not be visible even if > + * the main transaction commits. > + */ > + if (IsSubTransaction()) > + return; > > I am not completely sure but is it possible that the outer function > CommitTransactionCommand/AbortCurrentTransaction can avoid > calling this function in the switch case based on the current state, > so that under subtransaction this will never be called? > I have already explained as a separate response to this email why I don't think this is a very good idea. > > + /* > + * Prepare required undo request info so that it can be used in > + * exception. > + */ > + ResetUndoRequestInfo(&urinfo); > + urinfo.dbid = dbid; > + urinfo.full_xid = fxid; > + urinfo.start_urec_ptr = start_urec_ptr[per_level]; > + > > I see that we are preparing urinfo before execute_undo_actions so that > in case of an error in CATCH we can use that to > insert into the queue, but can we just initialize urinfo right there > before inserting into the queue, we have all the information > Am I missing something? > IIRC, the only idea was that we can use the same variable (urinfo.full_xid) in execute_undo_actions call and in the catch block, but I think your suggestion sounds better as we can avoid declaring urinfo as volatile in that case. > + > + /* > + * We need the locations of the start and end undo record pointers when > + * rollbacks are to be performed for prepared transactions using undo-based > + * relations. We need to store this information in the file as the user > + * might rollback the prepared transaction after recovery and for that we > + * need it's start and end undo locations. > + */ > + UndoRecPtr start_urec_ptr[UndoLogCategories]; > + UndoRecPtr end_urec_ptr[UndoLogCategories]; > > it's -> its > > .. > > We must have some comments to explain how performUndoActions is used, > where it's set. If it's explained somewhere else then we can > give reference to that code. > > + for (i = 0; i < UndoLogCategories; i++) > + { > + if (s->latest_urec_ptr[i]) > + { > + s->performUndoActions = true; > + break; > + } > + } > > I think we should chek UndoRecPtrIsValid(s->latest_urec_ptr[i]) > Changed as per suggestion. > + PG_TRY(); > + { > + /* > + * Prepare required undo request info so that it can be used in > + * exception. > + */ > + ResetUndoRequestInfo(&urinfo); > + urinfo.dbid = dbid; > + urinfo.full_xid = fxid; > + urinfo.start_urec_ptr = start_urec_ptr[per_level]; > + > + /* for subtransactions, we do partial rollback. */ > + execute_undo_actions(urinfo.full_xid, > + end_urec_ptr[per_level], > + start_urec_ptr[per_level], > + !isSubTrans); > + } > + PG_CATCH(); > > Wouldn't it be good to explain in comments that we are not rethrowing > the error in PG_CATCH but because we don't want the main > transaction to get an error if there is an error while applying to > undo action for the main transaction and we will abort the transaction > in the caller of this function? > I have added a comment atop of the function containing this code. > +tables are only accessible in the backend that has created them. We can't > +postpone applying undo actions for subtransactions as the modifications > +made by aborted subtransaction must not be visible even if the main transaction > +commits. > > I think we need to give detail reasoning why subtransaction changes > will be visible if we don't apply it's undo and the main > the transaction commits by mentioning that we don't use separate > transaction id for the subtransaction and that will make all the > changes of the transaction id visible when it commits. > I have added a detailed explanation in execute_undo_actions() and given a reference of same here. The changes are present in the patch series just posted by me [1]. [1] - https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > Hi Amit > > > > I've been testing some undo worker workloads (more on that soon), > > > > One small point, there is one small bug in the error queues which is > that the element pushed into error queue doesn't have an updated value > of to_urec_ptr which is important to construct the hash key. This > will lead to undolauncher/worker think that the action for the same is > already processed and it removes the same from the hash table. I have > a fix for the same which I will share in next version of the patch > (which I am going to share in the next day or two). > > > but > > here's a small thing: I managed to reach an LWLock self-deadlock in > > the undo worker launcher: > > > > I could see the problem, will fix in next version. > Fixed both of these problems in the patch just posted by me [1]. [1] - https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 22, 2019 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have reviewed 0012-Infrastructure-to-execute-pending-undo-actions, > Please find my comment so far. .. > 4. > +void > +undoaction_redo(XLogReaderState *record) > +{ > + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; > + > + switch (info) > + { > + case XLOG_UNDO_APPLY_PROGRESS: > + undo_xlog_apply_progress(record); > + break; > > For HotStandby it doesn't make sense to apply this wal as this > progress is only required when we try to apply the undo action after > restart > but in HotStandby we never apply undo actions. > Hmm, I think it is required. Think what if Hotstandby is later promoted to master and a large part of undo is already applied? In such a case, we can skip the already applied undo. > 6. > + if ((slot == NULL) || (UndoRecPtrGetLogNo(urecptr) != slot->logno)) > + slot = UndoLogGetSlot(UndoRecPtrGetLogNo(urecptr), false); > + > + Assert(slot != NULL); > We are passing missing_ok as false in UndoLogGetSlot. But, not sure > why we are expecting that undo lot can not be dropped. In multi-log > transaction it's possible > that the tablespace in which next undolog is there is already dropped? > If the transaction spans multiple logs, then both the logs should be in the same tablespace. So, how is it possible to drop the tablespace when part of undo is still pending? AFAICS, the code in choose_undo_tablespace() doesn't seem to allow switching tablespace for the same transaction, but I agree if someone used a different algorithm, then it might be possible. I think the important question is whether we should allow the same transactions undo to span across tablespaces? If so, then what you are telling makes sense and we should handle that, if not, then I think we are fine here. One might argue that there should be some more strong checks to ensure that the same transaction will always get the undo logs from the same tablespace, but I think that is a different thing then what you are raising here. Thomas, others, do you have any opinion on this matter? In FindUndoEndLocationAndSize, there is a check if the next log is discarded (Case 4: If the transaction is overflowed to ...), won't this case (considering it is possible) get detected by that check? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
I had a look at the UNDO patches at https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com, and at the patch to use the UNDO logs to clean up orphaned files, from undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to review? Thanks Thomas and Amit and others for working on this! Orphaned relfiles has been an ugly wart forever. It's a small thing, but really nice to fix that finally. This has been a long thread, and I haven't read it all, so please forgive me if I repeat stuff that's already been discussed. There are similar issues in CREATE/DROP DATABASE code. If you crash in the middle of CREATE DATABASE, you can be left with orphaned files in the data directory, or if you crash in the middle of DROP DATABASE, the data might be gone already but the pg_database entry is still there. We should plug those holes too. There's a lot of stuff in the patches that are not relevant for cleaning up orphaned files. I know this cleaning up orphaned files work is mainly a vehicle to get the UNDO log committed, so that's expected. If we only cared about orphaned files, I'm sure the patches wouldn't spend so much effort on concurrency, for example. Nevertheless, I think we should leave out some stuff that's clearly unused, for now. For example, a bunch of fields in the record format: uur_block, uur_offset, uur_tuple. You can add them later, as part of the patches that actually need them, but for now they just make the patch larger to review. Some more thoughts on the record format: I feel that the level of abstraction is not quite right. There are a bunch of fields, like uur_block, uur_offset, uur_tuple, that are probably useful for some UNDO resource managers (zheap I presume), but seem kind of arbitrary. How is uur_tuple different from uur_payload? Should they be named more generically as uur_payload1 and uur_payload2? And why two, why not three or four different payloads? In the WAL record format, there's a concept of "block id", which allows you to store N number of different payloads in the record, I think that would be a better approach. Or only have one payload, and let the resource manager code divide it as it sees fit. Many of the fields support a primitive type of compression, where a field can be omitted if it has the same value as on the first record on an UNDO page. That's handy. But again I don't like the fact that the fields have been hard-coded into the UNDO record format. I can see e.g. the relation oid to be useful for many AMs. But not all. And other AMs might well want to store and deduplicate other things, aside from the fields that are in the patch now. I'd like to move most of the fields to AM specific code, and somehow generalize the compression. One approach would be to let the AM store an arbitrary struct, and run it through a general-purpose compression algorithm, using the UNDO page's first record as the "dictionary". Or make the UNDO page's first record available in whole to the AM specific code, and let the AM do the deduplication. For cleaning up orphaned files, though, we don't really care about any of that, so I'd recommend just ripping it out for now. Compression/deduplication can be added later as a separate patch. The orphaned-file cleanup patch doesn't actually use the uur_reloid field. It stores the RelFileNode instead, in the paylod. I think that's further evidence that the hard-coded fields in the record format are not quite right. I don't like the way UndoFetchRecord returns a palloc'd UnpackedUndoRecord. I would prefer something similar to the xlogreader API, where a new call to UndoFetchRecord invalidates the previous result. On efficiency grounds, to avoid the palloc, but also to be consistent with xlogreader. In the UNDO page header, there are a bunch of fields like pd_lower/pd_upper/pd_special that are copied from the "standard" page header, that are unused. There's a FIXME comment about that too. Let's remove them, there's no need for UNDO pages to look like standard relation pages. The LSN needs to be at the beginning, to work with the buffer manager, but that's the only requirement. Could we leave out the UNDO and discard worker processes for now? Execute all UNDO actions immediately at rollback, and after crash recovery. That would be fine for cleaning up orphaned files, and it would cut down the size of the patch to review. Can this race condition happen: Transaction A creates a table and an UNDO record to remember it. The transaction is rolled back, and the file is removed. Another transaction, B, creates a different table, and chooses the same relfilenode. It loads the table with data, and commits. Then the system crashes. After crash recovery, the UNDO record for the first transaction is applied, and it removes the file that belongs to the second table, created by transaction B. - Heikki
On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I had a look at the UNDO patches at > https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com, > and at the patch to use the UNDO logs to clean up orphaned files, from > undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to > review? > Yes, I am not sure of cleanup orphan file patch (Thomas can confirm the same), but others are latest. > Thanks Thomas and Amit and others for working on this! Orphaned relfiles > has been an ugly wart forever. It's a small thing, but really nice to > fix that finally. This has been a long thread, and I haven't read it > all, so please forgive me if I repeat stuff that's already been discussed. > > There are similar issues in CREATE/DROP DATABASE code. If you crash in > the middle of CREATE DATABASE, you can be left with orphaned files in > the data directory, or if you crash in the middle of DROP DATABASE, the > data might be gone already but the pg_database entry is still there. We > should plug those holes too. > +1. Interesting. > There's a lot of stuff in the patches that are not relevant for cleaning > up orphaned files. I know this cleaning up orphaned files work is mainly > a vehicle to get the UNDO log committed, so that's expected. If we only > cared about orphaned files, I'm sure the patches wouldn't spend so much > effort on concurrency, for example. Nevertheless, I think we should > leave out some stuff that's clearly unused, for now. For example, a > bunch of fields in the record format: uur_block, uur_offset, uur_tuple. > You can add them later, as part of the patches that actually need them, > but for now they just make the patch larger to review. > > Some more thoughts on the record format: > > I feel that the level of abstraction is not quite right. There are a > bunch of fields, like uur_block, uur_offset, uur_tuple, that are > probably useful for some UNDO resource managers (zheap I presume), but > seem kind of arbitrary. How is uur_tuple different from uur_payload? > The uur_tuple field can only store tuple whereas uur_payload can have miscellaneous information. For ex. in zheap, we store transaction information like CID, CTID, some information related to TPD, etc. in the payload. Basically, I think eventually payload will have some bitmap to indicate what all is stored in it. OTOH, I agree that if we want we can store tuple as well in the payload. > Should they be named more generically as uur_payload1 and uur_payload2? > And why two, why not three or four different payloads? In the WAL record > format, there's a concept of "block id", which allows you to store N > number of different payloads in the record, I think that would be a > better approach. Or only have one payload, and let the resource manager > code divide it as it sees fit. > For payload, something like what you describe here sounds like a good idea, but I feel we can have tuple as a separate field. It will help in accessing tuple quickly and easily during visibility or rollbacks for some AM's like zheap. > Many of the fields support a primitive type of compression, where a > field can be omitted if it has the same value as on the first record on > an UNDO page. That's handy. But again I don't like the fact that the > fields have been hard-coded into the UNDO record format. I can see e.g. > the relation oid to be useful for many AMs. But not all. And other AMs > might well want to store and deduplicate other things, aside from the > fields that are in the patch now. I'd like to move most of the fields to > AM specific code, and somehow generalize the compression. One approach > would be to let the AM store an arbitrary struct, and run it through a > general-purpose compression algorithm, using the UNDO page's first > record as the "dictionary". Or make the UNDO page's first record > available in whole to the AM specific code, and let the AM do the > deduplication. For cleaning up orphaned files, though, we don't really > care about any of that, so I'd recommend just ripping it out for now. > Compression/deduplication can be added later as a separate patch. > I think this will make the undorecord-interface patch a bit simpler as well. > The orphaned-file cleanup patch doesn't actually use the uur_reloid > field. It stores the RelFileNode instead, in the paylod. I think that's > further evidence that the hard-coded fields in the record format are not > quite right. > > > I don't like the way UndoFetchRecord returns a palloc'd > UnpackedUndoRecord. I would prefer something similar to the xlogreader > API, where a new call to UndoFetchRecord invalidates the previous > result. On efficiency grounds, to avoid the palloc, but also to be > consistent with xlogreader. > > In the UNDO page header, there are a bunch of fields like > pd_lower/pd_upper/pd_special that are copied from the "standard" page > header, that are unused. There's a FIXME comment about that too. Let's > remove them, there's no need for UNDO pages to look like standard > relation pages. The LSN needs to be at the beginning, to work with the > buffer manager, but that's the only requirement. > > Could we leave out the UNDO and discard worker processes for now? > Execute all UNDO actions immediately at rollback, and after crash > recovery. That would be fine for cleaning up orphaned files, > Even if we execute all the undo actions on rollback, we need discard worker to discard undo at regular intervals. Also, what if we get an error while applying undo actions during rollback? Right now, we have a mechanism to push such a request to background worker and allow the session to continue. Instead, we might want to Panic in such cases if we don't want to have background undo workers. > and it > would cut down the size of the patch to review. > If we can find some way to handle all cases and everyone agrees to it, that would be good. In fact, we can try to get the basic stuff committed first and then try to get the rest (undo-worker machinery) done. > Can this race condition happen: Transaction A creates a table and an > UNDO record to remember it. The transaction is rolled back, and the file > is removed. Another transaction, B, creates a different table, and > chooses the same relfilenode. It loads the table with data, and commits. > Then the system crashes. After crash recovery, the UNDO record for the > first transaction is applied, and it removes the file that belongs to > the second table, created by transaction B. > I don't think such a race exists, but we should verify it once. Basically, once the rollback is complete, we mark the transaction rollback as complete in the transaction header in undo and write a WAL for it. After crash-recovery, we will skip such a transaction. Isn't that sufficient to prevent such a race condition? Thank you for looking into this work. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I had a look at the UNDO patches at > > https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com, > > and at the patch to use the UNDO logs to clean up orphaned files, from > > undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to > > review? > > Yes, I am not sure of cleanup orphan file patch (Thomas can confirm > the same), but others are latest. I have a new patch set to post soon, handling all the feedback that arrived in the past couple of weeks from 5 different reviewers (thanks all!). > > There are similar issues in CREATE/DROP DATABASE code. If you crash in > > the middle of CREATE DATABASE, you can be left with orphaned files in > > the data directory, or if you crash in the middle of DROP DATABASE, the > > data might be gone already but the pg_database entry is still there. We > > should plug those holes too. > > > > +1. Interesting. Huh. Right. > > Could we leave out the UNDO and discard worker processes for now? > > Execute all UNDO actions immediately at rollback, and after crash > > recovery. That would be fine for cleaning up orphaned files, > > > > Even if we execute all the undo actions on rollback, we need discard > worker to discard undo at regular intervals. Also, what if we get an > error while applying undo actions during rollback? Right now, we have > a mechanism to push such a request to background worker and allow the > session to continue. Instead, we might want to Panic in such cases if > we don't want to have background undo workers. > > > and it > > would cut down the size of the patch to review. > > If we can find some way to handle all cases and everyone agrees to it, > that would be good. In fact, we can try to get the basic stuff > committed first and then try to get the rest (undo-worker machinery) > done. I think it's definitely worth exploring. > > Can this race condition happen: Transaction A creates a table and an > > UNDO record to remember it. The transaction is rolled back, and the file > > is removed. Another transaction, B, creates a different table, and > > chooses the same relfilenode. It loads the table with data, and commits. > > Then the system crashes. After crash recovery, the UNDO record for the > > first transaction is applied, and it removes the file that belongs to > > the second table, created by transaction B. > > I don't think such a race exists, but we should verify it once. > Basically, once the rollback is complete, we mark the transaction > rollback as complete in the transaction header in undo and write a WAL > for it. After crash-recovery, we will skip such a transaction. Isn't > that sufficient to prevent such a race condition? The usual protection against relfilenode recycling applies: we don't actually remove the files on disk until after the next checkpoint, following the successful rollback. That is, the executing the rollback doesn't actually remove any files immediately, so you can't reuse the OID yet. There might be some problems like that if we tried to handle the CREATE DATABASE orphans you mentioned too naively though. Not sure. > Thank you for looking into this work. +1 -- Thomas Munro https://enterprisedb.com
On 05/08/2019 07:23, Thomas Munro wrote: > On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> Could we leave out the UNDO and discard worker processes for now? >>> Execute all UNDO actions immediately at rollback, and after crash >>> recovery. That would be fine for cleaning up orphaned files, >> >> Even if we execute all the undo actions on rollback, we need discard >> worker to discard undo at regular intervals. Also, what if we get an >> error while applying undo actions during rollback? Right now, we have >> a mechanism to push such a request to background worker and allow the >> session to continue. Instead, we might want to Panic in such cases if >> we don't want to have background undo workers. >> >>> and it >>> would cut down the size of the patch to review. >> >> If we can find some way to handle all cases and everyone agrees to it, >> that would be good. In fact, we can try to get the basic stuff >> committed first and then try to get the rest (undo-worker machinery) >> done. > > I think it's definitely worth exploring. Yeah. For cleaning up orphaned files, if unlink() fails, we can just log the error and move on. That's what we do in the main codepath, too. For any other error, PANIC seems ok. We're not expecting any errors during undo processing, so it doesn't seems safe to continue running. Hmm. Since applying the undo record is WAL-logged, you could run out of disk space while creating the WAL record. That seems unpleasant. >>> Can this race condition happen: Transaction A creates a table and an >>> UNDO record to remember it. The transaction is rolled back, and the file >>> is removed. Another transaction, B, creates a different table, and >>> chooses the same relfilenode. It loads the table with data, and commits. >>> Then the system crashes. After crash recovery, the UNDO record for the >>> first transaction is applied, and it removes the file that belongs to >>> the second table, created by transaction B. >> >> I don't think such a race exists, but we should verify it once. >> Basically, once the rollback is complete, we mark the transaction >> rollback as complete in the transaction header in undo and write a WAL >> for it. After crash-recovery, we will skip such a transaction. Isn't >> that sufficient to prevent such a race condition? Ok, I didn't realize there's a flag in the undo record to mark it as applied. Yeah, that fixes it. Seems a bit heavy-weight, but I guess it's fine. Do you do something different in zheap? I presume writing a WAL record for every applied undo record would be too heavy there. This needs some performance testing. We're creating one extra WAL record and one UNDO record for every file creation, and another WAL record on abort. It's probably cheap compared to all the other work done during table creation, but we should still get some numbers on it. Some regression tests would be nice too. - Heikki
On Mon, Aug 5, 2019 at 12:09 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 05/08/2019 07:23, Thomas Munro wrote: > > On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >>> Could we leave out the UNDO and discard worker processes for now? > >>> Execute all UNDO actions immediately at rollback, and after crash > >>> recovery. That would be fine for cleaning up orphaned files, > >> > >> Even if we execute all the undo actions on rollback, we need discard > >> worker to discard undo at regular intervals. Also, what if we get an > >> error while applying undo actions during rollback? Right now, we have > >> a mechanism to push such a request to background worker and allow the > >> session to continue. Instead, we might want to Panic in such cases if > >> we don't want to have background undo workers. > >> > >>> and it > >>> would cut down the size of the patch to review. > >> > >> If we can find some way to handle all cases and everyone agrees to it, > >> that would be good. In fact, we can try to get the basic stuff > >> committed first and then try to get the rest (undo-worker machinery) > >> done. > > > > I think it's definitely worth exploring. > > Yeah. For cleaning up orphaned files, if unlink() fails, we can just log > the error and move on. That's what we do in the main codepath, too. For > any other error, PANIC seems ok. We're not expecting any errors during > undo processing, so it doesn't seems safe to continue running. > > Hmm. Since applying the undo record is WAL-logged, you could run out of > disk space while creating the WAL record. That seems unpleasant. > We might get away by doing some minimum error handling for orphan file cleanup patch, but this facility was supposed to be a generic facility. Assuming, all of us agree on error handling stuff, still, I think we might not be able to get away with the requirement for discard worker to discard the logs. > >>> Can this race condition happen: Transaction A creates a table and an > >>> UNDO record to remember it. The transaction is rolled back, and the file > >>> is removed. Another transaction, B, creates a different table, and > >>> chooses the same relfilenode. It loads the table with data, and commits. > >>> Then the system crashes. After crash recovery, the UNDO record for the > >>> first transaction is applied, and it removes the file that belongs to > >>> the second table, created by transaction B. > >> > >> I don't think such a race exists, but we should verify it once. > >> Basically, once the rollback is complete, we mark the transaction > >> rollback as complete in the transaction header in undo and write a WAL > >> for it. After crash-recovery, we will skip such a transaction. Isn't > >> that sufficient to prevent such a race condition? > > Ok, I didn't realize there's a flag in the undo record to mark it as > applied. Yeah, that fixes it. Seems a bit heavy-weight, but I guess it's > fine. Do you do something different in zheap? I presume writing a WAL > record for every applied undo record would be too heavy there. > For zheap, we collect all the records of a page, apply them together and then write the entire page in WAL. The progress of transaction is updated at either transaction end (rollback complete) or after processing some threshold of undo records. So, generally, the WAL won't be for each undo record apply. > This needs some performance testing. We're creating one extra WAL record > and one UNDO record for every file creation, and another WAL record on > abort. It's probably cheap compared to all the other work done during > table creation, but we should still get some numbers on it. > makes sense. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Aug 5, 2019 at 6:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > For zheap, we collect all the records of a page, apply them together > and then write the entire page in WAL. The progress of transaction is > updated at either transaction end (rollback complete) or after > processing some threshold of undo records. So, generally, the WAL > won't be for each undo record apply. This explanation omits a crucial piece of the mechanism, because Heikki is asking what keeps the undo from being applied multiple times. When we apply the undo records to a page, we also adjust the undo pointers in the page. Since we have an undo pointer per transaction slot, and each transaction has its own slot, if we apply all the undo for a transaction to a page, we can just clear the slot; if we somehow end up back at the same point later, we'll know not to apply the undo a second time because we'll see that there's no transaction slot pointing to the undo we were thinking of applying. If we roll back to a savepoint, or for some other reason choose to apply only some of the undo to a page, we can set the undo record pointer for the transaction back to the value it had before we generated any newer undo. Then, we'll know that the newer undo doesn't need to be applied but the older undo can be applied. At least, I think that's how it's supposed to work. If you just update the progress field, it doesn't guarantee anything, because in the event of a crash, we could end up keeping the page changes but losing the update to the progress, as they are part of separate undo records. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I feel that the level of abstraction is not quite right. There are a > bunch of fields, like uur_block, uur_offset, uur_tuple, that are > probably useful for some UNDO resource managers (zheap I presume), but > seem kind of arbitrary. How is uur_tuple different from uur_payload? > Should they be named more generically as uur_payload1 and uur_payload2? > And why two, why not three or four different payloads? In the WAL record > format, there's a concept of "block id", which allows you to store N > number of different payloads in the record, I think that would be a > better approach. Or only have one payload, and let the resource manager > code divide it as it sees fit. > > Many of the fields support a primitive type of compression, where a > field can be omitted if it has the same value as on the first record on > an UNDO page. That's handy. But again I don't like the fact that the > fields have been hard-coded into the UNDO record format. I can see e.g. > the relation oid to be useful for many AMs. But not all. And other AMs > might well want to store and deduplicate other things, aside from the > fields that are in the patch now. I'd like to move most of the fields to > AM specific code, and somehow generalize the compression. One approach > would be to let the AM store an arbitrary struct, and run it through a > general-purpose compression algorithm, using the UNDO page's first > record as the "dictionary". I thought about this, too. I agree that there's something a little unsatisfying about the current structure, but I haven't been able to come up with something that seems definitively better. I think something along the lines of what you are describing here might work well, but I am VERY doubtful about the idea of a fixed-size struct. I think AMs are going to want to store variable-length data: especially tuples, but maybe also other stuff. For instance, imagine some AM that wants to implement locking that's more fine-grained that the four levels of tuple locks we have today: instead of just having key locks and all-columns locks, you could want to store the exact columns to be locked. Or maybe your TIDs are variable-width. And the problem is that as soon as you move to something where you pack in a bunch of variable-sized fields, you lose the ability to refer to thinks using reasonable names. That's where I came up with the idea of an UnpackedUndoRecord: give the common fields that "everyone's going to need" human-readable names, and jam only the strange, AM-specific stuff into the payload. But if those needs are not actually universal but very much AM-specific, then I'm afraid we're going to end up with deeply inscrutable code for packing and unpacking records. I imagine it's possible to come up with a good structure for that, but I don't think we have one today. > I don't like the way UndoFetchRecord returns a palloc'd > UnpackedUndoRecord. I would prefer something similar to the xlogreader > API, where a new call to UndoFetchRecord invalidates the previous > result. On efficiency grounds, to avoid the palloc, but also to be > consistent with xlogreader. I don't think that's going to work very well, because we often need to deal with multiple records at a time. There is (or was) a bulk-fetch interface, but I've also found while experimenting with this code that it can be useful to do things like: current = undo_fetch(starting_record); loop: next = undo_fetch(current->next_record_ptr); if some_test(next): break; undo_free(current); current = next; I think we shouldn't view such cases as exceptions to the general paradigm of looking at undo records one at a time, but instead as the normal case for which everything is optimized. Cases like orphaned file cleanup where the number of undo records is probably small and they're all independent of each other will, I think, turn out to be the exception rather than the rule. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 30, 2019 at 4:02 AM Andres Freund <andres@anarazel.de> wrote: > I'm a bit worried about expanding the use of > ReadBufferWithoutRelcache(). Not so much because of the relcache itself, > but because it requires doing separate smgropen() calls. While not > crazily expensive, it's also not free. Especially combined with closing > all such relations at transaction end (c.f. AtEOXact_SMgr). > > I'm somewhat inclined to think that this requires a slightly bigger > refactoring than done in this patch. Imo at the very least the smgr > entries ought not to be unowned. But working towards not haven to > re-open the smgr entry for every single trival request ought to be part > of this too. I spent some time trying to analyze this today and I agree with you that there seems to be room for improvement here. When I first looked at your comments, I wasn't too convinced, because access patterns that skip around between undo logs seem like they may be fairly common. Admittedly, there are cases where we want to read from just one undo log over and over again, and it would be good to optimize those, but I was initially a bit unconvinced that that there was a problem here worth being concerned about. Then I realized that you would also repeat the smgropen() if you read a single record that happens to be split across two pages, which seems a little silly. But then I realized that we're being a little silly even in the case where we're reading a single undo record that is stored entirely on a single page. We are certainly going to need to look up the undo log, but as things stand, we'll basically do it twice. For example, in the write path, we'll call UndoLogAllocate() and it will look up an UndoLogControl object for the undo log of interest, and then we'll call ReadBufferWithoutRelcache() which will call smgropen() which will do a hash table lookup to find the SMgrRelation associated with that undo log. That's not a large cost, as you say, but it does seem like it might be better to avoid having two different lookups in the same commonly-used code path, each of which peeks into a different backend-private data structure for information about the very same undo log. The obvious thing to do seems to be to have UndoLogControl objects own SmgrRelations. That would be something of a novelty, since it looks like currently only a Relation ever owns an SMgrRelation, but the smgr infrastructure seems to have been set up in a generic way so as to permit that sort of thing, so it seems like it should be workable. Perhaps UndoLogAllocate() function could return a pointer to the UndoLogControl object as well as UndoRecPtr. Then, there could be a function UndoLogWrite(UndoLogControl *, UndoRecPtr, char *, Size). On the read side, instead of calling UndoRecPtrAssignRelFileNode, maybe the undo log storage layer should provide a function that again returns an UndoLogControl, and then we could have a matching function UndoLogRead(UndoLogControl *, UndoRecPtr, char *, Size). I think this kind of design would address your concerns about using the unowned list, too, since the UndoLogControl objects would be owning the SMgrRelations. It took me a while to understand why you were concerned about using the unowned list, so I'm going to repeat it in my own words to make sure I've got it right, and also to possibly help out anyone else who may also have had difficulty grokking your concern. If we have a bunch of short transactions each of which accesses the same relation, the relcache entry will remain open and the file won't get closed in between, but if we have a bunch of short transactions each of which accesses the same undo log, the undo log will be closed and reopened at the operating system level for each individual transaction. That happens because when an SMgrRelation is "owned," the owner takes care of closing it, and so can keep it open across transactions, but when it's "unowned," it's automatically closed during transaction cleanup. And we should fix it, because closing and reopening the same file for every transaction unnecessarily might be expensive enough to matter, at least a little bit. How does all that sound? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-08-05 11:25:10 -0400, Robert Haas wrote: > The obvious thing to do seems to be to have UndoLogControl objects own > SmgrRelations. That would be something of a novelty, since it looks > like currently only a Relation ever owns an SMgrRelation, but the smgr > infrastructure seems to have been set up in a generic way so as to > permit that sort of thing, so it seems like it should be workable. Yea, I think that'd be a good step. I'm not 100% convinced it's quite enough, due to the way the undo smgr only ever has a single file descriptor open, and that undo log segments are fairly small, and that there'll often be multiple persistence levels active at the same time. But the undo fd handling is probably a separate concern than from who owns the smgr relations. > I think this kind of design would address your concerns about using > the unowned list, too, since the UndoLogControl objects would be > owning the SMgrRelations. Yup. > How does all that sound? A good move in the right direction, imo. Greetings, Andres Freund
Hi, (as I was out of context due to dealing with bugs, I've switched to lookign at the current zheap/undoprocessing branch. On 2019-07-30 01:02:20 -0700, Andres Freund wrote: > +/* > + * Insert a previously-prepared undo records. > + * > + * This function will write the actual undo record into the buffers which are > + * already pinned and locked in PreparedUndoInsert, and mark them dirty. This > + * step should be performed inside a critical section. > + */ Again, I think it's not ok to just assume you can lock an essentially unbounded number of buffers. This seems almost guaranteed to result in deadlocks. And there's limits on how many lwlocks one can hold etc. As far as I can tell there's simply no deadlock avoidance scheme in use here *at all*? I must be missing something. > + /* Main loop for writing the undo record. */ > + do > + { I'd prefer this to not be a do{} while(true) loop - as written I need to read to the end to see what the condition is. I don't think we have any loops like that in the code. > + /* > + * During recovery, there might be some blocks which are already > + * deleted due to some discard command so we can just skip > + * inserting into those blocks. > + */ > + if (!BufferIsValid(buffer)) > + { > + Assert(InRecovery); > + > + /* > + * Skip actual writing just update the context so that we have > + * write offset for inserting into next blocks. > + */ > + SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte); > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > + break; > + } How exactly can this happen? > + else > + { > + page = BufferGetPage(buffer); > + > + /* > + * Initialize the page whenever we try to write the first > + * record in page. We start writing immediately after the > + * block header. > + */ > + if (starting_byte == UndoLogBlockHeaderSize) > + UndoPageInit(page, BLCKSZ, prepared_undo->urec->uur_info, > + ucontext.already_processed, > + prepared_undo->urec->uur_tuple.len, > + prepared_undo->urec->uur_payload.len); > + > + /* > + * Try to insert the record into the current page. If it > + * doesn't succeed then recall the routine with the next page. > + */ > + InsertUndoData(&ucontext, page, starting_byte); > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > + { > + MarkBufferDirty(buffer); > + break; At this point we're five indentation levels deep. I'd extract at least either the the per prepared undo code or the code performing the writing across block boundaries into a separate function. Perhaps both. > +/* > + * Helper function for UndoGetOneRecord > + * > + * If any of rmid/reloid/xid/cid is not available in the undo record, then > + * it will get the information from the first complete undo record in the > + * page. > + */ > +static void > +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp, > + RelFileNode rnode, UndoLogCategory category, Buffer buffer) > +{ > + /* > + * If any of the common header field is not available in the current undo > + * record then we must read it from the first complete record of the page. > + */ How is it guaranteed that the first record on the page is actually from the current transaction? Can't there be a situation where that's from another transaction? > +/* > + * Helper function for UndoFetchRecord and UndoBulkFetchRecord > + * > + * curbuf - If an input buffer is valid then this function will not release the > + * pin on that buffer. If the buffer is not valid then it will assign curbuf > + * with the first buffer of the current undo record and also it will keep the > + * pin and lock on that buffer in a hope that while traversing the undo chain > + * the caller might want to read the previous undo record from the same block. > + */ Wait, so at exit *curbuf is pinned but not locked, if passed in, but is pinned *and* locked when not? That'd not be a sane API. I don't think the code works like that atm though. > +static UnpackedUndoRecord * > +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode, > + UndoLogCategory category, Buffer *curbuf) > +{ > + Page page; > + int starting_byte = UndoRecPtrGetPageOffset(urp); > + BlockNumber cur_blk; > + UndoPackContext ucontext = {{0}}; > + Buffer buffer = *curbuf; > + > + cur_blk = UndoRecPtrGetBlockNum(urp); > + > + /* Initiate unpacking one undo record. */ > + BeginUnpackUndo(&ucontext); > + > + while (true) > + { > + /* If we already have a buffer then no need to allocate a new one. */ > + if (!BufferIsValid(buffer)) > + { > + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk, > + RBM_NORMAL, NULL, > + RelPersistenceForUndoLogCategory(category)); > + > + /* > + * Remember the first buffer where this undo started as next undo > + * record what we fetch might fall on the same buffer. > + */ > + if (!BufferIsValid(*curbuf)) > + *curbuf = buffer; > + } > + > + /* Acquire shared lock on the buffer before reading undo from it. */ > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > + > + page = BufferGetPage(buffer); > + > + UnpackUndoData(&ucontext, page, starting_byte); > + > + /* > + * We are done if we have reached to the done stage otherwise move to > + * next block and continue reading from there. > + */ > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > + { > + if (buffer != *curbuf) > + UnlockReleaseBuffer(buffer); > + > + /* > + * Get any of the missing fields from the first record of the > + * page. > + */ > + GetCommonUndoRecInfo(&ucontext, urp, rnode, category, *curbuf); > + break; > + } > + > + /* > + * The record spans more than a page so we would have copied it (see > + * UnpackUndoRecord). In such cases, we can release the buffer. > + */ Where would it have been copied? Presumably in UnpackUndoData()? Imo the comment should say so. I'm a bit confused by the use of "would" in that comment. Either we have, or not? > + if (buffer != *curbuf) > + UnlockReleaseBuffer(buffer); Wait, so we *keep* the buffer locked if it the same as *curbuf? That can't be right. > + * Fetch the undo record for given undo record pointer. > + * > + * This will internally allocate the memory for the unpacked undo record which > + * intern will "intern" should probably be internally? But I'm not sure what the two "internally"s really add here. > +/* > + * Release the memory of the undo record allocated by UndoFetchRecord and > + * UndoBulkFetchRecord. > + */ > +void > +UndoRecordRelease(UnpackedUndoRecord *urec) > +{ > + /* Release the memory of payload data if we allocated it. */ > + if (urec->uur_payload.data) > + pfree(urec->uur_payload.data); > + > + /* Release memory of tuple data if we allocated it. */ > + if (urec->uur_tuple.data) > + pfree(urec->uur_tuple.data); > + > + /* Release memory of the transaction header if we allocated it. */ > + if (urec->uur_txn) > + pfree(urec->uur_txn); > + > + /* Release memory of the logswitch header if we allocated it. */ > + if (urec->uur_logswitch) > + pfree(urec->uur_logswitch); > + > + /* Release the memory of the undo record. */ > + pfree(urec); > +} Those comments before each pfree are not useful. Also, isn't this both fairly slow and fairly failure prone? The next record is going to need all that memory again, no? It seems to me that there should be one record that's allocated once, and then reused over multiple fetches, increasing the size if necesssary. I'm very doubtful that all this freeing of individual allocations in the undo code makes sense. Shouldn't this just be done in short lived memory contexts, that then get reset as a whole? That's both far less failure prone, and faster. > + * one_page - Caller is applying undo only for one block not for > + * complete transaction. If this is set true then instead > + * of following transaction undo chain using prevlen we will > + * follow the block prev chain of the block so that we can > + * avoid reading many unnecessary undo records of the > + * transaction. > + */ > +UndoRecInfo * > +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr, > + int undo_apply_size, int *nrecords, bool one_page) There's no caller for one_page mode in the series - I assume that's for later, during page-wise undo? It seems to behave in quite noticably different ways, is that really OK? Makes the code quite hard to understand. Also, it seems quite poorly named to me. It sounds like it's about fetching a single undo page (which makes no sense, obviously). But what it does is to switch to an entirely different way of traversing the undo chains. > + /* > + * In one_page mode we are fetching undo only for one page instead of > + * fetching all the undo of the transaction. Basically, we are fetching > + * interleaved undo records. So it does not make sense to do any prefetch > + * in that case. What does "interleaved" mean here? I assume that there will often be other UNDO records interspersed? But that's not guaranteed at all, right? In fact, for a lot of workloads it seems likely that there will be many consecutive undo records for a single page? In fact, won't that be the majority of cases? Thus it's not obvious to me that there's not often going to be consecutive pages for this case too. I'd even say that minimizing IO delay is *MORE* important during page-wise undo, as that happens in the context of client accesses, and it's not incurring cost on the party that performed DML, but on some random third party. I'm doubtful this is a sane interface. There's a lot of duplication between one_page and not one_page. It presupposes specific ways of constructing chains that are likely to depend on the AM. to_urecptr is only used in certain situations. E.g. I strongly suspect that for zheap's visibility determinations we'd want to concurrently follow all the necessary chains to determine visibility for all all tuples on the page, far enough to find the visible tuple - for seqscan's / bitmap heap scans / everything using page mode scans, that'll be way more efficient than doing this one-by-one and possibly even repeatedly. But what is exactly the right thing to do is going to be highly AM specific. I vaguely suspect what you'd want is an interface where the "bulk fetch" context basically has a FIFO queue of undo records to fetch, and a function to actually perform fetching. Whenever a record has been retrieved, a callback determines whether additional records are needed. In the case of fetching all the undo for a transaction, you'd just queue - probably in a more efficient representation - all the necessary undo. In case of page-wise undo, you'd queue the first record of the chain you'd want to undo, with a callback for queuing the next record. For visibility determinations in zheap, you'd queue all the different necessary chains, with a callback that queues the next necessary record if still needed for visibility determination. And then I suspect you'd have a separate callback whenever records have been fetched, with all the 'unconsumed' records. That then can, e.g. based on memory consumption, decide to process them or not. For visibility information you'd probably just want to condense the records to the minimum necessary (i.e. visibility information for the relevant tuples, and the visibile tuple when encountered) as soon as available. Obviously that's pretty handwavy. > Also, if we are fetching undo records from more than one > + * log, we don't know the boundaries for prefetching. Hence, we can't use > + * prefetching in this case. > + */ Hm. Why don't we know the boundaries (or cheaply infer them)? > + /* > + * If prefetch_pages are half of the prefetch_target then it's time to > + * prefetch again. > + */ > + if (prefetch_pages < prefetch_target / 2) > + PrefetchUndoPages(rnode, prefetch_target, &prefetch_pages, to_blkno, > + from_blkno, category); Hm. Why aren't we prefetching again as soon as possible? Given the current code there's not really benefit in fetching many adjacent pages at once. And this way it seems we're somewhat likely to cause fairly bursty IO? > + /* > + * In one_page mode it's possible that the undo of the transaction > + * might have been applied by worker and undo got discarded. Prevent > + * discard worker from discarding undo data while we are reading it. > + * See detail comment in UndoFetchRecord. In normal mode we are > + * holding transaction undo action lock so it can not be discarded. > + */ I don't really see a comment explaining this in UndoFetchRecord. Are you referring to InHotStandby? Because there's no comment about one_page mode as far as I can tell? The comment is clearly referring to that, rather than InHotStandby? > + if (one_page) > + { > + /* Refer comments in UndoFetchRecord. */ Missing "to". > + if (InHotStandby) > + { > + if (UndoRecPtrIsDiscarded(urecptr)) > + break; > + } > + else > + { > + LWLockAcquire(&slot->discard_lock, LW_SHARED); > + if (slot->logno != logno || urecptr < slot->oldest_data) > + { > + /* > + * The undo log slot has been recycled because it was > + * entirely discarded, or the data has been discarded > + * already. > + */ > + LWLockRelease(&slot->discard_lock); > + break; > + } > + } I find this deeply unsatisfying. It's repeated in a bunch of places. There's completely different behaviour between the hot-standby and !hot-standby case. There's UndoRecPtrIsDiscarded for the HS case, but we do a different test for !HS. There's no explanation as to why this is even reachable. > + /* Read the undo record. */ > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); > + > + /* Release the discard lock after fetching the record. */ > + if (!InHotStandby) > + LWLockRelease(&slot->discard_lock); > + } > + else > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); And then we do none of this in !one_page mode. > + /* > + * As soon as the transaction id is changed we can stop fetching the > + * undo record. Ideally, to_urecptr should control this but while > + * reading undo only for a page we don't know what is the end undo > + * record pointer for the transaction. > + */ > + if (one_page) > + { > + if (!FullTransactionIdIsValid(fxid)) > + fxid = uur->uur_fxid; > + else if (!FullTransactionIdEquals(fxid, uur->uur_fxid)) > + break; > + } > + > + /* Remember the previous undo record pointer. */ > + prev_urec_ptr = urecptr; > + > + /* > + * Calculate the previous undo record pointer of the transaction. If > + * we are reading undo only for a page then follow the blkprev chain > + * of the page. Otherwise, calculate the previous undo record pointer > + * using transaction's current undo record pointer and the prevlen. If > + * undo record has a valid uur_prevurp, this is the case of log switch > + * during the transaction so we can directly use uur_prevurp as our > + * previous undo record pointer of the transaction. > + */ > + if (one_page) > + urecptr = uur->uur_prevundo; > + else if (uur->uur_logswitch) > + urecptr = uur->uur_logswitch->urec_prevurp; > + else if (prev_urec_ptr == to_urecptr || > + uur->uur_info & UREC_INFO_TRANSACTION) > + urecptr = InvalidUndoRecPtr; > + else > + urecptr = UndoGetPrevUndoRecptr(prev_urec_ptr, buffer, category); > + FWIW, this is one of those concerns I was referring to above. What exactly needs to happen seems highly AM specific. > +/* > + * Read length of the previous undo record. > + * > + * This function will take an undo record pointer as an input and read the > + * length of the previous undo record which is stored at the end of the previous > + * undo record. If the undo record is split then this will add the undo block > + * header size in the total length. > + */ This should add some note as to when it's expected to be necessary. I was kind of concerned that this can be necessary, but it's only needed during log switches, which disarms that concern. > +static uint16 > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, > + UndoLogCategory category) > +{ > + UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp); > + BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp); > + Buffer buffer = input_buffer; > + Page page = NULL; > + char *pagedata = NULL; > + char prevlen[2]; > + RelFileNode rnode; > + int byte_to_read = sizeof(uint16); Shouldn't it be byte_to_read? And the sizeof a type that's tied with the actual undo format? Imagine we'd ever want to change the length format for undo records - this would be hard to find. > + char persistence; > + uint16 prev_rec_len = 0; > + > + /* Get relfilenode. */ > + UndoRecPtrAssignRelFileNode(rnode, urp); > + persistence = RelPersistenceForUndoLogCategory(category); > + > + if (BufferIsValid(buffer)) > + { > + page = BufferGetPage(buffer); > + pagedata = (char *) page; > + } > + > + /* > + * Length if the previous undo record is store at the end of that record > + * so just fetch last 2 bytes. > + */ > + while (byte_to_read > 0) > + { Why does this need a loop around the number of bytes? Can there ever be a case where this is split across a record? If so, isn't that a bad idea anyway? > + /* Read buffer if the current buffer is not valid. */ > + if (!BufferIsValid(buffer)) > + { > + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, > + cur_blk, RBM_NORMAL, NULL, > + persistence); > + > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > + > + page = BufferGetPage(buffer); > + pagedata = (char *) page; > + } > + > + page_offset -= 1; > + > + /* > + * Read current prevlen byte from current block if page_offset hasn't > + * reach to undo block header. Otherwise, go to the previous block > + * and continue reading from there. > + */ > + if (page_offset >= UndoLogBlockHeaderSize) > + { > + prevlen[byte_to_read - 1] = pagedata[page_offset]; > + byte_to_read -= 1; > + } > + else > + { > + /* > + * Release the current buffer if it is not provide by the caller. > + */ > + if (input_buffer != buffer) > + UnlockReleaseBuffer(buffer); > + > + /* > + * Could not read complete prevlen from the current block so go to > + * the previous block and start reading from end of the block. > + */ > + cur_blk -= 1; > + page_offset = BLCKSZ; > + > + /* > + * Reset buffer so that we can read it again for the previous > + * block. > + */ > + buffer = InvalidBuffer; > + } > + } I can't help but think that this shouldn't be yet another copy of logic for how to read undo pages. Need to do something else for a bit. More later. Greetings, Andres Freund
On Mon, Aug 5, 2019 at 12:42 PM Andres Freund <andres@anarazel.de> wrote: > A good move in the right direction, imo. I spent some more time thinking about this and talking to Thomas about it and I'd like to propose a somewhat more aggressive restructuring proposal, with the aim of getting a cleaner separation between layers of this patch set. Right now, the undo log storage stuff knows nothing about the contents of an undo log, whereas the undo interface storage knows everything about the contents of an undo log. In particular, it knows that it's a series of records, and those records are grouped into transactions, and it knows both the format of the individual records and also the details of how transaction headers work. Nothing can use the undo log storage system except for the undo interface layer, because the undo interface layer assumes that all the data in the undo storage system conforms to the record/recordset format which it defines. However, there are a few warts: while the undo log storage patch doesn't know anything about the contents of undo logs, it does know that that transaction boundaries matter, and it signals to the undo interface layer whether a transaction header should be inserted for a new record. That's a strange thing for the storage layer to be doing. Also, in addition to three persistence levels, it knows about a fourth undo log category for "special" data for multixact or TPD-like things. That's another wart. Suppose that we instead invent a new layer which sits on top of the undo log storage layer. This layer manages what I'm going to call GHOBs, growable hunks of bytes. (This is probably not the best name, but I thought of it in 5 seconds during a complex technical conversation, so bear with me.) The GHOB layer supports open/close/grow/write/overwrite operations. Conceptually, you open a GHOB with an initial size and a persistence level, and then you can subsequently grow it unless you fill up the undo log in which case you can't grow it any more; when you're done, you close it. Opening and closing a GHOB are operations that only make in-memory state changes. Opening a GHOB finds a place where you could write the initial amount of data you specify, but it doesn't actually write any data or change any persistent state yet, except for making sure that nobody else can grab that space as long as you have the GHOB open. Closing a GHOB tells the system that you're not going to grow the object any more, which means some other GHOB can be placed immediately after the last data you wrote. Growing a GHOB doesn't do anything persistent either; it just tests whether there would be room to write those bytes. So, the only operations that make actual persistent changes are write and overwrite. These operations just copy data into shared buffers and mark them dirty, but they are set up so that you can integrate this with whatever WAL-logging your doing for those operations, so that you can make the same writes happen at redo time. Then, on top of the GHOB layer, you have separate submodules for different kinds of GHOBs. Most importantly, you have a transaction-GHOB manager, which opens a GHOB per persistence level the first time somebody wants to write to it and closes those GHOBs at end-of-xact. AMs push records into the transaction-GHOB manager, and it pushes them into GHOBs on the other side. Then you can also have a multi-GHOB manager, which would replace what Thomas now has as a separate undo log category. The undo-log-storage layer wouldn't have any fixed limit on the number of GHOBs that could be open at the same time; it would just be the sum of whatever the individual GHOB type managers can open. It would be important to keep that number fairly small since there's not an unlimited supply of undo logs, but that doesn't seem like a problem for any of the uses we currently have in mind. Each GHOB would begin with a magic number identifying the GHOB type, and would have callbacks for everything else, like "how big is this GHOB?" and "is it discardable?". I'm not totally sure I've thought through all of the problems here, but it seems like this might help us fix some of the aforementioned layering inversions. The undo log storage system only knows about storage: it doesn't have to help with things like transaction boundaries any more, and it continues to be indifferent to the actual contents of the storage. At the GHOB layer, we know that we've got chunks of storage which are the unit of undo discard, and we know that they start with a magic number that identifies the type, but it doesn't know whether they are internally broken into records or, if so, how those records are organized. The individual GHOB managers do know that stuff; for example, the transaction-GHOB manager would know that AMs insert undo records and how those records are compressed and so forth. One thing that feels good about this system is that you could actually write something like the test_undo module that Thomas had in an older patch set. He threw it away because it doesn't play nice with the way the undorecord/undoaccess stuff works: that stuff thinks that all undo records have to be in the format that it knows about, and if they're not, it will barf. With this, test_undo could define its own kind of GHOB that keeps stuff until it's explicitly told to throw it away, and that'd be fine for 'make check' (but not 'make installcheck', probably). Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 5, 2019 at 6:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Aug 5, 2019 at 6:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > For zheap, we collect all the records of a page, apply them together > > and then write the entire page in WAL. The progress of transaction is > > updated at either transaction end (rollback complete) or after > > processing some threshold of undo records. So, generally, the WAL > > won't be for each undo record apply. > > This explanation omits a crucial piece of the mechanism, because > Heikki is asking what keeps the undo from being applied multiple > times. > Okay, I didn't realize that. > When we apply the undo records to a page, we also adjust the > undo pointers in the page. Since we have an undo pointer per > transaction slot, and each transaction has its own slot, if we apply > all the undo for a transaction to a page, we can just clear the slot; > if we somehow end up back at the same point later, we'll know not to > apply the undo a second time because we'll see that there's no > transaction slot pointing to the undo we were thinking of applying. If > we roll back to a savepoint, or for some other reason choose to apply > only some of the undo to a page, we can set the undo record pointer > for the transaction back to the value it had before we generated any > newer undo. Then, we'll know that the newer undo doesn't need to be > applied but the older undo can be applied. > > At least, I think that's how it's supposed to work. > Right, this is how it works. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, On 2019-08-05 11:29:34 -0700, Andres Freund wrote: > Need to do something else for a bit. More later. Here we go. > + /* > + * Compute the header size of the undo record. > + */ > +Size > +UndoRecordHeaderSize(uint16 uur_info) > +{ > + Size size; > + > + /* Add fixed header size. */ > + size = SizeOfUndoRecordHeader; > + > + /* Add size of transaction header if it presets. */ > + if ((uur_info & UREC_INFO_TRANSACTION) != 0) > + size += SizeOfUndoRecordTransaction; > + > + /* Add size of rmid if it presets. */ > + if ((uur_info & UREC_INFO_RMID) != 0) > + size += sizeof(RmgrId); > + > + /* Add size of reloid if it presets. */ > + if ((uur_info & UREC_INFO_RELOID) != 0) > + size += sizeof(Oid); > + > + /* Add size of fxid if it presets. */ > + if ((uur_info & UREC_INFO_XID) != 0) > + size += sizeof(FullTransactionId); > + > + /* Add size of cid if it presets. */ > + if ((uur_info & UREC_INFO_CID) != 0) > + size += sizeof(CommandId); > + > + /* Add size of forknum if it presets. */ > + if ((uur_info & UREC_INFO_FORK) != 0) > + size += sizeof(ForkNumber); > + > + /* Add size of prevundo if it presets. */ > + if ((uur_info & UREC_INFO_PREVUNDO) != 0) > + size += sizeof(UndoRecPtr); > + > + /* Add size of the block header if it presets. */ > + if ((uur_info & UREC_INFO_BLOCK) != 0) > + size += SizeOfUndoRecordBlock; > + > + /* Add size of the log switch header if it presets. */ > + if ((uur_info & UREC_INFO_LOGSWITCH) != 0) > + size += SizeOfUndoRecordLogSwitch; > + > + /* Add size of the payload header if it presets. */ > + if ((uur_info & UREC_INFO_PAYLOAD) != 0) > + size += SizeOfUndoRecordPayload; There's numerous blocks with one if for each type, and the body copied basically the same for each alternative. That doesn't seem like a reasonable approach to me. Means that many places need to be adjusted when we invariably add another type, and seems likely to lead to bugs over time. > + /* Add size of the payload header if it presets. */ FWIW, repeating the same comment, with or without minor differences, 10 times is a bad idea. Especially when the comment doesn't add *any* sort of information. Also, "if it presets" presumably is a typo? > +/* > + * Compute and return the expected size of an undo record. > + */ > +Size > +UndoRecordExpectedSize(UnpackedUndoRecord *uur) > +{ > + Size size; > + > + /* Header size. */ > + size = UndoRecordHeaderSize(uur->uur_info); > + > + /* Payload data size. */ > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0) > + { > + size += uur->uur_payload.len; > + size += uur->uur_tuple.len; > + } > + > + /* Add undo record length size. */ > + size += sizeof(uint16); > + > + return size; > +} > + > +/* > + * Calculate the size of the undo record stored on the page. > + */ > +static inline Size > +UndoRecordSizeOnPage(char *page_ptr) > +{ > + uint16 uur_info = ((UndoRecordHeader *) page_ptr)->urec_info; > + Size size; > + > + /* Header size. */ > + size = UndoRecordHeaderSize(uur_info); > + > + /* Payload data size. */ > + if ((uur_info & UREC_INFO_PAYLOAD) != 0) > + { > + UndoRecordPayload *payload = (UndoRecordPayload *) (page_ptr + size); > + > + size += payload->urec_payload_len; > + size += payload->urec_tuple_len; > + } > + > + return size; > +} > + > +/* > + * Compute size of the Unpacked undo record in memory > + */ > +Size > +UnpackedUndoRecordSize(UnpackedUndoRecord *uur) > +{ > + Size size; > + > + size = sizeof(UnpackedUndoRecord); > + > + /* Add payload size if record contains payload data. */ > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0) > + { > + size += uur->uur_payload.len; > + size += uur->uur_tuple.len; > + } > + > + return size; > +} These functions are all basically the same. We shouldn't copy code over and over like this. > +/* > + * Initiate inserting an undo record. > + * > + * This function will initialize the context for inserting and undo record > + * which will be inserted by calling InsertUndoData. > + */ > +void > +BeginInsertUndo(UndoPackContext *ucontext, UnpackedUndoRecord *uur) > +{ > + ucontext->stage = UNDO_PACK_STAGE_HEADER; > + ucontext->already_processed = 0; > + ucontext->partial_bytes = 0; > + > + /* Copy undo record header. */ > + ucontext->urec_hd.urec_type = uur->uur_type; > + ucontext->urec_hd.urec_info = uur->uur_info; > + > + /* Copy undo record transaction header if it is present. */ > + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0) > + memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction); > + > + /* Copy rmid if present. */ > + if ((uur->uur_info & UREC_INFO_RMID) != 0) > + ucontext->urec_rmid = uur->uur_rmid; > + > + /* Copy reloid if present. */ > + if ((uur->uur_info & UREC_INFO_RELOID) != 0) > + ucontext->urec_reloid = uur->uur_reloid; > + > + /* Copy fxid if present. */ > + if ((uur->uur_info & UREC_INFO_XID) != 0) > + ucontext->urec_fxid = uur->uur_fxid; > + > + /* Copy cid if present. */ > + if ((uur->uur_info & UREC_INFO_CID) != 0) > + ucontext->urec_cid = uur->uur_cid; > + > + /* Copy undo record relation header if it is present. */ > + if ((uur->uur_info & UREC_INFO_FORK) != 0) > + ucontext->urec_fork = uur->uur_fork; > + > + /* Copy prev undo record pointer if it is present. */ > + if ((uur->uur_info & UREC_INFO_PREVUNDO) != 0) > + ucontext->urec_prevundo = uur->uur_prevundo; > + > + /* Copy undo record block header if it is present. */ > + if ((uur->uur_info & UREC_INFO_BLOCK) != 0) > + { > + ucontext->urec_blk.urec_block = uur->uur_block; > + ucontext->urec_blk.urec_offset = uur->uur_offset; > + } > + > + /* Copy undo record log switch header if it is present. */ > + if ((uur->uur_info & UREC_INFO_LOGSWITCH) != 0) > + memcpy(&ucontext->urec_logswitch, uur->uur_logswitch, > + SizeOfUndoRecordLogSwitch); > + > + /* Copy undo record payload header and data if it is present. */ > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0) > + { > + ucontext->urec_payload.urec_payload_len = uur->uur_payload.len; > + ucontext->urec_payload.urec_tuple_len = uur->uur_tuple.len; > + ucontext->urec_payloaddata = uur->uur_payload.data; > + ucontext->urec_tupledata = uur->uur_tuple.data; > + } > + else > + { > + ucontext->urec_payload.urec_payload_len = 0; > + ucontext->urec_payload.urec_tuple_len = 0; > + } > + > + /* Compute undo record expected size and store in the context. */ > + ucontext->undo_len = UndoRecordExpectedSize(uur); > +} It really can't be right to have all these fields basically twice, in UnackedUndoRecord, and UndoPackContext. And then copy them one-by-one. I mean there's really just some random differences (ordering, some field names) between the structures, but otherwise they're the same? What on earth do we gain by this? This entire intermediate stage makes no sense at all to me. We copy data into an UndoRecord, then we copy into an UndoRecordContext, with essentially a field-by-field copy logic. Then we have another field-by-field logic that copies the data into the page. > +/* > + * Insert the undo record into the input page from the unpack undo context. > + * > + * Caller can call this function multiple times until desired stage is reached. > + * This will write the undo record into the page. > + */ > +void > +InsertUndoData(UndoPackContext *ucontext, Page page, int starting_byte) > +{ > + char *writeptr = (char *) page + starting_byte; > + char *endptr = (char *) page + BLCKSZ; > + > + switch (ucontext->stage) > + { > + case UNDO_PACK_STAGE_HEADER: > + /* Insert undo record header. */ > + if (!InsertUndoBytes((char *) &ucontext->urec_hd, > + SizeOfUndoRecordHeader, &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + ucontext->stage = UNDO_PACK_STAGE_TRANSACTION; > + /* fall through */ > + > + case UNDO_PACK_STAGE_TRANSACTION: > + if ((ucontext->urec_hd.urec_info & UREC_INFO_TRANSACTION) != 0) > + { > + /* Insert undo record transaction header. */ > + if (!InsertUndoBytes((char *) &ucontext->urec_txn, > + SizeOfUndoRecordTransaction, > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_RMID; > + /* fall through */ > + > + case UNDO_PACK_STAGE_RMID: > + /* Write rmid(if needed and not already done). */ > + if ((ucontext->urec_hd.urec_info & UREC_INFO_RMID) != 0) > + { > + if (!InsertUndoBytes((char *) &(ucontext->urec_rmid), sizeof(RmgrId), > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_RELOID; > + /* fall through */ > + > + case UNDO_PACK_STAGE_RELOID: > + /* Write reloid(if needed and not already done). */ > + if ((ucontext->urec_hd.urec_info & UREC_INFO_RELOID) != 0) > + { > + if (!InsertUndoBytes((char *) &(ucontext->urec_reloid), sizeof(Oid), > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_XID; > + /* fall through */ > + > + case UNDO_PACK_STAGE_XID: > + /* Write xid(if needed and not already done). */ > + if ((ucontext->urec_hd.urec_info & UREC_INFO_XID) != 0) > + { > + if (!InsertUndoBytes((char *) &(ucontext->urec_fxid), sizeof(FullTransactionId), > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_CID; > + /* fall through */ > + > + case UNDO_PACK_STAGE_CID: > + /* Write cid(if needed and not already done). */ > + if ((ucontext->urec_hd.urec_info & UREC_INFO_CID) != 0) > + { > + if (!InsertUndoBytes((char *) &(ucontext->urec_cid), sizeof(CommandId), > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_FORKNUM; > + /* fall through */ > + > + case UNDO_PACK_STAGE_FORKNUM: > + if ((ucontext->urec_hd.urec_info & UREC_INFO_FORK) != 0) > + { > + /* Insert undo record fork number. */ > + if (!InsertUndoBytes((char *) &ucontext->urec_fork, > + sizeof(ForkNumber), > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_PREVUNDO; > + /* fall through */ > + > + case UNDO_PACK_STAGE_PREVUNDO: > + if ((ucontext->urec_hd.urec_info & UREC_INFO_PREVUNDO) != 0) > + { > + /* Insert undo record blkprev. */ > + if (!InsertUndoBytes((char *) &ucontext->urec_prevundo, > + sizeof(UndoRecPtr), > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_BLOCK; > + /* fall through */ > + > + case UNDO_PACK_STAGE_BLOCK: > + if ((ucontext->urec_hd.urec_info & UREC_INFO_BLOCK) != 0) > + { > + /* Insert undo record block header. */ > + if (!InsertUndoBytes((char *) &ucontext->urec_blk, > + SizeOfUndoRecordBlock, > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_LOGSWITCH; > + /* fall through */ > + > + case UNDO_PACK_STAGE_LOGSWITCH: > + if ((ucontext->urec_hd.urec_info & UREC_INFO_LOGSWITCH) != 0) > + { > + /* Insert undo record transaction header. */ > + if (!InsertUndoBytes((char *) &ucontext->urec_logswitch, > + SizeOfUndoRecordLogSwitch, > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_PAYLOAD; > + /* fall through */ > + > + case UNDO_PACK_STAGE_PAYLOAD: > + if ((ucontext->urec_hd.urec_info & UREC_INFO_PAYLOAD) != 0) > + { > + /* Insert undo record payload header. */ > + if (!InsertUndoBytes((char *) &ucontext->urec_payload, > + SizeOfUndoRecordPayload, > + &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_PAYLOAD_DATA; > + /* fall through */ > + > + case UNDO_PACK_STAGE_PAYLOAD_DATA: > + { > + int len = ucontext->urec_payload.urec_payload_len; > + > + if (len > 0) > + { > + /* Insert payload data. */ > + if (!InsertUndoBytes((char *) ucontext->urec_payloaddata, > + len, &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_TUPLE_DATA; > + } > + /* fall through */ > + > + case UNDO_PACK_STAGE_TUPLE_DATA: > + { > + int len = ucontext->urec_payload.urec_tuple_len; > + > + if (len > 0) > + { > + /* Insert tuple data. */ > + if (!InsertUndoBytes((char *) ucontext->urec_tupledata, > + len, &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + } > + ucontext->stage = UNDO_PACK_STAGE_UNDO_LENGTH; > + } > + /* fall through */ > + > + case UNDO_PACK_STAGE_UNDO_LENGTH: > + /* Insert undo length. */ > + if (!InsertUndoBytes((char *) &ucontext->undo_len, > + sizeof(uint16), &writeptr, endptr, > + &ucontext->already_processed, > + &ucontext->partial_bytes)) > + return; > + > + ucontext->stage = UNDO_PACK_STAGE_DONE; > + /* fall through */ > + > + case UNDO_PACK_STAGE_DONE: > + /* Nothing to be done. */ > + break; > + > + default: > + Assert(0); /* Invalid stage */ > + } > +} I don't understand. The only purpose of this is that we can partially write a packed-but-not-actually-packed record onto a bunch of pages? And for that we have an endless chain of copy and pasted code calling InsertUndoBytes()? Copying data into shared buffers in tiny increments? If we need to this, what is the whole packed record format good for? Except for adding a bunch of functions with 10++ ifs and nearly identical code? Copying data is expensive. Copying data in tiny increments is more expensive. Copying data in tiny increments, with a bunch of branches, is even more expensive. Copying data in tiny increments, with a bunch of branches, is even more expensive, especially when it's shared memory. Copying data in tiny increments, with a bunch of branches, is even more expensive, especially when it's shared memory, especially when all that shared meory is locked at once. > +/* > + * Read the undo record from the input page to the unpack undo context. > + * > + * Caller can call this function multiple times until desired stage is reached. > + * This will read the undo record from the page and store the data into unpack > + * undo context, which can be later copied to unpacked undo record by calling > + * FinishUnpackUndo. > + */ > +void > +UnpackUndoData(UndoPackContext *ucontext, Page page, int starting_byte) > +{ > + char *readptr = (char *) page + starting_byte; > + char *endptr = (char *) page + BLCKSZ; > + > + switch (ucontext->stage) > + { > + case UNDO_PACK_STAGE_HEADER: You know roughly what I'm thinking. > commit 95d10fb308e3ec6ac8a7b4b5e7af78f6825f4dc8 > Author: Amit Kapila <amit.kapila@enterprisedb.com> > AuthorDate: 2019-06-13 15:10:06 +0530 > Commit: Amit Kapila <amit.kapila@enterprisedb.com> > CommitDate: 2019-07-31 16:36:52 +0530 > > Infrastructure to register and fetch undo action requests. I'm pretty sure I suggested that before, but this seems the wrong order. We should have very basic undo functionality in place, even if it can't actually guarantee that undo gets processed, before this. The design of this piece depends on understanding the later parts too much. > This infrasture provides a way to allow execution of undo actions. One > might think that we can always execute undo actions on error or explicit > rollabck by user, however there are cases when that is not posssible. s/rollabck by user/rollback by a user/ > For example, (a) if the system crash while doing operation, then after > startup, we need a way to perform undo actions; (b) If we get error while > performing undo actions. "doing operation" doesn't sound right. Maybe "performing an operation"? > Apart from this, when there are large rollback requests, then it is quite > inefficient to perform all the undo actions and then return control to > user. I don't think efficiency is the right word to describe that. I'd argue that it's probably often at least as efficient to let that rollback be processed in that context (higher cache locality, preventing that backend from creating further undo). It's just that doing so has a bad effect on latency. > To allow efficient execution of the undo actions, we create three queues > and a hash table for the rollback requests. Again I don't think efficient is the right descriptor. My understanding of the goals of having multiple queues is that it helps to achieve forward progress among separate goals, without loosing too much efficiency. > A Xid based priority queue > which will allow us to process the requests of older transactions and help > us to move oldesdXidHavingUnappliedUndo (this is a xid-horizon below which > all the transactions are visible) forward. "This is an important concern, because ..." > +/* > + * Returns the undo record pointer corresponding to first record in the given > + * block. > + */ > +UndoRecPtr > +UndoBlockGetFirstUndoRecord(BlockNumber blkno, UndoRecPtr urec_ptr, > + UndoLogCategory category) > +{ > + Buffer buffer; > + Page page; > + UndoPageHeader phdr; > + RelFileNode rnode; > + UndoLogOffset log_cur_off; > + Size partial_rec_size; > + int offset_cur_page; > + > + if (!BlockNumberIsValid(blkno)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid undo block number"))); > + > + UndoRecPtrAssignRelFileNode(rnode, urec_ptr); > + > + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, blkno, > + RBM_NORMAL, NULL, > + RelPersistenceForUndoLogCategory(category)); > + > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > + > + page = BufferGetPage(buffer); > + phdr = (UndoPageHeader)page; > + > + /* Calculate the size of the partial record. */ > + partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) + > + phdr->tuple_len + phdr->payload_len - > + phdr->record_offset; > + > + /* calculate the offset in current log. */ > + offset_cur_page = SizeOfUndoPageHeaderData + partial_rec_size; > + log_cur_off = (blkno * BLCKSZ) + offset_cur_page; > + > + UnlockReleaseBuffer(buffer); > + > + /* calculate the undo record pointer based on current offset in log. */ > + return MakeUndoRecPtr(UndoRecPtrGetLogNo(urec_ptr), log_cur_off); > +} Yet another function reading undo blocks. No. > The undo requests must appear in both xid and size > + * requests queues or neither. Why? > As of now we, process the requests from these > + * queues in a round-robin fashion to give equal priority to all three type > + * of requests. *types > + * The rollback requests exceeding a certain threshold are pushed into both > + * xid and size based queues. They are also registered in the hash table. Why aren't rollbacks below the threshold in the hashtable? > + * To ensure that backend and discard worker don't register the same request > + * in the hash table, we always register the request with full_xid and the > + * start pointer for the transaction in the hash table as key. Backends > + * always remember the value of start pointer, but discard worker doesn't know *the discard worker There's no explanation as to why we need more than the full_xid (presumably persistency levels). Nor why you chose not to include those. > + * the actual start value in case transaction's undo spans across multiple > + * logs. The reason for the same is that discard worker might encounter the > + * log which has overflowed undo records of the transaction first. "the log which has overflowed undo records of the transaction first" is confusing. Perhaps "the undo log into which the logically earlier undo overflowed before encountering the logically earlier undo"? > In such > + * cases, we need to compute the actual start position. The first record of a > + * transaction in each undo log contains a reference to the first record of > + * this transaction in the previous log. By following the previous log chain > + * of this transaction, we find the initial location which is used to register > + * the request. It seem wrong that the undo request layer needs to care about any of this. > +/* Each worker queue is a binary heap. */ > +typedef struct > +{ > + binaryheap *bh; > + union > + { > + UndoXidQueue *xid_elems; > + UndoSizeQueue *size_elems; > + UndoErrorQueue *error_elems; > + } q_choice; > +} UndoWorkerQueue; As we IIRC have decided to change this into a rbtree, I'll ignore related parts of the current code. What is the status of that work? I've checked the git trees, without seeing anything? Your last mail with patches https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com doesn't seem to contain that either? > +/* Different operations for XID queue */ > +#define InitXidQueue(bh, elems) \ > +( \ > + UndoWorkerQueues[XID_QUEUE].bh = bh, \ > + UndoWorkerQueues[XID_QUEUE].q_choice.xid_elems = elems \ > +) > + > +#define XidQueueIsEmpty() \ > + (binaryheap_empty(UndoWorkerQueues[XID_QUEUE].bh)) > + > +#define GetXidQueueSize() \ > + (binaryheap_cur_size(UndoWorkerQueues[XID_QUEUE].bh)) > + > +#define GetXidQueueElem(elem) \ > + (UndoWorkerQueues[XID_QUEUE].q_choice.xid_elems[elem]) > + > +#define GetXidQueueTopElem() \ > +( \ > + AssertMacro(!binaryheap_empty(UndoWorkerQueues[XID_QUEUE].bh)), \ > + DatumGetPointer(binaryheap_first(UndoWorkerQueues[XID_QUEUE].bh)) \ > +) > + > +#define GetXidQueueNthElem(n) \ > +( \ > + AssertMacro(!XidQueueIsEmpty()), \ > + DatumGetPointer(binaryheap_nth(UndoWorkerQueues[XID_QUEUE].bh, n)) \ > +) > + > +#define SetXidQueueElem(elem, e_dbid, e_full_xid, e_start_urec_ptr) \ > +( \ > + GetXidQueueElem(elem).dbid = e_dbid, \ > + GetXidQueueElem(elem).full_xid = e_full_xid, \ > + GetXidQueueElem(elem).start_urec_ptr = e_start_urec_ptr \ > +) > + > +/* Different operations for SIZE queue */ > +#define InitSizeQueue(bh, elems) \ > +( \ > + UndoWorkerQueues[SIZE_QUEUE].bh = bh, \ > + UndoWorkerQueues[SIZE_QUEUE].q_choice.size_elems = elems \ > +) > + > +#define SizeQueueIsEmpty() \ > + (binaryheap_empty(UndoWorkerQueues[SIZE_QUEUE].bh)) > + > +#define GetSizeQueueSize() \ > + (binaryheap_cur_size(UndoWorkerQueues[SIZE_QUEUE].bh)) > + > +#define GetSizeQueueElem(elem) \ > + (UndoWorkerQueues[SIZE_QUEUE].q_choice.size_elems[elem]) > + > +#define GetSizeQueueTopElem() \ > +( \ > + AssertMacro(!SizeQueueIsEmpty()), \ > + DatumGetPointer(binaryheap_first(UndoWorkerQueues[SIZE_QUEUE].bh)) \ > +) > + > +#define GetSizeQueueNthElem(n) \ > +( \ > + AssertMacro(!SizeQueueIsEmpty()), \ > + DatumGetPointer(binaryheap_nth(UndoWorkerQueues[SIZE_QUEUE].bh, n)) \ > +) > + > +#define SetSizeQueueElem(elem, e_dbid, e_full_xid, e_size, e_start_urec_ptr) \ > +( \ > + GetSizeQueueElem(elem).dbid = e_dbid, \ > + GetSizeQueueElem(elem).full_xid = e_full_xid, \ > + GetSizeQueueElem(elem).request_size = e_size, \ > + GetSizeQueueElem(elem).start_urec_ptr = e_start_urec_ptr \ > +) > + > +/* Different operations for Error queue */ > +#define InitErrorQueue(bh, elems) \ > +( \ > + UndoWorkerQueues[ERROR_QUEUE].bh = bh, \ > + UndoWorkerQueues[ERROR_QUEUE].q_choice.error_elems = elems \ > +) > + > +#define ErrorQueueIsEmpty() \ > + (binaryheap_empty(UndoWorkerQueues[ERROR_QUEUE].bh)) > + > +#define GetErrorQueueSize() \ > + (binaryheap_cur_size(UndoWorkerQueues[ERROR_QUEUE].bh)) > + > +#define GetErrorQueueElem(elem) \ > + (UndoWorkerQueues[ERROR_QUEUE].q_choice.error_elems[elem]) > + > +#define GetErrorQueueTopElem() \ > +( \ > + AssertMacro(!binaryheap_empty(UndoWorkerQueues[ERROR_QUEUE].bh)), \ > + DatumGetPointer(binaryheap_first(UndoWorkerQueues[ERROR_QUEUE].bh)) \ > +) > + > +#define GetErrorQueueNthElem(n) \ > +( \ > + AssertMacro(!ErrorQueueIsEmpty()), \ > + DatumGetPointer(binaryheap_nth(UndoWorkerQueues[ERROR_QUEUE].bh, n)) \ > +) -ETOOMANYMACROS I think nearly all of these shouldn't exist. See further below. > +#define SetErrorQueueElem(elem, e_dbid, e_full_xid, e_start_urec_ptr, e_retry_at, e_occurred_at) \ > +( \ > + GetErrorQueueElem(elem).dbid = e_dbid, \ > + GetErrorQueueElem(elem).full_xid = e_full_xid, \ > + GetErrorQueueElem(elem).start_urec_ptr = e_start_urec_ptr, \ > + GetErrorQueueElem(elem).next_retry_at = e_retry_at, \ > + GetErrorQueueElem(elem).err_occurred_at = e_occurred_at \ > +) It's very very rarely a good idea to have macros that evaluate their arguments multiple times. It'll also never be a good idea to get the same element multiple times from a queue. If needed - I'm very doubtful of that, given that there's a single caller - it should be a static inline function that gets the element once, stores it in a local variable, and then updates all the fields. > +/* > + * Binary heap comparison function to compare the size of transactions. > + */ > +static int > +undo_size_comparator(Datum a, Datum b, void *arg) > +{ > + UndoSizeQueue *sizeQueueElem1 = (UndoSizeQueue *) DatumGetPointer(a); > + UndoSizeQueue *sizeQueueElem2 = (UndoSizeQueue *) DatumGetPointer(b); > It's very odd that elements are named 'Queue' rather than a queue element. > +/* > + * Binary heap comparison function to compare the time at which an error > + * occurred for transactions. > + * > + * The error queue is sorted by next_retry_at and err_occurred_at. Currently, > + * the next_retry_at has some constant delay time (see PushErrorQueueElem), so > + * it doesn't make much sense to sort by both values. However, in future, if > + * we have some different algorithm for next_retry_at, then it will work > + * seamlessly. > + */ Why is it useful to have error_occurred_at be part of the comparison at all? If we need a tiebraker, err_occurred_at isn't that (if we can get conflicts for next_retry_at, then we can also get conflicts in err_occurred_at). Seems better to use something actually guaranteed to be unique for a tiebreaker. > +int > +UndoRollbackHashTableSize() > +{ missing void, at least compared to our common style. > + /* > + * The rollback hash table is used to avoid duplicate undo requests by > + * backends and discard worker. The table must be able to accomodate all > + * active undo requests. The undo requests must appear in both xid and > + * size requests queues or neither. In same transaction, there can be two > + * requests one for logged relations and another for unlogged relations. > + * So, the rollback hash table size should be equal to two request queues, > + * an error queue (currently this is same as request queue) and max "the same"? I assume this intended to mean the same size? > + * backends. This will ensure that it won't get filled. > + */ How does this ensure anything? > +static int > +RemoveOldElemsFromXidQueue() void. > +/* > + * Traverse the queue and remove dangling entries, if any. The queue > + * entry is considered dangling if the hash table doesn't contain the > + * corresponding entry. > + */ > +static int > +RemoveOldElemsFromSizeQueue() void. We shouldn't need this in this form anymore after the rbtree conversion - but because it again highlights on of my main complaints of all this work: Don't have multiple copies of essentially equivalent non-trivial functions. Especially not in the same file. This is a near verbatim copy of RemoveOldElemsFromXidQueue. Without any explanations why it's needed. Even if you intended it only as a short-term workaround (e.g. for the queues not sharing enough of a common base-layout to be able to share one cleanup routine), at the very least you need to add a FIXME or such explaining that this needs to be fixed. > +/* > + * Traverse the queue and remove dangling entries, if any. The queue > + * entry is considered dangling if the hash table doesn't contain the > + * corresponding entry. > + */ > +static int > +RemoveOldElemsFromErrorQueue() > +{ Another copy. > +/* > + * Returns true, if there is some valid request in the given queue, false, > + * otherwise. > + * > + * It fills hkey with hash key corresponding to the nth element of the > + * specified queue. > + */ > +static bool > +GetRollbackHashKeyFromQueue(UndoWorkerQueueType cur_queue, int n, > + RollbackHashKey *hkey) > +{ > + if (cur_queue == XID_QUEUE) > + { > + UndoXidQueue *elem; > + > + /* check if there is a work in the next queue */ > + if (GetXidQueueSize() <= n) > + return false; > + > + elem = (UndoXidQueue *) GetXidQueueNthElem(n); > + hkey->full_xid = elem->full_xid; > + hkey->start_urec_ptr = elem->start_urec_ptr; > + } This is a slightly different form of copying code repeatedly. Instead of passing in the queue type, this should get a pointer to the queue passed in. Functions like Get*QueueSize(), GetErrorQueueNthElem() shouldn't exist once for each queue type, they should be agnostic as to what the queue type is, and accept a queue as the parameter. Yes, there'd still be one additional queue type specific check, for the time. But that's still a lot less copied code. I also don't think it's a good idea to use RollbackHashKey as the parameter/function name here. This function doesn't need to know that it's for a hash table lookup. > +/* > + * Fetch the end urec pointer for the transaction and the undo request size. > + * > + * end_urecptr_out - This is an INOUT parameter. If end undo pointer is > + * specified, we use the same to calculate the size. Else, we calculate > + * the end undo pointer and return the same. > + * > + * last_log_start_urec_ptr_out - This is an OUT parameter. If a transaction > + * writes undo records in multiple undo logs, this is set to the start undo > + * record pointer of this transaction in the last log. If the transaction > + * writes undo records only in single undo log, it is set to start_urec_ptr. > + * This value is used to update the rollback progress of the transaction in > + * the last log. Once, we have start location in last log, the start location > + * in all the previous logs can be computed. See execute_undo_actions for > + * more details. > + * > + * XXX: We don't calculate the exact undo size. We always skip the size of > + * the last undo record (if not already discarded) from the calculation. This > + * optimization allows us to skip fetching an undo record for the most > + * frequent cases where the end pointer and current start pointer belong to > + * the same log. A simple subtraction between them gives us the size. In > + * future this function can be modified if someone needs the exact undo size. > + * As of now, we use this function to calculate the undo size for inserting > + * in the pending undo actions in undo worker's size queue. > + */ > +uint64 > +FindUndoEndLocationAndSize(UndoRecPtr start_urecptr, > + UndoRecPtr *end_urecptr_out, > + UndoRecPtr *last_log_start_urecptr_out, > + FullTransactionId full_xid) > +{ This really can't be the right place for this function. > +/* > + * Returns true, if we can push the rollback request to undo wrokers, false, *workers Also, it's not really queued to workers. Something like "can queue the rollback request to be executed in the background" would be more accurate afaict. > + * otherwise. > + */ > +static bool > +CanPushReqToUndoWorker(UndoRecPtr start_urec_ptr, UndoRecPtr end_urec_ptr, > + uint64 req_size) > +{ > + /* > + * This must be called after acquring RollbackRequestLock as we will check *acquiring > + * the binary heaps which can change. > + */ > + Assert(LWLockHeldByMeInMode(RollbackRequestLock, LW_EXCLUSIVE)); > + > + /* > + * We normally push the rollback request to undo workers if the size of > + * same is above a certain threshold. > + */ > + if (req_size >= rollback_overflow_size * 1024 * 1024) > + { Why is this being checked with the lock held? Seems like this should be handled in a pre-check? > +/* > + * Initialize the hash-table and priority heap based queues for rollback > + * requests in shared memory. > + */ > +void > +PendingUndoShmemInit(void) > +{ > + HASHCTL info; > + bool foundXidQueue = false; > + bool foundSizeQueue = false; > + bool foundErrorQueue = false; > + binaryheap *bh; > + UndoXidQueue *xid_elems; > + UndoSizeQueue *size_elems; > + UndoErrorQueue *error_elems; > + > + MemSet(&info, 0, sizeof(info)); > + > + info.keysize = sizeof(TransactionId) + sizeof(UndoRecPtr); > + info.entrysize = sizeof(RollbackHashEntry); > + info.hash = tag_hash; > + > + RollbackHT = ShmemInitHash("Undo Actions Lookup Table", > + UndoRollbackHashTableSize(), > + UndoRollbackHashTableSize(), &info, > + HASH_ELEM | HASH_FUNCTION | HASH_FIXED_SIZE); > + > + bh = binaryheap_allocate_shm("Undo Xid Binary Heap", > + pending_undo_queue_size, > + undo_age_comparator, > + NULL); > + > + xid_elems = (UndoXidQueue *) ShmemInitStruct("Undo Xid Queue Elements", > + UndoXidQueueElemsShmSize(), > + &foundXidQueue); > + > + Assert(foundXidQueue || !IsUnderPostmaster); > + > + if (!IsUnderPostmaster) > + memset(xid_elems, 0, sizeof(UndoXidQueue)); > + > + InitXidQueue(bh, xid_elems); > + > + bh = binaryheap_allocate_shm("Undo Size Binary Heap", > + pending_undo_queue_size, > + undo_size_comparator, > + NULL); > + size_elems = (UndoSizeQueue *) ShmemInitStruct("Undo Size Queue Elements", > + UndoSizeQueueElemsShmSize(), > + &foundSizeQueue); > + Assert(foundSizeQueue || !IsUnderPostmaster); > + > + if (!IsUnderPostmaster) > + memset(size_elems, 0, sizeof(UndoSizeQueue)); > + > + InitSizeQueue(bh, size_elems); > + > + bh = binaryheap_allocate_shm("Undo Error Binary Heap", > + pending_undo_queue_size, > + undo_err_time_comparator, > + NULL); > + > + error_elems = (UndoErrorQueue *) ShmemInitStruct("Undo Error Queue Elements", > + UndoErrorQueueElemsShmSize(), > + &foundErrorQueue); > + Assert(foundErrorQueue || !IsUnderPostmaster); > + > + if (!IsUnderPostmaster) > + memset(error_elems, 0, sizeof(UndoSizeQueue)); > + > + InitErrorQueue(bh, error_elems); Hm. Aren't you overwriting previously initialized data here with memset and Init*Queue, when using an EXEC_BACKEND build (e.g windows)? I think all the initialization should only be done once, e.g. if ShmemInitStruct() sets the *found to true. And then the other elements should be asserted to also exist/not exist. Also, what is the memset() here supposed to be doing? Aren't you just memsetting() the first element in the queue? Since the queue is dynamically sized, a static length (sizeof(UndoSizeQueue)) memset() obviously cannot cannot initialize the members. Also, this again is repeating code unnecessarily. > +/* Insert the request into an error queue. */ > +bool > +InsertRequestIntoErrorUndoQueue(volatile UndoRequestInfo * urinfo) > +{ > + RollbackHashEntry *rh; > + > + LWLockAcquire(RollbackRequestLock, LW_EXCLUSIVE); > + > + /* We can't insert into an error queue if it is already full. */ > + if (GetErrorQueueSize() >= pending_undo_queue_size) > + { > + int num_removed = 0; > + > + /* Try to remove few elements */ > + num_removed = RemoveOldElemsFromErrorQueue(); If we kept this, I'd rename these as Prune* and reword the comments to match. This makes the code look like we're actually removing valid entries. > +/* > + * Get the next set of pending rollback request for undo worker. "set"? We only remove one, no? > + * allow_peek - if true, peeks a few element from each queue to check whether > + * any request matches current dbid. > + * remove_from_queue - if true, picks an element from the queue whose dbid > + * matches current dbid and remove it from the queue before returning the same > + * to caller. > + * urinfo - this is an OUT parameter that returns the details of undo request > + * whose undo action is still pending. > + * in_other_db_out - this is an OUT parameter. If we've not found any work > + * for current database, but there is work for some other database, we set > + * this parameter as true. > + */ > +bool > +UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo *urinfo, > + bool *in_other_db_out) > +{ > + /* > + * If some undo worker is already processing the rollback request or > + * it is already processed, then we drop that request from the queue > + * and fetch the next entry from the queue. > + */ > + if (!rh || UndoRequestIsInProgress(rh)) > + { > + RemoveRequestFromQueue(cur_queue, 0); > + cur_undo_queue++; > + continue; > + } When is it possible to hit the in-progress case? > + /* > + * We've found a work for some database. If we don't want to remove > + * the request, we return from here and spawn a worker process to > + * apply the same. > + */ > + if (!remove_from_queue) > + { > + bool exists; > + > + StartTransactionCommand(); > + exists = dbid_exists(rh->dbid); > + CommitTransactionCommand(); > + > + /* > + * If the database doesn't exist, just remove the request since we > + * no longer need to apply the undo actions. > + */ > + if (!exists) > + { > + RemoveRequestFromQueue(cur_queue, 0); > + RollbackHTRemoveEntry(rh->full_xid, rh->start_urec_ptr, true); > + cur_undo_queue++; > + continue; > + } I still think there never should be a case in which this is possible. Dropping a database ought to remove all the associated undo. > + /* > + * The worker can perform this request if it is either not > + * connected to any database or the request belongs to the > + * same database to which it is connected. > + */ > + if ((MyDatabaseId == InvalidOid) || > + (MyDatabaseId != InvalidOid && MyDatabaseId == rh->dbid)) > + { > + /* found a work for current database */ > + if (in_other_db_out) > + *in_other_db_out = false; > + > + /* > + * Mark the undo request in hash table as in_progress so > + * that other undo worker doesn't pick the same entry for > + * rollback. > + */ > + rh->status = UNDO_REQUEST_INPROGRESS; > + > + /* set the undo request info to process */ > + SetUndoRequestInfoFromRHEntry(urinfo, rh, cur_queue); > + > + /* > + * Remove the request from queue so that other undo worker > + * doesn't process the same entry. > + */ > + RemoveRequestFromQueue(cur_queue, depth); > + LWLockRelease(RollbackRequestLock); > + return true; Copy of code from above. > +/* > + * This function registers the rollback requests. > + * > + * Returns true, if the request is registered and will be processed by undo > + * worker at some later point of time, false, otherwise in which case caller > + * can process the undo request by itself. > + * > + * The caller may execute undo actions itself if the request is not already > + * present in rollback hash table and can't be pushed to pending undo request > + * queues. The two reasons why request can't be pushed are (a) the size of > + * request is smaller than a threshold and the request is not from discard > + * worker, (b) the undo request queues are full. > + * > + * It is not advisable to apply the undo actions of a very large transaction > + * in the foreground as that can lead to a delay in retruning the control back *returning > +/* different types of undo worker */ > +typedef enum > +{ > + XID_QUEUE = 0, > + SIZE_QUEUE = 1, > + ERROR_QUEUE > +} UndoWorkerQueueType; IMO odd to explictly number two elements of an enum, but not the third. > +/* This is an entry for undo request queue that is sorted by xid. */ > +typedef struct UndoXidQueue > +{ > + FullTransactionId full_xid; > + UndoRecPtr start_urec_ptr; > + Oid dbid; > +} UndoXidQueue; As I said before, this isn't a queue, it's a queue entry. > +/* Reset the undo request info */ > +#define ResetUndoRequestInfo(urinfo) \ > +( \ > + (urinfo)->full_xid = InvalidFullTransactionId, \ > + (urinfo)->start_urec_ptr = InvalidUndoRecPtr, \ > + (urinfo)->end_urec_ptr = InvalidUndoRecPtr, \ > + (urinfo)->last_log_start_urec_ptr = InvalidUndoRecPtr, \ > + (urinfo)->dbid = InvalidOid, \ > + (urinfo)->request_size = 0, \ > + (urinfo)->undo_worker_queue = InvalidUndoWorkerQueue \ > +) > + > +/* set the undo request info from the rollback request */ > +#define SetUndoRequestInfoFromRHEntry(urinfo, rh, cur_queue) \ > +( \ > + urinfo->full_xid = rh->full_xid, \ > + urinfo->start_urec_ptr = rh->start_urec_ptr, \ > + urinfo->end_urec_ptr = rh->end_urec_ptr, \ > + urinfo->last_log_start_urec_ptr = rh->last_log_start_urec_ptr, \ > + urinfo->dbid = rh->dbid, \ > + urinfo->undo_worker_queue = cur_queue \ > +) See my other complaint about such macros. Multiple evaluation hazard etc. Also, the different formatting in two consecutively defined macros is odd. > +/*------------------------------------------------------------------------- > + * > + * undoaction.c > + * execute undo actions > + * > + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group > + * Portions Copyright (c) 1994, Regents of the University of California > + * > + * src/backend/access/undo/undoaction.c > + * > + * To apply the undo actions, we collect the undo records in bulk and try to s/the//g > + * process them together. We ensure to update the transaction's progress at > + * regular intervals so that after a crash we can skip already applied undo. > + * The undo apply progress is updated in terms of the number of blocks > + * processed. Undo apply progress value XACT_APPLY_PROGRESS_COMPLETED > + * indicates that all the undo is applied, XACT_APPLY_PROGRESS_NOT_STARTED > + * indicates that no undo action has been applied yet and any other value > + * indicates that we have applied undo partially and after crash recovery, we > + * need to start processing the undo from the same location. > + *------------------------------------------------------------------------- > +/* > + * UpdateUndoApplyProgress - Updates how far undo actions from a particular > + * log have been applied while rolling back a transaction. This progress is > + * measured in terms of undo block number of the undo log till which the > + * undo actions have been applied. > + */ > +static void > +UpdateUndoApplyProgress(UndoRecPtr progress_urec_ptr, > + BlockNumber block_num) > +{ > + UndoLogCategory category; > + UndoRecordInsertContext context = {{0}}; > + > + category = > + UndoLogNumberGetCategory(UndoRecPtrGetLogNo(progress_urec_ptr)); > + > + /* > + * We don't need to update the progress for temp tables as they get > + * discraded after startup. > + */ > + if (category == UNDO_TEMP) > + return; > + > + BeginUndoRecordInsert(&context, category, 1, NULL); > + > + /* > + * Prepare and update the undo apply progress in the transaction header. > + */ > + UndoRecordPrepareApplyProgress(&context, progress_urec_ptr, block_num); > + > + START_CRIT_SECTION(); > + > + /* Update the progress in the transaction header. */ > + UndoRecordUpdateTransInfo(&context, 0); > + > + /* WAL log the undo apply progress. */ > + { > + XLogRecPtr lsn; > + xl_undoapply_progress xlrec; > + > + xlrec.urec_ptr = progress_urec_ptr; > + xlrec.progress = block_num; > + > + XLogBeginInsert(); > + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); > + > + RegisterUndoLogBuffers(&context, 1); > + lsn = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS); > + UndoLogBuffersSetLSN(&context, lsn); > + } > + > + END_CRIT_SECTION(); > + > + /* Release undo buffers. */ > + FinishUndoRecordInsert(&context); > +} This whole prepare/execute split for updating apply pregress, and next undo pointers makes no sense to me. > +/* > + * UndoAlreadyApplied - Retruns true, if the actions are already applied, *returns > + * false, otherwise. > + */ > +static bool > +UndoAlreadyApplied(FullTransactionId full_xid, UndoRecPtr to_urecptr) > +{ > + UnpackedUndoRecord *uur = NULL; > + UndoRecordFetchContext context; > + > + /* Fetch the undo record. */ > + BeginUndoFetch(&context); > + uur = UndoFetchRecord(&context, to_urecptr); > + FinishUndoFetch(&context); Literally all the places that fetch a record, fetch them with exactly this combination of calls. If that's the pattern, what do we gain by this split? Note that UndoBulkFetchRecord does *NOT* use an UndoRecordFetchContext, for reasons that are beyond me. > +static void > +ProcessAndApplyUndo(FullTransactionId full_xid, UndoRecPtr from_urecptr, > + UndoRecPtr to_urecptr, UndoRecPtr last_log_start_urec_ptr, > + bool complete_xact) > +{ > + UndoRecInfo *urecinfo; > + UndoRecPtr urec_ptr = from_urecptr; > + int undo_apply_size; > + > + /* > + * We choose maintenance_work_mem to collect the undo records for > + * rollbacks as most of the large rollback requests are done by > + * background worker which can be considered as maintainence operation. > + * However, we can introduce a new guc for this as well. > + */ > + undo_apply_size = maintenance_work_mem * 1024L; > + > + /* > + * Fetch the multiple undo records that can fit into undo_apply_size; sort > + * them and then rmgr specific callback to process them. Repeat this > + * until we process all the records for the transaction being rolled back. > + */ > + do > + { use for(;;) or while (true). > + BlockNumber progress_block_num = InvalidBlockNumber; > + int i; > + int nrecords; > + bool log_switched = false; > + bool rollback_completed = false; > + bool update_progress = false; > + UndoRecPtr progress_urec_ptr = InvalidUndoRecPtr; > + UndoRecInfo *first_urecinfo; > + UndoRecInfo *last_urecinfo; > + > + CHECK_FOR_INTERRUPTS(); > + > + /* > + * Fetch multiple undo records at once. > + * > + * At a time, we only fetch the undo records from a single undo log. > + * Once, we process all the undo records from one undo log, we update s/Once, we process/Once we have processed/ > + * the last_log_start_urec_ptr and proceed to the previous undo log. > + */ > + urecinfo = UndoBulkFetchRecord(&urec_ptr, last_log_start_urec_ptr, > + undo_apply_size, &nrecords, false); > + > + /* > + * Since the rollback of this transaction is in-progress, there will be > + * at least one undo record which is not yet discarded. > + */ > + Assert(nrecords > 0); > + > + /* > + * Get the required information from first and last undo record before > + * we sort all the records. > + */ > + first_urecinfo = &urecinfo[0]; > + last_urecinfo = &urecinfo[nrecords - 1]; > + if (last_urecinfo->uur->uur_info & UREC_INFO_LOGSWITCH) > + { > + UndoRecordLogSwitch *logswitch = last_urecinfo->uur->uur_logswitch; > + > + /* > + * We have crossed the log boundary. The rest of the undo for > + * this transaction is in some other log, the location of which > + * can be found from this record. See commets atop undoaccess.c. *comments > + /* > + * We need to save the undo record pointer of the last record from > + * previous undo log. We will use the same as from location in > + * next iteration of bulk fetch. > + */ > + Assert(UndoRecPtrIsValid(logswitch->urec_prevurp)); > + urec_ptr = logswitch->urec_prevurp; > + > + /* > + * The last fetched undo record corresponds to the first undo > + * record of the current log. Once, the undo actions are performed > + * from this log, we've to mark the progress as completed. > + */ > + progress_urec_ptr = last_urecinfo->urp; > + > + /* > + * We also need to save the start location of this transaction in > + * previous log. This will be used in the next iteration of bulk > + * fetch and updating progress location. > + */ > + if (complete_xact) > + { > + Assert(UndoRecPtrIsValid(logswitch->urec_prevlogstart)); > + last_log_start_urec_ptr = logswitch->urec_prevlogstart; > + } > + > + /* We've to update the progress for the current log as completed. */ > + update_progress = true; > + } > + else if (complete_xact) > + { > + if (UndoRecPtrIsValid(urec_ptr)) > + { > + /* > + * There are still some undo actions pending in this log. So, > + * just update the progress block number. > + */ > + progress_block_num = UndoRecPtrGetBlockNum(last_urecinfo->urp); > + > + /* > + * If we've not fetched undo records for more than one undo > + * block, we can't update the progress block number. Because, > + * there can still be undo records in this block that needs to > + * be applied for rolling back this transaction. > + */ > + if (UndoRecPtrGetBlockNum(first_urecinfo->urp) > progress_block_num) > + { > + update_progress = true; > + progress_urec_ptr = last_log_start_urec_ptr; > + } > + } > + else > + { > + /* > + * Invalid urec_ptr indicates that we have executed all the undo > + * actions for this transaction. So, mark current log header > + * as complete. > + */ > + Assert(last_log_start_urec_ptr == to_urecptr); > + rollback_completed = true; > + update_progress = true; > + progress_urec_ptr = last_log_start_urec_ptr; > + } > + } This should be in a separate function. > + /* Free all undo records. */ > + for (i = 0; i < nrecords; i++) > + UndoRecordRelease(urecinfo[i].uur); > + > + /* Free urp array for the current batch of undo records. */ > + pfree(urecinfo); As noted elsewhere, I think that's the wrong memory management strategy. We should be using a memory context for undo processing, and then just reset it as a whole. For one, freeing granularly is inefficient. But more than that, it also means there's nothing to prevent memory leaks here. > +/* > + * execute_undo_actions - Execute the undo actions That's juts a restatement of the function name. > + * full_xid - Transaction id that is getting rolled back. > + * from_urecptr - undo record pointer from where to start applying undo > + * actions. > + * to_urecptr - undo record pointer up to which the undo actions need to be > + * applied. > + * complete_xact - true if rollback is for complete transaction. > + */ > +void > +execute_undo_actions(FullTransactionId full_xid, UndoRecPtr from_urecptr, > + UndoRecPtr to_urecptr, bool complete_xact) > +{ Why is this lower case, but ApplyUndo() camel case? How is a reader supposed to know which one uses for what? > typedef struct TwoPhaseFileHeader > { > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader > uint16 gidlen; /* length of the GID - GID follows the header */ > XLogRecPtr origin_lsn; /* lsn of this record at origin node */ > TimestampTz origin_timestamp; /* time of prepare at origin node */ > + > + /* > + * We need the locations of the start and end undo record pointers when > + * rollbacks are to be performed for prepared transactions using undo-based > + * relations. We need to store this information in the file as the user > + * might rollback the prepared transaction after recovery and for that we > + * need its start and end undo locations. > + */ > + UndoRecPtr start_urec_ptr[UndoLogCategories]; > + UndoRecPtr end_urec_ptr[UndoLogCategories]; > } TwoPhaseFileHeader; Why do we not need that knowledge for undo processing of a non-prepared transaction? > @@ -191,6 +195,16 @@ typedef struct TransactionStateData > bool didLogXid; /* has xid been included in WAL record? */ > int parallelModeLevel; /* Enter/ExitParallelMode counter */ > bool chain; /* start a new block after this one */ > + > + /* start and end undo record location for each log category */ > + UndoRecPtr startUrecPtr[UndoLogCategories]; /* this is 'to' location */ > + UndoRecPtr latestUrecPtr[UndoLogCategories]; /* this is 'from' > + * location */ > + /* > + * whether the undo request is registered to be processed by worker later? > + */ > + bool undoRequestResgistered[UndoLogCategories]; > + s/Resgistered/Registered/ > @@ -2906,9 +2942,18 @@ CommitTransactionCommand(void) > * StartTransactionCommand didn't set the STARTED state > * appropriately, while TBLOCK_PARALLEL_INPROGRESS should be ended > * by EndParallelWorkerTransaction(), not this function. > + * > + * TBLOCK_(SUB)UNDO means the error has occurred while applying > + * undo for a (sub)transaction. We can't reach here as while s/We can't reach here as while/This can't be reached while/ > + * applying undo via top-level transaction, if we get an error, > + * then it is handled by ReleaseResourcesAndProcessUndo Where and how does it handle that? Maybe I misunderstand what you mean? > + case TBLOCK_UNDO: > + /* > + * We reach here when we got error while applying undo > + * actions, so we don't want to again start applying it. Undo > + * workers can take care of it. > + * > + * AbortTransaction is already done, still need to release > + * locks and perform cleanup. > + */ > + ResetUndoActionsInfo(); > + ResourceOwnerRelease(s->curTransactionOwner, > + RESOURCE_RELEASE_LOCKS, > + false, > + true); > + s->state = TRANS_ABORT; > CleanupTransaction(); Hm. Why is it ok that we only perform that cleanup action? Either the rest of potentially held resources will get cleaned up somehow as well, in which case this ResourceOwnerRelease() ought to be redundant, or we're potentially leaking important resources like buffer pins, relcache references and whatnot here? > +/* > + * CheckAndRegisterUndoRequest - Register the request for applying undo > + * actions. > + * > + * It sets the transaction state to indicate whether the request is pushed to > + * the background worker which is used later to decide whether to apply the > + * actions. > + * > + * It is important to do this before marking the transaction as aborted in > + * clog otherwise, it is quite possible that discard worker miss this rollback > + * request from the computation of oldestXidHavingUnappliedUndo. This is > + * because it might do that computation before backend can register it in the > + * rollback hash table. So, neither oldestXmin computation will consider it > + * nor the hash table pass would have that value. > + */ > +static void > +CheckAndRegisterUndoRequest() (void) > +{ > + TransactionState s = CurrentTransactionState; > + bool result; > + int i; > + > + /* > + * We don't want to apply the undo actions when we are already cleaning up > + * for FATAL error. See ReleaseResourcesAndProcessUndo. > + */ > + if (SemiCritSectionCount > 0) > + { > + ResetUndoActionsInfo(); > + return; > + } Wait what? Semi critical sections? > + for (i = 0; i < UndoLogCategories; i++) > + { > + /* > + * We can't push the undo actions for temp table to background > + * workers as the the temp tables are only accessible in the > + * backend that has created them. > + */ > + if (i != UNDO_TEMP && UndoRecPtrIsValid(s->latestUrecPtr[i])) > + { > + result = RegisterUndoRequest(s->latestUrecPtr[i], > + s->startUrecPtr[i], > + MyDatabaseId, > + GetTopFullTransactionId()); > + s->undoRequestResgistered[i] = result; > + } > + } Give code like this I have a hard time seing what the point of having separate queue entries for the different persistency levels is. > +void > +ReleaseResourcesAndProcessUndo(void) > +{ > + TransactionState s = CurrentTransactionState; > + > + /* > + * We don't want to apply the undo actions when we are already cleaning up > + * for FATAL error. One of the main reasons is that we might be already > + * processing undo actions for a (sub)transaction when we reach here > + * (for ex. error happens while processing undo actions for a > + * subtransaction). > + */ > + if (SemiCritSectionCount > 0) > + { > + ResetUndoActionsInfo(); > + return; > + } > + > + if (!NeedToPerformUndoActions()) > + return; > + > + /* > + * State should still be TRANS_ABORT from AbortTransaction(). > + */ > + if (s->state != TRANS_ABORT) > + elog(FATAL, "ReleaseResourcesAndProcessUndo: unexpected state %s", > + TransStateAsString(s->state)); > + > + /* > + * Do abort cleanup processing before applying the undo actions. We must > + * do this before applying the undo actions to remove the effects of > + * failed transaction. > + */ > + if (IsSubTransaction()) > + { > + AtSubCleanup_Portals(s->subTransactionId); > + s->blockState = TBLOCK_SUBUNDO; > + } > + else > + { > + AtCleanup_Portals(); /* now safe to release portal memory */ > + AtEOXact_Snapshot(false, true); /* and release the transaction's > + * snapshots */ Why do precisely these actions need to be performed here? > + s->fullTransactionId = InvalidFullTransactionId; > + s->subTransactionId = TopSubTransactionId; > + s->blockState = TBLOCK_UNDO; > + } > + > + s->state = TRANS_UNDO; This seems guaranteed to constantly be out of date with other modifications of the commit/abort sequence. > +bool > +ProcessUndoRequestForEachLogCat(FullTransactionId fxid, Oid dbid, > + UndoRecPtr *end_urec_ptr, UndoRecPtr *start_urec_ptr, > + bool *undoRequestResgistered, bool isSubTrans) > +{ > + UndoRequestInfo urinfo; > + int i; > + uint32 save_holdoff; > + bool success = true; > + > + for (i = 0; i < UndoLogCategories; i++) > + { > + if (end_urec_ptr[i] && !undoRequestResgistered[i]) > + { > + save_holdoff = InterruptHoldoffCount; > + > + PG_TRY(); > + { > + /* for subtransactions, we do partial rollback. */ > + execute_undo_actions(fxid, > + end_urec_ptr[i], > + start_urec_ptr[i], > + !isSubTrans); > + } > + PG_CATCH(); > + { > + /* > + * Add the request into an error queue so that it can be > + * processed in a timely fashion. > + * > + * If we fail to add the request in an error queue, then mark > + * the entry status as invalid and continue to process the > + * remaining undo requests if any. This request will be later > + * added back to the queue by discard worker. > + */ > + ResetUndoRequestInfo(&urinfo); > + urinfo.dbid = dbid; > + urinfo.full_xid = fxid; > + urinfo.start_urec_ptr = start_urec_ptr[i]; > + if (!InsertRequestIntoErrorUndoQueue(&urinfo)) > + RollbackHTMarkEntryInvalid(urinfo.full_xid, > + urinfo.start_urec_ptr); > + /* > + * Errors can reset holdoff count, so restore back. This is > + * required because this function can be called after holding > + * interrupts. > + */ > + InterruptHoldoffCount = save_holdoff; > + > + /* Send the error only to server log. */ > + err_out_to_client(false); > + EmitErrorReport(); > + > + success = false; > + > + /* We should never reach here when we are in a semi-critical-section. */ > + Assert(SemiCritSectionCount == 0); This seems entirely and completely broken. You can't just catch an exception and continue. What if somebody held an lwlock when the error was thrown? A buffer pin? As far as I can tell the semi crit section stuff doesn't protect you against anything here, because it's not used exclusively. > +to complete the requests by themselves. There is an exception to it where when > +error queue becomes full, we just mark the request as 'invalid' and continue to > +process other requests if any. The discard worker will find this errored > +transaction at later point of time and again add it to the request queues. You say it's an exception, but you do not explain why that exception is there. Nor why that's not a problem for: > +We have the hard limit (proportional to the size of the rollback hash table) > +for the number of transactions that can have pending undo. This can help us > +in computing the value of oldestXidHavingUnappliedUndo and allowing us not to > +accumulate pending undo for a long time which will eventually block the > +discard of undo. > + * The main responsibility of the discard worker is to discard the undo log > + * of transactions that are committed and all-visible or are rolledback. It *rolled back > + * also registers the request for aborted transactions in the work queues. > + * To know more about work queues, see undorequest.c. It iterates through all > + * the active logs one-by-one and try to discard the transactions that are old > + * enough to matter. > + * > + * For tranasctions that spans across multiple logs, the log for committed and *transactions > + * all-visible transactions are discarded seprately for each log. This is *separately > + * possible as the transactions that span across logs have separate transaction > + * header for each log. For aborted transactions, we try to process the actions *tranaction headers > + * of entire transaction at one-shot as we need to perform the actions starting *an entire transaction in one shot > + * from end location to start location. However, it is possbile that the later *possible > + * portion of transaction that is overflowed into a separate log can be processed *a transaction > + * separately if we encounter the corresponding log first. If we want we can > + * combine the log for processing in that case as well, but there is no clear > + * advantage of the same. *of doing so > +void > +DiscardWorkerRegister(void) > +{ > + BackgroundWorker bgw; > + > + memset(&bgw, 0, sizeof(bgw)); > + bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | > + BGWORKER_BACKEND_DATABASE_CONNECTION; Why is a database needed? > + /* > + * Scan all the undo logs and intialize the rollback hash table with all > + * the pending rollback requests. This need to be done as a first step > + * because only after this the transactions will be allowed to write new > + * undo. See comments atop UndoLogProcess. > + */ > + UndoLogProcess(); Too generic name. > @@ -668,6 +676,50 @@ PrepareUndoInsert(UndoRecordInsertContext *context, > UndoCompressionInfo *compression_info = > &context->undo_compression_info[context->alloc_context.category]; > > + if (!InRecovery && IsUnderPostmaster) > + { > + int try_count = 0; > + > + /* > + * If we are not in a recovery and not in a single-user-mode, then undo s/in a single-user-mode/in single-user-mode/ (although I'd also remove the dashes) > + * generation should not be allowed until we have scanned all the undo > + * logs and initialized the hash table with all the aborted > + * transaction entries. See detailed comments in UndoLogProcess. > + */ > + while (!ProcGlobal->rollbackHTInitialized) > + { > + /* Error out after trying for one minute. */ > + if (try_count > ROLLBACK_HT_INIT_WAIT_TRY) > + ereport(ERROR, > + (errcode(ERRCODE_E_R_E_MODIFYING_SQL_DATA_NOT_PERMITTED), > + errmsg("rollback hash table is not yet initialized, wait for sometime and try again"))); > + > + /* > + * Rollback hash table is not yet intialized, sleep for 1 second > + * and try again. > + */ > + pg_usleep(1000000L); > + try_count++; > + } > + } I think it's wrong to do this here. We shouldn't open the database for writes before having performed sufficient initialization. If done like that, we shouldn't ever get here. Without such sequencing it's actually not possible to bring up a standby and allow writes in a normal way - the first few transactions will just fail. That's not ok. Nor are new retry loops with sleeps ok IMO. > + /* > + * If the rollback hash table is already full (excluding one additional > + * space for each backend) then don't allow to generate any new undo until > + * we apply some of the pending requests and create some space in the hash > + * table to accept new rollback requests. Leave the enough slots in the > + * hash table so that there is space for all the backends to register at > + * least one request. This is to protect the situation where one backend > + * keep consuming slots reserve for the other backends and suddenly there > + * is concurrent undo request from all the backends. So we always keep > + * the space reserve for MaxBackends. > + */ > + if (ProcGlobal->xactsHavingPendingUndo > > + (UndoRollbackHashTableSize() - MaxBackends)) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), > + errmsg("max limit for pending rollback request has reached, wait for sometime and try again"))); > + Why do we need to this work every time we're inserting undo? shouldn't that just happen once, when first accessing an undo log in a transaction? > + /* There might not be any undo log and hibernation might be needed. */ > + *hibernate = true; > + > + StartTransactionCommand(); Why do we need this? I assume it's so we can have a resource owner? Out of energy. Greetings, Andres Freund
Hi, On 2019-08-06 00:56:26 -0700, Andres Freund wrote: > Out of energy. Here's the last section of my low-leve review. Plan to write a higher level summary afterwards, now that I have a better picture of the code. > +static void > +UndoDiscardOneLog(UndoLogSlot *slot, TransactionId xmin, bool *hibernate) I think the naming here is pretty confusing. We have UndoDiscard(), UndoDiscardOneLog(), UndoLogDiscard(). I don't think anybody really can be expected to understand what is supposed to be what from these names. > + /* Loop until we run out of discardable transactions in the given log. */ > + do > + { for(;;) or while (true) > + TransactionId wait_xid = InvalidTransactionId; > + bool pending_abort = false; > + bool request_rollback = false; > + UndoStatus status; > + UndoRecordFetchContext context; > + > + next_insert = UndoLogGetNextInsertPtr(logno); > + > + /* There must be some undo data for a transaction. */ > + Assert(next_insert != undo_recptr); > + > + /* Fetch the undo record for the given undo_recptr. */ > + BeginUndoFetch(&context); > + uur = UndoFetchRecord(&context, undo_recptr); > + FinishUndoFetch(&context); > + > + if (uur != NULL) > + { > + if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED) FWIW, this is precisely my problem with exposing such small informational functions, which actually have to perform some work. As is there's several places looking up the underlying undo slot, within just these lines of code. We do it once in UndoLogGetNextInsertPtr(). Then again in UndoFetchRecord(). And then again in UndoRecPtrGetCategory(). And then later again multiple times when actually discarding. That perhaps doesn't matter from a performance POV, but for me that indicates that the APIs aren't quite right. > + { > + /* > + * For the "shared" category, we only discard when the > + * rm_undo_status callback tells us we can. > + */ Is there a description as to what the rm_status callback is intended to do? It currently is mandatory, is that intended? Why does this only apply to shared records? And why just for SHARED, not for any of the others? > + else > + { > + TransactionId xid = XidFromFullTransactionId(uur->uur_fxid); > + > + /* > + * Otherwise we use the CLOG and xmin to decide whether to > + * wait, discard or roll back. > + * > + * XXX: We've added the transaction-in-progress check to > + * avoid xids of in-progress autovacuum as those are not > + * computed for oldestxmin calculation. Hm. xids of autovacuum? The concern here is the xid that autovacuum might acquire when locking a relation for truncating a table at the end, with wal_level=replica? Because otherwise it shouldn't have any xids? > See > + * DiscardWorkerMain. Hm. This actually reminds me of a complaint I have about this. ISTM that the logic for discarding itself should be separate from the discard worker. I'd just add that, and a UDF to invoke it, in a separate commit. > + /* > + * Add the aborted transaction to the rollback request queues. > + * > + * We can ignore the abort for transactions whose corresponding > + * database doesn't exist. > + */ > + if (request_rollback && dbid_exists(uur->uur_txn->urec_dbid)) > + { > + (void) RegisterUndoRequest(InvalidUndoRecPtr, > + undo_recptr, > + uur->uur_txn->urec_dbid, > + uur->uur_fxid); > + > + pending_abort = true; > + } As I, I think, said before: This imo should not be necessary. > + > + /* > + * We can discard upto this point when one of following conditions is *up to > + * met: (a) we need to wait for a transaction first. (b) there is no > + * more log to process. (c) the transaction undo in current log is > + * finished. (d) there is a pending abort. > + */ This comment is hard to understand. Perhaps you're missing some words? Because it's e.g. not clear what it means that "we can discard up to this point", when we "need to wait for a transaction firts". Those seem strictly contradictory. I assume what this is trying to say is that we now have reached the end of the range of undo that can be discarded, so we should do so now? But it's really quite muddled, because we don't actually necessarily discard here, because we might have a wait_xid, for example? > + if (TransactionIdIsValid(wait_xid) || > + next_urecptr == InvalidUndoRecPtr || > + UndoRecPtrGetLogNo(next_urecptr) != logno || > + pending_abort) Hm. Is it guaranteed that wait_xid isn't actually old enough that we could discard further? I haven't figured out what precisely the purpose of rm_undo_status is, so I'm not sure. But the alternative seems to be that the callback would need to perform its own GetOldestXmin() computations etc, which seems to make no sense? It seems to me that the whole DidCommit/!InProgress/ block should not be part of the if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED) else if block, but follow it? I.e. the only thing inside the else should be XidFromFullTransactionId(uur->uur_fxid), and then we check afterwards whether it, or rm_undo_status()'s return value requires waiting? > + { > + /* Hey, I got some undo log to discard, can not hibernate now. */ > + *hibernate = false; I don't understand why this block sets *hibernate to false. I mean need_discard is not guranteed to be true at this point, no? > + /* > + * If we don't need to wait for this transaction and this is not > + * an aborted transaction, then we can discard it as well. > + */ > + if (!TransactionIdIsValid(wait_xid) && !pending_abort) > + { > + /* > + * It is safe to use next_insert as the location till which we > + * want to discard in this case. If something new has been > + * added after we have fetched this transaction's record, it > + * won't be considered in this pass of discard. > + */ > + undo_recptr = next_insert; > + latest_discardxid = XidFromFullTransactionId(undofxid); > + need_discard = true; > + > + /* We don't have anything more to discard. */ > + undofxid = InvalidFullTransactionId; > + } > + /* Update the shared memory state. */ > + LWLockAcquire(&slot->discard_lock, LW_EXCLUSIVE); > + > + /* > + * If the slot has been recycling while we were thinking about it, *recycled > + * we have to abandon the operation. > + */ > + if (slot->logno != logno) > + { > + LWLockRelease(&slot->discard_lock); > + break; > + } > + > + /* Update the slot information for the next pass of discard. */ > + slot->wait_fxmin = undofxid; > + slot->oldest_data = undo_recptr; Perhaps 'next pass of UndoDiscard()' instead? I found it confusing that UndoDiscardLog() is a loop, meaning that the "next pass" could perhaps reference the next pass through UndoDiscardLog()'s loop. But it's for UndoDiscard(). > + LWLockRelease(&slot->discard_lock); > + > + if (need_discard) > + { > + LWLockAcquire(&slot->discard_update_lock, LW_EXCLUSIVE); > + UndoLogDiscard(undo_recptr, latest_discardxid); > + LWLockRelease(&slot->discard_update_lock); > + } > + > + break; > + } It seems to me that the entire block above just shouldn't be inside the loop. As far as I can tell the point of the loop is to figure out up to where we can discard. Putting the actually discarding inside that loop is just confusing (and requires deeper indentation than necessary). > +/* > + * Scan all the undo logs and register the aborted transactions. This is > + * called as a first function from the discard worker and only after this pass "a first function"? There can only be one first function, no? Also, what does "first function" really mean? As I write earlier, I think this function name is too generic, it doesn't explain anything. And I think it's not OK for it to be called (the bgworker is started with BgWorkerStart_RecoveryFinished) after the system is supposed to be ready (i.e. StartupXlog() has finished, we signal that we're up to pg_ctl etc, and allow writing transaction), but necessary for allowing writes to be allowed (we throw errors in PrepareUndoInsert()). > + * over undo logs is complete, new undo can is allowed to be written in the "undo can"? > + * system. This is required because after crash recovery we don't know the > + * exact number of aborted transactions whose rollback request is pending and > + * we can not allow new undo request if we already have the request equal to > + * hash table size. So before start allowing any new transaction to write the > + * undo we need to make sure that we know exact number of pending requests. > + */ > +void > +UndoLogProcess() (void) > +{ > + UndoLogSlot *slot = NULL; > + > + /* > + * We need to perform this in a transaction because (a) we need resource > + * owner to scan the logs and (b) TransactionIdIsInProgress requires us to > + * be in transaction. > + */ > + StartTransactionCommand(); The need for resowners does not imply needing transactions. I think nearly all aux processes, for example, don't use transactions, but do have a resowner. > + /* > + * Loop through all the valid undo logs and scan them transaction by > + * transaction to find non-commited transactions if any and register them > + * in the rollback hash table. > + */ > + while ((slot = UndoLogNextSlot(slot))) > + { > + UndoRecPtr undo_recptr; > + UnpackedUndoRecord *uur = NULL; > + > + /* We do not execute shared (non-transactional) undo records. */ > + if (slot->meta.category == UNDO_SHARED) > + continue; > + > + /* Start scanning the log from the last discard point. */ > + undo_recptr = UndoLogGetOldestRecord(slot->logno, NULL); > + > + /* Loop until we scan complete log. */ > + while (1) > + { > + TransactionId xid; > + UndoRecordFetchContext context; > + > + /* Done with this log. */ > + if (!UndoRecPtrIsValid(undo_recptr)) > + break; Why isn't this loop while(UndoRecPtrIsValid(undo_recptr))? > + /* > + * Register the rollback request for all uncommitted and not in > + * progress transactions whose undo apply progress is still not > + * completed. Even though we don't allow any new transactions to > + * write undo until this first pass is completed, there might be > + * some prepared transactions which are still in progress, so we > + * don't include such transactions. > + */ > + if (!TransactionIdDidCommit(xid) && > + !TransactionIdIsInProgress(xid) && > + !IsXactApplyProgressCompleted(uur->uur_txn->urec_progress)) > + { > + (void) RegisterUndoRequest(InvalidUndoRecPtr, undo_recptr, > + uur->uur_txn->urec_dbid, > + uur->uur_fxid); > + } > + > + /* > + * Go to the next transaction in the same log. If uur_next is > + * point to the undo record pointer in the different log then we are "is point" > + * done with this log so just set undo_recptr to InvalidUndoRecPtr. > + */ > + if (UndoRecPtrGetLogNo(undo_recptr) == > + UndoRecPtrGetLogNo(uur->uur_txn->urec_next)) > + undo_recptr = uur->uur_txn->urec_next; > + else > + undo_recptr = InvalidUndoRecPtr; > + > + /* Release memory for the current record. */ > + UndoRecordRelease(uur); > + } > + } > + * XXX Ideally we can arrange undo logs so that we can efficiently find > + * those with oldest_xid < oldestXmin, but for now we'll just scan all of > + * them. > + */ > + while ((slot = UndoLogNextSlot(slot))) > + { > + /* > + * If the log is already discarded, then we are done. It is important > + * to first check this to ensure that tablespace containing this log > + * doesn't get dropped concurrently. > + */ > + LWLockAcquire(&slot->mutex, LW_SHARED); > + /* > + * We don't have to worry about slot recycling and check the logno > + * here, since we don't care about the identity of this slot, we're > + * visiting all of them. > + */ > + if (slot->meta.discard == slot->meta.unlogged.insert) > + { > + LWLockRelease(&slot->mutex); > + continue; > + } > + LWLockRelease(&slot->mutex); I'm fairly sure that pgindent will add some newlines here... It's a good practice to re-pgindent patches. > + /* We can't process temporary undo logs. */ > + if (slot->meta.category == UNDO_TEMP) > + continue; > + > + /* > + * If the first xid of the undo log is smaller than the xmin then try > + * to discard the undo log. > + */ > + if (!FullTransactionIdIsValid(slot->wait_fxmin) || > + FullTransactionIdPrecedes(slot->wait_fxmin, oldestXidHavingUndo)) So the comment describes something different than what's happening, while otherwise not adding much over the code... That's imo confusing. > + { > + /* Process the undo log. */ > + UndoDiscardOneLog(slot, oldestXmin, hibernate); That comment seems unhelpful. > + * XXX: In future, if multiple workers can perform discard then we may > + * need to use compare and swap for updating the shared memory value. > + */ > + if (FullTransactionIdIsValid(oldestXidHavingUndo)) > + pg_atomic_write_u64(&ProcGlobal->oldestFullXidHavingUnappliedUndo, > + U64FromFullTransactionId(oldestXidHavingUndo)); Seems like a lock would be more appropriate if we ever needed that - only other discard workers would need it, so ... > +/* > + * Discard all the logs. This is particularly required in single user mode > + * where at the commit time we discard all the undo logs. > + */ > +void > +UndoLogDiscardAll(void) > +{ > + UndoLogSlot *slot = NULL; > + > + Assert(!IsUnderPostmaster); > + > + /* > + * No locks are required for discard, since this called only in single > + * user mode. > + */ > + while ((slot = UndoLogNextSlot(slot))) > + { > + /* If the log is already discarded, then we are done. */ > + if (slot->meta.discard == slot->meta.unlogged.insert) > + continue; > + > + /* > + * Process the undo log. > + */ > + UndoLogDiscard(MakeUndoRecPtr(slot->logno, slot->meta.unlogged.insert), > + InvalidTransactionId); > + } > + > +} Uh. So. What happens if we start up in single user mode while transactions that haven't been rolled back yet exist? Which seems like a pretty typical situation for single user mode, because usually something has gone wrong before, which means it's quite likely that there are transactions that effectively aborted and haven't processed undo? How is this not entirely broken? > +/* > + * Discard the undo logs for temp tables. > + */ > +void > +TempUndoDiscard(UndoLogNumber logno) > +{ The only callsite for this is: + case ONCOMMIT_TEMP_DISCARD: + /* Discard temp table undo logs for temp tables. */ + TempUndoDiscard(oc->relid); + break; Which looks mightily odd, given that relid doesn't really sound like an undo log number. There's also no code actually registering an ONCOMMIT_TEMP_DISCARD callback. Nor is it clear to me why it, in general, would be correct to drop undo pre-commit, even for temp relations. It's fine for ON COMMIT DROP relations, but what about temporary relations that are longer lived than that? As the transaction can still fail at this stage - e.g. due to serialization failures - we'd just throw undo away that we'll need later? > @@ -943,9 +1077,24 @@ CanPushReqToUndoWorker(UndoRecPtr start_urec_ptr, UndoRecPtr end_urec_ptr, > > /* > * We normally push the rollback request to undo workers if the size of > - * same is above a certain threshold. > + * same is above a certain threshold. However, discard worker is allowed *the discard worker > * The request can't be pushed into the undo worker queue. The I don't think 'undo worker queue' is really correct. It's not one worker, and it's not one queue. And we're not queueing for a specific worker. > - * backends will try executing by itself. "Executing by itself" doesn't sound right. Execute the undo itself? > + * backends will try executing by itself. The discard worker will > + * keep the entry into the rollback hash table with "will keep the entry into" doesn't sound right. Insert? > + * UNDO_REQUEST_INVALID status. Such requests will be added in the > + * undo worker queues in the subsequent passes over undo logs by > + * discard worker. > */ > - else > + else if (!IsDiscardProcess()) > rh->status = UNDO_REQUEST_INPROGRESS; > + else > + rh->status = UNDO_REQUEST_INVALID; > } I don't understand what the point of this is. We add an entry into the hashtable, but mark it as invalid? How does this not allow to run out of memory? > + * To know more about work queues, see undorequest.c. The worker is launched > + * to handle requests for a particular database. I thought we had agreed that workers pick databases after they're started? There seems to be plenty code in here that does not implement that. > +/* SIGTERM: set flag to exit at next convenient time */ > +static void > +UndoworkerSigtermHandler(SIGNAL_ARGS) > +{ > + got_SIGTERM = true; > + > + /* Waken anything waiting on the process latch */ > + SetLatch(MyLatch); > +} > + > +/* SIGHUP: set flag to reload configuration at next convenient time */ > +static void > +UndoLauncherSighup(SIGNAL_ARGS) > +{ > + int save_errno = errno; > + > + got_SIGHUP = true; > + > + /* Waken anything waiting on the process latch */ > + SetLatch(MyLatch); > + > + errno = save_errno; > +} So one handler saves errno, the other doesn't... > +/* > + * Wait for a background worker to start up and attach to the shmem context. > + * > + * This is only needed for cleaning up the shared memory in case the worker > + * fails to attach. > + */ > +static void > +WaitForUndoWorkerAttach(UndoApplyWorker * worker, > + uint16 generation, > + BackgroundWorkerHandle *handle) Once we have undo workers pick their db, this should not be needed anymore. The launcher shouldn't even prepare anything in shared memory for it. > +/* > + * Returns whether an undo worker is available. > + */ > +static int > +IsUndoWorkerAvailable(void) > +{ > + int i; > + int alive_workers = 0; > + > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + > + /* Search for attached workers. */ > + for (i = 0; i < max_undo_workers; i++) > + { > + UndoApplyWorker *w = &UndoApplyCtx->workers[i]; > + > + if (w->in_use) > + alive_workers++; > + } > + > + LWLockRelease(UndoWorkerLock); > + > + return (alive_workers < max_undo_workers); > +} > + > +/* Sets the worker's lingering status. */ > +static void > +UndoWorkerIsLingering(bool sleep) > +{ > + /* Block concurrent access. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + > + MyUndoWorker->lingering = sleep; > + > + LWLockRelease(UndoWorkerLock); > +} > + > +/* Get the dbid and undo worker queue set by the undo launcher. */ > +static void > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) > +{ > + /* Block concurrent access. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + > + MyUndoWorker = &UndoApplyCtx->workers[slot]; > + > + if (!MyUndoWorker->in_use) > + { > + LWLockRelease(UndoWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("undo worker slot %d is empty", > + slot))); > + } > + > + urinfo->dbid = MyUndoWorker->dbid; > + urinfo->undo_worker_queue = MyUndoWorker->undo_worker_queue; > + > + LWLockRelease(UndoWorkerLock); > +} Why do all these need an exclusive lock? > +/* > + * Perform rollback request. We need to connect to the database for first > + * request and that is required because we access system tables while > + * performing undo actions. > + */ > +static void > +UndoWorkerPerformRequest(UndoRequestInfo * urinfo) > +{ > + bool error = false; > + > + /* must be connected to the database. */ > + Assert(MyDatabaseId != InvalidOid); The comment above says "We need to connect to the database", yet we assert here that we "must be connected to the database". > +/* > + * UndoLauncherRegister > + * Register a background worker running the undo worker launcher. > + */ > +void > +UndoLauncherRegister(void) > +{ > + BackgroundWorker bgw; > + > + if (max_undo_workers == 0) > + return; > + > + memset(&bgw, 0, sizeof(bgw)); > + bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | > + BGWORKER_BACKEND_DATABASE_CONNECTION; > + bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; > + snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres"); > + snprintf(bgw.bgw_function_name, BGW_MAXLEN, "UndoLauncherMain"); > + snprintf(bgw.bgw_name, BGW_MAXLEN, > + "undo worker launcher"); > + snprintf(bgw.bgw_type, BGW_MAXLEN, > + "undo worker launcher"); > + bgw.bgw_restart_time = 5; > + bgw.bgw_notify_pid = 0; > + bgw.bgw_main_arg = (Datum)0; > + > + RegisterBackgroundWorker(&bgw); > +} > + > +/* > + * Main loop for the undo worker launcher process. > + */ > +void > +UndoLauncherMain(Datum main_arg) > +{ > + UndoRequestInfo urinfo; > + > + ereport(DEBUG1, > + (errmsg("undo launcher started"))); > + > + before_shmem_exit(UndoLauncherOnExit, (Datum) 0); > + > + Assert(UndoApplyCtx->launcher_pid == 0); > + UndoApplyCtx->launcher_pid = MyProcPid; > + > + /* Establish signal handlers. */ > + pqsignal(SIGHUP, UndoLauncherSighup); > + pqsignal(SIGTERM, UndoworkerSigtermHandler); > + BackgroundWorkerUnblockSignals(); > + > + /* Establish connection to nailed catalogs. */ > + BackgroundWorkerInitializeConnection(NULL, NULL, 0); Why do we need to be connected in the launcher? I assume that's because we still do checks on the database? Greetings, Andres Freund
Hi, I'll be responding to a bunch of long review emails in this thread point by point separately, but just picking out a couple of points here that jumped out at me: On Wed, Aug 7, 2019 at 9:18 AM Andres Freund <andres@anarazel.de> wrote: > > + { > > + /* > > + * For the "shared" category, we only discard when the > > + * rm_undo_status callback tells us we can. > > + */ > > Is there a description as to what the rm_status callback is intended to > do? It currently is mandatory, is that intended? Why does this only > apply to shared records? And why just for SHARED, not for any of the others? Yeah, I will respond to this. After recent discussions with Robert the whole UNDO_SHARED concept looks a bit shaky, and there's a better way trying to get out -- more on that soon. > > See > > + * DiscardWorkerMain. > > Hm. This actually reminds me of a complaint I have about this. ISTM that > the logic for discarding itself should be separate from the discard > worker. I'd just add that, and a UDF to invoke it, in a separate commit. That's not a bad idea -- I have a 'pg_force_discard()' SP which I'll include in my next patchset, which is a bit raw, which I'm planning to make a bit smarter -- it might make sense to use the same code path for that. -- Thomas Munro https://enterprisedb.com
Hello Andres, On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote: > > +/* Each worker queue is a binary heap. */ > > +typedef struct > > +{ > > + binaryheap *bh; > > + union > > + { > > + UndoXidQueue *xid_elems; > > + UndoSizeQueue *size_elems; > > + UndoErrorQueue *error_elems; > > + } q_choice; > > +} UndoWorkerQueue; > > As we IIRC have decided to change this into a rbtree, I'll ignore > related parts of the current code. What is the status of that work? > I've checked the git trees, without seeing anything? Your last mail with > patches > https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com > doesn't seem to contain that either? > Yeah, we're changing this into a rbtree. This is still work-in-progress. > > > ......... > > +#define GetErrorQueueNthElem(n) \ > > +( \ > > + AssertMacro(!ErrorQueueIsEmpty()), \ > > + DatumGetPointer(binaryheap_nth(UndoWorkerQueues[ERROR_QUEUE].bh, n)) \ > > +) > > > -ETOOMANYMACROS > > I think nearly all of these shouldn't exist. See further below. > > > > +#define SetErrorQueueElem(elem, e_dbid, e_full_xid, e_start_urec_ptr, e_retry_at, e_occurred_at) \ > > +( \ > > + GetErrorQueueElem(elem).dbid = e_dbid, \ > > + GetErrorQueueElem(elem).full_xid = e_full_xid, \ > > + GetErrorQueueElem(elem).start_urec_ptr = e_start_urec_ptr, \ > > + GetErrorQueueElem(elem).next_retry_at = e_retry_at, \ > > + GetErrorQueueElem(elem).err_occurred_at = e_occurred_at \ > > +) > > It's very very rarely a good idea to have macros that evaluate their > arguments multiple times. It'll also never be a good idea to get the > same element multiple times from a queue. If needed - I'm very doubtful > of that, given that there's a single caller - it should be a static > inline function that gets the element once, stores it in a local > variable, and then updates all the fields. > Noted. Earlier, Robert also raised the point of using so many macros. He also suggested to use a single type of object that stores all the information we need. It'll make things simpler and easier to understand. In the upcoming patch set, we're removing all these changes. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-05 11:29:34 -0700, Andres Freund wrote: > > Need to do something else for a bit. More later. > > Here we go. > Thanks for the review. I will work on them. Currently, I need suggestions on some of the review comments. > > > + /* > > + * Compute the header size of the undo record. > > + */ > > +Size > > +UndoRecordHeaderSize(uint16 uur_info) > > +{ > > + Size size; > > + > > + /* Add fixed header size. */ > > + size = SizeOfUndoRecordHeader; > > + > > + /* Add size of transaction header if it presets. */ > > + if ((uur_info & UREC_INFO_TRANSACTION) != 0) > > + size += SizeOfUndoRecordTransaction; > > + > > + /* Add size of rmid if it presets. */ > > + if ((uur_info & UREC_INFO_RMID) != 0) > > + size += sizeof(RmgrId); > > + > > + /* Add size of reloid if it presets. */ > > + if ((uur_info & UREC_INFO_RELOID) != 0) > > + size += sizeof(Oid); > > + > There's numerous blocks with one if for each type, and the body copied > basically the same for each alternative. That doesn't seem like a > reasonable approach to me. Means that many places need to be adjusted > when we invariably add another type, and seems likely to lead to bugs > over time. > I agree with the point that we are repeating this in a couple of function and doing different actions e.g. In this function we are computing the size and in some other function we are copying the field. I am not sure what would be the best way to handle it? One approach could just write one function which handles all these cases but the caller will suggest what action to take. Basically, it will look like this. Function (uur_info, action) { if ((uur_info & UREC_INFO_TRANSACTION) != 0) { // if action is compute header size size += SizeOfUndoRecordTransaction; //else if action is copy to dest dest = src ... } Repeat for other types } But, IMHO that function will look confusing to anyone that what exactly it's trying to achieve. If anyone has a better idea please suggest. > > +/* > > + * Insert the undo record into the input page from the unpack undo context. > > + * > > + * Caller can call this function multiple times until desired stage is reached. > > + * This will write the undo record into the page. > > + */ > > +void > > +InsertUndoData(UndoPackContext *ucontext, Page page, int starting_byte) > > +{ > > + char *writeptr = (char *) page + starting_byte; > > + char *endptr = (char *) page + BLCKSZ; > > + > > + switch (ucontext->stage) > > + { > > + case UNDO_PACK_STAGE_HEADER: > > + /* Insert undo record header. */ > > + if (!InsertUndoBytes((char *) &ucontext->urec_hd, > > + SizeOfUndoRecordHeader, &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + ucontext->stage = UNDO_PACK_STAGE_TRANSACTION; > > + /* fall through */ > > + > > I don't understand. The only purpose of this is that we can partially > write a packed-but-not-actually-packed record onto a bunch of pages? And > for that we have an endless chain of copy and pasted code calling > InsertUndoBytes()? Copying data into shared buffers in tiny increments? > > If we need to this, what is the whole packed record format good for? > Except for adding a bunch of functions with 10++ ifs and nearly > identical code? > > Copying data is expensive. Copying data in tiny increments is more > expensive. Copying data in tiny increments, with a bunch of branches, is > even more expensive. Copying data in tiny increments, with a bunch of > branches, is even more expensive, especially when it's shared > memory. Copying data in tiny increments, with a bunch of branches, is > even more expensive, especially when it's shared memory, especially when > all that shared meory is locked at once. My idea is, indeed of keeping all these fields duplicated in the context, just allocate a single memory segment equal to the expected record size (maybe the payload data can keep separate). Now, based on uur_info pack all the field of UnpackedUndoRecord in that memory segment. After that In InsertUndoData, we just need one call to InsertUndoBytes for copying complete header in one shot and another call for copying payload data. Does this sound reasonable to you? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-05 11:29:34 -0700, Andres Freund wrote: > > Need to do something else for a bit. More later. > > > + * false, otherwise. > > + */ > > +static bool > > +UndoAlreadyApplied(FullTransactionId full_xid, UndoRecPtr to_urecptr) > > +{ > > + UnpackedUndoRecord *uur = NULL; > > + UndoRecordFetchContext context; > > + > > + /* Fetch the undo record. */ > > + BeginUndoFetch(&context); > > + uur = UndoFetchRecord(&context, to_urecptr); > > + FinishUndoFetch(&context); > > Literally all the places that fetch a record, fetch them with exactly > this combination of calls. If that's the pattern, what do we gain by > this split? Note that UndoBulkFetchRecord does *NOT* use an > UndoRecordFetchContext, for reasons that are beyond me. Actually, for the zheap or any other AM, where it needs to traverse the transactions undo the chain. For example, in zheap we will get the latest undo record pointer from the slot but we need to traverse the undo record chain backward using the prevundo pointer store in the undo record and find the undo record for a particular tuple. Earlier, there was a loop in UndoFetchRecord which were traversing the chain of the undo until it finds the matching record and record was matched using a callback. There was also an optimization that if the current record doesn't satisfy the callback then we keep the pin hold on the buffer and go to the previous record in the chain. Later based on the review comments by Robert we have decided that finding the matching undo record should be caller's responsibility so we have moved the loop out of the UndoFetchRecord and kept it in the zheap code. The reason for keeping the context is that we can keep the buffer pin held and remember that buffer in the context so that the caller can call UndoFetchRecord in a loop and the pin will be held on the buffer from which we have read the last undo record. I agree that in undoprocessing patch set we always need to fetch one record so instead of repeating this pattern everywhere we can write one function and move this sequence of calls in that function. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-08-05 11:29:34 -0700, Andres Freund wrote: > I am responding to some of the points where I need some more inputs or some discussion is required. Some of the things need more thoughts which I will respond later and some are quite straight forward and doesn't need much discussion. > > > +/* > > + * Binary heap comparison function to compare the time at which an error > > + * occurred for transactions. > > + * > > + * The error queue is sorted by next_retry_at and err_occurred_at. Currently, > > + * the next_retry_at has some constant delay time (see PushErrorQueueElem), so > > + * it doesn't make much sense to sort by both values. However, in future, if > > + * we have some different algorithm for next_retry_at, then it will work > > + * seamlessly. > > + */ > > Why is it useful to have error_occurred_at be part of the comparison at > all? If we need a tiebraker, err_occurred_at isn't that (if we can get > conflicts for next_retry_at, then we can also get conflicts in > err_occurred_at). Seems better to use something actually guaranteed to > be unique for a tiebreaker. > This was to distinguish the cases where the request is failing multiple times with the request failed the first time. I agree that we need a better unique identifier like FullTransactionid though. Do let me know if you have any other suggestion? > > > > + /* > > + * The rollback hash table is used to avoid duplicate undo requests by > > + * backends and discard worker. The table must be able to accomodate all > > + * active undo requests. The undo requests must appear in both xid and > > + * size requests queues or neither. In same transaction, there can be two > > + * requests one for logged relations and another for unlogged relations. > > + * So, the rollback hash table size should be equal to two request queues, > > + * an error queue (currently this is same as request queue) and max > > "the same"? I assume this intended to mean the same size? > Yes. I will add the word size to be more clear. > > > + * backends. This will ensure that it won't get filled. > > + */ > > How does this ensure anything? > Because based on this we will have a hard limit on the number of undo requests after which we won't allow more requests. See some more detailed explanation for the same later in this email. I think the comment needs to be updated. > > > + * the binary heaps which can change. > > + */ > > + Assert(LWLockHeldByMeInMode(RollbackRequestLock, LW_EXCLUSIVE)); > > + > > + /* > > + * We normally push the rollback request to undo workers if the size of > > + * same is above a certain threshold. > > + */ > > + if (req_size >= rollback_overflow_size * 1024 * 1024) > > + { > > Why is this being checked with the lock held? Seems like this should be > handled in a pre-check? > Yeah, it can be a pre-check, but I thought it is better to encapsulate everything in the function as this is not an expensive check. I think we can move it to outside lock to avoid any such confusion. > > > + * allow_peek - if true, peeks a few element from each queue to check whether > > + * any request matches current dbid. > > + * remove_from_queue - if true, picks an element from the queue whose dbid > > + * matches current dbid and remove it from the queue before returning the same > > + * to caller. > > + * urinfo - this is an OUT parameter that returns the details of undo request > > + * whose undo action is still pending. > > + * in_other_db_out - this is an OUT parameter. If we've not found any work > > + * for current database, but there is work for some other database, we set > > + * this parameter as true. > > + */ > > +bool > > +UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo *urinfo, > > + bool *in_other_db_out) > > +{ > > > > + /* > > + * If some undo worker is already processing the rollback request or > > + * it is already processed, then we drop that request from the queue > > + * and fetch the next entry from the queue. > > + */ > > + if (!rh || UndoRequestIsInProgress(rh)) > > + { > > + RemoveRequestFromQueue(cur_queue, 0); > > + cur_undo_queue++; > > + continue; > > + } > > When is it possible to hit the in-progress case? > The same request is in two queues. It is possible that when the request is being processed from xid queue by one of the workers, the request from another queue is picked by another worker. I think this case won't exist after making rbtree based queues. > > +/* > > + * UpdateUndoApplyProgress - Updates how far undo actions from a particular > > + * log have been applied while rolling back a transaction. This progress is > > + * measured in terms of undo block number of the undo log till which the > > + * undo actions have been applied. > > + */ > > +static void > > +UpdateUndoApplyProgress(UndoRecPtr progress_urec_ptr, > > + BlockNumber block_num) > > +{ > > + UndoLogCategory category; > > + UndoRecordInsertContext context = {{0}}; > > + > > + category = > > + UndoLogNumberGetCategory(UndoRecPtrGetLogNo(progress_urec_ptr)); > > + > > + /* > > + * We don't need to update the progress for temp tables as they get > > + * discraded after startup. > > + */ > > + if (category == UNDO_TEMP) > > + return; > > + > > + BeginUndoRecordInsert(&context, category, 1, NULL); > > + > > + /* > > + * Prepare and update the undo apply progress in the transaction header. > > + */ > > + UndoRecordPrepareApplyProgress(&context, progress_urec_ptr, block_num); > > + > > + START_CRIT_SECTION(); > > + > > + /* Update the progress in the transaction header. */ > > + UndoRecordUpdateTransInfo(&context, 0); > > + > > + /* WAL log the undo apply progress. */ > > + { > > + XLogRecPtr lsn; > > + xl_undoapply_progress xlrec; > > + > > + xlrec.urec_ptr = progress_urec_ptr; > > + xlrec.progress = block_num; > > + > > + XLogBeginInsert(); > > + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); > > + > > + RegisterUndoLogBuffers(&context, 1); > > + lsn = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS); > > + UndoLogBuffersSetLSN(&context, lsn); > > + } > > + > > + END_CRIT_SECTION(); > > + > > + /* Release undo buffers. */ > > + FinishUndoRecordInsert(&context); > > +} > > This whole prepare/execute split for updating apply pregress, and next > undo pointers makes no sense to me. > Can you explain what is your concern here? Basically, in the prepare phase, we do read and lock the buffer and in the actual update phase (which is under critical section), we update the contents in the shared buffer. This is the same idea as we use in many places in code. > > > typedef struct TwoPhaseFileHeader > > { > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader > > uint16 gidlen; /* length of the GID - GID follows the header */ > > XLogRecPtr origin_lsn; /* lsn of this record at origin node */ > > TimestampTz origin_timestamp; /* time of prepare at origin node */ > > + > > + /* > > + * We need the locations of the start and end undo record pointers when > > + * rollbacks are to be performed for prepared transactions using undo-based > > + * relations. We need to store this information in the file as the user > > + * might rollback the prepared transaction after recovery and for that we > > + * need its start and end undo locations. > > + */ > > + UndoRecPtr start_urec_ptr[UndoLogCategories]; > > + UndoRecPtr end_urec_ptr[UndoLogCategories]; > > } TwoPhaseFileHeader; > > Why do we not need that knowledge for undo processing of a non-prepared > transaction? > The non-prepared transaction also needs to be aware of that. It is stored in TransactionStateData. I am not sure if I understand your question here. > > > + * applying undo via top-level transaction, if we get an error, > > + * then it is handled by ReleaseResourcesAndProcessUndo > > Where and how does it handle that? Maybe I misunderstand what you mean? > It is handled in ProcessUndoRequestForEachLogCat which is called from ReleaseResourcesAndProcessUndo. Basically, the error is handled in catch and we insert the request in error queue. The function name should be changed in comments. > > > + case TBLOCK_UNDO: > > + /* > > + * We reach here when we got error while applying undo > > + * actions, so we don't want to again start applying it. Undo > > + * workers can take care of it. > > + * > > + * AbortTransaction is already done, still need to release > > + * locks and perform cleanup. > > + */ > > + ResetUndoActionsInfo(); > > + ResourceOwnerRelease(s->curTransactionOwner, > > + RESOURCE_RELEASE_LOCKS, > > + false, > > + true); > > + s->state = TRANS_ABORT; > > CleanupTransaction(); > > Hm. Why is it ok that we only perform that cleanup action? Either the > rest of potentially held resources will get cleaned up somehow as well, > in which case this ResourceOwnerRelease() ought to be redundant, or > we're potentially leaking important resources like buffer pins, relcache > references and whatnot here? > I had initially used AbortTransaction() here for such things, but I was not sure whether that is the right thing when we reach here in this state. Because AbortTransaction is already done once we reach here. The similar thing happens for the TBLOCK_SUBUNDO state few lines below where I had used AbortSubTransaction. Now, one problem I faced when AbortSubTransaction got invoked in this code path was it internally invokes RecordTransactionAbort->XidCacheRemoveRunningXids which result in the error "did not find subXID %u in MyProc". The reason is obvious which is that we had already removed it when AbortSubTransaction was invoked before applying undo actions. The releasing of locks was the thing which we have delayed to allow undo actions to be applied which is done here. The other idea here I had was to call AbortTransaction/AbortSubTransaction but somehow avoid calling RecordTransactionAbort when in this state. Do you have any suggestion to deal with this? > > > +{ > > + TransactionState s = CurrentTransactionState; > > + bool result; > > + int i; > > + > > + /* > > + * We don't want to apply the undo actions when we are already cleaning up > > + * for FATAL error. See ReleaseResourcesAndProcessUndo. > > + */ > > + if (SemiCritSectionCount > 0) > > + { > > + ResetUndoActionsInfo(); > > + return; > > + } > > Wait what? Semi critical sections? > Robert up thread suggested this idea [1] (See paragraph starting with "I am not a fan of applying_subxact_undo....") to deal with cases where we get an error while applying undo actions and we need to promote the error to FATAL. We have two such cases as of now in this patch, one is when we process temp log category log and other is when we are rolling back sub-transactions. The detailed reasons are mentioned in function execute_undo_actions. I think this can be used for other things as well in the future. > > > > + for (i = 0; i < UndoLogCategories; i++) > > + { > > + /* > > + * We can't push the undo actions for temp table to background > > + * workers as the the temp tables are only accessible in the > > + * backend that has created them. > > + */ > > + if (i != UNDO_TEMP && UndoRecPtrIsValid(s->latestUrecPtr[i])) > > + { > > + result = RegisterUndoRequest(s->latestUrecPtr[i], > > + s->startUrecPtr[i], > > + MyDatabaseId, > > + GetTopFullTransactionId()); > > + s->undoRequestResgistered[i] = result; > > + } > > + } > > Give code like this I have a hard time seing what the point of having > separate queue entries for the different persistency levels is. > It is not for this case, rather, it is for the case of discard worker (background worker) where we process the transactions at log level. The permanent and unlogged transactions will be in a separate log and can be encountered at different times, so this leads to having separate entries for them. I am planning to give a try to unify them based on some of the discussions in this email chain. > > > +void > > +ReleaseResourcesAndProcessUndo(void) > > +{ > > + TransactionState s = CurrentTransactionState; > > + > > + /* > > + * We don't want to apply the undo actions when we are already cleaning up > > + * for FATAL error. One of the main reasons is that we might be already > > + * processing undo actions for a (sub)transaction when we reach here > > + * (for ex. error happens while processing undo actions for a > > + * subtransaction). > > + */ > > + if (SemiCritSectionCount > 0) > > + { > > + ResetUndoActionsInfo(); > > + return; > > + } > > + > > + if (!NeedToPerformUndoActions()) > > + return; > > + > > + /* > > + * State should still be TRANS_ABORT from AbortTransaction(). > > + */ > > + if (s->state != TRANS_ABORT) > > + elog(FATAL, "ReleaseResourcesAndProcessUndo: unexpected state %s", > > + TransStateAsString(s->state)); > > + > > + /* > > + * Do abort cleanup processing before applying the undo actions. We must > > + * do this before applying the undo actions to remove the effects of > > + * failed transaction. > > + */ > > + if (IsSubTransaction()) > > + { > > + AtSubCleanup_Portals(s->subTransactionId); > > + s->blockState = TBLOCK_SUBUNDO; > > + } > > + else > > + { > > + AtCleanup_Portals(); /* now safe to release portal memory */ > > + AtEOXact_Snapshot(false, true); /* and release the transaction's > > + * snapshots */ > > Why do precisely these actions need to be performed here? > This is to get a transaction into a clean state. Before calling this function AbortTransaction has been performed and there were few more things we need to do for cleanup. > > > + s->fullTransactionId = InvalidFullTransactionId; > > + s->subTransactionId = TopSubTransactionId; > > + s->blockState = TBLOCK_UNDO; > > + } > > + > > + s->state = TRANS_UNDO; > > This seems guaranteed to constantly be out of date with other > modifications of the commit/abort sequence. > It is similar to how we change state in Abort(Sub)Transaction and we change the state back to TRANS_ABORT after applying undo in this function. So not sure, how it can be out-of-date. Do you have any better suggestion here? > > > > +bool > > +ProcessUndoRequestForEachLogCat(FullTransactionId fxid, Oid dbid, > > + UndoRecPtr *end_urec_ptr, UndoRecPtr *start_urec_ptr, > > + bool *undoRequestResgistered, bool isSubTrans) > > +{ > > + UndoRequestInfo urinfo; > > + int i; > > + uint32 save_holdoff; > > + bool success = true; > > + > > + for (i = 0; i < UndoLogCategories; i++) > > + { > > + if (end_urec_ptr[i] && !undoRequestResgistered[i]) > > + { > > + save_holdoff = InterruptHoldoffCount; > > + > > + PG_TRY(); > > + { > > + /* for subtransactions, we do partial rollback. */ > > + execute_undo_actions(fxid, > > + end_urec_ptr[i], > > + start_urec_ptr[i], > > + !isSubTrans); > > + } > > + PG_CATCH(); > > + { > > + /* > > + * Add the request into an error queue so that it can be > > + * processed in a timely fashion. > > + * > > + * If we fail to add the request in an error queue, then mark > > + * the entry status as invalid and continue to process the > > + * remaining undo requests if any. This request will be later > > + * added back to the queue by discard worker. > > + */ > > + ResetUndoRequestInfo(&urinfo); > > + urinfo.dbid = dbid; > > + urinfo.full_xid = fxid; > > + urinfo.start_urec_ptr = start_urec_ptr[i]; > > + if (!InsertRequestIntoErrorUndoQueue(&urinfo)) > > + RollbackHTMarkEntryInvalid(urinfo.full_xid, > > + urinfo.start_urec_ptr); > > + /* > > + * Errors can reset holdoff count, so restore back. This is > > + * required because this function can be called after holding > > + * interrupts. > > + */ > > + InterruptHoldoffCount = save_holdoff; > > + > > + /* Send the error only to server log. */ > > + err_out_to_client(false); > > + EmitErrorReport(); > > + > > + success = false; > > + > > + /* We should never reach here when we are in a semi-critical-section. */ > > + Assert(SemiCritSectionCount == 0); > > This seems entirely and completely broken. You can't just catch an > exception and continue. What if somebody held an lwlock when the error > was thrown? A buffer pin? > The caller deals with that. For example, when this is called from FinishPreparedTransaction, we do AbortOutOfAnyTransaction and when called from ReleaseResourcesAndProcessUndo, we just release locks. I think we might need to do something additional for ReleaseResourcesAndProcessUndo. Earlier here also, I had AbortTransaction but was not sure whether that is the right thing to do especially because it will lead to RecordTransactionAbort called twice, once when we do AbortTransaction before applying undo actions and once when we do it after catching the exception. Like as I said earlier maybe the right way is to just avoid calling RecordTransactionAbort again. > > > +to complete the requests by themselves. There is an exception to it where when > > +error queue becomes full, we just mark the request as 'invalid' and continue to > > +process other requests if any. The discard worker will find this errored > > +transaction at later point of time and again add it to the request queues. > > You say it's an exception, but you do not explain why that exception is > there. > The exception is when the error queue becomes full. The idea is that individual queues can be full but not the hash table. > Nor why that's not a problem for: > > > +We have the hard limit (proportional to the size of the rollback hash table) > > +for the number of transactions that can have pending undo. This can help us > > +in computing the value of oldestXidHavingUnappliedUndo and allowing us not to > > +accumulate pending undo for a long time which will eventually block the > > +discard of undo. > The reason why it is not a problem is that we don't remove the entry from the hash table rather just mark it such that later discard worker can add it to the queues. I am not sure if I understood your question completely, but let me try to explain this idea in a bit more detail. The basic idea is that Rollback Hash Table has space equivalent to all the three queues plus (2 * MaxBackends). Now, we will stop allowing the new transactions that want to write undo once the hash table has entries equivalent to all three queues and we have 2 * Max_Backends already attached to undo logs that are not committed. Assume we have each queue size as 5 and Max_Backends =10, then ideally we can 35 entries (3 * 5 + 2 * 10) in the hash table. The way all this is related to the error queue being full is like this: Say, we have a number of hash table entries equal to 15 which indicates all queues are full and now 10 backends connected to two different logs (permanent and unlogged). Next one of the transaction errors out and try to rollback, at this stage, it will add an entry in the hash table and try to execute the actions. While executing actions, it got an error and couldn't add to error queue because it was full, so at this stage, it just marks the hash table entry as invalid and proceeds (consider this happens for both logged and unlogged categories). So, at this stage, we will have 17 entries in the hash table and the other 9 backends attached to 18 logs which makes space for 35 xacts if the system crashes at this stage. The backend which errored out again tries to perform an operation for which it needs to perform undo. Now, we won't allow this backend to perform that action because if it crashed after performing the operation and before committing, the hash table will overflow. Currently, there are some problems with the hash table overflow checks in the code that needs to be fixed. > > > + /* There might not be any undo log and hibernation might be needed. */ > > + *hibernate = true; > > + > > + StartTransactionCommand(); > > Why do we need this? I assume it's so we can have a resource owner? > Yes, and another reason is we are using dbid_exists in this function. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 30, 2019 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > One data structure that could perhaps hold this would be > > UndoLogTableEntry (the per-backend cache, indexed by undo log number, > > with pretty fast lookups; used for things like > > UndoLogNumberGetCategory()). As long as you never want to have > > inter-transaction compression, that should have the right scope to > > give recovery per-undo log tracking. If you ever wanted to do > > compression between transactions too, maybe UndoLogSlot could work, > > but that'd have more complications. > > I think this could be a good idea. I had thought of keeping in the > slot as my 3rd option but later I removed it thinking that we need to > expose the compression field to the undo log layer. I think keeping > in the UndoLogTableEntry is a better idea than keeping in the slot. > But, I still have the same problem that we need to expose undo > record-level fields to undo log layer to compute the cache entry size. > OTOH, If we decide to get from the first record of the page (as I > mentioned up thread) then I don't think there is any performance issue > because we are inserting on the same page. But, for doing that we > need to unpack the complete undo record (guaranteed to be on one > page). And, UnpackUndoData will internally unpack the payload data > as well which is not required in our case unless we change > UnpackUndoData such that it unpacks only what the caller wants (one > input parameter will do). > > I am not sure out of these two which idea is better? > I have one more problem related to compression of the command id field. Basically, the problem is that we don't set the command id in the WAL and we will always store FirstCommandId in the undo[1]. So suppose there were 2 operations under a different CID then during DO time both the undo record will store the CID field in their respective undo records but during REDO time, all the commands will store the same CID(FirstCommandId) so as per the compression logic the subsequent record for the same transaction will not store the CID field. I am not sure what is the best way to handle this but I have few ideas. 1) Don't compress the CID field ever. 2) Write CID in WAL, but just for compressing the CID field in undo (which may not necessarily go to disk) we don't want to add extra 4 bytes in the WAL. Any better idea to handle this? [1] https://www.postgresql.org/message-id/CAFiTN-u2Ny2E-NgT8nmE65awJ7keOzePODZTEg98ceF%2BsNhRtw%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 05/08/2019 16:24, Robert Haas wrote: > On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I feel that the level of abstraction is not quite right. There are a >> bunch of fields, like uur_block, uur_offset, uur_tuple, that are >> probably useful for some UNDO resource managers (zheap I presume), but >> seem kind of arbitrary. How is uur_tuple different from uur_payload? >> Should they be named more generically as uur_payload1 and uur_payload2? >> And why two, why not three or four different payloads? In the WAL record >> format, there's a concept of "block id", which allows you to store N >> number of different payloads in the record, I think that would be a >> better approach. Or only have one payload, and let the resource manager >> code divide it as it sees fit. >> >> Many of the fields support a primitive type of compression, where a >> field can be omitted if it has the same value as on the first record on >> an UNDO page. That's handy. But again I don't like the fact that the >> fields have been hard-coded into the UNDO record format. I can see e.g. >> the relation oid to be useful for many AMs. But not all. And other AMs >> might well want to store and deduplicate other things, aside from the >> fields that are in the patch now. I'd like to move most of the fields to >> AM specific code, and somehow generalize the compression. One approach >> would be to let the AM store an arbitrary struct, and run it through a >> general-purpose compression algorithm, using the UNDO page's first >> record as the "dictionary". > > I thought about this, too. I agree that there's something a little > unsatisfying about the current structure, but I haven't been able to > come up with something that seems definitively better. I think > something along the lines of what you are describing here might work > well, but I am VERY doubtful about the idea of a fixed-size struct. I > think AMs are going to want to store variable-length data: especially > tuples, but maybe also other stuff. For instance, imagine some AM that > wants to implement locking that's more fine-grained that the four > levels of tuple locks we have today: instead of just having key locks > and all-columns locks, you could want to store the exact columns to be > locked. Or maybe your TIDs are variable-width. Sure, a fixed-size struct is quite limiting. My point is that all that packing of data into UNDO records should be AM-specific. Maybe it would be handy to have a a few common fields in the undo record header itself, but most data should be in the AM-specific payload, because it varies across AMs. > And the problem is that as soon as you move to something where you > pack in a bunch of variable-sized fields, you lose the ability to > refer to thinks using reasonable names. That's where I came up with > the idea of an UnpackedUndoRecord: give the common fields that > "everyone's going to need" human-readable names, and jam only the > strange, AM-specific stuff into the payload. But if those needs are > not actually universal but very much AM-specific, then I'm afraid > we're going to end up with deeply inscrutable code for packing and > unpacking records. I imagine it's possible to come up with a good > structure for that, but I don't think we have one today. Yeah, that's also a problem with complicated WAL record types. Hopefully the complex cases are an exception, not the norm. A complex case is unlikely to fit any pre-defined set of fields anyway. (We could look at how e.g. protobuf works, if this is really a big problem. I'm not suggesting that we add a dependency just for this, but there might be some patterns or interfaces that we could mimic.) If you remember, we did a big WAL format refactoring in 9.5, which moved some information from AM-specific structs to the common headers. Namely, the information on the relation blocks that the WAL record applies to. That was a very handy refactoring, and allowed tools like pg_waldump to print more detailed information about all WAL record types. For WAL records, moving the block information was natural, because there was special handling for full-page images anyway. However, I don't think we have enough experience with UNDO log yet, to know which fields would be best to include in the common undo header, and which to leave as AM-specific payload. I think we should keep the common header slim, and delegate to the AM routines. For UNDO records, having an XID on every record probably makes sense; all the use cases for UNDO log we've discussed are transactional. The rules on which UNDO records to apply and what/when to discard, depend on whether a transaction committed or aborted and when, so you need the XID for that. Although, the rule also depends on the AM; for cleaning up orphaned files, an UNDO record for a committed transaction can be discarded immediately, while zheap and zedstore records need to be kept around longer. So the exact rules for that will need to be AM-specific, too. Or maybe there are only a few different cases and we can enumerate them, so that an AM can just set a flag on the UNDO record to indicate when it can be discarded, instead of having a callback or some other totally generic approach. In short, I think we should keep the common code that deals with UNDO records more dumb, and delegate to the AMs more. That's enough for cleaning up orphaned files, we don't really need the more complicated stuff for that. We probably need more smarts for zheap/zedstore, but we don't quite know what it should look like yet. Let's keep it simple for now, so that we can get something we can review and commit sooner, and we can build on top of that later. >> I don't like the way UndoFetchRecord returns a palloc'd >> UnpackedUndoRecord. I would prefer something similar to the xlogreader >> API, where a new call to UndoFetchRecord invalidates the previous >> result. On efficiency grounds, to avoid the palloc, but also to be >> consistent with xlogreader. > > I don't think that's going to work very well, because we often need to > deal with multiple records at a time. There is (or was) a bulk-fetch > interface, but I've also found while experimenting with this code that > it can be useful to do things like: > > current = undo_fetch(starting_record); > loop: > next = undo_fetch(current->next_record_ptr); > if some_test(next): > break; > undo_free(current); > current = next; > > I think we shouldn't view such cases as exceptions to the general > paradigm of looking at undo records one at a time, but instead as the > normal case for which everything is optimized. Cases like orphaned > file cleanup where the number of undo records is probably small and > they're all independent of each other will, I think, turn out to be > the exception rather than the rule. Hmm. If you're following an UNDO chain, from newest to oldest, I would assume that the newer record has enough information to decide whether you need to look at the previous record. If the previous record is no longer interesting, it might already be discarded away, after all. I tried to browse through the zheap code but couldn't see that pattern. I'm not too familiar with the code, so I might've looked in the wrong place, though. - Heikki
On 07/08/2019 13:52, Dilip Kumar wrote: > I have one more problem related to compression of the command id > field. Basically, the problem is that we don't set the command id in > the WAL and we will always store FirstCommandId in the undo[1]. So > suppose there were 2 operations under a different CID then during DO > time both the undo record will store the CID field in their respective > undo records but during REDO time, all the commands will store the > same CID(FirstCommandId) so as per the compression logic the > subsequent record for the same transaction will not store the CID > field. I am not sure what is the best way to handle this but I have > few ideas. > > 1) Don't compress the CID field ever. > 2) Write CID in WAL, but just for compressing the CID field in undo > (which may not necessarily go to disk) we don't want to add extra 4 > bytes in the WAL. Most transactions have only a few commands, so you could optimize for that. If you use some kind of a variable-byte encoding for it, it could be a single byte or even just a few bits, for the common cases. For the first version, I'd suggest keeping it simple, though, and optimize later. - Heikki
On Thu, Aug 1, 2019 at 1:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > but > > > here's a small thing: I managed to reach an LWLock self-deadlock in > > > the undo worker launcher: > > > > > > > I could see the problem, will fix in next version. > > Fixed both of these problems in the patch just posted by me [1]. I reran the script that found that problem, so I could play with the linger logic. It creates N databases, and then it creates tables in random databases (because I'm testing with the orphaned table cleanup patch) and commits or rolls back at (say) 100 tx/sec. While it's doing that, you can look at the pg_stat_undo_logs view to see the discard and insert pointers whizzing along nicely, but if you look at the process table with htop or similar you can see that it's forking undo apply workers at 100/sec (the pid keeps changing), whenever there is more than one database involved. With a single database it lingers as I was expecting (and then creates problems when you want to drop the database). What I was expecting to see is that if you configure the test to generate undo work in 2, 3 or 4 dbs, and you have max_undo_workers set to 4, then you should finish up with 4 undo apply workers hanging around to service the work calmly without any new forking happening. If you generate undo work in more than 4 databases, I was expecting to see the undo workers exiting and being forked so that a total of 4 workers (at any time) can work their way around the more-than-4 databases, but not switching as fast as they can, so that we don't waste all our energy on forking and setup (how fast exactly they should switch, I don't know, that's what I wanted to see). A more advanced thing to worry about, not yet tested, is how well they'll handle asymmetrical work distributions (not enough workers, but some databases producing a lot and some a little undo work). Script attached. -- Thomas Munro https://enterprisedb.com
Attachment
On Wed, Aug 7, 2019 at 5:06 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Aug 1, 2019 at 1:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > but > > > > here's a small thing: I managed to reach an LWLock self-deadlock in > > > > the undo worker launcher: > > > > > > > > > > I could see the problem, will fix in next version. > > > > Fixed both of these problems in the patch just posted by me [1]. > > I reran the script that found that problem, so I could play with the > linger logic. > Thanks for the test. I will look into it and get back to you. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, On 2019-08-07 14:50:17 +0530, Amit Kapila wrote: > On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote: > > On 2019-08-05 11:29:34 -0700, Andres Freund wrote: > > > +/* > > > + * Binary heap comparison function to compare the time at which an error > > > + * occurred for transactions. > > > + * > > > + * The error queue is sorted by next_retry_at and err_occurred_at. Currently, > > > + * the next_retry_at has some constant delay time (see PushErrorQueueElem), so > > > + * it doesn't make much sense to sort by both values. However, in future, if > > > + * we have some different algorithm for next_retry_at, then it will work > > > + * seamlessly. > > > + */ > > > > Why is it useful to have error_occurred_at be part of the comparison at > > all? If we need a tiebraker, err_occurred_at isn't that (if we can get > > conflicts for next_retry_at, then we can also get conflicts in > > err_occurred_at). Seems better to use something actually guaranteed to > > be unique for a tiebreaker. > > > > This was to distinguish the cases where the request is failing > multiple times with the request failed the first time. I agree that > we need a better unique identifier like FullTransactionid though. Do > let me know if you have any other suggestion? Sure, I get why you have the field. Even if it were just for debugging or such. Was just commenting upon it being used as part of the comparison. I'd just go for (next_retry_at, fxid). > > > + * backends. This will ensure that it won't get filled. > > > + */ > > > > How does this ensure anything? > > > > Because based on this we will have a hard limit on the number of undo > requests after which we won't allow more requests. See some more > detailed explanation for the same later in this email. I think the > comment needs to be updated. Well, as your code stands, I don't think there is an actual hard limit on the number of transactions needing to be undone due to the way errors are handled. There's no consideration of prepared transactions. > > > + START_CRIT_SECTION(); > > > + > > > + /* Update the progress in the transaction header. */ > > > + UndoRecordUpdateTransInfo(&context, 0); > > > + > > > + /* WAL log the undo apply progress. */ > > > + { > > > + XLogRecPtr lsn; > > > + xl_undoapply_progress xlrec; > > > + > > > + xlrec.urec_ptr = progress_urec_ptr; > > > + xlrec.progress = block_num; > > > + > > > + XLogBeginInsert(); > > > + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); > > > + > > > + RegisterUndoLogBuffers(&context, 1); > > > + lsn = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS); > > > + UndoLogBuffersSetLSN(&context, lsn); > > > + } > > > + > > > + END_CRIT_SECTION(); > > > + > > > + /* Release undo buffers. */ > > > + FinishUndoRecordInsert(&context); > > > +} > > > > This whole prepare/execute split for updating apply pregress, and next > > undo pointers makes no sense to me. > > > > Can you explain what is your concern here? Basically, in the prepare > phase, we do read and lock the buffer and in the actual update phase > (which is under critical section), we update the contents in the > shared buffer. This is the same idea as we use in many places in > code. I'll comment on the concerns with the whole API separately. > > > typedef struct TwoPhaseFileHeader > > > { > > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader > > > uint16 gidlen; /* length of the GID - GID follows the header */ > > > XLogRecPtr origin_lsn; /* lsn of this record at origin node */ > > > TimestampTz origin_timestamp; /* time of prepare at origin node */ > > > + > > > + /* > > > + * We need the locations of the start and end undo record pointers when > > > + * rollbacks are to be performed for prepared transactions using undo-based > > > + * relations. We need to store this information in the file as the user > > > + * might rollback the prepared transaction after recovery and for that we > > > + * need its start and end undo locations. > > > + */ > > > + UndoRecPtr start_urec_ptr[UndoLogCategories]; > > > + UndoRecPtr end_urec_ptr[UndoLogCategories]; > > > } TwoPhaseFileHeader; > > > > Why do we not need that knowledge for undo processing of a non-prepared > > transaction? > The non-prepared transaction also needs to be aware of that. It is > stored in TransactionStateData. I am not sure if I understand your > question here. My concern is that I think it's fairly ugly to store data like this in the 2pc state file. And it's not an insubstantial amount of additional data either, compared to the current size, even when no undo is in use. There's a difference between an unused feature increasing backend local memory and increasing the size of WAL logged data. Obviously it's not by a huge amount, but still. It also just feels wrong to me. We don't need the UndoRecPtr's when recovering from a crash/restart to process undo. Now we obviously don't want to unnecessarily search for data that is expensive to gather, which is a good reason for keeping track of this data. But I do wonder if this is the right approach. I know that Robert is working on a patch that revises the undo request layer somewhat, it's possible that this is best discussed afterwards. > > > + case TBLOCK_UNDO: > > > + /* > > > + * We reach here when we got error while applying undo > > > + * actions, so we don't want to again start applying it. Undo > > > + * workers can take care of it. > > > + * > > > + * AbortTransaction is already done, still need to release > > > + * locks and perform cleanup. > > > + */ > > > + ResetUndoActionsInfo(); > > > + ResourceOwnerRelease(s->curTransactionOwner, > > > + RESOURCE_RELEASE_LOCKS, > > > + false, > > > + true); > > > + s->state = TRANS_ABORT; > > > CleanupTransaction(); > > > > Hm. Why is it ok that we only perform that cleanup action? Either the > > rest of potentially held resources will get cleaned up somehow as well, > > in which case this ResourceOwnerRelease() ought to be redundant, or > > we're potentially leaking important resources like buffer pins, relcache > > references and whatnot here? > > > > I had initially used AbortTransaction() here for such things, but I > was not sure whether that is the right thing when we reach here in > this state. Because AbortTransaction is already done once we reach > here. The similar thing happens for the TBLOCK_SUBUNDO state few > lines below where I had used AbortSubTransaction. Now, one problem I > faced when AbortSubTransaction got invoked in this code path was it > internally invokes RecordTransactionAbort->XidCacheRemoveRunningXids > which result in the error "did not find subXID %u in MyProc". The > reason is obvious which is that we had already removed it when > AbortSubTransaction was invoked before applying undo actions. The > releasing of locks was the thing which we have delayed to allow undo > actions to be applied which is done here. The other idea here I had > was to call AbortTransaction/AbortSubTransaction but somehow avoid > calling RecordTransactionAbort when in this state. Do you have any > suggestion to deal with this? Well, what I'm asking is how this possibly could be correct. Perhaps I'm just missing something, in which case I don't yet want to make suggestions for how this should look. My concern is that you seem to have added a state where process quite a lot of code - the undo actions, which use buffer pins, lwlocks, sometimes heavyweight locks, potentially even relache, much more - but we don't actually clean up any of those in case of error, *except* for *some* resowner managed things. I just don't understand how that could possibly be correct. I'm also fairly certain that we had discussed that we can't actually execute undo outside of a somewhat valid transaction environment - and as far as I can tell, there's nothing of that here. Even in the path without an error during UNDO, I see code like: + else + { + AtCleanup_Portals(); /* now safe to release portal memory */ + AtEOXact_Snapshot(false, true); /* and release the transaction's + * snapshots */ + s->fullTransactionId = InvalidFullTransactionId; + s->subTransactionId = TopSubTransactionId; + s->blockState = TBLOCK_UNDO; + } without any comments why exactly these two cleanup callbacks need to be called, and no others. See also below. And then when UNDO errors out, I see: + for (i = 0; i < UndoLogCategories; i++) + { + PG_CATCH(); + { ... + /* We should never reach here when we are in a semi-critical-section. */ + Assert(SemiCritSectionCount == 0); + } + PG_END_TRY(); meaning that we'll just move on to undo the next persistency category after an error. But there's absolutely no resource cleanup here. Which, to me, means we'll very easily self-deadlock and things like that. Consider an error thrown during undo, while holding an lwlock. If the next persistence category acquires that lock again, we'll self-deadlock. There's a lot of other similar issues. So I just don't understand the current model of the xact.c integration. That might be because I just don't understand the current design, or because the current design is pretty broken. > > > +{ > > > + TransactionState s = CurrentTransactionState; > > > + bool result; > > > + int i; > > > + > > > + /* > > > + * We don't want to apply the undo actions when we are already cleaning up > > > + * for FATAL error. See ReleaseResourcesAndProcessUndo. > > > + */ > > > + if (SemiCritSectionCount > 0) > > > + { > > > + ResetUndoActionsInfo(); > > > + return; > > > + } > > > > Wait what? Semi critical sections? > > > > Robert up thread suggested this idea [1] (See paragraph starting with > "I am not a fan of applying_subxact_undo....") to deal with cases > where we get an error while applying undo actions and we need to > promote the error to FATAL. Well, my problem with this starts with the fact that I don't know see a reason why we would want to promote subtransaction failures to FATAL. Or why that would be OK - loosing reliability when using savepoints seems pretty dubious to me. And sometimes we're can expect to get errors when savepoints are in use, e.g. out-of-memory errors. And they're often going to happen again during undo processing. So this isn't a "oh, it never realistically happens" scenario imo. There's two comments about this: +We promote the error to FATAL error if it occurred while applying undo for a +subtransaction. The reason we can't proceed without applying subtransaction's +undo is that the modifications made in that case must not be visible even if +the main transaction commits. + * (a) Subtransactions. We can't proceed without applying + * subtransaction's undo as the modifications made in that case must not + * be visible even if the main transaction commits. The reason why that + * can happen is because for undo-based AM's we don't need to have a + * separate transaction id for subtransactions and once the main + * transaction commits the tuples modified by subtransactions will become + * visible. But that only means we can't allow such errors to be caught - there should be a lot less harsh ways to achieve that than throwing a FATAL error. We could e.g. just walk up the transaction stack and mark the transaction levels as failed or something. So if somebody catches the error, any database access done will just cause a failure again. There's also: + * (b) Temp tables. We don't expect background workers to process undo of + * temporary tables as the same won't be accessible. But I fail to see why that requires FATALing either. Isn't the worst outcome here that we'll have some unnecessary undo around? > > Give code like this I have a hard time seing what the point of having > > separate queue entries for the different persistency levels is. > It is not for this case, rather, it is for the case of discard worker > (background worker) where we process the transactions at log level. > The permanent and unlogged transactions will be in a separate log and > can be encountered at different times, so this leads to having > separate entries for them. Given a hashtable over fxid, that doesn't seems like a counter-argument. We can just do an fxid lookup, and if there's already an entry, update it to reference the additional persistence level. One question of understanding: Why do we ever want to register undo requests for transactions that did not start in the log the discard worker is currently looking at? It seems to me that there's some complexity involved due to wanting to do that? We might have already processed the portion of the transaction in the later log, but I don't see why that'd be a problem? > > > > > > +void > > > +ReleaseResourcesAndProcessUndo(void) > > > +{ > > > + TransactionState s = CurrentTransactionState; > > > + > > > + /* > > > + * We don't want to apply the undo actions when we are already cleaning up > > > + * for FATAL error. One of the main reasons is that we might be already > > > + * processing undo actions for a (sub)transaction when we reach here > > > + * (for ex. error happens while processing undo actions for a > > > + * subtransaction). > > > + */ > > > + if (SemiCritSectionCount > 0) > > > + { > > > + ResetUndoActionsInfo(); > > > + return; > > > + } > > > + > > > + if (!NeedToPerformUndoActions()) > > > + return; > > > + > > > + /* > > > + * State should still be TRANS_ABORT from AbortTransaction(). > > > + */ > > > + if (s->state != TRANS_ABORT) > > > + elog(FATAL, "ReleaseResourcesAndProcessUndo: unexpected state %s", > > > + TransStateAsString(s->state)); > > > + > > > + /* > > > + * Do abort cleanup processing before applying the undo actions. We must > > > + * do this before applying the undo actions to remove the effects of > > > + * failed transaction. > > > + */ > > > + if (IsSubTransaction()) > > > + { > > > + AtSubCleanup_Portals(s->subTransactionId); > > > + s->blockState = TBLOCK_SUBUNDO; > > > + } > > > + else > > > + { > > > + AtCleanup_Portals(); /* now safe to release portal memory */ > > > + AtEOXact_Snapshot(false, true); /* and release the transaction's > > > + * snapshots */ > > > > Why do precisely these actions need to be performed here? > > > > This is to get a transaction into a clean state. Before calling this > function AbortTransaction has been performed and there were few more > things we need to do for cleanup. That doesn't answer my question. Why is it specifically these ones that need to be called "manually"? Why no others? Where is that explained? I assume you just copied them from CleanupTransaction() - but there's no reference to that fact on either side, which means nobody would know to keep them in sync. I'll also note that the way it's currently set up, we don't delete the transaction context before processing undo, at least as far as I can see. Which seems that some OOM cases won't be able to roll back, even if there'd be plenty memory except for the memory used by the transaction. The portal cleanup will allow for some, but not all of that, I think. > > > +bool > > > +ProcessUndoRequestForEachLogCat(FullTransactionId fxid, Oid dbid, > > > + UndoRecPtr *end_urec_ptr, UndoRecPtr *start_urec_ptr, > > > + bool *undoRequestResgistered, bool isSubTrans) > > > +{ > > > + UndoRequestInfo urinfo; > > > + int i; > > > + uint32 save_holdoff; > > > + bool success = true; > > > + > > > + for (i = 0; i < UndoLogCategories; i++) > > > + { > > > + if (end_urec_ptr[i] && !undoRequestResgistered[i]) > > > + { > > > + save_holdoff = InterruptHoldoffCount; > > > + > > > + PG_TRY(); > > > + { > > > + /* for subtransactions, we do partial rollback. */ > > > + execute_undo_actions(fxid, > > > + end_urec_ptr[i], > > > + start_urec_ptr[i], > > > + !isSubTrans); > > > + } > > > + PG_CATCH(); > > > + { > > > + /* > > > + * Add the request into an error queue so that it can be > > > + * processed in a timely fashion. > > > + * > > > + * If we fail to add the request in an error queue, then mark > > > + * the entry status as invalid and continue to process the > > > + * remaining undo requests if any. This request will be later > > > + * added back to the queue by discard worker. > > > + */ > > > + ResetUndoRequestInfo(&urinfo); > > > + urinfo.dbid = dbid; > > > + urinfo.full_xid = fxid; > > > + urinfo.start_urec_ptr = start_urec_ptr[i]; > > > + if (!InsertRequestIntoErrorUndoQueue(&urinfo)) > > > + RollbackHTMarkEntryInvalid(urinfo.full_xid, > > > + urinfo.start_urec_ptr); > > > + /* > > > + * Errors can reset holdoff count, so restore back. This is > > > + * required because this function can be called after holding > > > + * interrupts. > > > + */ > > > + InterruptHoldoffCount = save_holdoff; > > > + > > > + /* Send the error only to server log. */ > > > + err_out_to_client(false); > > > + EmitErrorReport(); > > > + > > > + success = false; > > > + > > > + /* We should never reach here when we are in a semi-critical-section. */ > > > + Assert(SemiCritSectionCount == 0); > > > > This seems entirely and completely broken. You can't just catch an > > exception and continue. What if somebody held an lwlock when the error > > was thrown? A buffer pin? > > > > The caller deals with that. For example, when this is called from > FinishPreparedTransaction, we do AbortOutOfAnyTransaction and when > called from ReleaseResourcesAndProcessUndo, we just release locks. I don't see the caller being able to do anything here - the danger is that a previous category of undo processing might have acquired resources, and they're not cleaned up on failure, as you've set things up. > Earlier here also, I had AbortTransaction but was not sure whether > that is the right thing to do especially because it will lead to > RecordTransactionAbort called twice, once when we do AbortTransaction > before applying undo actions and once when we do it after catching the > exception. Like as I said earlier maybe the right way is to just > avoid calling RecordTransactionAbort again. I think that "just" means that you've not divorced the state in which undo processing is happening well enough from the "original" transaction. I stand by my suggestion that what needs to happen is roughly 1) re-assign locks from failed (sub-)transaction to a special "undo" resource owner 2) completely abort (sub-)transaction 3) start a new (sub-)transaction 4) process undo 5) commit/abort that (sub-)transaction 6) release locks from "undo" resource owner > > Nor why that's not a problem for: > > > > > +We have the hard limit (proportional to the size of the rollback hash table) > > > +for the number of transactions that can have pending undo. This can help us > > > +in computing the value of oldestXidHavingUnappliedUndo and allowing us not to > > > +accumulate pending undo for a long time which will eventually block the > > > +discard of undo. > > > > The reason why it is not a problem is that we don't remove the entry > from the hash table rather just mark it such that later discard worker > can add it to the queues. I am not sure if I understood your question > completely, but let me try to explain this idea in a bit more detail. > > The basic idea is that Rollback Hash Table has space equivalent to all > the three queues plus (2 * MaxBackends). Now, we will stop allowing > the new transactions that want to write undo once the hash table has > entries equivalent to all three queues and we have 2 * Max_Backends > already attached to undo logs that are not committed. Assume we have > each queue size as 5 and Max_Backends =10, then ideally we can 35 > entries (3 * 5 + 2 * 10) in the hash table. The way all this is > related to the error queue being full is like this: > > Say, we have a number of hash table entries equal to 15 which > indicates all queues are full and now 10 backends connected to two > different logs (permanent and unlogged). Next one of the transaction > errors out and try to rollback, at this stage, it will add an entry in > the hash table and try to execute the actions. While executing > actions, it got an error and couldn't add to error queue because it > was full, so at this stage, it just marks the hash table entry as > invalid and proceeds (consider this happens for both logged and > unlogged categories). So, at this stage, we will have 17 entries in > the hash table and the other 9 backends attached to 18 logs which > makes space for 35 xacts if the system crashes at this stage. The > backend which errored out again tries to perform an operation for > which it needs to perform undo. Now, we won't allow this backend to > perform that action because if it crashed after performing the > operation and before committing, the hash table will overflow. What I don't understand is why there's any need for these "in hash table, but not in any queue, and not being processed" type entries. All that avoiding that seems to need is to make the error queue a bit bigger? > > > > > + /* There might not be any undo log and hibernation might be needed. */ > > > + *hibernate = true; > > > + > > > + StartTransactionCommand(); > > > > Why do we need this? I assume it's so we can have a resource owner? > > > > Yes, and another reason is we are using dbid_exists in this function. I think it'd be good to avoid needing any database access in both discard worker, and undo launcher. They really shouldn't need catalog access architecturally, and in the case of the discard worker we'd add another process that'd potentially hold the xmin horizon down for a while in some situations. We could of course add exceptions like we have for vacuum, but I think we really shouldn't need that. Greetings, Andres Freund
On Thu, Aug 8, 2019 at 9:31 AM Andres Freund <andres@anarazel.de> wrote: > I know that Robert is working on a patch that revises the undo request > layer somewhat, it's possible that this is best discussed afterwards. Here's what I have at the moment. This is not by any means a complete replacement for Amit's undo worker machinery, but it is a significant redesign (and I believe a significant improvement to) the queue management stuff from Amit's patch. I wrote this pretty quickly, so while it passes simple testing, it probably has a number of bugs, and to actually use it, it would need to be integrated with xact.c; right now it's just a standalone module that doesn't do anything except let itself be tested. Some of the ways it is different from Amit's patches include: * It uses RBTree rather than binaryheap, so when we look ahead, we look ahead in the right order. * There's no limit to the lookahead distance; when looking ahead, it will search the entirety of all 3 RBTrees for an entry from the right database. * It doesn't have a separate hash table keyed by XID. I didn't find that necessary. * It's better-isolated, as you can see from the fact that I've included a test module that tests this code without actually ever putting an UndoRequestManager in shared memory. I would've liked to expand this test module, but I don't have time to do that today and felt it better to get this much sent out. * It has a lot of comments explaining the design and how it's intended to integrate with the rest of the system. Broadly, my vision for how this would get used is: - Create an UndoRecordManager in shared memory. - Before a transaction first attaches to a permanent or unlogged undo log, xact.c would call RegisterUndoRequest(); thereafter, xact.c would store a pointer to the UndoRecord for the lifetime of the toplevel transaction. - Immediately after attaching to a permanent or unlogged undo log, xact.c would call UndoRequestSetLocation. - xact.c would track the number of bytes of permanent and unlogged undo records the transaction generates. If the transaction goes onto abort, it reports these by calling FinalizeUndoRequest. - If the transaction commits, it doesn't need that information, but does need to call UnregisterUndoRequest() as a post-commit step in CommitTransaction(). - In the case of an abort, after calling FinalizeUndoRequest, xact.c would call PerformUndoInBackground() to find out whether to do undo in the background or the foreground. If undo is to be done in the foreground, the backend must go on to call UnregisterUndoRequest() if undo succeeds, and RescheduleUndoRequest() if it fails. - In the case of a prepared transaction, a pointer to the UndoRequest would get stored in the GlobalTransaction (but nothing extra would get stored in the twophase state file). - COMMIT PREPARED calls UnregisterUndoRequest(). - ROLLBACK PREPARED calls PerformUndoInBackground; if told to do undo in the foreground, it must go on to call either UnregisterUndoRequest() on success or RescheduleUndoRequest() on failure, just like in the regular abort case. - After a crash, once recovery is complete but before we open for connections, or at least before we allow any new undo activity, the discard worker scans all the logs and makes a bunch of calls to RecreateUndoRequest(). Then, for each prepared transaction that still exists, it calls SuspendPreparedUndoRequest() and use the return value to reset the UndoRequest pointer in the GlobalTransaction. Only once both of those steps are completed can undo workers be safely started. - Undo workers call GetNextUndoRequest() to get the next task that they should perform, and once they do, they "own" the undo request. When undo succeeds or fails, they must call either UnregisterUndoRequest() or RescheduleUndoRequest(), as appropriate, just like for foreground undo. Making sure this is water-tight will probably require some well-done integration with xact.c, so that an undo request that we "own" because we got it in a background undo apply process looks exactly the same as one we "own" because it's our transaction originally. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Aug 9, 2019 at 1:57 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Aug 8, 2019 at 9:31 AM Andres Freund <andres@anarazel.de> wrote: > > I know that Robert is working on a patch that revises the undo request > > layer somewhat, it's possible that this is best discussed afterwards. > > Here's what I have at the moment. This is not by any means a complete > replacement for Amit's undo worker machinery, but it is a significant > redesign (and I believe a significant improvement to) the queue > management stuff from Amit's patch. > Thanks for working on this. Neither Kuntal nor I have got time to look into this part in detail. > I wrote this pretty quickly, so > while it passes simple testing, it probably has a number of bugs, and > to actually use it, it would need to be integrated with xact.c; > I can look into this and integrate with other parts of the patch next week unless you are planning to do. Right now, I am working on fixing up some other comments raised on the patches which I will share today or early next week after which I can start looking into this. I hope that is fine with you. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 22, 2019 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have reviewed 0012-Infrastructure-to-execute-pending-undo-actions, > Please find my comment so far. > > 1. > + /* It shouldn't be discarded. */ > + Assert(!UndoRecPtrIsDiscarded(xact_urp)); > > I think comments can be added to explain why it shouldn't be discarded. > > 2. > + /* Compute the offset of the uur_next in the undo record. */ > + offset = SizeOfUndoRecordHeader + > + offsetof(UndoRecordTransaction, urec_progress); > + > in comment /uur_next/uur_progress > > 3. > +/* > + * undo_record_comparator > + * > + * qsort comparator to handle undo record for applying undo actions of the > + * transaction. > + */ > Function header formating is not in sync with other functions. > Fixed all the above comments in the attached patch. > 4. > +void > +undoaction_redo(XLogReaderState *record) > +{ > + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; > + > + switch (info) > + { > + case XLOG_UNDO_APPLY_PROGRESS: > + undo_xlog_apply_progress(record); > + break; > > For HotStandby it doesn't make sense to apply this wal as this > progress is only required when we try to apply the undo action after > restart > but in HotStandby we never apply undo actions. > I have already responded in my earlier email on why this is required [1]. > 5. > + Assert(from_urecptr != InvalidUndoRecPtr); > + Assert(to_urecptr != InvalidUndoRecPtr); > > we can use macros UndoRecPtrIsValid instead of checking like this. > Fixed. > 6. > + if ((slot == NULL) || (UndoRecPtrGetLogNo(urecptr) != slot->logno)) > + slot = UndoLogGetSlot(UndoRecPtrGetLogNo(urecptr), false); > + > + Assert(slot != NULL); > We are passing missing_ok as false in UndoLogGetSlot. But, not sure > why we are expecting that undo lot can not be dropped. In multi-log > transaction it's possible > that the tablespace in which next undolog is there is already dropped? > Already responded on this in my earlier reply [1]. > 7. > + */ > + do > + { > + BlockNumber progress_block_num = InvalidBlockNumber; > + int i; > + int nrecords; > ..... > + */ > + if (!UndoRecPtrIsValid(urec_ptr)) > + break; > + } while (true); > > I think we can convert above loop to while(true) instead of do..while, > because there is no need for do while loop. > > 8. > + if (last_urecinfo->uur->uur_info & UREC_INFO_LOGSWITCH) > + { > + UndoRecordLogSwitch *logswitch = last_urecinfo->uur->uur_logswitch; > > IMHO, the caller of UndoFetchRecord should directly check > uur->uur_logswitch instead of uur_info & UREC_INFO_LOGSWITCH. > Actually, uur_info is internally set > for inserting the tuple and check there to know what to insert and > fetch but I think caller of UndoFetchRecord should directly rely on > the field because ideally all > the fields in UnpackUndoRecord must be set and uur_txt or > uur_logswitch will be allocated when those headers present. I think > this needs to be improved in undo interface patch > as well (in UndoBulkFetchRecord). > Okay, fixed both of the above. I have exposed a new macro IsUndoLogSwitched from undorecord.h which you might also want to use in your patch. Apart from this, in the attached patches, I have fixed various comments raised in this thread from Amit Khandekar. I'll respond to them separately. I have yet to address various comments raised by Andres and Robert which also includes integration with the latest patch on queues posted by Robert. Note - The patches for undo-log and undo-interface has not been rebased as others are working actively on their branches. The branch where this code resides can be accessed at https://github.com/EnterpriseDB/zheap/tree/undoprocessing [1] - https://www.postgresql.org/message-id/CAA4eK1KoA0L%3DPNBc_uu2v8H0%3DLA_Cm%3Do9GyFm6i6DSD6mUMppg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Move-some-md.c-specific-logic-from-smgr.c-to-md.c.patch
- 0002-Prepare-to-support-multiple-SMGR-implementations.patch
- 0003-Add-undo-log-manager.patch
- 0004-Allow-WAL-record-data-on-first-modification-after-a-.patch
- 0005-Add-prefetch-support-for-the-undo-log.patch
- 0006-Defect-and-enhancement-in-multi-log-support.patch
- 0007-Provide-interfaces-to-store-and-fetch-undo-records.patch
- 0008-undo-page-consistency-checker.patch
- 0009-Extend-binary-heap-functionality.patch
- 0010-Infrastructure-to-register-and-fetch-undo-action-req.patch
- 0011-Infrastructure-to-execute-pending-undo-actions.patch
- 0012-Allow-foreground-transactions-to-perform-undo-action.patch
- 0013-Allow-execution-and-discard-of-undo-by-background-wo.patch
On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Mon, 22 Jul 2019 at 14:21, Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have started review of > 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch. Below > are some quick comments to start with: > > +++ b/src/backend/access/undo/undoworker.c > > +#include "access/xact.h" > +#include "access/undorequest.h" > Order is not alphabetical > Fixed this and a few others. > + * Each undo worker then start reading from one of the queue the requests for > start=>starts > queue=>queues > > ------------- > > + rc = WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, > + 10L, WAIT_EVENT_BGWORKER_STARTUP); > + > + /* emergency bailout if postmaster has died */ > + if (rc & WL_POSTMASTER_DEATH) > + proc_exit(1); > I think now, thanks to commit cfdf4dc4fc9635a, you don't have to > explicitly handle postmaster death; instead you can use > WL_EXIT_ON_PM_DEATH. Please check at all such places where this is > done in this patch. > > ------------- > Fixed both of the above issues. > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) > +{ > + /* Block concurrent access. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + > + MyUndoWorker = &UndoApplyCtx->workers[slot]; > Not sure why MyUndoWorker is used here. Can't we use a local variable > ? Or do we intentionally attach to the slot as a side-operation ? > > ------------- > I have changed the code around this such that we first attach to the slot and then get the required info. Also, I don't see the need of exclusive lock here, so changed it to shared lock. > + * Get the dbid where the wroker should connect to and get the worker > wroker=>worker > > ------------- > > + BackgroundWorkerInitializeConnectionByOid(urinfo.dbid, 0, 0); > 0, 0 => InvalidOid, 0 > > + * Set the undo worker request queue from which the undo worker start > + * looking for a work. > start => should start > a work => work > > -------------- > Fixed both of these. > + if (!InsertRequestIntoErrorUndoQueue(urinfo)) > I was thinking what happens if for some reason > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the > entry will not be marked invalid, and so there will be no undo action > carried out because I think the undo worker will exit. What happens > next with this entry ? I think this will change after integration with Robert's latest patch on queues, so will address along with it if required. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > -------------------- > > Some further review comments for undoworker.c : > > > +/* Sets the worker's lingering status. */ > +static void > +UndoWorkerIsLingering(bool sleep) > The function name sounds like "is the worker lingering ?". Can we > rename it to something like "UndoWorkerSetLingering" ? > makes sense, changed as per suggestion. > ------------- > > + errmsg("undo worker slot %d is empty, cannot attach", > + slot))); > > + } > + > + if (MyUndoWorker->proc) > + { > + LWLockRelease(UndoWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("undo worker slot %d is already used by " > + "another worker, cannot attach", slot))); > > These two error messages can have a common error message "could not > attach to worker slot", with errdetail separate for each of them : > slot %d is empty. > slot %d is already used by another worker. > > -------------- > Changed as per suggestion. > +static int > +IsUndoWorkerAvailable(void) > +{ > + int i; > + int alive_workers = 0; > + > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > > Should have bool return value. > > Also, why is it keeping track of number of alive workers ? Sounds like > earlier it used to return number of alive workers ? If it indeed needs > to just return true/false, we can do away with alive_workers. > > Also, *exclusive* lock is unnecessary. > > -------------- > Changed as per suggestion. Additionally, I changed the name of the function to UndoWorkerIsAvailable(), so that it is similar to other functions in the file. > +if (UndoGetWork(false, false, &urinfo, NULL) && > + IsUndoWorkerAvailable()) > + UndoWorkerLaunch(urinfo); > > There is no lock acquired between IsUndoWorkerAvailable() and > UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable() > returns true, there is a small window where UndoWorkerLaunch() does > not find any worker slot with in_use false, causing assertion failure > for (worker != NULL). > -------------- > I have removed the assert and instead added a warning. I have also added a comment from the place where we call UndoWorkerLaunch to mention the race condition. > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) > +{ > + /* Block concurrent access. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > *Exclusive* lock is unnecessary. > ------------- > Right, changed to share lock. > + LWLockRelease(UndoWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("undo worker slot %d is empty", > + slot))); > I believe there is no need to explicitly release an lwlock before > raising an error, since the lwlocks get released during error > recovery. Please check all other places where this is done. > ------------- > Fixed. > + * Start new undo apply background worker, if possible otherwise return false. > worker, if possible otherwise => worker if possible, otherwise > ------------- > > +static bool > +UndoWorkerLaunch(UndoRequestInfo urinfo) > We don't check UndoWorkerLaunch() return value. Can't we make it's > return value type void ? > Now, the function returns void and accordingly I have adjusted the comment which should address both the above comments. > Also, it would be better to have urinfo as pointer to UndoRequestInfo > rather than UndoRequestInfo, so as to avoid structure copy. > ------------- > Okay, changed as per suggestion. > +{ > + BackgroundWorker bgw; > + BackgroundWorkerHandle *bgw_handle; > + uint16 generation; > + int i; > + int slot = 0; > We can remove variable i, and use slot variable in place of i. > ----------- > > + snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker"); > I think it would be trivial to also append the worker->generation in > the bgw_name. > ------------- > I am not sure if adding 'generation' is any useful. It might be better to add database id as each worker can work on a particular database, so that could be useful information. > > + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle)) > + { > + /* Failed to start worker, so clean up the worker slot. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + UndoWorkerCleanup(worker); > + LWLockRelease(UndoWorkerLock); > + > + return false; > + } > > Is it intentional that there is no (warning?) message logged when we > can't register a bg worker ? > ------------- > Added a warning in that code path. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 26, 2019 at 9:57 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Fri, 26 Jul 2019 at 12:25, Amit Kapila <amit.kapila16@gmail.com> wrote: > > I agree with all your other comments. > > Thanks for addressing the comments. Below is the continuation of my > comments from 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch > : > > > + * Perform rollback request. We need to connect to the database for first > + * request and that is required because we access system tables while > for first request and that is required => for the first request. This > is required > Changed as per suggestion. > --------------- > > +UndoLauncherShmemSize(void) > +{ > + Size size; > + > + /* > + * Need the fixed struct and the array of LogicalRepWorker. > + */ > + size = sizeof(UndoApplyCtxStruct); > > The fixed structure size should be offsetof(UndoApplyCtxStruct, > workers) rather than sizeof(UndoApplyCtxStruct) > > --------------- > Why? I see the similar code in ApplyLauncherShmemSize. If there is any problem with this, then we might have similar problem in existing code as well. > In UndoWorkerCleanup(), we set individual fields of the > UndoApplyWorker structure, whereas in UndoLauncherShmemInit(), for all > the UndoApplyWorker array elements, we just memset all the > UndoApplyWorker structure elements to 0. I think we should be > consistent for the two cases. I guess we can just memset to 0 as you > do in UndoLauncherShmemInit(), but this will cause the > worker->undo_worker_queue to be 0 i.e. XID_QUEUE , whereas in > UndoWorkerCleanup(), it is set to -1. Is the -1 value essential, or we > can just set it to XID_QUEUE initially ? Either is fine, because before launching the worker we set the valid value. It is better to set it as InvalidUndoWorkerQueue though. > Also, if we just use memset in UndoWorkerCleanup(), we need to first > save generation into a temp variable, and then after memset(), restore > it back. > This sounds like an unnecessary trickery. > That brought me to another point : > We already have a macro ResetUndoRequestInfo(), so UndoWorkerCleanup() > can just call ResetUndoRequestInfo(). > ------------ > Hmm, both (UndoRequestInfo and UndoApplyWorker) are separate structures, so how can we reuse them? > + bool allow_peek; > + > + CHECK_FOR_INTERRUPTS(); > + > + allow_peek = !TimestampDifferenceExceeds(started_at, > Some comments would be good about what is allow_peek used for. Something like : > "Arrange to prevent the worker from restarting quickly to switch databases" > Added a slightly different comment. > ----------------- > +++ b/src/backend/access/undo/README.UndoProcessing > ----------------- > > +worker then start reading from one of the queues the requests for that > start=>starts > --------------- > > +work, it lingers for UNDO_WORKER_LINGER_MS (10s as default). This avoids > As per the latest definition, it is 20s. IMHO, there's no need to > mention the default value in the readme. > --------------- > > +++ b/src/backend/access/undo/discardworker.c > --------------- > > + * portion of transaction that is overflowed into a separate log can > be processed > 80-col crossed. > > +#include "access/undodiscard.h" > +#include "access/discardworker.h" > Not in alphabetical order > Fixed all of the above four points. > > +++ b/src/backend/access/undo/undodiscard.c > --------------- > > + next_insert = UndoLogGetNextInsertPtr(logno); > I checked UndoLogGetNextInsertPtr() definition. It calls > find_undo_log_slot() to get back the slot from logno. Why not make it > accept slot as against logno ? At all other places, the slot->logno is > passed, so it is convenient to just pass the slot there. And in > UndoDiscardOneLog(), first call find_undo_log_slot() just before the > above line (or call it at the end of the do-while loop). > I am not sure if this is a good idea because find_undo_log_slot is purely a undolog module stuff, exposing it to outside doesn't seem like a good idea to me. > This way, > during each of the UndoLogGetNextInsertPtr() calls in undorequest.c, > we will have one less find_undo_log_slot() call. > I am not sure there is any performance benefit either because there is a cache for the slots and it should get from there very quickly. I think we can avoid this repeated call and I have done so in the attached patch. > ------------- > > In UndoDiscardOneLog(), there are at least 2 variable declarations > that can be moved inside the do-while loop : uur and next_insert. I am > not sure about the other variables viz : undofxid and > latest_discardxid. Values of these variables in one iteration continue > across to the second iteration. For latest_discardxid, it looks like > we do want its value to be carried forward, but is it also true for > undofxid ? > undofxid can be moved inside the loop, fixed that and other variables pointed out by you. > + /* If we reach here, this means there is something to discard. */ > + need_discard = true; > + } while (true); > > Also, about need_discard; there is no place where need_discard is set > to false. That means, from 2nd iteration onwards, it will never be > false. So even if the code that explicitly sets need_discard to true > does not get run, still the undolog will be discarded. Is this > expected ? > ------------- Yes. We will discard once we have even one transaction data to discard. For ex., say we decided that we can discard data for transaction id 501, and then next transaction 502 is aborted and the actions for same are not yet applied, in that case, we will discard the data of transaction 501. I hope this answers your question. > > + if (request_rollback && dbid_exists(uur->uur_txn->urec_dbid)) > + { > + (void) RegisterRollbackReq(InvalidUndoRecPtr, > + undo_recptr, > + uur->uur_txn->urec_dbid, > + uur->uur_fxid); > + > + pending_abort = true; > + } > We can get rid of request_rollback variable. Whatever the "if" block > above is doing, do it in this upper condition : > if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress)) > > Something like this : > > if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress)) > { > if (dbid_exists(uur->uur_txn->urec_dbid)) > { > (void) RegisterRollbackReq(InvalidUndoRecPtr, > undo_recptr, > uur->uur_txn->urec_dbid, > uur->uur_fxid); > > pending_abort = true; > } > } Hmm, you also need to check that transaction is not in-progress along with it. I think there will be more movements of checks and that will make code look less readable than it is now. > ------------- > > + UndoRecordRelease(uur); > + uur = NULL; > + } > ..... > ..... > + Assert(uur == NULL); > + > + /* If we reach here, this means there is something to discard. */ > + need_discard = true; > + } while (true); > > Looks like it is neither necessary to set uur to NULL, nor is it > necessary to have the Assert(uur == NULL). At the start of each > iteration uur is anyway assigned a fresh value, which may or may not > be NULL. > ------------- > I think there is no harm in doing it what you are saying, but the idea here is to prevent the release of undo record. Basically, if we have fetched a valid undo record, then it must be released. I understand this is not a bullet-proof Assert, because one might set it to NULL without actually releasing the memory. For now, I have added a comment before Assert, see if that makes sense? > + * over undo logs is complete, new undo can is allowed to be written in the > new undo can is allowed => new undo is allowed > > + * hash table size. So before start allowing any new transaction to write the > before start allowing => before allowing any new transactions to start > writing the > ------------- > Changed as per suggestion. > + /* Get the smallest of 'xid having pending undo' and 'oldestXmin' */ > + oldestXidHavingUndo = RollbackHTGetOldestFullXid(oldestXidHavingUndo); > + .... > + .... > + if (FullTransactionIdIsValid(oldestXidHavingUndo)) > + pg_atomic_write_u64(&ProcGlobal->oldestFullXidHavingUnappliedUndo, > + U64FromFullTransactionId(oldestXidHavingUndo)); > > Is it possible that the FullTransactionId returned by > RollbackHTGetOldestFullXid() could be invalid ? If not, then the if > condition above can be changed to an Assert(). > ------------- > Yeah, it could be changed to Assert. > > + * If the log is already discarded, then we are done. It is important > + * to first check this to ensure that tablespace containing this log > + * doesn't get dropped concurrently. > + */ > + LWLockAcquire(&slot->mutex, LW_SHARED); > + /* > + * We don't have to worry about slot recycling and check the logno > + * here, since we don't care about the identity of this slot, we're > + * visiting all of them. > I guess, it's accidental that the LWLockAcquire() call is *between* > the two comments ? > ----------- > I think it is better to have them as a single comment before acquiring the lock, so changed that way. > + if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED) > + { > + /* > + * For the "shared" category, we only discard when the > + * rm_undo_status callback tells us we can. > + */ > + status = RmgrTable[uur->uur_rmid].rm_undo_status(uur, > &wait_xid); > status variable could be declared in this block itself. > ------------- > Thomas mentioned that he is planning to change the implementation of shared undo logs, so let's keep this as it is for now. > > Some variable declaration alignments and comments spacing need changes > as per pgindent. > I have left this for now, will take care of this in the next version. Thanks, Amit Khandekar for all your review comments. As per my knowledge, I have addressed all of your review comments raised so far related to undo-processing patches. Do let me know if I have missed anything? Please find the latest patches in my email up thread [1]. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BMcY0qGaak0AHyzdgAn%2BF6dyxcpDwp9ifGg%3D1WVDadeQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 7, 2019 at 6:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Yeah, that's also a problem with complicated WAL record types. Hopefully > the complex cases are an exception, not the norm. A complex case is > unlikely to fit any pre-defined set of fields anyway. (We could look at > how e.g. protobuf works, if this is really a big problem. I'm not > suggesting that we add a dependency just for this, but there might be > some patterns or interfaces that we could mimic.) I think what you're calling the complex cases are going to be pretty normal cases, not something exotic, but I do agree with you that making the infrastructure more generic is worth considering. One idea I had is to use the facilities from pqformat.h; have the generic code read whatever the common fields are, and then pass the StringInfo to the AM which can do whatever it wants with the rest of the record, but probably these facilities would make it pretty easy to handle either a series of fixed-length fields or alternatively variable-length data. What do you think of that idea? (That would not preclude doing compression on top, although I think that feeding everything through pglz or even lz4/snappy may eat more CPU cycles than we can really afford. The option is there, though.) > If you remember, we did a big WAL format refactoring in 9.5, which moved > some information from AM-specific structs to the common headers. Namely, > the information on the relation blocks that the WAL record applies to. > That was a very handy refactoring, and allowed tools like pg_waldump to > print more detailed information about all WAL record types. For WAL > records, moving the block information was natural, because there was > special handling for full-page images anyway. However, I don't think we > have enough experience with UNDO log yet, to know which fields would be > best to include in the common undo header, and which to leave as > AM-specific payload. I think we should keep the common header slim, and > delegate to the AM routines. Yeah, I remember. I'm not really sure I totally buy your argument that we don't know what besides XID should go into an undo record: tuples are a pretty important concept, and although there might be some exceptions here and there, I have a hard time imagining that undo is going to be primarily about anything other than identifying a tuple and recording something you did to it. On the other hand, you might want to identify several tuples, or identify a tuple with a TID that's not 6 bytes, so that's a good reason for allowing more flexibility. Another point in being favor of being more flexible is that it's not clear that there's any use case for third-party tools that work using undo. WAL drives replication and logical decoding and could be used to drive incremental backup, but it's not really clear that similar applications exist for undo. If it's just private to the AM, the AM might as well be responsible for it. If that leads to code duplication, we can create a library of common routines and AM users can use them if they want. > Hmm. If you're following an UNDO chain, from newest to oldest, I would > assume that the newer record has enough information to decide whether > you need to look at the previous record. If the previous record is no > longer interesting, it might already be discarded away, after all. I actually thought zedstore might need this pattern. If you store an XID with each undo pointer, as the current zheap code mostly does, then you have enough information to decide whether you care about the previous undo record before you fetch it. But a tuple stores only an undo pointer, and you determine that the undo isn't discarded, you have to fetch the record first and then possibly decide that you had the right version in the first place. Now, maybe that pattern doesn't repeat, because the undo records could be set up to contain both an XMIN and an XMAX, but not necessarily. I don't know exactly what you have in mind, but it doesn't seem totally crazy that an undo record might contain the XID that created that version but not the XID that created the prior version, and if so, you'll iterate backwards until you either hit the end of undo or go one undo record past the version you can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi All, Please find the updated patch for the undo interface layer. I have rebased undoprocessing patches on top of that and there are also some changes in undo storage patch for handling the multi-log transaction which I am attaching a separate patch for that [0006-Defect-and-enhancement-in-multi-log-support.patch]. Mainly the new patch includes 1. Improvement in log-switch handling during recovery, earlier during recovery we were detecting log switch by adding a separate WAL but in this version, we are detecting it by the registered buffer in the WAL. By doing this we are avoiding extra WAL plus this method is in more sync with identifying undo log in UndoLogAllocateInRecovery. 2. Improved the mechanism of undo compression therein instead of keeping the compression info in the global variable we are reading it from the page in which we are inserting the undo record. 3. Improved readme file. Apart from this, I have worked on the review comments posted in this thread. I will reply to all those emails separately. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Jul 18, 2019 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 16, 2019 at 2:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Few comments on the new patch: > > 1. > Additionally, > +there is a mechanism for multi-insert, wherein multiple records are prepared > +and inserted at a time. > > Which mechanism are you talking about here? By any chance is this > related to some old code? Current code also we have option to prepare multiple records and insert them at once. I have enhanced the comments to make it more clear. > > 2. > +Fetching and undo record > +------------------------ > +To fetch an undo record, a caller must provide a valid undo record pointer. > +Optionally, the caller can provide a callback function with the information of > +the block and offset, which will help in faster retrieval of undo record, > +otherwise, it has to traverse the undo-chain. > > I think this is out-dated information. You seem to forget updating > README after latest changes in API. Right, fixed. > > 3. > + * The cid/xid/reloid/rmid information will be added in the undo record header > + * in the following cases: > + * a) The first undo record of the transaction. > + * b) First undo record of the page. > + * c) All subsequent record for the transaction which is not the first > + * transaction on the page. > + * Except above cases, If the rmid/reloid/xid/cid is same in the subsequent > + * records this information will not be stored in the record, these information > + * will be retrieved from the first undo record of that page. > + * If any of the member rmid/reloid/xid/cid has changed, the changed > information > + * will be stored in the undo record and the remaining information will be > + * retrieved from the first complete undo record of the page > + */ > +UndoCompressionInfo undo_compression_info[UndoLogCategories]; > > a. Do we want to compress fork_number also? It is an optional field > and is only include when undo record is for not MAIN_FORKNUM. For > zheap, this means it will never be included, but in future, it could > be included for some other AM or some other use case. So, not sure if > there is any benefit in compressing the same. Yeah, so as of now I haven't compressed forkno > > b. cid/xid/reloid/rmid - I think it is better to write it as rmid, > reloid, xid, cid in the same order as you declare them in > UndoPackStage. > > c. Some minor corrections. /Except above/Except for above/; /, If > the/, if the/; /is same/is the same/; /record, these > information/record rather this information/ > > d. I think there is no need to start the line "If any of the..." from > a new line, it can be continued where the previous line ends. Also, > at the end of that line, add a full stop. This comments are removed in new patch > > 4. > /* > + * Copy the compression global compression info to our context before > + * starting prepare because this value might get updated multiple time in > + * case of multi-prepare but the global value should be updated only after > + * we have successfully inserted the undo record. > + */ > > In the above comment, the first 'compression' is not required. /time/times/ This comments are changed now as design is changed > > 5. > +/* > + * The below common information will be stored in the first undo > record of the page. > + * Every subsequent undo record will not store this information, if > required this information > + * will be retrieved from the first undo record of the page. > + */ > +typedef struct UndoCompressionInfo > > The line length in the above comments exceeds the 80-char limit. You > might want to run pgindent to avoid such problems. Fixed, > > 6. > +/* > + * Exclude the common info in undo record flag and also set the compression > + * info in the context. > + * > > 'flag' seems to be a redundant word here? Obsolete comment as per new changes > > 7. > +UndoSetCommonInfo(UndoCompressionInfo *compressioninfo, > + UnpackedUndoRecord *urec, UndoRecPtr urp, > + Buffer buffer) > +{ > + > + /* > + * If we have valid compression info and the for the same transaction and > + * the current undo record is on the same block as the last undo record > + * then exclude the common information which are same as first complete > + * record on the page. > + */ > + if (compressioninfo->valid && > + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) && > + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp)) > > Here the comment is just a verbal for of if-check. How about writing > it as: "Exclude the common information from the record which is same > as the first record on the page." Tried to improved in new code. > > 8. > UndoSetCommonInfo() > { > .. > if (compressioninfo->valid && > + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) && > + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp)) > + { > + urec->uur_info &= ~UREC_INFO_XID; > + > + /* Don't include rmid if it's same. */ > + if (urec->uur_rmid == compressioninfo->rmid) > + urec->uur_info &= ~UREC_INFO_RMID; > + > + /* Don't include reloid if it's same. */ > + if (urec->uur_reloid == compressioninfo->reloid) > + urec->uur_info &= ~UREC_INFO_RELOID; > > In all the checks except for transaction id, urec's info is on the > left side. I think all the checks can be consistent. > > These are some of the things I noticed while skimming through this > patch. I will do some more detailed review later. > This code is changed now Please see the latest patch at https://www.postgresql.org/message-id/CAFiTN-uf4Bh0FHwec%2BJGbiLq%2Bj00V92W162SLd_JVvwW-jwREg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > I don't like the fact that undoaccess.c has a new global, > > > undo_compression_info. I haven't read the code thoroughly, but do we > > > really need that? I think it's never modified (so it could just be > > > declared const), > > > > Actually, this will get modified otherwise across undo record > > insertion how we will know what was the values of the common fields in > > the first record of the page. Another option could be that every time > > we insert the record, read the value from the first complete undo > > record on the page but that will be costly because for every new > > insertion we need to read the first undo record of the page. > > > > This information won't be shared across transactions, so can't we keep > it in top transaction's state? It seems to me that will be better > than to maintain it as a global state. As replied separetly that during recovery we would not have transaction state so I have decided to read from the first record on the page please check in the latest patch. > > Few more comments on this patch: > 1. > PrepareUndoInsert() > { > .. > + if (logswitched) > + { > .. > + } > + else > + { > .. > + resize = true; > .. > + } > + > .. > + > + do > + { > + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, rbm); > .. > + rbm = RBM_ZERO; > + cur_blk++; > + } while (cur_size < size); > + > + /* > + * Set/overwrite compression info if required and also exclude the common > + * fields from the undo record if possible. > + */ > + if (UndoSetCommonInfo(compression_info, urec, urecptr, > + context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)) > + resize = true; > + > + if (resize) > + size = UndoRecordExpectedSize(urec); > > I see that in some cases where resize is possible are checked before > buffer allocation and some are after. Isn't it better to do all these > checks before buffer allocation? Also, isn't it better to even > compute changed size before buffer allocation as that might sometimes > help in lesser buffer allocations? Right, fixed. > > Can you find a better way to write > :context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)? > It makes the line too long and difficult to understand. Check for > similar instances in the patch and if possible, change them as well. This code is gone. While replying I realised that I haven't scanned complete code for such occurance. I will work on that in next version. > > 2. > +InsertPreparedUndo(UndoRecordInsertContext *context) > { > .. > /* > + * Try to insert the record into the current page. If it > + * doesn't succeed then recall the routine with the next page. > + */ > + InsertUndoData(&ucontext, page, starting_byte); > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > + { > + MarkBufferDirty(buffer); > + break; > + } > + MarkBufferDirty(buffer); > .. > } > > Can't we call MarkBufferDirty(buffer) just before 'if' check? That > will avoid calling it twice. Done > > 3. > + * Later, during insert phase we will write actual records into thse buffers. > + */ > +struct PreparedUndoBuffer > > /thse/these Done > > 4. > + /* > + * If we are writing first undo record for the page the we can set the > + * compression so that subsequent records from the same transaction can > + * avoid including common information in the undo records. > + */ > + if (first_complete_undo) > > /page the we/page then we This code is gone > > 5. > PrepareUndoInsert() > { > .. > After > + * allocation We'll only advance by as many bytes as we turn out to need. > + */ > + UndoRecordSetInfo(urec); > > Change the beginning of comment as: "After allocation, we'll .." Done > > 6. > PrepareUndoInsert() > { > .. > * TODO: instead of storing this in the transaction header we can > + * have separate undo log switch header and store it there. > + */ > + prevlogurp = > + MakeUndoRecPtr(UndoRecPtrGetLogNo(prevlog_insert_urp), > + (UndoRecPtrGetOffset(prevlog_insert_urp) - prevlen)); > + > > I don't think this TODO is valid anymore because now the patch has a > separate log-switch header. Yup. Anyway now the log switch design is changed. > > 7. > /* > + * If undo log is switched then set the logswitch flag and also reset the > + * compression info because we can use same compression info for the new > + * undo log. > + */ > + if (UndoRecPtrIsValid(prevlog_xact_start)) > > /can/can't Right. But now compression code is changed so this comment does not exist. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 24, 2019 at 11:28 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > > Hi, > > I have stated review of > 0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few > quick comments. > > 1) README.undointerface should provide more information like API details or > the sequence in which API should get called. I have improved the readme where I am describing the more user specific details based on Robert's suggestions offlist. I think I need further improvement which can describe the order of api's to be called. Unfortunately that is not yet included in this patch set. > > 2) Information about the API's in the undoaccess.c file header block would > good. For reference please look at heapam.c. Done > > 3) typo > > + * Later, during insert phase we will write actual records into thse buffers. > + */ > > %s/thse/these Fixed > > 4) UndoRecordUpdateTransInfo() comments says that this must be called under > the critical section, but seems like undo_xlog_apply_progress() do call it > outside of critical section? Is there exception, then should add comments? > or Am I missing anything? During recovery, there is an exception but we can add comments for the same. I think I missed this in the latest patch, I will keep a note of it and will do this in the next version. > > > 5) In function UndoBlockGetFirstUndoRecord() below code: > > /* Calculate the size of the partial record. */ > partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) + > phdr->tuple_len + phdr->payload_len - > phdr->record_offset; > > can directly use UndoPagePartialRecSize(). This function is part of another patch in undoprocessing patch set > > 6) > > +static int > +UndoGetBufferSlot(UndoRecordInsertContext *context, > + RelFileNode rnode, > + BlockNumber blk, > + ReadBufferMode rbm) > +{ > + int i; > > In the above code variable "i" is mean "block index". It would be good > to give some valuable name to the variable, maybe "blockIndex" ? > Fixed > 7) > > * We will also keep a previous undo record pointer to the first and last undo > * record of the transaction in the previous log. The last undo record > * location is used find the previous undo record pointer during rollback. > > > %s/used fine/used to find Fixed > > 8) > > /* > * Defines the number of times we try to wait for rollback hash table to get > * initialized. After these many attempts it will return error and the user > * can retry the operation. > */ > #define ROLLBACK_HT_INIT_WAIT_TRY 60 > > %s/error/an error This is part of different patch in undoprocessing patch set > > 9) > > * we can get the exact size of partial record in this page. > */ > > %s/of partial/of the partial" This comment is removed in the latest code > > 10) > > * urecptr - current transaction's undo record pointer which need to be set in > * the previous transaction's header. > > %s/need/needs Done > > 11) > > /* > * If we are writing first undo record for the page the we can set the > * compression so that subsequent records from the same transaction can > * avoid including common information in the undo records. > */ > > > %s/the page the/the page then > > 12) > > /* > * If the transaction's undo records are split across the undo logs. So > * we need to update our own transaction header in the previous log. > */ > > double space between "to" and "update" Fixed > > 13) > > * The undo record should be freed by the caller by calling ReleaseUndoRecord. > * This function will old the pin on the buffer where we read the previous undo > * record so that when this function is called repeatedly with the same context > > %s/old/hold Fixed > > I will continue further review for the same patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Hi Dilip, > > > commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e > > Author: Dilip Kumar <dilipkumar@localhost.localdomain> > > Date: Thu May 2 11:28:13 2019 +0530 > > > > Provide interfaces to store and fetch undo records. > > +#include "commands/tablecmds.h" > +#include "storage/block.h" > +#include "storage/buf.h" > +#include "storage/buf_internals.h" > +#include "storage/bufmgr.h" > +#include "miscadmin.h" > > "miscadmin.h" comes before "storage...". Right, fixed. > > +/* > + * Compute the size of the partial record on the undo page. > + * > + * Compute the complete record size by uur_info and variable field length > + * stored in the page header and then subtract the offset of the record so that > + * we can get the exact size of partial record in this page. > + */ > +static inline Size > +UndoPagePartialRecSize(UndoPageHeader phdr) > +{ > + Size size; > > We decided to use size_t everywhere in new code (except perhaps > functions conforming to function pointer types that historically use > Size in their type). > > + /* > + * Compute the header size from undo record uur_info, stored in the page > + * header. > + */ > + size = UndoRecordHeaderSize(phdr->uur_info); > + > + /* > + * Add length of the variable part and undo length. Now, we know the > + * complete length of the undo record. > + */ > + size += phdr->tuple_len + phdr->payload_len + sizeof(uint16); > + > + /* > + * Subtract the size which is stored in the previous page to get the > + * partial record size stored in this page. > + */ > + size -= phdr->record_offset; > + > + return size; > > This is probably a stupid question but why isn't it enough to just > store the offset of the first record that begins on this page, or 0 > for none yet? Why do we need to worry about the partial record's > payload etc? Right, as this patch stand it would be enough to just store the offset where the first complete record start. But for undo page consistency checker we need to mask the CID field in the partial record as well. So we need to know how many bytes of the partial records are already written in the previous page (phdr->record_offset), what all fields are there in the partial record (uur_info) and the variable part to compute the next record offset. Currently, I have improved it by storing the complete record length instead of payload and tuple length but this we can further improve by storing the next record offset directly that will avoid some computation. I haven't worked on undo consistency patch much in this version so I will analyze this further in the next version. > > +UndoRecPtr > +PrepareUndoInsert(UndoRecordInsertContext *context, > + UnpackedUndoRecord *urec, > + Oid dbid) > +{ > ... > + /* Fetch compression info for the transaction. */ > + compression_info = GetTopTransactionUndoCompressionInfo(category); > > How can this work correctly in recovery? [Edit: it doesn't, as you > just pointed out] > > I had started reviewing an older version of your patch (the version > that had made it as far as the undoprocessing branch as of recently), > before I had the bright idea to look for a newer version. I was going > to object to the global variable you had there in the earlier version. > It seems to me that you have to be able to reproduce the exact same > compression in recovery that you produced as "do" time, no? How can > TopTranasctionStateData be the right place for this in recovery? > > One data structure that could perhaps hold this would be > UndoLogTableEntry (the per-backend cache, indexed by undo log number, > with pretty fast lookups; used for things like > UndoLogNumberGetCategory()). As long as you never want to have > inter-transaction compression, that should have the right scope to > give recovery per-undo log tracking. If you ever wanted to do > compression between transactions too, maybe UndoLogSlot could work, > but that'd have more complications. Currently, I have read it from the first record on the page. > > +/* > + * Read undo records of the transaction in bulk > + * > + * Read undo records between from_urecptr and to_urecptr until we exhaust the > + * the memory size specified by undo_apply_size. If we could not read all the > + * records till to_urecptr then the caller should consume current set > of records > + * and call this function again. > + * > + * from_urecptr - Where to start fetching the undo records. > If we can not > + * read all the records because of memory limit then this > + * will be set to the previous undo record > pointer from where > + * we need to start fetching on next call. > Otherwise it will > + * be set to InvalidUndoRecPtr. > + * to_urecptr - Last undo record pointer to be fetched. > + * undo_apply_size - Memory segment limit to collect undo records. > + * nrecords - Number of undo records read. > + * one_page - Caller is applying undo only for one block not for > + * complete transaction. If this is set true then instead > + * of following transaction undo chain using > prevlen we will > + * follow the block prev chain of the block so that we can > + * avoid reading many unnecessary undo records of the > + * transaction. > + */ > +UndoRecInfo * > +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr, > + int undo_apply_size, int *nrecords, bool one_page) > > Could you please make it clear in comments and assertions what the > relation between from_urecptr and to_urecptr is and what they mean > (they must be in the same undo log, one must be <= the other, both > point to the *start* of a record, so it's not the same as the total > range of undo)? I have enhanced the comments for the same > > undo_apply_size is not a good parameter name, because the function is > useful for things other than applying records -- like the > undoinspect() extension (or some better version of that), for example. > Maybe max_result_size or something like that? Changed > > +{ > ... > + /* Allocate memory for next undo record. */ > + uur = palloc0(sizeof(UnpackedUndoRecord)); > ... > + > + size = UnpackedUndoRecordSize(uur); > + total_size += size; > > I see, so the unpacked records are still allocated one at a time. I > guess that's OK for now. From some earlier discussion I had been > expecting an arrangement where the actual records were laid out > contiguously with their subcomponents (things they point to in > palloc()'d memory) nearby. In earlier version I was allocating one single memory and then packing the records in that memory. But, their we need to take care of alignnment of each unpacked undo record so that we can directly access them so we have changed it this way. > > +static uint16 > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, > + UndoLogCategory category) > +{ > ... > + char prevlen[2]; > ... > + prev_rec_len = *(uint16 *) (prevlen); > > I don't think that's OK, and might crash on a non-Intel system. How > about using a union of uint16 and char[2]? changed > > + /* Copy undo record transaction header if it is present. */ > + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0) > + memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction); > > I was wondering why you don't use D = S instead of mempcy(&D, &S, > size) wherever you can, until I noticed you use these SizeOfXXX macros > that don't include trailing padding from structs, and that's also how > you allocate objects. Hmm. So if I were to complain about you not > using plain old assignment whenever you can, I'd also have to complain > about that. Fixed > > I think that that technique of defining a SizeOfXXX macro that > excludes trailing bytes makes sense for writing into WAL or undo log > buffers using mempcy(). I'm not sure it makes sense for palloc() and > copying into typed variables like you're doing here and I think I'd > prefer the notational simplicity of using the (very humble) type > system facilities C gives us. (Some memory checker might not like it > you palloc(the shorter size) and then use = if the compiler chooses to > implement it as memcpy sizeof().) > > +/* > + * The below common information will be stored in the first undo record of the > + * page. Every subsequent undo record will not store this information, if > + * required this information will be retrieved from the first undo > record of the > + * page. > + */ > +typedef struct UndoCompressionInfo > > Shouldn't this say "Every subsequent record will not store this > information *if it's the same as the relevant fields in the first > record*"? > > +#define UREC_INFO_TRANSACTION 0x001 > +#define UREC_INFO_RMID 0x002 > +#define UREC_INFO_RELOID 0x004 > +#define UREC_INFO_XID 0x008 > > Should we call this UREC_INFO_FXID, since it refers to a FullTransactionId? Done > > +/* > + * Every undo record begins with an UndoRecordHeader structure, which is > + * followed by the additional structures indicated by the contents of > + * urec_info. All structures are packed into the alignment without padding > + * bytes, and the undo record itself need not be aligned either, so care > + * must be taken when reading the header. > + */ > > I think you mean "All structures are packed into undo pages without > considering alignment and without trailing padding bytes"? This comes > from the definition of the SizeOfXXX macros IIUC. There might still > be padding between members of some of those structs, no? Like this > one, that has the second member at offset 2 on my system: Done > > +typedef struct UndoRecordHeader > +{ > + uint8 urec_type; /* record type code */ > + uint16 urec_info; /* flag bits */ > +} UndoRecordHeader; > + > +#define SizeOfUndoRecordHeader \ > + (offsetof(UndoRecordHeader, urec_info) + sizeof(uint16)) > > +/* > + * Information for a transaction to which this undo belongs. This > + * also stores the dbid and the progress of the undo apply during rollback. > + */ > +typedef struct UndoRecordTransaction > +{ > + /* > + * Undo block number where we need to start reading the undo for applying > + * the undo action. InvalidBlockNumber means undo applying hasn't > + * started for the transaction and MaxBlockNumber mean undo completely > + * applied. And, any other block number means we have applied partial undo > + * so next we can start from this block. > + */ > + BlockNumber urec_progress; > + Oid urec_dbid; /* database id */ > + UndoRecPtr urec_next; /* urec pointer of the next transaction */ > +} UndoRecordTransaction; > > I propose that we rename this to UndoRecordGroupHeader (or something > like that... maybe "Set", but we also use "set" as a verb in various > relevant function names): I have changed this > > 1. We'll also use these for the new "shared" records we recently > invented that don't relate to a transaction. This is really about > defining the unit of discarding; we throw away the whole set of > records at once, which is why it's basically about proividing a space > for "urec_next". > > 2. Though it also holds rollback progress information, which is a > transaction-specific concept, there can be more than one of these sets > of records for a single transaction anyway. A single transaction can > write undo stuff in more than one undo log (different categories > perm/temp/unlogged/shared and also due to log switching when they are > full). > > So really it's just a header for an arbitrary set of records, used to > track when and how to discard them. > > If you agree with that idea, perhaps urec_next should become something > like urec_next_group, too. "next" is a bit vague, especially for > something as untyped as UndoRecPtr: someone might think it points to > the next record. Changed > > More soon. the latest patch at https://www.postgresql.org/message-id/CAFiTN-uf4Bh0FHwec%2BJGbiLq%2Bj00V92W162SLd_JVvwW-jwREg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 30, 2019 at 1:32 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > Amit, short note: The patches aren't attached in patch order. Obviously > a miniscule thing, but still nicer if that's not the case. > > Dilip, this also contains the start of a review for the undo record > interface further down. > > Subject: [PATCH 07/14] Provide interfaces to store and fetch undo records. > > > > Add the capability to form undo records and store them in undo logs. We > > also provide the capability to fetch the undo records. This layer will use > > undo-log-storage to reserve the space for the undo records and buffer > > management routines to write and read the undo records. > > > > > Undo records are stored in sequential order in the undo log. > > Maybe "In each und log undo records are stored in sequential order."? Done > > > > > +++ b/src/backend/access/undo/README.undointerface > > @@ -0,0 +1,29 @@ > > +Undo record interface layer > > +--------------------------- > > +This is the next layer which sits on top of the undo log storage, which will > > +provide an interface for prepare, insert, or fetch the undo records. This > > +layer will use undo-log-storage to reserve the space for the undo records > > +and buffer management routine to write and read the undo records. > > The reference to "undo log storage" kinda seems like a reference into > nothingness... Changed > > > > +Writing an undo record > > +---------------------- > > +To prepare an undo record, first, it will allocate required space using > > +undo log storage module. Next, it will pin and lock the required buffers and > > +return an undo record pointer where it will insert the record. Finally, it > > +calls the Insert routine for final insertion of prepared record. Additionally, > > +there is a mechanism for multi-insert, wherein multiple records are prepared > > +and inserted at a time. > > I'm not sure whta this is telling me. Who is "it"? > > To me the filename ("interface"), and the title of this section, > suggests this provides documentation on how to write code to insert undo > records. But I don't think this does. I have improved it > > > > +Fetching and undo record > > +------------------------ > > +To fetch an undo record, a caller must provide a valid undo record pointer. > > +Optionally, the caller can provide a callback function with the information of > > +the block and offset, which will help in faster retrieval of undo record, > > +otherwise, it has to traverse the undo-chain. > > > +There is also an interface to bulk fetch the undo records. Where the caller > > +can provide a TO and FROM undo record pointer and the memory limit for storing > > +the undo records. This API will return all the undo record between FROM and TO > > +undo record pointers if they can fit into provided memory limit otherwise, it > > +return whatever can fit into the memory limit. And, the caller can call it > > +repeatedly until it fetches all the records. > > There's a lot of terminology in this file that's not been introduced. I > think this needs to be greatly expanded and restructured to allow people > unfamiliar with the code to benefit. I have improved it, but I think still I need to work on it to introduce the terminology used. > > > > +/*------------------------------------------------------------------------- > > + * > > + * undoaccess.c > > + * entry points for inserting/fetching undo records > > > + * NOTES: > > + * Undo record layout: > > + * > > + * Undo records are stored in sequential order in the undo log. Each undo > > + * record consists of a variable length header, tuple data, and payload > > + * information. > > Is that actually true? There's records without tuples, no? Right, changed this > > > The first undo record of each transaction contains a > > + * transaction header that points to the next transaction's start > > header. > > Seems like this needs to reference different persistence levels, > otherwise it seems misleading, given there can be multiple first records > in multiple undo logs? I have changed it. > > > > + * This allows us to discard the entire transaction's log at one-shot > > rather > > s/at/in/ Fixed > > > + * than record-by-record. The callers are not aware of transaction header, > > s/of/of the/ Fixed > > > + * this is entirely maintained and used by undo record layer. See > > s/this/it/ Fixed > > > + * undorecord.h for detailed information about undo record header. > > s/undo record/the undo record/ Fixed > > > I think at the very least there's explanations missing for: > - what is the locking protocol for multiple buffers > - what are the contexts for insertion > - what phases an undo insertion happens in > - updating previous records in general > - what "packing" actually is > > > > + > > +/* Prototypes for static functions. */ > > > Don't think we commonly include that... Changed, removed all unwanted prototypes > > > +static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec, > > + UndoRecPtr urp, RelFileNode rnode, > > + UndoPersistence persistence, > > + Buffer *prevbuf); > > +static int UndoRecordPrepareTransInfo(UndoRecordInsertContext *context, > > + UndoRecPtr xact_urp, int size, int offset); > > +static void UndoRecordUpdateTransInfo(UndoRecordInsertContext *context, > > + int idx); > > +static void UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context, > > + UndoRecPtr urecptr, UndoRecPtr xact_urp); > > +static int UndoGetBufferSlot(UndoRecordInsertContext *context, > > + RelFileNode rnode, BlockNumber blk, > > + ReadBufferMode rbm); > > +static uint16 UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, > > + UndoPersistence upersistence); > > + > > +/* > > + * Structure to hold the prepared undo information. > > + */ > > +struct PreparedUndoSpace > > +{ > > + UndoRecPtr urp; /* undo record pointer */ > > + UnpackedUndoRecord *urec; /* undo record */ > > + uint16 size; /* undo record size */ > > + int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array > > + * index*/ > > +}; > > + > > +/* > > + * This holds undo buffers information required for PreparedUndoSpace during > > + * prepare undo time. Basically, during prepare time which is called outside > > + * the critical section we will acquire all necessary undo buffers pin and lock. > > + * Later, during insert phase we will write actual records into thse buffers. > > + */ > > +struct PreparedUndoBuffer > > +{ > > + UndoLogNumber logno; /* Undo log number */ > > + BlockNumber blk; /* block number */ > > + Buffer buf; /* buffer allocated for the block */ > > + bool zero; /* new block full of zeroes */ > > +}; > > Most files define datatypes before function prototypes, because > functions may reference the datatypes. done > > > > +/* > > + * Prepare to update the transaction header > > + * > > + * It's a helper function for PrepareUpdateNext and > > + * PrepareUpdateUndoActionProgress > > This doesn't really explain much. PrepareUpdateUndoActionProgress > doesnt' exist. I assume it's UndoRecordPrepareApplyProgress from 0012? Enhanced the comments > > > > + * xact_urp - undo record pointer to be updated. > > + * size - number of bytes to be updated. > > + * offset - offset in undo record where to start update. > > + */ > > These comments seem redundant with the parameter names. fixed > > > > +static int > > +UndoRecordPrepareTransInfo(UndoRecordInsertContext *context, > > + UndoRecPtr xact_urp, int size, int offset) > > +{ > > + BlockNumber cur_blk; > > + RelFileNode rnode; > > + int starting_byte; > > + int bufidx; > > + int index = 0; > > + int remaining_bytes; > > + XactUndoRecordInfo *xact_info; > > + > > + xact_info = &context->xact_urec_info[context->nxact_urec_info]; > > + > > + UndoRecPtrAssignRelFileNode(rnode, xact_urp); > > + cur_blk = UndoRecPtrGetBlockNum(xact_urp); > > + starting_byte = UndoRecPtrGetPageOffset(xact_urp); > > + > > + /* Remaining bytes on the current block. */ > > + remaining_bytes = BLCKSZ - starting_byte; > > + > > + /* > > + * Is there some byte of the urec_next on the current block, if not then > > + * start from the next block. > > + */ > > This comment needs rephrasing. Done > > > > + /* Loop until we have fetched all the buffers in which we need to write. */ > > + while (size > 0) > > + { > > + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, RBM_NORMAL); > > + xact_info->idx_undo_buffers[index++] = bufidx; > > + size -= (BLCKSZ - starting_byte); > > + starting_byte = UndoLogBlockHeaderSize; > > + cur_blk++; > > + } > > So, this locks a very large number of undo buffers at the same time, do > I see that correctly? What guarantees that there are no deadlocks due > to multiple buffers locked at the same time (I guess the order inside > the log)? What guarantees that this is a small enough number that we can > even lock all of them at the same time? I think we are locking them in the block order and that should avoid the deadlock. I have explained in the comments. > > Why do we need to lock all of them at the same time? That's not clear to > me. Because this is called outside the critical section so we keep all the buffers locked what we want to update inside the critical section for single wal record. > > Also, why do we need code to lock an unbounded number here? It seems > hard to imagine we'd ever want to update more than something around 8 > bytes? Shouldn't that at the most require two buffers? Right, it should lock at the most 2 buffers. Now, I have added assert for that. Basically, it can either lock 1 buffer or 2 buffers so I am not sure what is the best condition to break the loop. I guess our target is to write 8 bytes so breaking condition must be the number of bytes. I agree that we should never go beyond two buffers but for that, we can add an assert. Do you have another opinion on this? > > > > +/* > > + * Prepare to update the previous transaction's next undo pointer. > > + * > > + * We want to update the next transaction pointer in the previous transaction's > > + * header (first undo record of the transaction). In prepare phase we will > > + * unpack that record and lock the necessary buffers which we are going to > > + * overwrite and store the unpacked undo record in the context. Later, > > + * UndoRecordUpdateTransInfo will overwrite the undo record. > > + * > > + * xact_urp - undo record pointer of the previous transaction's header > > + * urecptr - current transaction's undo record pointer which need to be set in > > + * the previous transaction's header. > > + */ > > +static void > > +UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context, > > + UndoRecPtr urecptr, UndoRecPtr xact_urp) > > That name imo is confusing - it's not clear that it's not actually about > the next record or such. I agree. I think I will think about what to name it. I am planning to unify 2 function UndoRecordPrepareUpdateNext and PrepareUpdateUndoActionProgress then we can directly name it PrepareUndoRecordUpdate. But for that, I need to get the progress update code in my patch. > > > > +{ > > + UndoLogSlot *slot; > > + int index = 0; > > + int offset; > > + > > + /* > > + * The absence of previous transaction's undo indicate that this backend > > *indicates > Done > > > + /* > > + * Acquire the discard lock before reading the undo record so that discard > > + * worker doesn't remove the record while we are in process of reading it. > > + */ > > *the discard worker Done > > > > + LWLockAcquire(&slot->discard_update_lock, LW_SHARED); > > + /* Check if it is already discarded. */ > > + if (UndoLogIsDiscarded(xact_urp)) > > + { > > + /* Release lock and return. */ > > + LWLockRelease(&slot->discard_update_lock); > > + return; > > + } > > Ho, hum. I don't quite remember what we decided in the discussion about > not having to use the discard lock for this purpose. I think we haven't concluded an alternative solution for this and planned to keep it as is for now. Please correct me if anyone else has a different opinion. > > > > + /* Compute the offset of the uur_next in the undo record. */ > > + offset = SizeOfUndoRecordHeader + > > + offsetof(UndoRecordTransaction, urec_next); > > + > > + index = UndoRecordPrepareTransInfo(context, xact_urp, > > + sizeof(UndoRecPtr), offset); > > + /* > > + * Set the next pointer in xact_urec_info, this will be overwritten in > > + * actual undo record during update phase. > > + */ > > + context->xact_urec_info[index].next = urecptr; > > What does "this will be overwritten mean"? It sounds like "context->xact_urec_info[index].next" > would be overwritten, but that can't be true. > > > > + /* We can now release the discard lock as we have read the undo record. */ > > + LWLockRelease(&slot->discard_update_lock); > > +} > > Hm. Because you expect it to be blocked behind the content lwlocks for > the buffers? Yes, I added comments. > > > > +/* > > + * Overwrite the first undo record of the previous transaction to update its > > + * next pointer. > > + * > > + * This will insert the already prepared record by UndoRecordPrepareTransInfo. > > It doesn't actually appear to insert any records. At least not a record > in the way the rest of the file uses that term? I think this was old comments. Fixed it. > > > > + * This must be called under the critical section. > > s/under the/in a/ I think I missed in my last patch. Will fix in next version. > > Think that should be asserted. Added the assert. > > > > + /* > > + * Start writing directly from the write offset calculated during prepare > > + * phase. And, loop until we write required bytes. > > + */ > > Why do we do offset calculations multiple times? Seems like all the > offsets, and the split, should be computed in exactly one place. Sorry, I did not understand this, we are calculating the offset in the prepare phase. Do you want to point out something else? > > > > +/* > > + * Find the block number in undo buffer array > > + * > > + * If it is present then just return its index otherwise search the buffer and > > + * insert an entry and lock the buffer in exclusive mode. > > + * > > + * Undo log insertions are append-only. If the caller is writing new data > > + * that begins exactly at the beginning of a page, then there cannot be any > > + * useful data after that point. In that case RBM_ZERO can be passed in as > > + * rbm so that we can skip a useless read of a disk block. In all other > > + * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't > > + * happen to be already in the buffer pool. > > + */ > > +static int > > +UndoGetBufferSlot(UndoRecordInsertContext *context, > > + RelFileNode rnode, > > + BlockNumber blk, > > + ReadBufferMode rbm) > > +{ > > + int i; > > + Buffer buffer; > > + XLogRedoAction action = BLK_NEEDS_REDO; > > + PreparedUndoBuffer *prepared_buffer; > > + UndoPersistence persistence = context->alloc_context.persistence; > > + > > + /* Don't do anything, if we already have a buffer pinned for the block. */ > > As the code stands, it's locked, not just pinned. Changed > > > > + for (i = 0; i < context->nprepared_undo_buffer; i++) > > + { > > How large do we expect this to get at most? > In BeginUndoRecordInsert we are computing it + /* Compute number of buffers. */ + nbuffers = (nprepared + MAX_UNDO_UPDATE_INFO) * MAX_BUFFER_PER_UNDO; > > > + /* > > + * We did not find the block so allocate the buffer and insert into the > > + * undo buffer array. > > + */ > > + if (InRecovery) > > + action = XLogReadBufferForRedoBlock(context->alloc_context.xlog_record, > > + SMGR_UNDO, > > + rnode, > > + UndoLogForkNum, > > + blk, > > + rbm, > > + false, > > + &buffer); > > Why is not locking the buffer correct here? Can't there be concurrent > reads during hot standby? because XLogReadBufferForRedoBlock is locking it internally. I have added this in coment in new patch. > > > > +/* > > + * This function must be called before all the undo records which are going to > > + * get inserted under a single WAL record. > > How can a function be called "before all the undo records"? "before all the undo records which are getting inserted under single WAL" because it will set the prepare limit and allocate appropriate memory for that. So I am not sure what is your point here? why can't we call it before all the undo record we are inserting? > > > + * nprepared - This defines the max number of undo records that can be > > + * prepared before inserting them. > > + */ > > +void > > +BeginUndoRecordInsert(UndoRecordInsertContext *context, > > + UndoPersistence persistence, > > + int nprepared, > > + XLogReaderState *xlog_record) > > There definitely needs to be explanation about xlog_record. But also > about memory management etc. Looks like one e.g. can't call this from a > short lived memory context. I have added coments for this. > > > > +/* > > + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you > > + * intended to insert. Upon return, the necessary undo buffers are pinned and > > + * locked. > > Again, how is deadlocking / max number of buffers handled, and why do > they all need to be locked at the same time? > > > > + /* > > + * We don't yet know if this record needs a transaction header (ie is the > > + * first undo record for a given transaction in a given undo log), because > > + * you can only find out by allocating. We'll resolve this circularity by > > + * allocating enough space for a transaction header. We'll only advance > > + * by as many bytes as we turn out to need. > > + */ > > Why can we only find this out by allocating? This seems like an API > deficiency of the storage layer to me. The information is in the und log > slot's metadata, no? I agree with this. I think if Thomas agree we can provide an API in undo log which can provide us this information before we do the actual allocation. > > > > + urec->uur_next = InvalidUndoRecPtr; > > + UndoRecordSetInfo(urec); > > + urec->uur_info |= UREC_INFO_TRANSACTION; > > + urec->uur_info |= UREC_INFO_LOGSWITCH; > > + size = UndoRecordExpectedSize(urec); > > + > > + /* Allocate space for the record. */ > > + if (InRecovery) > > + { > > + /* > > + * We'll figure out where the space needs to be allocated by > > + * inspecting the xlog_record. > > + */ > > + Assert(context->alloc_context.persistence == UNDO_PERMANENT); > > + urecptr = UndoLogAllocateInRecovery(&context->alloc_context, > > + XidFromFullTransactionId(txid), > > + size, > > + &need_xact_header, > > + &last_xact_start, > > + &prevlog_xact_start, > > + &prevlogurp); > > + } > > + else > > + { > > + /* Allocate space for writing the undo record. */ > > That's basically the same comment as before the if. Removed > > > > + urecptr = UndoLogAllocate(&context->alloc_context, > > + size, > > + &need_xact_header, &last_xact_start, > > + &prevlog_xact_start, &prevlog_insert_urp); > > + > > + /* > > + * If prevlog_xact_start is a valid undo record pointer that means > > + * this transaction's undo records are split across undo logs. > > + */ > > + if (UndoRecPtrIsValid(prevlog_xact_start)) > > + { > > + uint16 prevlen; > > + > > + /* > > + * If undo log is switch during transaction then we must get a > > "is switch" is right. This code is removed now. > > > +/* > > + * Insert a previously-prepared undo records. > > s/a// Fixed > > > More tomorrow. > refer the latest patch at https://www.postgresql.org/message-id/CAFiTN-uf4Bh0FHwec%2BJGbiLq%2Bj00V92W162SLd_JVvwW-jwREg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 9, 2019 at 1:57 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Aug 8, 2019 at 9:31 AM Andres Freund <andres@anarazel.de> wrote: > > I know that Robert is working on a patch that revises the undo request > > layer somewhat, it's possible that this is best discussed afterwards. > > Here's what I have at the moment. This is not by any means a complete > replacement for Amit's undo worker machinery, but it is a significant > redesign (and I believe a significant improvement to) the queue > management stuff from Amit's patch. I wrote this pretty quickly, so > while it passes simple testing, it probably has a number of bugs, and > to actually use it, it would need to be integrated with xact.c; right > now it's just a standalone module that doesn't do anything except let > itself be tested. > > Some of the ways it is different from Amit's patches include: > > * It uses RBTree rather than binaryheap, so when we look ahead, we > look ahead in the right order. > > * There's no limit to the lookahead distance; when looking ahead, it > will search the entirety of all 3 RBTrees for an entry from the right > database. > > * It doesn't have a separate hash table keyed by XID. I didn't find > that necessary. > > * It's better-isolated, as you can see from the fact that I've > included a test module that tests this code without actually ever > putting an UndoRequestManager in shared memory. I would've liked to > expand this test module, but I don't have time to do that today and > felt it better to get this much sent out. > > * It has a lot of comments explaining the design and how it's intended > to integrate with the rest of the system. > > Broadly, my vision for how this would get used is: > > - Create an UndoRecordManager in shared memory. > - Before a transaction first attaches to a permanent or unlogged undo > log, xact.c would call RegisterUndoRequest(); thereafter, xact.c would > store a pointer to the UndoRecord for the lifetime of the toplevel > transaction. So, for top-level transactions rollback, we can directly refer from UndoRequest *, the start and end locations. But, what should we do for sub-transactions (rollback to savepoint)? One related point is that we also need information about last_log_start_undo_location to update the undo apply progress (The basic idea is if the transactions undo is spanned across multiple logs, we update the progress in each of the logs.). We can remember that in the transaction state or undorequest *. Any suggestion? > - Immediately after attaching to a permanent or unlogged undo log, > xact.c would call UndoRequestSetLocation. > - xact.c would track the number of bytes of permanent and unlogged > undo records the transaction generates. If the transaction goes onto > abort, it reports these by calling FinalizeUndoRequest. > - If the transaction commits, it doesn't need that information, but > does need to call UnregisterUndoRequest() as a post-commit step in > CommitTransaction(). > IIUC, for each transaction, we have to take a lock first time it attaches to a log and then the same lock at commit time. It seems the work under lock is less, but still, can't this cause a contention? It seems to me this is similar to what we saw in ProcArrayLock where work under lock was few instructions, but acquiring and releasing the lock by each backend at commit time was causing a bottleneck. It might be due to some reason this won't matter in a similar way in which case we can find that after integrating it with other patches from undo processing machinery and rebasing zheap branch over it? How will computation of oldestXidHavingUnappliedUndo will work? We can probably check the fxid queue and error queue to get that value. However, I am not sure if that is sufficient because incase we perform the request in the foreground, it won't be present in queues. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 8, 2019 at 7:01 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-08-07 14:50:17 +0530, Amit Kapila wrote: > > On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2019-08-05 11:29:34 -0700, Andres Freund wrote: > > > > > typedef struct TwoPhaseFileHeader > > > > { > > > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader > > > > uint16 gidlen; /* length of the GID - GID follows the header */ > > > > XLogRecPtr origin_lsn; /* lsn of this record at origin node */ > > > > TimestampTz origin_timestamp; /* time of prepare at origin node */ > > > > + > > > > + /* > > > > + * We need the locations of the start and end undo record pointers when > > > > + * rollbacks are to be performed for prepared transactions using undo-based > > > > + * relations. We need to store this information in the file as the user > > > > + * might rollback the prepared transaction after recovery and for that we > > > > + * need its start and end undo locations. > > > > + */ > > > > + UndoRecPtr start_urec_ptr[UndoLogCategories]; > > > > + UndoRecPtr end_urec_ptr[UndoLogCategories]; > > > > } TwoPhaseFileHeader; > > > > > > Why do we not need that knowledge for undo processing of a non-prepared > > > transaction? > > > The non-prepared transaction also needs to be aware of that. It is > > stored in TransactionStateData. I am not sure if I understand your > > question here. > > My concern is that I think it's fairly ugly to store data like this in > the 2pc state file. And it's not an insubstantial amount of additional > data either, compared to the current size, even when no undo is in > use. There's a difference between an unused feature increasing backend > local memory and increasing the size of WAL logged data. Obviously it's > not by a huge amount, but still. It also just feels wrong to me. > > We don't need the UndoRecPtr's when recovering from a crash/restart to > process undo. Now we obviously don't want to unnecessarily search for > data that is expensive to gather, which is a good reason for keeping > track of this data. But I do wonder if this is the right approach. > > I know that Robert is working on a patch that revises the undo request > layer somewhat, it's possible that this is best discussed afterwards. > Okay, we have started working on integrating with Robert's patch. I think not only this but many of the other things will also change. So, I will respond to other comments after integrating with Robert's patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Aug 5, 2019 at 11:59 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > (as I was out of context due to dealing with bugs, I've switched to > lookign at the current zheap/undoprocessing branch. > > On 2019-07-30 01:02:20 -0700, Andres Freund wrote: > > +/* > > + * Insert a previously-prepared undo records. > > + * > > + * This function will write the actual undo record into the buffers which are > > + * already pinned and locked in PreparedUndoInsert, and mark them dirty. This > > + * step should be performed inside a critical section. > > + */ > > Again, I think it's not ok to just assume you can lock an essentially > unbounded number of buffers. This seems almost guaranteed to result in > deadlocks. And there's limits on how many lwlocks one can hold etc. I think for controlling that we need to put a limit on max prepared undo? I am not sure any other way of limiting the number of buffers because we must lock all the buffer in which we are going to insert the undo record under one WAL logged operation. > > As far as I can tell there's simply no deadlock avoidance scheme in use > here *at all*? I must be missing something. We are always locking buffer in block order so I am not sure how it can deadlock? Am I missing something? > > > > + /* Main loop for writing the undo record. */ > > + do > > + { > > I'd prefer this to not be a do{} while(true) loop - as written I need to > read to the end to see what the condition is. I don't think we have any > loops like that in the code. Right, changed > > > > + /* > > + * During recovery, there might be some blocks which are already > > + * deleted due to some discard command so we can just skip > > + * inserting into those blocks. > > + */ > > + if (!BufferIsValid(buffer)) > > + { > > + Assert(InRecovery); > > + > > + /* > > + * Skip actual writing just update the context so that we have > > + * write offset for inserting into next blocks. > > + */ > > + SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte); > > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > > + break; > > + } > > How exactly can this happen? Suppose you insert one record for the transaction which split in block1 and 2. Now, before this block is actually going to the disk the transaction committed and become all visible the undo logs are discarded. It's possible that block 1 is completely discarded but block 2 is not because it might have undo for the next transaction. Now, during recovery (FPW is off) if block 1 is missing but block 2 is their so we need to skip inserting undo for block 1 as it does not exist. > > > > + else > > + { > > + page = BufferGetPage(buffer); > > + > > + /* > > + * Initialize the page whenever we try to write the first > > + * record in page. We start writing immediately after the > > + * block header. > > + */ > > + if (starting_byte == UndoLogBlockHeaderSize) > > + UndoPageInit(page, BLCKSZ, prepared_undo->urec->uur_info, > > + ucontext.already_processed, > > + prepared_undo->urec->uur_tuple.len, > > + prepared_undo->urec->uur_payload.len); > > + > > + /* > > + * Try to insert the record into the current page. If it > > + * doesn't succeed then recall the routine with the next page. > > + */ > > + InsertUndoData(&ucontext, page, starting_byte); > > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > > + { > > + MarkBufferDirty(buffer); > > + break; > > At this point we're five indentation levels deep. I'd extract at least > either the the per prepared undo code or the code performing the writing > across block boundaries into a separate function. Perhaps both. I have moved it to the separate function. > > > > > +/* > > + * Helper function for UndoGetOneRecord > > + * > > + * If any of rmid/reloid/xid/cid is not available in the undo record, then > > + * it will get the information from the first complete undo record in the > > + * page. > > + */ > > +static void > > +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp, > > + RelFileNode rnode, UndoLogCategory category, Buffer buffer) > > +{ > > + /* > > + * If any of the common header field is not available in the current undo > > + * record then we must read it from the first complete record of the page. > > + */ > > How is it guaranteed that the first record on the page is actually from > the current transaction? Can't there be a situation where that's from > another transaction? If the first record is not from the same transaction then the record must have all those fields in it so it should never try to access the first record. I have updated the comments for the same. > > > > > +/* > > + * Helper function for UndoFetchRecord and UndoBulkFetchRecord > > + * > > + * curbuf - If an input buffer is valid then this function will not release the > > + * pin on that buffer. If the buffer is not valid then it will assign curbuf > > + * with the first buffer of the current undo record and also it will keep the > > + * pin and lock on that buffer in a hope that while traversing the undo chain > > + * the caller might want to read the previous undo record from the same block. > > + */ > > Wait, so at exit *curbuf is pinned but not locked, if passed in, but is > pinned *and* locked when not? That'd not be a sane API. I don't think > the code works like that atm though. Comments were wrong, I have fixed. > > > > +static UnpackedUndoRecord * > > +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode, > > + UndoLogCategory category, Buffer *curbuf) > > +{ > > + Page page; > > + int starting_byte = UndoRecPtrGetPageOffset(urp); > > + BlockNumber cur_blk; > > + UndoPackContext ucontext = {{0}}; > > + Buffer buffer = *curbuf; > > + > > + cur_blk = UndoRecPtrGetBlockNum(urp); > > + > > + /* Initiate unpacking one undo record. */ > > + BeginUnpackUndo(&ucontext); > > + > > + while (true) > > + { > > + /* If we already have a buffer then no need to allocate a new one. */ > > + if (!BufferIsValid(buffer)) > > + { > > + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk, > > + RBM_NORMAL, NULL, > > + RelPersistenceForUndoLogCategory(category)); > > + > > + /* > > + * Remember the first buffer where this undo started as next undo > > + * record what we fetch might fall on the same buffer. > > + */ > > + if (!BufferIsValid(*curbuf)) > > + *curbuf = buffer; > > + } > > + > > + /* Acquire shared lock on the buffer before reading undo from it. */ > > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > > + > > + page = BufferGetPage(buffer); > > + > > + UnpackUndoData(&ucontext, page, starting_byte); > > + > > + /* > > + * We are done if we have reached to the done stage otherwise move to > > + * next block and continue reading from there. > > + */ > > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > > + { > > + if (buffer != *curbuf) > > + UnlockReleaseBuffer(buffer); > > + > > + /* > > + * Get any of the missing fields from the first record of the > > + * page. > > + */ > > + GetCommonUndoRecInfo(&ucontext, urp, rnode, category, *curbuf); > > + break; > > + } > > + > > + /* > > + * The record spans more than a page so we would have copied it (see > > + * UnpackUndoRecord). In such cases, we can release the buffer. > > + */ > > Where would it have been copied? Presumably in UnpackUndoData()? Imo the > comment should say so. > > I'm a bit confused by the use of "would" in that comment. Either we > have, or not? This comment is obsolete so removed. > > + if (buffer != *curbuf) > > + UnlockReleaseBuffer(buffer); > > Wait, so we *keep* the buffer locked if it the same as *curbuf? That > can't be right. At the end we are releasing the lock on the *curbuf. But now I have changed it so that it is more readable. > > > > + * Fetch the undo record for given undo record pointer. > > + * > > + * This will internally allocate the memory for the unpacked undo record which > > + * intern will > > "intern" should probably be internally? But I'm not sure what the two > "internally"s really add here. > > > > > +/* > > + * Release the memory of the undo record allocated by UndoFetchRecord and > > + * UndoBulkFetchRecord. > > + */ > > +void > > +UndoRecordRelease(UnpackedUndoRecord *urec) > > +{ > > + /* Release the memory of payload data if we allocated it. */ > > + if (urec->uur_payload.data) > > + pfree(urec->uur_payload.data); > > + > > + /* Release memory of tuple data if we allocated it. */ > > + if (urec->uur_tuple.data) > > + pfree(urec->uur_tuple.data); > > + > > + /* Release memory of the transaction header if we allocated it. */ > > + if (urec->uur_txn) > > + pfree(urec->uur_txn); > > + > > + /* Release memory of the logswitch header if we allocated it. */ > > + if (urec->uur_logswitch) > > + pfree(urec->uur_logswitch); > > + > > + /* Release the memory of the undo record. */ > > + pfree(urec); > > +} > > Those comments before each pfree are not useful. Removed > > > Also, isn't this both fairly slow and fairly failure prone? The next > record is going to need all that memory again, no? It seems to me that > there should be one record that's allocated once, and then reused over > multiple fetches, increasing the size if necesssary. > > I'm very doubtful that all this freeing of individual allocations in the > undo code makes sense. Shouldn't this just be done in short lived memory > contexts, that then get reset as a whole? That's both far less failure > prone, and faster. > > > > + * one_page - Caller is applying undo only for one block not for > > + * complete transaction. If this is set true then instead > > + * of following transaction undo chain using prevlen we will > > + * follow the block prev chain of the block so that we can > > + * avoid reading many unnecessary undo records of the > > + * transaction. > > + */ > > +UndoRecInfo * > > +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr, > > + int undo_apply_size, int *nrecords, bool one_page) > > > There's no caller for one_page mode in the series - I assume that's for > later, during page-wise undo? It seems to behave in quite noticably > different ways, is that really OK? Makes the code quite hard to > understand. > Also, it seems quite poorly named to me. It sounds like it's about > fetching a single undo page (which makes no sense, obviously). But what > it does is to switch to an entirely different way of traversing the undo > chains. one_page was zheap specific so I have removed it. I think in zheap specific function we can implement it by UndoFetchRecord in a loop. > > > > > + /* > > + * In one_page mode we are fetching undo only for one page instead of > > + * fetching all the undo of the transaction. Basically, we are fetching > > + * interleaved undo records. So it does not make sense to do any prefetch > > + * in that case. > > What does "interleaved" mean here? I meant that for one page we are following blockprev chain instead of complete transaction undo chain so there is no guarantee that the undo records are together. Basically, the undo records for the different blocks can be interleaved so I am not sure should we prefetch or not. I assume that there will often be > other UNDO records interspersed? But that's not guaranteed at all, > right? In fact, for a lot of workloads it seems likely that there will > be many consecutive undo records for a single page? In fact, won't that > be the majority of cases? Ok, that point makes sense to me but I thought if we always assume this we will do unwanted prefetch where this is not the case and we will put unnecessary load on the I/O. Currently, I have moved that code out of the undo layer so we can take a call while designing zheap specific function. > > Thus it's not obvious to me that there's not often going to be > consecutive pages for this case too. I'd even say that minimizing IO > delay is *MORE* important during page-wise undo, as that happens in the > context of client accesses, and it's not incurring cost on the party > that performed DML, but on some random third party. > > > I'm doubtful this is a sane interface. There's a lot of duplication > between one_page and not one_page. It presupposes specific ways of > constructing chains that are likely to depend on the AM. to_urecptr is > only used in certain situations. E.g. I strongly suspect that for > zheap's visibility determinations we'd want to concurrently follow all > the necessary chains to determine visibility for all all tuples on the > page, far enough to find the visible tuple - for seqscan's / bitmap heap > scans / everything using page mode scans, that'll be way more efficient > than doing this one-by-one and possibly even repeatedly. But what is > exactly the right thing to do is going to be highly AM specific. > > I vaguely suspect what you'd want is an interface where the "bulk fetch" > context basically has a FIFO queue of undo records to fetch, and a > function to actually perform fetching. Whenever a record has been > retrieved, a callback determines whether additional records are needed. > In the case of fetching all the undo for a transaction, you'd just queue > - probably in a more efficient representation - all the necessary > undo. In case of page-wise undo, you'd queue the first record of the > chain you'd want to undo, with a callback for queuing the next > record. For visibility determinations in zheap, you'd queue all the > different necessary chains, with a callback that queues the next > necessary record if still needed for visibility determination. > > And then I suspect you'd have a separate callback whenever records have > been fetched, with all the 'unconsumed' records. That then can, > e.g. based on memory consumption, decide to process them or not. For > visibility information you'd probably just want to condense the records > to the minimum necessary (i.e. visibility information for the relevant > tuples, and the visibile tuple when encountered) as soon as available. I haven't think on this part yet. I will analyze part. > > Obviously that's pretty handwavy. > > > > > > Also, if we are fetching undo records from more than one > > + * log, we don't know the boundaries for prefetching. Hence, we can't use > > + * prefetching in this case. > > + */ > > Hm. Why don't we know the boundaries (or cheaply infer them)? I have added comments for that. Basically, when we get the undo records from the different log (from and to pointers are in the different log) we don't know in latest undo log till what point the undo are from this transaction. We may consider prefetching to the start of the current log but there is no guarantee that all the blocks of the current logs are valid and not yet discarded. Ideally, the better fix would be that the caller always pass the from and to pointer from the same undo log. > > > > + /* > > + * If prefetch_pages are half of the prefetch_target then it's time to > > + * prefetch again. > > + */ > > + if (prefetch_pages < prefetch_target / 2) > > + PrefetchUndoPages(rnode, prefetch_target, &prefetch_pages, to_blkno, > > + from_blkno, category); > > Hm. Why aren't we prefetching again as soon as possible? Given the > current code there's not really benefit in fetching many adjacent pages > at once. And this way it seems we're somewhat likely to cause fairly > bursty IO? Hmm right, we can always prefetch as soon as we are behind the prefetch target. Done that way. > > > > + /* > > + * In one_page mode it's possible that the undo of the transaction > > + * might have been applied by worker and undo got discarded. Prevent > > + * discard worker from discarding undo data while we are reading it. > > + * See detail comment in UndoFetchRecord. In normal mode we are > > + * holding transaction undo action lock so it can not be discarded. > > + */ > > I don't really see a comment explaining this in UndoFetchRecord. Are > you referring to InHotStandby? Because there's no comment about one_page > mode as far as I can tell? The comment is clearly referring to that, > rather than InHotStandby? I have removed one_page code. > > > > > + if (one_page) > > + { > > + /* Refer comments in UndoFetchRecord. */ > > Missing "to". > > > > + if (InHotStandby) > > + { > > + if (UndoRecPtrIsDiscarded(urecptr)) > > + break; > > + } > > + else > > + { > > + LWLockAcquire(&slot->discard_lock, LW_SHARED); > > + if (slot->logno != logno || urecptr < slot->oldest_data) > > + { > > + /* > > + * The undo log slot has been recycled because it was > > + * entirely discarded, or the data has been discarded > > + * already. > > + */ > > + LWLockRelease(&slot->discard_lock); > > + break; > > + } > > + } > > I find this deeply unsatisfying. It's repeated in a bunch of > places. There's completely different behaviour between the hot-standby > and !hot-standby case. There's UndoRecPtrIsDiscarded for the HS case, > but we do a different test for !HS. There's no explanation as to why > this is even reachable. I have added comments in UndoFetchRecord. > > > > + /* Read the undo record. */ > > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); > > + > > + /* Release the discard lock after fetching the record. */ > > + if (!InHotStandby) > > + LWLockRelease(&slot->discard_lock); > > + } > > + else > > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); > > > And then we do none of this in !one_page mode. UndoBulkFetchRecord is always called from the aborted transaction so its undo can never get discarded concurrently so ideally, we don't need to check for discard. But, during one_page mode, we follow the when it comes from zheap for the one page it is possible that the undo for the transaction are applied from the worker for the complete transaction and its undo logs are discarded. But, I think this is highly am specific so I have removed one_page mode from here. > > > > + /* > > + * As soon as the transaction id is changed we can stop fetching the > > + * undo record. Ideally, to_urecptr should control this but while > > + * reading undo only for a page we don't know what is the end undo > > + * record pointer for the transaction. > > + */ > > + if (one_page) > > + { > > + if (!FullTransactionIdIsValid(fxid)) > > + fxid = uur->uur_fxid; > > + else if (!FullTransactionIdEquals(fxid, uur->uur_fxid)) > > + break; > > + } > > + > > + /* Remember the previous undo record pointer. */ > > + prev_urec_ptr = urecptr; > > + > > + /* > > + * Calculate the previous undo record pointer of the transaction. If > > + * we are reading undo only for a page then follow the blkprev chain > > + * of the page. Otherwise, calculate the previous undo record pointer > > + * using transaction's current undo record pointer and the prevlen. If > > + * undo record has a valid uur_prevurp, this is the case of log switch > > + * during the transaction so we can directly use uur_prevurp as our > > + * previous undo record pointer of the transaction. > > + */ > > + if (one_page) > > + urecptr = uur->uur_prevundo; > > + else if (uur->uur_logswitch) > > + urecptr = uur->uur_logswitch->urec_prevurp; > > + else if (prev_urec_ptr == to_urecptr || > > + uur->uur_info & UREC_INFO_TRANSACTION) > > + urecptr = InvalidUndoRecPtr; > > + else > > + urecptr = UndoGetPrevUndoRecptr(prev_urec_ptr, buffer, category); > > + > > FWIW, this is one of those concerns I was referring to above. What > exactly needs to happen seems highly AM specific. 1. one_page check is gone 2. uur->uur_info & UREC_INFO_TRANSACTION is also related to one_page so removed this too. 3. else if (uur->uur_logswitch) -> I think this is also related to the incapability of the caller that it can not identify the log switch but expect the bulk fetch to detect it and break fetching so that we can update the progress in the transaction header of the current log. I think we can solve these issue by callback as well as you suggested above. > > > > +/* > > + * Read length of the previous undo record. > > + * > > + * This function will take an undo record pointer as an input and read the > > + * length of the previous undo record which is stored at the end of the previous > > + * undo record. If the undo record is split then this will add the undo block > > + * header size in the total length. > > + */ > > This should add some note as to when it's expected to be necessary. I > was kind of concerned that this can be necessary, but it's only needed > during log switches, which disarms that concern. I think this is a normal case because the undo_len store the actual length of the record. But, if the undo record split across 2 pages and if we are at the end of the undo record (start of the next record) then for computing the absolute start offset of the previous undo record we need the exact distance between these two records and that will be current_offset - (the actual length of the previous record + Undo record header if the previous log is split across 2 pages) > > > > +static uint16 > > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, > > + UndoLogCategory category) > > +{ > > + UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp); > > + BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp); > > + Buffer buffer = input_buffer; > > + Page page = NULL; > > + char *pagedata = NULL; > > + char prevlen[2]; > > + RelFileNode rnode; > > + int byte_to_read = sizeof(uint16); > > Shouldn't it be byte_to_read? And the sizeof a type that's tied with the > actual undo format? Imagine we'd ever want to change the length format > for undo records - this would be hard to find. I did not get this comments. Do you mean that we should not rely on undo format i.e. we should not assume that undo length is stored at the end of the undo record? > > > > + char persistence; > > + uint16 prev_rec_len = 0; > > + > > + /* Get relfilenode. */ > > + UndoRecPtrAssignRelFileNode(rnode, urp); > > + persistence = RelPersistenceForUndoLogCategory(category); > > + > > + if (BufferIsValid(buffer)) > > + { > > + page = BufferGetPage(buffer); > > + pagedata = (char *) page; > > + } > > + > > + /* > > + * Length if the previous undo record is store at the end of that record > > + * so just fetch last 2 bytes. > > + */ > > + while (byte_to_read > 0) > > + { > > Why does this need a loop around the number of bytes? Can there ever be > a case where this is split across a record? If so, isn't that a bad idea > anyway? Yes, as of now, undo record can be splitted at any point even the undo length can be split acorss 2 pages. I think we can reduce complexity by making sure undo length doesn't get split acorss pages. But for handling that while allocating the undo we need to detect this whether the undo length can get splitted by checking the space in the current page and the undo record length and based on that we need to allocate 1 extra byte in the undo log. Seems that will add an extra complexity. > > > > + /* Read buffer if the current buffer is not valid. */ > > + if (!BufferIsValid(buffer)) > > + { > > + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, > > + cur_blk, RBM_NORMAL, NULL, > > + persistence); > > + > > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > > + > > + page = BufferGetPage(buffer); > > + pagedata = (char *) page; > > + } > > + > > + page_offset -= 1; > > + > > + /* > > + * Read current prevlen byte from current block if page_offset hasn't > > + * reach to undo block header. Otherwise, go to the previous block > > + * and continue reading from there. > > + */ > > + if (page_offset >= UndoLogBlockHeaderSize) > > + { > > + prevlen[byte_to_read - 1] = pagedata[page_offset]; > > + byte_to_read -= 1; > > + } > > + else > > + { > > + /* > > + * Release the current buffer if it is not provide by the caller. > > + */ > > + if (input_buffer != buffer) > > + UnlockReleaseBuffer(buffer); > > + > > + /* > > + * Could not read complete prevlen from the current block so go to > > + * the previous block and start reading from end of the block. > > + */ > > + cur_blk -= 1; > > + page_offset = BLCKSZ; > > + > > + /* > > + * Reset buffer so that we can read it again for the previous > > + * block. > > + */ > > + buffer = InvalidBuffer; > > + } > > + } > > I can't help but think that this shouldn't be yet another copy of logic > for how to read undo pages. I haven't yet thought but I will try to unify this with ReadUndoBytes. Actually, I didn't do that already because ReadUndoByte needs a start pointer where we need to read the given number of bytes but here we have an end pointer. May be by this logic we can compute the start pointer but that will look equally complex. I will work on this and try to figure out something. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-05 11:29:34 -0700, Andres Freund wrote: > > Need to do something else for a bit. More later. > > > > + /* > > + * Compute the header size of the undo record. > > + */ > > +Size > > +UndoRecordHeaderSize(uint16 uur_info) > > +{ > > + Size size; > > + > > + /* Add fixed header size. */ > > + size = SizeOfUndoRecordHeader; > > + > > + /* Add size of transaction header if it presets. */ > > + if ((uur_info & UREC_INFO_TRANSACTION) != 0) > > + size += SizeOfUndoRecordTransaction; > > + > > + /* Add size of rmid if it presets. */ > > + if ((uur_info & UREC_INFO_RMID) != 0) > > + size += sizeof(RmgrId); > > + > > + /* Add size of reloid if it presets. */ > > + if ((uur_info & UREC_INFO_RELOID) != 0) > > + size += sizeof(Oid); > > + > > + /* Add size of fxid if it presets. */ > > + if ((uur_info & UREC_INFO_XID) != 0) > > + size += sizeof(FullTransactionId); > > + > > + /* Add size of cid if it presets. */ > > + if ((uur_info & UREC_INFO_CID) != 0) > > + size += sizeof(CommandId); > > + > > + /* Add size of forknum if it presets. */ > > + if ((uur_info & UREC_INFO_FORK) != 0) > > + size += sizeof(ForkNumber); > > + > > + /* Add size of prevundo if it presets. */ > > + if ((uur_info & UREC_INFO_PREVUNDO) != 0) > > + size += sizeof(UndoRecPtr); > > + > > + /* Add size of the block header if it presets. */ > > + if ((uur_info & UREC_INFO_BLOCK) != 0) > > + size += SizeOfUndoRecordBlock; > > + > > + /* Add size of the log switch header if it presets. */ > > + if ((uur_info & UREC_INFO_LOGSWITCH) != 0) > > + size += SizeOfUndoRecordLogSwitch; > > + > > + /* Add size of the payload header if it presets. */ > > + if ((uur_info & UREC_INFO_PAYLOAD) != 0) > > + size += SizeOfUndoRecordPayload; > > There's numerous blocks with one if for each type, and the body copied > basically the same for each alternative. That doesn't seem like a > reasonable approach to me. Means that many places need to be adjusted > when we invariably add another type, and seems likely to lead to bugs > over time. I think I have expressed my thought on this in another email [https://www.postgresql.org/message-id/CAFiTN-vDrXuL6tHK1f_V9PAXp2%2BEFRpPtxCG_DRx08PZXAPkyw%40mail.gmail.com] > > > + /* Add size of the payload header if it presets. */ > > FWIW, repeating the same comment, with or without minor differences, 10 > times is a bad idea. Especially when the comment doesn't add *any* sort > of information. Ok, fixed > > Also, "if it presets" presumably is a typo? Fixed > > > > +/* > > + * Compute and return the expected size of an undo record. > > + */ > > +Size > > +UndoRecordExpectedSize(UnpackedUndoRecord *uur) > > +{ > > + Size size; > > + > > + /* Header size. */ > > + size = UndoRecordHeaderSize(uur->uur_info); > > + > > + /* Payload data size. */ > > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0) > > + { > > + size += uur->uur_payload.len; > > + size += uur->uur_tuple.len; > > + } > > + > > + /* Add undo record length size. */ > > + size += sizeof(uint16); > > + > > + return size; > > +} > > + > > +/* > > + * Calculate the size of the undo record stored on the page. > > + */ > > +static inline Size > > +UndoRecordSizeOnPage(char *page_ptr) > > +{ > > + uint16 uur_info = ((UndoRecordHeader *) page_ptr)->urec_info; > > + Size size; > > + > > + /* Header size. */ > > + size = UndoRecordHeaderSize(uur_info); > > + > > + /* Payload data size. */ > > + if ((uur_info & UREC_INFO_PAYLOAD) != 0) > > + { > > + UndoRecordPayload *payload = (UndoRecordPayload *) (page_ptr + size); > > + > > + size += payload->urec_payload_len; > > + size += payload->urec_tuple_len; > > + } > > + > > + return size; > > +} > > + > > +/* > > + * Compute size of the Unpacked undo record in memory > > + */ > > +Size > > +UnpackedUndoRecordSize(UnpackedUndoRecord *uur) > > +{ > > + Size size; > > + > > + size = sizeof(UnpackedUndoRecord); > > + > > + /* Add payload size if record contains payload data. */ > > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0) > > + { > > + size += uur->uur_payload.len; > > + size += uur->uur_tuple.len; > > + } > > + > > + return size; > > +} > > These functions are all basically the same. We shouldn't copy code over > and over like this. UnpackedUndoRecordSize -> computes the size of the unpacked undo record so it's different from above two, just payload part is common so moved payload size to common function. UndoRecordExpectedSize and UndoRecordSizeOnPage are two different functions except for the header size computation so I already had the common function for the header. UndoRecordExpectedSize computes the expected record size so it can access the payload directly from the unpack undo record whereas the UndoRecordSizeOnPage needs to calculate the record size by the record pointer which is already stored on the page so actually it doesn't have the unpacked undo record instead it first need to compute the header size and then it needs to reach to the payload data. Typecast that to payload header and compute the length. In unpack undo record payload is stored as StringInfoData whereas on the page it is packed as UndoRecordPayload header. So I am not sure how to unify them. Anyway, UndoRecordSizeOnPage is required only for undo page consistency checker patch so I have moved out of this patch. Later, I am planning to handle the comments of the undo page consistency checker patch so I will try to work on this function if I can improve it. > > > > +/* > > + * Initiate inserting an undo record. > > + * > > + * This function will initialize the context for inserting and undo record > > + * which will be inserted by calling InsertUndoData. > > + */ > > +void > > +BeginInsertUndo(UndoPackContext *ucontext, UnpackedUndoRecord *uur) > > +{ > > + ucontext->stage = UNDO_PACK_STAGE_HEADER; > > + ucontext->already_processed = 0; > > + ucontext->partial_bytes = 0; > > + > > + /* Copy undo record header. */ > > + ucontext->urec_hd.urec_type = uur->uur_type; > > + ucontext->urec_hd.urec_info = uur->uur_info; > > + > > + /* Copy undo record transaction header if it is present. */ > > + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0) > > + memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction); > > + > > + /* Copy rmid if present. */ > > + if ((uur->uur_info & UREC_INFO_RMID) != 0) > > + ucontext->urec_rmid = uur->uur_rmid; > > + > > + /* Copy reloid if present. */ > > + if ((uur->uur_info & UREC_INFO_RELOID) != 0) > > + ucontext->urec_reloid = uur->uur_reloid; > > + > > + /* Copy fxid if present. */ > > + if ((uur->uur_info & UREC_INFO_XID) != 0) > > + ucontext->urec_fxid = uur->uur_fxid; > > + > > + /* Copy cid if present. */ > > + if ((uur->uur_info & UREC_INFO_CID) != 0) > > + ucontext->urec_cid = uur->uur_cid; > > + > > + /* Copy undo record relation header if it is present. */ > > + if ((uur->uur_info & UREC_INFO_FORK) != 0) > > + ucontext->urec_fork = uur->uur_fork; > > + > > + /* Copy prev undo record pointer if it is present. */ > > + if ((uur->uur_info & UREC_INFO_PREVUNDO) != 0) > > + ucontext->urec_prevundo = uur->uur_prevundo; > > + > > + /* Copy undo record block header if it is present. */ > > + if ((uur->uur_info & UREC_INFO_BLOCK) != 0) > > + { > > + ucontext->urec_blk.urec_block = uur->uur_block; > > + ucontext->urec_blk.urec_offset = uur->uur_offset; > > + } > > + > > + /* Copy undo record log switch header if it is present. */ > > + if ((uur->uur_info & UREC_INFO_LOGSWITCH) != 0) > > + memcpy(&ucontext->urec_logswitch, uur->uur_logswitch, > > + SizeOfUndoRecordLogSwitch); > > + > > + /* Copy undo record payload header and data if it is present. */ > > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0) > > + { > > + ucontext->urec_payload.urec_payload_len = uur->uur_payload.len; > > + ucontext->urec_payload.urec_tuple_len = uur->uur_tuple.len; > > + ucontext->urec_payloaddata = uur->uur_payload.data; > > + ucontext->urec_tupledata = uur->uur_tuple.data; > > + } > > + else > > + { > > + ucontext->urec_payload.urec_payload_len = 0; > > + ucontext->urec_payload.urec_tuple_len = 0; > > + } > > + > > + /* Compute undo record expected size and store in the context. */ > > + ucontext->undo_len = UndoRecordExpectedSize(uur); > > +} > > It really can't be right to have all these fields basically twice, in > UnackedUndoRecord, and UndoPackContext. And then copy them one-by-one. > I mean there's really just some random differences (ordering, some field > names) between the structures, but otherwise they're the same? > > What on earth do we gain by this? This entire intermediate stage makes > no sense at all to me. We copy data into an UndoRecord, then we copy > into an UndoRecordContext, with essentially a field-by-field copy > logic. Then we have another field-by-field logic that copies the data > into the page. The idea was that in UnpackedUndoRecord we have all member as a field by field but in context, we can keep them in headers for example UndoRecordHeader, UndoRecordGroup, UndoRecordBlock. And, the idea behind this is that during InsertUndoData instead of calling InsertUndoByte field by field we call it once for each header because either we have to write all field of that header or none. But later we end up having a lot of optional headers and most of them have just one field in it so it appears that we are copying field by field. One alternative could be that we palloc a memory in context and then pack each field in that memory (except the payload and tuple data) then in one InsertUndoByte call we can insert complete header part and in we can have 2 more calls to InsertUndoBytes for writing payload and tuple data. What's your thought on this. > > > > > > +/* > > + * Insert the undo record into the input page from the unpack undo context. > > + * > > + * Caller can call this function multiple times until desired stage is reached. > > + * This will write the undo record into the page. > > + */ > > +void > > +InsertUndoData(UndoPackContext *ucontext, Page page, int starting_byte) > > +{ > > + char *writeptr = (char *) page + starting_byte; > > + char *endptr = (char *) page + BLCKSZ; > > + > > + switch (ucontext->stage) > > + { > > + case UNDO_PACK_STAGE_HEADER: > > + /* Insert undo record header. */ > > + if (!InsertUndoBytes((char *) &ucontext->urec_hd, > > + SizeOfUndoRecordHeader, &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + ucontext->stage = UNDO_PACK_STAGE_TRANSACTION; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_TRANSACTION: > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_TRANSACTION) != 0) > > + { > > + /* Insert undo record transaction header. */ > > + if (!InsertUndoBytes((char *) &ucontext->urec_txn, > > + SizeOfUndoRecordTransaction, > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_RMID; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_RMID: > > + /* Write rmid(if needed and not already done). */ > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_RMID) != 0) > > + { > > + if (!InsertUndoBytes((char *) &(ucontext->urec_rmid), sizeof(RmgrId), > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_RELOID; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_RELOID: > > + /* Write reloid(if needed and not already done). */ > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_RELOID) != 0) > > + { > > + if (!InsertUndoBytes((char *) &(ucontext->urec_reloid), sizeof(Oid), > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_XID; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_XID: > > + /* Write xid(if needed and not already done). */ > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_XID) != 0) > > + { > > + if (!InsertUndoBytes((char *) &(ucontext->urec_fxid), sizeof(FullTransactionId), > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_CID; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_CID: > > + /* Write cid(if needed and not already done). */ > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_CID) != 0) > > + { > > + if (!InsertUndoBytes((char *) &(ucontext->urec_cid), sizeof(CommandId), > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_FORKNUM; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_FORKNUM: > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_FORK) != 0) > > + { > > + /* Insert undo record fork number. */ > > + if (!InsertUndoBytes((char *) &ucontext->urec_fork, > > + sizeof(ForkNumber), > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_PREVUNDO; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_PREVUNDO: > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_PREVUNDO) != 0) > > + { > > + /* Insert undo record blkprev. */ > > + if (!InsertUndoBytes((char *) &ucontext->urec_prevundo, > > + sizeof(UndoRecPtr), > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_BLOCK; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_BLOCK: > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_BLOCK) != 0) > > + { > > + /* Insert undo record block header. */ > > + if (!InsertUndoBytes((char *) &ucontext->urec_blk, > > + SizeOfUndoRecordBlock, > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_LOGSWITCH; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_LOGSWITCH: > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_LOGSWITCH) != 0) > > + { > > + /* Insert undo record transaction header. */ > > + if (!InsertUndoBytes((char *) &ucontext->urec_logswitch, > > + SizeOfUndoRecordLogSwitch, > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_PAYLOAD; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_PAYLOAD: > > + if ((ucontext->urec_hd.urec_info & UREC_INFO_PAYLOAD) != 0) > > + { > > + /* Insert undo record payload header. */ > > + if (!InsertUndoBytes((char *) &ucontext->urec_payload, > > + SizeOfUndoRecordPayload, > > + &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_PAYLOAD_DATA; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_PAYLOAD_DATA: > > + { > > + int len = ucontext->urec_payload.urec_payload_len; > > + > > + if (len > 0) > > + { > > + /* Insert payload data. */ > > + if (!InsertUndoBytes((char *) ucontext->urec_payloaddata, > > + len, &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_TUPLE_DATA; > > + } > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_TUPLE_DATA: > > + { > > + int len = ucontext->urec_payload.urec_tuple_len; > > + > > + if (len > 0) > > + { > > + /* Insert tuple data. */ > > + if (!InsertUndoBytes((char *) ucontext->urec_tupledata, > > + len, &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + } > > + ucontext->stage = UNDO_PACK_STAGE_UNDO_LENGTH; > > + } > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_UNDO_LENGTH: > > + /* Insert undo length. */ > > + if (!InsertUndoBytes((char *) &ucontext->undo_len, > > + sizeof(uint16), &writeptr, endptr, > > + &ucontext->already_processed, > > + &ucontext->partial_bytes)) > > + return; > > + > > + ucontext->stage = UNDO_PACK_STAGE_DONE; > > + /* fall through */ > > + > > + case UNDO_PACK_STAGE_DONE: > > + /* Nothing to be done. */ > > + break; > > + > > + default: > > + Assert(0); /* Invalid stage */ > > + } > > +} > > I don't understand. The only purpose of this is that we can partially > write a packed-but-not-actually-packed record onto a bunch of pages? And > for that we have an endless chain of copy and pasted code calling > InsertUndoBytes()? Copying data into shared buffers in tiny increments? > > If we need to this, what is the whole packed record format good for? > Except for adding a bunch of functions with 10++ ifs and nearly > identical code? > > Copying data is expensive. Copying data in tiny increments is more > expensive. Copying data in tiny increments, with a bunch of branches, is > even more expensive. Copying data in tiny increments, with a bunch of > branches, is even more expensive, especially when it's shared > memory. Copying data in tiny increments, with a bunch of branches, is > even more expensive, especially when it's shared memory, especially when > all that shared meory is locked at once. > > > > +/* > > + * Read the undo record from the input page to the unpack undo context. > > + * > > + * Caller can call this function multiple times until desired stage is reached. > > + * This will read the undo record from the page and store the data into unpack > > + * undo context, which can be later copied to unpacked undo record by calling > > + * FinishUnpackUndo. > > + */ > > +void > > +UnpackUndoData(UndoPackContext *ucontext, Page page, int starting_byte) > > +{ > > + char *readptr = (char *) page + starting_byte; > > + char *endptr = (char *) page + BLCKSZ; > > + > > + switch (ucontext->stage) > > + { > > + case UNDO_PACK_STAGE_HEADER: > > You know roughly what I'm thinking. I have expressed my thought on this in last comment. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 13, 2019 at 6:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > So, for top-level transactions rollback, we can directly refer from > UndoRequest *, the start and end locations. But, what should we do > for sub-transactions (rollback to savepoint)? One related point is > that we also need information about last_log_start_undo_location to > update the undo apply progress (The basic idea is if the transactions > undo is spanned across multiple logs, we update the progress in each > of the logs.). We can remember that in the transaction state or > undorequest *. Any suggestion? The UndoRequest is only for top-level rollback. Any state that you need in order to do subtransaction rollback needs to be maintained someplace else, probably in the transaction state state, or some subsidiary data structure. The point here is that the UndoRequest is going to be stored in shared memory, but there is no reason ever to store the information about a subtransaction in shared memory, because that undo always has to be completed by the backend that is responsible for that transaction. Those things should not get mixed together. > IIUC, for each transaction, we have to take a lock first time it > attaches to a log and then the same lock at commit time. It seems the > work under lock is less, but still, can't this cause a contention? It > seems to me this is similar to what we saw in ProcArrayLock where work > under lock was few instructions, but acquiring and releasing the lock > by each backend at commit time was causing a bottleneck. LWLocks are pretty fast these days and the critical section is pretty short, so I think there's a chance it'll be just fine, but maybe it'll cause enough cache line bouncing to be problematic. If so, I think there are several possible ways to redesign the locking to improve things, but it made sense to me to try the simple approach first. > How will computation of oldestXidHavingUnappliedUndo will work? > > We can probably check the fxid queue and error queue to get that > value. However, I am not sure if that is sufficient because incase we > perform the request in the foreground, it won't be present in queues. Oh, I forgot about that requirement. I think I can fix it so it does that fairly easily, but it will require a little bit of redesign which I won't have time to do this week. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-08-06 14:18:42 -0700, Andres Freund wrote: > Here's the last section of my low-leve review. Plan to write a higher > level summary afterwards, now that I have a better picture of the code. General comments: - For a feature of this complexity, there's very little architectural documentation. Some individual parts have a bit, but there's basically nothing over-arching. That makes it extremely hard for anybody that is not already involved to understand the design constraints, and even for people involved it's hard to understand. I think it's very important for this to have a document that first explains what the goals, and non-goals, of this feature are. And then secondly explains the chosen architecture referencing those constraints. Once that's there it's a lot easier to review this patchset, to discuss the overall architecture, etc. - There are too many different email threads and git branches. The same patches are discussed in different threads, layers exist in slightly diverging versions in different git trees. Again making it very hard for anybody not primarily focussing on undo to join the discussion. I think most of the older git branches should be renamed into something indicating their historic status. The remaining branches should be referenced from a wiki page (linked to in each submission of a new patch version), explaining what they're about. I don't think it's realistic to have people infer meaning from the current branch names (undo, proposal-undo-log, undo-log-storage, undo-log-storage-v2, undo_interface_v1, undoprocessing). Given the size of the the overall project it's quite possibly not realistic to manage the whole work in a single git branch. With separate git branches, as currently done, it's however hard to understand which version of a what layer is used. I think at the very least higher layers need to indicate the version of the underlying layers is being used. I suggest just adding a "v01: " style prefix to all commit subjects a branch is rebased onto. It's also currently hard to understand what version of a layer is being discussed. I think submissions all need to include a version number (c.f. git format-patch's -v option), and that version ought to be included in the subject line. Each new major version of a patch should be started as a reply to the first message of a thread, to keep the structure of a discussion in a managable shape. New versions should include explicit explanations about the major changes compared to the last version. - The code quality of pretty significant parts of the patchset is not even close to being good enough. There are areas with a lot of code duplication. There are very few higher level explanations for interfaces. There's a lot of "i++; /* increment i to increment it */" style comments, but not enough higher level comments. There are significant issues with parts of the code that aren't noted anywhere in comments, leading to reviewers having to repeatedly re-discover them (and wasting time on that). There's different naming styles in related code without a discernible pattern (e.g. UndoRecordSetInfo being followed by get_undo_rec_cid_offset). The word-ordering of different layers is confusing (e.g. BeginUndoRecordInsert vs UndoLogBeginInsert vs PrepareUndoInsert). Different important externally visible functions have names that don't allow to determine which is supposed to do what (PrepareUndoInsert vs BeginUndoRecordInsert). More specific comments: - The whole sequencing of undo record insertion in combination with WAL logging does not appear to be right. It's a bit hard to say, because there's very little documentation on what the intended default sequence of operations is. My understanding is that the currently expected pattern is to: 1) Collect information / perform work needed to perform the action that needs to be UNDO logged. E.g. perform visibility determinations, wait for lockers, compute new infomask, etc. This will likely end with the "content" page(s) (e.g. a table's page) being exclusively locked. 2) Estimate space needed for all UNDO logging (BeginUndoRecordInsert) 3) Prepare for each undo record, this includes building the content for each undo record. PrepareUndoInsert(). This acquires, pins and locks buffers for undo. 4) begin a critical section 5) modify content page, mark buffer dirty 6) write UNDO, using InsertPreparedUndo() 7) associate undo with WAL record (RegisterUndoLogBuffers) 8) write WAL 9) End critical section But despite reading through the code, including READMEs, I'm doubtful that's quite the intended pattern. It REALLY can't be right that one needs to parse many function headers to figure out how the most basic use of undo could possibly work. There needs to be very clear documentation about how to write undo records. Functions that sound like they're involved, need actually useful comments, rather than just restatements of their function names (cf RegisterUndoLogBuffers, UndoLogBuffersSetLSN, UndoLogRegister). - I think there's two fairly fundamental, and related, problems with the sequence outlined above: - We can't search for target buffers to store undo data, while holding the "table" page content locked. That can involve writing out multiple pages till we find a usable victim buffer. That can take a pretty long time. While that's happening the content page would currently be locked. Note how e.g. heapam.c is careful to not hold *any* content locks while potentially performing IO. I think the current interface makes that hard. The easy way to solve would be to require sites performing UNDO logging to acquire victim pages before even acquiring other content locks. Perhaps the better approach could be for the undo layer to hold onto a number of clean buffers, and to keep the last page in an already written to undo log pinned. - We can't search for victim buffers for further undo data while already holding other undo pages content locked. Doing so means that while we're e.g. doing IO to clean out the new page, old undo data on the previous page can't be read. This seems easier to fix. Instead of PrepareUndoInsert() acquiring, pinning and locking buffers, it'd need to be split into two operations. One that acquires buffers and pins them, and one that locks them. I think it's quite possible that the locking operation could just be delayed until InsertPreparedUndo(). But if we solve the above problem, most of this might already be solved. - To me the current split between the packed and unpacked UNDO record formats makes very little sense, the reasoning behind having them is poorly if at all documented, results in extremely verbose code, and isn't extensible. When preparing to insert an undo record the in-buffer size is computed with UndoRecordHeaderSize() (needs to know about all optional data) from within PrepareUndoInsert() (which has a bunch a bunch of additional knowledge about the record format). Then during insertion InsertPreparedUndo(), first copies the UnpackedUndoRecord into an UndoPackContext (again needing ...), and then, via InsertUndoData(), copies that in small increments into the respective buffers (again needing knowledge about the complete record format, two copies even). Beside the code duplication, that also means the memory copies are very inefficient, because they're all done in tiny increments, multiple times. When reading undo it's smilar: UnpackUndoData(), again in small chunks, reads the buffer data into an UndoPackContext (another full copy of the unpacked record format). But then FinishUnpackUndo() *again* copies all that data, into an actual UnpackedUndoRecord (again, with a copy of the record format, albeit slightly different looking). I'm not convinced by Heikki's argument that we shouldn't have structure within undo records. In my opinion that is a significant weakness of how WAL was initially designed, and even after Heikki's work, still is a problem. But this isn't the right design either. Even if were to stay with the current fixed record format, I think the current code needs a substantial redesign: - I think 'packing' during insertion needs to serialize into a char* allocation during PrepareUndoInsert computing the size in parallel (or perhaps in InsertPreparedUndo, but probably not). The size of the record should never be split across record boundaries (i.e. we'll leave that space unused if we otherwise would need to split the size). The actual insertion would be a maximally sized memcpy() (so we'd as many memcpys as the buffer fits in, rather than one for each sub-type of a record). That allows to remove most of the duplicated knowledge of the record format, and makes insertions faster (by doing only large memcpys while holding exclusive content locks). - When reading an undo record, the whole stage of UnpackUndoData() reading data into a the UndoPackContext is omitted, reading directly into the UnpackedUndoRecord. That removes one further copy of the record format. - To avoid having separate copies of the record format logic, I'd probably encode it into *one* array of metadata. If we had {offsetoff(UnpackedUndoRecord, member), membersize(UnpackedUndoRecord, member), flag} we could fairly trivially remove most knowledge from the places currently knowing about the record format. I have some vague ideas for how to specify the format in a way that is more extensible, but with more structure than just a blob of data. But I don't think they're there yet. - The interface to read undo also doesn't seem right to me. For one there's various different ways to read records, with associated code duplication (batch, batch in "one page" mode - but that's being removed now I think, single record mode). I think the batch mode is too restrictive. We might not need this during the first merged version, but I think before long we're going to want to be able to efficiently traverse all the undo chains we need to determine the visibility of all tuples on a page. Otherwise we'll cause a lot of additional synchronous read IO, and will repeatedly re-fetch information, especially during sequential scans for an older snapshot. I think I briefly outlined this in an earlier email - my current though is that the batch interface (which the non-batch interface should just be a tiny helper around), should basically be a queue of "to-be-fetched" undo records. When batching reading an entire transaction, all blocks get put onto that queue. When traversing multiple chains, the chains are processed in a breadth-first fashion (by just looking at the queue, and pushing additional work to the end). That allows to efficiently issue prefetch requests for blocks to be read in the near future. I think that batch reading should just copy the underlying data into a char* buffer. Only the records that currently are being used by higher layers should get exploded into an unpacked record. That will reduce memory usage quite noticably (and I suspect it also drastically reduce the overhead due to a large context with a lot of small allocations that then get individually freed). That will make the sorting of undo a bit more CPU inefficient, because individual records will need to be partially unpacked for comparison, but I think that's going to be a far smaller loss than the win. - My reading of the current xact.c integration is that it's not workable as is. Undo is executed outside of a valid transaction state, exceptions aren't properly undone, logic would need to be duplicated to a significant degree, new kind of critical section. Greetings, Andres Freund
On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > - I think there's two fairly fundamental, and related, problems with > the sequence outlined above: > > - We can't search for target buffers to store undo data, while holding > the "table" page content locked. > > The easy way to solve would be to require sites performing UNDO > logging to acquire victim pages before even acquiring other content > locks. Perhaps the better approach could be for the undo layer to > hold onto a number of clean buffers, and to keep the last page in an > already written to undo log pinned. > > - We can't search for victim buffers for further undo data while > already holding other undo pages content locked. Doing so means that > while we're e.g. doing IO to clean out the new page, old undo data > on the previous page can't be read. > > This seems easier to fix. Instead of PrepareUndoInsert() acquiring, > pinning and locking buffers, it'd need to be split into two > operations. One that acquires buffers and pins them, and one that > locks them. I think it's quite possible that the locking operation > could just be delayed until InsertPreparedUndo(). But if we solve > the above problem, most of this might already be solved. Basically, that means - the caller should call PreparedUndoInsert before acquiring table page content lock right? because the PreparedUndoInsert just compute the size, allocate the space and pin+lock the buffers and for pinning the buffers we must compute the size and allocate the space using undo storage layer. - So basically, if we delay the lock till InsertPreparedUndo and call PrepareUndoInsert before acquiring table page content lock this problem is solved? Although I haven't yet analyzed the AM specific part that whether it's always possible to call the PrepareUndoInsert(basically getting all the undo record ready) before the page content lock. But, I am sure that won't be much difficult part. > - To me the current split between the packed and unpacked UNDO record > formats makes very little sense, the reasoning behind having them is > poorly if at all documented, results in extremely verbose code, and > isn't extensible. > > When preparing to insert an undo record the in-buffer size is computed > with UndoRecordHeaderSize() (needs to know about all optional data) > from within PrepareUndoInsert() (which has a bunch a bunch of > additional knowledge about the record format). Then during insertion > InsertPreparedUndo(), first copies the UnpackedUndoRecord into an > UndoPackContext (again needing ...), and then, via InsertUndoData(), > copies that in small increments into the respective buffers (again > needing knowledge about the complete record format, two copies > even). Beside the code duplication, that also means the memory copies > are very inefficient, because they're all done in tiny increments, > multiple times. > > When reading undo it's smilar: UnpackUndoData(), again in small > chunks, reads the buffer data into an UndoPackContext (another full > copy of the unpacked record format). But then FinishUnpackUndo() > *again* copies all that data, into an actual UnpackedUndoRecord > (again, with a copy of the record format, albeit slightly different > looking). > > I'm not convinced by Heikki's argument that we shouldn't have > structure within undo records. In my opinion that is a significant > weakness of how WAL was initially designed, and even after Heikki's > work, still is a problem. But this isn't the right design either. > > Even if were to stay with the current fixed record format, I think > the current code needs a substantial redesign: > > - I think 'packing' during insertion needs to serialize into a char* > allocation during PrepareUndoInsert ok computing the size in parallel > (or perhaps in InsertPreparedUndo, but probably not). The size of > the record should never be split across record boundaries > (i.e. we'll leave that space unused if we otherwise would need to > split the size). I think before UndoRecordAllocate we need to detect this part that whether the size of the record will start from the last byte of the page and if so then allocate one extra byte for the undo record. Or always allocate one extra byte for the undo record for handling this case. And, in FinalizeUndoAdvance only pass the size how much we have actually consumed. The actual insertion would be a maximally sized > memcpy() (so we'd as many memcpys as the buffer fits in, rather than > one for each sub-type of a record). > > That allows to remove most of the duplicated knowledge of the record > format, and makes insertions faster (by doing only large memcpys > while holding exclusive content locks). Right. > > - When reading an undo record, the whole stage of UnpackUndoData() > reading data into a the UndoPackContext is omitted, reading directly > into the UnpackedUndoRecord. That removes one further copy of the > record format. So we will read member by member to UnpackedUndoRecord? because in context we have at least a few headers packed and we can memcpy one header at a time like UndoRecordHeader, UndoRecordBlock. But that just a few of them so if we copy field by field in the UnpackedUndoRecord then we can get rid of copying in context then copy it back to the UnpackedUndoRecord. Is this is what in your mind or you want to store these structures (UndoRecordHeader, UndoRecordBlock) directly into UnpackedUndoRecord? > > - To avoid having separate copies of the record format logic, I'd > probably encode it into *one* array of metadata. If we had > {offsetoff(UnpackedUndoRecord, member), > membersize(UnpackedUndoRecord, member), > flag} > we could fairly trivially remove most knowledge from the places > currently knowing about the record format. Seems interesting. I will work on this. > > > I have some vague ideas for how to specify the format in a way that is > more extensible, but with more structure than just a blob of data. But > I don't think they're there yet. > > > - The interface to read undo also doesn't seem right to me. For one > there's various different ways to read records, with associated code > duplication (batch, batch in "one page" mode - but that's being > removed now I think, single record mode). > > I think the batch mode is too restrictive. We might not need this > during the first merged version, but I think before long we're going > to want to be able to efficiently traverse all the undo chains we need > to determine the visibility of all tuples on a page. Otherwise we'll > cause a lot of additional synchronous read IO, and will repeatedly > re-fetch information, especially during sequential scans for an older > snapshot. I think I briefly outlined this in an earlier email - my > current though is that the batch interface (which the non-batch > interface should just be a tiny helper around), should basically be a > queue of "to-be-fetched" undo records. When batching reading an entire > transaction, all blocks get put onto that queue. When traversing > multiple chains, the chains are processed in a breadth-first fashion > (by just looking at the queue, and pushing additional work to the > end). That allows to efficiently issue prefetch requests for blocks to > be read in the near future. I need to analyze this part. > > I think that batch reading should just copy the underlying data into a > char* buffer. Only the records that currently are being used by > higher layers should get exploded into an unpacked record. That will > reduce memory usage quite noticably (and I suspect it also drastically > reduce the overhead due to a large context with a lot of small > allocations that then get individually freed). Ok, I got your idea. I will analyze it further and work on this if there is no problem. That will make the > sorting of undo a bit more CPU inefficient, because individual records > will need to be partially unpacked for comparison, but I think that's > going to be a far smaller loss than the win. Right. > > > - My reading of the current xact.c integration is that it's not workable > as is. Undo is executed outside of a valid transaction state, > exceptions aren't properly undone, logic would need to be duplicated > to a significant degree, new kind of critical section. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2019-08-13 13:53:59 +0530, Dilip Kumar wrote: > On Tue, Jul 30, 2019 at 1:32 PM Andres Freund <andres@anarazel.de> wrote: > > > + /* Loop until we have fetched all the buffers in which we need to write. */ > > > + while (size > 0) > > > + { > > > + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, RBM_NORMAL); > > > + xact_info->idx_undo_buffers[index++] = bufidx; > > > + size -= (BLCKSZ - starting_byte); > > > + starting_byte = UndoLogBlockHeaderSize; > > > + cur_blk++; > > > + } > > > > So, this locks a very large number of undo buffers at the same time, do > > I see that correctly? What guarantees that there are no deadlocks due > > to multiple buffers locked at the same time (I guess the order inside > > the log)? What guarantees that this is a small enough number that we can > > even lock all of them at the same time? > > I think we are locking them in the block order and that should avoid > the deadlock. I have explained in the comments. Sorry for harping on this so much: But please, please, *always* document things like this *immediately*. This is among the most crucial things to document. There shouldn't need to be a reviewer prodding you to do so many months after the code has been written. For one you've likely forgotten details by then, but more importantly dependencies on the locking scheme will have crept into further places - if it's not well thought through that can be hrad to undo. And it wastes reviewer / reader bandwidth. > > Why do we need to lock all of them at the same time? That's not clear to > > me. > > Because this is called outside the critical section so we keep all the > buffers locked what we want to update inside the critical section for > single wal record. I don't understand this explanation. What does keeping the buffers locked have to do with the critical section? As explained in a later email, I think the current approach is not acceptable - but even without those issues, I don't see why we couldn't just lock the buffers at a later stage? > > > + for (i = 0; i < context->nprepared_undo_buffer; i++) > > > + { > > > > How large do we expect this to get at most? > > > In BeginUndoRecordInsert we are computing it > > + /* Compute number of buffers. */ > + nbuffers = (nprepared + MAX_UNDO_UPDATE_INFO) * MAX_BUFFER_PER_UNDO; Since nprepared is variable, that doesn't really answer the question. Greetings, Andres Freund
Hi, On 2019-08-13 17:05:27 +0530, Dilip Kumar wrote: > On Mon, Aug 5, 2019 at 11:59 PM Andres Freund <andres@anarazel.de> wrote: > > (as I was out of context due to dealing with bugs, I've switched to > > lookign at the current zheap/undoprocessing branch. > > > > On 2019-07-30 01:02:20 -0700, Andres Freund wrote: > > > +/* > > > + * Insert a previously-prepared undo records. > > > + * > > > + * This function will write the actual undo record into the buffers which are > > > + * already pinned and locked in PreparedUndoInsert, and mark them dirty. This > > > + * step should be performed inside a critical section. > > > + */ > > > > Again, I think it's not ok to just assume you can lock an essentially > > unbounded number of buffers. This seems almost guaranteed to result in > > deadlocks. And there's limits on how many lwlocks one can hold etc. > > I think for controlling that we need to put a limit on max prepared > undo? I am not sure any other way of limiting the number of buffers > because we must lock all the buffer in which we are going to insert > the undo record under one WAL logged operation. I heard that a number of times. But I still don't know why that'd actually be true. Why would it not be sufficient to just lock the buffer currently being written to, rather than all buffers? It'd require a bit of care updating the official current "logical end" of a log, but otherwise ought to not be particularly hard? Only one backend can extend the log after all, and until the log is externally visibily extended, nobody can read or write those buffers, no? > > > > As far as I can tell there's simply no deadlock avoidance scheme in use > > here *at all*? I must be missing something. > > We are always locking buffer in block order so I am not sure how it > can deadlock? Am I missing something? Do we really in all circumstances? Note that we update the transinfo (and also progress) from earlier in the log. But my main point is more that there's no documented deadlock avoidance scheme. Which imo means there's none, because nobody will know to maintain it. > > > + /* > > > + * During recovery, there might be some blocks which are already > > > + * deleted due to some discard command so we can just skip > > > + * inserting into those blocks. > > > + */ > > > + if (!BufferIsValid(buffer)) > > > + { > > > + Assert(InRecovery); > > > + > > > + /* > > > + * Skip actual writing just update the context so that we have > > > + * write offset for inserting into next blocks. > > > + */ > > > + SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte); > > > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > > > + break; > > > + } > > > > How exactly can this happen? > > Suppose you insert one record for the transaction which split in > block1 and 2. Now, before this block is actually going to the disk > the transaction committed and become all visible the undo logs are > discarded. It's possible that block 1 is completely discarded but > block 2 is not because it might have undo for the next transaction. > Now, during recovery (FPW is off) if block 1 is missing but block 2 is > their so we need to skip inserting undo for block 1 as it does not > exist. Hm. I'm quite doubtful this is a good idea. How will this not force us to a emit a lot more expensive durable operations while writing undo? And doesn't this reduce error detection quite remarkably? Thomas, Robert? > > > + /* Read the undo record. */ > > > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); > > > + > > > + /* Release the discard lock after fetching the record. */ > > > + if (!InHotStandby) > > > + LWLockRelease(&slot->discard_lock); > > > + } > > > + else > > > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); > > > > > > And then we do none of this in !one_page mode. > UndoBulkFetchRecord is always called from the aborted transaction so > its undo can never get discarded concurrently so ideally, we don't > need to check for discard. That's an undocumented assumption. Why would anybody reading the interface know that? > > > +static uint16 > > > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, > > > + UndoLogCategory category) > > > +{ > > > + UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp); > > > + BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp); > > > + Buffer buffer = input_buffer; > > > + Page page = NULL; > > > + char *pagedata = NULL; > > > + char prevlen[2]; > > > + RelFileNode rnode; > > > + int byte_to_read = sizeof(uint16); > > > > Shouldn't it be byte_to_read? Err, *bytes*_to_read. > > And the sizeof a type that's tied with the actual undo format? > > Imagine we'd ever want to change the length format for undo records > > - this would be hard to find. > > Do you mean that we should not rely on undo format i.e. we should not > assume that undo length is stored at the end of the undo record? I was referencing the use of sizeof(uint16). I think this should either reference an UndoRecLen typedef or something like it, or use something roughly like #define member_size(type, member) (sizeof((type){0}.member)) and then have bytes_to_read be set to something like member_size(PackedUndoRecord, len) > > > + char persistence; > > > + uint16 prev_rec_len = 0; > > > + > > > + /* Get relfilenode. */ > > > + UndoRecPtrAssignRelFileNode(rnode, urp); > > > + persistence = RelPersistenceForUndoLogCategory(category); > > > + > > > + if (BufferIsValid(buffer)) > > > + { > > > + page = BufferGetPage(buffer); > > > + pagedata = (char *) page; > > > + } > > > + > > > + /* > > > + * Length if the previous undo record is store at the end of that record > > > + * so just fetch last 2 bytes. > > > + */ > > > + while (byte_to_read > 0) > > > + { > > > > Why does this need a loop around the number of bytes? Can there ever be > > a case where this is split across a record? If so, isn't that a bad idea > > anyway? > Yes, as of now, undo record can be splitted at any point even the undo > length can be split acorss 2 pages. I think we can reduce complexity > by making sure undo length doesn't get split acorss pages. I think we definitely should do that. I'd probably even include more than just the size in the header that's not allowed to be split across pages. > But for handling that while allocating the undo we need to detect this > whether the undo length can get splitted by checking the space in the > current page and the undo record length and based on that we need to > allocate 1 extra byte in the undo log. Seems that will add an extra > complexity. That seems fairly straightforward? Greetings, Andres Freund
Hi, On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote: > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote: > > - I think there's two fairly fundamental, and related, problems with > > the sequence outlined above: > > > > - We can't search for target buffers to store undo data, while holding > > the "table" page content locked. > > > > The easy way to solve would be to require sites performing UNDO > > logging to acquire victim pages before even acquiring other content > > locks. Perhaps the better approach could be for the undo layer to > > hold onto a number of clean buffers, and to keep the last page in an > > already written to undo log pinned. > > > > - We can't search for victim buffers for further undo data while > > already holding other undo pages content locked. Doing so means that > > while we're e.g. doing IO to clean out the new page, old undo data > > on the previous page can't be read. > > > > This seems easier to fix. Instead of PrepareUndoInsert() acquiring, > > pinning and locking buffers, it'd need to be split into two > > operations. One that acquires buffers and pins them, and one that > > locks them. I think it's quite possible that the locking operation > > could just be delayed until InsertPreparedUndo(). But if we solve > > the above problem, most of this might already be solved. > > Basically, that means > - the caller should call PreparedUndoInsert before acquiring table > page content lock right? because the PreparedUndoInsert just compute > the size, allocate the space and pin+lock the buffers and for pinning > the buffers we must compute the size and allocate the space using undo > storage layer. I don't think we can normally pin the undo buffers properly at that stage. Without knowing the correct contents of the table page - which we can't know without holding some form of lock preventing modifications - we can't know how big our undo records are going to be. And we can't just have buffers that don't exist on disk in shared memory, and we don't want to allocate undo that we then don't need. So I think what we'd have to do at that stage, is to "pre-allocate" buffers for the maximum amount of UNDO needed, but mark the associated bufferdesc as not yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID would not be set. So at the start of a function that will need to insert undo we'd need to pre-reserve the maximum number of buffers we could potentially need. That reservation stage would a) pin the page with the current end of the undo b) if needed pin the page of older undo that we need to update (e.g. to update the next pointer) c) perform clock sweep etc to acquire (find or create) enough clean to hold the maximum amount of undo needed. These buffers would be marked as !BM_TAG_VALID | BUF_REFCOUNT_ONE. I assume that we'd make a) cheap by keeping it pinned for undo logs that a backend is actively attached to. b) should only be needed once in a transaction, so it's not too bad. c) we'd probably need to amortize across multiple undo insertions, by keeping the unused buffers pinned until the end of the transaction. I assume that having the infrastructure c) might also make some code for already in postgres easier. There's obviously some issues around guaranteeing that the maximum number of such buffers isn't high. > - So basically, if we delay the lock till InsertPreparedUndo and call > PrepareUndoInsert before acquiring table page content lock this > problem is solved? > > Although I haven't yet analyzed the AM specific part that whether it's > always possible to call the PrepareUndoInsert(basically getting all > the undo record ready) before the page content lock. But, I am sure > that won't be much difficult part. I think that is somewhere between not possible, and so expensive in a lot of cases that we'd not want to do it anyway. You'd at leasthave to first acquire a content lock on the page, mark the target tuple as locked, then unlock the page, reserve undo, lock the table page, actually update it. > > - When reading an undo record, the whole stage of UnpackUndoData() > > reading data into a the UndoPackContext is omitted, reading directly > > into the UnpackedUndoRecord. That removes one further copy of the > > record format. > So we will read member by member to UnpackedUndoRecord? because in > context we have at least a few headers packed and we can memcpy one > header at a time like UndoRecordHeader, UndoRecordBlock. Well, right now you then copy them again later, so not much is gained by that (although that later copy can happen without the content lock held). As I think I suggested before, I suspect that the best way would be to just memcpy() the data from the page(s) into an appropriately sized buffer with the content lock held, and then perform unpacking directly into UnpackedUndoRecord. Especially with the bulk API that will avoid having to do much work with locks held, and reduce memory usage by only unpacking the record(s) in a batch that are currently being looked at. > But that just a few of them so if we copy field by field in the > UnpackedUndoRecord then we can get rid of copying in context then copy > it back to the UnpackedUndoRecord. Is this is what in your mind or > you want to store these structures (UndoRecordHeader, UndoRecordBlock) > directly into UnpackedUndoRecord? I at the moment see no reason not to? > > Greetings, Andres Freund
On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote: > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote: > > > - I think there's two fairly fundamental, and related, problems with > > > the sequence outlined above: > > > > > > - We can't search for target buffers to store undo data, while holding > > > the "table" page content locked. > > > > > > The easy way to solve would be to require sites performing UNDO > > > logging to acquire victim pages before even acquiring other content > > > locks. Perhaps the better approach could be for the undo layer to > > > hold onto a number of clean buffers, and to keep the last page in an > > > already written to undo log pinned. > > > > > > - We can't search for victim buffers for further undo data while > > > already holding other undo pages content locked. Doing so means that > > > while we're e.g. doing IO to clean out the new page, old undo data > > > on the previous page can't be read. > > > > > > This seems easier to fix. Instead of PrepareUndoInsert() acquiring, > > > pinning and locking buffers, it'd need to be split into two > > > operations. One that acquires buffers and pins them, and one that > > > locks them. I think it's quite possible that the locking operation > > > could just be delayed until InsertPreparedUndo(). But if we solve > > > the above problem, most of this might already be solved. > > > > Basically, that means > > - the caller should call PreparedUndoInsert before acquiring table > > page content lock right? because the PreparedUndoInsert just compute > > the size, allocate the space and pin+lock the buffers and for pinning > > the buffers we must compute the size and allocate the space using undo > > storage layer. > > I don't think we can normally pin the undo buffers properly at that > stage. Without knowing the correct contents of the table page - which we > can't know without holding some form of lock preventing modifications - > we can't know how big our undo records are going to be. And we can't > just have buffers that don't exist on disk in shared memory, and we > don't want to allocate undo that we then don't need. So I think what > we'd have to do at that stage, is to "pre-allocate" buffers for the > maximum amount of UNDO needed, but mark the associated bufferdesc as not > yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID > would not be set. > > So at the start of a function that will need to insert undo we'd need to > pre-reserve the maximum number of buffers we could potentially > need. That reservation stage would Maybe we can provide an interface where the caller will input the max prepared undo (maybe BeginUndoRecordInsert) and based on that we can compute the max number of buffers we could potentially need for this particular operation. Most of the operation insert/update/delete will need 1 or 2 undo record so we can avoid pinning very high number of buffers in most of the cases. Currently, only for the multi-insert implementation of zheap we might need multiple undo-records (1 undo per range of record). > a) pin the page with the current end of the undo > b) if needed pin the page of older undo that we need to update (e.g. to > update the next pointer) > c) perform clock sweep etc to acquire (find or create) enough clean to > hold the maximum amount of undo needed. These buffers would be marked > as !BM_TAG_VALID | BUF_REFCOUNT_ONE. > > I assume that we'd make a) cheap by keeping it pinned for undo logs that > a backend is actively attached to. b) should only be needed once in a > transaction, so it's not too bad. c) we'd probably need to amortize > across multiple undo insertions, by keeping the unused buffers pinned > until the end of the transaction. > > I assume that having the infrastructure c) might also make some code > for already in postgres easier. There's obviously some issues around > guaranteeing that the maximum number of such buffers isn't high. > > > > - So basically, if we delay the lock till InsertPreparedUndo and call > > PrepareUndoInsert before acquiring table page content lock this > > problem is solved? > > > > Although I haven't yet analyzed the AM specific part that whether it's > > always possible to call the PrepareUndoInsert(basically getting all > > the undo record ready) before the page content lock. But, I am sure > > that won't be much difficult part. > > I think that is somewhere between not possible, and so expensive in a > lot of cases that we'd not want to do it anyway. You'd at leasthave to > first acquire a content lock on the page, mark the target tuple as > locked, then unlock the page, reserve undo, lock the table page, > actually update it. > > > > > - When reading an undo record, the whole stage of UnpackUndoData() > > > reading data into a the UndoPackContext is omitted, reading directly > > > into the UnpackedUndoRecord. That removes one further copy of the > > > record format. > > So we will read member by member to UnpackedUndoRecord? because in > > context we have at least a few headers packed and we can memcpy one > > header at a time like UndoRecordHeader, UndoRecordBlock. > > Well, right now you then copy them again later, so not much is gained by > that (although that later copy can happen without the content lock > held). As I think I suggested before, I suspect that the best way would > be to just memcpy() the data from the page(s) into an appropriately > sized buffer with the content lock held, and then perform unpacking > directly into UnpackedUndoRecord. Especially with the bulk API that will > avoid having to do much work with locks held, and reduce memory usage by > only unpacking the record(s) in a batch that are currently being looked > at. ok. > > > But that just a few of them so if we copy field by field in the > > UnpackedUndoRecord then we can get rid of copying in context then copy > > it back to the UnpackedUndoRecord. Is this is what in your mind or > > you want to store these structures (UndoRecordHeader, UndoRecordBlock) > > directly into UnpackedUndoRecord? > > I at the moment see no reason not to? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote: > > I think that batch reading should just copy the underlying data into a > > char* buffer. Only the records that currently are being used by > > higher layers should get exploded into an unpacked record. That will > > reduce memory usage quite noticably (and I suspect it also drastically > > reduce the overhead due to a large context with a lot of small > > allocations that then get individually freed). > > Ok, I got your idea. I will analyze it further and work on this if > there is no problem. I think there is one problem that currently while unpacking the undo record if the record is compressed (i.e. some of the fields does not exist in the record) then we read those fields from the first record on the page. But, if we just memcpy the undo pages to the buffers and delay the unpacking whenever it's needed seems that we would need to know the page boundary and also we need to know the offset of the first complete record on the page from where we can get that information (which is currently in undo page header). As of now even if we leave this issue apart I am not very clear what benefit you are seeing in the way you are describing compared to the way I am doing it now? a) Is it the multiple palloc? If so then we can allocate memory at once and flatten the undo records in that. Earlier, I was doing that but we need to align each unpacked undo record so that we can access them directly and based on Robert's suggestion I have modified it to multiple palloc. b) Is it the memory size problem that the unpack undo record will take more memory compared to the packed record? c) Do you think that we will not need to unpack all the records? But, I think eventually, at the higher level we will have to unpack all the undo records ( I understand that it will be one at a time) Or am I completely missing something here? > > That will make the > > sorting of undo a bit more CPU inefficient, because individual records > > will need to be partially unpacked for comparison, but I think that's > > going to be a far smaller loss than the win. > Right. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi Amit, I've combined three of your messages into one below, and responded inline. New patch set to follow shortly, with the fixes listed below (and others from other reviewers). On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > 0003-Add-undo-log-manager.patch > 1. > allocate_empty_undo_segment() ... > + /* create two parents up if not exist */ > + parentdir = pstrdup(undo_path); > + get_parent_directory(parentdir); > + get_parent_directory(parentdir); > + /* Can't create parent and it doesn't already exist? */ > + if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST) > > > All of this code is almost same as we have code in > TablespaceCreateDbspace still we have small differences like here you > are using mkdir instead of MakePGDirectory which as far as I can see > use similar permissions for creating directory. Also, it checks > whether the directory exists before trying to create it. Is there a > reason why we need to do a few things differently here if not, they > can both the places use one common function? Right, MakePGDirectory() arrived in commit da9b580d8990, and should probably be used everywhere we create directories under pgdata. Fixed. Yeah, I think we could just use TablespaceCreateDbspace() for this, if we are OK with teaching GetDatabasePath() and GetRelationPath() about how to make special undo paths, OR we are OK with just using "standard" paths, where undo files just live under database 9 (instead of the special "undo" directory). I stopped using a "9" directory in earlier versions because undo moved to a separate namespace when we agreed to use an extra discriminator in buffer tags and so forth; now that we're back to using database number 9, the question of whether to reflect that on the filesystem is back. I have had some trouble deciding which parts of the system should treat undo logs as some kind of 'relation' (and the SLRU project will have to tackle the same questions). I'll think about that some more before making the change. > 2. > allocate_empty_undo_segment() > { > .. > .. > /* Flush the contents of the file to disk before the next checkpoint. */ > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace); > .. > } > The comment in allocate_empty_undo_segment indicates that the code > wants to flush before checkpoint, but the actual function tries to > register the request with checkpointer. Shouldn't this be similar to > XLogFileInit where we use pg_fsync to flush the contents immediately? I responded to the general question about when we sync files in an earlier email. I've updated the comments to make it clearer that it's handing the work off, not doing it now. > Another thing is that recently in commit 475861b261 (commit by you), > we have introduced a mechanism to not fill the files with zero's for > certain filesystems like ZFS. Do we want similar behavior for undo > files? Good point. I will create a separate thread to discuss how the creation of a central file allocation routine (possibly with a GUC), and see if we can come up with something reusable for this, but independently committable. > 3. > +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end) > +{ > + UndoLogSlot *slot; > + size_t end; > + > + slot = find_undo_log_slot(logno, false); > + > + /* TODO review interlocking */ > + > + Assert(slot != NULL); > + Assert(slot->meta.end % UndoLogSegmentSize == 0); > + Assert(new_end % UndoLogSegmentSize == 0); > + Assert(InRecovery || > + CurrentSession->attached_undo_slots[slot->meta.category] == slot); > > Can you write some comments explaining the above Asserts? Also, can > you explain what interlocking issues are you worried about here? I added comments about the assertions. I will come back to the interlocking in another message, which I've now addressed (alluded to below as well). > 4. > while (end < new_end) > + { > + allocate_empty_undo_segment(logno, slot->meta.tablespace, end); > + end += UndoLogSegmentSize; > + } > + > + /* Flush the directory entries before next checkpoint. */ > + undofile_request_sync_dir(slot->meta.tablespace); > > I see that at two places after allocating empty undo segment, the > patch performs undofile_request_sync_dir whereas it doesn't perform > the same in UndoLogNewSegment? Is there a reason for the same or is it > missed from one of the places? You're right. Done. > 5. > +static void > +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end) > { > .. > /* > + * We didn't need to acquire the mutex to read 'end' above because only > + * we write to it. But we need the mutex to update it, because the > + * checkpointer might read it concurrently. > > Is this assumption correct? It seems patch also modified > slot->meta.end during discard in function UndoLogDiscard. I am > referring below code: > > +UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid) > { > .. > + /* Update shmem to show the new discard and end pointers. */ > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + slot->meta.discard = discard; > + slot->meta.end = end; > + LWLockRelease(&slot->mutex); > .. > } Yeah, the assumption was wrong, and that's what that other TODO note about interlocking was referring to. I have redesigned this so that there is a separate per-undo log extend_lock that allows UndoLogDiscard() (in a background worker or superuser command) and UndoLogAllocate() to serialise extension of the undo log. Hopefully foreground processes don't often have to wait (a discard worker will recycle segments fast enough), but if it ever does have to wait, it's waiting for another backend to rename() a fully allocated file, which is hopefully still better than writing a load of zeroes into a new file. > 6. > extend_undo_log() > { > .. > .. > if (!InRecovery) > + { > + xl_undolog_extend xlrec; > + XLogRecPtr ptr; > + > + xlrec.logno = logno; > + xlrec.end = end; > + > + XLogBeginInsert(); > + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); > + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND); > + XLogFlush(ptr); > + } > > It is not obvious to me why we need to perform XLogFlush here, can you explain? It's not needed, and I've removed it. The important thing here is that we insert the record after creating the files and telling the checkpointer to flush them; there's no benefit to flushing the WAL record. There are three crash recovery possibilities: (1) We recover from a checkpoint after this record, and the files are already durable, (2) we recover from a checkpoint before this record, replay this record, the file(s) may or may not be present but we'll tolerate them if they are and overwrite, (3) we recover from a checkpoint before this record was written, and this WAL record is never replayed because it wasn't flushed, and then there may or may not be some orphaned files but we'll eventually try to create files with the same names as we extend the undo log and tolerate their existence. > 7. > +attach_undo_log(UndoLogCategory category, Oid tablespace) > { > .. > if (candidate->meta.tablespace == tablespace) > + { > + logno = *place; > + slot = candidate; > + *place = candidate->next_free; > + break; > + } > > Here, the code is breaking from the loop, so why do we need to set > *place? Am I missing something obvious? (See further down). > 8. > + /* WAL-log the creation of this new undo log. */ > + { > + xl_undolog_create xlrec; > + > + xlrec.logno = logno; > + xlrec.tablespace = slot->meta.tablespace; > + xlrec.category = slot->meta.category; > + > + XLogBeginInsert(); > + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); > > Here and in most other places in this patch you are using > sizeof(xlrec) for xlog static data. However, as far as I know in > other places in the code we define the size using offset of the last > parameter of corresponding structure to avoid any inconsistency in WAL > record size across different platforms. Is there a reason to go > differently with this patch? See below one for example: > > typedef struct xl_hash_add_ovfl_page > { > uint16 bmsize; > bool bmpage_found; > } xl_hash_add_ovfl_page; > > #define SizeOfHashAddOvflPage > \ > (offsetof(xl_hash_add_ovfl_page, bmpage_found) + sizeof(bool)) I see. Apparently we don't always do that: tmunro@dogmatix $ git grep RegisterData | grep sizeof | wc -l 60 tmunro@dogmatix $ git grep RegisterData | grep Size | wc -l 63 I've now done it for all of these structs so that we trim the padding in some cases, even though in some cases it'll make no difference. > 9. > +static void > +undolog_xlog_create(XLogReaderState *record) > +{ > + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record); > + UndoLogSlot *slot; > + > + /* Create meta-data space in shared memory. */ > + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); > + > + /* TODO: assert that it doesn't exist already? */ > + > + slot = allocate_undo_log_slot(); > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > > Why do we need to acquire locks during recovery? Mostly because allocate_undo_log_slot() asserts that the lock is held. We probably don't need this in recovery but it doesn't seem like a problem, it'll never be contended. > 10. > I think UndoLogAllocate can leak allocation of slots. It first > allocates the slot for a new log from the free pool in there is no > existing slot/log, writes a WAL record and then at a later point of > time it actually creates the required physical space in the log via > extend_undo_log which also writes a separate WAL. Now, if there is a > error between these two operations, then we will have a redundant slot > allocated. What if there are repeated errors for similar thing from > multiple backends after which system crashes. Now, after restart, we > will allocate multiple slots for different lognos which don't have any > actual (physical) logs. This might not be a big problem in practice > because the chances of error between two operations are less, but > can't we delay the WAL logging for allocation of a slot for a new log. I don't think it leaks anything, and the undo log is not redundant, it's free/available. An undo log is allowed to have no space allocated (discard == end). In fact that can happen a few different ways: for example after crash recovery, we unlink all files belonging to unlogged undo logs by scanning the filesystem, and set discard == end, on a segment boundary. That's also the way new undo logs are born, and I think that's OK. If you crash and recover up to the point the undo log creation was WAL-logged, you'll now have a log with no space, and then the first person to try to allocate something in it will extend it (= create the file, move the end pointer) in the process of allocating space. > 11. > +UndoLogAllocate() > { > .. > .. > + /* > + * Maintain our tracking of the and the previous transaction start > + * locations. > + */ > + if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert) > + { > + slot->meta.unlogged.last_xact_start = > + slot->meta.unlogged.this_xact_start; > + slot->meta.unlogged.this_xact_start = slot->meta.unlogged.insert; > + } > > ".. of the and the ..", after first the, something is missing. Fixed. > 12. > UndoLogAllocate() > { > .. > .. > + /* > + * We don't need to acquire log->mutex to read log->meta.insert and > + * log->meta.end, because this backend is the only one that can > + * modify them. > + */ > + if (unlikely(new_insert > slot->meta.end)) > > I might be confused but slot->meta.end is modified by discard process > also, so how is it safe? If so, may be adding a comment to explain > the same would be good. Also, I think in the comments log should be > replaced with the slot. Right, now fixed. I fixed s/log->/slot->/ here and elsewhere in comments. > 13. > UndoLogAllocate() > { > .. > + /* This undo log is entirely full. Get a new one. */ > + if (logxid == GetTopTransactionId()) > + { > + /* > + * If the same transaction is split over two undo logs then > + * store the previous log number in new log. See detailed > + * comments in undorecord.c file header. > + */ > .. > } > > The undorecord.c should be renamed to undoaccess.c Fixed. > 14. > UndoLogAllocate() > { > .. > + if (logxid != GetTopTransactionId()) > + { > + /* > + * While we have the lock, check if we have been forcibly detached by > + * DROP TABLESPACE. That can only happen between transactions (see > + * DropUndoLogsInsTablespace()). > + */ > > /DropUndoLogsInsTablespace/DropUndoLogsInTablespace Fixed. > 15. > UndoLogSegmentPath() > { > .. > /* > + * Build the path from log number and offset. The pathname is the > + * UndoRecPtr of the first byte in the segment in hexadecimal, with a > + * period inserted between the components. > + */ > + snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno, > + segno * UndoLogSegmentSize); > .. > } > > a. It is not very clear from the above code why are we multiplying > segno with UndoLogSegmentSize? I see that many of the callers pass > segno as segno/UndoLogSegmentSize. Won't it be better if the caller > take care of passing correct value of segno? We want "the UndoRecPtr of the first byte in the segment [...] with a period inserted between the components". Seems clear? So undo log 7, segno 0 will be 000007.0000000000 and undo log 7, segno 1 will be 000007.0000100000, and UndoRecPtr of its first byte is at 0000070000100000 (so when you're looking at pg_stat_undo_logs or undoinspect() or any other representation of undo record pointers, you can easily see which files they are referring to). It's true that we could pass in the offset of the first byte, instead of the segment number, but some other callers have a segment number (see undofile.c). > b. In the comment above, instead of offset, shouldn't there be segment number. No, segno * segment size == offset (the offset part of an UndoRecPtr is the lower 48 bits; the upper 24 bits are the undo log number). > 16. UndoLogGetLastXactStartPoint is not used any where. I think this > was required in previous version of patchset, now, we can remove it. Done, thanks. > 17. > Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com > > This discussion link seems to be from old discussion/thread, not this one. Will reference this one. > 0019-Add-developer-documentation-for-the-undo-log-storage > 18. > +each undo log, a set of meta-data properties is tracked: > +tracked, including: > + > +* the tablespace that holds its segment files > +* the persistence level (permanent, unlogged or temporary) > > Here, don't we want to refer to UndoLogCategory rather than > persistence level? "tracked, including:" seems bit confusing. Fixed here and elsewhere. > 0020-Add-user-facing-documentation-for-undo-logs > 19. > <row> > + <entry><structfield>persistence</structfield></entry> > + <entry><type>text</type></entry> > + <entry>Persistence level of data stored in this undo log; one of > + <literal>permanent</literal>, <literal>unlogged</literal> or > + <literal>temporary</literal>.</entry> > + </row> > > Don't we want to cover the new (shared) undolog category here? Done (though I have mixed feelings about this shared category; more on that soon). On Thu, Jul 25, 2019 at 12:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 7. > > +attach_undo_log(UndoLogCategory category, Oid tablespace) > > { > > .. > > if (candidate->meta.tablespace == tablespace) > > + { > > + logno = *place; > > + slot = candidate; > > + *place = candidate->next_free; > > + break; > > + } > > > > Here, the code is breaking from the loop, so why do we need to set > > *place? Am I missing something obvious? > > > > I think I know what I was missing. It seems here you are removing an > element from the freelist. Right. > One point related to detach_current_undo_log. > > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + slot->pid = InvalidPid; > + slot->meta.unlogged.xid = InvalidTransactionId; > + if (full) > + slot->meta.status = UNDO_LOG_STATUS_FULL; > + LWLockRelease(&slot->mutex); > > If I read the comments in structure UndoLogMetaData, it is mentioned > that 'status' is changed by explicit WAL record whereas there is no > WAL record in code to change the status. I see the problem as well if > we don't WAL log this change. Suppose after changing the status of > this log, we allocate a new log and insert some records in that log as > well for the same transaction for which we have inserted records in > the log which we just marked as FULL. Now, here we form the link > between two logs as the same transaction has overflowed into a new > log. Say, we crash after this. Now, after recovery the log won't be > marked as FULL which means there is a chance that it can be used for > some other transaction, if that happens, then our link for a > transaction spanning to different log will break and we won't be able > to access the data in another log. In short, I think it is important > to WAL log this status change unless I am missing something. I thought it was OK to relax that and was going to just fix the comment, but the case you describe seems important. It seems we could either by WAL-logging the status changes as you said, or make sure the links have enough information to handle that. I'll think about that some more. On Thu, Jul 25, 2019 at 10:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Some more review of the same patch: > 1. > +typedef struct UndoLogSharedData > +{ > + UndoLogNumber free_lists[UndoLogCategories]; > + UndoLogNumber low_logno; > > What is the use of low_logno? I don't see anywhere in the code this > being assigned any value. Is it for some future use? Yeah, fixed, and now used. It reduces the need for 'negative cache entries' after a backend has been running for a very long time. > 2. > +void > +CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo) > { > .. > + /* Compute header checksum. */ > + INIT_CRC32C(crc); > + COMP_CRC32C(crc, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno)); > + COMP_CRC32C(crc, &UndoLogShared->next_logno, > sizeof(UndoLogShared->next_logno)); > + COMP_CRC32C(crc, &num_logs, sizeof(num_logs)); > + FIN_CRC32C(crc); > + > + /* Write out the number of active logs + crc. */ > + if ((write(fd, &UndoLogShared->low_logno, > sizeof(UndoLogShared->low_logno)) != sizeof(UndoLogShared->low_logno)) > || > + (write(fd, &UndoLogShared->next_logno, > sizeof(UndoLogShared->next_logno)) != > sizeof(UndoLogShared->next_logno)) || > > Is it safe to read UndoLogShared without UndoLogLock? All other > places accessing UndoLogShared uses UndoLogLock, so if this usage is > safe, maybe it is better to add a comment. Fixed for next_logno. And the other one, low_logno is no longer written to disk (it can be computed). > 3. > UndoLogAllocateInRecovery() > { > .. > /* > + * Otherwise we need to do our own transaction tracking > + * whenever we see a new xid, to match the logic in > + * UndoLogAllocate(). > + */ > + if (xid != slot->meta.unlogged.xid) > + { > + slot->meta.unlogged.xid = xid; > + if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert) > + slot->meta.unlogged.last_xact_start = > + slot->meta.unlogged.this_xact_start; > + slot->meta.unlogged.this_xact_start = > + slot->meta.unlogged.insert; > > The code doesn't follow the comment. In UndoLogAllocate, both > last_xact_start and this_xact_start are assigned in if block, so the > should be the case here. True, in "do" I only did the assignment if the values were different, and in "redo" I did the assignment even if they were the same, which has the same effect, but is indeed distracting. I've made them the same. > 4. > UndoLogAllocateInRecovery() > { > .. > + /* > + * Just as in UndoLogAllocate(), the caller may be extending an existing > + * allocation before committing with UndoLogAdvance(). > + */ > + if (context->try_location != InvalidUndoRecPtr) > + { > .. > } > > I am not sure how will this work because unlike UndoLogAllocate, this > function doesn't set try_location initially. It will be set later by > UndoLogAdvance which can easily go wrong because that doesn't include > UndoLogBlockHeaderSize. Hmm, yeah, I need to look into that some more. The 'advance' function does include consider header bytes though. I do admit that this code is very hard to follow. It got that way by being developed before the 'context' existed. It needs to be rewritten in a much clearer way; I'm going to do that. > 5. > +UndoLogAdvance(UndoLogAllocContext *context, size_t size) > +{ > + context->try_location = UndoLogOffsetPlusUsableBytes(context->try_location, > + size); > +} > > Here, you are using UndoRecPtr whereas UndoLogOffsetPlusUsableBytes > expects offset. Yeah that is ugly. I created UndoRecPtrPlusUsableBytes(). > 6. > UndoLogAllocateInRecovery() > { > .. > + /* > + * At this stage we should have an undo log that can handle this > + * allocation. If we don't, something is screwed up. > + */ > + if (UndoLogOffsetPlusUsableBytes(slot->meta.unlogged.insert, size) > > slot->meta.end) > + elog(ERROR, > + "cannot allocate %d bytes in undo log %d", > + (int) size, slot->logno); > .. > } > > Similar to point-5, here you are using a pointer instead of offset. Fixed. > 7. > UndoLogAllocateInRecovery() > { > .. > + /* We found a reference to a different (or first) undo log. */ > + slot = find_undo_log_slot(logno, false); > .. > + /* TODO: check locking against undo log slot recycling? */ > .. > } > > I think it is better to have an Assert here that slot can't be NULL. > AFAICS, slot can't be NULL unless there is some bug. I don't > understand this 'TODO' comment. Yeah. I just removed it. "slot" was already dereferenced above so an assertion that it's not NULL is too late to have any useful effect. > 8. > + { > + {"undo_tablespaces", PGC_USERSET, > CLIENT_CONN_STATEMENT, > + gettext_noop("Sets the > tablespace(s) to use for undo logs."), > + NULL, > + > GUC_LIST_INPUT | GUC_LIST_QUOTE > + }, > + > &undo_tablespaces, > + "", > + check_undo_tablespaces, > assign_undo_tablespaces, NULL > + }, > > It seems you need to update variable_is_guc_list_quote for this variable. Huh. Right. Done. > 9. > +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end) > { > .. > + if (!InRecovery) > + { > + xl_undolog_extend xlrec; > + XLogRecPtr ptr; > + > + xlrec.logno = logno; > + xlrec.end = end; > + > + XLogBeginInsert(); > + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); > + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND); > + XLogFlush(ptr); > + } > .. > } > > Do we need it for temporary/unlogged persistence level? Similarly, > there is a WAL logging in attach_undo_log which I can't understand why > it would be required for temporary/unlogged persistence levels. You're right, if we crash we don't care about any data in temporary/unlogged undo logs, since that data belongs to temporary/unlogged zheap (etc) tables. We destroy all of their files at startup in ResetUndoLogs(). So therefore we might as well not both to log the extension stuff. Done. > 10. > +choose_undo_tablespace(bool force_detach, Oid *tablespace) > { > .. > + oid = get_tablespace_oid(name, true); > + if (oid == InvalidOid) > .. > } > > Do we need to check permissions to see if the current user is allowed > to create in this tablespace? Yeah, right. I added a pg_tablespace_aclcheck() check postgres=> set undo_tablespaces = ts1; SET postgres=> create table t (); ERROR: permission denied for tablespace ts1 > 11. > +static bool > +choose_undo_tablespace(bool force_detach, Oid *tablespace) > +{ > + char *rawname; > + List *namelist; > + bool > need_to_unlock; > + int length; > + int > i; > + > + /* We need a modifiable copy of string. */ > + rawname = > pstrdup(undo_tablespaces); > > I don't see the usage of rawname outside this function, isn't it > better to free it? I understand that this function won't be called > frequently enough to matter, but still, there is some theoretical > danger if the user continuously changes undo_tablespaces. Fixed by freeing both rawname and namelist. > 12. > +find_undo_log_slot(UndoLogNumber logno, bool locked) > { > .. > + * TODO: We could track the lowest known undo log > number, to reduce > + * the negative cache entry bloat. > + */ > + if (result == NULL) > + { > .. > } > > Do we have any mechanism to clear this bloat or will it stay till the > end of the session? If it is later, then I think it might be good to > take care of this TODO. I think this is not a blocker, but good to > have kind of stuff. I did the TODO, so now we can drop negative cache entries below low_logno from the cache. There are probably more things we could do here to be more aggressive but that's a start. > 13. > +static void > +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, > + UndoLogOffset end) > { > .. > } > > What will happen if the transaction creating undolog segment rolls > back? Do we want to have pendingDeletes stuff as we have for normal > relation files? This might also help in clearing the shared memory > state (undo log slots) if any. No, that's non-transactional. The undo log segment remains created, just like various other things stay permanently even if the transaction that created them aborts (relation extension, btree splits, ...). -- Thomas Munro https://enterprisedb.com
Hi Kuntal, On Thu, Jul 25, 2019 at 5:40 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > Here are some review comments on 0003-Add-undo-log-manager.patch. I've > tried to avoid duplicate comments as much as possible. Thanks! Replies inline. I'll be posting a new patch set shortly with these and other fixes. > 1. In UndoLogAllocate, > + * time this backend as needed to write to an undo log at all or because > s/as/has Fixed. > + * Maintain our tracking of the and the previous transaction start > Do you mean current log's transaction start as well? Right, fixed. > 2. In UndoLogAllocateInRecovery, > we try to find the current log from the first undo buffer. So, after a > log switch, we always have to register at least one buffer from the > current undo log first. If we're updating something in the previous > log, the respective buffer should be registered after that. I think we > should document this in the comments. I'm not sure I understand. Is this working correctly today? > 3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full), > it seems the 'full' parameter is not used anywhere. Do we still need this? > > + /* It's been recycled. SO it must have been entirely discarded. */ > s/SO/So Fixed. > 4. In CleanUpUndoCheckPointFiles, > we can emit a debug2 message with something similar to : 'removed > unreachable undo metadata files' Done. > + if (unlink(path) != 0) > + elog(ERROR, "could not unlink file \"%s\": %m", path); > according to my observation, whenever we deal with a file operation, > we usually emit a ereport message with errcode_for_file_access(). > Should we change it to ereport? There are other file operations as > well including read(), OpenTransientFile() etc. Done. > 5. In CheckPointUndoLogs, > + /* Capture snapshot while holding each mutex. */ > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + serialized[num_logs++] = slot->meta; > + LWLockRelease(&slot->mutex); > why do we need an exclusive lock to read something from the slot? A > share lock seems to be sufficient. OK. > pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called > after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE) > without calling pgstat_report_wait_end(). I think you've done the > same to avoid an extra function call. But, it differs from other > places in the PG code. Perhaps, we should follow this approach > everywhere. Ok, changed. > 6. In StartupUndoLogs, > + if (fd < 0) > + elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path); > assuming your agreement upon changing above elog to ereport, the > message should be more user friendly. May be something like 'cannot > open pg_undo file'. Done. > + if ((size = read(fd, &slot->meta, sizeof(slot->meta))) != sizeof(slot->meta)) > The usage of size of doesn't look like a problem. But, we can save > some extra padding bytes at the end if we use (offsetof + sizeof) > approach similar to other places in PG. It current ends in a 64 bit value, so there is no padding here. > 7. In free_undo_log_slot, > + /* > + * When removing an undo log from a slot in shared memory, we acquire > + * UndoLogLock, log->mutex and log->discard_lock, so that other code can > + * hold any one of those locks to prevent the slot from being recycled. > + */ > + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + Assert(slot->logno != InvalidUndoLogNumber); > + slot->logno = InvalidUndoLogNumber; > + memset(&slot->meta, 0, sizeof(slot->meta)); > + LWLockRelease(&slot->mutex); > + LWLockRelease(UndoLogLock); > you've not taken the discard_lock as mentioned in the comment. Right, I was half-way between two different ideas about how that interlocking should work, but I have straightened this out now, and will write about the overall locking model separately. > 8. In find_undo_log_slot, > + * 1. If the calling code knows that it is attached to this lock or is the > s/lock/slot Fixed. BTW I am experimenting with macros that would actually make assertions about those programming rules. > + * 2. All other code should acquire log->mutex before accessing any members, > + * and after doing so, check that the logno hasn't moved. If it is not, the > + * entire undo log must be assumed to be discarded (as if this function > + * returned NULL) and the caller must behave accordingly. > Perhaps, you meant '..check that the logno remains same. If it is not..'. Fixed. > + /* > + * If we didn't find it, then it must already have been entirely > + * discarded. We create a negative cache entry so that we can answer > + * this question quickly next time. > + * > + * TODO: We could track the lowest known undo log number, to reduce > + * the negative cache entry bloat. > + */ > This is an interesting thought. But, I'm wondering how we are going to > search the discarded logno in the simple hash. I guess that's why it's > in the TODO list. Done. Each backend tracks its idea of the lowest undo log that exists. There is a shared low_logno that is recomputed whenever a slot is freed (ie a log is entirely discarded). Whenever a backend sees that its own value is too low, it walks forward dropping cache entries. Perhaps this could be made more proactive later by using sinval, but I didn't look into that. > 9. In attach_undo_log, > + * For now we have a simple linked list of unattached undo logs for each > + * persistence level. We'll grovel though it to find something for the > + * tablespace you asked for. If you're not using multiple tablespaces > s/though/through Fixed. > + if (slot == NULL) > + { > + if (UndoLogShared->next_logno > MaxUndoLogNumber) > + { > + /* > + * You've used up all 16 exabytes of undo log addressing space. > + * This is a difficult state to reach using only 16 exabytes of > + * WAL. > + */ > + elog(ERROR, "undo log address space exhausted"); > + } > looks like a potential unlikely() condition. Done. Yeah, actually every branch containing an unconditional elog() at ERROR or higher (or maybe even lower) must surely be considered unlikely, and it'd be nice to tell the leading compilers about that, but the last thread about that hasn't made it as far as a useful patch for some technical reason that didn't seem fatal to the concept, IIRC. I'd be curious to know what sort effect that sort of rule would have on the whole tree, in terms of code locality, even if you have to hack the compiler to find out... -- Thomas Munro https://enterprisedb.com
Hi, On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote: > On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote: > > > > I think that batch reading should just copy the underlying data into a > > > char* buffer. Only the records that currently are being used by > > > higher layers should get exploded into an unpacked record. That will > > > reduce memory usage quite noticably (and I suspect it also drastically > > > reduce the overhead due to a large context with a lot of small > > > allocations that then get individually freed). > > > > Ok, I got your idea. I will analyze it further and work on this if > > there is no problem. > > I think there is one problem that currently while unpacking the undo > record if the record is compressed (i.e. some of the fields does not > exist in the record) then we read those fields from the first record > on the page. But, if we just memcpy the undo pages to the buffers and > delay the unpacking whenever it's needed seems that we would need to > know the page boundary and also we need to know the offset of the > first complete record on the page from where we can get that > information (which is currently in undo page header). I don't understand why that's a problem? > As of now even if we leave this issue apart I am not very clear what > benefit you are seeing in the way you are describing compared to the > way I am doing it now? > > a) Is it the multiple palloc? If so then we can allocate memory at > once and flatten the undo records in that. Earlier, I was doing that > but we need to align each unpacked undo record so that we can access > them directly and based on Robert's suggestion I have modified it to > multiple palloc. Part of it. > b) Is it the memory size problem that the unpack undo record will take > more memory compared to the packed record? Part of it. > c) Do you think that we will not need to unpack all the records? But, > I think eventually, at the higher level we will have to unpack all the > undo records ( I understand that it will be one at a time) Part of it. There's a *huge* difference between having a few hundred to thousand unpacked records, each consisting of several independent allocations, in memory and having one large block containing all packed records in a batch, and a few allocations for the few unpacked records that need to exist. There's also d) we don't need separate tiny memory copies while holding buffer locks etc. Greetings, Andres Freund
On Fri, Aug 16, 2019 at 10:56 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote: > > On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > I think that batch reading should just copy the underlying data into a > > > > char* buffer. Only the records that currently are being used by > > > > higher layers should get exploded into an unpacked record. That will > > > > reduce memory usage quite noticably (and I suspect it also drastically > > > > reduce the overhead due to a large context with a lot of small > > > > allocations that then get individually freed). > > > > > > Ok, I got your idea. I will analyze it further and work on this if > > > there is no problem. > > > > I think there is one problem that currently while unpacking the undo > > record if the record is compressed (i.e. some of the fields does not > > exist in the record) then we read those fields from the first record > > on the page. But, if we just memcpy the undo pages to the buffers and > > delay the unpacking whenever it's needed seems that we would need to > > know the page boundary and also we need to know the offset of the > > first complete record on the page from where we can get that > > information (which is currently in undo page header). > > I don't understand why that's a problem? Okay, I was assuming that we will be only copying data part not complete page including the page header. If we copy the page header as well we might be able to unpack the compressed record as well. > > > > As of now even if we leave this issue apart I am not very clear what > > benefit you are seeing in the way you are describing compared to the > > way I am doing it now? > > > > a) Is it the multiple palloc? If so then we can allocate memory at > > once and flatten the undo records in that. Earlier, I was doing that > > but we need to align each unpacked undo record so that we can access > > them directly and based on Robert's suggestion I have modified it to > > multiple palloc. > > Part of it. > > > b) Is it the memory size problem that the unpack undo record will take > > more memory compared to the packed record? > > Part of it. > > > c) Do you think that we will not need to unpack all the records? But, > > I think eventually, at the higher level we will have to unpack all the > > undo records ( I understand that it will be one at a time) > > Part of it. There's a *huge* difference between having a few hundred to > thousand unpacked records, each consisting of several independent > allocations, in memory and having one large block containing all > packed records in a batch, and a few allocations for the few unpacked > records that need to exist. > > There's also d) we don't need separate tiny memory copies while holding > buffer locks etc. Yeah, that too. Yet another problem could be that how are we going to process those record? Because for that we need to know all the undo record pointers between start_urecptr and the end_urecptr right? we just have the big memory chunk and we have no idea how many undo records are there and what are their undo record pointers. And without knowing that information, I am unable to imagine how we are going to sort them based on block number. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote: > > > Again, I think it's not ok to just assume you can lock an essentially > > > unbounded number of buffers. This seems almost guaranteed to result in > > > deadlocks. And there's limits on how many lwlocks one can hold etc. > > > > I think for controlling that we need to put a limit on max prepared > > undo? I am not sure any other way of limiting the number of buffers > > because we must lock all the buffer in which we are going to insert > > the undo record under one WAL logged operation. > > I heard that a number of times. But I still don't know why that'd > actually be true. Why would it not be sufficient to just lock the buffer > currently being written to, rather than all buffers? It'd require a bit > of care updating the official current "logical end" of a log, but > otherwise ought to not be particularly hard? Only one backend can extend > the log after all, and until the log is externally visibily extended, > nobody can read or write those buffers, no? Well, I don't understand why you're on about this. We've discussed it a number of times but I'm still confused. I'll repeat my previous arguments on-list: 1. It's absolutely fine to just put a limit on this, because the higher-level facilities that use this shouldn't be doing a single WAL-logged operation that touches a zillion buffers. We have been careful to avoid having WAL-logged operations touch an unbounded number of buffers in plenty of other places, like the btree code, and we are going to have to be similarly careful here for multiple reasons, deadlock avoidance being one. So, saying, "hey, you're going to lock an unlimited number of buffers" is a straw man. We aren't. We can't. 2. The write-ahead logging protocol says that you're supposed to lock all the buffers at once. See src/backend/access/transam/README. If you want to go patch that file, then this patch can follow whatever the locking rules in the patched version are. But until then, the patch should follow *the actual rules* not some other protocol based on a hand-wavy explanation in an email someplace. Otherwise, you've got the same sort of undocumented disaster-waiting-to-happen that you keep complaining about in other parts of this patch. We need fewer of those, not more! 3. There is no reason to care about all of the buffers being locked at once, because they are not unlimited in number (see point #1) and nobody else is looking at them anyway (see the final sentence of what I quoted above). I think we are, or ought to be, talking about locking 2 (or maybe in rare cases 3 or 4) undo buffers in connection with a single WAL record. If we're talking about more than that, then I think the higher-level code needs to be changed. If we're talking about that many, then we don't need to be clever. We can just do the standard thing that the rest of the system does, and it will be fine just like it is everywhere else. > > Suppose you insert one record for the transaction which split in > > block1 and 2. Now, before this block is actually going to the disk > > the transaction committed and become all visible the undo logs are > > discarded. It's possible that block 1 is completely discarded but > > block 2 is not because it might have undo for the next transaction. > > Now, during recovery (FPW is off) if block 1 is missing but block 2 is > > their so we need to skip inserting undo for block 1 as it does not > > exist. > > Hm. I'm quite doubtful this is a good idea. How will this not force us > to a emit a lot more expensive durable operations while writing undo? > And doesn't this reduce error detection quite remarkably? > > Thomas, Robert? I think you're going to need to spell out your assumptions in order for me to be able to comment intelligently. This is another thing that seems pretty normal to me. Generally, WAL replay might need to recreate objects whose creation is not separately WAL-logged, and it might need to skip operations on objects that have been dropped later in the WAL stream and thus don't exist any more. This seems like an instance of the latter pattern. There's no reason to try to put valid data into pages that we know have been discarded, and both inserting and discarding undo data need to be logged anyway. As a general point, I think the hope is that undo generated by short-running transactions that commit and become all-visible quickly will be cheap. We should be able to dirty shared buffers but then discard the data without ever writing it out to disk if we've logged a discard of that data. Obviously, if you've got long-running transactions that are either generating undo or holding old snapshots, you're going to have to really flush the data, but we want to avoid that when we can. And the same is true on the standby: even if we write the dirty data into shared buffers instead of skipping the write altogether, we hope to be able to forget about those buffers when we encounter a discard record before the next checkpoint. One idea we could consider, if it makes the code sufficiently simpler and doesn't cost too much performance, is to remove the facility for skipping over bytes to be written and instead write any bytes that we don't really want to write to an entirely-fake buffer (e.g. a backend-private page in a static variable). That seems a little silly to me; I suspect there's a better way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-08-17 12:05:21 -0400, Robert Haas wrote: > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote: > > > > Again, I think it's not ok to just assume you can lock an essentially > > > > unbounded number of buffers. This seems almost guaranteed to result in > > > > deadlocks. And there's limits on how many lwlocks one can hold etc. > > > > > > I think for controlling that we need to put a limit on max prepared > > > undo? I am not sure any other way of limiting the number of buffers > > > because we must lock all the buffer in which we are going to insert > > > the undo record under one WAL logged operation. > > > > I heard that a number of times. But I still don't know why that'd > > actually be true. Why would it not be sufficient to just lock the buffer > > currently being written to, rather than all buffers? It'd require a bit > > of care updating the official current "logical end" of a log, but > > otherwise ought to not be particularly hard? Only one backend can extend > > the log after all, and until the log is externally visibily extended, > > nobody can read or write those buffers, no? > > Well, I don't understand why you're on about this. We've discussed it > a number of times but I'm still confused. There's two reasons here: The primary one in the context here is that if we do *not* have to lock the buffers all ahead of time, we can simplify the interface. We certainly can't lock the buffers over IO (due to buffer reclaim) as we're doing right now, so we'd need another phase, called by the "user" during undo insertion. But if we do not need to lock the buffers before the insertion over all starts, the inserting location doesn't have to care. Secondarily, all the reasoning for needing to lock all buffers ahead of time was imo fairly unconvincing. Following the "recipe" for WAL insertions is a good idea when writing a new run-of-the-mill WAL inserting location - but when writing a new fundamental facility, that already needs to modify how WAL works, then I find that much less convincing. > 1. It's absolutely fine to just put a limit on this, because the > higher-level facilities that use this shouldn't be doing a single > WAL-logged operation that touches a zillion buffers. We have been > careful to avoid having WAL-logged operations touch an unbounded > number of buffers in plenty of other places, like the btree code, and > we are going to have to be similarly careful here for multiple > reasons, deadlock avoidance being one. So, saying, "hey, you're going > to lock an unlimited number of buffers" is a straw man. We aren't. > We can't. Well, in the version of code that I was reviewing here, I don't there is such a limit (there is a limit for buffers per undo record, but no limit on the number of records inserted together). I think Dilip added a limit since. And we have the issue of a lot of IO happening while holding content locks on several pages. So I don't think it's a straw man at all. > 2. The write-ahead logging protocol says that you're supposed to lock > all the buffers at once. See src/backend/access/transam/README. If > you want to go patch that file, then this patch can follow whatever > the locking rules in the patched version are. But until then, the > patch should follow *the actual rules* not some other protocol based > on a hand-wavy explanation in an email someplace. Otherwise, you've > got the same sort of undocumented disaster-waiting-to-happen that you > keep complaining about in other parts of this patch. We need fewer of > those, not more! But that's not what I'm asking for? I don't even know where you take from that I don't want this to be documented. I'm mainly asking for a comment explaining why the current behaviour is what it is. Because I don't think an *implicit* "normal WAL logging rules" is sufficient explanation, because all the locking here happens one or two layers away from the WAL logging site - so it's absolutely *NOT* obvious that that's the explanation. And I don't think any of the locking sites actually has comments explaining why the locks are acquired at that time (in fact, IIRC until the review some even only mentioned pinning, not locking). > > > Suppose you insert one record for the transaction which split in > > > block1 and 2. Now, before this block is actually going to the disk > > > the transaction committed and become all visible the undo logs are > > > discarded. It's possible that block 1 is completely discarded but > > > block 2 is not because it might have undo for the next transaction. > > > Now, during recovery (FPW is off) if block 1 is missing but block 2 is > > > their so we need to skip inserting undo for block 1 as it does not > > > exist. > > > > Hm. I'm quite doubtful this is a good idea. How will this not force us > > to a emit a lot more expensive durable operations while writing undo? > > And doesn't this reduce error detection quite remarkably? > > > > Thomas, Robert? > > I think you're going to need to spell out your assumptions in order > for me to be able to comment intelligently. This is another thing > that seems pretty normal to me. Generally, WAL replay might need to > recreate objects whose creation is not separately WAL-logged, and it > might need to skip operations on objects that have been dropped later > in the WAL stream and thus don't exist any more. This seems like an > instance of the latter pattern. There's no reason to try to put valid > data into pages that we know have been discarded, and both inserting > and discarding undo data need to be logged anyway. Yea, I was "intentionally" vague here. I didn't have a concrete scenario that I was concerned about, but it somehow didn't quite seem right, and I didn't encounter an explanation why it's guaranteed to be safe. So more eyes seemed like a good idea. I'm not at all sure that there is an actual problem here - I'm mostly trying to understand this code, from the perspective of somebody reading it for the first time. I think what primarily makes me concerned is that it's not clear to me what guarantees that discard is the only reason for the block to potentially be missing. I contrast to most other similar cases where WAL replay simply re-creates the objects when trying to replay an action affecting such an object, here we simply skip over the WAL logged operation. So if e.g. the entire underlying UNDO file got lost, we neither re-create it with valid content, nor error out. Which means we got to be absolutely sure that all undo files are created in a persistent manner, at their full size. And that there's no way that data could get lost, without forcing us to perform REDO up to at least the relevant point again. While it appears that we always WAL log the undo extension, I am not convinced the recovery interlock is strong enough. For one UndoLogDiscard() unlinks segments before WAL logging their removal - which means if we crash after unlink() and before the XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in practice we might be fine, because there ought to be nobody still referencing that UNDO - but I don't think that's actually guaranteed as is). Nor do I see where we're updating minRecoveryLocation when replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during recovery could be stopped before the discard has been replayed, leaving us with wrong UNDO, but allowing write acess. Seems we'd at least need a few more XLogFlush() calls. > One idea we could consider, if it makes the code sufficiently simpler > and doesn't cost too much performance, is to remove the facility for > skipping over bytes to be written and instead write any bytes that we > don't really want to write to an entirely-fake buffer (e.g. a > backend-private page in a static variable). That seems a little silly > to me; I suspect there's a better way. I suspect so too. Greetings, Andres Freund
On Sat, Aug 17, 2019 at 9:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote: > > > > Again, I think it's not ok to just assume you can lock an essentially > > > > unbounded number of buffers. This seems almost guaranteed to result in > > > > deadlocks. And there's limits on how many lwlocks one can hold etc. > > > > > > I think for controlling that we need to put a limit on max prepared > > > undo? I am not sure any other way of limiting the number of buffers > > > because we must lock all the buffer in which we are going to insert > > > the undo record under one WAL logged operation. > > > > I heard that a number of times. But I still don't know why that'd > > actually be true. Why would it not be sufficient to just lock the buffer > > currently being written to, rather than all buffers? It'd require a bit > > of care updating the official current "logical end" of a log, but > > otherwise ought to not be particularly hard? Only one backend can extend > > the log after all, and until the log is externally visibily extended, > > nobody can read or write those buffers, no? > > Well, I don't understand why you're on about this. We've discussed it > a number of times but I'm still confused. I'll repeat my previous > arguments on-list: > > 1. It's absolutely fine to just put a limit on this, because the > higher-level facilities that use this shouldn't be doing a single > WAL-logged operation that touches a zillion buffers. We have been > careful to avoid having WAL-logged operations touch an unbounded > number of buffers in plenty of other places, like the btree code, and > we are going to have to be similarly careful here for multiple > reasons, deadlock avoidance being one. So, saying, "hey, you're going > to lock an unlimited number of buffers" is a straw man. We aren't. > We can't. Right. So basically, we need to put a limit on how many max undo can be prepared under single WAL logged operation and that will internally put the limit on max undo buffers. Suppose we limit max_prepared_ undo to 2 then we need to lock at max 5 undo buffers. We need to somehow deal with the multi-insert code in the zheap because in that code for inserting in a single page we write one undo record per range if all the tuple which we are inserting on a single page are interleaved. But, maybe we can handle that by just inserting one undo record which can have multiple ranges. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <andres@anarazel.de> wrote: > > > > > - When reading an undo record, the whole stage of UnpackUndoData() > > > reading data into a the UndoPackContext is omitted, reading directly > > > into the UnpackedUndoRecord. That removes one further copy of the > > > record format. > > So we will read member by member to UnpackedUndoRecord? because in > > context we have at least a few headers packed and we can memcpy one > > header at a time like UndoRecordHeader, UndoRecordBlock. > > Well, right now you then copy them again later, so not much is gained by > that (although that later copy can happen without the content lock > held). As I think I suggested before, I suspect that the best way would > be to just memcpy() the data from the page(s) into an appropriately > sized buffer with the content lock held, and then perform unpacking > directly into UnpackedUndoRecord. Especially with the bulk API that will > avoid having to do much work with locks held, and reduce memory usage by > only unpacking the record(s) in a batch that are currently being looked > at. > > > > But that just a few of them so if we copy field by field in the > > UnpackedUndoRecord then we can get rid of copying in context then copy > > it back to the UnpackedUndoRecord. Is this is what in your mind or > > you want to store these structures (UndoRecordHeader, UndoRecordBlock) > > directly into UnpackedUndoRecord? > > I at the moment see no reason not to? Currently, In UnpackedUndoRecord we store all members directly which are set by the caller. We store pointers to some header which are allocated internally by the undo layer and the caller need not worry about setting them. So now you are suggesting to put other headers also as structures in UnpackedUndoRecord. I as such don't have much problem in doing that but I think initially Robert designed UnpackedUndoRecord structure this way so it will be good if Robert provides his opinion on this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 17, 2019 at 10:58 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-17 12:05:21 -0400, Robert Haas wrote: > > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote: > > > > > Again, I think it's not ok to just assume you can lock an essentially > > > > > unbounded number of buffers. This seems almost guaranteed to result in > > > > > deadlocks. And there's limits on how many lwlocks one can hold etc. > > > > > > > > I think for controlling that we need to put a limit on max prepared > > > > undo? I am not sure any other way of limiting the number of buffers > > > > because we must lock all the buffer in which we are going to insert > > > > the undo record under one WAL logged operation. > > > > > > I heard that a number of times. But I still don't know why that'd > > > actually be true. Why would it not be sufficient to just lock the buffer > > > currently being written to, rather than all buffers? It'd require a bit > > > of care updating the official current "logical end" of a log, but > > > otherwise ought to not be particularly hard? Only one backend can extend > > > the log after all, and until the log is externally visibily extended, > > > nobody can read or write those buffers, no? > > > > Well, I don't understand why you're on about this. We've discussed it > > a number of times but I'm still confused. > > There's two reasons here: > > The primary one in the context here is that if we do *not* have to lock > the buffers all ahead of time, we can simplify the interface. We > certainly can't lock the buffers over IO (due to buffer reclaim) as > we're doing right now, so we'd need another phase, called by the "user" > during undo insertion. But if we do not need to lock the buffers before > the insertion over all starts, the inserting location doesn't have to > care. > > Secondarily, all the reasoning for needing to lock all buffers ahead of > time was imo fairly unconvincing. Following the "recipe" for WAL > insertions is a good idea when writing a new run-of-the-mill WAL > inserting location - but when writing a new fundamental facility, that > already needs to modify how WAL works, then I find that much less > convincing. > One point to remember in this regard is that we do need to modify the LSN in undo pages after writing WAL, so all the undo pages need to be locked by that time or we again need to take the lock on them. > > > 1. It's absolutely fine to just put a limit on this, because the > > higher-level facilities that use this shouldn't be doing a single > > WAL-logged operation that touches a zillion buffers. Right, by default a WAL log can only cover 4 buffers. If we need to touch more buffers, then the caller needs to call XLogEnsureRecordSpace. So, I agree with the point that generally, it should be few buffers (2 or 3) of undo that need to be touched in a single operation and if there are more, either callers need to change or at the very least they need to be careful about the same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2019-08-19 17:52:24 +0530, Amit Kapila wrote: > On Sat, Aug 17, 2019 at 10:58 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2019-08-17 12:05:21 -0400, Robert Haas wrote: > > > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Again, I think it's not ok to just assume you can lock an essentially > > > > > > unbounded number of buffers. This seems almost guaranteed to result in > > > > > > deadlocks. And there's limits on how many lwlocks one can hold etc. > > > > > > > > > > I think for controlling that we need to put a limit on max prepared > > > > > undo? I am not sure any other way of limiting the number of buffers > > > > > because we must lock all the buffer in which we are going to insert > > > > > the undo record under one WAL logged operation. > > > > > > > > I heard that a number of times. But I still don't know why that'd > > > > actually be true. Why would it not be sufficient to just lock the buffer > > > > currently being written to, rather than all buffers? It'd require a bit > > > > of care updating the official current "logical end" of a log, but > > > > otherwise ought to not be particularly hard? Only one backend can extend > > > > the log after all, and until the log is externally visibily extended, > > > > nobody can read or write those buffers, no? > > > > > > Well, I don't understand why you're on about this. We've discussed it > > > a number of times but I'm still confused. > > > > There's two reasons here: > > > > The primary one in the context here is that if we do *not* have to lock > > the buffers all ahead of time, we can simplify the interface. We > > certainly can't lock the buffers over IO (due to buffer reclaim) as > > we're doing right now, so we'd need another phase, called by the "user" > > during undo insertion. But if we do not need to lock the buffers before > > the insertion over all starts, the inserting location doesn't have to > > care. > > > > Secondarily, all the reasoning for needing to lock all buffers ahead of > > time was imo fairly unconvincing. Following the "recipe" for WAL > > insertions is a good idea when writing a new run-of-the-mill WAL > > inserting location - but when writing a new fundamental facility, that > > already needs to modify how WAL works, then I find that much less > > convincing. > > > > One point to remember in this regard is that we do need to modify the > LSN in undo pages after writing WAL, so all the undo pages need to be > locked by that time or we again need to take the lock on them. Well, my main point, which so far has largely been ignored, was that we may not acquire page locks when we still need to search for victim buffers later. If we don't need to lock the pages up-front, but only do so once we're actually copying the records into the undo pages, then we don't a separate phase to acquire the locks. We can still hold all of the page locks at the same time, as long as we just acquire them at the later stage. My secondary point was that *none* of this actually is documented, even if it's entirely unobvious to the reader that the relevant code can only run during WAL insertion, due to being pretty far removed from that. Greetings, Andres Freund
On Tue, Aug 20, 2019 at 2:46 AM Andres Freund <andres@anarazel.de> wrote: > > On 2019-08-19 17:52:24 +0530, Amit Kapila wrote: > > On Sat, Aug 17, 2019 at 10:58 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > Well, I don't understand why you're on about this. We've discussed it > > > > a number of times but I'm still confused. > > > > > > There's two reasons here: > > > > > > The primary one in the context here is that if we do *not* have to lock > > > the buffers all ahead of time, we can simplify the interface. We > > > certainly can't lock the buffers over IO (due to buffer reclaim) as > > > we're doing right now, so we'd need another phase, called by the "user" > > > during undo insertion. But if we do not need to lock the buffers before > > > the insertion over all starts, the inserting location doesn't have to > > > care. > > > > > > Secondarily, all the reasoning for needing to lock all buffers ahead of > > > time was imo fairly unconvincing. Following the "recipe" for WAL > > > insertions is a good idea when writing a new run-of-the-mill WAL > > > inserting location - but when writing a new fundamental facility, that > > > already needs to modify how WAL works, then I find that much less > > > convincing. > > > > > > > One point to remember in this regard is that we do need to modify the > > LSN in undo pages after writing WAL, so all the undo pages need to be > > locked by that time or we again need to take the lock on them. > > Well, my main point, which so far has largely been ignored, was that we > may not acquire page locks when we still need to search for victim > buffers later. If we don't need to lock the pages up-front, but only do > so once we're actually copying the records into the undo pages, then we > don't a separate phase to acquire the locks. We can still hold all of > the page locks at the same time, as long as we just acquire them at the > later stage. > Okay, IIUC, this means that we should have a separate phase where we call LockUndoBuffers (or something like that) before InsertPreparedUndo and after PrepareUndoInsert. The LockUndoBuffers will lock all the buffers pinned during PrepareUndoInsert. We can probably call LockUndoBuffers before entering the critical section to avoid any kind of failure in critical section. If so, that sounds reasonable to me. > My secondary point was that *none* of this actually is > documented, even if it's entirely unobvious to the reader that the > relevant code can only run during WAL insertion, due to being pretty far > removed from that. > I think this can be clearly mentioned in README or someplace else. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hello, Aside from code changes based on review (and I have more to come of those), the attached experimental patchset (also at https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol that, I hope, allows for better concurrency, reliability and readability, and removes a bunch of TODO notes about questionable interlocking. However, I'm not quite done figuring out if the bufmgr interaction is right and will be manageable on the undoaccess side, so I'm hoping to get some feedback, not asking for anyone to rebase on top of it yet. Previously, there were two LWLocks used to make sure that all discarding was prevented while anyone was reading or writing data in any part of an undo log, and (probably more problematically) vice versa. Here's a new approach that removes that blocking: 1. Anyone is allowed to try to read or write data at any UndoRecPtr that has been allocated, through the buffer pool (though you'd usually want to check it with UndoRecPtrIsDiscarded() first, and only rely on the system I'm describing to deal with races). 2. ReadBuffer() might return InvalidBuffer. This can happen for a cache miss, if the smgrread implementation wants to indicate that the buffer has been discarded/truncated and that is expected (md.c won't ever do that, but undofile.c can). 3. UndoLogDiscard() uses DiscardBuffer() to invalidate any currently unpinned buffers, and marks as BM_DISCARDED any that happen to be pinned right now, so they can't be immediately invalidated. Such buffers are never written back and are eligible for reuse on the next clock sweep, even if they're written into by a backend that managed to do that when we were trying to discard. 4. In order to make this work, I needed to track an extra offset 'begin' that says what physical storage exists. So [begin, end) give you the range of physical undo space (that is, files that exist on disk) and [discard, insert) give you the range of active data within it. There are now four offsets per log in shm and in the pg_stat_undo_logs view. 5. Separating begin from discard allows the WAL logging for UndoLogDiscard() to do filesystem actions before logging, and other effects after logging, which have several nice properties if you work through the various crash scenarios. This allowed a lot of direct UndoLogSlot access and locking code to be removed from undodiscard.c and undoaccess.c, because now they can just proceed as normal, as long as they are prepared to give up whenever the buffer manager tells them the buffer they're asking for has evaporated. Once they've pinned a buffer, they don't need to care if it becomes (or already was) BM_DISCARDED; attempts to dirty it will be silently ignore and eventually it'll be reclaimed. It also gets rid of 'oldest_data', which was another scheme tagging along behind the discard pointer. So now I'd like to get feedback on the sanity of this scheme. I'm not saying it doesn't have bugs right now -- I've been trying to figure out good ways to test it and I'm not quite there yet -- but the concept. One observation I have is that there were already code paths in undoaccess.c that can tolerate InvalidBuffer in recovery, due to the potentially different discard timing for DO vs REDO. I think that's a point in favour of this scheme, but I can see that it's inconvenient to have to deal with InvalidBuffer whenever you read. Some other changes in this patch set: 1. There is a superuser-only procedure pg_force_discard_undo(logno) that can discard on command. This can be used to get a system unwedged if rollback actions are failing. For example, if you insert an elog(ERROR, "boo!") into the smgr_undo() and then roll back a table creation, you'll see a discard worker repeatedly reporting the error, and pg_stat_undo_logs will show that the undo log space never gets freed. This can be fixed with CALL pg_force_discard_undo(<logno>). 2. There is a superuser-only testing-only procedure pg_force_switch_undo(logno) that can be used to force a transaction is currently writing to that log number to switch to a new one, as if it hit the end of the undo log (the 1TB address space within each undo log). This is good for exercising code that eg rolls back stuff spread over two undo logs. 3. When I was removing oldest_data from UndoLogSlot, I started wondering why wait_fxmin was in there, as it was almost the last reason why discard worker code needed to know about slots. Since we currently have only a single discard worker, and no facility for coordinating more than one discard worker, I think its bookkeeping might as well be backend local. Here I made the stupidest change that would work: a hash table to hold per-logno wait_fxmin. I'm not entirely sure what data structure we really want for this -- it's all a bit brute force right now. Thoughts? I pulled in the latest code from undoprocessing as of today, and I might be a bit confused about "Defect and enhancement in multi-log support" some of which I have squashed into the make undolog patch. BTW undoprocessing builds with initialized variable warnings in xact.c on clang today. -- Thomas Munro https://enterprisedb.com
Attachment
On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote: > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote: > I don't think we can normally pin the undo buffers properly at that > stage. Without knowing the correct contents of the table page - which we > can't know without holding some form of lock preventing modifications - > we can't know how big our undo records are going to be. And we can't > just have buffers that don't exist on disk in shared memory, and we > don't want to allocate undo that we then don't need. So I think what > we'd have to do at that stage, is to "pre-allocate" buffers for the > maximum amount of UNDO needed, but mark the associated bufferdesc as not > yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID > would not be set. > > So at the start of a function that will need to insert undo we'd need to > pre-reserve the maximum number of buffers we could potentially > need. That reservation stage would > > a) pin the page with the current end of the undo > b) if needed pin the page of older undo that we need to update (e.g. to > update the next pointer) > c) perform clock sweep etc to acquire (find or create) enough clean to > hold the maximum amount of undo needed. These buffers would be marked > as !BM_TAG_VALID | BUF_REFCOUNT_ONE. > > I assume that we'd make a) cheap by keeping it pinned for undo logs that > a backend is actively attached to. b) should only be needed once in a > transaction, so it's not too bad. c) we'd probably need to amortize > across multiple undo insertions, by keeping the unused buffers pinned > until the end of the transaction. > > I assume that having the infrastructure c) might also make some code > for already in postgres easier. There's obviously some issues around > guaranteeing that the maximum number of such buffers isn't high. I have analyzed this further, I think there is a problem if the record/s will not fit into the current undo log and we will have to switch the log. Because before knowing the actual record length we are not sure whether the undo log will switch or not and which undo log we will get. And, without knowing the logno (rnode) how we are going to pin the buffers? Am I missing something? Thomas do you think we can get around this problem? Apart from this while analyzing the other code I have noticed that in the current PG code we have few occurrences where try to read buffer under the buffer lock held. 1. In gistplacetopage { ... for (; ptr; ptr = ptr->next) { /* Allocate new page */ ptr->buffer = gistNewBuffer(rel); GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0); ptr->page = BufferGetPage(ptr->buffer); ptr->block.blkno = BufferGetBlockNumber(ptr->buffer); } 2. During page split we find new buffer while holding the lock on the current buffer. That doesn't mean that we can't do better but I am just referring to the existing code where we already have such issues. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 17, 2019 at 1:28 PM Andres Freund <andres@anarazel.de> wrote: > The primary one in the context here is that if we do *not* have to lock > the buffers all ahead of time, we can simplify the interface. We > certainly can't lock the buffers over IO (due to buffer reclaim) as > we're doing right now, so we'd need another phase, called by the "user" > during undo insertion. But if we do not need to lock the buffers before > the insertion over all starts, the inserting location doesn't have to > care. > > Secondarily, all the reasoning for needing to lock all buffers ahead of > time was imo fairly unconvincing. Following the "recipe" for WAL > insertions is a good idea when writing a new run-of-the-mill WAL > inserting location - but when writing a new fundamental facility, that > already needs to modify how WAL works, then I find that much less > convincing. So, you seem to be talking about something here which is different than what I thought we were talking about. One question is whether we need to lock all of the buffers "ahead of time," and I think the answer to that question is probably "no." Since nobody else can be writing to those buffers, and probably also nobody can be reading them except maybe for some debugging tool, it should be fine if we enter the critical section and then lock them at the point when we write the bytes. I mean, there shouldn't be any contention, and I don't see any other problems. The other question is whether need to hold all of the buffer locks at the same time, and that seems a lot more problematic to me. It's hard to say exactly whether this unsafe, because it depends on exactly what you think we're doing here, and I don't see that you've really spelled that out. The normal thing to do is call PageSetLSN() on every page before releasing the buffer lock, and that means holding all the buffer locks until after we've called XLogInsert(). Now, you could argue that we should skip setting the page LSN because the data ahead of the insertion pointer is effectively invisible anyway, but I bet that causes problems with checksums, at least, since they rely on the page LSN being accurate to know whether to emit WAL when a buffer is written. You could argue that we could do the XLogInsert() first and only after that lock and dirty the pages one by one, but I think that might break checkpoint interlocking, since it would then be possible for the checkpoint scan to pass over a buffer that does not appear to need writing for the current checkpoint but later gets dirtied and stamped with an LSN that would have caused it to be written had it been there at the time the checkpoint scan reached it. I really can't rule out the possibility that there's some way to make something in this area work, but I don't know what it is, and I think it's a fairly risky area to go tinkering. > Well, in the version of code that I was reviewing here, I don't there is > such a limit (there is a limit for buffers per undo record, but no limit > on the number of records inserted together). I think Dilip added a limit > since. And we have the issue of a lot of IO happening while holding > content locks on several pages. So I don't think it's a straw man at > all. Hmm, what do you mean by "a lot of IO happening while holding content locks on several pages"? We might XLogInsert() but there shouldn't be any buffer I/O going on at that point. If there is, I think that should be redesigned. We should collect buffer pins first, without locking. Then lock. Then write. Or maybe lock-and-write, but only after everything's pinned. The fact of calling XLogInsert() while holding buffer locks is not great, but I don't think it's any worse here than in any other part of the system, because the undo buffers aren't going to be suffering concurrent access from any other backend, and because there shouldn't be more than a few of them. > > 2. The write-ahead logging protocol says that you're supposed to lock > > all the buffers at once. See src/backend/access/transam/README. If > > you want to go patch that file, then this patch can follow whatever > > the locking rules in the patched version are. But until then, the > > patch should follow *the actual rules* not some other protocol based > > on a hand-wavy explanation in an email someplace. Otherwise, you've > > got the same sort of undocumented disaster-waiting-to-happen that you > > keep complaining about in other parts of this patch. We need fewer of > > those, not more! > > But that's not what I'm asking for? I don't even know where you take > from that I don't want this to be documented. I'm mainly asking for a > comment explaining why the current behaviour is what it is. Because I > don't think an *implicit* "normal WAL logging rules" is sufficient > explanation, because all the locking here happens one or two layers away > from the WAL logging site - so it's absolutely *NOT* obvious that that's > the explanation. And I don't think any of the locking sites actually has > comments explaining why the locks are acquired at that time (in fact, > IIRC until the review some even only mentioned pinning, not locking). I didn't intend to suggest that you don't want this to be documented. What I intended to suggest was that you seem to want to deviate from the documented rules, and it seems to me that we shouldn't do that unless we change the rules first, and I don't know what you think the rules should be or why those rules are safe. I think I basically agree with you about the rest of this: the API needs to be non-confusing and adequately documented, and it should avoiding acquiring buffer locks until we have all the relevant pins. > I think what primarily makes me concerned is that it's not clear to me > what guarantees that discard is the only reason for the block to > potentially be missing. I contrast to most other similar cases where WAL > replay simply re-creates the objects when trying to replay an action > affecting such an object, here we simply skip over the WAL logged > operation. So if e.g. the entire underlying UNDO file got lost, we > neither re-create it with valid content, nor error out. Which means we > got to be absolutely sure that all undo files are created in a > persistent manner, at their full size. And that there's no way that data > could get lost, without forcing us to perform REDO up to at least the > relevant point again. I think the crucial question for me here is the extent to which we're cross-checking against the discard pointer. If we're like, "oh, this undo data isn't on disk any more, it must've already been discarded, let's ignore the write," that doesn't sound particularly great, because files sometimes go missing. But, if we're like, "oh, we dirtied this undo buffer but now that undo has been discarded so we don't need to write the data back to the backing file," that seems fine. The discard pointer is a fully durable, WAL-logged thing; if it's somehow wrong, we have got huge problems anyway. > While it appears that we always WAL log the undo extension, I am not > convinced the recovery interlock is strong enough. For one > UndoLogDiscard() unlinks segments before WAL logging their removal - > which means if we crash after unlink() and before the > XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in > practice we might be fine, because there ought to be nobody still > referencing that UNDO - but I don't think that's actually guaranteed as > is). Hmm, that sounds a little worrying. I think there are two options here: unlike what we do with buffers, where we can use buffer locking etc. to make the insertion of the WAL record effectively simultaneous with the changes to the data pages, the removal of old undo files has to happen either before or after XLogInsert(). I think "after" would be better. If we do it before, then upon starting up, we have to accept that there might be undo which is not officially discarded which nevertheless no longer exists on disk; but that might also cause us to ignore real corruption. If we do it after, then we can just treat it as a non-critical cleanup that can be performed lazily and at leisure: at any time, without warning, the system may choose to remove any or all undo backing files all of whose address space is discarded. If we fail to remove files, we can just emit a WARNING and maybe retry later at some convenient point in time, or perhaps even just accept that we'll leak the file in that case. > Nor do I see where we're updating minRecoveryLocation when > replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during > recovery could be stopped before the discard has been replayed, leaving > us with wrong UNDO, but allowing write acess. Seems we'd at least need a > few more XLogFlush() calls. That sounds like a problem, but it seems like it might be better to make sure that minRecoveryLocation gets bumped, rather than adding XLogFlush() calls. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Currently, In UnpackedUndoRecord we store all members directly which > are set by the caller. We store pointers to some header which are > allocated internally by the undo layer and the caller need not worry > about setting them. So now you are suggesting to put other headers > also as structures in UnpackedUndoRecord. I as such don't have much > problem in doing that but I think initially Robert designed > UnpackedUndoRecord structure this way so it will be good if Robert > provides his opinion on this. I don't believe that's what is being suggested. It seems to me that the thing Andres is complaining about here has roots in the original sketch that I did for this code. The oldest version I can find is here: https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c In this version, and I think still in the current version, there is a two-stage marshaling strategy. First, the individual fields from the UnpackedUndoRecord get copied into global variables (yes, that was my fault, too, at least in part!) which are structures. Then, the structures get copied into the target buffer. The idea of that design was to keep the code simple, but it didn't really work out, because things got a lot more complicated between the time I wrote those 3244 lines of code and the >3000 lines of code that live in this patch today. One thing that change was that we moved more and more in the direction of considering individual fields as separate objects to be separately included or excluded, whereas when I wrote that code I thought we were going to have groups of related fields that stood or fell together. That idea turned out to be wrong. (There is the even-larger question here of whether we ought to take Heikki's suggestion and make this whole thing a lot more generic, but let's start by discussing how the design that we have today could be better-implemented.) If I understand Andres correctly, he's arguing that we ought to get rid of the two-stage marshaling strategy. During decoding, he wants data to go directly from the buffer that contains it to the UnpackedUndoRecord without ever being stored in the UnpackUndoContext. During insertion, he wants data to go directly from the UnpackedUndoRecord to the buffer that contains it. Or, at least, if there has to be an intermediate format, he wants it to be just a chunk of raw bytes, rather than a bunch of individual fields like we have in UndoPackContext currently. I think that's a reasonable goal. I'm not as concerned about it as he is from a performance point of view, but I think it would make the code look nicer, and that would be good. If we save CPU cycles along the way, that is also good. In broad outline, what this means is: 1. Any field in the UndoPackContext that starts with urec_ goes away. 2. Instead of something like InsertUndoBytes((char *) &(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *) &uur->uur_fxid, ...). 3. Similarly instead of ReadUndoBytes((char *) &ucontext->urec_fxid, ...) we'd write ReadUndoBytes((char *) &uur->uur_fxid, ...). 4. It seems slightly trickier to handle the cases where we've got a structure instead of individual fields, like urec_hd. But those could be broken down into field-by-field reads and writes, e.g. in this case one call for urec_type and a second for urec_info. 5. For uur_group and uur_logswitch, the code would need to allocate those subsidiary structures before copying into them. To me, that seems like it ought to be a pretty straightforward change that just makes things simpler. We'd probably need to pass the UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and keep a pointer to it in the UnpackUndoContext, but that seems fine. FinishUnpackUndo would end up just about empty, maybe entirely empty. Is that a reasonable idea? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 19, 2019 at 8:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > One point to remember in this regard is that we do need to modify the > LSN in undo pages after writing WAL, so all the undo pages need to be > locked by that time or we again need to take the lock on them. Uh, but a big part of the point of setting the LSN on the pages is to keep them from being written out before the corresponding WAL is flushed to disk. If you released and reacquired the lock, the page could be written out during the window in the middle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 19, 2019 at 5:16 PM Andres Freund <andres@anarazel.de> wrote: > Well, my main point, which so far has largely been ignored, was that we > may not acquire page locks when we still need to search for victim > buffers later. If we don't need to lock the pages up-front, but only do > so once we're actually copying the records into the undo pages, then we > don't a separate phase to acquire the locks. We can still hold all of > the page locks at the same time, as long as we just acquire them at the > later stage. +1 for that approach. I am in complete agreement. > My secondary point was that *none* of this actually is > documented, even if it's entirely unobvious to the reader that the > relevant code can only run during WAL insertion, due to being pretty far > removed from that. +1 also for properly documenting stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Well, my main point, which so far has largely been ignored, was that we > > may not acquire page locks when we still need to search for victim > > buffers later. If we don't need to lock the pages up-front, but only do > > so once we're actually copying the records into the undo pages, then we > > don't a separate phase to acquire the locks. We can still hold all of > > the page locks at the same time, as long as we just acquire them at the > > later stage. > > Okay, IIUC, this means that we should have a separate phase where we > call LockUndoBuffers (or something like that) before > InsertPreparedUndo and after PrepareUndoInsert. The LockUndoBuffers > will lock all the buffers pinned during PrepareUndoInsert. We can > probably call LockUndoBuffers before entering the critical section to > avoid any kind of failure in critical section. If so, that sounds > reasonable to me. I'm kind of scratching my head here, because this is clearly different than what Andres said in the quoted text to which you were replying. He clearly implied that we should acquire the buffer locks within the critical section during InsertPreparedUndo, and you responded by proposing to do it outside the critical section in a separate step. Regardless of which way is actually better, when somebody says "hey, let's do A!" and you respond by saying "sounds good, I'll go implement B!" that's not really helping us to get toward a solution. FWIW, although I also thought of doing what you are describing here, I think Andres's proposal is probably preferable, because it's simpler. There's not really any reason why we can't take the buffer locks from within the critical section, and that way callers don't have to deal with the extra step. > > My secondary point was that *none* of this actually is > > documented, even if it's entirely unobvious to the reader that the > > relevant code can only run during WAL insertion, due to being pretty far > > removed from that. > > I think this can be clearly mentioned in README or someplace else. It also needs to be adequately commented in the files and functions involved. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-08-20 09:08:29 -0400, Robert Haas wrote: > On Sat, Aug 17, 2019 at 1:28 PM Andres Freund <andres@anarazel.de> wrote: > > The primary one in the context here is that if we do *not* have to lock > > the buffers all ahead of time, we can simplify the interface. We > > certainly can't lock the buffers over IO (due to buffer reclaim) as > > we're doing right now, so we'd need another phase, called by the "user" > > during undo insertion. But if we do not need to lock the buffers before > > the insertion over all starts, the inserting location doesn't have to > > care. > > > > Secondarily, all the reasoning for needing to lock all buffers ahead of > > time was imo fairly unconvincing. Following the "recipe" for WAL > > insertions is a good idea when writing a new run-of-the-mill WAL > > inserting location - but when writing a new fundamental facility, that > > already needs to modify how WAL works, then I find that much less > > convincing. > > So, you seem to be talking about something here which is different > than what I thought we were talking about. One question is whether we > need to lock all of the buffers "ahead of time," and I think the > answer to that question is probably "no." Since nobody else can be > writing to those buffers, and probably also nobody can be reading them > except maybe for some debugging tool, it should be fine if we enter > the critical section and then lock them at the point when we write the > bytes. I mean, there shouldn't be any contention, and I don't see any > other problems. Right. As long as we are as restrictive about the number of buffers per undo record, and number of records per WAL insertions, I don't see any need to go further than that. > > Well, in the version of code that I was reviewing here, I don't there is > > such a limit (there is a limit for buffers per undo record, but no limit > > on the number of records inserted together). I think Dilip added a limit > > since. And we have the issue of a lot of IO happening while holding > > content locks on several pages. So I don't think it's a straw man at > > all. > > Hmm, what do you mean by "a lot of IO happening while holding content > locks on several pages"? We might XLogInsert() but there shouldn't be > any buffer I/O going on at that point. That's my primary complain with how the code is structured right now. Right now we potentially perform IO while holding exclusive content locks, often multiple ones even. When acquiring target pages for undo, currently always already hold the table page exclusive locked, and if there's more than one buffer for undo, we'll also hold the previous buffers locked. And acquiring a buffer will often have to write out a dirty buffer to the OS, and a lot of times that will then also require the kernel to flush data out. That's imo an absolute no-go for the general case. > If there is, I think that should be redesigned. We should collect > buffer pins first, without locking. Then lock. Then write. Or maybe > lock-and-write, but only after everything's pinned. Right. It's easy enough to do that for the locks on undo pages themselves. The harder part is the content lock on the "table page" - we don't accurately know how many undo buffers we will need, without holding the table lock (or preventing modifications in some other manner). I tried to outline the problem and potential solutions in more detail in: https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de > The fact of calling XLogInsert() while holding buffer locks is not > great, but I don't think it's any worse here than in any other part of > the system, because the undo buffers aren't going to be suffering > concurrent access from any other backend, and because there shouldn't > be more than a few of them. Yea. That's obviously undesirable, but also fundamentally required at least in the general case. And it's not at all specific to undo. > [ WAL logging protocol ] > I didn't intend to suggest that you don't want this to be documented. > What I intended to suggest was that you seem to want to deviate from > the documented rules, and it seems to me that we shouldn't do that > unless we change the rules first, and I don't know what you think the > rules should be or why those rules are safe. IDK. We have at least five different places that at the very least bend the rules - but with a comment explaining why it's safe in the specific case. Personally I don't really think the generic guideline needs to list every potential edge-case. > > I think what primarily makes me concerned is that it's not clear to me > > what guarantees that discard is the only reason for the block to > > potentially be missing. I contrast to most other similar cases where WAL > > replay simply re-creates the objects when trying to replay an action > > affecting such an object, here we simply skip over the WAL logged > > operation. So if e.g. the entire underlying UNDO file got lost, we > > neither re-create it with valid content, nor error out. Which means we > > got to be absolutely sure that all undo files are created in a > > persistent manner, at their full size. And that there's no way that data > > could get lost, without forcing us to perform REDO up to at least the > > relevant point again. > > I think the crucial question for me here is the extent to which we're > cross-checking against the discard pointer. If we're like, "oh, this > undo data isn't on disk any more, it must've already been discarded, > let's ignore the write," that doesn't sound particularly great, > because files sometimes go missing. Right. > But, if we're like, "oh, we dirtied this undo buffer but now that undo > has been discarded so we don't need to write the data back to the > backing file," that seems fine. The discard pointer is a fully > durable, WAL-logged thing; if it's somehow wrong, we have got huge > problems anyway. There is some cross-checking against the discard pointer while reading, but it's not obvious for me that there is in all places. In particularly for insertions. UndoGetBufferSlot() itself doesn't have a crosscheck afaict, and I don't see anything in InsertPreparedUndo() either. It's possible that somehow it's indirectly guaranteed, but if so it'd be far from obvious enough. > > While it appears that we always WAL log the undo extension, I am not > > convinced the recovery interlock is strong enough. For one > > UndoLogDiscard() unlinks segments before WAL logging their removal - > > which means if we crash after unlink() and before the > > XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in > > practice we might be fine, because there ought to be nobody still > > referencing that UNDO - but I don't think that's actually guaranteed as > > is). > > Hmm, that sounds a little worrying. I think there are two options > here: unlike what we do with buffers, where we can use buffer locking > etc. to make the insertion of the WAL record effectively simultaneous > with the changes to the data pages, the removal of old undo files has > to happen either before or after XLogInsert(). I think "after" would > be better. Right. > > Nor do I see where we're updating minRecoveryLocation when > > replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during > > recovery could be stopped before the discard has been replayed, leaving > > us with wrong UNDO, but allowing write acess. Seems we'd at least need a > > few more XLogFlush() calls. > > That sounds like a problem, but it seems like it might be better to > make sure that minRecoveryLocation gets bumped, rather than adding > XLogFlush() calls. XLogFlush() so far is the way to update minRecoveryLocation: /* * During REDO, we are reading not writing WAL. Therefore, instead of * trying to flush the WAL, we should update minRecoveryPoint instead. We * test XLogInsertAllowed(), not InRecovery, because we need checkpointer * to act this way too, and because when it tries to write the * end-of-recovery checkpoint, it should indeed flush. */ if (!XLogInsertAllowed()) { UpdateMinRecoveryPoint(record, false); return; } I don't think there's currently any other interface available to redo functions to update minRecoveryLocation. And we already use XLogFlush() for that purpose in numerous redo routines. Greetings, Andres Freund
Hi, On 2019-08-20 21:02:18 +1200, Thomas Munro wrote: > Aside from code changes based on review (and I have more to come of > those), the attached experimental patchset (also at > https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol > that, I hope, allows for better concurrency, reliability and > readability, and removes a bunch of TODO notes about questionable > interlocking. However, I'm not quite done figuring out if the bufmgr > interaction is right and will be manageable on the undoaccess side, so > I'm hoping to get some feedback, not asking for anyone to rebase on > top of it yet. > > Previously, there were two LWLocks used to make sure that all > discarding was prevented while anyone was reading or writing data in > any part of an undo log, and (probably more problematically) vice > versa. Here's a new approach that removes that blocking: > > 1. Anyone is allowed to try to read or write data at any UndoRecPtr > that has been allocated, through the buffer pool (though you'd usually > want to check it with UndoRecPtrIsDiscarded() first, and only rely on > the system I'm describing to deal with races). > > 2. ReadBuffer() might return InvalidBuffer. This can happen for a > cache miss, if the smgrread implementation wants to indicate that the > buffer has been discarded/truncated and that is expected (md.c won't > ever do that, but undofile.c can). Hm. This gives me a bit of a stomach ache. It somehow feels like a weird form of signalling. Can't quite put my finger on why it makes me feel queasy. > 3. UndoLogDiscard() uses DiscardBuffer() to invalidate any currently > unpinned buffers, and marks as BM_DISCARDED any that happen to be > pinned right now, so they can't be immediately invalidated. Such > buffers are never written back and are eligible for reuse on the next > clock sweep, even if they're written into by a backend that managed to > do that when we were trying to discard. Hm. When is it legitimate for a backend to write into such a buffer? I guess that's about updating the previous transaction's next pointer? Or progress info? > 5. Separating begin from discard allows the WAL logging for > UndoLogDiscard() to do filesystem actions before logging, and other > effects after logging, which have several nice properties if you work > through the various crash scenarios. Hm. ISTM we always need to log before doing some filesystem operation (see also my recent complaint Robert and I are discussing at the bottom of [1]). It's just that we can have a separate stage afterwards? [1] https://www.postgresql.org/message-id/CA%2BTgmoZc5JVYORsGYs8YnkSxUC%3DcLQF1Z%2BfcpH2TTKvqkS7MFg%40mail.gmail.com > So now I'd like to get feedback on the sanity of this scheme. I'm not > saying it doesn't have bugs right now -- I've been trying to figure > out good ways to test it and I'm not quite there yet -- but the > concept. One observation I have is that there were already code paths > in undoaccess.c that can tolerate InvalidBuffer in recovery, due to > the potentially different discard timing for DO vs REDO. I think > that's a point in favour of this scheme, but I can see that it's > inconvenient to have to deal with InvalidBuffer whenever you read. FWIW, I'm far from convinced that those are currently quite right. See discussion pointed to above. > I pulled in the latest code from undoprocessing as of today, and I > might be a bit confused about "Defect and enhancement in multi-log > support" some of which I have squashed into the make undolog patch. > BTW undoprocessing builds with initialized variable warnings in xact.c > on clang today. I've complained about that existance of that commit multiple times now. So far without any comments. Greetings, Andres Freund
Hi, On 2019-08-20 17:11:38 +0530, Dilip Kumar wrote: > On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <andres@anarazel.de> wrote: > > On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote: > > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote: > > > I don't think we can normally pin the undo buffers properly at that > > stage. Without knowing the correct contents of the table page - which we > > can't know without holding some form of lock preventing modifications - > > we can't know how big our undo records are going to be. And we can't > > just have buffers that don't exist on disk in shared memory, and we > > don't want to allocate undo that we then don't need. So I think what > > we'd have to do at that stage, is to "pre-allocate" buffers for the > > maximum amount of UNDO needed, but mark the associated bufferdesc as not > > yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID > > would not be set. > > > > So at the start of a function that will need to insert undo we'd need to > > pre-reserve the maximum number of buffers we could potentially > > need. That reservation stage would > > > > a) pin the page with the current end of the undo > > b) if needed pin the page of older undo that we need to update (e.g. to > > update the next pointer) > > c) perform clock sweep etc to acquire (find or create) enough clean to > > hold the maximum amount of undo needed. These buffers would be marked > > as !BM_TAG_VALID | BUF_REFCOUNT_ONE. > > > > I assume that we'd make a) cheap by keeping it pinned for undo logs that > > a backend is actively attached to. b) should only be needed once in a > > transaction, so it's not too bad. c) we'd probably need to amortize > > across multiple undo insertions, by keeping the unused buffers pinned > > until the end of the transaction. > > > > I assume that having the infrastructure c) might also make some code > > for already in postgres easier. There's obviously some issues around > > guaranteeing that the maximum number of such buffers isn't high. > > > I have analyzed this further, I think there is a problem if the > record/s will not fit into the current undo log and we will have to > switch the log. Because before knowing the actual record length we > are not sure whether the undo log will switch or not and which undo > log we will get. And, without knowing the logno (rnode) how we are > going to pin the buffers? Am I missing something? That's precisely why I was suggesting (at the start of the quoted block above) to not associate the buffers with pages at that point. Instead just have clean, pinned, *unassociated* buffers. Which can be re-associated without any IO. > Apart from this while analyzing the other code I have noticed that in > the current PG code we have few occurrences where try to read buffer > under the buffer lock held. > 1. > In gistplacetopage > { > ... > for (; ptr; ptr = ptr->next) > { > /* Allocate new page */ > ptr->buffer = gistNewBuffer(rel); > GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0); > ptr->page = BufferGetPage(ptr->buffer); > ptr->block.blkno = BufferGetBlockNumber(ptr->buffer); > } > 2. During page split we find new buffer while holding the lock on the > current buffer. > > That doesn't mean that we can't do better but I am just referring to > the existing code where we already have such issues. Those are pretty clearly edge-cases, whereas the undo case at hand is a very common path. Note again that heapam.c goes to considerably trouble to never do this for common cases. Greetings, Andres Freund
On 2019-08-20 09:44:23 -0700, Andres Freund wrote: > On 2019-08-20 21:02:18 +1200, Thomas Munro wrote: > > Aside from code changes based on review (and I have more to come of > > those), the attached experimental patchset (also at > > https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol > > that, I hope, allows for better concurrency, reliability and > > readability, and removes a bunch of TODO notes about questionable > > interlocking. However, I'm not quite done figuring out if the bufmgr > > interaction is right and will be manageable on the undoaccess side, so > > I'm hoping to get some feedback, not asking for anyone to rebase on > > top of it yet. > > > > Previously, there were two LWLocks used to make sure that all > > discarding was prevented while anyone was reading or writing data in > > any part of an undo log, and (probably more problematically) vice > > versa. Here's a new approach that removes that blocking: Oh, more point I forgot to add: Cool!
On Tue, Aug 20, 2019 at 5:02 AM Thomas Munro <thomas.munro@gmail.com> wrote: > 3. UndoLogDiscard() uses DiscardBuffer() to invalidate any currently > unpinned buffers, and marks as BM_DISCARDED any that happen to be > pinned right now, so they can't be immediately invalidated. Such > buffers are never written back and are eligible for reuse on the next > clock sweep, even if they're written into by a backend that managed to > do that when we were trying to discard. This is definitely more concurrent, but it might be *too much* concurrency. Suppose that backend #1 is inserting a row and updating the transaction header for the previous transaction; meanwhile, backend #2 is discarding the previous transaction. It could happen that backend #1 locks the transaction header for the previous transaction and is all set to log the insertion ... but then gets context-switched out. Now backend #2 swoops in and logs the discard. Backend #1 now wakes up and finishes logging a change to a page that, according to the logic of the WAL stream, no longer exists. It's probably possible to make this work by ignoring WAL references to discarded pages during replay, but that seems a bit dangerous. At least, it loses some sanity checking that you might like to have. It seems to me that you can avoid this if you require that a backend that wants to set BM_DISCARDED to acquire at least a shared content lock before doing so. If you do that, then once a backend acquires content lock(s) on the page(s) containing the transaction header for the purposes of updating it, it can notice that the BM_DISCARDED flag is set and choose not to update those pages after all. I think that would be a smart design choice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 13, 2019 at 8:11 AM Robert Haas <robertmhaas@gmail.com> wrote: > > We can probably check the fxid queue and error queue to get that > > value. However, I am not sure if that is sufficient because incase we > > perform the request in the foreground, it won't be present in queues. > > Oh, I forgot about that requirement. I think I can fix it so it does > that fairly easily, but it will require a little bit of redesign which > I won't have time to do this week. Here's a version with a quick (possibly buggy) prototype of the oldest-FXID support. It also includes a bunch of comment changes, pgindent, and a few other tweaks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Aug 20, 2019 at 7:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Currently, In UnpackedUndoRecord we store all members directly which > > are set by the caller. We store pointers to some header which are > > allocated internally by the undo layer and the caller need not worry > > about setting them. So now you are suggesting to put other headers > > also as structures in UnpackedUndoRecord. I as such don't have much > > problem in doing that but I think initially Robert designed > > UnpackedUndoRecord structure this way so it will be good if Robert > > provides his opinion on this. > > I don't believe that's what is being suggested. It seems to me that > the thing Andres is complaining about here has roots in the original > sketch that I did for this code. The oldest version I can find is > here: > > https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c > > In this version, and I think still in the current version, there is a > two-stage marshaling strategy. First, the individual fields from the > UnpackedUndoRecord get copied into global variables (yes, that was my > fault, too, at least in part!) which are structures. Then, the > structures get copied into the target buffer. The idea of that design > was to keep the code simple, but it didn't really work out, because > things got a lot more complicated between the time I wrote those 3244 > lines of code and the >3000 lines of code that live in this patch > today. One thing that change was that we moved more and more in the > direction of considering individual fields as separate objects to be > separately included or excluded, whereas when I wrote that code I > thought we were going to have groups of related fields that stood or > fell together. That idea turned out to be wrong. (There is the > even-larger question here of whether we ought to take Heikki's > suggestion and make this whole thing a lot more generic, but let's > start by discussing how the design that we have today could be > better-implemented.) > > If I understand Andres correctly, he's arguing that we ought to get > rid of the two-stage marshaling strategy. During decoding, he wants > data to go directly from the buffer that contains it to the > UnpackedUndoRecord without ever being stored in the UnpackUndoContext. > During insertion, he wants data to go directly from the > UnpackedUndoRecord to the buffer that contains it. Or, at least, if > there has to be an intermediate format, he wants it to be just a chunk > of raw bytes, rather than a bunch of individual fields like we have in > UndoPackContext currently. I think that's a reasonable goal. I'm not > as concerned about it as he is from a performance point of view, but I > think it would make the code look nicer, and that would be good. If > we save CPU cycles along the way, that is also good. > > In broad outline, what this means is: > > 1. Any field in the UndoPackContext that starts with urec_ goes away. > 2. Instead of something like InsertUndoBytes((char *) > &(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *) > &uur->uur_fxid, ...). > 3. Similarly instead of ReadUndoBytes((char *) &ucontext->urec_fxid, > ...) we'd write ReadUndoBytes((char *) &uur->uur_fxid, ...). > 4. It seems slightly trickier to handle the cases where we've got a > structure instead of individual fields, like urec_hd. But those could > be broken down into field-by-field reads and writes, e.g. in this case > one call for urec_type and a second for urec_info. > 5. For uur_group and uur_logswitch, the code would need to allocate > those subsidiary structures before copying into them. > > To me, that seems like it ought to be a pretty straightforward change > that just makes things simpler. We'd probably need to pass the > UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and > keep a pointer to it in the UnpackUndoContext, but that seems fine. > FinishUnpackUndo would end up just about empty, maybe entirely empty. > > Is that a reasonable idea? > I have already attempted that part and I feel it is not making code any simpler than what we have today. For packing, it's fine because I can process all the member once and directly pack it into one memory chunk and I can insert that to the buffer by one call of InsertUndoBytes and that will make the code simpler. But, while unpacking if I directly unpack to the UnpackUndoRecord then there are few complexities. I am not saying those are difficult to implement but code may not look better. a) First, we need to add extra stages for unpacking as we need to do field by field. b) Some of the members like uur_payload and uur_tuple are not the same type in the UnpackUndoRecord compared to how it is stored in the page. In UnpackUndoRecord those are StringInfoData whereas on the page we store it as UndoRecordPayload header followed by the actual data. I am not saying we can not unpack this directly we can do it like, first read the payload length from the page in uur_payload.len then read tuple length in uur_tuple.len then read both the data. And, for that, we will have to add extra stages. c) Currently, in UnpackUndoContext the members are stored in the same order in which we are storing them to the page whereas in UnpackUndoRecord they are stored in the order such that they are more convenient for them to understand, like all the fields which are set by the caller are separate from the fields which are allocated internally by the undo layer (transaction header and the log switch header). Now, for directly unpacking to the UnpackUndoRecord, we need to read them out of order which will make code more unreadable. Another option could be that we unpack some part directly into the UnapackUndoRecord (individual fields) and other parts to UnpackUndoContext (structures, payload) and in Finalise only copy those parts from UnpackUndoContext to UnapackUndoRecord. The code might look bit confusing though. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 20, 2019 at 8:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Well, my main point, which so far has largely been ignored, was that we > > > may not acquire page locks when we still need to search for victim > > > buffers later. If we don't need to lock the pages up-front, but only do > > > so once we're actually copying the records into the undo pages, then we > > > don't a separate phase to acquire the locks. We can still hold all of > > > the page locks at the same time, as long as we just acquire them at the > > > later stage. > > > > Okay, IIUC, this means that we should have a separate phase where we > > call LockUndoBuffers (or something like that) before > > InsertPreparedUndo and after PrepareUndoInsert. The LockUndoBuffers > > will lock all the buffers pinned during PrepareUndoInsert. We can > > probably call LockUndoBuffers before entering the critical section to > > avoid any kind of failure in critical section. If so, that sounds > > reasonable to me. > > I'm kind of scratching my head here, because this is clearly different > than what Andres said in the quoted text to which you were replying. > He clearly implied that we should acquire the buffer locks within the > critical section during InsertPreparedUndo, and you responded by > proposing to do it outside the critical section in a separate step. > Regardless of which way is actually better, when somebody says "hey, > let's do A!" and you respond by saying "sounds good, I'll go implement > B!" that's not really helping us to get toward a solution. > I got confused by the statement "We can still hold all of the page locks at the same time, as long as we just acquire them at the later stage." > FWIW, although I also thought of doing what you are describing here, I > think Andres's proposal is probably preferable, because it's simpler. > There's not really any reason why we can't take the buffer locks from > within the critical section, and that way callers don't have to deal > with the extra step. > IIRC, the reason this was done before starting critical section was because of coding convention mentioned in src/access/transam/README (Section: Write-Ahead Log Coding). It says first pin and exclusive lock the shared buffers and then start critical section. It might be that we can bypass that convention here, but I guess it is mainly to avoid any error in the critical section. I have checked the LWLockAcquire path and there doesn't seem to be any reason that it will throw error except when the caller has acquired many locks at the same time which is not the case here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have already attempted that part and I feel it is not making code > any simpler than what we have today. For packing, it's fine because I > can process all the member once and directly pack it into one memory > chunk and I can insert that to the buffer by one call of > InsertUndoBytes and that will make the code simpler. OK... > But, while unpacking if I directly unpack to the UnpackUndoRecord then > there are few complexities. I am not saying those are difficult to > implement but code may not look better. > > a) First, we need to add extra stages for unpacking as we need to do > field by field. > > b) Some of the members like uur_payload and uur_tuple are not the same > type in the UnpackUndoRecord compared to how it is stored in the page. > In UnpackUndoRecord those are StringInfoData whereas on the page we > store it as UndoRecordPayload header followed by the actual data. I > am not saying we can not unpack this directly we can do it like, > first read the payload length from the page in uur_payload.len then > read tuple length in uur_tuple.len then read both the data. And, for > that, we will have to add extra stages. I don't think that this is true; or at least, I think it's avoidable. The idea of an unpacking stage is that you refuse to advance to the next stage until you've got a certain number of bytes of data; and then you unpack everything that pertains to that stage. If you have 2 4-byte fields that you want to unpack together, you can just wait until you've 8 bytes of data and then unpack both. You don't really need 2 separate stages. (Similarly, your concern about fields being in a different order seems like it should be resolved by agreeing on one ordering and having everything use it; I don't know why there should be one order that is better in memory and another order that is better on disk.) The bigger issue here is that we don't seem to be making very much progress toward improving the overall design. Several significant improvements have been suggested: 1. Thomas suggested quite some time ago that we should make sure that the transaction header is the first optional header. If we do that, then I'm not clear on why we even need this incremental unpacking stuff any more. The only reason for all of this machinery was so that we could find the transaction header at an unknown offset inside a complex record format; there is, if I understand correctly, no other case in which we want to incrementally decode a record. But if the transaction header is at a fixed offset, then there seems to be no need to even have incremental decoding at all. Or it can be very simple, with three stages: (1) we don't yet have enough bytes to figure out how big the record is; (2) we have enough bytes to figure out how big the record is and we have figured that out but we don't yet have all of those bytes; and (3) we have the whole record, we can decode the whole thing and we're done. 2. Based on a discussion with Thomas, I suggested the GHOB stuff, which gets rid of the idea of transaction headers inside individual records altogether; instead, there is one undo blob per transaction (or maybe several if we overflow to another undo log) which begins with a sentinel byte that identifies it as owned by a transaction, and then the transaction header immediately follows that without being part of any record, and the records follow that data. As with the previous idea, this gets rid of the need for incremental decoding because it gets rid of the need to find the transaction header inside of a bigger record. As Thomas put it to me off-list, it puts the records inside of a larger chunk of space owned by the transaction instead of putting the transaction header inside of some particular record; that seems more correct than the way we have it now. 3. Heikki suggested that the format should be much more generic and that more should be left up to the AM. While neither Andres nor I are convinced that we should go as far in that direction as Heikki is proposing, the idea clearly has some merit, and we should probably be moving that way to some degree. For instance, the idea that we should store a block number and TID is a bit sketchy when you consider that a multi-insert operation really wants to store a TID list. The zheap tree has a ridiculous kludge to work around that problem; clearly we need something better. We've also mentioned that, in the future, we might want to support TIDs that are not 6 bytes, and that even just looking at what's currently under development, zedstore wants to treat TIDs as 48-bit opaque quantities, not a 4-byte block number and a 2-byte item pointer offset. So, there is clearly a need to go through the whole way we're doing this and rethink which parts are generic and which parts are AM-specific. 4. A related problem, which has been mentioned or at least alluded to by both Heikki and by me, is that we need a better way of handling the AM-specific data. Right now, the zheap code packs fixed-size things into the payload data and then finds them by knowing the offset where any particular data is within that field, but that's an unmaintainable mess. The zheap code could be improved by at least defining those offsets as constants someplace and adding some comments explaining the payload formats of various undo records, but even if we do that, it's not going to generalize very well to anything more complicated than a few fixed-size bits of data. I suggested using the pqformat stuff to try to structure that -- a suggestion to which Heikki has unfortunately not responded, because I'd really like to get his thoughts on it -- but whether we do that particular thing or not, I think we need to do something. In addition to wanting a better way of handling packing and unpacking for payload data, there's also a desire to have it participate in record compression, for which we don't seem to have any kind of plan. 5. Andres suggested multiple ideas for cleaning up and improving this code in https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de - which include the idea currently under discussion, several of the same ideas that I mentioned above, and a number of other things, such as making packing serialize to a char * rather than some ad-hoc intermediate format and having a metadata array over which we can loop rather than having multiple places where there's a separate bit of code for every field type. I don't think those suggestions are entirely unproblematic; for instance, the metadata array would would probably work a lot better if we moved the transaction and log-switch headers outside of individual records as suggested in (2) above. Otherwise, the metadata would have to include not only data-structure offsets but some kind of a flag indicating which of several data structures ought to contain the relevant information, which would make the whole thing a lot messier. And depending on what we do about (4), this might become moot or the details might change quite a bit, because if we no longer have a long list of "generic" fields, then we also won't have a bunch of places in the code that deal with that long list of generic fields, which means the metadata array might not be necessary, or might be simpler or smaller or designed differently. All of which is to make the point that responding to Andres's feedback will require a bunch of decisions about which parts of the feedback to take (because some of them are mutually exclusive, as he acknowledges himself) and what to do about them (because some of them are vague); yet, taken together, they seem to amount to the need for significant design changes, as do (1)-(4). Now, just to be clear, the code we're talking about here is mostly based on an original design by me, and whatever defects were present in that original design are nobody's fault but mine. And that list of defects includes pretty much everything in the above list. But, what we need to figure out at this point is how we're going to get those things fixed, and it seems to me that we're going to need a pretty substantial redesign, but this discussion is kind of down in the weeds. I mean, what are we gaining by arguing about how many stages we need for incremental unpacking if the real thing we need to is get rid of that concept altogether? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 21, 2019 at 6:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > FWIW, although I also thought of doing what you are describing here, I > > think Andres's proposal is probably preferable, because it's simpler. > > There's not really any reason why we can't take the buffer locks from > > within the critical section, and that way callers don't have to deal > > with the extra step. > > IIRC, the reason this was done before starting critical section was > because of coding convention mentioned in src/access/transam/README > (Section: Write-Ahead Log Coding). It says first pin and exclusive > lock the shared buffers and then start critical section. It might be > that we can bypass that convention here, but I guess it is mainly to > avoid any error in the critical section. I have checked the > LWLockAcquire path and there doesn't seem to be any reason that it > will throw error except when the caller has acquired many locks at the > same time which is not the case here. Yeah, I think it's fine to deviate from that convention in this respect. We treat LWLockAcquire() as a no-fail operation in many places; in my opinion, that elog(ERROR) that we have for too many LWLocks should be changed to elog(PANIC) precisely because we do treat LWLockAcquire() as no-fail in lots of places in the code, but I think I suggested that once and got slapped down, and I haven't had the energy to fight about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On August 21, 2019 8:36:34 AM PDT, Robert Haas <robertmhaas@gmail.com> wrote: > We treat LWLockAcquire() as a no-fail operation in many >places; in my opinion, that elog(ERROR) that we have for too many >LWLocks should be changed to elog(PANIC) precisely because we do treat >LWLockAcquire() as no-fail in lots of places in the code, but I think >I suggested that once and got slapped down, and I haven't had the >energy to fight about it. Fwiw, that proposal has my vote. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Aug 14, 2019 at 2:57 AM Andres Freund <andres@anarazel.de> wrote: > - My reading of the current xact.c integration is that it's not workable > as is. Undo is executed outside of a valid transaction state, > exceptions aren't properly undone, logic would need to be duplicated > to a significant degree, new kind of critical section. Regarding this particular point: ReleaseResourcesAndProcessUndo() is only supposed to be called after AbortTransaction(), and the additional steps it performs -- AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction. That's not crazy; the other steps in Cleanup(Sub)Transaction() look like stuff that's intended to be performed when we're totally done with this TransactionState stack entry, whereas these things are slightly higher-level cleanups that might even block undo (e.g. undropped portal prevents orphaned file cleanup). Granted, there are no comments explaining why those particular cleanup steps are performed here, and it's possible some other approach is better, but I think perhaps it's not quite as flagrantly broken as you think. I am also not convinced that semi-critical sections are a bad idea, although the if (SemiCritSectionCount > 0) test at the start of ReleaseResourcesAndProcessUndo() looks wrong. To roll back a subtransaction, we must perform undo in the foreground, and if that fails, the toplevel transaction can't be allowed to commit, full stop. Since we expect this to be a (very) rare scenario, I don't know why escalating to FATAL is a catastrophe. The only other option is to do something along the lines of SxactIsDoomed(), where we force all subsequent commits (and sub-commits?) within the toplevel xact to fail. You can argue that the latter is a better user experience, and for SSI I certainly agree, but this case isn't quite the same: there's a good chance we're dealing with a corrupted page or system administrator intervention to try to kill off a long-running undo task, and continuing in such cases seems a lot more dubious than after a serializability failure, where retrying is the expected recovery mode. The other case is where toplevel undo for a temporary table fails. It is unclear to me what, other than FATAL, could suffice there. I guess you could just let the session continue and leave the transaction undone, leaving whatever MVCC machinery the table AM may have look through it, but that sounds inferior to me. Rip the bandaid off. Some general complaints from my side about the xact.c changes: 1. The code structure doesn't seem quite right. For example: 1a. ProcessUndoRequestForEachLogCat has a try/catch block, but it seems to me that the job of a try/catch block is to provide structured error-handling for code for resources for which there's no direct handling in xact.c or resowner.c. Here, we're inside of xact.c, so why are we adding a try/catch block 1b. ReleaseResourcesAndProcessUndo does part of the work of cleaning up a failed transaction but not all of it, the rest being done by AbortTransaction, which is called before entering it, plus it also kicks off the actual undo work. I would expect a cleaner division of responsibility. 1c. Having an undo request per UndoLogCategory rather than one per transaction doesn't seem right to me; hopefully that will get cleaned up when the undorequest.c stuff I sent before is integrated. 1d. The code at the end of FinishPreparedTransaction() seems to expect that the called code will catch any error, but that clearing the error state might need to happen here, and also that we should fire up a new transaction; I suspect, but am not entirely sure, that that is not the way it should work. The code added earlier in that function also looks suspicious, because it's filling up what is basically a high-level control function with a bunch of low-level undo-specific details. In both places, the undo-specific concerns probably need to be better-isolated. 2. Signaling is done using some odd-looking mechanisms. For instance: 2a. The SemiCritSectionCount > 0 test at the top of ReleaseResourcesAndProcessUndo that I complained about earlier looks like a guard against reentrancy, but that must be the wrong way to get there; it makes it impossible to reuse what is ostensibly a general-purpose facility for any non-undo related purpose without maybe breaking something. 2b. ResetUndoActionsInfo() is called from a bunch of place, but only 2 of those places have a comment explaining why, and the function comment is pretty unilluminating. This looks like some kind of signaling machinery, but it's not very clear to me what it's actually trying to do. 2c. ResourceOwnerReleaseInternal() is directly calling NeedToPerformUndoActions(), which feels like a layering violation. 2d. I'm not really sure that TRANS_UNDO is serving any useful purpose; I think we need TBLOCK_UNDO and TBLOCK_SUBUNDO, but I'm not really sure TRANS_UNDO is doing anything useful; the change to SubTransactionIsActive() looks wrong to me, and the other changes I think would mostly go away if we just used TRANS_INPROGRESS. 2e. I'm skeptical that the err_out_to_client() stuff is the right way to suppress undo failure messages from being sent to the client. That needs to be done, but this doesn't seem like the right way. This is related to my complaint above about using a try/catch block inside xact.c. 3. I noticed a few other mistakes when reading through this again which I include here for the sake of completeness: 3a. memset(..., InvalidUndoRecPtr, ...) will only happen to work if every byte of InvalidUndoRecPtr happens to have the same value. That happens to be true, because it's defined 8 bytes of zeroes, but it's not OK to code it like this. 3b. "undoRequestResgistered" is a typo. 3c. GetEpochForXid definitely shouldn't exist any more... as has been reported in the past. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 21, 2019 at 9:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have already attempted that part and I feel it is not making code > > any simpler than what we have today. For packing, it's fine because I > > can process all the member once and directly pack it into one memory > > chunk and I can insert that to the buffer by one call of > > InsertUndoBytes and that will make the code simpler. > > OK... > > The bigger issue here is that we don't seem to be making very much > progress toward improving the overall design. Several significant > improvements have been suggested: > > 1. Thomas suggested quite some time ago that we should make sure that > the transaction header is the first optional header. I think this is already done at least 2-3 version before. So now we are updating the transaction header directly by writing at that offset and we don't need staging for this. > If we do that, > then I'm not clear on why we even need this incremental unpacking > stuff any more. The only reason for all of this machinery was so that > we could find the transaction header at an unknown offset inside a > complex record format; there is, if I understand correctly, no other > case in which we want to incrementally decode a record. > But if the > transaction header is at a fixed offset, then there seems to be no > need to even have incremental decoding at all. Or it can be very > simple, with three stages: (1) we don't yet have enough bytes to > figure out how big the record is; (2) we have enough bytes to figure > out how big the record is and we have figured that out but we don't > yet have all of those bytes; and (3) we have the whole record, we can > decode the whole thing and we're done. We can not know the complete size of the record even by reading the header because we have a payload that is variable part and payload length are stored in the payload header which again can be at random offset. But, maybe we can still follow this idea which will make unpacking far simpler. I have a few ideas a) Store payload header right after the transaction header so that we can easily know the complete record size. b) Once we decode the first header by uur_info we can compute an exact offset of the payload header and from there we can know the record size. > > 2. Based on a discussion with Thomas, I suggested the GHOB stuff, > which gets rid of the idea of transaction headers inside individual > records altogether; instead, there is one undo blob per transaction > (or maybe several if we overflow to another undo log) which begins > with a sentinel byte that identifies it as owned by a transaction, and > then the transaction header immediately follows that without being > part of any record, and the records follow that data. As with the > previous idea, this gets rid of the need for incremental decoding > because it gets rid of the need to find the transaction header inside > of a bigger record. As Thomas put it to me off-list, it puts the > records inside of a larger chunk of space owned by the transaction > instead of putting the transaction header inside of some particular > record; that seems more correct than the way we have it now. > > 3. Heikki suggested that the format should be much more generic and > that more should be left up to the AM. While neither Andres nor I are > convinced that we should go as far in that direction as Heikki is > proposing, the idea clearly has some merit, and we should probably be > moving that way to some degree. For instance, the idea that we should > store a block number and TID is a bit sketchy when you consider that a > multi-insert operation really wants to store a TID list. The zheap > tree has a ridiculous kludge to work around that problem; clearly we > need something better. We've also mentioned that, in the future, we > might want to support TIDs that are not 6 bytes, and that even just > looking at what's currently under development, zedstore wants to treat > TIDs as 48-bit opaque quantities, not a 4-byte block number and a > 2-byte item pointer offset. So, there is clearly a need to go through > the whole way we're doing this and rethink which parts are generic and > which parts are AM-specific. > > 4. A related problem, which has been mentioned or at least alluded to > by both Heikki and by me, is that we need a better way of handling the > AM-specific data. Right now, the zheap code packs fixed-size things > into the payload data and then finds them by knowing the offset where > any particular data is within that field, but that's an unmaintainable > mess. The zheap code could be improved by at least defining those > offsets as constants someplace and adding some comments explaining the > payload formats of various undo records, but even if we do that, it's > not going to generalize very well to anything more complicated than a > few fixed-size bits of data. I suggested using the pqformat stuff to > try to structure that -- a suggestion to which Heikki has > unfortunately not responded, because I'd really like to get his > thoughts on it -- but whether we do that particular thing or not, I > think we need to do something. In addition to wanting a better way of > handling packing and unpacking for payload data, there's also a desire > to have it participate in record compression, for which we don't seem > to have any kind of plan. > > 5. Andres suggested multiple ideas for cleaning up and improving this > code in https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de > - which include the idea currently under discussion, several of the > same ideas that I mentioned above, and a number of other things, such > as making packing serialize to a char * rather than some ad-hoc > intermediate format I have implemented this patch. I will post this along with other changes. and having a metadata array over which we can loop > rather than having multiple places where there's a separate bit of > code for every field type. I don't think those suggestions are > entirely unproblematic; for instance, the metadata array would would > probably work a lot better if we moved the transaction and log-switch > headers outside of individual records as suggested in (2) above. > Otherwise, the metadata would have to include not only data-structure > offsets but some kind of a flag indicating which of several data > structures ought to contain the relevant information, which would make > the whole thing a lot messier. And depending on what we do about (4), > this might become moot or the details might change quite a bit, > because if we no longer have a long list of "generic" fields, then we > also won't have a bunch of places in the code that deal with that long > list of generic fields, which means the metadata array might not be > necessary, or might be simpler or smaller or designed differently. > All of which is to make the point that responding to Andres's feedback > will require a bunch of decisions about which parts of the feedback to > take (because some of them are mutually exclusive, as he acknowledges > himself) and what to do about them (because some of them are vague); > yet, taken together, they seem to amount to the need for significant > design changes, as do (1)-(4). > > Now, just to be clear, the code we're talking about here is mostly > based on an original design by me, and whatever defects were present > in that original design are nobody's fault but mine. And that list of > defects includes pretty much everything in the above list. But, what > we need to figure out at this point is how we're going to get those > things fixed, and it seems to me that we're going to need a pretty > substantial redesign, but this discussion is kind of down in the > weeds. I mean, what are we gaining by arguing about how many stages > we need for incremental unpacking if the real thing we need to is get > rid of that concept altogether? Actually, In my local changes, I have already got rid of the multiple stages because I am packing all fields in one char * as suggested in the first part of the 4). I had a problem while unpacking because we don't know the complete size of the record beforehand especially because of the payload data. I have suggested a couple of points above as part of 1) for handling the payload size. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote: > We can not know the complete size of the record even by reading the > header because we have a payload that is variable part and payload > length are stored in the payload header which again can be at random > offset. Wait, but that's just purely self inflicted damage, no? The initial length just needs to include the payload. And all this is not an issue anymore? Greetings, Andres Freund
On Thu, Aug 22, 2019 at 9:58 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote: > > We can not know the complete size of the record even by reading the > > header because we have a payload that is variable part and payload > > length are stored in the payload header which again can be at random > > offset. > > Wait, but that's just purely self inflicted damage, no? The initial > length just needs to include the payload. And all this is not an issue > anymore? > Actually, we store the undo length only at the end of the record and that is for traversing the transaction's undo record chain during bulk fetch. Ac such in the beginning of the record we don't have the undo length. We do have uur_info but that just tell us which all optional header are included in the record. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote: > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote: > > > We can not know the complete size of the record even by reading the > > > header because we have a payload that is variable part and payload > > > length are stored in the payload header which again can be at random > > > offset. > > > > Wait, but that's just purely self inflicted damage, no? The initial > > length just needs to include the payload. And all this is not an issue > > anymore? > > > Actually, we store the undo length only at the end of the record and > that is for traversing the transaction's undo record chain during bulk > fetch. Ac such in the beginning of the record we don't have the undo > length. We do have uur_info but that just tell us which all optional > header are included in the record. But why? It makes a *lot* more sense to have it in the beginning. I don't think bulk-fetch really requires it to be in the end - we can still process records forward on a page-by-page basis. Greetings, Andres Freund
On Thu, Aug 22, 2019 at 10:24 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote: > > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote: > > > > We can not know the complete size of the record even by reading the > > > > header because we have a payload that is variable part and payload > > > > length are stored in the payload header which again can be at random > > > > offset. > > > > > > Wait, but that's just purely self inflicted damage, no? The initial > > > length just needs to include the payload. And all this is not an issue > > > anymore? > > > > > Actually, we store the undo length only at the end of the record and > > that is for traversing the transaction's undo record chain during bulk > > fetch. Ac such in the beginning of the record we don't have the undo > > length. We do have uur_info but that just tell us which all optional > > header are included in the record. > > But why? It makes a *lot* more sense to have it in the beginning. I > don't think bulk-fetch really requires it to be in the end - we can > still process records forward on a page-by-page basis. Yeah, we can handle the bulk fetch as you suggested and it will make it a lot easier. But, currently while registering the undo request (especially during the first pass) we need to compute the from_urecptr and the to_urecptr. And, for computing the from_urecptr, we have the end location of the transaction because we have the uur_next in the transaction header and that will tell us the end of our transaction but we still don't know the undo record pointer of the last record of the transaction. As of know, we read previous 2 bytes from the end of the transaction to know the length of the last record and from there we can compute the undo record pointer of the last record and that is our from_urecptr. Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 22, 2019 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Aug 22, 2019 at 10:24 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote: > > > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > > > Hi, > > > > > > > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote: > > > > > We can not know the complete size of the record even by reading the > > > > > header because we have a payload that is variable part and payload > > > > > length are stored in the payload header which again can be at random > > > > > offset. > > > > > > > > Wait, but that's just purely self inflicted damage, no? The initial > > > > length just needs to include the payload. And all this is not an issue > > > > anymore? > > > > > > > Actually, we store the undo length only at the end of the record and > > > that is for traversing the transaction's undo record chain during bulk > > > fetch. Ac such in the beginning of the record we don't have the undo > > > length. We do have uur_info but that just tell us which all optional > > > header are included in the record. > > > > But why? It makes a *lot* more sense to have it in the beginning. I > > don't think bulk-fetch really requires it to be in the end - we can > > still process records forward on a page-by-page basis. > > Yeah, we can handle the bulk fetch as you suggested and it will make > it a lot easier. But, currently while registering the undo request > (especially during the first pass) we need to compute the from_urecptr > and the to_urecptr. And, for computing the from_urecptr, we have the > end location of the transaction because we have the uur_next in the > transaction header and that will tell us the end of our transaction > but we still don't know the undo record pointer of the last record of > the transaction. As of know, we read previous 2 bytes from the end of > the transaction to know the length of the last record and from there > we can compute the undo record pointer of the last record and that is > our from_urecptr. > How about if we store the location of the last record of the transaction instead of the location of the next transaction in the transaction header? I think if we do that then discard worker might need to do some additional work in some cases as it needs to tell the location up to which discard is possible, however, many other cases might get simplified. With this also, when the log is switched while writing records for the same transaction, the transaction header in the first log will store the start location of the same transaction's records in the next log. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 22, 2019 at 12:54 AM Andres Freund <andres@anarazel.de> wrote: > But why? It makes a *lot* more sense to have it in the beginning. I > don't think bulk-fetch really requires it to be in the end - we can > still process records forward on a page-by-page basis. There are two separate needs here: to be able to go forward, and to be able to go backward. We have the length at the end of each record not because we're stupid, but so that we can back up. If we have another way of backing up, then the thing to do is not to move that to beginning of the record but to remove it entirely as unnecessary wastage. We can also think about how to improve forward traversal. Considering each problem separately: For forward traversal, we could simplify things somewhat by having only 3 decoding stages instead of N decoding stages. We really only need (1) a stage for accumulating bytes until we have uur_info, and then (2) a stage for accumulating bytes until we know the payload and tuple lengths, and then (3) a stage for accumulating bytes until we have the whole record. We have a lot more stages than that right now but I don't think we really need them for anything. Originally we had them so that we could do incremental decoding to find the transaction header in the record, but now that the transaction header is at a fixed offset, I think the multiplicity of stages is just baggage. We could simplify things more by deciding that the first two bytes of the record are going to contain the record size. That would increase the size of the record by 2 bytes, but we could (mostly) claw those bytes back by not storing the size of both uur_payload and uur_tuple. The size of the other one would be computed by subtraction: take the total record size, subtract the size of whichever of those two things we store, subtract the mandatory and optional headers that are present, and the rest must be the other value. That would still add 2 bytes for records that contain neither a payload nor a tuple, but that would probably be OK given that (a) a lot of records wouldn't be affected, (b) the code would get simpler, and (c) something like this seems necessary anyway given that we want to make the record format more generic. With this approach instead of 3 stages we only need 2: (1) accumulating bytes until we have the 2-byte length word, and (2) accumulating bytes until we have the whole record. For backward traversal, as I see it, there are basically two options. One is to do what we're doing right now, and store the record length at the end of the record. (That might mean that a record both begins and ends with its own length, which is not a crazy design.) The other is to do what I think you are proposing here: locate the beginning of the first record on the page, presumably based on some information stored in the page header, and then work forward through the page to figure out where all the records start. Then process them in reverse order. That saves 2 bytes per record. It's a little more expensive in terms of CPU cycles, especially if you only need some of the records on the page but not all of them, but that's probably not too bad. I think I'm basically agreeing with what you are proposing but I think it's important to spell out the underlying concerns, because otherwise I'm afraid we might think we have a meeting of the minds when we don't really. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Yeah, we can handle the bulk fetch as you suggested and it will make > it a lot easier. But, currently while registering the undo request > (especially during the first pass) we need to compute the from_urecptr > and the to_urecptr. And, for computing the from_urecptr, we have the > end location of the transaction because we have the uur_next in the > transaction header and that will tell us the end of our transaction > but we still don't know the undo record pointer of the last record of > the transaction. As of know, we read previous 2 bytes from the end of > the transaction to know the length of the last record and from there > we can compute the undo record pointer of the last record and that is > our from_urecptr.= I don't understand this. If we're registering an undo request at "do" time, we don't need to compute the starting location; we can just remember the UndoRecPtr of the first record we inserted. If we're reregistering an undo request after a restart, we can (and, I think, should) work forward from the discard location rather than backward from the insert location. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 22, 2019 at 7:34 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Yeah, we can handle the bulk fetch as you suggested and it will make > > it a lot easier. But, currently while registering the undo request > > (especially during the first pass) we need to compute the from_urecptr > > and the to_urecptr. And, for computing the from_urecptr, we have the > > end location of the transaction because we have the uur_next in the > > transaction header and that will tell us the end of our transaction > > but we still don't know the undo record pointer of the last record of > > the transaction. As of know, we read previous 2 bytes from the end of > > the transaction to know the length of the last record and from there > > we can compute the undo record pointer of the last record and that is > > our from_urecptr.= > > I don't understand this. If we're registering an undo request at "do" > time, we don't need to compute the starting location; we can just > remember the UndoRecPtr of the first record we inserted. If we're > reregistering an undo request after a restart, we can (and, I think, > should) work forward from the discard location rather than backward > from the insert location. Right, we work froward from the discard location. So after the discard location, while traversing the undo log when we encounter an aborted transaction we need to register its rollback request. And, for doing that we need 1) start location of the first undo record . 2) start location of the last undo record (last undo record pointer). We already have 1). But we have to compute 2). For doing that if we unpack the first undo record we will know the start of the next transaction. From there if we read the last two bytes then that will have the length of the last undo record of our transaction. So we can compute 2) with below formula start of the last undo record = start of the next transaction - length of our transaction's last record. Am I making sense here? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 22, 2019 at 9:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Aug 22, 2019 at 7:34 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Yeah, we can handle the bulk fetch as you suggested and it will make > > > it a lot easier. But, currently while registering the undo request > > > (especially during the first pass) we need to compute the from_urecptr > > > and the to_urecptr. And, for computing the from_urecptr, we have the > > > end location of the transaction because we have the uur_next in the > > > transaction header and that will tell us the end of our transaction > > > but we still don't know the undo record pointer of the last record of > > > the transaction. As of know, we read previous 2 bytes from the end of > > > the transaction to know the length of the last record and from there > > > we can compute the undo record pointer of the last record and that is > > > our from_urecptr.= > > > > I don't understand this. If we're registering an undo request at "do" > > time, we don't need to compute the starting location; we can just > > remember the UndoRecPtr of the first record we inserted. If we're > > reregistering an undo request after a restart, we can (and, I think, > > should) work forward from the discard location rather than backward > > from the insert location. > > Right, we work froward from the discard location. So after the > discard location, while traversing the undo log when we encounter an > aborted transaction we need to register its rollback request. And, > for doing that we need 1) start location of the first undo record . 2) > start location of the last undo record (last undo record pointer). > > We already have 1). But we have to compute 2). For doing that if we > unpack the first undo record we will know the start of the next > transaction. From there if we read the last two bytes then that will > have the length of the last undo record of our transaction. So we can > compute 2) with below formula > > start of the last undo record = start of the next transaction - length > of our transaction's last record. Maybe I am saying that because I am just thinking how the requests are registered as per the current code. But, those requests will ultimately be used for collecting the record by the bulk fetch. So if we are planning to change the bulk fetch to read forward then maybe we don't need the valid last undo record pointer because that we will anyway get while processing forward. So just knowing the end of the transaction is sufficient for us to know where to stop. I am not sure if this solution has any problem. Probably I should think again in the morning when my mind is well-rested. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi On August 22, 2019 9:14:10 AM PDT, Dilip Kumar <dilipbalaut@gmail.com> wrote: > But, those requests will >ultimately be used for collecting the record by the bulk fetch. So if >we are planning to change the bulk fetch to read forward then maybe we >don't need the valid last undo record pointer because that we will >anyway get while processing forward. So just knowing the end of the >transaction is sufficient for us to know where to stop. I am not sure >if this solution has any problem. Probably I should think again in >the morning when my mind is well-rested. I don't think we can easily do so for bulk apply without incurring significant overhead. It's pretty cheap to read in forwardorder and then process backwards on a page level - but for an entire transactions undo the story is different. Wecan't necessarily keep all of it in memory, so we'd have to read the undo twice to find the end. Right? Andres Andres Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Aug 22, 2019 at 9:55 PM Andres Freund <andres@anarazel.de> wrote: > > Hi > > On August 22, 2019 9:14:10 AM PDT, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > But, those requests will > >ultimately be used for collecting the record by the bulk fetch. So if > >we are planning to change the bulk fetch to read forward then maybe we > >don't need the valid last undo record pointer because that we will > >anyway get while processing forward. So just knowing the end of the > >transaction is sufficient for us to know where to stop. I am not sure > >if this solution has any problem. Probably I should think again in > >the morning when my mind is well-rested. > > I don't think we can easily do so for bulk apply without incurring significant overhead. It's pretty cheap to read in forwardorder and then process backwards on a page level - but for an entire transactions undo the story is different. Wecan't necessarily keep all of it in memory, so we'd have to read the undo twice to find the end. Right? > I was not talking about the entire transaction, I was also telling about the page level as you suggested. I was just saying that we may not need the start position of the last undo record of the transaction for registering the rollback request (which we currently do). However, we need to know the end of the transaction to know the last page from which we need to start reading forward. Let me explain with an example Transaction1 first, undo start at 10 first, undo end at 100 second, undo start at 101 second, undo end at 200 ...... last, undo start at 1000 last, undo end at 1100 Transaction2 first, undo start at 1101 first, undo end at 1200 second, undo start at 1201 second, undo end at 1300 Suppose we want to register the request for Transaction1. Then currently we need to know the start undo record pointer (10 as per above example) and the last undo record pointer (1000). But, we only know the start undo record pointer(10) and the start of the next transaction(1101). So for calculating the start of the last record, we use 1101 - 101 (length of the last record store 2 bytes before 1101). So, now I am saying that maybe we don't need to compute the start of last undo record (1000) because it's enough to know the end of the last undo record(1100). Because on whichever page the last undo record ends, we can start from that page and read forward on that page. * All numbers I used in the above example can be considered as undo record pointers. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 21, 2019 at 4:44 AM Andres Freund <andres@anarazel.de> wrote: > On 2019-08-20 21:02:18 +1200, Thomas Munro wrote: > > 1. Anyone is allowed to try to read or write data at any UndoRecPtr > > that has been allocated, through the buffer pool (though you'd usually > > want to check it with UndoRecPtrIsDiscarded() first, and only rely on > > the system I'm describing to deal with races). > > > > 2. ReadBuffer() might return InvalidBuffer. This can happen for a > > cache miss, if the smgrread implementation wants to indicate that the > > buffer has been discarded/truncated and that is expected (md.c won't > > ever do that, but undofile.c can). > > Hm. This gives me a bit of a stomach ache. It somehow feels like a weird > form of signalling. Can't quite put my finger on why it makes me feel > queasy. Well, if we're going to allow concurrent access and discarding, there has to be some part of the system where you can discover that the thing you wanted is gone. What's wrong with here? Stepping back a bit... why do we have a new concept here? The reason we don't have anything like this for relations currently is that we don't have live references to blocks that are expected to be concurrently truncated, due to heavyweight locking. But the whole purpose of the undo system is to allow cheap truncation/discarding of data that you *do* have live references to, and furthermore the discarding is expected to be frequent. The present experiment is about trying to do so without throwing our hands up and using a pessimistic lock. At one point Robert and I discussed some kind of scheme where you'd register your interest in a range of the log before you begin reading (some kind of range locks or hazard pointers), so that you would block discarding in that range, but the system would still allow someone to read in the middle of the log while the discard worker concurrently discards non-overlapping data at the end. But I kept returning to the idea that the buffer pool already has block-level range locking of various kinds. You can register your interest in a range by pinning the buffers. That's when you'll find out if the range is already gone. We could add an extra layer of range locking around that, but it wouldn't be any better, it'd just thrash your bus a bit more, and require more complexity in the discard worker (it has to defer discarding a bit, and either block or go away and come back later). > > 3. UndoLogDiscard() uses DiscardBuffer() to invalidate any currently > > unpinned buffers, and marks as BM_DISCARDED any that happen to be > > pinned right now, so they can't be immediately invalidated. Such > > buffers are never written back and are eligible for reuse on the next > > clock sweep, even if they're written into by a backend that managed to > > do that when we were trying to discard. > > Hm. When is it legitimate for a backend to write into such a buffer? I > guess that's about updating the previous transaction's next pointer? Or > progress info? Yes, previous transaction header's next pointer, and progress counter during rollback. We're mostly interested in the next pointer here, because the progress counter update would normally not be updated at a time when the page might be concurrently discarded. The exception to that is a superuser running CALL pg_force_discard_undo() (a data-eating operation designed to escape a system that can't successfully roll back and gets stuck, blowing away not-yet-rolled-back undo records). Here are some other ideas about how to avoid conflicts between discarding and transaction header update: 1. Lossy self-update-only: You could say that transactions are only allowed to write to their own transaction header, and then have them periodically update their own length in their own transaction header, and then teach the discard worker that the length information is only a starting point for a linear search for the next transaction based on page header information. That removes complexity for writers, but adds complexity and IO and CPU to the discard worker. Bleugh. 2. Strict self-update-only: We could update it as part of transaction cleanup. That is, when you commit or abort, probably at some time when your transaction is still advertised as running, you go and update your own transaction header with your the size. If you never reach that stage, I think you can fix it during crash recovery, during the startup scan that feeds the rollback request queues. That is, if you encounter a transaction header with length == 0, it must be the final one and its length is therefore known and can be updated, before you allow new transactions to begin. There are some complications like backends that exit without crashing, which I haven't thought about. As Amit just pointed out to me, that means that the update is not folded into the same buffer access as the next transaction, but perhaps you can mitigate that by not updating your header if the next header will be on the same page -- the next transaction can do it safely then (this page with the insert pointer on it can't be discarded). As Dilip just pointed out to me, it means that you always do an update that you might not never need to do if the transaction is discarded, to which I have no answer. Bleugh. 3. Perhaps there is a useful middle ground between ideas 1 and 2: if it's 0, the discard worker will perform a scan of page headers to compute the correct value, but if it's non-zero it will consider it to be correct and trust that value. The extra work would only happen after crash recovery or things like elog(FATAL, ...). 4. You could keep one extra transaction around all the time. That is, because we know we only ever want to stamp the transaction header of the previous transaction, don't let a transaction that hasn't been stamped with a length yet be discarded. But now we waste more space! 5. You could figure out a scheme to advertise the block number of the start of the previous transaction. You could have an LWLock that you have to take to stamp the transaction header of the previous transaction, and UndoLogDiscard() only has to take the lock if it wants to discard a range that overlaps with that block. This avoids contention for some workloads, but not others, so it seems like a half measure, and again you still have to deal with InvalidBuffer when reading. It's basically range locking; the buffer pool is already a kind of range locking scheme! These schemes are all about avoiding conflicts between discarding and writing, but you'd still have to tolerate InvalidBuffer for reads (ie reading zheap records) with this scheme, so I suppose you might as well just treat updates the same and not worry about any of the above. > > 5. Separating begin from discard allows the WAL logging for > > UndoLogDiscard() to do filesystem actions before logging, and other > > effects after logging, which have several nice properties if you work > > through the various crash scenarios. > > Hm. ISTM we always need to log before doing some filesystem operation > (see also my recent complaint Robert and I are discussing at the bottom > of [1]). It's just that we can have a separate stage afterwards? > > [1] https://www.postgresql.org/message-id/CA%2BTgmoZc5JVYORsGYs8YnkSxUC%3DcLQF1Z%2BfcpH2TTKvqkS7MFg%40mail.gmail.com I talked about this a bit with Robert and he pointed out that it's probably not actually necessary to WAL-log these operations at all, now that 'begin' and 'end' (= physical storage range) have been separated from 'discard' and 'insert' (active undo data range). Instead you could do it like this: 1. Maintain begin and end pointers in shared memory only, no WAL, no checkpoint. 2. Compute their initial values by scanning the filesystem at startup time. 3. Treat (logical) discard and insert pointers as today; WAL before shm, checkpoint. 4. begin must be <= discard, and end must be >= insert, or else PANIC. I'm looking into that. > > So now I'd like to get feedback on the sanity of this scheme. I'm not > > saying it doesn't have bugs right now -- I've been trying to figure > > out good ways to test it and I'm not quite there yet -- but the > > concept. One observation I have is that there were already code paths > > in undoaccess.c that can tolerate InvalidBuffer in recovery, due to > > the potentially different discard timing for DO vs REDO. I think > > that's a point in favour of this scheme, but I can see that it's > > inconvenient to have to deal with InvalidBuffer whenever you read. > > FWIW, I'm far from convinced that those are currently quite right. See > discussion pointed to above. Yeah. It seems highly desirable to make it so that all decisions about whether a write to an undo block is required or should be skipped are made on the primary, so that WAL reply just does what it's told. I am working on that. -- Thomas Munro https://enterprisedb.com
On Fri, Aug 23, 2019 at 2:04 AM Thomas Munro <thomas.munro@gmail.com> wrote: > 2. Strict self-update-only: We could update it as part of > transaction cleanup. That is, when you commit or abort, probably at > some time when your transaction is still advertised as running, you go > and update your own transaction header with your the size. If you > never reach that stage, I think you can fix it during crash recovery, > during the startup scan that feeds the rollback request queues. That > is, if you encounter a transaction header with length == 0, it must be > the final one and its length is therefore known and can be updated, > before you allow new transactions to begin. There are some > complications like backends that exit without crashing, which I > haven't thought about. As Amit just pointed out to me, that means > that the update is not folded into the same buffer access as the next > transaction, but perhaps you can mitigate that by not updating your > header if the next header will be on the same page -- the next > transaction can do it safely then (this page with the insert pointer > on it can't be discarded). As Dilip just pointed out to me, it means > that you always do an update that you might not never need to do if > the transaction is discarded, to which I have no answer. Bleugh. Andres and I have spent a lot of time on the phone over the last couple of days and I think we both kind of like this option. I don't think that the costs are likely to be very significant: you're talking about pinning, locking, dirtying, unlocking, and unpinning one buffer at commit time, or maybe two if your transaction touched both logged and unlogged tables. If the transaction is short enough for that overhead to matter, that buffer is probably already in shared_buffers, is probably already dirty, and is probably already in your CPU's cache. So I think the overhead will turn out to be low. Moreover, I doubt that we want to separately discard every transaction anyway. If you have very light-weight transactions, you don't want to add an extra WAL record per transaction anyway. Increasing the number of separate WAL records per transaction from say 5 to 6 would be a significant additional cost. You probably want to perform a discard, say, every 5 seconds or sooner if you can discard at least 64kB of undo, or something of that sort. So we're not going to save the overhead of updating the previous transaction header often enough to make much difference unless we're discarding so aggressively that we incur a much larger overhead elsewhere. I think. I am a little concerned about the backends that exit without crashing. Andres seems to want to treat that case as a bug to be fixed, but I doubt whether that's going to be practical. We're really only talking about extreme corner cases here, because before_shmem_exit(ShutdownPostgres, 0) means we'll AbortOutOfAnyTransaction() which should RecordTransactionAbort(). Only if we fail in the AbortTransaction() prior to reaching RecordTransactionAbort() will we manage to reach the later cleanup stages without having written an abort record. I haven't scrutinized that code lately to see exactly how things can go wrong there, but there shouldn't be a whole lot. However, there's probably a few things, like maybe a really poorly-timed malloc() failure. A zero-order solution would be to install a deadman switch. At on_shmem_exit time, you must detach from any undo log to which you are connected, so that somebody else can attach to it later. We can stick in a cross-check there that you haven't written any undo bytes to that log and PANIC if you have. Then the system must be water-tight. Perhaps it's possible to do better: if we could identify the cases in which such logic gets reached, we could try to guarantee that WAL is written and the undo log safely detached before we get there. But at the various least we can promote ERROR/FATAL to PANIC in the relevant case. A better solution would be to detect the problem and make sure we recover from it before reusing the undo log. Suppose each undo log has three states: (1) nobody's attached, (2) somebody's attached, and (3) nobody's attached but the last record might need a fixup. When we start up, all undo logs are in state 3, and the discard worker runs around and puts them into state 1. Subsequently, they alternate between states 1 and 2 for as long as the system remains up. But if as an exceptional case we reach on_shmem_exit without having detached the undo log, because of cascading failures, then we put the undo log in state 3. The discard worker already knows how to move undo logs from state 3 to state 1, and it can do the same thing here. Until it does nobody else can reuse that undo log. I might be missing something, but I think that would nail this down pretty tightly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 21, 2019 at 4:34 PM Robert Haas <robertmhaas@gmail.com> wrote: > ReleaseResourcesAndProcessUndo() is only supposed to be called after > AbortTransaction(), and the additional steps it performs -- > AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively > AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction. > That's not crazy; the other steps in Cleanup(Sub)Transaction() look > like stuff that's intended to be performed when we're totally done > with this TransactionState stack entry, whereas these things are > slightly higher-level cleanups that might even block undo (e.g. > undropped portal prevents orphaned file cleanup). Granted, there are > no comments explaining why those particular cleanup steps are > performed here, and it's possible some other approach is better, but I > think perhaps it's not quite as flagrantly broken as you think. Andres smacked me with the clue-bat off-list and now I understand why this is broken: there's no guarantee that running the various AtEOXact/AtCleanup functions actually puts the transaction back into a good state. They *might* return the system to the state that it was in immediately following StartTransaction(), but they also might not. Moreover, ProcessUndoRequestForEachLogCat uses PG_TRY/PG_CATCH and then discards the error without performing *any cleanup at all* but then goes on and tries to do undo for other undo log categories anyway. That is totally unsafe. I think that there should only be one chance to perform undo actions, and as I said or at least alluded to before, if that throws an error, it shouldn't be caught by a TRY/CATCH block but should be handled by the state machine in xact.c. If we're not going to make the state machine handle these conditions, the addition of TRANS_UNDO/TBLOCK_UNDO/TBLOCK_SUBUNDO is really missing the boat. I'm still not quite sure of the exact sequence of steps: we clearly need AtCleanup_Portals() and a bunch of the other stuff that happens during CleanupTransaction(), ideally including the freeing of memory, to happen before we try undo. But I don't have a great feeling for how to make that happen, and it seems more desirable for undo to begin as soon as the transaction fails rather than waiting until Cleanup(Sub)Transaction() time. I think some more research is needed here. > I am also not convinced that semi-critical sections are a bad idea, Regarding this, after further thought and discussion with Andres, there are two cases here that call for somewhat different handling: temporary undo, and subtransaction abort. In the case of a subtransaction abort, we can't proceed with the toplevel transaction unless we succeed in applying the subtransaction's undo, but that does not require killing off the backend. It might be a better idea to just fail the containing subtransaction with the error that occurred during undo apply; if there are multiple levels of subtransactions present then we might fail in the same way several times, but eventually we'll fail at the top level, forcibly kick the undo into the background, and the session can continue. The background workers will, hopefully, eventually recover the situation. Even if they can't, because, say, the failure is due to a bug or whatever, killing off the session doesn't really help. In the case of temporary undo, killing the session is a much more appealing approach. If we don't, how will that undo ever get processed? We could retry at some point (like every time we return to the toplevel command prompt?) or just ignore the fact that we didn't manage to perform those undo actions and leave that undo there like an albatross, accumulating more and more undo behind it until the session exits or the disk fills up. The latter strikes me as a terrible user experience, especially because for wire protocol reasons we'd have to swallow the errors or at best convert them to warnings, but YMMV. Anyway, probably these cases should not be handled exactly the same way, but exactly what to do probably depends on the previous question: how exactly does the integration into xact.c's state machine work, anyway? Meanwhile, I've been working up a prototype of how the undorequest.c stuff I sent previously could be integrated with xact.c. In working on that, I've realized that there seem to be two different tasks. One is tracking the information that we'll need to have available to perform undo actions. The other is the actual transaction state manipulation: when and how do we abort transactions, cleanup transactions, start new transactions specifically for undo? How are transactions performing undo specially marked, if at all? The attached patch includes a new module, undostate.c/h, which tries to handle the first of those things; this is just a prototype, and is missing some pieces marked with XXX, but I think it's probably the right general direction. It will still need to be plugged into a framework for launching undo apply background workers (which might require some API adjustments) and it needs xact.c to handle the core transactional stuff. But hopefully it will help to illustrate how the undorequest.c stuff that I sent before can actually be put to use. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hello Thomas, I was doing some testing for the scenario where the undo written by a transaction overflows to multiple undo logs. For that I've modified the following macro: #define UndoLogMaxSize (1024 * 1024) /* 1MB undo log size */ (I should have used the provided pg_force_switch_undo though..) I'm getting the following assert failure while performing the recovery with the same. "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL", File: "undolog.c", Line: 997)" I found that we don't emit an WAL record when we update the slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after crash recovery, some new transaction may use that undo log which is wrong, IMHO. Am I missing something? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > I'm getting the following assert failure while performing the recovery > with the same. > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL", > File: "undolog.c", Line: 997)" > > I found that we don't emit an WAL record when we update the > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after > crash recovery, some new transaction may use that undo log which is > wrong, IMHO. Am I missing something? Thanks, right, that status logging is wrong, will fix in next version. -- Thomas Munro https://enterprisedb.com
Hi Thomas, While testing one of the recovery scenarios I found one issue: FailedAssertion("!(logno == context->recovery_logno) The details of the same is mentioned below: The context's try_location was not updated in UndoLogAllocateInRecovery, in PrepareUndoInsert the try_location was updated with the undo record size. In the subsequent UndoLogAllocateInRecovery as the value for try_location was not initialized but only updated with the size the logno will always not match if the recovery_logno is non zero and the assert fails. Fixed by setting the try_location in UndoLogAllocateInRecovery, similar to try_location setting in UndoLogAllocate. Patch for the same is attached. Please have a look and add the changes in one of the upcoming version. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com On Mon, Sep 2, 2019 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > I'm getting the following assert failure while performing the recovery > > with the same. > > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL", > > File: "undolog.c", Line: 997)" > > > > I found that we don't emit an WAL record when we update the > > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after > > crash recovery, some new transaction may use that undo log which is > > wrong, IMHO. Am I missing something? > > Thanks, right, that status logging is wrong, will fix in next version. > > -- > Thomas Munro > https://enterprisedb.com > >
Attachment
On 2019-Sep-06, vignesh C wrote: > Hi Thomas, > > While testing one of the recovery scenarios I found one issue: > FailedAssertion("!(logno == context->recovery_logno) I marked this patch Waiting on Author. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Thomas, While testing zheap over undo apis, we've found the following issues/scenarios that might need some fixes/discussions: 1. In UndoLogAllocateInRecovery, when we find the current log number from the list of registered blocks, we don't check whether the block->in_use flag is true or not. In XLogResetInsertion, we just reset in_use flag without reseting the blocks[]->rnode information. So, if we don't check the in_use flag, it's possible that we'll consult some block information from the previous WAL record. IMHO, just adding an in_use check in UndoLogAllocateInRecovery will solve the problem. 2. A transaction, inserts one undo record and generated a WAL record for the same, say at WAL location 0/2000A000. Next, the undo record gets discarded and WAL is generated to update the meta.discard pointer at location 0/2000B000 At the same time, an ongoing checkpoint with checkpoint.redo at 0/20000000 flushes the latest meta.discard pointer. Now, the system crashes. Now, the recovery starts from the location 0/20000000. When the recovery of 0/2000A000 happens, it sees the undo record that it's about to insert, is already discarded as per meta.discard (flushed by checkpoint). In this case, should we just skip inserting the undo record? 3. Currently, we create a backup image of the unlogged part of the undo log's metadata only when some backend allocates some space from the undo log (in UndoLogAllocate). This helps us restore the unlogged meta part after a checkpoint. When we perform an undo action, we also update the undo action progress and emit an WAL record. The same operation can performed by the undo worker which doesn't allocate any space from the undo log. So, if an undo worker emits an WAL record to update undo action progress after a checkpoint, it'll not be able to WAL log the backup image of the meta unlogged part. IMHO, this breaks the recovery logic of unlogged part of undo meta. Thoughts? On Mon, Sep 2, 2019 at 9:47 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > I'm getting the following assert failure while performing the recovery > > with the same. > > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL", > > File: "undolog.c", Line: 997)" > > > > I found that we don't emit an WAL record when we update the > > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after > > crash recovery, some new transaction may use that undo log which is > > wrong, IMHO. Am I missing something? > > Thanks, right, that status logging is wrong, will fix in next version. > > -- > Thomas Munro > https://enterprisedb.com -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 16, 2019 at 5:27 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > While testing zheap over undo apis, we've found the following > issues/scenarios that might need some fixes/discussions: Thanks! > 1. In UndoLogAllocateInRecovery, when we find the current log number > from the list of registered blocks, we don't check whether the > block->in_use flag is true or not. In XLogResetInsertion, we just > reset in_use flag without reseting the blocks[]->rnode information. > So, if we don't check the in_use flag, it's possible that we'll > consult some block information from the previous WAL record. IMHO, > just adding an in_use check in UndoLogAllocateInRecovery will solve > the problem. Agreed. I added a line to break out of that loop if !block->in_use. BTW I am planning to simplify that code considerably, based on a plan to introduce a new rule: there can be only one undo record and therefore only one undo allocation per WAL record. > 2. A transaction, inserts one undo record and generated a WAL record > for the same, say at WAL location 0/2000A000. Next, the undo record > gets discarded and WAL is generated to update the meta.discard pointer > at location 0/2000B000 At the same time, an ongoing checkpoint with > checkpoint.redo at 0/20000000 flushes the latest meta.discard pointer. > Now, the system crashes. > Now, the recovery starts from the location 0/20000000. When the > recovery of 0/2000A000 happens, it sees the undo record that it's > about to insert, is already discarded as per meta.discard (flushed by > checkpoint). In this case, should we just skip inserting the undo > record? I see two options: 1. We make it so that if you're allocating in recovery and discard > insert, we'll just set discard = insert so you can proceed. The code in undofile_get_segment_file() already copes with missing files during recovery. 2. We skip the insert as you said. I think option 1 is probably best, otherwise you have to cope with failure to insert by skipping, as you said. > 3. Currently, we create a backup image of the unlogged part of the > undo log's metadata only when some backend allocates some space from > the undo log (in UndoLogAllocate). This helps us restore the unlogged > meta part after a checkpoint. > When we perform an undo action, we also update the undo action > progress and emit an WAL record. The same operation can performed by > the undo worker which doesn't allocate any space from the undo log. > So, if an undo worker emits an WAL record to update undo action > progress after a checkpoint, it'll not be able to WAL log the backup > image of the meta unlogged part. IMHO, this breaks the recovery logic > of unlogged part of undo meta. I thought that was OK because those undo data updates don't depend on the insert pointer. But I see what you mean: the next modification of the page that DOES depend on the insert pointer might not log the meta-data if it's not the first WAL record to touch it after a checkpoint. Rats. I'll have to think about that some more. -- Thomas Munro https://enterprisedb.com
Hello Thomas, On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > > 1. In UndoLogAllocateInRecovery, when we find the current log number > > from the list of registered blocks, we don't check whether the > > block->in_use flag is true or not. In XLogResetInsertion, we just > > reset in_use flag without reseting the blocks[]->rnode information. > > So, if we don't check the in_use flag, it's possible that we'll > > consult some block information from the previous WAL record. IMHO, > > just adding an in_use check in UndoLogAllocateInRecovery will solve > > the problem. > > Agreed. I added a line to break out of that loop if !block->in_use. > I think we should skip the block if !block->in_use. Because, the undo buffer can be registered in a subsequent block as well. For different operations, we can use different block_id to register the undo buffer in the redo record. > BTW I am planning to simplify that code considerably, based on a plan > to introduce a new rule: there can be only one undo record and > therefore only one undo allocation per WAL record. > Okay. In that case, we need to rethink the cases for multi-inserts and non-inlace updates both of which currently inserts multiple undo record corresponding to a single WAL record. For multi-inserts, it can be solved easily by moving all the offset information in the payload. But, for non-inlace updates, we insert one undo record for the update and one for the insert. Wondering whether we've to insert two WAL records - one for update and one for the new insert. > > 2. A transaction, inserts one undo record and generated a WAL record > > for the same, say at WAL location 0/2000A000. Next, the undo record > > gets discarded and WAL is generated to update the meta.discard pointer > > at location 0/2000B000 At the same time, an ongoing checkpoint with > > checkpoint.redo at 0/20000000 flushes the latest meta.discard pointer. > > Now, the system crashes. > > Now, the recovery starts from the location 0/20000000. When the > > recovery of 0/2000A000 happens, it sees the undo record that it's > > about to insert, is already discarded as per meta.discard (flushed by > > checkpoint). In this case, should we just skip inserting the undo > > record? > > I see two options: > > 1. We make it so that if you're allocating in recovery and discard > > insert, we'll just set discard = insert so you can proceed. The code > in undofile_get_segment_file() already copes with missing files during > recovery. > Interesting. This should work. > > > 3. Currently, we create a backup image of the unlogged part of the > > undo log's metadata only when some backend allocates some space from > > the undo log (in UndoLogAllocate). This helps us restore the unlogged > > meta part after a checkpoint. > > When we perform an undo action, we also update the undo action > > progress and emit an WAL record. The same operation can performed by > > the undo worker which doesn't allocate any space from the undo log. > > So, if an undo worker emits an WAL record to update undo action > > progress after a checkpoint, it'll not be able to WAL log the backup > > image of the meta unlogged part. IMHO, this breaks the recovery logic > > of unlogged part of undo meta. > > I thought that was OK because those undo data updates don't depend on > the insert pointer. But I see what you mean: the next modification of > the page that DOES depend on the insert pointer might not log the > meta-data if it's not the first WAL record to touch it after a > checkpoint. Rats. I'll have to think about that some more. Cool. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 16, 2019 at 11:09 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > Okay. In that case, we need to rethink the cases for multi-inserts and > non-inlace updates both of which currently inserts multiple undo > record corresponding to a single WAL record. For multi-inserts, it can > be solved easily by moving all the offset information in the payload. > But, for non-inlace updates, we insert one undo record for the update > and one for the insert. Wondering whether we've to insert two WAL > records - one for update and one for the new insert. No, I think the solution is to put the information about both halves of the non-in-place update in the same undo record. I think the only reason why that's tricky is because we've got two block numbers and two offsets, and the only reason that's a problem is because UnpackedUndoRecord only has one field for each of those things, and that goes right back to Heikki's comments about the format not being flexible enough. If you see some other problem, it would be interesting to know what it is. One thing I've been thinking about is: suppose that you're following the undo chain for a tuple and you come to a non-in-place update record. Can you get confused? I don't think so, because you can compare the TID for which you're following the chain to the new TID and the old TID in the record and it should match one or the other but not both. But I don't think you even really need to do that much: if you started with a deleted item, the first thing in the undo chain has to be a delete or non-in-place update that got rid of it. And if you started with a non-deleted item, then the beginning of the undo chain, if it hasn't been discarded yet, will be the insert or non-in-place update that created it. There's nowhere else that you can hit a non-in-place update, and no room (that I can see) for any ambiguity. It seems to me that zheap went wrong in ending up with separate undo types for in-place and non-in-place updates. Why not just have ONE kind of undo record that describes an update, and allow that update to have either one TID or two TIDs depending on which kind of update it is? There may be a reason, but I don't know what it is, unless it's just that the UnpackedUndoRecord idea that I invented wasn't flexible enough and nobody thought of generalizing it. Curious to hear your thoughts on this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 17, 2019 at 3:09 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Agreed. I added a line to break out of that loop if !block->in_use. > > > I think we should skip the block if !block->in_use. Because, the undo > buffer can be registered in a subsequent block as well. For different > operations, we can use different block_id to register the undo buffer > in the redo record. Oops, right. So it should just be added to the if condition. Will do. -- Thomas Munro https://enterprisedb.com
On Mon, Sep 16, 2019 at 10:37 PM Robert Haas <robertmhaas@gmail.com> wrote: > > It seems to me that zheap went wrong in ending up with separate undo > types for in-place and non-in-place updates. Why not just have ONE > kind of undo record that describes an update, and allow that update to > have either one TID or two TIDs depending on which kind of update it > is? There may be a reason, but I don't know what it is, unless it's > just that the UnpackedUndoRecord idea that I invented wasn't flexible > enough and nobody thought of generalizing it. Curious to hear your > thoughts on this. > I think not only TID's, but we also need to two uur_prevundo (previous undo of the block) pointers. This is required both when we have to perform page-wise undo and chain traversal during visibility checks. So, we can keep a combination of TID and prevundo. The other thing is that during rollback when we collect the undo for each page, applying the action for this undo need some thoughts. For example, we can't apply the undo to rollback both Insert and non-inplace-update as both are on different pages. The reason is that the page where non-inplace-update has happened might have more undos that need to be applied before this. We can somehow make this undo available to apply while collecting undo for both the heap pages. I think there is also a need to identify which TID is for Insert and which is for non-inplace-update part of the operation because we won't know that while applying undo unless we check the state of a tuple on the page. So, with this idea, we will make one undo record part of multiple chains which might need some consideration at different places like above. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote: > Oops, right. So it should just be added to the if condition. Will do. It's been a couple of months and the discussion has stale. It seems also that the patch was waiting for an update. So I am marking it as RwF for now. Please feel free to update it if you feel that's not adapted. -- Michael
Attachment
On Thu, Nov 28, 2019 at 3:45 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote: > > Oops, right. So it should just be added to the if condition. Will do. > > It's been a couple of months and the discussion has stale. It seems > also that the patch was waiting for an update. So I am marking it as > RwF for now. Please feel free to update it if you feel that's not > adapted. Thanks. We decided to redesign a couple of aspects of the undo storage and record layers that this patch was intended to demonstrate, and work on that is underway. More on that soon.
Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Nov 28, 2019 at 3:45 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote: > > > Oops, right. So it should just be added to the if condition. Will do. > > > > It's been a couple of months and the discussion has stale. It seems > > also that the patch was waiting for an update. So I am marking it as > > RwF for now. Please feel free to update it if you feel that's not > > adapted. > > Thanks. We decided to redesign a couple of aspects of the undo > storage and record layers that this patch was intended to demonstrate, > and work on that is underway. More on that soon. As my boss expressed in his recent blog post, we'd like to contribute to the zheap development, and a couple of developers from other companies are interested in this as well. Amit Kapila suggested that the "cleanup of orphaned files" feature is a good start point in getting the code into PG core, so I've spent some time on it and tried to rebase the patch set. In fact what I did is not mere rebasing against the current master branch - I've also (besides various bug fixes) done some design changes. Incorporated the new Undo Record Set (URS) infrastructure --------------------------------------------------------- This is also pointed out in [0]. I started from [1] and tried to implement some missing parts (e.g. proper closing of the URSs after crash), introduced UNDO_DEBUG preprocessor macro which makes the undo log segments very small and fixed some bugs that the small segments exposed. The most significant change I've done was removal of the undo requests from checkpoint. I could not find any particular bug / race conditions related to including the requests into the checkpoint, but I concluded that it's easier to think about consistency and checkpoint timings if we scan the undo log on restart (after recovery has finished) and create the requests from scratch. [2] shows where I ended up before I started to rebase this patchset. No background undo ------------------ Reduced complexity of the patch seems to be the priority at the moment. Amit suggested that cleanup of an orphaned relation file is simple enough to be done on foreground and I agree. "undo worker" is still there, but it only processes undo requests after server restart because relation data can only be changed in a transaction - it seems cleaner to launch a background worker for this than to hack the startup process. Since the concept of undo requests is closely related to the undo worker, I removed undorequest.c too. The new (much simpler) undo worker gets the information on incomplete / aborted transactions from the undo log as mentioned above. SMGR enhancement ---------------- I used the 0001 patch from [3] rather than [4], although it's more invasive because I noticed somewhere in the discussion that there should be no reserved database OID for the undo log. (InvalidOid cannot be used because it's already in use for shared catalogs.) Components added ---------------- pg_undo_dump utility and test framework for undoread.c. BTW, undoread.c seems to need some refactoring. Following are a few areas which are not implemented yet because more discussion is needed there: Discarding ---------- There's no discard worker for the URS infrastructure yet. I thought about discarding the undo log during checkpoint, but checkpoint should probably do more straightforward tasks than the calculation of a new discard pointer for each undo log, so a background worker is needed. A few notes on that: * until the zheap AM gets added, only the transaction that creates the undo records needs to access them. This assumption should make the discarding algorithm a bit simpler. Note that with zheap, the other transactions need to look for old versions of tuples, so the concept of oldestXidHavingUndo variable is needed there. * it's rather simple to pass pointer the URS pointer to the discard worker when transaction either committed or the undo has been executed. If the URS only consists of one chunk, the discard pointer can simply be advanced to the end of the chunk. But if there are multiple chunks, the discard worker might need to scan quite some amount of the undo log because (IIUC) chunks of different URSs can be interleaved (if there's not enough space for a record in the log 1, log 2 is used, but before we get to discarding, another transaction could have added its chunk to the log 1) and because the chunks only contain links backwards, not forward. If we added the forward link to the chunk header, it would make chunk closing more complex. How about storing the type header (which includes XID) in each chunk instead of only the first chunk of the URS? Thus we'd be able to check for each chunk separately whether it can be discarded. * if the URS belongs to an aborted transaction or a transaction that could not finish due to server crash, the transaction status alone does not justify discarding: we also need to be sure that the underlying undo records have been applied. So if we want to do without the oldestXidHavingUndo variable, some sort of undo progress tracking is needed, see below. Do not execute the same undo record multiple times -------------------------------------------------- Although I've noticed in the zheap code that it checks whether particular undo action was already undone, I think this functionality fits better in the URS layer. Also note in [1] (i.e. the undo layer, no zheap) that the header comment of AtSubAbort_XactUndo() refers to this problem. I've tried to implement such a thing (not included in this patch) by adding last_rec_applied field to UndoRecordSetChunkHeader. When the UNDO stage of the transaction starts, this field is set to the last undo record of given chunk, and once that record is applied, the pointer moves to the previous record in terms of undo pointer (i.e. the next record to be applied - the records are applied in reverse order) and so on. For recovery purposes, the pointer is maintained in a similar way as the ud_insertion_point field of UndoPageHeaderData. However, although I haven't tested performance yet, I wonder if it's o.k. to lock the buffer containing the chunk header exclusively for each undo record execution. I wonder if there's a better place to store the progress information, maybe at page level? I can spend more time on this project, but need a hint which part I should focus on. Other hackers might have the same problem. Thanks for any suggestions. [0] https://www.postgresql.org/message-id/CA%2BTgmoZwkqXs3hpT_nd17fyMnZDkg8yU%3D5kG%2BHQw%2B80rumiwUA%40mail.gmail.com [1] https://github.com/EnterpriseDB/zheap/tree/undo-record-set [2] https://github.com/cybertec-postgresql/postgres/tree/undo-record-set-ah [3] https://www.postgresql.org/message-id/CA%2BhUKGJfznxutTwpMLKPMjU_k9GhERoogyxx2Sf105LOA2La2A%40mail.gmail.com [4] https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Thu, Nov 12, 2020 at 10:15 PM Antonin Houska <ah@cybertec.at> wrote: > Thomas Munro <thomas.munro@gmail.com> wrote: > > Thanks. We decided to redesign a couple of aspects of the undo > > storage and record layers that this patch was intended to demonstrate, > > and work on that is underway. More on that soon. > > As my boss expressed in his recent blog post, we'd like to contribute to the > zheap development, and a couple of developers from other companies are > interested in this as well. Amit Kapila suggested that the "cleanup of > orphaned files" feature is a good start point in getting the code into PG > core, so I've spent some time on it and tried to rebase the patch set. Hi Antonin, I saw that -- great news! -- and have been meaning to write for a while. I think I am nearly ready to talk about it again. I agree 100% that it's worth trying to do something much simpler than a new access manager, and this was the simplest useful feature solving a real-world-problem-that-people-actually-have we could come up with (based on an idea from Robert). I think it needs a convincing explanation for why there is no scenario where the relfilenode is recycled for a new unlucky table before the rollback is executed, which might depend on details that you might be working on/changing (scenarios where you execute undo twice because you forgot you already did it). > In fact what I did is not mere rebasing against the current master branch - > I've also (besides various bug fixes) done some design changes. > > Incorporated the new Undo Record Set (URS) infrastructure > --------------------------------------------------------- > > This is also pointed out in [0]. > > I started from [1] and tried to implement some missing parts (e.g. proper > closing of the URSs after crash), introduced UNDO_DEBUG preprocessor macro > which makes the undo log segments very small and fixed some bugs that the > small segments exposed. Cool! Getting up to speed on all these made up concepts like URS, and getting all these pieces assembled and rebased and up and running is already quite something, let alone adding missing parts and debugging. > The most significant change I've done was removal of the undo requests from > checkpoint. I could not find any particular bug / race conditions related to > including the requests into the checkpoint, but I concluded that it's easier > to think about consistency and checkpoint timings if we scan the undo log on > restart (after recovery has finished) and create the requests from scratch. Interesting. I guess that would be closer to textbook three-phase ARIES. > [2] shows where I ended up before I started to rebase this patchset. > > No background undo > ------------------ > > Reduced complexity of the patch seems to be the priority at the moment. Amit > suggested that cleanup of an orphaned relation file is simple enough to be > done on foreground and I agree. > > "undo worker" is still there, but it only processes undo requests after server > restart because relation data can only be changed in a transaction - it seems > cleaner to launch a background worker for this than to hack the startup > process. I suppose the simplest useful system would be one does the work at startup before allowing connections, and also in regular backends, and panics if a backend ever exits while it has pending undo (panic = "goto crash recovery"). Then you don't have to deal with undo workers running at the same time as regular sessions which might run into trouble reacquiring locks (for an AM I mean), or due to OIDs being recycled with multiple checkpoints, or undo work that gets deferred until the next restart of the server. > Since the concept of undo requests is closely related to the undo worker, I > removed undorequest.c too. The new (much simpler) undo worker gets the > information on incomplete / aborted transactions from the undo log as > mentioned above. > > SMGR enhancement > ---------------- > > I used the 0001 patch from [3] rather than [4], although it's more invasive > because I noticed somewhere in the discussion that there should be no reserved > database OID for the undo log. (InvalidOid cannot be used because it's already > in use for shared catalogs.) I give up thinking about the colour of the BufferTag shed and went back to magic database 9, mainly because there seemed to be more pressing matters. I don't even think it's that crazy to store this type of system-wide data in pseudo databases, and I know of other systems that do similar sorts of things without blinking... > Following are a few areas which are not implemented yet because more > discussion is needed there: Hmm. I'm thinking about these questions.
On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > No background undo > ------------------ > > Reduced complexity of the patch seems to be the priority at the moment. Amit > suggested that cleanup of an orphaned relation file is simple enough to be > done on foreground and I agree. > Yeah, I think we should try and see if we can make it work but I noticed that there are few places like AbortOutOfAnyTransaction where we have the assumption that undo will be executed in the background. We need to deal with it. > "undo worker" is still there, but it only processes undo requests after server > restart because relation data can only be changed in a transaction - it seems > cleaner to launch a background worker for this than to hack the startup > process. > But, I see there are still multiple undoworkers that are getting launched and I am not sure if that works correctly because a particular undoworker is connected to a database and then it starts processing all the pending undo. > Since the concept of undo requests is closely related to the undo worker, I > removed undorequest.c too. The new (much simpler) undo worker gets the > information on incomplete / aborted transactions from the undo log as > mentioned above. > > SMGR enhancement > ---------------- > > I used the 0001 patch from [3] rather than [4], although it's more invasive > because I noticed somewhere in the discussion that there should be no reserved > database OID for the undo log. (InvalidOid cannot be used because it's already > in use for shared catalogs.) > > Components added > ---------------- > > pg_undo_dump utility and test framework for undoread.c. BTW, undoread.c seems > to need some refactoring. > > > Following are a few areas which are not implemented yet because more > discussion is needed there: > > Discarding > ---------- > > There's no discard worker for the URS infrastructure yet. I thought about > discarding the undo log during checkpoint, but checkpoint should probably do > more straightforward tasks than the calculation of a new discard pointer for > each undo log, so a background worker is needed. A few notes on that: > > * until the zheap AM gets added, only the transaction that creates the undo > records needs to access them. This assumption should make the discarding > algorithm a bit simpler. Note that with zheap, the other transactions need > to look for old versions of tuples, so the concept of oldestXidHavingUndo > variable is needed there. > > * it's rather simple to pass pointer the URS pointer to the discard worker > when transaction either committed or the undo has been executed. > Why can't we have a separate discard worker which keeps on scanning the undorecords and discard accordingly? Giving the onus of foreground process might be tricky because say discard worker is not up to speed and we ran out of space to pass such information for each commit/abort request. > > Do not execute the same undo record multiple times > -------------------------------------------------- > > Although I've noticed in the zheap code that it checks whether particular undo > action was already undone, I think this functionality fits better in the URS > layer. > If you want to track at undo record level, then won't it lead to performance overhead and probably additional WAL overhead considering this action needs to be WAL-logged. I think recording at page-level might be a better idea. > > I can spend more time on this project, but need a hint which part I should > focus on. > I can easily imagine that this needs a lot of work and I can try to help with this as much as possible from my side. I feel at this stage we should try to focus on undo-related work (to start with you can look at finishing the undoprocessing work for which I have shared some thoughts) and then probably at some point in time we need to rebase zheap over this. -- With Regards, Amit Kapila.
Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Nov 12, 2020 at 10:15 PM Antonin Houska <ah@cybertec.at> wrote: > I saw that -- great news! -- and have been meaning to write for a > while. I think I am nearly ready to talk about it again. I'm looking forward to it :-) > 100% that it's worth trying to do something much simpler than a new > access manager, and this was the simplest useful feature solving a > real-world-problem-that-people-actually-have we could come up with > (based on an idea from Robert). I think it needs a convincing > explanation for why there is no scenario where the relfilenode is > recycled for a new unlucky table before the rollback is executed, > which might depend on details that you might be working on/changing > (scenarios where you execute undo twice because you forgot you already > did it). Oh, I haven't thought about this problem yet. That might be another reason for the undo log infrastructure to record the progress somehow. > > No background undo > > ------------------ > > > > Reduced complexity of the patch seems to be the priority at the moment. Amit > > suggested that cleanup of an orphaned relation file is simple enough to be > > done on foreground and I agree. > > > > "undo worker" is still there, but it only processes undo requests after server > > restart because relation data can only be changed in a transaction - it seems > > cleaner to launch a background worker for this than to hack the startup > > process. > > I suppose the simplest useful system would be one does the work at > startup before allowing connections, and also in regular backends, and > panics if a backend ever exits while it has pending undo (panic = > "goto crash recovery"). Then you don't have to deal with undo workers > running at the same time as regular sessions which might run into > trouble reacquiring locks (for an AM I mean), or due to OIDs being > recycled with multiple checkpoints, or undo work that gets deferred > until the next restart of the server. I think that zheap can recognize that page has unapplied undo, so we don't need to reacquire any page lock on restart. However I agree that the background undo might introduce other concurrency issues. At least for now it's worth trying to move the cleanup into the startup process. We can reconsider this when implementing more expensive undo actions, especially the zheap rollback. > > Since the concept of undo requests is closely related to the undo worker, I > > removed undorequest.c too. The new (much simpler) undo worker gets the > > information on incomplete / aborted transactions from the undo log as > > mentioned above. > > > > SMGR enhancement > > ---------------- > > > > I used the 0001 patch from [3] rather than [4], although it's more invasive > > because I noticed somewhere in the discussion that there should be no reserved > > database OID for the undo log. (InvalidOid cannot be used because it's already > > in use for shared catalogs.) > > I give up thinking about the colour of the BufferTag shed and went > back to magic database 9, mainly because there seemed to be more > pressing matters. I don't even think it's that crazy to store this > type of system-wide data in pseudo databases, and I know of other > systems that do similar sorts of things without blinking... ok -- Antonin Houska Web: https://www.cybertec-postgresql.com
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > No background undo > > ------------------ > > > > Reduced complexity of the patch seems to be the priority at the moment. Amit > > suggested that cleanup of an orphaned relation file is simple enough to be > > done on foreground and I agree. > > > > Yeah, I think we should try and see if we can make it work but I > noticed that there are few places like AbortOutOfAnyTransaction where > we have the assumption that undo will be executed in the background. > We need to deal with it. I think this is o.k. if we always check for unapplied undo during startup. > > "undo worker" is still there, but it only processes undo requests after server > > restart because relation data can only be changed in a transaction - it seems > > cleaner to launch a background worker for this than to hack the startup > > process. > > > > But, I see there are still multiple undoworkers that are getting > launched and I am not sure if that works correctly because a > particular undoworker is connected to a database and then it starts > processing all the pending undo. Each undo worker applies only transactions for its own database, see ProcessExistingUndoRequests(): /* We can only process undo of the database we are connected to. */ if (xact_hdr.dboid != MyDatabaseId) continue; Nevertheless, as I've just mentioned in my response to Thomas, I admit that we should try to live w/o the undo worker altogether. > > Discarding > > ---------- > > > > There's no discard worker for the URS infrastructure yet. I thought about > > discarding the undo log during checkpoint, but checkpoint should probably do > > more straightforward tasks than the calculation of a new discard pointer for > > each undo log, so a background worker is needed. A few notes on that: > > > > * until the zheap AM gets added, only the transaction that creates the undo > > records needs to access them. This assumption should make the discarding > > algorithm a bit simpler. Note that with zheap, the other transactions need > > to look for old versions of tuples, so the concept of oldestXidHavingUndo > > variable is needed there. > > > > * it's rather simple to pass pointer the URS pointer to the discard worker > > when transaction either committed or the undo has been executed. > > > > Why can't we have a separate discard worker which keeps on scanning > the undorecords and discard accordingly? Giving the onus of foreground > process might be tricky because say discard worker is not up to speed > and we ran out of space to pass such information for each commit/abort > request. Sure, there should be a discard worker. The question is how to make its work efficient. The initial run after restart probably needs to scan everything between 'discard' and 'insert' pointers, but then it should process only the parts created by individual transactions. > > > > Do not execute the same undo record multiple times > > -------------------------------------------------- > > > > Although I've noticed in the zheap code that it checks whether particular undo > > action was already undone, I think this functionality fits better in the URS > > layer. > > > > If you want to track at undo record level, then won't it lead to > performance overhead and probably additional WAL overhead considering > this action needs to be WAL-logged. I think recording at page-level > might be a better idea. I'm not worried about WAL because the undo execution needs to be WAL-logged anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be evaluated regarding performance is the (exclusive) locking of the page that carries the progress information. I'm still not sure whether this info should be on every page or only in the chunk header. In either case, we have a problem if there are two or more chunks created by different transactions on the same page, and if more than on of these transactions need to perform undo. I tend to believe that this should happen rarely though. > > I can spend more time on this project, but need a hint which part I should > > focus on. > > > > I can easily imagine that this needs a lot of work and I can try to > help with this as much as possible from my side. I feel at this stage > we should try to focus on undo-related work (to start with you can > look at finishing the undoprocessing work for which I have shared some > thoughts) and then probably at some point in time we need to rebase > zheap over this. I agree, thanks! -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > > No background undo > > > ------------------ > > > > > > Reduced complexity of the patch seems to be the priority at the moment. Amit > > > suggested that cleanup of an orphaned relation file is simple enough to be > > > done on foreground and I agree. > > > > > > > Yeah, I think we should try and see if we can make it work but I > > noticed that there are few places like AbortOutOfAnyTransaction where > > we have the assumption that undo will be executed in the background. > > We need to deal with it. > > I think this is o.k. if we always check for unapplied undo during startup. > Hmm, how it is ok to leave undo (and rely on startup) unless it is a PANIC error. IIRC, this path is invoked in non-panic errors as well. Basically, we won't be able to discard such an undo which doesn't seem like a good idea. > > > "undo worker" is still there, but it only processes undo requests after server > > > restart because relation data can only be changed in a transaction - it seems > > > cleaner to launch a background worker for this than to hack the startup > > > process. > > > > > > > But, I see there are still multiple undoworkers that are getting > > launched and I am not sure if that works correctly because a > > particular undoworker is connected to a database and then it starts > > processing all the pending undo. > > Each undo worker applies only transactions for its own database, see > ProcessExistingUndoRequests(): > > /* We can only process undo of the database we are connected to. */ > if (xact_hdr.dboid != MyDatabaseId) > continue; > > Nevertheless, as I've just mentioned in my response to Thomas, I admit that we > should try to live w/o the undo worker altogether. > Okay, but keep in mind that there could be a large amount of undo (unlike redo which has some limit as we can replay it from the last checkpoint) which needs to be processed but it might be okay to live with that for now. Another thing is that it seems we need to connect to the database to perform it which might appear a bit odd that we don't allow users to connect to the database but internally we are connecting it. These are just some points to consider while finalizing the solution to this. > > > Discarding > > > ---------- > > > > > > There's no discard worker for the URS infrastructure yet. I thought about > > > discarding the undo log during checkpoint, but checkpoint should probably do > > > more straightforward tasks than the calculation of a new discard pointer for > > > each undo log, so a background worker is needed. A few notes on that: > > > > > > * until the zheap AM gets added, only the transaction that creates the undo > > > records needs to access them. This assumption should make the discarding > > > algorithm a bit simpler. Note that with zheap, the other transactions need > > > to look for old versions of tuples, so the concept of oldestXidHavingUndo > > > variable is needed there. > > > > > > * it's rather simple to pass pointer the URS pointer to the discard worker > > > when transaction either committed or the undo has been executed. > > > > > > > Why can't we have a separate discard worker which keeps on scanning > > the undorecords and discard accordingly? Giving the onus of foreground > > process might be tricky because say discard worker is not up to speed > > and we ran out of space to pass such information for each commit/abort > > request. > > Sure, there should be a discard worker. The question is how to make its work > efficient. The initial run after restart probably needs to scan everything > between 'discard' and 'insert' pointers, > Yeah, such an initial scan would be helpful to identify pending aborts and allow them to be processed. > but then it should process only the > parts created by individual transactions. > Yeah, it needs to process transaction-by-transaction to see which all we can discard. Also, note that in Single-User mode we need to discard undo after commit. I think we also need to maintain oldestXidHavingUndo for CLOG truncation and transaction-wraparound. We can't allow CLOG truncation for the transaction whose undo is not discarded as that could be required by some other transaction. For similar reasons, we can't allow transaction-wraparound and we need to integrate this into the existing xid-allocation mechanism. I have found one of the old patch (Allow-execution-and-discard-of-undo-by-background-wo) attached where all these concepts were implemented. Unless you have a reason why we don't these things, you might want to refer to the attached patch to either re-use or refer to these ideas. There are a few other things like undorequest and some undoworker mechanism which you can ignore. -- With Regards, Amit Kapila.
On Sun, Nov 15, 2020 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > > > > > No background undo > > > > ------------------ > > > > > > > > Reduced complexity of the patch seems to be the priority at the moment. Amit > > > > suggested that cleanup of an orphaned relation file is simple enough to be > > > > done on foreground and I agree. > > > > > > > > > > Yeah, I think we should try and see if we can make it work but I > > > noticed that there are few places like AbortOutOfAnyTransaction where > > > we have the assumption that undo will be executed in the background. > > > We need to deal with it. > > > > I think this is o.k. if we always check for unapplied undo during startup. > > > > Hmm, how it is ok to leave undo (and rely on startup) unless it is a > PANIC error. IIRC, this path is invoked in non-panic errors as well. > Basically, we won't be able to discard such an undo which doesn't seem > like a good idea. > > > > > "undo worker" is still there, but it only processes undo requests after server > > > > restart because relation data can only be changed in a transaction - it seems > > > > cleaner to launch a background worker for this than to hack the startup > > > > process. > > > > > > > > > > But, I see there are still multiple undoworkers that are getting > > > launched and I am not sure if that works correctly because a > > > particular undoworker is connected to a database and then it starts > > > processing all the pending undo. > > > > Each undo worker applies only transactions for its own database, see > > ProcessExistingUndoRequests(): > > > > /* We can only process undo of the database we are connected to. */ > > if (xact_hdr.dboid != MyDatabaseId) > > continue; > > > > Nevertheless, as I've just mentioned in my response to Thomas, I admit that we > > should try to live w/o the undo worker altogether. > > > > Okay, but keep in mind that there could be a large amount of undo > (unlike redo which has some limit as we can replay it from the last > checkpoint) which needs to be processed but it might be okay to live > with that for now. Another thing is that it seems we need to connect > to the database to perform it which might appear a bit odd that we > don't allow users to connect to the database but internally we are > connecting it. These are just some points to consider while finalizing > the solution to this. > > > > > Discarding > > > > ---------- > > > > > > > > There's no discard worker for the URS infrastructure yet. I thought about > > > > discarding the undo log during checkpoint, but checkpoint should probably do > > > > more straightforward tasks than the calculation of a new discard pointer for > > > > each undo log, so a background worker is needed. A few notes on that: > > > > > > > > * until the zheap AM gets added, only the transaction that creates the undo > > > > records needs to access them. This assumption should make the discarding > > > > algorithm a bit simpler. Note that with zheap, the other transactions need > > > > to look for old versions of tuples, so the concept of oldestXidHavingUndo > > > > variable is needed there. > > > > > > > > * it's rather simple to pass pointer the URS pointer to the discard worker > > > > when transaction either committed or the undo has been executed. > > > > > > > > > > Why can't we have a separate discard worker which keeps on scanning > > > the undorecords and discard accordingly? Giving the onus of foreground > > > process might be tricky because say discard worker is not up to speed > > > and we ran out of space to pass such information for each commit/abort > > > request. > > > > Sure, there should be a discard worker. The question is how to make its work > > efficient. The initial run after restart probably needs to scan everything > > between 'discard' and 'insert' pointers, > > > > Yeah, such an initial scan would be helpful to identify pending aborts > and allow them to be processed. > > > but then it should process only the > > parts created by individual transactions. > > > > Yeah, it needs to process transaction-by-transaction to see which all > we can discard. Also, note that in Single-User mode we need to discard > undo after commit. I think we also need to maintain > oldestXidHavingUndo for CLOG truncation and transaction-wraparound. We > can't allow CLOG truncation for the transaction whose undo is not > discarded as that could be required by some other transaction. For > similar reasons, we can't allow transaction-wraparound and we need to > integrate this into the existing xid-allocation mechanism. I have > found one of the old patch > (Allow-execution-and-discard-of-undo-by-background-wo) attached > oops, forgot to attach the patch, doing now. -- With Regards, Amit Kapila.
Attachment
On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > If you want to track at undo record level, then won't it lead to > > performance overhead and probably additional WAL overhead considering > > this action needs to be WAL-logged. I think recording at page-level > > might be a better idea. > > I'm not worried about WAL because the undo execution needs to be WAL-logged > anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be > evaluated regarding performance is the (exclusive) locking of the page that > carries the progress information. > That is just for one kind of smgr, think how you will do it for something like zheap. Their idea is to collect all the undo records (unless the undo for a transaction is very large) for one zheap-page and apply them together, so maintaining the status at each undo record level will surely lead to a large amount of additional WAL. See below how and why we have decided to do it differently. > I'm still not sure whether this info should > be on every page or only in the chunk header. In either case, we have a > problem if there are two or more chunks created by different transactions on > the same page, and if more than on of these transactions need to perform > undo. I tend to believe that this should happen rarely though. > I think we need to maintain this information at the transaction level and need to update it after processing a few blocks, at least that is what was decided and implemented earlier. We also need to update it when the log is switched or all the actions of the transaction were applied. The reasoning is that for short transactions it won't matter and for larger transactions, it is good to update it after a few pages to avoid WAL and locking overhead. Also, it is better if we collect the undo in bulk, this is proved to be beneficial for large transactions. The earlier version of the patch having all these ideas implemented is attached (Infrastructure-to-execute-pending-undo-actions and Provide-interfaces-to-store-and-fetch-undo-records). The second one has some APIs used by the first one but the main concepts were implemented in the first one (Infrastructure-to-execute-pending-undo-actions). I see that in the current version these can't be used as it is but still it can give us a good start point and we might be able to either re-use some code and or ideas from these patches. -- With Regards, Amit Kapila.
Attachment
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > > > > > No background undo > > > > ------------------ > > > > > > > > Reduced complexity of the patch seems to be the priority at the moment. Amit > > > > suggested that cleanup of an orphaned relation file is simple enough to be > > > > done on foreground and I agree. > > > > > > > > > > Yeah, I think we should try and see if we can make it work but I > > > noticed that there are few places like AbortOutOfAnyTransaction where > > > we have the assumption that undo will be executed in the background. > > > We need to deal with it. > > > > I think this is o.k. if we always check for unapplied undo during startup. > > > > Hmm, how it is ok to leave undo (and rely on startup) unless it is a > PANIC error. IIRC, this path is invoked in non-panic errors as well. > Basically, we won't be able to discard such an undo which doesn't seem > like a good idea. Since failure to apply leaves unconsistent data, I assume it should always cause PANIC, shouldn't it? (Thomas seems to assume the same in [1].) > > > > "undo worker" is still there, but it only processes undo requests after server > > > > restart because relation data can only be changed in a transaction - it seems > > > > cleaner to launch a background worker for this than to hack the startup > > > > process. > > > > > > > > > > But, I see there are still multiple undoworkers that are getting > > > launched and I am not sure if that works correctly because a > > > particular undoworker is connected to a database and then it starts > > > processing all the pending undo. > > > > Each undo worker applies only transactions for its own database, see > > ProcessExistingUndoRequests(): > > > > /* We can only process undo of the database we are connected to. */ > > if (xact_hdr.dboid != MyDatabaseId) > > continue; > > > > Nevertheless, as I've just mentioned in my response to Thomas, I admit that we > > should try to live w/o the undo worker altogether. > > > > Okay, but keep in mind that there could be a large amount of undo > (unlike redo which has some limit as we can replay it from the last > checkpoint) which needs to be processed but it might be okay to live > with that for now. Yes, the information to remove relation file does not take much space in the undo log. > Another thing is that it seems we need to connect to the database to perform > it which might appear a bit odd that we don't allow users to connect to the > database but internally we are connecting it. I think the implementation will need to follow the outcome of the part of the discussion that starts at [2], but I see your concern. I'm thinking why database connection is not needed to apply WAL but is needed for UNDO. I think locks make the difference. So maybe we can make the RMGR specific callbacks (rm_undo) aware of the fact that the cluster is still in the startup state, so the relations should be opened in NoLock mode? > > > > Discarding > > > > ---------- > > > > > > > > There's no discard worker for the URS infrastructure yet. I thought about > > > > discarding the undo log during checkpoint, but checkpoint should probably do > > > > more straightforward tasks than the calculation of a new discard pointer for > > > > each undo log, so a background worker is needed. A few notes on that: > > > > > > > > * until the zheap AM gets added, only the transaction that creates the undo > > > > records needs to access them. This assumption should make the discarding > > > > algorithm a bit simpler. Note that with zheap, the other transactions need > > > > to look for old versions of tuples, so the concept of oldestXidHavingUndo > > > > variable is needed there. > > > > > > > > * it's rather simple to pass pointer the URS pointer to the discard worker > > > > when transaction either committed or the undo has been executed. > > > > > > > > > > Why can't we have a separate discard worker which keeps on scanning > > > the undorecords and discard accordingly? Giving the onus of foreground > > > process might be tricky because say discard worker is not up to speed > > > and we ran out of space to pass such information for each commit/abort > > > request. > > > > Sure, there should be a discard worker. The question is how to make its work > > efficient. The initial run after restart probably needs to scan everything > > between 'discard' and 'insert' pointers, > > > > Yeah, such an initial scan would be helpful to identify pending aborts > and allow them to be processed. > > > but then it should process only the > > parts created by individual transactions. > > > > Yeah, it needs to process transaction-by-transaction to see which all > we can discard. Also, note that in Single-User mode we need to discard > undo after commit. ok, I missed this problem so far. > I think we also need to maintain oldestXidHavingUndo for CLOG truncation and > transaction-wraparound. We can't allow CLOG truncation for the transaction > whose undo is not discarded as that could be required by some other > transaction. Good point. Even the discard worker might need to check the transaction status when deciding whether undo log of that transaction should be discarded. > For similar reasons, we can't allow transaction-wraparound and > we need to integrate this into the existing xid-allocation mechanism. I have > found one of the old patch > (Allow-execution-and-discard-of-undo-by-background-wo) attached where all > these concepts were implemented. Unless you have a reason why we don't these > things, you might want to refer to the attached patch to either re-use or > refer to these ideas. There are a few other things like undorequest and some > undoworker mechanism which you can ignore. Thanks. [1] https://www.postgresql.org/message-id/CA%2BhUKGJL4X1em70rxN1d_EC3rxiVhVd1woHviydW%3DHr2PeGBpg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAH2-Wzk06ypb40z3B8HFiSsTVg961%3DE0%3DuQvqARJgT8_4QB2Mg%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Wed, Nov 18, 2020 at 4:03 PM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > > > > > > > > No background undo > > > > > ------------------ > > > > > > > > > > Reduced complexity of the patch seems to be the priority at the moment. Amit > > > > > suggested that cleanup of an orphaned relation file is simple enough to be > > > > > done on foreground and I agree. > > > > > > > > > > > > > Yeah, I think we should try and see if we can make it work but I > > > > noticed that there are few places like AbortOutOfAnyTransaction where > > > > we have the assumption that undo will be executed in the background. > > > > We need to deal with it. > > > > > > I think this is o.k. if we always check for unapplied undo during startup. > > > > > > > Hmm, how it is ok to leave undo (and rely on startup) unless it is a > > PANIC error. IIRC, this path is invoked in non-panic errors as well. > > Basically, we won't be able to discard such an undo which doesn't seem > > like a good idea. > > Since failure to apply leaves unconsistent data, I assume it should always > cause PANIC, shouldn't it? > But how can we ensure that AbortOutOfAnyTransaction will be called only in that scenario? > (Thomas seems to assume the same in [1].) > > > > > > "undo worker" is still there, but it only processes undo requests after server > > > > > restart because relation data can only be changed in a transaction - it seems > > > > > cleaner to launch a background worker for this than to hack the startup > > > > > process. > > > > > > > > > > > > > But, I see there are still multiple undoworkers that are getting > > > > launched and I am not sure if that works correctly because a > > > > particular undoworker is connected to a database and then it starts > > > > processing all the pending undo. > > > > > > Each undo worker applies only transactions for its own database, see > > > ProcessExistingUndoRequests(): > > > > > > /* We can only process undo of the database we are connected to. */ > > > if (xact_hdr.dboid != MyDatabaseId) > > > continue; > > > > > > Nevertheless, as I've just mentioned in my response to Thomas, I admit that we > > > should try to live w/o the undo worker altogether. > > > > > > > Okay, but keep in mind that there could be a large amount of undo > > (unlike redo which has some limit as we can replay it from the last > > checkpoint) which needs to be processed but it might be okay to live > > with that for now. > > Yes, the information to remove relation file does not take much space in the > undo log. > > > Another thing is that it seems we need to connect to the database to perform > > it which might appear a bit odd that we don't allow users to connect to the > > database but internally we are connecting it. > > I think the implementation will need to follow the outcome of the part of the > discussion that starts at [2], but I see your concern. I'm thinking why > database connection is not needed to apply WAL but is needed for UNDO. I think > locks make the difference. > Yeah, it would be probably a good idea to see if we can make undo apply work without db-connection especially if we want to do before allowing connections. The other possibility could be to let discard worker do this work lazily after allowing connections. -- With Regards, Amit Kapila.
Antonin Houska <ah@cybertec.at> wrote: > Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think we also need to maintain oldestXidHavingUndo for CLOG truncation and > > transaction-wraparound. We can't allow CLOG truncation for the transaction > > whose undo is not discarded as that could be required by some other > > transaction. > > Good point. Even the discard worker might need to check the transaction status > when deciding whether undo log of that transaction should be discarded. In the zheap code [1] I see that DiscardWorkerMain() discards undo log up to OldestXmin: OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_AUTOVACUUM | PROCARRAY_FLAGS_VACUUM); oldestXidHavingUndo = GetXidFromEpochXid(pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); /* * Call the discard routine if there oldestXidHavingUndo is lagging * behind OldestXmin. */ if (OldestXmin != InvalidTransactionId && TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin)) { UndoDiscard(OldestXmin, &hibernate); and that UndoDiscard() eventually advances oldestXidHavingUndo in the shared memory. I'm not sure this is correct because, IMO, OldestXmin can advance as soon as AbortTransaction() has cleared both xid and xmin fields of the transaction's PGXACT (by calling ProcArrayEndTransactionInternal). However the corresponding undo log may still be waiting for processing. Am I wrong? I think that oldestXidHavingUndo should be advanced at the time transaction commits or when the undo log of an aborted transaction has been applied. Then the discard worker would simply discard the undo log up to oldestXidHavingUndo. However, as the transactions whose undo is still not applied may no longer be registered in the shared memory (proc array), I don't know how to determine the next value of oldestXidHavingUndo. Also I wonder if FullTransactionId is needed for oldestXidHavingUndo in the shared memory rather than plain TransactionId (see oldestXidWithEpochHavingUndo in PROC_HDR). I think that the value cannot lag behind nextFullXid by more than 2 billions transactions anyway because in that case it would cause XID wraparound. (That in turn makes me think that VACUUM FREEZE should be able to discard undo log too.) [1] https://github.com/EnterpriseDB/zheap/tree/master -- Antonin Houska Web: https://www.cybertec-postgresql.com
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Nov 18, 2020 at 4:03 PM Antonin Houska <ah@cybertec.at> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > > > > > > > > > > > No background undo > > > > > > ------------------ > > > > > > > > > > > > Reduced complexity of the patch seems to be the priority at the moment. Amit > > > > > > suggested that cleanup of an orphaned relation file is simple enough to be > > > > > > done on foreground and I agree. > > > > > > > > > > > > > > > > Yeah, I think we should try and see if we can make it work but I > > > > > noticed that there are few places like AbortOutOfAnyTransaction where > > > > > we have the assumption that undo will be executed in the background. > > > > > We need to deal with it. > > > > > > > > I think this is o.k. if we always check for unapplied undo during startup. > > > > > > > > > > Hmm, how it is ok to leave undo (and rely on startup) unless it is a > > > PANIC error. IIRC, this path is invoked in non-panic errors as well. > > > Basically, we won't be able to discard such an undo which doesn't seem > > > like a good idea. > > > > Since failure to apply leaves unconsistent data, I assume it should always > > cause PANIC, shouldn't it? > > > > But how can we ensure that AbortOutOfAnyTransaction will be called > only in that scenario? I meant that AbortOutOfAnyTransaction should PANIC itself if it sees that there is unapplied undo, so nothing changes for its callers. Do I still miss something? > > > Another thing is that it seems we need to connect to the database to perform > > > it which might appear a bit odd that we don't allow users to connect to the > > > database but internally we are connecting it. > > > > I think the implementation will need to follow the outcome of the part of the > > discussion that starts at [2], but I see your concern. I'm thinking why > > database connection is not needed to apply WAL but is needed for UNDO. I think > > locks make the difference. > > > > Yeah, it would be probably a good idea to see if we can make undo > apply work without db-connection especially if we want to do before > allowing connections. The other possibility could be to let discard > worker do this work lazily after allowing connections. Actually I hit the problem of missing connection when playing with the "undoxacttest" module. Those tests use table_open() / table_close() functions, but it might not be necessary for the real RMGRs. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Wed, Nov 25, 2020 at 8:00 PM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Wed, Nov 18, 2020 at 4:03 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > > > > > > > > > > > > > > No background undo > > > > > > > ------------------ > > > > > > > > > > > > > > Reduced complexity of the patch seems to be the priority at the moment. Amit > > > > > > > suggested that cleanup of an orphaned relation file is simple enough to be > > > > > > > done on foreground and I agree. > > > > > > > > > > > > > > > > > > > Yeah, I think we should try and see if we can make it work but I > > > > > > noticed that there are few places like AbortOutOfAnyTransaction where > > > > > > we have the assumption that undo will be executed in the background. > > > > > > We need to deal with it. > > > > > > > > > > I think this is o.k. if we always check for unapplied undo during startup. > > > > > > > > > > > > > Hmm, how it is ok to leave undo (and rely on startup) unless it is a > > > > PANIC error. IIRC, this path is invoked in non-panic errors as well. > > > > Basically, we won't be able to discard such an undo which doesn't seem > > > > like a good idea. > > > > > > Since failure to apply leaves unconsistent data, I assume it should always > > > cause PANIC, shouldn't it? > > > > > > > But how can we ensure that AbortOutOfAnyTransaction will be called > > only in that scenario? > > I meant that AbortOutOfAnyTransaction should PANIC itself if it sees that > there is unapplied undo, so nothing changes for its callers. Do I still miss > something? > Adding PANIC in some generic code-path sounds scary. Why can't we simply try to execute undo? > > > > Another thing is that it seems we need to connect to the database to perform > > > > it which might appear a bit odd that we don't allow users to connect to the > > > > database but internally we are connecting it. > > > > > > I think the implementation will need to follow the outcome of the part of the > > > discussion that starts at [2], but I see your concern. I'm thinking why > > > database connection is not needed to apply WAL but is needed for UNDO. I think > > > locks make the difference. > > > > > > > Yeah, it would be probably a good idea to see if we can make undo > > apply work without db-connection especially if we want to do before > > allowing connections. The other possibility could be to let discard > > worker do this work lazily after allowing connections. > > Actually I hit the problem of missing connection when playing with the > "undoxacttest" module. Those tests use table_open() / table_close() functions, > but it might not be necessary for the real RMGRs. > How can we apply the action on a page without opening the relation? -- With Regards, Amit Kapila.
On Wed, Nov 25, 2020 at 7:47 PM Antonin Houska <ah@cybertec.at> wrote: > > Antonin Houska <ah@cybertec.at> wrote: > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I think we also need to maintain oldestXidHavingUndo for CLOG truncation and > > > transaction-wraparound. We can't allow CLOG truncation for the transaction > > > whose undo is not discarded as that could be required by some other > > > transaction. > > > > Good point. Even the discard worker might need to check the transaction status > > when deciding whether undo log of that transaction should be discarded. > > In the zheap code [1] I see that DiscardWorkerMain() discards undo log up to > OldestXmin: > > > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_AUTOVACUUM | > PROCARRAY_FLAGS_VACUUM); > > oldestXidHavingUndo = GetXidFromEpochXid(pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); > > /* > * Call the discard routine if there oldestXidHavingUndo is lagging > * behind OldestXmin. > */ > if (OldestXmin != InvalidTransactionId && > TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin)) > { > UndoDiscard(OldestXmin, &hibernate); > > and that UndoDiscard() eventually advances oldestXidHavingUndo in the shared > memory. > > I'm not sure this is correct because, IMO, OldestXmin can advance as soon as > AbortTransaction() has cleared both xid and xmin fields of the transaction's > PGXACT (by calling ProcArrayEndTransactionInternal). However the corresponding > undo log may still be waiting for processing. Am I wrong? > The UndoDiscard->UndoDiscardOneLog ensures that we don't discard the undo if there is a pending abort. > I think that oldestXidHavingUndo should be advanced at the time transaction > commits or when the undo log of an aborted transaction has been applied. > We can't advance oldestXidHavingUndo just on commit because later we need to rely on it for visibility, basically any transaction older than oldestXidHavingUndo should be all-visible. > Then > the discard worker would simply discard the undo log up to > oldestXidHavingUndo. However, as the transactions whose undo is still not > applied may no longer be registered in the shared memory (proc array), I don't > know how to determine the next value of oldestXidHavingUndo. > > Also I wonder if FullTransactionId is needed for oldestXidHavingUndo in the > shared memory rather than plain TransactionId (see > oldestXidWithEpochHavingUndo in PROC_HDR). I think that the value cannot lag > behind nextFullXid by more than 2 billions transactions anyway because in that > case it would cause XID wraparound. > You are right but still, it is better to keep it as FullTransactionId because (a) zheap uses FullTransactionId and we need to compare it with oldestXidWithEpochHavingUndo for visibility purpose, (b) In future, we want to get rid this of this limitation for undo as well. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Nov 25, 2020 at 7:47 PM Antonin Houska <ah@cybertec.at> wrote: > > > > Antonin Houska <ah@cybertec.at> wrote: > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I think we also need to maintain oldestXidHavingUndo for CLOG truncation and > > > > transaction-wraparound. We can't allow CLOG truncation for the transaction > > > > whose undo is not discarded as that could be required by some other > > > > transaction. > > > > > > Good point. Even the discard worker might need to check the transaction status > > > when deciding whether undo log of that transaction should be discarded. > > > > In the zheap code [1] I see that DiscardWorkerMain() discards undo log up to > > OldestXmin: > > > > > > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_AUTOVACUUM | > > PROCARRAY_FLAGS_VACUUM); > > > > oldestXidHavingUndo = GetXidFromEpochXid(pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); > > > > /* > > * Call the discard routine if there oldestXidHavingUndo is lagging > > * behind OldestXmin. > > */ > > if (OldestXmin != InvalidTransactionId && > > TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin)) > > { > > UndoDiscard(OldestXmin, &hibernate); > > > > and that UndoDiscard() eventually advances oldestXidHavingUndo in the shared > > memory. > > > > I'm not sure this is correct because, IMO, OldestXmin can advance as soon as > > AbortTransaction() has cleared both xid and xmin fields of the transaction's > > PGXACT (by calling ProcArrayEndTransactionInternal). However the corresponding > > undo log may still be waiting for processing. Am I wrong? > The UndoDiscard->UndoDiscardOneLog ensures that we don't discard the > undo if there is a pending abort. ok, I should have dug deeper than just reading the header comment of UndoDiscard(). Checked now and seem to understand why no information is lost. Nevertheless, I see in the zheap code that the discard worker may need to scan a lot of undo log each time. While the oldest_xid and oldest_data fields of UndoLogControl help to skip parts of the log, I'm not sure such information fits into the undo-record-set (URS) approach. For now I tend to try to implement the "exhaustive" scan for the URS too, and later let's teach the discard worker to store some metadata so that the processing is rather incremental. > > I think that oldestXidHavingUndo should be advanced at the time transaction > > commits or when the undo log of an aborted transaction has been applied. > > > > We can't advance oldestXidHavingUndo just on commit because later we > need to rely on it for visibility, basically any transaction older > than oldestXidHavingUndo should be all-visible. ok -- Antonin Houska Web: https://www.cybertec-postgresql.com
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Nov 25, 2020 at 8:00 PM Antonin Houska <ah@cybertec.at> wrote: > > I meant that AbortOutOfAnyTransaction should PANIC itself if it sees that > > there is unapplied undo, so nothing changes for its callers. Do I still miss > > something? > > > > Adding PANIC in some generic code-path sounds scary. Why can't we > simply try to execute undo? Indeed it should try. I imagined it this way but probably got distracted by some other thought when writing the email :-) > > Actually I hit the problem of missing connection when playing with the > > "undoxacttest" module. Those tests use table_open() / table_close() functions, > > but it might not be necessary for the real RMGRs. > > > > How can we apply the action on a page without opening the relation? If the undo record contains RelFileNode, ReadBufferWithoutRelcache() can be used, just like it happens with WAL. Not sure how much it would affect zheap. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > If you want to track at undo record level, then won't it lead to > > > performance overhead and probably additional WAL overhead considering > > > this action needs to be WAL-logged. I think recording at page-level > > > might be a better idea. > > > > I'm not worried about WAL because the undo execution needs to be WAL-logged > > anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be > > evaluated regarding performance is the (exclusive) locking of the page that > > carries the progress information. > > > > That is just for one kind of smgr, think how you will do it for > something like zheap. Their idea is to collect all the undo records > (unless the undo for a transaction is very large) for one zheap-page > and apply them together, so maintaining the status at each undo record > level will surely lead to a large amount of additional WAL. See below > how and why we have decided to do it differently. > > > I'm still not sure whether this info should > > be on every page or only in the chunk header. In either case, we have a > > problem if there are two or more chunks created by different transactions on > > the same page, and if more than on of these transactions need to perform > > undo. I tend to believe that this should happen rarely though. > > > > I think we need to maintain this information at the transaction level > and need to update it after processing a few blocks, at least that is > what was decided and implemented earlier. We also need to update it > when the log is switched or all the actions of the transaction were > applied. The reasoning is that for short transactions it won't matter > and for larger transactions, it is good to update it after a few pages > to avoid WAL and locking overhead. Also, it is better if we collect > the undo in bulk, this is proved to be beneficial for large > transactions. Attached is what I originally did not include in the patch series, see the part 0012. I have no better idea so far. The progress information is stored in the chunk header. To avoid too frequent locking, maybe the UpdateLastAppliedRecord() function can be modified so it recognizes when it's necessary to update the progress info. Also the user (zheap) should think when it should call the function. Since I've included 0012 now as a prerequisite for discarding (0013), currently it's only necessary to update the progress at undo log chunk boundary. In this version of the patch series I wanted to publish the remaining ideas I haven't published yet. > The earlier version of the patch having all these ideas > implemented is attached > (Infrastructure-to-execute-pending-undo-actions and > Provide-interfaces-to-store-and-fetch-undo-records). The second one > has some APIs used by the first one but the main concepts were > implemented in the first one > (Infrastructure-to-execute-pending-undo-actions). I see that in the > current version these can't be used as it is but still it can give us > a good start point and we might be able to either re-use some code and > or ideas from these patches. Is there a branch with these patches applied? They reference some functions that I don't see in [1]. I'd like to examine if / how my approach can be aligned with the current zheap design. [1] https://github.com/EnterpriseDB/zheap/tree/master -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Fri, Dec 4, 2020 at 1:50 PM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > The earlier version of the patch having all these ideas > > implemented is attached > > (Infrastructure-to-execute-pending-undo-actions and > > Provide-interfaces-to-store-and-fetch-undo-records). The second one > > has some APIs used by the first one but the main concepts were > > implemented in the first one > > (Infrastructure-to-execute-pending-undo-actions). I see that in the > > current version these can't be used as it is but still it can give us > > a good start point and we might be able to either re-use some code and > > or ideas from these patches. > > Is there a branch with these patches applied? They reference some functions > that I don't see in [1]. I'd like to examine if / how my approach can be > aligned with the current zheap design. > Can you once check in the patch-set attached in the email [1]? [1] - https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com -- With Regards, Amit Kapila.
> On Fri, Dec 04, 2020 at 10:22:42AM +0100, Antonin Houska wrote: > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > If you want to track at undo record level, then won't it lead to > > > > performance overhead and probably additional WAL overhead considering > > > > this action needs to be WAL-logged. I think recording at page-level > > > > might be a better idea. > > > > > > I'm not worried about WAL because the undo execution needs to be WAL-logged > > > anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be > > > evaluated regarding performance is the (exclusive) locking of the page that > > > carries the progress information. > > > > > > > That is just for one kind of smgr, think how you will do it for > > something like zheap. Their idea is to collect all the undo records > > (unless the undo for a transaction is very large) for one zheap-page > > and apply them together, so maintaining the status at each undo record > > level will surely lead to a large amount of additional WAL. See below > > how and why we have decided to do it differently. > > > > > I'm still not sure whether this info should > > > be on every page or only in the chunk header. In either case, we have a > > > problem if there are two or more chunks created by different transactions on > > > the same page, and if more than on of these transactions need to perform > > > undo. I tend to believe that this should happen rarely though. > > > > > > > I think we need to maintain this information at the transaction level > > and need to update it after processing a few blocks, at least that is > > what was decided and implemented earlier. We also need to update it > > when the log is switched or all the actions of the transaction were > > applied. The reasoning is that for short transactions it won't matter > > and for larger transactions, it is good to update it after a few pages > > to avoid WAL and locking overhead. Also, it is better if we collect > > the undo in bulk, this is proved to be beneficial for large > > transactions. > > Attached is what I originally did not include in the patch series, see the > part 0012. I have no better idea so far. The progress information is stored in > the chunk header. > > To avoid too frequent locking, maybe the UpdateLastAppliedRecord() function > can be modified so it recognizes when it's necessary to update the progress > info. Also the user (zheap) should think when it should call the function. > Since I've included 0012 now as a prerequisite for discarding (0013), > currently it's only necessary to update the progress at undo log chunk > boundary. > > In this version of the patch series I wanted to publish the remaining ideas I > haven't published yet. Thanks for the updated patch. As I've mentioned off the list I'm slowly looking through it with the intent to concentrate on undo progress tracking. But before I will post anything I want to mention couple of strange issues I see, otherwise I will forget for sure. Maybe it's already known, but running several times 'make installcheck' against a freshly build postgres with the patch applied from time to time I observe various errors. This one happens on a crash recovery, seems like UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in the replay process: TRAP: FailedAssertion("page_offset + this_page_bytes <= uph->ud_insertion_point", File: "undopage.c", Line: 300) postgres: startup recovering 000000010000000000000012(ExceptionalCondition+0xa1)[0x558b38b8a350] postgres: startup recovering 000000010000000000000012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e] postgres: startup recovering 000000010000000000000012(UndoReplay+0xa1d)[0x558b38766f32] postgres: startup recovering 000000010000000000000012(XactUndoReplay+0x77)[0x558b38769281] postgres: startup recovering 000000010000000000000012(smgr_redo+0x1af)[0x558b387aa7bd] This one is somewhat similar: TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: "undopage.c", Line: 287) postgres: undo worker for database 36893 (ExceptionalCondition+0xa1)[0x5559c90f1350] postgres: undo worker for database 36893 (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3] postgres: undo worker for database 36893 (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008] postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989] There are also here and there messages about not found undo files: ERROR: cannot open undo segment file 'base/undo/000008.0000020000': No such file or directory WARNING: failed to undo transaction I haven't found out the trigger yet, but got an impression that it happens after create_table tests.
Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Thanks for the updated patch. As I've mentioned off the list I'm slowly > looking through it with the intent to concentrate on undo progress > tracking. But before I will post anything I want to mention couple of > strange issues I see, otherwise I will forget for sure. Maybe it's > already known, but running several times 'make installcheck' against a > freshly build postgres with the patch applied from time to time I > observe various errors. > > This one happens on a crash recovery, seems like > UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in > the replay process: > > TRAP: FailedAssertion("page_offset + this_page_bytes <= uph->ud_insertion_point", File: "undopage.c", Line: 300) > postgres: startup recovering 000000010000000000000012(ExceptionalCondition+0xa1)[0x558b38b8a350] > postgres: startup recovering 000000010000000000000012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e] > postgres: startup recovering 000000010000000000000012(UndoReplay+0xa1d)[0x558b38766f32] > postgres: startup recovering 000000010000000000000012(XactUndoReplay+0x77)[0x558b38769281] > postgres: startup recovering 000000010000000000000012(smgr_redo+0x1af)[0x558b387aa7bd] > > This one is somewhat similar: > > TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: "undopage.c", Line: 287) > postgres: undo worker for database 36893 (ExceptionalCondition+0xa1)[0x5559c90f1350] > postgres: undo worker for database 36893 (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3] > postgres: undo worker for database 36893 (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008] > postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989] Well, on repeated run of the test I could also hit the first one. I could fix it and will post a new version of the patch (along with some other small changes) this week. > There are also here and there messages about not found undo files: > > ERROR: cannot open undo segment file 'base/undo/000008.0000020000': No such file or directory > WARNING: failed to undo transaction I don't see this one in the log so far, will try again. Thanks for the report! -- Antonin Houska Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at> wrote: > Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > Thanks for the updated patch. As I've mentioned off the list I'm slowly > > looking through it with the intent to concentrate on undo progress > > tracking. But before I will post anything I want to mention couple of > > strange issues I see, otherwise I will forget for sure. Maybe it's > > already known, but running several times 'make installcheck' against a > > freshly build postgres with the patch applied from time to time I > > observe various errors. > > > > This one happens on a crash recovery, seems like > > UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in > > the replay process: > > > > TRAP: FailedAssertion("page_offset + this_page_bytes <= uph->ud_insertion_point", File: "undopage.c", Line: 300) > > postgres: startup recovering 000000010000000000000012(ExceptionalCondition+0xa1)[0x558b38b8a350] > > postgres: startup recovering 000000010000000000000012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e] > > postgres: startup recovering 000000010000000000000012(UndoReplay+0xa1d)[0x558b38766f32] > > postgres: startup recovering 000000010000000000000012(XactUndoReplay+0x77)[0x558b38769281] > > postgres: startup recovering 000000010000000000000012(smgr_redo+0x1af)[0x558b387aa7bd] > > > > This one is somewhat similar: > > > > TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: "undopage.c", Line: 287) > > postgres: undo worker for database 36893 (ExceptionalCondition+0xa1)[0x5559c90f1350] > > postgres: undo worker for database 36893 (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3] > > postgres: undo worker for database 36893 (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008] > > postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989] > > Well, on repeated run of the test I could also hit the first one. I could fix > it and will post a new version of the patch (along with some other small > changes) this week. Attached is the next version. Changes done: * Removed the progress tracking and implemented undo discarding in a simpler way. Now, instead of maintaining the pointer to the last record applied, only a boolean field in the chunk header is set when ROLLBACK is done. This helps to determine whether the undo of a non-committed transaction can be discarded. * Removed the "undo worker" that the previous version only used to apply the undo after crash recovery. The startup process does the work now. * Umplemented cleanup after crashed CREATE DATABASE and ALTER DATABASE ... SET TABLESPACE. BTW, I wonder if this change allows these commands to be executed in a transaction block. I think the reason to prohibit that is to minimize the window between creation of the files and transaction commit - if the server crashes in that window, the new database files survive but the catalog changes don't. But maybe there are other reasons. (I don't claim it's terribly useful to create database in a transaction block though because the client cannot connect to it w/o leaving the current transaction.) * Reordered the diffs, i.e. moved the discarding in front of the actual features. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote: > Antonin Houska <ah@cybertec.at> wrote: > > Well, on repeated run of the test I could also hit the first one. I could fix > > it and will post a new version of the patch (along with some other small > > changes) this week. > > Attached is the next version. Changes done: Yikes, this patch is 23k lines, and most of it looks like added lines of code. Is this size expected? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Bruce Momjian <bruce@momjian.us> wrote: > On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote: > > Antonin Houska <ah@cybertec.at> wrote: > > > Well, on repeated run of the test I could also hit the first one. I could fix > > > it and will post a new version of the patch (along with some other small > > > changes) this week. > > > > Attached is the next version. Changes done: > > Yikes, this patch is 23k lines, and most of it looks like added lines of > code. Is this size expected? Yes, earlier versions of this patch, e.g. [1], were of comparable size. It's not really an "ordinary patch". [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
From: Antonin Houska <ah@cybertec.at> > not really an "ordinary patch". > > [1] > https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhR > q-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com I'm a bit interested in zheap-related topics. I'm reading this discussion to see what I can do. (But this thread is toolong... there are still 13,000 lines out of 45,000 lines.) What's the latest patch set to look at to achieve the undo infrastructure and its would-be first user, orphan file cleanup? As far as I've read, multiple people posted multiple patch sets, and I don't see how they are related. Regards Takayuki Tsunakawa
On Wed, Feb 3, 2021 at 2:45 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > > From: Antonin Houska <ah@cybertec.at> > > not really an "ordinary patch". > > > > [1] > > https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhR > > q-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com > > I'm a bit interested in zheap-related topics. I'm reading this discussion to see what I can do. (But this thread is toolong... there are still 13,000 lines out of 45,000 lines.) > > What's the latest patch set to look at to achieve the undo infrastructure and its would-be first user, orphan file cleanup? As far as I've read, multiple people posted multiple patch sets, and I don't see how they are related. > I feel it is good to start with the latest patch-set posted by Antonin [1]. [1] - https://www.postgresql.org/message-id/87363.1611941415%40antos -- With Regards, Amit Kapila.
tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > I'm crawling like a snail to read the patch set. Below are my first set of review comments, which are all minor. Thanks. > > (1) > + <indexterm><primary>tablespace</primary><secondary>temporary</secondary></indexterm> > > temporary -> undo Fixed. > > (2) > <term><varname>undo_tablespaces</varname> (<type>string</type>) > + > ... > + The value is a list of names of tablespaces. When there is more than > + one name in the list, <productname>PostgreSQL</productname> chooses an > + arbitrary one. If the name doesn't correspond to an existing > + tablespace, the next name is tried, and so on until all names have > + been tried. If no valid tablespace is specified, an error is raised. > + The validation of the name doesn't happen until the first attempt to > + write undo data. > > CREATE privilege needs to be mentioned like temp_tablespaces. Fixed. > > (3) > + The variable can only be changed before the first statement is > + executed in a transaction. > > Does it include any first statement that doesn't emit undo? Yes, it does. As soon as XID is assigned, the variable can no longer be set. > (4) > + <entry>One row for each undo log, showing current pointers, > + transactions and backends. > + See <xref linkend="pg-stat-undo-logs-view"/> for details. > > I think this can just be written like "showing usage information about the > undo log" just like other statistics views. That way, we can avoid having > to modify this sentence when we want to change the content of the view > later. Done. > > (5) > + <entry><structfield>discard</structfield></entry> > + <entry><type>text</type></entry> > + <entry>Location of the oldest data in this undo log.</entry> > > The name does not match the description intuitively. Maybe "oldest"? Discarding of the undo log is an important term used in the code. > BTW, how does this information help users? (I don't mean to say we > shouldn't output information that users cannot interpret; other DBMSs output > such information probably for technical support staff.) It's for DBA rather than a user. The value indicates whether discarding is working well or if it's blocked for some reason. If the latter happens, the undo log can pile up and consume too much disk space. > (6) > + <entry><structfield>insert</structfield></entry> > + <entry><type>text</type></entry> > + <entry>Location where the next data will be written in this undo > + log.</entry> > ... > + <entry><structfield>end</structfield></entry> > + <entry><type>text</type></entry> > + <entry>Location one byte past the end of the allocated physical storage > + backing this undo log.</entry> > > Again, how can these be used? If they are useful to calculate the amount of used space, shouldn't they be bigint? bigint is signed, so it cannot express 64-bit number. I think this deserves a new SQL type for the undo pointer, like pg_lsn for XLOG. > > (7) > @@ -65,7 +65,7 @@ > <structfield>smgrid</structfield> <type>integer</type> > </para> > <para> > - Block storage manager ID. 0 for regular relation data.</entry> > + Block storage manager ID. 0 for regular relation data. > </para></entry> > </row> > > I guess this change is mistakenly included? Fixed. > > (8) > diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml > @@ -216,6 +216,7 @@ Complete list of usable sgml source files in this directory. > <!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml"> > <!ENTITY pgupgrade SYSTEM "pgupgrade.sgml"> > <!ENTITY pgwaldump SYSTEM "pg_waldump.sgml"> > +<!ENTITY pgundodump SYSTEM "pg_undo_dump.sgml"> > <!ENTITY postgres SYSTEM "postgres-ref.sgml"> > > @@ -286,6 +286,7 @@ > &pgtesttiming; > &pgupgrade; > &pgwaldump; > + &pgundodump; > &postgres; > > It looks like this list needs to be ordered alphabetically. So, the new line is better placed between pg_test_timing andpg_upgrade? Fixed. > > (9) > I don't want to be disliked because of being picky, but maybe pg_undo_dump should be pg_undodump. Existing commands don'tuse '_' to separate words after pg_, except for pg_test_fsync and pg_test_timing. Done. > > (10) > + This utility can only be run by the user who installed the server, because > + it requires read-only access to the data directory. > > I guess you copied this from pg_waldump or pg_resetwal, but I'm afraid this should be as follows, which is an excerpt frompg_controldata's page. (The pages for pg_waldump and pg_resetwal should be fixed in a separate thread.) > > This utility can only be run by the user who initialized the cluster because it requires read access to the data directory. Fixed > > (11) > + The <option>-m</option> option cannot be used if > + either <option>-c</option> or <option>-l</option> is used. > > -l -> -r Fixed. > Or, why don't we align option characters with pg_waldump? pg_waldump uses -r to filter by rmgr. pg_undodump can outputrecord contents by default like pg_waldump. Considering pg_dump and pg_dumpall also output all data by default, thatseems how PostgreSQL commands behave. I've made the -r value (print out the undo records) the default, will consider using -r for filtering by rmgr. > > (12) > + <arg choice="opt"><option>startseg</option><arg choice="opt"><option>endseg</option></arg></arg> > > startseg and endseg are not described. Fixed. (Of course, this is an evidence that I used pg_waldump as a skeleton :-)) > > (13) > +Undo files backing undo logs in the default tablespace are stored under > ... > +Undo log files contain standard page headers as described in the next section, > > Fluctuations in expressions can be seen: undo file and undo log file. I think the following "undo data file" fits best. What do you think? > > + <entry><literal>UndoFileRead</literal></entry> > + <entry>Waiting for a read from an undo data file.</entry> > "Undo files backing undo logs ..." My feeling is that "data files" would be distracting here. I think the point of this sentence is simply that something resides in a file. "Undo log files contain standard page headers as described in the next section" I'm not opposed to "data files" here as there are also other kinds of files written by undo (at least the metadata written during checkpoint). Changed. > (14) > +Undo data exists in a 64-bit address space divided into 2^34 undo > +logs, each with a theoretical capacity of 1TB. The first time a > +backend writes undo, it attaches to an existing undo log whose > +capacity is not yet exhausted and which is not currently being used by > +any other backend; or if no suitable undo log already exists, it > +creates a new one. To avoid wasting space, each undo log is further > +divided into 1MB segment files, so that segments which are no longer > +needed can be removed (possibly recycling the underlying file by > +renaming it) and segments which are not yet needed do not need to be > +physically created on disk. An undo segment file has a name like > +<filename>000004.0001200000</filename>, where > +<filename>000004</filename> is the undo log number and > +<filename>0001200000</filename> is the offset of the first byte > +held in the file. > > The number of undo logs is not 2^34 but 2^24 (2^64 - 2^40 (1 TB)). Fixed. > (15) src/backend/access/undo/README > \ No newline at end of file > > Let's add a newline. > Fixed. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
Antonin Houska <ah@cybertec.at> wrote: > tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > > > I'm crawling like a snail to read the patch set. Below are my first set of review comments, which are all minor. > > Thanks. I've added the patch to the upcoming CF [1], so it possibly gets more review and makes some progress. I've marked myself as the author so it's clear who will try to respond to the reviews. It's clear that other people did much more work on the feature than I did so far - they are welcome to add themselves to the author list. [1] https://commitfest.postgresql.org/33/3228/ -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Wed, Jun 30, 2021 at 11:10 PM Antonin Houska <ah@cybertec.at> wrote: > > Antonin Houska <ah@cybertec.at> wrote: > > > tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > > > > > I'm crawling like a snail to read the patch set. Below are my first set of review comments, which are all minor. > > > > Thanks. > > I've added the patch to the upcoming CF [1], so it possibly gets more review > and makes some progress. I've marked myself as the author so it's clear who > will try to respond to the reviews. It's clear that other people did much more > work on the feature than I did so far - they are welcome to add themselves to > the author list. > > [1] https://commitfest.postgresql.org/33/3228/ > The patch does not apply on Head anymore, could you rebase and post a patch. I'm changing the status to "Waiting for Author". Regards, Vignesh
> On Wed, Jun 30, 2021 at 07:41:16PM +0200, Antonin Houska wrote: > Antonin Houska <ah@cybertec.at> wrote: > > > tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > > > > > I'm crawling like a snail to read the patch set. Below are my first set of review comments, which are all minor. > > > > Thanks. > > I've added the patch to the upcoming CF [1], so it possibly gets more review > and makes some progress. I've marked myself as the author so it's clear who > will try to respond to the reviews. It's clear that other people did much more > work on the feature than I did so far - they are welcome to add themselves to > the author list. > > [1] https://commitfest.postgresql.org/33/3228/ Hi, I'm crawling through the patch set like even slower creature than a snail, sorry for long absence. I'm reading the latest version posted here and, although it's hard to give any high level design comments on it yet, I thought it could be useful to post a few findings and questions in the meantime. * One question about the functionality: > On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote: > Attached is the next version. Changes done: > > * Removed the progress tracking and implemented undo discarding in a simpler > way. Now, instead of maintaining the pointer to the last record applied, > only a boolean field in the chunk header is set when ROLLBACK is > done. This helps to determine whether the undo of a non-committed > transaction can be discarded. Just to clarify, the whole feature was removed for the sake of simplicity, right? * By throwing at the patchset `make installcheck` I'm getting from time to time and error on the restart: TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)", File: "undorecordset.c", Line: 1098, PID: 6055) From what I see XLogReadBufferForRedoExtended finds an invalid buffer and returns BLK_NOTFOUND. The commentary says: If the block was not found, then it must be discarded later in the WAL. and continues with skip = false, but fails to get a page from an invalid buffer few lines later. It seems that the skip flag is supposed to be used this situation, should it also guard the BufferGetPage part? * Another interesting issue I've found happened inside DropUndoLogsInTablespace, when the process got SIGTERM. It seems processing stuck on: slist_foreach_modify(iter, &UndoLogShared->shared_free_lists[i]) iterating on the same element over and over. My guess is clear_private_free_lists was called and caused such unexpected outcome, should the access to shared_free_lists be somehow protected? * I also wonder about the segments in base/undo, the commentary in pg_undodump says: Since the UNDO log is a continuous stream of changes, any hole terminates processing. It looks like it's relatively easy to end up with such holes, and pg_undodump ends up with a message (found is added by me and contains a found offset which do not match the expected value): pg_undodump: error: segment 0000000000 missing in log 2, found 0000100000 This seems to be not causing any real issues, but it's not clear for me if such situation with gaps is fine or is it a problem? Other than that one more time thank you for this tremendous work, I find that the topic is of extreme importance.
Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Hi, > > I'm crawling through the patch set like even slower creature than a snail, > sorry for long absence. I'm reading the latest version posted here and, > although it's hard to give any high level design comments on it yet, I thought > it could be useful to post a few findings and questions in the meantime. > > * One question about the functionality: > > > On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote: > > Attached is the next version. Changes done: > > > > * Removed the progress tracking and implemented undo discarding in a simpler > > way. Now, instead of maintaining the pointer to the last record applied, > > only a boolean field in the chunk header is set when ROLLBACK is > > done. This helps to determine whether the undo of a non-committed > > transaction can be discarded. > > Just to clarify, the whole feature was removed for the sake of > simplicity, right? Amit Kapila told me that zheap can recognize that particular undo record was already applied and I could eventually find the corresponding code. So I removed the tracking from the undo log layer, although I still think it'd fit there. However then I found out that at least a boolean flag in the chunk header is needed to handle the discarding, so I implemented it. > * By throwing at the patchset `make installcheck` I'm getting from time to time > and error on the restart: > > TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)", > File: "undorecordset.c", Line: 1098, PID: 6055) > > From what I see XLogReadBufferForRedoExtended finds an invalid buffer and > returns BLK_NOTFOUND. The commentary says: > > If the block was not found, then it must be discarded later in > the WAL. > > and continues with skip = false, but fails to get a page from an invalid > buffer few lines later. It seems that the skip flag is supposed to be used > this situation, should it also guard the BufferGetPage part? I could see this sometime too, but can't reproduce it now. It's also not clear to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the whole undo log segment is created at once, even if only part of it is needed - see allocate_empty_undo_segment(). > * Another interesting issue I've found happened inside > DropUndoLogsInTablespace, when the process got SIGTERM. It seems processing > stuck on: > > slist_foreach_modify(iter, &UndoLogShared->shared_free_lists[i]) > > iterating on the same element over and over. My guess is > clear_private_free_lists was called and caused such unexpected outcome, > should the access to shared_free_lists be somehow protected? Well, I could get this error on repeated test run too. Thanks for the report. The list is protected by UndoLogLock. I found out that the problem was that free_undo_log_slot() "freed" the slot but didn't remove it from the shared freelist. Then some other backend thought it's free, picked it from the shared slot array, used and pushed again to the shared freelist. If the same item is already at the list head, slist_push_head() makes the initial node point to itself. I fixed it by removing the slot from the freelist before calling free_undo_log_slot() from CheckPointUndoLogs(). (The other call site DropUndoLogsInTablespace() was o.k.) > * I also wonder about the segments in base/undo, the commentary in pg_undodump > says: > > Since the UNDO log is a continuous stream of changes, any hole > terminates processing. > > It looks like it's relatively easy to end up with such holes, and pg_undodump > ends up with a message (found is added by me and contains a found offset > which do not match the expected value): > > pg_undodump: error: segment 0000000000 missing in log 2, found 0000100000 > > This seems to be not causing any real issues, but it's not clear for me if > such situation with gaps is fine or is it a problem? ok, I missed the point that the initial segment (or the initial sequence of segments) of the log can be missing due to discarding and segments recycling. I've fixed that, but if a segment is missing in the middle, it's still considered an error. > Other than that one more time thank you for this tremendous work, I find that > the topic is of extreme importance. I'm just trying to continue the tremendous work of others :-) Thanks for your review! -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
The cfbot complained that the patch series no longer applies, so I've rebased it and also tried to make sure that the other flags become green. One particular problem was that pg_upgrade complained that "live undo data" remains in the old cluster. I found out that the temporary undo log causes the problem, so I've adjusted the query in check_for_undo_data() accordingly until the problem gets fixed properly. The problem of the temporary undo log is that it's loaded into local buffers and that backend can exit w/o flushing local buffers to disk, and thus we are not guaranteed to find enough information when trying to discard the undo log the backend wrote. I'm thinking about the following solutions: 1. Let the backend manage temporary undo log on its own (even the slot metadata would stay outside the shared memory, and in particular the insertion pointer could start from 1 for each session) and remove the segment files at the same moment the temporary relations are removed. However, by moving the temporary undo slots away from the shared memory, computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would be affected. It might seem that a transaction which only writes undo log for temporary relations does not need to affect oldestFullXidHavingUndo, but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo prevents transactions to be truncated from the CLOG too early, I wonder if the following is possible (This scenario is only applicable to the zheap storage engine [1], which is not included in this patch, but should already be considered.): A transaction creates a temporary table, does some (many) changes and then gets rolled back. The undo records are being applied and it takes some time. Since XID of the transaction did not affect oldestFullXidHavingUndo, the XID can disappear from the CLOG due to truncation. However zundo.c in [1] indicates that the transaction status *is* checked during undo execution, so we might have a problem. Or do I miss something? UndoDiscard() in zheap seems to ignore temporary undo: /* We can't process temporary undo logs. */ if (log->meta.persistence == UNDO_TEMP) continue; 2. Do not load the temporary undo into local buffers. If it's always in the shared buffers, we should never see incomplete data when trying to discard undo. In this case, persistence levels UNDOPERSISTENCE_UNLOGGED and UNDOPERSISTENCE_TEMP could be merged into a single level. 3. Implement the discarding in another way, but I don't have new idea right now. Suggestions are welcome. [1] https://github.com/EnterpriseDB/zheap/tree/master -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
Antonin Houska <ah@cybertec.at> wrote: > Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > * By throwing at the patchset `make installcheck` I'm getting from time to time > > and error on the restart: > > > > TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)", > > File: "undorecordset.c", Line: 1098, PID: 6055) > > > > From what I see XLogReadBufferForRedoExtended finds an invalid buffer and > > returns BLK_NOTFOUND. The commentary says: > > > > If the block was not found, then it must be discarded later in > > the WAL. > > > > and continues with skip = false, but fails to get a page from an invalid > > buffer few lines later. It seems that the skip flag is supposed to be used > > this situation, should it also guard the BufferGetPage part? > > I could see this sometime too, but can't reproduce it now. It's also not clear > to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the > whole undo log segment is created at once, even if only part of it is needed - > see allocate_empty_undo_segment(). I could eventually reproduce the problem. The root cause was that WAL records were created even for temporary / unlogged undo, and thus only empty pages could be found during replay. I've fixed that and also setup regular test for the BLK_NOTFOUND value. That required a few more fixes to UndoReplay(). Attached is a new version. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote: > > The cfbot complained that the patch series no longer applies, so I've rebased > it and also tried to make sure that the other flags become green. > > One particular problem was that pg_upgrade complained that "live undo data" > remains in the old cluster. I found out that the temporary undo log causes the > problem, so I've adjusted the query in check_for_undo_data() accordingly until > the problem gets fixed properly. > > The problem of the temporary undo log is that it's loaded into local buffers > and that backend can exit w/o flushing local buffers to disk, and thus we are > not guaranteed to find enough information when trying to discard the undo log > the backend wrote. I'm thinking about the following solutions: > > 1. Let the backend manage temporary undo log on its own (even the slot > metadata would stay outside the shared memory, and in particular the > insertion pointer could start from 1 for each session) and remove the > segment files at the same moment the temporary relations are removed. > > However, by moving the temporary undo slots away from the shared memory, > computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would > be affected. It might seem that a transaction which only writes undo log > for temporary relations does not need to affect oldestFullXidHavingUndo, > but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo > prevents transactions to be truncated from the CLOG too early, I wonder if > the following is possible (This scenario is only applicable to the zheap > storage engine [1], which is not included in this patch, but should already > be considered.): > > A transaction creates a temporary table, does some (many) changes and then > gets rolled back. The undo records are being applied and it takes some > time. Since XID of the transaction did not affect oldestFullXidHavingUndo, > the XID can disappear from the CLOG due to truncation. > By above do you mean to say that in zheap code, we don't consider XIDs that operate on temp table/undo for oldestFullXidHavingUndo? > However zundo.c in > [1] indicates that the transaction status *is* checked during undo > execution, so we might have a problem. > It would be easier to follow if you can tell which exact code are you referring here? -- With Regards, Amit Kapila.
> On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > Antonin Houska <ah@cybertec.at> wrote: > > > Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > * By throwing at the patchset `make installcheck` I'm getting from time to time > > > and error on the restart: > > > > > > TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)", > > > File: "undorecordset.c", Line: 1098, PID: 6055) > > > > > > From what I see XLogReadBufferForRedoExtended finds an invalid buffer and > > > returns BLK_NOTFOUND. The commentary says: > > > > > > If the block was not found, then it must be discarded later in > > > the WAL. > > > > > > and continues with skip = false, but fails to get a page from an invalid > > > buffer few lines later. It seems that the skip flag is supposed to be used > > > this situation, should it also guard the BufferGetPage part? > > > > I could see this sometime too, but can't reproduce it now. It's also not clear > > to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the > > whole undo log segment is created at once, even if only part of it is needed - > > see allocate_empty_undo_segment(). > > I could eventually reproduce the problem. The root cause was that WAL records > were created even for temporary / unlogged undo, and thus only empty pages > could be found during replay. I've fixed that and also setup regular test for > the BLK_NOTFOUND value. That required a few more fixes to UndoReplay(). > > Attached is a new version. Yep, makes sense, thanks. I have few more questions: * The use case with orphaned files is working somewhat differently after the rebase on the latest master, do you observe it as well? The difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up an orphaned relation file immediately (only later on checkpoint) because of empty pendingUnlinks. I haven't investigated more yet, but seems like after this commit: commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 Author: Thomas Munro <tmunro@postgresql.org> Date: Mon Aug 2 17:32:20 2021 +1200 Run checkpointer and bgwriter in crash recovery. Start up the checkpointer and bgwriter during crash recovery (except in --single mode), as we do for replication. This wasn't done back in commit cdd46c76 out of caution. Now it seems like a better idea to make the environment as similar as possible in both cases. There may also be some performance advantages. something has to be updated (pendingOps are empty right now, so no unlink request is remembered). * What happened with the idea of abandoning discard worker for the sake of simplicity? From what I see limiting everything to foreground undo could reduce the core of the patch series to the first four patches (forgetting about test and docs, but I guess it would be enough at least for the design review), which is already less overwhelming.
On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > * What happened with the idea of abandoning discard worker for the sake > of simplicity? From what I see limiting everything to foreground undo > could reduce the core of the patch series to the first four patches > (forgetting about test and docs, but I guess it would be enough at > least for the design review), which is already less overwhelming. > I think the discard worker would be required even if we decide to apply all the undo in the foreground. We need to forget/remove the undo of committed transactions as well which we can't remove immediately after the commit. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > > > * What happened with the idea of abandoning discard worker for the sake > > of simplicity? From what I see limiting everything to foreground undo > > could reduce the core of the patch series to the first four patches > > (forgetting about test and docs, but I guess it would be enough at > > least for the design review), which is already less overwhelming. > > > > I think the discard worker would be required even if we decide to > apply all the undo in the foreground. We need to forget/remove the > undo of committed transactions as well which we can't remove > immediately after the commit. I think I proposed foreground discarding at some point, but you reminded me that the undo may still be needed for some time even after transaction commit. Thus the discard worker is indispensable. What we can miss, at least for the cleanup of the orphaned files, is the *undo worker*. In this patch series the cleanup is handled by the startup process. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote: > > The problem of the temporary undo log is that it's loaded into local buffers > > and that backend can exit w/o flushing local buffers to disk, and thus we are > > not guaranteed to find enough information when trying to discard the undo log > > the backend wrote. I'm thinking about the following solutions: > > > > 1. Let the backend manage temporary undo log on its own (even the slot > > metadata would stay outside the shared memory, and in particular the > > insertion pointer could start from 1 for each session) and remove the > > segment files at the same moment the temporary relations are removed. > > > > However, by moving the temporary undo slots away from the shared memory, > > computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would > > be affected. It might seem that a transaction which only writes undo log > > for temporary relations does not need to affect oldestFullXidHavingUndo, > > but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo > > prevents transactions to be truncated from the CLOG too early, I wonder if > > the following is possible (This scenario is only applicable to the zheap > > storage engine [1], which is not included in this patch, but should already > > be considered.): > > > > A transaction creates a temporary table, does some (many) changes and then > > gets rolled back. The undo records are being applied and it takes some > > time. Since XID of the transaction did not affect oldestFullXidHavingUndo, > > the XID can disappear from the CLOG due to truncation. > > > > By above do you mean to say that in zheap code, we don't consider XIDs > that operate on temp table/undo for oldestFullXidHavingUndo? I was referring to the code /* We can't process temporary undo logs. */ if (log->meta.persistence == UNDO_TEMP) continue; in undodiscard.c:UndoDiscard(). > > > However zundo.c in > > [1] indicates that the transaction status *is* checked during undo > > execution, so we might have a problem. > > > > It would be easier to follow if you can tell which exact code are you > referring here? In meant the call of TransactionIdDidCommit() in zundo.c:zheap_exec_pending_rollback(). -- Antonin Houska Web: https://www.cybertec-postgresql.com
Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > Antonin Houska <ah@cybertec.at> wrote: > > > > > Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > > * By throwing at the patchset `make installcheck` I'm getting from time to time > > > > and error on the restart: > > > > > > > > TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)", > > > > File: "undorecordset.c", Line: 1098, PID: 6055) > > > > > > > > From what I see XLogReadBufferForRedoExtended finds an invalid buffer and > > > > returns BLK_NOTFOUND. The commentary says: > > > > > > > > If the block was not found, then it must be discarded later in > > > > the WAL. > > > > > > > > and continues with skip = false, but fails to get a page from an invalid > > > > buffer few lines later. It seems that the skip flag is supposed to be used > > > > this situation, should it also guard the BufferGetPage part? > > > > > > I could see this sometime too, but can't reproduce it now. It's also not clear > > > to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the > > > whole undo log segment is created at once, even if only part of it is needed - > > > see allocate_empty_undo_segment(). > > > > I could eventually reproduce the problem. The root cause was that WAL records > > were created even for temporary / unlogged undo, and thus only empty pages > > could be found during replay. I've fixed that and also setup regular test for > > the BLK_NOTFOUND value. That required a few more fixes to UndoReplay(). > > > > Attached is a new version. > > Yep, makes sense, thanks. I have few more questions: > > * The use case with orphaned files is working somewhat differently after > the rebase on the latest master, do you observe it as well? The > difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up > an orphaned relation file immediately (only later on checkpoint) > because of empty pendingUnlinks. I haven't investigated more yet, but > seems like after this commit: > > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 > Author: Thomas Munro <tmunro@postgresql.org> > Date: Mon Aug 2 17:32:20 2021 +1200 > > Run checkpointer and bgwriter in crash recovery. > > Start up the checkpointer and bgwriter during crash recovery (except in > --single mode), as we do for replication. This wasn't done back in > commit cdd46c76 out of caution. Now it seems like a better idea to make > the environment as similar as possible in both cases. There may also be > some performance advantages. > > something has to be updated (pendingOps are empty right now, so no > unlink request is remembered). I haven't been debugging that part recently, but yes, this commit is relevant, thanks for pointing that out! Attached is a patch that should fix it. I'll include it in the next version of the patch series, unless you tell me that something is still wrong. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/access/undo/undorecordset.c b/src/backend/access/undo/undorecordset.c index 59eba7dfb6..9d05824141 100644 --- a/src/backend/access/undo/undorecordset.c +++ b/src/backend/access/undo/undorecordset.c @@ -2622,14 +2622,6 @@ ApplyPendingUndo(void) } } - /* - * Some undo actions may unlink files. Since the checkpointer is not - * guaranteed to be up, it seems simpler to process the undo request - * ourselves in the way the checkpointer would do. - */ - SyncPreCheckpoint(); - SyncPostCheckpoint(); - /* Cleanup. */ chunktable_destroy(sets); }
On Tue, 21 Sep 2021 09:00 Antonin Houska, <ah@cybertec.at> wrote:
Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Yep, makes sense, thanks. I have few more questions:
>
> * The use case with orphaned files is working somewhat differently after
> the rebase on the latest master, do you observe it as well? The
> difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
> an orphaned relation file immediately (only later on checkpoint)
> because of empty pendingUnlinks. I haven't investigated more yet, but
> seems like after this commit:
>
> commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> Author: Thomas Munro <tmunro@postgresql.org>
> Date: Mon Aug 2 17:32:20 2021 +1200
>
> Run checkpointer and bgwriter in crash recovery.
>
> Start up the checkpointer and bgwriter during crash recovery (except in
> --single mode), as we do for replication. This wasn't done back in
> commit cdd46c76 out of caution. Now it seems like a better idea to make
> the environment as similar as possible in both cases. There may also be
> some performance advantages.
>
> something has to be updated (pendingOps are empty right now, so no
> unlink request is remembered).
I haven't been debugging that part recently, but yes, this commit is relevant,
thanks for pointing that out! Attached is a patch that should fix it. I'll
include it in the next version of the patch series, unless you tell me that
something is still wrong.
Sure, but I can take a look only in a couple of days.
On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > > > > > * What happened with the idea of abandoning discard worker for the sake > > > of simplicity? From what I see limiting everything to foreground undo > > > could reduce the core of the patch series to the first four patches > > > (forgetting about test and docs, but I guess it would be enough at > > > least for the design review), which is already less overwhelming. > > > > > > > I think the discard worker would be required even if we decide to > > apply all the undo in the foreground. We need to forget/remove the > > undo of committed transactions as well which we can't remove > > immediately after the commit. > > I think I proposed foreground discarding at some point, but you reminded me > that the undo may still be needed for some time even after transaction > commit. Thus the discard worker is indispensable. > Right. > What we can miss, at least for the cleanup of the orphaned files, is the *undo > worker*. In this patch series the cleanup is handled by the startup process. > Okay, I think various people at different point of times has suggested that idea. I think one thing we might need to consider is what to do in case of a FATAL error? In case of FATAL error, it won't be advisable to execute undo immediately, so would we upgrade the error to PANIC in such cases. I remember vaguely that for clean up of orphaned files that can happen rarely and someone has suggested upgrading the error to PANIC in such a case but I don't remember the exact details. -- With Regards, Amit Kapila.
On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote: > > > > The problem of the temporary undo log is that it's loaded into local buffers > > > and that backend can exit w/o flushing local buffers to disk, and thus we are > > > not guaranteed to find enough information when trying to discard the undo log > > > the backend wrote. I'm thinking about the following solutions: > > > > > > 1. Let the backend manage temporary undo log on its own (even the slot > > > metadata would stay outside the shared memory, and in particular the > > > insertion pointer could start from 1 for each session) and remove the > > > segment files at the same moment the temporary relations are removed. > > > > > > However, by moving the temporary undo slots away from the shared memory, > > > computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would > > > be affected. It might seem that a transaction which only writes undo log > > > for temporary relations does not need to affect oldestFullXidHavingUndo, > > > but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo > > > prevents transactions to be truncated from the CLOG too early, I wonder if > > > the following is possible (This scenario is only applicable to the zheap > > > storage engine [1], which is not included in this patch, but should already > > > be considered.): > > > > > > A transaction creates a temporary table, does some (many) changes and then > > > gets rolled back. The undo records are being applied and it takes some > > > time. Since XID of the transaction did not affect oldestFullXidHavingUndo, > > > the XID can disappear from the CLOG due to truncation. > > > > > > > By above do you mean to say that in zheap code, we don't consider XIDs > > that operate on temp table/undo for oldestFullXidHavingUndo? > > I was referring to the code > > /* We can't process temporary undo logs. */ > if (log->meta.persistence == UNDO_TEMP) > continue; > > in undodiscard.c:UndoDiscard(). > Here, I think it will just skip undo of temporary undo logs and oldestFullXidHavingUndo should be advanced after skipping it. > > > > > However zundo.c in > > > [1] indicates that the transaction status *is* checked during undo > > > execution, so we might have a problem. > > > > > > > It would be easier to follow if you can tell which exact code are you > > referring here? > > In meant the call of TransactionIdDidCommit() in > zundo.c:zheap_exec_pending_rollback(). > IIRC, this should be called for temp tables after they have exited as this is only to apply the pending undo actions if any, and in case of temporary undo after session exit, we shouldn't need it. I am not able to understand what exact problem you are facing for temp tables after the session exit. Can you please explain it a bit more? -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska <ah@cybertec.at> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > > > > > > > * What happened with the idea of abandoning discard worker for the sake > > > > of simplicity? From what I see limiting everything to foreground undo > > > > could reduce the core of the patch series to the first four patches > > > > (forgetting about test and docs, but I guess it would be enough at > > > > least for the design review), which is already less overwhelming. > > What we can miss, at least for the cleanup of the orphaned files, is the *undo > > worker*. In this patch series the cleanup is handled by the startup process. > > > > Okay, I think various people at different point of times has suggested > that idea. I think one thing we might need to consider is what to do > in case of a FATAL error? In case of FATAL error, it won't be > advisable to execute undo immediately, so would we upgrade the error > to PANIC in such cases. I remember vaguely that for clean up of > orphaned files that can happen rarely and someone has suggested > upgrading the error to PANIC in such a case but I don't remember the > exact details. Do you mean FATAL error during normal operation? As far as I understand, even zheap does not rely on immediate UNDO execution (otherwise it'd never introduce the undo worker), so FATAL only means that the undo needs to be applied later so it can be discarded. As for the orphaned files cleanup feature with no undo worker, we might need PANIC to ensure that the undo is applied during restart and that it can be discarded, otherwise the unapplied undo log would stay there until the next (regular) restart and it would block discarding. However upgrading FATAL to PANIC just because the current transaction created a table seems quite rude. So the undo worker might be needed even for this patch? Or do you mean FATAL error when executing the UNDO? -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Fri, Sep 24, 2021 at 4:44 PM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska <ah@cybertec.at> wrote: > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > > > > > > > > > * What happened with the idea of abandoning discard worker for the sake > > > > > of simplicity? From what I see limiting everything to foreground undo > > > > > could reduce the core of the patch series to the first four patches > > > > > (forgetting about test and docs, but I guess it would be enough at > > > > > least for the design review), which is already less overwhelming. > > > > What we can miss, at least for the cleanup of the orphaned files, is the *undo > > > worker*. In this patch series the cleanup is handled by the startup process. > > > > > > > Okay, I think various people at different point of times has suggested > > that idea. I think one thing we might need to consider is what to do > > in case of a FATAL error? In case of FATAL error, it won't be > > advisable to execute undo immediately, so would we upgrade the error > > to PANIC in such cases. I remember vaguely that for clean up of > > orphaned files that can happen rarely and someone has suggested > > upgrading the error to PANIC in such a case but I don't remember the > > exact details. > > Do you mean FATAL error during normal operation? > Yes. > As far as I understand, even > zheap does not rely on immediate UNDO execution (otherwise it'd never > introduce the undo worker), so FATAL only means that the undo needs to be > applied later so it can be discarded. > Yeah, zheap either applies undo later via background worker or next time before dml operation if there is a need. > As for the orphaned files cleanup feature with no undo worker, we might need > PANIC to ensure that the undo is applied during restart and that it can be > discarded, otherwise the unapplied undo log would stay there until the next > (regular) restart and it would block discarding. However upgrading FATAL to > PANIC just because the current transaction created a table seems quite > rude. > True, I guess but we can once see in what all scenarios it can generate FATAL during that operation. > So the undo worker might be needed even for this patch? > I think we can keep undo worker as a separate patch and for base patch keep the idea of promoting FATAL to PANIC. This will at the very least make the review easier. > Or do you mean FATAL error when executing the UNDO? > No. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Sep 24, 2021 at 4:44 PM Antonin Houska <ah@cybertec.at> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > > > > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > > > > > > > > > > > * What happened with the idea of abandoning discard worker for the sake > > > > > > of simplicity? From what I see limiting everything to foreground undo > > > > > > could reduce the core of the patch series to the first four patches > > > > > > (forgetting about test and docs, but I guess it would be enough at > > > > > > least for the design review), which is already less overwhelming. > > > > > > What we can miss, at least for the cleanup of the orphaned files, is the *undo > > > > worker*. In this patch series the cleanup is handled by the startup process. > > > > > > > > > > Okay, I think various people at different point of times has suggested > > > that idea. I think one thing we might need to consider is what to do > > > in case of a FATAL error? In case of FATAL error, it won't be > > > advisable to execute undo immediately, so would we upgrade the error > > > to PANIC in such cases. I remember vaguely that for clean up of > > > orphaned files that can happen rarely and someone has suggested > > > upgrading the error to PANIC in such a case but I don't remember the > > > exact details. > > > > Do you mean FATAL error during normal operation? > > > > Yes. > > > As far as I understand, even > > zheap does not rely on immediate UNDO execution (otherwise it'd never > > introduce the undo worker), so FATAL only means that the undo needs to be > > applied later so it can be discarded. > > > > Yeah, zheap either applies undo later via background worker or next > time before dml operation if there is a need. > > > As for the orphaned files cleanup feature with no undo worker, we might need > > PANIC to ensure that the undo is applied during restart and that it can be > > discarded, otherwise the unapplied undo log would stay there until the next > > (regular) restart and it would block discarding. However upgrading FATAL to > > PANIC just because the current transaction created a table seems quite > > rude. > > > > True, I guess but we can once see in what all scenarios it can > generate FATAL during that operation. By "that operation" you mean "CREATE TABLE"? It's not about FATAL during CREATE TABLE, rather it's about FATAL anytime during a transaction. Whichever operation caused the FATAL error, we'd need to upgrade it to PANIC as long as the transaction has some undo. Although the postgres core probably does not raise FATAL errors too often (OOM conditions seem to be the typical cause), I'm still not enthusiastic about idea that the undo feature turns such errors into PANIC. I wonder what the reason to avoid undoing transaction on FATAL is. If it's about possibly long duration of the undo execution, deletion of orphaned files (relations or the whole databases) via undo shouldn't make things worse because currently FATAL also triggers this sort of cleanup immediately, it's just implemented in different ways. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Mon, Sep 27, 2021 at 7:43 PM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > Although the postgres core probably does not raise FATAL errors too often (OOM > conditions seem to be the typical cause), I'm still not enthusiastic about > idea that the undo feature turns such errors into PANIC. > > I wonder what the reason to avoid undoing transaction on FATAL is. If it's > about possibly long duration of the undo execution, deletion of orphaned files > (relations or the whole databases) via undo shouldn't make things worse > because currently FATAL also triggers this sort of cleanup immediately, it's > just implemented in different ways. > During FATAL, we don't want to perform more operations which can make the situation worse. Say, we are already short of memory (OOM), and undo execution can further try to allocate the memory won't do any good. Depending on the implementation, sometimes undo execution might need to perform WAL writes or data write which we don't want to do during FATAL error processing. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska <ah@cybertec.at> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > The problem of the temporary undo log is that it's loaded into local buffers > > > > and that backend can exit w/o flushing local buffers to disk, and thus we are > > > > not guaranteed to find enough information when trying to discard the undo log > > > > the backend wrote. I'm thinking about the following solutions: > > > > > > > > 1. Let the backend manage temporary undo log on its own (even the slot > > > > metadata would stay outside the shared memory, and in particular the > > > > insertion pointer could start from 1 for each session) and remove the > > > > segment files at the same moment the temporary relations are removed. > > > > > > > > However, by moving the temporary undo slots away from the shared memory, > > > > computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would > > > > be affected. It might seem that a transaction which only writes undo log > > > > for temporary relations does not need to affect oldestFullXidHavingUndo, > > > > but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo > > > > prevents transactions to be truncated from the CLOG too early, I wonder if > > > > the following is possible (This scenario is only applicable to the zheap > > > > storage engine [1], which is not included in this patch, but should already > > > > be considered.): > > > > > > > > A transaction creates a temporary table, does some (many) changes and then > > > > gets rolled back. The undo records are being applied and it takes some > > > > time. Since XID of the transaction did not affect oldestFullXidHavingUndo, > > > > the XID can disappear from the CLOG due to truncation. > > > > > > > > > > By above do you mean to say that in zheap code, we don't consider XIDs > > > that operate on temp table/undo for oldestFullXidHavingUndo? > > > > I was referring to the code > > > > /* We can't process temporary undo logs. */ > > if (log->meta.persistence == UNDO_TEMP) > > continue; > > > > in undodiscard.c:UndoDiscard(). > > > > Here, I think it will just skip undo of temporary undo logs and > oldestFullXidHavingUndo should be advanced after skipping it. Right, it'll be adavanced, but the transaction XID (if the transaction wrote only to temporary relations) might still be needed. > > > > > > > However zundo.c in > > > > [1] indicates that the transaction status *is* checked during undo > > > > execution, so we might have a problem. > > > > > > > > > > It would be easier to follow if you can tell which exact code are you > > > referring here? > > > > In meant the call of TransactionIdDidCommit() in > > zundo.c:zheap_exec_pending_rollback(). > > > > IIRC, this should be called for temp tables after they have exited as > this is only to apply the pending undo actions if any, and in case of > temporary undo after session exit, we shouldn't need it. I see (had to play with debugger a bit). Currently this works because the temporary relations are dropped by AbortTransaction() -> smgrDoPendingDeletes(), before the undo execution starts. The situation will change as soon as the file removal will also be handled by the undo subsystem, however I'm still not sure how to hit the TransactionIdDidCommit() call for the XID already truncated from CLOG. I'm starting to admint that there's no issue here: temporary undo is always applied immediately in foreground, and thus the zheap_exec_pending_rollback() function never needs to examine XID which no longer exists in the CLOG. > I am not able to understand what exact problem you are facing for temp > tables after the session exit. Can you please explain it a bit more? The problem is that the temporary undo buffers are loaded into backend-local buffers. Thus there's no guarantee that we'll find a consistent information in the undo file even if the backend exited cleanly (local buffers are not flushed at backend exit and there's no WAL for them). However, we need to read the undo file to find out if (part of) it can be discarded. I'm trying to find out whether we can ignore the temporary undo when trying to advance oldestFullXidHavingUndo or not. If we can, then each backend can mange its temporary undo on its own and - instead of checking which chunks can be discarded - simply delete the undo files on exit as a whole, just like it deletes temporary relations. Thus we wouldn't need to pay any special attention to discarding. Also, if backends managed the temporary undo this way, it wouldn't be necessary to track it via the shared memory (UndoLogMetaData). (With this approach, the undo record to delete the temporary relation must not be temporary, but this should not be an issue.) -- Antonin Houska Web: https://www.cybertec-postgresql.com
> On Tue, Sep 21, 2021 at 10:07:55AM +0200, Dmitry Dolgov wrote: > On Tue, 21 Sep 2021 09:00 Antonin Houska, <ah@cybertec.at> wrote: > > > Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > Yep, makes sense, thanks. I have few more questions: > > > > > > * The use case with orphaned files is working somewhat differently after > > > the rebase on the latest master, do you observe it as well? The > > > difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up > > > an orphaned relation file immediately (only later on checkpoint) > > > because of empty pendingUnlinks. I haven't investigated more yet, but > > > seems like after this commit: > > > > > > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 > > > Author: Thomas Munro <tmunro@postgresql.org> > > > Date: Mon Aug 2 17:32:20 2021 +1200 > > > > > > Run checkpointer and bgwriter in crash recovery. > > > > > > Start up the checkpointer and bgwriter during crash recovery > > (except in > > > --single mode), as we do for replication. This wasn't done back > > in > > > commit cdd46c76 out of caution. Now it seems like a better idea > > to make > > > the environment as similar as possible in both cases. There may > > also be > > > some performance advantages. > > > > > > something has to be updated (pendingOps are empty right now, so no > > > unlink request is remembered). > > > > I haven't been debugging that part recently, but yes, this commit is > > relevant, > > thanks for pointing that out! Attached is a patch that should fix it. I'll > > include it in the next version of the patch series, unless you tell me that > > something is still wrong. > > > > Sure, but I can take a look only in a couple of days. Thanks for the patch. Hm, maybe there is some misunderstanding. My question above was about the changed behaviour, when orphaned files (e.g. created relation files after the backend was killed) are removed only by checkpointer when it kicks in. As far as I understand, the original intention was to do this job right away, that's why SyncPre/PostCheckpoint was invoked. But the recent changes around checkpointer make the current implementation insufficient. The patch you've proposed removes invokation of SyncPre/PostCheckpoint, do I see it correctly? In this sense it doesn't change anything, except removing non-functioning code of course. But the question, probably reformulated from the more design point of view, stays the same — when and by which process such orphaned files have to be removed? I've assumed by removing right away the previous version was trying to avoid any kind of thunder effects of removing too many at once, but maybe I'm mistaken here.
Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Tue, Sep 21, 2021 at 10:07:55AM +0200, Dmitry Dolgov wrote: > > On Tue, 21 Sep 2021 09:00 Antonin Houska, <ah@cybertec.at> wrote: > > > > > Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > > > Yep, makes sense, thanks. I have few more questions: > > > > > > > > * The use case with orphaned files is working somewhat differently after > > > > the rebase on the latest master, do you observe it as well? The > > > > difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up > > > > an orphaned relation file immediately (only later on checkpoint) > > > > because of empty pendingUnlinks. I haven't investigated more yet, but > > > > seems like after this commit: > > > > > > > > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 > > > > Author: Thomas Munro <tmunro@postgresql.org> > > > > Date: Mon Aug 2 17:32:20 2021 +1200 > > > > > > > > Run checkpointer and bgwriter in crash recovery. > > > > > > > > Start up the checkpointer and bgwriter during crash recovery > > > (except in > > > > --single mode), as we do for replication. This wasn't done back > > > in > > > > commit cdd46c76 out of caution. Now it seems like a better idea > > > to make > > > > the environment as similar as possible in both cases. There may > > > also be > > > > some performance advantages. > > > > > > > > something has to be updated (pendingOps are empty right now, so no > > > > unlink request is remembered). > > > > > > I haven't been debugging that part recently, but yes, this commit is > > > relevant, > > > thanks for pointing that out! Attached is a patch that should fix it. I'll > > > include it in the next version of the patch series, unless you tell me that > > > something is still wrong. > > > > > > > Sure, but I can take a look only in a couple of days. > > Thanks for the patch. > > Hm, maybe there is some misunderstanding. My question above was about > the changed behaviour, when orphaned files (e.g. created relation files > after the backend was killed) are removed only by checkpointer when it > kicks in. As far as I understand, the original intention was to do this > job right away, that's why SyncPre/PostCheckpoint was invoked. But the > recent changes around checkpointer make the current implementation > insufficient. > The patch you've proposed removes invokation of SyncPre/PostCheckpoint, > do I see it correctly? In this sense it doesn't change anything, except > removing non-functioning code of course. Yes, it sounds like a misundeerstanding. I thought you complain about code which is no longer needed. The original intention was to make sure that the files are ever unlinked. IIRC before the commit 7ff23c6d27 the calls SyncPre/PostCheckpoint were necessary because the checkpointer wasn't runnig that early during the startup. Without these calls the startup process would exit without doing anything. Sorry, I see now that the comment incorrectly says "... it seems simpler ...", but in fact it was necessary. > But the question, probably > reformulated from the more design point of view, stays the same — when > and by which process such orphaned files have to be removed? I've > assumed by removing right away the previous version was trying to avoid > any kind of thunder effects of removing too many at once, but maybe I'm > mistaken here. I'm just trying to use the existing infrastructure: the effect of DROP TABLE also appear to be performed by the checkpointer. However I don't know why the unlinks need to be performed by the checkpointer. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Wed, Sep 29, 2021 at 8:18 AM Antonin Houska <ah@cybertec.at> wrote: > I'm just trying to use the existing infrastructure: the effect of DROP TABLE > also appear to be performed by the checkpointer. However I don't know why the > unlinks need to be performed by the checkpointer. For DROP TABLE, we leave an empty file (I've been calling it a "tombstone file") so that GetNewRelFileNode() won't let you reuse the same relfilenode in the same checkpoint cycle. One reason is that wal_level=minimal has a data-eating crash recovery failure mode if you reuse a relfilenode in a checkpoint cycle.
Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Sep 29, 2021 at 8:18 AM Antonin Houska <ah@cybertec.at> wrote: > > I'm just trying to use the existing infrastructure: the effect of DROP TABLE > > also appear to be performed by the checkpointer. However I don't know why the > > unlinks need to be performed by the checkpointer. > > For DROP TABLE, we leave an empty file (I've been calling it a > "tombstone file") so that GetNewRelFileNode() won't let you reuse the > same relfilenode in the same checkpoint cycle. One reason is that > wal_level=minimal has a data-eating crash recovery failure mode if you > reuse a relfilenode in a checkpoint cycle. Interesting. Is the problem that REDO of the DROP TABLE command deletes the relfilenode which already contains the new data, but the new data cannot be recovered because (due to wal_level=minimal) it's not present in WAL? In this case I suppose that the checkpoint just ensures that the DROP TABLE won't be replayed during the next crash recovery. BTW, does that comment fix attached make sense to you? The corresponding code in InitSync() is /* * Create pending-operations hashtable if we need it. Currently, we need * it if we are standalone (not under a postmaster) or if we are a * checkpointer auxiliary process. */ if (!IsUnderPostmaster || AmCheckpointerProcess()) I suspect this is also related to the commit 7ff23c6d27. Thanks for your answer, I was considering to add you to CC :-) -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index 1c78581354..ae6c5ff8e4 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -563,7 +563,7 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, if (pendingOps != NULL) { - /* standalone backend or startup process: fsync state is local */ + /* standalone backend or checkpointer process: fsync state is local */ RememberSyncRequest(ftag, type); return true; }
On Tue, Sep 28, 2021 at 7:36 PM Antonin Houska <ah@cybertec.at> wrote: > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska <ah@cybertec.at> wrote: > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > The problem of the temporary undo log is that it's loaded into local buffers > > > > > and that backend can exit w/o flushing local buffers to disk, and thus we are > > > > > not guaranteed to find enough information when trying to discard the undo log > > > > > the backend wrote. I'm thinking about the following solutions: > > > > > > > > > > 1. Let the backend manage temporary undo log on its own (even the slot > > > > > metadata would stay outside the shared memory, and in particular the > > > > > insertion pointer could start from 1 for each session) and remove the > > > > > segment files at the same moment the temporary relations are removed. > > > > > > > > > > However, by moving the temporary undo slots away from the shared memory, > > > > > computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would > > > > > be affected. It might seem that a transaction which only writes undo log > > > > > for temporary relations does not need to affect oldestFullXidHavingUndo, > > > > > but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo > > > > > prevents transactions to be truncated from the CLOG too early, I wonder if > > > > > the following is possible (This scenario is only applicable to the zheap > > > > > storage engine [1], which is not included in this patch, but should already > > > > > be considered.): > > > > > > > > > > A transaction creates a temporary table, does some (many) changes and then > > > > > gets rolled back. The undo records are being applied and it takes some > > > > > time. Since XID of the transaction did not affect oldestFullXidHavingUndo, > > > > > the XID can disappear from the CLOG due to truncation. > > > > > > > > > > > > > By above do you mean to say that in zheap code, we don't consider XIDs > > > > that operate on temp table/undo for oldestFullXidHavingUndo? > > > > > > I was referring to the code > > > > > > /* We can't process temporary undo logs. */ > > > if (log->meta.persistence == UNDO_TEMP) > > > continue; > > > > > > in undodiscard.c:UndoDiscard(). > > > > > > > Here, I think it will just skip undo of temporary undo logs and > > oldestFullXidHavingUndo should be advanced after skipping it. > > Right, it'll be adavanced, but the transaction XID (if the transaction wrote > only to temporary relations) might still be needed. > > > > > > > > > > However zundo.c in > > > > > [1] indicates that the transaction status *is* checked during undo > > > > > execution, so we might have a problem. > > > > > > > > > > > > > It would be easier to follow if you can tell which exact code are you > > > > referring here? > > > > > > In meant the call of TransactionIdDidCommit() in > > > zundo.c:zheap_exec_pending_rollback(). > > > > > > > IIRC, this should be called for temp tables after they have exited as > > this is only to apply the pending undo actions if any, and in case of > > temporary undo after session exit, we shouldn't need it. > > I see (had to play with debugger a bit). Currently this works because the > temporary relations are dropped by AbortTransaction() -> > smgrDoPendingDeletes(), before the undo execution starts. The situation will > change as soon as the file removal will also be handled by the undo subsystem, > however I'm still not sure how to hit the TransactionIdDidCommit() call for > the XID already truncated from CLOG. > > I'm starting to admint that there's no issue here: temporary undo is always > applied immediately in foreground, and thus the zheap_exec_pending_rollback() > function never needs to examine XID which no longer exists in the CLOG. > > > I am not able to understand what exact problem you are facing for temp > > tables after the session exit. Can you please explain it a bit more? > > The problem is that the temporary undo buffers are loaded into backend-local > buffers. Thus there's no guarantee that we'll find a consistent information in > the undo file even if the backend exited cleanly (local buffers are not > flushed at backend exit and there's no WAL for them). However, we need to read > the undo file to find out if (part of) it can be discarded. > > I'm trying to find out whether we can ignore the temporary undo when trying to > advance oldestFullXidHavingUndo or not. > It seems this is the crucial point. In the code, you pointed, we ignore the temporary undo while advancing oldestFullXidHavingUndo but if you find any case where that is not true then we need to discuss what is the best way to solve it. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Sep 28, 2021 at 7:36 PM Antonin Houska <ah@cybertec.at> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > > > The problem of the temporary undo log is that it's loaded into local buffers > > > > > > and that backend can exit w/o flushing local buffers to disk, and thus we are > > > > > > not guaranteed to find enough information when trying to discard the undo log > > > > > > the backend wrote. I'm thinking about the following solutions: > > > > > > > > > > > > 1. Let the backend manage temporary undo log on its own (even the slot > > > > > > metadata would stay outside the shared memory, and in particular the > > > > > > insertion pointer could start from 1 for each session) and remove the > > > > > > segment files at the same moment the temporary relations are removed. > > > > > > > > > > > > However, by moving the temporary undo slots away from the shared memory, > > > > > > computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would > > > > > > be affected. It might seem that a transaction which only writes undo log > > > > > > for temporary relations does not need to affect oldestFullXidHavingUndo, > > > > > > but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo > > > > > > prevents transactions to be truncated from the CLOG too early, I wonder if > > > > > > the following is possible (This scenario is only applicable to the zheap > > > > > > storage engine [1], which is not included in this patch, but should already > > > > > > be considered.): > > > > > > > > > > > > A transaction creates a temporary table, does some (many) changes and then > > > > > > gets rolled back. The undo records are being applied and it takes some > > > > > > time. Since XID of the transaction did not affect oldestFullXidHavingUndo, > > > > > > the XID can disappear from the CLOG due to truncation. > > > > > > > > > > > > > > > > By above do you mean to say that in zheap code, we don't consider XIDs > > > > > that operate on temp table/undo for oldestFullXidHavingUndo? > > > > > > > > I was referring to the code > > > > > > > > /* We can't process temporary undo logs. */ > > > > if (log->meta.persistence == UNDO_TEMP) > > > > continue; > > > > > > > > in undodiscard.c:UndoDiscard(). > > > > > > > > > > Here, I think it will just skip undo of temporary undo logs and > > > oldestFullXidHavingUndo should be advanced after skipping it. > > > > Right, it'll be adavanced, but the transaction XID (if the transaction wrote > > only to temporary relations) might still be needed. > > > > > > > > > > > > > However zundo.c in > > > > > > [1] indicates that the transaction status *is* checked during undo > > > > > > execution, so we might have a problem. > > > > > > > > > > > > > > > > It would be easier to follow if you can tell which exact code are you > > > > > referring here? > > > > > > > > In meant the call of TransactionIdDidCommit() in > > > > zundo.c:zheap_exec_pending_rollback(). > > > > > > > > > > IIRC, this should be called for temp tables after they have exited as > > > this is only to apply the pending undo actions if any, and in case of > > > temporary undo after session exit, we shouldn't need it. > > > > I see (had to play with debugger a bit). Currently this works because the > > temporary relations are dropped by AbortTransaction() -> > > smgrDoPendingDeletes(), before the undo execution starts. The situation will > > change as soon as the file removal will also be handled by the undo subsystem, > > however I'm still not sure how to hit the TransactionIdDidCommit() call for > > the XID already truncated from CLOG. > > > > I'm starting to admint that there's no issue here: temporary undo is always > > applied immediately in foreground, and thus the zheap_exec_pending_rollback() > > function never needs to examine XID which no longer exists in the CLOG. > > > > > I am not able to understand what exact problem you are facing for temp > > > tables after the session exit. Can you please explain it a bit more? > > > > The problem is that the temporary undo buffers are loaded into backend-local > > buffers. Thus there's no guarantee that we'll find a consistent information in > > the undo file even if the backend exited cleanly (local buffers are not > > flushed at backend exit and there's no WAL for them). However, we need to read > > the undo file to find out if (part of) it can be discarded. > > > > I'm trying to find out whether we can ignore the temporary undo when trying to > > advance oldestFullXidHavingUndo or not. > > > > It seems this is the crucial point. In the code, you pointed, we > ignore the temporary undo while advancing oldestFullXidHavingUndo but > if you find any case where that is not true then we need to discuss > what is the best way to solve it. As I already said above, I think now that the computation of oldestFullXidHavingUndo can actually ignore the temporary undo, like it happens in the zheap fork of postgres. At least I couldn't eventually find the corner case that would break the current solution. So it should be ok if the temporary undo is managed and discarded by individual backends. Patch 0005 of the new series tries to do that. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
Hi, On Thu, Nov 25, 2021 at 10:00 PM Antonin Houska <ah@cybertec.at> wrote: > > So it should be ok if the temporary undo is managed and discarded by > individual backends. Patch 0005 of the new series tries to do that. The cfbot reports that at least the 001 patch doesn't apply anymore: http://cfbot.cputube.org/patch_36_3228.log > === applying patch ./undo-20211125/0001-Add-SmgrId-to-smgropen-and-BufferTag.patch > [...] > patching file src/bin/pg_waldump/pg_waldump.c > Hunk #1 succeeded at 480 (offset 17 lines). > Hunk #2 FAILED at 500. > Hunk #3 FAILED at 531. > 2 out of 3 hunks FAILED -- saving rejects to file src/bin/pg_waldump/pg_waldump.c.rej Could you send a rebased version? In the meantime I'll switch the cf entry to Waiting on Author.
Hi, Antonin. I am more interested in zheap. Recently reviewing the patch you submitted.
When I use pg_undodump-tool to dump the undo page chunk, I found that some chunk header is abnormal.
After reading the relevant codes in 0006-The-initial-implementation-of-the-pg_undodump-tool.patch,
I feel that there is a bug in the function parse_undo_page.
According to my understanding The size in chunk Header includes chunk header + type-specific header + undo record.
If the entire chunk spans pages, also need to add the size of the page header.
But I found Now only the scenario of chunk header spanning pages is considered, and the scenario of type-specific header is not considered.
/*
* The page header size must eventually be subtracted from
* chunk_bytes_left because it's included in the chunk size. However,
* since chunk_bytes_left is unsigned, we do not subtract anything from it
* if it's still zero. This can happen if we're still reading the chunk
* header or the type-specific header. (The underflow should not be a
* problem because the chunk size will eventually be added, but it seems
* ugly and it makes debugging less convenient.)
*/
if (s->chunk_bytes_left > 0)
{
/* Chunk should not end within page header. */
Assert(s->chunk_bytes_left >= SizeOfUndoPageHeaderData);
s->chunk_bytes_left -= SizeOfUndoPageHeaderData;
s->chunk_bytes_to_skip = 0;
}
/* Processing the chunk header?
*/
else if (s->chunk_hdr_bytes_left > 0 )
s->chunk_bytes_to_skip = SizeOfUndoPageHeaderData; ------------------------------------------------
Should this code be fixed as this?When the type-specific header spans the undo page, the page header should be skipped.
else if (s->chunk_hdr_bytes_left > 0 || s->type_hdr_bytes_left > 0)
s->chunk_bytes_to_skip = SizeOfUndoPageHeaderData;
------------------------------------------------------------------发件人:Antonin Houska <ah@cybertec.at>发送时间:2022年3月29日(星期二) 17:25收件人:Dmitry Dolgov <9erthalion6@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>主 题:Re: POC: Cleaning up orphaned files using undo logs
The cfbot complained that the patch series no longer applies, so I've rebased
it and also tried to make sure that the other flags become green.
One particular problem was that pg_upgrade complained that "live undo data"
remains in the old cluster. I found out that the temporary undo log causes the
problem, so I've adjusted the query in check_for_undo_data() accordingly until
the problem gets fixed properly.
The problem of the temporary undo log is that it's loaded into local buffers
and that backend can exit w/o flushing local buffers to disk, and thus we are
not guaranteed to find enough information when trying to discard the undo log
the backend wrote. I'm thinking about the following solutions:
1. Let the backend manage temporary undo log on its own (even the slot
metadata would stay outside the shared memory, and in particular the
insertion pointer could start from 1 for each session) and remove the
segment files at the same moment the temporary relations are removed.
However, by moving the temporary undo slots away from the shared memory,
computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
be affected. It might seem that a transaction which only writes undo log
for temporary relations does not need to affect oldestFullXidHavingUndo,
but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
prevents transactions to be truncated from the CLOG too early, I wonder if
the following is possible (This scenario is only applicable to the zheap
storage engine [1], which is not included in this patch, but should already
be considered.):
A transaction creates a temporary table, does some (many) changes and then
gets rolled back. The undo records are being applied and it takes some
time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
the XID can disappear from the CLOG due to truncation. However zundo.c in
[1] indicates that the transaction status *is* checked during undo
execution, so we might have a problem.
Or do I miss something? UndoDiscard() in zheap seems to ignore temporary
undo:
/* We can't process temporary undo logs. */
if (log->meta.persistence == UNDO_TEMP)
continue;
2. Do not load the temporary undo into local buffers. If it's always in the
shared buffers, we should never see incomplete data when trying to discard
undo. In this case, persistence levels UNDOPERSISTENCE_UNLOGGED and
UNDOPERSISTENCE_TEMP could be merged into a single level.
3. Implement the discarding in another way, but I don't have new idea right
now.
Suggestions are welcome.
[1] https://github.com/EnterpriseDB/zheap/tree/master
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachment
孔凡深(云梳) <fanshen.kfs@alibaba-inc.com> wrote: > Hi, Antonin. I am more interested in zheap. Recently reviewing the patch you submitted. > When I use pg_undodump-tool to dump the undo page chunk, I found that some chunk header is abnormal. > After reading the relevant codes in 0006-The-initial-implementation-of-the-pg_undodump-tool.patch, > I feel that there is a bug in the function parse_undo_page. Thanks, I'll take a look if the project happens to continue. Currently it seems that another approach is more likely to be taken: https://www.postgresql.org/message-id/CA%2BTgmoa_VNzG4ZouZyQQ9h%3DoRiy%3DZQV5%2BxHQXxMWmep4Ygg8Dg%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com