Thread: Refactoring the checkpointer's fsync request queue
Hello hackers, Currently, md5.c and checkpointer.c interact in a way that breaks smgr.c's modularity. That doesn't matter much if md.c is the only storage manager implementation, but currently there are two proposals to provide new kinds of block storage accessed via the buffer manager: UNDO and SLRU. Here is a patch that rips the fsync stuff out of md.c, generalises it and puts it into a new translation unit smgrsync.c. It can deal with fsync()ing any files you want at checkpoint time, as long as they can be described by a SmgrFileTag (a struct type we can extend as needed). A pathname would work too, but I wanted something small and fixed in size. It's just a tag that can be converted to a path in case it needs to be reopened (eg on Windows), but otherwise is used as a hash table key to merge requests. There is one major fly in the ointment: fsyncgate[1]. Originally I planned to propose a patch on top of that one, but it's difficult -- both patches move a lot of the same stuff around. Personally, I don't think it would be a very good idea to back-patch that anyway. It'd be riskier than the problem it aims to solve, in terms of bugs and hard-to-foresee portability problems IMHO. I think we should consider back-patching some variant of Craig Ringer's PANIC patch, and consider this redesigned approach for future releases. So, please find attached the WIP patch that I would like to propose for PostgreSQL 12, under a separate Commitfest entry. It incorporates the fsyncgate work by Andres Freund (original file descriptor transfer POC) and me (many bug fixes and improvements), and the refactoring work as described above. It can be compiled in two modes: with the macro CHECKPOINTER_TRANSFER_FILES defined, it sends fds to the checkpointer, but if you comment out that macro definition for testing, or build on Windows, it reverts to a mode that reopens files in the checkpointer. I'm hoping to find a Windows-savvy collaborator to help finish the Windows support. Right now it passes make check on AppVeyor, but it needs to be reviewed and tested on a real system with a small shared_buffers (installcheck, pgbench, other attempts to break it). Other than that, there are a couple of remaining XXX notes for small known details, but I wanted to post this version now. [1] https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de -- Thomas Munro http://www.enterprisedb.com
Attachment
(Adding Dmitry to CC list.) On Tue, Oct 16, 2018 at 12:02 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > There is one major fly in the ointment: fsyncgate[1]. Originally I > planned to propose a patch on top of that one, but it's difficult -- > both patches move a lot of the same stuff around. Personally, I don't > think it would be a very good idea to back-patch that anyway. It'd be > riskier than the problem it aims to solve, in terms of bugs and > hard-to-foresee portability problems IMHO. I think we should consider > back-patching some variant of Craig Ringer's PANIC patch, and consider > this redesigned approach for future releases. > > So, please find attached the WIP patch that I would like to propose > for PostgreSQL 12, under a separate Commitfest entry. It incorporates > the fsyncgate work by Andres Freund (original file descriptor transfer > POC) and me (many bug fixes and improvements), and the refactoring > work as described above. Here is a rebased version of the patch, post pread()/pwrite(). I have also rewritten the commit message to try to explain the rationale concisely, instead of requiring the reader to consult multiple discussions that jump between lengthy email threads to understand the key points. There is one major problem with this patch: BufferSync(), run in the checkpointer, can deadlock against a backend that holds a buffer lock and is blocked in SendFsyncRequest(). To break this deadlock, we need way out of it on either the sending or receiving side. Here are three ideas: 1. Go back to the current pressure-valve strategy: make the sending side perform the fsync(), if it can't immediately write to the pipe. 2. Offload the BufferSync() work to bgwriter, so the checkpointer can keep draining the pipe. Communication between checkpointer and bgwriter can be fairly easily multiplexed with the pipe draining work. 3. Multiplex the checkpointer's work: Use LWLockConditionalAcquire() when locking buffers, and if that fails, try to drain the pipe, and then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop. I can hear you groan already; that doesn't seem particularly elegant, and there are portability problems implementing LWLockTimedAcquire(): semtimedop() and sem_timedwait() are not available on all platforms (eg macOS). Maybe pthread_timed_condwait() could help (!). I'm not actually sure if idea 1 is correct, and I also don't like that behaviour under pressure, and I think pressure is more likely than in current master (since we gave up sender-side queue compaction, and it seems quite easy to fill up the pipe's buffer). Number 2 appeals to me the most right now, but I haven't looked into the details or tried it yet. Number 3 is a straw man that perhaps helps illustrate the problem but involves taking on unnecessary new problems and seems like a non-starter. So, is there any reason the bgwriter process shouldn't do that work on the checkpointer's behalf? Is there another way? -- Thomas Munro http://www.enterprisedb.com
Attachment
On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > There is one major problem with this patch If there's only one, you're doing great! Although admittedly this seems like a big one... > 1. Go back to the current pressure-valve strategy: make the sending > side perform the fsync(), if it can't immediately write to the pipe. As you say, this will happen significantly more often with deduplication. That deduplication logic got added in response to a real need. Before that, you could cause an individual backend to start doing its own fsyncs() with something as simple as a bulk load. The queue would absorb most of them, but not all, and the performance ramifications where noticeable. > 2. Offload the BufferSync() work to bgwriter, so the checkpointer can > keep draining the pipe. Communication between checkpointer and > bgwriter can be fairly easily multiplexed with the pipe draining work. That sounds a little like you are proposing to go back to the way things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and, belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess the division of labor wouldn't be quite the same. > 3. Multiplex the checkpointer's work: Use LWLockConditionalAcquire() > when locking buffers, and if that fails, try to drain the pipe, and > then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop. I > can hear you groan already; that doesn't seem particularly elegant, > and there are portability problems implementing LWLockTimedAcquire(): > semtimedop() and sem_timedwait() are not available on all platforms > (eg macOS). Maybe pthread_timed_condwait() could help (!). You don't really need to invent LWLockTimedAcquire(). You could just keep retrying LWLockConditionalAcquire() in a delay loop. I agree that doesn't seem particularly elegant, though. I still feel like this whole pass-the-fds-to-the-checkpointer thing is a bit of a fool's errand, though. I mean, there's no guarantee that the first FD that gets passed to the checkpointer is the first one opened, or even the first one written, is there? It seems like if you wanted to make this work reliably, you'd need to do it the other way around: have the checkpointer (or some other background process) open all the FDs, and anybody else who wants to have one open get it from the checkpointer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2018-11-12 15:58:41 +1300, Thomas Munro wrote: > There is one major problem with this patch: BufferSync(), run in the > checkpointer, can deadlock against a backend that holds a buffer lock > and is blocked in SendFsyncRequest(). To break this deadlock, we need > way out of it on either the sending or receiving side. Here are three > ideas: That's the deadlock I'd mentioned in Pune (?) btw. > 1. Go back to the current pressure-valve strategy: make the sending > side perform the fsync(), if it can't immediately write to the pipe. I don't think that's correct / safe? I've previously wondered whether there's any way we could delay the write to a point where the buffer is not locked anymore - as far as I can tell it's actually not required for correctness that we send the fsync request before unlocking. It's architecturally a bit dicey tho :( Greetings, Andres Freund
Hi, On 2018-11-13 12:04:23 -0500, Robert Haas wrote: > I still feel like this whole pass-the-fds-to-the-checkpointer thing is > a bit of a fool's errand, though. I mean, there's no guarantee that > the first FD that gets passed to the checkpointer is the first one > opened, or even the first one written, is there? I'm not sure I understand the danger you're seeing here. It doesn't have to be the first fd opened, it has to be an fd that's older than all the writes that we need to ensure made it to disk. And that ought to be guaranteed by the logic? Between the FileWrite() and the register_dirty_segment() (and other relevant paths) the FD cannot be closed. > It seems like if you wanted to make this work reliably, you'd need to > do it the other way around: have the checkpointer (or some other > background process) open all the FDs, and anybody else who wants to > have one open get it from the checkpointer. That'd require a process context switch for each FD opened, which seems clearly like a no-go? Greetings, Andres Freund
On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <andres@anarazel.de> wrote: > On 2018-11-13 12:04:23 -0500, Robert Haas wrote: > > I still feel like this whole pass-the-fds-to-the-checkpointer thing is > > a bit of a fool's errand, though. I mean, there's no guarantee that > > the first FD that gets passed to the checkpointer is the first one > > opened, or even the first one written, is there? > I'm not sure I understand the danger you're seeing here. It doesn't have > to be the first fd opened, it has to be an fd that's older than all the > writes that we need to ensure made it to disk. And that ought to be > guaranteed by the logic? Between the FileWrite() and the > register_dirty_segment() (and other relevant paths) the FD cannot be > closed. Suppose backend A and backend B open a segment around the same time. Is it possible that backend A does a write before backend B, but backend B's copy of the fd reaches the checkpointer before backend A's copy? If you send the FD to the checkpointer before writing anything then I think it's fine, but if you write first and then send the FD to the checkpointer I don't see what guarantees the ordering. > > It seems like if you wanted to make this work reliably, you'd need to > > do it the other way around: have the checkpointer (or some other > > background process) open all the FDs, and anybody else who wants to > > have one open get it from the checkpointer. > > That'd require a process context switch for each FD opened, which seems > clearly like a no-go? I don't know how bad that would be. But hey, no cost is too great to pay as a workaround for insane kernel semantics, right? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
(Replies to a couple of different messages below) On Wed, Nov 14, 2018 at 6:04 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > There is one major problem with this patch > > If there's only one, you're doing great! Although admittedly this > seems like a big one... Make that two. > > 2. Offload the BufferSync() work to bgwriter, so the checkpointer can > > keep draining the pipe. Communication between checkpointer and > > bgwriter can be fairly easily multiplexed with the pipe draining work. > > That sounds a little like you are proposing to go back to the way > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and, > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess > the division of labor wouldn't be quite the same. But is there an argument against it? The checkpointer would still be creating checkpoints including running fsync, but the background writer would be, erm, writing, erm, in the background. Admittedly it adds a whole extra rabbit hole to this rabbit hole, which was itself a diversion from my goal of refactoring the syncing machinery to support undo logs. But the other two ideas seem to suck and/or have correctness issues. On Wed, Nov 14, 2018 at 7:43 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <andres@anarazel.de> wrote: > > On 2018-11-13 12:04:23 -0500, Robert Haas wrote: > > > I still feel like this whole pass-the-fds-to-the-checkpointer thing is > > > a bit of a fool's errand, though. I mean, there's no guarantee that > > > the first FD that gets passed to the checkpointer is the first one > > > opened, or even the first one written, is there? > > I'm not sure I understand the danger you're seeing here. It doesn't have > > to be the first fd opened, it has to be an fd that's older than all the > > writes that we need to ensure made it to disk. And that ought to be > > guaranteed by the logic? Between the FileWrite() and the > > register_dirty_segment() (and other relevant paths) the FD cannot be > > closed. > > Suppose backend A and backend B open a segment around the same time. > Is it possible that backend A does a write before backend B, but > backend B's copy of the fd reaches the checkpointer before backend A's > copy? If you send the FD to the checkpointer before writing anything > then I think it's fine, but if you write first and then send the FD to > the checkpointer I don't see what guarantees the ordering. I'm not sure if it matters whether we send the fd before or after the write, but we still need some kind of global ordering of fds that can order a given fd with respect to writes in other processes, so the patch introduces a global shared counter captured immediately after open() (including when reopened in the vfd machinery). In your example, both fds arrive in the checkpointer in some order, and it will keep the one with the older sequence number and close the other one. This sorting of all interesting fds will be forced before the checkpoint completes by AbsorbFsyncRequests(), which drains all messages from the pipe until it sees a message for the next checkpoint cycle. Hmm, I think there is a flaw in the plan here though. Messages for different checkpoint cycles race to enter the pipe around the time the cycle counter is bumped, so you could have a message for n hiding behind a message for n + 1 and not drain enough; I'm not sure and need to look at something else today, but I see a couple of potential solutions to that which I will mull over, based on either a new shared counter increment or a special pipe message written after BufferSync() by the bgwriter (if we go for idea #2; Andres had something similar in the original prototype but it could self-deadlock). I need to figure out if that is a suitable barrier due to buffer interlocking. > > > It seems like if you wanted to make this work reliably, you'd need to > > > do it the other way around: have the checkpointer (or some other > > > background process) open all the FDs, and anybody else who wants to > > > have one open get it from the checkpointer. > > > > That'd require a process context switch for each FD opened, which seems > > clearly like a no-go? > > I don't know how bad that would be. But hey, no cost is too great to > pay as a workaround for insane kernel semantics, right? Yeah, seems extremely expensive and unnecessary. It seems sufficient to track the global opening order... or at least a proxy that identifies the fd that performed the oldest write. Which I believe this patch is doing. -- Thomas Munro http://www.enterprisedb.com
> On Wed, 14 Nov 2018 at 00:44, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > Here is a rebased version of the patch, post pread()/pwrite(). I have > also rewritten the commit message to try to explain the rationale > concisely, instead of requiring the reader to consult multiple > discussions that jump between lengthy email threads to understand the > key points. Thank you for working on this patch! > There is one major problem with this patch: BufferSync(), run in the > checkpointer, can deadlock against a backend that holds a buffer lock > and is blocked in SendFsyncRequest(). To break this deadlock, we need > way out of it on either the sending or receiving side. Or introduce a third side, but I'm not sure how appropriate it here. > 2. Offload the BufferSync() work to bgwriter, so the checkpointer can > keep draining the pipe. Communication between checkpointer and > bgwriter can be fairly easily multiplexed with the pipe draining work. I also think it sounds better than other options (although probably it's partially because these options were formulated while already having some bias towards one of the solution). > > > 2. Offload the BufferSync() work to bgwriter, so the checkpointer can > > > keep draining the pipe. Communication between checkpointer and > > > bgwriter can be fairly easily multiplexed with the pipe draining work. > > > > That sounds a little like you are proposing to go back to the way > > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and, > > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess > > the division of labor wouldn't be quite the same. I had the same first thought, but then after reading the corresponding mailing thread I've got an impression that the purpose of this change was rather technical (to split work between different processed because of performance reasons) and not exactly relevant to the division of labor - am I wrong here? While testing this patch with frequent checkpoints I've stumbled upon an interesting error, that happened already after I finished one test: TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574) 2018-11-13 22:06:29.773 CET [7886] LOG: checkpointer process (PID 7934) was terminated by signal 6: Aborted 2018-11-13 22:06:29.773 CET [7886] LOG: terminating any other active server processes 2018-11-13 22:06:29.773 CET [7937] WARNING: terminating connection because of crash of another server process 2018-11-13 22:06:29.773 CET [7937] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2018-11-13 22:06:29.773 CET [7937] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2018-11-13 22:06:29.778 CET [7886] LOG: all server processes terminated; reinitializing I assume it should't be like that? I haven't investigated deeply yet, but backtrace looks like: >>> bt #0 0x00007f7ee7a3af00 in raise () from /lib64/libc.so.6 #1 0x00007f7ee7a3ca57 in abort () from /lib64/libc.so.6 #2 0x0000560e89d1858e in ExceptionalCondition (conditionName=conditionName@entry=0x560e89eca333 "!(rc > 0)", errorType=errorType@entry=0x560e89d6cec8 "FailedAssertion", fileName=fileName@entry=0x560e89eca2c9 "checkpointer.c", lineNumber=lineNumber@entry=574) at assert.c:54 #3 0x0000560e89b5e3ff in CheckpointerMain () at checkpointer.c:574 #4 0x0000560e8995ef9e in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffe05c32f60) at bootstrap.c:460 #5 0x0000560e89b69c55 in StartChildProcess (type=type@entry=CheckpointerProcess) at postmaster.c:5369 #6 0x0000560e89b6af15 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2916 #7 <signal handler called> #8 0x00007f7ee7afe00b in select () from /lib64/libc.so.6 #9 0x0000560e89b6bd20 in ServerLoop () at postmaster.c:1679 #10 0x0000560e89b6d1bc in PostmasterMain (argc=3, argv=<optimized out>) at postmaster.c:1388 #11 0x0000560e89acadc6 in main (argc=3, argv=0x560e8ad42c00) at main.c:228
On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > That sounds a little like you are proposing to go back to the way > > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and, > > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess > > the division of labor wouldn't be quite the same. > > But is there an argument against it? The checkpointer would still be > creating checkpoints including running fsync, but the background > writer would be, erm, writing, erm, in the background. I don't know. I guess the fact that the checkpointer is still performing the fsyncs is probably a key point. I mean, in the old division of labor, fsyncs could interrupt the background writing that was supposed to be happening. > I'm not sure if it matters whether we send the fd before or after the > write, but we still need some kind of global ordering of fds that can > order a given fd with respect to writes in other processes, so the > patch introduces a global shared counter captured immediately after > open() (including when reopened in the vfd machinery). But how do you make reading that counter atomic with the open() itself? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-11-14 16:36:49 -0500, Robert Haas wrote: > On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > I'm not sure if it matters whether we send the fd before or after the > > write, but we still need some kind of global ordering of fds that can > > order a given fd with respect to writes in other processes, so the > > patch introduces a global shared counter captured immediately after > > open() (including when reopened in the vfd machinery). > > But how do you make reading that counter atomic with the open() itself? I don't see why it has to be. As long as the "fd generation" assignment happens before fsync (and writes secondarily), there ought not to be any further need for synchronizity? Greetings, Andres Freund
On Wed, Nov 14, 2018 at 4:49 PM Andres Freund <andres@anarazel.de> wrote: > On 2018-11-14 16:36:49 -0500, Robert Haas wrote: > > But how do you make reading that counter atomic with the open() itself? > > I don't see why it has to be. As long as the "fd generation" assignment > happens before fsync (and writes secondarily), there ought not to be any > further need for synchronizity? If the goal is to have the FD that is opened first end up in the checkpointer's table, grabbing a counter backwards does not achieve it, because there's a race. S1: open FD S2: open FD S2: local_counter = shared_counter++ S1: local_counter = shared_counter++ Now S1 was opened first but has a higher shared counter value than S2 which was opened later. Does that matter? Beats me! I just work here... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > While testing this patch with frequent checkpoints I've stumbled upon an > interesting error, that happened already after I finished one test: > > TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574) Thanks for testing! Yeah, that's: + rc = WaitEventSetWait(wes, cur_timeout * 1000, &event, 1, 0); + Assert(rc > 0); I got confused about the API. If there is a timeout, you get rc == 0, but I think I was expecting rc == 1, event.event == WL_TIMEOUT. Oops. I will fix that when I post a new experimental version that uses the bgworker as discussed, and we can try to figure out if that design will fly. -- Thomas Munro http://www.enterprisedb.com
On Sat, Nov 17, 2018 at 4:05 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 14, 2018 at 4:49 PM Andres Freund <andres@anarazel.de> wrote: > > On 2018-11-14 16:36:49 -0500, Robert Haas wrote: > > > But how do you make reading that counter atomic with the open() itself? > > > > I don't see why it has to be. As long as the "fd generation" assignment > > happens before fsync (and writes secondarily), there ought not to be any > > further need for synchronizity? > > If the goal is to have the FD that is opened first end up in the > checkpointer's table, grabbing a counter backwards does not achieve > it, because there's a race. > > S1: open FD > S2: open FD > S2: local_counter = shared_counter++ > S1: local_counter = shared_counter++ > > Now S1 was opened first but has a higher shared counter value than S2 > which was opened later. Does that matter? Beats me! I just work > here... It's not important for the sequence numbers to match the opening order exactly (that'd work too but be expensive to orchestrate). It's important for the sequence numbers to be assigned before each backend does its first pwrite(). That gives us the following interleavings to worry about: S1: local_counter = shared_counter++ S2: local_counter = shared_counter++ S1: pwrite() S2: pwrite() S1: local_counter = shared_counter++ S2: local_counter = shared_counter++ S2: pwrite() S1: pwrite() S1: local_counter = shared_counter++ S1: pwrite() S2: local_counter = shared_counter++ S2: pwrite() ... plus the same interleavings with S1 and S2 labels swapped. In all 6 orderings, the fd that has the lowest sequence number can see errors relating to write-back of kernel buffers dirtied by both pwrite() calls. Or to put it another way, you can't be given a lower sequence number than another process that has already written, because that other process must have been given a sequence number before it wrote. -- Thomas Munro http://www.enterprisedb.com
On Fri, Nov 16, 2018 at 5:38 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Or to put it another way, you can't be given a lower sequence number > than another process that has already written, because that other > process must have been given a sequence number before it wrote. OK, that makes sense. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Nov 17, 2018 at 10:53 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > While testing this patch with frequent checkpoints I've stumbled upon an > > interesting error, that happened already after I finished one test: > > > > TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574) Fixed in the 0001 patch (and a similar problem in the WIN32 branch). On Thu, Nov 15, 2018 at 10:37 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > > That sounds a little like you are proposing to go back to the way > > > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and, > > > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess > > > the division of labor wouldn't be quite the same. > > > > But is there an argument against it? The checkpointer would still be > > creating checkpoints including running fsync, but the background > > writer would be, erm, writing, erm, in the background. > > I don't know. I guess the fact that the checkpointer is still > performing the fsyncs is probably a key point. I mean, in the old > division of labor, fsyncs could interrupt the background writing that > was supposed to be happening. Robert explained off-list that BgBufferSync() and BufferSync() have rather different goals, and performing them in the same loop without major reengineering to merge their logic would probably not work out well. So I'm abandoning that plan for now (though it could perhaps be interesting if done right). I do have a new plan though... On Wed, Nov 14, 2018 at 7:01 AM Andres Freund <andres@anarazel.de> wrote: > ... I've previously wondered whether > there's any way we could delay the write to a point where the buffer is > not locked anymore - as far as I can tell it's actually not required for > correctness that we send the fsync request before unlocking. It's > architecturally a bit dicey tho :( ... and it's basically what Andres said there ^^^. The specific hazard I wondered about is when a checkpoint begins after BufferAlloc() calls pwrite() but before it calls sendto(), so that we fail to fsync() a file that was modified before the checkpoint LSN. But, AFAICS, assuming we call sendto() before we update the buffer header, there are only two possibilities from the point of view of BufferAlloc(): 1. The checkpointer's BufferSync() loop arrives before we update the buffer header, so it sees the buffer as dirty, writes it out (again), remembers that the segment is dirty, and then when we eventually get the buffer header lock we see that it's not dirty anymore and we just skip the buffer. 2. The checkpointer's BufferSync() loop arrives after we updated the buffer header, so it sees it as invalid (or some later state), which means that we have already called sendto() (before we updated the header). Either way, the checkpointer finishes up calling fsync() before the checkpoint completes, as it should, and the worst that can happen due to bad timing is a harmless double pwrite(). I noticed a subtle problem though. Suppose we have case 2 above. After BufferSync() returns in the checkpointer, our backend has called sendto() to register a dirty file. In v2 the checkpointer runs AbsorbAllFsyncRequests() to drain the pipe until it sees a message for the current cycle (that is, it absorbs messages for the previous cycle). That's obviously not good enough, since backends race to call sendto() and a message for cycle n - 1 might be hiding behind a message for cycle n. So I propose to drain the pipe until it is empty or we see a message for cycle n + 1 (where n is the current cycle before we start draining, meaning that we ran out of fds and forced a new cycle in FlushFsyncRequestQueueIfNecessary()). I think that works, because although we can't be sure that we'll receive all messages for n - 1 before we receive a message for n due to races on the insert side, we *can* be certain that we'll receive all messages for n - 1 before we receive a message for n + 1, because we know that they were already in the pipe before we began. In the happy case, our system never runs out of fds so the pipe will eventually be empty, since backends send at most one message per cycle per file and the number of files is finite, and in the sad case, there are too many dirty files per cycle, so we keep creating new cycles while absorbing, but again the loop is bounded because we know that seeing n + 1 is enough for our purpose (which is to fsync all files that were already mentioned in messages send to the pipe before we started our loop). That's implemented in the 0002 patch, separated for ease of review. The above theories cover BufferAlloc()'s interaction with a checkpoint, and that seems to be the main story. I'm not entirely sure about the other callers of FlushBuffer() or FlushOneBuffer() (eg special case init fork stuff), but I've run out of brain power for now and wanted to post an update. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I do have a new plan though... Ugh. The plan in my previous email doesn't work, I was confused about the timing of the buffer header update. Back to the drawing board. -- Thomas Munro http://www.enterprisedb.com
> On Mon, Nov 26, 2018 at 11:47 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > I do have a new plan though... > > Ugh. The plan in my previous email doesn't work, I was confused about > the timing of the buffer header update. Back to the drawing board. Any chance to share the drawing board with the ideas? :) On the serious note, I assume you have plans to work on this during the next CF, right?
On Sun, Dec 2, 2018 at 1:46 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Mon, Nov 26, 2018 at 11:47 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > > On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro > > > <thomas.munro@enterprisedb.com> wrote: > > > I do have a new plan though... > > > > Ugh. The plan in my previous email doesn't work, I was confused about > > the timing of the buffer header update. Back to the drawing board. > > Any chance to share the drawing board with the ideas? :) > > On the serious note, I assume you have plans to work on this during the next > CF, right? Indeed I am. Unfortunately, the solution to that deadlock eludes me still. So, I have split this work into multiple patches. 0001 is a draft version of some new infrastructure I'd like to propose, 0002 is the thing originally described by the first two paragraphs in the first email in this thread, and the rest I'll have to defer for now (the fd passing stuff). To restate the purpose of this work: I want to make it possible for other patches to teach the checkpointer to fsync new kinds of files that are accessed through the buffer pool. Specifically, undo segment files (for zheap) and SLRU files (see Shawn Debnath's plan to put clog et al into the standard buffer pool). The main changes are: 1. A bunch of stuff moved out of md.c into smgrsync.c, where the same pendingOpTable machinery can be shared by any block storage implementation. 2. The actual fsync'ing now happens by going through smgrsyncimmed(). 3. You can now tell the checkpointer to forget individual segments (undo and slru both need to be able to do that when they truncate data from the 'front'). 4. The protocol for forgetting relations etc is slightly different: if a file is found to be missing, AbsortFsyncRequests() and then probe to see if the segment number disappeared from the set (instead of cancel flags), though I need to test this case. 5. Requests (ie segment numbers) are now stored in a sorted vector, because it doesn't make sense to store large and potentially sparse integers in bitmapsets. See patch 0001 for new machinery to support that. The interfaces in 0001 are perhaps a bit wordy and verbose (and hard to fit in 80 columns). Maybe I need something better for memory contexts. Speaking of which, it wasn't possible to do a guaranteed-no-alloc merge (like the one done for zero-anchored bitmapset in commit 1556cb2fc), so I had to add a second vector for 'in progress' segments. I merge them with the main set on the next attempt, if it's found to be non-empty. Very open to better ideas on how to do any of this. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Tue, Jan 1, 2019 at 10:41 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > So, I have split this work into multiple patches. 0001 is a draft > version of some new infrastructure I'd like to propose, 0002 is the > thing originally described by the first two paragraphs in the first > email in this thread, and the rest I'll have to defer for now (the fd > passing stuff). Apologies, there was a header missing from 0002, and a small change needed to a contrib file that I missed. Here is a new version. For the 0001 patch, I'll probably want to reconsider the naming a bit ("simple -> "specialized", "generic", ...?), refine (ability to turn off the small vector optimisation? optional MemoryContext? ability to extend without copying or zero-initialising at the same time? comparators with a user data parameter? two-value comparators vs three-value comparators? qsort with inline comparator? etc etc), and remove some gratuitous C++ cargo cultisms, and maybe also instantiate the thing centrally for some common types (I mean, perhaps 0002 should use a common uint32_vector rather than defining its own segnum_vector?). I suppose an alternative would be to use simplehash for the set of segment numbers here, but it seems like overkill and would waste a ton of memory in the common case of holding a single number. I wondered also about some kind of tree (basically, C++ set) but it seems much more complicated and would still be much larger for common cases. Sorted vectors seem to work pretty well here (but would lose in theoretical cases where you insert low values in large sets, but not in practice here AFAICS). -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Jan 2, 2019 at 11:40 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > For the 0001 patch, I'll probably want to reconsider the naming a bit > ("simple -> "specialized", "generic", ...?), refine (ability to turn > off the small vector optimisation? optional MemoryContext? ability > to extend without copying or zero-initialising at the same time? > comparators with a user data parameter? two-value comparators vs > three-value comparators? qsort with inline comparator? etc etc), and > remove some gratuitous C++ cargo cultisms, and maybe also instantiate > the thing centrally for some common types (I mean, perhaps 0002 should > use a common uint32_vector rather than defining its own > segnum_vector?). Here's a new version that fixes a couple of stupid bugs (mainly a broken XXX_lower_bound(), which I replaced with the standard algorithm I see in many sources). I couldn't resist the urge to try porting pg_qsort() to this style. It seems to be about twice as fast as the original at sorting integers on my machine with -O2. I suppose people aren't going to be too enthusiastic about yet another copy of qsort in the tree, but maybe this approach (with a bit more work) could replace the Perl code-gen for tuple sorting. Then the net number of copies wouldn't go up, but this could be used for more things too, and it fits with the style of simplehash.h and simplevector.h. Thoughts? -- Thomas Munro http://www.enterprisedb.com
Attachment
With the help of VMware's Dirk Hohndel (VMware's Chief Open Source Officer, a VP position near the top of the organization, and a personal friend of Linus), I have been fortunate enough to make contact directly with Linus Torvalds to discuss this issue. In emails to me he has told me that this patch is no longer provisional: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 Linus has given me permission to quote him, so here is a quote from an email he sent 2019-01-17: > That commit (b4678df184b3: "errseq: Always report a writeback error > once") was already backported to the stable trees (4.14 and 4.16), so > yes, everything should be fine. We did indeed miss old errors for a > while. > > > The latest information I could find on this said this commit was "provisional" but > > also that it might be back-patched to 4.13 and on. Can you clarify the status of > > this patch in either respect? > > It was definitely backported to both 4.14 and 4.16, I see it in my > email archives. > > The bug may remain in 4.13, but that isn't actually maintained any > more, and I don't think any distro uses it (distros tend to use the > long-term stable kernels that are maintained, or sometimes maintain > their own patch queue). I think that eliminates the need for this patch. -- Kevin Grittner VMware vCenter Server https://www.vmware.com/
On Tue, Jan 22, 2019 at 8:27 AM Kevin Grittner <kgrittn@gmail.com> wrote: > With the help of VMware's Dirk Hohndel (VMware's Chief Open Source > Officer, a VP position near the top of the organization, and a > personal friend of Linus), I have been fortunate enough to make > contact directly with Linus Torvalds to discuss this issue. In emails > to me he has told me that this patch is no longer provisional: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 > > Linus has given me permission to quote him, so here is a quote from an > email he sent 2019-01-17: > > That commit (b4678df184b3: "errseq: Always report a writeback error > > once") was already backported to the stable trees (4.14 and 4.16), so > > yes, everything should be fine. We did indeed miss old errors for a > > while. Sorry, but somehow I got the parent link rather that the intended commit. Linus got it right, of course. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4678df184b314a2bd47d2329feca2c2534aa12b > -- > Kevin Grittner > VMware vCenter Server > https://www.vmware.com/ -- Kevin Grittner VMware vCenter Server https://www.vmware.com/
Hi,, On 2019-01-22 08:27:48 -0600, Kevin Grittner wrote: > With the help of VMware's Dirk Hohndel (VMware's Chief Open Source > Officer, a VP position near the top of the organization, and a > personal friend of Linus), I have been fortunate enough to make > contact directly with Linus Torvalds to discuss this issue. In emails > to me he has told me that this patch is no longer provisional: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 Unfortunately, unless something has changed recently, that patch is *not* sufficient to really solve the issue - we don't guarantee that there's always an fd preventing the necessary information from being evicted from memory: Note that we might still lose the error if the inode gets evicted from the cache before anything can reopen it, but that was the case before errseq_t was merged. At LSF/MM we had some discussion about keeping inodes with unreported writeback errors around in the cache for longer (possibly indefinitely), but that's really a separate problem" And that's entirely possibly in postgres. The commit was dicussed on list too, btw... Greetings, Andres Freund
On Tue, Jan 22, 2019 at 12:17 PM Andres Freund <andres@anarazel.de> wrote: > Unfortunately, unless something has changed recently, that patch is > *not* sufficient to really solve the issue - we don't guarantee that > there's always an fd preventing the necessary information from being > evicted from memory: But we can't lose an FD without either closing it or suffering an abrupt termination that would trigger a PANIC, can we? And close() always calls fsync(). And I thought our "PANIC on fsync" patch paid attention to close(). How do you see this happening??? > Note that we might still lose the error if the inode gets evicted from > the cache before anything can reopen it, but that was the case before > errseq_t was merged. At LSF/MM we had some discussion about keeping > inodes with unreported writeback errors around in the cache for longer > (possibly indefinitely), but that's really a separate problem" > > And that's entirely possibly in postgres. Is it possible for an inode to be evicted while there is an open FD referencing it? > The commit was dicussed on list too, btw... Can you point to a post explaining how the inode can be evicted? -- Kevin Grittner VMware vCenter Server https://www.vmware.com/
On Wed, Jan 23, 2019 at 9:29 AM Kevin Grittner <kgrittn@gmail.com> wrote: > Can you point to a post explaining how the inode can be evicted? Hi Kevin, To recap the (admittedly confusing) list of problems with Linux fsync or our usage: 1. On Linux < 4.13, the error state can be cleared in various surprising ways so that we never hear about it. Jeff Layton identified and fixed this problem for 4.13+ by switching from an error flag to an error counter that is tracked in such a way that every fd hears about every error in the file. 2. Layton's changes originally assumed that you only wanted to hear about errors that happened after you opened the file (ie it set the fd's counter to the inode's current level at open time). Craig Ringer complained about this. Everyone complained about this. A fix was then made so that one fd also reports errors that happened before opening, if no one else has seen them yet. This is the change that was back-patched as far as Linux 4.14. So long as no third program comes along and calls fsync on a file that we don't have open anywhere, thereby eating the "not seen" flag before the checkpointer gets around to opening the file, all is well. 3. Regardless of the above changes, we also learned that pages are unceremoniously dropped from the page cache after write-back errors, so that calling fsync() again after a failure is a bad idea (it might report success, but your dirty data written before the previous fsync() call is gone). We handled that by introducing a PANIC after any fsync failure: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 So did MySQL, MongoDB, and probably everyone else who spat out their cornflakes while reading articles like "PostgreSQL's fsync() surprise" in the Linux Weekly News that resulted from Craig's report: https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 4. Regardless of all of the above changes, there is still one way to lose track of an error, as Andres mentioned: during a period of time when neither the writing backend nor the checkpointer has the file open, the kernel may choose to evict the inode from kernel memory, and thereby forget about an error that we haven't received yet. Problems 1-3 are solved by changes to Linux and PostgreSQL. Problem 4 would be solved by this "fd-passing" scheme (file descriptors are never closed until after fsync has been called, existing in the purgatory of Unix socket ancillary data until the checkpointer eventually deals with them), but it's complicated and not quite fully baked yet. It could also be solved by the kernel agreeing not to evict inodes that hold error state, or to promote the error to device level, or something like that. IIUC those kinds of ideas were rejected so far. (It can also be solved by using FreeBSD and/or ZFS, so you don't have problem 3 and therefore don't have the other problems.) I'm not sure how likely that failure mode actually is, but I guess you need a large number of active files, a low PostgreSQL max_safe_fds so we close descriptors aggressively, a kernel that is low on memory or has a high vfs_cache_pressure setting so that it throws out recently used inodes aggressively, enough time between checkpoints for all of the above to happen, and then some IO errors when the kernel is writing back dirty data asynchronously while you don't have the file open anywhere. -- Thomas Munro http://www.enterprisedb.com
Hi, On 2019-01-22 14:29:23 -0600, Kevin Grittner wrote: > On Tue, Jan 22, 2019 at 12:17 PM Andres Freund <andres@anarazel.de> wrote: > > > Unfortunately, unless something has changed recently, that patch is > > *not* sufficient to really solve the issue - we don't guarantee that > > there's always an fd preventing the necessary information from being > > evicted from memory: > > But we can't lose an FD without either closing it or suffering an > abrupt termination that would trigger a PANIC, can we? And close() > always calls fsync(). And I thought our "PANIC on fsync" patch paid > attention to close(). How do you see this happening??? close() doesn't trigger an fsync() in general (although it does on many NFS implementations), and doing so would be *terrible* for performance. Given that it's pretty clear how you can get all FDs closed, right? You just need sufficient open files that files get closed due to max_files_per_process, and you can run into the issue. A thousand open files is pretty easy to reach with forks, indexes, partitions etc, so this isn't particularly artifical. > > Note that we might still lose the error if the inode gets evicted from > > the cache before anything can reopen it, but that was the case before > > errseq_t was merged. At LSF/MM we had some discussion about keeping > > inodes with unreported writeback errors around in the cache for longer > > (possibly indefinitely), but that's really a separate problem" > > > > And that's entirely possibly in postgres. > > Is it possible for an inode to be evicted while there is an open FD > referencing it? No, but we don't guarantee that there's always an FD open, due to the > > The commit was dicussed on list too, btw... > > Can you point to a post explaining how the inode can be evicted? https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de is, I think, a good overview, with a bunch of links. As is the referenced lwn article [1] and the commit message you linked. Greetings, Andres Freund [1] https://lwn.net/Articles/752063/
On Tue, Jan 22, 2019 at 2:38 PM Andres Freund <andres@anarazel.de> wrote: > close() doesn't trigger an fsync() in general What sort of a performance hit was observed when testing the addition of fsync or fdatasync before any PostgreSQL close() of a writable file, or have we not yet checked that? > https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de > is, I think, a good overview, with a bunch of links. Thanks! Will review. -- Kevin Grittner VMware vCenter Server https://www.vmware.com/
Hi, On 2019-01-22 14:53:11 -0600, Kevin Grittner wrote: > On Tue, Jan 22, 2019 at 2:38 PM Andres Freund <andres@anarazel.de> wrote: > > > close() doesn't trigger an fsync() in general > > What sort of a performance hit was observed when testing the addition > of fsync or fdatasync before any PostgreSQL close() of a writable > file, or have we not yet checked that? I briefly played with it, and it was so atrocious (as in, less than something like 0.2x the throughput) that I didn't continue far down that path. Two ways I IIRC (and it's really just memory) I tried were: a) Short lived connections that do a bunch of writes to files each. That turns each disconnect into an fsync of most files. b) Workload with > max_files_per_process files (IIRC I just used a bunch of larger tables with a few indexes each) in a read/write workload that's a bit larger than shared buffers. That lead to most file closes being integrity writes, with obvious downsides. Greetings, Andres Freund
I (finally) got a chance to go through these patches and they look great. Thank you for working on this! Few comments: - I do not see SmgrFileTag being defined or used like you mentioned in your first email. I see RelFileNode still being used. Is this planned for the future? - Would be great to add a set of Tests for SimpleVector. > For the 0001 patch, I'll probably want to reconsider the naming a it > ("simple -> "specialized", "generic", ...?) I think the name SimpleVector is fine, fits with the SimpleHash theme. If the goal is to shorten it, perhaps PG prefix would suffice? > 4. The protocol for forgetting relations etc is slightly different: > if a file is found to be missing, AbsortFsyncRequests() and then probe > to see if the segment number disappeared from the set (instead of > cancel flags), though I need to test this case. Can you explain this part a bit more? I am likely missing something in the patch. > I couldn't resist the urge to try porting pg_qsort() to this style. > It seems to be about twice as fast as the original at sorting integers > on my machine with -O2. I suppose people aren't going to be too > enthusiastic about yet another copy of qsort in the tree, but maybe > this approach (with a bit more work) could replace the Perl code-gen > for tuple sorting. Then the net number of copies wouldn't go up, but > this could be used for more things too, and it fits with the style of > simplehash.h and simplevector.h. Thoughts? +1 for avoiding duplicate code. Would it be acceptable to migrate the rest of the usages to this model over time perhaps? Love to move this patch forward. I wonder if it might be better to introduce two different functions catering to the two different use cases for forcing an immediate sync: - sync a relation smgrimmedsyncrel(SMgrRelation, ForkNumber) - sync a specific segment smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber) This will avoid having to specify InvalidSegmentNumber for majority of the callers today. -- Shawn Debnath Amazon Web Services (AWS)
On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote: > I (finally) got a chance to go through these patches and they look > great. Thank you for working on this! This review is very recent, so I have moved the patch to next CF. -- Michael
Attachment
On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote: > I wonder if it might be better to introduce two different functions > catering to the two different use cases for forcing an immediate sync: > > - sync a relation > smgrimmedsyncrel(SMgrRelation, ForkNumber) > - sync a specific segment > smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber) > > This will avoid having to specify InvalidSegmentNumber for majority of > the callers today. I have gone ahead and rebased the refactor patch so it could cleanly apply on heapam.c, see patch v7. I am also attaching a patch (v8) that implements smgrimmedsyncrel() and smgrimmedsyncseg() as I mentioned in the previous email. It avoids callers to pass in InvalidSegmentNumber when they just want to sync the whole relation. As a side effect, I was able to get rid of some extra checkpointer.h includes. -- Shawn Debnath Amazon Web Services (AWS)
Attachment
On Wed, Feb 13, 2019 at 3:58 PM Shawn Debnath <sdn@amazon.com> wrote: > On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote: > > I wonder if it might be better to introduce two different functions > > catering to the two different use cases for forcing an immediate sync: > > > > - sync a relation > > smgrimmedsyncrel(SMgrRelation, ForkNumber) > > - sync a specific segment > > smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber) > > > > This will avoid having to specify InvalidSegmentNumber for majority of > > the callers today. > > I have gone ahead and rebased the refactor patch so it could cleanly > apply on heapam.c, see patch v7. > > I am also attaching a patch (v8) that implements smgrimmedsyncrel() and > smgrimmedsyncseg() as I mentioned in the previous email. It avoids > callers to pass in InvalidSegmentNumber when they just want to sync the > whole relation. As a side effect, I was able to get rid of some extra > checkpointer.h includes. Hi Shawn, Thanks! And sorry for not replying sooner -- I got distracted by FOSDEM (and the associated 20 thousand miles of travel). On that trip I had a chance to discuss this patch with Andres Freund in person, and he opined that it might be better for the fsync request queue to work in terms of pathnames. Instead of the approach in this patch, where a backend sends an fsync request for { reflfilenode, segno } inside mdwrite(), and then the checkpointer processes the request by calling smgrdimmedsyncrel(), he speculated that it'd be better to have mdwrite() send an fsync request for a pathname, and then the checkpointer would just open that file by name and fsync() it. That is, the checkpointer wouldn't call back into smgr. One of the advantages of that approach is that there are probably other files that need to be fsync'd for each checkpoint that could benefit from being offloaded to the checkpointer. Another is that you break the strange cycle mentioned above. Here's a possible problem with it. The fsync request messages would have to be either large (MAXPGPATH) or variable sized and potentially large. I am a bit worried that such messages would be problematic for the atomicity requirement of the (future) fd-passing patch that passes it via a Unix domain socket (which is a bit tricky because SOCK_SEQPACKET and SOCK_DGRAM aren't portable enough, so we probably have to use SOCK_STREAM, but there is no formal guarantee like PIPE_BUF; we know that in practice small messages will be atomic, but certainty decreases with larger messages. This needs more study...). You need to include the path even in a message containing an fd, because the checkpointer will use that as a hashtable key to merge received requests. Perhaps you'd solve that by using a small tag that can be converted back to a path (as you noticed my last patch had some leftover dead code from an experiment along those lines), but then I think you'd finish up needing an smgr interface to convert it back to the path (implementation different for md.c, undofile.c, slru.c). So you don't exactly break the cycle mentioned earlier. Hmm, or perhaps you could avoid even thinking about atomicity by passing 1 byte fd-bearing messages via the pipe, and pathnames via shared memory, in the same order. Another consideration if we do that is that the existing scheme has a kind of hierarchy that allows fsync requests to be cancelled in bulk when you drop relations and databases. That is, the checkpointer knows about the internal hierarchy of tablespace, db, rel, seg. If we get rid of that and have just paths, it seems like a bad idea to teach the checkpointer about the internal structure of the paths (even though we know they contain the same elements encoded somehow). You'd have to send an explicit cancel for every key; that is, if you're dropping a relation, you need to generate a cancel message for every segment, and if you're dropping a database, you need to generate a cancel message for every segment of every relation. Once again, if you used some kind of tag that is passed back to smgr, you could probably come up with a way to do it. Thoughts? -- Thomas Munro http://www.enterprisedb.com
Hi, On 2019-02-13 18:40:05 +1300, Thomas Munro wrote: > Thanks! And sorry for not replying sooner -- I got distracted by > FOSDEM (and the associated 20 thousand miles of travel). On that trip > I had a chance to discuss this patch with Andres Freund in person, and > he opined that it might be better for the fsync request queue to work > in terms of pathnames. Instead of the approach in this patch, where a > backend sends an fsync request for { reflfilenode, segno } inside > mdwrite(), and then the checkpointer processes the request by calling > smgrdimmedsyncrel(), he speculated that it'd be better to have > mdwrite() send an fsync request for a pathname, and then the > checkpointer would just open that file by name and fsync() it. That > is, the checkpointer wouldn't call back into smgr. > > One of the advantages of that approach is that there are probably > other files that need to be fsync'd for each checkpoint that could > benefit from being offloaded to the checkpointer. Another is that you > break the strange cycle mentioned above. The other issue is that I think your approach moves the segmentation logic basically out of md into smgr. I think that's wrong. We shouldn't presume that every type of storage is going to have segmentation that's representable in a uniform way imo. > Another consideration if we do that is that the existing scheme has a > kind of hierarchy that allows fsync requests to be cancelled in bulk > when you drop relations and databases. That is, the checkpointer > knows about the internal hierarchy of tablespace, db, rel, seg. If we > get rid of that and have just paths, it seems like a bad idea to teach > the checkpointer about the internal structure of the paths (even > though we know they contain the same elements encoded somehow). You'd > have to send an explicit cancel for every key; that is, if you're > dropping a relation, you need to generate a cancel message for every > segment, and if you're dropping a database, you need to generate a > cancel message for every segment of every relation. I can't see that being a problem - compared to the overhead of dropping a relation, that doesn't seem to be a meaningfully large cost? Greetings, Andres Freund
On Fri, Feb 15, 2019 at 06:45:02PM -0800, Andres Freund wrote: > > One of the advantages of that approach is that there are probably > > other files that need to be fsync'd for each checkpoint that could > > benefit from being offloaded to the checkpointer. Another is that you > > break the strange cycle mentioned above. > > The other issue is that I think your approach moves the segmentation > logic basically out of md into smgr. I think that's wrong. We shouldn't > presume that every type of storage is going to have segmentation that's > representable in a uniform way imo. I had a discussion with Thomas on this and am working on a new version of the patch that incorporates what you guys discussed at FOSDEM, but avoiding passing pathnames to checkpointer. The mdsync machinery will be moved out of md.c and pending ops table will incorporate the segment number as part of the key. Still deciding on how to cleanly re-factor _mdfd_getseg which mdsync utilizes during the file sync operations. The ultimate goal is to get checkpointer the file descriptor it can use to issue the fsync using FileSync. So perhaps a function in smgr that returns just that based on the RelFileNode, fork and segno combination. Dealing only with file descriptors will allow us to implement passing FDs to checkpointer directly as part of the request in the future. The goal is to encapsulate relation specific knowledge within md.c while allowing undo and generic block store (ex-SLRU) to do their own mapping within the smgr layer later. Yes, checkpointer will "call back" into smgr, but these would be to retrieve information that should be managed by smgr. Allowing checkpointer to focus on its job of tracking requests and syncing files via the fd interfaces. > > Another consideration if we do that is that the existing scheme has a > > kind of hierarchy that allows fsync requests to be cancelled in bulk > > when you drop relations and databases. That is, the checkpointer > > knows about the internal hierarchy of tablespace, db, rel, seg. If we > > get rid of that and have just paths, it seems like a bad idea to teach > > the checkpointer about the internal structure of the paths (even > > though we know they contain the same elements encoded somehow). You'd > > have to send an explicit cancel for every key; that is, if you're > > dropping a relation, you need to generate a cancel message for every > > segment, and if you're dropping a database, you need to generate a > > cancel message for every segment of every relation. > > I can't see that being a problem - compared to the overhead of dropping > a relation, that doesn't seem to be a meaningfully large cost? With the scheme above - dropping hierarchies will require scanning the hash table for matching dboid or reloid and removing those entries. We do this today for FORGET_DATABASE_FSYNC in RememberFsyncRequest. The matching function will belong in smgr. We can see how scanning the whole hash table impacts performance and iterate on it from there if needed. -- Shawn Debnath Amazon Web Services (AWS)
As promised, here's a patch that addresses the points discussed by Andres and Thomas at FOSDEM. As a result of how we want checkpointer to track what files to fsync, the pending ops table now integrates the forknum and segno as part of the hash key eliminating the need for the bitmapsets, or vectors from the previous iterations. We re-construct the pathnames from the RelFileNode, ForkNumber and SegmentNumber and use PathNameOpenFile to get the file descriptor to use for fsync. Apart from that, this patch moves the system for requesting and processing fsyncs out of md.c into smgr.c, allowing us to call on smgr component specific callbacks to retrieve metadata like relation and segment paths. This allows smgr components to maintain how relfilenodes, forks and segments map to specific files without exposing this knowledge to smgr. It redefines smgrsync() behavior to be closer to that of smgrimmedsysnc(), i.e., if a regular sync is required for a particular file, enqueue it in locally or forward it to checkpointer. smgrimmedsync() retains the existing behavior and fsyncs the file right away. The processing of fsync requests has been moved from mdsync() to a new ProcessFsyncRequests() function. Testing ------- Checkpointer stats didn't cover what I wanted to verify, i.e., time spent dealing with the pending operations table. So I added temporary instrumentation to get the numbers by timing the code in ProcessFsyncRequests which starts by absorbing fsync requests from checkpointer queue, processing them and finally issuing sync on the files. Similarly, I added the same instrumentation in the mdsync code in master branch. The time to actually execute FileSync is irrelevant for this patch. I did two separate runs for 30 mins, both with scale=10,000 on i3.8xlarge instances [1] with default params to force frequent checkpoints: 1. Single pgbench run having 1000 clients update 4 tables, as a result we get 4 relations and its forks and several segments in each being synced. 2. 10 parallel pgbench runs on 10 separate databases having 200 clients each. This results in more relations and more segments being touched letting us better compare against the bitmapset optimizations. Results -------- The important metric to look at would be the total time spent absorbing and processing the fsync requests as that's what the changes revolve around. The other metrics are here for posterity. The new code is about 6% faster in total time taken to process the queue for the single pgbench run. For the 10x parallel pgbench run, we are seeing drops up to 70% with the patch. Would be great if some other folks can verify this. The temporary instrumentation patches for the master branch and one that applies after the main patch are attached. Enable log_checkpoints and then use grep and cut to extract the numbers from the log file after the runs. [Requests Absorbed] single pgbench run Min Max Average Median Mode Std Dev -------- ------- -------- ---------- -------- ------- ---------- patch 15144 144961 78628.84 76124 58619 24135.69 master 25728 138422 81455.04 80601 25728 21295.83 10 parallel pgbench runs Min Max Average Median Mode Std Dev -------- -------- -------- ----------- -------- -------- ---------- patch 45098 282158 155969.4 151603 153049 39990.91 master 191833 602512 416533.86 424946 191833 82014.48 [Files Synced] single pgbench run Min Max Average Median Mode Std Dev -------- ----- ----- --------- -------- ------ --------- patch 153 166 158.11 158 159 1.86 master 154 166 158.29 159 159 10.29 10 parallel pgbench runs Min Max Average Median Mode Std Dev -------- ------ ------ --------- -------- ------ --------- patch 1540 1662 1556.42 1554 1552 11.12 master 1546 1546 1546 1559 1553 12.79 [Total Time in ProcessFsyncRequest/mdsync] single pgbench run Min Max Average Median Mode Std Dev -------- ----- --------- --------- -------- ------ --------- patch 500 3833.51 2305.22 2239 500 510.08 master 806 4430.32 2458.77 2382 806 497.01 10 parallel pgbench runs Min Max Average Median Mode Std Dev -------- ------ ------- ---------- -------- ------ --------- patch 908 6927 3022.58 2863 908 939.09 master 4323 17858 10982.15 11154 4322 2760.47 [1] https://aws.amazon.com/ec2/instance-types/i3/ -- Shawn Debnath Amazon Web Services (AWS)
Attachment
Hi, Thanks for the update! On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote: > As promised, here's a patch that addresses the points discussed by > Andres and Thomas at FOSDEM. As a result of how we want checkpointer to > track what files to fsync, the pending ops table now integrates the > forknum and segno as part of the hash key eliminating the need for the > bitmapsets, or vectors from the previous iterations. We re-construct the > pathnames from the RelFileNode, ForkNumber and SegmentNumber and use > PathNameOpenFile to get the file descriptor to use for fsync. I still object to exposing segment numbers to smgr and above. I think that's an implementation detail of the various storage managers, and we shouldn't expose smgr.c further than it already is. > Apart from that, this patch moves the system for requesting and > processing fsyncs out of md.c into smgr.c, allowing us to call on smgr > component specific callbacks to retrieve metadata like relation and > segment paths. This allows smgr components to maintain how relfilenodes, > forks and segments map to specific files without exposing this knowledge > to smgr. It redefines smgrsync() behavior to be closer to that of > smgrimmedsysnc(), i.e., if a regular sync is required for a particular > file, enqueue it in locally or forward it to checkpointer. > smgrimmedsync() retains the existing behavior and fsyncs the file right > away. The processing of fsync requests has been moved from mdsync() to a > new ProcessFsyncRequests() function. I think that's also wrong, imo fsyncs are storage detail, and should be relegated to *below* md.c. That's not to say the code should live in md.c, but the issuing of such calls shouldn't be part of smgr.c - consider e.g. developing a storage engine living in non volatile ram. Greetings, Andres Freund
On Thu, Feb 21, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote: > Thanks for the update! +1, thanks Shawn. It's interesting that you measure performance improvements that IIUC can come only from dropping the bitmapset stuff (or I guess bugs). I don't understand the mechanism for that improvement yet. The idea of just including the segment number (in some representation, possibly opaque to this code) in the hash table key instead of carrying a segment list seems pretty good from here (and I withdraw the sorted vector machinery I mentioned earlier as it's redundant for this project)... except for one detail. In your patch, you still have FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC. That relies on this sync manager code being able to understand which stuff to drop from the hash table, which means that is still has knowledge of the key hierarchy, so that it can match all entries for the relation or database. That's one of the things that Andres is arguing against. I can see how to fix that for the relation case: md.c could simply enqueue a FORGET_REQUEST message for every segment and fork in the relation, so they can be removed by O(1) hash table lookup. I can't immediately see how to deal with the database case though. Perhaps in a tag scheme, you'd have to make the tag mostly opaque, except for a DB OID field, so you can handle this case? (Or some kind of predicate callback, so you can say "does this tag cancel this other tag?"; seems over the top). > On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote: > > As promised, here's a patch that addresses the points discussed by > > Andres and Thomas at FOSDEM. As a result of how we want checkpointer to > > track what files to fsync, the pending ops table now integrates the > > forknum and segno as part of the hash key eliminating the need for the > > bitmapsets, or vectors from the previous iterations. We re-construct the > > pathnames from the RelFileNode, ForkNumber and SegmentNumber and use > > PathNameOpenFile to get the file descriptor to use for fsync. > > I still object to exposing segment numbers to smgr and above. I think > that's an implementation detail of the various storage managers, and we > shouldn't expose smgr.c further than it already is. Ok by me. The solution to this is probably either raw paths (as Andres has suggested, but which seem problematic to me -- see below), or some kind of small fixed size tag type that is morally equivalent to a path and can be converted to a path but is more convenient for shoving through pipes and into hash tables. > > Apart from that, this patch moves the system for requesting and > > processing fsyncs out of md.c into smgr.c, allowing us to call on smgr > > component specific callbacks to retrieve metadata like relation and > > segment paths. This allows smgr components to maintain how relfilenodes, > > forks and segments map to specific files without exposing this knowledge > > to smgr. It redefines smgrsync() behavior to be closer to that of > > smgrimmedsysnc(), i.e., if a regular sync is required for a particular > > file, enqueue it in locally or forward it to checkpointer. > > smgrimmedsync() retains the existing behavior and fsyncs the file right > > away. The processing of fsync requests has been moved from mdsync() to a > > new ProcessFsyncRequests() function. > > I think that's also wrong, imo fsyncs are storage detail, and should be > relegated to *below* md.c. That's not to say the code should live in > md.c, but the issuing of such calls shouldn't be part of smgr.c - > consider e.g. developing a storage engine living in non volatile ram. How about we take all that sync-related stuff, that Shawn has moved from md.c into smgr.c, and my earlier patch had moved out of md.c into smgrsync.c, and we put it into a new place src/backend/storage/file/sync.c? Or somewhere else, but not under smgr. It doesn't know anything about smgr concepts, and it can be used to schedule file sync for any kind of file, not just the smgr implementations' files. Though they'd be the main customers, I guess. md.c and undofile.c etc would call it to register stuff, and checkpointer.c would call it to actually perform all the fsync calls. If we do that, the next question is how to represent filenames. One idea is to use tags that can be converted back to a path. I suppose there could be a table of function pointers somewhere, and the tag could be a discriminated union? Or just a descriminator + a small array of dumb uint32_t of a size big enough for all users, a bit like lock tags. One reason to want to use small fixed-sized tags is to avoid atomicity problems in the future when it comes to the fd-passing work, as mentioned up-thread. Here are some other ideas, to avoid having to use tags: * send the paths through a shm queue, but the fds through the Unix domain socket; the messages carrying fds somehow point to the pathname in the shm queue (and deal with slight disorder...) * send the paths through the socket, but hold an LWLock while doing so to make sure it's atomic, no matter what the size * somehow prove that it's really already atomic even for long paths, on every operating system we support, and that it's never change, so there is no problem here Another problem with variable sized pathnames even without the future fd-passing work is that it's harder to size the shm queue: the current code sets max_requests to NBuffers, which makes some kind of sense because that's a hard upper bound on the number of dirty segments there could possibly be at a given moment in time (one you'll practically never hit), after deduplication. It's harder to come up with a decent size for a new variable-sized-message queue; MAXPGPATH * NBuffers would be insanely large (it's be 1/8th the size of the buffer pool), but if you make it any smaller there is no guarantee that compacting it can create space. Perhaps the solution to that is simply to block/wait while shm queue is full -- but that might have deadlock problems. -- Thomas Munro https://enterprisedb.com
> It's interesting that you measure performance improvements that IIUC > can come only from dropping the bitmapset stuff (or I guess bugs). I > don't understand the mechanism for that improvement yet. I will be digging into this a bit more to understand what really is the cause for the improvements. But first, need to get the refactor patch in better shape :-) > The idea of just including the segment number (in some representation, > possibly opaque to this code) in the hash table key instead of > carrying a segment list seems pretty good from here (and I withdraw > the sorted vector machinery I mentioned earlier as it's redundant for > this project)... except for one detail. In your patch, you still have > FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC. That relies on this > sync manager code being able to understand which stuff to drop from > the hash table, which means that is still has knowledge of the key > hierarchy, so that it can match all entries for the relation or > database. That's one of the things that Andres is arguing against. You are correct. I actually did mention having a callback to do the request resolution in an response to Andres back up in the thread, but oops, completely slipped my mind with my last patch. > How about we take all that sync-related stuff, that Shawn has moved > from md.c into smgr.c, and my earlier patch had moved out of md.c into > smgrsync.c, and we put it into a new place > src/backend/storage/file/sync.c? Or somewhere else, but not under > smgr. It doesn't know anything about smgr concepts, and it can be > used to schedule file sync for any kind of file, not just the smgr > implementations' files. Though they'd be the main customers, I guess. > md.c and undofile.c etc would call it to register stuff, and > checkpointer.c would call it to actually perform all the fsync calls. > > If we do that, the next question is how to represent filenames. One > idea is to use tags that can be converted back to a path. I suppose > there could be a table of function pointers somewhere, and the tag > could be a discriminated union? Or just a descriminator + a small > array of dumb uint32_t of a size big enough for all users, a bit like > lock tags. > > One reason to want to use small fixed-sized tags is to avoid atomicity > problems in the future when it comes to the fd-passing work, as > mentioned up-thread. Here are some other ideas, to avoid having to > use tags: > > * send the paths through a shm queue, but the fds through the Unix > domain socket; the messages carrying fds somehow point to the pathname > in the shm queue (and deal with slight disorder...) > * send the paths through the socket, but hold an LWLock while doing so > to make sure it's atomic, no matter what the size > * somehow prove that it's really already atomic even for long paths, > on every operating system we support, and that it's never change, so > there is no problem here > > Another problem with variable sized pathnames even without the future > fd-passing work is that it's harder to size the shm queue: the current > code sets max_requests to NBuffers, which makes some kind of sense > because that's a hard upper bound on the number of dirty segments > there could possibly be at a given moment in time (one you'll > practically never hit), after deduplication. It's harder to come up > with a decent size for a new variable-sized-message queue; MAXPGPATH * > NBuffers would be insanely large (it's be 1/8th the size of the buffer > pool), but if you make it any smaller there is no guarantee that > compacting it can create space. Perhaps the solution to that is > simply to block/wait while shm queue is full -- but that might have > deadlock problems. I think I have a lot better understanding of what Andres is envisioning and agree with what Thomas has said so far. To summarize, we want a "sync" component at the level of fd, that components higher up the chain like md, undo, slru and checkpointer will use to track and process fsync requests (I am refraining from putting in an ascii diagram here!). These checkpoint requests will be opaque to the sync machinery and will rely on requesters to provide the necessary details. I, agree with Thomas, in that I don't think passing full pathnames or variable pathnames is the right way to go for all the reasons Thomas mentioned in his email. However, if we want to, in the future we can easily extend the checkpoint request to include a type, CHECKPOINT_REQUEST_FD or CHECKPOINT_REQUEST_PATH, and delegate the current relfilenode to be of type CHECKPOINT_REQUEST_RELFILENODE. Sync can then act on the requests based on the type, and in some cases wouldn't need to interact with any other component. The pieces of information we need to process fsyncs are (1) determine if a request is to invalidate other requests the queue currently holds and (2) determine the full path to the file to issue fsync on. I think using callbacks is the better path forward than having md or other components issue an invalidate request for each and every segment which can get quite heavy handed for large databases. Performance would be the same as today since we already scan the entire hash table when we encounter a forget request. This new approach will involve one additional function call inside the loop which does a simple compare. And Thomas brought up a good point offline: if we followed the path of smgr for the callbacks, it will lead to header file dependency nightmare. It would be best for components like md to register it's callback functions with sync so that sync doesn't have to include higher level header files to get access to their prototypes. At the time of smgrinit(), mdinit() would call into sync and register it's callbacks with an ID. We can use the same value that we are using for smgr_which to map the callbacks. Each fsync request will then also accompany this ID which the sync mechanism will use to call handlers for resolving forget requests or obtaining paths for files. I think this should satisfy Andres' requirement for not teaching smgr anything about segmentation while keeping sync related knowledge far below the smgr layer. This scheme works for undo and generic block storage well. Though might have to teach immedsync to accept block numbers so that undo and other storage managers can determine what segment it maps to. Thoughts? I am going to get started with revising the patch unless I hear otherwise. And love the great feedback btw, thank you. -- Shawn Debnath Amazon Web Services (AWS)
Hi,
While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existing databases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybe my search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful.
https://www.percona.com/community-blog/2018/12/14/notes-mariadb-system-versioned-tables/
https://www.mssqltips.com/sqlservertip/3680/introduction-to-sql-server-temporal-tables/
What?
A temporal table feature has two tables “Temporal Table” and “History Table”. The Temporal Table is where our current tuples are stored. This is the main table, just like other PostgreSQL tables. The history table is the other half of the feature and is where all the history of the main table is stored. This table is created automatically. The history table is used to query certain data at a certain time, useful for a point in time analysis. It also offers built-in versioning.
Why?
Normally users write triggers or procedures to write a history of a table’s data. Some time-sensitive applications will have code to write a data history somewhere. By having this functionality, PostgreSQL would do it automatically. For example, if we have a retail table where the price of each product inventory item is stored. The temporal table would hold the current price of the product. When we update the price of a product in the temporal table, then a new row with a timestamp would be added to the history table. That means on each update of the price, a new row containing the previous price would be added to the history table. The same would apply in the case of deletes. When we delete any product from our inventory, then a row would be added to the history table storing the last price of the product prior to delete. For any point in time, we can access the price at which we sold the product.
How?
I was thinking about the implementation of this feature and read the documentation on the internet. Microsoft SQL Server, for example, offers such a feature. If we come to the conclusion we should offer the feature, I will share the complete design.
Here are some ideas I have around this:
- Syntax.
CREATE TABLE tablename
(
…
start_time DATETIME,
end_time DATETIME,
PERIOD FOR SYSTEM_TIME (start_time, end_time)
)
WITH
(
SYSTEM_VERSIONING = ON (HISTORY_TABLE = tablename_history)
);
The tablename is the temporal table and tablename_history is be the history table. The name of the history table is optional, in which case, PostgreSQL will generate a table name. These two columns are a must for a temporal table “start_time” and “end_time”. The PERIOD FOR SYSTEM_TIME is used to identify these columns.
ALTER TABLE SET SYSTEM_VERSIONING = ON/OFF
Due to this syntax addition in CREATE/ALTER TABLE, there are some grammar additions required in the parser.
PERIOD FOR SYSTEM TIME
SYSTEM VERSIONING
- Catalog Changes.
There are two options, one is to have another catalog pg_temporal which will contain the information or we could have that information in the pg_catalog
Table "public.pg_temporal"
Column | Type | Collation | Nullable | Default
-----------------+------+-----------+----------+---------
temporal_id | oid | | |
hist_id | oid | | |
start_date_name | text | | |
end_date_name | text | | |
--
While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existing databases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybe my search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful.
https://www.percona.com/community-blog/2018/12/14/notes-mariadb-system-versioned-tables/
https://www.mssqltips.com/sqlservertip/3680/introduction-to-sql-server-temporal-tables/
What?
A temporal table feature has two tables “Temporal Table” and “History Table”. The Temporal Table is where our current tuples are stored. This is the main table, just like other PostgreSQL tables. The history table is the other half of the feature and is where all the history of the main table is stored. This table is created automatically. The history table is used to query certain data at a certain time, useful for a point in time analysis. It also offers built-in versioning.
Why?
Normally users write triggers or procedures to write a history of a table’s data. Some time-sensitive applications will have code to write a data history somewhere. By having this functionality, PostgreSQL would do it automatically. For example, if we have a retail table where the price of each product inventory item is stored. The temporal table would hold the current price of the product. When we update the price of a product in the temporal table, then a new row with a timestamp would be added to the history table. That means on each update of the price, a new row containing the previous price would be added to the history table. The same would apply in the case of deletes. When we delete any product from our inventory, then a row would be added to the history table storing the last price of the product prior to delete. For any point in time, we can access the price at which we sold the product.
How?
I was thinking about the implementation of this feature and read the documentation on the internet. Microsoft SQL Server, for example, offers such a feature. If we come to the conclusion we should offer the feature, I will share the complete design.
Here are some ideas I have around this:
- Syntax.
CREATE TABLE tablename
(
…
start_time DATETIME,
end_time DATETIME,
PERIOD FOR SYSTEM_TIME (start_time, end_time)
)
WITH
(
SYSTEM_VERSIONING = ON (HISTORY_TABLE = tablename_history)
);
The tablename is the temporal table and tablename_history is be the history table. The name of the history table is optional, in which case, PostgreSQL will generate a table name. These two columns are a must for a temporal table “start_time” and “end_time”. The PERIOD FOR SYSTEM_TIME is used to identify these columns.
ALTER TABLE SET SYSTEM_VERSIONING = ON/OFF
Due to this syntax addition in CREATE/ALTER TABLE, there are some grammar additions required in the parser.
PERIOD FOR SYSTEM TIME
SYSTEM VERSIONING
- Catalog Changes.
There are two options, one is to have another catalog pg_temporal which will contain the information or we could have that information in the pg_catalog
Table "public.pg_temporal"
Column | Type | Collation | Nullable | Default
-----------------+------+-----------+----------+---------
temporal_id | oid | | |
hist_id | oid | | |
start_date_name | text | | |
end_date_name | text | | |
--
Ibrar Ahmed
Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed <ibrar.ahmad@gmail.com> escreveu: > > While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existingdatabases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybemy search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful. > https://www.postgresql.org/message-id/CA%2BrenyUb%2BXHzsrPHHR6ELqguxaUPGhOPyVc7NW%2BkRsRpBZuUFQ%40mail.gmail.com This is the last one. I don't know why it wasn't in the January CF. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
On 2/22/19 11:31 AM, Euler Taveira wrote: > Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed > <ibrar.ahmad@gmail.com> escreveu: >> >> While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existingdatabases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybemy search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful. >> > https://www.postgresql.org/message-id/CA%2BrenyUb%2BXHzsrPHHR6ELqguxaUPGhOPyVc7NW%2BkRsRpBZuUFQ%40mail.gmail.com > > This is the last one. I don't know why it wasn't in the January CF. Oh that's by me! :-) I didn't put it into the CF because I wanted to get some feedback on primary keys before I got too far into foreign keys, but someone recently advised me to starting adding to CFs anyway with "WIP" in the title, so I'll do that next time. Btw my own patch is very modest, and I'd love to see this other much more extensive patch get some attention: https://www.postgresql.org/message-id/flat/CAHO0eLYyvuqwF%3D2FsgDn1xOs_NOrFBu9Xh-Wq%2BaWfFy0y6%3DjWQ%40mail.gmail.com#4f7fbace3a2f2ce85fcc161cc3fdd273 They were told to adjust where in the query pipeline they do their work, and the latest patch does that (as I understand it), but I don't think anyone has looked at it yet. Both of these patches use range types rather than SQL:2011 PERIODs, but I'd like to *also* support PERIODs (and accept ranges everywhere we accept PERIODs). Vik Fearing already has a patch to let you *declare* PERIODs: https://www.postgresql-archive.org/Periods-td6022563.html Actually using PERIODs in queries seems like a decent chunk of work though: basically it means making our grammar & processing accept PERIODs anywhere they currently accept columns. I'd love to hear some thoughts/suggestions around that. For example: a PERIOD is *similar* to a GENERATED column, so maybe the work being done there can/should influence how we implement them. I'm excited to be getting some momentum around temporal features though! I'm supposed to give a talk about them at PGCon in Ottawa this spring, so hopefully that will help too. Yours, -- Paul ~{:-) pj@illuminatedcomputing.com
Em sex, 22 de fev de 2019 às 18:16, Paul Jungwirth <pj@illuminatedcomputing.com> escreveu: > > On 2/22/19 11:31 AM, Euler Taveira wrote: > > Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed > > <ibrar.ahmad@gmail.com> escreveu: > >> > >> While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existingdatabases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybemy search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful. > >> > > https://www.postgresql.org/message-id/CA%2BrenyUb%2BXHzsrPHHR6ELqguxaUPGhOPyVc7NW%2BkRsRpBZuUFQ%40mail.gmail.com > > > > This is the last one. I don't know why it wasn't in the January CF. > > Oh that's by me! :-) > Forgot to CC you. > I didn't put it into the CF because I wanted to get some feedback on > primary keys before I got too far into foreign keys, but someone > recently advised me to starting adding to CFs anyway with "WIP" in the > title, so I'll do that next time. > Get some feedback is one of the CF goals. Even if you have just a WIP, those CF feedbacks could help you solve/improve some pieces of your current code. > Btw my own patch is very modest, and I'd love to see this other much > more extensive patch get some attention: > > https://www.postgresql.org/message-id/flat/CAHO0eLYyvuqwF%3D2FsgDn1xOs_NOrFBu9Xh-Wq%2BaWfFy0y6%3DjWQ%40mail.gmail.com#4f7fbace3a2f2ce85fcc161cc3fdd273 > It isn't in the CF 2019-03. If you want it to be reviewed you should add it. At this point, both patches should target v13. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Hi, On 2019-02-22 10:18:57 -0800, Shawn Debnath wrote: > I think using callbacks is the better path forward than having md or > other components issue an invalidate request for each and every segment > which can get quite heavy handed for large databases. I'm not sure I buy this. Unlinking files isn't cheap, involves many disk writes, etc. The cost of an inval request in comparison isn't particularly large. Especially for relation-level (rather than database level) truncation, per-segment invals will likely commonly be faster than the sequential scan. > At the time of smgrinit(), mdinit() would call into sync and register > it's callbacks with an ID. We can use the same value that we are using > for smgr_which to map the callbacks. Each fsync request will then also > accompany this ID which the sync mechanism will use to call handlers for > resolving forget requests or obtaining paths for files. I'm not seeing a need to do this dynamically at runtime. Given that smgr isn't extensible, why don't we just map callbacks (or even just some switch based logic) based on some enum? Doing things at *init time has more potential to go wrong, because say a preload_shared_library does different things in postmaster than normal backends (in EXEC_BACKEND cases). Besides those two points, I think this is going in a good direction! Greetings, Andres Freund
On Sat, Feb 23, 2019 at 11:15 AM Andres Freund <andres@anarazel.de> wrote: > On 2019-02-22 10:18:57 -0800, Shawn Debnath wrote: > > I think using callbacks is the better path forward than having md or > > other components issue an invalidate request for each and every segment > > which can get quite heavy handed for large databases. > > I'm not sure I buy this. Unlinking files isn't cheap, involves many disk > writes, etc. The cost of an inval request in comparison isn't > particularly large. Especially for relation-level (rather than database > level) truncation, per-segment invals will likely commonly be faster > than the sequential scan. Well even if you do it with individual segment cancel messages for relations, you still need a way to deal with whole-database drops (generating the cancels for every segment in every relation in the database would be nuts), and that means either exposing some structure to the requests, right? So the requests would have { request type, callback ID, db, opaque tag }, where request type is SYNC, CANCEL, CANCEL_WHOLE_DB, callback ID is used to find the function that converts opaque tags to paths, and db is used for handling CANCEL_WHOLE_DB requests where you need to scan the whole hash table. Right? > > At the time of smgrinit(), mdinit() would call into sync and register > > it's callbacks with an ID. We can use the same value that we are using > > for smgr_which to map the callbacks. Each fsync request will then also > > accompany this ID which the sync mechanism will use to call handlers for > > resolving forget requests or obtaining paths for files. > > I'm not seeing a need to do this dynamically at runtime. Given that smgr > isn't extensible, why don't we just map callbacks (or even just some > switch based logic) based on some enum? Doing things at *init time has > more potential to go wrong, because say a preload_shared_library does > different things in postmaster than normal backends (in EXEC_BACKEND > cases). Yeah I suggested dynamic registration to avoid the problem that eg src/backend/storage/sync.c otherwise needs to forward declare md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe #include <storage/md.h> etc, which seemed like exactly the sort of thing up with which you would not put. (Which reminds me, smgr.{c,h} suffers from a similar disease, as the comment says: "move me elsewhere -- ay 7/94"). > Besides those two points, I think this is going in a good direction! Phew. :-) -- Thomas Munro https://enterprisedb.com
Hi, On 2019-02-23 11:42:49 +1300, Thomas Munro wrote: > On Sat, Feb 23, 2019 at 11:15 AM Andres Freund <andres@anarazel.de> wrote: > > On 2019-02-22 10:18:57 -0800, Shawn Debnath wrote: > > > I think using callbacks is the better path forward than having md or > > > other components issue an invalidate request for each and every segment > > > which can get quite heavy handed for large databases. > > > > I'm not sure I buy this. Unlinking files isn't cheap, involves many disk > > writes, etc. The cost of an inval request in comparison isn't > > particularly large. Especially for relation-level (rather than database > > level) truncation, per-segment invals will likely commonly be faster > > than the sequential scan. > > Well even if you do it with individual segment cancel messages for > relations, you still need a way to deal with whole-database drops > (generating the cancels for every segment in every relation in the > database would be nuts), and that means either exposing some structure > to the requests, right? So the requests would have { request type, > callback ID, db, opaque tag }, where request type is SYNC, CANCEL, > CANCEL_WHOLE_DB, callback ID is used to find the function that > converts opaque tags to paths, and db is used for handling > CANCEL_WHOLE_DB requests where you need to scan the whole hash table. > Right? I'm ok with using callbacks to allow pruning for things like droping databases. If we use callbacks, I don't see a need to explicitly include the db in the request however? The callback can look into the opaque tag, no? Also, why do we need a separation between request type and callback? That seems like it'll commonly be entirely redundant? > > > At the time of smgrinit(), mdinit() would call into sync and register > > > it's callbacks with an ID. We can use the same value that we are using > > > for smgr_which to map the callbacks. Each fsync request will then also > > > accompany this ID which the sync mechanism will use to call handlers for > > > resolving forget requests or obtaining paths for files. > > > > I'm not seeing a need to do this dynamically at runtime. Given that smgr > > isn't extensible, why don't we just map callbacks (or even just some > > switch based logic) based on some enum? Doing things at *init time has > > more potential to go wrong, because say a preload_shared_library does > > different things in postmaster than normal backends (in EXEC_BACKEND > > cases). > > Yeah I suggested dynamic registration to avoid the problem that eg > src/backend/storage/sync.c otherwise needs to forward declare > md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe > #include <storage/md.h> etc, which seemed like exactly the sort of > thing up with which you would not put. I'm not sure I understand. If we have a few known tag types, what's the problem with including the headers with knowledge of how to implement them into sync.c file? Greetings, Andres Freund
On Sat, Feb 23, 2019 at 11:48 AM Andres Freund <andres@anarazel.de> wrote: > > Yeah I suggested dynamic registration to avoid the problem that eg > > src/backend/storage/sync.c otherwise needs to forward declare > > md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe > > #include <storage/md.h> etc, which seemed like exactly the sort of > > thing up with which you would not put. > > I'm not sure I understand. If we have a few known tag types, what's the > problem with including the headers with knowledge of how to implement > them into sync.c file? Typo in my previous email: src/backend/storage/file/sync.c was my proposal for the translation unit holding this stuff (we don't have .c files directly under storage). But it didn't seem right that stuff under storage/file (things concerned with files) should know about stuff under storage/smgr (md.c functions, higher level smgr stuff). Perhaps that just means it should go into a different subdir, maybe src/backend/storage/sync/sync.c, that knows about files AND smgr stuff, or perhaps I'm worrying about nothing. -- Thomas Munro https://enterprisedb.com
Hi, On 2019-02-23 11:59:04 +1300, Thomas Munro wrote: > On Sat, Feb 23, 2019 at 11:48 AM Andres Freund <andres@anarazel.de> wrote: > > > Yeah I suggested dynamic registration to avoid the problem that eg > > > src/backend/storage/sync.c otherwise needs to forward declare > > > md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe > > > #include <storage/md.h> etc, which seemed like exactly the sort of > > > thing up with which you would not put. > > > > I'm not sure I understand. If we have a few known tag types, what's the > > problem with including the headers with knowledge of how to implement > > them into sync.c file? > > Typo in my previous email: src/backend/storage/file/sync.c was my > proposal for the translation unit holding this stuff (we don't have .c > files directly under storage). But it didn't seem right that stuff > under storage/file (things concerned with files) should know about > stuff under storage/smgr (md.c functions, higher level smgr stuff). > Perhaps that just means it should go into a different subdir, maybe > src/backend/storage/sync/sync.c, that knows about files AND smgr > stuff, or perhaps I'm worrying about nothing. I mean, if you have a md_tagtopath and need its behaviour, I don't understand why a local forward declaration changes things? But I also think this is a bit of a non-problem - the abbreviated path names are just a different representation of paths, I don't see a problem of having that in sync.[ch], especially if we have the option to also have a full-length pathname variant if we ever need that. Greetings, Andres Freund
> > Well even if you do it with individual segment cancel messages for > > relations, you still need a way to deal with whole-database drops > > (generating the cancels for every segment in every relation in the > > database would be nuts), and that means either exposing some structure > > to the requests, right? So the requests would have { request type, > > callback ID, db, opaque tag }, where request type is SYNC, CANCEL, > > CANCEL_WHOLE_DB, callback ID is used to find the function that > > converts opaque tags to paths, and db is used for handling > > CANCEL_WHOLE_DB requests where you need to scan the whole hash table. > > Right? > > I'm ok with using callbacks to allow pruning for things like droping > databases. If we use callbacks, I don't see a need to explicitly include > the db in the request however? The callback can look into the opaque > tag, no? Also, why do we need a separation between request type and > callback? That seems like it'll commonly be entirely redundant? By having the id to distinguish which smgr to use for callbacks, we don't need DB. You are correct, my plan was to make the whole request opaque to the sync mechanism. ForwardFsyncRequest will take requester smgr id, request type (forget [db/rel], sync), relfilenode, forknum, and segno and convert it into a opaque CheckpointRequest and queue it locally. The only responsibility here is to map the different pieces of data into the opaque CheckpointerRequest. Requester ID in combination with request type will help us look up which callback function to execute. A new enum for requester ID is perfectly okay. I was trying to recycle the smgr id but perhaps that's not the right approach. > > > > Yeah I suggested dynamic registration to avoid the problem that > > > > eg > > > > src/backend/storage/sync.c otherwise needs to forward declare > > > > md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe > > > > #include <storage/md.h> etc, which seemed like exactly the sort of > > > > thing up with which you would not put. > > > > > > I'm not sure I understand. If we have a few known tag types, what's the > > > problem with including the headers with knowledge of how to implement > > > them into sync.c file? > > > > Typo in my previous email: src/backend/storage/file/sync.c was my > > proposal for the translation unit holding this stuff (we don't have .c > > files directly under storage). But it didn't seem right that stuff > > under storage/file (things concerned with files) should know about > > stuff under storage/smgr (md.c functions, higher level smgr stuff). > > Perhaps that just means it should go into a different subdir, maybe > > src/backend/storage/sync/sync.c, that knows about files AND smgr > > stuff, or perhaps I'm worrying about nothing. > > I mean, if you have a md_tagtopath and need its behaviour, I don't > understand why a local forward declaration changes things? But I also > think this is a bit of a non-problem - the abbreviated path names are > just a different representation of paths, I don't see a problem of > having that in sync.[ch], especially if we have the option to also have > a full-length pathname variant if we ever need that. Today, all the callbacks for smgr have their prototypes defined in smgr.h and used in smgr.c. Forward declarations within the new sync.h would be fine but having md callbacks split in two different places may not be the cleanest approach? One could argue they serve different purposes so perhaps it correct that we define them separately. I am fine with either, but now I probably prefer the new enum to fixed function declaration mappings that Andres suggested. I agree it would be less error prone. -- Shawn Debnath Amazon Web Services (AWS)
On 2019-02-22 15:45:50 -0800, Shawn Debnath wrote: > > > Well even if you do it with individual segment cancel messages for > > > relations, you still need a way to deal with whole-database drops > > > (generating the cancels for every segment in every relation in the > > > database would be nuts), and that means either exposing some structure > > > to the requests, right? So the requests would have { request type, > > > callback ID, db, opaque tag }, where request type is SYNC, CANCEL, > > > CANCEL_WHOLE_DB, callback ID is used to find the function that > > > converts opaque tags to paths, and db is used for handling > > > CANCEL_WHOLE_DB requests where you need to scan the whole hash table. > > > Right? > > > > I'm ok with using callbacks to allow pruning for things like droping > > databases. If we use callbacks, I don't see a need to explicitly include > > the db in the request however? The callback can look into the opaque > > tag, no? Also, why do we need a separation between request type and > > callback? That seems like it'll commonly be entirely redundant? > > By having the id to distinguish which smgr to use for callbacks, we > don't need DB. You are correct, my plan was to make the whole request > opaque to the sync mechanism. ForwardFsyncRequest will take requester > smgr id, request type (forget [db/rel], sync), relfilenode, forknum, and > segno and convert it into a opaque CheckpointRequest and queue it > locally. The only responsibility here is to map the different pieces of > data into the opaque CheckpointerRequest. Requester ID in combination > with request type will help us look up which callback function to > execute. > > A new enum for requester ID is perfectly okay. I was trying to recycle > the smgr id but perhaps that's not the right approach. I think it'd be a bad idea to use the smgr id here. So +1 for separating. > > > > > Yeah I suggested dynamic registration to avoid the problem that > > > > > eg > > > > > src/backend/storage/sync.c otherwise needs to forward declare > > > > > md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe > > > > > #include <storage/md.h> etc, which seemed like exactly the sort of > > > > > thing up with which you would not put. > > > > > > > > I'm not sure I understand. If we have a few known tag types, what's the > > > > problem with including the headers with knowledge of how to implement > > > > them into sync.c file? > > > > > > Typo in my previous email: src/backend/storage/file/sync.c was my > > > proposal for the translation unit holding this stuff (we don't have .c > > > files directly under storage). But it didn't seem right that stuff > > > under storage/file (things concerned with files) should know about > > > stuff under storage/smgr (md.c functions, higher level smgr stuff). > > > Perhaps that just means it should go into a different subdir, maybe > > > src/backend/storage/sync/sync.c, that knows about files AND smgr > > > stuff, or perhaps I'm worrying about nothing. > > > > I mean, if you have a md_tagtopath and need its behaviour, I don't > > understand why a local forward declaration changes things? But I also > > think this is a bit of a non-problem - the abbreviated path names are > > just a different representation of paths, I don't see a problem of > > having that in sync.[ch], especially if we have the option to also have > > a full-length pathname variant if we ever need that. > > Today, all the callbacks for smgr have their prototypes defined in > smgr.h and used in smgr.c. Forward declarations within the new sync.h > would be fine but having md callbacks split in two different places may > not be the cleanest approach? One could argue they serve different > purposes so perhaps it correct that we define them separately. I am > fine with either, but now I probably prefer the new enum to fixed > function declaration mappings that Andres suggested. I agree it would be > less error prone. I'd really just entirely separate the two. I'd not include any knowledge of this mechanism into smgr.h, and just make the expansion of paths dispatch statically dispatched (potentially into md.c) to actually do the work. I'm not sure why sync.h would need any forward declarations for that? Greetings, Andres Freund
Hi Paul,
On Sat, Feb 23, 2019 at 2:16 AM Paul Jungwirth <pj@illuminatedcomputing.com> wrote:
On 2/22/19 11:31 AM, Euler Taveira wrote:
> Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed
> <ibrar.ahmad@gmail.com> escreveu:
>>
>> While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existing databases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybe my search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful.
>>
> https://www.postgresql.org/message-id/CA%2BrenyUb%2BXHzsrPHHR6ELqguxaUPGhOPyVc7NW%2BkRsRpBZuUFQ%40mail.gmail.com
>
> This is the last one. I don't know why it wasn't in the January CF.
Oh that's by me! :-)
I didn't put it into the CF because I wanted to get some feedback on
primary keys before I got too far into foreign keys, but someone
recently advised me to starting adding to CFs anyway with "WIP" in the
title, so I'll do that next time.
Btw my own patch is very modest, and I'd love to see this other much
more extensive patch get some attention:
https://www.postgresql.org/message-id/flat/CAHO0eLYyvuqwF%3D2FsgDn1xOs_NOrFBu9Xh-Wq%2BaWfFy0y6%3DjWQ%40mail.gmail.com#4f7fbace3a2f2ce85fcc161cc3fdd273
They were told to adjust where in the query pipeline they do their work,
and the latest patch does that (as I understand it), but I don't think
anyone has looked at it yet.
Both of these patches use range types rather than SQL:2011 PERIODs, but
I'd like to *also* support PERIODs (and accept ranges everywhere we
accept PERIODs). Vik Fearing already has a patch to let you *declare*
PERIODs:
https://www.postgresql-archive.org/Periods-td6022563.html
Actually using PERIODs in queries seems like a decent chunk of work
though: basically it means making our grammar & processing accept
PERIODs anywhere they currently accept columns. I'd love to hear some
thoughts/suggestions around that. For example: a PERIOD is *similar* to
a GENERATED column, so maybe the work being done there can/should
influence how we implement them.
I'm excited to be getting some momentum around temporal features though!
I'm supposed to give a talk about them at PGCon in Ottawa this spring,
so hopefully that will help too.
Yours,
--
Paul ~{:-)
pj@illuminatedcomputing.com
Ibrar Ahmed
On Fri, Feb 22, 2019 at 03:57:42PM -0800, Andres Freund wrote: > > Today, all the callbacks for smgr have their prototypes defined in > > smgr.h and used in smgr.c. Forward declarations within the new sync.h > > would be fine but having md callbacks split in two different places may > > not be the cleanest approach? One could argue they serve different > > purposes so perhaps it correct that we define them separately. I am > > fine with either, but now I probably prefer the new enum to fixed > > function declaration mappings that Andres suggested. I agree it would be > > less error prone. > > I'd really just entirely separate the two. I'd not include any knowledge > of this mechanism into smgr.h, and just make the expansion of paths > dispatch statically dispatched (potentially into md.c) to actually do > the work. I'm not sure why sync.h would need any forward declarations > for that? We had a quick offline discussion to get on the same page and we agreed to move forward with Andres' approach above. Attached is patch v10. Here's the overview of the patch: 1. Move the system for requesting and processing fsyncs out of md.c into storage/sync/sync.c with definitions in include/storage/sync.h. ProcessSyncRequests() is now responsible for processing the sync requests during checkpoint. 2. Removed the need for specific storage managers to implement pre and post checkpoint callbacks. These are now executed by the sync mechanism. 3. We now embed the fork number and the segment number as part of the hash key for the pending ops table. This eliminates the bitmapset based segment tracking for each relfilenode during fsync as not all storage managers may map their segments from zero. 4. Each sync request now must include a type: sync, forget, forget hierarchy, or unlink, and the owner who will be responsible for generating paths or matching forget requests. 5. For cancelling relation sync requests, we now must send a forget request for each fork and segment in the relation. 6. We do not rely on smgr to provide the file descriptor we use to issue fsync. Instead, we generate the full path based on the FileTag in the sync request and use PathNameOpenFile to get the file descriptor. Ran make check-world and repeated the tests described in [1]. The numbers show a 12% drop in total time for single run of 1000 clients and ~62% drop in total time for 10 parallel runs with 200 clients: [Requests Absorbed] Min Max Average Median Std Dev --------------- -------- -------- ----------- --------- ---------- patch-1x1000 17554 147410 85252.95 83835.5 21898.88 master-1x1000 25728 138422 81455.04 80601 21295.83 Min Max Average Median Std Dev patch-10x200 125922 279682 197098.76 197055 34038.25 master-10x200 191833 602512 416533.86 424946 82014.48 [Files Synced] Min Max Average Median Std Dev --------------- ------ ------ --------- -------- --------- patch-1x1000 155 213 158.93 159 2.97 master-1x1000 154 166 158.29 159 10.29 Min Max Average Median Std Dev patch-10x200 1456 1577 1546.84 1547 11.23 master-10x200 1546 1546 1546 1559 12.79 [Total Time in ProcessFsyncRequest/mdsync] Min Max Average Median Std Dev --------------- ------ --------- ---------- -------- --------- patch-1x1000 606 4022 2145.11 2123 440.72 master-1x1000 806 4430.32 2458.77 2382 497.01 Min Max Average Median Std Dev patch-10x200 2472 6960 4156.8 4141 817.56 master-10x200 4323 17858 10982.15 11154 2760.47 [1] https://www.postgresql.org/message-id/20190220232739.GA8280%40f01898859afd.ant.amazon.com -- Shawn Debnath Amazon Web Services (AWS)
Attachment
On Thu, Feb 28, 2019 at 10:27 AM Shawn Debnath <sdn@amazon.com> wrote: > We had a quick offline discussion to get on the same page and we agreed > to move forward with Andres' approach above. Attached is patch v10. > Here's the overview of the patch: Thanks. I will review, and try to rebase my undo patches on top of this and see what problems I crash into. > Ran make check-world and repeated the tests described in [1]. The > numbers show a 12% drop in total time for single run of 1000 clients and > ~62% drop in total time for 10 parallel runs with 200 clients: Hmm, good but unexpected. Will poke at this here. -- Thomas Munro https://enterprisedb.com
On Thu, Feb 28, 2019 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Feb 28, 2019 at 10:27 AM Shawn Debnath <sdn@amazon.com> wrote: > > We had a quick offline discussion to get on the same page and we agreed > > to move forward with Andres' approach above. Attached is patch v10. > > Here's the overview of the patch: > > Thanks. I will review, and try to rebase my undo patches on top of > this and see what problems I crash into. Some initial feedback: @@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags) * the REDO pointer. Note that smgr must not do anything that'd have to * be undone if we decide no checkpoint is needed. */ - smgrpreckpt(); + PreCheckpoint(); I would call this and the "post" variant something like SyncPreCheckpoint(). Otherwise it's too general sounding and not clear which module it's in. +static const int NSync = lengthof(syncsw); Unused. + fd = PathNameOpenFile(path, O_RDWR | PG_BINARY); Needs to be closed. + path = syncsw[entry->owner].sync_filepath(entry->ftag); Probably doesn't make much difference, but wouldn't it be better for the path to be written into a caller-supplied buffer of size MAXPGPATH? Then we could have that on the stack instead of alloc/free for every path. Hmm, mdfilepath() needs to use GetRelationPath(), and that already returns palloc'd memory. Oh well. + entry->canceled = true; Now that we killed the bitmapset, I wonder if we still need this canceled flag. What if we just removed the entry from the hash table? If you killed the canceled flag you could then replace this: + if (entry->canceled) + break; .. with another hash table probe to see if the entry went in the AbsorbSyncRequests() call (having first copied the key into a local variable since of course "entry" may have been freed). Or maybe you don't think that's better, I dunno, just an idea :-) +ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner) Is it a deliberate choice that you pass FileTag objects around by value? Rather than, say, pointer to const. Not really a complaint in the current coding since it's a small object anyway (not much bigger than a pointer), but I guess someone might eventually want to make it into a flexible sized object, or something, I dunno. -ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum) +ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum, + SegmentNumber segno) { - if (pendingOpsTable) + FileTag tag; + + tag.rnode = rnode; + tag.forknum = forknum; + tag.segno = segno; + + if (IsSyncManagedLocally()) { /* standalone backend or startup process: fsync state is local */ - RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC); + RememberSyncRequest(tag, FORGET_REQUEST, SYNC_MD); } ... You left this and similar functions in md.c, but I think it needs to move out to sync.c, and just take a FileTag directly. Otherwise I have to write similar functions in undofile.c, and it seems kinda weird that those modules are worrying about whether sync is managed locally or the message needs to be sent to the checkpointer, and even worse, they have to duplicate the loop that deals with ForwardSyncRequest() failing and retrying. Concretely I'm saying that sync.c should define a function like this: +/* + * PostSyncRequest + * + * Remember locally, or post to checkpointer as appropriate. + */ +void +PostSyncRequest(FileTag tag, SyncRequestType type, SyncRequestOwner owner) +{ + if (IsSyncManagedLocally()) + { + /* standalone backend or startup process: fsync state is local */ + RememberSyncRequest(tag, type, owner); + } + else if (IsUnderPostmaster) + { + while (!ForwardSyncRequest(tag, FORGET_REQUEST, SYNC_MD)) + pg_usleep(10000L); /* 10 msec seems a good number */ + } +} Hmm, perhaps it would need to take an argument to say whether it should keep retrying or return false if it fails; that way register_dirty_segment() could perform the FileSync() itself if the queue is full, but register_unlink could tell it to keep trying. Does this make sense? +typedef enum syncrequesttype +{ + SYNC_REQUEST, + FORGET_REQUEST, + FORGET_HIERARCHY_REQUEST, + UNLINK_REQUEST +} syncrequesttype; These names are too generic for the global C namespace; how about SYNC_REQ_CANCEL or similar? +typedef enum syncrequestowner +{ + SYNC_MD = 0 /* md smgr */ +} syncrequestowner; I have a feeling that Andres wanted to see a single enum combining both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD, SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you have it. +/* md callback forward declarations */ +extern char* mdfilepath(FileTag ftag); +extern bool mdtagmatches(FileTag ftag, FileTag predicate, SyncRequestType type); It's weird that these ^ are declared in sync.h. I think they should be declared in a new header md.h and that should be included by sync.c. I know that we have this historical weird thing there md.c's functions are declared in smgr.h, but we should eventually fix that. More soon. -- Thomas Munro https://enterprisedb.com
Hi, On 2019-03-01 23:17:27 +1300, Thomas Munro wrote: > @@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags) > * the REDO pointer. Note that smgr must not do anything that'd have to > * be undone if we decide no checkpoint is needed. > */ > - smgrpreckpt(); > + PreCheckpoint(); > > I would call this and the "post" variant something like > SyncPreCheckpoint(). Otherwise it's too general sounding and not > clear which module it's in. Definitely. > +typedef enum syncrequesttype > +{ > + SYNC_REQUEST, > + FORGET_REQUEST, > + FORGET_HIERARCHY_REQUEST, > + UNLINK_REQUEST > +} syncrequesttype; > > These names are too generic for the global C namespace; how about > SYNC_REQ_CANCEL or similar? > > +typedef enum syncrequestowner > +{ > + SYNC_MD = 0 /* md smgr */ > +} syncrequestowner; > > I have a feeling that Andres wanted to see a single enum combining > both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD, > SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you > have it. Obviously it's nicer looking this way, but OTOH, that means we have to send more data over the queue, because we can't easily combine the request + "owner". I don't have too strong feelings about it though. FWIW, I don't like the name owner here. Class? Method? Greetings, Andres Freund
On Fri, Mar 1, 2019 at 12:43 PM Andres Freund <andres@anarazel.de> wrote: > Obviously it's nicer looking this way, but OTOH, that means we have to > send more data over the queue, because we can't easily combine the > request + "owner". I don't have too strong feelings about it though. Yeah, I would lean toward combining those. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 01, 2019 at 01:15:21PM -0500, Robert Haas wrote: > > > > > > +typedef enum syncrequestowner > > > +{ > > > + SYNC_MD = 0 /* md smgr */ > > > +} syncrequestowner; > > > > > > I have a feeling that Andres wanted to see a single enum combining > > > both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD, > > > SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you > > > have it. > > > > Obviously it's nicer looking this way, but OTOH, that means we have to > > send more data over the queue, because we can't easily combine the > > request + "owner". I don't have too strong feelings about it though. > > Yeah, I would lean toward combining those. I disagree, at least with combining and retaining enums. Encoding all the possible request types with the current, planned and future SMGRs would cause a sheer explosion in the number of enum values. Not to mention that you have multiple enum values for the same behavior - which just isn't clean. And handling of these enums in the code would be ugly too. Do note that these are typedef'ed to uint8 currently. For a default config with 128 MB shared_buffers, we will use an extra 16kB (one extra byte to represent the owner). I am hesitant to change this right now unless folks feel strongly about it. If so, I would combine the type and owner by splitting it up in 4 bit chunks, allowing for 16 request types and 16 smgrs. This change would only apply for the in-memory queue. The code and functions would continue using the enums. -- Shawn Debnath Amazon Web Services (AWS)
On Sat, Mar 2, 2019 at 8:36 AM Shawn Debnath <sdn@amazon.com> wrote: > On Fri, Mar 01, 2019 at 01:15:21PM -0500, Robert Haas wrote: > > > > > > > > +typedef enum syncrequestowner > > > > +{ > > > > + SYNC_MD = 0 /* md smgr */ > > > > +} syncrequestowner; > > > > > > > > I have a feeling that Andres wanted to see a single enum combining > > > > both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD, > > > > SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you > > > > have it. > > > > > > Obviously it's nicer looking this way, but OTOH, that means we have to > > > send more data over the queue, because we can't easily combine the > > > request + "owner". I don't have too strong feelings about it though. > > > > Yeah, I would lean toward combining those. > > I disagree, at least with combining and retaining enums. Encoding all > the possible request types with the current, planned and future SMGRs > would cause a sheer explosion in the number of enum values. Not to > mention that you have multiple enum values for the same behavior - which > just isn't clean. And handling of these enums in the code would be ugly > too. > > Do note that these are typedef'ed to uint8 currently. For a default > config with 128 MB shared_buffers, we will use an extra 16kB (one extra > byte to represent the owner). I am hesitant to change this right now > unless folks feel strongly about it. > > If so, I would combine the type and owner by splitting it up in 4 bit > chunks, allowing for 16 request types and 16 smgrs. This change would > only apply for the in-memory queue. The code and functions would > continue using the enums. +1 -- Thomas Munro https://enterprisedb.com
On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote: > I disagree, at least with combining and retaining enums. Encoding all > the possible request types with the current, planned and future SMGRs > would cause a sheer explosion in the number of enum values. How big of an explosion would it be? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote: > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote: > > I disagree, at least with combining and retaining enums. Encoding all > > the possible request types with the current, planned and future SMGRs > > would cause a sheer explosion in the number of enum values. > > How big of an explosion would it be? 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in total. Any future smgr addition will expand this further. -- Shawn Debnath Amazon Web Services (AWS)
On Sat, Mar 2, 2019 at 9:35 AM Shawn Debnath <sdn@amazon.com> wrote: > On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote: > > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote: > > > I disagree, at least with combining and retaining enums. Encoding all > > > the possible request types with the current, planned and future SMGRs > > > would cause a sheer explosion in the number of enum values. > > > > How big of an explosion would it be? > > 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in > total. Any future smgr addition will expand this further. It's not so much the "explosion" that bothers me. I think we should have a distinct sync requester enum, because we need a way to index into the table of callbacks. How exactly you pack the two enums into compact space seems like a separate question; doing it with two words would obviously be wasteful, but it should be possible stuff them into (say) a single uint8_t, uint16_t or whatever will pack nicely in the request struct and allow the full range of request types (4?) + the full range of sync requesters (which we propose to expand to 3 in the forseeable future). Now perhaps the single enum idea was going to involve explicit values that encode the two values SYNC_REQ_CANCEL_MD = 0x1 | (0x04 << 4) so you could still extract the requester part, but that's just the same thing with uglier code. -- Thomas Munro https://enterprisedb.com
On Fri, Mar 1, 2019 at 3:35 PM Shawn Debnath <sdn@amazon.com> wrote: > On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote: > > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote: > > > I disagree, at least with combining and retaining enums. Encoding all > > > the possible request types with the current, planned and future SMGRs > > > would cause a sheer explosion in the number of enum values. > > > > How big of an explosion would it be? > > 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in > total. Any future smgr addition will expand this further. I thought the idea was that each smgr might have a different set of requests. If they're all going to have the same set of requests then I agree with you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 01, 2019 at 09:42:56AM -0800, Andres Freund wrote: > FWIW, I don't like the name owner here. Class? Method? How about handler? Similar to when looking up callback functions in lookup_index_am_handler_func()? -- Shawn Debnath Amazon Web Services (AWS)
On 2/25/19 4:21 AM, Ibrar Ahmed wrote: > Great, to hear that you are working on that. Do you think I can help you > with this? I did some groundwork to make it possible. I can help in > coding/reviewing or even can take lead if you want to. Hi Ibrar, I'd love some help with this! I submitted my patch to the March commitfest, and Peter Moser & Anton Dignös submitted theirs also. I still need to rebase on the most recent commits, but I'll try to do that tonight or tomorrow. Personally I'd love some review and feedback, because this is my first substantial patch. (I made a small change to btree_gist a couple years ago also....) I think the challenge with temporal functionality is that there are a lot of new concepts, and we'd like them all to hang together in a coherent way. (That's why I want to give a talk about it: to increase background understanding in the Postgres community.) So having someone take the lead on it makes sense. I'm happy to provide some opinions and direction, but my own coding contributions are likely to be slow, and having a regular contributor more closely involved would help a lot. Here are some thoughts about things that need work: - temporal primary keys (my patch) - temporal foreign keys (I've done some work on this adding to my patch but I haven't finished it yet.) - temporal joins (Moser/Dignös patch) - declaring PERIODs (Vik Fearing's patch) - showing PERIODs in the system catalog (Vik Fearing's patch) - using PERIODs in SELECT, WHERE, GROUP BY, HAVING, function arguments, etc. (TODO) - SYSTEM_TIME PERIODs for transaction-time tables (TODO) - temporal insert/update/delete for transaction-time tables (TODO) - temporal insert/update/delete for valid-time tables (TODO) - temporal SELECT for valid-time tables (TODO, could build off the Moser/Dignös work) - temporal SELECT for transaction-time tables (TODO) I think the transaction-time stuff is easier, but also less interesting, and there are well-known patterns for accomplishing it already. I'm more interested in supporting valid-time tables personally. -- Paul ~{:-) pj@illuminatedcomputing.com
On Fri, Mar 01, 2019 at 04:27:47PM -0500, Robert Haas wrote: > On Fri, Mar 1, 2019 at 3:35 PM Shawn Debnath <sdn@amazon.com> wrote: > > On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote: > > > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote: > > > > I disagree, at least with combining and retaining enums. Encoding all > > > > the possible request types with the current, planned and future SMGRs > > > > would cause a sheer explosion in the number of enum values. > > > > > > How big of an explosion would it be? > > > > 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in > > total. Any future smgr addition will expand this further. > > I thought the idea was that each smgr might have a different set of > requests. If they're all going to have the same set of requests then > I agree with you. Yeah, in this particular case and at this layer, the operations are consistent across all storage managers, in that, they want to queue a new sync request for a specific file, forget an already queued request, forget a hierarchy of requests, or unlink a specific file. The fun is at the smgr layer which was discussed in a sub-thread in the "Drop type smgr" thread started by Thomas. I started on a patch and will be sending it out after the refactor patch is revised. -- Shawn Debnath Amazon Web Services (AWS)
On Fri, Mar 01, 2019 at 11:17:27PM +1300, Thomas Munro wrote: > - smgrpreckpt(); > + PreCheckpoint(); > > I would call this and the "post" variant something like > SyncPreCheckpoint(). Otherwise it's too general sounding and not > clear which module it's in. Sure - fixed. > +static const int NSync = lengthof(syncsw); > > Unused. Good catch - interesting how there was no warning thrown for this. > + fd = PathNameOpenFile(path, O_RDWR | PG_BINARY); > > Needs to be closed. *ahem* fixed > > + path = syncsw[entry->owner].sync_filepath(entry->ftag); > > Probably doesn't make much difference, but wouldn't it be better for > the path to be written into a caller-supplied buffer of size > MAXPGPATH? Then we could have that on the stack instead of alloc/free > for every path. > > Hmm, mdfilepath() needs to use GetRelationPath(), and that already > returns palloc'd memory. Oh well. Yep - I tried to muck with this as well but the logic is too neatly encapsulated inside relpath.c and given the usages via relpath[perm|backend] - I chose to forgo the attempt. > > + entry->canceled = true; > > Now that we killed the bitmapset, I wonder if we still need this > canceled flag. What if we just removed the entry from the hash table? > If you killed the canceled flag you could then replace this: > > + if (entry->canceled) > + break; > > .. with another hash table probe to see if the entry went in the > AbsorbSyncRequests() call (having first copied the key into a local > variable since of course "entry" may have been freed). Or maybe you > don't think that's better, I dunno, just an idea :-) It seems safer and cleaner to have the canceled flag as other approaches don't really give us any gain. But this made me realize that I should also cancel the entry for the simple forget case instead of removing it from the hash table. Fixed and modified the for loop to break if entry was cancelled. Deletion of the entry now happens in one place - after it has been processed or skipped inside ProcessSyncRequests. > +ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner) > > Is it a deliberate choice that you pass FileTag objects around by > value? Rather than, say, pointer to const. Not really a complaint in > the current coding since it's a small object anyway (not much bigger > than a pointer), but I guess someone might eventually want to make it > into a flexible sized object, or something, I dunno. It was deliberate to match what we are doing today. You have a good point regarding future modifications. May as well get the API changed correctly the first time around. Changed. > -ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum) > +ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum, > + SegmentNumber segno) > { > - if (pendingOpsTable) > + FileTag tag; > + > + tag.rnode = rnode; > + tag.forknum = forknum; > + tag.segno = segno; > + > + if (IsSyncManagedLocally()) > { > /* standalone backend or startup process: fsync state > is local */ > - RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC); > + RememberSyncRequest(tag, FORGET_REQUEST, SYNC_MD); > } > ... > > You left this and similar functions in md.c, but I think it needs to > move out to sync.c, and just take a FileTag directly. Otherwise I > have to write similar functions in undofile.c, and it seems kinda > weird that those modules are worrying about whether sync is managed > locally or the message needs to be sent to the checkpointer, and even > worse, they have to duplicate the loop that deals with > ForwardSyncRequest() failing and retrying. Concretely I'm saying that > sync.c should define a function like this: > > +/* > + * PostSyncRequest > + * > + * Remember locally, or post to checkpointer as appropriate. > + */ > +void > +PostSyncRequest(FileTag tag, SyncRequestType type, SyncRequestOwner owner) > +{ > + if (IsSyncManagedLocally()) > + { > + /* standalone backend or startup process: fsync state > is local */ > + RememberSyncRequest(tag, type, owner); > + } > + else if (IsUnderPostmaster) > + { > + while (!ForwardSyncRequest(tag, FORGET_REQUEST, SYNC_MD)) > + pg_usleep(10000L); /* 10 msec seems a > good number */ > + } > +} > > Hmm, perhaps it would need to take an argument to say whether it > should keep retrying or return false if it fails; that way > register_dirty_segment() could perform the FileSync() itself if the > queue is full, but register_unlink could tell it to keep trying. Does > this make sense? Yeah - makes sense. I did have it on my list, but I think I wanted to minimize changes to md and wanted to wait for a use case. Though it seems clear it's needed and looks like you already ran into it with undo. Added a new function RegisterSyncRequest that checks for local table or forwards to checkpointer. Accepts a parameter retryOnError which does what it says or exits on failure. > +typedef enum syncrequesttype > +{ > + SYNC_REQUEST, > + FORGET_REQUEST, > + FORGET_HIERARCHY_REQUEST, > + UNLINK_REQUEST > +} syncrequesttype; > > These names are too generic for the global C namespace; how about > SYNC_REQ_CANCEL or similar? Yeah, again, was trying to match what we had before. I prefixed these enums with SYNC to have a more readable set: SYNC_REQUEST SYNC_FORGET_REQUEST SYNC_FORGET_HIERARCHY_REQUEST SYNC_UNLINK_REQUEST Except for the hierarchy one, they are pretty reasonable in length. > +typedef enum syncrequestowner > +{ > + SYNC_MD = 0 /* md smgr */ > +} syncrequestowner; > > I have a feeling that Andres wanted to see a single enum combining > both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD, > SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you > have it. This was tackled in a sub-thread. I am keeping the enums but changing the the checkpointer queue to persist the type and owner combo in 8 bits. > +/* md callback forward declarations */ > +extern char* mdfilepath(FileTag ftag); > +extern bool mdtagmatches(FileTag ftag, FileTag predicate, > SyncRequestType type); > > It's weird that these ^ are declared in sync.h. I think they should > be declared in a new header md.h and that should be included by > sync.c. I know that we have this historical weird thing there md.c's > functions are declared in smgr.h, but we should eventually fix that. Long story short - I wanted to avoid having md.h include sync.h for the FileTag definitions. But - it's cleaner this way. Changed. -- Shawn Debnath Amazon Web Services (AWS)
Attachment
On Tue, Mar 5, 2019 at 2:25 PM Shawn Debnath <sdn@amazon.com> wrote: > [v11 patch] Thanks. Hmm, something is wrong here because make check is dramatically slower -- for example the "insert" test runs in ~8-13 seconds instead of the usual ~0.2 seconds according to Travis, AppVeyor and my local FreeBSD system (note that fsync is disabled so it's not that -- it must be bogus queue-related CPU?) -- Thomas Munro https://enterprisedb.com
On Tue, Mar 05, 2019 at 10:45:37PM +1300, Thomas Munro wrote: > On Tue, Mar 5, 2019 at 2:25 PM Shawn Debnath <sdn@amazon.com> wrote: > > [v11 patch] > > Thanks. Hmm, something is wrong here because make check is > dramatically slower -- for example the "insert" test runs in ~8-13 > seconds instead of the usual ~0.2 seconds according to Travis, > AppVeyor and my local FreeBSD system (note that fsync is disabled so > it's not that -- it must be bogus queue-related CPU?) Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test. Interesting! It's reproducible so should be able to figure out what's going on. The only thing we do in ForwardSyncRequest() is split up the 8 bits into 2x4 bits and copy the FileTagData structure to the checkpointer queue. Will report back what I find. -- Shawn Debnath Amazon Web Services (AWS)
On Wed, Mar 6, 2019 at 5:07 AM Shawn Debnath <sdn@amazon.com> wrote: > Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test. > Interesting! It's reproducible so should be able to figure out what's > going on. The only thing we do in ForwardSyncRequest() is split up the 8 > bits into 2x4 bits and copy the FileTagData structure to the > checkpointer queue. Will report back what I find. More review, all superficial stuff: +typedef struct +{ + RelFileNode rnode; + ForkNumber forknum; + SegmentNumber segno; +} FileTagData; + +typedef FileTagData *FileTag; Even though I know I said we should take FileTag by pointer, and even though there is an older tradition in the tree of having a struct named "FooData" and a corresponding pointer typedef named "Foo", as far as I know most people are not following the convention for new code and I for one don't like it. One problem is that there isn't a way to make a pointer-to-const type given a pointer-to-non-const type, so you finish up throwing away const from your programs. I like const as documentation and a tiny bit of extra compiler checking. What do you think about "FileTag" for the struct and eg "const FileTag *tag" when receiving one as a function argument? -/* internals: move me elsewhere -- ay 7/94 */ Aha, about time too! +#include "fmgr.h" +#include "storage/block.h" +#include "storage/relfilenode.h" +#include "storage/smgr.h" +#include "storage/sync.h" Why do we need to include fmgr.h in md.h? +/* md storage manager funcationality */ Typo. +/* md sync callback forward declarations */ These aren't "forward" declarations, they're plain old declarations. +extern char* mdfilepath(FileTag ftag); Doesn't really matter too much because all of this will get pgindent-ed at some point, but FYI we write "char *md", not "char* md". #include "storage/smgr.h" +#include "storage/md.h" #include "utils/hsearch.h" Bad sorting. + FileTagData tag; + tag.rnode = reln->smgr_rnode.node; + tag.forknum = forknum; + tag.segno = seg->mdfd_segno; I wonder if it would be better practice to zero-initialise that sucker, so that if more members are added we don't leave them uninitialised. I like the syntax "FileTagData tag = {{0}}". (Unfortunately extra nesting required here because first member is a struct, and C99 doesn't allow us to use empty {} like C++, even though some versions of GCC accept it. Rats.) -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > +#include "fmgr.h" > +#include "storage/block.h" > +#include "storage/relfilenode.h" > +#include "storage/smgr.h" > +#include "storage/sync.h" > Why do we need to include fmgr.h in md.h? More generally, any massive increase in an include file's inclusions is probably a sign that you need to refactor. Cross-header inclusions are best avoided altogether if you can --- obviously that's not always possible, but we should minimize them. We've had some very unfortunate problems in the past from indiscriminate #includes in headers. regards, tom lane
On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > +#include "fmgr.h" > > +#include "storage/block.h" > > +#include "storage/relfilenode.h" > > +#include "storage/smgr.h" > > +#include "storage/sync.h" > > > Why do we need to include fmgr.h in md.h? > > More generally, any massive increase in an include file's inclusions > is probably a sign that you need to refactor. Cross-header inclusions > are best avoided altogether if you can --- obviously that's not always > possible, but we should minimize them. We've had some very unfortunate > problems in the past from indiscriminate #includes in headers. Agree - I do pay attention to these, but this one slipped through the cracks (copied smgr.h then edited to remove smgr bits). Thanks for catching this, will fix in the next patch iteration. -- Shawn Debnath Amazon Web Services (AWS)
On Wed, Mar 06, 2019 at 05:33:54AM +1300, Thomas Munro wrote: > On Wed, Mar 6, 2019 at 5:07 AM Shawn Debnath <sdn@amazon.com> wrote: > > Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test. > > Interesting! It's reproducible so should be able to figure out what's > > going on. The only thing we do in ForwardSyncRequest() is split up the 8 > > bits into 2x4 bits and copy the FileTagData structure to the > > checkpointer queue. Will report back what I find. Fixed - tried to be clever with a do while loop and ended up forcing a sleep of 10 ms for every register request. Silly mistake (had an assert with the right assertion right after!). Reverted to while(1) with a clean break if we meet the conditions. Thanks for Thomas for being a second set of eyes during the investigation. make check runs are happy: patch: make check 2.48s user 0.94s system 12% cpu 27.411 total master: make check 2.50s user 0.88s system 12% cpu 27.573 total > More review, all superficial stuff: > > +typedef struct > +{ > + RelFileNode rnode; > + ForkNumber forknum; > + SegmentNumber segno; > +} FileTagData; > + > +typedef FileTagData *FileTag; > > Even though I know I said we should take FileTag by pointer, and even > though there is an older tradition in the tree of having a struct > named "FooData" and a corresponding pointer typedef named "Foo", as > far as I know most people are not following the convention for new > code and I for one don't like it. One problem is that there isn't a > way to make a pointer-to-const type given a pointer-to-non-const type, > so you finish up throwing away const from your programs. I like const > as documentation and a tiny bit of extra compiler checking. What do > you think about "FileTag" for the struct and eg "const FileTag *tag" > when receiving one as a function argument? More compile time safety checks are always better. I have made the changes in this new patch. Also, followed BufferTag pattern and defined INIT_FILETAG that will initialize the structure with the correct values. This avoids the point you bring up of accidentally omitting initializing members when new ones are added. > +#include "fmgr.h" > +#include "storage/block.h" > +#include "storage/relfilenode.h" > +#include "storage/smgr.h" > +#include "storage/sync.h" > > Why do we need to include fmgr.h in md.h? Removed. > +/* md storage manager funcationality */ > > Typo. Fixed > +/* md sync callback forward declarations */ > > These aren't "forward" declarations, they're plain old declarations. Removed and simplified. > +extern char* mdfilepath(FileTag ftag); > > Doesn't really matter too much because all of this will get > pgindent-ed at some point, but FYI we write "char *md", not "char* > md". Hmm - different, noted. Changed. > #include "storage/smgr.h" > +#include "storage/md.h" > #include "utils/hsearch.h" > > Bad sorting. Ordered correctly.. > + FileTagData tag; > + tag.rnode = reln->smgr_rnode.node; > + tag.forknum = forknum; > + tag.segno = seg->mdfd_segno; > > I wonder if it would be better practice to zero-initialise that > sucker, so that if more members are added we don't leave them > uninitialised. I like the syntax "FileTagData tag = {{0}}". > (Unfortunately extra nesting required here because first member is a > struct, and C99 doesn't allow us to use empty {} like C++, even though > some versions of GCC accept it. Rats.) See comments above for re-defining FileTag. -- Shawn Debnath Amazon Web Services (AWS)
Attachment
On Wed, Mar 6, 2019 at 6:16 AM Shawn Debnath <sdn@amazon.com> wrote: > On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > > Why do we need to include fmgr.h in md.h? > > > > More generally, any massive increase in an include file's inclusions > > is probably a sign that you need to refactor. Cross-header inclusions > > are best avoided altogether if you can --- obviously that's not always > > possible, but we should minimize them. We've had some very unfortunate > > problems in the past from indiscriminate #includes in headers. > > Agree - I do pay attention to these, but this one slipped through the > cracks (copied smgr.h then edited to remove smgr bits). Thanks for > catching this, will fix in the next patch iteration. Huh... so why it was in smgr.h then? Seems bogus. Fix pushed to master. -- Thomas Munro https://enterprisedb.com
On Thu, Mar 07, 2019 at 08:11:51PM +1300, Thomas Munro wrote: > On Wed, Mar 6, 2019 at 6:16 AM Shawn Debnath <sdn@amazon.com> wrote: > > On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote: > > > Thomas Munro <thomas.munro@gmail.com> writes: > > > > Why do we need to include fmgr.h in md.h? > > > > > > More generally, any massive increase in an include file's inclusions > > > is probably a sign that you need to refactor. Cross-header inclusions > > > are best avoided altogether if you can --- obviously that's not always > > > possible, but we should minimize them. We've had some very unfortunate > > > problems in the past from indiscriminate #includes in headers. > > > > Agree - I do pay attention to these, but this one slipped through the > > cracks (copied smgr.h then edited to remove smgr bits). Thanks for > > catching this, will fix in the next patch iteration. > > Huh... so why it was in smgr.h then? Seems bogus. Fix pushed to master. So ... wondering if there are any other left over items for this patch or is it good to go? I imagine there's at least a couple of us who would love to see this get in for PG12. Thanks! -- Shawn Debnath Amazon Web Services (AWS)
On Wed, Mar 13, 2019 at 2:00 PM Shawn Debnath <sdn@amazon.com> wrote: > So ... wondering if there are any other left over items for this patch > or is it good to go? I imagine there's at least a couple of us who would > love to see this get in for PG12. I rebased my WIP undo stuff[1] (targeting 13) on top of this, and that seemed to go smoothly and the interfaces made sense, which was reassuring. I do wonder if we'll need to expose a way for eg pg_rewind and perhaps external backup tools to find paths and offsets given WAL block references that might in future include an SMGR ID (well that's one proposal), but that handwavy requirement doesn't seem to conflict with anything we've done here. I'm planning to do another round of review and testing. Aside from some refactoring which I think looks good anyway and prepares for future patches, the main effect of this patch is to force the checkpointer to open and close the files every time which seems OK to me. I know Andres wants to make a pass through it too. [1] https://www.postgresql.org/message-id/CA%2BhUKGKN42jB%2BubCKru716HPtMbahdia39GwG5pLgWLMZ_y1ng%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
On Wed, Mar 13, 2019 at 2:27 PM Thomas Munro <thomas.munro@gmail.com> wrote: > [...] Aside from some refactoring which I > think looks good anyway and prepares for future patches, the main > effect of this patch is to force the checkpointer to open and close > the files every time which seems OK to me. I've been trying to decide if that is a problem. Maybe there is a performance angle, and I'm also wondering if it might increase the risk of missing a write-back error. Of course we'll find a proper solution to that problem (perhaps fd-passing or sync-on-close[1]), but I don't want to commit anything in the name of refactoring that might make matters worse incidentally. Or perhaps those worries are bogus, since the checkpointer calls smgrcloseall() at the end anyway. On balance, I'm inclined to err on the side of caution and try to keep things a bit closer to the way they are for now. Here's a fixup patch. 0001 is the same as Shawn's v12 patch, and 0002 has my proposed changes to switch to callbacks that actually perform the sync and unlink operations given a file tag, and do so via the SMGR fd cache, rather than exposing the path to sync.c. This moves us back towards the way I had it in an earlier version of the patch, but instead of using smgrsyncimmed() as I had it, it goes via Shawn's new sync handler function lookup table, allowing for non-smgr components to use this machinery in future (as requested by Andres). Thoughts? It all needs to be pgindented, I'll do that later. I'll post a rebase of my undo stuff on top of this soon, to show how it looks with this interface. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGLMPXMnSLDwgnNRFPyxvf_0bJ5HwXcHWjCp7Cvh7G%3DxEA%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
Attachment
On Tue, Mar 26, 2019 at 12:20 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I've been trying to decide if that is a problem. Maybe there is a > performance angle, and I'm also wondering if it might increase the > risk of missing a write-back error. Of course we'll find a proper > solution to that problem (perhaps fd-passing or sync-on-close[1]), but > I don't want to commit anything in the name of refactoring that might > make matters worse incidentally. Or perhaps those worries are bogus, > since the checkpointer calls smgrcloseall() at the end anyway. > > On balance, I'm inclined to err on the side of caution and try to keep > things a bit closer to the way they are for now. > > Here's a fixup patch. 0001 is the same as Shawn's v12 patch, and 0002 > has my proposed changes to switch to callbacks that actually perform > the sync and unlink operations given a file tag, and do so via the > SMGR fd cache, rather than exposing the path to sync.c. This moves us > back towards the way I had it in an earlier version of the patch, but > instead of using smgrsyncimmed() as I had it, it goes via Shawn's new > sync handler function lookup table, allowing for non-smgr components > to use this machinery in future (as requested by Andres). Strong +1. Not only might closing and reopening the files have performance and reliability implications, but a future smgr might talk to the network, having no local file to sync. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 27, 2019 at 5:48 AM Shawn Debnath <sdn@amazon.com> wrote: > On Tue, Mar 26, 2019 at 09:22:56AM -0400, Robert Haas wrote: > > On Tue, Mar 26, 2019 at 12:20 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > > Here's a fixup patch. 0001 is the same as Shawn's v12 patch, and 0002 > > > has my proposed changes to switch to callbacks that actually perform > > > the sync and unlink operations given a file tag, and do so via the > > > SMGR fd cache, rather than exposing the path to sync.c. This moves us > > > back towards the way I had it in an earlier version of the patch, but > > > instead of using smgrsyncimmed() as I had it, it goes via Shawn's new > > > sync handler function lookup table, allowing for non-smgr components > > > to use this machinery in future (as requested by Andres). > > > > Strong +1. Not only might closing and reopening the files have > > performance and reliability implications, but a future smgr might talk > > to the network, having no local file to sync. > > Makes sense for now. When we re-visit the fd-passing or sync-on-close > implementations, we can adapt the changes relatively easily given the > rest of the framework is staying intact. I am hoping these patches do > not delay the last fsync-gate issue discussion further. I found a few more things that I thought needed adjustment: * Packing handler and request type into a uint8 is cute but a waste of time if we're just going to put it in a struct next to a member that requires word-alignment. So I changed it to a pair of plain old int16 members. The ftag member starts at offset 4 either way, on my system. * I didn't really like the use of the word HIERARCHY in the name of the request type, and changed it to SYNC_FILTER_REQUEST. That word came up because we were implementing a kind of hierarchy, where if you drop a database you want to forget things for all segments inside all relations inside that database, but the whole point of this new API is that it doesn't understand that, it calls a filter function to decide which requests to keep. So I preferred "filter" as a name for the type of request. * I simplified the "matches" callback interface. * Various typos and comment clean-up. I'm going to do some more testing and tidying tomorrow (for example I think the segment.h header is silly and I'd like to remove that), and commit this. -- Thomas Munro https://enterprisedb.com
Attachment
On Tue, Apr 2, 2019 at 11:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I'm going to do some more testing and tidying tomorrow (for example I > think the segment.h header is silly and I'd like to remove that), and > commit this. As a sanity check on the programming interface this thing gives you, I tried teaching the SLRUs to use the fsync queue. I finished up making a few small improvements, but the main thing I learned is that "handler" needs to be part of the hash table key. I suppose the discriminator could even be inside FileTag itself, but I chose to keep it separate and introduce a new struct to hold hander enum + FileTag in the hash table. The 0001 patch is what I'm going to commit soon. I don't know why Shawn measured a performance change -- a bug in an earlier version? -- I can't see it, but I'm planning to look into that a bit more first. I've attached the 0002 SLRU patch for interest, but I'm not planning to commit that one. -- Thomas Munro https://enterprisedb.com
Attachment
On Thu, Apr 04, 2019 at 01:08:31AM +1300, Thomas Munro wrote: > On Tue, Apr 2, 2019 at 11:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > I'm going to do some more testing and tidying tomorrow (for example I > > think the segment.h header is silly and I'd like to remove that), and > > commit this. Given the dislike in the thread for introducing the concept of segments at any layer higher than the storage manager itself, I thought it would be better to leave the existing header files alone and introduce a new one to separate the concept. I am fine either way we go. > As a sanity check on the programming interface this thing gives you, I > tried teaching the SLRUs to use the fsync queue. I finished up making > a few small improvements, but the main thing I learned is that > "handler" needs to be part of the hash table key. I suppose the > discriminator could even be inside FileTag itself, but I chose to keep > it separate and introduce a new struct to hold hander enum + FileTag > in the hash table. I think this is fine, but can you elaborate a bit more on why we need to include the handler in the key for the hash table? We are de-duping relfilenodes here and these should never collide with files from another component. The component separation would be encoded in the RelFileNode that the target smgr, or SLRU in this case, would be able to decipher. Do you foresee, or know of, use cases where FileTag alone will result in conflicts on the same file but from different handlers? +/* + * Populate a file tag describing a segment file. We only use the segment + * number, since we can derive everything else we need by having separate + * sync handler functions for clog, multixact etc. + */ +#define INIT_SLRUFILETAG(a,xx_segno) \ +( \ + (a).rnode.spcNode = 0, \ + (a).rnode.dbNode = 0, \ + (a).rnode.relNode = 0, \ + (a).forknum = 0, \ + (a).segno = (xx_segno) \ +) Based on the definition of INIT_SLRUFILETAG in your patch, it seems you are trying to only use the segno to identify the file. Not sure why we can't use other unused fields in FileTag to identify the component? You could for example in the current SLRU implementation have the handler set to SYNC_HANDLER_SLRU when invoking RegisterSyncRequest() and use relNode to distinguish between each SLRU component in a wrapper function and call slrusyncfiletag() with the right SlruCtl. For the SLRU to buffer cache work, I was planning on using the relNode field to identify which specific component this tag belongs to. dbNode would be pointing to the type of smgr (as discussed earlier in the thread and still TBD). I would prefer not to expand the hash key unnecessarily and given this isn't persisted, we can expand the key in the future if needed. Keeps the code simpler for now. > The 0001 patch is what I'm going to commit soon. I don't know why > Shawn measured a performance change -- a bug in an earlier version? -- > I can't see it, but I'm planning to look into that a bit more first. > I've attached the 0002 SLRU patch for interest, but I'm not planning > to commit that one. Question is which block of code did you measure? I can redo the instrumentation on the latest patch and re-validate and share the results. I previously measured the average time it took mdsync() and ProcessSyncRequests() in the patch to complete under similar workloads. > I found a few more things that I thought needed adjustment: > > * Packing handler and request type into a uint8 is cute but a waste of > time if we're just going to put it in a struct next to a member that > requires word-alignment. So I changed it to a pair of plain old int16 > members. The ftag member starts at offset 4 either way, on my system. Good catch! For posterity, using packed attribute here would be bad well as it would result RelFileNode's spcNode in FileTag in subsequent entries to be misaligned given our usage of the requests as an array. This is potentially unsafe on platforms other than x86. Re-arranging the fields would lead to the same result. Thanks for catching and fixing this! > * I didn't really like the use of the word HIERARCHY in the name of > the request type, and changed it to SYNC_FILTER_REQUEST. That word > came up because we were implementing a kind of hierarchy, where if you > drop a database you want to forget things for all segments inside all > relations inside that database, but the whole point of this new API is > that it doesn't understand that, it calls a filter function to decide > which requests to keep. So I preferred "filter" as a name for the > type of request. Yeah, I never liked the word hierarchy too much, but couldn't think of a better one either. Filter is perfect. -- Shawn Debnath Amazon Web Services (AWS)
On Thu, Apr 4, 2019 at 10:44 AM Shawn Debnath <sdn@amazon.com> wrote: > On Thu, Apr 04, 2019 at 01:08:31AM +1300, Thomas Munro wrote: > > As a sanity check on the programming interface this thing gives you, I > > tried teaching the SLRUs to use the fsync queue. I finished up making > > a few small improvements, but the main thing I learned is that > > "handler" needs to be part of the hash table key. I suppose the > > discriminator could even be inside FileTag itself, but I chose to keep > > it separate and introduce a new struct to hold hander enum + FileTag > > in the hash table. > > I think this is fine, but can you elaborate a bit more on why we need to > include the handler in the key for the hash table? We are de-duping > relfilenodes here and these should never collide with files from another > component. The component separation would be encoded in the RelFileNode > that the target smgr, or SLRU in this case, would be able to decipher. > Do you foresee, or know of, use cases where FileTag alone will result in > conflicts on the same file but from different handlers? Well it depends how we think of the FileTag namespace. I didn't worry before because md.c and my development version of undofile.c generate FileTag objects that never collide, because undofile.c sets the dbNode to invalid (and in earlier versions had a special magic DB OID), while md.c always uses valid values. But when I did the SLRU experiment with that 0002 patch, I figured it was reasonable to want to set only the segment number part of the FileTag. Then sync requests for pg_xact, pg_multixact/offsets and pg_multixact/members got merged. Oops. As you say, we could decide to assign bogus reserved dbNode values to keep them the keyspaces apart and not allow future modules to deviate from using a RelFileNode in the tag, but we already more-or-less decided in another thread that we don't want to do that for buffer tags. People seemed keen to see another discriminator for smgr ID, rather than requiring eg md.c and undo.c to use non-colliding RelFileNode values via a fake dbNode scheme. This problem is essentially the same, except that we decided we want the set of sync handlers to be potentially larger than the set of smgr handlers. For example, until we have a patch that moves SLRUs into shared buffers, SLRUs don't have SMGR IDs, and get as the 0002 patch shows, they could theoretically still benefit from handing off fsync work; there may be other cases like that. Generally, the question is similar in both places: (1) do sync handlers each get their own separate namespace of FileTags?, (2) do smgrs get their own namespace of BufferTag? Perhaps that is an argument for putting the sync handler number *inside* the FileTag, since we currently intend to do that with smgr IDs in BufferTag (stealing space from ForkNumber). > +/* > + * Populate a file tag describing a segment file. We only use the segment > + * number, since we can derive everything else we need by having separate > + * sync handler functions for clog, multixact etc. > + */ > +#define INIT_SLRUFILETAG(a,xx_segno) \ > +( \ > + (a).rnode.spcNode = 0, \ > + (a).rnode.dbNode = 0, \ > + (a).rnode.relNode = 0, \ > + (a).forknum = 0, \ > + (a).segno = (xx_segno) \ > +) > > Based on the definition of INIT_SLRUFILETAG in your patch, it seems you > are trying to only use the segno to identify the file. Not sure why we > can't use other unused fields in FileTag to identify the component? You > could for example in the current SLRU implementation have the handler > set to SYNC_HANDLER_SLRU when invoking RegisterSyncRequest() and use > relNode to distinguish between each SLRU component in a wrapper function > and call slrusyncfiletag() with the right SlruCtl. Yes, that would work too, though then slru.c would have to know about clog, multixact etc so it could find their SlruCtlData objects, which are currently static in the eg clog.c, multixact.c etc. That's why I put a tiny callback into each of those that could call slru.c to do the work. I'm not really proposing anything here, and I understand that you might want to refactor this stuff completely, I just wanted a quick sanity check of *something* else using this interface, other than md.c and my undo patch, in a particular a thing-that-isn't-connected-to-smgr. If it turns out to be useful for later work, good, if not, I won't be sad. I also suppressed the urge to teach it to use fd.c files and keep them open in a small cache -- all topics for future threads. > For the SLRU to buffer cache work, I was planning on using the relNode > field to identify which specific component this tag belongs to. dbNode > would be pointing to the type of smgr (as discussed earlier in the > thread and still TBD). I look forward to the patches :-) But as mentioned, there was some serious push-back on hijacking dbNode like that. > > The 0001 patch is what I'm going to commit soon. I don't know why > > Shawn measured a performance change -- a bug in an earlier version? -- > > I can't see it, but I'm planning to look into that a bit more first. > > I've attached the 0002 SLRU patch for interest, but I'm not planning > > to commit that one. > > Question is which block of code did you measure? I can redo the > instrumentation on the latest patch and re-validate and share the > results. I previously measured the average time it took mdsync() and > ProcessSyncRequests() in the patch to complete under similar workloads. Ok. Let me try that again here too. -- Thomas Munro https://enterprisedb.com
On Thu, Apr 4, 2019 at 11:39 AM Thomas Munro <thomas.munro@gmail.com> wrote: > ... Perhaps > that is an argument for putting the sync handler number *inside* the > FileTag, since we currently intend to do that with smgr IDs in > BufferTag (stealing space from ForkNumber). Here is a version like that. I like it better this way, and the extra space can be clawed back by using 16 bit types to hold the fork number and sync handler number. Here again is the straw-man 0002 patch updated to show how it might look for another potential user. -- Thomas Munro https://enterprisedb.com
Attachment
On Thu, Apr 04, 2019 at 02:01:14PM +1300, Thomas Munro wrote: > On Thu, Apr 4, 2019 at 11:39 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > ... Perhaps > > that is an argument for putting the sync handler number *inside* the > > FileTag, since we currently intend to do that with smgr IDs in > > BufferTag (stealing space from ForkNumber). > > Here is a version like that. I like it better this way, and the extra > space can be clawed back by using 16 bit types to hold the fork number > and sync handler number. +typedef struct FileTag +{ + int16 handler; /* SyncRequstHandler value, saving space */ + int16 forknum; /* ForkNumber, saving space */ + RelFileNode rnode; + BlockNumber segno; +} FileTag; Definitely makes sense. v16 looks good to me. Thanks! -- Shawn Debnath Amazon Web Services (AWS)
On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote: > On Thu, Apr 04, 2019 at 02:01:14PM +1300, Thomas Munro wrote: > > On Thu, Apr 4, 2019 at 11:39 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > > ... Perhaps > > > that is an argument for putting the sync handler number *inside* the > > > FileTag, since we currently intend to do that with smgr IDs in > > > BufferTag (stealing space from ForkNumber). > > > > Here is a version like that. I like it better this way, and the extra > > space can be clawed back by using 16 bit types to hold the fork number > > and sync handler number. > > +typedef struct FileTag > +{ > + int16 handler; /* SyncRequstHandler value, saving space */ > + int16 forknum; /* ForkNumber, saving space */ > + RelFileNode rnode; > + BlockNumber segno; > +} FileTag; Seems odd to me to use BlockNumber for segno.
On Thu, Apr 4, 2019 at 5:36 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote: > > +typedef struct FileTag > > +{ > > + int16 handler; /* SyncRequstHandler value, saving space */ > > + int16 forknum; /* ForkNumber, saving space */ > > + RelFileNode rnode; > > + BlockNumber segno; > > +} FileTag; > > Seems odd to me to use BlockNumber for segno. That is a tradition in md.c code. I had a new typedef SegmentNumber in all sync.{c,h} stuff in an earlier version, but had trouble figuring out where to define it... -- Thomas Munro https://enterprisedb.com
On Thu, Apr 04, 2019 at 05:39:14PM +1300, Thomas Munro wrote: > On Thu, Apr 4, 2019 at 5:36 PM Andres Freund <andres@anarazel.de> wrote: > > On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote: > > > +typedef struct FileTag > > > +{ > > > + int16 handler; /* SyncRequstHandler value, saving space */ > > > + int16 forknum; /* ForkNumber, saving space */ > > > + RelFileNode rnode; > > > + BlockNumber segno; > > > +} FileTag; > > > > Seems odd to me to use BlockNumber for segno. > > That is a tradition in md.c code. I had a new typedef SegmentNumber > in all sync.{c,h} stuff in an earlier version, but had trouble > figuring out where to define it... Thomas, this is why I had defined segment.h with the contents below :-) +++ b/src/include/storage/segment.h [...] +/* + * Segment Number: + * + * Each relation and its forks are divided into segments. This + * definition formalizes the definition of the segment number. + */ +typedef uint32 SegmentNumber; + +#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF) My last iteration, v12, patch had it. See [1] for comments on removal of segment.h. [1] https://www.postgresql.org/message-id/20190403214423.GA45392%40f01898859afd.ant.amazon.com -- Shawn Debnath Amazon Web Services (AWS)
On Thu, Apr 4, 2019 at 6:41 PM Shawn Debnath <sdn@amazon.com> wrote: > On Thu, Apr 04, 2019 at 05:39:14PM +1300, Thomas Munro wrote: > > On Thu, Apr 4, 2019 at 5:36 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote: > > > > +typedef struct FileTag > > > > +{ > > > > + int16 handler; /* SyncRequstHandler value, saving space */ > > > > + int16 forknum; /* ForkNumber, saving space */ > > > > + RelFileNode rnode; > > > > + BlockNumber segno; > > > > +} FileTag; > > > > > > Seems odd to me to use BlockNumber for segno. > > > > That is a tradition in md.c code. I had a new typedef SegmentNumber > > in all sync.{c,h} stuff in an earlier version, but had trouble > > figuring out where to define it... > > Thomas, this is why I had defined segment.h with the contents below :-) > > +++ b/src/include/storage/segment.h > [...] > +/* > + * Segment Number: > + * > + * Each relation and its forks are divided into segments. This > + * definition formalizes the definition of the segment number. > + */ > +typedef uint32 SegmentNumber; > + > +#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF) I don't think it's project policy to put a single typedef into its own header like that, and I'm not sure where else to put it. I have changed FileTag's segno member to plain old uint32 for now. md.c continues to use BlockNumber for segment numbers (as it has ever since commit e0c9301c87 switched over from int), but that's all internal to md.c and I think we can reasonably leave that sort of improvement to a later patch. Pushed. Thanks for all the work on this! -- Thomas Munro https://enterprisedb.com
On 2019-Apr-04, Thomas Munro wrote: > I don't think it's project policy to put a single typedef into its own > header like that, and I'm not sure where else to put it. shrug. Looks fine to me. I suppose if we don't have it anywhere, it's just because we haven't needed that particular trick yet. Creating a file with a lone typedef seems better than using uint32 to me. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 5, 2019 at 2:03 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Apr-04, Thomas Munro wrote: > > I don't think it's project policy to put a single typedef into its own > > header like that, and I'm not sure where else to put it. > > shrug. Looks fine to me. I suppose if we don't have it anywhere, it's > just because we haven't needed that particular trick yet. Creating a > file with a lone typedef seems better than using uint32 to me. It was commit 9fac5fd7 that gave me that idea. Ok, here is a patch that adds a one-typedef header and uses SegmentIndex to replace all cases of BlockNumber and int holding a segment number (where as an "index" or a "count"). -- Thomas Munro https://enterprisedb.com
Attachment
On Fri, Apr 5, 2019 at 10:53 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Ok, here is a patch that adds a one-typedef header and uses > SegmentIndex to replace all cases of BlockNumber and int holding a > segment number (where as an "index" or a "count"). (sorry, I meant "SegmentNumber", not "SegmentIndex") -- Thomas Munro https://enterprisedb.com
On Thu, Apr 4, 2019 at 11:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Pushed. Thanks for all the work on this! I managed to break this today while testing with RELSEG_SIZE set to 1 block (= squillions of 8kb files). The problem is incorrect arguments to _mdfd_getseg(), in code added recently by me. Without the EXTENSION_DONT_CHECK_SIZE flag, it refuses to open segments following segments that have been truncated, leading to a checkpointer fsync panic. It's also passing segnum where a blocknum is wanted. It should have used exactly the same arguments as in the old code, but didn't. I will push a fix shortly. -- Thomas Munro https://enterprisedb.com
On Fri, Apr 05, 2019 at 10:53:53AM +1300, Thomas Munro wrote: > On Fri, Apr 5, 2019 at 2:03 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2019-Apr-04, Thomas Munro wrote: > > > I don't think it's project policy to put a single typedef into its own > > > header like that, and I'm not sure where else to put it. > > > > shrug. Looks fine to me. I suppose if we don't have it anywhere, it's > > just because we haven't needed that particular trick yet. Creating a > > file with a lone typedef seems better than using uint32 to me. > > It was commit 9fac5fd7 that gave me that idea. > > Ok, here is a patch that adds a one-typedef header and uses > SegmentIndex to replace all cases of BlockNumber and int holding a > segment number (where as an "index" or a "count"). Looks good to me. -- Shawn Debnath Amazon Web Services (AWS)
On 2019-Apr-05, Thomas Munro wrote: > Ok, here is a patch that adds a one-typedef header and uses > SegmentIndex to replace all cases of BlockNumber and int holding a > segment number (where as an "index" or a "count"). Hmm, I now see (while doing the pg_checksum translation) that this patch didn't make it. Pity ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services