Thread: Refactoring the checkpointer's fsync request queue

Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

15 October 2018, 14:02:17

Hello hackers,

Currently, md5.c and checkpointer.c interact in a way that breaks
smgr.c's modularity.  That doesn't matter much if md.c is the only
storage manager implementation, but currently there are two proposals
to provide new kinds of block storage accessed via the buffer manager:
UNDO and SLRU.

Here is a patch that rips the fsync stuff out of md.c, generalises it
and puts it into a new translation unit smgrsync.c.  It can deal with
fsync()ing any files you want at checkpoint time, as long as they can
be described by a SmgrFileTag (a struct type we can extend as needed).
A pathname would work too, but I wanted something small and fixed in
size.  It's just a tag that can be converted to a path in case it
needs to be reopened (eg on Windows), but otherwise is used as a hash
table key to merge requests.

There is one major fly in the ointment:  fsyncgate[1].  Originally I
planned to propose a patch on top of that one, but it's difficult --
both patches move a lot of the same stuff around.  Personally, I don't
think it would be a very good idea to back-patch that anyway.  It'd be
riskier than the problem it aims to solve, in terms of bugs and
hard-to-foresee portability problems IMHO.  I think we should consider
back-patching some variant of Craig Ringer's PANIC patch, and consider
this redesigned approach for future releases.

So, please find attached the WIP patch that I would like to propose
for PostgreSQL 12, under a separate Commitfest entry.  It incorporates
the fsyncgate work by Andres Freund (original file descriptor transfer
POC) and me (many bug fixes and improvements), and the refactoring
work as described above.

It can be compiled in two modes: with the macro
CHECKPOINTER_TRANSFER_FILES defined, it sends fds to the checkpointer,
but if you comment out that macro definition for testing, or build on
Windows, it reverts to a mode that reopens files in the checkpointer.

I'm hoping to find a Windows-savvy collaborator to help finish the
Windows support.  Right now it passes make check on AppVeyor, but it
needs to be reviewed and tested on a real system with a small
shared_buffers (installcheck, pgbench, other attempts to break it).
Other than that, there are a couple of remaining XXX notes for small
known details, but I wanted to post this version now.

[1] https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

0001-Refactor-the-checkpointer-request-queue.patch

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

12 November 2018, 05:58:41

(Adding Dmitry to CC list.)

On Tue, Oct 16, 2018 at 12:02 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> There is one major fly in the ointment:  fsyncgate[1].  Originally I
> planned to propose a patch on top of that one, but it's difficult --
> both patches move a lot of the same stuff around.  Personally, I don't
> think it would be a very good idea to back-patch that anyway.  It'd be
> riskier than the problem it aims to solve, in terms of bugs and
> hard-to-foresee portability problems IMHO.  I think we should consider
> back-patching some variant of Craig Ringer's PANIC patch, and consider
> this redesigned approach for future releases.
>
> So, please find attached the WIP patch that I would like to propose
> for PostgreSQL 12, under a separate Commitfest entry.  It incorporates
> the fsyncgate work by Andres Freund (original file descriptor transfer
> POC) and me (many bug fixes and improvements), and the refactoring
> work as described above.

Here is a rebased version of the patch, post pread()/pwrite().  I have
also rewritten the commit message to try to explain the rationale
concisely, instead of requiring the reader to consult multiple
discussions that jump between lengthy email threads to understand the
key points.

There is one major problem with this patch:  BufferSync(), run in the
checkpointer, can deadlock against a backend that holds a buffer lock
and is blocked in SendFsyncRequest().  To break this deadlock, we need
way out of it on either the sending or receiving side.  Here are three
ideas:

1.  Go back to the current pressure-valve strategy: make the sending
side perform the fsync(), if it can't immediately write to the pipe.

2.  Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe.  Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

3.  Multiplex the checkpointer's work:  Use LWLockConditionalAcquire()
when locking buffers, and if that fails, try to drain the pipe, and
then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop.  I
can hear you groan already; that doesn't seem particularly elegant,
and there are portability problems implementing LWLockTimedAcquire():
semtimedop() and sem_timedwait() are not available on all platforms
(eg macOS).  Maybe pthread_timed_condwait() could help (!).

I'm not actually sure if idea 1 is correct, and I also don't like that
behaviour under pressure, and I think pressure is more likely than in
current master (since we gave up sender-side queue compaction, and it
seems quite easy to fill up the pipe's buffer).  Number 2 appeals to
me the most right now, but I haven't looked into the details or tried
it yet.  Number 3 is a straw man that perhaps helps illustrate the
problem but involves taking on unnecessary new problems and seems like
a non-starter.  So, is there any reason the bgwriter process shouldn't
do that work on the checkpointer's behalf?  Is there another way?

--
Thomas Munro
http://www.enterprisedb.com

Attachment

0001-Refactor-the-checkpointer-s-data-sync-request-que-v2.patch

Re: Refactoring the checkpointer's fsync request queue

From

Robert Haas

Date:

13 November 2018, 20:04:23

On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> There is one major problem with this patch

If there's only one, you're doing great!  Although admittedly this
seems like a big one...

> 1.  Go back to the current pressure-valve strategy: make the sending
> side perform the fsync(), if it can't immediately write to the pipe.

As you say, this will happen significantly more often with
deduplication.  That deduplication logic got added in response to a
real need.  Before that, you could cause an individual backend to
start doing its own fsyncs() with something as simple as a bulk load.
The queue would absorb most of them, but not all, and the performance
ramifications where noticeable.

> 2.  Offload the BufferSync() work to bgwriter, so the checkpointer can
> keep draining the pipe.  Communication between checkpointer and
> bgwriter can be fairly easily multiplexed with the pipe draining work.

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

> 3.  Multiplex the checkpointer's work:  Use LWLockConditionalAcquire()
> when locking buffers, and if that fails, try to drain the pipe, and
> then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop.  I
> can hear you groan already; that doesn't seem particularly elegant,
> and there are portability problems implementing LWLockTimedAcquire():
> semtimedop() and sem_timedwait() are not available on all platforms
> (eg macOS).  Maybe pthread_timed_condwait() could help (!).

You don't really need to invent LWLockTimedAcquire().  You could just
keep retrying LWLockConditionalAcquire() in a delay loop.  I agree
that doesn't seem particularly elegant, though.

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though.  I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there?  It seems like if you
wanted to make this work reliably, you'd need to do it the other way
around: have the checkpointer (or some other background process) open
all the FDs, and anybody else who wants to have one open get it from
the checkpointer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

13 November 2018, 21:01:14

Hi,

On 2018-11-12 15:58:41 +1300, Thomas Munro wrote:
> There is one major problem with this patch:  BufferSync(), run in the
> checkpointer, can deadlock against a backend that holds a buffer lock
> and is blocked in SendFsyncRequest().  To break this deadlock, we need
> way out of it on either the sending or receiving side.  Here are three
> ideas:

That's the deadlock I'd mentioned in Pune (?) btw.

> 1.  Go back to the current pressure-valve strategy: make the sending
> side perform the fsync(), if it can't immediately write to the pipe.

I don't think that's correct / safe?  I've previously wondered whether
there's any way we could delay the write to a point where the buffer is
not locked anymore - as far as I can tell it's actually not required for
correctness that we send the fsync request before unlocking.  It's
architecturally a bit dicey tho :(

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

13 November 2018, 21:07:05

Hi,

On 2018-11-13 12:04:23 -0500, Robert Haas wrote:
> I still feel like this whole pass-the-fds-to-the-checkpointer thing is
> a bit of a fool's errand, though.  I mean, there's no guarantee that
> the first FD that gets passed to the checkpointer is the first one
> opened, or even the first one written, is there?

I'm not sure I understand the danger you're seeing here. It doesn't have
to be the first fd opened, it has to be an fd that's older than all the
writes that we need to ensure made it to disk. And that ought to be
guaranteed by the logic?  Between the FileWrite() and the
register_dirty_segment() (and other relevant paths) the FD cannot be
closed.

> It seems like if you wanted to make this work reliably, you'd need to
> do it the other way around: have the checkpointer (or some other
> background process) open all the FDs, and anybody else who wants to
> have one open get it from the checkpointer.

That'd require a process context switch for each FD opened, which seems
clearly like a no-go?

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Robert Haas

Date:

13 November 2018, 21:43:30

On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-11-13 12:04:23 -0500, Robert Haas wrote:
> > I still feel like this whole pass-the-fds-to-the-checkpointer thing is
> > a bit of a fool's errand, though.  I mean, there's no guarantee that
> > the first FD that gets passed to the checkpointer is the first one
> > opened, or even the first one written, is there?
> I'm not sure I understand the danger you're seeing here. It doesn't have
> to be the first fd opened, it has to be an fd that's older than all the
> writes that we need to ensure made it to disk. And that ought to be
> guaranteed by the logic?  Between the FileWrite() and the
> register_dirty_segment() (and other relevant paths) the FD cannot be
> closed.

Suppose backend A and backend B open a segment around the same time.
Is it possible that backend A does a write before backend B, but
backend B's copy of the fd reaches the checkpointer before backend A's
copy?  If you send the FD to the checkpointer before writing anything
then I think it's fine, but if you write first and then send the FD to
the checkpointer I don't see what guarantees the ordering.

> > It seems like if you wanted to make this work reliably, you'd need to
> > do it the other way around: have the checkpointer (or some other
> > background process) open all the FDs, and anybody else who wants to
> > have one open get it from the checkpointer.
>
> That'd require a process context switch for each FD opened, which seems
> clearly like a no-go?

I don't know how bad that would be.  But hey, no cost is too great to
pay as a workaround for insane kernel semantics, right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

14 November 2018, 02:43:59

(Replies to a couple of different messages below)

On Wed, Nov 14, 2018 at 6:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > There is one major problem with this patch
>
> If there's only one, you're doing great!  Although admittedly this
> seems like a big one...

Make that two.

> > 2.  Offload the BufferSync() work to bgwriter, so the checkpointer can
> > keep draining the pipe.  Communication between checkpointer and
> > bgwriter can be fairly easily multiplexed with the pipe draining work.
>
> That sounds a little like you are proposing to go back to the way
> things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
> belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
> the division of labor wouldn't be quite the same.

But is there an argument against it?  The checkpointer would still be
creating checkpoints including running fsync, but the background
writer would be, erm, writing, erm, in the background.

Admittedly it adds a whole extra rabbit hole to this rabbit hole,
which was itself a diversion from my goal of refactoring the syncing
machinery to support undo logs.  But the other two ideas seem to suck
and/or have correctness issues.

On Wed, Nov 14, 2018 at 7:43 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2018-11-13 12:04:23 -0500, Robert Haas wrote:
> > > I still feel like this whole pass-the-fds-to-the-checkpointer thing is
> > > a bit of a fool's errand, though.  I mean, there's no guarantee that
> > > the first FD that gets passed to the checkpointer is the first one
> > > opened, or even the first one written, is there?
> > I'm not sure I understand the danger you're seeing here. It doesn't have
> > to be the first fd opened, it has to be an fd that's older than all the
> > writes that we need to ensure made it to disk. And that ought to be
> > guaranteed by the logic?  Between the FileWrite() and the
> > register_dirty_segment() (and other relevant paths) the FD cannot be
> > closed.
>
> Suppose backend A and backend B open a segment around the same time.
> Is it possible that backend A does a write before backend B, but
> backend B's copy of the fd reaches the checkpointer before backend A's
> copy?  If you send the FD to the checkpointer before writing anything
> then I think it's fine, but if you write first and then send the FD to
> the checkpointer I don't see what guarantees the ordering.

I'm not sure if it matters whether we send the fd before or after the
write, but we still need some kind of global ordering of fds that can
order a given fd with respect to writes in other processes, so the
patch introduces a global shared counter captured immediately after
open() (including when reopened in the vfd machinery).

In your example, both fds arrive in the checkpointer in some order,
and it will keep the one with the older sequence number and close the
other one.  This sorting of all interesting fds will be forced before
the checkpoint completes by AbsorbFsyncRequests(), which drains all
messages from the pipe until it sees a message for the next checkpoint
cycle.

Hmm, I think there is a flaw in the plan here though.  Messages for
different checkpoint cycles race to enter the pipe around the time the
cycle counter is bumped, so you could have a message for n hiding
behind a message for n + 1 and not drain enough; I'm not sure and need
to look at something else today, but I see a couple of potential
solutions to that which I will mull over, based on either a new shared
counter increment or a special pipe message written after BufferSync()
by the bgwriter (if we go for idea #2; Andres had something similar in
the original prototype but it could self-deadlock).  I need to figure
out if that is a suitable barrier due to buffer interlocking.

> > > It seems like if you wanted to make this work reliably, you'd need to
> > > do it the other way around: have the checkpointer (or some other
> > > background process) open all the FDs, and anybody else who wants to
> > > have one open get it from the checkpointer.
> >
> > That'd require a process context switch for each FD opened, which seems
> > clearly like a no-go?
>
> I don't know how bad that would be.  But hey, no cost is too great to
> pay as a workaround for insane kernel semantics, right?

Yeah, seems extremely expensive and unnecessary.  It seems sufficient
to track the global opening order... or at least a proxy that
identifies the fd that performed the oldest write.  Which I believe
this patch is doing.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Dmitry Dolgov

Date:

14 November 2018, 19:09:34

> On Wed, 14 Nov 2018 at 00:44, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> Here is a rebased version of the patch, post pread()/pwrite().  I have
> also rewritten the commit message to try to explain the rationale
> concisely, instead of requiring the reader to consult multiple
> discussions that jump between lengthy email threads to understand the
> key points.

Thank you for working on this patch!

> There is one major problem with this patch:  BufferSync(), run in the
> checkpointer, can deadlock against a backend that holds a buffer lock
> and is blocked in SendFsyncRequest().  To break this deadlock, we need
> way out of it on either the sending or receiving side.

Or introduce a third side, but I'm not sure how appropriate it here.

> 2.  Offload the BufferSync() work to bgwriter, so the checkpointer can
> keep draining the pipe.  Communication between checkpointer and
> bgwriter can be fairly easily multiplexed with the pipe draining work.

I also think it sounds better than other options (although probably it's
partially because these options were formulated while already having some bias
towards one of the solution).

> > > 2.  Offload the BufferSync() work to bgwriter, so the checkpointer can
> > > keep draining the pipe.  Communication between checkpointer and
> > > bgwriter can be fairly easily multiplexed with the pipe draining work.
> >
> > That sounds a little like you are proposing to go back to the way
> > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
> > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
> > the division of labor wouldn't be quite the same.

I had the same first thought, but then after reading the corresponding mailing
thread I've got an impression that the purpose of this change was rather
technical (to split work between different processed because of performance
reasons) and not exactly relevant to the division of labor - am I wrong here?

While testing this patch with frequent checkpoints I've stumbled upon an
interesting error, that happened already after I finished one test:

TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)
2018-11-13 22:06:29.773 CET [7886] LOG:  checkpointer process (PID
7934) was terminated by signal 6: Aborted
2018-11-13 22:06:29.773 CET [7886] LOG:  terminating any other active
server processes
2018-11-13 22:06:29.773 CET [7937] WARNING:  terminating connection
because of crash of another server process
2018-11-13 22:06:29.773 CET [7937] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-11-13 22:06:29.773 CET [7937] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
2018-11-13 22:06:29.778 CET [7886] LOG:  all server processes
terminated; reinitializing

I assume it should't be like that? I haven't investigated deeply yet, but
backtrace looks like:

>>> bt
#0  0x00007f7ee7a3af00 in raise () from /lib64/libc.so.6
#1  0x00007f7ee7a3ca57 in abort () from /lib64/libc.so.6
#2  0x0000560e89d1858e in ExceptionalCondition
(conditionName=conditionName@entry=0x560e89eca333 "!(rc > 0)",
errorType=errorType@entry=0x560e89d6cec8 "FailedAssertion",
fileName=fileName@entry=0x560e89eca2c9 "checkpointer.c",
lineNumber=lineNumber@entry=574) at assert.c:54
#3  0x0000560e89b5e3ff in CheckpointerMain () at checkpointer.c:574
#4  0x0000560e8995ef9e in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7ffe05c32f60) at bootstrap.c:460
#5  0x0000560e89b69c55 in StartChildProcess
(type=type@entry=CheckpointerProcess) at postmaster.c:5369
#6  0x0000560e89b6af15 in reaper (postgres_signal_arg=<optimized out>)
at postmaster.c:2916
#7  <signal handler called>
#8  0x00007f7ee7afe00b in select () from /lib64/libc.so.6
#9  0x0000560e89b6bd20 in ServerLoop () at postmaster.c:1679
#10 0x0000560e89b6d1bc in PostmasterMain (argc=3, argv=<optimized
out>) at postmaster.c:1388
#11 0x0000560e89acadc6 in main (argc=3, argv=0x560e8ad42c00) at main.c:228

Re: Refactoring the checkpointer's fsync request queue

From

Robert Haas

Date:

15 November 2018, 00:36:49

On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> > That sounds a little like you are proposing to go back to the way
> > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
> > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
> > the division of labor wouldn't be quite the same.
>
> But is there an argument against it?  The checkpointer would still be
> creating checkpoints including running fsync, but the background
> writer would be, erm, writing, erm, in the background.

I don't know.  I guess the fact that the checkpointer is still
performing the fsyncs is probably a key point.  I mean, in the old
division of labor, fsyncs could interrupt the background writing that
was supposed to be happening.

> I'm not sure if it matters whether we send the fd before or after the
> write, but we still need some kind of global ordering of fds that can
> order a given fd with respect to writes in other processes, so the
> patch introduces a global shared counter captured immediately after
> open() (including when reopened in the vfd machinery).

But how do you make reading that counter atomic with the open() itself?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

15 November 2018, 00:49:48

On 2018-11-14 16:36:49 -0500, Robert Haas wrote:
> On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > I'm not sure if it matters whether we send the fd before or after the
> > write, but we still need some kind of global ordering of fds that can
> > order a given fd with respect to writes in other processes, so the
> > patch introduces a global shared counter captured immediately after
> > open() (including when reopened in the vfd machinery).
> 
> But how do you make reading that counter atomic with the open() itself?

I don't see why it has to be. As long as the "fd generation" assignment
happens before fsync (and writes secondarily), there ought not to be any
further need for synchronizity?

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Robert Haas

Date:

16 November 2018, 18:05:44

On Wed, Nov 14, 2018 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-11-14 16:36:49 -0500, Robert Haas wrote:
> > But how do you make reading that counter atomic with the open() itself?
>
> I don't see why it has to be. As long as the "fd generation" assignment
> happens before fsync (and writes secondarily), there ought not to be any
> further need for synchronizity?

If the goal is to have the FD that is opened first end up in the
checkpointer's table, grabbing a counter backwards does not achieve
it, because there's a race.

S1: open FD
S2: open FD
S2: local_counter = shared_counter++
S1: local_counter = shared_counter++

Now S1 was opened first but has a higher shared counter value than S2
which was opened later.  Does that matter?  Beats me!  I just work
here...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

17 November 2018, 00:53:20

On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> While testing this patch with frequent checkpoints I've stumbled upon an
> interesting error, that happened already after I finished one test:
>
> TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)

Thanks for testing!  Yeah, that's:

+       rc = WaitEventSetWait(wes, cur_timeout * 1000, &event, 1, 0);
+       Assert(rc > 0);

I got confused about the API.  If there is a timeout, you get rc == 0,
but I think I was expecting rc == 1, event.event == WL_TIMEOUT.  Oops.
I will fix that when I post a new experimental version that uses the
bgworker as discussed, and we can try to figure out if that design
will fly.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

17 November 2018, 01:38:03

On Sat, Nov 17, 2018 at 4:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 14, 2018 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2018-11-14 16:36:49 -0500, Robert Haas wrote:
> > > But how do you make reading that counter atomic with the open() itself?
> >
> > I don't see why it has to be. As long as the "fd generation" assignment
> > happens before fsync (and writes secondarily), there ought not to be any
> > further need for synchronizity?
>
> If the goal is to have the FD that is opened first end up in the
> checkpointer's table, grabbing a counter backwards does not achieve
> it, because there's a race.
>
> S1: open FD
> S2: open FD
> S2: local_counter = shared_counter++
> S1: local_counter = shared_counter++
>
> Now S1 was opened first but has a higher shared counter value than S2
> which was opened later.  Does that matter?  Beats me!  I just work
> here...

It's not important for the sequence numbers to match the opening order
exactly (that'd work too but be expensive to orchestrate).  It's
important for the sequence numbers to be assigned before each backend
does its first pwrite().  That gives us the following interleavings to
worry about:

S1: local_counter = shared_counter++
S2: local_counter = shared_counter++
S1: pwrite()
S2: pwrite()

S1: local_counter = shared_counter++
S2: local_counter = shared_counter++
S2: pwrite()
S1: pwrite()

S1: local_counter = shared_counter++
S1: pwrite()
S2: local_counter = shared_counter++
S2: pwrite()

... plus the same interleavings with S1 and S2 labels swapped.  In all
6 orderings, the fd that has the lowest sequence number can see errors
relating to write-back of kernel buffers dirtied by both pwrite()
calls.

Or to put it another way, you can't be given a lower sequence number
than another process that has already written, because that other
process must have been given a sequence number before it wrote.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Robert Haas

Date:

20 November 2018, 22:09:59

On Fri, Nov 16, 2018 at 5:38 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Or to put it another way, you can't be given a lower sequence number
> than another process that has already written, because that other
> process must have been given a sequence number before it wrote.

OK, that makes sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

23 November 2018, 07:45:18

On Sat, Nov 17, 2018 at 10:53 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > While testing this patch with frequent checkpoints I've stumbled upon an
> > interesting error, that happened already after I finished one test:
> >
> > TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)

Fixed in the 0001 patch (and a similar problem in the WIN32 branch).

On Thu, Nov 15, 2018 at 10:37 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > > That sounds a little like you are proposing to go back to the way
> > > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
> > > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
> > > the division of labor wouldn't be quite the same.
> >
> > But is there an argument against it?  The checkpointer would still be
> > creating checkpoints including running fsync, but the background
> > writer would be, erm, writing, erm, in the background.
>
> I don't know.  I guess the fact that the checkpointer is still
> performing the fsyncs is probably a key point.  I mean, in the old
> division of labor, fsyncs could interrupt the background writing that
> was supposed to be happening.

Robert explained off-list that BgBufferSync() and BufferSync() have
rather different goals, and performing them in the same loop without
major reengineering to merge their logic would probably not work out
well.  So I'm abandoning that plan for now (though it could perhaps be
interesting if done right).

I do have a new plan though...

On Wed, Nov 14, 2018 at 7:01 AM Andres Freund <andres@anarazel.de> wrote:
> ...  I've previously wondered whether
> there's any way we could delay the write to a point where the buffer is
> not locked anymore - as far as I can tell it's actually not required for
> correctness that we send the fsync request before unlocking.  It's
> architecturally a bit dicey tho :(

... and it's basically what Andres said there ^^^.

The specific hazard I wondered about is when a checkpoint begins after
BufferAlloc() calls pwrite() but before it calls sendto(), so that we
fail to fsync() a file that was modified before the checkpoint LSN.
But, AFAICS, assuming we call sendto() before we update the buffer
header, there are only two possibilities from the point of view of
BufferAlloc():

1.  The checkpointer's BufferSync() loop arrives before we update the
buffer header, so it sees the buffer as dirty, writes it out (again),
remembers that the segment is dirty, and then when we eventually get
the buffer header lock we see that it's not dirty anymore and we just
skip the buffer.

2.  The checkpointer's BufferSync() loop arrives after we updated the
buffer header, so it sees it as invalid (or some later state), which
means that we have already called sendto() (before we updated the
header).

Either way, the checkpointer finishes up calling fsync() before the
checkpoint completes, as it should, and the worst that can happen due
to bad timing is a harmless double pwrite().

I noticed a subtle problem though.  Suppose we have case 2 above.
After BufferSync() returns in the checkpointer, our backend has called
sendto() to register a dirty file.  In v2 the checkpointer runs
AbsorbAllFsyncRequests() to drain the pipe until it sees a message for
the current cycle (that is, it absorbs messages for the previous
cycle).  That's obviously not good enough, since backends race to call
sendto() and a message for cycle n - 1 might be hiding behind a
message for cycle n.  So I propose to drain the pipe until it is empty
or we see a message for cycle n + 1 (where n is the current cycle
before we start draining, meaning that we ran out of fds and forced a
new cycle in FlushFsyncRequestQueueIfNecessary()).  I think that
works, because although we can't be sure that we'll receive all
messages for n - 1 before we receive a message for n due to races on
the insert side, we *can* be certain that we'll receive all messages
for n - 1 before we receive a message for n + 1, because we know that
they were already in the pipe before we began.  In the happy case, our
system never runs out of fds so the pipe will eventually be empty,
since backends send at most one message per cycle per file and the
number of files is finite, and in the sad case, there are too many
dirty files per cycle, so we keep creating new cycles while absorbing,
but again the loop is bounded because we know that seeing n + 1 is
enough for our purpose (which is to fsync all files that were already
mentioned in messages send to the pipe before we started our loop).

That's implemented in the 0002 patch, separated for ease of review.
The above theories cover BufferAlloc()'s interaction with a
checkpoint, and that seems to be the main story.  I'm not entirely
sure about the other callers of FlushBuffer() or FlushOneBuffer() (eg
special case init fork stuff), but I've run out of brain power for now
and wanted to post an update.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

27 November 2018, 01:46:50

On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I do have a new plan though...

Ugh.  The plan in my previous email doesn't work, I was confused about
the timing of the buffer header update.  Back to the drawing board.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Dmitry Dolgov

Date:

01 December 2018, 15:47:27

> On Mon, Nov 26, 2018 at 11:47 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > I do have a new plan though...
>
> Ugh.  The plan in my previous email doesn't work, I was confused about
> the timing of the buffer header update.  Back to the drawing board.

Any chance to share the drawing board with the ideas? :)

On the serious note, I assume you have plans to work on this during the next
CF, right?

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

01 January 2019, 00:41:30

On Sun, Dec 2, 2018 at 1:46 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Mon, Nov 26, 2018 at 11:47 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> > > On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro
> > > <thomas.munro@enterprisedb.com> wrote:
> > > I do have a new plan though...
> >
> > Ugh.  The plan in my previous email doesn't work, I was confused about
> > the timing of the buffer header update.  Back to the drawing board.
>
> Any chance to share the drawing board with the ideas? :)
>
> On the serious note, I assume you have plans to work on this during the next
> CF, right?

Indeed I am.  Unfortunately, the solution to that deadlock eludes me still.

So, I have split this work into multiple patches.  0001 is a draft
version of some new infrastructure I'd like to propose, 0002 is the
thing originally described by the first two paragraphs in the first
email in this thread, and the rest I'll have to defer for now (the fd
passing stuff).

To restate the purpose of this work: I want to make it possible for
other patches to teach the checkpointer to fsync new kinds of files
that are accessed through the buffer pool.  Specifically, undo segment
files (for zheap) and SLRU files (see Shawn Debnath's plan to put clog
et al into the standard buffer pool).  The main changes are:

1.  A bunch of stuff moved out of md.c into smgrsync.c, where the same
pendingOpTable machinery can be shared by any block storage
implementation.
2.  The actual fsync'ing now happens by going through smgrsyncimmed().
3.  You can now tell the checkpointer to forget individual segments
(undo and slru both need to be able to do that when they truncate data
from the 'front').
4.  The protocol for forgetting relations etc is slightly different:
if a file is found to be missing, AbsortFsyncRequests() and then probe
to see if the segment number disappeared from the set (instead of
cancel flags), though I need to test this case.
5.  Requests (ie segment numbers) are now stored in a sorted vector,
because it doesn't make sense to store large and potentially sparse
integers in bitmapsets.  See patch 0001 for new machinery to support
that.

The interfaces in 0001 are perhaps a bit wordy and verbose (and hard
to fit in 80 columns).  Maybe I need something better for memory
contexts.  Speaking of which, it wasn't possible to do a
guaranteed-no-alloc merge (like the one done for zero-anchored
bitmapset in commit 1556cb2fc), so I had to add a second vector for
'in progress' segments.  I merge them with the main set on the next
attempt, if it's found to be non-empty.  Very open to better ideas on
how to do any of this.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

02 January 2019, 01:40:14

On Tue, Jan 1, 2019 at 10:41 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> So, I have split this work into multiple patches.  0001 is a draft
> version of some new infrastructure I'd like to propose, 0002 is the
> thing originally described by the first two paragraphs in the first
> email in this thread, and the rest I'll have to defer for now (the fd
> passing stuff).

Apologies, there was a header missing from 0002, and a small change
needed to a contrib file that I missed.  Here is a new version.

For the 0001 patch, I'll probably want to reconsider the naming a bit
("simple -> "specialized", "generic", ...?), refine (ability to turn
off the small vector optimisation?  optional MemoryContext?  ability
to extend without copying or zero-initialising at the same time?
comparators with a user data parameter?  two-value comparators vs
three-value comparators?  qsort with inline comparator?  etc etc), and
remove some gratuitous C++ cargo cultisms, and maybe also instantiate
the thing centrally for some common types (I mean, perhaps 0002 should
use a common uint32_vector rather than defining its own
segnum_vector?).

I suppose an alternative would be to use simplehash for the set of
segment numbers here, but it seems like overkill and would waste a ton
of memory in the common case of holding a single number.  I wondered
also about some kind of tree (basically, C++ set) but it seems much
more complicated and would still be much larger for common cases.
Sorted vectors seem to work pretty well here (but would lose in
theoretical cases where you insert low values in large sets, but not
in practice here AFAICS).

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

03 January 2019, 08:34:31

On Wed, Jan 2, 2019 at 11:40 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> For the 0001 patch, I'll probably want to reconsider the naming a bit
> ("simple -> "specialized", "generic", ...?), refine (ability to turn
> off the small vector optimisation?  optional MemoryContext?  ability
> to extend without copying or zero-initialising at the same time?
> comparators with a user data parameter?  two-value comparators vs
> three-value comparators?  qsort with inline comparator?  etc etc), and
> remove some gratuitous C++ cargo cultisms, and maybe also instantiate
> the thing centrally for some common types (I mean, perhaps 0002 should
> use a common uint32_vector rather than defining its own
> segnum_vector?).

Here's a new version that fixes a couple of stupid bugs (mainly a
broken XXX_lower_bound(), which I replaced with the standard algorithm
I see in many sources).

I couldn't resist the urge to try porting pg_qsort() to this style.
It seems to be about twice as fast as the original at sorting integers
on my machine with -O2.  I suppose people aren't going to be too
enthusiastic about yet another copy of qsort in the tree, but maybe
this approach (with a bit more work) could replace the Perl code-gen
for tuple sorting.  Then the net number of copies wouldn't go up, but
this could be used for more things too, and it fits with the style of
simplehash.h and simplevector.h.  Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Refactoring the checkpointer's fsync request queue

From

Kevin Grittner

Date:

22 January 2019, 17:27:48

With the help of VMware's Dirk Hohndel (VMware's Chief Open Source
Officer, a VP position near the top of the organization, and a
personal friend of Linus), I have been fortunate enough to make
contact directly with Linus Torvalds to discuss this issue.  In emails
to me he has told me that this patch is no longer provisional:


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3

Linus has given me permission to quote him, so here is a quote from an
email he sent 2019-01-17:

> That commit (b4678df184b3: "errseq: Always report a writeback error
> once") was already backported to the stable trees (4.14 and 4.16), so
> yes, everything should be fine. We did indeed miss old errors for a
> while.
>
> > The latest information I could find on this said this commit was "provisional" but
> > also that it might be back-patched to 4.13 and on.  Can you clarify the status of
> > this patch in either respect?
>
> It was definitely backported to both 4.14 and 4.16, I see it in my
> email archives.
>
> The bug may remain in 4.13, but that isn't actually maintained any
> more, and I don't think any distro uses it (distros tend to use the
> long-term stable kernels that are maintained, or sometimes maintain
> their own patch queue).

I think that eliminates the need for this patch.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: Refactoring the checkpointer's fsync request queue

From

Kevin Grittner

Date:

22 January 2019, 17:33:48

On Tue, Jan 22, 2019 at 8:27 AM Kevin Grittner <kgrittn@gmail.com> wrote:

> With the help of VMware's Dirk Hohndel (VMware's Chief Open Source
> Officer, a VP position near the top of the organization, and a
> personal friend of Linus), I have been fortunate enough to make
> contact directly with Linus Torvalds to discuss this issue.  In emails
> to me he has told me that this patch is no longer provisional:
>
>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3
>
> Linus has given me permission to quote him, so here is a quote from an
> email he sent 2019-01-17:

> > That commit (b4678df184b3: "errseq: Always report a writeback error
> > once") was already backported to the stable trees (4.14 and 4.16), so
> > yes, everything should be fine. We did indeed miss old errors for a
> > while.

Sorry, but somehow I got the parent link rather that the intended
commit.  Linus got it right, of course.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4678df184b314a2bd47d2329feca2c2534aa12b
> --
> Kevin Grittner
> VMware vCenter Server
> https://www.vmware.com/



--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

22 January 2019, 21:17:02

Hi,,

On 2019-01-22 08:27:48 -0600, Kevin Grittner wrote:
> With the help of VMware's Dirk Hohndel (VMware's Chief Open Source
> Officer, a VP position near the top of the organization, and a
> personal friend of Linus), I have been fortunate enough to make
> contact directly with Linus Torvalds to discuss this issue.  In emails
> to me he has told me that this patch is no longer provisional:
> 
>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3

Unfortunately, unless something has changed recently, that patch is
*not* sufficient to really solve the issue - we don't guarantee that
there's always an fd preventing the necessary information from being
evicted from memory:

 Note that we might still lose the error if the inode gets evicted from
 the cache before anything can reopen it, but that was the case before
 errseq_t was merged. At LSF/MM we had some discussion about keeping
 inodes with unreported writeback errors around in the cache for longer
 (possibly indefinitely), but that's really a separate problem"

And that's entirely possibly in postgres.  The commit was dicussed on
list too, btw...

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Kevin Grittner

Date:

22 January 2019, 23:29:23

On Tue, Jan 22, 2019 at 12:17 PM Andres Freund <andres@anarazel.de> wrote:

> Unfortunately, unless something has changed recently, that patch is
> *not* sufficient to really solve the issue - we don't guarantee that
> there's always an fd preventing the necessary information from being
> evicted from memory:

But we can't lose an FD without either closing it or suffering an
abrupt termination that would trigger a PANIC, can we?  And close()
always calls fsync().  And I thought our "PANIC on fsync" patch paid
attention to close().  How do you see this happening???

>  Note that we might still lose the error if the inode gets evicted from
>  the cache before anything can reopen it, but that was the case before
>  errseq_t was merged. At LSF/MM we had some discussion about keeping
>  inodes with unreported writeback errors around in the cache for longer
>  (possibly indefinitely), but that's really a separate problem"
>
> And that's entirely possibly in postgres.

Is it possible for an inode to be evicted while there is an open FD
referencing it?

> The commit was dicussed on list too, btw...

Can you point to a post explaining how the inode can be evicted?
--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

22 January 2019, 23:31:53

On Wed, Jan 23, 2019 at 9:29 AM Kevin Grittner <kgrittn@gmail.com> wrote:
> Can you point to a post explaining how the inode can be evicted?

Hi Kevin,

To recap the (admittedly confusing) list of problems with Linux fsync
or our usage:

1. On Linux < 4.13, the error state can be cleared in various
surprising ways so that we never hear about it. Jeff Layton
identified and fixed this problem for 4.13+ by switching from an error
flag to an error counter that is tracked in such a way that every fd
hears about every error in the file.

2. Layton's changes originally assumed that you only wanted to hear
about errors that happened after you opened the file (ie it set the
fd's counter to the inode's current level at open time). Craig Ringer
complained about this. Everyone complained about this. A fix was
then made so that one fd also reports errors that happened before
opening, if no one else has seen them yet. This is the change that
was back-patched as far as Linux 4.14. So long as no third program
comes along and calls fsync on a file that we don't have open
anywhere, thereby eating the "not seen" flag before the checkpointer
gets around to opening the file, all is well.

3. Regardless of the above changes, we also learned that pages are
unceremoniously dropped from the page cache after write-back errors,
so that calling fsync() again after a failure is a bad idea (it might
report success, but your dirty data written before the previous
fsync() call is gone). We handled that by introducing a PANIC after
any fsync failure:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1

So did MySQL, MongoDB, and probably everyone else who spat out their
cornflakes while reading articles like "PostgreSQL's fsync() surprise"
in the Linux Weekly News that resulted from Craig's report:

https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a

https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96

4. Regardless of all of the above changes, there is still one way to
lose track of an error, as Andres mentioned: during a period of time
when neither the writing backend nor the checkpointer has the file
open, the kernel may choose to evict the inode from kernel memory, and
thereby forget about an error that we haven't received yet.

Problems 1-3 are solved by changes to Linux and PostgreSQL.

Problem 4 would be solved by this "fd-passing" scheme (file
descriptors are never closed until after fsync has been called,
existing in the purgatory of Unix socket ancillary data until the
checkpointer eventually deals with them), but it's complicated and not
quite fully baked yet.

It could also be solved by the kernel agreeing not to evict inodes
that hold error state, or to promote the error to device level, or
something like that. IIUC those kinds of ideas were rejected so far.

(It can also be solved by using FreeBSD and/or ZFS, so you don't have
problem 3 and therefore don't have the other problems.)

I'm not sure how likely that failure mode actually is, but I guess you
need a large number of active files, a low PostgreSQL max_safe_fds so
we close descriptors aggressively, a kernel that is low on memory or
has a high vfs_cache_pressure setting so that it throws out recently
used inodes aggressively, enough time between checkpoints for all of
the above to happen, and then some IO errors when the kernel is
writing back dirty data asynchronously while you don't have the file
open anywhere.

--
Thomas Munro
http://www.enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

22 January 2019, 23:38:32

Hi,

On 2019-01-22 14:29:23 -0600, Kevin Grittner wrote:
> On Tue, Jan 22, 2019 at 12:17 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > Unfortunately, unless something has changed recently, that patch is
> > *not* sufficient to really solve the issue - we don't guarantee that
> > there's always an fd preventing the necessary information from being
> > evicted from memory:
> 
> But we can't lose an FD without either closing it or suffering an
> abrupt termination that would trigger a PANIC, can we?  And close()
> always calls fsync().  And I thought our "PANIC on fsync" patch paid
> attention to close().  How do you see this happening???

close() doesn't trigger an fsync() in general (although it does on many
NFS implementations), and doing so would be *terrible* for
performance. Given that it's pretty clear how you can get all FDs
closed, right? You just need sufficient open files that files get closed
due to max_files_per_process, and you can run into the issue.  A
thousand open files is pretty easy to reach with forks, indexes,
partitions etc, so this isn't particularly artifical.

> >  Note that we might still lose the error if the inode gets evicted from
> >  the cache before anything can reopen it, but that was the case before
> >  errseq_t was merged. At LSF/MM we had some discussion about keeping
> >  inodes with unreported writeback errors around in the cache for longer
> >  (possibly indefinitely), but that's really a separate problem"
> >
> > And that's entirely possibly in postgres.
> 
> Is it possible for an inode to be evicted while there is an open FD
> referencing it?

No, but we don't guarantee that there's always an FD open, due to the 

> > The commit was dicussed on list too, btw...
> 
> Can you point to a post explaining how the inode can be evicted?

https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
is, I think, a good overview, with a bunch of links.  As is the
referenced lwn article [1] and the commit message you linked.

Greetings,

Andres Freund

[1] https://lwn.net/Articles/752063/

Re: Refactoring the checkpointer's fsync request queue

From

Kevin Grittner

Date:

22 January 2019, 23:53:11

On Tue, Jan 22, 2019 at 2:38 PM Andres Freund <andres@anarazel.de> wrote:

> close() doesn't trigger an fsync() in general

What sort of a performance hit was observed when testing the addition
of fsync or fdatasync before any PostgreSQL close() of a writable
file, or have we not yet checked that?

> https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
> is, I think, a good overview, with a bunch of links.

Thanks!  Will review.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

23 January 2019, 00:01:52

Hi,

On 2019-01-22 14:53:11 -0600, Kevin Grittner wrote:
> On Tue, Jan 22, 2019 at 2:38 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > close() doesn't trigger an fsync() in general
> 
> What sort of a performance hit was observed when testing the addition
> of fsync or fdatasync before any PostgreSQL close() of a writable
> file, or have we not yet checked that?

I briefly played with it, and it was so atrocious (as in, less than
something like 0.2x the throughput) that I didn't continue far down that
path.  Two ways I IIRC (and it's really just memory) I tried were:

a) Short lived connections that do a bunch of writes to files each. That
   turns each disconnect into an fsync of most files.
b) Workload with > max_files_per_process files (IIRC I just used a bunch
   of larger tables with a few indexes each) in a read/write workload
   that's a bit larger than shared buffers. That lead to most file
   closes being integrity writes, with obvious downsides.

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

31 January 2019, 08:59:38

I (finally) got a chance to go through these patches and they look 
great. Thank you for working on this!  Few comments:

- I do not see SmgrFileTag being defined or used like you mentioned in 
  your first email. I see RelFileNode still being used. Is this planned 
  for the future?
- Would be great to add a set of Tests for SimpleVector.

> For the 0001 patch, I'll probably want to reconsider the naming a it  
> ("simple -> "specialized", "generic", ...?)

I think the name SimpleVector is fine, fits with the SimpleHash theme.  
If the goal is to shorten it, perhaps PG prefix would suffice?

> 4. The protocol for forgetting relations etc is slightly different:
> if a file is found to be missing, AbsortFsyncRequests() and then probe
> to see if the segment number disappeared from the set (instead of
> cancel flags), though I need to test this case.

Can you explain this part a bit more? I am likely missing something in 
the patch.

> I couldn't resist the urge to try porting pg_qsort() to this style.
> It seems to be about twice as fast as the original at sorting integers
> on my machine with -O2.  I suppose people aren't going to be too
> enthusiastic about yet another copy of qsort in the tree, but maybe
> this approach (with a bit more work) could replace the Perl code-gen
> for tuple sorting.  Then the net number of copies wouldn't go up, but
> this could be used for more things too, and it fits with the style of
> simplehash.h and simplevector.h.  Thoughts?

+1 for avoiding duplicate code. Would it be acceptable to migrate the 
rest of the usages to this model over time perhaps? Love to move this 
patch forward.

I wonder if it might be better to introduce two different functions 
catering to the two different use cases for forcing an immediate sync:

- sync a relation
    smgrimmedsyncrel(SMgrRelation, ForkNumber)
- sync a specific segment
    smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber)

This will avoid having to specify InvalidSegmentNumber for majority of 
the callers today.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Michael Paquier

Date:

04 February 2019, 07:47:56

On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote:
> I (finally) got a chance to go through these patches and they look
> great. Thank you for working on this!

This review is very recent, so I have moved the patch to next CF.
--
Michael

Attachment

signature.asc

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

13 February 2019, 05:58:43

On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote:

> I wonder if it might be better to introduce two different functions 
> catering to the two different use cases for forcing an immediate sync:
> 
> - sync a relation
>     smgrimmedsyncrel(SMgrRelation, ForkNumber)
> - sync a specific segment
>     smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber)
> 
> This will avoid having to specify InvalidSegmentNumber for majority of 
> the callers today.

I have gone ahead and rebased the refactor patch so it could cleanly 
apply on heapam.c, see patch v7.

I am also attaching a patch (v8) that implements smgrimmedsyncrel() and 
smgrimmedsyncseg() as I mentioned in the previous email. It avoids 
callers to pass in InvalidSegmentNumber when they just want to sync the 
whole relation. As a side effect, I was able to get rid of some extra 
checkpointer.h includes.  

--
Shawn Debnath
Amazon Web Services (AWS)

Attachment

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

13 February 2019, 08:40:05

On Wed, Feb 13, 2019 at 3:58 PM Shawn Debnath <sdn@amazon.com> wrote:
> On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote:
> > I wonder if it might be better to introduce two different functions
> > catering to the two different use cases for forcing an immediate sync:
> >
> > - sync a relation
> >     smgrimmedsyncrel(SMgrRelation, ForkNumber)
> > - sync a specific segment
> >     smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber)
> >
> > This will avoid having to specify InvalidSegmentNumber for majority of
> > the callers today.
>
> I have gone ahead and rebased the refactor patch so it could cleanly
> apply on heapam.c, see patch v7.
>
> I am also attaching a patch (v8) that implements smgrimmedsyncrel() and
> smgrimmedsyncseg() as I mentioned in the previous email. It avoids
> callers to pass in InvalidSegmentNumber when they just want to sync the
> whole relation. As a side effect, I was able to get rid of some extra
> checkpointer.h includes.

Hi Shawn,

Thanks!  And sorry for not replying sooner -- I got distracted by
FOSDEM (and the associated 20 thousand miles of travel).  On that trip
I had a chance to discuss this patch with Andres Freund in person, and
he opined that it might be better for the fsync request queue to work
in terms of pathnames.  Instead of the approach in this patch, where a
backend sends an fsync request for { reflfilenode, segno } inside
mdwrite(), and then the checkpointer processes the request by calling
smgrdimmedsyncrel(), he speculated that it'd be better to have
mdwrite() send an fsync request for a pathname, and then the
checkpointer would just open that file by name and fsync() it.  That
is, the checkpointer wouldn't call back into smgr.

One of the advantages of that approach is that there are probably
other files that need to be fsync'd for each checkpoint that could
benefit from being offloaded to the checkpointer.  Another is that you
break the strange cycle mentioned above.

Here's a possible problem with it.  The fsync request messages would
have to be either large (MAXPGPATH) or variable sized and potentially
large.  I am a bit worried that such messages would be problematic for
the atomicity requirement of the (future) fd-passing patch that passes
it via a Unix domain socket (which is a bit tricky because
SOCK_SEQPACKET and SOCK_DGRAM aren't portable enough, so we probably
have to use SOCK_STREAM, but there is no formal guarantee like
PIPE_BUF; we know that in practice small messages will be atomic, but
certainty decreases with larger messages.  This needs more study...).
You need to include the path even in a message containing an fd,
because the checkpointer will use that as a hashtable key to merge
received requests.  Perhaps you'd solve that by using a small tag that
can be converted back to a path (as you noticed my last patch had some
leftover dead code from an experiment along those lines), but then I
think you'd finish up needing an smgr interface to convert it back to
the path (implementation different for md.c, undofile.c, slru.c).  So
you don't exactly break the cycle mentioned earlier.  Hmm, or perhaps
you could avoid even thinking about atomicity by passing 1 byte
fd-bearing messages via the pipe, and pathnames via shared memory, in
the same order.

Another consideration if we do that is that the existing scheme has a
kind of hierarchy that allows fsync requests to be cancelled in bulk
when you drop relations and databases.  That is, the checkpointer
knows about the internal hierarchy of tablespace, db, rel, seg.  If we
get rid of that and have just paths, it seems like a bad idea to teach
the checkpointer about the internal structure of the paths (even
though we know they contain the same elements encoded somehow).  You'd
have to send an explicit cancel for every key; that is, if you're
dropping a relation, you need to generate a cancel message for every
segment, and if you're dropping a database, you need to generate a
cancel message for every segment of every relation.  Once again, if
you used some kind of tag that is passed back to smgr, you could
probably come up with a way to do it.

Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

16 February 2019, 05:45:02

Hi,

On 2019-02-13 18:40:05 +1300, Thomas Munro wrote:
> Thanks!  And sorry for not replying sooner -- I got distracted by
> FOSDEM (and the associated 20 thousand miles of travel).  On that trip
> I had a chance to discuss this patch with Andres Freund in person, and
> he opined that it might be better for the fsync request queue to work
> in terms of pathnames.  Instead of the approach in this patch, where a
> backend sends an fsync request for { reflfilenode, segno } inside
> mdwrite(), and then the checkpointer processes the request by calling
> smgrdimmedsyncrel(), he speculated that it'd be better to have
> mdwrite() send an fsync request for a pathname, and then the
> checkpointer would just open that file by name and fsync() it.  That
> is, the checkpointer wouldn't call back into smgr.
> 
> One of the advantages of that approach is that there are probably
> other files that need to be fsync'd for each checkpoint that could
> benefit from being offloaded to the checkpointer.  Another is that you
> break the strange cycle mentioned above.

The other issue is that I think your approach moves the segmentation
logic basically out of md into smgr. I think that's wrong. We shouldn't
presume that every type of storage is going to have segmentation that's
representable in a uniform way imo.


> Another consideration if we do that is that the existing scheme has a
> kind of hierarchy that allows fsync requests to be cancelled in bulk
> when you drop relations and databases.  That is, the checkpointer
> knows about the internal hierarchy of tablespace, db, rel, seg.  If we
> get rid of that and have just paths, it seems like a bad idea to teach
> the checkpointer about the internal structure of the paths (even
> though we know they contain the same elements encoded somehow).  You'd
> have to send an explicit cancel for every key; that is, if you're
> dropping a relation, you need to generate a cancel message for every
> segment, and if you're dropping a database, you need to generate a
> cancel message for every segment of every relation.

I can't see that being a problem - compared to the overhead of dropping
a relation, that doesn't seem to be a meaningfully large cost?

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

16 February 2019, 22:39:05

On Fri, Feb 15, 2019 at 06:45:02PM -0800, Andres Freund wrote:

> > One of the advantages of that approach is that there are probably
> > other files that need to be fsync'd for each checkpoint that could
> > benefit from being offloaded to the checkpointer.  Another is that you
> > break the strange cycle mentioned above.
> 
> The other issue is that I think your approach moves the segmentation
> logic basically out of md into smgr. I think that's wrong. We shouldn't
> presume that every type of storage is going to have segmentation that's
> representable in a uniform way imo.

I had a discussion with Thomas on this and am working on a new version 
of the patch that incorporates what you guys discussed at FOSDEM, but 
avoiding passing pathnames to checkpointer.

The mdsync machinery will be moved out of md.c and pending ops table 
will incorporate the segment number as part of the key. Still deciding 
on how to cleanly re-factor _mdfd_getseg which mdsync utilizes during 
the file sync operations. The ultimate goal is to get checkpointer the 
file descriptor it can use to issue the fsync using FileSync. So perhaps 
a function in smgr that returns just that based on the RelFileNode, fork 
and segno combination. Dealing only with file descriptors will allow us 
to implement passing FDs to checkpointer directly as part of the request 
in the future.

The goal is to encapsulate relation specific knowledge within md.c while 
allowing undo and generic block store (ex-SLRU) to do their own mapping 
within the smgr layer later. Yes, checkpointer will "call back" into 
smgr, but these would be to retrieve information that should be managed 
by smgr. Allowing checkpointer to focus on its job of tracking requests 
and syncing files via the fd interfaces.

> > Another consideration if we do that is that the existing scheme has a
> > kind of hierarchy that allows fsync requests to be cancelled in bulk
> > when you drop relations and databases.  That is, the checkpointer
> > knows about the internal hierarchy of tablespace, db, rel, seg.  If we
> > get rid of that and have just paths, it seems like a bad idea to teach
> > the checkpointer about the internal structure of the paths (even
> > though we know they contain the same elements encoded somehow).  You'd
> > have to send an explicit cancel for every key; that is, if you're
> > dropping a relation, you need to generate a cancel message for every
> > segment, and if you're dropping a database, you need to generate a
> > cancel message for every segment of every relation.
> 
> I can't see that being a problem - compared to the overhead of dropping
> a relation, that doesn't seem to be a meaningfully large cost?

With the scheme above - dropping hierarchies will require scanning the 
hash table for matching dboid or reloid and removing those entries. We 
do this today for FORGET_DATABASE_FSYNC in RememberFsyncRequest. The 
matching function will belong in smgr. We can see how scanning the whole 
hash table impacts performance and iterate on it from there if needed.  
-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

21 February 2019, 02:27:40

As promised, here's a patch that addresses the points discussed by 
Andres and Thomas at FOSDEM. As a result of how we want checkpointer to 
track what files to fsync, the pending ops table now integrates the 
forknum and segno as part of the hash key eliminating the need for the 
bitmapsets, or vectors from the previous iterations. We re-construct the 
pathnames from the RelFileNode, ForkNumber and SegmentNumber and use 
PathNameOpenFile to get the file descriptor to use for fsync.

Apart from that, this patch moves the system for requesting and 
processing fsyncs out of md.c into smgr.c, allowing us to call on smgr 
component specific callbacks to retrieve metadata like relation and 
segment paths. This allows smgr components to maintain how relfilenodes, 
forks and segments map to specific files without exposing this knowledge 
to smgr.  It redefines smgrsync() behavior to be closer to that of 
smgrimmedsysnc(), i.e., if a regular sync is required for a particular 
file, enqueue it in locally or forward it to checkpointer.  
smgrimmedsync() retains the existing behavior and fsyncs the file right 
away. The processing of fsync requests has been moved from mdsync() to a 
new ProcessFsyncRequests() function.

Testing
-------

Checkpointer stats didn't cover what I wanted to verify, i.e., time 
spent dealing with the pending operations table. So I added temporary 
instrumentation to get the numbers by timing the code in 
ProcessFsyncRequests which starts by absorbing fsync requests from 
checkpointer queue, processing them and finally issuing sync on the 
files. Similarly, I added the same instrumentation in the mdsync code in 
master branch. The time to actually execute FileSync is irrelevant for 
this patch.

I did two separate runs for 30 mins, both with scale=10,000 on 
i3.8xlarge instances [1] with default params to force frequent 
checkpoints:

1. Single pgbench run having 1000 clients update 4 tables, as a result 
we get 4 relations and its forks and several segments in each being 
synced.

2. 10 parallel pgbench runs on 10 separate databases having 200 clients 
each. This results in more relations and more segments being touched 
letting us better compare against the bitmapset optimizations.

Results
--------

The important metric to look at would be the total time spent absorbing 
and processing the fsync requests as that's what the changes revolve 
around. The other metrics are here for posterity. The new code is about 
6% faster in total time taken to process the queue for the single 
pgbench run. For the 10x parallel pgbench run, we are seeing drops up to 
70% with the patch.

Would be great if some other folks can verify this. The temporary 
instrumentation patches for the master branch and one that applies after 
the main patch are attached. Enable log_checkpoints and then use grep 
and cut to extract the numbers from the log file after the runs.

[Requests Absorbed]

single pgbench run
            Min     Max     Average    Median   Mode    Std Dev
 -------- ------- -------- ---------- -------- ------- ----------
  patch    15144   144961   78628.84    76124   58619   24135.69
  master   25728   138422   81455.04    80601   25728   21295.83

10 parallel pgbench runs
            Min      Max      Average    Median    Mode    Std Dev
 -------- -------- -------- ----------- -------- -------- ----------
  patch     45098   282158    155969.4   151603   153049   39990.91
  master   191833   602512   416533.86   424946   191833   82014.48


[Files Synced]

single pgbench run
           Min   Max   Average   Median   Mode   Std Dev
 -------- ----- ----- --------- -------- ------ ---------
  patch    153   166    158.11      158    159      1.86
  master   154   166    158.29      159    159     10.29

10 parallel pgbench runs
           Min    Max    Average   Median   Mode   Std Dev
 -------- ------ ------ --------- -------- ------ ---------
  patch    1540   1662   1556.42     1554   1552     11.12
  master   1546   1546      1546     1559   1553     12.79


[Total Time in ProcessFsyncRequest/mdsync]

single pgbench run
           Min     Max     Average   Median   Mode   Std Dev
 -------- ----- --------- --------- -------- ------ ---------
  patch    500   3833.51   2305.22     2239    500    510.08
  master   806   4430.32   2458.77     2382    806    497.01

10 parallel pgbench runs
           Min     Max    Average    Median   Mode   Std Dev
 -------- ------ ------- ---------- -------- ------ ---------
  patch     908    6927    3022.58     2863    908    939.09
  master   4323   17858   10982.15    11154   4322   2760.47


 
[1] https://aws.amazon.com/ec2/instance-types/i3/

-- 
Shawn Debnath
Amazon Web Services (AWS)

Attachment

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

21 February 2019, 02:50:26

Hi,

Thanks for the update!

On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote:
> As promised, here's a patch that addresses the points discussed by 
> Andres and Thomas at FOSDEM. As a result of how we want checkpointer to 
> track what files to fsync, the pending ops table now integrates the 
> forknum and segno as part of the hash key eliminating the need for the 
> bitmapsets, or vectors from the previous iterations. We re-construct the 
> pathnames from the RelFileNode, ForkNumber and SegmentNumber and use 
> PathNameOpenFile to get the file descriptor to use for fsync.

I still object to exposing segment numbers to smgr and above. I think
that's an implementation detail of the various storage managers, and we
shouldn't expose smgr.c further than it already is.


> Apart from that, this patch moves the system for requesting and 
> processing fsyncs out of md.c into smgr.c, allowing us to call on smgr 
> component specific callbacks to retrieve metadata like relation and 
> segment paths. This allows smgr components to maintain how relfilenodes, 
> forks and segments map to specific files without exposing this knowledge 
> to smgr.  It redefines smgrsync() behavior to be closer to that of 
> smgrimmedsysnc(), i.e., if a regular sync is required for a particular 
> file, enqueue it in locally or forward it to checkpointer.  
> smgrimmedsync() retains the existing behavior and fsyncs the file right 
> away. The processing of fsync requests has been moved from mdsync() to a 
> new ProcessFsyncRequests() function.

I think that's also wrong, imo fsyncs are storage detail, and should be
relegated to *below* md.c.  That's not to say the code should live in
md.c, but the issuing of such calls shouldn't be part of smgr.c -
consider e.g. developing a storage engine living in non volatile ram.

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

21 February 2019, 06:28:32

On Thu, Feb 21, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
> Thanks for the update!

+1, thanks Shawn.

It's interesting that you measure performance improvements that IIUC
can come only from dropping the bitmapset stuff (or I guess bugs).  I
don't understand the mechanism for that improvement yet.

The idea of just including the segment number (in some representation,
possibly opaque to this code) in the hash table key instead of
carrying a segment list seems pretty good from here (and I withdraw
the sorted vector machinery I mentioned earlier as it's redundant for
this project)... except for one detail.  In your patch, you still have
FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC.  That relies on this
sync manager code being able to understand which stuff to drop from
the hash table, which means that is still has knowledge of the key
hierarchy, so that it can match all entries for the relation or
database.  That's one of the things that Andres is arguing against.

I can see how to fix that for the relation case: md.c could simply
enqueue a FORGET_REQUEST message for every segment and fork in the
relation, so they can be removed by O(1) hash table lookup.  I can't
immediately see how to deal with the database case though.  Perhaps in
a tag scheme, you'd have to make the tag mostly opaque, except for a
DB OID field, so you can handle this case?  (Or some kind of predicate
callback, so you can say "does this tag cancel this other tag?"; seems
over the top).

> On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote:
> > As promised, here's a patch that addresses the points discussed by
> > Andres and Thomas at FOSDEM. As a result of how we want checkpointer to
> > track what files to fsync, the pending ops table now integrates the
> > forknum and segno as part of the hash key eliminating the need for the
> > bitmapsets, or vectors from the previous iterations. We re-construct the
> > pathnames from the RelFileNode, ForkNumber and SegmentNumber and use
> > PathNameOpenFile to get the file descriptor to use for fsync.
>
> I still object to exposing segment numbers to smgr and above. I think
> that's an implementation detail of the various storage managers, and we
> shouldn't expose smgr.c further than it already is.

Ok by me.  The solution to this is probably either raw paths (as
Andres has suggested, but which seem problematic to me -- see below),
or some kind of small fixed size tag type that is morally equivalent
to a path and can be converted to a path but is more convenient for
shoving through pipes and into hash tables.

> > Apart from that, this patch moves the system for requesting and
> > processing fsyncs out of md.c into smgr.c, allowing us to call on smgr
> > component specific callbacks to retrieve metadata like relation and
> > segment paths. This allows smgr components to maintain how relfilenodes,
> > forks and segments map to specific files without exposing this knowledge
> > to smgr.  It redefines smgrsync() behavior to be closer to that of
> > smgrimmedsysnc(), i.e., if a regular sync is required for a particular
> > file, enqueue it in locally or forward it to checkpointer.
> > smgrimmedsync() retains the existing behavior and fsyncs the file right
> > away. The processing of fsync requests has been moved from mdsync() to a
> > new ProcessFsyncRequests() function.
>
> I think that's also wrong, imo fsyncs are storage detail, and should be
> relegated to *below* md.c.  That's not to say the code should live in
> md.c, but the issuing of such calls shouldn't be part of smgr.c -
> consider e.g. developing a storage engine living in non volatile ram.

How about we take all that sync-related stuff, that Shawn has moved
from md.c into smgr.c, and my earlier patch had moved out of md.c into
smgrsync.c, and we put it into a new place
src/backend/storage/file/sync.c?  Or somewhere else, but not under
smgr.  It doesn't know anything about smgr concepts, and it can be
used to schedule file sync for any kind of file, not just the smgr
implementations' files.  Though they'd be the main customers, I guess.
md.c and undofile.c etc would call it to register stuff, and
checkpointer.c would call it to actually perform all the fsync calls.

If we do that, the next question is how to represent filenames.  One
idea is to use tags that can be converted back to a path.  I suppose
there could be a table of function pointers somewhere, and the tag
could be a discriminated union?  Or just a descriminator + a small
array of dumb uint32_t of a size big enough for all users, a bit like
lock tags.

One reason to want to use small fixed-sized tags is to avoid atomicity
problems in the future when it comes to the fd-passing work, as
mentioned up-thread.  Here are some other ideas, to avoid having to
use tags:

* send the paths through a shm queue, but the fds through the Unix
domain socket; the messages carrying fds somehow point to the pathname
in the shm queue (and deal with slight disorder...)
* send the paths through the socket, but hold an LWLock while doing so
to make sure it's atomic, no matter what the size
* somehow prove that it's really already atomic even for long paths,
on every operating system we support, and that it's never change, so
there is no problem here

Another problem with variable sized pathnames even without the future
fd-passing work is that it's harder to size the shm queue: the current
code sets max_requests to NBuffers, which makes some kind of sense
because that's a hard upper bound on the number of dirty segments
there could possibly be at a given moment in time (one you'll
practically never hit), after deduplication.  It's harder to come up
with a decent size for a new variable-sized-message queue; MAXPGPATH *
NBuffers would be insanely large (it's be 1/8th the size of the buffer
pool), but if you make it any smaller there is no guarantee that
compacting it can create space.  Perhaps the solution to that is
simply to block/wait while shm queue is full -- but that might have
deadlock problems.

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

22 February 2019, 21:18:57

> It's interesting that you measure performance improvements that IIUC
> can come only from dropping the bitmapset stuff (or I guess bugs).  I
> don't understand the mechanism for that improvement yet.

I will be digging into this a bit more to understand what really is the 
cause for the improvements. But first, need to get the refactor patch in 
better shape :-)

> The idea of just including the segment number (in some representation,
> possibly opaque to this code) in the hash table key instead of
> carrying a segment list seems pretty good from here (and I withdraw
> the sorted vector machinery I mentioned earlier as it's redundant for
> this project)... except for one detail.  In your patch, you still have
> FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC.  That relies on this
> sync manager code being able to understand which stuff to drop from
> the hash table, which means that is still has knowledge of the key
> hierarchy, so that it can match all entries for the relation or
> database.  That's one of the things that Andres is arguing against.

You are correct. I actually did mention having a callback to do the 
request resolution in an response to Andres back up in the thread, but 
oops, completely slipped my mind with my last patch.

> How about we take all that sync-related stuff, that Shawn has moved
> from md.c into smgr.c, and my earlier patch had moved out of md.c into
> smgrsync.c, and we put it into a new place
> src/backend/storage/file/sync.c?  Or somewhere else, but not under
> smgr.  It doesn't know anything about smgr concepts, and it can be
> used to schedule file sync for any kind of file, not just the smgr
> implementations' files.  Though they'd be the main customers, I guess.
> md.c and undofile.c etc would call it to register stuff, and
> checkpointer.c would call it to actually perform all the fsync calls.
> 
> If we do that, the next question is how to represent filenames.  One
> idea is to use tags that can be converted back to a path.  I suppose
> there could be a table of function pointers somewhere, and the tag
> could be a discriminated union?  Or just a descriminator + a small
> array of dumb uint32_t of a size big enough for all users, a bit like
> lock tags.
> 
> One reason to want to use small fixed-sized tags is to avoid atomicity
> problems in the future when it comes to the fd-passing work, as
> mentioned up-thread.  Here are some other ideas, to avoid having to
> use tags:
> 
> * send the paths through a shm queue, but the fds through the Unix
> domain socket; the messages carrying fds somehow point to the pathname
> in the shm queue (and deal with slight disorder...)
> * send the paths through the socket, but hold an LWLock while doing so
> to make sure it's atomic, no matter what the size
> * somehow prove that it's really already atomic even for long paths,
> on every operating system we support, and that it's never change, so
> there is no problem here
> 
> Another problem with variable sized pathnames even without the future
> fd-passing work is that it's harder to size the shm queue: the current
> code sets max_requests to NBuffers, which makes some kind of sense
> because that's a hard upper bound on the number of dirty segments
> there could possibly be at a given moment in time (one you'll
> practically never hit), after deduplication.  It's harder to come up
> with a decent size for a new variable-sized-message queue; MAXPGPATH *
> NBuffers would be insanely large (it's be 1/8th the size of the buffer
> pool), but if you make it any smaller there is no guarantee that
> compacting it can create space.  Perhaps the solution to that is
> simply to block/wait while shm queue is full -- but that might have
> deadlock problems.

I think I have a lot better understanding of what Andres is envisioning 
and agree with what Thomas has said so far. To summarize, we want a 
"sync" component at the level of fd, that components higher up the chain 
like md, undo, slru and checkpointer will use to track and process fsync 
requests (I am refraining from putting in an ascii diagram here!).
These checkpoint requests will be opaque to the sync machinery and will 
rely on requesters to provide the necessary details.  I, agree with 
Thomas, in that I don't think passing full pathnames or variable 
pathnames is the right way to go for all the reasons Thomas mentioned in 
his email.  However, if we want to, in the future we can easily extend 
the checkpoint request to include a type, CHECKPOINT_REQUEST_FD or 
CHECKPOINT_REQUEST_PATH, and delegate the current relfilenode to be of 
type CHECKPOINT_REQUEST_RELFILENODE. Sync can then act on the requests 
based on the type, and in some cases wouldn't need to interact with any 
other component.

The pieces of information we need to process fsyncs are (1) determine if 
a request is to invalidate other requests the queue currently holds and 
(2) determine the full path to  the file to issue fsync on.

I think using callbacks is the better path forward than having md or 
other components issue an invalidate request for each and every segment 
which can get quite heavy handed for large databases. Performance would 
be the same as today since we already scan the entire hash table when we 
encounter a forget request. This new approach will involve one 
additional function call inside the loop which does a simple compare.

And Thomas brought up a good point offline: if we followed the path of 
smgr for the callbacks, it will lead to header file dependency 
nightmare. It would be best for components like md to register it's 
callback functions with sync so that sync doesn't have to include higher 
level header files to get access to their prototypes.

At the time of smgrinit(), mdinit() would call into sync and register 
it's callbacks with an ID. We can use the same value that we are using 
for smgr_which to map the callbacks. Each fsync request will then also 
accompany this ID which the sync mechanism will use to call handlers for 
resolving forget requests or obtaining paths for files.

I think this should satisfy Andres' requirement for not teaching smgr 
anything about segmentation while keeping sync related knowledge far 
below the smgr layer. This scheme works for undo and generic block 
storage well. Though might have to teach immedsync to accept block 
numbers so that undo and other storage managers can determine what 
segment it maps to.

Thoughts? I am going to get started with revising the patch unless I 
hear otherwise.  And love the great feedback btw, thank you.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Temporal Table Proposal

From

Ibrar Ahmed

Date:

22 February 2019, 21:40:21

Hi,

While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existing databases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybe my search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful.

https://www.percona.com/community-blog/2018/12/14/notes-mariadb-system-versioned-tables/
https://www.mssqltips.com/sqlservertip/3680/introduction-to-sql-server-temporal-tables/

What?
A temporal table feature has two tables “Temporal Table” and “History Table”. The Temporal Table is where our current tuples are stored. This is the main table, just like other PostgreSQL tables. The history table is the other half of the feature and is where all the history of the main table is stored. This table is created automatically. The history table is used to query certain data at a certain time, useful for a point in time analysis. It also offers built-in versioning.

Why?

Normally users write triggers or procedures to write a history of a table’s data. Some time-sensitive applications will have code to write a data history somewhere. By having this functionality, PostgreSQL would do it automatically. For example, if we have a retail table where the price of each product inventory item is stored. The temporal table would hold the current price of the product. When we update the price of a product in the temporal table, then a new row with a timestamp would be added to the history table. That means on each update of the price, a new row containing the previous price would be added to the history table. The same would apply in the case of deletes. When we delete any product from our inventory, then a row would be added to the history table storing the last price of the product prior to delete. For any point in time, we can access the price at which we sold the product.

How?
I was thinking about the implementation of this feature and read the documentation on the internet. Microsoft SQL Server, for example, offers such a feature. If we come to the conclusion we should offer the feature, I will share the complete design.

Here are some ideas I have around this:

- Syntax.

CREATE TABLE tablename

(

…

start_time DATETIME,

end_time DATETIME,

PERIOD FOR SYSTEM_TIME (start_time, end_time)

)

WITH

(

SYSTEM_VERSIONING = ON (HISTORY_TABLE = tablename_history)

);

The tablename is the temporal table and tablename_history is be the history table. The name of the history table is optional, in which case, PostgreSQL will generate a table name. These two columns are a must for a temporal table “start_time” and “end_time”. The PERIOD FOR SYSTEM_TIME is used to identify these columns.

ALTER TABLE SET SYSTEM_VERSIONING = ON/OFF

Due to this syntax addition in CREATE/ALTER TABLE, there are some grammar additions required in the parser.

PERIOD FOR SYSTEM TIME
SYSTEM VERSIONING

- Catalog Changes.

There are two options, one is to have another catalog pg_temporal which will contain the information or we could have that information in the pg_catalog

Table "public.pg_temporal"

Column | Type | Collation | Nullable | Default

-----------------+------+-----------+----------+---------

temporal_id | oid | | |

hist_id | oid | | |

start_date_name | text | | |

end_date_name | text | | |

--

Ibrar Ahmed

Re: Temporal Table Proposal

From

Euler Taveira

Date:

22 February 2019, 22:31:25

Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed
<ibrar.ahmad@gmail.com> escreveu:
>
> While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some
existingdatabases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any.
Maybemy search wasn’t accurate enough:  if anyone can point me to a discussion, that would be useful. 
>
https://www.postgresql.org/message-id/CA%2BrenyUb%2BXHzsrPHHR6ELqguxaUPGhOPyVc7NW%2BkRsRpBZuUFQ%40mail.gmail.com

This is the last one. I don't know why it wasn't in the January CF.


--
   Euler Taveira                                   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Temporal Table Proposal

From

Paul Jungwirth

Date:

23 February 2019, 00:16:02

On 2/22/19 11:31 AM, Euler Taveira wrote:
> Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed
> <ibrar.ahmad@gmail.com> escreveu:
>>
>> While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some
existingdatabases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any.
Maybemy search wasn’t accurate enough:  if anyone can point me to a discussion, that would be useful.

>>
> https://www.postgresql.org/message-id/CA%2BrenyUb%2BXHzsrPHHR6ELqguxaUPGhOPyVc7NW%2BkRsRpBZuUFQ%40mail.gmail.com
> 
> This is the last one. I don't know why it wasn't in the January CF.

Oh that's by me! :-)

I didn't put it into the CF because I wanted to get some feedback on 
primary keys before I got too far into foreign keys, but someone 
recently advised me to starting adding to CFs anyway with "WIP" in the 
title, so I'll do that next time.

Btw my own patch is very modest, and I'd love to see this other much 
more extensive patch get some attention:

https://www.postgresql.org/message-id/flat/CAHO0eLYyvuqwF%3D2FsgDn1xOs_NOrFBu9Xh-Wq%2BaWfFy0y6%3DjWQ%40mail.gmail.com#4f7fbace3a2f2ce85fcc161cc3fdd273

They were told to adjust where in the query pipeline they do their work, 
and the latest patch does that (as I understand it), but I don't think 
anyone has looked at it yet.

Both of these patches use range types rather than SQL:2011 PERIODs, but 
I'd like to *also* support PERIODs (and accept ranges everywhere we 
accept PERIODs). Vik Fearing already has a patch to let you *declare* 
PERIODs:

https://www.postgresql-archive.org/Periods-td6022563.html

Actually using PERIODs in queries seems like a decent chunk of work 
though: basically it means making our grammar & processing accept 
PERIODs anywhere they currently accept columns. I'd love to hear some 
thoughts/suggestions around that. For example: a PERIOD is *similar* to 
a GENERATED column, so maybe the work being done there can/should 
influence how we implement them.

I'm excited to be getting some momentum around temporal features though! 
I'm supposed to give a talk about them at PGCon in Ottawa this spring, 
so hopefully that will help too.

Yours,

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com

Re: Temporal Table Proposal

From

Euler Taveira

Date:

23 February 2019, 00:41:03

Em sex, 22 de fev de 2019 às 18:16, Paul Jungwirth
<pj@illuminatedcomputing.com> escreveu:
>
> On 2/22/19 11:31 AM, Euler Taveira wrote:
> > Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed
> > <ibrar.ahmad@gmail.com> escreveu:
> >>
> >> While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some
existingdatabases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any.
Maybemy search wasn’t accurate enough:  if anyone can point me to a discussion, that would be useful. 
> >>
> > https://www.postgresql.org/message-id/CA%2BrenyUb%2BXHzsrPHHR6ELqguxaUPGhOPyVc7NW%2BkRsRpBZuUFQ%40mail.gmail.com
> >
> > This is the last one. I don't know why it wasn't in the January CF.
>
> Oh that's by me! :-)
>
Forgot to CC you.

> I didn't put it into the CF because I wanted to get some feedback on
> primary keys before I got too far into foreign keys, but someone
> recently advised me to starting adding to CFs anyway with "WIP" in the
> title, so I'll do that next time.
>
Get some feedback is one of the CF goals. Even if you have just a WIP,
those CF feedbacks could help you solve/improve some pieces of your
current code.

> Btw my own patch is very modest, and I'd love to see this other much
> more extensive patch get some attention:
>
>
https://www.postgresql.org/message-id/flat/CAHO0eLYyvuqwF%3D2FsgDn1xOs_NOrFBu9Xh-Wq%2BaWfFy0y6%3DjWQ%40mail.gmail.com#4f7fbace3a2f2ce85fcc161cc3fdd273
>
It isn't in the CF 2019-03. If you want it to be reviewed you should add it.

At this point, both patches should target v13.


--
   Euler Taveira                                   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

23 February 2019, 01:15:47

Hi,

On 2019-02-22 10:18:57 -0800, Shawn Debnath wrote:
> I think using callbacks is the better path forward than having md or 
> other components issue an invalidate request for each and every segment 
> which can get quite heavy handed for large databases.

I'm not sure I buy this. Unlinking files isn't cheap, involves many disk
writes, etc. The cost of an inval request in comparison isn't
particularly large.  Especially for relation-level (rather than database
level) truncation, per-segment invals will likely commonly be faster
than the sequential scan.

> At the time of smgrinit(), mdinit() would call into sync and register 
> it's callbacks with an ID. We can use the same value that we are using 
> for smgr_which to map the callbacks. Each fsync request will then also 
> accompany this ID which the sync mechanism will use to call handlers for 
> resolving forget requests or obtaining paths for files.

I'm not seeing a need to do this dynamically at runtime. Given that smgr
isn't extensible, why don't we just map callbacks (or even just some
switch based logic) based on some enum?  Doing things at *init time has
more potential to go wrong, because say a preload_shared_library does
different things in postmaster than normal backends (in EXEC_BACKEND
cases).

Besides those two points, I think this is going in a good direction!

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

23 February 2019, 01:42:49

On Sat, Feb 23, 2019 at 11:15 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-02-22 10:18:57 -0800, Shawn Debnath wrote:
> > I think using callbacks is the better path forward than having md or
> > other components issue an invalidate request for each and every segment
> > which can get quite heavy handed for large databases.
>
> I'm not sure I buy this. Unlinking files isn't cheap, involves many disk
> writes, etc. The cost of an inval request in comparison isn't
> particularly large.  Especially for relation-level (rather than database
> level) truncation, per-segment invals will likely commonly be faster
> than the sequential scan.

Well even if you do it with individual segment cancel messages for
relations, you still need a way to deal with whole-database drops
(generating the cancels for every segment in every relation in the
database would be nuts), and that means either exposing some structure
to the requests, right?  So the requests would have { request type,
callback ID, db, opaque tag }, where request type is SYNC, CANCEL,
CANCEL_WHOLE_DB, callback ID is used to find the function that
converts opaque tags to paths, and db is used for handling
CANCEL_WHOLE_DB requests where you need to scan the whole hash table.
Right?

> > At the time of smgrinit(), mdinit() would call into sync and register
> > it's callbacks with an ID. We can use the same value that we are using
> > for smgr_which to map the callbacks. Each fsync request will then also
> > accompany this ID which the sync mechanism will use to call handlers for
> > resolving forget requests or obtaining paths for files.
>
> I'm not seeing a need to do this dynamically at runtime. Given that smgr
> isn't extensible, why don't we just map callbacks (or even just some
> switch based logic) based on some enum?  Doing things at *init time has
> more potential to go wrong, because say a preload_shared_library does
> different things in postmaster than normal backends (in EXEC_BACKEND
> cases).

Yeah I suggested dynamic registration to avoid the problem that eg
src/backend/storage/sync.c otherwise needs to forward declare
md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
#include <storage/md.h> etc, which seemed like exactly the sort of
thing up with which you would not put.  (Which reminds me, smgr.{c,h}
suffers from a similar disease, as the comment says: "move me
elsewhere -- ay 7/94").

> Besides those two points, I think this is going in a good direction!

Phew.  :-)

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

23 February 2019, 01:48:53

Hi,

On 2019-02-23 11:42:49 +1300, Thomas Munro wrote:
> On Sat, Feb 23, 2019 at 11:15 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-02-22 10:18:57 -0800, Shawn Debnath wrote:
> > > I think using callbacks is the better path forward than having md or
> > > other components issue an invalidate request for each and every segment
> > > which can get quite heavy handed for large databases.
> >
> > I'm not sure I buy this. Unlinking files isn't cheap, involves many disk
> > writes, etc. The cost of an inval request in comparison isn't
> > particularly large.  Especially for relation-level (rather than database
> > level) truncation, per-segment invals will likely commonly be faster
> > than the sequential scan.
> 
> Well even if you do it with individual segment cancel messages for
> relations, you still need a way to deal with whole-database drops
> (generating the cancels for every segment in every relation in the
> database would be nuts), and that means either exposing some structure
> to the requests, right?  So the requests would have { request type,
> callback ID, db, opaque tag }, where request type is SYNC, CANCEL,
> CANCEL_WHOLE_DB, callback ID is used to find the function that
> converts opaque tags to paths, and db is used for handling
> CANCEL_WHOLE_DB requests where you need to scan the whole hash table.
> Right?

I'm ok with using callbacks to allow pruning for things like droping
databases. If we use callbacks, I don't see a need to explicitly include
the db in the request however? The callback can look into the opaque
tag, no?  Also, why do we need a separation between request type and
callback? That seems like it'll commonly be entirely redundant?


> > > At the time of smgrinit(), mdinit() would call into sync and register
> > > it's callbacks with an ID. We can use the same value that we are using
> > > for smgr_which to map the callbacks. Each fsync request will then also
> > > accompany this ID which the sync mechanism will use to call handlers for
> > > resolving forget requests or obtaining paths for files.
> >
> > I'm not seeing a need to do this dynamically at runtime. Given that smgr
> > isn't extensible, why don't we just map callbacks (or even just some
> > switch based logic) based on some enum?  Doing things at *init time has
> > more potential to go wrong, because say a preload_shared_library does
> > different things in postmaster than normal backends (in EXEC_BACKEND
> > cases).
> 
> Yeah I suggested dynamic registration to avoid the problem that eg
> src/backend/storage/sync.c otherwise needs to forward declare
> md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
> #include <storage/md.h> etc, which seemed like exactly the sort of
> thing up with which you would not put.

I'm not sure I understand. If we have a few known tag types, what's the
problem with including the headers with knowledge of how to implement
them into sync.c file?

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

23 February 2019, 01:59:04

On Sat, Feb 23, 2019 at 11:48 AM Andres Freund <andres@anarazel.de> wrote:
> > Yeah I suggested dynamic registration to avoid the problem that eg
> > src/backend/storage/sync.c otherwise needs to forward declare
> > md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
> > #include <storage/md.h> etc, which seemed like exactly the sort of
> > thing up with which you would not put.
>
> I'm not sure I understand. If we have a few known tag types, what's the
> problem with including the headers with knowledge of how to implement
> them into sync.c file?

Typo in my previous email: src/backend/storage/file/sync.c was my
proposal for the translation unit holding this stuff (we don't have .c
files directly under storage).  But it didn't seem right that stuff
under storage/file (things concerned with files) should know about
stuff under storage/smgr (md.c functions, higher level smgr stuff).
Perhaps that just means it should go into a different subdir, maybe
src/backend/storage/sync/sync.c, that knows about files AND smgr
stuff, or perhaps I'm worrying about nothing.

--
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

23 February 2019, 02:09:16

Hi,

On 2019-02-23 11:59:04 +1300, Thomas Munro wrote:
> On Sat, Feb 23, 2019 at 11:48 AM Andres Freund <andres@anarazel.de> wrote:
> > > Yeah I suggested dynamic registration to avoid the problem that eg
> > > src/backend/storage/sync.c otherwise needs to forward declare
> > > md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
> > > #include <storage/md.h> etc, which seemed like exactly the sort of
> > > thing up with which you would not put.
> >
> > I'm not sure I understand. If we have a few known tag types, what's the
> > problem with including the headers with knowledge of how to implement
> > them into sync.c file?
> 
> Typo in my previous email: src/backend/storage/file/sync.c was my
> proposal for the translation unit holding this stuff (we don't have .c
> files directly under storage).  But it didn't seem right that stuff
> under storage/file (things concerned with files) should know about
> stuff under storage/smgr (md.c functions, higher level smgr stuff).
> Perhaps that just means it should go into a different subdir, maybe
> src/backend/storage/sync/sync.c, that knows about files AND smgr
> stuff, or perhaps I'm worrying about nothing.

I mean, if you have a md_tagtopath and need its behaviour, I don't
understand why a local forward declaration changes things?  But I also
think this is a bit of a non-problem - the abbreviated path names are
just a different representation of paths, I don't see a problem of
having that in sync.[ch], especially if we have the option to also have
a full-length pathname variant if we ever need that.

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

23 February 2019, 02:45:50

> > Well even if you do it with individual segment cancel messages for
> > relations, you still need a way to deal with whole-database drops
> > (generating the cancels for every segment in every relation in the
> > database would be nuts), and that means either exposing some structure
> > to the requests, right?  So the requests would have { request type,
> > callback ID, db, opaque tag }, where request type is SYNC, CANCEL,
> > CANCEL_WHOLE_DB, callback ID is used to find the function that
> > converts opaque tags to paths, and db is used for handling
> > CANCEL_WHOLE_DB requests where you need to scan the whole hash table.
> > Right?
> 
> I'm ok with using callbacks to allow pruning for things like droping
> databases. If we use callbacks, I don't see a need to explicitly include
> the db in the request however? The callback can look into the opaque
> tag, no?  Also, why do we need a separation between request type and
> callback? That seems like it'll commonly be entirely redundant?

By having the id to distinguish which smgr to use for callbacks, we 
don't need DB. You are correct, my plan was to make the whole request 
opaque to the sync mechanism. ForwardFsyncRequest will take requester 
smgr id, request type (forget [db/rel], sync), relfilenode, forknum, and 
segno and convert it into a opaque CheckpointRequest and queue it 
locally. The only responsibility here is to map the different pieces of 
data into the opaque CheckpointerRequest. Requester ID in combination 
with request type will help us look up which callback function to 
execute.

A new enum for requester ID is perfectly okay. I was trying to recycle 
the smgr id but perhaps that's not the right approach.

> > > > Yeah I suggested dynamic registration to avoid the problem that 
> > > > eg
> > > > src/backend/storage/sync.c otherwise needs to forward declare
> > > > md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
> > > > #include <storage/md.h> etc, which seemed like exactly the sort of
> > > > thing up with which you would not put.
> > >
> > > I'm not sure I understand. If we have a few known tag types, what's the
> > > problem with including the headers with knowledge of how to implement
> > > them into sync.c file?
> > 
> > Typo in my previous email: src/backend/storage/file/sync.c was my
> > proposal for the translation unit holding this stuff (we don't have .c
> > files directly under storage).  But it didn't seem right that stuff
> > under storage/file (things concerned with files) should know about
> > stuff under storage/smgr (md.c functions, higher level smgr stuff).
> > Perhaps that just means it should go into a different subdir, maybe
> > src/backend/storage/sync/sync.c, that knows about files AND smgr
> > stuff, or perhaps I'm worrying about nothing.
> 
> I mean, if you have a md_tagtopath and need its behaviour, I don't
> understand why a local forward declaration changes things?  But I also
> think this is a bit of a non-problem - the abbreviated path names are
> just a different representation of paths, I don't see a problem of
> having that in sync.[ch], especially if we have the option to also have
> a full-length pathname variant if we ever need that.

Today, all the callbacks for smgr have their prototypes defined in 
smgr.h and used in smgr.c.  Forward declarations within the new sync.h 
would be fine but having md callbacks split in two different places may 
not be the cleanest approach? One could argue they serve different 
purposes so perhaps it correct that we define them separately.  I am 
fine with either, but now I probably prefer the new enum to fixed 
function declaration mappings that Andres suggested. I agree it would be 
less error prone.


-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

23 February 2019, 02:57:42

On 2019-02-22 15:45:50 -0800, Shawn Debnath wrote:
> > > Well even if you do it with individual segment cancel messages for
> > > relations, you still need a way to deal with whole-database drops
> > > (generating the cancels for every segment in every relation in the
> > > database would be nuts), and that means either exposing some structure
> > > to the requests, right?  So the requests would have { request type,
> > > callback ID, db, opaque tag }, where request type is SYNC, CANCEL,
> > > CANCEL_WHOLE_DB, callback ID is used to find the function that
> > > converts opaque tags to paths, and db is used for handling
> > > CANCEL_WHOLE_DB requests where you need to scan the whole hash table.
> > > Right?
> >
> > I'm ok with using callbacks to allow pruning for things like droping
> > databases. If we use callbacks, I don't see a need to explicitly include
> > the db in the request however? The callback can look into the opaque
> > tag, no?  Also, why do we need a separation between request type and
> > callback? That seems like it'll commonly be entirely redundant?
>
> By having the id to distinguish which smgr to use for callbacks, we
> don't need DB. You are correct, my plan was to make the whole request
> opaque to the sync mechanism. ForwardFsyncRequest will take requester
> smgr id, request type (forget [db/rel], sync), relfilenode, forknum, and
> segno and convert it into a opaque CheckpointRequest and queue it
> locally. The only responsibility here is to map the different pieces of
> data into the opaque CheckpointerRequest. Requester ID in combination
> with request type will help us look up which callback function to
> execute.
>
> A new enum for requester ID is perfectly okay. I was trying to recycle
> the smgr id but perhaps that's not the right approach.

I think it'd be a bad idea to use the smgr id here. So +1 for
separating.


> > > > > Yeah I suggested dynamic registration to avoid the problem that
> > > > > eg
> > > > > src/backend/storage/sync.c otherwise needs to forward declare
> > > > > md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
> > > > > #include <storage/md.h> etc, which seemed like exactly the sort of
> > > > > thing up with which you would not put.
> > > >
> > > > I'm not sure I understand. If we have a few known tag types, what's the
> > > > problem with including the headers with knowledge of how to implement
> > > > them into sync.c file?
> > >
> > > Typo in my previous email: src/backend/storage/file/sync.c was my
> > > proposal for the translation unit holding this stuff (we don't have .c
> > > files directly under storage).  But it didn't seem right that stuff
> > > under storage/file (things concerned with files) should know about
> > > stuff under storage/smgr (md.c functions, higher level smgr stuff).
> > > Perhaps that just means it should go into a different subdir, maybe
> > > src/backend/storage/sync/sync.c, that knows about files AND smgr
> > > stuff, or perhaps I'm worrying about nothing.
> >
> > I mean, if you have a md_tagtopath and need its behaviour, I don't
> > understand why a local forward declaration changes things?  But I also
> > think this is a bit of a non-problem - the abbreviated path names are
> > just a different representation of paths, I don't see a problem of
> > having that in sync.[ch], especially if we have the option to also have
> > a full-length pathname variant if we ever need that.
>
> Today, all the callbacks for smgr have their prototypes defined in
> smgr.h and used in smgr.c.  Forward declarations within the new sync.h
> would be fine but having md callbacks split in two different places may
> not be the cleanest approach? One could argue they serve different
> purposes so perhaps it correct that we define them separately.  I am
> fine with either, but now I probably prefer the new enum to fixed
> function declaration mappings that Andres suggested. I agree it would be
> less error prone.

I'd really just entirely separate the two. I'd not include any knowledge
of this mechanism into smgr.h, and just make the expansion of paths
dispatch statically dispatched (potentially into md.c) to actually do
the work.  I'm not sure why sync.h would need any forward declarations
for that?

Greetings,

Andres Freund

Re: Temporal Table Proposal

From

Ibrar Ahmed

Date:

25 February 2019, 15:21:41

Hi Paul,

On Sat, Feb 23, 2019 at 2:16 AM Paul Jungwirth <pj@illuminatedcomputing.com> wrote:

On 2/22/19 11:31 AM, Euler Taveira wrote:
> Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed
> <ibrar.ahmad@gmail.com> escreveu:
>>
>> While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existing databases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybe my search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful.
>>
> https://www.postgresql.org/message-id/CA%2BrenyUb%2BXHzsrPHHR6ELqguxaUPGhOPyVc7NW%2BkRsRpBZuUFQ%40mail.gmail.com
>
> This is the last one. I don't know why it wasn't in the January CF.

Oh that's by me! :-)

I didn't put it into the CF because I wanted to get some feedback on
primary keys before I got too far into foreign keys, but someone
recently advised me to starting adding to CFs anyway with "WIP" in the
title, so I'll do that next time.

Btw my own patch is very modest, and I'd love to see this other much
more extensive patch get some attention:

https://www.postgresql.org/message-id/flat/CAHO0eLYyvuqwF%3D2FsgDn1xOs_NOrFBu9Xh-Wq%2BaWfFy0y6%3DjWQ%40mail.gmail.com#4f7fbace3a2f2ce85fcc161cc3fdd273

They were told to adjust where in the query pipeline they do their work,
and the latest patch does that (as I understand it), but I don't think
anyone has looked at it yet.

Both of these patches use range types rather than SQL:2011 PERIODs, but
I'd like to *also* support PERIODs (and accept ranges everywhere we
accept PERIODs). Vik Fearing already has a patch to let you *declare*
PERIODs:

https://www.postgresql-archive.org/Periods-td6022563.html

Actually using PERIODs in queries seems like a decent chunk of work
though: basically it means making our grammar & processing accept
PERIODs anywhere they currently accept columns. I'd love to hear some
thoughts/suggestions around that. For example: a PERIOD is *similar* to
a GENERATED column, so maybe the work being done there can/should
influence how we implement them.

I'm excited to be getting some momentum around temporal features though!
I'm supposed to give a talk about them at PGCon in Ottawa this spring,
so hopefully that will help too.

Yours,

--
Paul ~{:-)
pj@illuminatedcomputing.com

Great, to hear that you are working on that. Do you think I can help you with this? I did some groundwork to make it possible. I can help in coding/reviewing or even can take lead if you want to.

Ibrar Ahmed

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

28 February 2019, 00:26:54

On Fri, Feb 22, 2019 at 03:57:42PM -0800, Andres Freund wrote:
> > Today, all the callbacks for smgr have their prototypes defined in
> > smgr.h and used in smgr.c.  Forward declarations within the new sync.h
> > would be fine but having md callbacks split in two different places may
> > not be the cleanest approach? One could argue they serve different
> > purposes so perhaps it correct that we define them separately.  I am
> > fine with either, but now I probably prefer the new enum to fixed
> > function declaration mappings that Andres suggested. I agree it would be
> > less error prone.
> 
> I'd really just entirely separate the two. I'd not include any knowledge
> of this mechanism into smgr.h, and just make the expansion of paths
> dispatch statically dispatched (potentially into md.c) to actually do
> the work.  I'm not sure why sync.h would need any forward declarations
> for that?

We had a quick offline discussion to get on the same page and we agreed 
to move forward with Andres' approach above. Attached is patch v10.  
Here's the overview of the patch:

  1. Move the system for requesting and processing fsyncs out of md.c
  into storage/sync/sync.c with definitions in include/storage/sync.h.
  ProcessSyncRequests() is now responsible for processing the sync
  requests during checkpoint.

  2. Removed the need for specific storage managers to implement pre and
  post checkpoint callbacks. These are now executed by the sync mechanism.

  3. We now embed the fork number and the segment number as part of the
  hash key for the pending ops table. This eliminates the bitmapset based
  segment tracking for each relfilenode during fsync as not all storage
  managers may map their segments from zero.

  4. Each sync request now must include a type: sync, forget, forget
  hierarchy, or unlink, and the owner who will be responsible for
  generating paths or matching forget requests.

  5. For cancelling relation sync requests, we now must send a forget
  request for each fork and segment in the relation.

  6. We do not rely on smgr to provide the file descriptor we use to
  issue fsync. Instead, we generate the full path based on the FileTag
  in the sync request and use PathNameOpenFile to get the file descriptor.

Ran make check-world and repeated the tests described in [1]. The 
numbers show a 12% drop in total time for single run of 1000 clients and 
~62% drop in total time for 10 parallel runs with 200 clients:

[Requests Absorbed]

                   Min      Max      Average    Median    Std Dev
 --------------- -------- -------- ----------- --------- ----------
  patch-1x1000    17554    147410   85252.95    83835.5   21898.88
  master-1x1000   25728    138422   81455.04    80601     21295.83

                  Min      Max      Average     Median    Std Dev
  patch-10x200    125922   279682   197098.76   197055    34038.25
  master-10x200   191833   602512   416533.86   424946    82014.48

[Files Synced]

                  Min    Max    Average   Median   Std Dev
 --------------- ------ ------ --------- -------- ---------
  patch-1x1000    155    213    158.93    159      2.97
  master-1x1000   154    166    158.29    159      10.29

                  Min    Max    Average   Median   Std Dev
  patch-10x200    1456   1577   1546.84   1547     11.23
  master-10x200   1546   1546   1546      1559     12.79

[Total Time in ProcessFsyncRequest/mdsync]

                  Min      Max     Average    Median   Std Dev
 --------------- ------ --------- ---------- -------- ---------
  patch-1x1000    606    4022      2145.11    2123     440.72
  master-1x1000   806    4430.32   2458.77    2382     497.01

                  Min    Max       Average    Median   Std Dev
  patch-10x200    2472   6960      4156.8     4141     817.56
  master-10x200   4323   17858     10982.15   11154    2760.47

[1] 
https://www.postgresql.org/message-id/20190220232739.GA8280%40f01898859afd.ant.amazon.com

-- 
Shawn Debnath
Amazon Web Services (AWS)

Attachment

0001-Refactor-the-fsync-machinery-to-support-future-SMGR-v10.patch

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

28 February 2019, 00:43:00

On Thu, Feb 28, 2019 at 10:27 AM Shawn Debnath <sdn@amazon.com> wrote:
> We had a quick offline discussion to get on the same page and we agreed
> to move forward with Andres' approach above. Attached is patch v10.
> Here's the overview of the patch:

Thanks.  I will review, and try to rebase my undo patches on top of
this and see what problems I crash into.

> Ran make check-world and repeated the tests described in [1]. The
> numbers show a 12% drop in total time for single run of 1000 clients and
> ~62% drop in total time for 10 parallel runs with 200 clients:

Hmm, good but unexpected.  Will poke at this here.


--
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

01 March 2019, 13:17:27

On Thu, Feb 28, 2019 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Feb 28, 2019 at 10:27 AM Shawn Debnath <sdn@amazon.com> wrote:
> > We had a quick offline discussion to get on the same page and we agreed
> > to move forward with Andres' approach above. Attached is patch v10.
> > Here's the overview of the patch:
>
> Thanks.  I will review, and try to rebase my undo patches on top of
> this and see what problems I crash into.

Some initial feedback:

@@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
         * the REDO pointer.  Note that smgr must not do anything that'd have to
         * be undone if we decide no checkpoint is needed.
         */
-       smgrpreckpt();
+       PreCheckpoint();

I would call this and the "post" variant something like
SyncPreCheckpoint().  Otherwise it's too general sounding and not
clear which module it's in.

+static const int NSync = lengthof(syncsw);

Unused.

+                       fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);

Needs to be closed.

+                       path = syncsw[entry->owner].sync_filepath(entry->ftag);

Probably doesn't make much difference, but wouldn't it be better for
the path to be written into a caller-supplied buffer of size
MAXPGPATH?  Then we could have that on the stack instead of alloc/free
for every path.

Hmm, mdfilepath() needs to use GetRelationPath(), and that already
returns palloc'd memory.  Oh well.

+                               entry->canceled = true;

Now that we killed the bitmapset, I wonder if we still need this
canceled flag.  What if we just removed the entry from the hash table?
If you killed the canceled flag you could then replace this:

+                       if (entry->canceled)
+                               break;

.. with another hash table probe to see if the entry went in the
AbsorbSyncRequests() call (having first copied the key into a local
variable since of course "entry" may have been freed).  Or maybe you
don't think that's better, I dunno, just an idea :-)

+ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner)

Is it a deliberate choice that you pass FileTag objects around by
value?  Rather than, say, pointer to const.  Not really a complaint in
the current coding since it's a small object anyway (not much bigger
than a pointer), but I guess someone might eventually want to make it
into a flexible sized object, or something, I dunno.

-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum,
+                                                         SegmentNumber segno)
 {
-       if (pendingOpsTable)
+       FileTag tag;
+
+       tag.rnode = rnode;
+       tag.forknum = forknum;
+       tag.segno = segno;
+
+       if (IsSyncManagedLocally())
        {
                /* standalone backend or startup process: fsync state
is local */
-               RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+               RememberSyncRequest(tag, FORGET_REQUEST, SYNC_MD);
        }
...

You left this and similar functions in md.c, but I think it needs to
move out to sync.c, and just take a FileTag directly.  Otherwise I
have to write similar functions in undofile.c, and it seems kinda
weird that those modules are worrying about whether sync is managed
locally or the message needs to be sent to the checkpointer, and even
worse, they have to duplicate the loop that deals with
ForwardSyncRequest() failing and retrying.  Concretely I'm saying that
sync.c should define a function like this:

+/*
+ * PostSyncRequest
+ *
+ * Remember locally, or post to checkpointer as appropriate.
+ */
+void
+PostSyncRequest(FileTag tag, SyncRequestType type, SyncRequestOwner owner)
+{
+       if (IsSyncManagedLocally())
+       {
+               /* standalone backend or startup process: fsync state
is local */
+               RememberSyncRequest(tag, type, owner);
+       }
+       else if (IsUnderPostmaster)
+       {
+               while (!ForwardSyncRequest(tag, FORGET_REQUEST, SYNC_MD))
+                       pg_usleep(10000L);      /* 10 msec seems a
good number */
+       }
+}

Hmm, perhaps it would need to take an argument to say whether it
should keep retrying or return false if it fails; that way
register_dirty_segment() could perform the FileSync() itself if the
queue is full, but register_unlink could tell it to keep trying.  Does
this make sense?

+typedef enum syncrequesttype
+{
+       SYNC_REQUEST,
+       FORGET_REQUEST,
+       FORGET_HIERARCHY_REQUEST,
+       UNLINK_REQUEST
+} syncrequesttype;

These names are too generic for the global C namespace; how about
SYNC_REQ_CANCEL or similar?

+typedef enum syncrequestowner
+{
+       SYNC_MD = 0             /* md smgr */
+} syncrequestowner;

I have a feeling that Andres wanted to see a single enum combining
both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
have it.

+/* md callback forward declarations */
+extern char* mdfilepath(FileTag ftag);
+extern bool mdtagmatches(FileTag ftag, FileTag predicate,
SyncRequestType type);

It's weird that these ^ are declared in sync.h.  I think they should
be declared in a new header md.h and that should be included by
sync.c.  I know that we have this historical weird thing there md.c's
functions are declared in smgr.h, but we should eventually fix that.

More soon.

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

01 March 2019, 20:42:56

Hi,

On 2019-03-01 23:17:27 +1300, Thomas Munro wrote:
> @@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
>          * the REDO pointer.  Note that smgr must not do anything that'd have to
>          * be undone if we decide no checkpoint is needed.
>          */
> -       smgrpreckpt();
> +       PreCheckpoint();
> 
> I would call this and the "post" variant something like
> SyncPreCheckpoint().  Otherwise it's too general sounding and not
> clear which module it's in.

Definitely.

> +typedef enum syncrequesttype
> +{
> +       SYNC_REQUEST,
> +       FORGET_REQUEST,
> +       FORGET_HIERARCHY_REQUEST,
> +       UNLINK_REQUEST
> +} syncrequesttype;
> 
> These names are too generic for the global C namespace; how about
> SYNC_REQ_CANCEL or similar?
> 
> +typedef enum syncrequestowner
> +{
> +       SYNC_MD = 0             /* md smgr */
> +} syncrequestowner;
> 
> I have a feeling that Andres wanted to see a single enum combining
> both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
> SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
> have it.

Obviously it's nicer looking this way, but OTOH, that means we have to
send more data over the queue, because we can't easily combine the
request + "owner". I don't have too strong feelings about it though.

FWIW, I don't like the name owner here. Class? Method?

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

From

Robert Haas

Date:

01 March 2019, 21:15:21

On Fri, Mar 1, 2019 at 12:43 PM Andres Freund <andres@anarazel.de> wrote:
> Obviously it's nicer looking this way, but OTOH, that means we have to
> send more data over the queue, because we can't easily combine the
> request + "owner". I don't have too strong feelings about it though.

Yeah, I would lean toward combining those.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

01 March 2019, 22:35:58

On Fri, Mar 01, 2019 at 01:15:21PM -0500, Robert Haas wrote:
> > >
> > > +typedef enum syncrequestowner
> > > +{
> > > +       SYNC_MD = 0             /* md smgr */
> > > +} syncrequestowner;
> > >
> > > I have a feeling that Andres wanted to see a single enum combining
> > > both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
> > > SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
> > > have it.
> >
> > Obviously it's nicer looking this way, but OTOH, that means we have to 
> > send more data over the queue, because we can't easily combine the
> > request + "owner". I don't have too strong feelings about it though.
> 
> Yeah, I would lean toward combining those.

I disagree, at least with combining and retaining enums. Encoding all 
the possible request types with the current, planned and future SMGRs 
would cause a sheer explosion in the number of enum  values. Not to 
mention that you have multiple enum values for the same behavior - which 
just isn't clean. And handling of these enums in the code would be ugly 
too.

Do note that these are typedef'ed to uint8 currently. For a default 
config with 128 MB shared_buffers, we will use an extra 16kB (one extra 
byte to represent the owner).  I am hesitant to change this right now 
unless folks feel strongly about it.

If so, I would combine the type and owner by splitting it up in 4 bit 
chunks, allowing for 16 request types and 16 smgrs. This change would 
only apply for the in-memory queue. The code and functions would 
continue using the enums.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

01 March 2019, 23:00:27

On Sat, Mar 2, 2019 at 8:36 AM Shawn Debnath <sdn@amazon.com> wrote:
> On Fri, Mar 01, 2019 at 01:15:21PM -0500, Robert Haas wrote:
> > > >
> > > > +typedef enum syncrequestowner
> > > > +{
> > > > +       SYNC_MD = 0             /* md smgr */
> > > > +} syncrequestowner;
> > > >
> > > > I have a feeling that Andres wanted to see a single enum combining
> > > > both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
> > > > SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
> > > > have it.
> > >
> > > Obviously it's nicer looking this way, but OTOH, that means we have to
> > > send more data over the queue, because we can't easily combine the
> > > request + "owner". I don't have too strong feelings about it though.
> >
> > Yeah, I would lean toward combining those.
>
> I disagree, at least with combining and retaining enums. Encoding all
> the possible request types with the current, planned and future SMGRs
> would cause a sheer explosion in the number of enum  values. Not to
> mention that you have multiple enum values for the same behavior - which
> just isn't clean. And handling of these enums in the code would be ugly
> too.
>
> Do note that these are typedef'ed to uint8 currently. For a default
> config with 128 MB shared_buffers, we will use an extra 16kB (one extra
> byte to represent the owner).  I am hesitant to change this right now
> unless folks feel strongly about it.
>
> If so, I would combine the type and owner by splitting it up in 4 bit
> chunks, allowing for 16 request types and 16 smgrs. This change would
> only apply for the in-memory queue. The code and functions would
> continue using the enums.

+1

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Robert Haas

Date:

01 March 2019, 23:03:19

On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:
> I disagree, at least with combining and retaining enums. Encoding all
> the possible request types with the current, planned and future SMGRs
> would cause a sheer explosion in the number of enum  values.

How big of an explosion would it be?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

01 March 2019, 23:35:33

On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:
> On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:
> > I disagree, at least with combining and retaining enums. Encoding all
> > the possible request types with the current, planned and future SMGRs
> > would cause a sheer explosion in the number of enum  values.
> 
> How big of an explosion would it be?

4 enum values x # of smgrs; currently md, soon undo and slru so 12 in 
total. Any future smgr addition will expand this further.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

01 March 2019, 23:41:41

On Sat, Mar 2, 2019 at 9:35 AM Shawn Debnath <sdn@amazon.com> wrote:
> On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:
> > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:
> > > I disagree, at least with combining and retaining enums. Encoding all
> > > the possible request types with the current, planned and future SMGRs
> > > would cause a sheer explosion in the number of enum  values.
> >
> > How big of an explosion would it be?
>
> 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
> total. Any future smgr addition will expand this further.

It's not so much the "explosion" that bothers me.  I think we should
have a distinct sync requester enum, because we need a way to index
into the table of callbacks.  How exactly you pack the two enums into
compact space seems like a separate question; doing it with two words
would obviously be wasteful, but it should be possible stuff them into
(say) a single uint8_t, uint16_t or whatever will pack nicely in the
request struct and allow the full range of request types (4?) + the
full range of sync requesters (which we propose to expand to 3 in the
forseeable future).  Now perhaps the single enum idea was going to
involve explicit values that encode the two values SYNC_REQ_CANCEL_MD
= 0x1 | (0x04 << 4) so you could still extract the requester part, but
that's just the same thing with uglier code.

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Robert Haas

Date:

02 March 2019, 00:27:47

On Fri, Mar 1, 2019 at 3:35 PM Shawn Debnath <sdn@amazon.com> wrote:
> On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:
> > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:
> > > I disagree, at least with combining and retaining enums. Encoding all
> > > the possible request types with the current, planned and future SMGRs
> > > would cause a sheer explosion in the number of enum  values.
> >
> > How big of an explosion would it be?
>
> 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
> total. Any future smgr addition will expand this further.

I thought the idea was that each smgr might have a different set of
requests.  If they're all going to have the same set of requests then
I agree with you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

02 March 2019, 08:05:32

On Fri, Mar 01, 2019 at 09:42:56AM -0800, Andres Freund wrote:

> FWIW, I don't like the name owner here. Class? Method?

How about handler? Similar to when looking up callback functions in 
lookup_index_am_handler_func()?


-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Temporal Table Proposal

From

Paul Jungwirth

Date:

04 March 2019, 04:45:11

On 2/25/19 4:21 AM, Ibrar Ahmed wrote:
> Great, to hear that you are working on that. Do you think I can help you 
> with this? I did some groundwork to make it possible. I can help in 
> coding/reviewing or even can take lead if you want to.

Hi Ibrar,

I'd love some help with this! I submitted my patch to the March 
commitfest, and Peter Moser & Anton Dignös submitted theirs also. I 
still need to rebase on the most recent commits, but I'll try to do that 
tonight or tomorrow. Personally I'd love some review and feedback, 
because this is my first substantial patch. (I made a small change to 
btree_gist a couple years ago also....)

I think the challenge with temporal functionality is that there are a 
lot of new concepts, and we'd like them all to hang together in a 
coherent way. (That's why I want to give a talk about it: to increase 
background understanding in the Postgres community.) So having someone 
take the lead on it makes sense. I'm happy to provide some opinions and 
direction, but my own coding contributions are likely to be slow, and 
having a regular contributor more closely involved would help a lot.

Here are some thoughts about things that need work:

- temporal primary keys (my patch)
- temporal foreign keys (I've done some work on this adding to my patch 
but I haven't finished it yet.)
- temporal joins (Moser/Dignös patch)
- declaring PERIODs (Vik Fearing's patch)
- showing PERIODs in the system catalog (Vik Fearing's patch)
- using PERIODs in SELECT, WHERE, GROUP BY, HAVING, function arguments, 
etc. (TODO)
- SYSTEM_TIME PERIODs for transaction-time tables (TODO)
- temporal insert/update/delete for transaction-time tables (TODO)
- temporal insert/update/delete for valid-time tables (TODO)
- temporal SELECT for valid-time tables (TODO, could build off the 
Moser/Dignös work)
- temporal SELECT for transaction-time tables (TODO)

I think the transaction-time stuff is easier, but also less interesting, 
and there are well-known patterns for accomplishing it already. I'm more 
interested in supporting valid-time tables personally.

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

04 March 2019, 19:55:06

On Fri, Mar 01, 2019 at 04:27:47PM -0500, Robert Haas wrote:
> On Fri, Mar 1, 2019 at 3:35 PM Shawn Debnath <sdn@amazon.com> wrote:
> > On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:
> > > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:
> > > > I disagree, at least with combining and retaining enums. Encoding all
> > > > the possible request types with the current, planned and future SMGRs
> > > > would cause a sheer explosion in the number of enum  values.
> > >
> > > How big of an explosion would it be?
> >
> > 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
> > total. Any future smgr addition will expand this further.
> 
> I thought the idea was that each smgr might have a different set of
> requests.  If they're all going to have the same set of requests then
> I agree with you.

Yeah, in this particular case and at this layer, the operations are 
consistent across all storage managers, in that, they want to queue a 
new sync request for a specific file, forget an already queued request, 
forget a hierarchy of requests, or unlink a specific file.

The fun is at the smgr layer which was discussed in a sub-thread in the 
"Drop type smgr" thread started by Thomas. I started on a patch and will 
be sending it out after the refactor patch is revised.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

05 March 2019, 04:25:08

On Fri, Mar 01, 2019 at 11:17:27PM +1300, Thomas Munro wrote:
> -       smgrpreckpt();
> +       PreCheckpoint();
> 
> I would call this and the "post" variant something like
> SyncPreCheckpoint().  Otherwise it's too general sounding and not
> clear which module it's in.

Sure - fixed.

> +static const int NSync = lengthof(syncsw);
> 
> Unused.

Good catch - interesting how there was no warning thrown for this.

> +                       fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
> 
> Needs to be closed.

*ahem* fixed

> 
> +                       path = syncsw[entry->owner].sync_filepath(entry->ftag);
> 
> Probably doesn't make much difference, but wouldn't it be better for
> the path to be written into a caller-supplied buffer of size
> MAXPGPATH?  Then we could have that on the stack instead of alloc/free
> for every path.
> 
> Hmm, mdfilepath() needs to use GetRelationPath(), and that already
> returns palloc'd memory.  Oh well.

Yep - I tried to muck with this as well but the logic is too neatly 
encapsulated inside relpath.c and given the usages via 
relpath[perm|backend] - I chose to forgo the attempt.

> 
> +                               entry->canceled = true;
> 
> Now that we killed the bitmapset, I wonder if we still need this
> canceled flag.  What if we just removed the entry from the hash table?
> If you killed the canceled flag you could then replace this:
> 
> +                       if (entry->canceled)
> +                               break;
> 
> .. with another hash table probe to see if the entry went in the
> AbsorbSyncRequests() call (having first copied the key into a local
> variable since of course "entry" may have been freed).  Or maybe you
> don't think that's better, I dunno, just an idea :-)

It seems safer and cleaner to have the canceled flag as other approaches 
don't really give us any gain. But this made me realize that I should 
also cancel the entry for the simple forget case instead of removing it 
from the hash table. Fixed and modified the for loop to break if entry 
was cancelled. Deletion of the entry now happens in one place - after it 
has been processed or skipped inside ProcessSyncRequests.

> +ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner)
> 
> Is it a deliberate choice that you pass FileTag objects around by
> value?  Rather than, say, pointer to const.  Not really a complaint in
> the current coding since it's a small object anyway (not much bigger
> than a pointer), but I guess someone might eventually want to make it
> into a flexible sized object, or something, I dunno.

It was deliberate to match what we are doing today. You have a good 
point regarding future modifications. May as well get the API changed 
correctly the first time around. Changed.

> -ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
> +ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum,
> +                                                         SegmentNumber segno)
>  {
> -       if (pendingOpsTable)
> +       FileTag tag;
> +
> +       tag.rnode = rnode;
> +       tag.forknum = forknum;
> +       tag.segno = segno;
> +
> +       if (IsSyncManagedLocally())
>         {
>                 /* standalone backend or startup process: fsync state
> is local */
> -               RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
> +               RememberSyncRequest(tag, FORGET_REQUEST, SYNC_MD);
>         }
> ...
> 
> You left this and similar functions in md.c, but I think it needs to
> move out to sync.c, and just take a FileTag directly.  Otherwise I
> have to write similar functions in undofile.c, and it seems kinda
> weird that those modules are worrying about whether sync is managed
> locally or the message needs to be sent to the checkpointer, and even
> worse, they have to duplicate the loop that deals with
> ForwardSyncRequest() failing and retrying.  Concretely I'm saying that
> sync.c should define a function like this:
> 
> +/*
> + * PostSyncRequest
> + *
> + * Remember locally, or post to checkpointer as appropriate.
> + */
> +void
> +PostSyncRequest(FileTag tag, SyncRequestType type, SyncRequestOwner owner)
> +{
> +       if (IsSyncManagedLocally())
> +       {
> +               /* standalone backend or startup process: fsync state
> is local */
> +               RememberSyncRequest(tag, type, owner);
> +       }
> +       else if (IsUnderPostmaster)
> +       {
> +               while (!ForwardSyncRequest(tag, FORGET_REQUEST, SYNC_MD))
> +                       pg_usleep(10000L);      /* 10 msec seems a
> good number */
> +       }
> +}
> 
> Hmm, perhaps it would need to take an argument to say whether it
> should keep retrying or return false if it fails; that way
> register_dirty_segment() could perform the FileSync() itself if the
> queue is full, but register_unlink could tell it to keep trying.  Does
> this make sense?

Yeah - makes sense. I did have it on my list, but I think I wanted to 
minimize changes to md and wanted to wait for a use case. Though it 
seems clear it's needed and looks like you already ran into it with 
undo. Added a new function RegisterSyncRequest that checks for local 
table or forwards to checkpointer. Accepts a parameter retryOnError 
which does what it says or exits on failure.

> +typedef enum syncrequesttype
> +{
> +       SYNC_REQUEST,
> +       FORGET_REQUEST,
> +       FORGET_HIERARCHY_REQUEST,
> +       UNLINK_REQUEST
> +} syncrequesttype;
> 
> These names are too generic for the global C namespace; how about
> SYNC_REQ_CANCEL or similar?

Yeah, again, was trying to match what we had before. I prefixed these 
enums with SYNC to have a more readable set:

    SYNC_REQUEST
    SYNC_FORGET_REQUEST
    SYNC_FORGET_HIERARCHY_REQUEST
    SYNC_UNLINK_REQUEST

Except for the hierarchy one, they are pretty reasonable in length.

> +typedef enum syncrequestowner
> +{
> +       SYNC_MD = 0             /* md smgr */
> +} syncrequestowner;
> 
> I have a feeling that Andres wanted to see a single enum combining
> both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
> SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
> have it.

This was tackled in a sub-thread. I am keeping the enums but changing 
the the checkpointer queue to persist the type and owner combo in 8 
bits.

> +/* md callback forward declarations */
> +extern char* mdfilepath(FileTag ftag);
> +extern bool mdtagmatches(FileTag ftag, FileTag predicate,
> SyncRequestType type);
> 
> It's weird that these ^ are declared in sync.h.  I think they should
> be declared in a new header md.h and that should be included by
> sync.c.  I know that we have this historical weird thing there md.c's
> functions are declared in smgr.h, but we should eventually fix that.

Long story short - I wanted to avoid having md.h include sync.h for the 
FileTag definitions. But - it's cleaner this way. Changed.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Attachment

0001-Refactor-the-fsync-machinery-to-support-future-SMGR-v11.patch

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

05 March 2019, 12:45:37

On Tue, Mar 5, 2019 at 2:25 PM Shawn Debnath <sdn@amazon.com> wrote:
> [v11 patch]

Thanks.  Hmm, something is wrong here because make check is
dramatically slower -- for example the "insert" test runs in ~8-13
seconds instead of the usual ~0.2 seconds according to Travis,
AppVeyor and my local FreeBSD system (note that fsync is disabled so
it's not that -- it must be bogus queue-related CPU?)

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

05 March 2019, 19:07:15

On Tue, Mar 05, 2019 at 10:45:37PM +1300, Thomas Munro wrote:
> On Tue, Mar 5, 2019 at 2:25 PM Shawn Debnath <sdn@amazon.com> wrote:
> > [v11 patch]
> 
> Thanks.  Hmm, something is wrong here because make check is
> dramatically slower -- for example the "insert" test runs in ~8-13
> seconds instead of the usual ~0.2 seconds according to Travis,
> AppVeyor and my local FreeBSD system (note that fsync is disabled so
> it's not that -- it must be bogus queue-related CPU?)

Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test.  
Interesting! It's reproducible so should be able to figure out what's 
going on. The only thing we do in ForwardSyncRequest() is split up the 8 
bits into 2x4 bits and copy the FileTagData structure to the 
checkpointer queue. Will report back what I find.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

05 March 2019, 19:33:54

On Wed, Mar 6, 2019 at 5:07 AM Shawn Debnath <sdn@amazon.com> wrote:
> Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test.
> Interesting! It's reproducible so should be able to figure out what's
> going on. The only thing we do in ForwardSyncRequest() is split up the 8
> bits into 2x4 bits and copy the FileTagData structure to the
> checkpointer queue. Will report back what I find.

More review, all superficial stuff:

+typedef struct
+{
+       RelFileNode             rnode;
+       ForkNumber              forknum;
+       SegmentNumber   segno;
+} FileTagData;
+
+typedef FileTagData *FileTag;

Even though I know I said we should take FileTag by pointer, and even
though there is an older tradition in the tree of having a struct
named "FooData" and a corresponding pointer typedef named "Foo", as
far as I know most people are not following the convention for new
code and I for one don't like it.  One problem is that there isn't a
way to make a pointer-to-const type given a pointer-to-non-const type,
so you finish up throwing away const from your programs.  I like const
as documentation and a tiny bit of extra compiler checking.  What do
you think about "FileTag" for the struct and eg "const FileTag *tag"
when receiving one as a function argument?

-/* internals: move me elsewhere -- ay 7/94 */

Aha, about time too!

+#include "fmgr.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"

Why do we need to include fmgr.h in md.h?

+/* md storage manager funcationality */

Typo.

+/* md sync callback forward declarations */

These aren't "forward" declarations, they're plain old declarations.

+extern char* mdfilepath(FileTag ftag);

Doesn't really matter too much because all of this will get
pgindent-ed at some point, but FYI we write "char *md", not "char*
md".

 #include "storage/smgr.h"
+#include "storage/md.h"
 #include "utils/hsearch.h"

Bad sorting.

+       FileTagData tag;
+       tag.rnode = reln->smgr_rnode.node;
+       tag.forknum = forknum;
+       tag.segno = seg->mdfd_segno;

I wonder if it would be better practice to zero-initialise that
sucker, so that if more members are added we don't leave them
uninitialised.  I like the syntax "FileTagData tag = {{0}}".
(Unfortunately extra nesting required here because first member is a
struct, and C99 doesn't allow us to use empty {} like C++, even though
some versions of GCC accept it.  Rats.)

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Tom Lane

Date:

05 March 2019, 19:53:16

Thomas Munro <thomas.munro@gmail.com> writes:
> +#include "fmgr.h"
> +#include "storage/block.h"
> +#include "storage/relfilenode.h"
> +#include "storage/smgr.h"
> +#include "storage/sync.h"

> Why do we need to include fmgr.h in md.h?

More generally, any massive increase in an include file's inclusions
is probably a sign that you need to refactor.  Cross-header inclusions
are best avoided altogether if you can --- obviously that's not always
possible, but we should minimize them.  We've had some very unfortunate
problems in the past from indiscriminate #includes in headers.

            regards, tom lane

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

05 March 2019, 20:15:57

On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > +#include "fmgr.h"
> > +#include "storage/block.h"
> > +#include "storage/relfilenode.h"
> > +#include "storage/smgr.h"
> > +#include "storage/sync.h"
> 
> > Why do we need to include fmgr.h in md.h?
> 
> More generally, any massive increase in an include file's inclusions
> is probably a sign that you need to refactor.  Cross-header inclusions
> are best avoided altogether if you can --- obviously that's not always
> possible, but we should minimize them.  We've had some very unfortunate
> problems in the past from indiscriminate #includes in headers.

Agree - I do pay attention to these, but this one slipped through the 
cracks (copied smgr.h then edited to remove smgr bits). Thanks for 
catching this, will fix in the next patch iteration.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

06 March 2019, 05:29:59

On Wed, Mar 06, 2019 at 05:33:54AM +1300, Thomas Munro wrote:
> On Wed, Mar 6, 2019 at 5:07 AM Shawn Debnath <sdn@amazon.com> wrote:
> > Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test.
> > Interesting! It's reproducible so should be able to figure out what's
> > going on. The only thing we do in ForwardSyncRequest() is split up the 8
> > bits into 2x4 bits and copy the FileTagData structure to the
> > checkpointer queue. Will report back what I find.

Fixed - tried to be clever with a do while loop and ended up forcing a 
sleep of 10 ms for every register request. Silly mistake (had an assert 
with the right assertion right after!).  Reverted to while(1) with a 
clean break if we meet the conditions.  Thanks for Thomas for being a 
second set of eyes during the investigation. make check runs are happy:

  patch:
  make check  2.48s user 0.94s system 12% cpu 27.411 total
  
  master:
  make check  2.50s user 0.88s system 12% cpu 27.573 total

> More review, all superficial stuff:
> 
> +typedef struct
> +{
> +       RelFileNode             rnode;
> +       ForkNumber              forknum;
> +       SegmentNumber   segno;
> +} FileTagData;
> +
> +typedef FileTagData *FileTag;
> 
> Even though I know I said we should take FileTag by pointer, and even
> though there is an older tradition in the tree of having a struct
> named "FooData" and a corresponding pointer typedef named "Foo", as
> far as I know most people are not following the convention for new
> code and I for one don't like it.  One problem is that there isn't a
> way to make a pointer-to-const type given a pointer-to-non-const type,
> so you finish up throwing away const from your programs.  I like const
> as documentation and a tiny bit of extra compiler checking.  What do
> you think about "FileTag" for the struct and eg "const FileTag *tag"
> when receiving one as a function argument?

More compile time safety checks are always better. I have made the 
changes in this new patch. Also, followed BufferTag pattern and defined 
INIT_FILETAG that will initialize the structure with the correct values.  
This avoids the point you bring up of accidentally omitting initializing 
members when new ones are added.

> +#include "fmgr.h"
> +#include "storage/block.h"
> +#include "storage/relfilenode.h"
> +#include "storage/smgr.h"
> +#include "storage/sync.h"
> 
> Why do we need to include fmgr.h in md.h?

Removed.

> +/* md storage manager funcationality */
> 
> Typo.

Fixed

> +/* md sync callback forward declarations */
> 
> These aren't "forward" declarations, they're plain old declarations.

Removed and simplified.

> +extern char* mdfilepath(FileTag ftag);
> 
> Doesn't really matter too much because all of this will get
> pgindent-ed at some point, but FYI we write "char *md", not "char*
> md".

Hmm - different, noted.  Changed.

>  #include "storage/smgr.h"
> +#include "storage/md.h"
>  #include "utils/hsearch.h"
> 
> Bad sorting.

Ordered correctly..

> +       FileTagData tag;
> +       tag.rnode = reln->smgr_rnode.node;
> +       tag.forknum = forknum;
> +       tag.segno = seg->mdfd_segno;
> 
> I wonder if it would be better practice to zero-initialise that
> sucker, so that if more members are added we don't leave them
> uninitialised.  I like the syntax "FileTagData tag = {{0}}".
> (Unfortunately extra nesting required here because first member is a
> struct, and C99 doesn't allow us to use empty {} like C++, even though
> some versions of GCC accept it.  Rats.)

See comments above for re-defining FileTag.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Attachment

0001-Refactor-the-fsync-machinery-to-support-future-SMGR-v12.patch

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

07 March 2019, 10:11:51

On Wed, Mar 6, 2019 at 6:16 AM Shawn Debnath <sdn@amazon.com> wrote:
> On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > Why do we need to include fmgr.h in md.h?
> >
> > More generally, any massive increase in an include file's inclusions
> > is probably a sign that you need to refactor.  Cross-header inclusions
> > are best avoided altogether if you can --- obviously that's not always
> > possible, but we should minimize them.  We've had some very unfortunate
> > problems in the past from indiscriminate #includes in headers.
>
> Agree - I do pay attention to these, but this one slipped through the
> cracks (copied smgr.h then edited to remove smgr bits). Thanks for
> catching this, will fix in the next patch iteration.

Huh... so why it was in smgr.h then?  Seems bogus.  Fix pushed to master.

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

13 March 2019, 04:00:07

On Thu, Mar 07, 2019 at 08:11:51PM +1300, Thomas Munro wrote:
> On Wed, Mar 6, 2019 at 6:16 AM Shawn Debnath <sdn@amazon.com> wrote:
> > On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote:
> > > Thomas Munro <thomas.munro@gmail.com> writes:
> > > > Why do we need to include fmgr.h in md.h?
> > >
> > > More generally, any massive increase in an include file's inclusions
> > > is probably a sign that you need to refactor.  Cross-header inclusions
> > > are best avoided altogether if you can --- obviously that's not always
> > > possible, but we should minimize them.  We've had some very unfortunate
> > > problems in the past from indiscriminate #includes in headers.
> >
> > Agree - I do pay attention to these, but this one slipped through the
> > cracks (copied smgr.h then edited to remove smgr bits). Thanks for
> > catching this, will fix in the next patch iteration.
> 
> Huh... so why it was in smgr.h then?  Seems bogus.  Fix pushed to master.

So ... wondering if there are any other left over items for this patch 
or is it good to go? I imagine there's at least a couple of us who would 
love to see this get in for PG12.

Thanks!

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

13 March 2019, 04:27:47

On Wed, Mar 13, 2019 at 2:00 PM Shawn Debnath <sdn@amazon.com> wrote:
> So ... wondering if there are any other left over items for this patch
> or is it good to go? I imagine there's at least a couple of us who would
> love to see this get in for PG12.

I rebased my WIP undo stuff[1] (targeting 13) on top of this, and that
seemed to go smoothly and the interfaces made sense, which was
reassuring.  I do wonder if we'll need to expose a way for eg
pg_rewind and perhaps external backup tools to find paths and offsets
given WAL block references that might in future include an SMGR ID
(well that's one proposal), but that handwavy requirement doesn't seem
to conflict with anything we've done here.  I'm planning to do another
round of review and testing.  Aside from some refactoring which I
think looks good anyway and prepares for future patches, the main
effect of this patch is to force the checkpointer to open and close
the files every time which seems OK to me.  I know Andres wants to
make a pass through it too.

[1] https://www.postgresql.org/message-id/CA%2BhUKGKN42jB%2BubCKru716HPtMbahdia39GwG5pLgWLMZ_y1ng%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

26 March 2019, 07:19:25

On Wed, Mar 13, 2019 at 2:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> [...] Aside from some refactoring which I
> think looks good anyway and prepares for future patches, the main
> effect of this patch is to force the checkpointer to open and close
> the files every time which seems OK to me.

I've been trying to decide if that is a problem.  Maybe there is a
performance angle, and I'm also wondering if it might increase the
risk of missing a write-back error.  Of course we'll find a proper
solution to that problem (perhaps fd-passing or sync-on-close[1]), but
I don't want to commit anything in the name of refactoring that might
make matters worse incidentally.  Or perhaps those worries are bogus,
since the checkpointer calls smgrcloseall() at the end anyway.

On balance, I'm inclined to err on the side of caution and try to keep
things a bit closer to the way they are for now.

Here's a fixup patch.  0001 is the same as Shawn's v12 patch, and 0002
has my proposed changes to switch to callbacks that actually perform
the sync and unlink operations given a file tag, and do so via the
SMGR fd cache, rather than exposing the path to sync.c.  This moves us
back towards the way I had it in an earlier version of the patch, but
instead of using smgrsyncimmed() as I had it, it goes via Shawn's new
sync handler function lookup table, allowing for non-smgr components
to use this machinery in future (as requested by Andres).

Thoughts?

It all needs to be pgindented, I'll do that later.  I'll post a rebase
of my undo stuff on top of this soon, to show how it looks with this
interface.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLMPXMnSLDwgnNRFPyxvf_0bJ5HwXcHWjCp7Cvh7G%3DxEA%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com

Attachment

Re: Refactoring the checkpointer's fsync request queue

From

Robert Haas

Date:

26 March 2019, 16:22:56

On Tue, Mar 26, 2019 at 12:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I've been trying to decide if that is a problem.  Maybe there is a
> performance angle, and I'm also wondering if it might increase the
> risk of missing a write-back error.  Of course we'll find a proper
> solution to that problem (perhaps fd-passing or sync-on-close[1]), but
> I don't want to commit anything in the name of refactoring that might
> make matters worse incidentally.  Or perhaps those worries are bogus,
> since the checkpointer calls smgrcloseall() at the end anyway.
>
> On balance, I'm inclined to err on the side of caution and try to keep
> things a bit closer to the way they are for now.
>
> Here's a fixup patch.  0001 is the same as Shawn's v12 patch, and 0002
> has my proposed changes to switch to callbacks that actually perform
> the sync and unlink operations given a file tag, and do so via the
> SMGR fd cache, rather than exposing the path to sync.c.  This moves us
> back towards the way I had it in an earlier version of the patch, but
> instead of using smgrsyncimmed() as I had it, it goes via Shawn's new
> sync handler function lookup table, allowing for non-smgr components
> to use this machinery in future (as requested by Andres).

Strong +1.  Not only might closing and reopening the files have
performance and reliability implications, but a future smgr might talk
to the network, having no local file to sync.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

02 April 2019, 13:09:50

On Wed, Mar 27, 2019 at 5:48 AM Shawn Debnath <sdn@amazon.com> wrote:
> On Tue, Mar 26, 2019 at 09:22:56AM -0400, Robert Haas wrote:
> > On Tue, Mar 26, 2019 at 12:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > Here's a fixup patch.  0001 is the same as Shawn's v12 patch, and 0002
> > > has my proposed changes to switch to callbacks that actually perform
> > > the sync and unlink operations given a file tag, and do so via the
> > > SMGR fd cache, rather than exposing the path to sync.c.  This moves us
> > > back towards the way I had it in an earlier version of the patch, but
> > > instead of using smgrsyncimmed() as I had it, it goes via Shawn's new
> > > sync handler function lookup table, allowing for non-smgr components
> > > to use this machinery in future (as requested by Andres).
> >
> > Strong +1.  Not only might closing and reopening the files have
> > performance and reliability implications, but a future smgr might talk
> > to the network, having no local file to sync.
>
> Makes sense for now. When we re-visit the fd-passing or sync-on-close
> implementations, we can adapt the changes relatively easily given the
> rest of the framework is staying intact. I am hoping these patches do
> not delay the last fsync-gate issue discussion further.

I found a few more things that I thought needed adjustment:

* Packing handler and request type into a uint8 is cute but a waste of
time if we're just going to put it in a struct next to a member that
requires word-alignment.  So I changed it to a pair of plain old int16
members.  The ftag member starts at offset 4 either way, on my system.

* I didn't really like the use of the word HIERARCHY in the name of
the request type, and changed it to SYNC_FILTER_REQUEST.  That word
came up because we were implementing a kind of hierarchy, where if you
drop a database you want to forget things for all segments inside all
relations inside that database, but the whole point of this new API is
that it doesn't understand that, it calls a filter function to decide
which requests to keep.  So I preferred "filter" as a name for the
type of request.

* I simplified the "matches" callback interface.

* Various typos and comment clean-up.

I'm going to do some more testing and tidying tomorrow (for example I
think the segment.h header is silly and I'd like to remove that), and
commit this.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

0001-Refactor-the-fsync-queue-for-future-SMGR-impleme-v14.patch

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

03 April 2019, 15:08:31

On Tue, Apr 2, 2019 at 11:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'm going to do some more testing and tidying tomorrow (for example I
> think the segment.h header is silly and I'd like to remove that), and
> commit this.

As a sanity check on the programming interface this thing gives you, I
tried teaching the SLRUs to use the fsync queue.  I finished up making
a few small improvements, but the main thing I learned is that
"handler" needs to be part of the hash table key.  I suppose the
discriminator could even be inside FileTag itself, but I chose to keep
it separate and introduce a new struct to hold hander enum + FileTag
in the hash table.

The 0001 patch is what I'm going to commit soon.  I don't know why
Shawn measured a performance change -- a bug in an earlier version? --
I can't see it, but I'm planning to look into that a bit more first.
I've attached the 0002 SLRU patch for interest, but I'm not planning
to commit that one.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

04 April 2019, 00:44:24

On Thu, Apr 04, 2019 at 01:08:31AM +1300, Thomas Munro wrote:
> On Tue, Apr 2, 2019 at 11:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > I'm going to do some more testing and tidying tomorrow (for example I
> > think the segment.h header is silly and I'd like to remove that), and
> > commit this.

Given the dislike in the thread for introducing  the concept of segments 
at any layer  higher than the storage manager itself, I thought it would 
be better to leave the existing header files alone and introduce a new 
one to separate the concept. I am fine either way we go.

> As a sanity check on the programming interface this thing gives you, I
> tried teaching the SLRUs to use the fsync queue.  I finished up making
> a few small improvements, but the main thing I learned is that
> "handler" needs to be part of the hash table key.  I suppose the
> discriminator could even be inside FileTag itself, but I chose to keep
> it separate and introduce a new struct to hold hander enum + FileTag
> in the hash table.

I think this is fine, but can you elaborate a bit more on why we need to 
include the handler in the key for the hash table? We are de-duping 
relfilenodes here and these should never collide with files from another 
component. The component separation would be encoded in the RelFileNode 
that the target smgr, or SLRU in this case, would be able to decipher.  
Do you foresee, or know of, use cases where FileTag alone will result in 
conflicts on the same file but from different handlers?

+/*
+ * Populate a file tag describing a segment file.  We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_segno) \
+( \
+    (a).rnode.spcNode = 0, \
+    (a).rnode.dbNode = 0, \
+    (a).rnode.relNode = 0, \
+    (a).forknum = 0, \
+    (a).segno = (xx_segno) \
+)

Based on the definition of INIT_SLRUFILETAG in your patch, it seems you 
are trying to only use the segno to identify the file. Not sure why we 
can't use other unused fields in FileTag to identify the component?  You 
could for example in the current SLRU implementation have the handler 
set to SYNC_HANDLER_SLRU when invoking RegisterSyncRequest() and use 
relNode to distinguish between each SLRU component in a wrapper function 
and call slrusyncfiletag() with the right SlruCtl.

For the SLRU to buffer cache work, I was planning on using the relNode 
field to identify which specific component this tag belongs to.  dbNode 
would be pointing to the type of smgr (as discussed earlier in the 
thread and still TBD).

I would prefer not to expand the hash key unnecessarily and given this 
isn't persisted, we can expand the key in the future if needed. Keeps 
the code simpler for now.

> The 0001 patch is what I'm going to commit soon.  I don't know why
> Shawn measured a performance change -- a bug in an earlier version? --
> I can't see it, but I'm planning to look into that a bit more first.
> I've attached the 0002 SLRU patch for interest, but I'm not planning
> to commit that one.

Question is which block of code did you measure? I can redo the 
instrumentation on the latest patch and re-validate and share the 
results. I previously measured the average time it took mdsync() and 
ProcessSyncRequests() in the patch to complete under similar workloads.

> I found a few more things that I thought needed adjustment:
> 
> * Packing handler and request type into a uint8 is cute but a waste of
> time if we're just going to put it in a struct next to a member that
> requires word-alignment.  So I changed it to a pair of plain old int16
> members.  The ftag member starts at offset 4 either way, on my system.

Good catch! For posterity, using packed attribute here would be bad well 
as it would result RelFileNode's spcNode in FileTag in subsequent 
entries to be misaligned given our usage of the requests as an array.  
This is potentially unsafe on platforms other than x86. Re-arranging the 
fields would lead to the same result. Thanks for catching and fixing 
this!

> * I didn't really like the use of the word HIERARCHY in the name of
> the request type, and changed it to SYNC_FILTER_REQUEST.  That word
> came up because we were implementing a kind of hierarchy, where if you
> drop a database you want to forget things for all segments inside all
> relations inside that database, but the whole point of this new API is
> that it doesn't understand that, it calls a filter function to decide
> which requests to keep.  So I preferred "filter" as a name for the
> type of request.

Yeah, I never liked the word hierarchy too much, but couldn't think of a 
better one either. Filter is perfect.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

04 April 2019, 01:39:47

On Thu, Apr 4, 2019 at 10:44 AM Shawn Debnath <sdn@amazon.com> wrote:
> On Thu, Apr 04, 2019 at 01:08:31AM +1300, Thomas Munro wrote:
> > As a sanity check on the programming interface this thing gives you, I
> > tried teaching the SLRUs to use the fsync queue.  I finished up making
> > a few small improvements, but the main thing I learned is that
> > "handler" needs to be part of the hash table key.  I suppose the
> > discriminator could even be inside FileTag itself, but I chose to keep
> > it separate and introduce a new struct to hold hander enum + FileTag
> > in the hash table.
>
> I think this is fine, but can you elaborate a bit more on why we need to
> include the handler in the key for the hash table? We are de-duping
> relfilenodes here and these should never collide with files from another
> component. The component separation would be encoded in the RelFileNode
> that the target smgr, or SLRU in this case, would be able to decipher.
> Do you foresee, or know of, use cases where FileTag alone will result in
> conflicts on the same file but from different handlers?

Well it depends how we think of the FileTag namespace.  I didn't worry
before because md.c and my development version of undofile.c generate
FileTag objects that never collide, because undofile.c sets the dbNode
to invalid (and in earlier versions had a special magic DB OID), while
md.c always uses valid values.  But when I did the SLRU experiment
with that 0002 patch, I figured it was reasonable to want to set only
the segment number part of the FileTag.  Then sync requests for
pg_xact, pg_multixact/offsets and pg_multixact/members got merged.
Oops.

As you say, we could decide to assign bogus reserved dbNode values to
keep them the keyspaces apart and not allow future modules to deviate
from using a RelFileNode in the tag, but we already more-or-less
decided in another thread that we don't want to do that for buffer
tags.  People seemed keen to see another discriminator for smgr ID,
rather than requiring eg md.c and undo.c to use non-colliding
RelFileNode values via a fake dbNode scheme.  This problem is
essentially the same, except that we decided we want the set of sync
handlers to be potentially larger than the set of smgr handlers.  For
example, until we have a patch that moves SLRUs into shared buffers,
SLRUs don't have SMGR IDs, and get as the 0002 patch shows, they could
theoretically still benefit from handing off fsync work; there may be
other cases like that.  Generally, the question is similar in both
places: (1) do sync handlers each get their own separate namespace of
FileTags?, (2) do smgrs get their own namespace of BufferTag?  Perhaps
that is an argument for putting the sync handler number *inside* the
FileTag, since we currently intend to do that with smgr IDs in
BufferTag (stealing space from ForkNumber).

> +/*
> + * Populate a file tag describing a segment file.  We only use the segment
> + * number, since we can derive everything else we need by having separate
> + * sync handler functions for clog, multixact etc.
> + */
> +#define INIT_SLRUFILETAG(a,xx_segno) \
> +( \
> +       (a).rnode.spcNode = 0, \
> +       (a).rnode.dbNode = 0, \
> +       (a).rnode.relNode = 0, \
> +       (a).forknum = 0, \
> +       (a).segno = (xx_segno) \
> +)
>
> Based on the definition of INIT_SLRUFILETAG in your patch, it seems you
> are trying to only use the segno to identify the file. Not sure why we
> can't use other unused fields in FileTag to identify the component?  You
> could for example in the current SLRU implementation have the handler
> set to SYNC_HANDLER_SLRU when invoking RegisterSyncRequest() and use
> relNode to distinguish between each SLRU component in a wrapper function
> and call slrusyncfiletag() with the right SlruCtl.

Yes, that would work too, though then slru.c would have to know about
clog, multixact etc so it could find their SlruCtlData objects, which
are currently static in the eg clog.c, multixact.c etc.  That's why I
put a tiny callback into each of those that could call slru.c to do
the work.  I'm not really proposing anything here, and I understand
that you might want to refactor this stuff completely, I just wanted a
quick sanity check of *something* else using this interface, other
than md.c and my undo patch, in a particular a
thing-that-isn't-connected-to-smgr.  If it turns out to be useful for
later work, good, if not, I won't be sad.  I also suppressed the urge
to teach it to use fd.c files and keep them open in a small cache --
all topics for future threads.

> For the SLRU to buffer cache work, I was planning on using the relNode
> field to identify which specific component this tag belongs to.  dbNode
> would be pointing to the type of smgr (as discussed earlier in the
> thread and still TBD).

I look forward to the patches :-)  But as mentioned, there was some
serious push-back on hijacking dbNode like that.

> > The 0001 patch is what I'm going to commit soon.  I don't know why
> > Shawn measured a performance change -- a bug in an earlier version? --
> > I can't see it, but I'm planning to look into that a bit more first.
> > I've attached the 0002 SLRU patch for interest, but I'm not planning
> > to commit that one.
>
> Question is which block of code did you measure? I can redo the
> instrumentation on the latest patch and re-validate and share the
> results. I previously measured the average time it took mdsync() and
> ProcessSyncRequests() in the patch to complete under similar workloads.

Ok.  Let me try that again here too.

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

04 April 2019, 04:01:14

On Thu, Apr 4, 2019 at 11:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... Perhaps
> that is an argument for putting the sync handler number *inside* the
> FileTag, since we currently intend to do that with smgr IDs in
> BufferTag (stealing space from ForkNumber).

Here is a version like that.  I like it better this way, and the extra
space can be clawed back by using 16 bit types to hold the fork number
and sync handler number.

Here again is the straw-man 0002 patch updated to show how it might
look for another potential user.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

04 April 2019, 07:19:45

On Thu, Apr 04, 2019 at 02:01:14PM +1300, Thomas Munro wrote:
> On Thu, Apr 4, 2019 at 11:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > ... Perhaps
> > that is an argument for putting the sync handler number *inside* the
> > FileTag, since we currently intend to do that with smgr IDs in
> > BufferTag (stealing space from ForkNumber).
> 
> Here is a version like that.  I like it better this way, and the extra
> space can be clawed back by using 16 bit types to hold the fork number
> and sync handler number.

+typedef struct FileTag
+{
+    int16        handler;        /* SyncRequstHandler value, saving space */
+    int16        forknum;        /* ForkNumber, saving space */
+    RelFileNode rnode;
+    BlockNumber segno;
+} FileTag;

Definitely makes sense. v16 looks good to me.

Thanks!

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Andres Freund

Date:

04 April 2019, 07:35:59

On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote:
> On Thu, Apr 04, 2019 at 02:01:14PM +1300, Thomas Munro wrote:
> > On Thu, Apr 4, 2019 at 11:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > ... Perhaps
> > > that is an argument for putting the sync handler number *inside* the
> > > FileTag, since we currently intend to do that with smgr IDs in
> > > BufferTag (stealing space from ForkNumber).
> > 
> > Here is a version like that.  I like it better this way, and the extra
> > space can be clawed back by using 16 bit types to hold the fork number
> > and sync handler number.
> 
> +typedef struct FileTag
> +{
> +    int16        handler;        /* SyncRequstHandler value, saving space */
> +    int16        forknum;        /* ForkNumber, saving space */
> +    RelFileNode rnode;
> +    BlockNumber segno;
> +} FileTag;

Seems odd to me to use BlockNumber for segno.

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

04 April 2019, 07:39:14

On Thu, Apr 4, 2019 at 5:36 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote:
> > +typedef struct FileTag
> > +{
> > +     int16           handler;                /* SyncRequstHandler value, saving space */
> > +     int16           forknum;                /* ForkNumber, saving space */
> > +     RelFileNode rnode;
> > +     BlockNumber segno;
> > +} FileTag;
>
> Seems odd to me to use BlockNumber for segno.

That is a tradition in md.c code.  I had a new typedef SegmentNumber
in all sync.{c,h} stuff in an earlier version, but had trouble
figuring out where to define it...

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

04 April 2019, 08:41:07

On Thu, Apr 04, 2019 at 05:39:14PM +1300, Thomas Munro wrote:
> On Thu, Apr 4, 2019 at 5:36 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote:
> > > +typedef struct FileTag
> > > +{
> > > +     int16           handler;                /* SyncRequstHandler value, saving space */
> > > +     int16           forknum;                /* ForkNumber, saving space */
> > > +     RelFileNode rnode;
> > > +     BlockNumber segno;
> > > +} FileTag;
> >
> > Seems odd to me to use BlockNumber for segno.
> 
> That is a tradition in md.c code.  I had a new typedef SegmentNumber
> in all sync.{c,h} stuff in an earlier version, but had trouble
> figuring out where to define it...

Thomas, this is why I had defined segment.h with the contents below :-)

+++ b/src/include/storage/segment.h
[...]
+/*
+ * Segment Number:
+ *
+ * Each relation and its forks are divided into segments. This
+ * definition formalizes the definition of the segment number.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)

My last iteration, v12, patch had it. See [1] for comments on removal of 
segment.h.


[1] 
https://www.postgresql.org/message-id/20190403214423.GA45392%40f01898859afd.ant.amazon.com

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

04 April 2019, 13:47:26

On Thu, Apr 4, 2019 at 6:41 PM Shawn Debnath <sdn@amazon.com> wrote:
> On Thu, Apr 04, 2019 at 05:39:14PM +1300, Thomas Munro wrote:
> > On Thu, Apr 4, 2019 at 5:36 PM Andres Freund <andres@anarazel.de> wrote:
> > > On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote:
> > > > +typedef struct FileTag
> > > > +{
> > > > +     int16           handler;                /* SyncRequstHandler value, saving space */
> > > > +     int16           forknum;                /* ForkNumber, saving space */
> > > > +     RelFileNode rnode;
> > > > +     BlockNumber segno;
> > > > +} FileTag;
> > >
> > > Seems odd to me to use BlockNumber for segno.
> >
> > That is a tradition in md.c code.  I had a new typedef SegmentNumber
> > in all sync.{c,h} stuff in an earlier version, but had trouble
> > figuring out where to define it...
>
> Thomas, this is why I had defined segment.h with the contents below :-)
>
> +++ b/src/include/storage/segment.h
> [...]
> +/*
> + * Segment Number:
> + *
> + * Each relation and its forks are divided into segments. This
> + * definition formalizes the definition of the segment number.
> + */
> +typedef uint32 SegmentNumber;
> +
> +#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)

I don't think it's project policy to put a single typedef into its own
header like that, and I'm not sure where else to put it.  I have
changed FileTag's segno member to plain old uint32 for now.  md.c
continues to use BlockNumber for segment numbers (as it has ever since
commit e0c9301c87 switched over from int), but that's all internal to
md.c and I think we can reasonably leave that sort of improvement to a
later patch.

Pushed.  Thanks for all the work on this!

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Alvaro Herrera

Date:

04 April 2019, 16:02:58

On 2019-Apr-04, Thomas Munro wrote:

> I don't think it's project policy to put a single typedef into its own
> header like that, and I'm not sure where else to put it.

shrug.  Looks fine to me.  I suppose if we don't have it anywhere, it's
just because we haven't needed that particular trick yet.  Creating a
file with a lone typedef seems better than using uint32 to me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

05 April 2019, 00:53:53

On Fri, Apr 5, 2019 at 2:03 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2019-Apr-04, Thomas Munro wrote:
> > I don't think it's project policy to put a single typedef into its own
> > header like that, and I'm not sure where else to put it.
>
> shrug.  Looks fine to me.  I suppose if we don't have it anywhere, it's
> just because we haven't needed that particular trick yet.  Creating a
> file with a lone typedef seems better than using uint32 to me.

It was commit 9fac5fd7 that gave me that idea.

Ok, here is a patch that adds a one-typedef header and uses
SegmentIndex to replace all cases of BlockNumber and int holding a
segment number (where as an "index" or a "count").

-- 
Thomas Munro
https://enterprisedb.com

Attachment

0001-Introduce-SegmentNumber-typedef-for-relation-segment.patch

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

05 April 2019, 00:58:33

On Fri, Apr 5, 2019 at 10:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Ok, here is a patch that adds a one-typedef header and uses
> SegmentIndex to replace all cases of BlockNumber and int holding a
> segment number (where as an "index" or a "count").

(sorry, I meant "SegmentNumber", not "SegmentIndex")

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Thomas Munro

Date:

05 April 2019, 06:19:54

On Thu, Apr 4, 2019 at 11:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Pushed.  Thanks for all the work on this!

I managed to break this today while testing with RELSEG_SIZE set to 1
block (= squillions of 8kb files).  The problem is incorrect arguments
to _mdfd_getseg(), in code added recently by me.  Without the
EXTENSION_DONT_CHECK_SIZE flag, it refuses to open segments following
segments that have been truncated, leading to a checkpointer fsync
panic.  It's also passing segnum where a blocknum is wanted.  It
should have used exactly the same arguments as in the old code, but
didn't.  I will push a fix shortly.

-- 
Thomas Munro
https://enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

From

Shawn Debnath

Date:

05 April 2019, 19:22:06

On Fri, Apr 05, 2019 at 10:53:53AM +1300, Thomas Munro wrote:
> On Fri, Apr 5, 2019 at 2:03 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > On 2019-Apr-04, Thomas Munro wrote:
> > > I don't think it's project policy to put a single typedef into its own
> > > header like that, and I'm not sure where else to put it.
> >
> > shrug.  Looks fine to me.  I suppose if we don't have it anywhere, it's
> > just because we haven't needed that particular trick yet.  Creating a
> > file with a lone typedef seems better than using uint32 to me.
> 
> It was commit 9fac5fd7 that gave me that idea.
> 
> Ok, here is a patch that adds a one-typedef header and uses
> SegmentIndex to replace all cases of BlockNumber and int holding a
> segment number (where as an "index" or a "count").

Looks good to me.

-- 
Shawn Debnath
Amazon Web Services (AWS)

Re: Refactoring the checkpointer's fsync request queue

From

Alvaro Herrera

Date:

29 April 2019, 16:03:21

On 2019-Apr-05, Thomas Munro wrote:

> Ok, here is a patch that adds a one-typedef header and uses
> SegmentIndex to replace all cases of BlockNumber and int holding a
> segment number (where as an "index" or a "count").

Hmm, I now see (while doing the pg_checksum translation) that this patch
didn't make it.  Pity ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services