Re: Group commit, revised - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Group commit, revised
Date
Msg-id CA+Tgmoa+jrb_vXckcExZNOzJcLJBaXq-RVoJhMbRtyZdaQMSMg@mail.gmail.com
Whole thread Raw
In response to Re: Group commit, revised  (Peter Geoghegan <peter@2ndquadrant.com>)
Responses Re: Group commit, revised
List pgsql-hackers
On Thu, Jan 19, 2012 at 10:46 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 19 January 2012 17:40, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't know what you mean by this.  I think removing wal_writer_delay
>> is premature, because I think it still may have some utility, and the
>> patch removes it.  That's a separate change that should be factored
>> out of this patch and discussed separately.
>
> FWIW, I don't really care too much if we keep wal_writer_delay,
> provided it is only used in place of the patch's
> WALWRITER_NORMAL_TIMEOUT constant. I will note in passing that I doubt
> that the effect with asynchronous commits and hint bits is pronounced
> enough to have ever transferred through to someone making a practical
> recommendation to reduce wal_writer_delay to ameliorate clog
> contention.

It was very visible in some benchmarking Heikki did, and I was able to
reproduce it.

>> Yeah, I guess the shutdown sequence could get a bit complex if we try
>> to make everyone go through the WAL writer all the time.  But I wonder
>> if we could rejiggering things somehow so that everything goes through
>> WAL writer if its dead.
>
> I'm not sure what you mean by this last bit. It sounds a bit hazardous.

That last "if" was supposed to say "unless", which may contribute to
the confusion.

> My problem with nominating a backend to the status of an auxiliary is
> that no matter what way you cut it, it increases the failure surface
> area, so to speak.

I think that's the wrong way of thinking about it.  Imagine this: we
maintain a queue of people who are waiting on the current WAL flush,
the current-flush-to LSN, and a queue of people who are waiting on the
next WAL flush, and a leader.  All this data is protected by a
spinlock.  When you want to flush WAL, you grab the spinlock.  If the
current-flush-to LSN is greater than the LSN you need, you add
yourself to the waiting-on-current-flush queue, release the spinlock,
and go to sleep.  Otherwise, if there's no leader, you become the
leader, enter your flush LSN as the current-flush-to-LSN, and release
the spinlock.  If there is a leader, you add yourself to the
waiting-on-next-flush queue, release the spinlock, and sleep.

If you become the leader, you perform the flush.  Then you retake the
spinlock, dequeue anyone waiting on the current flush, move all of the
next flush waiters (or those within a certain LSN distance) to the
current flush list, remember who is at the head of that new queue, and
release the spinlock.  You then set a flag in the PGPROC of the
backend now at the head of the next-flush queue and wake him up; he
checks that flag on waking to know whether he is the leader or whether
he's just been woken because his flush is done.  After waking him so
the next flush can get started, you wake all the people who were
waiting on the flush you already did.

This may or may not be a good design, but I don't think it has any
more failure surface area than what you have here.  In particular,
whether or not the WAL writer is running doesn't matter; the system
can run just fine without it, and can even still do group commit.  To
look at it another way, it's not a question of whether we're treating
a regular backend as an auxiliary process; there's no rule anywhere
that backends can't be dependent on the proper operation of other
backends for proper functioning - there are MANY places that have that
property, including LWLockAcquire() and LWLockRelease().  Nor is there
any rule that background processes are more reliable than foreground
processes, nor do I believe they are.  Much of the existing patch's
failure surface seems to me to come from the coordination it requires
between ordinary backends and the background writer, and possible race
conditions appertaining thereto: WAL writer dies, backend dies,
postmaster dies, postmaster and WAL writer die together, etc.

>> +       if (delta > XLOG_SEG_SIZE * CheckPointSegments ||
>> +                       !ProcGlobal->groupCommitAvailable)
>>
>> That seems like a gigantic value.  I would think that we'd want to
>> forget about group commit any time we're behind by more than one
>> segment (maybe less).
>
> I'm sure that you're right - I myself described it as horrible in my
> original mail. I think that whatever value we set this to ought to be
> well reasoned. Right now, your magic number doesn't seem much better
> than my bogus heuristic (which only ever served as a placeholder
> implementation that hinted at a well-principled formula). What's your
> basis for suggesting that that limit would always be close to optimal?

It's probably not - I suspect even a single WAL segment still too
large.  I'm pretty sure that it would be safe to always flush up to
the next block boundary, because we always write the whole block
anyway even if it's only partially filled.  So there's no possible
regression there.  Anything larger than "the end of the current 8kB
block" is going to be a trade-off between latency and throughput, and
it seems to me that there's probably only one way to figure out what's
reasonable: test a bunch of different values and see which ones
perform well in practice.  Maybe we could answer part of the question
by writing a test program which does repeated sequential writes of
increasingly-large chunks of data.  We might find out for example that
64kB is basically the same as 8kB because most of the overhead is in
the system call anyway, but beyond that the curve kinks.  Or it may be
that 16kB is already twice as slow as 8kB, or that you can go up to
1MB without a problem.  I don't see a way to know that without testing
it on a couple different pieces of hardware and seeing what happens.

>> +               if (ProcDiePending || QueryCancelPending)
>> +               {
>> +                       GroupCommitCancelWait();
>> +
>> +                       /*
>> +                        * Let out-of-line interrupt handler take it
>> from here. Cannot raise
>> +                        * an error here because we're in an enclosing
>> critical block.
>> +                        */
>> +                       break;
>> +               }
>>
>> Presumably in this case we need to return false, not true.
>
> No, that is not an error. As described in the comment above that
> function, the return value indicates if group commit serviced the
> request, not necessarily successfully. If we were to return false,
> we'd be acting as if group commit just didn't want to flush that LSN,
> necessitating making alternative arrangements (calling XLogWrite()).
> However, we should just hurry up and get to the interrupt handler to
> error and report failure to the client (though not before executing
> the code immediately before "return true" at the end of the function).
> Maybe you could argue that it would be better to do that anyway, but I
> think that it could mask problems and unnecessarily complicate
> matters.

Ugh.  Our current system doesn't even require this code to be
interruptible, I think: you can't receive cancel or die interrupts
either while waiting for an LWLock or while holding it.

> By the way, how goes the introduction of memory barriers to Postgres?
> I'm aware that you committed an implementation back in September, but
> am not sure what the plan is as regards using them in 9.2.

Well, I was expecting a few patches to go in that needed them, but so
far that hasn't happened.  I think it would be a fine thing if we had
a good reason to commit some code that uses them, so that we can watch
the buildfarm turn pretty colors.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: JSON for PG 9.2
Next
From: Robert Haas
Date:
Subject: Re: JSON for PG 9.2