Re: Group commit, revised - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Group commit, revised |
Date | |
Msg-id | CA+Tgmoa+jrb_vXckcExZNOzJcLJBaXq-RVoJhMbRtyZdaQMSMg@mail.gmail.com Whole thread Raw |
In response to | Re: Group commit, revised (Peter Geoghegan <peter@2ndquadrant.com>) |
Responses |
Re: Group commit, revised
|
List | pgsql-hackers |
On Thu, Jan 19, 2012 at 10:46 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 19 January 2012 17:40, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't know what you mean by this. I think removing wal_writer_delay >> is premature, because I think it still may have some utility, and the >> patch removes it. That's a separate change that should be factored >> out of this patch and discussed separately. > > FWIW, I don't really care too much if we keep wal_writer_delay, > provided it is only used in place of the patch's > WALWRITER_NORMAL_TIMEOUT constant. I will note in passing that I doubt > that the effect with asynchronous commits and hint bits is pronounced > enough to have ever transferred through to someone making a practical > recommendation to reduce wal_writer_delay to ameliorate clog > contention. It was very visible in some benchmarking Heikki did, and I was able to reproduce it. >> Yeah, I guess the shutdown sequence could get a bit complex if we try >> to make everyone go through the WAL writer all the time. But I wonder >> if we could rejiggering things somehow so that everything goes through >> WAL writer if its dead. > > I'm not sure what you mean by this last bit. It sounds a bit hazardous. That last "if" was supposed to say "unless", which may contribute to the confusion. > My problem with nominating a backend to the status of an auxiliary is > that no matter what way you cut it, it increases the failure surface > area, so to speak. I think that's the wrong way of thinking about it. Imagine this: we maintain a queue of people who are waiting on the current WAL flush, the current-flush-to LSN, and a queue of people who are waiting on the next WAL flush, and a leader. All this data is protected by a spinlock. When you want to flush WAL, you grab the spinlock. If the current-flush-to LSN is greater than the LSN you need, you add yourself to the waiting-on-current-flush queue, release the spinlock, and go to sleep. Otherwise, if there's no leader, you become the leader, enter your flush LSN as the current-flush-to-LSN, and release the spinlock. If there is a leader, you add yourself to the waiting-on-next-flush queue, release the spinlock, and sleep. If you become the leader, you perform the flush. Then you retake the spinlock, dequeue anyone waiting on the current flush, move all of the next flush waiters (or those within a certain LSN distance) to the current flush list, remember who is at the head of that new queue, and release the spinlock. You then set a flag in the PGPROC of the backend now at the head of the next-flush queue and wake him up; he checks that flag on waking to know whether he is the leader or whether he's just been woken because his flush is done. After waking him so the next flush can get started, you wake all the people who were waiting on the flush you already did. This may or may not be a good design, but I don't think it has any more failure surface area than what you have here. In particular, whether or not the WAL writer is running doesn't matter; the system can run just fine without it, and can even still do group commit. To look at it another way, it's not a question of whether we're treating a regular backend as an auxiliary process; there's no rule anywhere that backends can't be dependent on the proper operation of other backends for proper functioning - there are MANY places that have that property, including LWLockAcquire() and LWLockRelease(). Nor is there any rule that background processes are more reliable than foreground processes, nor do I believe they are. Much of the existing patch's failure surface seems to me to come from the coordination it requires between ordinary backends and the background writer, and possible race conditions appertaining thereto: WAL writer dies, backend dies, postmaster dies, postmaster and WAL writer die together, etc. >> + if (delta > XLOG_SEG_SIZE * CheckPointSegments || >> + !ProcGlobal->groupCommitAvailable) >> >> That seems like a gigantic value. I would think that we'd want to >> forget about group commit any time we're behind by more than one >> segment (maybe less). > > I'm sure that you're right - I myself described it as horrible in my > original mail. I think that whatever value we set this to ought to be > well reasoned. Right now, your magic number doesn't seem much better > than my bogus heuristic (which only ever served as a placeholder > implementation that hinted at a well-principled formula). What's your > basis for suggesting that that limit would always be close to optimal? It's probably not - I suspect even a single WAL segment still too large. I'm pretty sure that it would be safe to always flush up to the next block boundary, because we always write the whole block anyway even if it's only partially filled. So there's no possible regression there. Anything larger than "the end of the current 8kB block" is going to be a trade-off between latency and throughput, and it seems to me that there's probably only one way to figure out what's reasonable: test a bunch of different values and see which ones perform well in practice. Maybe we could answer part of the question by writing a test program which does repeated sequential writes of increasingly-large chunks of data. We might find out for example that 64kB is basically the same as 8kB because most of the overhead is in the system call anyway, but beyond that the curve kinks. Or it may be that 16kB is already twice as slow as 8kB, or that you can go up to 1MB without a problem. I don't see a way to know that without testing it on a couple different pieces of hardware and seeing what happens. >> + if (ProcDiePending || QueryCancelPending) >> + { >> + GroupCommitCancelWait(); >> + >> + /* >> + * Let out-of-line interrupt handler take it >> from here. Cannot raise >> + * an error here because we're in an enclosing >> critical block. >> + */ >> + break; >> + } >> >> Presumably in this case we need to return false, not true. > > No, that is not an error. As described in the comment above that > function, the return value indicates if group commit serviced the > request, not necessarily successfully. If we were to return false, > we'd be acting as if group commit just didn't want to flush that LSN, > necessitating making alternative arrangements (calling XLogWrite()). > However, we should just hurry up and get to the interrupt handler to > error and report failure to the client (though not before executing > the code immediately before "return true" at the end of the function). > Maybe you could argue that it would be better to do that anyway, but I > think that it could mask problems and unnecessarily complicate > matters. Ugh. Our current system doesn't even require this code to be interruptible, I think: you can't receive cancel or die interrupts either while waiting for an LWLock or while holding it. > By the way, how goes the introduction of memory barriers to Postgres? > I'm aware that you committed an implementation back in September, but > am not sure what the plan is as regards using them in 9.2. Well, I was expecting a few patches to go in that needed them, but so far that hasn't happened. I think it would be a fine thing if we had a good reason to commit some code that uses them, so that we can watch the buildfarm turn pretty colors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: