Re: Group commit, revised - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Group commit, revised |
Date | |
Msg-id | CA+Tgmoa5LRCrCevAbqjAZxnqX7Qgv0sK9GpxsOwm2Yg3=srZ5Q@mail.gmail.com Whole thread Raw |
In response to | Re: Group commit, revised (Simon Riggs <simon@2ndQuadrant.com>) |
Responses |
Re: Group commit, revised
|
List | pgsql-hackers |
On Wed, Jan 18, 2012 at 5:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Jan 18, 2012 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Jan 17, 2012 at 12:37 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> I found it very helpful to reduce wal_writer_delay in pgbench tests, when >>> running with synchronous_commit=off. The reason is that hint bits don't get >>> set until the commit record is flushed to disk, so making the flushes more >>> frequent reduces the contention on the clog. However, Simon made async >>> commits nudge WAL writer if the WAL page fills up, so I'm not sure how >>> relevant that experience is anymore. > > Still completely relevant and orthogonal to this discussion. The patch > retains multi-modal behaviour. I don't know what you mean by this. I think removing wal_writer_delay is premature, because I think it still may have some utility, and the patch removes it. That's a separate change that should be factored out of this patch and discussed separately. >> There's still a small but measurable effect there in master. I think >> we might be able to make it fully auto-tuning, but I don't think we're >> fully there yet (not sure how much this patch changes that equation). >> >> I suggested a design similar to the one you just proposed to Simon >> when he originally suggested this feature. It seems that if the WAL >> writer is the only one doing WAL flushes, then there must be some IPC >> overhead - and context switching - involved whenever WAL is flushed. >> But clearly we're saving something somewhere else, on the basis of >> Peter's results, so maybe it's not worth worrying about. It does seem >> pretty odd to have all the regular backends going through the WAL >> writer and the auxiliary processes using a different mechanism, >> though. If we got rid of that, maybe WAL writer wouldn't even require >> a lock, if there's only one process that can be doing it at a time. > > When we did sync rep it made sense to have the WALSender do the work > and for others to just wait. It would be quite strange to require a > different design for essentially the same thing for normal commits and > WAL flushes to local disk. I should mention the original proposal for > streaming replication had each backend send data to standby > independently and that was recognised as a bad idea after some time. > Same for sync rep also. I don't think those cases are directly comparable. SR is talking to another machine, and I can't imagine that there is a terribly convenient or portable way for every backend that needs one to get a hold of the file descriptor for the socket. Even if it could, the data is sent as a stream, so if multiple backends sent to the same file descriptor you'd have to have some locking to prevent messages from getting interleaved. Or else you could have multiple connections, or use UDP, but that gets rather complex. Anyway, none of this is an issue for file I/O: anybody can open the file, and if two backends write data at different offsets at the same time - or the same data at the same offset at the same time - there's no problem. So the fact that it wasn't a good idea for SR doesn't convince me that it's also a bad idea here. On the other hand, I'm not saying we *should* do it that way, either - i.e. I am not trying to "require a different design" just because it's fun to make people change things. Rather, I am trying to figure out whether the design you've chosen is in fact the best one, and part of that involves reasoning about why it might or might not be. There are obvious reasons to think that having process A kick process B and go to sleep, then have process B do some work and wake up process A might be less efficient than having process A just do the work itself, in the uncontended case. The fact that that isn't happening seems awfully strange to me; it's hard to understand why 3 system calls are faster than one. That suggests that either the current system is badly broken in some way that this patch fixes (in which case, it would be nice to know what the broken thing is) or that there's an opportunity for further optimization of the new patch (either now or later, depending on how much work we're talking about). Just to be clear, it makes perfect sense that the new system is faster in the contended case, and the benchmark numbers are very impressive. It's a lot harder to understand why it's not slower in the uncontended case. > Not sure why its odd to have backends do one thing and auxiliaries do > another. The whole point of auxiliary processes is that they do a > specific thing different to normal backends. Anyway, the important > thing is to have auxiliary processes be independent of each other as > much as possible, which simplifies error handling and state logic in > the postmaster. Yeah, I guess the shutdown sequence could get a bit complex if we try to make everyone go through the WAL writer all the time. But I wonder if we could rejiggering things somehow so that everything goes through WAL writer if its dead. + * Wait for group commit, and then return true, if group commit serviced the + * request (not necessarily successfully). Otherwise, return false and fastpath + * out of here, allowing the backend to make alternative arrangements to flush + * its WAL in a more granular fashion. This can happen because the record that + * the backend requests to have flushed in so far into the future that to group + * commit it would This trails off in mid-sentence. + if (delta > XLOG_SEG_SIZE * CheckPointSegments || + !ProcGlobal->groupCommitAvailable) That seems like a gigantic value. I would think that we'd want to forget about group commit any time we're behind by more than one segment (maybe less). + if (ProcDiePending || QueryCancelPending) + { + GroupCommitCancelWait(); + + /* + * Let out-of-line interrupt handler take it from here. Cannot raise + * an error here because we're in an enclosing critical block. + */ + break; + } Presumably in this case we need to return false, not true. TBH, I feel like this error handling is quite fragile, a fragility it inherits from sync rep, on which it is based. I am not sure I have a better idea, though. + /* + * Backends may still be waiting to have their transactions committed in batch, + * but calling this function prevents any backend from using group commit, and + * thus from having a dependency on the WAL Writer. + * + * When the WAL Writer finishes servicing those remaining backends, it will not + * have any additional work to do, and may shutdown. + */ + void + GroupCommitDisable(void) + { + ProcGlobal->groupCommitAvailable = false; + } I can't imagine that this is safe against memory ordering issues. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: