Thread: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)
It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have them set within pg_config_manual. There are a number of use-cases where varying these values will offer increased performance (and for many cases, no difference at all, I accept). Most of the PostgreSQL user base don't recompile their own versions, let alone know how to edit the source to change these parameters. Those people should now have the benefits known and used by a small few. - BLCKSZ can be set at initdb, so an additional option -Z to allow setting the value to 4, 8, 16 or 32 KB at that point (default 8 KB). - XLOGSEGSZ can also be set at initdb, though could also be set using resetxlog following a clean shutdown. (This would, for example, require a Warm Standby server to need reconfiguration.) Both of these changes would require updates to the control file to allow those aspects to be set prior to the initial write of the control file. Values would be re-read from the control file on startup. Some refactoring would be required to make BLCKSZ usable, touching a number of parts of the code in minor ways. There aren't many cases where the BLCKSZ is used repeatedly at run time; mostly it is used during startup of the server or some parts of executor start. So, AFAICS, there is relatively low overhead from supporting variable BLCKSZ. The infrastructure to support variable XLOGSEGSZ is already there, so few changes are required to make this variable. Comments? (Sorry for raising so many threads at once; the 8.3 cycle is fairly short, so I want to get going, now that 8.2 seems almost there) -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Am Montag, 27. November 2006 12:30 schrieb Simon Riggs: > It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have > them set within pg_config_manual. There are a number of use-cases where > varying these values will offer increased performance Such as? -- Peter Eisentraut http://developer.postgresql.org/~petere/
On 11/27/06, Peter Eisentraut <peter_e@gmx.net> wrote: > Am Montag, 27. November 2006 12:30 schrieb Simon Riggs: > > It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have > > them set within pg_config_manual. There are a number of use-cases where > > varying these values will offer increased performance > > Such as? Reading 32k at a time from my SAN, instead of 8k, gave me a ~15% increase in overall I/O throughput. Now, I'm not certain that I didn't do something else stupid that the BLCKSZ partially counter acts, but dd bears out my results (and 64k is even faster for dd). > > -- > Peter Eisentraut > http://developer.postgresql.org/~petere/ > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
On Mon, 2006-11-27 at 14:01 +0100, Peter Eisentraut wrote: > Am Montag, 27. November 2006 12:30 schrieb Simon Riggs: > > It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have > > them set within pg_config_manual. There are a number of use-cases where > > varying these values will offer increased performance > > Such as? Increasing XLOGSEGSZ improves performance with write intensive workloads, where WAL is sufficiently active that switching WAL files and fsyncing causes all commits to freeze momentarily. http://blogs.sun.com/jkshah/category/Databases?page=1 Sun think so as well, but that does seem to be rare knowledge, AFAICS. Increasing BLCKSZ has been claimed to help by http://archives.postgresql.org/pgsql-performance/2006-05/msg00444.php http://archives.postgresql.org/pgsql-performance/2005-12/msg00139.php http://archives.postgresql.org/pgsql-performance/2004-12/msg00271.php Discussion on that does seem somewhat inconclusive, but that maybe just that test results are rather thin on the ground because of lack of ability to test this without recompilation. One commentator says that the gain isn't worth the pain of having to re-compile to get it, even though there is measured benefit. Personally, I've not measured any benefit for OLTP workloads, but there are many other workloads to try out. Increasing BLCKSZ would also allow increasing the size of GIST indexes (IIRC?). It would certainly allow larger TOAST_TARGETs to allow more data to be held in a single longer tuple than is currently possible, which would allow many text-based applications to avoid various overheads. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have > them set within pg_config_manual. The work required for this is much larger than you make it out to be, and zero evidence has been offered for any benefit. I have not heard of anyone bothering to use a custom BLCKSZ since we added TOAST to get rid of the row-length limitation ... regards, tom lane
Simon Riggs wrote: > Increasing XLOGSEGSZ improves performance with write intensive > workloads, where WAL is sufficiently active that switching WAL files > and fsyncing causes all commits to freeze momentarily. > http://blogs.sun.com/jkshah/category/Databases?page=1 He increased the WAL segment size from 16 MB to 256 MB. Without any further information about the system configuration, that seems to be mostly equivalent to increasing the number of checkpoint segments. > Increasing BLCKSZ has been claimed to help by [snip] > Discussion on that does seem somewhat inconclusive, but that maybe > just that test results are rather thin on the ground because of lack > of ability to test this without recompilation. I don't doubt that there may be a positive effect from increasing the block size. But we haven't seen any analysis of why that might be. If it's just to use the disk system bandwith better, maybe we should combine page writes instead or something. > Increasing BLCKSZ would also allow increasing the size of GIST > indexes (IIRC?). It would certainly allow larger TOAST_TARGETs to > allow more data to be held in a single longer tuple than is currently > possible, which would allow many text-based applications to avoid > various overheads. Have there ever been demands for reconfiguring the TOASTing behavior? The TOAST system seems to think that values larger than about 2 kB will rarely or never be used in computations, only for retrieval. What was the reason for choosing this particular limit? It seems to me that the maximum size of useful lookup keys is mostly influenced by human intelligence, not by the available computing hardware, so 2 kB seems to be just fine. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes: > I don't doubt that there may be a positive effect from increasing the > block size. But we haven't seen any analysis of why that might be. It seems at least as likely that increased block size would *decrease* performance by requiring even small writes to do more physical I/O. This applies to both data files and xlog. But the real issue here is whether there are grounds for supporting run-time changes in the block size. AFAICS the evidence for supporting even compile-time changes is pretty weak; why should we take the likely complexity and performance costs of making it run-time changeable? regards, tom lane
On Mon, 2006-11-27 at 22:08 +0100, Peter Eisentraut wrote: > Simon Riggs wrote: > > Increasing XLOGSEGSZ improves performance with write intensive > > workloads, where WAL is sufficiently active that switching WAL files > > and fsyncing causes all commits to freeze momentarily. > > http://blogs.sun.com/jkshah/category/Databases?page=1 > > He increased the WAL segment size from 16 MB to 256 MB. Without any > further information about the system configuration, that seems to be > mostly equivalent to increasing the number of checkpoint segments. On a busy system you can switch WAL segments every few seconds at 16MB. Fsync can freeze commits for more than a second, so raising the segment size reduces the fsync overhead considerably. This doesn't drop away fully with any of the various wal_fsync_method settings. 256MB is good, 1GB is better. Obviously changes the on-disk footprint considerably, so some flexibility is needed to accommodate small PC configs and large performance servers. It does also have the same effect as changing checkpoint segments, but we already have variability in that dimension. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Mon, 2006-11-27 at 22:08 +0100, Peter Eisentraut wrote: >> He increased the WAL segment size from 16 MB to 256 MB. Without any >> further information about the system configuration, that seems to be >> mostly equivalent to increasing the number of checkpoint segments. > On a busy system you can switch WAL segments every few seconds at 16MB. > Fsync can freeze commits for more than a second, so raising the segment > size reduces the fsync overhead considerably. Sorry, but that's just handwaving. The amount of data to be written for any specific commit isn't going to change in the least if you change XLOGSEGSZ --- it's still going to be whatever has been written since the last commit. I agree with Peter that the quoted Sun test appears to have failed to control the frequency of checkpoints, and that that was what really accounted for the performance change. So he'd have gotten the same result from increasing checkpoint_segments without bothering with a change in XLOGSEGSZ. I do note that XLogWrite() does this in the foreground path of control: * If we just wrote the whole last page of a logfile segment, * fsync the segment immediately. Thisavoids having to go back * and re-open prior segments when an fsync request comes along * later.Doing it here ensures that one and only one backend will * perform this fsync. This coding predates the existence of the bgwriter; now that we have that, it'd perhaps be interesting to try to put the burden on the bgwriter instead. (However, if a backend is trying to fsync a commit record just after the segment switch, it'd have to wait for the previous segment to be fsync'd anyway. The complexity and likely performance costs of arranging for that synchronization might outweigh any gains.) In any case, the existence of this code isn't an argument for raising XLOGSEGSZ, more the reverse --- the bigger the segment the more painful the fsync is likely to be. [ studies code a bit more... ] I'm also wondering whether the forced pg_control update at each xlog seg switch is worth its keep. Offhand it seems like the checkpoint pointer is enough; why are we maintaining logId/logSeg in pg_control? regards, tom lane
On Mon, 2006-11-27 at 18:26 -0500, Tom Lane wrote: > Sorry, but that's just handwaving. Thats fine. I was responding to private comments that I was trying to test things that had not been subject to community design. My response was that the community would react badly if conjectures are discussed without presenting firm performance evidence. Chicken and egg... > [ studies code a bit more... ] I'm also wondering whether the forced > pg_control update at each xlog seg switch is worth its keep. Offhand it > seems like the checkpoint pointer is enough; why are we maintaining > logId/logSeg in pg_control? ControlFile->logId = openLogId;ControlFile->logSeg = openLogSeg + 1;ControlFile->time = time(NULL);UpdateControlFile(); I've looked through the code paths related to the above code just at xlog switch. There doesn't seem to be any useful effect of storing these values in the control file. The logId and logSeg is never read, only written. There is a slight impact in that when the server crashes it will say the database crashed at ControlFile->time, so if we remove the update the crash information will be slightly more out of date than it is now in many cases. In the case of a long checkpoint_timeout that could be as much as an hour, but then thats no worse than it is now potentially on a system in a slack period when little WAL is written. Perhaps we can say if its within a minute of the last switch time, then we update the control file, otherwise don't. That seems like coding for the sake of it though and if we wanted that then we'd get the bgwriter to do it, not a random backend. Anyway, we can skip updating the control file and its fsync. IMHO touching the control file less is likely to make us more robust. I'll code up a patch for that and test to see if that improves things. Not sure if this is RC material? No, OK, don't shout. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> I don't doubt that there may be a positive effect from increasing the > block size. But we haven't seen any analysis of why that > might be. If it's just to use the disk system bandwith better, maybe we should > combine page writes instead or something. It is usually the reads that need aid. Writes are fast on modern disksystems since they are cached. I think the main effect is reduced OS overhead for the readahead prediction logic and reduced system call overhead. Andreas
On Mon, Nov 27, 2006 at 04:47:57PM -0500, Tom Lane wrote: > It seems at least as likely that increased block size would *decrease* > performance by requiring even small writes to do more physical I/O. > This applies to both data files and xlog. FWIW, a test we performed on just this some time ago was inconclusive, and I chalked up the inconclusiveness to exactly the increase in physical I/O for small writes. I couldn't release the results, just because I wasn't in a position to release the test data, but we had a fairly eclectic mixture of big and small rows. On certain workloads, it was in fact slower than the stock size (IIRC we tried both 16k and 32k), which is what led me to that speculation. But I never chased any of it down, because the preliminary results were so unpromising. A -- Andrew Sullivan | ajs@crankycanuck.ca I remember when computers were frustrating because they *did* exactly what you told them to. That actually seems sort of quaint now. --J.D. Baldwin
Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have >> them set within pg_config_manual. > > The work required for this is much larger than you make it out to be, > and zero evidence has been offered for any benefit. I have not heard of > anyone bothering to use a custom BLCKSZ since we added TOAST to get rid > of the row-length limitation ... I think, Configurable BLCKSZ could be useful for regression test. See hash index problem. Some kind of stress test could be check behavior on different BLCKSZ without recompilation. Zdenek
On Mon, 2006-11-27 at 18:26 -0500, Tom Lane wrote: > [ studies code a bit more... ] I'm also wondering whether the forced > pg_control update at each xlog seg switch is worth its keep. Offhand > it > seems like the checkpoint pointer is enough; why are we maintaining > logId/logSeg in pg_control? We maintain the values in shared memory to allow us to determine whether or not its time to checkpoint, and also to ensure that there is one and only one call to checkpoint. So we need to keep track of this somewhere and that may as well be where it already is. However, that doesn't mean we need to update the file on disk each time we switch xlog files, so I've removed the UpdateControlFile() at that point. That fsync was done while holding WALWriteLock() so removing it should be good for a few extra points of speed - at least we know there were some problems in that area. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Attachment
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Mon, 2006-11-27 at 18:26 -0500, Tom Lane wrote: >> [ studies code a bit more... ] I'm also wondering whether the forced >> pg_control update at each xlog seg switch is worth its keep. Offhand >> it seems like the checkpoint pointer is enough; why are we maintaining >> logId/logSeg in pg_control? > We maintain the values in shared memory to allow us to determine whether > or not its time to checkpoint, and also to ensure that there is one and > only one call to checkpoint. So we need to keep track of this somewhere > and that may as well be where it already is. Say again? AFAICT those fields are write-only; the only place we consult them is to decide whether they need to be updated. My thought was to remove 'em altogether. regards, tom lane
On Tue, 2006-12-05 at 15:14 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Mon, 2006-11-27 at 18:26 -0500, Tom Lane wrote: > >> [ studies code a bit more... ] I'm also wondering whether the forced > >> pg_control update at each xlog seg switch is worth its keep. Offhand > >> it seems like the checkpoint pointer is enough; why are we maintaining > >> logId/logSeg in pg_control? > > > We maintain the values in shared memory to allow us to determine whether > > or not its time to checkpoint, and also to ensure that there is one and > > only one call to checkpoint. So we need to keep track of this somewhere > > and that may as well be where it already is. > > Say again? AFAICT those fields are write-only; the only place we > consult them is to decide whether they need to be updated. My thought > was to remove 'em altogether. Thats what I thought originally. However, they guard the entrance to RequestCheckpoint() and after they have been set nobody else will call it - look at the test immediately prior to the rows changed by the patch. That comparison is why we still need them and why they aren't just write-only. So they need to be there, but we just don't need to write them to pg_control. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Tue, 2006-12-05 at 15:14 -0500, Tom Lane wrote: >> Say again? AFAICT those fields are write-only; the only place we >> consult them is to decide whether they need to be updated. My thought >> was to remove 'em altogether. > Thats what I thought originally. > However, they guard the entrance to RequestCheckpoint() and after they > have been set nobody else will call it - look at the test immediately > prior to the rows changed by the patch. Sure, what would happen is that every backend passing through this code would execute the several lines of computation needed to decide whether to call RequestCheckpoint. That's still way cheaper than an xlog switch as a whole, so it doesn't bother me. I think the first test is probably effectively redundant anyway, since the whole thing is executed with WALWriteLock held and so there can be only one backend doing it at a time --- it's not apparent to me that it's possible for someone else to have updated pg_control before the backend executing XLogWrite does. But in any case, the point here is that it doesn't matter whether the RequestCheckpoint code is inside the update-pg_control test or not. It was only put there on the thought that we could save some small number of cycles by not doing it if the update-pg_control test failed. regards, tom lane
On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Tue, 2006-12-05 at 15:14 -0500, Tom Lane wrote: > >> Say again? AFAICT those fields are write-only; the only place we > >> consult them is to decide whether they need to be updated. My thought > >> was to remove 'em altogether. > > > Thats what I thought originally. > > > However, they guard the entrance to RequestCheckpoint() and after they > > have been set nobody else will call it - look at the test immediately > > prior to the rows changed by the patch. > > Sure, what would happen is that every backend passing through this code > would execute the several lines of computation needed to decide whether > to call RequestCheckpoint. That's still way cheaper than an xlog switch > as a whole, so it doesn't bother me. I think the first test is probably > effectively redundant anyway, since the whole thing is executed with > WALWriteLock held and so there can be only one backend doing it at a > time --- it's not apparent to me that it's possible for someone else to > have updated pg_control before the backend executing XLogWrite does. Right, but the calculation uses RedoRecPtr, which may not be completely up to date. So presumably you want to re-read the shared memory value again to make sure we are exactly accurate and allow only one person to call checkpoint? Either way we have to take a lock. Insert lock causes deadlock, so we would need to use infolock. Yes, one backend at a time executes this code, but we need a way to tell whether the backend is the first to come through that code. I just left it with the lock it was already requesting. If you really think it should use infolock then I'll code it that way instead. > But in any case, the point here is that it doesn't matter whether the > RequestCheckpoint code is inside the update-pg_control test or not. > It was only put there on the thought that we could save some small > number of cycles by not doing it if the update-pg_control test failed. Understood, that wasn't why I left it that way. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote: >> Sure, what would happen is that every backend passing through this code >> would execute the several lines of computation needed to decide whether >> to call RequestCheckpoint. > Right, but the calculation uses RedoRecPtr, which may not be completely > up to date. So presumably you want to re-read the shared memory value > again to make sure we are exactly accurate and allow only one person to > call checkpoint? Either way we have to take a lock. Insert lock causes > deadlock, so we would need to use infolock. Not at all. It's highly unlikely that RedoRecPtr would be so out of date as to result in a false request for a checkpoint, and if it does, so what? Worst case is we perform an extra checkpoint. Also, given the current structure of the routine, this is probably not the best place for that code at all --- it'd make more sense for it to be in the just-finished-a-segment code stretch, which would ensure that it's only done by one backend once per segment. regards, tom lane
On Tue, 2006-12-05 at 17:26 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote: > >> Sure, what would happen is that every backend passing through this code > >> would execute the several lines of computation needed to decide whether > >> to call RequestCheckpoint. > > > Right, but the calculation uses RedoRecPtr, which may not be completely > > up to date. So presumably you want to re-read the shared memory value > > again to make sure we are exactly accurate and allow only one person to > > call checkpoint? Either way we have to take a lock. Insert lock causes > > deadlock, so we would need to use infolock. > > Not at all. It's highly unlikely that RedoRecPtr would be so out of > date as to result in a false request for a checkpoint, and if it does, > so what? Worst case is we perform an extra checkpoint. On its own, I wouldn't normally agree... > Also, given the current structure of the routine, this is probably not > the best place for that code at all --- it'd make more sense for it to > be in the just-finished-a-segment code stretch, which would ensure that > it's only done by one backend once per segment. But thats a much better plan since it requires no locking. There's a lot more changes there for such a simple fix though and lots more potential bugs, but I've coded it as you suggest and removed the fields from pg_control. Patch passes make check, applies cleanly on HEAD. pg_resetxlog and pgcontroldata tested. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Attachment
On 11/27/06, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, 2006-11-27 at 22:08 +0100, Peter Eisentraut wrote: > > Simon Riggs wrote: > > > Increasing XLOGSEGSZ improves performance with write intensive > > > workloads, where WAL is sufficiently active that switching WAL files > > > and fsyncing causes all commits to freeze momentarily. > > > http://blogs.sun.com/jkshah/category/Databases?page=1 > > > > He increased the WAL segment size from 16 MB to 256 MB. Without any > > further information about the system configuration, that seems to be > > mostly equivalent to increasing the number of checkpoint segments. > > On a busy system you can switch WAL segments every few seconds at 16MB. > Fsync can freeze commits for more than a second, so raising the segment > size reduces the fsync overhead considerably. This doesn't drop away > fully with any of the various wal_fsync_method settings. > > 256MB is good, 1GB is better. Obviously changes the on-disk footprint > considerably, so some flexibility is needed to accommodate small PC > configs and large performance servers. Also, 16MB WALs are quite a burden for backup systems (that's a lot of files that just keep coming and coming). [1] Regards, Dawid [1]: It really does the difference, especially if you have a centralized backup. And as for recovery, we have pg_xlogfile_name_offset(), the size of the WAL file should not be a problem in HA setups.
"Simon Riggs" <simon@2ndquadrant.com> writes: > [ patch to remove logId/logSeg from pg_control ] Looking this over, I realize that there's an unresolved problem. Although it's true that xlog.c itself doesn't use the logId/logSeg fields for anything interesting, pg_resetxlog relies on them to determine how far the old WAL extends, so that it can determine a safely higher start address for the new WAL. This puts a damper both on my thought of removing the fields altogether, and on Simon's earlier proposal to update them in shared memory but not immediately write pg_control during a segment switch. The proposed patch uses pg_control's last checkpoint location to drive the end-of-WAL computation, but that is obviously not good enough, as WAL might have gone many segments beyond that. Now, underestimating the WAL end address is not fatal; AFAIK the only consequence would be some complaints about "xlog flush request is not satisfied" until we had managed to advance the end of WAL past the largest page LSN present in the data files. But it's still annoying. What I'm considering is having pg_resetxlog scan the pg_xlog directory and assume that any segment files present might have been used. Thoughts? regards, tom lane
"Simon Riggs" <simon@2ndquadrant.com> writes: > [ patch to remove logId/logSeg from pg_control ] Applied with revisions. regards, tom lane
On Fri, 2006-12-08 at 13:18 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > What I'm considering is having pg_resetxlog scan the pg_xlog directory > and assume that any segment files present might have been used. [Reading committed code...] That's a very neat short cut - just using the file names rather than trying to read the files themselves as the abortive patch a few months back tried. Question: What happens when we run out of LogIds? I know we don't wrap onto the next timeline, but do we start from LogId=1 again? Looks to me like we just fall on the floor right now. We're probably not pressed for an answer... :-) -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > Question: What happens when we run out of LogIds? We die. That will occur after generating 2^64 bytes of WAL, which for an installation generating 100MB/second would be something over 5000 years if I'm counting correctly. regards, tom lane
The original discussion of this patch was here: http://archives.postgresql.org/pgsql-hackers/2006-11/msg00876.php Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Simon Riggs wrote: > On Tue, 2006-12-05 at 17:26 -0500, Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote: > > >> Sure, what would happen is that every backend passing through this code > > >> would execute the several lines of computation needed to decide whether > > >> to call RequestCheckpoint. > > > > > Right, but the calculation uses RedoRecPtr, which may not be completely > > > up to date. So presumably you want to re-read the shared memory value > > > again to make sure we are exactly accurate and allow only one person to > > > call checkpoint? Either way we have to take a lock. Insert lock causes > > > deadlock, so we would need to use infolock. > > > > Not at all. It's highly unlikely that RedoRecPtr would be so out of > > date as to result in a false request for a checkpoint, and if it does, > > so what? Worst case is we perform an extra checkpoint. > > On its own, I wouldn't normally agree... > > > Also, given the current structure of the routine, this is probably not > > the best place for that code at all --- it'd make more sense for it to > > be in the just-finished-a-segment code stretch, which would ensure that > > it's only done by one backend once per segment. > > But thats a much better plan since it requires no locking. > > There's a lot more changes there for such a simple fix though and lots > more potential bugs, but I've coded it as you suggest and removed the > fields from pg_control. > > Patch passes make check, applies cleanly on HEAD. > pg_resetxlog and pgcontroldata tested. > > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sat, 2007-02-03 at 20:37 -0500, Bruce Momjian wrote: > Your patch has been added to the PostgreSQL unapplied patches list at: > > http://momjian.postgresql.org/cgi-bin/pgpatches > > It will be applied as soon as one of the PostgreSQL committers reviews > and approves it. Tom applied the patch a few months ago. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Patch already applied by Tom. Removed from queue. --------------------------------------------------------------------------- Simon Riggs wrote: > On Tue, 2006-12-05 at 17:26 -0500, Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote: > > >> Sure, what would happen is that every backend passing through this code > > >> would execute the several lines of computation needed to decide whether > > >> to call RequestCheckpoint. > > > > > Right, but the calculation uses RedoRecPtr, which may not be completely > > > up to date. So presumably you want to re-read the shared memory value > > > again to make sure we are exactly accurate and allow only one person to > > > call checkpoint? Either way we have to take a lock. Insert lock causes > > > deadlock, so we would need to use infolock. > > > > Not at all. It's highly unlikely that RedoRecPtr would be so out of > > date as to result in a false request for a checkpoint, and if it does, > > so what? Worst case is we perform an extra checkpoint. > > On its own, I wouldn't normally agree... > > > Also, given the current structure of the routine, this is probably not > > the best place for that code at all --- it'd make more sense for it to > > be in the just-finished-a-segment code stretch, which would ensure that > > it's only done by one backend once per segment. > > But thats a much better plan since it requires no locking. > > There's a lot more changes there for such a simple fix though and lots > more potential bugs, but I've coded it as you suggest and removed the > fields from pg_control. > > Patch passes make check, applies cleanly on HEAD. > pg_resetxlog and pgcontroldata tested. > > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +