Thread: increasing the default WAL segment size
Hi, I'd like to propose that we increase the default WAL segment size, which is currently 16MB. It was first set to that value in commit 47937403676d913c0e740eec6b85113865c6c8ab in October of 1999; prior to that, it was 64MB. Between 1999 and now, there have been three significant changes that make me think it might be time to rethink this value: 1. Transaction rates are vastly higher these days. In 1999, I think we were still limited to ~2^32 transactions during the entire lifetime of the server; transaction ID wraparound hadn't been invented yet.[1] Today, some installations do that many write transactions in under a week. The practical consequence of this is that WAL files fill up in extremely short periods of time. Some users generate multiple terabytes of WAL per day, which means they are generating - and very likely archiving - WAL files a rate of greater than 1 per second! That poses multiple problems. For example, if your archive command happens to involve ssh, you might run into trouble because of this sort of thing: [rhaas pgsql]$ /usr/bin/time ssh hydra true 1.57 real 0.00 user 0.00 sys Also, your operating system's implementation of directories and the commands to work with them (like ls) don't necessarily scale well to tens or hundreds of thousands of archived files. Furthermore, there is an enforced, synchronous fsync at the end of every segment, which actually does hurt performance on write-heavy workloads.[2] Of course, if that were the only reason to consider increasing the segment size, it would probably make more sense to just try to push that extra fsync into the background, but that's not really the case. From what I hear, the gigantic number of files is a bigger pain point. 2. Disks are a bit larger these days. In the worst case, we waste just under twice as much space as whatever the segment size is: you might need 1 byte from the oldest segment you're keeping and 1 byte from the newest segment that you are keeping, but not the remaining contents of either file. In 1999, trying to limit disk wastage to <32MB probably seemed reasonable, but today that's very little disk space. I think at that time typical hard drive sizes were around 10 GB, whereas today they are around 1 TB.[3] I'm not sure whether the size of the sorts of high-performance storage that is likely to be used for pg_xlog has grown as fast as hard drives generally, but even so it seems pretty clear to me that trying to limit disk wastage to 32MB is excessively conservative on modern hardware. 3. archive_timeout is no longer a frequently used option. Obviously, if you are frequently archiving partial segments, you don't want the segment size to be too large, because if it is, each forced segment switch potentially wastes a large amount of space (and bandwidth). But given streaming replication and pg_receivexlog, the use case for archiving partial segments is, at least according to my understanding, a lot narrower than it used to be. So, I think we don't have to worry as much about keeping forced segment switches cheap as we did during the 8.x series. Considering those three factors, I think we should consider pushing the default value up somewhat higher for v10. Reverting to the 64MB size that we had prior to 47937403676d913c0e740eec6b85113865c6c8ab sounds pretty reasonable. Users with really high transaction rates might even prefer a higher value (e.g. 256MB, 1GB) but that's hardly practical for small installs given our default of max_wal_size = 1GB. Possibly it would make sense for this to be configurable at initdb time instead of requiring a recompile; we probably don't save any significant number of cycles by compiling this into the server. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] I believe at that time we consumed an XID even for a read-only transaction, too; today, we can do 2^32 read transactions in a few hours. [2] Amit did some benchmarking on this, I believe, but I don't have the numbers handy. [3] https://commons.wikimedia.org/wiki/File:Hard_drive_capacity_over_time.png
On Wed, Aug 24, 2016 at 10:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > 1. Transaction rates are vastly higher these days. In 1999, I think > we were still limited to ~2^32 transactions during the entire lifetime > of the server; transaction ID wraparound hadn't been invented yet.[1] > Today, some installations do that many write transactions in under a > week. The practical consequence of this is that WAL files fill up in > extremely short periods of time. Some users generate multiple > terabytes of WAL per day, which means they are generating - and very > likely archiving - WAL files a rate of greater than 1 per second! > That poses multiple problems. For example, if your archive command > happens to involve ssh, you might run into trouble because of this > sort of thing: > > [rhaas pgsql]$ /usr/bin/time ssh hydra true > 1.57 real 0.00 user 0.00 sys ... > Considering those three factors, I think we should consider pushing > the default value up somewhat higher for v10. Reverting to the 64MB > size that we had prior to 47937403676d913c0e740eec6b85113865c6c8ab > sounds pretty reasonable. Users with really high transaction rates > might even prefer a higher value (e.g. 256MB, 1GB) but that's hardly > practical for small installs given our default of max_wal_size = 1GB. > Possibly it would make sense for this to be configurable at initdb > time instead of requiring a recompile; we probably don't save any > significant number of cycles by compiling this into the server. FWIW, +1 We're already hurt by the small segments due to a similar phenomenon as the ssh case: TCP slow start. Designing the archive/recovery command to work around TCP slow start is quite complex, and bigger segments would just be a better thing. Not to mention that bigger segments compress better.
Robert Haas <robertmhaas@gmail.com> writes: > I'd like to propose that we increase the default WAL segment size, > which is currently 16MB. That seems like a reasonable thing to consider ... > Possibly it would make sense for this to be configurable at initdb > time instead of requiring a recompile; ... but I think this is just folly. You'd have to do major amounts of work to keep, eg, slave servers on the same page as the master about what the segment size is. Better to keep treating it like BLCKSZ, as a fixed parameter of a build. (There's a reason why we keep this number in pg_control.) regards, tom lane
> From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas > Considering those three factors, I think we should consider pushing the > default value up somewhat higher for v10. Reverting to the 64MB size that > we had prior to 47937403676d913c0e740eec6b85113865c6c8ab > sounds pretty reasonable. +1 The other downside is that the response time of transactions may degrade when they have to wait for a new WAL segment tobe created. Tha might pop up as occasional slow or higher maximum response time, which is a mystery to users. Maybe it'stime to use posix_fallocate() to create WAL segments. > Possibly it would make sense for this to be configurable at initdb time > instead of requiring a recompile; we probably don't save any significant > number of cycles by compiling this into the server. +1 > 3. archive_timeout is no longer a frequently used option. Obviously, if > you are frequently archiving partial segments, you don't want the segment > size to be too large, because if it is, each forced segment switch > potentially wastes a large amount of space (and bandwidth). > But given streaming replication and pg_receivexlog, the use case for > archiving partial segments is, at least according to my understanding, a > lot narrower than it used to be. So, I think we don't have to worry as > much about keeping forced segment switches cheap as we did during the 8.x > series. I'm not sure about this. I know (many or not) users use continuous archiving with archive_command and archive_timeout forbackups, and don't want to use streaming replication, because the system is not worth the cost and trouble of HA. I heardfrom a few users that they were surprised when they knew that PostgreSQL generates WAL even when no update transactionis happening. Is this still true? Regards Takayuki Tsunakawa
On 2016-08-24 22:33:49 -0400, Tom Lane wrote: > > Possibly it would make sense for this to be configurable at initdb > > time instead of requiring a recompile; > > ... but I think this is just folly. You'd have to do major amounts > of work to keep, eg, slave servers on the same page as the master > about what the segment size is. Don't think it'd actually be all that complicated, we already verify the compatibility of some things. But I'm doubtful it's worth it, and I'm also rather doubtful that it's actually without overhead. Andres
Andres Freund <andres@anarazel.de> writes: > On 2016-08-24 22:33:49 -0400, Tom Lane wrote: >> ... but I think this is just folly. You'd have to do major amounts >> of work to keep, eg, slave servers on the same page as the master >> about what the segment size is. > Don't think it'd actually be all that complicated, we already verify > the compatibility of some things. But I'm doubtful it's worth it, and > I'm also rather doubtful that it's actually without overhead. My point is basically that it'll introduce failure modes that we don't currently concern ourselves with. Yes, you can do configure --with-wal-segsize, but it's on your own head whether the resulting build will interoperate with anything else --- and I'm quite sure nobody tests, eg, walsender or walreceiver to see if they fail sanely in such cases. I don't think we'd get to take such a laissez-faire position with respect to an initdb option. regards, tom lane
On Wed, Aug 24, 2016 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ... but I think this is just folly. You'd have to do major amounts > of work to keep, eg, slave servers on the same page as the master > about what the segment size is. I said an initdb-time parameter, meaning not capable of being changed within the lifetime of the cluster. So I don't see how the slave servers would get out of sync? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 24, 2016 at 10:54 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-08-24 22:33:49 -0400, Tom Lane wrote: >> > Possibly it would make sense for this to be configurable at initdb >> > time instead of requiring a recompile; >> >> ... but I think this is just folly. You'd have to do major amounts >> of work to keep, eg, slave servers on the same page as the master >> about what the segment size is. > > Don't think it'd actually be all that complicated, we already verify > the compatibility of some things. But I'm doubtful it's worth it, and > I'm also rather doubtful that it's actually without overhead. Really? Where do you think the overhead would come from? What sort of test would you run to try to detect it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 24, 2016 at 11:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2016-08-24 22:33:49 -0400, Tom Lane wrote: >>> ... but I think this is just folly. You'd have to do major amounts >>> of work to keep, eg, slave servers on the same page as the master >>> about what the segment size is. > >> Don't think it'd actually be all that complicated, we already verify >> the compatibility of some things. But I'm doubtful it's worth it, and >> I'm also rather doubtful that it's actually without overhead. > > My point is basically that it'll introduce failure modes that we don't > currently concern ourselves with. Yes, you can do configure > --with-wal-segsize, but it's on your own head whether the resulting build > will interoperate with anything else --- and I'm quite sure nobody tests, > eg, walsender or walreceiver to see if they fail sanely in such cases. > I don't think we'd get to take such a laissez-faire position with respect > to an initdb option. I am really confused by this. If you connect a slave to a master other than the one that you cloned to create the salve, of course that's going to fail. But if the slave is cloned from the master, then the segment size is going to match. It seems like the only thing we need to do to make this work is make sure to get the segment size from the control file rather than anywhere else, which doesn't seem very difficult. What am I missing? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Aug 24, 2016 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... but I think this is just folly. You'd have to do major amounts >> of work to keep, eg, slave servers on the same page as the master >> about what the segment size is. > I said an initdb-time parameter, meaning not capable of being changed > within the lifetime of the cluster. So I don't see how the slave > servers would get out of sync? The point is that that now becomes something to worry about. I do not think I have to exhibit a live bug within five minutes' thought before saying that it's a risk area. It's something that we simply have not worried about before, and IME that generally means there's some squishy things there. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > What am I missing? Maybe nothing. But I'll point out that of the things that can currently be configured at initdb time, such as LC_COLLATE, there is not one single one that matters to walsender/walreceiver. If you think there is zero risk involved in introducing a parameter that will matter at that level, you have a different concept of risk than I do. If you'd presented some positive reason why we ought to be taking some risk here, I'd be on board. But you haven't really. The current default value for this parameter is nearly old enough to vote; how is it that we suddenly need to make it easily configurable? Let's just change the value and be happy. regards, tom lane
On 2016-08-24 23:26:51 -0400, Robert Haas wrote: > On Wed, Aug 24, 2016 at 10:54 PM, Andres Freund <andres@anarazel.de> wrote: > > and I'm also rather doubtful that it's actually without overhead. > > Really? Where do you think the overhead would come from? ATM we do a math involving XLOG_BLCKSZ in a bunch of places (including doing a lot of %). Some of that happens with exclusive lwlocks held, and some even with a spinlock held IIRC. Making that variable won't be free. Whether it's actually measurabel - hard to say. I do remember Heikki fighting hard to simplify some parts of the critical code during xlog scalability stuff, and that that even involved moving minor amounts of math out of critical sections. > What sort of test would you run to try to detect it? Xlog scalability tests (parallel copy, parallel inserts...), and decoding speed (pg_xlogdump --stats?)
On Wed, Aug 24, 2016 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> What am I missing? > > Maybe nothing. But I'll point out that of the things that can currently > be configured at initdb time, such as LC_COLLATE, there is not one single > one that matters to walsender/walreceiver. If you think there is zero > risk involved in introducing a parameter that will matter at that level, > you have a different concept of risk than I do. > > If you'd presented some positive reason why we ought to be taking some > risk here, I'd be on board. But you haven't really. The current default > value for this parameter is nearly old enough to vote; how is it that > we suddenly need to make it easily configurable? Let's just change > the value and be happy. I certainly think that's a good first cut. As I said before, I think that increasing the value from 16MB to 64MB won't really hurt people with mostly-default configurations. max_wal_size=1GB currently means 64 16-MB segments; if it starts meaning 16 64-MB segments, I don't think that will have much impact on on people one way or the other. Meanwhile, we'll significantly help people who are currently generating painfully large but not totally insane numbers of WAL segments. Someone who is currently generating 32,768 WAL segments per day - about one every 2.6 seconds - will have a significantly easier time if they start generating 8,192 WAL segments per day - about one every 10.5 seconds - instead. It's just much easier for a reasonably simple archive command to keep up, "ls" doesn't have as many directory entries to sort, etc. However, for people who have really high velocity systems - say 300,000 WAL segments per day - a fourfold increase in the segment size only gets them down to 75,000 WAL segments per day, which is still pretty nuts. High tens of thousands of segments per day is, surely, easier to manage than low hundreds of thousands, but it still puts really tight requirements on how fast your archive_command has to run. On that kind of system, you really want a segment size of maybe 1GB. In this example that gets you down to ~4700 WAL files per day, or about one every 18 seconds. But 1GB is clearly too large to be the default. I think we're going to run into this issue more and more as people start running PostgreSQL on larger databases. In current releases, the cost of wraparound autovacuums can easily be the limiting factor here: the I/O cost is proportional to the XID burn rate multiplied by the entire size of the database. So mostly read-only databases or databases that only take batch loads can be fine even if they are really big, but it's hard to scale databases that do lots of transaction processing beyond a certain size because you just end up running continuous wraparound vacuums and eventually you can't even do that fast enough. The freeze map changes in 9.6 should help with this problem, though, at least for databases that have hot spots rather than uniform access, which is of course very common. I think the result of that is likely to be that people try to scale up PostgreSQL to larger databases than ever before. New techniques for indexing large amounts of data (like BRIN) and for querying it (like parallel query, especially once we support having the driving scan be a bitmap heap scan) are going to encourage people in that direction, too. You're asking why we suddenly need to make this configurable as if it were a surprising need, but I think it would be more surprising if scaling up didn't create some new needs. I can't think of any reason why a 100TB database and a 100MB database should both want to use the same WAL segment size, and I think we want to support both of those things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 24, 2016 at 11:52 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-08-24 23:26:51 -0400, Robert Haas wrote: >> On Wed, Aug 24, 2016 at 10:54 PM, Andres Freund <andres@anarazel.de> wrote: >> > and I'm also rather doubtful that it's actually without overhead. >> >> Really? Where do you think the overhead would come from? > > ATM we do a math involving XLOG_BLCKSZ in a bunch of places (including > doing a lot of %). Some of that happens with exclusive lwlocks held, and > some even with a spinlock held IIRC. Making that variable won't be > free. Whether it's actually measurabel - hard to say. I do remember > Heikki fighting hard to simplify some parts of the critical code during > xlog scalability stuff, and that that even involved moving minor amounts > of math out of critical sections. OK, that's helpful context. >> What sort of test would you run to try to detect it? > > Xlog scalability tests (parallel copy, parallel inserts...), and > decoding speed (pg_xlogdump --stats?) Thanks; that's helpful, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-08-25 00:28:58 -0400, Robert Haas wrote: > On Wed, Aug 24, 2016 at 11:52 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-08-24 23:26:51 -0400, Robert Haas wrote: > >> On Wed, Aug 24, 2016 at 10:54 PM, Andres Freund <andres@anarazel.de> wrote: > >> > and I'm also rather doubtful that it's actually without overhead. > >> > >> Really? Where do you think the overhead would come from? > > > > ATM we do a math involving XLOG_BLCKSZ in a bunch of places (including > > doing a lot of %). Some of that happens with exclusive lwlocks held, and > > some even with a spinlock held IIRC. Making that variable won't be > > free. Whether it's actually measurabel - hard to say. I do remember > > Heikki fighting hard to simplify some parts of the critical code during > > xlog scalability stuff, and that that even involved moving minor amounts > > of math out of critical sections. > > OK, that's helpful context. > > >> What sort of test would you run to try to detect it? > > > > Xlog scalability tests (parallel copy, parallel inserts...), and > > decoding speed (pg_xlogdump --stats?) > > Thanks; that's helpful, too. FWIW, I'm also doubtful that investing time into making this initdb configurable is a good use of time: The number of users that'll adjust initdb time parameters is going to be fairly small.
On Thu, Aug 25, 2016 at 12:35 AM, Andres Freund <andres@anarazel.de> wrote: > FWIW, I'm also doubtful that investing time into making this initdb > configurable is a good use of time: The number of users that'll adjust > initdb time parameters is going to be fairly small. I have to admit that I was skeptical about the idea of doing anything about this at all the first few times it came up. 16MB ought to be good enough for anyone! However, the time between beatings has now gotten short enough that the bruises don't have time to heal before the next beating arrives from a completely different customer. I try not to hold my views so firmly as to be impervious to contrary evidence. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<div style="color:#000; background-color:#fff; font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande,sans-serif;font-size:12px"><div id="yui_3_16_0_1_1472100272670_6301">Hello hackers,</div><div id="yui_3_16_0_1_1472100272670_6302"><br/></div><div dir="ltr" id="yui_3_16_0_1_1472100272670_6303">I'm no PG hacker, somaybe I'm completely wrong, so sorry if I have wasted your time. I try to make the best out of Tom Lanes comment.<br /></div><divdir="ltr" id="yui_3_16_0_1_1472100272670_6304"><br /></div><div dir="ltr" id="yui_3_16_0_1_1472100272670_7010">Whatwould happen if there's a database on a server with initdb (or whatever) parameter-with-wal-size=64MB and later someone decides to make it the master in a replicated system and has a slave withoutthat parameter? Would the slave work with the "different" wal size of the master? How could be guaranteed that insuch a scenario the replication either works correctly or failes with a meaningful error message?<br /></div><div id="yui_3_16_0_1_1472100272670_6668"><br/></div><div id="yui_3_16_0_1_1472100272670_6669">But in general I thing a more flexibleWAL size is a good idea. <br /></div><div id="yui_3_16_0_1_1472100272670_6840">To answer Andres: You have found oneof the (few?) users to adjust initdb parameters.<br /></div><div id="yui_3_16_0_1_1472100272670_6670"><br /></div><divid="yui_3_16_0_1_1472100272670_6850">Regards</div><div id="yui_3_16_0_1_1472100272670_6888"><br /></div><divid="yui_3_16_0_1_1472100272670_6672"><br /></div><div class="qtdSeparateBR"><br /><br /></div><div class="yahoo_quoted"style="display: block;"><div style="font-family: HelveticaNeue, Helvetica Neue, Helvetica, Arial, LucidaGrande, sans-serif; font-size: 12px;"><div style="font-family: HelveticaNeue, Helvetica Neue, Helvetica, Arial, LucidaGrande, sans-serif; font-size: 16px;"><div dir="ltr"><font face="Arial" size="2"> Robert Haas <robertmhaas@gmail.com>schrieb am 6:43 Donnerstag, 25.August 2016:<br /></font></div><br /><br /><div class="y_msg_container">OnThu, Aug 25, 2016 at 12:35 AM, Andres Freund <<a href="mailto:andres@anarazel.de" shape="rect"ymailto="mailto:andres@anarazel.de">andres@anarazel.de</a>> wrote:<br clear="none" />> FWIW, I'm also doubtfulthat investing time into making this initdb<br clear="none" />> configurable is a good use of time: The numberof users that'll adjust<br clear="none" />> initdb time parameters is going to be fairly small.<br clear="none"/><br clear="none" />I have to admit that I was skeptical about the idea of doing anything<br clear="none" />aboutthis at all the first few times it came up. 16MB ought to be<br clear="none" />good enough for anyone! However,the time between beatings has now<br clear="none" />gotten short enough that the bruises don't have time to healbefore<br clear="none" />the next beating arrives from a completely different customer. I try<br clear="none" />notto hold my views so firmly as to be impervious to contrary<br clear="none" />evidence.<br clear="none" /><br clear="none"/>-- <br clear="none" />Robert Haas<br clear="none" />EnterpriseDB: <a href="http://www.enterprisedb.com/" shape="rect"target="_blank">http://www.enterprisedb.com</a><br clear="none" />The Enterprise PostgreSQL Company<div class="yqt5140683510"id="yqtfd93095"><br clear="none" /><br clear="none" /><br clear="none" />-- <br clear="none" />Sentvia pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org" shape="rect" ymailto="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<brclear="none" />To make changes to yoursubscription:<br clear="none" /><a href="http://www.postgresql.org/mailpref/pgsql-hackers" shape="rect" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><brclear="none" /></div><br /><br /></div></div></div></div></div>
On Wed, Aug 24, 2016 at 10:40:06PM -0300, Claudio Freire wrote: > > time instead of requiring a recompile; we probably don't save any > > significant number of cycles by compiling this into the server. > > FWIW, +1 > > We're already hurt by the small segments due to a similar phenomenon > as the ssh case: TCP slow start. Designing the archive/recovery > command to work around TCP slow start is quite complex, and bigger > segments would just be a better thing. > > Not to mention that bigger segments compress better. This would be good time to rename pg_xlog and pg_clog directories too. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Aug 25, 2016 at 1:04 AM, Wolfgang Wilhelm <wolfgang20121964@yahoo.de> wrote: > What would happen if there's a database on a server with initdb (or > whatever) parameter -with-wal-size=64MB and later someone decides to make it > the master in a replicated system and has a slave without that parameter? > Would the slave work with the "different" wal size of the master? How could > be guaranteed that in such a scenario the replication either works correctly > or failes with a meaningful error message? You make reference to an "initdb (or whatever) parameter" but actually there is a big difference between the "initdb" case and the "whatever" case. If the parameter is fixed at initdb time, then the master and the slave will definitely agree: the slave had to be created by copying the master, and that means the control file that contains the size was also copied. Neither can have been changed afterwards. That's what an initdb-time parameter means. On the other hand, if the parameter is, say, a GUC, then you would have exactly the kinds of problems that you are talking about here. I am not keen to solve any of those problems, which is why I am not proposing to go any further than an initdb-time parameter. > But in general I thing a more flexible WAL size is a good idea. > To answer Andres: You have found one of the (few?) users to adjust initdb > parameters. Good to know, thanks. In further defense of the idea that making this more configurable isn't nuts, it's worth noting that the history here is: * When Vadim originally added XLogSegSize in 30659d43eb73272e20f2eb1d785a07ba3b553ed8 (September 1999), it was a constant. * In c3c09be34b6b0d7892f1087a23fc6eb93f3c4f04 (February 2004), this became configurable via pg_config_manual.h. * In cf9f6c8d8e9df28f3fbe1850ca7f042b2c01252e (May 2008), Tom made this configurable via configure. So there's a well-established history of making this gradually easier for users to change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Aug 24, 2016 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... but I think this is just folly. You'd have to do major amounts
>> of work to keep, eg, slave servers on the same page as the master
>> about what the segment size is.
> I said an initdb-time parameter, meaning not capable of being changed
> within the lifetime of the cluster. So I don't see how the slave
> servers would get out of sync?
The point is that that now becomes something to worry about. I do not
think I have to exhibit a live bug within five minutes' thought before
saying that it's a risk area. It's something that we simply have not
worried about before, and IME that generally means there's some squishy
things there.
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > Meanwhile, we'll significantly help people who are currently > generating painfully large but not totally insane numbers of WAL > segments. Someone who is currently generating 32,768 WAL segments per > day - about one every 2.6 seconds - will have a significantly easier > time if they start generating 8,192 WAL segments per day - about one > every 10.5 seconds - instead. It's just much easier for a reasonably > simple archive command to keep up, "ls" doesn't have as many directory > entries to sort, etc. I'm generally on-board with increasing the WAL segment size, and I can see the point that we might want to make it more easily configurable as it's valuable to set it differently on a small database vs. a large database, but I take exception with the notion that a "simple archive command" is ever appropriate. Heikki's excellent talk at PGCon '15 (iirc) goes over why our archive command example is about as terrible as you can get and that's primairly because it's just a simple 'cp'. archive_command needs to be doing things like fsync'ing the WAL file after it's been copied away, probably fsync'ing the directory the WAL file has been copied into, returning the correct exit code to PG, etc. Thankfully, there are backup/WAL archive utilities which do this correctly and are even built to handle a large rate of WAL files for high transaction systems (including keeping open a long-running ssh/TCP to address the startup costs of both). Switching to 64MB would still be nice to simply reduce the number of files you have to deal with, and I'm all for it for that reason, but the ssh/TCP startup cost reasons aren't good ones for the switch as people shouldn't be using a "simple" command anyway and the good tools for WAL archiving have already addressed those issues. Thanks! Stephen
On Thu, Aug 25, 2016 at 9:48 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> Meanwhile, we'll significantly help people who are currently >> generating painfully large but not totally insane numbers of WAL >> segments. Someone who is currently generating 32,768 WAL segments per >> day - about one every 2.6 seconds - will have a significantly easier >> time if they start generating 8,192 WAL segments per day - about one >> every 10.5 seconds - instead. It's just much easier for a reasonably >> simple archive command to keep up, "ls" doesn't have as many directory >> entries to sort, etc. > > I'm generally on-board with increasing the WAL segment size, and I can > see the point that we might want to make it more easily configurable as > it's valuable to set it differently on a small database vs. a large > database, but I take exception with the notion that a "simple archive > command" is ever appropriate. My point wasn't really that archive_command should actually be simple. My point was that if it's being run multiple times per second, there are additional challenges that wouldn't arise if it were being run only every 5-10 seconds. I guess I should have said "simpler" rather than "reasonably simple", because there's nothing simple about setting archive_command properly. I mean, it could only actually be simple if somebody had a good a backup tool that provided an archive_command that you could just drop in place. But I'm sure if somebody had such a tool, they'd take every opportunity to bring it up, so we doubtless would have heard about it by now. Right? :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Aug 25, 2016 at 9:48 AM, Stephen Frost <sfrost@snowman.net> wrote: > > * Robert Haas (robertmhaas@gmail.com) wrote: > >> Meanwhile, we'll significantly help people who are currently > >> generating painfully large but not totally insane numbers of WAL > >> segments. Someone who is currently generating 32,768 WAL segments per > >> day - about one every 2.6 seconds - will have a significantly easier > >> time if they start generating 8,192 WAL segments per day - about one > >> every 10.5 seconds - instead. It's just much easier for a reasonably > >> simple archive command to keep up, "ls" doesn't have as many directory > >> entries to sort, etc. > > > > I'm generally on-board with increasing the WAL segment size, and I can > > see the point that we might want to make it more easily configurable as > > it's valuable to set it differently on a small database vs. a large > > database, but I take exception with the notion that a "simple archive > > command" is ever appropriate. > > My point wasn't really that archive_command should actually be simple. > My point was that if it's being run multiple times per second, there > are additional challenges that wouldn't arise if it were being run > only every 5-10 seconds. My point was that the concerns about TCP/ssh startup costs, which was part of your point #1 in your initial justification for the change, have been addressed through tooling. > I guess I should have said "simpler" rather than "reasonably simple", > because there's nothing simple about setting archive_command properly. Agreed. > I mean, it could only actually be simple if somebody had a good a > backup tool that provided an archive_command that you could just drop > in place. But I'm sure if somebody had such a tool, they'd take every > opportunity to bring it up, so we doubtless would have heard about it > by now. Right? :-) Thankfully there's actually multiple good open source and freely available tools that address this issue (albeit, through different mechanisms). Thanks! Stephen
On Wed, Aug 24, 2016 at 08:52:20PM -0700, Andres Freund wrote: > On 2016-08-24 23:26:51 -0400, Robert Haas wrote: > > On Wed, Aug 24, 2016 at 10:54 PM, Andres Freund <andres@anarazel.de> wrote: > > > and I'm also rather doubtful that it's actually without overhead. > > > > Really? Where do you think the overhead would come from? > > ATM we do a math involving XLOG_BLCKSZ in a bunch of places (including > doing a lot of %). Some of that happens with exclusive lwlocks held, and > some even with a spinlock held IIRC. Making that variable won't be > free. Whether it's actually measurabel - hard to say. I do remember > Heikki fighting hard to simplify some parts of the critical code during > xlog scalability stuff, and that that even involved moving minor amounts > of math out of critical sections. I think Robert made a good case that high-volume servers might want a larger WAL segment size, but as Andres pointed out, there are performance concerns. Those might be minimized by requiring the segment size to be a 2x multiple of 16MB. Another issue is that many users are coming from database products that have significant performance hits in switching WAL files so they might be tempted to set very high segment sizes in inappropriate cases. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Aug 25, 2016 at 10:34 AM, Stephen Frost <sfrost@snowman.net> wrote: >> My point wasn't really that archive_command should actually be simple. >> My point was that if it's being run multiple times per second, there >> are additional challenges that wouldn't arise if it were being run >> only every 5-10 seconds. > > My point was that the concerns about TCP/ssh startup costs, which was > part of your point #1 in your initial justification for the change, > have been addressed through tooling. It's good to know that some tool sets have addressed that, but I'm pretty certain that not every tool set has done so, probably not even all of the ones in common use. Anyway, I think the requirements we impose on archive_command today are just crazy. All other things being equal, changes that make it easier to write a decent one are IMHO going in the right direction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 25, 2016 at 10:39 AM, Bruce Momjian <bruce@momjian.us> wrote: > Another issue is that many users are coming from database products that > have significant performance hits in switching WAL files so they might > be tempted to set very high segment sizes in inappropriate cases. Well, we have some hit there, too. It may be smaller, but it's certainly not zero. I'm generally in favor of preventing people from setting ridiculous values for settings; we shouldn't let somebody set the WAL segment size to 8kB or something silly like that. But it's more important to enable legitimate uses than it is to prohibit inappropriate uses. If a particular value of a particular setting may be legitimately useful to some users, we should allow it, even if some other user might choose that value under false assumptions. In short, let's eschew nannyism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 25, 2016 at 9:34 AM, Magnus Hagander <magnus@hagander.net> wrote: > Because it comes with the cluster during replication. I think it's more > likely that you accidentally end up with two instances compiled with > different values than that you get an issue from this. I hadn't thought about it that way, but I think you're right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 25 August 2016 at 02:31, Robert Haas <robertmhaas@gmail.com> wrote: > Furthermore, there is an enforced, synchronous fsync at the end of > every segment, which actually does hurt performance on write-heavy > workloads.[2] Of course, if that were the only reason to consider > increasing the segment size, it would probably make more sense to just > try to push that extra fsync into the background, but that's not > really the case. From what I hear, the gigantic number of files is a > bigger pain point. I think we should fully describe the problem before finding a solution. This is too big a change to just tweak a value without discussing the actual issue. And if the problem is as described, how can a change of x4 be enough to make it worth the pain of change? I think you're already admitting it can't be worth it by discussing initdb configuration. If we do have the pain of change, should we also consider making WAL files variable length? What do we gain by having the files all the same size? ISTM better to have WAL files that vary in length up to 1GB in size. (This is all about XLOG_SEG_SIZE; I presume XLOG_BLCKSZ can stay as it is, right?) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
* Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Aug 25, 2016 at 10:34 AM, Stephen Frost <sfrost@snowman.net> wrote: > >> My point wasn't really that archive_command should actually be simple. > >> My point was that if it's being run multiple times per second, there > >> are additional challenges that wouldn't arise if it were being run > >> only every 5-10 seconds. > > > > My point was that the concerns about TCP/ssh startup costs, which was > > part of your point #1 in your initial justification for the change, > > have been addressed through tooling. > > It's good to know that some tool sets have addressed that, but I'm > pretty certain that not every tool set has done so, probably not even > all of the ones in common use. Anyway, I think the requirements we > impose on archive_command today are just crazy. All other things > being equal, changes that make it easier to write a decent one are > IMHO going in the right direction. Agreed, but, unfortunately, this isn't an "all other things being equal" case, or we wouldn't be having this discussion. Increasing the WAL segment size means it'll be longer before archive_command is called which means there's a larger amount of potential data loss for users who are using it without any other archiving/replication solution, along with the other concerns about it possibly resulting in a higher disk space cost. I agree that increasing it makes sense and that 64MB is a good number, but I wouldn't want to go much higher than that. That doesn't completely solve the TCP/SSH start-up cost penalty as there will be environments where that is still expensive even with 64MB WAL segments, but it will certainly be reduced. To try to summarize, I don't think we should be trying to solve the TCP/SSH start-up penalty issue for all users by encouraging them to increase the WAL segment size, at least not without covering the trade-offs. That isn't to say we shouldn't change the default, I agree that we should, but I believe we should keep it a reasonably conservative change and if we make it user-configurable then we need to be sure to document the trade-offs. Thanks! Stephen
On Thu, Aug 25, 2016 at 12:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 25 August 2016 at 02:31, Robert Haas <robertmhaas@gmail.com> wrote: > >> Furthermore, there is an enforced, synchronous fsync at the end of >> every segment, which actually does hurt performance on write-heavy >> workloads.[2] Of course, if that were the only reason to consider >> increasing the segment size, it would probably make more sense to just >> try to push that extra fsync into the background, but that's not >> really the case. From what I hear, the gigantic number of files is a >> bigger pain point. > > I think we should fully describe the problem before finding a solution. > > This is too big a change to just tweak a value without discussing the > actual issue. > > And if the problem is as described, how can a change of x4 be enough > to make it worth the pain of change? I think you're already admitting > it can't be worth it by discussing initdb configuration. > > If we do have the pain of change, should we also consider making WAL > files variable length? What do we gain by having the files all the > same size? ISTM better to have WAL files that vary in length up to 1GB > in size. > > (This is all about XLOG_SEG_SIZE; I presume XLOG_BLCKSZ can stay as it > is, right?) Avoiding variable sizes does avoid some failure modes on the filesystem side in the face of crashes/power loss. So making them variable size, while possible, wouldn't be simple at all (it would involve figuring out the way filesystems behave when facing crash when a file is changing in size).
On Thu, Aug 25, 2016 at 11:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 25 August 2016 at 02:31, Robert Haas <robertmhaas@gmail.com> wrote: >> Furthermore, there is an enforced, synchronous fsync at the end of >> every segment, which actually does hurt performance on write-heavy >> workloads.[2] Of course, if that were the only reason to consider >> increasing the segment size, it would probably make more sense to just >> try to push that extra fsync into the background, but that's not >> really the case. From what I hear, the gigantic number of files is a >> bigger pain point. > > I think we should fully describe the problem before finding a solution. Sure, that's usually a good idea. I attempted to outline all of the possible issues of which I am aware in my original email, but of course you may know of considerations which I overlooked. > This is too big a change to just tweak a value without discussing the > actual issue. Again, I tried to make sure I was discussing the actual issues in my original email. In brief: having to run archive_command multiple times per second imposes very tight latency requirements on it; directories with hundreds of thousands or millions of files are hard to manage; enforced synchronous fsyncs at the end of each segment hurt performance. > And if the problem is as described, how can a change of x4 be enough > to make it worth the pain of change? I think you're already admitting > it can't be worth it by discussing initdb configuration. I guess it depends on how much pain of change you think there will be. I would expect a change from 16MB -> 64MB to be fairly painless, but (1) it might break tools that aren't designed to cope with differing segment sizes and (2) it will increase disk utilization for people who have such low velocity systems that they never end up with more than 2 WAL segments, and now those segments are bigger. If you know of other impacts or have reason to believe those problems will be serious, please fill in the details. Despite the fact that initdb configuration has dominated this thread, I mentioned it only in the very last sentence of my email and only as a possibility. I believe that a 4x change will be good enough for the majority of people for whom this is currently a pain point. However, yes, I do believe that there are some people for whom it won't be sufficient. And I believe that as we continue to enhance PostgreSQL to support higher and higher transaction rates, the number of people who need an extra-large WAL segment size will increase. As I see it, there are three options here: 1. Do nothing. So far, I don't see anybody arguing for that. 2. Change the default to 64MB and call it good. This idea seems to have considerable support. 3. Allow initdb-time configurability but keep the default at 16MB. I don't see any support for this. There is clearly support for configurability, but I don't see anyone arguing that the current default is preferable, unless that is what you are arguing. 4. Change the default to 64MB and also allow initdb-time configurability. This option also appears to enjoy substantial support, perhaps more than #2. Magnus seemed to be arguing that this is preferable to #2, because then it's easier for people to change the setting back if someone discovers a case where the higher default is a problem; Tom, on the other hand, seems to think this is overkill. Personally, I believe option #4 is for the best. I believe that the great majority of users will be better off with 64MB than with 16MB, but I like the idea of allowing for smaller values (for people with really low-velocity instances) and larger ones (for people with really high-velocity instances). > If we do have the pain of change, should we also consider making WAL > files variable length? What do we gain by having the files all the > same size? ISTM better to have WAL files that vary in length up to 1GB > in size. This seems like an odd comment because the whole way we address WAL positions is based on the fact that segments are fixed size, as I would have thought you would know better than I. The file that contains a particular byte of WAL is based on lsn/XLOG_SEG_SIZE and the position within the file is lsn%XLOG_SEG_SIZE. Making files variable-size would vastly complicate this addressing scheme and maybe hurt performance in the process. I can't see any compelling reason to go there. > (This is all about XLOG_SEG_SIZE; I presume XLOG_BLCKSZ can stay as it > is, right?) Yep. Or at least, any discussion of changing the default XLOG block size would be a completely separate from the issues raised here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 25, 2016 at 11:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 25 August 2016 at 02:31, Robert Haas <robertmhaas@gmail.com> wrote:
>> Furthermore, there is an enforced, synchronous fsync at the end of
>> every segment, which actually does hurt performance on write-heavy
>> workloads.[2] Of course, if that were the only reason to consider
>> increasing the segment size, it would probably make more sense to just
>> try to push that extra fsync into the background, but that's not
>> really the case. From what I hear, the gigantic number of files is a
>> bigger pain point.
>
> I think we should fully describe the problem before finding a solution.
Sure, that's usually a good idea. I attempted to outline all of the
possible issues of which I am aware in my original email, but of
course you may know of considerations which I overlooked.
> This is too big a change to just tweak a value without discussing the
> actual issue.
Again, I tried to make sure I was discussing the actual issues in my
original email. In brief: having to run archive_command multiple
times per second imposes very tight latency requirements on it;
directories with hundreds of thousands or millions of files are hard
to manage; enforced synchronous fsyncs at the end of each segment hurt
performance.
> And if the problem is as described, how can a change of x4 be enough
> to make it worth the pain of change? I think you're already admitting
> it can't be worth it by discussing initdb configuration.
I guess it depends on how much pain of change you think there will be.
I would expect a change from 16MB -> 64MB to be fairly painless, but
(1) it might break tools that aren't designed to cope with differing
segment sizes and (2) it will increase disk utilization for people who
have such low velocity systems that they never end up with more than 2
WAL segments, and now those segments are bigger. If you know of other
impacts or have reason to believe those problems will be serious,
please fill in the details.
Despite the fact that initdb configuration has dominated this thread,
I mentioned it only in the very last sentence of my email and only as
a possibility. I believe that a 4x change will be good enough for the
majority of people for whom this is currently a pain point. However,
yes, I do believe that there are some people for whom it won't be
sufficient. And I believe that as we continue to enhance PostgreSQL
to support higher and higher transaction rates, the number of people
who need an extra-large WAL segment size will increase. As I see it,
there are three options here:
1. Do nothing. So far, I don't see anybody arguing for that.
2. Change the default to 64MB and call it good. This idea seems to
have considerable support.
3. Allow initdb-time configurability but keep the default at 16MB. I
don't see any support for this. There is clearly support for
configurability, but I don't see anyone arguing that the current
default is preferable, unless that is what you are arguing.
4. Change the default to 64MB and also allow initdb-time
configurability. This option also appears to enjoy substantial
support, perhaps more than #2. Magnus seemed to be arguing that this
is preferable to #2, because then it's easier for people to change the
setting back if someone discovers a case where the higher default is a
problem; Tom, on the other hand, seems to think this is overkill.
Personally, I believe option #4 is for the best. I believe that the
great majority of users will be better off with 64MB than with 16MB,
but I like the idea of allowing for smaller values (for people with
really low-velocity instances) and larger ones (for people with really
high-velocity instances).
On Thu, Aug 25, 2016 at 1:05 PM, Magnus Hagander <magnus@hagander.net> wrote: >> 1. Do nothing. So far, I don't see anybody arguing for that. >> >> 2. Change the default to 64MB and call it good. This idea seems to >> have considerable support. >> >> 3. Allow initdb-time configurability but keep the default at 16MB. I >> don't see any support for this. There is clearly support for >> configurability, but I don't see anyone arguing that the current >> default is preferable, unless that is what you are arguing. >> >> 4. Change the default to 64MB and also allow initdb-time >> configurability. This option also appears to enjoy substantial >> support, perhaps more than #2. Magnus seemed to be arguing that this >> is preferable to #2, because then it's easier for people to change the >> setting back if someone discovers a case where the higher default is a >> problem; Tom, on the other hand, seems to think this is overkill. > > I was not arguing for #4 over #2, at least not strongly. I think #2 is fine, > and I think #4 are fine. #4 allows a way out, but it's not *that* important > unless we go *beyond* 64Mb. OK, thanks for clarifying. I can't see going beyond 64MB by default when we're shipping max_wal_size=1GB. In another 20 years when PB-size thumb drives are commonplace we might reconsider. > I was mainly arguing that we can't claim "it has a configure switch so it's > kinda configurable" as a way out. If we want it configurable *at all*, it > should be an initdb switch. If we are confident in our defaults, it doesn't > have to be. > > I agree that #4 is best. I'm not sure it's worth the cost. I'm not worried > at all about the risk of master/slave sync thing, per previous statement. > But if it does have performance implications, per Andres suggestion, then > making it configurable at initdb time probably comes with a cost that's not > worth paying. At this point it's hard to judge, because we don't have any idea what the cost might be. I guess if we want to pursue this approach, somebody will have to code it up and benchmark it. But what I'm inclined to do for starters is put together a patch to go from 16MB -> 64MB. Committing that early this cycle will give us time to reconsider if that turns out to be painful for reasons we haven't thought of yet. And give tool authors time to make adjustments, if any are needed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 08/25/2016 01:12 PM, Robert Haas wrote: >> I agree that #4 is best. I'm not sure it's worth the cost. I'm not worried >> > at all about the risk of master/slave sync thing, per previous statement. >> > But if it does have performance implications, per Andres suggestion, then >> > making it configurable at initdb time probably comes with a cost that's not >> > worth paying. > At this point it's hard to judge, because we don't have any idea what > the cost might be. I guess if we want to pursue this approach, > somebody will have to code it up and benchmark it. But what I'm > inclined to do for starters is put together a patch to go from 16MB -> > 64MB. Committing that early this cycle will give us time to > reconsider if that turns out to be painful for reasons we haven't > thought of yet. And give tool authors time to make adjustments, if > any are needed. The one thing I'd be worried about with the increase in size is folks using PostgreSQL for very small databases. If your database is only 30MB or so in size, the increase in size of the WAL will be pretty significant (+144MB for the base 3 WAL segments). I'm not sure this is a real problem which users will notice (in today's scales, 144MB ain't much), but if it turns out to be, it would be nice to have a way to switch it back *just for them* without recompiling. -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
On Thu, Aug 25, 2016 at 1:43 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 08/25/2016 01:12 PM, Robert Haas wrote: >>> I agree that #4 is best. I'm not sure it's worth the cost. I'm not worried >>> > at all about the risk of master/slave sync thing, per previous statement. >>> > But if it does have performance implications, per Andres suggestion, then >>> > making it configurable at initdb time probably comes with a cost that's not >>> > worth paying. >> At this point it's hard to judge, because we don't have any idea what >> the cost might be. I guess if we want to pursue this approach, >> somebody will have to code it up and benchmark it. But what I'm >> inclined to do for starters is put together a patch to go from 16MB -> >> 64MB. Committing that early this cycle will give us time to >> reconsider if that turns out to be painful for reasons we haven't >> thought of yet. And give tool authors time to make adjustments, if >> any are needed. > > The one thing I'd be worried about with the increase in size is folks > using PostgreSQL for very small databases. If your database is only > 30MB or so in size, the increase in size of the WAL will be pretty > significant (+144MB for the base 3 WAL segments). I'm not sure this is > a real problem which users will notice (in today's scales, 144MB ain't > much), but if it turns out to be, it would be nice to have a way to > switch it back *just for them* without recompiling. I think you may be forgetting that "the base 3 WAL segments" is no longer the default configuration. checkpoint_segments=3 is history; we now have max_wal_size=1GB, which is a maximum of 64 WAL segments, not 3. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 25, 2016 at 1:43 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 08/25/2016 01:12 PM, Robert Haas wrote:
>>> I agree that #4 is best. I'm not sure it's worth the cost. I'm not worried
>>> > at all about the risk of master/slave sync thing, per previous statement.
>>> > But if it does have performance implications, per Andres suggestion, then
>>> > making it configurable at initdb time probably comes with a cost that's not
>>> > worth paying.
>> At this point it's hard to judge, because we don't have any idea what
>> the cost might be. I guess if we want to pursue this approach,
>> somebody will have to code it up and benchmark it. But what I'm
>> inclined to do for starters is put together a patch to go from 16MB ->
>> 64MB. Committing that early this cycle will give us time to
>> reconsider if that turns out to be painful for reasons we haven't
>> thought of yet. And give tool authors time to make adjustments, if
>> any are needed.
>
> The one thing I'd be worried about with the increase in size is folks
> using PostgreSQL for very small databases. If your database is only
> 30MB or so in size, the increase in size of the WAL will be pretty
> significant (+144MB for the base 3 WAL segments). I'm not sure this is
> a real problem which users will notice (in today's scales, 144MB ain't
> much), but if it turns out to be, it would be nice to have a way to
> switch it back *just for them* without recompiling.
I think you may be forgetting that "the base 3 WAL segments" is no
longer the default configuration. checkpoint_segments=3 is history;
we now have max_wal_size=1GB, which is a maximum of 64 WAL segments,
not 3.
On 2016-08-25 13:45:29 -0400, Robert Haas wrote: > I think you may be forgetting that "the base 3 WAL segments" is no > longer the default configuration. checkpoint_segments=3 is history; > we now have max_wal_size=1GB, which is a maximum of 64 WAL segments, > not 3. Well, but min_wal_size still is 48MB. So sure, if you consistently have a high WAL throughput, it'll be bigger. But otherwise pg_xlog will shrink again.
On Thu, Aug 25, 2016 at 1:50 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-08-25 13:45:29 -0400, Robert Haas wrote: >> I think you may be forgetting that "the base 3 WAL segments" is no >> longer the default configuration. checkpoint_segments=3 is history; >> we now have max_wal_size=1GB, which is a maximum of 64 WAL segments, >> not 3. > > Well, but min_wal_size still is 48MB. So sure, if you consistently have > a high WAL throughput, it'll be bigger. But otherwise pg_xlog will > shrink again. Hmm, yeah. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Thu, Aug 25, 2016 at 1:43 PM, Josh Berkus <josh@agliodbs.com> wrote: > > The one thing I'd be worried about with the increase in size is folks > > using PostgreSQL for very small databases. If your database is only > > 30MB or so in size, the increase in size of the WAL will be pretty > > significant (+144MB for the base 3 WAL segments). I'm not sure this is > > a real problem which users will notice (in today's scales, 144MB ain't > > much), but if it turns out to be, it would be nice to have a way to > > switch it back *just for them* without recompiling. > > I think you may be forgetting that "the base 3 WAL segments" is no > longer the default configuration. checkpoint_segments=3 is history; > we now have max_wal_size=1GB, which is a maximum of 64 WAL segments, > not 3. I think the relevant one for that case is the minimum, though: #min_wal_size = 80MB which corresponds to 5 segments. I suppose the default value for this minimum would change to some multiple of 64MB. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 25, 2016 at 2:49 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> On Thu, Aug 25, 2016 at 1:43 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> > The one thing I'd be worried about with the increase in size is folks >> > using PostgreSQL for very small databases. If your database is only >> > 30MB or so in size, the increase in size of the WAL will be pretty >> > significant (+144MB for the base 3 WAL segments). I'm not sure this is >> > a real problem which users will notice (in today's scales, 144MB ain't >> > much), but if it turns out to be, it would be nice to have a way to >> > switch it back *just for them* without recompiling. >> >> I think you may be forgetting that "the base 3 WAL segments" is no >> longer the default configuration. checkpoint_segments=3 is history; >> we now have max_wal_size=1GB, which is a maximum of 64 WAL segments, >> not 3. > > I think the relevant one for that case is the minimum, though: > > #min_wal_size = 80MB > > which corresponds to 5 segments. I suppose the default value for this > minimum would change to some multiple of 64MB. Yeah, Andres made the same point, although it looks like he erroneously stated that the minimum was 48MB whereas you have it as 80MB, which seems to be the actual value. I assume we would have to raise that to either 128MB or 192MB, which does feel like a bit of a hefty increase. It doesn't matter if you're going to make extensive use of the cluster, but if somebody's spinning up hundreds of clusters each of which has very little activity it might not be an altogether welcome change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Thu, Aug 25, 2016 at 2:49 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > I think the relevant one for that case is the minimum, though: > > > > #min_wal_size = 80MB > > > > which corresponds to 5 segments. I suppose the default value for this > > minimum would change to some multiple of 64MB. > > Yeah, Andres made the same point, although it looks like he > erroneously stated that the minimum was 48MB whereas you have it as > 80MB, which seems to be the actual value. I assume we would have to > raise that to either 128MB or 192MB, which does feel like a bit of a > hefty increase. It doesn't matter if you're going to make extensive > use of the cluster, but if somebody's spinning up hundreds of clusters > each of which has very little activity it might not be an altogether > welcome change. Yeah, and it's also related to the point Josh Berkus was making about clusters with little activity. Does it work to set the minimum to one WAL segment, i.e. 64MB? guc.c has a hardcoded minimum of 2, but I couldn't find an explanation for it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 25, 2016 at 04:21:33PM +0100, Simon Riggs wrote: > If we do have the pain of change, should we also consider making WAL > files variable length? What do we gain by having the files all the > same size? ISTM better to have WAL files that vary in length up to 1GB > in size. > > (This is all about XLOG_SEG_SIZE; I presume XLOG_BLCKSZ can stay as it > is, right?) I think having WAL use variable length files would add complexity for recycling WAL files. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Aug 25, 2016 at 3:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Yeah, and it's also related to the point Josh Berkus was making about > clusters with little activity. Right. > Does it work to set the minimum to one WAL segment, i.e. 64MB? guc.c > has a hardcoded minimum of 2, but I couldn't find an explanation for it. Well, I think that when you overrun the end of one segment, you're never going to be able to wrap around to the start of the same segment; you're going to get sucked into needing another file. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Thu, Aug 25, 2016 at 3:21 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Does it work to set the minimum to one WAL segment, i.e. 64MB? guc.c > > has a hardcoded minimum of 2, but I couldn't find an explanation for it. > > Well, I think that when you overrun the end of one segment, you're > never going to be able to wrap around to the start of the same > segment; you're going to get sucked into needing another file. Sure, but that's a transient situation; after a couple of checkpoints, the old segment can be removed without any danger, leaving only the active segment. [thinks] Ah, on reflection, there's no way that this buys anything: it is always critical to have enough disk space to have one more segment to switch to. So even if you're on tight disk constraints, you cannot afford to allocate space for a single segment only, because if you only have that and the need comes to create the next one to switch to, you will just not have the space. If we were to use the WAL space in a way different than the POSIX file interface, we could probably do better. But that seems too onerous. I suppose the only option is to keep the minimum at 2. I don't see any point in forcing the minimum to be more than that, however. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 24, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > 3. archive_timeout is no longer a frequently used option. Obviously, > if you are frequently archiving partial segments, you don't want the > segment size to be too large, because if it is, each forced segment > switch potentially wastes a large amount of space (and bandwidth). > But given streaming replication and pg_receivexlog, the use case for > archiving partial segments is, at least according to my understanding, > a lot narrower than it used to be. So, I think we don't have to worry > as much about keeping forced segment switches cheap as we did during > the 8.x series. Heroku uses archive_timeout. It is considered important, because S3 archives are more reliable than EBS storage. We want to cap how much time can pass before WAL is shipped to S3, to some degree. It's weird to talk about degrees of durability, since we tend to assume that it's either/or, but distinctions like that start to matter when you have an enormous number of databases. S3 has an extremely good track record, reliability-wise. We're not too concerned about the overhead of all of this, I think, because WAL segments consist of zeroes at the end when archive_timeout is applied (at least from 9.4 on). We compress the WAL segments, and many zeroes compress very well. I admit that I haven't looked at it in much detail, but that is my current understanding. -- Peter Geoghegan
On 26/08/16 05:43, Josh Berkus wrote: > On 08/25/2016 01:12 PM, Robert Haas wrote: >>> I agree that #4 is best. I'm not sure it's worth the cost. I'm not worried >>>> at all about the risk of master/slave sync thing, per previous statement. >>>> But if it does have performance implications, per Andres suggestion, then >>>> making it configurable at initdb time probably comes with a cost that's not >>>> worth paying. >> At this point it's hard to judge, because we don't have any idea what >> the cost might be. I guess if we want to pursue this approach, >> somebody will have to code it up and benchmark it. But what I'm >> inclined to do for starters is put together a patch to go from 16MB -> >> 64MB. Committing that early this cycle will give us time to >> reconsider if that turns out to be painful for reasons we haven't >> thought of yet. And give tool authors time to make adjustments, if >> any are needed. > The one thing I'd be worried about with the increase in size is folks > using PostgreSQL for very small databases. If your database is only > 30MB or so in size, the increase in size of the WAL will be pretty > significant (+144MB for the base 3 WAL segments). I'm not sure this is > a real problem which users will notice (in today's scales, 144MB ain't > much), but if it turns out to be, it would be nice to have a way to > switch it back *just for them* without recompiling. > Let such folk use Microsoft Access??? <Ducks & runs away very fast!> More seriously: Surely most such people would be using very old hardware & not likely to be upgrading to the most recent version of pg in the near future? And for the ones using modern hardware: either they have enough resources not to notice, or very probably will know enough to hunt round for a way to reduce the WAL size - I strongly suspect. Currently, I'm not support pg in any production environment, and using it for testing & keeping up-to-date with pg. So it would affect me - however, I have enough resources so it is no problem in practice. Cheers, Gavin
On Thu, Aug 25, 2016 at 10:25 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Aug 24, 2016 at 10:40:06PM -0300, Claudio Freire wrote: >> > time instead of requiring a recompile; we probably don't save any >> > significant number of cycles by compiling this into the server. >> >> FWIW, +1 >> >> We're already hurt by the small segments due to a similar phenomenon >> as the ssh case: TCP slow start. Designing the archive/recovery >> command to work around TCP slow start is quite complex, and bigger >> segments would just be a better thing. >> >> Not to mention that bigger segments compress better. > > This would be good time to rename pg_xlog and pg_clog directories too. That would be an excellent timing to do so. The first CF is close by, and such a change would be better at the beginning of the development cycle. -- Michael
Michael, * Michael Paquier (michael.paquier@gmail.com) wrote: > On Thu, Aug 25, 2016 at 10:25 PM, Bruce Momjian <bruce@momjian.us> wrote: > > On Wed, Aug 24, 2016 at 10:40:06PM -0300, Claudio Freire wrote: > >> > time instead of requiring a recompile; we probably don't save any > >> > significant number of cycles by compiling this into the server. > >> > >> FWIW, +1 > >> > >> We're already hurt by the small segments due to a similar phenomenon > >> as the ssh case: TCP slow start. Designing the archive/recovery > >> command to work around TCP slow start is quite complex, and bigger > >> segments would just be a better thing. > >> > >> Not to mention that bigger segments compress better. > > > > This would be good time to rename pg_xlog and pg_clog directories too. > > That would be an excellent timing to do so. The first CF is close by, > and such a change would be better at the beginning of the development > cycle. If we're going to be renaming things, we might wish to consider further changes, such as putting everything that's temporary & not WAL-logged into "pgsql_tmp" directories, so we don't need lists of "directories to exclude" in things like the pg_basebackup-related code. We should really have an independent thread about this though, as it's not what Robert's asking about here. Thanks! Stephen
On Thu, Aug 25, 2016 at 10:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 25, 2016 at 11:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 25 August 2016 at 02:31, Robert Haas <robertmhaas@gmail.com> wrote: >>> Furthermore, there is an enforced, synchronous fsync at the end of >>> every segment, which actually does hurt performance on write-heavy >>> workloads.[2] Of course, if that were the only reason to consider >>> increasing the segment size, it would probably make more sense to just >>> try to push that extra fsync into the background, but that's not >>> really the case. From what I hear, the gigantic number of files is a >>> bigger pain point. >> >> I think we should fully describe the problem before finding a solution. > > Sure, that's usually a good idea. I attempted to outline all of the > possible issues of which I am aware in my original email, but of > course you may know of considerations which I overlooked. > >> This is too big a change to just tweak a value without discussing the >> actual issue. > > Again, I tried to make sure I was discussing the actual issues in my > original email. In brief: having to run archive_command multiple > times per second imposes very tight latency requirements on it; > directories with hundreds of thousands or millions of files are hard > to manage; enforced synchronous fsyncs at the end of each segment hurt > performance. > >> And if the problem is as described, how can a change of x4 be enough >> to make it worth the pain of change? I think you're already admitting >> it can't be worth it by discussing initdb configuration. > > I guess it depends on how much pain of change you think there will be. > I would expect a change from 16MB -> 64MB to be fairly painless, but > (1) it might break tools that aren't designed to cope with differing > segment sizes and (2) it will increase disk utilization for people who > have such low velocity systems that they never end up with more than 2 > WAL segments, and now those segments are bigger. If you know of other > impacts or have reason to believe those problems will be serious, > please fill in the details. > > Despite the fact that initdb configuration has dominated this thread, > I mentioned it only in the very last sentence of my email and only as > a possibility. I believe that a 4x change will be good enough for the > majority of people for whom this is currently a pain point. However, > yes, I do believe that there are some people for whom it won't be > sufficient. And I believe that as we continue to enhance PostgreSQL > to support higher and higher transaction rates, the number of people > who need an extra-large WAL segment size will increase. As I see it, > there are three options here: > > 1. Do nothing. So far, I don't see anybody arguing for that. > > 2. Change the default to 64MB and call it good. This idea seems to > have considerable support. > > 3. Allow initdb-time configurability but keep the default at 16MB. I > don't see any support for this. There is clearly support for > configurability, but I don't see anyone arguing that the current > default is preferable, unless that is what you are arguing. > > 4. Change the default to 64MB and also allow initdb-time > configurability. This option also appears to enjoy substantial > support, perhaps more than #2. Magnus seemed to be arguing that this > is preferable to #2, because then it's easier for people to change the > setting back if someone discovers a case where the higher default is a > problem; Tom, on the other hand, seems to think this is overkill. > If we change the default to 64MB, then I think it won't allow to use old databases as-is because we store it in pg_control (I think one will get below error [1] for old databases, if we just change default and don't do anything else). Do you have way to address it or you think it is okay? [1] - FATAL: database files are incompatible with server DETAIL: The database cluster was initialized with XLOG_SEG_SIZE 16777216, but the server was compiled with XLOG_SEG_SIZE 67108864. HINT: It looks like you need to recompile or initdb. LOG: database system is shut down -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 26, 2016 at 12:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > If we change the default to 64MB, then I think it won't allow to use > old databases as-is because we store it in pg_control (I think one > will get below error [1] for old databases, if we just change default > and don't do anything else). Do you have way to address it or you > think it is okay? Those would still be able to work with ./configure --with-wal-segsize=16, so that's not really an issue. -- Michael
Gavin Flower wrote: > On 26/08/16 05:43, Josh Berkus wrote: > >The one thing I'd be worried about with the increase in size is folks > >using PostgreSQL for very small databases. If your database is only > >30MB or so in size, the increase in size of the WAL will be pretty > >significant (+144MB for the base 3 WAL segments). I'm not sure this is > >a real problem which users will notice (in today's scales, 144MB ain't > >much), but if it turns out to be, it would be nice to have a way to > >switch it back *just for them* without recompiling. > > > Let such folk use Microsoft Access??? <Ducks & runs away very fast!> > > > More seriously: > Surely most such people would be using very old hardware & not likely to be > upgrading to the most recent version of pg in the near future? And for the > ones using modern hardware: either they have enough resources not to notice, > or very probably will know enough to hunt round for a way to reduce the WAL > size - I strongly suspect. I've seen people with unusual environments, such as running Pg in some embedded platform with minimal resources, where they were baffled that Postgres used so much disk space on files that were barely written to and never read. It wasn't a question of there being "large" drives to buy, but one of not wanting to have a drive in the first place. Now, I grant that this was a few years ago already and disk tech (SSDs) has changed that world; maybe that argument doesn't apply anymore. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Aug 26, 2016 at 9:04 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Aug 26, 2016 at 12:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> If we change the default to 64MB, then I think it won't allow to use >> old databases as-is because we store it in pg_control (I think one >> will get below error [1] for old databases, if we just change default >> and don't do anything else). Do you have way to address it or you >> think it is okay? > > Those would still be able to work with ./configure > --with-wal-segsize=16, so that's not really an issue. > Right, but do we need suggest users to do so? The question/point was if we deliver server with default value as 64MB, then it won't allow to start old database. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 26, 2016 at 12:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Aug 26, 2016 at 9:04 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Fri, Aug 26, 2016 at 12:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> If we change the default to 64MB, then I think it won't allow to use >>> old databases as-is because we store it in pg_control (I think one >>> will get below error [1] for old databases, if we just change default >>> and don't do anything else). Do you have way to address it or you >>> think it is okay? >> >> Those would still be able to work with ./configure >> --with-wal-segsize=16, so that's not really an issue. >> > > Right, but do we need suggest users to do so? The question/point was > if we deliver server with default value as 64MB, then it won't allow > to start old database. Right, pg_upgrade could be made smarter by enforcing a conversion with a dedicated option: we could get away by filling the existing segments with zeros and add an XLOG switch record at the end of each segments formerly at 16MB converted to 64MB. That would still be better than converting each page LSN :( -- Michael
On 2016-08-26 13:07:09 +0900, Michael Paquier wrote: > On Fri, Aug 26, 2016 at 12:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Aug 26, 2016 at 9:04 AM, Michael Paquier > > <michael.paquier@gmail.com> wrote: > >> On Fri, Aug 26, 2016 at 12:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >>> If we change the default to 64MB, then I think it won't allow to use > >>> old databases as-is because we store it in pg_control (I think one > >>> will get below error [1] for old databases, if we just change default > >>> and don't do anything else). Do you have way to address it or you > >>> think it is okay? > >> > >> Those would still be able to work with ./configure > >> --with-wal-segsize=16, so that's not really an issue. > >> > > > > Right, but do we need suggest users to do so? The question/point was > > if we deliver server with default value as 64MB, then it won't allow > > to start old database. > > Right, pg_upgrade could be made smarter by enforcing a conversion with > a dedicated option: we could get away by filling the existing segments > with zeros and add an XLOG switch record at the end of each segments > formerly at 16MB converted to 64MB. That would still be better than > converting each page LSN :( Maybe I'm missing something here - but why would we need to do any of that? The WAL already isn't compatible between versions, and we don't reuse the old server's WAL anyway? Isn't all that's needed relaxing some error check?
On Wed, 24 Aug 2016 21:31:35 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > Hi, > > I'd like to propose that we increase the default WAL segment size, > which is currently 16MB. It was first set to that value in commit > 47937403676d913c0e740eec6b85113865c6c8ab in October of 1999; prior to > that, it was 64MB. Between 1999 and now, there have been three > significant changes that make me think it might be time to rethink > this value: <snip> > > Thoughts? From my ignorance, could block size affect this WAL size increase? In past (didn't tried with >9 versions) you can change disk block page size from 8KB to 16 KB or 32KB (or 4) modifing src/include/pg_config.hBLCKSZ 8192 and recompiling. (There are some mails in 1999-2002 about this topic) > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company --- --- Eduardo Morras <emorrasg@yahoo.es>
On Fri, Aug 26, 2016 at 12:39 AM, Andres Freund <andres@anarazel.de> wrote: > Maybe I'm missing something here - but why would we need to do any of > that? The WAL already isn't compatible between versions, and we don't > reuse the old server's WAL anyway? Isn't all that's needed relaxing some > error check? Yeah. If this change is made in a new major version - and how else would we do it? - it doesn't introduce any incompatibility that wouldn't be present already. pg_upgrade doesn't (and can't) migrate WAL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 26, 2016 at 4:45 AM, Eduardo Morras <emorrasg@yahoo.es> wrote: > From my ignorance, could block size affect this WAL size increase? > > In past (didn't tried with >9 versions) you can change disk block page size from 8KB to 16 KB or 32KB (or 4) modifing src/include/pg_config.hBLCKSZ 8192 and recompiling. (There are some mails in 1999-2002 about this topic) Yeah, I think that's still supposed to work although I don't know whether anyone has tried it lately. It is a separate topic from this issue, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/24/16 9:31 PM, Robert Haas wrote: > I'd like to propose that we increase the default WAL segment size, > which is currently 16MB. While the discussion about the best default value is ongoing, maybe we should at least *allow* some larger sizes, for testing out. Currently, configure says "Allowed values are 1,2,4,8,16,32,64.". What might be a good new upper limit? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 20, 2016 at 2:49 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 8/24/16 9:31 PM, Robert Haas wrote: >> I'd like to propose that we increase the default WAL segment size, >> which is currently 16MB. > > While the discussion about the best default value is ongoing, maybe we > should at least *allow* some larger sizes, for testing out. Currently, > configure says "Allowed values are 1,2,4,8,16,32,64.". What might be a > good new upper limit? 1GB? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-09-20 16:05:44 -0400, Robert Haas wrote: > On Tue, Sep 20, 2016 at 2:49 PM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: > > On 8/24/16 9:31 PM, Robert Haas wrote: > >> I'd like to propose that we increase the default WAL segment size, > >> which is currently 16MB. > > > > While the discussion about the best default value is ongoing, maybe we > > should at least *allow* some larger sizes, for testing out. Currently, > > configure says "Allowed values are 1,2,4,8,16,32,64.". What might be a > > good new upper limit? I'm doubtful it's worth increasing this. > 1GB? That sounds way too big to me. WAL file allocation would trigger pretty massive IO storms during zeroing, max_wal_size is going to be hard to tune, the amounts of dirty data during bulk loads is going to be very hard to control. If somebody wants to do something like this they better be well informed enough to override a #define. Andres
On Tue, Sep 20, 2016 at 4:09 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-09-20 16:05:44 -0400, Robert Haas wrote: >> On Tue, Sep 20, 2016 at 2:49 PM, Peter Eisentraut >> <peter.eisentraut@2ndquadrant.com> wrote: >> > On 8/24/16 9:31 PM, Robert Haas wrote: >> >> I'd like to propose that we increase the default WAL segment size, >> >> which is currently 16MB. >> > >> > While the discussion about the best default value is ongoing, maybe we >> > should at least *allow* some larger sizes, for testing out. Currently, >> > configure says "Allowed values are 1,2,4,8,16,32,64.". What might be a >> > good new upper limit? > > I'm doubtful it's worth increasing this. > >> 1GB? > > That sounds way too big to me. WAL file allocation would trigger pretty > massive IO storms during zeroing, max_wal_size is going to be hard to > tune, the amounts of dirty data during bulk loads is going to be very > hard to control. If somebody wants to do something like this they > better be well informed enough to override a #define. EnterpriseDB has customers generating multiple TB of WAL per day. Even with a 1GB segment size, some of them will fill multiple files per minute. At the current limit of 64MB, a few of them would still fill more than one file per second. That is not sane. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-09-20 16:18:02 -0400, Robert Haas wrote: > On Tue, Sep 20, 2016 at 4:09 PM, Andres Freund <andres@anarazel.de> wrote: > > That sounds way too big to me. WAL file allocation would trigger pretty > > massive IO storms during zeroing, max_wal_size is going to be hard to > > tune, the amounts of dirty data during bulk loads is going to be very > > hard to control. If somebody wants to do something like this they > > better be well informed enough to override a #define. > > EnterpriseDB has customers generating multiple TB of WAL per day. Sure, that's kind of common. > Even with a 1GB segment size, some of them will fill multiple files > per minute. At the current limit of 64MB, a few of them would still > fill more than one file per second. That is not sane. I doubt generating much larger files actually helps a lot there. I bet you a patch review that 1GB files are going to regress in pretty much every situation; especially when taking latency into account. I think what's actually needed for that is: - make it easier to implement archiving via streaming WAL; i.e. make pg_receivexlog actually usable - make archiving parallel - decouple WAL write & fsyncing granularity from segment size Requiring a non-default compile time or even just cluster creation time option for tuning isn't something worth expanding energy on imo. Andres
On Tue, Sep 20, 2016 at 4:25 PM, Andres Freund <andres@anarazel.de> wrote: >> Even with a 1GB segment size, some of them will fill multiple files >> per minute. At the current limit of 64MB, a few of them would still >> fill more than one file per second. That is not sane. > > I doubt generating much larger files actually helps a lot there. I bet > you a patch review that 1GB files are going to regress in pretty much > every situation; especially when taking latency into account. Well, you have a point: let's find out. Suppose we create a cluster that generates WAL very quickly, and then try different WAL segment sizes and see what works out best. Maybe something like: create N relatively small tables, with 100 or so tuples in each one. Have N backends, each assigned one of those tables, and it just updates all the rows over and over in a tight loop. Or feel free to suggest something else. > I think what's actually needed for that is: > - make it easier to implement archiving via streaming WAL; i.e. make > pg_receivexlog actually usable > - make archiving parallel > - decouple WAL write & fsyncing granularity from segment size > > Requiring a non-default compile time or even just cluster creation time > option for tuning isn't something worth expanding energy on imo. I don't agree. The latency requirements on an archive_command when you're churning out 16MB files multiple times per second are insanely tight, and saying that we shouldn't increase the size because it's better to go redesign a bunch of other things that will eventually *maybe* remove the need for archive_command does not seem like a reasonable response. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-09-20 16:32:46 -0400, Robert Haas wrote: > > Requiring a non-default compile time or even just cluster creation time > > option for tuning isn't something worth expanding energy on imo. > > I don't agree. The latency requirements on an archive_command when > you're churning out 16MB files multiple times per second are insanely > tight, and saying that we shouldn't increase the size because it's > better to go redesign a bunch of other things that will eventually > *maybe* remove the need for archive_command does not seem like a > reasonable response. Oh, I'm on board with increasing the default size a bit. A different default size isn't a non-default compile time option anymore though, and I don't think 1GB is a reasonable default. Running multiple archive_commands concurrently - pretty easy to implement - isn't the same as removing the need for archive command. I'm pretty sure that continously,and if necessary concurrently, archiving a bunch of 64MB files is going to work better than irregularly creating / transferring 1GB files. Andres
On Tue, Sep 20, 2016 at 4:42 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-09-20 16:32:46 -0400, Robert Haas wrote: >> > Requiring a non-default compile time or even just cluster creation time >> > option for tuning isn't something worth expanding energy on imo. >> >> I don't agree. The latency requirements on an archive_command when >> you're churning out 16MB files multiple times per second are insanely >> tight, and saying that we shouldn't increase the size because it's >> better to go redesign a bunch of other things that will eventually >> *maybe* remove the need for archive_command does not seem like a >> reasonable response. > > Oh, I'm on board with increasing the default size a bit. A different > default size isn't a non-default compile time option anymore though, and > I don't think 1GB is a reasonable default. But that's not the question. What Peter said was: "maybe we should at least *allow* some larger sizes, for testing out". I see very little merit in restricting the values that people can set via configure. That just makes life difficult. If a user picks a setting that doesn't perform well, oops. > Running multiple archive_commands concurrently - pretty easy to > implement - isn't the same as removing the need for archive command. I'm > pretty sure that continously,and if necessary concurrently, archiving a > bunch of 64MB files is going to work better than irregularly > creating / transferring 1GB files. I'm not trying to block you from implementing parallel archiving, but right now we don't have it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9/21/16 8:12 AM, Robert Haas wrote: >> Oh, I'm on board with increasing the default size a bit. A different >> > default size isn't a non-default compile time option anymore though, and >> > I don't think 1GB is a reasonable default. > But that's not the question. What Peter said was: "maybe we should at > least *allow* some larger sizes, for testing out". I see very little > merit in restricting the values that people can set via configure. > That just makes life difficult. If a user picks a setting that > doesn't perform well, oops. Right. If we think that a larger size can have some performance benefit and we think that 64MB might be a good new default (as was the initial suggestion), then we should surely allow at least say 128 and 256 to be tried out. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
The attached patch removes --with-wal-segsize configure option and adds a new initdb option --wal-segsize. The module initdb passes the wal-segsize value into an environment variable which is used to overwrite the guc value wal_ segment_size and set the internal variables : XLogSegSize and XLOG_SEG_SIZE (xlog_internal.h). The default wal_segment_size is not changed but I have increased the maximum size to 1GB.
- For pg_basebackup, add new replication command SHOW_WAL_SEGSZ to fetch the wal_segment_size in bytes.
--
Beena Emerson
Have a Great Day!
Attachment
Hello all,Please find attached a patch to make wal segment size initdb configurable.
The attached patch removes --with-wal-segsize configure option and adds a new initdb option --wal-segsize. The module initdb passes the wal-segsize value into an environment variable which is used to overwrite the guc value wal_ segment_size and set the internal variables : XLogSegSize and XLOG_SEG_SIZE (xlog_internal.h). The default wal_segment_size is not changed but I have increased the maximum size to 1GB.Since XLOG_SEG_SIZE is now variable, it could not be used directly in src/bin modules and few macros and few changes had to be made:- in guc.c , remove GUC_UNIT_XSEGS which used XLOG_SEG_SIZE and introduce show functions for the guc which used the unit (min_wal_size and max_wal_size).
- For pg_basebackup, add new replication command SHOW_WAL_SEGSZ to fetch the wal_segment_size in bytes.- pg_controldata, pg_resetxlog, pg_rewind, fetch the xlog_seg_size from the ControlFile.- Since pg_xlogdump reads the wal files, it uses the file size to determine the xlog_seg_size.- In pg_test_fsync, a buffer of size XLOG_SEG_SIZE was created, filled with random data and written to a temporary file to check for any write/fsync error before performing the tests. Since it does not affect the actual performance results, the XLOG_SEG_SIZE in the module is replaced with the default value (16MB).Please note that the documents are not updated in this patch.Feedback and suggestions are welcome.
Hi, On 2016-12-19 15:14:50 +0530, Beena Emerson wrote: > The attached patch removes --with-wal-segsize configure option and adds a > new initdb option --wal-segsize. The module initdb passes the wal-segsize > value into an environment variable which is used to overwrite the guc value > wal_ segment_size and set the internal variables : XLogSegSize and > XLOG_SEG_SIZE (xlog_internal.h). The default wal_segment_size is not > changed but I have increased the maximum size to 1GB. > > Since XLOG_SEG_SIZE is now variable, it could not be used directly in > src/bin modules and few macros and few changes had to be made: I do think this has the potential for negative performance implications. Consider code like /* skip over the page header */ if (CurrPos % XLogSegSize == 0) { CurrPos += SizeOfXLogLongPHD; currpos += SizeOfXLogLongPHD; } else right now that's doable in an efficient manner, because XLogSegSize is constant. If it's a variable and the compiler doesn't even know it's a power-of-two, it'll have to do a full "div" - and that's quite easily noticeable in a lot of cases. Now it could entirely be that the costs of this will be swamped by everything else, but I'd not want to rely on it. I think we need tests with concurrent large-file copies. And then also look at the profile to see whether the relevant places become new hotspots (not that we introduce something that's just hidden for now). We might be able to do a bit better, efficency wise, by storing XLogSegSize as a "shift factor". I.e. the 16M setting would be 24 (i.e. XLogSegSize would be defined as 1 << 24). Greetings, Andres Freund
Hi,
On 2016-12-19 15:14:50 +0530, Beena Emerson wrote:
> The attached patch removes --with-wal-segsize configure option and adds a
> new initdb option --wal-segsize. The module initdb passes the wal-segsize
> value into an environment variable which is used to overwrite the guc value
> wal_ segment_size and set the internal variables : XLogSegSize and
> XLOG_SEG_SIZE (xlog_internal.h). The default wal_segment_size is not
> changed but I have increased the maximum size to 1GB.
>
> Since XLOG_SEG_SIZE is now variable, it could not be used directly in
> src/bin modules and few macros and few changes had to be made:
I do think this has the potential for negative performance
implications. Consider code like
/* skip over the page header */
if (CurrPos % XLogSegSize == 0)
{
CurrPos += SizeOfXLogLongPHD;
currpos += SizeOfXLogLongPHD;
}
else
right now that's doable in an efficient manner, because XLogSegSize is
constant. If it's a variable and the compiler doesn't even know it's a
power-of-two, it'll have to do a full "div" - and that's quite easily
noticeable in a lot of cases.
Now it could entirely be that the costs of this will be swamped by
everything else, but I'd not want to rely on it.
I think we need tests with concurrent large-file copies. And then also
look at the profile to see whether the relevant places become new
hotspots (not that we introduce something that's just hidden for now).
We might be able to do a bit better, efficency wise, by storing
XLogSegSize as a "shift factor". I.e. the 16M setting would be 24
(i.e. XLogSegSize would be defined as 1 << 24).
On Tue, Dec 20, 2016 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote: > I do think this has the potential for negative performance > implications. Consider code like > /* skip over the page header */ > if (CurrPos % XLogSegSize == 0) > { > CurrPos += SizeOfXLogLongPHD; > currpos += SizeOfXLogLongPHD; > } > else > right now that's doable in an efficient manner, because XLogSegSize is > constant. If it's a variable and the compiler doesn't even know it's a > power-of-two, it'll have to do a full "div" - and that's quite easily > noticeable in a lot of cases. > > Now it could entirely be that the costs of this will be swamped by > everything else, but I'd not want to rely on it. We could use the GUC assign hook to compute a mask and a shift, so that this could be written as (CurrPos & mask_variable) == 0. That would avoid the division instruction, though not the memory access. I hope this is all in the noise, though. I know this is code is hot but I think it'll be hard to construct a test case where the bottleneck is anything other than the speed at which the disk can absorb bytes. I suppose we could set fsync=off and put the whole cluster on a RAMDISK to avoid those bottlenecks, but of course no real storage system behaves like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-12-20 08:10:29 -0500, Robert Haas wrote: > We could use the GUC assign hook to compute a mask and a shift, so > that this could be written as (CurrPos & mask_variable) == 0. That > would avoid the division instruction, though not the memory access. I suspect that'd be fine. > I hope this is all in the noise, though. Could very well be. > I know this is code is hot but I think it'll be hard to construct a > test case where the bottleneck is anything other than the speed at > which the disk can absorb bytes. I don't think that's really true. Heikki's WAL changes made a *BIG* difference. And pretty small changes in xlog.c can make noticeable throughput differences both in single and multi-threaded workloads. E.g. witnessed by the fact that the crc computation used to be a major bottleneck (and the crc32c instruction still shows up noticeably in profiles). SSDs have become fast enough that it's increasingly hard to saturate them. Andres
On 12/20/2016 02:19 PM, Andres Freund wrote: > On 2016-12-20 08:10:29 -0500, Robert Haas wrote: >> We could use the GUC assign hook to compute a mask and a shift, so >> that this could be written as (CurrPos & mask_variable) == 0. That >> would avoid the division instruction, though not the memory access. > > I suspect that'd be fine. > > >> I hope this is all in the noise, though. > > Could very well be. > > >> I know this is code is hot but I think it'll be hard to construct a >> test case where the bottleneck is anything other than the speed at >> which the disk can absorb bytes. > > I don't think that's really true. Heikki's WAL changes made a *BIG* > difference. And pretty small changes in xlog.c can make noticeable > throughput differences both in single and multi-threaded > workloads. E.g. witnessed by the fact that the crc computation used to > be a major bottleneck (and the crc32c instruction still shows up > noticeably in profiles). SSDs have become fast enough that it's > increasingly hard to saturate them. > It's not just SSDs. RAID controllers with write cache (which is typically just DRR3 memory anyway) have about the same effect even with spinning rust. So yes, this might make a measurable difference. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation: not tested General comments: There was some discussion about the impact of this on small installs, particularly around min_wal_size. The concern was thatchanging the default segment size to 64MB would significantly increase min_wal_size in terms of bytes. The default valuefor min_wal_size is 5 segments, so 16MB->64MB would mean going from 80MB to 320MB. IMHO if you're worried about thatthen just initdb with a smaller segment size. There's probably a number of other changes a small environment wants tomake besides that. Perhaps it'd be worth making DEFAULT_XLOG_SEG_SIZE a configure option to better support that. It's not clear from the thread that there is consensus that this feature is desired. In particular, the performance aspectsof changing segment size from a C constant to a variable are in question. Someone with access to large hardware shouldtest that. Andres[1] and Robert[2] did suggest that the option could be changed to a bitshift, which IMHO would alsosolve some sanity-checking issues. + * initdb passes the WAL segment size in an environment variable. We don't + * bother doing any sanity checking, we already check in initdb that the + * user gives a sane value. That doesn't seem like a good idea to me. If anything, the backend should sanity-check and initdb just rely on that. Perhapsthis is how other initdb options work, but it still seems bogus. In particular, verifying the size is a power of 2seems important, as failing that would probably be ReallyBad(tm). The patch also blindly trusts the value read from the control file; I'm not sure if that's standard procedure or not, butISTM it'd be worth sanity-checking that as well. The patch leaves the base GUC units for min_wal_size and max_wal_size as the # of segments. I'm not sure if that's a greatidea. + * convert_unit + * + * This takes the value in kbytes and then returns value in user-readable format This function needs a more specific name, such as pretty_print_kb(). + /* Check if wal_segment_size is in the power of 2 */ + for (i = 0;; i++, pow2 = pow(2, i)) + if (pow2 >= wal_segment_size) + break; + + if (wal_segment_size != 1 && pow2 > wal_segment_size) + { + fprintf(stderr, _("%s: WAL segment size must be in the power of 2\n"), progname); + exit(1); + } IMHO it'd be better to use the n & (n-1) check detailed at [3]. Actually, there's got to be other places that need to check this, so it'd be nice to just create a function that verifiesa number is a power of 2. + if (log_fname != NULL) + XLogFromFileName(log_fname, &minXlogTli, &minXlogSegNo); + Please add a comment about why XLogFromFileName has to be delayed. /* + * DEFAULT_XLOG_SEG_SIZE is the size of a single WAL file. This must be a power + * of 2 and larger than XLOG_BLCKSZ (preferably, a great deal larger than + * XLOG_BLCKSZ). + * + * Changing DEFAULT_XLOG_SEG_SIZE requires an initdb. + */ +#define DEFAULT_XLOG_SEG_SIZE (16*1024*1024) That comment isn't really accurate. It would be more useful to explain that DEFAULT_XLOG_SEG_SIZE is the default size ofa WAL segment used by initdb if a different value isn't specified. 1: https://www.postgresql.org/message-id/20161220082847.7t3t6utvxb6m5tfe%40alap3.anarazel.de 2: https://www.postgresql.org/message-id/CA%2BTgmoZTgnL25o68uPBTS6BD37ojD-1y-N88PkP57FzKqwcmmQ%40mail.gmail.com 3: http://stackoverflow.com/questions/108318/whats-the-simplest-way-to-test-whether-a-number-is-a-power-of-2-in-c
On Tue, Jan 3, 2017 at 6:23 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > + /* Check if wal_segment_size is in the power of 2 */ > + for (i = 0;; i++, pow2 = pow(2, i)) > + if (pow2 >= wal_segment_size) > + break; > + > + if (wal_segment_size != 1 && pow2 > wal_segment_size) > + { > + fprintf(stderr, _("%s: WAL segment size must be in the power of 2\n"), progname); > + exit(1); > + } I recall taht pow(x, 2) and x * x result usually in the same assembly code, but pow() can never be more optimal than a simple multiplication. So I'd think that it is wiser to avoid it in this code path. Documentation is missing for the new replication command SHOW_WAL_SEG. Actually, why not just having an equivalent of the SQL command and be able to query parameter values? -- Michael
On 2 January 2017 at 21:23, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > It's not clear from the thread that there is consensus that this feature is desired. In particular, the performance aspectsof changing segment size from a C constant to a variable are in question. Someone with access to large hardware shouldtest that. Andres[1] and Robert[2] did suggest that the option could be changed to a bitshift, which IMHO would alsosolve some sanity-checking issues. Overall, Robert has made a good case. The only discussion now is about the knock-on effects it causes. One concern that has only barely been discussed is the effect of zero-ing new WAL files. That is a linear effect and will adversely effect performance as WAL segment size increases. (The already stated fsync problem is also a linear effect but that reduces with WAL segment size, hence the need for a trade-off and hence why variable-size is preferable). If we wish this feature to get committed ISTM that we should examine server performance with a large fixed WAL segment size, so we can measure the effects of this, particularly with regard to the poor user that gets to add a new WAL file. ISTM that may reveal more work is needed to be handed off to the WALWriter process (or other issues/solutions). Once we have that information we can consider whether to apply this patch, so until then, -1 to apply this, though I am hopeful that this can be applied in this release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 3, 2017 at 6:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 2 January 2017 at 21:23, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > >> It's not clear from the thread that there is consensus that this feature is desired. In particular, the performance aspectsof changing segment size from a C constant to a variable are in question. Someone with access to large hardware shouldtest that. Andres[1] and Robert[2] did suggest that the option could be changed to a bitshift, which IMHO would alsosolve some sanity-checking issues. > > Overall, Robert has made a good case. The only discussion now is about > the knock-on effects it causes. > > One concern that has only barely been discussed is the effect of > zero-ing new WAL files. That is a linear effect and will adversely > effect performance as WAL segment size increases. > Sorry, but I am not able to understand why this is a problem? The bigger the size of WAL segment, lesser the number of files. So IIUC, then it can only impact if zero-ing two 16MB files is cheaper than zero-ing one 32MB file. Is that your theory or you have something else in mind? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 3 January 2017 at 13:45, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jan 3, 2017 at 6:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 2 January 2017 at 21:23, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> >>> It's not clear from the thread that there is consensus that this feature is desired. In particular, the performance aspectsof changing segment size from a C constant to a variable are in question. Someone with access to large hardware shouldtest that. Andres[1] and Robert[2] did suggest that the option could be changed to a bitshift, which IMHO would alsosolve some sanity-checking issues. >> >> Overall, Robert has made a good case. The only discussion now is about >> the knock-on effects it causes. >> >> One concern that has only barely been discussed is the effect of >> zero-ing new WAL files. That is a linear effect and will adversely >> effect performance as WAL segment size increases. >> > > Sorry, but I am not able to understand why this is a problem? The > bigger the size of WAL segment, lesser the number of files. So IIUC, > then it can only impact if zero-ing two 16MB files is cheaper than > zero-ing one 32MB file. Is that your theory or you have something > else in mind? The issue I see is that at present no backend needs to do more than 16MB of zeroing at one time, so the impact on response time is reduced. If we start doing zeroing in larger chunks than the impact on response times will increase. So instead of regular blips we have one large blip, less often. I think the latter will be worse, but welcome measurements that show that performance is smooth and regular with large files sizes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 3, 2017 at 8:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 3 January 2017 at 13:45, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Tue, Jan 3, 2017 at 6:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 2 January 2017 at 21:23, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>> >>>> It's not clear from the thread that there is consensus that this feature is desired. In particular, the performanceaspects of changing segment size from a C constant to a variable are in question. Someone with access to largehardware should test that. Andres[1] and Robert[2] did suggest that the option could be changed to a bitshift, whichIMHO would also solve some sanity-checking issues. >>> >>> Overall, Robert has made a good case. The only discussion now is about >>> the knock-on effects it causes. >>> >>> One concern that has only barely been discussed is the effect of >>> zero-ing new WAL files. That is a linear effect and will adversely >>> effect performance as WAL segment size increases. >>> >> >> Sorry, but I am not able to understand why this is a problem? The >> bigger the size of WAL segment, lesser the number of files. So IIUC, >> then it can only impact if zero-ing two 16MB files is cheaper than >> zero-ing one 32MB file. Is that your theory or you have something >> else in mind? > > The issue I see is that at present no backend needs to do more than > 16MB of zeroing at one time, so the impact on response time is > reduced. If we start doing zeroing in larger chunks than the impact on > response times will increase. So instead of regular blips we have one > large blip, less often. I think the latter will be worse, but welcome > measurements that show that performance is smooth and regular with > large files sizes. Yeah. I don't think there's any way to get around the fact that there will be bigger latency spikes in some cases with larger WAL files. I think the question is whether they'll be common enough or serious enough to worry about. For example, in a quick test on my laptop, zero-filling a 16 megabyte file using "dd if=/dev/zero of=x bs=8k count=2048" takes about 11 milliseconds, and zero-filling a 64 megabyte file with a count of 8192 increases the time to almost 50 milliseconds. That's something, but I wouldn't rate it as concerning. There are a lot of things that can cause latency changes multiple orders of magnitude larger than that, so worrying about that one in particular would seem to me to be fairly pointless. However, that's also a measurement on an unloaded system with an SSD, and the impact may be a lot more on a big system where with lots of concurrent activity, and if the process that does the write also has to do an fsync, that will increase the cost considerably, too. But the flip side is that it's wrong to imagine that there's no harm in leaving the situation as it is. Even my MacBook Pro can crank out about 2.7 WAL segments/second on "pgbench -c 16 -j 16 -T 60". I think a decent server with a few more CPU cores than my laptop has could do 4-5 times that. So we shouldn't imagine that the costs of spewing out a bajillion segment files are being paid only at the very high end. Even somebody running PostgreSQL on a low-end virtual machine might find it difficult to write an archive_command that can keep up if the system is under continuous load. Of course, as Stephen pointed out, there are toolkits that can do it and you should probably be using one of those anyway for other reasons, but nevertheless spitting out almost 3 WAL segments per second even on a laptop gives a whole new meaning to the term "continuous archiving". Another point to consider is that a bigger WAL segment size can actually *improve* latency because every segment switch triggers an immediate fsync, and every backend in the system ends up waiting for it to finish. We should probably eventually try to push those flushes into the background, and the zeroing work as well. My impression (possibly incorrect?) is that we expect to settle into a routine where zeroing new segments is relatively uncommon because we reuse old segment files, but the forced end-of-segment flushes never go away. So it's possible we might actually come out ahead on latency with this change, at least sometimes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3 January 2017 at 15:44, Robert Haas <robertmhaas@gmail.com> wrote: > Yeah. I don't think there's any way to get around the fact that there > will be bigger latency spikes in some cases with larger WAL files. One way would be for the WALwriter to zerofill new files ahead of time, thus avoiding the latency spike. > For example, in a quick test on my laptop, > zero-filling a 16 megabyte file using "dd if=/dev/zero of=x bs=8k > count=2048" takes about 11 milliseconds, and zero-filling a 64 > megabyte file with a count of 8192 increases the time to almost 50 > milliseconds. That's something, but I wouldn't rate it as concerning. I would rate that as concerning, especially if we allow much larger sizes. > But the flip side is that it's wrong to imagine that there's no harm > in leaving the situation as it is. The case for change has been made; the only discussion is what's in the new patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 3, 2017 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 3 January 2017 at 15:44, Robert Haas <robertmhaas@gmail.com> wrote: >> Yeah. I don't think there's any way to get around the fact that there >> will be bigger latency spikes in some cases with larger WAL files. > > One way would be for the WALwriter to zerofill new files ahead of > time, thus avoiding the latency spike. Sure, we could do that. I think it's an independent improvement, though: it is beneficial with or without this patch. >> For example, in a quick test on my laptop, >> zero-filling a 16 megabyte file using "dd if=/dev/zero of=x bs=8k >> count=2048" takes about 11 milliseconds, and zero-filling a 64 >> megabyte file with a count of 8192 increases the time to almost 50 >> milliseconds. That's something, but I wouldn't rate it as concerning. > > I would rate that as concerning, especially if we allow much larger sizes. I don't really understand the concern. If we allow large sizes but they are not the default, people can make a throughput-vs-latency trade-off when chosing a value for their installation. Those kind of trade-offs are common and unavoidable. If we raise the default, then it's more of a concern, but I'm not sure those numbers are big enough to worry about. I'm not sure how to decide which numbers are big enough to worry about, either. I guess we need some test results showing what happens with this patch in the real world before we go further. I agree that there's a possible downside to raising the segment size, but my suspicion is that the results are going to be better, not worse, because of reducing the number of end-of-segment fsyncs. There's no point worrying too much about how we're going to mitigate the negative impact until we know for sure that there is one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3 January 2017 at 16:24, Robert Haas <robertmhaas@gmail.com> wrote: > On Jan 3, 2017 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 3 January 2017 at 15:44, Robert Haas <robertmhaas@gmail.com> wrote: >>> Yeah. I don't think there's any way to get around the fact that there >>> will be bigger latency spikes in some cases with larger WAL files. >> >> One way would be for the WALwriter to zerofill new files ahead of >> time, thus avoiding the latency spike. > > Sure, we could do that. I think it's an independent improvement, > though: it is beneficial with or without this patch. The latency spike problem is exacerbated by increasing file size, so I think if we are allowing people to increase file size in this release then we should fix the knock-on problem it causes in this release also. If we don't fix it as part of this patch I would consider it an open item. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 3, 2017 at 3:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 3 January 2017 at 16:24, Robert Haas <robertmhaas@gmail.com> wrote: >> On Jan 3, 2017 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 3 January 2017 at 15:44, Robert Haas <robertmhaas@gmail.com> wrote: >>>> Yeah. I don't think there's any way to get around the fact that there >>>> will be bigger latency spikes in some cases with larger WAL files. >>> >>> One way would be for the WALwriter to zerofill new files ahead of >>> time, thus avoiding the latency spike. >> >> Sure, we could do that. I think it's an independent improvement, >> though: it is beneficial with or without this patch. > > The latency spike problem is exacerbated by increasing file size, so I > think if we are allowing people to increase file size in this release > then we should fix the knock-on problem it causes in this release > also. If we don't fix it as part of this patch I would consider it an > open item. I think I'd like to see some benchmark results before forming an opinion on whether that's a must-fix issue. I'm not sure I believe that allowing a larger WAL segment size is going to make things worse more than it makes them better. I think that should be tested, not assumed true. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3 January 2017 at 21:33, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 3, 2017 at 3:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 3 January 2017 at 16:24, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Jan 3, 2017 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> On 3 January 2017 at 15:44, Robert Haas <robertmhaas@gmail.com> wrote: >>>>> Yeah. I don't think there's any way to get around the fact that there >>>>> will be bigger latency spikes in some cases with larger WAL files. >>>> >>>> One way would be for the WALwriter to zerofill new files ahead of >>>> time, thus avoiding the latency spike. >>> >>> Sure, we could do that. I think it's an independent improvement, >>> though: it is beneficial with or without this patch. >> >> The latency spike problem is exacerbated by increasing file size, so I >> think if we are allowing people to increase file size in this release >> then we should fix the knock-on problem it causes in this release >> also. If we don't fix it as part of this patch I would consider it an >> open item. > > I think I'd like to see some benchmark results before forming an > opinion on whether that's a must-fix issue. I'm not sure I believe > that allowing a larger WAL segment size is going to make things worse > more than it makes them better. I think that should be tested, not > assumed true. Strange response. Nothing has been assumed. I asked for tests and you provided measurements. I suggest we fix just the problem as the fastest way forwards. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 4, 2017 at 3:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Strange response. Nothing has been assumed. I asked for tests and you > provided measurements. Sure, of zero-filling a file with dd. But I also pointed out that in a real PostgreSQL cluster, the change could actually *reduce* latency. > I suggest we fix just the problem as the fastest way forwards. If you want to do the work, sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4 January 2017 at 13:57, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 4, 2017 at 3:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Strange response. Nothing has been assumed. I asked for tests and you >> provided measurements. > > Sure, of zero-filling a file with dd. But I also pointed out that in > a real PostgreSQL cluster, the change could actually *reduce* latency. I think we are talking at cross purposes. We agree that the main change is useful, but it causes another problem which I can't see how you can characterize as reduced latency, based upon your own measurements. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 4, 2017 at 9:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 4 January 2017 at 13:57, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jan 4, 2017 at 3:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> Strange response. Nothing has been assumed. I asked for tests and you >>> provided measurements. >> >> Sure, of zero-filling a file with dd. But I also pointed out that in >> a real PostgreSQL cluster, the change could actually *reduce* latency. > > I think we are talking at cross purposes. We agree that the main > change is useful, but it causes another problem which I can't see how > you can characterize as reduced latency, based upon your own > measurements. Zero-filling files will take longer if the files are bigger. That will increase latency. But we will also have fewer forced end-of-segment syncs. That will reduce latency. Which effect is bigger? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 5, 2017 at 12:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 4, 2017 at 9:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 4 January 2017 at 13:57, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, Jan 4, 2017 at 3:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> Strange response. Nothing has been assumed. I asked for tests and you >>>> provided measurements. >>> >>> Sure, of zero-filling a file with dd. But I also pointed out that in >>> a real PostgreSQL cluster, the change could actually *reduce* latency. >> >> I think we are talking at cross purposes. We agree that the main >> change is useful, but it causes another problem which I can't see how >> you can characterize as reduced latency, based upon your own >> measurements. > > Zero-filling files will take longer if the files are bigger. That > will increase latency. But we will also have fewer forced > end-of-segment syncs. That will reduce latency. Which effect is > bigger? It depends on if the environment is CPU-bounded or I/O bounded. If CPU is at its limit, zero-filling takes time. If that's the I/O, fsync() would take longer to complete. -- Michael
On 4 January 2017 at 01:16, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jan 3, 2017 at 6:23 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> + /* Check if wal_segment_size is in the power of 2 */ >> + for (i = 0;; i++, pow2 = pow(2, i)) >> + if (pow2 >= wal_segment_size) >> + break; >> + >> + if (wal_segment_size != 1 && pow2 > wal_segment_size) >> + { >> + fprintf(stderr, _("%s: WAL segment size must be in the power of 2\n"), progname); >> + exit(1); >> + } > > I recall taht pow(x, 2) and x * x result usually in the same assembly > code, but pow() can never be more optimal than a simple > multiplication. So I'd think that it is wiser to avoid it in this code > path. Documentation is missing for the new replication command > SHOW_WAL_SEG. Actually, why not just having an equivalent of the SQL > command and be able to query parameter values? This would probably be nicer written using a bitwise trick to ensure that no lesser significant bits are set. If it's a power of 2, then subtracting 1 should have all the lesser significant bits as 1, so binary ANDing to that should be 0. i.e no common bits. Something like: /* ensure segment size is a power of 2 */ if ((wal_segment_size & (wal_segment_size - 1)) != 0) { fprintf(stderr, _("%s: WAL segment size must be in the power of 2\n"), progname); exit(1); } There's a similar trick in bitmapset.c for RIGHTMOST_ONE, so looks like we already have assumptions about two's complement arithmetic -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested
General comments:
There was some discussion about the impact of this on small installs, particularly around min_wal_size. The concern was that changing the default segment size to 64MB would significantly increase min_wal_size in terms of bytes. The default value for min_wal_size is 5 segments, so 16MB->64MB would mean going from 80MB to 320MB. IMHO if you're worried about that then just initdb with a smaller segment size. There's probably a number of other changes a small environment wants to make besides that. Perhaps it'd be worth making DEFAULT_XLOG_SEG_SIZE a configure option to better support that.
It's not clear from the thread that there is consensus that this feature is desired. In particular, the performance aspects of changing segment size from a C constant to a variable are in question. Someone with access to large hardware should test that. Andres[1] and Robert[2] did suggest that the option could be changed to a bitshift, which IMHO would also solve some sanity-checking issues.
+ * initdb passes the WAL segment size in an environment variable. We don't
+ * bother doing any sanity checking, we already check in initdb that the
+ * user gives a sane value.
That doesn't seem like a good idea to me. If anything, the backend should sanity-check and initdb just rely on that. Perhaps this is how other initdb options work, but it still seems bogus. In particular, verifying the size is a power of 2 seems important, as failing that would probably be ReallyBad(tm).
The patch also blindly trusts the value read from the control file; I'm not sure if that's standard procedure or not, but ISTM it'd be worth sanity-checking that as well.
The patch leaves the base GUC units for min_wal_size and max_wal_size as the # of segments. I'm not sure if that's a great idea.
+ * convert_unit
+ *
+ * This takes the value in kbytes and then returns value in user-readable format
This function needs a more specific name, such as pretty_print_kb().
+ /* Check if wal_segment_size is in the power of 2 */
+ for (i = 0;; i++, pow2 = pow(2, i))
+ if (pow2 >= wal_segment_size)
+ break;
+
+ if (wal_segment_size != 1 && pow2 > wal_segment_size)
+ {
+ fprintf(stderr, _("%s: WAL segment size must be in the power of 2\n"), progname);
+ exit(1);
+ }
IMHO it'd be better to use the n & (n-1) check detailed at [3].
Actually, there's got to be other places that need to check this, so it'd be nice to just create a function that verifies a number is a power of 2.
+ if (log_fname != NULL)
+ XLogFromFileName(log_fname, &minXlogTli, &minXlogSegNo);
+
Please add a comment about why XLogFromFileName has to be delayed.
/*
+ * DEFAULT_XLOG_SEG_SIZE is the size of a single WAL file. This must be a power
+ * of 2 and larger than XLOG_BLCKSZ (preferably, a great deal larger than
+ * XLOG_BLCKSZ).
+ *
+ * Changing DEFAULT_XLOG_SEG_SIZE requires an initdb.
+ */
+#define DEFAULT_XLOG_SEG_SIZE (16*1024*1024)
That comment isn't really accurate. It would be more useful to explain that DEFAULT_XLOG_SEG_SIZE is the default size of a WAL segment used by initdb if a different value isn't specified.
On Tue, Jan 3, 2017 at 6:23 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> + /* Check if wal_segment_size is in the power of 2 */
> + for (i = 0;; i++, pow2 = pow(2, i))
> + if (pow2 >= wal_segment_size)
> + break;
> +
> + if (wal_segment_size != 1 && pow2 > wal_segment_size)
> + {
> + fprintf(stderr, _("%s: WAL segment size must be in the power of 2\n"), progname);
> + exit(1);
> + }
I recall taht pow(x, 2) and x * x result usually in the same assembly
code, but pow() can never be more optimal than a simple
multiplication. So I'd think that it is wiser to avoid it in this code
path. Documentation is missing for the new replication command
SHOW_WAL_SEG.
Actually, why not just having an equivalent of the SQL
command and be able to query parameter values?
On Thu, Jan 5, 2017 at 6:39 AM, Beena Emerson <memissemerson@gmail.com> wrote: > This patch only needed the wal_segment_size and hence I made this specific > command. > How often and why would we need other parameter values in the replication > connection? > Making it a more general command to fetch any parameter can be a separate > topic. If it gets consensus, maybe it could be done and used here. I think the idea of supporting SHOW here is better than adding a special-purpose command just for the WAL size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 5, 2017 at 8:39 PM, Beena Emerson <memissemerson@gmail.com> wrote: > On Tue, Jan 3, 2017 at 5:46 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> Actually, why not just having an equivalent of the SQL >> command and be able to query parameter values? > > This patch only needed the wal_segment_size and hence I made this specific > command. > How often and why would we need other parameter values in the replication > connection? > Making it a more general command to fetch any parameter can be a separate > topic. If it gets consensus, maybe it could be done and used here. I concur that for this patch it may not be necessary. But let's not narrow us in a corner when designing things. Being able to query the value of parameters is something that I think is actually useful for cases where custom GUCs are loaded on the server's shared_preload_libraries to do validation checks (one case is a logical decoder on backend, with streaming receiver as client expecting the logical decoder to do a minimum). This can allow a client to do checks only using a replication stream. Another case that I have in mind is for utilities like pg_rewind, we have been discussing about being able to not need a superuser when querying the target server. Having such a command would allow for example pg_rewind to do a 'SHOW full_page_writes' without the need of an extra connection. -- Michael
On Thu, Jan 5, 2017 at 8:39 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> On Tue, Jan 3, 2017 at 5:46 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> Actually, why not just having an equivalent of the SQL
>> command and be able to query parameter values?
>
> This patch only needed the wal_segment_size and hence I made this specific
> command.
> How often and why would we need other parameter values in the replication
> connection?
> Making it a more general command to fetch any parameter can be a separate
> topic. If it gets consensus, maybe it could be done and used here.
I concur that for this patch it may not be necessary. But let's not
narrow us in a corner when designing things. Being able to query the
value of parameters is something that I think is actually useful for
cases where custom GUCs are loaded on the server's
shared_preload_libraries to do validation checks (one case is a
logical decoder on backend, with streaming receiver as client
expecting the logical decoder to do a minimum). This can allow a
client to do checks only using a replication stream. Another case that
I have in mind is for utilities like pg_rewind, we have been
discussing about being able to not need a superuser when querying the
target server. Having such a command would allow for example pg_rewind
to do a 'SHOW full_page_writes' without the need of an extra
connection.
On Fri, Jan 6, 2017 at 6:32 PM, Beena Emerson <memissemerson@gmail.com> wrote: > I see the point. I will change the SHOW_WAL_SEGSZ to a general SHOW command > in the next version of the patch. Could you split things? There could be one patch to introduce the SHOW command, and one on top of it for your patch to be able to change the WAL segment size wiht initdb. -- Michael
On 1/4/17 10:03 PM, David Rowley wrote: >> I recall taht pow(x, 2) and x * x result usually in the same assembly >> code, but pow() can never be more optimal than a simple >> multiplication. So I'd think that it is wiser to avoid it in this code >> path. Documentation is missing for the new replication command >> SHOW_WAL_SEG. Actually, why not just having an equivalent of the SQL >> command and be able to query parameter values? > This would probably be nicer written using a bitwise trick to ensure > that no lesser significant bits are set. If it's a power of 2, then > subtracting 1 should have all the lesser significant bits as 1, so > binary ANDing to that should be 0. i.e no common bits. > > Something like: > > /* ensure segment size is a power of 2 */ > if ((wal_segment_size & (wal_segment_size - 1)) != 0) > { > fprintf(stderr, _("%s: WAL segment size must be in the power of > 2\n"), progname); > exit(1); > } > > There's a similar trick in bitmapset.c for RIGHTMOST_ONE, so looks > like we already have assumptions about two's complement arithmetic Well, now that there's 3 places that need to do almost the same thing, I think it'd be best to just centralize this somewhere. I realize that's not going to save any significant amount of code, but it would make it crystal clear what's going on (assuming the excellent comment above RIGHTMOST_ONE was kept). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 1/5/17 5:38 AM, Beena Emerson wrote: > On Tue, Jan 3, 2017 at 2:53 AM, Jim Nasby <Jim.Nasby@bluetreble.com > <mailto:Jim.Nasby@bluetreble.com>> wrote: > General comments: > There was some discussion about the impact of this on small > installs, particularly around min_wal_size. The concern was that ... > The patch maintains the current XLOG_SEG_SIZE of 16MB as the default. > Only the capability to change its value has been moved for configure to > initdb. Ah, I missed that. Thanks for clarifying. > It's not clear from the thread that there is consensus that this > feature is desired. In particular, the performance aspects of > changing segment size from a C constant to a variable are in > question. Someone with access to large hardware should test that. > Andres[1] and Robert[2] did suggest that the option could be changed > to a bitshift, which IMHO would also solve some sanity-checking issues. Are you going to change to a bitshift in the next patch? > + * initdb passes the WAL segment size in an environment > variable. We don't > + * bother doing any sanity checking, we already check in > initdb that the > + * user gives a sane value. > > That doesn't seem like a good idea to me. If anything, the backend > should sanity-check and initdb just rely on that. Perhaps this is > how other initdb options work, but it still seems bogus. In > particular, verifying the size is a power of 2 seems important, as > failing that would probably be ReallyBad(tm). > > The patch also blindly trusts the value read from the control file; > I'm not sure if that's standard procedure or not, but ISTM it'd be > worth sanity-checking that as well. > > > There is a CRC check to detect error in the file. I think all the > ControlFile values are used directly and not re-verified. Sounds good. I do still think the variable from initdb should be sanity-checked. > The patch leaves the base GUC units for min_wal_size and > max_wal_size as the # of segments. I'm not sure if that's a great idea. > > > I think we can leave it as is. This is used in > CalculateCheckpontSegments and in XLOGfileslop to calculate the segment > numbers. My concern here is that we just changed from segments to KB for all the checkpoint settings, and this is introducing segments back in, but ... > I agree pretty_print_kb would have been a better for this function. > However, I have realised that using the show hook and this function is > not suitable and have found a better way of handling the removal of > GUC_UNIT_XSEGS which no longer needs this function : using the > GUC_UNIT_KB, convert the value in bytes to wal segment count instead in > the assign hook. The next version of patch will use this. ... it sounds like you're going back to exposing KB to users, and that's all that really matters. > IMHO it'd be better to use the n & (n-1) check detailed at [3]. See my other email about that. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On Sat, Jan 7, 2017 at 7:45 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > Well, now that there's 3 places that need to do almost the same thing, I > think it'd be best to just centralize this somewhere. I realize that's not > going to save any significant amount of code, but it would make it crystal > clear what's going on (assuming the excellent comment above RIGHTMOST_ONE > was kept). Hmm. This sounds a lot like what fls() and my_log2() also do. I've been quietly advocating for fls() because we only provide an implementation in src/port if the operating system doesn't have it, and the operating system may have an implementation that optimizes to a single machine-language instruction (bsrl on x86, I think, see 4f658dc851a73fc309a61be2503c29ed78a1592e). But the fact that our src/port implementation uses a loop instead of the RIGHTMOST_ONE() trick seems non-optimal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10 January 2017 at 07:40, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Jan 7, 2017 at 7:45 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> Well, now that there's 3 places that need to do almost the same thing, I >> think it'd be best to just centralize this somewhere. I realize that's not >> going to save any significant amount of code, but it would make it crystal >> clear what's going on (assuming the excellent comment above RIGHTMOST_ONE >> was kept). > > Hmm. This sounds a lot like what fls() and my_log2() also do. I've > been quietly advocating for fls() because we only provide an > implementation in src/port if the operating system doesn't have it, > and the operating system may have an implementation that optimizes to > a single machine-language instruction (bsrl on x86, I think, see > 4f658dc851a73fc309a61be2503c29ed78a1592e). But the fact that our > src/port implementation uses a loop instead of the RIGHTMOST_ONE() > trick seems non-optimal. It does really sound like we need a bitutils.c as mentioned in [1]. It would be good to make use of GCC's __builtin_popcount [2] instead of the number_of_ones[] array in bitmapset.c. It should be a bit faster and less cache polluting. [1] https://www.postgresql.org/message-id/14578.1462595165@sss.pgh.pa.us [2] https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jan 8, 2017 at 9:52 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> I agree pretty_print_kb would have been a better for this function. >> However, I have realised that using the show hook and this function is >> not suitable and have found a better way of handling the removal of >> GUC_UNIT_XSEGS which no longer needs this function : using the >> GUC_UNIT_KB, convert the value in bytes to wal segment count instead in >> the assign hook. The next version of patch will use this. > > > ... it sounds like you're going back to exposing KB to users, and that's all > that really matters. > >> IMHO it'd be better to use the n & (n-1) check detailed at [3]. That would be better. So I am looking at the proposed patch, though there have been reviews the patch was in "Needs Review" state, and as far as I can see it is a couple of things for frontends. Just by grepping for XLOG_SEG_SIZE I have spotted the following problems: - pg_standby uses it to know about the next segment available. - pg_receivexlog still uses it in segment handling. It may be a good idea to just remove XLOG_SEG_SIZE and fix the code paths that fail to compile without it, frontend utilities included because a lot of them now rely on the value coded in xlog_internal.h, but with this patch the value is set up in the context of initdb. And this would induce major breakages in many backup tools, pg_rman coming first in mind... We could replace it with for example a macro that frontends could use to check if the size of the WAL segment is in a valid range if the tool does not have direct access to the Postgres instance (aka the size of the WAL segment used there) as there are as well offline tools. -#define XLogSegSize ((uint32) XLOG_SEG_SIZE) + +extern uint32 XLogSegSize; +#define XLOG_SEG_SIZE XLogSegSize This bit is really bad for frontend declaring xlog_internal.h... --- a/src/bin/pg_test_fsync/pg_test_fsync.c +++ b/src/bin/pg_test_fsync/pg_test_fsync.c @@ -62,7 +62,7 @@ static const char *progname; static int secs_per_test = 5;static int needs_unlink = 0; -static char full_buf[XLOG_SEG_SIZE], +static char full_buf[DEFAULT_XLOG_SEG_SIZE], This would make sense as a new option of pg_test_fsync. A performance study would be a good idea as well. Regarding the generic SHOW command in the replication protocol, I may do it for next CF, I have use cases for it in my pocket. -- Michael
On Sun, Jan 8, 2017 at 9:52 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> I agree pretty_print_kb would have been a better for this function.
>> However, I have realised that using the show hook and this function is
>> not suitable and have found a better way of handling the removal of
>> GUC_UNIT_XSEGS which no longer needs this function : using the
>> GUC_UNIT_KB, convert the value in bytes to wal segment count instead in
>> the assign hook. The next version of patch will use this.
>
>
> ... it sounds like you're going back to exposing KB to users, and that's all
> that really matters.
>
>> IMHO it'd be better to use the n & (n-1) check detailed at [3].
That would be better.
So I am looking at the proposed patch, though there have been reviews
the patch was in "Needs Review" state, and as far as I can see it is a
couple of things for frontends. Just by grepping for XLOG_SEG_SIZE I
have spotted the following problems:
- pg_standby uses it to know about the next segment available.
- pg_receivexlog still uses it in segment handling.
It may be a good idea to just remove XLOG_SEG_SIZE and fix the code
paths that fail to compile without it, frontend utilities included
because a lot of them now rely on the value coded in xlog_internal.h,
but with this patch the value is set up in the context of initdb. And
this would induce major breakages in many backup tools, pg_rman coming
first in mind... We could replace it with for example a macro that
frontends could use to check if the size of the WAL segment is in a
valid range if the tool does not have direct access to the Postgres
instance (aka the size of the WAL segment used there) as there are as
well offline tools.
-#define XLogSegSize ((uint32) XLOG_SEG_SIZE)
+
+extern uint32 XLogSegSize;
+#define XLOG_SEG_SIZE XLogSegSize
This bit is really bad for frontend declaring xlog_internal.h...
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -62,7 +62,7 @@ static const char *progname;
static int secs_per_test = 5;
static int needs_unlink = 0;
-static char full_buf[XLOG_SEG_SIZE],
+static char full_buf[DEFAULT_XLOG_SEG_SIZE],
This would make sense as a new option of pg_test_fsync.
A performance study would be a good idea as well. Regarding the
generic SHOW command in the replication protocol, I may do it for next
CF, I have use cases for it in my pocket.
Attachment
On Tue, Jan 17, 2017 at 4:06 PM, Beena Emerson <memissemerson@gmail.com> wrote: > I have already made patch for the generic SHOW replication command > (attached) and am working on the new initdb patch based on that. > I have not yet fixed the pg_standby issue. I am trying to address all the > comments and bugs still. Having documentation for this patch in protocol.sgml would be nice. -- Michael
On Tue, Jan 17, 2017 at 4:06 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> I have already made patch for the generic SHOW replication command
> (attached) and am working on the new initdb patch based on that.
> I have not yet fixed the pg_standby issue. I am trying to address all the
> comments and bugs still.
Having documentation for this patch in protocol.sgml would be nice.
On Tue, Jan 17, 2017 at 12:38 PM, Michael Paquier <michael.paquier@gmail.com> wrote:On Tue, Jan 17, 2017 at 4:06 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> I have already made patch for the generic SHOW replication command
> (attached) and am working on the new initdb patch based on that.
> I have not yet fixed the pg_standby issue. I am trying to address all the
> comments and bugs still.
Having documentation for this patch in protocol.sgml would be nice.Yes. I will add that.
Attachment
On Tue, Jan 17, 2017 at 7:19 PM, Beena Emerson <memissemerson@gmail.com> wrote: > PFA the patch with the documentation included. It is usually better to keep doc lines under control of 72-80 characters if possible. + /* column 1: Wal segment size */ + len = strlen(value); + pq_sendint(&buf, len, 4); + pq_sendbytes(&buf, value, len); Bip. Error. This is a parameter value, not the WAL segment size. Except those minor points this works as expected, and is consistent with non-replication sessions, so that's nice by itself: =# create user toto replication login; CREATE ROLE $ psql "replication=1" -U toto => SHOW foo; ERROR: 42704: unrecognized configuration parameter "foo" LOCATION: GetConfigOptionByName, guc.c:7968 Time: 0.245 ms => SHOW log_directory; ERROR: 42501: must be superuser to examine "log_directory" LOCATION: GetConfigOptionByName, guc.c:7974 I think that we could get a committer look at that at the least. -- Michael
On Tue, Jan 17, 2017 at 8:54 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > I think that we could get a committer look at that at the least. This is sort of awkward, because it would be nice to reuse the code for the existing SHOW command rather than reinventing the wheel, but it's not very easy to do that, primarily because there are a number of places which rely on being able to do catalog access, which is not possible with a replication connection in hand. I got it working after hacking various things, so I have a complete list of the problems involved: 1. ShowGUCConfigOption() calls TupleDescInitEntry(), which does a catcache lookup to get the types pg_type entry. This isn't any big problem; I hacked around it by adding a TupleDescInitBuiltinEntry() which knows about the types that guc.c (and likely other builtins) care about. 2. SendRowDescriptionMessage calls getBaseTypeAndTypmod(), which does a catcache lookup to figure out whether the type is a domain. I short-circuited it by having it assume anything with an OID less than 10000 is not a domain. 3. printtup_prepare_info calls getTypeOutputInfo(), which does a catcache lookup to figure out the type output function's OID and whether it's a varlena. I bypassed that with an unspeakable hack. 4. printtup.c's code in general assumes that a DR_printtup always has a portal. It doesn't seem to mind if the portal doesn't contain anything very meaningful, but it has to have one. This problem has nothing to do with catalog access, but it's a problem. I solved it by (surprise) creating a portal, but I am not sure that's a very good idea. Problems 2-4 actually have to do with a DestReceiver of type DestRemote really, really wanting to have an associated Portal and database connection, so one approach is to create a stripped-down DestReceiver that doesn't care about those things and then passing that to GetPGVariable. That's not any less code than the way Beena coded it, of course; it's probably more. On the other hand, the stripped-down DestReciever implementation is more likely to be usable the next time somebody wants to add a new replication command, whereas this ad-hoc code to directly construct protocol messages will not be reusable. Opinions? (Hacked-up patch attached for educational purposes.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Jan 18, 2017 at 12:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Problems 2-4 actually have to do with a DestReceiver of type > DestRemote really, really wanting to have an associated Portal and > database connection, so one approach is to create a stripped-down > DestReceiver that doesn't care about those things and then passing > that to GetPGVariable. I tried that and it worked out pretty well, so I'm inclined to go with this approach. Proposed patches attached. 0001 adds the new DestReceiver type, and 0002 is a revised patch to implement the SHOW command itself. Thoughts, comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jan 20, 2017 at 12:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 18, 2017 at 12:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Problems 2-4 actually have to do with a DestReceiver of type >> DestRemote really, really wanting to have an associated Portal and >> database connection, so one approach is to create a stripped-down >> DestReceiver that doesn't care about those things and then passing >> that to GetPGVariable. > > I tried that and it worked out pretty well, so I'm inclined to go with > this approach. Proposed patches attached. 0001 adds the new > DestReceiver type, and 0002 is a revised patch to implement the SHOW > command itself. > > Thoughts, comments? This looks like a sensible approach to me. DestRemoteSimple could be useful for background workers that are not connected to a database as well. Isn't there a problem with PGC_REAL parameters? -- Michael
On Fri, Jan 20, 2017 at 2:34 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jan 20, 2017 at 12:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jan 18, 2017 at 12:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Problems 2-4 actually have to do with a DestReceiver of type >>> DestRemote really, really wanting to have an associated Portal and >>> database connection, so one approach is to create a stripped-down >>> DestReceiver that doesn't care about those things and then passing >>> that to GetPGVariable. >> >> I tried that and it worked out pretty well, so I'm inclined to go with >> this approach. Proposed patches attached. 0001 adds the new >> DestReceiver type, and 0002 is a revised patch to implement the SHOW >> command itself. >> >> Thoughts, comments? > > This looks like a sensible approach to me. DestRemoteSimple could be > useful for background workers that are not connected to a database as > well. Isn't there a problem with PGC_REAL parameters? No, because the output of SHOW is always of type text, regardless of the type of the GUC. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jan 21, 2017 at 4:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 20, 2017 at 2:34 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Fri, Jan 20, 2017 at 12:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, Jan 18, 2017 at 12:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> Problems 2-4 actually have to do with a DestReceiver of type >>>> DestRemote really, really wanting to have an associated Portal and >>>> database connection, so one approach is to create a stripped-down >>>> DestReceiver that doesn't care about those things and then passing >>>> that to GetPGVariable. >>> >>> I tried that and it worked out pretty well, so I'm inclined to go with >>> this approach. Proposed patches attached. 0001 adds the new >>> DestReceiver type, and 0002 is a revised patch to implement the SHOW >>> command itself. >>> >>> Thoughts, comments? >> >> This looks like a sensible approach to me. DestRemoteSimple could be >> useful for background workers that are not connected to a database as >> well. Isn't there a problem with PGC_REAL parameters? > > No, because the output of SHOW is always of type text, regardless of > the type of the GUC. Thinking about that over night, that looks pretty nice. What would be nicer though would be to add INT8OID and BYTEAOID in the list, and convert as well the other replication commands. At the end, I think that we should finish by being able to remove all pq_* routine dependencies in walsender.c, saving quite a couple of lines. -- Michael
Hello,
Please find attached an updated WIP patch. I have incorporated almost all comments. This is to be applied over Robert's patches. I will post performance results later on.
1. shift (>>) and AND (&) operations: The assign hook of wal_segment_size sets the WalModMask and WalShiftBit. All the modulo and division operations using XLogSegSize has been replaced with these. However, there are many preprocessors which divide with XLogSegSize in xlog_internal.h. I have not changed these because it would mean I will have to reassign the WalShiftBit along with XLogSegSize in all the modules which use these macros. That does not seem to be a good idea. Also, this means shift operator can be used only in couple of places.
2. pg_standby: it deals with WAL files, so I have used the file size to set the XLogSegSize (similar to pg_xlogdump). Also, macro MaxSegmentsPerLogFile using XLOG_SEG_SIZE is now defined in SetWALFileNameForCleanup where it is used. Since XLOG_SEG_SIZE is not preset, the code which throws an message if the file size is greater than XLOG_SEG_SIZE had to be removed.
3. XLOGChooseNumBuffers: This function, called during the creation of Shared Memory Segment, requires XLogSegSize which is set from the ControlFile. Hence temporarily read the ControlFile in XLOGShmemSize before invoking XLOGChooseNumBuffer. The ControlFile is read again and stored on the Shared Memory later on.
4. IsValidXLogSegSize: This is a macro to verify the XLogSegSize. This is used in initdb, pg_xlogdump, pg_standby.
5. Macro for power2: There were couple of ideas to make it centralised. For now, I have just defined it in xlog_internal.
6. Since CRC is used to verify the ControlFile before reading all the contents from it as is and I do not see the need to re-verify the xlog_seg_size.
7. min/max_wal_size still take values in KB unit and internally store it as segment count. Though the calculation is now shifted to their respective assign hook as the GUC_UNIT_XSEGS had to be removed.
8. Declaring XLogSegSize: There are 2 internal variables for the same parameter. In original code XLOG_SEG_SIZE is defined in the auto-generated file src/include/pg_config.h. And xlog_internal.h defines:
#define XLogSegSize ((uint32) XLOG_SEG_SIZE)
To avoid renaming all parts of code, I made the following change in xlog_internal.h
+ extern uint32 XLogSegSize;
+#define XLOG_SEG_SIZE XLogSegSize
would it be better to just use one variable XLogSegSize everywhere. But few external modules could be using XLOG_SEG_SIZE. Thoughts?
9. Documentation will be added in next version of patch.--
On Sat, Jan 21, 2017 at 4:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 20, 2017 at 2:34 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Jan 20, 2017 at 12:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Wed, Jan 18, 2017 at 12:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> Problems 2-4 actually have to do with a DestReceiver of type
>>>> DestRemote really, really wanting to have an associated Portal and
>>>> database connection, so one approach is to create a stripped-down
>>>> DestReceiver that doesn't care about those things and then passing
>>>> that to GetPGVariable.
>>>
>>> I tried that and it worked out pretty well, so I'm inclined to go with
>>> this approach. Proposed patches attached. 0001 adds the new
>>> DestReceiver type, and 0002 is a revised patch to implement the SHOW
>>> command itself.
>>>
>>> Thoughts, comments?
>>
>> This looks like a sensible approach to me. DestRemoteSimple could be
>> useful for background workers that are not connected to a database as
>> well. Isn't there a problem with PGC_REAL parameters?
>
> No, because the output of SHOW is always of type text, regardless of
> the type of the GUC.
Thinking about that over night, that looks pretty nice. What would be
nicer though would be to add INT8OID and BYTEAOID in the list, and
convert as well the other replication commands. At the end, I think
that we should finish by being able to remove all pq_* routine
dependencies in walsender.c, saving quite a couple of lines.
--
Michael
Attachment
On Fri, Jan 20, 2017 at 7:00 PM, Michael Paquier <michael.paquier@gmail.com> wrote: >> No, because the output of SHOW is always of type text, regardless of >> the type of the GUC. > > Thinking about that over night, that looks pretty nice. What would be > nicer though would be to add INT8OID and BYTEAOID in the list, and > convert as well the other replication commands. At the end, I think > that we should finish by being able to remove all pq_* routine > dependencies in walsender.c, saving quite a couple of lines. Might be worth investigating, but I don't feel any obligation to do that right now. Thanks for the review; committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 25, 2017 at 6:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 20, 2017 at 7:00 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >>> No, because the output of SHOW is always of type text, regardless of >>> the type of the GUC. >> >> Thinking about that over night, that looks pretty nice. What would be >> nicer though would be to add INT8OID and BYTEAOID in the list, and >> convert as well the other replication commands. At the end, I think >> that we should finish by being able to remove all pq_* routine >> dependencies in walsender.c, saving quite a couple of lines. > > Might be worth investigating, but I don't feel any obligation to do > that right now. Thanks for the review; committed. OK, I have done this refactoring effort as attached because I think that's really worth it. And here are the diff numbers: 3 files changed, 113 insertions(+), 162 deletions(-) That's a bit less than what I thought first because of all the singularities of bytea in its output and the way TIMELINE_HISTORY takes advantage of the message level routines. Still for IDENTIFY_SYSTEM, START_REPLICATION and CREATE_REPLICATION_SLOT the gains in readability are here. What do you think? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Jan 24, 2017 at 10:26 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jan 25, 2017 at 6:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jan 20, 2017 at 7:00 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>>> No, because the output of SHOW is always of type text, regardless of >>>> the type of the GUC. >>> >>> Thinking about that over night, that looks pretty nice. What would be >>> nicer though would be to add INT8OID and BYTEAOID in the list, and >>> convert as well the other replication commands. At the end, I think >>> that we should finish by being able to remove all pq_* routine >>> dependencies in walsender.c, saving quite a couple of lines. >> >> Might be worth investigating, but I don't feel any obligation to do >> that right now. Thanks for the review; committed. > > OK, I have done this refactoring effort as attached because I think > that's really worth it. And here are the diff numbers: > 3 files changed, 113 insertions(+), 162 deletions(-) > That's a bit less than what I thought first because of all the > singularities of bytea in its output and the way TIMELINE_HISTORY > takes advantage of the message level routines. Still for > IDENTIFY_SYSTEM, START_REPLICATION and CREATE_REPLICATION_SLOT the > gains in readability are here. Seems OK to me, but I think I'd want to hear a few other opinions before committing it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-01-26 13:16:13 -0500, Robert Haas wrote: > > OK, I have done this refactoring effort as attached because I think > > that's really worth it. And here are the diff numbers: > > 3 files changed, 113 insertions(+), 162 deletions(-) > > That's a bit less than what I thought first because of all the > > singularities of bytea in its output and the way TIMELINE_HISTORY > > takes advantage of the message level routines. Still for > > IDENTIFY_SYSTEM, START_REPLICATION and CREATE_REPLICATION_SLOT the > > gains in readability are here. > > Seems OK to me, but I think I'd want to hear a few other opinions > before committing it. Just to be absolutely sure: We're talking about Michael's cleanup patch, not the thread's original topic? Andres
On Thu, Jan 26, 2017 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-01-26 13:16:13 -0500, Robert Haas wrote: >> > OK, I have done this refactoring effort as attached because I think >> > that's really worth it. And here are the diff numbers: >> > 3 files changed, 113 insertions(+), 162 deletions(-) >> > That's a bit less than what I thought first because of all the >> > singularities of bytea in its output and the way TIMELINE_HISTORY >> > takes advantage of the message level routines. Still for >> > IDENTIFY_SYSTEM, START_REPLICATION and CREATE_REPLICATION_SLOT the >> > gains in readability are here. >> >> Seems OK to me, but I think I'd want to hear a few other opinions >> before committing it. > > Just to be absolutely sure: We're talking about Michael's cleanup patch, > not the thread's original topic? Correct. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2017-01-23 11:35:11 +0530, Beena Emerson wrote: > Please find attached an updated WIP patch. I have incorporated almost all > comments. This is to be applied over Robert's patches. I will post > performance results later on. > > 1. shift (>>) and AND (&) operations: The assign hook of wal_segment_size > sets the WalModMask and WalShiftBit. All the modulo and division operations > using XLogSegSize has been replaced with these. However, there are many > preprocessors which divide with XLogSegSize in xlog_internal.h. I have not > changed these because it would mean I will have to reassign the WalShiftBit > along with XLogSegSize in all the modules which use these macros. That does > not seem to be a good idea. Also, this means shift operator can be used > only in couple of places. I think it'd be better not to have XLogSegSize anymore. Silently changing a macros behaviour from being a compile time constant to something runtime configurable is a bad idea. > 8. Declaring XLogSegSize: There are 2 internal variables for the same > parameter. In original code XLOG_SEG_SIZE is defined in the auto-generated > file src/include/pg_config.h. And xlog_internal.h defines: > > #define XLogSegSize ((uint32) XLOG_SEG_SIZE) > > To avoid renaming all parts of code, I made the following change in > xlog_internal.h > > + extern uint32 XLogSegSize; > > +#define XLOG_SEG_SIZE XLogSegSize > > would it be better to just use one variable XLogSegSize everywhere. But > few external modules could be using XLOG_SEG_SIZE. Thoughts? They'll quite possibly break with configurable size anyway. So I'd rather have those broken explicitly. > +/* > + * These variables are set in assign_wal_segment_size > + * > + * WalModMask: It is an AND mask for XLogSegSize to allow for faster modulo > + * operations using it. > + * > + * WalShiftBit: It is an shift bit for XLogSegSize to allow for faster > + * division operations using it. > + * > + * UsableBytesInSegment: It is the number of bytes in a WAL segment usable for > + * WAL data. > + */ > +uint32 WalModMask; > +static int UsableBytesInSegment; > +static int WalShiftBit; This could use some editorializing. "Faster modulo operations" isn't an explaining how/why it's actually being used. Same for WalShiftBit. > /* > * Private, possibly out-of-date copy of shared LogwrtResult. > @@ -957,6 +975,7 @@ XLogInsertRecord(XLogRecData *rdata, > if (!XLogInsertAllowed()) > elog(ERROR, "cannot make new WAL entries during recovery"); > > + > /*---------- > * Spurious newline change. > if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD && > - ptr % XLOG_SEG_SIZE > XLOG_BLCKSZ) > + (ptr & WalModMask) > XLOG_BLCKSZ) > initializedUpto = ptr - SizeOfXLogShortPHD; > else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD && > - ptr % XLOG_SEG_SIZE < XLOG_BLCKSZ) > + (ptr & WalModMask) < XLOG_BLCKSZ) > initializedUpto = ptr - SizeOfXLogLongPHD; > else > initializedUpto = ptr; How about we introduce a XLogSegmentOffset(XLogRecPtr) function like macro in a first patch? That'll reduce the amount of change in the commit actually changing things quite noticeably, and makes it easier to adjust things later. I see very little benefit for in-place usage of either % XLOG_SEG_SIZE or & WalModMask. > @@ -1794,6 +1813,7 @@ XLogBytePosToRecPtr(uint64 bytepos) > uint32 seg_offset; > XLogRecPtr result; > > + > fullsegs = bytepos / UsableBytesInSegment; > bytesleft = bytepos % UsableBytesInSegment; spurious change. > @@ -1878,7 +1898,7 @@ XLogRecPtrToBytePos(XLogRecPtr ptr) > > XLByteToSeg(ptr, fullsegs); > > - fullpages = (ptr % XLOG_SEG_SIZE) / XLOG_BLCKSZ; > + fullpages = (ptr & WalModMask) / XLOG_BLCKSZ; > offset = ptr % XLOG_BLCKSZ; > > if (fullpages == 0) > @@ -2043,7 +2063,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) > /* > * If first page of an XLOG segment file, make it a long header. > */ > - if ((NewPage->xlp_pageaddr % XLogSegSize) == 0) > + if ((NewPage->xlp_pageaddr & WalModMask) == 0) > { > XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage; > > @@ -2095,6 +2115,7 @@ CalculateCheckpointSegments(void) > * number of segments consumed between checkpoints. > *------- > */ > + > target = (double) max_wal_size / (2.0 + CheckPointCompletionTarget); spurious change. > void > +assign_wal_segment_size(int newval, void *extra) > +{ > + /* > + * During system initialization, XLogSegSize is not set so we use > + * DEFAULT_XLOG_SEG_SIZE instead. > + */ > + int WalSegSize = (XLogSegSize == 0) ? DEFAULT_XLOG_SEG_SIZE : XLOG_SEG_SIZE; > + > + wal_segment_size = newval; > + UsableBytesInSegment = (wal_segment_size * UsableBytesInPage) - > + (SizeOfXLogLongPHD - SizeOfXLogShortPHD); > + WalModMask = WalSegSize - 1; > + > + /* Set the WalShiftBit */ > + WalShiftBit = 0; > + while (WalSegSize > 1) > + { > + WalSegSize = WalSegSize >> 1; > + WalShiftBit++; > + } > +} Hm. Are GUC hooks a good way to compute the masks? Interdependent GUCs are unfortunately not working well, and several GUCs might end up depending on these. I think it might be better to assign the variables somewhere early in StartupXLOG() or such. > + > +void > +assign_min_wal_size(int newval, void *extra) > +{ > + /* > + * During system initialization, XLogSegSize is not set so we use > + * DEFAULT_XLOG_SEG_SIZE instead. > + * > + * min_wal_size is in kB and XLogSegSize is in bytes and so it is > + * converted to kB for the calculation. > + */ > + int WalSegSize = (XLogSegSize == 0) ? (DEFAULT_XLOG_SEG_SIZE / 1024) : > + (XLOG_SEG_SIZE / 1024); > + > + min_wal_size = newval / WalSegSize; > +} > + > +void > assign_max_wal_size(int newval, void *extra) > { > - max_wal_size = newval; > + /* > + * During system initialization, XLogSegSize is not set so we use > + * DEFAULT_XLOG_SEG_SIZE instead. > + * > + * max_wal_size is in kB and XLogSegSize is in bytes and so it is > + * converted to bytes for the calculation. > + */ > + int WalSegSize = (XLogSegSize == 0) ? (DEFAULT_XLOG_SEG_SIZE / 1024) : > + (XLOG_SEG_SIZE / 1024); > + > + max_wal_size = newval / WalSegSize; > CalculateCheckpointSegments(); > } I don't think it's a good idea to have GUCs that are initially set to the wrong value and such. How about just storing bytes, and converting into segments upon use? > @@ -2135,8 +2205,8 @@ XLOGfileslop(XLogRecPtr PriorRedoPtr) > * correspond to. Always recycle enough segments to meet the minimum, and > * remove enough segments to stay below the maximum. > */ > - minSegNo = PriorRedoPtr / XLOG_SEG_SIZE + min_wal_size - 1; > - maxSegNo = PriorRedoPtr / XLOG_SEG_SIZE + max_wal_size - 1; > + minSegNo = (PriorRedoPtr >> WalShiftBit) + min_wal_size - 1; > + maxSegNo = (PriorRedoPtr >> WalShiftBit) + max_wal_size - 1; I think a macro would be good here too (same prerequisite patch as above). > @@ -4677,8 +4749,18 @@ XLOGShmemSize(void) > */ > if (XLOGbuffers == -1) > { > - char buf[32]; > - > + /* > + * The calculation of XLOGbuffers, requires the now run-time parameter > + * XLogSegSize from the ControlFile. The value determined here is > + * required to create the shared memory segment. Hence, temporarily > + * allocating space and reading ControlFile here. > + */ I don't like comments containing things like "the now run-time paramter" much - they are likely going to still be there in 10 years, and will be hard to understand. But anyway, how about we simply remove the "max one segment" boundary instead? I don't think it's actually very meaningful - several people posted benchmarks with more than one segment being beneficial. > diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c > index 31290d3..87efc3c 100644 > --- a/src/bin/pg_basebackup/streamutil.c > +++ b/src/bin/pg_basebackup/streamutil.c > @@ -238,6 +238,59 @@ GetConnection(void) > } > > /* > + * Run the SHOW_WAL_SEGMENT_SIZE command to set the XLogSegSize > + */ > +bool > +SetXLogSegSize(PGconn *conn) > +{ I think this is a confusing function name, because it sounds like you're setting the SegSize remotely or such. I think making it XLogRecPtr RetrieveXLogSegSize(conn); or such would lead to better code. > diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c > index 963802e..4ceebdc 100644 > --- a/src/bin/pg_resetxlog/pg_resetxlog.c > +++ b/src/bin/pg_resetxlog/pg_resetxlog.c > @@ -57,6 +57,7 @@ > #include "storage/large_object.h" > #include "pg_getopt.h" > > +uint32 XLogSegSize; This seems like a bad idea - having the same local variable both in frontend and backend programs seems like a recipe for disaster. Greetings, Andres Freund
Hi, On 2017-01-25 12:26:21 +0900, Michael Paquier wrote: > diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c > index 083c0303dc..2eb3a420ac 100644 > --- a/src/backend/access/common/tupdesc.c > +++ b/src/backend/access/common/tupdesc.c > @@ -629,6 +629,14 @@ TupleDescInitBuiltinEntry(TupleDesc desc, > att->attstorage = 'p'; > att->attcollation = InvalidOid; > break; > + > + case INT8OID: > + att->attlen = 8; > + att->attbyval = true; > + att->attalign = 'd'; > + att->attstorage = 'p'; > + att->attcollation = InvalidOid; > + break; > } > } INT8 isn't unconditionally byval, is it? > /* slot_name */ > - len = strlen(NameStr(MyReplicationSlot->data.name)); > - pq_sendint(&buf, len, 4); /* col1 len */ > - pq_sendbytes(&buf, NameStr(MyReplicationSlot->data.name), len); > + values[0] = PointerGetDatum(cstring_to_text(NameStr(MyReplicationSlot->data.name))); That seems a bit long. I've not done like the most careful review ever, but I'm in favor of the general change (provided the byval thing is fixed obviously). Greetings, Andres Freund
On Fri, Jan 27, 2017 at 4:20 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-01-25 12:26:21 +0900, Michael Paquier wrote: >> diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c >> index 083c0303dc..2eb3a420ac 100644 >> --- a/src/backend/access/common/tupdesc.c >> +++ b/src/backend/access/common/tupdesc.c >> @@ -629,6 +629,14 @@ TupleDescInitBuiltinEntry(TupleDesc desc, >> att->attstorage = 'p'; >> att->attcollation = InvalidOid; >> break; >> + >> + case INT8OID: >> + att->attlen = 8; >> + att->attbyval = true; >> + att->attalign = 'd'; >> + att->attstorage = 'p'; >> + att->attcollation = InvalidOid; >> + break; >> } >> } > > INT8 isn't unconditionally byval, is it? Doh. Of course. >> /* slot_name */ >> - len = strlen(NameStr(MyReplicationSlot->data.name)); >> - pq_sendint(&buf, len, 4); /* col1 len */ >> - pq_sendbytes(&buf, NameStr(MyReplicationSlot->data.name), len); >> + values[0] = PointerGetDatum(cstring_to_text(NameStr(MyReplicationSlot->data.name))); > > That seems a bit long. Sure. What about that: - len = strlen(NameStr(MyReplicationSlot->data.name)); - pq_sendint(&buf, len, 4); /* col1 len */ - pq_sendbytes(&buf, NameStr(MyReplicationSlot->data.name), len); + slot_name = NameStr(MyReplicationSlot->data.name); + values[0] = PointerGetDatum(cstring_to_text(slot_name)); > I've not done like the most careful review ever, but I'm in favor of the > general change (provided the byval thing is fixed obviously). Thanks for the review. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi,
On 2017-01-23 11:35:11 +0530, Beena Emerson wrote:
> Please find attached an updated WIP patch. I have incorporated almost all
> comments. This is to be applied over Robert's patches. I will post
> performance results later on.
>
> 1. shift (>>) and AND (&) operations: The assign hook of wal_segment_size
> sets the WalModMask and WalShiftBit. All the modulo and division operations
> using XLogSegSize has been replaced with these. However, there are many
> preprocessors which divide with XLogSegSize in xlog_internal.h. I have not
> changed these because it would mean I will have to reassign the WalShiftBit
> along with XLogSegSize in all the modules which use these macros. That does
> not seem to be a good idea. Also, this means shift operator can be used
> only in couple of places.
I think it'd be better not to have XLogSegSize anymore. Silently
changing a macros behaviour from being a compile time constant to
something runtime configurable is a bad idea.
> 8. Declaring XLogSegSize: There are 2 internal variables for the same
> parameter. In original code XLOG_SEG_SIZE is defined in the auto-generated
> file src/include/pg_config.h. And xlog_internal.h defines:
>
> #define XLogSegSize ((uint32) XLOG_SEG_SIZE)
>
> To avoid renaming all parts of code, I made the following change in
> xlog_internal.h
>
> + extern uint32 XLogSegSize;
>
> +#define XLOG_SEG_SIZE XLogSegSize
>
> would it be better to just use one variable XLogSegSize everywhere. But
> few external modules could be using XLOG_SEG_SIZE. Thoughts?
They'll quite possibly break with configurable size anyway. So I'd
rather have those broken explicitly.
> +/*
> + * These variables are set in assign_wal_segment_size
> + *
> + * WalModMask: It is an AND mask for XLogSegSize to allow for faster modulo
> + * operations using it.
> + *
> + * WalShiftBit: It is an shift bit for XLogSegSize to allow for faster
> + * division operations using it.
> + *
> + * UsableBytesInSegment: It is the number of bytes in a WAL segment usable for
> + * WAL data.
> + */
> +uint32 WalModMask;
> +static int UsableBytesInSegment;
> +static int WalShiftBit;
This could use some editorializing. "Faster modulo operations" isn't an
explaining how/why it's actually being used. Same for WalShiftBit.
> /*
> * Private, possibly out-of-date copy of shared LogwrtResult.
> @@ -957,6 +975,7 @@ XLogInsertRecord(XLogRecData *rdata,
> if (!XLogInsertAllowed())
> elog(ERROR, "cannot make new WAL entries during recovery");
>
> +
> /*----------
> *
Spurious newline change.
> if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
> - ptr % XLOG_SEG_SIZE > XLOG_BLCKSZ)
> + (ptr & WalModMask) > XLOG_BLCKSZ)
> initializedUpto = ptr - SizeOfXLogShortPHD;
> else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
> - ptr % XLOG_SEG_SIZE < XLOG_BLCKSZ)
> + (ptr & WalModMask) < XLOG_BLCKSZ)
> initializedUpto = ptr - SizeOfXLogLongPHD;
> else
> initializedUpto = ptr;
How about we introduce a XLogSegmentOffset(XLogRecPtr) function like
macro in a first patch? That'll reduce the amount of change in the
commit actually changing things quite noticeably, and makes it easier to
adjust things later. I see very little benefit for in-place usage of
either % XLOG_SEG_SIZE or & WalModMask.
> @@ -1794,6 +1813,7 @@ XLogBytePosToRecPtr(uint64 bytepos)
> uint32 seg_offset;
> XLogRecPtr result;
>
> +
> fullsegs = bytepos / UsableBytesInSegment;
> bytesleft = bytepos % UsableBytesInSegment;
spurious change.
> @@ -1878,7 +1898,7 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
>
> XLByteToSeg(ptr, fullsegs);
>
> - fullpages = (ptr % XLOG_SEG_SIZE) / XLOG_BLCKSZ;
> + fullpages = (ptr & WalModMask) / XLOG_BLCKSZ;
> offset = ptr % XLOG_BLCKSZ;
>
> if (fullpages == 0)
> @@ -2043,7 +2063,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
> /*
> * If first page of an XLOG segment file, make it a long header.
> */
> - if ((NewPage->xlp_pageaddr % XLogSegSize) == 0)
> + if ((NewPage->xlp_pageaddr & WalModMask) == 0)
> {
> XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
>
> @@ -2095,6 +2115,7 @@ CalculateCheckpointSegments(void)
> * number of segments consumed between checkpoints.
> *-------
> */
> +
> target = (double) max_wal_size / (2.0 + CheckPointCompletionTarget);
spurious change.
> void
> +assign_wal_segment_size(int newval, void *extra)
> +{
> + /*
> + * During system initialization, XLogSegSize is not set so we use
> + * DEFAULT_XLOG_SEG_SIZE instead.
> + */
> + int WalSegSize = (XLogSegSize == 0) ? DEFAULT_XLOG_SEG_SIZE : XLOG_SEG_SIZE;
> +
> + wal_segment_size = newval;
> + UsableBytesInSegment = (wal_segment_size * UsableBytesInPage) -
> + (SizeOfXLogLongPHD - SizeOfXLogShortPHD);
> + WalModMask = WalSegSize - 1;
> +
> + /* Set the WalShiftBit */
> + WalShiftBit = 0;
> + while (WalSegSize > 1)
> + {
> + WalSegSize = WalSegSize >> 1;
> + WalShiftBit++;
> + }
> +}
Hm. Are GUC hooks a good way to compute the masks? Interdependent GUCs
are unfortunately not working well, and several GUCs might end up
depending on these. I think it might be better to assign the variables
somewhere early in StartupXLOG() or such.
> +
> +void
> +assign_min_wal_size(int newval, void *extra)
> +{
> + /*
> + * During system initialization, XLogSegSize is not set so we use
> + * DEFAULT_XLOG_SEG_SIZE instead.
> + *
> + * min_wal_size is in kB and XLogSegSize is in bytes and so it is
> + * converted to kB for the calculation.
> + */
> + int WalSegSize = (XLogSegSize == 0) ? (DEFAULT_XLOG_SEG_SIZE / 1024) :
> + (XLOG_SEG_SIZE / 1024);
> +
> + min_wal_size = newval / WalSegSize;
> +}
> +
> +void
> assign_max_wal_size(int newval, void *extra)
> {
> - max_wal_size = newval;
> + /*
> + * During system initialization, XLogSegSize is not set so we use
> + * DEFAULT_XLOG_SEG_SIZE instead.
> + *
> + * max_wal_size is in kB and XLogSegSize is in bytes and so it is
> + * converted to bytes for the calculation.
> + */
> + int WalSegSize = (XLogSegSize == 0) ? (DEFAULT_XLOG_SEG_SIZE / 1024) :
> + (XLOG_SEG_SIZE / 1024);
> +
> + max_wal_size = newval / WalSegSize;
> CalculateCheckpointSegments();
> }
I don't think it's a good idea to have GUCs that are initially set to
the wrong value and such. How about just storing bytes, and converting
into segments upon use?
> @@ -2135,8 +2205,8 @@ XLOGfileslop(XLogRecPtr PriorRedoPtr)
> * correspond to. Always recycle enough segments to meet the minimum, and
> * remove enough segments to stay below the maximum.
> */
> - minSegNo = PriorRedoPtr / XLOG_SEG_SIZE + min_wal_size - 1;
> - maxSegNo = PriorRedoPtr / XLOG_SEG_SIZE + max_wal_size - 1;
> + minSegNo = (PriorRedoPtr >> WalShiftBit) + min_wal_size - 1;
> + maxSegNo = (PriorRedoPtr >> WalShiftBit) + max_wal_size - 1;
I think a macro would be good here too (same prerequisite patch as above).
> @@ -4677,8 +4749,18 @@ XLOGShmemSize(void)
> */
> if (XLOGbuffers == -1)
> {
> - char buf[32];
> -
> + /*
> + * The calculation of XLOGbuffers, requires the now run-time parameter
> + * XLogSegSize from the ControlFile. The value determined here is
> + * required to create the shared memory segment. Hence, temporarily
> + * allocating space and reading ControlFile here.
> + */
I don't like comments containing things like "the now run-time paramter"
much - they are likely going to still be there in 10 years, and will be
hard to understand.
But anyway, how about we simply remove the "max one segment" boundary
instead? I don't think it's actually very meaningful - several people
posted benchmarks with more than one segment being beneficial.
> diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/stream util.c
> index 31290d3..87efc3c 100644
> --- a/src/bin/pg_basebackup/streamutil.c
> +++ b/src/bin/pg_basebackup/streamutil.c
> @@ -238,6 +238,59 @@ GetConnection(void)
> }
>
> /*
> + * Run the SHOW_WAL_SEGMENT_SIZE command to set the XLogSegSize
> + */
> +bool
> +SetXLogSegSize(PGconn *conn)
> +{
I think this is a confusing function name, because it sounds like
you're setting the SegSize remotely or such. I think making it
XLogRecPtr RetrieveXLogSegSize(conn); or such would lead to better code.
> diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_rese txlog.c
> index 963802e..4ceebdc 100644
> --- a/src/bin/pg_resetxlog/pg_resetxlog.c
> +++ b/src/bin/pg_resetxlog/pg_resetxlog.c
> @@ -57,6 +57,7 @@
> #include "storage/large_object.h"
> #include "pg_getopt.h"
>
> +uint32 XLogSegSize;
This seems like a bad idea - having the same local variable both in
frontend and backend programs seems like a recipe for disaster.
On Thu, Jan 26, 2017 at 8:53 PM, Michael Paquier <michael.paquier@gmail.com> wrote: >> I've not done like the most careful review ever, but I'm in favor of the >> general change (provided the byval thing is fixed obviously). > > Thanks for the review. Why not use pg_ltoa and pg_lltoa like the output functions for the datatype do? Might use CStringGetTextDatum(blah) instead of PointerGetDatum(cstring_to_text(blah)). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jan 28, 2017 at 7:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 26, 2017 at 8:53 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >>> I've not done like the most careful review ever, but I'm in favor of the >>> general change (provided the byval thing is fixed obviously). >> >> Thanks for the review. > > Why not use pg_ltoa and pg_lltoa like the output functions for the datatype do? No particular reason. > Might use CStringGetTextDatum(blah) instead of > PointerGetDatum(cstring_to_text(blah)). Yes, thanks. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sat, Jan 28, 2017 at 8:04 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Jan 28, 2017 at 7:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Jan 26, 2017 at 8:53 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>>> I've not done like the most careful review ever, but I'm in favor of the >>>> general change (provided the byval thing is fixed obviously). >>> >>> Thanks for the review. >> >> Why not use pg_ltoa and pg_lltoa like the output functions for the datatype do? > > No particular reason. > >> Might use CStringGetTextDatum(blah) instead of >> PointerGetDatum(cstring_to_text(blah)). > > Yes, thanks. I am going to create a new thread for this refactoring patch, as that's different than the goal of this thread. Now, regarding the main patch. Per the status of the last couple of days, the patch has received a review but no new versions, so I am marking it as returned with feedback for now. Feel free to update the status of the patch to something else if you think that's more adapted. -- Michael
Hello Andres,Thank you for your review.On Fri, Jan 27, 2017 at 12:39 AM, Andres Freund <andres@anarazel.de> wrote:Hi,
On 2017-01-23 11:35:11 +0530, Beena Emerson wrote:
> Please find attached an updated WIP patch. I have incorporated almost all
> comments. This is to be applied over Robert's patches. I will post
> performance results later on.
>
> 1. shift (>>) and AND (&) operations: The assign hook of wal_segment_size
> sets the WalModMask and WalShiftBit. All the modulo and division operations
> using XLogSegSize has been replaced with these. However, there are many
> preprocessors which divide with XLogSegSize in xlog_internal.h. I have not
> changed these because it would mean I will have to reassign the WalShiftBit
> along with XLogSegSize in all the modules which use these macros. That does
> not seem to be a good idea. Also, this means shift operator can be used
> only in couple of places.
I think it'd be better not to have XLogSegSize anymore. Silently
changing a macros behaviour from being a compile time constant to
something runtime configurable is a bad idea.I dont think I understood u clearly. You mean convert the macros using XLogSegSize to functions?
Hm. Are GUC hooks a good way to compute the masks? Interdependent GUCs
are unfortunately not working well, and several GUCs might end up
depending on these. I think it might be better to assign the variables
somewhere early in StartupXLOG() or such.I am not sure about these interdependent GUCs. I need to study this better and make changes as required.
The XLogSegSize adjustment in assign hooks have been removed and a new macro ConvertToXSegs is used to convert the min and max wal_size to the segment count when required. wal_segment_size set from ReadControlFile also affects the Checkpointsegment value and hence the assign_wal_segment_size calls CalculateCheckpointSegments.
Documentation is updated
Performance Tests:
I ran pgbench tests for different wal segment size on database of scale factor 300 with shared_buffers of 8GB. Each of the tests ran for 10 min and a median of 3 readings were considered. The following table shows the performance of the patch wrt the HEAD for different client count for various wal-segsize value. We could say that there is not performance difference.
16 | 32 | 64 | 128 | |
8MB | -1.36 | 0.02 | 0.43 | -0.24 |
16MB | -0.38 | 0.18 | -0.09 | 0.4 |
32MB | -0.52 | 0.29 | 0.39 | 0.59 |
64MB | -0.15 | 0.04 | 0.52 | 0.38 |
Attachment
Now that we've renamed "xlog" to "wal" in user-facing elements, I think we should strive to use the name "wal" internally too in new code, not "xlog" anymore. This patch introduces several variables, macros, functions that ought to change names now -- XLogSegmentOffset should be WALSegmentOffset for example. (I expect that as we touch code over time, the use of "xlog" will decrease, though not fully disappear). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 15, 2017 at 8:46 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Now that we've renamed "xlog" to "wal" in user-facing elements, I think > we should strive to use the name "wal" internally too in new code, not > "xlog" anymore. This patch introduces several variables, macros, > functions that ought to change names now -- XLogSegmentOffset should be > WALSegmentOffset for example. (I expect that as we touch code over > time, the use of "xlog" will decrease, though not fully disappear). Ugh. I think that's going to lead to a complete mess. We'll end up with newer and older sections of the code being randomly inconsistent with each other. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-02-15 22:46:38 -0300, Alvaro Herrera wrote: > Now that we've renamed "xlog" to "wal" in user-facing elements, I think > we should strive to use the name "wal" internally too in new code, not > "xlog" anymore. This patch introduces several variables, macros, > functions that ought to change names now -- XLogSegmentOffset should be > WALSegmentOffset for example. (I expect that as we touch code over > time, the use of "xlog" will decrease, though not fully disappear). I think this will just decrease the consistency in xlog.c (note the name) et al.
Andres Freund <andres@anarazel.de> writes: > On 2017-02-15 22:46:38 -0300, Alvaro Herrera wrote: >> Now that we've renamed "xlog" to "wal" in user-facing elements, I think >> we should strive to use the name "wal" internally too in new code, not >> "xlog" anymore. This patch introduces several variables, macros, >> functions that ought to change names now -- XLogSegmentOffset should be >> WALSegmentOffset for example. (I expect that as we touch code over >> time, the use of "xlog" will decrease, though not fully disappear). > I think this will just decrease the consistency in xlog.c (note the > name) et al. It's also going to make back-patching bug fixes in the area a real nightmare. Let's not change the code more than necessary to implement the desired user-facing behavior. regards, tom lane
On Mon, Feb 6, 2017 at 11:09 PM, Beena Emerson <memissemerson@gmail.com> wrote: > > Hello, > PFA the updated patches. I've started reviewing the patches. 01-add-XLogSegmentOffset-macro.patch looks clean to me. I'll post my detailed review after looking into the second patch. But, both the patches needs a rebase based on the commit 85c11324cabaddcfaf3347df7 (Rename user-facing tools with "xlog" in the name to say "wal"). -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 6, 2017 at 11:09 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>
> Hello,
> PFA the updated patches.
I've started reviewing the patches.
01-add-XLogSegmentOffset-macro.patch looks clean to me. I'll post my
detailed review after looking into the second patch. But, both the
patches needs a rebase based on the commit 85c11324cabaddcfaf3347df7
(Rename user-facing tools with "xlog" in the name to say "wal").
Attachment
Hello,On Thu, Feb 16, 2017 at 3:37 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:PFA the rebased patches.On Mon, Feb 6, 2017 at 11:09 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>
> Hello,
> PFA the updated patches.
I've started reviewing the patches.
01-add-XLogSegmentOffset-macro.patch looks clean to me. I'll post my
detailed review after looking into the second patch. But, both the
patches needs a rebase based on the commit 85c11324cabaddcfaf3347df7
(Rename user-facing tools with "xlog" in the name to say "wal").
Attachment
On Fri, Feb 24, 2017 at 12:47 PM, Beena Emerson <memissemerson@gmail.com> wrote: > > Hello, > > The recent commit c29aff959dc64f7321062e7f33d8c6ec23db53d has again changed > the code and the second patch cannot be applied cleanly. Please find > attached the rebased 02 patch. 01 patch is the same . > I've done an initial review of the patch. The objective of the patch is to modify the wal-segsize as an initdb-time parameter instead of a compile time parameter. The patch introduces following three different techniques to expose the XLogSize to different modules: 1. Directly read XLogSegSize from the control file This is used by default, i.e., StartupXLOG() and looks good to me. 2. Run the SHOW wal_segment_size command to fetch and set the XLogSegSize + if (!RetrieveXLogSegSize(conn)) + disconnect_and_exit(1); + You need the same logic in pg_receivewal.c as well. 3. Retrieve the XLogSegSize by reading the file size of WAL files + if (private.inpath != NULL) + sprintf(full_path, "%s/%s", private.inpath, fname); + else + strcpy(full_path, fname); + + stat(full_path, &fst); + + if (!IsValidXLogSegSize(fst.st_size)) + { + fprintf(stderr, + _("%s: file size %d is invalid \n"), + progname, (int) fst.st_size); + + return EXIT_FAILURE; + + } + + XLogSegSize = (int) fst.st_size; I see couple of issues with this approach: * You should check the return value of stat() before going ahead. Something like, if (stat(filename, &fst) < 0) error "file doesn't exist" * You're considering any WAL file with a power of 2 as valid. Suppose, the correct WAL seg size is 64mb. For some reason, the server generated a 16mb invalid WAL file(maybe it crashed while creating the WAL file). Your code seems to treat this as a valid file which I think is incorrect. Do you agree with that? Is it possible to unify these different techniques of reading XLogSegSize in a generalized function with a proper documentation describing the scope and limitations of each approach? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Attachment
On 2/24/17 6:30 AM, Kuntal Ghosh wrote: > * You're considering any WAL file with a power of 2 as valid. Suppose, > the correct WAL seg size is 64mb. For some reason, the server > generated a 16mb invalid WAL file(maybe it crashed while creating the > WAL file). Your code seems to treat this as a valid file which I think > is incorrect. Do you agree with that? Detecting correct WAL size based on the size of a random WAL file seems like a really bad idea to me. I also don't see the reason for #2... or is that how initdb writes out the correct control file? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On Tue, Feb 28, 2017 at 9:45 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 2/24/17 6:30 AM, Kuntal Ghosh wrote: >> >> * You're considering any WAL file with a power of 2 as valid. Suppose, >> the correct WAL seg size is 64mb. For some reason, the server >> generated a 16mb invalid WAL file(maybe it crashed while creating the >> WAL file). Your code seems to treat this as a valid file which I think >> is incorrect. Do you agree with that? > > > Detecting correct WAL size based on the size of a random WAL file seems like > a really bad idea to me. +1 -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 2/24/17 6:30 AM, Kuntal Ghosh wrote:* You're considering any WAL file with a power of 2 as valid. Suppose,
the correct WAL seg size is 64mb. For some reason, the server
generated a 16mb invalid WAL file(maybe it crashed while creating the
WAL file). Your code seems to treat this as a valid file which I think
is incorrect. Do you agree with that?
Detecting correct WAL size based on the size of a random WAL file seems like a really bad idea to me.
I also don't see the reason for #2... or is that how initdb writes out the correct control file?
Hi, I took a look at this patch. Overall, the patch looks good to me. However, there are some review comments that I would like to share, 1. I think the macro 'PATH_MAX' used in pg_waldump.c file is specific to Linux. It needs to be changed to some constant value or may be MAXPGPATH inorder to make it platform independent. 2. As already mentioned by Jim and Kuntal upthread, you are trying to detect the configured WAL segment size in pg_waldump.c and pg_standby.c files based on the size of the random WAL file which doesn't look like a good idea. But, then I think the only option we have is to pass the location of pg_control file to pg_waldump module along with the start and end wal segments. 3. When trying to compile '02-initdb-walsegsize-v2.patch' on Windows, I got this warning message, Warning 1 warning C4005: 'DEFAULT_XLOG_SEG_SIZE' : macro redefinition c:\users\ashu\postgresql\src\include\pg_config_manual.h 20 Apart from these, I am not having any comments as of now. I am still validating the patch on Windows. If I find any issues i will update it. -- With Regards, Ashutosh Sharma. EnterpriseDB: http://www.enterprisedb.com On Tue, Feb 28, 2017 at 10:36 AM, Beena Emerson <memissemerson@gmail.com> wrote: > > > On Tue, Feb 28, 2017 at 9:45 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> >> On 2/24/17 6:30 AM, Kuntal Ghosh wrote: >>> >>> * You're considering any WAL file with a power of 2 as valid. Suppose, >>> the correct WAL seg size is 64mb. For some reason, the server >>> generated a 16mb invalid WAL file(maybe it crashed while creating the >>> WAL file). Your code seems to treat this as a valid file which I think >>> is incorrect. Do you agree with that? >> >> >> Detecting correct WAL size based on the size of a random WAL file seems >> like a really bad idea to me. >> >> >> I also don't see the reason for #2... or is that how initdb writes out the >> correct control file? > > > The initdb module reads the size from the option provided and sets the > environment variable. This variable is read in > src/backend/access/transam/xlog.c and the ControlFile written. > Unlike pg_resetwal and pg_rewind, pg_basebackup cannot access the Control > file. It only accesses the wal log folder. So we get the XLogSegSize from > the SHOW command using replication connection. > As Kuntal pointed out, I might need to set it from pg_receivewal.c as well. > > Thank you, > > Beena Emerson > > EnterpriseDB: https://www.enterprisedb.com/ > The Enterprise PostgreSQL Company
On Mon, Feb 27, 2017 at 11:15 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> * You're considering any WAL file with a power of 2 as valid. Suppose, >> the correct WAL seg size is 64mb. For some reason, the server >> generated a 16mb invalid WAL file(maybe it crashed while creating the >> WAL file). Your code seems to treat this as a valid file which I think >> is incorrect. Do you agree with that? > > Detecting correct WAL size based on the size of a random WAL file seems like > a really bad idea to me. It's not the most elegant thing ever, but I'm not sure I really see a big problem with it. Today, if the WAL file were the wrong size, we'd just error out. With the patch, if the WAL file were the wrong size, but happened to be a size that we consider legal, pg_waldump would treat it as a legal file and try to display the WAL records contained therein. This doesn't seem like a huge problem from her; what are you worried about? I agree that it would be bad if, for example, pg_resetwal saw a broken WAL file in pg_wal and consequently did the reset incorrectly, because the whole point of pg_resetwal is to escape situations where the contents of pg_wal may be bogus. However, pg_resetwal can read the value from the control file, so the technique of believing the file size doesn't need to be used in that case anyway. The only tools that need to infer the WAL size from the sizes of the segments actually present are those that neither have a running cluster (where SHOW can be used) nor access to the control file. There aren't many of those, and pg_waldump, at least, is a debugging tool anyway. IIUC, the other case where this comes up is for pg_standby, but if the WAL segments aren't all the same size that tool is presumably going to croak with or without these changes, so I'm not really sure there's much of an issue here. I might be missing something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi,
I took a look at this patch. Overall, the patch looks good to me.
However, there are some review comments that I would like to share,
1. I think the macro 'PATH_MAX' used in pg_waldump.c file is specific
to Linux. It needs to be changed to some constant value or may be
MAXPGPATH inorder to make it platform independent.
2. As already mentioned by Jim and Kuntal upthread, you are trying to
detect the configured WAL segment size in pg_waldump.c and
pg_standby.c files based on the size of the random WAL file which
doesn't look like a good idea. But, then I think the only option we
have is to pass the location of pg_control file to pg_waldump module
along with the start and end wal segments.
3. When trying to compile '02-initdb-walsegsize-v2.patch' on Windows,
I got this warning message,
Warning 1 warning C4005: 'DEFAULT_XLOG_SEG_SIZE' : macro
redefinition
c:\users\ashu\postgresql\src\include\pg_config_manual.h 20
Apart from these, I am not having any comments as of now. I am still
validating the patch on Windows. If I find any issues i will update
it.
- Call RetrieveXLogSegSize(conn) in pg_receivewal.c
- Remove the warning in Windows
- Change PATH_MAX in pg_waldump with MAXPGPATH
Attachment
We (Prabhat and I) have started basic testing of this feature -Thank you for your reviews Kuntal, Jim, AshutoshAttached in an updated 02 patch which:
- Call RetrieveXLogSegSize(conn) in pg_receivewal.c
- Remove the warning in Windows
- Change PATH_MAX in pg_waldump with MAXPGPATH
Regarding the usage of the wal file size as the XLogSegSize, I agree with what Robert has said. Generally, the wal size will be of the expected wal_segment_size and to have it any other size, esspecially of a valid power2 value is extremely rare and I feel it is not a major cause of concern.
2 quick issue -
1)at the time of initdb, we have set - "--wal-segsize 4" ,so all the WAL file size should be 4 MB each but in the postgresql.conf file , it is mentioned
#wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables
so the comment (16MB ) mentioned against parameter 'wal_keep_segments' looks wrong , either we should remove this or modify it .
2)Getting "Aborted (core dumped)" error at the time of running pg_basebackup , (this issue is only coming on Linux32 ,not on Linux64)
we have double check to confirm it .
Steps to reproduce on Linux32
===================
fetch the sources
apply both the patches
./configure --with-zlib --enable-debug --enable-cassert --enable-depend --prefix=$PWD/edbpsql --with-openssl CFLAGS="-g -O0"; make all install
Performed initdb with switch "--wal-segsize 4"
start the server
run pg_basebackup
[centos@tushar-centos bin]$ ./pg_basebackup -v -D /tmp/myslave
*** glibc detected *** ./pg_basebackup: free(): invalid pointer: 0x08da7f00 ***
======= Backtrace: =========
/lib/libc.so.6[0xae7e31]
/home/centos/pg10_10mar/postgresql/edbpsql/lib/libpq.so.5(PQclear+0x16d)[0x6266f5]
./pg_basebackup[0x8051441]
./pg_basebackup[0x804e7b5]
/lib/libc.so.6(__libc_start_main+0xe6)[0xa8dd26]
./pg_basebackup[0x804a231]
======= Memory map: ========
00153000-0017b000 r-xp 00000000 fc:01 1271 /lib/libk5crypto.so.3.1
0017b000-0017c000 r--p 00028000 fc:01 1271 /lib/libk5crypto.so.3.1
0017c000-0017d000 rw-p 00029000 fc:01 1271 /lib/libk5crypto.so.3.1
0017d000-0017e000 rw-p 00000000 00:00 0
0017e000-00180000 r-xp 00000000 fc:01 1241 /lib/libkeyutils.so.1.3
00180000-00181000 r--p 00001000 fc:01 1241 /lib/libkeyutils.so.1.3
00181000-00182000 rw-p 00002000 fc:01 1241 /lib/libkeyutils.so.1.3
002ad000-002b9000 r-xp 00000000 fc:01 1152 /lib/libnss_files-2.12.so
002b9000-002ba000 r--p 0000b000 fc:01 1152 /lib/libnss_files-2.12.so
002ba000-002bb000 rw-p 0000c000 fc:01 1152 /lib/libnss_files-2.12.so
004ad000-004b0000 r-xp 00000000 fc:01 1267 /lib/libcom_err.so.2.1
004b0000-004b1000 r--p 00002000 fc:01 1267 /lib/libcom_err.so.2.1
004b1000-004b2000 rw-p 00003000 fc:01 1267 /lib/libcom_err.so.2.1
004ec000-005c3000 r-xp 00000000 fc:01 1199 /lib/libkrb5.so.3.3
005c3000-005c9000 r--p 000d6000 fc:01 1199 /lib/libkrb5.so.3.3
005c9000-005ca000 rw-p 000dc000 fc:01 1199 /lib/libkrb5.so.3.3
00617000-00642000 r-xp 00000000 fc:01 2099439 /home/centos/pg10_10mar/postgresql/edbpsql/lib/libpq.so.5.10
00642000-00644000 rw-p 0002a000 fc:01 2099439 /home/centos/pg10_10mar/postgresql/edbpsql/lib/libpq.so.5.10
00792000-0079c000 r-xp 00000000 fc:01 1255 /lib/libkrb5support.so.0.1
0079c000-0079d000 r--p 00009000 fc:01 1255 /lib/libkrb5support.so.0.1
0079d000-0079e000 rw-p 0000a000 fc:01 1255 /lib/libkrb5support.so.0.1
007fd000-0083b000 r-xp 00000000 fc:01 1280 /lib/libgssapi_krb5.so.2.2
0083b000-0083c000 r--p 0003e000 fc:01 1280 /lib/libgssapi_krb5.so.2.2
0083c000-0083d000 rw-p 0003f000 fc:01 1280 /lib/libgssapi_krb5.so.2.2
0083f000-009ed000 r-xp 00000000 fc:01 292057 /usr/lib/libcrypto.so.1.0.1e
009ed000-009fd000 r--p 001ae000 fc:01 292057 /usr/lib/libcrypto.so.1.0.1e
009fd000-00a04000 rw-p 001be000 fc:01 292057 /usr/lib/libcrypto.so.1.0.1e
00a04000-00a07000 rw-p 00000000 00:00 0
00a51000-00a6f000 r-xp 00000000 fc:01 14109 /lib/ld-2.12.so
00a6f000-00a70000 r--p 0001d000 fc:01 14109 /lib/ld-2.12.so
00a70000-00a71000 rw-p 0001e000 fc:01 14109 /lib/ld-2.12.so
00a77000-00c08000 r-xp 00000000 fc:01 14110 /lib/libc-2.12.so
00c08000-00c0a000 r--p 00191000 fc:01 14110 /lib/libc-2.12.so
00c0a000-00c0b000 rw-p 00193000 fc:01 14110 /lib/libc-2.12.so
00c0b000-00c0e000 rw-p 00000000 00:00 0
00c10000-00c22000 r-xp 00000000 fc:01 14355 /lib/libz.so.1.2.3
00c22000-00c23000 r--p 00011000 fc:01 14355 /lib/libz.so.1.2.3
00c23000-00c24000 rw-p 00012000 fc:01 14355 /lib/libz.so.1.2.3
00c52000-00c55000 r-xp 00000000 fc:01 14375 /lib/libdl-2.12.so
00c55000-00c56000 r--p 00002000 fc:01 14375 /lib/libdl-2.12.so
00c56000-00c57000 rw-p 00003000 fc:01 14375 /lib/libdl-2.12.so
00c59000-00c70000 r-xp 00000000 fc:01 14379 /lib/libpthread-2.12.so
00c70000-00c71000 r--p 00016000 fc:01 14379 /lib/libpthread-2.12.so
00c71000-00c72000 rw-p 00017000 fc:01 14379 /lib/libpthread-2.12.so
00c72000-00c74000 rw-p 00000000 00:00 0
00d0a000-00d0b000 r-xp 00000000 00:00 0 [vdso]
00d8f000-00dac000 r-xp 00000000 fc:01 14392 /lib/libselinux.so.1
00dac000-00dad000 r--p 0001d000 fc:01 14392 /lib/libselinux.so.1
00dad000-00dae000 rw-p 0001e000 fc:01 14392 /lib/libselinux.so.1
00db0000-00dc5000 r-xp 00000000 fc:01 1430 /lib/libresolv-2.12.so
00dc5000-00dc6000 ---p 00015000 fc:01 1430 /lib/libresolv-2.12.so
00dc6000-00dc7000 r--p 00015000 fc:01 1430 /lib/libresolv-2.12.so
00dc7000-00dc8000 rw-p 00016000 fc:01 1430 /lib/libresolv-2.12.so
00dc8000-00dca000 rw-p 00000000 00:00 0
00dcc000-00de9000 r-xp 00000000 fc:01 1312 /lib/libgcc_s-4.4.7-20120601.so.1
00de9000-00dea000 rw-p 0001d000 fc:01 1312 /lib/libgcc_s-4.4.7-20120601.so.1
05576000-055d8000 r-xp 00000000 fc:01 275065 /usr/lib/libssl.so.1.0.1e
055d8000-055db000 r--p 00061000 fc:01 275065 /usr/lib/libssl.so.1.0.1e
055db000-055df000 rw-p 00064000 fc:01 275065 /usr/lib/libssl.so.1.0.1e
08048000-0805f000 r-xp 00000000 fc:01 2099490 /home/centos/pg10_10mar/postgresql/edbpsql/bin/pg_basebackup
0805f000-08060000 rw-p 00016000 fc:01 2099490 /home/centos/pg10_10mar/postgresql/edbpsql/bin/pg_basebackup
08060000-08062000 rw-p 00000000 00:00 0
08d9f000-08dc0000 rw-p 00000000 00:00 0 [heap]
b7519000-b7719000 r--p 00000000 fc:01 269751 /usr/lib/locale/locale-archive
b7719000-b771e000 rw-p 00000000 00:00 0
b772a000-b772c000 rw-p 00000000 00:00 0
bfbf6000-bfc0b000 rw-p 00000000 00:00 0 [stack]
Aborted (core dumped)
[centos@tushar-centos bin]$
same scenario is working fine against HEAD (v10 ) on Linux32 [i.e no patch applied]
[centos@tushar-centos bin]$ ./pg_basebackup --verbose -D /tmp/slave11
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: transaction log start point: 0/2800024 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: transaction log end point: 0/28000E4
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
[centos@tushar-centos bin]$
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On 03/10/2017 11:23 AM, Beena Emerson wrote:We (Prabhat and I) have started basic testing of this feature -Thank you for your reviews Kuntal, Jim, AshutoshAttached in an updated 02 patch which:
- Call RetrieveXLogSegSize(conn) in pg_receivewal.c
- Remove the warning in Windows
- Change PATH_MAX in pg_waldump with MAXPGPATH
Regarding the usage of the wal file size as the XLogSegSize, I agree with what Robert has said. Generally, the wal size will be of the expected wal_segment_size and to have it any other size, esspecially of a valid power2 value is extremely rare and I feel it is not a major cause of concern.
2 quick issue -
1)at the time of initdb, we have set - "--wal-segsize 4" ,so all the WAL file size should be 4 MB each but in the postgresql.conf file , it is mentioned
#wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables
so the comment (16MB ) mentioned against parameter 'wal_keep_segments' looks wrong , either we should remove this or modify it .
2)Getting "Aborted (core dumped)" error at the time of running pg_basebackup , (this issue is only coming on Linux32 ,not on Linux64)
we have double check to confirm it .
Steps to reproduce on Linux32
===================
fetch the sources
apply both the patches
./configure --with-zlib --enable-debug --enable-cassert --enable-depend --prefix=$PWD/edbpsql --with-openssl CFLAGS="-g -O0"; make all install
Performed initdb with switch "--wal-segsize 4"
start the server
run pg_basebackup
[centos@tushar-centos bin]$ ./pg_basebackup -v -D /tmp/myslave
*** glibc detected *** ./pg_basebackup: free(): invalid pointer: 0x08da7f00 ***
[centos@tushar-centos bin]$
same scenario is working fine against HEAD (v10 ) on Linux32 [i.e no patch applied]
[centos@tushar-centos bin]$ ./pg_basebackup --verbose -D /tmp/slave11
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: transaction log start point: 0/2800024 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: transaction log end point: 0/28000E4
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
[centos@tushar-centos bin]$
2)Getting "Aborted (core dumped)" error at the time of running pg_basebackup , (this issue is only coming on Linux32 ,not on Linux64)
we have double check to confirm it .
Steps to reproduce on Linux32
===================
fetch the sources
apply both the patches
./configure --with-zlib --enable-debug --enable-cassert --enable-depend --prefix=$PWD/edbpsql --with-openssl CFLAGS="-g -O0"; make all install
Performed initdb with switch "--wal-segsize 4"Does the crash occur with only size 4?
Crash occurs for the value of "--wal-segsize " 1, 2, 4, 8 with stack details as below :
[bin]$ ./pg_basebackup -v -D /tmp/slave8
*** glibc detected *** ./pg_basebackup: free(): invalid pointer: 0x09fe9f00 ***
======= Backtrace: =========
/lib/libc.so.6[0xbafb81]
/home/edb/PGsrc13march/postgresql/edbpsql/lib/libpq.so.5(PQclear+0x16d)[0x5696f5]
./pg_basebackup[0x8051441]
./pg_basebackup[0x804e7b5]
/lib/libc.so.6(__libc_start_main+0xe6)[0xb55d36]
./pg_basebackup[0x804a231]
======= Memory map: ========
00165000-001c7000 r-xp 00000000 08:03 1333807 /usr/lib/libssl.so.1.0.1e
001c7000-001ca000 r--p 00061000 08:03 1333807 /usr/lib/libssl.so.1.0.1e
001ca000-001ce000 rw-p 00064000 08:03 1333807 /usr/lib/libssl.so.1.0.1e
001ce000-0020c000 r-xp 00000000 08:03 1717206 /lib/libgssapi_krb5.so.2.2
0020c000-0020d000 r--p 0003e000 08:03 1717206 /lib/libgssapi_krb5.so.2.2
0020d000-0020e000 rw-p 0003f000 08:03 1717206 /lib/libgssapi_krb5.so.2.2
0020e000-002e5000 r-xp 00000000 08:03 1717208 /lib/libkrb5.so.3.3
002e5000-002eb000 r--p 000d6000 08:03 1717208 /lib/libkrb5.so.3.3
002eb000-002ec000 rw-p 000dc000 08:03 1717208 /lib/libkrb5.so.3.3
002ec000-00309000 r-xp 00000000 08:03 1706348 /lib/libgcc_s-4.4.7-20120601.so.1
00309000-0030a000 rw-p 0001d000 08:03 1706348 /lib/libgcc_s-4.4.7-20120601.so.1
00362000-00510000 r-xp 00000000 08:03 1333806 /usr/lib/libcrypto.so.1.0.1e
00510000-00520000 r--p 001ae000 08:03 1333806 /usr/lib/libcrypto.so.1.0.1e
00520000-00527000 rw-p 001be000 08:03 1333806 /usr/lib/libcrypto.so.1.0.1e
00527000-0052a000 rw-p 00000000 00:00 0
0055a000-00585000 r-xp 00000000 08:03 419296 /home/edb/PGsrc13march/postgresql/edbpsql/lib/libpq.so.5.10
00585000-00587000 rw-p 0002a000 08:03 419296 /home/edb/PGsrc13march/postgresql/edbpsql/lib/libpq.so.5.10
0086b000-0086e000 r-xp 00000000 08:03 1717205 /lib/libcom_err.so.2.1
0086e000-0086f000 r--p 00002000 08:03 1717205 /lib/libcom_err.so.2.1
0086f000-00870000 rw-p 00003000 08:03 1717205 /lib/libcom_err.so.2.1
008e3000-00900000 r-xp 00000000 08:03 1725674 /lib/libselinux.so.1
00900000-00901000 r--p 0001d000 08:03 1725674 /lib/libselinux.so.1
00901000-00902000 rw-p 0001e000 08:03 1725674 /lib/libselinux.so.1
00a10000-00a11000 r-xp 00000000 00:00 0 [vdso]
00af9000-00b03000 r-xp 00000000 08:03 1717209 /lib/libkrb5support.so.0.1
00b03000-00b04000 r--p 00009000 08:03 1717209 /lib/libkrb5support.so.0.1
00b04000-00b05000 rw-p 0000a000 08:03 1717209 /lib/libkrb5support.so.0.1
00b19000-00b37000 r-xp 00000000 08:03 1704925 /lib/ld-2.12.so
00b37000-00b38000 r--p 0001d000 08:03 1704925 /lib/ld-2.12.so
00b38000-00b39000 rw-p 0001e000 08:03 1704925 /lib/ld-2.12.so
00b3f000-00ccf000 r-xp 00000000 08:03 1704931 /lib/libc-2.12.so
00ccf000-00cd0000 ---p 00190000 08:03 1704931 /lib/libc-2.12.so
00cd0000-00cd2000 r--p 00190000 08:03 1704931 /lib/libc-2.12.so
00cd2000-00cd3000 rw-p 00192000 08:03 1704931 /lib/libc-2.12.so
00cd3000-00cd6000 rw-p 00000000 00:00 0
00cd8000-00cef000 r-xp 00000000 08:03 1704933 /lib/libpthread-2.12.so
00cef000-00cf0000 r--p 00016000 08:03 1704933 /lib/libpthread-2.12.so
00cf0000-00cf1000 rw-p 00017000 08:03 1704933 /lib/libpthread-2.12.so
00cf1000-00cf3000 rw-p 00000000 00:00 0
00cf5000-00cf8000 r-xp 00000000 08:03 1704977 /lib/libdl-2.12.so
00cf8000-00cf9000 r--p 00002000 08:03 1704977 /lib/libdl-2.12.so
00cf9000-00cfa000 rw-p 00003000 08:03 1704977 /lib/libdl-2.12.so
00d33000-00d45000 r-xp 00000000 08:03 1704980 /lib/libz.so.1.2.3
00d45000-00d46000 r--p 00011000 08:03 1704980 /lib/libz.so.1.2.3
00d46000-00d47000 rw-p 00012000 08:03 1704980 /lib/libz.so.1.2.3
00de4000-00e0c000 r-xp 00000000 08:03 1710117 /lib/libk5crypto.so.3.1
00e0c000-00e0d000 r--p 00028000 08:03 1710117 /lib/libk5crypto.so.3.1
00e0d000-00e0e000 rw-p 00029000 08:03 1710117 /lib/libk5crypto.so.3.1
00e0e000-00e0f000 rw-p 00000000 00:00 0
00e18000-00e1a000 r-xp 00000000 08:03 1710112 /lib/libkeyutils.so.1.3
00e1a000-00e1b000 r--p 00001000 08:03 1710112 /lib/libkeyutils.so.1.3
00e1b000-00e1c000 rw-p 00002000 08:03 1710112 /lib/libkeyutils.so.1.3
00f04000-00f10000 r-xp 00000000 08:03 1704932 /lib/libnss_files-2.12.so
00f10000-00f11000 r--p 0000b000 08:03 1704932 /lib/libnss_files-2.12.so
00f11000-00f12000 rw-p 0000c000 08:03 1704932 /lib/libnss_files-2.12.so
07be1000-07bf6000 r-xp 00000000 08:03 1704916 /lib/libresolv-2.12.so
07bf6000-07bf7000 ---p 00015000 08:03 1704916 /lib/libresolv-2.12.so
07bf7000-07bf8000 r--p 00015000 08:03 1704916 /lib/libresolv-2.12.so
07bf8000-07bf9000 rw-p 00016000 08:03 1704916 /lib/libresolv-2.12.so
07bf9000-07bfb000 rw-p 00000000 00:00 0
08048000-0805f000 r-xp 00000000 08:03 539967 /home/edb/PGsrc13march/postgresql/edbpsql/bin/pg_basebackup
0805f000-08060000 rw-p 00016000 08:03 539967 /home/edb/PGsrc13march/postgresql/edbpsql/bin/pg_basebackup
08060000-08062000 rw-p 00000000 00:00 0
09fe1000-0a002000 rw-p 00000000 00:00 0 [heap]
b74ec000-b76ec000 r--p 00000000 08:03 1333666 /usr/lib/locale/locale-archive
b76ec000-b76f1000 rw-p 00000000 00:00 0
b7700000-b7702000 rw-p 00000000 00:00 0
bfcf0000-bfd05000 rw-p 00000000 00:00 0 [stack]
Aborted (core dumped)
For value the value of "--wal-segsize " 16, 32, 64... (all multiple of 16) we are getting "Segmentation fault" message as below:
[bin]$ ./pg_basebackup -v -D /tmp/slave16
Segmentation fault (core dumped)
and for all other values of "--wal-segsize " 3, 5, 7, 9, 10, 11, ... 15, 17, 18, ... we are getting invalid message during "initdb":
[bin]$ ./initdb -D data1 --wal-segsize=17
initdb: Invalid WAL segment size 17
start the server
run pg_basebackup
[centos@tushar-centos bin]$ ./pg_basebackup -v -D /tmp/myslave
*** glibc detected *** ./pg_basebackup: free(): invalid pointer: 0x08da7f00 ***
[centos@tushar-centos bin]$
same scenario is working fine against HEAD (v10 ) on Linux32 [i.e no patch applied]
[centos@tushar-centos bin]$ ./pg_basebackup --verbose -D /tmp/slave11
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: transaction log start point: 0/2800024 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: transaction log end point: 0/28000E4
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
[centos@tushar-centos bin]$Just to confirm, was this done with configure flag --with-wal-segsize=4 ?
we also have configure with the option "--with-wal-segsize=4" and getting warning.
./configure --with-zlib --enable-debug --enable-cassert --enable-depend --prefix=$PWD/inst --with-openssl CFLAGS="-g -O0" --with-wal-segsize=4
configure: WARNING: unrecognized options: --with-wal-segsize
Hi,
2)Getting "Aborted (core dumped)" error at the time of running pg_basebackup , (this issue is only coming on Linux32 ,not on Linux64)
we have double check to confirm it .
Steps to reproduce on Linux32===================
fetch the sources
apply both the patches
./configure --with-zlib --enable-debug --enable-cassert --enable-depend --prefix=$PWD/edbpsql --with-openssl CFLAGS="-g -O0"; make all install
Performed initdb with switch "--wal-segsize 4"Does the crash occur with only size 4?
Crash occurs for the value of "--wal-segsize " 1, 2, 4, 8 with stack details as below :
For value the value of "--wal-segsize " 16, 32, 64... (all multiple of 16) we are getting "Segmentation fault" message as below:
[bin]$ ./pg_basebackup -v -D /tmp/slave16
Segmentation fault (core dumped)
and for all other values of "--wal-segsize " 3, 5, 7, 9, 10, 11, ... 15, 17, 18, ... we are getting invalid message during "initdb":
[bin]$ ./initdb -D data1 --wal-segsize=17
initdb: Invalid WAL segment size 17
start the server
run pg_basebackup
[centos@tushar-centos bin]$ ./pg_basebackup -v -D /tmp/myslave
*** glibc detected *** ./pg_basebackup: free(): invalid pointer: 0x08da7f00 ***
[centos@tushar-centos bin]$
same scenario is working fine against HEAD (v10 ) on Linux32 [i.e no patch applied]
[centos@tushar-centos bin]$ ./pg_basebackup --verbose -D /tmp/slave11
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: transaction log start point: 0/2800024 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: transaction log end point: 0/28000E4
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
[centos@tushar-centos bin]$Just to confirm, was this done with configure flag --with-wal-segsize=4 ?
we also have configure with the option "--with-wal-segsize=4" and getting warning.
./configure --with-zlib --enable-debug --enable-cassert --enable-depend --prefix=$PWD/inst --with-openssl CFLAGS="-g -O0" --with-wal-segsize=4
configure: WARNING: unrecognized options: --with-wal-segsize
1)at the time of initdb, we have set - "--wal-segsize 4" ,so all the WAL file size should be 4 MB each but in the postgresql.conf file , it is mentioned
#wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables
so the comment (16MB ) mentioned against parameter 'wal_keep_segments' looks wrong , either we should remove this or modify it .
2)Getting "Aborted (core dumped)" error at the time of running pg_basebackup , (this issue is only coming on Linux32 ,not on Linux64)
we have double check to confirm it .
Attachment
Thanks, both issues has been fixed now.Hello,Attached is the updated patch. It fixes the issues and also updates few code comments.Can you please check with the new patch?
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Tue, Mar 14, 2017 at 1:44 AM, Beena Emerson <memissemerson@gmail.com> wrote: > Attached is the updated patch. It fixes the issues and also updates few code > comments. I did an initial readthrough of this patch tonight just to get a feeling for what's going on. Based on that, here are a few review comments: The changes to pg_standby seem to completely break the logic to wait until the file has attained the correct size. I don't know how to salvage that logic off-hand, but just breaking it isn't acceptable. + Note that changing this value requires an initdb. Instead, maybe say something like "Note that this value is fixed for the lifetime of the database cluster." -int max_wal_size = 64; /* 1 GB */ -int min_wal_size = 5; /* 80 MB */ +int wal_segment_size = 2048; /* 16 MB */ +int max_wal_size = 1024 * 1024; /* 1 GB */ +int min_wal_size = 80 * 1024; /* 80 MB */ If wal_segment_size is now measured in multiple of XLOG_BLCKSZ, then it's not the case that 2048 is always 16MB. If the other values are now measured in kB, perhaps rename the variables to add _kb, to avoid confusion with the way it used to work (and in general). The problem with leaving this as-is is that any existing references to max_wal_size in core or extension code will silently break; you want it to break in a noticeable way so that it gets fixed. + * UsableBytesInSegment: It is set in assign_wal_segment_size and stores the + * number of bytes in a WAL segment usable for WAL data. The comment doesn't need to say where it gets set, and it doesn't need to repeat the variable name. Just say "The number of bytes in a..." +assign_wal_segment_size(int newval, void *extra) Why does a PGC_INTERNAL GUC need an assign hook? I think the GUC should only be there to expose the value; it shouldn't have calculation logic associated with it. /* + * initdb passes the WAL segment size in an environment variable. We don't + * bother doing any sanity checking, we already check in initdb that the + * user gives a sane value. + */ + XLogSegSize = pg_atoi(getenv("XLOG_SEG_SIZE"), sizeof(uint32), 0); I think we should bother. I don't like the idea of the postmaster crashing in flames without so much as a reasonable error message if this parameter-passing mechanism goes wrong. + {"wal-segsize", required_argument, NULL, 'Z'}, When adding an option with no documented short form, generally one picks a number that isn't a character for the value at the end. See pg_regress.c or initdb.c for examples. + wal_segment_size = atoi(str_wal_segment_size); So, you're comfortable interpreting --wal-segsize=1TB or --wal-segsize=1GB as 1? Implicitly, 1MB? + * ControlFile is not accessible here so use SHOW wal_segment_size command + * to set the XLogSegSize Breaks compatibility with pre-9.6 servers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
+assign_wal_segment_size(int newval, void *extra)
Why does a PGC_INTERNAL GUC need an assign hook? I think the GUC
should only be there to expose the value; it shouldn't have
calculation logic associated with it.
+ wal_segment_size = atoi(str_wal_segment_size);
So, you're comfortable interpreting --wal-segsize=1TB or
--wal-segsize=1GB as 1? Implicitly, 1MB?
On Fri, Mar 17, 2017 at 2:08 AM, Beena Emerson <memissemerson@gmail.com> wrote: > The option was intended to only accept values in MB as the original config > --with-wal-segsize option, unfortunately, the patch does not throw error as > in the config option when the units are specified. Yeah, you want to use strtol(), so that you can throw an error if *endptr isn't '\0'. > Error with config option --with-wal-segsize=1MB > configure: error: Invalid WAL segment size. Allowed values are > 1,2,4,8,16,32,64. > > Should we imitate this behaviour and just add a check to see if it only > contains numbers? or would it be better to allow the use of the units and > make appropriate code changes? I think just restricting it to numeric values would be fine. If somebody wants to do the work to make it accept a unit suffix, I don't have a problem with that, but it doesn't seem like a must-have. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/16/17 21:10, Robert Haas wrote: > The changes to pg_standby seem to completely break the logic to wait > until the file has attained the correct size. I don't know how to > salvage that logic off-hand, but just breaking it isn't acceptable. I think we would have to extend restore_command with an additional placeholder that communicates the segment size, and add a new pg_standby option to accept that size somehow. And specifying the size would have to be mandatory, for complete robustness. Urgh. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/17/17 16:20, Peter Eisentraut wrote: > On 3/16/17 21:10, Robert Haas wrote: >> The changes to pg_standby seem to completely break the logic to wait >> until the file has attained the correct size. I don't know how to >> salvage that logic off-hand, but just breaking it isn't acceptable. > > I think we would have to extend restore_command with an additional > placeholder that communicates the segment size, and add a new pg_standby > option to accept that size somehow. And specifying the size would have > to be mandatory, for complete robustness. Urgh. Another way would be to name the WAL files in a more self-describing way. For example, instead of 000000010000000000000001 000000010000000000000002 000000010000000000000003 name them (for 16 MB) 000000010000000001 000000010000000002 000000010000000003 Then, pg_standby and similar tools can compute the expected file size from the file name length: 16 ^ (24 - fnamelen) However, that way you can't actually support 64 MB segments. The next jump up would have to be 256 MB (unless you want to go to a base other than 16). -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 3/17/17 16:20, Peter Eisentraut wrote: >> I think we would have to extend restore_command with an additional >> placeholder that communicates the segment size, and add a new pg_standby >> option to accept that size somehow. And specifying the size would have >> to be mandatory, for complete robustness. Urgh. > Another way would be to name the WAL files in a more self-describing > way. For example, instead of Actually, if you're content with having tools obtain this info by examining the WAL files, we shouldn't need to muck with the WAL naming convention (which seems like it would be a horrid mess, anyway --- too much outside code knows that). Tools could get the segment size out of XLogLongPageHeaderData.xlp_seg_size in the first page of the segment. regards, tom lane
On 3/17/17 16:56, Tom Lane wrote: > Tools could get the segment size out of > XLogLongPageHeaderData.xlp_seg_size in the first page of the segment. OK, then pg_standby would have to wait until the file is at least XLOG_BLCKSZ, then look inside and get the expected final size. A bit more complicated than now, but seems doable. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 17, 2017 at 6:11 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 3/17/17 16:56, Tom Lane wrote: >> Tools could get the segment size out of >> XLogLongPageHeaderData.xlp_seg_size in the first page of the segment. > > OK, then pg_standby would have to wait until the file is at least > XLOG_BLCKSZ, then look inside and get the expected final size. A bit > more complicated than now, but seems doable. Yeah, that doesn't sound too bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/17/17 4:56 PM, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> On 3/17/17 16:20, Peter Eisentraut wrote: >>> I think we would have to extend restore_command with an additional >>> placeholder that communicates the segment size, and add a new pg_standby >>> option to accept that size somehow. And specifying the size would have >>> to be mandatory, for complete robustness. Urgh. > >> Another way would be to name the WAL files in a more self-describing >> way. For example, instead of > > Actually, if you're content with having tools obtain this info by > examining the WAL files, we shouldn't need to muck with the WAL naming > convention (which seems like it would be a horrid mess, anyway --- too > much outside code knows that). Tools could get the segment size out of > XLogLongPageHeaderData.xlp_seg_size in the first page of the segment. > > regards, tom lane +1 -- -David david@pgmasters.net
On Tue, Mar 14, 2017 at 1:44 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> Attached is the updated patch. It fixes the issues and also updates few code
> comments.
I did an initial readthrough of this patch tonight just to get a
feeling for what's going on. Based on that, here are a few review
comments:
The changes to pg_standby seem to completely break the logic to wait
until the file has attained the correct size. I don't know how to
salvage that logic off-hand, but just breaking it isn't acceptable.
+ Note that changing this value requires an initdb.
Instead, maybe say something like "Note that this value is fixed for
the lifetime of the database cluster."
-int max_wal_size = 64; /* 1 GB */
-int min_wal_size = 5; /* 80 MB */
+int wal_segment_size = 2048; /* 16 MB */
+int max_wal_size = 1024 * 1024; /* 1 GB */
+int min_wal_size = 80 * 1024; /* 80 MB */
If wal_segment_size is now measured in multiple of XLOG_BLCKSZ, then
it's not the case that 2048 is always 16MB. If the other values are
now measured in kB, perhaps rename the variables to add _kb, to avoid
confusion with the way it used to work (and in general). The problem
with leaving this as-is is that any existing references to
max_wal_size in core or extension code will silently break; you want
it to break in a noticeable way so that it gets fixed.
+ * UsableBytesInSegment: It is set in assign_wal_segment_size and stores the
+ * number of bytes in a WAL segment usable for WAL data.
The comment doesn't need to say where it gets set, and it doesn't need
to repeat the variable name. Just say "The number of bytes in a..."
+assign_wal_segment_size(int newval, void *extra)
Why does a PGC_INTERNAL GUC need an assign hook? I think the GUC
should only be there to expose the value; it shouldn't have
calculation logic associated with it.
/*
+ * initdb passes the WAL segment size in an environment variable. We don't
+ * bother doing any sanity checking, we already check in initdb that the
+ * user gives a sane value.
+ */
+ XLogSegSize = pg_atoi(getenv("XLOG_SEG_SIZE"), sizeof(uint32), 0);
I think we should bother. I don't like the idea of the postmaster
crashing in flames without so much as a reasonable error message if
this parameter-passing mechanism goes wrong.
+ {"wal-segsize", required_argument, NULL, 'Z'},
When adding an option with no documented short form, generally one
picks a number that isn't a character for the value at the end. See
pg_regress.c or initdb.c for examples.
+ wal_segment_size = atoi(str_wal_segment_size);
So, you're comfortable interpreting --wal-segsize=1TB or
--wal-segsize=1GB as 1? Implicitly, 1MB?
+ * ControlFile is not accessible here so use SHOW wal_segment_size command
+ * to set the XLogSegSize
Breaks compatibility with pre-9.6 servers.
Attachment
Hi Beena, On 3/20/17 2:07 PM, Beena Emerson wrote: > Added check for the version, the SHOW command will be run only in v10 > and above. Previous versions do not need this. I've just had the chance to have a look at this patch. This is not a complete review, just a test of something I've been curious about. With 16MB WAL segments the filename neatly aligns with the LSN. For example: WAL FILE 0000000100000001000000FE = LSN 1/FE000000 This no longer holds true with this patch. I created a cluster with 1GB segments and the sequence looked like: 000000010000000000000001 000000010000000000000002 000000010000000000000003 000000010000000100000000 Whereas I had expected something like: 000000010000000000000040 000000010000000000000080 0000000100000000000000CO 000000010000000100000000 I scanned the thread but couldn't find any mention of this so I'm curious to know if it was considered? Was the prior correspondence merely serendipitous? I'm honestly not sure which way I think is better, but I know either way it represents a pretty big behavioral change for any tools looking at pg_wal or using the various helper functions. It's a probably a good thing to do at the same time as the rename, just want to make sure we are all aware of the changes. -- -David david@pgmasters.net
On Mon, Mar 20, 2017 at 7:23 PM, David Steele <david@pgmasters.net> wrote: > With 16MB WAL segments the filename neatly aligns with the LSN. For > example: > > WAL FILE 0000000100000001000000FE = LSN 1/FE000000 > > This no longer holds true with this patch. It is already possible to change the WAL segment size using the configure option --with-wal-segsize, and I think the patch should be consistent with whatever that existing option does. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello,PFA the updated patch.On Fri, Mar 17, 2017 at 6:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:--On Tue, Mar 14, 2017 at 1:44 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> Attached is the updated patch. It fixes the issues and also updates few code
> comments.
I did an initial readthrough of this patch tonight just to get a
feeling for what's going on. Based on that, here are a few review
comments:
The changes to pg_standby seem to completely break the logic to wait
until the file has attained the correct size. I don't know how to
salvage that logic off-hand, but just breaking it isn't acceptable.Using, the XLogLongPageHeader->xlp_seg_size, all the original checks have been retained. This methid is even used in pg_waldump.
+ Note that changing this value requires an initdb.
Instead, maybe say something like "Note that this value is fixed for
the lifetime of the database cluster."Corrected.
-int max_wal_size = 64; /* 1 GB */
-int min_wal_size = 5; /* 80 MB */
+int wal_segment_size = 2048; /* 16 MB */
+int max_wal_size = 1024 * 1024; /* 1 GB */
+int min_wal_size = 80 * 1024; /* 80 MB */
If wal_segment_size is now measured in multiple of XLOG_BLCKSZ, then
it's not the case that 2048 is always 16MB. If the other values are
now measured in kB, perhaps rename the variables to add _kb, to avoid
confusion with the way it used to work (and in general). The problem
with leaving this as-is is that any existing references to
max_wal_size in core or extension code will silently break; you wantit to break in a noticeable way so that it gets fixed.The wal_segment_size now is DEFAULT_XLOG_SEG_SIZE / XLOG_BLCKSZ;min and max wal_size have _kb postfix+ * UsableBytesInSegment: It is set in assign_wal_segment_size and stores the
+ * number of bytes in a WAL segment usable for WAL data.
The comment doesn't need to say where it gets set, and it doesn't need
to repeat the variable name. Just say "The number of bytes in a..."Done.
+assign_wal_segment_size(int newval, void *extra)
Why does a PGC_INTERNAL GUC need an assign hook? I think the GUC
should only be there to expose the value; it shouldn't have
calculation logic associated with it.Removed the function and called the functions in ReadControlFile.
/*
+ * initdb passes the WAL segment size in an environment variable. We don't
+ * bother doing any sanity checking, we already check in initdb that the
+ * user gives a sane value.
+ */
+ XLogSegSize = pg_atoi(getenv("XLOG_SEG_SIZE"), sizeof(uint32), 0);
I think we should bother. I don't like the idea of the postmaster
crashing in flames without so much as a reasonable error message if
this parameter-passing mechanism goes wrong.I have rechecked the XLogSegSize.
+ {"wal-segsize", required_argument, NULL, 'Z'},
When adding an option with no documented short form, generally one
picks a number that isn't a character for the value at the end. See
pg_regress.c or initdb.c for examples.Done.
+ wal_segment_size = atoi(str_wal_segment_size);
So, you're comfortable interpreting --wal-segsize=1TB or
--wal-segsize=1GB as 1? Implicitly, 1MB?Imitating the current behaviour of config option --with-wal-segment, I have used strtol to throw an error if the value is not only integers.
+ * ControlFile is not accessible here so use SHOW wal_segment_size command
+ * to set the XLogSegSize
Breaks compatibility with pre-9.6 servers.Added check for the version, the SHOW command will be run only in v10 and above. Previous versions do not need this.Thank you,Beena EmersonEnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Mar 20, 2017 at 7:23 PM, David Steele <david@pgmasters.net> wrote: > > With 16MB WAL segments the filename neatly aligns with the LSN. For > > example: > > > > WAL FILE 0000000100000001000000FE = LSN 1/FE000000 > > > > This no longer holds true with this patch. > > It is already possible to change the WAL segment size using the > configure option --with-wal-segsize, and I think the patch should be > consistent with whatever that existing option does. Considering how little usage that option has likely seen (I can't say I've ever run into usage of it so far...), I'm not really sure that it makes sense to treat it as final when we're talking about changing the default here. In short, I'm also concerned about this change to make WAL file names no longer match up with LSNs and also about the odd stepping that you get as a result of this change when it comes to WAL file names. Thanks! Stephen
On 3/21/17 9:04 AM, Stephen Frost wrote: > Robert, > > * Robert Haas (robertmhaas@gmail.com) wrote: >> On Mon, Mar 20, 2017 at 7:23 PM, David Steele <david@pgmasters.net> wrote: >>> With 16MB WAL segments the filename neatly aligns with the LSN. For >>> example: >>> >>> WAL FILE 0000000100000001000000FE = LSN 1/FE000000 >>> >>> This no longer holds true with this patch. >> >> It is already possible to change the WAL segment size using the >> configure option --with-wal-segsize, and I think the patch should be >> consistent with whatever that existing option does. > > Considering how little usage that option has likely seen (I can't say > I've ever run into usage of it so far...), I'm not really sure that it > makes sense to treat it as final when we're talking about changing the > default here. +1. A seldom-used compile-time option does not necessarily provide a good model for a user-facing feature. > In short, I'm also concerned about this change to make WAL file names no > longer match up with LSNs and also about the odd stepping that you get > as a result of this change when it comes to WAL file names. I can't decide which way I like best. I like the filenames corresponding to LSNs as they do now, but it seems like a straight sequence might be easier to understand. Either way you need to know that different segment sizes mean different numbers of segments per lsn.xlogid. Even now the correspondence is a bit tenuous. I've always thought: 00000001000000010000000F Should be: 00000001000000010F000000 I'm really excited to (hopefully) have this feature in v10. I just want to be sure we discuss this as it will be a big change for tool authors and just about anybody who looks at WAL. Thanks, -- -David david@pgmasters.net
On Tue, Mar 21, 2017 at 9:04 AM, Stephen Frost <sfrost@snowman.net> wrote: > In short, I'm also concerned about this change to make WAL file names no > longer match up with LSNs and also about the odd stepping that you get > as a result of this change when it comes to WAL file names. OK, that's a bit surprising to me, but what do you want to do about it? If you take the approach that Beena did, then you lose the correspondence with LSNs, which is admittedly not great but there are already helper functions available to deal with LSN -> filename mappings and I assume those will continue to work. If you take the opposite approach, then WAL filenames stop being consecutive, which seems to me to be far worse in terms of user and tool confusion. Also note that, both currently and with the patch, you can also reduce the WAL segment size. David's proposed naming scheme doesn't handle that case, I think, and I think it would be all kinds of a bad idea to use one file-naming approach for segments < 16MB and a separate approach for segments > 16MB. That's not making anything easier for users or tool authors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/21/17 3:22 PM, Robert Haas wrote: > On Tue, Mar 21, 2017 at 9:04 AM, Stephen Frost <sfrost@snowman.net> wrote: >> In short, I'm also concerned about this change to make WAL file names no >> longer match up with LSNs and also about the odd stepping that you get >> as a result of this change when it comes to WAL file names. > > OK, that's a bit surprising to me, but what do you want to do about > it? If you take the approach that Beena did, then you lose the > correspondence with LSNs, which is admittedly not great but there are > already helper functions available to deal with LSN -> filename > mappings and I assume those will continue to work. If you take the > opposite approach, then WAL filenames stop being consecutive, which > seems to me to be far worse in terms of user and tool confusion. They are already non-consecutive. Does 000000010000000200000000 really logically follow 0000000100000001000000FF? Yeah, sort of, if you know the rules. > Also > note that, both currently and with the patch, you can also reduce the > WAL segment size. David's proposed naming scheme doesn't handle that > case, I think, and I think it would be all kinds of a bad idea to use > one file-naming approach for segments < 16MB and a separate approach > for segments > 16MB. That's not making anything easier for users or > tool authors. I believe it does handle that case, actually. The minimum WAL segment size is 1MB so they would increase like: 000000010000000100000000 000000010000000100100000 000000010000000100200000 ... 0000000100000001FFF00000 You could always calculate the next WAL file by adding (wal_seg_size_in_mb << 20) to the previous WAL file's LSN. This would even work for WAL segments > 4GB. Overall, I think this would make calculating WAL ranges simpler than it is now. The biggest downside I can see is that this would change the naming scheme for the default of 16MB compared to previous versions of Postgres. However, for all other wal-seg-size values changes would need to be made anyway. -- -David david@pgmasters.net
On 3/21/17 15:22, Robert Haas wrote: > If you take the approach that Beena did, then you lose the > correspondence with LSNs, which is admittedly not great but there are > already helper functions available to deal with LSN -> filename > mappings and I assume those will continue to work. If you take the > opposite approach, then WAL filenames stop being consecutive, which > seems to me to be far worse in terms of user and tool confusion. Anecdotally, I think having the file numbers consecutive is very important, for debugging and feel-good factor. If you want to raise the segment size and preserve the LSN mapping, then pick 256 MB as your next size. I do think, however, that this has the potential of creating another ongoing source of confusion similar to oid vs relfilenode, where the numbers are often the same, except when they are not. With hindsight, I would have made the relfilenodes completely different from the OIDs. We chose to keep them (mostly) the same as the OIDs, for compatibility. We are seemingly making a similar kind of decision here. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 21, 2017 at 6:02 PM, David Steele <david@pgmasters.net> wrote: > The biggest downside I can see is that this would change the naming scheme > for the default of 16MB compared to previous versions of Postgres. However, > for all other wal-seg-size values changes would need to be made anyway. I think changing the naming convention for 16MB WAL segments, which is still going to be what 99% of people use, is an awfully large compatibility break for an awfully marginal benefit. We've already created quite a few incompatibilities in this release, and I'm not entirely eager to just keep cranking them out at top speed. Where it's necessary to achieve forward progress in some area, sure, but this feels gratuitous to me. I agree that we might have picked your scheme if we were starting from scratch, but I have a hard time believing it's a good idea to do it now just because of this patch. Changing the WAL segment size has been supported for a long time, and I don't see the fact that it will now potentially be initdb-configurable rather than configure-configurable as a sufficient justification for whacking around the naming scheme -- even though I don't love the naming scheme we've got. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, Mar 21, 2017 at 6:02 PM, David Steele <david@pgmasters.net> wrote: > > The biggest downside I can see is that this would change the naming scheme > > for the default of 16MB compared to previous versions of Postgres. However, > > for all other wal-seg-size values changes would need to be made anyway. > > I think changing the naming convention for 16MB WAL segments, which is > still going to be what 99% of people use, is an awfully large > compatibility break for an awfully marginal benefit. It seems extremely unlikely to me that we're going to actually see users deviate from whatever we set the default to and so I'm not sure that this is a real concern. We aren't changing what 9.6 and below's naming scheme is, just what PG10+ do, and PG10+ are going to have a different default WAL size. I realize the current patch still has the 16MB default even though a rather large portion of the early discussion appeared in favor of changing it to 64MB. Once we've done that, I don't think it makes one whit of difference what the naming scheme looks like when you're using 16MB sizes because essentially zero people are going to actually use such a setting. > We've already > created quite a few incompatibilities in this release, and I'm not > entirely eager to just keep cranking them out at top speed. That position would seem to imply that you're in favor of keeping the current default of 16MB, but that doesn't make sense given that you started this discussion advocating to make it larger. Changing your position is certainly fine, but it'd be good to be more clear if that's what you meant here or if you were just referring to the file naming scheme but you do still want to increase the default size. I'll admit that we might have a few more people using non-default sizes once we make it an initdb-option (though I'm tempted to suggest that one might be able to count them using their digits ;), but it seems very unlikely that they would do so to reduce it back down to 16MB, so I'm really not seeing the naming scheme change as a serious backwards-incompatibility change. Thanks! Stephen
On Tue, Mar 21, 2017 at 8:10 PM, Stephen Frost <sfrost@snowman.net> wrote: >> We've already >> created quite a few incompatibilities in this release, and I'm not >> entirely eager to just keep cranking them out at top speed. > > That position would seem to imply that you're in favor of keeping the > current default of 16MB, but that doesn't make sense given that you > started this discussion advocating to make it larger. Changing your > position is certainly fine, but it'd be good to be more clear if that's > what you meant here or if you were just referring to the file naming > scheme but you do still want to increase the default size. To be honest, I'd sort of forgotten about the change which is the nominal subject of this thread - I was more focused on the patch, which makes it configurable. I was definitely initially in favor of raising the value, but I got cold feet, a bit, when Alvaro pointed out that going to 64MB would require a substantial increase in min_wal_size. I'm not sure people with small installations will appreciate seeing that value cranked up from 5 segments * 16MB = 80MB to, say, 3 segments * 64MB = 192MB. That's an extra 100+ MB of space that doesn't really do anything for you. And nobody's done any benchmarking to see whether having only 3 segments is even a workable, performant configuration, so maybe we'll end up with 5 * 64MB = 320MB by default. I'm a little worried that this whole question of changing the file naming scheme is a diversion which will result in torpedoing any chance of getting some kind of improvement here for v11. I don't think the patch is all that far from being committable but it's not going to get there if we start redesigning the world around it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 21, 2017 at 11:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm a little worried that this whole question of changing the file > naming scheme is a diversion which will result in torpedoing any > chance of getting some kind of improvement here for v11. I don't > think the patch is all that far from being committable but it's not > going to get there if we start redesigning the world around it. Ha. A little Freudian slip there, since I obviously meant v10. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I'm a little worried that this whole question of changing the file
naming scheme is a diversion which will result in torpedoing any
chance of getting some kind of improvement here for v11. I don't
think the patch is all that far from being committable but it's not
going to get there if we start redesigning the world around it.
Attachment
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, Mar 21, 2017 at 8:10 PM, Stephen Frost <sfrost@snowman.net> wrote: > >> We've already > >> created quite a few incompatibilities in this release, and I'm not > >> entirely eager to just keep cranking them out at top speed. > > > > That position would seem to imply that you're in favor of keeping the > > current default of 16MB, but that doesn't make sense given that you > > started this discussion advocating to make it larger. Changing your > > position is certainly fine, but it'd be good to be more clear if that's > > what you meant here or if you were just referring to the file naming > > scheme but you do still want to increase the default size. > > To be honest, I'd sort of forgotten about the change which is the > nominal subject of this thread - I was more focused on the patch, > which makes it configurable. I was definitely initially in favor of > raising the value, but I got cold feet, a bit, when Alvaro pointed out > that going to 64MB would require a substantial increase in > min_wal_size. I'm not sure people with small installations will > appreciate seeing that value cranked up from 5 segments * 16MB = 80MB > to, say, 3 segments * 64MB = 192MB. That's an extra 100+ MB of space > that doesn't really do anything for you. And nobody's done any > benchmarking to see whether having only 3 segments is even a workable, > performant configuration, so maybe we'll end up with 5 * 64MB = 320MB > by default. The performance concern of having 3 segments is a red herring here if we're talking about a default install because the default for max_wal_size is 1G, not 192MB. I do think increasing the default WAL size would be valuable to do even if it does mean a default install will take up a bit more space. I didn't see much discussion of it, but if this is really a concern then couldn't we set the default to be 2 segments worth instead of 3 also? That would mean an increase from 80MB to 128MB in the default install if you never touch more than 128MB during a checkpoint. > I'm a little worried that this whole question of changing the file > naming scheme is a diversion which will result in torpedoing any > chance of getting some kind of improvement here for v11. I don't > think the patch is all that far from being committable but it's not > going to get there if we start redesigning the world around it. It's not my intent to 'torpedo' this patch but I'm pretty disappointed that we're introducing yet another initdb-time option with, as far as I can tell, no option to change it after the cluster has started (without some serious hackery), and continuing to have a poor default, which is what most users will end up with. I really don't like these kinds of options. I'd much rather have a reasonable default that covers most cases and is less likely to be a problem for most systems than have a minimal setting that's impossible to change after you've got your data in the system. As much as I'd like everyone to talk to me before doing an initdb, that's pretty rare and instead we end up having to break the bad news that they should have known better and done the right thing at initdb time and, no, sorry, there's no answer today but to dump out all of the data and load it into a new cluster which was set up with the right initdb settings. Thanks! Stephen
On 3/22/17 05:44, Beena Emerson wrote: > As stated above, the default 16MB has not changed and so we can take > this separately and not as part of this patch. It's good to have that discussion separately, but if we're planning to do it for PG10 (not saying we should), then we should have that discussion very soon. Especially if we would be shipping a default configuration where the mapping of files to LSNs fails, which will require giving users some time to adjust. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 21, 2017 at 11:49:30PM -0400, Robert Haas wrote: > To be honest, I'd sort of forgotten about the change which is the > nominal subject of this thread - I was more focused on the patch, > which makes it configurable. I was definitely initially in favor of > raising the value, but I got cold feet, a bit, when Alvaro pointed out > that going to 64MB would require a substantial increase in > min_wal_size. I'm not sure people with small installations will > appreciate seeing that value cranked up from 5 segments * 16MB = 80MB > to, say, 3 segments * 64MB = 192MB. That's an extra 100+ MB of space > that doesn't really do anything for you. And nobody's done any > benchmarking to see whether having only 3 segments is even a workable, > performant configuration, so maybe we'll end up with 5 * 64MB = 320MB > by default. Maybe its time to have a documentation section listing suggested changes for small installs so we can have more reasonable defaults. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 3/22/17 08:46, Stephen Frost wrote: > It's not my intent to 'torpedo' this patch but I'm pretty disappointed > that we're introducing yet another initdb-time option with, as far as I > can tell, no option to change it after the cluster has started (without > some serious hackery), and continuing to have a poor default, which is > what most users will end up with. I understand this concern, but what alternative do you have in mind? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter, * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > On 3/22/17 08:46, Stephen Frost wrote: > > It's not my intent to 'torpedo' this patch but I'm pretty disappointed > > that we're introducing yet another initdb-time option with, as far as I > > can tell, no option to change it after the cluster has started (without > > some serious hackery), and continuing to have a poor default, which is > > what most users will end up with. > > I understand this concern, but what alternative do you have in mind? Changing the default to a more reasonable value would at least reduce the issue. I think it'd also be nice to have a way to change it post-initdb, but that's less of an issue if we are at least setting it to a good default to begin with instead of a minimal one. Thanks! Stephen
On Wed, Mar 22, 2017 at 3:14 PM, Beena Emerson <memissemerson@gmail.com> wrote: > PFA an updated patch which fixes a minor bug I found. It only increases the > string size in pretty_wal_size function. > The 01-add-XLogSegmentOffset-macro.patch has also been rebased. Thanks for the updated versions. Here is a partial review of the patch: In pg_standby.c and pg_waldump.c, + XLogPageHeader hdr = (XLogPageHeader) buf; + XLogLongPageHeader NewLongPage = (XLogLongPageHeader) hdr; + + XLogSegSize = NewLongPage->xlp_seg_size; It waits until the file is at least XLOG_BLCKSZ, then gets the expected final size from XLogPageHeader. This looks really clean compared to the previous approach. + * Verify that the min and max wal_size meet the minimum requirements. Better to write min_wal_size and max_wal_size. + errmsg("Insufficient value for \"min_wal_size\""))); "min_wal_size %d is too low" may be? Use lower case for error messages. Same for max_wal_size. + /* Set the XLogSegSize */ + XLogSegSize = ControlFile->xlog_seg_size; + A call to IsValidXLogSegSize() will be good after this, no? + /* Update variables using XLogSegSize */ + check_wal_size(); The method name looks somewhat misleading compared to the comment for it, doesn't it? + * allocating space and reading ControlFile. s/and/for + {"TB", GUC_UNIT_MB, 1024 * 1024}, + {"GB", GUC_UNIT_MB, 1024}, + {"MB", GUC_UNIT_MB, 1}, + {"kB", GUC_UNIT_MB, -1024}, @@ -2235,10 +2231,10 @@ static struct config_int ConfigureNamesInt[] = {"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS, gettext_noop("Setsthe minimum size to shrink the WAL to."), NULL, - GUC_UNIT_XSEGS + GUC_UNIT_MB }, - &min_wal_size, - 5, 2, INT_MAX, + &min_wal_size_mb, + DEFAULT_MIN_WAL_SEGS * 16, 2, INT_MAX, NULL, NULL, NULL }, @@ -2246,10 +2242,10 @@ static struct config_int ConfigureNamesInt[] = {"max_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS, gettext_noop("Setsthe WAL size that triggers a checkpoint."), NULL, - GUC_UNIT_XSEGS + GUC_UNIT_MB }, - &max_wal_size, - 64, 2, INT_MAX, + &max_wal_size_mb, + DEFAULT_MAX_WAL_SEGS * 16, 2, INT_MAX, NULL, assign_max_wal_size, NULL }, This patch introduces a new guc_unit having values in MB for max_wal_size and min_wal_size. I'm not sure about the upper limit which is set to INT_MAX for 32-bit systems as well. Is it needed to define something like MAX_MEGABYTES similar to MAX_KILOBYTES? It is worth mentioning that GUC_UNIT_KB can't be used in this case since MAX_KILOBYTES is INT_MAX / 1024(<2GB) on 32-bit systems. That's not a sufficient value for min_wal_size/max_wal_size. While testing with pg_waldump, I got the following error. bin/pg_waldump -p master/pg_wal/ -s 0/01000000 Floating point exception (core dumped) Stack: #0 0x00000000004039d6 in ReadPageInternal () #1 0x0000000000404c84 in XLogFindNextRecord () #2 0x0000000000401e08 in main () I think that the problem is in following code: /* parse files as start/end boundaries, extract path if not specified */ if (optind < argc) { .... + if (!RetrieveXLogSegSize(full_path)) ... } In this case, RetrieveXLogSegSize is conditionally called. So, if the condition is false, XLogSegSize will not be initialized. I'm yet to review pg_basebackup module and I'll try to finish that ASAP. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 22, 2017 at 8:46 AM, Stephen Frost <sfrost@snowman.net> wrote: >> I was definitely initially in favor of >> raising the value, but I got cold feet, a bit, when Alvaro pointed out >> that going to 64MB would require a substantial increase in >> min_wal_size. > > The performance concern of having 3 segments is a red herring here if > we're talking about a default install because the default for > max_wal_size is 1G, not 192MB. I do think increasing the default WAL > size would be valuable to do even if it does mean a default install will > take up a bit more space. min_wal_size isn't the same thing as max_wal_size. > I didn't see much discussion of it, but if this is really a concern then > couldn't we set the default to be 2 segments worth instead of 3 also? > That would mean an increase from 80MB to 128MB in the default install if > you never touch more than 128MB during a checkpoint. Not sure. Need testing. >> I'm a little worried that this whole question of changing the file >> naming scheme is a diversion which will result in torpedoing any >> chance of getting some kind of improvement here for v11. I don't >> think the patch is all that far from being committable but it's not >> going to get there if we start redesigning the world around it. > > It's not my intent to 'torpedo' this patch but I'm pretty disappointed > that we're introducing yet another initdb-time option with, as far as I > can tell, no option to change it after the cluster has started (without > some serious hackery), and continuing to have a poor default, which is > what most users will end up with. > > I really don't like these kinds of options. I'd much rather have a > reasonable default that covers most cases and is less likely to be a > problem for most systems than have a minimal setting that's impossible > to change after you've got your data in the system. As much as I'd like > everyone to talk to me before doing an initdb, that's pretty rare and > instead we end up having to break the bad news that they should have > known better and done the right thing at initdb time and, no, sorry, > there's no answer today but to dump out all of the data and load it into > a new cluster which was set up with the right initdb settings. Well, right now, the alternative is to recompile the server, so I think being able to make the change at initdb time is pretty [ insert a word of your choice here ] great by comparison. Now, I completely agree that initdb-time configurability is inferior to server-restart configurability which is obviously inferior to on-the-fly reconfigurability, but we are not going to get either of those latter two things in v10, so I think we should take the one we can get, which is clearly better than what we've got now. In the future, if somebody is willing to put in the time and energy to allow this to be changed via a pg_resetexlog-like procedure, or even on the fly by some ALTER SYSTEM command, we can consider those changes then, but everything this patch does will still be necessary. On the topic of whether to also change the default, I'm not sure what is best and will defer to others. On the topic of whether to whack around the file naming scheme, -1 from me. This patch was posted three months ago and nobody suggested that course of action until this week. Even though it is on a related topic, it is a conceptually separate change that is previously undiscussed and on which we do not have agreement. Making those changes just before feature freeze isn't fair to the patch authors or people who may not have time to pay attention to this thread right this minute. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On the topic of whether to also change the default, I'm not sure what > is best and will defer to others. On the topic of whether to whack > around the file naming scheme, -1 from me. This patch was posted > three months ago and nobody suggested that course of action until this > week. Even though it is on a related topic, it is a conceptually > separate change that is previously undiscussed and on which we do not > have agreement. Making those changes just before feature freeze isn't > fair to the patch authors or people who may not have time to pay > attention to this thread right this minute. While I understand that you'd like to separate the concerns between changing the renaming scheme and changing the default and enabling this option, I don't agree that they can or should be independently considered. This is, in my view, basically the only opportunity we will have to change the naming scheme because once we make it an initdb option, while I don't think very many people will use it, there will be people who will and the various tool authors will also have to adjust to handle those cases. Chances are good that we will even see people start to recommend using that initdb option, leading to more people using a different default, at which point we simply are not going to be able to consider changing the nameing scheme. Therefore, I would much rather we take this opportunity to change the naming scheme and the default at the same time to be more sane, because if we have this patch as-is in PG10, we won't be able to do so in the future without a great deal more pain. I'm willing to forgo the ability to change the WAL size with just a server restart for PG10 because that's something which can clearly be added later without any concerns about backwards-compatibility, but the same is not true regarding the naming scheme. Thanks! Stephen
> While I understand that you'd like to separate the concerns between
> changing the renaming scheme and changing the default and enabling this
> option, I don't agree that they can or should be independently
> considered.
Well, I don't understand what you think is going to happen here. Neither Beena nor any other contributor you don't employ is obliged to write a patch for those changes because you'd like them to get made, and Peter and I have already voted against including them. If you or David want to write a patch for those changes, post it for discussion, and try to get consensus to commit it, that's of course your right. But the patch will be more than three weeks after the feature freeze deadline and will have two committer votes against it from the outset.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Mar 22, 2017 at 12:22 PM, Stephen Frost <sfrost@snowman.net> wrote: > > While I understand that you'd like to separate the concerns between > > changing the renaming scheme and changing the default and enabling this > > option, I don't agree that they can or should be independently > > considered. > > Well, I don't understand what you think is going to happen here. Neither > Beena nor any other contributor you don't employ is obliged to write a > patch for those changes because you'd like them to get made, and Peter and > I have already voted against including them. If you or David want to write > a patch for those changes, post it for discussion, and try to get consensus > to commit it, that's of course your right. But the patch will be more than > three weeks after the feature freeze deadline and will have two committer > votes against it from the outset. This would clearly be an adjustment to the submitted patch, which happens regularly during the review and commit process and is part of the commitfest process, so I don't agree that holding it to new-feature level is correct when it's a change which is entirely relevant to this new feature, and one which a committer is asking to be included as part of the change. Nor do I feel particularly bad about asking for feature authors to be prepared to rework parts of their feature based on feedback during the commitfest process. I would have liked to have realized this oddity with the WAL naming scheme for not-16MB-WALs earlier too, but it's unfortunately not within my abilities to change that. That does not mean that we shouldn't be cognizant of the impact that this new feature will have in exposing this naming scheme, one which there seems to be agreement is bad, to users. That said, David is taking a look at it to try and be helpful. Vote-counting seems a bit premature given that there hasn't been any particularly clear asking for votes. Additionally, I believe Peter also seemed concerned that the existing naming scheme which, if used with, say, 64MB segments, wouldn't match LSNs either, in this post: 9795723f-b4dd-f9e9-62e4-ddaf6cd888f1@2ndquadrant.com Thanks! Stephen
On Wed, Mar 22, 2017 at 12:51 PM, Stephen Frost <sfrost@snowman.net> wrote: > This would clearly be an adjustment to the submitted patch, which > happens regularly during the review and commit process and is part of > the commitfest process, so I don't agree that holding it to new-feature > level is correct when it's a change which is entirely relevant to this > new feature, and one which a committer is asking to be included as part > of the change. Nor do I feel particularly bad about asking for feature > authors to be prepared to rework parts of their feature based on > feedback during the commitfest process. Obviously, reworking the patch is an expected part of the CommitFest process. However, I disagree that what you're asking for can in any way be fairly characterized that way. You're not trying to make it do the thing that it does better or differently. You're trying to make it do a second thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Mar 22, 2017 at 12:51 PM, Stephen Frost <sfrost@snowman.net> wrote: > > This would clearly be an adjustment to the submitted patch, which > > happens regularly during the review and commit process and is part of > > the commitfest process, so I don't agree that holding it to new-feature > > level is correct when it's a change which is entirely relevant to this > > new feature, and one which a committer is asking to be included as part > > of the change. Nor do I feel particularly bad about asking for feature > > authors to be prepared to rework parts of their feature based on > > feedback during the commitfest process. > > Obviously, reworking the patch is an expected part of the CommitFest > process. However, I disagree that what you're asking for can in any > way be fairly characterized that way. You're not trying to make it do > the thing that it does better or differently. You're trying to make > it do a second thing. I don't agree with the particularly narrow definition you're using in this case to say that adding an option to initdb to change how big WAL files are, which will also change how they're named (even though this patch doesn't *specifically* do anything with the naming because there was a configure-time switch which existed before) means that asking for the WAL files names, which are already being changed, to be changed in a different way, is really outside the scope and a new feature. To put this in another light, had this issue been brought up post feature-freeze, your definition would mean that we would only have the option to either revert the patch entirely or to live with the poor naming scheme. For my 2c, in such a case, I would have voted to make the change even post feature-freeze unless we were very close to release as it's not really a new 'feature'. Thankfully, that isn't the case here and we do have time to consider changing it without having to worry about having a post feature-freeze discussion about it. Thanks! Stephen
On Wed, Mar 22, 2017 at 1:22 PM, Stephen Frost <sfrost@snowman.net> wrote: > To put this in another light, had this issue been brought up post > feature-freeze, your definition would mean that we would only have the > option to either revert the patch entirely or to live with the poor > naming scheme. Yeah, and I absolutely agree with that. In fact, I think it's *already* past the time when we should be considering the changes you want. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Mar 22, 2017 at 1:22 PM, Stephen Frost <sfrost@snowman.net> wrote: > > To put this in another light, had this issue been brought up post > > feature-freeze, your definition would mean that we would only have the > > option to either revert the patch entirely or to live with the poor > > naming scheme. > > Yeah, and I absolutely agree with that. In fact, I think it's > *already* past the time when we should be considering the changes you > want. Then perhaps we do need to be thinking of moving this to PG11 instead of exposing an option that users will start to use which will result in WAL naming that'll be confusing and inconsistent. I certainly don't think it's a good idea to move forward exposing an option with a naming scheme that's agreed to be bad. Thanks! Stephen
Robert,
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Mar 22, 2017 at 12:22 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > While I understand that you'd like to separate the concerns between
> > changing the renaming scheme and changing the default and enabling this
> > option, I don't agree that they can or should be independently
> > considered.
>
> Well, I don't understand what you think is going to happen here. Neither
> Beena nor any other contributor you don't employ is obliged to write a
> patch for those changes because you'd like them to get made, and Peter and
> I have already voted against including them. If you or David want to write
> a patch for those changes, post it for discussion, and try to get consensus
> to commit it, that's of course your right. But the patch will be more than
> three weeks after the feature freeze deadline and will have two committer
> votes against it from the outset.
This would clearly be an adjustment to the submitted patch, which
happens regularly during the review and commit process and is part of
the commitfest process, so I don't agree that holding it to new-feature
level is correct when it's a change which is entirely relevant to this
new feature, and one which a committer is asking to be included as part
of the change. Nor do I feel particularly bad about asking for feature
authors to be prepared to rework parts of their feature based on
feedback during the commitfest process.
I would have liked to have realized this oddity with the WAL naming
scheme for not-16MB-WALs earlier too, but it's unfortunately not within
my abilities to change that. That does not mean that we shouldn't be
cognizant of the impact that this new feature will have in exposing this
naming scheme, one which there seems to be agreement is bad, to users.
That said, David is taking a look at it to try and be helpful.
Vote-counting seems a bit premature given that there hasn't been any
particularly clear asking for votes. Additionally, I believe Peter also
seemed concerned that the existing naming scheme which, if used with,
say, 64MB segments, wouldn't match LSNs either, in this post:
9795723f-b4dd-f9e9-62e4-ddaf6cd888f1@2ndquadrant.com
On Wed, Mar 22, 2017 at 1:49 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> On Wed, Mar 22, 2017 at 1:22 PM, Stephen Frost <sfrost@snowman.net> wrote: >> > To put this in another light, had this issue been brought up post >> > feature-freeze, your definition would mean that we would only have the >> > option to either revert the patch entirely or to live with the poor >> > naming scheme. >> >> Yeah, and I absolutely agree with that. In fact, I think it's >> *already* past the time when we should be considering the changes you >> want. > > Then perhaps we do need to be thinking of moving this to PG11 instead of > exposing an option that users will start to use which will result in WAL > naming that'll be confusing and inconsistent. I certainly don't think > it's a good idea to move forward exposing an option with a naming scheme > that's agreed to be bad. I'm not sure there is any such agreement. I agree that the naming scheme for WAL files probably isn't the greatest and that David's proposal is probably better, but we've had that naming scheme for many years, and I don't accept that making a previously-configure-time option initdb-time means that it's suddenly necessary to break everything for people who continue to use a 16MB WAL size. I really think that is very unlikely to be a majority position, no matter how firmly you and David hold to it. It is possible that a majority of people will agree that such a change should be made, but it seems very remote that a majority of people will agree that it has to (or even should be) the same commit that improves the configurability. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Mar 22, 2017 at 1:49 PM, Stephen Frost <sfrost@snowman.net> wrote: > > Then perhaps we do need to be thinking of moving this to PG11 instead of > > exposing an option that users will start to use which will result in WAL > > naming that'll be confusing and inconsistent. I certainly don't think > > it's a good idea to move forward exposing an option with a naming scheme > > that's agreed to be bad. > > I'm not sure there is any such agreement. I agree that the naming > scheme for WAL files probably isn't the greatest and that David's > proposal is probably better, but we've had that naming scheme for many > years, and I don't accept that making a previously-configure-time > option initdb-time means that it's suddenly necessary to break > everything for people who continue to use a 16MB WAL size. Apologies, I completely forgot to bring up how the discussion has evolved regarding the 16MB case even though we had moved past it in my head. Let me try to set that right here. One of the reasons to go with the LSN is that we would actually be maintaining what happens when the WAL files are 16MB in size. David's initial expectation was this for 64MB WAL files: 000000010000000000000040 000000010000000000000080 0000000100000000000000CO 000000010000000100000000 Which both matches the LSN *and* keeps the file names the same when they're 16MB. This is what David's looking at writing a patch for and is what I think we should be considering. This avoids breaking compatibility for people who choose to continue using 16MB (assuming we switch the default to 64MB, which I am still hopeful we will do). David had offered up another idea which would change the WAL naming for all sizes, but he and I chatted about it and it seemed like it'd make more sense to maintain the 16MB filenames and then to use the LSN for other sizes also in the same manner. Regardless of which approach we end up using, I do think we need a formal function for converting WAL file names into LSNs and documentation included which spells out exactly how that's done. This is obviously important to backup tools which need to make sure that there aren't any gaps in the WAL stream and also need to figure out where the LSN returned by pg_start_backup() is. We have a function for the latter already, but no documentation explaining how it works, which I believe we should as tool authors need to implement this in their own code since they can't always assume access to a PG server is available. Thanks! Stephen
On 3/22/17 3:09 PM, Stephen Frost wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> On Wed, Mar 22, 2017 at 1:49 PM, Stephen Frost <sfrost@snowman.net> wrote: >>> Then perhaps we do need to be thinking of moving this to PG11 instead of >>> exposing an option that users will start to use which will result in WAL >>> naming that'll be confusing and inconsistent. I certainly don't think >>> it's a good idea to move forward exposing an option with a naming scheme >>> that's agreed to be bad. >> > > One of the reasons to go with the LSN is that we would actually be > maintaining what happens when the WAL files are 16MB in size. > > David's initial expectation was this for 64MB WAL files: > > 000000010000000000000040 > 000000010000000000000080 > 0000000100000000000000CO > 000000010000000100000000 This is the 1GB sequence, actually, but idea would be the same for 64MB files. -- -David david@pgmasters.net
* David Steele (david@pgmasters.net) wrote: > On 3/22/17 3:09 PM, Stephen Frost wrote: > >* Robert Haas (robertmhaas@gmail.com) wrote: > >>On Wed, Mar 22, 2017 at 1:49 PM, Stephen Frost <sfrost@snowman.net> wrote: > >>>Then perhaps we do need to be thinking of moving this to PG11 instead of > >>>exposing an option that users will start to use which will result in WAL > >>>naming that'll be confusing and inconsistent. I certainly don't think > >>>it's a good idea to move forward exposing an option with a naming scheme > >>>that's agreed to be bad. > >> > > > >One of the reasons to go with the LSN is that we would actually be > >maintaining what happens when the WAL files are 16MB in size. > > > >David's initial expectation was this for 64MB WAL files: > > > >000000010000000000000040 > >000000010000000000000080 > >0000000100000000000000CO > >000000010000000100000000 > > This is the 1GB sequence, actually, but idea would be the same for > 64MB files. Ah, right, sorry. Thanks! Stephen
On 3/22/17 15:09, Stephen Frost wrote: > David's initial expectation was this for 64MB WAL files: > > 000000010000000000000040 > 000000010000000000000080 > 0000000100000000000000CO > 000000010000000100000000 > > Which both matches the LSN *and* keeps the file names the same when > they're 16MB. This is what David's looking at writing a patch for and > is what I think we should be considering. This avoids breaking > compatibility for people who choose to continue using 16MB (assuming > we switch the default to 64MB, which I am still hopeful we will do). The question is, which property is more useful to preserve: matching LSN, or having a mostly consecutive numbering. Actually, I would really really like to have both, but if I had to pick one, I'd lean 55% toward consecutive numbering. For the issue at hand, I think it's fine to proceed with the naming schema that the existing compile-time option gives you. In fact, that would flush out some of the tools that look directly at the file names and interpret them, thus preserving the option to move to a more radically different format. If changing WAL sizes catches on, I do think we should keep thinking about a new format for a future release, because debugging will otherwise become a bit wild. I'm thinking something like {integer timeline}_{integer seq number}_{hex lsn} might address various interests. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/22/17 15:37, Peter Eisentraut wrote: > If changing WAL sizes catches on, I do think we should keep thinking > about a new format for a future release, I think that means that I'm skeptical about changing the default size right now. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 22, 2017 at 3:24 PM, David Steele <david@pgmasters.net> wrote: >> One of the reasons to go with the LSN is that we would actually be >> maintaining what happens when the WAL files are 16MB in size. >> >> David's initial expectation was this for 64MB WAL files: >> >> 000000010000000000000040 >> 000000010000000000000080 >> 0000000100000000000000CO >> 000000010000000100000000 > > > This is the 1GB sequence, actually, but idea would be the same for 64MB > files. Wait, really? I thought you abandoned this approach because there's then no principled way to handle WAL segments of less than the default size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > The question is, which property is more useful to preserve: matching > LSN, or having a mostly consecutive numbering. > > Actually, I would really really like to have both, but if I had to pick > one, I'd lean 55% toward consecutive numbering. > For the issue at hand, I think it's fine to proceed with the naming > schema that the existing compile-time option gives you. What I don't particularly like about that is that it's *not* actually consecutive, you end up with this: 000000010000000000000001 000000010000000000000002 000000010000000000000003 000000010000000100000000 Which is part of what I don't particularly like about this approach. > In fact, that would flush out some of the tools that look directly at > the file names and interpret them, thus preserving the option to move to > a more radically different format. This doesn't make a lot of sense to me. If we get people to change to using larger WAL segments and the tools are modified to understand the pseudo-consecutive format, and then you want to change it on them again in another release or two? I'm generally a fan of not feeling too bad breaking backwards compatibility, but it seems pretty rough even to me to do so immediately. This is exactly why I think it'd be better to work out a good naming scheme now that actually makes sense and that we'll be able to stick with for a while instead of rushing to get this ability in now, when we'll have people actually starting to use it and then try to change it. > If changing WAL sizes catches on, I do think we should keep thinking > about a new format for a future release, because debugging will > otherwise become a bit wild. I'm thinking something like > > {integer timeline}_{integer seq number}_{hex lsn} > > might address various interests. Right, I'd rather not have debugging WAL files become a bit wild. If we can't work out a sensible approach to naming that we expect to last us for at least a couple of releases for different sizes of WAL files, then I don't think we should rush to encourage users to use different sizes of WAL files. Thanks! Stephen
On 3/22/17 3:39 PM, Peter Eisentraut wrote: > On 3/22/17 15:37, Peter Eisentraut wrote: >> If changing WAL sizes catches on, I do think we should keep thinking >> about a new format for a future release, > > I think that means that I'm skeptical about changing the default size > right now. I think if we don't change the default size it's very unlikely I would use alternate WAL segment sizes or recommend that anyone else does, at least in v10. I simply don't think it would get the level of testing required to be production worthy and I doubt that most tool writers would be quick to add support for a feature that very few people (if any) use. -- -David david@pgmasters.net
Hi Robert, On 3/22/17 3:45 PM, Robert Haas wrote: > On Wed, Mar 22, 2017 at 3:24 PM, David Steele <david@pgmasters.net> wrote: >>> One of the reasons to go with the LSN is that we would actually be >>> maintaining what happens when the WAL files are 16MB in size. >>> >>> David's initial expectation was this for 64MB WAL files: >>> >>> 000000010000000000000040 >>> 000000010000000000000080 >>> 0000000100000000000000CO >>> 000000010000000100000000 >> >> >> This is the 1GB sequence, actually, but idea would be the same for 64MB >> files. > > Wait, really? I thought you abandoned this approach because there's > then no principled way to handle WAL segments of less than the default > size. I did say that, but I thought I had hit on a compromise. But, as I originally pointed out the hex characters in the filename are not aligned correctly for > 8 bits (< 16MB segments) and using different alignments just made it less consistent. It would be OK if we were willing to drop the 1,2,4,8 segment sizes because then the alignment would make sense and not change the current 16MB sequence. Even then, there are some interesting side effects. For 1GB segments the "0000000100000001000000C0" segment would include LSNs 1/C0000000 through 1/FFFFFFFF. This is correct but is not an obvious filename to LSN mapping, at least for LSNs that appear later in the segment. -- -David david@pgmasters.net
David, * David Steele (david@pgmasters.net) wrote: > On 3/22/17 3:45 PM, Robert Haas wrote: > >On Wed, Mar 22, 2017 at 3:24 PM, David Steele <david@pgmasters.net> wrote: > >>>One of the reasons to go with the LSN is that we would actually be > >>>maintaining what happens when the WAL files are 16MB in size. > >>> > >>>David's initial expectation was this for 64MB WAL files: > >>> > >>>000000010000000000000040 > >>>000000010000000000000080 > >>>0000000100000000000000CO > >>>000000010000000100000000 > >> > >> > >>This is the 1GB sequence, actually, but idea would be the same for 64MB > >>files. > > > >Wait, really? I thought you abandoned this approach because there's > >then no principled way to handle WAL segments of less than the default > >size. > > I did say that, but I thought I had hit on a compromise. Strikes me as one, at least. > But, as I originally pointed out the hex characters in the filename > are not aligned correctly for > 8 bits (< 16MB segments) and using > different alignments just made it less consistent. > > It would be OK if we were willing to drop the 1,2,4,8 segment sizes > because then the alignment would make sense and not change the > current 16MB sequence. For my 2c, at least, it seems extremely unlikely that people are using smaller-than-16MB segments. Also, we don't have to actually drop support for those sizes, just handle the numbering differently, if we feel like they're useful enough to keep- in particular I was thinking we could make the filename one digit longer, or shift the numbers up one position, but my general feeling is that it wouldn't ever be an exercised use-case and therefore we should just drop support for them. Perhaps I'm being overly paranoid, but I share David's concern about non-standard/non-default WAL sizes being a serious risk due to lack of exposure for those code paths, which is another reason that we should change the default if we feel it's valuable to have a large WAL segment, not just create this option which users can set at initdb time but which we very rarely actually test to ensure it's working. With any of these we need to have some buildfarm systems which are at *least* running our regression tests against the different options, if we would consider telling users to use them. > Even then, there are some interesting side effects. For 1GB > segments the "0000000100000001000000C0" segment would include LSNs > 1/C0000000 through 1/FFFFFFFF. This is correct but is not an > obvious filename to LSN mapping, at least for LSNs that appear later > in the segment. That doesn't seem unreasonable to me. If we're going to use the starting LSN of the segment then it's going to skip when you start varying the size of the segment. Even keeping the current scheme we end up with skipping happening, it just different skipping and goes "1, 2, 3, skip to the next!" where how high the count goes depends on the size. With this approach, we're consistently skipping the same amount which is exactly the size divided by 16MB, always. I do also like Peter's suggestion also of using separator between the components of the WAL filename, but that would change the naming for everyone, which is a concern I can understand us wishing to avoid. From a user-experience point of view, keeping the mapping from the WAL filename to the starting LSN is quite nice, even if this change might complicate the backend code a bit. Thanks! Stephen
On 3/22/17 17:33, David Steele wrote: > I think if we don't change the default size it's very unlikely I would > use alternate WAL segment sizes or recommend that anyone else does, at > least in v10. > > I simply don't think it would get the level of testing required to be > production worthy I think we could tweak the test harnesses to run all the tests with different segment sizes. That would get pretty good coverage. More generally, the methodology that we should not add an option unless we also change the default because the option would otherwise not get enough testing is a bit dubious. > and I doubt that most tool writers would be quick to > add support for a feature that very few people (if any) use. I'm not one of those tool writers, although I have written my share of DBA scripts over the years, but I wonder why those tools would really care. They are handed files with predetermined names to archive, and for restore files with predetermined names are requested back from them.What else do they need? If something is missingthat requires them to parse file names, then maybe that should be added. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter, * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > On 3/22/17 17:33, David Steele wrote: > > and I doubt that most tool writers would be quick to > > add support for a feature that very few people (if any) use. > > I'm not one of those tool writers, although I have written my share of > DBA scripts over the years, but I wonder why those tools would really > care. They are handed files with predetermined names to archive, and > for restore files with predetermined names are requested back from them. > What else do they need? If something is missing that requires them to > parse file names, then maybe that should be added. PG backup technology has come a fair ways from that simple characterization of it. :) The backup tools need to also get the LSN from the pg_stop_backup and verify that they have the WAL file associated with that LSN. They also need to make sure that they have all of the WAL files between the starting LSN and the ending LSN. Doing that requires understanding how the files are named to make sure there aren't any missing. David will probably point out other reasons that the backup tools need to understand the file naming, but those are ones I know of off-hand. Thanks! Stephen
On 3/22/17 17:33, David Steele wrote:
> and I doubt that most tool writers would be quick to
> add support for a feature that very few people (if any) use.
I'm not one of those tool writers, although I have written my share of
DBA scripts over the years, but I wonder why those tools would really
care. They are handed files with predetermined names to archive, and
for restore files with predetermined names are requested back from them.
What else do they need? If something is missing that requires them to
parse file names, then maybe that should be added.
On 3/23/17 16:58, Stephen Frost wrote: > The backup tools need to also get the LSN from the pg_stop_backup and > verify that they have the WAL file associated with that LSN. There is a function for that. > They also > need to make sure that they have all of the WAL files between the > starting LSN and the ending LSN. Doing that requires understanding how > the files are named to make sure there aren't any missing. There is not a function for that, but there could be one. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/23/17 21:47, Jeff Janes wrote: > I have a pg_restore which predicts the file 5 files ahead of the one it > was asked for, and initiates a pre-fetch and decompression of it. Then > it delivers the file it was asked for, either by pulling it out of the > pre-staging area set up by the N-5th invocation, or by going directly to > the archive to get it. This speeds up play-back dramatically when the > files are stored compressed and non-local. Yeah, some better support for prefetching would be necessary to avoid having to have any knowledge of the file naming. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter, * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > There is a function for that. [...] > There is not a function for that, but there could be one. I'm not sure you've really considered what you're suggesting here. We need to to make sure we have every file between two LSNs. Yes, we could step the LSN forward one byte at a time, calling the appropriate function for every single byte, to make sure that we have that file, but that really isn't a reasonable approach. Nor would it be reasonable if we go on the assumption that WAL files can't be less than 1MB. Beyond that, this also bakes in an assumption that we would then require access to a database (of a current enough version to have the functions needed too!) to connect to and run these functions, which is a poor design. If the user is using a remote system to gather the WAL on, that system may not have easy access to PG. Further, backup tools will want to do things like off-line verification that the backup is complete, perhaps in another environment entirely which doesn't have PG, or maybe where what they're trying to do is make sure that a given backup is good before starting a restore to bring PG back up. Also, given that one of the things we're talking about here is specifically that we want to be able to change the WAL size for different databases, you would have to make sure that the database you're running these functions on uses the same WAL file size as the one which is being backed up. No, I don't agree that we can claim the LSN -> WAL filename mapping is an internal PG detail that we can whack around because there are functions to calculate the answer. External utilities need to be able to perform that translation and we need to document for them how to do so correctly. Thanks! Stephen
On 3/24/17 12:27 AM, Peter Eisentraut wrote: > On 3/23/17 16:58, Stephen Frost wrote: >> The backup tools need to also get the LSN from the pg_stop_backup and >> verify that they have the WAL file associated with that LSN. > > There is a function for that. > >> They also >> need to make sure that they have all of the WAL files between the >> starting LSN and the ending LSN. Doing that requires understanding how >> the files are named to make sure there aren't any missing. > > There is not a function for that, but there could be one. A function would be nice, but tools often cannot depend on the database being operational so it's still necessary to re-implement them. Having a sane sequence in the WAL makes that easier. -- -David david@pgmasters.net
Jeff, * Jeff Janes (jeff.janes@gmail.com) wrote: > On Thu, Mar 23, 2017 at 1:45 PM, Peter Eisentraut < > peter.eisentraut@2ndquadrant.com> wrote: > > On 3/22/17 17:33, David Steele wrote: > > > and I doubt that most tool writers would be quick to > > > add support for a feature that very few people (if any) use. > > > > I'm not one of those tool writers, although I have written my share of > > DBA scripts over the years, but I wonder why those tools would really > > care. They are handed files with predetermined names to archive, and > > for restore files with predetermined names are requested back from them. > > What else do they need? If something is missing that requires them to > > parse file names, then maybe that should be added. > > I have a pg_restore which predicts the file 5 files ahead of the one it was > asked for, and initiates a pre-fetch and decompression of it. Then it > delivers the file it was asked for, either by pulling it out of the > pre-staging area set up by the N-5th invocation, or by going directly to > the archive to get it. This speeds up play-back dramatically when the > files are stored compressed and non-local. Ah, yes, that is on our road-map for pgBackrest to do also, along with parallel WAL fetch which also needs to figure out the WAL names before being asked for them. We do already have parallel push, which also needs to figure out what the upcoming file names are going to be so we can find them and push them when they're indicated as ready in archive_status. Perhaps we could just push whatever is ready and remember everything that was pushed for when PG asks, but that is really not ideal. > That is why I need to know how the files are numbered. I don't think that > that makes much of a difference, though. Any change is going to break > that, no matter which change. Then I'll fix it. Right, but the discussion here is actually about the idea that we're going to encourage people to use the initdb-time option to change the WAL size, meaning you'll need to deal with different WAL sizes and different naming due to that, and then we're going to turn around in the very next release and break the naming, meaning you'll have to adjust your tools first for the different possible WAL sizes in PG10 and then adjust again for the different naming in PG11. I'm trying to suggest that if we're going to do this that, perhaps, we should try to make both the changes in one release instead of across two. > If we are going to break it, I'd prefer to just do away with the 'segment' > thing altogether. You have timelines, and you have files. That's it. I'm not sure I follow this proposal. We have to know which WAL file has which LSN in it, how do you do that with just 'timelines and files'? Thanks! Stephen
On 3/23/17 4:45 PM, Peter Eisentraut wrote: > On 3/22/17 17:33, David Steele wrote: >> I think if we don't change the default size it's very unlikely I would >> use alternate WAL segment sizes or recommend that anyone else does, at >> least in v10. >> >> I simply don't think it would get the level of testing required to be >> production worthy > > I think we could tweak the test harnesses to run all the tests with > different segment sizes. That would get pretty good coverage. I would want to see 1,16,64 at a minimum. More might be nice but that gets a bit ridiculous at some point. I would be fine with different critters having different defaults. I don't think that each critter needs to test each value. > More generally, the methodology that we should not add an option unless > we also change the default because the option would otherwise not get > enough testing is a bit dubious. Generally, I would agree, but I think this is a special case. This option has been around for a long time and we are just now exposing it in a way that's useful to end users. It's easy to see how various assumptions may have arisen around the default and led to code that is not quite right when using different values. Even if that's not true in the backend code, it might affect bin, and certainly affects third party tools. -- -David david@pgmasters.net
On Wed, Mar 22, 2017 at 3:14 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> PFA an updated patch which fixes a minor bug I found. It only increases the
> string size in pretty_wal_size function.
> The 01-add-XLogSegmentOffset-macro.patch has also been rebased.
Thanks for the updated versions. Here is a partial review of the patch:
In pg_standby.c and pg_waldump.c,
+ XLogPageHeader hdr = (XLogPageHeader) buf;
+ XLogLongPageHeader NewLongPage = (XLogLongPageHeader) hdr;
+
+ XLogSegSize = NewLongPage->xlp_seg_size;
It waits until the file is at least XLOG_BLCKSZ, then gets the
expected final size from XLogPageHeader. This looks really clean
compared to the previous approach.
+ * Verify that the min and max wal_size meet the minimum requirements.
Better to write min_wal_size and max_wal_size.
+ errmsg("Insufficient value for \"min_wal_size\"")));
"min_wal_size %d is too low" may be? Use lower case for error
messages. Same for max_wal_size.
+ /* Set the XLogSegSize */
+ XLogSegSize = ControlFile->xlog_seg_size;
+
A call to IsValidXLogSegSize() will be good after this, no?
+ /* Update variables using XLogSegSize */
+ check_wal_size();
The method name looks somewhat misleading compared to the comment for
it, doesn't it?
This patch introduces a new guc_unit having values in MB for
max_wal_size and min_wal_size. I'm not sure about the upper limit
which is set to INT_MAX for 32-bit systems as well. Is it needed to
define something like MAX_MEGABYTES similar to MAX_KILOBYTES?
It is worth mentioning that GUC_UNIT_KB can't be used in this case
since MAX_KILOBYTES is INT_MAX / 1024(<2GB) on 32-bit systems. That's
not a sufficient value for min_wal_size/max_wal_size.
While testing with pg_waldump, I got the following error.
bin/pg_waldump -p master/pg_wal/ -s 0/01000000
Floating point exception (core dumped)
Stack:
#0 0x00000000004039d6 in ReadPageInternal ()
#1 0x0000000000404c84 in XLogFindNextRecord ()
#2 0x0000000000401e08 in main ()
I think that the problem is in following code:
/* parse files as start/end boundaries, extract path if not specified */
if (optind < argc)
{
....
+ if (!RetrieveXLogSegSize(full_path))
...
}
In this case, RetrieveXLogSegSize is conditionally called. So, if the
condition is false, XLogSegSize will not be initialized.
Attachment
On Wed, Mar 22, 2017 at 6:05 PM, David Steele <david@pgmasters.net> wrote: >> Wait, really? I thought you abandoned this approach because there's >> then no principled way to handle WAL segments of less than the default >> size. > > I did say that, but I thought I had hit on a compromise. > > But, as I originally pointed out the hex characters in the filename are not > aligned correctly for > 8 bits (< 16MB segments) and using different > alignments just made it less consistent. I don't think I understand what the compromise is. Are you saying we should have one rule for segments < 16MB and another rule for segments > 16MB? I think using two different rules for file naming depending on the segment size will be a negative for both tool authors and ordinary users. > It would be OK if we were willing to drop the 1,2,4,8 segment sizes because > then the alignment would make sense and not change the current 16MB > sequence. Well, that is true. But the thing I'm trying to do here is to keep this patch down to what actually needs to be changed in order to accomplish the original purchase, not squeeze more and more changes into it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Robert, On 3/24/17 3:00 PM, Robert Haas wrote: > On Wed, Mar 22, 2017 at 6:05 PM, David Steele <david@pgmasters.net> wrote: >>> Wait, really? I thought you abandoned this approach because there's >>> then no principled way to handle WAL segments of less than the default >>> size. >> >> I did say that, but I thought I had hit on a compromise. >> >> But, as I originally pointed out the hex characters in the filename are not >> aligned correctly for > 8 bits (< 16MB segments) and using different >> alignments just made it less consistent. > > I don't think I understand what the compromise is. Are you saying we > should have one rule for segments < 16MB and another rule for segments >> 16MB? I think using two different rules for file naming depending > on the segment size will be a negative for both tool authors and > ordinary users. Sorry for the confusion, I meant to say that if we want to keep LSNs in the filenames and not change alignment for the current default, then we would need to drop support for segment sizes < 16MB (more or less what I said below). Bad editing on my part. >> It would be OK if we were willing to drop the 1,2,4,8 segment sizes because >> then the alignment would make sense and not change the current 16MB >> sequence. > > Well, that is true. But the thing I'm trying to do here is to keep > this patch down to what actually needs to be changed in order to > accomplish the original purchase, not squeeze more and more changes > into it. Attached is a patch to be applied on top of Beena's v8 patch that preserves LSNs in the file naming for all segment sizes. It's not quite complete because it doesn't modify the lower size limit everywhere, but I think it's enough so you can see what I'm getting at. This passes check-world and I've poked at it in other segment sizes as well manually. Behavior for the current default of 16MB is unchanged, and all other sizes go through a logical progression. 1GB: 000000010000000000000040 000000010000000000000080 0000000100000000000000C0 000000010000000100000000 256GB: 000000010000000000000010 000000010000000000000020 000000010000000000000030 ... 0000000100000000000000E0 0000000100000000000000F0 000000010000000100000000 64GB: 000000010000000100000004 000000010000000100000008 00000001000000010000000C ... 0000000100000001000000F8 0000000100000001000000FC 000000010000000100000000 I believe that maintaining an easy correspondence between LSN and filename is important. The cluster will not always be up to help with these calculations and tools that do the job may not be present or may have issues. I'm happy to merge this with Beena's patch (and tidy my patch up) if this looks like an improvement to everyone. -- -David david@pgmasters.net
Attachment
On 3/24/17 19:13, David Steele wrote: > Behavior for the current default of 16MB is unchanged, and all other > sizes go through a logical progression. Just at a glance, without analyzing the math behind it, this scheme seems super confusing. > > 1GB: > 000000010000000000000040 > 000000010000000000000080 > 0000000100000000000000C0 > 000000010000000100000000 > > 256GB: > > 000000010000000000000010 > 000000010000000000000020 > 000000010000000000000030 > ... > 0000000100000000000000E0 > 0000000100000000000000F0 > 000000010000000100000000 > > 64GB: > > 000000010000000100000004 > 000000010000000100000008 > 00000001000000010000000C > ... > 0000000100000001000000F8 > 0000000100000001000000FC > 000000010000000100000000 -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/24/17 08:18, Stephen Frost wrote: > Peter, > > * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: >> There is a function for that. > [...] >> There is not a function for that, but there could be one. > > I'm not sure you've really considered what you're suggesting here. Create a set-returning function that returns all the to-be-expected file names between two LSNs. > Beyond that, this also bakes in an assumption that we would then require > access to a database That is a good point, but then any change to the naming whatsoever will create trouble. Then we might as well choose which specific trouble. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter, * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > On 3/24/17 19:13, David Steele wrote: > > Behavior for the current default of 16MB is unchanged, and all other > > sizes go through a logical progression. > > Just at a glance, without analyzing the math behind it, this scheme > seems super confusing. Compared to: 1GB: 000000010000000000000001 000000010000000000000002 000000010000000000000003 000000010000000100000000 ...? Having the naming no longer match the LSN and also, seemingly, jump randomly, strikes me as very confusing. At least with the LSN-based approach, we aren't jumping randomly but exactly in-line with what the starting LSN of the file is, and always by the same amount (in hex). Thanks! Stephen
Peter, * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > On 3/24/17 08:18, Stephen Frost wrote: > > Beyond that, this also bakes in an assumption that we would then require > > access to a database > > That is a good point, but then any change to the naming whatsoever will > create trouble. Then we might as well choose which specific trouble. Right, and I'd rather we work that out before we start encouraging users to change their WAL segment size, which is what this patch will do. Personally, I'm alright with the patch David has produced, which is pretty small, maintains the same names when 16MB segments are used, and is pretty straight-forward to reason about. I do think it'll need added documentation to clarify how WAL segment names are calculated and perhaps another function which returns the size of WAL segments on a given cluster (I don't think we have that..?), and, ideally, added regression tests or buildfarm animals which try different sizes. On the other hand, I don't have any particular issue with the naming scheme you proposed up-thread, which uses proper separators between the components of a WAL filename, but that would change what happens with 16MB WAL segments today. I'm still of the opinion that we should be changing the default to 64MB for WAL segments. Thanks! Stephen
At this point, I suggest splitting this patch up into several potentially less controversial pieces. One big piece is that we currently don't support segment sizes larger than 64 GB, for various internal arithmetic reasons. Your patch appears to address that. So I suggest isolating that. Assuming it works correctly, I think there would be no great concern about it. The next piece would be making the various tools aware of varying segment sizes without having to rely on a built-in value. The third piece would then be the rest that allows you to set the size at initdb If we take these in order, we would make it easier to test various sizes and see if there are any more unforeseen issues when changing sizes. It would also make it easier to do performance testing so we can address the original question of what the default size should be. One concern I have is that your patch does not contain any tests. There should probably be lots of tests. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 25 March 2017 at 17:02, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > At this point, I suggest splitting this patch up into several > potentially less controversial pieces. > > One big piece is that we currently don't support segment sizes larger > than 64 GB, for various internal arithmetic reasons. Your patch appears > to address that. So I suggest isolating that. Assuming it works > correctly, I think there would be no great concern about it. +1 > The next piece would be making the various tools aware of varying > segment sizes without having to rely on a built-in value. Hmm > The third piece would then be the rest that allows you to set the size > at initdb > > If we take these in order, we would make it easier to test various sizes > and see if there are any more unforeseen issues when changing sizes. It > would also make it easier to do performance testing so we can address > the original question of what the default size should be. > > One concern I have is that your patch does not contain any tests. There > should probably be lots of tests. This is looking like a reject in its current form. Changing WAL filename to a new form seems best plan, but we don't have time to do that and get it right, especially with no tests. My summary of useful requirements would be * Files smaller than 16MB and larger than 16MB are desirable * LSN <-> filename mapping must be clear * New filename format best for debugging and clarity My proposal from here is that we allow only one new size in this release, to minimize the splash zone. Keep the filename format as it is now, using David's suggestion. Suggest adding 1GB as the only additional option, which continues the idea of having 1GB as the max filesize. New filename format can come in PG11 allowing much wider range of WAL filesizes, bigger and smaller. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At this point, I suggest splitting this patch up into several
potentially less controversial pieces.
One big piece is that we currently don't support segment sizes larger
than 64 GB, for various internal arithmetic reasons. Your patch appears
to address that. So I suggest isolating that. Assuming it works
correctly, I think there would be no great concern about it.
The next piece would be making the various tools aware of varying
segment sizes without having to rely on a built-in value.
The third piece would then be the rest that allows you to set the size
at initdb
If we take these in order, we would make it easier to test various sizes
and see if there are any more unforeseen issues when changing sizes. It
would also make it easier to do performance testing so we can address
the original question of what the default size should be.
02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes the internal representation of max_wal_size and min_wal_size to mb.
03-modify-tools.patch - Makes XLogSegSize into a variable, currently set as XLOG_SEG_SIZE and modifies the tools to fetch the size instead of using inbuilt value.
04-initdb-walsegsize.patch - Adds the initdb option to set wal-segsize and make related changes. Update pg_test_fsync to use DEFAULT_XLOG_SEG_SIZE instead of XLOG_SEG_SIZE
One concern I have is that your patch does not contain any tests. There
should probably be lots of tests.
Attachment
On Tue, Mar 28, 2017 at 1:06 AM, Beena Emerson <memissemerson@gmail.com> wrote: > On Sat, Mar 25, 2017 at 10:32 PM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> >> At this point, I suggest splitting this patch up into several >> potentially less controversial pieces. >> >> One big piece is that we currently don't support segment sizes larger >> than 64 GB, for various internal arithmetic reasons. Your patch appears >> to address that. So I suggest isolating that. Assuming it works >> correctly, I think there would be no great concern about it. >> >> The next piece would be making the various tools aware of varying >> segment sizes without having to rely on a built-in value. >> >> The third piece would then be the rest that allows you to set the size >> at initdb >> >> If we take these in order, we would make it easier to test various sizes >> and see if there are any more unforeseen issues when changing sizes. It >> would also make it easier to do performance testing so we can address >> the original question of what the default size should be. > > > PFA the patches divided into 3 parts: Thanks for splitting the patches. 01-add-XLogSegmentOffset-macro.patch is same as before and it looks good. > 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes > the internal representation of max_wal_size and min_wal_size to mb. looks good. > 03-modify-tools.patch - Makes XLogSegSize into a variable, currently set as > XLOG_SEG_SIZE and modifies the tools to fetch the size instead of using > inbuilt value. Several methods are declared and defined in different tools to fetch the size of wal-seg-size. In pg_standby.c, RetrieveXLogSegSize() - /* Set XLogSegSize from the WAL file header */ In pg_basebackup/streamutil.c,on behaRetrieveXLogSegSize(PGconn *conn) - /* set XLogSegSize using SHOW wal_segment_size */ In pg_waldump.c, ReadXLogFromDir(char *archive_loc) RetrieveXLogSegSize(char *archive_path) /* Scan through the archive location to set XLogSegsize from the first WAL file */ IMHO, it's better to define a single method in xlog.c and based on the different strategy, it can retrieve the XLogSegsize on behalf of different modules. I've suggested the same in my first set review and I'll still vote for it. For example, in xlog.c, you can define something as following: bool RetrieveXLogSegSize(RetrieveStrategy rs, void* ptr) Now based on the RetrieveStrategy(say Conn, File, Dir), you can cast the void pointer to the appropriate type. So, when a new tool needs to retrieve XLogSegSize, it can just call this function instead of defining a new RetrieveXLogSegSize method. It's just a suggestion from my side. Is there anything I'm missing which can cause the aforesaid approach not to be working? Apart from that, I've nothing to add here. > 04-initdb-walsegsize.patch - Adds the initdb option to set wal-segsize and > make related changes. Update pg_test_fsync to use DEFAULT_XLOG_SEG_SIZE > instead of XLOG_SEG_SIZE looks good. >> >> One concern I have is that your patch does not contain any tests. There >> should probably be lots of tests. > > > 05-initdb_tests.patch adds tap tests to initialize cluster with different > wal_segment_size and then check the config values. What other tests do you > have in mind? Checking the various tools? Nothing from me to add here. I've nothing to add here for the attached set of patches. On top of these, David's patch can be used for preserving LSNs in the file naming for all segment sizes. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Tue, Mar 28, 2017 at 1:06 AM, Beena Emerson <memissemerson@gmail.com> wrote:Thanks for splitting the patches.
> On Sat, Mar 25, 2017 at 10:32 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>>
>> At this point, I suggest splitting this patch up into several
>> potentially less controversial pieces.
>>
>> One big piece is that we currently don't support segment sizes larger
>> than 64 GB, for various internal arithmetic reasons. Your patch appears
>> to address that. So I suggest isolating that. Assuming it works
>> correctly, I think there would be no great concern about it.
>>
>> The next piece would be making the various tools aware of varying
>> segment sizes without having to rely on a built-in value.
>>
>> The third piece would then be the rest that allows you to set the size
>> at initdb
>>
>> If we take these in order, we would make it easier to test various sizes
>> and see if there are any more unforeseen issues when changing sizes. It
>> would also make it easier to do performance testing so we can address
>> the original question of what the default size should be.
>
>
> PFA the patches divided into 3 parts:
01-add-XLogSegmentOffset-macro.patch is same as before and it looks good. looks good.
> 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes
> the internal representation of max_wal_size and min_wal_size to mb.Several methods are declared and defined in different tools to fetch
> 03-modify-tools.patch - Makes XLogSegSize into a variable, currently set as
> XLOG_SEG_SIZE and modifies the tools to fetch the size instead of using
> inbuilt value.
the size of wal-seg-size.
In pg_standby.c,
RetrieveXLogSegSize() - /* Set XLogSegSize from the WAL file header */
In pg_basebackup/streamutil.c,
on behaRetrieveXLogSegSize(PGconn *conn) - /* set XLogSegSize using
SHOW wal_segment_size */
In pg_waldump.c,
ReadXLogFromDir(char *archive_loc)
RetrieveXLogSegSize(char *archive_path) /* Scan through the archive
location to set XLogSegsize from the first WAL file */
IMHO, it's better to define a single method in xlog.c and based on the
different strategy, it can retrieve the XLogSegsize on behalf of
different modules. I've suggested the same in my first set review and
I'll still vote for it. For example, in xlog.c, you can define
something as following:
bool RetrieveXLogSegSize(RetrieveStrategy rs, void* ptr)
Now based on the RetrieveStrategy(say Conn, File, Dir), you can cast
the void pointer to the appropriate type. So, when a new tool needs to
retrieve XLogSegSize, it can just call this function instead of
defining a new RetrieveXLogSegSize method.
It's just a suggestion from my side. Is there anything I'm missing
which can cause the aforesaid approach not to be working?
Apart from that, I've nothing to add here.
looks good.
> 04-initdb-walsegsize.patch - Adds the initdb option to set wal-segsize and
> make related changes. Update pg_test_fsync to use DEFAULT_XLOG_SEG_SIZE
> instead of XLOG_SEG_SIZENothing from me to add here.
>>
>> One concern I have is that your patch does not contain any tests. There
>> should probably be lots of tests.
>
>
> 05-initdb_tests.patch adds tap tests to initialize cluster with different
> wal_segment_size and then check the config values. What other tests do you
> have in mind? Checking the various tools?
I've nothing to add here for the attached set of patches. On top of
these, David's patch can be used for preserving LSNs in the filenaming for all segment sizes.
On Fri, Mar 31, 2017 at 10:40 AM, Beena Emerson <memissemerson@gmail.com> wrote: > On 30 Mar 2017 15:10, "Kuntal Ghosh" <kuntalghosh.2007@gmail.com> wrote: >> 03-modify-tools.patch - Makes XLogSegSize into a variable, currently set >> as >> XLOG_SEG_SIZE and modifies the tools to fetch the size instead of using >> inbuilt value. > Several methods are declared and defined in different tools to fetch > the size of wal-seg-size. > In pg_standby.c, > RetrieveXLogSegSize() - /* Set XLogSegSize from the WAL file header */ > > In pg_basebackup/streamutil.c, > on behaRetrieveXLogSegSize(PGconn *conn) - /* set XLogSegSize using > SHOW wal_segment_size */ > > In pg_waldump.c, > ReadXLogFromDir(char *archive_loc) > RetrieveXLogSegSize(char *archive_path) /* Scan through the archive > location to set XLogSegsize from the first WAL file */ > > IMHO, it's better to define a single method in xlog.c and based on the > different strategy, it can retrieve the XLogSegsize on behalf of > different modules. I've suggested the same in my first set review and > I'll still vote for it. For example, in xlog.c, you can define > something as following: > bool RetrieveXLogSegSize(RetrieveStrategy rs, void* ptr) > > Now based on the RetrieveStrategy(say Conn, File, Dir), you can cast > the void pointer to the appropriate type. So, when a new tool needs to > retrieve XLogSegSize, it can just call this function instead of > defining a new RetrieveXLogSegSize method. > > It's just a suggestion from my side. Is there anything I'm missing > which can cause the aforesaid approach not to be working? > Apart from that, I've nothing to add here. > > > > I do not think a generalised function is a good idea. Besides, I feel the > respective approaches are best kept in the modules used also because the > internal code is not easily accessible by utils. > Ahh, I wonder what the reason can be. Anyway, I'll leave that decision for the committer. I'm moving the status to Ready for committer. I've only tested the patch in my 64-bit linux system. It needs some testing on other environment settings. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 31, 2017 at 10:40 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> On 30 Mar 2017 15:10, "Kuntal Ghosh" <kuntalghosh.2007@gmail.com> wrote:Ahh, I wonder what the reason can be. Anyway, I'll leave that decision
> I do not think a generalised function is a good idea. Besides, I feel the
> respective approaches are best kept in the modules used also because the
> internal code is not easily accessible by utils.
>
for the committer. I'm moving the status to Ready for committer.
On 27 March 2017 at 15:36, Beena Emerson <memissemerson@gmail.com> wrote: > 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes > the internal representation of max_wal_size and min_wal_size to mb. Committed first part to allow internal representation change (only). No commitment yet to increasing wal-segsize in the way this patch has it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 5, 2017 at 3:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 27 March 2017 at 15:36, Beena Emerson <memissemerson@gmail.com> wrote: > >> 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes >> the internal representation of max_wal_size and min_wal_size to mb. > > Committed first part to allow internal representation change (only). > > No commitment yet to increasing wal-segsize in the way this patch has it. > What part of patch you don't like and do you have any suggestions to improve the same? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 4/4/17 22:47, Amit Kapila wrote: >> Committed first part to allow internal representation change (only). >> >> No commitment yet to increasing wal-segsize in the way this patch has it. >> > > What part of patch you don't like and do you have any suggestions to > improve the same? I think there are still some questions and disagreements about how it should behave. I suggest the next step is to dial up the allowed segment size in configure and run some tests about what a reasonable maximum value could be. I did a little bit of that, but somewhere around 256 MB, things got really slow. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/4/17 22:47, Amit Kapila wrote:
>> Committed first part to allow internal representation change (only).
>>
>> No commitment yet to increasing wal-segsize in the way this patch has it.
>>
>
> What part of patch you don't like and do you have any suggestions to
> improve the same?
I think there are still some questions and disagreements about how it
should behave.
I suggest the next step is to dial up the allowed segment size in
configure and run some tests about what a reasonable maximum value could
be. I did a little bit of that, but somewhere around 256 MB, things got
really slow.
On 4 April 2017 at 22:47, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 5, 2017 at 3:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 27 March 2017 at 15:36, Beena Emerson <memissemerson@gmail.com> wrote: >> >>> 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes >>> the internal representation of max_wal_size and min_wal_size to mb. >> >> Committed first part to allow internal representation change (only). >> >> No commitment yet to increasing wal-segsize in the way this patch has it. >> > > What part of patch you don't like and do you have any suggestions to > improve the same? The only part of the patch uncommitted was related to choice of WAL file size in the config file. I've already made suggestions on that upthread. I'm now looking at patch 03-modify-tools.patch * Peter's "lack of tests" comment still applies * I think we should remove pg_standby in this release, so we don't have to care about it * If we change pg_resetwal then it should allow changing XLogSegSize also * "coulnot access the archive location" 03 looks mostly OK 04 is much more of a mess * Lots of comments and notes pre-judge what the limits and configurability are, so its hard to commit the patches without committing to the basic assumptions. Please look at removing all assumptions about what the values/options are, so we can change them later 05 adds various tests but I don't think adds enough value to commit -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 5 April 2017 at 06:04, Beena Emerson <memissemerson@gmail.com> wrote: >> >> No commitment yet to increasing wal-segsize in the way this patch has >> >> it. >> >> >> > >> > What part of patch you don't like and do you have any suggestions to >> > improve the same? >> >> I think there are still some questions and disagreements about how it >> should behave. > > > The WALfilename - LSN mapping disruption for higher values you mean? Is > there anything else I have missed? I see various issues raised but not properly addressed 1. we would need to drop support for segment sizes < 16MB unless we adopt a new incompatible filename format. I think at 16MB the naming should be the same as now and that WALfilename -> LSN is very important. For this release, I think we should just allow >= 16MB and avoid the issue thru lack of time. 2. It's not clear to me the advantage of being able to pick varying filesizes. I see great disadvantage in having too many options, which greatly increases the chance of incompatibility, annoyance and breakage. I favour a small number of values that have been shown by testing to be sweet spots in performance and usability. (1GB has been suggested) 3. New file allocation has been a problem raised with this patch for some months now. Lack of clarity and/or movement on these issues is very, very close to getting the patch rejected now. Let's not approach this with the viewpoint that I or others want it to be rejected, lets work forwards and get some solid changes that will improve the world without causing problems. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/5/17 06:04, Beena Emerson wrote: > I suggest the next step is to dial up the allowed segment size in > configure and run some tests about what a reasonable maximum value could > be. I did a little bit of that, but somewhere around 256 MB, things got > really slow. > > > Would it be better if just increase the limit to 128MB for now? > In next we can change the WAL file name format and expand the range? I don't think me saying it felt a bit slow around 256 MB is a proper technical analysis that should lead to the conclusion that that upper limit should be 128 MB. ;-) This tells me that there is a lot of explore and test here before we should let it loose on users. I think the best we should do now is spend a bit of time exploring whether/how larger values of segment size behave, and bump the hardcoded configure limit if we get positive results. Everything else should probably be postponed. (Roughly speaking, to get started, this would mean compiling with --with-wal-segsize 16, 32, 64, 128, 256, run make check-world both sequentially and in parallel, and take note of a) passing, b) run time, c) disk space.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 5 April 2017 at 08:36, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 4/5/17 06:04, Beena Emerson wrote: >> I suggest the next step is to dial up the allowed segment size in >> configure and run some tests about what a reasonable maximum value could >> be. I did a little bit of that, but somewhere around 256 MB, things got >> really slow. >> >> >> Would it be better if just increase the limit to 128MB for now? >> In next we can change the WAL file name format and expand the range? > > I don't think me saying it felt a bit slow around 256 MB is a proper > technical analysis that should lead to the conclusion that that upper > limit should be 128 MB. ;-) > > This tells me that there is a lot of explore and test here before we > should let it loose on users. Agreed > I think the best we should do now is spend a bit of time exploring > whether/how larger values of segment size behave, and bump the hardcoded > configure limit if we get positive results. Everything else should > probably be postponed. > > (Roughly speaking, to get started, this would mean compiling with > --with-wal-segsize 16, 32, 64, 128, 256, run make check-world both > sequentially and in parallel, and take note of a) passing, b) run time, > c) disk space.) I've committed the rest of Beena's patch to allow this testing to occur up to 1024MB. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 5 April 2017 at 06:04, Beena Emerson <memissemerson@gmail.com> wrote:
I see various issues raised but not properly addressed
1. we would need to drop support for segment sizes < 16MB unless we
adopt a new incompatible filename format.
I think at 16MB the naming should be the same as now and that
WALfilename -> LSN is very important.
For this release, I think we should just allow >= 16MB and avoid the
issue thru lack of time.
2. It's not clear to me the advantage of being able to pick varying
filesizes. I see great disadvantage in having too many options, which
greatly increases the chance of incompatibility, annoyance and
breakage. I favour a small number of values that have been shown by
testing to be sweet spots in performance and usability. (1GB has been
suggested)
3. New file allocation has been a problem raised with this patch for
some months now.
I don't think me saying it felt a bit slow around 256 MB is a proper
technical analysis that should lead to the conclusion that that upper
limit should be 128 MB. ;-)
On 4/6/17 07:13, Beena Emerson wrote: > Does the options 16, 64 and 1024 seem good. > We can remove sizes below 16 as most have agreed and as per the > discussion, 64MB and 1GB seems favoured. We could probably allow 32MB > since it was an already allowed size? I don't see the need to remove any options right now. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/5/17 7:29 AM, Simon Riggs wrote: > On 5 April 2017 at 06:04, Beena Emerson <memissemerson@gmail.com> wrote: >> >> The WALfilename - LSN mapping disruption for higher values you mean? Is >> there anything else I have missed? > > I see various issues raised but not properly addressed > > 1. we would need to drop support for segment sizes < 16MB unless we > adopt a new incompatible filename format. > I think at 16MB the naming should be the same as now and that > WALfilename -> LSN is very important. > For this release, I think we should just allow >= 16MB and avoid the > issue thru lack of time. +1. > 2. It's not clear to me the advantage of being able to pick varying > filesizes. I see great disadvantage in having too many options, which > greatly increases the chance of incompatibility, annoyance and > breakage. I favour a small number of values that have been shown by > testing to be sweet spots in performance and usability. (1GB has been > suggested) I'm in favor of 16,64,256,1024. > 3. New file allocation has been a problem raised with this patch for > some months now. I've been playing around with this and I don't think short tests show larger sizes off to advantage. Larger segments will definitely perform more poorly until Postgres starts recycling WAL. Once that happens I think performance differences should be negligible, though of course this needs to be verified with longer-running tests. If archive_timeout is set then there will also be additional time taken to zero out potentially larger unused parts of the segment. I don't see this as an issue, however, because if archive_timeout is being triggered then the system is very likely under lower than usual load. > Lack of clarity and/or movement on these issues is very, very close to > getting the patch rejected now. Let's not approach this with the > viewpoint that I or others want it to be rejected, lets work forwards > and get some solid changes that will improve the world without causing > problems. I would definitely like to see this go in, though I agree with Peter that we need a lot more testing. I'd like to see some build farm animals testing with different sizes at the very least, even if there's no time to add new tests. -- -David david@pgmasters.net
On 04/06/2017 08:33 PM, David Steele wrote: > On 4/5/17 7:29 AM, Simon Riggs wrote: >> On 5 April 2017 at 06:04, Beena Emerson <memissemerson@gmail.com> wrote: >>> >>> The WALfilename - LSN mapping disruption for higher values you mean? Is >>> there anything else I have missed? >> >> I see various issues raised but not properly addressed >> >> 1. we would need to drop support for segment sizes < 16MB unless we >> adopt a new incompatible filename format. >> I think at 16MB the naming should be the same as now and that >> WALfilename -> LSN is very important. >> For this release, I think we should just allow >= 16MB and avoid the >> issue thru lack of time. > > +1. > >> 2. It's not clear to me the advantage of being able to pick varying >> filesizes. I see great disadvantage in having too many options, which >> greatly increases the chance of incompatibility, annoyance and >> breakage. I favour a small number of values that have been shown by >> testing to be sweet spots in performance and usability. (1GB has been >> suggested) > > I'm in favor of 16,64,256,1024. > I don't see a particular reason for this, TBH. The sweet spots will be likely dependent hardware / OS configuration etc. Assuming there actually are sweet spots - no one demonstrated that yet. Also, I don't see how supporting additional WAL sizes increases chance of incompatibility. We already allow that, so either the tools (e.g. backup solutions) assume WAL segments are always 16MB (in which case are essentially broken) or support valid file sizes (in which case they should have no issues with the new ones). If we're going to do this, I'm in favor of deciding some reasonable upper limit (say, 1GB or 2GB sounds good), and allowing all 2^n values up to that limit. >> 3. New file allocation has been a problem raised with this patch for >> some months now. > > I've been playing around with this and I don't think short tests show > larger sizes off to advantage. Larger segments will definitely perform > more poorly until Postgres starts recycling WAL. Once that happens I > think performance differences should be negligible, though of course > this needs to be verified with longer-running tests. > I'm willing to do some extensive performance testing on the patch. I don't see how that could happen in the next few day (before the feature freeze), particularly considering we're interested in long tests. The question however is whether we need to do this testing when we don't actually change the default (at least the patch submitted on 3/27 does seem to keep the 16MB). I assume people specifying a custom value when calling initdb are expected to know what they are doing (and I don't see how we can prevent distros from choosing a bad value in their packages - they could already do that with configure-time option). > If archive_timeout is set then there will also be additional time taken > to zero out potentially larger unused parts of the segment. I don't see > this as an issue, however, because if archive_timeout is being triggered > then the system is very likely under lower than usual load. > >> Lack of clarity and/or movement on these issues is very, very close to >> getting the patch rejected now. Let's not approach this with the >> viewpoint that I or others want it to be rejected, lets work forwards >> and get some solid changes that will improve the world without causing >> problems. > > I would definitely like to see this go in, though I agree with Peter > that we need a lot more testing. I'd like to see some build farm > animals testing with different sizes at the very least, even if there's > no time to add new tests. > Do we actually have any infrastructure for that? Or do you plan to add some new animals with different WAL segment sizes? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/6/17 5:05 PM, Tomas Vondra wrote: > On 04/06/2017 08:33 PM, David Steele wrote: >> On 4/5/17 7:29 AM, Simon Riggs wrote: >> >>> 2. It's not clear to me the advantage of being able to pick varying >>> filesizes. I see great disadvantage in having too many options, which >>> greatly increases the chance of incompatibility, annoyance and >>> breakage. I favour a small number of values that have been shown by >>> testing to be sweet spots in performance and usability. (1GB has been >>> suggested) >> >> I'm in favor of 16,64,256,1024. >> > > I don't see a particular reason for this, TBH. The sweet spots will be > likely dependent hardware / OS configuration etc. Assuming there > actually are sweet spots - no one demonstrated that yet. Fair enough, but my feeling is that this patch has never been about server performance, per se. Rather, is is about archive management and trying to stem the tide of WAL as servers get bigger and busier. Generally, archive commands have to make a remote connection to offload WAL and that has a cost per segment. > Also, I don't see how supporting additional WAL sizes increases chance > of incompatibility. We already allow that, so either the tools (e.g. > backup solutions) assume WAL segments are always 16MB (in which case are > essentially broken) or support valid file sizes (in which case they > should have no issues with the new ones). I don't see how a compile-time option counts as "supporting that" in practice. How many people in the field are running custom builds of Postgres? And of those, how many have changed the WAL segment size? I've never encountered a non-standard segment size or talked to anyone who has. I'm not saying it has *never* happened but I would venture to say it's rare. > If we're going to do this, I'm in favor of deciding some reasonable > upper limit (say, 1GB or 2GB sounds good), and allowing all 2^n values > up to that limit. I'm OK with that. I'm also OK with providing a few reasonable choices. I guess that means I'll just go with the majority opinion. >>> 3. New file allocation has been a problem raised with this patch for >>> some months now. >> >> I've been playing around with this and I don't think short tests show >> larger sizes off to advantage. Larger segments will definitely perform >> more poorly until Postgres starts recycling WAL. Once that happens I >> think performance differences should be negligible, though of course >> this needs to be verified with longer-running tests. >> > I'm willing to do some extensive performance testing on the patch. I > don't see how that could happen in the next few day (before the feature > freeze), particularly considering we're interested in long tests. Cool. I've been thinking about how to do some meaningful tests for this (mostly pgbench related). I'd like to hear what you are thinking. > The question however is whether we need to do this testing when we don't > actually change the default (at least the patch submitted on 3/27 does > seem to keep the 16MB). I assume people specifying a custom value when > calling initdb are expected to know what they are doing (and I don't see > how we can prevent distros from choosing a bad value in their packages - > they could already do that with configure-time option). Just because we don't change the default doesn't mean that others won't.I still think testing for sizes other than 16MB isseverely lacking and I don't believe caveat emptor is the way to go. > Do we actually have any infrastructure for that? Or do you plan to add > some new animals with different WAL segment sizes? I don't have plans to add animals. I think we'd need a way to tell 'make check' to use a different segment size for tests and then hopefully reconfigure some of the existing animals. -- -David david@pgmasters.net
On 04/06/2017 11:45 PM, David Steele wrote: > On 4/6/17 5:05 PM, Tomas Vondra wrote: >> On 04/06/2017 08:33 PM, David Steele wrote: >>> On 4/5/17 7:29 AM, Simon Riggs wrote: >>> >>>> 2. It's not clear to me the advantage of being able to pick varying >>>> filesizes. I see great disadvantage in having too many options, which >>>> greatly increases the chance of incompatibility, annoyance and >>>> breakage. I favour a small number of values that have been shown by >>>> testing to be sweet spots in performance and usability. (1GB has been >>>> suggested) >>> >>> I'm in favor of 16,64,256,1024. >>> >> >> I don't see a particular reason for this, TBH. The sweet spots will be >> likely dependent hardware / OS configuration etc. Assuming there >> actually are sweet spots - no one demonstrated that yet. > > Fair enough, but my feeling is that this patch has never been about > server performance, per se. Rather, is is about archive management and > trying to stem the tide of WAL as servers get bigger and busier. > Generally, archive commands have to make a remote connection to offload > WAL and that has a cost per segment. > Perhaps, although Robert also mentioned that the fsync at the end of each WAL segment is noticeable. But the thread is a bit difficult to follow, different people have different ideas about the motivation of the patch, etc. >> Also, I don't see how supporting additional WAL sizes increases chance >> of incompatibility. We already allow that, so either the tools (e.g. >> backup solutions) assume WAL segments are always 16MB (in which case are >> essentially broken) or support valid file sizes (in which case they >> should have no issues with the new ones). > > I don't see how a compile-time option counts as "supporting that" in > practice. How many people in the field are running custom builds of > Postgres? And of those, how many have changed the WAL segment size? > I've never encountered a non-standard segment size or talked to anyone > who has. I'm not saying it has *never* happened but I would venture to > say it's rare. > I agree it's rare, but I don't think that means we can just consider the option as 'unsupported'. We're even mentioning it in the docs as a valid way to customize granularity of the WAL archival. I certainly know people who run custom builds, and some of them run with custom WAL segment size. Some of them are our customers, some are not. And yes, some of them actually patched the code to allow 256MB WAL segments. >> If we're going to do this, I'm in favor of deciding some reasonable >> upper limit (say, 1GB or 2GB sounds good), and allowing all 2^n values >> up to that limit. > > I'm OK with that. I'm also OK with providing a few reasonable choices. > I guess that means I'll just go with the majority opinion. > >>>> 3. New file allocation has been a problem raised with this patch for >>>> some months now. >>> >>> I've been playing around with this and I don't think short tests show >>> larger sizes off to advantage. Larger segments will definitely perform >>> more poorly until Postgres starts recycling WAL. Once that happens I >>> think performance differences should be negligible, though of course >>> this needs to be verified with longer-running tests. >>> >> I'm willing to do some extensive performance testing on the patch. I >> don't see how that could happen in the next few day (before the feature >> freeze), particularly considering we're interested in long tests. > > Cool. I've been thinking about how to do some meaningful tests for this > (mostly pgbench related). I'd like to hear what you are thinking. > My plan was to do some pgbench tests with different workloads, scales (in shared buffers, in RAM, exceeds RAM), and different storage configurations (SSD vs. HDD, WAL/datadir on the same/different device/fs, possibly also ext4/xfs). >> The question however is whether we need to do this testing when we don't >> actually change the default (at least the patch submitted on 3/27 does >> seem to keep the 16MB). I assume people specifying a custom value when >> calling initdb are expected to know what they are doing (and I don't see >> how we can prevent distros from choosing a bad value in their packages - >> they could already do that with configure-time option). > > Just because we don't change the default doesn't mean that others won't. > I still think testing for sizes other than 16MB is severely lacking and > I don't believe caveat emptor is the way to go. > Aren't you mixing regression and performance testing? I agree we need to be sure all segment sizes are handled correctly, no argument here. >> Do we actually have any infrastructure for that? Or do you plan to add >> some new animals with different WAL segment sizes? > > I don't have plans to add animals. I think we'd need a way to tell > 'make check' to use a different segment size for tests and then > hopefully reconfigure some of the existing animals. > OK. My point was that we don't have that capability now, and the latest patch is not adding it either. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/6/17 6:52 PM, Tomas Vondra wrote: > On 04/06/2017 11:45 PM, David Steele wrote: >> >> How many people in the field are running custom builds of >> Postgres? And of those, how many have changed the WAL segment size? >> I've never encountered a non-standard segment size or talked to anyone >> who has. I'm not saying it has *never* happened but I would venture to >> say it's rare. > > I agree it's rare, but I don't think that means we can just consider the > option as 'unsupported'. We're even mentioning it in the docs as a valid > way to customize granularity of the WAL archival. > > I certainly know people who run custom builds, and some of them run with > custom WAL segment size. Some of them are our customers, some are not. > And yes, some of them actually patched the code to allow 256MB WAL > segments. I feel a little better knowing that *somebody* is doing it in the field. >> Just because we don't change the default doesn't mean that others won't. >> I still think testing for sizes other than 16MB is severely lacking and >> I don't believe caveat emptor is the way to go. > > Aren't you mixing regression and performance testing? I agree we need to > be sure all segment sizes are handled correctly, no argument here. Not intentionally. Our standard test suite is only regression as far as I can see. All the performance testing I've seen has been done ad hoc. >> I don't have plans to add animals. I think we'd need a way to tell >> 'make check' to use a different segment size for tests and then >> hopefully reconfigure some of the existing animals. > > OK. My point was that we don't have that capability now, and the latest > patch is not adding it either. Agreed. -- -David david@pgmasters.net
(Roughly speaking, to get started, this would mean compiling with
--with-wal-segsize 16, 32, 64, 128, 256, run make check-world both
sequentially and in parallel, and take note of a) passing, b) run time,
c) disk space.)
The following are the results of the installcheck-world execution.
wal_size time cluster_size pg_wal files
16 5m32.761s 539M 417M 26
32 5m32.618s 545M 417M 13
64 5m39.685s 571M 449M 7
128 5m52.961s 641M 513M 4
256 6m13.402s 635M 512M 2
512 7m3.252s 1.2G 1G 2
1024 9m0.205s 1.2G 1G 1
wal_size time cluster_size pg_wal files
16 5m31.137s 542M 417M 26
32 5m29.532s 539M 417M 13
64 5m36.189s 571M 449M 7
128 5m50.027s 635M 513M 4
256 6m13.603s 635M 512M 2
512 7m4.154s 1.2G 1G 2
1024 8m55.357s 1.2G 1G 1
For every test, except for connect/test5 in src/interfaces/ecpg, all else passed.
We can see that smaller chunks take lesser time for the same amount of WAL (128 and 256, 512 and 1024). But these tests are not large enough to conclude.
Attachment
On 04/06/2017 08:33 PM, David Steele wrote:
I'm in favor of 16,64,256,1024.
I don't see a particular reason for this, TBH. The sweet spots will be likely dependent hardware / OS configuration etc. Assuming there actually are sweet spots - no one demonstrated that yet.
Also, I don't see how supporting additional WAL sizes increases chance of incompatibility. We already allow that, so either the tools (e.g. backup solutions) assume WAL segments are always 16MB (in which case are essentially broken) or support valid file sizes (in which case they should have no issues with the new ones).
If we're going to do this, I'm in favor of deciding some reasonable upper limit (say, 1GB or 2GB sounds good), and allowing all 2^n values up to that limit.
I did not get any degradation, in fact, higher values showed performance improvement for higher client count.
On 4/7/17 2:59 AM, Beena Emerson wrote: > I ran tests and following are the details: > > Machine details: > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 192 > On-line CPU(s) list: 0-191 > Thread(s) per core: 8 > Core(s) per socket: 1 > Socket(s): 24 > NUMA node(s): 4 > Model: IBM,8286-42A > > clients> 16 32 64 > 128 > size > 16MB 18895.63486 28799.48759 37855.39521 27968.88309 > 32MB 18313.1461 29201.44954 40733.80051 32458.74147 > 64 MB 18055.73141 30875.28687 42713.54447 38009.60542 > 128MB 18234.31424 33208.65419 48604.5593 45498.27689 > 256MB 19524.36498 35740.19032 54686.16898 54060.11168 > 512MB 20351.90719 37426.72174 55045.60719 56194.99349 > 1024MB 19667.67062 35696.19194 53666.60373 54353.0614 > > I did not get any degradation, in fact, higher values showed performance > improvement for higher client count. This submission has been moved to CF 2017-07. -- -David david@pgmasters.net
Hello, On Tue, Mar 28, 2017 at 1:06 AM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello, > > On Sat, Mar 25, 2017 at 10:32 PM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> >> At this point, I suggest splitting this patch up into several >> potentially less controversial pieces. >> >> One big piece is that we currently don't support segment sizes larger >> than 64 GB, for various internal arithmetic reasons. Your patch appears >> to address that. So I suggest isolating that. Assuming it works >> correctly, I think there would be no great concern about it. >> >> The next piece would be making the various tools aware of varying >> segment sizes without having to rely on a built-in value. >> >> The third piece would then be the rest that allows you to set the size >> at initdb >> >> If we take these in order, we would make it easier to test various sizes >> and see if there are any more unforeseen issues when changing sizes. It >> would also make it easier to do performance testing so we can address >> the original question of what the default size should be. > > > PFA the patches divided into 3 parts: > > 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes > the internal representation of max_wal_size and min_wal_size to mb. Already committed. > > 03-modify-tools.patch - Makes XLogSegSize into a variable, currently set as > XLOG_SEG_SIZE and modifies the tools to fetch the size instead of using > inbuilt value. > The updated 03-modify-tools_v2.patch has the following changes: - Rebased over current HEAD - Impoverished comments - Adding error messages where applicable. - Replace XLOG_SEG_SIZE in the tools and xlog_internal.h to XLogSegSize. XLOG_SEG_SIZE is the wal_segment_size the server was compiled with and XLogSegSize is the wal_segment_size of the target server on which the tool is run. When the binaries used and the target server are compiled with different wal_segment_size, the calculations would be be affected and the tool would crash. To avoid it, all the calculations used by tool should use XLogSegSize. - pg_waldump : The fuzzy_open_file is split into two functions - open_file_in_directory and identify_target_directory so that code can be reused when determining the XLogSegSize from the WAL file header. - IsValidXLogSegSize macro is moved from 04 to here so that we can use it for validating the size in all the tools. > 04-initdb-walsegsize.patch - Adds the initdb option to set wal-segsize and > make related changes. Update pg_test_fsync to use DEFAULT_XLOG_SEG_SIZE > instead of XLOG_SEG_SIZE The 04-initdb-walsegsize_v2.patch has the following improvements: - Rebased over new 03 patch - Pass the wal-segsize intidb option as command-line option rathern than in an environment variable. - Since new function check_wal_size had only had two checks and was sed once, moved the code to ReadControlFile where it is used and removed this function. - improve comments and add validations where required. - Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and max_wal_size,instead of the value 16. - Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc wal_segment_size instead 16 - INT_MAX. > >> >> One concern I have is that your patch does not contain any tests. There >> should probably be lots of tests. > > > 05-initdb_tests.patch adds tap tests to initialize cluster with different > wal_segment_size and then check the config values. What other tests do you > have in mind? Checking the various tools? > > -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 07/06/2017 12:04 PM, Beena Emerson wrote: > The 04-initdb-walsegsize_v2.patch has the following improvements: > - Rebased over new 03 patch > - Pass the wal-segsize intidb option as command-line option rathern > than in an environment variable. > - Since new function check_wal_size had only had two checks and was > sed once, moved the code to ReadControlFile where it is used and > removed this function. > - improve comments and add validations where required. > - Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and > max_wal_size,instead of the value 16. > - Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc > wal_segment_size instead 16 - INT_MAX. Thanks Beena. I tested with below following scenarios and all are working as expected .)Different WAL segment size i.e 1,2,8,16,32,64,512,1024 at the time of initdb .)SR setup in place. .)Combinations of max/min_wal_size in postgresql.conf with different wal_segment_size. .)shutdown the server forcefully (kill -9) / promote slave / to make sure -recovery happened successfully. .)with different utilities like pg_resetwal/pg_upgrade/pg_waldump .)running pg_bench for substantial workloads (~ 1 hour) .)WAL segment size is not default (changed at the time of ./configure) + different wal_segment_size (at the time of initdb) . Things looks fine. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Thu, Jul 6, 2017 at 3:21 PM, tushar <tushar.ahuja@enterprisedb.com> wrote: > On 07/06/2017 12:04 PM, Beena Emerson wrote: >> >> The 04-initdb-walsegsize_v2.patch has the following improvements: >> - Rebased over new 03 patch >> - Pass the wal-segsize intidb option as command-line option rathern >> than in an environment variable. >> - Since new function check_wal_size had only had two checks and was >> sed once, moved the code to ReadControlFile where it is used and >> removed this function. >> - improve comments and add validations where required. >> - Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and >> max_wal_size,instead of the value 16. >> - Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc >> wal_segment_size instead 16 - INT_MAX. > > Thanks Beena. I tested with below following scenarios and all are working > as expected > .)Different WAL segment size i.e 1,2,8,16,32,64,512,1024 at the time of > initdb > .)SR setup in place. > .)Combinations of max/min_wal_size in postgresql.conf with different > wal_segment_size. > .)shutdown the server forcefully (kill -9) / promote slave / to make sure > -recovery happened successfully. > .)with different utilities like pg_resetwal/pg_upgrade/pg_waldump > .)running pg_bench for substantial workloads (~ 1 hour) > .)WAL segment size is not default (changed at the time of ./configure) + > different wal_segment_size (at the time of initdb) . > > Things looks fine. > Thank you Tushar. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Personally I find the split between 03 and 04 and their naming a bit confusing. I'd rather just merge them. These patches need a rebase, they don't apply anymore. On 2017-07-06 12:04:12 +0530, Beena Emerson wrote: > @@ -4813,6 +4836,18 @@ XLOGShmemSize(void) > { > char buf[32]; > > + /* > + * The calculation of XLOGbuffers requires the run-time parameter > + * XLogSegSize which is set from the control file. This value is > + * required to create the shared memory segment. Hence, temporarily > + * allocate space for reading the control file. > + */ This makes me uncomfortable. Having to choose the control file multiple times seems wrong. We're effectively treating the control file as part of the configuration now, and that means we should move it's parsing to an earlier part of startup. > + if (!IsBootstrapProcessingMode()) > + { > + ControlFile = palloc(sizeof(ControlFileData)); > + ReadControlFile(); > + pfree(ControlFile); General plea: Please reset to NULL in cases like this, otherwise the pointer will [temporarily] point into a freed memory location, which makes debugging harder. > @@ -8146,6 +8181,9 @@ InitXLOGAccess(void) > ThisTimeLineID = XLogCtl->ThisTimeLineID; > Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode()); > > + /* set XLogSegSize */ > + XLogSegSize = ControlFile->xlog_seg_size; > + Hm, why do we have two variables keeping track of the segment size? wal_segment_size and XLogSegSize? That's bound to lead to confusion. > /* Use GetRedoRecPtr to copy the RedoRecPtr safely */ > (void) GetRedoRecPtr(); > /* Also update our copy of doPageWrites. */ > diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c > index b3f0b3c..d2c524b 100644 > --- a/src/backend/bootstrap/bootstrap.c > +++ b/src/backend/bootstrap/bootstrap.c > @@ -19,6 +19,7 @@ > > #include "access/htup_details.h" > #include "access/xact.h" > +#include "access/xlog_internal.h" > #include "bootstrap/bootstrap.h" > #include "catalog/index.h" > #include "catalog/pg_collation.h" > @@ -47,6 +48,7 @@ > #include "utils/tqual.h" > > uint32 bootstrap_data_checksum_version = 0; /* No checksum */ > +uint32 XLogSegSize; Se we define the same variable declared in a header in multiple files (once for each applicationq)? I'm pretty strongly against that. Why's that needed/a good idea? Neither is it clear to me why the definition was moved from xlog.c to bootstrap.c? That doesn't sound like a natural place. > /* > + * Calculate the default wal_size in proper unit. > + */ > +static char * > +pretty_wal_size(int segment_count) > +{ > + double val = wal_segment_size / (1024 * 1024) * segment_count; > + double temp_val; > + char *result = malloc(10); > + > + /* > + * Wal segment size ranges from 1MB to 1GB and the default > + * segment_count is 5 for min_wal_size and 64 for max_wal_size, so the > + * final values can range from 5MB to 64GB. > + */ Referencing the defaults here seems unnecessary. And nearly a guarantee that the values in the comment will go out of date soon-ish. > + /* set default max_wal_size and min_wal_size */ > + snprintf(repltok, sizeof(repltok), "min_wal_size = %s", > + pretty_wal_size(DEFAULT_MIN_WAL_SEGS)); > + conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok); > + > + snprintf(repltok, sizeof(repltok), "max_wal_size = %s", > + pretty_wal_size(DEFAULT_MAX_WAL_SEGS)); > + conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok); > + Hm. So postgresql.conf.sample values are now going to contain misleading information for clusters with non-default segment sizes. Have we discussed instead defaulting min_wal_size/max_wal_size to a constant amount of megabytes and rounding up when it doesn't work for a particular segment size? > diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h > index 9c0039c..c805f12 100644 > --- a/src/include/access/xlog_internal.h > +++ b/src/include/access/xlog_internal.h > @@ -91,6 +91,11 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader; > */ > > extern uint32 XLogSegSize; > +#define XLOG_SEG_SIZE XLogSegSize I don't think this is a good idea, we should rather rip the bandaid of and remove this macro. If people are assuming it's a macro they'll just run into more confusing errors/problems. > diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h > index f3b3529..f31c30e 100644 > --- a/src/include/pg_config_manual.h > +++ b/src/include/pg_config_manual.h > @@ -14,6 +14,12 @@ > */ > > /* > + * This is default value for WAL_segment_size to be used at intidb when run > + * without --walsegsize option. > + */ WAL_segment_size is a bit weirdly cased... > diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c > index d7fa2a8..279728d 100644 > --- a/contrib/pg_standby/pg_standby.c > +++ b/contrib/pg_standby/pg_standby.c > @@ -33,9 +33,12 @@ > #include "pg_getopt.h" > > #include "access/xlog_internal.h" > +#include "access/xlogreader.h" > > const char *progname; > > +uint32 XLogSegSize; > + > /* Options and defaults */ > int sleeptime = 5; /* amount of time to sleep between file checks */ > int waittime = -1; /* how long we have been waiting, -1 no wait > @@ -100,6 +103,72 @@ int nextWALFileType; > > struct stat stat_buf; > > +static bool SetWALFileNameForCleanup(void); > + > +/* Set XLogSegSize from the WAL file specified by WALFilePath */ Hm. Why don't we instead accept the segment size as a parameter expanded in restore_command? Then this magic isn't necessary. This won't be the only command needing it. > +static bool > +RetrieveXLogSegSize() Please add void as argument. > -#define MaxSegmentsPerLogFile ( 0xFFFFFFFF / XLOG_SEG_SIZE ) > - > static void > CustomizableCleanupPriorWALFiles(void) > { > @@ -315,6 +384,7 @@ SetWALFileNameForCleanup(void) > uint32 log_diff = 0, > seg_diff = 0; > bool cleanup = false; > + int MaxSegmentsPerLogFile = (0xFFFFFFFF / XLogSegSize); Inconsistent variable naming here. > /* > + * From version 10, explicitly set XLogSegSize using SHOW wal_segment_size > + * since ControlFile is not accessible here. > + */ > +bool > +RetrieveXLogSegSize(PGconn *conn) > +{ > + /* wal_segment_size ranges from 1MB to 1GB */ > + tmp_result = pg_strdup(PQgetvalue(res, 0, 0)); Why strdup if we just do a sscanf? > +/* > + * Try to find fname in the given directory. Returns true if it is found, > + * false otherwise. If fname is NULL, search the complete directory for any > + * file with a valid WAL file name. > + */ > +static bool > +search_directory(char *directory, char *fname) This doesn't mention an important fact, namely that this routine tries to figure out XLogSegSize from the file... This is kind of an ugly approach, but I don't see anything really simpler. > +/* check that the given size is a valid XLogSegSize */ > +#define IsPowerOf2(x) (((x) & ((x)-1)) == 0) Not that it really matters here, but this isn't correct for 0 I believe. > +#define IsValidXLogSegSize(size) \ > + (IsPowerOf2(size) && \ > + (size >= XLogSegMinSize && size <= XLogSegMaxSize)) > + Please wrap references to size in parens. Should we consider making this an inline function instead? There's some multiple evaluation hazard here... > +#define XLogSegmentsPerXLogId (UINT64CONST(0x100000000) / XLogSegSize) > > #define XLogSegNoOffsetToRecPtr(segno, offset, dest) \ > - (dest) = (segno) * XLOG_SEG_SIZE + (offset) > + (dest) = (segno) * XLogSegSize + (offset) I don't think it's a good idea to implicitly reference a global variable in such a macro. IOW, I think this needs to grow another parameter, and callers should get adjusted. I know this'll affect a number of macros, but it still seems like the right thing to do. I'd welcome other opinions on this. Greetings, Andres Freund
PFA the updated patches. On Wed, Aug 16, 2017 at 1:55 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > Personally I find the split between 03 and 04 and their naming a bit > confusing. I'd rather just merge them. These patches need a rebase, > they don't apply anymore. 01 is rebased. 04 and 03 are now merged into 02-initdb-configurable-walsegsize.patch. It also fixes a issue on Windows, the XLogSegSize is now passed through BackendParameters so the values are available during process forking. > > > On 2017-07-06 12:04:12 +0530, Beena Emerson wrote: >> @@ -4813,6 +4836,18 @@ XLOGShmemSize(void) >> { >> char buf[32]; >> >> + /* >> + * The calculation of XLOGbuffers requires the run-time parameter >> + * XLogSegSize which is set from the control file. This value is >> + * required to create the shared memory segment. Hence, temporarily >> + * allocate space for reading the control file. >> + */ > > This makes me uncomfortable. Having to choose the control file multiple > times seems wrong. We're effectively treating the control file as part > of the configuration now, and that means we should move it's parsing to > an earlier part of startup. Yes, this may seem ugly. ControlFile was originally read into the shared memory segment but then we now need the XLogSegSize from the ControlFile to initialise the shared memory segment. I could not figure out any other way to achieve this. > > >> + if (!IsBootstrapProcessingMode()) >> + { >> + ControlFile = palloc(sizeof(ControlFileData)); >> + ReadControlFile(); >> + pfree(ControlFile); > > General plea: Please reset to NULL in cases like this, otherwise the > pointer will [temporarily] point into a freed memory location, which > makes debugging harder. done. > > > >> @@ -8146,6 +8181,9 @@ InitXLOGAccess(void) >> ThisTimeLineID = XLogCtl->ThisTimeLineID; >> Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode()); >> >> + /* set XLogSegSize */ >> + XLogSegSize = ControlFile->xlog_seg_size; >> + > > Hm, why do we have two variables keeping track of the segment size? > wal_segment_size and XLogSegSize? That's bound to lead to confusion. > wal_segment_size is the guc which stores the number of segments (XLogSegSize / XLOG_BLCKSZ). > >> /* Use GetRedoRecPtr to copy the RedoRecPtr safely */ >> (void) GetRedoRecPtr(); >> /* Also update our copy of doPageWrites. */ >> diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c >> index b3f0b3c..d2c524b 100644 >> --- a/src/backend/bootstrap/bootstrap.c >> +++ b/src/backend/bootstrap/bootstrap.c >> @@ -19,6 +19,7 @@ >> >> #include "access/htup_details.h" >> #include "access/xact.h" >> +#include "access/xlog_internal.h" >> #include "bootstrap/bootstrap.h" >> #include "catalog/index.h" >> #include "catalog/pg_collation.h" >> @@ -47,6 +48,7 @@ >> #include "utils/tqual.h" >> >> uint32 bootstrap_data_checksum_version = 0; /* No checksum */ >> +uint32 XLogSegSize; > > Se we define the same variable declared in a header in multiple files > (once for each applicationq)? I'm pretty strongly against that. Why's > that needed/a good idea? Neither is it clear to me why the definition > was moved from xlog.c to bootstrap.c? That doesn't sound like a natural > place. I have moved back to xlog.c. > > >> /* >> + * Calculate the default wal_size in proper unit. >> + */ >> +static char * >> +pretty_wal_size(int segment_count) >> +{ >> + double val = wal_segment_size / (1024 * 1024) * segment_count; >> + double temp_val; >> + char *result = malloc(10); >> + >> + /* >> + * Wal segment size ranges from 1MB to 1GB and the default >> + * segment_count is 5 for min_wal_size and 64 for max_wal_size, so the >> + * final values can range from 5MB to 64GB. >> + */ > > Referencing the defaults here seems unnecessary. And nearly a guarantee > that the values in the comment will go out of date soon-ish. Removed the comment. > > >> + /* set default max_wal_size and min_wal_size */ >> + snprintf(repltok, sizeof(repltok), "min_wal_size = %s", >> + pretty_wal_size(DEFAULT_MIN_WAL_SEGS)); >> + conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok); >> + >> + snprintf(repltok, sizeof(repltok), "max_wal_size = %s", >> + pretty_wal_size(DEFAULT_MAX_WAL_SEGS)); >> + conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok); >> + > > Hm. So postgresql.conf.sample values are now going to contain misleading > information for clusters with non-default segment sizes. > > Have we discussed instead defaulting min_wal_size/max_wal_size to a > constant amount of megabytes and rounding up when it doesn't work for > a particular segment size? This was not discussed. In the original code, the min_wal_size and max_wal_size are computed in the guc.c for any wal_segment_size set at configure. { {"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS, gettext_noop("Sets the minimum size to shrink the WAL to."), NULL, GUC_UNIT_MB }, &min_wal_size_mb, 5 * (XLOG_SEG_SIZE / (1024 * 1024)), 2, MAX_KILOBYTES, NULL, NULL, NULL }, { {"max_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS, gettext_noop("Sets the WAL size that triggers a checkpoint."), NULL, GUC_UNIT_MB }, &max_wal_size_mb, 64 * (XLOG_SEG_SIZE / (1024 * 1024)), 2, MAX_KILOBYTES, NULL, assign_max_wal_size, NULL }, Hence I have retained the same calculation for min_wal_size and max_wal_size. If we get consensus for fixing a default and updating when required, then I will change the code accordingly. > > >> diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h >> index 9c0039c..c805f12 100644 >> --- a/src/include/access/xlog_internal.h >> +++ b/src/include/access/xlog_internal.h >> @@ -91,6 +91,11 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader; >> */ >> >> extern uint32 XLogSegSize; >> +#define XLOG_SEG_SIZE XLogSegSize > > I don't think this is a good idea, we should rather rip the bandaid > of and remove this macro. If people are assuming it's a macro they'll > just run into more confusing errors/problems. > Okay. done. > >> diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h >> index f3b3529..f31c30e 100644 >> --- a/src/include/pg_config_manual.h >> +++ b/src/include/pg_config_manual.h >> @@ -14,6 +14,12 @@ >> */ >> >> /* >> + * This is default value for WAL_segment_size to be used at intidb when run >> + * without --walsegsize option. >> + */ > > WAL_segment_size is a bit weirdly cased... corrected. > > >> diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c >> index d7fa2a8..279728d 100644 >> --- a/contrib/pg_standby/pg_standby.c >> +++ b/contrib/pg_standby/pg_standby.c >> @@ -33,9 +33,12 @@ >> #include "pg_getopt.h" >> >> #include "access/xlog_internal.h" >> +#include "access/xlogreader.h" >> >> const char *progname; >> >> +uint32 XLogSegSize; >> + >> /* Options and defaults */ >> int sleeptime = 5; /* amount of time to sleep between file checks */ >> int waittime = -1; /* how long we have been waiting, -1 no wait >> @@ -100,6 +103,72 @@ int nextWALFileType; >> >> struct stat stat_buf; >> >> +static bool SetWALFileNameForCleanup(void); >> + >> +/* Set XLogSegSize from the WAL file specified by WALFilePath */ > > Hm. Why don't we instead accept the segment size as a parameter expanded > in restore_command? Then this magic isn't necessary. This won't be the > only command needing it. > > >> +static bool >> +RetrieveXLogSegSize() > > Please add void as argument. done. >> -#define MaxSegmentsPerLogFile ( 0xFFFFFFFF / XLOG_SEG_SIZE ) >> - >> static void >> CustomizableCleanupPriorWALFiles(void) >> { >> @@ -315,6 +384,7 @@ SetWALFileNameForCleanup(void) >> uint32 log_diff = 0, >> seg_diff = 0; >> bool cleanup = false; >> + int MaxSegmentsPerLogFile = (0xFFFFFFFF / XLogSegSize); > > Inconsistent variable naming here. XLOG_SEG_SIZE is now removed so we can only use the variable XLogSegSize. > >> /* >> + * From version 10, explicitly set XLogSegSize using SHOW wal_segment_size >> + * since ControlFile is not accessible here. >> + */ >> +bool >> +RetrieveXLogSegSize(PGconn *conn) >> +{ > >> + /* wal_segment_size ranges from 1MB to 1GB */ >> + tmp_result = pg_strdup(PQgetvalue(res, 0, 0)); > > Why strdup if we just do a sscanf? Fixed. > >> +/* >> + * Try to find fname in the given directory. Returns true if it is found, >> + * false otherwise. If fname is NULL, search the complete directory for any >> + * file with a valid WAL file name. >> + */ >> +static bool >> +search_directory(char *directory, char *fname) > > This doesn't mention an important fact, namely that this routine tries > to figure out XLogSegSize from the file... Added to the comment. > > This is kind of an ugly approach, but I don't see anything really simpler. > >> +/* check that the given size is a valid XLogSegSize */ >> +#define IsPowerOf2(x) (((x) & ((x)-1)) == 0) > > Not that it really matters here, but this isn't correct for 0 I > believe. > >> +#define IsValidXLogSegSize(size) \ >> + (IsPowerOf2(size) && \ >> + (size >= XLogSegMinSize && size <= XLogSegMaxSize)) >> + > > Please wrap references to size in parens. Added the parens. > > Should we consider making this an inline function instead? There's > some multiple evaluation hazard here... > > >> +#define XLogSegmentsPerXLogId (UINT64CONST(0x100000000) / XLogSegSize) >> >> #define XLogSegNoOffsetToRecPtr(segno, offset, dest) \ >> - (dest) = (segno) * XLOG_SEG_SIZE + (offset) >> + (dest) = (segno) * XLogSegSize + (offset) > > I don't think it's a good idea to implicitly reference a global variable > in such a macro. IOW, I think this needs to grow another parameter, and > callers should get adjusted. I know this'll affect a number of macros, > but it still seems like the right thing to do. I'd welcome other > opinions on this. I have added this change in the separate patch (03-modify-xlog-macros.patch) since it touches a lot of files. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 2017-08-23 12:13:15 +0530, Beena Emerson wrote: > >> + /* > >> + * The calculation of XLOGbuffers requires the run-time parameter > >> + * XLogSegSize which is set from the control file. This value is > >> + * required to create the shared memory segment. Hence, temporarily > >> + * allocate space for reading the control file. > >> + */ > > > > This makes me uncomfortable. Having to choose the control file multiple > > times seems wrong. We're effectively treating the control file as part > > of the configuration now, and that means we should move it's parsing to > > an earlier part of startup. > > Yes, this may seem ugly. ControlFile was originally read into the > shared memory segment but then we now need the XLogSegSize from the > ControlFile to initialise the shared memory segment. I could not > figure out any other way to achieve this. I think reading it one into local memory inside the startup process and then copying it into shared memory from there should work? > >> @@ -8146,6 +8181,9 @@ InitXLOGAccess(void) > >> ThisTimeLineID = XLogCtl->ThisTimeLineID; > >> Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode()); > >> > >> + /* set XLogSegSize */ > >> + XLogSegSize = ControlFile->xlog_seg_size; > >> + > > > > Hm, why do we have two variables keeping track of the segment size? > > wal_segment_size and XLogSegSize? That's bound to lead to confusion. > > > > wal_segment_size is the guc which stores the number of segments > (XLogSegSize / XLOG_BLCKSZ). wal_segment_size and XLogSegSize are the same name, spelt different, so if that's where we want to go, we should name them differently. But perhaps more fundamentally, I don't see why we need both: What stops us from just defining the GUC in bytes? Regards, Andres
On Wed, Aug 30, 2017 at 4:43 AM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2017-08-23 12:13:15 +0530, Beena Emerson wrote: >> >> + /* >> >> + * The calculation of XLOGbuffers requires the run-time parameter >> >> + * XLogSegSize which is set from the control file. This value is >> >> + * required to create the shared memory segment. Hence, temporarily >> >> + * allocate space for reading the control file. >> >> + */ >> > >> > This makes me uncomfortable. Having to choose the control file multiple >> > times seems wrong. We're effectively treating the control file as part >> > of the configuration now, and that means we should move it's parsing to >> > an earlier part of startup. >> >> Yes, this may seem ugly. ControlFile was originally read into the >> shared memory segment but then we now need the XLogSegSize from the >> ControlFile to initialise the shared memory segment. I could not >> figure out any other way to achieve this. > > I think reading it one into local memory inside the startup process and > then copying it into shared memory from there should work? >. Done. > >> >> @@ -8146,6 +8181,9 @@ InitXLOGAccess(void) >> >> ThisTimeLineID = XLogCtl->ThisTimeLineID; >> >> Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode()); >> >> >> >> + /* set XLogSegSize */ >> >> + XLogSegSize = ControlFile->xlog_seg_size; >> >> + >> > >> > Hm, why do we have two variables keeping track of the segment size? >> > wal_segment_size and XLogSegSize? That's bound to lead to confusion. >> > >> >> wal_segment_size is the guc which stores the number of segments >> (XLogSegSize / XLOG_BLCKSZ). > > wal_segment_size and XLogSegSize are the same name, spelt different, so > if that's where we want to go, we should name them differently. But > perhaps more fundamentally, I don't see why we need both: What stops us > from just defining the GUC in bytes? I made a few changes for this: - Make XLogSegSize int instead of uint32 - Add a GUC_UNIT_BYT for the unit conversion so that show wal_segment_size displays user-friendly values. - track_activity_query_size unit is set to GUC_UNIT_BYT. This was initially null because we did not have a unit for bytes. This may not be necessary as it changes the output of SHOW command. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, I was looking to commit this, but the changes I made ended up being pretty large. Here's what I changed in the attached: - split GUC_UNIT_BYTE into a separate commit, squashed rest - renamed GUC_UNIT_BYT to GUC_UNIT_BYTE, don't see why we'd have such a weird abbreviation? - bumped control file version, otherwise things wouldn't work correctly - wal_segment_size text still said "Shows the number of pages per write ahead log segment." - I still feel strongly that exporting XLogSegSize, which previously was a macro and now a integer variable, is a bad idea. Hence I've renamed it to wal_segment_size. - There still were comments referencing XLOG_SEG_SIZE - IsPowerOf2 regarded 0 as a valid power of two - ConvertToXSegs() depended on a variable not passed as arg, bad idea. - As previously mentioned, I don't think it's ok to rely on vars like XLogSegSize to be defined both in backend and frontend code. - I don't think XLogReader can rely on XLogSegSize, needs to be parametrized. - pg_rewind exported another copy of extern int XLogSegSize - streamutil.h had a extern uint32 WalSegsz; but used RetrieveXlogSegSize, that seems needlessly different - moved wal_segment_size (aka XLogSegSize) to xlog.h - pg_standby included xlogreader, not sure why? - MaxSegmentsPerLogFile still had a conflicting naming scheme - you'd included "sys/stat.h", that's not really appropriate for system headers, should be <sys/stat.h> (and then grouped w/ rest) - pg_controldata's warning about an invalid segsize missed newlines Unresolved: - this needs some new performance tests, the number of added instructions isn't trivial. Don't think there's anything, but ... - read through it again, check long lines - pg_standby's RetrieveWALSegSize() does too much for it's name. It seems quite weird that a function named that way has the section below "/* check if clean up is necessary */" - the way you redid the ReadControlFile() invocation doesn't quite seem right. Consider what happens if XLOGbuffers isn't -1 - then we wouldn't read the control file, but you unconditionally copy it in XLOGShmemInit(). I think we instead should introduce something like XLOGPreShmemInit() that reads the control file unless in bootstrap mode. Then get rid of the second ReadControlFile() already present. - In pg_resetwal.c:ReadControlFile() we ignore the file contents if there's an invalid segment size, but accept the contents as guessed if there's a crc failure - that seems a bit weird? - verify EXEC_BACKEND does the right thing - not this commit/patch, but XLogReadDetermineTimeline() could really use some simplifying of repetitive expresssions - XLOGShmemInit shouldn't memcpy to temp_cfile and such, why not just save previous pointer in a local variable? - could you fill in the Reviewed-By: line in the commit message? Running out of concentration / time now. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello, On Wed, Sep 6, 2017 at 7:37 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > I was looking to commit this, but the changes I made ended up being > pretty large. Here's what I changed in the attached: > - split GUC_UNIT_BYTE into a separate commit, squashed rest > - renamed GUC_UNIT_BYT to GUC_UNIT_BYTE, don't see why we'd have such a > weird abbreviation? > - bumped control file version, otherwise things wouldn't work correctly > - wal_segment_size text still said "Shows the number of pages per write > ahead log segment." > - I still feel strongly that exporting XLogSegSize, which previously was > a macro and now a integer variable, is a bad idea. Hence I've renamed > it to wal_segment_size. > - There still were comments referencing XLOG_SEG_SIZE > - IsPowerOf2 regarded 0 as a valid power of two > - ConvertToXSegs() depended on a variable not passed as arg, bad idea. > - As previously mentioned, I don't think it's ok to rely on vars like > XLogSegSize to be defined both in backend and frontend code. > - I don't think XLogReader can rely on XLogSegSize, needs to be > parametrized. > - pg_rewind exported another copy of extern int XLogSegSize > - streamutil.h had a extern uint32 WalSegsz; but used > RetrieveXlogSegSize, that seems needlessly different > - moved wal_segment_size (aka XLogSegSize) to xlog.h > - pg_standby included xlogreader, not sure why? > - MaxSegmentsPerLogFile still had a conflicting naming scheme > - you'd included "sys/stat.h", that's not really appropriate for system > headers, should be <sys/stat.h> (and then grouped w/ rest) > - pg_controldata's warning about an invalid segsize missed newlines > Thank you. > Unresolved: > - this needs some new performance tests, the number of added instructions > isn't trivial. Don't think there's anything, but ... I will give out the results soon. > - read through it again, check long lines I have broken the long lines where necessary and applied pgindent as well. > - pg_standby's RetrieveWALSegSize() does too much for it's name. It > seems quite weird that a function named that way has the section below > "/* check if clean up is necessary */" we set 2 cleanup related variables once WalSegSize is set, namely need_cleanup and exclusiveCleanupFileName. Does SetWALSegSizeAndCleanupValues look good? > - the way you redid the ReadControlFile() invocation doesn't quite seem > right. Consider what happens if XLOGbuffers isn't -1 - then we > wouldn't read the control file, but you unconditionally copy it in > XLOGShmemInit(). I think we instead should introduce something like > XLOGPreShmemInit() that reads the control file unless in bootstrap > mode. Then get rid of the second ReadControlFile() already present. I did not think it was necessary to create a new function, I have simply added the check and function call within the XLOGShmemInit(). > - In pg_resetwal.c:ReadControlFile() we ignore the file contents if > there's an invalid segment size, but accept the contents as guessed if > there's a crc failure - that seems a bit weird? I have changed the behaviour to treat it as guessed and also modified the error message. >- verify EXEC_BACKEND does the right thing > - not this commit/patch, but XLogReadDetermineTimeline() could really > use some simplifying of repetitive expressions I will check this. > - XLOGShmemInit shouldn't memcpy to temp_cfile and such, why not just > save previous pointer in a local variable? done. > - could you fill in the Reviewed-By: line in the commit message? I have added the names in alphabetical order. Kindly check the attached v2 patch. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Sep 6, 2017 at 8:24 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello, > > On Wed, Sep 6, 2017 at 7:37 AM, Andres Freund <andres@anarazel.de> wrote: >> Hi, > >> Unresolved: >> - this needs some new performance tests, the number of added instructions >> isn't trivial. Don't think there's anything, but ... > > I will give out the results soon. Performance tests: The following results are the median of 3 runs for 32 and 56 clients/threads on a pgbench database of 300 scale with each run of 900s (15 min) for various wal segment sizes and shared buffers 8GB. Following is the % difference of the performance of patched code (initdb wal-segsize) over the original code (configure wal-segsize) size | c_32 | c_56 ------------+-------------------+-------------- 4MB | 1.11 | -0.18 8MB | 0 | -1.56 16MB | 0.79 | 0.23 64MB | 0.89 | 0.28 1024MB | -1.29 | -0.09 Median values: size | 32_original | 32_patched | 56_original | 56_patched ------------+-----------------------+-----------------------+-----------------------+-------------------- 4MB | 83999.06142 | 84933.78919 | 95667.13483 | 95492.21335 8MB | 84949.08195 | 84947.35953 | 96584.13828 | 95081.37257 16MB | 84155.40321 | 84820.98328 | 95697.53134 | 95914.98814 64MB | 84496.2927 | 85245.70758 | 96307.95222 | 96581.1183 1024 | 76230.39323 | 75247.03348 | 92495.18142 | 92410.59222 We can conclude that there is not much difference. [1] Previous performance results: https://www.postgresql.org/message-id/CAOG9ApESjqYm2VQWxNrZAKySzVo-vDw2JWhDqYQStzD%2BgwRUiA%40mail.gmail.com > >> - read through it again, check long lines > I have broken the long lines where necessary and applied pgindent as well. > >> - pg_standby's RetrieveWALSegSize() does too much for it's name. It >> seems quite weird that a function named that way has the section below >> "/* check if clean up is necessary */" > > we set 2 cleanup related variables once WalSegSize is set, namely > need_cleanup and exclusiveCleanupFileName. Does > SetWALSegSizeAndCleanupValues look good? > >> - the way you redid the ReadControlFile() invocation doesn't quite seem >> right. Consider what happens if XLOGbuffers isn't -1 - then we >> wouldn't read the control file, but you unconditionally copy it in >> XLOGShmemInit(). I think we instead should introduce something like >> XLOGPreShmemInit() that reads the control file unless in bootstrap >> mode. Then get rid of the second ReadControlFile() already present. > > I did not think it was necessary to create a new function, I have > simply added the check and > function call within the XLOGShmemInit(). > >> - In pg_resetwal.c:ReadControlFile() we ignore the file contents if >> there's an invalid segment size, but accept the contents as guessed if >> there's a crc failure - that seems a bit weird? > > I have changed the behaviour to treat it as guessed and also modified > the error message. > >>- verify EXEC_BACKEND does the right thing Ashutosh Sharma has verified this and confirms that there are no issues. >> - not this commit/patch, but XLogReadDetermineTimeline() could really >> use some simplifying of repetitive expressions > > I will check this. > >> - XLOGShmemInit shouldn't memcpy to temp_cfile and such, why not just >> save previous pointer in a local variable? > done. > >> - could you fill in the Reviewed-By: line in the commit message? > I have added the names in alphabetical order. > > Kindly check the attached v2 patch. PFA the rebased patch. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 2017-09-06 20:24:16 +0530, Beena Emerson wrote: > > - pg_standby's RetrieveWALSegSize() does too much for it's name. It > > seems quite weird that a function named that way has the section below > > "/* check if clean up is necessary */" > > we set 2 cleanup related variables once WalSegSize is set, namely > need_cleanup and exclusiveCleanupFileName. Does > SetWALSegSizeAndCleanupValues look good? It's better, but see below. > > - the way you redid the ReadControlFile() invocation doesn't quite seem > > right. Consider what happens if XLOGbuffers isn't -1 - then we > > wouldn't read the control file, but you unconditionally copy it in > > XLOGShmemInit(). I think we instead should introduce something like > > XLOGPreShmemInit() that reads the control file unless in bootstrap > > mode. Then get rid of the second ReadControlFile() already present. > > I did not think it was necessary to create a new function, I have > simply added the check and > function call within the XLOGShmemInit(). Which is wrong. XLogShmemSize() already needs to know the actual size, otherwise we allocate the wrong shmem size. You may sometimes succeed nevertheless because we leave some slop unused shared memory space, but it's not ok to rely on. See the refactoring I did in 0001. Changes: - refactored the way the control file was handled, moved it to separate phase. I wrote this last and it's late, so I'm not yet fully confident in it, but it survives plain and EXEC_BACKEND builds. This also gets rid of ferrying wal_segment_size through the EXEC_BACKEND variable stuff, which didn't really do much, given how many other parts weren't carried over. - renamed all the non-postgres binary version of wal_segment_size to WalSegSz, diverging seems pointless, and the WalSegsz seems inconsistent. - changed malloc in pg_waldump's search_directory() to a stack allocation. Less because of efficiency, more because there wasn't any error handling. - removed redundant char * -> XLogPageHeader -> XLogLongPageHeader casting. - replace new malloc with pg_malloc in initdb (failure handling!) - replaced the floating point logic in pretty_wal_size with a, imo much simpler, (sz % 1024) == 0 - it's inconsistent that the new code for pg_standby was added to the top of the file, where all the customizable stuff resides. - other small changes Issues: - I think the pg_standby stuff isn't correct. And it's hard to understand. Consider the case where the first file restored is *not* a timeline history file, but *also* not a complete file. We'll start to spew "not enough data in file" errors and such, which we previously didn't. My preferred solution would be to remove pg_standby ([1]), but that's probably not quick enough. Unless we can quickly agree on that, I think we need to refactor this a bit, I've done so in the attached, but it's untested. Could you please verify it works and if not fix it up? What do you think? Regards, Andres [1] http://archives.postgresql.org/message-id/20170913064824.rqflkadxwpboabgw%40alap3.anarazel.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Sep 13, 2017 at 2:58 PM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2017-09-06 20:24:16 +0530, Beena Emerson wrote: >> > - pg_standby's RetrieveWALSegSize() does too much for it's name. It >> > seems quite weird that a function named that way has the section below >> > "/* check if clean up is necessary */" >> >> we set 2 cleanup related variables once WalSegSize is set, namely >> need_cleanup and exclusiveCleanupFileName. Does >> SetWALSegSizeAndCleanupValues look good? > > It's better, but see below. > > >> > - the way you redid the ReadControlFile() invocation doesn't quite seem >> > right. Consider what happens if XLOGbuffers isn't -1 - then we >> > wouldn't read the control file, but you unconditionally copy it in >> > XLOGShmemInit(). I think we instead should introduce something like >> > XLOGPreShmemInit() that reads the control file unless in bootstrap >> > mode. Then get rid of the second ReadControlFile() already present. >> >> I did not think it was necessary to create a new function, I have >> simply added the check and >> function call within the XLOGShmemInit(). > > Which is wrong. XLogShmemSize() already needs to know the actual size, > otherwise we allocate the wrong shmem size. You may sometimes succeed > nevertheless because we leave some slop unused shared memory space, but > it's not ok to rely on. See the refactoring I did in 0001. > > Changes: > - refactored the way the control file was handled, moved it to separate > phase. I wrote this last and it's late, so I'm not yet fully confident > in it, but it survives plain and EXEC_BACKEND builds. This also gets > rid of ferrying wal_segment_size through the EXEC_BACKEND variable > stuff, which didn't really do much, given how many other parts weren't > carried over. > - renamed all the non-postgres binary version of wal_segment_size to > WalSegSz, diverging seems pointless, and the WalSegsz seems > inconsistent. > - changed malloc in pg_waldump's search_directory() to a stack > allocation. Less because of efficiency, more because there wasn't any > error handling. > - removed redundant char * -> XLogPageHeader -> XLogLongPageHeader casting. > - replace new malloc with pg_malloc in initdb (failure handling!) > - replaced the floating point logic in pretty_wal_size with a, imo much > simpler, (sz % 1024) == 0 > - it's inconsistent that the new code for pg_standby was added to the > top of the file, where all the customizable stuff resides. > - other small changes > > Issues: > > - I think the pg_standby stuff isn't correct. And it's hard to > understand. Consider the case where the first file restored is *not* a > timeline history file, but *also* not a complete file. We'll start to > spew "not enough data in file" errors and such, which we previously > didn't. My preferred solution would be to remove pg_standby ([1]), > but that's probably not quick enough. Unless we can quickly agree on > that, I think we need to refactor this a bit, I've done so in the > attached, but it's untested. Could you please verify it works and if > not fix it up? > > What do you think? The change looks good and is working as expected. PFA the updated patch after running pgindent. Thank you, Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 2017-09-14 11:31:33 +0530, Beena Emerson wrote: > The change looks good and is working as expected. > PFA the updated patch after running pgindent. I've pushed this version. Yay! Thanks for the work Beena, everyone! The only change I made is to run the pg_upgrade tests with a 1 MB segment size, as discussed in [1]. We'll probably want to refine that, but let's discuss that in the other thread. Regards, Andres [1] http://archives.postgresql.org/message-id/20170919175457.liz3oreqiambuhca%40alap3.anarazel.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers