Thread: Proposal : For Auto-Prewarm.
Attachment
Attachment
On 10/27/16 6:39 AM, Mithun Cy wrote: > # pg_autoprewarm. IMO it would be better to add this functionality to pg_prewarm instead of creating another contrib module. That would reduce confusion and reduce the amount of code necessary. + cmp_member_elem(database); + cmp_member_elem(spcNode); + cmp_member_elem(filenode); + cmp_member_elem(forknum); + cmp_member_elem(blocknum); Presumably the first 4 numbers will vary far less than blocknum, so it's probably worth reversing the order here (or at least put blocknum first). I didn't look at the two load functions since presumably they'd go away/change significantly if this was combined with pg_prewarm. + if (!block_info_array) + elog(ERROR, "Out of Memory!"); AFAICT this isn't necessary since palloc will error itself if it fails. + snprintf(transient_dump_file_path, sizeof(dump_file_path), + "%s.save.tmp", DEFAULT_DUMP_FILENAME); Since there's no method to change DEFAULT_DUMP_FILENAME, I would call it what it is: DUMP_FILENAME. Also, maybe worth an assert to make sure there was enough room for the complete filename. That'd need to be a real check if this was configurable anyway. + if (!avoid_dumping) + dump_now(); Perhaps that check should be moved into dump_now() itself... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Mithun Cy > # pg_autoprewarm. > > This a PostgreSQL contrib module which automatically dump all of the > blocknums present in buffer pool at the time of server shutdown(smart and > fast mode only, to be enhanced to dump at regular interval.) and load these > blocks when server restarts. I welcome this feature! I remember pg_hibernate did this. I wonder what happened to pg_hibernate. Did you check it? Regards Takayuki Tsunakawa
On Fri, Oct 28, 2016 at 5:15 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 10/27/16 6:39 AM, Mithun Cy wrote: >> >> # pg_autoprewarm. > > > IMO it would be better to add this functionality to pg_prewarm instead of > creating another contrib module. That would reduce confusion and reduce the > amount of code necessary. > > + cmp_member_elem(database); > + cmp_member_elem(spcNode); > + cmp_member_elem(filenode); > + cmp_member_elem(forknum); > + cmp_member_elem(blocknum); > > Presumably the first 4 numbers will vary far less than blocknum, so it's > probably worth reversing the order here (or at least put blocknum first). > > I didn't look at the two load functions since presumably they'd go > away/change significantly if this was combined with pg_prewarm. > > + if (!block_info_array) > + elog(ERROR, "Out of Memory!"); > AFAICT this isn't necessary since palloc will error itself if it fails. > > + snprintf(transient_dump_file_path, sizeof(dump_file_path), > + "%s.save.tmp", DEFAULT_DUMP_FILENAME); > Since there's no method to change DEFAULT_DUMP_FILENAME, I would call it > what it is: DUMP_FILENAME. > > Also, maybe worth an assert to make sure there was enough room for the > complete filename. That'd need to be a real check if this was configurable > anyway. > > + if (!avoid_dumping) > + dump_now(); > Perhaps that check should be moved into dump_now() itself... As this picked up my curiosity... +/* + * ReadBufferForPrewarm -- This new interface is for pg_autoprewarm. + */ +Buffer +ReadBufferForPrewarm(SMgrRelation smgr, char relpersistence, + ForkNumber forkNum, BlockNumber blockNum, + ReadBufferMode mode, BufferAccessStrategy strategy) +{ + bool hit; + + return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum, + mode, strategy, &hit); +} May be better to just expose ReadBuffer_common or rename it. -- Michael
On Fri, Oct 28, 2016 at 1:45 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 10/27/16 6:39 AM, Mithun Cy wrote: >> >> # pg_autoprewarm. > > > IMO it would be better to add this functionality to pg_prewarm instead of > creating another contrib module. > There is not much common functionality between the two. The main job of this feature is to dump the contents of shared buffers and reload them at startup and it takes care for not over ridding the existing shared buffer contents while reloading the dump file. This is somewhat different to what prewarm provides which is to load the relation blocks in memory or buffers, it has nothing to do with prior shared buffer contents. So, I am not sure if it is good idea to add it as a functionality with prewarm module. OTOH, if more people want that way, then as such there is no harm in doing that way. One point that seems to be worth discussing is when should the buffer information be dumped to a file? Shall we dump at each checkpoint or at regular intervals via auto prewarm worker or at some other time? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2016-10-28 12:59:58 +0530, Amit Kapila wrote: > On Fri, Oct 28, 2016 at 1:45 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > > On 10/27/16 6:39 AM, Mithun Cy wrote: > >> > >> # pg_autoprewarm. > > > > > > IMO it would be better to add this functionality to pg_prewarm instead of > > creating another contrib module. > > > > There is not much common functionality between the two. I don't really agree. For me manual and automated prewarming are pretty closely related. Sure they have their independent usages, and not too much code might be shared. But grouping them in the same extension seems to make sense, it's confusing to have closely related but independent extensions. > One point that seems to be worth discussing is when should the buffer > information be dumped to a file? Shall we dump at each checkpoint or > at regular intervals via auto prewarm worker or at some other time? Should probably be at some regular interval - not sure if checkpoints are the best time (or if it's even realistic to tie a bgworker to checkpoints), since checkpoints have a significant impact on the state of shared_buffers. Greetings, Andres Freund
On Fri, Oct 28, 2016 at 3:40 AM, Andres Freund <andres@anarazel.de> wrote: >> >> There is not much common functionality between the two. > > I don't really agree. For me manual and automated prewarming are pretty > closely related. Sure they have their independent usages, and not too > much code might be shared. But grouping them in the same extension seems > to make sense, it's confusing to have closely related but independent > extensions. I agree that putting them together would be fine. >> One point that seems to be worth discussing is when should the buffer >> information be dumped to a file? Shall we dump at each checkpoint or >> at regular intervals via auto prewarm worker or at some other time? > > Should probably be at some regular interval - not sure if checkpoints > are the best time (or if it's even realistic to tie a bgworker to > checkpoints), since checkpoints have a significant impact on the state > of shared_buffers. Checkpoints don't cause any buffer replacement, which I think is what would be relevant here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> you assigned as reviewer to the current patch in the 11-2016 commitfest. > But you haven't shared your review yet. Please share your views about > the patch. This will help us in smoother operation of commitfest. > Thanks for the reminder. Mithun has not provided a patch addressing the comments upthread. I am waiting for his response to those comments. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
Sorry I took some time on this please find latest patch with addressed review comments. Apart from fixes for comments I have introduced a new GUC variable for the pg_autoprewarm "buff_dump_interval". So now we dump the buffer pool metadata at every buff_dump_interval secs. Currently buff_dump_interval can be set only at startup time. I did not choose to do the dumping at checkpoint time, as it appeared these 2 things are not much related and keeping it independent would be nice for usage. Also overhead of any communication between them can be avoided.On Fri, Oct 28, 2016 at 1:45 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:> IMO it would be better to add this functionality to pg_prewarm instead of creating another contrib module. That would reduce confusion and reduce> the amount of code necessary.I have merged pg_autoprewarm module into pg_prewarm, This is just the directory merge, Functionality merge is not possible pg_prewarm is just a utility function with specific signature to load a specific relation at runtime, where as pg_autoprewarm is a bgworker which dumps current buffer pool and load it during startup time.> Presumably the first 4 numbers will vary far less than blocknum, so it's probably worth reversing the order here (or at least put blocknum first).function sort_cmp_func is for qsort so orderly comparison is needed to say which is bigger or smaller, If we put blocknum first, we cannot decide same.> AFAICT this isn't necessary since palloc will error itself if it fails.Fixed.> Since there's no method to change DEFAULT_DUMP_FILENAME, I would call it what it is: DUMP_FILENAME.Fixed.> Also, maybe worth an assert to make sure there was enough room for the complete filename. That'd need to be a real check if this was configurable> anyway.I think if we make it configurable I think I can put that check.> + if (!avoid_dumping)> + dump_now();> Perhaps that check should be moved into dump_now() itself...Fixed.
I took a look at this again, and it doesn't appear to be working for me. The library is being loaded during startup, butI don't see any further activity in the log, and I don't see an autoprewarm file in $PGDATA. There needs to be some kind of documentation change as part of this patch. I'm not sure the default GUC setting of 0 makes sense. If you've loaded the module, presumably you want it to be running.I think it'd be nice if the GUC had a -1 setting that meant to use checkpoint_timeout. Having the GUC be restart-only is also pretty onerous. I don't think it'd be hard to make the worker respond to a reload...there's code in the autovacuum launcher you could use as an example. I'm also wondering if this really needs to be a permanently running process... perhaps the worker could simply be startedas necessary? Though maybe that means it wouldn't run at shutdown. Also not sure if it could be relaunched when areload happens. I'm marking this as waiting on author for now, because it's not working for me and needs documentation. The new status of this patch is: Waiting on Author
On Tue, Jan 24, 2017 at 5:07 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > I took a look at this again, and it doesn't appear to be working for me. The library is being loaded during startup, butI don't see any further activity in the log, and I don't see an autoprewarm file in $PGDATA. Hi Jim, Thanks for looking into this patch, I just downloaded the patch and applied same to the latest code, I can see file " autoprewarm.save" in $PGDATA which is created and dumped at shutdown time and an activity is logged as below 2017-01-24 13:22:25.012 IST [91755] LOG: Buffer Dump: saved metadata of 59 blocks. In my code by default, we only dump at shutdown time. If we want to dump at regular interval then we need to set the GUC pg_autoprewarm.buff_dump_interval to > 0. I think I am missing something while trying to recreate the bug reported above. Can you please let me know what exactly you mean by the library is not working. > There needs to be some kind of documentation change as part of this patch. Thanks, I will add a sgml for same. For remaining suggestions, I will try to address in my next patch based on its feasibility. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On Tue, Jan 24, 2017 at 5:07 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I took a look at this again, and it doesn't appear to be working for me. The library is being loaded during startup, but I don't see any further activity in the log, and I don't see an autoprewarm file in $PGDATA.
Hi Jim,
Thanks for looking into this patch, I just downloaded the patch and
applied same to the latest code, I can see file " autoprewarm.save" in
$PGDATA which is created and dumped at shutdown time and an activity
is logged as below
2017-01-24 13:22:25.012 IST [91755] LOG: Buffer Dump: saved metadata
of 59 blocks.
In my code by default, we only dump at shutdown time. If we want to
dump at regular interval then we need to set the GUC
pg_autoprewarm.buff_dump_interval to > 0. I think I am missing
something while trying to recreate the bug reported above. Can you
please let me know what exactly you mean by the library is not
working.
> There needs to be some kind of documentation change as part of this patch.
Thanks, I will add a sgml for same.
For remaining suggestions, I will try to address in my next patch
based on its feasibility.
1. I think the README was maintained as is from the 1st version and says pg_autoprewarm is a contrib module. It should be rewritten to pg_autoprewarm is a part of pg_prewarm module. The documentation should be added to pgprewarm.sgml instead of the README
2. buff_dump_interval could be renamed to just dump_interval or buffer_dump_interval. Also, since it is now part of pg_prewarm. I think it makes sense to have the conf parameter be: pg_prewarm.xxx instead of pg_autoprewarm.xxx
3. During startup we get the following message:
2017-01-24 16:13:57.615 IST [90061] LOG: Num buffers : 272
I could be better written as “pg_prewarm: 272 blocks loaded from dump” or something similar.
4. Also, the message while dumping says:
2017-01-24 16:15:17.712 IST [90061] LOG: Buffer Dump: saved metadata of 272 blocks
It would be better to write the module name in message instead of "Buffer Dump"
On Fri, Oct 28, 2016 at 6:36 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > I welcome this feature! I remember pg_hibernate did this. I wonder what happened to pg_hibernate. Did you check it? Thanks, when I checked with pg_hibernate there were two things people complained about it. Buffer loading will start after the recovery is finished and the database has reached the consistent state. Two It can replace existing buffers which are loaded due to recovery and newly connected clients. And this solution tried to solve them. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On 1/24/17 4:56 AM, Beena Emerson wrote: > 2. buff_dump_interval could be renamed to just dump_interval or > buffer_dump_interval. Also, since it is now part of pg_prewarm. I think > it makes sense to have the conf parameter be: pg_prewarm.xxx instead of > pg_autoprewarm.xxx I'd really like to find terminology other than "buffer dump", because that makes it sound like we're dumping the contents of the buffers themselves. Maybe block_map? Buffer_map? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 1/24/17 2:26 AM, Mithun Cy wrote: > Thanks for looking into this patch, I just downloaded the patch and > applied same to the latest code, I can see file " autoprewarm.save" in > $PGDATA which is created and dumped at shutdown time and an activity > is logged as below > 2017-01-24 13:22:25.012 IST [91755] LOG: Buffer Dump: saved metadata > of 59 blocks. Yeah, I wasn't getting that at all, though I did see the shared library being loaded. If I get a chance I'll try it again. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 1/24/17 2:26 AM, Mithun Cy wrote:Thanks for looking into this patch, I just downloaded the patch and
applied same to the latest code, I can see file " autoprewarm.save" in
$PGDATA which is created and dumped at shutdown time and an activity
is logged as below
2017-01-24 13:22:25.012 IST [91755] LOG: Buffer Dump: saved metadata
of 59 blocks.
Yeah, I wasn't getting that at all, though I did see the shared library being loaded. If I get a chance I'll try it again.
On Tue, Jan 24, 2017 at 5:07 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > I took a look at this again, and it doesn't appear to be working for me. The library is being loaded during startup, butI don't see any further activity in the log, and I don't see an autoprewarm file in $PGDATA. > > There needs to be some kind of documentation change as part of this patch. > > I'm not sure the default GUC setting of 0 makes sense. If you've loaded the module, presumably you want it to be running.I think it'd be nice if the GUC had a -1 setting that meant to use checkpoint_timeout. > > Having the GUC be restart-only is also pretty onerous. I don't think it'd be hard to make the worker respond to a reload...there's code in the autovacuum launcher you could use as an example. > +1. I don't think there should be any problem in making it PGC_SIGHUP. > I'm also wondering if this really needs to be a permanently running process... perhaps the worker could simply be startedas necessary? Do you want to invoke worker after every buff_dump_interval? I think that will be bad in terms of starting a new process and who will monitor when to start such a process. I think it is better to keep it as a permanently running background process if loaded by user. > Though maybe that means it wouldn't run at shutdown. Yeah, that will be another drawback. Few comments found while glancing the patch. 1. +TO DO: +------ +Add functionality to dump based on timer at regular interval. I think you need to remove above TO DO. 2. + /* Load the page only if there exist a free buffer. We do not want to + * replace an existing buffer. */ This is not a PG style multiline comment. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 23, 2017 at 6:37 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > I'm not sure the default GUC setting of 0 makes sense. If you've loaded the module, presumably you want it to be running.I think it'd be nice if the GUC had a -1 setting that meant to use checkpoint_timeout. Actually, I think we need to use -1 to mean "don't run the worker at all". 0 means "run the worker, but don't do timed dumps". >0 means "run the worker, and dump at that interval". I have to admit that when I was first thinking about this feature, my initial thought was "hey, let's dump once per checkpoint_timeout". But I think that Mithun's approach is better. There's no intrinsic connection between this and checkpointing, and letting the user pick the interval is a lot more flexible. We could still have a magic value that means "same as checkpoint_timeout" but it's not obvious to me that there's any value in that; the user might as well just pick the time interval that they want. Actually, for busy systems, the interval is probably shorter than checkpoint_timeout. Dumping the list of buffers isn't that expensive, and if you are doing checkpoints every half hour or so that's not probably longer than what you want for this. So I suggest that we should just have the documentation could recommend a suitable value (e.g. 5 minutes) and not worry about checkpoint_timeout. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/24/17 11:13 PM, Beena Emerson wrote: > > On Wed, Jan 25, 2017 at 10:36 AM, Jim Nasby <Jim.Nasby@bluetreble.com > <mailto:Jim.Nasby@bluetreble.com>> wrote: > > On 1/24/17 2:26 AM, Mithun Cy wrote: > > Thanks for looking into this patch, I just downloaded the patch and > applied same to the latest code, I can see file " > autoprewarm.save" in > $PGDATA which is created and dumped at shutdown time and an activity > is logged as below > 2017-01-24 13:22:25.012 IST [91755] LOG: Buffer Dump: saved > metadata > of 59 blocks. > > > Yeah, I wasn't getting that at all, though I did see the shared > library being loaded. If I get a chance I'll try it again. > > > > Hope u added the following to the conf file: > > shared_preload_libraries = 'pg_autoprewarm' # (change requires restart) Therein lied my problem: I was preloading pg_prewarm, not pg_autoprewarm. I think the two need to be integrated much better than they are right now. They should certainly be in the same .so, and as others have mentioned the docs need to be fixed. For consistency, I think the name should just be pg_prewarm, as well as the prefix for the GUC. Based on that and other feedback I'm going to mark this as returned with feedback, though if you're able to get a revised patch in the next few days please do. FYI (and this is just a suggestion), for testing purposes, it might also be handy to allow manual dump and load via functions, with the load function giving you the option to forcibly load (instead of doing nothing if there are no buffers on the free list). It would also be handy of those functions accepted a different filename. That way you could reset shared_buffers to a known condition before running a test. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 1/25/17 1:46 PM, Jim Nasby wrote: > Based on that and other feedback I'm going to mark this as returned with > feedback, though if you're able to get a revised patch in the next few > days please do. Actually, based on the message that popped up when I went to do that I guess it's better not to do that, so I marked it as Waiting on Author. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On Wed, Jan 25, 2017 at 2:46 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > I think the two need to be integrated much better than they are right now. > They should certainly be in the same .so, and as others have mentioned the > docs need to be fixed. For consistency, I think the name should just be > pg_prewarm, as well as the prefix for the GUC. Yikes. +1, definitely. > It would also be handy of those functions > accepted a different filename. That way you could reset shared_buffers to a > known condition before running a test. That would have some pretty unpleasant security implications unless it is awfully carefully thought out. I doubt this has enough utility to make it worth thinking that hard about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/24/17 3:26 AM, Mithun Cy wrote: > In my code by default, we only dump at shutdown time. If we want to > dump at regular interval then we need to set the GUC > pg_autoprewarm.buff_dump_interval to > 0. Just a thought with an additional use case: If I want to set up a standby for offloading queries, could I take the dump file from the primary or another existing standby, copy it to the new standby, and have it be warmed up to the state of the other instance from that? In my experience, that kind of use is just as interesting as preserving the buffers across a restart. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 26, 2017 at 8:45 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 1/24/17 3:26 AM, Mithun Cy wrote: >> In my code by default, we only dump at shutdown time. If we want to >> dump at regular interval then we need to set the GUC >> pg_autoprewarm.buff_dump_interval to > 0. > > Just a thought with an additional use case: If I want to set up a > standby for offloading queries, could I take the dump file from the > primary or another existing standby, copy it to the new standby, and > have it be warmed up to the state of the other instance from that? > > In my experience, that kind of use is just as interesting as preserving > the buffers across a restart. > An interesting use case. I am not sure if somebody has tried that way but it appears to me that the current proposed patch should work for this use case. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 26, 2017 at 8:45 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 1/24/17 3:26 AM, Mithun Cy wrote:
>> In my code by default, we only dump at shutdown time. If we want to
>> dump at regular interval then we need to set the GUC
>> pg_autoprewarm.buff_dump_interval to > 0.
>
> Just a thought with an additional use case: If I want to set up a
> standby for offloading queries, could I take the dump file from the
> primary or another existing standby, copy it to the new standby, and
> have it be warmed up to the state of the other instance from that?
>
> In my experience, that kind of use is just as interesting as preserving
> the buffers across a restart.
>
An interesting use case. I am not sure if somebody has tried that way
but it appears to me that the current proposed patch should work for
this use case.
On Thu, Jan 26, 2017 at 8:45 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Just a thought with an additional use case: If I want to set up a > standby for offloading queries, could I take the dump file from the > primary or another existing standby, copy it to the new standby, and > have it be warmed up to the state of the other instance from that? > > In my experience, that kind of use is just as interesting as preserving > the buffers across a restart. Initially, I did not think about this thanks for asking. For now, we dump the buffer pool info in the format <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>; If BlockNum in new standby correspond to the same set of rows as it was with the server where the dump was produced, I think we can directly use the dump file in new standby. All we need to do is just drop the ".save" file in data-directory and preload the library. Buffer pool will be warmed with blocks mentioned in ".save". -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On 1/26/17 11:11 PM, Beena Emerson wrote: > In that case, we could add the file location parameter. By default it > would store in the cluster directory else in the location provided. We > can update this parameter in standby for it to access the file. I don't see the need for that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 27, 2017 at 3:18 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 1/26/17 11:11 PM, Beena Emerson wrote: >> In that case, we could add the file location parameter. By default it >> would store in the cluster directory else in the location provided. We >> can update this parameter in standby for it to access the file. > > I don't see the need for that. +1. That seems like over-engineering this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/26/17 10:11 PM, Beena Emerson wrote: > In that case, we could add the file location parameter. By default it > would store in the cluster directory else in the location provided. We > can update this parameter in standby for it to access the file. I don't see file location being as useful in this case. What would be more useful is being able to forcibly load blocks into shared buffers so that you didn't need to restart. Hmm, it occurs to me that could be accomplished by providing an SRF that returned the contents of the current save file. In any case, I strongly suggest focusing on the issues that have already been identified before trying to add more features. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On Fri, Jan 27, 2017 at 5:34 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 1/26/17 10:11 PM, Beena Emerson wrote: >> In that case, we could add the file location parameter. By default it >> would store in the cluster directory else in the location provided. We >> can update this parameter in standby for it to access the file. > > I don't see file location being as useful in this case. What would be more > useful is being able to forcibly load blocks into shared buffers so that you > didn't need to restart. Of course, you can already do that with the existing pg_prewarm and pg_buffercache functionality. Any time you want, you can use pg_buffercache to dump out a list of everything in shared_buffers, and pg_prewarm to suck that same stuff in on the same node or a separate node. All this patch is trying to do is provide a convenient, automated way to make that work. (An argument could be made that this ought to be in core and the default behavior, because who really wants to start with an ice-cold cold buffer cache on a production system? It's possible that repopulating shared_buffers in the background after a restart could cause enough I/O to interfere with foreground activity that regrettably ends up needing none of the prewarmed buffers, but I think prewarming a few GB of data should be quite fast under normal circumstances, and any well-intentioned system can go wrong under some set of obscure circumstances. Still, the patch takes the conservative course of making this an opt-in behavior, and that's probably for the best.) > In any case, I strongly suggest focusing on the issues that have already > been identified before trying to add more features. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 31, 2017 at 3:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 27, 2017 at 5:34 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On 1/26/17 10:11 PM, Beena Emerson wrote: >>> In that case, we could add the file location parameter. By default it >>> would store in the cluster directory else in the location provided. We >>> can update this parameter in standby for it to access the file. >> >> I don't see file location being as useful in this case. What would be more >> useful is being able to forcibly load blocks into shared buffers so that you >> didn't need to restart. > > Of course, you can already do that with the existing pg_prewarm and > pg_buffercache functionality. Any time you want, you can use > pg_buffercache to dump out a list of everything in shared_buffers, and > pg_prewarm to suck that same stuff in on the same node or a separate > node. All this patch is trying to do is provide a convenient, > automated way to make that work. > > (An argument could be made that this ought to be in core and the > default behavior, because who really wants to start with an ice-cold > cold buffer cache on a production system? It's possible that > repopulating shared_buffers in the background after a restart could > cause enough I/O to interfere with foreground activity that > regrettably ends up needing none of the prewarmed buffers, but I think > prewarming a few GB of data should be quite fast under normal > circumstances, and any well-intentioned system can go wrong under some > set of obscure circumstances. Still, the patch takes the conservative > course of making this an opt-in behavior, and that's probably for the > best.) I partially agree with this paragraph, at least there are advantages to do so for cases where the data fits in shared buffers. Even for data sets fitting in RAM that can be an advantage as the buffers would get evicted from Postgres' cache but not the OS. Now for cases where there are many load patterns on a given database (I have some here), that's hard to make this thing by default on. This patch needs to be visibly reshaped anyway, so I am marking it as returned with feedback. -- Michael
On Tue, Jan 31, 2017 at 1:48 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > I partially agree with this paragraph, at least there are advantages > to do so for cases where the data fits in shared buffers. Even for > data sets fitting in RAM that can be an advantage as the buffers would > get evicted from Postgres' cache but not the OS. Now for cases where > there are many load patterns on a given database (I have some here), > that's hard to make this thing by default on. Well, the question even for that case is whether it really costs anything. My bet is that it is nearly free when it doesn't help, but that could be wrong. My experience running pgbench tests is that prewarming all of pgbench_accounts on a scale factor that fits in shared_buffers using "dd" took just a few seconds, but when accessing the blocks in random order the cache took many minutes to heat up. Now, I assume that this patch sorts the I/O (although I haven't checked that) and therefore I expect that the prewarm completes really fast. If that's not the case, then that's bad. But if it is the case, then it's not really hurting you even if the workload changes completely. Again, I'm not really arguing for enable-by-default, but I think if this is well-implemented the chances of it actually hurting anything are very low, so you'll either win or it'll make no difference. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 31, 2017 at 9:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Now, I assume that this patch sorts the I/O (although I haven't > checked that) and therefore I expect that the prewarm completes really > fast. If that's not the case, then that's bad. But if it is the > case, then it's not really hurting you even if the workload changes > completely. -- JFYI Yes in the patch we load the sorted <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 1, 2017 at 1:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Well, the question even for that case is whether it really costs > anything. My bet is that it is nearly free when it doesn't help, but > that could be wrong. My experience running pgbench tests is that > prewarming all of pgbench_accounts on a scale factor that fits in > shared_buffers using "dd" took just a few seconds, but when accessing > the blocks in random order the cache took many minutes to heat up. And benchmarks like dbt-1 have a pre-warming period added in the test itself where the user can specify in a number of seconds to linearly increase the load from 0% to 100%, just for filling in the OS and PG's cache... This feature would be helpful. > Now, I assume that this patch sorts the I/O (although I haven't > checked that) and therefore I expect that the prewarm completes really > fast. If that's not the case, then that's bad. But if it is the > case, then it's not really hurting you even if the workload changes > completely. Having that working fast would be really nice. -- Michael
Hi all, Here is the new patch which fixes all of above comments, I changed the design a bit now as below What is it? =========== A pair of bgwrokers one which automatically dumps buffer pool's block info at a given interval and another which loads those block into buffer pool when the server restarts. How does it work? ================= When the shared library pg_prewarm is preloaded during server startup. A bgworker "auto pg_prewarm load" is launched immediately after the server is started. The bgworker will start loading blocks obtained from block info entry <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> in $PGDATA/AUTO_PG_PREWARM_FILE, until there is a free buffer in the buffer pool. This way we do not replace any new blocks which were loaded either by the recovery process or the querying clients. Once the "auto pg_prewarm load" bgworker has completed its job, it will register a dynamic bgworker "auto pg_prewarm dump" which has to be launched when the server reaches a consistent state. The new bgworker will periodically scan the buffer pool and then dump the meta info of blocks which are currently in the buffer pool. The GUC pg_prewarm.dump_interval if set > 0 indicates the minimum time interval between two dumps. If pg_prewarm.dump_interval is set to AT_PWARM_DUMP_AT_SHUTDOWN_ONLY the bgworker will only dump at the time of server shutdown. If it is set to AT_PWARM_LOAD_ONLY we do not want the bgworker to dump anymore, so it stops there. To relaunch a stopped "auto pg_prewarm dump" bgworker we can use the utility function "launch_pg_prewarm_dump". ================== One problem now I have kept it open is multiple "auto pg_prewarm dump" can be launched even if already a dump/load worker is running by calling launch_pg_prewarm_dump. I can avoid this by dropping a lock-file before starting the bgworkers. But, if there is an another method to avoid launching bgworker on a simple method I can do same. Any suggestion on this will be very helpful. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi all,
Here is the new patch which fixes all of above comments, I changed the
design a bit now as below
What is it?
===========
A pair of bgwrokers one which automatically dumps buffer pool's block
info at a given interval and another which loads those block into
buffer pool when
the server restarts.
On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello, > > Thank you for the updated patch. > > On Tue, Feb 7, 2017 at 10:44 AM, Mithun Cy <mithun.cy@enterprisedb.com> > wrote: >> >> Hi all, >> Here is the new patch which fixes all of above comments, I changed the >> design a bit now as below >> >> What is it? >> =========== >> A pair of bgwrokers one which automatically dumps buffer pool's block >> info at a given interval and another which loads those block into >> buffer pool when >> the server restarts. > > > Are 2 workers required? > I think in the new design there is a provision of launching the worker dynamically to dump the buffers, so there seems to be a need of separate workers for loading and dumping the buffers. However, there is no explanation in the patch or otherwise when and why this needs a pair of workers. Also, if the dump interval is greater than zero, then do we really need to separately register a dynamic worker? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 7, 2017 at 10:44 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > > ================== > One problem now I have kept it open is multiple "auto pg_prewarm dump" > can be launched even if already a dump/load worker is running by > calling launch_pg_prewarm_dump. I can avoid this by dropping a > lock-file before starting the bgworkers. But, if there is an another > method to avoid launching bgworker on a simple method I can do same. > How about keeping a variable in PROC_HDR structure to indicate if already one dump worker is running, then don't allow to start a new one? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote: > launched by other applications. Also with max_worker_processes = 2 and > restart, the system crashes when the 2nd worker is not launched: > 2017-02-07 11:36:39.132 IST [20573] LOG: auto pg_prewarm load : number of > buffers actually tried to load 64 > 2017-02-07 11:36:39.143 IST [18014] LOG: worker process: auto pg_prewarm > load (PID 20573) was terminated by signal 11: Segmentation fault SEGFAULT was the coding mistake I have called the C-language function directly without initializing the functioncallinfo. Thanks for raising. Below patch fixes same. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> launched by other applications. Also with max_worker_processes = 2 and
> restart, the system crashes when the 2nd worker is not launched:
> 2017-02-07 11:36:39.132 IST [20573] LOG: auto pg_prewarm load : number of
> buffers actually tried to load 64
> 2017-02-07 11:36:39.143 IST [18014] LOG: worker process: auto pg_prewarm
> load (PID 20573) was terminated by signal 11: Segmentation fault
SEGFAULT was the coding mistake I have called the C-language function
directly without initializing the functioncallinfo. Thanks for
raising. Below patch fixes same.
= Background worker messages:
- Workers when launched, show messages like: "logical replication launcher started”, "autovacuum launcher started”. We should probably have a similar message to show that the pg_prewarm load and dump bgworker has started.
- "auto pg_prewarm load: number of buffers to load x”, other messages show space before and after “:”, we should keep it consistent through out.
= Action of -1.
I thought we decided that interval value of -1 would mean that the auto prewarm worker will not be run at all. With current code, -1 is explained to mean it will not dump. I noticed that reloading with new option as -1 stops both the workers but restarting loads the data and then quits. Why does it allow loading with -1? Please explain this better in the documents.
= launch_pg_prewarm_dump()
With dump_interval=-1, Though function returns a pid, this worker is not running in the 04 patch. 03 version it was launching. Dumping is not done now.
=# SELECT launch_pg_prewarm_dump();
launch_pg_prewarm_dump
------------------------
53552
(1 row)
$ ps -ef | grep 53552
b_emers+ 53555 4391 0 16:21 pts/1 00:00:00 grep --color=auto 53552
= Function names
- load_now could be better named as load_file, load_dumpfile or similar.
- dump_now -> dump_buffer or better?
= Corrupt file
if the dump file is corrupted, the system crashes and the prewarm bgworkers are not restarted. This needs to be handled better.
WARNING: terminating connection because of crash of another server process
2017-02-07 16:36:58.680 IST [54252] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory
= Documentation
I feel the documentation needs to be improved greatly.
- The first para in pg_prewarm should mention the autoload feature too.
- The new section should be named “The pg_prewarm autoload” or something better. "auto pg_prewarm bgworker” does not seem appropriate. The configuration parameter should also be listed out clearly like in auth-delay page. The new function launch_pg_prewarm_dump() should be listed under Functions.
On Tue, Feb 7, 2017 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote: >> Are 2 workers required? >> > > I think in the new design there is a provision of launching the worker > dynamically to dump the buffers, so there seems to be a need of > separate workers for loading and dumping the buffers. However, there > is no explanation in the patch or otherwise when and why this needs a > pair of workers. Also, if the dump interval is greater than zero, > then do we really need to separately register a dynamic worker? We have introduced a new value -1 for pg_prewarm.dump_interval this means we will not dump at all, At this state, I thought auto pg_prewarm process need not run at all, so I coded to exit the auto pg_prewarm immediately. But If the user decides to start the auto pg_prewarm to dump only without restarting the server, I have introduced a launcher function "launch_pg_prewarm_dump" to restart the auto pg_prewarm only to dump. Since now we can launch worker only to dump, I thought we can redistribute the code between two workers, one which only does prewarm (load only) and another dumps periodically. This helped me to modularize and reuse the code. So once load worker has finished its job, it registers a dump worker and then exists. But if max_worker_processes is not enough to launch the "auto pg_prewarm dump" bgworker We throw an error 2017-02-07 14:51:59.789 IST [50481] ERROR: registering dynamic bgworker "auto pg_prewarm dump" failed c 2017-02-07 14:51:59.789 IST [50481] HINT: Consider increasing configuration parameter "max_worker_processes". Now thinking again instead of such error and then correcting same by explicitly launching the auto pg_prewarm dump bgwroker through launch_pg_prewarm_dump(), I can go back to original design where there will be one worker which loads and then dumps periodically. And launch_pg_prewarm_dump will relaunch dump only activity of that worker. Does this sound good? -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
Thanks Beena, On Tue, Feb 7, 2017 at 4:46 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Few more comments: > > = Background worker messages: > > - Workers when launched, show messages like: "logical replication launcher > started”, "autovacuum launcher started”. We should probably have a similar > message to show that the pg_prewarm load and dump bgworker has started. -- Thanks, I will add startup and shutdown message. > - "auto pg_prewarm load: number of buffers to load x”, other messages show > space before and after “:”, we should keep it consistent through out. > -- I think you are testing patch 03. The latest patch_04 have corrected same. Can you please re-test it. > > = Action of -1. > I thought we decided that interval value of -1 would mean that the auto > prewarm worker will not be run at all. With current code, -1 is explained to > mean it will not dump. I noticed that reloading with new option as -1 stops > both the workers but restarting loads the data and then quits. Why does it > allow loading with -1? Please explain this better in the documents. > -- '-1' means we do not want to dump at all. So we decide not to continue with launched bgworker and it exits. As per your first comment, if I register the startup and shutdown messages for auto pg_prewarm I think it will look better. Will try to explain it in a better way in documentation. The "auto pg_prewarm load" will not be affected with dump_interval value. It will always start, load(prewarm) and then exit. > > = launch_pg_prewarm_dump() > =# SELECT launch_pg_prewarm_dump(); > launch_pg_prewarm_dump > ------------------------ > 53552 > (1 row) > > $ ps -ef | grep 53552 > b_emers+ 53555 4391 0 16:21 pts/1 00:00:00 grep --color=auto 53552 -- If dump_interval = -1 "auto pg_prewarm dump" will exit immediately. > = Function names > - load_now could be better named as load_file, load_dumpfile or similar. > - dump_now -> dump_buffer or better? I did choose load_now and dump_now to indicate we are doing it immediately as invoking them was based on a timer/state. Probably we can improve that but dump_buffer, load_file may not be the right replacement. > > = Corrupt file > if the dump file is corrupted, the system crashes and the prewarm bgworkers > are not restarted. This needs to be handled better. > > WARNING: terminating connection because of crash of another server process > 2017-02-07 16:36:58.680 IST [54252] DETAIL: The postmaster has commanded > this server process to roll back the current transaction and exit, because > another server process exited abnormally and possibly corrupted shared > memory --- Can you please paste you autopgprewarm.save file, I edited the file manually to some illegal entry but did not see any crash. Only we failed to load as block number were invalid. Please share your tests so that I can reproduce same. > = Documentation > > I feel the documentation needs to be improved greatly. > > - The first para in pg_prewarm should mention the autoload feature too. > > - The new section should be named “The pg_prewarm autoload” or something > better. "auto pg_prewarm bgworker” does not seem appropriate. The > configuration parameter should also be listed out clearly like in auth-delay > page. The new function launch_pg_prewarm_dump() should be listed under > Functions. -- Thanks I will try to improve the documentation. And, put more details there. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
Thanks Beena,
On Tue, Feb 7, 2017 at 4:46 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Few more comments:
>
> = Background worker messages:
>
> - Workers when launched, show messages like: "logical replication launcher
> started”, "autovacuum launcher started”. We should probably have a similar
> message to show that the pg_prewarm load and dump bgworker has started.
-- Thanks, I will add startup and shutdown message.
> - "auto pg_prewarm load: number of buffers to load x”, other messages show
> space before and after “:”, we should keep it consistent through out.
>
-- I think you are testing patch 03. The latest patch_04 have
corrected same. Can you please re-test it.
>
> = Action of -1.
> I thought we decided that interval value of -1 would mean that the auto
> prewarm worker will not be run at all. With current code, -1 is explained to
> mean it will not dump. I noticed that reloading with new option as -1 stops
> both the workers but restarting loads the data and then quits. Why does it
> allow loading with -1? Please explain this better in the documents.
>
-- '-1' means we do not want to dump at all. So we decide not to
continue with launched bgworker and it exits. As per your first
comment, if I register the startup and shutdown messages for auto
pg_prewarm I think it will look better. Will try to explain it in a
better way in documentation. The "auto pg_prewarm load" will not be
affected with dump_interval value. It will always start, load(prewarm)
and then exit.
>
> = launch_pg_prewarm_dump()
> =# SELECT launch_pg_prewarm_dump();
> launch_pg_prewarm_dump
> ------------------------
> 53552
> (1 row)
>
> $ ps -ef | grep 53552
> b_emers+ 53555 4391 0 16:21 pts/1 00:00:00 grep --color=auto 53552
-- If dump_interval = -1 "auto pg_prewarm dump" will exit immediately.
> = Function names
> - load_now could be better named as load_file, load_dumpfile or similar.
> - dump_now -> dump_buffer or better?
I did choose load_now and dump_now to indicate we are doing it
immediately as invoking them was based on a timer/state. Probably we
can improve that but dump_buffer, load_file may not be the right
replacement.
>
> = Corrupt file
> if the dump file is corrupted, the system crashes and the prewarm bgworkers
> are not restarted. This needs to be handled better.
>
> WARNING: terminating connection because of crash of another server process
> 2017-02-07 16:36:58.680 IST [54252] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory
--- Can you please paste you autopgprewarm.save file, I edited the
file manually to some illegal entry but did not see any crash. Only
we failed to load as block number were invalid. Please share your
tests so that I can reproduce same.
> = Documentation
>
> I feel the documentation needs to be improved greatly.
>
> - The first para in pg_prewarm should mention the autoload feature too.
>
> - The new section should be named “The pg_prewarm autoload” or something
> better. "auto pg_prewarm bgworker” does not seem appropriate. The
> configuration parameter should also be listed out clearly like in auth-delay
> page. The new function launch_pg_prewarm_dump() should be listed under
> Functions.
-- Thanks I will try to improve the documentation. And, put more details there.
On Tue, Feb 7, 2017 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Are 2 workers required?
>>
>
> I think in the new design there is a provision of launching the worker
> dynamically to dump the buffers, so there seems to be a need of
> separate workers for loading and dumping the buffers. However, there
> is no explanation in the patch or otherwise when and why this needs a
> pair of workers. Also, if the dump interval is greater than zero,
> then do we really need to separately register a dynamic worker?
We have introduced a new value -1 for pg_prewarm.dump_interval this
means we will not dump at all, At this state, I thought auto
pg_prewarm process need not run at all, so I coded to exit the auto
pg_prewarm immediately. But If the user decides to start the auto
pg_prewarm to dump only without restarting the server, I have
introduced a launcher function "launch_pg_prewarm_dump" to restart the
auto pg_prewarm only to dump. Since now we can launch worker only to
dump, I thought we can redistribute the code between two workers, one
which only does prewarm (load only) and another dumps periodically.
This helped me to modularize and reuse the code. So once load worker
has finished its job, it registers a dump worker and then exists.
But if max_worker_processes is not enough to launch the "auto
pg_prewarm dump" bgworker
We throw an error
2017-02-07 14:51:59.789 IST [50481] ERROR: registering dynamic
bgworker "auto pg_prewarm dump" failed c
2017-02-07 14:51:59.789 IST [50481] HINT: Consider increasing
configuration parameter "max_worker_processes".
Now thinking again instead of such error and then correcting same by
explicitly launching the auto pg_prewarm dump bgwroker through
launch_pg_prewarm_dump(), I can go back to original design where there
will be one worker which loads and then dumps periodically. And
launch_pg_prewarm_dump will relaunch dump only activity of that
worker. Does this sound good?
On Tue, Feb 7, 2017 at 6:11 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Yes it would be better to have only one pg_prewarm worker as the loader is > idle for the entire server run time after the initial load activity of few > secs. Sorry, that is again a bug in the code. The code to handle SIGUSR1 somehow got deleted before I submitted patch_03 and I failed to notice same. As in the code loader bgworker is waiting on the latch to know the status of dump bgworker. Actually, the loader bgworker should exit right after launching the dump bgworker. I will try to fix this and other comments given by you in my next patch. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 7, 2017 at 5:14 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > On Tue, Feb 7, 2017 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote: >>> Are 2 workers required? >>> >> >> I think in the new design there is a provision of launching the worker >> dynamically to dump the buffers, so there seems to be a need of >> separate workers for loading and dumping the buffers. However, there >> is no explanation in the patch or otherwise when and why this needs a >> pair of workers. Also, if the dump interval is greater than zero, >> then do we really need to separately register a dynamic worker? > > We have introduced a new value -1 for pg_prewarm.dump_interval this > means we will not dump at all, At this state, I thought auto > pg_prewarm process need not run at all, so I coded to exit the auto > pg_prewarm immediately. But If the user decides to start the auto > pg_prewarm to dump only without restarting the server, I have > introduced a launcher function "launch_pg_prewarm_dump" to restart the > auto pg_prewarm only to dump. Since now we can launch worker only to > dump, I thought we can redistribute the code between two workers, one > which only does prewarm (load only) and another dumps periodically. > This helped me to modularize and reuse the code. So once load worker > has finished its job, it registers a dump worker and then exists. > But if max_worker_processes is not enough to launch the "auto > pg_prewarm dump" bgworker > We throw an error > 2017-02-07 14:51:59.789 IST [50481] ERROR: registering dynamic > bgworker "auto pg_prewarm dump" failed c > 2017-02-07 14:51:59.789 IST [50481] HINT: Consider increasing > configuration parameter "max_worker_processes". > > Now thinking again instead of such error and then correcting same by > explicitly launching the auto pg_prewarm dump bgwroker through > launch_pg_prewarm_dump(), I can go back to original design where there > will be one worker which loads and then dumps periodically. And > launch_pg_prewarm_dump will relaunch dump only activity of that > worker. Does this sound good? > Won't it be simple if you consider -1 as a value to just load library?For *_interval = -1, it will neither load nor dump. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Won't it be simple if you consider -1 as a value to just load library?On Tue, Feb 7, 2017 at 5:14 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Tue, Feb 7, 2017 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>>> Are 2 workers required?
>>>
>>
>> I think in the new design there is a provision of launching the worker
>> dynamically to dump the buffers, so there seems to be a need of
>> separate workers for loading and dumping the buffers. However, there
>> is no explanation in the patch or otherwise when and why this needs a
>> pair of workers. Also, if the dump interval is greater than zero,
>> then do we really need to separately register a dynamic worker?
>
> We have introduced a new value -1 for pg_prewarm.dump_interval this
> means we will not dump at all, At this state, I thought auto
> pg_prewarm process need not run at all, so I coded to exit the auto
> pg_prewarm immediately. But If the user decides to start the auto
> pg_prewarm to dump only without restarting the server, I have
> introduced a launcher function "launch_pg_prewarm_dump" to restart the
> auto pg_prewarm only to dump. Since now we can launch worker only to
> dump, I thought we can redistribute the code between two workers, one
> which only does prewarm (load only) and another dumps periodically.
> This helped me to modularize and reuse the code. So once load worker
> has finished its job, it registers a dump worker and then exists.
> But if max_worker_processes is not enough to launch the "auto
> pg_prewarm dump" bgworker
> We throw an error
> 2017-02-07 14:51:59.789 IST [50481] ERROR: registering dynamic
> bgworker "auto pg_prewarm dump" failed c
> 2017-02-07 14:51:59.789 IST [50481] HINT: Consider increasing
> configuration parameter "max_worker_processes".
>
> Now thinking again instead of such error and then correcting same by
> explicitly launching the auto pg_prewarm dump bgwroker through
> launch_pg_prewarm_dump(), I can go back to original design where there
> will be one worker which loads and then dumps periodically. And
> launch_pg_prewarm_dump will relaunch dump only activity of that
> worker. Does this sound good?
>
For *_interval = -1, it will neither load nor dump.
On Tue, Feb 7, 2017 at 1:31 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > SEGFAULT was the coding mistake I have called the C-language function > directly without initializing the functioncallinfo. Thanks for > raising. Below patch fixes same. It would be nice if this had an option to preload only internal B-tree pages into shared_buffers. They're usually less than 1% of the total pages in a B-Tree, and are by far the most frequently accessed. It's reasonable to suppose that much of the random I/O incurred when warming the cache occurs there. Now, prewarming those will incur random I/O, often completely random I/O, but by and large it would be a matter of swallowing that cost sooner, through using your tool, rather than later, during the execution of queries. -- Peter Geoghegan
On Thu, Feb 9, 2017 at 12:36 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Feb 7, 2017 at 1:31 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> SEGFAULT was the coding mistake I have called the C-language function >> directly without initializing the functioncallinfo. Thanks for >> raising. Below patch fixes same. > > It would be nice if this had an option to preload only internal B-tree > pages into shared_buffers. > Sure, but I think it won't directly fit into the current functionality of patch which loads the blocks that were dumped from shared buffers before the server has stopped. The way to extend it could be that while dumping it just dumps the btree internal pages or something like that. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 7, 2017 at 2:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Feb 7, 2017 at 10:44 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> >> ================== >> One problem now I have kept it open is multiple "auto pg_prewarm dump" >> can be launched even if already a dump/load worker is running by >> calling launch_pg_prewarm_dump. I can avoid this by dropping a >> lock-file before starting the bgworkers. But, if there is an another >> method to avoid launching bgworker on a simple method I can do same. >> > > How about keeping a variable in PROC_HDR structure to indicate if > already one dump worker is running, then don't allow to start a new > one? A contrib module shouldn't change core (and shouldn't need to). It can register its own shared memory area if it wants. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 9, 2017 at 1:47 AM, Beena Emerson <memissemerson@gmail.com> wrote: >> Won't it be simple if you consider -1 as a value to just load library? >> For *_interval = -1, it will neither load nor dump. >> > +1 > That is what I thought was the behaviour we decided upon for -1. Right. I can't see why you'd want to be able to separately control those two things. If you're not dumping, you don't want to load; if you're not loading, you don't want to dump. I think what should happen is that there should be only one worker. If the GUC is -1, it never gets registered. Otherwise, it starts at database startup time and runs until shutdown. At startup time, it loads buffers until we run out of free buffers or until all saved buffers are loaded, whichever happens first. Buffers should be sorted by relfilenode (in any order) and block number (in increasing order). Once it finishes loading buffers, it repeatedly sleeps for the amount of time indicated by the GUC (or indefinitely if the GUC is 0), dumping after each sleep and at shutdown. Shutting down one worker to start up another doesn't seem to make sense. If for some reason you want the code for those in separate functions, you can call one function and then call the other. Putting them in completely separate processes doesn't buy anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/10/17 15:12, Robert Haas wrote: > Right. I can't see why you'd want to be able to separately control > those two things. If you're not dumping, you don't want to load; if > you're not loading, you don't want to dump. What about the case where you want to prewarm a standby from the info from the primary (or another standby)? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 22, 2017 at 4:06 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2/10/17 15:12, Robert Haas wrote: >> Right. I can't see why you'd want to be able to separately control >> those two things. If you're not dumping, you don't want to load; if >> you're not loading, you don't want to dump. > > What about the case where you want to prewarm a standby from the info > from the primary (or another standby)? I think it's OK to treat that as something of a corner case. There's nothing to keep you from doing that today: just use pg_buffercache to dump a list of blocks on one server, and then pass those blocks to pg_prewarm on another server. The point of this patch, AIUI, is to automate a particularly common case of that, which is to dump before shutdown and load upon startup. It doesn't preclude other things that people want to do. I suppose we could have an SQL-callable function that does an immediate dump (without starting a worker). That wouldn't hurt anything, and might be useful in a case like the one you mention. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi all thanks, I have tried to fix all of the comments given above with some more code cleanups. On Wed, Feb 22, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think it's OK to treat that as something of a corner case. There's > nothing to keep you from doing that today: just use pg_buffercache to > dump a list of blocks on one server, and then pass those blocks to > pg_prewarm on another server. The point of this patch, AIUI, is to > automate a particularly common case of that, which is to dump before > shutdown and load upon startup. It doesn't preclude other things that > people want to do. > > I suppose we could have an SQL-callable function that does an > immediate dump (without starting a worker). That wouldn't hurt > anything, and might be useful in a case like the one you mention. In the latest patch, I have moved the things back as in old ways there will be one bgworker "auto pg_prewarm" which automatically records information about blocks which were present in buffer pool before server shutdown and then prewarm the buffer pool upon server restart with those blocks. I have reverted back the code which helped us to launch the stopped "auto pg_prewarm" bgworker. The reason I introduced a launcher SQL utility function is the running auto pg_prewarm can be stopped by the user by setting dump_interval to -1. So if the user wants to restart the stopped auto pg_prewarm(this time dump only to prewarm on next restart), he can use that utility. The user can launch the auto pg_prewarm to dump periodically while the server is still running. If that was not the concern I think I misunderstood the comments and overdid the design. So as a first patch I will keep the things simple. Also, using a separate process for prewarm and dump activity was a bad design hence reverted back same. The auto pg_prewarm can only be launched by preloading the library. And I can add additional utilities, once we can formalize what is really needed out of it. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Feb 22, 2017 at 11:49 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > Hi all thanks, > I have tried to fix all of the comments given above with some more > code cleanups. While reading this patch tonight, I realized a serious problem with the entire approach, which is that this patch is supposing that we can read relation blocks for every database from a single worker that's not connected to any database. I realize that I suggested that approach, but now I think it's broken, because the patch isn't taking any locks on the relations whose pages it is reading, and that is definitely going to break things. While autoprewarm is busy sucking blocks into the shared buffer cache, somebody could be, for example, dropping one of those relations. DropRelFileNodesAllBuffers and friends expect that nobody is going to be concurrently reading blocks back into the buffer cache because they hold AccessExclusiveLock, and they assume that anybody else who is touching it will hold at least AccessShareLock. But this violates that assumption, and probably some others. This is not easy to fix. The lock has to be taken based on the relation OID, not the relfilenode, but we don't have the relation OID in the dump file, and recording it there won't help, because the relfilenode can change under us if the relation is rewritten with CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE. I don't see a solution other than launching a separate worker for each database, which seems like it could be extremely expensive if there are many databases. Also, I am pretty sure it's no good to take locks before recovery reaches a consistent state. I'm not sure off-hand whether crash-recovery will notice conflicting locks, but even if it does, blocking crash recovery in order to prewarm stuff is bad news. Here are some other review comments (which may not matter unless we can think up a solution to the problems above). - I think auto_pg_prewarm.c is an unnecessarily long name. How about autoprewarm.c? - It looks like you ran pgindent over this without adding the new typedefs to your typedefs.list, so things like the last line of each typedef is formatted incorrectly. - ReadBufferForPrewarm isn't a good name for this interface. You need to give it a generic name (and header comment) that describes what it does, rather than why you added it. Others might want to also use this interface. Actually, an even better idea might be to adjust ReadBufferWithoutRelcache() to serve your need here. That function's header comment seems to contemplate that somebody might want to add a relpersistence argument someday; perhaps that day has arrived. - have_free_buffer's comment shouldn't mention autoprewarm, but it should mention that this is a lockless test, so the results might be slightly stale. See similar comments in various other backend functions for an example of how to write this. - next_task could be static, and with such a generic name, it really MUST be static to avoid namespace conflicts. - load_block() has a race condition. The relation could be dropped after you check smgrexists() and before you access the relation. Also, the fork number validity check looks useless; it should never fail. - I suggest renaming the file that stores the blocks to autoprewarm.blocks or something like that. Calling it just "autopgprewarm" doesn't seem very clear. - I don't see any reason for the dump file to contain a header record with an expected record count. When rereading the file, you can just read until EOF; there's no real need to know the record count before you start. - You should test for multiple flags like this: if ((buf_state & (BM_VALID|VM_TAG_VALID|BM_PERSISTENT)) != 0). However, I also think the test is wrong. Even if the buffer isn't BM_VALID, that's not really a reason not to include it in the dump file. Same with BM_PERSISTENT. I think the reason for the latter restriction is that you're always calling load_block() with RELPERSISTENCE_PERMANENT, but that's not a good idea either. If the relation were made unlogged after you wrote the dump file, then on reload it you'd incorrectly set BM_PERMANENT on the reloaded blocks. - elog() should not be used for user-facing messages, but rather only for internal messages that we don't expect to get generated. Also, the messages you've picked don't conform to the project's message style guidelines. - The error handling loop around load_block() suggests that you're expecting some reads to fail, which I guess is because you could be trying to read blocks from a relation that's been rewritten under a different relfilenode, or partially or entirely truncated. But I don't think it's very reasonable to just let ReadBufferWhatever() spew error messages into the log and hope users don't panic. People will expect an automatic prewarm solution to handle any such cases quietly, not bleat loudly in the log. I suspect that this error-trapping code isn't entirely correct, but there's not much point in fixing it; what we really need to do is get rid of it (somehow). - dump_block_info_periodically() will sleep for too long - perhaps forever - if WaitLatch() repeatedly returns due to WL_LATCH_SET, which can probably happen if for any reason the process receives SIGUSR1 repeatedly. Every time the latch gets set, the timeout is reset, so it may never expire. There are examples of how to write a loop like this correctly in various places in the server; please check one of those. - I don't think you should need an error-catching loop in auto_pgprewarm_main(), either. Just let the worker die if there's an ERROR, and set the restart interval to something other than BGW_NEVER_RESTART. - Setting bgw_main isn't correct for extension code. Please read the documentation on background workers, which explains how to do this correctly in an extension. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/26/17 11:46, Robert Haas wrote: > I don't see > a solution other than launching a separate worker for each database, > which seems like it could be extremely expensive if there are many > databases. You don't have to start all these workers at once. Starting one and not starting the next one until the first one is finished should be fine. It will have the same serial behavior that the patch is proposing anyway. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Feb 27, 2017 at 7:18 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2/26/17 11:46, Robert Haas wrote: >> I don't see >> a solution other than launching a separate worker for each database, >> which seems like it could be extremely expensive if there are many >> databases. > > You don't have to start all these workers at once. Starting one and not > starting the next one until the first one is finished should be fine. > It will have the same serial behavior that the patch is proposing anyway. Yeah, true. The constant factor is higher, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 22, 2017 at 11:49 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> Hi all thanks, >> I have tried to fix all of the comments given above with some more >> code cleanups. > > While reading this patch tonight, I realized a serious problem with > the entire approach, which is that this patch is supposing that we can > read relation blocks for every database from a single worker that's > not connected to any database. I realize that I suggested that > approach, but now I think it's broken, because the patch isn't taking > any locks on the relations whose pages it is reading, and that is > definitely going to break things. While autoprewarm is busy sucking > blocks into the shared buffer cache, somebody could be, for example, > dropping one of those relations. DropRelFileNodesAllBuffers and > friends expect that nobody is going to be concurrently reading blocks > back into the buffer cache because they hold AccessExclusiveLock, and > they assume that anybody else who is touching it will hold at least > AccessShareLock. But this violates that assumption, and probably some > others. > > This is not easy to fix. The lock has to be taken based on the > relation OID, not the relfilenode, but we don't have the relation OID > in the dump file, and recording it there won't help, because the > relfilenode can change under us if the relation is rewritten with > CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE. I don't see > a solution other than launching a separate worker for each database, > which seems like it could be extremely expensive if there are many > databases. Also, I am pretty sure it's no good to take locks before > recovery reaches a consistent state. So we should move this loading of blocks once the recovery reaches a consistent state so that we can connect to a database. To allow worker, to take a lock, we need to dump relation oid as well. Is that what you are envisioning to fix this problem? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 2, 2017 at 7:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > So we should move this loading of blocks once the recovery reaches a > consistent state so that we can connect to a database. To allow > worker, to take a lock, we need to dump relation oid as well. Is that > what you are envisioning to fix this problem? No. The relation -> relfilenode mapping isn't constant, and the dumping process has no way to get the relation OIDs from the relfilenodes anyway. I think what you need to do is dump the relfilenodes as at present, but then at restore time you need a worker per database, and it connects to the database and then uses the infrastructure added by f01d1ae3a104019d6d68aeff85c4816a275130b3 to discover what relation OID, if any, currently corresponds to the proposed relfilenode. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi all, thanks for the feedback. Based on your recent comments I have implemented a new patch which is attached below, On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: > This is not easy to fix. The lock has to be taken based on the > relation OID, not the relfilenode, but we don't have the relation OID > in the dump file, and recording it there won't help, because the > relfilenode can change under us if the relation is rewritten with > CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE. I don't see > a solution other than launching a separate worker for each database, > which seems like it could be extremely expensive if there are many > databases. Also, I am pretty sure it's no good to take locks before > recovery reaches a consistent state. I'm not sure off-hand whether > crash-recovery will notice conflicting locks, but even if it does, > blocking crash recovery in order to prewarm stuff is bad news. On Mon, Feb 27, 2017 at 7:18 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > You don't have to start all these workers at once. Starting one and not > starting the next one until the first one is finished should be fine. > It will have the same serial behavior that the patch is proposing anyway. I have implemented a similar logic now. The prewarm bgworker will launch a sub-worker per database in the dump file. And, each sub-worker will load its database block info. The sub-workers will be launched only after previous one is finished. All of this will only start if the database has reached a consistent state. I have also introduced 2 more utility functions which were requested earlier. A. launch_autoprewarm_dump() RETURNS int4 This is a SQL callable function to launch the autoprewarm worker to dump the buffer pool information at regular interval. In a server, we can only run one autoprewarm worker so if a worker sees another existing worker it will exit immediately. The return value is pid of the worker which has been launched. B. autoprewarm_dump_now() RETURNS int8 This is a SQL callable function to dump buffer pool information immediately once by a backend. This can work in parallel with an autoprewarm worker while it is dumping. The return value is the number of blocks info dumped. I need some feedback on efficiently dividing the blocks among sub-workers. Existing dump format will not be much useful. I have chosen the following format Each entry of block info looks like this: <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it as BlockInfoRecord. Contents of AUTOPREWARM_FILE has been formated such a way that blockInfoRecord of each database can be given to different prewarm workers. format of AUTOPREWAM_FILE ======================================= [offset position of database map table] [sorted BlockInfoRecords..............] [database map table] ======================================= The [database map table] is a sequence of offset in file which will point to first BlockInfoRecords of each database in the dump. The prewarm worker will read this offset one by one in sequence and ask its sub-worker to seek to this position and then start loading the BlockInfoRecords one by one until it sees a BlockInfoRecords of a different database than it is actually connected to. NOTE: We store off_t inside file so the dump file will not be portable to be used across systems where sizeof off_t is different from each other. I also thought of having one dump file per database. Problems I thought which can go against it is there could be too many dump file (also stale files of databases which are no longer in buffer pool). Also, there is a possibility of dump file getting lost due to concurrent dumps by bgworker and autoprewarm_dump_now() SQL utility. ex: While implementing same, dump file names were chosen as number sequence 0,1,2,3......number_of_db. (This helps to avoid stale files being chosen before active database dump files). If 2 concurrent process dump together there will be a race condition. If one page of the new database is loaded to buffer pool at that point there could be a possibility that a recent dump of database blocks will be overwritten due to the race condition and it will go missing from dumps even though that database is still active in bufferpool. If any comments to fix this will be very helpful. On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Here are some other review comments (which may not matter unless we > can think up a solution to the problems above). > - I think auto_pg_prewarm.c is an unnecessarily long name. How about > autoprewarm.c? Fixed; I am also trying to replace the term "auto pg_prewarm" to "autoprewarm" in all relevant places. > - It looks like you ran pgindent over this without adding the new > typedefs to your typedefs.list, so things like the last line of each > typedef is formatted incorrectly. Fixed. > - ReadBufferForPrewarm isn't a good name for this interface. You need > to give it a generic name (and header comment) that describes what it > does, rather than why you added it. Others might want to also use > this interface. Actually, an even better idea might be to adjust > ReadBufferWithoutRelcache() to serve your need here. That function's > header comment seems to contemplate that somebody might want to add a > relpersistence argument someday; perhaps that day has arrived. -- I think it is not needed now as we have relation descriptor hence using ReadBufferExtended. > - have_free_buffer's comment shouldn't mention autoprewarm, but it > should mention that this is a lockless test, so the results might be > slightly stale. See similar comments in various other backend > functions for an example of how to write this. -- Fixed > - next_task could be static, and with such a generic name, it really > MUST be static to avoid namespace conflicts. -- Fixed. after the new code that variable is no longer needed. > - load_block() has a race condition. The relation could be dropped > after you check smgrexists() and before you access the relation. > Also, the fork number validity check looks useless; it should never > fail. -- Now we hold a relation lock so this should not be an issue. But extra check for forknum is required before calling smgrexists. Otherwise, it crashes if valid forknum is not given hence check is necessary. crash call stack after manually setting one of the forknum 10 #0 0x00007f333bb27694 in vfprintf () from /lib64/libc.so.6 #1 0x00007f333bb53179 in vsnprintf () from /lib64/libc.so.6 #2 0x00000000009ec683 in pvsnprintf (buf=0x139e8e0 "global/1260_", '\177' <repeats 115 times>, len=128, fmt=0xbebffe "global/%u_%s", args=0x7ffeaea65750) at psprintf.c:121 #3 0x00000000009ec5d1 in psprintf (fmt=0xbebffe "global/%u_%s") at psprintf.c:64 #4 0x00000000009eca0f in GetRelationPath (dbNode=0, spcNode=1664, relNode=1260, backendId=-1, forkNumber=10) at relpath.c:150 #5 0x000000000082cd95 in mdopen (reln=0x1359c78, forknum=10, behavior=2) at md.c:583 #6 0x000000000082c58d in mdexists (reln=0x1359c78, forkNum=10) at md.c:284 #7 0x000000000082f4cf in smgrexists (reln=0x1359c78, forknum=10) at smgr.c:289 #8 0x00007f33353b0294 in load_one_database (main_arg=21) at autoprewarm.c:546 #9 0x0000000000795232 in StartBackgroundWorker () at bgworker.c:757 #10 0x00000000007a707d in do_start_bgworker (rw=0x12fc3d0) at postmaster.c:5570 > - I suggest renaming the file that stores the blocks to > autoprewarm.blocks or something like that. Calling it just > "autopgprewarm" doesn't seem very clear. -- Fixed. The current file name is "autopgprewarm.save" changed it to autoprewarm.blocks. > - I don't see any reason for the dump file to contain a header record > with an expected record count. When rereading the file, you can just > read until EOF; there's no real need to know the record count before > you start. -- Fixed. Now removed. > - You should test for multiple flags like this: if ((buf_state & > (BM_VALID|VM_TAG_VALID|BM_PERSISTENT)) != 0). However, I also think > the test is wrong. Even if the buffer isn't BM_VALID, that's not > really a reason not to include it in the dump file. Same with > BM_PERSISTENT. I think the reason for the latter restriction is that > you're always calling load_block() with RELPERSISTENCE_PERMANENT, but > that's not a good idea either. If the relation were made unlogged > after you wrote the dump file, then on reload it you'd incorrectly set > BM_PERMANENT on the reloaded blocks. -- Fixed now removed BM_PERMANENT and BM_VALID only using BM_TAG_VALID so if the hash table has this buffer we dump it. > - elog() should not be used for user-facing messages, but rather only > for internal messages that we don't expect to get generated. Also, > the messages you've picked don't conform to the project's message > style guidelines. -- Fixed. > - The error handling loop around load_block() suggests that you're > expecting some reads to fail, which I guess is because you could be > trying to read blocks from a relation that's been rewritten under a > different relfilenode, or partially or entirely truncated. But I > don't think it's very reasonable to just let ReadBufferWhatever() spew > error messages into the log and hope users don't panic. People will > expect an automatic prewarm solution to handle any such cases quietly, > not bleat loudly in the log. I suspect that this error-trapping code > isn't entirely correct, but there's not much point in fixing it; what > we really need to do is get rid of it (somehow). [Need Reelook] -- Debug and check if block load fails what will happen. > - dump_block_info_periodically() will sleep for too long - perhaps > forever - if WaitLatch() repeatedly returns due to WL_LATCH_SET, which > can probably happen if for any reason the process receives SIGUSR1 > repeatedly. Every time the latch gets set, the timeout is reset, so > it may never expire. There are examples of how to write a loop like > this correctly in various places in the server; please check one of > those. -- Fixed same, to count elapsed time from the previous dump and then dump. > - I don't think you should need an error-catching loop in > auto_pgprewarm_main(), either. Just let the worker die if there's an > ERROR, and set the restart interval to something other than > BGW_NEVER_RESTART. -- Fixed we restart it now. > - Setting bgw_main isn't correct for extension code. Please read the > documentation on background workers, which explains how to do this > correctly in an extension. -Fixed. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
>> - The error handling loop around load_block() suggests that you're >> expecting some reads to fail, which I guess is because you could be >> trying to read blocks from a relation that's been rewritten under a >> different relfilenode, or partially or entirely truncated. But I >> don't think it's very reasonable to just let ReadBufferWhatever() spew >> error messages into the log and hope users don't panic. People will >> expect an automatic prewarm solution to handle any such cases quietly, >> not bleat loudly in the log. I suspect that this error-trapping code >> isn't entirely correct, but there's not much point in fixing it; what >> we really need to do is get rid of it (somehow). > > [Need Reelook] -- Debug and check if block load fails what will happen. Oops Sorry, this was for self-reference. It is fixed now -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On 3/13/17 09:15, Mithun Cy wrote: > A. launch_autoprewarm_dump() RETURNS int4 > This is a SQL callable function to launch the autoprewarm worker to > dump the buffer pool information at regular interval. In a server, we > can only run one autoprewarm worker so if a worker sees another > existing worker it will exit immediately. The return value is pid of > the worker which has been launched. Why do you need that? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Mar 25, 2017 at 11:46 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 3/13/17 09:15, Mithun Cy wrote: >> A. launch_autoprewarm_dump() RETURNS int4 >> This is a SQL callable function to launch the autoprewarm worker to >> dump the buffer pool information at regular interval. In a server, we >> can only run one autoprewarm worker so if a worker sees another >> existing worker it will exit immediately. The return value is pid of >> the worker which has been launched. > > Why do you need that? To launch an autoprewarm worker we have to preload the liberary which need a server restart. If we want to start periodic dumping on an already running server so that it can automatically prewarm on its next restart, this API can be used to launch the autoprewarm. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On 2017-03-13 18:45:00 +0530, Mithun Cy wrote: > I have implemented a similar logic now. The prewarm bgworker will > launch a sub-worker per database in the dump file. And, each > sub-worker will load its database block info. The sub-workers will be > launched only after previous one is finished. All of this will only > start if the database has reached a consistent state. Hm. For replay performance it'd possibly be good to start earlier, before reaching consistency. Is there an issue starting earlier? > diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c > new file mode 100644 > index 0000000..f4b34ca > --- /dev/null > +++ b/contrib/pg_prewarm/autoprewarm.c > @@ -0,0 +1,1137 @@ > +/*------------------------------------------------------------------------- > + * > + * autoprewarm.c > + * > + * -- Automatically prewarm the shared buffer pool when server restarts. Don't think we ususally use -- here. > + * Copyright (c) 2013-2017, PostgreSQL Global Development Group Hm, that's a bit of a weird date range. > + * IDENTIFICATION > + * contrib/pg_prewarm.c/autoprewarm.c > + *------------------------------------------------------------------------- > + */ The pg_prewarm.c in there looks like some search & replace gone awry. > +#include "postgres.h" > +#include <unistd.h> > + > +/* These are always necessary for a bgworker. */ > +#include "miscadmin.h" > +#include "postmaster/bgworker.h" > +#include "storage/ipc.h" > +#include "storage/latch.h" > +#include "storage/lwlock.h" > +#include "storage/proc.h" > +#include "storage/shmem.h" > + > +/* These are necessary for prewarm utilities. */ > +#include "pgstat.h" > +#include "storage/buf_internals.h" > +#include "storage/smgr.h" > +#include "utils/memutils.h" > +#include "utils/resowner.h" > +#include "utils/guc.h" > +#include "catalog/pg_class.h" > +#include "catalog/pg_type.h" > +#include "executor/spi.h" > +#include "access/xact.h" > +#include "utils/rel.h" > +#include "port/atomics.h" I'd rather just sort these alphabetically. I think this should rather be in the initial header. > +/* > + * autoprewarm : > + * > + * What is it? > + * =========== > + * A bgworker which automatically records information about blocks which were > + * present in buffer pool before server shutdown and then prewarm the buffer > + * pool upon server restart with those blocks. > + * > + * How does it work? > + * ================= > + * When the shared library "pg_prewarm" is preloaded, a > + * bgworker "autoprewarm" is launched immediately after the server has reached > + * consistent state. The bgworker will start loading blocks recorded in the > + * format BlockInfoRecord > + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in > + * $PGDATA/AUTOPREWARM_FILE, until there is a free buffer left in the buffer > + * pool. This way we do not replace any new blocks which were loaded either by > + * the recovery process or the querying clients. s/until there is a/until there is no/? > +/* > + * ============================================================================ > + * =========================== SIGNAL HANDLERS =========================== > + * ============================================================================ > + */ Hm... > +static void sigtermHandler(SIGNAL_ARGS); > +static void sighupHandler(SIGNAL_ARGS); I don't think that's a casing we commonly use. We mostly use CamelCase or underscore_case. > +/* > + * Signal handler for SIGUSR1. > + */ > +static void > +sigusr1Handler(SIGNAL_ARGS) > +{ > + int save_errno = errno; > + > + if (MyProc) > + SetLatch(&MyProc->procLatch); > + > + errno = save_errno; > +} Hm, what's this one for? > +/* > + * Shared state information about the running autoprewarm bgworker. > + */ > +typedef struct AutoPrewarmSharedState > +{ > + pg_atomic_uint32 current_task; /* current tasks performed by > + * autoprewarm workers. */ > +} AutoPrewarmSharedState; Hm. Why do we need atomics here? I thought there's no concurrency? > +/* > + * sort_cmp_func - compare function used for qsort(). > + */ > +static int > +sort_cmp_func(const void *p, const void *q) > +{ rename to blockinfo_cmp? > +static AutoPrewarmTask > +get_autoprewarm_task(AutoPrewarmTask todo_task) > +{ > + bool found; > + > + state = NULL; > + > + LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); > + state = ShmemInitStruct("autoprewarm", > + sizeof(AutoPrewarmSharedState), > + &found); > + if (!found) > + pg_atomic_write_u32(&(state->current_task), todo_task); Superflous parens (repeated a lot). > + LWLockRelease(AddinShmemInitLock); > + > + /* If found check if we can go ahead. */ > + if (found) > + { > + if (pg_atomic_read_u32(&(state->current_task)) == > + TASK_PREWARM_BUFFERPOOL) You repeat the read in every branch - why don't you store it in a variable instead? That aside, the use of an atomic doesn't seem to actually gain us anything here. If we need control over concurrency it seems a lot better to instead use a lwlock or spinlock. There's no contention here, using lock-free stuff just increases complexity without a corresponding benefit. > + { > + if (todo_task == TASK_PREWARM_BUFFERPOOL) > + { > + /* > + * we were prewarming and we are back to do same, time to > + * abort prewarming and move to dumping. > + */ I'm not sure what "back to do same" should mean here - changing to a different type of task surely is not the same. > + pg_atomic_write_u32(&(state->current_task), > + TASK_DUMP_BUFFERPOOL_INFO); > + return TASK_DUMP_BUFFERPOOL_INFO; > + } > + else > + return TASK_END; /* rest all cannot proceed further. */ What does that comment mean? > + } > + else if (pg_atomic_read_u32(&(state->current_task)) == > + TASK_DUMP_IMMEDIATE_ONCE) > + { > + uint32 current_state = TASK_DUMP_IMMEDIATE_ONCE; > + > + /* We cannot do a TASK_PREWARM_BUFFERPOOL but rest can go ahead */ > + if (todo_task == TASK_DUMP_IMMEDIATE_ONCE) > + return TASK_DUMP_IMMEDIATE_ONCE; > + > + if (todo_task == TASK_PREWARM_BUFFERPOOL) > + todo_task = TASK_DUMP_BUFFERPOOL_INFO; /* skip to do dump only */ > + > + /* > + * first guy who can atomically set the current_task get the > + * opportunity to proceed further > + */ > + if (pg_atomic_compare_exchange_u32(&(state->current_task), > + ¤t_state, > + TASK_DUMP_BUFFERPOOL_INFO)) > + { > + /* Wow! We won the race proceed with the task. */ > + return TASK_DUMP_BUFFERPOOL_INFO; > + } > + else > + return TASK_END; Note that it's not generally guaranteed that any pg_atomic_compare_exchange_u32 actually wins, it could temporarily fail for all. > +/* > + * getnextblockinfo -- given a BlkType get its next BlockInfoRecord from the > + * dump file. > + */ > +static BlkType > +getnextblockinfo(FILE *file, BlockInfoRecord *currblkinfo, BlkType reqblock, > + BlockInfoRecord *newblkinfo) > +{ > + BlkType nextblk; > + > + while (true) > + { > + /* get next block. */ > + if (5 != fscanf(file, "%u,%u,%u,%u,%u\n", &(newblkinfo->database), > + &(newblkinfo->spcNode), &(newblkinfo->filenode), > + (uint32 *) &(newblkinfo->forknum), > + &(newblkinfo->blocknum))) > + return BLKTYPE_END; /* No more valid entry hence stop processing. */ Hm. Is it actually helpful to store the file as text? That's commonly going to increase the size of the file quite considerably, no? > +/* > + * GetRelOid -- given a filenode get its relation oid. > + */ > +static Oid > +get_reloid(Oid filenode) > +{ Function and comment don't agree on naming. But what is this actually used for? I thought Robert, in http://archives.postgresql.org/message-id/CA%2BTgmoa%3DUqCL2mR%2B9WTq05tB3Up-z4Sv2wkzkDxDwBP7Mj_2_w%40mail.gmail.com suggested storing the filenode in the dump, and then to use RelidByRelfilenode to get the corresponding relation? It seems a lot better to use relfilenodes, because otherwise table rewrites will lead to reloading wrong things. > + int ret; > + Oid relationid; > + bool isnull; > + Datum value[1] = {ObjectIdGetDatum(filenode)}; > + StringInfoData buf; > + Oid ptype[1] = {OIDOID}; > + > + initStringInfo(&buf); > + appendStringInfo(&buf, > + "select oid from pg_class where pg_relation_filenode(oid) = $1"); > + > + ret = SPI_execute_with_args(buf.data, 1, (Oid *) &ptype, (Datum *) &value, > + NULL, true, 1); > + > + if (ret != SPI_OK_SELECT) > + ereport(FATAL, (errmsg("SPI_execute failed: error code %d", ret))); > + > + if (SPI_processed < 1) > + return InvalidOid; > + > + relationid = DatumGetObjectId(SPI_getbinval(SPI_tuptable->vals[0], > + SPI_tuptable->tupdesc, > + 1, &isnull)); > + if (isnull) > + return InvalidOid; > + > + return relationid; > +} Doing this via SPI doesn't strike me as a good idea - that's really quite expensive. Why not call the underlying function directly? > +/* > + * load_one_database -- start of prewarm sub-worker, this will try to load > + * blocks of one database starting from block info position passed by main > + * prewarm worker. > + */ > +void > +load_one_database(Datum main_arg) > +{ > + /* check if file exists and open file in read mode. */ > + snprintf(dump_file_path, sizeof(dump_file_path), "%s", AUTOPREWARM_FILE); > + file = fopen(dump_file_path, PG_BINARY_R); > + if (!file) > + return; /* No file to load. */ Shouldn't this be an error case? In which case is it ok for the file to be gone after we launched the worker? > + /* > + * It should be a block info belonging to a new database. Or else dump > + * file is corrupted better to end the loading of bocks now. > + */ > + if (loadblocktype != BLKTYPE_NEW_DATABASE) > + goto end_load; /* should we raise a voice here? */ Yes, this should raise an error. > + case BLKTYPE_NEW_RELATION: > + > + /* > + * release lock on previous relation. > + */ > + if (rel) > + { > + relation_close(rel, AccessShareLock); > + rel = NULL; > + } > + > + loadblocktype = BLKTYPE_NEW_RELATION; > + > + /* > + * lock new relation. > + */ > + reloid = get_reloid(toload_block.filenode); > + > + if (!OidIsValid(reloid)) > + break; > + > + rel = try_relation_open(reloid, AccessShareLock); > + if (!rel) > + break; > + RelationOpenSmgr(rel); Now I'm confused. Your get_reloid used pg_relation_filenode() to map from relation oid to filenode - and then you're using it to lock the relation? Something's wrong. > + case BLKTYPE_NEW_FORK: > + > + /* > + * check if fork exists and if block is within the range > + */ > + loadblocktype = BLKTYPE_NEW_FORK; > + if ( /* toload_block.forknum > InvalidForkNumber && > + * toload_block.forknum <= MAX_FORKNUM && */ > + !smgrexists(rel->rd_smgr, toload_block.forknum)) > + break; Huh? What's with that commented out section of code? > + case BLKTYPE_NEW_BLOCK: > + > + /* check if blocknum is valid and with in fork file size. */ > + if (toload_block.blocknum >= nblocks) > + { > + /* move to next forknum. */ > + loadblocktype = BLKTYPE_NEW_FORK; > + break; > + } Hm. Why does the size of the underlying file allow us to skip to the next fork? Don't we have to read all the pending dump records? > + buf = ReadBufferExtended(rel, toload_block.forknum, > + toload_block.blocknum, RBM_NORMAL, > + NULL); > + if (BufferIsValid(buf)) > + { > + ReleaseBuffer(buf); > + } > + > + loadblocktype = BLKTYPE_NEW_BLOCK; > + break; Hm. RBM_NORMAL will error out in a bunch of cases, is that ok? > + if (have_dbconnection) > + { > + SPI_finish(); > + PopActiveSnapshot(); > + CommitTransactionCommand(); > + } > + return; > +} Are we really ok keeping open a transaction through all of this? That could potentially be quite long, no? How about doing that on a per-file basis, or even moving to session locks alltogether? > +/* This sub-module is for periodically dumping buffer pool's block info into > + * a dump file AUTOPREWARM_FILE. > + * Each entry of block info looks like this: > + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it > + * as BlockInfoRecord. > + * > + * Contents of AUTOPREWARM_FILE has been formated such a way that > + * blockInfoRecord of each database can be given to different prewarm workers. > + * > + * format of AUTOPREWAM_FILE > + * ======================================= > + * [offset position of database map table] > + * [sorted BlockInfoRecords..............] > + * [database map table] > + * ======================================= This doesn't mention storing things as ascii, instead of binary... > + * The [database map table] is sequence of offset in file which will point to > + * first BlockInfoRecords of each database in the dump. The prewarm worker > + * will read this offset one by one in sequence and ask its subworker to seek > + * to this position and then start loading the BlockInfoRecords one by one > + * until it see a BlockInfoRecords of a different database than it is actually > + * connected to. > + * NOTE : We store off_t inside file so the dump file will not be portable to > + * be used across systems where sizeof off_t is different from each other. > + */ Why are we using off_t? Shouldn't this just be BlockNumber? > +static uint32 > +dump_now(void) > +{ > + static char dump_file_path[MAXPGPATH], > + > + for (num_blocks = 0, i = 0; i < NBuffers; i++) > + { > + uint32 buf_state; > + > + bufHdr = GetBufferDescriptor(i); > + > + /* lock each buffer header before inspecting. */ > + buf_state = LockBufHdr(bufHdr); > + > + if (buf_state & BM_TAG_VALID) > + { > + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode; > + block_info_array[num_blocks].spcNode = bufHdr->tag.rnode.spcNode; > + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode; > + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum; > + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum; > + ++num_blocks; > + } > + > + UnlockBufHdr(bufHdr, buf_state); > + } > + > + /* sorting now only to avoid sorting while loading. */ "sorting while loading"? You mean random accesses? > + pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord), > + sort_cmp_func); > + snprintf(transient_dump_file_path, sizeof(dump_file_path), > + "%s.%d", AUTOPREWARM_FILE, MyProcPid); > + file = fopen(transient_dump_file_path, "w"); > + if (file == NULL) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("autoprewarm: could not open \"%s\": %m", > + dump_file_path))); What if that file already exists? You're not truncating it. Also, what if we error out in the middle of this? We'll leak an fd. I think this needs to use OpenTransientFile etc. > + snprintf(dump_file_path, sizeof(dump_file_path), > + "%s", AUTOPREWARM_FILE); > + ret = fprintf(file, "%020jd\n", (intmax_t) 0); > + if (ret < 0) > + { > + fclose(file); > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("autoprewarm: error writing to \"%s\" : %m", > + dump_file_path))); > + } > + > + database_map_table[num_db++] = ftello(file); > + > + for (i = 0; i < num_blocks; i++) > + { > + if (i > 0 && block_info_array[i].database != prev_database) > + { > + if (num_db == database_map_table_size) > + { > + database_map_table_size *= 2; /* double and repalloc. */ > + database_map_table = > + (off_t *) repalloc(database_map_table, > + sizeof(off_t) * database_map_table_size); > + } > + fflush(file); > + database_map_table[num_db++] = ftello(file); > + } > + > + ret = fprintf(file, "%u,%u,%u,%u,%u\n", > + block_info_array[i].database, > + block_info_array[i].spcNode, > + block_info_array[i].filenode, > + (uint32) block_info_array[i].forknum, > + block_info_array[i].blocknum); > + if (ret < 0) > + { > + fclose(file); > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("autoprewarm: error writing to \"%s\" : %m", > + dump_file_path))); > + } > + > + prev_database = block_info_array[i].database; > + } I think we should check for interrupts somewhere in that (and the preceding) loop. > +/* > + * dump_block_info_periodically - at regular intervals, which is defined by GUC > + * dump_interval, dump the info of blocks which are present in buffer pool. > + */ > +void > +dump_block_info_periodically() > +{ Suggest adding void to the parameter list. > + pg_time_t last_dump_time = (pg_time_t) time(NULL); > + > + while (!got_sigterm) > + { > + int rc; > + pg_time_t now; > + int elapsed_secs = 0, > + timeout = AT_PWARM_DEFAULT_DUMP_INTERVAL; > + > + if (dump_interval > AT_PWARM_DUMP_AT_SHUTDOWN_ONLY) > + { > + now = (pg_time_t) time(NULL); > + elapsed_secs = now - last_dump_time; > + > + if (elapsed_secs > dump_interval) > + { > + dump_now(); > + if (got_sigterm) > + return; /* got shutdown signal just after a dump. And, > + * I think better to return now. */ > + last_dump_time = (pg_time_t) time(NULL); > + elapsed_secs = 0; > + } > + > + timeout = dump_interval - elapsed_secs; > + } I suggest using GetCurrenttimstamp() and TimestampDifferenceExceeds() instead. > + /* Has been set not to dump. Nothing more to do. */ > + if (dump_interval == AT_PWARM_OFF) > + return; > + > + ResetLatch(&MyProc->procLatch); > + rc = WaitLatch(&MyProc->procLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, > + timeout * 1000, PG_WAIT_EXTENSION); > + > + if (rc & WL_POSTMASTER_DEATH) > + proc_exit(1); > + > + /* > + * In case of a SIGHUP, just reload the configuration. > + */ > + if (got_sighup) > + { > + got_sighup = false; > + ProcessConfigFile(PGC_SIGHUP); > + } > + } > + > + /* One last block meta info dump while postmaster shutdown. */ > + if (dump_interval != AT_PWARM_OFF) > + dump_now(); Uh, afaics we'll also do this if somebody SIGTERMed the process interactively? > +/* Extension's entry point. */ > +void > +_PG_init(void) > +{ > + BackgroundWorker autoprewarm; > + > + /* Define custom GUC variables. */ > + DefineCustomIntVariable("pg_prewarm.dump_interval", > + "Sets the maximum time between two buffer pool dumps", > + "If set to Zero, timer based dumping is disabled." > + " If set to -1, stops the running autoprewarm.", > + &dump_interval, > + AT_PWARM_DEFAULT_DUMP_INTERVAL, > + AT_PWARM_OFF, INT_MAX / 1000, > + PGC_SIGHUP, > + GUC_UNIT_S, > + NULL, > + NULL, > + NULL); > + > + /* if not run as a preloaded library, nothing more to do here! */ > + if (!process_shared_preload_libraries_in_progress) > + return; > + > + DefineCustomStringVariable("pg_prewarm.default_database", > + "default database to connect if dump has not recorded same.", > + NULL, > + &default_database, > + "postgres", > + PGC_POSTMASTER, > + 0, > + NULL, > + NULL, > + NULL); I don't think it's a good idea to make guc registration depending on process_shared_preload_libraries_in_progress. You should also use EmitWarningsOnPlaceholders() somewhere here. I also wonder whether we don't need to use prefetch to actually make this fast enough. I think it's pretty clear that this needs a bit more work and thus won't be ready for v10. Moved to the next CF. - Andres
> On 2017-03-13 18:45:00 +0530, Mithun Cy wrote:
>> I have implemented a similar logic now. The prewarm bgworker will
>> launch a sub-worker per database in the dump file. And, each
>> sub-worker will load its database block info. The sub-workers will be
>> launched only after previous one is finished. All of this will only
>> start if the database has reached a consistent state.
>
> Hm. For replay performance it'd possibly be good to start earlier,
> before reaching consistency. Is there an issue starting earlier?
Thanks Andres for a detailed review. I will try to address them in my next post. I thought it is important to reply to above comment before that. Earlier patches used to start loading blocks before reaching a consistent state. Then Robert while reviewing found a basic flaw in my approach[1]. The function DropRelFileNodesAllBuffers do not expect others to load the blocks concurrently while it is getting rid of buffered blocks. So has to delay loading until database reaches consistent state so that we can connect to each database and take a relation lock before loading any of theirs blocks.
[1] cannot load blocks without holding relation lock
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
Thanks, Andres, I have tried to fix all of your comments. One important change has happened with this patch is previously we used to read one block info structure at a time and load it. But now we read all of them together and load it into as DSA and then we distribute the block infos to the subgroups to load corresponding blocks. With this portability issue which I have mentioned above will no longer exists as we do not store any map tables within the dump file. On Thu, Apr 6, 2017 at 4:12 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-13 18:45:00 +0530, Mithun Cy wrote: >> launched only after previous one is finished. All of this will only >> start if the database has reached a consistent state. > > Hm. For replay performance it'd possibly be good to start earlier, > before reaching consistency. Is there an issue starting earlier? Earlier patches used to start loading blocks before reaching a consistent state. Then Robert while reviewing found a basic flaw in my approach. The function DropRelFileNodesAllBuffers do not expect others to load the blocks concurrently while it is getting rid of buffered blocks. So has to delay loading until database reaches consistent state so that we can connect to each database and take a relation lock before loading any of theirs blocks. >> + * -- Automatically prewarm the shared buffer pool when server restarts. > > Don't think we ususally use -- here. -- Fixed. >> + * Copyright (c) 2013-2017, PostgreSQL Global Development Group > > Hm, that's a bit of a weird date range. -- changed it to 2016-2017 is that right? > The pg_prewarm.c in there looks like some search & replace gone awry. -- sorry Fixed. >> +#include "utils/rel.h" >> +#include "port/atomics.h" > > I'd rather just sort these alphabetically. > I think this should rather be in the initial header. -- Fixed as suggested and have moved everything into a header file pg_prewarm.h. > s/until there is a/until there is no/? -- Fixed. > >> +/* >> + * ============================================================================ >> + * =========================== SIGNAL HANDLERS =========================== >> + * ============================================================================ >> + */ > > Hm... -- I have reverted those cosmetic changes now. > >> +static void sigtermHandler(SIGNAL_ARGS); >> +static void sighupHandler(SIGNAL_ARGS); > > I don't think that's a casing we commonly use. We mostly use CamelCase > or underscore_case. > -- Fixed with CamelCase. >> + * Signal handler for SIGUSR1. >> + */ >> +static void >> +sigusr1Handler(SIGNAL_ARGS) > Hm, what's this one for? -- The prewarm sub-workers will notify with SIGUSR1 on their startup/shutdown. Updated the comments. >> +/* >> + * Shared state information about the running autoprewarm bgworker. >> + */ >> +typedef struct AutoPrewarmSharedState >> +{ >> + pg_atomic_uint32 current_task; /* current tasks performed by >> + * autoprewarm workers. */ >> +} AutoPrewarmSharedState; > > Hm. Why do we need atomics here? I thought there's no concurrency? There are 3 methods in autoprewarm. A: prealoaded prewarm bgworker which also prewarm and then periodic dumping; B: bgwoker launched by launch_autoprewarm_dump() which do the periodic dumping. C: Immediate dump by backends. We do not want 2 bgworkers started and working concurrently. and do not want to dump while prewarm task is running. As you have suggested rewrote a simple logic with the use of a boolean variable. >> +sort_cmp_func(const void *p, const void *q) >> +{ > > rename to blockinfo_cmp? -- Fixed. > > Superflous parens (repeated a lot). -- Fixed in all places. > Hm. Is it actually helpful to store the file as text? That's commonly > going to increase the size of the file quite considerably, no? -- Having in the text could help in readability or if we want to modify/adjust it as needed. I shall so a small experiment to check how much we caould save and produce a patch on top of this for same. > > But what is this actually used for? I thought Robert, in > http://archives.postgresql.org/message-id/CA%2BTgmoa%3DUqCL2mR%2B9WTq05tB3Up-z4Sv2wkzkDxDwBP7Mj_2_w%40mail.gmail.com > suggested storing the filenode in the dump, and then to use > RelidByRelfilenode to get the corresponding relation? -- Fixed as suggested directly calling RelidByRelfilenode now. >> +load_one_database(Datum main_arg) >> +{ > >> + /* check if file exists and open file in read mode. */ >> + snprintf(dump_file_path, sizeof(dump_file_path), "%s", AUTOPREWARM_FILE); >> + file = fopen(dump_file_path, PG_BINARY_R); >> + if (!file) >> + return; /* No file to load. */ > > Shouldn't this be an error case? In which case is it ok for the file to > be gone after we launched the worker? -- Yes sorry a mistake. In the new patch I have changed the file map to dsa to distribute block infos to subworker so this code will go away. Now the main worker will load all of the blockinfos into a dynamic shared area(dsa) and sub worker will read blocks from them which belong to the database they are assigned to load. > >> + /* >> + * It should be a block info belonging to a new database. Or else dump >> + * file is corrupted better to end the loading of bocks now. >> + */ >> + if (loadblocktype != BLKTYPE_NEW_DATABASE) >> + goto end_load; /* should we raise a voice here? */ > > Yes, this should raise an error. -- Got rid of this check with the new code. >> + case BLKTYPE_NEW_RELATION: >> + >> + /* >> + * release lock on previous relation. >> + */ >> + if (rel) >> + { >> + relation_close(rel, AccessShareLock); >> + rel = NULL; >> + } >> + >> + loadblocktype = BLKTYPE_NEW_RELATION; >> + >> + /* >> + * lock new relation. >> + */ >> + reloid = get_reloid(toload_block.filenode); >> + >> + if (!OidIsValid(reloid)) >> + break; >> + >> + rel = try_relation_open(reloid, AccessShareLock); >> + if (!rel) >> + break; >> + RelationOpenSmgr(rel); > > Now I'm confused. Your get_reloid used pg_relation_filenode() to map > from relation oid to filenode - and then you're using it to lock the > relation? Something's wrong. We take a shared lock on relation id so that while we load the blocks of the relation, DropRelFileNodesAllBuffers is not called by another process. >> + case BLKTYPE_NEW_FORK: >> + >> + /* >> + * check if fork exists and if block is within the range >> + */ >> + loadblocktype = BLKTYPE_NEW_FORK; >> + if ( /* toload_block.forknum > InvalidForkNumber && >> + * toload_block.forknum <= MAX_FORKNUM && */ >> + !smgrexists(rel->rd_smgr, toload_block.forknum)) >> + break; > > Huh? What's with that commented out section of code? -- smgrexists is not safe it crashed if we pass illegal forknumber so a precheck. Accidently forgot to uncomment same sorry. Fixed it now. >> + case BLKTYPE_NEW_BLOCK: >> + >> + /* check if blocknum is valid and with in fork file size. */ >> + if (toload_block.blocknum >= nblocks) >> + { >> + /* move to next forknum. */ >> + loadblocktype = BLKTYPE_NEW_FORK; >> + break; >> + } > > Hm. Why does the size of the underlying file allow us to skip to the > next fork? Don't we have to read all the pending dump records? -- Blocks beyond the file size are not existing. So I thought we can move to next fork to load thier blocks. >> + buf = ReadBufferExtended(rel, toload_block.forknum, >> + toload_block.blocknum, RBM_NORMAL, >> + NULL); >> + if (BufferIsValid(buf)) >> + { >> + ReleaseBuffer(buf); >> + } >> + >> + loadblocktype = BLKTYPE_NEW_BLOCK; >> + break; > > Hm. RBM_NORMAL will error out in a bunch of cases, is that ok? For now, on error we restart the bgworker; Previously I did setjump to catch(process) the error and continue. But as a review comment fix I have moved to new logic of restart. >> + if (have_dbconnection) >> + { >> + SPI_finish(); >> + PopActiveSnapshot(); >> + CommitTransactionCommand(); >> + } >> + return; >> +} > > Are we really ok keeping open a transaction through all of this? That > could potentially be quite long, no? How about doing that on a per-file > basis, or even moving to session locks alltogether? > -- We hold the transaction until we load all the dumped blocks of the database or buffer pool becomes full, yes that can go long. Should I end and start a transaction on every fork file? > > This doesn't mention storing things as ascii, instead of binary... > -- Fixed same. >> + * NOTE : We store off_t inside file so the dump file will not be portable to >> + * be used across systems where sizeof off_t is different from each other. >> + */ > > Why are we using off_t? Shouldn't this just be BlockNumber? -- Previously we maintained a special map to block infos of each database, but now moved everything to dsa so off_t this code is removed now. >> +static uint32 >> +dump_now(void) >> +{ >> + static char dump_file_path[MAXPGPATH], > >> + >> + for (num_blocks = 0, i = 0; i < NBuffers; i++) >> + { >> + uint32 buf_state; >> + >> + bufHdr = GetBufferDescriptor(i); >> + >> + /* lock each buffer header before inspecting. */ >> + buf_state = LockBufHdr(bufHdr); >> + >> + if (buf_state & BM_TAG_VALID) >> + { >> + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode; >> + block_info_array[num_blocks].spcNode = bufHdr->tag.rnode.spcNode; >> + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode; >> + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum; >> + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum; >> + ++num_blocks; >> + } >> + >> + UnlockBufHdr(bufHdr, buf_state); > >> + } >> + >> + /* sorting now only to avoid sorting while loading. */ > > "sorting while loading"? You mean random accesses? -- Yes fixed same. >> + pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord), >> + sort_cmp_func); > > > >> + snprintf(transient_dump_file_path, sizeof(dump_file_path), >> + "%s.%d", AUTOPREWARM_FILE, MyProcPid); >> + file = fopen(transient_dump_file_path, "w"); >> + if (file == NULL) >> + ereport(ERROR, >> + (errcode_for_file_access(), >> + errmsg("autoprewarm: could not open \"%s\": %m", >> + dump_file_path))); > > What if that file already exists? You're not truncating it. Also, what > if we error out in the middle of this? We'll leak an fd. I think this > needs to use OpenTransientFile etc. Thanks using OpenTransientFile now. >> + snprintf(dump_file_path, sizeof(dump_file_path), >> + "%s", AUTOPREWARM_FILE); >> + ret = fprintf(file, "%020jd\n", (intmax_t) 0); >> + if (ret < 0) >> + { >> + fclose(file); >> + ereport(ERROR, >> + (errcode_for_file_access(), >> + errmsg("autoprewarm: error writing to \"%s\" : %m", >> + dump_file_path))); >> + } >> + >> + database_map_table[num_db++] = ftello(file); >> + >> + for (i = 0; i < num_blocks; i++) >> + { >> + if (i > 0 && block_info_array[i].database != prev_database) >> + { >> + if (num_db == database_map_table_size) >> + { >> + database_map_table_size *= 2; /* double and repalloc. */ >> + database_map_table = >> + (off_t *) repalloc(database_map_table, >> + sizeof(off_t) * database_map_table_size); >> + } >> + fflush(file); >> + database_map_table[num_db++] = ftello(file); >> + } >> + >> + ret = fprintf(file, "%u,%u,%u,%u,%u\n", >> + block_info_array[i].database, >> + block_info_array[i].spcNode, >> + block_info_array[i].filenode, >> + (uint32) block_info_array[i].forknum, >> + block_info_array[i].blocknum); >> + if (ret < 0) >> + { >> + fclose(file); >> + ereport(ERROR, >> + (errcode_for_file_access(), >> + errmsg("autoprewarm: error writing to \"%s\" : %m", >> + dump_file_path))); >> + } >> + >> + prev_database = block_info_array[i].database; >> + } > > I think we should check for interrupts somewhere in that (and the > preceding) loop. -- Now checking from any sighup and if it is asked not to dump with pg_prewarm.dump_interval set to -1 I will terminate the loop. > >> +/* >> + * dump_block_info_periodically - at regular intervals, which is defined by GUC >> + * dump_interval, dump the info of blocks which are present in buffer pool. >> + */ >> +void >> +dump_block_info_periodically() >> +{ > > Suggest adding void to the parameter list. -- Fixed. >> + last_dump_time = (pg_time_t) time(NULL); >> + elapsed_secs = 0; >> + } >> + >> + timeout = dump_interval - elapsed_secs; >> + } > > I suggest using GetCurrenttimstamp() and TimestampDifferenceExceeds() > instead. -- Fixed as suggested. >> + /* >> + * In case of a SIGHUP, just reload the configuration. >> + */ >> + if (got_sighup) >> + { >> + got_sighup = false; >> + ProcessConfigFile(PGC_SIGHUP); >> + } >> + } >> + >> + /* One last block meta info dump while postmaster shutdown. */ >> + if (dump_interval != AT_PWARM_OFF) >> + dump_now(); > > Uh, afaics we'll also do this if somebody SIGTERMed the process > interactively? -- Yes we dump and exit on a sigterm. >> + /* if not run as a preloaded library, nothing more to do here! */ >> + if (!process_shared_preload_libraries_in_progress) >> + return; >> + >> + DefineCustomStringVariable("pg_prewarm.default_database", >> + "default database to connect if dump has not recorded same.", >> + NULL, >> + &default_database, >> + "postgres", >> + PGC_POSTMASTER, >> + 0, >> + NULL, >> + NULL, >> + NULL); > > I don't think it's a good idea to make guc registration depending on > process_shared_preload_libraries_in_progress. -- Fixed now. > You should also use EmitWarningsOnPlaceholders() somewhere here. -- Added same. > I also wonder whether we don't need to use prefetch to actually make > this fast enough. -- I have not used prefetch but I will re-access and update the patch. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, May 23, 2017 at 7:06 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > Thanks, Andres, > > I have tried to fix all of your comments. There was a typo issue in previous patch 07 where instead of == I have used !=. And, a mistake in comments. I have corrected same now. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, May 24, 2017 at 6:28 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > On Tue, May 23, 2017 at 7:06 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> Thanks, Andres, >> >> I have tried to fix all of your comments. > > There was a typo issue in previous patch 07 where instead of == I have > used !=. And, a mistake in comments. I have corrected same now. +/* + * autoprewarm : + * + * What is it? + * =========== + * A bgworker which automatically records information about blocks which were + * present in buffer pool before server shutdown and then prewarm the buffer + * pool upon server restart with those blocks. + * + * How does it work? + * ================= + * When the shared library "pg_prewarm" is preloaded, a + * bgworker "autoprewarm" is launched immediately after the server has reached + * consistent state. The bgworker will start loading blocks recorded in the + * format BlockInfoRecord + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in + * $PGDATA/AUTOPREWARM_FILE, until there is no free buffer left in the buffer + * pool. This way we do not replace any new blocks which were loaded either by + * the recovery process or the querying clients. + * + * Once the "autoprewarm" bgworker has completed its prewarm task, it will + * start a new task to periodically dump the BlockInfoRecords related to blocks + * which are currently in shared buffer pool. Upon next server restart, the + * bgworker will prewarm the buffer pool by loading those blocks. The GUC + * pg_prewarm.dump_interval will control the dumping activity of the bgworker. + */ Make this part of the file header comment. Also, add an enabling GUC. The default can be true, but it should be possible to preload the library so that the SQL functions are available without a dynamic library load without requiring you to get the auto-prewarm behavior. I suggest pg_prewarm.autoprewarm = true / false. Your SigtermHandler and SighupHandler routines are still capitalized in a way that's not very consistent with what we do elsewhere. I think all of our other signal handlers have names_like_this() not NamesLikeThis(). + * ============== types and variables used by autoprewam ============= Spelling. + * Meta-data of each persistent block which is dumped and used to load. Metadata +typedef struct BlockInfoRecord +{ + Oid database; /* database */ + Oid spcNode; /* tablespace */ + Oid filenode; /* relation's filenode. */ + ForkNumber forknum; /* fork number */ + BlockNumber blocknum; /* block number */ +} BlockInfoRecord; spcNode is capitalized differently from all of the other members. + LWLock *lock; /* protects SharedState */ Just declare this as LWLock lock, and initialize it using LWLockInitialize. The way you're doing it is more complicated. +static dsa_area *AutoPrewarmDSA = NULL; DSA seems like more than you need here. There's only one allocation being performed. I think it would be simpler and less error-prone to use DSM directly. I don't even think you need a shm_toc. You could just do: seg = dsm_create(segsize, 0); blkinfo = dsm_segment_address(seg); Then pass dsm_segment_handle(seg) to the worker using bgw_main_arg or bgw_extra, and have it call dsm_attach. An advantage of this approach is that you only allocate the memory you actually need, whereas DSA will allocate more, expecting that you might do further allocations. + pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord), + blockinfo_cmp); I think it would make more sense to sort the array on the load side, in the autoprewarm worker, rather than when dumping. First, many dumps will never be reloaded, so there's no real reason to waste time sorting them. Second, the way you've got the code right now, it relies heavily on the assumption that the dump file will be sorted, but that doesn't seem like a necessary assumption. We don't really expect users to hand-edit the dump files, but if they do, it'd be nice if it didn't randomly break things unnecessarily. + errmsg("autoprewarm: could not open \"%s\": %m", + dump_file_path))); Use standard error message wordings! Don't create new translatable messages by adding "autoprewarm" to error messages. There are multiple instances of this throughout the patch. + snprintf(dump_file_path, sizeof(dump_file_path), + "%s", AUTOPREWARM_FILE); This is completely pointless. You can get rid of the dump_file_path variable and just use AUTOPREWARM_FILE wherever you would have used dump_file_path. It's just making a copy of a string you already have. Also, this code interrupts the flow of the surrounding logic in a weird way; even if we needed it, it would make more sense to put it up higher, where we construct the temporary path, or down lower, where the value is actually needed. + SPI_connect(); + PushActiveSnapshot(GetTransactionSnapshot()); It is really unclear why you need this, since the code does not use SPI for anything, or do anything that needs a snapshot. + goto end_load; I think you should try to rewrite this function so that it doesn't need "goto". I also think in general that this function could be written in a much more direct way by getting rid of the switch and the BLKTYPE_* constants altogether. connect_to_db() can only happen once, so do that the beginning. After that, the logic can look roughly like this: BlockInfoRecord *old_blk = NULL; while (!got_sigterm && pos < maxpos && have_free_buffer()) { BlockInfoRecord *blk = block_info[pos]; /* Quit if we've reached records for another database. */ if (old_blk != NULL && old_blk->dbOid != blk->dbOid) break; /* * When we reach a new relation, close the old one. Note, however, * that the previous try_relation_open mayhave failed, in which case * rel will be NULL. */ if (old_blk != NULL && old_blk->relOid != blk->relOid && rel!= NULL) { relation_close(rel, AccessShareLock); rel = NULL; } /* * Try to open each new relation, but only once, when we first * encounter it. If it's been dropped, skip theassociated blocks. */ if (old_blk == NULL || old_blk->relOid != blk->relOid) { Assert(rel == NULL); rel = try_relation_open(blk->relOid, AccessShareLock); } if (!rel) { ++pos; continue; } /* Once per fork, check for fork existence and size. */ if (old_blk == NULL || old_blk->forkNumber != blk->forkNumber) { RelationOpenSmgr(rel); if (smgrexists(rel->rd_smgr, blk->forkNumber)) nblocks =RelationGetNumberOfBlocksInFork(...); else nblocks = 0; } /* Prewarm buffer. */ buf = ReadBufferExtended(...); ... ++pos; } + errhint("Kill all remaining database processes and restart" + " the database."))); Don't break strings across lines. Just put it all on one (long) line. I don't think you should really need default_database. It seems like it should be possible to jigger things so that those blocks are loaded together with some other database (say, let the first worker do it). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks Robert, On Wed, May 24, 2017 at 5:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: > + * > + * Once the "autoprewarm" bgworker has completed its prewarm task, it will > + * start a new task to periodically dump the BlockInfoRecords related to blocks > + * which are currently in shared buffer pool. Upon next server restart, the > + * bgworker will prewarm the buffer pool by loading those blocks. The GUC > + * pg_prewarm.dump_interval will control the dumping activity of the bgworker. > + */ > > Make this part of the file header comment. -- Thanks, I have moved same as part of header description. > Also, add an enabling GUC. > The default can be true, but it should be possible to preload the > library so that the SQL functions are available without a dynamic > library load without requiring you to get the auto-prewarm behavior. > I suggest pg_prewarm.autoprewarm = true / false. -- Thanks, I have added a boolean GUC pg_prewarm.autoprewarm with default true. I have made it as PGC_POSTMASTER level variable considering the intention is here to avoid starting the autoprewarm worker. Need help, have I missed anything? Currently, sql callable functions autoprewarm_dump_now(), launch_autoprewarm_dump() are only available after create extension pg_prewarm command will this change now? There is another GUC setting pg_prewarm.dump_interval if = -1 we stop the running autoprewarm worker. I have a doubt should we combine these 2 entities into one such that it control the state of autoprewarm worker? > Your SigtermHandler and SighupHandler routines are still capitalized > in a way that's not very consistent with what we do elsewhere. I > think all of our other signal handlers have names_like_this() not > NamesLikeThis(). > -- handler functions are renamed for example apw_sigterm_handler, as similar in autovacuum.c > + * ============== types and variables used by autoprewam ============= > > Spelling. -- Fixed, Sorry. > + * Meta-data of each persistent block which is dumped and used to load. > > Metadata -- Fixed. > +typedef struct BlockInfoRecord > +{ > + Oid database; /* database */ > + Oid spcNode; /* tablespace */ > + Oid filenode; /* relation's filenode. */ > + ForkNumber forknum; /* fork number */ > + BlockNumber blocknum; /* block number */ > +} BlockInfoRecord; > > spcNode is capitalized differently from all of the other members. -- renamed from spcNode to spcnode. > > + LWLock *lock; /* protects SharedState */ > > Just declare this as LWLock lock, and initialize it using > LWLockInitialize. The way you're doing it is more complicated. -- Fixed as suggested LWLockInitialize(&state->lock, LWLockNewTrancheId()); > +static dsa_area *AutoPrewarmDSA = NULL; > > DSA seems like more than you need here. There's only one allocation > being performed. I think it would be simpler and less error-prone to > use DSM directly. I don't even think you need a shm_toc. You could > just do: > > seg = dsm_create(segsize, 0); > blkinfo = dsm_segment_address(seg); > Then pass dsm_segment_handle(seg) to the worker using bgw_main_arg or > bgw_extra, and have it call dsm_attach. An advantage of this approach > is that you only allocate the memory you actually need, whereas DSA > will allocate more, expecting that you might do further allocations. - Thanks Fixed. And we pass following arguments to subwrokers through bgw_extra /* * The block_infos allocated to each sub-worker to do prewarming. */ typedef struct prewarm_elem { dsm_handle block_info_handle; /* handle to dsm seg of block_infos */ Oid database; /* database to connect and load */ uint32 start_pos; /* start position within block_infos from * which sub-worker start prewaring blocks. */ uint32 end_of_blockinfos; /* End of block_infos in dsm */ } prewarm_elem; to distribute the prewarm work among subworkers. > > + pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord), > + blockinfo_cmp); > > I think it would make more sense to sort the array on the load side, > in the autoprewarm worker, rather than when dumping. Fixed Now sorting block_infos just before loading the blocks > + errmsg("autoprewarm: could not open \"%s\": %m", > + dump_file_path))); > > Use standard error message wordings! Don't create new translatable > messages by adding "autoprewarm" to error messages. There are > multiple instances of this throughout the patch. -- Removed autoprewarm as part of sufix in error message in all of the places. > + snprintf(dump_file_path, sizeof(dump_file_path), > + "%s", AUTOPREWARM_FILE); > > This is completely pointless. You can get rid of the dump_file_path -- dump_file_path is removed. > > + SPI_connect(); > + PushActiveSnapshot(GetTransactionSnapshot()); > > It is really unclear why you need this, since the code does not use > SPI for anything, or do anything that needs a snapshot. -Sorry Removed it now. > + goto end_load; > > I think you should try to rewrite this function so that it doesn't > need "goto". I also think in general that this function could be > written in a much more direct way by getting rid of the switch and the > BLKTYPE_* constants altogether. connect_to_db() can only happen once, > so do that the beginning. After that, the logic can look roughly like > this: -- Fixed using exact code framework as you have suggested previously. > > + errhint("Kill all remaining database processes and restart" > + " the database."))); > > Don't break strings across lines. Just put it all on one (long) line. -- Fixed; I have tried to trim the string which was going beyond 80chars but has fixed it now as you have suggested. > I don't think you should really need default_database. It seems like > it should be possible to jigger things so that those blocks are loaded > together with some other database (say, let the first worker do it). -- Fixed, for block_infos with database 0 will be combined with next database's block_info load. One problem which I have kept open is what if there are only block_info's with database 0 in dump file, With default_database we could have handled such case. After removing it I am ignoring block_infos of 0 databases in such cases. Is that okay?. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 27.10.2016 14:39, Mithun Cy wrote: > # pg_autoprewarm. > > This a PostgreSQL contrib module which automatically dump all of the > blocknums > present in buffer pool at the time of server shutdown(smart and fast > mode only, > to be enhanced to dump at regular interval.) and load these blocks > when server restarts. > > Design: > ------ > We have created a BG Worker Auto Pre-warmer which during shutdown > dumps all the > blocknum in buffer pool in sorted order. > Format of each entry is > <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>. > Auto Pre-warmer is started as soon as the postmaster is started we do > not wait > for recovery to finish and database to reach a consistent state. If > there is a > "dump_file" to load we start loading each block entry to buffer pool until > there is a free buffer. This way we do not replace any new blocks > which was > loaded either by recovery process or querying clients. Then it waits > until it receives > SIGTERM to dump the block information in buffer pool. > > HOW TO USE: > ----------- > Build and add the pg_autoprewarm to shared_preload_libraries. Auto > Pre-warmer > process automatically do dumping of buffer pool's block info and load > them when > restarted. > > TO DO: > ------ > Add functionality to dump based on timer at regular interval. > And some cleanups. I wonder if you considered parallel prewarming of a table? Right now either with pg_prewarm, either with pg_autoprewarm, preloading table's data is performed by one backend. It certainly makes sense if there is just one HDD and we want to minimize impact of pg_prewarm on normal DBMS activity. But sometimes we need to load data in memory as soon as possible. And modern systems has larger number of CPU cores and RAID devices make it possible to efficiently load data in parallel. I have asked this question in context of my CFS (compressed file system) for Postgres. The customer's complaint was that there are 64 cores at his system but when he is building index, decompression of heap data is performed by only one core. This is why I thought about prewarm... (parallel index construction is separate story...) pg_prewarm makes is possible to specify range of blocks, so, in principle, it is possible to manually preload table in parallel, by spawining pg_prewarm with different subranges in several backends. But it is definitely not user friendly approach. And as far as I understand pg_autoprewarm has all necessary infrastructure to do parallel load. We just need to spawn more than one background worker and specify separate block range for each worker. Do you think that such functionality (parallel autoprewarm) can be useful and be easily added? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, May 30, 2017 at 10:16 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > Thanks Robert, Sorry, there was one more mistake ( a typo) in dump_now() instead of using pfree I used free corrected same in the new patch v10. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, May 30, 2017 at 12:36 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > On 27.10.2016 14:39, Mithun Cy wrote: > And as far as I understand pg_autoprewarm has all necessary infrastructure > to do parallel load. We just need to spawn more than one background worker > and specify > separate block range for each worker. > > Do you think that such functionality (parallel autoprewarm) can be useful > and be easily added? I have not put any attention on making the autoprewarm parallel. But as per the current state of the code, making the subworkers to load the blocks in parallel should be possible and I think could be easily added. Probably we need a configurable parameter to restrict max number of parallel prewarm sub-workers. Since I have not thought much about this I might not be aware of all of the difficulties around it. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On Tue, May 30, 2017 at 12:36 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > On 27.10.2016 14:39, Mithun Cy wrote: > > > I wonder if you considered parallel prewarming of a table? > Right now either with pg_prewarm, either with pg_autoprewarm, preloading > table's data is performed by one backend. > It certainly makes sense if there is just one HDD and we want to minimize > impact of pg_prewarm on normal DBMS activity. > But sometimes we need to load data in memory as soon as possible. And modern > systems has larger number of CPU cores and > RAID devices make it possible to efficiently load data in parallel. > > I have asked this question in context of my CFS (compressed file system) for > Postgres. The customer's complaint was that there are 64 cores at his system > but when > he is building index, decompression of heap data is performed by only one > core. This is why I thought about prewarm... (parallel index construction is > separate story...) > > pg_prewarm makes is possible to specify range of blocks, so, in principle, > it is possible to manually preload table in parallel, by spawining > pg_prewarm > with different subranges in several backends. But it is definitely not user > friendly approach. > And as far as I understand pg_autoprewarm has all necessary infrastructure > to do parallel load. We just need to spawn more than one background worker > and specify > separate block range for each worker. > > Do you think that such functionality (parallel autoprewarm) can be useful > and be easily added? > I think parallel load functionality can be useful for few cases like when the system has multiple I/O channels. I think doing it parallelly might need some additional infrastructure to manage the workers based on how we decide to parallelism like whether we allow each worker to pick one block and load the same or specify the range of blocks for each worker. Each way has its own pros and cons. It seems like even if we want to add such an option to *prewarm functionality, it should be added as a separate patch as it has its own set of problems that needs to be solved. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, May 30, 2017 at 8:15 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > On Tue, May 30, 2017 at 10:16 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> Thanks Robert, > > Sorry, there was one more mistake ( a typo) in dump_now() instead of > using pfree I used free corrected same in the new patch v10. + * contrib/autoprewarm.c Wrong. + Oid database; /* database */ + Oid spcnode; /* tablespace */ + Oid filenode; /* relation's filenode. */ + ForkNumber forknum; /* fork number */ + BlockNumber blocknum; /* block number */ Why spcnode rather than tablespace? Also, the third line has a period, but not the others. I think you could just drop these comments; they basically just duplicate the structure names, except for spcnode which doesn't but probably should. + bool can_do_prewarm; /* if set can't do prewarm task. */ The comment and the name of the variable imply opposite meanings. +/* + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos. + */ +static void +detach_blkinfos(int code, Datum arg) +{ + if (seg != NULL) + dsm_detach(seg); +} I assume you don't really need this. Presumably process exit is going to detach the DSM anyway. + if (seg == NULL) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("unable to map dynamic shared memory segment"))); Let's use the wording from the message in parallel.c rather than this wording. Actually, we should probably (as a separate patch) change test_shm_mq's worker.c to use the parallel.c wording also. + SetCurrentStatementStartTimestamp(); Why do we need this? + StartTransactionCommand(); Do we need a transaction? If so, how about starting a new transaction for each relation instead of using a single one for the entire prewarm? + if (status == BGWH_STOPPED) + return; + + if (status == BGWH_POSTMASTER_DIED) + { + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), + errmsg("cannot start bgworker autoprewarm without postmaster"), + errhint("Kill all remaining database processes and restart" + " the database."))); + } + + Assert(0); Instead of writing it this way, remove the first "if" block and change the second one to Assert(status == BGWH_STOPPED). Also, I'd ditch the curly braces in this case. + file = fopen(AUTOPREWARM_FILE, PG_BINARY_R); Use AllocateFile() instead. See the comments for that function and at the top of fd.c. Then you don't need to worry about leaking the file handle on an error, so you can remove the fclose() before ereport() stuff -- which is incomplete anyway, because you could fail e.g. inside dsm_create(). + errmsg("Error reading num of elements in \"%s\" for" + " autoprewarm : %m", AUTOPREWARM_FILE))); As I said in a previous review, please do NOT split error messages across lines like this. Also, this error message is nothing close to PostgreSQL style. Please read https://www.postgresql.org/docs/devel/static/error-style-guide.html and learn to follow all those guidelines written therein. I see at least 3 separate problems with this message. + num_elements = i; I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm block dump has %d entries but expected %d", i, num_elements); It seems OK for this to be elog() rather than ereport() because the file should never be corrupt unless the user has cheated by hand-editing it. I think you can get rid of next_db_pos altogether, and this prewarm_elem thing too. Design sketch: 1. Move all the stuff that's in prewarm_elem into AutoPrewarmSharedState. Rename start_pos to prewarm_start_idx and end_of_blockinfos to prewarm_stop_idx. 2. Instead of computing all of the database ranges first and then launching workers, do it all in one loop. So figure out where the records for the current database end and set prewarm_start_idx and prewarm_end_idx appropriately. Launch a worker. When the worker terminates, set prewarm_start_idx = prewarm_end_idx and advance prewarm_end_idx to the end of the next database's records. This saves having to allocate memory for the next_db_pos array, and it also avoids this crock: + memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem)); The reason that's bad is because it only works so long as bgw_extra is large enough to hold prewarm_elem. If prewarm_elem grows or bgw_extra shrinks, this turns into a buffer overrun. I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating the PID for the temporary file. Otherwise, you might leave many temporary files behind under different names. If you use the same name every time, you'll never have more than one, and the next successful dump will end up getting rid of it along the way. + pfree(block_info_array); + CloseTransientFile(fd); + unlink(transient_dump_file_path); + ereport(ERROR, + (errcode_for_file_access(), + errmsg("error writing to \"%s\" : %m", + AUTOPREWARM_FILE))); Again, this is NOT a standard error message text. It occurs in zero other places in the source tree. You are not the first person to need an error message for a failed write to a file; please look at what the previous authors did. Also, the pfree() before report is not needed; isn't the whole process going to terminate? Also, you can't really use errcode_for_file_access() here, because you've meanwhile done CloseTransientFile() and unlink(), which will have clobbered errno. + ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks))); Not project style for ereport(). Line break after the first comma. Similarly elsewhere. + * dump_now - the main routine which goes through each buffer header of buffer + * pool and dumps their meta data. We Sort these data and then dump them. + * Sorting is necessary as it facilitates sequential read during load. This is no longer true, because you moved the sort to the load side. It's also not capitalized properly. Discussions of the format of the autoprewarm dump file involve inexplicably varying number of < and > symbols: + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it + buflen = sprintf(buf, "%u,%u,%u,%u,%u\n", +#ifndef __AUTOPREWARM_H__ +#define __AUTOPREWARM_H__ We don't use double-underscored names for header files. Please consult existing header files for the appropriate style. Also, why does this file exist at all, instead of just including them in the .c file? The pointer of a header is for things that need to be included by multiple .c files, but there's no such need here. + * load. If there are no other block infos than the global objects + * we silently ignore them. Should I throw error? Silently ignoring them seems fine. Throwing an error doesn't seem like it would improve things. + /* + * Register a sub-worker to load new database's block. Wait until the + * sub-worker finish its job before launching next sub-worker. + */ + launch_prewarm_subworker(&pelem); The function name implies that it launches the worker, but the comment implies that it also waits for it to terminate. Functions should be named in a way that matches what they do. I feel like the get_autoprewarm_task() function is introducing fairly baroque logic for something that really ought to be more simple. All autoprewarm_main() really needs to do is: if (!state->autoprewarm_done) autoprewarm(); dump_block_info_periodically(); The locking in autoprewarm_dump_now() is a bit tricky. There are two trouble cases. One is that we try to rename() our new dump file on top of the existing one while a background worker is still using it to perform an autoprewarm. The other is that we try to write a new temporary dump file while some other process is trying to do the same thing. I think we should fix this by storing a PID in AutoPrewarmSharedState; a process which wants to perform a dump or an autoprewarm must change the PID from 0 to their own PID, and change it back to 0 on successful completion or error exit. If we go to perform an immediate dump process and finds a non-zero value already just does ereport(ERROR, ...), including the PID of the other process in the message (e.g. "unable to perform block dump because dump file is being used by PID %d"). In a background worker, if we go to dump and find the file in use, log a message (e.g. "skipping block dump because it is already being performed by PID %d", "skipping prewarm because block dump file is being rewritten by PID %d"). I also think we should change is_bgworker_running to a PID, so that if we try to launch a new worker we can report something like "autoprewarm worker is already running under PID %d". So putting that all together, I suppose AutoPrewarmSharedState should end up looking like this: LWLock lock; /* mutual exclusion */ pid_t bgworker_pid; /* for main bgworker */ pid_t pid_using_dumpfile; /* for autoprewarm or block dump */ /* following items are for communication with per-database worker */ dsm_handle block_info_handle; Oid database; int prewarm_start_idx; int prewarm_stop_idx; I suggest going through and changing "subworker" to "per-database worker" throughout. BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: > + * contrib/autoprewarm.c > > Wrong. -- Oops Sorry fixed. > + Oid database; /* database */ > + Oid spcnode; /* tablespace */ > + Oid filenode; /* relation's filenode. */ > + ForkNumber forknum; /* fork number */ > + BlockNumber blocknum; /* block number */ > > Why spcnode rather than tablespace? Also, the third line has a > period, but not the others. I think you could just drop these > comments; they basically just duplicate the structure names, except > for spcnode which doesn't but probably should. -- Dropped comments and changed spcnode to tablespace. > + bool can_do_prewarm; /* if set can't do prewarm task. */ > > The comment and the name of the variable imply opposite meanings. -- Sorry a typo. Now this variable has been removed as you have suggested with new variables in AutoPrewarmSharedState. > +/* > + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos. > + */ > I assume you don't really need this. Presumably process exit is going > to detach the DSM anyway. -- Yes considering process exit will detach the dsm, I have removed that function. > + if (seg == NULL) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("unable to map dynamic shared memory segment"))); > > Let's use the wording from the message in parallel.c rather than this > wording. Actually, we should probably (as a separate patch) change > test_shm_mq's worker.c to use the parallel.c wording also. -- I have corrected the message with "could not map dynamic shared memory segment" as in parallel.c > + SetCurrentStatementStartTimestamp(); > > Why do we need this? -- Removed Sorry forgot to remove same when I removed the SPI connection code. > + StartTransactionCommand(); > > Do we need a transaction? If so, how about starting a new transaction > for each relation instead of using a single one for the entire > prewarm? -- We do relation_open hence need a transaction. As suggested now we start a new transaction on every new relation. > + if (status == BGWH_STOPPED) > + return; > + > + if (status == BGWH_POSTMASTER_DIED) > + { > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), > + errmsg("cannot start bgworker autoprewarm without postmaster"), > + errhint("Kill all remaining database processes and restart" > + " the database."))); > + } > + > + Assert(0); > > Instead of writing it this way, remove the first "if" block and change > the second one to Assert(status == BGWH_STOPPED). Also, I'd ditch the > curly braces in this case. -- Fixed as suggested. > > + file = fopen(AUTOPREWARM_FILE, PG_BINARY_R); > > Use AllocateFile() instead. See the comments for that function and at > the top of fd.c. Then you don't need to worry about leaking the file > handle on an error, so you can remove the fclose() before ereport() now> stuff -- which is incomplete anyway, because you could fail e.g. > inside dsm_create(). -- Using AllocateFile now. > > + errmsg("Error reading num of elements in \"%s\" for" > + " autoprewarm : %m", AUTOPREWARM_FILE))); > > As I said in a previous review, please do NOT split error messages > across lines like this. Also, this error message is nothing close to > PostgreSQL style. Please read > https://www.postgresql.org/docs/devel/static/error-style-guide.html > and learn to follow all those guidelines written therein. I see at > least 3 separate problems with this message. -- Thanks, I have tried to fix it now. > + num_elements = i; > > I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm > block dump has %d entries but expected %d", i, num_elements); It > seems OK for this to be elog() rather than ereport() because the file > should never be corrupt unless the user has cheated by hand-editing > it. -- Fixed as suggested. Now eloged as an ERROR. > I think you can get rid of next_db_pos altogether, and this > prewarm_elem thing too. Design sketch: > > 1. Move all the stuff that's in prewarm_elem into > AutoPrewarmSharedState. Rename start_pos to prewarm_start_idx and > end_of_blockinfos to prewarm_stop_idx. > > 2. Instead of computing all of the database ranges first and then > launching workers, do it all in one loop. So figure out where the > records for the current database end and set prewarm_start_idx and > prewarm_end_idx appropriately. Launch a worker. When the worker > terminates, set prewarm_start_idx = prewarm_end_idx and advance > prewarm_end_idx to the end of the next database's records. > > This saves having to allocate memory for the next_db_pos array, and it > also avoids this crock: > > + memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem)); > -- Fixed as suggested. > The reason that's bad is because it only works so long as bgw_extra is > large enough to hold prewarm_elem. If prewarm_elem grows or bgw_extra > shrinks, this turns into a buffer overrun. -- passing prewarm info through bgw_extra helped us to restrict the scope and lifetime of prewarm_elem only to prewarm task. Moving them to shared memory made them global even though they are not needed once prewarm task is finished. As there are other disadvantages of using bgw_extra I have now implemented as you have suggested. > I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating > the PID for the temporary file. Otherwise, you might leave many > temporary files behind under different names. If you use the same > name every time, you'll never have more than one, and the next > successful dump will end up getting rid of it along the way. -- Fixed as sugested. Previosuly PID was used so that concurrent dump can happen between dump worker and immediate dump as they will write to two different files. With new way of registering PID before file access in shared memory I think that problem can be adressed. > > + pfree(block_info_array); > + CloseTransientFile(fd); > + unlink(transient_dump_file_path); > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("error writing to \"%s\" : %m", .> + AUTOPREWARM_FILE))); > > Again, this is NOT a standard error message text. It occurs in zero > other places in the source tree. You are not the first person to need > an error message for a failed write to a file; please look at what the > previous authors did. Also, the pfree() before report is not needed; > isn't the whole process going to terminate? Also, you can't really use > errcode_for_file_access() here, because you've meanwhile done > CloseTransientFile() and unlink(), which will have clobbered errno. -- Removed pfree, saved errno before CloseTransientFile() and unlink() > + ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks))); > > Not project style for ereport(). Line break after the first comma. > Similarly elsewhere. -- Tried to fix same > + * dump_now - the main routine which goes through each buffer > header of buffer > + * pool and dumps their meta data. We Sort these data and then dump them. > + * Sorting is necessary as it facilitates sequential read during load. > > This is no longer true, because you moved the sort to the load side. > It's also not capitalized properly. -- Sorry removed now. > Discussions of the format of the autoprewarm dump file involve > inexplicably varying number of < and > symbols: > > + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in > + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it > + buflen = sprintf(buf, "%u,%u,%u,%u,%u\n", -- Sorry fixed now, in all of the places the block info formats will not have such (</>) delimiter. > +#ifndef __AUTOPREWARM_H__ > +#define __AUTOPREWARM_H__ > > We don't use double-underscored names for header files. Please > consult existing header files for the appropriate style. Also, why > does this file exist at all, instead of just including them in the .c > file? The pointer of a header is for things that need to be included > by multiple .c files, but there's no such need here. -- This was done to fix one of the previous review comments. I have moved them back to .c file. > + * load. If there are no other block infos than the global objects > + * we silently ignore them. Should I throw error? > > Silently ignoring them seems fine. Throwing an error doesn't seem > like it would improve things. -- Okay thanks. > > + /* > + * Register a sub-worker to load new database's block. Wait until the > + * sub-worker finish its job before launching next sub-worker. > + */ > + launch_prewarm_subworker(&pelem); > > The function name implies that it launches the worker, but the comment > implies that it also waits for it to terminate. Functions should be > named in a way that matches what they do. -- Have renamed it to launch_and_wait_for_per_database_worker > I feel like the get_autoprewarm_task() function is introducing fairly > baroque logic for something that really ought to be more simple. All > autoprewarm_main() really needs to do is: > > if (!state->autoprewarm_done) > autoprewarm(); > dump_block_info_periodically(); -- Have simplified things as suggested now. Function get_autoprewarm_task has been removed. > The locking in autoprewarm_dump_now() is a bit tricky. There are two > trouble cases. One is that we try to rename() our new dump file on > top of the existing one while a background worker is still using it to > perform an autoprewarm. The other is that we try to write a new > temporary dump file while some other process is trying to do the same > thing. I think we should fix this by storing a PID in > AutoPrewarmSharedState; a process which wants to perform a dump or an > autoprewarm must change the PID from 0 to their own PID, and change it > back to 0 on successful completion or error exit. If we go to perform > an immediate dump process and finds a non-zero value already just does > ereport(ERROR, ...), including the PID of the other process in the > message (e.g. "unable to perform block dump because dump file is being > used by PID %d"). In a background worker, if we go to dump and find > the file in use, log a message (e.g. "skipping block dump because it > is already being performed by PID %d", "skipping prewarm because block > dump file is being rewritten by PID %d"). -- Fixed as suggested. > I also think we should change is_bgworker_running to a PID, so that if > we try to launch a new worker we can report something like > "autoprewarm worker is already running under PID %d". -- Fixed. I could only "LOG" about another autoprewarm worker already running and then exit. Because on ERROR we try to restart the worker, so do not want to restart such workers. > So putting that all together, I suppose AutoPrewarmSharedState should > end up looking like this: > > LWLock lock; /* mutual exclusion */ > pid_t bgworker_pid; /* for main bgworker */ > pid_t pid_using_dumpfile; /* for autoprewarm or block dump */ -- I think one more member is required which state whether prewarm can be done when the worker restarts. > /* following items are for communication with per-database worker */ > dsm_handle block_info_handle; > Oid database; > int prewarm_start_idx; > int prewarm_stop_idx; -- Fixed as suggested > I suggest going through and changing "subworker" to "per-database > worker" throughout. -- Fixed as suggested. > > BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB? I have tried same on my local machine with ssd as a storage. settings: shared_buffers = 8GB, loaded data with pg_bench scale_factor=1000. Total blocks got dumped autoprewarm_dump_now ---------------------- 1048576 5 different load time based logs 1. 2017-06-04 11:30:26.460 IST [116253] LOG: autoprewarm has started 2017-06-04 11:30:43.443 IST [116253] LOG: autoprewarm load task ended -- 17 secs 2 2017-06-04 11:31:13.565 IST [116291] LOG: autoprewarm has started 2017-06-04 11:31:30.317 IST [116291] LOG: autoprewarm load task ended -- 17 secs 3. 2017-06-04 11:32:12.995 IST [116329] LOG: autoprewarm has started 2017-06-04 11:32:29.982 IST [116329] LOG: autoprewarm load task ended -- 17 secs 4. 2017-06-04 11:32:58.974 IST [116361] LOG: autoprewarm has started 2017-06-04 11:33:15.017 IST [116361] LOG: autoprewarm load task ended -- 17secs 5. 2017-06-04 12:15:49.772 IST [117936] LOG: autoprewarm has started 2017-06-04 12:16:11.012 IST [117936] LOG: autoprewarm load task ended -- 22 secs. So mostly from 17 to 22 secs. But I think I need to do tests on a larger set of configuration on different storage types. I shall do same and upload later. I have also uploaded latest performance test results (on my local machine ssd drive) configuration: shared_buffer = 8GB, test setting: scale_factor=300 (data fits to shared_buffers) pgbench clients =1 TEST PGBENCH_RUN="./pgbench --no-vacuum --protocol=prepared --time=5 -j 1 -c 1 --select-only postgres" START_TIME=$SECONDS; echo TIME, TPS; while true; do TPS=$($PGBENCH_RUN | grep excluding | cut -d ' ' -f 3); TIME=$((SECONDS-START_TIME)); echo $TIME, $TPS; done -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sun, Jun 4, 2017 at 12:45 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> + * contrib/autoprewarm.c >> >> Wrong. > > -- Oops Sorry fixed. > >> + Oid database; /* database */ >> + Oid spcnode; /* tablespace */ >> + Oid filenode; /* relation's filenode. */ >> + ForkNumber forknum; /* fork number */ >> + BlockNumber blocknum; /* block number */ >> >> Why spcnode rather than tablespace? Also, the third line has a >> period, but not the others. I think you could just drop these >> comments; they basically just duplicate the structure names, except >> for spcnode which doesn't but probably should. > > -- Dropped comments and changed spcnode to tablespace. > >> + bool can_do_prewarm; /* if set can't do prewarm task. */ >> >> The comment and the name of the variable imply opposite meanings. > > -- Sorry a typo. Now this variable has been removed as you have > suggested with new variables in AutoPrewarmSharedState. > >> +/* >> + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos. >> + */ >> I assume you don't really need this. Presumably process exit is going >> to detach the DSM anyway. > > -- Yes considering process exit will detach the dsm, I have removed > that function. > >> + if (seg == NULL) >> + ereport(ERROR, >> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), >> + errmsg("unable to map dynamic shared memory segment"))); >> >> Let's use the wording from the message in parallel.c rather than this >> wording. Actually, we should probably (as a separate patch) change >> test_shm_mq's worker.c to use the parallel.c wording also. > > -- I have corrected the message with "could not map dynamic shared > memory segment" as in parallel.c > >> + SetCurrentStatementStartTimestamp(); >> >> Why do we need this? > > -- Removed Sorry forgot to remove same when I removed the SPI connection code. > >> + StartTransactionCommand(); >> >> Do we need a transaction? If so, how about starting a new transaction >> for each relation instead of using a single one for the entire >> prewarm? > > -- We do relation_open hence need a transaction. As suggested now we > start a new transaction on every new relation. > >> + if (status == BGWH_STOPPED) >> + return; >> + >> + if (status == BGWH_POSTMASTER_DIED) >> + { >> + ereport(ERROR, >> + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), >> + errmsg("cannot start bgworker autoprewarm without postmaster"), >> + errhint("Kill all remaining database processes and restart" >> + " the database."))); >> + } >> + >> + Assert(0); >> >> Instead of writing it this way, remove the first "if" block and change >> the second one to Assert(status == BGWH_STOPPED). Also, I'd ditch the >> curly braces in this case. > > -- Fixed as suggested. > >> >> + file = fopen(AUTOPREWARM_FILE, PG_BINARY_R); >> >> Use AllocateFile() instead. See the comments for that function and at >> the top of fd.c. Then you don't need to worry about leaking the file >> handle on an error, so you can remove the fclose() before ereport() > now> stuff -- which is incomplete anyway, because you could fail e.g. >> inside dsm_create(). > > -- Using AllocateFile now. > >> >> + errmsg("Error reading num of elements in \"%s\" for" >> + " autoprewarm : %m", AUTOPREWARM_FILE))); >> >> As I said in a previous review, please do NOT split error messages >> across lines like this. Also, this error message is nothing close to >> PostgreSQL style. Please read >> https://www.postgresql.org/docs/devel/static/error-style-guide.html >> and learn to follow all those guidelines written therein. I see at >> least 3 separate problems with this message. > > -- Thanks, I have tried to fix it now. > >> + num_elements = i; >> >> I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm >> block dump has %d entries but expected %d", i, num_elements); It >> seems OK for this to be elog() rather than ereport() because the file >> should never be corrupt unless the user has cheated by hand-editing >> it. > > -- Fixed as suggested. Now eloged as an ERROR. > >> I think you can get rid of next_db_pos altogether, and this >> prewarm_elem thing too. Design sketch: >> >> 1. Move all the stuff that's in prewarm_elem into >> AutoPrewarmSharedState. Rename start_pos to prewarm_start_idx and >> end_of_blockinfos to prewarm_stop_idx. >> >> 2. Instead of computing all of the database ranges first and then >> launching workers, do it all in one loop. So figure out where the >> records for the current database end and set prewarm_start_idx and >> prewarm_end_idx appropriately. Launch a worker. When the worker >> terminates, set prewarm_start_idx = prewarm_end_idx and advance >> prewarm_end_idx to the end of the next database's records. >> >> This saves having to allocate memory for the next_db_pos array, and it >> also avoids this crock: >> >> + memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem)); >> > > -- Fixed as suggested. > >> The reason that's bad is because it only works so long as bgw_extra is >> large enough to hold prewarm_elem. If prewarm_elem grows or bgw_extra >> shrinks, this turns into a buffer overrun. > > -- passing prewarm info through bgw_extra helped us to restrict the > scope and lifetime of prewarm_elem only to prewarm task. Moving them > to shared memory made them global even though they are not needed once > prewarm task is finished. As there are other disadvantages of using > bgw_extra I have now implemented as you have suggested. > >> I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating >> the PID for the temporary file. Otherwise, you might leave many >> temporary files behind under different names. If you use the same >> name every time, you'll never have more than one, and the next >> successful dump will end up getting rid of it along the way. > > -- Fixed as sugested. Previosuly PID was used so that concurrent dump > can happen between dump worker and immediate dump as they will write > to two different files. With new way of registering PID before file > access in shared memory I think that problem can be adressed. > >> >> + pfree(block_info_array); >> + CloseTransientFile(fd); >> + unlink(transient_dump_file_path); >> + ereport(ERROR, >> + (errcode_for_file_access(), >> + errmsg("error writing to \"%s\" : %m", > .> + AUTOPREWARM_FILE))); >> >> Again, this is NOT a standard error message text. It occurs in zero >> other places in the source tree. You are not the first person to need >> an error message for a failed write to a file; please look at what the >> previous authors did. Also, the pfree() before report is not needed; >> isn't the whole process going to terminate? Also, you can't really use >> errcode_for_file_access() here, because you've meanwhile done >> CloseTransientFile() and unlink(), which will have clobbered errno. > > -- Removed pfree, saved errno before CloseTransientFile() and unlink() > >> + ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks))); >> >> Not project style for ereport(). Line break after the first comma. >> Similarly elsewhere. > > -- Tried to fix same > >> + * dump_now - the main routine which goes through each buffer >> header of buffer >> + * pool and dumps their meta data. We Sort these data and then dump them. >> + * Sorting is necessary as it facilitates sequential read during load. >> >> This is no longer true, because you moved the sort to the load side. >> It's also not capitalized properly. > > -- Sorry removed now. > >> Discussions of the format of the autoprewarm dump file involve >> inexplicably varying number of < and > symbols: >> >> + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in >> + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it >> + buflen = sprintf(buf, "%u,%u,%u,%u,%u\n", > > -- Sorry fixed now, in all of the places the block info formats will > not have such (</>) delimiter. > >> +#ifndef __AUTOPREWARM_H__ >> +#define __AUTOPREWARM_H__ >> >> We don't use double-underscored names for header files. Please >> consult existing header files for the appropriate style. Also, why >> does this file exist at all, instead of just including them in the .c >> file? The pointer of a header is for things that need to be included >> by multiple .c files, but there's no such need here. > > -- This was done to fix one of the previous review comments. I have > moved them back to .c file. > >> + * load. If there are no other block infos than the global objects >> + * we silently ignore them. Should I throw error? >> >> Silently ignoring them seems fine. Throwing an error doesn't seem >> like it would improve things. > > -- Okay thanks. > >> >> + /* >> + * Register a sub-worker to load new database's block. Wait until the >> + * sub-worker finish its job before launching next sub-worker. >> + */ >> + launch_prewarm_subworker(&pelem); >> >> The function name implies that it launches the worker, but the comment >> implies that it also waits for it to terminate. Functions should be >> named in a way that matches what they do. > > -- Have renamed it to launch_and_wait_for_per_database_worker > >> I feel like the get_autoprewarm_task() function is introducing fairly >> baroque logic for something that really ought to be more simple. All >> autoprewarm_main() really needs to do is: >> >> if (!state->autoprewarm_done) >> autoprewarm(); >> dump_block_info_periodically(); > > -- Have simplified things as suggested now. Function > get_autoprewarm_task has been removed. > >> The locking in autoprewarm_dump_now() is a bit tricky. There are two >> trouble cases. One is that we try to rename() our new dump file on >> top of the existing one while a background worker is still using it to >> perform an autoprewarm. The other is that we try to write a new >> temporary dump file while some other process is trying to do the same >> thing. I think we should fix this by storing a PID in >> AutoPrewarmSharedState; a process which wants to perform a dump or an >> autoprewarm must change the PID from 0 to their own PID, and change it >> back to 0 on successful completion or error exit. If we go to perform >> an immediate dump process and finds a non-zero value already just does >> ereport(ERROR, ...), including the PID of the other process in the >> message (e.g. "unable to perform block dump because dump file is being >> used by PID %d"). In a background worker, if we go to dump and find >> the file in use, log a message (e.g. "skipping block dump because it >> is already being performed by PID %d", "skipping prewarm because block >> dump file is being rewritten by PID %d"). > > -- Fixed as suggested. > >> I also think we should change is_bgworker_running to a PID, so that if >> we try to launch a new worker we can report something like >> "autoprewarm worker is already running under PID %d". > > -- Fixed. I could only "LOG" about another autoprewarm worker already > running and then exit. Because on ERROR we try to restart the worker, > so do not want to restart such workers. > >> So putting that all together, I suppose AutoPrewarmSharedState should >> end up looking like this: >> >> LWLock lock; /* mutual exclusion */ >> pid_t bgworker_pid; /* for main bgworker */ >> pid_t pid_using_dumpfile; /* for autoprewarm or block dump */ > > -- I think one more member is required which state whether prewarm can > be done when the worker restarts. > >> /* following items are for communication with per-database worker */ >> dsm_handle block_info_handle; >> Oid database; >> int prewarm_start_idx; >> int prewarm_stop_idx; > > -- Fixed as suggested > >> I suggest going through and changing "subworker" to "per-database >> worker" throughout. > > -- Fixed as suggested. > >> >> BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB? > > I have tried same on my local machine with ssd as a storage. > > settings: shared_buffers = 8GB, loaded data with pg_bench scale_factor=1000. > > Total blocks got dumped > autoprewarm_dump_now > ---------------------- > 1048576 > > 5 different load time based logs > > 1. > 2017-06-04 11:30:26.460 IST [116253] LOG: autoprewarm has started > 2017-06-04 11:30:43.443 IST [116253] LOG: autoprewarm load task ended > -- 17 secs > > 2 > 2017-06-04 11:31:13.565 IST [116291] LOG: autoprewarm has started > 2017-06-04 11:31:30.317 IST [116291] LOG: autoprewarm load task ended > -- 17 secs > > 3. > 2017-06-04 11:32:12.995 IST [116329] LOG: autoprewarm has started > 2017-06-04 11:32:29.982 IST [116329] LOG: autoprewarm load task ended > -- 17 secs > > 4. > 2017-06-04 11:32:58.974 IST [116361] LOG: autoprewarm has started > 2017-06-04 11:33:15.017 IST [116361] LOG: autoprewarm load task ended > -- 17secs > > 5. > 2017-06-04 12:15:49.772 IST [117936] LOG: autoprewarm has started > 2017-06-04 12:16:11.012 IST [117936] LOG: autoprewarm load task ended > -- 22 secs. > > So mostly from 17 to 22 secs. > > But I think I need to do tests on a larger set of configuration on > different storage types. I shall do same and upload later. I have also > uploaded latest performance test results (on my local machine ssd > drive) > configuration: shared_buffer = 8GB, > test setting: scale_factor=300 (data fits to shared_buffers) pgbench clients =1 > > TEST > PGBENCH_RUN="./pgbench --no-vacuum --protocol=prepared --time=5 -j 1 > -c 1 --select-only postgres" > START_TIME=$SECONDS; echo TIME, TPS; while true; do TPS=$($PGBENCH_RUN > | grep excluding | cut -d ' ' -f 3); TIME=$((SECONDS-START_TIME)); > echo $TIME, $TPS; done > > I had a look at the patch from stylistic/formatting point of view, please find the attached patch for the suggested modifications. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Jun 5, 2017 at 7:58 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > I had a look at the patch from stylistic/formatting point of view, > please find the attached patch for the suggested modifications. Many of these seem worse, like these ones: - * Quit if we've reached records for another database. Unless the + * Quit if we've reached records of another database. Unless the - * When we reach a new relation, close the old one. Note, however, - * that the previous try_relation_open may have failed, in which case - * rel will be NULL. + * On reaching a new relation, close the old one. Note, that the + * previous try_relation_open may have failed, in which case rel will + * be NULL. - * Try to open each new relation, but only once, when we first - * encounter it. If it's been dropped, skip the associated blocks. + * Each relation is open only once at it's first encounter. If it's + * been dropped, skip the associated blocks. Others are better, like these: - (errmsg("could not continue autoprewarm worker is already running under PID %d", + (errmsg("autoprewarm worker is already running under PID %d", - * Start of prewarm per-database worker. This will try to load blocks of one + * Start prewarm per-database worker, which will load blocks of one Others don't really seem better or worse, like: - * Register a per-database worker to load new database's block. And - * wait until they finish their job to launch next one. + * Register a per-database worker to load new database's block. Wait + * until they finish their job to launch next one. IMHO, there's still a good bit of work needed here to make this sound like American English. For example: - * It is a bgworker which automatically records information about blocks - * which were present in buffer pool before server shutdown and then - * prewarm the buffer pool upon server restart with those blocks. + * It is a bgworker process that automatically records information about + * blocks which were present in buffer pool before server shutdown and then + * prewarms the buffer pool upon server restart with those blocks. This construction "It is a..." without a clear referent seems to be standard in Indian English, but it looks wrong to English speakers from other parts of the world, or at least to me. + * Since there could be at max one worker who could do a prewarm, hence, + * acquiring locks is not required before setting skip_prewarm_on_restart. To me, adding a comma before hence looks like a significant improvement, but the word hence itself seems out-of-place. Also, I'd change "at max" to "at most" and maybe reword the sentence a little. There's a lot of little things like this which I have tended be quite strict about changing before commit; I occasionally wonder whether it's really worth the effort. It's not really wrong, it just sounds weird to me as an American. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 5, 2017 at 8:06 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > Many of these seem worse, like these ones: > > - * Quit if we've reached records for another database. Unless the > + * Quit if we've reached records of another database. Unless the > Why is it worse? I've always encountered using "records of database" and not "records for database". Anyhow, I tried reframing the sentence altogether. > - * When we reach a new relation, close the old one. Note, however, > - * that the previous try_relation_open may have failed, in which case > - * rel will be NULL. > + * On reaching a new relation, close the old one. Note, that the > + * previous try_relation_open may have failed, in which case rel will > + * be NULL. > Reframed the sentence. > - * Try to open each new relation, but only once, when we first > - * encounter it. If it's been dropped, skip the associated blocks. > + * Each relation is open only once at it's first encounter. If it's > + * been dropped, skip the associated blocks. > Reframed. > IMHO, there's still a good bit of work needed here to make this sound > like American English. For example: > > - * It is a bgworker which automatically records information about blocks > - * which were present in buffer pool before server shutdown and then > - * prewarm the buffer pool upon server restart with those blocks. > + * It is a bgworker process that automatically records information about > + * blocks which were present in buffer pool before server > shutdown and then > + * prewarms the buffer pool upon server restart with those blocks. > > This construction "It is a..." without a clear referent seems to be > standard in Indian English, but it looks wrong to English speakers > from other parts of the world, or at least to me. > Agreed, tried reframing the sentence. > + * Since there could be at max one worker who could do a prewarm, hence, > + * acquiring locks is not required before setting skip_prewarm_on_restart. > > To me, adding a comma before hence looks like a significant > improvement, but the word hence itself seems out-of-place. Also, I'd > change "at max" to "at most" and maybe reword the sentence a little. > There's a lot of little things like this which I have tended be quite > strict about changing before commit; I occasionally wonder whether > it's really worth the effort. It's not really wrong, it just sounds > weird to me as an American. > Agree, sentence reframed. I am attaching the patch with the modifications I made on a second look. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Jun 6, 2017 at 6:29 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > On Mon, Jun 5, 2017 at 8:06 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Many of these seem worse, like these ones: >> >> - * Quit if we've reached records for another database. Unless the >> + * Quit if we've reached records of another database. Unless the >> > Why is it worse? I've always encountered using "records of database" > and not "records for database". Anyhow, I tried reframing the sentence > altogether. My experience is the opposite. If I Google "from another database", including the quotes, I get 516,000 hits; if I do the same using "of another database", I get 110,000 hits. I think that means the former usage is more popular, although obviously to some degree it's a matter of taste. + * database, tablespace, filenode, forknum, blocknum in The file doesn't contain the spaces you added here, which is probably why Mithun did it as he did, but actually we don't really need this level of detail in the file header comment anyway. I think you could drop the specific mention of how blocks are identified. + * GUC variable to decide if autoprewarm worker has to be started when if->whether? has to be->should be? Actually this patch uses "if" in various places where I would use "whether", but that's probably a matter of taste to some extent. - * Start of prewarm per-database worker. This will try to load blocks of one - * database starting from block info position state->prewarm_start_idx to - * state->prewarm_stop_idx. + * Connect to the database and load the blocks of that database starting from + * the position state->prewarm_start_idx to state->prewarm_stop_idx. That's a really good rephrasing. It's more compact and more imperative. The only thing that seems a little off is that you say "starting from" and then mention both the start and stop indexes. Maybe say "between" instead. - * Quit if we've reached records for another database. Unless the - * previous blocks were of global objects which were combined with - * next database's block infos. + * Quit if we've reached records for another database. If previous + * blocks are of some global objects, then continue pre-warming. Good. - * When we reach a new relation, close the old one. Note, however, - * that the previous try_relation_open may have failed, in which case - * rel will be NULL. + * As soon as we encounter a block of a new relation, close the old + * relation. Note, that rel will be NULL if try_relation_open failed + * previously, in that case there is nothing to close. I wrote the original comment here, so you may not be too surprised to here that I liked it as it was. Actually, your rewrite of the first sentence seems like it might be better, but I think the second one is not. By deleting "however", you've made the comma after "Note" redundant, and by changing "in which case" to "in that case", you've made a dependent clause into a comma splice. I hate comma splices. https://en.wikipedia.org/wiki/Comma_splice + * $PGDATA/AUTOPREWARM_FILE to a dsm. Further, these BlockInfoRecords are I would probably capitalize DSM here. There's no data structure called dsm (lower-case) and we have other examples where it is capitalized in documentation and comment text (see, e.g. custom-scan.sgml). + * Since there could be at most one worker for prewarm, locking is not could -> can? - * For block info of a global object whose database will be 0 - * try to combine them with next non-zero database's block - * infos to load. + * For the BlockRecordInfos of a global object, combine them + * with BlockRecordInfos of the next non-global object. Good. Or even "Combine BlockRecordInfos for a global object with the next non-global object", which I think is even more compact. + * If we are here with database having InvalidOid, then only + * BlockRecordInfos belonging to global objects exist. Since, we can + * not connect with InvalidOid skip prewarming for these objects. If we reach this point with current_db == InvalidOid, ... + * Sub-routine to periodically call dump_now(). Every existing use of the word subroutine in our code based spells it with no hyphen. [rhaas pgsql]$ git grep -- Subroutine | wc -l 47 [rhaas pgsql]$ git grep -- Sub-routine | wc -l 0 Personally, I find this change worse, because calling something a subroutine is totally generic, like saying that you met a "person" or that something was "nice". Calling it a loop gives at least a little bit of specific information. + * Call dum_now() at regular intervals defined by GUC variable dump_interval. This change introduces an obvious typographical error. - /* One last block info dump while postmaster shutdown. */ + /* It's time for postmaster shutdown, let's dump for one last time. */ Comma splice. + /* Perform autoprewarm's task. */ Uninformative. + * A Common function to initialize BackgroundWorker structure. Adding a period to the end is a good idea, but how about also fixing the capitalization of "Common"? I think this is a common usage in India, but not elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I have merged Rafia's patch for cosmetic changes. I have also fixed some of the changes you have recommended over that. But kept few as it is since Rafia's opinion was needed on that. On Wed, Jun 7, 2017 at 5:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: > My experience is the opposite. If I Google "from another database", > including the quotes, I get 516,000 hits; if I do the same using "of > another database", I get 110,000 hits. I think that means the former > usage is more popular, although obviously to some degree it's a matter > of taste. -- Open. > > + * database, tablespace, filenode, forknum, blocknum in > > The file doesn't contain the spaces you added here, which is probably > why Mithun did it as he did, but actually we don't really need this > level of detail in the file header comment anyway. I think you could > drop the specific mention of how blocks are identified. -- Fixed. It has been removed now. > + * GUC variable to decide if autoprewarm worker has to be started when > > if->whether? has to be->should be? > Actually this patch uses "if" in various places where I would use > "whether", but that's probably a matter of taste to some extent. -- Fixed. I have changed from if to whether. > > - * Start of prewarm per-database worker. This will try to load blocks of one > - * database starting from block info position state->prewarm_start_idx to > - * state->prewarm_stop_idx. > + * Connect to the database and load the blocks of that database starting from > + * the position state->prewarm_start_idx to state->prewarm_stop_idx. > > That's a really good rephrasing. It's more compact and more > imperative. The only thing that seems a little off is that you say > "starting from" and then mention both the start and stop indexes. > Maybe say "between" instead. -- It is a half-open interval so rewrote it within the notation [,) > > - * Quit if we've reached records for another database. Unless the > - * previous blocks were of global objects which were combined with > - * next database's block infos. > + * Quit if we've reached records for another database. If previous > + * blocks are of some global objects, then continue pre-warming. > > Good. > > - * When we reach a new relation, close the old one. Note, however, > - * that the previous try_relation_open may have failed, in which case > - * rel will be NULL. > + * As soon as we encounter a block of a new relation, close the old > + * relation. Note, that rel will be NULL if try_relation_open failed > + * previously, in that case there is nothing to close. > > I wrote the original comment here, so you may not be too surprised to > here that I liked it as it was. Actually, your rewrite of the first > sentence seems like it might be better, but I think the second one is > not. By deleting "however", you've made the comma after "Note" > redundant, and by changing "in which case" to "in that case", you've > made a dependent clause into a comma splice. I hate comma splices. > > https://en.wikipedia.org/wiki/Comma_splice -- Open > + * $PGDATA/AUTOPREWARM_FILE to a dsm. Further, these BlockInfoRecords are > I would probably capitalize DSM here. There's no data structure > called dsm (lower-case) and we have other examples where it is > capitalized in documentation and comment text (see, e.g. > custom-scan.sgml). -- Fixed. > > + * Since there could be at most one worker for prewarm, locking is not > > could -> can? -- Fixed. > > - * For block info of a global object whose database will be 0 > - * try to combine them with next non-zero database's block > - * infos to load. > + * For the BlockRecordInfos of a global object, combine them > + * with BlockRecordInfos of the next non-global object. > > Good. Or even "Combine BlockRecordInfos for a global object with the > next non-global object", which I think is even more compact. -- Fixed. > > + * If we are here with database having InvalidOid, then only > + * BlockRecordInfos belonging to global objects exist. Since, we can > + * not connect with InvalidOid skip prewarming for these objects. > > If we reach this point with current_db == InvalidOid, ... --Fixed. > > + * Sub-routine to periodically call dump_now(). > > Every existing use of the word subroutine in our code based spells it > with no hyphen. > > [rhaas pgsql]$ git grep -- Subroutine | wc -l > 47 > [rhaas pgsql]$ git grep -- Sub-routine | wc -l > 0 > > Personally, I find this change worse, because calling something a > subroutine is totally generic, like saying that you met a "person" or > that something was "nice". Calling it a loop gives at least a little > bit of specific information. -- Fixed. We call it a loop now. > > + * Call dum_now() at regular intervals defined by GUC variable dump_interval. > > This change introduces an obvious typographical error. -- Fixed. > > - /* One last block info dump while postmaster shutdown. */ > + /* It's time for postmaster shutdown, let's dump for one last time. */ > > Comma splice. -- Open > > + /* Perform autoprewarm's task. */ > > Uninformative. -- Fixed. I have removed same now. > > + * A Common function to initialize BackgroundWorker structure. > > Adding a period to the end is a good idea, but how about also fixing > the capitalization of "Common"? I think this is a common usage in > India, but not elsewhere. -- Fixed to common. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jun 9, 2017 at 10:01 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > I have merged Rafia's patch for cosmetic changes. I have also fixed > some of the changes you have recommended over that. But kept few as it > is since Rafia's opinion was needed on that. > Few comments on the latest patch: 1. + /* Check whether blocknum is valid and within fork file size. */ + if (blk->blocknum >= nblocks) + { + /* Move to next forknum. */ + ++pos; + old_blk = blk; + continue; + } + + /* Prewarm buffer. */ + buf = ReadBufferExtended(rel, blk->forknum, blk->blocknum, RBM_NORMAL, + NULL); + if (BufferIsValid(buf)) + ReleaseBuffer(buf); + + old_blk = blk; + ++pos; You are incrementing position at different places in this loop. I think you can do it once at the start of the loop. 2. +dump_now(bool is_bgworker) { .. + fd = OpenTransientFile(transient_dump_file_path, + O_CREAT | O_WRONLY | O_TRUNC, 0666); +prewarm_buffer_pool(void) { .. + file = AllocateFile(AUTOPREWARM_FILE, PG_BINARY_R); During prewarm, you seem to be using binary mode to open a file whereas during dump binary flag is not passed. Is there a reason for such a difference? 3. + ereport(LOG, + (errmsg("saved metadata info of %d blocks", num_blocks))); It doesn't seem like a good idea to log this info at each dump interval. How about making this as a DEBUG1 message? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Few comments on the latest patch:
+
+ /* Prewarm buffer. */
+ buf = ReadBufferExtended(rel, blk->forknum, blk->blocknum, RBM_NORMAL,
+ NULL);
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ old_blk = blk;
+ ++pos;
You are incrementing position at different places in this loop. I
think you can do it once at the start of the loop.
2.
+dump_now(bool is_bgworker)
{
..
+ fd = OpenTransientFile(transient_dump_file_path,
+ O_CREAT | O_WRONLY | O_TRUNC, 0666);
+prewarm_buffer_pool(void)
{
..
+ file = AllocateFile(AUTOPREWARM_FILE, PG_BINARY_R);
During prewarm, you seem to be using binary mode to open a file
whereas during dump binary flag is not passed. Is there a reason
for such a difference?
3.
+ ereport(LOG,
+ (errmsg("saved metadata info of %d blocks", num_blocks)));
It doesn't seem like a good idea to log this info at each dump
interval. How about making this as a DEBUG1 message?
Attachment
On Mon, Jun 12, 2017 at 6:31 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > Thanks, Amit, > + /* Perform autoprewarm's task. */ + if (todo_task == TASK_PREWARM_BUFFERPOOL && + !state->skip_prewarm_on_restart) Why have you removed above comment in the new patch? I am not pointing this because above comment is meaningful, rather changing things in different versions of the patch without informing reviewer can increase the time to review. I feel you can write some better comment here. 1. new file mode 100644 index 0000000..6c35fb7 --- /dev/null +++ b/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql @@ -0,0 +1,14 @@ +/* contrib/pg_prewarm/pg_prewarm--1.0--1.1.sql */ In comments, the SQL file name is wrong. 2. + /* Define custom GUC variables. */ + if (process_shared_preload_libraries_in_progress) + DefineCustomBoolVariable("pg_prewarm.autoprewarm", + "Enable/Disable auto-prewarm feature.", + NULL, + &autoprewarm, + true, + PGC_POSTMASTER, + 0, + NULL, + NULL, + NULL); + + DefineCustomIntVariable("pg_prewarm.dump_interval", + "Sets the maximum time between two buffer pool dumps", + "If set to zero, timer based dumping is disabled." + " If set to -1, stops autoprewarm.", + &dump_interval, + AT_PWARM_DEFAULT_DUMP_INTERVAL, + AT_PWARM_OFF, INT_MAX / 1000, + PGC_SIGHUP, + GUC_UNIT_S, + NULL, + NULL, + NULL); + + EmitWarningsOnPlaceholders("pg_prewarm"); + + /* If not run as a preloaded library, nothing more to do. */ + if (!process_shared_preload_libraries_in_progress) + return; a. You can easily write this code such that process_shared_preload_libraries_in_progress needs to be checked just once. Move the define of pg_prewarm.dump_interval at first place and then check if (!process_shared_preload_libraries_in_progress ) return. b. Variable name autoprewarm isn't self-explanatory, also if you have to search the use of this variable in the code, it is difficult because a lot of unrelated usages can pop-up. How about naming it as start_prewarm_worker or enable_autoprewarm or something like that? 3. +static AutoPrewarmSharedState *state = NULL; Again, the naming of this variable (state) is not meaningful. How about SharedPrewarmState or something like that? 4. + ereport(LOG, + (errmsg("autoprewarm has started"))); .. + ereport(LOG, + (errmsg("autoprewarm shutting down"))); How about changing messages as "autoprewarm worker started" and "autoprewarm worker stopped" respectively? 5. +void +dump_block_info_periodically(void) +{ + TimestampTz last_dump_time = GetCurrentTimestamp(); .. + if (TimestampDifferenceExceeds(last_dump_time, + current_time, + (dump_interval * 1000))) + { + dump_now(true); .. With above coding, it will not dump the very first time it tries to dump the blocks information. I think it is better if it dumps the first time and then dumps after dump_interval. I think it is not difficult to arrange above code to do so if you also think that is better behavior? 6. +dump_now(bool is_bgworker) { .. + if (write(fd, buf, buflen) < buflen) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\" : %m", + transient_dump_file_path))); .. } You seem to forget to close and unlink the file in above code path. There are a lot of places in this function where you have to free memory or close file in case of an error condition. You can use multiple labels to define error exit paths, something like we have done in DecodeXLogRecord. 7. + for (i = 0; i < num_blocks; i++) + { + /* In case of a SIGHUP, just reload the configuration. */ + if (got_sighup) + { + got_sighup = false; + ProcessConfigFile(PGC_SIGHUP); + } + + /* Have we been asked to stop dump? */ + if (dump_interval == AT_PWARM_OFF) + { + pfree(block_info_array); + CloseTransientFile(fd); + unlink(transient_dump_file_path); + return 0; + } + + buflen = sprintf(buf, "%u,%u,%u,%u,%u\n", + block_info_array[i].database, + block_info_array[i].tablespace, + block_info_array[i].filenode, + (uint32) block_info_array[i].forknum, + block_info_array[i].blocknum); + + if (write(fd, buf, buflen) < buflen) + { + int save_errno = errno; + + CloseTransientFile(fd); + unlink(transient_dump_file_path); + errno = save_errno; + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\": %m", + transient_dump_file_path))); + } It seems we can club the above writes into 8K sized writes instead of one write per block information. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jun 12, 2017 at 6:31 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Thanks, Amit,
>
+ /* Perform autoprewarm's task. */
+ if (todo_task == TASK_PREWARM_BUFFERPOOL &&
+ !state->skip_prewarm_on_restart)
Why have you removed above comment in the new patch? I am not
pointing this because above comment is meaningful, rather changing
things in different versions of the patch without informing reviewer
can increase the time to review. I feel you can write some better
comment here.
1.
new file mode 100644
index 0000000..6c35fb7
--- /dev/null
+++ b/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql
@@ -0,0 +1,14 @@
+/* contrib/pg_prewarm/pg_prewarm--1.0--1.1.sql */
In comments, the SQL file name is wrong.
2.
+ /* Define custom GUC variables. */
+ if (process_shared_preload_libraries_in_progress)
+ DefineCustomBoolVariable("pg_prewarm.autoprewarm",
+ "Enable/Disable auto-prewarm feature.",
+ NULL,
+ &autoprewarm,
+ true,
+ PGC_POSTMASTER,
+ 0,
+ NULL,
+ NULL,
+ NULL);
+
+ DefineCustomIntVariable("pg_prewarm.dump_interval",
+ "Sets the maximum time between two buffer pool dumps",
+ "If set to zero, timer based dumping is disabled."
+ " If set to -1, stops autoprewarm.",
+ &dump_interval,
+ AT_PWARM_DEFAULT_DUMP_INTERVAL,
+ AT_PWARM_OFF, INT_MAX / 1000,
+ PGC_SIGHUP,
+ GUC_UNIT_S,
+ NULL,
+ NULL,
+ NULL);
+
+ EmitWarningsOnPlaceholders("pg_prewarm");
+
+ /* If not run as a preloaded library, nothing more to do. */
+ if (!process_shared_preload_libraries_in_progress)
+ return;
a. You can easily write this code such that
process_shared_preload_libraries_in_progress needs to be checked just
once. Move the define of pg_prewarm.dump_interval at first place and
then check if (!process_shared_preload_libraries_in_progress ) return.
b. Variable name autoprewarm isn't self-explanatory, also if you have
to search the use of this variable in the code, it is difficult
because a lot of unrelated usages can pop-up. How about naming it as
start_prewarm_worker or enable_autoprewarm or something like that?
3.
+static AutoPrewarmSharedState *state = NULL;
Again, the naming of this variable (state) is not meaningful. How
about SharedPrewarmState or something like that?
4.
+ ereport(LOG,
+ (errmsg("autoprewarm has started")));
..
+ ereport(LOG,
+ (errmsg("autoprewarm shutting down")));
How about changing messages as "autoprewarm worker started" and
"autoprewarm worker stopped" respectively?
5.
+void
+dump_block_info_periodically(void)
+{
+ TimestampTz last_dump_time = GetCurrentTimestamp();
..
+ if (TimestampDifferenceExceeds(last_dump_time,
+ current_time,
+ (dump_interval * 1000)))
+ {
+ dump_now(true);
..
With above coding, it will not dump the very first time it tries to
dump the blocks information. I think it is better if it dumps the
first time and then dumps after dump_interval. I think it is not
difficult to arrange above code to do so if you also think that is
better behavior?
6.
+dump_now(bool is_bgworker)
{
..
+ if (write(fd, buf, buflen) < buflen)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\" : %m",
+ transient_dump_file_path)));
..
}
You seem to forget to close and unlink the file in above code path.
There are a lot of places in this function where you have to free
memory or close file in case of an error condition. You can use
multiple labels to define error exit paths, something like we have
done in DecodeXLogRecord.
7.
+ for (i = 0; i < num_blocks; i++)
+ {
+ /* In case of a SIGHUP, just reload the configuration. */
+ if (got_sighup)
+ {
+ got_sighup = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ /* Have we been asked to stop dump? */
+ if (dump_interval == AT_PWARM_OFF)
+ {
+ pfree(block_info_array);
+ CloseTransientFile(fd);
+ unlink(transient_dump_file_path);
+ return 0;
+ }
+
+ buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",
+ block_info_array[i].database,
+ block_info_array[i].tablespace,
+ block_info_array[i].filenode,
+ (uint32) block_info_array[i].forknum,
+ block_info_array[i].blocknum);
+
+ if (write(fd, buf, buflen) < buflen)
+ {
+ int save_errno = errno;
+
+ CloseTransientFile(fd);
+ unlink(transient_dump_file_path);
+ errno = save_errno;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ transient_dump_file_path)));
+ }
It seems we can club the above writes into 8K sized writes instead of
one write per block information.
Attachment
On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > [ new patch ] I think this is looking better. I have some suggestions: * I suggest renaming launch_autoprewarm_dump() to autoprewarm_start_worker(). I think that will be clearer. Remember that user-visible names, internal names, and the documentation should all match. * I think the GUC could be pg_prewarm.autoprewarm rather than pg_prewarm.enable_autoprewarm. It's shorter and, I think, no less clear. * In the documentation, don't say "This is a SQL callable function to....". This is a list of SQL-callable functions, so each thing in the list is one. Just delete this from the beginning of each sentence. * The reason for the AT_PWARM_* naming is not very obvious. Does AT mean "at" or "auto" or something else? How about AUTOPREWARM_INTERVAL_DISABLED, AUTOPREWARM_INTERVAL_SHUTDOWN_ONLY, AUTOPREWARM_INTERVAL_DEFAULT? * Instead of defining apw_sigusr1_handler, I think you could just use procsignal_sigusr1_handler. Instead of defining apw_sigterm_handler, perhaps you could just use die(). got_sigterm would go away, and you'd just CHECK_FOR_INTERRUPTS(). * The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse reset_apw_state(), which might be better named detach_apw_shmem(). Similarly, init_apw_state() could be init_apw_shmem(). * Instead of load_one_database(), I suggest autoprewarm_database_main(). That is more parallel to autoprewarm_main(), which you have elsewhere, and makes it more obvious that it's the main entrypoint for some background worker. * Instead of launch_and_wait_for_per_database_worker(), I suggest autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I suggest autoprewarm_buffers(). The motivation for changing prewarm to autoprewarm is that we want the names here to be clearly distinct from the other parts of pg_prewarm that are not related to autoprewarm. The motivation for changing buffer_pool to buffers is just that it's a little shorter. Personally I also like the sound it of it better, but YMMV. * prewarm_buffer_pool() ends with a useless return statement. I suggest removing it. * Instead of creating our own buffering system via buffer_file_write() and buffer_file_flush(), why not just use the facilities provided by the operating system? fopen() et. al. provide buffering, and we have AllocateFile() to provide a FILE *; it's just like OpenTransientFile(), which you are using, but you'll get the buffering stuff for free. Maybe there's some reason why this won't work out nicely, but off-hand it seems like it might. It looks like you are already using AllocateFile() to read the dump, so using it to write the dump as well seems like it would be logical. * I think that it would be cool if, when autoprewarm completed, it printed a message at LOG rather than DEBUG1, and with a few more details, like "autoprewarm successfully prewarmed %d of %d previously-loaded blocks". This would require some additional tracking that you haven't got right now; you'd have to keep track not only of the number of blocks read from the file but how many of those some worker actually loaded. You could do that with an extra counter in the shared memory area that gets incremented by the per-database workers. * dump_block_info_periodically() calls ResetLatch() immediately before WaitLatch; that's backwards. See the commit message for commit 887feefe87b9099eeeec2967ec31ce20df4dfa9b and the comments it added to the top of latch.h for details on how to do this correctly. * dump_block_info_periodically()'s main loop is a bit confusing. I think that after calling dump_now(true) it should just "continue", which will automatically retest got_sigterm. You could rightly object to that plan on the grounds that we then wouldn't recheck got_sighup promptly, but you can fix that by moving the got_sighup test to the top of the loop, which is a good idea anyway for at least two other reasons. First, you probably want to check for a pending SIGHUP on initially entering this function, because something might have changed during the prewarm phase, and second, see the previous comment about using the "another valid coding pattern" from latch.h, which puts the ResetLatch() at the bottom of the loop. * I think that launch_autoprewarm_dump() should ereport(ERROR, ...) rather than just return NULL if the feature is disabled. Maybe something like ... ERROR: pg_prewarm.dump_interval must be non-negative in order to launch worker * Not sure about this one, but maybe we should consider just getting rid of pg_prewarm.dump_interval = -1 altogether and make the minimum value 0. If pg_prewarm.autoprewarm = on, then we start the worker and dump according to the dump interval; if pg_prewarm.autoprewarm = off then we don't start the worker automatically, but we still let you start it manually. If you do, it respects the configured dump_interval. With this design, we don't need the error suggested in the previous item at all, and the code can be simplified in various places --- all the checks for AT_PWARM_OFF go away. And I don't see that we're really losing anything. There's not much sense in dumping but not prewarming or prewarming but not dumping, so having pg_prewarm.autoprewarm configure whether the worker is started automatically rather than whether it prewarms (with a separate control for whether it dumps) seems to make sense. The one time when you want to do one without the other is when you first install the extension -- during the first server lifetime, you'll want to dump, so that after the next restart you have something to preload. But this design would allow that. That's all I have time for today - hope it helps. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 June 2017 at 22:52, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> [ new patch ] > > I think this is looking better. I have some suggestions: > > * I suggest renaming launch_autoprewarm_dump() to > autoprewarm_start_worker(). I think that will be clearer. Remember > that user-visible names, internal names, and the documentation should > all match. +1 I like related functions and GUCs to be similarly named so that they have the same prefix. > > * I think the GUC could be pg_prewarm.autoprewarm rather than > pg_prewarm.enable_autoprewarm. It's shorter and, I think, no less > clear. +1 I also think pg_prewarm.dump_interval should be renamed to pg_prewarm.autoprewarm_interval. > > * In the documentation, don't say "This is a SQL callable function > to....". This is a list of SQL-callable functions, so each thing in > the list is one. Just delete this from the beginning of each > sentence. I've made a pass at the documentation and ended up removing those intros. I haven't made any GUC/function renaming changes, but I have rewritten some paragraphs for clarity. Updated patch attached. One thing I couldn't quite make sense of is: "The autoprewarm process will start loading blocks recorded in $PGDATA/autoprewarm.blocks until there is a free buffer left in the buffer pool." Is this saying "until there is a single free buffer remaining in shared buffers"? I haven't corrected or clarified this as I don't understand it. Also, I find it a bit messy that launch_autoprewarm_dump() doesn't detect an autoprewarm process already running. I'd want this to return NULL or an error if called for a 2nd time. > > * The reason for the AT_PWARM_* naming is not very obvious. Does AT > mean "at" or "auto" or something else? How about > AUTOPREWARM_INTERVAL_DISABLED, AUTOPREWARM_INTERVAL_SHUTDOWN_ONLY, > AUTOPREWARM_INTERVAL_DEFAULT? > > * Instead of defining apw_sigusr1_handler, I think you could just use > procsignal_sigusr1_handler. Instead of defining apw_sigterm_handler, > perhaps you could just use die(). got_sigterm would go away, and > you'd just CHECK_FOR_INTERRUPTS(). > > * The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse > reset_apw_state(), which might be better named detach_apw_shmem(). > Similarly, init_apw_state() could be init_apw_shmem(). > > * Instead of load_one_database(), I suggest > autoprewarm_database_main(). That is more parallel to > autoprewarm_main(), which you have elsewhere, and makes it more > obvious that it's the main entrypoint for some background worker. > > * Instead of launch_and_wait_for_per_database_worker(), I suggest > autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I > suggest autoprewarm_buffers(). The motivation for changing prewarm > to autoprewarm is that we want the names here to be clearly distinct > from the other parts of pg_prewarm that are not related to > autoprewarm. The motivation for changing buffer_pool to buffers is > just that it's a little shorter. Personally I also like the sound it > of it better, but YMMV. > > * prewarm_buffer_pool() ends with a useless return statement. I > suggest removing it. > > * Instead of creating our own buffering system via buffer_file_write() > and buffer_file_flush(), why not just use the facilities provided by > the operating system? fopen() et. al. provide buffering, and we have > AllocateFile() to provide a FILE *; it's just like > OpenTransientFile(), which you are using, but you'll get the buffering > stuff for free. Maybe there's some reason why this won't work out > nicely, but off-hand it seems like it might. It looks like you are > already using AllocateFile() to read the dump, so using it to write > the dump as well seems like it would be logical. > > * I think that it would be cool if, when autoprewarm completed, it > printed a message at LOG rather than DEBUG1, and with a few more > details, like "autoprewarm successfully prewarmed %d of %d > previously-loaded blocks". This would require some additional > tracking that you haven't got right now; you'd have to keep track not > only of the number of blocks read from the file but how many of those > some worker actually loaded. You could do that with an extra counter > in the shared memory area that gets incremented by the per-database > workers. > > * dump_block_info_periodically() calls ResetLatch() immediately before > WaitLatch; that's backwards. See the commit message for commit > 887feefe87b9099eeeec2967ec31ce20df4dfa9b and the comments it added to > the top of latch.h for details on how to do this correctly. > > * dump_block_info_periodically()'s main loop is a bit confusing. I > think that after calling dump_now(true) it should just "continue", > which will automatically retest got_sigterm. You could rightly object > to that plan on the grounds that we then wouldn't recheck got_sighup > promptly, but you can fix that by moving the got_sighup test to the > top of the loop, which is a good idea anyway for at least two other > reasons. First, you probably want to check for a pending SIGHUP on > initially entering this function, because something might have changed > during the prewarm phase, and second, see the previous comment about > using the "another valid coding pattern" from latch.h, which puts the > ResetLatch() at the bottom of the loop. > > * I think that launch_autoprewarm_dump() should ereport(ERROR, ...) > rather than just return NULL if the feature is disabled. Maybe > something like ... ERROR: pg_prewarm.dump_interval must be > non-negative in order to launch worker > > * Not sure about this one, but maybe we should consider just getting > rid of pg_prewarm.dump_interval = -1 altogether and make the minimum > value 0. If pg_prewarm.autoprewarm = on, then we start the worker and > dump according to the dump interval; if pg_prewarm.autoprewarm = off > then we don't start the worker automatically, but we still let you > start it manually. If you do, it respects the configured > dump_interval. With this design, we don't need the error suggested in > the previous item at all, and the code can be simplified in various > places --- all the checks for AT_PWARM_OFF go away. And I don't see > that we're really losing anything. There's not much sense in dumping > but not prewarming or prewarming but not dumping, so having > pg_prewarm.autoprewarm configure whether the worker is started > automatically rather than whether it prewarms (with a separate control > for whether it dumps) seems to make sense. The one time when you want > to do one without the other is when you first install the extension -- > during the first server lifetime, you'll want to dump, so that after > the next restart you have something to preload. But this design would > allow that. -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Thanks, Robert, I have tried to fix all of you comments and merged to fixes suggested by Thom in patch 15. On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > * I suggest renaming launch_autoprewarm_dump() to > autoprewarm_start_worker(). I think that will be clearer. Remember > that user-visible names, internal names, and the documentation should > all match. -- Fixed as suggested. > > * I think the GUC could be pg_prewarm.autoprewarm rather than > pg_prewarm.enable_autoprewarm. It's shorter and, I think, no less > clear. -- I have made GUC name as autoprewarm. > > * In the documentation, don't say "This is a SQL callable function > to....". This is a list of SQL-callable functions, so each thing in > the list is one. Just delete this from the beginning of each > sentence. -- Fixed, Thom has provided the fix and I have merged same to my patch. > * The reason for the AT_PWARM_* naming is not very obvious. Does AT > mean "at" or "auto" or something else? How about > AUTOPREWARM_INTERVAL_DISABLED, AUTOPREWARM_INTERVAL_SHUTDOWN_ONLY, > AUTOPREWARM_INTERVAL_DEFAULT? -- Fixed as suggested. The AUTOPREWARM_INTERVAL_DISABLED is removed now as suggested by below comments. > > * Instead of defining apw_sigusr1_handler, I think you could just use > procsignal_sigusr1_handler. Instead of defining apw_sigterm_handler, > perhaps you could just use die(). got_sigterm would go away, and > you'd just CHECK_FOR_INTERRUPTS(). -- Hi have registered procsignal_sigusr1_handler instead of apw_sigusr1_handler. But I have some doubts about using die instead of apw_sigterm_handler in main autoprewarm worker. On shutdown(sigterm) we should dump and then exit, so doing a CHECK_FOR_INTERRUPTS() we might miss dumping the buffer contents. I think I need to modify some server code in ProcessInterrupts to handle this, please let me know if I am wrong about this. For per-database prewarm worker, this seems right so I am registering die for SIGTERM and calling CHECK_FOR_INTERRUPTS(). Also for autoprewarm_dump_now(). > > * The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse > reset_apw_state(), which might be better named detach_apw_shmem(). > Similarly, init_apw_state() could be init_apw_shmem(). -- Fixed. > * Instead of load_one_database(), I suggest > autoprewarm_database_main(). That is more parallel to > autoprewarm_main(), which you have elsewhere, and makes it more > obvious that it's the main entrypoint for some background worker. -- Fixed. > * Instead of launch_and_wait_for_per_database_worker(), I suggest > autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I > suggest autoprewarm_buffers(). The motivation for changing prewarm > to autoprewarm is that we want the names here to be clearly distinct > from the other parts of pg_prewarm that are not related to > autoprewarm. The motivation for changing buffer_pool to buffers is > just that it's a little shorter. Personally I also like the sound it > of it better, but YMMV. -- Fixed as suggested. I have renamed as suggested. > * prewarm_buffer_pool() ends with a useless return statement. I > suggest removing it. -- Sorry Fixed. > > * Instead of creating our own buffering system via buffer_file_write() > and buffer_file_flush(), why not just use the facilities provided by > the operating system? fopen() et. al. provide buffering, and we have > AllocateFile() to provide a FILE *; it's just like > OpenTransientFile(), which you are using, but you'll get the buffering > stuff for free. Maybe there's some reason why this won't work out > nicely, but off-hand it seems like it might. It looks like you are > already using AllocateFile() to read the dump, so using it to write > the dump as well seems like it would be logical. -- Now using AllocateFile(). > * I think that it would be cool if, when autoprewarm completed, it > printed a message at LOG rather than DEBUG1, and with a few more > details, like "autoprewarm successfully prewarmed %d of %d > previously-loaded blocks". This would require some additional > tracking that you haven't got right now; you'd have to keep track not > only of the number of blocks read from the file but how many of those > some worker actually loaded. You could do that with an extra counter > in the shared memory area that gets incremented by the per-database > workers. > > * dump_block_info_periodically() calls ResetLatch() immediately before > WaitLatch; that's backwards. See the commit message for commit > 887feefe87b9099eeeec2967ec31ce20df4dfa9b and the comments it added to > the top of latch.h for details on how to do this correctly. -- Sorry Fixed. > * dump_block_info_periodically()'s main loop is a bit confusing. I > think that after calling dump_now(true) it should just "continue", > which will automatically retest got_sigterm. You could rightly object > to that plan on the grounds that we then wouldn't recheck got_sighup > promptly, but you can fix that by moving the got_sighup test to the > top of the loop, which is a good idea anyway for at least two other > reasons. First, you probably want to check for a pending SIGHUP on > initially entering this function, because something might have changed > during the prewarm phase, and second, see the previous comment about > using the "another valid coding pattern" from latch.h, which puts the > ResetLatch() at the bottom of the loop. -- Agree, my idea was while we were dumping or just immediately after dumping if we receive sigterm we need not dump again for shutdown. I think I am wrong so fixed as you have suggested. > > * I think that launch_autoprewarm_dump() should ereport(ERROR, ...) > rather than just return NULL if the feature is disabled. Maybe > something like ... ERROR: pg_prewarm.dump_interval must be > non-negative in order to launch worker -- I have removed pg_prewarm.dump_interval = -1 case as you have suggested below. So no need for error now. > * Not sure about this one, but maybe we should consider just getting > rid of pg_prewarm.dump_interval = -1 altogether and make the minimum > value 0. If pg_prewarm.autoprewarm = on, then we start the worker and > dump according to the dump interval; if pg_prewarm.autoprewarm = off > then we don't start the worker automatically, but we still let you > start it manually. If you do, it respects the configured > dump_interval. With this design, we don't need the error suggested in > the previous item at all, and the code can be simplified in various > places --- all the checks for AT_PWARM_OFF go away. And I don't see > that we're really losing anything. There's not much sense in dumping > but not prewarming or prewarming but not dumping, so having > pg_prewarm.autoprewarm configure whether the worker is started > automatically rather than whether it prewarms (with a separate control > for whether it dumps) seems to make sense. The one time when you want > to do one without the other is when you first install the extension -- > during the first server lifetime, you'll want to dump, so that after > the next restart you have something to preload. But this design would > allow that. -- Agree. Removed the case pg_prewarm.dump_interval = -1. I had similar doubt which I have tried to raise previously On Tue, May 30, 2017 at 10:16 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >>There is another GUC setting pg_prewarm.dump_interval if = -1 we stop >>the running autoprewarm worker. I have a doubt should we combine these >>2 entities into one such that it controls the state of autoprewarm >>worker? Now I have one doubt, do we need a mechanism to stop running autoprewarm worker while keeping the server alive? Can I use the pg_prewarm.autoprewarm for the same purpose? -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote: > > I also think pg_prewarm.dump_interval should be renamed to > pg_prewarm.autoprewarm_interval. Thanks, I have changed it to pg_prewarm.autoprewarm_interval. >> >> * In the documentation, don't say "This is a SQL callable function >> to....". This is a list of SQL-callable functions, so each thing in >> the list is one. Just delete this from the beginning of each >> sentence. > One thing I couldn't quite make sense of is: > > "The autoprewarm process will start loading blocks recorded in > $PGDATA/autoprewarm.blocks until there is a free buffer left in the > buffer pool." > > Is this saying "until there is a single free buffer remaining in > shared buffers"? I haven't corrected or clarified this as I don't > understand it. Sorry, that was a typo I wanted to say until there is no free buffer left. Fixed in autoprewarm_16.patch. > > Also, I find it a bit messy that launch_autoprewarm_dump() doesn't > detect an autoprewarm process already running. I'd want this to > return NULL or an error if called for a 2nd time. We log instead of error as we try to check only after launching the worker and inside worker. One solution could be as similar to autoprewam_dump_now(), the autoprewarm_start_worker() can init shared memory and check if we can launch worker in backend itself. I will try to fix same. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote: >> >> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't >> detect an autoprewarm process already running. I'd want this to >> return NULL or an error if called for a 2nd time. > > We log instead of error as we try to check only after launching the > worker and inside worker. One solution could be as similar to > autoprewam_dump_now(), the autoprewarm_start_worker() can init shared > memory and check if we can launch worker in backend itself. I will try > to fix same. I have fixed it now as follows +Datum +autoprewarm_start_worker(PG_FUNCTION_ARGS) +{ + pid_t pid; + + init_apw_shmem(); + pid = apw_state->bgworker_pid; + if (pid != InvalidPid) + ereport(ERROR, + (errmsg("autoprewarm worker is already running under PID %d", + pid))); + + autoprewarm_dump_launcher(); + PG_RETURN_VOID(); +} In backend itself, we shall check if an autoprewarm worker is running then only start the server. There is a possibility if this function is executed concurrently when there is no worker already running (Which I think is not a normal usage) then both call will say it has successfully launched the worker even though only one could have successfully done that (other will log and silently die). I think that is okay as the objective was to get one worker up and running. I have changed the return value to void. The worker could be restarted when there is an error. So returned pid is not going to be same as worker pid in such cases. Also, I do not see any use of pid. Made documentation changes regarding above changes. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote: >>> >>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't >>> detect an autoprewarm process already running. I'd want this to >>> return NULL or an error if called for a 2nd time. >> >> We log instead of error as we try to check only after launching the >> worker and inside worker. One solution could be as similar to >> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared >> memory and check if we can launch worker in backend itself. I will try >> to fix same. > > I have fixed it now as follows > > +Datum > +autoprewarm_start_worker(PG_FUNCTION_ARGS) > +{ > + pid_t pid; > + > + init_apw_shmem(); > + pid = apw_state->bgworker_pid; > + if (pid != InvalidPid) > + ereport(ERROR, > + (errmsg("autoprewarm worker is already running under PID %d", > + pid))); > + > + autoprewarm_dump_launcher(); > + PG_RETURN_VOID(); > +} > > In backend itself, we shall check if an autoprewarm worker is running > then only start the server. There is a possibility if this function is > executed concurrently when there is no worker already running (Which I > think is not a normal usage) then both call will say it has > successfully launched the worker even though only one could have > successfully done that (other will log and silently die). Why can't we close this remaining race condition? Basically, if we just perform all of the actions in this function under the lock and autoprewarm_dump_launcher waits till the autoprewarm worker has initialized the bgworker_pid, then there won't be any remaining race condition. I think if we don't close this race condition, it will be unpredictable whether the user will get the error or there will be only a server log for the same. > I think that > is okay as the objective was to get one worker up and running. > You are right that the objective will be met, but still, I feel the behavior of this API will be unpredictable which in my opinion should be fixed. If it is really not possible or extremely difficult to fix this behavior, then I think we should update the docs. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote: >>> >>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't >>> detect an autoprewarm process already running. I'd want this to >>> return NULL or an error if called for a 2nd time. >> >> We log instead of error as we try to check only after launching the >> worker and inside worker. One solution could be as similar to >> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared >> memory and check if we can launch worker in backend itself. I will try >> to fix same. > > I have fixed it now as follows Few comments on the latest patch: 1. + LWLockRelease(&apw_state->lock); + if (!is_bgworker) + ereport(ERROR, + (errmsg("could not perform block dump because dump file is being used by PID %d", + apw_state->pid_using_dumpfile))); + ereport(LOG, + (errmsg("skipping block dump because it is already being performed by PID %d", + apw_state->pid_using_dumpfile))); The above code looks confusing as both the messages are saying the same thing in different words. I think you keep one message (probably the first one) and decide error level based on if this is invoked for bgworker. Also, you can move LWLockRelease after error message, because if there is any error, then it will automatically release all lwlocks. 2. +autoprewarm_dump_now(PG_FUNCTION_ARGS) +{ + uint32 num_blocks = 0; + .. + PG_RETURN_INT64(num_blocks); .. } Is there any reason for using PG_RETURN_INT64 instead of PG_RETURN_UINT32? 3. +dump_now(bool is_bgworker) { .. + if (buf_state & BM_TAG_VALID) + { + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode; + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode; + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode; + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum; + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum; + ++num_blocks; + } .. } I think there is no use of writing Unlogged buffers unless the dump is for the shutdown. You might want to use BufferIsPermanent to detect the same. 4. +static uint32 +dump_now(bool is_bgworker) { .. + for (num_blocks = 0, i = 0; i < NBuffers; i++) + { + uint32 buf_state; + + if (!is_bgworker) + CHECK_FOR_INTERRUPTS(); .. } Why checking for interrupts is only for non-bgwroker cases? 5. + * Each entry of BlockRecordInfo consists of database, tablespace, filenode, + * forknum, blocknum. Note that this is in the text form so that the dump + * information is readable and can be edited, if required. + */ In the above comment, you are saying that the dump file is in text form whereas in the code you are using binary form. I think code should match comments. Is there a reason of preferring binary over text or vice versa? 6. +dump_now(bool is_bgworker) { .. + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR); + apw_state->pid_using_dumpfile = InvalidPid; .. } How will pid_using_dumpfile be set to InvalidPid in the case of error for non-bgworker cases? 7. +dump_now(bool is_bgworker) { .. + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR); .. } How will transient_dump_file_path be unlinked in the case of error in durable_rename? I think you need to use PG_TRY..PG_CATCH to ensure same? 8. + file = AllocateFile(transient_dump_file_path, PG_BINARY_W); + if (!file) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not open file \"%s\": %m", + transient_dump_file_path))); + + ret = fprintf(file, "<<%u>>\n", num_blocks); + if (ret < 0) + { + int save_errno = errno; + + FreeFile(file); I think you don't need to close the file in case of error, it will be automatically taken care in case of error (via transaction abort path). 9. + /* Register autoprewarm load. */ + setup_autoprewarm(&autoprewarm_worker, "autoprewarm", "autoprewarm_main", + Int32GetDatum(TASK_PREWARM_BUFFERPOOL), 0, 0); What does "load" signify in above comment? Do you want to say worker instead? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > > * Instead of creating our own buffering system via buffer_file_write() > and buffer_file_flush(), why not just use the facilities provided by > the operating system? fopen() et. al. provide buffering, and we have > AllocateFile() to provide a FILE *; it's just like > OpenTransientFile(), which you are using, but you'll get the buffering > stuff for free. Maybe there's some reason why this won't work out > nicely, but off-hand it seems like it might. It looks like you are > already using AllocateFile() to read the dump, so using it to write > the dump as well seems like it would be logical. > One thing that is worth considering is AllocateFile is recommended to be used for short operations. Refer text atop AllocateFile(). If the number of blocks to be dumped is large, then the file can remain open for the significant period of time. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote: >>>> >>>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't >>>> detect an autoprewarm process already running. I'd want this to >>>> return NULL or an error if called for a 2nd time. >>> >>> We log instead of error as we try to check only after launching the >>> worker and inside worker. One solution could be as similar to >>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared >>> memory and check if we can launch worker in backend itself. I will try >>> to fix same. >> >> I have fixed it now as follows >> >> +Datum >> +autoprewarm_start_worker(PG_FUNCTION_ARGS) >> +{ >> + pid_t pid; >> + >> + init_apw_shmem(); >> + pid = apw_state->bgworker_pid; >> + if (pid != InvalidPid) >> + ereport(ERROR, >> + (errmsg("autoprewarm worker is already running under PID %d", >> + pid))); >> + >> + autoprewarm_dump_launcher(); >> + PG_RETURN_VOID(); >> +} >> >> In backend itself, we shall check if an autoprewarm worker is running >> then only start the server. There is a possibility if this function is >> executed concurrently when there is no worker already running (Which I >> think is not a normal usage) then both call will say it has >> successfully launched the worker even though only one could have >> successfully done that (other will log and silently die). > > Why can't we close this remaining race condition? Basically, if we > just perform all of the actions in this function under the lock and > autoprewarm_dump_launcher waits till the autoprewarm worker has > initialized the bgworker_pid, then there won't be any remaining race > condition. I think if we don't close this race condition, it will be > unpredictable whether the user will get the error or there will be > only a server log for the same. Yes, I can make autoprewarm_dump_launcher to wait until the launched bgworker set its pid, but this requires one more synchronization variable between launcher and worker. More than that I see ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the worker), which needs to be called before setting pid. So I thought it won't be harmful let two concurrent calls to launch workers and we just log failures. Please let me know if I need to rethink about it. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 3, 2017 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> >> * Instead of creating our own buffering system via buffer_file_write() >> and buffer_file_flush(), why not just use the facilities provided by >> the operating system? fopen() et. al. provide buffering, and we have >> AllocateFile() to provide a FILE *; it's just like >> OpenTransientFile(), which you are using, but you'll get the buffering >> stuff for free. Maybe there's some reason why this won't work out >> nicely, but off-hand it seems like it might. It looks like you are >> already using AllocateFile() to read the dump, so using it to write >> the dump as well seems like it would be logical. >> > > One thing that is worth considering is AllocateFile is recommended to > be used for short operations. Refer text atop AllocateFile(). If the > number of blocks to be dumped is large, then the file can remain open > for the significant period of time. -- Agree. I think I need suggestion on this we will hold on to 1 fd, but I am not sure what amount of time file being opened qualify as a case against using AllocateFile(). -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 3, 2017 at 3:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Few comments on the latest patch: > > 1. > + LWLockRelease(&apw_state->lock); > + if (!is_bgworker) > + ereport(ERROR, > + (errmsg("could not perform block dump because dump file is being > used by PID %d", > + apw_state->pid_using_dumpfile))); > + ereport(LOG, > + (errmsg("skipping block dump because it is already being performed by PID %d", > + apw_state->pid_using_dumpfile))); > > > The above code looks confusing as both the messages are saying the > same thing in different words. I think you keep one message (probably > the first one) and decide error level based on if this is invoked for > bgworker. Also, you can move LWLockRelease after error message, > because if there is any error, then it will automatically release all > lwlocks. ERROR is used for autoprewarm_dump_now which is called from the backend. LOG is used for bgworker. wordings used are to match the context if failing to dump is acceptable or not. In the case of bgworker, it is acceptable we are not particular about the start time of dump but the only interval between the dumps. So if already somebody doing it is acceptable. But one who calls autoprewarm_dump_now might be particular about the start time of dump so we throw error making him retry same. The wording's are suggested by Robert(below snapshot) in one of his previous comments and I also agree with it. If you think I should reconsider this and I am missing something I am open to suggestions. On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: +If we go to perform +an immediate dump process and finds a non-zero value already just does +ereport(ERROR, ...), including the PID of the other process in the +message (e.g. "unable to perform block dump because dump file is being +used by PID %d"). In a background worker, if we go to dump and find +the file in use, log a message (e.g. "skipping block dump because it +is already being performed by PID %d", "skipping prewarm because block +dump file is being rewritten by PID %d"). Thanks moved the LWLockRelease after ERROR call. > 2. > +autoprewarm_dump_now(PG_FUNCTION_ARGS) > +{ > + uint32 num_blocks = 0; > + > .. > + PG_RETURN_INT64(num_blocks); > .. > } > > Is there any reason for using PG_RETURN_INT64 instead of PG_RETURN_UINT32? Return type autoprewarm_dump_now() is pg_catalog.int8 to accommodate uint32 so I used PG_RETURN_INT64. I think PG_RETURN_UINT32 can be used as well I have replaced now. > 3. > +dump_now(bool is_bgworker) > { > .. > + if (buf_state & BM_TAG_VALID) > + { > + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode; > + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode; > + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode; > + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum; > + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum; > + ++num_blocks; > + } > .. > } >I think there is no use of writing Unlogged buffers unless the dump is >for the shutdown. You might want to use BufferIsPermanent to detect the same. -- I do not think that is true pages of the unlogged table are also read into buffers for read-only purpose. So if we miss to dump them while we shut down then the previous dump should be used. > 4. > +static uint32 > +dump_now(bool is_bgworker) > { > .. > + for (num_blocks = 0, i = 0; i < NBuffers; i++) > + { > + uint32 buf_state; > + > + if (!is_bgworker) > + CHECK_FOR_INTERRUPTS(); > .. > } > > Why checking for interrupts is only for non-bgwroker cases? -- autoprewarm_dump_now is directly called from the backend. In such case, we have to handle signals registered for backend in dump_now(). For bgworker dump_block_info_periodically caller of dump_now() handles SIGTERM, SIGUSR1 which we are interested in. > 5. > + * Each entry of BlockRecordInfo consists of database, tablespace, filenode, > + * forknum, blocknum. Note that this is in the text form so that the dump > + * information is readable and can be edited, if required. > + */ > > In the above comment, you are saying that the dump file is in text > form whereas in the code you are using binary form. I think code > should match comments. Is there a reason of preferring binary over > text or vice versa? -- Previously I used the write() on file descriptor. Sorry I should have changed the mode of opening to text mode when I moved the code to use AllocateFile Sorry fixed same now. > 6. > +dump_now(bool is_bgworker) > { > .. > + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR); > + apw_state->pid_using_dumpfile = InvalidPid; > .. > } > > How will pid_using_dumpfile be set to InvalidPid in the case of error > for non-bgworker cases? -- I have a try() - catch() in autoprewarm_dump_now I think that is okay. > > 7. > +dump_now(bool is_bgworker) > { > .. > + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR); > > .. > } > > How will transient_dump_file_path be unlinked in the case of error in > durable_rename? I think you need to use PG_TRY..PG_CATCH to ensure > same? -- If durable_rename is failing that seems basic functionality of autoperwarm is failing so I want it to be an ERROR. I do not want to remove the temp file as we always truncate before reusing it again. So I think there is no need to catch all ERROR in dump_now() just to remove the temp file. > 8. > + file = AllocateFile(transient_dump_file_path, PG_BINARY_W); > + if (!file) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not open file \"%s\": %m", > + transient_dump_file_path))); > + > + ret = fprintf(file, "<<%u>>\n", num_blocks); > + if (ret < 0) > + { > + int save_errno = errno; > + > + FreeFile(file); > > I think you don't need to close the file in case of error, it will be > automatically taken care in case of error (via transaction abort > path). -- I was trying to close the file before unlinking. Agree it is not necessary to do so but just did it as a practice. I have removed it now. > 9. > + /* Register autoprewarm load. */ > + setup_autoprewarm(&autoprewarm_worker, "autoprewarm", "autoprewarm_main", > + Int32GetDatum(TASK_PREWARM_BUFFERPOOL), 0, 0); > > What does "load" signify in above comment? Do you want to say worker instead? -- Sorry fixed now. In addition to above I have made one more change, per-database autoprewarm bgworker has been renamed from "autoprewarm" to "per-database autoprewarm" -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Jul 5, 2017 at 6:25 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >>> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >>>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote: >>>>> >>>>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't >>>>> detect an autoprewarm process already running. I'd want this to >>>>> return NULL or an error if called for a 2nd time. >>>> >>>> We log instead of error as we try to check only after launching the >>>> worker and inside worker. One solution could be as similar to >>>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared >>>> memory and check if we can launch worker in backend itself. I will try >>>> to fix same. >>> >>> I have fixed it now as follows >>> >>> +Datum >>> +autoprewarm_start_worker(PG_FUNCTION_ARGS) >>> +{ >>> + pid_t pid; >>> + >>> + init_apw_shmem(); >>> + pid = apw_state->bgworker_pid; >>> + if (pid != InvalidPid) >>> + ereport(ERROR, >>> + (errmsg("autoprewarm worker is already running under PID %d", >>> + pid))); >>> + >>> + autoprewarm_dump_launcher(); >>> + PG_RETURN_VOID(); >>> +} >>> >>> In backend itself, we shall check if an autoprewarm worker is running >>> then only start the server. There is a possibility if this function is >>> executed concurrently when there is no worker already running (Which I >>> think is not a normal usage) then both call will say it has >>> successfully launched the worker even though only one could have >>> successfully done that (other will log and silently die). >> >> Why can't we close this remaining race condition? Basically, if we >> just perform all of the actions in this function under the lock and >> autoprewarm_dump_launcher waits till the autoprewarm worker has >> initialized the bgworker_pid, then there won't be any remaining race >> condition. I think if we don't close this race condition, it will be >> unpredictable whether the user will get the error or there will be >> only a server log for the same. > > Yes, I can make autoprewarm_dump_launcher to wait until the launched > bgworker set its pid, but this requires one more synchronization > variable between launcher and worker. More than that I see > ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the > worker), which needs to be called before setting pid. So I thought it > won't be harmful let two concurrent calls to launch workers and we > just log failures. Please let me know if I need to rethink about it. > I don't know whether you need to rethink but as presented in the patch, it seems unclear to me about the specs of API. As this is an exposed function to the user, I think the behavior should be well defined. If you don't think changing code has many advantages, then at the very least update the docs to indicate the expectation and behavior of the API. Also, I think it is better to add few comments in the code to tell about the unpredictable behavior in case of race condition and the reason for same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com