Thread: Proposal : For Auto-Prewarm.

Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
# pg_autoprewarm.

This a PostgreSQL contrib module which automatically dump all of the blocknums
present in buffer pool at the time of server shutdown(smart and fast mode only,
to be enhanced to dump at regular interval.) and load these blocks when server restarts.

Design:
------
We have created a BG Worker Auto Pre-warmer which during shutdown dumps all the
blocknum in buffer pool in sorted order.
Format of each entry is <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>.
Auto Pre-warmer is started as soon as the postmaster is started we do not wait
for recovery to finish and database to reach a consistent state. If there is a
"dump_file" to load we start loading each block entry to buffer pool until
there is a free buffer. This way we do not replace any new blocks which was
loaded either by recovery process or querying clients. Then it waits until it receives
SIGTERM to dump the block information in buffer pool.

HOW TO USE:
-----------
Build and add the pg_autoprewarm to shared_preload_libraries. Auto Pre-warmer
process automatically do dumping of buffer pool's block info and load them when
restarted.

TO DO:
------
Add functionality to dump based on timer at regular interval.
And some cleanups.
--
Thanks and Regards
Mithun C Y

Attachment

Re: Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Thu, Oct 27, 2016 at 5:09 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
 > This a PostgreSQL contrib module which automatically dump all of the blocknums
 >present in buffer pool at the time of server shutdown(smart and fast mode only,
 >to be enhanced to dump at regular interval.) and load these blocks when server restarts.

Sorry I forgot add warmup time performance measurement done based on this patch. Adding now.
--
Thanks and Regards
Mithun C Y

Attachment

Re: Proposal : For Auto-Prewarm.

From
Jim Nasby
Date:
On 10/27/16 6:39 AM, Mithun Cy wrote:
> # pg_autoprewarm.

IMO it would be better to add this functionality to pg_prewarm instead 
of creating another contrib module. That would reduce confusion and 
reduce the amount of code necessary.

+    cmp_member_elem(database);
+    cmp_member_elem(spcNode);
+    cmp_member_elem(filenode);
+    cmp_member_elem(forknum);
+    cmp_member_elem(blocknum);

Presumably the first 4 numbers will vary far less than blocknum, so it's 
probably worth reversing the order here (or at least put blocknum first).

I didn't look at the two load functions since presumably they'd go 
away/change significantly if this was combined with pg_prewarm.

+    if (!block_info_array)
+        elog(ERROR, "Out of Memory!");
AFAICT this isn't necessary since palloc will error itself if it fails.

+    snprintf(transient_dump_file_path, sizeof(dump_file_path),
+             "%s.save.tmp", DEFAULT_DUMP_FILENAME);
Since there's no method to change DEFAULT_DUMP_FILENAME, I would call it 
what it is: DUMP_FILENAME.

Also, maybe worth an assert to make sure there was enough room for the 
complete filename. That'd need to be a real check if this was 
configurable anyway.

+    if (!avoid_dumping)
+        dump_now();
Perhaps that check should be moved into dump_now() itself...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: Proposal : For Auto-Prewarm.

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Mithun Cy
> # pg_autoprewarm.
> 
> This a PostgreSQL contrib module which automatically dump all of the
> blocknums present in buffer pool at the time of server shutdown(smart and
> fast mode only, to be enhanced to dump at regular interval.) and load these
> blocks when server restarts.

I welcome this feature!  I remember pg_hibernate did this.   I wonder what happened to pg_hibernate.  Did you check
it?

Regards
Takayuki Tsunakawa



Re: Proposal : For Auto-Prewarm.

From
Michael Paquier
Date:
On Fri, Oct 28, 2016 at 5:15 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 10/27/16 6:39 AM, Mithun Cy wrote:
>>
>> # pg_autoprewarm.
>
>
> IMO it would be better to add this functionality to pg_prewarm instead of
> creating another contrib module. That would reduce confusion and reduce the
> amount of code necessary.
>
> +       cmp_member_elem(database);
> +       cmp_member_elem(spcNode);
> +       cmp_member_elem(filenode);
> +       cmp_member_elem(forknum);
> +       cmp_member_elem(blocknum);
>
> Presumably the first 4 numbers will vary far less than blocknum, so it's
> probably worth reversing the order here (or at least put blocknum first).
>
> I didn't look at the two load functions since presumably they'd go
> away/change significantly if this was combined with pg_prewarm.
>
> +       if (!block_info_array)
> +               elog(ERROR, "Out of Memory!");
> AFAICT this isn't necessary since palloc will error itself if it fails.
>
> +       snprintf(transient_dump_file_path, sizeof(dump_file_path),
> +                        "%s.save.tmp", DEFAULT_DUMP_FILENAME);
> Since there's no method to change DEFAULT_DUMP_FILENAME, I would call it
> what it is: DUMP_FILENAME.
>
> Also, maybe worth an assert to make sure there was enough room for the
> complete filename. That'd need to be a real check if this was configurable
> anyway.
>
> +       if (!avoid_dumping)
> +               dump_now();
> Perhaps that check should be moved into dump_now() itself...

As this picked up my curiosity...

+/*
+ * ReadBufferForPrewarm -- This new interface is for pg_autoprewarm.
+ */
+Buffer
+ReadBufferForPrewarm(SMgrRelation smgr, char relpersistence,
+                     ForkNumber forkNum, BlockNumber blockNum,
+                     ReadBufferMode mode, BufferAccessStrategy strategy)
+{
+    bool        hit;
+
+    return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
+                             mode, strategy, &hit);
+}
May be better to just expose ReadBuffer_common or rename it.
--
Michael



Re: Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Fri, Oct 28, 2016 at 1:45 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 10/27/16 6:39 AM, Mithun Cy wrote:
>>
>> # pg_autoprewarm.
>
>
> IMO it would be better to add this functionality to pg_prewarm instead of
> creating another contrib module.
>

There is not much common functionality between the two.  The main job
of this feature is to dump the contents of shared buffers and reload
them at startup and it takes care for not over ridding the existing
shared buffer contents while reloading the dump file.  This is
somewhat different to what prewarm provides which is to load the
relation blocks in memory or buffers, it has nothing to do with prior
shared buffer contents.  So, I am not sure if it is good idea to add
it as a functionality with prewarm module.  OTOH, if more people want
that way, then as such there is no harm in doing that way.

One point that seems to be worth discussing is when should the buffer
information be dumped to a file?  Shall we dump at each checkpoint or
at regular intervals via auto prewarm worker or at some other time?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Proposal : For Auto-Prewarm.

From
Andres Freund
Date:
On 2016-10-28 12:59:58 +0530, Amit Kapila wrote:
> On Fri, Oct 28, 2016 at 1:45 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> > On 10/27/16 6:39 AM, Mithun Cy wrote:
> >>
> >> # pg_autoprewarm.
> >
> >
> > IMO it would be better to add this functionality to pg_prewarm instead of
> > creating another contrib module.
> >
> 
> There is not much common functionality between the two.

I don't really agree. For me manual and automated prewarming are pretty
closely related. Sure they have their independent usages, and not too
much code might be shared. But grouping them in the same extension seems
to make sense, it's confusing to have closely related but independent
extensions.

> One point that seems to be worth discussing is when should the buffer
> information be dumped to a file?  Shall we dump at each checkpoint or
> at regular intervals via auto prewarm worker or at some other time?

Should probably be at some regular interval - not sure if checkpoints
are the best time (or if it's even realistic to tie a bgworker to
checkpoints), since checkpoints have a significant impact on the state
of shared_buffers.

Greetings,

Andres Freund



Re: Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Fri, Oct 28, 2016 at 3:40 AM, Andres Freund <andres@anarazel.de> wrote:
>>
>> There is not much common functionality between the two.
>
> I don't really agree. For me manual and automated prewarming are pretty
> closely related. Sure they have their independent usages, and not too
> much code might be shared. But grouping them in the same extension seems
> to make sense, it's confusing to have closely related but independent
> extensions.

I agree that putting them together would be fine.

>> One point that seems to be worth discussing is when should the buffer
>> information be dumped to a file?  Shall we dump at each checkpoint or
>> at regular intervals via auto prewarm worker or at some other time?
>
> Should probably be at some regular interval - not sure if checkpoints
> are the best time (or if it's even realistic to tie a bgworker to
> checkpoints), since checkpoints have a significant impact on the state
> of shared_buffers.

Checkpoints don't cause any buffer replacement, which I think is what
would be relevant here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal : For Auto-Prewarm.

From
Haribabu Kommi
Date:
Hi Ashutosh,

 
This is a gentle reminder.

you assigned as reviewer to the current patch in the 11-2016 commitfest.
But you haven't shared your review yet. Please share your views about
the patch. This will help us in smoother operation of commitfest.

Please Ignore if you already shared your review.
 
Regards,
Hari Babu
Fujitsu Australia

Re: Proposal : For Auto-Prewarm.

From
Ashutosh Bapat
Date:
> you assigned as reviewer to the current patch in the 11-2016 commitfest.
> But you haven't shared your review yet. Please share your views about
> the patch. This will help us in smoother operation of commitfest.
>

Thanks for the reminder.

Mithun has not provided a patch addressing the comments upthread. I am
waiting for his response to those comments.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
Sorry I took some time on this please find latest patch with addressed review comments. Apart from fixes for comments I have introduced a new GUC variable for the pg_autoprewarm "buff_dump_interval". So now we dump the buffer pool metadata at every buff_dump_interval secs. Currently buff_dump_interval can be set only at startup time. I did not choose to do the dumping at checkpoint time, as it appeared these 2 things are not much related and keeping it independent would be nice for usage. Also overhead of any communication between them can be avoided.

On Fri, Oct 28, 2016 at 1:45 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
 > IMO it would be better to add this functionality to pg_prewarm instead of creating another contrib module. That would reduce confusion and reduce
 > the amount of code necessary.

I have merged pg_autoprewarm module into pg_prewarm, This is just the directory merge, Functionality merge is not possible pg_prewarm is just a utility function with specific signature to load a specific relation at runtime, where as pg_autoprewarm is a bgworker which dumps current buffer pool and load it during startup time.   
 
> Presumably the first 4 numbers will vary far less than blocknum, so it's probably worth reversing the order here (or at least put blocknum first).

function sort_cmp_func is for qsort so orderly comparison is needed to say which is bigger or smaller, If we put blocknum first, we cannot decide same.

> AFAICT this isn't necessary since palloc will error itself if it fails.

Fixed.

> Since there's no method to change DEFAULT_DUMP_FILENAME, I would call it what it is: DUMP_FILENAME.

Fixed.
 
> Also, maybe worth an assert to make sure there was enough room for the complete filename. That'd need to be a real check if this was configurable
> anyway.

I think if we make it configurable I think I can put that check.

 > + if (!avoid_dumping)
 > +               dump_now();
 > Perhaps that check should be moved into dump_now() itself...
 
 Fixed.

--
Thanks and Regards
Mithun C Y

Attachment

Re: Proposal : For Auto-Prewarm.

From
Haribabu Kommi
Date:


On Tue, Nov 29, 2016 at 4:26 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
Sorry I took some time on this please find latest patch with addressed review comments. Apart from fixes for comments I have introduced a new GUC variable for the pg_autoprewarm "buff_dump_interval". So now we dump the buffer pool metadata at every buff_dump_interval secs. Currently buff_dump_interval can be set only at startup time. I did not choose to do the dumping at checkpoint time, as it appeared these 2 things are not much related and keeping it independent would be nice for usage. Also overhead of any communication between them can be avoided.

On Fri, Oct 28, 2016 at 1:45 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
 > IMO it would be better to add this functionality to pg_prewarm instead of creating another contrib module. That would reduce confusion and reduce
 > the amount of code necessary.

I have merged pg_autoprewarm module into pg_prewarm, This is just the directory merge, Functionality merge is not possible pg_prewarm is just a utility function with specific signature to load a specific relation at runtime, where as pg_autoprewarm is a bgworker which dumps current buffer pool and load it during startup time.   
 
> Presumably the first 4 numbers will vary far less than blocknum, so it's probably worth reversing the order here (or at least put blocknum first).

function sort_cmp_func is for qsort so orderly comparison is needed to say which is bigger or smaller, If we put blocknum first, we cannot decide same.

> AFAICT this isn't necessary since palloc will error itself if it fails.

Fixed.

> Since there's no method to change DEFAULT_DUMP_FILENAME, I would call it what it is: DUMP_FILENAME.

Fixed.
 
> Also, maybe worth an assert to make sure there was enough room for the complete filename. That'd need to be a real check if this was configurable
> anyway.

I think if we make it configurable I think I can put that check.

 > + if (!avoid_dumping)
 > +               dump_now();
 > Perhaps that check should be moved into dump_now() itself...
 
 Fixed.


Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Jim Nasby
Date:
I took a look at this again, and it doesn't appear to be working for me. The library is being loaded during startup,
butI don't see any further activity in the log, and I don't see an autoprewarm file in $PGDATA.
 

There needs to be some kind of documentation change as part of this patch.

I'm not sure the default GUC setting of 0 makes sense. If you've loaded the module, presumably you want it to be
running.I think it'd be nice if the GUC had a -1 setting that meant to use checkpoint_timeout.
 

Having the GUC be restart-only is also pretty onerous. I don't think it'd be hard to make the worker respond to a
reload...there's code in the autovacuum launcher you could use as an example.
 

I'm also wondering if this really needs to be a permanently running process... perhaps the worker could simply be
startedas necessary? Though maybe that means it wouldn't run at shutdown. Also not sure if it could be relaunched when
areload happens.
 

I'm marking this as waiting on author for now, because it's not working for me and needs documentation.

The new status of this patch is: Waiting on Author

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Tue, Jan 24, 2017 at 5:07 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I took a look at this again, and it doesn't appear to be working for me. The library is being loaded during startup,
butI don't see any further activity in the log, and I don't see an autoprewarm file in $PGDATA.
 

Hi Jim,
Thanks for looking into this patch, I just downloaded the patch and
applied same to the latest code, I can see file " autoprewarm.save" in
$PGDATA which is created and dumped at shutdown time and an activity
is logged as below
2017-01-24 13:22:25.012 IST [91755] LOG:  Buffer Dump: saved metadata
of 59 blocks.

In my code by default, we only dump at shutdown time. If we want to
dump at regular interval then we need to set the GUC
pg_autoprewarm.buff_dump_interval to > 0. I think I am missing
something while trying to recreate the bug reported above. Can you
please let me know what exactly you mean by the library is not
working.

> There needs to be some kind of documentation change as part of this patch.
Thanks, I will add a sgml for same.

For remaining suggestions, I will try to address in my next patch
based on its feasibility.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Beena Emerson
Date:


On Tue, Jan 24, 2017 at 1:56 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Tue, Jan 24, 2017 at 5:07 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I took a look at this again, and it doesn't appear to be working for me. The library is being loaded during startup, but I don't see any further activity in the log, and I don't see an autoprewarm file in $PGDATA.

Hi Jim,
Thanks for looking into this patch, I just downloaded the patch and
applied same to the latest code, I can see file " autoprewarm.save" in
$PGDATA which is created and dumped at shutdown time and an activity
is logged as below
2017-01-24 13:22:25.012 IST [91755] LOG:  Buffer Dump: saved metadata
of 59 blocks.

In my code by default, we only dump at shutdown time. If we want to
dump at regular interval then we need to set the GUC
pg_autoprewarm.buff_dump_interval to > 0. I think I am missing
something while trying to recreate the bug reported above. Can you
please let me know what exactly you mean by the library is not
working.

> There needs to be some kind of documentation change as part of this patch.
Thanks, I will add a sgml for same.

For remaining suggestions, I will try to address in my next patch
based on its feasibility.



The patch works for me too. 

Few initial comments:

1. I think the README was maintained as is from the 1st version and says pg_autoprewarm is a contrib module. It should be rewritten to pg_autoprewarm is a part of pg_prewarm module. The documentation should be added to pgprewarm.sgml instead of the README

2. buff_dump_interval could be renamed to just dump_interval or buffer_dump_interval. Also, since it is now part of pg_prewarm. I think it makes sense to have the conf parameter be: pg_prewarm.xxx instead of pg_autoprewarm.xxx

3. During startup we get the following message:

2017-01-24 16:13:57.615 IST [90061] LOG:  Num buffers : 272

I could be better written as “pg_prewarm: 272 blocks loaded from dump” or something similar.

4. Also, the message while dumping says:

2017-01-24 16:15:17.712 IST [90061] LOG:  Buffer Dump: saved metadata of 272 blocks

It would be better to write the module name in message instead of "Buffer Dump"


Thank you, 

Beena Emerson

Have a Great Day!

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Fri, Oct 28, 2016 at 6:36 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> I welcome this feature!  I remember pg_hibernate did this.   I wonder what happened to pg_hibernate.  Did you check
it?
Thanks, when I checked with pg_hibernate there were two things people
complained about it. Buffer loading will start after the recovery is
finished and the database has reached the consistent state. Two It can
replace existing buffers which are loaded due to recovery and newly
connected clients. And this solution tried to solve them.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Jim Nasby
Date:
On 1/24/17 4:56 AM, Beena Emerson wrote:
> 2. buff_dump_interval could be renamed to just dump_interval or
> buffer_dump_interval. Also, since it is now part of pg_prewarm. I think
> it makes sense to have the conf parameter be: pg_prewarm.xxx instead of
> pg_autoprewarm.xxx

I'd really like to find terminology other than "buffer dump", because 
that makes it sound like we're dumping the contents of the buffers 
themselves.

Maybe block_map? Buffer_map?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Jim Nasby
Date:
On 1/24/17 2:26 AM, Mithun Cy wrote:
> Thanks for looking into this patch, I just downloaded the patch and
> applied same to the latest code, I can see file " autoprewarm.save" in
> $PGDATA which is created and dumped at shutdown time and an activity
> is logged as below
> 2017-01-24 13:22:25.012 IST [91755] LOG:  Buffer Dump: saved metadata
> of 59 blocks.

Yeah, I wasn't getting that at all, though I did see the shared library 
being loaded. If I get a chance I'll try it again.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Beena Emerson
Date:


On Wed, Jan 25, 2017 at 10:36 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 1/24/17 2:26 AM, Mithun Cy wrote:
Thanks for looking into this patch, I just downloaded the patch and
applied same to the latest code, I can see file " autoprewarm.save" in
$PGDATA which is created and dumped at shutdown time and an activity
is logged as below
2017-01-24 13:22:25.012 IST [91755] LOG:  Buffer Dump: saved metadata
of 59 blocks.

Yeah, I wasn't getting that at all, though I did see the shared library being loaded. If I get a chance I'll try it again.


Hope u added the following to the conf file:

shared_preload_libraries = 'pg_autoprewarm'     # (change requires restart)
pg_autoprewarm.buff_dump_interval=20

Even after this u could not see the message then that's strange.
 
--
Thank you, 

Beena Emerson

Have a Great Day!

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Tue, Jan 24, 2017 at 5:07 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I took a look at this again, and it doesn't appear to be working for me. The library is being loaded during startup,
butI don't see any further activity in the log, and I don't see an autoprewarm file in $PGDATA.
 
>
> There needs to be some kind of documentation change as part of this patch.
>
> I'm not sure the default GUC setting of 0 makes sense. If you've loaded the module, presumably you want it to be
running.I think it'd be nice if the GUC had a -1 setting that meant to use checkpoint_timeout.
 
>
> Having the GUC be restart-only is also pretty onerous. I don't think it'd be hard to make the worker respond to a
reload...there's code in the autovacuum launcher you could use as an example.
 
>

+1.  I don't think there should be any problem in making it PGC_SIGHUP.

> I'm also wondering if this really needs to be a permanently running process... perhaps the worker could simply be
startedas necessary?
 

Do you want to invoke worker after every buff_dump_interval?  I think
that will be bad in terms of starting a new process and who will
monitor when to start such a process.  I think it is better to keep it
as a permanently running background process if loaded by user.

> Though maybe that means it wouldn't run at shutdown.

Yeah, that will be another drawback.

Few comments found while glancing the patch.

1.
+TO DO:
+------
+Add functionality to dump based on timer at regular interval.

I think you need to remove above TO DO.

2.
+ /* Load the page only if there exist a free buffer. We do not want to
+ * replace an existing buffer. */

This is not a PG style multiline comment.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Mon, Jan 23, 2017 at 6:37 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I'm not sure the default GUC setting of 0 makes sense. If you've loaded the module, presumably you want it to be
running.I think it'd be nice if the GUC had a -1 setting that meant to use checkpoint_timeout.
 

Actually, I think we need to use -1 to mean "don't run the worker at
all".  0 means "run the worker, but don't do timed dumps".  >0 means
"run the worker, and dump at that interval".

I have to admit that when I was first thinking about this feature, my
initial thought was "hey,  let's dump once per checkpoint_timeout".
But I think that Mithun's approach is better.  There's no intrinsic
connection between this and checkpointing, and letting the user pick
the interval is a lot more flexible.  We could still have a magic
value that means "same as checkpoint_timeout" but it's not obvious to
me that there's any value in that; the user might as well just pick
the time interval that they want.

Actually, for busy systems, the interval is probably shorter than
checkpoint_timeout.  Dumping the list of buffers isn't that expensive,
and if you are doing checkpoints every half hour or so that's not
probably longer than what you want for this.  So I suggest that we
should just have the documentation could recommend a suitable value
(e.g. 5 minutes) and not worry about checkpoint_timeout.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Jim Nasby
Date:
On 1/24/17 11:13 PM, Beena Emerson wrote:
>
> On Wed, Jan 25, 2017 at 10:36 AM, Jim Nasby <Jim.Nasby@bluetreble.com
> <mailto:Jim.Nasby@bluetreble.com>> wrote:
>
>     On 1/24/17 2:26 AM, Mithun Cy wrote:
>
>         Thanks for looking into this patch, I just downloaded the patch and
>         applied same to the latest code, I can see file "
>         autoprewarm.save" in
>         $PGDATA which is created and dumped at shutdown time and an activity
>         is logged as below
>         2017-01-24 13:22:25.012 IST [91755] LOG:  Buffer Dump: saved
>         metadata
>         of 59 blocks.
>
>
>     Yeah, I wasn't getting that at all, though I did see the shared
>     library being loaded. If I get a chance I'll try it again.
>
>
>
> Hope u added the following to the conf file:
>
> shared_preload_libraries = 'pg_autoprewarm'     # (change requires restart)

Therein lied my problem: I was preloading pg_prewarm, not pg_autoprewarm.

I think the two need to be integrated much better than they are right 
now. They should certainly be in the same .so, and as others have 
mentioned the docs need to be fixed. For consistency, I think the name 
should just be pg_prewarm, as well as the prefix for the GUC.

Based on that and other feedback I'm going to mark this as returned with 
feedback, though if you're able to get a revised patch in the next few 
days please do.

FYI (and this is just a suggestion), for testing purposes, it might also 
be handy to allow manual dump and load via functions, with the load 
function giving you the option to forcibly load (instead of doing 
nothing if there are no buffers on the free list). It would also be 
handy of those functions accepted a different filename. That way you 
could reset shared_buffers to a known condition before running a test.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Jim Nasby
Date:
On 1/25/17 1:46 PM, Jim Nasby wrote:
> Based on that and other feedback I'm going to mark this as returned with
> feedback, though if you're able to get a revised patch in the next few
> days please do.

Actually, based on the message that popped up when I went to do that I 
guess it's better not to do that, so I marked it as Waiting on Author.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Wed, Jan 25, 2017 at 2:46 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I think the two need to be integrated much better than they are right now.
> They should certainly be in the same .so, and as others have mentioned the
> docs need to be fixed. For consistency, I think the name should just be
> pg_prewarm, as well as the prefix for the GUC.

Yikes.  +1, definitely.

> It would also be handy of those functions
> accepted a different filename. That way you could reset shared_buffers to a
> known condition before running a test.

That would have some pretty unpleasant security implications unless it
is awfully carefully thought out.  I doubt this has enough utility to
make it worth thinking that hard about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Peter Eisentraut
Date:
On 1/24/17 3:26 AM, Mithun Cy wrote:
> In my code by default, we only dump at shutdown time. If we want to
> dump at regular interval then we need to set the GUC
> pg_autoprewarm.buff_dump_interval to > 0.

Just a thought with an additional use case:  If I want to set up a
standby for offloading queries, could I take the dump file from the
primary or another existing standby, copy it to the new standby, and
have it be warmed up to the state of the other instance from that?

In my experience, that kind of use is just as interesting as preserving
the buffers across a restart.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Thu, Jan 26, 2017 at 8:45 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 1/24/17 3:26 AM, Mithun Cy wrote:
>> In my code by default, we only dump at shutdown time. If we want to
>> dump at regular interval then we need to set the GUC
>> pg_autoprewarm.buff_dump_interval to > 0.
>
> Just a thought with an additional use case:  If I want to set up a
> standby for offloading queries, could I take the dump file from the
> primary or another existing standby, copy it to the new standby, and
> have it be warmed up to the state of the other instance from that?
>
> In my experience, that kind of use is just as interesting as preserving
> the buffers across a restart.
>

An interesting use case.  I am not sure if somebody has tried that way
but it appears to me that the current proposed patch should work for
this use case.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Beena Emerson
Date:


On Fri, Jan 27, 2017 at 8:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jan 26, 2017 at 8:45 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 1/24/17 3:26 AM, Mithun Cy wrote:
>> In my code by default, we only dump at shutdown time. If we want to
>> dump at regular interval then we need to set the GUC
>> pg_autoprewarm.buff_dump_interval to > 0.
>
> Just a thought with an additional use case:  If I want to set up a
> standby for offloading queries, could I take the dump file from the
> primary or another existing standby, copy it to the new standby, and
> have it be warmed up to the state of the other instance from that?
>
> In my experience, that kind of use is just as interesting as preserving
> the buffers across a restart.
>

An interesting use case.  I am not sure if somebody has tried that way
but it appears to me that the current proposed patch should work for
this use case.


Even I feel this should work. 
In that case, we could add the file location parameter.  By default it would store in the cluster directory else in the location provided. We can update this parameter in standby for it to access the file.
Thoughts?


--
Thank you, 

Beena Emerson

Have a Great Day!

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Thu, Jan 26, 2017 at 8:45 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Just a thought with an additional use case:  If I want to set up a
> standby for offloading queries, could I take the dump file from the
> primary or another existing standby, copy it to the new standby, and
> have it be warmed up to the state of the other instance from that?
>
> In my experience, that kind of use is just as interesting as preserving
> the buffers across a restart.

Initially, I did not think about this thanks for asking. For now, we
dump the buffer pool info in the format
<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>; If BlockNum in
new standby correspond to the same set of rows as it was with the
server where the dump was produced, I think we can directly use the
dump file in new standby. All we need to do is just drop the ".save"
file in data-directory and preload the library. Buffer pool will be
warmed with blocks mentioned in ".save".

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Peter Eisentraut
Date:
On 1/26/17 11:11 PM, Beena Emerson wrote:
> In that case, we could add the file location parameter.  By default it
> would store in the cluster directory else in the location provided. We
> can update this parameter in standby for it to access the file.

I don't see the need for that.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Fri, Jan 27, 2017 at 3:18 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 1/26/17 11:11 PM, Beena Emerson wrote:
>> In that case, we could add the file location parameter.  By default it
>> would store in the cluster directory else in the location provided. We
>> can update this parameter in standby for it to access the file.
>
> I don't see the need for that.

+1.  That seems like over-engineering this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Jim Nasby
Date:
On 1/26/17 10:11 PM, Beena Emerson wrote:
> In that case, we could add the file location parameter.  By default it
> would store in the cluster directory else in the location provided. We
> can update this parameter in standby for it to access the file.

I don't see file location being as useful in this case. What would be 
more useful is being able to forcibly load blocks into shared buffers so 
that you didn't need to restart.

Hmm, it occurs to me that could be accomplished by providing an SRF that 
returned the contents of the current save file.

In any case, I strongly suggest focusing on the issues that have already 
been identified before trying to add more features.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Fri, Jan 27, 2017 at 5:34 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 1/26/17 10:11 PM, Beena Emerson wrote:
>> In that case, we could add the file location parameter.  By default it
>> would store in the cluster directory else in the location provided. We
>> can update this parameter in standby for it to access the file.
>
> I don't see file location being as useful in this case. What would be more
> useful is being able to forcibly load blocks into shared buffers so that you
> didn't need to restart.

Of course, you can already do that with the existing pg_prewarm and
pg_buffercache functionality.  Any time you want, you can use
pg_buffercache to dump out a list of everything in shared_buffers, and
pg_prewarm to suck that same stuff in on the same node or a separate
node.  All this patch is trying to do is provide a convenient,
automated way to make that work.

(An argument could be made that this ought to be in core and the
default behavior, because who really wants to start with an ice-cold
cold buffer cache on a production system?  It's possible that
repopulating shared_buffers in the background after a restart could
cause enough I/O to interfere with foreground activity that
regrettably ends up needing none of the prewarmed buffers, but I think
prewarming a few GB of data should be quite fast under normal
circumstances, and any well-intentioned system can go wrong under some
set of obscure circumstances.  Still, the patch takes the conservative
course of making this an opt-in behavior, and that's probably for the
best.)

> In any case, I strongly suggest focusing on the issues that have already
> been identified before trying to add more features.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Michael Paquier
Date:
On Tue, Jan 31, 2017 at 3:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 27, 2017 at 5:34 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 1/26/17 10:11 PM, Beena Emerson wrote:
>>> In that case, we could add the file location parameter.  By default it
>>> would store in the cluster directory else in the location provided. We
>>> can update this parameter in standby for it to access the file.
>>
>> I don't see file location being as useful in this case. What would be more
>> useful is being able to forcibly load blocks into shared buffers so that you
>> didn't need to restart.
>
> Of course, you can already do that with the existing pg_prewarm and
> pg_buffercache functionality.  Any time you want, you can use
> pg_buffercache to dump out a list of everything in shared_buffers, and
> pg_prewarm to suck that same stuff in on the same node or a separate
> node.  All this patch is trying to do is provide a convenient,
> automated way to make that work.
>
> (An argument could be made that this ought to be in core and the
> default behavior, because who really wants to start with an ice-cold
> cold buffer cache on a production system?  It's possible that
> repopulating shared_buffers in the background after a restart could
> cause enough I/O to interfere with foreground activity that
> regrettably ends up needing none of the prewarmed buffers, but I think
> prewarming a few GB of data should be quite fast under normal
> circumstances, and any well-intentioned system can go wrong under some
> set of obscure circumstances.  Still, the patch takes the conservative
> course of making this an opt-in behavior, and that's probably for the
> best.)

I partially agree with this paragraph, at least there are advantages
to do so for cases where the data fits in shared buffers. Even for
data sets fitting in RAM that can be an advantage as the buffers would
get evicted from Postgres' cache but not the OS. Now for cases where
there are many load patterns on a given database (I have some here),
that's hard to make this thing by default on.

This patch needs to be visibly reshaped anyway, so I am marking it as
returned with feedback.
-- 
Michael



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Tue, Jan 31, 2017 at 1:48 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> I partially agree with this paragraph, at least there are advantages
> to do so for cases where the data fits in shared buffers. Even for
> data sets fitting in RAM that can be an advantage as the buffers would
> get evicted from Postgres' cache but not the OS. Now for cases where
> there are many load patterns on a given database (I have some here),
> that's hard to make this thing by default on.

Well, the question even for that case is whether it really costs
anything.  My bet is that it is nearly free when it doesn't help, but
that could be wrong.  My experience running pgbench tests is that
prewarming all of pgbench_accounts on a scale factor that fits in
shared_buffers using "dd" took just a few seconds, but when accessing
the blocks in random order the cache took many minutes to heat up.
Now, I assume that this patch sorts the I/O (although I haven't
checked that) and therefore I expect that the prewarm completes really
fast.  If that's not the case, then that's bad.  But if it is the
case, then it's not really hurting you even if the workload changes
completely.

Again, I'm not really arguing for enable-by-default, but I think if
this is well-implemented the chances of it actually hurting anything
are very low, so you'll either win or it'll make no difference.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Tue, Jan 31, 2017 at 9:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Now, I assume that this patch sorts the I/O (although I haven't
> checked that) and therefore I expect that the prewarm completes really
> fast.  If that's not the case, then that's bad.  But if it is the
> case, then it's not really hurting you even if the workload changes
> completely.

-- JFYI Yes in the patch we load the sorted
<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Michael Paquier
Date:
On Wed, Feb 1, 2017 at 1:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, the question even for that case is whether it really costs
> anything.  My bet is that it is nearly free when it doesn't help, but
> that could be wrong.  My experience running pgbench tests is that
> prewarming all of pgbench_accounts on a scale factor that fits in
> shared_buffers using "dd" took just a few seconds, but when accessing
> the blocks in random order the cache took many minutes to heat up.

And benchmarks like dbt-1 have a pre-warming period added in the test
itself where the user can specify in a number of seconds to linearly
increase the load from 0% to 100%, just for filling in the OS and PG's
cache... This feature would be helpful.

> Now, I assume that this patch sorts the I/O (although I haven't
> checked that) and therefore I expect that the prewarm completes really
> fast.  If that's not the case, then that's bad.  But if it is the
> case, then it's not really hurting you even if the workload changes
> completely.

Having that working fast would be really nice.
-- 
Michael



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
Hi all,
Here is the new patch which fixes all of above comments, I changed the
design a bit now as below

What is it?
===========
A pair of bgwrokers one which automatically dumps buffer pool's block
info at a given interval and another which loads those block into
buffer pool when
the server restarts.

How does it work?
=================
When the shared library pg_prewarm is preloaded during server startup.
A bgworker "auto pg_prewarm load" is launched immediately after the
server is started. The bgworker will start loading blocks obtained
from block info entry
<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> in
$PGDATA/AUTO_PG_PREWARM_FILE, until there is a free buffer in the
buffer pool. This way we do not replace any new blocks which were
loaded either by the recovery process or the querying clients.

Once the "auto pg_prewarm load" bgworker has completed its job, it
will register a dynamic bgworker "auto pg_prewarm dump" which has to
be launched
when the server reaches a consistent state. The new bgworker will
periodically scan the buffer pool and then dump the meta info of
blocks
which are currently in the buffer pool. The GUC
pg_prewarm.dump_interval if set > 0 indicates the minimum time
interval between two dumps. If
pg_prewarm.dump_interval is set to AT_PWARM_DUMP_AT_SHUTDOWN_ONLY the
bgworker will only dump at the time of server shutdown. If it is set
to AT_PWARM_LOAD_ONLY we do not want the bgworker to dump anymore, so
it stops there.

To relaunch a stopped "auto pg_prewarm dump" bgworker we can use the
utility function "launch_pg_prewarm_dump".

==================
One problem now I have kept it open is multiple "auto pg_prewarm dump"
can be launched even if already a dump/load worker is running by
calling launch_pg_prewarm_dump. I can avoid this by dropping a
lock-file before starting the bgworkers. But, if there is an another
method to avoid launching bgworker on a simple method I can do same.
Any suggestion on this will be very helpful.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Beena Emerson
Date:
Hello,

Thank you for the updated patch.

On Tue, Feb 7, 2017 at 10:44 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
Hi all,
Here is the new patch which fixes all of above comments, I changed the
design a bit now as below

What is it?
===========
A pair of bgwrokers one which automatically dumps buffer pool's block
info at a given interval and another which loads those block into
buffer pool when
the server restarts.

Are 2 workers required? This would reduce the number of workers to be launched by other  applications. Also with max_worker_processes = 2 and restart, the system crashes when the 2nd worker is not launched:
2017-02-07 11:36:39.132 IST [20573] LOG:  auto pg_prewarm load : number of buffers actually tried to load 64
2017-02-07 11:36:39.143 IST [18014] LOG:  worker process: auto pg_prewarm load (PID 20573) was terminated by signal 11: Segmentation fault 

I think the document should also mention that an appropriate max_worker_processes should be set else the dump worker will not be launched at all. 


--
Thank you, 

Beena Emerson

Have a Great Day!

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello,
>
> Thank you for the updated patch.
>
> On Tue, Feb 7, 2017 at 10:44 AM, Mithun Cy <mithun.cy@enterprisedb.com>
> wrote:
>>
>> Hi all,
>> Here is the new patch which fixes all of above comments, I changed the
>> design a bit now as below
>>
>> What is it?
>> ===========
>> A pair of bgwrokers one which automatically dumps buffer pool's block
>> info at a given interval and another which loads those block into
>> buffer pool when
>> the server restarts.
>
>
> Are 2 workers required?
>

I think in the new design there is a provision of launching the worker
dynamically to dump the buffers, so there seems to be a need of
separate workers for loading and dumping the buffers.  However, there
is no explanation in the patch or otherwise when and why this needs a
pair of workers.  Also, if the dump interval is greater than zero,
then do we really need to separately register a dynamic worker?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Tue, Feb 7, 2017 at 10:44 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>
> ==================
> One problem now I have kept it open is multiple "auto pg_prewarm dump"
> can be launched even if already a dump/load worker is running by
> calling launch_pg_prewarm_dump. I can avoid this by dropping a
> lock-file before starting the bgworkers. But, if there is an another
> method to avoid launching bgworker on a simple method I can do same.
>

How about keeping a variable in PROC_HDR structure to indicate if
already one dump worker is running, then don't allow to start a new
one?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> launched by other  applications. Also with max_worker_processes = 2 and
> restart, the system crashes when the 2nd worker is not launched:
> 2017-02-07 11:36:39.132 IST [20573] LOG:  auto pg_prewarm load : number of
> buffers actually tried to load 64
> 2017-02-07 11:36:39.143 IST [18014] LOG:  worker process: auto pg_prewarm
> load (PID 20573) was terminated by signal 11: Segmentation fault

SEGFAULT was the coding mistake I have called the C-language function
directly without initializing the functioncallinfo. Thanks for
raising. Below patch fixes same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Beena Emerson
Date:


On Tue, Feb 7, 2017 at 3:01 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> launched by other  applications. Also with max_worker_processes = 2 and
> restart, the system crashes when the 2nd worker is not launched:
> 2017-02-07 11:36:39.132 IST [20573] LOG:  auto pg_prewarm load : number of
> buffers actually tried to load 64
> 2017-02-07 11:36:39.143 IST [18014] LOG:  worker process: auto pg_prewarm
> load (PID 20573) was terminated by signal 11: Segmentation fault

SEGFAULT was the coding mistake I have called the C-language function
directly without initializing the functioncallinfo. Thanks for
raising. Below patch fixes same.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Few more comments:

= Background worker messages:

- Workers when launched, show messages like: "logical replication launcher started”, "autovacuum launcher started”. We should probably have a similar message to show that the pg_prewarm load and dump bgworker has started.

- "auto pg_prewarm load: number of buffers to load x”, other messages show space before and after “:”, we should keep it consistent through out.


= Action of -1.
I thought we decided that interval value of -1 would mean that the auto prewarm worker will not be run at all. With current code, -1 is explained to mean it will not dump. I noticed that reloading with new option as -1 stops both the workers but restarting loads the data and then quits. Why does it allow loading with -1? Please explain this better in the documents.


= launch_pg_prewarm_dump()
With dump_interval=-1, Though function returns a pid, this worker is not running in the 04 patch. 03 version it was launching. Dumping is not done now.

=# SELECT launch_pg_prewarm_dump();
 launch_pg_prewarm_dump
------------------------
                  53552
(1 row)

$ ps -ef | grep 53552
b_emers+  53555   4391  0 16:21 pts/1    00:00:00 grep --color=auto 53552


= Function names
- load_now could be better named as load_file, load_dumpfile or similar.
- dump_now -> dump_buffer or  better?


= Corrupt file
if the dump file is corrupted, the system crashes and the prewarm bgworkers are not restarted. This needs to be handled better.

WARNING:  terminating connection because of crash of another server process
2017-02-07 16:36:58.680 IST [54252] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory


= Documentation

I feel the documentation needs to be improved greatly.

- The first para in pg_prewarm should mention the autoload feature too.

- The new section should be named “The pg_prewarm autoload” or something better. "auto pg_prewarm bgworker” does not seem appropriate.  The configuration parameter should also be listed out clearly like in auth-delay page. The new function launch_pg_prewarm_dump() should be listed under Functions.

--
Thank you, 

Beena Emerson

Have a Great Day!

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Tue, Feb 7, 2017 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Are 2 workers required?
>>
>
> I think in the new design there is a provision of launching the worker
> dynamically to dump the buffers, so there seems to be a need of
> separate workers for loading and dumping the buffers.  However, there
> is no explanation in the patch or otherwise when and why this needs a
> pair of workers.  Also, if the dump interval is greater than zero,
> then do we really need to separately register a dynamic worker?

We have introduced a new value  -1 for pg_prewarm.dump_interval this
means we will not dump at all, At this state, I thought auto
pg_prewarm process need not run at all, so I coded to exit the auto
pg_prewarm immediately. But If the user decides to start the auto
pg_prewarm to dump only without restarting the server, I have
introduced a launcher function "launch_pg_prewarm_dump" to restart the
auto pg_prewarm only to dump. Since now we can launch worker only to
dump, I thought we can redistribute the code between two workers, one
which only does prewarm (load only) and another dumps periodically.
This helped me to modularize and reuse the code. So once load worker
has finished its job, it registers a dump worker and then exists.
But if max_worker_processes is not enough to launch the "auto
pg_prewarm dump" bgworker
We throw an error
2017-02-07 14:51:59.789 IST [50481] ERROR:  registering dynamic
bgworker "auto pg_prewarm dump" failed c
2017-02-07 14:51:59.789 IST [50481] HINT:  Consider increasing
configuration parameter "max_worker_processes".

Now thinking again instead of such error and then correcting same by
explicitly launching the auto pg_prewarm dump bgwroker through
launch_pg_prewarm_dump(), I can go back to original design where there
will be one worker which loads and then dumps periodically. And
launch_pg_prewarm_dump will relaunch dump only activity of that
worker. Does this sound good?

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
Thanks Beena,

On Tue, Feb 7, 2017 at 4:46 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Few more comments:
>
> = Background worker messages:
>
> - Workers when launched, show messages like: "logical replication launcher
> started”, "autovacuum launcher started”. We should probably have a similar
> message to show that the pg_prewarm load and dump bgworker has started.

-- Thanks, I will add startup and shutdown message.

> - "auto pg_prewarm load: number of buffers to load x”, other messages show
> space before and after “:”, we should keep it consistent through out.
>

-- I think you are testing patch 03. The latest patch_04 have
corrected same. Can you please re-test it.

>
> = Action of -1.
> I thought we decided that interval value of -1 would mean that the auto
> prewarm worker will not be run at all. With current code, -1 is explained to
> mean it will not dump. I noticed that reloading with new option as -1 stops
> both the workers but restarting loads the data and then quits. Why does it
> allow loading with -1? Please explain this better in the documents.
>

-- '-1' means we do not want to dump at all. So we decide not to
continue with launched bgworker and it exits. As per your first
comment, if I register the startup and shutdown messages for auto
pg_prewarm I think it will look better. Will try to explain it in a
better way in documentation. The "auto pg_prewarm load" will not be
affected with dump_interval value. It will always start, load(prewarm)
and then exit.

>
> = launch_pg_prewarm_dump()

> =# SELECT launch_pg_prewarm_dump();
>  launch_pg_prewarm_dump
> ------------------------
>                   53552
> (1 row)
>
> $ ps -ef | grep 53552
> b_emers+  53555   4391  0 16:21 pts/1    00:00:00 grep --color=auto 53552

-- If dump_interval = -1 "auto pg_prewarm dump" will exit immediately.

> = Function names
> - load_now could be better named as load_file, load_dumpfile or similar.
> - dump_now -> dump_buffer or  better?

I did choose load_now and dump_now to indicate we are doing it
immediately as invoking them was based on a timer/state. Probably we
can improve that but dump_buffer, load_file may not be the right
replacement.

>
> = Corrupt file
> if the dump file is corrupted, the system crashes and the prewarm bgworkers
> are not restarted. This needs to be handled better.
>
> WARNING:  terminating connection because of crash of another server process
> 2017-02-07 16:36:58.680 IST [54252] DETAIL:  The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory

--- Can you please paste you autopgprewarm.save file, I edited the
file manually to some illegal entry but did not see any crash.  Only
we failed to load as block number were invalid. Please share your
tests so that I can reproduce same.

> = Documentation
>
> I feel the documentation needs to be improved greatly.
>
> - The first para in pg_prewarm should mention the autoload feature too.
>
> - The new section should be named “The pg_prewarm autoload” or something
> better. "auto pg_prewarm bgworker” does not seem appropriate.  The
> configuration parameter should also be listed out clearly like in auth-delay
> page. The new function launch_pg_prewarm_dump() should be listed under
> Functions.

-- Thanks I will try to improve the documentation. And, put more details there.


--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Beena Emerson
Date:
Hello,

On Tue, Feb 7, 2017 at 5:52 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
Thanks Beena,

On Tue, Feb 7, 2017 at 4:46 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Few more comments:
>
> = Background worker messages:
>
> - Workers when launched, show messages like: "logical replication launcher
> started”, "autovacuum launcher started”. We should probably have a similar
> message to show that the pg_prewarm load and dump bgworker has started.

-- Thanks, I will add startup and shutdown message.

> - "auto pg_prewarm load: number of buffers to load x”, other messages show
> space before and after “:”, we should keep it consistent through out.
>

-- I think you are testing patch 03. The latest patch_04 have
corrected same. Can you please re-test it.

I had initially written comments for 03 and then again tested for 04 and retained comments valid for it. I guess I missed to removed this. Sorry.
 

>
> = Action of -1.
> I thought we decided that interval value of -1 would mean that the auto
> prewarm worker will not be run at all. With current code, -1 is explained to
> mean it will not dump. I noticed that reloading with new option as -1 stops
> both the workers but restarting loads the data and then quits. Why does it
> allow loading with -1? Please explain this better in the documents.
>

-- '-1' means we do not want to dump at all. So we decide not to
continue with launched bgworker and it exits. As per your first
comment, if I register the startup and shutdown messages for auto
pg_prewarm I think it will look better. Will try to explain it in a
better way in documentation. The "auto pg_prewarm load" will not be
affected with dump_interval value. It will always start, load(prewarm)
and then exit.

>
> = launch_pg_prewarm_dump()

> =# SELECT launch_pg_prewarm_dump();
>  launch_pg_prewarm_dump
> ------------------------
>                   53552
> (1 row)
>
> $ ps -ef | grep 53552
> b_emers+  53555   4391  0 16:21 pts/1    00:00:00 grep --color=auto 53552

-- If dump_interval = -1 "auto pg_prewarm dump" will exit immediately.

> = Function names
> - load_now could be better named as load_file, load_dumpfile or similar.
> - dump_now -> dump_buffer or  better?

I did choose load_now and dump_now to indicate we are doing it
immediately as invoking them was based on a timer/state. Probably we
can improve that but dump_buffer, load_file may not be the right
replacement.

>
> = Corrupt file
> if the dump file is corrupted, the system crashes and the prewarm bgworkers
> are not restarted. This needs to be handled better.
>
> WARNING:  terminating connection because of crash of another server process
> 2017-02-07 16:36:58.680 IST [54252] DETAIL:  The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory

--- Can you please paste you autopgprewarm.save file, I edited the
file manually to some illegal entry but did not see any crash.  Only
we failed to load as block number were invalid. Please share your
tests so that I can reproduce same.

I only changed the fork number from 0 to 10 in one of the entry.


> = Documentation
>
> I feel the documentation needs to be improved greatly.
>
> - The first para in pg_prewarm should mention the autoload feature too.
>
> - The new section should be named “The pg_prewarm autoload” or something
> better. "auto pg_prewarm bgworker” does not seem appropriate.  The
> configuration parameter should also be listed out clearly like in auth-delay
> page. The new function launch_pg_prewarm_dump() should be listed under
> Functions.

-- Thanks I will try to improve the documentation. And, put more details there.



--
Thank you, 

Beena Emerson

Have a Great Day!

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Beena Emerson
Date:
Hello,

On Tue, Feb 7, 2017 at 5:14 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Tue, Feb 7, 2017 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Are 2 workers required?
>>
>
> I think in the new design there is a provision of launching the worker
> dynamically to dump the buffers, so there seems to be a need of
> separate workers for loading and dumping the buffers.  However, there
> is no explanation in the patch or otherwise when and why this needs a
> pair of workers.  Also, if the dump interval is greater than zero,
> then do we really need to separately register a dynamic worker?

We have introduced a new value  -1 for pg_prewarm.dump_interval this
means we will not dump at all, At this state, I thought auto
pg_prewarm process need not run at all, so I coded to exit the auto
pg_prewarm immediately. But If the user decides to start the auto
pg_prewarm to dump only without restarting the server, I have
introduced a launcher function "launch_pg_prewarm_dump" to restart the
auto pg_prewarm only to dump. Since now we can launch worker only to
dump, I thought we can redistribute the code between two workers, one
which only does prewarm (load only) and another dumps periodically.
This helped me to modularize and reuse the code. So once load worker
has finished its job, it registers a dump worker and then exists.
But if max_worker_processes is not enough to launch the "auto
pg_prewarm dump" bgworker
We throw an error
2017-02-07 14:51:59.789 IST [50481] ERROR:  registering dynamic
bgworker "auto pg_prewarm dump" failed c
2017-02-07 14:51:59.789 IST [50481] HINT:  Consider increasing
configuration parameter "max_worker_processes".

Now thinking again instead of such error and then correcting same by
explicitly launching the auto pg_prewarm dump bgwroker through
launch_pg_prewarm_dump(), I can go back to original design where there
will be one worker which loads and then dumps periodically. And
launch_pg_prewarm_dump will relaunch dump only activity of that
worker. Does this sound good?

 Yes it would be better to have only one pg_prewarm worker as the loader is idle for the entire server run time after the initial load activity of few secs.

--
Thank you, 

Beena Emerson

Have a Great Day!

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Tue, Feb 7, 2017 at 6:11 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>  Yes it would be better to have only one pg_prewarm worker as the loader is
> idle for the entire server run time after the initial load activity of few
> secs.
Sorry, that is again a bug in the code. The code to handle SIGUSR1
somehow got deleted before I submitted patch_03 and I failed to notice
same.
As in the code loader bgworker is waiting on the latch to know the
status of dump bgworker. Actually, the loader bgworker should exit
right after launching the dump bgworker. I will try to fix this and
other comments given by you in my next patch.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Tue, Feb 7, 2017 at 5:14 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Tue, Feb 7, 2017 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>>> Are 2 workers required?
>>>
>>
>> I think in the new design there is a provision of launching the worker
>> dynamically to dump the buffers, so there seems to be a need of
>> separate workers for loading and dumping the buffers.  However, there
>> is no explanation in the patch or otherwise when and why this needs a
>> pair of workers.  Also, if the dump interval is greater than zero,
>> then do we really need to separately register a dynamic worker?
>
> We have introduced a new value  -1 for pg_prewarm.dump_interval this
> means we will not dump at all, At this state, I thought auto
> pg_prewarm process need not run at all, so I coded to exit the auto
> pg_prewarm immediately. But If the user decides to start the auto
> pg_prewarm to dump only without restarting the server, I have
> introduced a launcher function "launch_pg_prewarm_dump" to restart the
> auto pg_prewarm only to dump. Since now we can launch worker only to
> dump, I thought we can redistribute the code between two workers, one
> which only does prewarm (load only) and another dumps periodically.
> This helped me to modularize and reuse the code. So once load worker
> has finished its job, it registers a dump worker and then exists.
> But if max_worker_processes is not enough to launch the "auto
> pg_prewarm dump" bgworker
> We throw an error
> 2017-02-07 14:51:59.789 IST [50481] ERROR:  registering dynamic
> bgworker "auto pg_prewarm dump" failed c
> 2017-02-07 14:51:59.789 IST [50481] HINT:  Consider increasing
> configuration parameter "max_worker_processes".
>
> Now thinking again instead of such error and then correcting same by
> explicitly launching the auto pg_prewarm dump bgwroker through
> launch_pg_prewarm_dump(), I can go back to original design where there
> will be one worker which loads and then dumps periodically. And
> launch_pg_prewarm_dump will relaunch dump only activity of that
> worker. Does this sound good?
>

Won't it be simple if you consider -1 as a value to just load library?For *_interval = -1, it will neither load nor
dump.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Beena Emerson
Date:
Hello,

On Wed, Feb 8, 2017 at 3:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Feb 7, 2017 at 5:14 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Tue, Feb 7, 2017 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>>> Are 2 workers required?
>>>
>>
>> I think in the new design there is a provision of launching the worker
>> dynamically to dump the buffers, so there seems to be a need of
>> separate workers for loading and dumping the buffers.  However, there
>> is no explanation in the patch or otherwise when and why this needs a
>> pair of workers.  Also, if the dump interval is greater than zero,
>> then do we really need to separately register a dynamic worker?
>
> We have introduced a new value  -1 for pg_prewarm.dump_interval this
> means we will not dump at all, At this state, I thought auto
> pg_prewarm process need not run at all, so I coded to exit the auto
> pg_prewarm immediately. But If the user decides to start the auto
> pg_prewarm to dump only without restarting the server, I have
> introduced a launcher function "launch_pg_prewarm_dump" to restart the
> auto pg_prewarm only to dump. Since now we can launch worker only to
> dump, I thought we can redistribute the code between two workers, one
> which only does prewarm (load only) and another dumps periodically.
> This helped me to modularize and reuse the code. So once load worker
> has finished its job, it registers a dump worker and then exists.
> But if max_worker_processes is not enough to launch the "auto
> pg_prewarm dump" bgworker
> We throw an error
> 2017-02-07 14:51:59.789 IST [50481] ERROR:  registering dynamic
> bgworker "auto pg_prewarm dump" failed c
> 2017-02-07 14:51:59.789 IST [50481] HINT:  Consider increasing
> configuration parameter "max_worker_processes".
>
> Now thinking again instead of such error and then correcting same by
> explicitly launching the auto pg_prewarm dump bgwroker through
> launch_pg_prewarm_dump(), I can go back to original design where there
> will be one worker which loads and then dumps periodically. And
> launch_pg_prewarm_dump will relaunch dump only activity of that
> worker. Does this sound good?
>

Won't it be simple if you consider -1 as a value to just load library?
 For *_interval = -1, it will neither load nor dump.

 
+1
That is what I thought was the behaviour we decided upon for -1. 



--
Thank you, 

Beena Emerson

Have a Great Day!

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Peter Geoghegan
Date:
On Tue, Feb 7, 2017 at 1:31 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> SEGFAULT was the coding mistake I have called the C-language function
> directly without initializing the functioncallinfo. Thanks for
> raising. Below patch fixes same.

It would be nice if this had an option to preload only internal B-tree
pages into shared_buffers. They're usually less than 1% of the total
pages in a B-Tree, and are by far the most frequently accessed. It's
reasonable to suppose that much of the random I/O incurred when
warming the cache occurs there. Now, prewarming those will incur
random I/O, often completely random I/O, but by and large it would be
a matter of swallowing that cost sooner, through using your tool,
rather than later, during the execution of queries.

-- 
Peter Geoghegan



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Thu, Feb 9, 2017 at 12:36 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Feb 7, 2017 at 1:31 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> SEGFAULT was the coding mistake I have called the C-language function
>> directly without initializing the functioncallinfo. Thanks for
>> raising. Below patch fixes same.
>
> It would be nice if this had an option to preload only internal B-tree
> pages into shared_buffers.
>

Sure, but I think it won't directly fit into the current functionality
of patch which loads the blocks that were dumped from shared buffers
before the server has stopped.  The way to extend it could be that
while dumping it just dumps the btree internal pages or something like
that.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Tue, Feb 7, 2017 at 2:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Feb 7, 2017 at 10:44 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>>
>> ==================
>> One problem now I have kept it open is multiple "auto pg_prewarm dump"
>> can be launched even if already a dump/load worker is running by
>> calling launch_pg_prewarm_dump. I can avoid this by dropping a
>> lock-file before starting the bgworkers. But, if there is an another
>> method to avoid launching bgworker on a simple method I can do same.
>>
>
> How about keeping a variable in PROC_HDR structure to indicate if
> already one dump worker is running, then don't allow to start a new
> one?

A contrib module shouldn't change core (and shouldn't need to).  It
can register its own shared memory area if it wants.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 1:47 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Won't it be simple if you consider -1 as a value to just load library?
>>  For *_interval = -1, it will neither load nor dump.
>>
> +1
> That is what I thought was the behaviour we decided upon for -1.

Right.  I can't see why you'd want to be able to separately control
those two things.  If you're not dumping, you don't want to load; if
you're not loading, you don't want to dump.

I think what should happen is that there should be only one worker.
If the GUC is -1, it never gets registered.  Otherwise, it starts at
database startup time and runs until shutdown.  At startup time, it
loads buffers until we run out of free buffers or until all saved
buffers are loaded, whichever happens first.  Buffers should be sorted
by relfilenode (in any order) and block number (in increasing order).
Once it finishes loading buffers, it repeatedly sleeps for the amount
of time indicated by the GUC (or indefinitely if the GUC is 0),
dumping after each sleep and at shutdown.

Shutting down one worker to start up another doesn't seem to make
sense.  If for some reason you want the code for those in separate
functions, you can call one function and then call the other.  Putting
them in completely separate processes doesn't buy anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Peter Eisentraut
Date:
On 2/10/17 15:12, Robert Haas wrote:
> Right.  I can't see why you'd want to be able to separately control
> those two things.  If you're not dumping, you don't want to load; if
> you're not loading, you don't want to dump.

What about the case where you want to prewarm a standby from the info
from the primary (or another standby)?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Wed, Feb 22, 2017 at 4:06 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 2/10/17 15:12, Robert Haas wrote:
>> Right.  I can't see why you'd want to be able to separately control
>> those two things.  If you're not dumping, you don't want to load; if
>> you're not loading, you don't want to dump.
>
> What about the case where you want to prewarm a standby from the info
> from the primary (or another standby)?

I think it's OK to treat that as something of a corner case.  There's
nothing to keep you from doing that today: just use pg_buffercache to
dump a list of blocks on one server, and then pass those blocks to
pg_prewarm on another server.  The point of this patch, AIUI, is to
automate a particularly common case of that, which is to dump before
shutdown and load upon startup.  It doesn't preclude other things that
people want to do.

I suppose we could have an SQL-callable function that does an
immediate dump (without starting a worker).  That wouldn't hurt
anything, and might be useful in a case like the one you mention.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
Hi all thanks,
I have tried to fix all of the comments given above with some more
code cleanups.

On Wed, Feb 22, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think it's OK to treat that as something of a corner case.  There's
> nothing to keep you from doing that today: just use pg_buffercache to
> dump a list of blocks on one server, and then pass those blocks to
> pg_prewarm on another server.  The point of this patch, AIUI, is to
> automate a particularly common case of that, which is to dump before
> shutdown and load upon startup.  It doesn't preclude other things that
> people want to do.
>
> I suppose we could have an SQL-callable function that does an
> immediate dump (without starting a worker).  That wouldn't hurt
> anything, and might be useful in a case like the one you mention.

In the latest patch, I have moved the things back as in old ways there
will be one bgworker "auto pg_prewarm" which automatically records
information about blocks which were present in buffer pool before
server shutdown and then prewarm the buffer pool upon server restart
with those blocks. I have reverted back the code which helped us to
launch the stopped "auto pg_prewarm" bgworker. The reason I introduced
a launcher SQL utility function is the running auto pg_prewarm can be
stopped by the user by setting dump_interval to -1. So if the user
wants to restart the stopped auto pg_prewarm(this time dump only to
prewarm on next restart), he can use that utility. The user can launch
the auto pg_prewarm to dump periodically while the server is still
running. If that was not the concern I think I misunderstood the
comments and overdid the design. So as a first patch I will keep the
things simple. Also,
using a separate process for prewarm and dump activity was a bad
design hence reverted back same. The auto pg_prewarm can only be
launched by preloading the library. And I can add additional
utilities, once we can formalize what is really needed out of it.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Wed, Feb 22, 2017 at 11:49 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Hi all thanks,
> I have tried to fix all of the comments given above with some more
> code cleanups.

While reading this patch tonight, I realized a serious problem with
the entire approach, which is that this patch is supposing that we can
read relation blocks for every database from a single worker that's
not connected to any database.  I realize that I suggested that
approach, but now I think it's broken, because the patch isn't taking
any locks on the relations whose pages it is reading, and that is
definitely going to break things.  While autoprewarm is busy sucking
blocks into the shared buffer cache, somebody could be, for example,
dropping one of those relations.  DropRelFileNodesAllBuffers and
friends expect that nobody is going to be concurrently reading blocks
back into the buffer cache because they hold AccessExclusiveLock, and
they assume that anybody else who is touching it will hold at least
AccessShareLock.  But this violates that assumption, and probably some
others.

This is not easy to fix.  The lock has to be taken based on the
relation OID, not the relfilenode, but we don't have the relation OID
in the dump file, and recording it there won't help, because the
relfilenode can change under us if the relation is rewritten with
CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE.  I don't see
a solution other than launching a separate worker for each database,
which seems like it could be extremely expensive if there are many
databases.  Also, I am pretty sure it's no good to take locks before
recovery reaches a consistent state.  I'm not sure off-hand whether
crash-recovery will notice conflicting locks, but even if it does,
blocking crash recovery in order to prewarm stuff is bad news.

Here are some other review comments (which may not matter unless we
can think up a solution to the problems above).

- I think auto_pg_prewarm.c is an unnecessarily long name.  How about
autoprewarm.c?

- It looks like you ran pgindent over this without adding the new
typedefs to your typedefs.list, so things like the last line of each
typedef is formatted incorrectly.

- ReadBufferForPrewarm isn't a good name for this interface.  You need
to give it a generic name (and header comment) that describes what it
does, rather than why you added it.  Others might want to also use
this interface.  Actually, an even better idea might be to adjust
ReadBufferWithoutRelcache() to serve your need here.  That function's
header comment seems to contemplate that somebody might want to add a
relpersistence argument someday; perhaps that day has arrived.

- have_free_buffer's comment shouldn't mention autoprewarm, but it
should mention that this is a lockless test, so the results might be
slightly stale.  See similar comments in various other backend
functions for an example of how to write this.

- next_task could be static, and with such a generic name, it really
MUST be static to avoid namespace conflicts.

- load_block() has a race condition.  The relation could be dropped
after you check smgrexists() and before you access the relation.
Also, the fork number validity check looks useless; it should never
fail.

- I suggest renaming the file that stores the blocks to
autoprewarm.blocks or something like that.  Calling it just
"autopgprewarm" doesn't seem very clear.

- I don't see any reason for the dump file to contain a header record
with an expected record count.  When rereading the file, you can just
read until EOF; there's no real need to know the record count before
you start.

- You should test for multiple flags like this: if ((buf_state &
(BM_VALID|VM_TAG_VALID|BM_PERSISTENT)) != 0).  However, I also think
the test is wrong. Even if the buffer isn't BM_VALID, that's not
really a reason not to include it in the dump file.  Same with
BM_PERSISTENT.  I think the reason for the latter restriction is that
you're always calling load_block() with RELPERSISTENCE_PERMANENT, but
that's not a good idea either.  If the relation were made unlogged
after you wrote the dump file, then on reload it you'd incorrectly set
BM_PERMANENT on the reloaded blocks.

- elog() should not be used for user-facing messages, but rather only
for internal messages that we don't expect to get generated.  Also,
the messages you've picked don't conform to the project's message
style guidelines.

- The error handling loop around load_block() suggests that you're
expecting some reads to fail, which I guess is because you could be
trying to read blocks from a relation that's been rewritten under a
different relfilenode, or partially or entirely truncated.  But I
don't think it's very reasonable to just let ReadBufferWhatever() spew
error messages into the log and hope users don't panic.  People will
expect an automatic prewarm solution to handle any such cases quietly,
not bleat loudly in the log.  I suspect that this error-trapping code
isn't entirely correct, but there's not much point in fixing it; what
we really need to do is get rid of it (somehow).

- dump_block_info_periodically() will sleep for too long - perhaps
forever - if WaitLatch() repeatedly returns due to WL_LATCH_SET, which
can probably happen if for any reason the process receives SIGUSR1
repeatedly.  Every time the latch gets set, the timeout is reset, so
it may never expire.  There are examples of how to write a loop like
this correctly in various places in the server; please check one of
those.

- I don't think you should need an error-catching loop in
auto_pgprewarm_main(), either.  Just let the worker die if there's an
ERROR, and set the restart interval to something other than
BGW_NEVER_RESTART.

- Setting bgw_main isn't correct for extension code.  Please read the
documentation on background workers, which explains how to do this
correctly in an extension.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Peter Eisentraut
Date:
On 2/26/17 11:46, Robert Haas wrote:
> I don't see
> a solution other than launching a separate worker for each database,
> which seems like it could be extremely expensive if there are many
> databases.

You don't have to start all these workers at once.  Starting one and not
starting the next one until the first one is finished should be fine.
It will have the same serial behavior that the patch is proposing anyway.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Mon, Feb 27, 2017 at 7:18 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 2/26/17 11:46, Robert Haas wrote:
>> I don't see
>> a solution other than launching a separate worker for each database,
>> which seems like it could be extremely expensive if there are many
>> databases.
>
> You don't have to start all these workers at once.  Starting one and not
> starting the next one until the first one is finished should be fine.
> It will have the same serial behavior that the patch is proposing anyway.

Yeah, true.  The constant factor is higher, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 22, 2017 at 11:49 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> Hi all thanks,
>> I have tried to fix all of the comments given above with some more
>> code cleanups.
>
> While reading this patch tonight, I realized a serious problem with
> the entire approach, which is that this patch is supposing that we can
> read relation blocks for every database from a single worker that's
> not connected to any database.  I realize that I suggested that
> approach, but now I think it's broken, because the patch isn't taking
> any locks on the relations whose pages it is reading, and that is
> definitely going to break things.  While autoprewarm is busy sucking
> blocks into the shared buffer cache, somebody could be, for example,
> dropping one of those relations.  DropRelFileNodesAllBuffers and
> friends expect that nobody is going to be concurrently reading blocks
> back into the buffer cache because they hold AccessExclusiveLock, and
> they assume that anybody else who is touching it will hold at least
> AccessShareLock.  But this violates that assumption, and probably some
> others.
>
> This is not easy to fix.  The lock has to be taken based on the
> relation OID, not the relfilenode, but we don't have the relation OID
> in the dump file, and recording it there won't help, because the
> relfilenode can change under us if the relation is rewritten with
> CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE.  I don't see
> a solution other than launching a separate worker for each database,
> which seems like it could be extremely expensive if there are many
> databases.  Also, I am pretty sure it's no good to take locks before
> recovery reaches a consistent state.

So we should move this loading of blocks once the recovery reaches a
consistent state so that we can connect to a database.  To allow
worker, to take a lock, we need to dump relation oid as well.  Is that
what you are envisioning to fix this problem?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Thu, Mar 2, 2017 at 7:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> So we should move this loading of blocks once the recovery reaches a
> consistent state so that we can connect to a database.  To allow
> worker, to take a lock, we need to dump relation oid as well.  Is that
> what you are envisioning to fix this problem?

No.  The relation -> relfilenode mapping isn't constant, and the
dumping process has no way to get the relation OIDs from the
relfilenodes anyway.  I think what you need to do is dump the
relfilenodes as at present, but then at restore time you need a worker
per database, and it connects to the database and then uses the
infrastructure added by f01d1ae3a104019d6d68aeff85c4816a275130b3 to
discover what relation OID, if any, currently corresponds to the
proposed relfilenode.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
Hi all, thanks for the feedback. Based on your recent comments I have
implemented a new patch which is attached below,

On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This is not easy to fix.  The lock has to be taken based on the
> relation OID, not the relfilenode, but we don't have the relation OID
> in the dump file, and recording it there won't help, because the
> relfilenode can change under us if the relation is rewritten with
> CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE.  I don't see
> a solution other than launching a separate worker for each database,
> which seems like it could be extremely expensive if there are many
> databases.  Also, I am pretty sure it's no good to take locks before
> recovery reaches a consistent state.  I'm not sure off-hand whether
> crash-recovery will notice conflicting locks, but even if it does,
> blocking crash recovery in order to prewarm stuff is bad news.

On Mon, Feb 27, 2017 at 7:18 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> You don't have to start all these workers at once.  Starting one and not
> starting the next one until the first one is finished should be fine.
> It will have the same serial behavior that the patch is proposing anyway.

I have implemented a similar logic now. The prewarm bgworker will
launch a sub-worker per database in the dump file. And, each
sub-worker will load its database block info. The sub-workers will be
launched only after previous one is finished. All of this will only
start if the database has reached a consistent state.

I have also introduced 2 more utility functions which were requested earlier.
A. launch_autoprewarm_dump() RETURNS int4
This is a SQL callable function to launch the autoprewarm worker to
dump the buffer pool information at regular interval. In a server, we
can only run one autoprewarm worker so if a worker sees another
existing worker it will exit immediately. The return value is pid of
the worker which has been launched.

B. autoprewarm_dump_now() RETURNS int8
This is a SQL callable function to dump buffer pool information
immediately once by a backend. This can work in parallel with an
autoprewarm worker while it is dumping. The return value is the number
of blocks info dumped.

I need some feedback on efficiently dividing the blocks among
sub-workers. Existing dump format will not be much useful.

I have chosen the following format
Each entry of block info looks like this:
<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall
call it as BlockInfoRecord.

Contents of AUTOPREWARM_FILE has been formated such a way that
blockInfoRecord of each database can be given to different prewarm
workers.
format of AUTOPREWAM_FILE
=======================================
[offset position of database map table]
[sorted BlockInfoRecords..............]
[database map table]
=======================================

The [database map table] is a sequence of offset in file which will
point to first BlockInfoRecords of each database in the dump. The
prewarm worker will read this offset one by one in sequence and ask
its sub-worker to seek to this position and then start loading the
BlockInfoRecords one by one until it sees a BlockInfoRecords of a
different database than it is actually connected to. NOTE: We store
off_t inside file so the dump file will not be portable to be used
across systems where sizeof off_t is different from each other.

I also thought of having one dump file per database. Problems I
thought which can go against it is there could be too many dump file
(also stale files of databases which are no longer in buffer pool).
Also, there is a possibility of dump file getting lost due to
concurrent dumps by bgworker and autoprewarm_dump_now() SQL utility.
ex: While implementing same, dump file names were chosen as number
sequence 0,1,2,3......number_of_db. (This helps to avoid stale files
being chosen before active database dump files). If 2 concurrent
process dump together there will be a race condition. If one page of
the new database is loaded to buffer pool at that point there could be
a possibility that a recent dump of database blocks will be
overwritten due to the race condition and it will go missing from
dumps even though that database is still active in bufferpool. If any
comments to fix this will be very helpful.

On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Here are some other review comments (which may not matter unless we
> can think up a solution to the problems above).

> - I think auto_pg_prewarm.c is an unnecessarily long name.  How about
> autoprewarm.c?

Fixed; I am also trying to replace the term "auto pg_prewarm" to
"autoprewarm" in all relevant places.

> - It looks like you ran pgindent over this without adding the new
> typedefs to your typedefs.list, so things like the last line of each
> typedef is formatted incorrectly.

Fixed.

> - ReadBufferForPrewarm isn't a good name for this interface.  You need
> to give it a generic name (and header comment) that describes what it
> does, rather than why you added it.  Others might want to also use
> this interface.  Actually, an even better idea might be to adjust
> ReadBufferWithoutRelcache() to serve your need here.  That function's
> header comment seems to contemplate that somebody might want to add a
> relpersistence argument someday; perhaps that day has arrived.

-- I think it is not needed now as we have relation descriptor hence
using ReadBufferExtended.

> - have_free_buffer's comment shouldn't mention autoprewarm, but it
> should mention that this is a lockless test, so the results might be
> slightly stale.  See similar comments in various other backend
> functions for an example of how to write this.

-- Fixed

> - next_task could be static, and with such a generic name, it really
> MUST be static to avoid namespace conflicts.

-- Fixed. after the new code that variable is no longer needed.

> - load_block() has a race condition.  The relation could be dropped
> after you check smgrexists() and before you access the relation.
> Also, the fork number validity check looks useless; it should never
> fail.

-- Now we hold a relation lock so this should not be an issue. But
extra check for forknum is required before calling smgrexists.
Otherwise, it crashes if valid forknum is not given hence check is
necessary.

crash call stack after manually setting one of the forknum 10

#0  0x00007f333bb27694 in vfprintf () from /lib64/libc.so.6
#1  0x00007f333bb53179 in vsnprintf () from /lib64/libc.so.6
#2  0x00000000009ec683 in pvsnprintf (buf=0x139e8e0 "global/1260_",
'\177' <repeats 115 times>, len=128, fmt=0xbebffe "global/%u_%s",
args=0x7ffeaea65750)
    at psprintf.c:121
#3  0x00000000009ec5d1 in psprintf (fmt=0xbebffe "global/%u_%s") at
psprintf.c:64
#4  0x00000000009eca0f in GetRelationPath (dbNode=0, spcNode=1664,
relNode=1260, backendId=-1, forkNumber=10) at relpath.c:150
#5  0x000000000082cd95 in mdopen (reln=0x1359c78, forknum=10,
behavior=2) at md.c:583
#6  0x000000000082c58d in mdexists (reln=0x1359c78, forkNum=10) at md.c:284
#7  0x000000000082f4cf in smgrexists (reln=0x1359c78, forknum=10) at smgr.c:289
#8  0x00007f33353b0294 in load_one_database (main_arg=21) at autoprewarm.c:546
#9  0x0000000000795232 in StartBackgroundWorker () at bgworker.c:757
#10 0x00000000007a707d in do_start_bgworker (rw=0x12fc3d0) at postmaster.c:5570


> - I suggest renaming the file that stores the blocks to
> autoprewarm.blocks or something like that.  Calling it just
> "autopgprewarm" doesn't seem very clear.

-- Fixed. The current file name is "autopgprewarm.save" changed it to
autoprewarm.blocks.

> - I don't see any reason for the dump file to contain a header record
> with an expected record count.  When rereading the file, you can just
> read until EOF; there's no real need to know the record count before
> you start.

-- Fixed. Now removed.

> - You should test for multiple flags like this: if ((buf_state &
> (BM_VALID|VM_TAG_VALID|BM_PERSISTENT)) != 0).  However, I also think
> the test is wrong. Even if the buffer isn't BM_VALID, that's not
> really a reason not to include it in the dump file.  Same with
> BM_PERSISTENT.  I think the reason for the latter restriction is that
> you're always calling load_block() with RELPERSISTENCE_PERMANENT, but
> that's not a good idea either.  If the relation were made unlogged
> after you wrote the dump file, then on reload it you'd incorrectly set
> BM_PERMANENT on the reloaded blocks.

-- Fixed now removed BM_PERMANENT and BM_VALID only using BM_TAG_VALID
so if the hash table has this buffer we dump it.

> - elog() should not be used for user-facing messages, but rather only
> for internal messages that we don't expect to get generated.  Also,
> the messages you've picked don't conform to the project's message
> style guidelines.

-- Fixed.

> - The error handling loop around load_block() suggests that you're
> expecting some reads to fail, which I guess is because you could be
> trying to read blocks from a relation that's been rewritten under a
> different relfilenode, or partially or entirely truncated.  But I
> don't think it's very reasonable to just let ReadBufferWhatever() spew
> error messages into the log and hope users don't panic.  People will
> expect an automatic prewarm solution to handle any such cases quietly,
> not bleat loudly in the log.  I suspect that this error-trapping code
> isn't entirely correct, but there's not much point in fixing it; what
> we really need to do is get rid of it (somehow).

[Need Reelook] -- Debug and check if block load fails what will happen.

> - dump_block_info_periodically() will sleep for too long - perhaps
> forever - if WaitLatch() repeatedly returns due to WL_LATCH_SET, which
> can probably happen if for any reason the process receives SIGUSR1
> repeatedly.  Every time the latch gets set, the timeout is reset, so
> it may never expire.  There are examples of how to write a loop like
> this correctly in various places in the server; please check one of
> those.

-- Fixed same, to count elapsed time from the previous dump and then dump.

> - I don't think you should need an error-catching loop in
> auto_pgprewarm_main(), either.  Just let the worker die if there's an
> ERROR, and set the restart interval to something other than
> BGW_NEVER_RESTART.

-- Fixed we restart it now.

> - Setting bgw_main isn't correct for extension code.  Please read the
> documentation on background workers, which explains how to do this
> correctly in an extension.

-Fixed.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
>> - The error handling loop around load_block() suggests that you're
>> expecting some reads to fail, which I guess is because you could be
>> trying to read blocks from a relation that's been rewritten under a
>> different relfilenode, or partially or entirely truncated.  But I
>> don't think it's very reasonable to just let ReadBufferWhatever() spew
>> error messages into the log and hope users don't panic.  People will
>> expect an automatic prewarm solution to handle any such cases quietly,
>> not bleat loudly in the log.  I suspect that this error-trapping code
>> isn't entirely correct, but there's not much point in fixing it; what
>> we really need to do is get rid of it (somehow).
>
> [Need Reelook] -- Debug and check if block load fails what will happen.

Oops Sorry, this was for self-reference. It is fixed now

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: Proposal : For Auto-Prewarm.

From
Peter Eisentraut
Date:
On 3/13/17 09:15, Mithun Cy wrote:
> A. launch_autoprewarm_dump() RETURNS int4
> This is a SQL callable function to launch the autoprewarm worker to
> dump the buffer pool information at regular interval. In a server, we
> can only run one autoprewarm worker so if a worker sees another
> existing worker it will exit immediately. The return value is pid of
> the worker which has been launched.

Why do you need that?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Sat, Mar 25, 2017 at 11:46 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 3/13/17 09:15, Mithun Cy wrote:
>> A. launch_autoprewarm_dump() RETURNS int4
>> This is a SQL callable function to launch the autoprewarm worker to
>> dump the buffer pool information at regular interval. In a server, we
>> can only run one autoprewarm worker so if a worker sees another
>> existing worker it will exit immediately. The return value is pid of
>> the worker which has been launched.
>
> Why do you need that?

To launch an autoprewarm worker we have to preload the liberary which
need a server restart. If we want to start periodic dumping on an
already running server so that it can automatically prewarm on its
next restart, this API can be used to launch the autoprewarm.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: Proposal : For Auto-Prewarm.

From
andres@anarazel.de (Andres Freund)
Date:
On 2017-03-13 18:45:00 +0530, Mithun Cy wrote:
> I have implemented a similar logic now. The prewarm bgworker will
> launch a sub-worker per database in the dump file. And, each
> sub-worker will load its database block info. The sub-workers will be
> launched only after previous one is finished. All of this will only
> start if the database has reached a consistent state.

Hm. For replay performance it'd possibly be good to start earlier,
before reaching consistency.  Is there an issue starting earlier?


> diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
> new file mode 100644
> index 0000000..f4b34ca
> --- /dev/null
> +++ b/contrib/pg_prewarm/autoprewarm.c
> @@ -0,0 +1,1137 @@
> +/*-------------------------------------------------------------------------
> + *
> + * autoprewarm.c
> + *
> + * -- Automatically prewarm the shared buffer pool when server restarts.

Don't think we ususally use -- here.


> + *    Copyright (c) 2013-2017, PostgreSQL Global Development Group

Hm, that's a bit of a weird date range.


> + *    IDENTIFICATION
> + *        contrib/pg_prewarm.c/autoprewarm.c
> + *-------------------------------------------------------------------------
> + */

The pg_prewarm.c in there looks like some search & replace gone awry.



> +#include "postgres.h"
> +#include <unistd.h>
> +
> +/* These are always necessary for a bgworker. */
> +#include "miscadmin.h"
> +#include "postmaster/bgworker.h"
> +#include "storage/ipc.h"
> +#include "storage/latch.h"
> +#include "storage/lwlock.h"
> +#include "storage/proc.h"
> +#include "storage/shmem.h"
> +
> +/* These are necessary for prewarm utilities. */
> +#include "pgstat.h"
> +#include "storage/buf_internals.h"
> +#include "storage/smgr.h"
> +#include "utils/memutils.h"
> +#include "utils/resowner.h"
> +#include "utils/guc.h"
> +#include "catalog/pg_class.h"
> +#include "catalog/pg_type.h"
> +#include "executor/spi.h"
> +#include "access/xact.h"
> +#include "utils/rel.h"
> +#include "port/atomics.h"

I'd rather just sort these alphabetically.




I think this should rather be in the initial header.

> +/*
> + * autoprewarm :
> + *
> + * What is it?
> + * ===========
> + * A bgworker which automatically records information about blocks which were
> + * present in buffer pool before server shutdown and then prewarm the buffer
> + * pool upon server restart with those blocks.
> + *
> + * How does it work?
> + * =================
> + * When the shared library "pg_prewarm" is preloaded, a
> + * bgworker "autoprewarm" is launched immediately after the server has reached
> + * consistent state. The bgworker will start loading blocks recorded in the
> + * format BlockInfoRecord
> + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
> + * $PGDATA/AUTOPREWARM_FILE, until there is a free buffer left in the buffer
> + * pool. This way we do not replace any new blocks which were loaded either by
> + * the recovery process or the querying clients.

s/until there is a/until there is no/?


> +/*
> + * ============================================================================
> + * ===========================     SIGNAL HANDLERS    ===========================
> + * ============================================================================
> + */

Hm...

> +static void sigtermHandler(SIGNAL_ARGS);
> +static void sighupHandler(SIGNAL_ARGS);

I don't think that's a casing we commonly use.  We mostly use CamelCase
or underscore_case.


> +/*
> + *    Signal handler for SIGUSR1.
> + */
> +static void
> +sigusr1Handler(SIGNAL_ARGS)
> +{
> +    int            save_errno = errno;
> +
> +    if (MyProc)
> +        SetLatch(&MyProc->procLatch);
> +
> +    errno = save_errno;
> +}

Hm, what's this one for?


> +/*
> + * Shared state information about the running autoprewarm bgworker.
> + */
> +typedef struct AutoPrewarmSharedState
> +{
> +    pg_atomic_uint32 current_task;        /* current tasks performed by
> +                                         * autoprewarm workers. */
> +} AutoPrewarmSharedState;

Hm.  Why do we need atomics here?  I thought there's no concurrency?


> +/*
> + * sort_cmp_func - compare function used for qsort().
> + */
> +static int
> +sort_cmp_func(const void *p, const void *q)
> +{

rename to blockinfo_cmp?



> +static AutoPrewarmTask
> +get_autoprewarm_task(AutoPrewarmTask todo_task)
> +{
> +    bool        found;
> +
> +    state = NULL;
> +
> +    LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
> +    state = ShmemInitStruct("autoprewarm",
> +                            sizeof(AutoPrewarmSharedState),
> +                            &found);
> +    if (!found)
> +        pg_atomic_write_u32(&(state->current_task), todo_task);

Superflous parens (repeated a lot).


> +    LWLockRelease(AddinShmemInitLock);
> +
> +    /* If found check if we can go ahead. */
> +    if (found)
> +    {
> +        if (pg_atomic_read_u32(&(state->current_task)) ==
> +            TASK_PREWARM_BUFFERPOOL)

You repeat the read in every branch - why don't you store it in a
variable instead?

That aside, the use of an atomic doesn't seem to actually gain us
anything here.  If we need control over concurrency it seems a lot
better to instead use a lwlock or spinlock.  There's no contention here,
using lock-free stuff just increases complexity without a corresponding
benefit.


> +        {
> +            if (todo_task == TASK_PREWARM_BUFFERPOOL)
> +            {
> +                /*
> +                 * we were prewarming and we are back to do same, time to
> +                 * abort prewarming and move to dumping.
> +                 */

I'm not sure what "back to do same" should mean here - changing to a
different type of task surely is not the same.


> +                pg_atomic_write_u32(&(state->current_task),
> +                                    TASK_DUMP_BUFFERPOOL_INFO);
> +                return TASK_DUMP_BUFFERPOOL_INFO;
> +            }
> +            else
> +                return TASK_END;    /* rest all cannot proceed further. */

What does that comment mean?


> +        }
> +        else if (pg_atomic_read_u32(&(state->current_task)) ==
> +                 TASK_DUMP_IMMEDIATE_ONCE)
> +        {
> +            uint32        current_state = TASK_DUMP_IMMEDIATE_ONCE;
> +
> +            /* We cannot do a TASK_PREWARM_BUFFERPOOL but rest can go ahead */
> +            if (todo_task == TASK_DUMP_IMMEDIATE_ONCE)
> +                return TASK_DUMP_IMMEDIATE_ONCE;
> +
> +            if (todo_task == TASK_PREWARM_BUFFERPOOL)
> +                todo_task = TASK_DUMP_BUFFERPOOL_INFO;    /* skip to do dump only */
> +
> +            /*
> +             * first guy who can atomically set the current_task get the
> +             * opportunity to proceed further
> +             */
> +            if (pg_atomic_compare_exchange_u32(&(state->current_task),
> +                                               ¤t_state,
> +                                               TASK_DUMP_BUFFERPOOL_INFO))
> +            {
> +                /* Wow! We won the race proceed with the task. */
> +                return TASK_DUMP_BUFFERPOOL_INFO;
> +            }
> +            else
> +                return TASK_END;

Note that it's not generally guaranteed that any
pg_atomic_compare_exchange_u32 actually wins, it could temporarily fail
for all.


> +/*
> + * getnextblockinfo -- given a BlkType get its next BlockInfoRecord from the
> + *                       dump file.
> + */
> +static BlkType
> +getnextblockinfo(FILE *file, BlockInfoRecord *currblkinfo, BlkType reqblock,
> +                 BlockInfoRecord *newblkinfo)
> +{
> +    BlkType        nextblk;
> +
> +    while (true)
> +    {
> +        /* get next block. */
> +        if (5 != fscanf(file, "%u,%u,%u,%u,%u\n", &(newblkinfo->database),
> +                        &(newblkinfo->spcNode), &(newblkinfo->filenode),
> +                        (uint32 *) &(newblkinfo->forknum),
> +                        &(newblkinfo->blocknum)))
> +            return BLKTYPE_END; /* No more valid entry hence stop processing. */

Hm.  Is it actually helpful to store the file as text?  That's commonly
going to increase the size of the file quite considerably, no?

> +/*
> + * GetRelOid -- given a filenode get its relation oid.
> + */
> +static Oid
> +get_reloid(Oid filenode)
> +{

Function and comment don't agree on naming.


But what is this actually used for?  I thought Robert, in
http://archives.postgresql.org/message-id/CA%2BTgmoa%3DUqCL2mR%2B9WTq05tB3Up-z4Sv2wkzkDxDwBP7Mj_2_w%40mail.gmail.com
suggested storing the filenode in the dump, and then to use
RelidByRelfilenode to get the corresponding relation?

It seems a lot better to use relfilenodes, because otherwise table
rewrites will lead to reloading wrong things.


> +    int            ret;
> +    Oid            relationid;
> +    bool        isnull;
> +    Datum        value[1] = {ObjectIdGetDatum(filenode)};
> +    StringInfoData buf;
> +    Oid            ptype[1] = {OIDOID};
> +
> +    initStringInfo(&buf);
> +    appendStringInfo(&buf,
> +            "select oid from pg_class where pg_relation_filenode(oid) = $1");
> +
> +    ret = SPI_execute_with_args(buf.data, 1, (Oid *) &ptype, (Datum *) &value,
> +                                NULL, true, 1);
> +
> +    if (ret != SPI_OK_SELECT)
> +        ereport(FATAL, (errmsg("SPI_execute failed: error code %d", ret)));
> +
> +    if (SPI_processed < 1)
> +        return InvalidOid;
> +
> +    relationid = DatumGetObjectId(SPI_getbinval(SPI_tuptable->vals[0],
> +                                                SPI_tuptable->tupdesc,
> +                                                1, &isnull));
> +    if (isnull)
> +        return InvalidOid;
> +
> +    return relationid;
> +}

Doing this via SPI doesn't strike me as a good idea - that's really
quite expensive.  Why not call the underlying function directly?


> +/*
> + * load_one_database -- start of prewarm sub-worker, this will try to load
> + * blocks of one database starting from block info position passed by main
> + * prewarm worker.
> + */
> +void
> +load_one_database(Datum main_arg)
> +{

> +    /* check if file exists and open file in read mode. */
> +    snprintf(dump_file_path, sizeof(dump_file_path), "%s", AUTOPREWARM_FILE);
> +    file = fopen(dump_file_path, PG_BINARY_R);
> +    if (!file)
> +        return;                    /* No file to load. */

Shouldn't this be an error case?  In which case is it ok for the file to
be gone after we launched the worker?


> +    /*
> +     * It should be a block info belonging to a new database. Or else dump
> +     * file is corrupted better to end the loading of bocks now.
> +     */
> +    if (loadblocktype != BLKTYPE_NEW_DATABASE)
> +        goto end_load;            /* should we raise a voice here? */

Yes, this should raise an error.



> +            case BLKTYPE_NEW_RELATION:
> +
> +                /*
> +                 * release lock on previous relation.
> +                 */
> +                if (rel)
> +                {
> +                    relation_close(rel, AccessShareLock);
> +                    rel = NULL;
> +                }
> +
> +                loadblocktype = BLKTYPE_NEW_RELATION;
> +
> +                /*
> +                 * lock new relation.
> +                 */
> +                reloid = get_reloid(toload_block.filenode);
> +
> +                if (!OidIsValid(reloid))
> +                    break;
> +
> +                rel = try_relation_open(reloid, AccessShareLock);
> +                if (!rel)
> +                    break;
> +                RelationOpenSmgr(rel);

Now I'm confused.  Your get_reloid used pg_relation_filenode() to map
from relation oid to filenode - and then you're using it to lock the
relation?  Something's wrong.


> +            case BLKTYPE_NEW_FORK:
> +
> +                /*
> +                 * check if fork exists and if block is within the range
> +                 */
> +                loadblocktype = BLKTYPE_NEW_FORK;
> +                if (            /* toload_block.forknum > InvalidForkNumber &&
> +                                 * toload_block.forknum <= MAX_FORKNUM && */
> +                    !smgrexists(rel->rd_smgr, toload_block.forknum))
> +                    break;

Huh? What's with that commented out section of code?


> +            case BLKTYPE_NEW_BLOCK:
> +
> +                /* check if blocknum is valid and with in fork file size. */
> +                if (toload_block.blocknum >= nblocks)
> +                {
> +                    /* move to next forknum. */
> +                    loadblocktype = BLKTYPE_NEW_FORK;
> +                    break;
> +                }

Hm. Why does the size of the underlying file allow us to skip to the
next fork? Don't we have to read all the pending dump records?


> +                buf = ReadBufferExtended(rel, toload_block.forknum,
> +                                         toload_block.blocknum, RBM_NORMAL,
> +                                         NULL);
> +                if (BufferIsValid(buf))
> +                {
> +                    ReleaseBuffer(buf);
> +                }
> +
> +                loadblocktype = BLKTYPE_NEW_BLOCK;
> +                break;

Hm. RBM_NORMAL will error out in a bunch of cases, is that ok?

> +    if (have_dbconnection)
> +    {
> +        SPI_finish();
> +        PopActiveSnapshot();
> +        CommitTransactionCommand();
> +    }
> +    return;
> +}

Are we really ok keeping open a transaction through all of this? That
could potentially be quite long, no?  How about doing that on a per-file
basis, or even moving to session locks alltogether?



> +/* This sub-module is for periodically dumping buffer pool's block info into
> + * a dump file AUTOPREWARM_FILE.
> + * Each entry of block info looks like this:
> + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it
> + * as BlockInfoRecord.
> + *
> + * Contents of AUTOPREWARM_FILE has been formated such a way that
> + * blockInfoRecord of each database can be given to different prewarm workers.
> + *
> + *    format of AUTOPREWAM_FILE
> + *    =======================================
> + *    [offset position of database map table]
> + *    [sorted BlockInfoRecords..............]
> + *    [database map table]
> + *    =======================================

This doesn't mention storing things as ascii, instead of binary...


> + *    The [database map table] is sequence of offset in file which will point to
> + *    first BlockInfoRecords of each database in the dump. The prewarm worker
> + *    will read this offset one by one in sequence and ask its subworker to seek
> + *    to this position and then start loading the BlockInfoRecords one by one
> + *    until it see a BlockInfoRecords of a different database than it is actually
> + *    connected to.
> + *    NOTE : We store off_t inside file so the dump file will not be portable to
> + *    be used across systems where sizeof off_t is different from each other.
> + */

Why are we using off_t? Shouldn't this just be BlockNumber?


> +static uint32
> +dump_now(void)
> +{
> +    static char dump_file_path[MAXPGPATH],

> +
> +    for (num_blocks = 0, i = 0; i < NBuffers; i++)
> +    {
> +        uint32        buf_state;
> +
> +        bufHdr = GetBufferDescriptor(i);
> +
> +        /* lock each buffer header before inspecting. */
> +        buf_state = LockBufHdr(bufHdr);
> +
> +        if (buf_state & BM_TAG_VALID)
> +        {
> +            block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
> +            block_info_array[num_blocks].spcNode = bufHdr->tag.rnode.spcNode;
> +            block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
> +            block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
> +            block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
> +            ++num_blocks;
> +        }
> +
> +        UnlockBufHdr(bufHdr, buf_state);

> +    }
> +
> +    /* sorting now only to avoid sorting while loading. */

"sorting while loading"? You mean random accesses?

> +    pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord),
> +             sort_cmp_func);



> +    snprintf(transient_dump_file_path, sizeof(dump_file_path),
> +             "%s.%d", AUTOPREWARM_FILE, MyProcPid);
> +    file = fopen(transient_dump_file_path, "w");
> +    if (file == NULL)
> +        ereport(ERROR,
> +                (errcode_for_file_access(),
> +                 errmsg("autoprewarm: could not open \"%s\": %m",
> +                        dump_file_path)));

What if that file already exists? You're not truncating it.  Also, what
if we error out in the middle of this? We'll leak an fd.  I think this
needs to use OpenTransientFile etc.

> +    snprintf(dump_file_path, sizeof(dump_file_path),
> +             "%s", AUTOPREWARM_FILE);
> +    ret = fprintf(file, "%020jd\n", (intmax_t) 0);
> +    if (ret < 0)
> +    {
> +        fclose(file);
> +        ereport(ERROR,
> +                (errcode_for_file_access(),
> +                 errmsg("autoprewarm: error writing to \"%s\" : %m",
> +                        dump_file_path)));
> +    }
> +
> +    database_map_table[num_db++] = ftello(file);
> +
> +    for (i = 0; i < num_blocks; i++)
> +    {
> +        if (i > 0 && block_info_array[i].database != prev_database)
> +        {
> +            if (num_db == database_map_table_size)
> +            {
> +                database_map_table_size *= 2;    /* double and repalloc. */
> +                database_map_table =
> +                    (off_t *) repalloc(database_map_table,
> +                                    sizeof(off_t) * database_map_table_size);
> +            }
> +            fflush(file);
> +            database_map_table[num_db++] = ftello(file);
> +        }
> +
> +        ret = fprintf(file, "%u,%u,%u,%u,%u\n",
> +                      block_info_array[i].database,
> +                      block_info_array[i].spcNode,
> +                      block_info_array[i].filenode,
> +                      (uint32) block_info_array[i].forknum,
> +                      block_info_array[i].blocknum);
> +        if (ret < 0)
> +        {
> +            fclose(file);
> +            ereport(ERROR,
> +                    (errcode_for_file_access(),
> +                     errmsg("autoprewarm: error writing to \"%s\" : %m",
> +                            dump_file_path)));
> +        }
> +
> +        prev_database = block_info_array[i].database;
> +    }

I think we should check for interrupts somewhere in that (and the
preceding) loop.

> +/*
> + * dump_block_info_periodically - at regular intervals, which is defined by GUC
> + * dump_interval, dump the info of blocks which are present in buffer pool.
> + */
> +void
> +dump_block_info_periodically()
> +{

Suggest adding void to the parameter list.


> +    pg_time_t    last_dump_time = (pg_time_t) time(NULL);
> +
> +    while (!got_sigterm)
> +    {
> +        int            rc;
> +        pg_time_t    now;
> +        int            elapsed_secs = 0,
> +                    timeout = AT_PWARM_DEFAULT_DUMP_INTERVAL;
> +
> +        if (dump_interval > AT_PWARM_DUMP_AT_SHUTDOWN_ONLY)
> +        {
> +            now = (pg_time_t) time(NULL);
> +            elapsed_secs = now - last_dump_time;
> +
> +            if (elapsed_secs > dump_interval)
> +            {
> +                dump_now();
> +                if (got_sigterm)
> +                    return;        /* got shutdown signal just after a dump. And,
> +                                 * I think better to return now. */
> +                last_dump_time = (pg_time_t) time(NULL);
> +                elapsed_secs = 0;
> +            }
> +
> +            timeout = dump_interval - elapsed_secs;
> +        }

I suggest using GetCurrenttimstamp() and TimestampDifferenceExceeds()
instead.


> +        /* Has been set not to dump. Nothing more to do. */
> +        if (dump_interval == AT_PWARM_OFF)
> +            return;
> +
> +        ResetLatch(&MyProc->procLatch);
> +        rc = WaitLatch(&MyProc->procLatch,
> +                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
> +                       timeout * 1000, PG_WAIT_EXTENSION);
> +
> +        if (rc & WL_POSTMASTER_DEATH)
> +            proc_exit(1);
> +
> +        /*
> +         * In case of a SIGHUP, just reload the configuration.
> +         */
> +        if (got_sighup)
> +        {
> +            got_sighup = false;
> +            ProcessConfigFile(PGC_SIGHUP);
> +        }
> +    }
> +
> +    /* One last block meta info dump while postmaster shutdown. */
> +    if (dump_interval != AT_PWARM_OFF)
> +        dump_now();

Uh, afaics we'll also do this if somebody SIGTERMed the process
interactively?


> +/* Extension's entry point. */
> +void
> +_PG_init(void)
> +{
> +    BackgroundWorker autoprewarm;
> +
> +    /* Define custom GUC variables. */
> +    DefineCustomIntVariable("pg_prewarm.dump_interval",
> +                       "Sets the maximum time between two buffer pool dumps",
> +                            "If set to Zero, timer based dumping is disabled."
> +                            " If set to -1, stops the running autoprewarm.",
> +                            &dump_interval,
> +                            AT_PWARM_DEFAULT_DUMP_INTERVAL,
> +                            AT_PWARM_OFF, INT_MAX / 1000,
> +                            PGC_SIGHUP,
> +                            GUC_UNIT_S,
> +                            NULL,
> +                            NULL,
> +                            NULL);
> +
> +    /* if not run as a preloaded library, nothing more to do here! */
> +    if (!process_shared_preload_libraries_in_progress)
> +        return;
> +
> +    DefineCustomStringVariable("pg_prewarm.default_database",
> +                "default database to connect if dump has not recorded same.",
> +                               NULL,
> +                               &default_database,
> +                               "postgres",
> +                               PGC_POSTMASTER,
> +                               0,
> +                               NULL,
> +                               NULL,
> +                               NULL);

I don't think it's a good idea to make guc registration depending on
process_shared_preload_libraries_in_progress.


You should also use EmitWarningsOnPlaceholders() somewhere here.




I also wonder whether we don't need to use prefetch to actually make
this fast enough.


I think it's pretty clear that this needs a bit more work and thus won't
be ready for v10.  Moved to the next CF.


- Andres



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Thu, Apr 6, 2017 at 4:12 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-13 18:45:00 +0530, Mithun Cy wrote:
>> I have implemented a similar logic now. The prewarm bgworker will
>> launch a sub-worker per database in the dump file. And, each
>> sub-worker will load its database block info. The sub-workers will be
>> launched only after previous one is finished. All of this will only
>> start if the database has reached a consistent state.
>
> Hm. For replay performance it'd possibly be good to start earlier,
> before reaching consistency.  Is there an issue starting earlier?

Thanks Andres for a detailed review. I will try to address them in my next post. I thought it is important to reply to above comment before that. Earlier patches used to start loading blocks before reaching a consistent state. Then Robert while reviewing found a basic flaw in my approach[1].  The function DropRelFileNodesAllBuffers do not expect others to load the blocks concurrently while it is getting rid of buffered blocks. So has to delay loading until database reaches consistent state so that we can connect to each database and take a relation lock before loading any of theirs blocks.


[1] cannot load blocks without holding relation lock


--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
Thanks, Andres,

I have tried to fix all of your comments. One important change has
happened with this patch is previously we used to read one block info
structure at a time and load it. But now we read all of them together
and load it into as DSA and then we distribute the block infos to the
subgroups to load corresponding blocks. With this portability issue
which I have mentioned above will no longer exists as we do not store
any map tables within the dump file.

On Thu, Apr 6, 2017 at 4:12 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-13 18:45:00 +0530, Mithun Cy wrote:
>> launched only after previous one is finished. All of this will only
>> start if the database has reached a consistent state.
>
> Hm. For replay performance it'd possibly be good to start earlier,
> before reaching consistency.  Is there an issue starting earlier?

Earlier patches used to start loading blocks before reaching a
consistent state. Then Robert while reviewing found a basic flaw in my
approach.  The function DropRelFileNodesAllBuffers do not expect
others to load the blocks concurrently while it is getting rid of
buffered blocks. So has to delay loading until database reaches
consistent state so that we can connect to each database and take a
relation lock before loading any of theirs blocks.

>> + * -- Automatically prewarm the shared buffer pool when server restarts.
>
> Don't think we ususally use -- here.

-- Fixed.


>> + *   Copyright (c) 2013-2017, PostgreSQL Global Development Group
>
> Hm, that's a bit of a weird date range.

-- changed it to 2016-2017 is that right?


> The pg_prewarm.c in there looks like some search & replace gone awry.

-- sorry Fixed.


>> +#include "utils/rel.h"
>> +#include "port/atomics.h"
>
> I'd rather just sort these alphabetically.
> I think this should rather be in the initial header.

-- Fixed as suggested and have moved everything into a header file pg_prewarm.h.


> s/until there is a/until there is no/?

-- Fixed.

>
>> +/*
>> + * ============================================================================
>> + * ===========================        SIGNAL HANDLERS        ===========================
>> + * ============================================================================
>> + */
>
> Hm...

-- I have reverted those cosmetic changes now.

>
>> +static void sigtermHandler(SIGNAL_ARGS);
>> +static void sighupHandler(SIGNAL_ARGS);
>
> I don't think that's a casing we commonly use.  We mostly use CamelCase
> or underscore_case.
>

-- Fixed with CamelCase.


>> + *   Signal handler for SIGUSR1.
>> + */
>> +static void
>> +sigusr1Handler(SIGNAL_ARGS)

> Hm, what's this one for?

-- The prewarm sub-workers will notify with SIGUSR1 on their
startup/shutdown. Updated the comments.

>> +/*
>> + * Shared state information about the running autoprewarm bgworker.
>> + */
>> +typedef struct AutoPrewarmSharedState
>> +{
>> +     pg_atomic_uint32 current_task;          /* current tasks performed by
>> +                                                                              * autoprewarm workers. */
>> +} AutoPrewarmSharedState;
>
> Hm.  Why do we need atomics here?  I thought there's no concurrency?

There are 3 methods in autoprewarm.
A: prealoaded prewarm bgworker which also prewarm and then periodic dumping;
B: bgwoker launched by launch_autoprewarm_dump() which do the periodic dumping.
C: Immediate dump by backends.

We do not want 2 bgworkers started and working concurrently. and do
not want to dump while prewarm task is running. As you have suggested
rewrote a simple logic with the use of a boolean variable.


>> +sort_cmp_func(const void *p, const void *q)
>> +{
>
> rename to blockinfo_cmp?

-- Fixed.

>
> Superflous parens (repeated a lot).

-- Fixed in all places.


> Hm.  Is it actually helpful to store the file as text?  That's commonly
> going to increase the size of the file quite considerably, no?

-- Having in the text could help in readability or if we want to
modify/adjust it as needed. I shall so a small experiment to check how
much we caould save and produce a patch on top of this for same.

>
> But what is this actually used for?  I thought Robert, in
> http://archives.postgresql.org/message-id/CA%2BTgmoa%3DUqCL2mR%2B9WTq05tB3Up-z4Sv2wkzkDxDwBP7Mj_2_w%40mail.gmail.com
> suggested storing the filenode in the dump, and then to use
> RelidByRelfilenode to get the corresponding relation?

-- Fixed as suggested directly calling RelidByRelfilenode now.


>> +load_one_database(Datum main_arg)
>> +{
>
>> +     /* check if file exists and open file in read mode. */
>> +     snprintf(dump_file_path, sizeof(dump_file_path), "%s", AUTOPREWARM_FILE);
>> +     file = fopen(dump_file_path, PG_BINARY_R);
>> +     if (!file)
>> +             return;                                 /* No file to load. */
>
> Shouldn't this be an error case?  In which case is it ok for the file to
> be gone after we launched the worker?

-- Yes sorry a mistake. In the new patch I have changed the file map
to dsa to distribute block infos to subworker so this code will go
away.
   Now the main worker will load all of the blockinfos into a dynamic
shared area(dsa) and sub worker will read blocks from them which
belong to the database they are assigned to load.

>
>> +     /*
>> +      * It should be a block info belonging to a new database. Or else dump
>> +      * file is corrupted better to end the loading of bocks now.
>> +      */
>> +     if (loadblocktype != BLKTYPE_NEW_DATABASE)
>> +             goto end_load;                  /* should we raise a voice here? */
>
> Yes, this should raise an error.

-- Got rid of this check with the new code.

>> +                     case BLKTYPE_NEW_RELATION:
>> +
>> +                             /*
>> +                              * release lock on previous relation.
>> +                              */
>> +                             if (rel)
>> +                             {
>> +                                     relation_close(rel, AccessShareLock);
>> +                                     rel = NULL;
>> +                             }
>> +
>> +                             loadblocktype = BLKTYPE_NEW_RELATION;
>> +
>> +                             /*
>> +                              * lock new relation.
>> +                              */
>> +                             reloid = get_reloid(toload_block.filenode);
>> +
>> +                             if (!OidIsValid(reloid))
>> +                                     break;
>> +
>> +                             rel = try_relation_open(reloid, AccessShareLock);
>> +                             if (!rel)
>> +                                     break;
>> +                             RelationOpenSmgr(rel);
>
> Now I'm confused.  Your get_reloid used pg_relation_filenode() to map
> from relation oid to filenode - and then you're using it to lock the
> relation?  Something's wrong.

We take a shared lock on relation id so that while we load the blocks
of the relation, DropRelFileNodesAllBuffers is not called by another
process.

>> +                     case BLKTYPE_NEW_FORK:
>> +
>> +                             /*
>> +                              * check if fork exists and if block is within the range
>> +                              */
>> +                             loadblocktype = BLKTYPE_NEW_FORK;
>> +                             if (                    /* toload_block.forknum > InvalidForkNumber &&
>> +                                                              * toload_block.forknum <= MAX_FORKNUM && */
>> +                                     !smgrexists(rel->rd_smgr, toload_block.forknum))
>> +                                     break;
>
> Huh? What's with that commented out section of code?

-- smgrexists is not safe it crashed if we pass illegal forknumber so
a precheck. Accidently forgot to uncomment same sorry. Fixed it now.


>> +                     case BLKTYPE_NEW_BLOCK:
>> +
>> +                             /* check if blocknum is valid and with in fork file size. */
>> +                             if (toload_block.blocknum >= nblocks)
>> +                             {
>> +                                     /* move to next forknum. */
>> +                                     loadblocktype = BLKTYPE_NEW_FORK;
>> +                                     break;
>> +                             }
>
> Hm. Why does the size of the underlying file allow us to skip to the
> next fork? Don't we have to read all the pending dump records?

-- Blocks beyond the file size are not existing. So I thought we can
move to next fork to load thier blocks.

>> +                             buf = ReadBufferExtended(rel, toload_block.forknum,
>> +                                                                              toload_block.blocknum, RBM_NORMAL,
>> +                                                                              NULL);
>> +                             if (BufferIsValid(buf))
>> +                             {
>> +                                     ReleaseBuffer(buf);
>> +                             }
>> +
>> +                             loadblocktype = BLKTYPE_NEW_BLOCK;
>> +                             break;
>
> Hm. RBM_NORMAL will error out in a bunch of cases, is that ok?

For now, on error we restart the bgworker; Previously I did setjump to
catch(process) the error and continue. But as a review comment fix I
have moved to new logic of restart.

>> +     if (have_dbconnection)
>> +     {
>> +             SPI_finish();
>> +             PopActiveSnapshot();
>> +             CommitTransactionCommand();
>> +     }
>> +     return;
>> +}
>
> Are we really ok keeping open a transaction through all of this? That
> could potentially be quite long, no?  How about doing that on a per-file
> basis, or even moving to session locks alltogether?
>

-- We hold the transaction until we load all the dumped blocks of the
database or buffer pool becomes full, yes that can go long. Should I
end and start a transaction on every fork file?


>
> This doesn't mention storing things as ascii, instead of binary...
>

-- Fixed same.


>> + *   NOTE : We store off_t inside file so the dump file will not be portable to
>> + *   be used across systems where sizeof off_t is different from each other.
>> + */
>
> Why are we using off_t? Shouldn't this just be BlockNumber?

-- Previously we maintained a special map to block infos of each
database, but now moved everything to dsa so off_t this code is
removed now.

>> +static uint32
>> +dump_now(void)
>> +{
>> +     static char dump_file_path[MAXPGPATH],
>
>> +
>> +     for (num_blocks = 0, i = 0; i < NBuffers; i++)
>> +     {
>> +             uint32          buf_state;
>> +
>> +             bufHdr = GetBufferDescriptor(i);
>> +
>> +             /* lock each buffer header before inspecting. */
>> +             buf_state = LockBufHdr(bufHdr);
>> +
>> +             if (buf_state & BM_TAG_VALID)
>> +             {
>> +                     block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
>> +                     block_info_array[num_blocks].spcNode = bufHdr->tag.rnode.spcNode;
>> +                     block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
>> +                     block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
>> +                     block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
>> +                     ++num_blocks;
>> +             }
>> +
>> +             UnlockBufHdr(bufHdr, buf_state);
>
>> +     }
>> +
>> +     /* sorting now only to avoid sorting while loading. */
>
> "sorting while loading"? You mean random accesses?

-- Yes fixed same.


>> +     pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord),
>> +                      sort_cmp_func);
>
>
>
>> +     snprintf(transient_dump_file_path, sizeof(dump_file_path),
>> +                      "%s.%d", AUTOPREWARM_FILE, MyProcPid);
>> +     file = fopen(transient_dump_file_path, "w");
>> +     if (file == NULL)
>> +             ereport(ERROR,
>> +                             (errcode_for_file_access(),
>> +                              errmsg("autoprewarm: could not open \"%s\": %m",
>> +                                             dump_file_path)));
>
> What if that file already exists? You're not truncating it.  Also, what
> if we error out in the middle of this? We'll leak an fd.  I think this
> needs to use OpenTransientFile etc.

Thanks using OpenTransientFile now.


>> +     snprintf(dump_file_path, sizeof(dump_file_path),
>> +                      "%s", AUTOPREWARM_FILE);
>> +     ret = fprintf(file, "%020jd\n", (intmax_t) 0);
>> +     if (ret < 0)
>> +     {
>> +             fclose(file);
>> +             ereport(ERROR,
>> +                             (errcode_for_file_access(),
>> +                              errmsg("autoprewarm: error writing to \"%s\" : %m",
>> +                                             dump_file_path)));
>> +     }
>> +
>> +     database_map_table[num_db++] = ftello(file);
>> +
>> +     for (i = 0; i < num_blocks; i++)
>> +     {
>> +             if (i > 0 && block_info_array[i].database != prev_database)
>> +             {
>> +                     if (num_db == database_map_table_size)
>> +                     {
>> +                             database_map_table_size *= 2;   /* double and repalloc. */
>> +                             database_map_table =
>> +                                     (off_t *) repalloc(database_map_table,
>> +                                                                     sizeof(off_t) * database_map_table_size);
>> +                     }
>> +                     fflush(file);
>> +                     database_map_table[num_db++] = ftello(file);
>> +             }
>> +
>> +             ret = fprintf(file, "%u,%u,%u,%u,%u\n",
>> +                                       block_info_array[i].database,
>> +                                       block_info_array[i].spcNode,
>> +                                       block_info_array[i].filenode,
>> +                                       (uint32) block_info_array[i].forknum,
>> +                                       block_info_array[i].blocknum);
>> +             if (ret < 0)
>> +             {
>> +                     fclose(file);
>> +                     ereport(ERROR,
>> +                                     (errcode_for_file_access(),
>> +                                      errmsg("autoprewarm: error writing to \"%s\" : %m",
>> +                                                     dump_file_path)));
>> +             }
>> +
>> +             prev_database = block_info_array[i].database;
>> +     }
>
> I think we should check for interrupts somewhere in that (and the
> preceding) loop.

-- Now checking from any sighup and if it is asked not to dump with
pg_prewarm.dump_interval set to -1 I will terminate the loop.

>
>> +/*
>> + * dump_block_info_periodically - at regular intervals, which is defined by GUC
>> + * dump_interval, dump the info of blocks which are present in buffer pool.
>> + */
>> +void
>> +dump_block_info_periodically()
>> +{
>
> Suggest adding void to the parameter list.

-- Fixed.


>> +                             last_dump_time = (pg_time_t) time(NULL);
>> +                             elapsed_secs = 0;
>> +                     }
>> +
>> +                     timeout = dump_interval - elapsed_secs;
>> +             }
>
> I suggest using GetCurrenttimstamp() and TimestampDifferenceExceeds()
> instead.

-- Fixed as suggested.


>> +             /*
>> +              * In case of a SIGHUP, just reload the configuration.
>> +              */
>> +             if (got_sighup)
>> +             {
>> +                     got_sighup = false;
>> +                     ProcessConfigFile(PGC_SIGHUP);
>> +             }
>> +     }
>> +
>> +     /* One last block meta info dump while postmaster shutdown. */
>> +     if (dump_interval != AT_PWARM_OFF)
>> +             dump_now();
>
> Uh, afaics we'll also do this if somebody SIGTERMed the process
> interactively?

-- Yes we dump and exit on a sigterm.


>> +     /* if not run as a preloaded library, nothing more to do here! */
>> +     if (!process_shared_preload_libraries_in_progress)
>> +             return;
>> +
>> +     DefineCustomStringVariable("pg_prewarm.default_database",
>> +                             "default database to connect if dump has not recorded same.",
>> +                                                        NULL,
>> +                                                        &default_database,
>> +                                                        "postgres",
>> +                                                        PGC_POSTMASTER,
>> +                                                        0,
>> +                                                        NULL,
>> +                                                        NULL,
>> +                                                        NULL);
>
> I don't think it's a good idea to make guc registration depending on
> process_shared_preload_libraries_in_progress.

-- Fixed now.

> You should also use EmitWarningsOnPlaceholders() somewhere here.

-- Added same.

> I also wonder whether we don't need to use prefetch to actually make
> this fast enough.

-- I have not used prefetch but I will re-access and update the patch.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Tue, May 23, 2017 at 7:06 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Thanks, Andres,
>
> I have tried to fix all of your comments.

There was a typo issue in previous patch 07 where instead of == I have
used !=. And, a mistake in comments. I have corrected same now.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Wed, May 24, 2017 at 6:28 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Tue, May 23, 2017 at 7:06 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> Thanks, Andres,
>>
>> I have tried to fix all of your comments.
>
> There was a typo issue in previous patch 07 where instead of == I have
> used !=. And, a mistake in comments. I have corrected same now.

+/*
+ * autoprewarm :
+ *
+ * What is it?
+ * ===========
+ * A bgworker which automatically records information about blocks which were
+ * present in buffer pool before server shutdown and then prewarm the buffer
+ * pool upon server restart with those blocks.
+ *
+ * How does it work?
+ * =================
+ * When the shared library "pg_prewarm" is preloaded, a
+ * bgworker "autoprewarm" is launched immediately after the server has reached
+ * consistent state. The bgworker will start loading blocks recorded in the
+ * format BlockInfoRecord
+ * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
+ * $PGDATA/AUTOPREWARM_FILE, until there is no free buffer left in the buffer
+ * pool. This way we do not replace any new blocks which were loaded either by
+ * the recovery process or the querying clients.
+ *
+ * Once the "autoprewarm" bgworker has completed its prewarm task, it will
+ * start a new task to periodically dump the BlockInfoRecords related to blocks
+ * which are currently in shared buffer pool. Upon next server restart, the
+ * bgworker will prewarm the buffer pool by loading those blocks. The GUC
+ * pg_prewarm.dump_interval will control the dumping activity of the bgworker.
+ */

Make this part of the file header comment.  Also, add an enabling GUC.
The default can be true, but it should be possible to preload the
library so that the SQL functions are available without a dynamic
library load without requiring you to get the auto-prewarm behavior.
I suggest pg_prewarm.autoprewarm = true / false.

Your SigtermHandler and SighupHandler routines are still capitalized
in a way that's not very consistent with what we do elsewhere.  I
think all of our other signal handlers have names_like_this() not
NamesLikeThis().

+ * ==============    types and variables used by autoprewam      =============

Spelling.

+ * Meta-data of each persistent block which is dumped and used to load.

Metadata

+typedef struct BlockInfoRecord
+{
+    Oid            database;        /* database */
+    Oid            spcNode;        /* tablespace */
+    Oid            filenode;        /* relation's filenode. */
+    ForkNumber    forknum;        /* fork number */
+    BlockNumber blocknum;        /* block number */
+} BlockInfoRecord;

spcNode is capitalized differently from all of the other members.

+    LWLock       *lock;            /* protects SharedState */

Just declare this as LWLock lock, and initialize it using
LWLockInitialize.  The way you're doing it is more complicated.

+static dsa_area *AutoPrewarmDSA = NULL;

DSA seems like more than you need here.  There's only one allocation
being performed.  I think it would be simpler and less error-prone to
use DSM directly.  I don't even think you need a shm_toc.  You could
just do:

seg = dsm_create(segsize, 0);
blkinfo = dsm_segment_address(seg);

Then pass dsm_segment_handle(seg) to the worker using bgw_main_arg or
bgw_extra, and have it call dsm_attach.  An advantage of this approach
is that you only allocate the memory you actually need, whereas DSA
will allocate more, expecting that you might do further allocations.

+    pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord),
+             blockinfo_cmp);

I think it would make more sense to sort the array on the load side,
in the autoprewarm worker, rather than when dumping.  First, many
dumps will never be reloaded, so there's no real reason to waste time
sorting them.  Second, the way you've got the code right now, it
relies heavily on the assumption that the dump file will be sorted,
but that doesn't seem like a necessary assumption.  We don't really
expect users to hand-edit the dump files, but if they do, it'd be nice
if it didn't randomly break things unnecessarily.

+                 errmsg("autoprewarm: could not open \"%s\": %m",
+                        dump_file_path)));

Use standard error message wordings!  Don't create new translatable
messages by adding "autoprewarm" to error messages.  There are
multiple instances of this throughout the patch.

+    snprintf(dump_file_path, sizeof(dump_file_path),
+             "%s", AUTOPREWARM_FILE);

This is completely pointless.  You can get rid of the dump_file_path
variable and just use AUTOPREWARM_FILE wherever you would have used
dump_file_path.  It's just making a copy of a string you already have.
Also, this code interrupts the flow of the surrounding logic in a
weird way; even if we needed it, it would make more sense to put it up
higher, where we construct the temporary path, or down lower, where
the value is actually needed.

+    SPI_connect();
+    PushActiveSnapshot(GetTransactionSnapshot());

It is really unclear why you need this, since the code does not use
SPI for anything, or do anything that needs a snapshot.

+            goto end_load;

I think you should try to rewrite this function so that it doesn't
need "goto".  I also think in general that this function could be
written in a much more direct way by getting rid of the switch and the
BLKTYPE_* constants altogether.  connect_to_db() can only happen once,
so do that the beginning.  After that, the logic can look roughly like
this:

BlockInfoRecord *old_blk = NULL;

while (!got_sigterm && pos < maxpos && have_free_buffer())
{   BlockInfoRecord *blk = block_info[pos];
   /* Quit if we've reached records for another database. */   if (old_blk != NULL && old_blk->dbOid != blk->dbOid)
 break;
 
   /*    * When we reach a new relation, close the old one.  Note, however,    * that the previous try_relation_open
mayhave failed, in which case    * rel will be NULL.    */   if (old_blk != NULL && old_blk->relOid != blk->relOid &&
rel!= NULL)   {       relation_close(rel, AccessShareLock);       rel = NULL;   }
 
   /*    * Try to open each new relation, but only once, when we first    * encounter it.  If it's been dropped, skip
theassociated blocks.    */   if (old_blk == NULL || old_blk->relOid != blk->relOid)   {       Assert(rel == NULL);
 rel = try_relation_open(blk->relOid, AccessShareLock);   }   if (!rel)   {      ++pos;      continue;   }
 
   /* Once per fork, check for fork existence and size. */   if (old_blk == NULL || old_blk->forkNumber !=
blk->forkNumber)  {       RelationOpenSmgr(rel);       if (smgrexists(rel->rd_smgr, blk->forkNumber))           nblocks
=RelationGetNumberOfBlocksInFork(...);       else           nblocks = 0;   }
 
   /* Prewarm buffer. */   buf = ReadBufferExtended(...);   ...   ++pos;
}

+                 errhint("Kill all remaining database processes and restart"
+                         " the database.")));

Don't break strings across lines.  Just put it all on one (long) line.

I don't think you should really need default_database. It seems like
it should be possible to jigger things so that those blocks are loaded
together with some other database (say, let the first worker do it).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
Thanks Robert,

On Wed, May 24, 2017 at 5:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> + *
> + * Once the "autoprewarm" bgworker has completed its prewarm task, it will
> + * start a new task to periodically dump the BlockInfoRecords related to blocks
> + * which are currently in shared buffer pool. Upon next server restart, the
> + * bgworker will prewarm the buffer pool by loading those blocks. The GUC
> + * pg_prewarm.dump_interval will control the dumping activity of the bgworker.
> + */
>
> Make this part of the file header comment.

-- Thanks, I have moved same as part of header description.

> Also, add an enabling GUC.
> The default can be true, but it should be possible to preload the
> library so that the SQL functions are available without a dynamic
> library load without requiring you to get the auto-prewarm behavior.
> I suggest pg_prewarm.autoprewarm = true / false.

-- Thanks, I have added a boolean GUC pg_prewarm.autoprewarm with
default true. I have made it as PGC_POSTMASTER level variable
considering the intention is here to avoid starting the autoprewarm
worker. Need help, have I missed anything? Currently, sql callable
functions autoprewarm_dump_now(), launch_autoprewarm_dump() are only
available after create extension pg_prewarm command will this change
now?
There is another GUC setting pg_prewarm.dump_interval if = -1 we stop
the running autoprewarm worker. I have a doubt should we combine these
2 entities into one such that it control the state of autoprewarm
worker?

> Your SigtermHandler and SighupHandler routines are still capitalized
> in a way that's not very consistent with what we do elsewhere.  I
> think all of our other signal handlers have names_like_this() not
> NamesLikeThis().
>

-- handler functions are renamed for example apw_sigterm_handler, as
similar in autovacuum.c

> + * ==============    types and variables used by autoprewam      =============
>
> Spelling.

-- Fixed, Sorry.


> + * Meta-data of each persistent block which is dumped and used to load.
>
> Metadata

-- Fixed.


> +typedef struct BlockInfoRecord
> +{
> +    Oid            database;        /* database */
> +    Oid            spcNode;        /* tablespace */
> +    Oid            filenode;        /* relation's filenode. */
> +    ForkNumber    forknum;        /* fork number */
> +    BlockNumber blocknum;        /* block number */
> +} BlockInfoRecord;
>
> spcNode is capitalized differently from all of the other members.

-- renamed from spcNode to spcnode.
>
> +    LWLock       *lock;            /* protects SharedState */
>
> Just declare this as LWLock lock, and initialize it using
> LWLockInitialize.  The way you're doing it is more complicated.

-- Fixed as suggested
LWLockInitialize(&state->lock, LWLockNewTrancheId());

> +static dsa_area *AutoPrewarmDSA = NULL;
>
> DSA seems like more than you need here.  There's only one allocation
> being performed.  I think it would be simpler and less error-prone to
> use DSM directly.  I don't even think you need a shm_toc.  You could
> just do:
>
> seg = dsm_create(segsize, 0);
> blkinfo = dsm_segment_address(seg);
> Then pass dsm_segment_handle(seg) to the worker using bgw_main_arg or
> bgw_extra, and have it call dsm_attach.  An advantage of this approach
> is that you only allocate the memory you actually need, whereas DSA
> will allocate more, expecting that you might do further allocations.

- Thanks Fixed. And we pass following arguments to subwrokers through bgw_extra
/*
 * The block_infos allocated to each sub-worker to do prewarming.
 */
typedef struct prewarm_elem
{
    dsm_handle  block_info_handle;      /* handle to dsm seg of block_infos */
    Oid         database;       /* database to connect and load */
    uint32      start_pos;      /* start position within block_infos from
                                 * which sub-worker start prewaring blocks. */
    uint32      end_of_blockinfos;      /* End of block_infos in dsm */
} prewarm_elem;

to distribute the prewarm work among subworkers.

>
> +    pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord),
> +             blockinfo_cmp);
>
> I think it would make more sense to sort the array on the load side,
> in the autoprewarm worker, rather than when dumping.

Fixed Now sorting block_infos just before loading the blocks

> +                 errmsg("autoprewarm: could not open \"%s\": %m",
> +                        dump_file_path)));
>
> Use standard error message wordings!  Don't create new translatable
> messages by adding "autoprewarm" to error messages.  There are
> multiple instances of this throughout the patch.

-- Removed autoprewarm as part of sufix in error message in all of the places.


> +    snprintf(dump_file_path, sizeof(dump_file_path),
> +             "%s", AUTOPREWARM_FILE);
>
> This is completely pointless.  You can get rid of the dump_file_path
-- dump_file_path is removed.


>
> +    SPI_connect();
> +    PushActiveSnapshot(GetTransactionSnapshot());
>
> It is really unclear why you need this, since the code does not use
> SPI for anything, or do anything that needs a snapshot.

-Sorry Removed it now.

> +            goto end_load;
>
> I think you should try to rewrite this function so that it doesn't
> need "goto".  I also think in general that this function could be
> written in a much more direct way by getting rid of the switch and the
> BLKTYPE_* constants altogether.  connect_to_db() can only happen once,
> so do that the beginning.  After that, the logic can look roughly like
> this:

-- Fixed using exact code framework as you have suggested previously.

>
> +                 errhint("Kill all remaining database processes and restart"
> +                         " the database.")));
>
> Don't break strings across lines.  Just put it all on one (long) line.

-- Fixed; I have tried to trim the string which was going beyond
80chars but has fixed it now as you have suggested.

> I don't think you should really need default_database. It seems like
> it should be possible to jigger things so that those blocks are loaded
> together with some other database (say, let the first worker do it).

-- Fixed, for block_infos with database 0 will be combined with next
database's block_info load. One problem which I have kept open is what
if there are only block_info's with database 0 in dump file, With
default_database we could have handled such case. After removing it I
am ignoring block_infos of 0 databases in such cases. Is that okay?.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Konstantin Knizhnik
Date:
On 27.10.2016 14:39, Mithun Cy wrote:
> # pg_autoprewarm.
>
> This a PostgreSQL contrib module which automatically dump all of the 
> blocknums
> present in buffer pool at the time of server shutdown(smart and fast 
> mode only,
> to be enhanced to dump at regular interval.) and load these blocks 
> when server restarts.
>
> Design:
> ------
> We have created a BG Worker Auto Pre-warmer which during shutdown 
> dumps all the
> blocknum in buffer pool in sorted order.
> Format of each entry is 
> <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>.
> Auto Pre-warmer is started as soon as the postmaster is started we do 
> not wait
> for recovery to finish and database to reach a consistent state. If 
> there is a
> "dump_file" to load we start loading each block entry to buffer pool until
> there is a free buffer. This way we do not replace any new blocks 
> which was
> loaded either by recovery process or querying clients. Then it waits 
> until it receives
> SIGTERM to dump the block information in buffer pool.
>
> HOW TO USE:
> -----------
> Build and add the pg_autoprewarm to shared_preload_libraries. Auto 
> Pre-warmer
> process automatically do dumping of buffer pool's block info and load 
> them when
> restarted.
>
> TO DO:
> ------
> Add functionality to dump based on timer at regular interval.
> And some cleanups.

I wonder if you considered parallel prewarming of a table?
Right now either with pg_prewarm, either with pg_autoprewarm, preloading 
table's data is performed by one backend.
It certainly makes sense if there is just one HDD and we want to 
minimize impact of pg_prewarm on normal DBMS activity.
But sometimes we need to load data in memory as soon as possible. And 
modern systems has larger number of CPU cores and
RAID devices make it possible to efficiently load data in parallel.

I have asked this question in context of my CFS (compressed file system) 
for Postgres. The customer's complaint was that there are 64 cores at 
his system but when
he is building index, decompression of heap data is performed by only 
one core. This is why I thought about prewarm... (parallel index 
construction is separate story...)

pg_prewarm makes is possible to specify range of blocks, so, in 
principle, it is possible to manually preload table in parallel, by 
spawining pg_prewarm
with different subranges in several backends. But it is definitely not 
user friendly approach.
And as far as I understand pg_autoprewarm has all necessary 
infrastructure to do parallel load. We just need to spawn more than one 
background worker and specify
separate block range for each worker.

Do you think that such functionality (parallel autoprewarm) can be 
useful and be easily added?

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Tue, May 30, 2017 at 10:16 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Thanks Robert,

Sorry, there was one more mistake ( a typo) in dump_now() instead of
using pfree I used free corrected same in the new patch v10.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Tue, May 30, 2017 at 12:36 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> On 27.10.2016 14:39, Mithun Cy wrote:
> And as far as I understand pg_autoprewarm has all necessary infrastructure
> to do parallel load. We just need to spawn more than one background worker
> and specify
> separate block range for each worker.
>
> Do you think that such functionality (parallel autoprewarm) can be useful
> and be easily added?

I have not put any attention on making the autoprewarm parallel. But
as per the current state of the code, making the subworkers to load
the blocks in parallel should be possible and I think could be easily
added. Probably we need a configurable parameter to restrict max
number of parallel prewarm sub-workers. Since I have not thought much
about this I might not be aware of all of the difficulties around it.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Tue, May 30, 2017 at 12:36 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> On 27.10.2016 14:39, Mithun Cy wrote:
>
>
> I wonder if you considered parallel prewarming of a table?
> Right now either with pg_prewarm, either with pg_autoprewarm, preloading
> table's data is performed by one backend.
> It certainly makes sense if there is just one HDD and we want to minimize
> impact of pg_prewarm on normal DBMS activity.
> But sometimes we need to load data in memory as soon as possible. And modern
> systems has larger number of CPU cores and
> RAID devices make it possible to efficiently load data in parallel.
>
> I have asked this question in context of my CFS (compressed file system) for
> Postgres. The customer's complaint was that there are 64 cores at his system
> but when
> he is building index, decompression of heap data is performed by only one
> core. This is why I thought about prewarm... (parallel index construction is
> separate story...)
>
> pg_prewarm makes is possible to specify range of blocks, so, in principle,
> it is possible to manually preload table in parallel, by spawining
> pg_prewarm
> with different subranges in several backends. But it is definitely not user
> friendly approach.
> And as far as I understand pg_autoprewarm has all necessary infrastructure
> to do parallel load. We just need to spawn more than one background worker
> and specify
> separate block range for each worker.
>
> Do you think that such functionality (parallel autoprewarm) can be useful
> and be easily added?
>

I think parallel load functionality can be useful for few cases like
when the system has multiple I/O channels.  I think doing it
parallelly might need some additional infrastructure to manage the
workers based on how we decide to parallelism like whether we allow
each worker to pick one block and load the same or specify the range
of blocks for each worker.  Each way has its own pros and cons.  It
seems like even if we want to add such an option to *prewarm
functionality, it should be added as a separate patch as it has its
own set of problems that needs to be solved.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Tue, May 30, 2017 at 8:15 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Tue, May 30, 2017 at 10:16 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> Thanks Robert,
>
> Sorry, there was one more mistake ( a typo) in dump_now() instead of
> using pfree I used free corrected same in the new patch v10.

+ *        contrib/autoprewarm.c

Wrong.

+    Oid            database;        /* database */
+    Oid            spcnode;        /* tablespace */
+    Oid            filenode;        /* relation's filenode. */
+    ForkNumber    forknum;        /* fork number */
+    BlockNumber blocknum;        /* block number */

Why spcnode rather than tablespace?  Also, the third line has a
period, but not the others.  I think you could just drop these
comments; they basically just duplicate the structure names, except
for spcnode which doesn't but probably should.

+    bool        can_do_prewarm; /* if set can't do prewarm task. */

The comment and the name of the variable imply opposite meanings.

+/*
+ * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos.
+ */
+static void
+detach_blkinfos(int code, Datum arg)
+{
+    if (seg != NULL)
+        dsm_detach(seg);
+}

I assume you don't really need this.  Presumably process exit is going
to detach the DSM anyway.

+    if (seg == NULL)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("unable to map dynamic shared memory segment")));

Let's use the wording from the message in parallel.c rather than this
wording.  Actually, we should probably (as a separate patch) change
test_shm_mq's worker.c to use the parallel.c wording also.

+    SetCurrentStatementStartTimestamp();

Why do we need this?

+    StartTransactionCommand();

Do we need a transaction?  If so, how about starting a new transaction
for each relation instead of using a single one for the entire
prewarm?

+    if (status == BGWH_STOPPED)
+        return;
+
+    if (status == BGWH_POSTMASTER_DIED)
+    {
+        ereport(ERROR,
+                (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+              errmsg("cannot start bgworker autoprewarm without postmaster"),
+                 errhint("Kill all remaining database processes and restart"
+                         " the database.")));
+    }
+
+    Assert(0);

Instead of writing it this way, remove the first "if" block and change
the second one to Assert(status == BGWH_STOPPED).  Also, I'd ditch the
curly braces in this case.

+    file = fopen(AUTOPREWARM_FILE, PG_BINARY_R);

Use AllocateFile() instead.  See the comments for that function and at
the top of fd.c.  Then you don't need to worry about leaking the file
handle on an error, so you can remove the fclose() before ereport()
stuff -- which is incomplete anyway, because you could fail e.g.
inside dsm_create().

+                 errmsg("Error reading num of elements in \"%s\" for"
+                        " autoprewarm : %m", AUTOPREWARM_FILE)));

As I said in a previous review, please do NOT split error messages
across lines like this.  Also, this error message is nothing close to
PostgreSQL style. Please read
https://www.postgresql.org/docs/devel/static/error-style-guide.html
and learn to follow all those guidelines written therein.  I see at
least 3 separate problems with this message.

+    num_elements = i;

I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm
block dump has %d entries but expected %d", i, num_elements);  It
seems OK for this to be elog() rather than ereport() because the file
should never be corrupt unless the user has cheated by hand-editing
it.

I think you can get rid of next_db_pos altogether, and this
prewarm_elem thing too.  Design sketch:

1. Move all the stuff that's in prewarm_elem into
AutoPrewarmSharedState.  Rename start_pos to prewarm_start_idx and
end_of_blockinfos to prewarm_stop_idx.

2. Instead of computing all of the database ranges first and then
launching workers, do it all in one loop.  So figure out where the
records for the current database end and set prewarm_start_idx and
prewarm_end_idx appropriately.  Launch a worker.  When the worker
terminates, set prewarm_start_idx = prewarm_end_idx and advance
prewarm_end_idx to the end of the next database's records.

This saves having to allocate memory for the next_db_pos array, and it
also avoids this crock:

+    memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem));

The reason that's bad is because it only works so long as bgw_extra is
large enough to hold prewarm_elem.  If prewarm_elem grows or bgw_extra
shrinks, this turns into a buffer overrun.

I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating
the PID for the temporary file.  Otherwise, you might leave many
temporary files behind under different names.  If you use the same
name every time, you'll never have more than one, and the next
successful dump will end up getting rid of it along the way.

+            pfree(block_info_array);
+            CloseTransientFile(fd);
+            unlink(transient_dump_file_path);
+            ereport(ERROR,
+                    (errcode_for_file_access(),
+                     errmsg("error writing to \"%s\" : %m",
+                            AUTOPREWARM_FILE)));

Again, this is NOT a standard error message text.  It occurs in zero
other places in the source tree.  You are not the first person to need
an error message for a failed write to a file; please look at what the
previous authors did.  Also, the pfree() before report is not needed;
isn't the whole process going to terminate? Also, you can't really use
errcode_for_file_access() here, because you've meanwhile done
CloseTransientFile() and unlink(), which will have clobbered errno.

+    ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks)));

Not project style for ereport().  Line break after the first comma.
Similarly elsewhere.

+ *    dump_now - the main routine which goes through each buffer
header of buffer
+ *    pool and dumps their meta data. We Sort these data and then dump them.
+ *    Sorting is necessary as it facilitates sequential read during load.

This is no longer true, because you moved the sort to the load side.
It's also not capitalized properly.

Discussions of the format of the autoprewarm dump file involve
inexplicably varying number of < and > symbols:

+ *        <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
+ * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it
+        buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",

+#ifndef __AUTOPREWARM_H__
+#define __AUTOPREWARM_H__

We don't use double-underscored names for header files.  Please
consult existing header files for the appropriate style.  Also, why
does this file exist at all, instead of just including them in the .c
file?  The pointer of a header is for things that need to be included
by multiple .c files, but there's no such need here.

+             * load. If there are no other block infos than the global objects
+             * we silently ignore them. Should I throw error?

Silently ignoring them seems fine.  Throwing an error doesn't seem
like it would improve things.

+        /*
+         * Register a sub-worker to load new database's block. Wait until the
+         * sub-worker finish its job before launching next sub-worker.
+         */
+        launch_prewarm_subworker(&pelem);

The function name implies that it launches the worker, but the comment
implies that it also waits for it to terminate.  Functions should be
named in a way that matches what they do.

I feel like the get_autoprewarm_task() function is introducing fairly
baroque logic for something that really ought to be more simple.  All
autoprewarm_main() really needs to do is:

if (!state->autoprewarm_done)   autoprewarm();
dump_block_info_periodically();

The locking in autoprewarm_dump_now() is a bit tricky.  There are two
trouble cases.  One is that we try to rename() our new dump file on
top of the existing one while a background worker is still using it to
perform an autoprewarm.  The other is that we try to write a new
temporary dump file while some other process is trying to do the same
thing.  I think we should fix this by storing a PID in
AutoPrewarmSharedState; a process which wants to perform a dump or an
autoprewarm must change the PID from 0 to their own PID, and change it
back to 0 on successful completion or error exit.  If we go to perform
an immediate dump process and finds a non-zero value already just does
ereport(ERROR, ...), including the PID of the other process in the
message (e.g. "unable to perform block dump because dump file is being
used by PID %d").  In a background worker, if we go to dump and find
the file in use, log a message (e.g. "skipping block dump because it
is already being performed by PID %d", "skipping prewarm because block
dump file is being rewritten by PID %d").

I also think we should change is_bgworker_running to a PID, so that if
we try to launch a new worker we can report something like
"autoprewarm worker is already running under PID %d".

So putting that all together, I suppose AutoPrewarmSharedState should
end up looking like this:

LWLock lock; /* mutual exclusion */
pid_t bgworker_pid;  /* for main bgworker */
pid_t pid_using_dumpfile; /* for autoprewarm or block dump */

/* following items are for communication with per-database worker */
dsm_handle block_info_handle;
Oid database;
int prewarm_start_idx;
int prewarm_stop_idx;

I suggest going through and changing "subworker" to "per-database
worker" throughout.

BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> + *        contrib/autoprewarm.c
>
> Wrong.

-- Oops Sorry fixed.

> +    Oid            database;        /* database */
> +    Oid            spcnode;        /* tablespace */
> +    Oid            filenode;        /* relation's filenode. */
> +    ForkNumber    forknum;        /* fork number */
> +    BlockNumber blocknum;        /* block number */
>
> Why spcnode rather than tablespace?  Also, the third line has a
> period, but not the others.  I think you could just drop these
> comments; they basically just duplicate the structure names, except
> for spcnode which doesn't but probably should.

-- Dropped comments and changed spcnode to tablespace.

> +    bool        can_do_prewarm; /* if set can't do prewarm task. */
>
> The comment and the name of the variable imply opposite meanings.

-- Sorry a typo. Now this variable has been removed as you have
suggested with new variables in AutoPrewarmSharedState.

> +/*
> + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos.
> + */
> I assume you don't really need this.  Presumably process exit is going
> to detach the DSM anyway.

-- Yes considering process exit will detach the dsm, I have removed
that function.

> +    if (seg == NULL)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 errmsg("unable to map dynamic shared memory segment")));
>
> Let's use the wording from the message in parallel.c rather than this
> wording.  Actually, we should probably (as a separate patch) change
> test_shm_mq's worker.c to use the parallel.c wording also.

-- I have corrected the message with "could not map dynamic shared
memory segment" as in parallel.c

> +    SetCurrentStatementStartTimestamp();
>
> Why do we need this?

-- Removed Sorry forgot to remove same when I removed the SPI connection code.

> +    StartTransactionCommand();
>
> Do we need a transaction?  If so, how about starting a new transaction
> for each relation instead of using a single one for the entire
> prewarm?

-- We do relation_open hence need a transaction. As suggested now we
start a new transaction on every new relation.

> +    if (status == BGWH_STOPPED)
> +        return;
> +
> +    if (status == BGWH_POSTMASTER_DIED)
> +    {
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
> +              errmsg("cannot start bgworker autoprewarm without postmaster"),
> +                 errhint("Kill all remaining database processes and restart"
> +                         " the database.")));
> +    }
> +
> +    Assert(0);
>
> Instead of writing it this way, remove the first "if" block and change
> the second one to Assert(status == BGWH_STOPPED).  Also, I'd ditch the
> curly braces in this case.

-- Fixed as suggested.

>
> +    file = fopen(AUTOPREWARM_FILE, PG_BINARY_R);
>
> Use AllocateFile() instead.  See the comments for that function and at
> the top of fd.c.  Then you don't need to worry about leaking the file
> handle on an error, so you can remove the fclose() before ereport()
 now> stuff -- which is incomplete anyway, because you could fail e.g.
> inside dsm_create().

-- Using AllocateFile now.

>
> +                 errmsg("Error reading num of elements in \"%s\" for"
> +                        " autoprewarm : %m", AUTOPREWARM_FILE)));
>
> As I said in a previous review, please do NOT split error messages
> across lines like this.  Also, this error message is nothing close to
> PostgreSQL style. Please read
> https://www.postgresql.org/docs/devel/static/error-style-guide.html
> and learn to follow all those guidelines written therein.  I see at
> least 3 separate problems with this message.

-- Thanks, I have tried to fix it now.

> +    num_elements = i;
>
> I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm
> block dump has %d entries but expected %d", i, num_elements);  It
> seems OK for this to be elog() rather than ereport() because the file
> should never be corrupt unless the user has cheated by hand-editing
> it.

-- Fixed as suggested. Now eloged as an ERROR.

> I think you can get rid of next_db_pos altogether, and this
> prewarm_elem thing too.  Design sketch:
>
> 1. Move all the stuff that's in prewarm_elem into
> AutoPrewarmSharedState.  Rename start_pos to prewarm_start_idx and
> end_of_blockinfos to prewarm_stop_idx.
>
> 2. Instead of computing all of the database ranges first and then
> launching workers, do it all in one loop.  So figure out where the
> records for the current database end and set prewarm_start_idx and
> prewarm_end_idx appropriately.  Launch a worker.  When the worker
> terminates, set prewarm_start_idx = prewarm_end_idx and advance
> prewarm_end_idx to the end of the next database's records.
>
> This saves having to allocate memory for the next_db_pos array, and it
> also avoids this crock:
>
> +    memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem));
>

-- Fixed as suggested.

> The reason that's bad is because it only works so long as bgw_extra is
> large enough to hold prewarm_elem.  If prewarm_elem grows or bgw_extra
> shrinks, this turns into a buffer overrun.

-- passing prewarm info through bgw_extra helped us to restrict the
scope and lifetime of prewarm_elem only to prewarm task. Moving them
to shared memory made them global even though they are not needed once
prewarm task is finished. As there are other disadvantages of using
bgw_extra I have now implemented as you have suggested.

> I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating
> the PID for the temporary file.  Otherwise, you might leave many
> temporary files behind under different names.  If you use the same
> name every time, you'll never have more than one, and the next
> successful dump will end up getting rid of it along the way.

-- Fixed as sugested. Previosuly PID was used so that concurrent dump
can happen between dump worker and immediate dump as they will write
to two different files. With new way of registering PID before file
access in shared memory I think that problem can be adressed.

>
> +            pfree(block_info_array);
> +            CloseTransientFile(fd);
> +            unlink(transient_dump_file_path);
> +            ereport(ERROR,
> +                    (errcode_for_file_access(),
> +                     errmsg("error writing to \"%s\" : %m",
.> +                            AUTOPREWARM_FILE)));
>
> Again, this is NOT a standard error message text.  It occurs in zero
> other places in the source tree.  You are not the first person to need
> an error message for a failed write to a file; please look at what the
> previous authors did.  Also, the pfree() before report is not needed;
> isn't the whole process going to terminate? Also, you can't really use
> errcode_for_file_access() here, because you've meanwhile done
> CloseTransientFile() and unlink(), which will have clobbered errno.

-- Removed pfree, saved errno before CloseTransientFile() and unlink()

> +    ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks)));
>
> Not project style for ereport().  Line break after the first comma.
> Similarly elsewhere.

-- Tried to fix same

> + *    dump_now - the main routine which goes through each buffer
> header of buffer
> + *    pool and dumps their meta data. We Sort these data and then dump them.
> + *    Sorting is necessary as it facilitates sequential read during load.
>
> This is no longer true, because you moved the sort to the load side.
> It's also not capitalized properly.

-- Sorry removed now.

> Discussions of the format of the autoprewarm dump file involve
> inexplicably varying number of < and > symbols:
>
> + *        <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
> + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it
> +        buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",

-- Sorry fixed now, in all of the places the block info formats will
not have such (</>) delimiter.

> +#ifndef __AUTOPREWARM_H__
> +#define __AUTOPREWARM_H__
>
> We don't use double-underscored names for header files.  Please
> consult existing header files for the appropriate style.  Also, why
> does this file exist at all, instead of just including them in the .c
> file?  The pointer of a header is for things that need to be included
> by multiple .c files, but there's no such need here.

-- This was done to fix one of the previous review comments. I have
moved them back to .c file.

> +             * load. If there are no other block infos than the global objects
> +             * we silently ignore them. Should I throw error?
>
> Silently ignoring them seems fine.  Throwing an error doesn't seem
> like it would improve things.

-- Okay thanks.

>
> +        /*
> +         * Register a sub-worker to load new database's block. Wait until the
> +         * sub-worker finish its job before launching next sub-worker.
> +         */
> +        launch_prewarm_subworker(&pelem);
>
> The function name implies that it launches the worker, but the comment
> implies that it also waits for it to terminate.  Functions should be
> named in a way that matches what they do.

-- Have renamed it to launch_and_wait_for_per_database_worker

> I feel like the get_autoprewarm_task() function is introducing fairly
> baroque logic for something that really ought to be more simple.  All
> autoprewarm_main() really needs to do is:
>
> if (!state->autoprewarm_done)
>     autoprewarm();
> dump_block_info_periodically();

-- Have simplified things as suggested now. Function
get_autoprewarm_task has been removed.

> The locking in autoprewarm_dump_now() is a bit tricky.  There are two
> trouble cases.  One is that we try to rename() our new dump file on
> top of the existing one while a background worker is still using it to
> perform an autoprewarm.  The other is that we try to write a new
> temporary dump file while some other process is trying to do the same
> thing.  I think we should fix this by storing a PID in
> AutoPrewarmSharedState; a process which wants to perform a dump or an
> autoprewarm must change the PID from 0 to their own PID, and change it
> back to 0 on successful completion or error exit.  If we go to perform
> an immediate dump process and finds a non-zero value already just does
> ereport(ERROR, ...), including the PID of the other process in the
> message (e.g. "unable to perform block dump because dump file is being
> used by PID %d").  In a background worker, if we go to dump and find
> the file in use, log a message (e.g. "skipping block dump because it
> is already being performed by PID %d", "skipping prewarm because block
> dump file is being rewritten by PID %d").

-- Fixed as suggested.

> I also think we should change is_bgworker_running to a PID, so that if
> we try to launch a new worker we can report something like
> "autoprewarm worker is already running under PID %d".

-- Fixed. I could only "LOG" about another autoprewarm worker already
running and then exit. Because on ERROR we try to restart the worker,
so do not want to restart such workers.

> So putting that all together, I suppose AutoPrewarmSharedState should
> end up looking like this:
>
> LWLock lock; /* mutual exclusion */
> pid_t bgworker_pid;  /* for main bgworker */
> pid_t pid_using_dumpfile; /* for autoprewarm or block dump */

-- I think one more member is required which state whether prewarm can
be done when the worker restarts.

> /* following items are for communication with per-database worker */
> dsm_handle block_info_handle;
> Oid database;
> int prewarm_start_idx;
> int prewarm_stop_idx;

-- Fixed as suggested

> I suggest going through and changing "subworker" to "per-database
> worker" throughout.

-- Fixed as suggested.

>
> BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB?

I have tried same on my local machine with ssd as a storage.

settings: shared_buffers = 8GB, loaded data with pg_bench scale_factor=1000.

Total blocks got dumped
autoprewarm_dump_now
----------------------
1048576

5 different load time based logs

1.
2017-06-04 11:30:26.460 IST [116253] LOG:  autoprewarm has started
2017-06-04 11:30:43.443 IST [116253] LOG:  autoprewarm load task ended
-- 17 secs

2
2017-06-04 11:31:13.565 IST [116291] LOG:  autoprewarm has started
2017-06-04 11:31:30.317 IST [116291] LOG:  autoprewarm load task ended
-- 17 secs

3.
2017-06-04 11:32:12.995 IST [116329] LOG:  autoprewarm has started
2017-06-04 11:32:29.982 IST [116329] LOG:  autoprewarm load task ended
-- 17 secs

4.
2017-06-04 11:32:58.974 IST [116361] LOG:  autoprewarm has started
2017-06-04 11:33:15.017 IST [116361] LOG:  autoprewarm load task ended
-- 17secs

5.
2017-06-04 12:15:49.772 IST [117936] LOG:  autoprewarm has started
2017-06-04 12:16:11.012 IST [117936] LOG:  autoprewarm load task ended
-- 22 secs.

So mostly from 17 to 22 secs.

But I think I need to do tests on a larger set of configuration on
different storage types. I shall do same and upload later. I have also
uploaded latest performance test results (on my local machine ssd
drive)
configuration: shared_buffer = 8GB,
test setting: scale_factor=300 (data fits to shared_buffers) pgbench clients =1

TEST
PGBENCH_RUN="./pgbench --no-vacuum --protocol=prepared --time=5 -j 1
-c 1 --select-only postgres"
START_TIME=$SECONDS; echo TIME, TPS; while true; do TPS=$($PGBENCH_RUN
| grep excluding | cut -d ' ' -f 3); TIME=$((SECONDS-START_TIME));
echo $TIME, $TPS; done


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Rafia Sabih
Date:
On Sun, Jun 4, 2017 at 12:45 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> + *        contrib/autoprewarm.c
>>
>> Wrong.
>
> -- Oops Sorry fixed.
>
>> +    Oid            database;        /* database */
>> +    Oid            spcnode;        /* tablespace */
>> +    Oid            filenode;        /* relation's filenode. */
>> +    ForkNumber    forknum;        /* fork number */
>> +    BlockNumber blocknum;        /* block number */
>>
>> Why spcnode rather than tablespace?  Also, the third line has a
>> period, but not the others.  I think you could just drop these
>> comments; they basically just duplicate the structure names, except
>> for spcnode which doesn't but probably should.
>
> -- Dropped comments and changed spcnode to tablespace.
>
>> +    bool        can_do_prewarm; /* if set can't do prewarm task. */
>>
>> The comment and the name of the variable imply opposite meanings.
>
> -- Sorry a typo. Now this variable has been removed as you have
> suggested with new variables in AutoPrewarmSharedState.
>
>> +/*
>> + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos.
>> + */
>> I assume you don't really need this.  Presumably process exit is going
>> to detach the DSM anyway.
>
> -- Yes considering process exit will detach the dsm, I have removed
> that function.
>
>> +    if (seg == NULL)
>> +        ereport(ERROR,
>> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                 errmsg("unable to map dynamic shared memory segment")));
>>
>> Let's use the wording from the message in parallel.c rather than this
>> wording.  Actually, we should probably (as a separate patch) change
>> test_shm_mq's worker.c to use the parallel.c wording also.
>
> -- I have corrected the message with "could not map dynamic shared
> memory segment" as in parallel.c
>
>> +    SetCurrentStatementStartTimestamp();
>>
>> Why do we need this?
>
> -- Removed Sorry forgot to remove same when I removed the SPI connection code.
>
>> +    StartTransactionCommand();
>>
>> Do we need a transaction?  If so, how about starting a new transaction
>> for each relation instead of using a single one for the entire
>> prewarm?
>
> -- We do relation_open hence need a transaction. As suggested now we
> start a new transaction on every new relation.
>
>> +    if (status == BGWH_STOPPED)
>> +        return;
>> +
>> +    if (status == BGWH_POSTMASTER_DIED)
>> +    {
>> +        ereport(ERROR,
>> +                (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
>> +              errmsg("cannot start bgworker autoprewarm without postmaster"),
>> +                 errhint("Kill all remaining database processes and restart"
>> +                         " the database.")));
>> +    }
>> +
>> +    Assert(0);
>>
>> Instead of writing it this way, remove the first "if" block and change
>> the second one to Assert(status == BGWH_STOPPED).  Also, I'd ditch the
>> curly braces in this case.
>
> -- Fixed as suggested.
>
>>
>> +    file = fopen(AUTOPREWARM_FILE, PG_BINARY_R);
>>
>> Use AllocateFile() instead.  See the comments for that function and at
>> the top of fd.c.  Then you don't need to worry about leaking the file
>> handle on an error, so you can remove the fclose() before ereport()
>  now> stuff -- which is incomplete anyway, because you could fail e.g.
>> inside dsm_create().
>
> -- Using AllocateFile now.
>
>>
>> +                 errmsg("Error reading num of elements in \"%s\" for"
>> +                        " autoprewarm : %m", AUTOPREWARM_FILE)));
>>
>> As I said in a previous review, please do NOT split error messages
>> across lines like this.  Also, this error message is nothing close to
>> PostgreSQL style. Please read
>> https://www.postgresql.org/docs/devel/static/error-style-guide.html
>> and learn to follow all those guidelines written therein.  I see at
>> least 3 separate problems with this message.
>
> -- Thanks, I have tried to fix it now.
>
>> +    num_elements = i;
>>
>> I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm
>> block dump has %d entries but expected %d", i, num_elements);  It
>> seems OK for this to be elog() rather than ereport() because the file
>> should never be corrupt unless the user has cheated by hand-editing
>> it.
>
> -- Fixed as suggested. Now eloged as an ERROR.
>
>> I think you can get rid of next_db_pos altogether, and this
>> prewarm_elem thing too.  Design sketch:
>>
>> 1. Move all the stuff that's in prewarm_elem into
>> AutoPrewarmSharedState.  Rename start_pos to prewarm_start_idx and
>> end_of_blockinfos to prewarm_stop_idx.
>>
>> 2. Instead of computing all of the database ranges first and then
>> launching workers, do it all in one loop.  So figure out where the
>> records for the current database end and set prewarm_start_idx and
>> prewarm_end_idx appropriately.  Launch a worker.  When the worker
>> terminates, set prewarm_start_idx = prewarm_end_idx and advance
>> prewarm_end_idx to the end of the next database's records.
>>
>> This saves having to allocate memory for the next_db_pos array, and it
>> also avoids this crock:
>>
>> +    memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem));
>>
>
> -- Fixed as suggested.
>
>> The reason that's bad is because it only works so long as bgw_extra is
>> large enough to hold prewarm_elem.  If prewarm_elem grows or bgw_extra
>> shrinks, this turns into a buffer overrun.
>
> -- passing prewarm info through bgw_extra helped us to restrict the
> scope and lifetime of prewarm_elem only to prewarm task. Moving them
> to shared memory made them global even though they are not needed once
> prewarm task is finished. As there are other disadvantages of using
> bgw_extra I have now implemented as you have suggested.
>
>> I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating
>> the PID for the temporary file.  Otherwise, you might leave many
>> temporary files behind under different names.  If you use the same
>> name every time, you'll never have more than one, and the next
>> successful dump will end up getting rid of it along the way.
>
> -- Fixed as sugested. Previosuly PID was used so that concurrent dump
> can happen between dump worker and immediate dump as they will write
> to two different files. With new way of registering PID before file
> access in shared memory I think that problem can be adressed.
>
>>
>> +            pfree(block_info_array);
>> +            CloseTransientFile(fd);
>> +            unlink(transient_dump_file_path);
>> +            ereport(ERROR,
>> +                    (errcode_for_file_access(),
>> +                     errmsg("error writing to \"%s\" : %m",
> .> +                            AUTOPREWARM_FILE)));
>>
>> Again, this is NOT a standard error message text.  It occurs in zero
>> other places in the source tree.  You are not the first person to need
>> an error message for a failed write to a file; please look at what the
>> previous authors did.  Also, the pfree() before report is not needed;
>> isn't the whole process going to terminate? Also, you can't really use
>> errcode_for_file_access() here, because you've meanwhile done
>> CloseTransientFile() and unlink(), which will have clobbered errno.
>
> -- Removed pfree, saved errno before CloseTransientFile() and unlink()
>
>> +    ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks)));
>>
>> Not project style for ereport().  Line break after the first comma.
>> Similarly elsewhere.
>
> -- Tried to fix same
>
>> + *    dump_now - the main routine which goes through each buffer
>> header of buffer
>> + *    pool and dumps their meta data. We Sort these data and then dump them.
>> + *    Sorting is necessary as it facilitates sequential read during load.
>>
>> This is no longer true, because you moved the sort to the load side.
>> It's also not capitalized properly.
>
> -- Sorry removed now.
>
>> Discussions of the format of the autoprewarm dump file involve
>> inexplicably varying number of < and > symbols:
>>
>> + *        <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
>> + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it
>> +        buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",
>
> -- Sorry fixed now, in all of the places the block info formats will
> not have such (</>) delimiter.
>
>> +#ifndef __AUTOPREWARM_H__
>> +#define __AUTOPREWARM_H__
>>
>> We don't use double-underscored names for header files.  Please
>> consult existing header files for the appropriate style.  Also, why
>> does this file exist at all, instead of just including them in the .c
>> file?  The pointer of a header is for things that need to be included
>> by multiple .c files, but there's no such need here.
>
> -- This was done to fix one of the previous review comments. I have
> moved them back to .c file.
>
>> +             * load. If there are no other block infos than the global objects
>> +             * we silently ignore them. Should I throw error?
>>
>> Silently ignoring them seems fine.  Throwing an error doesn't seem
>> like it would improve things.
>
> -- Okay thanks.
>
>>
>> +        /*
>> +         * Register a sub-worker to load new database's block. Wait until the
>> +         * sub-worker finish its job before launching next sub-worker.
>> +         */
>> +        launch_prewarm_subworker(&pelem);
>>
>> The function name implies that it launches the worker, but the comment
>> implies that it also waits for it to terminate.  Functions should be
>> named in a way that matches what they do.
>
> -- Have renamed it to launch_and_wait_for_per_database_worker
>
>> I feel like the get_autoprewarm_task() function is introducing fairly
>> baroque logic for something that really ought to be more simple.  All
>> autoprewarm_main() really needs to do is:
>>
>> if (!state->autoprewarm_done)
>>     autoprewarm();
>> dump_block_info_periodically();
>
> -- Have simplified things as suggested now. Function
> get_autoprewarm_task has been removed.
>
>> The locking in autoprewarm_dump_now() is a bit tricky.  There are two
>> trouble cases.  One is that we try to rename() our new dump file on
>> top of the existing one while a background worker is still using it to
>> perform an autoprewarm.  The other is that we try to write a new
>> temporary dump file while some other process is trying to do the same
>> thing.  I think we should fix this by storing a PID in
>> AutoPrewarmSharedState; a process which wants to perform a dump or an
>> autoprewarm must change the PID from 0 to their own PID, and change it
>> back to 0 on successful completion or error exit.  If we go to perform
>> an immediate dump process and finds a non-zero value already just does
>> ereport(ERROR, ...), including the PID of the other process in the
>> message (e.g. "unable to perform block dump because dump file is being
>> used by PID %d").  In a background worker, if we go to dump and find
>> the file in use, log a message (e.g. "skipping block dump because it
>> is already being performed by PID %d", "skipping prewarm because block
>> dump file is being rewritten by PID %d").
>
> -- Fixed as suggested.
>
>> I also think we should change is_bgworker_running to a PID, so that if
>> we try to launch a new worker we can report something like
>> "autoprewarm worker is already running under PID %d".
>
> -- Fixed. I could only "LOG" about another autoprewarm worker already
> running and then exit. Because on ERROR we try to restart the worker,
> so do not want to restart such workers.
>
>> So putting that all together, I suppose AutoPrewarmSharedState should
>> end up looking like this:
>>
>> LWLock lock; /* mutual exclusion */
>> pid_t bgworker_pid;  /* for main bgworker */
>> pid_t pid_using_dumpfile; /* for autoprewarm or block dump */
>
> -- I think one more member is required which state whether prewarm can
> be done when the worker restarts.
>
>> /* following items are for communication with per-database worker */
>> dsm_handle block_info_handle;
>> Oid database;
>> int prewarm_start_idx;
>> int prewarm_stop_idx;
>
> -- Fixed as suggested
>
>> I suggest going through and changing "subworker" to "per-database
>> worker" throughout.
>
> -- Fixed as suggested.
>
>>
>> BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB?
>
> I have tried same on my local machine with ssd as a storage.
>
> settings: shared_buffers = 8GB, loaded data with pg_bench scale_factor=1000.
>
> Total blocks got dumped
> autoprewarm_dump_now
> ----------------------
> 1048576
>
> 5 different load time based logs
>
> 1.
> 2017-06-04 11:30:26.460 IST [116253] LOG:  autoprewarm has started
> 2017-06-04 11:30:43.443 IST [116253] LOG:  autoprewarm load task ended
> -- 17 secs
>
> 2
> 2017-06-04 11:31:13.565 IST [116291] LOG:  autoprewarm has started
> 2017-06-04 11:31:30.317 IST [116291] LOG:  autoprewarm load task ended
> -- 17 secs
>
> 3.
> 2017-06-04 11:32:12.995 IST [116329] LOG:  autoprewarm has started
> 2017-06-04 11:32:29.982 IST [116329] LOG:  autoprewarm load task ended
> -- 17 secs
>
> 4.
> 2017-06-04 11:32:58.974 IST [116361] LOG:  autoprewarm has started
> 2017-06-04 11:33:15.017 IST [116361] LOG:  autoprewarm load task ended
> -- 17secs
>
> 5.
> 2017-06-04 12:15:49.772 IST [117936] LOG:  autoprewarm has started
> 2017-06-04 12:16:11.012 IST [117936] LOG:  autoprewarm load task ended
> -- 22 secs.
>
> So mostly from 17 to 22 secs.
>
> But I think I need to do tests on a larger set of configuration on
> different storage types. I shall do same and upload later. I have also
> uploaded latest performance test results (on my local machine ssd
> drive)
> configuration: shared_buffer = 8GB,
> test setting: scale_factor=300 (data fits to shared_buffers) pgbench clients =1
>
> TEST
> PGBENCH_RUN="./pgbench --no-vacuum --protocol=prepared --time=5 -j 1
> -c 1 --select-only postgres"
> START_TIME=$SECONDS; echo TIME, TPS; while true; do TPS=$($PGBENCH_RUN
> | grep excluding | cut -d ' ' -f 3); TIME=$((SECONDS-START_TIME));
> echo $TIME, $TPS; done
>
>
I had a look at the patch from stylistic/formatting point of view,
please find the attached patch for the suggested modifications.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Mon, Jun 5, 2017 at 7:58 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> I had a look at the patch from stylistic/formatting point of view,
> please find the attached patch for the suggested modifications.

Many of these seem worse, like these ones:

-         * Quit if we've reached records for another database. Unless the
+         * Quit if we've reached records of another database. Unless the

-         * When we reach a new relation, close the old one.  Note, however,
-         * that the previous try_relation_open may have failed, in which case
-         * rel will be NULL.
+         * On reaching a new relation, close the old one.  Note, that the
+         * previous try_relation_open may have failed, in which case rel will
+         * be NULL.

-         * Try to open each new relation, but only once, when we first
-         * encounter it.  If it's been dropped, skip the associated blocks.
+         * Each relation is open only once at it's first encounter. If it's
+         * been dropped, skip the associated blocks.

Others are better, like these:

-                (errmsg("could not continue autoprewarm worker is
already running under PID %d",
+                (errmsg("autoprewarm worker is already running under PID %d",

- * Start of prewarm per-database worker. This will try to load blocks of one
+ * Start prewarm per-database worker, which will load blocks of one

Others don't really seem better or worse, like:

-         * Register a per-database worker to load new database's block. And
-         * wait until they finish their job to launch next one.
+         * Register a per-database worker to load new database's block. Wait
+         * until they finish their job to launch next one.

IMHO, there's still a good bit of work needed here to make this sound
like American English.  For example:

- *        It is a bgworker which automatically records information about blocks
- *        which were present in buffer pool before server shutdown and then
- *        prewarm the buffer pool upon server restart with those blocks.
+ *        It is a bgworker process that automatically records information about
+ *        blocks which were present in buffer pool before server
shutdown and then
+ *        prewarms the buffer pool upon server restart with those blocks.

This construction "It is a..." without a clear referent seems to be
standard in Indian English, but it looks wrong to English speakers
from other parts of the world, or at least to me.

+     * Since there could be at max one worker who could do a prewarm, hence,
+     * acquiring locks is not required before setting skip_prewarm_on_restart.

To me, adding a comma before hence looks like a significant
improvement, but the word hence itself seems out-of-place.  Also, I'd
change "at max" to "at most" and maybe reword the sentence a little.
There's a lot of little things like this which I have tended be quite
strict about changing before commit; I occasionally wonder whether
it's really worth the effort.  It's not really wrong, it just sounds
weird to me as an American.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Rafia Sabih
Date:
On Mon, Jun 5, 2017 at 8:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Many of these seem worse, like these ones:
>
> -         * Quit if we've reached records for another database. Unless the
> +         * Quit if we've reached records of another database. Unless the
>
Why is it worse? I've always encountered using "records of database"
and not "records for database". Anyhow, I tried reframing the sentence
altogether.
> -         * When we reach a new relation, close the old one.  Note, however,
> -         * that the previous try_relation_open may have failed, in which case
> -         * rel will be NULL.
> +         * On reaching a new relation, close the old one.  Note, that the
> +         * previous try_relation_open may have failed, in which case rel will
> +         * be NULL.
>
Reframed the sentence.
> -         * Try to open each new relation, but only once, when we first
> -         * encounter it.  If it's been dropped, skip the associated blocks.
> +         * Each relation is open only once at it's first encounter. If it's
> +         * been dropped, skip the associated blocks.
>
Reframed.

> IMHO, there's still a good bit of work needed here to make this sound
> like American English.  For example:
>
> - *        It is a bgworker which automatically records information about blocks
> - *        which were present in buffer pool before server shutdown and then
> - *        prewarm the buffer pool upon server restart with those blocks.
> + *        It is a bgworker process that automatically records information about
> + *        blocks which were present in buffer pool before server
> shutdown and then
> + *        prewarms the buffer pool upon server restart with those blocks.
>
> This construction "It is a..." without a clear referent seems to be
> standard in Indian English, but it looks wrong to English speakers
> from other parts of the world, or at least to me.
>
Agreed, tried reframing the sentence.
> +     * Since there could be at max one worker who could do a prewarm, hence,
> +     * acquiring locks is not required before setting skip_prewarm_on_restart.
>
> To me, adding a comma before hence looks like a significant
> improvement, but the word hence itself seems out-of-place.  Also, I'd
> change "at max" to "at most" and maybe reword the sentence a little.
> There's a lot of little things like this which I have tended be quite
> strict about changing before commit; I occasionally wonder whether
> it's really worth the effort.  It's not really wrong, it just sounds
> weird to me as an American.
>
Agree, sentence reframed.
I am attaching the patch with the modifications I made on a second look.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Tue, Jun 6, 2017 at 6:29 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> On Mon, Jun 5, 2017 at 8:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Many of these seem worse, like these ones:
>>
>> -         * Quit if we've reached records for another database. Unless the
>> +         * Quit if we've reached records of another database. Unless the
>>
> Why is it worse? I've always encountered using "records of database"
> and not "records for database". Anyhow, I tried reframing the sentence
> altogether.

My experience is the opposite.  If I Google "from another database",
including the quotes, I get 516,000 hits; if I do the same using "of
another database", I get 110,000 hits.  I think that means the former
usage is more popular, although obviously to some degree it's a matter
of taste.

+ *        database, tablespace, filenode, forknum, blocknum in

The file doesn't contain the spaces you added here, which is probably
why Mithun did it as he did, but actually we don't really need this
level of detail in the file header comment anyway.  I think you could
drop the specific mention of how blocks are identified.

+ * GUC variable to decide if autoprewarm worker has to be started when

if->whether? has to be->should be?

Actually this patch uses "if" in various places where I would use
"whether", but that's probably a matter of taste to some extent.

- * Start of prewarm per-database worker. This will try to load blocks of one
- * database starting from block info position state->prewarm_start_idx to
- * state->prewarm_stop_idx.
+ * Connect to the database and load the blocks of that database starting from
+ * the position state->prewarm_start_idx to state->prewarm_stop_idx.

That's a really good rephrasing.  It's more compact and more
imperative.  The only thing that seems a little off is that you say
"starting from" and then mention both the start and stop indexes.
Maybe say "between" instead.

-         * Quit if we've reached records for another database. Unless the
-         * previous blocks were of global objects which were combined with
-         * next database's block infos.
+         * Quit if we've reached records for another database. If previous
+         * blocks are of some global objects, then continue pre-warming.

Good.

-         * When we reach a new relation, close the old one.  Note, however,
-         * that the previous try_relation_open may have failed, in which case
-         * rel will be NULL.
+         * As soon as we encounter a block of a new relation, close the old
+         * relation. Note, that rel will be NULL if try_relation_open failed
+         * previously, in that case there is nothing to close.

I wrote the original comment here, so you may not be too surprised to
here that I liked it as it was.  Actually, your rewrite of the first
sentence seems like it might be better, but I think the second one is
not.  By deleting "however", you've made the comma after "Note"
redundant, and by changing "in which case" to "in that case", you've
made a dependent clause into a comma splice.  I hate comma splices.

https://en.wikipedia.org/wiki/Comma_splice

+ * $PGDATA/AUTOPREWARM_FILE to a dsm. Further, these BlockInfoRecords are

I would probably capitalize DSM here.  There's no data structure
called dsm (lower-case) and we have other examples where it is
capitalized in documentation and comment text (see, e.g.
custom-scan.sgml).

+     * Since there could be at most one worker for prewarm, locking is not

could -> can?

-                 * For block info of a global object whose database will be 0
-                 * try to combine them with next non-zero database's block
-                 * infos to load.
+                 * For the BlockRecordInfos of a global object, combine them
+                 * with BlockRecordInfos of the next non-global object.

Good.  Or even "Combine BlockRecordInfos for a global object with the
next non-global object", which I think is even more compact.

+         * If we are here with database having InvalidOid, then only
+         * BlockRecordInfos belonging to global objects exist. Since, we can
+         * not connect with InvalidOid skip prewarming for these objects.

If we reach this point with current_db == InvalidOid, ...

+ *        Sub-routine to periodically call dump_now().

Every existing use of the word subroutine in our code based spells it
with no hyphen.

[rhaas pgsql]$ git grep -- Subroutine | wc -l     47
[rhaas pgsql]$ git grep -- Sub-routine | wc -l      0

Personally, I find this change worse, because calling something a
subroutine is totally generic, like saying that you met a "person" or
that something was "nice".  Calling it a loop gives at least a little
bit of specific information.

+ * Call dum_now() at regular intervals defined by GUC variable dump_interval.

This change introduces an obvious typographical error.

-    /* One last block info dump while postmaster shutdown. */
+    /* It's time for postmaster shutdown, let's dump for one last time. */

Comma splice.

+    /* Perform autoprewarm's task. */

Uninformative.

+ *        A Common function to initialize BackgroundWorker structure.

Adding a period to the end is a good idea, but how about also fixing
the capitalization of "Common"?  I think this is a common usage in
India, but not elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
I have merged Rafia's patch for cosmetic changes. I have also fixed
some of the changes you have recommended over that. But kept few as it
is since Rafia's opinion was needed on that.

On Wed, Jun 7, 2017 at 5:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> My experience is the opposite.  If I Google "from another database",
> including the quotes, I get 516,000 hits; if I do the same using "of
> another database", I get 110,000 hits.  I think that means the former
> usage is more popular, although obviously to some degree it's a matter
> of taste.

-- Open.

>
> + *        database, tablespace, filenode, forknum, blocknum in
>
> The file doesn't contain the spaces you added here, which is probably
> why Mithun did it as he did, but actually we don't really need this
> level of detail in the file header comment anyway.  I think you could
> drop the specific mention of how blocks are identified.

-- Fixed. It has been removed now.

> + * GUC variable to decide if autoprewarm worker has to be started when
>
> if->whether? has to be->should be?

> Actually this patch uses "if" in various places where I would use
> "whether", but that's probably a matter of taste to some extent.

-- Fixed. I have changed from if to whether.

>
> - * Start of prewarm per-database worker. This will try to load blocks of one
> - * database starting from block info position state->prewarm_start_idx to
> - * state->prewarm_stop_idx.
> + * Connect to the database and load the blocks of that database starting from
> + * the position state->prewarm_start_idx to state->prewarm_stop_idx.
>
> That's a really good rephrasing.  It's more compact and more
> imperative.  The only thing that seems a little off is that you say
> "starting from" and then mention both the start and stop indexes.
> Maybe say "between" instead.

-- It is a half-open interval so rewrote it within the notation [,)

>
> -         * Quit if we've reached records for another database. Unless the
> -         * previous blocks were of global objects which were combined with
> -         * next database's block infos.
> +         * Quit if we've reached records for another database. If previous
> +         * blocks are of some global objects, then continue pre-warming.
>
> Good.
>
> -         * When we reach a new relation, close the old one.  Note, however,
> -         * that the previous try_relation_open may have failed, in which case
> -         * rel will be NULL.
> +         * As soon as we encounter a block of a new relation, close the old
> +         * relation. Note, that rel will be NULL if try_relation_open failed
> +         * previously, in that case there is nothing to close.
>
> I wrote the original comment here, so you may not be too surprised to
> here that I liked it as it was.  Actually, your rewrite of the first
> sentence seems like it might be better, but I think the second one is
> not.  By deleting "however", you've made the comma after "Note"
> redundant, and by changing "in which case" to "in that case", you've
> made a dependent clause into a comma splice.  I hate comma splices.
>
> https://en.wikipedia.org/wiki/Comma_splice

-- Open

> + * $PGDATA/AUTOPREWARM_FILE to a dsm. Further, these BlockInfoRecords are
> I would probably capitalize DSM here.  There's no data structure
> called dsm (lower-case) and we have other examples where it is
> capitalized in documentation and comment text (see, e.g.
> custom-scan.sgml).

-- Fixed.

>
> +     * Since there could be at most one worker for prewarm, locking is not
>
> could -> can?

-- Fixed.

>
> -                 * For block info of a global object whose database will be 0
> -                 * try to combine them with next non-zero database's block
> -                 * infos to load.
> +                 * For the BlockRecordInfos of a global object, combine them
> +                 * with BlockRecordInfos of the next non-global object.
>
> Good.  Or even "Combine BlockRecordInfos for a global object with the
> next non-global object", which I think is even more compact.

-- Fixed.

>
> +         * If we are here with database having InvalidOid, then only
> +         * BlockRecordInfos belonging to global objects exist. Since, we can
> +         * not connect with InvalidOid skip prewarming for these objects.
>
> If we reach this point with current_db == InvalidOid, ...

--Fixed.

>
> + *        Sub-routine to periodically call dump_now().
>
> Every existing use of the word subroutine in our code based spells it
> with no hyphen.
>
> [rhaas pgsql]$ git grep -- Subroutine | wc -l
>       47
> [rhaas pgsql]$ git grep -- Sub-routine | wc -l
>        0
>
> Personally, I find this change worse, because calling something a
> subroutine is totally generic, like saying that you met a "person" or
> that something was "nice".  Calling it a loop gives at least a little
> bit of specific information.

-- Fixed. We call it a loop now.

>
> + * Call dum_now() at regular intervals defined by GUC variable dump_interval.
>
> This change introduces an obvious typographical error.

-- Fixed.

>
> -    /* One last block info dump while postmaster shutdown. */
> +    /* It's time for postmaster shutdown, let's dump for one last time. */
>
> Comma splice.

-- Open

>
> +    /* Perform autoprewarm's task. */
>
> Uninformative.

-- Fixed. I have removed same now.

>
> + *        A Common function to initialize BackgroundWorker structure.
>
> Adding a period to the end is a good idea, but how about also fixing
> the capitalization of "Common"?  I think this is a common usage in
> India, but not elsewhere.

-- Fixed to common.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Fri, Jun 9, 2017 at 10:01 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> I have merged Rafia's patch for cosmetic changes. I have also fixed
> some of the changes you have recommended over that. But kept few as it
> is since Rafia's opinion was needed on that.
>

Few comments on the latest patch:
1.
+ /* Check whether blocknum is valid and within fork file size. */
+ if (blk->blocknum >= nblocks)
+ {
+ /* Move to next forknum. */
+ ++pos;
+ old_blk = blk;
+ continue;
+ }
+
+ /* Prewarm buffer. */
+ buf = ReadBufferExtended(rel, blk->forknum, blk->blocknum, RBM_NORMAL,
+ NULL);
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ old_blk = blk;
+ ++pos;

You are incrementing position at different places in this loop. I
think you can do it once at the start of the loop.

2.
+dump_now(bool is_bgworker)
{
..
+ fd = OpenTransientFile(transient_dump_file_path,
+   O_CREAT | O_WRONLY | O_TRUNC, 0666);

+prewarm_buffer_pool(void)
{
..
+ file = AllocateFile(AUTOPREWARM_FILE, PG_BINARY_R);

During prewarm, you seem to be using binary mode to open a file
whereas during dump binary flag is not passed.  Is there a reason
for such a difference?

3.
+ ereport(LOG,
+ (errmsg("saved metadata info of %d blocks", num_blocks)));

It doesn't seem like a good idea to log this info at each dump
interval.  How about making this as a DEBUG1 message?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
Thanks, Amit,

On Fri, Jun 9, 2017 at 8:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments on the latest patch:
+
+ /* Prewarm buffer. */
+ buf = ReadBufferExtended(rel, blk->forknum, blk->blocknum, RBM_NORMAL,
+ NULL);
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ old_blk = blk;
+ ++pos;

You are incrementing position at different places in this loop. I
think you can do it once at the start of the loop.

Fixed now moved all of ++pos to one place now.
 
2.
+dump_now(bool is_bgworker)
{
..
+ fd = OpenTransientFile(transient_dump_file_path,
+   O_CREAT | O_WRONLY | O_TRUNC, 0666);

+prewarm_buffer_pool(void)
{
..
+ file = AllocateFile(AUTOPREWARM_FILE, PG_BINARY_R);

During prewarm, you seem to be using binary mode to open a file
whereas during dump binary flag is not passed.  Is there a reason
for such a difference?

-- Sorry fixed now, both use binary.
 

3.
+ ereport(LOG,
+ (errmsg("saved metadata info of %d blocks", num_blocks)));

It doesn't seem like a good idea to log this info at each dump
interval.  How about making this as a DEBUG1 message?
 
-- Fixed, made it as DEBUG1 along with another message "autoprewarm load task ended" message. 


--
Thanks and Regards
Mithun C Y

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Mon, Jun 12, 2017 at 6:31 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Thanks, Amit,
>

+ /* Perform autoprewarm's task. */
+ if (todo_task == TASK_PREWARM_BUFFERPOOL &&
+ !state->skip_prewarm_on_restart)

Why have you removed above comment in the new patch?  I am not
pointing this because above comment is meaningful, rather changing
things in different versions of the patch without informing reviewer
can increase the time to review.  I feel you can write some better
comment here.

1.
new file mode 100644
index 0000000..6c35fb7
--- /dev/null
+++ b/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql
@@ -0,0 +1,14 @@
+/* contrib/pg_prewarm/pg_prewarm--1.0--1.1.sql */

In comments, the SQL file name is wrong.

2.
+ /* Define custom GUC variables. */
+ if (process_shared_preload_libraries_in_progress)
+ DefineCustomBoolVariable("pg_prewarm.autoprewarm",
+ "Enable/Disable auto-prewarm feature.",
+ NULL,
+ &autoprewarm,
+ true,
+ PGC_POSTMASTER,
+ 0,
+ NULL,
+ NULL,
+ NULL);
+
+ DefineCustomIntVariable("pg_prewarm.dump_interval",
+   "Sets the maximum time between two buffer pool dumps",
+ "If set to zero, timer based dumping is disabled."
+ " If set to -1, stops autoprewarm.",
+ &dump_interval,
+ AT_PWARM_DEFAULT_DUMP_INTERVAL,
+ AT_PWARM_OFF, INT_MAX / 1000,
+ PGC_SIGHUP,
+ GUC_UNIT_S,
+ NULL,
+ NULL,
+ NULL);
+
+ EmitWarningsOnPlaceholders("pg_prewarm");
+
+ /* If not run as a preloaded library, nothing more to do. */
+ if (!process_shared_preload_libraries_in_progress)
+ return;

a. You can easily write this code such that
process_shared_preload_libraries_in_progress needs to be checked just
once.  Move the define of pg_prewarm.dump_interval at first place and
then check if (!process_shared_preload_libraries_in_progress ) return.

b. Variable name autoprewarm isn't self-explanatory, also if you have
to search the use of this variable in the code, it is difficult
because a lot of unrelated usages can pop-up.  How about naming it as
start_prewarm_worker or enable_autoprewarm or something like that?

3.
+static AutoPrewarmSharedState *state = NULL;

Again, the naming of this variable (state) is not meaningful.  How
about SharedPrewarmState or something like that?

4.
+ ereport(LOG,
+ (errmsg("autoprewarm has started")));
..
+ ereport(LOG,
+ (errmsg("autoprewarm shutting down")));

How about changing messages as "autoprewarm worker started" and
"autoprewarm worker stopped" respectively?

5.
+void
+dump_block_info_periodically(void)
+{
+ TimestampTz last_dump_time = GetCurrentTimestamp();
..
+ if (TimestampDifferenceExceeds(last_dump_time,
+   current_time,
+   (dump_interval * 1000)))
+ {
+ dump_now(true);
..

With above coding, it will not dump the very first time it tries to
dump the blocks information.  I think it is better if it dumps the
first time and then dumps after dump_interval.  I think it is not
difficult to arrange above code to do so if you also think that is
better behavior?

6.
+dump_now(bool is_bgworker)
{
..
+ if (write(fd, buf, buflen) < buflen)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\" : %m",
+ transient_dump_file_path)));
..
}

You seem to forget to close and unlink the file in above code path.
There are a lot of places in this function where you have to free
memory or close file in case of an error condition.  You can use
multiple labels to define error exit paths, something like we have
done in DecodeXLogRecord.

7.
+ for (i = 0; i < num_blocks; i++)
+ {
+ /* In case of a SIGHUP, just reload the configuration. */
+ if (got_sighup)
+ {
+ got_sighup = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ /* Have we been asked to stop dump? */
+ if (dump_interval == AT_PWARM_OFF)
+ {
+ pfree(block_info_array);
+ CloseTransientFile(fd);
+ unlink(transient_dump_file_path);
+ return 0;
+ }
+
+ buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",
+ block_info_array[i].database,
+ block_info_array[i].tablespace,
+ block_info_array[i].filenode,
+ (uint32) block_info_array[i].forknum,
+ block_info_array[i].blocknum);
+
+ if (write(fd, buf, buflen) < buflen)
+ {
+ int save_errno = errno;
+
+ CloseTransientFile(fd);
+ unlink(transient_dump_file_path);
+ errno = save_errno;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ transient_dump_file_path)));
+ }

It seems we can club the above writes into 8K sized writes instead of
one write per block information.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Mon, Jun 12, 2017 at 7:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jun 12, 2017 at 6:31 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Thanks, Amit,
>

+ /* Perform autoprewarm's task. */
+ if (todo_task == TASK_PREWARM_BUFFERPOOL &&
+ !state->skip_prewarm_on_restart)

Why have you removed above comment in the new patch?  I am not
pointing this because above comment is meaningful, rather changing
things in different versions of the patch without informing reviewer
can increase the time to review.  I feel you can write some better
comment here.

That happened during previous comment fix.  I think I have removed in patch_12 itself and I have stated same in mail. I felt this code was simple so there was no need of adding new comments. I have tried to add few now as suggested.
 

1.
new file mode 100644
index 0000000..6c35fb7
--- /dev/null
+++ b/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql
@@ -0,0 +1,14 @@
+/* contrib/pg_prewarm/pg_prewarm--1.0--1.1.sql */

In comments, the SQL file name is wrong.

-- Sorry, Fixed.
 
2.
+ /* Define custom GUC variables. */
+ if (process_shared_preload_libraries_in_progress)
+ DefineCustomBoolVariable("pg_prewarm.autoprewarm",
+ "Enable/Disable auto-prewarm feature.",
+ NULL,
+ &autoprewarm,
+ true,
+ PGC_POSTMASTER,
+ 0,
+ NULL,
+ NULL,
+ NULL);
+
+ DefineCustomIntVariable("pg_prewarm.dump_interval",
+   "Sets the maximum time between two buffer pool dumps",
+ "If set to zero, timer based dumping is disabled."
+ " If set to -1, stops autoprewarm.",
+ &dump_interval,
+ AT_PWARM_DEFAULT_DUMP_INTERVAL,
+ AT_PWARM_OFF, INT_MAX / 1000,
+ PGC_SIGHUP,
+ GUC_UNIT_S,
+ NULL,
+ NULL,
+ NULL);
+
+ EmitWarningsOnPlaceholders("pg_prewarm");
+
+ /* If not run as a preloaded library, nothing more to do. */
+ if (!process_shared_preload_libraries_in_progress)
+ return;

a. You can easily write this code such that
process_shared_preload_libraries_in_progress needs to be checked just
once.  Move the define of pg_prewarm.dump_interval at first place and
then check if (!process_shared_preload_libraries_in_progress ) return.

-- Thanks I have fixed as suggested. Previously I did it that way to avoid calling EmitWarningsOnPlaceholders in two different places.
 

b. Variable name autoprewarm isn't self-explanatory, also if you have
to search the use of this variable in the code, it is difficult
because a lot of unrelated usages can pop-up.  How about naming it as
start_prewarm_worker or enable_autoprewarm or something like that?

-- Name was taken as part of previous comments, I think enable_autoprewarm looks good so renaming it. Please let me know if I need to reconsider same.
 

3.
+static AutoPrewarmSharedState *state = NULL;

Again, the naming of this variable (state) is not meaningful.  How
about SharedPrewarmState or something like that?

-- state is for both prewarm and dump worker I would like to keep it simple and small, some where else I have used "apw_sigterm_handler" so I think
"apw_state" could be a good compromise. I have renamed functions to reset_apw_state, init_apw_state in similar lines. Please let me know if I need to reconsider same.
 

4.
+ ereport(LOG,
+ (errmsg("autoprewarm has started")));
..
+ ereport(LOG,
+ (errmsg("autoprewarm shutting down")));

How about changing messages as "autoprewarm worker started" and
"autoprewarm worker stopped" respectively?

-- Thanks fixed as suggested.
 

5.
+void
+dump_block_info_periodically(void)
+{
+ TimestampTz last_dump_time = GetCurrentTimestamp();
..
+ if (TimestampDifferenceExceeds(last_dump_time,
+   current_time,
+   (dump_interval * 1000)))
+ {
+ dump_now(true);
..

With above coding, it will not dump the very first time it tries to
dump the blocks information.  I think it is better if it dumps the
first time and then dumps after dump_interval.  I think it is not
difficult to arrange above code to do so if you also think that is
better behavior?

-- Thanks agree, fixed as suggested.
 

6.
+dump_now(bool is_bgworker)
{
..
+ if (write(fd, buf, buflen) < buflen)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\" : %m",
+ transient_dump_file_path)));
..
}

You seem to forget to close and unlink the file in above code path.
There are a lot of places in this function where you have to free
memory or close file in case of an error condition.  You can use
multiple labels to define error exit paths, something like we have
done in DecodeXLogRecord.

-- Fixed and I have moved those error message to a new function buffer_file_flush while fixing below comments, so I think having a goto to as in DecodeXLogRecord is not necessary now.

7.
+ for (i = 0; i < num_blocks; i++)
+ {
+ /* In case of a SIGHUP, just reload the configuration. */
+ if (got_sighup)
+ {
+ got_sighup = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ /* Have we been asked to stop dump? */
+ if (dump_interval == AT_PWARM_OFF)
+ {
+ pfree(block_info_array);
+ CloseTransientFile(fd);
+ unlink(transient_dump_file_path);
+ return 0;
+ }
+
+ buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",
+ block_info_array[i].database,
+ block_info_array[i].tablespace,
+ block_info_array[i].filenode,
+ (uint32) block_info_array[i].forknum,
+ block_info_array[i].blocknum);
+
+ if (write(fd, buf, buflen) < buflen)
+ {
+ int save_errno = errno;
+
+ CloseTransientFile(fd);
+ unlink(transient_dump_file_path);
+ errno = save_errno;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ transient_dump_file_path)));
+ }

It seems we can club the above writes into 8K sized writes instead of
one write per block information.

-- I have tried to fix same mostly inspired from BufFileWrite in buffile.c. I have not used same because there was no interfaces suitable to use the facility and also I am not sure if I can use it in our context. Should I also use buffer file while reading from autoprewarm.blocks file.
 
--
Thanks and Regards
Mithun C Y

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Robert Haas
Date:
On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> [ new patch ]

I think this is looking better.  I have some suggestions:

* I suggest renaming launch_autoprewarm_dump() to
autoprewarm_start_worker().  I think that will be clearer.  Remember
that user-visible names, internal names, and the documentation should
all match.

* I think the GUC could be pg_prewarm.autoprewarm rather than
pg_prewarm.enable_autoprewarm.  It's shorter and, I think, no less
clear.

* In the documentation, don't say "This is a SQL callable function
to....".  This is a list of SQL-callable functions, so each thing in
the list is one.  Just delete this from the beginning of each
sentence.

* The reason for the AT_PWARM_* naming is not very obvious.  Does AT
mean "at" or "auto" or something else?  How about
AUTOPREWARM_INTERVAL_DISABLED, AUTOPREWARM_INTERVAL_SHUTDOWN_ONLY,
AUTOPREWARM_INTERVAL_DEFAULT?

* Instead of defining apw_sigusr1_handler, I think you could just use
procsignal_sigusr1_handler.  Instead of defining apw_sigterm_handler,
perhaps you could just use die().  got_sigterm would go away, and
you'd just CHECK_FOR_INTERRUPTS().

* The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse
reset_apw_state(), which might be better named detach_apw_shmem().
Similarly, init_apw_state() could be init_apw_shmem().

* Instead of load_one_database(), I suggest
autoprewarm_database_main().  That is more parallel to
autoprewarm_main(), which you have elsewhere, and makes it more
obvious that it's the main entrypoint for some background worker.

* Instead of launch_and_wait_for_per_database_worker(), I suggest
autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I
suggest autoprewarm_buffers().   The motivation for changing prewarm
to autoprewarm is that we want the names here to be clearly distinct
from the other parts of pg_prewarm that are not related to
autoprewarm.  The motivation for changing buffer_pool to buffers is
just that it's a little shorter.  Personally I  also like the sound it
of it better, but YMMV.

* prewarm_buffer_pool() ends with a useless return statement.  I
suggest removing it.

* Instead of creating our own buffering system via buffer_file_write()
and buffer_file_flush(), why not just use the facilities provided by
the operating system?  fopen() et. al. provide buffering, and we have
AllocateFile() to provide a FILE *; it's just like
OpenTransientFile(), which you are using, but you'll get the buffering
stuff for free.  Maybe there's some reason why this won't work out
nicely, but off-hand it seems like it might.  It looks like you are
already using AllocateFile() to read the dump, so using it to write
the dump as well seems like it would be logical.

* I think that it would be cool if, when autoprewarm completed, it
printed a message at LOG rather than DEBUG1, and with a few more
details, like "autoprewarm successfully prewarmed %d of %d
previously-loaded blocks".  This would require some additional
tracking that you haven't got right now; you'd have to keep track not
only of the number of blocks read from the file but how many of those
some worker actually loaded.  You could do that with an extra counter
in the shared memory area that gets incremented by the per-database
workers.

* dump_block_info_periodically() calls ResetLatch() immediately before
WaitLatch; that's backwards.  See the commit message for commit
887feefe87b9099eeeec2967ec31ce20df4dfa9b and the comments it added to
the top of latch.h for details on how to do this correctly.

* dump_block_info_periodically()'s main loop is a bit confusing.  I
think that after calling dump_now(true) it should just "continue",
which will automatically retest got_sigterm.  You could rightly object
to that plan on the grounds that we then wouldn't recheck got_sighup
promptly, but you can fix that by moving the got_sighup test to the
top of the loop, which is a good idea anyway for at least two other
reasons.  First, you probably want to check for a pending SIGHUP on
initially entering this function, because something might have changed
during the prewarm phase, and second, see the previous comment about
using the "another valid coding pattern" from latch.h, which puts the
ResetLatch() at the bottom of the loop.

* I think that launch_autoprewarm_dump() should ereport(ERROR, ...)
rather than just return NULL if the feature is disabled.  Maybe
something like ... ERROR: pg_prewarm.dump_interval must be
non-negative in order to launch worker

* Not sure about this one, but maybe we should consider just getting
rid of pg_prewarm.dump_interval = -1 altogether and make the minimum
value 0. If pg_prewarm.autoprewarm = on, then we start the worker and
dump according to the dump interval; if pg_prewarm.autoprewarm = off
then we don't start the worker automatically, but we still let you
start it manually.  If you do, it respects the configured
dump_interval.  With this design, we don't need the error suggested in
the previous item at all, and the code can be simplified in various
places --- all the checks for AT_PWARM_OFF go away.  And I don't see
that we're really losing anything.  There's not much sense in dumping
but not prewarming or prewarming but not dumping, so having
pg_prewarm.autoprewarm configure whether the worker is started
automatically rather than whether it prewarms (with a separate control
for whether it dumps) seems to make sense.  The one time when you want
to do one without the other is when you first install the extension --
during the first server lifetime, you'll want to dump, so that after
the next restart you have something to preload.  But this design would
allow that.

That's all I have time for today - hope it helps.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Thom Brown
Date:
On 22 June 2017 at 22:52, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> [ new patch ]
>
> I think this is looking better.  I have some suggestions:
>
> * I suggest renaming launch_autoprewarm_dump() to
> autoprewarm_start_worker().  I think that will be clearer.  Remember
> that user-visible names, internal names, and the documentation should
> all match.

+1

I like related functions and GUCs to be similarly named so that they
have the same prefix.

>
> * I think the GUC could be pg_prewarm.autoprewarm rather than
> pg_prewarm.enable_autoprewarm.  It's shorter and, I think, no less
> clear.

+1

I also think pg_prewarm.dump_interval should be renamed to
pg_prewarm.autoprewarm_interval.

>
> * In the documentation, don't say "This is a SQL callable function
> to....".  This is a list of SQL-callable functions, so each thing in
> the list is one.  Just delete this from the beginning of each
> sentence.

I've made a pass at the documentation and ended up removing those
intros.  I haven't made any GUC/function renaming changes, but I have
rewritten some paragraphs for clarity.  Updated patch attached.

One thing I couldn't quite make sense of is:

"The autoprewarm process will start loading blocks recorded in
$PGDATA/autoprewarm.blocks until there is a free buffer left in the
buffer pool."

Is this saying "until there is a single free buffer remaining in
shared buffers"?  I haven't corrected or clarified this as I don't
understand it.

Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
detect an autoprewarm process already running.  I'd want this to
return NULL or an error if called for a 2nd time.

>
> * The reason for the AT_PWARM_* naming is not very obvious.  Does AT
> mean "at" or "auto" or something else?  How about
> AUTOPREWARM_INTERVAL_DISABLED, AUTOPREWARM_INTERVAL_SHUTDOWN_ONLY,
> AUTOPREWARM_INTERVAL_DEFAULT?
>
> * Instead of defining apw_sigusr1_handler, I think you could just use
> procsignal_sigusr1_handler.  Instead of defining apw_sigterm_handler,
> perhaps you could just use die().  got_sigterm would go away, and
> you'd just CHECK_FOR_INTERRUPTS().
>
> * The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse
> reset_apw_state(), which might be better named detach_apw_shmem().
> Similarly, init_apw_state() could be init_apw_shmem().
>
> * Instead of load_one_database(), I suggest
> autoprewarm_database_main().  That is more parallel to
> autoprewarm_main(), which you have elsewhere, and makes it more
> obvious that it's the main entrypoint for some background worker.
>
> * Instead of launch_and_wait_for_per_database_worker(), I suggest
> autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I
> suggest autoprewarm_buffers().   The motivation for changing prewarm
> to autoprewarm is that we want the names here to be clearly distinct
> from the other parts of pg_prewarm that are not related to
> autoprewarm.  The motivation for changing buffer_pool to buffers is
> just that it's a little shorter.  Personally I  also like the sound it
> of it better, but YMMV.
>
> * prewarm_buffer_pool() ends with a useless return statement.  I
> suggest removing it.
>
> * Instead of creating our own buffering system via buffer_file_write()
> and buffer_file_flush(), why not just use the facilities provided by
> the operating system?  fopen() et. al. provide buffering, and we have
> AllocateFile() to provide a FILE *; it's just like
> OpenTransientFile(), which you are using, but you'll get the buffering
> stuff for free.  Maybe there's some reason why this won't work out
> nicely, but off-hand it seems like it might.  It looks like you are
> already using AllocateFile() to read the dump, so using it to write
> the dump as well seems like it would be logical.
>
> * I think that it would be cool if, when autoprewarm completed, it
> printed a message at LOG rather than DEBUG1, and with a few more
> details, like "autoprewarm successfully prewarmed %d of %d
> previously-loaded blocks".  This would require some additional
> tracking that you haven't got right now; you'd have to keep track not
> only of the number of blocks read from the file but how many of those
> some worker actually loaded.  You could do that with an extra counter
> in the shared memory area that gets incremented by the per-database
> workers.
>
> * dump_block_info_periodically() calls ResetLatch() immediately before
> WaitLatch; that's backwards.  See the commit message for commit
> 887feefe87b9099eeeec2967ec31ce20df4dfa9b and the comments it added to
> the top of latch.h for details on how to do this correctly.
>
> * dump_block_info_periodically()'s main loop is a bit confusing.  I
> think that after calling dump_now(true) it should just "continue",
> which will automatically retest got_sigterm.  You could rightly object
> to that plan on the grounds that we then wouldn't recheck got_sighup
> promptly, but you can fix that by moving the got_sighup test to the
> top of the loop, which is a good idea anyway for at least two other
> reasons.  First, you probably want to check for a pending SIGHUP on
> initially entering this function, because something might have changed
> during the prewarm phase, and second, see the previous comment about
> using the "another valid coding pattern" from latch.h, which puts the
> ResetLatch() at the bottom of the loop.
>
> * I think that launch_autoprewarm_dump() should ereport(ERROR, ...)
> rather than just return NULL if the feature is disabled.  Maybe
> something like ... ERROR: pg_prewarm.dump_interval must be
> non-negative in order to launch worker
>
> * Not sure about this one, but maybe we should consider just getting
> rid of pg_prewarm.dump_interval = -1 altogether and make the minimum
> value 0. If pg_prewarm.autoprewarm = on, then we start the worker and
> dump according to the dump interval; if pg_prewarm.autoprewarm = off
> then we don't start the worker automatically, but we still let you
> start it manually.  If you do, it respects the configured
> dump_interval.  With this design, we don't need the error suggested in
> the previous item at all, and the code can be simplified in various
> places --- all the checks for AT_PWARM_OFF go away.  And I don't see
> that we're really losing anything.  There's not much sense in dumping
> but not prewarming or prewarming but not dumping, so having
> pg_prewarm.autoprewarm configure whether the worker is started
> automatically rather than whether it prewarms (with a separate control
> for whether it dumps) seems to make sense.  The one time when you want
> to do one without the other is when you first install the extension --
> during the first server lifetime, you'll want to dump, so that after
> the next restart you have something to preload.  But this design would
> allow that.



-- 
Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
Thanks, Robert, I have tried to fix all of you comments and merged to
fixes suggested by Thom in patch 15.

On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> * I suggest renaming launch_autoprewarm_dump() to
> autoprewarm_start_worker().  I think that will be clearer.  Remember
> that user-visible names, internal names, and the documentation should
> all match.

-- Fixed as suggested.

>
> * I think the GUC could be pg_prewarm.autoprewarm rather than
> pg_prewarm.enable_autoprewarm.  It's shorter and, I think, no less
> clear.

-- I have made GUC name as autoprewarm.

>
> * In the documentation, don't say "This is a SQL callable function
> to....".  This is a list of SQL-callable functions, so each thing in
> the list is one.  Just delete this from the beginning of each
> sentence.

-- Fixed, Thom has provided the fix and I have merged same to my patch.

> * The reason for the AT_PWARM_* naming is not very obvious.  Does AT
> mean "at" or "auto" or something else?  How about
> AUTOPREWARM_INTERVAL_DISABLED, AUTOPREWARM_INTERVAL_SHUTDOWN_ONLY,
> AUTOPREWARM_INTERVAL_DEFAULT?

-- Fixed as suggested. The AUTOPREWARM_INTERVAL_DISABLED is removed
now as suggested by below comments.

>
> * Instead of defining apw_sigusr1_handler, I think you could just use
> procsignal_sigusr1_handler.  Instead of defining apw_sigterm_handler,
> perhaps you could just use die().  got_sigterm would go away, and
> you'd just CHECK_FOR_INTERRUPTS().

-- Hi have registered procsignal_sigusr1_handler instead of
apw_sigusr1_handler. But I have some doubts about using die instead of
apw_sigterm_handler in main autoprewarm worker. On shutdown(sigterm)
we should dump and then exit, so doing a CHECK_FOR_INTERRUPTS() we
might miss dumping the buffer contents. I think I need to modify some
server code in ProcessInterrupts to handle this, please let me know if
I am wrong about this.
For per-database prewarm worker, this seems right so I am registering
die for SIGTERM and calling CHECK_FOR_INTERRUPTS(). Also for
autoprewarm_dump_now().

>
> * The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse
> reset_apw_state(), which might be better named detach_apw_shmem().
> Similarly, init_apw_state() could be init_apw_shmem().

-- Fixed.

> * Instead of load_one_database(), I suggest
> autoprewarm_database_main().  That is more parallel to
> autoprewarm_main(), which you have elsewhere, and makes it more
> obvious that it's the main entrypoint for some background worker.

-- Fixed.

> * Instead of launch_and_wait_for_per_database_worker(), I suggest
> autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I
> suggest autoprewarm_buffers().   The motivation for changing prewarm
> to autoprewarm is that we want the names here to be clearly distinct
> from the other parts of pg_prewarm that are not related to
> autoprewarm.  The motivation for changing buffer_pool to buffers is
> just that it's a little shorter.  Personally I  also like the sound it
> of it better, but YMMV.

-- Fixed as suggested. I have renamed as suggested.

> * prewarm_buffer_pool() ends with a useless return statement.  I
> suggest removing it.

-- Sorry Fixed.

>
> * Instead of creating our own buffering system via buffer_file_write()
> and buffer_file_flush(), why not just use the facilities provided by
> the operating system?  fopen() et. al. provide buffering, and we have
> AllocateFile() to provide a FILE *; it's just like
> OpenTransientFile(), which you are using, but you'll get the buffering
> stuff for free.  Maybe there's some reason why this won't work out
> nicely, but off-hand it seems like it might.  It looks like you are
> already using AllocateFile() to read the dump, so using it to write
> the dump as well seems like it would be logical.

-- Now using AllocateFile().

> * I think that it would be cool if, when autoprewarm completed, it
> printed a message at LOG rather than DEBUG1, and with a few more
> details, like "autoprewarm successfully prewarmed %d of %d
> previously-loaded blocks".  This would require some additional
> tracking that you haven't got right now; you'd have to keep track not
> only of the number of blocks read from the file but how many of those
> some worker actually loaded.  You could do that with an extra counter
> in the shared memory area that gets incremented by the per-database
> workers.
>
> * dump_block_info_periodically() calls ResetLatch() immediately before
> WaitLatch; that's backwards.  See the commit message for commit
> 887feefe87b9099eeeec2967ec31ce20df4dfa9b and the comments it added to
> the top of latch.h for details on how to do this correctly.

-- Sorry Fixed.

> * dump_block_info_periodically()'s main loop is a bit confusing.  I
> think that after calling dump_now(true) it should just "continue",
> which will automatically retest got_sigterm.  You could rightly object
> to that plan on the grounds that we then wouldn't recheck got_sighup
> promptly, but you can fix that by moving the got_sighup test to the
> top of the loop, which is a good idea anyway for at least two other
> reasons.  First, you probably want to check for a pending SIGHUP on
> initially entering this function, because something might have changed
> during the prewarm phase, and second, see the previous comment about
> using the "another valid coding pattern" from latch.h, which puts the
> ResetLatch() at the bottom of the loop.

-- Agree, my idea was while we were dumping or just immediately after
dumping if we receive sigterm we need not dump again for shutdown. I
think I am wrong so fixed as you have suggested.

>
> * I think that launch_autoprewarm_dump() should ereport(ERROR, ...)
> rather than just return NULL if the feature is disabled.  Maybe
> something like ... ERROR: pg_prewarm.dump_interval must be
> non-negative in order to launch worker

-- I have removed pg_prewarm.dump_interval = -1 case as you have
suggested below. So no need for error now.

> * Not sure about this one, but maybe we should consider just getting
> rid of pg_prewarm.dump_interval = -1 altogether and make the minimum
> value 0. If pg_prewarm.autoprewarm = on, then we start the worker and
> dump according to the dump interval; if pg_prewarm.autoprewarm = off
> then we don't start the worker automatically, but we still let you
> start it manually.  If you do, it respects the configured
> dump_interval.  With this design, we don't need the error suggested in
> the previous item at all, and the code can be simplified in various
> places --- all the checks for AT_PWARM_OFF go away.  And I don't see
> that we're really losing anything.  There's not much sense in dumping
> but not prewarming or prewarming but not dumping, so having
> pg_prewarm.autoprewarm configure whether the worker is started
> automatically rather than whether it prewarms (with a separate control
> for whether it dumps) seems to make sense.  The one time when you want
> to do one without the other is when you first install the extension --
> during the first server lifetime, you'll want to dump, so that after
> the next restart you have something to preload.  But this design would
> allow that.

-- Agree. Removed the case pg_prewarm.dump_interval = -1. I had
similar doubt which I have tried to raise previously

On Tue, May 30, 2017 at 10:16 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>>There is another GUC setting pg_prewarm.dump_interval if = -1 we stop
>>the running autoprewarm worker. I have a doubt should we combine these
>>2 entities into one such that it controls the state of autoprewarm
>>worker?

Now I have one doubt, do we need a mechanism to stop running
autoprewarm worker while keeping the server alive? Can I use the
pg_prewarm.autoprewarm for the same purpose?


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote:

>
> I also think pg_prewarm.dump_interval should be renamed to
> pg_prewarm.autoprewarm_interval.

Thanks, I have changed it to pg_prewarm.autoprewarm_interval.

>>
>> * In the documentation, don't say "This is a SQL callable function
>> to....".  This is a list of SQL-callable functions, so each thing in
>> the list is one.  Just delete this from the beginning of each
>> sentence.


> One thing I couldn't quite make sense of is:
>
> "The autoprewarm process will start loading blocks recorded in
> $PGDATA/autoprewarm.blocks until there is a free buffer left in the
> buffer pool."
>
> Is this saying "until there is a single free buffer remaining in
> shared buffers"?  I haven't corrected or clarified this as I don't
> understand it.

Sorry, that was a typo I wanted to say  until there is no free buffer
left. Fixed in autoprewarm_16.patch.

>
> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
> detect an autoprewarm process already running.  I'd want this to
> return NULL or an error if called for a 2nd time.

We log instead of error as we try to check only after launching the
worker and inside worker. One solution could be as similar to
autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
memory and check if we can launch worker in backend itself. I will try
to fix same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote:
>>
>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
>> detect an autoprewarm process already running.  I'd want this to
>> return NULL or an error if called for a 2nd time.
>
> We log instead of error as we try to check only after launching the
> worker and inside worker. One solution could be as similar to
> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
> memory and check if we can launch worker in backend itself. I will try
> to fix same.

I have fixed it now as follows

+Datum
+autoprewarm_start_worker(PG_FUNCTION_ARGS)
+{
+   pid_t       pid;
+
+   init_apw_shmem();
+   pid = apw_state->bgworker_pid;
+   if (pid != InvalidPid)
+       ereport(ERROR,
+               (errmsg("autoprewarm worker is already running under PID %d",
+                       pid)));
+
+   autoprewarm_dump_launcher();
+   PG_RETURN_VOID();
+}

In backend itself, we shall check if an autoprewarm worker is running
then only start the server. There is a possibility if this function is
executed concurrently when there is no worker already running (Which I
think is not a normal usage) then both call will say it has
successfully launched the worker even though only one could have
successfully done that (other will log and silently die). I think that
is okay as the objective was to get one worker up and running.

I have changed the return value to void. The worker could be restarted
when there is an error. So returned pid is not going to be same as
worker pid in such cases. Also, I do not see any use of pid.  Made
documentation changes regarding above changes.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote:
>>>
>>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
>>> detect an autoprewarm process already running.  I'd want this to
>>> return NULL or an error if called for a 2nd time.
>>
>> We log instead of error as we try to check only after launching the
>> worker and inside worker. One solution could be as similar to
>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
>> memory and check if we can launch worker in backend itself. I will try
>> to fix same.
>
> I have fixed it now as follows
>
> +Datum
> +autoprewarm_start_worker(PG_FUNCTION_ARGS)
> +{
> +   pid_t       pid;
> +
> +   init_apw_shmem();
> +   pid = apw_state->bgworker_pid;
> +   if (pid != InvalidPid)
> +       ereport(ERROR,
> +               (errmsg("autoprewarm worker is already running under PID %d",
> +                       pid)));
> +
> +   autoprewarm_dump_launcher();
> +   PG_RETURN_VOID();
> +}
>
> In backend itself, we shall check if an autoprewarm worker is running
> then only start the server. There is a possibility if this function is
> executed concurrently when there is no worker already running (Which I
> think is not a normal usage) then both call will say it has
> successfully launched the worker even though only one could have
> successfully done that (other will log and silently die).

Why can't we close this remaining race condition?  Basically, if we
just perform all of the actions in this function under the lock and
autoprewarm_dump_launcher waits till the autoprewarm worker has
initialized the bgworker_pid, then there won't be any remaining race
condition.  I think if we don't close this race condition, it will be
unpredictable whether the user will get the error or there will be
only a server log for the same.

> I think that
> is okay as the objective was to get one worker up and running.
>

You are right that the objective will be met, but still, I feel the
behavior of this API will be unpredictable which in my opinion should
be fixed.  If it is really not possible or extremely difficult to fix
this behavior, then I think we should update the docs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote:
>>>
>>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
>>> detect an autoprewarm process already running.  I'd want this to
>>> return NULL or an error if called for a 2nd time.
>>
>> We log instead of error as we try to check only after launching the
>> worker and inside worker. One solution could be as similar to
>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
>> memory and check if we can launch worker in backend itself. I will try
>> to fix same.
>
> I have fixed it now as follows

Few comments on the latest patch:

1.
+ LWLockRelease(&apw_state->lock);
+ if (!is_bgworker)
+ ereport(ERROR,
+ (errmsg("could not perform block dump because dump file is being
used by PID %d",
+ apw_state->pid_using_dumpfile)));
+ ereport(LOG,
+ (errmsg("skipping block dump because it is already being performed by PID %d",
+ apw_state->pid_using_dumpfile)));


The above code looks confusing as both the messages are saying the
same thing in different words.  I think you keep one message (probably
the first one) and decide error level based on if this is invoked for
bgworker.  Also, you can move LWLockRelease after error message,
because if there is any error, then it will automatically release all
lwlocks.

2.
+autoprewarm_dump_now(PG_FUNCTION_ARGS)
+{
+ uint32 num_blocks = 0;
+
..
+ PG_RETURN_INT64(num_blocks);
..
}

Is there any reason for using PG_RETURN_INT64 instead of PG_RETURN_UINT32?

3.
+dump_now(bool is_bgworker)
{
..
+ if (buf_state & BM_TAG_VALID)
+ {
+ block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
+ block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
+ block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
+ block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
+ block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
+ ++num_blocks;
+ }
..
}

I think there is no use of writing Unlogged buffers unless the dump is
for the shutdown.  You might want to use BufferIsPermanent to detect
the same.

4.
+static uint32
+dump_now(bool is_bgworker)
{
..
+ for (num_blocks = 0, i = 0; i < NBuffers; i++)
+ {
+ uint32 buf_state;
+
+ if (!is_bgworker)
+ CHECK_FOR_INTERRUPTS();
..
}

Why checking for interrupts is only for non-bgwroker cases?

5.
+ * Each entry of BlockRecordInfo consists of database, tablespace, filenode,
+ * forknum, blocknum. Note that this is in the text form so that the dump
+ * information is readable and can be edited, if required.
+ */

In the above comment, you are saying that the dump file is in text
form whereas in the code you are using binary form.  I think code
should match comments. Is there a reason of preferring binary over
text or vice versa?

6.
+dump_now(bool is_bgworker)
{
..
+ (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
+ apw_state->pid_using_dumpfile = InvalidPid;
..
}

How will pid_using_dumpfile be set to InvalidPid in the case of error
for non-bgworker cases?

7.
+dump_now(bool is_bgworker)
{
..
+ (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);

..
}

How will transient_dump_file_path be unlinked in the case of error in
durable_rename?  I think you need to use PG_TRY..PG_CATCH to ensure
same?

8.
+ file = AllocateFile(transient_dump_file_path, PG_BINARY_W);
+ if (!file)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ transient_dump_file_path)));
+
+ ret = fprintf(file, "<<%u>>\n", num_blocks);
+ if (ret < 0)
+ {
+ int save_errno = errno;
+
+ FreeFile(file);

I think you don't need to close the file in case of error, it will be
automatically taken care in case of error (via transaction abort
path).

9.
+ /* Register autoprewarm load. */
+ setup_autoprewarm(&autoprewarm_worker, "autoprewarm", "autoprewarm_main",
+  Int32GetDatum(TASK_PREWARM_BUFFERPOOL), 0, 0);

What does "load" signify in above comment?  Do you want to say worker instead?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>
> * Instead of creating our own buffering system via buffer_file_write()
> and buffer_file_flush(), why not just use the facilities provided by
> the operating system?  fopen() et. al. provide buffering, and we have
> AllocateFile() to provide a FILE *; it's just like
> OpenTransientFile(), which you are using, but you'll get the buffering
> stuff for free.  Maybe there's some reason why this won't work out
> nicely, but off-hand it seems like it might.  It looks like you are
> already using AllocateFile() to read the dump, so using it to write
> the dump as well seems like it would be logical.
>

One thing that is worth considering is AllocateFile is recommended to
be used for short operations.  Refer text atop AllocateFile().  If the
number of blocks to be dumped is large, then the file can remain open
for the significant period of time.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote:
>>>>
>>>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
>>>> detect an autoprewarm process already running.  I'd want this to
>>>> return NULL or an error if called for a 2nd time.
>>>
>>> We log instead of error as we try to check only after launching the
>>> worker and inside worker. One solution could be as similar to
>>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
>>> memory and check if we can launch worker in backend itself. I will try
>>> to fix same.
>>
>> I have fixed it now as follows
>>
>> +Datum
>> +autoprewarm_start_worker(PG_FUNCTION_ARGS)
>> +{
>> +   pid_t       pid;
>> +
>> +   init_apw_shmem();
>> +   pid = apw_state->bgworker_pid;
>> +   if (pid != InvalidPid)
>> +       ereport(ERROR,
>> +               (errmsg("autoprewarm worker is already running under PID %d",
>> +                       pid)));
>> +
>> +   autoprewarm_dump_launcher();
>> +   PG_RETURN_VOID();
>> +}
>>
>> In backend itself, we shall check if an autoprewarm worker is running
>> then only start the server. There is a possibility if this function is
>> executed concurrently when there is no worker already running (Which I
>> think is not a normal usage) then both call will say it has
>> successfully launched the worker even though only one could have
>> successfully done that (other will log and silently die).
>
> Why can't we close this remaining race condition?  Basically, if we
> just perform all of the actions in this function under the lock and
> autoprewarm_dump_launcher waits till the autoprewarm worker has
> initialized the bgworker_pid, then there won't be any remaining race
> condition.  I think if we don't close this race condition, it will be
> unpredictable whether the user will get the error or there will be
> only a server log for the same.

Yes, I can make autoprewarm_dump_launcher to wait until the launched
bgworker set its pid, but this requires one more synchronization
variable between launcher and worker. More than that I see
ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the
worker), which needs to be called before setting pid. So I thought it
won't be harmful let two concurrent calls to launch workers and we
just log failures. Please let me know if I need to rethink about it.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Mon, Jul 3, 2017 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>>
>> * Instead of creating our own buffering system via buffer_file_write()
>> and buffer_file_flush(), why not just use the facilities provided by
>> the operating system?  fopen() et. al. provide buffering, and we have
>> AllocateFile() to provide a FILE *; it's just like
>> OpenTransientFile(), which you are using, but you'll get the buffering
>> stuff for free.  Maybe there's some reason why this won't work out
>> nicely, but off-hand it seems like it might.  It looks like you are
>> already using AllocateFile() to read the dump, so using it to write
>> the dump as well seems like it would be logical.
>>
>
> One thing that is worth considering is AllocateFile is recommended to
> be used for short operations.  Refer text atop AllocateFile().  If the
> number of blocks to be dumped is large, then the file can remain open
> for the significant period of time.

-- Agree. I think I need suggestion on this we will hold on to 1 fd,
but I am not sure what amount of time file being opened qualify as a
case against using AllocateFile().



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Mithun Cy
Date:
On Mon, Jul 3, 2017 at 3:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Few comments on the latest patch:
>
> 1.
> + LWLockRelease(&apw_state->lock);
> + if (!is_bgworker)
> + ereport(ERROR,
> + (errmsg("could not perform block dump because dump file is being
> used by PID %d",
> + apw_state->pid_using_dumpfile)));
> + ereport(LOG,
> + (errmsg("skipping block dump because it is already being performed by PID %d",
> + apw_state->pid_using_dumpfile)));
>
>
> The above code looks confusing as both the messages are saying the
> same thing in different words.  I think you keep one message (probably
> the first one) and decide error level based on if this is invoked for
> bgworker.  Also, you can move LWLockRelease after error message,
> because if there is any error, then it will automatically release all
> lwlocks.

ERROR is used for autoprewarm_dump_now which is called from the backend.
LOG is used for bgworker.
wordings used are to match the context if failing to dump is
acceptable or not. In the case of bgworker, it is acceptable we are
not particular about the start time of dump but the only interval
between the dumps. So if already somebody doing it is acceptable. But
one who calls autoprewarm_dump_now might be particular about the start
time of dump so we throw error making him retry same.

The wording's are suggested by Robert(below snapshot) in one of his
previous comments and I also agree with it. If you think I should
reconsider this and I am missing something I am open to suggestions.

On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
+If we go to perform
+an immediate dump process and finds a non-zero value already just does
+ereport(ERROR, ...), including the PID of the other process in the
+message (e.g. "unable to perform block dump because dump file is being
+used by PID %d").  In a background worker, if we go to dump and find
+the file in use, log a message (e.g. "skipping block dump because it
+is already being performed by PID %d", "skipping prewarm because block
+dump file is being rewritten by PID %d").

Thanks moved the LWLockRelease after ERROR call.

> 2.
> +autoprewarm_dump_now(PG_FUNCTION_ARGS)
> +{
> + uint32 num_blocks = 0;
> +
> ..
> + PG_RETURN_INT64(num_blocks);
> ..
> }
>
> Is there any reason for using PG_RETURN_INT64 instead of PG_RETURN_UINT32?

Return type autoprewarm_dump_now() is pg_catalog.int8 to accommodate
uint32 so I used PG_RETURN_INT64. I think PG_RETURN_UINT32 can be used
as well I have replaced now.

> 3.
> +dump_now(bool is_bgworker)
> {
> ..
> + if (buf_state & BM_TAG_VALID)
> + {
> + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
> + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
> + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
> + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
> + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
> + ++num_blocks;
> + }
> ..
> }
>I think there is no use of writing Unlogged buffers unless the dump is
>for the shutdown.  You might want to use BufferIsPermanent to detect the same.

-- I do not think that is true pages of the unlogged table are also
read into buffers for read-only purpose. So if we miss to dump them
while we shut down then the previous dump should be used.

> 4.
> +static uint32
> +dump_now(bool is_bgworker)
> {
> ..
> + for (num_blocks = 0, i = 0; i < NBuffers; i++)
> + {
> + uint32 buf_state;
> +
> + if (!is_bgworker)
> + CHECK_FOR_INTERRUPTS();
> ..
> }
>
> Why checking for interrupts is only for non-bgwroker cases?

-- autoprewarm_dump_now is directly called from the backend. In such
case, we have to handle signals registered for backend in dump_now().
For bgworker dump_block_info_periodically caller of dump_now() handles
SIGTERM, SIGUSR1 which we are interested in.

> 5.
> + * Each entry of BlockRecordInfo consists of database, tablespace, filenode,
> + * forknum, blocknum. Note that this is in the text form so that the dump
> + * information is readable and can be edited, if required.
> + */
>
> In the above comment, you are saying that the dump file is in text
> form whereas in the code you are using binary form.  I think code
> should match comments. Is there a reason of preferring binary over
> text or vice versa?

-- Previously I used the write() on file descriptor. Sorry I should
have changed the mode of opening to text mode when I moved the code to
use AllocateFile Sorry fixed same now.

> 6.
> +dump_now(bool is_bgworker)
> {
> ..
> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
> + apw_state->pid_using_dumpfile = InvalidPid;
> ..
> }
>
> How will pid_using_dumpfile be set to InvalidPid in the case of error
> for non-bgworker cases?

-- I have a try() - catch() in autoprewarm_dump_now I think that is okay.

>
> 7.
> +dump_now(bool is_bgworker)
> {
> ..
> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
>
> ..
> }
>
> How will transient_dump_file_path be unlinked in the case of error in
> durable_rename?  I think you need to use PG_TRY..PG_CATCH to ensure
> same?

-- If durable_rename is failing that seems basic functionality of
autoperwarm is failing so I want it to be an ERROR. I do not want to
remove the temp file as we always truncate before reusing it again. So
I think there is no need to catch all ERROR in dump_now() just to
remove the temp file.

> 8.
> + file = AllocateFile(transient_dump_file_path, PG_BINARY_W);
> + if (!file)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not open file \"%s\": %m",
> + transient_dump_file_path)));
> +
> + ret = fprintf(file, "<<%u>>\n", num_blocks);
> + if (ret < 0)
> + {
> + int save_errno = errno;
> +
> + FreeFile(file);
>
> I think you don't need to close the file in case of error, it will be
> automatically taken care in case of error (via transaction abort
> path).

-- I was trying to close the file before unlinking. Agree it is not
necessary to do so but just did it as a practice. I have removed it
now.

> 9.
> + /* Register autoprewarm load. */
> + setup_autoprewarm(&autoprewarm_worker, "autoprewarm", "autoprewarm_main",
> +  Int32GetDatum(TASK_PREWARM_BUFFERPOOL), 0, 0);
>
> What does "load" signify in above comment?  Do you want to say worker instead?

-- Sorry fixed now.


In addition to above I have made one more change, per-database
autoprewarm bgworker has been renamed from "autoprewarm" to
"per-database autoprewarm"

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal : For Auto-Prewarm.

From
Amit Kapila
Date:
On Wed, Jul 5, 2017 at 6:25 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>>> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>>>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <thom@linux.com> wrote:
>>>>>
>>>>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
>>>>> detect an autoprewarm process already running.  I'd want this to
>>>>> return NULL or an error if called for a 2nd time.
>>>>
>>>> We log instead of error as we try to check only after launching the
>>>> worker and inside worker. One solution could be as similar to
>>>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
>>>> memory and check if we can launch worker in backend itself. I will try
>>>> to fix same.
>>>
>>> I have fixed it now as follows
>>>
>>> +Datum
>>> +autoprewarm_start_worker(PG_FUNCTION_ARGS)
>>> +{
>>> +   pid_t       pid;
>>> +
>>> +   init_apw_shmem();
>>> +   pid = apw_state->bgworker_pid;
>>> +   if (pid != InvalidPid)
>>> +       ereport(ERROR,
>>> +               (errmsg("autoprewarm worker is already running under PID %d",
>>> +                       pid)));
>>> +
>>> +   autoprewarm_dump_launcher();
>>> +   PG_RETURN_VOID();
>>> +}
>>>
>>> In backend itself, we shall check if an autoprewarm worker is running
>>> then only start the server. There is a possibility if this function is
>>> executed concurrently when there is no worker already running (Which I
>>> think is not a normal usage) then both call will say it has
>>> successfully launched the worker even though only one could have
>>> successfully done that (other will log and silently die).
>>
>> Why can't we close this remaining race condition?  Basically, if we
>> just perform all of the actions in this function under the lock and
>> autoprewarm_dump_launcher waits till the autoprewarm worker has
>> initialized the bgworker_pid, then there won't be any remaining race
>> condition.  I think if we don't close this race condition, it will be
>> unpredictable whether the user will get the error or there will be
>> only a server log for the same.
>
> Yes, I can make autoprewarm_dump_launcher to wait until the launched
> bgworker set its pid, but this requires one more synchronization
> variable between launcher and worker. More than that I see
> ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the
> worker), which needs to be called before setting pid. So I thought it
> won't be harmful let two concurrent calls to launch workers and we
> just log failures. Please let me know if I need to rethink about it.
>

I don't know whether you need to rethink but as presented in the
patch, it seems unclear to me about the specs of API.  As this is an
exposed function to the user, I think the behavior should be well
defined.  If you don't think changing code has many advantages, then
at the very least update the docs to indicate the expectation and
behavior of the API.  Also, I think it is better to add few comments
in the code to tell about the unpredictable behavior in case of race
condition and the reason for same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com