Thread: patch for new feature: Buffer Cache Hibernation

patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi,

I am working on new feature `Buffer Cache Hibernation' which enables
postgres to keep higher cache hit ratio even just started.

Postgres usually starts with ZERO buffer cache.  By saving the buffer
cache data structure into hibernation files just before shutdown, and
loading them at startup, postgres can start operations with the saved
buffer cache as the same condition as just before the last shutdown.

Here is the patch for 9.0.3 (also tested on 8.4.7)
http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-9.0.3.patch

The patch includes the following.
- At shutdown, buffer cache data structure (such as BufferDescriptors, BufferBlocks and StrategyControl) is saved into
hibernationfiles.
 
- At startup, buffer cache data structure is loaded from hibernation files and buffer lookup hashtable is setup based
onbuffer descriptors.
 
- Above functions are enabled by specifying `enable_buffer_cache_hibernation=on' in postgresql.conf.

Any comments are welcome and I would very much appreciate merging the
patch in source tree.

Have fun and thanks!


Re: patch for new feature: Buffer Cache Hibernation

From
Andrew Dunstan
Date:

On 05/04/2011 10:10 AM, Mitsuru IWASAKI wrote:
> Hi,
>
> I am working on new feature `Buffer Cache Hibernation' which enables
> postgres to keep higher cache hit ratio even just started.
>
> Postgres usually starts with ZERO buffer cache.  By saving the buffer
> cache data structure into hibernation files just before shutdown, and
> loading them at startup, postgres can start operations with the saved
> buffer cache as the same condition as just before the last shutdown.
>
> Here is the patch for 9.0.3 (also tested on 8.4.7)
> http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-9.0.3.patch
>
> The patch includes the following.
> - At shutdown, buffer cache data structure (such as BufferDescriptors,
>    BufferBlocks and StrategyControl) is saved into hibernation files.
> - At startup, buffer cache data structure is loaded from hibernation
>    files and buffer lookup hashtable is setup based on buffer descriptors.
> - Above functions are enabled by specifying `enable_buffer_cache_hibernation=on'
>    in postgresql.conf.
>
> Any comments are welcome and I would very much appreciate merging the
> patch in source tree.
>
>

That sounds cool.

Please a) make sure your patch is up to data against the latest source 
in git and b) submit it to the next commitfest at 
<https://commitfest.postgresql.org/action/commitfest_view?id=10>

We don't backport features, and 9.1 is closed for features now, so the 
earliest release this could be used in is 9.2.

cheers

andrew


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Stark
Date:
On Wed, May 4, 2011 at 3:10 PM, Mitsuru IWASAKI <iwasaki@jp.freebsd.org> wrote:
> Postgres usually starts with ZERO buffer cache.  By saving the buffer
> cache data structure into hibernation files just before shutdown, and
> loading them at startup, postgres can start operations with the saved
> buffer cache as the same condition as just before the last shutdown.

Offhand this seems pretty handy for benchmarks where it would help get
reproducible results.


--
greg


Re: patch for new feature: Buffer Cache Hibernation

From
Tom Lane
Date:
Mitsuru IWASAKI <iwasaki@jp.FreeBSD.org> writes:
> Postgres usually starts with ZERO buffer cache.  By saving the buffer
> cache data structure into hibernation files just before shutdown, and
> loading them at startup, postgres can start operations with the saved
> buffer cache as the same condition as just before the last shutdown.

This seems like a lot of complication for rather dubious gain.  What
happens when the DBA changes the shared_buffers setting, for instance?
How do you protect against the cached buffers getting out-of-sync with
the actual disk files (especially during recovery scenarios)?  What
about crash-induced corruption in the cache file itself (consider the
not-unlikely possibility that init will kill the database before it's
had time to dump all the buffers during a system shutdown)?  Do you have
any proof that writing out a few GB of buffers and then reading them
back in is actually much cheaper than letting the database re-read the
data from the disk files?
        regards, tom lane


Re: patch for new feature: Buffer Cache Hibernation

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of mié may 04 12:44:36 -0300 2011:

> This seems like a lot of complication for rather dubious gain.  What
> happens when the DBA changes the shared_buffers setting, for instance?
> How do you protect against the cached buffers getting out-of-sync with
> the actual disk files (especially during recovery scenarios)?  What
> about crash-induced corruption in the cache file itself (consider the
> not-unlikely possibility that init will kill the database before it's
> had time to dump all the buffers during a system shutdown)?  Do you have
> any proof that writing out a few GB of buffers and then reading them
> back in is actually much cheaper than letting the database re-read the
> data from the disk files?

I thought the idea wasn't to copy the entire buffer but only a
descriptor, so that the buffer would be loaded from the original page.

If shared_buffers changes, there's no problem.  If the new setting is
smaller, then the last paages would just not be copied, and would have
to be read from disk the first time they are accessed.  If the new
setting is larger, then the last few buffers would remain unused until
requested.

As for gain, I have heard of test setups requiring hours of runtime in
order to prime the buffer cache.

Crash safety would have to be researched, sure.  Maybe only do it in
clean shutdown.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for new feature: Buffer Cache Hibernation

From
"Dickson S. Guedes"
Date:
2011/5/4 Greg Stark <gsstark@mit.edu>:
> On Wed, May 4, 2011 at 3:10 PM, Mitsuru IWASAKI <iwasaki@jp.freebsd.org> wrote:
>> Postgres usually starts with ZERO buffer cache.  By saving the buffer
>> cache data structure into hibernation files just before shutdown, and
>> loading them at startup, postgres can start operations with the saved
>> buffer cache as the same condition as just before the last shutdown.
>
> Offhand this seems pretty handy for benchmarks where it would help get
> reproducible results.

It could have an option to force it or not at start of postgres. This
could helps on benchmarks scenarios.

--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Stark
Date:
On Wed, May 4, 2011 at 4:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Do you have
> any proof that writing out a few GB of buffers and then reading them
> back in is actually much cheaper than letting the database re-read the
> data from the disk files?

I believe he's just writing out the meta data. Ie, which blocks to
re-reread from the disk files.

-- 
greg


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Smith
Date:
Alvaro Herrera wrote:
> As for gain, I have heard of test setups requiring hours of runtime in
> order to prime the buffer cache.
>   

And production ones too.  I have multiple customers where a server 
restart is almost a planned multi-hour downtime.  The system may be back 
up, but for a couple of hours performance is so terrible it's barely 
usable.  You can watch the MB/s ramp up as the more random data fills in 
over time; getting that taken care of in a larger block more amenable to 
elevator sorting would be a huge help.

I never bothered with this particular idea though because shared_buffers 
is only a portion of the important data.  Cedric's pgfincore code digs 
into the OS cache, too, which can then save enough to be really useful 
here.  And that's already got a snapshot/restore feature.  The slides at 
http://www.pgcon.org/2010/schedule/events/261.en.html have a useful into 
to that, pages 30 through 34 are the neat ones.  That provides some 
other neat APIs for preloading popular data into cache too.  I'd rather 
work on getting something like that into core, rather than adding 
something that only is targeting just shared_buffers.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: patch for new feature: Buffer Cache Hibernation

From
Jeff Janes
Date:
On Wed, May 4, 2011 at 7:10 AM, Mitsuru IWASAKI <iwasaki@jp.freebsd.org> wrote:
> Hi,
>
> I am working on new feature `Buffer Cache Hibernation' which enables
> postgres to keep higher cache hit ratio even just started.
>
> Postgres usually starts with ZERO buffer cache.  By saving the buffer
> cache data structure into hibernation files just before shutdown, and
> loading them at startup, postgres can start operations with the saved
> buffer cache as the same condition as just before the last shutdown.
>
> Here is the patch for 9.0.3 (also tested on 8.4.7)
> http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-9.0.3.patch
>
> The patch includes the following.
> - At shutdown, buffer cache data structure (such as BufferDescriptors,
>  BufferBlocks and StrategyControl) is saved into hibernation files.
> - At startup, buffer cache data structure is loaded from hibernation
>  files and buffer lookup hashtable is setup based on buffer descriptors.
> - Above functions are enabled by specifying `enable_buffer_cache_hibernation=on'
>  in postgresql.conf.
>
> Any comments are welcome and I would very much appreciate merging the
> patch in source tree.
>
> Have fun and thanks!

It applies and builds against head with offsets and some fuzz.  It
fails make check, but apparently only because
src/test/regress/expected/rangefuncs.out needs to be updated to
include the new setting.  (Although all the other "enable%" settings
are for the planner, so making a new setting with that prefix that
does something else might be undesirable)

I think that PgFincore (http://pgfoundry.org/projects/pgfincore/)
provides similar functionality.  Are you familiar with that?  If so,
could you contrast your approach with that one?

Cheers,

Jeff


Re: patch for new feature: Buffer Cache Hibernation

From
Josh Berkus
Date:
All,

I thought that Dimitri had already implemented this using Fincore.  It's
linux-only, but that should work well enough to test the general concept.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: patch for new feature: Buffer Cache Hibernation

From
Dimitri Fontaine
Date:
Josh Berkus <josh@agliodbs.com> writes:
> I thought that Dimitri had already implemented this using Fincore.  It's
> linux-only, but that should work well enough to test the general concept.

Actually, Cédric did, and I have a clone of his repository where I did
some debian packaging of it.
 http://villemain.org/projects/pgfincore http://git.postgresql.org/gitweb?p=pgfincore.git;a=summary
http://git.postgresql.org/gitweb?p=pgfincore.git;a=tree

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: patch for new feature: Buffer Cache Hibernation

From
Cédric Villemain
Date:
2011/5/4 Josh Berkus <josh@agliodbs.com>:
> All,
>
> I thought that Dimitri had already implemented this using Fincore.  It's
> linux-only, but that should work well enough to test the general concept.

Harald provided me some pointers at pgday in Stuttgart to make it work
with windows but ... hum I have not windows and wasn't enought
motivated to make it work on it if no one need it.

I didn't search recently on the different kernels, but any kernel
supporting mincore and posix_fadvise should work. (so probably the
same set of kernel that support our 'effective_io_concurrency').

Still waiting for (free)BSD support .....


--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi, thanks for good suggestions.

> > Postgres usually starts with ZERO buffer cache.  By saving the buffer
> > cache data structure into hibernation files just before shutdown, and
> > loading them at startup, postgres can start operations with the saved
> > buffer cache as the same condition as just before the last shutdown.
> 
> This seems like a lot of complication for rather dubious gain.  What
> happens when the DBA changes the shared_buffers setting, for instance?

It was my first concern actually.  Current implementation is stopping
reading hibernation file when detecting the size mismatch among
shared_buffers and hibernation file.  I think it is a safety way.
As Alvaro Herrera mentioned, it would be possible to adjust copying
buffer bloks, but changing shared_buffers setting is not so often I
think.

> How do you protect against the cached buffers getting out-of-sync with
> the actual disk files (especially during recovery scenarios)?  What

Saving DB buffer cahce is called at shutdown after finishing
bgwriter's final checkpoint process, so dirty-buffers should not exist
I believe.
For recovery scenarios, I need to research it though...
Could you describe what is need to be consider?

> about crash-induced corruption in the cache file itself (consider the
> not-unlikely possibility that init will kill the database before it's
> had time to dump all the buffers during a system shutdown)?  Do you have

I think this is important point.  I'll implement validation function for
hibernation file.

> any proof that writing out a few GB of buffers and then reading them
> back in is actually much cheaper than letting the database re-read the
> data from the disk files?

I think this means sequential-read vs scattered-read.
The largest hibernation file is for buffer blocks, and sequential-read
from it would be much faster than scattered-read from database file
via smgrread() block by block.
As Greg Stark suggested, re-reading from database file based on buffer
descriptors was one of implementation candidates (it can reduce
storage consumption for hibernation), but I chose creating buffer
blocks raw image file and reading it for the performance.


Thanks


Re: patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi,

> I think that PgFincore (http://pgfoundry.org/projects/pgfincore/)
> provides similar functionality.  Are you familiar with that?  If so,
> could you contrast your approach with that one?

I'm not familiar with PgFincore at all sorry, but I got source code
and documents and read through them just now.
# and I'm a novice on postgres actually...
The target both is to reduce physical I/O, but their approaches and
gains are different.
My understanding is like this;

+---------------------+     +---------------------+
| Postgres(backend)   |     | Postgres            |
| +-----------------+ |     |                     |
| | DB Buffer Cache | |     |                     |
| | (shared buffers)| |     |                     |
| |*my target       | |     |                     |
| +-----------------+ |     |                     |
|   ^      ^          |     |                     |
|   |      |          |     |                     |
|   v      v          |     |                     |
| +-----------------+ |     | +-----------------+ | 
| |  buffer manager | |     | |    pgfincore    | |
| +-----------------+ |     | +-----------------+ |
+---^------^----------+     +----------^----------+   |      |smgrread()                 |posix_fadvise()   |read()|
                      |                 userland
 
==================================================================   |      |                           |
 kernel   |      +-------------+-------------+   |                    |   |                    v   |
+------------------------+  |       | File System            |   |       |   +-----------------+  |   +------>|   | FS
BufferCache |  |           |   |*PgFincore target|  |           |   +-----------------+  |           |    ^       ^
     |           +----|-------|-----------+                |       |
 
==================================================================                |       |
 hardware      +---------|-------|----------------+      |         |       v  Physical Disk |      |         |
+------------------+|      |         |   | base/16384/24598 | |      |         v   +------------------+ |      |
+------------------------------+|      | |Buffer Cache Hibernation Files| |      | +------------------------------+ |
  +----------------------------------+
 

In summary, PgFincore's target is File System Buffer Cache, Buffer
Cache Hibernation's target is DB Buffer Cache(shared buffers).

PgFincore is trying to preload database file by posix_fadvise() into
File System Buffer Cache, not into DB Buffer Cache(shared buffers).
On query execution, buffer manager will get DB buffer blocks by
smgrread() from file system unless necessary blocks exist in DB Buffer
Cache.  At this point, physical reads may not happen because part of
(or entire) database file is already loaded into FS Buffer Cache.

The gain depends on the file system, especially size of File System
Buffer Cache.
Preloading database file is equivalent to following command in short.
$ cat base/16384/24598 > /dev/null

I think PgFincore is good for data warehouse in applications.


Buffer Cache Hibernation, my approach, is more simple and straight forward.
It try to save/load the contents of DB Buffer Cache(shared buffers) using
regular files(called Buffer Cache Hibernation Files).
At startup, buffer manager will load DB buffer blocks into DB Buffer
Cache from Buffer Cache Hibernation Files which was saved at the last
shutdown.  Note that database file will not be read, so it is not
cached in File System Buffer Cache at all.  Only contents of DB Buffer
Cache are filled.  Therefore, the DB buffer cache miss penalty would
be larger than PgFincore's.

The gain depends on the size of shared buffers, and how often the
similar queries are executed before and after restarting.

Buffer Cache Hibernation is good for OLTP in applications.


I think that PgFincore and Buffer Cache Hibernation is not exclusive,
they can co-work together in different caching levels.



Sorry for my poor english skill, but I'm doing my best :)

Thanks


Re: patch for new feature: Buffer Cache Hibernation

From
Cédric Villemain
Date:
2011/5/5 Mitsuru IWASAKI <iwasaki@jp.freebsd.org>:
> Hi,
>
>> I think that PgFincore (http://pgfoundry.org/projects/pgfincore/)
>> provides similar functionality.  Are you familiar with that?  If so,
>> could you contrast your approach with that one?
>
> I'm not familiar with PgFincore at all sorry, but I got source code
> and documents and read through them just now.
> # and I'm a novice on postgres actually...
> The target both is to reduce physical I/O, but their approaches and
> gains are different.
> My understanding is like this;
>
> +---------------------+     +---------------------+
> | Postgres(backend)   |     | Postgres            |
> | +-----------------+ |     |                     |
> | | DB Buffer Cache | |     |                     |
> | | (shared buffers)| |     |                     |
> | |*my target       | |     |                     |
> | +-----------------+ |     |                     |
> |   ^      ^          |     |                     |
> |   |      |          |     |                     |
> |   v      v          |     |                     |
> | +-----------------+ |     | +-----------------+ |
> | |  buffer manager | |     | |    pgfincore    | |
> | +-----------------+ |     | +-----------------+ |
> +---^------^----------+     +----------^----------+
>    |      |smgrread()                 |posix_fadvise()
>    |read()|                           |                 userland
> ==================================================================
>    |      |                           |                 kernel
>    |      +-------------+-------------+
>    |                    |
>    |                    v
>    |       +------------------------+
>    |       | File System            |
>    |       |   +-----------------+  |
>    +------>|   | FS Buffer Cache |  |
>            |   |*PgFincore target|  |
>            |   +-----------------+  |
>            |    ^       ^           |
>            +----|-------|-----------+
>                 |       |
> ==================================================================
>                 |       |                               hardware
>       +---------|-------|----------------+
>       |         |       v  Physical Disk |
>       |         |   +------------------+ |
>       |         |   | base/16384/24598 | |
>       |         v   +------------------+ |
>       | +------------------------------+ |
>       | |Buffer Cache Hibernation Files| |
>       | +------------------------------+ |
>       +----------------------------------+
>

littel detail, pgfincore store its data per relation in a file, like you do.
I rewrote a bit that, and it will store its data directly in
postgresql tables, as well as it will be able to restore the cache
from raw bitstring.

> In summary, PgFincore's target is File System Buffer Cache, Buffer
> Cache Hibernation's target is DB Buffer Cache(shared buffers).

Correct. (btw I am very happy of your idea and that you get time to do it)

>
> PgFincore is trying to preload database file by posix_fadvise() into
> File System Buffer Cache, not into DB Buffer Cache(shared buffers).
> On query execution, buffer manager will get DB buffer blocks by
> smgrread() from file system unless necessary blocks exist in DB Buffer
> Cache.  At this point, physical reads may not happen because part of
> (or entire) database file is already loaded into FS Buffer Cache.
>
> The gain depends on the file system, especially size of File System
> Buffer Cache.
> Preloading database file is equivalent to following command in short.
> $ cat base/16384/24598 > /dev/null

Not exactly.

it exists 2 calls :
* pgfadv_WILLNEED* pgfadv_WILLNEED_snapshot

The former ask to load each segment of a relation *but* the kernel can
decide to not do that or load only part of each segment. (so it is not
as brutal as cat file > /dev/null )
The later read *exactly* each blocks required in each segment, not all
blocks except if all were in cache while doing the snapshot. (this one
is the part of the snapshot/restore combo)

>
> I think PgFincore is good for data warehouse in applications.

Pgfincore with bitstring storage in a table allow streaming to
HotStandbys and get better response in case of switch-over/fail-over
by doing some house-keeping on the HotStandby and keep it really hot
;)

Even web applications have large database today ....

(they is more, but it is no the subject)

>
>
> Buffer Cache Hibernation, my approach, is more simple and straight forward.
> It try to save/load the contents of DB Buffer Cache(shared buffers) using
> regular files(called Buffer Cache Hibernation Files).
> At startup, buffer manager will load DB buffer blocks into DB Buffer
> Cache from Buffer Cache Hibernation Files which was saved at the last
> shutdown.  Note that database file will not be read, so it is not
> cached in File System Buffer Cache at all.  Only contents of DB Buffer
> Cache are filled.  Therefore, the DB buffer cache miss penalty would
> be larger than PgFincore's.
>
> The gain depends on the size of shared buffers, and how often the
> similar queries are executed before and after restarting.
>
> Buffer Cache Hibernation is good for OLTP in applications.

It is very helpfull for debugging and analysis purpose, also, IIUC.
I may prefer the per relation approach (so you can snapshot and
restore only the interesting tables/index). Given what I read in your
patch it looks easy to do, isn't it ?

I also prefer the idea to keep a map of the Buffer Cache (yes, like
what I do with pgfincore) than storing the data directly and reading
it directly. This later part semmes a bit dangerous to me, even if it
looks sane from a normal postgresql stop/start process.

>
>
> I think that PgFincore and Buffer Cache Hibernation is not exclusive,
> they can co-work together in different caching levels.

Yes.

>
>
>
> Sorry for my poor english skill, but I'm doing my best :)

better than me, and anyway your patch remain very easy to read in all case.

>
> Thanks
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi,

I revised the patch against HEAD, it's available at:
http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110506.patch

Implemented hibernation file validations:
- comparison with pg_control
At shutdown:
pg_control state should be DB_SHUTDOWNED.
At startup:
pg_control state should be DB_SHUTDOWNED.
hibernation files should be newer than pg_control.

- CRC check
At shutdown:
compute CRC values for hibernation files and store them into a file.
At startup:
CRC values for hibernation files should be the same with read from the
file created at shutdown.

- file size
At startup:
The size of hibernation file should be the same with calculated file
size based on shared_buffers.

- buffer descriptors validation
At startup:
The descriptor flags should not include BM_DIRTY, BM_IO_IN_PROGRESS,
BM_IO_ERROR, BM_JUST_DIRTIED and BM_PIN_COUNT_WAITER.
Sanity checks for usage_count and usage_count should be done.
(wait_backend_pid is zero-cleared because the process was terminated already)

- system call error checking
At shutdown and startup:
Evaluation for return value system call (eg. open(), read(), write()
and etc) should be done.

> > How do you protect against the cached buffers getting out-of-sync with
> > the actual disk files (especially during recovery scenarios)?  What
> 
> Saving DB buffer cahce is called at shutdown after finishing
> bgwriter's final checkpoint process, so dirty-buffers should not exist
> I believe.
> For recovery scenarios, I need to research it though...
> Could you describe what is need to be consider?

I think hibernation should be allowed only when the system is shutdown
normaly by checking pg_control state.
And once the abnormal shutdown was detected, the hibernation files
should be ignored.
The latest patch includes this.
# modifications for xlog.c:ReadControlFile() was required though...

> > about crash-induced corruption in the cache file itself (consider the
> > not-unlikely possibility that init will kill the database before it's
> > had time to dump all the buffers during a system shutdown)?  Do you have
> 
> I think this is important point.  I'll implement validation function for
> hibernation file.

Added validations seem enough for me.
# because my understanding on postgres is not enough ;)
If any other considerations are required, please point them out.

Thanks


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Smith
Date:
On 05/05/2011 05:06 AM, Mitsuru IWASAKI wrote:
> In summary, PgFincore's target is File System Buffer Cache, Buffer
> Cache Hibernation's target is DB Buffer Cache(shared buffers).
>    

Right.  The thing to realize is that shared_buffers is becoming a 
smaller fraction of the total RAM used by the database every year.  On 
Windows it's been stuck at useful settings being less than 512MB for a 
while now.  And on UNIX systems, around 8GB seems to be effective upper 
limit.  Best case, shared_buffers is only going to be around 25% of 
total RAM; worst-case, approximately, you might have Windows server with 
64GB of RAM where shared_buffers is less than 1% of total RAM.

There's nothing wrong with the general idea you're suggesting.  It's 
just only targeting a small (and shrinking) subset of the real problem 
here.  Rebuilding cache state starts with shared_buffers, but that's not 
enough of the problem to be an effective tweak on many systems.

I think that all the complexity with CRCs etc. is unlikely to lead 
anywhere too, and those two issues are not completely unrelated.  The 
simplest, safest thing here is the right way to approach this, not the 
most complicated one, and a simpler format might add some flexibility 
here to reload more cache state too.  The bottleneck on reloading the 
cache state is reading everything from disk.  Trying to micro-optimize 
any other part of that is moving in the wrong direction to me.  I doubt 
you'll ever measure a useful benefit that overcomes the expense of 
maintaining the code.  And you seem to be moving to where someone can't 
restore cache state when they change shared_buffers.  A simpler 
implementation might still work in that situation; reload until you run 
out of buffers if shared_buffers shrinks, reload until you're done with 
the original size.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us




Re: patch for new feature: Buffer Cache Hibernation

From
Robert Haas
Date:
On Fri, May 6, 2011 at 5:31 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 05/05/2011 05:06 AM, Mitsuru IWASAKI wrote:
>>
>> In summary, PgFincore's target is File System Buffer Cache, Buffer
>> Cache Hibernation's target is DB Buffer Cache(shared buffers).
>>
>
> Right.  The thing to realize is that shared_buffers is becoming a smaller
> fraction of the total RAM used by the database every year.  On Windows it's
> been stuck at useful settings being less than 512MB for a while now.  And on
> UNIX systems, around 8GB seems to be effective upper limit.  Best case,
> shared_buffers is only going to be around 25% of total RAM; worst-case,
> approximately, you might have Windows server with 64GB of RAM where
> shared_buffers is less than 1% of total RAM.
>
> There's nothing wrong with the general idea you're suggesting.  It's just
> only targeting a small (and shrinking) subset of the real problem here.
>  Rebuilding cache state starts with shared_buffers, but that's not enough of
> the problem to be an effective tweak on many systems.
>
> I think that all the complexity with CRCs etc. is unlikely to lead anywhere
> too, and those two issues are not completely unrelated.  The simplest,
> safest thing here is the right way to approach this, not the most
> complicated one, and a simpler format might add some flexibility here to
> reload more cache state too.  The bottleneck on reloading the cache state is
> reading everything from disk.  Trying to micro-optimize any other part of
> that is moving in the wrong direction to me.  I doubt you'll ever measure a
> useful benefit that overcomes the expense of maintaining the code.  And you
> seem to be moving to where someone can't restore cache state when they
> change shared_buffers.  A simpler implementation might still work in that
> situation; reload until you run out of buffers if shared_buffers shrinks,
> reload until you're done with the original size.

Yeah, I'm pretty well convinced this whole approach is a dead end.
Priming the OS buffer cache seems way more useful.  I also think
saving the blocks to be read rather than the actual blocks makes a lot
more sense.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi, thanks for your comments!
I'm glad to discuss about this topic.

>  * pgfadv_WILLNEED
>  * pgfadv_WILLNEED_snapshot
> 
> The former ask to load each segment of a relation *but* the kernel can
> decide to not do that or load only part of each segment. (so it is not
> as brutal as cat file > /dev/null )
> The later read *exactly* each blocks required in each segment, not all
> blocks except if all were in cache while doing the snapshot. (this one
> is the part of the snapshot/restore combo)

Sorry about that, I'm not so familiar with posix_fadvise().
I'll check posix_fadvise() later.
Actually I used to execute 'cat database_file > /dev/null' script on
other DBSM before starting.
# or 'select /*+ INDEX(emp emp_pk) */ count(*) from emp;' to load
# index blocks

> I may prefer the per relation approach (so you can snapshot and
> restore only the interesting tables/index). Given what I read in your
> patch it looks easy to do, isn't it ?

I would like to keep my patch as simple as possible, because
it is just a hibernation function, not complicated buffer management.
But I want to try improving buffer management on next vacation.
# currently I'm in 11-days vacation until Sunday.

My rough idea on improving buffer management like this;
SQL> alter table table_name buffer pin priority 7;
SQL> alter index index_name buffer pin priority 10;

This DDL set 'buffer pin priority' property to table/index and
also buffer descriptors related with table/index.
Optionally preloading database files in FS cache and relation blocks
in DB cache would be possible.

When new buffer is required, buffer manager refer to the priority in
each buffers and select a victim buffer.

I think it helps batch job runs in better buffer cache condition
by giving hints for buffer management.
For example, job-A reads table_A, index_A and writes only table_B;
SQL> alter table table_A buffer pin priority 7;
SQL> alter index index_A buffer pin priority 10;
SQL> alter table table_B buffer pin priority 1;
keeps buffers of index_A, table_A (table_B will be victims soon).

Buffer pin priority can be reset like this;
SQL> alter system buffer pin priority 5;

Next job-B reads and writes table_C, reads index_C with preloading;
SQL> alter table table_C buffer pin priority 5;
SQL> alter index index_C buffer pin priority 10 with preloading 50%;
something like this.

> I also prefer the idea to keep a map of the Buffer Cache (yes, like
> what I do with pgfincore) than storing the data directly and reading
> it directly. This later part semmes a bit dangerous to me, even if it
> looks sane from a normal postgresql stop/start process.

Never mind :)
I added enough validations and will add more.

> better than me, and anyway your patch remain very easy to read in all case.

Thanks a lot!  My policy on experimental implementation is easy-to-read
so that people understand my idea quickly.
That's why my first patch doesn't have enough error checkings ;)

Thanks




Re: patch for new feature: Buffer Cache Hibernation

From
Robert Haas
Date:
On Sat, May 7, 2011 at 3:32 AM, Mitsuru IWASAKI <iwasaki@jp.freebsd.org> wrote:
> I have one more day for working on this, but I may give up...

I think this is an interesting line of inquiry, but if you were hoping
to get something committable in a couple of days, you had unrealistic
expectations...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi, folks!

> I'll do more testing tomorrow, and hopefully finalize my patch.

Done!  the patch is available at:
http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110508.patch 

I hope this would be committable and the final version.
Major changes from the experimental implementation are the following.

- add many validations against hibernation file corruption and etc.
- restore buffer blocks based on buffer descriptors, not from the saved file.
- support restoring cache state even if shared_buffers had changed.

My vacation ends today and I have to go back my work from tomorrow,
but I would try to find spare time for this.

Thanks a lot for happy hacking days with you!


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Smith
Date:
Mitsuru IWASAKI wrote:
> the patch is available at:
> http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110508.patch 
>   

We can't accept patches just based on a pointer to a web site.  Please 
e-mail this to the mailing list so that it can be considered a 
submission under the project's licensing terms.

> I hope this would be committable and the final version.
>   

PostgreSQL has high standards for code submissions.  Extremely few 
submissions are committed without significant revisions to them based on 
code review.  So far you've gotten a first round of high-level design 
review, there's several additional steps before something is considered 
for a commit.  The whole process is outlined at 
http://wiki.postgresql.org/wiki/Submitting_a_Patch
From a couple of minutes of reading the patch, the first things that 
pop out as problems are:

-All of the ControlFile -> controlFile renaming has add a larger 
difference to ReadControlFile than I would consider ideal.
-Touching StrategyControl is not something this patch should be doing.
-I don't think your justification ("debugging or portability") for 
keeping around your original code in here is going to be sufficient to 
do so.
-This should not be named enable_buffer_cache_hibernation.  That very 
large diff you ended up with in the regression tests is because all of 
the settings named enable_* are optimizer control settings.  Using the 
name "buffer_cache_hibernation" instead would make a better starting point.
From a bigger picture perspective, this really hasn't addressed any of 
my comments about shared_buffers only being the beginning of the useful 
cache state to worry about here.  I'd at least like the solution to the 
buffer cache save/restore to have a plan for how it might address that 
too one day.  This project is also picky about only committing code that 
fits into the long-term picture for desired features.

Having a working example of a server-side feature doing cache storage 
and restoration is helpful though.  Don't think your work here is 
unappreciated--it is.  Getting this feature added is just a harder 
problem than what you've done so far.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us




Re: patch for new feature: Buffer Cache Hibernation

From
Heikki Linnakangas
Date:
On 08.05.2011 07:58, Mitsuru IWASAKI wrote:
>> I'll do more testing tomorrow, and hopefully finalize my patch.
>
> Done!  the patch is available at:
> http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110508.patch

I'd suggest doing this as an extension module. All the changes to 
existing server code seem superficial.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi,
Sorry, I missed these messages because I didn't subscribe to this list.
# I've just subscribed temporary

> > I think that all the complexity with CRCs etc. is unlikely to lead anywhere
> > too, and those two issues are not completely unrelated.  The simplest,
> > safest thing here is the right way to approach this, not the most
> > complicated one, and a simpler format might add some flexibility here to
> > reload more cache state too.  The bottleneck on reloading the cache state is
> > reading everything from disk.  Trying to micro-optimize any other part of
> > that is moving in the wrong direction to me.  I doubt you'll ever measure a
> > useful benefit that overcomes the expense of maintaining the code.  And you
> > seem to be moving to where someone can't restore cache state when they
> > change shared_buffers.  A simpler implementation might still work in that
> > situation; reload until you run out of buffers if shared_buffers shrinks,
> > reload until you're done with the original size.
>
> Yeah, I'm pretty well convinced this whole approach is a dead end.
> Priming the OS buffer cache seems way more useful.  I also think
> saving the blocks to be read rather than the actual blocks makes a lot
> more sense.

OK, there are two your suggestions here IIUC.
# if not, please correct me.
1. restore buffer blocks based on buffer descriptors, not from the saved file.
2. support restoring cache state even if shared_buffers had changed.

For 1, I've just finish my work.  The latest patch is available at:
http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch

On my box, shared_buffers can be set up to only 200MB.
Elapsed time for starting up is almost the same, about 3 sec (w/o
hibernation takes about 1 sec).
For shutdown, writing buffer blocks takes about 10 sec, otherwise
about 1 sec.

Well, it seems you were right :)
By restoring buffer blocks based on buffer descriptors, the OS buffer
cache will be filled too.  This can help buffer updating performance
I believe.

I think saving buffer blocks is still useful for debugging or portability,
so I would like to remain the support code in my patch.


For 2, I'm not sure how to implement this.
The problem is that freelist.c:StrategyControl is also restored at
startup, but I have no idea currently how to adjust StrategyControl
when shared_buffer had changed.
StrategyControl has important data on buffer allocation, so this should be
matched with shared_buffer, I belive.

Changing shared_buffer is not so often on production environment.
Current implementation like this;
If shared_buffer had changed, restoring is aborted only on that time
and saving is executed with new shared_buffer at shutdown, restoring
is executed at startup on next time.

I have one more day for working on this, but I may give up...

Thanks


Re: patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi,

> We can't accept patches just based on a pointer to a web site.  Please 
> e-mail this to the mailing list so that it can be considered a 
> submission under the project's licensing terms.
> 
> > I hope this would be committable and the final version.
> >   
> 
> PostgreSQL has high standards for code submissions.  Extremely few 
> submissions are committed without significant revisions to them based on 
> code review.  So far you've gotten a first round of high-level design 
> review, there's several additional steps before something is considered 
> for a commit.  The whole process is outlined at 
> http://wiki.postgresql.org/wiki/Submitting_a_Patch

OK, I would do so for my next patch.

>  From a couple of minutes of reading the patch, the first things that 
> pop out as problems are:
> 
> -All of the ControlFile -> controlFile renaming has add a larger 
> difference to ReadControlFile than I would consider ideal.

I think so too, I will consider this again.

> -Touching StrategyControl is not something this patch should be doing.

Sorry, I could not get this.  Could you describe this?
I think StrategyControl needs to be adjusted if shared_buffers setting
was changed.

> -I don't think your justification ("debugging or portability") for 
> keeping around your original code in here is going to be sufficient to 
> do so.
> -This should not be named enable_buffer_cache_hibernation.  That very 
> large diff you ended up with in the regression tests is because all of 
> the settings named enable_* are optimizer control settings.  Using the 
> name "buffer_cache_hibernation" instead would make a better starting point.

OK, how about `buffer_cache_hibernation_level'?
The value 0 to disable(default), 1 for saving buffer descriptors only,
2 for saving buffer descriptors and buffer blocks.

>  From a bigger picture perspective, this really hasn't addressed any of 
> my comments about shared_buffers only being the beginning of the useful 
> cache state to worry about here.  I'd at least like the solution to the 
> buffer cache save/restore to have a plan for how it might address that 
> too one day.  This project is also picky about only committing code that 
> fits into the long-term picture for desired features.

My simple motivation on this is that `We don't want to restart our DB
server because the DB buffer cache will be lost and the DB server
needs to start its operations with zero cache.  Does any DBMS product
support holding the contents of DB cache as it is even by restarting,
just like the hibernation feature of PC?'.
It's very simple and many of DB admins will be happy soon with this
feature, I think.

Thanks


Re: patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi,

> I'd suggest doing this as an extension module. All the changes to 
> existing server code seem superficial.

It sounds interesting.  I'll try it later.
Are there any good examples for extension module?

Thanks


Re: patch for new feature: Buffer Cache Hibernation

From
"Kevin Grittner"
Date:
Mitsuru IWASAKI  wrote:
> Are there any good examples for extension module?
Browse the subdirectories of contrib.
-Kevin



Re: patch for new feature: Buffer Cache Hibernation

From
Robert Haas
Date:
On Fri, May 6, 2011 at 5:31 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> I think that all the complexity with CRCs etc. is unlikely to lead anywhere
> too, and those two issues are not completely unrelated.  The simplest,
> safest thing here is the right way to approach this, not the most
> complicated one, and a simpler format might add some flexibility here to
> reload more cache state too.  The bottleneck on reloading the cache state is
> reading everything from disk.  Trying to micro-optimize any other part of
> that is moving in the wrong direction to me.  I doubt you'll ever measure a
> useful benefit that overcomes the expense of maintaining the code.  And you
> seem to be moving to where someone can't restore cache state when they
> change shared_buffers.  A simpler implementation might still work in that
> situation; reload until you run out of buffers if shared_buffers shrinks,
> reload until you're done with the original size.

I don't think there's any need for this to get data into
shared_buffers at all.  Getting it into the OS cache oughta be plenty
sufficient, no?

ISTM that a very simple approach here would be to save the contents of
each shared buffer on clean shutdown, and to POSIX_FADV_WILLNEED those
buffers on startup.  We could worry about additional complexity, like
using fincore to probe the OS cache, in a follow-on patch.  While
reloading only 8GB of maybe 30GB of cached data on restart would not
be as good as reloading all of it, it would be a lot better than
reloading none of it, and the gymnastics required seems substantially
less.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for new feature: Buffer Cache Hibernation

From
Cédric Villemain
Date:
2011/5/15 Robert Haas <robertmhaas@gmail.com>:
> On Fri, May 6, 2011 at 5:31 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> I think that all the complexity with CRCs etc. is unlikely to lead anywhere
>> too, and those two issues are not completely unrelated.  The simplest,
>> safest thing here is the right way to approach this, not the most
>> complicated one, and a simpler format might add some flexibility here to
>> reload more cache state too.  The bottleneck on reloading the cache state is
>> reading everything from disk.  Trying to micro-optimize any other part of
>> that is moving in the wrong direction to me.  I doubt you'll ever measure a
>> useful benefit that overcomes the expense of maintaining the code.  And you
>> seem to be moving to where someone can't restore cache state when they
>> change shared_buffers.  A simpler implementation might still work in that
>> situation; reload until you run out of buffers if shared_buffers shrinks,
>> reload until you're done with the original size.
>
> I don't think there's any need for this to get data into
> shared_buffers at all.  Getting it into the OS cache oughta be plenty
> sufficient, no?
>
> ISTM that a very simple approach here would be to save the contents of
> each shared buffer on clean shutdown, and to POSIX_FADV_WILLNEED those
> buffers on startup.

+1
It is just an evolution of the current process if I understood the
explantions of the latest patch correctly.

>We could worry about additional complexity, like
> using fincore to probe the OS cache, in a follow-on patch.  While
> reloading only 8GB of maybe 30GB of cached data on restart would not
> be as good as reloading all of it, it would be a lot better than
> reloading none of it, and the gymnastics required seems substantially
> less.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Smith
Date:
On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote:
> For 1, I've just finish my work.  The latest patch is available at:
> http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch
>    

Reminder here--we can't accept code based on it being published to a web 
page.  You'll need to e-mail it to the pgsql-hackers mailing list to be 
considered for the next PostgreSQL CommitFest, which is starting in a 
few weeks.  Code submitted to the mailing list is considered a release 
of it to the project under the PostgreSQL license, which we can't just 
assume for things when given only a URL to them.

Also, you suggested you were out of time to work on this.  If that's the 
case, we'd like to know that so we don't keep cc'ing you about things in 
expectation of an answer.  Someone else may pick this up as a project to 
continue working on.  But it's going to need a fair amount of revision 
before it matches what people want here, and I'm not sure how much of 
what you've written is going to end up in any commit that may happen 
from this idea.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us




Re: patch for new feature: Buffer Cache Hibernation

From
Tatsuo Ishii
Date:
> Yeah, I'm pretty well convinced this whole approach is a dead end.
> Priming the OS buffer cache seems way more useful.  I also think
> saving the blocks to be read rather than the actual blocks makes a lot
> more sense.

Well, his proposal works on any platforms PostgreSQL supports. On the
other hand PgFincore works on Linux only. Who wants Linux only tool be
in core?

Also I really want to see the performance comparison between these two
approaches in the real world database.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: patch for new feature: Buffer Cache Hibernation

From
Cédric Villemain
Date:
2011/6/1 Tatsuo Ishii <ishii@postgresql.org>:
>> Yeah, I'm pretty well convinced this whole approach is a dead end.
>> Priming the OS buffer cache seems way more useful.  I also think
>> saving the blocks to be read rather than the actual blocks makes a lot
>> more sense.
>
> Well, his proposal works on any platforms PostgreSQL supports. On the
> other hand PgFincore works on Linux only. Who wants Linux only tool be
> in core?

I don't want to compete the features here. Just for the completeness:
PgFincore 'snapshot' is possible on any platform supporting mincure()
(most support it, for widows alternatives exists). For restoring, it
can be a ReadBuffer for postgresql cache; for OS it can be an
open(),read(X), read (Y), close() *or* posix_fadvise() which can be
less destructive (I did only via posix_fadv but nothing prevent to
change that when posix support is not present).
And we already have linux-only feature in-core, fortunately because it
is usefull feature and I really like to add more posix_fadvise call
(*this* will really help read and cache strategy more than any hack we
can do to try to workaround kernel decisions)
Note that BSD developers can change that and make posix_fadvise work:
it has been sitting in their TODO list since some years now.

Anyway we need this patch on-list to go ahead.

>
> Also I really want to see the performance comparison between these two
> approaches in the real world database.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Smith
Date:
On 06/01/2011 03:03 AM, Tatsuo Ishii wrote:
> Also I really want to see the performance comparison between these two
> approaches in the real world database.
>    

Well, tell me how big of a performance improvement you want PgFincore to 
win by, and I'll construct a benchmark where it does that.  If you pick 
a database size that fits in the OS cache, but is bigger than 
shared_buffers, the difference between the approaches is huge.  The 
opposite--trying to find a case where this hibernation approach wins--is 
extremely hard to do.

Anyway, further discussion of this patch is kind of a waste right now.  
We've never gotten the patch actually sent to the list to establish a 
proper contribution (just pointers to a web page), and no feedback on 
that or other suggestions for redesign (extension repackaging, GUC 
renaming, removing unused code, and a few more).  Unless the author 
shows up again in the next two weeks, this is getting bounced back with 
no review as code we can't use.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us




Re: patch for new feature: Buffer Cache Hibernation

From
Jeff Janes
Date:
On Sun, May 15, 2011 at 11:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> I don't think there's any need for this to get data into
> shared_buffers at all.  Getting it into the OS cache oughta be plenty
> sufficient, no?
>
> ISTM that a very simple approach here would be to save the contents of
> each shared buffer on clean shutdown, and to POSIX_FADV_WILLNEED those
> buffers on startup.

Do you mean to save the contents of the buffer pages themselves into a
hibernation file, or to save just the identities (relation/fork/block
number) of the buffers?

In the first case, getting them into the OS cache would not help
because the kernel would not recognize that data as being equivalent
to the block it is a copy of.

In the latter case, wouldn't we just trigger the same inefficient
scattered read of the data that normal database operation would
trigger, taking about the same amount of time to reach cache-warmth?
Or is POSIX_FADV_WILLNEED going to be clever about reordering and
coalescing reads?

Cheers,

Jeff


Re: patch for new feature: Buffer Cache Hibernation

From
Robert Haas
Date:
On Wed, Jun 1, 2011 at 11:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sun, May 15, 2011 at 11:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't think there's any need for this to get data into
>> shared_buffers at all.  Getting it into the OS cache oughta be plenty
>> sufficient, no?
>>
>> ISTM that a very simple approach here would be to save the contents of
>> each shared buffer on clean shutdown, and to POSIX_FADV_WILLNEED those
>> buffers on startup.
>
> Do you mean to save the contents of the buffer pages themselves into a
> hibernation file, or to save just the identities (relation/fork/block
> number) of the buffers?

The latter.

> In the first case, getting them into the OS cache would not help
> because the kernel would not recognize that data as being equivalent
> to the block it is a copy of.
>
> In the latter case, wouldn't we just trigger the same inefficient
> scattered read of the data that normal database operation would
> trigger, taking about the same amount of time to reach cache-warmth?
> Or is POSIX_FADV_WILLNEED going to be clever about reordering and
> coalescing reads?

It would be nice if POSIX_FADV_WILLNEED is clever enough to reorder
and coalesce, but even if it isn't, we can help it along by doing all
the reads from any given file one after another and in increasing
block number order.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Stark
Date:
On Wed, Jun 1, 2011 at 8:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> In the latter case, wouldn't we just trigger the same inefficient
> scattered read of the data that normal database operation would
> trigger, taking about the same amount of time to reach cache-warmth?

If you have a system where you're bandwidth-constrained and processing
queries as fast as you can then yes.

But if you have an OLTP system where queries come in at a fixed rate
and it's latency that matters then there's a big difference. It might
take you hours to prime the cache at the rate that queries come in
organically and for that whole time every query requires multiple
cache misses and multiple seeks and random access reads. Once it's all
primed your whole database might actually fit in RAM and require no
i/o to serve requests. And it's possible that your system is
architected on the assumption that that's the case and performance is
inadequate until the whole database is read in.

Actually in that extreme case you can probably get away with a few dd
commands or perhaps an sql select count(*) on startup. I'm not sure in
practice how wide the use case is in the gap between that extreme case
and more average cases where the difference isn't so catastrophic.

I'm sure there will be people who will say it's big but I would like
to see numbers. And I'm not just talking about the usual knee-jerk
"lets' see the benchmarks" response. I would love to see metrics on a
live database showing users how much of their response time depends on
the cache and how that performance varies as the cache gets warmer.
Right now I think users are kind of in the dark on cache effectiveness
and latency numbers.

-- 
greg


Re: patch for new feature: Buffer Cache Hibernation

From
Mitsuru IWASAKI
Date:
Hi,

> On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote:
> > For 1, I've just finish my work.  The latest patch is available at:
> > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch
> >    
> 
> Reminder here--we can't accept code based on it being published to a web 
> page.  You'll need to e-mail it to the pgsql-hackers mailing list to be 
> considered for the next PostgreSQL CommitFest, which is starting in a 
> few weeks.  Code submitted to the mailing list is considered a release 
> of it to the project under the PostgreSQL license, which we can't just 
> assume for things when given only a URL to them.

Sorry about that, but I had enough time to revise my patches this week-end.
I attached the patches in this mail, and will update CommitFest page soon.

> Also, you suggested you were out of time to work on this.  If that's the 
> case, we'd like to know that so we don't keep cc'ing you about things in 
> expectation of an answer.  Someone else may pick this up as a project to 
> continue working on.  But it's going to need a fair amount of revision 
> before it matches what people want here, and I'm not sure how much of 
> what you've written is going to end up in any commit that may happen 
> from this idea.

It seems that I don't have enough time to complete this work.
You don't need to keep cc'ing me, and I'm very happy if postgres to be
the first DBMS which support buffer cache hibernation feature.

Thanks!


diff --git src/backend/access/transam/xlog.c src/backend/access/transam/xlog.c
index b0e4c41..7a3a207 100644
--- src/backend/access/transam/xlog.c
+++ src/backend/access/transam/xlog.c
@@ -4834,6 +4834,19 @@ ReadControlFile(void)#endif}
+bool
+GetControlFile(ControlFileData *controlFile)
+{
+    if (ControlFile == NULL)
+    {
+        return false;
+    }
+
+    memcpy(controlFile, ControlFile, sizeof(ControlFileData));
+
+    return true;
+}
+voidUpdateControlFile(void){
diff --git src/backend/bootstrap/bootstrap.c src/backend/bootstrap/bootstrap.c
index fc093cc..7ecf6bb 100644
--- src/backend/bootstrap/bootstrap.c
+++ src/backend/bootstrap/bootstrap.c
@@ -360,6 +360,15 @@ AuxiliaryProcessMain(int argc, char *argv[])    BaseInit();    /*
+     * Only StartupProcess can call ResumeBufferCacheHibernation() after
+     * InitFileAccess() and smgrinit().
+     */
+    if (auxType == StartupProcess && BufferCacheHibernationLevel > 0)
+    {
+        ResumeBufferCacheHibernation();
+    }
+
+    /*     * When we are an auxiliary process, we aren't going to do the full     * InitPostgres pushups, but there
area couple of things that need to get     * lit up even in an auxiliary process.
 
diff --git src/backend/storage/buffer/buf_init.c src/backend/storage/buffer/buf_init.c
index dadb49d..52eb51a 100644
--- src/backend/storage/buffer/buf_init.c
+++ src/backend/storage/buffer/buf_init.c
@@ -127,6 +127,14 @@ InitBufferPool(void)    /* Init other shared buffer-management stuff */
StrategyInitialize(!foundDescs);
+
+    if (BufferCacheHibernationLevel > 0)
+    {
+        ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS,
+            (char *)BufferDescriptors, sizeof(BufferDesc), NBuffers);
+        ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS,
+            (char *)BufferBlocks, BLCKSZ, NBuffers);
+    }}/*
diff --git src/backend/storage/buffer/bufmgr.c src/backend/storage/buffer/bufmgr.c
index f96685d..dba8ebf 100644
--- src/backend/storage/buffer/bufmgr.c
+++ src/backend/storage/buffer/bufmgr.c
@@ -31,6 +31,7 @@#include "postgres.h"#include <sys/file.h>
+#include <sys/stat.h>#include <unistd.h>#include "catalog/catalog.h"
@@ -61,6 +62,13 @@#define BUF_WRITTEN                0x01#define BUF_REUSABLE            0x02
+/*
+ * Buffer Cache Hibernation stuff.
+ */
+/* enable this to debug buffer cache hibernation. */
+#if 0
+#define DEBUG_BUFFER_CACHE_HIBERNATION
+#endif/* GUC variables */bool        zero_damaged_pages = false;
@@ -765,6 +773,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,                }
}
 
+#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
+            elog(DEBUG5,
+                "alloc  [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
+                    buf->buf_id, buf->flags, buf->usage_count, buf->refcount,
+                    buf->wait_backend_pid, buf->freeNext,
+                    newHash, newTag.rnode.spcNode,
+                    newTag.rnode.dbNode, newTag.rnode.relNode,
+                    newTag.forkNum, newTag.blockNum);
+#endif
+            return buf;        }
@@ -800,6 +818,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,     * the old content is
nolonger relevant.  (The usage_count starts out at     * 1 so that the buffer can survive one clock-sweep pass.)
*/
+#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
+    elog(DEBUG5,
+        "rename [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
+            buf->buf_id, buf->flags, buf->usage_count, buf->refcount,
+            buf->wait_backend_pid, buf->freeNext,
+            oldHash, oldTag.rnode.spcNode,
+            oldTag.rnode.dbNode, oldTag.rnode.relNode,
+            oldTag.forkNum, oldTag.blockNum);
+#endif
+    buf->tag = newTag;    buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR |
BM_PERMANENT);   if (relpersistence == RELPERSISTENCE_PERMANENT)
 
@@ -2772,3 +2800,716 @@ local_buffer_write_error_callback(void *arg)        pfree(path);    }}
+
+/* ----------------------------------------------------------------
+ *        Buffer Cache Hibernation support stuff
+ *
+ * Suspend/resume buffer cache data structure using hibernation files
+ * at shutdown/startup.
+ * ----------------------------------------------------------------
+ */
+
+int    BufferCacheHibernationLevel = 0;
+
+#define    BUFFER_CACHE_HIBERNATION_FILE_STRATEGY        "global/pg_buffer_cache_hibernation_strategy"
+#define    BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS    "global/pg_buffer_cache_hibernation_descriptors"
+#define    BUFFER_CACHE_HIBERNATION_FILE_BLOCKS        "global/pg_buffer_cache_hibernation_blocks"
+#define    BUFFER_CACHE_HIBERNATION_FILE_CRC32            "global/pg_buffer_cache_hibernation_crc32"
+
+static struct
+{
+    char        *hibernation_file;
+    char        *data_ptr;
+    Size        record_length;    
+    Size        num_records;    
+    pg_crc32    crc;
+} BufferCacheHibernationData[] =
+{
+    /* BufferStrategyControl */
+    {
+        BUFFER_CACHE_HIBERNATION_FILE_STRATEGY,
+        NULL, 0, 0, 0
+    },
+
+    /* BufferDescriptors */
+    {
+        BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS,
+        NULL, 0, 0, 0
+    },
+
+    /* BufferBlocks */
+    {
+        BUFFER_CACHE_HIBERNATION_FILE_BLOCKS,
+        NULL, 0, 0, 0
+    },
+
+    /* End-of-list marker */
+    {
+        NULL,
+        NULL, 0, 0, 0
+    },
+};
+
+static ControlFileData    controlFile;
+static bool                controlFileInitialized = false;
+
+/*
+ * AtProcExit_BufferCacheHibernation:
+ *         store the buffer cache into hibernation files at shutdown.
+ */
+static void
+AtProcExit_BufferCacheHibernation(int code, Datum arg)
+{
+    BufferHibernationFileType    id;
+    int                            i;
+    int                            fd;
+
+    if (BufferCacheHibernationLevel == 0)
+    {
+        return;
+    }
+
+    /*
+     * get the control file to check the system state validation.
+     */
+    if (GetControlFile(&controlFile) == false)
+    {
+        elog(WARNING,
+            "could not get control file, "
+            "aborting buffer cache hibernation");
+        return;
+    }
+
+    if (controlFile.state != DB_SHUTDOWNED)
+    {
+        elog(WARNING,
+            "database system was not shut down normally, "
+            "aborting buffer cache hibernation");
+        return;
+    }
+
+    /*
+     * suspend buffer cache data structure into hibernation files.
+     */
+    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
+    {
+        Size        record_length;
+        Size        num_records;
+        char        *ptr;
+        pg_crc32    crc;
+
+        if (BufferCacheHibernationLevel < 2 &&
+            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+        {
+            continue;
+        }
+
+        if (BufferCacheHibernationData[id].data_ptr == NULL ||
+            BufferCacheHibernationData[id].record_length == 0 ||
+            BufferCacheHibernationData[id].num_records == 0)
+        {
+            elog(WARNING,
+                "ResisterBufferCacheHibernation() was not called for %s",
+                BufferCacheHibernationData[id].hibernation_file);
+            goto cleanup;
+        }
+
+        fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file,
+                O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR);
+        if (fd < 0)
+        {
+            elog(WARNING,
+                "could not open %s",
+                BufferCacheHibernationData[id].hibernation_file);
+            goto cleanup;
+        }
+
+        record_length = BufferCacheHibernationData[id].record_length;
+        num_records = BufferCacheHibernationData[id].num_records;
+
+        elog(NOTICE,
+            "buffer cache hibernate into %s",
+            BufferCacheHibernationData[id].hibernation_file);
+
+        INIT_CRC32(crc);
+        for (i = 0; i < num_records; i++)
+        {
+            ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length);
+            if (write(fd, (void *)ptr, record_length) != record_length)
+            {
+                elog(WARNING,
+                    "could not write %s",
+                    BufferCacheHibernationData[id].hibernation_file);
+                goto cleanup;
+            }
+
+            COMP_CRC32(crc, ptr, record_length);
+        }
+
+        FIN_CRC32(crc);
+        close(fd);
+
+        BufferCacheHibernationData[id].crc = crc;
+    }
+
+    /*
+     * save the computed crc values for the validations at resuming.
+     */
+    fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32,
+            O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR);
+    if (fd < 0)
+    {
+        elog(WARNING,
+            "could not open %s",
+            BUFFER_CACHE_HIBERNATION_FILE_CRC32);
+        goto cleanup;
+    }
+
+    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
+    {
+        pg_crc32    crc;
+
+        if (BufferCacheHibernationLevel < 2 &&
+            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+        {
+            continue;
+        }
+
+        crc = BufferCacheHibernationData[id].crc;
+        if (write(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32))
+        {
+            elog(WARNING,
+                "could not write %s for %s",
+                BUFFER_CACHE_HIBERNATION_FILE_CRC32,
+                BufferCacheHibernationData[id].hibernation_file);
+            goto cleanup;
+        }
+    }
+    close(fd);
+
+    elog(NOTICE,
+        "buffer cache suspended successfully");
+
+    return;
+
+cleanup:
+    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
+    {
+        unlink(BufferCacheHibernationData[id].hibernation_file);
+    }
+
+    return;
+}
+
+/*
+ * ResisterBufferCacheHibernation:
+ *         register the buffer cache data structure info.
+ */
+void
+ResisterBufferCacheHibernation(BufferHibernationFileType id, char *ptr, Size record_length, Size num_records)
+{
+    static bool                    first_time = true;
+
+    if (BufferCacheHibernationLevel == 0)
+    {
+        return;
+    }
+
+    if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY &&
+        id != BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS &&
+        id != BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+    {
+        return;
+    }
+
+    if (first_time)
+    {
+        /*
+         * AtProcExit_BufferCacheHibernation to be called at shutdown.
+         */
+        on_shmem_exit(AtProcExit_BufferCacheHibernation, 0);
+        first_time = false;
+    }
+
+    /*
+     * get the control file to check the system state and
+     * hibernation file validations.
+     */
+    if (controlFileInitialized == false)
+    {
+        if (GetControlFile(&controlFile) == true)
+        {
+            controlFileInitialized = true;
+        }
+    }
+
+    BufferCacheHibernationData[id].data_ptr = ptr;
+    BufferCacheHibernationData[id].record_length = record_length;
+    BufferCacheHibernationData[id].num_records = num_records;
+}
+
+/*
+ * ResumeBufferCacheHibernation:
+ *         resume the buffer cache from hibernation file at startup.
+ */
+void
+ResumeBufferCacheHibernation(void)
+{
+    BufferHibernationFileType    id;
+    int                            i;
+    int                            fd;
+    Size                        num_records;
+    Size                        record_length;
+    char                        *buf_common;
+    int                            oldNBuffers;
+    bool                        buffer_block_processed;
+
+    if (BufferCacheHibernationLevel == 0)
+    {
+        return;
+    }
+
+    buf_common = NULL;
+    buffer_block_processed = false;
+
+    /*
+     * lock all buffer descriptors to prevent other processes from
+     * updating buffers.
+     */
+    for (i = 0; i < NBuffers; i++)
+    {
+        BufferDesc    *buf;
+
+        buf = &BufferDescriptors[i];
+        LockBufHdr(buf);
+    }
+
+    /*
+     * get the control file to check the system state and
+     * hibernation file validations.
+     */
+    if (controlFileInitialized == false)
+    {
+        elog(WARNING,
+            "could not get control file, "
+            "aborting buffer cache hibernation");
+        goto cleanup;
+    }
+
+    if (controlFile.state != DB_SHUTDOWNED)
+    {
+        elog(WARNING,
+            "database system was not shut down normally, "
+            "aborting buffer cache hibernation");
+        goto cleanup;
+    }
+
+    /*
+     * read the crc values which was computed when the hibernation
+     * files were created.
+     */
+    fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32,
+            O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
+    if (fd < 0)
+    {
+        elog(WARNING,
+            "could not open %s",
+            BUFFER_CACHE_HIBERNATION_FILE_CRC32);
+        goto cleanup;
+    }
+
+    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
+    {
+        pg_crc32    crc;
+
+        if (BufferCacheHibernationLevel < 2 &&
+            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+        {
+            continue;
+        }
+
+        if (read(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32))
+        {
+            if (BufferCacheHibernationLevel == 2 &&
+                id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+            {
+                /*
+                 * if buffer_cache_hibernation_level changes 1 to 2,
+                 * the crc value of buffer block hibernation file may not exist.
+                 * just ignore it here.
+                 */
+                continue;
+            }
+
+            elog(WARNING,
+                "could not read %s for %s",
+                BUFFER_CACHE_HIBERNATION_FILE_CRC32,
+                BufferCacheHibernationData[id].hibernation_file);
+            close(fd);
+            goto cleanup;
+        }
+        BufferCacheHibernationData[id].crc = crc;
+    }
+
+    close(fd);
+
+    /*
+     * allocate a buffer to read the contents of the hibernation files
+     * for validations.
+     */
+    record_length = 0;
+    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
+    {
+        if (record_length < BufferCacheHibernationData[id].record_length)
+        {
+            record_length = BufferCacheHibernationData[id].record_length;
+        }
+    }
+
+    buf_common = malloc(record_length);
+    Assert(buf_common != NULL);
+
+    /* assume that the number of buffers have not changed. */
+    oldNBuffers = NBuffers;
+
+    /*
+     * check if all hibernation files are valid.
+     */
+    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
+    {
+        struct stat    sb;
+        pg_crc32    crc;
+
+        if (BufferCacheHibernationLevel < 2 &&
+            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+        {
+            continue;
+        }
+
+        if (BufferCacheHibernationData[id].data_ptr == NULL ||
+            BufferCacheHibernationData[id].record_length == 0 ||
+            BufferCacheHibernationData[id].num_records == 0)
+        {
+            elog(WARNING,
+                "ResisterBufferCacheHibernation() was not called for %s",
+                BufferCacheHibernationData[id].hibernation_file);
+            goto cleanup;
+        }
+
+        fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file,
+                O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
+        if (fd < 0)
+        {
+            if (BufferCacheHibernationLevel == 2 &&
+                id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+            {
+                /*
+                 * if buffer_cache_hibernation_level changes 1 to 2,
+                 * the buffer block hibernation file may not exist.
+                 * just ignore it here.
+                 */
+                continue;
+            }
+
+            goto cleanup;
+        }
+
+        if (fstat(fd, &sb) < 0)
+        {
+            elog(WARNING,
+                "could not get stats of the buffer cache hibernation file: %s",
+                BufferCacheHibernationData[id].hibernation_file);
+            close(fd);
+            goto cleanup;
+        }
+
+        record_length = BufferCacheHibernationData[id].record_length;
+        num_records = BufferCacheHibernationData[id].num_records;
+
+        if (sb.st_size != (record_length * num_records))
+        {
+            /* The size of StrategyControl should be the same always. */
+            if (id == BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY ||
+                (sb.st_size % record_length) > 0)
+            {
+                elog(WARNING,
+                    "size mismatch on the buffer cache hibernation file: %s",
+                    BufferCacheHibernationData[id].hibernation_file);
+                close(fd);
+                goto cleanup;
+            }
+
+            /*
+             * The number of records of buffer descriptors and blocks
+             * should be the same.
+             */
+            if (oldNBuffers != NBuffers &&
+                oldNBuffers != (sb.st_size / record_length))
+            {
+                elog(WARNING,
+                    "size mismatch on the buffer cache hibernation file: %s",
+                    BufferCacheHibernationData[id].hibernation_file);
+                close(fd);
+                goto cleanup;
+            }
+            
+            oldNBuffers = sb.st_size / record_length;
+
+            elog(NOTICE,
+                "shared_buffers have changed from %d to %d: %s",
+                oldNBuffers, NBuffers,
+                BufferCacheHibernationData[id].hibernation_file);
+
+            /* use the original size to compute CRC of the hibernation file. */
+            num_records = oldNBuffers;
+        }
+
+        if ((pg_time_t)sb.st_mtime < controlFile.time)
+        {
+            elog(WARNING,
+                "the hibernation file is older than control file: %s",
+                BufferCacheHibernationData[id].hibernation_file);
+            close(fd);
+            goto cleanup;
+        }
+
+        INIT_CRC32(crc);
+        for (i = 0; i < num_records; i++)
+        {
+            if (read(fd, (void *)buf_common, record_length) != record_length)
+            {
+                elog(WARNING,
+                    "could not read the buffer cache hibernation file: %s",
+                    BufferCacheHibernationData[id].hibernation_file);
+                close(fd);
+                goto cleanup;
+            }
+
+            COMP_CRC32(crc, buf_common, record_length);
+
+            /*
+             * buffer descriptors validations.
+             */
+            if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS)
+            {
+                BufferDesc    *buf;
+                BufFlags    abnormal_flags;
+
+                if (i >= NBuffers)
+                {
+                    continue;
+                }
+
+                abnormal_flags = (BM_DIRTY | BM_IO_IN_PROGRESS | BM_IO_ERROR |
+                                  BM_JUST_DIRTIED | BM_PIN_COUNT_WAITER);
+
+                buf = (BufferDesc *)buf_common;
+
+                if (buf->flags & abnormal_flags)
+                {
+                    elog(WARNING,
+                        "abnormal flags in buffer descriptors: %d",
+                        buf->flags);
+                    close(fd);
+                    goto cleanup;
+                }
+
+                if (buf->usage_count > BM_MAX_USAGE_COUNT)
+                {
+                    elog(WARNING,
+                        "invalid usage count in buffer descriptors: %d",
+                        buf->usage_count);
+                    close(fd);
+                    goto cleanup;
+                }
+
+                if (buf->buf_id < 0 || buf->buf_id >= num_records)
+                {
+                    elog(WARNING,
+                        "invalid buffer id in buffer descriptors: %d",
+                        buf->buf_id);
+                    close(fd);
+                    goto cleanup;
+                }
+            }
+        }
+
+        FIN_CRC32(crc);
+        close(fd);
+
+        if (!EQ_CRC32(BufferCacheHibernationData[id].crc, crc))
+        {
+            elog(WARNING,
+                "crc mismatch on the buffer cache hibernation file: %s",
+                BufferCacheHibernationData[id].hibernation_file);
+            close(fd);
+            goto cleanup;
+        }
+    }
+
+    /*
+     * resume the buffer cache data structure from the hibernation files.
+     */
+    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
+    {
+        int            fd;
+        char        *ptr;
+
+        if (BufferCacheHibernationLevel < 2 &&
+            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+        {
+            continue;
+        }
+
+        record_length = BufferCacheHibernationData[id].record_length;
+        num_records = BufferCacheHibernationData[id].num_records;
+
+        if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY)
+        {
+            /* use the smaller number of buffers. */
+            num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers;
+        }
+
+        fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file,
+                O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
+        if (fd < 0)
+        {
+            if (BufferCacheHibernationLevel == 2 &&
+                id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+            {
+                /*
+                 * if buffer_cache_hibernation_level changes 1 to 2,
+                 * the buffer block hibernation file may not exist.
+                 * just ignore it here.
+                 */
+                continue;
+            }
+
+            goto cleanup;
+        }
+
+        elog(NOTICE,
+            "buffer cache resume from %s(%d bytes * %d records)",
+            BufferCacheHibernationData[id].hibernation_file,
+            record_length, num_records);
+
+        for (i = 0; i < num_records; i++)
+        {
+            ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length);
+            read(fd, (void *)ptr, record_length);
+
+            /* Re-lock the buffer descriptor if necessary. */
+            if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS)
+            {
+                BufferDesc    *buf;
+
+                buf = (BufferDesc *)ptr;
+                if (IsUnlockBufHdr(buf))
+                {
+                    LockBufHdr(buf);
+                }
+            }
+        }
+
+        close(fd);
+
+        if (id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
+        {
+            buffer_block_processed = true;
+        }
+    }
+
+    if (buffer_block_processed == false)
+    {
+        /* we didn't use the buffer block hibernation file, so delete it now. */
+        id = BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS;
+        unlink(BufferCacheHibernationData[id].hibernation_file);
+    }
+
+    /*
+     * set the rest data structures (eg. lookup hashtable) up
+     * based on the buffer descriptors.
+     */
+    num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers;
+    for (i = 0; i < num_records; i++)
+    {
+        BufferDesc        *buf;
+        BufferTag        newTag;
+        uint32            newHash;
+        int                buf_id;
+
+        buf = &BufferDescriptors[i];
+        if (buf->tag.rnode.spcNode    == InvalidOid &&
+            buf->tag.rnode.dbNode    == InvalidOid &&
+            buf->tag.rnode.relNode    == InvalidOid)
+        {
+            continue;
+        }
+
+        INIT_BUFFERTAG(newTag, buf->tag.rnode, buf->tag.forkNum, buf->tag.blockNum);
+        newHash = BufTableHashCode(&newTag);
+
+        if (buffer_block_processed == false)
+        {
+            Block            bufBlock;
+            SMgrRelation    smgr;
+
+            /*
+             * re-read buffer block.
+             */
+            bufBlock = BufHdrGetBlock(buf);
+            smgr = smgropen(buf->tag.rnode, InvalidBackendId);
+            smgrread(smgr, newTag.forkNum, newTag.blockNum, (char *) bufBlock);
+        }
+
+        buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+        if (buf_id != -1)
+        {
+            /* the entry exists already, return it to the freelist. */
+            buf->refcount = 0;
+            buf->flags = 0;
+            InvalidateBuffer(buf);
+            continue;
+        }
+
+        /* clear wait_backend_pid because the process was terminated already. */
+        buf->wait_backend_pid = 0;
+
+#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
+        elog(DEBUG5,
+            "resume [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
+                buf->buf_id, buf->flags, buf->usage_count, buf->refcount,
+                buf->wait_backend_pid, buf->freeNext,
+                newHash, newTag.rnode.spcNode,
+                newTag.rnode.dbNode, newTag.rnode.relNode,
+                newTag.forkNum, newTag.blockNum);
+#endif
+    }
+
+    /*
+     * adjust StrategyControl based on the change of shared_buffers.
+     */
+    if (oldNBuffers != NBuffers)
+    {
+        AdjustStrategyControl(oldNBuffers);
+    }
+
+    elog(NOTICE,
+        "buffer cache resumed successfully");
+
+cleanup:
+    for (i = 0; i < NBuffers; i++)
+    {
+        BufferDesc    *buf;
+
+        buf = &BufferDescriptors[i];
+        UnlockBufHdr(buf);
+    }
+
+    if (buf_common != NULL)
+    {
+        free(buf_common);
+    }
+
+    return;
+}
diff --git src/backend/storage/buffer/freelist.c src/backend/storage/buffer/freelist.c
index bf9903b..ffc101d 100644
--- src/backend/storage/buffer/freelist.c
+++ src/backend/storage/buffer/freelist.c
@@ -347,6 +347,12 @@ StrategyInitialize(bool init)    }    else        Assert(!init);
+
+    if (BufferCacheHibernationLevel > 0)
+    {
+        ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY,
+            (char *)StrategyControl, sizeof(BufferStrategyControl), 1);
+    }}
@@ -521,3 +527,47 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, volatile BufferDesc *buf)    return true;}
+
+/*
+ * AdjustStrategyControl -- adjust the member variables of StrategyControl
+ *
+ * If the shared_buffers setting had changed, restored StrategyControl
+ * needs to be adjusted for in both cases of shrinking and enlarging.
+ * This is called only from bufmgr.c:ResumeBufferCacheHibernation().
+ */
+void
+AdjustStrategyControl(int oldNBuffers)
+{
+    if (oldNBuffers == NBuffers)
+    {
+        return;
+    }
+
+    /* enlarge or shrink the free buffer based on current NBuffers. */
+    StrategyControl->lastFreeBuffer = NBuffers - 1;
+
+    /* shared_buffers shrunk. */
+    if (oldNBuffers > NBuffers)
+    {
+        if (StrategyControl->nextVictimBuffer >= NBuffers)
+        {
+            /* set the tail of buffers. */
+            StrategyControl->nextVictimBuffer = NBuffers - 1;
+        }
+
+        if (StrategyControl->firstFreeBuffer >= NBuffers)
+        {
+            /* set FREENEXT_END_OF_LIST(-1). */
+            StrategyControl->firstFreeBuffer = FREENEXT_END_OF_LIST;
+        }
+    }
+    else
+    /* shared_buffers enlarged. */
+    {
+        if (StrategyControl->firstFreeBuffer < 0)
+        {
+            /* set the next entry of the tail of old buffers. */
+            StrategyControl->firstFreeBuffer = oldNBuffers;
+        }
+    }
+}
diff --git src/backend/utils/misc/guc.c src/backend/utils/misc/guc.c
index 738e215..5affc6e 100644
--- src/backend/utils/misc/guc.c
+++ src/backend/utils/misc/guc.c
@@ -2361,6 +2361,18 @@ static struct config_int ConfigureNamesInt[] =        NULL, NULL, NULL    },
+    {
+        {"buffer_cache_hibernation_level", PGC_POSTMASTER, UNGROUPED,
+            gettext_noop("Sets buffer cache hibernation level."),
+            gettext_noop("0 to disable(default), "
+                         "1 for saving buffer descriptors only(recommended), "
+                         "2 for saving buffer descriptors and buffer blocks(slower at shutdown).")
+        },
+        &BufferCacheHibernationLevel,
+        0, 0, 2,
+        NULL, NULL, NULL
+    },
+    /* End-of-list marker */    {        {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git src/backend/utils/misc/postgresql.conf.sample src/backend/utils/misc/postgresql.conf.sample
index b8a1582..44b6ff3 100644
--- src/backend/utils/misc/postgresql.conf.sample
+++ src/backend/utils/misc/postgresql.conf.sample
@@ -119,6 +119,17 @@#maintenance_work_mem = 16MB        # min 1MB#max_stack_depth = 2MB            # min 100kB
+
+# Buffer Cache Hibernation:
+#  Suspend/resume buffer cache data structure using hibernation files
+#  at shutdown/startup.
+#buffer_cache_hibernation_level = 0    # Sets buffer cache hibernation level.
+                    # 0 to disable(default),
+                    # 1 for saving buffer descriptors only
+                    #   (recommended),
+                    # 2 for saving buffer descriptors and
+                    #   buffer blocks(slower at shutdown).
+# - Kernel Resource Usage -#max_files_per_process = 1000        # min 25
diff --git src/include/access/xlog.h src/include/access/xlog.h
index 7056fd6..7a9fb99 100644
--- src/include/access/xlog.h
+++ src/include/access/xlog.h
@@ -13,6 +13,7 @@#include "access/rmgr.h"#include "access/xlogdefs.h"
+#include "catalog/pg_control.h"#include "lib/stringinfo.h"#include "storage/buf.h"#include "utils/pg_crc.h"
@@ -294,6 +295,7 @@ extern bool XLogInsertAllowed(void);extern void GetXLogReceiptTime(TimestampTz *rtime, bool
*fromStream);externXLogRecPtr GetXLogReplayRecPtr(void);
 
+extern bool GetControlFile(ControlFileData *controlFile);extern void UpdateControlFile(void);extern uint64
GetSystemIdentifier(void);externSize XLOGShmemSize(void);
 
diff --git src/include/storage/buf_internals.h src/include/storage/buf_internals.h
index b7d4ea5..d537ef1 100644
--- src/include/storage/buf_internals.h
+++ src/include/storage/buf_internals.h
@@ -167,6 +167,7 @@ typedef struct sbufdesc */#define LockBufHdr(bufHdr)
SpinLockAcquire(&(bufHdr)->buf_hdr_lock)#defineUnlockBufHdr(bufHdr)    SpinLockRelease(&(bufHdr)->buf_hdr_lock)
 
+#define IsUnlockBufHdr(bufHdr)    SpinLockFree(&(bufHdr)->buf_hdr_lock)/* in buf_init.c */
@@ -190,6 +191,7 @@ extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,extern int
StrategySyncStart(uint32*complete_passes, uint32 *num_buf_alloc);extern Size StrategyShmemSize(void);extern void
StrategyInitialize(boolinit);
 
+extern void AdjustStrategyControl(int oldNBuffers);/* buf_table.c */extern Size BufTableShmemSize(int size);
diff --git src/include/storage/bufmgr.h src/include/storage/bufmgr.h
index b8fc87e..ddfeb9d 100644
--- src/include/storage/bufmgr.h
+++ src/include/storage/bufmgr.h
@@ -211,6 +211,20 @@ extern void BgBufferSync(void);extern void AtProcExit_LocalBuffers(void);
+/* buffer cache hibernation support stuff */
+extern int    BufferCacheHibernationLevel;
+
+typedef enum BufferHibernationFileType
+{   
+    BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY,
+    BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS,
+    BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS
+} BufferHibernationFileType;
+
+extern void ResisterBufferCacheHibernation(BufferHibernationFileType id,
+                char *ptr, Size record_length, Size num_records);
+extern void ResumeBufferCacheHibernation(void);
+/* in freelist.c */extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);extern void
FreeAccessStrategy(BufferAccessStrategystrategy);
 


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Smith
Date:
On 06/05/2011 08:50 AM, Mitsuru IWASAKI wrote:
> It seems that I don't have enough time to complete this work.
> You don't need to keep cc'ing me, and I'm very happy if postgres to be
> the first DBMS which support buffer cache hibernation feature.
>    

Thanks for submitting the patch, and we'll see what happens from here.  
I've switch to bcc'ing you here and we should get you off everyone 
else's cc: list here soon.  If this feature ends up getting committed, 
I'll try to remember to drop you a note about it so you can see what 
happened.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us




Re: patch for new feature: Buffer Cache Hibernation

From
Bruce Momjian
Date:
Should this be marked as TODO?

---------------------------------------------------------------------------

Mitsuru IWASAKI wrote:
> Hi,
> 
> > On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote:
> > > For 1, I've just finish my work.  The latest patch is available at:
> > > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch
> > >    
> > 
> > Reminder here--we can't accept code based on it being published to a web 
> > page.  You'll need to e-mail it to the pgsql-hackers mailing list to be 
> > considered for the next PostgreSQL CommitFest, which is starting in a 
> > few weeks.  Code submitted to the mailing list is considered a release 
> > of it to the project under the PostgreSQL license, which we can't just 
> > assume for things when given only a URL to them.
> 
> Sorry about that, but I had enough time to revise my patches this week-end.
> I attached the patches in this mail, and will update CommitFest page soon.
> 
> > Also, you suggested you were out of time to work on this.  If that's the 
> > case, we'd like to know that so we don't keep cc'ing you about things in 
> > expectation of an answer.  Someone else may pick this up as a project to 
> > continue working on.  But it's going to need a fair amount of revision 
> > before it matches what people want here, and I'm not sure how much of 
> > what you've written is going to end up in any commit that may happen 
> > from this idea.
> 
> It seems that I don't have enough time to complete this work.
> You don't need to keep cc'ing me, and I'm very happy if postgres to be
> the first DBMS which support buffer cache hibernation feature.
> 
> Thanks!
> 
> 
> diff --git src/backend/access/transam/xlog.c src/backend/access/transam/xlog.c
> index b0e4c41..7a3a207 100644
> --- src/backend/access/transam/xlog.c
> +++ src/backend/access/transam/xlog.c
> @@ -4834,6 +4834,19 @@ ReadControlFile(void)
>  #endif
>  }
>  
> +bool
> +GetControlFile(ControlFileData *controlFile)
> +{
> +    if (ControlFile == NULL)
> +    {
> +        return false;
> +    }
> +
> +    memcpy(controlFile, ControlFile, sizeof(ControlFileData));
> +
> +    return true;
> +}
> +
>  void
>  UpdateControlFile(void)
>  {
> diff --git src/backend/bootstrap/bootstrap.c src/backend/bootstrap/bootstrap.c
> index fc093cc..7ecf6bb 100644
> --- src/backend/bootstrap/bootstrap.c
> +++ src/backend/bootstrap/bootstrap.c
> @@ -360,6 +360,15 @@ AuxiliaryProcessMain(int argc, char *argv[])
>      BaseInit();
>  
>      /*
> +     * Only StartupProcess can call ResumeBufferCacheHibernation() after
> +     * InitFileAccess() and smgrinit().
> +     */
> +    if (auxType == StartupProcess && BufferCacheHibernationLevel > 0)
> +    {
> +        ResumeBufferCacheHibernation();
> +    }
> +
> +    /*
>       * When we are an auxiliary process, we aren't going to do the full
>       * InitPostgres pushups, but there are a couple of things that need to get
>       * lit up even in an auxiliary process.
> diff --git src/backend/storage/buffer/buf_init.c src/backend/storage/buffer/buf_init.c
> index dadb49d..52eb51a 100644
> --- src/backend/storage/buffer/buf_init.c
> +++ src/backend/storage/buffer/buf_init.c
> @@ -127,6 +127,14 @@ InitBufferPool(void)
>  
>      /* Init other shared buffer-management stuff */
>      StrategyInitialize(!foundDescs);
> +
> +    if (BufferCacheHibernationLevel > 0)
> +    {
> +        ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS,
> +            (char *)BufferDescriptors, sizeof(BufferDesc), NBuffers);
> +        ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS,
> +            (char *)BufferBlocks, BLCKSZ, NBuffers);
> +    }
>  }
>  
>  /*
> diff --git src/backend/storage/buffer/bufmgr.c src/backend/storage/buffer/bufmgr.c
> index f96685d..dba8ebf 100644
> --- src/backend/storage/buffer/bufmgr.c
> +++ src/backend/storage/buffer/bufmgr.c
> @@ -31,6 +31,7 @@
>  #include "postgres.h"
>  
>  #include <sys/file.h>
> +#include <sys/stat.h>
>  #include <unistd.h>
>  
>  #include "catalog/catalog.h"
> @@ -61,6 +62,13 @@
>  #define BUF_WRITTEN                0x01
>  #define BUF_REUSABLE            0x02
>  
> +/*
> + * Buffer Cache Hibernation stuff.
> + */
> +/* enable this to debug buffer cache hibernation. */
> +#if 0
> +#define DEBUG_BUFFER_CACHE_HIBERNATION
> +#endif
>  
>  /* GUC variables */
>  bool        zero_damaged_pages = false;
> @@ -765,6 +773,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>                  }
>              }
>  
> +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
> +            elog(DEBUG5,
> +                "alloc  [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
> +                    buf->buf_id, buf->flags, buf->usage_count, buf->refcount,
> +                    buf->wait_backend_pid, buf->freeNext,
> +                    newHash, newTag.rnode.spcNode,
> +                    newTag.rnode.dbNode, newTag.rnode.relNode,
> +                    newTag.forkNum, newTag.blockNum);
> +#endif
> +
>              return buf;
>          }
>  
> @@ -800,6 +818,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>       * the old content is no longer relevant.  (The usage_count starts out at
>       * 1 so that the buffer can survive one clock-sweep pass.)
>       */
> +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
> +    elog(DEBUG5,
> +        "rename [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
> +            buf->buf_id, buf->flags, buf->usage_count, buf->refcount,
> +            buf->wait_backend_pid, buf->freeNext,
> +            oldHash, oldTag.rnode.spcNode,
> +            oldTag.rnode.dbNode, oldTag.rnode.relNode,
> +            oldTag.forkNum, oldTag.blockNum);
> +#endif
> +
>      buf->tag = newTag;
>      buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
>      if (relpersistence == RELPERSISTENCE_PERMANENT)
> @@ -2772,3 +2800,716 @@ local_buffer_write_error_callback(void *arg)
>          pfree(path);
>      }
>  }
> +
> +/* ----------------------------------------------------------------
> + *        Buffer Cache Hibernation support stuff
> + *
> + * Suspend/resume buffer cache data structure using hibernation files
> + * at shutdown/startup.
> + * ----------------------------------------------------------------
> + */
> +
> +int    BufferCacheHibernationLevel = 0;
> +
> +#define    BUFFER_CACHE_HIBERNATION_FILE_STRATEGY        "global/pg_buffer_cache_hibernation_strategy"
> +#define    BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS    "global/pg_buffer_cache_hibernation_descriptors"
> +#define    BUFFER_CACHE_HIBERNATION_FILE_BLOCKS        "global/pg_buffer_cache_hibernation_blocks"
> +#define    BUFFER_CACHE_HIBERNATION_FILE_CRC32            "global/pg_buffer_cache_hibernation_crc32"
> +
> +static struct
> +{
> +    char        *hibernation_file;
> +    char        *data_ptr;
> +    Size        record_length;    
> +    Size        num_records;    
> +    pg_crc32    crc;
> +} BufferCacheHibernationData[] =
> +{
> +    /* BufferStrategyControl */
> +    {
> +        BUFFER_CACHE_HIBERNATION_FILE_STRATEGY,
> +        NULL, 0, 0, 0
> +    },
> +
> +    /* BufferDescriptors */
> +    {
> +        BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS,
> +        NULL, 0, 0, 0
> +    },
> +
> +    /* BufferBlocks */
> +    {
> +        BUFFER_CACHE_HIBERNATION_FILE_BLOCKS,
> +        NULL, 0, 0, 0
> +    },
> +
> +    /* End-of-list marker */
> +    {
> +        NULL,
> +        NULL, 0, 0, 0
> +    },
> +};
> +
> +static ControlFileData    controlFile;
> +static bool                controlFileInitialized = false;
> +
> +/*
> + * AtProcExit_BufferCacheHibernation:
> + *         store the buffer cache into hibernation files at shutdown.
> + */
> +static void
> +AtProcExit_BufferCacheHibernation(int code, Datum arg)
> +{
> +    BufferHibernationFileType    id;
> +    int                            i;
> +    int                            fd;
> +
> +    if (BufferCacheHibernationLevel == 0)
> +    {
> +        return;
> +    }
> +
> +    /*
> +     * get the control file to check the system state validation.
> +     */
> +    if (GetControlFile(&controlFile) == false)
> +    {
> +        elog(WARNING,
> +            "could not get control file, "
> +            "aborting buffer cache hibernation");
> +        return;
> +    }
> +
> +    if (controlFile.state != DB_SHUTDOWNED)
> +    {
> +        elog(WARNING,
> +            "database system was not shut down normally, "
> +            "aborting buffer cache hibernation");
> +        return;
> +    }
> +
> +    /*
> +     * suspend buffer cache data structure into hibernation files.
> +     */
> +    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
> +    {
> +        Size        record_length;
> +        Size        num_records;
> +        char        *ptr;
> +        pg_crc32    crc;
> +
> +        if (BufferCacheHibernationLevel < 2 &&
> +            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +        {
> +            continue;
> +        }
> +
> +        if (BufferCacheHibernationData[id].data_ptr == NULL ||
> +            BufferCacheHibernationData[id].record_length == 0 ||
> +            BufferCacheHibernationData[id].num_records == 0)
> +        {
> +            elog(WARNING,
> +                "ResisterBufferCacheHibernation() was not called for %s",
> +                BufferCacheHibernationData[id].hibernation_file);
> +            goto cleanup;
> +        }
> +
> +        fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file,
> +                O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR);
> +        if (fd < 0)
> +        {
> +            elog(WARNING,
> +                "could not open %s",
> +                BufferCacheHibernationData[id].hibernation_file);
> +            goto cleanup;
> +        }
> +
> +        record_length = BufferCacheHibernationData[id].record_length;
> +        num_records = BufferCacheHibernationData[id].num_records;
> +
> +        elog(NOTICE,
> +            "buffer cache hibernate into %s",
> +            BufferCacheHibernationData[id].hibernation_file);
> +
> +        INIT_CRC32(crc);
> +        for (i = 0; i < num_records; i++)
> +        {
> +            ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length);
> +            if (write(fd, (void *)ptr, record_length) != record_length)
> +            {
> +                elog(WARNING,
> +                    "could not write %s",
> +                    BufferCacheHibernationData[id].hibernation_file);
> +                goto cleanup;
> +            }
> +
> +            COMP_CRC32(crc, ptr, record_length);
> +        }
> +
> +        FIN_CRC32(crc);
> +        close(fd);
> +
> +        BufferCacheHibernationData[id].crc = crc;
> +    }
> +
> +    /*
> +     * save the computed crc values for the validations at resuming.
> +     */
> +    fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32,
> +            O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR);
> +    if (fd < 0)
> +    {
> +        elog(WARNING,
> +            "could not open %s",
> +            BUFFER_CACHE_HIBERNATION_FILE_CRC32);
> +        goto cleanup;
> +    }
> +
> +    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
> +    {
> +        pg_crc32    crc;
> +
> +        if (BufferCacheHibernationLevel < 2 &&
> +            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +        {
> +            continue;
> +        }
> +
> +        crc = BufferCacheHibernationData[id].crc;
> +        if (write(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32))
> +        {
> +            elog(WARNING,
> +                "could not write %s for %s",
> +                BUFFER_CACHE_HIBERNATION_FILE_CRC32,
> +                BufferCacheHibernationData[id].hibernation_file);
> +            goto cleanup;
> +        }
> +    }
> +    close(fd);
> +
> +    elog(NOTICE,
> +        "buffer cache suspended successfully");
> +
> +    return;
> +
> +cleanup:
> +    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
> +    {
> +        unlink(BufferCacheHibernationData[id].hibernation_file);
> +    }
> +
> +    return;
> +}
> +
> +/*
> + * ResisterBufferCacheHibernation:
> + *         register the buffer cache data structure info.
> + */
> +void
> +ResisterBufferCacheHibernation(BufferHibernationFileType id, char *ptr, Size record_length, Size num_records)
> +{
> +    static bool                    first_time = true;
> +
> +    if (BufferCacheHibernationLevel == 0)
> +    {
> +        return;
> +    }
> +
> +    if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY &&
> +        id != BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS &&
> +        id != BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +    {
> +        return;
> +    }
> +
> +    if (first_time)
> +    {
> +        /*
> +         * AtProcExit_BufferCacheHibernation to be called at shutdown.
> +         */
> +        on_shmem_exit(AtProcExit_BufferCacheHibernation, 0);
> +        first_time = false;
> +    }
> +
> +    /*
> +     * get the control file to check the system state and
> +     * hibernation file validations.
> +     */
> +    if (controlFileInitialized == false)
> +    {
> +        if (GetControlFile(&controlFile) == true)
> +        {
> +            controlFileInitialized = true;
> +        }
> +    }
> +
> +    BufferCacheHibernationData[id].data_ptr = ptr;
> +    BufferCacheHibernationData[id].record_length = record_length;
> +    BufferCacheHibernationData[id].num_records = num_records;
> +}
> +
> +/*
> + * ResumeBufferCacheHibernation:
> + *         resume the buffer cache from hibernation file at startup.
> + */
> +void
> +ResumeBufferCacheHibernation(void)
> +{
> +    BufferHibernationFileType    id;
> +    int                            i;
> +    int                            fd;
> +    Size                        num_records;
> +    Size                        record_length;
> +    char                        *buf_common;
> +    int                            oldNBuffers;
> +    bool                        buffer_block_processed;
> +
> +    if (BufferCacheHibernationLevel == 0)
> +    {
> +        return;
> +    }
> +
> +    buf_common = NULL;
> +    buffer_block_processed = false;
> +
> +    /*
> +     * lock all buffer descriptors to prevent other processes from
> +     * updating buffers.
> +     */
> +    for (i = 0; i < NBuffers; i++)
> +    {
> +        BufferDesc    *buf;
> +
> +        buf = &BufferDescriptors[i];
> +        LockBufHdr(buf);
> +    }
> +
> +    /*
> +     * get the control file to check the system state and
> +     * hibernation file validations.
> +     */
> +    if (controlFileInitialized == false)
> +    {
> +        elog(WARNING,
> +            "could not get control file, "
> +            "aborting buffer cache hibernation");
> +        goto cleanup;
> +    }
> +
> +    if (controlFile.state != DB_SHUTDOWNED)
> +    {
> +        elog(WARNING,
> +            "database system was not shut down normally, "
> +            "aborting buffer cache hibernation");
> +        goto cleanup;
> +    }
> +
> +    /*
> +     * read the crc values which was computed when the hibernation
> +     * files were created.
> +     */
> +    fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32,
> +            O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
> +    if (fd < 0)
> +    {
> +        elog(WARNING,
> +            "could not open %s",
> +            BUFFER_CACHE_HIBERNATION_FILE_CRC32);
> +        goto cleanup;
> +    }
> +
> +    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
> +    {
> +        pg_crc32    crc;
> +
> +        if (BufferCacheHibernationLevel < 2 &&
> +            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +        {
> +            continue;
> +        }
> +
> +        if (read(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32))
> +        {
> +            if (BufferCacheHibernationLevel == 2 &&
> +                id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +            {
> +                /*
> +                 * if buffer_cache_hibernation_level changes 1 to 2,
> +                 * the crc value of buffer block hibernation file may not exist.
> +                 * just ignore it here.
> +                 */
> +                continue;
> +            }
> +
> +            elog(WARNING,
> +                "could not read %s for %s",
> +                BUFFER_CACHE_HIBERNATION_FILE_CRC32,
> +                BufferCacheHibernationData[id].hibernation_file);
> +            close(fd);
> +            goto cleanup;
> +        }
> +        BufferCacheHibernationData[id].crc = crc;
> +    }
> +
> +    close(fd);
> +
> +    /*
> +     * allocate a buffer to read the contents of the hibernation files
> +     * for validations.
> +     */
> +    record_length = 0;
> +    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
> +    {
> +        if (record_length < BufferCacheHibernationData[id].record_length)
> +        {
> +            record_length = BufferCacheHibernationData[id].record_length;
> +        }
> +    }
> +
> +    buf_common = malloc(record_length);
> +    Assert(buf_common != NULL);
> +
> +    /* assume that the number of buffers have not changed. */
> +    oldNBuffers = NBuffers;
> +
> +    /*
> +     * check if all hibernation files are valid.
> +     */
> +    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
> +    {
> +        struct stat    sb;
> +        pg_crc32    crc;
> +
> +        if (BufferCacheHibernationLevel < 2 &&
> +            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +        {
> +            continue;
> +        }
> +
> +        if (BufferCacheHibernationData[id].data_ptr == NULL ||
> +            BufferCacheHibernationData[id].record_length == 0 ||
> +            BufferCacheHibernationData[id].num_records == 0)
> +        {
> +            elog(WARNING,
> +                "ResisterBufferCacheHibernation() was not called for %s",
> +                BufferCacheHibernationData[id].hibernation_file);
> +            goto cleanup;
> +        }
> +
> +        fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file,
> +                O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
> +        if (fd < 0)
> +        {
> +            if (BufferCacheHibernationLevel == 2 &&
> +                id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +            {
> +                /*
> +                 * if buffer_cache_hibernation_level changes 1 to 2,
> +                 * the buffer block hibernation file may not exist.
> +                 * just ignore it here.
> +                 */
> +                continue;
> +            }
> +
> +            goto cleanup;
> +        }
> +
> +        if (fstat(fd, &sb) < 0)
> +        {
> +            elog(WARNING,
> +                "could not get stats of the buffer cache hibernation file: %s",
> +                BufferCacheHibernationData[id].hibernation_file);
> +            close(fd);
> +            goto cleanup;
> +        }
> +
> +        record_length = BufferCacheHibernationData[id].record_length;
> +        num_records = BufferCacheHibernationData[id].num_records;
> +
> +        if (sb.st_size != (record_length * num_records))
> +        {
> +            /* The size of StrategyControl should be the same always. */
> +            if (id == BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY ||
> +                (sb.st_size % record_length) > 0)
> +            {
> +                elog(WARNING,
> +                    "size mismatch on the buffer cache hibernation file: %s",
> +                    BufferCacheHibernationData[id].hibernation_file);
> +                close(fd);
> +                goto cleanup;
> +            }
> +
> +            /*
> +             * The number of records of buffer descriptors and blocks
> +             * should be the same.
> +             */
> +            if (oldNBuffers != NBuffers &&
> +                oldNBuffers != (sb.st_size / record_length))
> +            {
> +                elog(WARNING,
> +                    "size mismatch on the buffer cache hibernation file: %s",
> +                    BufferCacheHibernationData[id].hibernation_file);
> +                close(fd);
> +                goto cleanup;
> +            }
> +            
> +            oldNBuffers = sb.st_size / record_length;
> +
> +            elog(NOTICE,
> +                "shared_buffers have changed from %d to %d: %s",
> +                oldNBuffers, NBuffers,
> +                BufferCacheHibernationData[id].hibernation_file);
> +
> +            /* use the original size to compute CRC of the hibernation file. */
> +            num_records = oldNBuffers;
> +        }
> +
> +        if ((pg_time_t)sb.st_mtime < controlFile.time)
> +        {
> +            elog(WARNING,
> +                "the hibernation file is older than control file: %s",
> +                BufferCacheHibernationData[id].hibernation_file);
> +            close(fd);
> +            goto cleanup;
> +        }
> +
> +        INIT_CRC32(crc);
> +        for (i = 0; i < num_records; i++)
> +        {
> +            if (read(fd, (void *)buf_common, record_length) != record_length)
> +            {
> +                elog(WARNING,
> +                    "could not read the buffer cache hibernation file: %s",
> +                    BufferCacheHibernationData[id].hibernation_file);
> +                close(fd);
> +                goto cleanup;
> +            }
> +
> +            COMP_CRC32(crc, buf_common, record_length);
> +
> +            /*
> +             * buffer descriptors validations.
> +             */
> +            if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS)
> +            {
> +                BufferDesc    *buf;
> +                BufFlags    abnormal_flags;
> +
> +                if (i >= NBuffers)
> +                {
> +                    continue;
> +                }
> +
> +                abnormal_flags = (BM_DIRTY | BM_IO_IN_PROGRESS | BM_IO_ERROR |
> +                                  BM_JUST_DIRTIED | BM_PIN_COUNT_WAITER);
> +
> +                buf = (BufferDesc *)buf_common;
> +
> +                if (buf->flags & abnormal_flags)
> +                {
> +                    elog(WARNING,
> +                        "abnormal flags in buffer descriptors: %d",
> +                        buf->flags);
> +                    close(fd);
> +                    goto cleanup;
> +                }
> +
> +                if (buf->usage_count > BM_MAX_USAGE_COUNT)
> +                {
> +                    elog(WARNING,
> +                        "invalid usage count in buffer descriptors: %d",
> +                        buf->usage_count);
> +                    close(fd);
> +                    goto cleanup;
> +                }
> +
> +                if (buf->buf_id < 0 || buf->buf_id >= num_records)
> +                {
> +                    elog(WARNING,
> +                        "invalid buffer id in buffer descriptors: %d",
> +                        buf->buf_id);
> +                    close(fd);
> +                    goto cleanup;
> +                }
> +            }
> +        }
> +
> +        FIN_CRC32(crc);
> +        close(fd);
> +
> +        if (!EQ_CRC32(BufferCacheHibernationData[id].crc, crc))
> +        {
> +            elog(WARNING,
> +                "crc mismatch on the buffer cache hibernation file: %s",
> +                BufferCacheHibernationData[id].hibernation_file);
> +            close(fd);
> +            goto cleanup;
> +        }
> +    }
> +
> +    /*
> +     * resume the buffer cache data structure from the hibernation files.
> +     */
> +    for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
> +    {
> +        int            fd;
> +        char        *ptr;
> +
> +        if (BufferCacheHibernationLevel < 2 &&
> +            id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +        {
> +            continue;
> +        }
> +
> +        record_length = BufferCacheHibernationData[id].record_length;
> +        num_records = BufferCacheHibernationData[id].num_records;
> +
> +        if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY)
> +        {
> +            /* use the smaller number of buffers. */
> +            num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers;
> +        }
> +
> +        fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file,
> +                O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
> +        if (fd < 0)
> +        {
> +            if (BufferCacheHibernationLevel == 2 &&
> +                id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +            {
> +                /*
> +                 * if buffer_cache_hibernation_level changes 1 to 2,
> +                 * the buffer block hibernation file may not exist.
> +                 * just ignore it here.
> +                 */
> +                continue;
> +            }
> +
> +            goto cleanup;
> +        }
> +
> +        elog(NOTICE,
> +            "buffer cache resume from %s(%d bytes * %d records)",
> +            BufferCacheHibernationData[id].hibernation_file,
> +            record_length, num_records);
> +
> +        for (i = 0; i < num_records; i++)
> +        {
> +            ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length);
> +            read(fd, (void *)ptr, record_length);
> +
> +            /* Re-lock the buffer descriptor if necessary. */
> +            if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS)
> +            {
> +                BufferDesc    *buf;
> +
> +                buf = (BufferDesc *)ptr;
> +                if (IsUnlockBufHdr(buf))
> +                {
> +                    LockBufHdr(buf);
> +                }
> +            }
> +        }
> +
> +        close(fd);
> +
> +        if (id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
> +        {
> +            buffer_block_processed = true;
> +        }
> +    }
> +
> +    if (buffer_block_processed == false)
> +    {
> +        /* we didn't use the buffer block hibernation file, so delete it now. */
> +        id = BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS;
> +        unlink(BufferCacheHibernationData[id].hibernation_file);
> +    }
> +
> +    /*
> +     * set the rest data structures (eg. lookup hashtable) up
> +     * based on the buffer descriptors.
> +     */
> +    num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers;
> +    for (i = 0; i < num_records; i++)
> +    {
> +        BufferDesc        *buf;
> +        BufferTag        newTag;
> +        uint32            newHash;
> +        int                buf_id;
> +
> +        buf = &BufferDescriptors[i];
> +        if (buf->tag.rnode.spcNode    == InvalidOid &&
> +            buf->tag.rnode.dbNode    == InvalidOid &&
> +            buf->tag.rnode.relNode    == InvalidOid)
> +        {
> +            continue;
> +        }
> +
> +        INIT_BUFFERTAG(newTag, buf->tag.rnode, buf->tag.forkNum, buf->tag.blockNum);
> +        newHash = BufTableHashCode(&newTag);
> +
> +        if (buffer_block_processed == false)
> +        {
> +            Block            bufBlock;
> +            SMgrRelation    smgr;
> +
> +            /*
> +             * re-read buffer block.
> +             */
> +            bufBlock = BufHdrGetBlock(buf);
> +            smgr = smgropen(buf->tag.rnode, InvalidBackendId);
> +            smgrread(smgr, newTag.forkNum, newTag.blockNum, (char *) bufBlock);
> +        }
> +
> +        buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
> +        if (buf_id != -1)
> +        {
> +            /* the entry exists already, return it to the freelist. */
> +            buf->refcount = 0;
> +            buf->flags = 0;
> +            InvalidateBuffer(buf);
> +            continue;
> +        }
> +
> +        /* clear wait_backend_pid because the process was terminated already. */
> +        buf->wait_backend_pid = 0;
> +
> +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
> +        elog(DEBUG5,
> +            "resume [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
> +                buf->buf_id, buf->flags, buf->usage_count, buf->refcount,
> +                buf->wait_backend_pid, buf->freeNext,
> +                newHash, newTag.rnode.spcNode,
> +                newTag.rnode.dbNode, newTag.rnode.relNode,
> +                newTag.forkNum, newTag.blockNum);
> +#endif
> +    }
> +
> +    /*
> +     * adjust StrategyControl based on the change of shared_buffers.
> +     */
> +    if (oldNBuffers != NBuffers)
> +    {
> +        AdjustStrategyControl(oldNBuffers);
> +    }
> +
> +    elog(NOTICE,
> +        "buffer cache resumed successfully");
> +
> +cleanup:
> +    for (i = 0; i < NBuffers; i++)
> +    {
> +        BufferDesc    *buf;
> +
> +        buf = &BufferDescriptors[i];
> +        UnlockBufHdr(buf);
> +    }
> +
> +    if (buf_common != NULL)
> +    {
> +        free(buf_common);
> +    }
> +
> +    return;
> +}
> diff --git src/backend/storage/buffer/freelist.c src/backend/storage/buffer/freelist.c
> index bf9903b..ffc101d 100644
> --- src/backend/storage/buffer/freelist.c
> +++ src/backend/storage/buffer/freelist.c
> @@ -347,6 +347,12 @@ StrategyInitialize(bool init)
>      }
>      else
>          Assert(!init);
> +
> +    if (BufferCacheHibernationLevel > 0)
> +    {
> +        ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY,
> +            (char *)StrategyControl, sizeof(BufferStrategyControl), 1);
> +    }
>  }
>  
>  
> @@ -521,3 +527,47 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, volatile BufferDesc *buf)
>  
>      return true;
>  }
> +
> +/*
> + * AdjustStrategyControl -- adjust the member variables of StrategyControl
> + *
> + * If the shared_buffers setting had changed, restored StrategyControl
> + * needs to be adjusted for in both cases of shrinking and enlarging.
> + * This is called only from bufmgr.c:ResumeBufferCacheHibernation().
> + */
> +void
> +AdjustStrategyControl(int oldNBuffers)
> +{
> +    if (oldNBuffers == NBuffers)
> +    {
> +        return;
> +    }
> +
> +    /* enlarge or shrink the free buffer based on current NBuffers. */
> +    StrategyControl->lastFreeBuffer = NBuffers - 1;
> +
> +    /* shared_buffers shrunk. */
> +    if (oldNBuffers > NBuffers)
> +    {
> +        if (StrategyControl->nextVictimBuffer >= NBuffers)
> +        {
> +            /* set the tail of buffers. */
> +            StrategyControl->nextVictimBuffer = NBuffers - 1;
> +        }
> +
> +        if (StrategyControl->firstFreeBuffer >= NBuffers)
> +        {
> +            /* set FREENEXT_END_OF_LIST(-1). */
> +            StrategyControl->firstFreeBuffer = FREENEXT_END_OF_LIST;
> +        }
> +    }
> +    else
> +    /* shared_buffers enlarged. */
> +    {
> +        if (StrategyControl->firstFreeBuffer < 0)
> +        {
> +            /* set the next entry of the tail of old buffers. */
> +            StrategyControl->firstFreeBuffer = oldNBuffers;
> +        }
> +    }
> +}
> diff --git src/backend/utils/misc/guc.c src/backend/utils/misc/guc.c
> index 738e215..5affc6e 100644
> --- src/backend/utils/misc/guc.c
> +++ src/backend/utils/misc/guc.c
> @@ -2361,6 +2361,18 @@ static struct config_int ConfigureNamesInt[] =
>          NULL, NULL, NULL
>      },
>  
> +    {
> +        {"buffer_cache_hibernation_level", PGC_POSTMASTER, UNGROUPED,
> +            gettext_noop("Sets buffer cache hibernation level."),
> +            gettext_noop("0 to disable(default), "
> +                         "1 for saving buffer descriptors only(recommended), "
> +                         "2 for saving buffer descriptors and buffer blocks(slower at shutdown).")
> +        },
> +        &BufferCacheHibernationLevel,
> +        0, 0, 2,
> +        NULL, NULL, NULL
> +    },
> +
>      /* End-of-list marker */
>      {
>          {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
> diff --git src/backend/utils/misc/postgresql.conf.sample src/backend/utils/misc/postgresql.conf.sample
> index b8a1582..44b6ff3 100644
> --- src/backend/utils/misc/postgresql.conf.sample
> +++ src/backend/utils/misc/postgresql.conf.sample
> @@ -119,6 +119,17 @@
>  #maintenance_work_mem = 16MB        # min 1MB
>  #max_stack_depth = 2MB            # min 100kB
>  
> +
> +# Buffer Cache Hibernation:
> +#  Suspend/resume buffer cache data structure using hibernation files
> +#  at shutdown/startup.
> +#buffer_cache_hibernation_level = 0    # Sets buffer cache hibernation level.
> +                    # 0 to disable(default),
> +                    # 1 for saving buffer descriptors only
> +                    #   (recommended),
> +                    # 2 for saving buffer descriptors and
> +                    #   buffer blocks(slower at shutdown).
> +
>  # - Kernel Resource Usage -
>  
>  #max_files_per_process = 1000        # min 25
> diff --git src/include/access/xlog.h src/include/access/xlog.h
> index 7056fd6..7a9fb99 100644
> --- src/include/access/xlog.h
> +++ src/include/access/xlog.h
> @@ -13,6 +13,7 @@
>  
>  #include "access/rmgr.h"
>  #include "access/xlogdefs.h"
> +#include "catalog/pg_control.h"
>  #include "lib/stringinfo.h"
>  #include "storage/buf.h"
>  #include "utils/pg_crc.h"
> @@ -294,6 +295,7 @@ extern bool XLogInsertAllowed(void);
>  extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
>  extern XLogRecPtr GetXLogReplayRecPtr(void);
>  
> +extern bool GetControlFile(ControlFileData *controlFile);
>  extern void UpdateControlFile(void);
>  extern uint64 GetSystemIdentifier(void);
>  extern Size XLOGShmemSize(void);
> diff --git src/include/storage/buf_internals.h src/include/storage/buf_internals.h
> index b7d4ea5..d537ef1 100644
> --- src/include/storage/buf_internals.h
> +++ src/include/storage/buf_internals.h
> @@ -167,6 +167,7 @@ typedef struct sbufdesc
>   */
>  #define LockBufHdr(bufHdr)        SpinLockAcquire(&(bufHdr)->buf_hdr_lock)
>  #define UnlockBufHdr(bufHdr)    SpinLockRelease(&(bufHdr)->buf_hdr_lock)
> +#define IsUnlockBufHdr(bufHdr)    SpinLockFree(&(bufHdr)->buf_hdr_lock)
>  
>  
>  /* in buf_init.c */
> @@ -190,6 +191,7 @@ extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
>  extern int    StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
>  extern Size StrategyShmemSize(void);
>  extern void StrategyInitialize(bool init);
> +extern void AdjustStrategyControl(int oldNBuffers);
>  
>  /* buf_table.c */
>  extern Size BufTableShmemSize(int size);
> diff --git src/include/storage/bufmgr.h src/include/storage/bufmgr.h
> index b8fc87e..ddfeb9d 100644
> --- src/include/storage/bufmgr.h
> +++ src/include/storage/bufmgr.h
> @@ -211,6 +211,20 @@ extern void BgBufferSync(void);
>  
>  extern void AtProcExit_LocalBuffers(void);
>  
> +/* buffer cache hibernation support stuff */
> +extern int    BufferCacheHibernationLevel;
> +
> +typedef enum BufferHibernationFileType
> +{   
> +    BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY,
> +    BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS,
> +    BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS
> +} BufferHibernationFileType;
> +
> +extern void ResisterBufferCacheHibernation(BufferHibernationFileType id,
> +                char *ptr, Size record_length, Size num_records);
> +extern void ResumeBufferCacheHibernation(void);
> +
>  /* in freelist.c */
>  extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
>  extern void FreeAccessStrategy(BufferAccessStrategy strategy);
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: patch for new feature: Buffer Cache Hibernation

From
Cédric Villemain
Date:
2011/10/14 Bruce Momjian <bruce@momjian.us>:
>
> Should this be marked as TODO?

I suppose TODO items *are* wanted and so working on them should remove
the pain to convince people here to accept the feature, aren't they ?

>
> ---------------------------------------------------------------------------
>
> Mitsuru IWASAKI wrote:
>> Hi,
>>
>> > On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote:
>> > > For 1, I've just finish my work.  The latest patch is available at:
>> > > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch
>> > >
>> >
>> > Reminder here--we can't accept code based on it being published to a web
>> > page.  You'll need to e-mail it to the pgsql-hackers mailing list to be
>> > considered for the next PostgreSQL CommitFest, which is starting in a
>> > few weeks.  Code submitted to the mailing list is considered a release
>> > of it to the project under the PostgreSQL license, which we can't just
>> > assume for things when given only a URL to them.
>>
>> Sorry about that, but I had enough time to revise my patches this week-end.
>> I attached the patches in this mail, and will update CommitFest page soon.
>>
>> > Also, you suggested you were out of time to work on this.  If that's the
>> > case, we'd like to know that so we don't keep cc'ing you about things in
>> > expectation of an answer.  Someone else may pick this up as a project to
>> > continue working on.  But it's going to need a fair amount of revision
>> > before it matches what people want here, and I'm not sure how much of
>> > what you've written is going to end up in any commit that may happen
>> > from this idea.
>>
>> It seems that I don't have enough time to complete this work.
>> You don't need to keep cc'ing me, and I'm very happy if postgres to be
>> the first DBMS which support buffer cache hibernation feature.
>>
>> Thanks!
>>
>>
>> diff --git src/backend/access/transam/xlog.c src/backend/access/transam/xlog.c
>> index b0e4c41..7a3a207 100644
>> --- src/backend/access/transam/xlog.c
>> +++ src/backend/access/transam/xlog.c
>> @@ -4834,6 +4834,19 @@ ReadControlFile(void)
>>  #endif
>>  }
>>
>> +bool
>> +GetControlFile(ControlFileData *controlFile)
>> +{
>> +     if (ControlFile == NULL)
>> +     {
>> +             return false;
>> +     }
>> +
>> +     memcpy(controlFile, ControlFile, sizeof(ControlFileData));
>> +
>> +     return true;
>> +}
>> +
>>  void
>>  UpdateControlFile(void)
>>  {
>> diff --git src/backend/bootstrap/bootstrap.c src/backend/bootstrap/bootstrap.c
>> index fc093cc..7ecf6bb 100644
>> --- src/backend/bootstrap/bootstrap.c
>> +++ src/backend/bootstrap/bootstrap.c
>> @@ -360,6 +360,15 @@ AuxiliaryProcessMain(int argc, char *argv[])
>>       BaseInit();
>>
>>       /*
>> +      * Only StartupProcess can call ResumeBufferCacheHibernation() after
>> +      * InitFileAccess() and smgrinit().
>> +      */
>> +     if (auxType == StartupProcess && BufferCacheHibernationLevel > 0)
>> +     {
>> +             ResumeBufferCacheHibernation();
>> +     }
>> +
>> +     /*
>>        * When we are an auxiliary process, we aren't going to do the full
>>        * InitPostgres pushups, but there are a couple of things that need to get
>>        * lit up even in an auxiliary process.
>> diff --git src/backend/storage/buffer/buf_init.c src/backend/storage/buffer/buf_init.c
>> index dadb49d..52eb51a 100644
>> --- src/backend/storage/buffer/buf_init.c
>> +++ src/backend/storage/buffer/buf_init.c
>> @@ -127,6 +127,14 @@ InitBufferPool(void)
>>
>>       /* Init other shared buffer-management stuff */
>>       StrategyInitialize(!foundDescs);
>> +
>> +     if (BufferCacheHibernationLevel > 0)
>> +     {
>> +             ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS,
>> +                     (char *)BufferDescriptors, sizeof(BufferDesc), NBuffers);
>> +             ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS,
>> +                     (char *)BufferBlocks, BLCKSZ, NBuffers);
>> +     }
>>  }
>>
>>  /*
>> diff --git src/backend/storage/buffer/bufmgr.c src/backend/storage/buffer/bufmgr.c
>> index f96685d..dba8ebf 100644
>> --- src/backend/storage/buffer/bufmgr.c
>> +++ src/backend/storage/buffer/bufmgr.c
>> @@ -31,6 +31,7 @@
>>  #include "postgres.h"
>>
>>  #include <sys/file.h>
>> +#include <sys/stat.h>
>>  #include <unistd.h>
>>
>>  #include "catalog/catalog.h"
>> @@ -61,6 +62,13 @@
>>  #define BUF_WRITTEN                          0x01
>>  #define BUF_REUSABLE                 0x02
>>
>> +/*
>> + * Buffer Cache Hibernation stuff.
>> + */
>> +/* enable this to debug buffer cache hibernation. */
>> +#if 0
>> +#define DEBUG_BUFFER_CACHE_HIBERNATION
>> +#endif
>>
>>  /* GUC variables */
>>  bool         zero_damaged_pages = false;
>> @@ -765,6 +773,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>>                               }
>>                       }
>>
>> +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
>> +                     elog(DEBUG5,
>> +                             "alloc  [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
>> +                                     buf->buf_id, buf->flags, buf->usage_count, buf->refcount,
>> +                                     buf->wait_backend_pid, buf->freeNext,
>> +                                     newHash, newTag.rnode.spcNode,
>> +                                     newTag.rnode.dbNode, newTag.rnode.relNode,
>> +                                     newTag.forkNum, newTag.blockNum);
>> +#endif
>> +
>>                       return buf;
>>               }
>>
>> @@ -800,6 +818,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>>        * the old content is no longer relevant.  (The usage_count starts out at
>>        * 1 so that the buffer can survive one clock-sweep pass.)
>>        */
>> +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
>> +     elog(DEBUG5,
>> +             "rename [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
>> +                     buf->buf_id, buf->flags, buf->usage_count, buf->refcount,
>> +                     buf->wait_backend_pid, buf->freeNext,
>> +                     oldHash, oldTag.rnode.spcNode,
>> +                     oldTag.rnode.dbNode, oldTag.rnode.relNode,
>> +                     oldTag.forkNum, oldTag.blockNum);
>> +#endif
>> +
>>       buf->tag = newTag;
>>       buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
>>       if (relpersistence == RELPERSISTENCE_PERMANENT)
>> @@ -2772,3 +2800,716 @@ local_buffer_write_error_callback(void *arg)
>>               pfree(path);
>>       }
>>  }
>> +
>> +/* ----------------------------------------------------------------
>> + *           Buffer Cache Hibernation support stuff
>> + *
>> + * Suspend/resume buffer cache data structure using hibernation files
>> + * at shutdown/startup.
>> + * ----------------------------------------------------------------
>> + */
>> +
>> +int  BufferCacheHibernationLevel = 0;
>> +
>> +#define      BUFFER_CACHE_HIBERNATION_FILE_STRATEGY          "global/pg_buffer_cache_hibernation_strategy"
>> +#define      BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS       "global/pg_buffer_cache_hibernation_descriptors"
>> +#define      BUFFER_CACHE_HIBERNATION_FILE_BLOCKS            "global/pg_buffer_cache_hibernation_blocks"
>> +#define      BUFFER_CACHE_HIBERNATION_FILE_CRC32                     "global/pg_buffer_cache_hibernation_crc32"
>> +
>> +static struct
>> +{
>> +     char            *hibernation_file;
>> +     char            *data_ptr;
>> +     Size            record_length;
>> +     Size            num_records;
>> +     pg_crc32        crc;
>> +} BufferCacheHibernationData[] =
>> +{
>> +     /* BufferStrategyControl */
>> +     {
>> +             BUFFER_CACHE_HIBERNATION_FILE_STRATEGY,
>> +             NULL, 0, 0, 0
>> +     },
>> +
>> +     /* BufferDescriptors */
>> +     {
>> +             BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS,
>> +             NULL, 0, 0, 0
>> +     },
>> +
>> +     /* BufferBlocks */
>> +     {
>> +             BUFFER_CACHE_HIBERNATION_FILE_BLOCKS,
>> +             NULL, 0, 0, 0
>> +     },
>> +
>> +     /* End-of-list marker */
>> +     {
>> +             NULL,
>> +             NULL, 0, 0, 0
>> +     },
>> +};
>> +
>> +static ControlFileData       controlFile;
>> +static bool                          controlFileInitialized = false;
>> +
>> +/*
>> + * AtProcExit_BufferCacheHibernation:
>> + *           store the buffer cache into hibernation files at shutdown.
>> + */
>> +static void
>> +AtProcExit_BufferCacheHibernation(int code, Datum arg)
>> +{
>> +     BufferHibernationFileType       id;
>> +     int                                                     i;
>> +     int                                                     fd;
>> +
>> +     if (BufferCacheHibernationLevel == 0)
>> +     {
>> +             return;
>> +     }
>> +
>> +     /*
>> +      * get the control file to check the system state validation.
>> +      */
>> +     if (GetControlFile(&controlFile) == false)
>> +     {
>> +             elog(WARNING,
>> +                     "could not get control file, "
>> +                     "aborting buffer cache hibernation");
>> +             return;
>> +     }
>> +
>> +     if (controlFile.state != DB_SHUTDOWNED)
>> +     {
>> +             elog(WARNING,
>> +                     "database system was not shut down normally, "
>> +                     "aborting buffer cache hibernation");
>> +             return;
>> +     }
>> +
>> +     /*
>> +      * suspend buffer cache data structure into hibernation files.
>> +      */
>> +     for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
>> +     {
>> +             Size            record_length;
>> +             Size            num_records;
>> +             char            *ptr;
>> +             pg_crc32        crc;
>> +
>> +             if (BufferCacheHibernationLevel < 2 &&
>> +                     id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +             {
>> +                     continue;
>> +             }
>> +
>> +             if (BufferCacheHibernationData[id].data_ptr == NULL ||
>> +                     BufferCacheHibernationData[id].record_length == 0 ||
>> +                     BufferCacheHibernationData[id].num_records == 0)
>> +             {
>> +                     elog(WARNING,
>> +                             "ResisterBufferCacheHibernation() was not called for %s",
>> +                             BufferCacheHibernationData[id].hibernation_file);
>> +                     goto cleanup;
>> +             }
>> +
>> +             fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file,
>> +                             O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR);
>> +             if (fd < 0)
>> +             {
>> +                     elog(WARNING,
>> +                             "could not open %s",
>> +                             BufferCacheHibernationData[id].hibernation_file);
>> +                     goto cleanup;
>> +             }
>> +
>> +             record_length = BufferCacheHibernationData[id].record_length;
>> +             num_records = BufferCacheHibernationData[id].num_records;
>> +
>> +             elog(NOTICE,
>> +                     "buffer cache hibernate into %s",
>> +                     BufferCacheHibernationData[id].hibernation_file);
>> +
>> +             INIT_CRC32(crc);
>> +             for (i = 0; i < num_records; i++)
>> +             {
>> +                     ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length);
>> +                     if (write(fd, (void *)ptr, record_length) != record_length)
>> +                     {
>> +                             elog(WARNING,
>> +                                     "could not write %s",
>> +                                     BufferCacheHibernationData[id].hibernation_file);
>> +                             goto cleanup;
>> +                     }
>> +
>> +                     COMP_CRC32(crc, ptr, record_length);
>> +             }
>> +
>> +             FIN_CRC32(crc);
>> +             close(fd);
>> +
>> +             BufferCacheHibernationData[id].crc = crc;
>> +     }
>> +
>> +     /*
>> +      * save the computed crc values for the validations at resuming.
>> +      */
>> +     fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32,
>> +                     O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR);
>> +     if (fd < 0)
>> +     {
>> +             elog(WARNING,
>> +                     "could not open %s",
>> +                     BUFFER_CACHE_HIBERNATION_FILE_CRC32);
>> +             goto cleanup;
>> +     }
>> +
>> +     for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
>> +     {
>> +             pg_crc32        crc;
>> +
>> +             if (BufferCacheHibernationLevel < 2 &&
>> +                     id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +             {
>> +                     continue;
>> +             }
>> +
>> +             crc = BufferCacheHibernationData[id].crc;
>> +             if (write(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32))
>> +             {
>> +                     elog(WARNING,
>> +                             "could not write %s for %s",
>> +                             BUFFER_CACHE_HIBERNATION_FILE_CRC32,
>> +                             BufferCacheHibernationData[id].hibernation_file);
>> +                     goto cleanup;
>> +             }
>> +     }
>> +     close(fd);
>> +
>> +     elog(NOTICE,
>> +             "buffer cache suspended successfully");
>> +
>> +     return;
>> +
>> +cleanup:
>> +     for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
>> +     {
>> +             unlink(BufferCacheHibernationData[id].hibernation_file);
>> +     }
>> +
>> +     return;
>> +}
>> +
>> +/*
>> + * ResisterBufferCacheHibernation:
>> + *           register the buffer cache data structure info.
>> + */
>> +void
>> +ResisterBufferCacheHibernation(BufferHibernationFileType id, char *ptr, Size record_length, Size num_records)
>> +{
>> +     static bool                                     first_time = true;
>> +
>> +     if (BufferCacheHibernationLevel == 0)
>> +     {
>> +             return;
>> +     }
>> +
>> +     if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY &&
>> +             id != BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS &&
>> +             id != BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +     {
>> +             return;
>> +     }
>> +
>> +     if (first_time)
>> +     {
>> +             /*
>> +              * AtProcExit_BufferCacheHibernation to be called at shutdown.
>> +              */
>> +             on_shmem_exit(AtProcExit_BufferCacheHibernation, 0);
>> +             first_time = false;
>> +     }
>> +
>> +     /*
>> +      * get the control file to check the system state and
>> +      * hibernation file validations.
>> +      */
>> +     if (controlFileInitialized == false)
>> +     {
>> +             if (GetControlFile(&controlFile) == true)
>> +             {
>> +                     controlFileInitialized = true;
>> +             }
>> +     }
>> +
>> +     BufferCacheHibernationData[id].data_ptr = ptr;
>> +     BufferCacheHibernationData[id].record_length = record_length;
>> +     BufferCacheHibernationData[id].num_records = num_records;
>> +}
>> +
>> +/*
>> + * ResumeBufferCacheHibernation:
>> + *           resume the buffer cache from hibernation file at startup.
>> + */
>> +void
>> +ResumeBufferCacheHibernation(void)
>> +{
>> +     BufferHibernationFileType       id;
>> +     int                                                     i;
>> +     int                                                     fd;
>> +     Size                                            num_records;
>> +     Size                                            record_length;
>> +     char                                            *buf_common;
>> +     int                                                     oldNBuffers;
>> +     bool                                            buffer_block_processed;
>> +
>> +     if (BufferCacheHibernationLevel == 0)
>> +     {
>> +             return;
>> +     }
>> +
>> +     buf_common = NULL;
>> +     buffer_block_processed = false;
>> +
>> +     /*
>> +      * lock all buffer descriptors to prevent other processes from
>> +      * updating buffers.
>> +      */
>> +     for (i = 0; i < NBuffers; i++)
>> +     {
>> +             BufferDesc      *buf;
>> +
>> +             buf = &BufferDescriptors[i];
>> +             LockBufHdr(buf);
>> +     }
>> +
>> +     /*
>> +      * get the control file to check the system state and
>> +      * hibernation file validations.
>> +      */
>> +     if (controlFileInitialized == false)
>> +     {
>> +             elog(WARNING,
>> +                     "could not get control file, "
>> +                     "aborting buffer cache hibernation");
>> +             goto cleanup;
>> +     }
>> +
>> +     if (controlFile.state != DB_SHUTDOWNED)
>> +     {
>> +             elog(WARNING,
>> +                     "database system was not shut down normally, "
>> +                     "aborting buffer cache hibernation");
>> +             goto cleanup;
>> +     }
>> +
>> +     /*
>> +      * read the crc values which was computed when the hibernation
>> +      * files were created.
>> +      */
>> +     fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32,
>> +                     O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
>> +     if (fd < 0)
>> +     {
>> +             elog(WARNING,
>> +                     "could not open %s",
>> +                     BUFFER_CACHE_HIBERNATION_FILE_CRC32);
>> +             goto cleanup;
>> +     }
>> +
>> +     for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
>> +     {
>> +             pg_crc32        crc;
>> +
>> +             if (BufferCacheHibernationLevel < 2 &&
>> +                     id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +             {
>> +                     continue;
>> +             }
>> +
>> +             if (read(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32))
>> +             {
>> +                     if (BufferCacheHibernationLevel == 2 &&
>> +                             id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +                     {
>> +                             /*
>> +                              * if buffer_cache_hibernation_level changes 1 to 2,
>> +                              * the crc value of buffer block hibernation file may not exist.
>> +                              * just ignore it here.
>> +                              */
>> +                             continue;
>> +                     }
>> +
>> +                     elog(WARNING,
>> +                             "could not read %s for %s",
>> +                             BUFFER_CACHE_HIBERNATION_FILE_CRC32,
>> +                             BufferCacheHibernationData[id].hibernation_file);
>> +                     close(fd);
>> +                     goto cleanup;
>> +             }
>> +             BufferCacheHibernationData[id].crc = crc;
>> +     }
>> +
>> +     close(fd);
>> +
>> +     /*
>> +      * allocate a buffer to read the contents of the hibernation files
>> +      * for validations.
>> +      */
>> +     record_length = 0;
>> +     for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
>> +     {
>> +             if (record_length < BufferCacheHibernationData[id].record_length)
>> +             {
>> +                     record_length = BufferCacheHibernationData[id].record_length;
>> +             }
>> +     }
>> +
>> +     buf_common = malloc(record_length);
>> +     Assert(buf_common != NULL);
>> +
>> +     /* assume that the number of buffers have not changed. */
>> +     oldNBuffers = NBuffers;
>> +
>> +     /*
>> +      * check if all hibernation files are valid.
>> +      */
>> +     for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
>> +     {
>> +             struct stat     sb;
>> +             pg_crc32        crc;
>> +
>> +             if (BufferCacheHibernationLevel < 2 &&
>> +                     id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +             {
>> +                     continue;
>> +             }
>> +
>> +             if (BufferCacheHibernationData[id].data_ptr == NULL ||
>> +                     BufferCacheHibernationData[id].record_length == 0 ||
>> +                     BufferCacheHibernationData[id].num_records == 0)
>> +             {
>> +                     elog(WARNING,
>> +                             "ResisterBufferCacheHibernation() was not called for %s",
>> +                             BufferCacheHibernationData[id].hibernation_file);
>> +                     goto cleanup;
>> +             }
>> +
>> +             fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file,
>> +                             O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
>> +             if (fd < 0)
>> +             {
>> +                     if (BufferCacheHibernationLevel == 2 &&
>> +                             id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +                     {
>> +                             /*
>> +                              * if buffer_cache_hibernation_level changes 1 to 2,
>> +                              * the buffer block hibernation file may not exist.
>> +                              * just ignore it here.
>> +                              */
>> +                             continue;
>> +                     }
>> +
>> +                     goto cleanup;
>> +             }
>> +
>> +             if (fstat(fd, &sb) < 0)
>> +             {
>> +                     elog(WARNING,
>> +                             "could not get stats of the buffer cache hibernation file: %s",
>> +                             BufferCacheHibernationData[id].hibernation_file);
>> +                     close(fd);
>> +                     goto cleanup;
>> +             }
>> +
>> +             record_length = BufferCacheHibernationData[id].record_length;
>> +             num_records = BufferCacheHibernationData[id].num_records;
>> +
>> +             if (sb.st_size != (record_length * num_records))
>> +             {
>> +                     /* The size of StrategyControl should be the same always. */
>> +                     if (id == BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY ||
>> +                             (sb.st_size % record_length) > 0)
>> +                     {
>> +                             elog(WARNING,
>> +                                     "size mismatch on the buffer cache hibernation file: %s",
>> +                                     BufferCacheHibernationData[id].hibernation_file);
>> +                             close(fd);
>> +                             goto cleanup;
>> +                     }
>> +
>> +                     /*
>> +                      * The number of records of buffer descriptors and blocks
>> +                      * should be the same.
>> +                      */
>> +                     if (oldNBuffers != NBuffers &&
>> +                             oldNBuffers != (sb.st_size / record_length))
>> +                     {
>> +                             elog(WARNING,
>> +                                     "size mismatch on the buffer cache hibernation file: %s",
>> +                                     BufferCacheHibernationData[id].hibernation_file);
>> +                             close(fd);
>> +                             goto cleanup;
>> +                     }
>> +
>> +                     oldNBuffers = sb.st_size / record_length;
>> +
>> +                     elog(NOTICE,
>> +                             "shared_buffers have changed from %d to %d: %s",
>> +                             oldNBuffers, NBuffers,
>> +                             BufferCacheHibernationData[id].hibernation_file);
>> +
>> +                     /* use the original size to compute CRC of the hibernation file. */
>> +                     num_records = oldNBuffers;
>> +             }
>> +
>> +             if ((pg_time_t)sb.st_mtime < controlFile.time)
>> +             {
>> +                     elog(WARNING,
>> +                             "the hibernation file is older than control file: %s",
>> +                             BufferCacheHibernationData[id].hibernation_file);
>> +                     close(fd);
>> +                     goto cleanup;
>> +             }
>> +
>> +             INIT_CRC32(crc);
>> +             for (i = 0; i < num_records; i++)
>> +             {
>> +                     if (read(fd, (void *)buf_common, record_length) != record_length)
>> +                     {
>> +                             elog(WARNING,
>> +                                     "could not read the buffer cache hibernation file: %s",
>> +                                     BufferCacheHibernationData[id].hibernation_file);
>> +                             close(fd);
>> +                             goto cleanup;
>> +                     }
>> +
>> +                     COMP_CRC32(crc, buf_common, record_length);
>> +
>> +                     /*
>> +                      * buffer descriptors validations.
>> +                      */
>> +                     if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS)
>> +                     {
>> +                             BufferDesc      *buf;
>> +                             BufFlags        abnormal_flags;
>> +
>> +                             if (i >= NBuffers)
>> +                             {
>> +                                     continue;
>> +                             }
>> +
>> +                             abnormal_flags = (BM_DIRTY | BM_IO_IN_PROGRESS | BM_IO_ERROR |
>> +                                                               BM_JUST_DIRTIED | BM_PIN_COUNT_WAITER);
>> +
>> +                             buf = (BufferDesc *)buf_common;
>> +
>> +                             if (buf->flags & abnormal_flags)
>> +                             {
>> +                                     elog(WARNING,
>> +                                             "abnormal flags in buffer descriptors: %d",
>> +                                             buf->flags);
>> +                                     close(fd);
>> +                                     goto cleanup;
>> +                             }
>> +
>> +                             if (buf->usage_count > BM_MAX_USAGE_COUNT)
>> +                             {
>> +                                     elog(WARNING,
>> +                                             "invalid usage count in buffer descriptors: %d",
>> +                                             buf->usage_count);
>> +                                     close(fd);
>> +                                     goto cleanup;
>> +                             }
>> +
>> +                             if (buf->buf_id < 0 || buf->buf_id >= num_records)
>> +                             {
>> +                                     elog(WARNING,
>> +                                             "invalid buffer id in buffer descriptors: %d",
>> +                                             buf->buf_id);
>> +                                     close(fd);
>> +                                     goto cleanup;
>> +                             }
>> +                     }
>> +             }
>> +
>> +             FIN_CRC32(crc);
>> +             close(fd);
>> +
>> +             if (!EQ_CRC32(BufferCacheHibernationData[id].crc, crc))
>> +             {
>> +                     elog(WARNING,
>> +                             "crc mismatch on the buffer cache hibernation file: %s",
>> +                             BufferCacheHibernationData[id].hibernation_file);
>> +                     close(fd);
>> +                     goto cleanup;
>> +             }
>> +     }
>> +
>> +     /*
>> +      * resume the buffer cache data structure from the hibernation files.
>> +      */
>> +     for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++)
>> +     {
>> +             int                     fd;
>> +             char            *ptr;
>> +
>> +             if (BufferCacheHibernationLevel < 2 &&
>> +                     id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +             {
>> +                     continue;
>> +             }
>> +
>> +             record_length = BufferCacheHibernationData[id].record_length;
>> +             num_records = BufferCacheHibernationData[id].num_records;
>> +
>> +             if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY)
>> +             {
>> +                     /* use the smaller number of buffers. */
>> +                     num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers;
>> +             }
>> +
>> +             fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file,
>> +                             O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
>> +             if (fd < 0)
>> +             {
>> +                     if (BufferCacheHibernationLevel == 2 &&
>> +                             id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +                     {
>> +                             /*
>> +                              * if buffer_cache_hibernation_level changes 1 to 2,
>> +                              * the buffer block hibernation file may not exist.
>> +                              * just ignore it here.
>> +                              */
>> +                             continue;
>> +                     }
>> +
>> +                     goto cleanup;
>> +             }
>> +
>> +             elog(NOTICE,
>> +                     "buffer cache resume from %s(%d bytes * %d records)",
>> +                     BufferCacheHibernationData[id].hibernation_file,
>> +                     record_length, num_records);
>> +
>> +             for (i = 0; i < num_records; i++)
>> +             {
>> +                     ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length);
>> +                     read(fd, (void *)ptr, record_length);
>> +
>> +                     /* Re-lock the buffer descriptor if necessary. */
>> +                     if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS)
>> +                     {
>> +                             BufferDesc      *buf;
>> +
>> +                             buf = (BufferDesc *)ptr;
>> +                             if (IsUnlockBufHdr(buf))
>> +                             {
>> +                                     LockBufHdr(buf);
>> +                             }
>> +                     }
>> +             }
>> +
>> +             close(fd);
>> +
>> +             if (id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS)
>> +             {
>> +                     buffer_block_processed = true;
>> +             }
>> +     }
>> +
>> +     if (buffer_block_processed == false)
>> +     {
>> +             /* we didn't use the buffer block hibernation file, so delete it now. */
>> +             id = BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS;
>> +             unlink(BufferCacheHibernationData[id].hibernation_file);
>> +     }
>> +
>> +     /*
>> +      * set the rest data structures (eg. lookup hashtable) up
>> +      * based on the buffer descriptors.
>> +      */
>> +     num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers;
>> +     for (i = 0; i < num_records; i++)
>> +     {
>> +             BufferDesc              *buf;
>> +             BufferTag               newTag;
>> +             uint32                  newHash;
>> +             int                             buf_id;
>> +
>> +             buf = &BufferDescriptors[i];
>> +             if (buf->tag.rnode.spcNode      == InvalidOid &&
>> +                     buf->tag.rnode.dbNode   == InvalidOid &&
>> +                     buf->tag.rnode.relNode  == InvalidOid)
>> +             {
>> +                     continue;
>> +             }
>> +
>> +             INIT_BUFFERTAG(newTag, buf->tag.rnode, buf->tag.forkNum, buf->tag.blockNum);
>> +             newHash = BufTableHashCode(&newTag);
>> +
>> +             if (buffer_block_processed == false)
>> +             {
>> +                     Block                   bufBlock;
>> +                     SMgrRelation    smgr;
>> +
>> +                     /*
>> +                      * re-read buffer block.
>> +                      */
>> +                     bufBlock = BufHdrGetBlock(buf);
>> +                     smgr = smgropen(buf->tag.rnode, InvalidBackendId);
>> +                     smgrread(smgr, newTag.forkNum, newTag.blockNum, (char *) bufBlock);
>> +             }
>> +
>> +             buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
>> +             if (buf_id != -1)
>> +             {
>> +                     /* the entry exists already, return it to the freelist. */
>> +                     buf->refcount = 0;
>> +                     buf->flags = 0;
>> +                     InvalidateBuffer(buf);
>> +                     continue;
>> +             }
>> +
>> +             /* clear wait_backend_pid because the process was terminated already. */
>> +             buf->wait_backend_pid = 0;
>> +
>> +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
>> +             elog(DEBUG5,
>> +                     "resume [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
>> +                             buf->buf_id, buf->flags, buf->usage_count, buf->refcount,
>> +                             buf->wait_backend_pid, buf->freeNext,
>> +                             newHash, newTag.rnode.spcNode,
>> +                             newTag.rnode.dbNode, newTag.rnode.relNode,
>> +                             newTag.forkNum, newTag.blockNum);
>> +#endif
>> +     }
>> +
>> +     /*
>> +      * adjust StrategyControl based on the change of shared_buffers.
>> +      */
>> +     if (oldNBuffers != NBuffers)
>> +     {
>> +             AdjustStrategyControl(oldNBuffers);
>> +     }
>> +
>> +     elog(NOTICE,
>> +             "buffer cache resumed successfully");
>> +
>> +cleanup:
>> +     for (i = 0; i < NBuffers; i++)
>> +     {
>> +             BufferDesc      *buf;
>> +
>> +             buf = &BufferDescriptors[i];
>> +             UnlockBufHdr(buf);
>> +     }
>> +
>> +     if (buf_common != NULL)
>> +     {
>> +             free(buf_common);
>> +     }
>> +
>> +     return;
>> +}
>> diff --git src/backend/storage/buffer/freelist.c src/backend/storage/buffer/freelist.c
>> index bf9903b..ffc101d 100644
>> --- src/backend/storage/buffer/freelist.c
>> +++ src/backend/storage/buffer/freelist.c
>> @@ -347,6 +347,12 @@ StrategyInitialize(bool init)
>>       }
>>       else
>>               Assert(!init);
>> +
>> +     if (BufferCacheHibernationLevel > 0)
>> +     {
>> +             ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY,
>> +                     (char *)StrategyControl, sizeof(BufferStrategyControl), 1);
>> +     }
>>  }
>>
>>
>> @@ -521,3 +527,47 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, volatile BufferDesc *buf)
>>
>>       return true;
>>  }
>> +
>> +/*
>> + * AdjustStrategyControl -- adjust the member variables of StrategyControl
>> + *
>> + * If the shared_buffers setting had changed, restored StrategyControl
>> + * needs to be adjusted for in both cases of shrinking and enlarging.
>> + * This is called only from bufmgr.c:ResumeBufferCacheHibernation().
>> + */
>> +void
>> +AdjustStrategyControl(int oldNBuffers)
>> +{
>> +     if (oldNBuffers == NBuffers)
>> +     {
>> +             return;
>> +     }
>> +
>> +     /* enlarge or shrink the free buffer based on current NBuffers. */
>> +     StrategyControl->lastFreeBuffer = NBuffers - 1;
>> +
>> +     /* shared_buffers shrunk. */
>> +     if (oldNBuffers > NBuffers)
>> +     {
>> +             if (StrategyControl->nextVictimBuffer >= NBuffers)
>> +             {
>> +                     /* set the tail of buffers. */
>> +                     StrategyControl->nextVictimBuffer = NBuffers - 1;
>> +             }
>> +
>> +             if (StrategyControl->firstFreeBuffer >= NBuffers)
>> +             {
>> +                     /* set FREENEXT_END_OF_LIST(-1). */
>> +                     StrategyControl->firstFreeBuffer = FREENEXT_END_OF_LIST;
>> +             }
>> +     }
>> +     else
>> +     /* shared_buffers enlarged. */
>> +     {
>> +             if (StrategyControl->firstFreeBuffer < 0)
>> +             {
>> +                     /* set the next entry of the tail of old buffers. */
>> +                     StrategyControl->firstFreeBuffer = oldNBuffers;
>> +             }
>> +     }
>> +}
>> diff --git src/backend/utils/misc/guc.c src/backend/utils/misc/guc.c
>> index 738e215..5affc6e 100644
>> --- src/backend/utils/misc/guc.c
>> +++ src/backend/utils/misc/guc.c
>> @@ -2361,6 +2361,18 @@ static struct config_int ConfigureNamesInt[] =
>>               NULL, NULL, NULL
>>       },
>>
>> +     {
>> +             {"buffer_cache_hibernation_level", PGC_POSTMASTER, UNGROUPED,
>> +                     gettext_noop("Sets buffer cache hibernation level."),
>> +                     gettext_noop("0 to disable(default), "
>> +                                              "1 for saving buffer descriptors only(recommended), "
>> +                                              "2 for saving buffer descriptors and buffer blocks(slower at
shutdown).")
>> +             },
>> +             &BufferCacheHibernationLevel,
>> +             0, 0, 2,
>> +             NULL, NULL, NULL
>> +     },
>> +
>>       /* End-of-list marker */
>>       {
>>               {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
>> diff --git src/backend/utils/misc/postgresql.conf.sample src/backend/utils/misc/postgresql.conf.sample
>> index b8a1582..44b6ff3 100644
>> --- src/backend/utils/misc/postgresql.conf.sample
>> +++ src/backend/utils/misc/postgresql.conf.sample
>> @@ -119,6 +119,17 @@
>>  #maintenance_work_mem = 16MB         # min 1MB
>>  #max_stack_depth = 2MB                       # min 100kB
>>
>> +
>> +# Buffer Cache Hibernation:
>> +#  Suspend/resume buffer cache data structure using hibernation files
>> +#  at shutdown/startup.
>> +#buffer_cache_hibernation_level = 0  # Sets buffer cache hibernation level.
>> +                                     # 0 to disable(default),
>> +                                     # 1 for saving buffer descriptors only
>> +                                     #   (recommended),
>> +                                     # 2 for saving buffer descriptors and
>> +                                     #   buffer blocks(slower at shutdown).
>> +
>>  # - Kernel Resource Usage -
>>
>>  #max_files_per_process = 1000                # min 25
>> diff --git src/include/access/xlog.h src/include/access/xlog.h
>> index 7056fd6..7a9fb99 100644
>> --- src/include/access/xlog.h
>> +++ src/include/access/xlog.h
>> @@ -13,6 +13,7 @@
>>
>>  #include "access/rmgr.h"
>>  #include "access/xlogdefs.h"
>> +#include "catalog/pg_control.h"
>>  #include "lib/stringinfo.h"
>>  #include "storage/buf.h"
>>  #include "utils/pg_crc.h"
>> @@ -294,6 +295,7 @@ extern bool XLogInsertAllowed(void);
>>  extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
>>  extern XLogRecPtr GetXLogReplayRecPtr(void);
>>
>> +extern bool GetControlFile(ControlFileData *controlFile);
>>  extern void UpdateControlFile(void);
>>  extern uint64 GetSystemIdentifier(void);
>>  extern Size XLOGShmemSize(void);
>> diff --git src/include/storage/buf_internals.h src/include/storage/buf_internals.h
>> index b7d4ea5..d537ef1 100644
>> --- src/include/storage/buf_internals.h
>> +++ src/include/storage/buf_internals.h
>> @@ -167,6 +167,7 @@ typedef struct sbufdesc
>>   */
>>  #define LockBufHdr(bufHdr)           SpinLockAcquire(&(bufHdr)->buf_hdr_lock)
>>  #define UnlockBufHdr(bufHdr) SpinLockRelease(&(bufHdr)->buf_hdr_lock)
>> +#define IsUnlockBufHdr(bufHdr)       SpinLockFree(&(bufHdr)->buf_hdr_lock)
>>
>>
>>  /* in buf_init.c */
>> @@ -190,6 +191,7 @@ extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
>>  extern int   StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
>>  extern Size StrategyShmemSize(void);
>>  extern void StrategyInitialize(bool init);
>> +extern void AdjustStrategyControl(int oldNBuffers);
>>
>>  /* buf_table.c */
>>  extern Size BufTableShmemSize(int size);
>> diff --git src/include/storage/bufmgr.h src/include/storage/bufmgr.h
>> index b8fc87e..ddfeb9d 100644
>> --- src/include/storage/bufmgr.h
>> +++ src/include/storage/bufmgr.h
>> @@ -211,6 +211,20 @@ extern void BgBufferSync(void);
>>
>>  extern void AtProcExit_LocalBuffers(void);
>>
>> +/* buffer cache hibernation support stuff */
>> +extern int   BufferCacheHibernationLevel;
>> +
>> +typedef enum BufferHibernationFileType
>> +{
>> +    BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY,
>> +    BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS,
>> +    BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS
>> +} BufferHibernationFileType;
>> +
>> +extern void ResisterBufferCacheHibernation(BufferHibernationFileType id,
>> +                             char *ptr, Size record_length, Size num_records);
>> +extern void ResumeBufferCacheHibernation(void);
>> +
>>  /* in freelist.c */
>>  extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
>>  extern void FreeAccessStrategy(BufferAccessStrategy strategy);
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>
> --
>  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>  EnterpriseDB                             http://enterprisedb.com
>
>  + It's impossible for everything to be true. +
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


Re: patch for new feature: Buffer Cache Hibernation

From
Heikki Linnakangas
Date:
On 14.10.2011 11:44, Cédric Villemain wrote:
> 2011/10/14 Bruce Momjian<bruce@momjian.us>:
>>
>> Should this be marked as TODO?
>
> I suppose TODO items *are* wanted and so working on them should remove
> the pain to convince people here to accept the feature, aren't they ?

I don't think this is worthwhile to have in the backend. Someone could 
write it as an extension on pgfoundry, but I don't think that belongs on 
the TODO.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: patch for new feature: Buffer Cache Hibernation

From
Tom Lane
Date:
Cédric Villemain <cedric.villemain.debian@gmail.com> writes:
> 2011/10/14 Bruce Momjian <bruce@momjian.us>:
>> Should this be marked as TODO?

> I suppose TODO items *are* wanted and so working on them should remove
> the pain to convince people here to accept the feature, aren't they ?

There is plenty of stuff in the TODO list for which there is no
consensus.
        regards, tom lane


Re: patch for new feature: Buffer Cache Hibernation

From
Bruce Momjian
Date:
Tom Lane wrote:
> Cédric Villemain <cedric.villemain.debian@gmail.com> writes:
> > 2011/10/14 Bruce Momjian <bruce@momjian.us>:
> >> Should this be marked as TODO?
> 
> > I suppose TODO items *are* wanted and so working on them should remove
> > the pain to convince people here to accept the feature, aren't they ?
> 
> There is plenty of stuff in the TODO list for which there is no
> consensus.

Uh, we should probably remove those then.  Can you think of any?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: patch for new feature: Buffer Cache Hibernation

From
Alvaro Herrera
Date:
Excerpts from Bruce Momjian's message of vie oct 14 11:56:22 -0300 2011:
> Tom Lane wrote:
> > Cédric Villemain <cedric.villemain.debian@gmail.com> writes:
> > > 2011/10/14 Bruce Momjian <bruce@momjian.us>:
> > >> Should this be marked as TODO?
> > 
> > > I suppose TODO items *are* wanted and so working on them should remove
> > > the pain to convince people here to accept the feature, aren't they ?
> > 
> > There is plenty of stuff in the TODO list for which there is no
> > consensus.
> 
> Uh, we should probably remove those then.  Can you think of any?

The guideline, last I checked, was that before getting into coding any
item from the TODO list, the prospective hacker should check previous
discussions and initiate a new one on this list to ensure consensus.
Unless something is blatantly "not wanted", I don't think it should be
removed from the TODO list.  There not being consensus does not mean
that there cannot ever be.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for new feature: Buffer Cache Hibernation

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> 
> Excerpts from Bruce Momjian's message of vie oct 14 11:56:22 -0300 2011:
> > Tom Lane wrote:
> > > Cédric Villemain <cedric.villemain.debian@gmail.com> writes:
> > > > 2011/10/14 Bruce Momjian <bruce@momjian.us>:
> > > >> Should this be marked as TODO?
> > > 
> > > > I suppose TODO items *are* wanted and so working on them should remove
> > > > the pain to convince people here to accept the feature, aren't they ?
> > > 
> > > There is plenty of stuff in the TODO list for which there is no
> > > consensus.
> > 
> > Uh, we should probably remove those then.  Can you think of any?
> 
> The guideline, last I checked, was that before getting into coding any
> item from the TODO list, the prospective hacker should check previous
> discussions and initiate a new one on this list to ensure consensus.
> Unless something is blatantly "not wanted", I don't think it should be
> removed from the TODO list.  There not being consensus does not mean
> that there cannot ever be.

OK.  But if we are pretty sure we don't want something, e.g. hibernate,
we shouldn't add it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: patch for new feature: Buffer Cache Hibernation

From
Robert Haas
Date:
On Fri, Oct 14, 2011 at 11:12 AM, Bruce Momjian <bruce@momjian.us> wrote:
> OK.  But if we are pretty sure we don't want something, e.g. hibernate,
> we shouldn't add it.

Fair enough, but I'm not even slightly sure that we don't want that.
I think having prewarming utilities available as contrib modules or on
PGXN would be useful, but integrating something into the backend would
allow it to be far more automated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for new feature: Buffer Cache Hibernation

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Bruce Momjian's message of vie oct 14 11:56:22 -0300 2011:
>> Tom Lane wrote:
>>> There is plenty of stuff in the TODO list for which there is no
>>> consensus.

>> Uh, we should probably remove those then.  Can you think of any?

> Unless something is blatantly "not wanted", I don't think it should be
> removed from the TODO list.  There not being consensus does not mean
> that there cannot ever be.

Yeah.  The reason why something is on the TODO list (and not already
done) is typically one of

1. It's too hard, or too long/boring for the expected value.
2. There's no consensus about how to implement the feature.
3. There's no consensus about the user-visible design of the feature.

Cases where there's debate about whether we want it at all seem to me
to be a subset of #3.  But for anything in #3, someone could do the
legwork or have the bright idea needed to create consensus about how
to design the feature.

My gripe about the TODO list is not that we have some stuff in there
that's not clearly wanted, it's that some of the entries fail to make
it clear where the issue stands on this scale.  That could lead people
to waste time trying to code something that there's not consensus for
the design or implementation of.
        regards, tom lane


Re: patch for new feature: Buffer Cache Hibernation

From
Alvaro Herrera
Date:
Excerpts from Bruce Momjian's message of vie oct 14 12:12:22 -0300 2011:
> 
> Alvaro Herrera wrote:

> > The guideline, last I checked, was that before getting into coding any
> > item from the TODO list, the prospective hacker should check previous
> > discussions and initiate a new one on this list to ensure consensus.
> > Unless something is blatantly "not wanted", I don't think it should be
> > removed from the TODO list.  There not being consensus does not mean
> > that there cannot ever be.
> 
> OK.  But if we are pretty sure we don't want something, e.g. hibernate,
> we shouldn't add it.

If we're so sure we don't want it, we could add it to the "features we
do not want" section.  But as Robert says downthread, I don't see us
being so sure that we don't want hibernation.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for new feature: Buffer Cache Hibernation

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 14, 2011 at 11:12 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> OK. �But if we are pretty sure we don't want something, e.g. hibernate,
>> we shouldn't add it.

> Fair enough, but I'm not even slightly sure that we don't want that.
> I think having prewarming utilities available as contrib modules or on
> PGXN would be useful, but integrating something into the backend would
> allow it to be far more automated.

Right.  I think this one falls into my class #2, ie, we have no idea how
to implement it usefully.  Doesn't (necessarily) mean that the core
concept is without merit.
        regards, tom lane


Re: patch for new feature: Buffer Cache Hibernation

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> 
> Excerpts from Bruce Momjian's message of vie oct 14 12:12:22 -0300 2011:
> > 
> > Alvaro Herrera wrote:
> 
> > > The guideline, last I checked, was that before getting into coding any
> > > item from the TODO list, the prospective hacker should check previous
> > > discussions and initiate a new one on this list to ensure consensus.
> > > Unless something is blatantly "not wanted", I don't think it should be
> > > removed from the TODO list.  There not being consensus does not mean
> > > that there cannot ever be.
> > 
> > OK.  But if we are pretty sure we don't want something, e.g. hibernate,
> > we shouldn't add it.
> 
> If we're so sure we don't want it, we could add it to the "features we
> do not want" section.  But as Robert says downthread, I don't see us

Those are for features that people often ask for, and we don't want.  I
am sure there are a lot of things we don't want.

> being so sure that we don't want hibernation.

So, add it?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: patch for new feature: Buffer Cache Hibernation

From
Greg Stark
Date:
On Fri, Oct 14, 2011 at 4:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Right.  I think this one falls into my class #2, ie, we have no idea how
> to implement it usefully.  Doesn't (necessarily) mean that the core
> concept is without merit.

Hm. given that we have an implementation I wouldn't say we have *no*
clue.  But there are certainly some parts we don't have consensus yet
on. But then working code sometimes trumps a lack of absolute
consensus.

But just for the sake of argument I'm not sure that the implementation
of dumping the current contents of the buffer cache is actually
optimal. It doesn't handle resizing the buffer cache after a restart
for example which I think would be a significant case. There could be
other buffer cache algorithm parameters users might change -- though I
don't think we really have any currently.

If we had --to take it to an extreme-- a record of every buffer
request prior to the shutdown then we could replay that log virtually
with the new buffer cache size and know what buffers the new buffer
cache size would have had in it.

I'm not sure if there's any way to gather that data efficiently, and
if we could if there's any way to bound the amount of data we would
have to retain to anything less than nigh-infinite volumes, and if we
could if there's any way to limit that has to be replayed on restart.
But my point is that there may be other more general options than
snapshotting the actual buffer cache of the system shutting down.

--
greg


Re: patch for new feature: Buffer Cache Hibernation

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Fri, Oct 14, 2011 at 4:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Right. �I think this one falls into my class #2, ie, we have no idea how
>> to implement it usefully. �Doesn't (necessarily) mean that the core
>> concept is without merit.

> Hm. given that we have an implementation I wouldn't say we have *no*
> clue.  But there are certainly some parts we don't have consensus yet
> on. But then working code sometimes trumps a lack of absolute
> consensus.

In this context "working" means "shows a significant performance
benefit", and IIRC we don't have a demonstration of that.  Anyway this
was all discussed back in May.
        regards, tom lane