Thread: patch for new feature: Buffer Cache Hibernation
Hi, I am working on new feature `Buffer Cache Hibernation' which enables postgres to keep higher cache hit ratio even just started. Postgres usually starts with ZERO buffer cache. By saving the buffer cache data structure into hibernation files just before shutdown, and loading them at startup, postgres can start operations with the saved buffer cache as the same condition as just before the last shutdown. Here is the patch for 9.0.3 (also tested on 8.4.7) http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-9.0.3.patch The patch includes the following. - At shutdown, buffer cache data structure (such as BufferDescriptors, BufferBlocks and StrategyControl) is saved into hibernationfiles. - At startup, buffer cache data structure is loaded from hibernation files and buffer lookup hashtable is setup based onbuffer descriptors. - Above functions are enabled by specifying `enable_buffer_cache_hibernation=on' in postgresql.conf. Any comments are welcome and I would very much appreciate merging the patch in source tree. Have fun and thanks!
On 05/04/2011 10:10 AM, Mitsuru IWASAKI wrote: > Hi, > > I am working on new feature `Buffer Cache Hibernation' which enables > postgres to keep higher cache hit ratio even just started. > > Postgres usually starts with ZERO buffer cache. By saving the buffer > cache data structure into hibernation files just before shutdown, and > loading them at startup, postgres can start operations with the saved > buffer cache as the same condition as just before the last shutdown. > > Here is the patch for 9.0.3 (also tested on 8.4.7) > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-9.0.3.patch > > The patch includes the following. > - At shutdown, buffer cache data structure (such as BufferDescriptors, > BufferBlocks and StrategyControl) is saved into hibernation files. > - At startup, buffer cache data structure is loaded from hibernation > files and buffer lookup hashtable is setup based on buffer descriptors. > - Above functions are enabled by specifying `enable_buffer_cache_hibernation=on' > in postgresql.conf. > > Any comments are welcome and I would very much appreciate merging the > patch in source tree. > > That sounds cool. Please a) make sure your patch is up to data against the latest source in git and b) submit it to the next commitfest at <https://commitfest.postgresql.org/action/commitfest_view?id=10> We don't backport features, and 9.1 is closed for features now, so the earliest release this could be used in is 9.2. cheers andrew
On Wed, May 4, 2011 at 3:10 PM, Mitsuru IWASAKI <iwasaki@jp.freebsd.org> wrote: > Postgres usually starts with ZERO buffer cache. By saving the buffer > cache data structure into hibernation files just before shutdown, and > loading them at startup, postgres can start operations with the saved > buffer cache as the same condition as just before the last shutdown. Offhand this seems pretty handy for benchmarks where it would help get reproducible results. -- greg
Mitsuru IWASAKI <iwasaki@jp.FreeBSD.org> writes: > Postgres usually starts with ZERO buffer cache. By saving the buffer > cache data structure into hibernation files just before shutdown, and > loading them at startup, postgres can start operations with the saved > buffer cache as the same condition as just before the last shutdown. This seems like a lot of complication for rather dubious gain. What happens when the DBA changes the shared_buffers setting, for instance? How do you protect against the cached buffers getting out-of-sync with the actual disk files (especially during recovery scenarios)? What about crash-induced corruption in the cache file itself (consider the not-unlikely possibility that init will kill the database before it's had time to dump all the buffers during a system shutdown)? Do you have any proof that writing out a few GB of buffers and then reading them back in is actually much cheaper than letting the database re-read the data from the disk files? regards, tom lane
Excerpts from Tom Lane's message of mié may 04 12:44:36 -0300 2011: > This seems like a lot of complication for rather dubious gain. What > happens when the DBA changes the shared_buffers setting, for instance? > How do you protect against the cached buffers getting out-of-sync with > the actual disk files (especially during recovery scenarios)? What > about crash-induced corruption in the cache file itself (consider the > not-unlikely possibility that init will kill the database before it's > had time to dump all the buffers during a system shutdown)? Do you have > any proof that writing out a few GB of buffers and then reading them > back in is actually much cheaper than letting the database re-read the > data from the disk files? I thought the idea wasn't to copy the entire buffer but only a descriptor, so that the buffer would be loaded from the original page. If shared_buffers changes, there's no problem. If the new setting is smaller, then the last paages would just not be copied, and would have to be read from disk the first time they are accessed. If the new setting is larger, then the last few buffers would remain unused until requested. As for gain, I have heard of test setups requiring hours of runtime in order to prime the buffer cache. Crash safety would have to be researched, sure. Maybe only do it in clean shutdown. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
2011/5/4 Greg Stark <gsstark@mit.edu>: > On Wed, May 4, 2011 at 3:10 PM, Mitsuru IWASAKI <iwasaki@jp.freebsd.org> wrote: >> Postgres usually starts with ZERO buffer cache. By saving the buffer >> cache data structure into hibernation files just before shutdown, and >> loading them at startup, postgres can start operations with the saved >> buffer cache as the same condition as just before the last shutdown. > > Offhand this seems pretty handy for benchmarks where it would help get > reproducible results. It could have an option to force it or not at start of postgres. This could helps on benchmarks scenarios. -- Dickson S. Guedes mail/xmpp: guedes@guedesoft.net - skype: guediz http://guedesoft.net - http://www.postgresql.org.br
On Wed, May 4, 2011 at 4:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Do you have > any proof that writing out a few GB of buffers and then reading them > back in is actually much cheaper than letting the database re-read the > data from the disk files? I believe he's just writing out the meta data. Ie, which blocks to re-reread from the disk files. -- greg
Alvaro Herrera wrote: > As for gain, I have heard of test setups requiring hours of runtime in > order to prime the buffer cache. > And production ones too. I have multiple customers where a server restart is almost a planned multi-hour downtime. The system may be back up, but for a couple of hours performance is so terrible it's barely usable. You can watch the MB/s ramp up as the more random data fills in over time; getting that taken care of in a larger block more amenable to elevator sorting would be a huge help. I never bothered with this particular idea though because shared_buffers is only a portion of the important data. Cedric's pgfincore code digs into the OS cache, too, which can then save enough to be really useful here. And that's already got a snapshot/restore feature. The slides at http://www.pgcon.org/2010/schedule/events/261.en.html have a useful into to that, pages 30 through 34 are the neat ones. That provides some other neat APIs for preloading popular data into cache too. I'd rather work on getting something like that into core, rather than adding something that only is targeting just shared_buffers. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Wed, May 4, 2011 at 7:10 AM, Mitsuru IWASAKI <iwasaki@jp.freebsd.org> wrote: > Hi, > > I am working on new feature `Buffer Cache Hibernation' which enables > postgres to keep higher cache hit ratio even just started. > > Postgres usually starts with ZERO buffer cache. By saving the buffer > cache data structure into hibernation files just before shutdown, and > loading them at startup, postgres can start operations with the saved > buffer cache as the same condition as just before the last shutdown. > > Here is the patch for 9.0.3 (also tested on 8.4.7) > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-9.0.3.patch > > The patch includes the following. > - At shutdown, buffer cache data structure (such as BufferDescriptors, > BufferBlocks and StrategyControl) is saved into hibernation files. > - At startup, buffer cache data structure is loaded from hibernation > files and buffer lookup hashtable is setup based on buffer descriptors. > - Above functions are enabled by specifying `enable_buffer_cache_hibernation=on' > in postgresql.conf. > > Any comments are welcome and I would very much appreciate merging the > patch in source tree. > > Have fun and thanks! It applies and builds against head with offsets and some fuzz. It fails make check, but apparently only because src/test/regress/expected/rangefuncs.out needs to be updated to include the new setting. (Although all the other "enable%" settings are for the planner, so making a new setting with that prefix that does something else might be undesirable) I think that PgFincore (http://pgfoundry.org/projects/pgfincore/) provides similar functionality. Are you familiar with that? If so, could you contrast your approach with that one? Cheers, Jeff
All, I thought that Dimitri had already implemented this using Fincore. It's linux-only, but that should work well enough to test the general concept. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > I thought that Dimitri had already implemented this using Fincore. It's > linux-only, but that should work well enough to test the general concept. Actually, Cédric did, and I have a clone of his repository where I did some debian packaging of it. http://villemain.org/projects/pgfincore http://git.postgresql.org/gitweb?p=pgfincore.git;a=summary http://git.postgresql.org/gitweb?p=pgfincore.git;a=tree Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
2011/5/4 Josh Berkus <josh@agliodbs.com>: > All, > > I thought that Dimitri had already implemented this using Fincore. It's > linux-only, but that should work well enough to test the general concept. Harald provided me some pointers at pgday in Stuttgart to make it work with windows but ... hum I have not windows and wasn't enought motivated to make it work on it if no one need it. I didn't search recently on the different kernels, but any kernel supporting mincore and posix_fadvise should work. (so probably the same set of kernel that support our 'effective_io_concurrency'). Still waiting for (free)BSD support ..... -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Hi, thanks for good suggestions. > > Postgres usually starts with ZERO buffer cache. By saving the buffer > > cache data structure into hibernation files just before shutdown, and > > loading them at startup, postgres can start operations with the saved > > buffer cache as the same condition as just before the last shutdown. > > This seems like a lot of complication for rather dubious gain. What > happens when the DBA changes the shared_buffers setting, for instance? It was my first concern actually. Current implementation is stopping reading hibernation file when detecting the size mismatch among shared_buffers and hibernation file. I think it is a safety way. As Alvaro Herrera mentioned, it would be possible to adjust copying buffer bloks, but changing shared_buffers setting is not so often I think. > How do you protect against the cached buffers getting out-of-sync with > the actual disk files (especially during recovery scenarios)? What Saving DB buffer cahce is called at shutdown after finishing bgwriter's final checkpoint process, so dirty-buffers should not exist I believe. For recovery scenarios, I need to research it though... Could you describe what is need to be consider? > about crash-induced corruption in the cache file itself (consider the > not-unlikely possibility that init will kill the database before it's > had time to dump all the buffers during a system shutdown)? Do you have I think this is important point. I'll implement validation function for hibernation file. > any proof that writing out a few GB of buffers and then reading them > back in is actually much cheaper than letting the database re-read the > data from the disk files? I think this means sequential-read vs scattered-read. The largest hibernation file is for buffer blocks, and sequential-read from it would be much faster than scattered-read from database file via smgrread() block by block. As Greg Stark suggested, re-reading from database file based on buffer descriptors was one of implementation candidates (it can reduce storage consumption for hibernation), but I chose creating buffer blocks raw image file and reading it for the performance. Thanks
Hi, > I think that PgFincore (http://pgfoundry.org/projects/pgfincore/) > provides similar functionality. Are you familiar with that? If so, > could you contrast your approach with that one? I'm not familiar with PgFincore at all sorry, but I got source code and documents and read through them just now. # and I'm a novice on postgres actually... The target both is to reduce physical I/O, but their approaches and gains are different. My understanding is like this; +---------------------+ +---------------------+ | Postgres(backend) | | Postgres | | +-----------------+ | | | | | DB Buffer Cache | | | | | | (shared buffers)| | | | | |*my target | | | | | +-----------------+ | | | | ^ ^ | | | | | | | | | | v v | | | | +-----------------+ | | +-----------------+ | | | buffer manager | | | | pgfincore | | | +-----------------+ | | +-----------------+ | +---^------^----------+ +----------^----------+ | |smgrread() |posix_fadvise() |read()| | userland ================================================================== | | | kernel | +-------------+-------------+ | | | v | +------------------------+ | | File System | | | +-----------------+ | +------>| | FS BufferCache | | | |*PgFincore target| | | +-----------------+ | | ^ ^ | +----|-------|-----------+ | | ================================================================== | | hardware +---------|-------|----------------+ | | v Physical Disk | | | +------------------+| | | | base/16384/24598 | | | v +------------------+ | | +------------------------------+| | |Buffer Cache Hibernation Files| | | +------------------------------+ | +----------------------------------+ In summary, PgFincore's target is File System Buffer Cache, Buffer Cache Hibernation's target is DB Buffer Cache(shared buffers). PgFincore is trying to preload database file by posix_fadvise() into File System Buffer Cache, not into DB Buffer Cache(shared buffers). On query execution, buffer manager will get DB buffer blocks by smgrread() from file system unless necessary blocks exist in DB Buffer Cache. At this point, physical reads may not happen because part of (or entire) database file is already loaded into FS Buffer Cache. The gain depends on the file system, especially size of File System Buffer Cache. Preloading database file is equivalent to following command in short. $ cat base/16384/24598 > /dev/null I think PgFincore is good for data warehouse in applications. Buffer Cache Hibernation, my approach, is more simple and straight forward. It try to save/load the contents of DB Buffer Cache(shared buffers) using regular files(called Buffer Cache Hibernation Files). At startup, buffer manager will load DB buffer blocks into DB Buffer Cache from Buffer Cache Hibernation Files which was saved at the last shutdown. Note that database file will not be read, so it is not cached in File System Buffer Cache at all. Only contents of DB Buffer Cache are filled. Therefore, the DB buffer cache miss penalty would be larger than PgFincore's. The gain depends on the size of shared buffers, and how often the similar queries are executed before and after restarting. Buffer Cache Hibernation is good for OLTP in applications. I think that PgFincore and Buffer Cache Hibernation is not exclusive, they can co-work together in different caching levels. Sorry for my poor english skill, but I'm doing my best :) Thanks
2011/5/5 Mitsuru IWASAKI <iwasaki@jp.freebsd.org>: > Hi, > >> I think that PgFincore (http://pgfoundry.org/projects/pgfincore/) >> provides similar functionality. Are you familiar with that? If so, >> could you contrast your approach with that one? > > I'm not familiar with PgFincore at all sorry, but I got source code > and documents and read through them just now. > # and I'm a novice on postgres actually... > The target both is to reduce physical I/O, but their approaches and > gains are different. > My understanding is like this; > > +---------------------+ +---------------------+ > | Postgres(backend) | | Postgres | > | +-----------------+ | | | > | | DB Buffer Cache | | | | > | | (shared buffers)| | | | > | |*my target | | | | > | +-----------------+ | | | > | ^ ^ | | | > | | | | | | > | v v | | | > | +-----------------+ | | +-----------------+ | > | | buffer manager | | | | pgfincore | | > | +-----------------+ | | +-----------------+ | > +---^------^----------+ +----------^----------+ > | |smgrread() |posix_fadvise() > |read()| | userland > ================================================================== > | | | kernel > | +-------------+-------------+ > | | > | v > | +------------------------+ > | | File System | > | | +-----------------+ | > +------>| | FS Buffer Cache | | > | |*PgFincore target| | > | +-----------------+ | > | ^ ^ | > +----|-------|-----------+ > | | > ================================================================== > | | hardware > +---------|-------|----------------+ > | | v Physical Disk | > | | +------------------+ | > | | | base/16384/24598 | | > | v +------------------+ | > | +------------------------------+ | > | |Buffer Cache Hibernation Files| | > | +------------------------------+ | > +----------------------------------+ > littel detail, pgfincore store its data per relation in a file, like you do. I rewrote a bit that, and it will store its data directly in postgresql tables, as well as it will be able to restore the cache from raw bitstring. > In summary, PgFincore's target is File System Buffer Cache, Buffer > Cache Hibernation's target is DB Buffer Cache(shared buffers). Correct. (btw I am very happy of your idea and that you get time to do it) > > PgFincore is trying to preload database file by posix_fadvise() into > File System Buffer Cache, not into DB Buffer Cache(shared buffers). > On query execution, buffer manager will get DB buffer blocks by > smgrread() from file system unless necessary blocks exist in DB Buffer > Cache. At this point, physical reads may not happen because part of > (or entire) database file is already loaded into FS Buffer Cache. > > The gain depends on the file system, especially size of File System > Buffer Cache. > Preloading database file is equivalent to following command in short. > $ cat base/16384/24598 > /dev/null Not exactly. it exists 2 calls : * pgfadv_WILLNEED* pgfadv_WILLNEED_snapshot The former ask to load each segment of a relation *but* the kernel can decide to not do that or load only part of each segment. (so it is not as brutal as cat file > /dev/null ) The later read *exactly* each blocks required in each segment, not all blocks except if all were in cache while doing the snapshot. (this one is the part of the snapshot/restore combo) > > I think PgFincore is good for data warehouse in applications. Pgfincore with bitstring storage in a table allow streaming to HotStandbys and get better response in case of switch-over/fail-over by doing some house-keeping on the HotStandby and keep it really hot ;) Even web applications have large database today .... (they is more, but it is no the subject) > > > Buffer Cache Hibernation, my approach, is more simple and straight forward. > It try to save/load the contents of DB Buffer Cache(shared buffers) using > regular files(called Buffer Cache Hibernation Files). > At startup, buffer manager will load DB buffer blocks into DB Buffer > Cache from Buffer Cache Hibernation Files which was saved at the last > shutdown. Note that database file will not be read, so it is not > cached in File System Buffer Cache at all. Only contents of DB Buffer > Cache are filled. Therefore, the DB buffer cache miss penalty would > be larger than PgFincore's. > > The gain depends on the size of shared buffers, and how often the > similar queries are executed before and after restarting. > > Buffer Cache Hibernation is good for OLTP in applications. It is very helpfull for debugging and analysis purpose, also, IIUC. I may prefer the per relation approach (so you can snapshot and restore only the interesting tables/index). Given what I read in your patch it looks easy to do, isn't it ? I also prefer the idea to keep a map of the Buffer Cache (yes, like what I do with pgfincore) than storing the data directly and reading it directly. This later part semmes a bit dangerous to me, even if it looks sane from a normal postgresql stop/start process. > > > I think that PgFincore and Buffer Cache Hibernation is not exclusive, > they can co-work together in different caching levels. Yes. > > > > Sorry for my poor english skill, but I'm doing my best :) better than me, and anyway your patch remain very easy to read in all case. > > Thanks > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Hi, I revised the patch against HEAD, it's available at: http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110506.patch Implemented hibernation file validations: - comparison with pg_control At shutdown: pg_control state should be DB_SHUTDOWNED. At startup: pg_control state should be DB_SHUTDOWNED. hibernation files should be newer than pg_control. - CRC check At shutdown: compute CRC values for hibernation files and store them into a file. At startup: CRC values for hibernation files should be the same with read from the file created at shutdown. - file size At startup: The size of hibernation file should be the same with calculated file size based on shared_buffers. - buffer descriptors validation At startup: The descriptor flags should not include BM_DIRTY, BM_IO_IN_PROGRESS, BM_IO_ERROR, BM_JUST_DIRTIED and BM_PIN_COUNT_WAITER. Sanity checks for usage_count and usage_count should be done. (wait_backend_pid is zero-cleared because the process was terminated already) - system call error checking At shutdown and startup: Evaluation for return value system call (eg. open(), read(), write() and etc) should be done. > > How do you protect against the cached buffers getting out-of-sync with > > the actual disk files (especially during recovery scenarios)? What > > Saving DB buffer cahce is called at shutdown after finishing > bgwriter's final checkpoint process, so dirty-buffers should not exist > I believe. > For recovery scenarios, I need to research it though... > Could you describe what is need to be consider? I think hibernation should be allowed only when the system is shutdown normaly by checking pg_control state. And once the abnormal shutdown was detected, the hibernation files should be ignored. The latest patch includes this. # modifications for xlog.c:ReadControlFile() was required though... > > about crash-induced corruption in the cache file itself (consider the > > not-unlikely possibility that init will kill the database before it's > > had time to dump all the buffers during a system shutdown)? Do you have > > I think this is important point. I'll implement validation function for > hibernation file. Added validations seem enough for me. # because my understanding on postgres is not enough ;) If any other considerations are required, please point them out. Thanks
On 05/05/2011 05:06 AM, Mitsuru IWASAKI wrote: > In summary, PgFincore's target is File System Buffer Cache, Buffer > Cache Hibernation's target is DB Buffer Cache(shared buffers). > Right. The thing to realize is that shared_buffers is becoming a smaller fraction of the total RAM used by the database every year. On Windows it's been stuck at useful settings being less than 512MB for a while now. And on UNIX systems, around 8GB seems to be effective upper limit. Best case, shared_buffers is only going to be around 25% of total RAM; worst-case, approximately, you might have Windows server with 64GB of RAM where shared_buffers is less than 1% of total RAM. There's nothing wrong with the general idea you're suggesting. It's just only targeting a small (and shrinking) subset of the real problem here. Rebuilding cache state starts with shared_buffers, but that's not enough of the problem to be an effective tweak on many systems. I think that all the complexity with CRCs etc. is unlikely to lead anywhere too, and those two issues are not completely unrelated. The simplest, safest thing here is the right way to approach this, not the most complicated one, and a simpler format might add some flexibility here to reload more cache state too. The bottleneck on reloading the cache state is reading everything from disk. Trying to micro-optimize any other part of that is moving in the wrong direction to me. I doubt you'll ever measure a useful benefit that overcomes the expense of maintaining the code. And you seem to be moving to where someone can't restore cache state when they change shared_buffers. A simpler implementation might still work in that situation; reload until you run out of buffers if shared_buffers shrinks, reload until you're done with the original size. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Fri, May 6, 2011 at 5:31 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 05/05/2011 05:06 AM, Mitsuru IWASAKI wrote: >> >> In summary, PgFincore's target is File System Buffer Cache, Buffer >> Cache Hibernation's target is DB Buffer Cache(shared buffers). >> > > Right. The thing to realize is that shared_buffers is becoming a smaller > fraction of the total RAM used by the database every year. On Windows it's > been stuck at useful settings being less than 512MB for a while now. And on > UNIX systems, around 8GB seems to be effective upper limit. Best case, > shared_buffers is only going to be around 25% of total RAM; worst-case, > approximately, you might have Windows server with 64GB of RAM where > shared_buffers is less than 1% of total RAM. > > There's nothing wrong with the general idea you're suggesting. It's just > only targeting a small (and shrinking) subset of the real problem here. > Rebuilding cache state starts with shared_buffers, but that's not enough of > the problem to be an effective tweak on many systems. > > I think that all the complexity with CRCs etc. is unlikely to lead anywhere > too, and those two issues are not completely unrelated. The simplest, > safest thing here is the right way to approach this, not the most > complicated one, and a simpler format might add some flexibility here to > reload more cache state too. The bottleneck on reloading the cache state is > reading everything from disk. Trying to micro-optimize any other part of > that is moving in the wrong direction to me. I doubt you'll ever measure a > useful benefit that overcomes the expense of maintaining the code. And you > seem to be moving to where someone can't restore cache state when they > change shared_buffers. A simpler implementation might still work in that > situation; reload until you run out of buffers if shared_buffers shrinks, > reload until you're done with the original size. Yeah, I'm pretty well convinced this whole approach is a dead end. Priming the OS buffer cache seems way more useful. I also think saving the blocks to be read rather than the actual blocks makes a lot more sense. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, thanks for your comments! I'm glad to discuss about this topic. > * pgfadv_WILLNEED > * pgfadv_WILLNEED_snapshot > > The former ask to load each segment of a relation *but* the kernel can > decide to not do that or load only part of each segment. (so it is not > as brutal as cat file > /dev/null ) > The later read *exactly* each blocks required in each segment, not all > blocks except if all were in cache while doing the snapshot. (this one > is the part of the snapshot/restore combo) Sorry about that, I'm not so familiar with posix_fadvise(). I'll check posix_fadvise() later. Actually I used to execute 'cat database_file > /dev/null' script on other DBSM before starting. # or 'select /*+ INDEX(emp emp_pk) */ count(*) from emp;' to load # index blocks > I may prefer the per relation approach (so you can snapshot and > restore only the interesting tables/index). Given what I read in your > patch it looks easy to do, isn't it ? I would like to keep my patch as simple as possible, because it is just a hibernation function, not complicated buffer management. But I want to try improving buffer management on next vacation. # currently I'm in 11-days vacation until Sunday. My rough idea on improving buffer management like this; SQL> alter table table_name buffer pin priority 7; SQL> alter index index_name buffer pin priority 10; This DDL set 'buffer pin priority' property to table/index and also buffer descriptors related with table/index. Optionally preloading database files in FS cache and relation blocks in DB cache would be possible. When new buffer is required, buffer manager refer to the priority in each buffers and select a victim buffer. I think it helps batch job runs in better buffer cache condition by giving hints for buffer management. For example, job-A reads table_A, index_A and writes only table_B; SQL> alter table table_A buffer pin priority 7; SQL> alter index index_A buffer pin priority 10; SQL> alter table table_B buffer pin priority 1; keeps buffers of index_A, table_A (table_B will be victims soon). Buffer pin priority can be reset like this; SQL> alter system buffer pin priority 5; Next job-B reads and writes table_C, reads index_C with preloading; SQL> alter table table_C buffer pin priority 5; SQL> alter index index_C buffer pin priority 10 with preloading 50%; something like this. > I also prefer the idea to keep a map of the Buffer Cache (yes, like > what I do with pgfincore) than storing the data directly and reading > it directly. This later part semmes a bit dangerous to me, even if it > looks sane from a normal postgresql stop/start process. Never mind :) I added enough validations and will add more. > better than me, and anyway your patch remain very easy to read in all case. Thanks a lot! My policy on experimental implementation is easy-to-read so that people understand my idea quickly. That's why my first patch doesn't have enough error checkings ;) Thanks
On Sat, May 7, 2011 at 3:32 AM, Mitsuru IWASAKI <iwasaki@jp.freebsd.org> wrote: > I have one more day for working on this, but I may give up... I think this is an interesting line of inquiry, but if you were hoping to get something committable in a couple of days, you had unrealistic expectations... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, folks! > I'll do more testing tomorrow, and hopefully finalize my patch. Done! the patch is available at: http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110508.patch I hope this would be committable and the final version. Major changes from the experimental implementation are the following. - add many validations against hibernation file corruption and etc. - restore buffer blocks based on buffer descriptors, not from the saved file. - support restoring cache state even if shared_buffers had changed. My vacation ends today and I have to go back my work from tomorrow, but I would try to find spare time for this. Thanks a lot for happy hacking days with you!
Mitsuru IWASAKI wrote: > the patch is available at: > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110508.patch > We can't accept patches just based on a pointer to a web site. Please e-mail this to the mailing list so that it can be considered a submission under the project's licensing terms. > I hope this would be committable and the final version. > PostgreSQL has high standards for code submissions. Extremely few submissions are committed without significant revisions to them based on code review. So far you've gotten a first round of high-level design review, there's several additional steps before something is considered for a commit. The whole process is outlined at http://wiki.postgresql.org/wiki/Submitting_a_Patch From a couple of minutes of reading the patch, the first things that pop out as problems are: -All of the ControlFile -> controlFile renaming has add a larger difference to ReadControlFile than I would consider ideal. -Touching StrategyControl is not something this patch should be doing. -I don't think your justification ("debugging or portability") for keeping around your original code in here is going to be sufficient to do so. -This should not be named enable_buffer_cache_hibernation. That very large diff you ended up with in the regression tests is because all of the settings named enable_* are optimizer control settings. Using the name "buffer_cache_hibernation" instead would make a better starting point. From a bigger picture perspective, this really hasn't addressed any of my comments about shared_buffers only being the beginning of the useful cache state to worry about here. I'd at least like the solution to the buffer cache save/restore to have a plan for how it might address that too one day. This project is also picky about only committing code that fits into the long-term picture for desired features. Having a working example of a server-side feature doing cache storage and restoration is helpful though. Don't think your work here is unappreciated--it is. Getting this feature added is just a harder problem than what you've done so far. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On 08.05.2011 07:58, Mitsuru IWASAKI wrote: >> I'll do more testing tomorrow, and hopefully finalize my patch. > > Done! the patch is available at: > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110508.patch I'd suggest doing this as an extension module. All the changes to existing server code seem superficial. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, Sorry, I missed these messages because I didn't subscribe to this list. # I've just subscribed temporary > > I think that all the complexity with CRCs etc. is unlikely to lead anywhere > > too, and those two issues are not completely unrelated. The simplest, > > safest thing here is the right way to approach this, not the most > > complicated one, and a simpler format might add some flexibility here to > > reload more cache state too. The bottleneck on reloading the cache state is > > reading everything from disk. Trying to micro-optimize any other part of > > that is moving in the wrong direction to me. I doubt you'll ever measure a > > useful benefit that overcomes the expense of maintaining the code. And you > > seem to be moving to where someone can't restore cache state when they > > change shared_buffers. A simpler implementation might still work in that > > situation; reload until you run out of buffers if shared_buffers shrinks, > > reload until you're done with the original size. > > Yeah, I'm pretty well convinced this whole approach is a dead end. > Priming the OS buffer cache seems way more useful. I also think > saving the blocks to be read rather than the actual blocks makes a lot > more sense. OK, there are two your suggestions here IIUC. # if not, please correct me. 1. restore buffer blocks based on buffer descriptors, not from the saved file. 2. support restoring cache state even if shared_buffers had changed. For 1, I've just finish my work. The latest patch is available at: http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch On my box, shared_buffers can be set up to only 200MB. Elapsed time for starting up is almost the same, about 3 sec (w/o hibernation takes about 1 sec). For shutdown, writing buffer blocks takes about 10 sec, otherwise about 1 sec. Well, it seems you were right :) By restoring buffer blocks based on buffer descriptors, the OS buffer cache will be filled too. This can help buffer updating performance I believe. I think saving buffer blocks is still useful for debugging or portability, so I would like to remain the support code in my patch. For 2, I'm not sure how to implement this. The problem is that freelist.c:StrategyControl is also restored at startup, but I have no idea currently how to adjust StrategyControl when shared_buffer had changed. StrategyControl has important data on buffer allocation, so this should be matched with shared_buffer, I belive. Changing shared_buffer is not so often on production environment. Current implementation like this; If shared_buffer had changed, restoring is aborted only on that time and saving is executed with new shared_buffer at shutdown, restoring is executed at startup on next time. I have one more day for working on this, but I may give up... Thanks
Hi, > We can't accept patches just based on a pointer to a web site. Please > e-mail this to the mailing list so that it can be considered a > submission under the project's licensing terms. > > > I hope this would be committable and the final version. > > > > PostgreSQL has high standards for code submissions. Extremely few > submissions are committed without significant revisions to them based on > code review. So far you've gotten a first round of high-level design > review, there's several additional steps before something is considered > for a commit. The whole process is outlined at > http://wiki.postgresql.org/wiki/Submitting_a_Patch OK, I would do so for my next patch. > From a couple of minutes of reading the patch, the first things that > pop out as problems are: > > -All of the ControlFile -> controlFile renaming has add a larger > difference to ReadControlFile than I would consider ideal. I think so too, I will consider this again. > -Touching StrategyControl is not something this patch should be doing. Sorry, I could not get this. Could you describe this? I think StrategyControl needs to be adjusted if shared_buffers setting was changed. > -I don't think your justification ("debugging or portability") for > keeping around your original code in here is going to be sufficient to > do so. > -This should not be named enable_buffer_cache_hibernation. That very > large diff you ended up with in the regression tests is because all of > the settings named enable_* are optimizer control settings. Using the > name "buffer_cache_hibernation" instead would make a better starting point. OK, how about `buffer_cache_hibernation_level'? The value 0 to disable(default), 1 for saving buffer descriptors only, 2 for saving buffer descriptors and buffer blocks. > From a bigger picture perspective, this really hasn't addressed any of > my comments about shared_buffers only being the beginning of the useful > cache state to worry about here. I'd at least like the solution to the > buffer cache save/restore to have a plan for how it might address that > too one day. This project is also picky about only committing code that > fits into the long-term picture for desired features. My simple motivation on this is that `We don't want to restart our DB server because the DB buffer cache will be lost and the DB server needs to start its operations with zero cache. Does any DBMS product support holding the contents of DB cache as it is even by restarting, just like the hibernation feature of PC?'. It's very simple and many of DB admins will be happy soon with this feature, I think. Thanks
Hi, > I'd suggest doing this as an extension module. All the changes to > existing server code seem superficial. It sounds interesting. I'll try it later. Are there any good examples for extension module? Thanks
Mitsuru IWASAKI wrote: > Are there any good examples for extension module? Browse the subdirectories of contrib. -Kevin
On Fri, May 6, 2011 at 5:31 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I think that all the complexity with CRCs etc. is unlikely to lead anywhere > too, and those two issues are not completely unrelated. The simplest, > safest thing here is the right way to approach this, not the most > complicated one, and a simpler format might add some flexibility here to > reload more cache state too. The bottleneck on reloading the cache state is > reading everything from disk. Trying to micro-optimize any other part of > that is moving in the wrong direction to me. I doubt you'll ever measure a > useful benefit that overcomes the expense of maintaining the code. And you > seem to be moving to where someone can't restore cache state when they > change shared_buffers. A simpler implementation might still work in that > situation; reload until you run out of buffers if shared_buffers shrinks, > reload until you're done with the original size. I don't think there's any need for this to get data into shared_buffers at all. Getting it into the OS cache oughta be plenty sufficient, no? ISTM that a very simple approach here would be to save the contents of each shared buffer on clean shutdown, and to POSIX_FADV_WILLNEED those buffers on startup. We could worry about additional complexity, like using fincore to probe the OS cache, in a follow-on patch. While reloading only 8GB of maybe 30GB of cached data on restart would not be as good as reloading all of it, it would be a lot better than reloading none of it, and the gymnastics required seems substantially less. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/5/15 Robert Haas <robertmhaas@gmail.com>: > On Fri, May 6, 2011 at 5:31 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> I think that all the complexity with CRCs etc. is unlikely to lead anywhere >> too, and those two issues are not completely unrelated. The simplest, >> safest thing here is the right way to approach this, not the most >> complicated one, and a simpler format might add some flexibility here to >> reload more cache state too. The bottleneck on reloading the cache state is >> reading everything from disk. Trying to micro-optimize any other part of >> that is moving in the wrong direction to me. I doubt you'll ever measure a >> useful benefit that overcomes the expense of maintaining the code. And you >> seem to be moving to where someone can't restore cache state when they >> change shared_buffers. A simpler implementation might still work in that >> situation; reload until you run out of buffers if shared_buffers shrinks, >> reload until you're done with the original size. > > I don't think there's any need for this to get data into > shared_buffers at all. Getting it into the OS cache oughta be plenty > sufficient, no? > > ISTM that a very simple approach here would be to save the contents of > each shared buffer on clean shutdown, and to POSIX_FADV_WILLNEED those > buffers on startup. +1 It is just an evolution of the current process if I understood the explantions of the latest patch correctly. >We could worry about additional complexity, like > using fincore to probe the OS cache, in a follow-on patch. While > reloading only 8GB of maybe 30GB of cached data on restart would not > be as good as reloading all of it, it would be a lot better than > reloading none of it, and the gymnastics required seems substantially > less. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote: > For 1, I've just finish my work. The latest patch is available at: > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch > Reminder here--we can't accept code based on it being published to a web page. You'll need to e-mail it to the pgsql-hackers mailing list to be considered for the next PostgreSQL CommitFest, which is starting in a few weeks. Code submitted to the mailing list is considered a release of it to the project under the PostgreSQL license, which we can't just assume for things when given only a URL to them. Also, you suggested you were out of time to work on this. If that's the case, we'd like to know that so we don't keep cc'ing you about things in expectation of an answer. Someone else may pick this up as a project to continue working on. But it's going to need a fair amount of revision before it matches what people want here, and I'm not sure how much of what you've written is going to end up in any commit that may happen from this idea. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
> Yeah, I'm pretty well convinced this whole approach is a dead end. > Priming the OS buffer cache seems way more useful. I also think > saving the blocks to be read rather than the actual blocks makes a lot > more sense. Well, his proposal works on any platforms PostgreSQL supports. On the other hand PgFincore works on Linux only. Who wants Linux only tool be in core? Also I really want to see the performance comparison between these two approaches in the real world database. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
2011/6/1 Tatsuo Ishii <ishii@postgresql.org>: >> Yeah, I'm pretty well convinced this whole approach is a dead end. >> Priming the OS buffer cache seems way more useful. I also think >> saving the blocks to be read rather than the actual blocks makes a lot >> more sense. > > Well, his proposal works on any platforms PostgreSQL supports. On the > other hand PgFincore works on Linux only. Who wants Linux only tool be > in core? I don't want to compete the features here. Just for the completeness: PgFincore 'snapshot' is possible on any platform supporting mincure() (most support it, for widows alternatives exists). For restoring, it can be a ReadBuffer for postgresql cache; for OS it can be an open(),read(X), read (Y), close() *or* posix_fadvise() which can be less destructive (I did only via posix_fadv but nothing prevent to change that when posix support is not present). And we already have linux-only feature in-core, fortunately because it is usefull feature and I really like to add more posix_fadvise call (*this* will really help read and cache strategy more than any hack we can do to try to workaround kernel decisions) Note that BSD developers can change that and make posix_fadvise work: it has been sitting in their TODO list since some years now. Anyway we need this patch on-list to go ahead. > > Also I really want to see the performance comparison between these two > approaches in the real world database. > -- > Tatsuo Ishii > SRA OSS, Inc. Japan > English: http://www.sraoss.co.jp/index_en.php > Japanese: http://www.sraoss.co.jp > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On 06/01/2011 03:03 AM, Tatsuo Ishii wrote: > Also I really want to see the performance comparison between these two > approaches in the real world database. > Well, tell me how big of a performance improvement you want PgFincore to win by, and I'll construct a benchmark where it does that. If you pick a database size that fits in the OS cache, but is bigger than shared_buffers, the difference between the approaches is huge. The opposite--trying to find a case where this hibernation approach wins--is extremely hard to do. Anyway, further discussion of this patch is kind of a waste right now. We've never gotten the patch actually sent to the list to establish a proper contribution (just pointers to a web page), and no feedback on that or other suggestions for redesign (extension repackaging, GUC renaming, removing unused code, and a few more). Unless the author shows up again in the next two weeks, this is getting bounced back with no review as code we can't use. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Sun, May 15, 2011 at 11:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't think there's any need for this to get data into > shared_buffers at all. Getting it into the OS cache oughta be plenty > sufficient, no? > > ISTM that a very simple approach here would be to save the contents of > each shared buffer on clean shutdown, and to POSIX_FADV_WILLNEED those > buffers on startup. Do you mean to save the contents of the buffer pages themselves into a hibernation file, or to save just the identities (relation/fork/block number) of the buffers? In the first case, getting them into the OS cache would not help because the kernel would not recognize that data as being equivalent to the block it is a copy of. In the latter case, wouldn't we just trigger the same inefficient scattered read of the data that normal database operation would trigger, taking about the same amount of time to reach cache-warmth? Or is POSIX_FADV_WILLNEED going to be clever about reordering and coalescing reads? Cheers, Jeff
On Wed, Jun 1, 2011 at 11:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Sun, May 15, 2011 at 11:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't think there's any need for this to get data into >> shared_buffers at all. Getting it into the OS cache oughta be plenty >> sufficient, no? >> >> ISTM that a very simple approach here would be to save the contents of >> each shared buffer on clean shutdown, and to POSIX_FADV_WILLNEED those >> buffers on startup. > > Do you mean to save the contents of the buffer pages themselves into a > hibernation file, or to save just the identities (relation/fork/block > number) of the buffers? The latter. > In the first case, getting them into the OS cache would not help > because the kernel would not recognize that data as being equivalent > to the block it is a copy of. > > In the latter case, wouldn't we just trigger the same inefficient > scattered read of the data that normal database operation would > trigger, taking about the same amount of time to reach cache-warmth? > Or is POSIX_FADV_WILLNEED going to be clever about reordering and > coalescing reads? It would be nice if POSIX_FADV_WILLNEED is clever enough to reorder and coalesce, but even if it isn't, we can help it along by doing all the reads from any given file one after another and in increasing block number order. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 1, 2011 at 8:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > In the latter case, wouldn't we just trigger the same inefficient > scattered read of the data that normal database operation would > trigger, taking about the same amount of time to reach cache-warmth? If you have a system where you're bandwidth-constrained and processing queries as fast as you can then yes. But if you have an OLTP system where queries come in at a fixed rate and it's latency that matters then there's a big difference. It might take you hours to prime the cache at the rate that queries come in organically and for that whole time every query requires multiple cache misses and multiple seeks and random access reads. Once it's all primed your whole database might actually fit in RAM and require no i/o to serve requests. And it's possible that your system is architected on the assumption that that's the case and performance is inadequate until the whole database is read in. Actually in that extreme case you can probably get away with a few dd commands or perhaps an sql select count(*) on startup. I'm not sure in practice how wide the use case is in the gap between that extreme case and more average cases where the difference isn't so catastrophic. I'm sure there will be people who will say it's big but I would like to see numbers. And I'm not just talking about the usual knee-jerk "lets' see the benchmarks" response. I would love to see metrics on a live database showing users how much of their response time depends on the cache and how that performance varies as the cache gets warmer. Right now I think users are kind of in the dark on cache effectiveness and latency numbers. -- greg
Hi, > On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote: > > For 1, I've just finish my work. The latest patch is available at: > > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch > > > > Reminder here--we can't accept code based on it being published to a web > page. You'll need to e-mail it to the pgsql-hackers mailing list to be > considered for the next PostgreSQL CommitFest, which is starting in a > few weeks. Code submitted to the mailing list is considered a release > of it to the project under the PostgreSQL license, which we can't just > assume for things when given only a URL to them. Sorry about that, but I had enough time to revise my patches this week-end. I attached the patches in this mail, and will update CommitFest page soon. > Also, you suggested you were out of time to work on this. If that's the > case, we'd like to know that so we don't keep cc'ing you about things in > expectation of an answer. Someone else may pick this up as a project to > continue working on. But it's going to need a fair amount of revision > before it matches what people want here, and I'm not sure how much of > what you've written is going to end up in any commit that may happen > from this idea. It seems that I don't have enough time to complete this work. You don't need to keep cc'ing me, and I'm very happy if postgres to be the first DBMS which support buffer cache hibernation feature. Thanks! diff --git src/backend/access/transam/xlog.c src/backend/access/transam/xlog.c index b0e4c41..7a3a207 100644 --- src/backend/access/transam/xlog.c +++ src/backend/access/transam/xlog.c @@ -4834,6 +4834,19 @@ ReadControlFile(void)#endif} +bool +GetControlFile(ControlFileData *controlFile) +{ + if (ControlFile == NULL) + { + return false; + } + + memcpy(controlFile, ControlFile, sizeof(ControlFileData)); + + return true; +} +voidUpdateControlFile(void){ diff --git src/backend/bootstrap/bootstrap.c src/backend/bootstrap/bootstrap.c index fc093cc..7ecf6bb 100644 --- src/backend/bootstrap/bootstrap.c +++ src/backend/bootstrap/bootstrap.c @@ -360,6 +360,15 @@ AuxiliaryProcessMain(int argc, char *argv[]) BaseInit(); /* + * Only StartupProcess can call ResumeBufferCacheHibernation() after + * InitFileAccess() and smgrinit(). + */ + if (auxType == StartupProcess && BufferCacheHibernationLevel > 0) + { + ResumeBufferCacheHibernation(); + } + + /* * When we are an auxiliary process, we aren't going to do the full * InitPostgres pushups, but there area couple of things that need to get * lit up even in an auxiliary process. diff --git src/backend/storage/buffer/buf_init.c src/backend/storage/buffer/buf_init.c index dadb49d..52eb51a 100644 --- src/backend/storage/buffer/buf_init.c +++ src/backend/storage/buffer/buf_init.c @@ -127,6 +127,14 @@ InitBufferPool(void) /* Init other shared buffer-management stuff */ StrategyInitialize(!foundDescs); + + if (BufferCacheHibernationLevel > 0) + { + ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS, + (char *)BufferDescriptors, sizeof(BufferDesc), NBuffers); + ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS, + (char *)BufferBlocks, BLCKSZ, NBuffers); + }}/* diff --git src/backend/storage/buffer/bufmgr.c src/backend/storage/buffer/bufmgr.c index f96685d..dba8ebf 100644 --- src/backend/storage/buffer/bufmgr.c +++ src/backend/storage/buffer/bufmgr.c @@ -31,6 +31,7 @@#include "postgres.h"#include <sys/file.h> +#include <sys/stat.h>#include <unistd.h>#include "catalog/catalog.h" @@ -61,6 +62,13 @@#define BUF_WRITTEN 0x01#define BUF_REUSABLE 0x02 +/* + * Buffer Cache Hibernation stuff. + */ +/* enable this to debug buffer cache hibernation. */ +#if 0 +#define DEBUG_BUFFER_CACHE_HIBERNATION +#endif/* GUC variables */bool zero_damaged_pages = false; @@ -765,6 +773,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, } } +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION + elog(DEBUG5, + "alloc [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d", + buf->buf_id, buf->flags, buf->usage_count, buf->refcount, + buf->wait_backend_pid, buf->freeNext, + newHash, newTag.rnode.spcNode, + newTag.rnode.dbNode, newTag.rnode.relNode, + newTag.forkNum, newTag.blockNum); +#endif + return buf; } @@ -800,6 +818,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, * the old content is nolonger relevant. (The usage_count starts out at * 1 so that the buffer can survive one clock-sweep pass.) */ +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION + elog(DEBUG5, + "rename [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d", + buf->buf_id, buf->flags, buf->usage_count, buf->refcount, + buf->wait_backend_pid, buf->freeNext, + oldHash, oldTag.rnode.spcNode, + oldTag.rnode.dbNode, oldTag.rnode.relNode, + oldTag.forkNum, oldTag.blockNum); +#endif + buf->tag = newTag; buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT); if (relpersistence == RELPERSISTENCE_PERMANENT) @@ -2772,3 +2800,716 @@ local_buffer_write_error_callback(void *arg) pfree(path); }} + +/* ---------------------------------------------------------------- + * Buffer Cache Hibernation support stuff + * + * Suspend/resume buffer cache data structure using hibernation files + * at shutdown/startup. + * ---------------------------------------------------------------- + */ + +int BufferCacheHibernationLevel = 0; + +#define BUFFER_CACHE_HIBERNATION_FILE_STRATEGY "global/pg_buffer_cache_hibernation_strategy" +#define BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS "global/pg_buffer_cache_hibernation_descriptors" +#define BUFFER_CACHE_HIBERNATION_FILE_BLOCKS "global/pg_buffer_cache_hibernation_blocks" +#define BUFFER_CACHE_HIBERNATION_FILE_CRC32 "global/pg_buffer_cache_hibernation_crc32" + +static struct +{ + char *hibernation_file; + char *data_ptr; + Size record_length; + Size num_records; + pg_crc32 crc; +} BufferCacheHibernationData[] = +{ + /* BufferStrategyControl */ + { + BUFFER_CACHE_HIBERNATION_FILE_STRATEGY, + NULL, 0, 0, 0 + }, + + /* BufferDescriptors */ + { + BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS, + NULL, 0, 0, 0 + }, + + /* BufferBlocks */ + { + BUFFER_CACHE_HIBERNATION_FILE_BLOCKS, + NULL, 0, 0, 0 + }, + + /* End-of-list marker */ + { + NULL, + NULL, 0, 0, 0 + }, +}; + +static ControlFileData controlFile; +static bool controlFileInitialized = false; + +/* + * AtProcExit_BufferCacheHibernation: + * store the buffer cache into hibernation files at shutdown. + */ +static void +AtProcExit_BufferCacheHibernation(int code, Datum arg) +{ + BufferHibernationFileType id; + int i; + int fd; + + if (BufferCacheHibernationLevel == 0) + { + return; + } + + /* + * get the control file to check the system state validation. + */ + if (GetControlFile(&controlFile) == false) + { + elog(WARNING, + "could not get control file, " + "aborting buffer cache hibernation"); + return; + } + + if (controlFile.state != DB_SHUTDOWNED) + { + elog(WARNING, + "database system was not shut down normally, " + "aborting buffer cache hibernation"); + return; + } + + /* + * suspend buffer cache data structure into hibernation files. + */ + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) + { + Size record_length; + Size num_records; + char *ptr; + pg_crc32 crc; + + if (BufferCacheHibernationLevel < 2 && + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + continue; + } + + if (BufferCacheHibernationData[id].data_ptr == NULL || + BufferCacheHibernationData[id].record_length == 0 || + BufferCacheHibernationData[id].num_records == 0) + { + elog(WARNING, + "ResisterBufferCacheHibernation() was not called for %s", + BufferCacheHibernationData[id].hibernation_file); + goto cleanup; + } + + fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file, + O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR); + if (fd < 0) + { + elog(WARNING, + "could not open %s", + BufferCacheHibernationData[id].hibernation_file); + goto cleanup; + } + + record_length = BufferCacheHibernationData[id].record_length; + num_records = BufferCacheHibernationData[id].num_records; + + elog(NOTICE, + "buffer cache hibernate into %s", + BufferCacheHibernationData[id].hibernation_file); + + INIT_CRC32(crc); + for (i = 0; i < num_records; i++) + { + ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length); + if (write(fd, (void *)ptr, record_length) != record_length) + { + elog(WARNING, + "could not write %s", + BufferCacheHibernationData[id].hibernation_file); + goto cleanup; + } + + COMP_CRC32(crc, ptr, record_length); + } + + FIN_CRC32(crc); + close(fd); + + BufferCacheHibernationData[id].crc = crc; + } + + /* + * save the computed crc values for the validations at resuming. + */ + fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32, + O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR); + if (fd < 0) + { + elog(WARNING, + "could not open %s", + BUFFER_CACHE_HIBERNATION_FILE_CRC32); + goto cleanup; + } + + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) + { + pg_crc32 crc; + + if (BufferCacheHibernationLevel < 2 && + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + continue; + } + + crc = BufferCacheHibernationData[id].crc; + if (write(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32)) + { + elog(WARNING, + "could not write %s for %s", + BUFFER_CACHE_HIBERNATION_FILE_CRC32, + BufferCacheHibernationData[id].hibernation_file); + goto cleanup; + } + } + close(fd); + + elog(NOTICE, + "buffer cache suspended successfully"); + + return; + +cleanup: + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) + { + unlink(BufferCacheHibernationData[id].hibernation_file); + } + + return; +} + +/* + * ResisterBufferCacheHibernation: + * register the buffer cache data structure info. + */ +void +ResisterBufferCacheHibernation(BufferHibernationFileType id, char *ptr, Size record_length, Size num_records) +{ + static bool first_time = true; + + if (BufferCacheHibernationLevel == 0) + { + return; + } + + if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY && + id != BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS && + id != BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + return; + } + + if (first_time) + { + /* + * AtProcExit_BufferCacheHibernation to be called at shutdown. + */ + on_shmem_exit(AtProcExit_BufferCacheHibernation, 0); + first_time = false; + } + + /* + * get the control file to check the system state and + * hibernation file validations. + */ + if (controlFileInitialized == false) + { + if (GetControlFile(&controlFile) == true) + { + controlFileInitialized = true; + } + } + + BufferCacheHibernationData[id].data_ptr = ptr; + BufferCacheHibernationData[id].record_length = record_length; + BufferCacheHibernationData[id].num_records = num_records; +} + +/* + * ResumeBufferCacheHibernation: + * resume the buffer cache from hibernation file at startup. + */ +void +ResumeBufferCacheHibernation(void) +{ + BufferHibernationFileType id; + int i; + int fd; + Size num_records; + Size record_length; + char *buf_common; + int oldNBuffers; + bool buffer_block_processed; + + if (BufferCacheHibernationLevel == 0) + { + return; + } + + buf_common = NULL; + buffer_block_processed = false; + + /* + * lock all buffer descriptors to prevent other processes from + * updating buffers. + */ + for (i = 0; i < NBuffers; i++) + { + BufferDesc *buf; + + buf = &BufferDescriptors[i]; + LockBufHdr(buf); + } + + /* + * get the control file to check the system state and + * hibernation file validations. + */ + if (controlFileInitialized == false) + { + elog(WARNING, + "could not get control file, " + "aborting buffer cache hibernation"); + goto cleanup; + } + + if (controlFile.state != DB_SHUTDOWNED) + { + elog(WARNING, + "database system was not shut down normally, " + "aborting buffer cache hibernation"); + goto cleanup; + } + + /* + * read the crc values which was computed when the hibernation + * files were created. + */ + fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32, + O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR); + if (fd < 0) + { + elog(WARNING, + "could not open %s", + BUFFER_CACHE_HIBERNATION_FILE_CRC32); + goto cleanup; + } + + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) + { + pg_crc32 crc; + + if (BufferCacheHibernationLevel < 2 && + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + continue; + } + + if (read(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32)) + { + if (BufferCacheHibernationLevel == 2 && + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + /* + * if buffer_cache_hibernation_level changes 1 to 2, + * the crc value of buffer block hibernation file may not exist. + * just ignore it here. + */ + continue; + } + + elog(WARNING, + "could not read %s for %s", + BUFFER_CACHE_HIBERNATION_FILE_CRC32, + BufferCacheHibernationData[id].hibernation_file); + close(fd); + goto cleanup; + } + BufferCacheHibernationData[id].crc = crc; + } + + close(fd); + + /* + * allocate a buffer to read the contents of the hibernation files + * for validations. + */ + record_length = 0; + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) + { + if (record_length < BufferCacheHibernationData[id].record_length) + { + record_length = BufferCacheHibernationData[id].record_length; + } + } + + buf_common = malloc(record_length); + Assert(buf_common != NULL); + + /* assume that the number of buffers have not changed. */ + oldNBuffers = NBuffers; + + /* + * check if all hibernation files are valid. + */ + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) + { + struct stat sb; + pg_crc32 crc; + + if (BufferCacheHibernationLevel < 2 && + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + continue; + } + + if (BufferCacheHibernationData[id].data_ptr == NULL || + BufferCacheHibernationData[id].record_length == 0 || + BufferCacheHibernationData[id].num_records == 0) + { + elog(WARNING, + "ResisterBufferCacheHibernation() was not called for %s", + BufferCacheHibernationData[id].hibernation_file); + goto cleanup; + } + + fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file, + O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR); + if (fd < 0) + { + if (BufferCacheHibernationLevel == 2 && + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + /* + * if buffer_cache_hibernation_level changes 1 to 2, + * the buffer block hibernation file may not exist. + * just ignore it here. + */ + continue; + } + + goto cleanup; + } + + if (fstat(fd, &sb) < 0) + { + elog(WARNING, + "could not get stats of the buffer cache hibernation file: %s", + BufferCacheHibernationData[id].hibernation_file); + close(fd); + goto cleanup; + } + + record_length = BufferCacheHibernationData[id].record_length; + num_records = BufferCacheHibernationData[id].num_records; + + if (sb.st_size != (record_length * num_records)) + { + /* The size of StrategyControl should be the same always. */ + if (id == BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY || + (sb.st_size % record_length) > 0) + { + elog(WARNING, + "size mismatch on the buffer cache hibernation file: %s", + BufferCacheHibernationData[id].hibernation_file); + close(fd); + goto cleanup; + } + + /* + * The number of records of buffer descriptors and blocks + * should be the same. + */ + if (oldNBuffers != NBuffers && + oldNBuffers != (sb.st_size / record_length)) + { + elog(WARNING, + "size mismatch on the buffer cache hibernation file: %s", + BufferCacheHibernationData[id].hibernation_file); + close(fd); + goto cleanup; + } + + oldNBuffers = sb.st_size / record_length; + + elog(NOTICE, + "shared_buffers have changed from %d to %d: %s", + oldNBuffers, NBuffers, + BufferCacheHibernationData[id].hibernation_file); + + /* use the original size to compute CRC of the hibernation file. */ + num_records = oldNBuffers; + } + + if ((pg_time_t)sb.st_mtime < controlFile.time) + { + elog(WARNING, + "the hibernation file is older than control file: %s", + BufferCacheHibernationData[id].hibernation_file); + close(fd); + goto cleanup; + } + + INIT_CRC32(crc); + for (i = 0; i < num_records; i++) + { + if (read(fd, (void *)buf_common, record_length) != record_length) + { + elog(WARNING, + "could not read the buffer cache hibernation file: %s", + BufferCacheHibernationData[id].hibernation_file); + close(fd); + goto cleanup; + } + + COMP_CRC32(crc, buf_common, record_length); + + /* + * buffer descriptors validations. + */ + if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS) + { + BufferDesc *buf; + BufFlags abnormal_flags; + + if (i >= NBuffers) + { + continue; + } + + abnormal_flags = (BM_DIRTY | BM_IO_IN_PROGRESS | BM_IO_ERROR | + BM_JUST_DIRTIED | BM_PIN_COUNT_WAITER); + + buf = (BufferDesc *)buf_common; + + if (buf->flags & abnormal_flags) + { + elog(WARNING, + "abnormal flags in buffer descriptors: %d", + buf->flags); + close(fd); + goto cleanup; + } + + if (buf->usage_count > BM_MAX_USAGE_COUNT) + { + elog(WARNING, + "invalid usage count in buffer descriptors: %d", + buf->usage_count); + close(fd); + goto cleanup; + } + + if (buf->buf_id < 0 || buf->buf_id >= num_records) + { + elog(WARNING, + "invalid buffer id in buffer descriptors: %d", + buf->buf_id); + close(fd); + goto cleanup; + } + } + } + + FIN_CRC32(crc); + close(fd); + + if (!EQ_CRC32(BufferCacheHibernationData[id].crc, crc)) + { + elog(WARNING, + "crc mismatch on the buffer cache hibernation file: %s", + BufferCacheHibernationData[id].hibernation_file); + close(fd); + goto cleanup; + } + } + + /* + * resume the buffer cache data structure from the hibernation files. + */ + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) + { + int fd; + char *ptr; + + if (BufferCacheHibernationLevel < 2 && + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + continue; + } + + record_length = BufferCacheHibernationData[id].record_length; + num_records = BufferCacheHibernationData[id].num_records; + + if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY) + { + /* use the smaller number of buffers. */ + num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers; + } + + fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file, + O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR); + if (fd < 0) + { + if (BufferCacheHibernationLevel == 2 && + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + /* + * if buffer_cache_hibernation_level changes 1 to 2, + * the buffer block hibernation file may not exist. + * just ignore it here. + */ + continue; + } + + goto cleanup; + } + + elog(NOTICE, + "buffer cache resume from %s(%d bytes * %d records)", + BufferCacheHibernationData[id].hibernation_file, + record_length, num_records); + + for (i = 0; i < num_records; i++) + { + ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length); + read(fd, (void *)ptr, record_length); + + /* Re-lock the buffer descriptor if necessary. */ + if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS) + { + BufferDesc *buf; + + buf = (BufferDesc *)ptr; + if (IsUnlockBufHdr(buf)) + { + LockBufHdr(buf); + } + } + } + + close(fd); + + if (id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) + { + buffer_block_processed = true; + } + } + + if (buffer_block_processed == false) + { + /* we didn't use the buffer block hibernation file, so delete it now. */ + id = BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS; + unlink(BufferCacheHibernationData[id].hibernation_file); + } + + /* + * set the rest data structures (eg. lookup hashtable) up + * based on the buffer descriptors. + */ + num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers; + for (i = 0; i < num_records; i++) + { + BufferDesc *buf; + BufferTag newTag; + uint32 newHash; + int buf_id; + + buf = &BufferDescriptors[i]; + if (buf->tag.rnode.spcNode == InvalidOid && + buf->tag.rnode.dbNode == InvalidOid && + buf->tag.rnode.relNode == InvalidOid) + { + continue; + } + + INIT_BUFFERTAG(newTag, buf->tag.rnode, buf->tag.forkNum, buf->tag.blockNum); + newHash = BufTableHashCode(&newTag); + + if (buffer_block_processed == false) + { + Block bufBlock; + SMgrRelation smgr; + + /* + * re-read buffer block. + */ + bufBlock = BufHdrGetBlock(buf); + smgr = smgropen(buf->tag.rnode, InvalidBackendId); + smgrread(smgr, newTag.forkNum, newTag.blockNum, (char *) bufBlock); + } + + buf_id = BufTableInsert(&newTag, newHash, buf->buf_id); + if (buf_id != -1) + { + /* the entry exists already, return it to the freelist. */ + buf->refcount = 0; + buf->flags = 0; + InvalidateBuffer(buf); + continue; + } + + /* clear wait_backend_pid because the process was terminated already. */ + buf->wait_backend_pid = 0; + +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION + elog(DEBUG5, + "resume [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d", + buf->buf_id, buf->flags, buf->usage_count, buf->refcount, + buf->wait_backend_pid, buf->freeNext, + newHash, newTag.rnode.spcNode, + newTag.rnode.dbNode, newTag.rnode.relNode, + newTag.forkNum, newTag.blockNum); +#endif + } + + /* + * adjust StrategyControl based on the change of shared_buffers. + */ + if (oldNBuffers != NBuffers) + { + AdjustStrategyControl(oldNBuffers); + } + + elog(NOTICE, + "buffer cache resumed successfully"); + +cleanup: + for (i = 0; i < NBuffers; i++) + { + BufferDesc *buf; + + buf = &BufferDescriptors[i]; + UnlockBufHdr(buf); + } + + if (buf_common != NULL) + { + free(buf_common); + } + + return; +} diff --git src/backend/storage/buffer/freelist.c src/backend/storage/buffer/freelist.c index bf9903b..ffc101d 100644 --- src/backend/storage/buffer/freelist.c +++ src/backend/storage/buffer/freelist.c @@ -347,6 +347,12 @@ StrategyInitialize(bool init) } else Assert(!init); + + if (BufferCacheHibernationLevel > 0) + { + ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY, + (char *)StrategyControl, sizeof(BufferStrategyControl), 1); + }} @@ -521,3 +527,47 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, volatile BufferDesc *buf) return true;} + +/* + * AdjustStrategyControl -- adjust the member variables of StrategyControl + * + * If the shared_buffers setting had changed, restored StrategyControl + * needs to be adjusted for in both cases of shrinking and enlarging. + * This is called only from bufmgr.c:ResumeBufferCacheHibernation(). + */ +void +AdjustStrategyControl(int oldNBuffers) +{ + if (oldNBuffers == NBuffers) + { + return; + } + + /* enlarge or shrink the free buffer based on current NBuffers. */ + StrategyControl->lastFreeBuffer = NBuffers - 1; + + /* shared_buffers shrunk. */ + if (oldNBuffers > NBuffers) + { + if (StrategyControl->nextVictimBuffer >= NBuffers) + { + /* set the tail of buffers. */ + StrategyControl->nextVictimBuffer = NBuffers - 1; + } + + if (StrategyControl->firstFreeBuffer >= NBuffers) + { + /* set FREENEXT_END_OF_LIST(-1). */ + StrategyControl->firstFreeBuffer = FREENEXT_END_OF_LIST; + } + } + else + /* shared_buffers enlarged. */ + { + if (StrategyControl->firstFreeBuffer < 0) + { + /* set the next entry of the tail of old buffers. */ + StrategyControl->firstFreeBuffer = oldNBuffers; + } + } +} diff --git src/backend/utils/misc/guc.c src/backend/utils/misc/guc.c index 738e215..5affc6e 100644 --- src/backend/utils/misc/guc.c +++ src/backend/utils/misc/guc.c @@ -2361,6 +2361,18 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"buffer_cache_hibernation_level", PGC_POSTMASTER, UNGROUPED, + gettext_noop("Sets buffer cache hibernation level."), + gettext_noop("0 to disable(default), " + "1 for saving buffer descriptors only(recommended), " + "2 for saving buffer descriptors and buffer blocks(slower at shutdown).") + }, + &BufferCacheHibernationLevel, + 0, 0, 2, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL diff --git src/backend/utils/misc/postgresql.conf.sample src/backend/utils/misc/postgresql.conf.sample index b8a1582..44b6ff3 100644 --- src/backend/utils/misc/postgresql.conf.sample +++ src/backend/utils/misc/postgresql.conf.sample @@ -119,6 +119,17 @@#maintenance_work_mem = 16MB # min 1MB#max_stack_depth = 2MB # min 100kB + +# Buffer Cache Hibernation: +# Suspend/resume buffer cache data structure using hibernation files +# at shutdown/startup. +#buffer_cache_hibernation_level = 0 # Sets buffer cache hibernation level. + # 0 to disable(default), + # 1 for saving buffer descriptors only + # (recommended), + # 2 for saving buffer descriptors and + # buffer blocks(slower at shutdown). +# - Kernel Resource Usage -#max_files_per_process = 1000 # min 25 diff --git src/include/access/xlog.h src/include/access/xlog.h index 7056fd6..7a9fb99 100644 --- src/include/access/xlog.h +++ src/include/access/xlog.h @@ -13,6 +13,7 @@#include "access/rmgr.h"#include "access/xlogdefs.h" +#include "catalog/pg_control.h"#include "lib/stringinfo.h"#include "storage/buf.h"#include "utils/pg_crc.h" @@ -294,6 +295,7 @@ extern bool XLogInsertAllowed(void);extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);externXLogRecPtr GetXLogReplayRecPtr(void); +extern bool GetControlFile(ControlFileData *controlFile);extern void UpdateControlFile(void);extern uint64 GetSystemIdentifier(void);externSize XLOGShmemSize(void); diff --git src/include/storage/buf_internals.h src/include/storage/buf_internals.h index b7d4ea5..d537ef1 100644 --- src/include/storage/buf_internals.h +++ src/include/storage/buf_internals.h @@ -167,6 +167,7 @@ typedef struct sbufdesc */#define LockBufHdr(bufHdr) SpinLockAcquire(&(bufHdr)->buf_hdr_lock)#defineUnlockBufHdr(bufHdr) SpinLockRelease(&(bufHdr)->buf_hdr_lock) +#define IsUnlockBufHdr(bufHdr) SpinLockFree(&(bufHdr)->buf_hdr_lock)/* in buf_init.c */ @@ -190,6 +191,7 @@ extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,extern int StrategySyncStart(uint32*complete_passes, uint32 *num_buf_alloc);extern Size StrategyShmemSize(void);extern void StrategyInitialize(boolinit); +extern void AdjustStrategyControl(int oldNBuffers);/* buf_table.c */extern Size BufTableShmemSize(int size); diff --git src/include/storage/bufmgr.h src/include/storage/bufmgr.h index b8fc87e..ddfeb9d 100644 --- src/include/storage/bufmgr.h +++ src/include/storage/bufmgr.h @@ -211,6 +211,20 @@ extern void BgBufferSync(void);extern void AtProcExit_LocalBuffers(void); +/* buffer cache hibernation support stuff */ +extern int BufferCacheHibernationLevel; + +typedef enum BufferHibernationFileType +{ + BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY, + BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS, + BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS +} BufferHibernationFileType; + +extern void ResisterBufferCacheHibernation(BufferHibernationFileType id, + char *ptr, Size record_length, Size num_records); +extern void ResumeBufferCacheHibernation(void); +/* in freelist.c */extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);extern void FreeAccessStrategy(BufferAccessStrategystrategy);
On 06/05/2011 08:50 AM, Mitsuru IWASAKI wrote: > It seems that I don't have enough time to complete this work. > You don't need to keep cc'ing me, and I'm very happy if postgres to be > the first DBMS which support buffer cache hibernation feature. > Thanks for submitting the patch, and we'll see what happens from here. I've switch to bcc'ing you here and we should get you off everyone else's cc: list here soon. If this feature ends up getting committed, I'll try to remember to drop you a note about it so you can see what happened. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
Should this be marked as TODO? --------------------------------------------------------------------------- Mitsuru IWASAKI wrote: > Hi, > > > On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote: > > > For 1, I've just finish my work. The latest patch is available at: > > > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch > > > > > > > Reminder here--we can't accept code based on it being published to a web > > page. You'll need to e-mail it to the pgsql-hackers mailing list to be > > considered for the next PostgreSQL CommitFest, which is starting in a > > few weeks. Code submitted to the mailing list is considered a release > > of it to the project under the PostgreSQL license, which we can't just > > assume for things when given only a URL to them. > > Sorry about that, but I had enough time to revise my patches this week-end. > I attached the patches in this mail, and will update CommitFest page soon. > > > Also, you suggested you were out of time to work on this. If that's the > > case, we'd like to know that so we don't keep cc'ing you about things in > > expectation of an answer. Someone else may pick this up as a project to > > continue working on. But it's going to need a fair amount of revision > > before it matches what people want here, and I'm not sure how much of > > what you've written is going to end up in any commit that may happen > > from this idea. > > It seems that I don't have enough time to complete this work. > You don't need to keep cc'ing me, and I'm very happy if postgres to be > the first DBMS which support buffer cache hibernation feature. > > Thanks! > > > diff --git src/backend/access/transam/xlog.c src/backend/access/transam/xlog.c > index b0e4c41..7a3a207 100644 > --- src/backend/access/transam/xlog.c > +++ src/backend/access/transam/xlog.c > @@ -4834,6 +4834,19 @@ ReadControlFile(void) > #endif > } > > +bool > +GetControlFile(ControlFileData *controlFile) > +{ > + if (ControlFile == NULL) > + { > + return false; > + } > + > + memcpy(controlFile, ControlFile, sizeof(ControlFileData)); > + > + return true; > +} > + > void > UpdateControlFile(void) > { > diff --git src/backend/bootstrap/bootstrap.c src/backend/bootstrap/bootstrap.c > index fc093cc..7ecf6bb 100644 > --- src/backend/bootstrap/bootstrap.c > +++ src/backend/bootstrap/bootstrap.c > @@ -360,6 +360,15 @@ AuxiliaryProcessMain(int argc, char *argv[]) > BaseInit(); > > /* > + * Only StartupProcess can call ResumeBufferCacheHibernation() after > + * InitFileAccess() and smgrinit(). > + */ > + if (auxType == StartupProcess && BufferCacheHibernationLevel > 0) > + { > + ResumeBufferCacheHibernation(); > + } > + > + /* > * When we are an auxiliary process, we aren't going to do the full > * InitPostgres pushups, but there are a couple of things that need to get > * lit up even in an auxiliary process. > diff --git src/backend/storage/buffer/buf_init.c src/backend/storage/buffer/buf_init.c > index dadb49d..52eb51a 100644 > --- src/backend/storage/buffer/buf_init.c > +++ src/backend/storage/buffer/buf_init.c > @@ -127,6 +127,14 @@ InitBufferPool(void) > > /* Init other shared buffer-management stuff */ > StrategyInitialize(!foundDescs); > + > + if (BufferCacheHibernationLevel > 0) > + { > + ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS, > + (char *)BufferDescriptors, sizeof(BufferDesc), NBuffers); > + ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS, > + (char *)BufferBlocks, BLCKSZ, NBuffers); > + } > } > > /* > diff --git src/backend/storage/buffer/bufmgr.c src/backend/storage/buffer/bufmgr.c > index f96685d..dba8ebf 100644 > --- src/backend/storage/buffer/bufmgr.c > +++ src/backend/storage/buffer/bufmgr.c > @@ -31,6 +31,7 @@ > #include "postgres.h" > > #include <sys/file.h> > +#include <sys/stat.h> > #include <unistd.h> > > #include "catalog/catalog.h" > @@ -61,6 +62,13 @@ > #define BUF_WRITTEN 0x01 > #define BUF_REUSABLE 0x02 > > +/* > + * Buffer Cache Hibernation stuff. > + */ > +/* enable this to debug buffer cache hibernation. */ > +#if 0 > +#define DEBUG_BUFFER_CACHE_HIBERNATION > +#endif > > /* GUC variables */ > bool zero_damaged_pages = false; > @@ -765,6 +773,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > } > } > > +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION > + elog(DEBUG5, > + "alloc [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d", > + buf->buf_id, buf->flags, buf->usage_count, buf->refcount, > + buf->wait_backend_pid, buf->freeNext, > + newHash, newTag.rnode.spcNode, > + newTag.rnode.dbNode, newTag.rnode.relNode, > + newTag.forkNum, newTag.blockNum); > +#endif > + > return buf; > } > > @@ -800,6 +818,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > * the old content is no longer relevant. (The usage_count starts out at > * 1 so that the buffer can survive one clock-sweep pass.) > */ > +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION > + elog(DEBUG5, > + "rename [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d", > + buf->buf_id, buf->flags, buf->usage_count, buf->refcount, > + buf->wait_backend_pid, buf->freeNext, > + oldHash, oldTag.rnode.spcNode, > + oldTag.rnode.dbNode, oldTag.rnode.relNode, > + oldTag.forkNum, oldTag.blockNum); > +#endif > + > buf->tag = newTag; > buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT); > if (relpersistence == RELPERSISTENCE_PERMANENT) > @@ -2772,3 +2800,716 @@ local_buffer_write_error_callback(void *arg) > pfree(path); > } > } > + > +/* ---------------------------------------------------------------- > + * Buffer Cache Hibernation support stuff > + * > + * Suspend/resume buffer cache data structure using hibernation files > + * at shutdown/startup. > + * ---------------------------------------------------------------- > + */ > + > +int BufferCacheHibernationLevel = 0; > + > +#define BUFFER_CACHE_HIBERNATION_FILE_STRATEGY "global/pg_buffer_cache_hibernation_strategy" > +#define BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS "global/pg_buffer_cache_hibernation_descriptors" > +#define BUFFER_CACHE_HIBERNATION_FILE_BLOCKS "global/pg_buffer_cache_hibernation_blocks" > +#define BUFFER_CACHE_HIBERNATION_FILE_CRC32 "global/pg_buffer_cache_hibernation_crc32" > + > +static struct > +{ > + char *hibernation_file; > + char *data_ptr; > + Size record_length; > + Size num_records; > + pg_crc32 crc; > +} BufferCacheHibernationData[] = > +{ > + /* BufferStrategyControl */ > + { > + BUFFER_CACHE_HIBERNATION_FILE_STRATEGY, > + NULL, 0, 0, 0 > + }, > + > + /* BufferDescriptors */ > + { > + BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS, > + NULL, 0, 0, 0 > + }, > + > + /* BufferBlocks */ > + { > + BUFFER_CACHE_HIBERNATION_FILE_BLOCKS, > + NULL, 0, 0, 0 > + }, > + > + /* End-of-list marker */ > + { > + NULL, > + NULL, 0, 0, 0 > + }, > +}; > + > +static ControlFileData controlFile; > +static bool controlFileInitialized = false; > + > +/* > + * AtProcExit_BufferCacheHibernation: > + * store the buffer cache into hibernation files at shutdown. > + */ > +static void > +AtProcExit_BufferCacheHibernation(int code, Datum arg) > +{ > + BufferHibernationFileType id; > + int i; > + int fd; > + > + if (BufferCacheHibernationLevel == 0) > + { > + return; > + } > + > + /* > + * get the control file to check the system state validation. > + */ > + if (GetControlFile(&controlFile) == false) > + { > + elog(WARNING, > + "could not get control file, " > + "aborting buffer cache hibernation"); > + return; > + } > + > + if (controlFile.state != DB_SHUTDOWNED) > + { > + elog(WARNING, > + "database system was not shut down normally, " > + "aborting buffer cache hibernation"); > + return; > + } > + > + /* > + * suspend buffer cache data structure into hibernation files. > + */ > + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) > + { > + Size record_length; > + Size num_records; > + char *ptr; > + pg_crc32 crc; > + > + if (BufferCacheHibernationLevel < 2 && > + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + continue; > + } > + > + if (BufferCacheHibernationData[id].data_ptr == NULL || > + BufferCacheHibernationData[id].record_length == 0 || > + BufferCacheHibernationData[id].num_records == 0) > + { > + elog(WARNING, > + "ResisterBufferCacheHibernation() was not called for %s", > + BufferCacheHibernationData[id].hibernation_file); > + goto cleanup; > + } > + > + fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file, > + O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR); > + if (fd < 0) > + { > + elog(WARNING, > + "could not open %s", > + BufferCacheHibernationData[id].hibernation_file); > + goto cleanup; > + } > + > + record_length = BufferCacheHibernationData[id].record_length; > + num_records = BufferCacheHibernationData[id].num_records; > + > + elog(NOTICE, > + "buffer cache hibernate into %s", > + BufferCacheHibernationData[id].hibernation_file); > + > + INIT_CRC32(crc); > + for (i = 0; i < num_records; i++) > + { > + ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length); > + if (write(fd, (void *)ptr, record_length) != record_length) > + { > + elog(WARNING, > + "could not write %s", > + BufferCacheHibernationData[id].hibernation_file); > + goto cleanup; > + } > + > + COMP_CRC32(crc, ptr, record_length); > + } > + > + FIN_CRC32(crc); > + close(fd); > + > + BufferCacheHibernationData[id].crc = crc; > + } > + > + /* > + * save the computed crc values for the validations at resuming. > + */ > + fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32, > + O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR); > + if (fd < 0) > + { > + elog(WARNING, > + "could not open %s", > + BUFFER_CACHE_HIBERNATION_FILE_CRC32); > + goto cleanup; > + } > + > + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) > + { > + pg_crc32 crc; > + > + if (BufferCacheHibernationLevel < 2 && > + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + continue; > + } > + > + crc = BufferCacheHibernationData[id].crc; > + if (write(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32)) > + { > + elog(WARNING, > + "could not write %s for %s", > + BUFFER_CACHE_HIBERNATION_FILE_CRC32, > + BufferCacheHibernationData[id].hibernation_file); > + goto cleanup; > + } > + } > + close(fd); > + > + elog(NOTICE, > + "buffer cache suspended successfully"); > + > + return; > + > +cleanup: > + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) > + { > + unlink(BufferCacheHibernationData[id].hibernation_file); > + } > + > + return; > +} > + > +/* > + * ResisterBufferCacheHibernation: > + * register the buffer cache data structure info. > + */ > +void > +ResisterBufferCacheHibernation(BufferHibernationFileType id, char *ptr, Size record_length, Size num_records) > +{ > + static bool first_time = true; > + > + if (BufferCacheHibernationLevel == 0) > + { > + return; > + } > + > + if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY && > + id != BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS && > + id != BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + return; > + } > + > + if (first_time) > + { > + /* > + * AtProcExit_BufferCacheHibernation to be called at shutdown. > + */ > + on_shmem_exit(AtProcExit_BufferCacheHibernation, 0); > + first_time = false; > + } > + > + /* > + * get the control file to check the system state and > + * hibernation file validations. > + */ > + if (controlFileInitialized == false) > + { > + if (GetControlFile(&controlFile) == true) > + { > + controlFileInitialized = true; > + } > + } > + > + BufferCacheHibernationData[id].data_ptr = ptr; > + BufferCacheHibernationData[id].record_length = record_length; > + BufferCacheHibernationData[id].num_records = num_records; > +} > + > +/* > + * ResumeBufferCacheHibernation: > + * resume the buffer cache from hibernation file at startup. > + */ > +void > +ResumeBufferCacheHibernation(void) > +{ > + BufferHibernationFileType id; > + int i; > + int fd; > + Size num_records; > + Size record_length; > + char *buf_common; > + int oldNBuffers; > + bool buffer_block_processed; > + > + if (BufferCacheHibernationLevel == 0) > + { > + return; > + } > + > + buf_common = NULL; > + buffer_block_processed = false; > + > + /* > + * lock all buffer descriptors to prevent other processes from > + * updating buffers. > + */ > + for (i = 0; i < NBuffers; i++) > + { > + BufferDesc *buf; > + > + buf = &BufferDescriptors[i]; > + LockBufHdr(buf); > + } > + > + /* > + * get the control file to check the system state and > + * hibernation file validations. > + */ > + if (controlFileInitialized == false) > + { > + elog(WARNING, > + "could not get control file, " > + "aborting buffer cache hibernation"); > + goto cleanup; > + } > + > + if (controlFile.state != DB_SHUTDOWNED) > + { > + elog(WARNING, > + "database system was not shut down normally, " > + "aborting buffer cache hibernation"); > + goto cleanup; > + } > + > + /* > + * read the crc values which was computed when the hibernation > + * files were created. > + */ > + fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32, > + O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR); > + if (fd < 0) > + { > + elog(WARNING, > + "could not open %s", > + BUFFER_CACHE_HIBERNATION_FILE_CRC32); > + goto cleanup; > + } > + > + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) > + { > + pg_crc32 crc; > + > + if (BufferCacheHibernationLevel < 2 && > + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + continue; > + } > + > + if (read(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32)) > + { > + if (BufferCacheHibernationLevel == 2 && > + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + /* > + * if buffer_cache_hibernation_level changes 1 to 2, > + * the crc value of buffer block hibernation file may not exist. > + * just ignore it here. > + */ > + continue; > + } > + > + elog(WARNING, > + "could not read %s for %s", > + BUFFER_CACHE_HIBERNATION_FILE_CRC32, > + BufferCacheHibernationData[id].hibernation_file); > + close(fd); > + goto cleanup; > + } > + BufferCacheHibernationData[id].crc = crc; > + } > + > + close(fd); > + > + /* > + * allocate a buffer to read the contents of the hibernation files > + * for validations. > + */ > + record_length = 0; > + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) > + { > + if (record_length < BufferCacheHibernationData[id].record_length) > + { > + record_length = BufferCacheHibernationData[id].record_length; > + } > + } > + > + buf_common = malloc(record_length); > + Assert(buf_common != NULL); > + > + /* assume that the number of buffers have not changed. */ > + oldNBuffers = NBuffers; > + > + /* > + * check if all hibernation files are valid. > + */ > + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) > + { > + struct stat sb; > + pg_crc32 crc; > + > + if (BufferCacheHibernationLevel < 2 && > + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + continue; > + } > + > + if (BufferCacheHibernationData[id].data_ptr == NULL || > + BufferCacheHibernationData[id].record_length == 0 || > + BufferCacheHibernationData[id].num_records == 0) > + { > + elog(WARNING, > + "ResisterBufferCacheHibernation() was not called for %s", > + BufferCacheHibernationData[id].hibernation_file); > + goto cleanup; > + } > + > + fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file, > + O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR); > + if (fd < 0) > + { > + if (BufferCacheHibernationLevel == 2 && > + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + /* > + * if buffer_cache_hibernation_level changes 1 to 2, > + * the buffer block hibernation file may not exist. > + * just ignore it here. > + */ > + continue; > + } > + > + goto cleanup; > + } > + > + if (fstat(fd, &sb) < 0) > + { > + elog(WARNING, > + "could not get stats of the buffer cache hibernation file: %s", > + BufferCacheHibernationData[id].hibernation_file); > + close(fd); > + goto cleanup; > + } > + > + record_length = BufferCacheHibernationData[id].record_length; > + num_records = BufferCacheHibernationData[id].num_records; > + > + if (sb.st_size != (record_length * num_records)) > + { > + /* The size of StrategyControl should be the same always. */ > + if (id == BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY || > + (sb.st_size % record_length) > 0) > + { > + elog(WARNING, > + "size mismatch on the buffer cache hibernation file: %s", > + BufferCacheHibernationData[id].hibernation_file); > + close(fd); > + goto cleanup; > + } > + > + /* > + * The number of records of buffer descriptors and blocks > + * should be the same. > + */ > + if (oldNBuffers != NBuffers && > + oldNBuffers != (sb.st_size / record_length)) > + { > + elog(WARNING, > + "size mismatch on the buffer cache hibernation file: %s", > + BufferCacheHibernationData[id].hibernation_file); > + close(fd); > + goto cleanup; > + } > + > + oldNBuffers = sb.st_size / record_length; > + > + elog(NOTICE, > + "shared_buffers have changed from %d to %d: %s", > + oldNBuffers, NBuffers, > + BufferCacheHibernationData[id].hibernation_file); > + > + /* use the original size to compute CRC of the hibernation file. */ > + num_records = oldNBuffers; > + } > + > + if ((pg_time_t)sb.st_mtime < controlFile.time) > + { > + elog(WARNING, > + "the hibernation file is older than control file: %s", > + BufferCacheHibernationData[id].hibernation_file); > + close(fd); > + goto cleanup; > + } > + > + INIT_CRC32(crc); > + for (i = 0; i < num_records; i++) > + { > + if (read(fd, (void *)buf_common, record_length) != record_length) > + { > + elog(WARNING, > + "could not read the buffer cache hibernation file: %s", > + BufferCacheHibernationData[id].hibernation_file); > + close(fd); > + goto cleanup; > + } > + > + COMP_CRC32(crc, buf_common, record_length); > + > + /* > + * buffer descriptors validations. > + */ > + if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS) > + { > + BufferDesc *buf; > + BufFlags abnormal_flags; > + > + if (i >= NBuffers) > + { > + continue; > + } > + > + abnormal_flags = (BM_DIRTY | BM_IO_IN_PROGRESS | BM_IO_ERROR | > + BM_JUST_DIRTIED | BM_PIN_COUNT_WAITER); > + > + buf = (BufferDesc *)buf_common; > + > + if (buf->flags & abnormal_flags) > + { > + elog(WARNING, > + "abnormal flags in buffer descriptors: %d", > + buf->flags); > + close(fd); > + goto cleanup; > + } > + > + if (buf->usage_count > BM_MAX_USAGE_COUNT) > + { > + elog(WARNING, > + "invalid usage count in buffer descriptors: %d", > + buf->usage_count); > + close(fd); > + goto cleanup; > + } > + > + if (buf->buf_id < 0 || buf->buf_id >= num_records) > + { > + elog(WARNING, > + "invalid buffer id in buffer descriptors: %d", > + buf->buf_id); > + close(fd); > + goto cleanup; > + } > + } > + } > + > + FIN_CRC32(crc); > + close(fd); > + > + if (!EQ_CRC32(BufferCacheHibernationData[id].crc, crc)) > + { > + elog(WARNING, > + "crc mismatch on the buffer cache hibernation file: %s", > + BufferCacheHibernationData[id].hibernation_file); > + close(fd); > + goto cleanup; > + } > + } > + > + /* > + * resume the buffer cache data structure from the hibernation files. > + */ > + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) > + { > + int fd; > + char *ptr; > + > + if (BufferCacheHibernationLevel < 2 && > + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + continue; > + } > + > + record_length = BufferCacheHibernationData[id].record_length; > + num_records = BufferCacheHibernationData[id].num_records; > + > + if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY) > + { > + /* use the smaller number of buffers. */ > + num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers; > + } > + > + fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file, > + O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR); > + if (fd < 0) > + { > + if (BufferCacheHibernationLevel == 2 && > + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + /* > + * if buffer_cache_hibernation_level changes 1 to 2, > + * the buffer block hibernation file may not exist. > + * just ignore it here. > + */ > + continue; > + } > + > + goto cleanup; > + } > + > + elog(NOTICE, > + "buffer cache resume from %s(%d bytes * %d records)", > + BufferCacheHibernationData[id].hibernation_file, > + record_length, num_records); > + > + for (i = 0; i < num_records; i++) > + { > + ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length); > + read(fd, (void *)ptr, record_length); > + > + /* Re-lock the buffer descriptor if necessary. */ > + if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS) > + { > + BufferDesc *buf; > + > + buf = (BufferDesc *)ptr; > + if (IsUnlockBufHdr(buf)) > + { > + LockBufHdr(buf); > + } > + } > + } > + > + close(fd); > + > + if (id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) > + { > + buffer_block_processed = true; > + } > + } > + > + if (buffer_block_processed == false) > + { > + /* we didn't use the buffer block hibernation file, so delete it now. */ > + id = BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS; > + unlink(BufferCacheHibernationData[id].hibernation_file); > + } > + > + /* > + * set the rest data structures (eg. lookup hashtable) up > + * based on the buffer descriptors. > + */ > + num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers; > + for (i = 0; i < num_records; i++) > + { > + BufferDesc *buf; > + BufferTag newTag; > + uint32 newHash; > + int buf_id; > + > + buf = &BufferDescriptors[i]; > + if (buf->tag.rnode.spcNode == InvalidOid && > + buf->tag.rnode.dbNode == InvalidOid && > + buf->tag.rnode.relNode == InvalidOid) > + { > + continue; > + } > + > + INIT_BUFFERTAG(newTag, buf->tag.rnode, buf->tag.forkNum, buf->tag.blockNum); > + newHash = BufTableHashCode(&newTag); > + > + if (buffer_block_processed == false) > + { > + Block bufBlock; > + SMgrRelation smgr; > + > + /* > + * re-read buffer block. > + */ > + bufBlock = BufHdrGetBlock(buf); > + smgr = smgropen(buf->tag.rnode, InvalidBackendId); > + smgrread(smgr, newTag.forkNum, newTag.blockNum, (char *) bufBlock); > + } > + > + buf_id = BufTableInsert(&newTag, newHash, buf->buf_id); > + if (buf_id != -1) > + { > + /* the entry exists already, return it to the freelist. */ > + buf->refcount = 0; > + buf->flags = 0; > + InvalidateBuffer(buf); > + continue; > + } > + > + /* clear wait_backend_pid because the process was terminated already. */ > + buf->wait_backend_pid = 0; > + > +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION > + elog(DEBUG5, > + "resume [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d", > + buf->buf_id, buf->flags, buf->usage_count, buf->refcount, > + buf->wait_backend_pid, buf->freeNext, > + newHash, newTag.rnode.spcNode, > + newTag.rnode.dbNode, newTag.rnode.relNode, > + newTag.forkNum, newTag.blockNum); > +#endif > + } > + > + /* > + * adjust StrategyControl based on the change of shared_buffers. > + */ > + if (oldNBuffers != NBuffers) > + { > + AdjustStrategyControl(oldNBuffers); > + } > + > + elog(NOTICE, > + "buffer cache resumed successfully"); > + > +cleanup: > + for (i = 0; i < NBuffers; i++) > + { > + BufferDesc *buf; > + > + buf = &BufferDescriptors[i]; > + UnlockBufHdr(buf); > + } > + > + if (buf_common != NULL) > + { > + free(buf_common); > + } > + > + return; > +} > diff --git src/backend/storage/buffer/freelist.c src/backend/storage/buffer/freelist.c > index bf9903b..ffc101d 100644 > --- src/backend/storage/buffer/freelist.c > +++ src/backend/storage/buffer/freelist.c > @@ -347,6 +347,12 @@ StrategyInitialize(bool init) > } > else > Assert(!init); > + > + if (BufferCacheHibernationLevel > 0) > + { > + ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY, > + (char *)StrategyControl, sizeof(BufferStrategyControl), 1); > + } > } > > > @@ -521,3 +527,47 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, volatile BufferDesc *buf) > > return true; > } > + > +/* > + * AdjustStrategyControl -- adjust the member variables of StrategyControl > + * > + * If the shared_buffers setting had changed, restored StrategyControl > + * needs to be adjusted for in both cases of shrinking and enlarging. > + * This is called only from bufmgr.c:ResumeBufferCacheHibernation(). > + */ > +void > +AdjustStrategyControl(int oldNBuffers) > +{ > + if (oldNBuffers == NBuffers) > + { > + return; > + } > + > + /* enlarge or shrink the free buffer based on current NBuffers. */ > + StrategyControl->lastFreeBuffer = NBuffers - 1; > + > + /* shared_buffers shrunk. */ > + if (oldNBuffers > NBuffers) > + { > + if (StrategyControl->nextVictimBuffer >= NBuffers) > + { > + /* set the tail of buffers. */ > + StrategyControl->nextVictimBuffer = NBuffers - 1; > + } > + > + if (StrategyControl->firstFreeBuffer >= NBuffers) > + { > + /* set FREENEXT_END_OF_LIST(-1). */ > + StrategyControl->firstFreeBuffer = FREENEXT_END_OF_LIST; > + } > + } > + else > + /* shared_buffers enlarged. */ > + { > + if (StrategyControl->firstFreeBuffer < 0) > + { > + /* set the next entry of the tail of old buffers. */ > + StrategyControl->firstFreeBuffer = oldNBuffers; > + } > + } > +} > diff --git src/backend/utils/misc/guc.c src/backend/utils/misc/guc.c > index 738e215..5affc6e 100644 > --- src/backend/utils/misc/guc.c > +++ src/backend/utils/misc/guc.c > @@ -2361,6 +2361,18 @@ static struct config_int ConfigureNamesInt[] = > NULL, NULL, NULL > }, > > + { > + {"buffer_cache_hibernation_level", PGC_POSTMASTER, UNGROUPED, > + gettext_noop("Sets buffer cache hibernation level."), > + gettext_noop("0 to disable(default), " > + "1 for saving buffer descriptors only(recommended), " > + "2 for saving buffer descriptors and buffer blocks(slower at shutdown).") > + }, > + &BufferCacheHibernationLevel, > + 0, 0, 2, > + NULL, NULL, NULL > + }, > + > /* End-of-list marker */ > { > {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL > diff --git src/backend/utils/misc/postgresql.conf.sample src/backend/utils/misc/postgresql.conf.sample > index b8a1582..44b6ff3 100644 > --- src/backend/utils/misc/postgresql.conf.sample > +++ src/backend/utils/misc/postgresql.conf.sample > @@ -119,6 +119,17 @@ > #maintenance_work_mem = 16MB # min 1MB > #max_stack_depth = 2MB # min 100kB > > + > +# Buffer Cache Hibernation: > +# Suspend/resume buffer cache data structure using hibernation files > +# at shutdown/startup. > +#buffer_cache_hibernation_level = 0 # Sets buffer cache hibernation level. > + # 0 to disable(default), > + # 1 for saving buffer descriptors only > + # (recommended), > + # 2 for saving buffer descriptors and > + # buffer blocks(slower at shutdown). > + > # - Kernel Resource Usage - > > #max_files_per_process = 1000 # min 25 > diff --git src/include/access/xlog.h src/include/access/xlog.h > index 7056fd6..7a9fb99 100644 > --- src/include/access/xlog.h > +++ src/include/access/xlog.h > @@ -13,6 +13,7 @@ > > #include "access/rmgr.h" > #include "access/xlogdefs.h" > +#include "catalog/pg_control.h" > #include "lib/stringinfo.h" > #include "storage/buf.h" > #include "utils/pg_crc.h" > @@ -294,6 +295,7 @@ extern bool XLogInsertAllowed(void); > extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream); > extern XLogRecPtr GetXLogReplayRecPtr(void); > > +extern bool GetControlFile(ControlFileData *controlFile); > extern void UpdateControlFile(void); > extern uint64 GetSystemIdentifier(void); > extern Size XLOGShmemSize(void); > diff --git src/include/storage/buf_internals.h src/include/storage/buf_internals.h > index b7d4ea5..d537ef1 100644 > --- src/include/storage/buf_internals.h > +++ src/include/storage/buf_internals.h > @@ -167,6 +167,7 @@ typedef struct sbufdesc > */ > #define LockBufHdr(bufHdr) SpinLockAcquire(&(bufHdr)->buf_hdr_lock) > #define UnlockBufHdr(bufHdr) SpinLockRelease(&(bufHdr)->buf_hdr_lock) > +#define IsUnlockBufHdr(bufHdr) SpinLockFree(&(bufHdr)->buf_hdr_lock) > > > /* in buf_init.c */ > @@ -190,6 +191,7 @@ extern bool StrategyRejectBuffer(BufferAccessStrategy strategy, > extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc); > extern Size StrategyShmemSize(void); > extern void StrategyInitialize(bool init); > +extern void AdjustStrategyControl(int oldNBuffers); > > /* buf_table.c */ > extern Size BufTableShmemSize(int size); > diff --git src/include/storage/bufmgr.h src/include/storage/bufmgr.h > index b8fc87e..ddfeb9d 100644 > --- src/include/storage/bufmgr.h > +++ src/include/storage/bufmgr.h > @@ -211,6 +211,20 @@ extern void BgBufferSync(void); > > extern void AtProcExit_LocalBuffers(void); > > +/* buffer cache hibernation support stuff */ > +extern int BufferCacheHibernationLevel; > + > +typedef enum BufferHibernationFileType > +{ > + BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY, > + BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS, > + BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS > +} BufferHibernationFileType; > + > +extern void ResisterBufferCacheHibernation(BufferHibernationFileType id, > + char *ptr, Size record_length, Size num_records); > +extern void ResumeBufferCacheHibernation(void); > + > /* in freelist.c */ > extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype); > extern void FreeAccessStrategy(BufferAccessStrategy strategy); > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
2011/10/14 Bruce Momjian <bruce@momjian.us>: > > Should this be marked as TODO? I suppose TODO items *are* wanted and so working on them should remove the pain to convince people here to accept the feature, aren't they ? > > --------------------------------------------------------------------------- > > Mitsuru IWASAKI wrote: >> Hi, >> >> > On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote: >> > > For 1, I've just finish my work. The latest patch is available at: >> > > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch >> > > >> > >> > Reminder here--we can't accept code based on it being published to a web >> > page. You'll need to e-mail it to the pgsql-hackers mailing list to be >> > considered for the next PostgreSQL CommitFest, which is starting in a >> > few weeks. Code submitted to the mailing list is considered a release >> > of it to the project under the PostgreSQL license, which we can't just >> > assume for things when given only a URL to them. >> >> Sorry about that, but I had enough time to revise my patches this week-end. >> I attached the patches in this mail, and will update CommitFest page soon. >> >> > Also, you suggested you were out of time to work on this. If that's the >> > case, we'd like to know that so we don't keep cc'ing you about things in >> > expectation of an answer. Someone else may pick this up as a project to >> > continue working on. But it's going to need a fair amount of revision >> > before it matches what people want here, and I'm not sure how much of >> > what you've written is going to end up in any commit that may happen >> > from this idea. >> >> It seems that I don't have enough time to complete this work. >> You don't need to keep cc'ing me, and I'm very happy if postgres to be >> the first DBMS which support buffer cache hibernation feature. >> >> Thanks! >> >> >> diff --git src/backend/access/transam/xlog.c src/backend/access/transam/xlog.c >> index b0e4c41..7a3a207 100644 >> --- src/backend/access/transam/xlog.c >> +++ src/backend/access/transam/xlog.c >> @@ -4834,6 +4834,19 @@ ReadControlFile(void) >> #endif >> } >> >> +bool >> +GetControlFile(ControlFileData *controlFile) >> +{ >> + if (ControlFile == NULL) >> + { >> + return false; >> + } >> + >> + memcpy(controlFile, ControlFile, sizeof(ControlFileData)); >> + >> + return true; >> +} >> + >> void >> UpdateControlFile(void) >> { >> diff --git src/backend/bootstrap/bootstrap.c src/backend/bootstrap/bootstrap.c >> index fc093cc..7ecf6bb 100644 >> --- src/backend/bootstrap/bootstrap.c >> +++ src/backend/bootstrap/bootstrap.c >> @@ -360,6 +360,15 @@ AuxiliaryProcessMain(int argc, char *argv[]) >> BaseInit(); >> >> /* >> + * Only StartupProcess can call ResumeBufferCacheHibernation() after >> + * InitFileAccess() and smgrinit(). >> + */ >> + if (auxType == StartupProcess && BufferCacheHibernationLevel > 0) >> + { >> + ResumeBufferCacheHibernation(); >> + } >> + >> + /* >> * When we are an auxiliary process, we aren't going to do the full >> * InitPostgres pushups, but there are a couple of things that need to get >> * lit up even in an auxiliary process. >> diff --git src/backend/storage/buffer/buf_init.c src/backend/storage/buffer/buf_init.c >> index dadb49d..52eb51a 100644 >> --- src/backend/storage/buffer/buf_init.c >> +++ src/backend/storage/buffer/buf_init.c >> @@ -127,6 +127,14 @@ InitBufferPool(void) >> >> /* Init other shared buffer-management stuff */ >> StrategyInitialize(!foundDescs); >> + >> + if (BufferCacheHibernationLevel > 0) >> + { >> + ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS, >> + (char *)BufferDescriptors, sizeof(BufferDesc), NBuffers); >> + ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS, >> + (char *)BufferBlocks, BLCKSZ, NBuffers); >> + } >> } >> >> /* >> diff --git src/backend/storage/buffer/bufmgr.c src/backend/storage/buffer/bufmgr.c >> index f96685d..dba8ebf 100644 >> --- src/backend/storage/buffer/bufmgr.c >> +++ src/backend/storage/buffer/bufmgr.c >> @@ -31,6 +31,7 @@ >> #include "postgres.h" >> >> #include <sys/file.h> >> +#include <sys/stat.h> >> #include <unistd.h> >> >> #include "catalog/catalog.h" >> @@ -61,6 +62,13 @@ >> #define BUF_WRITTEN 0x01 >> #define BUF_REUSABLE 0x02 >> >> +/* >> + * Buffer Cache Hibernation stuff. >> + */ >> +/* enable this to debug buffer cache hibernation. */ >> +#if 0 >> +#define DEBUG_BUFFER_CACHE_HIBERNATION >> +#endif >> >> /* GUC variables */ >> bool zero_damaged_pages = false; >> @@ -765,6 +773,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, >> } >> } >> >> +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION >> + elog(DEBUG5, >> + "alloc [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d", >> + buf->buf_id, buf->flags, buf->usage_count, buf->refcount, >> + buf->wait_backend_pid, buf->freeNext, >> + newHash, newTag.rnode.spcNode, >> + newTag.rnode.dbNode, newTag.rnode.relNode, >> + newTag.forkNum, newTag.blockNum); >> +#endif >> + >> return buf; >> } >> >> @@ -800,6 +818,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, >> * the old content is no longer relevant. (The usage_count starts out at >> * 1 so that the buffer can survive one clock-sweep pass.) >> */ >> +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION >> + elog(DEBUG5, >> + "rename [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d", >> + buf->buf_id, buf->flags, buf->usage_count, buf->refcount, >> + buf->wait_backend_pid, buf->freeNext, >> + oldHash, oldTag.rnode.spcNode, >> + oldTag.rnode.dbNode, oldTag.rnode.relNode, >> + oldTag.forkNum, oldTag.blockNum); >> +#endif >> + >> buf->tag = newTag; >> buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT); >> if (relpersistence == RELPERSISTENCE_PERMANENT) >> @@ -2772,3 +2800,716 @@ local_buffer_write_error_callback(void *arg) >> pfree(path); >> } >> } >> + >> +/* ---------------------------------------------------------------- >> + * Buffer Cache Hibernation support stuff >> + * >> + * Suspend/resume buffer cache data structure using hibernation files >> + * at shutdown/startup. >> + * ---------------------------------------------------------------- >> + */ >> + >> +int BufferCacheHibernationLevel = 0; >> + >> +#define BUFFER_CACHE_HIBERNATION_FILE_STRATEGY "global/pg_buffer_cache_hibernation_strategy" >> +#define BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS "global/pg_buffer_cache_hibernation_descriptors" >> +#define BUFFER_CACHE_HIBERNATION_FILE_BLOCKS "global/pg_buffer_cache_hibernation_blocks" >> +#define BUFFER_CACHE_HIBERNATION_FILE_CRC32 "global/pg_buffer_cache_hibernation_crc32" >> + >> +static struct >> +{ >> + char *hibernation_file; >> + char *data_ptr; >> + Size record_length; >> + Size num_records; >> + pg_crc32 crc; >> +} BufferCacheHibernationData[] = >> +{ >> + /* BufferStrategyControl */ >> + { >> + BUFFER_CACHE_HIBERNATION_FILE_STRATEGY, >> + NULL, 0, 0, 0 >> + }, >> + >> + /* BufferDescriptors */ >> + { >> + BUFFER_CACHE_HIBERNATION_FILE_DESCRIPTORS, >> + NULL, 0, 0, 0 >> + }, >> + >> + /* BufferBlocks */ >> + { >> + BUFFER_CACHE_HIBERNATION_FILE_BLOCKS, >> + NULL, 0, 0, 0 >> + }, >> + >> + /* End-of-list marker */ >> + { >> + NULL, >> + NULL, 0, 0, 0 >> + }, >> +}; >> + >> +static ControlFileData controlFile; >> +static bool controlFileInitialized = false; >> + >> +/* >> + * AtProcExit_BufferCacheHibernation: >> + * store the buffer cache into hibernation files at shutdown. >> + */ >> +static void >> +AtProcExit_BufferCacheHibernation(int code, Datum arg) >> +{ >> + BufferHibernationFileType id; >> + int i; >> + int fd; >> + >> + if (BufferCacheHibernationLevel == 0) >> + { >> + return; >> + } >> + >> + /* >> + * get the control file to check the system state validation. >> + */ >> + if (GetControlFile(&controlFile) == false) >> + { >> + elog(WARNING, >> + "could not get control file, " >> + "aborting buffer cache hibernation"); >> + return; >> + } >> + >> + if (controlFile.state != DB_SHUTDOWNED) >> + { >> + elog(WARNING, >> + "database system was not shut down normally, " >> + "aborting buffer cache hibernation"); >> + return; >> + } >> + >> + /* >> + * suspend buffer cache data structure into hibernation files. >> + */ >> + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) >> + { >> + Size record_length; >> + Size num_records; >> + char *ptr; >> + pg_crc32 crc; >> + >> + if (BufferCacheHibernationLevel < 2 && >> + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + continue; >> + } >> + >> + if (BufferCacheHibernationData[id].data_ptr == NULL || >> + BufferCacheHibernationData[id].record_length == 0 || >> + BufferCacheHibernationData[id].num_records == 0) >> + { >> + elog(WARNING, >> + "ResisterBufferCacheHibernation() was not called for %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + goto cleanup; >> + } >> + >> + fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file, >> + O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR); >> + if (fd < 0) >> + { >> + elog(WARNING, >> + "could not open %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + goto cleanup; >> + } >> + >> + record_length = BufferCacheHibernationData[id].record_length; >> + num_records = BufferCacheHibernationData[id].num_records; >> + >> + elog(NOTICE, >> + "buffer cache hibernate into %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + >> + INIT_CRC32(crc); >> + for (i = 0; i < num_records; i++) >> + { >> + ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length); >> + if (write(fd, (void *)ptr, record_length) != record_length) >> + { >> + elog(WARNING, >> + "could not write %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + goto cleanup; >> + } >> + >> + COMP_CRC32(crc, ptr, record_length); >> + } >> + >> + FIN_CRC32(crc); >> + close(fd); >> + >> + BufferCacheHibernationData[id].crc = crc; >> + } >> + >> + /* >> + * save the computed crc values for the validations at resuming. >> + */ >> + fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32, >> + O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, S_IRUSR | S_IWUSR); >> + if (fd < 0) >> + { >> + elog(WARNING, >> + "could not open %s", >> + BUFFER_CACHE_HIBERNATION_FILE_CRC32); >> + goto cleanup; >> + } >> + >> + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) >> + { >> + pg_crc32 crc; >> + >> + if (BufferCacheHibernationLevel < 2 && >> + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + continue; >> + } >> + >> + crc = BufferCacheHibernationData[id].crc; >> + if (write(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32)) >> + { >> + elog(WARNING, >> + "could not write %s for %s", >> + BUFFER_CACHE_HIBERNATION_FILE_CRC32, >> + BufferCacheHibernationData[id].hibernation_file); >> + goto cleanup; >> + } >> + } >> + close(fd); >> + >> + elog(NOTICE, >> + "buffer cache suspended successfully"); >> + >> + return; >> + >> +cleanup: >> + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) >> + { >> + unlink(BufferCacheHibernationData[id].hibernation_file); >> + } >> + >> + return; >> +} >> + >> +/* >> + * ResisterBufferCacheHibernation: >> + * register the buffer cache data structure info. >> + */ >> +void >> +ResisterBufferCacheHibernation(BufferHibernationFileType id, char *ptr, Size record_length, Size num_records) >> +{ >> + static bool first_time = true; >> + >> + if (BufferCacheHibernationLevel == 0) >> + { >> + return; >> + } >> + >> + if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY && >> + id != BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS && >> + id != BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + return; >> + } >> + >> + if (first_time) >> + { >> + /* >> + * AtProcExit_BufferCacheHibernation to be called at shutdown. >> + */ >> + on_shmem_exit(AtProcExit_BufferCacheHibernation, 0); >> + first_time = false; >> + } >> + >> + /* >> + * get the control file to check the system state and >> + * hibernation file validations. >> + */ >> + if (controlFileInitialized == false) >> + { >> + if (GetControlFile(&controlFile) == true) >> + { >> + controlFileInitialized = true; >> + } >> + } >> + >> + BufferCacheHibernationData[id].data_ptr = ptr; >> + BufferCacheHibernationData[id].record_length = record_length; >> + BufferCacheHibernationData[id].num_records = num_records; >> +} >> + >> +/* >> + * ResumeBufferCacheHibernation: >> + * resume the buffer cache from hibernation file at startup. >> + */ >> +void >> +ResumeBufferCacheHibernation(void) >> +{ >> + BufferHibernationFileType id; >> + int i; >> + int fd; >> + Size num_records; >> + Size record_length; >> + char *buf_common; >> + int oldNBuffers; >> + bool buffer_block_processed; >> + >> + if (BufferCacheHibernationLevel == 0) >> + { >> + return; >> + } >> + >> + buf_common = NULL; >> + buffer_block_processed = false; >> + >> + /* >> + * lock all buffer descriptors to prevent other processes from >> + * updating buffers. >> + */ >> + for (i = 0; i < NBuffers; i++) >> + { >> + BufferDesc *buf; >> + >> + buf = &BufferDescriptors[i]; >> + LockBufHdr(buf); >> + } >> + >> + /* >> + * get the control file to check the system state and >> + * hibernation file validations. >> + */ >> + if (controlFileInitialized == false) >> + { >> + elog(WARNING, >> + "could not get control file, " >> + "aborting buffer cache hibernation"); >> + goto cleanup; >> + } >> + >> + if (controlFile.state != DB_SHUTDOWNED) >> + { >> + elog(WARNING, >> + "database system was not shut down normally, " >> + "aborting buffer cache hibernation"); >> + goto cleanup; >> + } >> + >> + /* >> + * read the crc values which was computed when the hibernation >> + * files were created. >> + */ >> + fd = BasicOpenFile(BUFFER_CACHE_HIBERNATION_FILE_CRC32, >> + O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR); >> + if (fd < 0) >> + { >> + elog(WARNING, >> + "could not open %s", >> + BUFFER_CACHE_HIBERNATION_FILE_CRC32); >> + goto cleanup; >> + } >> + >> + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) >> + { >> + pg_crc32 crc; >> + >> + if (BufferCacheHibernationLevel < 2 && >> + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + continue; >> + } >> + >> + if (read(fd, (void *)&crc, sizeof(pg_crc32)) != sizeof(pg_crc32)) >> + { >> + if (BufferCacheHibernationLevel == 2 && >> + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + /* >> + * if buffer_cache_hibernation_level changes 1 to 2, >> + * the crc value of buffer block hibernation file may not exist. >> + * just ignore it here. >> + */ >> + continue; >> + } >> + >> + elog(WARNING, >> + "could not read %s for %s", >> + BUFFER_CACHE_HIBERNATION_FILE_CRC32, >> + BufferCacheHibernationData[id].hibernation_file); >> + close(fd); >> + goto cleanup; >> + } >> + BufferCacheHibernationData[id].crc = crc; >> + } >> + >> + close(fd); >> + >> + /* >> + * allocate a buffer to read the contents of the hibernation files >> + * for validations. >> + */ >> + record_length = 0; >> + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) >> + { >> + if (record_length < BufferCacheHibernationData[id].record_length) >> + { >> + record_length = BufferCacheHibernationData[id].record_length; >> + } >> + } >> + >> + buf_common = malloc(record_length); >> + Assert(buf_common != NULL); >> + >> + /* assume that the number of buffers have not changed. */ >> + oldNBuffers = NBuffers; >> + >> + /* >> + * check if all hibernation files are valid. >> + */ >> + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) >> + { >> + struct stat sb; >> + pg_crc32 crc; >> + >> + if (BufferCacheHibernationLevel < 2 && >> + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + continue; >> + } >> + >> + if (BufferCacheHibernationData[id].data_ptr == NULL || >> + BufferCacheHibernationData[id].record_length == 0 || >> + BufferCacheHibernationData[id].num_records == 0) >> + { >> + elog(WARNING, >> + "ResisterBufferCacheHibernation() was not called for %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + goto cleanup; >> + } >> + >> + fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file, >> + O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR); >> + if (fd < 0) >> + { >> + if (BufferCacheHibernationLevel == 2 && >> + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + /* >> + * if buffer_cache_hibernation_level changes 1 to 2, >> + * the buffer block hibernation file may not exist. >> + * just ignore it here. >> + */ >> + continue; >> + } >> + >> + goto cleanup; >> + } >> + >> + if (fstat(fd, &sb) < 0) >> + { >> + elog(WARNING, >> + "could not get stats of the buffer cache hibernation file: %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + close(fd); >> + goto cleanup; >> + } >> + >> + record_length = BufferCacheHibernationData[id].record_length; >> + num_records = BufferCacheHibernationData[id].num_records; >> + >> + if (sb.st_size != (record_length * num_records)) >> + { >> + /* The size of StrategyControl should be the same always. */ >> + if (id == BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY || >> + (sb.st_size % record_length) > 0) >> + { >> + elog(WARNING, >> + "size mismatch on the buffer cache hibernation file: %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + close(fd); >> + goto cleanup; >> + } >> + >> + /* >> + * The number of records of buffer descriptors and blocks >> + * should be the same. >> + */ >> + if (oldNBuffers != NBuffers && >> + oldNBuffers != (sb.st_size / record_length)) >> + { >> + elog(WARNING, >> + "size mismatch on the buffer cache hibernation file: %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + close(fd); >> + goto cleanup; >> + } >> + >> + oldNBuffers = sb.st_size / record_length; >> + >> + elog(NOTICE, >> + "shared_buffers have changed from %d to %d: %s", >> + oldNBuffers, NBuffers, >> + BufferCacheHibernationData[id].hibernation_file); >> + >> + /* use the original size to compute CRC of the hibernation file. */ >> + num_records = oldNBuffers; >> + } >> + >> + if ((pg_time_t)sb.st_mtime < controlFile.time) >> + { >> + elog(WARNING, >> + "the hibernation file is older than control file: %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + close(fd); >> + goto cleanup; >> + } >> + >> + INIT_CRC32(crc); >> + for (i = 0; i < num_records; i++) >> + { >> + if (read(fd, (void *)buf_common, record_length) != record_length) >> + { >> + elog(WARNING, >> + "could not read the buffer cache hibernation file: %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + close(fd); >> + goto cleanup; >> + } >> + >> + COMP_CRC32(crc, buf_common, record_length); >> + >> + /* >> + * buffer descriptors validations. >> + */ >> + if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS) >> + { >> + BufferDesc *buf; >> + BufFlags abnormal_flags; >> + >> + if (i >= NBuffers) >> + { >> + continue; >> + } >> + >> + abnormal_flags = (BM_DIRTY | BM_IO_IN_PROGRESS | BM_IO_ERROR | >> + BM_JUST_DIRTIED | BM_PIN_COUNT_WAITER); >> + >> + buf = (BufferDesc *)buf_common; >> + >> + if (buf->flags & abnormal_flags) >> + { >> + elog(WARNING, >> + "abnormal flags in buffer descriptors: %d", >> + buf->flags); >> + close(fd); >> + goto cleanup; >> + } >> + >> + if (buf->usage_count > BM_MAX_USAGE_COUNT) >> + { >> + elog(WARNING, >> + "invalid usage count in buffer descriptors: %d", >> + buf->usage_count); >> + close(fd); >> + goto cleanup; >> + } >> + >> + if (buf->buf_id < 0 || buf->buf_id >= num_records) >> + { >> + elog(WARNING, >> + "invalid buffer id in buffer descriptors: %d", >> + buf->buf_id); >> + close(fd); >> + goto cleanup; >> + } >> + } >> + } >> + >> + FIN_CRC32(crc); >> + close(fd); >> + >> + if (!EQ_CRC32(BufferCacheHibernationData[id].crc, crc)) >> + { >> + elog(WARNING, >> + "crc mismatch on the buffer cache hibernation file: %s", >> + BufferCacheHibernationData[id].hibernation_file); >> + close(fd); >> + goto cleanup; >> + } >> + } >> + >> + /* >> + * resume the buffer cache data structure from the hibernation files. >> + */ >> + for (id = 0; BufferCacheHibernationData[id].hibernation_file != NULL; id++) >> + { >> + int fd; >> + char *ptr; >> + >> + if (BufferCacheHibernationLevel < 2 && >> + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + continue; >> + } >> + >> + record_length = BufferCacheHibernationData[id].record_length; >> + num_records = BufferCacheHibernationData[id].num_records; >> + >> + if (id != BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY) >> + { >> + /* use the smaller number of buffers. */ >> + num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers; >> + } >> + >> + fd = BasicOpenFile(BufferCacheHibernationData[id].hibernation_file, >> + O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR); >> + if (fd < 0) >> + { >> + if (BufferCacheHibernationLevel == 2 && >> + id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + /* >> + * if buffer_cache_hibernation_level changes 1 to 2, >> + * the buffer block hibernation file may not exist. >> + * just ignore it here. >> + */ >> + continue; >> + } >> + >> + goto cleanup; >> + } >> + >> + elog(NOTICE, >> + "buffer cache resume from %s(%d bytes * %d records)", >> + BufferCacheHibernationData[id].hibernation_file, >> + record_length, num_records); >> + >> + for (i = 0; i < num_records; i++) >> + { >> + ptr = BufferCacheHibernationData[id].data_ptr + (i * record_length); >> + read(fd, (void *)ptr, record_length); >> + >> + /* Re-lock the buffer descriptor if necessary. */ >> + if (id == BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS) >> + { >> + BufferDesc *buf; >> + >> + buf = (BufferDesc *)ptr; >> + if (IsUnlockBufHdr(buf)) >> + { >> + LockBufHdr(buf); >> + } >> + } >> + } >> + >> + close(fd); >> + >> + if (id == BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS) >> + { >> + buffer_block_processed = true; >> + } >> + } >> + >> + if (buffer_block_processed == false) >> + { >> + /* we didn't use the buffer block hibernation file, so delete it now. */ >> + id = BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS; >> + unlink(BufferCacheHibernationData[id].hibernation_file); >> + } >> + >> + /* >> + * set the rest data structures (eg. lookup hashtable) up >> + * based on the buffer descriptors. >> + */ >> + num_records = (oldNBuffers < NBuffers)? oldNBuffers : NBuffers; >> + for (i = 0; i < num_records; i++) >> + { >> + BufferDesc *buf; >> + BufferTag newTag; >> + uint32 newHash; >> + int buf_id; >> + >> + buf = &BufferDescriptors[i]; >> + if (buf->tag.rnode.spcNode == InvalidOid && >> + buf->tag.rnode.dbNode == InvalidOid && >> + buf->tag.rnode.relNode == InvalidOid) >> + { >> + continue; >> + } >> + >> + INIT_BUFFERTAG(newTag, buf->tag.rnode, buf->tag.forkNum, buf->tag.blockNum); >> + newHash = BufTableHashCode(&newTag); >> + >> + if (buffer_block_processed == false) >> + { >> + Block bufBlock; >> + SMgrRelation smgr; >> + >> + /* >> + * re-read buffer block. >> + */ >> + bufBlock = BufHdrGetBlock(buf); >> + smgr = smgropen(buf->tag.rnode, InvalidBackendId); >> + smgrread(smgr, newTag.forkNum, newTag.blockNum, (char *) bufBlock); >> + } >> + >> + buf_id = BufTableInsert(&newTag, newHash, buf->buf_id); >> + if (buf_id != -1) >> + { >> + /* the entry exists already, return it to the freelist. */ >> + buf->refcount = 0; >> + buf->flags = 0; >> + InvalidateBuffer(buf); >> + continue; >> + } >> + >> + /* clear wait_backend_pid because the process was terminated already. */ >> + buf->wait_backend_pid = 0; >> + >> +#ifdef DEBUG_BUFFER_CACHE_HIBERNATION >> + elog(DEBUG5, >> + "resume [%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d", >> + buf->buf_id, buf->flags, buf->usage_count, buf->refcount, >> + buf->wait_backend_pid, buf->freeNext, >> + newHash, newTag.rnode.spcNode, >> + newTag.rnode.dbNode, newTag.rnode.relNode, >> + newTag.forkNum, newTag.blockNum); >> +#endif >> + } >> + >> + /* >> + * adjust StrategyControl based on the change of shared_buffers. >> + */ >> + if (oldNBuffers != NBuffers) >> + { >> + AdjustStrategyControl(oldNBuffers); >> + } >> + >> + elog(NOTICE, >> + "buffer cache resumed successfully"); >> + >> +cleanup: >> + for (i = 0; i < NBuffers; i++) >> + { >> + BufferDesc *buf; >> + >> + buf = &BufferDescriptors[i]; >> + UnlockBufHdr(buf); >> + } >> + >> + if (buf_common != NULL) >> + { >> + free(buf_common); >> + } >> + >> + return; >> +} >> diff --git src/backend/storage/buffer/freelist.c src/backend/storage/buffer/freelist.c >> index bf9903b..ffc101d 100644 >> --- src/backend/storage/buffer/freelist.c >> +++ src/backend/storage/buffer/freelist.c >> @@ -347,6 +347,12 @@ StrategyInitialize(bool init) >> } >> else >> Assert(!init); >> + >> + if (BufferCacheHibernationLevel > 0) >> + { >> + ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY, >> + (char *)StrategyControl, sizeof(BufferStrategyControl), 1); >> + } >> } >> >> >> @@ -521,3 +527,47 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, volatile BufferDesc *buf) >> >> return true; >> } >> + >> +/* >> + * AdjustStrategyControl -- adjust the member variables of StrategyControl >> + * >> + * If the shared_buffers setting had changed, restored StrategyControl >> + * needs to be adjusted for in both cases of shrinking and enlarging. >> + * This is called only from bufmgr.c:ResumeBufferCacheHibernation(). >> + */ >> +void >> +AdjustStrategyControl(int oldNBuffers) >> +{ >> + if (oldNBuffers == NBuffers) >> + { >> + return; >> + } >> + >> + /* enlarge or shrink the free buffer based on current NBuffers. */ >> + StrategyControl->lastFreeBuffer = NBuffers - 1; >> + >> + /* shared_buffers shrunk. */ >> + if (oldNBuffers > NBuffers) >> + { >> + if (StrategyControl->nextVictimBuffer >= NBuffers) >> + { >> + /* set the tail of buffers. */ >> + StrategyControl->nextVictimBuffer = NBuffers - 1; >> + } >> + >> + if (StrategyControl->firstFreeBuffer >= NBuffers) >> + { >> + /* set FREENEXT_END_OF_LIST(-1). */ >> + StrategyControl->firstFreeBuffer = FREENEXT_END_OF_LIST; >> + } >> + } >> + else >> + /* shared_buffers enlarged. */ >> + { >> + if (StrategyControl->firstFreeBuffer < 0) >> + { >> + /* set the next entry of the tail of old buffers. */ >> + StrategyControl->firstFreeBuffer = oldNBuffers; >> + } >> + } >> +} >> diff --git src/backend/utils/misc/guc.c src/backend/utils/misc/guc.c >> index 738e215..5affc6e 100644 >> --- src/backend/utils/misc/guc.c >> +++ src/backend/utils/misc/guc.c >> @@ -2361,6 +2361,18 @@ static struct config_int ConfigureNamesInt[] = >> NULL, NULL, NULL >> }, >> >> + { >> + {"buffer_cache_hibernation_level", PGC_POSTMASTER, UNGROUPED, >> + gettext_noop("Sets buffer cache hibernation level."), >> + gettext_noop("0 to disable(default), " >> + "1 for saving buffer descriptors only(recommended), " >> + "2 for saving buffer descriptors and buffer blocks(slower at shutdown).") >> + }, >> + &BufferCacheHibernationLevel, >> + 0, 0, 2, >> + NULL, NULL, NULL >> + }, >> + >> /* End-of-list marker */ >> { >> {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL >> diff --git src/backend/utils/misc/postgresql.conf.sample src/backend/utils/misc/postgresql.conf.sample >> index b8a1582..44b6ff3 100644 >> --- src/backend/utils/misc/postgresql.conf.sample >> +++ src/backend/utils/misc/postgresql.conf.sample >> @@ -119,6 +119,17 @@ >> #maintenance_work_mem = 16MB # min 1MB >> #max_stack_depth = 2MB # min 100kB >> >> + >> +# Buffer Cache Hibernation: >> +# Suspend/resume buffer cache data structure using hibernation files >> +# at shutdown/startup. >> +#buffer_cache_hibernation_level = 0 # Sets buffer cache hibernation level. >> + # 0 to disable(default), >> + # 1 for saving buffer descriptors only >> + # (recommended), >> + # 2 for saving buffer descriptors and >> + # buffer blocks(slower at shutdown). >> + >> # - Kernel Resource Usage - >> >> #max_files_per_process = 1000 # min 25 >> diff --git src/include/access/xlog.h src/include/access/xlog.h >> index 7056fd6..7a9fb99 100644 >> --- src/include/access/xlog.h >> +++ src/include/access/xlog.h >> @@ -13,6 +13,7 @@ >> >> #include "access/rmgr.h" >> #include "access/xlogdefs.h" >> +#include "catalog/pg_control.h" >> #include "lib/stringinfo.h" >> #include "storage/buf.h" >> #include "utils/pg_crc.h" >> @@ -294,6 +295,7 @@ extern bool XLogInsertAllowed(void); >> extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream); >> extern XLogRecPtr GetXLogReplayRecPtr(void); >> >> +extern bool GetControlFile(ControlFileData *controlFile); >> extern void UpdateControlFile(void); >> extern uint64 GetSystemIdentifier(void); >> extern Size XLOGShmemSize(void); >> diff --git src/include/storage/buf_internals.h src/include/storage/buf_internals.h >> index b7d4ea5..d537ef1 100644 >> --- src/include/storage/buf_internals.h >> +++ src/include/storage/buf_internals.h >> @@ -167,6 +167,7 @@ typedef struct sbufdesc >> */ >> #define LockBufHdr(bufHdr) SpinLockAcquire(&(bufHdr)->buf_hdr_lock) >> #define UnlockBufHdr(bufHdr) SpinLockRelease(&(bufHdr)->buf_hdr_lock) >> +#define IsUnlockBufHdr(bufHdr) SpinLockFree(&(bufHdr)->buf_hdr_lock) >> >> >> /* in buf_init.c */ >> @@ -190,6 +191,7 @@ extern bool StrategyRejectBuffer(BufferAccessStrategy strategy, >> extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc); >> extern Size StrategyShmemSize(void); >> extern void StrategyInitialize(bool init); >> +extern void AdjustStrategyControl(int oldNBuffers); >> >> /* buf_table.c */ >> extern Size BufTableShmemSize(int size); >> diff --git src/include/storage/bufmgr.h src/include/storage/bufmgr.h >> index b8fc87e..ddfeb9d 100644 >> --- src/include/storage/bufmgr.h >> +++ src/include/storage/bufmgr.h >> @@ -211,6 +211,20 @@ extern void BgBufferSync(void); >> >> extern void AtProcExit_LocalBuffers(void); >> >> +/* buffer cache hibernation support stuff */ >> +extern int BufferCacheHibernationLevel; >> + >> +typedef enum BufferHibernationFileType >> +{ >> + BUFFER_CACHE_HIBERNATION_TYPE_STRATEGY, >> + BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS, >> + BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS >> +} BufferHibernationFileType; >> + >> +extern void ResisterBufferCacheHibernation(BufferHibernationFileType id, >> + char *ptr, Size record_length, Size num_records); >> +extern void ResumeBufferCacheHibernation(void); >> + >> /* in freelist.c */ >> extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype); >> extern void FreeAccessStrategy(BufferAccessStrategy strategy); >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + It's impossible for everything to be true. + > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On 14.10.2011 11:44, Cédric Villemain wrote: > 2011/10/14 Bruce Momjian<bruce@momjian.us>: >> >> Should this be marked as TODO? > > I suppose TODO items *are* wanted and so working on them should remove > the pain to convince people here to accept the feature, aren't they ? I don't think this is worthwhile to have in the backend. Someone could write it as an extension on pgfoundry, but I don't think that belongs on the TODO. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Cédric Villemain <cedric.villemain.debian@gmail.com> writes: > 2011/10/14 Bruce Momjian <bruce@momjian.us>: >> Should this be marked as TODO? > I suppose TODO items *are* wanted and so working on them should remove > the pain to convince people here to accept the feature, aren't they ? There is plenty of stuff in the TODO list for which there is no consensus. regards, tom lane
Tom Lane wrote: > Cédric Villemain <cedric.villemain.debian@gmail.com> writes: > > 2011/10/14 Bruce Momjian <bruce@momjian.us>: > >> Should this be marked as TODO? > > > I suppose TODO items *are* wanted and so working on them should remove > > the pain to convince people here to accept the feature, aren't they ? > > There is plenty of stuff in the TODO list for which there is no > consensus. Uh, we should probably remove those then. Can you think of any? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Excerpts from Bruce Momjian's message of vie oct 14 11:56:22 -0300 2011: > Tom Lane wrote: > > Cédric Villemain <cedric.villemain.debian@gmail.com> writes: > > > 2011/10/14 Bruce Momjian <bruce@momjian.us>: > > >> Should this be marked as TODO? > > > > > I suppose TODO items *are* wanted and so working on them should remove > > > the pain to convince people here to accept the feature, aren't they ? > > > > There is plenty of stuff in the TODO list for which there is no > > consensus. > > Uh, we should probably remove those then. Can you think of any? The guideline, last I checked, was that before getting into coding any item from the TODO list, the prospective hacker should check previous discussions and initiate a new one on this list to ensure consensus. Unless something is blatantly "not wanted", I don't think it should be removed from the TODO list. There not being consensus does not mean that there cannot ever be. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > > Excerpts from Bruce Momjian's message of vie oct 14 11:56:22 -0300 2011: > > Tom Lane wrote: > > > Cédric Villemain <cedric.villemain.debian@gmail.com> writes: > > > > 2011/10/14 Bruce Momjian <bruce@momjian.us>: > > > >> Should this be marked as TODO? > > > > > > > I suppose TODO items *are* wanted and so working on them should remove > > > > the pain to convince people here to accept the feature, aren't they ? > > > > > > There is plenty of stuff in the TODO list for which there is no > > > consensus. > > > > Uh, we should probably remove those then. Can you think of any? > > The guideline, last I checked, was that before getting into coding any > item from the TODO list, the prospective hacker should check previous > discussions and initiate a new one on this list to ensure consensus. > Unless something is blatantly "not wanted", I don't think it should be > removed from the TODO list. There not being consensus does not mean > that there cannot ever be. OK. But if we are pretty sure we don't want something, e.g. hibernate, we shouldn't add it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Oct 14, 2011 at 11:12 AM, Bruce Momjian <bruce@momjian.us> wrote: > OK. But if we are pretty sure we don't want something, e.g. hibernate, > we shouldn't add it. Fair enough, but I'm not even slightly sure that we don't want that. I think having prewarming utilities available as contrib modules or on PGXN would be useful, but integrating something into the backend would allow it to be far more automated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Bruce Momjian's message of vie oct 14 11:56:22 -0300 2011: >> Tom Lane wrote: >>> There is plenty of stuff in the TODO list for which there is no >>> consensus. >> Uh, we should probably remove those then. Can you think of any? > Unless something is blatantly "not wanted", I don't think it should be > removed from the TODO list. There not being consensus does not mean > that there cannot ever be. Yeah. The reason why something is on the TODO list (and not already done) is typically one of 1. It's too hard, or too long/boring for the expected value. 2. There's no consensus about how to implement the feature. 3. There's no consensus about the user-visible design of the feature. Cases where there's debate about whether we want it at all seem to me to be a subset of #3. But for anything in #3, someone could do the legwork or have the bright idea needed to create consensus about how to design the feature. My gripe about the TODO list is not that we have some stuff in there that's not clearly wanted, it's that some of the entries fail to make it clear where the issue stands on this scale. That could lead people to waste time trying to code something that there's not consensus for the design or implementation of. regards, tom lane
Excerpts from Bruce Momjian's message of vie oct 14 12:12:22 -0300 2011: > > Alvaro Herrera wrote: > > The guideline, last I checked, was that before getting into coding any > > item from the TODO list, the prospective hacker should check previous > > discussions and initiate a new one on this list to ensure consensus. > > Unless something is blatantly "not wanted", I don't think it should be > > removed from the TODO list. There not being consensus does not mean > > that there cannot ever be. > > OK. But if we are pretty sure we don't want something, e.g. hibernate, > we shouldn't add it. If we're so sure we don't want it, we could add it to the "features we do not want" section. But as Robert says downthread, I don't see us being so sure that we don't want hibernation. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 14, 2011 at 11:12 AM, Bruce Momjian <bruce@momjian.us> wrote: >> OK. �But if we are pretty sure we don't want something, e.g. hibernate, >> we shouldn't add it. > Fair enough, but I'm not even slightly sure that we don't want that. > I think having prewarming utilities available as contrib modules or on > PGXN would be useful, but integrating something into the backend would > allow it to be far more automated. Right. I think this one falls into my class #2, ie, we have no idea how to implement it usefully. Doesn't (necessarily) mean that the core concept is without merit. regards, tom lane
Alvaro Herrera wrote: > > Excerpts from Bruce Momjian's message of vie oct 14 12:12:22 -0300 2011: > > > > Alvaro Herrera wrote: > > > > The guideline, last I checked, was that before getting into coding any > > > item from the TODO list, the prospective hacker should check previous > > > discussions and initiate a new one on this list to ensure consensus. > > > Unless something is blatantly "not wanted", I don't think it should be > > > removed from the TODO list. There not being consensus does not mean > > > that there cannot ever be. > > > > OK. But if we are pretty sure we don't want something, e.g. hibernate, > > we shouldn't add it. > > If we're so sure we don't want it, we could add it to the "features we > do not want" section. But as Robert says downthread, I don't see us Those are for features that people often ask for, and we don't want. I am sure there are a lot of things we don't want. > being so sure that we don't want hibernation. So, add it? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Oct 14, 2011 at 4:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Right. I think this one falls into my class #2, ie, we have no idea how > to implement it usefully. Doesn't (necessarily) mean that the core > concept is without merit. Hm. given that we have an implementation I wouldn't say we have *no* clue. But there are certainly some parts we don't have consensus yet on. But then working code sometimes trumps a lack of absolute consensus. But just for the sake of argument I'm not sure that the implementation of dumping the current contents of the buffer cache is actually optimal. It doesn't handle resizing the buffer cache after a restart for example which I think would be a significant case. There could be other buffer cache algorithm parameters users might change -- though I don't think we really have any currently. If we had --to take it to an extreme-- a record of every buffer request prior to the shutdown then we could replay that log virtually with the new buffer cache size and know what buffers the new buffer cache size would have had in it. I'm not sure if there's any way to gather that data efficiently, and if we could if there's any way to bound the amount of data we would have to retain to anything less than nigh-infinite volumes, and if we could if there's any way to limit that has to be replayed on restart. But my point is that there may be other more general options than snapshotting the actual buffer cache of the system shutting down. -- greg
Greg Stark <stark@mit.edu> writes: > On Fri, Oct 14, 2011 at 4:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Right. �I think this one falls into my class #2, ie, we have no idea how >> to implement it usefully. �Doesn't (necessarily) mean that the core >> concept is without merit. > Hm. given that we have an implementation I wouldn't say we have *no* > clue. But there are certainly some parts we don't have consensus yet > on. But then working code sometimes trumps a lack of absolute > consensus. In this context "working" means "shows a significant performance benefit", and IIRC we don't have a demonstration of that. Anyway this was all discussed back in May. regards, tom lane