bgwriter changes - Mailing list pgsql-hackers
From | Neil Conway |
---|---|
Subject | bgwriter changes |
Date | |
Msg-id | 41BEEB0E.4070003@samurai.com Whole thread Raw |
Responses |
Re: bgwriter changes
Re: bgwriter changes Re: bgwriter changes |
List | pgsql-hackers |
In recent discussion[1] with Simon Riggs, there has been some talk of making some changes to the bgwriter. To summarize the problem, the bgwriter currently scans the entire T1+T2 buffer lists and returns a list of all the currently dirty buffers. It then selects a subset of that list (computed using bgwriter_percent and bgwriter_maxpages) to flush to disk. Not only does this mean we can end up scanning a significant portion of shared_buffers for every invocation of the bgwriter, we also do the scan while holding the BufMgrLock, likely hurting scalability. I think a fix for this in some fashion is warranted for 8.0. Possible solutions: (1) Special-case bgwriter_percent=100. The only reason we need to return a list of all the dirty buffers is so that we can choose n% of them to satisfy bgwriter_percent. That is obviously unnecessary if we have bgwriter_percent=100. I think this change won't help most users, *unless* we also change bgwriter_percent=100 in the default configuration. (2) Remove bgwriter_percent. I have yet to hear anyone argue that there's an actual need for bgwriter_percent in tuning bgwriter behavior, and one less GUC var is a good thing, all else being equal. This is effectively the same as #1 with the default changed, only less flexibility. (3) Change the meaning of bgwriter_percent, per Simon's proposal. Make it mean "the percentage of the buffer pool to scan, at most, to look for dirty buffers". I don't think this is workable, at least not at this point in the release cycle, because it means we might not smooth of checkpoint load, one of the primary goals of the bgwriter (in this proposal bgwriter would only ever consider writing out a small subset of the total shared buffer cache: the least-recently-used n%, with 2% being a suggested default). Some variant of this might be worth exploring for 8.1 though. A patch (implementing #2) is attached -- any benchmark results would be helpful. Increasing shared_buffers (to 10,000 or more) should make the problem noticeable. Opinions on which route is the best, or on some alternative solution? My inclination is toward #2, but I'm not dead-set on it. -Neil [1] http://archives.postgresql.org/pgsql-hackers/2004-12/msg00386.php Index: doc/src/sgml/runtime.sgml =================================================================== RCS file: /var/lib/cvs/pgsql/doc/src/sgml/runtime.sgml,v retrieving revision 1.296 diff -c -r1.296 runtime.sgml *** doc/src/sgml/runtime.sgml 13 Dec 2004 18:05:09 -0000 1.296 --- doc/src/sgml/runtime.sgml 14 Dec 2004 04:52:26 -0000 *************** *** 1350,1382 **** <para> Specifies the delay between activity rounds for the background writer. In each round the writer issues writes ! for some number of dirty buffers (controllable by the ! following parameters). The selected buffers will always be ! the least recently used ones among the currently dirty ! buffers. It then sleeps for <varname>bgwriter_delay</> ! milliseconds, and repeats. The default value is 200. Note ! that on many systems, the effective resolution of sleep ! delays is 10 milliseconds; setting <varname>bgwriter_delay</> ! to a value that is not a multiple of 10 may have the same ! results as setting it to the next higher multiple of 10. ! This option can only be set at server start or in the ! <filename>postgresql.conf</filename> file. ! </para> ! </listitem> ! </varlistentry> ! ! <varlistentry id="guc-bgwriter-percent" xreflabel="bgwriter_percent"> ! <term><varname>bgwriter_percent</varname> (<type>integer</type>)</term> ! <indexterm> ! <primary><varname>bgwriter_percent</> configuration parameter</primary> ! </indexterm> ! <listitem> ! <para> ! In each round, no more than this percentage of the currently ! dirty buffers will be written (rounding up any fraction to ! the next whole number of buffers). The default value is ! 1. This option can only be set at server start or in the ! <filename>postgresql.conf</filename> file. </para> </listitem> </varlistentry> --- 1350,1367 ---- <para> Specifies the delay between activity rounds for the background writer. In each round the writer issues writes ! for some number of dirty buffers (controllable by ! <varname>bgwriter_maxpages</varname>). The selected buffers ! will always be the least recently used ones among the ! currently dirty buffers. It then sleeps for ! <varname>bgwriter_delay</> milliseconds, and repeats. The ! default value is 200. Note that on many systems, the ! effective resolution of sleep delays is 10 milliseconds; ! setting <varname>bgwriter_delay</> to a value that is not a ! multiple of 10 may have the same results as setting it to the ! next higher multiple of 10. This option can only be set at ! server start or in the <filename>postgresql.conf</filename> ! file. </para> </listitem> </varlistentry> *************** *** 1398,1409 **** </variablelist> <para> ! Smaller values of <varname>bgwriter_percent</varname> and ! <varname>bgwriter_maxpages</varname> reduce the extra I/O load ! caused by the background writer, but leave more work to be done ! at checkpoint time. To reduce load spikes at checkpoints, ! increase the values. To disable background writing entirely, ! set <varname>bgwriter_percent</varname> and/or <varname>bgwriter_maxpages</varname> to zero. </para> </sect3> --- 1383,1396 ---- </variablelist> <para> ! Decreasing <varname>bgwriter_maxpages</varname> or increasing ! <varname>bgwriter_delay</varname> will reduce the extra I/O load ! caused by the background writer, but will leave more work to be ! done at checkpoint time. To reduce load spikes at checkpoints, ! increase the number of pages written per round ! (<varname>bgwriter_maxpages</varname>) or reduce the delay ! between rounds (<varname>bgwriter_delay</varname>). To disable ! background writing entirely, set <varname>bgwriter_maxpages</varname> to zero. </para> </sect3> Index: src/backend/catalog/index.c =================================================================== RCS file: /var/lib/cvs/pgsql/src/backend/catalog/index.c,v retrieving revision 1.242 diff -c -r1.242 index.c *** src/backend/catalog/index.c 1 Dec 2004 19:00:39 -0000 1.242 --- src/backend/catalog/index.c 14 Dec 2004 04:32:39 -0000 *************** *** 1062,1068 **** /* Send out shared cache inval if necessary */ if (!IsBootstrapProcessingMode()) CacheInvalidateHeapTuple(pg_class, tuple); ! BufferSync(-1, -1); } else if (dirty) { --- 1062,1068 ---- /* Send out shared cache inval if necessary */ if (!IsBootstrapProcessingMode()) CacheInvalidateHeapTuple(pg_class, tuple); ! BufferSync(-1); } else if (dirty) { Index: src/backend/commands/dbcommands.c =================================================================== RCS file: /var/lib/cvs/pgsql/src/backend/commands/dbcommands.c,v retrieving revision 1.147 diff -c -r1.147 dbcommands.c *** src/backend/commands/dbcommands.c 18 Nov 2004 01:14:26 -0000 1.147 --- src/backend/commands/dbcommands.c 14 Dec 2004 04:40:19 -0000 *************** *** 332,338 **** * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(-1, -1); /* * Close virtual file descriptors so the kernel has more available for --- 332,338 ---- * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(-1); /* * Close virtual file descriptors so the kernel has more available for *************** *** 1206,1212 **** * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(-1, -1); #ifndef WIN32 --- 1206,1212 ---- * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(-1); #ifndef WIN32 Index: src/backend/postmaster/bgwriter.c =================================================================== RCS file: /var/lib/cvs/pgsql/src/backend/postmaster/bgwriter.c,v retrieving revision 1.11 diff -c -r1.11 bgwriter.c *** src/backend/postmaster/bgwriter.c 5 Nov 2004 17:11:28 -0000 1.11 --- src/backend/postmaster/bgwriter.c 14 Dec 2004 04:44:26 -0000 *************** *** 116,122 **** * GUC parameters */ int BgWriterDelay = 200; - int BgWriterPercent = 1; int BgWriterMaxPages = 100; int CheckPointTimeout = 300; --- 116,121 ---- *************** *** 372,378 **** n = 1; } else ! n = BufferSync(BgWriterPercent, BgWriterMaxPages); /* * Nap for the configured time or sleep for 10 seconds if there --- 371,377 ---- n = 1; } else ! n = BufferSync(BgWriterMaxPages); /* * Nap for the configured time or sleep for 10 seconds if there Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /var/lib/cvs/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.182 diff -c -r1.182 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 24 Nov 2004 02:56:17 -0000 1.182 --- src/backend/storage/buffer/bufmgr.c 14 Dec 2004 04:40:18 -0000 *************** *** 671,717 **** * * This is called at checkpoint time to write out all dirty shared buffers, * and by the background writer process to write out some of the dirty blocks. ! * percent/maxpages should be -1 in the former case, and limit values (>= 0) * in the latter. * * Returns the number of buffers written. */ int ! BufferSync(int percent, int maxpages) { BufferDesc **dirty_buffers; BufferTag *buftags; int num_buffer_dirty; int i; ! /* If either limit is zero then we are disabled from doing anything... */ ! if (percent == 0 || maxpages == 0) return 0; /* * Get a list of all currently dirty buffers and how many there are. * We do not flush buffers that get dirtied after we started. They * have to wait until the next checkpoint. */ ! dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *)); ! buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag)); LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags, ! NBuffers); ! ! /* ! * If called by the background writer, we are usually asked to only ! * write out some portion of dirty buffers now, to prevent the IO ! * storm at checkpoint time. ! */ ! if (percent > 0) ! { ! Assert(percent <= 100); ! num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100; ! } ! if (maxpages > 0 && num_buffer_dirty > maxpages) ! num_buffer_dirty = maxpages; /* Make sure we can handle the pin inside the loop */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); --- 671,710 ---- * * This is called at checkpoint time to write out all dirty shared buffers, * and by the background writer process to write out some of the dirty blocks. ! * maxpages should be -1 in the former case, and a limit value (>= 0) * in the latter. * * Returns the number of buffers written. */ int ! BufferSync(int maxpages) { BufferDesc **dirty_buffers; BufferTag *buftags; int num_buffer_dirty; int i; ! /* If maxpages is zero then we're effectively disabled */ ! if (maxpages == 0) return 0; + /* If -1, flush all dirty buffers */ + if (maxpages == -1) + maxpages = NBuffers; + /* + * Get a list of up to "maxpages" dirty buffers, starting from LRU and * Get a list of all currently dirty buffers and how many there are. * We do not flush buffers that get dirtied after we started. They * have to wait until the next checkpoint. */ ! dirty_buffers = (BufferDesc **) palloc(maxpages * sizeof(BufferDesc *)); ! buftags = (BufferTag *) palloc(maxpages * sizeof(BufferTag)); LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags, ! maxpages); ! Assert(num_buffer_dirty <= maxpages); /* Make sure we can handle the pin inside the loop */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); *************** *** 947,953 **** void FlushBufferPool(void) { ! BufferSync(-1, -1); smgrsync(); } --- 940,946 ---- void FlushBufferPool(void) { ! BufferSync(-1); smgrsync(); } Index: src/backend/storage/buffer/freelist.c =================================================================== RCS file: /var/lib/cvs/pgsql/src/backend/storage/buffer/freelist.c,v retrieving revision 1.48 diff -c -r1.48 freelist.c *** src/backend/storage/buffer/freelist.c 16 Sep 2004 16:58:31 -0000 1.48 --- src/backend/storage/buffer/freelist.c 14 Dec 2004 04:22:02 -0000 *************** *** 753,810 **** int num_buffer_dirty = 0; int cdb_id_t1; int cdb_id_t2; - int buf_id; - BufferDesc *buf; /* ! * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all ! * dirty buffers found in that order to the list. The ARC strategy ! * keeps all used buffers including pinned ones in the T1 or T2 list. ! * So we cannot miss any dirty buffers. */ cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1]; cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2]; while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0) { if (cdb_id_t1 >= 0) { buf_id = StrategyCDB[cdb_id_t1].buf_id; - buf = &BufferDescriptors[buf_id]; - - if (buf->flags & BM_VALID) - { - if ((buf->flags & BM_DIRTY) || (buf->cntxDirty)) - { - buffers[num_buffer_dirty] = buf; - buftags[num_buffer_dirty] = buf->tag; - num_buffer_dirty++; - if (num_buffer_dirty >= max_buffers) - break; - } - } - cdb_id_t1 = StrategyCDB[cdb_id_t1].next; } ! ! if (cdb_id_t2 >= 0) { buf_id = StrategyCDB[cdb_id_t2].buf_id; ! buf = &BufferDescriptors[buf_id]; ! if (buf->flags & BM_VALID) { ! if ((buf->flags & BM_DIRTY) || (buf->cntxDirty)) ! { ! buffers[num_buffer_dirty] = buf; ! buftags[num_buffer_dirty] = buf->tag; ! num_buffer_dirty++; ! if (num_buffer_dirty >= max_buffers) ! break; ! } } - - cdb_id_t2 = StrategyCDB[cdb_id_t2].next; } } --- 753,797 ---- int num_buffer_dirty = 0; int cdb_id_t1; int cdb_id_t2; /* ! * Traverse the T1 and T2 list from LRU to MRU in "parallel" and ! * add all dirty buffers found in that order to the list. The ARC ! * strategy keeps all used buffers including pinned ones in the T1 ! * or T2 list. So we cannot miss any dirty buffers. */ cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1]; cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2]; while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0) { + int buf_id; + BufferDesc *buf; + if (cdb_id_t1 >= 0) { buf_id = StrategyCDB[cdb_id_t1].buf_id; cdb_id_t1 = StrategyCDB[cdb_id_t1].next; } ! else { + Assert(cdb_id_t2 >= 0); buf_id = StrategyCDB[cdb_id_t2].buf_id; ! cdb_id_t2 = StrategyCDB[cdb_id_t2].next; ! } ! ! buf = &BufferDescriptors[buf_id]; ! if (buf->flags & BM_VALID) ! { ! if ((buf->flags & BM_DIRTY) || (buf->cntxDirty)) { ! buffers[num_buffer_dirty] = buf; ! buftags[num_buffer_dirty] = buf->tag; ! num_buffer_dirty++; ! if (num_buffer_dirty >= max_buffers) ! break; } } } Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /var/lib/cvs/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.250 diff -c -r1.250 guc.c *** src/backend/utils/misc/guc.c 24 Nov 2004 19:51:03 -0000 1.250 --- src/backend/utils/misc/guc.c 14 Dec 2004 04:44:40 -0000 *************** *** 1249,1263 **** }, { - {"bgwriter_percent", PGC_SIGHUP, RESOURCES, - gettext_noop("Background writer percentage of dirty buffers to flush per round"), - NULL - }, - &BgWriterPercent, - 1, 0, 100, NULL, NULL - }, - - { {"bgwriter_maxpages", PGC_SIGHUP, RESOURCES, gettext_noop("Background writer maximum number of pages to flush per round"), NULL --- 1249,1254 ---- Index: src/backend/utils/misc/postgresql.conf.sample =================================================================== RCS file: /var/lib/cvs/pgsql/src/backend/utils/misc/postgresql.conf.sample,v retrieving revision 1.134 diff -c -r1.134 postgresql.conf.sample *** src/backend/utils/misc/postgresql.conf.sample 5 Nov 2004 19:16:16 -0000 1.134 --- src/backend/utils/misc/postgresql.conf.sample 14 Dec 2004 04:54:47 -0000 *************** *** 96,106 **** #vacuum_cost_page_dirty = 20 # 0-10000 credits #vacuum_cost_limit = 200 # 0-10000 credits ! # - Background writer - #bgwriter_delay = 200 # 10-10000 milliseconds between rounds ! #bgwriter_percent = 1 # 0-100% of dirty buffers in each round ! #bgwriter_maxpages = 100 # 0-1000 buffers max per round #--------------------------------------------------------------------------- --- 96,105 ---- #vacuum_cost_page_dirty = 20 # 0-10000 credits #vacuum_cost_limit = 200 # 0-10000 credits ! # - Background Writer - #bgwriter_delay = 200 # 10-10000 milliseconds between rounds ! #bgwriter_maxpages = 100 # max buffers written per round, 0 disables #--------------------------------------------------------------------------- Index: src/include/postmaster/bgwriter.h =================================================================== RCS file: /var/lib/cvs/pgsql/src/include/postmaster/bgwriter.h,v retrieving revision 1.3 diff -c -r1.3 bgwriter.h *** src/include/postmaster/bgwriter.h 29 Aug 2004 04:13:09 -0000 1.3 --- src/include/postmaster/bgwriter.h 14 Dec 2004 04:44:44 -0000 *************** *** 18,24 **** /* GUC options */ extern int BgWriterDelay; - extern int BgWriterPercent; extern int BgWriterMaxPages; extern int CheckPointTimeout; extern int CheckPointWarning; --- 18,23 ---- Index: src/include/storage/bufmgr.h =================================================================== RCS file: /var/lib/cvs/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.88 diff -c -r1.88 bufmgr.h *** src/include/storage/bufmgr.h 16 Oct 2004 18:57:26 -0000 1.88 --- src/include/storage/bufmgr.h 14 Dec 2004 04:40:09 -0000 *************** *** 150,156 **** extern void AbortBufferIO(void); extern void BufmgrCommit(void); ! extern int BufferSync(int percent, int maxpages); extern void InitLocalBuffer(void); --- 150,156 ---- extern void AbortBufferIO(void); extern void BufmgrCommit(void); ! extern int BufferSync(int maxpages); extern void InitLocalBuffer(void);
pgsql-hackers by date: