Thread: [HACKERS] GUC for cleanup indexes threshold.
Hi and happy new year. The lazy vacuum calls lazy_cleanup_index to update statistics of indexes on a table such as relpages, reltuples at the end of the lazy_scan_heap. In all type of indexes the lazy_cleanup_index scans all index pages. It happens even if table has not been updated at all since previous vacuum invoked. Freeze map reduces the execution time and cost of table vacuuming much if almost table has been frozen. But it doesn't work for cleaning up indexes. If a very large static table has index then because the cleaning up index is called and it always scans all index pages, it takes time to scan all pages of index as reported[1]. Attached patch introduces new GUC parameter parameter vacuum_cleanup_index_scale_factor which specifies the fraction of the table pages containing dead tuple needed to trigger a cleaning up indexes. The default is 0.0, which means that the cleanup index is not invoked if no update on table. In other word, if table is completely frozen then lazy vacuum can skip the index scans as well. Increasing this value could reduce total time of lazy vacuum but the statistics and the free space map of index are not updated. [1] https://www.postgresql.org/message-id/MWHPR20MB142177B86D893C946FAFC9A4A18C0%40MWHPR20MB1421.namprd20.prod.outlook.com Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Jan 4, 2017 at 3:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Hi and happy new year. > > The lazy vacuum calls lazy_cleanup_index to update statistics of > indexes on a table such as relpages, reltuples at the end of the > lazy_scan_heap. In all type of indexes the lazy_cleanup_index scans > all index pages. It happens even if table has not been updated at all > since previous vacuum invoked. Freeze map reduces the execution time > and cost of table vacuuming much if almost table has been frozen. But > it doesn't work for cleaning up indexes. If a very large static table > has index then because the cleaning up index is called and it always > scans all index pages, it takes time to scan all pages of index as > reported[1]. > > Attached patch introduces new GUC parameter parameter > vacuum_cleanup_index_scale_factor which specifies the fraction of the > table pages containing dead tuple needed to trigger a cleaning up > indexes. The default is 0.0, which means that the cleanup index is not > invoked if no update on table. In other word, if table is completely > frozen then lazy vacuum can skip the index scans as well. Increasing > this value could reduce total time of lazy vacuum but the statistics > and the free space map of index are not updated. Cool. I'll look at this, but not until next CF. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 4, 2017 at 1:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached patch introduces new GUC parameter parameter > vacuum_cleanup_index_scale_factor which specifies the fraction of the > table pages containing dead tuple needed to trigger a cleaning up > indexes. The default is 0.0, which means that the cleanup index is not > invoked if no update on table. In other word, if table is completely > frozen then lazy vacuum can skip the index scans as well. Increasing > this value could reduce total time of lazy vacuum but the statistics > and the free space map of index are not updated. > I was looking into your patch and trying to understand how the following piece of code works. + if (vacuumed_pages > cleanupidx_thresh) + { + for (i = 0; i < nindexes; i++) + lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); + } So, you are skipping deletion of index entries if it does not reach the clean-up index threshold. But, you are removing all dead tuples from the heap pointed by the same index. Hence, index will contain entries with invalid references. How does that work? How will you remove those index entries later? (I'm a newbie.) + This parameter can only be set anywhere. Oxymoron. :-) -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 10, 2017 at 8:01 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Wed, Jan 4, 2017 at 1:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> Attached patch introduces new GUC parameter parameter >> vacuum_cleanup_index_scale_factor which specifies the fraction of the >> table pages containing dead tuple needed to trigger a cleaning up >> indexes. The default is 0.0, which means that the cleanup index is not >> invoked if no update on table. In other word, if table is completely >> frozen then lazy vacuum can skip the index scans as well. Increasing >> this value could reduce total time of lazy vacuum but the statistics >> and the free space map of index are not updated. >> > I was looking into your patch and trying to understand how the > following piece of code works. Thank you for looking at this patch! > + if (vacuumed_pages > cleanupidx_thresh) > + { > + for (i = 0; i < nindexes; i++) > + lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); > + } > So, you are skipping deletion of index entries if it does not reach > the clean-up index threshold. But, you are removing all dead tuples > from the heap pointed by the same index. Hence, index will contain > entries with invalid references. I think no. Before calling lazy_cleanup_index, all garbage on heap and index should have been reclaimed by lazy_vacuum_heap and lazy_vacuum_index. > + This parameter can only be set anywhere. > Oxymoron. :-) > Will fix it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, I tried regression test and found some errors concerning brin and gin, though I didn't look into this. Here's a log: *** /home/ideriha/postgres-master/src/test/regress/expected/brin.out 2017-02-13 11:33:43.270942937 +0900 --- /home/ideriha/postgres-master/src/test/regress/results/brin.out 2017-02-15 14:58:24.725984474 +0900 *************** *** 403,408 **** SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected brin_summarize_new_values --------------------------- ! 0 (1 row) --- 403,408 ---- SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected brin_summarize_new_values --------------------------- ! 5 (1 row) ====================================================================== *** /home/ideriha/postgres-master/src/test/regress/expected/gin.out 2016-12-20 16:49:09.513050050 +0900 --- /home/ideriha/postgres-master/src/test/regress/results/gin.out 2017-02-15 14:58:25.536984461 +0900 *************** *** 20,26 **** select gin_clean_pending_list('gin_test_idx'); -- nothing to flush gin_clean_pending_list ------------------------ ! 0 (1 row) -- Test vacuuming --- 20,26 ---- select gin_clean_pending_list('gin_test_idx'); -- nothing to flush gin_clean_pending_list ------------------------ ! 8 (1 row) -- Test vacuuming ====================================================================== Regards, Ideriha Takeshi
On Wed, Feb 15, 2017 at 4:39 PM, Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> wrote: > Hi, I tried regression test and found some errors concerning brin and gin, > though I didn't look into this. > > Here's a log: > > *** /home/ideriha/postgres-master/src/test/regress/expected/brin.out 2017-02-13 11:33:43.270942937 +0900 > --- /home/ideriha/postgres-master/src/test/regress/results/brin.out 2017-02-15 14:58:24.725984474 +0900 > *************** > *** 403,408 **** > SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected > brin_summarize_new_values > --------------------------- > ! 0 > (1 row) > > --- 403,408 ---- > SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected > brin_summarize_new_values > --------------------------- > ! 5 > (1 row) > > > ====================================================================== > > *** /home/ideriha/postgres-master/src/test/regress/expected/gin.out 2016-12-20 16:49:09.513050050 +0900 > --- /home/ideriha/postgres-master/src/test/regress/results/gin.out 2017-02-15 14:58:25.536984461 +0900 > *************** > *** 20,26 **** > select gin_clean_pending_list('gin_test_idx'); -- nothing to flush > gin_clean_pending_list > ------------------------ > ! 0 > (1 row) > > -- Test vacuuming > --- 20,26 ---- > select gin_clean_pending_list('gin_test_idx'); -- nothing to flush > gin_clean_pending_list > ------------------------ > ! 8 > (1 row) > > -- Test vacuuming > > ====================================================================== > > Thank you for testing! It's a bug. Attached latest version patch, which passed make check. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Feb 13, 2017 at 1:01 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: Thanks for the explanation. I've looked into the referred code. I'm still in doubt. vacuumed_pages is incremented only when there are no indexes, i.e. nindexes=0. Now, look at the following part in the patch. + /* + * Do post-vacuum cleanup and statistics update for each index if + * the number of vacuumed page exceeds threshold. + */ + cleanupidx_thresh = (float4) nblocks * vacuum_cleanup_index_scale; + + elog(DEBUG3, "%s: vac: %d (threshold %0.f)", + RelationGetRelationName(onerel), nblocks, cleanupidx_thresh); + if (vacuumed_pages >= cleanupidx_thresh) + { + for (i = 0; i < nindexes; i++) + lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); + } So, unless vacuum_cleanup_index_scale_thresh is zero, lazy_cleanup_index will never be called. IMO, this seems to be incorrect. Besides, I've tested with non-zero(0.5) vacuum_cleanup_index_scale_thresh and the regression tests for brin and gin fails. (make installcheck) + {"vacuum_cleanup_index_scale_factor", PGC_USERSET, CLIENT_CONN_STATEMENT, + gettext_noop("Number of pages containing dead tuple prior to vacuum as a fraction of relpages."), + NULL + }, + &vacuum_cleanup_index_scale, + 0.0, 0.0, 100.0, + NULL, NULL, NULL + }, Maximum value for vacuum_cleanup_index_scale_factor should be 1 instead of 100. As the code indicates, it is certainly not treated as a percentage fraction of relpages. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > It's a bug. Attached latest version patch, which passed make check. In its current form, I'm not sure this is a good idea. Problems... 1. I'm pretty sure the world doesn't need another VACUUM parameter I suggest that we use the existing vacuum scale factor/4 to reflect that indexes are more sensitive to bloat. 2. The current btree vacuum code requires 2 vacuums to fully reuse half-dead pages. So skipping an index vacuum might mean that second index scan never happens at all, which would be bad. I suggest that we store the number of half-dead pages in the metapage after each VACUUM, so we can decide whether to skip the scan or not. And we use some math like each half-dead page that needs to be reused is worth 250 index entries, so the decision to skip is based upon rows and empty pages, not just recently vacuumed rows. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> It's a bug. Attached latest version patch, which passed make check. > > In its current form, I'm not sure this is a good idea. Problems... > > 1. I'm pretty sure the world doesn't need another VACUUM parameter > > I suggest that we use the existing vacuum scale factor/4 to reflect > that indexes are more sensitive to bloat. I do not think it's a good idea to control multiple behaviors with a single GUC. We don't really know that dividing by 4 will be right for everyone, or even for most people. It's better to have another parameter with a sensible default than to hardcode a ratio that might work out poorly for some people. > 2. The current btree vacuum code requires 2 vacuums to fully reuse > half-dead pages. So skipping an index vacuum might mean that second > index scan never happens at all, which would be bad. Maybe. If there are a tiny number of those half-dead pages in a huge index, it probably doesn't matter. Also, I don't think it would never happen, unless the table just never gets any more updates or deletes - but that case could also happen today. It's just a matter of happening less frequently. I guess the question is whether the accumulation of half-dead pages in the index could become a problem before the unsetting of all-visible bits in the heap becomes a problem. If the second one always happen first, then we don't have an issue here, but if it's possible for the first one to become a big problem before the second one gets to be a serious issue, then we need something more sophisticated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> It's a bug. Attached latest version patch, which passed make check. >> >> In its current form, I'm not sure this is a good idea. Problems... >> >> 1. I'm pretty sure the world doesn't need another VACUUM parameter >> >> I suggest that we use the existing vacuum scale factor/4 to reflect >> that indexes are more sensitive to bloat. > > I do not think it's a good idea to control multiple behaviors with a > single GUC. We don't really know that dividing by 4 will be right for > everyone, or even for most people. It's better to have another > parameter with a sensible default than to hardcode a ratio that might > work out poorly for some people. > >> 2. The current btree vacuum code requires 2 vacuums to fully reuse >> half-dead pages. So skipping an index vacuum might mean that second >> index scan never happens at all, which would be bad. > > Maybe. If there are a tiny number of those half-dead pages in a huge > index, it probably doesn't matter. Also, I don't think it would never > happen, unless the table just never gets any more updates or deletes - > but that case could also happen today. It's just a matter of > happening less frequently. The half-dead pages are never cleaned up if the ratio of pages containing garbage is always lower than threshold. Also in gin index the pending list is never cleared, which become big problem. I guess that we should take action for each type of indexes. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 2/19/17 7:56 PM, Masahiko Sawada wrote: > The half-dead pages are never cleaned up if the ratio of pages > containing garbage is always lower than threshold. Also in gin index > the pending list is never cleared, which become big problem. I guess > that we should take action for each type of indexes. What worries me is that each AM is going to have a different notion of what needs to happen to support this. That indicates that trying to handle this at the vacuum level is not a good idea. I think it would be wiser to add support for skipping scans to the AM API instead. That also means you don't have to add support for this to every index type to start with. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On Mon, Feb 20, 2017 at 11:35 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 2/19/17 7:56 PM, Masahiko Sawada wrote: >> >> The half-dead pages are never cleaned up if the ratio of pages >> containing garbage is always lower than threshold. Also in gin index >> the pending list is never cleared, which become big problem. I guess >> that we should take action for each type of indexes. > > > What worries me is that each AM is going to have a different notion of what > needs to happen to support this. That indicates that trying to handle this > at the vacuum level is not a good idea. > > I think it would be wiser to add support for skipping scans to the AM API > instead. That also means you don't have to add support for this to every > index type to start with. Yeah, and it's better to have it as a index storage parameter. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> It's a bug. Attached latest version patch, which passed make check. >>> >>> In its current form, I'm not sure this is a good idea. Problems... >>> >>> 1. I'm pretty sure the world doesn't need another VACUUM parameter >>> >>> I suggest that we use the existing vacuum scale factor/4 to reflect >>> that indexes are more sensitive to bloat. >> >> I do not think it's a good idea to control multiple behaviors with a >> single GUC. We don't really know that dividing by 4 will be right for >> everyone, or even for most people. It's better to have another >> parameter with a sensible default than to hardcode a ratio that might >> work out poorly for some people. >> >>> 2. The current btree vacuum code requires 2 vacuums to fully reuse >>> half-dead pages. So skipping an index vacuum might mean that second >>> index scan never happens at all, which would be bad. >> >> Maybe. If there are a tiny number of those half-dead pages in a huge >> index, it probably doesn't matter. Also, I don't think it would never >> happen, unless the table just never gets any more updates or deletes - >> but that case could also happen today. It's just a matter of >> happening less frequently. > Yeah thats right and I am not sure if it is worth to perform a complete pass to reclaim dead/deleted pages unless we know someway that there are many such pages. Also, I think we do reclaim the complete page while allocating a new page in btree. > The half-dead pages are never cleaned up if the ratio of pages > containing garbage is always lower than threshold. > Which threshold are you referring here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> It's a bug. Attached latest version patch, which passed make check. >>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse >>>> half-dead pages. So skipping an index vacuum might mean that second >>>> index scan never happens at all, which would be bad. >>> >>> Maybe. If there are a tiny number of those half-dead pages in a huge >>> index, it probably doesn't matter. Also, I don't think it would never >>> happen, unless the table just never gets any more updates or deletes - >>> but that case could also happen today. It's just a matter of >>> happening less frequently. >> > > Yeah thats right and I am not sure if it is worth to perform a > complete pass to reclaim dead/deleted pages unless we know someway > that there are many such pages. Agreed.... which is why On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote: > I suggest that we store the number of half-dead pages in the metapage > after each VACUUM, so we can decide whether to skip the scan or not. > Also, I think we do reclaim the > complete page while allocating a new page in btree. That's not how it works according to the README at least. You might be referring to cleaning out killed tuples just before a page split? That's something different. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>> It's a bug. Attached latest version patch, which passed make check. > >>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse >>>>> half-dead pages. So skipping an index vacuum might mean that second >>>>> index scan never happens at all, which would be bad. >>>> >>>> Maybe. If there are a tiny number of those half-dead pages in a huge >>>> index, it probably doesn't matter. Also, I don't think it would never >>>> happen, unless the table just never gets any more updates or deletes - >>>> but that case could also happen today. It's just a matter of >>>> happening less frequently. >>> >> >> Yeah thats right and I am not sure if it is worth to perform a >> complete pass to reclaim dead/deleted pages unless we know someway >> that there are many such pages. > > Agreed.... which is why > On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote: >> I suggest that we store the number of half-dead pages in the metapage >> after each VACUUM, so we can decide whether to skip the scan or not. > > >> Also, I think we do reclaim the >> complete page while allocating a new page in btree. > > That's not how it works according to the README at least. > I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))), won't that help us in reclaiming the space? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 20, 2017 at 6:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> It's a bug. Attached latest version patch, which passed make check. >>>> >>>> In its current form, I'm not sure this is a good idea. Problems... >>>> >>>> 1. I'm pretty sure the world doesn't need another VACUUM parameter >>>> >>>> I suggest that we use the existing vacuum scale factor/4 to reflect >>>> that indexes are more sensitive to bloat. >>> >>> I do not think it's a good idea to control multiple behaviors with a >>> single GUC. We don't really know that dividing by 4 will be right for >>> everyone, or even for most people. It's better to have another >>> parameter with a sensible default than to hardcode a ratio that might >>> work out poorly for some people. >>> >>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse >>>> half-dead pages. So skipping an index vacuum might mean that second >>>> index scan never happens at all, which would be bad. >>> >>> Maybe. If there are a tiny number of those half-dead pages in a huge >>> index, it probably doesn't matter. Also, I don't think it would never >>> happen, unless the table just never gets any more updates or deletes - >>> but that case could also happen today. It's just a matter of >>> happening less frequently. >> > > Yeah thats right and I am not sure if it is worth to perform a > complete pass to reclaim dead/deleted pages unless we know someway > that there are many such pages. Also, I think we do reclaim the > complete page while allocating a new page in btree. > >> The half-dead pages are never cleaned up if the ratio of pages >> containing garbage is always lower than threshold. >> > > Which threshold are you referring here? > I meant the new parameter in current patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 20 February 2017 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>>> It's a bug. Attached latest version patch, which passed make check. >> >>>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse >>>>>> half-dead pages. So skipping an index vacuum might mean that second >>>>>> index scan never happens at all, which would be bad. >>>>> >>>>> Maybe. If there are a tiny number of those half-dead pages in a huge >>>>> index, it probably doesn't matter. Also, I don't think it would never >>>>> happen, unless the table just never gets any more updates or deletes - >>>>> but that case could also happen today. It's just a matter of >>>>> happening less frequently. >>>> >>> >>> Yeah thats right and I am not sure if it is worth to perform a >>> complete pass to reclaim dead/deleted pages unless we know someway >>> that there are many such pages. >> >> Agreed.... which is why >> On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote: >>> I suggest that we store the number of half-dead pages in the metapage >>> after each VACUUM, so we can decide whether to skip the scan or not. >> >> >>> Also, I think we do reclaim the >>> complete page while allocating a new page in btree. >> >> That's not how it works according to the README at least. >> > > I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))), > won't that help us in reclaiming the space? Not unless the README is incorrect, no. That section of code is just a retest of pages retrieved from FSM; they aren't even added there until two scans have occurred and even then it may not be possible to recycle. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 16, 2017 at 10:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> 2. The current btree vacuum code requires 2 vacuums to fully reuse >> half-dead pages. So skipping an index vacuum might mean that second >> index scan never happens at all, which would be bad. > > Maybe. If there are a tiny number of those half-dead pages in a huge > index, it probably doesn't matter. Also, I don't think it would never > happen, unless the table just never gets any more updates or deletes - > but that case could also happen today. It's just a matter of > happening less frequently. > > I guess the question is whether the accumulation of half-dead pages in > the index could become a problem before the unsetting of all-visible > bits in the heap becomes a problem. If the second one always happen > first, then we don't have an issue here, but if it's possible for the > first one to become a big problem before the second one gets to be a > serious issue, then we need something more sophisticated. Not getting to a second VACUUM where you might have otherwise can only be a problem to the extent that users are sensitive to not reclaiming disk space from indexes at the level of the FSM. It's not accurate to say that pages could be left "half dead" indefinitely by this patch, since that is something that lasts only as long as the first phase of B-Tree page deletion. In fact, the only possible problem is that pages are recyclable in principle, but that doesn't happen due to this new GUC. That isn't analogous to heap bloat at all, because it's not as if there are any downlinks or right links or left links pointing to the recyclable (fully deleted) pages; the previous key space *has* in fact been *fully* reclaimed. These pages are fully dead, and as such are out of the critical path of index scans entirely once the second phase finishes. (They only need to continue to physically exist because old index scans might follow a stale pointer). Note that there is an interlock against RecentGlobalXmin respected by VACUUM, that prevents this sort of recycling. I suspect that the restrictions on page deletion as opposed to page recycling is vastly more likely to cause pain to users, and that's not made any worse by this. -- Peter Geoghegan
On Tue, Feb 21, 2017 at 1:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 20 February 2017 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>>>> It's a bug. Attached latest version patch, which passed make check. >>> >>>>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse >>>>>>> half-dead pages. So skipping an index vacuum might mean that second >>>>>>> index scan never happens at all, which would be bad. >>>>>> >>>>>> Maybe. If there are a tiny number of those half-dead pages in a huge >>>>>> index, it probably doesn't matter. Also, I don't think it would never >>>>>> happen, unless the table just never gets any more updates or deletes - >>>>>> but that case could also happen today. It's just a matter of >>>>>> happening less frequently. >>>>> >>>> >>>> Yeah thats right and I am not sure if it is worth to perform a >>>> complete pass to reclaim dead/deleted pages unless we know someway >>>> that there are many such pages. >>> >>> Agreed.... which is why >>> On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> I suggest that we store the number of half-dead pages in the metapage >>>> after each VACUUM, so we can decide whether to skip the scan or not. >>> >>> >>>> Also, I think we do reclaim the >>>> complete page while allocating a new page in btree. >>> >>> That's not how it works according to the README at least. >>> >> >> I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))), >> won't that help us in reclaiming the space? > > Not unless the README is incorrect, no. > Just to ensure that we both have the same understanding, let me try to write what I understand about this reclaim algorithm. AFAIU, in the first pass vacuum will mark the half dead pages as Deleted and in the second pass, it will record such pages as free in FSM so that they can be reused as new pages when the indexam asked for a new block instead of extending the index relation. Now, if we introduce this new GUC, then there are chances that sometimes we skip the second pass where it would not have been skipped. Note that we do perform the second pass in the same vacuum cycle when index has not been scanned for deleting the tuples as per below code: btvacuumcleanup() { .. if (stats == NULL) { stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); btvacuumscan(info, stats, NULL, NULL, 0); .. } In above code stats won't be NULL, if the vacuum has scanned index for deleting tuples (btbulkdelete). So, based on this I think it will skip scanning the index (or recycling pages marked as deleted) in the second vacuum only when there are no dead tuple removals in that vacuum. Do we agree till here? I understand that there could be some delay in reclaiming dead pages but do you think it is such a big deal that we completely scan the index for such cases or even try to change the metapage format? > That section of code is just a retest of pages retrieved from FSM; > Yes, I think you are right. In short, I agree that only vacuum can reclaim half-dead pages. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 22, 2017 at 12:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Feb 21, 2017 at 1:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 20 February 2017 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>>> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>>>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>>>>> It's a bug. Attached latest version patch, which passed make check. >>>> >>>>>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse >>>>>>>> half-dead pages. So skipping an index vacuum might mean that second >>>>>>>> index scan never happens at all, which would be bad. >>>>>>> >>>>>>> Maybe. If there are a tiny number of those half-dead pages in a huge >>>>>>> index, it probably doesn't matter. Also, I don't think it would never >>>>>>> happen, unless the table just never gets any more updates or deletes - >>>>>>> but that case could also happen today. It's just a matter of >>>>>>> happening less frequently. >>>>>> >>>>> >>>>> Yeah thats right and I am not sure if it is worth to perform a >>>>> complete pass to reclaim dead/deleted pages unless we know someway >>>>> that there are many such pages. >>>> >>>> Agreed.... which is why >>>> On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>> I suggest that we store the number of half-dead pages in the metapage >>>>> after each VACUUM, so we can decide whether to skip the scan or not. >>>> >>>> >>>>> Also, I think we do reclaim the >>>>> complete page while allocating a new page in btree. >>>> >>>> That's not how it works according to the README at least. >>>> >>> >>> I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))), >>> won't that help us in reclaiming the space? >> >> Not unless the README is incorrect, no. >> > > Just to ensure that we both have the same understanding, let me try to > write what I understand about this reclaim algorithm. AFAIU, in the > first pass vacuum will mark the half dead pages as Deleted and in the > second pass, it will record such pages as free in FSM so that they can > be reused as new pages when the indexam asked for a new block instead > of extending the index relation. Now, if we introduce this new GUC, > then there are chances that sometimes we skip the second pass where it > would not have been skipped. > > Note that we do perform the second pass in the same vacuum cycle when > index has not been scanned for deleting the tuples as per below code: The first pass uses cycle id given by _bt_start_vacuum, but the second pass always uses cycle id 0. > btvacuumcleanup() > { > .. > if (stats == NULL) > { > stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); > btvacuumscan(info, stats, NULL, NULL, 0); > .. > } > > In above code stats won't be NULL, if the vacuum has scanned index for > deleting tuples (btbulkdelete). So, based on this I think it will > skip scanning the index (or recycling pages marked as deleted) in the > second vacuum only when there are no dead tuple removals in that > vacuum. Do we agree till here? Agreed. > I understand that there could be some delay in reclaiming dead pages > but do you think it is such a big deal that we completely scan the > index for such cases or even try to change the metapage format? IIUC, I think that we need to have the number of half-dead pages in meta page. Isn't it a problem that the freespace map of btree index is not vacuumed if all vacuums skip the second pass? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Feb 23, 2017 at 5:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Feb 22, 2017 at 12:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >> I understand that there could be some delay in reclaiming dead pages >> but do you think it is such a big deal that we completely scan the >> index for such cases or even try to change the metapage format? > > IIUC, I think that we need to have the number of half-dead pages in meta page. > Don't you think we need to consider backward compatibility if we want to do that? > Isn't it a problem that the freespace map of btree index is not > vacuumed if all vacuums skip the second pass? > AFAIU, you want to skip only when there is no dead tuple removal, if so what is the need to update freespace map of btree index? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 24, 2017 at 8:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> IIUC, I think that we need to have the number of half-dead pages in meta page. > > Don't you think we need to consider backward compatibility if we want > to do that? Yeah, that would be an on-disk format break. I think this thread is pretty short on evidence that would let us make a smart decision about what to do here. I see three possibilities. The first is that this patch is a good idea whether we do something about the issue of half-dead pages or not. The second is that this patch is a good idea if we do something about the issue of half-dead pages but a bad idea if we don't. The third is that this patch is a bad idea whether or not we do anything about the issue of half-dead pages. Unfortunately, we have no evidence at all that would let us figure out which of those three things is true. The original post didn't include any relevant benchmarks or test results. Simon's reply, which suggested that the problem of half-dead pages, didn't include any benchmarks or test results. In fact, in neither place were any tests suggested, even hypothetically, which would help us decide what to do. I had a hunch when I saw this thread that it was a good idea, and Simon has a hunch that this btree page recycling thing needs to be fixed first, and he might be right. Or we might both be wrong. Or ... who knows, really? I think we need to come up with some set of tests to figure out what actually works well in practice here. Theories are a good starting point, but good vacuum behavior is really important, and a patch that changes it ought to be backed up by at least some experimental evidence. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/24/17 11:26 AM, Robert Haas wrote: > I think we need to come up with some set of tests to figure out what > actually works well in practice here. Theories are a good starting > point, but good vacuum behavior is really important, and a patch that > changes it ought to be backed up by at least some experimental > evidence. I think something else worth considering is that if we had some method of mapping heap TIDs back to indexes then a lot (all?) of these problems would go away. 10+ years ago the idea of keeping such a mapping would probably be untenable, but with resource forks and how much cheaper storage is maybe that's no longer the case. For btree I think this could be done by keeping a second btree ordered by ctid that points either to index entries or even just to whole index pages. At ~ 20 bytes per entry, even a 1B row index would take ~20GB. Page splits are obviously a big issue. Maybe it's safe to update the ctid map for every item that gets moved when a split happens. Would a ctid map work for other indexes as well? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think this thread is pretty short on evidence that would let us make > a smart decision about what to do here. I see three possibilities. > The first is that this patch is a good idea whether we do something > about the issue of half-dead pages or not. The second is that this > patch is a good idea if we do something about the issue of half-dead > pages but a bad idea if we don't. The third is that this patch is a > bad idea whether or not we do anything about the issue of half-dead > pages. Half-dead pages are not really relevant to this discussion, AFAICT. I think that both you and Simon mean "recyclable" pages. There are several levels of indirection involved here, to keep the locking very granular, so it gets tricky to talk about. B-Tree page deletion is like a page split in reverse. It has a symmetry with page splits, which have two phases (atomic operations). There are also two phases for deletion, the first of which leaves the target page without a downlink in its parent, and marks it half dead. By the end of the first phase, there are still sibling pointers, so an index scan can land on them before the second phase of deletion begins -- they can visit a half-dead page before such time as the second phase of deletion begins, where the sibling link goes away. So, the sibling link isn't stale as such, but the page is still morally dead. (Second phase is where we remove even the sibling links, and declare it fully dead.) Even though there are two phases of deletion, the second still occurs immediately after the first within VACUUM. The need to have two phases is hard to explain, so I won't try, but it suffices to say that VACUUM does not actually ever leave a page half dead unless there is a hard crash. Recall that VACUUMing of a B-Tree is performed sequentially, so blocks can be recycled without needing to be found via a downlink or sibling link by VACUUM. What is at issue here, then, is VACUUM's degree of "eagerness" around putting *fully* dead B-Tree pages in the FSM for recycling. The interlock with RecentGlobalXmin is what makes it impossible for VACUUM to generally fully delete pages, *as well as* mark them as recyclable (put them in the FSM) all at once. Maybe you get this already, since, as I said, the terminology is tricky in this area, but I can't tell. -- Peter Geoghegan
On Sat, Feb 25, 2017 at 3:40 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think this thread is pretty short on evidence that would let us make >> a smart decision about what to do here. I see three possibilities. >> The first is that this patch is a good idea whether we do something >> about the issue of half-dead pages or not. The second is that this >> patch is a good idea if we do something about the issue of half-dead >> pages but a bad idea if we don't. The third is that this patch is a >> bad idea whether or not we do anything about the issue of half-dead >> pages. > +1. I think we can track the stats from IndexBulkDeleteResult->pages_free to see the impact of the patch. > Half-dead pages are not really relevant to this discussion, AFAICT. I > think that both you and Simon mean "recyclable" pages. > Yes, I think so and I think that is just confusion about terminology. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Feb 25, 2017 at 3:40 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think this thread is pretty short on evidence that would let us make >> a smart decision about what to do here. I see three possibilities. >> The first is that this patch is a good idea whether we do something >> about the issue of half-dead pages or not. The second is that this >> patch is a good idea if we do something about the issue of half-dead >> pages but a bad idea if we don't. The third is that this patch is a >> bad idea whether or not we do anything about the issue of half-dead >> pages. > > Half-dead pages are not really relevant to this discussion, AFAICT. I > think that both you and Simon mean "recyclable" pages. There are > several levels of indirection involved here, to keep the locking very > granular, so it gets tricky to talk about. Thanks for the clarification. I wasn't very clear on what was going on here; that helps. The thing that strikes me based on what you wrote is that our page recycling seems to admit of long delays even as things stand. There's no bound on how much time could pass between one index vacuum and the next, and RecentGlobalXmin could and probably usually will advance past the point that would allow recycling long before the next index vacuum cycle. I don't know whether that strengthens or weakens Simon's argument. On the one hand, you could argue that if we're already doing this on a long delay, making it even longer probably won't hurt much. On the other hand, you could argue that if the current situation is bad, we should at least avoid making it worse. I don't know which of those arguments is correct, if either. Do you have an idea about that, or any ideas for experiments we could try? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Feb 25, 2017 at 10:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: > The thing that strikes me based on what you wrote is that our page > recycling seems to admit of long delays even as things stand. There's > no bound on how much time could pass between one index vacuum and the > next, and RecentGlobalXmin could and probably usually will advance > past the point that would allow recycling long before the next index > vacuum cycle. Agreed. > I don't know whether that strengthens or weakens > Simon's argument. I think it weakens it, but I hesitate to take a firm position on it just yet. > On the one hand, you could argue that if we're > already doing this on a long delay, making it even longer probably > won't hurt much. On the other hand, you could argue that if the > current situation is bad, we should at least avoid making it worse. I > don't know which of those arguments is correct, if either. Unsure. I will point out: * There probably is a big problem with B-Tree index bloat for certain workloads ("sparse deletion patterns"), that interacts badly with less frequent VACUUMing. * Whatever the bloat this patch makes worse is not *that* bloat, at least with the proposed default for vacuum_cleanup_index_scale_factor; it's not the bloat we usually think of when we talk about index bloat. A full index scan will not need to visit any of the dead pages, even just to immediately skip over them. We just won't be able to recycle them, which is another problem. * The problem of not recycling as soon as we'd prefer can only happen when everything else is, roughly speaking, going right. Which is still pretty good. (Again, remarks only apply when the default vacuum_cleanup_index_scale_factor is used.) * Roughly speaking, the recycling of disk blocks, and efficient use of disk space more generally is not a priority for the implementation. Nor should it be. I tend to think of this recycling as being more about the worst case for space utilization than about the average case. Kind of like the fast root vs. true root thing prevents our needing to descend "skinny" B-Trees from the true root, which can only really happen following vast changes to the key space, which are always going to be painful. These cases are a bit pathological. For more on this recycling stuff, see section 2.5 of the Lanin and Shasha paper, "Freeing empty nodes" [1]. It's describing what is essentially the RecentGlobalXmin interlock, and I think you're right to point out that that could stand to be a lot more aggressive, which is maybe the real problem, if any. (The papers remarks suggest we could stand to be more eager about it.) > Do you have an idea about that, or any ideas for experiments we could try? Nothing occurs to me right now, unfortunately. However, my general sense is that it would probably be just fine when vacuum_cleanup_index_scale_factor was 0.0, but there might be non-linear increases in "the serious type of index bloat" as the proposed new setting was scaled up. I'd be much more worried about that. [1] https://archive.org/stream/symmetricconcurr00lani#page/6/mode/2up -- Peter Geoghegan
On 2/27/17 12:46 PM, Peter Geoghegan wrote: > On Sat, Feb 25, 2017 at 10:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> Do you have an idea about that, or any ideas for experiments we could try? > > Nothing occurs to me right now, unfortunately. However, my general > sense is that it would probably be just fine when > vacuum_cleanup_index_scale_factor was 0.0, but there might be > non-linear increases in "the serious type of index bloat" as the > proposed new setting was scaled up. I'd be much more worried about > that. This was originally marked "Waiting on Author" due to some minor problems with the patch but on the whole there are much larger issues at play. The tenor seems to be that we should somehow prove the effectiveness of this patch one way or the other, but nobody is quite sure how to go about that, and in fact it would probably be different for each AM. Sawada, if you have ideas about how to go about this then we would need to see something very soon. If not, I think marking this RWF is the best course of action. Thanks, -- -David david@pgmasters.net
On Sat, Feb 25, 2017 at 7:10 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think this thread is pretty short on evidence that would let us make >> a smart decision about what to do here. I see three possibilities. >> The first is that this patch is a good idea whether we do something >> about the issue of half-dead pages or not. The second is that this >> patch is a good idea if we do something about the issue of half-dead >> pages but a bad idea if we don't. The third is that this patch is a >> bad idea whether or not we do anything about the issue of half-dead >> pages. > > Half-dead pages are not really relevant to this discussion, AFAICT. I > think that both you and Simon mean "recyclable" pages. There are > several levels of indirection involved here, to keep the locking very > granular, so it gets tricky to talk about. > > B-Tree page deletion is like a page split in reverse. It has a > symmetry with page splits, which have two phases (atomic operations). > There are also two phases for deletion, the first of which leaves the > target page without a downlink in its parent, and marks it half dead. > By the end of the first phase, there are still sibling pointers, so an > index scan can land on them before the second phase of deletion begins > -- they can visit a half-dead page before such time as the second > phase of deletion begins, where the sibling link goes away. So, the > sibling link isn't stale as such, but the page is still morally dead. > (Second phase is where we remove even the sibling links, and declare > it fully dead.) > > Even though there are two phases of deletion, the second still occurs > immediately after the first within VACUUM. The need to have two phases > is hard to explain, so I won't try, but it suffices to say that VACUUM > does not actually ever leave a page half dead unless there is a hard > crash. > > Recall that VACUUMing of a B-Tree is performed sequentially, so blocks > can be recycled without needing to be found via a downlink or sibling > link by VACUUM. What is at issue here, then, is VACUUM's degree of > "eagerness" around putting *fully* dead B-Tree pages in the FSM for > recycling. The interlock with RecentGlobalXmin is what makes it > impossible for VACUUM to generally fully delete pages, *as well as* > mark them as recyclable (put them in the FSM) all at once. > > Maybe you get this already, since, as I said, the terminology is > tricky in this area, but I can't tell. > Thank you for clarification. Let me check my understanding. IIUC, skipping second index vacuum path (lazy_cleanup_index) can not be cause of leaving page as half-dead state but could leave recyclable pages that are not marked as a recyclable. But second one, it can be reclaimed by next index vacuum because btvacuumpage calls RecordFreeIndexPage for recyclable page. Am I missing something? My first motivation of this patch is to skip the second index vacuum patch when vacuum skipped whole table by visibility map. But as Robert suggested on another thread, I changed it to have a threshold. If my understanding is correct, we can have a threshold that specifies the fraction of the scanned pages by vacuum. If it's set 0.1, lazy_scan_heap can do the second vacuum index only when 10% of table is scanned. IOW, if 90% of table pages is skipped, which means almost of table has not changed since previous vacuum, we can skip the second index vacuum. In this design, we could handle other types of AM as well. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 3, 2017 at 10:45 PM, David Steele <david@pgmasters.net> wrote: > On 2/27/17 12:46 PM, Peter Geoghegan wrote: >> On Sat, Feb 25, 2017 at 10:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >>> Do you have an idea about that, or any ideas for experiments we could try? >> >> Nothing occurs to me right now, unfortunately. However, my general >> sense is that it would probably be just fine when >> vacuum_cleanup_index_scale_factor was 0.0, but there might be >> non-linear increases in "the serious type of index bloat" as the >> proposed new setting was scaled up. I'd be much more worried about >> that. > > This was originally marked "Waiting on Author" due to some minor > problems with the patch but on the whole there are much larger issues at > play. > > The tenor seems to be that we should somehow prove the effectiveness of > this patch one way or the other, but nobody is quite sure how to go > about that, and in fact it would probably be different for each AM. > > Sawada, if you have ideas about how to go about this then we would need > to see something very soon. If not, I think marking this RWF is the > best course of action. > Thank you for the remind. I've post new idea about this. After got consensus about the design, I'm going to update the patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 3, 2017 at 8:13 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for clarification. Let me check my understanding. IIUC, > skipping second index vacuum path (lazy_cleanup_index) can not be > cause of leaving page as half-dead state but could leave recyclable > pages that are not marked as a recyclable. But second one, it can be > reclaimed by next index vacuum because btvacuumpage calls > RecordFreeIndexPage for recyclable page. Am I missing something? I think that this is a good summary. You have not missed anything. > My first motivation of this patch is to skip the second index vacuum > patch when vacuum skipped whole table by visibility map. Makes sense. > But as Robert > suggested on another thread, I changed it to have a threshold. If my > understanding is correct, we can have a threshold that specifies the > fraction of the scanned pages by vacuum. If it's set 0.1, > lazy_scan_heap can do the second vacuum index only when 10% of table > is scanned. IOW, if 90% of table pages is skipped, which means almost > of table has not changed since previous vacuum, we can skip the second > index vacuum. It sounds like a setting of 0.1 would leave us in a position where it's very unlikely that a VACUUM of indexes could fail to occur when autovacuum has been triggered in the common way, by the "vacuum threshold" having been exceeded. Does this feeling that I have about it seem accurate to you? Is that actually your intent? It's hard to really be sure, because there are so many confounding factors (e.g. HOT), but if that was 100% true, then I suppose there would theoretically be zero new risk (except, perhaps, of the "other type of bloat" that I described, which I am not very worried about). Please verify my understanding of your thought process: We don't have to freeze indexes at all, ever, so if we see index bloat as a separate problem, we also see that there is no need to *link* index needs to the need for freezing. XID burn rate is a very bad proxy for how bloated an index may be. Besides, we already have a separate trigger for the thing that *actually* matters to indexes (the vacuum threshold stuff). Maybe 0.0 is too low as a default for vacuum_cleanup_index_scale_factor, even though initially it seemed attractive to me for theoretical reasons. Something like 0.01 is still "practically zero", but removes the extreme sensitivity that would have with 0.0. So, 0.01 might make sense as a default for roughly the same reason that autovacuum_vacuum_threshold exists. (Maybe it should be more like autovacuum_vacuum_threshold, in that it's an absolute minimum number of heap blocks to trigger index clean-up.) At some point in the future, it may be possible to actually go ahead with index vacumming, but do only a small amount of B-Tree vacuuming by a process that is similar to a conventional index scan: A process that collects index values from those rows found in the heap, and performs subsequent look-ups to kill tuples in the index. I imagine that in cases where the freeze map stuff really helps, the only index is often on a bigserial column or similar, which naturally has good locality. When you VACUUM heap pages, you attempt to build a range of values for each index, a little like a BRIN index build. This range is what you go on to use to do a cheap index-scan-based B-Tree VACUUM. This could have far far less I/O, though has obvious risks that we need to worry about. That's work for another release, of course. -- Peter Geoghegan
On Fri, Mar 3, 2017 at 11:37 AM, Peter Geoghegan <pg@bowt.ie> wrote: > Please verify my understanding of your thought process: We don't have > to freeze indexes at all, ever, so if we see index bloat as a separate > problem, we also see that there is no need to *link* index needs to > the need for freezing. XID burn rate is a very bad proxy for how > bloated an index may be. Besides, we already have a separate trigger > for the thing that *actually* matters to indexes (the vacuum threshold > stuff). Another thing I wonder about: It would be okay to use the number of unset freeze map bits as a reasonable proxy for how much bloat is in the index the first time we vacuum. But, don't we then set the freeze map bits, losing any record of having skipped indexes? What mechanism exists that allows back-pressure to actually *build up* over many vacuum anti-wraparound cycles, so that we slowly but surely get around to actually vacuuming indexes at some point? -- Peter Geoghegan
On Fri, Mar 3, 2017 at 11:49 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> Please verify my understanding of your thought process: We don't have >> to freeze indexes at all, ever, so if we see index bloat as a separate >> problem, we also see that there is no need to *link* index needs to >> the need for freezing. XID burn rate is a very bad proxy for how >> bloated an index may be. Besides, we already have a separate trigger >> for the thing that *actually* matters to indexes (the vacuum threshold >> stuff). Wait, I have this wrong. You're only ever avoiding lazy_cleanup_index(), never lazy_vacuum_index() (I repeated Kuntal's mistake). So, you're focused on avoiding recycling pages in the case that no actual bloat is in the index. That hasn't changed from your earliest version, where there was no new GUC. You are not proposing to increase VACUUM's tolerance of "the bad kind of bloat". It seems that this is much less risky than I first thought, then. > Another thing I wonder about: It would be okay to use the number of > unset freeze map bits as a reasonable proxy for how much bloat is in > the index the first time we vacuum. But, don't we then set the freeze > map bits, losing any record of having skipped indexes? Still seems like we may want to remember that we skipped the btvacuumcleanup() calls, to build back-pressure, though. That is, we might want to do what Simon suggests: something may be stored in the B-Tree meta-page (for example) to remember that we tolerated not doing a btvacuumcleanup() when we could have and maybe should have. It would remember the last number of pages deleted by the last VACUUM. That's the only thing I can think of that matters. I believe this is what Simon meant when he said "half dead pages". OTOH, I see no reason to insist on this meta-page stuff. The only thing that can delete a B-Tree page is VACUUM. Hopefully, a subsequent VACUUM will occur that goes on to recycle those deleted pages. But, it's not as if a "recycling-orientated B-Tree VACUUM" (a btvacuumcleanup() call to btvacuumscan()) occurs because of a build-up of back-pressure that has any relationship to recycling. Back pressure is actually created by "real" bloat, or maybe using up XIDs, which is not the same thing. Bloat cannot occur within B-Tree pages that are already fully deleted and (almost?) eligible to be recycled. In other words, the number of B-Tree pages that the last VACUUM deleted, and thus made eligible to recycle by the next VACUUM has no relationship with the number of pages the next VACUUM will itself end up deleting, in general, or how long it will be before that next VACUUM comes, if it comes at all, or anything else that seems at all relevant. In other other words, you are only preventing recycling that would have occurred pretty much by accident before now, due to concerns that had nothing to do with recycling in particular. The really important thing is that you have bound the amount of recycling that VACUUM can now fail to do to only one "round" of VACUUM, because newly deleted pages can only occur due to another VACUUM, which will always do however much recycling it can manage to do as it scans a B-Tree index. I'm not very worried about VACUUM not being super-eager about reclaiming disk space for the FSM if there is a natural limit to that. It would be intolerable for it to ever be possible for recyclable-but-not-recycled pages to grow indefinitely, but I see no risk of that here. -- Peter Geoghegan
On Fri, Mar 3, 2017 at 2:41 PM, Peter Geoghegan <pg@bowt.ie> wrote: > In other words, the number of B-Tree pages that the last VACUUM > deleted, and thus made eligible to recycle by the next VACUUM has no > relationship with the number of pages the next VACUUM will itself end > up deleting, in general, or how long it will be before that next > VACUUM comes, if it comes at all, or anything else that seems at all > relevant. This raises another question, though: Why have this GUC at all? Why use *any* threshold that is to be compared against the number of heap pages that were processed by VACUUM this time? B-Tree page deletion isn't really part of the ordinary life cycle of a B-Tree index. In order for that to be the case, somebody would have to delete large ranges of indexed values (typically hundreds of logically contiguous values -- no gaps), without anyone else ever inserting new tuples that are in the same range before the next VACUUM. It's very unlikely that this would happen again and again in the real world. So, even if we never freeze, the number of B-Tree pages that we delete when we VACUUM today is generally a useless predictor of how many will be deleted by a VACUUM that occurs tomorrow. This is true despite the fact that the number of dead heap tuples is probably almost identical for each VACUUM (or the number of heap pages that end up being processed by VACUUM, if you prefer). Barring any concerns about crash safety, we can be completely certain that any "recycling-orientated B-Tree VACUUM" (a btvacuumcleanup() call to btvacuumscan(), which happens because there are no tuples in the index to kill) will end up recycling however many pages the last VACUUM managed to delete, which is a precisely knowable number (or could be made knowable if we stashed that number somewhere, like the meta-page). It will typically only take seconds or minutes after the VACUUM finishes for its RecentGlobalXmin interlock to stop being a problem (that is, for _bt_page_recyclable() to return "true" for any pages that that VACUUM deleted). From that point on, those deleted pages are "money in the bank" for the FSM. The only reason why we'd want to tie "the FSM withdrawing that money" to VACUUM is because that might be needed to clean up regular bloat anyway. The test performed by this patch within lazy_scan_heap(), to determine whether we should avoid calling lazy_cleanup_index() would therefore look like this, ideally: Do I want to go to the trouble of scanning this index (a cost that is proportionate to the size of the index) in order to recycle this number of known-deleted pages (a benefit that is proportionate to that number)? (I still think that the important thing is that we don't let the number of unclaimed-by-FSM recyclable pages grow forever, though.) (Thinks about it some more...) Unfortunately, I just saw a whole new problem with this patch: _bt_page_recyclable() is the one place in the B-Tree AM where we stash an XID. We don't need to set this to FrozenTransactionId at any point, because this is stashed for deleted pages only, pages that are likely to be recycled very soon. It might be that the VACUUM that ends up deleting any such page is an anti-wraparound VACUUM, especially in the case that this patch really wants to improve. However, with this patch, that recycling won't happen, of course. As a result, _bt_page_recyclable() will falsely report that the page is not recyclable if it is ever asked again. -- Peter Geoghegan
On Sat, Mar 4, 2017 at 5:59 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Mar 3, 2017 at 2:41 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> In other words, the number of B-Tree pages that the last VACUUM >> deleted, and thus made eligible to recycle by the next VACUUM has no >> relationship with the number of pages the next VACUUM will itself end >> up deleting, in general, or how long it will be before that next >> VACUUM comes, if it comes at all, or anything else that seems at all >> relevant. > > This raises another question, though: Why have this GUC at all? Why > use *any* threshold that is to be compared against the number of heap > pages that were processed by VACUUM this time? > > B-Tree page deletion isn't really part of the ordinary life cycle of a > B-Tree index. In order for that to be the case, somebody would have to > delete large ranges of indexed values (typically hundreds of logically > contiguous values -- no gaps), without anyone else ever inserting new > tuples that are in the same range before the next VACUUM. It's very > unlikely that this would happen again and again in the real world. So, > even if we never freeze, the number of B-Tree pages that we delete > when we VACUUM today is generally a useless predictor of how many will > be deleted by a VACUUM that occurs tomorrow. This is true despite the > fact that the number of dead heap tuples is probably almost identical > for each VACUUM (or the number of heap pages that end up being > processed by VACUUM, if you prefer). > > Barring any concerns about crash safety, we can be completely certain > that any "recycling-orientated B-Tree VACUUM" (a btvacuumcleanup() > call to btvacuumscan(), which happens because there are no tuples in > the index to kill) will end up recycling however many pages the last > VACUUM managed to delete, which is a precisely knowable number (or > could be made knowable if we stashed that number somewhere, like the > meta-page). It will typically only take seconds or minutes after the > VACUUM finishes for its RecentGlobalXmin interlock to stop being a > problem (that is, for _bt_page_recyclable() to return "true" for any > pages that that VACUUM deleted). From that point on, those deleted > pages are "money in the bank" for the FSM. The only reason why we'd > want to tie "the FSM withdrawing that money" to VACUUM is because that > might be needed to clean up regular bloat anyway. > > The test performed by this patch within lazy_scan_heap(), to determine > whether we should avoid calling lazy_cleanup_index() would therefore > look like this, ideally: Do I want to go to the trouble of scanning > this index (a cost that is proportionate to the size of the index) in > order to recycle this number of known-deleted pages (a benefit that is > proportionate to that number)? (I still think that the important thing > is that we don't let the number of unclaimed-by-FSM recyclable pages > grow forever, though.) > You are right that we don't want the number of unclaimed-by-FSM recyclable pages to grow forever, but I think that won't happen with this patch. As soon as there are more deletions (in heap), in the next vacuum cycle, the pages will be reclaimed by lazy_vacuum_index(). > (Thinks about it some more...) > > Unfortunately, I just saw a whole new problem with this patch: > _bt_page_recyclable() is the one place in the B-Tree AM where we stash > an XID. > Can you be more specific to tell which code exactly you are referring here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 3, 2017 at 6:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > You are right that we don't want the number of unclaimed-by-FSM > recyclable pages to grow forever, but I think that won't happen with > this patch. As soon as there are more deletions (in heap), in the > next vacuum cycle, the pages will be reclaimed by lazy_vacuum_index(). Right. >> (Thinks about it some more...) >> >> Unfortunately, I just saw a whole new problem with this patch: >> _bt_page_recyclable() is the one place in the B-Tree AM where we stash >> an XID. >> > > Can you be more specific to tell which code exactly you are referring here? I meant that we stash an XID when a B-Tree page is deleted, used to determine when it's finally safe to to recycle the page by comparing it to RecentGlobalXmin (recyling will happen during the next VACUUM). We can't immediately recycle a page, because an existing index scan might land on the page while following a stale pointer. _bt_page_recyclable(), which checks if recycling is safe (no possible index scan can land of dead page), is a pretty small function. I'm concerned about what happens when pg_class.relfrozenxid/pg_database.datfrozenxid are advanced past opaque->btpo.xact. While I can't see this explained anywhere, I'm pretty sure that that's supposed to be impossible, which this patch changes. -- Peter Geoghegan
On Sat, Mar 4, 2017 at 8:55 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Mar 3, 2017 at 6:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> You are right that we don't want the number of unclaimed-by-FSM >> recyclable pages to grow forever, but I think that won't happen with >> this patch. As soon as there are more deletions (in heap), in the >> next vacuum cycle, the pages will be reclaimed by lazy_vacuum_index(). > > Right. > >>> (Thinks about it some more...) >>> >>> Unfortunately, I just saw a whole new problem with this patch: >>> _bt_page_recyclable() is the one place in the B-Tree AM where we stash >>> an XID. >>> >> >> Can you be more specific to tell which code exactly you are referring here? > > I meant that we stash an XID when a B-Tree page is deleted, used to > determine when it's finally safe to to recycle the page by comparing > it to RecentGlobalXmin (recyling will happen during the next VACUUM). > We can't immediately recycle a page, because an existing index scan > might land on the page while following a stale pointer. > > _bt_page_recyclable(), which checks if recycling is safe (no possible > index scan can land of dead page), is a pretty small function. I'm > concerned about what happens when > pg_class.relfrozenxid/pg_database.datfrozenxid are advanced past > opaque->btpo.xact. > Are you talking pg_class.relfrozenxid of index or table? IIUC, for indexes it will be InvalidTransactionId and for tables, I think it will be updated with or without the patch. > While I can't see this explained anywhere, I'm > pretty sure that that's supposed to be impossible, which this patch > changes. > What makes you think that patch will allow pg_class.relfrozenxid to be advanced past opaque->btpo.xact which was previously not possible? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Mar 4, 2017 at 4:37 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Mar 3, 2017 at 8:13 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Thank you for clarification. Let me check my understanding. IIUC, >> skipping second index vacuum path (lazy_cleanup_index) can not be >> cause of leaving page as half-dead state but could leave recyclable >> pages that are not marked as a recyclable. But second one, it can be >> reclaimed by next index vacuum because btvacuumpage calls >> RecordFreeIndexPage for recyclable page. Am I missing something? > > I think that this is a good summary. You have not missed anything. > >> My first motivation of this patch is to skip the second index vacuum >> patch when vacuum skipped whole table by visibility map. > > Makes sense. > >> But as Robert >> suggested on another thread, I changed it to have a threshold. If my >> understanding is correct, we can have a threshold that specifies the >> fraction of the scanned pages by vacuum. If it's set 0.1, >> lazy_scan_heap can do the second vacuum index only when 10% of table >> is scanned. IOW, if 90% of table pages is skipped, which means almost >> of table has not changed since previous vacuum, we can skip the second >> index vacuum. > > It sounds like a setting of 0.1 would leave us in a position where > it's very unlikely that a VACUUM of indexes could fail to occur when > autovacuum has been triggered in the common way, by the "vacuum > threshold" having been exceeded. Does this feeling that I have about > it seem accurate to you? Is that actually your intent? It's hard to > really be sure, because there are so many confounding factors (e.g. > HOT), but if that was 100% true, then I suppose there would > theoretically be zero new risk (except, perhaps, of the "other type of > bloat" that I described, which I am not very worried about). > > Please verify my understanding of your thought process: We don't have > to freeze indexes at all, ever, so if we see index bloat as a separate > problem, we also see that there is no need to *link* index needs to > the need for freezing. XID burn rate is a very bad proxy for how > bloated an index may be. Besides, we already have a separate trigger > for the thing that *actually* matters to indexes (the vacuum threshold > stuff). > > Maybe 0.0 is too low as a default for > vacuum_cleanup_index_scale_factor, even though initially it seemed > attractive to me for theoretical reasons. Something like 0.01 is still > "practically zero", but removes the extreme sensitivity that would > have with 0.0. So, 0.01 might make sense as a default for roughly the > same reason that autovacuum_vacuum_threshold exists. (Maybe it should > be more like autovacuum_vacuum_threshold, in that it's an absolute > minimum number of heap blocks to trigger index clean-up.) Agreed. Attached current design patch. Also I've changed default value to 0.01. Remaining issues are two issues; first is what you mentioned about that pg_class.relfrozenxid/pg_database.datfrozenxid could be advanced past opaque->btpo.xact. The second is to find out the impact to type of indexes other than btree. I'm investigating the second one now. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sat, Mar 4, 2017 at 1:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> While I can't see this explained anywhere, I'm >> pretty sure that that's supposed to be impossible, which this patch >> changes. >> > > What makes you think that patch will allow pg_class.relfrozenxid to be > advanced past opaque->btpo.xact which was previously not possible? By not reliably recycling pages in a timely manner, we won't change anything about the dead page. It just sticks around. This is mostly fine, but we still need VACUUM to be able to reason about it (to determine if it is in fact recyclable), which is what I'm concerned about breaking here. It still needs to be *possible* to recycle any recyclable page at some point (e.g., when we find it convenient). pg_class.relfrozenxid is InvalidTransactionId for indexes because indexes generally don't store XIDs. This is the one exception that I'm aware of, presumably justified by the fact that it's only for recyclable pages anyway, and those are currently *guaranteed* to get recycled quickly. In particular, they're guaranteed to get recycled by the next VACUUM. They may be recycled in the course of anti-wraparound VACUUM, even if VACUUM has no garbage tuples to kill (even if we only do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the case that this patch proposes to have us skip touching indexes for. -- Peter Geoghegan
On Wed, Mar 8, 2017 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Mar 4, 2017 at 1:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> While I can't see this explained anywhere, I'm >>> pretty sure that that's supposed to be impossible, which this patch >>> changes. >>> >> >> What makes you think that patch will allow pg_class.relfrozenxid to be >> advanced past opaque->btpo.xact which was previously not possible? > > By not reliably recycling pages in a timely manner, we won't change > anything about the dead page. It just sticks around. This is mostly > fine, but we still need VACUUM to be able to reason about it (to > determine if it is in fact recyclable), which is what I'm concerned > about breaking here. It still needs to be *possible* to recycle any > recyclable page at some point (e.g., when we find it convenient). > > pg_class.relfrozenxid is InvalidTransactionId for indexes because > indexes generally don't store XIDs. This is the one exception that I'm > aware of, presumably justified by the fact that it's only for > recyclable pages anyway, and those are currently *guaranteed* to get > recycled quickly. In particular, they're guaranteed to get recycled by > the next VACUUM. They may be recycled in the course of anti-wraparound > VACUUM, even if VACUUM has no garbage tuples to kill (even if we only > do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the > case that this patch proposes to have us skip touching indexes for. > To prevent this, I think we need to not skip the lazy_cleanup_index when ant-wraparound vacuum (aggressive = true) even if the number of scanned pages is less then the threshold. This can ensure that pg_class.relfrozenxid is not advanced past opaque->bp.xact with minimal pain. Even if the btree dead page is leaved, the subsequent modification makes garbage on heap and then autovauum recycles that page while index vacuuming(lazy_index_vacuum). Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Mar 9, 2017 at 8:51 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> pg_class.relfrozenxid is InvalidTransactionId for indexes because >> indexes generally don't store XIDs. This is the one exception that I'm >> aware of, presumably justified by the fact that it's only for >> recyclable pages anyway, and those are currently *guaranteed* to get >> recycled quickly. In particular, they're guaranteed to get recycled by >> the next VACUUM. They may be recycled in the course of anti-wraparound >> VACUUM, even if VACUUM has no garbage tuples to kill (even if we only >> do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the >> case that this patch proposes to have us skip touching indexes for. >> > > To prevent this, I think we need to not skip the lazy_cleanup_index > when ant-wraparound vacuum (aggressive = true) even if the number of > scanned pages is less then the threshold. This can ensure that > pg_class.relfrozenxid is not advanced past opaque->bp.xact with > minimal pain. Even if the btree dead page is leaved, the subsequent > modification makes garbage on heap and then autovauum recycles that > page while index vacuuming(lazy_index_vacuum). Sorry for the delay in getting back to you. I think that that's safe, but it is a little disappointing that it does not allow us to skip work in the case that you really had in mind when writing the patch. Better than nothing, though, and perhaps still a good start. I would like to hear other people's opinions. -- Peter Geoghegan
On Tue, Mar 14, 2017 at 2:48 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I think that that's safe, but it is a little disappointing that it > does not allow us to skip work in the case that you really had in mind > when writing the patch. Better than nothing, though, and perhaps still > a good start. I would like to hear other people's opinions. Hmm. So, the SQL-callable function txid_current() exports a 64-bit XID, which includes an "epoch". If PostgreSQL used these 64-bit XIDs natively, we'd never need to freeze. Obviously we don't do that because the per-tuple overhead of visibility information is already high, and that would make it much worse. But, we can easily afford the extra overhead in a completely dead page. Maybe we can overcome the _bt_page_recyclable() limitation by being cleverer about how it determines if recycling is safe -- it can examine epoch, too. This would also be required in the similar function vacuumRedirectAndPlaceholder(), which is a part of SP-GiST. We already have BTPageOpaqueData.btpo, a union whose contained type varies based on the page being dead. We could just do the same with some other field in that struct, and then store epoch there. Clearly nobody really cares about most data that remains on the page. Index scans just need to be able to land on it to determine that it's dead, and VACUUM needs to be able to determine whether or not there could possibly be such an index scan at the time it considers recycling.. Does anyone see a problem with that? -- Peter Geoghegan
On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote: > We already have BTPageOpaqueData.btpo, a union whose contained type > varies based on the page being dead. We could just do the same with > some other field in that struct, and then store epoch there. Clearly > nobody really cares about most data that remains on the page. Index > scans just need to be able to land on it to determine that it's dead, > and VACUUM needs to be able to determine whether or not there could > possibly be such an index scan at the time it considers recycling.. ISTM that we need all of the fields within BTPageOpaqueData even for dead pages, actually. The left links and right links still need to be sane, and the flag bits are needed. Plus, the field that stores an XID already is clearly necessary. Even if they weren't needed, it would probably still be a good idea to keep them around for forensic purposes. However, the page header field pd_prune_xid is currently unused for indexes, and is the same width as CheckPoint.nextXidEpoch (the extra thing we might want to store -- the epoch). Maybe you could store the epoch within that field when B-Tree VACUUM deletes a page, and then compare that within _bt_page_recyclable(). It would come before the existing XID comparison in that function. One nice thing about this idea is that pd_prune_xid will be all-zero for index pages from the current format, so there is no need to take special care to make sure that databases that have undergone pg_upgrade don't break. -- Peter Geoghegan
On Thu, Mar 9, 2017 at 10:21 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Mar 8, 2017 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Sat, Mar 4, 2017 at 1:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> While I can't see this explained anywhere, I'm >>>> pretty sure that that's supposed to be impossible, which this patch >>>> changes. >>>> >>> >>> What makes you think that patch will allow pg_class.relfrozenxid to be >>> advanced past opaque->btpo.xact which was previously not possible? >> >> By not reliably recycling pages in a timely manner, we won't change >> anything about the dead page. It just sticks around. This is mostly >> fine, but we still need VACUUM to be able to reason about it (to >> determine if it is in fact recyclable), which is what I'm concerned >> about breaking here. It still needs to be *possible* to recycle any >> recyclable page at some point (e.g., when we find it convenient). >> >> pg_class.relfrozenxid is InvalidTransactionId for indexes because >> indexes generally don't store XIDs. This is the one exception that I'm >> aware of, presumably justified by the fact that it's only for >> recyclable pages anyway, and those are currently *guaranteed* to get >> recycled quickly. In particular, they're guaranteed to get recycled by >> the next VACUUM. They may be recycled in the course of anti-wraparound >> VACUUM, even if VACUUM has no garbage tuples to kill (even if we only >> do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the >> case that this patch proposes to have us skip touching indexes for. >> > > To prevent this, I think we need to not skip the lazy_cleanup_index > when ant-wraparound vacuum (aggressive = true) even if the number of > scanned pages is less then the threshold. This can ensure that > pg_class.relfrozenxid is not advanced past opaque->bp.xact with > minimal pain. Even if the btree dead page is leaved, the subsequent > modification makes garbage on heap and then autovauum recycles that > page while index vacuuming(lazy_index_vacuum). > What about if somebody does manual vacuum and there are no garbage tuples to clean, won't in that case also you want to avoid skipping the lazy_cleanup_index? Another option could be to skip updating the relfrozenxid if we have skipped the index cleanup. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 15, 2017 at 4:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Mar 9, 2017 at 10:21 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Wed, Mar 8, 2017 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote: >>> On Sat, Mar 4, 2017 at 1:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>>> While I can't see this explained anywhere, I'm >>>>> pretty sure that that's supposed to be impossible, which this patch >>>>> changes. >>>>> >>>> >>>> What makes you think that patch will allow pg_class.relfrozenxid to be >>>> advanced past opaque->btpo.xact which was previously not possible? >>> >>> By not reliably recycling pages in a timely manner, we won't change >>> anything about the dead page. It just sticks around. This is mostly >>> fine, but we still need VACUUM to be able to reason about it (to >>> determine if it is in fact recyclable), which is what I'm concerned >>> about breaking here. It still needs to be *possible* to recycle any >>> recyclable page at some point (e.g., when we find it convenient). >>> >>> pg_class.relfrozenxid is InvalidTransactionId for indexes because >>> indexes generally don't store XIDs. This is the one exception that I'm >>> aware of, presumably justified by the fact that it's only for >>> recyclable pages anyway, and those are currently *guaranteed* to get >>> recycled quickly. In particular, they're guaranteed to get recycled by >>> the next VACUUM. They may be recycled in the course of anti-wraparound >>> VACUUM, even if VACUUM has no garbage tuples to kill (even if we only >>> do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the >>> case that this patch proposes to have us skip touching indexes for. >>> >> >> To prevent this, I think we need to not skip the lazy_cleanup_index >> when ant-wraparound vacuum (aggressive = true) even if the number of >> scanned pages is less then the threshold. This can ensure that >> pg_class.relfrozenxid is not advanced past opaque->bp.xact with >> minimal pain. Even if the btree dead page is leaved, the subsequent >> modification makes garbage on heap and then autovauum recycles that >> page while index vacuuming(lazy_index_vacuum). >> > > What about if somebody does manual vacuum and there are no garbage > tuples to clean, won't in that case also you want to avoid skipping > the lazy_cleanup_index? Yes, in that case lazy_cleanup_index will be skipped. > Another option could be to skip updating the > relfrozenxid if we have skipped the index cleanup. This could make anti-wraparound VACUUM occurs at high frequency and we cannot skip lazy_clean_index when aggressive vacuum after all. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On 3/15/17 9:50 AM, Amit Kapila wrote: > > What about if somebody does manual vacuum and there are no garbage > tuples to clean, won't in that case also you want to avoid skipping > the lazy_cleanup_index? Another option could be to skip updating the > relfrozenxid if we have skipped the index cleanup. This thread has been idle for six days. Please respond and/or post a new patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be marked "Returned with Feedback". Thanks, -- -David david@pgmasters.net
On Wed, Mar 22, 2017 at 2:29 AM, David Steele <david@pgmasters.net> wrote: > Hi, > > On 3/15/17 9:50 AM, Amit Kapila wrote: >> >> >> What about if somebody does manual vacuum and there are no garbage >> tuples to clean, won't in that case also you want to avoid skipping >> the lazy_cleanup_index? Another option could be to skip updating the >> relfrozenxid if we have skipped the index cleanup. > > > This thread has been idle for six days. Please respond and/or post a new > patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be marked > "Returned with Feedback". > I've changes the patch so that lazy_cleanup_index is not skipped when aggressive vacuum. Attached patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Mar 14, 2017 at 6:10 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Mar 14, 2017 at 2:48 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> I think that that's safe, but it is a little disappointing that it >> does not allow us to skip work in the case that you really had in mind >> when writing the patch. Better than nothing, though, and perhaps still >> a good start. I would like to hear other people's opinions. > > Hmm. So, the SQL-callable function txid_current() exports a 64-bit > XID, which includes an "epoch". If PostgreSQL used these 64-bit XIDs > natively, we'd never need to freeze. Obviously we don't do that > because the per-tuple overhead of visibility information is already > high, and that would make it much worse. But, we can easily afford the > extra overhead in a completely dead page. Maybe we can overcome the > _bt_page_recyclable() limitation by being cleverer about how it > determines if recycling is safe -- it can examine epoch, too. This > would also be required in the similar function > vacuumRedirectAndPlaceholder(), which is a part of SP-GiST. > > We already have BTPageOpaqueData.btpo, a union whose contained type > varies based on the page being dead. We could just do the same with > some other field in that struct, and then store epoch there. Clearly > nobody really cares about most data that remains on the page. Index > scans just need to be able to land on it to determine that it's dead, > and VACUUM needs to be able to determine whether or not there could > possibly be such an index scan at the time it considers recycling.. > > Does anyone see a problem with that? Wouldn't it break on-disk compatibility with existing btree indexes? I think we're still trying to solve a problem that Simon postulated in advance of evidence that shows how much of a problem it actually is. Not only might that be unnecessary, but if we don't have a test demonstrating the problem, we also don't have a test demonstrating that a given approach fixes it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 21, 2017 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Wouldn't it break on-disk compatibility with existing btree indexes? Yes, it would, but see my later remarks on pd_prune_xid. I think that that would be safe. > I think we're still trying to solve a problem that Simon postulated in > advance of evidence that shows how much of a problem it actually is. I don't think we're still doing that. I think we're discussing the risk of recycling being broken indefinitely when it doesn't happen in time. > Not only might that be unnecessary, but if we don't have a test > demonstrating the problem, we also don't have a test demonstrating > that a given approach fixes it. Preventing recycling from happening until we feel like it is probably fine. It is not fine to break it forever, though. The specific problem is that there is an XID stored in dead B-Tree + SP-GiST pages that is used in the subsequent RecentGlobalXmin interlock that determines if recycling is safe (if there is no possible index scan that could land on the dead page). You know, the _bt_page_recyclable() check. -- Peter Geoghegan
On Tue, Mar 21, 2017 at 8:18 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> Not only might that be unnecessary, but if we don't have a test >> demonstrating the problem, we also don't have a test demonstrating >> that a given approach fixes it. > > Preventing recycling from happening until we feel like it is probably > fine. It is not fine to break it forever, though. The specific problem > is that there is an XID stored in dead B-Tree + SP-GiST pages that is > used in the subsequent RecentGlobalXmin interlock that determines if > recycling is safe (if there is no possible index scan that could land > on the dead page). You know, the _bt_page_recyclable() check. Oh. OK, so this is not just about bloat -- it's about whether this can be safely done at all. Somehow, I missed that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 15, 2017 at 7:51 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> We already have BTPageOpaqueData.btpo, a union whose contained type >> varies based on the page being dead. We could just do the same with >> some other field in that struct, and then store epoch there. Clearly >> nobody really cares about most data that remains on the page. Index >> scans just need to be able to land on it to determine that it's dead, >> and VACUUM needs to be able to determine whether or not there could >> possibly be such an index scan at the time it considers recycling.. > > ISTM that we need all of the fields within BTPageOpaqueData even for > dead pages, actually. The left links and right links still need to be > sane, and the flag bits are needed. Plus, the field that stores an XID > already is clearly necessary. Even if they weren't needed, it would > probably still be a good idea to keep them around for forensic > purposes. However, the page header field pd_prune_xid is currently > unused for indexes, and is the same width as CheckPoint.nextXidEpoch > (the extra thing we might want to store -- the epoch). > > Maybe you could store the epoch within that field when B-Tree VACUUM > deletes a page, and then compare that within _bt_page_recyclable(). It > would come before the existing XID comparison in that function. One > nice thing about this idea is that pd_prune_xid will be all-zero for > index pages from the current format, so there is no need to take > special care to make sure that databases that have undergone > pg_upgrade don't break. > Thank you for the suggestion! If we store the poch within union field, I think we will not be able to use BTPageOpaqueData.btpo.xact at the same time. Since comparing btpo.xact is still necessary to determine if that page is recyclable we cannot store the epoch into the same union field. And if we store it into BTPageOpaqueData, it would break disk compatibility. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 3/23/17 1:54 AM, Masahiko Sawada wrote: > On Wed, Mar 15, 2017 at 7:51 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote: >>> We already have BTPageOpaqueData.btpo, a union whose contained type >>> varies based on the page being dead. We could just do the same with >>> some other field in that struct, and then store epoch there. Clearly >>> nobody really cares about most data that remains on the page. Index >>> scans just need to be able to land on it to determine that it's dead, >>> and VACUUM needs to be able to determine whether or not there could >>> possibly be such an index scan at the time it considers recycling.. >> >> ISTM that we need all of the fields within BTPageOpaqueData even for >> dead pages, actually. The left links and right links still need to be >> sane, and the flag bits are needed. Plus, the field that stores an XID >> already is clearly necessary. Even if they weren't needed, it would >> probably still be a good idea to keep them around for forensic >> purposes. However, the page header field pd_prune_xid is currently >> unused for indexes, and is the same width as CheckPoint.nextXidEpoch >> (the extra thing we might want to store -- the epoch). >> >> Maybe you could store the epoch within that field when B-Tree VACUUM >> deletes a page, and then compare that within _bt_page_recyclable(). It >> would come before the existing XID comparison in that function. One >> nice thing about this idea is that pd_prune_xid will be all-zero for >> index pages from the current format, so there is no need to take >> special care to make sure that databases that have undergone >> pg_upgrade don't break. >> > > Thank you for the suggestion! > If we store the poch within union field, I think we will not be able > to use BTPageOpaqueData.btpo.xact at the same time. Since comparing > btpo.xact is still necessary to determine if that page is recyclable > we cannot store the epoch into the same union field. And if we store > it into BTPageOpaqueData, it would break disk compatibility. I have marked this patch "Waiting for Author". This thread has been idle for five days. Please respond with a new patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked "Returned with Feedback". Thanks, -- -David david@pgmasters.net
On Wed, Mar 29, 2017 at 12:23 AM, David Steele <david@pgmasters.net> wrote: > On 3/23/17 1:54 AM, Masahiko Sawada wrote: >> >> On Wed, Mar 15, 2017 at 7:51 AM, Peter Geoghegan <pg@bowt.ie> wrote: >>> >>> On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote: >>>> >>>> We already have BTPageOpaqueData.btpo, a union whose contained type >>>> varies based on the page being dead. We could just do the same with >>>> some other field in that struct, and then store epoch there. Clearly >>>> nobody really cares about most data that remains on the page. Index >>>> scans just need to be able to land on it to determine that it's dead, >>>> and VACUUM needs to be able to determine whether or not there could >>>> possibly be such an index scan at the time it considers recycling.. >>> >>> >>> ISTM that we need all of the fields within BTPageOpaqueData even for >>> dead pages, actually. The left links and right links still need to be >>> sane, and the flag bits are needed. Plus, the field that stores an XID >>> already is clearly necessary. Even if they weren't needed, it would >>> probably still be a good idea to keep them around for forensic >>> purposes. However, the page header field pd_prune_xid is currently >>> unused for indexes, and is the same width as CheckPoint.nextXidEpoch >>> (the extra thing we might want to store -- the epoch). >>> >>> Maybe you could store the epoch within that field when B-Tree VACUUM >>> deletes a page, and then compare that within _bt_page_recyclable(). It >>> would come before the existing XID comparison in that function. One >>> nice thing about this idea is that pd_prune_xid will be all-zero for >>> index pages from the current format, so there is no need to take >>> special care to make sure that databases that have undergone >>> pg_upgrade don't break. >>> >> >> Thank you for the suggestion! >> If we store the poch within union field, I think we will not be able >> to use BTPageOpaqueData.btpo.xact at the same time. Since comparing >> btpo.xact is still necessary to determine if that page is recyclable >> we cannot store the epoch into the same union field. And if we store >> it into BTPageOpaqueData, it would break disk compatibility. > > > I have marked this patch "Waiting for Author". > > This thread has been idle for five days. Please respond with a new patch by > 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked "Returned > with Feedback". > I was thinking that the status of this patch is still "Needs review" because I sent latest version patch[1]. We're still under discussion how safely we can skip to the cleanup vacuum on btree index. That patch has some restrictions but it would be good for first step. [1] https://www.postgresql.org/message-id/CAD21AoDCmnoqKuKOmge6uc5AJAWOcLAz6jjB_WPSPFVQT5PkUA%40mail.gmail.com Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 3/29/17 2:23 AM, Masahiko Sawada wrote: > On Wed, Mar 29, 2017 at 12:23 AM, David Steele <david@pgmasters.net> wrote: >> On 3/23/17 1:54 AM, Masahiko Sawada wrote: >>> >>> On Wed, Mar 15, 2017 at 7:51 AM, Peter Geoghegan <pg@bowt.ie> wrote: >>>> >>>> On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote: >>>>> >>>>> We already have BTPageOpaqueData.btpo, a union whose contained type >>>>> varies based on the page being dead. We could just do the same with >>>>> some other field in that struct, and then store epoch there. Clearly >>>>> nobody really cares about most data that remains on the page. Index >>>>> scans just need to be able to land on it to determine that it's dead, >>>>> and VACUUM needs to be able to determine whether or not there could >>>>> possibly be such an index scan at the time it considers recycling.. >>>> >>>> >>>> ISTM that we need all of the fields within BTPageOpaqueData even for >>>> dead pages, actually. The left links and right links still need to be >>>> sane, and the flag bits are needed. Plus, the field that stores an XID >>>> already is clearly necessary. Even if they weren't needed, it would >>>> probably still be a good idea to keep them around for forensic >>>> purposes. However, the page header field pd_prune_xid is currently >>>> unused for indexes, and is the same width as CheckPoint.nextXidEpoch >>>> (the extra thing we might want to store -- the epoch). >>>> >>>> Maybe you could store the epoch within that field when B-Tree VACUUM >>>> deletes a page, and then compare that within _bt_page_recyclable(). It >>>> would come before the existing XID comparison in that function. One >>>> nice thing about this idea is that pd_prune_xid will be all-zero for >>>> index pages from the current format, so there is no need to take >>>> special care to make sure that databases that have undergone >>>> pg_upgrade don't break. >>>> >>> >>> Thank you for the suggestion! >>> If we store the poch within union field, I think we will not be able >>> to use BTPageOpaqueData.btpo.xact at the same time. Since comparing >>> btpo.xact is still necessary to determine if that page is recyclable >>> we cannot store the epoch into the same union field. And if we store >>> it into BTPageOpaqueData, it would break disk compatibility. >> >> >> I have marked this patch "Waiting for Author". >> >> This thread has been idle for five days. Please respond with a new patch by >> 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked "Returned >> with Feedback". >> > > I was thinking that the status of this patch is still "Needs review" > because I sent latest version patch[1]. We're still under discussion > how safely we can skip to the cleanup vacuum on btree index. That > patch has some restrictions but it would be good for first step. I've marked this patch as "Needs Review". Feel free to do that on your own if you think I've made a mistake in classifying a patch. My view is that the patch is stalled and something might be required on your part to get it moving again, perhaps trying another approach. -- -David david@pgmasters.net
On Wed, Mar 29, 2017 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I was thinking that the status of this patch is still "Needs review" > because I sent latest version patch[1]. I think you're right. I took a look at this today. I think there is some problem with the design of this patch. I originally proposed a threshold based on the percentage of not-all-visible pages on the theory that we'd just skip looking at the indexes altogether in that case. But that's not what the patch does: it only avoids the index *cleanup*, not the index *vacuum*. And the comments in btvacuumcleanup say this: /* * If btbulkdelete was called, we need not do anything, just return the * stats from the latest btbulkdelete call. If it wasn't called, we must * still do a pass over the index, to recycle any newly-recyclable pages * and toobtain index statistics. * * Since we aren't going to actually delete any leaf items, there's no * need to gothrough all the vacuum-cycle-ID pushups. */ So, if I'm reading this correctly, the only time this patch saves substantial work - at least in the case of a btree index - is in the case where there are no dead tuples at all. But if we only want to avoid the work in that case, then a threshold based on the percentage of all-visible pages seems like the wrong thing, because the other stuff btvacuumcleanup() is doing doesn't have anything to do with the number of all-visible pages. I'm not quite sure what the right thing to do is here, but I'm doubtful that this is it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 29, 2017 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I was thinking that the status of this patch is still "Needs review" >> because I sent latest version patch[1]. > > I think you're right. > > I took a look at this today. I think there is some problem with the > design of this patch. I originally proposed a threshold based on the > percentage of not-all-visible pages on the theory that we'd just skip > looking at the indexes altogether in that case. But that's not what > the patch does: it only avoids the index *cleanup*, not the index > *vacuum*. Hmm, I guess there is misunderstanding on this thread (my English might have confused you, sorry). I've been proposing the patch that allows lazy vacuum skip only the index cleanup vacuum. Because the problem reported by xu jian[1] is that second vacuum freeze takes long time even if the table is not modified since first vaucuum. The reason is that we cannot skip lazy_cleanup_index even if the table is not changed at all since previous vacuum. If the table has a garbage the lazy vacuum does lazy_vacuum_index, on the other hand, if the table doesn't have garbage, which means that only insertion was execued or the table was not modified, the lazy vacuum does lazy_cleanup_index instead. Since I'd been knew this restriction I proposed it. That's why proposing new GUC parameter name has been "vacuum_cleanup_index_scale_factor". > And the comments in btvacuumcleanup say this: > > /* > * If btbulkdelete was called, we need not do anything, just return the > * stats from the latest btbulkdelete call. If it wasn't called, we must > * still do a pass over the index, to recycle any newly-recyclable pages > * and to obtain index statistics. > * > * Since we aren't going to actually delete any leaf items, there's no > * need to go through all the vacuum-cycle-ID pushups. > */ In current design, we can skip lazy_cleanup_index in case where the number of pages need to be vacuumed is less than the threshold. For example, after frozen whole table by aggressive vacuum the vacuum can skip scanning on the index if only a few insertion is execute on that table. But if a transaction made a garbage on the table, this patch cannot prevent from vacuuming on the index as you mentioned. By skipping index cleanup we could leave recyclable pages that are not marked as a recyclable. But it can be reclaimed by both next index vacuum and next index cleanup becuase btvacuumpage calls RecordFreeIndexPage for recyclable page. So I think it doesn't become a problem. And a next concern pointed by Peter is that we stash an XID when a btree page is deleted, which is used to determine when it's finally safe to recycle the page. I prevent from this situation by not allowing lazy vacuum to skip the index cleanup when aggressive vacuum. But since this makes the advantage of this patch weak I'm considering better idea. > So, if I'm reading this correctly, the only time this patch saves > substantial work - at least in the case of a btree index - is in the > case where there are no dead tuples at all. But if we only want to > avoid the work in that case, then a threshold based on the percentage > of all-visible pages seems like the wrong thing, because the other > stuff btvacuumcleanup() is doing doesn't have anything to do with the > number of all-visible pages. I thought that if the lazy vacuum skipped almost table according to visibility map it means that the state of table is changed just a little and we can skip updating the index statistics. Am I missing something? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: > On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > [ lots of valuable discussion ] I think this patch clearly still is in the design stage, and has received plenty feedback this CF. I'll therefore move this to the next commitfest. - Andres
On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> [ lots of valuable discussion ] > > I think this patch clearly still is in the design stage, and has > received plenty feedback this CF. I'll therefore move this to the next > commitfest. Does anyone have ideas on a way forward here? I don't, but then I haven't thought about it in detail in several months. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
I was just looking the thread since it is found left alone for a long time in the CF app. At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com> > On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > > > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: > >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> [ lots of valuable discussion ] > > > > I think this patch clearly still is in the design stage, and has > > received plenty feedback this CF. I'll therefore move this to the next > > commitfest. > > Does anyone have ideas on a way forward here? I don't, but then I > haven't thought about it in detail in several months. Is the additional storage in metapage to store the current status of vaccum is still unacceptable even if it can avoid useless full-page scan on indexes especially for stable tables? Or, how about additional 1 bit in pg_stat_*_index to indicate that the index *don't* require vacuum cleanup stage. (default value causes cleanup) index_bulk_delete (or ambulkdelete) returns the flag in IndexBulkDeleteResult then lazy_scan_heap stores the flag in stats and in the next cycle it is looked up to decide the necessity of index cleanup. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I was just looking the thread since it is found left alone for a > long time in the CF app. > > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote: >> > Hi, >> > >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> [ lots of valuable discussion ] >> > >> > I think this patch clearly still is in the design stage, and has >> > received plenty feedback this CF. I'll therefore move this to the next >> > commitfest. >> >> Does anyone have ideas on a way forward here? I don't, but then I >> haven't thought about it in detail in several months. > > Is the additional storage in metapage to store the current status > of vaccum is still unacceptable even if it can avoid useless > full-page scan on indexes especially for stable tables? > > Or, how about additional 1 bit in pg_stat_*_index to indicate > that the index *don't* require vacuum cleanup stage. (default > value causes cleanup) > > index_bulk_delete (or ambulkdelete) returns the flag in > IndexBulkDeleteResult then lazy_scan_heap stores the flag in > stats and in the next cycle it is looked up to decide the > necessity of index cleanup. Maybe this is looking at the problem from the wrong direction. Why can't the page be added to the FSM immediately and the check be done at runtime when looking for a reusable page? Index FSMs currently store only 0 or 255, couldn't they store 128 for half-recyclable pages and make the caller re-check reusability before using it? This would make the second pass totally unnecessary, for a slight penalty when looking at the FSM. Ie: 2 lookups in the FSM instead of one. An initial one with min free space 128 to get a half-recyclable page, if the check fails on that page, a second lookup with min free space 255 to get a surely-recycleable page. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Maybe this is looking at the problem from the wrong direction. > > Why can't the page be added to the FSM immediately and the check be > done at runtime when looking for a reusable page? > > Index FSMs currently store only 0 or 255, couldn't they store 128 for > half-recyclable pages and make the caller re-check reusability before > using it? No, because it's impossible for them to know whether or not the page that their index scan just landed on recycled just a second ago, or was like this since before their xact began/snapshot was acquired. For your reference, this RecentGlobalXmin interlock stuff is what Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty Nodes". Seems pretty hard to do it any other way. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi, At Tue, 19 Sep 2017 16:55:38 -0700, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-Wzn0-3zxGRp_qp1OaEXY7h1W0-W_VCFO0nDv0K_+kabyYQ@mail.gmail.com> > On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > > Maybe this is looking at the problem from the wrong direction. > > > > Why can't the page be added to the FSM immediately and the check be > > done at runtime when looking for a reusable page? > > > > Index FSMs currently store only 0 or 255, couldn't they store 128 for > > half-recyclable pages and make the caller re-check reusability before > > using it? > > No, because it's impossible for them to know whether or not the page > that their index scan just landed on recycled just a second ago, or > was like this since before their xact began/snapshot was acquired. > > For your reference, this RecentGlobalXmin interlock stuff is what > Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty > Nodes". Seems pretty hard to do it any other way. Anyway(:p) the attached first patch is a PoC for the cleanup-state-in-stats method works only for btree. Some LOG-level debugging messages are put in the patch to show how it works. The following steps makes a not-recyclable page but I'm not sure it is general enough, and I couldn't generate half-dead pages. The pg_sleep() in the following steps is inserted in order to see the updated values in stats. DROP TABLE IF EXISTS t1; CREATE TABLE t1 (a int); CREATE INDEX ON t1 (a); INSERT INTO t1 (SELECT a FROM generate_series(0, 800000) a); DELETE FROM t1 WHERE a > 416700 AND a < 417250; VACUUM t1; DELETE FROM t1; VACUUM t1; -- 1 (or wait for autovacuum) select pg_sleep(1); VACUUM t1; -- 2 (autovacuum doesn't work) select pg_sleep(1); VACUUM t1; -- 3 (ditto) The following logs are emited while the three VACUUMs are issued. # VACUUM t1; -- 1 (or wait for autovacuum)LOG: btvacuumscan(t1_a_idx) result: deleted = 2185, notrecyclable = 1, hafldead= 0, no_cleanup_needed = falseLOG: Vacuum cleanup of index t1_a_idx is NOT skippedLOG: btvacuumcleanup on indext1_a_idx is skipped since bulkdelete has run just before. # VACUUM t1; -- 2LOG: Vacuum cleanup of index t1_a_idx is NOT skippedLOG: btvacuumscan(t1_a_idx) result: deleted = 2192,notrecyclable = 0, hafldead = 0, no_cleanup_needed = true # VACUUM t1; -- 3LOG: Vacuum cleanup of index t1_a_idx is skipped VACUUM #1 leaves a unrecyclable page and requests the next cleanup. VACUUM #2 leaves no unrecyclable page and inhibits the next cleanup. VACUUM #3 (and ever after) no vacuum cleanup executed. # I suppose it is a known issue that the cleanup cycles are not # executed automatically unless new dead tuples are generated. - Getting stats takes a very long time to fail during initdb. Since I couldn't find the right way to cope with this, I addeda tentative function pgstat_live(), which checks that the backend has a valid stats socket. - The patch calls pg_stat_get_vac_cleanup_needed using DirectFunctionCall. It might be better be wrapped. As a byproduct, this enables us to run extra autovacuum rounds fo r index cleanup. With the second attached, autovacuum works as follows. DROP TABLE IF EXISTS t1; CREATE TABLE t1 (a int); CREATE INDEX ON t1 (a); INSERT INTO t1 (SELECT a FROM generate_series(0, 800000) a); DELETE FROM t1 WHERE a > 416700 AND a < 417250; (autovacuum on t1 runs) > LOG: btvacuumscan(t1_a_idx) result: deleted = 0, notrecyclable = 0, hafldead = 0, no_cleanup_needed = true > LOG: Vacuum cleanup of index t1_a_idx is skipped > LOG: automatic vacuum of table "postgres.public.t1": index scans: 1 DELETE FROM t1; (autovacuum on t1 runs) > LOG: btvacuumscan(t1_a_idx) result: deleted = 2185, notrecyclable = 1, hafldead = 0, no_cleanup_needed = false > LOG: Vacuum cleanup of index t1_a_idx is NOT skipped > LOG: btvacuumcleanup on index t1_a_idx is skipped since bulkdelete has run just before. > LOG: automatic vacuum of table "postgres.public.t1": index scans: 1 (cleanup vacuum runs for t1 in the next autovac timing) > LOG: Vacuum cleanup of index t1_a_idx is NOT skipped > LOG: btvacuumscan(t1_a_idx) result: deleted = 2192, notrecyclable = 0, hafldead = 0, no_cleanup_needed = true > LOG: automatic vacuum of table "postgres.public.t1": index scans: 0 Any suggestions are welcome. regards, -- Kyotaro Horiguchi NTT Open Source Software Center *** a/src/backend/access/nbtree/nbtpage.c --- b/src/backend/access/nbtree/nbtpage.c *************** *** 1110,1116 **** _bt_pagedel(Relation rel, Buffer buf) { int ndeleted = 0; BlockNumber rightsib; ! bool rightsib_empty; Page page; BTPageOpaque opaque; --- 1110,1116 ---- { int ndeleted = 0; BlockNumber rightsib; ! bool rightsib_empty = false; Page page; BTPageOpaque opaque; *** a/src/backend/access/nbtree/nbtree.c --- b/src/backend/access/nbtree/nbtree.c *************** *** 63,68 **** typedef struct --- 63,70 ---- BlockNumber lastBlockLocked; /* highest blkno we've cleanup-locked */ BlockNumber totFreePages; /* true total # of free pages */ MemoryContext pagedelcontext; + uint32 pages_notrecyclable; /* # of not-yet-recyclable pages */ + uint32 pages_halfdead; /* # of half-dead pages */ } BTVacState; /* *************** *** 945,950 **** btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, --- 947,954 ---- IndexBulkDeleteResult * btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) { + extern char *get_rel_name(Oid); + /* No-op in ANALYZE ONLY mode */ if (info->analyze_only) return stats; *************** *** 963,968 **** btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) --- 967,977 ---- stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); btvacuumscan(info,stats, NULL, NULL, 0); } + else + ereport(LOG, + (errmsg ("btvacuumcleanup on index %s is skipped since bulkdelete has run just before.", + get_rel_name(info->index->rd_id)), + errhidestmt (true))); /* Finally, vacuum the FSM */ IndexFreeSpaceMapVacuum(info->index); *************** *** 1004,1009 **** btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, --- 1013,1019 ---- BlockNumber num_pages; BlockNumber blkno; bool needLock; + extern char *get_rel_name(Oid); /* * Reset counts that will be incremented during the scan; needed in case *************** *** 1022,1027 **** btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, --- 1032,1039 ---- vstate.lastBlockVacuumed = BTREE_METAPAGE; /* Initialise at first block */ vstate.lastBlockLocked= BTREE_METAPAGE; vstate.totFreePages = 0; + vstate.pages_notrecyclable = 0; + vstate.pages_halfdead = 0; /* Create a temporary memory context to run _bt_pagedel in */ vstate.pagedelcontext= AllocSetContextCreate(CurrentMemoryContext, *************** *** 1111,1116 **** btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, --- 1123,1139 ---- /* update statistics */ stats->num_pages = num_pages; stats->pages_free = vstate.totFreePages; + + /* check if we need no further clenaup */ + if (vstate.pages_notrecyclable == 0 && vstate.pages_halfdead == 0) + stats->no_cleanup_needed = true; + + ereport(LOG, + (errmsg ("btvacuumscan(%s) result: deleted = %d, notrecyclable = %d, hafldead = %d, no_cleanup_needed = %s", + get_rel_name(rel->rd_id), stats->pages_deleted, + vstate.pages_notrecyclable, vstate.pages_halfdead, + stats->no_cleanup_needed ? "true":"false"), + errhidestmt(true))); } /* *************** *** 1190,1195 **** restart: --- 1213,1219 ---- { /* Already deleted, but can't recycle yet */ stats->pages_deleted++; + vstate->pages_notrecyclable++; } else if (P_ISHALFDEAD(opaque)) { *************** *** 1359,1364 **** restart: --- 1383,1390 ---- /* count only this page, else may double-count parent */ if (ndel) stats->pages_deleted++; + else if (P_ISHALFDEAD(opaque)) + vstate->pages_halfdead++; /* Still half-dead */ MemoryContextSwitchTo(oldcontext); /* pagedelreleased buffer, so we shouldn't */ *** a/src/backend/commands/vacuumlazy.c --- b/src/backend/commands/vacuumlazy.c *************** *** 56,61 **** --- 56,62 ---- #include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/lmgr.h" + #include "utils/fmgrprotos.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/pg_rusage.h" *************** *** 129,134 **** typedef struct LVRelStats --- 130,137 ---- int num_index_scans; TransactionId latestRemovedXid; bool lock_waiter_detected; + int num_index_stats; + PgStat_MsgVacuum_indstate *indstats; } LVRelStats; *************** *** 152,158 **** static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, LVRelStats *vacrelstats); static void lazy_cleanup_index(Relation indrel, ! IndexBulkDeleteResult *stats, LVRelStats *vacrelstats); static int lazy_vacuum_page(Relationonerel, BlockNumber blkno, Buffer buffer, int tupindex, LVRelStats *vacrelstats,Buffer *vmbuffer); --- 155,161 ---- IndexBulkDeleteResult **stats, LVRelStats *vacrelstats); static voidlazy_cleanup_index(Relation indrel, ! IndexBulkDeleteResult **stats, LVRelStats *vacrelstats); static int lazy_vacuum_page(Relationonerel, BlockNumber blkno, Buffer buffer, int tupindex, LVRelStats *vacrelstats,Buffer *vmbuffer); *************** *** 342,348 **** lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params, pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared, new_live_tuples, ! vacrelstats->new_dead_tuples); pgstat_progress_end_command(); /* and log the action ifappropriate */ --- 345,352 ---- pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared, new_live_tuples, ! vacrelstats->new_dead_tuples, ! vacrelstats->num_index_stats, vacrelstats->indstats); pgstat_progress_end_command(); /* and log the action if appropriate */ *************** *** 496,501 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, --- 500,508 ---- indstats = (IndexBulkDeleteResult **) palloc0(nindexes * sizeof(IndexBulkDeleteResult *)); + vacrelstats->num_index_stats = nindexes; + vacrelstats->indstats = (PgStat_MsgVacuum_indstate *) + palloc0(nindexes * MAXALIGN(sizeof(PgStat_MsgVacuum_indstate))); nblocks = RelationGetNumberOfBlocks(onerel); vacrelstats->rel_pages = nblocks; *************** *** 1320,1326 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, /* Do post-vacuum cleanup andstatistics update for each index */ for (i = 0; i < nindexes; i++) ! lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); /* If no indexes, make log report that lazy_vacuum_heapwould've made */ if (vacuumed_pages) --- 1327,1344 ---- /* Do post-vacuum cleanup and statistics update for each index */ for (i = 0; i < nindexes; i++) ! { ! lazy_cleanup_index(Irel[i], &indstats[i], vacrelstats); ! ! /* update stats if indstats exists */ ! if (indstats[i]) ! { ! /* prepare to record the result to stats */ ! vacrelstats->indstats[i].indexoid = Irel[i]->rd_id; ! vacrelstats->indstats[i].vac_cleanup_needed = ! !(indstats[i] && indstats[i]->no_cleanup_needed); ! } ! } /* If no indexes, make log report that lazy_vacuum_heap would've made */ if (vacuumed_pages) *************** *** 1622,1632 **** lazy_vacuum_index(Relation indrel, */ static void lazy_cleanup_index(Relation indrel, ! IndexBulkDeleteResult *stats, LVRelStats *vacrelstats) { IndexVacuumInfo ivinfo; PGRUsage ru0; pg_rusage_init(&ru0); --- 1640,1652 ---- */ static void lazy_cleanup_index(Relation indrel, ! IndexBulkDeleteResult **stats, LVRelStats *vacrelstats) { IndexVacuumInfo ivinfo; PGRUsage ru0; + bool run_cleanup = true; + extern char *get_rel_name(Oid); pg_rusage_init(&ru0); *************** *** 1637,1655 **** lazy_cleanup_index(Relation indrel, ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples; ivinfo.strategy= vac_strategy; ! stats = index_vacuum_cleanup(&ivinfo, stats); ! if (!stats) return; /* * Now update statistics in pg_class, but only if the index says the count * is accurate. */ ! if (!stats->estimated_count) vac_update_relstats(indrel, ! stats->num_pages, ! stats->num_index_tuples, 0, false, InvalidTransactionId, --- 1657,1696 ---- ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples; ivinfo.strategy = vac_strategy; ! /* ! * If lazy_vacuum_index tells me that no cleanup is required, or stats ! * tells so, skip cleanup. ! */ ! if (*stats) ! { ! if ((*stats)->no_cleanup_needed) ! run_cleanup =false; ! } ! else ! run_cleanup = DatumGetBool( ! DirectFunctionCall1(pg_stat_get_vac_cleanup_needed, ! ObjectIdGetDatum(indrel->rd_id))); ! ereport(LOG, ! (errmsg ("Vacuum cleanup of index %s is %sskipped", ! get_rel_name(indrel->rd_id), ! run_cleanup ? "NOT ": ""), ! errhidestmt (true))); ! ! if (run_cleanup) ! *stats = index_vacuum_cleanup(&ivinfo, *stats); ! ! if (!*stats) return; /* * Now update statistics in pg_class, but only if the index says the count * is accurate. */ ! if (!(*stats)->estimated_count) vac_update_relstats(indrel, ! (*stats)->num_pages, ! (*stats)->num_index_tuples, 0, false, InvalidTransactionId, *************** *** 1659,1674 **** lazy_cleanup_index(Relation indrel, ereport(elevel, (errmsg("index \"%s\" now contains%.0f row versions in %u pages", RelationGetRelationName(indrel), ! stats->num_index_tuples, ! stats->num_pages), errdetail("%.0f index row versions were removed.\n" "%u index pages have been deleted, %u are currently reusable.\n" "%s.", ! stats->tuples_removed, ! stats->pages_deleted, stats->pages_free, pg_rusage_show(&ru0)))); - - pfree(stats); } /* --- 1700,1713 ---- ereport(elevel, (errmsg("index \"%s\" now contains %.0f row versions in %u pages", RelationGetRelationName(indrel), ! (*stats)->num_index_tuples, ! (*stats)->num_pages), errdetail("%.0f index row versions were removed.\n" "%u index pages have been deleted, %u are currently reusable.\n" "%s.", ! (*stats)->tuples_removed, ! (*stats)->pages_deleted, (*stats)->pages_free, pg_rusage_show(&ru0)))); } /* *** a/src/backend/postmaster/pgstat.c --- b/src/backend/postmaster/pgstat.c *************** *** 1403,1423 **** pgstat_report_autovac(Oid dboid) */ void pgstat_report_vacuum(Oid tableoid, bool shared, ! PgStat_Counter livetuples, PgStat_Counter deadtuples) { ! PgStat_MsgVacuum msg; if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts) return; ! pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM); ! msg.m_databaseid = shared ? InvalidOid : MyDatabaseId; ! msg.m_tableoid = tableoid; ! msg.m_autovacuum = IsAutoVacuumWorkerProcess(); ! msg.m_vacuumtime = GetCurrentTimestamp(); ! msg.m_live_tuples = livetuples; ! msg.m_dead_tuples = deadtuples; ! pgstat_send(&msg, sizeof(msg)); } /* -------- --- 1403,1437 ---- */ void pgstat_report_vacuum(Oid tableoid, bool shared, ! PgStat_Counter livetuples, PgStat_Counter deadtuples, ! int nindstats, PgStat_MsgVacuum_indstate *stats) { ! PgStat_MsgVacuum *msg; ! int i; ! int msgsize; if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts) return; + msgsize = offsetof(PgStat_MsgVacuum, m_indvacstates) + + MAXALIGN(sizeof(PgStat_MsgVacuum_indstate)) * nindstats; ! msg = (PgStat_MsgVacuum *) palloc(msgsize); ! pgstat_setheader(&msg->m_hdr, PGSTAT_MTYPE_VACUUM); ! msg->m_databaseid = shared ? InvalidOid : MyDatabaseId; ! msg->m_tableoid = tableoid; ! msg->m_autovacuum = IsAutoVacuumWorkerProcess(); ! msg->m_vacuumtime = GetCurrentTimestamp(); ! msg->m_live_tuples = livetuples; ! msg->m_dead_tuples = deadtuples; ! msg->m_n_indvac_states = nindstats; ! ! for (i = 0 ; i < nindstats ; i++) ! { ! msg->m_indvacstates[i].indexoid = stats[i].indexoid; ! msg->m_indvacstates[i].vac_cleanup_needed = stats[i].vac_cleanup_needed; ! } ! ! pgstat_send(msg, msgsize); } /* -------- *************** *** 1535,1541 **** pgstat_report_tempfile(size_t filesize) pgstat_send(&msg, sizeof(msg)); } ! /* ---------- * pgstat_ping() - * --- 1549,1561 ---- pgstat_send(&msg, sizeof(msg)); } ! bool ! pgstat_live(void) ! { ! if (pgStatSock == PGINVALID_SOCKET) ! return false; ! return true; ! } /* ---------- * pgstat_ping() - * *************** *** 4587,4592 **** pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create) --- 4607,4613 ---- result->analyze_count = 0; result->autovac_analyze_timestamp = 0; result->autovac_analyze_count= 0; + result->needs_vacuum_cleanup = true; } return result; *************** *** 5718,5723 **** pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len) --- 5739,5745 ---- tabentry->analyze_count = 0; tabentry->autovac_analyze_timestamp = 0; tabentry->autovac_analyze_count = 0; + tabentry->needs_vacuum_cleanup = true; } else { *************** *** 5963,5968 **** pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len) --- 5985,5991 ---- { PgStat_StatDBEntry *dbentry; PgStat_StatTabEntry *tabentry; + int i; /* * Store the data in the table's hashtable entry. *************** *** 5984,5989 **** pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len) --- 6007,6023 ---- tabentry->vacuum_timestamp = msg->m_vacuumtime; tabentry->vacuum_count++; } + + /* store index vacuum stats */ + for (i = 0 ; i < msg->m_n_indvac_states ; i++) + { + PgStat_StatTabEntry *indtabentry; + Oid indoid = msg->m_indvacstates[i].indexoid; + bool vac_cleanup_needed = msg->m_indvacstates[i].vac_cleanup_needed; + + indtabentry = pgstat_get_tab_entry(dbentry, indoid, true); + indtabentry->needs_vacuum_cleanup = vac_cleanup_needed; + } } /* ---------- *** a/src/backend/utils/adt/pgstatfuncs.c --- b/src/backend/utils/adt/pgstatfuncs.c *************** *** 27,32 **** --- 27,33 ---- #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" + #include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatileuint32 *)&(var)))) *************** *** 328,333 **** pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS) --- 329,366 ---- } Datum + pg_stat_get_vac_cleanup_needed(PG_FUNCTION_ARGS) + { + Oid relid = PG_GETARG_OID(0); + bool result; + PgStat_StatTabEntry *tabentry; + HeapTuple reltup; + bool is_index = false; + + if (!pgstat_live()) + return true; + + reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid)); + if (HeapTupleIsValid(reltup)) + { + if (((Form_pg_class) GETSTRUCT(reltup))->relkind == RELKIND_INDEX) + is_index = true; + + ReleaseSysCache(reltup); + } + + if (!is_index) + PG_RETURN_NULL(); + + if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL) + result = true; + else + result = tabentry->needs_vacuum_cleanup; + + PG_RETURN_BOOL(result); + } + + Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS) { Oid relid = PG_GETARG_OID(0); *** a/src/include/access/genam.h --- b/src/include/access/genam.h *************** *** 77,82 **** typedef struct IndexBulkDeleteResult --- 77,83 ---- double tuples_removed; /* # removed during vacuum operation */ BlockNumber pages_deleted; /* # unused pages in index */ BlockNumber pages_free; /* # pages available for reuse */ + bool no_cleanup_needed; /* true if no cleanup needed */ } IndexBulkDeleteResult; /* Typedef for callbackfunction to determine if a tuple is bulk-deletable */ *** a/src/include/access/nbtree.h --- b/src/include/access/nbtree.h *************** *** 416,421 **** typedef struct BTScanOpaqueData --- 416,422 ---- typedef BTScanOpaqueData *BTScanOpaque; + /* * We use some private sk_flags bits in preprocessed scan keys. We're allowed * to use bits 16-31 (see skey.h). The uppermost bits are copied from the *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *************** *** 2873,2878 **** DATA(insert OID = 3054 ( pg_stat_get_vacuum_count PGNSP PGUID 12 1 0 0 0 f f f f --- 2873,2880 ---- DESCR("statistics: number of manual vacuums for a table"); DATA(insert OID = 3055 ( pg_stat_get_autovacuum_countPGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autovacuum_count_null_ _null_ _null_ )); DESCR("statistics: number of auto vacuums for a table"); + DATA(insert OID = 3419 ( pg_stat_get_vac_cleanup_needed PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 16 "26" _null_ _null__null_ _null_ _null_ pg_stat_get_vac_cleanup_needed _null_ _null_ _null_ )); + DESCR("statistics: whether vacuum on a relation requires cleanup"); DATA(insert OID = 3056 ( pg_stat_get_analyze_countPGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_analyze_count_null_ _null_ _null_ )); DESCR("statistics: number of manual analyzes for a table"); DATA(insertOID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null__null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ )); *** a/src/include/pgstat.h --- b/src/include/pgstat.h *************** *** 360,365 **** typedef struct PgStat_MsgAutovacStart --- 360,372 ---- * after VACUUM * ---------- */ + typedef struct PgStat_MsgVacuum_indstate + { + Oid indexoid; + bool vac_cleanup_needed; + } PgStat_MsgVacuum_indstate; + + typedef struct PgStat_MsgVacuum { PgStat_MsgHdr m_hdr; *************** *** 369,377 **** typedef struct PgStat_MsgVacuum TimestampTz m_vacuumtime; PgStat_Counter m_live_tuples; PgStat_Counterm_dead_tuples; } PgStat_MsgVacuum; - /* ---------- * PgStat_MsgAnalyze Sent by the backend or autovacuum daemon * after ANALYZE --- 376,385 ---- TimestampTz m_vacuumtime; PgStat_Counter m_live_tuples; PgStat_Counter m_dead_tuples; + int m_n_indvac_states; + PgStat_MsgVacuum_indstate m_indvacstates[FLEXIBLE_ARRAY_MEMBER]; } PgStat_MsgVacuum; /* ---------- * PgStat_MsgAnalyze Sent by the backend or autovacuum daemon * after ANALYZE *************** *** 641,646 **** typedef struct PgStat_StatTabEntry --- 649,656 ---- PgStat_Counter analyze_count; TimestampTz autovac_analyze_timestamp; /* autovacuum initiated*/ PgStat_Counter autovac_analyze_count; + + bool needs_vacuum_cleanup; /* This index needs vac cleanup */ } PgStat_StatTabEntry; *************** *** 1159,1166 **** extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t extern void pgstat_report_autovac(Oiddboid); extern void pgstat_report_vacuum(Oid tableoid, bool shared, ! PgStat_Counter livetuples, PgStat_Counter deadtuples); ! extern void pgstat_report_analyze(Relation rel, PgStat_Counter livetuples, PgStat_Counter deadtuples, bool resetcounter); --- 1169,1177 ---- extern void pgstat_report_autovac(Oid dboid); extern void pgstat_report_vacuum(Oid tableoid, bool shared, ! PgStat_Counter livetuples, PgStat_Counter deadtuples, ! int nindstats, PgStat_MsgVacuum_indstate *states); ! extern void pgstat_report_analyze(Relation rel, PgStat_Counter livetuples, PgStat_Counter deadtuples, bool resetcounter); *************** *** 1172,1177 **** extern void pgstat_bestart(void); --- 1183,1189 ---- extern void pgstat_report_activity(BackendState state, const char *cmd_str); extern void pgstat_report_tempfile(size_tfilesize); + extern bool pgstat_live(void); extern void pgstat_report_appname(const char *appname); extern void pgstat_report_xact_timestamp(TimestampTztstamp); extern const char *pgstat_get_wait_event(uint32 wait_event_info); *** a/src/backend/postmaster/autovacuum.c --- b/src/backend/postmaster/autovacuum.c *************** *** 2791,2796 **** table_recheck_autovac(Oid relid, HTAB *table_toast_map, --- 2791,2803 ---- effective_multixact_freeze_max_age, &dovacuum,&doanalyze, &wraparound); + /* force vacuum if any index on the rel is requesting cleanup scan */ + if (!dovacuum) + dovacuum = + DatumGetBool( + DirectFunctionCall1(pg_stat_get_vac_cleanup_needed, + ObjectIdGetDatum(relid))); + /* ignore ANALYZE for toast tables */ if (classForm->relkind == RELKIND_TOASTVALUE) doanalyze = false; *************** *** 3045,3050 **** relation_needs_vacanalyze(Oid relid, --- 3052,3064 ---- /* Determine if this table needs vacuum or analyze. */ *dovacuum = force_vacuum || (vactuples> vacthresh); *doanalyze = (anltuples > anlthresh); + + /* still force vacuum if index cleanup is requested */ + if (!*dovacuum) + *dovacuum = + DatumGetBool( + DirectFunctionCall1(pg_stat_get_vac_cleanup_needed, + ObjectIdGetDatum(relid))); } else { *** a/src/backend/utils/adt/pgstatfuncs.c --- b/src/backend/utils/adt/pgstatfuncs.c *************** *** 349,361 **** pg_stat_get_vac_cleanup_needed(PG_FUNCTION_ARGS) ReleaseSysCache(reltup); } if (!is_index) ! PG_RETURN_NULL(); ! if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL) ! result = true; else ! result = tabentry->needs_vacuum_cleanup; PG_RETURN_BOOL(result); } --- 349,393 ---- ReleaseSysCache(reltup); } + /* + * If normal relaion is specified, return true if any index of the + * relation is explicitly requesting cleanup. + */ if (!is_index) ! { ! Relation indrel; ! SysScanDesc indscan; ! HeapTuple indtup; ! result = false; ! indrel = heap_open(IndexRelationId, AccessShareLock); ! indscan = systable_beginscan(indrel, InvalidOid, false, NULL, 0, NULL); ! while (HeapTupleIsValid(indtup = systable_getnext(indscan)) && ! !result) ! { ! Form_pg_index ind = (Form_pg_index) GETSTRUCT(indtup); ! ! if (ind->indrelid != relid) ! continue; ! ! if ((tabentry = pgstat_fetch_stat_tabentry(ind->indexrelid))) ! result |= tabentry->needs_vacuum_cleanup; ! } ! systable_endscan(indscan); ! heap_close(indrel, AccessShareLock); ! } else ! { ! /* ! * Elsewise reutrn the status of the index. As somewhat inconsistent ! * behavior with the normal relation case above, *true* is returned ! * for indexes with no stats here. ! */ ! if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL) ! result = true; ! else ! result = tabentry->needs_vacuum_cleanup; ! } PG_RETURN_BOOL(result); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 8:55 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Maybe this is looking at the problem from the wrong direction. >> >> Why can't the page be added to the FSM immediately and the check be >> done at runtime when looking for a reusable page? >> >> Index FSMs currently store only 0 or 255, couldn't they store 128 for >> half-recyclable pages and make the caller re-check reusability before >> using it? > > No, because it's impossible for them to know whether or not the page > that their index scan just landed on recycled just a second ago, or > was like this since before their xact began/snapshot was acquired. > > For your reference, this RecentGlobalXmin interlock stuff is what > Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty > Nodes". Seems pretty hard to do it any other way. I don't see the difference between a vacuum run and distributed maintainance at _bt_getbuf time. In fact, the code seems to be in place already. _bt_page_recyclable seems to prevent old transactions from treating those pages as recyclable already, and the description of the technique in 2.5 doesn't seem to preclude doing the drain while doing other operations. In fact, Lehman even considers the possibility of multiple concurrent garbage collectors. It's only a matter of making the page visible in the FSM in a way that can be efficiently skipped if we want to go directly to a page that actually CAN be recycled to avoid looping forever looking for a recyclable page in _bt_getbuf. In fact, that's pretty much Lehman's drain technique right there. FSM entries with 128 would be "the queue", and FSM entries with 255 would be "the freelist". _bt_getbuf can be the GC getting a buffer to try and recycle, give up after a few tries, and get an actual recycleable buffer instead (or extend the relationship). In essence, microvacuum. Unless I'm missing something and RecentGlobalXmin somehow doesn't exclude all old transactions, I don't see a problem. Lanin & Shasha use reference counting to do GC wait during the drain, and most of the ordering of operations needed there is because of that, but using the xmin seems to make all those considerations moot. An xact earlier than RecentGlobalXmin cannot have active transactions able to follow links to that page AFAIK. TBH, I didn't read the whole papers, though I probably will. But, in essence, what's the difference of vacuum doing if (_bt_page_recyclable(page)) { /* Okay to recycle this page */ RecordFreeIndexPage(rel,blkno); vstate->totFreePages++; stats->pages_deleted++; } VS doing it in _bt_getbuf? What am I missing? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I was just looking the thread since it is found left alone for a > long time in the CF app. > > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote: >> > Hi, >> > >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> [ lots of valuable discussion ] >> > >> > I think this patch clearly still is in the design stage, and has >> > received plenty feedback this CF. I'll therefore move this to the next >> > commitfest. >> >> Does anyone have ideas on a way forward here? I don't, but then I >> haven't thought about it in detail in several months. > > Is the additional storage in metapage to store the current status > of vaccum is still unacceptable even if it can avoid useless > full-page scan on indexes especially for stable tables? > > Or, how about additional 1 bit in pg_stat_*_index to indicate > that the index *don't* require vacuum cleanup stage. (default > value causes cleanup) You meant that "the next cycle" is the lazy_cleanup_index() function called by lazy_scan_heap()? > > index_bulk_delete (or ambulkdelete) returns the flag in > IndexBulkDeleteResult then lazy_scan_heap stores the flag in > stats and in the next cycle it is looked up to decide the > necessity of index cleanup. > Could you elaborate about this? For example in btree index, the index cleanup skips to scan on the index scan if index_bulk_delete has been called during vacuuming because stats != NULL. So I think we don't need such a flag. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com> > On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > I was just looking the thread since it is found left alone for a > > long time in the CF app. > > > > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com> > >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote: > >> > Hi, > >> > > >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: > >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> >> [ lots of valuable discussion ] > >> > > >> > I think this patch clearly still is in the design stage, and has > >> > received plenty feedback this CF. I'll therefore move this to the next > >> > commitfest. > >> > >> Does anyone have ideas on a way forward here? I don't, but then I > >> haven't thought about it in detail in several months. > > > > Is the additional storage in metapage to store the current status > > of vaccum is still unacceptable even if it can avoid useless > > full-page scan on indexes especially for stable tables? > > > > Or, how about additional 1 bit in pg_stat_*_index to indicate > > that the index *don't* require vacuum cleanup stage. (default > > value causes cleanup) > > You meant that "the next cycle" is the lazy_cleanup_index() function > called by lazy_scan_heap()? Both finally call btvacuumscan under a certain condition, but what I meant by "the next cycle" is the lazy_cleanup_index call in the next round of vacuum since abstract layer (index am) isn't conscious of the detail of btree. > > index_bulk_delete (or ambulkdelete) returns the flag in > > IndexBulkDeleteResult then lazy_scan_heap stores the flag in > > stats and in the next cycle it is looked up to decide the > > necessity of index cleanup. > > > > Could you elaborate about this? For example in btree index, the index > cleanup skips to scan on the index scan if index_bulk_delete has been > called during vacuuming because stats != NULL. So I think we don't > need such a flag. The flag works so that successive two index full scans don't happen in a vacuum round. If any rows are fully deleted, just following btvacuumcleanup does nothing. I think what you wanted to solve here was the problem that index_vacuum_cleanup runs a full scan even if it ends with no actual work, when manual or anti-wraparound vacuums. (I'm getting a bit confused on this..) It is caused by using the pointer "stats" as the flag to instruct to do that. If the stats-as-a-flag worked as expected, the GUC doesn't seem to be required. Addition to that, as Simon and Peter pointed out index_bulk_delete can leave not-fully-removed pages (so-called half-dead pages and pages that are recyclable but not registered in FSM, AFAICS) in some cases mainly by RecentGlobalXmin interlock. In this case, just inhibiting cleanup scan by a threshold lets such dangling pages persist in the index. (I conldn't make such a many dangling pages, though..) The first patch in the mail (*1) does that. It seems having some bugs, though.. Since the dangling pages persist until autovacuum decided to scan the belonging table again, we should run a vacuum round (or index_vacuum_cleanup itself) even having no dead rows if we want to clean up such pages within a certain period. The second patch doesn that. [*1] https://www.postgresql.org/message-id/20170921.174957.236914340.horiguchi.kyotaro@lab.ntt.co.jp regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
I apologize in advance of possible silliness. At Thu, 21 Sep 2017 13:54:01 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in <CAGTBQpYvgdqxVaiyui=BKrzw7ZZfTQi9KECUL4-Lkc2ThqX8QQ@mail.gmail.com> > On Tue, Sep 19, 2017 at 8:55 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > >> Maybe this is looking at the problem from the wrong direction. > >> > >> Why can't the page be added to the FSM immediately and the check be > >> done at runtime when looking for a reusable page? > >> > >> Index FSMs currently store only 0 or 255, couldn't they store 128 for > >> half-recyclable pages and make the caller re-check reusability before > >> using it? > > > > No, because it's impossible for them to know whether or not the page > > that their index scan just landed on recycled just a second ago, or > > was like this since before their xact began/snapshot was acquired. > > > > For your reference, this RecentGlobalXmin interlock stuff is what > > Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty > > Nodes". Seems pretty hard to do it any other way. > > I don't see the difference between a vacuum run and distributed > maintainance at _bt_getbuf time. In fact, the code seems to be in > place already. The pages prohibited to register as "free" by RecentGlobalXmin cannot be grabbed _bt_getbuf since the page is liked from nowhere nor FSM doesn't offer the pages is "free". > _bt_page_recyclable seems to prevent old transactions from treating > those pages as recyclable already, and the description of the > technique in 2.5 doesn't seem to preclude doing the drain while doing > other operations. In fact, Lehman even considers the possibility of > multiple concurrent garbage collectors. _bt_page_recyclable prevent a vacuum scan from discarding pages that might be looked from any active transaction, and the "drain" itself is a technique to prevent freeing still-active pages so a scan using the "drain" technique is freely executed simultaneously with other transactions. The paper might allow concurrent GCs (or vacuums) but our nbtree is saying that no concurrent vacuum is assumed. Er... here it is. nbtpages.c:1589: _bt_unlink_halfdead_page | * right. This search could fail if either the sibling or the target page | * was deleted by someone else meanwhile; if so, give up. (Right now, | * that should never happen, since page deletion is only done in VACUUM | * and there shouldn't be multiple VACUUMs concurrently on the same | * table.) > It's only a matter of making the page visible in the FSM in a way that > can be efficiently skipped if we want to go directly to a page that > actually CAN be recycled to avoid looping forever looking for a > recyclable page in _bt_getbuf. In fact, that's pretty much Lehman's Mmm. What _bt_getbuf does is recheck the page given from FSM as a "free page". If FSM gives no more page, it just tries to extend the index relation. Or am I reading you wrongly? > drain technique right there. FSM entries with 128 would be "the > queue", and FSM entries with 255 would be "the freelist". _bt_getbuf > can be the GC getting a buffer to try and recycle, give up after a few > tries, and get an actual recycleable buffer instead (or extend the > relationship). In essence, microvacuum. > > Unless I'm missing something and RecentGlobalXmin somehow doesn't > exclude all old transactions, I don't see a problem. > > Lanin & Shasha use reference counting to do GC wait during the drain, > and most of the ordering of operations needed there is because of > that, but using the xmin seems to make all those considerations moot. > An xact earlier than RecentGlobalXmin cannot have active transactions > able to follow links to that page AFAIK. > > TBH, I didn't read the whole papers, though I probably will. > > But, in essence, what's the difference of vacuum doing > > if (_bt_page_recyclable(page)) > { > /* Okay to recycle this page */ > RecordFreeIndexPage(rel, blkno); > vstate->totFreePages++; > stats->pages_deleted++; > } > > VS doing it in _bt_getbuf? > > What am I missing? -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com> >> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > I was just looking the thread since it is found left alone for a >> > long time in the CF app. >> > >> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com> >> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote: >> >> > Hi, >> >> > >> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: >> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> >> [ lots of valuable discussion ] >> >> > >> >> > I think this patch clearly still is in the design stage, and has >> >> > received plenty feedback this CF. I'll therefore move this to the next >> >> > commitfest. >> >> >> >> Does anyone have ideas on a way forward here? I don't, but then I >> >> haven't thought about it in detail in several months. >> > >> > Is the additional storage in metapage to store the current status >> > of vaccum is still unacceptable even if it can avoid useless >> > full-page scan on indexes especially for stable tables? >> > >> > Or, how about additional 1 bit in pg_stat_*_index to indicate >> > that the index *don't* require vacuum cleanup stage. (default >> > value causes cleanup) >> >> You meant that "the next cycle" is the lazy_cleanup_index() function >> called by lazy_scan_heap()? > > Both finally call btvacuumscan under a certain condition, but > what I meant by "the next cycle" is the lazy_cleanup_index call > in the next round of vacuum since abstract layer (index am) isn't > conscious of the detail of btree. > >> > index_bulk_delete (or ambulkdelete) returns the flag in >> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in >> > stats and in the next cycle it is looked up to decide the >> > necessity of index cleanup. >> > >> >> Could you elaborate about this? For example in btree index, the index >> cleanup skips to scan on the index scan if index_bulk_delete has been >> called during vacuuming because stats != NULL. So I think we don't >> need such a flag. > > The flag works so that successive two index full scans don't > happen in a vacuum round. If any rows are fully deleted, just > following btvacuumcleanup does nothing. > > I think what you wanted to solve here was the problem that > index_vacuum_cleanup runs a full scan even if it ends with no > actual work, when manual or anti-wraparound vacuums. (I'm > getting a bit confused on this..) It is caused by using the > pointer "stats" as the flag to instruct to do that. If the > stats-as-a-flag worked as expected, the GUC doesn't seem to be > required. Hmm, my proposal is like that if a table doesn't changed since the previous vacuum much we skip the cleaning up index. If the table has at least one garbage we do the lazy_vacuum_index and then IndexBulkDeleteResutl is stored, which causes to skip doing the btvacuumcleanup. On the other hand, if the table doesn't have any garbage but some new tuples inserted since the previous vacuum, we don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this case, we always do the lazy_cleanup_index (i.g, we do the full scan) even if only one tuple is inserted. That's why I proposed a new GUC parameter which allows us to skip the lazy_cleanup_index in the case. > > Addition to that, as Simon and Peter pointed out > index_bulk_delete can leave not-fully-removed pages (so-called > half-dead pages and pages that are recyclable but not registered > in FSM, AFAICS) in some cases mainly by RecentGlobalXmin > interlock. In this case, just inhibiting cleanup scan by a > threshold lets such dangling pages persist in the index. (I > conldn't make such a many dangling pages, though..) > > The first patch in the mail (*1) does that. It seems having some > bugs, though.. > > > Since the dangling pages persist until autovacuum decided to scan > the belonging table again, we should run a vacuum round (or > index_vacuum_cleanup itself) even having no dead rows if we want > to clean up such pages within a certain period. The second patch > doesn that. > IIUC half-dead pages are not relevant to this proposal. The proposal has two problems; * By skipping index cleanup we could leave recyclable pages that are not marked as a recyclable. * we stash an XID when a btree page is deleted, which is used to determine when it's finally safe to recycle the page Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
At Fri, 22 Sep 2017 17:21:04 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBN9ucgMDuinx2ptU8upEToHnR-A35aBcQyZnLFvWdVPg@mail.gmail.com> > On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com> > >> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI > >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >> Could you elaborate about this? For example in btree index, the index > >> cleanup skips to scan on the index scan if index_bulk_delete has been > >> called during vacuuming because stats != NULL. So I think we don't > >> need such a flag. > > > > The flag works so that successive two index full scans don't > > happen in a vacuum round. If any rows are fully deleted, just > > following btvacuumcleanup does nothing. > > > > I think what you wanted to solve here was the problem that > > index_vacuum_cleanup runs a full scan even if it ends with no > > actual work, when manual or anti-wraparound vacuums. (I'm > > getting a bit confused on this..) It is caused by using the > > pointer "stats" as the flag to instruct to do that. If the > > stats-as-a-flag worked as expected, the GUC doesn't seem to be > > required. > > Hmm, my proposal is like that if a table doesn't changed since the > previous vacuum much we skip the cleaning up index. > > If the table has at least one garbage we do the lazy_vacuum_index and > then IndexBulkDeleteResutl is stored, which causes to skip doing the > btvacuumcleanup. On the other hand, if the table doesn't have any > garbage but some new tuples inserted since the previous vacuum, we > don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this > case, we always do the lazy_cleanup_index (i.g, we do the full scan) > even if only one tuple is inserted. That's why I proposed a new GUC > parameter which allows us to skip the lazy_cleanup_index in the case. I think the problem raised in this thread is that the last index scan may leave dangling pages. > > Addition to that, as Simon and Peter pointed out > > index_bulk_delete can leave not-fully-removed pages (so-called > > half-dead pages and pages that are recyclable but not registered > > in FSM, AFAICS) in some cases mainly by RecentGlobalXmin > > interlock. In this case, just inhibiting cleanup scan by a > > threshold lets such dangling pages persist in the index. (I > > conldn't make such a many dangling pages, though..) > > > > The first patch in the mail (*1) does that. It seems having some > > bugs, though.. > > > > > > Since the dangling pages persist until autovacuum decided to scan > > the belonging table again, we should run a vacuum round (or > > index_vacuum_cleanup itself) even having no dead rows if we want > > to clean up such pages within a certain period. The second patch > > doesn that. > > > > IIUC half-dead pages are not relevant to this proposal. The proposal > has two problems; > > * By skipping index cleanup we could leave recyclable pages that are > not marked as a recyclable. Yes. > * we stash an XID when a btree page is deleted, which is used to > determine when it's finally safe to recycle the page Is it a "problem" of this proposal? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 22, 2017 at 4:46 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I apologize in advance of possible silliness. > > At Thu, 21 Sep 2017 13:54:01 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in <CAGTBQpYvgdqxVaiyui=BKrzw7ZZfTQi9KECUL4-Lkc2ThqX8QQ@mail.gmail.com> >> On Tue, Sep 19, 2017 at 8:55 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> > On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> >> Maybe this is looking at the problem from the wrong direction. >> >> >> >> Why can't the page be added to the FSM immediately and the check be >> >> done at runtime when looking for a reusable page? >> >> >> >> Index FSMs currently store only 0 or 255, couldn't they store 128 for >> >> half-recyclable pages and make the caller re-check reusability before >> >> using it? >> > >> > No, because it's impossible for them to know whether or not the page >> > that their index scan just landed on recycled just a second ago, or >> > was like this since before their xact began/snapshot was acquired. >> > >> > For your reference, this RecentGlobalXmin interlock stuff is what >> > Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty >> > Nodes". Seems pretty hard to do it any other way. >> >> I don't see the difference between a vacuum run and distributed >> maintainance at _bt_getbuf time. In fact, the code seems to be in >> place already. > > The pages prohibited to register as "free" by RecentGlobalXmin > cannot be grabbed _bt_getbuf since the page is liked from nowhere > nor FSM doesn't offer the pages is "free". Yes, but suppose vacuum did add them to the FSM in the first round, but with a special marker that differentiates them from immediately recycleable ones. >> _bt_page_recyclable seems to prevent old transactions from treating >> those pages as recyclable already, and the description of the >> technique in 2.5 doesn't seem to preclude doing the drain while doing >> other operations. In fact, Lehman even considers the possibility of >> multiple concurrent garbage collectors. > > _bt_page_recyclable prevent a vacuum scan from discarding pages > that might be looked from any active transaction, and the "drain" > itself is a technique to prevent freeing still-active pages so a > scan using the "drain" technique is freely executed > simultaneously with other transactions. The paper might allow > concurrent GCs (or vacuums) but our nbtree is saying that no > concurrent vacuum is assumed. Er... here it is. > > nbtpages.c:1589: _bt_unlink_halfdead_page > | * right. This search could fail if either the sibling or the target page > | * was deleted by someone else meanwhile; if so, give up. (Right now, > | * that should never happen, since page deletion is only done in VACUUM > | * and there shouldn't be multiple VACUUMs concurrently on the same > | * table.) Ok, yes, but we're not talking about halfdead pages, but deleted pages that haven't been recycled yet. >> It's only a matter of making the page visible in the FSM in a way that >> can be efficiently skipped if we want to go directly to a page that >> actually CAN be recycled to avoid looping forever looking for a >> recyclable page in _bt_getbuf. In fact, that's pretty much Lehman's > > Mmm. What _bt_getbuf does is recheck the page given from FSM as a > "free page". If FSM gives no more page, it just tries to extend > the index relation. Or am I reading you wrongly? On non-index FSMs, you can request a page that has at least N free bytes. Index FSMs always mark pages as fully empty or fully full, no in-betweens, but suppose we used that capability of the data structure to mark "maybe recycleable" pages with 50% free space, and "surely recycleable" pages with 100% free space. Then _bt_getbuf could request for a 50% free page a few times, check if they're recycleable (ie: check _bt_page_recyclable), and essentially do microvacuum on that page, and if it cannot find a recycleable page, then try again with 100% recycleable ones. The code is almost there, only thing missing is the distinction between "maybe recycleable" and "surely recycleable" pages in the index FSM. Take this with a grain of salt, I'm not an expert on that code. But it seems feasible to me. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-22 11:21, Masahiko Sawada wrote: > On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada >> <sawada.mshk@gmail.com> wrote in >> <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com> >>> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> > I was just looking the thread since it is found left alone for a >>> > long time in the CF app. >>> > >>> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com> >>> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote: >>> >> > Hi, >>> >> > >>> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: >>> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> >> >> [ lots of valuable discussion ] >>> >> > >>> >> > I think this patch clearly still is in the design stage, and has >>> >> > received plenty feedback this CF. I'll therefore move this to the next >>> >> > commitfest. >>> >> >>> >> Does anyone have ideas on a way forward here? I don't, but then I >>> >> haven't thought about it in detail in several months. >>> > >>> > Is the additional storage in metapage to store the current status >>> > of vaccum is still unacceptable even if it can avoid useless >>> > full-page scan on indexes especially for stable tables? >>> > >>> > Or, how about additional 1 bit in pg_stat_*_index to indicate >>> > that the index *don't* require vacuum cleanup stage. (default >>> > value causes cleanup) >>> >>> You meant that "the next cycle" is the lazy_cleanup_index() function >>> called by lazy_scan_heap()? >> >> Both finally call btvacuumscan under a certain condition, but >> what I meant by "the next cycle" is the lazy_cleanup_index call >> in the next round of vacuum since abstract layer (index am) isn't >> conscious of the detail of btree. >> >>> > index_bulk_delete (or ambulkdelete) returns the flag in >>> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in >>> > stats and in the next cycle it is looked up to decide the >>> > necessity of index cleanup. >>> > >>> >>> Could you elaborate about this? For example in btree index, the index >>> cleanup skips to scan on the index scan if index_bulk_delete has been >>> called during vacuuming because stats != NULL. So I think we don't >>> need such a flag. >> >> The flag works so that successive two index full scans don't >> happen in a vacuum round. If any rows are fully deleted, just >> following btvacuumcleanup does nothing. >> >> I think what you wanted to solve here was the problem that >> index_vacuum_cleanup runs a full scan even if it ends with no >> actual work, when manual or anti-wraparound vacuums. (I'm >> getting a bit confused on this..) It is caused by using the >> pointer "stats" as the flag to instruct to do that. If the >> stats-as-a-flag worked as expected, the GUC doesn't seem to be >> required. > > Hmm, my proposal is like that if a table doesn't changed since the > previous vacuum much we skip the cleaning up index. > > If the table has at least one garbage we do the lazy_vacuum_index and > then IndexBulkDeleteResutl is stored, which causes to skip doing the > btvacuumcleanup. On the other hand, if the table doesn't have any > garbage but some new tuples inserted since the previous vacuum, we > don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this > case, we always do the lazy_cleanup_index (i.g, we do the full scan) > even if only one tuple is inserted. That's why I proposed a new GUC > parameter which allows us to skip the lazy_cleanup_index in the case. > >> >> Addition to that, as Simon and Peter pointed out >> index_bulk_delete can leave not-fully-removed pages (so-called >> half-dead pages and pages that are recyclable but not registered >> in FSM, AFAICS) in some cases mainly by RecentGlobalXmin >> interlock. In this case, just inhibiting cleanup scan by a >> threshold lets such dangling pages persist in the index. (I >> conldn't make such a many dangling pages, though..) >> >> The first patch in the mail (*1) does that. It seems having some >> bugs, though.. >> >> >> Since the dangling pages persist until autovacuum decided to scan >> the belonging table again, we should run a vacuum round (or >> index_vacuum_cleanup itself) even having no dead rows if we want >> to clean up such pages within a certain period. The second patch >> doesn that. >> > > IIUC half-dead pages are not relevant to this proposal. The proposal > has two problems; > * By skipping index cleanup we could leave recyclable pages that are > not marked as a recyclable. > * we stash an XID when a btree page is deleted, which is used to > determine when it's finally safe to recycle the page > > Regards, > > -- > Masahiko Sawada > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center Here is a small patch that skips scanning btree index if no pending deleted pages exists. It detects this situation by comparing pages_deleted with pages_free. If they are equal, then there is no half-deleted pages, and it is safe to skip next lazy scan. Flag stored in a btpo_flags. It is unset using general wal before scan. If no half-deleted pages found, it is set without wal (like hint bit). (it is safe to miss setting flag, but it is not safe to miss unsetting flag). This patch works correctly: - if rows are only inserted and never deleted, index always skips scanning (starting with second scan). - if some rows updated/deleted, then some scans are not skipped. But when all half-deleted pages are marked as deleted, lazy scans start to be skipped. Open question: what to do with index statistic? For simplicity this patch skips updating stats (just returns NULL from btvacuumcleanup). Probably, it should detect proportion of table changes, and do not skip scans if table grows too much. -- Sokolov Yura Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-09-22 16:22, Sokolov Yura wrote: > On 2017-09-22 11:21, Masahiko Sawada wrote: >> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada >>> <sawada.mshk@gmail.com> wrote in >>> <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com> >>>> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI >>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>> > I was just looking the thread since it is found left alone for a >>>> > long time in the CF app. >>>> > >>>> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com> >>>> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote: >>>> >> > Hi, >>>> >> > >>>> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: >>>> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> >> >> [ lots of valuable discussion ] >>>> >> > >>>> >> > I think this patch clearly still is in the design stage, and has >>>> >> > received plenty feedback this CF. I'll therefore move this to the next >>>> >> > commitfest. >>>> >> >>>> >> Does anyone have ideas on a way forward here? I don't, but then I >>>> >> haven't thought about it in detail in several months. >>>> > >>>> > Is the additional storage in metapage to store the current status >>>> > of vaccum is still unacceptable even if it can avoid useless >>>> > full-page scan on indexes especially for stable tables? >>>> > >>>> > Or, how about additional 1 bit in pg_stat_*_index to indicate >>>> > that the index *don't* require vacuum cleanup stage. (default >>>> > value causes cleanup) >>>> >>>> You meant that "the next cycle" is the lazy_cleanup_index() function >>>> called by lazy_scan_heap()? >>> >>> Both finally call btvacuumscan under a certain condition, but >>> what I meant by "the next cycle" is the lazy_cleanup_index call >>> in the next round of vacuum since abstract layer (index am) isn't >>> conscious of the detail of btree. >>> >>>> > index_bulk_delete (or ambulkdelete) returns the flag in >>>> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in >>>> > stats and in the next cycle it is looked up to decide the >>>> > necessity of index cleanup. >>>> > >>>> >>>> Could you elaborate about this? For example in btree index, the >>>> index >>>> cleanup skips to scan on the index scan if index_bulk_delete has >>>> been >>>> called during vacuuming because stats != NULL. So I think we don't >>>> need such a flag. >>> >>> The flag works so that successive two index full scans don't >>> happen in a vacuum round. If any rows are fully deleted, just >>> following btvacuumcleanup does nothing. >>> >>> I think what you wanted to solve here was the problem that >>> index_vacuum_cleanup runs a full scan even if it ends with no >>> actual work, when manual or anti-wraparound vacuums. (I'm >>> getting a bit confused on this..) It is caused by using the >>> pointer "stats" as the flag to instruct to do that. If the >>> stats-as-a-flag worked as expected, the GUC doesn't seem to be >>> required. >> >> Hmm, my proposal is like that if a table doesn't changed since the >> previous vacuum much we skip the cleaning up index. >> >> If the table has at least one garbage we do the lazy_vacuum_index and >> then IndexBulkDeleteResutl is stored, which causes to skip doing the >> btvacuumcleanup. On the other hand, if the table doesn't have any >> garbage but some new tuples inserted since the previous vacuum, we >> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this >> case, we always do the lazy_cleanup_index (i.g, we do the full scan) >> even if only one tuple is inserted. That's why I proposed a new GUC >> parameter which allows us to skip the lazy_cleanup_index in the case. >> >>> >>> Addition to that, as Simon and Peter pointed out >>> index_bulk_delete can leave not-fully-removed pages (so-called >>> half-dead pages and pages that are recyclable but not registered >>> in FSM, AFAICS) in some cases mainly by RecentGlobalXmin >>> interlock. In this case, just inhibiting cleanup scan by a >>> threshold lets such dangling pages persist in the index. (I >>> conldn't make such a many dangling pages, though..) >>> >>> The first patch in the mail (*1) does that. It seems having some >>> bugs, though.. >>> >>> >>> Since the dangling pages persist until autovacuum decided to scan >>> the belonging table again, we should run a vacuum round (or >>> index_vacuum_cleanup itself) even having no dead rows if we want >>> to clean up such pages within a certain period. The second patch >>> doesn that. >>> >> >> IIUC half-dead pages are not relevant to this proposal. The proposal >> has two problems; >> * By skipping index cleanup we could leave recyclable pages that are >> not marked as a recyclable. >> * we stash an XID when a btree page is deleted, which is used to >> determine when it's finally safe to recycle the page >> >> Regards, >> >> -- >> Masahiko Sawada >> NIPPON TELEGRAPH AND TELEPHONE CORPORATION >> NTT Open Source Software Center > > Here is a small patch that skips scanning btree index if no pending > deleted pages exists. > > It detects this situation by comparing pages_deleted with pages_free. > If they are equal, then there is no half-deleted pages, and it is > safe to skip next lazy scan. > > Flag stored in a btpo_flags. It is unset using general wal before scan. > If no half-deleted pages found, it is set without wal (like hint bit). > (it is safe to miss setting flag, but it is not safe to miss unsetting > flag). > > This patch works correctly: > - if rows are only inserted and never deleted, index always skips > scanning (starting with second scan). > - if some rows updated/deleted, then some scans are not skipped. But > when all half-deleted pages are marked as deleted, lazy scans start to > be skipped. > > Open question: what to do with index statistic? For simplicity this > patch skips updating stats (just returns NULL from btvacuumcleanup). > Probably, it should detect proportion of table changes, and do not > skip scans if table grows too much. Excuse me, I didn't mean to overwrite "last attachment" on commitfest page. -- Sokolov Yura Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed We're using Postgres with this patch for some time. In our use case we've got a quickly growing large table with events from our users. Table has a structure of (user_id, ts, <event data>). Events are append only, each user generates events in small predictabletime frame, mostly each second. From time to time we need to read this table in fashion of WHERE ts BETWEEN a AND b AND user_id=c. Such query leads to enormous amount of seeks, as records of each user are scattered across relation and there are no pagesthat contain two events from same user. To fight it, we created a btree index on (user_id, ts, <event_data>). Plan switched to index only scans, but heap fetchesand execution times were still the same. Manual We noticed that autovacuum skips scanning the relation and freezing the Visibility Map. We started frequently performing VACUUM manually on the relation. This helped with freezing the Visibility Map. However, we found out that VACUUM makes a full scan over the index. As index does not fit into memory, this means that each run flushes all the disk caches and eats up Amazon IOPS credits. With this patch behavior is much better for us - VACUUM finishes real quick. As a future improvement, a similar improvement for other index types will be useful. After it happens, I'm looking forward to autovacuum kicking in on append-only tables, to freeze the Visibility Map. The new status of this patch is: Ready for Committer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 22, 2017 at 5:31 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Fri, 22 Sep 2017 17:21:04 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBN9ucgMDuinx2ptU8upEToHnR-A35aBcQyZnLFvWdVPg@mail.gmail.com> >> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com> >> >> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI >> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >> Could you elaborate about this? For example in btree index, the index >> >> cleanup skips to scan on the index scan if index_bulk_delete has been >> >> called during vacuuming because stats != NULL. So I think we don't >> >> need such a flag. >> > >> > The flag works so that successive two index full scans don't >> > happen in a vacuum round. If any rows are fully deleted, just >> > following btvacuumcleanup does nothing. >> > >> > I think what you wanted to solve here was the problem that >> > index_vacuum_cleanup runs a full scan even if it ends with no >> > actual work, when manual or anti-wraparound vacuums. (I'm >> > getting a bit confused on this..) It is caused by using the >> > pointer "stats" as the flag to instruct to do that. If the >> > stats-as-a-flag worked as expected, the GUC doesn't seem to be >> > required. >> >> Hmm, my proposal is like that if a table doesn't changed since the >> previous vacuum much we skip the cleaning up index. >> >> If the table has at least one garbage we do the lazy_vacuum_index and >> then IndexBulkDeleteResutl is stored, which causes to skip doing the >> btvacuumcleanup. On the other hand, if the table doesn't have any >> garbage but some new tuples inserted since the previous vacuum, we >> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this >> case, we always do the lazy_cleanup_index (i.g, we do the full scan) >> even if only one tuple is inserted. That's why I proposed a new GUC >> parameter which allows us to skip the lazy_cleanup_index in the case. > > I think the problem raised in this thread is that the last index > scan may leave dangling pages. > >> > Addition to that, as Simon and Peter pointed out >> > index_bulk_delete can leave not-fully-removed pages (so-called >> > half-dead pages and pages that are recyclable but not registered >> > in FSM, AFAICS) in some cases mainly by RecentGlobalXmin >> > interlock. In this case, just inhibiting cleanup scan by a >> > threshold lets such dangling pages persist in the index. (I >> > conldn't make such a many dangling pages, though..) >> > >> > The first patch in the mail (*1) does that. It seems having some >> > bugs, though.. >> > >> > >> > Since the dangling pages persist until autovacuum decided to scan >> > the belonging table again, we should run a vacuum round (or >> > index_vacuum_cleanup itself) even having no dead rows if we want >> > to clean up such pages within a certain period. The second patch >> > doesn that. >> > >> >> IIUC half-dead pages are not relevant to this proposal. The proposal >> has two problems; >> >> * By skipping index cleanup we could leave recyclable pages that are >> not marked as a recyclable. > > Yes. > >> * we stash an XID when a btree page is deleted, which is used to >> determine when it's finally safe to recycle the page > > Is it a "problem" of this proposal? > As Peter explained before[1], the problem is that there is an XID stored in dead btree pages that is used in the subsequent RecentGlobalXmin interlock that determines if recycling is safe. [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
At Mon, 25 Sep 2017 19:20:07 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAzKJnc8UM728c0BMHx=7itJh4Db_Lj3Y31itnGrj-heQ@mail.gmail.com> > >> * we stash an XID when a btree page is deleted, which is used to > >> determine when it's finally safe to recycle the page > > > > Is it a "problem" of this proposal? > > > > As Peter explained before[1], the problem is that there is an XID > stored in dead btree pages that is used in the subsequent > RecentGlobalXmin interlock that determines if recycling is safe. > > [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com Yeah I know that, and I understand that it is the reason why it is bad to just skip looking the GUC regardless. On the other hand looking the recycle status, I think we don't need the GUC. (I believe) The patch is *a Poc* in the way. (I'd like to let RecordPageWithFreeSpace to return the previous value to avoid duplicate fsm search..) regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
At Fri, 22 Sep 2017 17:15:08 +0300, Sokolov Yura <y.sokolov@postgrespro.ru> wrote in <c7d736b5ea4cda67a644a0247f1a3951@postgrespro.ru> > On 2017-09-22 16:22, Sokolov Yura wrote: > > On 2017-09-22 11:21, Masahiko Sawada wrote: > >> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI > >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >>> At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada > >>> <sawada.mshk@gmail.com> wrote in > >>> <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com> > >>>> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI > >>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >>>> > I was just looking the thread since it is found left alone for a > >>>> > long time in the CF app. > >>>> > > >>>> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote > >>>> > in > >>>> > <CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com> > >>>> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> > >>>> >> wrote: > >>>> >> > Hi, > >>>> >> > > >>>> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote: > >>>> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas > >>>> >> >> <robertmhaas@gmail.com> wrote: > >>>> >> >> [ lots of valuable discussion ] > >>>> >> > > >>>> >> > I think this patch clearly still is in the design stage, and has > >>>> >> > received plenty feedback this CF. I'll therefore move this to the > >>>> >> > next > >>>> >> > commitfest. > >>>> >> > >>>> >> Does anyone have ideas on a way forward here? I don't, but then I > >>>> >> haven't thought about it in detail in several months. > >>>> > > >>>> > Is the additional storage in metapage to store the current status > >>>> > of vaccum is still unacceptable even if it can avoid useless > >>>> > full-page scan on indexes especially for stable tables? > >>>> > > >>>> > Or, how about additional 1 bit in pg_stat_*_index to indicate > >>>> > that the index *don't* require vacuum cleanup stage. (default > >>>> > value causes cleanup) > >>>> You meant that "the next cycle" is the lazy_cleanup_index() function > >>>> called by lazy_scan_heap()? > >>> Both finally call btvacuumscan under a certain condition, but > >>> what I meant by "the next cycle" is the lazy_cleanup_index call > >>> in the next round of vacuum since abstract layer (index am) isn't > >>> conscious of the detail of btree. > >>> > >>>> > index_bulk_delete (or ambulkdelete) returns the flag in > >>>> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in > >>>> > stats and in the next cycle it is looked up to decide the > >>>> > necessity of index cleanup. > >>>> > > >>>> Could you elaborate about this? For example in btree index, the index > >>>> cleanup skips to scan on the index scan if index_bulk_delete has been > >>>> called during vacuuming because stats != NULL. So I think we don't > >>>> need such a flag. > >>> The flag works so that successive two index full scans don't > >>> happen in a vacuum round. If any rows are fully deleted, just > >>> following btvacuumcleanup does nothing. > >>> I think what you wanted to solve here was the problem that > >>> index_vacuum_cleanup runs a full scan even if it ends with no > >>> actual work, when manual or anti-wraparound vacuums. (I'm > >>> getting a bit confused on this..) It is caused by using the > >>> pointer "stats" as the flag to instruct to do that. If the > >>> stats-as-a-flag worked as expected, the GUC doesn't seem to be > >>> required. > >> Hmm, my proposal is like that if a table doesn't changed since the > >> previous vacuum much we skip the cleaning up index. > >> If the table has at least one garbage we do the lazy_vacuum_index and > >> then IndexBulkDeleteResutl is stored, which causes to skip doing the > >> btvacuumcleanup. On the other hand, if the table doesn't have any > >> garbage but some new tuples inserted since the previous vacuum, we > >> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this > >> case, we always do the lazy_cleanup_index (i.g, we do the full scan) > >> even if only one tuple is inserted. That's why I proposed a new GUC > >> parameter which allows us to skip the lazy_cleanup_index in the case. > >> > >>> Addition to that, as Simon and Peter pointed out > >>> index_bulk_delete can leave not-fully-removed pages (so-called > >>> half-dead pages and pages that are recyclable but not registered > >>> in FSM, AFAICS) in some cases mainly by RecentGlobalXmin > >>> interlock. In this case, just inhibiting cleanup scan by a > >>> threshold lets such dangling pages persist in the index. (I > >>> conldn't make such a many dangling pages, though..) > >>> The first patch in the mail (*1) does that. It seems having some > >>> bugs, though.. > >>> Since the dangling pages persist until autovacuum decided to scan > >>> the belonging table again, we should run a vacuum round (or > >>> index_vacuum_cleanup itself) even having no dead rows if we want > >>> to clean up such pages within a certain period. The second patch > >>> doesn that. > >>> > >> IIUC half-dead pages are not relevant to this proposal. The proposal > >> has two problems; > >> * By skipping index cleanup we could leave recyclable pages that are > >> not marked as a recyclable. > >> * we stash an XID when a btree page is deleted, which is used to > >> determine when it's finally safe to recycle the page > >> Regards, > >> -- > >> Masahiko Sawada > >> NIPPON TELEGRAPH AND TELEPHONE CORPORATION > >> NTT Open Source Software Center > > Here is a small patch that skips scanning btree index if no pending > > deleted pages exists. > > It detects this situation by comparing pages_deleted with pages_free. It seems to work to prevent needless cleanup scans. (Although I still uncertain how to deal with half-dead pages..) > > If they are equal, then there is no half-deleted pages, and it is > > safe to skip next lazy scan. > > Flag stored in a btpo_flags. It is unset using general wal before > > scan. I'm not sure it's good to add a meta-page specific information into per-page storage. > > If no half-deleted pages found, it is set without wal (like hint bit). > > (it is safe to miss setting flag, but it is not safe to miss unsetting > > flag). > > This patch works correctly: > > - if rows are only inserted and never deleted, index always skips > > scanning (starting with second scan). (Yes, I was confused here..) > > - if some rows updated/deleted, then some scans are not skipped. But > > when all half-deleted pages are marked as deleted, lazy scans start to > > be skipped. > > Open question: what to do with index statistic? For simplicity this > > patch skips updating stats (just returns NULL from btvacuumcleanup). > > Probably, it should detect proportion of table changes, and do not > > skip scans if table grows too much. Agreed. > Excuse me, I didn't mean to overwrite "last attachment" on commitfest > page. I'm the other one who also posted a patch in this thread:p I'm not sure what we should behave for the case, but I think it's inevitable to attach a patch when proposing a funcamental change of the design in a concrete shape. (But this also cofuses the CI machinery:p) Actually, it was quite clear what you are thinking. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hello, Darafei. You wrote: DP> The following review has been posted through the commitfest application: DP> make installcheck-world: tested, passed DP> Implements feature: tested, passed DP> Spec compliant: tested, passed DP> Documentation: tested, passed DP> We're using Postgres with this patch for some time. DP> In our use case we've got a quickly growing large table with events from our users. DP> Table has a structure of (user_id, ts, <event data>). Events are DP> append only, each user generates events in small predictable time frame, mostly each second. DP> From time to time we need to read this table in fashion of WHERE DP> ts BETWEEN a AND b AND user_id=c. DP> Such query leads to enormous amount of seeks, as records of each DP> user are scattered across relation and there are no pages that DP> contain two events from same user. DP> To fight it, we created a btree index on (user_id, ts, DP> <event_data>). Plan switched to index only scans, but heap fetches DP> and execution times were still the same. DP> Manual DP> We noticed that autovacuum skips scanning the relation and freezing the Visibility Map. DP> We started frequently performing VACUUM manually on the relation. DP> This helped with freezing the Visibility Map. DP> However, we found out that VACUUM makes a full scan over the index. DP> As index does not fit into memory, this means that each run DP> flushes all the disk caches and eats up Amazon IOPS credits. DP> With this patch behavior is much better for us - VACUUM finishes real quick. DP> As a future improvement, a similar improvement for other index types will be useful. DP> After it happens, I'm looking forward to autovacuum kicking in on DP> append-only tables, to freeze the Visibility Map. DP> The new status of this patch is: Ready for Committer Seems like, we may also going to hit it and it would be cool this vacuum issue solved for next PG version. -- With best wishes,Pavel mailto:pavel@gf.microolap.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 10, 2017 at 5:55 AM, Pavel Golub <pavel@microolap.com> wrote: > DP> The new status of this patch is: Ready for Committer > > Seems like, we may also going to hit it and it would be cool this > vacuum issue solved for next PG version. Exactly which patch on this thread is someone proposing for commit? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 13, 2017 at 1:14 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Oct 10, 2017 at 5:55 AM, Pavel Golub <pavel@microolap.com> wrote: >> DP> The new status of this patch is: Ready for Committer >> >> Seems like, we may also going to hit it and it would be cool this >> vacuum issue solved for next PG version. > > Exactly which patch on this thread is someone proposing for commit? > I guess that is the patch I proposed. However I think that there still is room for discussion because the patch cannot skip to cleanup vacuum when aggressive vacuum, which is one of the situation that I really wanted to skip. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 16, 2017 at 3:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I guess that is the patch I proposed. However I think that there still > is room for discussion because the patch cannot skip to cleanup vacuum > when aggressive vacuum, which is one of the situation that I really > wanted to skip. Well, I think there are outstanding concerns that the patch in question is not safe, and I don't see that anything has been done to resolve them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 25 September 2017 at 22:34, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > Here is a small patch that skips scanning btree index if no pending >> > deleted pages exists. >> > It detects this situation by comparing pages_deleted with pages_free. > > It seems to work to prevent needless cleanup scans. So this leaves us in the situation that 1. Masahiko's patch has unresolved problems 2. Yura's patch works and is useful Unless there is disagreement on the above, it seems we should apply Yura's patch (an edited version, perhaps). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Nov 29, 2017 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 25 September 2017 at 22:34, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >>> > Here is a small patch that skips scanning btree index if no pending >>> > deleted pages exists. >>> > It detects this situation by comparing pages_deleted with pages_free. >> >> It seems to work to prevent needless cleanup scans. > > So this leaves us in the situation that > > 1. Masahiko's patch has unresolved problems > 2. Yura's patch works and is useful > > Unless there is disagreement on the above, it seems we should apply > Yura's patch (an edited version, perhaps). > IIRC the patches that makes the cleanup scan skip has a problem pointed by Peter[1], that is that we stash an XID when a btree page is deleted, which is used to determine when it's finally safe to recycle the page. Yura's patch doesn't have that problem? [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Greetings, (pruned the CC list) * Michael Paquier (michael.paquier@gmail.com) wrote: > On Thu, Nov 30, 2017 at 12:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Nov 29, 2017 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> Unless there is disagreement on the above, it seems we should apply > >> Yura's patch (an edited version, perhaps). > >> > > > > IIRC the patches that makes the cleanup scan skip has a problem > > pointed by Peter[1], that is that we stash an XID when a btree page is > > deleted, which is used to determine when it's finally safe to recycle > > the page. Yura's patch doesn't have that problem? > > > > [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com > > The latest patch present on this thread does not apply > (https://www.postgresql.org/message-id/c4a47caef28024ab7528946bed264058@postgrespro.ru). > Please send a rebase. For now I am moving the patch to next CF, with > "waiting on author". We are now just about a week into the CF and this patch hasn't been updated and is still in 'waiting on author' state and it sounds like it might have a pretty serious issue based on the above comment. Masahiko Sawada, if this patch isn't viable or requires serious rework to be acceptable, then perhaps we should change its status to 'returned with feedback' and you can post a new patch for the next commitfest..? Otherwise, I'd encourage you to try and find time to post an updated patch soon, so that it can be reviewed. Thanks! Stephen
Attachment
On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote: >> > IIRC the patches that makes the cleanup scan skip has a problem >> > pointed by Peter[1], that is that we stash an XID when a btree page is >> > deleted, which is used to determine when it's finally safe to recycle >> > the page. Yura's patch doesn't have that problem? >> > >> > [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com > Masahiko Sawada, if this patch isn't viable or requires serious rework > to be acceptable, then perhaps we should change its status to 'returned > with feedback' and you can post a new patch for the next commitfest..? I believe that the problem that I pointed out with freezing/wraparound is a solvable problem. If we think about it carefully, we will come up with a good solution. I have tried to get the ball rolling with my pd_prune_xid suggestion. I think it's important to not lose sight of the fact that the page deletion/recycling XID thing is just one detail that we need to think about some more. I cannot fault Sawada-san for waiting to hear other people's views before proceeding. It really needs to be properly discussed. -- Peter Geoghegan
Greetings Peter, * Peter Geoghegan (pg@bowt.ie) wrote: > On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote: > >> > IIRC the patches that makes the cleanup scan skip has a problem > >> > pointed by Peter[1], that is that we stash an XID when a btree page is > >> > deleted, which is used to determine when it's finally safe to recycle > >> > the page. Yura's patch doesn't have that problem? > >> > > >> > [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com > > > Masahiko Sawada, if this patch isn't viable or requires serious rework > > to be acceptable, then perhaps we should change its status to 'returned > > with feedback' and you can post a new patch for the next commitfest..? > > I believe that the problem that I pointed out with freezing/wraparound > is a solvable problem. If we think about it carefully, we will come up > with a good solution. I have tried to get the ball rolling with my > pd_prune_xid suggestion. I think it's important to not lose sight of > the fact that the page deletion/recycling XID thing is just one detail > that we need to think about some more. > > I cannot fault Sawada-san for waiting to hear other people's views > before proceeding. It really needs to be properly discussed. Thanks for commenting! Perhaps it should really be in Needs review state then..? Either way, thanks again for the clarification and hopefully this will revive the discussion. Stephen
Attachment
Hi Stephen, On Sat, Jan 6, 2018 at 4:02 PM, Stephen Frost <sfrost@snowman.net> wrote: > Perhaps it should really be in Needs review state then..? Probably. As I pointed out already some time ago, this RecentGlobalXmin interlock stuff is our particular implementation of what Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty Nodes". It's not easy to understand why this is necessary in general (you have to understand the whole paper for that), but our implementation of the drain technique is simple. I'm afraid that I made the problem sound scarier than it actually is. As Lanin & Shasha put it: "freeing empty nodes by the drain technique is transparent to the rest of the algorithm and can be added by a separate module". nbtree doesn't actually care very much about XIDs -- testing an XID against RecentGlobalXmin was just the most convenient way of using L&S's technique to defer recycling until it is definitely safe. We only need to make sure that _bt_page_recyclable() cannot become confused by XID wraparound to fix this problem -- that's it. -- Peter Geoghegan
On Sun, Jan 7, 2018 at 8:40 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote: >>> > IIRC the patches that makes the cleanup scan skip has a problem >>> > pointed by Peter[1], that is that we stash an XID when a btree page is >>> > deleted, which is used to determine when it's finally safe to recycle >>> > the page. Yura's patch doesn't have that problem? >>> > >>> > [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com > >> Masahiko Sawada, if this patch isn't viable or requires serious rework >> to be acceptable, then perhaps we should change its status to 'returned >> with feedback' and you can post a new patch for the next commitfest..? > > I believe that the problem that I pointed out with freezing/wraparound > is a solvable problem. If we think about it carefully, we will come up > with a good solution. I have tried to get the ball rolling with my > pd_prune_xid suggestion. I think it's important to not lose sight of > the fact that the page deletion/recycling XID thing is just one detail > that we need to think about some more. > > I cannot fault Sawada-san for waiting to hear other people's views > before proceeding. It really needs to be properly discussed. > Thank you for commenting. IIUC we have two approaches: one idea is based on Peter's suggestion. We can use pd_prune_xid to store epoch of xid of which the page is deleted. That way, we can correctly mark deleted pages as recyclable without breaking on-disk format. Another idea is suggested by Sokolov Yura. His original patch makes btree have a flag in btpo_flags that implies the btree has deleted but not recyclable page or not. I'd rather want to store it as bool in BTMetaPageData. Currently btm_version is defined as uint32 but I think we won't use all of them. If we store it in part of btm_version we don't break on-disk format. However, we're now assuming that the vacuum on btree index always scans whole btree rather than a part, and this approach will nurture it more. It might be possible that it will become a restriction in the future. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jan 10, 2018 at 11:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sun, Jan 7, 2018 at 8:40 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote: >>>> > IIRC the patches that makes the cleanup scan skip has a problem >>>> > pointed by Peter[1], that is that we stash an XID when a btree page is >>>> > deleted, which is used to determine when it's finally safe to recycle >>>> > the page. Yura's patch doesn't have that problem? >>>> > >>>> > [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com >> >>> Masahiko Sawada, if this patch isn't viable or requires serious rework >>> to be acceptable, then perhaps we should change its status to 'returned >>> with feedback' and you can post a new patch for the next commitfest..? >> >> I believe that the problem that I pointed out with freezing/wraparound >> is a solvable problem. If we think about it carefully, we will come up >> with a good solution. I have tried to get the ball rolling with my >> pd_prune_xid suggestion. I think it's important to not lose sight of >> the fact that the page deletion/recycling XID thing is just one detail >> that we need to think about some more. >> >> I cannot fault Sawada-san for waiting to hear other people's views >> before proceeding. It really needs to be properly discussed. >> > > Thank you for commenting. > > IIUC we have two approaches: one idea is based on Peter's suggestion. > We can use pd_prune_xid to store epoch of xid of which the page is > deleted. That way, we can correctly mark deleted pages as recyclable > without breaking on-disk format. > > Another idea is suggested by Sokolov Yura. His original patch makes > btree have a flag in btpo_flags that implies the btree has deleted but > not recyclable page or not. I'd rather want to store it as bool in > BTMetaPageData. Currently btm_version is defined as uint32 but I think > we won't use all of them. If we store it in part of btm_version we > don't break on-disk format. However, we're now assuming that the > vacuum on btree index always scans whole btree rather than a part, and > this approach will nurture it more. It might be possible that it will > become a restriction in the future. > On third thought, it's similar to what we are doing for heaps but we can store the oldest btpo.xact of a btree index to somewhere(metapage or pg_class.relfrozenxid?) after vacuumed. We can skip cleanup vacuuming as long as the xid is not more than a certain threshold old (say vacuum_index_cleanup_age). Combining with new parameter I proposed before, the condition of doing cleanup index vacuum will become as follows. if (there is garbage on heap) Vacuum index, and update oldest btpo.xact else /* there is no garbage on heap */ { if (# of scanned pages > (nblocks * vacuum_cleanup_index_scale_factor) OR oldest btpo.xact is more than vacuum_index_cleanup_age old?)) Cleanup vacuum index, and update oldest btpo.xact } In current index vacuuming, it scans whole index pages if there is even one garbage on heap(I'd like to improve it though someday), which it also lead to update the oldest btpo.xact at the time. If vacuum_cleanup_index_slace_factor is enough high, we can skip the scanning whole index as long as either the table isn't modified for 2 billion transactions or the oldest btpo.xact isn't more than vacuum_index_cleanup_age transactions old. But if there is a opened transaction for a very long time, we might not be able to mark deleted page as recyclable. Some tricks might be required. Feedback is welcome. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Jan 12, 2018 at 1:05 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Jan 10, 2018 at 11:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sun, Jan 7, 2018 at 8:40 AM, Peter Geoghegan <pg@bowt.ie> wrote: >>> On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote: >>>>> > IIRC the patches that makes the cleanup scan skip has a problem >>>>> > pointed by Peter[1], that is that we stash an XID when a btree page is >>>>> > deleted, which is used to determine when it's finally safe to recycle >>>>> > the page. Yura's patch doesn't have that problem? >>>>> > >>>>> > [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com >>> >>>> Masahiko Sawada, if this patch isn't viable or requires serious rework >>>> to be acceptable, then perhaps we should change its status to 'returned >>>> with feedback' and you can post a new patch for the next commitfest..? >>> >>> I believe that the problem that I pointed out with freezing/wraparound >>> is a solvable problem. If we think about it carefully, we will come up >>> with a good solution. I have tried to get the ball rolling with my >>> pd_prune_xid suggestion. I think it's important to not lose sight of >>> the fact that the page deletion/recycling XID thing is just one detail >>> that we need to think about some more. >>> >>> I cannot fault Sawada-san for waiting to hear other people's views >>> before proceeding. It really needs to be properly discussed. >>> >> >> Thank you for commenting. >> >> IIUC we have two approaches: one idea is based on Peter's suggestion. >> We can use pd_prune_xid to store epoch of xid of which the page is >> deleted. That way, we can correctly mark deleted pages as recyclable >> without breaking on-disk format. >> >> Another idea is suggested by Sokolov Yura. His original patch makes >> btree have a flag in btpo_flags that implies the btree has deleted but >> not recyclable page or not. I'd rather want to store it as bool in >> BTMetaPageData. Currently btm_version is defined as uint32 but I think >> we won't use all of them. If we store it in part of btm_version we >> don't break on-disk format. However, we're now assuming that the >> vacuum on btree index always scans whole btree rather than a part, and >> this approach will nurture it more. It might be possible that it will >> become a restriction in the future. >> > > On third thought, it's similar to what we are doing for heaps but we > can store the oldest btpo.xact of a btree index to somewhere(metapage > or pg_class.relfrozenxid?) after vacuumed. We can skip cleanup > vacuuming as long as the xid is not more than a certain threshold old > (say vacuum_index_cleanup_age). Combining with new parameter I > proposed before, the condition of doing cleanup index vacuum will > become as follows. > > if (there is garbage on heap) > Vacuum index, and update oldest btpo.xact > else /* there is no garbage on heap */ > { > if (# of scanned pages > (nblocks * > vacuum_cleanup_index_scale_factor) OR oldest btpo.xact is more than > vacuum_index_cleanup_age old?)) > Cleanup vacuum index, and update oldest btpo.xact > } > > In current index vacuuming, it scans whole index pages if there is > even one garbage on heap(I'd like to improve it though someday), which > it also lead to update the oldest btpo.xact at the time. If > vacuum_cleanup_index_slace_factor is enough high, we can skip the > scanning whole index as long as either the table isn't modified for 2 > billion transactions or the oldest btpo.xact isn't more than > vacuum_index_cleanup_age transactions old. But if there is a opened > transaction for a very long time, we might not be able to mark deleted > page as recyclable. Some tricks might be required. Feedback is > welcome. > Since this patch still needs to be discussed before proceeding, I've marked it as "Needs Review". Feedback is very welcome. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Nov 29, 2017 at 6:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Nov 29, 2017 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 25 September 2017 at 22:34, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
>>> > Here is a small patch that skips scanning btree index if no pending
>>> > deleted pages exists.
>>> > It detects this situation by comparing pages_deleted with pages_free.
>>
>> It seems to work to prevent needless cleanup scans.
>
> So this leaves us in the situation that
>
> 1. Masahiko's patch has unresolved problems
> 2. Yura's patch works and is useful
>
> Unless there is disagreement on the above, it seems we should apply
> Yura's patch (an edited version, perhaps).
>
IIRC the patches that makes the cleanup scan skip has a problem
pointed by Peter[1], that is that we stash an XID when a btree page is
deleted, which is used to determine when it's finally safe to recycle
the page. Yura's patch doesn't have that problem?
[1] https://www.postgresql.org/message-id/CAH2-Wz%3D1% 3Dt5fcGGfarQGcAWBqaCh% 2BdLMjpYCYHpEyzK8Qg6OrQ% 40mail.gmail.com
Yes, I think Yura's patch doesn't have that problem, because it skips
cleanup only when there are no recyclable pages. And that means there
is no btpo.xact stored, so no XIDs to be wraparounded.
BTW, I've rebased Yura's patch. I think this patch has following issues
for now:
1) Adding BTP_NEED_NO_CLEANUP as per-page flag doesn't look nice.
2) In the append-only case, index statistics can lag indefinitely.
3) Patch definitely needs more comments.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
On Fri, Jan 12, 2018 at 7:05 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On third thought, it's similar to what we are doing for heaps but weOn Wed, Jan 10, 2018 at 11:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sun, Jan 7, 2018 at 8:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>>> > IIRC the patches that makes the cleanup scan skip has a problem
>>>> > pointed by Peter[1], that is that we stash an XID when a btree page is
>>>> > deleted, which is used to determine when it's finally safe to recycle
>>>> > the page. Yura's patch doesn't have that problem?
>>>> >
>>>> > [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1% 3Dt5fcGGfarQGcAWBqaCh% 2BdLMjpYCYHpEyzK8Qg6OrQ% 40mail.gmail.com
>>
>>> Masahiko Sawada, if this patch isn't viable or requires serious rework
>>> to be acceptable, then perhaps we should change its status to 'returned
>>> with feedback' and you can post a new patch for the next commitfest..?
>>
>> I believe that the problem that I pointed out with freezing/wraparound
>> is a solvable problem. If we think about it carefully, we will come up
>> with a good solution. I have tried to get the ball rolling with my
>> pd_prune_xid suggestion. I think it's important to not lose sight of
>> the fact that the page deletion/recycling XID thing is just one detail
>> that we need to think about some more.
>>
>> I cannot fault Sawada-san for waiting to hear other people's views
>> before proceeding. It really needs to be properly discussed.
>>
>
> Thank you for commenting.
>
> IIUC we have two approaches: one idea is based on Peter's suggestion.
> We can use pd_prune_xid to store epoch of xid of which the page is
> deleted. That way, we can correctly mark deleted pages as recyclable
> without breaking on-disk format.
>
> Another idea is suggested by Sokolov Yura. His original patch makes
> btree have a flag in btpo_flags that implies the btree has deleted but
> not recyclable page or not. I'd rather want to store it as bool in
> BTMetaPageData. Currently btm_version is defined as uint32 but I think
> we won't use all of them. If we store it in part of btm_version we
> don't break on-disk format. However, we're now assuming that the
> vacuum on btree index always scans whole btree rather than a part, and
> this approach will nurture it more. It might be possible that it will
> become a restriction in the future.
>
can store the oldest btpo.xact of a btree index to somewhere(metapage
or pg_class.relfrozenxid?) after vacuumed. We can skip cleanup
vacuuming as long as the xid is not more than a certain threshold old
(say vacuum_index_cleanup_age). Combining with new parameter I
proposed before, the condition of doing cleanup index vacuum will
become as follows.
if (there is garbage on heap)
Vacuum index, and update oldest btpo.xact
else /* there is no garbage on heap */
{
if (# of scanned pages > (nblocks *
vacuum_cleanup_index_scale_factor) OR oldest btpo.xact is more than
vacuum_index_cleanup_age old?))
Cleanup vacuum index, and update oldest btpo.xact
}
In current index vacuuming, it scans whole index pages if there is
even one garbage on heap(I'd like to improve it though someday), which
it also lead to update the oldest btpo.xact at the time. If
vacuum_cleanup_index_slace_factor is enough high, we can skip the
scanning whole index as long as either the table isn't modified for 2
billion transactions or the oldest btpo.xact isn't more than
vacuum_index_cleanup_age transactions old. But if there is a opened
transaction for a very long time, we might not be able to mark deleted
page as recyclable. Some tricks might be required. Feedback is
welcome.
Storing oldest btpo.xact seems like a good idea for me. So, by comparing
oldest btpo.xact to RecentGlobalXmin we can determine if there are
some recyclable pages which are really going to be recycled during
possible cleanup? Assuming that index pages deletions are rare, that
seems like sufficient criterion to run the cleanup. Also, this criterion
will eliminate the risk of wraparound, because we'll recycle index pages
not later than we do now.
Another case when running cleanup is necessary (or at least desirable)
is stalled index statistics. Assuming that index statistics can be stalled
only when there is no deletions from index (otherwise index would
be scanned during vacuum), we can use number of index insertions as
measure of statistics oldness.
So, what about following logic?
if (there is garbage on heap)
Vacuum index, and update oldest btpo.xact
else /* there is no garbage on heap */
{
if (number of inserted index tuples since last btvacuumscan() /
Vacuum index, and update oldest btpo.xact
else /* there is no garbage on heap */
{
if (number of inserted index tuples since last btvacuumscan() /
total number of index tuples >= some threshold
OR oldest btpo.xact is older than RecentGlobalXmin)
Cleanup vacuum index, and update oldest btpo.xact
}
Cleanup vacuum index, and update oldest btpo.xact
}
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Feb 28, 2018 at 1:45 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Wed, Nov 29, 2017 at 6:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Wed, Nov 29, 2017 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> >> wrote: >> > On 25 September 2017 at 22:34, Kyotaro HORIGUCHI >> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > >> >>> > Here is a small patch that skips scanning btree index if no pending >> >>> > deleted pages exists. >> >>> > It detects this situation by comparing pages_deleted with >> >>> > pages_free. >> >> >> >> It seems to work to prevent needless cleanup scans. >> > >> > So this leaves us in the situation that >> > >> > 1. Masahiko's patch has unresolved problems >> > 2. Yura's patch works and is useful >> > >> > Unless there is disagreement on the above, it seems we should apply >> > Yura's patch (an edited version, perhaps). >> > >> >> IIRC the patches that makes the cleanup scan skip has a problem >> pointed by Peter[1], that is that we stash an XID when a btree page is >> deleted, which is used to determine when it's finally safe to recycle >> the page. Yura's patch doesn't have that problem? >> >> [1] >> https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com > > > Yes, I think Yura's patch doesn't have that problem, because it skips > cleanup only when there are no recyclable pages. And that means there > is no btpo.xact stored, so no XIDs to be wraparounded. I've looked at the patch again. And you're right, Yura's patch doesn't have that problem. > > BTW, I've rebased Yura's patch. I think this patch has following issues > for now: > > 1) Adding BTP_NEED_NO_CLEANUP as per-page flag doesn't look nice. Yeah, the alternative ideas are to store the flag it into pd_rune_xid of meta page or to use 1bit of btm_version so that we don't break disk format compatibility. > 2) In the append-only case, index statistics can lag indefinitely. The original proposal proposed a new GUC that specifies a fraction of the modified pages to trigger a cleanup indexes. Combining with Yura's patch and telling requirement of cleanup to indexes from lazyvacuum code we can deal with it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 2, 2018 at 10:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
> 2) In the append-only case, index statistics can lag indefinitely.
The original proposal proposed a new GUC that specifies a fraction of
the modified pages to trigger a cleanup indexes.
Regarding original proposal, I didn't get what exactly it's intended to be.
You're checking if vacuumed_pages >= nblocks * vacuum_cleanup_index_scale.
But vacuumed_pages is the variable which could be incremented when
no indexes exist on the table. When indexes are present, this variable is always
zero. I can assume, that it's intended to compare number of pages where
at least one tuple is deleted to nblocks * vacuum_cleanup_index_scale.
But that is also not an option for us, because we're going to optimize the
case when exactly zero tuples is deleted by vacuum.
The thing I'm going to propose is to add estimated number of tuples in
table to IndexVacuumInfo. Then B-tree can memorize that number of tuples
when last time index was scanned in the meta-page. If pass value
is differs from the value in meta-page too much, then cleanup is forced.
Any better ideas?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Sun, Mar 4, 2018 at 8:59 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Fri, Mar 2, 2018 at 10:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> > 2) In the append-only case, index statistics can lag indefinitely. >> >> The original proposal proposed a new GUC that specifies a fraction of >> the modified pages to trigger a cleanup indexes. > > > Regarding original proposal, I didn't get what exactly it's intended to be. > You're checking if vacuumed_pages >= nblocks * vacuum_cleanup_index_scale. > But vacuumed_pages is the variable which could be incremented when > no indexes exist on the table. When indexes are present, this variable is > always > zero. I can assume, that it's intended to compare number of pages where > at least one tuple is deleted to nblocks * vacuum_cleanup_index_scale. > But that is also not an option for us, because we're going to optimize the > case when exactly zero tuples is deleted by vacuum. In the latest v4 patch, I compare scanned_pages and the threshold, which means if the number of pages that are modified since the last vacuum is larger than the threshold we force cleanup index. > The thing I'm going to propose is to add estimated number of tuples in > table to IndexVacuumInfo. Then B-tree can memorize that number of tuples > when last time index was scanned in the meta-page. If pass value > is differs from the value in meta-page too much, then cleanup is forced. > > Any better ideas? I think that would work. But I'm concerned about metapage format compatibility. And since I've not fully investigated about cleanup index of other index types I'm not sure that interface makes sense. It might not be better but an alternative idea is to add a condition (Irel[i]->rd_rel->relam == BTREE_AM_OID) in lazy_scan_heap. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Mar 5, 2018 at 5:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Mar 4, 2018 at 8:59 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Fri, Mar 2, 2018 at 10:53 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> > 2) In the append-only case, index statistics can lag indefinitely.
>>
>> The original proposal proposed a new GUC that specifies a fraction of
>> the modified pages to trigger a cleanup indexes.
>
>
> Regarding original proposal, I didn't get what exactly it's intended to be.
> You're checking if vacuumed_pages >= nblocks * vacuum_cleanup_index_scale.
> But vacuumed_pages is the variable which could be incremented when
> no indexes exist on the table. When indexes are present, this variable is
> always
> zero. I can assume, that it's intended to compare number of pages where
> at least one tuple is deleted to nblocks * vacuum_cleanup_index_scale.
> But that is also not an option for us, because we're going to optimize the
> case when exactly zero tuples is deleted by vacuum.
In the latest v4 patch, I compare scanned_pages and the threshold,
which means if the number of pages that are modified since the last
vacuum is larger than the threshold we force cleanup index.
Right, sorry I've overlooked that. However, if even use number of pages
I would still prefer cumulative measure. So, number of vacuums are
taken into account even if each of them touched only small number of
pages.
> The thing I'm going to propose is to add estimated number of tuples in
> table to IndexVacuumInfo. Then B-tree can memorize that number of tuples
> when last time index was scanned in the meta-page. If pass value
> is differs from the value in meta-page too much, then cleanup is forced.
>
> Any better ideas?
I think that would work. But I'm concerned about metapage format
compatibility.
That's not show-stopper. B-tree meta page have version number. So,
it's no problem to provide online update.
And since I've not fully investigated about cleanup
index of other index types I'm not sure that interface makes sense. It
might not be better but an alternative idea is to add a condition
(Irel[i]->rd_rel->relam == BTREE_AM_OID) in lazy_scan_heap.
I meant putting this logic *inside* btvacuumcleanup() while passing
required measure to IndexVacuumInfo which is accessible from
btvacuumcleanup().
------
Alexander Korotkov
Postgres Professional: http://www. postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.
The Russian Postgres Company
Hi!
I'd like to propose a revised patch based on various ideas upthread.
This patch works as following.
1) B-tree meta page is extended with 2 additional parameters:
* btm_oldest_btpo_xact – oldest btpo_xact among of deleted pages,
* btm_last_cleanup_num_heap_tuples – number of heap tuples during last cleanup scan.
2) These parameters are reset during btbulkdelete() and set during btvacuumcleanup().
3) Index scans during second and subsequent btvacuumcleanup() happen only if
btm_oldest_btpo_xact is older than RecentGlobalXmin
OR num_heap_tuples >= btm_last_cleanup_num_heap_tuples(1 + vacuum_cleanup_index_scale_factor).
In other words btvacuumcleanup() scans the index only if there are recyclable pages,
or index statistics is stalled (inserted more than vacuum_cleanup_index_scale_factor
since last index statistics collection).
4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.
Default value is 0.1. So, by default cleanup scan is triggered after increasing of
table size by 10%.
5) Since new fields are added to the metapage, BTREE_VERSION is bumped.
In order to support pg_upgrade, read of previous metapage version is supported.
On metapage rewrite, it's upgraded to the new version.
So, since we don't skip scan of recyclable pages, there is no risk of xid wraparound.
Risk of stalled statistics is also small, because vacuum_cleanup_index_scale_factor
default value is quite low. User can increase vacuum_cleanup_index_scale_factor
on his own risk and have less load of B-tree cleanup scan bought by more gap in
index statistics.
Some simple benchmark shows the effect.
Before patch.
# insert into t select i from generate_series(1,100000000) i;
# create index t_i_idx on t(i);
# vacuum t;
VACUUM
Time: 15639,822 ms (00:15,640)
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 6,195 ms
# vacuum t;
VACUUM
Time: 1012,794 ms (00:01,013)
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 5,276 ms
# vacuum t;
VACUUM
Time: 1013,254 ms (00:01,013)
After patch.
# insert into t select i from generate_series(1,100000000) i;
# create index t_i_idx on t(i);
# vacuum t;
VACUUM
Time: 15689,450 ms (00:15,689)
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 5,585 ms
# vacuum t;
VACUUM
Time: 50,777 ms
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 5,641 ms
# vacuum t;
VACUUM
Time: 46,997 ms
Thus, vacuum time for append-only table drops from 1000 ms to 50 ms (in about 20X).
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Hi! > Sorry for my late reply. > I'd like to propose a revised patch based on various ideas upthread. Thank you for proposing the patch! > > This patch works as following. > > 1) B-tree meta page is extended with 2 additional parameters: > * btm_oldest_btpo_xact – oldest btpo_xact among of deleted pages, > * btm_last_cleanup_num_heap_tuples – number of heap tuples during last > cleanup scan. > > 2) These parameters are reset during btbulkdelete() and set during > btvacuumcleanup(). Can't we set these parameters even during btbulkdelete()? By keeping them up to date, we will able to avoid an unnecessary cleanup vacuums even after index bulk-delete. > > 3) Index scans during second and subsequent btvacuumcleanup() happen only if > btm_oldest_btpo_xact is older than RecentGlobalXmin > OR num_heap_tuples >= btm_last_cleanup_num_heap_tuples(1 + > vacuum_cleanup_index_scale_factor). > > In other words btvacuumcleanup() scans the index only if there are > recyclable pages, > or index statistics is stalled (inserted more than > vacuum_cleanup_index_scale_factor > since last index statistics collection). > > 4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption. > Default value is 0.1. So, by default cleanup scan is triggered after > increasing of > table size by 10%. > > 5) Since new fields are added to the metapage, BTREE_VERSION is bumped. > In order to support pg_upgrade, read of previous metapage version is > supported. > On metapage rewrite, it's upgraded to the new version. > > So, since we don't skip scan of recyclable pages, there is no risk of xid > wraparound. > Risk of stalled statistics is also small, because > vacuum_cleanup_index_scale_factor > default value is quite low. User can increase > vacuum_cleanup_index_scale_factor > on his own risk and have less load of B-tree cleanup scan bought by more gap > in > index statistics. Agreed. I've not reviewed the code deeply yet but the regression test of this patch seems to wrongly pass the regression tests and bt_metap() function of pageinspect needs to be updated. Attached an updated patch fixed these issue. Will review the patch again. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> 2) These parameters are reset during btbulkdelete() and set during
> btvacuumcleanup().
Can't we set these parameters even during btbulkdelete()? By keeping
them up to date, we will able to avoid an unnecessary cleanup vacuums
even after index bulk-delete.
We certainly can update cleanup-related parameters during btbulkdelete().
However, in this case we would update B-tree meta-page during each
VACUUM cycle. That may cause some overhead for non append-only
workloads. I don't think this overhead would be sensible, because in
non append-only scenarios VACUUM typically writes much more of information.
But I would like this oriented to append-only workload patch to be
as harmless as possible for other workloads.
patch seems to wrongly pass the regression tests and bt_metap()I've not reviewed the code deeply yet but the regression test of this
function of pageinspect needs to be updated.
Right. Thank you for fixing this issue.
Attached an updated patch
fixed these issue. Will review the patch again.
Thank you!
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Mar 9, 2018 at 9:40 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:Attached an updated patchfixed these issue. Will review the patch again.Thank you!
I've fixed a bug: _bt_vacuum_needs_cleanup( ) didn't release a lock when metapage is in old format.
Bug was found by Darafei Praliaskouski and Grigory Smolkin who tested this patch.
Revised patch is attached.
------
Alexander Korotkov
Postgres Professional: http://www. postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.
The Russian Postgres Company
Attachment
On Sat, Mar 10, 2018 at 5:49 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Fri, Mar 9, 2018 at 9:40 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:Attached an updated patchfixed these issue. Will review the patch again.Thank you!I've fixed a bug: _bt_vacuum_needs_cleanup() didn't release a lock when metapage is in old format. Bug was found by Darafei Praliaskouski and Grigory Smolkin who tested this patch.Revised patch is attached.
I also found that online upgrade of meta-page didn't work: version field wasn't set.
Fixed in the attached version of patch.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov >> <a.korotkov@postgrespro.ru> wrote: >> > 2) These parameters are reset during btbulkdelete() and set during >> > btvacuumcleanup(). >> >> Can't we set these parameters even during btbulkdelete()? By keeping >> them up to date, we will able to avoid an unnecessary cleanup vacuums >> even after index bulk-delete. > > > We certainly can update cleanup-related parameters during btbulkdelete(). > However, in this case we would update B-tree meta-page during each > VACUUM cycle. That may cause some overhead for non append-only > workloads. I don't think this overhead would be sensible, because in > non append-only scenarios VACUUM typically writes much more of information. > But I would like this oriented to append-only workload patch to be > as harmless as possible for other workloads. What overhead are you referring here? I guess the overhead is only the calculating the oldest btpo.xact. And I think it would be harmless. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > 2) These parameters are reset during btbulkdelete() and set during
>> > btvacuumcleanup().
>>
>> Can't we set these parameters even during btbulkdelete()? By keeping
>> them up to date, we will able to avoid an unnecessary cleanup vacuums
>> even after index bulk-delete.
>
>
> We certainly can update cleanup-related parameters during btbulkdelete().
> However, in this case we would update B-tree meta-page during each
> VACUUM cycle. That may cause some overhead for non append-only
> workloads. I don't think this overhead would be sensible, because in
> non append-only scenarios VACUUM typically writes much more of information.
> But I would like this oriented to append-only workload patch to be
> as harmless as possible for other workloads.
What overhead are you referring here? I guess the overhead is only the
calculating the oldest btpo.xact. And I think it would be harmless.
I meant overhead of setting last_cleanup_num_heap_tuples after every
btbulkdelete with wal-logging of meta-page. I bet it also would be
harmless, but I think that needs some testing.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov >> <a.korotkov@postgrespro.ru> wrote: >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> >> > wrote: >> >> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov >> >> <a.korotkov@postgrespro.ru> wrote: >> >> > 2) These parameters are reset during btbulkdelete() and set during >> >> > btvacuumcleanup(). >> >> >> >> Can't we set these parameters even during btbulkdelete()? By keeping >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums >> >> even after index bulk-delete. >> > >> > >> > We certainly can update cleanup-related parameters during >> > btbulkdelete(). >> > However, in this case we would update B-tree meta-page during each >> > VACUUM cycle. That may cause some overhead for non append-only >> > workloads. I don't think this overhead would be sensible, because in >> > non append-only scenarios VACUUM typically writes much more of >> > information. >> > But I would like this oriented to append-only workload patch to be >> > as harmless as possible for other workloads. >> >> What overhead are you referring here? I guess the overhead is only the >> calculating the oldest btpo.xact. And I think it would be harmless. > > > I meant overhead of setting last_cleanup_num_heap_tuples after every > btbulkdelete with wal-logging of meta-page. I bet it also would be > harmless, but I think that needs some testing. Agreed. After more thought, it might be too late but we can consider the possibility of another idea proposed by Peter. Attached patch addresses the original issue of index cleanups by storing the epoch number of page deletion XID into PageHeader->pd_prune_xid which is 4byte field. Comparing to the current proposed patch this patch doesn't need neither the page upgrade code nor extra WAL-logging. If we also want to address cases other than append-only case we will require the bulk-delete method of scanning whole index and of logging WAL. But it leads some extra overhead. With this patch we no longer need to depend on the full scan on b-tree index. This might be useful for a future when we make the bulk-delete of b-tree index not scan whole index. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com> > On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> > > wrote: > >> > >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov > >> <a.korotkov@postgrespro.ru> wrote: > >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> > >> > wrote: > >> >> > >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov > >> >> <a.korotkov@postgrespro.ru> wrote: > >> >> > 2) These parameters are reset during btbulkdelete() and set during > >> >> > btvacuumcleanup(). > >> >> > >> >> Can't we set these parameters even during btbulkdelete()? By keeping > >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums > >> >> even after index bulk-delete. > >> > > >> > > >> > We certainly can update cleanup-related parameters during > >> > btbulkdelete(). > >> > However, in this case we would update B-tree meta-page during each > >> > VACUUM cycle. That may cause some overhead for non append-only > >> > workloads. I don't think this overhead would be sensible, because in > >> > non append-only scenarios VACUUM typically writes much more of > >> > information. > >> > But I would like this oriented to append-only workload patch to be > >> > as harmless as possible for other workloads. > >> > >> What overhead are you referring here? I guess the overhead is only the > >> calculating the oldest btpo.xact. And I think it would be harmless. > > > > > > I meant overhead of setting last_cleanup_num_heap_tuples after every > > btbulkdelete with wal-logging of meta-page. I bet it also would be > > harmless, but I think that needs some testing. > > Agreed. > > After more thought, it might be too late but we can consider the > possibility of another idea proposed by Peter. Attached patch > addresses the original issue of index cleanups by storing the epoch > number of page deletion XID into PageHeader->pd_prune_xid which is > 4byte field. Mmm. It seems to me that the story is returning to the beginning. Could I try retelling the story? I understant that the initial problem was vacuum runs apparently unnecessary full-scan on indexes many times. The reason for that is the fact that a cleanup scan may leave some (or many under certain condition) dead pages not-recycled but we don't know whether a cleanup is needed or not. They will be staying left forever unless we run additional cleanup-scans at the appropriate timing. (If I understand it correctly,) Sawada-san's latest proposal is (fundamentally the same to the first one,) just skipping the cleanup scan if the vacuum scan just before found that the number of *live* tuples are increased. If there where many deletions and insertions but no increase of total number of tuples, we don't have a cleanup. Consequently it had a wraparound problem and it is addressed in this version. (ditto.) Alexander proposed to record the oldest xid of recyclable pages in metapage (and the number of tuples at the last cleanup). This prevents needless cleanup scan and surely runs cleanups to remove all recyclable pages. I think that we can accept Sawada-san's proposal if we accept the fact that indexes can retain recyclable pages for a long time. (Honestly I don't think so.) If (as I might have mentioned as the same upthread for Yura's patch,) we accept to hold the information on index meta page, Alexander's way would be preferable. The difference betwen Yura's and Alexander's is the former runs cleanup scan if a recyclable page is present but the latter avoids that before any recyclable pages are knwon to be removed. > Comparing to the current proposed patch this patch > doesn't need neither the page upgrade code nor extra WAL-logging. If # By the way, my proposal was storing the information as Yura # proposed into stats collector. The information maybe be # available a bit lately, but it doesn't harm. This doesn't need # extra WAL logging nor the upgrad code:p > we also want to address cases other than append-only case we will I'm afraid that "the problem for the other cases" is a new one that this patch introduces, not an existing one. > require the bulk-delete method of scanning whole index and of logging > WAL. But it leads some extra overhead. With this patch we no longer > need to depend on the full scan on b-tree index. This might be useful > for a future when we make the bulk-delete of b-tree index not scan > whole index. Perhaps I'm taking something incorrectly, but is it just the result of skipping 'maybe needed' scans without condiering the actual necessity? I also don't like extra WAL logging, but it happens once (or twice?) per vaccum cycle (for every index). On the other hand I want to put the on-the-fly upgrade path out of the ordinary path. (Reviving the pg_upgrade's custom module?) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry I'd like to make a trivial but critical fix. At Mon, 19 Mar 2018 14:45:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180319.144505.166111203.horiguchi.kyotaro@lab.ntt.co.jp> > At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com> > > On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov > > <a.korotkov@postgrespro.ru> wrote: > > > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> > > > wrote: > > >> > > >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov > > >> <a.korotkov@postgrespro.ru> wrote: > > >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> > > >> > wrote: > > >> >> > > >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov > > >> >> <a.korotkov@postgrespro.ru> wrote: > > >> >> > 2) These parameters are reset during btbulkdelete() and set during > > >> >> > btvacuumcleanup(). > > >> >> > > >> >> Can't we set these parameters even during btbulkdelete()? By keeping > > >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums > > >> >> even after index bulk-delete. > > >> > > > >> > > > >> > We certainly can update cleanup-related parameters during > > >> > btbulkdelete(). > > >> > However, in this case we would update B-tree meta-page during each > > >> > VACUUM cycle. That may cause some overhead for non append-only > > >> > workloads. I don't think this overhead would be sensible, because in > > >> > non append-only scenarios VACUUM typically writes much more of > > >> > information. > > >> > But I would like this oriented to append-only workload patch to be > > >> > as harmless as possible for other workloads. > > >> > > >> What overhead are you referring here? I guess the overhead is only the > > >> calculating the oldest btpo.xact. And I think it would be harmless. > > > > > > > > > I meant overhead of setting last_cleanup_num_heap_tuples after every > > > btbulkdelete with wal-logging of meta-page. I bet it also would be > > > harmless, but I think that needs some testing. > > > > Agreed. > > > > After more thought, it might be too late but we can consider the > > possibility of another idea proposed by Peter. Attached patch > > addresses the original issue of index cleanups by storing the epoch > > number of page deletion XID into PageHeader->pd_prune_xid which is > > 4byte field. > > Mmm. It seems to me that the story is returning to the > beginning. Could I try retelling the story? > > I understant that the initial problem was vacuum runs apparently > unnecessary full-scan on indexes many times. The reason for that > is the fact that a cleanup scan may leave some (or many under > certain condition) dead pages not-recycled but we don't know > whether a cleanup is needed or not. They will be staying left > forever unless we run additional cleanup-scans at the appropriate > timing. > > (If I understand it correctly,) Sawada-san's latest proposal is > (fundamentally the same to the first one,) just skipping the > cleanup scan if the vacuum scan just before found that the number > of *live* tuples are increased. If there where many deletions and > insertions but no increase of total number of tuples, we don't > have a cleanup. Consequently it had a wraparound problem and it > is addressed in this version. > > (ditto.) Alexander proposed to record the oldest xid of > recyclable pages in metapage (and the number of tuples at the > last cleanup). This prevents needless cleanup scan and surely > runs cleanups to remove all recyclable pages. > > I think that we can accept Sawada-san's proposal if we accept the > fact that indexes can retain recyclable pages for a long > time. (Honestly I don't think so.) > > If (as I might have mentioned as the same upthread for Yura's > patch,) we accept to hold the information on index meta page, > Alexander's way would be preferable. The difference betwen Yura's > and Alexander's is the former runs cleanup scan if a recyclable > page is present but the latter avoids that before any recyclable - pages are knwon to be removed. + pages are knwon to be actually removable > > Comparing to the current proposed patch this patch > > doesn't need neither the page upgrade code nor extra WAL-logging. If > > # By the way, my proposal was storing the information as Yura > # proposed into stats collector. The information maybe be > # available a bit lately, but it doesn't harm. This doesn't need > # extra WAL logging nor the upgrad code:p > > > we also want to address cases other than append-only case we will > > I'm afraid that "the problem for the other cases" is a new one > that this patch introduces, not an existing one. > > > require the bulk-delete method of scanning whole index and of logging > > WAL. But it leads some extra overhead. With this patch we no longer > > need to depend on the full scan on b-tree index. This might be useful > > for a future when we make the bulk-delete of b-tree index not scan > > whole index. > > Perhaps I'm taking something incorrectly, but is it just the > result of skipping 'maybe needed' scans without condiering the > actual necessity? > > I also don't like extra WAL logging, but it happens once (or > twice?) per vaccum cycle (for every index). On the other hand I > want to put the on-the-fly upgrade path out of the ordinary > path. (Reviving the pg_upgrade's custom module?) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Mar 19, 2018 at 2:45 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com> >> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov >> <a.korotkov@postgrespro.ru> wrote: >> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> >> > wrote: >> >> >> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov >> >> <a.korotkov@postgrespro.ru> wrote: >> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> >> >> > wrote: >> >> >> >> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov >> >> >> <a.korotkov@postgrespro.ru> wrote: >> >> >> > 2) These parameters are reset during btbulkdelete() and set during >> >> >> > btvacuumcleanup(). >> >> >> >> >> >> Can't we set these parameters even during btbulkdelete()? By keeping >> >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums >> >> >> even after index bulk-delete. >> >> > >> >> > >> >> > We certainly can update cleanup-related parameters during >> >> > btbulkdelete(). >> >> > However, in this case we would update B-tree meta-page during each >> >> > VACUUM cycle. That may cause some overhead for non append-only >> >> > workloads. I don't think this overhead would be sensible, because in >> >> > non append-only scenarios VACUUM typically writes much more of >> >> > information. >> >> > But I would like this oriented to append-only workload patch to be >> >> > as harmless as possible for other workloads. >> >> >> >> What overhead are you referring here? I guess the overhead is only the >> >> calculating the oldest btpo.xact. And I think it would be harmless. >> > >> > >> > I meant overhead of setting last_cleanup_num_heap_tuples after every >> > btbulkdelete with wal-logging of meta-page. I bet it also would be >> > harmless, but I think that needs some testing. >> >> Agreed. >> >> After more thought, it might be too late but we can consider the >> possibility of another idea proposed by Peter. Attached patch >> addresses the original issue of index cleanups by storing the epoch >> number of page deletion XID into PageHeader->pd_prune_xid which is >> 4byte field. > > Mmm. It seems to me that the story is returning to the > beginning. Could I try retelling the story? > > I understant that the initial problem was vacuum runs apparently > unnecessary full-scan on indexes many times. The reason for that > is the fact that a cleanup scan may leave some (or many under > certain condition) dead pages not-recycled but we don't know > whether a cleanup is needed or not. They will be staying left > forever unless we run additional cleanup-scans at the appropriate > timing. > > (If I understand it correctly,) Sawada-san's latest proposal is > (fundamentally the same to the first one,) just skipping the > cleanup scan if the vacuum scan just before found that the number > of *live* tuples are increased. If there where many deletions and > insertions but no increase of total number of tuples, we don't > have a cleanup. Consequently it had a wraparound problem and it > is addressed in this version. No, it doesn't have a wraparound problem. The patch based on Peter's idea I proposed adds an epoch number of page deletion xid and compare them when we judge whether the page is recyclable or not. It's something like we create 64-bit xid of deletion xid. Also, if there is even one deletion the bulk-delete will be performed instead of the index cleanup. So with this patch we do the index cleanup only when the reltuple of table is increased by fraction of vacuum_index_cleanup_scale_factor from previous stats. It doesn't need to do the index cleanup by any xid thresholds. > (ditto.) Alexander proposed to record the oldest xid of > recyclable pages in metapage (and the number of tuples at the > last cleanup). This prevents needless cleanup scan and surely > runs cleanups to remove all recyclable pages. Yes, but the concerns we discussed are that we need extra WAL-logging for updating the metapage and it works only for append-only case. If we also want to support more cases we will need to update the metapage during bulk-delete. The overhead of WAL-logging would be harmless but should be tested as Alexander mentioned. > > I think that we can accept Sawada-san's proposal if we accept the > fact that indexes can retain recyclable pages for a long > time. (Honestly I don't think so.) > > If (as I might have mentioned as the same upthread for Yura's > patch,) we accept to hold the information on index meta page, > Alexander's way would be preferable. The difference betwen Yura's > and Alexander's is the former runs cleanup scan if a recyclable > page is present but the latter avoids that before any recyclable > pages are knwon to be removed. > >> Comparing to the current proposed patch this patch >> doesn't need neither the page upgrade code nor extra WAL-logging. If > > # By the way, my proposal was storing the information as Yura > # proposed into stats collector. The information maybe be > # available a bit lately, but it doesn't harm. This doesn't need > # extra WAL logging nor the upgrad code:p > >> we also want to address cases other than append-only case we will > > I'm afraid that "the problem for the other cases" is a new one > that this patch introduces, not an existing one. I meant that the current Alexandor's proposal works for append-only table. If we want to support other cases we have to update metapage during bulk-delete, which assumes that bulk-delete always scans whole index. > >> require the bulk-delete method of scanning whole index and of logging >> WAL. But it leads some extra overhead. With this patch we no longer >> need to depend on the full scan on b-tree index. This might be useful >> for a future when we make the bulk-delete of b-tree index not scan >> whole index. > > Perhaps I'm taking something incorrectly, but is it just the > result of skipping 'maybe needed' scans without condiering the > actual necessity? I meant to scan only index pages that are relevant with garbages TIDs on a table. The current b-tree index bulk-deletion is very slow and heavy because we always scans the whole index even if there is only 1 dead tuples in a table. To address this problem I'm thinking a way to make bulk-delete not scan whole index if there is a few dead tuples in a table. That is, we do index scans to collect the stack of btree pages and reclaim garbage. Maybe we will full index scan if there are a lot of dead tuples, which would be same as what we're doing on planning access paths. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Mar 19, 2018 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> require the bulk-delete method of scanning whole index and of logging >>> WAL. But it leads some extra overhead. With this patch we no longer >>> need to depend on the full scan on b-tree index. This might be useful >>> for a future when we make the bulk-delete of b-tree index not scan >>> whole index. >> >> Perhaps I'm taking something incorrectly, but is it just the >> result of skipping 'maybe needed' scans without condiering the >> actual necessity? > > I meant to scan only index pages that are relevant with garbages TIDs > on a table. The current b-tree index bulk-deletion is very slow and > heavy because we always scans the whole index even if there is only 1 > dead tuples in a table. To address this problem I'm thinking a way to > make bulk-delete not scan whole index if there is a few dead tuples in > a table. That is, we do index scans to collect the stack of btree > pages and reclaim garbage. Maybe we will full index scan if there are > a lot of dead tuples, which would be same as what we're doing on > planning access paths. In theory it's not very difficult to do. I was pondering doing some PoC after the other vacuum patches get through. TL;DR version is, as long as there's enough MWM to fit the keys, they can be stashed before vacuuming the heap, and used to perform index scans instead for index cleanup. If MWM runs out, it just goes back to bulk-delete scans (it would imply there's a lot to clean up anyway, so index scans wouldn't be worth it). A finer decision can be made with random_page_cost on which technique is likely faster as well. So yes, lets not paint ourselves into a corner where full index scans are required for correct operation, that would make the above impossible.
At Mon, 19 Mar 2018 20:50:48 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCfWXcX-po8Q1r779nyVGzs01pwpSLM=u7Sx3Hv+L+4gg@mail.gmail.com> > On Mon, Mar 19, 2018 at 2:45 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com> > >> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov > >> <a.korotkov@postgrespro.ru> wrote: > >> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> > >> > wrote: > >> >> > >> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov > >> >> <a.korotkov@postgrespro.ru> wrote: > >> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov > >> >> >> <a.korotkov@postgrespro.ru> wrote: > >> >> >> > 2) These parameters are reset during btbulkdelete() and set during > >> >> >> > btvacuumcleanup(). > >> >> >> > >> >> >> Can't we set these parameters even during btbulkdelete()? By keeping > >> >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums > >> >> >> even after index bulk-delete. > >> >> > > >> >> > > >> >> > We certainly can update cleanup-related parameters during > >> >> > btbulkdelete(). > >> >> > However, in this case we would update B-tree meta-page during each > >> >> > VACUUM cycle. That may cause some overhead for non append-only > >> >> > workloads. I don't think this overhead would be sensible, because in > >> >> > non append-only scenarios VACUUM typically writes much more of > >> >> > information. > >> >> > But I would like this oriented to append-only workload patch to be > >> >> > as harmless as possible for other workloads. > >> >> > >> >> What overhead are you referring here? I guess the overhead is only the > >> >> calculating the oldest btpo.xact. And I think it would be harmless. > >> > > >> > > >> > I meant overhead of setting last_cleanup_num_heap_tuples after every > >> > btbulkdelete with wal-logging of meta-page. I bet it also would be > >> > harmless, but I think that needs some testing. > >> > >> Agreed. > >> > >> After more thought, it might be too late but we can consider the > >> possibility of another idea proposed by Peter. Attached patch > >> addresses the original issue of index cleanups by storing the epoch > >> number of page deletion XID into PageHeader->pd_prune_xid which is > >> 4byte field. > > > > Mmm. It seems to me that the story is returning to the > > beginning. Could I try retelling the story? > > > > I understant that the initial problem was vacuum runs apparently > > unnecessary full-scan on indexes many times. The reason for that > > is the fact that a cleanup scan may leave some (or many under > > certain condition) dead pages not-recycled but we don't know > > whether a cleanup is needed or not. They will be staying left > > forever unless we run additional cleanup-scans at the appropriate > > timing. > > > > (If I understand it correctly,) Sawada-san's latest proposal is > > (fundamentally the same to the first one,) just skipping the > > cleanup scan if the vacuum scan just before found that the number > > of *live* tuples are increased. If there where many deletions and > > insertions but no increase of total number of tuples, we don't > > have a cleanup. Consequently it had a wraparound problem and it > > is addressed in this version. > > No, it doesn't have a wraparound problem. The patch based on Peter's > idea I proposed adds an epoch number of page deletion xid and compare > them when we judge whether the page is recyclable or not. It's > something like we create 64-bit xid of deletion xid. Also, if there is > even one deletion the bulk-delete will be performed instead of the > index cleanup. So with this patch we do the index cleanup only when > the reltuple of table is increased by fraction of > vacuum_index_cleanup_scale_factor from previous stats. It doesn't need > to do the index cleanup by any xid thresholds. Perhaps you took me wrong. I know the last patch doesn't have (or at least intends to get rid of ) the problem, and I wrote that the problem was introduced by your *first* patch. > > (ditto.) Alexander proposed to record the oldest xid of > > recyclable pages in metapage (and the number of tuples at the > > last cleanup). This prevents needless cleanup scan and surely > > runs cleanups to remove all recyclable pages. > > Yes, but the concerns we discussed are that we need extra WAL-logging > for updating the metapage and it works only for append-only case. If > we also want to support more cases we will need to update the metapage > during bulk-delete. The overhead of WAL-logging would be harmless but > should be tested as Alexander mentioned. Agreed. > > I think that we can accept Sawada-san's proposal if we accept the > > fact that indexes can retain recyclable pages for a long > > time. (Honestly I don't think so.) > > > > If (as I might have mentioned as the same upthread for Yura's > > patch,) we accept to hold the information on index meta page, > > Alexander's way would be preferable. The difference betwen Yura's > > and Alexander's is the former runs cleanup scan if a recyclable > > page is present but the latter avoids that before any recyclable > > pages are knwon to be removed. > > > >> Comparing to the current proposed patch this patch > >> doesn't need neither the page upgrade code nor extra WAL-logging. If > > > > # By the way, my proposal was storing the information as Yura > > # proposed into stats collector. The information maybe be > > # available a bit lately, but it doesn't harm. This doesn't need > > # extra WAL logging nor the upgrad code:p > > > >> we also want to address cases other than append-only case we will > > > > I'm afraid that "the problem for the other cases" is a new one > > that this patch introduces, not an existing one. > > I meant that the current Alexandor's proposal works for append-only > table. If we want to support other cases we have to update metapage > during bulk-delete, which assumes that bulk-delete always scans whole > index. True. Currently no patches so far gets rid of the whole-cleanup-scan. > >> require the bulk-delete method of scanning whole index and of logging > >> WAL. But it leads some extra overhead. With this patch we no longer > >> need to depend on the full scan on b-tree index. This might be useful > >> for a future when we make the bulk-delete of b-tree index not scan > >> whole index. > > > > Perhaps I'm taking something incorrectly, but is it just the > > result of skipping 'maybe needed' scans without condiering the > > actual necessity? > > I meant to scan only index pages that are relevant with garbages TIDs > on a table. The current b-tree index bulk-deletion is very slow and > heavy because we always scans the whole index even if there is only 1 > dead tuples in a table. To address this problem I'm thinking a way to > make bulk-delete not scan whole index if there is a few dead tuples in > a table. That is, we do index scans to collect the stack of btree > pages and reclaim garbage. Maybe we will full index scan if there are > a lot of dead tuples, which would be same as what we're doing on > planning access paths. Yeah, that seems good! A possible problem of that is that the pages we want to recycle in a cleanup scan can *not* be only them that have found to be recyclable in the vacuum-scan just before. When we leave some recyclable pages in a cleanup scan, we should do whole-scan at the next chance if we don't have the TID list (or in other smaller form, or just the number of recyclable pages?) at the time. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 20 Mar 2018 13:57:19 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180320.135719.90053076.horiguchi.kyotaro@lab.ntt.co.jp> > At Mon, 19 Mar 2018 20:50:48 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCfWXcX-po8Q1r779nyVGzs01pwpSLM=u7Sx3Hv+L+4gg@mail.gmail.com> > > On Mon, Mar 19, 2018 at 2:45 PM, Kyotaro HORIGUCHI > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com> > > >> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov > > >> <a.korotkov@postgrespro.ru> wrote: > > >> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> > > >> > wrote: > > >> >> > > >> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov > > >> >> <a.korotkov@postgrespro.ru> wrote: > > >> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> > > >> >> > wrote: > > >> >> >> > > >> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov > > >> >> >> <a.korotkov@postgrespro.ru> wrote: > > >> >> >> > 2) These parameters are reset during btbulkdelete() and set during > > >> >> >> > btvacuumcleanup(). > > >> >> >> > > >> >> >> Can't we set these parameters even during btbulkdelete()? By keeping > > >> >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums > > >> >> >> even after index bulk-delete. > > >> >> > > > >> >> > > > >> >> > We certainly can update cleanup-related parameters during > > >> >> > btbulkdelete(). > > >> >> > However, in this case we would update B-tree meta-page during each > > >> >> > VACUUM cycle. That may cause some overhead for non append-only > > >> >> > workloads. I don't think this overhead would be sensible, because in > > >> >> > non append-only scenarios VACUUM typically writes much more of > > >> >> > information. > > >> >> > But I would like this oriented to append-only workload patch to be > > >> >> > as harmless as possible for other workloads. > > >> >> > > >> >> What overhead are you referring here? I guess the overhead is only the > > >> >> calculating the oldest btpo.xact. And I think it would be harmless. > > >> > > > >> > > > >> > I meant overhead of setting last_cleanup_num_heap_tuples after every > > >> > btbulkdelete with wal-logging of meta-page. I bet it also would be > > >> > harmless, but I think that needs some testing. > > >> > > >> Agreed. > > >> > > >> After more thought, it might be too late but we can consider the > > >> possibility of another idea proposed by Peter. Attached patch > > >> addresses the original issue of index cleanups by storing the epoch > > >> number of page deletion XID into PageHeader->pd_prune_xid which is > > >> 4byte field. > > > > > > Mmm. It seems to me that the story is returning to the > > > beginning. Could I try retelling the story? > > > > > > I understant that the initial problem was vacuum runs apparently > > > unnecessary full-scan on indexes many times. The reason for that > > > is the fact that a cleanup scan may leave some (or many under > > > certain condition) dead pages not-recycled but we don't know > > > whether a cleanup is needed or not. They will be staying left > > > forever unless we run additional cleanup-scans at the appropriate > > > timing. > > > > > > (If I understand it correctly,) Sawada-san's latest proposal is > > > (fundamentally the same to the first one,) just skipping the > > > cleanup scan if the vacuum scan just before found that the number > > > of *live* tuples are increased. If there where many deletions and > > > insertions but no increase of total number of tuples, we don't > > > have a cleanup. Consequently it had a wraparound problem and it > > > is addressed in this version. > > > > No, it doesn't have a wraparound problem. The patch based on Peter's > > idea I proposed adds an epoch number of page deletion xid and compare > > them when we judge whether the page is recyclable or not. It's > > something like we create 64-bit xid of deletion xid. Also, if there is > > even one deletion the bulk-delete will be performed instead of the > > index cleanup. So with this patch we do the index cleanup only when > > the reltuple of table is increased by fraction of > > vacuum_index_cleanup_scale_factor from previous stats. It doesn't need > > to do the index cleanup by any xid thresholds. > > > Perhaps you took me wrong. I know the last patch doesn't have (or > at least intends to get rid of ) the problem, and I wrote that > the problem was introduced by your *first* patch. > > > > (ditto.) Alexander proposed to record the oldest xid of > > > recyclable pages in metapage (and the number of tuples at the > > > last cleanup). This prevents needless cleanup scan and surely > > > runs cleanups to remove all recyclable pages. > > > > Yes, but the concerns we discussed are that we need extra WAL-logging > > for updating the metapage and it works only for append-only case. If > > we also want to support more cases we will need to update the metapage > > during bulk-delete. The overhead of WAL-logging would be harmless but > > should be tested as Alexander mentioned. > > Agreed. > > > > I think that we can accept Sawada-san's proposal if we accept the > > > fact that indexes can retain recyclable pages for a long > > > time. (Honestly I don't think so.) > > > > > > If (as I might have mentioned as the same upthread for Yura's > > > patch,) we accept to hold the information on index meta page, > > > Alexander's way would be preferable. The difference betwen Yura's > > > and Alexander's is the former runs cleanup scan if a recyclable > > > page is present but the latter avoids that before any recyclable > > > pages are knwon to be removed. > > > > > >> Comparing to the current proposed patch this patch > > >> doesn't need neither the page upgrade code nor extra WAL-logging. If > > > > > > # By the way, my proposal was storing the information as Yura > > > # proposed into stats collector. The information maybe be > > > # available a bit lately, but it doesn't harm. This doesn't need > > > # extra WAL logging nor the upgrad code:p > > > > > >> we also want to address cases other than append-only case we will > > > > > > I'm afraid that "the problem for the other cases" is a new one > > > that this patch introduces, not an existing one. > > > > I meant that the current Alexandor's proposal works for append-only > > table. If we want to support other cases we have to update metapage > > during bulk-delete, which assumes that bulk-delete always scans whole > > index. > > True. Currently no patches so far gets rid of the whole-cleanup-scan. > > > >> require the bulk-delete method of scanning whole index and of logging > > >> WAL. But it leads some extra overhead. With this patch we no longer > > >> need to depend on the full scan on b-tree index. This might be useful > > >> for a future when we make the bulk-delete of b-tree index not scan > > >> whole index. > > > > > > Perhaps I'm taking something incorrectly, but is it just the > > > result of skipping 'maybe needed' scans without condiering the > > > actual necessity? > > > > I meant to scan only index pages that are relevant with garbages TIDs > > on a table. The current b-tree index bulk-deletion is very slow and > > heavy because we always scans the whole index even if there is only 1 > > dead tuples in a table. To address this problem I'm thinking a way to > > make bulk-delete not scan whole index if there is a few dead tuples in > > a table. That is, we do index scans to collect the stack of btree > > pages and reclaim garbage. Maybe we will full index scan if there are > > a lot of dead tuples, which would be same as what we're doing on > > planning access paths. > > Yeah, that seems good! A possible problem of that is that the > pages we want to recycle in a cleanup scan can *not* be only them > that have found to be recyclable in the vacuum-scan just > before. When we leave some recyclable pages in a cleanup scan, we > should do whole-scan at the next chance if we don't have the TID > list (or in other smaller form, or just the number of recyclable > pages?) at the time. "the number of recyclable pages?" is wrong. It doesn't work to avoid whole-scan. It just indicates whether the "this" round of cleanup needs whole-scan or not. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Mar 19, 2018 at 5:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Mar 14, 2018 at 9:25 PM, Alexander KorotkovAgreed.<a.korotkov@postgrespro.ru> wrote:
> On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>> > wrote:
>> >>
>> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>> >> <a.korotkov@postgrespro.ru> wrote:
>> >> > 2) These parameters are reset during btbulkdelete() and set during
>> >> > btvacuumcleanup().
>> >>
>> >> Can't we set these parameters even during btbulkdelete()? By keeping
>> >> them up to date, we will able to avoid an unnecessary cleanup vacuums
>> >> even after index bulk-delete.
>> >
>> >
>> > We certainly can update cleanup-related parameters during
>> > btbulkdelete().
>> > However, in this case we would update B-tree meta-page during each
>> > VACUUM cycle. That may cause some overhead for non append-only
>> > workloads. I don't think this overhead would be sensible, because in
>> > non append-only scenarios VACUUM typically writes much more of
>> > information.
>> > But I would like this oriented to append-only workload patch to be
>> > as harmless as possible for other workloads.
>>
>> What overhead are you referring here? I guess the overhead is only the
>> calculating the oldest btpo.xact. And I think it would be harmless.
>
>
> I meant overhead of setting last_cleanup_num_heap_tuples after every
> btbulkdelete with wal-logging of meta-page. I bet it also would be
> harmless, but I think that needs some testing.
After more thought, it might be too late but we can consider the
possibility of another idea proposed by Peter. Attached patch
addresses the original issue of index cleanups by storing the epoch
number of page deletion XID into PageHeader->pd_prune_xid which is
4byte field. Comparing to the current proposed patch this patch
doesn't need neither the page upgrade code nor extra WAL-logging. If
we also want to address cases other than append-only case we will
require the bulk-delete method of scanning whole index and of logging
WAL. But it leads some extra overhead. With this patch we no longer
need to depend on the full scan on b-tree index. This might be useful
for a future when we make the bulk-delete of b-tree index not scan
whole index.
Storing epoch in deleted pages is good by itself, because it makes us
free to select the moment of time to recycle those pages without risk
of wraparound.
However, I see that you are comparing relative change of num_heap_tuples
before and after vacuum to vacuum_cleanup_index_scale_factor.
The problem is that if vacuum is very aggressive (and that is likely for
append only case if this patch is committed), then num_heap_tuples
changes very slightly every vacuum cycle. So, cleanup would never
happen and statistics could stall indefinitely.
Another issue is that append only workloads are frequently combined
with rare periodical bulk deletes of data. Assuming that in this patch
you don't consider deleted pages to trigger index cleanup, on such
workload large amounts of deleted pages may reside non-recycled until
next bulk delete cycle.
So, despite I generally like idea of storing epoch of deleted xid in the
page, I think my version of patch is closer to committable shape.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Mar 19, 2018 at 8:45 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co. jp> wrote:
At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gm ail.com> > WAL. But it leads some extra overhead. With this patch we no longer> require the bulk-delete method of scanning whole index and of logging
> need to depend on the full scan on b-tree index. This might be useful
> for a future when we make the bulk-delete of b-tree index not scan
> whole index.
Perhaps I'm taking something incorrectly, but is it just the
result of skipping 'maybe needed' scans without condiering the
actual necessity?
I also don't like extra WAL logging, but it happens once (or
twice?) per vaccum cycle (for every index).
In my version of patch WAL logging doesn't happen every vacuum cycle.
WAL-logging of meta-page happens only in two cases:
1) After first vacuum which didn't delete any heap tuples. I.e. after we
switch from update workload to append-only workload.
2) During append-only workload when cleanup is triggered (either
deleted pages become recyclable or statistics considered as stalled).
Typically in both update and append-only workloads, WAL logging
isn't happening during vacuum cycle.
Masahiko Sawada has proposed to update meta-information during
btbulkdelete [1]. That would lead to WAL logging in almost very
vacuum cycle. I was slightly opposed to that saying that this overhead
need to be tested [2]. However this concern is not related to
current shape of my version of patch.
On the other hand I
want to put the on-the-fly upgrade path out of the ordinary
path. (Reviving the pg_upgrade's custom module?)
I don't know. Previously, we successfully did on-fly upgrade of
GIN posting lists. Our pg_upgrade machinery is probably already
very complicated and overloaded. Should we burden it with
upgrade of every meta-page of every B-tree index assuming that
we can handle it very well inside B-tree itself on the fly?
------
Alexander Korotkov
Postgres Professional: http://www. postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.
The Russian Postgres Company
On Thu, Mar 22, 2018 at 9:24 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Mon, Mar 19, 2018 at 5:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov >> <a.korotkov@postgrespro.ru> wrote: >> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> >> > wrote: >> >> >> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov >> >> <a.korotkov@postgrespro.ru> wrote: >> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada >> >> > <sawada.mshk@gmail.com> >> >> > wrote: >> >> >> >> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov >> >> >> <a.korotkov@postgrespro.ru> wrote: >> >> >> > 2) These parameters are reset during btbulkdelete() and set during >> >> >> > btvacuumcleanup(). >> >> >> >> >> >> Can't we set these parameters even during btbulkdelete()? By keeping >> >> >> them up to date, we will able to avoid an unnecessary cleanup >> >> >> vacuums >> >> >> even after index bulk-delete. >> >> > >> >> > >> >> > We certainly can update cleanup-related parameters during >> >> > btbulkdelete(). >> >> > However, in this case we would update B-tree meta-page during each >> >> > VACUUM cycle. That may cause some overhead for non append-only >> >> > workloads. I don't think this overhead would be sensible, because in >> >> > non append-only scenarios VACUUM typically writes much more of >> >> > information. >> >> > But I would like this oriented to append-only workload patch to be >> >> > as harmless as possible for other workloads. >> >> >> >> What overhead are you referring here? I guess the overhead is only the >> >> calculating the oldest btpo.xact. And I think it would be harmless. >> > >> > >> > I meant overhead of setting last_cleanup_num_heap_tuples after every >> > btbulkdelete with wal-logging of meta-page. I bet it also would be >> > harmless, but I think that needs some testing. >> >> Agreed. >> >> After more thought, it might be too late but we can consider the >> possibility of another idea proposed by Peter. Attached patch >> addresses the original issue of index cleanups by storing the epoch >> number of page deletion XID into PageHeader->pd_prune_xid which is >> 4byte field. Comparing to the current proposed patch this patch >> doesn't need neither the page upgrade code nor extra WAL-logging. If >> we also want to address cases other than append-only case we will >> require the bulk-delete method of scanning whole index and of logging >> WAL. But it leads some extra overhead. With this patch we no longer >> need to depend on the full scan on b-tree index. This might be useful >> for a future when we make the bulk-delete of b-tree index not scan >> whole index. > > > Storing epoch in deleted pages is good by itself, because it makes us > free to select the moment of time to recycle those pages without risk > of wraparound. > > However, I see that you are comparing relative change of num_heap_tuples > before and after vacuum to vacuum_cleanup_index_scale_factor. > The problem is that if vacuum is very aggressive (and that is likely for > append only case if this patch is committed), then num_heap_tuples > changes very slightly every vacuum cycle. So, cleanup would never > happen and statistics could stall indefinitely. Good catch. I think we need to store the number of tuples at when scanning whole index is performed (bulkdelete or cleanup) as your patch does so. So it also would need the page-upgrading code. Since that value would be helpful for other type of indexes too it might be better to store it into system catalog. > > Another issue is that append only workloads are frequently combined > with rare periodical bulk deletes of data. Assuming that in this patch > you don't consider deleted pages to trigger index cleanup, on such > workload large amounts of deleted pages may reside non-recycled until > next bulk delete cycle. Perhaps using that new value we can trigger the cleanup if the current number of tuples has been increased or decreased the vacuum_index_scale_factor% from n_tup_last_cleanup. > > So, despite I generally like idea of storing epoch of deleted xid in the > page, I think my version of patch is closer to committable shape. > Agreed, but as I mentioned before, I'm concerned that your version patch approach will become a restriction of future improvement. If community decides this feature works only for mostly append-only workloads I think your version of patch is the best approach for now. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Mar 27, 2018 at 11:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Mar 22, 2018 at 9:24 PM, Alexander KorotkovGood catch. I think we need to store the number of tuples at when<a.korotkov@postgrespro.ru> wrote:
> However, I see that you are comparing relative change of num_heap_tuples
> before and after vacuum to vacuum_cleanup_index_scale_factor.
> The problem is that if vacuum is very aggressive (and that is likely for
> append only case if this patch is committed), then num_heap_tuples
> changes very slightly every vacuum cycle. So, cleanup would never
> happen and statistics could stall indefinitely.
scanning whole index is performed (bulkdelete or cleanup) as your
patch does so. So it also would need the page-upgrading code. Since
that value would be helpful for other type of indexes too it might be
better to store it into system catalog.
>
> Another issue is that append only workloads are frequently combined
> with rare periodical bulk deletes of data. Assuming that in this patch
> you don't consider deleted pages to trigger index cleanup, on such
> workload large amounts of deleted pages may reside non-recycled until
> next bulk delete cycle.
Perhaps using that new value we can trigger the cleanup if the current
number of tuples has been increased or decreased the
vacuum_index_scale_factor% from n_tup_last_cleanup.
Yes, that might work. However, decreased number of tuples could be
inaccurate measure of number of deleted pages. Depending on distribution
of tuples per pages, same amount of deleted tuples can lead to very different
number of deleted pages (in corner cases in can start from zero to the very
large amounts). If we want to skip scanning of deleted pages then it
would be better store their exact count known by previous bulk delete or
cleanup.
>
> So, despite I generally like idea of storing epoch of deleted xid in the
> page, I think my version of patch is closer to committable shape.
>
Agreed, but as I mentioned before, I'm concerned that your version
patch approach will become a restriction of future improvement. If
community decides this feature works only for mostly append-only
workloads I think your version of patch is the best approach for now.
At first I would like to note that all valid optimizations presented in the
thread optimizes append case. Thus they do give serious benefit
on mostly append-only workloads. Since patches were about
skipping/reducing cleanup stage which does serious work only in
append-only case.
It could be possible in future to optimize also cases when only small
fraction of table tuples were deleted. Those tuples could be deleted
not by full index scan, but by individual index lookups. But I think
such optimization is rather harder than everything mentioned in
this thread, and it should be considered separately.
The thing which could be improved in future is making btree able
to skip cleanup when some deleted pages are pending to be recycled.
But I'm not sure that this problem deserves a lot of time investment
right now. Because I think that page deletion in our btree is unfortunately
quite rate situation. In real life our btree is rather going to be bloat with bad
fillfactor of pages rather than got much pages deleted.
So, I would like to clarify why could my patch block future improvements
in this area? For instance, if we would decide to make btree able to skip
cleanup when some delete pages are pending for recycle, we can add
it in future.
------
Alexander Korotkov
Postgres Professional: http://www.postg respro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postg
The Russian Postgres Company
Sorry for the late response. On Tue, Mar 27, 2018 at 8:01 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Tue, Mar 27, 2018 at 11:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Thu, Mar 22, 2018 at 9:24 PM, Alexander Korotkov >> <a.korotkov@postgrespro.ru> wrote: >> > However, I see that you are comparing relative change of num_heap_tuples >> > before and after vacuum to vacuum_cleanup_index_scale_factor. >> > The problem is that if vacuum is very aggressive (and that is likely for >> > append only case if this patch is committed), then num_heap_tuples >> > changes very slightly every vacuum cycle. So, cleanup would never >> > happen and statistics could stall indefinitely. >> >> Good catch. I think we need to store the number of tuples at when >> scanning whole index is performed (bulkdelete or cleanup) as your >> patch does so. So it also would need the page-upgrading code. Since >> that value would be helpful for other type of indexes too it might be >> better to store it into system catalog. >> >> > >> > Another issue is that append only workloads are frequently combined >> > with rare periodical bulk deletes of data. Assuming that in this patch >> > you don't consider deleted pages to trigger index cleanup, on such >> > workload large amounts of deleted pages may reside non-recycled until >> > next bulk delete cycle. >> >> Perhaps using that new value we can trigger the cleanup if the current >> number of tuples has been increased or decreased the >> vacuum_index_scale_factor% from n_tup_last_cleanup. > > > Yes, that might work. However, decreased number of tuples could be > inaccurate measure of number of deleted pages. Depending on distribution > of tuples per pages, same amount of deleted tuples can lead to very > different > number of deleted pages (in corner cases in can start from zero to the very > large amounts). If we want to skip scanning of deleted pages then it > would be better store their exact count known by previous bulk delete or > cleanup. > >> > >> > So, despite I generally like idea of storing epoch of deleted xid in the >> > page, I think my version of patch is closer to committable shape. >> > >> >> Agreed, but as I mentioned before, I'm concerned that your version >> patch approach will become a restriction of future improvement. If >> community decides this feature works only for mostly append-only >> workloads I think your version of patch is the best approach for now. > > > At first I would like to note that all valid optimizations presented in the > thread optimizes append case. Thus they do give serious benefit > on mostly append-only workloads. Since patches were about > skipping/reducing cleanup stage which does serious work only in > append-only case. > > It could be possible in future to optimize also cases when only small > fraction of table tuples were deleted. Those tuples could be deleted > not by full index scan, but by individual index lookups. But I think > such optimization is rather harder than everything mentioned in > this thread, and it should be considered separately. Agreed. > > The thing which could be improved in future is making btree able > to skip cleanup when some deleted pages are pending to be recycled. > But I'm not sure that this problem deserves a lot of time investment > right now. Because I think that page deletion in our btree is unfortunately > quite rate situation. In real life our btree is rather going to be bloat > with bad > fillfactor of pages rather than got much pages deleted. Agreed. In your version patch we cleanup recyclable pages if there might be them whereas my version patch could leave recyclable pages. The thing I'm concerned is that the patch unnecessarily would narrow its effect. That is, we might be able to update the btree meta page even when bulkdelete. In your version patch we have to scan whole index at least twice (bulkdelete and cleanup) if even one tuple is deleted. But if deletion of index tuples didn't lead index page deletions (which is a common case) the extra one scan is really unnecessary. So I guess that we can update the meta page only when we deleted pages in the index vacuum scan. If no page deletion happened since we don't need to update the oldest btpo.xact we don't update meta page, and of course don't WAL-logging. This can be separate patch though. At least the current approach is more safer. > > So, I would like to clarify why could my patch block future improvements > in this area? For instance, if we would decide to make btree able to skip > cleanup when some delete pages are pending for recycle, we can add > it in future. > Anyway, for approaches of this feature I agree your version patch and your patch looks good to me as the first step of this feature. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Apr 3, 2018 at 6:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Mar 27, 2018 at 8:01 PM, Alexander Korotkov> in this area? For instance, if we would decide to make btree able to skip> So, I would like to clarify why could my patch block future improvements
> cleanup when some delete pages are pending for recycle, we can add
> it in future.
>
Anyway, for approaches of this feature I agree your version patch and
your patch looks good to me as the first step of this feature.
Agreed. I think we got consensus that this patch is good first step,
which doesn't block further enhancements in future.
So, I'm attaching rebased version of patch and marking this RFC.
------
Alexander Korotkov
Postgres Professional: http://www. postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.
The Russian Postgres Company
Attachment
Thanks for everyone, pushed with minor editorization Alexander Korotkov wrote: > On Tue, Apr 3, 2018 at 6:42 PM, Masahiko Sawada <sawada.mshk@gmail.com > <mailto:sawada.mshk@gmail.com>> wrote: > > On Tue, Mar 27, 2018 at 8:01 PM, Alexander Korotkov > > So, I would like to clarify why could my patch block future improvements > > in this area? For instance, if we would decide to make btree able to skip > > cleanup when some delete pages are pending for recycle, we can add > > it in future. > > > > Anyway, for approaches of this feature I agree your version patch and > your patch looks good to me as the first step of this feature. > > > Agreed. I think we got consensus that this patch is good first step, > which doesn't block further enhancements in future. > > So, I'm attaching rebased version of patch and marking this RFC. > > ------ > Alexander Korotkov > Postgres Professional:http://www.postgrespro.com <http://www.postgrespro.com/> > The Russian Postgres Company -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Thu, Apr 5, 2018 at 1:30 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Thanks for everyone, pushed with minor editorization > Thank you for committing! I found a typo in nbtpage.c and attached a patch fixes it. s/overritten/overwritten/ Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Thu, Apr 5, 2018 at 2:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Apr 5, 2018 at 1:30 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> Thanks for everyone, pushed with minor editorization >> > > Thank you for committing! > I found a typo in nbtpage.c and attached a patch fixes it. > I also found an incorrect documentation in create_index.sgml as follows. <term><literal>vacuum_cleanup_index_scale_factor</literal></term> <listitem> <para> Per-table value for <xref linkend="guc-vacuum-cleanup-index-scale-factor"/>. </para> </listitem> </varlistentry> I think it should be "Per-index". Attached a patch for fixing it. And sorry for missing it at review. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Hello. The commit leaves three warnings for -Wunused-but-set-variable. Two of them are not assertion-only but really not used at all. I also found that nodeMerge.c has one such variable. regards. At Thu, 5 Apr 2018 15:43:55 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAtYzXeuhPqGw89KxqcJWpSKxWRNkYzyJxAqJHZham==Q@mail.gmail.com> > On Thu, Apr 5, 2018 at 2:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Apr 5, 2018 at 1:30 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > >> Thanks for everyone, pushed with minor editorization > >> > > > > Thank you for committing! > > I found a typo in nbtpage.c and attached a patch fixes it. > > > > I also found an incorrect documentation in create_index.sgml as follows. > > <term><literal>vacuum_cleanup_index_scale_factor</literal></term> > <listitem> > <para> > Per-table value for <xref > linkend="guc-vacuum-cleanup-index-scale-factor"/>. > </para> > </listitem> > </varlistentry> > > I think it should be "Per-index". Attached a patch for fixing it. And > sorry for missing it at review. > > Regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 505a67e6ed..b920d66731 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -85,7 +85,7 @@ void _bt_upgrademetapage(Page page) { BTMetaPageData *metad; - BTPageOpaque metaopaque; + BTPageOpaque metaopaque PG_USED_FOR_ASSERTS_ONLY; metad = BTPageGetMeta(page); metaopaque = (BTPageOpaque) PageGetSpecialPointer(page); @@ -118,7 +118,6 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact, { Buffer metabuf; Page metapg; - BTPageOpaque metaopaque; BTMetaPageData *metad; bool needsRewrite = false; XLogRecPtr recptr; @@ -126,7 +125,6 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact, /* read the metapage and check if it needs rewrite */ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ); metapg = BufferGetPage(metabuf); - metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg); metad = BTPageGetMeta(metapg); /* outdated version of metapage always needs rewrite */ diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 06badc90ba..66a66f2dad 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -786,13 +786,11 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info) { Buffer metabuf; Page metapg; - BTPageOpaque metaopaque; BTMetaPageData *metad; bool result = false; metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ); metapg = BufferGetPage(metabuf); - metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg); metad = BTPageGetMeta(metapg); if (metad->btm_version < BTREE_VERSION) diff --git a/src/backend/executor/nodeMerge.c b/src/backend/executor/nodeMerge.c index 0e0d0795d4..b21e69903d 100644 --- a/src/backend/executor/nodeMerge.c +++ b/src/backend/executor/nodeMerge.c @@ -485,7 +485,7 @@ ExecMerge(ModifyTableState *mtstate, EState *estate, TupleTableSlot *slot, ItemPointer tupleid; ItemPointerData tuple_ctid; bool matched = false; - char relkind; + char relkind PG_USED_FOR_ASSERTS_ONLY; Datum datum; bool isNull;
Thanks to everyone, fixes are pushed except nodeMerge.c, I don't wish to increase entropy around MERGE patch :) Kyotaro HORIGUCHI wrote: > Hello. > > The commit leaves three warnings for > -Wunused-but-set-variable. Two of them are not assertion-only but > really not used at all. > > I also found that nodeMerge.c has one such variable. > > regards. > > At Thu, 5 Apr 2018 15:43:55 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAtYzXeuhPqGw89KxqcJWpSKxWRNkYzyJxAqJHZham==Q@mail.gmail.com> >> On Thu, Apr 5, 2018 at 2:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Thu, Apr 5, 2018 at 1:30 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >>>> Thanks for everyone, pushed with minor editorization >>>> >>> >>> Thank you for committing! >>> I found a typo in nbtpage.c and attached a patch fixes it. >>> >> >> I also found an incorrect documentation in create_index.sgml as follows. >> >> <term><literal>vacuum_cleanup_index_scale_factor</literal></term> >> <listitem> >> <para> >> Per-table value for <xref >> linkend="guc-vacuum-cleanup-index-scale-factor"/>. >> </para> >> </listitem> >> </varlistentry> >> >> I think it should be "Per-index". Attached a patch for fixing it. And >> sorry for missing it at review. >> >> Regards, > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Thu, Apr 5, 2018 at 7:23 PM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Thanks to everyone, fixes are pushed except nodeMerge.c, I don't wish to > increase entropy around MERGE patch :) > Thank you! Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
At Fri, 6 Apr 2018 10:52:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCYWuKu4yvTLE1Hgv2SsJRnBjYweTztTkBABUykXonb1w@mail.gmail.com> > On Thu, Apr 5, 2018 at 7:23 PM, Teodor Sigaev <teodor@sigaev.ru> wrote: > > Thanks to everyone, fixes are pushed except nodeMerge.c, I don't wish to > > increase entropy around MERGE patch :) No problem. Thanks! > Thank you! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi!
It is cool to see this in Postgres 11. However:
It is cool to see this in Postgres 11. However:
4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.Default value is 0.1. So, by default cleanup scan is triggered after increasing oftable size by 10%.
vacuum_cleanup_index_scale_factor can be set to the maximum of 100.
I imagine that on a large append-only table with IOPS storage system budget it may happen that I would want to never perform a full scan on index. Roughly, with parameter set to 100, if we vacuum the table first time with 1 tuple and 130 byte wide rows, we'll have a full scan at 130 bytes, 12 kbytes, 1.2MB, 123MB, 12 GB, 1.2TB.
If we happen to perform the first vacuum when there are 4 tuples in the table, it becomes 52kb, 5MB, 495MB, 48GB - and both 12GB and 48GB will exhaust any storage spike IOPS budget, slowing everything down rather suddenly.
Can the upper limit for this GUC be lifted, or have a value for "never"?
Hi! On Sat, Jun 16, 2018 at 11:23 PM Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote: > It is cool to see this in Postgres 11. However: > >> >> 4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption. >> Default value is 0.1. So, by default cleanup scan is triggered after increasing of >> table size by 10%. > > > vacuum_cleanup_index_scale_factor can be set to the maximum of 100. > I imagine that on a large append-only table with IOPS storage system budget it may happen that I would want to never performa full scan on index. Roughly, with parameter set to 100, if we vacuum the table first time with 1 tuple and 130 bytewide rows, we'll have a full scan at 130 bytes, 12 kbytes, 1.2MB, 123MB, 12 GB, 1.2TB. > > If we happen to perform the first vacuum when there are 4 tuples in the table, it becomes 52kb, 5MB, 495MB, 48GB - andboth 12GB and 48GB will exhaust any storage spike IOPS budget, slowing everything down rather suddenly. > > Can the upper limit for this GUC be lifted, or have a value for "never"? I have some further exploration of how statistics obtained by B-tree index vacuum cleanup is used. 1) Collected pages and tuples numbers are not directly used, but used for an estimation of tuples density per page, while current number of page is estimated using smgr (see btcostestimate()). So, unless density of tuples significantly changes, having index statistics stalled doesn't affect query plans. 2) Our optimization for skipping B-tree index vacuum cleanup works only in case when use manually vacuums table in order to update visibility map. Autovacuum is not triggered for append-only tables. So, if user doesn't have special care about append-only tables, they're not vacuumed until "autovacuum to prevent wraparound". Thus, index statistics could be very stalled. And I don't think we have many occurrences of issues with stalled index statistics. 3) We have very safe defaul of vacuum_cleanup_index_scale_factor equal to 1.1. But as Darafei claimed, 100 maximum value is probably too low for advanced users, who really need benefits of this optimization. So, I'm proposing to raise maximum valus of vacuum_cleanup_index_scale_factor to DBL_MAX. Any objections? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Hi! > > On Sat, Jun 16, 2018 at 11:23 PM Darafei "Komяpa" Praliaskouski > <me@komzpa.net> wrote: >> It is cool to see this in Postgres 11. However: >> >>> >>> 4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption. >>> Default value is 0.1. So, by default cleanup scan is triggered after increasing of >>> table size by 10%. >> >> >> vacuum_cleanup_index_scale_factor can be set to the maximum of 100. >> I imagine that on a large append-only table with IOPS storage system budget it may happen that I would want to never performa full scan on index. Roughly, with parameter set to 100, if we vacuum the table first time with 1 tuple and 130 bytewide rows, we'll have a full scan at 130 bytes, 12 kbytes, 1.2MB, 123MB, 12 GB, 1.2TB. >> >> If we happen to perform the first vacuum when there are 4 tuples in the table, it becomes 52kb, 5MB, 495MB, 48GB - andboth 12GB and 48GB will exhaust any storage spike IOPS budget, slowing everything down rather suddenly. >> >> Can the upper limit for this GUC be lifted, or have a value for "never"? > > I have some further exploration of how statistics obtained by B-tree > index vacuum cleanup is used. > > 1) Collected pages and tuples numbers are not directly used, but used > for an estimation of tuples density per page, while current number of > page is estimated using smgr (see btcostestimate()). So, unless > density of tuples significantly changes, having index statistics > stalled doesn't affect query plans. > 2) Our optimization for skipping B-tree index vacuum cleanup works > only in case when use manually vacuums table in order to update > visibility map. Autovacuum is not triggered for append-only tables. > So, if user doesn't have special care about append-only tables, > they're not vacuumed until "autovacuum to prevent wraparound". Thus, > index statistics could be very stalled. And I don't think we have > many occurrences of issues with stalled index statistics. > 3) We have very safe defaul of vacuum_cleanup_index_scale_factor equal > to 1.1. But as Darafei claimed, 100 maximum value is probably too low > for advanced users, who really need benefits of this optimization. Thank you for explanation. > So, I'm proposing to raise maximum valus of > vacuum_cleanup_index_scale_factor to DBL_MAX. Any objections? > I agree to expand the maximum value. But if users don't want index cleanup it would be helpful if we have an option (e.g. setting to -1) to disable index cleanup while documenting a risk of disabling index cleanup. It seems to me that setting very high values means the same purpose. Also, your patch lacks documentation update. BTW, I realized that postgresql.conf.sample doesn't have vacuum_cleanup_index_scale_factor option. Attached patch fixes it. Regards, -- Masahiko Sawada
Attachment
On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov > > So, I'm proposing to raise maximum valus of > > vacuum_cleanup_index_scale_factor to DBL_MAX. Any objections? > > > > I agree to expand the maximum value. But if users don't want index > cleanup it would be helpful if we have an option (e.g. setting to -1) > to disable index cleanup while documenting a risk of disabling index > cleanup. It seems to me that setting very high values means the same > purpose. Yes, providing an option to completely disable b-tree index cleanup would be good. But the problem is that we already use -1 value for "use the default" in reloption. So, if even we will make -1 guc option to mean "never cleanup", then we still wouldn't be able to make reloption to work this way. Probably, we should use another "magical value" in reloption for "use the default" semantics. > Also, your patch lacks documentation update. Good catch, thank you. > BTW, I realized that postgresql.conf.sample doesn't have > vacuum_cleanup_index_scale_factor option. Attached patch fixes it. It seems that you post a wrong attachment, because the patch you sent is exactly same as mine. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Jun 19, 2018 at 5:43 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov >> > So, I'm proposing to raise maximum valus of >> > vacuum_cleanup_index_scale_factor to DBL_MAX. Any objections? >> > >> >> I agree to expand the maximum value. But if users don't want index >> cleanup it would be helpful if we have an option (e.g. setting to -1) >> to disable index cleanup while documenting a risk of disabling index >> cleanup. It seems to me that setting very high values means the same >> purpose. > > Yes, providing an option to completely disable b-tree index cleanup > would be good. But the problem is that we already use -1 value for > "use the default" in reloption. So, if even we will make -1 guc > option to mean "never cleanup", then we still wouldn't be able to make > reloption to work this way. Probably, we should use another "magical > value" in reloption for "use the default" semantics. Right. We can add a new reloption specifying whether we use default value of vacuum_index_cleanup_scale_factor or not but it might be overkill. > >> Also, your patch lacks documentation update. > > Good catch, thank you. > >> BTW, I realized that postgresql.conf.sample doesn't have >> vacuum_cleanup_index_scale_factor option. Attached patch fixes it. > > It seems that you post a wrong attachment, because the patch you sent > is exactly same as mine. > Sorry, attached correct one. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Tue, Jun 19, 2018 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Jun 19, 2018 at 5:43 PM, Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov > >> > So, I'm proposing to raise maximum valus of > >> > vacuum_cleanup_index_scale_factor to DBL_MAX. Any objections? > >> > > >> > >> I agree to expand the maximum value. But if users don't want index > >> cleanup it would be helpful if we have an option (e.g. setting to -1) > >> to disable index cleanup while documenting a risk of disabling index > >> cleanup. It seems to me that setting very high values means the same > >> purpose. > > > > Yes, providing an option to completely disable b-tree index cleanup > > would be good. But the problem is that we already use -1 value for > > "use the default" in reloption. So, if even we will make -1 guc > > option to mean "never cleanup", then we still wouldn't be able to make > > reloption to work this way. Probably, we should use another "magical > > value" in reloption for "use the default" semantics. > > Right. We can add a new reloption specifying whether we use default > value of vacuum_index_cleanup_scale_factor or not but it might be > overkill. That would be overkill, and also that would be different from how other reloptions behaves. > >> Also, your patch lacks documentation update. > > > > Good catch, thank you. > > > >> BTW, I realized that postgresql.conf.sample doesn't have > >> vacuum_cleanup_index_scale_factor option. Attached patch fixes it. > > > > It seems that you post a wrong attachment, because the patch you sent > > is exactly same as mine. > > > > Sorry, attached correct one. Ok. I've rephrased comment a bit. Also, you created "index vacuum" subsection in the "resource usage" section. I think it's not appropriate for this option to be in "resource usage". Ideally it should be grouped with other vacuum options, but we don't have single section for that. "Autovacuum" section is also not appropriate, because this guc works not only for autovacuum. So, most semantically close options, which affects vacuum in general, are vacuum_freeze_min_age, vacuum_freeze_table_age, vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age, which are located in "client connection defaults" section. So, I decided to move this GUC into this section. I also change the section in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT. I think we ideally should have a "vacuum" section, which would have two subections: "client defaults" and "autovacuum". But that should be a separate patch, which needs to be discussed separately. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Tue, Jun 19, 2018 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Jun 19, 2018 at 5:43 PM, Alexander Korotkov >> <a.korotkov@postgrespro.ru> wrote: >> > On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov >> >> > So, I'm proposing to raise maximum valus of >> >> > vacuum_cleanup_index_scale_factor to DBL_MAX. Any objections? >> >> > >> >> >> >> I agree to expand the maximum value. But if users don't want index >> >> cleanup it would be helpful if we have an option (e.g. setting to -1) >> >> to disable index cleanup while documenting a risk of disabling index >> >> cleanup. It seems to me that setting very high values means the same >> >> purpose. >> > >> > Yes, providing an option to completely disable b-tree index cleanup >> > would be good. But the problem is that we already use -1 value for >> > "use the default" in reloption. So, if even we will make -1 guc >> > option to mean "never cleanup", then we still wouldn't be able to make >> > reloption to work this way. Probably, we should use another "magical >> > value" in reloption for "use the default" semantics. >> >> Right. We can add a new reloption specifying whether we use default >> value of vacuum_index_cleanup_scale_factor or not but it might be >> overkill. > > That would be overkill, and also that would be different from how > other reloptions behaves. Agreed. > >> >> Also, your patch lacks documentation update. >> > >> > Good catch, thank you. >> > >> >> BTW, I realized that postgresql.conf.sample doesn't have >> >> vacuum_cleanup_index_scale_factor option. Attached patch fixes it. >> > >> > It seems that you post a wrong attachment, because the patch you sent >> > is exactly same as mine. >> > >> >> Sorry, attached correct one. > > Ok. I've rephrased comment a bit. Also, you created "index vacuum" > subsection in the "resource usage" section. I think it's not > appropriate for this option to be in "resource usage". Ideally it > should be grouped with other vacuum options, but we don't have single > section for that. "Autovacuum" section is also not appropriate, > because this guc works not only for autovacuum. So, most semantically > close options, which affects vacuum in general, are > vacuum_freeze_min_age, vacuum_freeze_table_age, > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age, > which are located in "client connection defaults" section. So, I > decided to move this GUC into this section. I also change the section > in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT. Agreed. So should we move it to 19.11 Client Connection Defaults in doc as well? And I think it's better if this option places next to other vacuum options for finding easier. Attached patch changes them so. Please review it. > I think we ideally should have a "vacuum" section, which would have > two subections: "client defaults" and "autovacuum". But that should > be a separate patch, which needs to be discussed separately. +1 Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Wed, Jun 20, 2018 at 8:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov > > Ok. I've rephrased comment a bit. Also, you created "index vacuum" > > subsection in the "resource usage" section. I think it's not > > appropriate for this option to be in "resource usage". Ideally it > > should be grouped with other vacuum options, but we don't have single > > section for that. "Autovacuum" section is also not appropriate, > > because this guc works not only for autovacuum. So, most semantically > > close options, which affects vacuum in general, are > > vacuum_freeze_min_age, vacuum_freeze_table_age, > > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age, > > which are located in "client connection defaults" section. So, I > > decided to move this GUC into this section. I also change the section > > in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT. > > Agreed. So should we move it to 19.11 Client Connection Defaults in > doc as well? And I think it's better if this option places next to > other vacuum options for finding easier. Attached patch changes them > so. Please review it. Right, thank you. Looks good for me. I'm going to commit this if no objections. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Jun 20, 2018 at 12:00 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Wed, Jun 20, 2018 at 8:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov > > > Ok. I've rephrased comment a bit. Also, you created "index vacuum" > > > subsection in the "resource usage" section. I think it's not > > > appropriate for this option to be in "resource usage". Ideally it > > > should be grouped with other vacuum options, but we don't have single > > > section for that. "Autovacuum" section is also not appropriate, > > > because this guc works not only for autovacuum. So, most semantically > > > close options, which affects vacuum in general, are > > > vacuum_freeze_min_age, vacuum_freeze_table_age, > > > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age, > > > which are located in "client connection defaults" section. So, I > > > decided to move this GUC into this section. I also change the section > > > in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT. > > > > Agreed. So should we move it to 19.11 Client Connection Defaults in > > doc as well? And I think it's better if this option places next to > > other vacuum options for finding easier. Attached patch changes them > > so. Please review it. > > Right, thank you. Looks good for me. > I'm going to commit this if no objections. Pushed. Regarding maximum value for vacuum_cleanup_index_scale_factor. I think introducing special value with "never cleanup" meaning would be overkill for post feature freeze enhancement. So, I propose to just increase maximum value for both GUC and reloption. See the attached patch. It also changes calculations _bt_vacuum_needs_cleanup() for better handling of large values (just some kind of overflow paranoia). ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Fri, Jun 22, 2018 at 6:55 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Wed, Jun 20, 2018 at 12:00 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: >> On Wed, Jun 20, 2018 at 8:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> > On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov >> > > Ok. I've rephrased comment a bit. Also, you created "index vacuum" >> > > subsection in the "resource usage" section. I think it's not >> > > appropriate for this option to be in "resource usage". Ideally it >> > > should be grouped with other vacuum options, but we don't have single >> > > section for that. "Autovacuum" section is also not appropriate, >> > > because this guc works not only for autovacuum. So, most semantically >> > > close options, which affects vacuum in general, are >> > > vacuum_freeze_min_age, vacuum_freeze_table_age, >> > > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age, >> > > which are located in "client connection defaults" section. So, I >> > > decided to move this GUC into this section. I also change the section >> > > in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT. >> > >> > Agreed. So should we move it to 19.11 Client Connection Defaults in >> > doc as well? And I think it's better if this option places next to >> > other vacuum options for finding easier. Attached patch changes them >> > so. Please review it. >> >> Right, thank you. Looks good for me. >> I'm going to commit this if no objections. > > Pushed. Thank you! > > Regarding maximum value for vacuum_cleanup_index_scale_factor. I > think introducing special value with "never cleanup" meaning would be > overkill for post feature freeze enhancement. After more thought, adding a threshold to invoke index cleanups perhaps work in order to support "never cleanup", it will be PostgreSQL 12 item though. If a index has less tuples than the threshold (say, vacuum_cleanup_index_threshold), index cleanups never be executed. > So, I propose to just > increase maximum value for both GUC and reloption. See the attached > patch. It also changes calculations _bt_vacuum_needs_cleanup() for > better handling of large values (just some kind of overflow paranoia). > The patch looks good to me. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jun 26, 2018 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Jun 22, 2018 at 6:55 PM, Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > So, I propose to just > > increase maximum value for both GUC and reloption. See the attached > > patch. It also changes calculations _bt_vacuum_needs_cleanup() for > > better handling of large values (just some kind of overflow paranoia). > > The patch looks good to me. Pushed, thanks! ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
вт, 26 июн. 2018 г. в 15:42, Alexander Korotkov <a.korotkov@postgrespro.ru>:
On Tue, Jun 26, 2018 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Jun 22, 2018 at 6:55 PM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > So, I propose to just
> > increase maximum value for both GUC and reloption. See the attached
> > patch. It also changes calculations _bt_vacuum_needs_cleanup() for
> > better handling of large values (just some kind of overflow paranoia).
>
> The patch looks good to me.
Pushed, thanks!
Thank you for the enhancement. Now Index Only Scans over Append-Only tables in Postgres can be implemented, even if it requires manual kicking of VACUUM over large table, and that's a great enhancement for moving object databases. :)
My eye catches another thing, the error message in tests is:
DETAIL: Valid values are between "0.000000" and "179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000".
a) do we really need to print digits of dblmax? "Valid values are double precision, non-negative"?
b) double precision binary-to-decimal noise starts at 16th digit. Why does it stop at the point, and we have precise ".000000"? Does it bite the conversion somewhere else too?
On Tue, Jun 26, 2018 at 4:11 PM Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote: > вт, 26 июн. 2018 г. в 15:42, Alexander Korotkov <a.korotkov@postgrespro.ru>: >> >> On Tue, Jun 26, 2018 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> > On Fri, Jun 22, 2018 at 6:55 PM, Alexander Korotkov >> > <a.korotkov@postgrespro.ru> wrote: >> > > So, I propose to just >> > > increase maximum value for both GUC and reloption. See the attached >> > > patch. It also changes calculations _bt_vacuum_needs_cleanup() for >> > > better handling of large values (just some kind of overflow paranoia). >> > >> > The patch looks good to me. >> >> Pushed, thanks! > > > Thank you for the enhancement. Now Index Only Scans over Append-Only tables in Postgres can be implemented, even if itrequires manual kicking of VACUUM over large table, and that's a great enhancement for moving object databases. :) > > My eye catches another thing, the error message in tests is: > > DETAIL: Valid values are between "0.000000" and "179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000". > > a) do we really need to print digits of dblmax? "Valid values are double precision, non-negative"? > b) double precision binary-to-decimal noise starts at 16th digit. Why does it stop at the point, and we have precise ".000000"?Does it bite the conversion somewhere else too? Thank you for pointing. I'm proposing to change output format from "%f" to "%g" [1] [2]. It looks better and the same as what we do for GUCs. 1. https://www.postgresql.org/message-id/CAPpHfdvewmr4PcpRjrkstoNn1n2_6dL-iHRB21CCfZ0efZdBTg%40mail.gmail.com 2. https://www.postgresql.org/message-id/CAPpHfdsFBbJAd24F0b_o9TfTtu%2B%2BjH0bR5XS_e9xbSwk8SJwyQ%40mail.gmail.com ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company