Thread: [HACKERS] GUC for cleanup indexes threshold.

[HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
Hi and happy new year.

The lazy vacuum calls lazy_cleanup_index to update statistics of
indexes on a table such as relpages, reltuples at the end of the
lazy_scan_heap. In all type of indexes the lazy_cleanup_index scans
all index pages. It happens even if table has not been updated at all
since previous vacuum invoked. Freeze map reduces the execution time
and cost of table vacuuming much if almost table has been frozen. But
it doesn't work for cleaning up indexes. If a very large static table
has index then because the cleaning up index is called and it always
scans all index pages, it takes time to scan all pages of index as
reported[1].

Attached patch introduces new GUC parameter parameter
vacuum_cleanup_index_scale_factor which specifies the fraction of the
table pages containing dead tuple needed to trigger a cleaning up
indexes. The default is 0.0, which means that the cleanup index is not
invoked if no update on table. In other word, if table is completely
frozen then lazy vacuum can skip the index scans as well. Increasing
this value could reduce total time of lazy vacuum but the statistics
and the free space map of index are not updated.

[1]
https://www.postgresql.org/message-id/MWHPR20MB142177B86D893C946FAFC9A4A18C0%40MWHPR20MB1421.namprd20.prod.outlook.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Robert Haas
Date:
On Wed, Jan 4, 2017 at 3:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Hi and happy new year.
>
> The lazy vacuum calls lazy_cleanup_index to update statistics of
> indexes on a table such as relpages, reltuples at the end of the
> lazy_scan_heap. In all type of indexes the lazy_cleanup_index scans
> all index pages. It happens even if table has not been updated at all
> since previous vacuum invoked. Freeze map reduces the execution time
> and cost of table vacuuming much if almost table has been frozen. But
> it doesn't work for cleaning up indexes. If a very large static table
> has index then because the cleaning up index is called and it always
> scans all index pages, it takes time to scan all pages of index as
> reported[1].
>
> Attached patch introduces new GUC parameter parameter
> vacuum_cleanup_index_scale_factor which specifies the fraction of the
> table pages containing dead tuple needed to trigger a cleaning up
> indexes. The default is 0.0, which means that the cleanup index is not
> invoked if no update on table. In other word, if table is completely
> frozen then lazy vacuum can skip the index scans as well. Increasing
> this value could reduce total time of lazy vacuum but the statistics
> and the free space map of index are not updated.

Cool.  I'll look at this, but not until next CF.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kuntal Ghosh
Date:
On Wed, Jan 4, 2017 at 1:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> Attached patch introduces new GUC parameter parameter
> vacuum_cleanup_index_scale_factor which specifies the fraction of the
> table pages containing dead tuple needed to trigger a cleaning up
> indexes. The default is 0.0, which means that the cleanup index is not
> invoked if no update on table. In other word, if table is completely
> frozen then lazy vacuum can skip the index scans as well. Increasing
> this value could reduce total time of lazy vacuum but the statistics
> and the free space map of index are not updated.
>
I was looking into your patch and trying to understand how the
following piece of code works.
+    if (vacuumed_pages > cleanupidx_thresh)
+    {
+        for (i = 0; i < nindexes; i++)
+            lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
+    }
So, you are skipping deletion of index entries if it does not reach
the clean-up index threshold. But, you are removing all dead tuples
from the heap pointed by the same index. Hence, index will contain
entries with invalid references. How does that work? How will you
remove those index entries later? (I'm a newbie.)

+        This parameter can only be set anywhere.
Oxymoron. :-)


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Feb 10, 2017 at 8:01 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> On Wed, Jan 4, 2017 at 1:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>> Attached patch introduces new GUC parameter parameter
>> vacuum_cleanup_index_scale_factor which specifies the fraction of the
>> table pages containing dead tuple needed to trigger a cleaning up
>> indexes. The default is 0.0, which means that the cleanup index is not
>> invoked if no update on table. In other word, if table is completely
>> frozen then lazy vacuum can skip the index scans as well. Increasing
>> this value could reduce total time of lazy vacuum but the statistics
>> and the free space map of index are not updated.
>>
> I was looking into your patch and trying to understand how the
> following piece of code works.

Thank you for looking at this patch!

> +    if (vacuumed_pages > cleanupidx_thresh)
> +    {
> +        for (i = 0; i < nindexes; i++)
> +            lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
> +    }
> So, you are skipping deletion of index entries if it does not reach
> the clean-up index threshold. But, you are removing all dead tuples
> from the heap pointed by the same index. Hence, index will contain
> entries with invalid references.

I think no. Before calling lazy_cleanup_index, all garbage on heap and
index should have been reclaimed by lazy_vacuum_heap and
lazy_vacuum_index.

> +        This parameter can only be set anywhere.
> Oxymoron. :-)
>

Will fix it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
"Ideriha, Takeshi"
Date:
Hi, I tried regression test and found some errors concerning brin and gin,
though I didn't look into this.

Here's a log:

*** /home/ideriha/postgres-master/src/test/regress/expected/brin.out    2017-02-13 11:33:43.270942937 +0900
--- /home/ideriha/postgres-master/src/test/regress/results/brin.out    2017-02-15 14:58:24.725984474 +0900
***************
*** 403,408 **** SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected  brin_summarize_new_values
---------------------------
!                          0 (1 row) 
--- 403,408 ---- SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected  brin_summarize_new_values
---------------------------
!                          5 (1 row) 

======================================================================

*** /home/ideriha/postgres-master/src/test/regress/expected/gin.out    2016-12-20 16:49:09.513050050 +0900
--- /home/ideriha/postgres-master/src/test/regress/results/gin.out    2017-02-15 14:58:25.536984461 +0900
***************
*** 20,26 **** select gin_clean_pending_list('gin_test_idx'); -- nothing to flush  gin_clean_pending_list
------------------------
!                       0 (1 row)  -- Test vacuuming
--- 20,26 ---- select gin_clean_pending_list('gin_test_idx'); -- nothing to flush  gin_clean_pending_list
------------------------
!                       8 (1 row)  -- Test vacuuming

======================================================================


Regards,
Ideriha Takeshi

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Feb 15, 2017 at 4:39 PM, Ideriha, Takeshi
<ideriha.takeshi@jp.fujitsu.com> wrote:
> Hi, I tried regression test and found some errors concerning brin and gin,
> though I didn't look into this.
>
> Here's a log:
>
> *** /home/ideriha/postgres-master/src/test/regress/expected/brin.out    2017-02-13 11:33:43.270942937 +0900
> --- /home/ideriha/postgres-master/src/test/regress/results/brin.out     2017-02-15 14:58:24.725984474 +0900
> ***************
> *** 403,408 ****
>   SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected
>    brin_summarize_new_values
>   ---------------------------
> !                          0
>   (1 row)
>
> --- 403,408 ----
>   SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected
>    brin_summarize_new_values
>   ---------------------------
> !                          5
>   (1 row)
>
>
> ======================================================================
>
> *** /home/ideriha/postgres-master/src/test/regress/expected/gin.out     2016-12-20 16:49:09.513050050 +0900
> --- /home/ideriha/postgres-master/src/test/regress/results/gin.out      2017-02-15 14:58:25.536984461 +0900
> ***************
> *** 20,26 ****
>   select gin_clean_pending_list('gin_test_idx'); -- nothing to flush
>    gin_clean_pending_list
>   ------------------------
> !                       0
>   (1 row)
>
>   -- Test vacuuming
> --- 20,26 ----
>   select gin_clean_pending_list('gin_test_idx'); -- nothing to flush
>    gin_clean_pending_list
>   ------------------------
> !                       8
>   (1 row)
>
>   -- Test vacuuming
>
> ======================================================================
>
>

Thank you for testing!

It's a bug. Attached latest version patch, which passed make check.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kuntal Ghosh
Date:
On Mon, Feb 13, 2017 at 1:01 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thanks for the explanation. I've looked into the referred code. I'm
still in doubt. vacuumed_pages is incremented only when there are no
indexes, i.e. nindexes=0. Now, look at the following part in the
patch.

+       /*
+        * Do post-vacuum cleanup and statistics update for each index if
+        * the number of vacuumed page exceeds threshold.
+        */
+       cleanupidx_thresh = (float4) nblocks * vacuum_cleanup_index_scale;
+
+       elog(DEBUG3, "%s: vac: %d (threshold %0.f)",
+                RelationGetRelationName(onerel), nblocks, cleanupidx_thresh);
+       if (vacuumed_pages >= cleanupidx_thresh)
+       {
+               for (i = 0; i < nindexes; i++)
+                       lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
+       }
So, unless vacuum_cleanup_index_scale_thresh is zero,
lazy_cleanup_index will never be called. IMO, this seems to be
incorrect. Besides, I've tested with non-zero(0.5)
vacuum_cleanup_index_scale_thresh and the regression tests for brin
and gin fails. (make installcheck)

+               {"vacuum_cleanup_index_scale_factor", PGC_USERSET,
CLIENT_CONN_STATEMENT,
+                gettext_noop("Number of pages containing dead tuple
prior to vacuum as a fraction of relpages."),
+                       NULL
+               },
+               &vacuum_cleanup_index_scale,
+               0.0, 0.0, 100.0,
+               NULL, NULL, NULL
+       },
Maximum value for vacuum_cleanup_index_scale_factor should be 1
instead of 100. As the code indicates, it is certainly not treated as
a percentage fraction of relpages.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Simon Riggs
Date:
On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> It's a bug. Attached latest version patch, which passed make check.

In its current form, I'm not sure this is a good idea. Problems...

1. I'm pretty sure the world doesn't need another VACUUM parameter

I suggest that we use the existing vacuum scale factor/4 to reflect
that indexes are more sensitive to bloat.

2. The current btree vacuum code requires 2 vacuums to fully reuse
half-dead pages. So skipping an index vacuum might mean that second
index scan never happens at all, which would be bad.

I suggest that we store the number of half-dead pages in the metapage
after each VACUUM, so we can decide whether to skip the scan or not.
And we use some math like each half-dead page that needs to be reused
is worth 250 index entries, so the decision to skip is based upon rows
and empty pages, not just recently vacuumed rows.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Robert Haas
Date:
On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> It's a bug. Attached latest version patch, which passed make check.
>
> In its current form, I'm not sure this is a good idea. Problems...
>
> 1. I'm pretty sure the world doesn't need another VACUUM parameter
>
> I suggest that we use the existing vacuum scale factor/4 to reflect
> that indexes are more sensitive to bloat.

I do not think it's a good idea to control multiple behaviors with a
single GUC.  We don't really know that dividing by 4 will be right for
everyone, or even for most people.  It's better to have another
parameter with a sensible default than to hardcode a ratio that might
work out poorly for some people.

> 2. The current btree vacuum code requires 2 vacuums to fully reuse
> half-dead pages. So skipping an index vacuum might mean that second
> index scan never happens at all, which would be bad.

Maybe.  If there are a tiny number of those half-dead pages in a huge
index, it probably doesn't matter.  Also, I don't think it would never
happen, unless the table just never gets any more updates or deletes -
but that case could also happen today.  It's just a matter of
happening less frequently.

I guess the question is whether the accumulation of half-dead pages in
the index could become a problem before the unsetting of all-visible
bits in the heap becomes a problem.  If the second one always happen
first, then we don't have an issue here, but if it's possible for the
first one to become a big problem before the second one gets to be a
serious issue, then we need something more sophisticated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> It's a bug. Attached latest version patch, which passed make check.
>>
>> In its current form, I'm not sure this is a good idea. Problems...
>>
>> 1. I'm pretty sure the world doesn't need another VACUUM parameter
>>
>> I suggest that we use the existing vacuum scale factor/4 to reflect
>> that indexes are more sensitive to bloat.
>
> I do not think it's a good idea to control multiple behaviors with a
> single GUC.  We don't really know that dividing by 4 will be right for
> everyone, or even for most people.  It's better to have another
> parameter with a sensible default than to hardcode a ratio that might
> work out poorly for some people.
>
>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>> half-dead pages. So skipping an index vacuum might mean that second
>> index scan never happens at all, which would be bad.
>
> Maybe.  If there are a tiny number of those half-dead pages in a huge
> index, it probably doesn't matter.  Also, I don't think it would never
> happen, unless the table just never gets any more updates or deletes -
> but that case could also happen today.  It's just a matter of
> happening less frequently.

The half-dead pages are never cleaned up if the ratio of pages
containing garbage is always lower than threshold. Also in gin index
the pending list is never cleared, which become big problem. I guess
that we should take action for each type of indexes.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Jim Nasby
Date:
On 2/19/17 7:56 PM, Masahiko Sawada wrote:
> The half-dead pages are never cleaned up if the ratio of pages
> containing garbage is always lower than threshold. Also in gin index
> the pending list is never cleared, which become big problem. I guess
> that we should take action for each type of indexes.

What worries me is that each AM is going to have a different notion of 
what needs to happen to support this. That indicates that trying to 
handle this at the vacuum level is not a good idea.

I think it would be wiser to add support for skipping scans to the AM 
API instead. That also means you don't have to add support for this to 
every index type to start with.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Mon, Feb 20, 2017 at 11:35 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 2/19/17 7:56 PM, Masahiko Sawada wrote:
>>
>> The half-dead pages are never cleaned up if the ratio of pages
>> containing garbage is always lower than threshold. Also in gin index
>> the pending list is never cleared, which become big problem. I guess
>> that we should take action for each type of indexes.
>
>
> What worries me is that each AM is going to have a different notion of what
> needs to happen to support this. That indicates that trying to handle this
> at the vacuum level is not a good idea.
>
> I think it would be wiser to add support for skipping scans to the AM API
> instead. That also means you don't have to add support for this to every
> index type to start with.

Yeah, and it's better to have it as a index storage parameter.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Amit Kapila
Date:
On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> It's a bug. Attached latest version patch, which passed make check.
>>>
>>> In its current form, I'm not sure this is a good idea. Problems...
>>>
>>> 1. I'm pretty sure the world doesn't need another VACUUM parameter
>>>
>>> I suggest that we use the existing vacuum scale factor/4 to reflect
>>> that indexes are more sensitive to bloat.
>>
>> I do not think it's a good idea to control multiple behaviors with a
>> single GUC.  We don't really know that dividing by 4 will be right for
>> everyone, or even for most people.  It's better to have another
>> parameter with a sensible default than to hardcode a ratio that might
>> work out poorly for some people.
>>
>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>>> half-dead pages. So skipping an index vacuum might mean that second
>>> index scan never happens at all, which would be bad.
>>
>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>> index, it probably doesn't matter.  Also, I don't think it would never
>> happen, unless the table just never gets any more updates or deletes -
>> but that case could also happen today.  It's just a matter of
>> happening less frequently.
>

Yeah thats right and I am not sure if it is worth to perform a
complete pass to reclaim dead/deleted pages unless we know someway
that there are many such pages.  Also, I think we do reclaim the
complete page while allocating a new page in btree.

> The half-dead pages are never cleaned up if the ratio of pages
> containing garbage is always lower than threshold.
>

Which threshold are you referring here?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Simon Riggs
Date:
On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> It's a bug. Attached latest version patch, which passed make check.

>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>>>> half-dead pages. So skipping an index vacuum might mean that second
>>>> index scan never happens at all, which would be bad.
>>>
>>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>>> index, it probably doesn't matter.  Also, I don't think it would never
>>> happen, unless the table just never gets any more updates or deletes -
>>> but that case could also happen today.  It's just a matter of
>>> happening less frequently.
>>
>
> Yeah thats right and I am not sure if it is worth to perform a
> complete pass to reclaim dead/deleted pages unless we know someway
> that there are many such pages.

Agreed.... which is why
On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote:
> I suggest that we store the number of half-dead pages in the metapage
> after each VACUUM, so we can decide whether to skip the scan or not.


> Also, I think we do reclaim the
> complete page while allocating a new page in btree.

That's not how it works according to the README at least.

You might be referring to cleaning out killed tuples just before a
page split? That's something different.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Amit Kapila
Date:
On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>> It's a bug. Attached latest version patch, which passed make check.
>
>>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>>>>> half-dead pages. So skipping an index vacuum might mean that second
>>>>> index scan never happens at all, which would be bad.
>>>>
>>>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>>>> index, it probably doesn't matter.  Also, I don't think it would never
>>>> happen, unless the table just never gets any more updates or deletes -
>>>> but that case could also happen today.  It's just a matter of
>>>> happening less frequently.
>>>
>>
>> Yeah thats right and I am not sure if it is worth to perform a
>> complete pass to reclaim dead/deleted pages unless we know someway
>> that there are many such pages.
>
> Agreed.... which is why
> On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I suggest that we store the number of half-dead pages in the metapage
>> after each VACUUM, so we can decide whether to skip the scan or not.
>
>
>> Also, I think we do reclaim the
>> complete page while allocating a new page in btree.
>
> That's not how it works according to the README at least.
>

I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))),
won't that help us in reclaiming the space?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Mon, Feb 20, 2017 at 6:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> It's a bug. Attached latest version patch, which passed make check.
>>>>
>>>> In its current form, I'm not sure this is a good idea. Problems...
>>>>
>>>> 1. I'm pretty sure the world doesn't need another VACUUM parameter
>>>>
>>>> I suggest that we use the existing vacuum scale factor/4 to reflect
>>>> that indexes are more sensitive to bloat.
>>>
>>> I do not think it's a good idea to control multiple behaviors with a
>>> single GUC.  We don't really know that dividing by 4 will be right for
>>> everyone, or even for most people.  It's better to have another
>>> parameter with a sensible default than to hardcode a ratio that might
>>> work out poorly for some people.
>>>
>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>>>> half-dead pages. So skipping an index vacuum might mean that second
>>>> index scan never happens at all, which would be bad.
>>>
>>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>>> index, it probably doesn't matter.  Also, I don't think it would never
>>> happen, unless the table just never gets any more updates or deletes -
>>> but that case could also happen today.  It's just a matter of
>>> happening less frequently.
>>
>
> Yeah thats right and I am not sure if it is worth to perform a
> complete pass to reclaim dead/deleted pages unless we know someway
> that there are many such pages.  Also, I think we do reclaim the
> complete page while allocating a new page in btree.
>
>> The half-dead pages are never cleaned up if the ratio of pages
>> containing garbage is always lower than threshold.
>>
>
> Which threshold are you referring here?
>

I meant the new parameter in current patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Simon Riggs
Date:
On 20 February 2017 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>>> It's a bug. Attached latest version patch, which passed make check.
>>
>>>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>>>>>> half-dead pages. So skipping an index vacuum might mean that second
>>>>>> index scan never happens at all, which would be bad.
>>>>>
>>>>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>>>>> index, it probably doesn't matter.  Also, I don't think it would never
>>>>> happen, unless the table just never gets any more updates or deletes -
>>>>> but that case could also happen today.  It's just a matter of
>>>>> happening less frequently.
>>>>
>>>
>>> Yeah thats right and I am not sure if it is worth to perform a
>>> complete pass to reclaim dead/deleted pages unless we know someway
>>> that there are many such pages.
>>
>> Agreed.... which is why
>> On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> I suggest that we store the number of half-dead pages in the metapage
>>> after each VACUUM, so we can decide whether to skip the scan or not.
>>
>>
>>> Also, I think we do reclaim the
>>> complete page while allocating a new page in btree.
>>
>> That's not how it works according to the README at least.
>>
>
> I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))),
> won't that help us in reclaiming the space?

Not unless the README is incorrect, no.

That section of code is just a retest of pages retrieved from FSM;
they aren't even added there until two scans have occurred and even
then it may not be possible to recycle.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Thu, Feb 16, 2017 at 10:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>> half-dead pages. So skipping an index vacuum might mean that second
>> index scan never happens at all, which would be bad.
>
> Maybe.  If there are a tiny number of those half-dead pages in a huge
> index, it probably doesn't matter.  Also, I don't think it would never
> happen, unless the table just never gets any more updates or deletes -
> but that case could also happen today.  It's just a matter of
> happening less frequently.
>
> I guess the question is whether the accumulation of half-dead pages in
> the index could become a problem before the unsetting of all-visible
> bits in the heap becomes a problem.  If the second one always happen
> first, then we don't have an issue here, but if it's possible for the
> first one to become a big problem before the second one gets to be a
> serious issue, then we need something more sophisticated.

Not getting to a second VACUUM where you might have otherwise can only
be a problem to the extent that users are sensitive to not reclaiming
disk space from indexes at the level of the FSM. It's not accurate to
say that pages could be left "half dead" indefinitely by this patch,
since that is something that lasts only as long as the first phase of
B-Tree page deletion. In fact, the only possible problem is that pages
are recyclable in principle, but that doesn't happen due to this new
GUC.

That isn't analogous to heap bloat at all, because it's not as if
there are any downlinks or right links or left links pointing to the
recyclable (fully deleted) pages; the previous key space *has* in fact
been *fully* reclaimed. These pages are fully dead, and as such are
out of the critical path of index scans entirely once the second phase
finishes. (They only need to continue to physically exist because old
index scans might follow a stale pointer).

Note that there is an interlock against RecentGlobalXmin respected by
VACUUM, that prevents this sort of recycling. I suspect that the
restrictions on page deletion as opposed to page recycling is vastly
more likely to cause pain to users, and that's not made any worse by
this.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Amit Kapila
Date:
On Tue, Feb 21, 2017 at 1:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 20 February 2017 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>>>> It's a bug. Attached latest version patch, which passed make check.
>>>
>>>>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>>>>>>> half-dead pages. So skipping an index vacuum might mean that second
>>>>>>> index scan never happens at all, which would be bad.
>>>>>>
>>>>>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>>>>>> index, it probably doesn't matter.  Also, I don't think it would never
>>>>>> happen, unless the table just never gets any more updates or deletes -
>>>>>> but that case could also happen today.  It's just a matter of
>>>>>> happening less frequently.
>>>>>
>>>>
>>>> Yeah thats right and I am not sure if it is worth to perform a
>>>> complete pass to reclaim dead/deleted pages unless we know someway
>>>> that there are many such pages.
>>>
>>> Agreed.... which is why
>>> On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> I suggest that we store the number of half-dead pages in the metapage
>>>> after each VACUUM, so we can decide whether to skip the scan or not.
>>>
>>>
>>>> Also, I think we do reclaim the
>>>> complete page while allocating a new page in btree.
>>>
>>> That's not how it works according to the README at least.
>>>
>>
>> I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))),
>> won't that help us in reclaiming the space?
>
> Not unless the README is incorrect, no.
>

Just to ensure that we both have the same understanding, let me try to
write what I understand about this reclaim algorithm.  AFAIU, in the
first pass vacuum will mark the half dead pages as Deleted and in the
second pass, it will record such pages as free in FSM so that they can
be reused as new pages when the indexam asked for a new block instead
of extending the index relation.  Now, if we introduce this new GUC,
then there are chances that sometimes we skip the second pass where it
would not have been skipped.

Note that we do perform the second pass in the same vacuum cycle when
index has not been scanned for deleting the tuples as per below code:

btvacuumcleanup()
{
..
if (stats == NULL)
{
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
btvacuumscan(info, stats, NULL, NULL, 0);
..
}

In above code stats won't be NULL, if the vacuum has scanned index for
deleting tuples (btbulkdelete).  So, based on this I think it will
skip scanning the index (or recycling pages marked as deleted) in the
second vacuum only when there are no dead tuple removals in that
vacuum.  Do we agree till here?
I understand that there could be some delay in reclaiming dead pages
but do you think it is such a big deal that we completely scan the
index for such cases or even try to change the metapage format?

> That section of code is just a retest of pages retrieved from FSM;
>

Yes, I think you are right.  In short, I agree that only vacuum can
reclaim half-dead pages.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Feb 22, 2017 at 12:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Feb 21, 2017 at 1:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 20 February 2017 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> On 20 February 2017 at 09:15, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>>>>> On 15 February 2017 at 08:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>>>>> It's a bug. Attached latest version patch, which passed make check.
>>>>
>>>>>>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>>>>>>>> half-dead pages. So skipping an index vacuum might mean that second
>>>>>>>> index scan never happens at all, which would be bad.
>>>>>>>
>>>>>>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>>>>>>> index, it probably doesn't matter.  Also, I don't think it would never
>>>>>>> happen, unless the table just never gets any more updates or deletes -
>>>>>>> but that case could also happen today.  It's just a matter of
>>>>>>> happening less frequently.
>>>>>>
>>>>>
>>>>> Yeah thats right and I am not sure if it is worth to perform a
>>>>> complete pass to reclaim dead/deleted pages unless we know someway
>>>>> that there are many such pages.
>>>>
>>>> Agreed.... which is why
>>>> On 16 February 2017 at 11:17, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>> I suggest that we store the number of half-dead pages in the metapage
>>>>> after each VACUUM, so we can decide whether to skip the scan or not.
>>>>
>>>>
>>>>> Also, I think we do reclaim the
>>>>> complete page while allocating a new page in btree.
>>>>
>>>> That's not how it works according to the README at least.
>>>>
>>>
>>> I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))),
>>> won't that help us in reclaiming the space?
>>
>> Not unless the README is incorrect, no.
>>
>
> Just to ensure that we both have the same understanding, let me try to
> write what I understand about this reclaim algorithm.  AFAIU, in the
> first pass vacuum will mark the half dead pages as Deleted and in the
> second pass, it will record such pages as free in FSM so that they can
> be reused as new pages when the indexam asked for a new block instead
> of extending the index relation.  Now, if we introduce this new GUC,
> then there are chances that sometimes we skip the second pass where it
> would not have been skipped.
>
> Note that we do perform the second pass in the same vacuum cycle when
> index has not been scanned for deleting the tuples as per below code:

The first pass uses cycle id given by _bt_start_vacuum, but the second
pass always uses cycle id 0.

> btvacuumcleanup()
> {
> ..
> if (stats == NULL)
> {
> stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
> btvacuumscan(info, stats, NULL, NULL, 0);
> ..
> }
>
> In above code stats won't be NULL, if the vacuum has scanned index for
> deleting tuples (btbulkdelete).  So, based on this I think it will
> skip scanning the index (or recycling pages marked as deleted) in the
> second vacuum only when there are no dead tuple removals in that
> vacuum.  Do we agree till here?

Agreed.

> I understand that there could be some delay in reclaiming dead pages
> but do you think it is such a big deal that we completely scan the
> index for such cases or even try to change the metapage format?

IIUC, I think that we need to have the number of half-dead pages in meta page.
Isn't it a problem that the freespace map of btree index is not
vacuumed if all vacuums skip the second pass?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Amit Kapila
Date:
On Thu, Feb 23, 2017 at 5:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Feb 22, 2017 at 12:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>> I understand that there could be some delay in reclaiming dead pages
>> but do you think it is such a big deal that we completely scan the
>> index for such cases or even try to change the metapage format?
>
> IIUC, I think that we need to have the number of half-dead pages in meta page.
>

Don't you think we need to consider backward compatibility if we want
to do that?

> Isn't it a problem that the freespace map of btree index is not
> vacuumed if all vacuums skip the second pass?
>

AFAIU, you want to skip only when there is no dead tuple removal, if
so what is the need to update freespace map of btree index?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Robert Haas
Date:
On Fri, Feb 24, 2017 at 8:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> IIUC, I think that we need to have the number of half-dead pages in meta page.
>
> Don't you think we need to consider backward compatibility if we want
> to do that?

Yeah, that would be an on-disk format break.

I think this thread is pretty short on evidence that would let us make
a smart decision about what to do here.  I see three possibilities.
The first is that this patch is a good idea whether we do something
about the issue of half-dead pages or not.  The second is that this
patch is a good idea if we do something about the issue of half-dead
pages but a bad idea if we don't.  The third is that this patch is a
bad idea whether or not we do anything about the issue of half-dead
pages.

Unfortunately, we have no evidence at all that would let us figure out
which of those three things is true.  The original post didn't include
any relevant benchmarks or test results.  Simon's reply, which
suggested that the problem of half-dead pages, didn't include any
benchmarks or test results.  In fact, in neither place were any tests
suggested, even hypothetically, which would help us decide what to do.
I had a hunch when I saw this thread that it was a good idea, and
Simon has a hunch that this btree page recycling thing needs to be
fixed first, and he might be right.  Or we might both be wrong.  Or
... who knows, really?

I think we need to come up with some set of tests to figure out what
actually works well in practice here.  Theories are a good starting
point, but good vacuum behavior is really important, and a patch that
changes it ought to be backed up by at least some experimental
evidence.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Jim Nasby
Date:
On 2/24/17 11:26 AM, Robert Haas wrote:
> I think we need to come up with some set of tests to figure out what
> actually works well in practice here.  Theories are a good starting
> point, but good vacuum behavior is really important, and a patch that
> changes it ought to be backed up by at least some experimental
> evidence.

I think something else worth considering is that if we had some method 
of mapping heap TIDs back to indexes then a lot (all?) of these problems 
would go away. 10+ years ago the idea of keeping such a mapping would 
probably be untenable, but with resource forks and how much cheaper 
storage is maybe that's no longer the case.

For btree I think this could be done by keeping a second btree ordered 
by ctid that points either to index entries or even just to whole index 
pages. At ~ 20 bytes per entry, even a 1B row index would take ~20GB.

Page splits are obviously a big issue. Maybe it's safe to update the 
ctid map for every item that gets moved when a split happens.

Would a ctid map work for other indexes as well?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think this thread is pretty short on evidence that would let us make
> a smart decision about what to do here.  I see three possibilities.
> The first is that this patch is a good idea whether we do something
> about the issue of half-dead pages or not.  The second is that this
> patch is a good idea if we do something about the issue of half-dead
> pages but a bad idea if we don't.  The third is that this patch is a
> bad idea whether or not we do anything about the issue of half-dead
> pages.

Half-dead pages are not really relevant to this discussion, AFAICT. I
think that both you and Simon mean "recyclable" pages. There are
several levels of indirection involved here, to keep the locking very
granular, so it gets tricky to talk about.

B-Tree page deletion is like a page split in reverse. It has a
symmetry with page splits, which have two phases (atomic operations).
There are also two phases for deletion, the first of which leaves the
target page without a downlink in its parent, and marks it half dead.
By the end of the first phase, there are still sibling pointers, so an
index scan can land on them before the second phase of deletion begins
-- they can visit a half-dead page before such time as the second
phase of deletion begins, where the sibling link goes away. So, the
sibling link isn't stale as such, but the page is still morally dead.
(Second phase is where we remove even the sibling links, and declare
it fully dead.)

Even though there are two phases of deletion, the second still occurs
immediately after the first within VACUUM. The need to have two phases
is hard to explain, so I won't try, but it suffices to say that VACUUM
does not actually ever leave a page half dead unless there is a hard
crash.

Recall that VACUUMing of a B-Tree is performed sequentially, so blocks
can be recycled without needing to be found via a downlink or sibling
link by VACUUM. What is at issue here, then, is VACUUM's degree of
"eagerness" around putting *fully* dead B-Tree pages in the FSM for
recycling. The interlock with RecentGlobalXmin is what makes it
impossible for VACUUM to generally fully delete pages, *as well as*
mark them as recyclable (put them in the FSM) all at once.

Maybe you get this already, since, as I said, the terminology is
tricky in this area, but I can't tell.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Amit Kapila
Date:
On Sat, Feb 25, 2017 at 3:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this thread is pretty short on evidence that would let us make
>> a smart decision about what to do here.  I see three possibilities.
>> The first is that this patch is a good idea whether we do something
>> about the issue of half-dead pages or not.  The second is that this
>> patch is a good idea if we do something about the issue of half-dead
>> pages but a bad idea if we don't.  The third is that this patch is a
>> bad idea whether or not we do anything about the issue of half-dead
>> pages.
>

+1.  I think we can track the stats from
IndexBulkDeleteResult->pages_free to see the impact of the patch.


> Half-dead pages are not really relevant to this discussion, AFAICT. I
> think that both you and Simon mean "recyclable" pages.
>

Yes, I think so and I think that is just confusion about terminology.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Robert Haas
Date:
On Sat, Feb 25, 2017 at 3:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this thread is pretty short on evidence that would let us make
>> a smart decision about what to do here.  I see three possibilities.
>> The first is that this patch is a good idea whether we do something
>> about the issue of half-dead pages or not.  The second is that this
>> patch is a good idea if we do something about the issue of half-dead
>> pages but a bad idea if we don't.  The third is that this patch is a
>> bad idea whether or not we do anything about the issue of half-dead
>> pages.
>
> Half-dead pages are not really relevant to this discussion, AFAICT. I
> think that both you and Simon mean "recyclable" pages. There are
> several levels of indirection involved here, to keep the locking very
> granular, so it gets tricky to talk about.

Thanks for the clarification.  I wasn't very clear on what was going
on here; that helps.

The thing that strikes me based on what you wrote is that our page
recycling seems to admit of long delays even as things stand.  There's
no bound on how much time could pass between one index vacuum and the
next, and RecentGlobalXmin could and probably usually will advance
past the point that would allow recycling long before the next index
vacuum cycle.  I don't know whether that strengthens or weakens
Simon's argument.  On the one hand, you could argue that if we're
already doing this on a long delay, making it even longer probably
won't hurt much.  On the other hand, you could argue that if the
current situation is bad, we should at least avoid making it worse.  I
don't know which of those arguments is correct, if either.  Do you
have an idea about that, or any ideas for experiments we could try?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Sat, Feb 25, 2017 at 10:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The thing that strikes me based on what you wrote is that our page
> recycling seems to admit of long delays even as things stand.  There's
> no bound on how much time could pass between one index vacuum and the
> next, and RecentGlobalXmin could and probably usually will advance
> past the point that would allow recycling long before the next index
> vacuum cycle.

Agreed.

> I don't know whether that strengthens or weakens
> Simon's argument.

I think it weakens it, but I hesitate to take a firm position on it just yet.

> On the one hand, you could argue that if we're
> already doing this on a long delay, making it even longer probably
> won't hurt much.  On the other hand, you could argue that if the
> current situation is bad, we should at least avoid making it worse.  I
> don't know which of those arguments is correct, if either.

Unsure. I will point out:

* There probably is a big problem with B-Tree index bloat for certain
workloads ("sparse deletion patterns"), that interacts badly with less
frequent VACUUMing.

* Whatever the bloat this patch makes worse is not *that* bloat, at
least with the proposed default for vacuum_cleanup_index_scale_factor;
it's not the bloat we usually think of when we talk about index bloat.
A full index scan will not need to visit any of the dead pages, even
just to immediately skip over them. We just won't be able to recycle
them, which is another problem.

* The problem of not recycling as soon as we'd prefer can only happen
when everything else is, roughly speaking, going right. Which is still
pretty good.  (Again, remarks only apply when the default
vacuum_cleanup_index_scale_factor is used.)

* Roughly speaking, the recycling of disk blocks, and efficient use of
disk space more generally is not a priority for the implementation.
Nor should it be.

I tend to think of this recycling as being more about the worst case
for space utilization than about the average case. Kind of like the
fast root vs. true root thing prevents our needing to descend "skinny"
B-Trees from the true root, which can only really happen following
vast changes to the key space, which are always going to be painful.
These cases are a bit pathological.

For more on this recycling stuff, see section 2.5 of the Lanin and
Shasha paper, "Freeing empty nodes" [1]. It's describing what is
essentially the RecentGlobalXmin interlock, and I think you're right
to point out that that could stand to be a lot more aggressive, which
is maybe the real problem, if any. (The papers remarks suggest we
could stand to be more eager about it.)

> Do you have an idea about that, or any ideas for experiments we could try?

Nothing occurs to me right now, unfortunately. However, my general
sense is that it would probably be just fine when
vacuum_cleanup_index_scale_factor was 0.0, but there might be
non-linear increases in "the serious type of index bloat" as the
proposed new setting was scaled up. I'd be much more worried about
that.

[1] https://archive.org/stream/symmetricconcurr00lani#page/6/mode/2up
-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
David Steele
Date:
On 2/27/17 12:46 PM, Peter Geoghegan wrote:
> On Sat, Feb 25, 2017 at 10:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> Do you have an idea about that, or any ideas for experiments we could try?
> 
> Nothing occurs to me right now, unfortunately. However, my general
> sense is that it would probably be just fine when
> vacuum_cleanup_index_scale_factor was 0.0, but there might be
> non-linear increases in "the serious type of index bloat" as the
> proposed new setting was scaled up. I'd be much more worried about
> that.

This was originally marked "Waiting on Author" due to some minor
problems with the patch but on the whole there are much larger issues at
play.

The tenor seems to be that we should somehow prove the effectiveness of
this patch one way or the other, but nobody is quite sure how to go
about that, and in fact it would probably be different for each AM.

Sawada, if you have ideas about how to go about this then we would need
to see something very soon.  If not, I think marking this RWF is the
best course of action.

Thanks,
-- 
-David
david@pgmasters.net



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Sat, Feb 25, 2017 at 7:10 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this thread is pretty short on evidence that would let us make
>> a smart decision about what to do here.  I see three possibilities.
>> The first is that this patch is a good idea whether we do something
>> about the issue of half-dead pages or not.  The second is that this
>> patch is a good idea if we do something about the issue of half-dead
>> pages but a bad idea if we don't.  The third is that this patch is a
>> bad idea whether or not we do anything about the issue of half-dead
>> pages.
>
> Half-dead pages are not really relevant to this discussion, AFAICT. I
> think that both you and Simon mean "recyclable" pages. There are
> several levels of indirection involved here, to keep the locking very
> granular, so it gets tricky to talk about.
>
> B-Tree page deletion is like a page split in reverse. It has a
> symmetry with page splits, which have two phases (atomic operations).
> There are also two phases for deletion, the first of which leaves the
> target page without a downlink in its parent, and marks it half dead.
> By the end of the first phase, there are still sibling pointers, so an
> index scan can land on them before the second phase of deletion begins
> -- they can visit a half-dead page before such time as the second
> phase of deletion begins, where the sibling link goes away. So, the
> sibling link isn't stale as such, but the page is still morally dead.
> (Second phase is where we remove even the sibling links, and declare
> it fully dead.)
>
> Even though there are two phases of deletion, the second still occurs
> immediately after the first within VACUUM. The need to have two phases
> is hard to explain, so I won't try, but it suffices to say that VACUUM
> does not actually ever leave a page half dead unless there is a hard
> crash.
>
> Recall that VACUUMing of a B-Tree is performed sequentially, so blocks
> can be recycled without needing to be found via a downlink or sibling
> link by VACUUM. What is at issue here, then, is VACUUM's degree of
> "eagerness" around putting *fully* dead B-Tree pages in the FSM for
> recycling. The interlock with RecentGlobalXmin is what makes it
> impossible for VACUUM to generally fully delete pages, *as well as*
> mark them as recyclable (put them in the FSM) all at once.
>
> Maybe you get this already, since, as I said, the terminology is
> tricky in this area, but I can't tell.
>

Thank you for clarification. Let me check my understanding. IIUC,
skipping second index vacuum path (lazy_cleanup_index) can not be
cause of leaving page as half-dead state but could leave recyclable
pages that are not marked as a recyclable. But second one, it can be
reclaimed by next index vacuum because btvacuumpage calls
RecordFreeIndexPage for recyclable page. Am I missing something?

My first motivation of this patch is to skip the second index vacuum
patch when vacuum skipped whole table by visibility map. But as Robert
suggested on another thread, I changed it to have a threshold. If my
understanding is correct, we can have a threshold that specifies the
fraction of the scanned pages by vacuum. If it's set 0.1,
lazy_scan_heap can do the second vacuum index only when 10% of table
is scanned. IOW, if 90% of table pages is skipped, which means almost
of table has not changed since previous vacuum, we can skip the second
index vacuum.

In this design, we could handle other types of AM as well.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Mar 3, 2017 at 10:45 PM, David Steele <david@pgmasters.net> wrote:
> On 2/27/17 12:46 PM, Peter Geoghegan wrote:
>> On Sat, Feb 25, 2017 at 10:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>>> Do you have an idea about that, or any ideas for experiments we could try?
>>
>> Nothing occurs to me right now, unfortunately. However, my general
>> sense is that it would probably be just fine when
>> vacuum_cleanup_index_scale_factor was 0.0, but there might be
>> non-linear increases in "the serious type of index bloat" as the
>> proposed new setting was scaled up. I'd be much more worried about
>> that.
>
> This was originally marked "Waiting on Author" due to some minor
> problems with the patch but on the whole there are much larger issues at
> play.
>
> The tenor seems to be that we should somehow prove the effectiveness of
> this patch one way or the other, but nobody is quite sure how to go
> about that, and in fact it would probably be different for each AM.
>
> Sawada, if you have ideas about how to go about this then we would need
> to see something very soon.  If not, I think marking this RWF is the
> best course of action.
>

Thank you for the remind. I've post new idea about this. After got
consensus about the design, I'm going to update the patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Fri, Mar 3, 2017 at 8:13 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for clarification. Let me check my understanding. IIUC,
> skipping second index vacuum path (lazy_cleanup_index) can not be
> cause of leaving page as half-dead state but could leave recyclable
> pages that are not marked as a recyclable. But second one, it can be
> reclaimed by next index vacuum because btvacuumpage calls
> RecordFreeIndexPage for recyclable page. Am I missing something?

I think that this is a good summary. You have not missed anything.

> My first motivation of this patch is to skip the second index vacuum
> patch when vacuum skipped whole table by visibility map.

Makes sense.

> But as Robert
> suggested on another thread, I changed it to have a threshold. If my
> understanding is correct, we can have a threshold that specifies the
> fraction of the scanned pages by vacuum. If it's set 0.1,
> lazy_scan_heap can do the second vacuum index only when 10% of table
> is scanned. IOW, if 90% of table pages is skipped, which means almost
> of table has not changed since previous vacuum, we can skip the second
> index vacuum.

It sounds like a setting of 0.1 would leave us in a position where
it's very unlikely that a VACUUM of indexes could fail to occur when
autovacuum has been triggered in the common way, by the "vacuum
threshold" having been exceeded. Does this feeling that I have about
it seem accurate to you? Is that actually your intent? It's hard to
really be sure, because there are so many confounding factors (e.g.
HOT), but if that was 100% true, then I suppose there would
theoretically be zero new risk (except, perhaps, of the "other type of
bloat" that I described, which I am not very worried about).

Please verify my understanding of your thought process: We don't have
to freeze indexes at all, ever, so if we see index bloat as a separate
problem, we also see that there is no need to *link* index needs to
the need for freezing. XID burn rate is a very bad proxy for how
bloated an index may be. Besides, we already have a separate trigger
for the thing that *actually* matters to indexes (the vacuum threshold
stuff).

Maybe 0.0 is too low as a default for
vacuum_cleanup_index_scale_factor, even though initially it seemed
attractive to me for theoretical reasons. Something like 0.01 is still
"practically zero", but removes the extreme sensitivity that would
have with 0.0. So, 0.01 might make sense as a default for roughly the
same reason that autovacuum_vacuum_threshold exists. (Maybe it should
be more like autovacuum_vacuum_threshold, in that it's an absolute
minimum number of heap blocks to trigger index clean-up.)

At some point in the future, it may be possible to actually go ahead
with index vacumming, but do only a small amount of B-Tree vacuuming
by a process that is similar to a conventional index scan: A process
that collects index values from those rows found in the heap, and
performs subsequent look-ups to kill tuples in the index. I imagine
that in cases where the freeze map stuff really helps, the only index
is often on a bigserial column or similar, which naturally has good
locality. When you VACUUM heap pages, you attempt to build a range of
values for each index, a little like a BRIN index build. This range is
what you go on to use to do a cheap index-scan-based B-Tree VACUUM.
This could have far far less I/O, though has obvious risks that we
need to worry about. That's work for another release, of course.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Fri, Mar 3, 2017 at 11:37 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Please verify my understanding of your thought process: We don't have
> to freeze indexes at all, ever, so if we see index bloat as a separate
> problem, we also see that there is no need to *link* index needs to
> the need for freezing. XID burn rate is a very bad proxy for how
> bloated an index may be. Besides, we already have a separate trigger
> for the thing that *actually* matters to indexes (the vacuum threshold
> stuff).

Another thing I wonder about: It would be okay to use the number of
unset freeze map bits as a reasonable proxy for how much bloat is in
the index the first time we vacuum. But, don't we then set the freeze
map bits, losing any record of having skipped indexes?

What mechanism exists that allows back-pressure to actually *build up*
over many vacuum anti-wraparound cycles, so that we slowly but surely
get around to actually vacuuming indexes at some point?

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Fri, Mar 3, 2017 at 11:49 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Please verify my understanding of your thought process: We don't have
>> to freeze indexes at all, ever, so if we see index bloat as a separate
>> problem, we also see that there is no need to *link* index needs to
>> the need for freezing. XID burn rate is a very bad proxy for how
>> bloated an index may be. Besides, we already have a separate trigger
>> for the thing that *actually* matters to indexes (the vacuum threshold
>> stuff).

Wait, I have this wrong. You're only ever avoiding
lazy_cleanup_index(), never lazy_vacuum_index() (I repeated Kuntal's
mistake). So, you're focused on avoiding recycling pages in the case
that no actual bloat is in the index. That hasn't changed from your
earliest version, where there was no new GUC. You are not proposing to
increase VACUUM's tolerance of "the bad kind of bloat". It seems that
this is much less risky than I first thought, then.

> Another thing I wonder about: It would be okay to use the number of
> unset freeze map bits as a reasonable proxy for how much bloat is in
> the index the first time we vacuum. But, don't we then set the freeze
> map bits, losing any record of having skipped indexes?

Still seems like we may want to remember that we skipped the
btvacuumcleanup() calls, to build back-pressure, though. That is, we
might want to do what Simon suggests: something may be stored in the
B-Tree meta-page (for example) to remember that we tolerated not doing
a btvacuumcleanup() when we could have and maybe should have. It would
remember the last number of pages deleted by the last VACUUM. That's
the only thing I can think of that matters. I believe this is what
Simon meant when he said "half dead pages".

OTOH, I see no reason to insist on this meta-page stuff. The only
thing that can delete a B-Tree page is VACUUM. Hopefully, a subsequent
VACUUM will occur that goes on to recycle those deleted pages. But,
it's not as if a "recycling-orientated B-Tree VACUUM" (a
btvacuumcleanup() call to btvacuumscan()) occurs because of a build-up
of back-pressure that has any relationship to recycling. Back pressure
is actually created by "real" bloat, or maybe using up XIDs, which is
not the same thing. Bloat cannot occur within B-Tree pages that are
already fully deleted and (almost?) eligible to be recycled.

In other words, the number of B-Tree pages that the last VACUUM
deleted, and thus made eligible to recycle by the next VACUUM has no
relationship with the number of pages the next VACUUM will itself end
up deleting, in general, or how long it will be before that next
VACUUM comes, if it comes at all, or anything else that seems at all
relevant. In other other words, you are only preventing recycling that
would have occurred pretty much by accident before now, due to
concerns that had nothing to do with recycling in particular.

The really important thing is that you have bound the amount of
recycling that VACUUM can now fail to do to only one "round" of
VACUUM, because newly deleted pages can only occur due to another
VACUUM, which will always do however much recycling it can manage to
do as it scans a B-Tree index. I'm not very worried about VACUUM not
being super-eager about reclaiming disk space for the FSM if there is
a natural limit to that.

It would be intolerable for it to ever be possible for
recyclable-but-not-recycled pages to grow indefinitely, but I see no
risk of that here.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Fri, Mar 3, 2017 at 2:41 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> In other words, the number of B-Tree pages that the last VACUUM
> deleted, and thus made eligible to recycle by the next VACUUM has no
> relationship with the number of pages the next VACUUM will itself end
> up deleting, in general, or how long it will be before that next
> VACUUM comes, if it comes at all, or anything else that seems at all
> relevant.

This raises another question, though: Why have this GUC at all? Why
use *any* threshold that is to be compared against the number of heap
pages that were processed by VACUUM this time?

B-Tree page deletion isn't really part of the ordinary life cycle of a
B-Tree index. In order for that to be the case, somebody would have to
delete large ranges of indexed values (typically hundreds of logically
contiguous values -- no gaps), without anyone else ever inserting new
tuples that are in the same range before the next VACUUM. It's very
unlikely that this would happen again and again in the real world. So,
even if we never freeze, the number of B-Tree pages that we delete
when we VACUUM today is generally a useless predictor of how many will
be deleted by a VACUUM that occurs tomorrow. This is true despite the
fact that the number of dead heap tuples is probably almost identical
for each VACUUM (or the number of heap pages that end up being
processed by VACUUM, if you prefer).

Barring any concerns about crash safety, we can be completely certain
that any "recycling-orientated B-Tree VACUUM" (a btvacuumcleanup()
call to btvacuumscan(), which happens because there are no tuples in
the index to kill) will end up recycling however many pages the last
VACUUM managed to delete, which is a precisely knowable number (or
could be made knowable if we stashed that number somewhere, like the
meta-page). It will typically only take seconds or minutes after the
VACUUM finishes for its RecentGlobalXmin interlock to stop being a
problem (that is, for _bt_page_recyclable() to return "true" for any
pages that that VACUUM deleted). From that point on, those deleted
pages are "money in the bank" for the FSM. The only reason why we'd
want to tie "the FSM withdrawing that money" to VACUUM is because that
might be needed to clean up regular bloat anyway.

The test performed by this patch within lazy_scan_heap(), to determine
whether we should avoid calling lazy_cleanup_index() would therefore
look like this, ideally: Do I want to go to the trouble of scanning
this index (a cost that is proportionate to the size of the index) in
order to recycle this number of known-deleted pages (a benefit that is
proportionate to that number)? (I still think that the important thing
is that we don't let the number of unclaimed-by-FSM recyclable pages
grow forever, though.)

(Thinks about it some more...)

Unfortunately, I just saw a whole new problem with this patch:
_bt_page_recyclable() is the one place in the B-Tree AM where we stash
an XID. We don't need to set this to FrozenTransactionId at any point,
because this is stashed for deleted pages only, pages that are likely
to be recycled very soon. It might be that the VACUUM that ends up
deleting any such page is an anti-wraparound VACUUM, especially in the
case that this patch really wants to improve. However, with this
patch, that recycling won't happen, of course. As a result,
_bt_page_recyclable() will falsely report that the page is not
recyclable if it is ever asked again.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Amit Kapila
Date:
On Sat, Mar 4, 2017 at 5:59 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Mar 3, 2017 at 2:41 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> In other words, the number of B-Tree pages that the last VACUUM
>> deleted, and thus made eligible to recycle by the next VACUUM has no
>> relationship with the number of pages the next VACUUM will itself end
>> up deleting, in general, or how long it will be before that next
>> VACUUM comes, if it comes at all, or anything else that seems at all
>> relevant.
>
> This raises another question, though: Why have this GUC at all? Why
> use *any* threshold that is to be compared against the number of heap
> pages that were processed by VACUUM this time?
>
> B-Tree page deletion isn't really part of the ordinary life cycle of a
> B-Tree index. In order for that to be the case, somebody would have to
> delete large ranges of indexed values (typically hundreds of logically
> contiguous values -- no gaps), without anyone else ever inserting new
> tuples that are in the same range before the next VACUUM. It's very
> unlikely that this would happen again and again in the real world. So,
> even if we never freeze, the number of B-Tree pages that we delete
> when we VACUUM today is generally a useless predictor of how many will
> be deleted by a VACUUM that occurs tomorrow. This is true despite the
> fact that the number of dead heap tuples is probably almost identical
> for each VACUUM (or the number of heap pages that end up being
> processed by VACUUM, if you prefer).
>
> Barring any concerns about crash safety, we can be completely certain
> that any "recycling-orientated B-Tree VACUUM" (a btvacuumcleanup()
> call to btvacuumscan(), which happens because there are no tuples in
> the index to kill) will end up recycling however many pages the last
> VACUUM managed to delete, which is a precisely knowable number (or
> could be made knowable if we stashed that number somewhere, like the
> meta-page). It will typically only take seconds or minutes after the
> VACUUM finishes for its RecentGlobalXmin interlock to stop being a
> problem (that is, for _bt_page_recyclable() to return "true" for any
> pages that that VACUUM deleted). From that point on, those deleted
> pages are "money in the bank" for the FSM. The only reason why we'd
> want to tie "the FSM withdrawing that money" to VACUUM is because that
> might be needed to clean up regular bloat anyway.
>
> The test performed by this patch within lazy_scan_heap(), to determine
> whether we should avoid calling lazy_cleanup_index() would therefore
> look like this, ideally: Do I want to go to the trouble of scanning
> this index (a cost that is proportionate to the size of the index) in
> order to recycle this number of known-deleted pages (a benefit that is
> proportionate to that number)? (I still think that the important thing
> is that we don't let the number of unclaimed-by-FSM recyclable pages
> grow forever, though.)
>

You are right that we don't want the number of unclaimed-by-FSM
recyclable pages to grow forever, but I think that won't happen with
this patch.  As soon as there are more deletions (in heap), in the
next vacuum cycle, the pages will be reclaimed by lazy_vacuum_index().


> (Thinks about it some more...)
>
> Unfortunately, I just saw a whole new problem with this patch:
> _bt_page_recyclable() is the one place in the B-Tree AM where we stash
> an XID.
>

Can you be more specific to tell which code exactly you are referring here?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Fri, Mar 3, 2017 at 6:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> You are right that we don't want the number of unclaimed-by-FSM
> recyclable pages to grow forever, but I think that won't happen with
> this patch.  As soon as there are more deletions (in heap), in the
> next vacuum cycle, the pages will be reclaimed by lazy_vacuum_index().

Right.

>> (Thinks about it some more...)
>>
>> Unfortunately, I just saw a whole new problem with this patch:
>> _bt_page_recyclable() is the one place in the B-Tree AM where we stash
>> an XID.
>>
>
> Can you be more specific to tell which code exactly you are referring here?

I meant that we stash an XID when a B-Tree page is deleted, used to
determine when it's finally safe to to recycle the page by comparing
it to RecentGlobalXmin (recyling will happen during the next VACUUM).
We can't immediately recycle a page, because an existing index scan
might land on the page while following a stale pointer.

_bt_page_recyclable(), which checks if recycling is safe (no possible
index scan can land of dead page), is a pretty small function. I'm
concerned about what happens when
pg_class.relfrozenxid/pg_database.datfrozenxid are advanced past
opaque->btpo.xact. While I can't see this explained anywhere, I'm
pretty sure that that's supposed to be impossible, which this patch
changes.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Amit Kapila
Date:
On Sat, Mar 4, 2017 at 8:55 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Mar 3, 2017 at 6:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> You are right that we don't want the number of unclaimed-by-FSM
>> recyclable pages to grow forever, but I think that won't happen with
>> this patch.  As soon as there are more deletions (in heap), in the
>> next vacuum cycle, the pages will be reclaimed by lazy_vacuum_index().
>
> Right.
>
>>> (Thinks about it some more...)
>>>
>>> Unfortunately, I just saw a whole new problem with this patch:
>>> _bt_page_recyclable() is the one place in the B-Tree AM where we stash
>>> an XID.
>>>
>>
>> Can you be more specific to tell which code exactly you are referring here?
>
> I meant that we stash an XID when a B-Tree page is deleted, used to
> determine when it's finally safe to to recycle the page by comparing
> it to RecentGlobalXmin (recyling will happen during the next VACUUM).
> We can't immediately recycle a page, because an existing index scan
> might land on the page while following a stale pointer.
>
> _bt_page_recyclable(), which checks if recycling is safe (no possible
> index scan can land of dead page), is a pretty small function. I'm
> concerned about what happens when
> pg_class.relfrozenxid/pg_database.datfrozenxid are advanced past
> opaque->btpo.xact.
>

Are you talking pg_class.relfrozenxid of index or table?  IIUC, for
indexes it will be InvalidTransactionId and for tables, I think it
will be updated with or without the patch.

> While I can't see this explained anywhere, I'm
> pretty sure that that's supposed to be impossible, which this patch
> changes.
>

What makes you think that patch will allow pg_class.relfrozenxid to be
advanced past opaque->btpo.xact which was previously not possible?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Sat, Mar 4, 2017 at 4:37 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Mar 3, 2017 at 8:13 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Thank you for clarification. Let me check my understanding. IIUC,
>> skipping second index vacuum path (lazy_cleanup_index) can not be
>> cause of leaving page as half-dead state but could leave recyclable
>> pages that are not marked as a recyclable. But second one, it can be
>> reclaimed by next index vacuum because btvacuumpage calls
>> RecordFreeIndexPage for recyclable page. Am I missing something?
>
> I think that this is a good summary. You have not missed anything.
>
>> My first motivation of this patch is to skip the second index vacuum
>> patch when vacuum skipped whole table by visibility map.
>
> Makes sense.
>
>> But as Robert
>> suggested on another thread, I changed it to have a threshold. If my
>> understanding is correct, we can have a threshold that specifies the
>> fraction of the scanned pages by vacuum. If it's set 0.1,
>> lazy_scan_heap can do the second vacuum index only when 10% of table
>> is scanned. IOW, if 90% of table pages is skipped, which means almost
>> of table has not changed since previous vacuum, we can skip the second
>> index vacuum.
>
> It sounds like a setting of 0.1 would leave us in a position where
> it's very unlikely that a VACUUM of indexes could fail to occur when
> autovacuum has been triggered in the common way, by the "vacuum
> threshold" having been exceeded. Does this feeling that I have about
> it seem accurate to you? Is that actually your intent? It's hard to
> really be sure, because there are so many confounding factors (e.g.
> HOT), but if that was 100% true, then I suppose there would
> theoretically be zero new risk (except, perhaps, of the "other type of
> bloat" that I described, which I am not very worried about).
>
> Please verify my understanding of your thought process: We don't have
> to freeze indexes at all, ever, so if we see index bloat as a separate
> problem, we also see that there is no need to *link* index needs to
> the need for freezing. XID burn rate is a very bad proxy for how
> bloated an index may be. Besides, we already have a separate trigger
> for the thing that *actually* matters to indexes (the vacuum threshold
> stuff).
>
> Maybe 0.0 is too low as a default for
> vacuum_cleanup_index_scale_factor, even though initially it seemed
> attractive to me for theoretical reasons. Something like 0.01 is still
> "practically zero", but removes the extreme sensitivity that would
> have with 0.0. So, 0.01 might make sense as a default for roughly the
> same reason that autovacuum_vacuum_threshold exists. (Maybe it should
> be more like autovacuum_vacuum_threshold, in that it's an absolute
> minimum number of heap blocks to trigger index clean-up.)

Agreed. Attached current design patch. Also I've changed default value to 0.01.

Remaining issues are two issues; first is what you mentioned about
that pg_class.relfrozenxid/pg_database.datfrozenxid could be advanced
past opaque->btpo.xact. The second is to find out the impact to type
of indexes other than btree. I'm investigating the second one now.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Sat, Mar 4, 2017 at 1:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> While I can't see this explained anywhere, I'm
>> pretty sure that that's supposed to be impossible, which this patch
>> changes.
>>
>
> What makes you think that patch will allow pg_class.relfrozenxid to be
> advanced past opaque->btpo.xact which was previously not possible?

By not reliably recycling pages in a timely manner, we won't change
anything about the dead page. It just sticks around. This is mostly
fine, but we still need VACUUM to be able to reason about it (to
determine if it is in fact recyclable), which is what I'm concerned
about breaking here. It still needs to be *possible* to recycle any
recyclable page at some point (e.g., when we find it convenient).

pg_class.relfrozenxid is InvalidTransactionId for indexes because
indexes generally don't store XIDs. This is the one exception that I'm
aware of, presumably justified by the fact that it's only for
recyclable pages anyway, and those are currently *guaranteed* to get
recycled quickly. In particular, they're guaranteed to get recycled by
the next VACUUM. They may be recycled in the course of anti-wraparound
VACUUM, even if VACUUM has no garbage tuples to kill (even if we only
do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the
case that this patch proposes to have us skip touching indexes for.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Mar 8, 2017 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Mar 4, 2017 at 1:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> While I can't see this explained anywhere, I'm
>>> pretty sure that that's supposed to be impossible, which this patch
>>> changes.
>>>
>>
>> What makes you think that patch will allow pg_class.relfrozenxid to be
>> advanced past opaque->btpo.xact which was previously not possible?
>
> By not reliably recycling pages in a timely manner, we won't change
> anything about the dead page. It just sticks around. This is mostly
> fine, but we still need VACUUM to be able to reason about it (to
> determine if it is in fact recyclable), which is what I'm concerned
> about breaking here. It still needs to be *possible* to recycle any
> recyclable page at some point (e.g., when we find it convenient).
>
> pg_class.relfrozenxid is InvalidTransactionId for indexes because
> indexes generally don't store XIDs. This is the one exception that I'm
> aware of, presumably justified by the fact that it's only for
> recyclable pages anyway, and those are currently *guaranteed* to get
> recycled quickly. In particular, they're guaranteed to get recycled by
> the next VACUUM. They may be recycled in the course of anti-wraparound
> VACUUM, even if VACUUM has no garbage tuples to kill (even if we only
> do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the
> case that this patch proposes to have us skip touching indexes for.
>

To prevent this, I think we need to not skip the lazy_cleanup_index
when ant-wraparound vacuum (aggressive = true) even if the number of
scanned pages is less then the threshold. This can ensure that
pg_class.relfrozenxid is not advanced past opaque->bp.xact with
minimal pain. Even if the btree dead page is leaved, the subsequent
modification makes garbage on heap and then autovauum recycles that
page while index vacuuming(lazy_index_vacuum).

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Thu, Mar 9, 2017 at 8:51 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> pg_class.relfrozenxid is InvalidTransactionId for indexes because
>> indexes generally don't store XIDs. This is the one exception that I'm
>> aware of, presumably justified by the fact that it's only for
>> recyclable pages anyway, and those are currently *guaranteed* to get
>> recycled quickly. In particular, they're guaranteed to get recycled by
>> the next VACUUM. They may be recycled in the course of anti-wraparound
>> VACUUM, even if VACUUM has no garbage tuples to kill (even if we only
>> do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the
>> case that this patch proposes to have us skip touching indexes for.
>>
>
> To prevent this, I think we need to not skip the lazy_cleanup_index
> when ant-wraparound vacuum (aggressive = true) even if the number of
> scanned pages is less then the threshold. This can ensure that
> pg_class.relfrozenxid is not advanced past opaque->bp.xact with
> minimal pain. Even if the btree dead page is leaved, the subsequent
> modification makes garbage on heap and then autovauum recycles that
> page while index vacuuming(lazy_index_vacuum).

Sorry for the delay in getting back to you.

I think that that's safe, but it is a little disappointing that it
does not allow us to skip work in the case that you really had in mind
when writing the patch. Better than nothing, though, and perhaps still
a good start. I would like to hear other people's opinions.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Tue, Mar 14, 2017 at 2:48 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I think that that's safe, but it is a little disappointing that it
> does not allow us to skip work in the case that you really had in mind
> when writing the patch. Better than nothing, though, and perhaps still
> a good start. I would like to hear other people's opinions.

Hmm. So, the SQL-callable function txid_current() exports a 64-bit
XID, which includes an "epoch". If PostgreSQL used these 64-bit XIDs
natively, we'd never need to freeze. Obviously we don't do that
because the per-tuple overhead of visibility information is already
high, and that would make it much worse. But, we can easily afford the
extra overhead in a completely dead page. Maybe we can overcome the
_bt_page_recyclable() limitation by being cleverer about how it
determines if recycling is safe -- it can examine epoch, too. This
would also be required in the similar function
vacuumRedirectAndPlaceholder(), which is a part of SP-GiST.

We already have BTPageOpaqueData.btpo, a union whose contained type
varies based on the page being dead. We could just do the same with
some other field in that struct, and then store epoch there. Clearly
nobody really cares about most data that remains on the page. Index
scans just need to be able to land on it to determine that it's dead,
and VACUUM needs to be able to determine whether or not there could
possibly be such an index scan at the time it considers recycling..

Does anyone see a problem with that?

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> We already have BTPageOpaqueData.btpo, a union whose contained type
> varies based on the page being dead. We could just do the same with
> some other field in that struct, and then store epoch there. Clearly
> nobody really cares about most data that remains on the page. Index
> scans just need to be able to land on it to determine that it's dead,
> and VACUUM needs to be able to determine whether or not there could
> possibly be such an index scan at the time it considers recycling..

ISTM that we need all of the fields within BTPageOpaqueData even for
dead pages, actually. The left links and right links still need to be
sane, and the flag bits are needed. Plus, the field that stores an XID
already is clearly necessary. Even if they weren't needed, it would
probably still be a good idea to keep them around for forensic
purposes. However, the page header field pd_prune_xid is currently
unused for indexes, and is the same width as CheckPoint.nextXidEpoch
(the extra thing we might want to store -- the epoch).

Maybe you could store the epoch within that field when B-Tree VACUUM
deletes a page, and then compare that within _bt_page_recyclable(). It
would come before the existing XID comparison in that function. One
nice thing about this idea is that pd_prune_xid will be all-zero for
index pages from the current format, so there is no need to take
special care to make sure that databases that have undergone
pg_upgrade don't break.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Amit Kapila
Date:
On Thu, Mar 9, 2017 at 10:21 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Mar 8, 2017 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Sat, Mar 4, 2017 at 1:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> While I can't see this explained anywhere, I'm
>>>> pretty sure that that's supposed to be impossible, which this patch
>>>> changes.
>>>>
>>>
>>> What makes you think that patch will allow pg_class.relfrozenxid to be
>>> advanced past opaque->btpo.xact which was previously not possible?
>>
>> By not reliably recycling pages in a timely manner, we won't change
>> anything about the dead page. It just sticks around. This is mostly
>> fine, but we still need VACUUM to be able to reason about it (to
>> determine if it is in fact recyclable), which is what I'm concerned
>> about breaking here. It still needs to be *possible* to recycle any
>> recyclable page at some point (e.g., when we find it convenient).
>>
>> pg_class.relfrozenxid is InvalidTransactionId for indexes because
>> indexes generally don't store XIDs. This is the one exception that I'm
>> aware of, presumably justified by the fact that it's only for
>> recyclable pages anyway, and those are currently *guaranteed* to get
>> recycled quickly. In particular, they're guaranteed to get recycled by
>> the next VACUUM. They may be recycled in the course of anti-wraparound
>> VACUUM, even if VACUUM has no garbage tuples to kill (even if we only
>> do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the
>> case that this patch proposes to have us skip touching indexes for.
>>
>
> To prevent this, I think we need to not skip the lazy_cleanup_index
> when ant-wraparound vacuum (aggressive = true) even if the number of
> scanned pages is less then the threshold. This can ensure that
> pg_class.relfrozenxid is not advanced past opaque->bp.xact with
> minimal pain. Even if the btree dead page is leaved, the subsequent
> modification makes garbage on heap and then autovauum recycles that
> page while index vacuuming(lazy_index_vacuum).
>

What about if somebody does manual vacuum and there are no garbage
tuples to clean, won't in that case also you want to avoid skipping
the lazy_cleanup_index?  Another option could be to skip updating the
relfrozenxid if we have skipped the index cleanup.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Mar 15, 2017 at 4:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 9, 2017 at 10:21 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Wed, Mar 8, 2017 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> On Sat, Mar 4, 2017 at 1:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>> While I can't see this explained anywhere, I'm
>>>>> pretty sure that that's supposed to be impossible, which this patch
>>>>> changes.
>>>>>
>>>>
>>>> What makes you think that patch will allow pg_class.relfrozenxid to be
>>>> advanced past opaque->btpo.xact which was previously not possible?
>>>
>>> By not reliably recycling pages in a timely manner, we won't change
>>> anything about the dead page. It just sticks around. This is mostly
>>> fine, but we still need VACUUM to be able to reason about it (to
>>> determine if it is in fact recyclable), which is what I'm concerned
>>> about breaking here. It still needs to be *possible* to recycle any
>>> recyclable page at some point (e.g., when we find it convenient).
>>>
>>> pg_class.relfrozenxid is InvalidTransactionId for indexes because
>>> indexes generally don't store XIDs. This is the one exception that I'm
>>> aware of, presumably justified by the fact that it's only for
>>> recyclable pages anyway, and those are currently *guaranteed* to get
>>> recycled quickly. In particular, they're guaranteed to get recycled by
>>> the next VACUUM. They may be recycled in the course of anti-wraparound
>>> VACUUM, even if VACUUM has no garbage tuples to kill (even if we only
>>> do lazy_cleanup_index() instead of lazy_vacuum_index()). This is the
>>> case that this patch proposes to have us skip touching indexes for.
>>>
>>
>> To prevent this, I think we need to not skip the lazy_cleanup_index
>> when ant-wraparound vacuum (aggressive = true) even if the number of
>> scanned pages is less then the threshold. This can ensure that
>> pg_class.relfrozenxid is not advanced past opaque->bp.xact with
>> minimal pain. Even if the btree dead page is leaved, the subsequent
>> modification makes garbage on heap and then autovauum recycles that
>> page while index vacuuming(lazy_index_vacuum).
>>
>
> What about if somebody does manual vacuum and there are no garbage
> tuples to clean, won't in that case also you want to avoid skipping
> the lazy_cleanup_index?

Yes, in that case lazy_cleanup_index will be skipped.

> Another option could be to skip updating the
> relfrozenxid if we have skipped the index cleanup.

This could make anti-wraparound VACUUM occurs at high frequency and we
cannot skip lazy_clean_index when aggressive vacuum after all.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
David Steele
Date:
Hi,

On 3/15/17 9:50 AM, Amit Kapila wrote:
>
> What about if somebody does manual vacuum and there are no garbage
> tuples to clean, won't in that case also you want to avoid skipping
> the lazy_cleanup_index?  Another option could be to skip updating the
> relfrozenxid if we have skipped the index cleanup.

This thread has been idle for six days.  Please respond and/or post a 
new patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be 
marked "Returned with Feedback".

Thanks,
-- 
-David
david@pgmasters.net



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Mar 22, 2017 at 2:29 AM, David Steele <david@pgmasters.net> wrote:
> Hi,
>
> On 3/15/17 9:50 AM, Amit Kapila wrote:
>>
>>
>> What about if somebody does manual vacuum and there are no garbage
>> tuples to clean, won't in that case also you want to avoid skipping
>> the lazy_cleanup_index?  Another option could be to skip updating the
>> relfrozenxid if we have skipped the index cleanup.
>
>
> This thread has been idle for six days.  Please respond and/or post a new
> patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be marked
> "Returned with Feedback".
>

I've changes the patch so that lazy_cleanup_index is not skipped when
aggressive vacuum. Attached patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Robert Haas
Date:
On Tue, Mar 14, 2017 at 6:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Mar 14, 2017 at 2:48 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I think that that's safe, but it is a little disappointing that it
>> does not allow us to skip work in the case that you really had in mind
>> when writing the patch. Better than nothing, though, and perhaps still
>> a good start. I would like to hear other people's opinions.
>
> Hmm. So, the SQL-callable function txid_current() exports a 64-bit
> XID, which includes an "epoch". If PostgreSQL used these 64-bit XIDs
> natively, we'd never need to freeze. Obviously we don't do that
> because the per-tuple overhead of visibility information is already
> high, and that would make it much worse. But, we can easily afford the
> extra overhead in a completely dead page. Maybe we can overcome the
> _bt_page_recyclable() limitation by being cleverer about how it
> determines if recycling is safe -- it can examine epoch, too. This
> would also be required in the similar function
> vacuumRedirectAndPlaceholder(), which is a part of SP-GiST.
>
> We already have BTPageOpaqueData.btpo, a union whose contained type
> varies based on the page being dead. We could just do the same with
> some other field in that struct, and then store epoch there. Clearly
> nobody really cares about most data that remains on the page. Index
> scans just need to be able to land on it to determine that it's dead,
> and VACUUM needs to be able to determine whether or not there could
> possibly be such an index scan at the time it considers recycling..
>
> Does anyone see a problem with that?

Wouldn't it break on-disk compatibility with existing btree indexes?

I think we're still trying to solve a problem that Simon postulated in
advance of evidence that shows how much of a problem it actually is.
Not only might that be unnecessary, but if we don't have a test
demonstrating the problem, we also don't have a test demonstrating
that a given approach fixes it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Tue, Mar 21, 2017 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Wouldn't it break on-disk compatibility with existing btree indexes?

Yes, it would, but see my later remarks on pd_prune_xid. I think that
that would be safe.

> I think we're still trying to solve a problem that Simon postulated in
> advance of evidence that shows how much of a problem it actually is.

I don't think we're still doing that. I think we're discussing the
risk of recycling being broken indefinitely when it doesn't happen in
time.

> Not only might that be unnecessary, but if we don't have a test
> demonstrating the problem, we also don't have a test demonstrating
> that a given approach fixes it.

Preventing recycling from happening until we feel like it is probably
fine. It is not fine to break it forever, though. The specific problem
is that there is an XID stored in dead B-Tree + SP-GiST pages that is
used in the subsequent RecentGlobalXmin interlock that determines if
recycling is safe (if there is no possible index scan that could land
on the dead page). You know, the _bt_page_recyclable() check.

-- 
Peter Geoghegan



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Robert Haas
Date:
On Tue, Mar 21, 2017 at 8:18 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Not only might that be unnecessary, but if we don't have a test
>> demonstrating the problem, we also don't have a test demonstrating
>> that a given approach fixes it.
>
> Preventing recycling from happening until we feel like it is probably
> fine. It is not fine to break it forever, though. The specific problem
> is that there is an XID stored in dead B-Tree + SP-GiST pages that is
> used in the subsequent RecentGlobalXmin interlock that determines if
> recycling is safe (if there is no possible index scan that could land
> on the dead page). You know, the _bt_page_recyclable() check.

Oh.  OK, so this is not just about bloat -- it's about whether this
can be safely done at all.  Somehow, I missed that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Mar 15, 2017 at 7:51 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> We already have BTPageOpaqueData.btpo, a union whose contained type
>> varies based on the page being dead. We could just do the same with
>> some other field in that struct, and then store epoch there. Clearly
>> nobody really cares about most data that remains on the page. Index
>> scans just need to be able to land on it to determine that it's dead,
>> and VACUUM needs to be able to determine whether or not there could
>> possibly be such an index scan at the time it considers recycling..
>
> ISTM that we need all of the fields within BTPageOpaqueData even for
> dead pages, actually. The left links and right links still need to be
> sane, and the flag bits are needed. Plus, the field that stores an XID
> already is clearly necessary. Even if they weren't needed, it would
> probably still be a good idea to keep them around for forensic
> purposes. However, the page header field pd_prune_xid is currently
> unused for indexes, and is the same width as CheckPoint.nextXidEpoch
> (the extra thing we might want to store -- the epoch).
>
> Maybe you could store the epoch within that field when B-Tree VACUUM
> deletes a page, and then compare that within _bt_page_recyclable(). It
> would come before the existing XID comparison in that function. One
> nice thing about this idea is that pd_prune_xid will be all-zero for
> index pages from the current format, so there is no need to take
> special care to make sure that databases that have undergone
> pg_upgrade don't break.
>

Thank you for the suggestion!
If we store the poch within union field, I think we will not be able
to use BTPageOpaqueData.btpo.xact at the same time. Since comparing
btpo.xact is still necessary to determine if that page is recyclable
we cannot store the epoch into the same union field. And if we store
it into BTPageOpaqueData, it would break disk compatibility.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: GUC for cleanup indexes threshold.

From
David Steele
Date:
On 3/23/17 1:54 AM, Masahiko Sawada wrote:
> On Wed, Mar 15, 2017 at 7:51 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> We already have BTPageOpaqueData.btpo, a union whose contained type
>>> varies based on the page being dead. We could just do the same with
>>> some other field in that struct, and then store epoch there. Clearly
>>> nobody really cares about most data that remains on the page. Index
>>> scans just need to be able to land on it to determine that it's dead,
>>> and VACUUM needs to be able to determine whether or not there could
>>> possibly be such an index scan at the time it considers recycling..
>>
>> ISTM that we need all of the fields within BTPageOpaqueData even for
>> dead pages, actually. The left links and right links still need to be
>> sane, and the flag bits are needed. Plus, the field that stores an XID
>> already is clearly necessary. Even if they weren't needed, it would
>> probably still be a good idea to keep them around for forensic
>> purposes. However, the page header field pd_prune_xid is currently
>> unused for indexes, and is the same width as CheckPoint.nextXidEpoch
>> (the extra thing we might want to store -- the epoch).
>>
>> Maybe you could store the epoch within that field when B-Tree VACUUM
>> deletes a page, and then compare that within _bt_page_recyclable(). It
>> would come before the existing XID comparison in that function. One
>> nice thing about this idea is that pd_prune_xid will be all-zero for
>> index pages from the current format, so there is no need to take
>> special care to make sure that databases that have undergone
>> pg_upgrade don't break.
>>
>
> Thank you for the suggestion!
> If we store the poch within union field, I think we will not be able
> to use BTPageOpaqueData.btpo.xact at the same time. Since comparing
> btpo.xact is still necessary to determine if that page is recyclable
> we cannot store the epoch into the same union field. And if we store
> it into BTPageOpaqueData, it would break disk compatibility.

I have marked this patch "Waiting for Author".

This thread has been idle for five days.  Please respond with a new 
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked 
"Returned with Feedback".

Thanks,
-- 
-David
david@pgmasters.net



Re: GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Mar 29, 2017 at 12:23 AM, David Steele <david@pgmasters.net> wrote:
> On 3/23/17 1:54 AM, Masahiko Sawada wrote:
>>
>> On Wed, Mar 15, 2017 at 7:51 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>>
>>> On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>>>
>>>> We already have BTPageOpaqueData.btpo, a union whose contained type
>>>> varies based on the page being dead. We could just do the same with
>>>> some other field in that struct, and then store epoch there. Clearly
>>>> nobody really cares about most data that remains on the page. Index
>>>> scans just need to be able to land on it to determine that it's dead,
>>>> and VACUUM needs to be able to determine whether or not there could
>>>> possibly be such an index scan at the time it considers recycling..
>>>
>>>
>>> ISTM that we need all of the fields within BTPageOpaqueData even for
>>> dead pages, actually. The left links and right links still need to be
>>> sane, and the flag bits are needed. Plus, the field that stores an XID
>>> already is clearly necessary. Even if they weren't needed, it would
>>> probably still be a good idea to keep them around for forensic
>>> purposes. However, the page header field pd_prune_xid is currently
>>> unused for indexes, and is the same width as CheckPoint.nextXidEpoch
>>> (the extra thing we might want to store -- the epoch).
>>>
>>> Maybe you could store the epoch within that field when B-Tree VACUUM
>>> deletes a page, and then compare that within _bt_page_recyclable(). It
>>> would come before the existing XID comparison in that function. One
>>> nice thing about this idea is that pd_prune_xid will be all-zero for
>>> index pages from the current format, so there is no need to take
>>> special care to make sure that databases that have undergone
>>> pg_upgrade don't break.
>>>
>>
>> Thank you for the suggestion!
>> If we store the poch within union field, I think we will not be able
>> to use BTPageOpaqueData.btpo.xact at the same time. Since comparing
>> btpo.xact is still necessary to determine if that page is recyclable
>> we cannot store the epoch into the same union field. And if we store
>> it into BTPageOpaqueData, it would break disk compatibility.
>
>
> I have marked this patch "Waiting for Author".
>
> This thread has been idle for five days.  Please respond with a new patch by
> 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked "Returned
> with Feedback".
>

I was thinking that the status of this patch is still "Needs review"
because I sent latest version patch[1]. We're still under discussion
how safely we can skip to the cleanup vacuum on btree index. That
patch has some restrictions but it would be good for first step.

[1] https://www.postgresql.org/message-id/CAD21AoDCmnoqKuKOmge6uc5AJAWOcLAz6jjB_WPSPFVQT5PkUA%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: GUC for cleanup indexes threshold.

From
David Steele
Date:
On 3/29/17 2:23 AM, Masahiko Sawada wrote:
> On Wed, Mar 29, 2017 at 12:23 AM, David Steele <david@pgmasters.net> wrote:
>> On 3/23/17 1:54 AM, Masahiko Sawada wrote:
>>>
>>> On Wed, Mar 15, 2017 at 7:51 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>>>
>>>> On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>>>>
>>>>> We already have BTPageOpaqueData.btpo, a union whose contained type
>>>>> varies based on the page being dead. We could just do the same with
>>>>> some other field in that struct, and then store epoch there. Clearly
>>>>> nobody really cares about most data that remains on the page. Index
>>>>> scans just need to be able to land on it to determine that it's dead,
>>>>> and VACUUM needs to be able to determine whether or not there could
>>>>> possibly be such an index scan at the time it considers recycling..
>>>>
>>>>
>>>> ISTM that we need all of the fields within BTPageOpaqueData even for
>>>> dead pages, actually. The left links and right links still need to be
>>>> sane, and the flag bits are needed. Plus, the field that stores an XID
>>>> already is clearly necessary. Even if they weren't needed, it would
>>>> probably still be a good idea to keep them around for forensic
>>>> purposes. However, the page header field pd_prune_xid is currently
>>>> unused for indexes, and is the same width as CheckPoint.nextXidEpoch
>>>> (the extra thing we might want to store -- the epoch).
>>>>
>>>> Maybe you could store the epoch within that field when B-Tree VACUUM
>>>> deletes a page, and then compare that within _bt_page_recyclable(). It
>>>> would come before the existing XID comparison in that function. One
>>>> nice thing about this idea is that pd_prune_xid will be all-zero for
>>>> index pages from the current format, so there is no need to take
>>>> special care to make sure that databases that have undergone
>>>> pg_upgrade don't break.
>>>>
>>>
>>> Thank you for the suggestion!
>>> If we store the poch within union field, I think we will not be able
>>> to use BTPageOpaqueData.btpo.xact at the same time. Since comparing
>>> btpo.xact is still necessary to determine if that page is recyclable
>>> we cannot store the epoch into the same union field. And if we store
>>> it into BTPageOpaqueData, it would break disk compatibility.
>>
>>
>> I have marked this patch "Waiting for Author".
>>
>> This thread has been idle for five days.  Please respond with a new patch by
>> 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked "Returned
>> with Feedback".
>>
>
> I was thinking that the status of this patch is still "Needs review"
> because I sent latest version patch[1]. We're still under discussion
> how safely we can skip to the cleanup vacuum on btree index. That
> patch has some restrictions but it would be good for first step.

I've marked this patch as "Needs Review".  Feel free to do that on your 
own if you think I've made a mistake in classifying a patch.

My view is that the patch is stalled and something might be required on 
your part to get it moving again, perhaps trying another approach.

-- 
-David
david@pgmasters.net



Re: GUC for cleanup indexes threshold.

From
Robert Haas
Date:
On Wed, Mar 29, 2017 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I was thinking that the status of this patch is still "Needs review"
> because I sent latest version patch[1].

I think you're right.

I took a look at this today.  I think there is some problem with the
design of this patch.  I originally proposed a threshold based on the
percentage of not-all-visible pages on the theory that we'd just skip
looking at the indexes altogether in that case.  But that's not what
the patch does: it only avoids the index *cleanup*, not the index
*vacuum*.  And the comments in btvacuumcleanup say this:
   /*    * If btbulkdelete was called, we need not do anything, just return the    * stats from the latest btbulkdelete
call. If it wasn't called, we must    * still do a pass over the index, to recycle any newly-recyclable pages    * and
toobtain index statistics.    *    * Since we aren't going to actually delete any leaf items, there's no    * need to
gothrough all the vacuum-cycle-ID pushups.    */
 

So, if I'm reading this correctly, the only time this patch saves
substantial work - at least in the case of a btree index - is in the
case where there are no dead tuples at all.  But if we only want to
avoid the work in that case, then a threshold based on the percentage
of all-visible pages seems like the wrong thing, because the other
stuff btvacuumcleanup() is doing doesn't have anything to do with the
number of all-visible pages.

I'm not quite sure what the right thing to do is here, but I'm
doubtful that this is it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 29, 2017 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I was thinking that the status of this patch is still "Needs review"
>> because I sent latest version patch[1].
>
> I think you're right.
>
> I took a look at this today.  I think there is some problem with the
> design of this patch.  I originally proposed a threshold based on the
> percentage of not-all-visible pages on the theory that we'd just skip
> looking at the indexes altogether in that case.  But that's not what
> the patch does: it only avoids the index *cleanup*, not the index
> *vacuum*.

Hmm, I guess there is misunderstanding on this thread (my English
might have confused you, sorry).

I've been proposing the patch that allows lazy vacuum skip only the
index cleanup vacuum. Because the problem reported by xu jian[1] is
that second vacuum freeze takes long time even if the table is not
modified since first vaucuum. The reason is that we cannot skip
lazy_cleanup_index even if the table is not changed at all since
previous vacuum. If the table has a garbage the lazy vacuum does
lazy_vacuum_index, on the other hand, if the table doesn't have
garbage, which means that only insertion was execued or the table was
not modified, the lazy vacuum does lazy_cleanup_index instead. Since
I'd been knew this restriction I proposed it. That's why proposing new
GUC parameter name has been "vacuum_cleanup_index_scale_factor".

> And the comments in btvacuumcleanup say this:
>
>     /*
>      * If btbulkdelete was called, we need not do anything, just return the
>      * stats from the latest btbulkdelete call.  If it wasn't called, we must
>      * still do a pass over the index, to recycle any newly-recyclable pages
>      * and to obtain index statistics.
>      *
>      * Since we aren't going to actually delete any leaf items, there's no
>      * need to go through all the vacuum-cycle-ID pushups.
>      */

In current design, we can skip lazy_cleanup_index in case where the
number of pages need to be vacuumed is less than the threshold. For
example, after frozen whole table by aggressive vacuum the vacuum can
skip scanning on the index if only a few insertion is execute on that
table. But if a transaction made a garbage on the table, this patch
cannot prevent from vacuuming on the index as you mentioned.

By skipping index cleanup we could leave recyclable pages that are not
marked as a recyclable. But it can be reclaimed by both next index
vacuum and next index cleanup becuase btvacuumpage calls
RecordFreeIndexPage for recyclable page. So I think it doesn't become
a problem.

And a next concern pointed by Peter is that we stash an XID when a
btree page is deleted, which is used to determine when it's finally
safe to recycle the page. I prevent from this situation by not
allowing lazy vacuum to skip the index cleanup when aggressive vacuum.
But since this makes the advantage of this patch weak I'm considering
better idea.

> So, if I'm reading this correctly, the only time this patch saves
> substantial work - at least in the case of a btree index - is in the
> case where there are no dead tuples at all.  But if we only want to
> avoid the work in that case, then a threshold based on the percentage
> of all-visible pages seems like the wrong thing, because the other
> stuff btvacuumcleanup() is doing doesn't have anything to do with the
> number of all-visible pages.

I thought that if the lazy vacuum skipped almost table according to
visibility map it means that the state of table is changed just a
little and we can skip updating the index statistics. Am I missing
something?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: GUC for cleanup indexes threshold.

From
Andres Freund
Date:
Hi,

On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> [ lots of valuable discussion ]

I think this patch clearly still is in the design stage, and has
received plenty feedback this CF.  I'll therefore move this to the next
commitfest.

- Andres



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
>> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> [ lots of valuable discussion ]
>
> I think this patch clearly still is in the design stage, and has
> received plenty feedback this CF.  I'll therefore move this to the next
> commitfest.

Does anyone have ideas on a way forward here? I don't, but then I
haven't thought about it in detail in several months.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
I was just looking the thread since it is found left alone for a
long time in the CF app.

At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in
<CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com>
> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> [ lots of valuable discussion ]
> >
> > I think this patch clearly still is in the design stage, and has
> > received plenty feedback this CF.  I'll therefore move this to the next
> > commitfest.
> 
> Does anyone have ideas on a way forward here? I don't, but then I
> haven't thought about it in detail in several months.

Is the additional storage in metapage to store the current status
of vaccum is still unacceptable even if it can avoid useless
full-page scan on indexes especially for stable tables?

Or, how about additional 1 bit in pg_stat_*_index to indicate
that the index *don't* require vacuum cleanup stage. (default
value causes cleanup)

index_bulk_delete (or ambulkdelete) returns the flag in
IndexBulkDeleteResult then lazy_scan_heap stores the flag in
stats and in the next cycle it is looked up to decide the
necessity of index cleanup.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Claudio Freire
Date:
On Tue, Sep 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I was just looking the thread since it is found left alone for a
> long time in the CF app.
>
> At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in
<CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com>
>> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hi,
>> >
>> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
>> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> >> [ lots of valuable discussion ]
>> >
>> > I think this patch clearly still is in the design stage, and has
>> > received plenty feedback this CF.  I'll therefore move this to the next
>> > commitfest.
>>
>> Does anyone have ideas on a way forward here? I don't, but then I
>> haven't thought about it in detail in several months.
>
> Is the additional storage in metapage to store the current status
> of vaccum is still unacceptable even if it can avoid useless
> full-page scan on indexes especially for stable tables?
>
> Or, how about additional 1 bit in pg_stat_*_index to indicate
> that the index *don't* require vacuum cleanup stage. (default
> value causes cleanup)
>
> index_bulk_delete (or ambulkdelete) returns the flag in
> IndexBulkDeleteResult then lazy_scan_heap stores the flag in
> stats and in the next cycle it is looked up to decide the
> necessity of index cleanup.

Maybe this is looking at the problem from the wrong direction.

Why can't the page be added to the FSM immediately and the check be
done at runtime when looking for a reusable page?

Index FSMs currently store only 0 or 255, couldn't they store 128 for
half-recyclable pages and make the caller re-check reusability before
using it?

This would make the second pass totally unnecessary, for a slight
penalty when looking at the FSM. Ie: 2 lookups in the FSM instead of
one. An initial one with min free space 128 to get a half-recyclable
page, if the check fails on that page, a second lookup with min free
space 255 to get a surely-recycleable page.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Maybe this is looking at the problem from the wrong direction.
>
> Why can't the page be added to the FSM immediately and the check be
> done at runtime when looking for a reusable page?
>
> Index FSMs currently store only 0 or 255, couldn't they store 128 for
> half-recyclable pages and make the caller re-check reusability before
> using it?

No, because it's impossible for them to know whether or not the page
that their index scan just landed on recycled just a second ago, or
was like this since before their xact began/snapshot was acquired.

For your reference, this RecentGlobalXmin interlock stuff is what
Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty
Nodes". Seems pretty hard to do it any other way.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
Hi,

At Tue, 19 Sep 2017 16:55:38 -0700, Peter Geoghegan <pg@bowt.ie> wrote in
<CAH2-Wzn0-3zxGRp_qp1OaEXY7h1W0-W_VCFO0nDv0K_+kabyYQ@mail.gmail.com>
> On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > Maybe this is looking at the problem from the wrong direction.
> >
> > Why can't the page be added to the FSM immediately and the check be
> > done at runtime when looking for a reusable page?
> >
> > Index FSMs currently store only 0 or 255, couldn't they store 128 for
> > half-recyclable pages and make the caller re-check reusability before
> > using it?
> 
> No, because it's impossible for them to know whether or not the page
> that their index scan just landed on recycled just a second ago, or
> was like this since before their xact began/snapshot was acquired.
> 
> For your reference, this RecentGlobalXmin interlock stuff is what
> Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty
> Nodes". Seems pretty hard to do it any other way.

Anyway(:p) the attached first patch is a PoC for the
cleanup-state-in-stats method works only for btree. Some
LOG-level debugging messages are put in the patch to show how it
works.

The following steps makes a not-recyclable page but I'm not sure
it is general enough, and I couldn't generate half-dead pages.
The pg_sleep() in the following steps is inserted in order to see
the updated values in stats.


DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a int);
CREATE INDEX ON t1 (a);
INSERT INTO t1 (SELECT a FROM generate_series(0, 800000) a);
DELETE FROM t1 WHERE a > 416700 AND a < 417250;
VACUUM t1;
DELETE FROM t1;
VACUUM t1;  -- 1 (or wait for autovacuum)
select pg_sleep(1);
VACUUM t1;  -- 2 (autovacuum doesn't work)
select pg_sleep(1);
VACUUM t1;  -- 3 (ditto)


The following logs are emited while the three VACUUMs are issued.

# VACUUM t1;  -- 1 (or wait for autovacuum)LOG:  btvacuumscan(t1_a_idx) result: deleted = 2185, notrecyclable = 1,
hafldead= 0, no_cleanup_needed = falseLOG:  Vacuum cleanup of index t1_a_idx is NOT skippedLOG:  btvacuumcleanup on
indext1_a_idx is skipped since bulkdelete has run just before.
 
# VACUUM t1;  -- 2LOG:  Vacuum cleanup of index t1_a_idx is NOT skippedLOG:  btvacuumscan(t1_a_idx) result: deleted =
2192,notrecyclable = 0, hafldead = 0, no_cleanup_needed = true
 
# VACUUM t1;  -- 3LOG:  Vacuum cleanup of index t1_a_idx is skipped


VACUUM #1 leaves a unrecyclable page and requests the next cleanup.
VACUUM #2 leaves no unrecyclable page and inhibits the next cleanup.
VACUUM #3 (and ever after) no vacuum cleanup executed.

# I suppose it is a known issue that the cleanup cycles are not
# executed automatically unless new dead tuples are generated.

- Getting stats takes a very long time to fail during initdb. Since I couldn't find the right way to cope with this, I
addeda tentative function pgstat_live(), which checks that the backend has a valid stats socket.
 

- The patch calls pg_stat_get_vac_cleanup_needed using DirectFunctionCall. It might be better be wrapped.


As a byproduct, this enables us to run extra autovacuum rounds fo
r index cleanup. With the second attached, autovacuum works as
follows.

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a int);
CREATE INDEX ON t1 (a);
INSERT INTO t1 (SELECT a FROM generate_series(0, 800000) a);
DELETE FROM t1 WHERE a > 416700 AND a < 417250;
(autovacuum on t1 runs)
> LOG:  btvacuumscan(t1_a_idx) result: deleted = 0, notrecyclable = 0, hafldead = 0, no_cleanup_needed = true
> LOG:  Vacuum cleanup of index t1_a_idx is skipped
> LOG:  automatic vacuum of table "postgres.public.t1": index scans: 1
DELETE FROM t1;
(autovacuum on t1 runs)
> LOG:  btvacuumscan(t1_a_idx) result: deleted = 2185, notrecyclable = 1, hafldead = 0, no_cleanup_needed = false
> LOG:  Vacuum cleanup of index t1_a_idx is NOT skipped
> LOG:  btvacuumcleanup on index t1_a_idx is skipped since bulkdelete has run just before.
> LOG:  automatic vacuum of table "postgres.public.t1": index scans: 1
(cleanup vacuum runs for t1 in the next autovac timing)
> LOG:  Vacuum cleanup of index t1_a_idx is NOT skipped
> LOG:  btvacuumscan(t1_a_idx) result: deleted = 2192, notrecyclable = 0, hafldead = 0, no_cleanup_needed = true
> LOG:  automatic vacuum of table "postgres.public.t1": index scans: 0


Any suggestions are welcome.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***************
*** 1110,1116 **** _bt_pagedel(Relation rel, Buffer buf) {     int            ndeleted = 0;     BlockNumber rightsib;
!     bool        rightsib_empty;     Page        page;     BTPageOpaque opaque; 
--- 1110,1116 ---- {     int            ndeleted = 0;     BlockNumber rightsib;
!     bool        rightsib_empty = false;     Page        page;     BTPageOpaque opaque; 
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 63,68 **** typedef struct
--- 63,70 ----     BlockNumber lastBlockLocked;    /* highest blkno we've cleanup-locked */     BlockNumber
totFreePages;   /* true total # of free pages */     MemoryContext pagedelcontext;
 
+     uint32        pages_notrecyclable;    /* # of not-yet-recyclable pages */
+     uint32        pages_halfdead;        /* # of half-dead pages */ } BTVacState;  /*
***************
*** 945,950 **** btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
--- 947,954 ---- IndexBulkDeleteResult * btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) {
+     extern char *get_rel_name(Oid);
+      /* No-op in ANALYZE ONLY mode */     if (info->analyze_only)         return stats;
***************
*** 963,968 **** btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
--- 967,977 ----         stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
btvacuumscan(info,stats, NULL, NULL, 0);     }
 
+     else
+         ereport(LOG,
+                 (errmsg ("btvacuumcleanup on index %s is skipped since bulkdelete has run just before.",
+                          get_rel_name(info->index->rd_id)),
+                  errhidestmt (true)));      /* Finally, vacuum the FSM */     IndexFreeSpaceMapVacuum(info->index);
***************
*** 1004,1009 **** btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
--- 1013,1019 ----     BlockNumber num_pages;     BlockNumber blkno;     bool        needLock;
+     extern char *get_rel_name(Oid);      /*      * Reset counts that will be incremented during the scan; needed in
case
***************
*** 1022,1027 **** btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
--- 1032,1039 ----     vstate.lastBlockVacuumed = BTREE_METAPAGE;    /* Initialise at first block */
vstate.lastBlockLocked= BTREE_METAPAGE;     vstate.totFreePages = 0;
 
+     vstate.pages_notrecyclable = 0;
+     vstate.pages_halfdead = 0;      /* Create a temporary memory context to run _bt_pagedel in */
vstate.pagedelcontext= AllocSetContextCreate(CurrentMemoryContext,
 
***************
*** 1111,1116 **** btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
--- 1123,1139 ----     /* update statistics */     stats->num_pages = num_pages;     stats->pages_free =
vstate.totFreePages;
+ 
+     /* check if we need no further clenaup */
+     if (vstate.pages_notrecyclable == 0 && vstate.pages_halfdead == 0)
+         stats->no_cleanup_needed = true;
+ 
+     ereport(LOG,
+             (errmsg ("btvacuumscan(%s) result: deleted = %d, notrecyclable = %d, hafldead = %d, no_cleanup_needed =
%s",
+                      get_rel_name(rel->rd_id), stats->pages_deleted,
+                      vstate.pages_notrecyclable, vstate.pages_halfdead,
+                      stats->no_cleanup_needed ? "true":"false"),
+              errhidestmt(true))); }  /*
***************
*** 1190,1195 **** restart:
--- 1213,1219 ----     {         /* Already deleted, but can't recycle yet */         stats->pages_deleted++;
+         vstate->pages_notrecyclable++;     }     else if (P_ISHALFDEAD(opaque))     {
***************
*** 1359,1364 **** restart:
--- 1383,1390 ----         /* count only this page, else may double-count parent */         if (ndel)
stats->pages_deleted++;
+         else if (P_ISHALFDEAD(opaque))
+             vstate->pages_halfdead++;    /* Still half-dead */          MemoryContextSwitchTo(oldcontext);         /*
pagedelreleased buffer, so we shouldn't */
 
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 56,61 ****
--- 56,62 ---- #include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/lmgr.h"
+ #include "utils/fmgrprotos.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/pg_rusage.h"
***************
*** 129,134 **** typedef struct LVRelStats
--- 130,137 ----     int            num_index_scans;     TransactionId latestRemovedXid;     bool
lock_waiter_detected;
+     int            num_index_stats;
+     PgStat_MsgVacuum_indstate *indstats; } LVRelStats;  
***************
*** 152,158 **** static void lazy_vacuum_index(Relation indrel,                   IndexBulkDeleteResult **stats,
          LVRelStats *vacrelstats); static void lazy_cleanup_index(Relation indrel,
 
!                    IndexBulkDeleteResult *stats,                    LVRelStats *vacrelstats); static int
lazy_vacuum_page(Relationonerel, BlockNumber blkno, Buffer buffer,                  int tupindex, LVRelStats
*vacrelstats,Buffer *vmbuffer);
 
--- 155,161 ----                   IndexBulkDeleteResult **stats,                   LVRelStats *vacrelstats); static
voidlazy_cleanup_index(Relation indrel,
 
!                    IndexBulkDeleteResult **stats,                    LVRelStats *vacrelstats); static int
lazy_vacuum_page(Relationonerel, BlockNumber blkno, Buffer buffer,                  int tupindex, LVRelStats
*vacrelstats,Buffer *vmbuffer);
 
***************
*** 342,348 **** lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),                         onerel->rd_rel->relisshared,
    new_live_tuples,
 
!                          vacrelstats->new_dead_tuples);     pgstat_progress_end_command();      /* and log the action
ifappropriate */
 
--- 345,352 ----     pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,                         new_live_tuples,
 
!                          vacrelstats->new_dead_tuples,
!                          vacrelstats->num_index_stats, vacrelstats->indstats);     pgstat_progress_end_command();
/* and log the action if appropriate */
 
***************
*** 496,501 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
--- 500,508 ----      indstats = (IndexBulkDeleteResult **)         palloc0(nindexes * sizeof(IndexBulkDeleteResult
*));
+     vacrelstats->num_index_stats = nindexes;
+     vacrelstats->indstats = (PgStat_MsgVacuum_indstate *)
+         palloc0(nindexes * MAXALIGN(sizeof(PgStat_MsgVacuum_indstate)));      nblocks =
RelationGetNumberOfBlocks(onerel);    vacrelstats->rel_pages = nblocks;
 
***************
*** 1320,1326 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,      /* Do post-vacuum cleanup
andstatistics update for each index */     for (i = 0; i < nindexes; i++)
 
!         lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);      /* If no indexes, make log report that
lazy_vacuum_heapwould've made */     if (vacuumed_pages)
 
--- 1327,1344 ----      /* Do post-vacuum cleanup and statistics update for each index */     for (i = 0; i < nindexes;
i++)
!     {
!         lazy_cleanup_index(Irel[i], &indstats[i], vacrelstats);
! 
!         /* update stats if indstats exists */
!         if (indstats[i])
!         {
!             /* prepare to record the result to stats */
!             vacrelstats->indstats[i].indexoid = Irel[i]->rd_id;
!             vacrelstats->indstats[i].vac_cleanup_needed =
!                 !(indstats[i] && indstats[i]->no_cleanup_needed);
!         }
!     }      /* If no indexes, make log report that lazy_vacuum_heap would've made */     if (vacuumed_pages)
***************
*** 1622,1632 **** lazy_vacuum_index(Relation indrel,  */ static void lazy_cleanup_index(Relation indrel,
!                    IndexBulkDeleteResult *stats,                    LVRelStats *vacrelstats) {     IndexVacuumInfo
ivinfo;    PGRUsage    ru0;      pg_rusage_init(&ru0); 
 
--- 1640,1652 ----  */ static void lazy_cleanup_index(Relation indrel,
!                    IndexBulkDeleteResult **stats,                    LVRelStats *vacrelstats) {     IndexVacuumInfo
ivinfo;    PGRUsage    ru0;
 
+     bool        run_cleanup = true;
+     extern char *get_rel_name(Oid);      pg_rusage_init(&ru0); 
***************
*** 1637,1655 **** lazy_cleanup_index(Relation indrel,     ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples;
ivinfo.strategy= vac_strategy; 
 
!     stats = index_vacuum_cleanup(&ivinfo, stats); 
!     if (!stats)         return;      /*      * Now update statistics in pg_class, but only if the index says the
count     * is accurate.      */
 
!     if (!stats->estimated_count)         vac_update_relstats(indrel,
!                             stats->num_pages,
!                             stats->num_index_tuples,                             0,
false,                            InvalidTransactionId,
 
--- 1657,1696 ----     ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples;     ivinfo.strategy = vac_strategy; 
!     /*
!      * If lazy_vacuum_index tells me that no cleanup is required, or stats
!      * tells so, skip cleanup.
!      */
!     if (*stats)
!     {
!         if ((*stats)->no_cleanup_needed)
!             run_cleanup =false;
!     }
!     else
!         run_cleanup = DatumGetBool(
!             DirectFunctionCall1(pg_stat_get_vac_cleanup_needed,
!                                 ObjectIdGetDatum(indrel->rd_id))); 
!     ereport(LOG,
!             (errmsg ("Vacuum cleanup of index %s is %sskipped",
!                      get_rel_name(indrel->rd_id),
!                      run_cleanup ? "NOT ": ""),
!              errhidestmt (true)));
! 
!     if (run_cleanup)
!         *stats = index_vacuum_cleanup(&ivinfo, *stats);
! 
!     if (!*stats)         return;      /*      * Now update statistics in pg_class, but only if the index says the
count     * is accurate.      */
 
!     if (!(*stats)->estimated_count)         vac_update_relstats(indrel,
!                             (*stats)->num_pages,
!                             (*stats)->num_index_tuples,                             0,
false,                            InvalidTransactionId,
 
***************
*** 1659,1674 **** lazy_cleanup_index(Relation indrel,     ereport(elevel,             (errmsg("index \"%s\" now
contains%.0f row versions in %u pages",                     RelationGetRelationName(indrel),
 
!                     stats->num_index_tuples,
!                     stats->num_pages),              errdetail("%.0f index row versions were removed.\n"
        "%u index pages have been deleted, %u are currently reusable.\n"                        "%s.",
 
!                        stats->tuples_removed,
!                        stats->pages_deleted, stats->pages_free,                        pg_rusage_show(&ru0))));
- 
-     pfree(stats); }  /*
--- 1700,1713 ----     ereport(elevel,             (errmsg("index \"%s\" now contains %.0f row versions in %u pages",
                 RelationGetRelationName(indrel),
 
!                     (*stats)->num_index_tuples,
!                     (*stats)->num_pages),              errdetail("%.0f index row versions were removed.\n"
           "%u index pages have been deleted, %u are currently reusable.\n"                        "%s.",
 
!                        (*stats)->tuples_removed,
!                        (*stats)->pages_deleted, (*stats)->pages_free,                        pg_rusage_show(&ru0))));
} /*
 
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 1403,1423 **** pgstat_report_autovac(Oid dboid)  */ void pgstat_report_vacuum(Oid tableoid, bool shared,
!                      PgStat_Counter livetuples, PgStat_Counter deadtuples) {
!     PgStat_MsgVacuum msg;      if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)         return; 
!     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
!     msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
!     msg.m_tableoid = tableoid;
!     msg.m_autovacuum = IsAutoVacuumWorkerProcess();
!     msg.m_vacuumtime = GetCurrentTimestamp();
!     msg.m_live_tuples = livetuples;
!     msg.m_dead_tuples = deadtuples;
!     pgstat_send(&msg, sizeof(msg)); }  /* --------
--- 1403,1437 ----  */ void pgstat_report_vacuum(Oid tableoid, bool shared,
!                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
!                      int nindstats, PgStat_MsgVacuum_indstate *stats) {
!     PgStat_MsgVacuum *msg;
!     int i;
!     int msgsize;      if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)         return;
+     msgsize = offsetof(PgStat_MsgVacuum, m_indvacstates) +
+         MAXALIGN(sizeof(PgStat_MsgVacuum_indstate)) * nindstats; 
!     msg = (PgStat_MsgVacuum *) palloc(msgsize);
!     pgstat_setheader(&msg->m_hdr, PGSTAT_MTYPE_VACUUM);
!     msg->m_databaseid = shared ? InvalidOid : MyDatabaseId;
!     msg->m_tableoid = tableoid;
!     msg->m_autovacuum = IsAutoVacuumWorkerProcess();
!     msg->m_vacuumtime = GetCurrentTimestamp();
!     msg->m_live_tuples = livetuples;
!     msg->m_dead_tuples = deadtuples;
!     msg->m_n_indvac_states = nindstats;
! 
!     for (i = 0 ; i < nindstats ; i++)
!     {
!         msg->m_indvacstates[i].indexoid = stats[i].indexoid;
!         msg->m_indvacstates[i].vac_cleanup_needed = stats[i].vac_cleanup_needed;
!     }
! 
!     pgstat_send(msg, msgsize); }  /* --------
***************
*** 1535,1541 **** pgstat_report_tempfile(size_t filesize)     pgstat_send(&msg, sizeof(msg)); } 
!  /* ----------  * pgstat_ping() -  *
--- 1549,1561 ----     pgstat_send(&msg, sizeof(msg)); } 
! bool
! pgstat_live(void)
! {
!     if (pgStatSock == PGINVALID_SOCKET)
!         return false;
!     return true;
! } /* ----------  * pgstat_ping() -  *
***************
*** 4587,4592 **** pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
--- 4607,4613 ----         result->analyze_count = 0;         result->autovac_analyze_timestamp = 0;
result->autovac_analyze_count= 0;
 
+         result->needs_vacuum_cleanup = true;     }      return result;
***************
*** 5718,5723 **** pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
--- 5739,5745 ----             tabentry->analyze_count = 0;             tabentry->autovac_analyze_timestamp = 0;
    tabentry->autovac_analyze_count = 0;
 
+             tabentry->needs_vacuum_cleanup = true;         }         else         {
***************
*** 5963,5968 **** pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
--- 5985,5991 ---- {     PgStat_StatDBEntry *dbentry;     PgStat_StatTabEntry *tabentry;
+     int i;      /*      * Store the data in the table's hashtable entry.
***************
*** 5984,5989 **** pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
--- 6007,6023 ----         tabentry->vacuum_timestamp = msg->m_vacuumtime;         tabentry->vacuum_count++;     }
+ 
+     /* store index vacuum stats */
+     for (i = 0 ; i < msg->m_n_indvac_states ; i++)
+     {
+         PgStat_StatTabEntry *indtabentry;
+         Oid indoid = msg->m_indvacstates[i].indexoid;
+         bool vac_cleanup_needed = msg->m_indvacstates[i].vac_cleanup_needed;
+ 
+         indtabentry = pgstat_get_tab_entry(dbentry, indoid, true);
+         indtabentry->needs_vacuum_cleanup = vac_cleanup_needed;
+     } }  /* ----------
*** a/src/backend/utils/adt/pgstatfuncs.c
--- b/src/backend/utils/adt/pgstatfuncs.c
***************
*** 27,32 ****
--- 27,33 ---- #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h"
+ #include "utils/syscache.h" #include "utils/timestamp.h"  #define UINT32_ACCESS_ONCE(var)
((uint32)(*((volatileuint32 *)&(var))))
 
***************
*** 328,333 **** pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS)
--- 329,366 ---- }  Datum
+ pg_stat_get_vac_cleanup_needed(PG_FUNCTION_ARGS)
+ {
+     Oid            relid = PG_GETARG_OID(0);
+     bool result;
+     PgStat_StatTabEntry *tabentry;
+     HeapTuple    reltup;
+     bool is_index = false;
+ 
+     if (!pgstat_live())
+         return true;
+ 
+     reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
+     if (HeapTupleIsValid(reltup))
+     {
+         if (((Form_pg_class) GETSTRUCT(reltup))->relkind == RELKIND_INDEX)
+             is_index = true;
+ 
+         ReleaseSysCache(reltup);
+     }
+ 
+     if (!is_index)
+         PG_RETURN_NULL();
+ 
+     if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+         result = true;
+     else
+         result = tabentry->needs_vacuum_cleanup;
+ 
+     PG_RETURN_BOOL(result);
+ }
+ 
+ Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS) {     Oid            relid = PG_GETARG_OID(0);
*** a/src/include/access/genam.h
--- b/src/include/access/genam.h
***************
*** 77,82 **** typedef struct IndexBulkDeleteResult
--- 77,83 ----     double        tuples_removed; /* # removed during vacuum operation */     BlockNumber pages_deleted;
  /* # unused pages in index */     BlockNumber pages_free;        /* # pages available for reuse */
 
+     bool        no_cleanup_needed;  /* true if no cleanup needed  */ } IndexBulkDeleteResult;  /* Typedef for
callbackfunction to determine if a tuple is bulk-deletable */ 
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
***************
*** 416,421 **** typedef struct BTScanOpaqueData
--- 416,422 ----  typedef BTScanOpaqueData *BTScanOpaque; 
+  /*  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed  * to use bits 16-31 (see skey.h).
The uppermost bits are copied from the
 
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 2873,2878 **** DATA(insert OID = 3054 ( pg_stat_get_vacuum_count PGNSP PGUID 12 1 0 0 0 f f f f
--- 2873,2880 ---- DESCR("statistics: number of manual vacuums for a table"); DATA(insert OID = 3055 (
pg_stat_get_autovacuum_countPGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_
pg_stat_get_autovacuum_count_null_ _null_ _null_ )); DESCR("statistics: number of auto vacuums for a table");
 
+ DATA(insert OID = 3419 ( pg_stat_get_vac_cleanup_needed PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 16 "26" _null_
_null__null_ _null_ _null_ pg_stat_get_vac_cleanup_needed _null_ _null_ _null_ ));
 
+ DESCR("statistics: whether vacuum on a relation requires cleanup"); DATA(insert OID = 3056 (
pg_stat_get_analyze_countPGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_
pg_stat_get_analyze_count_null_ _null_ _null_ )); DESCR("statistics: number of manual analyzes for a table");
DATA(insertOID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_
_null__null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
 
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
***************
*** 360,365 **** typedef struct PgStat_MsgAutovacStart
--- 360,372 ----  *                                after VACUUM  * ----------  */
+ typedef struct PgStat_MsgVacuum_indstate
+ {
+     Oid        indexoid;
+     bool    vac_cleanup_needed;
+ } PgStat_MsgVacuum_indstate;
+ 
+  typedef struct PgStat_MsgVacuum {     PgStat_MsgHdr m_hdr;
***************
*** 369,377 **** typedef struct PgStat_MsgVacuum     TimestampTz m_vacuumtime;     PgStat_Counter m_live_tuples;
PgStat_Counterm_dead_tuples; } PgStat_MsgVacuum; 
 
-  /* ----------  * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon  *
   after ANALYZE
 
--- 376,385 ----     TimestampTz m_vacuumtime;     PgStat_Counter m_live_tuples;     PgStat_Counter m_dead_tuples;
+     int            m_n_indvac_states;
+     PgStat_MsgVacuum_indstate m_indvacstates[FLEXIBLE_ARRAY_MEMBER]; } PgStat_MsgVacuum;  /* ----------  *
PgStat_MsgAnalyze           Sent by the backend or autovacuum daemon  *                                after ANALYZE
 
***************
*** 641,646 **** typedef struct PgStat_StatTabEntry
--- 649,656 ----     PgStat_Counter analyze_count;     TimestampTz autovac_analyze_timestamp;    /* autovacuum
initiated*/     PgStat_Counter autovac_analyze_count;
 
+ 
+     bool         needs_vacuum_cleanup;    /* This index needs vac cleanup */ } PgStat_StatTabEntry;  
***************
*** 1159,1166 **** extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t  extern void
pgstat_report_autovac(Oiddboid); extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 
!                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
! extern void pgstat_report_analyze(Relation rel,                       PgStat_Counter livetuples, PgStat_Counter
deadtuples,                      bool resetcounter); 
 
--- 1169,1177 ----  extern void pgstat_report_autovac(Oid dboid); extern void pgstat_report_vacuum(Oid tableoid, bool
shared,
!                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
!                      int nindstats, PgStat_MsgVacuum_indstate *states);
!     extern void pgstat_report_analyze(Relation rel,                       PgStat_Counter livetuples, PgStat_Counter
deadtuples,                      bool resetcounter); 
 
***************
*** 1172,1177 **** extern void pgstat_bestart(void);
--- 1183,1189 ----  extern void pgstat_report_activity(BackendState state, const char *cmd_str); extern void
pgstat_report_tempfile(size_tfilesize);
 
+ extern bool pgstat_live(void); extern void pgstat_report_appname(const char *appname); extern void
pgstat_report_xact_timestamp(TimestampTztstamp); extern const char *pgstat_get_wait_event(uint32 wait_event_info); 
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
***************
*** 2791,2796 **** table_recheck_autovac(Oid relid, HTAB *table_toast_map,
--- 2791,2803 ----                               effective_multixact_freeze_max_age,
&dovacuum,&doanalyze, &wraparound); 
 
+     /* force vacuum if any index on the rel is requesting cleanup scan */
+     if (!dovacuum)
+         dovacuum =
+             DatumGetBool(
+                 DirectFunctionCall1(pg_stat_get_vac_cleanup_needed,
+                                     ObjectIdGetDatum(relid)));
+      /* ignore ANALYZE for toast tables */     if (classForm->relkind == RELKIND_TOASTVALUE)         doanalyze =
false;
***************
*** 3045,3050 **** relation_needs_vacanalyze(Oid relid,
--- 3052,3064 ----         /* Determine if this table needs vacuum or analyze. */         *dovacuum = force_vacuum ||
(vactuples> vacthresh);         *doanalyze = (anltuples > anlthresh);
 
+ 
+         /* still force vacuum if index cleanup is requested */
+         if (!*dovacuum)
+             *dovacuum =
+                 DatumGetBool(
+                     DirectFunctionCall1(pg_stat_get_vac_cleanup_needed,
+                                         ObjectIdGetDatum(relid)));     }     else     {
*** a/src/backend/utils/adt/pgstatfuncs.c
--- b/src/backend/utils/adt/pgstatfuncs.c
***************
*** 349,361 **** pg_stat_get_vac_cleanup_needed(PG_FUNCTION_ARGS)         ReleaseSysCache(reltup);     }      if
(!is_index)
!         PG_RETURN_NULL(); 
!     if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
!         result = true;     else
!         result = tabentry->needs_vacuum_cleanup;      PG_RETURN_BOOL(result); }
--- 349,393 ----         ReleaseSysCache(reltup);     } 
+     /*
+      * If normal relaion is specified, return true if any index of the
+      * relation is explicitly requesting cleanup.
+      */     if (!is_index)
!     {
!         Relation indrel;
!         SysScanDesc indscan;
!         HeapTuple indtup; 
!         result = false;
!         indrel = heap_open(IndexRelationId, AccessShareLock);
!         indscan = systable_beginscan(indrel, InvalidOid, false, NULL, 0, NULL);
!         while (HeapTupleIsValid(indtup = systable_getnext(indscan)) &&
!                !result)
!         {
!             Form_pg_index ind = (Form_pg_index) GETSTRUCT(indtup);
! 
!             if (ind->indrelid != relid)
!                 continue;
! 
!             if ((tabentry = pgstat_fetch_stat_tabentry(ind->indexrelid)))
!                 result |= tabentry->needs_vacuum_cleanup;
!         }
!         systable_endscan(indscan);
!         heap_close(indrel, AccessShareLock);
!     }     else
!     {
!         /*
!          * Elsewise reutrn the status of the index. As somewhat inconsistent
!          * behavior with the normal relation case above, *true* is returned
!          * for indexes with no stats here.
!          */
!         if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
!             result = true;
!         else
!             result = tabentry->needs_vacuum_cleanup;
!     }      PG_RETURN_BOOL(result); }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Claudio Freire
Date:
On Tue, Sep 19, 2017 at 8:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Maybe this is looking at the problem from the wrong direction.
>>
>> Why can't the page be added to the FSM immediately and the check be
>> done at runtime when looking for a reusable page?
>>
>> Index FSMs currently store only 0 or 255, couldn't they store 128 for
>> half-recyclable pages and make the caller re-check reusability before
>> using it?
>
> No, because it's impossible for them to know whether or not the page
> that their index scan just landed on recycled just a second ago, or
> was like this since before their xact began/snapshot was acquired.
>
> For your reference, this RecentGlobalXmin interlock stuff is what
> Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty
> Nodes". Seems pretty hard to do it any other way.

I don't see the difference between a vacuum run and distributed
maintainance at _bt_getbuf time. In fact, the code seems to be in
place already.

_bt_page_recyclable seems to prevent old transactions from treating
those pages as recyclable already, and the description of the
technique in 2.5 doesn't seem to preclude doing the drain while doing
other operations. In fact, Lehman even considers the possibility of
multiple concurrent garbage collectors.

It's only a matter of making the page visible in the FSM in a way that
can be efficiently skipped if we want to go directly to a page that
actually CAN be recycled to avoid looping forever looking for a
recyclable page in _bt_getbuf. In fact, that's pretty much Lehman's
drain technique right there. FSM entries with 128 would be "the
queue", and FSM entries with 255 would be "the freelist". _bt_getbuf
can be the GC getting a buffer to try and recycle, give up after a few
tries, and get an actual recycleable buffer instead (or extend the
relationship). In essence, microvacuum.

Unless I'm missing something and RecentGlobalXmin somehow doesn't
exclude all old transactions, I don't see a problem.

Lanin & Shasha use reference counting to do GC wait during the drain,
and most of the ordering of operations needed there is because of
that, but using the xmin seems to make all those considerations moot.
An xact earlier than RecentGlobalXmin cannot have active transactions
able to follow links to that page AFAIK.

TBH, I didn't read the whole papers, though I probably will.

But, in essence, what's the difference of vacuum doing
       if (_bt_page_recyclable(page))       {               /* Okay to recycle this page */
RecordFreeIndexPage(rel,blkno);               vstate->totFreePages++;               stats->pages_deleted++;       }
 

VS doing it in _bt_getbuf?

What am I missing?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I was just looking the thread since it is found left alone for a
> long time in the CF app.
>
> At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in
<CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com>
>> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hi,
>> >
>> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
>> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> >> [ lots of valuable discussion ]
>> >
>> > I think this patch clearly still is in the design stage, and has
>> > received plenty feedback this CF.  I'll therefore move this to the next
>> > commitfest.
>>
>> Does anyone have ideas on a way forward here? I don't, but then I
>> haven't thought about it in detail in several months.
>
> Is the additional storage in metapage to store the current status
> of vaccum is still unacceptable even if it can avoid useless
> full-page scan on indexes especially for stable tables?
>
> Or, how about additional 1 bit in pg_stat_*_index to indicate
> that the index *don't* require vacuum cleanup stage. (default
> value causes cleanup)

You meant that "the next cycle" is the lazy_cleanup_index() function
called by lazy_scan_heap()?

>
> index_bulk_delete (or ambulkdelete) returns the flag in
> IndexBulkDeleteResult then lazy_scan_heap stores the flag in
> stats and in the next cycle it is looked up to decide the
> necessity of index cleanup.
>

Could you elaborate about this? For example in btree index, the index
cleanup skips to scan on the index scan if index_bulk_delete has been
called during vacuuming because stats != NULL. So I think we don't
need such a flag.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com>
> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I was just looking the thread since it is found left alone for a
> > long time in the CF app.
> >
> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in
<CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com>
> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > Hi,
> >> >
> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> >> [ lots of valuable discussion ]
> >> >
> >> > I think this patch clearly still is in the design stage, and has
> >> > received plenty feedback this CF.  I'll therefore move this to the next
> >> > commitfest.
> >>
> >> Does anyone have ideas on a way forward here? I don't, but then I
> >> haven't thought about it in detail in several months.
> >
> > Is the additional storage in metapage to store the current status
> > of vaccum is still unacceptable even if it can avoid useless
> > full-page scan on indexes especially for stable tables?
> >
> > Or, how about additional 1 bit in pg_stat_*_index to indicate
> > that the index *don't* require vacuum cleanup stage. (default
> > value causes cleanup)
> 
> You meant that "the next cycle" is the lazy_cleanup_index() function
> called by lazy_scan_heap()?

Both finally call btvacuumscan under a certain condition, but
what I meant by "the next cycle" is the lazy_cleanup_index call
in the next round of vacuum since abstract layer (index am) isn't
conscious of the detail of btree.

> > index_bulk_delete (or ambulkdelete) returns the flag in
> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in
> > stats and in the next cycle it is looked up to decide the
> > necessity of index cleanup.
> >
> 
> Could you elaborate about this? For example in btree index, the index
> cleanup skips to scan on the index scan if index_bulk_delete has been
> called during vacuuming because stats != NULL. So I think we don't
> need such a flag.

The flag works so that successive two index full scans don't
happen in a vacuum round. If any rows are fully deleted, just
following btvacuumcleanup does nothing.

I think what you wanted to solve here was the problem that
index_vacuum_cleanup runs a full scan even if it ends with no
actual work, when manual or anti-wraparound vacuums.  (I'm
getting a bit confused on this..) It is caused by using the
pointer "stats" as the flag to instruct to do that. If the
stats-as-a-flag worked as expected, the GUC doesn't seem to be
required.

Addition to that, as Simon and Peter pointed out
index_bulk_delete can leave not-fully-removed pages (so-called
half-dead pages and pages that are recyclable but not registered
in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
interlock. In this case, just inhibiting cleanup scan by a
threshold lets such dangling pages persist in the index. (I
conldn't make such a many dangling pages, though..)

The first patch in the mail (*1) does that. It seems having some
bugs, though..


Since the dangling pages persist until autovacuum decided to scan
the belonging table again, we should run a vacuum round (or
index_vacuum_cleanup itself) even having no dead rows if we want
to clean up such pages within a certain period. The second patch
doesn that.


[*1] https://www.postgresql.org/message-id/20170921.174957.236914340.horiguchi.kyotaro@lab.ntt.co.jp

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
I apologize in advance of possible silliness.

At Thu, 21 Sep 2017 13:54:01 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in
<CAGTBQpYvgdqxVaiyui=BKrzw7ZZfTQi9KECUL4-Lkc2ThqX8QQ@mail.gmail.com>
> On Tue, Sep 19, 2017 at 8:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> > On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> >> Maybe this is looking at the problem from the wrong direction.
> >>
> >> Why can't the page be added to the FSM immediately and the check be
> >> done at runtime when looking for a reusable page?
> >>
> >> Index FSMs currently store only 0 or 255, couldn't they store 128 for
> >> half-recyclable pages and make the caller re-check reusability before
> >> using it?
> >
> > No, because it's impossible for them to know whether or not the page
> > that their index scan just landed on recycled just a second ago, or
> > was like this since before their xact began/snapshot was acquired.
> >
> > For your reference, this RecentGlobalXmin interlock stuff is what
> > Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty
> > Nodes". Seems pretty hard to do it any other way.
> 
> I don't see the difference between a vacuum run and distributed
> maintainance at _bt_getbuf time. In fact, the code seems to be in
> place already.

The pages prohibited to register as "free" by RecentGlobalXmin
cannot be grabbed _bt_getbuf since the page is liked from nowhere
nor FSM doesn't offer the pages is "free".

> _bt_page_recyclable seems to prevent old transactions from treating
> those pages as recyclable already, and the description of the
> technique in 2.5 doesn't seem to preclude doing the drain while doing
> other operations. In fact, Lehman even considers the possibility of
> multiple concurrent garbage collectors.

_bt_page_recyclable prevent a vacuum scan from discarding pages
that might be looked from any active transaction, and the "drain"
itself is a technique to prevent freeing still-active pages so a
scan using the "drain" technique is freely executed
simultaneously with other transactions. The paper might allow
concurrent GCs (or vacuums) but our nbtree is saying that no
concurrent vacuum is assumed. Er... here it is.

nbtpages.c:1589: _bt_unlink_halfdead_page
| * right.  This search could fail if either the sibling or the target page
| * was deleted by someone else meanwhile; if so, give up.  (Right now,
| * that should never happen, since page deletion is only done in VACUUM
| * and there shouldn't be multiple VACUUMs concurrently on the same
| * table.)


> It's only a matter of making the page visible in the FSM in a way that
> can be efficiently skipped if we want to go directly to a page that
> actually CAN be recycled to avoid looping forever looking for a
> recyclable page in _bt_getbuf. In fact, that's pretty much Lehman's

Mmm. What _bt_getbuf does is recheck the page given from FSM as a
"free page". If FSM gives no more page, it just tries to extend
the index relation. Or am I reading you wrongly?

> drain technique right there. FSM entries with 128 would be "the
> queue", and FSM entries with 255 would be "the freelist". _bt_getbuf
> can be the GC getting a buffer to try and recycle, give up after a few
> tries, and get an actual recycleable buffer instead (or extend the
> relationship). In essence, microvacuum.
> 
> Unless I'm missing something and RecentGlobalXmin somehow doesn't
> exclude all old transactions, I don't see a problem.
> 
> Lanin & Shasha use reference counting to do GC wait during the drain,
> and most of the ordering of operations needed there is because of
> that, but using the xmin seems to make all those considerations moot.
> An xact earlier than RecentGlobalXmin cannot have active transactions
> able to follow links to that page AFAIK.
> 
> TBH, I didn't read the whole papers, though I probably will.
> 
> But, in essence, what's the difference of vacuum doing
> 
>         if (_bt_page_recyclable(page))
>         {
>                 /* Okay to recycle this page */
>                 RecordFreeIndexPage(rel, blkno);
>                 vstate->totFreePages++;
>                 stats->pages_deleted++;
>         }
> 
> VS doing it in _bt_getbuf?
> 
> What am I missing?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com>
>> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > I was just looking the thread since it is found left alone for a
>> > long time in the CF app.
>> >
>> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in
<CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com>
>> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote:
>> >> > Hi,
>> >> >
>> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
>> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> >> >> [ lots of valuable discussion ]
>> >> >
>> >> > I think this patch clearly still is in the design stage, and has
>> >> > received plenty feedback this CF.  I'll therefore move this to the next
>> >> > commitfest.
>> >>
>> >> Does anyone have ideas on a way forward here? I don't, but then I
>> >> haven't thought about it in detail in several months.
>> >
>> > Is the additional storage in metapage to store the current status
>> > of vaccum is still unacceptable even if it can avoid useless
>> > full-page scan on indexes especially for stable tables?
>> >
>> > Or, how about additional 1 bit in pg_stat_*_index to indicate
>> > that the index *don't* require vacuum cleanup stage. (default
>> > value causes cleanup)
>>
>> You meant that "the next cycle" is the lazy_cleanup_index() function
>> called by lazy_scan_heap()?
>
> Both finally call btvacuumscan under a certain condition, but
> what I meant by "the next cycle" is the lazy_cleanup_index call
> in the next round of vacuum since abstract layer (index am) isn't
> conscious of the detail of btree.
>
>> > index_bulk_delete (or ambulkdelete) returns the flag in
>> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in
>> > stats and in the next cycle it is looked up to decide the
>> > necessity of index cleanup.
>> >
>>
>> Could you elaborate about this? For example in btree index, the index
>> cleanup skips to scan on the index scan if index_bulk_delete has been
>> called during vacuuming because stats != NULL. So I think we don't
>> need such a flag.
>
> The flag works so that successive two index full scans don't
> happen in a vacuum round. If any rows are fully deleted, just
> following btvacuumcleanup does nothing.
>
> I think what you wanted to solve here was the problem that
> index_vacuum_cleanup runs a full scan even if it ends with no
> actual work, when manual or anti-wraparound vacuums.  (I'm
> getting a bit confused on this..) It is caused by using the
> pointer "stats" as the flag to instruct to do that. If the
> stats-as-a-flag worked as expected, the GUC doesn't seem to be
> required.

Hmm, my proposal is like that if a table doesn't changed since the
previous vacuum much we skip the cleaning up index.

If the table has at least one garbage we do the lazy_vacuum_index and
then IndexBulkDeleteResutl is stored, which causes to skip doing the
btvacuumcleanup. On the other hand, if the table doesn't have any
garbage but some new tuples inserted since the previous vacuum, we
don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this
case, we always do the lazy_cleanup_index (i.g, we do the full scan)
even if only one tuple is inserted. That's why I proposed a new GUC
parameter which allows us to skip the lazy_cleanup_index in the case.

>
> Addition to that, as Simon and Peter pointed out
> index_bulk_delete can leave not-fully-removed pages (so-called
> half-dead pages and pages that are recyclable but not registered
> in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
> interlock. In this case, just inhibiting cleanup scan by a
> threshold lets such dangling pages persist in the index. (I
> conldn't make such a many dangling pages, though..)
>
> The first patch in the mail (*1) does that. It seems having some
> bugs, though..
>
>
> Since the dangling pages persist until autovacuum decided to scan
> the belonging table again, we should run a vacuum round (or
> index_vacuum_cleanup itself) even having no dead rows if we want
> to clean up such pages within a certain period. The second patch
> doesn that.
>

IIUC half-dead pages are not relevant to this proposal. The proposal
has two problems;
* By skipping index cleanup we could leave recyclable pages that are
not marked as a recyclable.
* we stash an XID when a btree page is deleted, which is used to
determine when it's finally safe to recycle the page

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
At Fri, 22 Sep 2017 17:21:04 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoBN9ucgMDuinx2ptU8upEToHnR-A35aBcQyZnLFvWdVPg@mail.gmail.com>
> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com>
> >> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> Could you elaborate about this? For example in btree index, the index
> >> cleanup skips to scan on the index scan if index_bulk_delete has been
> >> called during vacuuming because stats != NULL. So I think we don't
> >> need such a flag.
> >
> > The flag works so that successive two index full scans don't
> > happen in a vacuum round. If any rows are fully deleted, just
> > following btvacuumcleanup does nothing.
> >
> > I think what you wanted to solve here was the problem that
> > index_vacuum_cleanup runs a full scan even if it ends with no
> > actual work, when manual or anti-wraparound vacuums.  (I'm
> > getting a bit confused on this..) It is caused by using the
> > pointer "stats" as the flag to instruct to do that. If the
> > stats-as-a-flag worked as expected, the GUC doesn't seem to be
> > required.
> 
> Hmm, my proposal is like that if a table doesn't changed since the
> previous vacuum much we skip the cleaning up index.
> 
> If the table has at least one garbage we do the lazy_vacuum_index and
> then IndexBulkDeleteResutl is stored, which causes to skip doing the
> btvacuumcleanup. On the other hand, if the table doesn't have any
> garbage but some new tuples inserted since the previous vacuum, we
> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this
> case, we always do the lazy_cleanup_index (i.g, we do the full scan)
> even if only one tuple is inserted. That's why I proposed a new GUC
> parameter which allows us to skip the lazy_cleanup_index in the case.

I think the problem raised in this thread is that the last index
scan may leave dangling pages.

> > Addition to that, as Simon and Peter pointed out
> > index_bulk_delete can leave not-fully-removed pages (so-called
> > half-dead pages and pages that are recyclable but not registered
> > in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
> > interlock. In this case, just inhibiting cleanup scan by a
> > threshold lets such dangling pages persist in the index. (I
> > conldn't make such a many dangling pages, though..)
> >
> > The first patch in the mail (*1) does that. It seems having some
> > bugs, though..
> >
> >
> > Since the dangling pages persist until autovacuum decided to scan
> > the belonging table again, we should run a vacuum round (or
> > index_vacuum_cleanup itself) even having no dead rows if we want
> > to clean up such pages within a certain period. The second patch
> > doesn that.
> >
> 
> IIUC half-dead pages are not relevant to this proposal. The proposal
> has two problems;
> 
> * By skipping index cleanup we could leave recyclable pages that are
> not marked as a recyclable.

Yes.

> * we stash an XID when a btree page is deleted, which is used to
> determine when it's finally safe to recycle the page

Is it a "problem" of this proposal?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Claudio Freire
Date:
On Fri, Sep 22, 2017 at 4:46 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I apologize in advance of possible silliness.
>
> At Thu, 21 Sep 2017 13:54:01 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in
<CAGTBQpYvgdqxVaiyui=BKrzw7ZZfTQi9KECUL4-Lkc2ThqX8QQ@mail.gmail.com>
>> On Tue, Sep 19, 2017 at 8:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> > On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> >> Maybe this is looking at the problem from the wrong direction.
>> >>
>> >> Why can't the page be added to the FSM immediately and the check be
>> >> done at runtime when looking for a reusable page?
>> >>
>> >> Index FSMs currently store only 0 or 255, couldn't they store 128 for
>> >> half-recyclable pages and make the caller re-check reusability before
>> >> using it?
>> >
>> > No, because it's impossible for them to know whether or not the page
>> > that their index scan just landed on recycled just a second ago, or
>> > was like this since before their xact began/snapshot was acquired.
>> >
>> > For your reference, this RecentGlobalXmin interlock stuff is what
>> > Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty
>> > Nodes". Seems pretty hard to do it any other way.
>>
>> I don't see the difference between a vacuum run and distributed
>> maintainance at _bt_getbuf time. In fact, the code seems to be in
>> place already.
>
> The pages prohibited to register as "free" by RecentGlobalXmin
> cannot be grabbed _bt_getbuf since the page is liked from nowhere
> nor FSM doesn't offer the pages is "free".

Yes, but suppose vacuum did add them to the FSM in the first round,
but with a special marker that differentiates them from immediately
recycleable ones.

>> _bt_page_recyclable seems to prevent old transactions from treating
>> those pages as recyclable already, and the description of the
>> technique in 2.5 doesn't seem to preclude doing the drain while doing
>> other operations. In fact, Lehman even considers the possibility of
>> multiple concurrent garbage collectors.
>
> _bt_page_recyclable prevent a vacuum scan from discarding pages
> that might be looked from any active transaction, and the "drain"
> itself is a technique to prevent freeing still-active pages so a
> scan using the "drain" technique is freely executed
> simultaneously with other transactions. The paper might allow
> concurrent GCs (or vacuums) but our nbtree is saying that no
> concurrent vacuum is assumed. Er... here it is.
>
> nbtpages.c:1589: _bt_unlink_halfdead_page
> | * right.  This search could fail if either the sibling or the target page
> | * was deleted by someone else meanwhile; if so, give up.  (Right now,
> | * that should never happen, since page deletion is only done in VACUUM
> | * and there shouldn't be multiple VACUUMs concurrently on the same
> | * table.)

Ok, yes, but we're not talking about halfdead pages, but deleted pages
that haven't been recycled yet.

>> It's only a matter of making the page visible in the FSM in a way that
>> can be efficiently skipped if we want to go directly to a page that
>> actually CAN be recycled to avoid looping forever looking for a
>> recyclable page in _bt_getbuf. In fact, that's pretty much Lehman's
>
> Mmm. What _bt_getbuf does is recheck the page given from FSM as a
> "free page". If FSM gives no more page, it just tries to extend
> the index relation. Or am I reading you wrongly?

On non-index FSMs, you can request a page that has at least N free bytes.

Index FSMs always mark pages as fully empty or fully full, no in-betweens,
but suppose we used that capability of the data structure to mark
"maybe recycleable"
pages with 50% free space, and "surely recycleable" pages with 100% free space.

Then _bt_getbuf could request for a 50% free page a few times, check if they're
recycleable (ie: check _bt_page_recyclable), and essentially do microvacuum on
that page, and if it cannot find a recycleable page, then try again with 100%
recycleable ones.

The code is almost there, only thing missing is the distinction
between "maybe recycleable"
and "surely recycleable" pages in the index FSM.

Take this with a grain of salt, I'm not an expert on that code. But it
seems feasible to me.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Sokolov Yura
Date:
On 2017-09-22 11:21, Masahiko Sawada wrote:
> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada 
>> <sawada.mshk@gmail.com> wrote in 
>> <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com>
>>> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> > I was just looking the thread since it is found left alone for a
>>> > long time in the CF app.
>>> >
>>> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in
<CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com>
>>> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote:
>>> >> > Hi,
>>> >> >
>>> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
>>> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> >> >> [ lots of valuable discussion ]
>>> >> >
>>> >> > I think this patch clearly still is in the design stage, and has
>>> >> > received plenty feedback this CF.  I'll therefore move this to the next
>>> >> > commitfest.
>>> >>
>>> >> Does anyone have ideas on a way forward here? I don't, but then I
>>> >> haven't thought about it in detail in several months.
>>> >
>>> > Is the additional storage in metapage to store the current status
>>> > of vaccum is still unacceptable even if it can avoid useless
>>> > full-page scan on indexes especially for stable tables?
>>> >
>>> > Or, how about additional 1 bit in pg_stat_*_index to indicate
>>> > that the index *don't* require vacuum cleanup stage. (default
>>> > value causes cleanup)
>>> 
>>> You meant that "the next cycle" is the lazy_cleanup_index() function
>>> called by lazy_scan_heap()?
>> 
>> Both finally call btvacuumscan under a certain condition, but
>> what I meant by "the next cycle" is the lazy_cleanup_index call
>> in the next round of vacuum since abstract layer (index am) isn't
>> conscious of the detail of btree.
>> 
>>> > index_bulk_delete (or ambulkdelete) returns the flag in
>>> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in
>>> > stats and in the next cycle it is looked up to decide the
>>> > necessity of index cleanup.
>>> >
>>> 
>>> Could you elaborate about this? For example in btree index, the index
>>> cleanup skips to scan on the index scan if index_bulk_delete has been
>>> called during vacuuming because stats != NULL. So I think we don't
>>> need such a flag.
>> 
>> The flag works so that successive two index full scans don't
>> happen in a vacuum round. If any rows are fully deleted, just
>> following btvacuumcleanup does nothing.
>> 
>> I think what you wanted to solve here was the problem that
>> index_vacuum_cleanup runs a full scan even if it ends with no
>> actual work, when manual or anti-wraparound vacuums.  (I'm
>> getting a bit confused on this..) It is caused by using the
>> pointer "stats" as the flag to instruct to do that. If the
>> stats-as-a-flag worked as expected, the GUC doesn't seem to be
>> required.
> 
> Hmm, my proposal is like that if a table doesn't changed since the
> previous vacuum much we skip the cleaning up index.
> 
> If the table has at least one garbage we do the lazy_vacuum_index and
> then IndexBulkDeleteResutl is stored, which causes to skip doing the
> btvacuumcleanup. On the other hand, if the table doesn't have any
> garbage but some new tuples inserted since the previous vacuum, we
> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this
> case, we always do the lazy_cleanup_index (i.g, we do the full scan)
> even if only one tuple is inserted. That's why I proposed a new GUC
> parameter which allows us to skip the lazy_cleanup_index in the case.
> 
>> 
>> Addition to that, as Simon and Peter pointed out
>> index_bulk_delete can leave not-fully-removed pages (so-called
>> half-dead pages and pages that are recyclable but not registered
>> in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
>> interlock. In this case, just inhibiting cleanup scan by a
>> threshold lets such dangling pages persist in the index. (I
>> conldn't make such a many dangling pages, though..)
>> 
>> The first patch in the mail (*1) does that. It seems having some
>> bugs, though..
>> 
>> 
>> Since the dangling pages persist until autovacuum decided to scan
>> the belonging table again, we should run a vacuum round (or
>> index_vacuum_cleanup itself) even having no dead rows if we want
>> to clean up such pages within a certain period. The second patch
>> doesn that.
>> 
> 
> IIUC half-dead pages are not relevant to this proposal. The proposal
> has two problems;
> * By skipping index cleanup we could leave recyclable pages that are
> not marked as a recyclable.
> * we stash an XID when a btree page is deleted, which is used to
> determine when it's finally safe to recycle the page
> 
> Regards,
> 
> --
> Masahiko Sawada
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center

Here is a small patch that skips scanning btree index if no pending
deleted pages exists.

It detects this situation by comparing pages_deleted with pages_free.
If they are equal, then there is no half-deleted pages, and it is
safe to skip next lazy scan.

Flag stored in a btpo_flags. It is unset using general wal before scan.
If no half-deleted pages found, it is set without wal (like hint bit).
(it is safe to miss setting flag, but it is not safe to miss unsetting
flag).

This patch works correctly:
- if rows are only inserted and never deleted, index always skips
scanning (starting with second scan).
- if some rows updated/deleted, then some scans are not skipped. But
when all half-deleted pages are marked as deleted, lazy scans start to
be skipped.

Open question: what to do with index statistic? For simplicity this
patch skips updating stats (just returns NULL from btvacuumcleanup).
Probably, it should detect proportion of table changes, and do not
skip scans if table grows too much.

-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Sokolov Yura
Date:
On 2017-09-22 16:22, Sokolov Yura wrote:
> On 2017-09-22 11:21, Masahiko Sawada wrote:
>> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada 
>>> <sawada.mshk@gmail.com> wrote in 
>>> <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com>
>>>> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>> > I was just looking the thread since it is found left alone for a
>>>> > long time in the CF app.
>>>> >
>>>> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote in
<CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com>
>>>> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de> wrote:
>>>> >> > Hi,
>>>> >> >
>>>> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
>>>> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> >> >> [ lots of valuable discussion ]
>>>> >> >
>>>> >> > I think this patch clearly still is in the design stage, and has
>>>> >> > received plenty feedback this CF.  I'll therefore move this to the next
>>>> >> > commitfest.
>>>> >>
>>>> >> Does anyone have ideas on a way forward here? I don't, but then I
>>>> >> haven't thought about it in detail in several months.
>>>> >
>>>> > Is the additional storage in metapage to store the current status
>>>> > of vaccum is still unacceptable even if it can avoid useless
>>>> > full-page scan on indexes especially for stable tables?
>>>> >
>>>> > Or, how about additional 1 bit in pg_stat_*_index to indicate
>>>> > that the index *don't* require vacuum cleanup stage. (default
>>>> > value causes cleanup)
>>>> 
>>>> You meant that "the next cycle" is the lazy_cleanup_index() function
>>>> called by lazy_scan_heap()?
>>> 
>>> Both finally call btvacuumscan under a certain condition, but
>>> what I meant by "the next cycle" is the lazy_cleanup_index call
>>> in the next round of vacuum since abstract layer (index am) isn't
>>> conscious of the detail of btree.
>>> 
>>>> > index_bulk_delete (or ambulkdelete) returns the flag in
>>>> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in
>>>> > stats and in the next cycle it is looked up to decide the
>>>> > necessity of index cleanup.
>>>> >
>>>> 
>>>> Could you elaborate about this? For example in btree index, the 
>>>> index
>>>> cleanup skips to scan on the index scan if index_bulk_delete has 
>>>> been
>>>> called during vacuuming because stats != NULL. So I think we don't
>>>> need such a flag.
>>> 
>>> The flag works so that successive two index full scans don't
>>> happen in a vacuum round. If any rows are fully deleted, just
>>> following btvacuumcleanup does nothing.
>>> 
>>> I think what you wanted to solve here was the problem that
>>> index_vacuum_cleanup runs a full scan even if it ends with no
>>> actual work, when manual or anti-wraparound vacuums.  (I'm
>>> getting a bit confused on this..) It is caused by using the
>>> pointer "stats" as the flag to instruct to do that. If the
>>> stats-as-a-flag worked as expected, the GUC doesn't seem to be
>>> required.
>> 
>> Hmm, my proposal is like that if a table doesn't changed since the
>> previous vacuum much we skip the cleaning up index.
>> 
>> If the table has at least one garbage we do the lazy_vacuum_index and
>> then IndexBulkDeleteResutl is stored, which causes to skip doing the
>> btvacuumcleanup. On the other hand, if the table doesn't have any
>> garbage but some new tuples inserted since the previous vacuum, we
>> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this
>> case, we always do the lazy_cleanup_index (i.g, we do the full scan)
>> even if only one tuple is inserted. That's why I proposed a new GUC
>> parameter which allows us to skip the lazy_cleanup_index in the case.
>> 
>>> 
>>> Addition to that, as Simon and Peter pointed out
>>> index_bulk_delete can leave not-fully-removed pages (so-called
>>> half-dead pages and pages that are recyclable but not registered
>>> in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
>>> interlock. In this case, just inhibiting cleanup scan by a
>>> threshold lets such dangling pages persist in the index. (I
>>> conldn't make such a many dangling pages, though..)
>>> 
>>> The first patch in the mail (*1) does that. It seems having some
>>> bugs, though..
>>> 
>>> 
>>> Since the dangling pages persist until autovacuum decided to scan
>>> the belonging table again, we should run a vacuum round (or
>>> index_vacuum_cleanup itself) even having no dead rows if we want
>>> to clean up such pages within a certain period. The second patch
>>> doesn that.
>>> 
>> 
>> IIUC half-dead pages are not relevant to this proposal. The proposal
>> has two problems;
>> * By skipping index cleanup we could leave recyclable pages that are
>> not marked as a recyclable.
>> * we stash an XID when a btree page is deleted, which is used to
>> determine when it's finally safe to recycle the page
>> 
>> Regards,
>> 
>> --
>> Masahiko Sawada
>> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
>> NTT Open Source Software Center
> 
> Here is a small patch that skips scanning btree index if no pending
> deleted pages exists.
> 
> It detects this situation by comparing pages_deleted with pages_free.
> If they are equal, then there is no half-deleted pages, and it is
> safe to skip next lazy scan.
> 
> Flag stored in a btpo_flags. It is unset using general wal before scan.
> If no half-deleted pages found, it is set without wal (like hint bit).
> (it is safe to miss setting flag, but it is not safe to miss unsetting
> flag).
> 
> This patch works correctly:
> - if rows are only inserted and never deleted, index always skips
> scanning (starting with second scan).
> - if some rows updated/deleted, then some scans are not skipped. But
> when all half-deleted pages are marked as deleted, lazy scans start to
> be skipped.
> 
> Open question: what to do with index statistic? For simplicity this
> patch skips updating stats (just returns NULL from btvacuumcleanup).
> Probably, it should detect proportion of table changes, and do not
> skip scans if table grows too much.

Excuse me, I didn't mean to overwrite "last attachment" on commitfest 
page.

-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Darafei Praliaskouski
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

We're using Postgres with this patch for some time.

In our use case we've got a quickly growing large table with events from our users. 
Table has a structure of (user_id, ts, <event data>). Events are append only, each user generates events in small
predictabletime frame, mostly each second.
 
From time to time we need to read this table in fashion of WHERE ts BETWEEN a AND b AND user_id=c.
Such query leads to enormous amount of seeks, as records of each user are scattered across relation and there are no
pagesthat contain two events from same user.
 

To fight it, we created a btree index on (user_id, ts, <event_data>). Plan switched to index only scans, but heap
fetchesand execution times were still the same. 
 
Manual 
We noticed that autovacuum skips scanning the relation and freezing the Visibility Map. 

We started frequently performing VACUUM manually on the relation. This helped with freezing the Visibility Map.
However, we found out that VACUUM makes a full scan over the index.
As index does not fit into memory, this means that each run flushes all the disk caches and eats up Amazon IOPS
credits.
 

With this patch behavior is much better for us - VACUUM finishes real quick.

As a future improvement, a similar improvement for other index types will be useful.
After it happens, I'm looking forward to autovacuum kicking in on append-only tables, to freeze the Visibility Map.

The new status of this patch is: Ready for Committer

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Sep 22, 2017 at 5:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Fri, 22 Sep 2017 17:21:04 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoBN9ucgMDuinx2ptU8upEToHnR-A35aBcQyZnLFvWdVPg@mail.gmail.com>
>> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com>
>> >> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> >> Could you elaborate about this? For example in btree index, the index
>> >> cleanup skips to scan on the index scan if index_bulk_delete has been
>> >> called during vacuuming because stats != NULL. So I think we don't
>> >> need such a flag.
>> >
>> > The flag works so that successive two index full scans don't
>> > happen in a vacuum round. If any rows are fully deleted, just
>> > following btvacuumcleanup does nothing.
>> >
>> > I think what you wanted to solve here was the problem that
>> > index_vacuum_cleanup runs a full scan even if it ends with no
>> > actual work, when manual or anti-wraparound vacuums.  (I'm
>> > getting a bit confused on this..) It is caused by using the
>> > pointer "stats" as the flag to instruct to do that. If the
>> > stats-as-a-flag worked as expected, the GUC doesn't seem to be
>> > required.
>>
>> Hmm, my proposal is like that if a table doesn't changed since the
>> previous vacuum much we skip the cleaning up index.
>>
>> If the table has at least one garbage we do the lazy_vacuum_index and
>> then IndexBulkDeleteResutl is stored, which causes to skip doing the
>> btvacuumcleanup. On the other hand, if the table doesn't have any
>> garbage but some new tuples inserted since the previous vacuum, we
>> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this
>> case, we always do the lazy_cleanup_index (i.g, we do the full scan)
>> even if only one tuple is inserted. That's why I proposed a new GUC
>> parameter which allows us to skip the lazy_cleanup_index in the case.
>
> I think the problem raised in this thread is that the last index
> scan may leave dangling pages.
>
>> > Addition to that, as Simon and Peter pointed out
>> > index_bulk_delete can leave not-fully-removed pages (so-called
>> > half-dead pages and pages that are recyclable but not registered
>> > in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
>> > interlock. In this case, just inhibiting cleanup scan by a
>> > threshold lets such dangling pages persist in the index. (I
>> > conldn't make such a many dangling pages, though..)
>> >
>> > The first patch in the mail (*1) does that. It seems having some
>> > bugs, though..
>> >
>> >
>> > Since the dangling pages persist until autovacuum decided to scan
>> > the belonging table again, we should run a vacuum round (or
>> > index_vacuum_cleanup itself) even having no dead rows if we want
>> > to clean up such pages within a certain period. The second patch
>> > doesn that.
>> >
>>
>> IIUC half-dead pages are not relevant to this proposal. The proposal
>> has two problems;
>>
>> * By skipping index cleanup we could leave recyclable pages that are
>> not marked as a recyclable.
>
> Yes.
>
>> * we stash an XID when a btree page is deleted, which is used to
>> determine when it's finally safe to recycle the page
>
> Is it a "problem" of this proposal?
>

As Peter explained before[1], the problem is that there is an XID
stored in dead btree pages that is used in the subsequent
RecentGlobalXmin interlock that determines if recycling is safe.

[1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
At Mon, 25 Sep 2017 19:20:07 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAzKJnc8UM728c0BMHx=7itJh4Db_Lj3Y31itnGrj-heQ@mail.gmail.com>
> >> * we stash an XID when a btree page is deleted, which is used to
> >> determine when it's finally safe to recycle the page
> >
> > Is it a "problem" of this proposal?
> >
> 
> As Peter explained before[1], the problem is that there is an XID
> stored in dead btree pages that is used in the subsequent
> RecentGlobalXmin interlock that determines if recycling is safe.
> 
> [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com

Yeah I know that, and I understand that it is the reason why it
is bad to just skip looking the GUC regardless.

On the other hand looking the recycle status, I think we don't
need the GUC. (I believe) The patch is *a Poc* in the way. (I'd
like to let RecordPageWithFreeSpace to return the previous value
to avoid duplicate fsm search..)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
At Fri, 22 Sep 2017 17:15:08 +0300, Sokolov Yura <y.sokolov@postgrespro.ru> wrote in
<c7d736b5ea4cda67a644a0247f1a3951@postgrespro.ru>
> On 2017-09-22 16:22, Sokolov Yura wrote:
> > On 2017-09-22 11:21, Masahiko Sawada wrote:
> >> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI
> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>> At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada
> >>> <sawada.mshk@gmail.com> wrote in
> >>> <CAD21AoD6zgb1W6ps1aXj0CcAB_chDYiiTNtEdpMhkefGg13-GQ@mail.gmail.com>
> >>>> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
> >>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>>> > I was just looking the thread since it is found left alone for a
> >>>> > long time in the CF app.
> >>>> >
> >>>> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <pg@bowt.ie> wrote
> >>>> > in
> >>>> > <CAH2-WzkhJhAXD+6DdBp7D8WYLfJ3D0m=AZbGsiw=USUjTmuv-g@mail.gmail.com>
> >>>> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <andres@anarazel.de>
> >>>> >> wrote:
> >>>> >> > Hi,
> >>>> >> >
> >>>> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
> >>>> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas
> >>>> >> >> <robertmhaas@gmail.com> wrote:
> >>>> >> >> [ lots of valuable discussion ]
> >>>> >> >
> >>>> >> > I think this patch clearly still is in the design stage, and has
> >>>> >> > received plenty feedback this CF.  I'll therefore move this to the
> >>>> >> > next
> >>>> >> > commitfest.
> >>>> >>
> >>>> >> Does anyone have ideas on a way forward here? I don't, but then I
> >>>> >> haven't thought about it in detail in several months.
> >>>> >
> >>>> > Is the additional storage in metapage to store the current status
> >>>> > of vaccum is still unacceptable even if it can avoid useless
> >>>> > full-page scan on indexes especially for stable tables?
> >>>> >
> >>>> > Or, how about additional 1 bit in pg_stat_*_index to indicate
> >>>> > that the index *don't* require vacuum cleanup stage. (default
> >>>> > value causes cleanup)
> >>>> You meant that "the next cycle" is the lazy_cleanup_index() function
> >>>> called by lazy_scan_heap()?
> >>> Both finally call btvacuumscan under a certain condition, but
> >>> what I meant by "the next cycle" is the lazy_cleanup_index call
> >>> in the next round of vacuum since abstract layer (index am) isn't
> >>> conscious of the detail of btree.
> >>> 
> >>>> > index_bulk_delete (or ambulkdelete) returns the flag in
> >>>> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in
> >>>> > stats and in the next cycle it is looked up to decide the
> >>>> > necessity of index cleanup.
> >>>> >
> >>>> Could you elaborate about this? For example in btree index, the index
> >>>> cleanup skips to scan on the index scan if index_bulk_delete has been
> >>>> called during vacuuming because stats != NULL. So I think we don't
> >>>> need such a flag.
> >>> The flag works so that successive two index full scans don't
> >>> happen in a vacuum round. If any rows are fully deleted, just
> >>> following btvacuumcleanup does nothing.
> >>> I think what you wanted to solve here was the problem that
> >>> index_vacuum_cleanup runs a full scan even if it ends with no
> >>> actual work, when manual or anti-wraparound vacuums.  (I'm
> >>> getting a bit confused on this..) It is caused by using the
> >>> pointer "stats" as the flag to instruct to do that. If the
> >>> stats-as-a-flag worked as expected, the GUC doesn't seem to be
> >>> required.
> >> Hmm, my proposal is like that if a table doesn't changed since the
> >> previous vacuum much we skip the cleaning up index.
> >> If the table has at least one garbage we do the lazy_vacuum_index and
> >> then IndexBulkDeleteResutl is stored, which causes to skip doing the
> >> btvacuumcleanup. On the other hand, if the table doesn't have any
> >> garbage but some new tuples inserted since the previous vacuum, we
> >> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this
> >> case, we always do the lazy_cleanup_index (i.g, we do the full scan)
> >> even if only one tuple is inserted. That's why I proposed a new GUC
> >> parameter which allows us to skip the lazy_cleanup_index in the case.
> >> 
> >>> Addition to that, as Simon and Peter pointed out
> >>> index_bulk_delete can leave not-fully-removed pages (so-called
> >>> half-dead pages and pages that are recyclable but not registered
> >>> in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
> >>> interlock. In this case, just inhibiting cleanup scan by a
> >>> threshold lets such dangling pages persist in the index. (I
> >>> conldn't make such a many dangling pages, though..)
> >>> The first patch in the mail (*1) does that. It seems having some
> >>> bugs, though..
> >>> Since the dangling pages persist until autovacuum decided to scan
> >>> the belonging table again, we should run a vacuum round (or
> >>> index_vacuum_cleanup itself) even having no dead rows if we want
> >>> to clean up such pages within a certain period. The second patch
> >>> doesn that.
> >>> 
> >> IIUC half-dead pages are not relevant to this proposal. The proposal
> >> has two problems;
> >> * By skipping index cleanup we could leave recyclable pages that are
> >> not marked as a recyclable.
> >> * we stash an XID when a btree page is deleted, which is used to
> >> determine when it's finally safe to recycle the page
> >> Regards,
> >> --
> >> Masahiko Sawada
> >> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> >> NTT Open Source Software Center
> > Here is a small patch that skips scanning btree index if no pending
> > deleted pages exists.
> > It detects this situation by comparing pages_deleted with pages_free.

It seems to work to prevent needless cleanup scans. (Although I
still uncertain how to deal with half-dead pages..)

> > If they are equal, then there is no half-deleted pages, and it is
> > safe to skip next lazy scan.
> > Flag stored in a btpo_flags. It is unset using general wal before
> > scan.

I'm not sure it's good to add a meta-page specific information
into per-page storage.

> > If no half-deleted pages found, it is set without wal (like hint bit).
> > (it is safe to miss setting flag, but it is not safe to miss unsetting
> > flag).
> > This patch works correctly:
> > - if rows are only inserted and never deleted, index always skips
> > scanning (starting with second scan).
(Yes, I was confused here..)
> > - if some rows updated/deleted, then some scans are not skipped. But
> > when all half-deleted pages are marked as deleted, lazy scans start to
> > be skipped.

> > Open question: what to do with index statistic? For simplicity this
> > patch skips updating stats (just returns NULL from btvacuumcleanup).
> > Probably, it should detect proportion of table changes, and do not
> > skip scans if table grows too much.

Agreed.

> Excuse me, I didn't mean to overwrite "last attachment" on commitfest
> page.

I'm the other one who  also posted a patch in this thread:p

I'm not sure what we should behave for the case, but I think it's
inevitable to attach a patch when proposing a funcamental change
of the design in a concrete shape. (But this also cofuses the CI
machinery:p)

Actually, it was quite clear what you are thinking.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Pavel Golub
Date:
Hello, Darafei.

You wrote:

DP> The following review has been posted through the commitfest application:
DP> make installcheck-world:  tested, passed
DP> Implements feature:       tested, passed
DP> Spec compliant:           tested, passed
DP> Documentation:            tested, passed

DP> We're using Postgres with this patch for some time.

DP> In our use case we've got a quickly growing large table with events from our users.
DP> Table has a structure of (user_id, ts, <event data>). Events are
DP> append only, each user generates events in small predictable time frame, mostly each second.
DP> From time to time we need to read this table in fashion of WHERE
DP> ts BETWEEN a AND b AND user_id=c.
DP> Such query leads to enormous amount of seeks, as records of each
DP> user are scattered across relation and there are no pages that
DP> contain two events from same user.

DP> To fight it, we created a btree index on (user_id, ts,
DP> <event_data>). Plan switched to index only scans, but heap fetches
DP> and execution times were still the same. 
DP> Manual 
DP> We noticed that autovacuum skips scanning the relation and freezing the Visibility Map.

DP> We started frequently performing VACUUM manually on the relation.
DP> This helped with freezing the Visibility Map.
DP> However, we found out that VACUUM makes a full scan over the index.
DP> As index does not fit into memory, this means that each run
DP> flushes all the disk caches and eats up Amazon IOPS credits. 

DP> With this patch behavior is much better for us - VACUUM finishes real quick.

DP> As a future improvement, a similar improvement for other index types will be useful.
DP> After it happens, I'm looking forward to autovacuum kicking in on
DP> append-only tables, to freeze the Visibility Map.

DP> The new status of this patch is: Ready for Committer


Seems  like,  we  may  also  going to hit it and it would be cool this
vacuum issue solved for next PG version.

-- 
With best wishes,Pavel                          mailto:pavel@gf.microolap.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Robert Haas
Date:
On Tue, Oct 10, 2017 at 5:55 AM, Pavel Golub <pavel@microolap.com> wrote:
> DP> The new status of this patch is: Ready for Committer
>
> Seems  like,  we  may  also  going to hit it and it would be cool this
> vacuum issue solved for next PG version.

Exactly which patch on this thread is someone proposing for commit?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Oct 13, 2017 at 1:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 10, 2017 at 5:55 AM, Pavel Golub <pavel@microolap.com> wrote:
>> DP> The new status of this patch is: Ready for Committer
>>
>> Seems  like,  we  may  also  going to hit it and it would be cool this
>> vacuum issue solved for next PG version.
>
> Exactly which patch on this thread is someone proposing for commit?
>

I guess that is the patch I proposed. However I think that there still
is room for discussion because the patch cannot skip to cleanup vacuum
when aggressive vacuum, which is one of the situation that I really
wanted to skip.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Robert Haas
Date:
On Mon, Oct 16, 2017 at 3:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I guess that is the patch I proposed. However I think that there still
> is room for discussion because the patch cannot skip to cleanup vacuum
> when aggressive vacuum, which is one of the situation that I really
> wanted to skip.

Well, I think there are outstanding concerns that the patch in
question is not safe, and I don't see that anything has been done to
resolve them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Simon Riggs
Date:
On 25 September 2017 at 22:34, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

>> > Here is a small patch that skips scanning btree index if no pending
>> > deleted pages exists.
>> > It detects this situation by comparing pages_deleted with pages_free.
>
> It seems to work to prevent needless cleanup scans.

So this leaves us in the situation that

1. Masahiko's patch has unresolved problems
2. Yura's patch works and is useful

Unless there is disagreement on the above, it seems we should apply
Yura's patch (an edited version, perhaps).

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Nov 29, 2017 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 25 September 2017 at 22:34, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
>>> > Here is a small patch that skips scanning btree index if no pending
>>> > deleted pages exists.
>>> > It detects this situation by comparing pages_deleted with pages_free.
>>
>> It seems to work to prevent needless cleanup scans.
>
> So this leaves us in the situation that
>
> 1. Masahiko's patch has unresolved problems
> 2. Yura's patch works and is useful
>
> Unless there is disagreement on the above, it seems we should apply
> Yura's patch (an edited version, perhaps).
>

IIRC the patches that makes the cleanup scan skip has a problem
pointed by Peter[1], that is that we stash an XID when a btree page is
deleted, which is used to determine when it's finally safe to recycle
the page. Yura's patch doesn't have that problem?

[1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Stephen Frost
Date:
Greetings,

(pruned the CC list)

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Thu, Nov 30, 2017 at 12:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Wed, Nov 29, 2017 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> Unless there is disagreement on the above, it seems we should apply
> >> Yura's patch (an edited version, perhaps).
> >>
> >
> > IIRC the patches that makes the cleanup scan skip has a problem
> > pointed by Peter[1], that is that we stash an XID when a btree page is
> > deleted, which is used to determine when it's finally safe to recycle
> > the page. Yura's patch doesn't have that problem?
> >
> > [1]
https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com
>
> The latest patch present on this thread does not apply
> (https://www.postgresql.org/message-id/c4a47caef28024ab7528946bed264058@postgrespro.ru).
> Please send a rebase. For now I am moving the patch to next CF, with
> "waiting on author".

We are now just about a week into the CF and this patch hasn't been
updated and is still in 'waiting on author' state and it sounds like it
might have a pretty serious issue based on the above comment.

Masahiko Sawada, if this patch isn't viable or requires serious rework
to be acceptable, then perhaps we should change its status to 'returned
with feedback' and you can post a new patch for the next commitfest..?

Otherwise, I'd encourage you to try and find time to post an updated
patch soon, so that it can be reviewed.

Thanks!

Stephen

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> > IIRC the patches that makes the cleanup scan skip has a problem
>> > pointed by Peter[1], that is that we stash an XID when a btree page is
>> > deleted, which is used to determine when it's finally safe to recycle
>> > the page. Yura's patch doesn't have that problem?
>> >
>> > [1]
https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com

> Masahiko Sawada, if this patch isn't viable or requires serious rework
> to be acceptable, then perhaps we should change its status to 'returned
> with feedback' and you can post a new patch for the next commitfest..?

I believe that the problem that I pointed out with freezing/wraparound
is a solvable problem. If we think about it carefully, we will come up
with a good solution. I have tried to get the ball rolling with my
pd_prune_xid suggestion. I think it's important to not lose sight of
the fact that the page deletion/recycling XID thing is just one detail
that we need to think about some more.

I cannot fault Sawada-san for waiting to hear other people's views
before proceeding. It really needs to be properly discussed.

-- 
Peter Geoghegan


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Stephen Frost
Date:
Greetings Peter,

* Peter Geoghegan (pg@bowt.ie) wrote:
> On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote:
> >> > IIRC the patches that makes the cleanup scan skip has a problem
> >> > pointed by Peter[1], that is that we stash an XID when a btree page is
> >> > deleted, which is used to determine when it's finally safe to recycle
> >> > the page. Yura's patch doesn't have that problem?
> >> >
> >> > [1]
https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com
>
> > Masahiko Sawada, if this patch isn't viable or requires serious rework
> > to be acceptable, then perhaps we should change its status to 'returned
> > with feedback' and you can post a new patch for the next commitfest..?
>
> I believe that the problem that I pointed out with freezing/wraparound
> is a solvable problem. If we think about it carefully, we will come up
> with a good solution. I have tried to get the ball rolling with my
> pd_prune_xid suggestion. I think it's important to not lose sight of
> the fact that the page deletion/recycling XID thing is just one detail
> that we need to think about some more.
>
> I cannot fault Sawada-san for waiting to hear other people's views
> before proceeding. It really needs to be properly discussed.

Thanks for commenting!

Perhaps it should really be in Needs review state then..?

Either way, thanks again for the clarification and hopefully this will
revive the discussion.

Stephen

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Peter Geoghegan
Date:
Hi Stephen,

On Sat, Jan 6, 2018 at 4:02 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Perhaps it should really be in Needs review state then..?

Probably.

As I pointed out already some time ago, this RecentGlobalXmin
interlock stuff is our particular implementation of what Lanin &
Shasha call "The Drain Technique" within "2.5 Freeing Empty Nodes".
It's not easy to understand why this is necessary in general (you have
to understand the whole paper for that), but our implementation of the
drain technique is simple. I'm afraid that I made the problem sound
scarier than it actually is.

As Lanin & Shasha put it: "freeing empty nodes by the drain technique
is transparent to the rest of the algorithm and can be added by a
separate module". nbtree doesn't actually care very much about XIDs --
testing an XID against RecentGlobalXmin was just the most convenient
way of using L&S's technique to defer recycling until it is definitely
safe. We only need to make sure that _bt_page_recyclable() cannot
become confused by XID wraparound to fix this problem -- that's it.

-- 
Peter Geoghegan


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Sun, Jan 7, 2018 at 8:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>> > IIRC the patches that makes the cleanup scan skip has a problem
>>> > pointed by Peter[1], that is that we stash an XID when a btree page is
>>> > deleted, which is used to determine when it's finally safe to recycle
>>> > the page. Yura's patch doesn't have that problem?
>>> >
>>> > [1]
https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com
>
>> Masahiko Sawada, if this patch isn't viable or requires serious rework
>> to be acceptable, then perhaps we should change its status to 'returned
>> with feedback' and you can post a new patch for the next commitfest..?
>
> I believe that the problem that I pointed out with freezing/wraparound
> is a solvable problem. If we think about it carefully, we will come up
> with a good solution. I have tried to get the ball rolling with my
> pd_prune_xid suggestion. I think it's important to not lose sight of
> the fact that the page deletion/recycling XID thing is just one detail
> that we need to think about some more.
>
> I cannot fault Sawada-san for waiting to hear other people's views
> before proceeding. It really needs to be properly discussed.
>

Thank you for commenting.

IIUC we have two approaches: one idea is based on Peter's suggestion.
We can use pd_prune_xid to store epoch of xid of which the page is
deleted. That way, we can correctly mark deleted pages as recyclable
without breaking on-disk format.

Another idea is suggested by  Sokolov Yura. His original patch makes
btree have a flag in btpo_flags that implies the btree has deleted but
not recyclable page or not. I'd rather want to store it as bool in
BTMetaPageData. Currently btm_version is defined as uint32 but I think
we won't use all of them. If we store it in part of btm_version we
don't break on-disk format. However, we're now assuming that the
vacuum on btree index always scans whole btree rather than a part, and
this approach will nurture it more. It might be possible that it will
become a restriction in the future.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Jan 10, 2018 at 11:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sun, Jan 7, 2018 at 8:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>>> > IIRC the patches that makes the cleanup scan skip has a problem
>>>> > pointed by Peter[1], that is that we stash an XID when a btree page is
>>>> > deleted, which is used to determine when it's finally safe to recycle
>>>> > the page. Yura's patch doesn't have that problem?
>>>> >
>>>> > [1]
https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com
>>
>>> Masahiko Sawada, if this patch isn't viable or requires serious rework
>>> to be acceptable, then perhaps we should change its status to 'returned
>>> with feedback' and you can post a new patch for the next commitfest..?
>>
>> I believe that the problem that I pointed out with freezing/wraparound
>> is a solvable problem. If we think about it carefully, we will come up
>> with a good solution. I have tried to get the ball rolling with my
>> pd_prune_xid suggestion. I think it's important to not lose sight of
>> the fact that the page deletion/recycling XID thing is just one detail
>> that we need to think about some more.
>>
>> I cannot fault Sawada-san for waiting to hear other people's views
>> before proceeding. It really needs to be properly discussed.
>>
>
> Thank you for commenting.
>
> IIUC we have two approaches: one idea is based on Peter's suggestion.
> We can use pd_prune_xid to store epoch of xid of which the page is
> deleted. That way, we can correctly mark deleted pages as recyclable
> without breaking on-disk format.
>
> Another idea is suggested by  Sokolov Yura. His original patch makes
> btree have a flag in btpo_flags that implies the btree has deleted but
> not recyclable page or not. I'd rather want to store it as bool in
> BTMetaPageData. Currently btm_version is defined as uint32 but I think
> we won't use all of them. If we store it in part of btm_version we
> don't break on-disk format. However, we're now assuming that the
> vacuum on btree index always scans whole btree rather than a part, and
> this approach will nurture it more. It might be possible that it will
> become a restriction in the future.
>

On third thought, it's similar to what we are doing for heaps but we
can store the oldest btpo.xact of a btree index to somewhere(metapage
or pg_class.relfrozenxid?) after vacuumed. We can skip cleanup
vacuuming as long as the xid is not more than a certain threshold old
(say vacuum_index_cleanup_age). Combining with new parameter I
proposed before, the condition of doing cleanup index vacuum will
become as follows.

if (there is garbage on heap)
   Vacuum index, and update oldest btpo.xact
else /* there is no garbage on heap */
{
    if (# of scanned pages > (nblocks *
vacuum_cleanup_index_scale_factor) OR oldest btpo.xact is more than
vacuum_index_cleanup_age old?))
        Cleanup vacuum index, and update oldest btpo.xact
}

In current index vacuuming, it scans whole index pages if there is
even one garbage on heap(I'd like to improve it though someday), which
it also lead to update the oldest btpo.xact at the time. If
vacuum_cleanup_index_slace_factor is enough high, we can skip the
scanning whole index as long as either the table isn't modified for 2
billion transactions or the oldest btpo.xact isn't more than
vacuum_index_cleanup_age transactions old. But if there is a opened
transaction for a very long time, we might not be able to mark deleted
page as recyclable. Some tricks might be required. Feedback is
welcome.

Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Jan 12, 2018 at 1:05 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Jan 10, 2018 at 11:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sun, Jan 7, 2018 at 8:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>>>> > IIRC the patches that makes the cleanup scan skip has a problem
>>>>> > pointed by Peter[1], that is that we stash an XID when a btree page is
>>>>> > deleted, which is used to determine when it's finally safe to recycle
>>>>> > the page. Yura's patch doesn't have that problem?
>>>>> >
>>>>> > [1]
https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com
>>>
>>>> Masahiko Sawada, if this patch isn't viable or requires serious rework
>>>> to be acceptable, then perhaps we should change its status to 'returned
>>>> with feedback' and you can post a new patch for the next commitfest..?
>>>
>>> I believe that the problem that I pointed out with freezing/wraparound
>>> is a solvable problem. If we think about it carefully, we will come up
>>> with a good solution. I have tried to get the ball rolling with my
>>> pd_prune_xid suggestion. I think it's important to not lose sight of
>>> the fact that the page deletion/recycling XID thing is just one detail
>>> that we need to think about some more.
>>>
>>> I cannot fault Sawada-san for waiting to hear other people's views
>>> before proceeding. It really needs to be properly discussed.
>>>
>>
>> Thank you for commenting.
>>
>> IIUC we have two approaches: one idea is based on Peter's suggestion.
>> We can use pd_prune_xid to store epoch of xid of which the page is
>> deleted. That way, we can correctly mark deleted pages as recyclable
>> without breaking on-disk format.
>>
>> Another idea is suggested by  Sokolov Yura. His original patch makes
>> btree have a flag in btpo_flags that implies the btree has deleted but
>> not recyclable page or not. I'd rather want to store it as bool in
>> BTMetaPageData. Currently btm_version is defined as uint32 but I think
>> we won't use all of them. If we store it in part of btm_version we
>> don't break on-disk format. However, we're now assuming that the
>> vacuum on btree index always scans whole btree rather than a part, and
>> this approach will nurture it more. It might be possible that it will
>> become a restriction in the future.
>>
>
> On third thought, it's similar to what we are doing for heaps but we
> can store the oldest btpo.xact of a btree index to somewhere(metapage
> or pg_class.relfrozenxid?) after vacuumed. We can skip cleanup
> vacuuming as long as the xid is not more than a certain threshold old
> (say vacuum_index_cleanup_age). Combining with new parameter I
> proposed before, the condition of doing cleanup index vacuum will
> become as follows.
>
> if (there is garbage on heap)
>    Vacuum index, and update oldest btpo.xact
> else /* there is no garbage on heap */
> {
>     if (# of scanned pages > (nblocks *
> vacuum_cleanup_index_scale_factor) OR oldest btpo.xact is more than
> vacuum_index_cleanup_age old?))
>         Cleanup vacuum index, and update oldest btpo.xact
> }
>
> In current index vacuuming, it scans whole index pages if there is
> even one garbage on heap(I'd like to improve it though someday), which
> it also lead to update the oldest btpo.xact at the time. If
> vacuum_cleanup_index_slace_factor is enough high, we can skip the
> scanning whole index as long as either the table isn't modified for 2
> billion transactions or the oldest btpo.xact isn't more than
> vacuum_index_cleanup_age transactions old. But if there is a opened
> transaction for a very long time, we might not be able to mark deleted
> page as recyclable. Some tricks might be required. Feedback is
> welcome.
>

Since this patch still needs to be discussed before proceeding, I've
marked it as "Needs Review". Feedback is very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Wed, Nov 29, 2017 at 6:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Nov 29, 2017 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 25 September 2017 at 22:34, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
>>> > Here is a small patch that skips scanning btree index if no pending
>>> > deleted pages exists.
>>> > It detects this situation by comparing pages_deleted with pages_free.
>>
>> It seems to work to prevent needless cleanup scans.
>
> So this leaves us in the situation that
>
> 1. Masahiko's patch has unresolved problems
> 2. Yura's patch works and is useful
>
> Unless there is disagreement on the above, it seems we should apply
> Yura's patch (an edited version, perhaps).
>

IIRC the patches that makes the cleanup scan skip has a problem
pointed by Peter[1], that is that we stash an XID when a btree page is
deleted, which is used to determine when it's finally safe to recycle
the page. Yura's patch doesn't have that problem?

[1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com

Yes, I think Yura's patch doesn't have that problem, because it skips
cleanup only when there are no recyclable pages.  And that means there
is no btpo.xact stored, so no XIDs to be wraparounded.

BTW, I've rebased Yura's patch.  I think this patch has following issues
for now:

1) Adding BTP_NEED_NO_CLEANUP as per-page flag doesn't look nice.  
2) In the append-only case, index statistics can lag indefinitely.
3) Patch definitely needs more comments.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Fri, Jan 12, 2018 at 7:05 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Jan 10, 2018 at 11:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sun, Jan 7, 2018 at 8:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Sat, Jan 6, 2018 at 2:20 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>>> > IIRC the patches that makes the cleanup scan skip has a problem
>>>> > pointed by Peter[1], that is that we stash an XID when a btree page is
>>>> > deleted, which is used to determine when it's finally safe to recycle
>>>> > the page. Yura's patch doesn't have that problem?
>>>> >
>>>> > [1] https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com
>>
>>> Masahiko Sawada, if this patch isn't viable or requires serious rework
>>> to be acceptable, then perhaps we should change its status to 'returned
>>> with feedback' and you can post a new patch for the next commitfest..?
>>
>> I believe that the problem that I pointed out with freezing/wraparound
>> is a solvable problem. If we think about it carefully, we will come up
>> with a good solution. I have tried to get the ball rolling with my
>> pd_prune_xid suggestion. I think it's important to not lose sight of
>> the fact that the page deletion/recycling XID thing is just one detail
>> that we need to think about some more.
>>
>> I cannot fault Sawada-san for waiting to hear other people's views
>> before proceeding. It really needs to be properly discussed.
>>
>
> Thank you for commenting.
>
> IIUC we have two approaches: one idea is based on Peter's suggestion.
> We can use pd_prune_xid to store epoch of xid of which the page is
> deleted. That way, we can correctly mark deleted pages as recyclable
> without breaking on-disk format.
>
> Another idea is suggested by  Sokolov Yura. His original patch makes
> btree have a flag in btpo_flags that implies the btree has deleted but
> not recyclable page or not. I'd rather want to store it as bool in
> BTMetaPageData. Currently btm_version is defined as uint32 but I think
> we won't use all of them. If we store it in part of btm_version we
> don't break on-disk format. However, we're now assuming that the
> vacuum on btree index always scans whole btree rather than a part, and
> this approach will nurture it more. It might be possible that it will
> become a restriction in the future.
>

On third thought, it's similar to what we are doing for heaps but we
can store the oldest btpo.xact of a btree index to somewhere(metapage
or pg_class.relfrozenxid?) after vacuumed. We can skip cleanup
vacuuming as long as the xid is not more than a certain threshold old
(say vacuum_index_cleanup_age). Combining with new parameter I
proposed before, the condition of doing cleanup index vacuum will
become as follows.

if (there is garbage on heap)
   Vacuum index, and update oldest btpo.xact
else /* there is no garbage on heap */
{
    if (# of scanned pages > (nblocks *
vacuum_cleanup_index_scale_factor) OR oldest btpo.xact is more than
vacuum_index_cleanup_age old?))
        Cleanup vacuum index, and update oldest btpo.xact
}

In current index vacuuming, it scans whole index pages if there is
even one garbage on heap(I'd like to improve it though someday), which
it also lead to update the oldest btpo.xact at the time. If
vacuum_cleanup_index_slace_factor is enough high, we can skip the
scanning whole index as long as either the table isn't modified for 2
billion transactions or the oldest btpo.xact isn't more than
vacuum_index_cleanup_age transactions old. But if there is a opened
transaction for a very long time, we might not be able to mark deleted
page as recyclable. Some tricks might be required. Feedback is
welcome.

Storing oldest btpo.xact seems like a good idea for me.  So, by comparing
oldest btpo.xact to RecentGlobalXmin we can determine if there are
some recyclable pages which are really going to be recycled during
possible cleanup?  Assuming that index pages deletions are rare, that
seems like sufficient criterion to run the cleanup.  Also, this criterion
will eliminate the risk of wraparound, because we'll recycle index pages
not later than we do now.

Another case when running cleanup is necessary (or at least desirable)
is stalled index statistics.  Assuming that index statistics can be stalled
only when there is no deletions from index (otherwise index would
be scanned during vacuum), we can use number of index insertions as
measure of statistics oldness.

So, what about following logic?

if (there is garbage on heap)
   Vacuum index, and update oldest btpo.xact
else /* there is no garbage on heap */
{
    if (number of inserted index tuples since last btvacuumscan() /
        total number of index tuples >= some threshold
        OR oldest btpo.xact is older than RecentGlobalXmin)
        Cleanup vacuum index, and update oldest btpo.xact
}

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Feb 28, 2018 at 1:45 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Nov 29, 2017 at 6:06 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Wed, Nov 29, 2017 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com>
>> wrote:
>> > On 25 September 2017 at 22:34, Kyotaro HORIGUCHI
>> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> >
>> >>> > Here is a small patch that skips scanning btree index if no pending
>> >>> > deleted pages exists.
>> >>> > It detects this situation by comparing pages_deleted with
>> >>> > pages_free.
>> >>
>> >> It seems to work to prevent needless cleanup scans.
>> >
>> > So this leaves us in the situation that
>> >
>> > 1. Masahiko's patch has unresolved problems
>> > 2. Yura's patch works and is useful
>> >
>> > Unless there is disagreement on the above, it seems we should apply
>> > Yura's patch (an edited version, perhaps).
>> >
>>
>> IIRC the patches that makes the cleanup scan skip has a problem
>> pointed by Peter[1], that is that we stash an XID when a btree page is
>> deleted, which is used to determine when it's finally safe to recycle
>> the page. Yura's patch doesn't have that problem?
>>
>> [1]
>> https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com
>
>
> Yes, I think Yura's patch doesn't have that problem, because it skips
> cleanup only when there are no recyclable pages.  And that means there
> is no btpo.xact stored, so no XIDs to be wraparounded.

I've looked at the patch again. And you're right, Yura's patch doesn't
have that problem.

>
> BTW, I've rebased Yura's patch.  I think this patch has following issues
> for now:
>
> 1) Adding BTP_NEED_NO_CLEANUP as per-page flag doesn't look nice.

Yeah, the alternative ideas are to store the flag it into pd_rune_xid
of meta page or to use 1bit of btm_version so that we don't break disk
format compatibility.

> 2) In the append-only case, index statistics can lag indefinitely.

The original proposal proposed a new GUC that specifies a fraction of
the modified pages to trigger a cleanup indexes. Combining with Yura's
patch and telling requirement of cleanup to indexes from lazyvacuum
code we can deal with it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Fri, Mar 2, 2018 at 10:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 2) In the append-only case, index statistics can lag indefinitely.

The original proposal proposed a new GUC that specifies a fraction of
the modified pages to trigger a cleanup indexes.

Regarding original proposal, I didn't get what exactly it's intended to be.
You're checking if vacuumed_pages >= nblocks * vacuum_cleanup_index_scale.
But vacuumed_pages is the variable which could be incremented when
no indexes exist on the table.  When indexes are present, this variable is always
zero.  I can assume, that it's intended to compare number of pages where
at least one tuple is deleted to nblocks * vacuum_cleanup_index_scale.
But that is also not an option for us, because we're going to optimize the
case when exactly zero tuples is deleted by vacuum.

The thing I'm going to propose is to add estimated number of tuples in
table to IndexVacuumInfo.  Then B-tree can memorize that number of tuples
when last time index was scanned in the meta-page.  If pass value
is differs from the value in meta-page too much, then cleanup is forced.

Any better ideas?


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Sun, Mar 4, 2018 at 8:59 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Fri, Mar 2, 2018 at 10:53 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> > 2) In the append-only case, index statistics can lag indefinitely.
>>
>> The original proposal proposed a new GUC that specifies a fraction of
>> the modified pages to trigger a cleanup indexes.
>
>
> Regarding original proposal, I didn't get what exactly it's intended to be.
> You're checking if vacuumed_pages >= nblocks * vacuum_cleanup_index_scale.
> But vacuumed_pages is the variable which could be incremented when
> no indexes exist on the table.  When indexes are present, this variable is
> always
> zero.  I can assume, that it's intended to compare number of pages where
> at least one tuple is deleted to nblocks * vacuum_cleanup_index_scale.
> But that is also not an option for us, because we're going to optimize the
> case when exactly zero tuples is deleted by vacuum.

In the latest v4 patch, I compare scanned_pages and the threshold,
which means if the number of pages that are modified since the last
vacuum is larger than the threshold we force cleanup index.

> The thing I'm going to propose is to add estimated number of tuples in
> table to IndexVacuumInfo.  Then B-tree can memorize that number of tuples
> when last time index was scanned in the meta-page.  If pass value
> is differs from the value in meta-page too much, then cleanup is forced.
>
> Any better ideas?

I think that would work. But I'm concerned about metapage format
compatibility. And since I've not fully investigated about cleanup
index of other index types I'm not sure that interface makes sense. It
might not be better but an alternative idea is to add a condition
(Irel[i]->rd_rel->relam == BTREE_AM_OID) in lazy_scan_heap.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Mon, Mar 5, 2018 at 5:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Mar 4, 2018 at 8:59 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Fri, Mar 2, 2018 at 10:53 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> > 2) In the append-only case, index statistics can lag indefinitely.
>>
>> The original proposal proposed a new GUC that specifies a fraction of
>> the modified pages to trigger a cleanup indexes.
>
>
> Regarding original proposal, I didn't get what exactly it's intended to be.
> You're checking if vacuumed_pages >= nblocks * vacuum_cleanup_index_scale.
> But vacuumed_pages is the variable which could be incremented when
> no indexes exist on the table.  When indexes are present, this variable is
> always
> zero.  I can assume, that it's intended to compare number of pages where
> at least one tuple is deleted to nblocks * vacuum_cleanup_index_scale.
> But that is also not an option for us, because we're going to optimize the
> case when exactly zero tuples is deleted by vacuum.

In the latest v4 patch, I compare scanned_pages and the threshold,
which means if the number of pages that are modified since the last
vacuum is larger than the threshold we force cleanup index.

Right, sorry I've overlooked that.  However, if even use number of pages
I would still prefer cumulative measure.  So, number of vacuums are
taken into account even if each of them touched only small number of
pages.
 
> The thing I'm going to propose is to add estimated number of tuples in
> table to IndexVacuumInfo.  Then B-tree can memorize that number of tuples
> when last time index was scanned in the meta-page.  If pass value
> is differs from the value in meta-page too much, then cleanup is forced.
>
> Any better ideas?

I think that would work. But I'm concerned about metapage format
compatibility.

That's not show-stopper.  B-tree meta page have version number.  So,
it's no problem to provide online update.
 
And since I've not fully investigated about cleanup
index of other index types I'm not sure that interface makes sense. It
might not be better but an alternative idea is to add a condition
(Irel[i]->rd_rel->relam == BTREE_AM_OID) in lazy_scan_heap.

I meant putting this logic *inside* btvacuumcleanup() while passing
required measure to IndexVacuumInfo which is accessible from
btvacuumcleanup().

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

 

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
Hi!

I'd like to propose a revised patch based on various ideas upthread.

This patch works as following.

1) B-tree meta page is extended with 2 additional parameters:
 * btm_oldest_btpo_xact – oldest btpo_xact among of deleted pages,
 * btm_last_cleanup_num_heap_tuples – number of heap tuples during last cleanup scan.

2) These parameters are reset during btbulkdelete() and set during btvacuumcleanup().

3) Index scans during second and subsequent btvacuumcleanup() happen only if
   btm_oldest_btpo_xact is older than RecentGlobalXmin
  OR num_heap_tuples >= btm_last_cleanup_num_heap_tuples(1 + vacuum_cleanup_index_scale_factor).

In other words btvacuumcleanup() scans the index only if there are recyclable pages,
or index statistics is stalled (inserted more than vacuum_cleanup_index_scale_factor
since last index statistics collection).

4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.
Default value is 0.1.  So, by default cleanup scan is triggered after increasing of
table size by 10%.

5) Since new fields are added to the metapage, BTREE_VERSION is bumped.
In order to support pg_upgrade, read of previous metapage version is supported.
On metapage rewrite, it's upgraded to the new version.

So, since we don't skip scan of recyclable pages, there is no risk of xid wraparound.
Risk of stalled statistics is also small, because vacuum_cleanup_index_scale_factor
default value is quite low.  User can increase vacuum_cleanup_index_scale_factor
on his own risk and have less load of B-tree cleanup scan bought by more gap in
index statistics.

Some simple benchmark shows the effect.

Before patch.

# insert into t select i from generate_series(1,100000000) i;
# create index t_i_idx on t(i);
# vacuum t;
VACUUM
Time: 15639,822 ms (00:15,640)
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 6,195 ms
# vacuum t;
VACUUM
Time: 1012,794 ms (00:01,013)
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 5,276 ms
# vacuum t;
VACUUM
Time: 1013,254 ms (00:01,013)

After patch.

# insert into t select i from generate_series(1,100000000) i;
# create index t_i_idx on t(i);
# vacuum t;
VACUUM
Time: 15689,450 ms (00:15,689)
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 5,585 ms
# vacuum t;
VACUUM
Time: 50,777 ms
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 5,641 ms
# vacuum t;
VACUUM
Time: 46,997 ms

Thus, vacuum time for append-only table drops from 1000 ms to 50 ms (in about 20X).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Hi!
>

Sorry for my late reply.

> I'd like to propose a revised patch based on various ideas upthread.

Thank you for proposing the patch!

>
> This patch works as following.
>
> 1) B-tree meta page is extended with 2 additional parameters:
>  * btm_oldest_btpo_xact – oldest btpo_xact among of deleted pages,
>  * btm_last_cleanup_num_heap_tuples – number of heap tuples during last
> cleanup scan.
>
> 2) These parameters are reset during btbulkdelete() and set during
> btvacuumcleanup().

Can't we set these parameters even during btbulkdelete()? By keeping
them up to date, we will able to avoid an unnecessary cleanup vacuums
even after index bulk-delete.

>
> 3) Index scans during second and subsequent btvacuumcleanup() happen only if
>    btm_oldest_btpo_xact is older than RecentGlobalXmin
>   OR num_heap_tuples >= btm_last_cleanup_num_heap_tuples(1 +
> vacuum_cleanup_index_scale_factor).
>
> In other words btvacuumcleanup() scans the index only if there are
> recyclable pages,
> or index statistics is stalled (inserted more than
> vacuum_cleanup_index_scale_factor
> since last index statistics collection).
>
> 4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.
> Default value is 0.1.  So, by default cleanup scan is triggered after
> increasing of
> table size by 10%.
>
> 5) Since new fields are added to the metapage, BTREE_VERSION is bumped.
> In order to support pg_upgrade, read of previous metapage version is
> supported.
> On metapage rewrite, it's upgraded to the new version.
>
> So, since we don't skip scan of recyclable pages, there is no risk of xid
> wraparound.
> Risk of stalled statistics is also small, because
> vacuum_cleanup_index_scale_factor
> default value is quite low.  User can increase
> vacuum_cleanup_index_scale_factor
> on his own risk and have less load of B-tree cleanup scan bought by more gap
> in
> index statistics.

Agreed.

I've not reviewed the code deeply yet but the regression test of this
patch seems to wrongly pass the regression tests and bt_metap()
function of pageinspect needs to be updated. Attached an updated patch
fixed these issue. Will review the patch again.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> 2) These parameters are reset during btbulkdelete() and set during
> btvacuumcleanup().

Can't we set these parameters even during btbulkdelete()? By keeping
them up to date, we will able to avoid an unnecessary cleanup vacuums
even after index bulk-delete.
 
We certainly can update cleanup-related parameters during btbulkdelete().
However, in this case we would update B-tree meta-page during each
VACUUM cycle.  That may cause some overhead for non append-only
workloads.  I don't think this overhead would be sensible, because in
non append-only scenarios VACUUM typically writes much more of information.
But I would like this oriented to append-only workload patch to be
as harmless as possible for other workloads.

I've not reviewed the code deeply yet but the regression test of this
patch seems to wrongly pass the regression tests and bt_metap()
function of pageinspect needs to be updated.

Right.  Thank you for fixing this issue.
 
Attached an updated patch
fixed these issue. Will review the patch again.

Thank you!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Fri, Mar 9, 2018 at 9:40 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attached an updated patch
fixed these issue. Will review the patch again.

Thank you!

I've fixed a bug: _bt_vacuum_needs_cleanup() didn't release a lock when metapage is in old format.
Bug was found by Darafei Praliaskouski and Grigory Smolkin who tested this patch.
Revised patch is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Sat, Mar 10, 2018 at 5:49 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Fri, Mar 9, 2018 at 9:40 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attached an updated patch
fixed these issue. Will review the patch again.

Thank you!

I've fixed a bug: _bt_vacuum_needs_cleanup() didn't release a lock when metapage is in old format.
Bug was found by Darafei Praliaskouski and Grigory Smolkin who tested this patch.
Revised patch is attached.

I also found that online upgrade of meta-page didn't work: version field wasn't set.
Fixed in the attached version of patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > 2) These parameters are reset during btbulkdelete() and set during
>> > btvacuumcleanup().
>>
>> Can't we set these parameters even during btbulkdelete()? By keeping
>> them up to date, we will able to avoid an unnecessary cleanup vacuums
>> even after index bulk-delete.
>
>
> We certainly can update cleanup-related parameters during btbulkdelete().
> However, in this case we would update B-tree meta-page during each
> VACUUM cycle.  That may cause some overhead for non append-only
> workloads.  I don't think this overhead would be sensible, because in
> non append-only scenarios VACUUM typically writes much more of information.
> But I would like this oriented to append-only workload patch to be
> as harmless as possible for other workloads.

What overhead are you referring here? I guess the overhead is only the
calculating the oldest btpo.xact. And I think it would be harmless.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > 2) These parameters are reset during btbulkdelete() and set during
>> > btvacuumcleanup().
>>
>> Can't we set these parameters even during btbulkdelete()? By keeping
>> them up to date, we will able to avoid an unnecessary cleanup vacuums
>> even after index bulk-delete.
>
>
> We certainly can update cleanup-related parameters during btbulkdelete().
> However, in this case we would update B-tree meta-page during each
> VACUUM cycle.  That may cause some overhead for non append-only
> workloads.  I don't think this overhead would be sensible, because in
> non append-only scenarios VACUUM typically writes much more of information.
> But I would like this oriented to append-only workload patch to be
> as harmless as possible for other workloads.

What overhead are you referring here? I guess the overhead is only the
calculating the oldest btpo.xact. And I think it would be harmless.

I meant overhead of setting last_cleanup_num_heap_tuples after every
btbulkdelete with wal-logging of meta-page.  I bet it also would be
harmless, but I think that needs some testing.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>> > wrote:
>> >>
>> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>> >> <a.korotkov@postgrespro.ru> wrote:
>> >> > 2) These parameters are reset during btbulkdelete() and set during
>> >> > btvacuumcleanup().
>> >>
>> >> Can't we set these parameters even during btbulkdelete()? By keeping
>> >> them up to date, we will able to avoid an unnecessary cleanup vacuums
>> >> even after index bulk-delete.
>> >
>> >
>> > We certainly can update cleanup-related parameters during
>> > btbulkdelete().
>> > However, in this case we would update B-tree meta-page during each
>> > VACUUM cycle.  That may cause some overhead for non append-only
>> > workloads.  I don't think this overhead would be sensible, because in
>> > non append-only scenarios VACUUM typically writes much more of
>> > information.
>> > But I would like this oriented to append-only workload patch to be
>> > as harmless as possible for other workloads.
>>
>> What overhead are you referring here? I guess the overhead is only the
>> calculating the oldest btpo.xact. And I think it would be harmless.
>
>
> I meant overhead of setting last_cleanup_num_heap_tuples after every
> btbulkdelete with wal-logging of meta-page.  I bet it also would be
> harmless, but I think that needs some testing.

Agreed.

After more thought, it might be too late but we can consider the
possibility of another idea proposed by Peter. Attached patch
addresses the original issue of index cleanups by storing the epoch
number of page deletion XID into PageHeader->pd_prune_xid which is
4byte field. Comparing to the current proposed patch this patch
doesn't need neither the page upgrade code nor extra WAL-logging. If
we also want to address cases other than append-only case we will
require the bulk-delete method of scanning whole index and of logging
WAL. But it leads some extra overhead. With this patch we no longer
need to depend on the full scan on b-tree index. This might be useful
for a future when we make the bulk-delete of b-tree index not scan
whole index.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com>
> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> > wrote:
> >>
> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
> >> <a.korotkov@postgrespro.ru> wrote:
> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> >> > wrote:
> >> >>
> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
> >> >> <a.korotkov@postgrespro.ru> wrote:
> >> >> > 2) These parameters are reset during btbulkdelete() and set during
> >> >> > btvacuumcleanup().
> >> >>
> >> >> Can't we set these parameters even during btbulkdelete()? By keeping
> >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums
> >> >> even after index bulk-delete.
> >> >
> >> >
> >> > We certainly can update cleanup-related parameters during
> >> > btbulkdelete().
> >> > However, in this case we would update B-tree meta-page during each
> >> > VACUUM cycle.  That may cause some overhead for non append-only
> >> > workloads.  I don't think this overhead would be sensible, because in
> >> > non append-only scenarios VACUUM typically writes much more of
> >> > information.
> >> > But I would like this oriented to append-only workload patch to be
> >> > as harmless as possible for other workloads.
> >>
> >> What overhead are you referring here? I guess the overhead is only the
> >> calculating the oldest btpo.xact. And I think it would be harmless.
> >
> >
> > I meant overhead of setting last_cleanup_num_heap_tuples after every
> > btbulkdelete with wal-logging of meta-page.  I bet it also would be
> > harmless, but I think that needs some testing.
> 
> Agreed.
> 
> After more thought, it might be too late but we can consider the
> possibility of another idea proposed by Peter. Attached patch
> addresses the original issue of index cleanups by storing the epoch
> number of page deletion XID into PageHeader->pd_prune_xid which is
> 4byte field.

Mmm. It seems to me that the story is returning to the
beginning. Could I try retelling the story?

I understant that the initial problem was vacuum runs apparently
unnecessary full-scan on indexes many times. The reason for that
is the fact that a cleanup scan may leave some (or many under
certain condition) dead pages not-recycled but we don't know
whether a cleanup is needed or not. They will be staying left
forever unless we run additional cleanup-scans at the appropriate
timing.

(If I understand it correctly,) Sawada-san's latest proposal is
(fundamentally the same to the first one,) just skipping the
cleanup scan if the vacuum scan just before found that the number
of *live* tuples are increased. If there where many deletions and
insertions but no increase of total number of tuples, we don't
have a cleanup. Consequently it had a wraparound problem and it
is addressed in this version.

(ditto.) Alexander proposed to record the oldest xid of
recyclable pages in metapage (and the number of tuples at the
last cleanup). This prevents needless cleanup scan and surely
runs cleanups to remove all recyclable pages.

I think that we can accept Sawada-san's proposal if we accept the
fact that indexes can retain recyclable pages for a long
time. (Honestly I don't think so.)

If (as I might have mentioned as the same upthread for Yura's
patch,) we accept to hold the information on index meta page,
Alexander's way would be preferable. The difference betwen Yura's
and Alexander's is the former runs cleanup scan if a recyclable
page is present but the latter avoids that before any recyclable
pages are knwon to be removed.

>               Comparing to the current proposed patch this patch
> doesn't need neither the page upgrade code nor extra WAL-logging. If

# By the way, my proposal was storing the information as Yura
# proposed into stats collector. The information maybe be
# available a bit lately, but it doesn't harm. This doesn't need
# extra WAL logging nor the upgrad code:p

> we also want to address cases other than append-only case we will

I'm afraid that "the problem for the other cases" is a new one
that this patch introduces, not an existing one.

> require the bulk-delete method of scanning whole index and of logging
> WAL. But it leads some extra overhead. With this patch we no longer
> need to depend on the full scan on b-tree index. This might be useful
> for a future when we make the bulk-delete of b-tree index not scan
> whole index.

Perhaps I'm taking something incorrectly, but is it just the
result of skipping 'maybe needed' scans without condiering the
actual necessity?

I also don't like extra WAL logging, but it happens once (or
twice?) per vaccum cycle (for every index). On the other hand I
want to put the on-the-fly upgrade path out of the ordinary
path. (Reviving the pg_upgrade's custom module?)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
Sorry I'd like to make a trivial but critical fix.

At Mon, 19 Mar 2018 14:45:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180319.144505.166111203.horiguchi.kyotaro@lab.ntt.co.jp>
> At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com>
> > On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov
> > <a.korotkov@postgrespro.ru> wrote:
> > > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> > > wrote:
> > >>
> > >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
> > >> <a.korotkov@postgrespro.ru> wrote:
> > >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> > >> > wrote:
> > >> >>
> > >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
> > >> >> <a.korotkov@postgrespro.ru> wrote:
> > >> >> > 2) These parameters are reset during btbulkdelete() and set during
> > >> >> > btvacuumcleanup().
> > >> >>
> > >> >> Can't we set these parameters even during btbulkdelete()? By keeping
> > >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums
> > >> >> even after index bulk-delete.
> > >> >
> > >> >
> > >> > We certainly can update cleanup-related parameters during
> > >> > btbulkdelete().
> > >> > However, in this case we would update B-tree meta-page during each
> > >> > VACUUM cycle.  That may cause some overhead for non append-only
> > >> > workloads.  I don't think this overhead would be sensible, because in
> > >> > non append-only scenarios VACUUM typically writes much more of
> > >> > information.
> > >> > But I would like this oriented to append-only workload patch to be
> > >> > as harmless as possible for other workloads.
> > >>
> > >> What overhead are you referring here? I guess the overhead is only the
> > >> calculating the oldest btpo.xact. And I think it would be harmless.
> > >
> > >
> > > I meant overhead of setting last_cleanup_num_heap_tuples after every
> > > btbulkdelete with wal-logging of meta-page.  I bet it also would be
> > > harmless, but I think that needs some testing.
> > 
> > Agreed.
> > 
> > After more thought, it might be too late but we can consider the
> > possibility of another idea proposed by Peter. Attached patch
> > addresses the original issue of index cleanups by storing the epoch
> > number of page deletion XID into PageHeader->pd_prune_xid which is
> > 4byte field.
> 
> Mmm. It seems to me that the story is returning to the
> beginning. Could I try retelling the story?
> 
> I understant that the initial problem was vacuum runs apparently
> unnecessary full-scan on indexes many times. The reason for that
> is the fact that a cleanup scan may leave some (or many under
> certain condition) dead pages not-recycled but we don't know
> whether a cleanup is needed or not. They will be staying left
> forever unless we run additional cleanup-scans at the appropriate
> timing.
> 
> (If I understand it correctly,) Sawada-san's latest proposal is
> (fundamentally the same to the first one,) just skipping the
> cleanup scan if the vacuum scan just before found that the number
> of *live* tuples are increased. If there where many deletions and
> insertions but no increase of total number of tuples, we don't
> have a cleanup. Consequently it had a wraparound problem and it
> is addressed in this version.
> 
> (ditto.) Alexander proposed to record the oldest xid of
> recyclable pages in metapage (and the number of tuples at the
> last cleanup). This prevents needless cleanup scan and surely
> runs cleanups to remove all recyclable pages.
> 
> I think that we can accept Sawada-san's proposal if we accept the
> fact that indexes can retain recyclable pages for a long
> time. (Honestly I don't think so.)
> 
> If (as I might have mentioned as the same upthread for Yura's
> patch,) we accept to hold the information on index meta page,
> Alexander's way would be preferable. The difference betwen Yura's
> and Alexander's is the former runs cleanup scan if a recyclable
> page is present but the latter avoids that before any recyclable

- pages are knwon to be removed.
+ pages are knwon to be actually removable

> >               Comparing to the current proposed patch this patch
> > doesn't need neither the page upgrade code nor extra WAL-logging. If
> 
> # By the way, my proposal was storing the information as Yura
> # proposed into stats collector. The information maybe be
> # available a bit lately, but it doesn't harm. This doesn't need
> # extra WAL logging nor the upgrad code:p
> 
> > we also want to address cases other than append-only case we will
> 
> I'm afraid that "the problem for the other cases" is a new one
> that this patch introduces, not an existing one.
> 
> > require the bulk-delete method of scanning whole index and of logging
> > WAL. But it leads some extra overhead. With this patch we no longer
> > need to depend on the full scan on b-tree index. This might be useful
> > for a future when we make the bulk-delete of b-tree index not scan
> > whole index.
> 
> Perhaps I'm taking something incorrectly, but is it just the
> result of skipping 'maybe needed' scans without condiering the
> actual necessity?
> 
> I also don't like extra WAL logging, but it happens once (or
> twice?) per vaccum cycle (for every index). On the other hand I
> want to put the on-the-fly upgrade path out of the ordinary
> path. (Reviving the pg_upgrade's custom module?)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Mon, Mar 19, 2018 at 2:45 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com>
>> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com>
>> > wrote:
>> >>
>> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
>> >> <a.korotkov@postgrespro.ru> wrote:
>> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>> >> >> <a.korotkov@postgrespro.ru> wrote:
>> >> >> > 2) These parameters are reset during btbulkdelete() and set during
>> >> >> > btvacuumcleanup().
>> >> >>
>> >> >> Can't we set these parameters even during btbulkdelete()? By keeping
>> >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums
>> >> >> even after index bulk-delete.
>> >> >
>> >> >
>> >> > We certainly can update cleanup-related parameters during
>> >> > btbulkdelete().
>> >> > However, in this case we would update B-tree meta-page during each
>> >> > VACUUM cycle.  That may cause some overhead for non append-only
>> >> > workloads.  I don't think this overhead would be sensible, because in
>> >> > non append-only scenarios VACUUM typically writes much more of
>> >> > information.
>> >> > But I would like this oriented to append-only workload patch to be
>> >> > as harmless as possible for other workloads.
>> >>
>> >> What overhead are you referring here? I guess the overhead is only the
>> >> calculating the oldest btpo.xact. And I think it would be harmless.
>> >
>> >
>> > I meant overhead of setting last_cleanup_num_heap_tuples after every
>> > btbulkdelete with wal-logging of meta-page.  I bet it also would be
>> > harmless, but I think that needs some testing.
>>
>> Agreed.
>>
>> After more thought, it might be too late but we can consider the
>> possibility of another idea proposed by Peter. Attached patch
>> addresses the original issue of index cleanups by storing the epoch
>> number of page deletion XID into PageHeader->pd_prune_xid which is
>> 4byte field.
>
> Mmm. It seems to me that the story is returning to the
> beginning. Could I try retelling the story?
>
> I understant that the initial problem was vacuum runs apparently
> unnecessary full-scan on indexes many times. The reason for that
> is the fact that a cleanup scan may leave some (or many under
> certain condition) dead pages not-recycled but we don't know
> whether a cleanup is needed or not. They will be staying left
> forever unless we run additional cleanup-scans at the appropriate
> timing.
>
> (If I understand it correctly,) Sawada-san's latest proposal is
> (fundamentally the same to the first one,) just skipping the
> cleanup scan if the vacuum scan just before found that the number
> of *live* tuples are increased. If there where many deletions and
> insertions but no increase of total number of tuples, we don't
> have a cleanup. Consequently it had a wraparound problem and it
> is addressed in this version.

No, it doesn't have a wraparound problem. The patch based on Peter's
idea I proposed adds an epoch number of page deletion xid and compare
them when we judge whether the page is recyclable or not. It's
something like we create 64-bit xid of deletion xid. Also, if there is
even one deletion the bulk-delete will be performed instead of the
index cleanup. So with this patch we do the index cleanup only when
the reltuple of table is increased by fraction of
vacuum_index_cleanup_scale_factor from previous stats. It doesn't need
to do the index cleanup by any xid thresholds.

> (ditto.) Alexander proposed to record the oldest xid of
> recyclable pages in metapage (and the number of tuples at the
> last cleanup). This prevents needless cleanup scan and surely
> runs cleanups to remove all recyclable pages.

Yes, but the concerns we discussed are that we need extra WAL-logging
for updating the metapage and it works only for append-only case. If
we also want to support more cases we will need to update the metapage
during bulk-delete. The overhead of WAL-logging would be harmless but
should be tested as Alexander mentioned.

>
> I think that we can accept Sawada-san's proposal if we accept the
> fact that indexes can retain recyclable pages for a long
> time. (Honestly I don't think so.)
>
> If (as I might have mentioned as the same upthread for Yura's
> patch,) we accept to hold the information on index meta page,
> Alexander's way would be preferable. The difference betwen Yura's
> and Alexander's is the former runs cleanup scan if a recyclable
> page is present but the latter avoids that before any recyclable
> pages are knwon to be removed.
>
>>               Comparing to the current proposed patch this patch
>> doesn't need neither the page upgrade code nor extra WAL-logging. If
>
> # By the way, my proposal was storing the information as Yura
> # proposed into stats collector. The information maybe be
> # available a bit lately, but it doesn't harm. This doesn't need
> # extra WAL logging nor the upgrad code:p
>
>> we also want to address cases other than append-only case we will
>
> I'm afraid that "the problem for the other cases" is a new one
> that this patch introduces, not an existing one.

I meant that the current Alexandor's proposal works for append-only
table. If we want to support other cases we have to update metapage
during bulk-delete, which assumes that bulk-delete always scans whole
index.

>
>> require the bulk-delete method of scanning whole index and of logging
>> WAL. But it leads some extra overhead. With this patch we no longer
>> need to depend on the full scan on b-tree index. This might be useful
>> for a future when we make the bulk-delete of b-tree index not scan
>> whole index.
>
> Perhaps I'm taking something incorrectly, but is it just the
> result of skipping 'maybe needed' scans without condiering the
> actual necessity?

I meant to scan only index pages that are relevant with garbages TIDs
on a table. The current b-tree index bulk-deletion is very slow and
heavy because we always scans the whole index even if there is only 1
dead tuples in a table. To address this problem I'm thinking a way to
make bulk-delete not scan whole index if there is a few dead tuples in
a table. That is, we do index scans to collect the stack of btree
pages and reclaim garbage. Maybe we will full index scan if there are
a lot of dead tuples, which would be same as what we're doing on
planning access paths.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Claudio Freire
Date:
On Mon, Mar 19, 2018 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> require the bulk-delete method of scanning whole index and of logging
>>> WAL. But it leads some extra overhead. With this patch we no longer
>>> need to depend on the full scan on b-tree index. This might be useful
>>> for a future when we make the bulk-delete of b-tree index not scan
>>> whole index.
>>
>> Perhaps I'm taking something incorrectly, but is it just the
>> result of skipping 'maybe needed' scans without condiering the
>> actual necessity?
>
> I meant to scan only index pages that are relevant with garbages TIDs
> on a table. The current b-tree index bulk-deletion is very slow and
> heavy because we always scans the whole index even if there is only 1
> dead tuples in a table. To address this problem I'm thinking a way to
> make bulk-delete not scan whole index if there is a few dead tuples in
> a table. That is, we do index scans to collect the stack of btree
> pages and reclaim garbage. Maybe we will full index scan if there are
> a lot of dead tuples, which would be same as what we're doing on
> planning access paths.

In theory it's not very difficult to do. I was pondering doing some
PoC after the other vacuum patches get through.

TL;DR version is, as long as there's enough MWM to fit the keys, they
can be stashed before vacuuming the heap, and used to perform index
scans instead for index cleanup. If MWM runs out, it just goes back to
bulk-delete scans (it would imply there's a lot to clean up anyway, so
index scans wouldn't be worth it). A finer decision can be made with
random_page_cost on which technique is likely faster as well.

So yes, lets not paint ourselves into a corner where full index scans
are required for correct operation, that would make the above
impossible.


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
At Mon, 19 Mar 2018 20:50:48 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCfWXcX-po8Q1r779nyVGzs01pwpSLM=u7Sx3Hv+L+4gg@mail.gmail.com>
> On Mon, Mar 19, 2018 at 2:45 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com>
> >> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov
> >> <a.korotkov@postgrespro.ru> wrote:
> >> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> >> > wrote:
> >> >>
> >> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
> >> >> <a.korotkov@postgrespro.ru> wrote:
> >> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
> >> >> >> <a.korotkov@postgrespro.ru> wrote:
> >> >> >> > 2) These parameters are reset during btbulkdelete() and set during
> >> >> >> > btvacuumcleanup().
> >> >> >>
> >> >> >> Can't we set these parameters even during btbulkdelete()? By keeping
> >> >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums
> >> >> >> even after index bulk-delete.
> >> >> >
> >> >> >
> >> >> > We certainly can update cleanup-related parameters during
> >> >> > btbulkdelete().
> >> >> > However, in this case we would update B-tree meta-page during each
> >> >> > VACUUM cycle.  That may cause some overhead for non append-only
> >> >> > workloads.  I don't think this overhead would be sensible, because in
> >> >> > non append-only scenarios VACUUM typically writes much more of
> >> >> > information.
> >> >> > But I would like this oriented to append-only workload patch to be
> >> >> > as harmless as possible for other workloads.
> >> >>
> >> >> What overhead are you referring here? I guess the overhead is only the
> >> >> calculating the oldest btpo.xact. And I think it would be harmless.
> >> >
> >> >
> >> > I meant overhead of setting last_cleanup_num_heap_tuples after every
> >> > btbulkdelete with wal-logging of meta-page.  I bet it also would be
> >> > harmless, but I think that needs some testing.
> >>
> >> Agreed.
> >>
> >> After more thought, it might be too late but we can consider the
> >> possibility of another idea proposed by Peter. Attached patch
> >> addresses the original issue of index cleanups by storing the epoch
> >> number of page deletion XID into PageHeader->pd_prune_xid which is
> >> 4byte field.
> >
> > Mmm. It seems to me that the story is returning to the
> > beginning. Could I try retelling the story?
> >
> > I understant that the initial problem was vacuum runs apparently
> > unnecessary full-scan on indexes many times. The reason for that
> > is the fact that a cleanup scan may leave some (or many under
> > certain condition) dead pages not-recycled but we don't know
> > whether a cleanup is needed or not. They will be staying left
> > forever unless we run additional cleanup-scans at the appropriate
> > timing.
> >
> > (If I understand it correctly,) Sawada-san's latest proposal is
> > (fundamentally the same to the first one,) just skipping the
> > cleanup scan if the vacuum scan just before found that the number
> > of *live* tuples are increased. If there where many deletions and
> > insertions but no increase of total number of tuples, we don't
> > have a cleanup. Consequently it had a wraparound problem and it
> > is addressed in this version.
> 
> No, it doesn't have a wraparound problem. The patch based on Peter's
> idea I proposed adds an epoch number of page deletion xid and compare
> them when we judge whether the page is recyclable or not. It's
> something like we create 64-bit xid of deletion xid. Also, if there is
> even one deletion the bulk-delete will be performed instead of the
> index cleanup. So with this patch we do the index cleanup only when
> the reltuple of table is increased by fraction of
> vacuum_index_cleanup_scale_factor from previous stats. It doesn't need
> to do the index cleanup by any xid thresholds.


Perhaps you took me wrong. I know the last patch doesn't have (or
at least intends to get rid of ) the problem, and I wrote that
the problem was introduced by your *first* patch.

> > (ditto.) Alexander proposed to record the oldest xid of
> > recyclable pages in metapage (and the number of tuples at the
> > last cleanup). This prevents needless cleanup scan and surely
> > runs cleanups to remove all recyclable pages.
> 
> Yes, but the concerns we discussed are that we need extra WAL-logging
> for updating the metapage and it works only for append-only case. If
> we also want to support more cases we will need to update the metapage
> during bulk-delete. The overhead of WAL-logging would be harmless but
> should be tested as Alexander mentioned.

Agreed.

> > I think that we can accept Sawada-san's proposal if we accept the
> > fact that indexes can retain recyclable pages for a long
> > time. (Honestly I don't think so.)
> >
> > If (as I might have mentioned as the same upthread for Yura's
> > patch,) we accept to hold the information on index meta page,
> > Alexander's way would be preferable. The difference betwen Yura's
> > and Alexander's is the former runs cleanup scan if a recyclable
> > page is present but the latter avoids that before any recyclable
> > pages are knwon to be removed.
> >
> >>               Comparing to the current proposed patch this patch
> >> doesn't need neither the page upgrade code nor extra WAL-logging. If
> >
> > # By the way, my proposal was storing the information as Yura
> > # proposed into stats collector. The information maybe be
> > # available a bit lately, but it doesn't harm. This doesn't need
> > # extra WAL logging nor the upgrad code:p
> >
> >> we also want to address cases other than append-only case we will
> >
> > I'm afraid that "the problem for the other cases" is a new one
> > that this patch introduces, not an existing one.
> 
> I meant that the current Alexandor's proposal works for append-only
> table. If we want to support other cases we have to update metapage
> during bulk-delete, which assumes that bulk-delete always scans whole
> index.

True. Currently no patches so far gets rid of the whole-cleanup-scan.

> >> require the bulk-delete method of scanning whole index and of logging
> >> WAL. But it leads some extra overhead. With this patch we no longer
> >> need to depend on the full scan on b-tree index. This might be useful
> >> for a future when we make the bulk-delete of b-tree index not scan
> >> whole index.
> >
> > Perhaps I'm taking something incorrectly, but is it just the
> > result of skipping 'maybe needed' scans without condiering the
> > actual necessity?
> 
> I meant to scan only index pages that are relevant with garbages TIDs
> on a table. The current b-tree index bulk-deletion is very slow and
> heavy because we always scans the whole index even if there is only 1
> dead tuples in a table. To address this problem I'm thinking a way to
> make bulk-delete not scan whole index if there is a few dead tuples in
> a table. That is, we do index scans to collect the stack of btree
> pages and reclaim garbage. Maybe we will full index scan if there are
> a lot of dead tuples, which would be same as what we're doing on
> planning access paths.

Yeah, that seems good!  A possible problem of that is that the
pages we want to recycle in a cleanup scan can *not* be only them
that have found to be recyclable in the vacuum-scan just
before. When we leave some recyclable pages in a cleanup scan, we
should do whole-scan at the next chance if we don't have the TID
list (or in other smaller form, or just the number of recyclable
pages?) at the time.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
At Tue, 20 Mar 2018 13:57:19 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180320.135719.90053076.horiguchi.kyotaro@lab.ntt.co.jp>
> At Mon, 19 Mar 2018 20:50:48 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCfWXcX-po8Q1r779nyVGzs01pwpSLM=u7Sx3Hv+L+4gg@mail.gmail.com>
> > On Mon, Mar 19, 2018 at 2:45 PM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com>
> > >> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov
> > >> <a.korotkov@postgrespro.ru> wrote:
> > >> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> > >> > wrote:
> > >> >>
> > >> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
> > >> >> <a.korotkov@postgrespro.ru> wrote:
> > >> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> > >> >> > wrote:
> > >> >> >>
> > >> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
> > >> >> >> <a.korotkov@postgrespro.ru> wrote:
> > >> >> >> > 2) These parameters are reset during btbulkdelete() and set during
> > >> >> >> > btvacuumcleanup().
> > >> >> >>
> > >> >> >> Can't we set these parameters even during btbulkdelete()? By keeping
> > >> >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums
> > >> >> >> even after index bulk-delete.
> > >> >> >
> > >> >> >
> > >> >> > We certainly can update cleanup-related parameters during
> > >> >> > btbulkdelete().
> > >> >> > However, in this case we would update B-tree meta-page during each
> > >> >> > VACUUM cycle.  That may cause some overhead for non append-only
> > >> >> > workloads.  I don't think this overhead would be sensible, because in
> > >> >> > non append-only scenarios VACUUM typically writes much more of
> > >> >> > information.
> > >> >> > But I would like this oriented to append-only workload patch to be
> > >> >> > as harmless as possible for other workloads.
> > >> >>
> > >> >> What overhead are you referring here? I guess the overhead is only the
> > >> >> calculating the oldest btpo.xact. And I think it would be harmless.
> > >> >
> > >> >
> > >> > I meant overhead of setting last_cleanup_num_heap_tuples after every
> > >> > btbulkdelete with wal-logging of meta-page.  I bet it also would be
> > >> > harmless, but I think that needs some testing.
> > >>
> > >> Agreed.
> > >>
> > >> After more thought, it might be too late but we can consider the
> > >> possibility of another idea proposed by Peter. Attached patch
> > >> addresses the original issue of index cleanups by storing the epoch
> > >> number of page deletion XID into PageHeader->pd_prune_xid which is
> > >> 4byte field.
> > >
> > > Mmm. It seems to me that the story is returning to the
> > > beginning. Could I try retelling the story?
> > >
> > > I understant that the initial problem was vacuum runs apparently
> > > unnecessary full-scan on indexes many times. The reason for that
> > > is the fact that a cleanup scan may leave some (or many under
> > > certain condition) dead pages not-recycled but we don't know
> > > whether a cleanup is needed or not. They will be staying left
> > > forever unless we run additional cleanup-scans at the appropriate
> > > timing.
> > >
> > > (If I understand it correctly,) Sawada-san's latest proposal is
> > > (fundamentally the same to the first one,) just skipping the
> > > cleanup scan if the vacuum scan just before found that the number
> > > of *live* tuples are increased. If there where many deletions and
> > > insertions but no increase of total number of tuples, we don't
> > > have a cleanup. Consequently it had a wraparound problem and it
> > > is addressed in this version.
> > 
> > No, it doesn't have a wraparound problem. The patch based on Peter's
> > idea I proposed adds an epoch number of page deletion xid and compare
> > them when we judge whether the page is recyclable or not. It's
> > something like we create 64-bit xid of deletion xid. Also, if there is
> > even one deletion the bulk-delete will be performed instead of the
> > index cleanup. So with this patch we do the index cleanup only when
> > the reltuple of table is increased by fraction of
> > vacuum_index_cleanup_scale_factor from previous stats. It doesn't need
> > to do the index cleanup by any xid thresholds.
> 
> 
> Perhaps you took me wrong. I know the last patch doesn't have (or
> at least intends to get rid of ) the problem, and I wrote that
> the problem was introduced by your *first* patch.
> 
> > > (ditto.) Alexander proposed to record the oldest xid of
> > > recyclable pages in metapage (and the number of tuples at the
> > > last cleanup). This prevents needless cleanup scan and surely
> > > runs cleanups to remove all recyclable pages.
> > 
> > Yes, but the concerns we discussed are that we need extra WAL-logging
> > for updating the metapage and it works only for append-only case. If
> > we also want to support more cases we will need to update the metapage
> > during bulk-delete. The overhead of WAL-logging would be harmless but
> > should be tested as Alexander mentioned.
> 
> Agreed.
> 
> > > I think that we can accept Sawada-san's proposal if we accept the
> > > fact that indexes can retain recyclable pages for a long
> > > time. (Honestly I don't think so.)
> > >
> > > If (as I might have mentioned as the same upthread for Yura's
> > > patch,) we accept to hold the information on index meta page,
> > > Alexander's way would be preferable. The difference betwen Yura's
> > > and Alexander's is the former runs cleanup scan if a recyclable
> > > page is present but the latter avoids that before any recyclable
> > > pages are knwon to be removed.
> > >
> > >>               Comparing to the current proposed patch this patch
> > >> doesn't need neither the page upgrade code nor extra WAL-logging. If
> > >
> > > # By the way, my proposal was storing the information as Yura
> > > # proposed into stats collector. The information maybe be
> > > # available a bit lately, but it doesn't harm. This doesn't need
> > > # extra WAL logging nor the upgrad code:p
> > >
> > >> we also want to address cases other than append-only case we will
> > >
> > > I'm afraid that "the problem for the other cases" is a new one
> > > that this patch introduces, not an existing one.
> > 
> > I meant that the current Alexandor's proposal works for append-only
> > table. If we want to support other cases we have to update metapage
> > during bulk-delete, which assumes that bulk-delete always scans whole
> > index.
> 
> True. Currently no patches so far gets rid of the whole-cleanup-scan.
> 
> > >> require the bulk-delete method of scanning whole index and of logging
> > >> WAL. But it leads some extra overhead. With this patch we no longer
> > >> need to depend on the full scan on b-tree index. This might be useful
> > >> for a future when we make the bulk-delete of b-tree index not scan
> > >> whole index.
> > >
> > > Perhaps I'm taking something incorrectly, but is it just the
> > > result of skipping 'maybe needed' scans without condiering the
> > > actual necessity?
> > 
> > I meant to scan only index pages that are relevant with garbages TIDs
> > on a table. The current b-tree index bulk-deletion is very slow and
> > heavy because we always scans the whole index even if there is only 1
> > dead tuples in a table. To address this problem I'm thinking a way to
> > make bulk-delete not scan whole index if there is a few dead tuples in
> > a table. That is, we do index scans to collect the stack of btree
> > pages and reclaim garbage. Maybe we will full index scan if there are
> > a lot of dead tuples, which would be same as what we're doing on
> > planning access paths.
> 
> Yeah, that seems good!  A possible problem of that is that the
> pages we want to recycle in a cleanup scan can *not* be only them
> that have found to be recyclable in the vacuum-scan just
> before. When we leave some recyclable pages in a cleanup scan, we
> should do whole-scan at the next chance if we don't have the TID
> list (or in other smaller form, or just the number of recyclable
> pages?) at the time.

"the number of recyclable pages?" is wrong.  It doesn't work to
avoid whole-scan. It just indicates whether the "this" round of
cleanup needs whole-scan or not.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Mon, Mar 19, 2018 at 5:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>> > wrote:
>> >>
>> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>> >> <a.korotkov@postgrespro.ru> wrote:
>> >> > 2) These parameters are reset during btbulkdelete() and set during
>> >> > btvacuumcleanup().
>> >>
>> >> Can't we set these parameters even during btbulkdelete()? By keeping
>> >> them up to date, we will able to avoid an unnecessary cleanup vacuums
>> >> even after index bulk-delete.
>> >
>> >
>> > We certainly can update cleanup-related parameters during
>> > btbulkdelete().
>> > However, in this case we would update B-tree meta-page during each
>> > VACUUM cycle.  That may cause some overhead for non append-only
>> > workloads.  I don't think this overhead would be sensible, because in
>> > non append-only scenarios VACUUM typically writes much more of
>> > information.
>> > But I would like this oriented to append-only workload patch to be
>> > as harmless as possible for other workloads.
>>
>> What overhead are you referring here? I guess the overhead is only the
>> calculating the oldest btpo.xact. And I think it would be harmless.
>
>
> I meant overhead of setting last_cleanup_num_heap_tuples after every
> btbulkdelete with wal-logging of meta-page.  I bet it also would be
> harmless, but I think that needs some testing.

Agreed.

After more thought, it might be too late but we can consider the
possibility of another idea proposed by Peter. Attached patch
addresses the original issue of index cleanups by storing the epoch
number of page deletion XID into PageHeader->pd_prune_xid which is
4byte field. Comparing to the current proposed patch this patch
doesn't need neither the page upgrade code nor extra WAL-logging. If
we also want to address cases other than append-only case we will
require the bulk-delete method of scanning whole index and of logging
WAL. But it leads some extra overhead. With this patch we no longer
need to depend on the full scan on b-tree index. This might be useful
for a future when we make the bulk-delete of b-tree index not scan
whole index.

Storing epoch in deleted pages is good by itself, because it makes us
free to select the moment of time to recycle those pages without risk
of wraparound.

However, I see that you are comparing relative change of num_heap_tuples
before and after vacuum to vacuum_cleanup_index_scale_factor.
The problem is that if vacuum is very aggressive (and that is likely for
append only case if this patch is committed), then num_heap_tuples
changes very slightly every vacuum cycle.  So, cleanup would never
happen and statistics could stall indefinitely.

Another issue is that append only workloads are frequently combined
with rare periodical bulk deletes of data.  Assuming that in this patch
you don't consider deleted pages to trigger index cleanup, on such
workload large amounts of deleted pages may reside non-recycled until
next bulk delete cycle.

So, despite I generally like idea of storing epoch of deleted xid in the
page, I think my version of patch is closer to committable shape.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Mon, Mar 19, 2018 at 8:45 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A@mail.gmail.com>
> require the bulk-delete method of scanning whole index and of logging
> WAL. But it leads some extra overhead. With this patch we no longer
> need to depend on the full scan on b-tree index. This might be useful
> for a future when we make the bulk-delete of b-tree index not scan
> whole index.

Perhaps I'm taking something incorrectly, but is it just the
result of skipping 'maybe needed' scans without condiering the
actual necessity?

I also don't like extra WAL logging, but it happens once (or
twice?) per vaccum cycle (for every index).

In my version of patch WAL logging doesn't happen every vacuum cycle.
WAL-logging of meta-page happens only in two cases:
1) After first vacuum which didn't delete any heap tuples.  I.e. after we
switch from update workload to append-only workload.
2) During append-only workload when cleanup is triggered (either
deleted pages become recyclable or statistics considered as stalled).

Typically in both update and append-only workloads, WAL logging
isn't happening during vacuum cycle.

Masahiko Sawada has proposed to update meta-information during
btbulkdelete [1].  That would lead to WAL logging in almost very
vacuum cycle.  I was slightly opposed to that saying that this overhead
need to be tested [2].  However this concern is not related to
current shape of my version of patch.
 
On the other hand I
want to put the on-the-fly upgrade path out of the ordinary
path. (Reviving the pg_upgrade's custom module?)

I don't know.  Previously, we successfully did on-fly upgrade of
GIN posting lists.  Our pg_upgrade machinery is probably already
very complicated and overloaded.  Should we burden it with
upgrade of every meta-page of every B-tree index assuming that
we can handle it very well inside B-tree itself on the fly?


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

 

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Thu, Mar 22, 2018 at 9:24 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Mon, Mar 19, 2018 at 5:12 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com>
>> > wrote:
>> >>
>> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
>> >> <a.korotkov@postgrespro.ru> wrote:
>> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada
>> >> > <sawada.mshk@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>> >> >> <a.korotkov@postgrespro.ru> wrote:
>> >> >> > 2) These parameters are reset during btbulkdelete() and set during
>> >> >> > btvacuumcleanup().
>> >> >>
>> >> >> Can't we set these parameters even during btbulkdelete()? By keeping
>> >> >> them up to date, we will able to avoid an unnecessary cleanup
>> >> >> vacuums
>> >> >> even after index bulk-delete.
>> >> >
>> >> >
>> >> > We certainly can update cleanup-related parameters during
>> >> > btbulkdelete().
>> >> > However, in this case we would update B-tree meta-page during each
>> >> > VACUUM cycle.  That may cause some overhead for non append-only
>> >> > workloads.  I don't think this overhead would be sensible, because in
>> >> > non append-only scenarios VACUUM typically writes much more of
>> >> > information.
>> >> > But I would like this oriented to append-only workload patch to be
>> >> > as harmless as possible for other workloads.
>> >>
>> >> What overhead are you referring here? I guess the overhead is only the
>> >> calculating the oldest btpo.xact. And I think it would be harmless.
>> >
>> >
>> > I meant overhead of setting last_cleanup_num_heap_tuples after every
>> > btbulkdelete with wal-logging of meta-page.  I bet it also would be
>> > harmless, but I think that needs some testing.
>>
>> Agreed.
>>
>> After more thought, it might be too late but we can consider the
>> possibility of another idea proposed by Peter. Attached patch
>> addresses the original issue of index cleanups by storing the epoch
>> number of page deletion XID into PageHeader->pd_prune_xid which is
>> 4byte field. Comparing to the current proposed patch this patch
>> doesn't need neither the page upgrade code nor extra WAL-logging. If
>> we also want to address cases other than append-only case we will
>> require the bulk-delete method of scanning whole index and of logging
>> WAL. But it leads some extra overhead. With this patch we no longer
>> need to depend on the full scan on b-tree index. This might be useful
>> for a future when we make the bulk-delete of b-tree index not scan
>> whole index.
>
>
> Storing epoch in deleted pages is good by itself, because it makes us
> free to select the moment of time to recycle those pages without risk
> of wraparound.
>
> However, I see that you are comparing relative change of num_heap_tuples
> before and after vacuum to vacuum_cleanup_index_scale_factor.
> The problem is that if vacuum is very aggressive (and that is likely for
> append only case if this patch is committed), then num_heap_tuples
> changes very slightly every vacuum cycle.  So, cleanup would never
> happen and statistics could stall indefinitely.

Good catch. I think we need to store the number of tuples at when
scanning whole index is performed (bulkdelete or cleanup) as your
patch does so. So it also would need the page-upgrading code. Since
that value would be helpful for other type of indexes too it might be
better to  store it into system catalog.

>
> Another issue is that append only workloads are frequently combined
> with rare periodical bulk deletes of data.  Assuming that in this patch
> you don't consider deleted pages to trigger index cleanup, on such
> workload large amounts of deleted pages may reside non-recycled until
> next bulk delete cycle.

Perhaps using that new value we can trigger the cleanup if the current
number of tuples has been increased or decreased the
vacuum_index_scale_factor% from n_tup_last_cleanup.

>
> So, despite I generally like idea of storing epoch of deleted xid in the
> page, I think my version of patch is closer to committable shape.
>

Agreed, but as I mentioned before, I'm concerned that your version
patch approach will become a restriction of future improvement. If
community decides this feature works only for mostly append-only
workloads I think your version of patch is the best approach for now.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Tue, Mar 27, 2018 at 11:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Mar 22, 2018 at 9:24 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> However, I see that you are comparing relative change of num_heap_tuples
> before and after vacuum to vacuum_cleanup_index_scale_factor.
> The problem is that if vacuum is very aggressive (and that is likely for
> append only case if this patch is committed), then num_heap_tuples
> changes very slightly every vacuum cycle.  So, cleanup would never
> happen and statistics could stall indefinitely.

Good catch. I think we need to store the number of tuples at when
scanning whole index is performed (bulkdelete or cleanup) as your
patch does so. So it also would need the page-upgrading code. Since
that value would be helpful for other type of indexes too it might be
better to  store it into system catalog.

>
> Another issue is that append only workloads are frequently combined
> with rare periodical bulk deletes of data.  Assuming that in this patch
> you don't consider deleted pages to trigger index cleanup, on such
> workload large amounts of deleted pages may reside non-recycled until
> next bulk delete cycle.

Perhaps using that new value we can trigger the cleanup if the current
number of tuples has been increased or decreased the
vacuum_index_scale_factor% from n_tup_last_cleanup.
 
Yes, that might work.  However, decreased number of tuples could be
inaccurate measure of number of deleted pages.  Depending on distribution
of tuples per pages, same amount of deleted tuples can lead to very different
number of deleted pages (in corner cases in can start from zero to the very
large amounts).  If we want to skip scanning of deleted pages then it
would be better store their exact count known by previous bulk delete or
cleanup.

>
> So, despite I generally like idea of storing epoch of deleted xid in the
> page, I think my version of patch is closer to committable shape.
>

Agreed, but as I mentioned before, I'm concerned that your version
patch approach will become a restriction of future improvement. If
community decides this feature works only for mostly append-only
workloads I think your version of patch is the best approach for now.

At first I would like to note that all valid optimizations presented in the
thread optimizes append case.  Thus they do give serious benefit
on mostly append-only workloads.  Since patches were about
skipping/reducing cleanup stage which does serious work only in
append-only case.

It could be possible in future to optimize also cases when only small
fraction of table tuples were deleted.  Those tuples could be deleted
not by full index scan, but by individual index lookups.  But I think
such optimization is rather harder than everything mentioned in
this thread, and it should be considered separately.

The thing which could be improved in future is making btree able
to skip cleanup when some deleted pages are pending to be recycled.
But I'm not sure that this problem deserves a lot of time investment
right now.  Because I think that page deletion in our btree is unfortunately
quite rate situation.  In real life our btree is rather going to be bloat with bad
fillfactor of pages rather than got much pages deleted.

So, I would like to clarify why could my patch block future improvements
in this area?  For instance, if we would decide to make btree able to skip
cleanup when some delete pages are pending for recycle, we can add
it in future.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
Sorry for the late response.

On Tue, Mar 27, 2018 at 8:01 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Mar 27, 2018 at 11:55 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Thu, Mar 22, 2018 at 9:24 PM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > However, I see that you are comparing relative change of num_heap_tuples
>> > before and after vacuum to vacuum_cleanup_index_scale_factor.
>> > The problem is that if vacuum is very aggressive (and that is likely for
>> > append only case if this patch is committed), then num_heap_tuples
>> > changes very slightly every vacuum cycle.  So, cleanup would never
>> > happen and statistics could stall indefinitely.
>>
>> Good catch. I think we need to store the number of tuples at when
>> scanning whole index is performed (bulkdelete or cleanup) as your
>> patch does so. So it also would need the page-upgrading code. Since
>> that value would be helpful for other type of indexes too it might be
>> better to  store it into system catalog.
>>
>> >
>> > Another issue is that append only workloads are frequently combined
>> > with rare periodical bulk deletes of data.  Assuming that in this patch
>> > you don't consider deleted pages to trigger index cleanup, on such
>> > workload large amounts of deleted pages may reside non-recycled until
>> > next bulk delete cycle.
>>
>> Perhaps using that new value we can trigger the cleanup if the current
>> number of tuples has been increased or decreased the
>> vacuum_index_scale_factor% from n_tup_last_cleanup.
>
>
> Yes, that might work.  However, decreased number of tuples could be
> inaccurate measure of number of deleted pages.  Depending on distribution
> of tuples per pages, same amount of deleted tuples can lead to very
> different
> number of deleted pages (in corner cases in can start from zero to the very
> large amounts).  If we want to skip scanning of deleted pages then it
> would be better store their exact count known by previous bulk delete or
> cleanup.
>
>> >
>> > So, despite I generally like idea of storing epoch of deleted xid in the
>> > page, I think my version of patch is closer to committable shape.
>> >
>>
>> Agreed, but as I mentioned before, I'm concerned that your version
>> patch approach will become a restriction of future improvement. If
>> community decides this feature works only for mostly append-only
>> workloads I think your version of patch is the best approach for now.
>
>
> At first I would like to note that all valid optimizations presented in the
> thread optimizes append case.  Thus they do give serious benefit
> on mostly append-only workloads.  Since patches were about
> skipping/reducing cleanup stage which does serious work only in
> append-only case.
>
> It could be possible in future to optimize also cases when only small
> fraction of table tuples were deleted.  Those tuples could be deleted
> not by full index scan, but by individual index lookups.  But I think
> such optimization is rather harder than everything mentioned in
> this thread, and it should be considered separately.

Agreed.

>
> The thing which could be improved in future is making btree able
> to skip cleanup when some deleted pages are pending to be recycled.
> But I'm not sure that this problem deserves a lot of time investment
> right now.  Because I think that page deletion in our btree is unfortunately
> quite rate situation.  In real life our btree is rather going to be bloat
> with bad
> fillfactor of pages rather than got much pages deleted.

Agreed. In your version patch we cleanup recyclable pages if there
might be them whereas my version patch could leave recyclable pages.
The thing I'm concerned is that the patch unnecessarily would narrow
its effect. That is, we might be able to update the btree meta page
even when bulkdelete.

In your version patch we have to scan whole index at least twice
(bulkdelete and cleanup) if even one tuple is deleted. But if deletion
of index tuples didn't lead index page deletions (which is a common
case)  the extra one scan is really unnecessary. So I guess that we
can update the meta page only when we deleted pages in the index
vacuum scan. If no page deletion happened since we don't need to
update the oldest btpo.xact we don't update meta page, and of course
don't WAL-logging. This can be separate patch though. At least the
current approach is more safer.

>
> So, I would like to clarify why could my patch block future improvements
> in this area?  For instance, if we would decide to make btree able to skip
> cleanup when some delete pages are pending for recycle, we can add
> it in future.
>

Anyway, for approaches of this feature I agree your version patch and
your patch looks good to me as the first step of this feature.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Tue, Apr 3, 2018 at 6:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Mar 27, 2018 at 8:01 PM, Alexander Korotkov
> So, I would like to clarify why could my patch block future improvements
> in this area?  For instance, if we would decide to make btree able to skip
> cleanup when some delete pages are pending for recycle, we can add
> it in future.
>

Anyway, for approaches of this feature I agree your version patch and
your patch looks good to me as the first step of this feature.

Agreed.  I think we got consensus that this patch is good first step,
which doesn't block further enhancements in future.

So, I'm attaching rebased version of patch and marking this RFC.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Teodor Sigaev
Date:
Thanks for everyone, pushed with minor editorization


Alexander Korotkov wrote:
> On Tue, Apr 3, 2018 at 6:42 PM, Masahiko Sawada <sawada.mshk@gmail.com 
> <mailto:sawada.mshk@gmail.com>> wrote:
> 
>     On Tue, Mar 27, 2018 at 8:01 PM, Alexander Korotkov
>     > So, I would like to clarify why could my patch block future improvements
>     > in this area?  For instance, if we would decide to make btree able to skip
>     > cleanup when some delete pages are pending for recycle, we can add
>     > it in future.
>     >
> 
>     Anyway, for approaches of this feature I agree your version patch and
>     your patch looks good to me as the first step of this feature.
> 
> 
> Agreed.  I think we got consensus that this patch is good first step,
> which doesn't block further enhancements in future.
> 
> So, I'm attaching rebased version of patch and marking this RFC.
> 
> ------
> Alexander Korotkov
> Postgres Professional:http://www.postgrespro.com <http://www.postgrespro.com/>
> The Russian Postgres Company

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Thu, Apr 5, 2018 at 1:30 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Thanks for everyone, pushed with minor editorization
>

Thank you for committing!
I found a typo in nbtpage.c and attached a patch fixes it.

s/overritten/overwritten/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Thu, Apr 5, 2018 at 2:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Apr 5, 2018 at 1:30 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> Thanks for everyone, pushed with minor editorization
>>
>
> Thank you for committing!
> I found a typo in nbtpage.c and attached a patch fixes it.
>

I also found an incorrect documentation in create_index.sgml as follows.

     <term><literal>vacuum_cleanup_index_scale_factor</literal></term>
     <listitem>
     <para>
      Per-table value for <xref
linkend="guc-vacuum-cleanup-index-scale-factor"/>.
     </para>
     </listitem>
    </varlistentry>

I think it should be "Per-index". Attached a patch for fixing it. And
sorry for missing it at review.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
Hello.

The commit leaves three warnings for
-Wunused-but-set-variable. Two of them are not assertion-only but
really not used at all.

I also found that nodeMerge.c has one such variable.

regards.

At Thu, 5 Apr 2018 15:43:55 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAtYzXeuhPqGw89KxqcJWpSKxWRNkYzyJxAqJHZham==Q@mail.gmail.com>
> On Thu, Apr 5, 2018 at 2:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Thu, Apr 5, 2018 at 1:30 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> >> Thanks for everyone, pushed with minor editorization
> >>
> >
> > Thank you for committing!
> > I found a typo in nbtpage.c and attached a patch fixes it.
> >
> 
> I also found an incorrect documentation in create_index.sgml as follows.
> 
>      <term><literal>vacuum_cleanup_index_scale_factor</literal></term>
>      <listitem>
>      <para>
>       Per-table value for <xref
> linkend="guc-vacuum-cleanup-index-scale-factor"/>.
>      </para>
>      </listitem>
>     </varlistentry>
> 
> I think it should be "Per-index". Attached a patch for fixing it. And
> sorry for missing it at review.
> 
> Regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 505a67e6ed..b920d66731 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -85,7 +85,7 @@ void
 _bt_upgrademetapage(Page page)
 {
     BTMetaPageData *metad;
-    BTPageOpaque metaopaque;
+    BTPageOpaque metaopaque PG_USED_FOR_ASSERTS_ONLY;
 
     metad = BTPageGetMeta(page);
     metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -118,7 +118,6 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 {
     Buffer            metabuf;
     Page            metapg;
-    BTPageOpaque    metaopaque;
     BTMetaPageData *metad;
     bool            needsRewrite = false;
     XLogRecPtr        recptr;
@@ -126,7 +125,6 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
     /* read the metapage and check if it needs rewrite */
     metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
     metapg = BufferGetPage(metabuf);
-    metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
     metad = BTPageGetMeta(metapg);
 
     /* outdated version of metapage always needs rewrite */
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 06badc90ba..66a66f2dad 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -786,13 +786,11 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 {
     Buffer            metabuf;
     Page            metapg;
-    BTPageOpaque    metaopaque;
     BTMetaPageData *metad;
     bool            result = false;
 
     metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
     metapg = BufferGetPage(metabuf);
-    metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
     metad = BTPageGetMeta(metapg);
 
     if (metad->btm_version < BTREE_VERSION)
diff --git a/src/backend/executor/nodeMerge.c b/src/backend/executor/nodeMerge.c
index 0e0d0795d4..b21e69903d 100644
--- a/src/backend/executor/nodeMerge.c
+++ b/src/backend/executor/nodeMerge.c
@@ -485,7 +485,7 @@ ExecMerge(ModifyTableState *mtstate, EState *estate, TupleTableSlot *slot,
     ItemPointer tupleid;
     ItemPointerData tuple_ctid;
     bool        matched = false;
-    char        relkind;
+    char        relkind PG_USED_FOR_ASSERTS_ONLY;
     Datum        datum;
     bool        isNull;


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Teodor Sigaev
Date:
Thanks to everyone, fixes are pushed except nodeMerge.c, I don't wish to 
increase entropy around MERGE patch :)

Kyotaro HORIGUCHI wrote:
> Hello.
> 
> The commit leaves three warnings for
> -Wunused-but-set-variable. Two of them are not assertion-only but
> really not used at all.
> 
> I also found that nodeMerge.c has one such variable.
> 
> regards.
> 
> At Thu, 5 Apr 2018 15:43:55 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAtYzXeuhPqGw89KxqcJWpSKxWRNkYzyJxAqJHZham==Q@mail.gmail.com>
>> On Thu, Apr 5, 2018 at 2:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Thu, Apr 5, 2018 at 1:30 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>>> Thanks for everyone, pushed with minor editorization
>>>>
>>>
>>> Thank you for committing!
>>> I found a typo in nbtpage.c and attached a patch fixes it.
>>>
>>
>> I also found an incorrect documentation in create_index.sgml as follows.
>>
>>       <term><literal>vacuum_cleanup_index_scale_factor</literal></term>
>>       <listitem>
>>       <para>
>>        Per-table value for <xref
>> linkend="guc-vacuum-cleanup-index-scale-factor"/>.
>>       </para>
>>       </listitem>
>>      </varlistentry>
>>
>> I think it should be "Per-index". Attached a patch for fixing it. And
>> sorry for missing it at review.
>>
>> Regards,
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Thu, Apr 5, 2018 at 7:23 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Thanks to everyone, fixes are pushed except nodeMerge.c, I don't wish to
> increase entropy around MERGE patch :)
>

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Kyotaro HORIGUCHI
Date:
At Fri, 6 Apr 2018 10:52:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCYWuKu4yvTLE1Hgv2SsJRnBjYweTztTkBABUykXonb1w@mail.gmail.com>
> On Thu, Apr 5, 2018 at 7:23 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> > Thanks to everyone, fixes are pushed except nodeMerge.c, I don't wish to
> > increase entropy around MERGE patch :)

No problem. Thanks!

> Thank you!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] GUC for cleanup indexes threshold.

From
Darafei "Komяpa" Praliaskouski
Date:
Hi!

It is cool to see this in Postgres 11. However:
 
4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.
Default value is 0.1.  So, by default cleanup scan is triggered after increasing of
table size by 10%.

vacuum_cleanup_index_scale_factor can be set to the maximum of 100. 
I imagine that on a large append-only table with IOPS storage system budget it may happen that I would want to never perform a full scan on index. Roughly, with parameter set to 100, if we vacuum the table first time with 1 tuple and 130 byte wide rows, we'll have a full scan at 130 bytes, 12 kbytes, 1.2MB, 123MB, 12 GB, 1.2TB. 

If we happen to perform the first vacuum when there are 4 tuples in the table, it becomes 52kb, 5MB, 495MB, 48GB - and both 12GB and 48GB will exhaust any storage spike IOPS budget, slowing everything down rather suddenly. 

Can the upper limit for this GUC be lifted, or have a value for "never"?


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
Hi!

On Sat, Jun 16, 2018 at 11:23 PM Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:
> It is cool to see this in Postgres 11. However:
>
>>
>> 4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.
>> Default value is 0.1.  So, by default cleanup scan is triggered after increasing of
>> table size by 10%.
>
>
> vacuum_cleanup_index_scale_factor can be set to the maximum of 100.
> I imagine that on a large append-only table with IOPS storage system budget it may happen that I would want to never
performa full scan on index. Roughly, with parameter set to 100, if we vacuum the table first time with 1 tuple and 130
bytewide rows, we'll have a full scan at 130 bytes, 12 kbytes, 1.2MB, 123MB, 12 GB, 1.2TB. 
>
> If we happen to perform the first vacuum when there are 4 tuples in the table, it becomes 52kb, 5MB, 495MB, 48GB -
andboth 12GB and 48GB will exhaust any storage spike IOPS budget, slowing everything down rather suddenly. 
>
> Can the upper limit for this GUC be lifted, or have a value for "never"?

I have some further exploration of how statistics obtained by B-tree
index vacuum cleanup is used.

1) Collected pages and tuples numbers are not directly used, but used
for an estimation of tuples density per page, while current number of
page is estimated using smgr (see btcostestimate()).  So, unless
density of tuples significantly changes, having index statistics
stalled doesn't affect query plans.
2) Our optimization for skipping B-tree index vacuum cleanup works
only in case when use manually vacuums table in order to update
visibility map. Autovacuum is not triggered for append-only tables.
So, if user doesn't have special care about append-only tables,
they're not vacuumed until "autovacuum to prevent wraparound".  Thus,
index statistics could be very stalled.  And I don't think we have
many occurrences of issues with stalled index statistics.
3) We have very safe defaul of vacuum_cleanup_index_scale_factor equal
to 1.1.  But as Darafei claimed, 100 maximum value is probably too low
for advanced users, who really need benefits of this optimization.

So, I'm proposing to raise maximum valus of
vacuum_cleanup_index_scale_factor to DBL_MAX.  Any objections?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Hi!
>
> On Sat, Jun 16, 2018 at 11:23 PM Darafei "Komяpa" Praliaskouski
> <me@komzpa.net> wrote:
>> It is cool to see this in Postgres 11. However:
>>
>>>
>>> 4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.
>>> Default value is 0.1.  So, by default cleanup scan is triggered after increasing of
>>> table size by 10%.
>>
>>
>> vacuum_cleanup_index_scale_factor can be set to the maximum of 100.
>> I imagine that on a large append-only table with IOPS storage system budget it may happen that I would want to never
performa full scan on index. Roughly, with parameter set to 100, if we vacuum the table first time with 1 tuple and 130
bytewide rows, we'll have a full scan at 130 bytes, 12 kbytes, 1.2MB, 123MB, 12 GB, 1.2TB. 
>>
>> If we happen to perform the first vacuum when there are 4 tuples in the table, it becomes 52kb, 5MB, 495MB, 48GB -
andboth 12GB and 48GB will exhaust any storage spike IOPS budget, slowing everything down rather suddenly. 
>>
>> Can the upper limit for this GUC be lifted, or have a value for "never"?
>
> I have some further exploration of how statistics obtained by B-tree
> index vacuum cleanup is used.
>
> 1) Collected pages and tuples numbers are not directly used, but used
> for an estimation of tuples density per page, while current number of
> page is estimated using smgr (see btcostestimate()).  So, unless
> density of tuples significantly changes, having index statistics
> stalled doesn't affect query plans.
> 2) Our optimization for skipping B-tree index vacuum cleanup works
> only in case when use manually vacuums table in order to update
> visibility map. Autovacuum is not triggered for append-only tables.
> So, if user doesn't have special care about append-only tables,
> they're not vacuumed until "autovacuum to prevent wraparound".  Thus,
> index statistics could be very stalled.  And I don't think we have
> many occurrences of issues with stalled index statistics.
> 3) We have very safe defaul of vacuum_cleanup_index_scale_factor equal
> to 1.1.  But as Darafei claimed, 100 maximum value is probably too low
> for advanced users, who really need benefits of this optimization.

Thank you for explanation.

> So, I'm proposing to raise maximum valus of
> vacuum_cleanup_index_scale_factor to DBL_MAX.  Any objections?
>

I agree to expand the maximum value. But if users don't want index
cleanup it would be helpful if we have an option (e.g. setting to -1)
to disable index cleanup while documenting a risk of disabling index
cleanup. It seems to me that setting very high values means the same
purpose.
Also, your patch lacks documentation update.

BTW, I realized that postgresql.conf.sample doesn't have
vacuum_cleanup_index_scale_factor option. Attached patch fixes it.

Regards,

--
Masahiko Sawada

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov
> > So, I'm proposing to raise maximum valus of
> > vacuum_cleanup_index_scale_factor to DBL_MAX.  Any objections?
> >
>
> I agree to expand the maximum value. But if users don't want index
> cleanup it would be helpful if we have an option (e.g. setting to -1)
> to disable index cleanup while documenting a risk of disabling index
> cleanup. It seems to me that setting very high values means the same
> purpose.

Yes, providing an option to completely disable b-tree index cleanup
would be good.  But the problem is that we already use -1 value for
"use the default" in reloption.  So, if even we will make -1 guc
option to mean "never cleanup", then we still wouldn't be able to make
reloption to work this way.  Probably, we should use another "magical
value" in reloption for "use the default" semantics.

> Also, your patch lacks documentation update.

Good catch, thank you.

> BTW, I realized that postgresql.conf.sample doesn't have
> vacuum_cleanup_index_scale_factor option. Attached patch fixes it.

It seems that you post a wrong attachment, because the patch you sent
is exactly same as mine.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Tue, Jun 19, 2018 at 5:43 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov
>> > So, I'm proposing to raise maximum valus of
>> > vacuum_cleanup_index_scale_factor to DBL_MAX.  Any objections?
>> >
>>
>> I agree to expand the maximum value. But if users don't want index
>> cleanup it would be helpful if we have an option (e.g. setting to -1)
>> to disable index cleanup while documenting a risk of disabling index
>> cleanup. It seems to me that setting very high values means the same
>> purpose.
>
> Yes, providing an option to completely disable b-tree index cleanup
> would be good.  But the problem is that we already use -1 value for
> "use the default" in reloption.  So, if even we will make -1 guc
> option to mean "never cleanup", then we still wouldn't be able to make
> reloption to work this way.  Probably, we should use another "magical
> value" in reloption for "use the default" semantics.

Right. We can add a new reloption specifying whether we use default
value of vacuum_index_cleanup_scale_factor or not but it might be
overkill.

>
>> Also, your patch lacks documentation update.
>
> Good catch, thank you.
>
>> BTW, I realized that postgresql.conf.sample doesn't have
>> vacuum_cleanup_index_scale_factor option. Attached patch fixes it.
>
> It seems that you post a wrong attachment, because the patch you sent
> is exactly same as mine.
>

Sorry, attached correct one.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Tue, Jun 19, 2018 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Jun 19, 2018 at 5:43 PM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >> On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov
> >> > So, I'm proposing to raise maximum valus of
> >> > vacuum_cleanup_index_scale_factor to DBL_MAX.  Any objections?
> >> >
> >>
> >> I agree to expand the maximum value. But if users don't want index
> >> cleanup it would be helpful if we have an option (e.g. setting to -1)
> >> to disable index cleanup while documenting a risk of disabling index
> >> cleanup. It seems to me that setting very high values means the same
> >> purpose.
> >
> > Yes, providing an option to completely disable b-tree index cleanup
> > would be good.  But the problem is that we already use -1 value for
> > "use the default" in reloption.  So, if even we will make -1 guc
> > option to mean "never cleanup", then we still wouldn't be able to make
> > reloption to work this way.  Probably, we should use another "magical
> > value" in reloption for "use the default" semantics.
>
> Right. We can add a new reloption specifying whether we use default
> value of vacuum_index_cleanup_scale_factor or not but it might be
> overkill.

That would be overkill, and also that would be different from how
other reloptions behaves.

> >> Also, your patch lacks documentation update.
> >
> > Good catch, thank you.
> >
> >> BTW, I realized that postgresql.conf.sample doesn't have
> >> vacuum_cleanup_index_scale_factor option. Attached patch fixes it.
> >
> > It seems that you post a wrong attachment, because the patch you sent
> > is exactly same as mine.
> >
>
> Sorry, attached correct one.

Ok.  I've rephrased comment a bit.  Also, you created "index vacuum"
subsection in the "resource usage" section.  I think it's not
appropriate for this option to be in "resource usage".  Ideally it
should be grouped with other vacuum options, but we don't have single
section for that.  "Autovacuum" section is also not appropriate,
because this guc works not only for autovacuum.  So, most semantically
close options, which affects vacuum in general, are
vacuum_freeze_min_age, vacuum_freeze_table_age,
vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age,
which are located in "client connection defaults" section.  So, I
decided to move this GUC into this section.  I also change the section
in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT.

I think we ideally should have a "vacuum" section, which would have
two subections: "client defaults" and "autovacuum".  But that should
be a separate patch, which needs to be discussed separately.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Jun 19, 2018 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Tue, Jun 19, 2018 at 5:43 PM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >> On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov
>> >> > So, I'm proposing to raise maximum valus of
>> >> > vacuum_cleanup_index_scale_factor to DBL_MAX.  Any objections?
>> >> >
>> >>
>> >> I agree to expand the maximum value. But if users don't want index
>> >> cleanup it would be helpful if we have an option (e.g. setting to -1)
>> >> to disable index cleanup while documenting a risk of disabling index
>> >> cleanup. It seems to me that setting very high values means the same
>> >> purpose.
>> >
>> > Yes, providing an option to completely disable b-tree index cleanup
>> > would be good.  But the problem is that we already use -1 value for
>> > "use the default" in reloption.  So, if even we will make -1 guc
>> > option to mean "never cleanup", then we still wouldn't be able to make
>> > reloption to work this way.  Probably, we should use another "magical
>> > value" in reloption for "use the default" semantics.
>>
>> Right. We can add a new reloption specifying whether we use default
>> value of vacuum_index_cleanup_scale_factor or not but it might be
>> overkill.
>
> That would be overkill, and also that would be different from how
> other reloptions behaves.

Agreed.

>
>> >> Also, your patch lacks documentation update.
>> >
>> > Good catch, thank you.
>> >
>> >> BTW, I realized that postgresql.conf.sample doesn't have
>> >> vacuum_cleanup_index_scale_factor option. Attached patch fixes it.
>> >
>> > It seems that you post a wrong attachment, because the patch you sent
>> > is exactly same as mine.
>> >
>>
>> Sorry, attached correct one.
>
> Ok.  I've rephrased comment a bit.  Also, you created "index vacuum"
> subsection in the "resource usage" section.  I think it's not
> appropriate for this option to be in "resource usage".  Ideally it
> should be grouped with other vacuum options, but we don't have single
> section for that.  "Autovacuum" section is also not appropriate,
> because this guc works not only for autovacuum.  So, most semantically
> close options, which affects vacuum in general, are
> vacuum_freeze_min_age, vacuum_freeze_table_age,
> vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age,
> which are located in "client connection defaults" section.  So, I
> decided to move this GUC into this section.  I also change the section
> in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT.

Agreed. So should we move it to 19.11 Client Connection Defaults in
doc as well? And I think it's better if this option places next to
other vacuum options for finding easier. Attached patch changes them
so. Please review it.

> I think we ideally should have a "vacuum" section, which would have
> two subections: "client defaults" and "autovacuum".  But that should
> be a separate patch, which needs to be discussed separately.

+1

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Wed, Jun 20, 2018 at 8:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov
> > Ok.  I've rephrased comment a bit.  Also, you created "index vacuum"
> > subsection in the "resource usage" section.  I think it's not
> > appropriate for this option to be in "resource usage".  Ideally it
> > should be grouped with other vacuum options, but we don't have single
> > section for that.  "Autovacuum" section is also not appropriate,
> > because this guc works not only for autovacuum.  So, most semantically
> > close options, which affects vacuum in general, are
> > vacuum_freeze_min_age, vacuum_freeze_table_age,
> > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age,
> > which are located in "client connection defaults" section.  So, I
> > decided to move this GUC into this section.  I also change the section
> > in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT.
>
> Agreed. So should we move it to 19.11 Client Connection Defaults in
> doc as well? And I think it's better if this option places next to
> other vacuum options for finding easier. Attached patch changes them
> so. Please review it.

Right, thank you.  Looks good for me.
I'm going to commit this if no objections.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Wed, Jun 20, 2018 at 12:00 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Jun 20, 2018 at 8:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov
> > > Ok.  I've rephrased comment a bit.  Also, you created "index vacuum"
> > > subsection in the "resource usage" section.  I think it's not
> > > appropriate for this option to be in "resource usage".  Ideally it
> > > should be grouped with other vacuum options, but we don't have single
> > > section for that.  "Autovacuum" section is also not appropriate,
> > > because this guc works not only for autovacuum.  So, most semantically
> > > close options, which affects vacuum in general, are
> > > vacuum_freeze_min_age, vacuum_freeze_table_age,
> > > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age,
> > > which are located in "client connection defaults" section.  So, I
> > > decided to move this GUC into this section.  I also change the section
> > > in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT.
> >
> > Agreed. So should we move it to 19.11 Client Connection Defaults in
> > doc as well? And I think it's better if this option places next to
> > other vacuum options for finding easier. Attached patch changes them
> > so. Please review it.
>
> Right, thank you.  Looks good for me.
> I'm going to commit this if no objections.

Pushed.

Regarding maximum value for vacuum_cleanup_index_scale_factor.  I
think introducing special value with "never cleanup" meaning would be
overkill for post feature freeze enhancement.  So, I propose to just
increase maximum value for both GUC and reloption.  See the attached
patch.  It also changes calculations _bt_vacuum_needs_cleanup() for
better handling of large values (just some kind of overflow paranoia).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Masahiko Sawada
Date:
On Fri, Jun 22, 2018 at 6:55 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Jun 20, 2018 at 12:00 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
>> On Wed, Jun 20, 2018 at 8:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> > On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov
>> > > Ok.  I've rephrased comment a bit.  Also, you created "index vacuum"
>> > > subsection in the "resource usage" section.  I think it's not
>> > > appropriate for this option to be in "resource usage".  Ideally it
>> > > should be grouped with other vacuum options, but we don't have single
>> > > section for that.  "Autovacuum" section is also not appropriate,
>> > > because this guc works not only for autovacuum.  So, most semantically
>> > > close options, which affects vacuum in general, are
>> > > vacuum_freeze_min_age, vacuum_freeze_table_age,
>> > > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age,
>> > > which are located in "client connection defaults" section.  So, I
>> > > decided to move this GUC into this section.  I also change the section
>> > > in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT.
>> >
>> > Agreed. So should we move it to 19.11 Client Connection Defaults in
>> > doc as well? And I think it's better if this option places next to
>> > other vacuum options for finding easier. Attached patch changes them
>> > so. Please review it.
>>
>> Right, thank you.  Looks good for me.
>> I'm going to commit this if no objections.
>
> Pushed.

Thank you!

>
> Regarding maximum value for vacuum_cleanup_index_scale_factor.  I
> think introducing special value with "never cleanup" meaning would be
> overkill for post feature freeze enhancement.

After more thought, adding a threshold to invoke index cleanups
perhaps work in order to support "never cleanup", it will be
PostgreSQL 12 item though. If a index has less tuples than the
threshold (say, vacuum_cleanup_index_threshold), index cleanups never
be executed.

> So, I propose to just
> increase maximum value for both GUC and reloption.  See the attached
> patch.  It also changes calculations _bt_vacuum_needs_cleanup() for
> better handling of large values (just some kind of overflow paranoia).
>

The patch looks good to me.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Tue, Jun 26, 2018 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Jun 22, 2018 at 6:55 PM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > So, I propose to just
> > increase maximum value for both GUC and reloption.  See the attached
> > patch.  It also changes calculations _bt_vacuum_needs_cleanup() for
> > better handling of large values (just some kind of overflow paranoia).
>
> The patch looks good to me.

Pushed, thanks!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] GUC for cleanup indexes threshold.

From
Darafei "Komяpa" Praliaskouski
Date:

вт, 26 июн. 2018 г. в 15:42, Alexander Korotkov <a.korotkov@postgrespro.ru>:
On Tue, Jun 26, 2018 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Jun 22, 2018 at 6:55 PM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > So, I propose to just
> > increase maximum value for both GUC and reloption.  See the attached
> > patch.  It also changes calculations _bt_vacuum_needs_cleanup() for
> > better handling of large values (just some kind of overflow paranoia).
>
> The patch looks good to me.

Pushed, thanks!

Thank you for the enhancement. Now Index Only Scans over Append-Only tables in Postgres  can be implemented, even if it requires manual kicking of VACUUM over large table, and that's a great enhancement for moving object databases. :)

My eye catches another thing, the error message in tests is:

DETAIL:  Valid values are between "0.000000" and "179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000".

a) do we really need to print digits of dblmax? "Valid values are double precision, non-negative"?
b) double precision binary-to-decimal noise starts at 16th digit. Why does it stop at the point, and we have precise ".000000"? Does it bite the conversion somewhere else too? 

Re: [HACKERS] GUC for cleanup indexes threshold.

From
Alexander Korotkov
Date:
On Tue, Jun 26, 2018 at 4:11 PM Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:
> вт, 26 июн. 2018 г. в 15:42, Alexander Korotkov <a.korotkov@postgrespro.ru>:
>>
>> On Tue, Jun 26, 2018 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> > On Fri, Jun 22, 2018 at 6:55 PM, Alexander Korotkov
>> > <a.korotkov@postgrespro.ru> wrote:
>> > > So, I propose to just
>> > > increase maximum value for both GUC and reloption.  See the attached
>> > > patch.  It also changes calculations _bt_vacuum_needs_cleanup() for
>> > > better handling of large values (just some kind of overflow paranoia).
>> >
>> > The patch looks good to me.
>>
>> Pushed, thanks!
>
>
> Thank you for the enhancement. Now Index Only Scans over Append-Only tables in Postgres  can be implemented, even if
itrequires manual kicking of VACUUM over large table, and that's a great enhancement for moving object databases. :) 
>
> My eye catches another thing, the error message in tests is:
>
> DETAIL:  Valid values are between "0.000000" and
"179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000".
>
> a) do we really need to print digits of dblmax? "Valid values are double precision, non-negative"?
> b) double precision binary-to-decimal noise starts at 16th digit. Why does it stop at the point, and we have precise
".000000"?Does it bite the conversion somewhere else too? 

Thank you for pointing.  I'm proposing to change output format from
"%f" to "%g" [1] [2].  It looks better and the same as what we do for
GUCs.

1. https://www.postgresql.org/message-id/CAPpHfdvewmr4PcpRjrkstoNn1n2_6dL-iHRB21CCfZ0efZdBTg%40mail.gmail.com
2. https://www.postgresql.org/message-id/CAPpHfdsFBbJAd24F0b_o9TfTtu%2B%2BjH0bR5XS_e9xbSwk8SJwyQ%40mail.gmail.com

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company