Thread: Berserk Autovacuum (let's save next Mandrill)

Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:
Hi hackers,

Attached is sketch of small patch that fixes several edge cases with autovacuum. Long story short autovacuum never comes to append only tables, killing large productions.

First case, mine.
 https://www.postgresql.org/message-id/CAC8Q8tLBeAxR%2BBXWuKK%2BHP5m8tEVYn270CVrDvKXt%3D0PkJTY9g%40mail.gmail.com

We had a table we were appending and wanted Index Only Scan to work. For it to work, you need to call VACUUM manually, since VACUUM is the only way to mark pages all visible, and autovacuum never comes to append only tables. We were clever to invent a workflow without dead tuples and it painfully bit us.

Second case, just read in the news.
https://mailchimp.com/what-we-learned-from-the-recent-mandrill-outage/

Mandrill has 6TB append only table that autovacuum probably never vacuumed. Then anti-wraparound came and production went down. If autovacuum did its job before that last moment, it would probably be okay.

Idea: look not on dead tuples, but on changes, just like ANALYZE does.
It's my first patch on Postgres, it's probably all wrong but I hope it helps you get the idea.
--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa
Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Alvaro Herrera
Date:
On 2019-Mar-27, Darafei "Komяpa" Praliaskouski wrote:

> Attached is sketch of small patch that fixes several edge cases with
> autovacuum. Long story short autovacuum never comes to append only tables,
> killing large productions.

Yeah, autovac is not coping with these scenarios (and probably others).
However, rather than taking your patch's idea verbatim, I think we
should have autovacuum use separate actions for those two (wildly
different) scenarios.  For example:

* certain tables would have some sort of partial scan that sets the
  visibility map.  There's no reason to invoke the whole vacuuming
  machinery.  I don't think this is limited to append-only tables, but
  rather those are just the ones that are affected the most.

* tables nearing wraparound danger should use the (yet to be committed)
  option to skip index cleaning, which makes the cleanup action faster.
  Again, no need for complete vacuuming.


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:
Hi,

чт, 28 мар. 2019 г. в 00:32, Alvaro Herrera <alvherre@2ndquadrant.com>:
On 2019-Mar-27, Darafei "Komяpa" Praliaskouski wrote:

> Attached is sketch of small patch that fixes several edge cases with
> autovacuum. Long story short autovacuum never comes to append only tables,
> killing large productions.

Yeah, autovac is not coping with these scenarios (and probably others).
However, rather than taking your patch's idea verbatim, I think we
should have autovacuum use separate actions for those two (wildly
different) scenarios.  For example:

* certain tables would have some sort of partial scan that sets the
  visibility map.  There's no reason to invoke the whole vacuuming
  machinery.  I don't think this is limited to append-only tables, but
  rather those are just the ones that are affected the most.

What other machinery runs on VACUUM invocation that is not wanted there?
Since Postgres 11 index cleanup is already skipped on append-only tables.
 
* tables nearing wraparound danger should use the (yet to be committed)
  option to skip index cleaning, which makes the cleanup action faster.
  Again, no need for complete vacuuming.

"Nearing wraparound" is too late already. In Amazon, reading table from gp2 after you exhausted your IOPS burst budget is like reading a floppy drive, you have to freeze a lot earlier than you hit several terabytes of unfrozen data, or you're dead like Mandrill's Search and Url tables from the link I shared.
 


--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: Berserk Autovacuum (let's save next Mandrill)

From
Alvaro Herrera
Date:
On 2019-Mar-28, Darafei "Komяpa" Praliaskouski wrote:


> чт, 28 мар. 2019 г. в 00:32, Alvaro Herrera <alvherre@2ndquadrant.com>:
> 
> > On 2019-Mar-27, Darafei "Komяpa" Praliaskouski wrote:

> > * certain tables would have some sort of partial scan that sets the
> >   visibility map.  There's no reason to invoke the whole vacuuming
> >   machinery.  I don't think this is limited to append-only tables, but
> >   rather those are just the ones that are affected the most.
> 
> What other machinery runs on VACUUM invocation that is not wanted there?
> Since Postgres 11 index cleanup is already skipped on append-only tables.

Well, I think it would be useful to set all-visible earlier than waiting
for a vacuum to be necessary, even for tables that are not append-only.
So if you think about this just for the append-only table, you leave
money on the table.

> > * tables nearing wraparound danger should use the (yet to be committed)
> >   option to skip index cleaning, which makes the cleanup action faster.
> >   Again, no need for complete vacuuming.
> 
> "Nearing wraparound" is too late already. In Amazon, reading table from gp2
> after you exhausted your IOPS burst budget is like reading a floppy drive,
> you have to freeze a lot earlier than you hit several terabytes of unfrozen
> data, or you're dead like Mandrill's Search and Url tables from the link I
> shared.

OK, then start freezing tuples in the cheap mode (skip index updates)
earlier than that.  I suppose a good question is when to start.


I wonder if Mandrill's problem is related to Mailchimp raising the
freeze_max_age to a point where autovac did not have enough time to
react with an emergency vacuum.  If you keep raising that value because
the vacuums cause problems for you (they block DDL), there's something
wrong.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Thu, 28 Mar 2019 at 11:01, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Mar-28, Darafei "Komяpa" Praliaskouski wrote:
> > "Nearing wraparound" is too late already. In Amazon, reading table from gp2
> > after you exhausted your IOPS burst budget is like reading a floppy drive,
> > you have to freeze a lot earlier than you hit several terabytes of unfrozen
> > data, or you're dead like Mandrill's Search and Url tables from the link I
> > shared.
>
> OK, then start freezing tuples in the cheap mode (skip index updates)
> earlier than that.  I suppose a good question is when to start.

I thought recently that it would be good to have some sort of
pro-active auto-vacuum mode that made use of idle workers.  Probably
there would need to be some mode flag that mentioned which workers
were in proactive mode so that these could be cancelled when more
pressing work came in.  I don't have an idea exactly of what
"pro-active" would actually be defined as, but I know that when the
single transaction ID is consumed that causes terra bytes of tables to
suddenly need an anti-wraparound vacuum, then it's not a good
situation to be in. Perhaps getting to some percentage of
autovacuum_freeze_max_age could be classed as pro-active.

> I wonder if Mandrill's problem is related to Mailchimp raising the
> freeze_max_age to a point where autovac did not have enough time to
> react with an emergency vacuum.  If you keep raising that value because
> the vacuums cause problems for you (they block DDL), there's something
> wrong.

I have seen some very high autovacuum_freeze_max_age settings
recently. It would be interesting to know what they had theirs set to.
I see they mentioned "Search and Url tables". I can imagine "search"
never needs any UPDATEs, so quite possibly those were append-only, in
which case the anti-wraparound vacuum would have had quite a lot of
work on its hands since possibly every page needed frozen. A table
receiving regular auto-vacuums from dead tuples would likely get some
pages frozen during those.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:


чт, 28 мар. 2019 г. в 01:01, Alvaro Herrera <alvherre@2ndquadrant.com>:
On 2019-Mar-28, Darafei "Komяpa" Praliaskouski wrote:


> чт, 28 мар. 2019 г. в 00:32, Alvaro Herrera <alvherre@2ndquadrant.com>:
>
> > On 2019-Mar-27, Darafei "Komяpa" Praliaskouski wrote:

> > * certain tables would have some sort of partial scan that sets the
> >   visibility map.  There's no reason to invoke the whole vacuuming
> >   machinery.  I don't think this is limited to append-only tables, but
> >   rather those are just the ones that are affected the most.
>
> What other machinery runs on VACUUM invocation that is not wanted there?
> Since Postgres 11 index cleanup is already skipped on append-only tables.

Well, I think it would be useful to set all-visible earlier than waiting
for a vacuum to be necessary, even for tables that are not append-only.
So if you think about this just for the append-only table, you leave
money on the table.

Thing is, problem does not exist for non-append-only tables, they're going to be vacuumed after 50 rows got updated, automatically.
 

> > * tables nearing wraparound danger should use the (yet to be committed)
> >   option to skip index cleaning, which makes the cleanup action faster.
> >   Again, no need for complete vacuuming.
>
> "Nearing wraparound" is too late already. In Amazon, reading table from gp2
> after you exhausted your IOPS burst budget is like reading a floppy drive,
> you have to freeze a lot earlier than you hit several terabytes of unfrozen
> data, or you're dead like Mandrill's Search and Url tables from the link I
> shared.

OK, then start freezing tuples in the cheap mode (skip index updates)
earlier than that.  I suppose a good question is when to start.

Attached (autovacuum_berserk_v1.patch)
 code achieves that. For append-only tables since https://commitfest.postgresql.org/16/952/ vacuum skips index cleanup if no updates happened. You just need to trigger it, and it already will be "cheap".

--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:


On Thu, Mar 28, 2019 at 2:36 AM David Rowley <david.rowley@2ndquadrant.com> wrote:
On Thu, 28 Mar 2019 at 11:01, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Mar-28, Darafei "Komяpa" Praliaskouski wrote:
> > "Nearing wraparound" is too late already. In Amazon, reading table from gp2
> > after you exhausted your IOPS burst budget is like reading a floppy drive,
> > you have to freeze a lot earlier than you hit several terabytes of unfrozen
> > data, or you're dead like Mandrill's Search and Url tables from the link I
> > shared.
>
> OK, then start freezing tuples in the cheap mode (skip index updates)
> earlier than that.  I suppose a good question is when to start.

I thought recently that it would be good to have some sort of
pro-active auto-vacuum mode that made use of idle workers. 

Problem with "idle" is that it never happens on system that are going to wraparound on their lifetime. This has to be a part of normal database functioning.

Why not select a table that has inserts, updates and deletes for autovacuum just like we do for autoanalyze, not only deletes and updates like we do now?


--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Thu, 28 Mar 2019 at 22:04, Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:
>
> On Thu, Mar 28, 2019 at 2:36 AM David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I thought recently that it would be good to have some sort of
>> pro-active auto-vacuum mode that made use of idle workers.
>
> Problem with "idle" is that it never happens on system that are going to wraparound on their lifetime. This has to be
apart of normal database functioning. 

I'd say auto-vacuum is configured to run too slowly if you never have
an idle worker. The chances that it happens to be running at exactly
the right speed to keep up with demand must be about close to nil.

> Why not select a table that has inserts, updates and deletes for autovacuum just like we do for autoanalyze, not only
deletesand updates like we do now? 

Sounds like a good idea, although I do agree with Alvaro when he
mentions that it would be good to only invoke a worker that was only
going to freeze tuples and not look at the indexes. I've not looked at
it, but there's a patch [1] in the current CF for that.  I'd say a
good course of action would be to review that then write a patch with
a new bool flag in relation_needs_vacanalyze for "freezeonly" and have
auto-vacuum invoke vacuum in this new freeze only mode if freezeonly
is set and dovacuum is not.

Any patch not in the current CF is already PG13 or beyond. Having at
least a freeze only vacuum mode main ease some pain, even if it still
needs to be done manually for anyone finding themselves in a similar
situation as mailchimp.

The idea I was mentioning was more targeted to ease the sudden rush of
auto-vacuum activity when suddenly a bunch of large tables require an
anti-wraparound vacuum all at once.

[1] https://commitfest.postgresql.org/22/1817/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Thu, Mar 28, 2019 at 6:32 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Thu, 28 Mar 2019 at 22:04, Darafei "Komяpa" Praliaskouski
> <me@komzpa.net> wrote:
> >
> > On Thu, Mar 28, 2019 at 2:36 AM David Rowley <david.rowley@2ndquadrant.com> wrote:
> >> I thought recently that it would be good to have some sort of
> >> pro-active auto-vacuum mode that made use of idle workers.
> >
> > Problem with "idle" is that it never happens on system that are going to wraparound on their lifetime. This has to
bea part of normal database functioning. 
>
> I'd say auto-vacuum is configured to run too slowly if you never have
> an idle worker. The chances that it happens to be running at exactly
> the right speed to keep up with demand must be about close to nil.
>
> > Why not select a table that has inserts, updates and deletes for autovacuum just like we do for autoanalyze, not
onlydeletes and updates like we do now? 
>
> Sounds like a good idea, although I do agree with Alvaro when he
> mentions that it would be good to only invoke a worker that was only
> going to freeze tuples and not look at the indexes.

The invoking autovacuum on table based on inserts, not only deletes
and updates, seems good idea to me. But in this case, I think that we
can not only freeze tuples but also update visibility map even when
setting all-visible. Roughly speaking  I think vacuum does the
following operations.

1. heap vacuum
2. HOT pruning
3. freezing tuples
4. updating visibility map (all-visible and all-frozen)
5. index vacuum/cleanup
6. truncation

With the proposed patch[1] we can control to do 5 or not. In addition
to that, another proposed patch[2] allows us to control 6.

For append-only tables (and similar tables), what we periodically want
to do would be 3 and 4 (possibly we can do 2 as well). So maybe we
need to have both an option of (auto)vacuum to control whether to do 1
and something like a new autovacuum threshold (or an option) to invoke
the vacuum that disables 1, 5 and 6. The vacuum that does only 2, 3
and 4 would be much cheaper than today's vacuum and anti-wraparound
vacuum would be able to skip almost pages.

[1] https://commitfest.postgresql.org/22/1817/
[2] https://commitfest.postgresql.org/22/1981/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:


On Thu, Mar 28, 2019 at 12:32 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
On Thu, 28 Mar 2019 at 22:04, Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:
>
> On Thu, Mar 28, 2019 at 2:36 AM David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I thought recently that it would be good to have some sort of
>> pro-active auto-vacuum mode that made use of idle workers.
>
> Problem with "idle" is that it never happens on system that are going to wraparound on their lifetime. This has to be a part of normal database functioning.

I'd say auto-vacuum is configured to run too slowly if you never have
an idle worker. The chances that it happens to be running at exactly
the right speed to keep up with demand must be about close to nil.

> Why not select a table that has inserts, updates and deletes for autovacuum just like we do for autoanalyze, not only deletes and updates like we do now?

Sounds like a good idea, although I do agree with Alvaro when he
mentions that it would be good to only invoke a worker that was only
going to freeze tuples and not look at the indexes.

This is current behavior of VACUUM on tables without dead tuples, already. Issue is that nothing triggers this VACUUM apart from user performing VACUUM manually, or super late antiwraparound vacuum.

Any patch not in the current CF is already PG13 or beyond. Having at
least a freeze only vacuum mode main ease some pain, even if it still
needs to be done manually for anyone finding themselves in a similar
situation as mailchimp.

If you're in wraparound halt with super large table on Amazon gp2 nothing will help you - issue is, there's no option to "rewrite all of it quickly". Burst limit lets you feel the shared drive as if it was an SSD on most of your load, but reading and re-writing all the storage gets throttled, and there's no option to escape this quickly.

The process that freezes and marks all-visible pages has to run in parallel and at the speed of your backend pushing pages to disk, maybe lagging behind a bit - but not up to "we need to rescan all the table".
 

The idea I was mentioning was more targeted to ease the sudden rush of
auto-vacuum activity when suddenly a bunch of large tables require an
anti-wraparound vacuum all at once.

[1] https://commitfest.postgresql.org/22/1817/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:
Hi,

> > Why not select a table that has inserts, updates and deletes for autovacuum just like we do for autoanalyze, not only deletes and updates like we do now?
>
> Sounds like a good idea, although I do agree with Alvaro when he
> mentions that it would be good to only invoke a worker that was only
> going to freeze tuples and not look at the indexes.

The invoking autovacuum on table based on inserts, not only deletes
and updates, seems good idea to me. But in this case, I think that we
can not only freeze tuples but also update visibility map even when
setting all-visible. Roughly speaking  I think vacuum does the
following operations.

1. heap vacuum
2. HOT pruning
Is it worth skipping it if we're writing a page anyway for the sake of hint bits and new xids? This will all be no-op anyway on append-only tables and happen only when we actually need something? 
 
3. freezing tuples
4. updating visibility map (all-visible and all-frozen)
These two are needed, and current autovacuum launch process does not take into account that this is also needed for non-dead tuples.
 
5. index vacuum/cleanup
There is a separate patch for that. But, since https://commitfest.postgresql.org/16/952/ for almost a year already Postgres skips index cleanup on tables without new dead tuples, so this case is taken care of already?
 
6. truncation
This shouldn't be a heavy operation?
 

With the proposed patch[1] we can control to do 5 or not. In addition
to that, another proposed patch[2] allows us to control 6.

For append-only tables (and similar tables), what we periodically want
to do would be 3 and 4 (possibly we can do 2 as well). So maybe we
need to have both an option of (auto)vacuum to control whether to do 1
and something like a new autovacuum threshold (or an option) to invoke
the vacuum that disables 1, 5 and 6. The vacuum that does only 2, 3
and 4 would be much cheaper than today's vacuum and anti-wraparound
vacuum would be able to skip almost pages.

Why will we want to get rid of 1? It's a noop from write perspective and saves a scan to do it if it's not noop.

Why make it faster in emergency situations when situation can be made non-emergency from the very beginning instead?
 

[1] https://commitfest.postgresql.org/22/1817/
[2] https://commitfest.postgresql.org/22/1981/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Thu, Mar 28, 2019 at 8:58 PM Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:
>
> Hi,
>
> > > Why not select a table that has inserts, updates and deletes for autovacuum just like we do for autoanalyze, not
onlydeletes and updates like we do now? 
>>
>> >
>> > Sounds like a good idea, although I do agree with Alvaro when he
>> > mentions that it would be good to only invoke a worker that was only
>> > going to freeze tuples and not look at the indexes.
>>
>> The invoking autovacuum on table based on inserts, not only deletes
>> and updates, seems good idea to me. But in this case, I think that we
>> can not only freeze tuples but also update visibility map even when
>> setting all-visible. Roughly speaking  I think vacuum does the
>> following operations.
>>
>> 1. heap vacuum
>>
>> 2. HOT pruning
>
> Is it worth skipping it if we're writing a page anyway for the sake of hint bits and new xids? This will all be no-op
anywayon append-only tables and happen only when we actually need something? 
>

Yeah, these operations are required only when the table has actual
garbage. IOW, append-only tables never require them.

>>
>> 3. freezing tuples
>> 4. updating visibility map (all-visible and all-frozen)
>
> These two are needed, and current autovacuum launch process does not take into account that this is also needed for
non-deadtuples. 
>
>>
>> 5. index vacuum/cleanup
>
> There is a separate patch for that. But, since https://commitfest.postgresql.org/16/952/ for almost a year already
Postgresskips index cleanup on tables without new dead tuples, so this case is taken care of already? 

I think that's not enough. The feature "GUC for cleanup index
threshold" allows us to skip only index cleanup when there are less
insertion than the fraction of the total number of heap tuples since
last index cleanup. Therefore it helps only append-only tables (and
supporting only btree index for now). We still have to do index
vacuuming even if the table has just a few dead tuple. The proposed
patch[1] helps this situation; vacuum can run while skipping index
vacuuming and index cleanup.

>
>>
>> 6. truncation
>
> This shouldn't be a heavy operation?
>

I don't think so. This could take AccessExclusiveLock on the table and
take a long time with large shared buffer as per reported on that
thread[2].

>>
>>
>> With the proposed patch[1] we can control to do 5 or not. In addition
>> to that, another proposed patch[2] allows us to control 6.
>>
>> For append-only tables (and similar tables), what we periodically want
>> to do would be 3 and 4 (possibly we can do 2 as well). So maybe we
>> need to have both an option of (auto)vacuum to control whether to do 1
>> and something like a new autovacuum threshold (or an option) to invoke
>> the vacuum that disables 1, 5 and 6. The vacuum that does only 2, 3
>> and 4 would be much cheaper than today's vacuum and anti-wraparound
>> vacuum would be able to skip almost pages.
>
>
> Why will we want to get rid of 1? It's a noop from write perspective and saves a scan to do it if it's not noop.
>

Because that's for tables that have many inserts but have some
updates/deletes. I think that this strategy would help not only
append-only tables but also such tables.

> Why make it faster in emergency situations when situation can be made non-emergency from the very beginning instead?
>

I don't understand the meaning of "situation can be made non-emergency
from the very beginning". Could you please elaborate on that?


>> [1] https://commitfest.postgresql.org/22/1817/
>> [2] https://commitfest.postgresql.org/22/1981/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Berserk Autovacuum (let's save next Mandrill)

From
Michael Banck
Date:
Hi,

On Thu, Mar 28, 2019 at 12:36:24PM +1300, David Rowley wrote:
> On Thu, 28 Mar 2019 at 11:01, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > I wonder if Mandrill's problem is related to Mailchimp raising the
> > freeze_max_age to a point where autovac did not have enough time to
> > react with an emergency vacuum.  If you keep raising that value because
> > the vacuums cause problems for you (they block DDL), there's something
> > wrong.
>
> I have seen some very high autovacuum_freeze_max_age settings
> recently. It would be interesting to know what they had theirs set to.
> I see they mentioned "Search and Url tables". I can imagine "search"
> never needs any UPDATEs, so quite possibly those were append-only, in
> which case the anti-wraparound vacuum would have had quite a lot of
> work on its hands since possibly every page needed frozen. A table
> receiving regular auto-vacuums from dead tuples would likely get some
> pages frozen during those.

By the way, the Routine Vacuuming chapter of the documentation says:

"The sole disadvantage of increasing autovacuum_freeze_max_age (and
vacuum_freeze_table_age along with it) is that the pg_xact and
pg_commit_ts subdirectories of the database cluster will take more space

[...]

If [pg_xact and pg_commit_ts taking 0.5 and 20 GB, respectively]
is trivial compared to your total database size, setting
autovacuum_freeze_max_age to its maximum allowed value is recommended."

Maybe this should be qualified with "unless you have trouble with your
autovacuum keeping up" or so; or generally reworded?


Michael



Re: Berserk Autovacuum (let's save next Mandrill)

From
Robert Haas
Date:
On Wed, Mar 27, 2019 at 5:32 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> * certain tables would have some sort of partial scan that sets the
>   visibility map.  There's no reason to invoke the whole vacuuming
>   machinery.  I don't think this is limited to append-only tables, but
>   rather those are just the ones that are affected the most.

I think this is a really good idea, but in order for it to work well I
think we would need to have some kind of estimate of autovacuum
pressure.

If we know that we're currently fairly on top of things, and there is
not much for autovacuum to do, periodically vacuuming a chunk of some
table that has a lot of unset visibility-map bits is probably a good
idea.  However, we can't possibly guess how aggressively to do this if
we have no idea how long it's going to be before we need to vacuum
that table for real.  If the number of XIDs remaining until the table
gets a wraparound vacuum is X, and the number of XIDs being consumed
per day is Y, we can estimate that in roughly X/Y days, we're going to
need to do a wraparound vacuum.  That value might be in the range of
months, or in the range of hours.

If it's months, we probably want limit vacuum to working at a pretty
slow rate, say 1% of the table size per hour or something.  If it's in
hours, we need to be a lot more aggressive.  Right now we have no
information to tell us which of those things is the case, so we'd just
be shooting in the dark.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Berserk Autovacuum (let's save next Mandrill)

From
Amit Langote
Date:
On Sun, Mar 31, 2019 at 1:11 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Mar 27, 2019 at 5:32 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > * certain tables would have some sort of partial scan that sets the
> >   visibility map.  There's no reason to invoke the whole vacuuming
> >   machinery.  I don't think this is limited to append-only tables, but
> >   rather those are just the ones that are affected the most.
>
> I think this is a really good idea, but in order for it to work well I
> think we would need to have some kind of estimate of autovacuum
> pressure.
>
> If we know that we're currently fairly on top of things, and there is
> not much for autovacuum to do, periodically vacuuming a chunk of some
> table that has a lot of unset visibility-map bits is probably a good
> idea.  However, we can't possibly guess how aggressively to do this if
> we have no idea how long it's going to be before we need to vacuum
> that table for real.  If the number of XIDs remaining until the table
> gets a wraparound vacuum is X, and the number of XIDs being consumed
> per day is Y, we can estimate that in roughly X/Y days, we're going to
> need to do a wraparound vacuum.  That value might be in the range of
> months, or in the range of hours.
>
> If it's months, we probably want limit vacuum to working at a pretty
> slow rate, say 1% of the table size per hour or something.  If it's in
> hours, we need to be a lot more aggressive.  Right now we have no
> information to tell us which of those things is the case, so we'd just
> be shooting in the dark.

Sawada-san presented some ideas in his PGCon 2018 talk that may be related.

https://www.pgcon.org/2018/schedule/attachments/488_Vacuum_More_Efficient_Than_Ever

(slide 32~)

Thanks,
Amit



Re: Berserk Autovacuum (let's save next Mandrill)

From
Vik Fearing
Date:
On 27/03/2019 21:54, Darafei "Komяpa" Praliaskouski wrote:
> Hi hackers,
> 
> Attached is sketch of small patch that fixes several edge cases with
> autovacuum. Long story short autovacuum never comes to append only
> tables, killing large productions.
> 
> First case, mine.
>  https://www.postgresql.org/message-id/CAC8Q8tLBeAxR%2BBXWuKK%2BHP5m8tEVYn270CVrDvKXt%3D0PkJTY9g%40mail.gmail.com
> 
> We had a table we were appending and wanted Index Only Scan to work. For
> it to work, you need to call VACUUM manually, since VACUUM is the only
> way to mark pages all visible, and autovacuum never comes to append only
> tables. We were clever to invent a workflow without dead tuples and it
> painfully bit us.
> 
> Second case, just read in the news.
> https://mailchimp.com/what-we-learned-from-the-recent-mandrill-outage/
> 
> Mandrill has 6TB append only table that autovacuum probably never
> vacuumed. Then anti-wraparound came and production went down. If
> autovacuum did its job before that last moment, it would probably be okay.
> 
> Idea: look not on dead tuples, but on changes, just like ANALYZE does.
> It's my first patch on Postgres, it's probably all wrong but I hope it
> helps you get the idea.

This was suggested and rejected years ago:
https://www.postgresql.org/message-id/b970f20f-f096-2d3a-6c6d-ee887bd30cfb@2ndquadrant.fr
-- 
Vik Fearing                                          +33 6 46 75 15 36
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support



Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:


On Thu, Mar 28, 2019 at 6:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> 1. heap vacuum
>>
>> 2. HOT pruning
>
> Is it worth skipping it if we're writing a page anyway for the sake of hint bits and new xids? This will all be no-op anyway on append-only tables and happen only when we actually need something?
>

Yeah, these operations are required only when the table has actual
garbage. IOW, append-only tables never require them.

>>
>> 3. freezing tuples
>> 4. updating visibility map (all-visible and all-frozen)
>
> These two are needed, and current autovacuum launch process does not take into account that this is also needed for non-dead tuples.
>
>>
>> 5. index vacuum/cleanup
>
> There is a separate patch for that. But, since https://commitfest.postgresql.org/16/952/ for almost a year already Postgres skips index cleanup on tables without new dead tuples, so this case is taken care of already?

I think that's not enough. The feature "GUC for cleanup index
threshold" allows us to skip only index cleanup when there are less
insertion than the fraction of the total number of heap tuples since
last index cleanup. Therefore it helps only append-only tables (and
supporting only btree index for now). We still have to do index
vacuuming even if the table has just a few dead tuple. The proposed
patch[1] helps this situation; vacuum can run while skipping index
vacuuming and index cleanup.

So, the patch I posted can be technically applied after https://commitfest.postgresql.org/22/1817/ gets merged?

The change with my patch is that a table with 49 insertions and one delete:
 - previously will wait for 49 more deletes by default (and ignore insertions), and only then clean up both table and indexes.
 - with patch will freeze/update VM for insertions, and scan the index.

In my experience only btree index is requiring a slow full index scan, that's why only it was in the "GUC for cleanup index
threshold" patch. Is it wrong and more index types do a full index scan on vacuum after deletion of a single tuple?

 
>> 6. truncation
>
> This shouldn't be a heavy operation?
>

I don't think so. This could take AccessExclusiveLock on the table and
take a long time with large shared buffer as per reported on that
thread[2].

While this can be a useful optimization, I believe it is out of scope for this patch. I want to fix vacuum never coming to append only tables without breaking other behaviors. Truncation is likely a case of enough dead tuples to trigger a vacuum via currently existing mechanisms.
 
>>
>>
>> With the proposed patch[1] we can control to do 5 or not. In addition
>> to that, another proposed patch[2] allows us to control 6.
>>
>> For append-only tables (and similar tables), what we periodically want
>> to do would be 3 and 4 (possibly we can do 2 as well). So maybe we
>> need to have both an option of (auto)vacuum to control whether to do 1
>> and something like a new autovacuum threshold (or an option) to invoke
>> the vacuum that disables 1, 5 and 6. The vacuum that does only 2, 3
>> and 4 would be much cheaper than today's vacuum and anti-wraparound
>> vacuum would be able to skip almost pages.
>
>
> Why will we want to get rid of 1? It's a noop from write perspective and saves a scan to do it if it's not noop.
>

Because that's for tables that have many inserts but have some
updates/deletes. I think that this strategy would help not only
append-only tables but also such tables.

How much do we save by skipping a heap vacuum on almost-append-only table, where amount of updates is below 50 which is current threshold?
 

> Why make it faster in emergency situations when situation can be made non-emergency from the very beginning instead?
>

I don't understand the meaning of "situation can be made non-emergency
from the very beginning". Could you please elaborate on that?

Let's imagine a simple append-only workflow on current default settings Postgres. You create a table, and start inserting tuples, one per transaction. Let's imagine a page fits 50 tuples (my case for taxi movement data), and Amazon gp2 storage which caps you say at 1000 IOPS in non-burst mode.
Anti-wrap-around-auto-vacuum (we need a drawing of misreading of this term with a crossed out car bent in Space) will be triggered in autovacuum_freeze_max_age inserts, 200000000 by default. That converts into 4000000 pages, or around 32 GB. It will be the first vacuum ever on that table, since no other mechanism triggers it, and if it steals all the available IOPS, it will finish in 200000000/50 /1000 = 4000 seconds, killing prod for over an hour.

Telemetry workloads can easily generate 32 GB of data a day (I've seen more, but let's stick to that number). Production going down for an hour a day isn't good and I consider it an emergency.

Now, two ways to fix it that reading documentation leads you while you're a sleepy one trying to get prod back:
 - raise autovacuum_freeze_max_age so VACUUM keeps sleeping;
 - rewrite code to use batching to insert more tuples at once.

We don't have a better recommendation mechanism for settings, and experience in tuning autovacuum into right direction comes at the cost of a job or company to people :)

Both ways not fix the problem but just delay the inevitable. Ratio of "one hour of vacuum per day of operation" keeps, you just delay it.
Let's say had same thing with 1000 records batched inserts, and moved autovacuum_freeze_max_age to the highest possible value. How much will the downtime last?

2**31 (max tid) * 1000 (tuples per tid) / 50 (tuples in page) / 1000 (pages per second) / 86400 (seconds in day) = 49 days.

This matches highest estimation in Mandrill's report, so that might be what have happened to them.

This all would not be needed if autovacuum came after 50 inserted tuples. It will just mark page as all visible and all frozen and be gone, while it's still in memory. This will get rid of emergency altogether.

Is this elaborate enough disaster scenario? :)


--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:
By the way, the Routine Vacuuming chapter of the documentation says:

"The sole disadvantage of increasing autovacuum_freeze_max_age (and
vacuum_freeze_table_age along with it) is that the pg_xact and
pg_commit_ts subdirectories of the database cluster will take more space

[...]

If [pg_xact and pg_commit_ts taking 0.5 and 20 GB, respectively]
is trivial compared to your total database size, setting
autovacuum_freeze_max_age to its maximum allowed value is recommended."

Maybe this should be qualified with "unless you have trouble with your
autovacuum keeping up" or so; or generally reworded?

This recommendation is in the mindset of "wraparound never happens".
If your database is large, you have more chances to hit it painfully, and if it's append-only even more so.

Alternative point of "if your database is super large and actively written, you may want to set autovacuum_freeze_max_age to even smaller values so that autovacuum load is more evenly spread over time" may be needed.



 
--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:
If it's months, we probably want limit vacuum to working at a pretty
slow rate, say 1% of the table size per hour or something.  If it's in
hours, we need to be a lot more aggressive.  Right now we have no
information to tell us which of those things is the case, so we'd just
be shooting in the dark.

Thing is, you don't need to spread out your vacuum in time if the rate of vacuuming matches rate of table growth. Can we mark tuples/pages as all-visible and all-frozen say, the moment they're pushed out of shared_buffers?
 

--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:

> Idea: look not on dead tuples, but on changes, just like ANALYZE does.
> It's my first patch on Postgres, it's probably all wrong but I hope it
> helps you get the idea.

This was suggested and rejected years ago:
https://www.postgresql.org/message-id/b970f20f-f096-2d3a-6c6d-ee887bd30cfb@2ndquadrant.fr

Thank you for sharing the link. I've read through the thread and see you posted two patches, first being similar but different from mine, and second being about a different matter.

I don't see "rejected" there, just a common distraction of "you should also consider this" and time-out leading to "returned with feedback" at the end.

Thing is, we have dead large productions and post-mortems now as your patch wasn't pushed back in 2016, so situation is different. Let's push at least first of two patches of yours, or mine.

Which one is better and why?

I believe mine, as it just follows a pattern already established and proven in autoanalyze. If vacuum comes and unable to harvest some dead tuples, it will come over again in your case, and just sleep until it gets new dead tuples in mine, which looks better to me - there's no dead loop in case some dead tuples are stuck forever.
If someone thinks yours is better we may also consider it for autoanalyze?


--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:
The invoking autovacuum on table based on inserts, not only deletes
and updates, seems good idea to me. But in this case, I think that we
can not only freeze tuples but also update visibility map even when
setting all-visible. Roughly speaking  I think vacuum does the
following operations.

1. heap vacuum
2. HOT pruning
3. freezing tuples
4. updating visibility map (all-visible and all-frozen)
5. index vacuum/cleanup
6. truncation

With the proposed patch[1] we can control to do 5 or not. In addition
to that, another proposed patch[2] allows us to control 6.

[1] is committed, [2] nears commit. Seems we have now all the infra to teach autovacuum to run itself based on inserts and not hurt anybody?

... 
[1] https://commitfest.postgresql.org/22/1817/
[2] https://commitfest.postgresql.org/22/1981/


--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
Chris Travers
Date:


On Sat, Apr 6, 2019 at 9:56 AM Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:
The invoking autovacuum on table based on inserts, not only deletes
and updates, seems good idea to me. But in this case, I think that we
can not only freeze tuples but also update visibility map even when
setting all-visible. Roughly speaking  I think vacuum does the
following operations.

1. heap vacuum
2. HOT pruning
3. freezing tuples
4. updating visibility map (all-visible and all-frozen)
5. index vacuum/cleanup
6. truncation

With the proposed patch[1] we can control to do 5 or not. In addition
to that, another proposed patch[2] allows us to control 6.

[1] is committed, [2] nears commit. Seems we have now all the infra to teach autovacuum to run itself based on inserts and not hurt anybody?

... 
[1] https://commitfest.postgresql.org/22/1817/
[2] https://commitfest.postgresql.org/22/1981/


Reading the thread and the patch, I generally agree that:
1.  With the current infrastructure having auto vacuum periodically scan append-only tables for freezing would be good, and
2.  I can't think of any cases where this would be a bad thing.

Also I am not 100% convinced that the problems are avoidable by setting the wraparound prevention thresholds low enough.  In cases where one is doing large bulk inserts all the time, vacuum freeze could have a lot of work to do, and in some cases I could imagine IO storms making that difficult.

I plan to run some benchmarks on this to try to assess performance impact of this patch in standard pgbench scenarios.I will also try to come up with some other benchmarks in append only workloads.
 

--
Darafei Praliaskouski


--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: Berserk Autovacuum (let's save next Mandrill)

From
Alvaro Herrera
Date:
On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:

> Alternative point of "if your database is super large and actively written,
> you may want to set autovacuum_freeze_max_age to even smaller values so
> that autovacuum load is more evenly spread over time" may be needed.

I don't think it's helpful to force emergency vacuuming more frequently;
quite the contrary, it's likely to cause even more issues.  We should
tweak autovacuum to perform freezing more preemtively instead.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On April 10, 2019 8:13:06 AM PDT, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
>
>> Alternative point of "if your database is super large and actively
>written,
>> you may want to set autovacuum_freeze_max_age to even smaller values
>so
>> that autovacuum load is more evenly spread over time" may be needed.
>
>I don't think it's helpful to force emergency vacuuming more
>frequently;
>quite the contrary, it's likely to cause even more issues.  We should
>tweak autovacuum to perform freezing more preemtively instead.

I still think the fundamental issue with making vacuum less painful is that the all indexes have to be read entirely.
Evenif there's not much work (say millions of rows frozen, hundreds removed). Without that issue we could vacuum much
morefrequently. And do it properly in insert only workloads. 


Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Berserk Autovacuum (let's save next Mandrill)

From
Chris Travers
Date:


On Wed, Apr 10, 2019 at 5:21 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On April 10, 2019 8:13:06 AM PDT, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
>
>> Alternative point of "if your database is super large and actively
>written,
>> you may want to set autovacuum_freeze_max_age to even smaller values
>so
>> that autovacuum load is more evenly spread over time" may be needed.
>
>I don't think it's helpful to force emergency vacuuming more
>frequently;
>quite the contrary, it's likely to cause even more issues.  We should
>tweak autovacuum to perform freezing more preemtively instead.

I still think the fundamental issue with making vacuum less painful is that the all indexes have to be read entirely. Even if there's not much work (say millions of rows frozen, hundreds removed). Without that issue we could vacuum much more frequently. And do it properly in insert only workloads.

So I see a couple of issues here and wondering what the best approach is.

The first is to just skip lazy_cleanup_index if no rows were removed.  Is this the approach you have in mind?  Or is that insufficient?

The second approach would be to replace the whole idea of this patch with a lazy freeze worker which would basically periodically do a vacuum freeze on relations matching certain criteria.  This could have a lower max workers than autovacuum and therefore less of a threat in terms of total IO usage.

Thoughts?


Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.




--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: Berserk Autovacuum (let's save next Mandrill)

From
Tomas Vondra
Date:
On Thu, Apr 11, 2019 at 11:25:29AM +0200, Chris Travers wrote:
>   On Wed, Apr 10, 2019 at 5:21 PM Andres Freund <andres@anarazel.de> wrote:
>
>     Hi,
>
>     On April 10, 2019 8:13:06 AM PDT, Alvaro Herrera
>     <alvherre@2ndquadrant.com> wrote:
>     >On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
>     >
>     >> Alternative point of "if your database is super large and actively
>     >written,
>     >> you may want to set autovacuum_freeze_max_age to even smaller values
>     >so
>     >> that autovacuum load is more evenly spread over time" may be needed.
>     >
>     >I don't think it's helpful to force emergency vacuuming more
>     >frequently;
>     >quite the contrary, it's likely to cause even more issues.  We should
>     >tweak autovacuum to perform freezing more preemtively instead.
>
>     I still think the fundamental issue with making vacuum less painful is
>     that the all indexes have to be read entirely. Even if there's not much
>     work (say millions of rows frozen, hundreds removed). Without that issue
>     we could vacuum much more frequently. And do it properly in insert only
>     workloads.
>
>   So I see a couple of issues here and wondering what the best approach is.
>   The first is to just skip lazy_cleanup_index if no rows were removed.  Is
>   this the approach you have in mind?  Or is that insufficient?

I don't think that's what Andres had in mind, as he explicitly mentioned
removed rows. So just skipping lazy_cleanup_index when there were no
deleted would not help in that case.

What I think we could do is simply leave the tuple pointers in the table
(and indexes) when there are only very few of them, and only do the
expensive table/index cleanup once there's anough of them.

>   The second approach would be to replace the whole idea of this patch with
>   a lazy freeze worker which would basically periodically do a vacuum freeze
>   on relations matching certain criteria.  This could have a lower max
>   workers than autovacuum and therefore less of a threat in terms of total
>   IO usage.
>   Thoughts?
>

Not sure. I find it rather difficult to manage more and more different
types of cleanup workers.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:


On Wed, Apr 10, 2019 at 6:13 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:

> Alternative point of "if your database is super large and actively written,
> you may want to set autovacuum_freeze_max_age to even smaller values so
> that autovacuum load is more evenly spread over time" may be needed.

I don't think it's helpful to force emergency vacuuming more frequently;
quite the contrary, it's likely to cause even more issues.  We should
tweak autovacuum to perform freezing more preemtively instead.

Okay. What would be your recommendation for the case of Mandrill running current Postgres 11? Which parameters shall they tune and to which values?

 

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:

>I don't think it's helpful to force emergency vacuuming more
>frequently;
>quite the contrary, it's likely to cause even more issues.  We should
>tweak autovacuum to perform freezing more preemtively instead.

I still think the fundamental issue with making vacuum less painful is that the all indexes have to be read entirely. Even if there's not much work (say millions of rows frozen, hundreds removed). Without that issue we could vacuum much more frequently. And do it properly in insert only workloads.

Deletion of hundreds of rows on default settings will cause the same behavior now.
If there was 0 updates currently the index cleanup will be skipped.

https://commitfest.postgresql.org/22/1817/ got merged. This means Autovacuum can have two separate thresholds - the current, on dead tuples, triggering the VACUUM same way it triggers it now, and a new one, on inserted tuples only, triggering VACUUM (INDEX_CLEANUP FALSE)?

--
Darafei Praliaskouski

Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Sun, Apr 14, 2019 at 4:51 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Apr 11, 2019 at 11:25:29AM +0200, Chris Travers wrote:
> >   On Wed, Apr 10, 2019 at 5:21 PM Andres Freund <andres@anarazel.de> wrote:
> >
> >     Hi,
> >
> >     On April 10, 2019 8:13:06 AM PDT, Alvaro Herrera
> >     <alvherre@2ndquadrant.com> wrote:
> >     >On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
> >     >
> >     >> Alternative point of "if your database is super large and actively
> >     >written,
> >     >> you may want to set autovacuum_freeze_max_age to even smaller values
> >     >so
> >     >> that autovacuum load is more evenly spread over time" may be needed.
> >     >
> >     >I don't think it's helpful to force emergency vacuuming more
> >     >frequently;
> >     >quite the contrary, it's likely to cause even more issues.  We should
> >     >tweak autovacuum to perform freezing more preemtively instead.
> >
> >     I still think the fundamental issue with making vacuum less painful is
> >     that the all indexes have to be read entirely. Even if there's not much
> >     work (say millions of rows frozen, hundreds removed). Without that issue
> >     we could vacuum much more frequently. And do it properly in insert only
> >     workloads.
> >
> >   So I see a couple of issues here and wondering what the best approach is.
> >   The first is to just skip lazy_cleanup_index if no rows were removed.  Is
> >   this the approach you have in mind?  Or is that insufficient?
>
> I don't think that's what Andres had in mind, as he explicitly mentioned
> removed rows. So just skipping lazy_cleanup_index when there were no
> deleted would not help in that case.
>
> What I think we could do is simply leave the tuple pointers in the table
> (and indexes) when there are only very few of them, and only do the
> expensive table/index cleanup once there's anough of them.

Yeah, we now have an infrastructure that skips index vacuuming by
leaving the tuples pointers. So we then can have a threshold for
autovacuum to invoke index vacuuming. Or an another idea is to delete
index entries more actively by index looking up instead of scanning
the whole index. It's proposed[1].

[1] I couldn't get the URL of the thread right now for some reason but
the thread subject is " [WIP] [B-Tree] Retail IndexTuple deletion".

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Mon, Apr 15, 2019 at 10:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Apr 14, 2019 at 4:51 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Thu, Apr 11, 2019 at 11:25:29AM +0200, Chris Travers wrote:
> > >   On Wed, Apr 10, 2019 at 5:21 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > >     Hi,
> > >
> > >     On April 10, 2019 8:13:06 AM PDT, Alvaro Herrera
> > >     <alvherre@2ndquadrant.com> wrote:
> > >     >On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
> > >     >
> > >     >> Alternative point of "if your database is super large and actively
> > >     >written,
> > >     >> you may want to set autovacuum_freeze_max_age to even smaller values
> > >     >so
> > >     >> that autovacuum load is more evenly spread over time" may be needed.
> > >     >
> > >     >I don't think it's helpful to force emergency vacuuming more
> > >     >frequently;
> > >     >quite the contrary, it's likely to cause even more issues.  We should
> > >     >tweak autovacuum to perform freezing more preemtively instead.
> > >
> > >     I still think the fundamental issue with making vacuum less painful is
> > >     that the all indexes have to be read entirely. Even if there's not much
> > >     work (say millions of rows frozen, hundreds removed). Without that issue
> > >     we could vacuum much more frequently. And do it properly in insert only
> > >     workloads.
> > >
> > >   So I see a couple of issues here and wondering what the best approach is.
> > >   The first is to just skip lazy_cleanup_index if no rows were removed.  Is
> > >   this the approach you have in mind?  Or is that insufficient?
> >
> > I don't think that's what Andres had in mind, as he explicitly mentioned
> > removed rows. So just skipping lazy_cleanup_index when there were no
> > deleted would not help in that case.
> >
> > What I think we could do is simply leave the tuple pointers in the table
> > (and indexes) when there are only very few of them, and only do the
> > expensive table/index cleanup once there's anough of them.
>
> Yeah, we now have an infrastructure that skips index vacuuming by
> leaving the tuples pointers. So we then can have a threshold for
> autovacuum to invoke index vacuuming. Or an another idea is to delete
> index entries more actively by index looking up instead of scanning
> the whole index. It's proposed[1].
>
> [1] I couldn't get the URL of the thread right now for some reason but
> the thread subject is " [WIP] [B-Tree] Retail IndexTuple deletion".

Now I got https://www.postgresql.org/message-id/425db134-8bba-005c-b59d-56e50de3b41e%40postgrespro.ru

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Sun, Apr 14, 2019 at 9:59 PM Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:
>>
>>
>> >I don't think it's helpful to force emergency vacuuming more
>> >frequently;
>> >quite the contrary, it's likely to cause even more issues.  We should
>> >tweak autovacuum to perform freezing more preemtively instead.
>>
>> I still think the fundamental issue with making vacuum less painful is that the all indexes have to be read
entirely.Even if there's not much work (say millions of rows frozen, hundreds removed). Without that issue we could
vacuummuch more frequently. And do it properly in insert only workloads. 
>
>
> Deletion of hundreds of rows on default settings will cause the same behavior now.
> If there was 0 updates currently the index cleanup will be skipped.
>
> https://commitfest.postgresql.org/22/1817/ got merged. This means Autovacuum can have two separate thresholds - the
current,on dead tuples, triggering the VACUUM same way it triggers it now, and a new one, on inserted tuples only,
triggeringVACUUM (INDEX_CLEANUP FALSE)? 
>

Agreed.

To invoke autovacuum even on insert-only tables we would need check
the number of inserted tuples since last vacuum. I think we can keep
track of the number of inserted tuples since last vacuum to the stats
collector and add the threshold to invoke vacuum with INDEX_CLEANUP =
false. If an autovacuum worker confirms that the number of inserted
tuples exceeds the threshold it invokes vacuum with INDEX_CLEANUP =
false. However if the number of dead tuples also exceeds the
autovacuum thresholds (autovacuum_vacuum_threshold and
autovacuum_vacuum_scale_factor) it should invoke vacuum with
INDEX_CLEANUP = true. Therefore new threshold makes sense only when
it's lower than the autovacuum thresholds.

I guess we can have one new GUC parameter to control scale factor.
Since only relatively large tables will require this feature we might
not need the threshold based the number of tuples.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Berserk Autovacuum (let's save next Mandrill)

From
Amit Kapila
Date:
On Tue, Jul 23, 2019 at 1:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> To invoke autovacuum even on insert-only tables we would need check
> the number of inserted tuples since last vacuum. I think we can keep
> track of the number of inserted tuples since last vacuum to the stats
> collector and add the threshold to invoke vacuum with INDEX_CLEANUP =
> false. If an autovacuum worker confirms that the number of inserted
> tuples exceeds the threshold it invokes vacuum with INDEX_CLEANUP =
> false. However if the number of dead tuples also exceeds the
> autovacuum thresholds (autovacuum_vacuum_threshold and
> autovacuum_vacuum_scale_factor) it should invoke vacuum with
> INDEX_CLEANUP = true. Therefore new threshold makes sense only when
> it's lower than the autovacuum thresholds.
>
> I guess we can have one new GUC parameter to control scale factor.
> Since only relatively large tables will require this feature we might
> not need the threshold based the number of tuples.
>

Generally speaking, having more guc's for autovacuum and that too
which are in some way dependent on existing guc's sounds bit scary,
but OTOH whatever you wrote makes sense and can help the scenarios
which this thread is trying to deal with.   Have you given any thought
to what Alvaro mentioned up-thread "certain tables would have some
sort of partial scan that sets the visibility map.  There's no reason
to invoke the whole vacuuming machinery.  I don't think this is
limited to append-only tables, but
rather those are just the ones that are affected the most."?

This thread seems to be stalled for the reason that we don't have a
clear consensus on what is the right solution for the problem being
discussed.  Alvaro, anyone has any thoughts on how we can move forward
with this work?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Tue, Sep 10, 2019 at 8:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jul 23, 2019 at 1:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > To invoke autovacuum even on insert-only tables we would need check
> > the number of inserted tuples since last vacuum. I think we can keep
> > track of the number of inserted tuples since last vacuum to the stats
> > collector and add the threshold to invoke vacuum with INDEX_CLEANUP =
> > false. If an autovacuum worker confirms that the number of inserted
> > tuples exceeds the threshold it invokes vacuum with INDEX_CLEANUP =
> > false. However if the number of dead tuples also exceeds the
> > autovacuum thresholds (autovacuum_vacuum_threshold and
> > autovacuum_vacuum_scale_factor) it should invoke vacuum with
> > INDEX_CLEANUP = true. Therefore new threshold makes sense only when
> > it's lower than the autovacuum thresholds.
> >
> > I guess we can have one new GUC parameter to control scale factor.
> > Since only relatively large tables will require this feature we might
> > not need the threshold based the number of tuples.
> >
>
> Generally speaking, having more guc's for autovacuum and that too
> which are in some way dependent on existing guc's sounds bit scary,
> but OTOH whatever you wrote makes sense and can help the scenarios
> which this thread is trying to deal with.   Have you given any thought
> to what Alvaro mentioned up-thread "certain tables would have some
> sort of partial scan that sets the visibility map.  There's no reason
> to invoke the whole vacuuming machinery.  I don't think this is
> limited to append-only tables, but
> rather those are just the ones that are affected the most."?
>

Speaking of partial scan I've considered before that we could use WAL
to find which pages have garbage much and not all-visible pages. We
can vacuum only a particular part of table that is most effective of
garbage collection instead of whole tables. I've shared some results
of that at PGCon and it's still in PoC state.

Also, to address the issue of updating VM of mostly-append-only tables
I considered some possible solutions:

1. Using INDEX_CLEANUP = false and TRUNCATE = false vacuum does hot
pruning, vacuuming table and updating VM. In addition to updating VM
we need to do other two operations but since the mostly-insert-only
tables would have less garbage the hot pruning and vacuuming table
should be light workload. This is what I proposed on up-thread.
2. This may have already been discussed before but we could update
VM when hot pruning during SELECT operation. Since this affects SELECT
performance it should be enabled on only particular tables by user
request.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Berserk Autovacuum (let's save next Mandrill)

From
Amit Kapila
Date:
On Fri, Sep 13, 2019 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Sep 10, 2019 at 8:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Generally speaking, having more guc's for autovacuum and that too
> > which are in some way dependent on existing guc's sounds bit scary,
> > but OTOH whatever you wrote makes sense and can help the scenarios
> > which this thread is trying to deal with.   Have you given any thought
> > to what Alvaro mentioned up-thread "certain tables would have some
> > sort of partial scan that sets the visibility map.  There's no reason
> > to invoke the whole vacuuming machinery.  I don't think this is
> > limited to append-only tables, but
> > rather those are just the ones that are affected the most."?
> >
>
> Speaking of partial scan I've considered before that we could use WAL
> to find which pages have garbage much and not all-visible pages. We
> can vacuum only a particular part of table that is most effective of
> garbage collection instead of whole tables. I've shared some results
> of that at PGCon and it's still in PoC state.
>
> Also, to address the issue of updating VM of mostly-append-only tables
> I considered some possible solutions:
>
> 1. Using INDEX_CLEANUP = false and TRUNCATE = false vacuum does hot
> pruning, vacuuming table and updating VM. In addition to updating VM
> we need to do other two operations but since the mostly-insert-only
> tables would have less garbage the hot pruning and vacuuming table
> should be light workload. This is what I proposed on up-thread.
>

Yes, this is an option, but it might be better if we can somehow avoid
triggering the vacuum machinery.

> 2. This may have already been discussed before but we could update
> VM when hot pruning during SELECT operation. Since this affects SELECT
> performance it should be enabled on only particular tables by user
> request.
>

Yeah, doing anything additional in SELECT's can be tricky and think of
a case where actually there is nothing to prune on-page, in that case
also if we run the visibility checks and then mark the visibility map,
then it can be a noticeable overhead.  OTOH, I think this will be a
one-time overhead because after the first scan the visibility map will
be updated and future scans don't need to update visibility map unless
someone has updated that page.  I was wondering why not do this during
write workloads.  For example, when Insert operation finds that there
is no space in the current page and it has to move to next page, it
can check if the page (that doesn't have space to accommodate current
tuple) can be marked all-visible.  In this case, we would have already
done the costly part of an operation which is to Read/Lock the buffer.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tomas Vondra
Date:
Hi,

This patch is currently in "needs review" state, but that seems quite
wrong - there's been a lot of discussions about how we might improve
behavior for append-only-tables, but IMO there's no clear consensus nor
a patch that we might review.

So I think this should be either "waiting on author" or maybe "rejected
with feedback". Is there any chance of getting a reviewable patch in the
current commitfest? If not, I propose to mark it as RWF.

I still hope we can improve this somehow in time for PG13. The policy is
not to allow new non-trivial patches in the last CF, but hopefully this
might be considered an old patch.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-01-07 at 19:05 +0100, Tomas Vondra wrote:
> This patch is currently in "needs review" state, but that seems quite
> wrong - there's been a lot of discussions about how we might improve
> behavior for append-only-tables, but IMO there's no clear consensus nor
> a patch that we might review.
> 
> So I think this should be either "waiting on author" or maybe "rejected
> with feedback". Is there any chance of getting a reviewable patch in the
> current commitfest? If not, I propose to mark it as RWF.
> 
> I still hope we can improve this somehow in time for PG13. The policy is
> not to allow new non-trivial patches in the last CF, but hopefully this
> might be considered an old patch.

I think that no conclusion was reached because there are *many* things
that could be improved, and *many* interesting and ambitious ideas were
vented.

But I think it would be good to have *something* that addresses the immediate
problem (INSERT-only tables are autovacuumed too late), as long as
that does not have negative side-effects or blocks further improvements.

I don't feel totally well with the very simplistic approach of this
patch (use the same metric to trigger autoanalyze and autovacuum),
but what about this:

- a new table storage option autovacuum_vacuum_insert_threshold,
  perhaps a GUC of the same name, by default deactivated.

- if tabentry->tuples_inserted exceeds this threshold, but not one
  of the others, lauch autovacuum with index_cleanup=off.

How would you feel about that?

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Mon, 2020-03-02 at 14:57 +0100, I wrote:
> But I think it would be good to have *something* that addresses the immediate
> problem (INSERT-only tables are autovacuumed too late), as long as
> that does not have negative side-effects or blocks further improvements.
> 
> I don't feel totally well with the very simplistic approach of this
> patch (use the same metric to trigger autoanalyze and autovacuum),
> but what about this:
> 
> - a new table storage option autovacuum_vacuum_insert_threshold,
>   perhaps a GUC of the same name, by default deactivated.
> 
> - if tabentry->tuples_inserted exceeds this threshold, but not one
>   of the others, lauch autovacuum with index_cleanup=off.

As a more substantial base for discussion, here is a patch that:

- introduces a GUC and reloption "autovacuum_vacuum_insert_limit",
  default 10000000

- introduces a statistics counter "inserts_since_vacuum" per table
  that gets reset to 0 after vacuum

- causes autovacuum to run without cleaning up indexes if
  inserts_since_vacuum > autovacuum_vacuum_insert_limit
  and there is no other reason for an autovacuum

No doc patch is included yet, and perhaps the new counter should
be shown in "pg_stat_user_tables".

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-03 at 16:28 +0100, Laurenz Albe wrote:
> As a more substantial base for discussion, here is a patch that:
> 
> - introduces a GUC and reloption "autovacuum_vacuum_insert_limit",
>   default 10000000
> 
> - introduces a statistics counter "inserts_since_vacuum" per table
>   that gets reset to 0 after vacuum
> 
> - causes autovacuum to run without cleaning up indexes if
>   inserts_since_vacuum > autovacuum_vacuum_insert_limit
>   and there is no other reason for an autovacuum

I just realized that the exercise is pointless unless that
autovacuum also runs with FREEZE on.

Updated patch attached.

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Thu, 5 Mar 2020 at 04:15, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> I just realized that the exercise is pointless unless that
> autovacuum also runs with FREEZE on.

I think we need to move forward with doing something to cope with
INSERT-only tables not being auto-vacuumed.

I think the patch you have is something along the lines to what I'd
have imagined we should do. However, there are a few things that I'd
do a different way.

1. I'd go for 2 new GUCs and reloptions.
autovacuum_vacuum_insert_threshold (you're currently calling this
autovacuum_vacuum_insert_limit. I don't see why the word "limit" is
relevant here). The other GUC I think should be named
autovacuum_vacuum_insert_scale_factor and these should work exactly
the same way as autovacuum_vacuum_threshold and
autovacuum_vacuum_scale_factor, but be applied in a similar way to the
vacuum settings, but only be applied after we've checked to ensure the
table is not otherwise eligible to be vacuumed.

2. I believe you're right in setting the freeze_min_age to 0 rather
than freeze_min_age.  My understanding of freeze_min_age is that we
don't too pro-actively freeze tuples as in many workloads, a freshly
INSERTed tuple is more likely to receive an UPDATE than an old tuple
is.  e.g something like new orders having their status updated various
times until the order is complete, where it's not updated ever again.
In the INSERT-only case, there seems to be not much reason not to just
freeze right away.

3. The name "insert_only" does not seem the best for the new boolean
variable that you're using in various places.  That name seems to be
too closely related to our current intended use case. Maybe
skip_index_cleanup is more to the point.

4. Are you sure you mean "Maximum" here?  Isn't it the minimum? At
least it will be once you add both options. Otherwise, I think Maximum
is not the correct word. Perhaps "The threshold"

+ {"autovacuum_vacuum_insert_limit", PGC_SIGHUP, AUTOVACUUM,
+ gettext_noop("Maximum number of tuple inserts prior to vacuum."),
+ NULL
+ },


5. I think the new field in this struct should be named vacuum_insert_threshold

@@ -252,6 +252,7 @@ typedef struct AutoVacOpts
 {
  bool enabled;
  int vacuum_threshold;
+ int vacuum_ins_limit;

6. Are you sure you meant to default this to 50?

index e58e4788a8..9d96d58ed2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -598,6 +598,8 @@
 #autovacuum_naptime = 1min # time between autovacuum runs
 #autovacuum_vacuum_threshold = 50 # min number of row updates before
  # vacuum
+#autovacuum_vacuum_insert_limit = 50 # max number of row inserts before
+ # vacuum

Seems excessive given there's no scale factor in the current patch.

7. I know you know.... missing docs... would be good to get those.

8. Should we care when setting the insert counter back to 0 if
auto-vacuum has skipped pages?

9. You should add a new column to the pg_stat_all_tables view to allow
visibility of the insert since the last vacuum.  The column should be
named n_ins_since_vacuum. This seems like the best combination of
n_mod_since_analyze and n_tup_ins.

10. I'm slightly worried about the case where we don't quite trigger a
normal vacuum but trigger a vacuum due to INSERTs then skip cleaning
up the indexes but proceed to leave dead index entries causing indexes
to become bloated.  It does not seem impossible that given the right
balance of INSERTs and UPDATE/DELETEs that this could happen every
time and the indexes would just become larger and larger.

It's pretty easy to see this in action with:

create extension if not exists pgstattuple;
create table t0 (a int primary key);
alter table t0 set (autovacuum_enabled=off);
insert into t0 select generate_Series(1,1000000);

delete from t0 where a&1=0; vacuum (index_cleanup off) t0; insert into
t0 select generate_series(2,1000000,2); select * from
pgstattuple('t0'),pg_relation_size('t0') as t0_size; select n_dead_tup
from pg_stat_all_tables where relid = 't0'::regclass; -- repeat this a
few times and watch the indexes bloat

11. We probably do also need to debate if we want this on or off by
default.   I'd have leaned towards enabling by default if I'd not
personally witnessed the fact that people rarely* increase auto-vacuum
to run faster than the standard cost settings. I've seen hundreds of
servers over the years with all workers busy for days on something
they'll never finish quickly enough.  We increased those settings 10x
in PG12, so there will be fewer people around suffering from that now,
but even after having reduced the vacuum_cost_delay x10 over the PG11
settings, it's by no means fast enough for everyone.  I've mixed
feelings about giving auto-vacuum more work to do for those people, so
perhaps the best option is to keep this off by default so as not to
affect the people who don't tune auto-vacuum.  They'll just suffer the
pain all at once when they hit max freeze age instead of more
gradually with the additional load on the workers.   At least adding
this feature gives the people who do tune auto-vacuum some ability to
handle read-only tables in some sane way.

An alternative way of doing it would be to set the threshold to some
number of million tuples and set the scale_factor to 0.2 so that it
only has an effect on larger tables, of which generally people only
have a smallish number of.


* My opinion may be biased as the sample of people did arrive asking for help



Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:
Hi,

Thanks Laurenz for taking action on this and writing a better patch
than my initial.
This will help avoid both Mandrill-like downtimes and get Index Only
Scan just work on large telemetry databases like the one I was
responsible for back when I was in Juno.

On Thu, Mar 5, 2020 at 9:40 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 5 Mar 2020 at 04:15, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > I just realized that the exercise is pointless unless that
> > autovacuum also runs with FREEZE on.

> 8. Should we care when setting the insert counter back to 0 if
> auto-vacuum has skipped pages?

I believe it would be enough just to leave a comment about this in code.

> 10. I'm slightly worried about the case where we don't quite trigger a
> normal vacuum but trigger a vacuum due to INSERTs then skip cleaning
> up the indexes but proceed to leave dead index entries causing indexes
> to become bloated.  It does not seem impossible that given the right
> balance of INSERTs and UPDATE/DELETEs that this could happen every
> time and the indexes would just become larger and larger.

Can we not reset statistics about dead tuples upon index-skipping
vacuum, since we didn't really take care of them?

> 11. We probably do also need to debate if we want this on or off by
> default.   I'd have leaned towards enabling by default if I'd not
> personally witnessed the fact that people rarely* increase auto-vacuum
> to run faster than the standard cost settings. I've seen hundreds of
> servers over the years with all workers busy for days on something
> they'll never finish quickly enough.  We increased those settings 10x
> in PG12, so there will be fewer people around suffering from that now,
> but even after having reduced the vacuum_cost_delay x10 over the PG11
> settings, it's by no means fast enough for everyone.  I've mixed
> feelings about giving auto-vacuum more work to do for those people, so
> perhaps the best option is to keep this off by default so as not to
> affect the people who don't tune auto-vacuum.  They'll just suffer the
> pain all at once when they hit max freeze age instead of more
> gradually with the additional load on the workers.   At least adding
> this feature gives the people who do tune auto-vacuum some ability to
> handle read-only tables in some sane way.

That's exactly the situation we're trying to avoid with this patch.
Suffering all at once takes large production deployments down for
weeks, and that gets into news.
In current cloud setups it's plain impossible to read the whole
database at all, let alone rewrite, with IO budgets.
I say we should enable this setting by default.
If my calculations are correct, that will freeze the table once each
new gigabyte of data is written, which is usually fitting into burst
read thresholds.




-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Thu, 2020-03-05 at 19:40 +1300, David Rowley wrote:
> I think we need to move forward with doing something to cope with
> INSERT-only tables not being auto-vacuumed.
> 
> I think the patch you have is something along the lines to what I'd
> have imagined we should do. However, there are a few things that I'd
> do a different way.

Thanks for the review, and that's good news.

> 1. I'd go for 2 new GUCs and reloptions.
> autovacuum_vacuum_insert_threshold (you're currently calling this
> autovacuum_vacuum_insert_limit. I don't see why the word "limit" is
> relevant here). The other GUC I think should be named
> autovacuum_vacuum_insert_scale_factor and these should work exactly
> the same way as autovacuum_vacuum_threshold and
> autovacuum_vacuum_scale_factor, but be applied in a similar way to the
> vacuum settings, but only be applied after we've checked to ensure the
> table is not otherwise eligible to be vacuumed.

Yes, "threshold" is better than "limit" I have renamed the GUC and
the reloption.

I disagree about the scale_factor (and have not added it to the
updated version of the patch).  If we have a scale_factor, then the
time between successive autovacuum runs would increase as the table
gets bigger, which defeats the purpose of reducing the impact of each
autovacuum run.

Since autovacuum skips pages where it has nothing to do, we can expect
that runs on a large table won't be much more expensive than runs on a
smaller table, right?

> 3. The name "insert_only" does not seem the best for the new boolean
> variable that you're using in various places.  That name seems to be
> too closely related to our current intended use case. Maybe
> skip_index_cleanup is more to the point.

I originally called the variable "skip_indexes", but when I decided
that such vacuum runs also aggressively freeze the table, I thought
that the name was misleading and renamed it.

I won't put up a fight about this, though.

> 4. Are you sure you mean "Maximum" here?  Isn't it the minimum? At
> least it will be once you add both options. Otherwise, I think Maximum
> is not the correct word. Perhaps "The threshold"
> 
> + {"autovacuum_vacuum_insert_limit", PGC_SIGHUP, AUTOVACUUM,
> + gettext_noop("Maximum number of tuple inserts prior to vacuum."),
> + NULL
> + },

I had actually been debating whether to use "maximum" or "minimum".
I realize now that this strange uncertainty stems from the fact that
there is (yet) only a single parameter to govern this.

The updated patch desctibes the GUC as
"Number of tuple inserts prior to vacuum."

> 5. I think the new field in this struct should be named vacuum_insert_threshold
> 
> @@ -252,6 +252,7 @@ typedef struct AutoVacOpts
>  {
>   bool enabled;
>   int vacuum_threshold;
> + int vacuum_ins_limit;

I agree as above, renamed.

> 6. Are you sure you meant to default this to 50?
> 
> index e58e4788a8..9d96d58ed2 100644
> --- a/src/backend/utils/misc/postgresql.conf.sample
> +++ b/src/backend/utils/misc/postgresql.conf.sample
> @@ -598,6 +598,8 @@
>  #autovacuum_naptime = 1min # time between autovacuum runs
>  #autovacuum_vacuum_threshold = 50 # min number of row updates before
>   # vacuum
> +#autovacuum_vacuum_insert_limit = 50 # max number of row inserts before
> + # vacuum
> 
> Seems excessive given there's no scale factor in the current patch.

That was a mistake.
I chose 10000000 as the actual default value, but forgot to put the
same value into "postgresql.conf".

> 7. I know you know.... missing docs... would be good to get those.

The updated version of the patch has documentation.

I just wanted to get a feeling if my patch would be killed cold before
I went to the effort of writing documentation.

> 8. Should we care when setting the insert counter back to 0 if
> auto-vacuum has skipped pages?

Since this is only an approximate value anyway, I decided not to care.
I don't know if that is acceptable.

> 9. You should add a new column to the pg_stat_all_tables view to allow
> visibility of the insert since the last vacuum.  The column should be
> named n_ins_since_vacuum. This seems like the best combination of
> n_mod_since_analyze and n_tup_ins.

Done.

> 10. I'm slightly worried about the case where we don't quite trigger a
> normal vacuum but trigger a vacuum due to INSERTs then skip cleaning
> up the indexes but proceed to leave dead index entries causing indexes
> to become bloated.  It does not seem impossible that given the right
> balance of INSERTs and UPDATE/DELETEs that this could happen every
> time and the indexes would just become larger and larger.

I understand.

This might particularly be a problem with larger tables, where
a normal autovacuum is rare because of the scale_factor.

Perhaps we can take care of the problem by *not* skipping index
cleanup if "changes_since_analyze" is substantially greater than 0.

What do you think?

> 11. We probably do also need to debate if we want this on or off by
> default.   I'd have leaned towards enabling by default if I'd not
> personally witnessed the fact that people rarely* increase auto-vacuum
> to run faster than the standard cost settings. I've seen hundreds of
> servers over the years with all workers busy for days on something
> they'll never finish quickly enough.  We increased those settings 10x
> in PG12, so there will be fewer people around suffering from that now,
> but even after having reduced the vacuum_cost_delay x10 over the PG11
> settings, it's by no means fast enough for everyone.  I've mixed
> feelings about giving auto-vacuum more work to do for those people, so
> perhaps the best option is to keep this off by default so as not to
> affect the people who don't tune auto-vacuum.  They'll just suffer the
> pain all at once when they hit max freeze age instead of more
> gradually with the additional load on the workers.   At least adding
> this feature gives the people who do tune auto-vacuum some ability to
> handle read-only tables in some sane way.
> 
> An alternative way of doing it would be to set the threshold to some
> number of million tuples and set the scale_factor to 0.2 so that it
> only has an effect on larger tables, of which generally people only
> have a smallish number of.

Yes, I think that disabling this by default defeats the purpose.

Knowledgeable people can avoid the problem today by manually scheduling
VACUUM runs on insert-only tables, and the functionality proposed here
is specifically to improve the lives of people who don't know enough
to tune autovacuum.

My original idea was to set the threshold to 10 million and have no scale
factor.


Updated patch attached.

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Thu, Mar 05, 2020 at 03:27:31PM +0100, Laurenz Albe wrote:
> On Thu, 2020-03-05 at 19:40 +1300, David Rowley wrote:
> > 1. I'd go for 2 new GUCs and reloptions.
> > autovacuum_vacuum_insert_scale_factor and these should work exactly
> 
> I disagree about the scale_factor (and have not added it to the
> updated version of the patch).  If we have a scale_factor, then the
> time between successive autovacuum runs would increase as the table
> gets bigger, which defeats the purpose of reducing the impact of each
> autovacuum run.

I would vote to include scale factor.  You're right that a nonzero scale factor
would cause vacuum to run with geometrically decreasing frequency.  The same
thing currently happens with autoanalyze as a table grows in size.  I found
that our monthly-partitioned tables were being analyzed too infrequently
towards the end of the month.  (At the beginning of the month, 10% is 2.4 hours
worth of timeseries data, but at the end of the month 10% is 3 days, which was
an issue when querying the previous day may have rowcount estimates near zero.)
If someone wanted to avoid that, they'd set scale_factor=0.  I think this patch
should parallel what's already in place, and we can add documention for the
behavior if need be.  Possibly scale_factor should default to zero, which I
think might make sense since insert-only tables seem to be the main target of
this patch.

> +++ b/doc/src/sgml/maintenance.sgml
> +   <para>
> +    Tables that have received more than
> +    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>
> +    inserts since they were last vacuumed and are not eligible for vacuuming
> +    based on the above criteria will be vacuumed to reduce the impact of a future
> +    anti-wraparound vacuum run.
> +    Such a vacuum will aggressively freeze tuples, and it will not clean up dead
> +    index tuples.

"BUT will not clean .."

> +++ b/src/backend/postmaster/autovacuum.c
> +        /*
> +         * If the number of inserted tuples exceeds the limit

I would say "exceeds the threshold"

Thanks for working on this; we would use this feature on our insert-only
tables.

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Fri, 6 Mar 2020 at 03:27, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Thu, 2020-03-05 at 19:40 +1300, David Rowley wrote:
> > 1. I'd go for 2 new GUCs and reloptions.
> > autovacuum_vacuum_insert_threshold (you're currently calling this
> > autovacuum_vacuum_insert_limit. I don't see why the word "limit" is
> > relevant here). The other GUC I think should be named
> > autovacuum_vacuum_insert_scale_factor and these should work exactly
> > the same way as autovacuum_vacuum_threshold and
> > autovacuum_vacuum_scale_factor, but be applied in a similar way to the
> > vacuum settings, but only be applied after we've checked to ensure the
> > table is not otherwise eligible to be vacuumed.
>
> I disagree about the scale_factor (and have not added it to the
> updated version of the patch).  If we have a scale_factor, then the
> time between successive autovacuum runs would increase as the table
> gets bigger, which defeats the purpose of reducing the impact of each
> autovacuum run.

My view here is not really to debate what logically makes the most
sense.  I don't really think for a minute that the current
auto-vacuums scale_factor and thresholds are perfect for the job. It's
true that the larger a table becomes, the less often it'll be
vacuumed, but these are control knobs that people have become
accustomed to and I don't really think that making an exception for
this is warranted.  Perhaps we can zero out the scale factor by
default and set the threshold into the millions of tuples. We can have
people chime in on what they think about that and why once the code is
written and even perhaps committed.

Lack of a scale_factor does leave people who regularly truncate their
"append-only" tables out in the cold a bit.  Perhaps they'd like
index-only scans to kick in soon after they truncate without having to
wait for 10 million tuples, or so.

> > 10. I'm slightly worried about the case where we don't quite trigger a
> > normal vacuum but trigger a vacuum due to INSERTs then skip cleaning
> > up the indexes but proceed to leave dead index entries causing indexes
> > to become bloated.  It does not seem impossible that given the right
> > balance of INSERTs and UPDATE/DELETEs that this could happen every
> > time and the indexes would just become larger and larger.
>
> I understand.
>
> This might particularly be a problem with larger tables, where
> a normal autovacuum is rare because of the scale_factor.
>
> Perhaps we can take care of the problem by *not* skipping index
> cleanup if "changes_since_analyze" is substantially greater than 0.
>
> What do you think?

Well, there is code that skips the index scans when there are 0 dead
tuples found in the heap. If the table is truly INSERT-only then it
won't do any harm since we'll skip the index scan anyway.  I think
it's less risky to clean the indexes. If we skip that then there will
be a group of people will suffer from index bloat due to this, no
matter if they realise it or not.

> > 11. We probably do also need to debate if we want this on or off by
> > default.   I'd have leaned towards enabling by default if I'd not
> > personally witnessed the fact that people rarely* increase auto-vacuum
> > to run faster than the standard cost settings. I've seen hundreds of
> > servers over the years with all workers busy for days on something
> > they'll never finish quickly enough.  We increased those settings 10x
> > in PG12, so there will be fewer people around suffering from that now,
> > but even after having reduced the vacuum_cost_delay x10 over the PG11
> > settings, it's by no means fast enough for everyone.  I've mixed
> > feelings about giving auto-vacuum more work to do for those people, so
> > perhaps the best option is to keep this off by default so as not to
> > affect the people who don't tune auto-vacuum.  They'll just suffer the
> > pain all at once when they hit max freeze age instead of more
> > gradually with the additional load on the workers.   At least adding
> > this feature gives the people who do tune auto-vacuum some ability to
> > handle read-only tables in some sane way.
> >
> > An alternative way of doing it would be to set the threshold to some
> > number of million tuples and set the scale_factor to 0.2 so that it
> > only has an effect on larger tables, of which generally people only
> > have a smallish number of.
>
> Yes, I think that disabling this by default defeats the purpose.

Perhaps the solution to that is somewhere else then.  I can picture
some sort of load average counters for auto-vacuum and spamming the
logs with WARNINGs if we maintain high enough load for long enough,
but we'd likely be better just completely overhauling the vacuum cost
settings to be a percentage of total effort rather than some fixed
speed.  That would allow more powerful servers to run vacuum more
quickly and it would also run more quickly during low load periods.
We'd just need to sample now and again how long vacuuming a series of
page takes then sleep for a time based on how long that took. That's
not for this patch though.

> Updated patch attached.

Thanks.  I've not looked yet as I really think we need a scale_factor
for this.  I'm interested to hear what others think. So far both
Justin and I think it's a good idea.



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
Thanks, Justin, for the review.
I have applied the changes where still applicable.

On Fri, 2020-03-06 at 10:52 +1300, David Rowley wrote:
> On Fri, 6 Mar 2020 at 03:27, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > On Thu, 2020-03-05 at 19:40 +1300, David Rowley wrote:
> > > 1. I'd go for 2 new GUCs and reloptions.
> > > autovacuum_vacuum_insert_threshold (you're currently calling this
> > > autovacuum_vacuum_insert_limit. I don't see why the word "limit" is
> > > relevant here). The other GUC I think should be named
> > > autovacuum_vacuum_insert_scale_factor and these should work exactly
> > > the same way as autovacuum_vacuum_threshold and
> > > autovacuum_vacuum_scale_factor, but be applied in a similar way to the
> > > vacuum settings, but only be applied after we've checked to ensure the
> > > table is not otherwise eligible to be vacuumed.
> > 
> > I disagree about the scale_factor (and have not added it to the
> > updated version of the patch).  If we have a scale_factor, then the
> > time between successive autovacuum runs would increase as the table
> > gets bigger, which defeats the purpose of reducing the impact of each
> > autovacuum run.
> 
> My view here is not really to debate what logically makes the most
> sense.  I don't really think for a minute that the current
> auto-vacuums scale_factor and thresholds are perfect for the job. It's
> true that the larger a table becomes, the less often it'll be
> vacuumed, but these are control knobs that people have become
> accustomed to and I don't really think that making an exception for
> this is warranted.  Perhaps we can zero out the scale factor by
> default and set the threshold into the millions of tuples. We can have
> people chime in on what they think about that and why once the code is
> written and even perhaps committed.

Ok, I submit.  My main desire was to keep the number of new GUCs as
low as reasonably possible, but making the feature tunable along the
known and "trusted" lines may be a good thing.

The new parameter is called "autovacuum_vacuum_insert_scale_factor".

> Lack of a scale_factor does leave people who regularly truncate their
> "append-only" tables out in the cold a bit.  Perhaps they'd like
> index-only scans to kick in soon after they truncate without having to
> wait for 10 million tuples, or so.

That point I don't see.
Truncating a table resets the counters to 0.

> > > 10. I'm slightly worried about the case where we don't quite trigger a
> > > normal vacuum but trigger a vacuum due to INSERTs then skip cleaning
> > > up the indexes but proceed to leave dead index entries causing indexes
> > > to become bloated.  It does not seem impossible that given the right
> > > balance of INSERTs and UPDATE/DELETEs that this could happen every
> > > time and the indexes would just become larger and larger.
> > 
> > Perhaps we can take care of the problem by *not* skipping index
> > cleanup if "changes_since_analyze" is substantially greater than 0.
> > 
> > What do you think?
> 
> Well, there is code that skips the index scans when there are 0 dead
> tuples found in the heap. If the table is truly INSERT-only then it
> won't do any harm since we'll skip the index scan anyway.  I think
> it's less risky to clean the indexes. If we skip that then there will
> be a group of people will suffer from index bloat due to this, no
> matter if they realise it or not.

Oh I didn't know that.

In that case it is better to have this vacuum process indexes as well.
I have changed the patch so that it freezes tuples, but does not skip
index cleanup.

Better err on the side of caution.

> > Yes, I think that disabling this by default defeats the purpose.
> 
> Perhaps the solution to that is somewhere else then.  I can picture
> some sort of load average counters for auto-vacuum and spamming the
> logs with WARNINGs if we maintain high enough load for long enough,
> but we'd likely be better just completely overhauling the vacuum cost
> settings to be a percentage of total effort rather than some fixed
> speed.  That would allow more powerful servers to run vacuum more
> quickly and it would also run more quickly during low load periods.
> We'd just need to sample now and again how long vacuuming a series of
> page takes then sleep for a time based on how long that took. That's
> not for this patch though.

Right.


Updated patch attached.

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Sat, 7 Mar 2020 at 03:45, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> Thanks, Justin, for the review.
> I have applied the changes where still applicable.
>
> On Fri, 2020-03-06 at 10:52 +1300, David Rowley wrote:
> > Lack of a scale_factor does leave people who regularly truncate their
> > "append-only" tables out in the cold a bit.  Perhaps they'd like
> > index-only scans to kick in soon after they truncate without having to
> > wait for 10 million tuples, or so.
>
> That point I don't see.
> Truncating a table resets the counters to 0.

The scenario there is that if we don't have any
autovacuum_vacuum_insert_scale_factor and we set the threshold to 10
million tuples.  The user truncates the table on a monthly basis and
nearer to the end of the month the tuples accumulates around 100
million tuples, roughly 3.2 million are inserted per day, so
auto-vacuum kicks in for this table around once every 3 days.  At the
start of the month, the table is truncated and it begins refilling.
The n_ins_since_vacuum is reset to 0 during the truncate. Meanwhile,
the table is being queried constantly and it takes 3 days for us to
vacuum the table again. Queries hitting the table are unable to use
Index Only Scans for 3 days.  The DBAs don't have a lot of control
over this.

I think we can help users with that by giving them a bit more control
over when auto-vacuum will run for the table. scale_factor and
threshold.

> Updated patch attached.

Great. I'll have a look.



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Tue, 10 Mar 2020 at 09:56, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sat, 7 Mar 2020 at 03:45, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > Updated patch attached.
>
> Great. I'll have a look.

I don't really have many complaints about the v4 patch.  However,
during my pass of it, I did note down a few things that you might want
to have a look at.

1. Do we need to change documentation on freeze_min_age to mention
that it does not apply in all cases? I'm leaning towards not changing
this as `VACUUM FREEZE` is also an exception to this, which I don't
see mentioned.

2. Perhaps the documentation in maintenance.sgml should mention that
the table will be vacuumed with the equivalent of having
vacuum_freeze_min_age = 0, instead of:

"Such a vacuum will aggressively freeze tuples."

aggressive is the wrong word here. We call it an aggressive vacuum if
we disable page skipping, not for setting the vacuum_freeze_min_age to
0.

See heap_vacuum_rel()

/*
* We request an aggressive scan if the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
*/
aggressive = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
   xidFullScanLimit);
aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
  mxactFullScanLimit);
if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
aggressive = true;

3. The following DEBUG3 elog should be updated to include the new values:

elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f)",
NameStr(classForm->relname),
vactuples, vacthresh, anltuples, anlthresh);

Someone might be confused at why auto-vacuum is running if you don't
put those in.

4. This would be nicer if you swapped the order of the operands to the
< condition and replaced the operator with >. That'll match the way it
is done above.

/*
* If the number of inserted tuples exceeds the threshold and no
* vacuum is necessary for other reasons, run an "insert-only" vacuum
* that freezes aggressively.
*/
if (!(*dovacuum) && vacinsthresh < tabentry->inserts_since_vacuum)
{
*dovacuum = true;
*freeze_all = true;
}

It would also be nicer if you assigned the value of
tabentry->inserts_since_vacuum to a variable, so as to match what the
other code there is doing. That'll also make the change for #3 neater.

5. The following text:

    A threshold similar to the above is calculated from
    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/> and
    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
    Tables that have received more inserts than the calculated threshold
    since they were last vacuumed (and are not eligible for vacuuming for
    other reasons) will be vacuumed to reduce the impact of a future
    anti-wraparound vacuum run.

I think "... will be vacuumed with the equivalent of having <xref
linkend="guc-vacuum-freeze-min-age"/> set to <literal>0</literal>".
I'm not sure we need to mention the reduction of impact to
anti-wraparound vacuums.

6. Please run the regression tests and make sure they pass. The
"rules" test is currently failing due to the new column in
"pg_stat_all_tables"

Apart from the above, does anyone else have objections or concerns
with the patch?  I'd like to take a serious look at pushing it once
the above points are resolved.



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-10 at 09:56 +1300, David Rowley wrote:
> > > Lack of a scale_factor does leave people who regularly truncate their
> > > "append-only" tables out in the cold a bit.  Perhaps they'd like
> > > index-only scans to kick in soon after they truncate without having to
> > > wait for 10 million tuples, or so.
> > 
> > That point I don't see.
> > Truncating a table resets the counters to 0.
> 
> The scenario there is that if we don't have any
> autovacuum_vacuum_insert_scale_factor and we set the threshold to 10
> million tuples.  The user truncates the table on a monthly basis and
> nearer to the end of the month the tuples accumulates around 100
> million tuples, roughly 3.2 million are inserted per day, so
> auto-vacuum kicks in for this table around once every 3 days.  At the
> start of the month, the table is truncated and it begins refilling.
> The n_ins_since_vacuum is reset to 0 during the truncate. Meanwhile,
> the table is being queried constantly and it takes 3 days for us to
> vacuum the table again. Queries hitting the table are unable to use
> Index Only Scans for 3 days.  The DBAs don't have a lot of control
> over this.
> 
> I think we can help users with that by giving them a bit more control
> over when auto-vacuum will run for the table. scale_factor and
> threshold.

Oh, that's a good point.
I only thought about anti-wraparound vacuum, but the feature might be useful
for index-only scans as well.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
> +++ b/src/backend/utils/misc/postgresql.conf.sample
> +#autovacuum_vacuum_insert_threshold = 10000000    # min number of row inserts
> +                    # before vacuum

Similar to a previous comment [0] about reloptions or GUC:

Can we say "threshold number of insertions before vacuum" ?
..or "maximum number of insertions before triggering autovacuum"

-- 
Justin

[0] https://www.postgresql.org/message-id/602873766faa0e9200a60dcc26dc10c636761d5d.camel%40cybertec.at



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Fri, 6 Mar 2020 at 23:46, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> Thanks, Justin, for the review.
> I have applied the changes where still applicable.
>
> On Fri, 2020-03-06 at 10:52 +1300, David Rowley wrote:
> > On Fri, 6 Mar 2020 at 03:27, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > > On Thu, 2020-03-05 at 19:40 +1300, David Rowley wrote:
> > > > 1. I'd go for 2 new GUCs and reloptions.
> > > > autovacuum_vacuum_insert_threshold (you're currently calling this
> > > > autovacuum_vacuum_insert_limit. I don't see why the word "limit" is
> > > > relevant here). The other GUC I think should be named
> > > > autovacuum_vacuum_insert_scale_factor and these should work exactly
> > > > the same way as autovacuum_vacuum_threshold and
> > > > autovacuum_vacuum_scale_factor, but be applied in a similar way to the
> > > > vacuum settings, but only be applied after we've checked to ensure the
> > > > table is not otherwise eligible to be vacuumed.
> > >
> > > I disagree about the scale_factor (and have not added it to the
> > > updated version of the patch).  If we have a scale_factor, then the
> > > time between successive autovacuum runs would increase as the table
> > > gets bigger, which defeats the purpose of reducing the impact of each
> > > autovacuum run.
> >
> > My view here is not really to debate what logically makes the most
> > sense.  I don't really think for a minute that the current
> > auto-vacuums scale_factor and thresholds are perfect for the job. It's
> > true that the larger a table becomes, the less often it'll be
> > vacuumed, but these are control knobs that people have become
> > accustomed to and I don't really think that making an exception for
> > this is warranted.  Perhaps we can zero out the scale factor by
> > default and set the threshold into the millions of tuples. We can have
> > people chime in on what they think about that and why once the code is
> > written and even perhaps committed.
>
> Ok, I submit.  My main desire was to keep the number of new GUCs as
> low as reasonably possible, but making the feature tunable along the
> known and "trusted" lines may be a good thing.
>
> The new parameter is called "autovacuum_vacuum_insert_scale_factor".
>
> > Lack of a scale_factor does leave people who regularly truncate their
> > "append-only" tables out in the cold a bit.  Perhaps they'd like
> > index-only scans to kick in soon after they truncate without having to
> > wait for 10 million tuples, or so.
>
> That point I don't see.
> Truncating a table resets the counters to 0.
>
> > > > 10. I'm slightly worried about the case where we don't quite trigger a
> > > > normal vacuum but trigger a vacuum due to INSERTs then skip cleaning
> > > > up the indexes but proceed to leave dead index entries causing indexes
> > > > to become bloated.  It does not seem impossible that given the right
> > > > balance of INSERTs and UPDATE/DELETEs that this could happen every
> > > > time and the indexes would just become larger and larger.
> > >
> > > Perhaps we can take care of the problem by *not* skipping index
> > > cleanup if "changes_since_analyze" is substantially greater than 0.
> > >
> > > What do you think?
> >
> > Well, there is code that skips the index scans when there are 0 dead
> > tuples found in the heap. If the table is truly INSERT-only then it
> > won't do any harm since we'll skip the index scan anyway.  I think
> > it's less risky to clean the indexes. If we skip that then there will
> > be a group of people will suffer from index bloat due to this, no
> > matter if they realise it or not.

+1

FYI actually vacuum could perform index cleanup phase (i.g.
PROGRESS_VACUUM_PHASE_INDEX_CLEANUP phase) on a table even if it's a
truly INSERT-only table, depending on
vacuum_cleanup_index_scale_factor. Anyway, I also agree with not
disabling index cleanup in insert-only vacuum case, because it could
become not only a cause of index bloat but also a big performance
issue. For example, if autovacuum on a table always run without index
cleanup, gin index on that table will accumulate insertion tuples in
its pending list and will be cleaned up by a backend process while
inserting new tuple, not by a autovacuum process. We can disable index
vacuum by index_cleanup storage parameter per tables, so it would be
better to defer these settings to users.

I have one question about this patch from architectural perspective:
have you considered to use autovacuum_vacuum_threshold and
autovacuum_vacuum_scale_factor also for this purpose? That is, we
compare the threshold computed by these values to not only the number
of dead tuples but also the number of inserted tuples. If the number
of dead tuples exceeds the threshold, we trigger autovacuum as usual.
On the other hand if the number of inserted tuples exceeds, we trigger
autovacuum with vacuum_freeze_min_age = 0. I'm concerned that how user
consider the settings of newly added two parameters. We will have in
total 4 parameters. Amit also was concerned about that[1].

I think this idea also works fine. In insert-only table case, since
only the number of inserted tuples gets increased, only one threshold
(that is, threshold computed by autovacuum_vacuum_threshold and
autovacuum_vacuum_scale_factor) is enough to trigger autovacuum. And
in mostly-insert table case, in the first place, we can trigger
autovacuum even in current PostgreSQL, since we have some dead tuples.
But if we want to trigger autovacuum more frequently by the number of
newly inserted tuples, we can set that threshold lower while
considering only the number of inserted tuples.

And I briefly looked at this patch:

@@ -2889,7 +2898,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
        tab->at_params.truncate = VACOPT_TERNARY_DEFAULT;
        /* As of now, we don't support parallel vacuum for autovacuum */
        tab->at_params.nworkers = -1;
-       tab->at_params.freeze_min_age = freeze_min_age;
+       tab->at_params.freeze_min_age = freeze_all ? 0 : freeze_min_age;
        tab->at_params.freeze_table_age = freeze_table_age;
        tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
        tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;

I think we can set multixact_freeze_min_age to 0 as well.

Regards,

[1] https://www.postgresql.org/message-id/CAA4eK1%2BrCxS_Pg4GdSa6G8ESOTHK%2BjDVgqYd_dnO07rGNaewKA%40mail.gmail.com

--
Masahiko Sawada            http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-10 at 13:53 +1300, David Rowley wrote:
> 1. Do we need to change documentation on freeze_min_age to mention
> that it does not apply in all cases? I'm leaning towards not changing
> this as `VACUUM FREEZE` is also an exception to this, which I don't
> see mentioned.

I agree with that.  Too little documentation is bad, but too much of
it can also confuse and make it hard to find the needle in the haystack.

> 2. Perhaps the documentation in maintenance.sgml should mention that
> the table will be vacuumed with the equivalent of having
> vacuum_freeze_min_age = 0, instead of:
> 
> "Such a vacuum will aggressively freeze tuples."
> 
> aggressive is the wrong word here. We call it an aggressive vacuum if
> we disable page skipping, not for setting the vacuum_freeze_min_age to
> 0.

Agreed, see below.

> 3. The following DEBUG3 elog should be updated to include the new values:
> 
> elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f)",
> NameStr(classForm->relname),
> vactuples, vacthresh, anltuples, anlthresh);

Done.

> Someone might be confused at why auto-vacuum is running if you don't
> put those in.
> 
> 4. This would be nicer if you swapped the order of the operands to the
> < condition and replaced the operator with >. That'll match the way it
> is done above.
> 
> /*
> * If the number of inserted tuples exceeds the threshold and no
> * vacuum is necessary for other reasons, run an "insert-only" vacuum
> * that freezes aggressively.
> */
> if (!(*dovacuum) && vacinsthresh < tabentry->inserts_since_vacuum)
> {
> *dovacuum = true;
> *freeze_all = true;
> }
> 
> It would also be nicer if you assigned the value of
> tabentry->inserts_since_vacuum to a variable, so as to match what the
> other code there is doing. That'll also make the change for #3 neater.

Changed that way.

> 5. The following text:
> 
>     A threshold similar to the above is calculated from
>     <xref linkend="guc-autovacuum-vacuum-insert-threshold"/> and
>     <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
>     Tables that have received more inserts than the calculated threshold
>     since they were last vacuumed (and are not eligible for vacuuming for
>     other reasons) will be vacuumed to reduce the impact of a future
>     anti-wraparound vacuum run.
> 
> I think "... will be vacuumed with the equivalent of having <xref
> linkend="guc-vacuum-freeze-min-age"/> set to <literal>0</literal>".
> I'm not sure we need to mention the reduction of impact to
> anti-wraparound vacuums.

Done like that.

I left in the explanation of the purpose of this setting.
Understanding the purpose of the GUCs will make it easier to tune them
correctly.

> 6. Please run the regression tests and make sure they pass. The
> "rules" test is currently failing due to the new column in
> "pg_stat_all_tables"

Oops, sorry.  I ran pgindent, but forgot to re-run the regression tests.

Done.


Attached is V5, which also fixes the bug discovered my Masahiko Sawada.
He made an interesting suggestion which we should consider before committing.

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-10 at 00:00 -0500, Justin Pryzby wrote:
> > +++ b/src/backend/utils/misc/postgresql.conf.sample
> > +#autovacuum_vacuum_insert_threshold = 10000000       # min number of row inserts
> > +                                     # before vacuum
> 
> Similar to a previous comment [0] about reloptions or GUC:
> 
> Can we say "threshold number of insertions before vacuum" ?
> ..or "maximum number of insertions before triggering autovacuum"

Hmm.  I copied the wording from "autovacuum_vacuum_threshold".

Since the parameters have similar semantics, a different wording
would confuse.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-10 at 18:14 +0900, Masahiko Sawada wrote:

Thanks for the review and your thoughts!

> FYI actually vacuum could perform index cleanup phase (i.g.
> PROGRESS_VACUUM_PHASE_INDEX_CLEANUP phase) on a table even if it's a
> truly INSERT-only table, depending on
> vacuum_cleanup_index_scale_factor. Anyway, I also agree with not
> disabling index cleanup in insert-only vacuum case, because it could
> become not only a cause of index bloat but also a big performance
> issue. For example, if autovacuum on a table always run without index
> cleanup, gin index on that table will accumulate insertion tuples in
> its pending list and will be cleaned up by a backend process while
> inserting new tuple, not by a autovacuum process. We can disable index
> vacuum by index_cleanup storage parameter per tables, so it would be
> better to defer these settings to users.

Thanks for the confirmation.

> I have one question about this patch from architectural perspective:
> have you considered to use autovacuum_vacuum_threshold and
> autovacuum_vacuum_scale_factor also for this purpose? That is, we
> compare the threshold computed by these values to not only the number
> of dead tuples but also the number of inserted tuples. If the number
> of dead tuples exceeds the threshold, we trigger autovacuum as usual.
> On the other hand if the number of inserted tuples exceeds, we trigger
> autovacuum with vacuum_freeze_min_age = 0. I'm concerned that how user
> consider the settings of newly added two parameters. We will have in
> total 4 parameters. Amit also was concerned about that[1].
> 
> I think this idea also works fine. In insert-only table case, since
> only the number of inserted tuples gets increased, only one threshold
> (that is, threshold computed by autovacuum_vacuum_threshold and
> autovacuum_vacuum_scale_factor) is enough to trigger autovacuum. And
> in mostly-insert table case, in the first place, we can trigger
> autovacuum even in current PostgreSQL, since we have some dead tuples.
> But if we want to trigger autovacuum more frequently by the number of
> newly inserted tuples, we can set that threshold lower while
> considering only the number of inserted tuples.

I am torn.

On the one hand it would be wonderful not to have to add yet more GUCs
to the already complicated autovacuum configuration.  It already confuses
too many users.

On the other hand that will lead to unnecessary vacuums for small
tables.
Worse, the progression caused by the comparatively large scale
factor may make it vacuum large tables too seldom.

I'd be grateful if somebody knowledgeable could throw his or her opinion
into the scales.

> And I briefly looked at this patch:
> 
> @@ -2889,7 +2898,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
>         tab->at_params.truncate = VACOPT_TERNARY_DEFAULT;
>         /* As of now, we don't support parallel vacuum for autovacuum */
>         tab->at_params.nworkers = -1;
> -       tab->at_params.freeze_min_age = freeze_min_age;
> +       tab->at_params.freeze_min_age = freeze_all ? 0 : freeze_min_age;
>         tab->at_params.freeze_table_age = freeze_table_age;
>         tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
>         tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
> 
> I think we can set multixact_freeze_min_age to 0 as well.

Ugh, yes, that is a clear oversight.
I have fixed it in the latest version.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Wed, 11 Mar 2020 at 08:17, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Tue, 2020-03-10 at 18:14 +0900, Masahiko Sawada wrote:
> > I have one question about this patch from architectural perspective:
> > have you considered to use autovacuum_vacuum_threshold and
> > autovacuum_vacuum_scale_factor also for this purpose? That is, we
> > compare the threshold computed by these values to not only the number
> > of dead tuples but also the number of inserted tuples. If the number
> > of dead tuples exceeds the threshold, we trigger autovacuum as usual.
> > On the other hand if the number of inserted tuples exceeds, we trigger
> > autovacuum with vacuum_freeze_min_age = 0. I'm concerned that how user
> > consider the settings of newly added two parameters. We will have in
> > total 4 parameters. Amit also was concerned about that[1].
> >
> > I think this idea also works fine. In insert-only table case, since
> > only the number of inserted tuples gets increased, only one threshold
> > (that is, threshold computed by autovacuum_vacuum_threshold and
> > autovacuum_vacuum_scale_factor) is enough to trigger autovacuum. And
> > in mostly-insert table case, in the first place, we can trigger
> > autovacuum even in current PostgreSQL, since we have some dead tuples.
> > But if we want to trigger autovacuum more frequently by the number of
> > newly inserted tuples, we can set that threshold lower while
> > considering only the number of inserted tuples.
>
> I am torn.
>
> On the one hand it would be wonderful not to have to add yet more GUCs
> to the already complicated autovacuum configuration.  It already confuses
> too many users.
>
> On the other hand that will lead to unnecessary vacuums for small
> tables.
> Worse, the progression caused by the comparatively large scale
> factor may make it vacuum large tables too seldom.

I think we really need to discuss what the default values for these
INSERT-only vacuums should be before we can decide if we need 2
further GUCc to control the feature.  Right now the default is 0.0 on
the scale factor and 10 million tuples threshold.  I'm not saying
those are good or bad values, but if they are good, then they're
pretty different from the normal threshold of 50 and the normal scale
factor of 0.2, therefore (assuming the delete/update thresholds are
also good), then we need the additional GUCs.

If someone wants to put forward a case for making the defaults more
similar, then perhaps we can consider merging the options.  One case
might be the fact that we want INSERT-only tables to benefit from
Index Only Scans more often than after 10 million inserts.

As for pros and cons. Feel free to add to the following list:

For new GUCs/reloptions:
1. Gives users more control over this new auto-vacuum behaviour
2. The new feature can be completely disabled. This might be very
useful for people who suffer from auto-vacuum starvation.

Against new GUCs/reloptions:
1. Adds more code, documentation and maintenance.
2. Adds more complexity to auto-vacuum configuration.

As for my opinion, I'm leaning towards keeping the additional options.
I think if we were just adding auto-vacuum to core code now, then I'd
be voting to keep the configuration as simple as possible. However,
that's far from the case, and we do have over a decade of people that
have gotten used to how auto-vacuum currently behaves. Many people are
unlikely to even notice the change, but some will, and then there will
be another group of people who want to turn it off, and that group
might be upset when we tell them that they can't, at least not without
flipping the big red "autovacuum" switch into the off position (of
which, I'm pretty hesitant to recommend that anyone ever does).



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Wed, 11 Mar 2020 at 04:17, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Tue, 2020-03-10 at 18:14 +0900, Masahiko Sawada wrote:
>
> Thanks for the review and your thoughts!
>
> > FYI actually vacuum could perform index cleanup phase (i.g.
> > PROGRESS_VACUUM_PHASE_INDEX_CLEANUP phase) on a table even if it's a
> > truly INSERT-only table, depending on
> > vacuum_cleanup_index_scale_factor. Anyway, I also agree with not
> > disabling index cleanup in insert-only vacuum case, because it could
> > become not only a cause of index bloat but also a big performance
> > issue. For example, if autovacuum on a table always run without index
> > cleanup, gin index on that table will accumulate insertion tuples in
> > its pending list and will be cleaned up by a backend process while
> > inserting new tuple, not by a autovacuum process. We can disable index
> > vacuum by index_cleanup storage parameter per tables, so it would be
> > better to defer these settings to users.
>
> Thanks for the confirmation.
>
> > I have one question about this patch from architectural perspective:
> > have you considered to use autovacuum_vacuum_threshold and
> > autovacuum_vacuum_scale_factor also for this purpose? That is, we
> > compare the threshold computed by these values to not only the number
> > of dead tuples but also the number of inserted tuples. If the number
> > of dead tuples exceeds the threshold, we trigger autovacuum as usual.
> > On the other hand if the number of inserted tuples exceeds, we trigger
> > autovacuum with vacuum_freeze_min_age = 0. I'm concerned that how user
> > consider the settings of newly added two parameters. We will have in
> > total 4 parameters. Amit also was concerned about that[1].
> >
> > I think this idea also works fine. In insert-only table case, since
> > only the number of inserted tuples gets increased, only one threshold
> > (that is, threshold computed by autovacuum_vacuum_threshold and
> > autovacuum_vacuum_scale_factor) is enough to trigger autovacuum. And
> > in mostly-insert table case, in the first place, we can trigger
> > autovacuum even in current PostgreSQL, since we have some dead tuples.
> > But if we want to trigger autovacuum more frequently by the number of
> > newly inserted tuples, we can set that threshold lower while
> > considering only the number of inserted tuples.
>
> I am torn.
>
> On the one hand it would be wonderful not to have to add yet more GUCs
> to the already complicated autovacuum configuration.  It already confuses
> too many users.
>
> On the other hand that will lead to unnecessary vacuums for small
> tables.
> Worse, the progression caused by the comparatively large scale
> factor may make it vacuum large tables too seldom.
>

I might be missing your point but could you elaborate on that in what
kind of case you think this lead to unnecessary vacuums?

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Wed, 2020-03-11 at 12:00 +0900, Masahiko Sawada wrote:
> > > I have one question about this patch from architectural perspective:
> > > have you considered to use autovacuum_vacuum_threshold and
> > > autovacuum_vacuum_scale_factor also for this purpose?
> >
> > I am torn.
> > 
> > On the one hand it would be wonderful not to have to add yet more GUCs
> > to the already complicated autovacuum configuration.  It already confuses
> > too many users.
> > 
> > On the other hand that will lead to unnecessary vacuums for small
> > tables.
> > Worse, the progression caused by the comparatively large scale
> > factor may make it vacuum large tables too seldom.
> 
> I might be missing your point but could you elaborate on that in what
> kind of case you think this lead to unnecessary vacuums?

If you have an insert-only table that has 100000 entries, it will get
vacuumed roughly every 20000 new entries.  The impact is probably too
little to care, but it will increase the contention for the three
autovacuum workers available by default.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Wed, 11 Mar 2020 at 13:24, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Wed, 2020-03-11 at 12:00 +0900, Masahiko Sawada wrote:
> > > > I have one question about this patch from architectural perspective:
> > > > have you considered to use autovacuum_vacuum_threshold and
> > > > autovacuum_vacuum_scale_factor also for this purpose?
> > >
> > > I am torn.
> > >
> > > On the one hand it would be wonderful not to have to add yet more GUCs
> > > to the already complicated autovacuum configuration.  It already confuses
> > > too many users.
> > >
> > > On the other hand that will lead to unnecessary vacuums for small
> > > tables.
> > > Worse, the progression caused by the comparatively large scale
> > > factor may make it vacuum large tables too seldom.
> >
> > I might be missing your point but could you elaborate on that in what
> > kind of case you think this lead to unnecessary vacuums?
>
> If you have an insert-only table that has 100000 entries, it will get
> vacuumed roughly every 20000 new entries.  The impact is probably too
> little to care, but it will increase the contention for the three
> autovacuum workers available by default.

The same is true for read-write table, right? If that becomes a
problem, it's a mis-configuration and user should increase these
values just like when we set these values for read-write tables.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Wed, 11 Mar 2020 at 17:24, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Wed, 2020-03-11 at 12:00 +0900, Masahiko Sawada wrote:
> > I might be missing your point but could you elaborate on that in what
> > kind of case you think this lead to unnecessary vacuums?
>
> If you have an insert-only table that has 100000 entries, it will get
> vacuumed roughly every 20000 new entries.  The impact is probably too
> little to care, but it will increase the contention for the three
> autovacuum workers available by default.

I guess that depends on your definition of unnecessary. If you want
Index Only Scans, then those settings don't seem unreasonable.  If you
want it just to reduce the chances or impact of an anti-wraparound
vacuum then likely it's a bit too often.

I understand this patch was born due to the anti-wraparound case, but
should we really just ignore the Index Only Scan case?



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Wed, 11 Mar 2020 at 19:00, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 11 Mar 2020 at 13:24, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > If you have an insert-only table that has 100000 entries, it will get
> > vacuumed roughly every 20000 new entries.  The impact is probably too
> > little to care, but it will increase the contention for the three
> > autovacuum workers available by default.
>
> The same is true for read-write table, right? If that becomes a
> problem, it's a mis-configuration and user should increase these
> values just like when we set these values for read-write tables.

It is true that if vacuum has more to do than it can do, then
something is not configured correctly.

I imagine Laurenz set the scale factor to 0.0 and the threshold to 10
million to reduce the chances that someone will encounter that
problem. I mentioned somewhere upthread that commonly used to see
production servers running with the standard vacuum_cost_limit of 200
and the (pre-PG12) autovacuum_vacuum_cost_delay of 20. Generally, it
didn't go well for them.  autovacuum_vacuum_cost_delay is now 2 by
default, so auto-vacuum in PG12 and beyond runs 10x faster, but it's
still pretty conservative and it'll still need another bump in several
years when hardware is faster than it is today. So, by no means did
that 10x increase mean that nobody will suffer from auto-vacuum
starvation ever again.

Now, perhaps it remains to be seen if adding additional work onto
auto-vacuum will help or hinder those people.  If their auto-vacuum
can just keep up until the cluster is old enough to need
anti-wraparound vacuums and then falls massively behind, then perhaps
this is a good thing as they might notice at some point before their
server explodes in the middle of the night. By that time they might
have become complacent. Additionally, I think this is pretty well
aligned to the case mentioned in the subject line of this email.  We
now have a freeze map, so performing vacuums to freeze tuples twice as
often is not really much more expensive in total than doing that
vacuuming half as often. Even tables (e.g log tables) that are never
queried won't become much more costly to maintain.  In the meantime,
for tables that do receive queries, then we're more likely to get an
index-only scan.

Perhaps a good way to decide what the scale_factor should be set to
should depend on the run-time of an Index Only Scan, vs an Index Scan.

create table ios (a int, b text);
insert into ios select x,x::text from generate_series(1,1000000)x;
create index on ios (a);
vacuum analyze ios;

explain (analyze, buffers) select a from ios order by a; -- on 2nd exec
                                                              QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using ios_a_idx on ios  (cost=0.42..25980.42
rows=1000000 width=4) (actual time=0.035..212.602 rows=1000000
loops=1)
   Heap Fetches: 0
   Buffers: shared hit=2736
 Planning Time: 0.095 ms
 Execution Time: 246.864 ms
(5 rows)

set enable_indexonlyscan=0;
explain (analyze, buffers) select a from ios order by a;
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ios_a_idx on ios  (cost=0.42..31388.42 rows=1000000
width=4) (actual time=0.036..451.381 rows=1000000 loops=1)
   Buffers: shared hit=8140
 Planning Time: 0.089 ms
 Execution Time: 486.582 ms
(4 rows)

So about twice as fast with the IOS. When it's going to be beneficial
to perform the vacuum will depend on the reads to insert ratio.  I'm
starting to think that we should set the scale_factor to something
like 0.3 and the threshold to 50. Is anyone strongly against that?  Or
Laurenz, are you really set on the 10 million threshold?



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Thu, 2020-03-12 at 17:47 +1300, David Rowley wrote:
> I'm starting to think that we should set the scale_factor to something
> like 0.3 and the threshold to 50. Is anyone strongly against that?  Or
> Laurenz, are you really set on the 10 million threshold?

These values are almost the same as "autovacuum_vacuum_scale_factor"
and "autovacuum_vacuum_threshold", so you actually agree with Masahiko
with the exception that you want it tunable separately.

I don't like the high scale factor.

If your insert-only table was last vacuumed when it had 500 million rows,
the next autovacuum will freeze 150 million tuples, which is a lot.
The impact will be less than that of an anti-wraparound vacuum because
it is not as persistent, but if our 150 million tuple autovacuum backs
down because it hits a lock or gets killed by the DBA, that is also not
good, since it will just come again.
And the bigger the vacuum run is, the more likely it is to meet an obstacle.

So I think that large insert-only tables should be vacuumed more often
than that.  If the number of tuples that have to be frozen is small,
the vacuum run will be short and is less likely to cause problems.
That is why I chose a scale factor of 0 here.


But I totally see your point about index-only scans.

I think the problem is that this insert-only autovacuum serves two masters:
1. preventing massive anti-wraparound vacuum that severely impacts the system
2. maintaining the visibility map for index-only scans

I thought of the first case when I chose the parameter values.

I am afraid that we cannot come up with one setting that fits all, so I
advocate a setting that targets the first problem, which I think is more
important (and was the motivation for this thread).

I could add a paragraph to the documentation that tells people how to
configure the parameters if they want to use it to get index-only scans.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Thu, 12 Mar 2020 at 18:38, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Thu, 2020-03-12 at 17:47 +1300, David Rowley wrote:
> > Laurenz, are you really set on the 10 million threshold?
>
> These values are almost the same as "autovacuum_vacuum_scale_factor"
> and "autovacuum_vacuum_threshold", so you actually agree with Masahiko
> with the exception that you want it tunable separately.
>
> I don't like the high scale factor.
>
> If your insert-only table was last vacuumed when it had 500 million rows,
> the next autovacuum will freeze 150 million tuples, which is a lot.
> The impact will be less than that of an anti-wraparound vacuum because
> it is not as persistent, but if our 150 million tuple autovacuum backs
> down because it hits a lock or gets killed by the DBA, that is also not
> good, since it will just come again.
> And the bigger the vacuum run is, the more likely it is to meet an obstacle.
>
> So I think that large insert-only tables should be vacuumed more often
> than that.  If the number of tuples that have to be frozen is small,
> the vacuum run will be short and is less likely to cause problems.
> That is why I chose a scale factor of 0 here.

That's a good point.  If those 150 million inserts were done one per
transaction, then it wouldn't take many more tuples before wraparound
vacuums occur more often than insert vacuums.  The only way I see
around that is to a) configure it the way you'd like, or; b) add yet
another GUC and reloption to represent how close to
autovacuum_freeze_max_age / autovacuum_multixact_freeze_max_age the
table is.  I'm not very excited about adding yet another GUC, plus
anti-wraparound vacuums already occur 10 times more often than they
need to. If we added such a GUC and set it to, say, 0.1, then they'd
happen 100 times more often than needed before actual wraparound
occurs.

I'm starting to see now why you were opposed to the scale_factor in
the first place.

I really think that this is really a problem with the design of the
threshold and scale_factor system.  I used to commonly see people with
larger tables zeroing out the scale_factor and setting a reasonable
threshold or dropping the scale_factor down to some fraction of a
percent. I don't really have any better design in mind though, at
least not one that does not require adding new vacuum options.



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Thu, 12 Mar 2020 at 14:38, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Thu, 2020-03-12 at 17:47 +1300, David Rowley wrote:
> > I'm starting to think that we should set the scale_factor to something
> > like 0.3 and the threshold to 50. Is anyone strongly against that?  Or
> > Laurenz, are you really set on the 10 million threshold?
>
> These values are almost the same as "autovacuum_vacuum_scale_factor"
> and "autovacuum_vacuum_threshold", so you actually agree with Masahiko
> with the exception that you want it tunable separately.
>
> I don't like the high scale factor.
>
> If your insert-only table was last vacuumed when it had 500 million rows,
> the next autovacuum will freeze 150 million tuples, which is a lot.
> The impact will be less than that of an anti-wraparound vacuum because
> it is not as persistent, but if our 150 million tuple autovacuum backs
> down because it hits a lock or gets killed by the DBA, that is also not
> good, since it will just come again.
> And the bigger the vacuum run is, the more likely it is to meet an obstacle.
>
> So I think that large insert-only tables should be vacuumed more often
> than that.  If the number of tuples that have to be frozen is small,
> the vacuum run will be short and is less likely to cause problems.
> That is why I chose a scale factor of 0 here.

The reason why you want to add new GUC parameters is to use different
default values for insert-update table case and insert-only table
case? I think I understand the pros and cons of adding separate
parameters, but I still cannot understand use cases where we cannot
handle without separate parameters.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Thu, 12 Mar 2020 at 19:50, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> The reason why you want to add new GUC parameters is to use different
> default values for insert-update table case and insert-only table
> case?

Yes, but in particular so it can be completely disabled easily.

> I think I understand the pros and cons of adding separate
> parameters, but I still cannot understand use cases where we cannot
> handle without separate parameters.

That's a lot of negatives. I think I understand that you don't feel
that additional GUCs are worth it?

Laurenz highlighted a seemingly very valid reason that the current
GUCs cannot be reused. Namely, say the table has 1 billion rows, if we
use the current scale factor of 0.2, then we'll run an insert-only
vacuum every 200 million rows. If those INSERTs are one per
transaction then the new feature does nothing as the wraparound vacuum
will run instead. Since this feature was born due to large insert-only
tables, this concern seems very valid to me.



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Thu, 12 Mar 2020 at 16:28, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 12 Mar 2020 at 19:50, Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > The reason why you want to add new GUC parameters is to use different
> > default values for insert-update table case and insert-only table
> > case?
>
> Yes, but in particular so it can be completely disabled easily.
>
> > I think I understand the pros and cons of adding separate
> > parameters, but I still cannot understand use cases where we cannot
> > handle without separate parameters.
>
> That's a lot of negatives. I think I understand that you don't feel
> that additional GUCs are worth it?
>
> Laurenz highlighted a seemingly very valid reason that the current
> GUCs cannot be reused. Namely, say the table has 1 billion rows, if we
> use the current scale factor of 0.2, then we'll run an insert-only
> vacuum every 200 million rows. If those INSERTs are one per
> transaction then the new feature does nothing as the wraparound vacuum
> will run instead. Since this feature was born due to large insert-only
> tables, this concern seems very valid to me.

Yeah, I understand and agree that since most people would use default
values we can reduce mis-configuration cases by adding separate GUCs
that have appropriate default values for that purpose but on the other
hand I'm not sure it's worth that we cover the large insert-only table
case by adding separate GUCs in spite of being able to cover it even
by existing two GUCs. If we want to disable this feature on the
particular table, we can have a storage parameter that means not to
consider the number of inserted tuples rather than having multiple
GUCs that allows us to fine tuning. And IIUC even in the above case, I
think that if we trigger insert-only vacuum by comparing the number of
inserted tuples to the threshold computed by existing threshold and
scale factor, we can cover it. But since you and Laurenz already
agreed to adding two GUCs I'm not going to insist on that.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Fri, 13 Mar 2020 at 01:43, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 12 Mar 2020 at 16:28, David Rowley <dgrowleyml@gmail.com> wrote:
> > Laurenz highlighted a seemingly very valid reason that the current
> > GUCs cannot be reused. Namely, say the table has 1 billion rows, if we
> > use the current scale factor of 0.2, then we'll run an insert-only
> > vacuum every 200 million rows. If those INSERTs are one per
> > transaction then the new feature does nothing as the wraparound vacuum
> > will run instead. Since this feature was born due to large insert-only
> > tables, this concern seems very valid to me.
>
> Yeah, I understand and agree that since most people would use default
> values we can reduce mis-configuration cases by adding separate GUCs
> that have appropriate default values for that purpose but on the other
> hand I'm not sure it's worth that we cover the large insert-only table
> case by adding separate GUCs in spite of being able to cover it even
> by existing two GUCs.

In light of the case above, do you have an alternative suggestion?

> If we want to disable this feature on the
> particular table, we can have a storage parameter that means not to
> consider the number of inserted tuples rather than having multiple
> GUCs that allows us to fine tuning. And IIUC even in the above case, I
> think that if we trigger insert-only vacuum by comparing the number of
> inserted tuples to the threshold computed by existing threshold and
> scale factor, we can cover it.

So you're suggesting we drive the insert-vacuums from existing
scale_factor and threshold?  What about the 1 billion row table
example above?



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Fri, 2020-03-13 at 09:10 +1300, David Rowley wrote:
> So you're suggesting we drive the insert-vacuums from existing
> scale_factor and threshold?  What about the 1 billion row table
> example above?

I am still not 100% certain if that is really realistic.
Transactions that insert only a single row are probably the
exception in large insert-only tables.

But I think that we probably always can find a case where any given
parameter setting is not so great, so in order to get ahead
let's decide on something that is not right out stupid.
Changing the defaults later is always an option.

So the three options are:

1. introduce no new parameters and trigger autovacuum if the number
   of inserts exceeds the regular vacuum threshold.

2. introduce the new parameters with high base threshold and zero scale factor.

3. introduce the new parameters with low base threshold and high scale factor.

I think all three are viable.
If nobody else wants to weigh in, throw a coin.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:
On Fri, Mar 13, 2020 at 3:19 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Fri, 2020-03-13 at 09:10 +1300, David Rowley wrote:
> > So you're suggesting we drive the insert-vacuums from existing
> > scale_factor and threshold?  What about the 1 billion row table
> > example above?
>
> I am still not 100% certain if that is really realistic.
> Transactions that insert only a single row are probably the
> exception in large insert-only tables.
>
> But I think that we probably always can find a case where any given
> parameter setting is not so great, so in order to get ahead
> let's decide on something that is not right out stupid.
> Changing the defaults later is always an option.
>
> So the three options are:
>
> 1. introduce no new parameters and trigger autovacuum if the number
>    of inserts exceeds the regular vacuum threshold.
>
> 2. introduce the new parameters with high base threshold and zero scale factor.

Both of these look good to me. 1 is approach in my initial patch
sketch, 2 is approach taken by Laurenz.
Values I think in when considering vacuum is "how many megabytes of
table aren't frozen/visible" (since that's what translates into
processing time knowing io limits of storage), and "how many pages
aren't yet vacuumed".

Threshold in Laurenz's patch was good enough for my taste - it's
basically "vacuum after every gigabyte", and that's exactly what we
implemented when working around this issue manually. There's enough
chance that latest gigabyte is in RAM and vacuum will be super fast on
it; reading a gigabyte of data is not a showstopper for most
contemporary physical and cloud environments I can think of. If
reading a gigabyte is a problem already then wraparound is a
guaranteed disaster.

About index only scan, this threshold seems good enough too. There's a
good chance last gig is already in RAM, and previous data was
processed with previous vacuum. Anyway - with this patch Index Only
Scan starts actually working :)

I'd vote for 2 with a note "rip it off all together later and redesign
scale factors and thresholds system to something more easily
graspable". Whoever needs to cancel the new behavior for some reason
will have a knob then, and patch is laid out already.

> 3. introduce the new parameters with low base threshold and high scale factor.

This looks bad to me. "the bigger the table, the longer we wait" does
not look good for me for something designed as a measure preventing
issues with big tables.

> I think all three are viable.
> If nobody else wants to weigh in, throw a coin.
>
> Yours,
> Laurenz Albe
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa



Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Wed, Mar 11, 2020 at 10:32:47AM +1300, David Rowley wrote:
> 2. The new feature can be completely disabled. This might be very
> useful for people who suffer from auto-vacuum starvation.

On Thu, Mar 12, 2020 at 08:28:05PM +1300, David Rowley wrote:
> Yes, but in particular so it can be completely disabled easily.

How is it disabled ?  By setting scale_factor=100 ?

+               {
                                                              
 
+                       "autovacuum_vacuum_insert_scale_factor",
                                                              
 
+                       "Number of tuple inserts prior to vacuum as a fraction of reltuples",
                                                              
 
+                       RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
                                                              
 
+                       ShareUpdateExclusiveLock
                                                              
 
+               },
                                                              
 
+               -1, 0.0, 100.0
                                                              
 

Note, vacuum_cleanup_index_scale_factor uses max: 1e10
See 4d54543efa5eb074ead4d0fadb2af4161c943044

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Fri, 2020-03-13 at 12:05 +0300, Darafei "Komяpa" Praliaskouski wrote:
>  1. introduce no new parameters and trigger autovacuum if the number
> >     of inserts exceeds the regular vacuum threshold.
> > 
> > 2. introduce the new parameters with high base threshold and zero scale factor.
> 
> Both of these look good to me. 1 is approach in my initial patch
> sketch, 2 is approach taken by Laurenz.
> Values I think in when considering vacuum is "how many megabytes of
> table aren't frozen/visible" (since that's what translates into
> processing time knowing io limits of storage), and "how many pages
> aren't yet vacuumed".
> 
> Threshold in Laurenz's patch was good enough for my taste - it's
> basically "vacuum after every gigabyte", and that's exactly what we
> implemented when working around this issue manually. There's enough
> chance that latest gigabyte is in RAM and vacuum will be super fast on
> it; reading a gigabyte of data is not a showstopper for most
> contemporary physical and cloud environments I can think of. If
> reading a gigabyte is a problem already then wraparound is a
> guaranteed disaster.
> 
> About index only scan, this threshold seems good enough too. There's a
> good chance last gig is already in RAM, and previous data was
> processed with previous vacuum. Anyway - with this patch Index Only
> Scan starts actually working :)
> 
> I'd vote for 2 with a note "rip it off all together later and redesign
> scale factors and thresholds system to something more easily
> graspable". Whoever needs to cancel the new behavior for some reason
> will have a knob then, and patch is laid out already.
> 
> > 3. introduce the new parameters with low base threshold and high scale factor.
> 
> This looks bad to me. "the bigger the table, the longer we wait" does
> not look good for me for something designed as a measure preventing
> issues with big tables.

Thanks for the feedback.

It looks like we have a loose consensus on #2, i.e. my patch.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Fri, 2020-03-13 at 07:00 -0500, Justin Pryzby wrote:
> > 2. The new feature can be completely disabled. This might be very
> > useful for people who suffer from auto-vacuum starvation.
> 
> > Yes, but in particular so it can be completely disabled easily.
> 
> How is it disabled ?  By setting scale_factor=100 ?
> 
> +               {
                                                                
 
> +                       "autovacuum_vacuum_insert_scale_factor",
                                                                
 
> +                       "Number of tuple inserts prior to vacuum as a fraction of reltuples",
                                                                
 
> +                       RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
                                                                
 
> +                       ShareUpdateExclusiveLock
                                                                
 
> +               },
                                                                
 
> +               -1, 0.0, 100.0
                                                                
 
> 
> Note, vacuum_cleanup_index_scale_factor uses max: 1e10
> See 4d54543efa5eb074ead4d0fadb2af4161c943044

By setting the threshold very high, or by setting the scale factor to 100.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Tue, Mar 10, 2020 at 01:53:42PM +1300, David Rowley wrote:
> 2. Perhaps the documentation in maintenance.sgml should mention that
> the table will be vacuumed with the equivalent of having
> vacuum_freeze_min_age = 0, instead of:
> 
> "Such a vacuum will aggressively freeze tuples."
> 
> aggressive is the wrong word here. We call it an aggressive vacuum if
> we disable page skipping, not for setting the vacuum_freeze_min_age to
> 0.

Possible it would be better to run VACUUM *without* freeze_min_age=0 ?  (I get
confused and have to spend 20min re-reading the vacuum GUC docs every time I
deal with this stuff, so maybe I'm off).

As I understand, the initial motivation of this patch was to avoid disruptive
anti-wraparound vacuums on insert-only table.  But if vacuum were triggered at
all, it would freeze the oldest tuples, which is all that's needed; especially
since fd31cd2651 "Don't vacuum all-frozen pages.", those pages would never need
to be vacuumed again.  Recently written tuples wouldn't be frozen, which is ok,
they're handled next time.

Another motivation of the patch is to allow indexonly scan, for which the
planner looks at pages' "relallvisible" fraction (and at execution if a page
isn't allvisible, visits the heap).  Again, that happens if vacuum were run at
all.  Again, some pages won't be marked allvisible, which is fine, they're
handled next time.

I think freeze_min_age=0 could negatively affect people who have insert-mostly
tables (I'm not concerned, but that includes us).  If they consistently hit the
autovacuum insert threshold before the cleanup threshold for updated/deleted
tuples, any updated/deleted tuples would be frozen, which would be
wasteful:  

|One disadvantage of decreasing vacuum_freeze_min_age is that it might cause
|VACUUM to do useless work: freezing a row version is a waste of time if the row
|is modified soon thereafter (causing it to acquire a new XID). So the setting
|should be large enough that rows are not frozen until they are unlikely to
|change any more.

So my question is if autovacuum triggered by insert threshold should trigger
VACUUM with the same settings as a vacuum due to deleted tuples.  I realize the
DBA could just configure the thresholds so they'd hit vacuum for cleaning dead
tuples, so my suggestion maybe just improves the case with the default
settings.  It's possible to set the reloption autovacuum_freeze_min_age, which
I think supports the idea of running a vacuum normally and letting it (and the
DBA) decide what do with existing logic.

Also, there was a discussion about index cleanup with the conclusion that it
was safer not to skip it, since otherwise indexes might bloat.  I think that's
right, since vacuum for cleanup is triggered by the number of dead heap tuples.
To skip index cleanup, I think you'd want a metric for
n_dead_since_index_cleanup.  (Or maybe analyze could track dead index tuples
and trigger vacuum of each index separately).

Having now played with the patch, I'll suggest that 10000000 is too high a
threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
much lower (100000?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.  

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-13 13:44:42 -0500, Justin Pryzby wrote:
> As I understand, the initial motivation of this patch was to avoid disruptive
> anti-wraparound vacuums on insert-only table.  But if vacuum were triggered at
> all, it would freeze the oldest tuples, which is all that's needed; especially
> since fd31cd2651 "Don't vacuum all-frozen pages.", those pages would never need
> to be vacuumed again.  Recently written tuples wouldn't be frozen, which is ok,
> they're handled next time.
> 
> Another motivation of the patch is to allow indexonly scan, for which the
> planner looks at pages' "relallvisible" fraction (and at execution if a page
> isn't allvisible, visits the heap).  Again, that happens if vacuum were run at
> all.  Again, some pages won't be marked allvisible, which is fine, they're
> handled next time.
> 
> I think freeze_min_age=0 could negatively affect people who have insert-mostly
> tables (I'm not concerned, but that includes us).  If they consistently hit the
> autovacuum insert threshold before the cleanup threshold for updated/deleted
> tuples, any updated/deleted tuples would be frozen, which would be
> wasteful:  

I think that's a valid concern.


> |One disadvantage of decreasing vacuum_freeze_min_age is that it might cause
> |VACUUM to do useless work: freezing a row version is a waste of time if the row
> |is modified soon thereafter (causing it to acquire a new XID). So the setting
> |should be large enough that rows are not frozen until they are unlikely to
> |change any more.

I think the overhead here might be a bit overstated. Once a page is
dirtied (or already dirty) during vacuum, and we freeze a single row
(necessating WAL logging), there's not really a good reason to not also
freeze the rest of the row on that page. The added cost for freezing
another row is miniscule compared to the "constant" cost of freezing
anything on the page.  It's of course different if there are otherwise
no tuples worth freezing on the page (not uncommon). But there's really
no reason for that to be the case:

Afaict the only problem with more aggressively freezing when we touch
(beyond hint bits) the page anyway is that we commonly end up with
multiple WAL records for the same page:

1) lazy_scan_heap()->heap_page_prune() will log a XLOG_HEAP2_CLEAN record, but leave
   itemids in place most of the time
2) lazy_scan_heap()->log_heap_freeze() will log a XLOG_HEAP2_FREEZE_PAGE record
3a) if no indexes exist/index cleanup is disabled:
  lazy_vacuum_page()->lazy_vacuum_page() will log a XLOG_HEAP2_CLEAN
  record, removing dead tuples (including itemids)
3b) if indexes need to be cleaned up,
  lazy_vacuum_heap()->lazy_vacuum_page() will log a XLOG_HEAP2_CLEAN

which is not nice. It likely is worth merging xl_heap_freeze_page into
xl_heap_clean, and having heap pruning always freeze once it decides to
dirty a page.

We could probably always prune dead tuples as part of heap_prune_chain()
if there's no indexes - but I'm doubtful it's worth it, since there'll
be few tables with lots of dead tuples that don't have indexes.

Merging 3b's WAL record would be harder, I think.


There's also a significant source of additional WAL records here, one
that I think should really not have been introduced:

4) HeapTupleSatisfiesVacuum() called both by heap_prune_chain(), and
  lazy_scan_heap() will often trigger a WAL record when the checksums or
  wal_log_hint_bits are enabled. If the page hasn't been modified in the
  current checkpoint window (extremely common for VACUUM, reasonably
  common for opportunistic pruning), we will log a full page write.

  Imo this really should have been avoided when checksums were added,
  that's a pretty substantial and unnecessary increase in overhead.


It's probably overkill to tie fixing the 'insert only' case to improving
the WAL logging for vacuuming / pruning. But it'd certainly would
largely remove the tradeoff discussed here, by removing additional
overhead of freezing in tables that are also updated.


> Also, there was a discussion about index cleanup with the conclusion that it
> was safer not to skip it, since otherwise indexes might bloat.  I think that's
> right, since vacuum for cleanup is triggered by the number of dead heap tuples.
> To skip index cleanup, I think you'd want a metric for
> n_dead_since_index_cleanup.  (Or maybe analyze could track dead index tuples
> and trigger vacuum of each index separately).
> 
> Having now played with the patch, I'll suggest that 10000000 is too high a
> threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
> much lower (100000?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.

ISTM that the danger of regressing workloads due to suddenly repeatedly
scanning huge indexes that previously were never / rarely scanned is
significant (if there's a few dead tuples, otherwise most indexes will
be able to skip the scan since the vacuum_cleanup_index_scale_factor
introduction)).

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Fri, 2020-03-13 at 13:44 -0500, Justin Pryzby wrote:
> Possible it would be better to run VACUUM *without* freeze_min_age=0 ?  (I get
> confused and have to spend 20min re-reading the vacuum GUC docs every time I
> deal with this stuff, so maybe I'm off).
> 
> As I understand, the initial motivation of this patch was to avoid disruptive
> anti-wraparound vacuums on insert-only table.  But if vacuum were triggered at
> all, it would freeze the oldest tuples, which is all that's needed; especially
> since fd31cd2651 "Don't vacuum all-frozen pages.", those pages would never need
> to be vacuumed again.  Recently written tuples wouldn't be frozen, which is ok,
> they're handled next time.

Freezing tuples too early is wasteful if the tuples get updated or deleted
soon after, but based on the assumption that an autovacuum triggered by insert
is dealing with an insert-mostly table, it is not that wasteful.

If we didn't freeze all tuples, it is easy to envision a situation where
bulk data loads load several million rows in a few transactions, which
would trigger a vacuum.  With the normal vacuum_freeze_min_age, that vacuum
would do nothing at all.  It is better if each vacuum freezes some rows,
in other words, if it does some of the anti-wraparound work.

> Another motivation of the patch is to allow indexonly scan, for which the
> planner looks at pages' "relallvisible" fraction (and at execution if a page
> isn't allvisible, visits the heap).  Again, that happens if vacuum were run at
> all.  Again, some pages won't be marked allvisible, which is fine, they're
> handled next time.

Yes, freezing is irrelevant with respect to index only scans, but it helps
with mitigating the impact of anti-wraparound vacuum runs.

> I think freeze_min_age=0 could negatively affect people who have insert-mostly
> tables (I'm not concerned, but that includes us).  If they consistently hit the
> autovacuum insert threshold before the cleanup threshold for updated/deleted
> tuples, any updated/deleted tuples would be frozen, which would be
> wasteful:  

I don't get that.  Surely tuples whose xmax is committed won't be frozen.

> So my question is if autovacuum triggered by insert threshold should trigger
> VACUUM with the same settings as a vacuum due to deleted tuples.  I realize the
> DBA could just configure the thresholds so they'd hit vacuum for cleaning dead
> tuples, so my suggestion maybe just improves the case with the default
> settings.  It's possible to set the reloption autovacuum_freeze_min_age, which
> I think supports the idea of running a vacuum normally and letting it (and the
> DBA) decide what do with existing logic.

Yes, the DBA can explicitly set vacuum_freeze_min_age to 0.

But for one DBA who understands his or her workload well enough, and who knows
the workings of autovacuum well enough to do that kind of tuning, there are
99 DBAs who don't, and it is the goal of the patch (expressed in the subject)
to make things work for those people who go with the default.

And I believe that is better achieved with freezing as many tuples as possible.

> Also, there was a discussion about index cleanup with the conclusion that it
> was safer not to skip it, since otherwise indexes might bloat.  I think that's
> right, since vacuum for cleanup is triggered by the number of dead heap tuples.
> To skip index cleanup, I think you'd want a metric for
> n_dead_since_index_cleanup.  (Or maybe analyze could track dead index tuples
> and trigger vacuum of each index separately).

Yes, I think we pretty much all agree on that.

> Having now played with the patch, I'll suggest that 10000000 is too high a
> threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
> much lower (100000?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.

There is the concern that that might treat large table to seldom.

I am curious - what were the findings that led you to think that 10000000
is too high?

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Fri, Mar 13, 2020 at 02:38:51PM -0700, Andres Freund wrote:
> > |One disadvantage of decreasing vacuum_freeze_min_age is that it might cause
> > |VACUUM to do useless work: freezing a row version is a waste of time if the row
> > |is modified soon thereafter (causing it to acquire a new XID). So the setting
> > |should be large enough that rows are not frozen until they are unlikely to
> > |change any more.
> 
> I think the overhead here might be a bit overstated. Once a page is

Could you clarify if you mean the language in docs in general or specifically
in the context of this patch ?

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Fri, Mar 13, 2020 at 02:38:51PM -0700, Andres Freund wrote:
> > Having now played with the patch, I'll suggest that 10000000 is too high a
> > threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
> > much lower (100000?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.
> 
> ISTM that the danger of regressing workloads due to suddenly repeatedly
> scanning huge indexes that previously were never / rarely scanned is
> significant

You're right - at one point, I was going to argue to skip index cleanup, and I
think wrote that before I finished convincing myself why it wasn't ok to skip.

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Fri, 13 Mar 2020 at 05:11, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 13 Mar 2020 at 01:43, Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Thu, 12 Mar 2020 at 16:28, David Rowley <dgrowleyml@gmail.com> wrote:
> > > Laurenz highlighted a seemingly very valid reason that the current
> > > GUCs cannot be reused. Namely, say the table has 1 billion rows, if we
> > > use the current scale factor of 0.2, then we'll run an insert-only
> > > vacuum every 200 million rows. If those INSERTs are one per
> > > transaction then the new feature does nothing as the wraparound vacuum
> > > will run instead. Since this feature was born due to large insert-only
> > > tables, this concern seems very valid to me.
> >
> > Yeah, I understand and agree that since most people would use default
> > values we can reduce mis-configuration cases by adding separate GUCs
> > that have appropriate default values for that purpose but on the other
> > hand I'm not sure it's worth that we cover the large insert-only table
> > case by adding separate GUCs in spite of being able to cover it even
> > by existing two GUCs.
>
> In light of the case above, do you have an alternative suggestion?
>
> > If we want to disable this feature on the
> > particular table, we can have a storage parameter that means not to
> > consider the number of inserted tuples rather than having multiple
> > GUCs that allows us to fine tuning. And IIUC even in the above case, I
> > think that if we trigger insert-only vacuum by comparing the number of
> > inserted tuples to the threshold computed by existing threshold and
> > scale factor, we can cover it.
>
> So you're suggesting we drive the insert-vacuums from existing
> scale_factor and threshold?  What about the 1 billion row table
> example above?

My suggestion is the initial approach proposed by Justin; comparing
the number of inserted tuples to the threshold computed by
autovacuum_vacum_threshold and autovacuum_vacuum_scale_factor in order
to trigger autovacuum. But as discussed, there is a downside; if the
number of inserted tuples are almost the same as, but a little larger
than, the number of dead tuples, we will trigger insert-only vacuum
but it's wasteful.

There is already a consensus on introducing new 2 parameters, but as
the second idea I'd like to add one (or two) GUC(s) to my suggestion,
say autovacuum_vacuum_freeze_insert_ratio; this parameter is the ratio
of the number of inserted tuples for total number of tuples modified
and inserted, in order to trigger insert-only vacuum. For example,
suppose the table has 1,000,000 tuples and we set threshold = 0,
scale_factor = 0.2 and freeze_insert_ratio = 0.9, we will trigger
normal autovacuum when n_dead_tup + n_ins_since_vacuum > 200,000, but
we will instead trigger insert-only autovacuum, which is a vacuum with
vacuum_freeze_min_age = 0, when n_ins_since_vacuum > 180,000 (=200,000
* 0.9). IOW if 90% of modified tuples are insertions, we freeze tuples
aggressively. If we want to trigger insert-only vacuum only on
insert-only table we can set freeze_insert_ratio = 1.0. The down side
of this idea is that we cannot disable autovacuum triggered by the
number of inserted, although we might be able to introduce more one
GUC that controls whether to include the number of inserted tuples for
triggering autovacuum (say, autovacuum_vacuum_triggered_by_insert =
on|off). The pros of this idea would be that we can ensure that
insert-only vacuum will run only in the case where the ratio of
insertion is large enough.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Fri, Mar 13, 2020 at 10:48:27PM +0100, Laurenz Albe wrote:
> On Fri, 2020-03-13 at 13:44 -0500, Justin Pryzby wrote:
> > Possible it would be better to run VACUUM *without* freeze_min_age=0 ?  (I get
> > confused and have to spend 20min re-reading the vacuum GUC docs every time I
> > deal with this stuff, so maybe I'm off).
> > 
> > As I understand, the initial motivation of this patch was to avoid disruptive
> > anti-wraparound vacuums on insert-only table.  But if vacuum were triggered at
> > all, it would freeze the oldest tuples, which is all that's needed; especially
> > since fd31cd2651 "Don't vacuum all-frozen pages.", those pages would never need
> > to be vacuumed again.  Recently written tuples wouldn't be frozen, which is ok,
> > they're handled next time.
> 
> Freezing tuples too early is wasteful if the tuples get updated or deleted
> soon after, but based on the assumption that an autovacuum triggered by insert
> is dealing with an insert-mostly table, it is not that wasteful.

You're right that it's not *that* wasteful.  If it's a table that gets 90%
inserts/10% updates, then only 10% of its tuples will be frozen.  In the worst
case, it's the same tuples every time, and that's somewhat wasteful.  In the
best case, those tuples are clustered on a small number of pages.

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Mon, 2020-03-16 at 12:53 +0900, Masahiko Sawada wrote:
> There is already a consensus on introducing new 2 parameters, but as
> the second idea I'd like to add one (or two) GUC(s) to my suggestion,
> say autovacuum_vacuum_freeze_insert_ratio; this parameter is the ratio
> of the number of inserted tuples for total number of tuples modified
> and inserted, in order to trigger insert-only vacuum. For example,
> suppose the table has 1,000,000 tuples and we set threshold = 0,
> scale_factor = 0.2 and freeze_insert_ratio = 0.9, we will trigger
> normal autovacuum when n_dead_tup + n_ins_since_vacuum > 200,000, but
> we will instead trigger insert-only autovacuum, which is a vacuum with
> vacuum_freeze_min_age = 0, when n_ins_since_vacuum > 180,000 (=200,000
> * 0.9). IOW if 90% of modified tuples are insertions, we freeze tuples
> aggressively. If we want to trigger insert-only vacuum only on
> insert-only table we can set freeze_insert_ratio = 1.0. The down side
> of this idea is that we cannot disable autovacuum triggered by the
> number of inserted, although we might be able to introduce more one
> GUC that controls whether to include the number of inserted tuples for
> triggering autovacuum (say, autovacuum_vacuum_triggered_by_insert =
> on|off). The pros of this idea would be that we can ensure that
> insert-only vacuum will run only in the case where the ratio of
> insertion is large enough.

Two more parameters :^(  But your reasoning is good.

How about we go with what we have now and leave that for future
discussion and patches?

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Mon, Mar 16, 2020 at 12:53:43PM +0900, Masahiko Sawada wrote:

> There is already a consensus on introducing new 2 parameters, but as
> the second idea I'd like to add one (or two) GUC(s) to my suggestion,
> say autovacuum_vacuum_freeze_insert_ratio; this parameter is the ratio
> of the number of inserted tuples for total number of tuples modified
> and inserted, in order to trigger insert-only vacuum. For example,
> suppose the table has 1,000,000 tuples and we set threshold = 0,
> scale_factor = 0.2 and freeze_insert_ratio = 0.9, we will trigger
> normal autovacuum when n_dead_tup + n_ins_since_vacuum > 200,000, but
> we will instead trigger insert-only autovacuum, which is a vacuum with
> vacuum_freeze_min_age = 0, when n_ins_since_vacuum > 180,000 (=200,000
> * 0.9). IOW if 90% of modified tuples are insertions, we freeze tuples
> aggressively. If we want to trigger insert-only vacuum only on
> insert-only table we can set freeze_insert_ratio = 1.0. The down side
> of this idea is that we cannot disable autovacuum triggered by the
> number of inserted, although we might be able to introduce more one
> GUC that controls whether to include the number of inserted tuples for
> triggering autovacuum (say, autovacuum_vacuum_triggered_by_insert =
> on|off). The pros of this idea would be that we can ensure that
> insert-only vacuum will run only in the case where the ratio of
> insertion is large enough.

I was thinking about something like this myself.  I would appreciate keeping
separate the thresholds for 1) triggering vacuum; and, 2) the options
autovacuum uses when it runs (in this case, FREEZE).  Someone might want
autovacuum to run with FREEZE on a table vacuumed due to dead tuples (say, on a
partitioned table), or might *not* want to run FREEZE on a table vacuumed due
to insertions (maybe because index scans are too expensive or FREEZE makes it
too slow).

Normally, when someone complains about bad plan related to no index-onlyscan,
we tell them to run vacuum, and if that helps, then ALTER TABLE .. SET
(autovacuum_vacuum_scale_factor=0.005).

If there's two thresholds (4 GUCs and 4 relopts) for autovacuum, then do we
have to help determine which one was being hit, and which relopt to set?

I wonder if the new insert GUCs should default to -1 (disabled)?  And the
insert thresholds should be set by new insert relopt (if set), or by new insert
GUC (default -1), else normal relopt, or normal GUC.  The defaults would give
50 + 0.20*n.  When someone asks about IOS, we'd tell them to set
autovacuum_vacuum_scale_factor=0.005, same as now.

vac_ins_scale_factor =
    (relopts && relopts->vacuum_ins_scale_factor >= 0) ? relopts->vacuum_ins_scale_factor :
    autovacuum_vac_ins_scale >= 0 ? autovacuum_vac_ins_scale : 
    (relopts && relopts->vacuum_scale_factor >= 0) ? relopts->vacuum_scale_factor :
    autovacuum_vac_scale;

One would disable autovacuum triggered by insertions by setting
autovacuum_vacuum_insert_scale_factor=1e10 (which I think should also be the
max for this patch).

It seems to me that the easy thing to do is to implement this initially without
FREEZE (which is controlled by vacuum_freeze_table_age), and defer until
July/v14 further discussion and implementation of another GUC/relopt for
autovacuum freezing to be controlled by insert thresholds (or ratio).

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-13 19:10:00 -0500, Justin Pryzby wrote:
> On Fri, Mar 13, 2020 at 02:38:51PM -0700, Andres Freund wrote:
> > > |One disadvantage of decreasing vacuum_freeze_min_age is that it might cause
> > > |VACUUM to do useless work: freezing a row version is a waste of time if the row
> > > |is modified soon thereafter (causing it to acquire a new XID). So the setting
> > > |should be large enough that rows are not frozen until they are unlikely to
> > > |change any more.
> > 
> > I think the overhead here might be a bit overstated. Once a page is
> 
> Could you clarify if you mean the language in docs in general or specifically
> in the context of this patch ?

In the docs.

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Mon, 2020-03-16 at 07:47 -0500, Justin Pryzby wrote:
> It seems to me that the easy thing to do is to implement this initially without
> FREEZE (which is controlled by vacuum_freeze_table_age), and defer until
> July/v14 further discussion and implementation of another GUC/relopt for
> autovacuum freezing to be controlled by insert thresholds (or ratio).

Freezing tuples is the point of this patch.
As I have said, if you have a table where you insert many rows in few
transactions, you would trigger an autovacuum that then ends up doing nothing
because none of the rows have reached vacuum_freeze_table_age yet.

Then some time later you will get a really large vacuum run.

It seems to me that if we keep trying finding the formula that will vacuum
every table just right and never so the wrong thing, we will never get to anything.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-16 20:49:43 +0100, Laurenz Albe wrote:
> On Mon, 2020-03-16 at 07:47 -0500, Justin Pryzby wrote:
> > It seems to me that the easy thing to do is to implement this initially without
> > FREEZE (which is controlled by vacuum_freeze_table_age), and defer until
> > July/v14 further discussion and implementation of another GUC/relopt for
> > autovacuum freezing to be controlled by insert thresholds (or ratio).
> 
> Freezing tuples is the point of this patch.

Sure. But not hurting existing installation is also a goal of the
patch. Since this is introducing potentially significant performance
downsides, I think it's good to be a bit conservative with the default
configuration.

I'm gettin a bit more bullish on implementing some of what what I
discussed in
https://www.postgresql.org/message-id/20200313213851.ejrk5gptnmp65uoo%40alap3.anarazel.de
at the same time as this patch.

In particularl, I think it'd make sense to *not* have a lower freezing
horizon for insert vacuums (because it *will* cause problems), but if
the page is dirty anyway, then do the freezing even if freeze_min_age
etc would otherwise prevent us from doing so?

It'd probably be ok to incur the WAL logging overhead unconditionally,
but I'm not sure about it.


> As I have said, if you have a table where you insert many rows in few
> transactions, you would trigger an autovacuum that then ends up doing nothing
> because none of the rows have reached vacuum_freeze_table_age yet.

> Then some time later you will get a really large vacuum run.

Well, only if you don't further insert into the table. Which isn't that
common a case for a table having a "really large vacuum run".


Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Mon, Mar 16, 2020 at 08:49:43PM +0100, Laurenz Albe wrote:
> On Mon, 2020-03-16 at 07:47 -0500, Justin Pryzby wrote:
> > It seems to me that the easy thing to do is to implement this initially without
> > FREEZE (which is controlled by vacuum_freeze_table_age), and defer until
> > July/v14 further discussion and implementation of another GUC/relopt for
> > autovacuum freezing to be controlled by insert thresholds (or ratio).
> 
> Freezing tuples is the point of this patch.
> As I have said, if you have a table where you insert many rows in few
> transactions, you would trigger an autovacuum that then ends up doing nothing
> because none of the rows have reached vacuum_freeze_table_age yet.
> 
> Then some time later you will get a really large vacuum run.

Best practice is to vacuum following bulk load.  I don't think this patch is
going to change that.  Bulk-loaded tuples will be autovacuumed, which is nice,
but I don't think it'll be ideal if large bulk loads trigger an autovacuum with
cost delays which ISTM if it runs with FREEZE will take even longer.

If it's a bulk load, then I think it's okay to assume it was vacuumed, or
otherwise that it'll eventually be hit by autovac at some later date.

If it's not a "bulk load" but a normal runtime, and the table continues to
receive inserts/deletes, then eventually it'll hit a vacuum threshold and
tuples can be frozen.

If it receives a bunch of activity, which then stops (like a partition of a
table of timeseries data), then maybe it doesn't hit a vacuum threshold, until
wraparound vacuum.  I think in that case it's not catastrophic, since then it
wasn't big enough to hit any threshold (it's partitioned).  If every day,
autovacuum kicks in and does wraparound vacuum on table with data from (say)
100 days ago, I think that's reasonable.

One case which would suck is if the insert_threshold were 1e6, and you restore
a DB with 1000 tables of historic data (which are no longer being inserted
into) which have 9e5 rows each (just below the threshold).  Then autovacuum
will hit them all at once.  The solution to that is to manual vacuum after bulk
load, same as today.  As a practical matter, some of the tables are likely to
hit the autovacuum insert threshold, and some are likely to be pruned (or
updated) before wraparound vacuum, so the patch usually does improve that case.

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Mon, 2020-03-16 at 13:13 -0700, Andres Freund wrote:
> > Freezing tuples is the point of this patch.
> 
> Sure. But not hurting existing installation is also a goal of the
> patch. Since this is introducing potentially significant performance
> downsides, I think it's good to be a bit conservative with the default
> configuration.
> 
> I'm gettin a bit more bullish on implementing some of what what I
> discussed in
> https://www.postgresql.org/message-id/20200313213851.ejrk5gptnmp65uoo%40alap3.anarazel.de
> at the same time as this patch.
>
> In particularl, I think it'd make sense to *not* have a lower freezing
> horizon for insert vacuums (because it *will* cause problems), but if
> the page is dirty anyway, then do the freezing even if freeze_min_age
> etc would otherwise prevent us from doing so?

I don't quite see why freezing tuples in insert-only tables will cause
problems - are you saying that more WAL will be written compared to
freezing with a higher freeze_min_age?

> > As I have said, if you have a table where you insert many rows in few
> > transactions, you would trigger an autovacuum that then ends up doing nothing
> > because none of the rows have reached vacuum_freeze_table_age yet.
> > Then some time later you will get a really large vacuum run.
> 
> Well, only if you don't further insert into the table. Which isn't that
> common a case for a table having a "really large vacuum run".

Ah, yes, you are right.
So it actually would not be worse if we use the normal freeze_min_age
for insert-only vacuums.

So do you think the patch would be ok as it is if we change only that?

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Mon, 2020-03-16 at 16:07 -0500, Justin Pryzby wrote:
> Best practice is to vacuum following bulk load.

Yes.

> If it's a bulk load, then I think it's okay to assume it was vacuumed,

No.  This patch is there precisely because too many people don't know
that they should vacuum their table after a bulk insert.
The idea of autovacuum is to do these things for you atomatically.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-16 22:25:11 +0100, Laurenz Albe wrote:
> On Mon, 2020-03-16 at 13:13 -0700, Andres Freund wrote:
> > > Freezing tuples is the point of this patch.
> > 
> > Sure. But not hurting existing installation is also a goal of the
> > patch. Since this is introducing potentially significant performance
> > downsides, I think it's good to be a bit conservative with the default
> > configuration.
> > 
> > I'm gettin a bit more bullish on implementing some of what what I
> > discussed in
> > https://www.postgresql.org/message-id/20200313213851.ejrk5gptnmp65uoo%40alap3.anarazel.de
> > at the same time as this patch.
> >
> > In particularl, I think it'd make sense to *not* have a lower freezing
> > horizon for insert vacuums (because it *will* cause problems), but if
> > the page is dirty anyway, then do the freezing even if freeze_min_age
> > etc would otherwise prevent us from doing so?
> 
> I don't quite see why freezing tuples in insert-only tables will cause
> problems - are you saying that more WAL will be written compared to
> freezing with a higher freeze_min_age?

As far as I understand the patch may trigger additional vacuums e.g. for
tables that have some heavily updated parts / key ranges, and otherwise
are largely insert only (as long as there are in total considerably more
inserts than updates). That's not at all uncommon.

And for the heavily updated regions the additional vacuums with a 0 min
age could prove to be costly.  I've not looked at the new code, but it'd
be particularly bad if the changes were to trigger the
lazy_check_needs_freeze() check in lazy_scan_heap() - it'd have the
potential for a lot more contention.


> > > As I have said, if you have a table where you insert many rows in few
> > > transactions, you would trigger an autovacuum that then ends up doing nothing
> > > because none of the rows have reached vacuum_freeze_table_age yet.
> > > Then some time later you will get a really large vacuum run.
> > 
> > Well, only if you don't further insert into the table. Which isn't that
> > common a case for a table having a "really large vacuum run".
> 
> Ah, yes, you are right.
> So it actually would not be worse if we use the normal freeze_min_age
> for insert-only vacuums.

Well, it's still be worse, because it'd likely trigger more writes of
the same pages. Once for setting hint bits during the first vacuum, and
then later a second for freezing. Which is why I was pondering using the
logic


> So do you think the patch would be ok as it is if we change only that?

I've not looked at it in enough detail so far to say either way, sorry.

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Mon, 2020-03-16 at 14:34 -0700, Andres Freund wrote:
> > > In particularl, I think it'd make sense to *not* have a lower freezing
> > > horizon for insert vacuums (because it *will* cause problems), but if
> > > the page is dirty anyway, then do the freezing even if freeze_min_age
> > > etc would otherwise prevent us from doing so?
> > 
> > I don't quite see why freezing tuples in insert-only tables will cause
> > problems - are you saying that more WAL will be written compared to
> > freezing with a higher freeze_min_age?
> 
> As far as I understand the patch may trigger additional vacuums e.g. for
> tables that have some heavily updated parts / key ranges, and otherwise
> are largely insert only (as long as there are in total considerably more
> inserts than updates). That's not at all uncommon.
> 
> And for the heavily updated regions the additional vacuums with a 0 min
> age could prove to be costly.  I've not looked at the new code, but it'd
> be particularly bad if the changes were to trigger the
> lazy_check_needs_freeze() check in lazy_scan_heap() - it'd have the
> potential for a lot more contention.

I think I got it.

Here is a version of the patch that does *not* freeze more tuples than
normal, except if a prior tuple on the same page is already eligible for freezing.

lazy_check_needs_freeze() is only called for an aggressive vacuum, which
this isn't.

Does that look sane?

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Tue, Mar 17, 2020 at 01:14:02AM +0100, Laurenz Albe wrote:
> lazy_check_needs_freeze() is only called for an aggressive vacuum, which
> this isn't.

> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -1388,17 +1388,26 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
>              else
>              {
>                  bool        tuple_totally_frozen;
> +                bool        freeze_all;
>  
>                  num_tuples += 1;
>                  hastup = true;
>  
> +                /*
> +                 * If any tuple was already frozen in the block and this is
> +                 * an insert-only vacuum, we might as well freeze all other
> +                 * tuples in that block.
> +                 */
> +                freeze_all = params->is_insert_only && has_dead_tuples;
> +

You're checking if any (previously-scanned) tuple was *dead*, but I think you
need to check nfrozen>=0.

Also, this will fail to freeze tuples on a page which *could* be
oppotunistically-frozen, but *follow* the first tuple which *needs* to be
frozen.

I think Andres was thinking this would maybe be an optimization independent of
is_insert_only (?)

>                  /*
>                   * Each non-removable tuple must be checked to see if it needs
>                   * freezing.  Note we already have exclusive buffer lock.
>                   */
>                  if (heap_prepare_freeze_tuple(tuple.t_data,
>                                                relfrozenxid, relminmxid,
> -                                              FreezeLimit, MultiXactCutoff,
> +                                              freeze_all ? 0 : FreezeLimit,
> +                                              freeze_all ? 0 : MultiXactCutoff,
>                                                &frozen[nfrozen],
>                                                &tuple_totally_frozen))

> +    /* normal autovacuum shouldn't freeze aggressively */
> +    *insert_only = false;

Aggressively is a bad choice of words.  In the context of vacuum, it usually
means "visit all pages, even those which are allvisible".

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-17 at 10:24 -0500, Justin Pryzby wrote:
> > --- a/src/backend/access/heap/vacuumlazy.c
> > +++ b/src/backend/access/heap/vacuumlazy.c
> > @@ -1388,17 +1388,26 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
> >                        else
> >                        {
> >                                bool            tuple_totally_frozen;
> > +                             bool            freeze_all;
> >   
> >                                num_tuples += 1;
> >                                hastup = true;
> >   
> > +                             /*
> > +                              * If any tuple was already frozen in the block and this is
> > +                              * an insert-only vacuum, we might as well freeze all other
> > +                              * tuples in that block.
> > +                              */
> > +                             freeze_all = params->is_insert_only && has_dead_tuples;
> > +
> 
> You're checking if any (previously-scanned) tuple was *dead*, but I think you
> need to check nfrozen>=0.

Yes, that was a silly typo.

> Also, this will fail to freeze tuples on a page which *could* be
> oppotunistically-frozen, but *follow* the first tuple which *needs* to be
> frozen.

I am aware of that.  I was trying to see if that went in the direction that
Andres intends before trying more invasive modifications.

> I think Andres was thinking this would maybe be an optimization independent of
> is_insert_only (?)

I wasn't sure.

In the light of that, I have ripped out that code again.

Also, since aggressive^H^H^H^H^H^H^H^H^H^Hproactive freezing seems to be a
performance problem in some cases (pages with UPDATEs and DELETEs in otherwise
INSERT-mostly tables), I have done away with the whole freezing thing,
which made the whole patch much smaller and simpler.

Now all that is introduced are the threshold and scale factor and
the new statistics counter to track the number of inserts since the last
VACUUM.

> > +     /* normal autovacuum shouldn't freeze aggressively */
> > +     *insert_only = false;
> 
> Aggressively is a bad choice of words.  In the context of vacuum, it usually
> means "visit all pages, even those which are allvisible".

This is gone in the latest patch.

Updated patch attached.

Perhaps we can reach a consensus on this reduced functionality.

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Tue, Mar 17, 2020 at 08:42:07PM +0100, Laurenz Albe wrote:
> Also, since aggressive^H^H^H^H^H^H^H^H^H^Hproactive freezing seems to be a
> performance problem in some cases (pages with UPDATEs and DELETEs in otherwise
> INSERT-mostly tables), I have done away with the whole freezing thing,
> which made the whole patch much smaller and simpler.
> 
> Now all that is introduced are the threshold and scale factor and
> the new statistics counter to track the number of inserts since the last
> VACUUM.
> 
> Updated patch attached.
> 
> Perhaps we can reach a consensus on this reduced functionality.

+1

I still suggest scale_factor maximum of 1e10, like
4d54543efa5eb074ead4d0fadb2af4161c943044

Which alows more effectively disabling it than a factor of 100, which would
progress like: ~1, 1e2, 1e4, 1e6, 1e8, 1e10, ..

I don't think that 1e4 would be a problem, but 1e6 and 1e8 could be.  With
1e10, it's first vacuumed when there's 10billion inserts, if we didn't previous
hit the n_dead threshold.

I think that's ok?  If one wanted to disable it up to 1e11 tuples, I think
they'd disable autovacuum, or preferably just implement an vacuum job.

The commit message says:
|The scale factor defaults to 0, which means that it is
|effectively disabled, but it offers some flexibility
..but "it" is ambiguous, so should say something like: "the table size does not
contribute to the autovacuum threshold".

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-17 at 14:56 -0500, Justin Pryzby wrote:
> I still suggest scale_factor maximum of 1e10, like
> 4d54543efa5eb074ead4d0fadb2af4161c943044
> 
> Which alows more effectively disabling it than a factor of 100, which would
> progress like: ~1, 1e2, 1e4, 1e6, 1e8, 1e10, ..
> 
> I don't think that 1e4 would be a problem, but 1e6 and 1e8 could be.  With
> 1e10, it's first vacuumed when there's 10billion inserts, if we didn't previous
> hit the n_dead threshold.
> 
> I think that's ok?  If one wanted to disable it up to 1e11 tuples, I think
> they'd disable autovacuum, or preferably just implement an vacuum job.

Assume a scale factor >= 1, for example 2, and n live tuples.
The table has just been vacuumed.

Now we insert m number tuples (which are live).

Then the condition

  threshold + scale_factor * live_tuples < newly_inserted_tuples

becomes

  10000000 + 2 * (n + m) < m

which can never be true for non-negative n and m.

So a scale factor >= 1 disables the feature.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Tue, Mar 17, 2020 at 10:01:15PM +0100, Laurenz Albe wrote:
> On Tue, 2020-03-17 at 14:56 -0500, Justin Pryzby wrote:
> > I still suggest scale_factor maximum of 1e10, like
> > 4d54543efa5eb074ead4d0fadb2af4161c943044
> > 
> > Which alows more effectively disabling it than a factor of 100, which would
> > progress like: ~1, 1e2, 1e4, 1e6, 1e8, 1e10, ..
> > 
> > I don't think that 1e4 would be a problem, but 1e6 and 1e8 could be.  With
> > 1e10, it's first vacuumed when there's 10billion inserts, if we didn't previous
> > hit the n_dead threshold.
> > 
> > I think that's ok?  If one wanted to disable it up to 1e11 tuples, I think
> > they'd disable autovacuum, or preferably just implement an vacuum job.
> 
> Assume a scale factor >= 1, for example 2, and n live tuples.
> The table has just been vacuumed.
> 
> Now we insert m number tuples (which are live).
> 
> Then the condition
> 
>   threshold + scale_factor * live_tuples < newly_inserted_tuples
> 
> becomes
> 
>   10000000 + 2 * (n + m) < m
> 
> which can never be true for non-negative n and m.
> 
> So a scale factor >= 1 disables the feature.

No, this is what we mailed about privately yesterday, and I demonstrated that
autovac can still run with factor=100.  I said:

|It's a multiplier, not a percent out of 100 (fraction is not a great choice of
|words).
|
|                &autovacuum_vac_scale,
|                0.2, 0.0, 100.0,
|
|The default is 0.2 (20%), so 100 means after updating/deleting 100*reltuples.

live tuples is an estimate, from the most recent vacuum OR analyze.

If 1.0 disabled the feature, it wouldn't make much sense to allow factor up to
100.

+       {
+               {"autovacuum_vacuum_insert_scale_factor", PGC_SIGHUP, AUTOVACUUM,
+                       gettext_noop("Number of tuple inserts prior to vacuum as a fraction of reltuples."),
+                       NULL
+               },
+               &autovacuum_vac_ins_scale,
+               0.0, 0.0, 100.0,
+               NULL, NULL, NULL
+       },

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-17 at 16:07 -0500, Justin Pryzby wrote:
> > Assume a scale factor >= 1, for example 2, and n live tuples.
> > The table has just been vacuumed.
> > 
> > Now we insert m number tuples (which are live).
> > 
> > Then the condition
> > 
> >    threshold + scale_factor * live_tuples < newly_inserted_tuples
> > 
> > becomes
> > 
> >    10000000 + 2 * (n + m) < m
> > 
> > which can never be true for non-negative n and m.
> > 
> > So a scale factor >= 1 disables the feature.
> 
> No, this is what we mailed about privately yesterday, and I demonstrated that
> autovac can still run with factor=100.  I said:

I remember.
Can you point out where exactly the flaw in my reasoning is?

> > It's a multiplier, not a percent out of 100 (fraction is not a great choice of
> > words).
> > 
> >                 &autovacuum_vac_scale,
> >                 0.2, 0.0, 100.0,
> > 
> > The default is 0.2 (20%), so 100 means after updating/deleting 100*reltuples.

Yes, exactly.

> If 1.0 disabled the feature, it wouldn't make much sense to allow factor up to
> 100.

True, we could set the upper limit to 2, but it doesn't matter much.

Note that this is different from autovacuum_vacuum_scale_factor,
because inserted tuples are live, while dead tuples are not.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Tue, Mar 17, 2020 at 10:22:44PM +0100, Laurenz Albe wrote:
> On Tue, 2020-03-17 at 16:07 -0500, Justin Pryzby wrote:
> > > Assume a scale factor >= 1, for example 2, and n live tuples.
> > > The table has just been vacuumed.
> > > 
> > > Now we insert m number tuples (which are live).

.. but not yet counted in reltuples.

On Tue, Mar 17, 2020 at 10:22:44PM +0100, Laurenz Albe wrote:
> Note that this is different from autovacuum_vacuum_scale_factor,
> because inserted tuples are live, while dead tuples are not.

But they're not counted in reltuples until after the next vacuum (or analyze),
which is circular, since it's exactly what we're trying to schedule.

                reltuples = classForm->reltuples;
                vactuples = tabentry->n_dead_tuples;
+               instuples = tabentry->inserts_since_vacuum;
                anltuples = tabentry->changes_since_analyze;
 
                vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;
+               vacinsthresh = (float4) vac_ins_base_thresh + vac_ins_scale_factor * reltuples;

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-17 at 16:34 -0500, Justin Pryzby wrote:
> > > > Now we insert m number tuples (which are live).
> 
> .. but not yet counted in reltuples.

Thanks for pointing out my mistake.

Here is another patch, no changes except setting the upper limit
for autovacuum_vacuum_insert_scale_factor to 1e10.

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Mon, Mar 16, 2020 at 07:47:13AM -0500, Justin Pryzby wrote:
> Normally, when someone complains about bad plan related to no index-onlyscan,
> we tell them to run vacuum, and if that helps, then ALTER TABLE .. SET
> (autovacuum_vacuum_scale_factor=0.005).
> 
> If there's two thresholds (4 GUCs and 4 relopts) for autovacuum, then do we
> have to help determine which one was being hit, and which relopt to set?

I don't think we came to any resolution on this.

Right now, to encourage IOS, we'd tell someone to set
autovacuum_vacuum_scale_factor=0.005.  That wouldn't work for an insert-only
table, but I've never heard back from someone that it didn't work.

So with this patch, we'd maybe tell them to do this, to also get IOS on
insert-only tables ?
|ALTER TABLE .. SET (autovacuum_vacuum_scale_factor=0.005, autovacuum_vacuum_insert_threshold=50000);

> I wonder if the new insert GUCs should default to -1 (disabled)?  And the
> insert thresholds should be set by new insert relopt (if set), or by new insert
> GUC (default -1), else normal relopt, or normal GUC.  The defaults would give
> 50 + 0.20*n.  When someone asks about IOS, we'd tell them to set
> autovacuum_vacuum_scale_factor=0.005, same as now.
> 
> vac_ins_scale_factor =
>     (relopts && relopts->vacuum_ins_scale_factor >= 0) ? relopts->vacuum_ins_scale_factor :
>     autovacuum_vac_ins_scale >= 0 ? autovacuum_vac_ins_scale : 
>     (relopts && relopts->vacuum_scale_factor >= 0) ? relopts->vacuum_scale_factor :
>     autovacuum_vac_scale;

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-17 01:14:02 +0100, Laurenz Albe wrote:
> lazy_check_needs_freeze() is only called for an aggressive vacuum, which
> this isn't.

Hm? I mean some of these will be aggressive vacuums, because it's older
than vacuum_freeze_table_age? And the lower age limit would make that
potentially more painful, no?

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-17 20:42:07 +0100, Laurenz Albe wrote:
> > I think Andres was thinking this would maybe be an optimization independent of
> > is_insert_only (?)
>
> I wasn't sure.

I'm not sure myself - but I'm doubtful that using a 0 min age by default
will be ok.

I was trying to say (in a later email) that I think it might be a good
compromise to opportunistically freeze if we're dirtying the page
anyway, but not optimize WAL emission etc. That's a pretty simple
change, and it'd address a lot of the potential performance regressions,
while still freezing for the "first" vacuum in insert only workloads.


> Add "autovacuum_vacuum_insert_threshold" and
> "autovacuum_vacuum_insert_scale_factor" GUC and reloption.
> The default value for the threshold is 10000000.
> The scale factor defaults to 0, which means that it is
> effectively disabled, but it offers some flexibility
> to tune the feature similar to other autovacuum knobs.

I don't think a default scale factor of 0 is going to be ok. For
large-ish tables this will basically cause permanent vacuums. And it'll
sometimes trigger for tables that actually coped well so far. 10 million
rows could be a few seconds, not more.

I don't think that the argument that otherwise a table might not get
vacuumed before autovacuum_freeze_max_age is convincing enough.

a) if that's indeed the argument, we should increase the default
  autovacuum_freeze_max_age - now that there's insert triggered vacuums,
  the main argument against that from before isn't valid anymore.

b) there's not really a good arguments for vacuuming more often than
  autovacuum_freeze_max_age for such tables. It'll not be not frequent
  enough to allow IOS for new data, and you're not preventing
  anti-wraparound vacuums from happening.

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
James Coleman
Date:
On Tue, Mar 17, 2020 at 9:03 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-17 20:42:07 +0100, Laurenz Albe wrote:
> > > I think Andres was thinking this would maybe be an optimization independent of
> > > is_insert_only (?)
> >
> > I wasn't sure.
>
> I'm not sure myself - but I'm doubtful that using a 0 min age by default
> will be ok.
>
> I was trying to say (in a later email) that I think it might be a good
> compromise to opportunistically freeze if we're dirtying the page
> anyway, but not optimize WAL emission etc. That's a pretty simple
> change, and it'd address a lot of the potential performance regressions,
> while still freezing for the "first" vacuum in insert only workloads.

If we have truly insert-only tables, then doesn't vacuuming with
freezing every tuple actually decrease total vacuum cost (perhaps
significantly) since otherwise every vacuum keeps having to scan the
heap for dead tuples on pages where we know there are none? Those
pages could conceptually be frozen and ignored, but are not frozen
because of the default behavior, correct?

We have tables that log each change to a business object (as I suspect
many transactional workloads do), and I've often thought that
immediately freeze every page as soon as it fills up would be a real
win for us.

If that's all true, it seems to me that removing that part of the
patch significantly lowers its value.

If we opportunistically freeze only if we're already dirtying a page,
would that help a truly insert-only workload? E.g., are there hint
bits on the page that would need to change the first time we vacuum a
full page with no dead tuples? I would have assumed the answer was
"no" (since if so I think it would follow that _all_ pages need
updated the first time they're vacuumed?). But if that's the case,
then this kind of opportunistic freezing wouldn't help this kind of
workload. Maybe there's something I'm misunderstanding about how
vacuum works though.

Thanks,
James



Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Tue, Mar 17, 2020 at 09:58:53PM -0400, James Coleman wrote:
> On Tue, Mar 17, 2020 at 9:03 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2020-03-17 20:42:07 +0100, Laurenz Albe wrote:
> > > > I think Andres was thinking this would maybe be an optimization independent of
> > > > is_insert_only (?)
> > >
> > > I wasn't sure.
> >
> > I'm not sure myself - but I'm doubtful that using a 0 min age by default
> > will be ok.
> >
> > I was trying to say (in a later email) that I think it might be a good
> > compromise to opportunistically freeze if we're dirtying the page
> > anyway, but not optimize WAL emission etc. That's a pretty simple
> > change, and it'd address a lot of the potential performance regressions,
> > while still freezing for the "first" vacuum in insert only workloads.
> 
> If we have truly insert-only tables, then doesn't vacuuming with
> freezing every tuple actually decrease total vacuum cost (perhaps
> significantly) since otherwise every vacuum keeps having to scan the
> heap for dead tuples on pages where we know there are none? Those
> pages could conceptually be frozen and ignored, but are not frozen
> because of the default behavior, correct?

The essential part of this patch is to trigger vacuum *at all* on an
insert-only table.  Before today's updated patch, it also used FREEZE on any
table which hit the new insert threshold.  The concern I raised is for
insert-MOSTLY tables.  I thought it might be an issue if repeatedly freezing
updated tuples caused vacuum to be too slow, especially if they're distributed
in pages all across the table rather than clustered.

And I asked that the behavior (FREEZE) be configurable by a separate setting
than the one that triggers autovacuum to run.  FREEZE is already controlled by
the vacuum_freeze_table_age param.

I think you're right that VACUUM FREEZE on an insert-only table would be less
expensive than vacuum once without freeze and vacuum again later, which uses
freeze.  To me, that suggests setting vacuum_freeze_table_age to a low value on
those tables.

Regular vacuum avoids scanning all-visible pages, so for an insert-only table
pages should only be vacuumed once (if frozen the 1st time) or twice (if not).

         * Except when aggressive is set, we want to skip pages that are
         * all-visible according to the visibility map, but only when we can skip

postgres=# CREATE TABLE t (i int) ; INSERT INTO t SELECT generate_series(1,999999); VACUUM VERBOSE t; VACUUM VERBOSE
t;
...
INFO:  "t": found 0 removable, 999999 nonremovable row versions in 4425 out of 4425 pages
...
VACUUM
Time: 106.038 ms
INFO:  "t": found 0 removable, 175 nonremovable row versions in 1 out of 4425 pages
VACUUM
Time: 1.828 ms

=> That's its not very clear way of saying that it only scanned 1 page the 2nd
time around.

> We have tables that log each change to a business object (as I suspect
> many transactional workloads do), and I've often thought that
> immediately freeze every page as soon as it fills up would be a real
> win for us.
> 
> If that's all true, it seems to me that removing that part of the
> patch significantly lowers its value.

> If we opportunistically freeze only if we're already dirtying a page,
> would that help a truly insert-only workload? E.g., are there hint
> bits on the page that would need to change the first time we vacuum a
> full page with no dead tuples? I would have assumed the answer was
> "no" (since if so I think it would follow that _all_ pages need
> updated the first time they're vacuumed?).

You probably know that hint bits are written by the first process to access the
tuple after it was written.  I think you're asking if the first *vacuum*
requires additional writes beyond that.  And I think vacuum wouldn't touch the
page until it decides to freeze tuples.

I do have a patch to display the number of hint bits written and pages frozen.
https://www.postgresql.org/message-id/flat/20200126141328.GP13621%40telsasoft.com

> But if that's the case, then this kind of opportunistic freezing wouldn't
> help this kind of workload. Maybe there's something I'm misunderstanding
> about how vacuum works though.

I am reminding myself about vacuum with increasing frequency and usually still
learn something new.

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-17 21:58:53 -0400, James Coleman wrote:
> On Tue, Mar 17, 2020 at 9:03 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-03-17 20:42:07 +0100, Laurenz Albe wrote:
> > > > I think Andres was thinking this would maybe be an optimization independent of
> > > > is_insert_only (?)
> > >
> > > I wasn't sure.
> >
> > I'm not sure myself - but I'm doubtful that using a 0 min age by default
> > will be ok.
> >
> > I was trying to say (in a later email) that I think it might be a good
> > compromise to opportunistically freeze if we're dirtying the page
> > anyway, but not optimize WAL emission etc. That's a pretty simple
> > change, and it'd address a lot of the potential performance regressions,
> > while still freezing for the "first" vacuum in insert only workloads.
> 
> If we have truly insert-only tables, then doesn't vacuuming with
> freezing every tuple actually decrease total vacuum cost (perhaps
> significantly) since otherwise every vacuum keeps having to scan the
> heap for dead tuples on pages where we know there are none? Those
> pages could conceptually be frozen and ignored, but are not frozen
> because of the default behavior, correct?

Yes.


> If that's all true, it seems to me that removing that part of the
> patch significantly lowers its value.

Well, perfect sometimes is the enemy of the good. We gotta get something
in, and having some automated vacuuming for insert mostly/only tables is
a huge step forward. And avoiding regressions is an important part of
doing so.

I outlined the steps we could take to allow for more aggressive
vacuuming upthread.


> If we opportunistically freeze only if we're already dirtying a page,
> would that help a truly insert-only workload?

Yes.


> E.g., are there hint bits on the page that would need to change the
> first time we vacuum a full page with no dead tuples?

Yes. HEAP_XMIN_COMMITTED.


> I would have assumed the answer was "no" (since if so I think it would
> follow that _all_ pages need updated the first time they're
> vacuumed?).

That is the case. Although they might already be set when the tuples are
accessed for other reasons.


Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
James Coleman
Date:
On Tue, Mar 17, 2020 at 11:37 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Tue, Mar 17, 2020 at 09:58:53PM -0400, James Coleman wrote:
> > On Tue, Mar 17, 2020 at 9:03 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2020-03-17 20:42:07 +0100, Laurenz Albe wrote:
> > > > > I think Andres was thinking this would maybe be an optimization independent of
> > > > > is_insert_only (?)
> > > >
> > > > I wasn't sure.
> > >
> > > I'm not sure myself - but I'm doubtful that using a 0 min age by default
> > > will be ok.
> > >
> > > I was trying to say (in a later email) that I think it might be a good
> > > compromise to opportunistically freeze if we're dirtying the page
> > > anyway, but not optimize WAL emission etc. That's a pretty simple
> > > change, and it'd address a lot of the potential performance regressions,
> > > while still freezing for the "first" vacuum in insert only workloads.
> >
> > If we have truly insert-only tables, then doesn't vacuuming with
> > freezing every tuple actually decrease total vacuum cost (perhaps
> > significantly) since otherwise every vacuum keeps having to scan the
> > heap for dead tuples on pages where we know there are none? Those
> > pages could conceptually be frozen and ignored, but are not frozen
> > because of the default behavior, correct?
>
> The essential part of this patch is to trigger vacuum *at all* on an
> insert-only table.  Before today's updated patch, it also used FREEZE on any
> table which hit the new insert threshold.  The concern I raised is for
> insert-MOSTLY tables.  I thought it might be an issue if repeatedly freezing
> updated tuples caused vacuum to be too slow, especially if they're distributed
> in pages all across the table rather than clustered.

Yeah, for some reason I'd completely forgotten (caught up in thinking
about the best possible outcome re: freezing insert only tables) that
the bigger problem was just triggering vacuum at all on those tables.

> And I asked that the behavior (FREEZE) be configurable by a separate setting
> than the one that triggers autovacuum to run.  FREEZE is already controlled by
> the vacuum_freeze_table_age param.
>
> I think you're right that VACUUM FREEZE on an insert-only table would be less
> expensive than vacuum once without freeze and vacuum again later, which uses
> freeze.  To me, that suggests setting vacuum_freeze_table_age to a low value on
> those tables.
>
> Regular vacuum avoids scanning all-visible pages, so for an insert-only table
> pages should only be vacuumed once (if frozen the 1st time) or twice (if not).
>
>          * Except when aggressive is set, we want to skip pages that are
>          * all-visible according to the visibility map, but only when we can skip
>
> postgres=# CREATE TABLE t (i int) ; INSERT INTO t SELECT generate_series(1,999999); VACUUM VERBOSE t; VACUUM VERBOSE
t;
> ...
> INFO:  "t": found 0 removable, 999999 nonremovable row versions in 4425 out of 4425 pages
> ...
> VACUUM
> Time: 106.038 ms
> INFO:  "t": found 0 removable, 175 nonremovable row versions in 1 out of 4425 pages
> VACUUM
> Time: 1.828 ms
>
> => That's its not very clear way of saying that it only scanned 1 page the 2nd
> time around.

I didn't realize that about the visibility map being taken into account.

> > We have tables that log each change to a business object (as I suspect
> > many transactional workloads do), and I've often thought that
> > immediately freeze every page as soon as it fills up would be a real
> > win for us.
> >
> > If that's all true, it seems to me that removing that part of the
> > patch significantly lowers its value.
>
> > If we opportunistically freeze only if we're already dirtying a page,
> > would that help a truly insert-only workload? E.g., are there hint
> > bits on the page that would need to change the first time we vacuum a
> > full page with no dead tuples? I would have assumed the answer was
> > "no" (since if so I think it would follow that _all_ pages need
> > updated the first time they're vacuumed?).
>
> You probably know that hint bits are written by the first process to access the
> tuple after it was written.  I think you're asking if the first *vacuum*
> requires additional writes beyond that.  And I think vacuum wouldn't touch the
> page until it decides to freeze tuples.

I think my assumption is that (at least in our case), the first
process to access will definitely not be vacuum on any regular basis.

> I do have a patch to display the number of hint bits written and pages frozen.
> https://www.postgresql.org/message-id/flat/20200126141328.GP13621%40telsasoft.com

I'll take a look at that too.

> > But if that's the case, then this kind of opportunistic freezing wouldn't
> > help this kind of workload. Maybe there's something I'm misunderstanding
> > about how vacuum works though.
>
> I am reminding myself about vacuum with increasing frequency and usually still
> learn something new.

For sure.

Thanks,
James



Re: Berserk Autovacuum (let's save next Mandrill)

From
James Coleman
Date:
On Wed, Mar 18, 2020 at 1:08 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-17 21:58:53 -0400, James Coleman wrote:
> > On Tue, Mar 17, 2020 at 9:03 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2020-03-17 20:42:07 +0100, Laurenz Albe wrote:
> > > > > I think Andres was thinking this would maybe be an optimization independent of
> > > > > is_insert_only (?)
> > > >
> > > > I wasn't sure.
> > >
> > > I'm not sure myself - but I'm doubtful that using a 0 min age by default
> > > will be ok.
> > >
> > > I was trying to say (in a later email) that I think it might be a good
> > > compromise to opportunistically freeze if we're dirtying the page
> > > anyway, but not optimize WAL emission etc. That's a pretty simple
> > > change, and it'd address a lot of the potential performance regressions,
> > > while still freezing for the "first" vacuum in insert only workloads.
> >
> > If we have truly insert-only tables, then doesn't vacuuming with
> > freezing every tuple actually decrease total vacuum cost (perhaps
> > significantly) since otherwise every vacuum keeps having to scan the
> > heap for dead tuples on pages where we know there are none? Those
> > pages could conceptually be frozen and ignored, but are not frozen
> > because of the default behavior, correct?
>
> Yes.
>
>
> > If that's all true, it seems to me that removing that part of the
> > patch significantly lowers its value.
>
> Well, perfect sometimes is the enemy of the good. We gotta get something
> in, and having some automated vacuuming for insert mostly/only tables is
> a huge step forward. And avoiding regressions is an important part of
> doing so.

Yep, as I responded to Justin, in thinking about the details I'd lost
sight of the biggest issue.

So I withdraw that concern in favor of getting something out that
improves things now.

...

> > If we opportunistically freeze only if we're already dirtying a page,
> > would that help a truly insert-only workload?
>
> Yes.

Only if some other process hasn't already read and caused hint bits to
be written, correct? Or am I missing something there too?

> > E.g., are there hint bits on the page that would need to change the
> > first time we vacuum a full page with no dead tuples?
>
> Yes. HEAP_XMIN_COMMITTED.

This can be set opportunistically by other non-vacuum processes though?

> > I would have assumed the answer was "no" (since if so I think it would
> > follow that _all_ pages need updated the first time they're
> > vacuumed?).
>
> That is the case. Although they might already be set when the tuples are
> accessed for other reasons.

Ah, I think this is answering what I'd asked above.

I'm very excited to see improvements in flight on this use case.

Thanks,
James



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-17 at 17:26 -0700, Andres Freund wrote:
> On 2020-03-17 01:14:02 +0100, Laurenz Albe wrote:
> > lazy_check_needs_freeze() is only called for an aggressive vacuum, which
> > this isn't.
> 
> Hm? I mean some of these will be aggressive vacuums, because it's older
> than vacuum_freeze_table_age? And the lower age limit would make that
> potentially more painful, no?

You are right.  I thought of autovacuum_freeze_max_age, but not of
vacuum_freeze_table_age.

Autovacuum configuration is so woefully complicated that it makes me
feel bad to propose two more parameters :^(

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Tue, 2020-03-17 at 18:02 -0700, Andres Freund wrote:
> I don't think a default scale factor of 0 is going to be ok. For
> large-ish tables this will basically cause permanent vacuums. And it'll
> sometimes trigger for tables that actually coped well so far. 10 million
> rows could be a few seconds, not more.
> 
> I don't think that the argument that otherwise a table might not get
> vacuumed before autovacuum_freeze_max_age is convincing enough.
> 
> a) if that's indeed the argument, we should increase the default
>   autovacuum_freeze_max_age - now that there's insert triggered vacuums,
>   the main argument against that from before isn't valid anymore.
> 
> b) there's not really a good arguments for vacuuming more often than
>   autovacuum_freeze_max_age for such tables. It'll not be not frequent
>   enough to allow IOS for new data, and you're not preventing
>   anti-wraparound vacuums from happening.

According to my reckoning, that is the remaining objection to the patch
as it is (with ordinary freezing behavior).

How about a scale_factor od 0.005?  That will be high enough for large
tables, which seem to be the main concern here.

I fully agree with your point a) - should that be part of the patch?

I am not sure about b).  In my mind, the objective is not to prevent
anti-wraparound vacuums, but to see that they have less work to do,
because previous autovacuum runs already have frozen anything older than
vacuum_freeze_min_age.  So, assuming linear growth, the number of tuples
to freeze during any run would be at most one fourth of today's number
when we hit autovacuum_freeze_max_age.

I am still sorry to see more proactive freezing go, which would
reduce the impact for truly insert-only tables.
After sleeping on it, here is one last idea.

Granted, freezing with vacuum_freeze_min_age = 0 poses a problem
for those parts of the table that will receive updates or deletes.
But what if insert-triggered vacuum operates with - say -
one tenth of vacuum_freeze_min_age (unless explicitly overridden
for the table)?  That might still be high enough not to needlessly
freeze too many tuples that will still be modified, but it will
reduce the impact on insert-only tables.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Fri, Mar 13, 2020 at 02:38:51PM -0700, Andres Freund wrote:
> On 2020-03-13 13:44:42 -0500, Justin Pryzby wrote:
> > Having now played with the patch, I'll suggest that 10000000 is too high a
> > threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
> > much lower (100000?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.
> 
> ISTM that the danger of regressing workloads due to suddenly repeatedly
> scanning huge indexes that previously were never / rarely scanned is
> significant (if there's a few dead tuples, otherwise most indexes will
> be able to skip the scan since the vacuum_cleanup_index_scale_factor
> introduction)).

We could try to avoid that issue here:

|        /* If any tuples need to be deleted, perform final vacuum cycle */
|        /* XXX put a threshold on min number of tuples here? */
|        if (dead_tuples->num_tuples > 0)
|        {
|                /* Work on all the indexes, and then the heap */
|                lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
|                                                                lps, nindexes);
|
|                /* Remove tuples from heap */
|                lazy_vacuum_heap(onerel, vacrelstats);
|        }

As you said, an insert-only table can skip scanning indexes, but an
insert-mostly table currently cannot.

Maybe we could skip the final index scan if we hit the autovacuum insert
threshold?

I still don't like mixing the thresholds with the behavior they imply, but
maybe what's needed is better docs describing all of vacuum's roles and its
procedure and priority in executing them.

The dead tuples would just be cleaned up during a future vacuum, right ?  So
that would be less efficient, but (no surprise) there's a balance to strike and
that can be tuned.  I think that wouldn't be an issue for most people; the
worst case would be if you set high maint_work_mem, and low insert threshold,
and you got increased bloat.  But faster vacuum if we avoided idx scans.

That might allow more flexibility in our discussion around default values for
thresholds for insert-triggered vacuum.

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Thu, 19 Mar 2020 at 18:45, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Tue, 2020-03-17 at 18:02 -0700, Andres Freund wrote:
> > I don't think a default scale factor of 0 is going to be ok. For
> > large-ish tables this will basically cause permanent vacuums. And it'll
> > sometimes trigger for tables that actually coped well so far. 10 million
> > rows could be a few seconds, not more.
> >
> > I don't think that the argument that otherwise a table might not get
> > vacuumed before autovacuum_freeze_max_age is convincing enough.
> >
> > a) if that's indeed the argument, we should increase the default
> >   autovacuum_freeze_max_age - now that there's insert triggered vacuums,
> >   the main argument against that from before isn't valid anymore.
> >
> > b) there's not really a good arguments for vacuuming more often than
> >   autovacuum_freeze_max_age for such tables. It'll not be not frequent
> >   enough to allow IOS for new data, and you're not preventing
> >   anti-wraparound vacuums from happening.
>
> According to my reckoning, that is the remaining objection to the patch
> as it is (with ordinary freezing behavior).
>
> How about a scale_factor od 0.005?  That will be high enough for large
> tables, which seem to be the main concern here.

I agree with that, however, I'd thought 0.01, just so we're still
close to having about 100 times less work to do for huge insert-only
tables when it comes to having to perform an anti-wraparound vacuum.

> I fully agree with your point a) - should that be part of the patch?

I think it will be a good idea to increase this, but I really don't
think this patch should be touching it.  It's something to put on the
issues list for after the CF so more people have the bandwidth to chip
in their thoughts.

> I am not sure about b).  In my mind, the objective is not to prevent
> anti-wraparound vacuums, but to see that they have less work to do,
> because previous autovacuum runs already have frozen anything older than
> vacuum_freeze_min_age.  So, assuming linear growth, the number of tuples
> to freeze during any run would be at most one fourth of today's number
> when we hit autovacuum_freeze_max_age.

I hear what Andres is saying about proactive freezing for already
dirty pages.  I think that's worth looking into, but don't feel like
we need to do it for this patch. The patch is worthy without it and
such a change affects more than insert-vacuums, so should be a
separate commit.

If people really do have an insert-only table then we can recommend
that they set the table's autovacuum_freeze_min_age to 0.

> I am still sorry to see more proactive freezing go, which would
> reduce the impact for truly insert-only tables.
> After sleeping on it, here is one last idea.
>
> Granted, freezing with vacuum_freeze_min_age = 0 poses a problem
> for those parts of the table that will receive updates or deletes.
> But what if insert-triggered vacuum operates with - say -
> one tenth of vacuum_freeze_min_age (unless explicitly overridden
> for the table)?  That might still be high enough not to needlessly
> freeze too many tuples that will still be modified, but it will
> reduce the impact on insert-only tables.

I think that might be a bit too magical and may not be what some
people want. I know that most people won't set
autovacuum_freeze_min_age to 0 for insert-only tables, but we can at
least throw something in the documents to mention it's a good idea,
however, looking over the docs I'm not too sure the best place to note
that down.

I've attached a small fix which I'd like to apply to your v8 patch.
With that, and pending one final look, I'd like to push this during my
Monday (New Zealand time).  So if anyone strongly objects to that,
please state their case before then.

David

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Thu, 19 Mar 2020 at 19:07, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Fri, Mar 13, 2020 at 02:38:51PM -0700, Andres Freund wrote:
> > On 2020-03-13 13:44:42 -0500, Justin Pryzby wrote:
> > > Having now played with the patch, I'll suggest that 10000000 is too high a
> > > threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
> > > much lower (100000?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.
> >
> > ISTM that the danger of regressing workloads due to suddenly repeatedly
> > scanning huge indexes that previously were never / rarely scanned is
> > significant (if there's a few dead tuples, otherwise most indexes will
> > be able to skip the scan since the vacuum_cleanup_index_scale_factor
> > introduction)).
>
> We could try to avoid that issue here:
>
> |        /* If any tuples need to be deleted, perform final vacuum cycle */
> |        /* XXX put a threshold on min number of tuples here? */
> |        if (dead_tuples->num_tuples > 0)
> |        {
> |                /* Work on all the indexes, and then the heap */
> |                lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
> |                                                                lps, nindexes);
> |
> |                /* Remove tuples from heap */
> |                lazy_vacuum_heap(onerel, vacrelstats);
> |        }
>
> As you said, an insert-only table can skip scanning indexes, but an
> insert-mostly table currently cannot.
>
> Maybe we could skip the final index scan if we hit the autovacuum insert
> threshold?
>
> I still don't like mixing the thresholds with the behavior they imply, but
> maybe what's needed is better docs describing all of vacuum's roles and its
> procedure and priority in executing them.
>
> The dead tuples would just be cleaned up during a future vacuum, right ?  So
> that would be less efficient, but (no surprise) there's a balance to strike and
> that can be tuned.  I think that wouldn't be an issue for most people; the
> worst case would be if you set high maint_work_mem, and low insert threshold,
> and you got increased bloat.  But faster vacuum if we avoided idx scans.
>
> That might allow more flexibility in our discussion around default values for
> thresholds for insert-triggered vacuum.

We went over this a bit already. The risk is that if you have an
insert-mostly table and always trigger an auto-vacuum for inserts and
never due to dead tuples, then you'll forego the index cleanup every
time causing the indexes to bloat over time.

I think any considerations to add some sort of threshold on dead
tuples before cleaning the index should be considered independently.
Trying to get everyone to agree to what's happening here is hard
enough without adding more options to the list.  I understand that
there may be small issues with insert-only tables with a tiny number
of dead tuples, perhaps due to aborts could cause some issues while
scanning the index, but that's really one of the big reasons why the
10 million insert threshold has been added. Just in the past few hours
we've talked about having a very small scale factor to protect from
over-vacuum on huge tables that see 10 million tuples inserted in
short spaces of time.  I think that's a good compromise, however,
certainly not perfect.

David



Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Thu, Mar 19, 2020 at 09:52:11PM +1300, David Rowley wrote:
> On Thu, 19 Mar 2020 at 19:07, Justin Pryzby <pryzby@telsasoft.com> wrote:
> >
> > On Fri, Mar 13, 2020 at 02:38:51PM -0700, Andres Freund wrote:
> > > On 2020-03-13 13:44:42 -0500, Justin Pryzby wrote:
> > > > Having now played with the patch, I'll suggest that 10000000 is too high a
> > > > threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
> > > > much lower (100000?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.
> > >
> > > ISTM that the danger of regressing workloads due to suddenly repeatedly
> > > scanning huge indexes that previously were never / rarely scanned is
> > > significant (if there's a few dead tuples, otherwise most indexes will
> > > be able to skip the scan since the vacuum_cleanup_index_scale_factor
> > > introduction)).
> >
> > We could try to avoid that issue here:
> >
> > |        /* If any tuples need to be deleted, perform final vacuum cycle */
> > |        /* XXX put a threshold on min number of tuples here? */
> > |        if (dead_tuples->num_tuples > 0)
> > |        {
> > |                /* Work on all the indexes, and then the heap */
> > |                lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
> > |                                                                lps, nindexes);
> > |
> > |                /* Remove tuples from heap */
> > |                lazy_vacuum_heap(onerel, vacrelstats);
> > |        }
> >
> > As you said, an insert-only table can skip scanning indexes, but an
> > insert-mostly table currently cannot.
> >
> > Maybe we could skip the final index scan if we hit the autovacuum insert
> > threshold?
> >
> > I still don't like mixing the thresholds with the behavior they imply, but
> > maybe what's needed is better docs describing all of vacuum's roles and its
> > procedure and priority in executing them.
> >
> > The dead tuples would just be cleaned up during a future vacuum, right ?  So
> > that would be less efficient, but (no surprise) there's a balance to strike and
> > that can be tuned.  I think that wouldn't be an issue for most people; the
> > worst case would be if you set high maint_work_mem, and low insert threshold,
> > and you got increased bloat.  But faster vacuum if we avoided idx scans.
> >
> > That might allow more flexibility in our discussion around default values for
> > thresholds for insert-triggered vacuum.
> 
> We went over this a bit already. The risk is that if you have an
> insert-mostly table and always trigger an auto-vacuum for inserts and
> never due to dead tuples, then you'll forego the index cleanup every
> time causing the indexes to bloat over time.

At the time, we were talking about skipping index *cleanup* phase.
Which also incurs an index scan.
>+        tab->at_params.index_cleanup = insert_only ? VACOPT_TERNARY_DISABLED : VACOPT_TERNARY_DEFAULT;
We decided not to skip this, since it would allow index bloat, if vacuum were
only ever run due to inserts, and cleanup never happened.

I'm suggesting the possibility of skipping not index *cleanup* but index (and
heap) *vacuum*.  So that saves an index scan itself, and I think implies later
skipping cleanup (since no index tuples were removed).  But more importantly, I
think if we skip that during an insert-triggered vacuum, the dead heap tuples
are still there during the next vacuum instance.  So unlike index cleanup
(where we don't keep track of the total number of dead index tuples), this can
accumulate over time, and eventually trigger index+heap vacuum, and cleanup.

> I think any considerations to add some sort of threshold on dead
> tuples before cleaning the index should be considered independently.

+1, yes.  I'm hoping to anticipate and mitigate any objections and regressions
more than raise them.

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Thu, 2020-03-19 at 21:39 +1300, David Rowley wrote:
> > According to my reckoning, that is the remaining objection to the patch
> > as it is (with ordinary freezing behavior).
> >
> > How about a scale_factor od 0.005?  That will be high enough for large
> > tables, which seem to be the main concern here.
> 
> I agree with that, however, I'd thought 0.01, just so we're still
> close to having about 100 times less work to do for huge insert-only
> tables when it comes to having to perform an anti-wraparound vacuum.

Fine with me.

> > I am still sorry to see more proactive freezing go, which would
> > reduce the impact for truly insert-only tables.
> > After sleeping on it, here is one last idea.
> >
> > Granted, freezing with vacuum_freeze_min_age = 0 poses a problem
> > for those parts of the table that will receive updates or deletes.
> > But what if insert-triggered vacuum operates with - say -
> > one tenth of vacuum_freeze_min_age (unless explicitly overridden
> > for the table)?  That might still be high enough not to needlessly
> > freeze too many tuples that will still be modified, but it will
> > reduce the impact on insert-only tables.
> 
> I think that might be a bit too magical and may not be what some
> people want. I know that most people won't set
> autovacuum_freeze_min_age to 0 for insert-only tables, but we can at
> least throw something in the documents to mention it's a good idea,
> however, looking over the docs I'm not too sure the best place to note
> that down.

I was afraid that idea would be too cute to appeal.

> I've attached a small fix which I'd like to apply to your v8 patch.
> With that, and pending one final look, I'd like to push this during my
> Monday (New Zealand time).  So if anyone strongly objects to that,
> please state their case before then.

Thanks!

I have rolled your edits into the attached patch v9, rebased against
current master.

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-19 06:45:48 +0100, Laurenz Albe wrote:
> On Tue, 2020-03-17 at 18:02 -0700, Andres Freund wrote:
> > I don't think a default scale factor of 0 is going to be ok. For
> > large-ish tables this will basically cause permanent vacuums. And it'll
> > sometimes trigger for tables that actually coped well so far. 10 million
> > rows could be a few seconds, not more.
> > 
> > I don't think that the argument that otherwise a table might not get
> > vacuumed before autovacuum_freeze_max_age is convincing enough.
> > 
> > a) if that's indeed the argument, we should increase the default
> >   autovacuum_freeze_max_age - now that there's insert triggered vacuums,
> >   the main argument against that from before isn't valid anymore.
> > 
> > b) there's not really a good arguments for vacuuming more often than
> >   autovacuum_freeze_max_age for such tables. It'll not be not frequent
> >   enough to allow IOS for new data, and you're not preventing
> >   anti-wraparound vacuums from happening.
> 
> According to my reckoning, that is the remaining objection to the patch
> as it is (with ordinary freezing behavior).
> 
> How about a scale_factor od 0.005?  That will be high enough for large
> tables, which seem to be the main concern here.

Seems low on a first blush. On a large-ish table with 1 billion tuples,
we'd vacuum every 5 million inserts. For many ETL workloads this will
result in a vacuum after every bulk operation. Potentially with an index
scan associated (even if there's no errors, a lot of bulk loads use ON
CONFLICT INSERT leading to the occasional update).

Personally I think we should be considerably more conservative in the
first release or two. Exposing a lot of people that previously didn't
have a lot of problems to vacuuming being *massively* more aggressive,
basically permanently running on an insert only table, will be bad.


> I fully agree with your point a) - should that be part of the patch?

We can just make it a seperate patch committed shortly afterwards.


> I am not sure about b).  In my mind, the objective is not to prevent
> anti-wraparound vacuums, but to see that they have less work to do,
> because previous autovacuum runs already have frozen anything older than
> vacuum_freeze_min_age.  So, assuming linear growth, the number of tuples
> to freeze during any run would be at most one fourth of today's number
> when we hit autovacuum_freeze_max_age.

This whole chain of arguments seems like it actually has little to do
with vacuuming insert only/mostly tables. The same problem exists for
tables that aren't insert only/mostly. Instead it IMO is an argument for
a general change in logic about when to freeze.

What exactly is it that you want to achieve by having anti-wrap vacuums
be quicker? If the goal is to reduce the window in which autovacuums
aren't automatically cancelled when there's a conflicting lock request,
or in which autovacuum just schedules based on xid age, then you can't
have wraparound vacuums needing to do substantial amount of work.

Except for not auto-cancelling, and the autovac scheduling issue,
there's really nothing magic about anti-wrap vacuums.


If the goal is to avoid redundant writes, then it's largely unrelated to
anti-wrap vacuums, and can to a large degree addressed by
opportunistically freezing (best even during hot pruning!).


I am more and more convinced that it's a seriously bad idea to tie
committing "autovacuum after inserts" to also committing a change in
logic around freezing. That's not to say we shouldn't try to address
both this cycle, but discussing them as if they really are one item
makes it both more likely that we get nothing in, and more likely that
we miss the larger picture.


> I am still sorry to see more proactive freezing go, which would
> reduce the impact for truly insert-only tables.
> After sleeping on it, here is one last idea.
> 
> Granted, freezing with vacuum_freeze_min_age = 0 poses a problem
> for those parts of the table that will receive updates or deletes.

IMO it's not at all just those regions that are potentially negatively
affected:
If there are no other modifications to the page, more aggressively
freezing can lead to seriously increased write volume. Its quite normal
to have databases where data in insert only tables *never* gets old
enough to need to be frozen (either because xid usage is low, or because
older partitions are dropped).  If data in an insert-only table isn't
write-only, the hint bits are likely to already be set, which means that
vacuum will just cause the entire table to be written another time,
without a reason.


I don't see how it's ok to substantially regress this very common
workload. IMO this basically means that more aggressively and
non-opportunistically freezing simply is a no-go (be it for insert or
other causes for vacuuming).

What am I missing?

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Darafei "Komяpa" Praliaskouski
Date:
> > According to my reckoning, that is the remaining objection to the patch
> > as it is (with ordinary freezing behavior).
> >
> > How about a scale_factor od 0.005?  That will be high enough for large
> > tables, which seem to be the main concern here.
>
> Seems low on a first blush. On a large-ish table with 1 billion tuples,
> we'd vacuum every 5 million inserts. For many ETL workloads this will
> result in a vacuum after every bulk operation. Potentially with an index
> scan associated (even if there's no errors, a lot of bulk loads use ON
> CONFLICT INSERT leading to the occasional update).

This is a good and wanted thing. Upthread it was already suggested
that "everyone knows to vacuum after bulk operations". This will go and vacuum
the data while it's hot and in caches, not afterwards, reading from disk.


> > I am not sure about b).  In my mind, the objective is not to prevent
> > anti-wraparound vacuums, but to see that they have less work to do,
> > because previous autovacuum runs already have frozen anything older than
> > vacuum_freeze_min_age.  So, assuming linear growth, the number of tuples
> > to freeze during any run would be at most one fourth of today's number
> > when we hit autovacuum_freeze_max_age.
>
> This whole chain of arguments seems like it actually has little to do
> with vacuuming insert only/mostly tables. The same problem exists for
> tables that aren't insert only/mostly. Instead it IMO is an argument for
> a general change in logic about when to freeze.
>
> What exactly is it that you want to achieve by having anti-wrap vacuums
> be quicker? If the goal is to reduce the window in which autovacuums
> aren't automatically cancelled when there's a conflicting lock request,
> or in which autovacuum just schedules based on xid age, then you can't
> have wraparound vacuums needing to do substantial amount of work.

The problem hit by Mandrill is simple: in modern cloud environments
it's sometimes simply impossible to read all the data on disk because
of different kinds of throttling.
At some point your production database just shuts down and asks to
VACUUM in single user mode for 40 days.

You want vacuum to happen long before that, preferably when the data
is still in RAM, or, at least, fits your cloud provider's disk burst
performance budget, where performance of block device resembles that
of an SSD and not of a Floppy Disk.

Some more reading on how that works:
https://aws.amazon.com/ru/blogs/database/understanding-burst-vs-baseline-performance-with-amazon-rds-and-gp2/

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-19 20:47:40 +0100, Laurenz Albe wrote:
> On Thu, 2020-03-19 at 21:39 +1300, David Rowley wrote:
> > I've attached a small fix which I'd like to apply to your v8 patch.
> > With that, and pending one final look, I'd like to push this during my
> > Monday (New Zealand time).  So if anyone strongly objects to that,
> > please state their case before then.

I am doubtful it should be committed with the current settings. See below.


> From 3ba4b572d82969bbb2af787d1bccc72f417ad3a0 Mon Sep 17 00:00:00 2001
> From: Laurenz Albe <laurenz.albe@cybertec.at>
> Date: Thu, 19 Mar 2020 20:26:43 +0100
> Subject: [PATCH] Autovacuum tables that have received only inserts
>
> Add "autovacuum_vacuum_insert_threshold" and
> "autovacuum_vacuum_insert_scale_factor" GUC and reloption.
> The default value for the threshold is 10000000;
> the scale factor defaults to 0.01.
>
> Any table that has received more inserts since it was
> last vacuumed (and that is not vacuumed for another
> reason) will be autovacuumed.
>
> This avoids the known problem that insert-only tables
> are never autovacuumed until they need to have their
> anti-wraparound autovacuum, which then can be massive
> and disruptive.

Shouldn't this also mention index only scans? IMO that's at least as big
a problem as the "large vacuum" problem.


> +     <varlistentry id="guc-autovacuum-vacuum-insert-threshold" xreflabel="autovacuum_vacuum_insert_threshold">
> +      <term><varname>autovacuum_vacuum_insert_threshold</varname> (<type>integer</type>)
> +      <indexterm>
> +       <primary><varname>autovacuum_vacuum_insert_threshold</varname></primary>
> +       <secondary>configuration parameter</secondary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        Specifies the number of inserted tuples needed to trigger a
> +        <command>VACUUM</command> in any one table.
> +        The default is 10000000 tuples.
> +        This parameter can only be set in the <filename>postgresql.conf</filename>
> +        file or on the server command line;
> +        but the setting can be overridden for individual tables by
> +        changing table storage parameters.
> +       </para>
> +      </listitem>
> +     </varlistentry>
> +
>       <varlistentry id="guc-autovacuum-analyze-threshold" xreflabel="autovacuum_analyze_threshold">
>        <term><varname>autovacuum_analyze_threshold</varname> (<type>integer</type>)
>        <indexterm>
> @@ -7342,6 +7362,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
>        </listitem>
>       </varlistentry>
>
> +     <varlistentry id="guc-autovacuum-vacuum-insert-scale-factor"
xreflabel="autovacuum_vacuum_insert_scale_factor">
> +      <term><varname>autovacuum_vacuum_insert_scale_factor</varname> (<type>floating point</type>)
> +      <indexterm>
> +       <primary><varname>autovacuum_vacuum_insert_scale_factor</varname></primary>
> +       <secondary>configuration parameter</secondary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        Specifies a fraction of the table size to add to
> +        <varname>autovacuum_vacuum_insert_threshold</varname>
> +        when deciding whether to trigger a <command>VACUUM</command>.
> +        The default is 0.01 (1% of table size).
> +        This parameter can only be set in the <filename>postgresql.conf</filename>
> +        file or on the server command line;
> +        but the setting can be overridden for individual tables by
> +        changing table storage parameters.
> +       </para>
> +      </listitem>
> +     </varlistentry>
> +

I am *VERY* doubtful that the attempt of using a large threshold, and a
tiny scale factor, is going to work out well. I'm not confident enough
in my gut feeling to full throatedly object, but confident enough that
I'd immediately change it on any important database I operated.

Independent of how large a constant you set the threshold to, for
databases with substantially bigger tables this will lead to [near]
constant vacuuming. As soon as you hit 1 billion rows - which isn't
actually that much - this is equivalent to setting
autovacuum_{vacuum,analyze}_scale_factor to 0.01. There's cases where
that can be a sensible setting, but I don't think anybody would suggest
it as a default.


After thinking about it for a while, I think it's fundamentally flawed
to use large constant thresholds to avoid unnecessary vacuums. It's easy
to see cases where it's bad for common databases of today, but it'll be
much worse a few years down the line where common table sizes have grown
by a magnitude or two. Nor do they address the difference between tables
of a certain size with e.g. 2kb wide rows, and a same sized table with
28 byte wide rows.  The point of constant thresholds imo can only be to
avoid unnecessary work at the *small* (even tiny) end, not the opposite.


I think there's too much "reinventing" autovacuum scheduling in a
"local" insert-only manner happening in this thread. And as far as I can
tell additionally only looking at a somewhat narrow slice of insert only
workloads.


I, again, strongly suggest using much more conservative values here. And
then try to address the shortcomings - like not freezing aggressively
enough - in separate patches (and by now separate releases, in all
likelihood).


This will have a huge impact on a lot of postgres
installations. Autovacuum already is perceived as one of the biggest
issues around postgres. If the ratio of cases where these changes
improve things to the cases it regresses isn't huge, it'll be painful
(silent improvements are obviously less noticed than breakages).

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-20 01:11:23 +0300, Darafei "Komяpa" Praliaskouski wrote:
> > > According to my reckoning, that is the remaining objection to the patch
> > > as it is (with ordinary freezing behavior).
> > >
> > > How about a scale_factor od 0.005?  That will be high enough for large
> > > tables, which seem to be the main concern here.
> >
> > Seems low on a first blush. On a large-ish table with 1 billion tuples,
> > we'd vacuum every 5 million inserts. For many ETL workloads this will
> > result in a vacuum after every bulk operation. Potentially with an index
> > scan associated (even if there's no errors, a lot of bulk loads use ON
> > CONFLICT INSERT leading to the occasional update).
> 
> This is a good and wanted thing.

I don't think that's true in general. As proposed this can increase the
overall amount of IO (both reads and writes) due to vacuum by a *LOT*.


> Upthread it was already suggested that "everyone knows to vacuum after
> bulk operations". This will go and vacuum the data while it's hot and
> in caches, not afterwards, reading from disk.

For many bulk load cases the data will not be in cache, in particular not
when individual bulk inserts are more than a few gigabytes.


> The problem hit by Mandrill is simple: in modern cloud environments
> it's sometimes simply impossible to read all the data on disk because
> of different kinds of throttling.

Yes. Which is one of the reasons why this has the potential to cause
serious issues. The proposed changes very often will *increase* the
total amount of IO. A good fraction of the time that will be "hidden" by
caching, but it'll be far from all the time.


> At some point your production database just shuts down and asks to
> VACUUM in single user mode for 40 days.

That basically has nothing to do with what we're talking about here. The
default wraparound trigger is 200 million xids, and shutdowns start at
more than 2 billion xids. If an anti-wrap autovacuum can't finish within
2 billion rows, then this won't be addressed by vacuuming more
frequently (including more frequent index scans, causing a lot more
IO!).

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Fri, 20 Mar 2020 at 11:17, Andres Freund <andres@anarazel.de> wrote:
> I think there's too much "reinventing" autovacuum scheduling in a
> "local" insert-only manner happening in this thread. And as far as I can
> tell additionally only looking at a somewhat narrow slice of insert only
> workloads.

I understand your concern and you might be right. However, I think the
main reason that the default settings for the new threshold and scale
factor has deviated this far from the existing settings is regarding
the example of a large insert-only table that receives inserts of 1
row per xact.  If we were to copy the existing settings then when that
table gets to 1 billion rows, it would be eligible for an
insert-vacuum after 200 million tuples/xacts, which does not help the
situation since an anti-wraparound vacuum would be triggering then
anyway.

I'm unsure if it will help with the discussion, but I put together a
quick and dirty C program to show when a table will be eligible for an
auto-vacuum with the given scale_factor and threshold

$ gcc -O2 vacuum.c -o vacuum
$ ./vacuum
Syntax ./vacuum <scale_factor> <threshold> <maximum table size in rows>
$ ./vacuum 0.01 10000000 100000000000 | tail -n 1
Vacuum 463 at 99183465731 reltuples, 991915456 inserts
$ ./vacuum 0.2 50 100000000000 | tail -n 1
Vacuum 108 at 90395206733 reltuples, 15065868288 inserts

So, yeah, certainly, there are more than four times as many vacuums
with an insert-only table of 100 billion rows using the proposed
settings vs the defaults for the existing scale_factor and threshold.
However, at the tail end of the first run there, we were close to a
billion rows (991,915,456) between vacucums. Is that too excessive?

I'm sharing this in the hope that it'll make it easy to experiment
with settings which we can all agree on.

For a 1 billion row table, the proposed settings give us 69 vacuums
and the standard settings 83.

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-20 15:05:03 +1300, David Rowley wrote:
> On Fri, 20 Mar 2020 at 11:17, Andres Freund <andres@anarazel.de> wrote:
> > I think there's too much "reinventing" autovacuum scheduling in a
> > "local" insert-only manner happening in this thread. And as far as I can
> > tell additionally only looking at a somewhat narrow slice of insert only
> > workloads.
> 
> I understand your concern and you might be right. However, I think the
> main reason that the default settings for the new threshold and scale
> factor has deviated this far from the existing settings is regarding
> the example of a large insert-only table that receives inserts of 1
> row per xact.  If we were to copy the existing settings then when that
> table gets to 1 billion rows, it would be eligible for an
> insert-vacuum after 200 million tuples/xacts, which does not help the
> situation since an anti-wraparound vacuum would be triggering then
> anyway.

Sure, that'd happen for inserts that happen after that threshold. I'm
just not convinced that this is as huge a problem as presented in this
thread. And I'm fairly convinced the proposed solution is the wrong
direction to go into.

It's not like that's not an issue for updates? If you update one row per
transaction, then you run into exactly the same issue for a table of the
same size?  You maybe could argue that it's more common to insert 1
billion tuples in individual transaction, than it is to update 1 billion
tuples in individual transactions, but I don't think it's a huge
difference if it even exist.

In fact the problem is worse for the update case, because that tends to
generate a lot more random looking IO during vacuum (both because only
parts of the table are updated causing small block reads/writes, and
because it will need [multiple] index scans/vacuum, and because the
vacuum is a lot more expensive CPU time wise).

Imo this line of reasoning is about adding autovacuum scheduling based
on xid age, not about insert only workloads.

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Thu, 2020-03-19 at 15:17 -0700, Andres Freund wrote:
> I am doubtful it should be committed with the current settings. See below.
> 
> > From 3ba4b572d82969bbb2af787d1bccc72f417ad3a0 Mon Sep 17 00:00:00 2001
> > From: Laurenz Albe <laurenz.albe@cybertec.at>
> > Date: Thu, 19 Mar 2020 20:26:43 +0100
> > Subject: [PATCH] Autovacuum tables that have received only inserts
> > 
> > This avoids the known problem that insert-only tables
> > are never autovacuumed until they need to have their
> > anti-wraparound autovacuum, which then can be massive
> > and disruptive.
> 
> Shouldn't this also mention index only scans? IMO that's at least as big
> a problem as the "large vacuum" problem.

Yes, that would be good.

> I am *VERY* doubtful that the attempt of using a large threshold, and a
> tiny scale factor, is going to work out well. I'm not confident enough
> in my gut feeling to full throatedly object, but confident enough that
> I'd immediately change it on any important database I operated.
> 
> Independent of how large a constant you set the threshold to, for
> databases with substantially bigger tables this will lead to [near]
> constant vacuuming. As soon as you hit 1 billion rows - which isn't
> actually that much - this is equivalent to setting
> autovacuum_{vacuum,analyze}_scale_factor to 0.01. There's cases where
> that can be a sensible setting, but I don't think anybody would suggest
> it as a default.

In that, you are assuming that the bigger a table is, the more data
modifications it will get, so that making the scale factor the dominant
element will work out better.

My experience is that it is more likely for the change rate (inserts,
I am less certain about updates and deletes) to be independent of the
table size.  (Too) many large databases are so large not because the
data influx grows linearly over time, but because people don't want to
get rid of old data (or would very much like to do so, but never planned
for it).

This second scenario would be much better served by a high threshold and
a low scale factor.

> After thinking about it for a while, I think it's fundamentally flawed
> to use large constant thresholds to avoid unnecessary vacuums. It's easy
> to see cases where it's bad for common databases of today, but it'll be
> much worse a few years down the line where common table sizes have grown
> by a magnitude or two. Nor do they address the difference between tables
> of a certain size with e.g. 2kb wide rows, and a same sized table with
> 28 byte wide rows.  The point of constant thresholds imo can only be to
> avoid unnecessary work at the *small* (even tiny) end, not the opposite.
> 
> 
> I think there's too much "reinventing" autovacuum scheduling in a
> "local" insert-only manner happening in this thread. And as far as I can
> tell additionally only looking at a somewhat narrow slice of insert only
> workloads.

Perhaps.  The traditional "high scale factor, low threshold" system
is (in my perception) mostly based on the objective of cleaning up
dead tuples.  When autovacuum was introduced, index only scans were
only a dream.

With the objective of getting rid of dead tuples, having the scale factor
be the dominant part makes sense: it is OK for bloat to be a certain
percentage of the table size.

Also, as you say, tables were much smaller then, and they will only
become bigger in the future.  But I find that to be an argument *for*
making the threshold the dominant element: otherwise, you vacuum less
and less often, and the individual runs become larger and larger.
Now that vacuum skips pages where it knows it has nothing to do,
doesn't take away much of the pain of vacuuming large tables where
nothing much has changed?

> I, again, strongly suggest using much more conservative values here. And
> then try to address the shortcomings - like not freezing aggressively
> enough - in separate patches (and by now separate releases, in all
> likelihood).

There is much to say for that, I agree.

> This will have a huge impact on a lot of postgres
> installations. Autovacuum already is perceived as one of the biggest
> issues around postgres. If the ratio of cases where these changes
> improve things to the cases it regresses isn't huge, it'll be painful
> (silent improvements are obviously less noticed than breakages).

Yes, that makes it scary to mess with autovacuum.

One of the problems I see in the course of this discussion is that one
can always come up with examples that make any choice look bad.
It is impossible to do it right for everybody.

In the light of that, I won't object to a more conservative default
value for the parameters, even though my considerations above suggest
to me the opposite.  But perhaps my conclusions are based on flawed
premises.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Thu, 2020-03-19 at 14:38 -0700, Andres Freund wrote:
> > I am not sure about b).  In my mind, the objective is not to prevent
> > anti-wraparound vacuums, but to see that they have less work to do,
> > because previous autovacuum runs already have frozen anything older than
> > vacuum_freeze_min_age.  So, assuming linear growth, the number of tuples
> > to freeze during any run would be at most one fourth of today's number
> > when we hit autovacuum_freeze_max_age.
> 
> This whole chain of arguments seems like it actually has little to do
> with vacuuming insert only/mostly tables. The same problem exists for
> tables that aren't insert only/mostly. Instead it IMO is an argument for
> a general change in logic about when to freeze.

My goal was to keep individual vacuum runs from having too much
work to do.  The freezing was an afterthought.

The difference (for me) is that I am more convinced that the insert
rate for insert-only table is constant over time than I am of the
update rate to be constant.

> What exactly is it that you want to achieve by having anti-wrap vacuums
> be quicker? If the goal is to reduce the window in which autovacuums
> aren't automatically cancelled when there's a conflicting lock request,
> or in which autovacuum just schedules based on xid age, then you can't
> have wraparound vacuums needing to do substantial amount of work.
> 
> Except for not auto-cancelling, and the autovac scheduling issue,
> there's really nothing magic about anti-wrap vacuums.

Yes.  I am under the impression that it is the duration and amount
of work per vacuum run that is the problem here, not the aggressiveness
as such.

If you are in the habit of frequently locking tables with high
lock modes (and I have seen people do that), you are lost anyway:
normal autovacuum runs will always die, and anti-wraparound vacuum
will kill you.  There is nothing we can do about that, except perhaps
put a fat warning in the documentation of LOCK.

> If the goal is to avoid redundant writes, then it's largely unrelated to
> anti-wrap vacuums, and can to a large degree addressed by
> opportunistically freezing (best even during hot pruning!).
> 
> 
> I am more and more convinced that it's a seriously bad idea to tie
> committing "autovacuum after inserts" to also committing a change in
> logic around freezing. That's not to say we shouldn't try to address
> both this cycle, but discussing them as if they really are one item
> makes it both more likely that we get nothing in, and more likely that
> we miss the larger picture.

I hear you, and I agree that we shouldn't do it with this patch.

> If there are no other modifications to the page, more aggressively
> freezing can lead to seriously increased write volume. Its quite normal
> to have databases where data in insert only tables *never* gets old
> enough to need to be frozen (either because xid usage is low, or because
> older partitions are dropped).  If data in an insert-only table isn't
> write-only, the hint bits are likely to already be set, which means that
> vacuum will just cause the entire table to be written another time,
> without a reason.
> 
> 
> I don't see how it's ok to substantially regress this very common
> workload. IMO this basically means that more aggressively and
> non-opportunistically freezing simply is a no-go (be it for insert or
> other causes for vacuuming).
> 
> What am I missing?

Nothing that I can see, and these are good examples why eager freezing
may not be such a smart idea after all.

I think your idea of freezing everything on a page when we know it is
going to be dirtied anyway is the smartest way of going about that.

My only remaining quibbles are about scale factor and threshold, see
my other mail.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-19 06:45:48 +0100, Laurenz Albe wrote:
> On Tue, 2020-03-17 at 18:02 -0700, Andres Freund wrote:
> > I don't think a default scale factor of 0 is going to be ok. For
> > large-ish tables this will basically cause permanent vacuums. And it'll
> > sometimes trigger for tables that actually coped well so far. 10 million
> > rows could be a few seconds, not more.
> > 
> > I don't think that the argument that otherwise a table might not get
> > vacuumed before autovacuum_freeze_max_age is convincing enough.
> > 
> > a) if that's indeed the argument, we should increase the default
> >   autovacuum_freeze_max_age - now that there's insert triggered vacuums,
> >   the main argument against that from before isn't valid anymore.
> > 
> > b) there's not really a good arguments for vacuuming more often than
> >   autovacuum_freeze_max_age for such tables. It'll not be not frequent
> >   enough to allow IOS for new data, and you're not preventing
> >   anti-wraparound vacuums from happening.
> 
> According to my reckoning, that is the remaining objection to the patch
> as it is (with ordinary freezing behavior).
> 
> How about a scale_factor od 0.005?  That will be high enough for large
> tables, which seem to be the main concern here.
> 
> I fully agree with your point a) - should that be part of the patch?
> 
> I am not sure about b).  In my mind, the objective is not to prevent
> anti-wraparound vacuums, but to see that they have less work to do,
> because previous autovacuum runs already have frozen anything older than
> vacuum_freeze_min_age.  So, assuming linear growth, the number of tuples
> to freeze during any run would be at most one fourth of today's number
> when we hit autovacuum_freeze_max_age.

Based on two IM conversations I think it might be worth emphasizing how
vacuum_cleanup_index_scale_factor works:

For btree, even if there is not a single deleted tuple, we can *still*
end up doing a full index scans at the end of vacuum. As the docs describe
vacuum_cleanup_index_scale_factor:

       <para>
        Specifies the fraction of the total number of heap tuples counted in
        the previous statistics collection that can be inserted without
        incurring an index scan at the <command>VACUUM</command> cleanup stage.
        This setting currently applies to B-tree indexes only.
       </para>

I.e. with the default settings we will perform a whole-index scan
(without visibility map or such) after every 10% growth of the
table. Which means that, even if the visibility map prevents repeated
tables accesses, increasing the rate of vacuuming for insert-only tables
can cause a lot more whole index scans.  Which means that vacuuming an
insert-only workload frequently *will* increase the total amount of IO,
even if there is not a single dead tuple. Rather than just spreading the
same amount of IO over more vacuums.

And both gin and gist just always do a full index scan, regardless of
vacuum_cleanup_index_scale_factor (either during a bulk delete, or
during the cleanup).  Thus more frequent vacuuming for insert-only
tables can cause a *lot* of pain (even an approx quadratic increase of
IO?  O(increased_frequency * peak_index_size)?) if you have large
indexes - which is very common for gin/gist.


Is there something missing in the above description?

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-20 06:59:57 +0100, Laurenz Albe wrote:
> On Thu, 2020-03-19 at 15:17 -0700, Andres Freund wrote:
> > I am *VERY* doubtful that the attempt of using a large threshold, and a
> > tiny scale factor, is going to work out well. I'm not confident enough
> > in my gut feeling to full throatedly object, but confident enough that
> > I'd immediately change it on any important database I operated.
> > 
> > Independent of how large a constant you set the threshold to, for
> > databases with substantially bigger tables this will lead to [near]
> > constant vacuuming. As soon as you hit 1 billion rows - which isn't
> > actually that much - this is equivalent to setting
> > autovacuum_{vacuum,analyze}_scale_factor to 0.01. There's cases where
> > that can be a sensible setting, but I don't think anybody would suggest
> > it as a default.
> 
> In that, you are assuming that the bigger a table is, the more data
> modifications it will get, so that making the scale factor the dominant
> element will work out better.

> My experience is that it is more likely for the change rate (inserts,
> I am less certain about updates and deletes) to be independent of the
> table size.  (Too) many large databases are so large not because the
> data influx grows linearly over time, but because people don't want to
> get rid of old data (or would very much like to do so, but never planned
> for it).

I don't think growing ingest rate into insert only tables is exactly
rare. Maybe I've been too long in the Bay Area though.


> This second scenario would be much better served by a high threshold and
> a low scale factor.

I don't think that's really true. As soon as there's any gin/gist
indexes, a single non-HOT dead tuple, or a btree index grew by more
than vacuum_cleanup_index_scale_factor, indexes are scanned as a
whole. See the email I just concurrently happened to write:
https://postgr.es/m/20200320062031.uwagypenawujwajx%40alap3.anarazel.de

Which means that often each additional vacuum causes IO that's
proportional to the *total* index size, *not* the table size
delta. Which means that the difference in total IO basically is
O(increased_frequency * peak_table_size) in the worst case.




> > After thinking about it for a while, I think it's fundamentally flawed
> > to use large constant thresholds to avoid unnecessary vacuums. It's easy
> > to see cases where it's bad for common databases of today, but it'll be
> > much worse a few years down the line where common table sizes have grown
> > by a magnitude or two. Nor do they address the difference between tables
> > of a certain size with e.g. 2kb wide rows, and a same sized table with
> > 28 byte wide rows.  The point of constant thresholds imo can only be to
> > avoid unnecessary work at the *small* (even tiny) end, not the opposite.
> > 
> > 
> > I think there's too much "reinventing" autovacuum scheduling in a
> > "local" insert-only manner happening in this thread. And as far as I can
> > tell additionally only looking at a somewhat narrow slice of insert only
> > workloads.
> 
> Perhaps.  The traditional "high scale factor, low threshold" system
> is (in my perception) mostly based on the objective of cleaning up
> dead tuples.  When autovacuum was introduced, index only scans were
> only a dream.
> 
> With the objective of getting rid of dead tuples, having the scale factor
> be the dominant part makes sense: it is OK for bloat to be a certain
> percentage of the table size.
> 

As far as I can tell this argument doesn't make sense in light of the ob
fact that many vacuums trigger whole index scans, even if there are no
deleted tuples, as described above?


Even disregarding the index issue, I still don't think your argument is
very convicing.  For one, as I mentioned in another recent email, 10
million rows in a narrow table is something entirely different than 10
million rows in a very wide table. scale_factor doesn't have that
problem to the same degree.  Also, it's fairly obvious that this
argument doesn't hold in the general sense, otherwise we could just set
a threshold of, say, 10000.

There's also the issue that frequent vacuums will often not be able to
mark most of the the new data all-visible, due to concurrent
sessions. E.g. concurrent bulk loading sessions, analytics queries
actually looking at the data, replicas all can easily prevent data that
was just inserted from being marked 'all-visible' (not to speak of
frozen). That's not likely to be a problem in a purely oltp system that
inserts only single rows per xact, and has no longlived readers (nor
replicas with hs_feedback = on), but outside of that...


> Also, as you say, tables were much smaller then, and they will only
> become bigger in the future.  But I find that to be an argument *for*
> making the threshold the dominant element: otherwise, you vacuum less
> and less often, and the individual runs become larger and larger.

Which mostly is ok, because there are significant costs that scale with
the table size. And in a lot (but far from all!) of cases the benefits
of vacuuming scale more with the overall table size than with the delta
of the size.


> Now that vacuum skips pages where it knows it has nothing to do,
> doesn't take away much of the pain of vacuuming large tables where
> nothing much has changed?

Unfortunately not really.

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Masahiko Sawada
Date:
On Fri, 20 Mar 2020 at 15:20, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-19 06:45:48 +0100, Laurenz Albe wrote:
> > On Tue, 2020-03-17 at 18:02 -0700, Andres Freund wrote:
> > > I don't think a default scale factor of 0 is going to be ok. For
> > > large-ish tables this will basically cause permanent vacuums. And it'll
> > > sometimes trigger for tables that actually coped well so far. 10 million
> > > rows could be a few seconds, not more.
> > >
> > > I don't think that the argument that otherwise a table might not get
> > > vacuumed before autovacuum_freeze_max_age is convincing enough.
> > >
> > > a) if that's indeed the argument, we should increase the default
> > >   autovacuum_freeze_max_age - now that there's insert triggered vacuums,
> > >   the main argument against that from before isn't valid anymore.
> > >
> > > b) there's not really a good arguments for vacuuming more often than
> > >   autovacuum_freeze_max_age for such tables. It'll not be not frequent
> > >   enough to allow IOS for new data, and you're not preventing
> > >   anti-wraparound vacuums from happening.
> >
> > According to my reckoning, that is the remaining objection to the patch
> > as it is (with ordinary freezing behavior).
> >
> > How about a scale_factor od 0.005?  That will be high enough for large
> > tables, which seem to be the main concern here.
> >
> > I fully agree with your point a) - should that be part of the patch?
> >
> > I am not sure about b).  In my mind, the objective is not to prevent
> > anti-wraparound vacuums, but to see that they have less work to do,
> > because previous autovacuum runs already have frozen anything older than
> > vacuum_freeze_min_age.  So, assuming linear growth, the number of tuples
> > to freeze during any run would be at most one fourth of today's number
> > when we hit autovacuum_freeze_max_age.
>
> Based on two IM conversations I think it might be worth emphasizing how
> vacuum_cleanup_index_scale_factor works:
>
> For btree, even if there is not a single deleted tuple, we can *still*
> end up doing a full index scans at the end of vacuum. As the docs describe
> vacuum_cleanup_index_scale_factor:
>
>        <para>
>         Specifies the fraction of the total number of heap tuples counted in
>         the previous statistics collection that can be inserted without
>         incurring an index scan at the <command>VACUUM</command> cleanup stage.
>         This setting currently applies to B-tree indexes only.
>        </para>
>
> I.e. with the default settings we will perform a whole-index scan
> (without visibility map or such) after every 10% growth of the
> table. Which means that, even if the visibility map prevents repeated
> tables accesses, increasing the rate of vacuuming for insert-only tables
> can cause a lot more whole index scans.  Which means that vacuuming an
> insert-only workload frequently *will* increase the total amount of IO,
> even if there is not a single dead tuple. Rather than just spreading the
> same amount of IO over more vacuums.

Right.

>
> And both gin and gist just always do a full index scan, regardless of
> vacuum_cleanup_index_scale_factor (either during a bulk delete, or
> during the cleanup).  Thus more frequent vacuuming for insert-only
> tables can cause a *lot* of pain (even an approx quadratic increase of
> IO?  O(increased_frequency * peak_index_size)?) if you have large
> indexes - which is very common for gin/gist.

That's right but for gin, more frequent vacuuming for insert-only
tables can help to clean up the pending list, which increases search
speed and better than doing it by a backend process.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Thu, 2020-03-19 at 23:20 -0700, Andres Freund wrote:
> I am not sure about b).  In my mind, the objective is not to prevent
> > anti-wraparound vacuums, but to see that they have less work to do,
> > because previous autovacuum runs already have frozen anything older than
> > vacuum_freeze_min_age.  So, assuming linear growth, the number of tuples
> > to freeze during any run would be at most one fourth of today's number
> > when we hit autovacuum_freeze_max_age.
> 
> Based on two IM conversations I think it might be worth emphasizing how
> vacuum_cleanup_index_scale_factor works:
> 
> For btree, even if there is not a single deleted tuple, we can *still*
> end up doing a full index scans at the end of vacuum. As the docs describe
> vacuum_cleanup_index_scale_factor:
> 
>        <para>
>         Specifies the fraction of the total number of heap tuples counted in
>         the previous statistics collection that can be inserted without
>         incurring an index scan at the <command>VACUUM</command> cleanup stage.
>         This setting currently applies to B-tree indexes only.
>        </para>
> 
> I.e. with the default settings we will perform a whole-index scan
> (without visibility map or such) after every 10% growth of the
> table. Which means that, even if the visibility map prevents repeated
> tables accesses, increasing the rate of vacuuming for insert-only tables
> can cause a lot more whole index scans.  Which means that vacuuming an
> insert-only workload frequently *will* increase the total amount of IO,
> even if there is not a single dead tuple. Rather than just spreading the
> same amount of IO over more vacuums.
> 
> And both gin and gist just always do a full index scan, regardless of
> vacuum_cleanup_index_scale_factor (either during a bulk delete, or
> during the cleanup).  Thus more frequent vacuuming for insert-only
> tables can cause a *lot* of pain (even an approx quadratic increase of
> IO?  O(increased_frequency * peak_index_size)?) if you have large
> indexes - which is very common for gin/gist.

Ok, ok.  Thanks for the explanation.

In the light of that, I agree that we should increase the scale_factor.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Fri, 2020-03-20 at 14:43 +0100, Laurenz Albe wrote:
> I.e. with the default settings we will perform a whole-index scan
> > (without visibility map or such) after every 10% growth of the
> > table. Which means that, even if the visibility map prevents repeated
> > tables accesses, increasing the rate of vacuuming for insert-only tables
> > can cause a lot more whole index scans.  Which means that vacuuming an
> > insert-only workload frequently *will* increase the total amount of IO,
> > even if there is not a single dead tuple. Rather than just spreading the
> > same amount of IO over more vacuums.
> > 
> > And both gin and gist just always do a full index scan, regardless of
> > vacuum_cleanup_index_scale_factor (either during a bulk delete, or
> > during the cleanup).  Thus more frequent vacuuming for insert-only
> > tables can cause a *lot* of pain (even an approx quadratic increase of
> > IO?  O(increased_frequency * peak_index_size)?) if you have large
> > indexes - which is very common for gin/gist.
> 
> In the light of that, I agree that we should increase the scale_factor.

Here is version 10 of the patch, which uses a scale factor of 0.2.

Yours,
Laurenz Albe

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Mon, 2020-03-23 at 14:27 +0100, Laurenz Albe wrote:
> Here is version 10 of the patch, which uses a scale factor of 0.2.

This patch should be what everybody can live with.

It would be good if we can get at least that committed before feature freeze.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Mon, Mar 23, 2020 at 02:27:29PM +0100, Laurenz Albe wrote:
> Here is version 10 of the patch, which uses a scale factor of 0.2.

Thanks

> Any table that has received more inserts since it was
> last vacuumed (and that is not vacuumed for another
> reason) will be autovacuumed.

Since this vacuum doesn't trigger any special behavior (freeze), you can remove
the parenthesized part: "(and that is not vacuumed for another reason)".

Maybe in the docs you can write this with thousands separators: 10,000,000

It looks like the GUC uses scale factor max=1e10, but the relopt is still
max=100, which means it's less possible to disable for a single rel.

> +++ b/src/backend/access/common/reloptions.c
> @@ -398,6 +407,15 @@ static relopt_real realRelOpts[] =
>          },
>          -1, 0.0, 100.0
>      },
> +    {
> +        {
> +            "autovacuum_vacuum_insert_scale_factor",
> +            "Number of tuple inserts prior to vacuum as a fraction of reltuples",
> +            RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
> +            ShareUpdateExclusiveLock
> +        },
> +        -1, 0.0, 100.0
> +    },
>      {
>          {
>              "autovacuum_analyze_scale_factor",

> +++ b/src/backend/utils/misc/guc.c
> @@ -3549,6 +3558,17 @@ static struct config_real ConfigureNamesReal[] =
>          0.2, 0.0, 100.0,
>          NULL, NULL, NULL
>      },
> +
> +    {
> +        {"autovacuum_vacuum_insert_scale_factor", PGC_SIGHUP, AUTOVACUUM,
> +            gettext_noop("Number of tuple inserts prior to vacuum as a fraction of reltuples."),
> +            NULL
> +        },
> +        &autovacuum_vac_ins_scale,
> +        0.2, 0.0, 1e10,
> +        NULL, NULL, NULL
> +    },
> +
>      {
>          {"autovacuum_analyze_scale_factor", PGC_SIGHUP, AUTOVACUUM,
>              gettext_noop("Number of tuple inserts, updates, or deletes prior to analyze as a fraction of
reltuples."),



Re: Berserk Autovacuum (let's save next Mandrill)

From
Alvaro Herrera
Date:
On 2020-Mar-25, Justin Pryzby wrote:

> Maybe in the docs you can write this with thousands separators: 10,000,000
> 
> It looks like the GUC uses scale factor max=1e10, but the relopt is still
> max=100, which means it's less possible to disable for a single rel.

I have paid no attention to this thread, but how does it make sense to
have a scale factor to be higher than 100?  Surely you mean the
threshold value that should be set to ten million, not the scale factor?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Justin Pryzby
Date:
On Wed, Mar 25, 2020 at 12:46:52PM -0300, Alvaro Herrera wrote:
> On 2020-Mar-25, Justin Pryzby wrote:
> 
> > Maybe in the docs you can write this with thousands separators: 10,000,000
> > 
> > It looks like the GUC uses scale factor max=1e10, but the relopt is still
> > max=100, which means it's less possible to disable for a single rel.
> 
> I have paid no attention to this thread, but how does it make sense to
> have a scale factor to be higher than 100?  Surely you mean the
> threshold value that should be set to ten million, not the scale factor?

We went over this here:
https://www.postgresql.org/message-id/20200317195616.GZ26184%40telsasoft.com
...
https://www.postgresql.org/message-id/20200317213426.GB26184%40telsasoft.com

The scale factor is relative to the reltuples estimate, which comes from vacuum
(which presently doesn't run against insert-only tables, and what we're trying
to schedule), or analyze, which probably runs adequately, but might be disabled
or run too infrequently.

Since we talked about how scale_factor can be used to effectively disable this
new feature, I thought that scale=100 was too small and suggesed 1e10 (same as
max for vacuum_cleanup_index_scale_factor since 4d54543ef).  That should allow
handling the case that analyze is disabled, or its threshold is high, or it
hasn't run yet, or it's running but hasn't finished, or analyze is triggered as
same time as vacuum.

A table with 1e7 tuples (threshold) into which one inserts 1e9 tuples would hit
scale_factor=100 threshold, which means scale_factor failed to "disable" the
feature, as claimed.  If anything, I think it may need to be larger...

-- 
Justin



Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-25 11:05:21 -0500, Justin Pryzby wrote:
> Since we talked about how scale_factor can be used to effectively disable this
> new feature, I thought that scale=100 was too small and suggesed 1e10 (same as
> max for vacuum_cleanup_index_scale_factor since 4d54543ef).  That should allow
> handling the case that analyze is disabled, or its threshold is high, or it
> hasn't run yet, or it's running but hasn't finished, or analyze is triggered as
> same time as vacuum.

For disabling we instead should allow -1, and disable the feature if set
to < 0.

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
Alexander Korotkov
Date:
On Wed, Mar 25, 2020 at 10:26 PM Andres Freund <andres@anarazel.de> wrote:
> On 2020-03-25 11:05:21 -0500, Justin Pryzby wrote:
> > Since we talked about how scale_factor can be used to effectively disable this
> > new feature, I thought that scale=100 was too small and suggesed 1e10 (same as
> > max for vacuum_cleanup_index_scale_factor since 4d54543ef).  That should allow
> > handling the case that analyze is disabled, or its threshold is high, or it
> > hasn't run yet, or it's running but hasn't finished, or analyze is triggered as
> > same time as vacuum.
>
> For disabling we instead should allow -1, and disable the feature if set
> to < 0.

This patch introduces both GUC and reloption.  In reloptions we
typically use -1 for "disable reloption, use GUC value instead"
semantics.  So it's unclear how should we allow reloption to both
disable feature and disable reloption.  I think we don't have a
precedent in the codebase yet.  We could allow -2 (disable reloption)
and -1 (disable feature) for reloption.  Opinions?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Wed, 2020-03-25 at 23:19 +0300, Alexander Korotkov wrote:
> On Wed, Mar 25, 2020 at 10:26 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2020-03-25 11:05:21 -0500, Justin Pryzby wrote:
> > > Since we talked about how scale_factor can be used to effectively disable this
> > > new feature, I thought that scale=100 was too small and suggesed 1e10 (same as
> > > max for vacuum_cleanup_index_scale_factor since 4d54543ef).  That should allow
> > > handling the case that analyze is disabled, or its threshold is high, or it
> > > hasn't run yet, or it's running but hasn't finished, or analyze is triggered as
> > > same time as vacuum.
> > 
> > For disabling we instead should allow -1, and disable the feature if set
> > to < 0.
> 
> This patch introduces both GUC and reloption.  In reloptions we
> typically use -1 for "disable reloption, use GUC value instead"
> semantics.  So it's unclear how should we allow reloption to both
> disable feature and disable reloption.  I think we don't have a
> precedent in the codebase yet.  We could allow -2 (disable reloption)
> and -1 (disable feature) for reloption.  Opinions?

Here is patch v11, where the reloption has the same upper limit 1e10
as the GUC.  There is no good reason to have them different.

I am reluctant to introduce new semantics like a reloption value of -2
to disable a feature in this patch right before feature freeze.

I believe there are enough options to disable insert-only vacuuming for
an individual table:

- Set the threshold to 2147483647.  True, that will not work for very
  large tables, but I think that there are few tables that insert that
  many rows before they hit autovacuum_freeze_max_age anyway.

- Set the scale factor to some astronomical value.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
Andres Freund
Date:
Hi,

On 2020-03-26 10:12:39 +0100, Laurenz Albe wrote:
> On Wed, 2020-03-25 at 23:19 +0300, Alexander Korotkov wrote:
> I am reluctant to introduce new semantics like a reloption value of -2
> to disable a feature in this patch right before feature freeze.
> 
> I believe there are enough options to disable insert-only vacuuming for
> an individual table:

> - Set the threshold to 2147483647.  True, that will not work for very
>   large tables, but I think that there are few tables that insert that
>   many rows before they hit autovacuum_freeze_max_age anyway.
> 
> - Set the scale factor to some astronomical value.

Meh. You *are* adding new semantics with these. And they're terrible.

Greetings,

Andres Freund



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Fri, 27 Mar 2020 at 07:51, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-26 10:12:39 +0100, Laurenz Albe wrote:
> > On Wed, 2020-03-25 at 23:19 +0300, Alexander Korotkov wrote:
> > I am reluctant to introduce new semantics like a reloption value of -2
> > to disable a feature in this patch right before feature freeze.
> >
> > I believe there are enough options to disable insert-only vacuuming for
> > an individual table:
>
> > - Set the threshold to 2147483647.  True, that will not work for very
> >   large tables, but I think that there are few tables that insert that
> >   many rows before they hit autovacuum_freeze_max_age anyway.
> >
> > - Set the scale factor to some astronomical value.
>
> Meh. You *are* adding new semantics with these. And they're terrible.

I've modified this to allow a proper way to disable the entire feature
by allowing the setting to be set to -1 to disable the feature. I feel
people are fairly used to using -1 to disable various features (e.g.
log_autovacuum_min_duration).  I've used the special value of -2 for
the reloption to have that cascade to using the GUC instead.  The
autovacuum_vacuum_insert_threshold reloption may be explicitly set to
-1 to disable autovacuums for inserts for the relation.

I've also knocked the default threshold down to 1000. I feel this is a
better value given that the scale factor is now 0.2. There seemed to
be no need to exclude smaller tables from seeing gains such as
index-only scans.

If nobody objects, I plan to push this one shortly.

David

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Fri, 2020-03-27 at 10:18 +1300, David Rowley wrote:
> > > I believe there are enough options to disable insert-only vacuuming for
> > > an individual table:
> >
> > > - Set the threshold to 2147483647.  True, that will not work for very
> > >   large tables, but I think that there are few tables that insert that
> > >   many rows before they hit autovacuum_freeze_max_age anyway.
> > >
> > > - Set the scale factor to some astronomical value.
> >
> > Meh. You *are* adding new semantics with these. And they're terrible.
> 
> I've modified this to allow a proper way to disable the entire feature
> by allowing the setting to be set to -1 to disable the feature. I feel
> people are fairly used to using -1 to disable various features (e.g.
> log_autovacuum_min_duration).  I've used the special value of -2 for
> the reloption to have that cascade to using the GUC instead.  The
> autovacuum_vacuum_insert_threshold reloption may be explicitly set to
> -1 to disable autovacuums for inserts for the relation.
> 
> I've also knocked the default threshold down to 1000. I feel this is a
> better value given that the scale factor is now 0.2. There seemed to
> be no need to exclude smaller tables from seeing gains such as
> index-only scans.
> 
> If nobody objects, I plan to push this one shortly.

Thanks for the help!

The new meaning of -2 should be documented, other than that it looks
good to me.

I'll accept the new semantics, but they don't make me happy.  People are
used to -1 meaning "use the GUC value instead".

I still don't see why we need that.  Contrary to Andres' opinion, I don't
think that disabling a parameter by setting it to a value high enough that
it does not take effect is a bad thing.

I won't put up a fight though.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Fri, 27 Mar 2020 at 22:40, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> The new meaning of -2 should be documented, other than that it looks
> good to me.

But the users don't need to know anything about -2. It's not possible
to explicitly set the value to -2. This is just the reset value of the
reloption which means "use the GUC".

> I'll accept the new semantics, but they don't make me happy.  People are
> used to -1 meaning "use the GUC value instead".

The problem with having -1 on the reloption meaning use the GUC, in
this case, is that it means the reset value of the reloption must be
-1 and we need to allow them to set -2 explicitly, and if we do that,
then -1 also becomes a valid value that users can set. Maybe that's
not the end of the world, but I'd rather have the reset value be
unsettable by users. To me, that's less confusing as there are fewer
special values to remember the meaning of.

The reason I want a method to explicitly disable the feature is the
fact that it's easy to document and it should reduce the number of
people who are confused about the best method to disable the feature.
I know there's going to be a non-zero number of people who'll want to
do that.



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Sat, 2020-03-28 at 11:59 +1300, David Rowley wrote:
> On Fri, 27 Mar 2020 at 22:40, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > The new meaning of -2 should be documented, other than that it looks
> > good to me.
> 
> But the users don't need to know anything about -2. It's not possible
> to explicitly set the value to -2. This is just the reset value of the
> reloption which means "use the GUC".

I see.

> > I'll accept the new semantics, but they don't make me happy.  People are
> > used to -1 meaning "use the GUC value instead".
> 
> The problem with having -1 on the reloption meaning use the GUC, in
> this case, is that it means the reset value of the reloption must be
> -1 and we need to allow them to set -2 explicitly, and if we do that,
> then -1 also becomes a valid value that users can set. Maybe that's
> not the end of the world, but I'd rather have the reset value be
> unsettable by users. To me, that's less confusing as there are fewer
> special values to remember the meaning of.
> 
> The reason I want a method to explicitly disable the feature is the
> fact that it's easy to document and it should reduce the number of
> people who are confused about the best method to disable the feature.
> I know there's going to be a non-zero number of people who'll want to
> do that.

In the light of that, I have no objections.

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Sat, 28 Mar 2020 at 17:12, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> In the light of that, I have no objections.

Thank you.  Pushed.

David



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Sat, 28 Mar 2020 at 19:21, David Rowley <dgrowleyml@gmail.com> wrote:
> Thank you.  Pushed.

I'm unsure yet if this has caused an instability on lousyjack's run in
[1]. I see that table does have 30,000 rows inserted, so it does seem
probable that it may receive an autovacuum now when didn't before. I
did a quick local test to see if swapping the "ANALYZE pagg_tab_ml;"
to "VACUUM ANALYZE pagg_tab_ml;" would do the same on my local
machine, but it didn't.

I'll keep an eye on lousyjack's next run.  If it passes next run, I
may add some SQL to determine if pg_stat_all_tables.autovacuum_count
for those tables are varying between passing and failing runs.

David

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2006%3A33%3A02



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Sat, 28 Mar 2020 at 17:12, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>> In the light of that, I have no objections.

> Thank you.  Pushed.

It seems like this commit has resulted in some buildfarm instability:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2006%3A33%3A02

apparent change of plan

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2009%3A20%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2013%3A20%3A05

unstable results in stats_ext test

I initially thought that Dean's functional-stats adjustment might be
the culprit, but the timestamps on these failures disprove that.

            regards, tom lane



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Sun, 29 Mar 2020 at 06:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > On Sat, 28 Mar 2020 at 17:12, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> >> In the light of that, I have no objections.
>
> > Thank you.  Pushed.
>
> It seems like this commit has resulted in some buildfarm instability:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2006%3A33%3A02
>
> apparent change of plan
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2009%3A20%3A05
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2013%3A20%3A05
>
> unstable results in stats_ext test

Yeah, thanks for pointing that out.  I'm just doing some tests locally
to see if I can recreate those results after vacuuming the mcv_list
table, so far I'm unable to.



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Sun, 29 Mar 2020 at 10:30, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sun, 29 Mar 2020 at 06:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > David Rowley <dgrowleyml@gmail.com> writes:
> > > On Sat, 28 Mar 2020 at 17:12, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > >> In the light of that, I have no objections.
> >
> > > Thank you.  Pushed.
> >
> > It seems like this commit has resulted in some buildfarm instability:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2006%3A33%3A02
> >
> > apparent change of plan
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2009%3A20%3A05
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2013%3A20%3A05
> >
> > unstable results in stats_ext test
>
> Yeah, thanks for pointing that out.  I'm just doing some tests locally
> to see if I can recreate those results after vacuuming the mcv_list
> table, so far I'm unable to.

I'm considering pushing the attached to try to get some confirmation
that additional autovacuums are the issue. However, I'm not too sure
it's a wise idea to as I can trigger an additional auto-vacuum and
have these new tests fail with make installcheck after setting
autovacuum_naptime to 1s, but I'm not getting the other diffs
experienced by lousyjack and petalura.  The patch may just cause more
failures without proving much, especially so with slower machines.

The other idea I had was just to change the
autovacuum_vacuum_insert_threshold relopt to -1 for the problem tables
and see if that stabilises things.

Yet another option would be to see if reltuples varies between runs
and ditch the autovacuum_count column from the attached.  There does
not appear to be any part of the tests which would cause any dead
tuples in any of the affected relations, so I'm unsure why reltuples
would vary between what ANALYZE and VACUUM would set it to.

I'm still thinking. Input welcome.

David

Attachment

Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Sun, 29 Mar 2020 at 15:29, David Rowley <dgrowleyml@gmail.com> wrote:
> I'm considering pushing the attached to try to get some confirmation
> that additional autovacuums are the issue. However, I'm not too sure
> it's a wise idea to as I can trigger an additional auto-vacuum and
> have these new tests fail with make installcheck after setting
> autovacuum_naptime to 1s, but I'm not getting the other diffs
> experienced by lousyjack and petalura.  The patch may just cause more
> failures without proving much, especially so with slower machines.

Instead of the above, I ended up modifying the two intermittently
failing tests to change the ANALYZE into a VACUUM ANALYZE.  This
should prevent autovacuum sneaking in a vacuum at some point in time
after the ANALYZE has taken place.

I don't believe any of the current buildfarm failures can be
attributed to any of the recent changes to autovacuum, but I'll
continue to monitor the farm to see if anything is suspect.

David



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> I don't believe any of the current buildfarm failures can be
> attributed to any of the recent changes to autovacuum, but I'll
> continue to monitor the farm to see if anything is suspect.

I agree none of the failures I see right now are related to that
(there's some "No space left on device" failures, Windows randomicity,
snapper's compiler bug, and don't-know-what on hyrax).

But the ones that were seemingly due to that were intermittent,
so we'll have to watch for awhile.

            regards, tom lane



Re: Berserk Autovacuum (let's save next Mandrill)

From
Amit Kapila
Date:
On Mon, Mar 30, 2020 at 7:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I don't believe any of the current buildfarm failures can be
> > attributed to any of the recent changes to autovacuum, but I'll
> > continue to monitor the farm to see if anything is suspect.
>
> I agree none of the failures I see right now are related to that
> (there's some "No space left on device" failures, Windows randomicity,
> snapper's compiler bug, and don't-know-what on hyrax).
>
> But the ones that were seemingly due to that were intermittent,
> so we'll have to watch for awhile.
>

Today, stats_ext failed on petalura [1].  Can it be due to this?  I
have also committed a patch but immediately I don't see it to be
related to my commit.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-30%2002%3A20%3A03

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Mon, Mar 30, 2020 at 7:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But the ones that were seemingly due to that were intermittent,
>> so we'll have to watch for awhile.

> Today, stats_ext failed on petalura [1].  Can it be due to this?  I
> have also committed a patch but immediately I don't see it to be
> related to my commit.

Yeah, this looks just like petalura's previous failures, so the
problem is still there :-(

            regards, tom lane



Re: Berserk Autovacuum (let's save next Mandrill)

From
Laurenz Albe
Date:
On Sat, 2020-03-28 at 19:21 +1300, David Rowley wrote:
> Thank you.  Pushed.

Thanks for your efforts on this, and thanks for working on the fallout.

How can it be that even after an explicit VACUUM, this patch can cause
unstable regression test results?

Yours,
Laurenz Albe




Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Mon, 30 Mar 2020 at 17:57, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> How can it be that even after an explicit VACUUM, this patch can cause
> unstable regression test results?

I only added vacuums for mcv_lists. The problem with petalura [1] is
with the functional_dependencies table.

I'll see if I can come up with some way to do this in a more
deterministic way to determine which tables to add vacuums for, rather
than waiting for and reacting post-failure.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-30%2002%3A20%3A03



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Mon, 30 Mar 2020 at 19:49, David Rowley <dgrowleyml@gmail.com> wrote:
> I'll see if I can come up with some way to do this in a more
> deterministic way to determine which tables to add vacuums for, rather
> than waiting for and reacting post-failure.

I ended up running make installcheck on an instance with
autovacuum_naptime set to 1s with a small additional debug line in
autovacuum.c, namely:

diff --git a/src/backend/postmaster/autovacuum.c
b/src/backend/postmaster/autovacuum.c
index 7e97ffab27..ad81e321dc 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3099,6 +3099,9 @@ relation_needs_vacanalyze(Oid relid,
                *dovacuum = force_vacuum || (vactuples > vacthresh) ||
                                        (vac_ins_base_thresh >= 0 &&
instuples > vacinsthresh);
                *doanalyze = (anltuples > anlthresh);
+
+               if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
+                       elog(LOG, "******** %s", NameStr(classForm->relname));
        }
        else
        {

I grepped the log after the installcheck to grab the table names that
saw an insert vacuum during the test then grepped the test output to
see if the table appears to pose a risk of test instability.

I've classed each table with a risk factor. "VeryLow" seems like
there's almost no risk because we don't ever look at EXPLAIN.  Low
risk tables look at EXPLAIN, but I feel are not quite looking in
enough detail to cause issues. Medium risk look at EXPLAIN and I feel
there's a risk of some change, I think these are all Append nodes
which do order subnodes based on their cost. High risk.... those are
the ones I'm about to look into changing.

The full results of my analysis are:

Table: agg_group_1 aggregates.out. Nothing looks at EXPLAIN. Risk:VeryLow
Table: agg_hash_1 aggregates.out. Nothing looks at EXPLAIN. Risk:VeryLow
Table: atest12 privileges.out. Lots of looking at EXPLAIN, but nothing
appears to look into row estimates in detail. Risk:Low
Table: brin_test brin.out.  Test already does VACUUM ANALYZE. Risk:VeryLow
Table: bt_f8_heap btree_index.out, create_index.out. Rows loaded in
copy.source. Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: bt_i4_heap btree_index.out, create_index.out. Rows loaded in
copy.source. Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: bt_name_heap btree_index.out, create_index.out. Rows loaded in
copy.source. Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: bt_txt_heap btree_index.out, create_index.out. Rows loaded in
copy.source. Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: dupindexcols create_index.out. Some looking at EXPLAIN plans,
but nothing appears to look into row estimates in detail. Risk:Low
Table: fast_emp4000 create_am.out, create_index.out, create_misc.out.
Lots of looking at EXPLAIN, but nothing appears to look into row
estimates in detail. Risk:Low
Table: functional_dependencies stats_ext.out. Lots of looking at
EXPLAIN output. Test looks at row estimates. Risk:High
Table: gist_tbl gist.out. Lots of looking at EXPLAIN, but nothing
appears to look into row estimates in detail. Risk:Low
Table: hash_f8_heap hash_index.out. Rows loaded in copy.source.
Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: hash_i4_heap hash_index.out. Rows loaded in copy.source.
Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: hash_name_heap hash_index.out. Rows loaded in copy.source.
Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: hash_txt_heap hash_index.out. Rows loaded in copy.source.
Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: kd_point_tbl create_index_spgist.out. Lots of looking at
EXPLAIN, but nothing appears to look into row estimates in detail.
Risk:Low
Table: mcv_lists stats_ext.out. Lots of looking at EXPLAIN, but tests
appear to VACUUM after loading rows. Risk:Low
Table: mcv_lists_arrays stats_ext.out. Nothing appears to look at
EXPLAIN. Risk:VeryLow
Table: mcv_lists_bool stats_ext.out.  Lots of looking at EXPLAIN
output. Test looks at row estimates. Risk:High
Table: ndistinct stats_ext.out.  Lots of looking at EXPLAIN output.
Test looks at row estimates. Only 1000 rows are loaded initially and
then 5000 after a truncate. 1000 rows won't trigger the auto-vacuum.
Risk:High
Table: onek Lots of files. Sees a VACUUM in sanity_check test,
however, some tests run before sanity_check, e.g. create_index,
select, copy, none of which appear to pay particular attention to
anything vacuum might change. Risk:Low
Table: pagg_tab_ml_p2_s1 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pagg_tab_ml_p2_s2 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pagg_tab_ml_p3_s1 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pagg_tab_ml_p3_s2 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pagg_tab_para_p1 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pagg_tab_para_p2 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pagg_tab_para_p3 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pg_attribute Seen in several tests. Nothing appears to look at
EXPLAIN. Risk:VeryLow
Table: pg_depend Seen in several tests. Nothing appears to look at
EXPLAIN. Risk:VeryLow
Table: pg_largeobject Seen in several tests. Nothing appears to look
at EXPLAIN. Risk:VeryLow
Table: quad_box_tbl box.out. Sees some use of EXPLAIN, but nothing
looks critical. Risk:Low
Table: quad_box_tbl_ord_seq1 box.out. No EXPLAIN usage. Risk:VeryLow
Table: quad_box_tbl_ord_seq2 box.out. No EXPLAIN usage. Risk:VeryLow
Table: quad_point_tbl create_index_spgist.out Sees some use of
EXPLAIN. Index Only Scans are already being used. Risk:Low
Table: quad_poly_tbl polygon.out Some usages of EXPLAIN. Risk:Low
Table: radix_text_tbl create_index_spgist.out Some usages of EXPLAIN. Risk:Low
Table: road various tests. Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: slow_emp4000 various tests. Nothing appears to look at EXPLAIN.
Risk:VeryLow
Table: spgist_box_tbl spgist.out. Nothing appears to look at EXPLAIN.
Risk:VeryLow
Table: spgist_point_tbl spgist.out. Nothing appears to look at
EXPLAIN. Risk:VeryLow
Table: spgist_text_tbl spgist.out. Nothing appears to look at EXPLAIN.
Risk:VeryLow
Table: tenk1 aggregates.out, groupingsets.out, join.out, limit.out,
misc_functions.out, rowtypes.out,select_distinct.out,
select_parallel.out, subselect.out, tablesample.out, tidscan.out,
union.out, window.out and write_parallel.out are after vacuum in
sanity_check. EXPLAIN used in create_index.out and inherit.out, which
are all run before sanity_check does the vacuum. Risk:Medium
Table: tenk2 Only sees EXPLAIN usages in select_parallel.out, which is
after the table is vacuumed in sanity_check. Risk:Low
Table: test_range_gist rangetypes.out. Nothing appears to look at
EXPLAIN. Risk:VeryLow
Table: test_range_spgist rangetypes.out. Some EXPLAIN usage. Risk:Low
Table: testjsonb jsonb.out. Some EXPLAIN usage. Risk:Low
Table: transition_table_level2 plpgsql.out. Nothing appears to look at
EXPLAIN. Risk:VeryLow
Table: transition_table_status plpgsql.out. Nothing appears to look at
EXPLAIN. Risk:VeryLow

I'd like to wait to see if we get failures for the ones I've classed
as medium risk.

David



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Sat, 28 Mar 2020 at 22:22, David Rowley <dgrowleyml@gmail.com> wrote:
> I'm unsure yet if this has caused an instability on lousyjack's run in
> [1].

pogona has just joined in on the fun [1], so, we're not out the woods
on this yet. I'll start having a look at this in more detail.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2020-03-30%2023%3A10%3A03



Re: Berserk Autovacuum (let's save next Mandrill)

From
Dean Rasheed
Date:
On Tue, 31 Mar 2020 at 04:39, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sat, 28 Mar 2020 at 22:22, David Rowley <dgrowleyml@gmail.com> wrote:
> > I'm unsure yet if this has caused an instability on lousyjack's run in
> > [1].
>
> pogona has just joined in on the fun [1], so, we're not out the woods
> on this yet. I'll start having a look at this in more detail.
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2020-03-30%2023%3A10%3A03
>

I had a go at reproducing this. I wasn't able to produce the reported
failure, but I can reliably produce an Assert failure that may be
related by doing a VACUUM FULL simultaneously with an ANALYZE that is
generating extended stats, which produces:

#0  0x00007f28081c9520 in raise () from /lib64/libc.so.6
#1  0x00007f28081cab01 in abort () from /lib64/libc.so.6
#2  0x0000000000aad1ad in ExceptionalCondition (conditionName=0xb2f1a1
"ItemIdIsNormal(lp)", errorType=0xb2e7c9 "FailedAssertion",
fileName=0xb2e848 "heapam.c", lineNumber=3016) at assert.c:67
#3  0x00000000004fb79e in heap_update (relation=0x7f27feebeda8,
otid=0x2d881fc, newtup=0x2d881f8, cid=0, crosscheck=0x0, wait=true,
tmfd=0x7ffc568a5900, lockmode=0x7ffc568a58fc) at heapam.c:3016
#4  0x00000000004fdead in simple_heap_update (relation=0x7f27feebeda8,
otid=0x2d881fc, tup=0x2d881f8) at heapam.c:3902
#5  0x00000000005be860 in CatalogTupleUpdate (heapRel=0x7f27feebeda8,
otid=0x2d881fc, tup=0x2d881f8) at indexing.c:230
#6  0x00000000008df898 in statext_store (statOid=18964, ndistinct=0x0,
dependencies=0x2a85fe0, mcv=0x0, stats=0x2a86570) at
extended_stats.c:553
#7  0x00000000008deec0 in BuildRelationExtStatistics
(onerel=0x7f27feed9008, totalrows=5000, numrows=5000, rows=0x2ad5a30,
natts=7, vacattrstats=0x2a75f40) at extended_stats.c:187
#8  0x000000000065c1b7 in do_analyze_rel (onerel=0x7f27feed9008,
params=0x7ffc568a5fc0, va_cols=0x0, acquirefunc=0x65ce37
<acquire_sample_rows>, relpages=31, inh=false, in_outer_xact=false,
elevel=13) at analyze.c:606
#9  0x000000000065b532 in analyze_rel (relid=18956,
relation=0x29b0bc0, params=0x7ffc568a5fc0, va_cols=0x0,
in_outer_xact=false, bstrategy=0x2a7dfa0) at analyze.c:263
#10 0x00000000006fd768 in vacuum (relations=0x2a7e148,
params=0x7ffc568a5fc0, bstrategy=0x2a7dfa0, isTopLevel=true) at
vacuum.c:468
#11 0x00000000006fd22c in ExecVacuum (pstate=0x2a57a00,
vacstmt=0x29b0ca8, isTopLevel=true) at vacuum.c:251

It looks to me as though the problem is that statext_store() needs to
take its lock on pg_statistic_ext_data *before* searching for the
stats tuple to update.

It's late here, so I haven't worked up a patch yet, but it looks
pretty straightforward.

Regards,
Dean



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> I had a go at reproducing this. I wasn't able to produce the reported
> failure, but I can reliably produce an Assert failure that may be
> related by doing a VACUUM FULL simultaneously with an ANALYZE that is
> generating extended stats, which produces:
> ...
> It looks to me as though the problem is that statext_store() needs to
> take its lock on pg_statistic_ext_data *before* searching for the
> stats tuple to update.

Hmm, yeah, that seems like clearly a bad idea.

> It's late here, so I haven't worked up a patch yet, but it looks
> pretty straightforward.

I can take care of it.

            regards, tom lane



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tom Lane
Date:
I wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> I had a go at reproducing this. I wasn't able to produce the reported
>> failure, but I can reliably produce an Assert failure that may be
>> related by doing a VACUUM FULL simultaneously with an ANALYZE that is
>> generating extended stats, which produces:
>> ...
>> It looks to me as though the problem is that statext_store() needs to
>> take its lock on pg_statistic_ext_data *before* searching for the
>> stats tuple to update.

> Hmm, yeah, that seems like clearly a bad idea.

I pushed a fix for that, but I think it must be unrelated to the
buildfarm failures we're seeing.  For that coding to be a problem,
it would have to run concurrently with a VACUUM FULL or CLUSTER
on pg_statistic_ext_data (which would give all the tuples new TIDs).
AFAICS that won't happen with the tests that are giving trouble.

            regards, tom lane



Re: Berserk Autovacuum (let's save next Mandrill)

From
Dean Rasheed
Date:
On Tue, 31 Mar 2020 at 22:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> >> ...
> >> It looks to me as though the problem is that statext_store() needs to
> >> take its lock on pg_statistic_ext_data *before* searching for the
> >> stats tuple to update.
>
> > Hmm, yeah, that seems like clearly a bad idea.
>
> I pushed a fix for that

Thanks for doing that (looks like it was my mistake originally).

> but I think it must be unrelated to the
> buildfarm failures we're seeing.  For that coding to be a problem,
> it would have to run concurrently with a VACUUM FULL or CLUSTER
> on pg_statistic_ext_data (which would give all the tuples new TIDs).
> AFAICS that won't happen with the tests that are giving trouble.
>

Yeah, that makes sense. I still can't see what might be causing those
failures. The tests that were doing an ALTER COLUMN and then expecting
to see the results of a non-analysed table ought to be fixed by
0936d1b6f, but that doesn't match the buildfarm failures. Possibly
0936d1b6f will help with those anyway, but if so, it'll be annoying
not understanding why.

Regards,
Dean



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> Yeah, that makes sense. I still can't see what might be causing those
> failures. The tests that were doing an ALTER COLUMN and then expecting
> to see the results of a non-analysed table ought to be fixed by
> 0936d1b6f, but that doesn't match the buildfarm failures. Possibly
> 0936d1b6f will help with those anyway, but if so, it'll be annoying
> not understanding why.

Quite :-(.  While it's too early to declare victory, we've seen no
more failures of this ilk since 0936d1b6f, so it's sure looking like
autovacuum did have something to do with it.

Just to save people repeating the search I did, these are the buildfarm
failures of interest so far:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2006%3A33%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2009%3A20%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2013%3A20%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2020%3A03%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-03-28%2022%3A00%3A19
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-03-29%2006%3A45%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-30%2002%3A20%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2020-03-30%2006%3A00%3A06
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2020-03-30%2006%3A10%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2020-03-30%2023%3A10%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-03-31%2005%3A00%3A35

The first of those is unlike the rest, and I'm not trying to account for
it here.  In the rest, what we see is that sometimes the estimates are off
by a little bit from what's expected, up or down just a percent or two.
And those deltas kick at inconsistent spots partway through a series of
similar tests, so it's hard to deny that *something* asynchronous to the
test script is causing it.

After contemplating the failures for awhile, I have a theory that
at least partially matches the data.  What I think is happening is
that autovacuum (NOT auto-analyze) launches on the table, and since
it is running concurrently with the foreground test script, it fails
to immediately acquire buffer lock on one or more of the table pages.
Since this isn't an aggressive vacuum scan, it just politely backs
off and doesn't scan those pages.  And that translates to not getting
a perfectly accurate reltuples estimate at the end of the vacuum.
On my x86_64 machine, which matches the buildfarm critters having
trouble, the actual contents of both of the troublesome tables will
be 5000 tuples in 31 pages --- which comes out to be 30 pages with
162 tuples each and then 140 tuples in the last page.  Working through
the math in vac_estimate_reltuples (and assuming that the "old" values
were accurate numbers from the test script's own ANALYZE), what I find
is that autovacuum will conclude there are 4999 tuples if it misses
scanning one of the first 30 pages, or 5021 tuples if it misses scanning
the last page, because its interpolation from the old tuple density
figure will underestimate or overestimate the number of missed tuples
accordingly.  Once that slightly-off number gets pushed into pg_class,
we start to get slightly-off rowcount estimates in the test cases.

So what I'm hypothesizing is that the pg_statistic data is perfectly
fine but pg_class.reltuples goes off a little bit after autovacuum.
The percentage changes in reltuples that I predict this way don't
quite square with the percentage changes we see in the overall
rowcount estimates, which is a problem for this theory.  But the test
cases are exercising some fairly complex estimation logic, and it
wouldn't surprise me much if the estimates aren't linearly affected by
reltuples.  (Tomas, do you want to comment further on that point?)

            regards, tom lane



Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Thu, 2 Apr 2020 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> > Yeah, that makes sense. I still can't see what might be causing those
> > failures. The tests that were doing an ALTER COLUMN and then expecting
> > to see the results of a non-analysed table ought to be fixed by
> > 0936d1b6f, but that doesn't match the buildfarm failures. Possibly
> > 0936d1b6f will help with those anyway, but if so, it'll be annoying
> > not understanding why.
>
> Quite :-(.  While it's too early to declare victory, we've seen no
> more failures of this ilk since 0936d1b6f, so it's sure looking like
> autovacuum did have something to do with it.

How about [1]? It seems related to me and also post 0936d1b6f.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-04-01%2017%3A03%3A05



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tomas Vondra
Date:
On Wed, Apr 01, 2020 at 11:13:12PM -0400, Tom Lane wrote:
>Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> Yeah, that makes sense. I still can't see what might be causing those
>> failures. The tests that were doing an ALTER COLUMN and then expecting
>> to see the results of a non-analysed table ought to be fixed by
>> 0936d1b6f, but that doesn't match the buildfarm failures. Possibly
>> 0936d1b6f will help with those anyway, but if so, it'll be annoying
>> not understanding why.
>
>Quite :-(.  While it's too early to declare victory, we've seen no
>more failures of this ilk since 0936d1b6f, so it's sure looking like
>autovacuum did have something to do with it.
>
>Just to save people repeating the search I did, these are the buildfarm
>failures of interest so far:
>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2006%3A33%3A02
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2009%3A20%3A05
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2013%3A20%3A05
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2020%3A03%3A03
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-03-28%2022%3A00%3A19
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-03-29%2006%3A45%3A02
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-30%2002%3A20%3A03
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2020-03-30%2006%3A00%3A06
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2020-03-30%2006%3A10%3A05
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2020-03-30%2023%3A10%3A03
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-03-31%2005%3A00%3A35
>
>The first of those is unlike the rest, and I'm not trying to account for
>it here.  In the rest, what we see is that sometimes the estimates are off
>by a little bit from what's expected, up or down just a percent or two.
>And those deltas kick at inconsistent spots partway through a series of
>similar tests, so it's hard to deny that *something* asynchronous to the
>test script is causing it.
>
>After contemplating the failures for awhile, I have a theory that
>at least partially matches the data.  What I think is happening is
>that autovacuum (NOT auto-analyze) launches on the table, and since
>it is running concurrently with the foreground test script, it fails
>to immediately acquire buffer lock on one or more of the table pages.
>Since this isn't an aggressive vacuum scan, it just politely backs
>off and doesn't scan those pages.  And that translates to not getting
>a perfectly accurate reltuples estimate at the end of the vacuum.
>On my x86_64 machine, which matches the buildfarm critters having
>trouble, the actual contents of both of the troublesome tables will
>be 5000 tuples in 31 pages --- which comes out to be 30 pages with
>162 tuples each and then 140 tuples in the last page.  Working through
>the math in vac_estimate_reltuples (and assuming that the "old" values
>were accurate numbers from the test script's own ANALYZE), what I find
>is that autovacuum will conclude there are 4999 tuples if it misses
>scanning one of the first 30 pages, or 5021 tuples if it misses scanning
>the last page, because its interpolation from the old tuple density
>figure will underestimate or overestimate the number of missed tuples
>accordingly.  Once that slightly-off number gets pushed into pg_class,
>we start to get slightly-off rowcount estimates in the test cases.
>
>So what I'm hypothesizing is that the pg_statistic data is perfectly
>fine but pg_class.reltuples goes off a little bit after autovacuum.
>The percentage changes in reltuples that I predict this way don't
>quite square with the percentage changes we see in the overall
>rowcount estimates, which is a problem for this theory.  But the test
>cases are exercising some fairly complex estimation logic, and it
>wouldn't surprise me much if the estimates aren't linearly affected by
>reltuples.  (Tomas, do you want to comment further on that point?)
>

I think this theory makes perfect sense. I think it's much less likely
to see the last page skipped, so we're likely to end up with reltuples
lower than 5000 (as opposed to seeing the 5021). That kinda matches the
reports, where we generally see estimates reduced by 1 or 2. The -1
change could be explained by rounding errors, I guess - with 5000 we
might have produced 139.51, rounded up to 140, a slight drop may get us
139. Not sure about the -2 changes, but I suppose it's possible we might
actually skip multiple pages, reducing the reltuples estimate even more?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 2 Apr 2020 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Quite :-(.  While it's too early to declare victory, we've seen no
>> more failures of this ilk since 0936d1b6f, so it's sure looking like
>> autovacuum did have something to do with it.

> How about [1]? It seems related to me and also post 0936d1b6f.

That looks much like the first lousyjack failure, which as I said
I wasn't trying to account for at that point.

After looking at those failures, though, I believe that the root cause
may be the same, ie small changes in pg_class.reltuples due to
autovacuum not seeing all pages of the tables.  The test structure
is a bit different, but it is accessing the tables in between EXPLAIN
attempts, so it could be preventing a concurrent autovac from seeing
all pages.

I see your fix at cefb82d49, but it feels a bit brute-force.  Unlike
stats_ext.sql, we're not (supposed to be) dependent on exact planner
estimates in this test.  So I think the real problem here is crappy test
case design.  Namely, that these various sub-tables are exactly the
same size, despite which the test is expecting that the planner will
order them consistently --- with a planning algorithm that prefers
to put larger tables first in parallel appends (cf. create_append_path).
It's not surprising that the result is unstable in the face of small
variations in the rowcount estimates.

I'd be inclined to undo what you did in favor of initializing the
test tables to contain significantly different numbers of rows,
because that would (a) achieve plan stability more directly,
and (b) demonstrate that the planner is actually ordering the
tables by cost correctly.  Maybe somewhere else we have a test
that is verifying (b), but these test cases abysmally fail to
check that point.

I'm not really on board with disabling autovacuum in the regression
tests anywhere we aren't absolutely forced to do so.  It's not
representative of real world practice (or at least not real world
best practice ;-)) and it could help hide actual bugs.  We don't seem
to have much choice with the stats_ext tests as they are constituted,
but those tests look really fragile to me.  Let's not adopt that
technique where we have other possible ways to stabilize test results.

            regards, tom lane



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tom Lane
Date:
I wrote:
> I'd be inclined to undo what you did in favor of initializing the
> test tables to contain significantly different numbers of rows,
> because that would (a) achieve plan stability more directly,
> and (b) demonstrate that the planner is actually ordering the
> tables by cost correctly.  Maybe somewhere else we have a test
> that is verifying (b), but these test cases abysmally fail to
> check that point.

Concretely, I suggest the attached, which replaces the autovac disables
with adjusting partition boundaries so that the partitions contain
different numbers of rows.

I did not touch the partition boundaries for pagg_tab1 and pagg_tab2,
because that would have required also changing the associated test
queries (which are designed to access only particular partitions).
It seemed like too much work to verify that the answers were still
right, and it's not really necessary because those tables are so
small as to fit in single pages.  That means that autovac will either
see the whole table or none of it, and in either case it won't change
reltuples.

This does cause the order of partitions to change in a couple of the
pagg_tab_ml EXPLAIN results, but I think that's fine; the ordering
does now match the actual sizes of the partitions.

            regards, tom lane

diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index d8a6836..a4dc12b 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -2,9 +2,9 @@
 -- PARTITION_AGGREGATE
 -- Test partitionwise aggregation on partitioned tables
 --
--- Note: Various tests located within are sensitive to tables being
--- auto-vacuumed while the tests are running.  For this reason we
--- run autovacuum_enabled = off for all tables.
+-- Note: to ensure plan stability, it's a good idea to make the partitions of
+-- any one partitioned table in this test all have different numbers of rows.
+--
 -- Enable partitionwise aggregate, which by default is disabled.
 SET enable_partitionwise_aggregate TO true;
 -- Enable partitionwise join, which by default is disabled.
@@ -15,9 +15,9 @@ SET max_parallel_workers_per_gather TO 0;
 -- Tests for list partitioned tables.
 --
 CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY LIST(c);
-CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000', '0001', '0002', '0003') WITH (autovacuum_enabled
=off); 
-CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0004', '0005', '0006', '0007') WITH (autovacuum_enabled
=off); 
-CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0008', '0009', '0010', '0011') WITH (autovacuum_enabled
=off); 
+CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000', '0001', '0002', '0003', '0004');
+CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005', '0006', '0007', '0008');
+CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009', '0010', '0011');
 INSERT INTO pagg_tab SELECT i % 20, i % 30, to_char(i % 12, 'FM0000'), i % 30 FROM generate_series(0, 2999) i;
 ANALYZE pagg_tab;
 -- When GROUP BY clause matches; full aggregation is performed for each partition.
@@ -400,13 +400,13 @@ SELECT a, sum(b order by a) FROM pagg_tab GROUP BY a ORDER BY 1, 2;

 -- JOIN query
 CREATE TABLE pagg_tab1(x int, y int) PARTITION BY RANGE(x);
-CREATE TABLE pagg_tab1_p1 PARTITION OF pagg_tab1 FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab1_p2 PARTITION OF pagg_tab1 FOR VALUES FROM (10) TO (20) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab1_p3 PARTITION OF pagg_tab1 FOR VALUES FROM (20) TO (30) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab1_p1 PARTITION OF pagg_tab1 FOR VALUES FROM (0) TO (10);
+CREATE TABLE pagg_tab1_p2 PARTITION OF pagg_tab1 FOR VALUES FROM (10) TO (20);
+CREATE TABLE pagg_tab1_p3 PARTITION OF pagg_tab1 FOR VALUES FROM (20) TO (30);
 CREATE TABLE pagg_tab2(x int, y int) PARTITION BY RANGE(y);
-CREATE TABLE pagg_tab2_p1 PARTITION OF pagg_tab2 FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab2_p2 PARTITION OF pagg_tab2 FOR VALUES FROM (10) TO (20) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab2_p3 PARTITION OF pagg_tab2 FOR VALUES FROM (20) TO (30) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab2_p1 PARTITION OF pagg_tab2 FOR VALUES FROM (0) TO (10);
+CREATE TABLE pagg_tab2_p2 PARTITION OF pagg_tab2 FOR VALUES FROM (10) TO (20);
+CREATE TABLE pagg_tab2_p3 PARTITION OF pagg_tab2 FOR VALUES FROM (20) TO (30);
 INSERT INTO pagg_tab1 SELECT i % 30, i % 20 FROM generate_series(0, 299, 2) i;
 INSERT INTO pagg_tab2 SELECT i % 20, i % 30 FROM generate_series(0, 299, 3) i;
 ANALYZE pagg_tab1;
@@ -820,9 +820,9 @@ SELECT a.x, a.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x = 1 AND x = 2) a

 -- Partition by multiple columns
 CREATE TABLE pagg_tab_m (a int, b int, c int) PARTITION BY RANGE(a, ((a+b)/2));
-CREATE TABLE pagg_tab_m_p1 PARTITION OF pagg_tab_m FOR VALUES FROM (0, 0) TO (10, 10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_m_p2 PARTITION OF pagg_tab_m FOR VALUES FROM (10, 10) TO (20, 20) WITH (autovacuum_enabled =
off);
-CREATE TABLE pagg_tab_m_p3 PARTITION OF pagg_tab_m FOR VALUES FROM (20, 20) TO (30, 30) WITH (autovacuum_enabled =
off);
+CREATE TABLE pagg_tab_m_p1 PARTITION OF pagg_tab_m FOR VALUES FROM (0, 0) TO (12, 12);
+CREATE TABLE pagg_tab_m_p2 PARTITION OF pagg_tab_m FOR VALUES FROM (12, 12) TO (22, 22);
+CREATE TABLE pagg_tab_m_p3 PARTITION OF pagg_tab_m FOR VALUES FROM (22, 22) TO (30, 30);
 INSERT INTO pagg_tab_m SELECT i % 30, i % 40, i % 50 FROM generate_series(0, 2999) i;
 ANALYZE pagg_tab_m;
 -- Partial aggregation as GROUP BY clause does not match with PARTITION KEY
@@ -926,15 +926,15 @@ SELECT a, c, sum(b), avg(c), count(*) FROM pagg_tab_m GROUP BY (a+b)/2, 2, 1 HAV

 -- Test with multi-level partitioning scheme
 CREATE TABLE pagg_tab_ml (a int, b int, c text) PARTITION BY RANGE(a);
-CREATE TABLE pagg_tab_ml_p1 PARTITION OF pagg_tab_ml FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab_ml_p1 PARTITION OF pagg_tab_ml FOR VALUES FROM (0) TO (10);
 CREATE TABLE pagg_tab_ml_p2 PARTITION OF pagg_tab_ml FOR VALUES FROM (10) TO (20) PARTITION BY LIST (c);
-CREATE TABLE pagg_tab_ml_p2_s1 PARTITION OF pagg_tab_ml_p2 FOR VALUES IN ('0000', '0001') WITH (autovacuum_enabled =
off);
-CREATE TABLE pagg_tab_ml_p2_s2 PARTITION OF pagg_tab_ml_p2 FOR VALUES IN ('0002', '0003') WITH (autovacuum_enabled =
off);
+CREATE TABLE pagg_tab_ml_p2_s1 PARTITION OF pagg_tab_ml_p2 FOR VALUES IN ('0000', '0001', '0002');
+CREATE TABLE pagg_tab_ml_p2_s2 PARTITION OF pagg_tab_ml_p2 FOR VALUES IN ('0003');
 -- This level of partitioning has different column positions than the parent
 CREATE TABLE pagg_tab_ml_p3(b int, c text, a int) PARTITION BY RANGE (b);
-CREATE TABLE pagg_tab_ml_p3_s1(c text, a int, b int) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_ml_p3_s2 PARTITION OF pagg_tab_ml_p3 FOR VALUES FROM (5) TO (10) WITH (autovacuum_enabled =
off);
-ALTER TABLE pagg_tab_ml_p3 ATTACH PARTITION pagg_tab_ml_p3_s1 FOR VALUES FROM (0) TO (5);
+CREATE TABLE pagg_tab_ml_p3_s1(c text, a int, b int);
+CREATE TABLE pagg_tab_ml_p3_s2 PARTITION OF pagg_tab_ml_p3 FOR VALUES FROM (7) TO (10);
+ALTER TABLE pagg_tab_ml_p3 ATTACH PARTITION pagg_tab_ml_p3_s1 FOR VALUES FROM (0) TO (7);
 ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO (30);
 INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM0000') FROM generate_series(0, 29999) i;
 ANALYZE pagg_tab_ml;
@@ -1253,14 +1253,14 @@ SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
                                  Group Key: pagg_tab_ml_1.b
                                  ->  Parallel Seq Scan on pagg_tab_ml_p2_s1 pagg_tab_ml_1
                            ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_2.b
-                                 ->  Parallel Seq Scan on pagg_tab_ml_p2_s2 pagg_tab_ml_2
-                           ->  Partial HashAggregate
                                  Group Key: pagg_tab_ml_3.b
                                  ->  Parallel Seq Scan on pagg_tab_ml_p3_s1 pagg_tab_ml_3
                            ->  Partial HashAggregate
                                  Group Key: pagg_tab_ml_4.b
                                  ->  Parallel Seq Scan on pagg_tab_ml_p3_s2 pagg_tab_ml_4
+                           ->  Partial HashAggregate
+                                 Group Key: pagg_tab_ml_2.b
+                                 ->  Parallel Seq Scan on pagg_tab_ml_p2_s2 pagg_tab_ml_2
 (24 rows)

 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b HAVING avg(a) < 15 ORDER BY 1, 2, 3;
@@ -1292,10 +1292,6 @@ SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 O
                      Filter: (avg(pagg_tab_ml_1.b) > '7'::numeric)
                      ->  Seq Scan on pagg_tab_ml_p2_s1 pagg_tab_ml_1
                ->  HashAggregate
-                     Group Key: pagg_tab_ml_2.a, pagg_tab_ml_2.b, pagg_tab_ml_2.c
-                     Filter: (avg(pagg_tab_ml_2.b) > '7'::numeric)
-                     ->  Seq Scan on pagg_tab_ml_p2_s2 pagg_tab_ml_2
-               ->  HashAggregate
                      Group Key: pagg_tab_ml_3.a, pagg_tab_ml_3.b, pagg_tab_ml_3.c
                      Filter: (avg(pagg_tab_ml_3.b) > '7'::numeric)
                      ->  Seq Scan on pagg_tab_ml_p3_s1 pagg_tab_ml_3
@@ -1303,6 +1299,10 @@ SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 O
                      Group Key: pagg_tab_ml_4.a, pagg_tab_ml_4.b, pagg_tab_ml_4.c
                      Filter: (avg(pagg_tab_ml_4.b) > '7'::numeric)
                      ->  Seq Scan on pagg_tab_ml_p3_s2 pagg_tab_ml_4
+               ->  HashAggregate
+                     Group Key: pagg_tab_ml_2.a, pagg_tab_ml_2.b, pagg_tab_ml_2.c
+                     Filter: (avg(pagg_tab_ml_2.b) > '7'::numeric)
+                     ->  Seq Scan on pagg_tab_ml_p2_s2 pagg_tab_ml_2
 (25 rows)

 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 ORDER BY 1, 2, 3;
@@ -1332,9 +1332,9 @@ SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 O
 -- costing such plans.
 SET parallel_setup_cost TO 10;
 CREATE TABLE pagg_tab_para(x int, y int) PARTITION BY RANGE(x);
-CREATE TABLE pagg_tab_para_p1 PARTITION OF pagg_tab_para FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_para_p2 PARTITION OF pagg_tab_para FOR VALUES FROM (10) TO (20) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_para_p3 PARTITION OF pagg_tab_para FOR VALUES FROM (20) TO (30) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab_para_p1 PARTITION OF pagg_tab_para FOR VALUES FROM (0) TO (12);
+CREATE TABLE pagg_tab_para_p2 PARTITION OF pagg_tab_para FOR VALUES FROM (12) TO (22);
+CREATE TABLE pagg_tab_para_p3 PARTITION OF pagg_tab_para FOR VALUES FROM (22) TO (30);
 INSERT INTO pagg_tab_para SELECT i % 30, i % 20 FROM generate_series(0, 29999) i;
 ANALYZE pagg_tab_para;
 -- When GROUP BY clause matches; full aggregation is performed for each partition.
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index 8350ddc..946197f 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -2,9 +2,9 @@
 -- PARTITION_AGGREGATE
 -- Test partitionwise aggregation on partitioned tables
 --
--- Note: Various tests located within are sensitive to tables being
--- auto-vacuumed while the tests are running.  For this reason we
--- run autovacuum_enabled = off for all tables.
+-- Note: to ensure plan stability, it's a good idea to make the partitions of
+-- any one partitioned table in this test all have different numbers of rows.
+--

 -- Enable partitionwise aggregate, which by default is disabled.
 SET enable_partitionwise_aggregate TO true;
@@ -17,9 +17,9 @@ SET max_parallel_workers_per_gather TO 0;
 -- Tests for list partitioned tables.
 --
 CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY LIST(c);
-CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000', '0001', '0002', '0003') WITH (autovacuum_enabled
=off); 
-CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0004', '0005', '0006', '0007') WITH (autovacuum_enabled
=off); 
-CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0008', '0009', '0010', '0011') WITH (autovacuum_enabled
=off); 
+CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000', '0001', '0002', '0003', '0004');
+CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005', '0006', '0007', '0008');
+CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009', '0010', '0011');
 INSERT INTO pagg_tab SELECT i % 20, i % 30, to_char(i % 12, 'FM0000'), i % 30 FROM generate_series(0, 2999) i;
 ANALYZE pagg_tab;

@@ -94,14 +94,14 @@ SELECT a, sum(b order by a) FROM pagg_tab GROUP BY a ORDER BY 1, 2;
 -- JOIN query

 CREATE TABLE pagg_tab1(x int, y int) PARTITION BY RANGE(x);
-CREATE TABLE pagg_tab1_p1 PARTITION OF pagg_tab1 FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab1_p2 PARTITION OF pagg_tab1 FOR VALUES FROM (10) TO (20) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab1_p3 PARTITION OF pagg_tab1 FOR VALUES FROM (20) TO (30) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab1_p1 PARTITION OF pagg_tab1 FOR VALUES FROM (0) TO (10);
+CREATE TABLE pagg_tab1_p2 PARTITION OF pagg_tab1 FOR VALUES FROM (10) TO (20);
+CREATE TABLE pagg_tab1_p3 PARTITION OF pagg_tab1 FOR VALUES FROM (20) TO (30);

 CREATE TABLE pagg_tab2(x int, y int) PARTITION BY RANGE(y);
-CREATE TABLE pagg_tab2_p1 PARTITION OF pagg_tab2 FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab2_p2 PARTITION OF pagg_tab2 FOR VALUES FROM (10) TO (20) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab2_p3 PARTITION OF pagg_tab2 FOR VALUES FROM (20) TO (30) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab2_p1 PARTITION OF pagg_tab2 FOR VALUES FROM (0) TO (10);
+CREATE TABLE pagg_tab2_p2 PARTITION OF pagg_tab2 FOR VALUES FROM (10) TO (20);
+CREATE TABLE pagg_tab2_p3 PARTITION OF pagg_tab2 FOR VALUES FROM (20) TO (30);

 INSERT INTO pagg_tab1 SELECT i % 30, i % 20 FROM generate_series(0, 299, 2) i;
 INSERT INTO pagg_tab2 SELECT i % 20, i % 30 FROM generate_series(0, 299, 3) i;
@@ -177,9 +177,9 @@ SELECT a.x, a.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x = 1 AND x = 2) a
 -- Partition by multiple columns

 CREATE TABLE pagg_tab_m (a int, b int, c int) PARTITION BY RANGE(a, ((a+b)/2));
-CREATE TABLE pagg_tab_m_p1 PARTITION OF pagg_tab_m FOR VALUES FROM (0, 0) TO (10, 10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_m_p2 PARTITION OF pagg_tab_m FOR VALUES FROM (10, 10) TO (20, 20) WITH (autovacuum_enabled =
off);
-CREATE TABLE pagg_tab_m_p3 PARTITION OF pagg_tab_m FOR VALUES FROM (20, 20) TO (30, 30) WITH (autovacuum_enabled =
off);
+CREATE TABLE pagg_tab_m_p1 PARTITION OF pagg_tab_m FOR VALUES FROM (0, 0) TO (12, 12);
+CREATE TABLE pagg_tab_m_p2 PARTITION OF pagg_tab_m FOR VALUES FROM (12, 12) TO (22, 22);
+CREATE TABLE pagg_tab_m_p3 PARTITION OF pagg_tab_m FOR VALUES FROM (22, 22) TO (30, 30);
 INSERT INTO pagg_tab_m SELECT i % 30, i % 40, i % 50 FROM generate_series(0, 2999) i;
 ANALYZE pagg_tab_m;

@@ -202,17 +202,17 @@ SELECT a, c, sum(b), avg(c), count(*) FROM pagg_tab_m GROUP BY (a+b)/2, 2, 1 HAV
 -- Test with multi-level partitioning scheme

 CREATE TABLE pagg_tab_ml (a int, b int, c text) PARTITION BY RANGE(a);
-CREATE TABLE pagg_tab_ml_p1 PARTITION OF pagg_tab_ml FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab_ml_p1 PARTITION OF pagg_tab_ml FOR VALUES FROM (0) TO (10);
 CREATE TABLE pagg_tab_ml_p2 PARTITION OF pagg_tab_ml FOR VALUES FROM (10) TO (20) PARTITION BY LIST (c);
-CREATE TABLE pagg_tab_ml_p2_s1 PARTITION OF pagg_tab_ml_p2 FOR VALUES IN ('0000', '0001') WITH (autovacuum_enabled =
off);
-CREATE TABLE pagg_tab_ml_p2_s2 PARTITION OF pagg_tab_ml_p2 FOR VALUES IN ('0002', '0003') WITH (autovacuum_enabled =
off);
+CREATE TABLE pagg_tab_ml_p2_s1 PARTITION OF pagg_tab_ml_p2 FOR VALUES IN ('0000', '0001', '0002');
+CREATE TABLE pagg_tab_ml_p2_s2 PARTITION OF pagg_tab_ml_p2 FOR VALUES IN ('0003');

 -- This level of partitioning has different column positions than the parent
 CREATE TABLE pagg_tab_ml_p3(b int, c text, a int) PARTITION BY RANGE (b);
-CREATE TABLE pagg_tab_ml_p3_s1(c text, a int, b int) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_ml_p3_s2 PARTITION OF pagg_tab_ml_p3 FOR VALUES FROM (5) TO (10) WITH (autovacuum_enabled =
off);
+CREATE TABLE pagg_tab_ml_p3_s1(c text, a int, b int);
+CREATE TABLE pagg_tab_ml_p3_s2 PARTITION OF pagg_tab_ml_p3 FOR VALUES FROM (7) TO (10);

-ALTER TABLE pagg_tab_ml_p3 ATTACH PARTITION pagg_tab_ml_p3_s1 FOR VALUES FROM (0) TO (5);
+ALTER TABLE pagg_tab_ml_p3 ATTACH PARTITION pagg_tab_ml_p3_s1 FOR VALUES FROM (0) TO (7);
 ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO (30);

 INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM0000') FROM generate_series(0, 29999) i;
@@ -287,9 +287,9 @@ SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 O
 SET parallel_setup_cost TO 10;

 CREATE TABLE pagg_tab_para(x int, y int) PARTITION BY RANGE(x);
-CREATE TABLE pagg_tab_para_p1 PARTITION OF pagg_tab_para FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_para_p2 PARTITION OF pagg_tab_para FOR VALUES FROM (10) TO (20) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_para_p3 PARTITION OF pagg_tab_para FOR VALUES FROM (20) TO (30) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab_para_p1 PARTITION OF pagg_tab_para FOR VALUES FROM (0) TO (12);
+CREATE TABLE pagg_tab_para_p2 PARTITION OF pagg_tab_para FOR VALUES FROM (12) TO (22);
+CREATE TABLE pagg_tab_para_p3 PARTITION OF pagg_tab_para FOR VALUES FROM (22) TO (30);

 INSERT INTO pagg_tab_para SELECT i % 30, i % 20 FROM generate_series(0, 29999) i;


Re: Berserk Autovacuum (let's save next Mandrill)

From
David Rowley
Date:
On Fri, 3 Apr 2020 at 04:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > I'd be inclined to undo what you did in favor of initializing the
> > test tables to contain significantly different numbers of rows,
> > because that would (a) achieve plan stability more directly,
> > and (b) demonstrate that the planner is actually ordering the
> > tables by cost correctly.  Maybe somewhere else we have a test
> > that is verifying (b), but these test cases abysmally fail to
> > check that point.
>
> Concretely, I suggest the attached, which replaces the autovac disables
> with adjusting partition boundaries so that the partitions contain
> different numbers of rows.

I've looked over this and I agree that it's a better solution to the problem.

I'm happy for you to go ahead on this.

David



Re: Berserk Autovacuum (let's save next Mandrill)

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Fri, 3 Apr 2020 at 04:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Concretely, I suggest the attached, which replaces the autovac disables
>> with adjusting partition boundaries so that the partitions contain
>> different numbers of rows.

> I've looked over this and I agree that it's a better solution to the problem.
> I'm happy for you to go ahead on this.

Pushed, thanks for looking at it!

            regards, tom lane