Re: Autovacuum on partitioned table (autoanalyze) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Autovacuum on partitioned table (autoanalyze)
Date
Msg-id 20210730010355.6yodvn2ag3arfihi@alap3.anarazel.de
Whole thread Raw
In response to Re: Autovacuum on partitioned table (autoanalyze)  (Andres Freund <andres@anarazel.de>)
Responses Re: Autovacuum on partitioned table (autoanalyze)  (Andrew Dunstan <andrew@dunslane.net>)
Re: Autovacuum on partitioned table (autoanalyze)  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Hi,

CCing RMT because I think we need to do something about this for v14.

On 2021-07-27 19:23:42 -0700, Andres Freund wrote:
> On 2021-07-22 13:54:58 -0700, Andres Freund wrote:
> > On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote:
> > > On 2021-Apr-07, Alvaro Herrera wrote:
> > >
> > > > OK, I bit the bullet and re-did the logic in the way I had proposed
> > > > earlier in the thread: do the propagation on the collector's side, by
> > > > sending only the list of ancestors: the collector can read the tuple
> > > > change count by itself, to add it to each ancestor.  This seems less
> > > > wasteful.  Attached is v16 which does it that way and seems to work
> > > > nicely under my testing.
> > >
> > > Pushed with this approach.  Thanks for persisting with this.
> >
> > I'm looking at this in the context of rebasing & polishing the shared
> > memory stats patch.
> >
> > I have a few questions / concerns:
>
> Another one, and I think this might warrant thinking about for v14:
>
> Isn't this going to create a *lot* of redundant sampling?  Especially if you
> have any sort of nested partition tree. In the most absurd case a partition
> with n parents will get sampled n times, solely due to changes to itself.
>
> Look at the following example:
>
> BEGIN;
> DROP TABLE if exists p;
> CREATE TABLE p (i int) partition by range(i);
> CREATE TABLE p_0 PARTITION OF p FOR VALUES FROM     (   0) to (5000) partition by range(i);
> CREATE TABLE p_0_0 PARTITION OF p_0 FOR VALUES FROM (   0) to (1000);
> CREATE TABLE p_0_1 PARTITION OF p_0 FOR VALUES FROM (1000) to (2000);
> CREATE TABLE p_0_2 PARTITION OF p_0 FOR VALUES FROM (2000) to (3000);
> CREATE TABLE p_0_3 PARTITION OF p_0 FOR VALUES FROM (3000) to (4000);
> CREATE TABLE p_0_4 PARTITION OF p_0 FOR VALUES FROM (4000) to (5000);
> -- create some initial data
> INSERT INTO p select generate_series(0, 5000 - 1) data FROM generate_series(1, 100) reps;
> COMMIT;
>
> UPDATE p_0_4 SET i = i;
>
>
> Whenever the update is executed, all partitions will be sampled at least twice
> (once for p and once for p_0), with p_0_4 sampled three times.
>
> Of course, this is an extreme example, but it's not hard to imagine cases
> where v14 will cause the number of auto-analyzes increase sufficiently to bog
> down autovacuum to a problematic degree.
>
>
> Additionally, while analyzing all child partitions for a partitioned tables
> are AccessShareLock'ed at once. If a partition hierarchy has more than one
> level, it actually is likely that multiple autovacuum workers will end up
> processing the ancestors separately.  This seems like it might contribute to
> lock exhaustion issues with larger partition hierarchies?


I started to write a patch rejiggering autovacuum.c portion of this
change. While testing it I hit the case of manual ANALYZEs leaving
changes_since_analyze for partitioned tables in a bogus state - without a
minimally invasive way to fix that. After a bit of confused staring I realized
that the current code has a very similar problem:

Using the same setup as above:

INSERT INTO p VALUES (0,0); /* repeat as many times as desired */
ANALYZE p_0_0;

At this point the system will have lost track of the changes to p_0_0, unless
an autovacuum worker was launched between the INSERTs and the ANALYZE (which
would cause pgstat_report_anl_ancestors() to report the change count upwards).

There appears to be code trying to address that, but I don't see how it
ever does anything meaningful?

    /*
     * Now report ANALYZE to the stats collector.  For regular tables, we do
     * it only if not doing inherited stats.  For partitioned tables, we only
     * do it for inherited stats. (We're never called for not-inherited stats
     * on partitioned tables anyway.)
     *
     * Reset the changes_since_analyze counter only if we analyzed all
     * columns; otherwise, there is still work for auto-analyze to do.
     */
    if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
        pgstat_report_analyze(onerel, totalrows, totaldeadrows,
                              (va_cols == NIL));

    /*
     * If this is a manual analyze of all columns of a permanent leaf
     * partition, and not doing inherited stats, also let the collector know
     * about the ancestor tables of this partition.  Autovacuum does the
     * equivalent of this at the start of its run, so there's no reason to do
     * it there.
     */
    if (!inh && !IsAutoVacuumWorkerProcess() &&
        (va_cols == NIL) &&
        onerel->rd_rel->relispartition &&
        onerel->rd_rel->relkind == RELKIND_RELATION &&
        onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
    {
        pgstat_report_anl_ancestors(RelationGetRelid(onerel));
    }

The pgstat_report_analyze() triggers pgstat_recv_analyze() to reset the
counter that pgstat_recv_anl_ancestors() would use to report changes
upwards:

    /*
     * If commanded, reset changes_since_analyze to zero.  This forgets any
     * changes that were committed while the ANALYZE was in progress, but we
     * have no good way to estimate how many of those there were.
     */
    if (msg->m_resetcounter)
    {
        tabentry->changes_since_analyze = 0;
        tabentry->changes_since_analyze_reported = 0;
    }

And if one instead inverts the order of pgstat_report_analyze() and
pgstat_report_anl_ancestors() one gets a slightly different problem: A manual
ANALYZE of the partition root results in the partition root having a non-zero
changes_since_analyze afterwards. expand_vacuum() causes child partitions to be
added to the list of relations, which *first* causes the partition root to be
analyzed, and *then* partitions. The partitions then report their
changes_since_analyze upwards.


I don't think the code as is is fit for v14. It looks like it was rewritten
with a new approach just before the freeze ([1]), and as far as I can tell the
concerns I quoted above weren't even discussed in the whole thread.  Alvaro,
any comments?

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20210408032235.GA6842%40alvherre.pgsql



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Replace l337sp34k in comments.
Next
From: John Naylor
Date:
Subject: Re: speed up verifying UTF-8