Re: autovac issue with large number of tables - Mailing list pgsql-hackers

From Kasahara Tatsuhito
Subject Re: autovac issue with large number of tables
Date
Msg-id CAP0=ZV+dpHjgj6tPwrTyKzsUw_amvVxR+Y0v49H=qp8QAqaFNg@mail.gmail.com
Whole thread Raw
In response to Re: autovac issue with large number of tables  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: autovac issue with large number of tables  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
Hi,

On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
> <kasahara.tatsuhito@gmail.com> wrote:
> >
> > Hi,
> >
> > On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> > <kasahara.tatsuhito@gmail.com> wrote:
> > > > I wonder if we could have table_recheck_autovac do two probes of the stats
> > > > data.  First probe the existing stats data, and if it shows the table to
> > > > be already vacuumed, return immediately.  If not, *then* force a stats
> > > > re-read, and check a second time.
> > > Does the above mean that the second and subsequent table_recheck_autovac()
> > > will be improved to first check using the previous refreshed statistics?
> > > I think that certainly works.
> > >
> > > If that's correct, I'll try to create a patch for the PoC
> >
> > I still don't know how to reproduce Jim's troubles, but I was able to reproduce
> > what was probably a very similar problem.
> >
> > This problem seems to be more likely to occur in cases where you have
> > a large number of tables,
> > i.e., a large amount of stats, and many small tables need VACUUM at
> > the same time.
> >
> > So I followed Tom's advice and created a patch for the PoC.
> > This patch will enable a flag in the table_recheck_autovac function to use
> > the existing stats next time if VACUUM (or ANALYZE) has already been done
> > by another worker on the check after the stats have been updated.
> > If the tables continue to require VACUUM after the refresh, then a refresh
> > will be required instead of using the existing statistics.
> >
> > I did simple test with HEAD and HEAD + this PoC patch.
> > The tests were conducted in two cases.
> > (I changed few configurations. see attached scripts)
> >
> > 1. Normal VACUUM case
> >   - SET autovacuum = off
> >   - CREATE tables with 100 rows
> >   - DELETE 90 rows for each tables
> >   - SET autovacuum = on and restart PostgreSQL
> >   - Measure the time it takes for all tables to be VACUUMed
> >
> > 2. Anti wrap round VACUUM case
> >   - CREATE brank tables
> >   - SELECT all of these tables (for generate stats)
> >   - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
> >   - Consumes a lot of XIDs by using txid_curent()
> >   - Measure the time it takes for all tables to be VACUUMed
> >
> > For each test case, the following results were obtained by changing
> > autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> > Also changing num of tables to 1000, 5000, 10000 and 20000.
> >
> > Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
> > but I think it's enough to ask for a trend.
> >
> > ===========================================================================
> > [1.Normal VACUUM case]
> >  tables:1000
> >   autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
> >   autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
> >   autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
> >   autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
> >   autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec
> >
> >  tables:5000
> >   autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
> >   autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
> >   autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
> >   autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
> >   autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec
> >
> >  tables:10000
> >   autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
> >   autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
> >   autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
> >   autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
> >   autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec
> >
> >  tables:20000
> >   autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
> >   autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
> >   autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
> >   autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
> >   autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec
> >
> > [2.Anti wrap round VACUUM case]
> >  tables:1000
> >   autovacuum_max_workers 1:   (HEAD) 19 sec VS (with patch) 18 sec
> >   autovacuum_max_workers 2:   (HEAD) 14 sec VS (with patch) 15 sec
> >   autovacuum_max_workers 3:   (HEAD) 14 sec VS (with patch) 14 sec
> >   autovacuum_max_workers 5:   (HEAD) 14 sec VS (with patch) 16 sec
> >   autovacuum_max_workers 10:  (HEAD) 16 sec VS (with patch) 14 sec
> >
> >  tables:5000
> >   autovacuum_max_workers 1:   (HEAD) 69 sec VS (with patch) 69 sec
> >   autovacuum_max_workers 2:   (HEAD) 66 sec VS (with patch) 47 sec
> >   autovacuum_max_workers 3:   (HEAD) 59 sec VS (with patch) 37 sec
> >   autovacuum_max_workers 5:   (HEAD) 39 sec VS (with patch) 28 sec
> >   autovacuum_max_workers 10:  (HEAD) 39 sec VS (with patch) 29 sec
> >
> >  tables:10000
> >   autovacuum_max_workers 1:   (HEAD) 139 sec VS (with patch) 138 sec
> >   autovacuum_max_workers 2:   (HEAD) 130 sec VS (with patch)  86 sec
> >   autovacuum_max_workers 3:   (HEAD) 120 sec VS (with patch)  68 sec
> >   autovacuum_max_workers 5:   (HEAD)  96 sec VS (with patch)  41 sec
> >   autovacuum_max_workers 10:  (HEAD)  90 sec VS (with patch)  39 sec
> >
> >  tables:20000
> >   autovacuum_max_workers 1:   (HEAD) 313 sec VS (with patch) 331 sec
> >   autovacuum_max_workers 2:   (HEAD) 209 sec VS (with patch) 201 sec
> >   autovacuum_max_workers 3:   (HEAD) 227 sec VS (with patch) 141 sec
> >   autovacuum_max_workers 5:   (HEAD) 236 sec VS (with patch)  88 sec
> >   autovacuum_max_workers 10:  (HEAD) 309 sec VS (with patch)  74 sec
> > ===========================================================================
> >
> > The cases without patch, the scalability of the worker has decreased
> > as the number of tables has increased.
> > In fact, the more workers there are, the longer it takes to complete
> > VACUUM to all tables.
> > The cases with patch, it shows good scalability with respect to the
> > number of workers.
>
> It seems a good performance improvement even without the patch of
> shared memory based stats collector.
>
> >
> > Note that perf top results showed that hash_search_with_hash_value,
> > hash_seq_search and
> > pgstat_read_statsfiles are dominant during VACUUM in all patterns,
> > with or without the patch.
> >
> > Therefore, there is still a need to find ways to optimize the reading
> > of large amounts of stats.
> > However, this patch is effective in its own right, and since there are
> > only a few parts to modify,
> > I think it should be able to be applied to current (preferably
> > pre-v13) PostgreSQL.
>
> +1
>
> +
> +       /* We might be better to refresh stats */
> +       use_existing_stats = false;
>     }
> +   else
> +   {
>
> -   heap_freetuple(classTup);
> +       heap_freetuple(classTup);
> +       /* The relid has already vacuumed, so we might be better to
> use exiting stats */
> +       use_existing_stats = true;
> +   }
>
> With that patch, the autovacuum process refreshes the stats in the
> next check if it finds out that the table still needs to be vacuumed.
> But I guess it's not necessarily true because the next table might be
> vacuumed already. So I think we might want to always use the existing
> for the first check. What do you think?
Thanks for your comment.

If we assume the case where some workers vacuum on large tables
and a single worker vacuum on small tables, the processing
performance of the single worker will be slightly lower if the
existing statistics are checked every time.

In fact, at first I tried to check the existing stats every time,
but the performance was slightly worse in cases with a small number of workers.
(Checking the existing stats is lightweight , but at high frequency,
 it affects processing performance.)
Therefore, at after refresh statistics, determine whether autovac
should use the existing statistics.

BTW, I found some typos in comments, so attache a  fixed version.

Best regards,

-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: vac_update_datfrozenxid will raise "wrong tuple length" if pg_database tuple contains toast attribute.
Next
From: Peter Smith
Date:
Subject: Re: [HACKERS] logical decoding of two-phase transactions