Re: Autovacuum on partitioned table (autoanalyze) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Autovacuum on partitioned table (autoanalyze) |
Date | |
Msg-id | 20210730010355.6yodvn2ag3arfihi@alap3.anarazel.de Whole thread Raw |
In response to | Re: Autovacuum on partitioned table (autoanalyze) (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Autovacuum on partitioned table (autoanalyze)
Re: Autovacuum on partitioned table (autoanalyze) |
List | pgsql-hackers |
Hi, CCing RMT because I think we need to do something about this for v14. On 2021-07-27 19:23:42 -0700, Andres Freund wrote: > On 2021-07-22 13:54:58 -0700, Andres Freund wrote: > > On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote: > > > On 2021-Apr-07, Alvaro Herrera wrote: > > > > > > > OK, I bit the bullet and re-did the logic in the way I had proposed > > > > earlier in the thread: do the propagation on the collector's side, by > > > > sending only the list of ancestors: the collector can read the tuple > > > > change count by itself, to add it to each ancestor. This seems less > > > > wasteful. Attached is v16 which does it that way and seems to work > > > > nicely under my testing. > > > > > > Pushed with this approach. Thanks for persisting with this. > > > > I'm looking at this in the context of rebasing & polishing the shared > > memory stats patch. > > > > I have a few questions / concerns: > > Another one, and I think this might warrant thinking about for v14: > > Isn't this going to create a *lot* of redundant sampling? Especially if you > have any sort of nested partition tree. In the most absurd case a partition > with n parents will get sampled n times, solely due to changes to itself. > > Look at the following example: > > BEGIN; > DROP TABLE if exists p; > CREATE TABLE p (i int) partition by range(i); > CREATE TABLE p_0 PARTITION OF p FOR VALUES FROM ( 0) to (5000) partition by range(i); > CREATE TABLE p_0_0 PARTITION OF p_0 FOR VALUES FROM ( 0) to (1000); > CREATE TABLE p_0_1 PARTITION OF p_0 FOR VALUES FROM (1000) to (2000); > CREATE TABLE p_0_2 PARTITION OF p_0 FOR VALUES FROM (2000) to (3000); > CREATE TABLE p_0_3 PARTITION OF p_0 FOR VALUES FROM (3000) to (4000); > CREATE TABLE p_0_4 PARTITION OF p_0 FOR VALUES FROM (4000) to (5000); > -- create some initial data > INSERT INTO p select generate_series(0, 5000 - 1) data FROM generate_series(1, 100) reps; > COMMIT; > > UPDATE p_0_4 SET i = i; > > > Whenever the update is executed, all partitions will be sampled at least twice > (once for p and once for p_0), with p_0_4 sampled three times. > > Of course, this is an extreme example, but it's not hard to imagine cases > where v14 will cause the number of auto-analyzes increase sufficiently to bog > down autovacuum to a problematic degree. > > > Additionally, while analyzing all child partitions for a partitioned tables > are AccessShareLock'ed at once. If a partition hierarchy has more than one > level, it actually is likely that multiple autovacuum workers will end up > processing the ancestors separately. This seems like it might contribute to > lock exhaustion issues with larger partition hierarchies? I started to write a patch rejiggering autovacuum.c portion of this change. While testing it I hit the case of manual ANALYZEs leaving changes_since_analyze for partitioned tables in a bogus state - without a minimally invasive way to fix that. After a bit of confused staring I realized that the current code has a very similar problem: Using the same setup as above: INSERT INTO p VALUES (0,0); /* repeat as many times as desired */ ANALYZE p_0_0; At this point the system will have lost track of the changes to p_0_0, unless an autovacuum worker was launched between the INSERTs and the ANALYZE (which would cause pgstat_report_anl_ancestors() to report the change count upwards). There appears to be code trying to address that, but I don't see how it ever does anything meaningful? /* * Now report ANALYZE to the stats collector. For regular tables, we do * it only if not doing inherited stats. For partitioned tables, we only * do it for inherited stats. (We're never called for not-inherited stats * on partitioned tables anyway.) * * Reset the changes_since_analyze counter only if we analyzed all * columns; otherwise, there is still work for auto-analyze to do. */ if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) pgstat_report_analyze(onerel, totalrows, totaldeadrows, (va_cols == NIL)); /* * If this is a manual analyze of all columns of a permanent leaf * partition, and not doing inherited stats, also let the collector know * about the ancestor tables of this partition. Autovacuum does the * equivalent of this at the start of its run, so there's no reason to do * it there. */ if (!inh && !IsAutoVacuumWorkerProcess() && (va_cols == NIL) && onerel->rd_rel->relispartition && onerel->rd_rel->relkind == RELKIND_RELATION && onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT) { pgstat_report_anl_ancestors(RelationGetRelid(onerel)); } The pgstat_report_analyze() triggers pgstat_recv_analyze() to reset the counter that pgstat_recv_anl_ancestors() would use to report changes upwards: /* * If commanded, reset changes_since_analyze to zero. This forgets any * changes that were committed while the ANALYZE was in progress, but we * have no good way to estimate how many of those there were. */ if (msg->m_resetcounter) { tabentry->changes_since_analyze = 0; tabentry->changes_since_analyze_reported = 0; } And if one instead inverts the order of pgstat_report_analyze() and pgstat_report_anl_ancestors() one gets a slightly different problem: A manual ANALYZE of the partition root results in the partition root having a non-zero changes_since_analyze afterwards. expand_vacuum() causes child partitions to be added to the list of relations, which *first* causes the partition root to be analyzed, and *then* partitions. The partitions then report their changes_since_analyze upwards. I don't think the code as is is fit for v14. It looks like it was rewritten with a new approach just before the freeze ([1]), and as far as I can tell the concerns I quoted above weren't even discussed in the whole thread. Alvaro, any comments? Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20210408032235.GA6842%40alvherre.pgsql
pgsql-hackers by date: