Thread: Unfortunate Nested Loop + Missing Autovacuum

Unfortunate Nested Loop + Missing Autovacuum

From
Lincoln Swaine-Moore
Date:
Hi all--

Writing in with a two-parter performance issue.

I've got two tables roughly shaped like this:

                                              Table "public.rawdata"
        Column         |            Type             | Collation | Nullable |                     Default
-----------------------+-----------------------------+-----------+----------+-------------------------------------------------
 id                    | integer                     |           | not null | nextval('rawdata_id_seq'::regclass)
 timestamp             | timestamp without time zone |           | not null |
 sn                    | character varying(36)       |           | not null |
 raw_value             | double precision            |           |          |
 Indexes:
    "rawdata_pkey" PRIMARY KEY, btree (id)
    "rawdata_multicol_sn_and_timestamp_desc_idx" btree (sn, "timestamp" DESC)
Foreign-key constraints:
    "rawdata_sn_fkey" FOREIGN KEY (sn) REFERENCES devices(sn) ON UPDATE CASCADE ON DELETE CASCADE
Referenced by:
    TABLE "cleandata" CONSTRAINT "cleandata_raw_table_id_fkey" FOREIGN KEY (raw_table_id) REFERENCES rawdata(id)

                                            Table "public.cleandata"
     Column      |            Type             | Collation | Nullable |                      Default
-----------------+-----------------------------+-----------+----------+---------------------------------------------------
 id              | integer                     |           | not null | nextval('cleandata_id_seq'::regclass)
 timestamp       | timestamp without time zone |           | not null |
 sn              | character varying(36)       |           | not null |
 raw_table_id    | integer                     |           |          |
 clean_value     | double precision            |           |          |
Indexes:
    "cleandata_pkey" PRIMARY KEY, btree (id)
    "cleandata_multicol_sn_and_timestamp_desc_idx" btree (sn, "timestamp" DESC)
Foreign-key constraints:
    "cleandata_raw_table_id_fkey" FOREIGN KEY (raw_table_id) REFERENCES rawdata(id)
    "cleandata_sn_fkey" FOREIGN KEY (sn) REFERENCES devices(sn) ON UPDATE CASCADE ON DELETE CASCADE


Rows are inserted into the raw table, then slightly later inserted into the cleandata table, at a rate of ~1-5 million
rows a day. The tables are on the order of hundreds of millions of rows. They are almost never updated or deleted,
but are selected regularly. Technically, not every row in rawdata will have a corresponding row in cleandata,
but in practice the vast majority do.

Thanks to the combined index on sn / timestamp, selecting data from one or the other of these tables is
fairly performant. Where things get hairy is when I try to join between them. I can write a performant query that joins
from clean to raw, but the opposite direction (which is not logically equivalent) gets hairy because raw_table_id is not indexed on cleandata.
I think in the medium term, it would be good to build that index, but in the short term, I can help speed things up
by limiting the data by doubly filtering like:

explain analyze SELECT count(*)
FROM (
    SELECT *
    FROM
        rawdata
    WHERE
       timestamp >= NOW() - interval '2 weeks'
        AND sn ='FOO'
    ) r
JOIN (
    SELECT *
    FROM
        cleandata
    WHERE
        timestamp >= NOW() - interval '2 weeks'
         AND sn ='FOO'
) c
ON
    r.id = c.raw_table_id;


which produces:
Hash Join  (cost=42.80..83.52 rows=1 width=44) (actual time=852.341..1338.713 rows=20141 loops=1)
  Hash Cond: (cleandata.raw_table_id = rawdata.id)
  ->  Index Scan using cleandata_multicol_sn_and_timestamp_desc_idx on cleandata  (cost=0.70..41.30 rows=47 width=12) (actual time=0.051..476.287 rows=20141 loops=1)
"        Index Cond: (((sn)::text = 'FOO'::text) AND (""timestamp"" >= (now() - '14 days'::interval)))"
  ->  Hash  (cost=41.51..41.51 rows=47 width=40) (actual time=852.273..852.275 rows=20141 loops=1)
        Buckets: 32768 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 1751kB
        ->  Index Scan using raw_multicol_sn_and_timestamp_desc_idx on rawdata  (cost=0.70..41.51 rows=47 width=40) (actual time=0.050..844.555 rows=20141 loops=1)
"              Index Cond: (((sn)::text = 'FOO'::text) AND (""timestamp"" >= (now() - '14 days'::interval)))"
Planning Time: 2.211 ms
Execution Time: 1339.680 ms


The estimates are totally wrong, but the plan is good and the query is fast.
However, when I change the query slightly, and make it `timestamp >= NOW() - interval '2 weeks' and timestamp < NOW()`,
which actually covers the same period (or really any other time period), the plan changes dramatically:
Nested Loop  (cost=1.42..5.47 rows=1 width=572) (actual time=0.100..185218.340 rows=20141 loops=1)
  Join Filter: (rawdata.id = cleandata.raw_table_id)
  Rows Removed by Join Filter: 405639740
  ->  Index Scan using raw_multicol_sn_and_timestamp_desc_idx on rawdata  (cost=0.71..2.73 rows=1 width=424) (actual time=0.056..100.014 rows=20141 loops=1)
"        Index Cond: (((sn)::text = 'FOO'::text) AND (""timestamp"" >= (now() - '14 days'::interval)) AND (""timestamp"" < now()))"
  ->  Index Scan using cleandata_multicol_sn_and_timestamp_desc_idx on cleandata  (cost=0.71..2.73 rows=1 width=148) (actual time=0.045..7.584 rows=20141 loops=20141)
"        Index Cond: (((sn)::text = 'FOO'::text) AND (""timestamp"" >= (now() - '14 days'::interval)) AND (""timestamp"" < now()))"
Planning Time: 0.450 ms
Execution Time: 185225.028 ms


The row estimate for each table gets even worse, and I guess this is enough to encourage postgres to do this as a Nested Loop, with disastrous consequences.

So, obviously there's a statistics problem, which led me to realize that actually these tables have *never* been autovacuumed/analyzed according to pg_stat_user_tables. 
I'm using a managed database which makes it a little tricky to debug, but all my settings
(autovacuum/autovacuum_vacuum_threshold/autovacuum_analyze_threshold/autovacuum_vacuum_insert_threshold) are default,
and I can see that other tables have been vacuumed recently.

I assume this has something to do with the fact that these tables don't accumulate dead tuples since they're basically append-only?
But I still think INSERTs should trigger autovacuum/analyze eventually (at least when the table grows by 10%, because of autovacuum_analyze_scale_factor),
and I'm confused why that doesn't seem to have happened. Seems like this is probably hurting my queries' performance elsewhere.

So, my two big questions are:
- Is there a better way to write my query to hint away from the awful nested loop join?
- Can anyone think of why autovacuum is declining to vacuum/analyze these tables?

Thanks for reading, and for any help with this!

--
Lincoln Swaine-Moore

Re: Unfortunate Nested Loop + Missing Autovacuum

From
Andrei Lepikhov
Date:
On 22/2/2025 00:46, Lincoln Swaine-Moore wrote:
> So, obviously there's a statistics problem, which led me to realize that 
> actually these tables have *never* been autovacuumed/analyzed according 
> to pg_stat_user_tables.
> I'm using a managed database which makes it a little tricky to debug, 
> but all my settings
> (autovacuum/autovacuum_vacuum_threshold/autovacuum_analyze_threshold/ 
> autovacuum_vacuum_insert_threshold) are default,
> and I can see that other tables have been vacuumed recently.
I know a couple of reports related to this kind of behaviour and 
different solutions to resolve it. But first, if you execute the ANALYZE 
command on these problematic tables, does it fix your issue? May you 
live with manual vacuum analysis each time after batch insertion?
If not, may you provide a synthetic reproduction of the case?

-- 
regards, Andrei Lepikhov



Re: Unfortunate Nested Loop + Missing Autovacuum

From
Lincoln Swaine-Moore
Date:
Thanks for the reply! I tried the analysis on our much shorter staging table and it did change the plan. I haven’t tried it on the production ones because my understanding is that the autovacuum process is gentler with resource consumption and I didn’t want to gum things up in the meantime. But that may be false or avoidable. 

Unfortunately the insertions are not batched—they arrive in a fairly steady stream. Given that, and the fact that we’re talking about the behavior of the daemon, I’m not sure how to provide a reproduction (though would be game if you had pointers). I could in theory make a job analyze these tables on a regular cadence, though I guess that feels like I’d be stepping on autovacuum’s toes.

Would be happy to read over similar reports if you could steer me toward them.

Thanks again for your help. 

Lincoln Swaine-Moore


On Sat, Feb 22, 2025 at 00:37 Andrei Lepikhov <lepihov@gmail.com> wrote:
On 22/2/2025 00:46, Lincoln Swaine-Moore wrote:
> So, obviously there's a statistics problem, which led me to realize that
> actually these tables have *never* been autovacuumed/analyzed according
> to pg_stat_user_tables.
> I'm using a managed database which makes it a little tricky to debug,
> but all my settings
> (autovacuum/autovacuum_vacuum_threshold/autovacuum_analyze_threshold/
> autovacuum_vacuum_insert_threshold) are default,
> and I can see that other tables have been vacuumed recently.
I know a couple of reports related to this kind of behaviour and
different solutions to resolve it. But first, if you execute the ANALYZE
command on these problematic tables, does it fix your issue? May you
live with manual vacuum analysis each time after batch insertion?
If not, may you provide a synthetic reproduction of the case?

--
regards, Andrei Lepikhov

Re: Unfortunate Nested Loop + Missing Autovacuum

From
Jeff Janes
Date:
On Sun, Feb 23, 2025 at 5:49 PM Lincoln Swaine-Moore <lswainemoore@gmail.com> wrote:
Thanks for the reply! I tried the analysis on our much shorter staging table and it did change the plan. I haven’t tried it on the production ones because my understanding is that the autovacuum process is gentler with resource consumption and I didn’t want to gum things up in the meantime. But that may be false or avoidable. 

The default setting of vacuum_cost_delay is zero, while the default setting of autovacuum_vacuum_cost_delay is 2ms (or 20ms, depending on the software version).  So you can give a manual run the same resource consumption as an autorun just by changing vacuum_cost_delay in that session so that it has the same value as autovacuum_vacuum_cost_delay.  Also, if you just do ANALYZE (rather than VACUUM ANALYZE) the resource usage should be rather modest anyway.

Another difference is that autovac will cancel itself if it detects it is blocking something else, while a manual vac/analyze operation will not do that.  Normal operations (DML) don't block against vacuum anyway, only things like index creation or partition maintenance do that.  But if those types of operation are frequent, then doing a manual VACUUM or ANALYZE could indeed gum things up.  Also, if those operations are frequent, it could explain the missing autovac.  If every auto attempt gets cancelled before it completes, then it will never complete.  pg_stat_user_tables doesn't reflect cancelled vacuum or analyze so those will go missing.  (You should see mentions of cancelled autovac operations in the log file though.)

Database restarts will also interrupt vacuums.  So if your database is shutdown and restarted regularly (for cold back-ups, or just out of some misguided belief that restarting occasionally is a good practise) and the period between restarts is shorter than how long it would take autovac to run, this could also explain the lack of completed autovacs.  Also, if a table qualifies for both auto vacuum and auto analyze, the vacuum is done first.  So even if auto analyze would be fast by itself, it still won't complete if auto vacuum is slow and never gets to finish.

It is possible to override your vacuum settings on a per-table basis.  So another possible explanation for the missing autovacs is that those two tables have been specifically configured to disable autovacuum on them and only them.  If you use psql, \dt+ (but not regular \dt) will show such customizations.  I'm sure other tools also have ways to detect this, but I don't know what those ways are off the top of my head.

Cheers,

Jeff

Re: Unfortunate Nested Loop + Missing Autovacuum

From
Andrei Lepikhov
Date:
On 23/2/2025 23:49, Lincoln Swaine-Moore wrote:
> Thanks for the reply! I tried the analysis on our much shorter staging 
> table and it did change the plan. I haven’t tried it on the production 
> ones because my understanding is that the autovacuum process is gentler 
> with resource consumption and I didn’t want to gum things up in the 
> meantime. But that may be false or avoidable.
> 
> Unfortunately the insertions are not batched—they arrive in a fairly 
> steady stream. Given that, and the fact that we’re talking about the 
> behavior of the daemon, I’m not sure how to provide a reproduction 
> (though would be game if you had pointers). I could in theory make a job 
> analyze these tables on a regular cadence, though I guess that feels 
> like I’d be stepping on autovacuum’s toes.
> 
> Would be happy to read over similar reports if you could steer me toward 
> them.
I know about at least two problematic use cases:
1. Batch insertions of tuples with steadily growing value of a column 
(timestamp as an example). Selecting "just arrived" tuples, we can't 
estimate cardinalities correctly because these values are out of 
histogram and MCV statistics until the next analyze.
2. The table is overcrowded by queries, and autovacuum does not have a 
chance to pass the table.

For case No.1, I attempted to extend the 'probe indexes' technique [1]. 
However, it is applicable only in specific conditions (analytic load, 
for example) and should be designed as an extension. Unfortunately, we 
still do not have selectivity hooks to let it go.

To resolve case No.2 (and not only), I have invented the 'plan freezing' 
extension [2,3]. Using it, you may find the optimal plan once and freeze 
it in the instance's plan cache, picking constants that have to be 
treated as parameters. Any changes of the table content, statistics, or 
even GUCs will not impact the query plan until "unfreezing".

Also, you may use the pg_hint_plan extension [4] to provide Postgres 
optimiser with hints that can help it generate the plan you need.

[1] https://danolivo.substack.com/p/probing-indexes-to-survive-data-skew
[2] https://danolivo.substack.com/p/designing-a-prototype-postgres-plan
[3] https://postgrespro.com/docs/enterprise/16/sr-plan
[4] https://github.com/ossc-db/pg_hint_plan

-- 
regards, Andrei Lepikhov



Re: Unfortunate Nested Loop + Missing Autovacuum

From
Lincoln Swaine-Moore
Date:
Thanks Jeff for the response--I did end up just analyzing the tables manually, as a stopgap. Resource consumption was a non-issue as you predicted (and plan was corrected, though estimates were still slightly awkward).

With respect to the blocking of the autovacuum/analyze: no it shouldn't be the case that those are running frequently in our case. Ditto re: database restarts--this is out of my control because it would be DigitalOcean's doing, but I don't see any evidence of it. Nor anything amiss in \dt+, unfortunately.

I'll try to figure out if I can get access to the logs to search for cancellations. Do you happen to know what that would look like if I'm grep-ing for it? And do you have any other guesses about possible explanations?

Thanks again for your help.

Lincoln



On Sun, Feb 23, 2025 at 6:09 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Sun, Feb 23, 2025 at 5:49 PM Lincoln Swaine-Moore <lswainemoore@gmail.com> wrote:
Thanks for the reply! I tried the analysis on our much shorter staging table and it did change the plan. I haven’t tried it on the production ones because my understanding is that the autovacuum process is gentler with resource consumption and I didn’t want to gum things up in the meantime. But that may be false or avoidable. 

The default setting of vacuum_cost_delay is zero, while the default setting of autovacuum_vacuum_cost_delay is 2ms (or 20ms, depending on the software version).  So you can give a manual run the same resource consumption as an autorun just by changing vacuum_cost_delay in that session so that it has the same value as autovacuum_vacuum_cost_delay.  Also, if you just do ANALYZE (rather than VACUUM ANALYZE) the resource usage should be rather modest anyway.

Another difference is that autovac will cancel itself if it detects it is blocking something else, while a manual vac/analyze operation will not do that.  Normal operations (DML) don't block against vacuum anyway, only things like index creation or partition maintenance do that.  But if those types of operation are frequent, then doing a manual VACUUM or ANALYZE could indeed gum things up.  Also, if those operations are frequent, it could explain the missing autovac.  If every auto attempt gets cancelled before it completes, then it will never complete.  pg_stat_user_tables doesn't reflect cancelled vacuum or analyze so those will go missing.  (You should see mentions of cancelled autovac operations in the log file though.)

Database restarts will also interrupt vacuums.  So if your database is shutdown and restarted regularly (for cold back-ups, or just out of some misguided belief that restarting occasionally is a good practise) and the period between restarts is shorter than how long it would take autovac to run, this could also explain the lack of completed autovacs.  Also, if a table qualifies for both auto vacuum and auto analyze, the vacuum is done first.  So even if auto analyze would be fast by itself, it still won't complete if auto vacuum is slow and never gets to finish.

It is possible to override your vacuum settings on a per-table basis.  So another possible explanation for the missing autovacs is that those two tables have been specifically configured to disable autovacuum on them and only them.  If you use psql, \dt+ (but not regular \dt) will show such customizations.  I'm sure other tools also have ways to detect this, but I don't know what those ways are off the top of my head.

Cheers,

Jeff


--
Lincoln Swaine-Moore