Debugging sudden increase in planning CPU time (possibly due to get_actual_variable_range) - Mailing list pgsql-hackers

From Core Studios Inc.
Subject Debugging sudden increase in planning CPU time (possibly due to get_actual_variable_range)
Date
Msg-id b0f47540-2915-4537-90a4-4896183bb027@gmail.com
Whole thread Raw
List pgsql-hackers

Hello hackers,

We have a hot table (1.4k QPS) with 150M records and 55GB in size. Recently, we noticed a sudden spike in CPU utilization (from 20% to 80%), right after an ANALYZE operation finished on that table:

automatic analyze of table "cars" system usage: CPU: user: 1.50 s, system: 0.55 s, elapsed: 25.47 s

The CPU spike correlated perfectly with an increase in `index_rows_read` and `index_rows_fetched` (as reported by DataDog) for a b-tree index on a timestamp column, and we narrowed it to a specific query. However, we could not find any query plans that used the index, no matter how many different ways we tried. The query is the following:

explain (buffers,analyze) SELECT
  "cars".*
FROM
  "cars"
WHERE
  ("cars"."user_id" = 'xxx')
  AND ("cars"."brand_id" in ('xxx'))
  AND (replacement IS NULL)
  AND ("cars"."status" = 'active')
  AND ("cars"."sold_at" > NOW())
  AND ("cars"."bought_at" > NOW())
  AND (timeout = 0
    OR EXTRACT(EPOCH
    FROM (NOW() - GREATEST(updated_at,sold_at))) <= timeout);

The index is a b-tree index on bought_at (timestamp without timezone), named `index_cars_on_bought_at`. The query plan looks like this:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Index Scan using index_cars_on_brand_id_include_sold_at_bought_at on cars  (cost=0.70..7.58 rows=1 width=508) (actual time=0.046..0.046 rows=1 loops=1)
   Index Cond: ((brand_id)::text = 'xxx'::text)
   Filter: ((replacement IS NULL) AND ((user_id)::text = 'xxx'::text) AND ((status)::text = 'active'::text) AND (sold_at > now()) AND (bought_at > now()) AND ((timeout = 0) OR (date_part('epoch'::text, (now() - (GREATEST(updated_at, sold_at))::timestamp with time zone)) <= (timeout)::double precision)))
   Rows Removed by Filter: 4
   Buffers: shared hit=10
 Planning:
   Buffers: shared hit=6311 dirtied=4
 Planning Time: 9.989 ms
 Execution Time: 0.073 ms
(9 rows)

The `Planning: Buffers: shared hit=6311` and `Planning Time: 9.989ms` stood out to me. If we remove the `bought_at > NOW()` predicate, query plan time looks like this:

  Index Scan using index_cars_on_brand_id_include_sold_at_bought_at on cars  (cost=0.70..7.58 rows=1 width=508) (actual time=0.046..0.046 rows=1 loops=1)
   Index Cond: ((brand_id)::text = 'xxx'::text)
   Filter: ((replacement IS NULL) AND ((user_id)::text = 'xxx'::text) AND ((status)::text = 'active'::text) AND (sold_at > now()) AND ((timeout = 0) OR (date_part('epoch'::text, (now() - (GREATEST(updated_at, sold_at))::timestamp with time zone)) <= (timeout)::double precision)))
   Rows Removed by Filter: 4
   Buffers: shared hit=10
 Planning Time: 0.107 ms
 Execution Time: 0.055 ms
(7 rows)

This makes me hypothesize that the increased planning time is the planner scanning through `index_cars_on_bought_at`. Eventually I run into a similar report[1] which mentions get_actual_variable_range()as the offender. Indeed, after I performed a VACUUM on the table and planning time and CPU utilization reverted back to normal levels. Unfortunately, since this is a managed database, we don't have access to attach to the machine and grab a CPU profile.

Periodically (every 30min) we delete all rows in batches of 50k from the table for which `bought_at > NOW()`. This means we mostly delete from the beginning of the index, which would cause problem (if I understand correctly) when the planner tries to compute the `min` of `bought_at` column; which it does if `NOW()` happens to be at the exact first point of the histogram. Is my understanding correct?

I know this pattern creates dead tuples that might accumulate and cause slowdowns due to table bloat, however given that we did this for the past 4y and never encountered a similar problem, I'm also looking to understand what might've changed _now_ and this was triggered (assuming this is actually the issue). Table size and/or query volume did not change significantly (apart from the usual/gradual growth it had in the past years).

Thanks in advance!

[1] https://www.postgresql.org/message-id/flat/CABC5ruQQz6xzS%2B%2BxKaCdVSyb_F%3Do4OAOdbsxYM7-jmCggWGaUQ%40mail.gmail.com

P.S. The instance has 96 vCPUs, runs on GCP Cloud SQL and is on Postgres v13.20.

pgsql-hackers by date:

Previous
From: Mihail Nikalayeu
Date:
Subject: Re: [BUG?] check_exclusion_or_unique_constraint false negative
Next
From: shveta malik
Date:
Subject: Re: Conflict detection for update_deleted in logical replication