Thread: Understanding query planner cpu usage
I'm having an perplexing issue in PG 10.1 wherein deleting a large amount of rows from a table causes query planning time to spike dramatically for a while. This happens with or without autovacuums so vacuuming isn't the issue.
CPU usage during this time spikes as well. I can't determine if the query planning issue is caused by CPU starvation or if the CPU starvation is caused by the query planning. Either way, query planning spikes from 15ms to nearly 7 seconds, and the load average on the machine spikes from around 1.0 to 80 or 90. Despite query planning time increasing by several orders of magnitude, query execution time remains very similar.
The schema and query are similar to what's below, with each table containing about 60 million rows. The load spike happens after deleting approximately 1% of the table data from each of the tables in a transaction. It happens immediately after the transaction commits, not during the transaction itself. The transaction takes about a minute to execute.
It's possible there are other longer lived transactions that may be working on data in these tables, and I think there's probably a clue to the cause here.
Does anyone have any clues or pointers for what I can look for? Is there a way to interrogate the query planner and see what it's doing? (explain analyze only gives the final chosen plan and how long query planning took to execute)
CREATE TABLE data (
data_id bigint NOT NULL PRIMARY KEY
);
CREATE TABLE segment (
data_id bigint NOT NULL,
segment_id bigint NOT NULL PRIMARY KEY);
CREATE TABLE raw (
segment_id bigint NOT NULL,
raw_id bigint NOT NULL PRIMARY KEY
);
CREATE TABLE processed (
segment_id bigint NOT NULL,
raw_id bigint NOT NULL,
processed_id bigint NOT NULL PRIMARY KEY,
magical_id integer
);
All primary keys are auto-incrementing sequences.
data -> segment is 1:many (but typically 1:1)
segment -> raw is 1:many (but almost always 1:1)
raw -> processed is 1:1
select * from processed
left join raw using (raw_id, segment_id)
left join segment using (segment_id)
left join data using (data_id)
where processed.magical_id = [magical_id]
order by processed_id desc;
Lucas Fairchild-Madar <lucas.madar@gmail.com> writes: > I'm having an perplexing issue in PG 10.1 wherein deleting a large amount > of rows from a table causes query planning time to spike dramatically for a > while. This happens with or without autovacuums so vacuuming isn't the > issue. Would the deleted rows happen to be the extremal values of some indexed column that is a join key in the slowly-planned queries? If so, this might be some manifestation of a problem we've seen before: the planner tries to find out the current live max value of the column by scanning the index, and that's really slow if there are a lot of recently-dead entries at the index end, because each of them has to be inspected and then hinted dead. You'd pay that overhead at some point anyway, of course. The cases where it becomes a problem are where the planner inspects these values but *can't* hint them dead, such as when the deletion hasn't committed yet, or they're newly inserted rows that likewise aren't committed. Then each incoming query does the work over again until the transient state is resolved. We've done various things to ameliorate this, but maybe you've found some new way to cause it to be a pain point. Is there anything special about the way you're deleting the rows? Maybe there's a long-running transaction in the background that can still see the deleted rows? Or I might be barking up the wrong tree entirely. But this sure sounds reminiscent of that class of problems. regards, tom lane
On Wed, Feb 21, 2018 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Lucas Fairchild-Madar <lucas.madar@gmail.com> writes:
> I'm having an perplexing issue in PG 10.1 wherein deleting a large amount
> of rows from a table causes query planning time to spike dramatically for a
> while. This happens with or without autovacuums so vacuuming isn't the
> issue.
Would the deleted rows happen to be the extremal values of some indexed
column that is a join key in the slowly-planned queries?
Yes. Basically, this is sort of a moving set of several weeks of data. We have a job that goes through (currently, every 3 hours) and periodically wipes out data from the very beginning. At the same time, more data is continually being appended to the table. All of these rows are immutable. The vast majority of the work happening in the database happens in the last two or three million rows of these tables; the rest is just there for auditing and analytics.
If so, this might be some manifestation of a problem we've seen before:
the planner tries to find out the current live max value of the column
by scanning the index, and that's really slow if there are a lot of
recently-dead entries at the index end, because each of them has to be
inspected and then hinted dead. You'd pay that overhead at some point
anyway, of course. The cases where it becomes a problem are where the
planner inspects these values but *can't* hint them dead, such as when
the deletion hasn't committed yet, or they're newly inserted rows that
likewise aren't committed. Then each incoming query does the work
over again until the transient state is resolved.
We've done various things to ameliorate this, but maybe you've found
some new way to cause it to be a pain point. Is there anything special
about the way you're deleting the rows? Maybe there's a long-running
transaction in the background that can still see the deleted rows?
This sounds like the trigger here. We have a long-running (~20 minute) transaction that's doing analytical rollups on these sets of tables. It's doing them only on very recent data, but I assume having an open transaction means the tuples can't be marked as dead yet because the queries within that transaction could still potentially access them.
I took a look further in this direction and found that our cleanup process and this analytical rollup both happened to run very close to each other twice a day. I moved the cleanup process to run at a time when it should never collide with analytical rollups and we'll see if the load spike happens again. When this happens, the long-running query takes almost an hour so the problem compounds.
It would be helpful to have a way to introspect in to what the query planner is doing. For instance, if I could explain (analyze, verbose, queryplan) the query and the queryplan told me that it spent 5000ms skipping over dead tuples in the index, knowing that would arm me with more information to tune the system better.
Perhaps this is a stupid question, but would creating the index in descending order solve this problem (as a unique index, not a primary key)? What is the planner doing when trying to find the current live max value of the column?
Thanks,
Lucas
Lucas Fairchild-Madar <lucas.madar@gmail.com> writes: > On Wed, Feb 21, 2018 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If so, this might be some manifestation of a problem we've seen before: >> the planner tries to find out the current live max value of the column >> by scanning the index, and that's really slow if there are a lot of >> recently-dead entries at the index end, because each of them has to be >> inspected and then hinted dead. > I took a look further in this direction and found that our cleanup process > and this analytical rollup both happened to run very close to each other > twice a day. I moved the cleanup process to run at a time when it should > never collide with analytical rollups and we'll see if the load spike > happens again. I'm thinking that will fix it for you. > Perhaps this is a stupid question, but would creating the index in > descending order solve this problem (as a unique index, not a primary key)? I think most likely it'd just move the issue to the other end of the index, though you could try. See below. > What is the planner doing when trying to find the current live max value of > the column? It's trying to estimate whether a mergejoin will be able to stop short of reading all the tuples from the other side of the join. (For instance, if you've got 1,3,5 on one side, and 1,4,5,7,8,9,19 on the other, the second input doesn't have to be read past "7" because once we run off the end of the first input, we know we couldn't see any matches later on the second input. So the planner wants to compare the ending key value on each side to the key distribution on the other side, to see what this might save.) Now, that's a unidirectional question for any particular mergejoin plan, so that for any one cost estimate it's only going to need to look at one end of the key range. But I think it will consider merge joins with both sort directions, so that both ends of the key range will get investigated in this way. I might be wrong though; it's late and I've not looked at that code in awhile ... regards, tom lane
On Wed, Feb 21, 2018 at 7:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What is the planner doing when trying to find the current live max value of
> the column?
It's trying to estimate whether a mergejoin will be able to stop short of
reading all the tuples from the other side of the join. (For instance,
if you've got 1,3,5 on one side, and 1,4,5,7,8,9,19 on the other, the
second input doesn't have to be read past "7" because once we run off the
end of the first input, we know we couldn't see any matches later on the
second input. So the planner wants to compare the ending key value on
each side to the key distribution on the other side, to see what this might
save.) Now, that's a unidirectional question for any particular mergejoin
plan, so that for any one cost estimate it's only going to need to look at
one end of the key range. But I think it will consider merge joins with
both sort directions, so that both ends of the key range will get
investigated in this way. I might be wrong though; it's late and I've
not looked at that code in awhile ...
I'm thinking the least painful solution here might be to set enable_mergejoin = false for this particular query, since the rows joined are quite sparse.