Thread: Degraded performance during table rewrite
Hi all,
I am attempting to do a bunch of table rewrites to shrink a table in the absence of pg_repack and vacuum full (both not an option). The database is growing fast and has had significant bloat in both heaps and indexes, so index rebuilds alone won't cut it. We found that table inheritance can be used to achieve this rather nicely. We are running PG v9.6.18.
We are setting up the inheritance as follows:
BEGIN;
ALTER TABLE #{table} RENAME TO #{table}_old;
CREATE TABLE #{table} (LIKE #{table}_old INCLUDING ALL);
ALTER TABLE #{table}_old INHERIT #{table};
ALTER SEQUENCE #{table}_id_seq OWNED BY #{table}.id;
COMMIT;
Then, the migration itself runs as follows (each in a transaction, looping through records and sleeping for a bit)
WITH del AS (
DELETE FROM #{old_table}
WHERE id IN (
SELECT id
FROM #{old_table}
WHERE id > #{max_deleted_id} -- This is the max deleted from the previous batch, we grab it programmatically.
ORDER BY id ASC
LIMIT #{batch_size}
)
RETURNING *
)
INSERT INTO #{table}
SELECT * FROM del
RETURNING id
For instance, the batch_size can be 10_000, and the code sleeps programatically for 200ms (this is done in a Ruby script).
"max_deleted_id" is passed to each run from the previous one. This improves the inner SELECT query.
This works very well. However, I noticed two suprising things:
1. The performance of the delete and insert drops by several orders of magnitude as the script runs. For instance, in one run, it goes from 150ms average run to 700ms per batch.
2. The explain itself takes a while to run on a sample batch. In one table, the explain alone took four seconds.
To try and reproduce this locally, I used the following dummy table:
create table towns (id serial primary key, code text, article text, name text, department text);
insert into towns (
code, article, name, department
)
select
left(md5(i::text), 10),
md5(random()::text),
md5(random()::text),
left(md5(random()::text), 4)
from generate_series(1, 100000000) s(i);
This spends 150ms per batch, which climbs to 700ms per batch. A vacuum of the old table lowers is back to 150ms, but I don't understand why, because we structure the query to jump over all previously dead rows. There is an old thread in which Tom Lane mentions that the planner might itself be walking that primary index. Is this applicable here? And is there anything we can do besides more aggressive and continued vacuuming of the old table (or a change in autovacuum settings)? Ideally, we want to run this overnight without much supervision.
I am attempting to do a bunch of table rewrites to shrink a table in the absence of pg_repack and vacuum full (both not an option). The database is growing fast and has had significant bloat in both heaps and indexes, so index rebuilds alone won't cut it. We found that table inheritance can be used to achieve this rather nicely. We are running PG v9.6.18.
We are setting up the inheritance as follows:
BEGIN;
ALTER TABLE #{table} RENAME TO #{table}_old;
CREATE TABLE #{table} (LIKE #{table}_old INCLUDING ALL);
ALTER TABLE #{table}_old INHERIT #{table};
ALTER SEQUENCE #{table}_id_seq OWNED BY #{table}.id;
COMMIT;
Then, the migration itself runs as follows (each in a transaction, looping through records and sleeping for a bit)
WITH del AS (
DELETE FROM #{old_table}
WHERE id IN (
SELECT id
FROM #{old_table}
WHERE id > #{max_deleted_id} -- This is the max deleted from the previous batch, we grab it programmatically.
ORDER BY id ASC
LIMIT #{batch_size}
)
RETURNING *
)
INSERT INTO #{table}
SELECT * FROM del
RETURNING id
For instance, the batch_size can be 10_000, and the code sleeps programatically for 200ms (this is done in a Ruby script).
"max_deleted_id" is passed to each run from the previous one. This improves the inner SELECT query.
This works very well. However, I noticed two suprising things:
1. The performance of the delete and insert drops by several orders of magnitude as the script runs. For instance, in one run, it goes from 150ms average run to 700ms per batch.
2. The explain itself takes a while to run on a sample batch. In one table, the explain alone took four seconds.
To try and reproduce this locally, I used the following dummy table:
create table towns (id serial primary key, code text, article text, name text, department text);
insert into towns (
code, article, name, department
)
select
left(md5(i::text), 10),
md5(random()::text),
md5(random()::text),
left(md5(random()::text), 4)
from generate_series(1, 100000000) s(i);
This spends 150ms per batch, which climbs to 700ms per batch. A vacuum of the old table lowers is back to 150ms, but I don't understand why, because we structure the query to jump over all previously dead rows. There is an old thread in which Tom Lane mentions that the planner might itself be walking that primary index. Is this applicable here? And is there anything we can do besides more aggressive and continued vacuuming of the old table (or a change in autovacuum settings)? Ideally, we want to run this overnight without much supervision.
On Fri, Jul 3, 2020 at 4:24 PM Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> wrote:
Hi all,
I am attempting to do a bunch of table rewrites to shrink a table in the absence of pg_repack and vacuum full (both not an option). The database is growing fast and has had significant bloat in both heaps and indexes, so index rebuilds alone won't cut it. We found that table inheritance can be used to achieve this rather nicely. We are running PG v9.6.18.
We are setting up the inheritance as follows:
BEGIN;
ALTER TABLE #{table} RENAME TO #{table}_old;
CREATE TABLE #{table} (LIKE #{table}_old INCLUDING ALL);
ALTER TABLE #{table}_old INHERIT #{table};
ALTER SEQUENCE #{table}_id_seq OWNED BY #{table}.id;
COMMIT;
Then, the migration itself runs as follows (each in a transaction, looping through records and sleeping for a bit)
WITH del AS (
DELETE FROM #{old_table}
WHERE id IN (
SELECT id
FROM #{old_table}
WHERE id > #{max_deleted_id} -- This is the max deleted from the previous batch, we grab it programmatically.
ORDER BY id ASC
LIMIT #{batch_size}
)
RETURNING *
)
INSERT INTO #{table}
SELECT * FROM del
RETURNING id
For instance, the batch_size can be 10_000, and the code sleeps programatically for 200ms (this is done in a Ruby script).
"max_deleted_id" is passed to each run from the previous one. This improves the inner SELECT query.
This works very well. However, I noticed two suprising things:
1. The performance of the delete and insert drops by several orders of magnitude as the script runs. For instance, in one run, it goes from 150ms average run to 700ms per batch.
2. The explain itself takes a while to run on a sample batch. In one table, the explain alone took four seconds.
To try and reproduce this locally, I used the following dummy table:
create table towns (id serial primary key, code text, article text, name text, department text);
insert into towns (
code, article, name, department
)
select
left(md5(i::text), 10),
md5(random()::text),
md5(random()::text),
left(md5(random()::text), 4)
from generate_series(1, 100000000) s(i);
This spends 150ms per batch, which climbs to 700ms per batch. A vacuum of the old table lowers is back to 150ms, but I don't understand why, because we structure the query to jump over all previously dead rows. There is an old thread in which Tom Lane mentions that the planner might itself be walking that primary index. Is this applicable here? And is there anything we can do besides more aggressive and continued vacuuming of the old table (or a change in autovacuum settings)? Ideally, we want to run this overnight without much supervision.
I just ran a comparison between our version and PG12. To make things easier to pinpoint, I did the following:
- Increased WAL size (this was never an issue anway) and spread out checkpointing.
- Disabled autovacuum on both the old table and the new one.
Ran the migration on PG 9.6 and 12. The 9.6 still climbs from 150ms to 700ms per batch, while PG12 stays stable at 150ms per batch, so it seems to me like *something* has improved between the versions, but not sure what.
Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> writes: > ... the migration itself runs as follows (each in a transaction, looping > through records and sleeping for a bit) > WITH del AS ( > DELETE FROM #{old_table} > WHERE id IN ( > SELECT id > FROM #{old_table} > WHERE id > #{max_deleted_id} -- This is the max deleted from the > previous batch, we grab it programmatically. > ORDER BY id ASC > LIMIT #{batch_size} > ) > RETURNING * > ) > INSERT INTO #{table} > SELECT * FROM del > RETURNING id > This spends 150ms per batch, which climbs to 700ms per batch. A vacuum of > the old table lowers is back to 150ms, but I don't understand why, because > we structure the query to jump over all previously dead rows. There is an > old thread in which Tom Lane mentions that the planner might itself be > walking that primary index. Is this applicable here? And is there anything > we can do besides more aggressive and continued vacuuming of the old table > (or a change in autovacuum settings)? Ideally, we want to run this > overnight without much supervision. Yeah, given that the slowdown seems to be in the planner, and given your further observation that v12 is better, I'd say that this is an issue with get_actual_variable_range. That's going to be invoked to try to determine the selectivity of the "WHERE id > #{max_deleted_id}" clause, if the constant is past the last value in the histogram for the id column. The improvement you see in v12 actually came in in v11, and I think I'll just quote the commit log: Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: master Release: REL_11_BR [3ca930fc3] 2017-09-07 19:41:51 -0400 Improve performance of get_actual_variable_range with recently-dead tuples. In commit fccebe421, we hacked get_actual_variable_range() to scan the index with SnapshotDirty, so that if there are many uncommitted tuples at the end of the index range, it wouldn't laboriously scan through all of them looking for a live value to return. However, that didn't fix it for the case of many recently-dead tuples at the end of the index; SnapshotDirty recognizes those as committed dead and so we're back to the same problem. To improve the situation, invent a "SnapshotNonVacuumable" snapshot type and use that instead. The reason this helps is that, if the snapshot rejects a given index entry, we know that the indexscan will mark that index entry as killed. This means the next get_actual_variable_range() scan will proceed past that entry without visiting the heap, making the scan a lot faster. We may end up accepting a recently-dead tuple as being the estimated extremal value, but that doesn't seem much worse than the compromise we made before to accept not-yet-committed extremal values. The cost of the scan is still proportional to the number of dead index entries at the end of the range, so in the interval after a mass delete but before VACUUM's cleaned up the mess, it's still possible for get_actual_variable_range() to take a noticeable amount of time, if you've got enough such dead entries. But the constant factor is much much better than before, since all we need to do with each index entry is test its "killed" bit. We chose to back-patch commit fccebe421 at the time, but I'm hesitant to do so here, because this form of the problem seems to affect many fewer people. Also, even when it happens, it's less bad than the case fixed by commit fccebe421 because we don't get the contention effects from expensive TransactionIdIsInProgress tests. Dmitriy Sarafannikov, reviewed by Andrey Borodin Discussion: https://postgr.es/m/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru There are a number of possibilities for working around this in your particular situation, short of an upgrade to v11+. You could try doing a manual VACUUM between deletion steps, but that could fail to fix it if anything else is running concurrently (because the VACUUM might not think it's safe to recycle the recently-dead tuples yet). I think possibly a better approach is to try to avoid the situation wherein estimating "WHERE id > #{max_deleted_id}" requires determining the table's true endpoint id value. For that, the last id value seen in the pg_stats histogram for the id column has to be greater than the max_deleted_id value. So you might find that increasing the deletion batch size (thereby reducing max_deleted_id) does the trick; or you could increase the statistics target for that column, making the histogram larger and hence (probably) making its endpoint higher. regards, tom lane
On 7/3/20 3:24 PM, Mohamed Wael Khobalatte wrote:
Is the problem really about writing the new table, or reading from the old table?
Hi all,
I am attempting to do a bunch of table rewrites to shrink a table in the absence of pg_repack and vacuum full (both not an option). The database is growing fast and has had significant bloat in both heaps and indexes, so index rebuilds alone won't cut it. We found that table inheritance can be used to achieve this rather nicely. We are running PG v9.6.18.
We are setting up the inheritance as follows:
BEGIN;
ALTER TABLE #{table} RENAME TO #{table}_old;
CREATE TABLE #{table} (LIKE #{table}_old INCLUDING ALL);
ALTER TABLE #{table}_old INHERIT #{table};
ALTER SEQUENCE #{table}_id_seq OWNED BY #{table}.id;
COMMIT;
Then, the migration itself runs as follows (each in a transaction, looping through records and sleeping for a bit)
WITH del AS (
DELETE FROM #{old_table}
WHERE id IN (
SELECT id
FROM #{old_table}
WHERE id > #{max_deleted_id} -- This is the max deleted from the previous batch, we grab it programmatically.
ORDER BY id ASC
LIMIT #{batch_size}
)
RETURNING *
)
INSERT INTO #{table}
SELECT * FROM del
RETURNING id
For instance, the batch_size can be 10_000, and the code sleeps programatically for 200ms (this is done in a Ruby script).
"max_deleted_id" is passed to each run from the previous one. This improves the inner SELECT query.
This works very well. However, I noticed two suprising things:
1. The performance of the delete and insert drops by several orders of magnitude as the script runs. For instance, in one run, it goes from 150ms average run to 700ms per batch.
2. The explain itself takes a while to run on a sample batch. In one table, the explain alone took four seconds.
To try and reproduce this locally, I used the following dummy table:
create table towns (id serial primary key, code text, article text, name text, department text);
insert into towns (
code, article, name, department
)
select
left(md5(i::text), 10),
md5(random()::text),
md5(random()::text),
left(md5(random()::text), 4)
from generate_series(1, 100000000) s(i);
This spends 150ms per batch, which climbs to 700ms per batch. A vacuum of the old table lowers is back to 150ms, but I don't understand why, because we structure the query to jump over all previously dead rows. There is an old thread in which Tom Lane mentions that the planner might itself be walking that primary index. Is this applicable here? And is there anything we can do besides more aggressive and continued vacuuming of the old table (or a change in autovacuum settings)? Ideally, we want to run this overnight without much supervision.
Is the problem really about writing the new table, or reading from the old table?
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
On Fri, Jul 3, 2020 at 5:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> writes:
> ... the migration itself runs as follows (each in a transaction, looping
> through records and sleeping for a bit)
> WITH del AS (
> DELETE FROM #{old_table}
> WHERE id IN (
> SELECT id
> FROM #{old_table}
> WHERE id > #{max_deleted_id} -- This is the max deleted from the
> previous batch, we grab it programmatically.
> ORDER BY id ASC
> LIMIT #{batch_size}
> )
> RETURNING *
> )
> INSERT INTO #{table}
> SELECT * FROM del
> RETURNING id
> This spends 150ms per batch, which climbs to 700ms per batch. A vacuum of
> the old table lowers is back to 150ms, but I don't understand why, because
> we structure the query to jump over all previously dead rows. There is an
> old thread in which Tom Lane mentions that the planner might itself be
> walking that primary index. Is this applicable here? And is there anything
> we can do besides more aggressive and continued vacuuming of the old table
> (or a change in autovacuum settings)? Ideally, we want to run this
> overnight without much supervision.
Yeah, given that the slowdown seems to be in the planner, and given your
further observation that v12 is better, I'd say that this is an issue
with get_actual_variable_range. That's going to be invoked to try to
determine the selectivity of the "WHERE id > #{max_deleted_id}" clause,
if the constant is past the last value in the histogram for the id
column.
The improvement you see in v12 actually came in in v11, and I think
I'll just quote the commit log:
Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_11_BR [3ca930fc3] 2017-09-07 19:41:51 -0400
Improve performance of get_actual_variable_range with recently-dead tuples.
In commit fccebe421, we hacked get_actual_variable_range() to scan the
index with SnapshotDirty, so that if there are many uncommitted tuples
at the end of the index range, it wouldn't laboriously scan through all
of them looking for a live value to return. However, that didn't fix it
for the case of many recently-dead tuples at the end of the index;
SnapshotDirty recognizes those as committed dead and so we're back to
the same problem.
To improve the situation, invent a "SnapshotNonVacuumable" snapshot type
and use that instead. The reason this helps is that, if the snapshot
rejects a given index entry, we know that the indexscan will mark that
index entry as killed. This means the next get_actual_variable_range()
scan will proceed past that entry without visiting the heap, making the
scan a lot faster. We may end up accepting a recently-dead tuple as
being the estimated extremal value, but that doesn't seem much worse than
the compromise we made before to accept not-yet-committed extremal values.
The cost of the scan is still proportional to the number of dead index
entries at the end of the range, so in the interval after a mass delete
but before VACUUM's cleaned up the mess, it's still possible for
get_actual_variable_range() to take a noticeable amount of time, if you've
got enough such dead entries. But the constant factor is much much better
than before, since all we need to do with each index entry is test its
"killed" bit.
We chose to back-patch commit fccebe421 at the time, but I'm hesitant to
do so here, because this form of the problem seems to affect many fewer
people. Also, even when it happens, it's less bad than the case fixed
by commit fccebe421 because we don't get the contention effects from
expensive TransactionIdIsInProgress tests.
Dmitriy Sarafannikov, reviewed by Andrey Borodin
Discussion: https://postgr.es/m/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru
There are a number of possibilities for working around this in your
particular situation, short of an upgrade to v11+. You could try doing a
manual VACUUM between deletion steps, but that could fail to fix it if
anything else is running concurrently (because the VACUUM might not think
it's safe to recycle the recently-dead tuples yet). I think possibly
a better approach is to try to avoid the situation wherein estimating
"WHERE id > #{max_deleted_id}" requires determining the table's true
endpoint id value. For that, the last id value seen in the pg_stats
histogram for the id column has to be greater than the max_deleted_id
value. So you might find that increasing the deletion batch size
(thereby reducing max_deleted_id) does the trick; or you could increase
the statistics target for that column, making the histogram larger and
hence (probably) making its endpoint higher.
regards, tom lane
Hi Tom, thanks for your response.
I did increase the target to 10_000 in my local testing, and that didn't do the trick, the time per batch still increases. A regular vacuum analyze does bring it down though, as we established. The issue is that some of these tables are very large (north of 400GB for some), and such as vacuums will take a while, so not sure if the gain is actually worth it to do them frequently only to see the problem come back again (and to make matters worse, there *will* be concurrent activity, thus making the vacuums less likely to do the job).
One curious thing I noticed is that in my testing where I disable autovacuum, sometimes the batch time comes back to 150ms on its own, so something must be making the planner's life easier, but I can't tell which.
Do you happen to know if there is an upper limit to how much time the planner is willing to spend on this? Since I've seen it climb to four seconds, I suppose not, but I am not sure. It would help us estimate the timing of these runs better.
I did increase the target to 10_000 in my local testing, and that didn't do the trick, the time per batch still increases. A regular vacuum analyze does bring it down though, as we established. The issue is that some of these tables are very large (north of 400GB for some), and such as vacuums will take a while, so not sure if the gain is actually worth it to do them frequently only to see the problem come back again (and to make matters worse, there *will* be concurrent activity, thus making the vacuums less likely to do the job).
One curious thing I noticed is that in my testing where I disable autovacuum, sometimes the batch time comes back to 150ms on its own, so something must be making the planner's life easier, but I can't tell which.
Do you happen to know if there is an upper limit to how much time the planner is willing to spend on this? Since I've seen it climb to four seconds, I suppose not, but I am not sure. It would help us estimate the timing of these runs better.
On Fri, Jul 3, 2020 at 10:16 PM Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> wrote:
On Fri, Jul 3, 2020 at 5:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> writes:
> ... the migration itself runs as follows (each in a transaction, looping
> through records and sleeping for a bit)
> WITH del AS (
> DELETE FROM #{old_table}
> WHERE id IN (
> SELECT id
> FROM #{old_table}
> WHERE id > #{max_deleted_id} -- This is the max deleted from the
> previous batch, we grab it programmatically.
> ORDER BY id ASC
> LIMIT #{batch_size}
> )
> RETURNING *
> )
> INSERT INTO #{table}
> SELECT * FROM del
> RETURNING id
> This spends 150ms per batch, which climbs to 700ms per batch. A vacuum of
> the old table lowers is back to 150ms, but I don't understand why, because
> we structure the query to jump over all previously dead rows. There is an
> old thread in which Tom Lane mentions that the planner might itself be
> walking that primary index. Is this applicable here? And is there anything
> we can do besides more aggressive and continued vacuuming of the old table
> (or a change in autovacuum settings)? Ideally, we want to run this
> overnight without much supervision.
Yeah, given that the slowdown seems to be in the planner, and given your
further observation that v12 is better, I'd say that this is an issue
with get_actual_variable_range. That's going to be invoked to try to
determine the selectivity of the "WHERE id > #{max_deleted_id}" clause,
if the constant is past the last value in the histogram for the id
column.
The improvement you see in v12 actually came in in v11, and I think
I'll just quote the commit log:
Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_11_BR [3ca930fc3] 2017-09-07 19:41:51 -0400
Improve performance of get_actual_variable_range with recently-dead tuples.
In commit fccebe421, we hacked get_actual_variable_range() to scan the
index with SnapshotDirty, so that if there are many uncommitted tuples
at the end of the index range, it wouldn't laboriously scan through all
of them looking for a live value to return. However, that didn't fix it
for the case of many recently-dead tuples at the end of the index;
SnapshotDirty recognizes those as committed dead and so we're back to
the same problem.
To improve the situation, invent a "SnapshotNonVacuumable" snapshot type
and use that instead. The reason this helps is that, if the snapshot
rejects a given index entry, we know that the indexscan will mark that
index entry as killed. This means the next get_actual_variable_range()
scan will proceed past that entry without visiting the heap, making the
scan a lot faster. We may end up accepting a recently-dead tuple as
being the estimated extremal value, but that doesn't seem much worse than
the compromise we made before to accept not-yet-committed extremal values.
The cost of the scan is still proportional to the number of dead index
entries at the end of the range, so in the interval after a mass delete
but before VACUUM's cleaned up the mess, it's still possible for
get_actual_variable_range() to take a noticeable amount of time, if you've
got enough such dead entries. But the constant factor is much much better
than before, since all we need to do with each index entry is test its
"killed" bit.
We chose to back-patch commit fccebe421 at the time, but I'm hesitant to
do so here, because this form of the problem seems to affect many fewer
people. Also, even when it happens, it's less bad than the case fixed
by commit fccebe421 because we don't get the contention effects from
expensive TransactionIdIsInProgress tests.
Dmitriy Sarafannikov, reviewed by Andrey Borodin
Discussion: https://postgr.es/m/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru
There are a number of possibilities for working around this in your
particular situation, short of an upgrade to v11+. You could try doing a
manual VACUUM between deletion steps, but that could fail to fix it if
anything else is running concurrently (because the VACUUM might not think
it's safe to recycle the recently-dead tuples yet). I think possibly
a better approach is to try to avoid the situation wherein estimating
"WHERE id > #{max_deleted_id}" requires determining the table's true
endpoint id value. For that, the last id value seen in the pg_stats
histogram for the id column has to be greater than the max_deleted_id
value. So you might find that increasing the deletion batch size
(thereby reducing max_deleted_id) does the trick; or you could increase
the statistics target for that column, making the histogram larger and
hence (probably) making its endpoint higher.
regards, tom laneHi Tom, thanks for your response.
I did increase the target to 10_000 in my local testing, and that didn't do the trick, the time per batch still increases. A regular vacuum analyze does bring it down though, as we established. The issue is that some of these tables are very large (north of 400GB for some), and such as vacuums will take a while, so not sure if the gain is actually worth it to do them frequently only to see the problem come back again (and to make matters worse, there *will* be concurrent activity, thus making the vacuums less likely to do the job).
One curious thing I noticed is that in my testing where I disable autovacuum, sometimes the batch time comes back to 150ms on its own, so something must be making the planner's life easier, but I can't tell which.
Do you happen to know if there is an upper limit to how much time the planner is willing to spend on this? Since I've seen it climb to four seconds, I suppose not, but I am not sure. It would help us estimate the timing of these runs better.
Forgot to address one data point from your response Tom, which is increasing batch size. That will depend on the size of tuples. We might afford to do so for some tables but the wider the records the more trouble we've seen with WAL build up and replication or HA lags. The plan is to find a sweet spot, but the issue with the planner is more tricky.
Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> writes: > Do you happen to know if there is an upper limit to how much time the > planner is willing to spend on this? There is not. (I have considered that idea, but it's not implemented. I'm not sure whether there's still much practical problem given the v11+ behavior, so I'm not in a hurry to complicate things more.) Another idea that might conceivably be useful to you on 9.6 is to reorganize the bulk deletions so that most of them aren't at the endpoint of the live id range. If you made it so that the extremal values are deleted last, you'd never hit this behavior. regards, tom lane
Another idea that might conceivably be useful to you on 9.6 is to
reorganize the bulk deletions so that most of them aren't at the endpoint
of the live id range. If you made it so that the extremal values are
deleted last, you'd never hit this behavior.
regards, tom lane
Hm, I am not sure I understood your suggestion. We'd still need to find a range of ids to delete, and the order by asc has the nice property that it moves old records first, which helps tables with a lot of activity on recently created tuples (and I suppose an ordering in the other direction would suffer from the same problem).
On Fri, Jul 3, 2020 at 11:30 PM Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> wrote:
Another idea that might conceivably be useful to you on 9.6 is toreorganize the bulk deletions so that most of them aren't at the endpoint
of the live id range. If you made it so that the extremal values are
deleted last, you'd never hit this behavior.
regards, tom lane
Hm, I am not sure I understood your suggestion. We'd still need to find a range of ids to delete, and the order by asc has the nice property that it moves old records first, which helps tables with a lot of activity on recently created tuples (and I suppose an ordering in the other direction would suffer from the same problem).
Tom, I think I understood what you meant after getting some sleep. What I do now is:
- Initially load a lot of ids, say a million.
- Then I slice that array by my preferred batch, say a 10000.
- Loop through these slices and run an improved query, which looks like this:
WITH del AS (
DELETE FROM #{old_table}
WHERE id >= #{first_id_in_slice} AND id <= #{last_id_in_slice}
RETURNING *
)
INSERT INTO #{table}
SELECT * FROM del
RETURNING id
This not only stays *steady*, but dropping the inner query reduces it to sub 100ms per batch! The initial query to find the million Ids does drop a little as we go along (autovacuum will help that one), but it is a far cry from the earlier situation. I also think this keeps the desired correctness behavior as well. Is this what you had in mind?
Here is sample output I logged of the migration:
Migrated 10000 records from towns_old to towns in 82.92 ms.
Migrated 10000 records from towns_old to towns in 84.89 ms.
Migrated 10000 records from towns_old to towns in 85.92 ms.
Migrated 10000 records from towns_old to towns in 86.57 ms.
Here is a log of the initial query to load ID ranges:
Loaded towns_old ids in 3569.71 ms. Current id range is 63299754 to 64299753.
Loaded towns_old ids in 1990.62 ms. Current id range is 64299754 to 65299753.
Loaded towns_old ids in 2542.46 ms. Current id range is 65299754 to 66299753.
Loaded towns_old ids in 2040.88 ms. Current id range is 66299754 to 67299753.
Loaded towns_old ids in 1907.96 ms. Current id range is 67299754 to 68299753.
Loaded towns_old ids in 2626.74 ms. Current id range is 68299754 to 70435753.
Loaded towns_old ids in 3510.16 ms. Current id range is 70435754 to 71435753.
Loaded towns_old ids in 1841.95 ms. Current id range is 71435754 to 72435753.
Loaded towns_old ids in 1774.52 ms. Current id range is 72435754 to 73435753.
These numbers of loading ranges seem acceptable to me, and were not changed by autovacuum considering the number of tuples actually requested. If we spend an average of 2s here, it's still remarkably better than before because we've saved an average of 500ms per batch, and at 100 batches per id range, that's 50 seconds saved!
- Initially load a lot of ids, say a million.
- Then I slice that array by my preferred batch, say a 10000.
- Loop through these slices and run an improved query, which looks like this:
WITH del AS (
DELETE FROM #{old_table}
WHERE id >= #{first_id_in_slice} AND id <= #{last_id_in_slice}
RETURNING *
)
INSERT INTO #{table}
SELECT * FROM del
RETURNING id
This not only stays *steady*, but dropping the inner query reduces it to sub 100ms per batch! The initial query to find the million Ids does drop a little as we go along (autovacuum will help that one), but it is a far cry from the earlier situation. I also think this keeps the desired correctness behavior as well. Is this what you had in mind?
Here is sample output I logged of the migration:
Migrated 10000 records from towns_old to towns in 82.92 ms.
Migrated 10000 records from towns_old to towns in 84.89 ms.
Migrated 10000 records from towns_old to towns in 85.92 ms.
Migrated 10000 records from towns_old to towns in 86.57 ms.
Here is a log of the initial query to load ID ranges:
Loaded towns_old ids in 3569.71 ms. Current id range is 63299754 to 64299753.
Loaded towns_old ids in 1990.62 ms. Current id range is 64299754 to 65299753.
Loaded towns_old ids in 2542.46 ms. Current id range is 65299754 to 66299753.
Loaded towns_old ids in 2040.88 ms. Current id range is 66299754 to 67299753.
Loaded towns_old ids in 1907.96 ms. Current id range is 67299754 to 68299753.
Loaded towns_old ids in 2626.74 ms. Current id range is 68299754 to 70435753.
Loaded towns_old ids in 3510.16 ms. Current id range is 70435754 to 71435753.
Loaded towns_old ids in 1841.95 ms. Current id range is 71435754 to 72435753.
Loaded towns_old ids in 1774.52 ms. Current id range is 72435754 to 73435753.
These numbers of loading ranges seem acceptable to me, and were not changed by autovacuum considering the number of tuples actually requested. If we spend an average of 2s here, it's still remarkably better than before because we've saved an average of 500ms per batch, and at 100 batches per id range, that's 50 seconds saved!
Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> writes: > Tom, I think I understood what you meant after getting some sleep. What I > do now is: > - Initially load a lot of ids, say a million. > - Then I slice that array by my preferred batch, say a 10000. > - Loop through these slices and run an improved query, which looks like > this: > WITH del AS ( > DELETE FROM #{old_table} > WHERE id >= #{first_id_in_slice} AND id <= #{last_id_in_slice} > RETURNING * > ) > INSERT INTO #{table} > SELECT * FROM del > RETURNING id Right, the problem occurs when you're deleting stuff that's at the end of the id range, and have already deleted stuff that was previously the end of the range. If you start somewhere in the middle and work towards the end, you dodge that problem, even for the very last delete that *is* handling the endmost ids. regards, tom lane
Right, the problem occurs when you're deleting stuff that's at the end
of the id range, and have already deleted stuff that was previously the
end of the range. If you start somewhere in the middle and work towards
the end, you dodge that problem, even for the very last delete that
*is* handling the endmost ids.
regards, tom lane
Amazing, thank you very much for your help.