Thread: [GENERAL] efficiently migrating 'old' data from one table to another
I'm just wondering if there's a more efficient way of handling a certain periodic data migration. We have a pair of tables with this structure: table_a__live column_1 INT column_2 INT record_timestamp TIMESTAMP table_a__archive column_1 INT column_2 INT record_timestamp TIMESTAMP periodically, we must migrate items that are 'stale' from `table_a__live ` to `table_a__archive`. The entries are copiedover to the archive, then deleted. The staleness is calculated based on age-- so we need to use INTERVAL. the "live" table can have anywhere from 100k to20MM records. the primary key on `table_a__live` is a composite of column_1 & column_2, In order to minimize scanning the table, we opted to hint migrations with a dedicated column: ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL; CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE is_migrate IS NOT NULL; so our migration is then based on that `is_migrate` column: BEGIN; UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL'1 month'; INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp FROMtable_a__live WHERE is_migrate IS TRUE; DELETE FROM table_a__live WHERE is_migrate IS TRUE; COMMIT; The inserts & deletes are blazing fast, but the UPDATE is a bit slow from postgres re-writing all the rows. can anyone suggest a better approach? I considered copying everything to a tmp table then inserting/deleting based on that table -- but there's a lot of disk-ioon that approach too. fwiw we're on postgres9.6.1
Re: [GENERAL] efficiently migrating 'old' data from one table toanother
From
"btober@computer.org"
Date:
----- Original Message ----- > From: "Jonathan Vanasco" <postgres@2xlp.com> > To: "pgsql-general general" <pgsql-general@postgresql.org> > Sent: Thursday, January 12, 2017 3:06:14 PM > Subject: [GENERAL] efficiently migrating 'old' data from one table to another > > I'm just wondering if there's a more efficient way of handling a certain > periodic data migration. > > We have a pair of tables with this structure: > > table_a__live > column_1 INT > column_2 INT > record_timestamp TIMESTAMP > > table_a__archive > column_1 INT > column_2 INT > record_timestamp TIMESTAMP > > periodically, we must migrate items that are 'stale' from `table_a__live ` to > `table_a__archive`. The entries are copied over to the archive, then > deleted. > > The staleness is calculated based on age-- so we need to use INTERVAL. the > "live" table can have anywhere from 100k to 20MM records. > > the primary key on `table_a__live` is a composite of column_1 & column_2, > > In order to minimize scanning the table, we opted to hint migrations with a > dedicated column: > > ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL; > CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE > is_migrate IS NOT NULL; > > so our migration is then based on that `is_migrate` column: > > BEGIN; > UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < > transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month'; > INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT > column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS > TRUE; > DELETE FROM table_a__live WHERE is_migrate IS TRUE; > COMMIT; > > The inserts & deletes are blazing fast, but the UPDATE is a bit slow from > postgres re-writing all the rows. > > can anyone suggest a better approach? > > I considered copying everything to a tmp table then inserting/deleting based > on that table -- but there's a lot of disk-io on that approach too. Review manual section 7.8.2. Data-Modifying Statements in WITH https://www.postgresql.org/docs/9.6/static/queries-with.html -- B
On 1/12/2017 12:06 PM, Jonathan Vanasco wrote: > I'm just wondering if there's a more efficient way of handling a certain periodic data migration. partition the tables by some date interval such as week (if you do this archiving weekly). each week, disconnect the oldest partition from the 'active' partition set, and add it to the 'archive' partition set. voila, no inserts, deletes, or updates are done at all, just some metadata operations. -- john r pierce, recycling bits in santa cruz
On 01/12/2017 12:06 PM, Jonathan Vanasco wrote: > I'm just wondering if there's a more efficient way of handling a certain periodic data migration. > > We have a pair of tables with this structure: > > table_a__live > column_1 INT > column_2 INT > record_timestamp TIMESTAMP > > table_a__archive > column_1 INT > column_2 INT > record_timestamp TIMESTAMP > > periodically, we must migrate items that are 'stale' from `table_a__live ` to `table_a__archive`. The entries are copiedover to the archive, then deleted. > > The staleness is calculated based on age-- so we need to use INTERVAL. the "live" table can have anywhere from 100k to20MM records. > > the primary key on `table_a__live` is a composite of column_1 & column_2, > > In order to minimize scanning the table, we opted to hint migrations with a dedicated column: > > ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL; > CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE is_migrate IS NOT NULL; > > so our migration is then based on that `is_migrate` column: > > BEGIN; > UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL'1 month'; > INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp FROMtable_a__live WHERE is_migrate IS TRUE; > DELETE FROM table_a__live WHERE is_migrate IS TRUE; > COMMIT; > > The inserts & deletes are blazing fast, but the UPDATE is a bit slow from postgres re-writing all the rows. Maybe I am missing something, but why do the UPDATE? Why not?: BEGIN; INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp FROM table_a__live WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month'; DELETE FROM table_a__live WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month'; COMMIT; With an index on record_timestamp. > > can anyone suggest a better approach? > > I considered copying everything to a tmp table then inserting/deleting based on that table -- but there's a lot of disk-ioon that approach too. > > > fwiw we're on postgres9.6.1 > -- Adrian Klaver adrian.klaver@aklaver.com
Re: [GENERAL] efficiently migrating 'old' data from one table to another
From
"David G. Johnston"
Date:
> so our migration is then based on that `is_migrate` column:
>
> BEGIN;
> UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
> INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS TRUE;
> DELETE FROM table_a__live WHERE is_migrate IS TRUE;
> COMMIT;
>
> The inserts & deletes are blazing fast, but the UPDATE is a bit slow from postgres re-writing all the rows.
Maybe I am missing something, but why do the UPDATE?
Not to mention doubling the amount of I/O vacuum is going to chew up.
David J.
On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org <btober@broadstripe.net> wrote: > > > ----- Original Message ----- >> From: "Jonathan Vanasco" <postgres@2xlp.com> >> To: "pgsql-general general" <pgsql-general@postgresql.org> >> Sent: Thursday, January 12, 2017 3:06:14 PM >> Subject: [GENERAL] efficiently migrating 'old' data from one table to another >> >> I'm just wondering if there's a more efficient way of handling a certain >> periodic data migration. >> >> We have a pair of tables with this structure: >> >> table_a__live >> column_1 INT >> column_2 INT >> record_timestamp TIMESTAMP >> >> table_a__archive >> column_1 INT >> column_2 INT >> record_timestamp TIMESTAMP >> >> periodically, we must migrate items that are 'stale' from `table_a__live ` to >> `table_a__archive`. The entries are copied over to the archive, then >> deleted. >> >> The staleness is calculated based on age-- so we need to use INTERVAL. the >> "live" table can have anywhere from 100k to 20MM records. >> >> the primary key on `table_a__live` is a composite of column_1 & column_2, >> >> In order to minimize scanning the table, we opted to hint migrations with a >> dedicated column: >> >> ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL; >> CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE >> is_migrate IS NOT NULL; >> >> so our migration is then based on that `is_migrate` column: >> >> BEGIN; >> UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < >> transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month'; >> INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT >> column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS >> TRUE; >> DELETE FROM table_a__live WHERE is_migrate IS TRUE; >> COMMIT; >> >> The inserts & deletes are blazing fast, but the UPDATE is a bit slow from >> postgres re-writing all the rows. >> >> can anyone suggest a better approach? >> >> I considered copying everything to a tmp table then inserting/deleting based >> on that table -- but there's a lot of disk-io on that approach too. > > > Review manual section 7.8.2. Data-Modifying Statements in WITH > > > https://www.postgresql.org/docs/9.6/static/queries-with.html this. with data as (delete from foo where ... returning * ) insert into foo_backup select * from data;
Re: [GENERAL] efficiently migrating 'old' data from one table to another
From
Jonathan Vanasco
Date:
On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote: > On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org > <btober@broadstripe.net> wrote: >> >> Review manual section 7.8.2. Data-Modifying Statements in WITH >> >> >> https://www.postgresql.org/docs/9.6/static/queries-with.html > > this. > > with data as (delete from foo where ... returning * ) insert into > foo_backup select * from data; Thanks, btober and merlin. that's exactly what i want. On Jan 12, 2017, at 4:45 PM, Adrian Klaver wrote: > Maybe I am missing something, but why do the UPDATE? > Why not?: > ... > With an index on record_timestamp. That's actually the production deployment that we're trying to optimize. Depending on the size of the table (rows, width)it performs "less than great", even with the index on record_timestamp. The UPDATE actually worked faster in most situations. I honestly don't know why (the only thing that makes sense to me isserver-load)... but the update + bool test ended up being (much) faster than the timestamp comparison.
On Jan 12, 2017, Jonathan Vanasco <postgres@2xlp.com> wrote: >On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote: > >> On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org >> <btober@broadstripe.net> wrote: >>> >>> Review manual section 7.8.2. Data-Modifying Statements in WITH >>> >>> >>> https://www.postgresql.org/docs/9.6/static/queries-with.html >> >> this. >> >> with data as (delete from foo where ... returning * ) insert into >> foo_backup select * from data; > >Thanks, btober and merlin. that's exactly what i want. To help you a little more, I just did this for a set of tables within the last week. :) The heart of the program is this sql: my $Chunk_size = 10000; my $Interval = 24; my $sql = " WITH keys AS ( SELECT $pk_column FROM $table WHERE $time_column < NOW() - '$Interval MONTHS'::INTERVAL ORDER BY $pk_column LIMIT $Chunk_size ), data AS ( DELETE FROM $table WHERE $pk_column <= (SELECT MAX($pk_column) FROM keys) RETURNING * ) INSERT INTO archive_$table SELECT * FROM data;"; That's from Perl, but I suspect you can guess as to what each var should be for your application. You can set $Chunk_size to whatever you want. There is obviously a loop around that which executes until we get 0 rows, then we move on to the next table. The point of the chunks was to limit the impact on the production tables as we move data out of them. If you don't have that concern and want to do all rows at once then remove the LIMIT and ORDER BY. HTH, Kevin
On Fri, Jan 13, 2017 at 12:03 PM, <kbrannen@pwhome.com> wrote: > On Jan 12, 2017, Jonathan Vanasco <postgres@2xlp.com> wrote: >>On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote: >> >>> On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org >>> <btober@broadstripe.net> wrote: >>>> >>>> Review manual section 7.8.2. Data-Modifying Statements in WITH >>>> >>>> >>>> https://www.postgresql.org/docs/9.6/static/queries-with.html >>> >>> this. >>> >>> with data as (delete from foo where ... returning * ) insert into >>> foo_backup select * from data; >> >>Thanks, btober and merlin. that's exactly what i want. > > To help you a little more, I just did this for a set of tables within the > last week. :) The heart of the program is this sql: > > my $Chunk_size = 10000; > my $Interval = 24; > my $sql = " > WITH > keys AS ( > SELECT $pk_column > FROM $table > WHERE $time_column < NOW() - '$Interval MONTHS'::INTERVAL > ORDER BY $pk_column > LIMIT $Chunk_size ), > data AS ( > DELETE FROM $table > WHERE $pk_column <= (SELECT MAX($pk_column) FROM keys) > RETURNING * ) > INSERT INTO archive_$table SELECT * FROM data;"; > > That's from Perl, but I suspect you can guess as to what each var should be for > your application. You can set $Chunk_size to whatever you want. There is > obviously a loop around that which executes until we get 0 rows, then we move > on to the next table. > > The point of the chunks was to limit the impact on the production tables > as we move data out of them. If you don't have that concern and want to do all > rows at once then remove the LIMIT and ORDER BY. FYI, although it's likely ok in this instance, directly inserting table names without precaution is considered dubious and should be avoided as practice. SQL injection is a risk, and your code will fail in the presence of unusual bug legal table names containing spaces. For posterity handling this kind of action inside the database (via plpgsql/EXECUTE) in order to leverage some internal routines, especially quote_ident(), is generally a good idea. merlin