Thread: Thoughts on how to avoid a massive integer update.
Hi all,
This is a shot in the dark in hopes to find a magic bullet to fix an issue I have, I can’t personally think of any solution myself.
I have a database with hundreds of terabytes of data, where every table has an integer column referencing a small table. For reasons out of my control and cannot change, I NEED to update every single row in all these tables, changing the integer value to a different integer.
Since I have to deal with dead space, I can only do a couple tables at a time, then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop the old one and swap in the new, either way is very time consuming.
Initial tests suggest this effort will take several months to complete, not to mention cause blocking issues on tables being worked on.
Does anyone have any hackery ideas on how to achieve this in less time? I was looking at possibly converting the integer column type to another that would present the integer differently, like a hex value, but everything still ends up requiring all data to be re-written to disk. In a well designed database (I didn’t design it :) ), I would simply change the data in the referenced table (200 total rows), however the key being referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be changed.
Thanks for any thoughts or ideas,
- Brian F
On 5/4/20 2:32 PM, Fehrle, Brian wrote: > Hi all, > > This is a shot in the dark in hopes to find a magic bullet to fix an > issue I have, I can’t personally think of any solution myself. > > I have a database with hundreds of terabytes of data, where every table > has an integer column referencing a small table. For reasons out of my > control and cannot change, I NEED to update every single row in all > these tables, changing the integer value to a different integer. > > Since I have to deal with dead space, I can only do a couple tables at a > time, then do a vacuum full after each one. Why? A regular vacuum would mark the space as available. More below. > Another option is to build a new table with the new values, then drop > the old one and swap in the new, either way is very time consuming. > > Initial tests suggest this effort will take several months to complete, > not to mention cause blocking issues on tables being worked on. > > Does anyone have any hackery ideas on how to achieve this in less time? > I was looking at possibly converting the integer column type to another > that would present the integer differently, like a hex value, but > everything still ends up requiring all data to be re-written to disk. In > a well designed database (I didn’t design it :) ), I would simply change > the data in the referenced table (200 total rows), however the key being > referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be > changed. I'm not following above. Could you show an example table relationship? > > Thanks for any thoughts or ideas, > > * Brian F > -- Adrian Klaver adrian.klaver@aklaver.com
Is the new value computable from the original value with a single function? Could you cover your tables with viewa transforming the column with said function?Hi all,
This is a shot in the dark in hopes to find a magic bullet to fix an issue I have, I can’t personally think of any solution myself.
I have a database with hundreds of terabytes of data, where every table has an integer column referencing a small table. For reasons out of my control and cannot change, I NEED to update every single row in all these tables, changing the integer value to a different integer.
Since I have to deal with dead space, I can only do a couple tables at a time, then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop the old one and swap in the new, either way is very time consuming.
Initial tests suggest this effort will take several months to complete, not to mention cause blocking issues on tables being worked on.
Does anyone have any hackery ideas on how to achieve this in less time? I was looking at possibly converting the integer column type to another that would present the integer differently, like a hex value, but everything still ends up requiring all data to be re-written to disk. In a well designed database (I didn’t design it :) ), I would simply change the data in the referenced table (200 total rows), however the key being referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be changed.
Thanks for any thoughts or ideas,
- Brian F
No? You'll need to divorce each table from that lookup table and any foreign key relationships to avoid all those lookups with ever update.
Hi all,
This is a shot in the dark in hopes to find a magic bullet to fix an issue I have, I can’t personally think of any solution myself.
I have a database with hundreds of terabytes of data, where every table has an integer column referencing a small table. For reasons out of my control and cannot change, I NEED to update every single row in all these tables, changing the integer value to a different integer.
Since I have to deal with dead space, I can only do a couple tables at a time, then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop the old one and swap in the new, either way is very time consuming.
Initial tests suggest this effort will take several months to complete, not to mention cause blocking issues on tables being worked on.
Does anyone have any hackery ideas on how to achieve this in less time? I was looking at possibly converting the integer column type to another that would present the integer differently, like a hex value, but everything still ends up requiring all data to be re-written to disk. In a well designed database (I didn’t design it :) ), I would simply change the data in the referenced table (200 total rows), however the key being referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be changed.
Thanks for any thoughts or ideas,
- Brian F
Here's my wife solution: add a column to small table, fill with duplicate of the original column. Change the foreign keys of the huge tables to reference the new column in the small table. Update the original values to what you now need them to be. (Not sure how much re-index cost you'll see here, but non-zero I'm guessing)
Your reports now must of course return the new value via joins to the dictionary(?) table. Views my be your best bet here.
On 2020-05-04 21:32:56 +0000, Fehrle, Brian wrote: > I have a database with hundreds of terabytes of data, where every table has an > integer column referencing a small table. For reasons out of my control and > cannot change, I NEED to update every single row in all these tables, changing > the integer value to a different integer. > > > > Since I have to deal with dead space, I can only do a couple tables at a time, > then do a vacuum full after each one. Do you have to change all the values at the same time? If you can change them one (or a few) at a time, the tables shouldn't bloat much. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
On 5/4/20, 3:56 PM, "Adrian Klaver" <adrian.klaver@aklaver.com> wrote: [External Email] On 5/4/20 2:32 PM, Fehrle, Brian wrote: > Hi all, > > This is a shot in the dark in hopes to find a magic bullet to fix an > issue I have, I can’t personally think of any solution myself. > > I have a database with hundreds of terabytes of data, where every table > has an integer column referencing a small table. For reasons out of my > control and cannot change, I NEED to update every single row in all > these tables, changing the integer value to a different integer. > > Since I have to deal with dead space, I can only do a couple tables at a > time, then do a vacuum full after each one. Why? A regular vacuum would mark the space as available. A regular vacuum would mark the space as available – ***for re-use***, it will not release the space back to the drive as‘unused’. 99% of my tables are old data that will not receive any future inserts or updates, which means that space thatis marked ready for ‘reuse’ will not ever be used. This means that a 100GB table will be updated, and every row marked as dead and re-created with the newly updated data. Thistable is now 200GB in size. A vacuum will keep it at 200GB of space used, freeing up the 100GB of dead space as ready-to-reuse.A vacuum full will make it 100GB again. Since 99% of my tables will never be updated or inserted into again, this means my ~300 Terabytes of data would be ~600 Terabytesof data on disk. Thus, vacuum full. More below. > Another option is to build a new table with the new values, then drop > the old one and swap in the new, either way is very time consuming. > > Initial tests suggest this effort will take several months to complete, > not to mention cause blocking issues on tables being worked on. > > Does anyone have any hackery ideas on how to achieve this in less time? > I was looking at possibly converting the integer column type to another > that would present the integer differently, like a hex value, but > everything still ends up requiring all data to be re-written to disk. In > a well designed database (I didn’t design it :) ), I would simply change > the data in the referenced table (200 total rows), however the key being > referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be > changed. I'm not following above. Could you show an example table relationship? It’s a simple one-to-many relationship: *Info_table* info_table_sid integer *data_table* data_table_sid integer, info_table_id integer references info_table(info_table_sid), > > Thanks for any thoughts or ideas, > > * Brian F > -- Adrian Klaver adrian.klaver@aklaver.com
Could you show an example table relationship?
It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer
*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),
I NEED to update every single row in all these tables, changing the integer value to a different integer.
Does anyone have any hackery ideas on how to achieve this in less time?
From: "David G. Johnston" <david.g.johnston@gmail.com>
Date: Friday, May 8, 2020 at 11:48 AM
To: "Fehrle, Brian" <bfehrle@comscore.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.
[External Email]
On Mon, May 4, 2020 at 2:33 PM Fehrle, Brian <bfehrle@comscore.com> wrote:
I NEED to update every single row in all these tables, changing the integer value to a different integer.
Does anyone have any hackery ideas on how to achieve this in less time?
Probably the only solution that would perform computationally faster would take the same amount of time or more to code and debug, and be considerably riskier. Basically shut down PostgreSQL and modify the data files directly to change one integer byte sequence to another. On the positive side the source code for PostgreSQL is open source and that data, while complex, is structured.
On the point of "vacuum" versus "vacuum full" - I don't know if this is how it would work in reality but conceptually if you updated half the table, vacuumed, updated the second half, vacuumed, the second batch of updates would reuse the spaced freed from the first batch and you'd only increase the disk consumption by 1.5 instead of 2.0. As you increase the number of batches the percentage of additional space consumed decreases. Though if you have the space I'd have to imagine that creating a brand new table and dropping the old one would be the best solution when taken in isolation.
David J.
Modifying data files is too risky and I wouldn’t be able to get that kind of work approved.
Even with keeping excess space to an additional 50%, that’s tons of storage I’d need to order, so either vacuum full or re-create tables for minimal on disk usage are my only options.
From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 11:05 AM
To: "Fehrle, Brian" <bfehrle@comscore.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.
[External Email]
Could you show an example table relationship?
It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer
*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),
Right, and now you wish to change the values in the referenced table (info_table.info_table_sid) correct?
Correct. If info_table.info_table_sid = 123 and many rows in data_table point to it, the ID needs to be changed to 456 in the info_table, as well as all the columns in the data_table.
you have to do the second block per orig_idalter table info_table add orig_id int;update info_table set orig_id = info_table_sid;update info_table set info_table_sid = 456 where info_table_sid = 456;alter table data_table drop reference NOT SQLalter table data_table make reference to info_table.orig_id NOT SQL
On May 8, 2020, at 1:36 PM, Fehrle, Brian <bfehrle@comscore.com> wrote:From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 11:05 AM
To: "Fehrle, Brian" <bfehrle@comscore.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.[External Email]
Could you show an example table relationship?
It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer
*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),Right, and now you wish to change the values in the referenced table (info_table.info_table_sid) correct?Correct. If info_table.info_table_sid = 123 and many rows in data_table point to it, the ID needs to be changed to 456 in the info_table, as well as all the columns in the data_table.
Well as I said, I think you could add a column to info_tablealter table info_table add orig_id int;update info_table set orig_id = info_table_sid;update info_table set info_table_sid = 456 where info_table_sid = 456;
alter table data_table drop reference NOT SQLalter table data_table make reference to info_table.orig_id NOT SQL
On May 8, 2020, at 2:37 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:On Fri, May 8, 2020 at 12:49 PM Rob Sargent <robjsargent@gmail.com> wrote:Well as I said, I think you could add a column to info_tablealter table info_table add orig_id int;update info_table set orig_id = info_table_sid;update info_table set info_table_sid = 456 where info_table_sid = 456;huh?alter table data_table drop reference NOT SQLalter table data_table make reference to info_table.orig_id NOT SQLYou don't seem to understand the requirement. The data_table integer value must be changed - all you are doing is a convoluted column rename on the table holding the PK half of the relationship.David J.
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?
On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.David J.
Wow, I couldn’t disagree more ;)On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.
On May 8, 2020, at 2:57 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com> wrote:Wow, I couldn’t disagree more ;)On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.David J.
David’s assessment is correct (and I think we’re all on the same page). The value of the foreign keys that tie the tables together must be changed, and yeah that value _should_ simply be an additional column in the info_table and the foreign key be an arbitrary integer, but since it wasn’t set up that way from the beginning (over a decade ago), this is what I’m stuck with.
Blah.
From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 3:05 PM
To: "David G. Johnston" <david.g.johnston@gmail.com>
Cc: "Fehrle, Brian" <bfehrle@comscore.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.
[External Email]
On May 8, 2020, at 2:57 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com> wrote:
On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?
The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.
Wow, I couldn’t disagree more ;)
Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.
David J.
Sorry, I wasn’t disagreeing with the problem statement. OP did say the “info.id” needed to change from 123 to 456. With the current foreign key alignment that is very expensive. I think we’re all in agreement there. To push “456” back out to the data table I see as perpetuation of the problem. I didn’t sense that OP felt it necessary to continue in the current mode as a requirement. If so, my mistake
On May 8, 2020, at 3:52 PM, Fehrle, Brian <bfehrle@comscore.com> wrote:David’s assessment is correct (and I think we’re all on the same page). The value of the foreign keys that tie the tables together must be changed, and yeah that value _should_ simply be an additional column in the info_table and the foreign key be an arbitrary integer, but since it wasn’t set up that way from the beginning (over a decade ago), this is what I’m stuck with.Blah.
From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 3:05 PM
To: "David G. Johnston" <david.g.johnston@gmail.com>
Cc: "Fehrle, Brian" <bfehrle@comscore.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.[External Email]
On May 8, 2020, at 2:57 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com> wrote:On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.Wow, I couldn’t disagree more ;)Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.David J.Sorry, I wasn’t disagreeing with the problem statement. OP did say the “info.id” needed to change from 123 to 456. With the current foreign key alignment that is very expensive. I think we’re all in agreement there. To push “456” back out to the data table I see as perpetuation of the problem. I didn’t sense that OP felt it necessary to continue in the current mode as a requirement. If so, my mistake