Thread: Thoughts on how to avoid a massive integer update.

Thoughts on how to avoid a massive integer update.

From

"Fehrle, Brian"

Date:

04 May 2020, 21:32:56

Hi all,

This is a shot in the dark in hopes to find a magic bullet to fix an issue I have, I can’t personally think of any solution myself.

I have a database with hundreds of terabytes of data, where every table has an integer column referencing a small table. For reasons out of my control and cannot change, I NEED to update every single row in all these tables, changing the integer value to a different integer.

Since I have to deal with dead space, I can only do a couple tables at a time, then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop the old one and swap in the new, either way is very time consuming.

Initial tests suggest this effort will take several months to complete, not to mention cause blocking issues on tables being worked on.

Does anyone have any hackery ideas on how to achieve this in less time? I was looking at possibly converting the integer column type to another that would present the integer differently, like a hex value, but everything still ends up requiring all data to be re-written to disk. In a well designed database (I didn’t design it :) ), I would simply change the data in the referenced table (200 total rows), however the key being referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be changed.

Thanks for any thoughts or ideas,

Brian F

Re: Thoughts on how to avoid a massive integer update.

From

Adrian Klaver

Date:

04 May 2020, 21:56:26

On 5/4/20 2:32 PM, Fehrle, Brian wrote:
> Hi all,
> 
> This is a shot in the dark in hopes to find a magic bullet to fix an 
> issue I have, I can’t personally think of any solution myself.
> 
> I have a database with hundreds of terabytes of data, where every table 
> has an integer column referencing a small table. For reasons out of my 
> control and cannot change, I NEED to update every single row in all 
> these tables, changing the integer value to a different integer.
> 
> Since I have to deal with dead space, I can only do a couple tables at a 
> time, then do a vacuum full after each one.

Why?
A regular vacuum would mark the space as available.

More below.

> Another option is to build a new table with the new values, then drop 
> the old one and swap in the new, either way is very time consuming.
> 
> Initial tests suggest this effort will take several months to complete, 
> not to mention cause blocking issues on tables being worked on.
> 
> Does anyone have any hackery ideas on how to achieve this in less time? 
> I was looking at possibly converting the integer column type to another 
> that would present the integer differently, like a hex value, but 
> everything still ends up requiring all data to be re-written to disk. In 
> a well designed database (I didn’t design it :) ), I would simply change 
> the data in the referenced table (200 total rows), however the key being 
> referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be 
> changed.

I'm not following above.

Could you show an example table relationship?


> 
> Thanks for any thoughts or ideas,
> 
>   * Brian F
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Thoughts on how to avoid a massive integer update.

From

Rob Sargent

Date:

04 May 2020, 21:57:20

On 5/4/20 3:32 PM, Fehrle, Brian wrote:

Hi all,

This is a shot in the dark in hopes to find a magic bullet to fix an issue I have, I can’t personally think of any solution myself.

I have a database with hundreds of terabytes of data, where every table has an integer column referencing a small table. For reasons out of my control and cannot change, I NEED to update every single row in all these tables, changing the integer value to a different integer.

Since I have to deal with dead space, I can only do a couple tables at a time, then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop the old one and swap in the new, either way is very time consuming.

Initial tests suggest this effort will take several months to complete, not to mention cause blocking issues on tables being worked on.

Does anyone have any hackery ideas on how to achieve this in less time? I was looking at possibly converting the integer column type to another that would present the integer differently, like a hex value, but everything still ends up requiring all data to be re-written to disk. In a well designed database (I didn’t design it :) ), I would simply change the data in the referenced table (200 total rows), however the key being referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be changed.

Thanks for any thoughts or ideas,
Brian F

Is the new value computable from the original value with a single function? Could you cover your tables with viewa transforming the column with said function?

No? You'll need to divorce each table from that lookup table and any foreign key relationships to avoid all those lookups with ever update.

Re: Thoughts on how to avoid a massive integer update.

From

Rob Sargent

Date:

04 May 2020, 23:46:27

On 5/4/20 3:32 PM, Fehrle, Brian wrote:

Hi all,

This is a shot in the dark in hopes to find a magic bullet to fix an issue I have, I can’t personally think of any solution myself.

I have a database with hundreds of terabytes of data, where every table has an integer column referencing a small table. For reasons out of my control and cannot change, I NEED to update every single row in all these tables, changing the integer value to a different integer.

Since I have to deal with dead space, I can only do a couple tables at a time, then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop the old one and swap in the new, either way is very time consuming.

Initial tests suggest this effort will take several months to complete, not to mention cause blocking issues on tables being worked on.

Does anyone have any hackery ideas on how to achieve this in less time? I was looking at possibly converting the integer column type to another that would present the integer differently, like a hex value, but everything still ends up requiring all data to be re-written to disk. In a well designed database (I didn’t design it :) ), I would simply change the data in the referenced table (200 total rows), however the key being referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be changed.

Thanks for any thoughts or ideas,
Brian F

Here's my wife solution: add a column to small table, fill with duplicate of the original column. Change the foreign keys of the huge tables to reference the new column in the small table. Update the original values to what you now need them to be. (Not sure how much re-index cost you'll see here, but non-zero I'm guessing)

Your reports now must of course return the new value via joins to the dictionary(?) table. Views my be your best bet here.

Re: Thoughts on how to avoid a massive integer update.

From

"Peter J. Holzer"

Date:

05 May 2020, 19:33:11

On 2020-05-04 21:32:56 +0000, Fehrle, Brian wrote:
> I have a database with hundreds of terabytes of data, where every table has an
> integer column referencing a small table. For reasons out of my control and
> cannot change, I NEED to update every single row in all these tables, changing
> the integer value to a different integer.
>
>
>
> Since I have to deal with dead space, I can only do a couple tables at a time,
> then do a vacuum full after each one.

Do you have to change all the values at the same time? If you can change
them one (or a few) at a time, the tables shouldn't bloat much.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

signature.asc

Re: Thoughts on how to avoid a massive integer update.

From

"Fehrle, Brian"

Date:

08 May 2020, 16:57:49


On 5/4/20, 3:56 PM, "Adrian Klaver" <adrian.klaver@aklaver.com> wrote:

    [External Email]
    
    
    On 5/4/20 2:32 PM, Fehrle, Brian wrote:
    > Hi all,
    >
    > This is a shot in the dark in hopes to find a magic bullet to fix an
    > issue I have, I can’t personally think of any solution myself.
    >
    > I have a database with hundreds of terabytes of data, where every table
    > has an integer column referencing a small table. For reasons out of my
    > control and cannot change, I NEED to update every single row in all
    > these tables, changing the integer value to a different integer.
    >
    > Since I have to deal with dead space, I can only do a couple tables at a
    > time, then do a vacuum full after each one.
    
    Why?
    A regular vacuum would mark the space as available.
    
A regular vacuum would mark the space as available – ***for re-use***, it will not release the space back to the drive
as‘unused’.  99% of my tables are old data that will not receive any future inserts or updates, which means that space
thatis marked ready for ‘reuse’ will not ever be used. 
 
This means that a 100GB table will be updated, and every row marked as dead and re-created with the newly updated data.
Thistable is now 200GB in size. A vacuum will keep it at 200GB of space used, freeing up the 100GB of dead space as
ready-to-reuse.A vacuum full will make it 100GB again. 
 

Since 99% of my tables will never be updated or inserted into again, this means my ~300 Terabytes of data would be ~600
Terabytesof data on disk. Thus, vacuum full. 
 

    More below.
    
    > Another option is to build a new table with the new values, then drop
    > the old one and swap in the new, either way is very time consuming.
    >
    > Initial tests suggest this effort will take several months to complete,
    > not to mention cause blocking issues on tables being worked on.
    >
    > Does anyone have any hackery ideas on how to achieve this in less time?
    > I was looking at possibly converting the integer column type to another
    > that would present the integer differently, like a hex value, but
    > everything still ends up requiring all data to be re-written to disk. In
    > a well designed database (I didn’t design it :) ), I would simply change
    > the data in the referenced table (200 total rows), however the key being
    > referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be
    > changed.
    
    I'm not following above.
    
    Could you show an example table relationship?

It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer


*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),




    >
    > Thanks for any thoughts or ideas,
    >
    >   * Brian F
    >
    
    
    --
    Adrian Klaver
    adrian.klaver@aklaver.com

Re: Thoughts on how to avoid a massive integer update.

From

Rob Sargent

Date:

08 May 2020, 17:04:57

Could you show an example table relationship?

It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer

*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),

Right, and now you wish to change the values in the referenced table (info_table.info_table_sid) correct?

Re: Thoughts on how to avoid a massive integer update.

From

"David G. Johnston"

Date:

08 May 2020, 17:48:31

On Mon, May 4, 2020 at 2:33 PM Fehrle, Brian <bfehrle@comscore.com> wrote:

I NEED to update every single row in all these tables, changing the integer value to a different integer.

Does anyone have any hackery ideas on how to achieve this in less time?

Probably the only solution that would perform computationally faster would take the same amount of time or more to code and debug, and be considerably riskier. Basically shut down PostgreSQL and modify the data files directly to change one integer byte sequence to another. On the positive side the source code for PostgreSQL is open source and that data, while complex, is structured.

On the point of "vacuum" versus "vacuum full" - I don't know if this is how it would work in reality but conceptually if you updated half the table, vacuumed, updated the second half, vacuumed, the second batch of updates would reuse the spaced freed from the first batch and you'd only increase the disk consumption by 1.5 instead of 2.0. As you increase the number of batches the percentage of additional space consumed decreases. Though if you have the space I'd have to imagine that creating a brand new table and dropping the old one would be the best solution when taken in isolation.

David J.

Re: Thoughts on how to avoid a massive integer update.

From

"Fehrle, Brian"

Date:

08 May 2020, 19:35:13

From: "David G. Johnston" <david.g.johnston@gmail.com>
Date: Friday, May 8, 2020 at 11:48 AM
To: "Fehrle, Brian" <bfehrle@comscore.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.

[External Email]

On Mon, May 4, 2020 at 2:33 PM Fehrle, Brian <bfehrle@comscore.com> wrote:

I NEED to update every single row in all these tables, changing the integer value to a different integer.

Does anyone have any hackery ideas on how to achieve this in less time?

David J.

Modifying data files is too risky and I wouldn’t be able to get that kind of work approved.

Even with keeping excess space to an additional 50%, that’s tons of storage I’d need to order, so either vacuum full or re-create tables for minimal on disk usage are my only options.

Re: Thoughts on how to avoid a massive integer update.

From

"Fehrle, Brian"

Date:

08 May 2020, 19:36:18

From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 11:05 AM
To: "Fehrle, Brian" <bfehrle@comscore.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.

[External Email]

Could you show an example table relationship?

It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer

*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),

Right, and now you wish to change the values in the referenced table (info_table.info_table_sid) correct?

Correct. If info_table.info_table_sid = 123 and many rows in data_table point to it, the ID needs to be changed to 456 in the info_table, as well as all the columns in the data_table.

Re: Thoughts on how to avoid a massive integer update.

From

Rob Sargent

Date:

08 May 2020, 19:49:07

Well as I said, I think you could add a column to info_table

alter table info_table add orig_id int;
update info_table set orig_id = info_table_sid;

update info_table set info_table_sid = 456 where info_table_sid = 456;

alter table data_table drop reference NOT SQL
alter table data_table make reference to info_table.orig_id NOT SQL

you have to do the second block per orig_id

and the third block per table

at your reports needing the new value will of course need to do the join and get the updated value which now resides in the ill-name info_table_sid.

This leaves “orig_id” as just an id and “info_table_sid” as an editable attribute

Nothing is broken other than the per-table lock while you switch the foreign key

On May 8, 2020, at 1:36 PM, Fehrle, Brian <bfehrle@comscore.com> wrote:

From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 11:05 AM
To: "Fehrle, Brian" <bfehrle@comscore.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.

[External Email]
Could you show an example table relationship?

It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer

*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),

Right, and now you wish to change the values in the referenced table (info_table.info_table_sid) correct?

Correct. If info_table.info_table_sid = 123 and many rows in data_table point to it, the ID needs to be changed to 456 in the info_table, as well as all the columns in the data_table.

Re: Thoughts on how to avoid a massive integer update.

From

"David G. Johnston"

Date:

08 May 2020, 20:37:36

On Fri, May 8, 2020 at 12:49 PM Rob Sargent <robjsargent@gmail.com> wrote:

Well as I said, I think you could add a column to info_table
alter table info_table add orig_id int;
update info_table set orig_id = info_table_sid;

update info_table set info_table_sid = 456 where info_table_sid = 456;

huh?

alter table data_table drop reference NOT SQL
alter table data_table make reference to info_table.orig_id NOT SQL

You don't seem to understand the requirement. The data_table integer value must be changed - all you are doing is a convoluted column rename on the table holding the PK half of the relationship.

David J.

Re: Thoughts on how to avoid a massive integer update.

From

Rob Sargent

Date:

08 May 2020, 20:41:34

On May 8, 2020, at 2:37 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Fri, May 8, 2020 at 12:49 PM Rob Sargent <robjsargent@gmail.com> wrote:
Well as I said, I think you could add a column to info_table
alter table info_table add orig_id int;
update info_table set orig_id = info_table_sid;

update info_table set info_table_sid = 456 where info_table_sid = 456;

huh?

alter table data_table drop reference NOT SQL
alter table data_table make reference to info_table.orig_id NOT SQL

You don't seem to understand the requirement. The data_table integer value must be changed - all you are doing is a convoluted column rename on the table holding the PK half of the relationship.

David J.

My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?

Re: Thoughts on how to avoid a massive integer update.

From

"David G. Johnston"

Date:

08 May 2020, 20:43:51

On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:

My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?

The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.

David J.

Re: Thoughts on how to avoid a massive integer update.

From

Rob Sargent

Date:

08 May 2020, 20:51:13

On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?

The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.

David J.

Wow, I couldn’t disagree more ;)

Let’s say it were an ssn. A side from the fact that those are mutable and should not be used as an id (not to say identifier), the ssn should only be in the table aligned with that_person. Other tables refering to that person should use the locally assigned and arbitrary identifier for that person. (Frankly there is no “natural key” for a person.)

The scheme I propose has a chance of completing relatively unobtrusively and quickly allowing a migration over time if deemed necessary. Likely to finished just in time for the next interuption.

Re: Thoughts on how to avoid a massive integer update.

From

"David G. Johnston"

Date:

08 May 2020, 20:57:03

On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com> wrote:

On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?

The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.

Wow, I couldn’t disagree more ;)

Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.

David J.

Re: Thoughts on how to avoid a massive integer update.

From

Rob Sargent

Date:

08 May 2020, 21:05:08

On May 8, 2020, at 2:57 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com> wrote:

On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?

The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.

Wow, I couldn’t disagree more ;)

Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.

David J.

Sorry, I wasn’t disagreeing with the problem statement. OP did say the “info.id” needed to change from 123 to 456. With the current foreign key alignment that is very expensive. I think we’re all in agreement there. To push “456” back out to the data table I see as perpetuation of the problem. I didn’t sense that OP felt it necessary to continue in the current mode as a requirement. If so, my mistake

Re: Thoughts on how to avoid a massive integer update.

From

"Fehrle, Brian"

Date:

08 May 2020, 21:52:26

David’s assessment is correct (and I think we’re all on the same page). The value of the foreign keys that tie the tables together must be changed, and yeah that value _should_ simply be an additional column in the info_table and the foreign key be an arbitrary integer, but since it wasn’t set up that way from the beginning (over a decade ago), this is what I’m stuck with.

Blah.

From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 3:05 PM
To: "David G. Johnston" <david.g.johnston@gmail.com>
Cc: "Fehrle, Brian" <bfehrle@comscore.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.

[External Email]

On May 8, 2020, at 2:57 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com> wrote:

On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?

The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.

Wow, I couldn’t disagree more ;)

Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.

David J.

Re: Thoughts on how to avoid a massive integer update.

From

Rob Sargent

Date:

08 May 2020, 21:57:01

On May 8, 2020, at 3:52 PM, Fehrle, Brian <bfehrle@comscore.com> wrote:

David’s assessment is correct (and I think we’re all on the same page). The value of the foreign keys that tie the tables together must be changed, and yeah that value _should_ simply be an additional column in the info_table and the foreign key be an arbitrary integer, but since it wasn’t set up that way from the beginning (over a decade ago), this is what I’m stuck with.

Blah.

Uncle.

And you’re stuck with it because no-one joins back to info, I take it? Denormalization writ large. Oh heck do the switcheroo and mail out a magic decoder ring ;)

From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 3:05 PM
To: "David G. Johnston" <david.g.johnston@gmail.com>
Cc: "Fehrle, Brian" <bfehrle@comscore.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.

[External Email]

On May 8, 2020, at 2:57 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com> wrote:

On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?

The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.

Wow, I couldn’t disagree more ;)

Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.

David J.

Sorry, I wasn’t disagreeing with the problem statement. OP did say the “info.id” needed to change from 123 to 456. With the current foreign key alignment that is very expensive. I think we’re all in agreement there. To push “456” back out to the data table I see as perpetuation of the problem. I didn’t sense that OP felt it necessary to continue in the current mode as a requirement. If so, my mistake