Re: Massive table (500M rows) update nightmare - Mailing list pgsql-performance

From Kevin Kempter
Subject Re: Massive table (500M rows) update nightmare
Date
Msg-id 201001071046.24305.kevink@consistentstate.com
Whole thread Raw
In response to Re: Massive table (500M rows) update nightmare  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-performance
On Thursday 07 January 2010 09:57:59 Kevin Grittner wrote:
> Ludwik Dylag <ldylag@gmail.com> wrote:
> > I would suggest:
> > 1. turn off autovacuum
> > 1a. ewentually tune db for better performace for this kind of
> >     operation (cant not help here)
> > 2. restart database
> > 3. drop all indexes
> > 4. update
> > 5. vacuum full table
> > 6. create indexes
> > 7. turn on autovacuum
>
> I've only ever attempted something like that with a few tens of
> millions of rows.  I gave up on waiting for the VACUUM FULL step
> after a few days.
>
> I some scheduled down time is acceptable (with "some" kind of hard
> to estimate accurately) the best bet would be to add the column with
> the USING clause to fill in the value.  (I think that would cause a
> table rewrite; if not, then add something to the ALTER TABLE which
> would.)  My impression is that the OP would rather stretch out the
> implementation than to suffer down time, which can certainly be a
> valid call.
>
> If that is the goal, then the real question is whether there's a way
> to tune the incremental updates to speed that phase.  Carlo, what
> version of PostgreSQL is this?  Can you show us the results of an
> EXPLAIN ANALYZE for the run of one iteration of the UPDATE?
> Information on the OS, hardware, PostgreSQL build configuration, and
> the contents of postgresql.conf (excluding all comments) could help
> us spot possible techniques to speed this up.
>
> -Kevin
>



If you can come up with an effective method of tracking updates/deletes/inserts
such as a trigger that writes the PK to a separate table upon any inserts,
updates or deletes to the table you could do something like this:



1) create new table (no indexes) with the structure you want the table to have
at the end of the process (i.e. the post-altered state) [new_tab]

2) create the insert,update,delete triggers mentioned above on the existing
table [curr_tab] and write all the PK id's that change into a 3rd table
[changed_keys]

3) kick off a process that simply does a select from curr_tab into new_tab and
populates/back-fills the new column as part of the query

4) let it run as long as it takes

5) once it's complete do this:

   Create the all the indexes on the new_tab

   BEGIN;

   LOCK TABLE curr_tab;

   DELETE from new_tab
       where pk_id in (select distinct pk_id from changed_keys);

   INSERT into new_tab
       select * from curr_tab
       where curr_tab.pk_id in (select distinct pk_id from changed_keys);

   ALTER TABLE curr_tab RENAME to old_tab;

   ALTER TABLE new_tab RENAME to curr_tab;

   COMMIT;



Also you might want to consider partitioning this table in the process...



Once you're confident you no longer need the old table [old_tab] you can drop
it



pgsql-performance by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: "large" spam tables and performance: postgres memory parameters
Next
From: "Greg Sabino Mullane"
Date:
Subject: Re: Massive table (500M rows) update nightmare