Thread: Updating a very large table

Updating a very large table

From
Rafael Domiciano
Date:
Hello people,

I have a large table to do update (in every tuple), the table (table1) has about 8 millions tuples, and I had to add a new column (new_column).
The table1 is one of the most accessed table in my cluster, so I can't do a direct update 'cause almost everyone stays in waiting.
Doing the update at night was unsuccesful, in the morning it was running yet, and we had to kill it.

So, my question is, what the best practice, best way, (how do you handles with a situation like this) for updating a very large table, every tuples in a new column?

P.S.: Some people may thing that I have a Design Problem level, but don't, new_column have to belong to table1, in this especific case.

Thks for response,

Rafael Domiciano

Re: Updating a very large table

From
"Kevin Grittner"
Date:
Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

> I have a large table to do update (in every tuple), the table
> (table1) has about 8 millions tuples, and I had to add a new column
> (new_column).  The table1 is one of the most accessed table in my
> cluster, so I can't do a direct update 'cause almost everyone stays
> in waiting.  Doing the update at night was unsuccesful, in the
> morning it was running yet, and we had to kill it.

Doing that with an ALTER TABLE against only 8 million rows should
probably not have taken all night (depending, of course, on a lot of
details you haven't provided).  Killing it after running for all those
hours will have bloated your tables and indexes.  (Perhaps they were
already bloated, in which case this will have made it worse.)

What version of PostgreSQL is this, on what OS?

What does a VACUUM ANALYZE VERBOSE on this table show as output?

What are the last few lines of VACUUM ANALYZE VERBOSE on the whole
database?

If you can show us the table description and how you tried to modify
it, that would help.

-Kevin

Re: Updating a very large table

From
Rafael Domiciano
Date:
Hello Kevin, Thnks for response,

Doing the alter table to add the new column was fast: ALTER TABLE table1 ADD COLUMN new_column date;

The problem is that I have to do a update in this column, and the values are going to be the a misc of others 2 columns of the table1, something like this:

update table1
set new_column = (date)
where
  new_column is null;

Postgres Version: 8.3.6
Os.: Fedora Core 9
4 Gb Ram


On Thu, Apr 23, 2009 at 2:29 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

> I have a large table to do update (in every tuple), the table
> (table1) has about 8 millions tuples, and I had to add a new column
> (new_column).  The table1 is one of the most accessed table in my
> cluster, so I can't do a direct update 'cause almost everyone stays
> in waiting.  Doing the update at night was unsuccesful, in the
> morning it was running yet, and we had to kill it.

Doing that with an ALTER TABLE against only 8 million rows should
probably not have taken all night (depending, of course, on a lot of
details you haven't provided).  Killing it after running for all those
hours will have bloated your tables and indexes.  (Perhaps they were
already bloated, in which case this will have made it worse.)

What version of PostgreSQL is this, on what OS?

What does a VACUUM ANALYZE VERBOSE on this table show as output?

INFO:  "table1": encontrados 1572 versões de registros removíveis e 8022357 não-removíveis em 244388 páginas
DETAIL:  7101 versões de registros não vigentes não podem ser removidas ainda.
Havia 1657653 ponteiros de itens não utilizados.
62515 páginas contém espaço livre útil.
0 páginas estão completamente vazias.
CPU 9.38s/26.74u sec elapsed 27540.53 sec.

 


What are the last few lines of VACUUM ANALYZE VERBOSE on the whole
database?

INFO:  mapeamento de espaço livre contém 152886 páginas em 907 relações
DETAIL:  O total de 151280 páginas estão em uso (incluindo excesso).
151280 páginas são necessárias para encontrar todo espaço livre.
Limites atuais são:  153600 páginas, 2500 relações, utilizando 1065 kB.

 


If you can show us the table description and how you tried to modify
it, that would help.

-Kevin

Re: Updating a very large table

From
"Kevin Grittner"
Date:
Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

> Doing the alter table to add the new column was fast: ALTER TABLE
> table1 ADD COLUMN new_column date;

So far, so good.

> The problem is that I have to do a update in this column, and the
> values are going to be the a misc of others 2 columns of the table1,
> something like this:
>
> update table1
> set new_column = (date)
> where
>   new_column is null;

You're probably going to want to do that in small chunks (I would try
to avoid updating more than about 10,000 rows per transaction.)

> Postgres Version: 8.3.6

You should update to the latest 8.3 bug-fix version, if you can.  No
conversion needed; just stop on the old software and start on the new.

> Os.: Fedora Core 9
> 4 Gb Ram

I assume that you've tuned PostgreSQL, but if my other suggestions
don't help, please post again with the non-commented lines of the
postgresql.conf file, and actual table specifications and query text,
along with an EXPLAIN of the query.

> Kevin Grittner <Kevin.Grittner@wicourts.gov wrote:
>> What does a VACUUM ANALYZE VERBOSE on this table show as output?
>
> INFO:  "table1": encontrados 1572 versões de registros removíveis e
> 8022357 não-removíveis em 244388 páginas
> DETAIL:  7101 versões de registros não vigentes não podem ser
> removidas ainda.
> Havia 1657653 ponteiros de itens não utilizados.
> 62515 páginas contém espaço livre útil.
> 0 páginas estão completamente vazias.
> CPU 9.38s/26.74u sec elapsed 27540.53 sec.

That could be better.  You might want to schedule an overnight CLUSTER
of the table (followed by an ANALYZE) before attempting the update.

You don't show any index or toast-table information.  Was there none?

>> What are the last few lines of VACUUM ANALYZE VERBOSE on the whole
>> database?
>
> INFO:  mapeamento de espaço livre contém 152886 páginas em 907
> relações
> DETAIL:  O total de 151280 páginas estão em uso (incluindo excesso).
> 151280 páginas são necessárias para encontrar todo espaço livre.
> Limites atuais são:  153600 páginas, 2500 relações, utilizando 1065
> kB.

That looks OK, although you're right at the edge.

-Kevin

Re: Updating a very large table

From
Rafael Domiciano
Date:


On Thu, Apr 23, 2009 at 4:06 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

> Doing the alter table to add the new column was fast: ALTER TABLE
> table1 ADD COLUMN new_column date;

So far, so good.

> The problem is that I have to do a update in this column, and the
> values are going to be the a misc of others 2 columns of the table1,
> something like this:
>
> update table1
> set new_column = (date)
> where
>   new_column is null;

You're probably going to want to do that in small chunks (I would try
to avoid updating more than about 10,000 rows per transaction.)

Yeah, we are trying this. Thnks.
 


> Postgres Version: 8.3.6

You should update to the latest 8.3 bug-fix version, if you can.  No
conversion needed; just stop on the old software and start on the new.

> Os.: Fedora Core 9
> 4 Gb Ram

I assume that you've tuned PostgreSQL, but if my other suggestions
don't help, please post again with the non-commented lines of the
postgresql.conf file, and actual table specifications and query text,
along with an EXPLAIN of the query.

Yes, I've tuned configuration to fit the hardware, and it's ok.
 

> Kevin Grittner <Kevin.Grittner@wicourts.gov wrote:
>> What does a VACUUM ANALYZE VERBOSE on this table show as output?
>
> INFO:  "table1": encontrados 1572 versões de registros removíveis e
> 8022357 não-removíveis em 244388 páginas
> DETAIL:  7101 versões de registros não vigentes não podem ser
> removidas ainda.
> Havia 1657653 ponteiros de itens não utilizados.
> 62515 páginas contém espaço livre útil.
> 0 páginas estão completamente vazias.
> CPU 9.38s/26.74u sec elapsed 27540.53 sec.

That could be better.  You might want to schedule an overnight CLUSTER
of the table (followed by an ANALYZE) before attempting the update.

You don't show any index or toast-table information.  Was there none?

this table has about 15 indexes...

How good are to Cluster table? Has any criteria to cluster table? How can I do it?

 


>> What are the last few lines of VACUUM ANALYZE VERBOSE on the whole
>> database?
>
> INFO:  mapeamento de espaço livre contém 152886 páginas em 907
> relações
> DETAIL:  O total de 151280 páginas estão em uso (incluindo excesso).
> 151280 páginas são necessárias para encontrar todo espaço livre.
> Limites atuais são:  153600 páginas, 2500 relações, utilizando 1065
> kB.

That looks OK, although you're right at the edge.

-Kevin

Re: Updating a very large table

From
Michael Monnerie
Date:
On Freitag 24 April 2009 Rafael Domiciano wrote:
> this table has about 15 indexes...
>
> How good are to Cluster table? Has any criteria to cluster table? How
> can I do it?

http://www.postgresql.org/docs/8.3/interactive/sql-cluster.html
CLUSTER tablename [ USING indexname after which you want data to be
sorted ]

Postgres then rewrites the whole table, which creates a new file on the
disk with the table entries written in the order of the index you chose.
Don't forget the ANALYZE afterwards.

Question: This will not rewrite the indices for that table, right? Then
a REINDEX could be interesting too.

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4


Attachment

Re: Updating a very large table

From
Scott Marlowe
Date:
On Thu, Apr 23, 2009 at 11:21 PM, Michael Monnerie
<michael.monnerie@is.it-management.at> wrote:
> On Freitag 24 April 2009 Rafael Domiciano wrote:
>> this table has about 15 indexes...
>>
>> How good are to Cluster table? Has any criteria to cluster table? How
>> can I do it?
>
> http://www.postgresql.org/docs/8.3/interactive/sql-cluster.html
> CLUSTER tablename [ USING indexname after which you want data to be
> sorted ]
>
> Postgres then rewrites the whole table, which creates a new file on the
> disk with the table entries written in the order of the index you chose.
> Don't forget the ANALYZE afterwards.

Note that cluster on a table with random ordering can be really slow.
I've found it's much faster to reorder to a temp table then truncate
the original and insert into ... select  from temptable order by xyz
to refill it.  Since either cluster or truncate select order by are
both gonna lock the table from users, you might as well use what's
faster for you.

> Question: This will not rewrite the indices for that table, right? Then
> a REINDEX could be interesting too.

Also, if you're doing the select into thing, you can drop the indexes
then recreate them.  Usually also faster on a big table being
reordered.

Re: Updating a very large table

From
"Kevin Grittner"
Date:
Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

> this table has about 15 indexes...

That would tend to make mass updates like this slow.

> How good are to Cluster table? Has any criteria to cluster table?
> How can I do it?

CLUSTER reads through the table in the sequence of an index, which you
specify, and creates a new copy of the table and then replaces the
original table with this copy.  The table then has no bloat and the
data rows will (until you start modifying the table) be in the same
sequence as that index.

You must have room for a second copy of the table in order for this to
succeed.  All indexes, permissions, etc. are set to match the original
table.  The only choice is which index to use -- if there is an index
which is often used to select a number of rows, it is a good candidate
for use in the CLUSTER, since that will minimize disk access.

As has already been pointed out, there are ways to do the same thing
with a sequential pass of the data.  If there is no index which is
often used to select a number of rows, or the CLUSTER is unable to
complete in whatever maintenance window you have, the unordered
approach might be better than CLUSTER.

-Kevin

Re: Updating a very large table

From
"Kevin Grittner"
Date:
Michael Monnerie <michael.monnerie@is.it-management.at> wrote:

> Question: [CLUSTER] will not rewrite the indices for that table,
> right? Then a REINDEX could be interesting too.

CLUSTER always rebuilds all indexes on the table.  There's no point
doing a REINDEX afterward or dropping indexes first.

-Kevin

Re: Updating a very large table

From
Michael Monnerie
Date:
On Freitag 24 April 2009 Scott Marlowe wrote:
> Also, if you're doing the select into thing, you can drop the indexes
> then recreate them.  Usually also faster on a big table being
> reordered.
> CLUSTER always rebuilds all indexes on the table.  There's no point
> doing a REINDEX afterward or dropping indexes first.

But once you did the first CLUSTER, subsequent ones are pretty quick. In
dbmail (a mail system using postgres as mailstore), this boostet
performance a lot, as data is ordered on disk the way it is read, which
helps a lot with caching. And it works automatically. I run it once a
week since some time, it takes a fraction of the backup time to CLUSTER
most tables (except some minor tables where order doesn't matter).

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4


Re: Updating a very large table

From
Chris Browne
Date:
rafael.domiciano@gmail.com (Rafael Domiciano) writes:
> Hello Kevin, Thnks for response,
> Doing the alter table to add the new column was fast: ALTER TABLE table1 ADD COLUMN new_column date;
> The problem is that I have to do a update in this column, and the values are going to be the a misc of others 2
columnsof the table1, something 
> like this:
> update table1
> set new_column = (date)
> where
>   new_column is null;
> Postgres Version: 8.3.6
> Os.: Fedora Core 9
> 4 Gb Ram

If you try to do this in one "swell foop," it's going to take hours,
lock anything else that would want to access the table, and bloat the
table, all of which is exactly what you don't want...

I'd suggest doing the updates in more bite-sized pieces, a few
thousand tuples at a time.

Further, to make that efficient, I'd suggest adding an index, at least
temporarily, on some column in the table that's generally unique.  (A
date stamp that *tends* to vary would be plenty good enough; it
doesn't need to be strictly unique.  What's important is that there
shouldn't be many repeated values in the column.)

Thus, the initial set of changes would be done thus...

  alter table1 add column new_column timestamptz;
  create index concurrently temp_newcol_idx on table1(quasi_unique_column) where (new_column is null);

It'll take a while for that index to be available, but it's not really
necessary to use it until you have a lot of tuples converted to have
new_column set.

Then, run a query like the following:

  update table1 set new_column = [whatever calculation]
  where new_column is null and
        quasi_unique_column in
          (select quasi_unique_column from table1 where new_column is null limit 1000);

This should be repeated until it no longer finds any tuples to "fix."
Once this is complete, the temporary index may be dropped.

The number 1000 is somewhat arbitrary:

  - 1 would be bad, as that means you need to do 8 million queries to
    process an 8M tuple table

  - 8000000 would be bad, as that would try to do the whole thing in
    one big batch, taking a long time, locking things, bloating
    things, and consuming a lot of memory

1000 is materially larger than 1, but also materially smaller than
8000000.

Using 10000, instead, would mean more work is done in each
transaction; you might want to try varying counts, and stop increasing
the count when you cease to see improvements due to doing more work in
bulk.  I doubt that there's a material difference between 1000 and
10000.

Make sure that the table is being vacuumed once in a while; that
doesn't need to be continuous, but if you want the table to only bloat
by ~10%, then that means you should vacuum once for every 10% of the
table.
--
(reverse (concatenate 'string "gro.mca" "@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/advocacy.html
Mary had a little lambda
A sheep she couldn't clone
And every where that lambda went
Her calculus got blown

Re: Updating a very large table

From
"Kevin Grittner"
Date:
Chris Browne <cbbrowne@acm.org> wrote:

> I'd suggest adding an index

The OP said the table had 15 indexes already.  I would guess one of
those could be used.  Perhaps it has a primary key....

>   update table1 set new_column = [whatever calculation]
>   where new_column is null and
>         quasi_unique_column in
>           (select quasi_unique_column from table1
>            where new_column is null limit 1000);

Or, if the primary key (or other unique or quasi-unique existing
index) has multiple columns, this could still be done with:

  update table1 set new_column = [whatever calculation]
  where new_column is null and
        (col1, col2) in
          (select col1, col2 from table1
           where new_column is null limit 1000);

-Kevin

Re: Updating a very large table

From
Ron Mayer
Date:
Kevin Grittner wrote:
> Chris Browne <cbbrowne@acm.org> wrote:
>
>> I'd suggest adding an index
>
> The OP said the table had 15 indexes already.  I would guess one of
> those could be used.  Perhaps it has a primary key....
>
>>   update table1 set new_column = [whatever calculation]
>>   where new_column is null and
>>         quasi_unique_column in
>>           (select quasi_unique_column from table1
>>            where new_column is null limit 1000);
>
> Or, if the primary key (or other unique or quasi-unique existing
> index) has multiple columns, this could still be done with:
>
>   update table1 set new_column = [whatever calculation]
>   where new_column is null and
>         (col1, col2) in
>           (select col1, col2 from table1
>            where new_column is null limit 1000);
>

Would doing something with ctid be even better?
Or does it have some risks I'm missing.   I'm thinking
something like:

fli=# select max(ctid) from table1;
     max
-------------
 (183000,42)
(1 row)

Then

update table set new_column=[whatever] where ctid<'(10000,1)';
vacuum;
update table set new_column=[whatever] where ctid>'(10000,1)' and ctid<'(20000,1');
vacuum;
...
update table set new_column=[whatever] where ctid>'(180000,1)';
vacuum;

and perhaps a final

update table set new_column=[whatever] where new_column is null;

to catch any this might have missed?


Seems this makes it easer to control how much the table will
bloat too -- if I only want it to bloat 5% I divide max(ctid) by 20
for each group size....


Re: Updating a very large table

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Would doing something with ctid be even better?
> Or does it have some risks I'm missing.   I'm thinking
> something like:

> update table set new_column=[whatever] where ctid<'(10000,1)';
> vacuum;
> update table set new_column=[whatever] where ctid>'(10000,1)' and ctid<'(20000,1');
> vacuum;
> ...
> update table set new_column=[whatever] where ctid>'(180000,1)';
> vacuum;

You are forgetting the fact that an UPDATE in itself changes a row's
ctid.

            regards, tom lane