Re: Need to update all my 60 million rows at once without transactional integrity - Mailing list pgsql-general

From Chris Browne
Subject Re: Need to update all my 60 million rows at once without transactional integrity
Date
Msg-id 60tzhs8b92.fsf@dba2.int.libertyrms.com
Whole thread Raw
In response to Need to update all my 60 million rows at once without transactional integrity  (christian_behrens@gmx.net)
Responses Re: Need to update all my 60 million rows at once without transactional integrity  ("Dann Corbit" <DCorbit@connx.com>)
List pgsql-general
christian_behrens@gmx.net writes:
> How can I make a Update of a column in a very large table for all
> rows without using the double amount of disc space and without any
> need for atomic operation?

You may need to redefine the problem.

> I have a very large table with about 60 million rows. I sometimes
> need to do a simple update to ALL rows that resets a status-flag to
> zero.

That seems like a pretty useless update to me...

Why not instead redefine the "status-we-need-to-have" to be zero?
[e.g. - don't try to turn off the Sun; instead, define "brightness" as
the standard ;-)]

> I don't need to have transactional integrity (but of course if the system crashes, there should be no data
corruption.A separate flag in the file system can well save the fact that that bulk update was in progress) for this, I
don'tcare or want a abort or "all or nothing" like SQL mandates. The idea is basically that either this update succeeds
orit succeeds or - there is no "not". It must succeed. It must be tried until it works. If the update is halfway
finished,that's okay. 
>
> If I just do an
> UPDATE table SET flag=0;
> then Pg will make a copy of every row which must be cleaned up by vaccuum. I understand - and don't like during this
specificproblem - that PG is a MVCC database and will keep my old versions of the rows before the update alive. This is
verybad. 
>
> If I do a batched loop like this:
> UPDATE table SET flag=0 where id>=0 and id <200;
> UPDATE table SET flag=0 where id>=200 and id <400;
> UPDATE table SET flag=0 where id>=400 and id <600;
> ...
>
> then PG will seek all over my harddrive I think.

If I *were* going to do this, I'd want to do:

   update table set flag=0 where flag <> 0 and id in (select id from table where flag <> 0 limit 50000);

which will go through the table 50K rows at a time.

> It would be much better if it could just start in the beginning of
> the table and work it's way towards the end. But which sort-criteria
> would satisfy this? I don't think that there is any SQL that does
> something like that.

You couldn't guarantee physical ordering anyways, so that seems
futile.

> Another ideas (and I think it's a quite good idea) would be to
> drop the column and recreate it with a new default value.

The "column drop" would be virtually instantaneous; it would
essentially be hidden from view.

But the addition of the column would rewrite the table, doubling its
size :-(.

> And I don't think I should create new columns (the old one would be
> hidden and their internal column ids lost I think) all the time,
> that might have a limit.

That's correct.  It will eventually cause a problem.

> Is there any other way to go?

Step back to what value you're putting into that column, and why.

Perhaps you never need to change this value.

In the Slony-I replication system, we have something sort of similar;
we collect a table of entries that need to get applied to another
database.  (Table called sl_log_1.)  We *NEVER* alter its contents;
that is not necessary to indicate that data has been replicated (which
is the status of interest).

Instead of updating tuples to mark that they are processed, we instead
store information in another table that indicates up to which point in
time we have finished replicating.  (It's more complex and indirect
than that, but nonetheless, it's still a fair characterization...)

So, perhaps you should be storing a per-day value in the table, and
store, somewhere else, what point you're "done up to."
--
let name="cbbrowne" and tld="linuxdatabases.info" in name ^ "@" ^ tld;;
http://linuxdatabases.info/info/emacs.html
Last night  I played a  blank tape at  full blast. The mime  next door
went nuts.

pgsql-general by date:

Previous
From: "Scott Marlowe"
Date:
Subject: Re: How to modify ENUM datatypes?
Next
From: brian
Date:
Subject: Re: How to modify ENUM datatypes?