Thread: "Healing" a table after massive updates

"Healing" a table after massive updates

From
"Gauthier, Dave"
Date:

Hi:

 

I have a job that loads a large table, but then has to “update” about half the records for various reasons.  My perception of what happens on update for a particular recors is...

- a new record will be inserted with the updated value(s).

- The old record is marked as being obselete.

- Not sure what happens to index elements that pointed to the original (now obselete) record. Is it updated to point directly at the newly inserted record?  Or does it use the obselete record as a “link” to the newly inserted record?

 

My concern is that the resulting table is not in optimal shape for queries.  I would like to get rid of the obseleted records (vacuum I believe) but also “heal” the table in terms of filling in the holes left where those deleted records used to be (will gather more records per disk block read if record density on disk is greater).  Is there a way to do this?

 

Thanks

Re: "Healing" a table after massive updates

From
"Gauthier, Dave"
Date:

I might be able to answer my own question...

vacuum FULL (analyze is optional)

 

Correct?

 


From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Gauthier, Dave
Sent: Thursday, September 11, 2008 10:01 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] "Healing" a table after massive updates

 

Hi:

 

I have a job that loads a large table, but then has to “update” about half the records for various reasons.  My perception of what happens on update for a particular recors is...

- a new record will be inserted with the updated value(s).

- The old record is marked as being obselete.

- Not sure what happens to index elements that pointed to the original (now obselete) record. Is it updated to point directly at the newly inserted record?  Or does it use the obselete record as a “link” to the newly inserted record?

 

My concern is that the resulting table is not in optimal shape for queries.  I would like to get rid of the obseleted records (vacuum I believe) but also “heal” the table in terms of filling in the holes left where those deleted records used to be (will gather more records per disk block read if record density on disk is greater).  Is there a way to do this?

 

Thanks

Re: "Healing" a table after massive updates

From
Brad Nicholson
Date:
On Thu, 2008-09-11 at 07:01 -0700, Gauthier, Dave wrote:
> Hi:
>
>
>
> I have a job that loads a large table, but then has to “update” about
> half the records for various reasons.  My perception of what happens
> on update for a particular recors is...
>
> - a new record will be inserted with the updated value(s).
>
> - The old record is marked as being obselete.
>
> - Not sure what happens to index elements that pointed to the original
> (now obselete) record. Is it updated to point directly at the newly
> inserted record?  Or does it use the obselete record as a “link” to
> the newly inserted record?

Depends on the version of Postgres.  Prior to 8.3, the obsolete tuples
and index entries are dead.  In 8.3, the updates are HOT updates, it
will not leave the dead tuples or index.  8.3 might be a big help for
you.  It could remove the need to vacuum this table entirely.

> My concern is that the resulting table is not in optimal shape for
> queries.  I would like to get rid of the obseleted records (vacuum I
> believe) but also “heal” the table in terms of filling in the holes
> left where those deleted records used to be (will gather more records
> per disk block read if record density on disk is greater).  Is there a
> way to do this?

Regular VACUUM is the correct operation to get rid of the dead tuples.

If you want to compact the the table, you either need to use CLUSTER or
VACUUM FULL + REINDEX.
--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.


Re: "Healing" a table after massive updates

From
Alvaro Herrera
Date:
Brad Nicholson wrote:

> If you want to compact the the table, you either need to use CLUSTER or
> VACUUM FULL + REINDEX.

Actually those are all pretty slow.  If you can do a no-op ALTER TYPE
that rewrites the entire table, it is a lot faster.  Something like

ALTER TABLE tab ALTER COLUMN col TYPE integer;

Assume that column "col" on table "tab" already has type integer.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: "Healing" a table after massive updates

From
Bill Moran
Date:
In response to "Gauthier, Dave" <dave.gauthier@intel.com>:

> I might be able to answer my own question...
>
> vacuum FULL (analyze is optional)

CLUSTER _may_ be a better choice, but carefully read the docs regarding
it's drawbacks first.  You may want to do some benchmarks to see if it's
really needed before you commit to it as a scheduled operation.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

****************************************************************
IMPORTANT: This message contains confidential information and is
intended only for the individual named. If the reader of this
message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************

Re: "Healing" a table after massive updates

From
Alvaro Herrera
Date:
Bill Moran wrote:
> In response to "Gauthier, Dave" <dave.gauthier@intel.com>:
>
> > I might be able to answer my own question...
> >
> > vacuum FULL (analyze is optional)
>
> CLUSTER _may_ be a better choice, but carefully read the docs regarding
> it's drawbacks first.  You may want to do some benchmarks to see if it's
> really needed before you commit to it as a scheduled operation.

What drawbacks?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: "Healing" a table after massive updates

From
Bill Moran
Date:
In response to Alvaro Herrera <alvherre@commandprompt.com>:

> Bill Moran wrote:
> > In response to "Gauthier, Dave" <dave.gauthier@intel.com>:
> >
> > > I might be able to answer my own question...
> > >
> > > vacuum FULL (analyze is optional)
> >
> > CLUSTER _may_ be a better choice, but carefully read the docs regarding
> > it's drawbacks first.  You may want to do some benchmarks to see if it's
> > really needed before you commit to it as a scheduled operation.
>
> What drawbacks?

There's the whole "there will be two copies of the table on-disk" thing
that could be an issue if it's a large table.

Depending on the version of PG in use, there are warnings about tuple
visibility during CLUSTER.  It seems as if most of these have been
removed for 8.3.

And while this applies to VACUUM FULL as well, it might just be a bad
idea.  If the number of rows inserted isn't a significant increase of
the overall size of the table, he may find that overall performance is
better if he uses plain old VACUUM so that the FSM stays at a stable
size.

Some of this is dependent on PG version, which the OP neglected to mention.
Other stuff is dependent on what kind of maintenance window he has, which
was also not mentioned.  Other stuff id dependent on various various
details of the actual process, which (looking back through the thread)
I may have misunderstood.  I took the post to mean that he was loading
additional data into a pre-existing table, which is a difference scenario
than loading up a virgin table.

So, you know, standard disclaimer about YMMV, which it seems that _I_
worded poorly.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: "Healing" a table after massive updates

From
"Scott Marlowe"
Date:
On Thu, Sep 11, 2008 at 8:56 AM, Bill Moran
<wmoran@collaborativefusion.com> wrote:
> In response to Alvaro Herrera <alvherre@commandprompt.com>:
>
>> Bill Moran wrote:
>> > In response to "Gauthier, Dave" <dave.gauthier@intel.com>:
>> >
>> > > I might be able to answer my own question...
>> > >
>> > > vacuum FULL (analyze is optional)
>> >
>> > CLUSTER _may_ be a better choice, but carefully read the docs regarding
>> > it's drawbacks first.  You may want to do some benchmarks to see if it's
>> > really needed before you commit to it as a scheduled operation.
>>
>> What drawbacks?
>
> There's the whole "there will be two copies of the table on-disk" thing
> that could be an issue if it's a large table.

I've also found cluster to be pretty slow, even on 8.3.  On a server
that hits 30-40Megs a second write speed for random access during
pgbench, it's writing out at 1 to 2 megabytes a second when it runs,
and takes the better part of a day on our biggest table.  vacuumdb -fz
+ reindexdb ran in about 6 hours which means we could fit it into our
maintenance window.  vacuum moves a lot more data per second than
cluster.

Re: "Healing" a table after massive updates

From
Alan Hodgson
Date:
On Thursday 11 September 2008, "Gauthier, Dave" <dave.gauthier@intel.com>
wrote:
> I have a job that loads a large table, but then has to "update" about
> half the records for various reasons.  My perception of what happens on
> update for a particular recors is...
>
> - a new record will be inserted with the updated value(s).
>
> - The old record is marked as being obselete.
>

What you might consider doing is loading the data into a temp table,
updating it there, then copying that data into the final destination.
Depending on the indexes involved, you might even find this to be faster.

--
Alan

Re: "Healing" a table after massive updates

From
"Scott Marlowe"
Date:
On Thu, Sep 11, 2008 at 11:15 AM, Alan Hodgson <ahodgson@simkin.ca> wrote:
> On Thursday 11 September 2008, "Gauthier, Dave" <dave.gauthier@intel.com>
> wrote:
>> I have a job that loads a large table, but then has to "update" about
>> half the records for various reasons.  My perception of what happens on
>> update for a particular recors is...
>>
>> - a new record will be inserted with the updated value(s).
>>
>> - The old record is marked as being obselete.
>>
>
> What you might consider doing is loading the data into a temp table,
> updating it there, then copying that data into the final destination.
> Depending on the indexes involved, you might even find this to be faster.

Especially if you can drop then recreate them on the real table before
reimporting them to it.

Re: "Healing" a table after massive updates

From
Gregory Stark
Date:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:

> On Thu, Sep 11, 2008 at 8:56 AM, Bill Moran
> <wmoran@collaborativefusion.com> wrote:
>> In response to Alvaro Herrera <alvherre@commandprompt.com>:
>>
>>> Bill Moran wrote:
>>> > In response to "Gauthier, Dave" <dave.gauthier@intel.com>:
>>> >
>>> > > I might be able to answer my own question...
>>> > >
>>> > > vacuum FULL (analyze is optional)
>>> >
>>> > CLUSTER _may_ be a better choice, but carefully read the docs regarding
>>> > it's drawbacks first.  You may want to do some benchmarks to see if it's
>>> > really needed before you commit to it as a scheduled operation.
>>>
>>> What drawbacks?
>>
>> There's the whole "there will be two copies of the table on-disk" thing
>> that could be an issue if it's a large table.
>
> I've also found cluster to be pretty slow, even on 8.3.  On a server
> that hits 30-40Megs a second write speed for random access during
> pgbench, it's writing out at 1 to 2 megabytes a second when it runs,
> and takes the better part of a day on our biggest table.  vacuumdb -fz
> + reindexdb ran in about 6 hours which means we could fit it into our
> maintenance window.  vacuum moves a lot more data per second than
> cluster.

Alternative you can do

ALTER TABLE tab ALTER col TYPE <sametype> USING col;

which will rewrite the table without using an index. This is much faster but
has the same other disadvantages.


Incidentally the visibility bugs are indeed entirely fixed in 8.3. In 8.2 and
before cluster and alter table rewrites can both cause tuples to not appear
for transactions which were started before the cluster or alter table such as
a long-running pg_dump.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: "Healing" a table after massive updates

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Incidentally the visibility bugs are indeed entirely fixed in 8.3. In 8.2 and
> before cluster and alter table rewrites can both cause tuples to not appear
> for transactions which were started before the cluster or alter table such as
> a long-running pg_dump.

AFAIK that's true only for CLUSTER, not ALTER TABLE.  There would be a
bunch of logical inconsistencies in altering rows and then pretending
you hadn't.

            regards, tom lane

Re: "Healing" a table after massive updates

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> Incidentally the visibility bugs are indeed entirely fixed in 8.3. In 8.2 and
>> before cluster and alter table rewrites can both cause tuples to not appear
>> for transactions which were started before the cluster or alter table such as
>> a long-running pg_dump.
>
> AFAIK that's true only for CLUSTER, not ALTER TABLE.  There would be a
> bunch of logical inconsistencies in altering rows and then pretending
> you hadn't.

Uh, what's true? That the bugs are fixed or that we ever had them to begin
with?

Oh, are you saying that ALTER TABLE doesn't go through the rewrite code to
carry along old versions of the tuples because it can't alter the old
versions? That is a bit annoying. That means there's no safe way to do a table
rewrite without clustering the table?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!