Thread: loading data, creating indexes, clustering, vacuum...

loading data, creating indexes, clustering, vacuum...

From
"Angva"
Date:
Hi everyone,

Looking for a small bit of advice...

I have a script that updates several tables with large amounts of data.
Before running the updates, it drops all indexes for optimal
performance. When the updates have finished, I run the following
procedure:

recreate the indexes
cluster the tables
vacuum full analyze on the tables

I was hoping an expert could comment on the optimal way to order these
three commands. For instance I have a hunch that creating the indexes
first (as I do now) could slow down the clustering - perhaps the row
locations in the indexes all have to be updated as the cluster command
shifts their locations? And perhaps vacuuming should be done before
clustering so that dead tuples aren't "in the way"?

Of course I could just test every combination until I get it right, but
I'd like to have a good understanding as well.

Any insight would be much appreciated.

Thank you,
Mark


Re: loading data, creating indexes, clustering, vacuum...

From
Alan Hodgson
Date:
On Thursday 07 December 2006 08:38, "Angva" <angvaw@gmail.com> wrote:
> three commands. For instance I have a hunch that creating the indexes
> first (as I do now) could slow down the clustering - perhaps the row
> locations in the indexes all have to be updated as the cluster command
> shifts their locations? And perhaps vacuuming should be done before
> clustering so that dead tuples aren't "in the way"?

clustering also removes the dead tuples.

I would just:

- create one index, the one to be clustered
- cluster the table
- create the remaining indexes

--
Eat right. Exercise regularly. Die anyway.


Re: loading data, creating indexes, clustering, vacuum...

From
Martijn van Oosterhout
Date:
On Thu, Dec 07, 2006 at 02:52:11PM -0800, Alan Hodgson wrote:
> On Thursday 07 December 2006 08:38, "Angva" <angvaw@gmail.com> wrote:
> > three commands. For instance I have a hunch that creating the indexes
> > first (as I do now) could slow down the clustering - perhaps the row
> > locations in the indexes all have to be updated as the cluster command
> > shifts their locations? And perhaps vacuuming should be done before
> > clustering so that dead tuples aren't "in the way"?
>
> clustering also removes the dead tuples.
>
> I would just:
>
> - create one index, the one to be clustered
> - cluster the table
> - create the remaining indexes

And then run ANALYSE. No need to vacuum because the cluster did that
already.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment
Angva wrote:
> Looking for a small bit of advice...
>
> I have a script that updates several tables with large amounts of data.
> Before running the updates, it drops all indexes for optimal
> performance. When the updates have finished, I run the following
> procedure:
>
> recreate the indexes
> cluster the tables
> vacuum full analyze on the tables

Hi all,

I'd like to see a general way to take indexes off line without actually
losing their definitions.  For example, something like "ALTER TABLE [EN
| DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc.  This could
also be used internally when a backend encounters an error
reading/writing an index.  Rather than refusing to execute queries, it
could just ignore indexes it knows are disabled or bad in some way and
re-plan as needed.

This would have two benefits.  First, the above scenerio would be much
simpler.  Rather than dropping and re-creating new indexes, you could
just disable and then re-enable them without having any knowledge of
their structure.  Secondly, it would allow us to put indexes in an
alternate table space on a non-redundant volume and, in the case of a
drive failure, be able to limp along, and get the system back to normal
simply by replacing the disk and issuing a REINDEX command.

I realize there are a couple gotchas with this.  For example, what to do
with unique indexes?  Perhaps a backend would still need to refuse to do
update/inserts on a table with degraded unique indexes, unless the index
was disabled explicitly?  And then, refuse to rebuild/re-enable the
index as normal if non-unique values found?


Thx for considering :-)

-Glen Parker

Hi all,

Since PITR works well, my use of pg_dump has shifted.  Rather than using
it as a backup tool, I now use it as a snapshotting tool.  At the end of
each  month we do an ASCII dump to keep around, so if we ever need to,
we can see the data as it was any number of months or years ago.  Not a
backup at all, just a raw data archive.

These archives do not need to hold all our data, for example, system
logs would be useless later.  There also is no reason to include
indexes.  Ignoring extranious tables and indexes is a great way to keep
the archive small and keep the time to restore as low as possible.

So, it would be great if pg_dump could accept some sort of argument to
make it simply not dump certain types of objects.  Indexes, views,
functions, etc.


Thx for considering :-)

-Glen Parker

On Fri, Dec 08, 2006 at 12:49:30PM -0800, Glen Parker wrote:
> I'd like to see a general way to take indexes off line without actually
> losing their definitions.  For example, something like "ALTER TABLE [EN
> | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc.  This could
> also be used internally when a backend encounters an error
> reading/writing an index.  Rather than refusing to execute queries, it
> could just ignore indexes it knows are disabled or bad in some way and
> re-plan as needed.

One issue would be that even disabled indexes would need to be updated
when there are new rows. If you don't update the index when it's
disabled, then re-enabling will essentially need to rebuild the index.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Marking indexes out of date (WAS: loading data, creating

From
Glen Parker
Date:
Martijn van Oosterhout wrote:
> On Fri, Dec 08, 2006 at 12:49:30PM -0800, Glen Parker wrote:
>> I'd like to see a general way to take indexes off line without actually
>> losing their definitions.  For example, something like "ALTER TABLE [EN
>> | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc.  This could
>> also be used internally when a backend encounters an error
>> reading/writing an index.  Rather than refusing to execute queries, it
>> could just ignore indexes it knows are disabled or bad in some way and
>> re-plan as needed.
>
> One issue would be that even disabled indexes would need to be updated
> when there are new rows. If you don't update the index when it's
> disabled, then re-enabling will essentially need to rebuild the index.


That's what I had in mind.  You could just as easily blow away the index
file(s).  It's just that I don't want it to toss the index *definition*.

To continued to update such an index would be to completely negate the
benefit of disabling it!

-Glen

Martijn van Oosterhout <kleptog@svana.org> writes:
> One issue would be that even disabled indexes would need to be updated
> when there are new rows. If you don't update the index when it's
> disabled, then re-enabling will essentially need to rebuild the index.

I assume that's what he wants.  However, it's not immediately clear that
there's a sufficient use case for this to justify the extra apparatus
compared to just DROP INDEX (and recreate it later).

            regards, tom lane

Re: loading data, creating indexes, clustering, vacuum...

From
"Angva"
Date:
> clustering also removes the dead tuples.

I have a followup question. What if the set of dead tuples is too big
and I need to VACUUM FULL, as opposed to VACUUM. (The update size
varies greatly from day to day.) Will the clustering effectively do a
VACUUM FULL, or just a VACUUM?

Thanks again for your help,
Mark


Re: loading data, creating indexes, clustering, vacuum...

From
"Angva"
Date:
Thank you very much, Alan and Martijn, for the advice!

Mark

Martijn van Oosterhout wrote:
> On Thu, Dec 07, 2006 at 02:52:11PM -0800, Alan Hodgson wrote:
> > On Thursday 07 December 2006 08:38, "Angva" <angvaw@gmail.com> wrote:
> > > three commands. For instance I have a hunch that creating the indexes
> > > first (as I do now) could slow down the clustering - perhaps the row
> > > locations in the indexes all have to be updated as the cluster command
> > > shifts their locations? And perhaps vacuuming should be done before
> > > clustering so that dead tuples aren't "in the way"?
> >
> > clustering also removes the dead tuples.
> >
> > I would just:
> >
> > - create one index, the one to be clustered
> > - cluster the table
> > - create the remaining indexes
>
> And then run ANALYSE. No need to vacuum because the cluster did that
> already.
>
> Have a nice day,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > From each according to his ability. To each according to his ability to litigate.
>
> --fXStkuK2IQBfcDe+
> Content-Type: application/pgp-signature
> Content-Disposition: inline;
>     filename="signature.asc"
> Content-Description: Digital signature
> X-Google-AttachSize: 190


Re: loading data, creating indexes, clustering, vacuum...

From
Bill Moran
Date:
"Angva" <angvaw@gmail.com> wrote:
>
> > clustering also removes the dead tuples.
>
> I have a followup question. What if the set of dead tuples is too big
> and I need to VACUUM FULL, as opposed to VACUUM. (The update size
> varies greatly from day to day.) Will the clustering effectively do a
> VACUUM FULL, or just a VACUUM?

CLUSTER is the equivalent of VACUUM FULL, with the addition that it
orders the data in the table in the order of the index you specify.

VACUUM FULL doesn't follow any particular order for the data.

-Bill

Re: loading data, creating indexes, clustering, vacuum...

From
"Shoaib Mir"
Date:
If set have the max_fsm_pages set correctly, that makes VACUUM faster and removes the need for VACUUM FULL or REINDEX. Should be slightly more than the total number of data pages which will be touched by updates and deletes between vacuums.

Thanks,
--------
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)

On 8 Dec 2006 09:33:31 -0800, Angva <angvaw@gmail.com> wrote:
> clustering also removes the dead tuples.

I have a followup question. What if the set of dead tuples is too big
and I need to VACUUM FULL, as opposed to VACUUM. (The update size
varies greatly from day to day.) Will the clustering effectively do a
VACUUM FULL, or just a VACUUM?

Thanks again for your help,
Mark


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

PITR and moving objects between table spaces

From
Glen Parker
Date:
Gurus,

I hope I can make this clear somehow...  Anyway...  This all involves PG
8.1.4 on a 64-bit FC5 box.

Select version() says "PostgreSQL 8.1.4 on x86_64-redhat-linux-gnu,
compiled by GCC x86_64-redhat-linux-gcc (GCC) 4.1.0 20060304 (Red Hat
4.1.0-3)".

I guess the best question I can see is, under what circumstances is the
directory name in pg_tablespace actually used?

I have a scenario where I want to restore from a PITR backup, into an
alternate location on the same machine it came from, while the original
database is still up and running.  I have one alternate table space.

It goes like this.  First I expand the base archive into an alternate
location, then expand the table space archive(s) into alternate
location(s).  Then I recreate the links under pg_tblspc.  I then fiddle
a little bit with config files and run postgres on the new alternate
location.  Everthing goes fine, the database rolls forward, and then
postgres quits (because I give it an SQL file for stdin).  Great.

But now I have a problem.  What if I move objects from the main
tablespace to the alternate one, such as indexes, between the time of
the backup and the restore?  During the restore/recovery, the
pg_tablespace table is out of date.  If the tablespace directory listed
there was used in copying files, I'd have a big fat mess involving a
badly broken production database.

Hopefully that all makes sense...

-Glen

Re: PITR and moving objects between table spaces

From
Tom Lane
Date:
Glen Parker <glenebob@nwlink.com> writes:
> I guess the best question I can see is, under what circumstances is the
> directory name in pg_tablespace actually used?

It isn't used (except by pg_dumpall) ... what counts is where the
symlink in $PGDATA/pg_tblspc points.

> I have a scenario where I want to restore from a PITR backup, into an
> alternate location on the same machine it came from, while the original
> database is still up and running.  I have one alternate table space.

You can probably make this work if you don't issue any CREATE TABLESPACE
commands while PITR logging is active, but you'll want to test your
procedures pretty carefully.

            regards, tom lane

Re: PITR and moving objects between table spaces

From
Glen Parker
Date:
> You can probably make this work if you don't issue any CREATE TABLESPACE
> commands while PITR logging is active, but you'll want to test your
> procedures pretty carefully.

That's what I thought, and after your message, I went ahead with it and
had no problems.  Thx, Tom.

-Glen

Re: Marking indexes out of date (WAS: loading data,

From
Bruce Momjian
Date:
Glen Parker wrote:
> Martijn van Oosterhout wrote:
> > On Fri, Dec 08, 2006 at 12:49:30PM -0800, Glen Parker wrote:
> >> I'd like to see a general way to take indexes off line without actually
> >> losing their definitions.  For example, something like "ALTER TABLE [EN
> >> | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc.  This could
> >> also be used internally when a backend encounters an error
> >> reading/writing an index.  Rather than refusing to execute queries, it
> >> could just ignore indexes it knows are disabled or bad in some way and
> >> re-plan as needed.
> >
> > One issue would be that even disabled indexes would need to be updated
> > when there are new rows. If you don't update the index when it's
> > disabled, then re-enabling will essentially need to rebuild the index.
>
>
> That's what I had in mind.  You could just as easily blow away the index
> file(s).  It's just that I don't want it to toss the index *definition*.

Well, you can just pg_dump the index definition to a file.  What more
would someone want?

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +