Home > mailing lists

Thread: When/if to Reindex

When/if to Reindex

From

"Steven Flatt"

Date:

18 July 2007, 14:08:36

We're using Postgres 8.2.4.

I'm trying to decide whether it's worthwhile to implement a process that does periodic reindexing. In a few ad hoc tests, where I've tried to set up data similar to how our application does it, I've noticed decent performance increases after doing a reindex as well as the planner being more likely to choose an index scan.

Some background: we make extensive use of partitioned tables. In fact, I'm really only considering reindexing partitions that have "just closed". In our simplest/most general case, we have a table partitioned by a timestamp column, each partition 24 hours wide. The partition will have an index on the timestamp column as well as a few other indexes including a primary key index (all b-tree). Is there a programmatic way I can decide, upon the "closing" of a partition, which, if any, of these indexes will benefit from a reindex? Can I determine things like average node density, node depth, or any other indication as to the quality of an index? Will pg_class.relpages be any help here?

Is it a simple matter of running some queries, reindexing the table, then running the queries again to determine overall performance change? If so, what queries would exercise this best?

Just trying to determine if the extra cost of reindexing newly closed partitions will be worth the performance benefit of querying the data. Reindexing a table with a day's worth of data is taking on the order of a few hours (10s of millions of rows).

The docs say that:

"...for B-tree indexes a freshly-constructed index is somewhat faster to access than one that has been updated many times, because logically adjacent pages are usually also physically adjacent in a newly built index... It might be worthwhile to reindex periodically just to improve access speed."

Thanks,

Steve

Re: When/if to Reindex

From

"Jim C. Nasby"

Date:

18 July 2007, 15:53:00

On Wed, Jul 18, 2007 at 01:08:30PM -0400, Steven Flatt wrote:
> We're using Postgres 8.2.4.
>
> I'm trying to decide whether it's worthwhile to implement a process that
> does periodic reindexing.  In a few ad hoc tests, where I've tried to set up
> data similar to how our application does it, I've noticed decent performance
> increases after doing a reindex as well as the planner being more likely to
> choose an index scan.
>
> Some background: we make extensive use of partitioned tables.  In fact, I'm
> really only considering reindexing partitions that have "just closed".  In
> our simplest/most general case, we have a table partitioned by a timestamp
> column, each partition 24 hours wide.  The partition will have an index on
> the timestamp column as well as a few other indexes including a primary key
> index (all b-tree).  Is there a programmatic way I can decide, upon the
> "closing" of a partition, which, if any, of these indexes will benefit from
> a reindex?  Can I determine things like average node density, node depth, or
> any other indication as to the quality of an index?  Will pg_class.relpages
> be any help here?

Looking at that stuff will help determine if the index is bloated, or if
it's just bigger than optimal. Once you're done writing to an index, it
might be worth reindexing with a fillfactor of 100% to shrink things
down a bit.

> Is it a simple matter of running some queries, reindexing the table, then
> running the queries again to determine overall performance change?  If so,
> what queries would exercise this best?
>
> Just trying to determine if the extra cost of reindexing newly closed
> partitions will be worth the performance benefit of querying the data.
> Reindexing a table with a day's worth of data is taking on the order of a
> few hours (10s of millions of rows).
>
> The docs say that:
>
> "...for B-tree indexes a freshly-constructed index is somewhat faster to
> access than one that has been updated many times, because logically adjacent
> pages are usually also physically adjacent in a newly built index... It
> might be worthwhile to reindex periodically just to improve access speed."

That's the other consideration, though if you're seeing a big difference
I suspect it's an issue of indexes fitting in cache or not.
--
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

msg-22763-27541.dat

Re: When/if to Reindex

From

Vivek Khera

Date:

08 August 2007, 14:52:41

On Jul 18, 2007, at 1:08 PM, Steven Flatt wrote:

> Some background: we make extensive use of partitioned tables.  In
> fact, I'm
> really only considering reindexing partitions that have "just
> closed".  In
> our simplest/most general case, we have a table partitioned by a
> timestamp
> column, each partition 24 hours wide.  The partition will have an
> index on
> the timestamp column as well as a few other indexes including a
> primary key

If all you ever did was insert into that table, then you probably
don't need to reindex.  If you did mass updates/deletes mixed with
your inserts, then perhaps you do.

Do some experiments comparing pg_class.relpages for your table and
its indexes before and after a reindex.  Decide if the number of
pages you save on the index is worth the trouble.  If it shaves off
just a handful of pages, I'd vote no...

Re: When/if to Reindex

From

"Steven Flatt"

Date:

08 August 2007, 16:13:22

On 8/8/07, Vivek Khera <vivek@khera.org> wrote:

If all you ever did was insert into that table, then you probably
don't need to reindex.  If you did mass updates/deletes mixed with
your inserts, then perhaps you do.

Do some experiments comparing pg_class.relpages for your table and
its indexes before and after a reindex.  Decide if the number of
pages you save on the index is worth the trouble.  If it shaves off
just a handful of pages, I'd vote no...

What's interesting is that an insert-only table can benefit significantly from reindexing after the table is fully loaded. I had done experiments exactly as you suggest (looking at pg_class.relpages), and determined that reindexing results in about a 30% space savings for all indexes except the PK index. The PK index (integer based on a sequence) does not benefit at all. By setting fillfactor=100 on the index prior to reindexing, I get another 10% space savings on all the indexes.

Not to mention the general performance improvements when reading from the table...

So, we decided that reindexing partitions after they're fully loaded *was* worth it.

Steve

Re: When/if to Reindex

From

Bill Moran

Date:

08 August 2007, 16:28:05

In response to "Steven Flatt" <steven.flatt@gmail.com>:

> On 8/8/07, Vivek Khera <vivek@khera.org> wrote:
> >
> > If all you ever did was insert into that table, then you probably
> > don't need to reindex.  If you did mass updates/deletes mixed with
> > your inserts, then perhaps you do.
> >
> > Do some experiments comparing pg_class.relpages for your table and
> > its indexes before and after a reindex.  Decide if the number of
> > pages you save on the index is worth the trouble.  If it shaves off
> > just a handful of pages, I'd vote no...
>
>
> What's interesting is that an insert-only table can benefit significantly
> from reindexing after the table is fully loaded.  I had done experiments
> exactly as you suggest (looking at pg_class.relpages), and determined that
> reindexing results in about a 30% space savings for all indexes except the
> PK index.  The PK index (integer based on a sequence) does not benefit at
> all.  By setting fillfactor=100 on the index prior to reindexing, I get
> another 10% space savings on all the indexes.
>
> Not to mention the general performance improvements when reading from the
> table...
>
> So, we decided that reindexing partitions after they're fully loaded *was*
> worth it.

I've had similar experience.  One thing you didn't mention that I've noticed
is that VACUUM FULL often bloats indexes.  I've made it SOP that
after application upgrades (which usually includes lots of ALTER TABLES and
other massive schema and data changes) I VACUUM FULL and REINDEX (in that
order).

Lots of ALTER TABLEs seem to bloat the database size considerably, beyond
what normal VACUUM seems to fix.  A FULL seems to fix that, but it appears
to bloat the indexes, thus a REINDEX helps.

I would expect that setting fillfactor to 100 will encourage indexs to bloat
faster, and would only be recommended if you didn't expect the index contents
to change?

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: When/if to Reindex

From

Decibel!

Date:

08 August 2007, 17:36:35

On Wed, Aug 08, 2007 at 03:27:57PM -0400, Bill Moran wrote:
> I've had similar experience.  One thing you didn't mention that I've noticed
> is that VACUUM FULL often bloats indexes.  I've made it SOP that
> after application upgrades (which usually includes lots of ALTER TABLES and
> other massive schema and data changes) I VACUUM FULL and REINDEX (in that
> order).

You'd be better off with a CLUSTER in that case. It'll be faster, and
you'll ensure that the table has optimal ordering.

> Lots of ALTER TABLEs seem to bloat the database size considerably, beyond
> what normal VACUUM seems to fix.  A FULL seems to fix that, but it appears
> to bloat the indexes, thus a REINDEX helps.

Hrm, are you sure that's still true? I just did an ALTER TABLE ... TYPE
and it created a new file, meaning no bloating.

> I would expect that setting fillfactor to 100 will encourage indexs to bloat
> faster, and would only be recommended if you didn't expect the index contents
> to change?

Yes.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

msg-22763-27873.dat

Re: When/if to Reindex

From

Tom Lane

Date:

08 August 2007, 22:52:03

Bill Moran <wmoran@collaborativefusion.com> writes:
> In response to "Steven Flatt" <steven.flatt@gmail.com>:
>> What's interesting is that an insert-only table can benefit significantly
>> from reindexing after the table is fully loaded.

> I've had similar experience.  One thing you didn't mention that I've noticed
> is that VACUUM FULL often bloats indexes.  I've made it SOP that
> after application upgrades (which usually includes lots of ALTER TABLES and
> other massive schema and data changes) I VACUUM FULL and REINDEX (in that
> order).

Actually, if that is your intent then the best plan is: drop indexes,
VACUUM FULL, create indexes from scratch.  A huge proportion of VACUUM
FULL's time goes into updating the indexes, and that work is basically
wasted if you are going to reindex afterwards.

CLUSTER is a good substitute for V.F. partly because it doesn't try to
update the indexes incrementally, but just does the equivalent of
REINDEX after it's reordered the heap.

I'd make the same remark about Steven's case: if possible, don't create
the indexes at all until you've loaded the table fully.

            regards, tom lane

Re: When/if to Reindex

From

Bill Moran

Date:

09 August 2007, 10:04:48

In response to "Decibel!" <decibel@decibel.org>:

> On Wed, Aug 08, 2007 at 03:27:57PM -0400, Bill Moran wrote:
> > I've had similar experience.  One thing you didn't mention that I've noticed
> > is that VACUUM FULL often bloats indexes.  I've made it SOP that
> > after application upgrades (which usually includes lots of ALTER TABLES and
> > other massive schema and data changes) I VACUUM FULL and REINDEX (in that
> > order).
>
> You'd be better off with a CLUSTER in that case. It'll be faster, and
> you'll ensure that the table has optimal ordering.

Point taken.

> > Lots of ALTER TABLEs seem to bloat the database size considerably, beyond
> > what normal VACUUM seems to fix.  A FULL seems to fix that, but it appears
> > to bloat the indexes, thus a REINDEX helps.
>
> Hrm, are you sure that's still true? I just did an ALTER TABLE ... TYPE
> and it created a new file, meaning no bloating.

No, I'm not.  This isn't something I've analyzed or investigated in detail.
During upgrades, a lot happens: ATLER TABLES, tables are dropped, new tables
are created, massive amounts of data may be altered in a short period, stored
procedures are replaced, etc, etc.

I don't remember what led me to believe that the ALTER TABLES were causing the
worst of the problem, but it's entirely possible that I was off-base.  (I seem
to remember being concerned about too many DROP COLUMN and ADD COLUMNs)  In any
event, my original statement (that it's a good idea to REINDEX after VACUUM
FULL) still seems to be correct.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: When/if to Reindex

From

"Steven Flatt"

Date:

09 August 2007, 10:51:56

On 8/8/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'd make the same remark about Steven's case: if possible, don't create
the indexes at all until you've loaded the table fully.

We considered this, however in some of our 12-hour partitions, there are upwards of 50 or 60 million rows near the end of the 12 hours so read performance gets bad on the current partition very quickly if there are no indexes.

It makes more sense for us to have ~1 hour's worth of reindexing afterwards during which read performance on that partition is "compromised".

Steve

Re: When/if to Reindex

From

"Steven Flatt"

Date:

22 August 2007, 13:55:37

It makes more sense for us to have ~1 hour's worth of reindexing afterwards during which read performance on that partition is "compromised".

So, based on the docs, I was expecting read performance to be compromised during a reindex, specifically reads would not be allowed to use the index:

"REINDEX locks out writes but not reads of the index's parent table. It also takes an exclusive lock on the specific index being processed, which will block reads that attempt to use that index."

However I'm seeing that all readers of that table are blocked until the reindex finishes, even reads that do not attempt to use the index. Is this a problem with the docs or a bug?

I'm considering creating a new index with the same definition as the first (different name), so while that index is being created, read access to the table, and the original index, is not blocked. When the new index is created, drop the original index and rename the new index to the original, and we've essentially accomplished the same thing. In fact, why isn't reindex doing this sort of thing in the background anways?

Thanks,

Steve

Re: When/if to Reindex

From

Gregory Stark

Date:

22 August 2007, 20:51:04

"Steven Flatt" <steven.flatt@gmail.com> writes:

> However I'm seeing that all readers of that table are blocked until the
> reindex finishes, even reads that do not attempt to use the index.  Is this
> a problem with the docs or a bug?

You'll have to describe in more detail what you're doing so we can see what's
causing it to not work for you because "works for me":

postgres=# create table test (i integer);
CREATE TABLE
postgres=# insert into test select generate_series(1,1000);
INSERT 0 1000
postgres=# create or replace function slow(integer) returns integer as 'begin perform pg_sleep(0); return $1; end'
languageplpgsql immutable strict; 
CREATE FUNCTION
postgres=# create index slowi on test (slow(i));
CREATE INDEX
postgres=# create or replace function slow(integer) returns integer as 'begin perform pg_sleep(1); return $1; end'
languageplpgsql immutable strict; 
CREATE FUNCTION
postgres=# reindex index slowi;

While that's running I ran:

postgres=# select count(*) from test;
 count
-------
  1000
(1 row)

> I'm considering creating a new index with the same definition as the first
> (different name), so while that index is being created, read access to the
> table, and the original index, is not blocked.  When the new index is
> created, drop the original index and rename the new index to the original,
> and we've essentially accomplished the same thing.  In fact, why isn't
> reindex doing this sort of thing in the background anways?

It is but one level lower down. But the locks which block people from using
the index must be at this level. Consider for example that one of the
operations someone might be doing is creating a foreign key which depends on
this index. If we created a new index and then tried to drop this one the drop
would fail because of the foreign key which needs it. It's possible these
problems could all be worked out but it would still take quite a bit of work
to do so.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: When/if to Reindex

From

"Steven Flatt"

Date:

23 August 2007, 19:25:31

On 8/22/07, Gregory Stark <stark@enterprisedb.com> wrote:

postgres=# create table test (i integer);
CREATE TABLE
postgres=# insert into test select generate_series(1,1000);
INSERT 0 1000
postgres=# create or replace function slow(integer) returns integer as 'begin perform pg_sleep(0); return $1; end' language plpgsql immutable strict;
CREATE FUNCTION
postgres=# create index slowi on test (slow(i));
CREATE INDEX
postgres=# create or replace function slow(integer) returns integer as 'begin perform pg_sleep(1); return $1; end' language plpgsql immutable strict;
CREATE FUNCTION
postgres=# reindex index slowi;

While that's running I ran:

postgres=# select count(*) from test;
count
-------
1000
(1 row)

Interestingly enough, the example you've given does not work for me either. The select count(*) from test blocks until the reindex completes. Are we using the same pg version?

# select version();

version

--------------------------------------------------------------------------------
----------------
PostgreSQL 8.2.4 on i386-portbld-freebsd6.1, compiled by GCC cc (GCC) 3.4.4 [FreeBSD] 20050518
(1 row)

Looking at the pg_locks table, I see:

# select locktype,relation,mode,granted from pg_locks where not granted;
locktype | relation | mode | granted
----------+----------+-----------------+---------
relation | 69293 | AccessShareLock | f
(1 row)

# select relname from pg_class where oid = 69293;
relname
---------
slowi
(1 row)

# select locktype,relation,mode,granted from pg_locks where relation = 69293;
locktype | relation |        mode         | granted
----------+----------+---------------------+---------
relation |    69293 | AccessShareLock     | f
relation |    69293 | AccessExclusiveLock | t
(2 rows)

So the reindex statement has an AccessExclusiveLock on the index, which seems right, and this blocks the select count(*) from getting an AccessShareLock on the index. Why does the select count(*) need a lock on the index? Is there some Postgres setting that could cause this behaviour? I can't even do an "explain select count(*) from test" without blocking.

Any ideas?

Steve

Re: When/if to Reindex

From

Tom Lane

Date:

24 August 2007, 00:59:21

"Steven Flatt" <steven.flatt@gmail.com> writes:
> Interestingly enough, the example you've given does not work for me either.
> The select count(*) from test blocks until the reindex completes.  Are we
> using the same pg version?

Seems like a fair question, because Greg's example blocks for me too,
in plancat.c where the planner is trying to acquire information on each
index.  This seems to be an unwanted side effect of this 8.2-era patch
http://archives.postgresql.org/pgsql-committers/2006-07/msg00356.php
specifically, note here
http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/optimizer/util/plancat.c.diff?r1=1.121;r2=1.122;f=h
how the new planner coding takes at least AccessShareLock on each index,
where the old coding took no lock at all.

I think that the new coding rule of "you *must* take some lock when
opening the relation" is essential for tables, but it might not be
necessary for indexes if you've got a lock on the parent table.
We don't allow any schema changes on an index to be made without holding
exclusive lock on the parent, so plancat.c's basic purpose of finding
out the properties of the index could be done safely without any index
lock.

The fly in the ointment is that after collecting the pg_index definition
of the index, plancat.c also wants to know how big it is --- it calls
RelationGetNumberOfBlocks.  And that absolutely does look at the
physical storage, which means it absolutely is unsafe to do in parallel
with a REINDEX that will be dropping the old physical storage at some
point.

So maybe we are stuck and we have to say "that doesn't work anymore".
But it feels like we might not be too far away from letting it still
work.  Thoughts, ideas?

            regards, tom lane

Re: When/if to Reindex

From

Mark Kirkwood

Date:

24 August 2007, 02:27:25

Tom Lane wrote:
>
> The fly in the ointment is that after collecting the pg_index definition
> of the index, plancat.c also wants to know how big it is --- it calls
> RelationGetNumberOfBlocks.  And that absolutely does look at the
> physical storage, which means it absolutely is unsafe to do in parallel
> with a REINDEX that will be dropping the old physical storage at some
> point.
>
> So maybe we are stuck and we have to say "that doesn't work anymore".
> But it feels like we might not be too far away from letting it still
> work.  Thoughts, ideas?
>

A suggestion that seems a bit like a leap backwards in time - maybe just
use the pg_class.relpages entry for the index size?

I'm punting that with autovacuum being enabled by default now, the
relpages entries for all relations will be more representative than they
used to in previous releases.

Cheers

Mark

Re: When/if to Reindex

From

Gregory Stark

Date:

24 August 2007, 03:49:28

"Steven Flatt" <steven.flatt@gmail.com> writes:

> On 8/22/07, Gregory Stark <stark@enterprisedb.com> wrote:
>
> Interestingly enough, the example you've given does not work for me either.
> The select count(*) from test blocks until the reindex completes.  Are we
> using the same pg version?

I was using CVS head but given Tom's explanation I wouldn't expect to see any
different behaviour here.

I just retried it and it did block. I can't think of anything I could have
done wrong last time to make it appear not to block. If I had missed an error
at some point along the way I would have expected the reindex to complete
quickly or fail or something but it was definitely just blocked. I remember
noting (much) later that it had finished.

Strange.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: When/if to Reindex

From

"Steven Flatt"

Date:

24 August 2007, 11:22:48

On 8/24/07, Mark Kirkwood <markir@paradise.net.nz> wrote:

Tom Lane wrote:
>
> The fly in the ointment is that after collecting the pg_index definition
> of the index, plancat.c also wants to know how big it is --- it calls
> RelationGetNumberOfBlocks. And that absolutely does look at the
> physical storage, which means it absolutely is unsafe to do in parallel
> with a REINDEX that will be dropping the old physical storage at some
> point.

A suggestion that seems a bit like a leap backwards in time - maybe just
use the pg_class.relpages entry for the index size?

Just throwing this out there (looking from a higher level)...

Why do we even need to consider calling RelationGetNumberOfBlocks or looking at the pg_class.relpages entry? My understanding of the expected behaviour is that while a reindex is happening, all queries run against the parent table are planned as though the index isn't there ( i.e. it's unusable). This may/will result in sub-optimal query plans, but the point is that reindex never blocks readers. Not sure if from an implementation standpoint it's easy to mark an index as "being reindexed" in which case the planner should just skip it.

Steve

Re: When/if to Reindex

From

Tom Lane

Date:

24 August 2007, 13:15:37

"Steven Flatt" <steven.flatt@gmail.com> writes:
>> The fly in the ointment is that after collecting the pg_index definition
>> of the index, plancat.c also wants to know how big it is --- it calls
>> RelationGetNumberOfBlocks.

> Why do we even need to consider calling RelationGetNumberOfBlocks or looking
> at the pg_class.relpages entry?  My understanding of the expected behaviour
> is that while a reindex is happening, all queries run against the parent
> table are planned as though the index isn't there (i.e. it's unusable).

Where in the world did you get that idea?

If we had a REINDEX CONCURRENTLY it might work that way.  A normal
REINDEX cannot "mark" anything because it runs within a single
transaction; there is no way that it can emit any catalog changes
that will be visible before it's over.

            regards, tom lane

Re: When/if to Reindex

From

"Steven Flatt"

Date:

24 August 2007, 13:56:27

On 8/24/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Steven Flatt" <steven.flatt@gmail.com> writes:
> Why do we even need to consider calling RelationGetNumberOfBlocks or looking
> at the pg_class.relpages entry? My understanding of the expected behaviour
> is that while a reindex is happening, all queries run against the parent
> table are planned as though the index isn't there (i.e. it's unusable).

Where in the world did you get that idea?

Maybe that's what I was *hoping* the behaviour would be. :)

From the docs:

"REINDEX locks out writes but not reads of the index's parent table."

"It also takes an exclusive lock on the specific index being processed..."

I believe those two statements imply that reads of the parent table don't take any lock whatsoever on the index being processed, i.e. they ignore it.

If we had a REINDEX CONCURRENTLY it might work that way. A normal
REINDEX cannot "mark" anything because it runs within a single
transaction; there is no way that it can emit any catalog changes
that will be visible before it's over.

... but I understand this difficulty.

So, can we simply trust what's in pg_class.relpages and ignore looking directly at the index? This is a fairly serious concern for us, that reindex is blocking all readers of the parent table.

Thanks,

Steve

Re: When/if to Reindex

From

Tom Lane

Date:

24 August 2007, 14:29:11

"Steven Flatt" <steven.flatt@gmail.com> writes:
> So, can we simply trust what's in pg_class.relpages and ignore looking
> directly at the index?

No, we can't.  In the light of morning I remember more about the reason
for the aforesaid patch: it's actually unsafe to read the pg_class row
at all if you have not got lock on the index.  We are reading with
SnapshotNow in order to be sure we see up-to-date info, and that means
that a concurrent update of the row (eg, for REINDEX to report the new
relfilenode) can have the following behavior:

1. REINDEX inserts the new modified version of the index's pg_class row.

2. Would-be reader process visits the new version of the pg_class row.
   It's not committed yet, so we ignore it and continue scanning.

3. REINDEX commits.

4. Reader process visits the old version of the pg_class row.  It's
   now committed dead, so we ignore it and continue scanning.

5. Reader process bombs out with a complaint about no pg_class row for
   the index.

So we really have to have the lock.

> This is a fairly serious concern for us, that
> reindex is blocking all readers of the parent table.

I'm afraid you're kinda stuck: I don't see any fix that would be
practical to put into 8.2, or even 8.3 considering that it's way too
late to be thinking of implementing REINDEX CONCURRENTLY for 8.3.

You might be able to work around it for now by faking such a reindex
"by hand"; that is, create a duplicate new index under a different
name using CREATE INDEX CONCURRENTLY, then exclusive-lock the table
for just long enough to drop the old index and rename the new one
to match.

It's probably worth asking also how badly you really need routine
reindexing.  Are you certain your app still needs that with 8.2,
or is it a hangover from a few releases back?  Could more aggressive
(auto)vacuuming provide a better solution?

            regards, tom lane

Re: When/if to Reindex

From

"Steven Flatt"

Date:

24 August 2007, 14:49:12

On 8/24/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

You might be able to work around it for now by faking such a reindex
"by hand"; that is, create a duplicate new index under a different
name using CREATE INDEX CONCURRENTLY, then exclusive-lock the table
for just long enough to drop the old index and rename the new one
to match.

This is a good suggestion, one that we had thought of earlier. Looks like it might be time to try it out and observe system impact.

It's probably worth asking also how badly you really need routine
reindexing. Are you certain your app still needs that with 8.2,
or is it a hangover from a few releases back? Could more aggressive
(auto)vacuuming provide a better solution?

Routine reindexing was added (recently, since moving to 8.2) as more of an optimization than a necessity. If the idea above doesn't work for us or causes locking issues, then we could always do away with the periodic reindexing. That would be unfortunate, because reindexing serves to be quite a nice optimization for us. We've observed up to 40% space savings (after setting the fillfactor to 100, then reindexing) along with general improvement in read performance (although hard to quantify).

As mentioned earlier in this thread, we're only reindexing insert-only partitioned tables, once they're fully loaded.

Thanks for your help.

Steve

Re: When/if to Reindex

From

Gregory Stark

Date:

24 August 2007, 17:07:12

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> "Steven Flatt" <steven.flatt@gmail.com> writes:
>> So, can we simply trust what's in pg_class.relpages and ignore looking
>> directly at the index?
>
> No, we can't.  In the light of morning I remember more about the reason
> for the aforesaid patch: it's actually unsafe to read the pg_class row
> at all if you have not got lock on the index.  We are reading with
> SnapshotNow in order to be sure we see up-to-date info, and that means
> that a concurrent update of the row (eg, for REINDEX to report the new
> relfilenode) can have the following behavior:

Should reindex be doing an in-place update? Don't we have to do in-place
updates for other system catalogs which are read in snapshotnow for precisely
the same reasons?

Alternatively, why does the planner need access to the pg_class entry and not
just the pg_index record?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: When/if to Reindex

From

Tom Lane

Date:

24 August 2007, 18:44:12

Gregory Stark <stark@enterprisedb.com> writes:
> Should reindex be doing an in-place update?

Not if you'd like it to be crash-safe.

> Alternatively, why does the planner need access to the pg_class entry and not
> just the pg_index record?

For one thing, to find out how big the index is ... though if we could
get around that problem, it might indeed be possible to treat the
pg_index records as property of the parent table not the index itself,
which would give us license to read them without locking the index.

            regards, tom lane

Re: When/if to Reindex

From

Bruce Momjian

Date:

14 September 2007, 01:23:20

This has been saved for the 8.4 release:

    http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Tom Lane wrote:
> "Steven Flatt" <steven.flatt@gmail.com> writes:
> > So, can we simply trust what's in pg_class.relpages and ignore looking
> > directly at the index?
>
> No, we can't.  In the light of morning I remember more about the reason
> for the aforesaid patch: it's actually unsafe to read the pg_class row
> at all if you have not got lock on the index.  We are reading with
> SnapshotNow in order to be sure we see up-to-date info, and that means
> that a concurrent update of the row (eg, for REINDEX to report the new
> relfilenode) can have the following behavior:
>
> 1. REINDEX inserts the new modified version of the index's pg_class row.
>
> 2. Would-be reader process visits the new version of the pg_class row.
>    It's not committed yet, so we ignore it and continue scanning.
>
> 3. REINDEX commits.
>
> 4. Reader process visits the old version of the pg_class row.  It's
>    now committed dead, so we ignore it and continue scanning.
>
> 5. Reader process bombs out with a complaint about no pg_class row for
>    the index.
>
> So we really have to have the lock.
>
> > This is a fairly serious concern for us, that
> > reindex is blocking all readers of the parent table.
>
> I'm afraid you're kinda stuck: I don't see any fix that would be
> practical to put into 8.2, or even 8.3 considering that it's way too
> late to be thinking of implementing REINDEX CONCURRENTLY for 8.3.
>
> You might be able to work around it for now by faking such a reindex
> "by hand"; that is, create a duplicate new index under a different
> name using CREATE INDEX CONCURRENTLY, then exclusive-lock the table
> for just long enough to drop the old index and rename the new one
> to match.
>
> It's probably worth asking also how badly you really need routine
> reindexing.  Are you certain your app still needs that with 8.2,
> or is it a hangover from a few releases back?  Could more aggressive
> (auto)vacuuming provide a better solution?
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +