Thread: automatic REINDEX-ing

automatic REINDEX-ing

From
Joao Ferreira gmail
Date:
Hello all

[[[ while dealing with a disk size problem I realised my REINDEX cron
script was not really being called every week :(   so... ]]]

I executed REINDEX by hand and the disk ocupation imediatelly dropped 6
Giga...!!!

is there a way to configure postgres to automatically execute the needed
REINDEXING (on indexes and tables) for a given database....

something similar to auto-vacuum... I guess....

thx
joao



Re: automatic REINDEX-ing

From
Lennin Caro
Date:
you can use a cron job


--- On Tue, 8/12/08, Joao Ferreira gmail <joao.miguel.c.ferreira@gmail.com> wrote:

> From: Joao Ferreira gmail <joao.miguel.c.ferreira@gmail.com>
> Subject: [GENERAL] automatic REINDEX-ing
> To: "pgsql-general" <pgsql-general@postgresql.org>
> Date: Tuesday, August 12, 2008, 3:13 PM
> Hello all
>
> [[[ while dealing with a disk size problem I realised my
> REINDEX cron
> script was not really being called every week :(   so...
> ]]]
>
> I executed REINDEX by hand and the disk ocupation
> imediatelly dropped 6
> Giga...!!!
>
> is there a way to configure postgres to automatically
> execute the needed
> REINDEXING (on indexes and tables) for a given database....
>
> something similar to auto-vacuum... I guess....
>
> thx
> joao
>
>
>
> --
> Sent via pgsql-general mailing list
> (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general





Re: automatic REINDEX-ing

From
Tom Lane
Date:
Joao Ferreira gmail <joao.miguel.c.ferreira@gmail.com> writes:
> I executed REINDEX by hand and the disk ocupation imediatelly dropped 6
> Giga...!!!

> is there a way to configure postgres to automatically execute the needed
> REINDEXING (on indexes and tables) for a given database....

Generally speaking, there shouldn't be a need for "automatic
reindexing".  What the above suggests is that you need more aggressive
routine vacuuming, so that you don't get into this situation in the
first place.

BTW, "more aggressive routine vacuuming" does NOT mean "use vacuum full".
Vacuum full tends to make index bloat worse, not better.

            regards, tom lane

Re: automatic REINDEX-ing

From
Joao Ferreira gmail
Date:
On Tue, 2008-08-12 at 11:53 -0400, Tom Lane wrote:
> TW, "more aggressive routine vacuuming" does NOT mean "use vacuum
> full".
> Vacuum full tends to make index bloat worse, not better.
>
>                         regards, tom lane
>
Ok. so what does it mean ?



I'm a bit lost here.  I'm currently executing VACUUM FULL _and_ REINDEX
(tbls & idxs) every week.

Should I keep the REINDEX and drop VACUUM FULL ?

How do I iterate to a better approach ?

thanks.

joao



Re: automatic REINDEX-ing

From
"Scott Marlowe"
Date:
On Tue, Aug 12, 2008 at 10:04 AM, Joao Ferreira gmail
<joao.miguel.c.ferreira@gmail.com> wrote:
>
> On Tue, 2008-08-12 at 11:53 -0400, Tom Lane wrote:
>> TW, "more aggressive routine vacuuming" does NOT mean "use vacuum
>> full".
>> Vacuum full tends to make index bloat worse, not better.
>>
>>                         regards, tom lane
>>
> Ok. so what does it mean ?
>
>
>
> I'm a bit lost here.  I'm currently executing VACUUM FULL _and_ REINDEX
> (tbls & idxs) every week.
>
> Should I keep the REINDEX and drop VACUUM FULL ?
>
> How do I iterate to a better approach ?

It's better to run REGULAR vacuums more often than to vacuum full OR
reindex OR both.

If your machine doesn't have the I/O bandwidth to withstand being
vacuumed during the day then you either have to have a fairly large
free space map and vacuum off hours or buy a machine with more I/O
bandwidth.

With the sleep settings in vacuum and autovacuum you can usually get
away with autovacuum running during the day.

Re: automatic REINDEX-ing

From
William Garrison
Date:
Tom Lane wrote:
> Joao Ferreira gmail <joao.miguel.c.ferreira@gmail.com> writes:
>
>> I executed REINDEX by hand and the disk ocupation imediatelly dropped 6
>> Giga...!!!
>>
>
>
>> is there a way to configure postgres to automatically execute the needed
>> REINDEXING (on indexes and tables) for a given database....
>>
>
> Generally speaking, there shouldn't be a need for "automatic
> reindexing".  What the above suggests is that you need more aggressive
> routine vacuuming, so that you don't get into this situation in the
> first place.
>
> BTW, "more aggressive routine vacuuming" does NOT mean "use vacuum full".
> Vacuum full tends to make index bloat worse, not better.
>
>             regards, tom lane
>
>

So now that we know what that term does not mean, what does it mean?
Just doing it more often by adjusting the autovacuum parameters?

Re: automatic REINDEX-ing

From
"Scott Marlowe"
Date:
On Tue, Aug 12, 2008 at 10:09 AM, William Garrison
<postgres@mobydisk.com> wrote:
> Tom Lane wrote:
>>
>> Joao Ferreira gmail <joao.miguel.c.ferreira@gmail.com> writes:
>>
>>>
>>> I executed REINDEX by hand and the disk ocupation imediatelly dropped 6
>>> Giga...!!!
>>>
>>
>>
>>>
>>> is there a way to configure postgres to automatically execute the needed
>>> REINDEXING (on indexes and tables) for a given database....
>>>
>>
>> Generally speaking, there shouldn't be a need for "automatic
>> reindexing".  What the above suggests is that you need more aggressive
>> routine vacuuming, so that you don't get into this situation in the
>> first place.
>>
>> BTW, "more aggressive routine vacuuming" does NOT mean "use vacuum full".
>> Vacuum full tends to make index bloat worse, not better.
>>
>
> So now that we know what that term does not mean, what does it mean?  Just
> doing it more often by adjusting the autovacuum parameters?

exactly.  Or running cronned vacuums on particular tables if they need
it more often.

Re: automatic REINDEX-ing

From
Bill Moran
Date:
In response to Joao Ferreira gmail <joao.miguel.c.ferreira@gmail.com>:

>
> On Tue, 2008-08-12 at 11:53 -0400, Tom Lane wrote:
> > TW, "more aggressive routine vacuuming" does NOT mean "use vacuum
> > full".
> > Vacuum full tends to make index bloat worse, not better.
> >
> >                         regards, tom lane
> >
> Ok. so what does it mean ?
>
> I'm a bit lost here.  I'm currently executing VACUUM FULL _and_ REINDEX
> (tbls & idxs) every week.
>
> Should I keep the REINDEX and drop VACUUM FULL ?

Don't "vacuum full" as part of regular maintenance.  Do plain vacuum.
If that's unable to keep up with the database bloat, then do it more
often.  Whether you use autovacuum or cron isn't as important as
whether you're vacuuming often enough.

Personally, I like to put explicit VACUUM commands in my applications
after operations that are known to bloat tables.  This isn't always
possible as it sometimes introduces a performance issue, but I use it
where it doesn't cause problem as it solves the bloat problem at the
point of creation.

REINDEX is normally not needed, although there _are_ some corner cases
that seem to require it.  One particular corner case is VACUUM FULL,
which tends to bloat indexes.

If you're using vacuum on a schedule appropriate to your database
activity, you'll probably not need reindex.  If you do find that your
use is one of those corner cases where reindex is necessary, then go
ahead and do it.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: automatic REINDEX-ing

From
Ow Mun Heng
Date:
On Tue, 2008-08-12 at 08:38 -0700, Lennin Caro wrote:
> you can use a cron job
>

I have my cron setup to do database wide vacuums each night and it
usually takes ~between 4-6 hours on ~200G DB size.

On days where there is huge activity, it can drag on for like 15+ hours.

I've recently dropped all my indexes and started to only rebuild _some_
needed ones.


What's the method for looking at index bloats anyway?

Re: automatic REINDEX-ing

From
Kevin Hunter
Date:
At 12:04p -0400 on Tue, 12 Aug 2008, Joao Ferreira wrote:
> I'm a bit lost here.  I'm currently executing VACUUM FULL _and_ REINDEX
> (tbls & idxs) every week.
>
> Should I keep the REINDEX and drop VACUUM FULL ?
>
> How do I iterate to a better approach ?

It might behoove you to read the Notes section of the docs about vacuuming:

http://www.postgresql.org/docs/current/static/sql-vacuum.html

Roughly, VACUUM simply reclaims/frees disk space, while VACUUM FULL
additionally reorganizes disk usage.  I'm still don't know *why* this
leads to index bloat, however, just that it does.  I must defer that
explication to a Postgres guru.

Kevin

Re: automatic REINDEX-ing

From
Martijn van Oosterhout
Date:
On Wed, Aug 13, 2008 at 12:41:41PM -0400, Kevin Hunter wrote:
> Roughly, VACUUM simply reclaims/frees disk space, while VACUUM FULL
> additionally reorganizes disk usage.  I'm still don't know *why* this
> leads to index bloat, however, just that it does.  I must defer that
> explication to a Postgres guru.

Because VACUUM FULL needs to move stuff around in the table which means it
need to mess around with the indexes (adding new entries). Ordinary
VACUUM only needs to delete stuff so doesn't cause anywhere near as
many problems.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: automatic REINDEX-ing

From
Joao Ferreira gmail
Date:
> Because VACUUM FULL needs to move stuff around in the table which means it
> need to mess around with the indexes (adding new entries). Ordinary
> VACUUM only needs to delete stuff so doesn't cause anywhere near as
> many problems.

so.... in  the event that I really end up running VACUUM FULL once in a
while, it seems to me it is a good idea to run REINDEX.

Is this correct ?

thx
joao


>
> Have a nice day,



Re: automatic REINDEX-ing

From
Kevin Hunter
Date:
At 12:53p -0400 on Wed, 13 Aug 2008, Martijn van Oosterhout wrote:
> On Wed, Aug 13, 2008 at 12:41:41PM -0400, Kevin Hunter wrote:
>> Roughly, VACUUM simply reclaims/frees disk space, while VACUUM FULL
>> additionally reorganizes disk usage.  I'm still don't know *why* this
>> leads to index bloat, however, just that it does.  I must defer that
>> explication to a Postgres guru.
>
> Because VACUUM FULL needs to move stuff around in the table which means it
> need to mess around with the indexes (adding new entries). Ordinary
> VACUUM only needs to delete stuff so doesn't cause anywhere near as
> many problems.

Hmm.  I get the reorganization bit, but so what?  Since VACUUM FULL
already has an exclusive lock, what prevents it from updating the
indexes in-place to point to the new physical disk location?  Why does
it need to create extra bloat?

Or, failing that, what's the reason to not issue a REINDEX CONCURRENTLY
automatically after a VACUUM FULL (or something to that effect)?

Kevin

Re: automatic REINDEX-ing

From
Martijn van Oosterhout
Date:
On Wed, Aug 13, 2008 at 01:16:03PM -0400, Kevin Hunter wrote:
> Hmm.  I get the reorganization bit, but so what?  Since VACUUM FULL
> already has an exclusive lock, what prevents it from updating the
> indexes in-place to point to the new physical disk location?  Why does
> it need to create extra bloat?

AIUI, people know VACUUM FULL sucks and that in the cases where it
really helps CLUSTER is faster anyway and doesn't have the index
problems. The TODO list reference several discussions on the topic.

> Or, failing that, what's the reason to not issue a REINDEX CONCURRENTLY
> automatically after a VACUUM FULL (or something to that effect)?

Or how about not doing VACUUM FULL at all. It's not a command that
should be run regularly in most situations.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: automatic REINDEX-ing

From
Kevin Hunter
Date:
At 2:44p -0400 on Wed, 13 Aug 2008, Martijn van Oosterhout wrote:
> Or how about not doing VACUUM FULL at all. It's not a command that
> should be run regularly in most situations.

Heh, fair enough.  My questions were rather academic anyway.  :-)

I appreciate it, Martijn!

Kevin

Re: automatic REINDEX-ing

From
Tino Wildenhain
Date:
Martijn van Oosterhout wrote:
...
> AIUI, people know VACUUM FULL sucks and that in the cases where it
> really helps CLUSTER is faster anyway and doesn't have the index
> problems. The TODO list reference several discussions on the topic.
>
>> Or, failing that, what's the reason to not issue a REINDEX CONCURRENTLY
>> automatically after a VACUUM FULL (or something to that effect)?
>
> Or how about not doing VACUUM FULL at all. It's not a command that
> should be run regularly in most situations.

Which makes me think if the solution would be to just run CLUSTER under
the hood when VACUUM FULL is requested. Would that introduce any
other problems?

Cheers
Tino

Attachment

Re: automatic REINDEX-ing

From
Alvaro Herrera
Date:
Tino Wildenhain escribió:

> Which makes me think if the solution would be to just run CLUSTER under
> the hood when VACUUM FULL is requested. Would that introduce any
> other problems?

The difference is that CLUSTER requires double the disk space in table +
indexes.  VACUUM FULL has no such requirement.

A possibly approach would be to do an ANALYZE (to have fresh stats about
dead tuple density), and do a CLUSTER if the density is too high.

There has been talk about rewriting VACUUM FULL anyway; it's complex
code and it introduces extra complications in other parts of code that
would be otherwise unneeded, e.g. HOT.  I have no idea what a rewritten
VACUUM FULL would look like, though.

Another thing we should do in this area is rewrite CLUSTER to use a
seqscan + sort instead of indexscan when the heap/index order
correlation is low.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: automatic REINDEX-ing

From
Tom Lane
Date:
Kevin Hunter <hunteke@earlham.edu> writes:
> Hmm.  I get the reorganization bit, but so what?  Since VACUUM FULL
> already has an exclusive lock, what prevents it from updating the
> indexes in-place to point to the new physical disk location?  Why does
> it need to create extra bloat?

Crash safety.  The basic process in vacuum full is:

* detect that there is room in a lower-numbered page to move the
physically last tuple to.

* put a copy of the last tuple there.  Mark the last tuple itself
as deleted by the VACUUM.

* make new index entries for that copy.

* repeat until there's no longer any free space that can hold the
last undeleted tuple.

At this point there are two copies of every moved-down tuple, and
two sets of index entries for it.  The inserted copies are marked
as inserted by the VACUUM FULL.  Therefore, a crash anywhere along
this part of the process will leave them having the same status
as tuples inserted by any other failed transaction, ie, not good.
The original copies are deleted by the vacuum, but again, since
it crashed, the deletion is a no-op.  Result: you still have one
and only one good copy of each tuple.

Next: commit the VACUUM FULL transaction and start a new one
to continue working in.  The commit is atomic and instantaneous
via the same WAL magic that makes every other commit atomic
and instantaneous.

If we crash at some point after the commit, the state is that
all the moved-down tuple copies are good, all the original ones
are dead.  Therefore we still have one and only one good copy
of each row, and there's no data corruption.

Next, vacuum full runs around and deletes the index entries for
the original copies of the moved tuples.  Crash partway through
here doesn't matter, because it doesn't actually matter whether
a dead tuple has all, some, or none of its expected index entries.
It's not going to be returned by any indexscan anyway.  The next
vacuum will finish up the job of removing index entries for any
dead tuples it finds.

Once we know there are no index entries left for the dead tuples,
we can physically remove the tuples.  Once that's done, we
can truncate away any wholly-free pages at the end of the table.

So in short, at no point in a vacuum sequence can you lose data
due to a crash.  That wouldn't be true if we overwrote valid
data or index entries.

Plain vacuum just removes index entries for dead tuples and then the
tuples themselves.  It doesn't try to move anything across page
boundaries.  So while it frees space that can be re-used for subsequent
insertions, it will very seldom be able to shrink the table length ---
usually there will still be some live tuples near the end of the table.

Vacuum full's problem with indexes is comparable: the structure of the
index is largely dictated by the index access method, and there's no
very good way to force pages at the physical end of the index to
become empty.  So the excess index entries that it temporarily adds
tend to lead to long-term index bloat.  The longer you wait between
vacuum fulls, the more tuples have to be moved to compact the table,
and the worse the index bloat gets.

            regards, tom lane

Re: automatic REINDEX-ing

From
Alvaro Herrera
Date:
Tom Lane escribió:

> Crash safety.  The basic process in vacuum full is:
>
> * detect that there is room in a lower-numbered page to move the
> physically last tuple to.
>
> * put a copy of the last tuple there.  Mark the last tuple itself
> as deleted by the VACUUM.

Hmm, I've seen people doing piecemeal table shrinking with no exclusive
locking by looking up the live tuples in the last page, doing no-op
UPDATEs until they moves those tuples to an earlier page with free
space, and then hoping that plain VACUUM will detect that this page is
empty and truncating it.  I wonder if we could write some (semi-)
automatic way of doing this.  The benefit is that the page ends up
truncated to a reasonable of pages, reducing bloat, without requiring
the table to be exclusive-locked.

The main problem with this approach is that it is fairly painful to deal
with CTID values.  The operator support for them is pretty poor.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support