Thread: Re: FW: index bloat

Re: FW: index bloat

From

Tom Lane

Date:

12 July 2005, 15:57:18

"David Esposito" <pgsql-general@esposito.newnetco.com> writes:
>> BTW, the tail of the VACUUM VERBOSE output ought to have
>> something about
>> overall usage of the FSM --- what does that look like?

> INFO:  free space map: 528 relations, 172357 pages stored; 170096 total
> pages needed
> DETAIL:  Allocated FSM size: 10000 relations + 1000000 pages = 6511 kB
> shared memory.

OK, so you are definitely not running out of FSM slots...

I spent some time this morning trying to reproduce the bloating
behavior, without any success.  I made a table with a plain "serial
primary key" column, and ran a test program that did

    insert 10000 rows
    update about 10% of the rows at random
    if more than 500000 rows, delete the oldest 10000
    vacuum
    repeat

which is intended to emulate your daily cycle with about one-tenth
as much data (just to keep the runtime reasonable).  I didn't see
any bloat at all: the index growth looked like

INFO:  index "t_pkey" now contains 450000 row versions in 1374 pages
INFO:  index "t_pkey" now contains 460000 row versions in 1404 pages
INFO:  index "t_pkey" now contains 470000 row versions in 1435 pages
INFO:  index "t_pkey" now contains 480000 row versions in 1465 pages
INFO:  index "t_pkey" now contains 490000 row versions in 1496 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1527 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1557 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1588 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1588 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages

and it never grew any larger than that even in several hundred "days".

This test was against CVS tip, but I'm pretty certain the relevant
algorithms were all the same in 7.4.  So there is some important
aspect in which this test does not replicate the conditions your
index is seeing.  Can you think of any way that I've missed capturing
your usage pattern?

            regards, tom lane

Re: index bloat

From

"David Esposito"

Date:

13 July 2005, 10:50:58

First, thank you for spending so much time on this issue

Second, I think I might have found a good lead ... I replicated the test you
described below (minus the updating of 10% of the records) ... I've attached
the PHP script (I'm more proficient at writing PHP than a shell script; you
should be able to run it from the command line ('php -q bloat.test.php') as
long as you've created a db named 'test' first)

You're right that the index behavior is well-behaved with the cycle of
INSERT / DELETE / VACUUM ... But while it was running, I started a second
session to the database after the 60th iteration and did

BEGIN;
SELECT COUNT(*) FROM bigboy;
ROLLBACK;

During my transaction, I saw the relpages charge upwards steadily until I
issued the ROLLBACK .. but even after the ROLLBACK (and even after closing
the second DB connection), the pages weren't reclaimed on the next VACUUM

This sounds exactly like what may be happening to us ... This is a 24x7 app
so during the course of the VACUUM there are probably some open transactions
... Over the past 4 or 5 weeks, that could explain the steady unlimited
growth ...

Another funny thing to note: I was able to cause this same behavior if I did
the following:

psql template1
BEGIN;
SELECT COUNT(*) FROM pg_database;
ROLLBACK;
\q

FYI, I'm using the 8.0.1 RPM build for RHEL3 (2PGDG)

-dave

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Tuesday, July 12, 2005 2:57 PM
> To: David Esposito
> Cc: pgsql-general@postgresql.org
> Subject: Re: FW: [GENERAL] index bloat
>
> "David Esposito" <pgsql-general@esposito.newnetco.com> writes:
> >> BTW, the tail of the VACUUM VERBOSE output ought to have
> >> something about
> >> overall usage of the FSM --- what does that look like?
>
> > INFO:  free space map: 528 relations, 172357 pages stored;
> 170096 total
> > pages needed
> > DETAIL:  Allocated FSM size: 10000 relations + 1000000
> pages = 6511 kB
> > shared memory.
>
> OK, so you are definitely not running out of FSM slots...
>
> I spent some time this morning trying to reproduce the bloating
> behavior, without any success.  I made a table with a plain "serial
> primary key" column, and ran a test program that did
>
>     insert 10000 rows
>     update about 10% of the rows at random
>     if more than 500000 rows, delete the oldest 10000
>     vacuum
>     repeat
>
> which is intended to emulate your daily cycle with about one-tenth
> as much data (just to keep the runtime reasonable).  I didn't see
> any bloat at all: the index growth looked like
>
> INFO:  index "t_pkey" now contains 450000 row versions in 1374 pages
> INFO:  index "t_pkey" now contains 460000 row versions in 1404 pages
> INFO:  index "t_pkey" now contains 470000 row versions in 1435 pages
> INFO:  index "t_pkey" now contains 480000 row versions in 1465 pages
> INFO:  index "t_pkey" now contains 490000 row versions in 1496 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1527 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1557 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1588 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1588 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1589 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
> INFO:  index "t_pkey" now contains 500000 row versions in 1590 pages
>
> and it never grew any larger than that even in several hundred "days".
>
> This test was against CVS tip, but I'm pretty certain the relevant
> algorithms were all the same in 7.4.  So there is some important
> aspect in which this test does not replicate the conditions your
> index is seeing.  Can you think of any way that I've missed capturing
> your usage pattern?
>
>             regards, tom lane
>

Attachment

bloat.test.php

Re: index bloat

From

Tom Lane

Date:

13 July 2005, 15:10:17

"David Esposito" <pgsql-general@esposito.newnetco.com> writes:
> You're right that the index behavior is well-behaved with the cycle of
> INSERT / DELETE / VACUUM ... But while it was running, I started a second
> session to the database after the 60th iteration and did

> BEGIN;
> SELECT COUNT(*) FROM bigboy;
> ROLLBACK;

> During my transaction, I saw the relpages charge upwards steadily until I
> issued the ROLLBACK ..

Sure.  VACUUM can't remove rows that might still be visible to any open
transaction, so sitting with an open transaction is going to effectively
disable VACUUM.  Client apps that hold open transactions for long
intervals are bad news (but this is true for many databases not just PG).

> but even after the ROLLBACK (and even after closing
> the second DB connection), the pages weren't reclaimed on the next VACUUM

Plain VACUUM doesn't try very hard to shorten the table physically, so
that's not surprising either.  But the internal free space should get
picked up at this point.

This does not strike me as an explanation for ongoing bloat.  There
are always going to be a few tuples not immediately reclaimable, but
normally that just factors in as part of the steady-state overhead.
Your VACUUM VERBOSE traces showed

DETAIL:  2 dead row versions cannot be removed yet.
DETAIL:  1 dead row versions cannot be removed yet.

so you're not having any major problem with not-yet-removable rows.

So I'm still pretty baffled :-(

            regards, tom lane

Re: index bloat

From

"David Esposito"

Date:

13 July 2005, 16:21:11

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Wednesday, July 13, 2005 2:10 PM
> To: David Esposito
>
> Plain VACUUM doesn't try very hard to shorten the table physically, so
> that's not surprising either.  But the internal free space should get
> picked up at this point.
>
> This does not strike me as an explanation for ongoing bloat.  There
> are always going to be a few tuples not immediately reclaimable, but
> normally that just factors in as part of the steady-state overhead.
> Your VACUUM VERBOSE traces showed
>
> DETAIL:  2 dead row versions cannot be removed yet.
> DETAIL:  1 dead row versions cannot be removed yet.
>
> so you're not having any major problem with not-yet-removable rows.
>
> So I'm still pretty baffled :-(

Hmm, if I keep running the following query while the test program is going
(giving it a few iterations to rest between executions), the steady-state
usage of the indexes seems to go up ... it doesn't happen every time you run
the query, but if you do it 10 times, it seems to go up at least once every
few times you run it .. And the usage keeps charging upwards long after the
UPDATE query finishes (at least 3 or 4 iterations afterwards until it levels
off again) ... It would seem like the steady-state should be reached after
the first couple of runs and then never creep up any further because there
should be enough slack in the index, right?

UPDATE bigboy SET creation_date = CURRENT_TIMESTAMP
WHERE creation_date BETWEEN CURRENT_TIMESTAMP - INTERVAL '15 seconds'
AND CURRENT_TIMESTAMP - INTERVAL '5 seconds';

Is there any way to disassemble an index (either through some fancy SQL
query or by running the actual physical file through a tool) to get an idea
on where the slack could be accumulating? like somehow be able to determine
that all of the oldest pages have a 0.01% population?

At this point I realize I'm grasping at straws and you're welcome to give up
on my problem at any time ... you've given it a good run ... :-)

-dave

Re: index bloat

From

Tom Lane

Date:

13 July 2005, 16:45:12

"David Esposito" <pgsql-general@esposito.newnetco.com> writes:
> Hmm, if I keep running the following query while the test program is going
> (giving it a few iterations to rest between executions), the steady-state
> usage of the indexes seems to go up ... it doesn't happen every time you run
> the query, but if you do it 10 times, it seems to go up at least once every
> few times you run it .. And the usage keeps charging upwards long after the
> UPDATE query finishes (at least 3 or 4 iterations afterwards until it levels
> off again) ... It would seem like the steady-state should be reached after
> the first couple of runs and then never creep up any further because there
> should be enough slack in the index, right?

> UPDATE bigboy SET creation_date = CURRENT_TIMESTAMP
> WHERE creation_date BETWEEN CURRENT_TIMESTAMP - INTERVAL '15 seconds'
> AND CURRENT_TIMESTAMP - INTERVAL '5 seconds';

Hmm, this is preferentially touching stuff near the right end of the
index, ie, it's going to bloat the pages associated with higher keys.
As I understand your usage of these indexes, pages generally only get
reclaimed off the left end (as records get old enough to be moved to
archival storage).  So if you were to repeat this test for long enough
for the data to cycle all the way through the table and out again
(50 days in your real-world usage) then the extra space would be
evenly distributed and the usage would reach a steady state.

The testing I've been doing so far involves UPDATEs that touch a
uniformly distributed subset of the table --- maybe that's the aspect
that is failing to match your reality.  Do you mostly update
recently-added rows?  Can you quantify the effect at all?

> Is there any way to disassemble an index (either through some fancy SQL
> query or by running the actual physical file through a tool) to get an idea
> on where the slack could be accumulating?

No such code exists AFAIK, though the idea is sounding pretty attractive
at the moment ;-).  You could get some crude numbers by adding debug
printouts to btbulkdelete() ...

            regards, tom lane

Re: index bloat

From

"David Esposito"

Date:

13 July 2005, 17:53:36

 > -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Wednesday, July 13, 2005 3:45 PM
>
> Hmm, this is preferentially touching stuff near the right end of the
> index, ie, it's going to bloat the pages associated with higher keys.
> As I understand your usage of these indexes, pages generally only get
> reclaimed off the left end (as records get old enough to be moved to
> archival storage).  So if you were to repeat this test for long enough
> for the data to cycle all the way through the table and out again
> (50 days in your real-world usage) then the extra space would be
> evenly distributed and the usage would reach a steady state.

So I guess the test of this would be whether the index usage increases after
the 50 day mark passes ... Our last REINDEX was on June 5th so that should
mean that things would level off on July 25th ...

What you're saying is that when VACUUM runs nightly, it clears out all of
the dead row versions from the indexes ... but since the bulk of my dead row
versions are all clustered around the right side of the index, that means
that those pages become very sparsely populated with data ... as the index
ages (assuming time-zero is the day I did my REINDEX) the index becomes
exclusively populated with those sparse pages ... and the way new keys are
inserted into the index is to always add them to a new page (where the 'new'
page is either a truly new page, or a page that is completely empty), rather
than using up some of the fragmented space within existing pages?

and this must differ from the way that space is allocated for the actual
table record otherwise i would be witnessing the same bloat with the table
entity itself?

out of curiosity, why is it necessary for MVCC to create copies of the index
keys if the value of the indexed field doesn't change with an UPDATE to the
underlying record? (perhaps answering my own question: because the entry in
the index is the key and the rowid that it's pointing to .. so when you
UPDATE the row, you have to create a new index entry to point at the new
rowid)

>
> The testing I've been doing so far involves UPDATEs that touch a
> uniformly distributed subset of the table --- maybe that's the aspect
> that is failing to match your reality.  Do you mostly update
> recently-added rows?  Can you quantify the effect at all?

This is probably a good point ... The histogram of UPDATE activity to the
table probably looks something like

90% - records created < 24 hours ago
7% - records created 1 - 2 days ago
2% - records created 2 - 7 days ago
1% - records older than 7 days

I don't have a way to easily quantitatively confirm this, but based on what
we see in our reports and what we recommend to customers (as far as when
they can consider a mailing "closed"), we usually recommend considering the
48-hour mark as the end of the vast majority of the activity ..

Thanks again,
Dave

Re: index bloat

From

Tom Lane

Date:

13 July 2005, 18:39:45

"David Esposito" <pgsql-general@esposito.newnetco.com> writes:
> ... and the way new keys are
> inserted into the index is to always add them to a new page (where the 'new'
> page is either a truly new page, or a page that is completely empty), rather
> than using up some of the fragmented space within existing pages?

Well, they are added to a page associated with the key range they are
in.  You can't just stuff keys into any randomly chosen index page;
otherwise the index wouldn't be readily searchable, which is more or
less the entire point of an index.

>> The testing I've been doing so far involves UPDATEs that touch a
>> uniformly distributed subset of the table --- maybe that's the aspect
>> that is failing to match your reality.  Do you mostly update
>> recently-added rows?  Can you quantify the effect at all?

> This is probably a good point ... The histogram of UPDATE activity to the
> table probably looks something like

> 90% - records created < 24 hours ago
> 7% - records created 1 - 2 days ago
> 2% - records created 2 - 7 days ago
> 1% - records older than 7 days

Ah; now I think things are starting to make sense.  We had already
estimated that about 10% of the records are updated each day, but
what this says is that the majority of those updates happen to records
that have never yet seen a VACUUM --- which is to say the last 2%
(1/50th) of the table.  So roughly speaking, the average record is
updated about five times in its first day of existence, and on
average less than once during its remaining 49 days of existence?

Given those sorts of numbers, what we've got is that by the time the
nightly VACUUM runs, the leading-edge part of the index (today's
entries) has been bloated to about 5x what its minimum size would be.
And what this means is that the steady-state situation will be that
sort of density throughout the whole index.  The difficulty is that
the btree code will only reclaim entirely-empty index pages for reuse.
Given that an index on integer keys can fit about 500 keys per page,
even a 5x bloated index has still got about 100 keys per page, making
it fairly unlikely for any of the pages to go completely empty until
you purge that whole range of keys at the end of the record's life.

This is a situation where it'd be nice if VACUUM could merge adjacent
partly-full index pages so as to reclaim space before the pages go
entirely empty on their own.  We looked at doing that when the btree
compaction code was first written, but set it aside as too hard because
of various concurrency problems.  (I think this is actually the first
case I've seen reported from the field where that decision looks wrong.
You've got a pretty odd update distribution here --- not so much that
the skew to recent entries is surprising, as that the index keys are
all on non-updating fields and so there's no spreading out of the index
ranges affected by the updates.)

Barring someone stepping up and making page merging happen (don't hold
your breath), it seems you've basically got two alternatives:

1. VACUUM more often than once per day.

2. Live with a steady-state index size that's about 5x the minimum.

I'd recommend taking a hard look at choice #1 --- you could experiment
with using the "vacuum cost delay" settings to tamp down VACUUM's I/O
demand to the point where it doesn't kill interactive performance, and
then run it maybe every hour or two on the table(s) where you have this
problem.

(Memo to hackers: this is a fairly interesting case for autovacuum
I think.  The overall update rate on the table is not high enough to
trigger frequent vacuums, unless autovacuum is somehow made aware that
particular index key ranges are getting hit more heavily than others.
Maybe this says that autovac needs to be tracking btree index page
splits, or some such statistic, more than just overall updates.)

            regards, tom lane

Re: index bloat

From

Alvaro Herrera

Date:

13 July 2005, 19:04:51

On Wed, Jul 13, 2005 at 05:39:33PM -0400, Tom Lane wrote:

> (Memo to hackers: this is a fairly interesting case for autovacuum
> I think.  The overall update rate on the table is not high enough to
> trigger frequent vacuums, unless autovacuum is somehow made aware that
> particular index key ranges are getting hit more heavily than others.
> Maybe this says that autovac needs to be tracking btree index page
> splits, or some such statistic, more than just overall updates.)

Interesting.  I wonder exactly what metric do we want to track in the
first place.  Just the number of page splits does not necessarily mean a
lot -- it could be a table that is under heavy insertion, with no dead
tuples.  I guess we could do something with the ratio of dead tuples vs.
new tuples (in this case this seems to be close to 1, rather than 0,
which would be the case I mention above), times number of btree page
splits since last vacuum.

If the number is "high" then we need to vacuum.  I guess the threshold
needs to be related to average key length.

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

Re: index bloat

From

"David Esposito"

Date:

18 July 2005, 08:47:56

This week is looking busy for me but hopefully I'll be able to play around
with various vacuuming frequencies for this table ...

Thanks for all of your help; I'll report on my progress

-Dave

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Wednesday, July 13, 2005 5:40 PM
>
> 1. VACUUM more often than once per day.
>
> 2. Live with a steady-state index size that's about 5x the minimum.
>
> I'd recommend taking a hard look at choice #1 --- you could experiment
> with using the "vacuum cost delay" settings to tamp down VACUUM's I/O
> demand to the point where it doesn't kill interactive performance, and
> then run it maybe every hour or two on the table(s) where you
> have this
> problem.
>