Thread: Re: FW: index bloat
"David Esposito" <pgsql-general@esposito.newnetco.com> writes: >> BTW, the tail of the VACUUM VERBOSE output ought to have >> something about >> overall usage of the FSM --- what does that look like? > INFO: free space map: 528 relations, 172357 pages stored; 170096 total > pages needed > DETAIL: Allocated FSM size: 10000 relations + 1000000 pages = 6511 kB > shared memory. OK, so you are definitely not running out of FSM slots... I spent some time this morning trying to reproduce the bloating behavior, without any success. I made a table with a plain "serial primary key" column, and ran a test program that did insert 10000 rows update about 10% of the rows at random if more than 500000 rows, delete the oldest 10000 vacuum repeat which is intended to emulate your daily cycle with about one-tenth as much data (just to keep the runtime reasonable). I didn't see any bloat at all: the index growth looked like INFO: index "t_pkey" now contains 450000 row versions in 1374 pages INFO: index "t_pkey" now contains 460000 row versions in 1404 pages INFO: index "t_pkey" now contains 470000 row versions in 1435 pages INFO: index "t_pkey" now contains 480000 row versions in 1465 pages INFO: index "t_pkey" now contains 490000 row versions in 1496 pages INFO: index "t_pkey" now contains 500000 row versions in 1527 pages INFO: index "t_pkey" now contains 500000 row versions in 1557 pages INFO: index "t_pkey" now contains 500000 row versions in 1588 pages INFO: index "t_pkey" now contains 500000 row versions in 1588 pages INFO: index "t_pkey" now contains 500000 row versions in 1589 pages INFO: index "t_pkey" now contains 500000 row versions in 1589 pages INFO: index "t_pkey" now contains 500000 row versions in 1589 pages INFO: index "t_pkey" now contains 500000 row versions in 1589 pages INFO: index "t_pkey" now contains 500000 row versions in 1589 pages INFO: index "t_pkey" now contains 500000 row versions in 1589 pages INFO: index "t_pkey" now contains 500000 row versions in 1589 pages INFO: index "t_pkey" now contains 500000 row versions in 1589 pages INFO: index "t_pkey" now contains 500000 row versions in 1590 pages INFO: index "t_pkey" now contains 500000 row versions in 1590 pages INFO: index "t_pkey" now contains 500000 row versions in 1590 pages INFO: index "t_pkey" now contains 500000 row versions in 1590 pages INFO: index "t_pkey" now contains 500000 row versions in 1590 pages INFO: index "t_pkey" now contains 500000 row versions in 1590 pages and it never grew any larger than that even in several hundred "days". This test was against CVS tip, but I'm pretty certain the relevant algorithms were all the same in 7.4. So there is some important aspect in which this test does not replicate the conditions your index is seeing. Can you think of any way that I've missed capturing your usage pattern? regards, tom lane
First, thank you for spending so much time on this issue Second, I think I might have found a good lead ... I replicated the test you described below (minus the updating of 10% of the records) ... I've attached the PHP script (I'm more proficient at writing PHP than a shell script; you should be able to run it from the command line ('php -q bloat.test.php') as long as you've created a db named 'test' first) You're right that the index behavior is well-behaved with the cycle of INSERT / DELETE / VACUUM ... But while it was running, I started a second session to the database after the 60th iteration and did BEGIN; SELECT COUNT(*) FROM bigboy; ROLLBACK; During my transaction, I saw the relpages charge upwards steadily until I issued the ROLLBACK .. but even after the ROLLBACK (and even after closing the second DB connection), the pages weren't reclaimed on the next VACUUM This sounds exactly like what may be happening to us ... This is a 24x7 app so during the course of the VACUUM there are probably some open transactions ... Over the past 4 or 5 weeks, that could explain the steady unlimited growth ... Another funny thing to note: I was able to cause this same behavior if I did the following: psql template1 BEGIN; SELECT COUNT(*) FROM pg_database; ROLLBACK; \q FYI, I'm using the 8.0.1 RPM build for RHEL3 (2PGDG) -dave > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Tuesday, July 12, 2005 2:57 PM > To: David Esposito > Cc: pgsql-general@postgresql.org > Subject: Re: FW: [GENERAL] index bloat > > "David Esposito" <pgsql-general@esposito.newnetco.com> writes: > >> BTW, the tail of the VACUUM VERBOSE output ought to have > >> something about > >> overall usage of the FSM --- what does that look like? > > > INFO: free space map: 528 relations, 172357 pages stored; > 170096 total > > pages needed > > DETAIL: Allocated FSM size: 10000 relations + 1000000 > pages = 6511 kB > > shared memory. > > OK, so you are definitely not running out of FSM slots... > > I spent some time this morning trying to reproduce the bloating > behavior, without any success. I made a table with a plain "serial > primary key" column, and ran a test program that did > > insert 10000 rows > update about 10% of the rows at random > if more than 500000 rows, delete the oldest 10000 > vacuum > repeat > > which is intended to emulate your daily cycle with about one-tenth > as much data (just to keep the runtime reasonable). I didn't see > any bloat at all: the index growth looked like > > INFO: index "t_pkey" now contains 450000 row versions in 1374 pages > INFO: index "t_pkey" now contains 460000 row versions in 1404 pages > INFO: index "t_pkey" now contains 470000 row versions in 1435 pages > INFO: index "t_pkey" now contains 480000 row versions in 1465 pages > INFO: index "t_pkey" now contains 490000 row versions in 1496 pages > INFO: index "t_pkey" now contains 500000 row versions in 1527 pages > INFO: index "t_pkey" now contains 500000 row versions in 1557 pages > INFO: index "t_pkey" now contains 500000 row versions in 1588 pages > INFO: index "t_pkey" now contains 500000 row versions in 1588 pages > INFO: index "t_pkey" now contains 500000 row versions in 1589 pages > INFO: index "t_pkey" now contains 500000 row versions in 1589 pages > INFO: index "t_pkey" now contains 500000 row versions in 1589 pages > INFO: index "t_pkey" now contains 500000 row versions in 1589 pages > INFO: index "t_pkey" now contains 500000 row versions in 1589 pages > INFO: index "t_pkey" now contains 500000 row versions in 1589 pages > INFO: index "t_pkey" now contains 500000 row versions in 1589 pages > INFO: index "t_pkey" now contains 500000 row versions in 1589 pages > INFO: index "t_pkey" now contains 500000 row versions in 1590 pages > INFO: index "t_pkey" now contains 500000 row versions in 1590 pages > INFO: index "t_pkey" now contains 500000 row versions in 1590 pages > INFO: index "t_pkey" now contains 500000 row versions in 1590 pages > INFO: index "t_pkey" now contains 500000 row versions in 1590 pages > INFO: index "t_pkey" now contains 500000 row versions in 1590 pages > > and it never grew any larger than that even in several hundred "days". > > This test was against CVS tip, but I'm pretty certain the relevant > algorithms were all the same in 7.4. So there is some important > aspect in which this test does not replicate the conditions your > index is seeing. Can you think of any way that I've missed capturing > your usage pattern? > > regards, tom lane >
Attachment
"David Esposito" <pgsql-general@esposito.newnetco.com> writes: > You're right that the index behavior is well-behaved with the cycle of > INSERT / DELETE / VACUUM ... But while it was running, I started a second > session to the database after the 60th iteration and did > BEGIN; > SELECT COUNT(*) FROM bigboy; > ROLLBACK; > During my transaction, I saw the relpages charge upwards steadily until I > issued the ROLLBACK .. Sure. VACUUM can't remove rows that might still be visible to any open transaction, so sitting with an open transaction is going to effectively disable VACUUM. Client apps that hold open transactions for long intervals are bad news (but this is true for many databases not just PG). > but even after the ROLLBACK (and even after closing > the second DB connection), the pages weren't reclaimed on the next VACUUM Plain VACUUM doesn't try very hard to shorten the table physically, so that's not surprising either. But the internal free space should get picked up at this point. This does not strike me as an explanation for ongoing bloat. There are always going to be a few tuples not immediately reclaimable, but normally that just factors in as part of the steady-state overhead. Your VACUUM VERBOSE traces showed DETAIL: 2 dead row versions cannot be removed yet. DETAIL: 1 dead row versions cannot be removed yet. so you're not having any major problem with not-yet-removable rows. So I'm still pretty baffled :-( regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Wednesday, July 13, 2005 2:10 PM > To: David Esposito > > Plain VACUUM doesn't try very hard to shorten the table physically, so > that's not surprising either. But the internal free space should get > picked up at this point. > > This does not strike me as an explanation for ongoing bloat. There > are always going to be a few tuples not immediately reclaimable, but > normally that just factors in as part of the steady-state overhead. > Your VACUUM VERBOSE traces showed > > DETAIL: 2 dead row versions cannot be removed yet. > DETAIL: 1 dead row versions cannot be removed yet. > > so you're not having any major problem with not-yet-removable rows. > > So I'm still pretty baffled :-( Hmm, if I keep running the following query while the test program is going (giving it a few iterations to rest between executions), the steady-state usage of the indexes seems to go up ... it doesn't happen every time you run the query, but if you do it 10 times, it seems to go up at least once every few times you run it .. And the usage keeps charging upwards long after the UPDATE query finishes (at least 3 or 4 iterations afterwards until it levels off again) ... It would seem like the steady-state should be reached after the first couple of runs and then never creep up any further because there should be enough slack in the index, right? UPDATE bigboy SET creation_date = CURRENT_TIMESTAMP WHERE creation_date BETWEEN CURRENT_TIMESTAMP - INTERVAL '15 seconds' AND CURRENT_TIMESTAMP - INTERVAL '5 seconds'; Is there any way to disassemble an index (either through some fancy SQL query or by running the actual physical file through a tool) to get an idea on where the slack could be accumulating? like somehow be able to determine that all of the oldest pages have a 0.01% population? At this point I realize I'm grasping at straws and you're welcome to give up on my problem at any time ... you've given it a good run ... :-) -dave
"David Esposito" <pgsql-general@esposito.newnetco.com> writes: > Hmm, if I keep running the following query while the test program is going > (giving it a few iterations to rest between executions), the steady-state > usage of the indexes seems to go up ... it doesn't happen every time you run > the query, but if you do it 10 times, it seems to go up at least once every > few times you run it .. And the usage keeps charging upwards long after the > UPDATE query finishes (at least 3 or 4 iterations afterwards until it levels > off again) ... It would seem like the steady-state should be reached after > the first couple of runs and then never creep up any further because there > should be enough slack in the index, right? > UPDATE bigboy SET creation_date = CURRENT_TIMESTAMP > WHERE creation_date BETWEEN CURRENT_TIMESTAMP - INTERVAL '15 seconds' > AND CURRENT_TIMESTAMP - INTERVAL '5 seconds'; Hmm, this is preferentially touching stuff near the right end of the index, ie, it's going to bloat the pages associated with higher keys. As I understand your usage of these indexes, pages generally only get reclaimed off the left end (as records get old enough to be moved to archival storage). So if you were to repeat this test for long enough for the data to cycle all the way through the table and out again (50 days in your real-world usage) then the extra space would be evenly distributed and the usage would reach a steady state. The testing I've been doing so far involves UPDATEs that touch a uniformly distributed subset of the table --- maybe that's the aspect that is failing to match your reality. Do you mostly update recently-added rows? Can you quantify the effect at all? > Is there any way to disassemble an index (either through some fancy SQL > query or by running the actual physical file through a tool) to get an idea > on where the slack could be accumulating? No such code exists AFAIK, though the idea is sounding pretty attractive at the moment ;-). You could get some crude numbers by adding debug printouts to btbulkdelete() ... regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Wednesday, July 13, 2005 3:45 PM > > Hmm, this is preferentially touching stuff near the right end of the > index, ie, it's going to bloat the pages associated with higher keys. > As I understand your usage of these indexes, pages generally only get > reclaimed off the left end (as records get old enough to be moved to > archival storage). So if you were to repeat this test for long enough > for the data to cycle all the way through the table and out again > (50 days in your real-world usage) then the extra space would be > evenly distributed and the usage would reach a steady state. So I guess the test of this would be whether the index usage increases after the 50 day mark passes ... Our last REINDEX was on June 5th so that should mean that things would level off on July 25th ... What you're saying is that when VACUUM runs nightly, it clears out all of the dead row versions from the indexes ... but since the bulk of my dead row versions are all clustered around the right side of the index, that means that those pages become very sparsely populated with data ... as the index ages (assuming time-zero is the day I did my REINDEX) the index becomes exclusively populated with those sparse pages ... and the way new keys are inserted into the index is to always add them to a new page (where the 'new' page is either a truly new page, or a page that is completely empty), rather than using up some of the fragmented space within existing pages? and this must differ from the way that space is allocated for the actual table record otherwise i would be witnessing the same bloat with the table entity itself? out of curiosity, why is it necessary for MVCC to create copies of the index keys if the value of the indexed field doesn't change with an UPDATE to the underlying record? (perhaps answering my own question: because the entry in the index is the key and the rowid that it's pointing to .. so when you UPDATE the row, you have to create a new index entry to point at the new rowid) > > The testing I've been doing so far involves UPDATEs that touch a > uniformly distributed subset of the table --- maybe that's the aspect > that is failing to match your reality. Do you mostly update > recently-added rows? Can you quantify the effect at all? This is probably a good point ... The histogram of UPDATE activity to the table probably looks something like 90% - records created < 24 hours ago 7% - records created 1 - 2 days ago 2% - records created 2 - 7 days ago 1% - records older than 7 days I don't have a way to easily quantitatively confirm this, but based on what we see in our reports and what we recommend to customers (as far as when they can consider a mailing "closed"), we usually recommend considering the 48-hour mark as the end of the vast majority of the activity .. Thanks again, Dave
"David Esposito" <pgsql-general@esposito.newnetco.com> writes: > ... and the way new keys are > inserted into the index is to always add them to a new page (where the 'new' > page is either a truly new page, or a page that is completely empty), rather > than using up some of the fragmented space within existing pages? Well, they are added to a page associated with the key range they are in. You can't just stuff keys into any randomly chosen index page; otherwise the index wouldn't be readily searchable, which is more or less the entire point of an index. >> The testing I've been doing so far involves UPDATEs that touch a >> uniformly distributed subset of the table --- maybe that's the aspect >> that is failing to match your reality. Do you mostly update >> recently-added rows? Can you quantify the effect at all? > This is probably a good point ... The histogram of UPDATE activity to the > table probably looks something like > 90% - records created < 24 hours ago > 7% - records created 1 - 2 days ago > 2% - records created 2 - 7 days ago > 1% - records older than 7 days Ah; now I think things are starting to make sense. We had already estimated that about 10% of the records are updated each day, but what this says is that the majority of those updates happen to records that have never yet seen a VACUUM --- which is to say the last 2% (1/50th) of the table. So roughly speaking, the average record is updated about five times in its first day of existence, and on average less than once during its remaining 49 days of existence? Given those sorts of numbers, what we've got is that by the time the nightly VACUUM runs, the leading-edge part of the index (today's entries) has been bloated to about 5x what its minimum size would be. And what this means is that the steady-state situation will be that sort of density throughout the whole index. The difficulty is that the btree code will only reclaim entirely-empty index pages for reuse. Given that an index on integer keys can fit about 500 keys per page, even a 5x bloated index has still got about 100 keys per page, making it fairly unlikely for any of the pages to go completely empty until you purge that whole range of keys at the end of the record's life. This is a situation where it'd be nice if VACUUM could merge adjacent partly-full index pages so as to reclaim space before the pages go entirely empty on their own. We looked at doing that when the btree compaction code was first written, but set it aside as too hard because of various concurrency problems. (I think this is actually the first case I've seen reported from the field where that decision looks wrong. You've got a pretty odd update distribution here --- not so much that the skew to recent entries is surprising, as that the index keys are all on non-updating fields and so there's no spreading out of the index ranges affected by the updates.) Barring someone stepping up and making page merging happen (don't hold your breath), it seems you've basically got two alternatives: 1. VACUUM more often than once per day. 2. Live with a steady-state index size that's about 5x the minimum. I'd recommend taking a hard look at choice #1 --- you could experiment with using the "vacuum cost delay" settings to tamp down VACUUM's I/O demand to the point where it doesn't kill interactive performance, and then run it maybe every hour or two on the table(s) where you have this problem. (Memo to hackers: this is a fairly interesting case for autovacuum I think. The overall update rate on the table is not high enough to trigger frequent vacuums, unless autovacuum is somehow made aware that particular index key ranges are getting hit more heavily than others. Maybe this says that autovac needs to be tracking btree index page splits, or some such statistic, more than just overall updates.) regards, tom lane
On Wed, Jul 13, 2005 at 05:39:33PM -0400, Tom Lane wrote: > (Memo to hackers: this is a fairly interesting case for autovacuum > I think. The overall update rate on the table is not high enough to > trigger frequent vacuums, unless autovacuum is somehow made aware that > particular index key ranges are getting hit more heavily than others. > Maybe this says that autovac needs to be tracking btree index page > splits, or some such statistic, more than just overall updates.) Interesting. I wonder exactly what metric do we want to track in the first place. Just the number of page splits does not necessarily mean a lot -- it could be a table that is under heavy insertion, with no dead tuples. I guess we could do something with the ratio of dead tuples vs. new tuples (in this case this seems to be close to 1, rather than 0, which would be the case I mention above), times number of btree page splits since last vacuum. If the number is "high" then we need to vacuum. I guess the threshold needs to be related to average key length. -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "Saca el libro que tu religión considere como el indicado para encontrar la oración que traiga paz a tu alma. Luego rebootea el computador y ve si funciona" (Carlos Duclós)
This week is looking busy for me but hopefully I'll be able to play around with various vacuuming frequencies for this table ... Thanks for all of your help; I'll report on my progress -Dave > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Wednesday, July 13, 2005 5:40 PM > > 1. VACUUM more often than once per day. > > 2. Live with a steady-state index size that's about 5x the minimum. > > I'd recommend taking a hard look at choice #1 --- you could experiment > with using the "vacuum cost delay" settings to tamp down VACUUM's I/O > demand to the point where it doesn't kill interactive performance, and > then run it maybe every hour or two on the table(s) where you > have this > problem. >