Thread: Extreme bloating of intarray GiST indexes

Extreme bloating of intarray GiST indexes

From
Josh Berkus
Date:
Hackers,

I'm currently looking at a database which has some extreme bloating of
intarray GiST indexes.  As in 1000% bloating in only a few months.  This
is not a particularly high-transaction-rate database, so the bloating is
a little surprising; I can only explain it if vacuum wasn't cleaning the
indexes at all, and maybe not even then.

We're currently instrumenting the database so that we can collect a bit
more data on update activity, but in the meantime, has anyone seen
anything like this?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Extreme bloating of intarray GiST indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> I'm currently looking at a database which has some extreme bloating of
> intarray GiST indexes.  As in 1000% bloating in only a few months.  This
> is not a particularly high-transaction-rate database, so the bloating is
> a little surprising; I can only explain it if vacuum wasn't cleaning the
> indexes at all, and maybe not even then.

> We're currently instrumenting the database so that we can collect a bit
> more data on update activity, but in the meantime, has anyone seen
> anything like this?

1. What PG version?
2. If new enough to have contrib/pgstattuple, what does pgstattuple()  have to say about the index?

I'm suspicious that this might be bloat caused by a bad picksplit function,
not from having a lot of dead entries in the index.  We've fixed several
other bogus picksplit functions in contrib in the past.
        regards, tom lane


Re: Extreme bloating of intarray GiST indexes

From
Josh Berkus
Date:
> 1. What PG version?

8.4.4, so it has the broken picksplit.

> 2. If new enough to have contrib/pgstattuple, what does pgstattuple()
>    have to say about the index?

Will check.

> I'm suspicious that this might be bloat caused by a bad picksplit function,
> not from having a lot of dead entries in the index.  We've fixed several
> other bogus picksplit functions in contrib in the past.

Yeah, I'll test updating to 8.4.8.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Extreme bloating of intarray GiST indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> 1. What PG version?

> 8.4.4, so it has the broken picksplit.
> ...
> Yeah, I'll test updating to 8.4.8.

Uh, no, the picksplit bugs we fixed were in cube and seg --- there's
no reason to think that updating will help this.  But 8.4's pgstattuple
does appear to support gist indexes, so please run that and see what
you get.
        regards, tom lane


Re: Extreme bloating of intarray GiST indexes

From
Alexander Korotkov
Date:
On Thu, Apr 28, 2011 at 11:11 PM, Josh Berkus <josh@agliodbs.com> wrote:
I'm currently looking at a database which has some extreme bloating of
intarray GiST indexes.  As in 1000% bloating in only a few months.  This
is not a particularly high-transaction-rate database, so the bloating is
a little surprising; I can only explain it if vacuum wasn't cleaning the
indexes at all, and maybe not even then.

We're currently instrumenting the database so that we can collect a bit
more data on update activity, but in the meantime, has anyone seen
anything like this? 

What opclass is used for GiST index: gist__int_ops or gist__intbig_ops?  
Do you take into account that gist__int_ops is very inefficient for large datasets? 

----
With best regards,
Alexander Korotkov.

Re: Extreme bloating of intarray GiST indexes

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> What opclass is used for GiST index: gist__int_ops or gist__intbig_ops?
> Do you take into account that gist__int_ops is very inefficient for large
> datasets?

I seem to recall some discussion recently about documenting where you
should cut over to using "gist__intbig_ops" --- IIRC, it wasn't all that
"big" by modern standards.  But it doesn't look like any such change made
it into the docs.  Should we reopen that discussion?
        regards, tom lane


Re: Extreme bloating of intarray GiST indexes

From
Alexander Korotkov
Date:
On Fri, Apr 29, 2011 at 1:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I seem to recall some discussion recently about documenting where you
should cut over to using "gist__intbig_ops" --- IIRC, it wasn't all that
"big" by modern standards.  But it doesn't look like any such change made
it into the docs.  Should we reopen that discussion?
 
Actually, I don't see a reason to make decision between gist__int_ops and gist__intbig_ops. Because we can choose between full enumeration and lossy bitmap on the fly on the base of array length (when some length threshold achived array is converted to bitmap). If this problem is urgent, I can write a patch with opclass that would seem more suitable to be default to me, when I'll have a time for it.

----
With best regards,
Alexander Korotkov.

Re: Extreme bloating of intarray GiST indexes

From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Uh, no, the picksplit bugs we fixed were in cube and seg --- there's
> no reason to think that updating will help this.  But 8.4's pgstattuple
> does appear to support gist indexes, so please run that and see what
> you get.

There's also gevel that I used to inspect in development GiST index, and
I found it pretty useful.  Don't know yet how it compares to pgstattuple.
 http://www.sai.msu.su/~megera/wiki/Gevel

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Extreme bloating of intarray GiST indexes

From
Josh Berkus
Date:
Tom, Alexander,

So we are using gist_intbig_ops, so that's not the issue.

Using pgstattuple might be a bit of a challenge.  The client doesn't
have it installed, and I can't pull it from Yum without also upgrading
PostgreSQL, since Yum doesn't stock old versions AFAIK.

Maybe we should consider making diagnostic utilities like this standard
with PostgreSQL?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Extreme bloating of intarray GiST indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Tom, Alexander,
> So we are using gist_intbig_ops, so that's not the issue.

> Using pgstattuple might be a bit of a challenge.  The client doesn't
> have it installed, and I can't pull it from Yum without also upgrading
> PostgreSQL, since Yum doesn't stock old versions AFAIK.

And updating Postgres to latest minor release is a bad thing why?
I can't believe you're not holding your client's feet to the fire
about running an old version, quite independently of the fact that
they need that contrib module.

But having said that, what you say makes no sense at all.  They have
intarray installed, so they have postgresql-contrib.  I know of no
Yum-accessible distributions in which intarray and pgstattuple wouldn't
be delivered in the same RPM.
        regards, tom lane


Re: Extreme bloating of intarray GiST indexes

From
Josh Berkus
Date:
Tom, Alexander,

So, some data:

corp=# select indexname,
pg_size_pretty(pg_relation_size(indexname::text)) as indexsize,
pg_size_pretty(pg_relation_size(tablename::text)) as tablesize
from pg_indexes where indexname like '%__listings_features' order by
pg_relation_size(indexname::text) desc;              indexname               | indexsize  | tablesize
---------------------------------------+------------+------------idx__listings_features                | 52 MB      |
20MB
 

corp=# select * from pg_indexes where indexname = 'idx__listings_features';
-[ RECORD 1

]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
schemaname | boards
tablename  | listings
indexname  | idx__listings_features
tablespace |
indexdef   | CREATE INDEX idx__listings_features ON listings USING gist
(features public.gist__intbig_ops) WHERE ((deleted_at IS NULL) AND
(status_id = 1))

corp=# select * from public.pgstattuple('idx__listings_features');
-[ RECORD 1 ]------+---------
table_len          | 54190080
tuple_count        | 7786
tuple_len          | 2117792
tuple_percent      | 3.91
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 49297536
free_percent       | 90.97                    ^^^^^^^^^
Well, that explains the bloating.  Why all that free space, though?

Maybe autovac isn't running?

Nope:

corp=# select * from pg_stat_user_tables where relname = 'listings';

-[ RECORD 1 ]----+------------------------------
relid            | 110919
schemaname       | boards
relname          | listings
seq_scan         | 37492
seq_tup_read     | 328794009
idx_scan         | 33982523
idx_tup_fetch    | 302782765
n_tup_ins        | 19490
n_tup_upd        | 668445
n_tup_del        | 9826
n_tup_hot_upd    | 266661
n_live_tup       | 9664
n_dead_tup       | 776
last_vacuum      | 2010-07-25 19:46:45.922861+00
last_autovacuum  | 2011-04-30 17:30:40.555311+00
last_analyze     | 2010-07-25 19:46:45.922861+00
last_autoanalyze | 2011-04-28 23:49:54.968689+00

I don't know when stats were last reset (see, this is why we need a
reset timestamp!) so not sure how long those have been accumulating.

(note: object names changed for confidentiality)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Extreme bloating of intarray GiST indexes

From
Josh Berkus
Date:
All,

Some trending data, since there's a lot of bloated indexes here:

select 'index_' || ( row_number() over ( order by free_percent desc ) )
as "index", *
from (
select (public.pgstattuple(indexname::text)).free_percent,
round(( n_tup_upd )::numeric / n_tup_ins, 2) as update_ratio,
round(( n_tup_hot_upd )::numeric / n_tup_ins, 2) as hot_update_ratio,
round(( n_tup_del * 100 )::numeric / n_tup_ins)  as delete_percent,
extract('days' from ( now() - last_autovacuum )) as days_since_vac,
n_live_tup / 1000 as "1K_tuples"
from pg_indexes join pg_stat_user_tables as tables
ON pg_indexes.schemaname = tables.schemaname
AND pg_indexes.tablename = tables.relname
where indexname like '%__listings_features'
) as idxstats

order by free_percent desc;
index|free_percent|update_ratio|hot_update_ratio|delete_percent|days_since_vac|1K_tuples
index_1|90.97|34.30|13.68|50|3|9
index_2|87.14|15.54|2.99|41|1|2
index_3|85.08|10.86|1.42|35|5|77
index_4|84.28|22.27|5.47|18|4|370
index_5|82.4|13.65|3.89|24|49|82
index_6|82.2|11.32|2.22|29|3|54
index_7|80.97|14.38|2.95|6|14|17
index_8|80.59|15.64|2.73|48|1|29
index_9|78.43|12.81|2.97|21|37|42
index_10|77.91|11.24|2.33|57|1|21
index_11|77.26|12.73|2.00|18|11|55
index_12|77.07|16.62|2.71|15|7|7
index_13|76.56|12.20|3.20|11|11|18
index_14|75.94|14.52|2.00|23|13|15
index_15|74.73|14.94|2.68|17|11|34
index_16|73.78|15.94|3.77|25|5|2
index_17|73.54|50.19|4.26|14|14|10
index_18|73.11|15.07|6.70|20|20|7
index_19|72.82|10.26|4.63|19|11|7
index_20|72.55|15.59|5.14|22|3|13
index_21|68.52|19.69|5.49|13|11|3
index_22|61.47|14.00|4.61|27|47|2
index_23|45.06|18.10|11.65|19|96|2
index_24|37.75|6.04|1.32|36|96|15
index_25|36.87|15.32|3.71|10|96|17
index_26|32.32|7.07|2.15|18|96|15
index_27|0|6.28|0.74|10|316|48

This makes a pretty graph, but the only thing it tells me is that the
handful of non-bloated tables are the ones which weren't vacuumed
recently, and either have very few rows or haven't gotten a lot of
updates.  This is not a surprise.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Extreme bloating of intarray GiST indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> So, some data:

> corp=# select * from public.pgstattuple('idx__listings_features');
> -[ RECORD 1 ]------+---------
> table_len          | 54190080
> tuple_count        | 7786
> tuple_len          | 2117792
> tuple_percent      | 3.91
> dead_tuple_count   | 0
> dead_tuple_len     | 0
> dead_tuple_percent | 0
> free_space         | 49297536
> free_percent       | 90.97
>                      ^^^^^^^^^
> Well, that explains the bloating.  Why all that free space, though?

> Maybe autovac isn't running?

No, because you have under 10% dead tuples in the main table.
I think this is sufficient proof of the crummy-page-splits theory.
Can you provide the data in the column that's indexed?
        regards, tom lane


Re: Extreme bloating of intarray GiST indexes

From
Alexander Korotkov
Date:
I have another hypothesis about index bloat cause. AFAIK, vaccum procedure on GiST don't have any storage utilization guarantee. For example, if only one live item is in some page, then only one item will be left in this page. I.e. there is no index reroganization during vacuum. If there wouldn't be many inserts into such pages in future then they will be stay bloat.

----
With best regards,
Alexander Korotkov.

Re: Extreme bloating of intarray GiST indexes

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> I have another hypothesis about index bloat cause. AFAIK, vaccum procedure
> on GiST don't have any storage utilization guarantee. For example, if only
> one live item is in some page, then only one item will be left in this page.
> I.e. there is no index reroganization during vacuum. If there wouldn't be
> many inserts into such pages in future then they will be stay bloat.

Possibly, but the same is true of btree indexes, and we very seldom see
cases where that's a serious issue.  In any case, this is all just
speculation without evidence --- we need to see actual data to figure
out what's going on.
        regards, tom lane


Re: Extreme bloating of intarray GiST indexes

From
Josh Berkus
Date:
> No, because you have under 10% dead tuples in the main table.
> I think this is sufficient proof of the crummy-page-splits theory.
> Can you provide the data in the column that's indexed?

Yes, I can.   Fortunately, none of it's identifiable.

Attached.  This is for the index which is 90% free space.

So, some other characteristics of this index:

* If you didn't notice earlier, it's a partial index.  The two columns
which determine the partial index change more often than the intarray
column.

* We've also determined some other unusual patterns from watching the
application:

(a) the "listings" table is a very wide table, with about 60 columns

(b) whenever the table gets updated, the application code updates these
60 columns in 4 sections.  So there's 4 updates to the same row, in a
single transaction.

(c) we *think* that other columns of the table, including other indexed
columns, are changed much more frequently than the intarray column is.
Currently doing analysis on that.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Attachment

Re: Extreme bloating of intarray GiST indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> (c) we *think* that other columns of the table, including other indexed
> columns, are changed much more frequently than the intarray column is.
> Currently doing analysis on that.

Yeah, I noticed that your statistics for the table showed far more
updates than insertions or deletions.  If the intarray itself didn't
change often, that would result in lots of duplicate entries being made
in the index.  They'd get cleaned by vacuum eventually, but maybe not
fast enough to avoid the one-live-tuple-per-page syndrome that Alexander
was speculating about.
        regards, tom lane


Re: Extreme bloating of intarray GiST indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> Can you provide the data in the column that's indexed?

> Attached.  This is for the index which is 90% free space.

I tried loading this data in fresh, and then creating a gist__intbig_ops
index on it.  I got these pgstattuple numbers (in 8.4.8):

table_len          | 8806400
tuple_count        | 15005
tuple_len          | 4081360
tuple_percent      | 46.35
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 4088100
free_percent       | 46.42

On the other hand, loading the data with a pre-existing empty index gave

table_len          | 7798784
tuple_count        | 15005
tuple_len          | 4081360
tuple_percent      | 52.33
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 3183672
free_percent       | 40.82

Neither of those numbers are great, and it's a bit surprising that
CREATE INDEX produces a result notably worse than incremental loading;
but still a darn sight better than 90% free space.  So I think probably
the update pattern has a lot to do with this.

> * If you didn't notice earlier, it's a partial index.  The two columns
> which determine the partial index change more often than the intarray
> column.

Yeah, with only about half of the table actually indexed, since you
showed only 7786 index entries in your results.  But unless there's
reason to think the indexed and unindexed entries are substantially
different in the intarray column, this is probably not very relevant.

> * We've also determined some other unusual patterns from watching the
> application:

> (a) the "listings" table is a very wide table, with about 60 columns

> (b) whenever the table gets updated, the application code updates these
> 60 columns in 4 sections.  So there's 4 updates to the same row, in a
> single transaction.

Hmm.  That is going to lead to four dead index entries for every live
one (unless some of the updates are HOT, which won't happen if you're
changing any indexed columns).  VACUUM will get back the space
eventually, but not before you've caused some index bloat.

I tried doing something similar to my test table:

contrib_regression=# alter table listings add column junk int;
ALTER TABLE
contrib_regression=# create index li on listings(junk);
CREATE INDEX
contrib_regression=# begin;
BEGIN
contrib_regression=# update listings set junk=1;
UPDATE 15005
contrib_regression=# update listings set junk=2;
UPDATE 15005
contrib_regression=# update listings set junk=3;
UPDATE 15005
contrib_regression=# update listings set junk=4;
UPDATE 15005
contrib_regression=# commit;
COMMIT
contrib_regression=# vacuum listings;
VACUUM

and then got these pgstattuple numbers:

table_len          | 39460864
tuple_count        | 15005
tuple_len          | 4081360
tuple_percent      | 10.34
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 32923872
free_percent       | 83.43

which is up in the same ballpark with your problem.  Now probably your
client's app is not updating all rows at once, but still this is a
pretty wasteful update pattern.  Is there a reason not to update all
the columns in a single update?

If you can't change the app, I'd suggest more aggressive autovacuuming
as the least painful fix.
        regards, tom lane


Re: Extreme bloating of intarray GiST indexes

From
Josh Berkus
Date:
On 5/4/11 11:29 AM, Tom Lane wrote:
> which is up in the same ballpark with your problem.  Now probably your
> client's app is not updating all rows at once, but still this is a
> pretty wasteful update pattern.  Is there a reason not to update all
> the columns in a single update?

Yeah, really crappy application code.  Discussing it with app developers
now ...

> If you can't change the app, I'd suggest more aggressive autovacuuming
> as the least painful fix.

Will test that.  It's not clear that vacuuming is helping at all.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Extreme bloating of intarray GiST indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> On 5/4/11 11:29 AM, Tom Lane wrote:
>> If you can't change the app, I'd suggest more aggressive autovacuuming
>> as the least painful fix.

> Will test that.  It's not clear that vacuuming is helping at all.

Well, you realize of course that you need a REINDEX to get the index
size back down to a sane range.  Autovacuum may or may not be able to
keep it from creeping back up ... but vacuum definitely won't remove
existing bloat.
        regards, tom lane