Thread: rtree/gist index taking enormous amount of space in 8.2.3
Hi,
In version 8.1.5, I have an rtree index on a 1.5 GB table. The size of this index is 500 MB. After migrating to 8.2.3, the size of this index has increased to 35GB. I’ve dropped are recreated the index and got the same result. In 8.2.3 the index type is gist, does this have something to do with it? At 35GB we have seen a decrease in performance. Any help or hints are appreciated.
CREATE INDEX binloc_boxrange
ON featureloc
USING rtree
(boxrange(fmin, fmax));
CREATE TABLE featureloc
(
featureloc_id serial NOT NULL,
feature_id integer NOT NULL,
srcfeature_id integer,
fmin integer,
is_fmin_partial boolean NOT NULL DEFAULT false,
fmax integer,
is_fmax_partial boolean NOT NULL DEFAULT false,
strand smallint,
phase integer,
residue_info text,
locgroup integer NOT NULL DEFAULT 0,
rank integer NOT NULL DEFAULT 0,
….
CREATE OR REPLACE FUNCTION boxrange(integer, integer)
RETURNS box AS
'SELECT box (create_point(0, $1), create_point($2,500000000))'
LANGUAGE 'sql' IMMUTABLE;
ALTER FUNCTION boxrange(integer, integer) OWNER TO cjm;
CREATE OR REPLACE FUNCTION create_point(integer, integer)
RETURNS point AS
'SELECT point ($1, $2)'
LANGUAGE 'sql' VOLATILE;
ALTER FUNCTION create_point(integer, integer) OWNER TO cjm;
Thanks,
Tom
"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes: > In version 8.1.5, I have an rtree index on a 1.5 GB table. The size of > this index is 500 MB. After migrating to 8.2.3, the size of this index > has increased to 35GB. I've dropped are recreated the index and got the > same result. In 8.2.3 the index type is gist, does this have something > to do with it? We dropped rtree in 8.2 on the strength of experiments that seemed to show gist was always better, but you seem to have an outlier case... Can you tell us something about the distribution of the data values (fmin and fmax)? Is there anything particularly magic about the two constants you're using in the boxes? I suppose you've hit some bad corner case in the gist box opclass, but it's not clear what. regards, tom lane
min(fmin) | max(fmin) | avg(fmin) 1 | 55296469 | 11423945 min(fmax) | max(fmax) | avg(fmax) 18 | 55553288 | 11424491 There are 5,704,211 rows in the table. This application has been inherited by us. As far as I can tell the magic of the two constants seem to represent the notion of infinity. Thanks, Tom -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Thursday, June 28, 2007 12:09 PM To: Dolafi, Tom Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space in 8.2.3 "Dolafi, Tom" <dolafit@janelia.hhmi.org> writes: > In version 8.1.5, I have an rtree index on a 1.5 GB table. The size of > this index is 500 MB. After migrating to 8.2.3, the size of this index > has increased to 35GB. I've dropped are recreated the index and got the > same result. In 8.2.3 the index type is gist, does this have something > to do with it? We dropped rtree in 8.2 on the strength of experiments that seemed to show gist was always better, but you seem to have an outlier case... Can you tell us something about the distribution of the data values (fmin and fmax)? Is there anything particularly magic about the two constants you're using in the boxes? I suppose you've hit some bad corner case in the gist box opclass, but it's not clear what. regards, tom lane
Dolafi, Tom wrote: > min(fmin) | max(fmin) | avg(fmin) > 1 | 55296469 | 11423945 > > min(fmax) | max(fmax) | avg(fmax) > 18 | 55553288 | 11424491 > > There are 5,704,211 rows in the table. When you're looking for weird index problems, it's more interesting to know if there are certain numbers that occur a LOT. From your statistics above, each number occurs about 10 times in the table. But do some particular numbers occur thousands,or even millions, of times? Here is a query that will print a list of the highest-occuring values. You might expect a few occurances of 20, and maybe30, but if you have thousands or millions of occurances of certain numbers, then that can screw up an index. select fmax, c from (select fmax, count(fmax) as c from your_table group by fmax) as foo where c > 3 order by c desc; Craig
"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes: > min(fmin) | max(fmin) | avg(fmin) > 1 | 55296469 | 11423945 > min(fmax) | max(fmax) | avg(fmax) > 18 | 55553288 | 11424491 OK, I was able to reproduce a problem after making the further guess that fmax is usually a little bit greater than fmin. The attached test script generates an rtree index of around 800 pages on 8.1.9, and the index build time is about 6 seconds on my machine. On CVS HEAD, the script generates a gist index of over 30000 pages and the build time is over 60 seconds. Since I'm using random() the numbers move around a bit, but they're consistently awful. I experimented with a few other distributions, such as fmin and fmax chosen independently in the same range, and saw gist build time usually better than rtree and index size only somewhat larger, so this particular distribution apparently fools gist_box_picksplit rather badly. The problem seems nonlinear too --- I had originally tried it with 1 million test rows instead of 100000, and gave up waiting for the index build after more than an hour. Oleg, Teodor, can this be improved? regards, tom lane drop table featureloc; CREATE TABLE featureloc ( fmin integer, fmax integer ); insert into featureloc select r1, r1 + 1 + random() * 1000 from (select 1 + random() * 55000000 as r1, 1 + random() * 55000000 as r2 from generate_series(1,100000) offset 0) as ss; CREATE OR REPLACE FUNCTION boxrange(integer, integer) RETURNS box AS 'SELECT box (point(0, $1), point($2, 500000000))' LANGUAGE 'sql' STRICT IMMUTABLE; CREATE INDEX binloc_boxrange ON featureloc USING rtree (boxrange(fmin, fmax)); vacuum verbose featureloc;
The data is not distributed well... Top 20 occurrences of fmin and fmax: fmin | count ----------+-------- 0 | 214476 19281576 | 2870 2490005 | 2290 1266332 | 2261 15539680 | 2086 11022233 | 2022 25559658 | 1923 3054411 | 1906 10237885 | 1890 13827272 | 1876 19187021 | 1847 18101335 | 1845 1518230 | 1843 21199488 | 1842 1922518 | 1826 1216144 | 1798 25802126 | 1762 8307335 | 1745 21271866 | 1736 8361667 | 1721 fmax | count ----------+-------- 25 | 197551 21272002 | 547 21271988 | 335 21271969 | 321 6045781 | 247 1339301 | 243 21669151 | 235 7779506 | 232 2571422 | 229 7715946 | 228 27421323 | 222 7048089 | 221 87364 | 219 13656535 | 217 26034147 | 214 19184612 | 213 7048451 | 213 21668877 | 213 6587492 | 212 9484598 | 212 Also, out of 5.7 million rows there are 1.6 million unique fmin and 1.6 million unique fmax values. Thanks, Tom -----Original Message----- From: Craig James [mailto:craig_james@emolecules.com] Sent: Friday, June 29, 2007 12:14 PM To: Dolafi, Tom Cc: Tom Lane; pgsql-performance@postgresql.org Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space in 8.2.3 Dolafi, Tom wrote: > min(fmin) | max(fmin) | avg(fmin) > 1 | 55296469 | 11423945 > > min(fmax) | max(fmax) | avg(fmax) > 18 | 55553288 | 11424491 > > There are 5,704,211 rows in the table. When you're looking for weird index problems, it's more interesting to know if there are certain numbers that occur a LOT. From your statistics above, each number occurs about 10 times in the table. But do some particular numbers occur thousands, or even millions, of times? Here is a query that will print a list of the highest-occuring values. You might expect a few occurances of 20, and maybe 30, but if you have thousands or millions of occurances of certain numbers, then that can screw up an index. select fmax, c from (select fmax, count(fmax) as c from your_table group by fmax) as foo where c > 3 order by c desc; Craig
Thanks for looking into this and reproducing a similar result. The index took 6 hours to complete on a 1.5GB table resulting in 35GB of storage, and it took 36 hours to vacuum... I'm patient :-) In the mean time I've dropped the index which has resulted in overall performance gain on queries against the table, but we have not tested the part of the application which would utilize this index. - Tom -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, June 29, 2007 1:58 PM To: Dolafi, Tom Cc: pgsql-performance@postgresql.org; Oleg Bartunov; Teodor Sigaev Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space in 8.2.3 "Dolafi, Tom" <dolafit@janelia.hhmi.org> writes: > min(fmin) | max(fmin) | avg(fmin) > 1 | 55296469 | 11423945 > min(fmax) | max(fmax) | avg(fmax) > 18 | 55553288 | 11424491 OK, I was able to reproduce a problem after making the further guess that fmax is usually a little bit greater than fmin. The attached test script generates an rtree index of around 800 pages on 8.1.9, and the index build time is about 6 seconds on my machine. On CVS HEAD, the script generates a gist index of over 30000 pages and the build time is over 60 seconds. Since I'm using random() the numbers move around a bit, but they're consistently awful. I experimented with a few other distributions, such as fmin and fmax chosen independently in the same range, and saw gist build time usually better than rtree and index size only somewhat larger, so this particular distribution apparently fools gist_box_picksplit rather badly. The problem seems nonlinear too --- I had originally tried it with 1 million test rows instead of 100000, and gave up waiting for the index build after more than an hour. Oleg, Teodor, can this be improved? regards, tom lane drop table featureloc; CREATE TABLE featureloc ( fmin integer, fmax integer ); insert into featureloc select r1, r1 + 1 + random() * 1000 from (select 1 + random() * 55000000 as r1, 1 + random() * 55000000 as r2 from generate_series(1,100000) offset 0) as ss; CREATE OR REPLACE FUNCTION boxrange(integer, integer) RETURNS box AS 'SELECT box (point(0, $1), point($2, 500000000))' LANGUAGE 'sql' STRICT IMMUTABLE; CREATE INDEX binloc_boxrange ON featureloc USING rtree (boxrange(fmin, fmax)); vacuum verbose featureloc;
"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes: > The data is not distributed well... Can you show us min, max, and avg of fmax minus fmin? I'd like to check my guess about that being a fairly narrow range. regards, tom lane
"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes: > In the mean time I've dropped the index which has resulted in overall > performance gain on queries against the table, but we have not tested > the part of the application which would utilize this index. I noted that with the same (guessed-at) distribution of fmin/fmax, the index size remains reasonable if you change the derived boxes to CREATE OR REPLACE FUNCTION boxrange(integer, integer) RETURNS box AS 'SELECT box (point($1, $1), point($2, $2))' LANGUAGE 'sql' STRICT IMMUTABLE; which makes sense from the point of view of geometric intuition: instead of a bunch of very tall, mostly very narrow, mostly overlapping boxes, you have a bunch of small square boxes spread out along a line. So it stands to reason that a geometrically-motivated index structure would work a lot better on the latter. I don't know though whether your queries can be adapted to work with this. What was the index being used for, exactly? regards, tom lane
(fmax-fmin)... min | max | avg ---------+---------+---------------------- 1 | 2278225 | 546 I noticed 3000 occurrences where fmax is less than fmin. I excluded these values to get the min difference between the two. Also, there are 20 "invalid"/"bogus" rows with negative values which were excluded from the queries. Thanks, Tom -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, June 29, 2007 2:30 PM To: Dolafi, Tom Cc: Craig James; pgsql-performance@postgresql.org Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space in 8.2.3 "Dolafi, Tom" <dolafit@janelia.hhmi.org> writes: > The data is not distributed well... Can you show us min, max, and avg of fmax minus fmin? I'd like to check my guess about that being a fairly narrow range. regards, tom lane
The application need is to determine genomic features present in a user-defined portion of a chromosome. My guess is that features (boxes) are overlapping along a line (chromosome), and there is a need to represent them as being stacked. Since I'm not certain of its exact use, I've emailed the application owner to find the motivation as to why a geometric index structure is used, and why the boxes are tall and overlapping. As a side note, the data model for our application is based on a popular bioinformatics open source project called chado. Thanks, Tom -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, June 29, 2007 2:38 PM To: Dolafi, Tom Cc: pgsql-performance@postgresql.org; Oleg Bartunov; Teodor Sigaev Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space in 8.2.3 "Dolafi, Tom" <dolafit@janelia.hhmi.org> writes: > In the mean time I've dropped the index which has resulted in overall > performance gain on queries against the table, but we have not tested > the part of the application which would utilize this index. I noted that with the same (guessed-at) distribution of fmin/fmax, the index size remains reasonable if you change the derived boxes to CREATE OR REPLACE FUNCTION boxrange(integer, integer) RETURNS box AS 'SELECT box (point($1, $1), point($2, $2))' LANGUAGE 'sql' STRICT IMMUTABLE; which makes sense from the point of view of geometric intuition: instead of a bunch of very tall, mostly very narrow, mostly overlapping boxes, you have a bunch of small square boxes spread out along a line. So it stands to reason that a geometrically-motivated index structure would work a lot better on the latter. I don't know though whether your queries can be adapted to work with this. What was the index being used for, exactly? regards, tom lane
Thank you for the patch. The index size is back down to 500MB and there are no performance issues with queries against the table. -----Original Message----- From: Teodor Sigaev [mailto:teodor@sigaev.ru] Sent: Friday, July 06, 2007 8:08 AM To: Tom Lane Cc: Dolafi, Tom; pgsql-performance@postgresql.org; Oleg Bartunov Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space in 8.2.3 > Oleg, Teodor, can this be improved? Attached patch improves creation of index for similar corner cases. And split algorithm still demonstrates O(n). It possible to make fallback to Guttman's split algorithm in corner cases, but I don't like this: used linear algorithm is much faster and usually has better performance in search. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/