Thread: Creation of tsearch2 index is very slow
Hello! I noticed that the creation of a GIST index for tsearch2 takes very long - about 20 minutes. CPU utilization is 100 %, the resulting index file size is ~25 MB. Is this behaviour normal? Full text columns: title author_list tsearch2 word lists: fti_title fti_author_list tsearch2 indexes: idx_fti_title idx_fti_author_list The table has 700,000 records. When I create a normal B-Tree index on the same column for testing purposes, it works quite fast (approx. 30 seconds). The columns that should be indexed are small, only about 10 words on average. System specs: Athlon64 X2 3800+, 2 GB RAM PostgreSQL 8.1.2, Windows XP SP2 I've never noticed this problem before, so could it probably be related to v8.1.2? Or is something in my configuration or table definition that causes this sluggishness? Thanks very much in advance for your help! - Stephan This is the table definition: ----------------------------------------------------------------- CREATE TABLE publications ( id int4 NOT NULL DEFAULT nextval('publications_id_seq'::regclass), publication_type_id int4 NOT NULL DEFAULT 0, keyword text NOT NULL, mdate date, "year" date, title text, fti_title tsvector, author_list text, fti_author_list tsvector, overview_timestamp timestamp, overview_xml text, CONSTRAINT publications_pkey PRIMARY KEY (keyword) USING INDEX TABLESPACE dblp_index, CONSTRAINT publications_publication_type_id_fkey FOREIGN KEY (publication_type_id) REFERENCES publication_types (id) MATCH SIMPLE ON UPDATE RESTRICT ON DELETE RESTRICT, CONSTRAINT publications_year_check CHECK (date_part('month'::text, "year") = 1::double precision AND date_part('day'::text, "year") = 1::double precision) ) WITHOUT OIDS; CREATE INDEX fki_publications_publication_type_id ON publications USING btree (publication_type_id) TABLESPACE dblp_index; CREATE INDEX idx_fti_author_list ON publications USING gist (fti_author_list) TABLESPACE dblp_index; CREATE INDEX idx_fti_title ON publications USING gist (fti_title) TABLESPACE dblp_index; CREATE INDEX idx_publications_year ON publications USING btree ("year") TABLESPACE dblp_index; CREATE INDEX idx_publications_year_part ON publications USING btree (date_part('year'::text, "year")) TABLESPACE dblp_index; CREATE TRIGGER tsvectorupdate_all BEFORE INSERT OR UPDATE ON publications FOR EACH ROW EXECUTE PROCEDURE multi_tsearch2();
Attachment
PS: What I forgot to mention was that inserting records into the table is also about 2-3 times slower than before (most likely due to the slow index update operations). I dropped the whole database and restored the dumpfile, but the result it the same. When the index is recreated after COPYing the data, it takes more than 20 minutes for _each_ of both tsearch2 indexes. So the total time to restore this table is more than 45 minutes! - Stephan
Attachment
Stephan Vollmer <svollmer@gmx.de> writes: > I noticed that the creation of a GIST index for tsearch2 takes very > long - about 20 minutes. CPU utilization is 100 %, the resulting > index file size is ~25 MB. Is this behaviour normal? This has been complained of before. GIST is always going to be slower at index-building than btree; in the btree case there's a simple optimal strategy for making an index (ie, sort the keys) but for GIST we don't know anything better to do than insert the keys one at a time. However, I'm not sure that anyone's tried to do any performance optimization on the GIST insert code ... there might be some low-hanging fruit there. It'd be interesting to look at a gprof profile of what the backend is doing during the index build. Do you have the ability to do that, or would you be willing to give your data to someone else to investigate with? (The behavior is very possibly data-dependent, which is why I want to see a profile with your specific data and not just some random dataset or other.) regards, tom lane
Tom Lane wrote: > Stephan Vollmer <svollmer@gmx.de> writes: >> I noticed that the creation of a GIST index for tsearch2 takes very >> long - about 20 minutes. CPU utilization is 100 %, the resulting >> index file size is ~25 MB. Is this behaviour normal? > > This has been complained of before. GIST is always going to be slower > at index-building than btree; in the btree case there's a simple optimal > strategy for making an index (ie, sort the keys) but for GIST we don't > know anything better to do than insert the keys one at a time. Ah, ok. That explains a lot, although I wonder why it is so much slower. > However, I'm not sure that anyone's tried to do any performance > optimization on the GIST insert code ... there might be some low-hanging > fruit there. It'd be interesting to look at a gprof profile of what the > backend is doing during the index build. Do you have the ability to do > that, or would you be willing to give your data to someone else to > investigate with? Unfortunately, I'm not able to investigate it further myself as I'm quite a Postgres newbie. But I could provide someone else with the example table. Maybe someone else could find out why it is so slow. I dropped all unnecessary columns and trimmed the table down to 235,000 rows. The dumped table (compressed with RAR) is 7,1 MB. I don't have a website to upload it but I could send it to someone via e-mail. With this 235,000 row table, index creation times are: - GIST 347063 ms - B-Tree 2515 ms Thanks for your help! - Stephan
Attachment
On Fri, Jan 20, 2006 at 10:35:21AM -0500, Tom Lane wrote: > However, I'm not sure that anyone's tried to do any performance > optimization on the GIST insert code ... there might be some low-hanging > fruit there. It'd be interesting to look at a gprof profile of what the > backend is doing during the index build. Do you have the ability to do > that, or would you be willing to give your data to someone else to > investigate with? (The behavior is very possibly data-dependent, which > is why I want to see a profile with your specific data and not just some > random dataset or other.) The cost on inserting would generally go to either penalty, or picksplit. Certainly if you're inserting lots of values in a short interval, I can imagine picksplit being nasty, since the algorithms for a lot of datatypes are not really reknown for their speed. I'm wondering if you could possibly improve the process by grouping into larger blocks. For example, pull out enough tuples to cover 4 pages and then call picksplit three times to split it into the four pages. This gives you 4 entries for the level above the leaves. Keep reading tuples and splitting until you get enough for the next level and call picksplit on those. etc etc. The thing is, you never call penalty here so it's questionable whether the tree will be as efficient as just inserting. For example, if have a data type representing ranges (a,b), straight inserting can produce the perfect tree order like a b-tree (assuming non-overlapping entries). The above process will produce something close, but not quite... Should probably get out a pen-and-paper to model this. After all, if the speed of the picksplit increases superlinearly to the number of elements, calling it will larger sets may prove to be a loss overall... Perhaps the easiest would be to allow datatypes to provide a bulkinsert function, like b-tree does? The question is, what should be its input and output? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Stephan Vollmer <svollmer@gmx.de> writes: > Unfortunately, I'm not able to investigate it further myself as I'm > quite a Postgres newbie. But I could provide someone else with the > example table. Maybe someone else could find out why it is so slow. I'd be willing to take a look, if you'll send me the dump file off-list. > I dropped all unnecessary columns and trimmed the table down to > 235,000 rows. The dumped table (compressed with RAR) is 7,1 MB. I > don't have a website to upload it but I could send it to someone via > e-mail. Don't have RAR --- gzip or bzip2 is fine ... regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > The cost on inserting would generally go to either penalty, or > picksplit. Certainly if you're inserting lots of values in a short > interval, I can imagine picksplit being nasty, since the algorithms for > a lot of datatypes are not really reknown for their speed. Tut tut ... in the words of the sage, it is a capital mistake to theorize in advance of the data. You may well be right, but on the other hand it could easily be something dumb like an O(N^2) loop over a list structure. I'll post some gprof numbers after Stephan sends me the dump. We should probably move the thread to someplace like pgsql-perform, too. regards, tom lane
Stephan Vollmer <svollmer@gmx.de> writes: > Tom Lane wrote: >> However, I'm not sure that anyone's tried to do any performance >> optimization on the GIST insert code ... there might be some low-hanging >> fruit there. > Unfortunately, I'm not able to investigate it further myself as I'm > quite a Postgres newbie. But I could provide someone else with the > example table. Maybe someone else could find out why it is so slow. The problem seems to be mostly tsearch2's fault rather than the general GIST code. I've applied a partial fix to 8.1 and HEAD branches, which you can find here if you're in a hurry for it: http://archives.postgresql.org/pgsql-committers/2006-01/msg00283.php (the gistidx.c change is all you need for tsearch2) There is some followup discussion in the pgsql-performance list. It seems possible that we can get another factor of 10 or better with a smarter picksplit algorithm --- but that patch will probably be too large to be considered for back-patching into the stable branches. regards, tom lane
Tom Lane wrote: > The problem seems to be mostly tsearch2's fault rather than the general > GIST code. I've applied a partial fix to 8.1 and HEAD branches, which > you can find here if you're in a hurry for it: > http://archives.postgresql.org/pgsql-committers/2006-01/msg00283.php > (the gistidx.c change is all you need for tsearch2) Thanks for all your time and work you and the other guys are spending on this matter! I'll look into the new version, but a.p.o seems to be unreachable at the moment. > There is some followup discussion in the pgsql-performance list. It > seems possible that we can get another factor of 10 or better with a > smarter picksplit algorithm --- but that patch will probably be too > large to be considered for back-patching into the stable branches. I've already been following the discussion on pgsql-perform, although I have to admit that don't understand every detail of the tsearch2 implementation. :-) Thus, I'm sorry that I won't be able to help directly on that problem. But it is interesting to read anyway. Best regards, - Stephan