Thread: Creation of tsearch2 index is very slow

Creation of tsearch2 index is very slow

From
Stephan Vollmer
Date:
Hello!

I noticed that the creation of a GIST index for tsearch2 takes very
long - about 20 minutes. CPU utilization is 100 %, the resulting
index file size is ~25 MB. Is this behaviour normal?

Full text columns:   title          author_list
tsearch2 word lists: fti_title      fti_author_list
tsearch2 indexes:    idx_fti_title  idx_fti_author_list

The table has 700,000 records. When I create a normal B-Tree index
on the same column for testing purposes, it works quite fast
(approx. 30 seconds).

The columns that should be indexed are small, only about 10 words on
average.

System specs:
Athlon64 X2 3800+, 2 GB RAM
PostgreSQL 8.1.2, Windows XP SP2

I've never noticed this problem before, so could it probably be
related to v8.1.2? Or is something in my configuration or table
definition that causes this sluggishness?

Thanks very much in advance for your help!

- Stephan



This is the table definition:
-----------------------------------------------------------------
CREATE TABLE publications
(
  id int4 NOT NULL DEFAULT nextval('publications_id_seq'::regclass),
  publication_type_id int4 NOT NULL DEFAULT 0,
  keyword text NOT NULL,
  mdate date,
  "year" date,
  title text,
  fti_title tsvector,
  author_list text,
  fti_author_list tsvector,
  overview_timestamp timestamp,
  overview_xml text,
  CONSTRAINT publications_pkey PRIMARY KEY (keyword) USING INDEX
      TABLESPACE dblp_index,
  CONSTRAINT publications_publication_type_id_fkey FOREIGN KEY
     (publication_type_id)
      REFERENCES publication_types (id) MATCH SIMPLE
      ON UPDATE RESTRICT ON DELETE RESTRICT,
  CONSTRAINT publications_year_check CHECK (date_part('month'::text,
"year") = 1::double precision AND date_part('day'::text, "year") =
1::double precision)
)
WITHOUT OIDS;

CREATE INDEX fki_publications_publication_type_id
  ON publications
  USING btree
  (publication_type_id)
  TABLESPACE dblp_index;

CREATE INDEX idx_fti_author_list
  ON publications
  USING gist
  (fti_author_list)
  TABLESPACE dblp_index;

CREATE INDEX idx_fti_title
  ON publications
  USING gist
  (fti_title)
  TABLESPACE dblp_index;

CREATE INDEX idx_publications_year
  ON publications
  USING btree
  ("year")
  TABLESPACE dblp_index;

CREATE INDEX idx_publications_year_part
  ON publications
  USING btree
  (date_part('year'::text, "year"))
  TABLESPACE dblp_index;


CREATE TRIGGER tsvectorupdate_all
  BEFORE INSERT OR UPDATE
  ON publications
  FOR EACH ROW
  EXECUTE PROCEDURE multi_tsearch2();



Attachment

Re: Creation of tsearch2 index is very slow

From
Stephan Vollmer
Date:
PS:

What I forgot to mention was that inserting records into the table
is also about 2-3 times slower than before (most likely due to the
slow index update operations).

I dropped the whole database and restored the dumpfile, but the
result it the same. When the index is recreated after COPYing the
data, it takes more than 20 minutes for _each_ of both tsearch2
indexes. So the total time to restore this table is more than 45
minutes!

- Stephan


Attachment

Re: Creation of tsearch2 index is very slow

From
Tom Lane
Date:
Stephan Vollmer <svollmer@gmx.de> writes:
> I noticed that the creation of a GIST index for tsearch2 takes very
> long - about 20 minutes. CPU utilization is 100 %, the resulting
> index file size is ~25 MB. Is this behaviour normal?

This has been complained of before.  GIST is always going to be slower
at index-building than btree; in the btree case there's a simple optimal
strategy for making an index (ie, sort the keys) but for GIST we don't
know anything better to do than insert the keys one at a time.

However, I'm not sure that anyone's tried to do any performance
optimization on the GIST insert code ... there might be some low-hanging
fruit there.  It'd be interesting to look at a gprof profile of what the
backend is doing during the index build.  Do you have the ability to do
that, or would you be willing to give your data to someone else to
investigate with?  (The behavior is very possibly data-dependent, which
is why I want to see a profile with your specific data and not just some
random dataset or other.)

            regards, tom lane

Re: Creation of tsearch2 index is very slow

From
Stephan Vollmer
Date:
Tom Lane wrote:
> Stephan Vollmer <svollmer@gmx.de> writes:
>> I noticed that the creation of a GIST index for tsearch2 takes very
>> long - about 20 minutes. CPU utilization is 100 %, the resulting
>> index file size is ~25 MB. Is this behaviour normal?
>
> This has been complained of before.  GIST is always going to be slower
> at index-building than btree; in the btree case there's a simple optimal
> strategy for making an index (ie, sort the keys) but for GIST we don't
> know anything better to do than insert the keys one at a time.

Ah, ok. That explains a lot, although I wonder why it is so much slower.


> However, I'm not sure that anyone's tried to do any performance
> optimization on the GIST insert code ... there might be some low-hanging
> fruit there.  It'd be interesting to look at a gprof profile of what the
> backend is doing during the index build.  Do you have the ability to do
> that, or would you be willing to give your data to someone else to
> investigate with?

Unfortunately, I'm not able to investigate it further myself as I'm
quite a Postgres newbie. But I could provide someone else with the
example table. Maybe someone else could find out why it is so slow.

I dropped all unnecessary columns and trimmed the table down to
235,000 rows. The dumped table (compressed with RAR) is 7,1 MB. I
don't have a website to upload it but I could send it to someone via
e-mail.

With this 235,000 row table, index creation times are:
- GIST    347063 ms
- B-Tree    2515 ms


Thanks for your help!

- Stephan



Attachment

Re: Creation of tsearch2 index is very slow

From
Martijn van Oosterhout
Date:
On Fri, Jan 20, 2006 at 10:35:21AM -0500, Tom Lane wrote:
> However, I'm not sure that anyone's tried to do any performance
> optimization on the GIST insert code ... there might be some low-hanging
> fruit there.  It'd be interesting to look at a gprof profile of what the
> backend is doing during the index build.  Do you have the ability to do
> that, or would you be willing to give your data to someone else to
> investigate with?  (The behavior is very possibly data-dependent, which
> is why I want to see a profile with your specific data and not just some
> random dataset or other.)

The cost on inserting would generally go to either penalty, or
picksplit. Certainly if you're inserting lots of values in a short
interval, I can imagine picksplit being nasty, since the algorithms for
a lot of datatypes are not really reknown for their speed.

I'm wondering if you could possibly improve the process by grouping
into larger blocks. For example, pull out enough tuples to cover 4
pages and then call picksplit three times to split it into the four
pages. This gives you 4 entries for the level above the leaves. Keep
reading tuples and splitting until you get enough for the next level
and call picksplit on those. etc etc.

The thing is, you never call penalty here so it's questionable whether
the tree will be as efficient as just inserting. For example, if have a
data type representing ranges (a,b), straight inserting can produce the
perfect tree order like a b-tree (assuming non-overlapping entries).
The above process will produce something close, but not quite...

Should probably get out a pen-and-paper to model this. After all, if
the speed of the picksplit increases superlinearly to the number of
elements, calling it will larger sets may prove to be a loss overall...

Perhaps the easiest would be to allow datatypes to provide a bulkinsert
function, like b-tree does? The question is, what should be its input
and output?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Creation of tsearch2 index is very slow

From
Tom Lane
Date:
Stephan Vollmer <svollmer@gmx.de> writes:
> Unfortunately, I'm not able to investigate it further myself as I'm
> quite a Postgres newbie. But I could provide someone else with the
> example table. Maybe someone else could find out why it is so slow.

I'd be willing to take a look, if you'll send me the dump file off-list.

> I dropped all unnecessary columns and trimmed the table down to
> 235,000 rows. The dumped table (compressed with RAR) is 7,1 MB. I
> don't have a website to upload it but I could send it to someone via
> e-mail.

Don't have RAR --- gzip or bzip2 is fine ...

            regards, tom lane

Re: Creation of tsearch2 index is very slow

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> The cost on inserting would generally go to either penalty, or
> picksplit. Certainly if you're inserting lots of values in a short
> interval, I can imagine picksplit being nasty, since the algorithms for
> a lot of datatypes are not really reknown for their speed.

Tut tut ... in the words of the sage, it is a capital mistake to
theorize in advance of the data.  You may well be right, but on the
other hand it could easily be something dumb like an O(N^2) loop over
a list structure.

I'll post some gprof numbers after Stephan sends me the dump.  We
should probably move the thread to someplace like pgsql-perform, too.

            regards, tom lane

Re: Creation of tsearch2 index is very slow

From
Tom Lane
Date:
Stephan Vollmer <svollmer@gmx.de> writes:
> Tom Lane wrote:
>> However, I'm not sure that anyone's tried to do any performance
>> optimization on the GIST insert code ... there might be some low-hanging
>> fruit there.

> Unfortunately, I'm not able to investigate it further myself as I'm
> quite a Postgres newbie. But I could provide someone else with the
> example table. Maybe someone else could find out why it is so slow.

The problem seems to be mostly tsearch2's fault rather than the general
GIST code.  I've applied a partial fix to 8.1 and HEAD branches, which
you can find here if you're in a hurry for it:
http://archives.postgresql.org/pgsql-committers/2006-01/msg00283.php
(the gistidx.c change is all you need for tsearch2)

There is some followup discussion in the pgsql-performance list.  It
seems possible that we can get another factor of 10 or better with a
smarter picksplit algorithm --- but that patch will probably be too
large to be considered for back-patching into the stable branches.

            regards, tom lane

Re: Creation of tsearch2 index is very slow

From
Stephan Vollmer
Date:
Tom Lane wrote:

> The problem seems to be mostly tsearch2's fault rather than the general
> GIST code.  I've applied a partial fix to 8.1 and HEAD branches, which
> you can find here if you're in a hurry for it:
> http://archives.postgresql.org/pgsql-committers/2006-01/msg00283.php
> (the gistidx.c change is all you need for tsearch2)

Thanks for all your time and work you and the other guys are
spending on this matter! I'll look into the new version, but a.p.o
seems to be unreachable at the moment.


> There is some followup discussion in the pgsql-performance list.  It
> seems possible that we can get another factor of 10 or better with a
> smarter picksplit algorithm --- but that patch will probably be too
> large to be considered for back-patching into the stable branches.

I've already been following the discussion on pgsql-perform,
although I have to admit that don't understand every detail of the
tsearch2 implementation. :-)  Thus, I'm sorry that I won't be able
to help directly on that problem. But it is interesting to read anyway.

Best regards,

- Stephan


Attachment