Re: Creation of tsearch2 index is very slow - Mailing list pgsql-general

From Martijn van Oosterhout
Subject Re: Creation of tsearch2 index is very slow
Date
Msg-id 20060120170452.GC31908@svana.org
Whole thread Raw
In response to Re: Creation of tsearch2 index is very slow  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Creation of tsearch2 index is very slow
List pgsql-general
On Fri, Jan 20, 2006 at 10:35:21AM -0500, Tom Lane wrote:
> However, I'm not sure that anyone's tried to do any performance
> optimization on the GIST insert code ... there might be some low-hanging
> fruit there.  It'd be interesting to look at a gprof profile of what the
> backend is doing during the index build.  Do you have the ability to do
> that, or would you be willing to give your data to someone else to
> investigate with?  (The behavior is very possibly data-dependent, which
> is why I want to see a profile with your specific data and not just some
> random dataset or other.)

The cost on inserting would generally go to either penalty, or
picksplit. Certainly if you're inserting lots of values in a short
interval, I can imagine picksplit being nasty, since the algorithms for
a lot of datatypes are not really reknown for their speed.

I'm wondering if you could possibly improve the process by grouping
into larger blocks. For example, pull out enough tuples to cover 4
pages and then call picksplit three times to split it into the four
pages. This gives you 4 entries for the level above the leaves. Keep
reading tuples and splitting until you get enough for the next level
and call picksplit on those. etc etc.

The thing is, you never call penalty here so it's questionable whether
the tree will be as efficient as just inserting. For example, if have a
data type representing ranges (a,b), straight inserting can produce the
perfect tree order like a b-tree (assuming non-overlapping entries).
The above process will produce something close, but not quite...

Should probably get out a pen-and-paper to model this. After all, if
the speed of the picksplit increases superlinearly to the number of
elements, calling it will larger sets may prove to be a loss overall...

Perhaps the easiest would be to allow datatypes to provide a bulkinsert
function, like b-tree does? The question is, what should be its input
and output?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

pgsql-general by date:

Previous
From: Stephan Vollmer
Date:
Subject: Re: Creation of tsearch2 index is very slow
Next
From: Tom Lane
Date:
Subject: Re: Creation of tsearch2 index is very slow