Thread: GIN index creation extremely slow ?

GIN index creation extremely slow ?

From
Stefan Kaltenbrunner
Date:
on IRC somebody mentioned that it took >34h to greate a GIN index (on a
tsvector) on a ~3 Million column table (wikipedia dump) with a
reasonable speced box (AMD 3400+).
After getting hold of a dump of said table (around 4,1GB in size) I
managed to get the following timings:

test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
CREATE INDEX
Time: 416122.896 ms

so about 7 minutes - sounds very reasonable

test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
CREATE INDEX
Time: 52681605.101 ms

ouch - that makes for a whoppy 14,6hours(!). During that time the box is
completely CPU bottlenecked and during virtually no IO at all - (varing
maintainance_work_mem does not seem to make any noticable difference).

That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10
for the WAL and 12 disks for the data running a very recent -HEAD
checkout ...

It looks like we still don't have any docs for GIN in the tree so I
don't know if those timings are expected or not ...


Stefan


Re: GIN index creation extremely slow ?

From
Oleg Bartunov
Date:
gin uses maintenance_work_mem,so try to increase it and see dramatic
improvements

Oleg
On Mon, 26 Jun 2006, Stefan Kaltenbrunner wrote:

> on IRC somebody mentioned that it took >34h to greate a GIN index (on a
> tsvector) on a ~3 Million column table (wikipedia dump) with a
> reasonable speced box (AMD 3400+).
> After getting hold of a dump of said table (around 4,1GB in size) I
> managed to get the following timings:
>
> test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
> CREATE INDEX
> Time: 416122.896 ms
>
> so about 7 minutes - sounds very reasonable
>
> test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
> CREATE INDEX
> Time: 52681605.101 ms
>
> ouch - that makes for a whoppy 14,6hours(!). During that time the box is
> completely CPU bottlenecked and during virtually no IO at all - (varing
> maintainance_work_mem does not seem to make any noticable difference).
>
> That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10
> for the WAL and 12 disks for the data running a very recent -HEAD
> checkout ...
>
> It looks like we still don't have any docs for GIN in the tree so I
> don't know if those timings are expected or not ...
>
>
> Stefan
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: GIN index creation extremely slow ?

From
Christopher Kings-Lynne
Date:
> It looks like we still don't have any docs for GIN in the tree so I
> don't know if those timings are expected or not ...

Ummm my bad.  Sorry...



Re: GIN index creation extremely slow ?

From
Teodor Sigaev
Date:
> test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
> CREATE INDEX
> Time: 416122.896 ms
> 
> so about 7 minutes - sounds very reasonable
> 
> test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
> CREATE INDEX
> Time: 52681605.101 ms

I'll look at this,  but GiST time creation is suspiciously small.
Can you test on smaller table, for example with 100000 records and if results 
are repeat, pls, send to me test suite...

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: GIN index creation extremely slow ?

From
Stefan Kaltenbrunner
Date:
Teodor Sigaev wrote:
>> test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
>> CREATE INDEX
>> Time: 416122.896 ms
>>
>> so about 7 minutes - sounds very reasonable
>>
>> test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
>> CREATE INDEX
>> Time: 52681605.101 ms
> 
> I'll look at this,  but GiST time creation is suspiciously small.
> Can you test on smaller table, for example with 100000 records and if
> results are repeat, pls, send to me test suite...

I won't have access to the original testcase and server for a few days
but I just redid some testing on a slower personal box of mine with a
smaller(but similiar) testset and on that box I could not reproduce that
issue.
So the problem is either caused by the size of the table or somehow by
the data itself :-(


Stefan


Re: GIN index creation extremely slow ?

From
Teodor Sigaev
Date:
> I won't have access to the original testcase and server for a few days
> but I just redid some testing on a slower personal box of mine with a
> smaller(but similiar) testset and on that box I could not reproduce that
> issue.
> So the problem is either caused by the size of the table or somehow by
> the data itself :-(

We tested gin with 5 millions records (but not a wikipedia's text, but with blog 
records which is usually mush shorter) and index creation time was near 8 hours 
on Dual Xeon 3.2, but we didn't try GiST. In other hand, creating GiST index on 
0.5 millions emails took about 12 hours.

That's why 7 minutes is very suspicious result.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: GIN index creation extremely slow ?

From
Martijn van Oosterhout
Date:
On Wed, Jun 28, 2006 at 01:09:23PM +0400, Teodor Sigaev wrote:
> We tested gin with 5 millions records (but not a wikipedia's text, but with
> blog records which is usually mush shorter) and index creation time was
> near 8 hours on Dual Xeon 3.2, but we didn't try GiST. In other hand,
> creating GiST index on 0.5 millions emails took about 12 hours.
>
> That's why 7 minutes is very suspicious result.

Tom did commit a patch a while ago which made a huge difference in
index creation time for tsearch by changing one routine. I don't know
if it got backpatched, so it might be worth checking people are working
on the same version.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: GIN index creation extremely slow ?

From
Oleg Bartunov
Date:
On Wed, 28 Jun 2006, Martijn van Oosterhout wrote:

> On Wed, Jun 28, 2006 at 01:09:23PM +0400, Teodor Sigaev wrote:
>> We tested gin with 5 millions records (but not a wikipedia's text, but with
>> blog records which is usually mush shorter) and index creation time was
>> near 8 hours on Dual Xeon 3.2, but we didn't try GiST. In other hand,
>> creating GiST index on 0.5 millions emails took about 12 hours.
>>
>> That's why 7 minutes is very suspicious result.
>
> Tom did commit a patch a while ago which made a huge difference in
> index creation time for tsearch by changing one routine. I don't know
> if it got backpatched, so it might be worth checking people are working
> on the same version.

it's on 8.1.4 and speedup is about 10 times.

>
> Have a nice day,
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: GIN index creation extremely slow ?

From
Teodor Sigaev
Date:
> Tom did commit a patch a while ago which made a huge difference in
> index creation time for tsearch by changing one routine. I don't know
> if it got backpatched, so it might be worth checking people are working
> on the same version.

I saw that patch, but I still think that 7 minutes is too small :)

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: GIN index creation extremely slow ?

From
Stefan Kaltenbrunner
Date:
Teodor Sigaev wrote:
>> Tom did commit a patch a while ago which made a huge difference in
>> index creation time for tsearch by changing one routine. I don't know
>> if it got backpatched, so it might be worth checking people are working
>> on the same version.
> 
> I saw that patch, but I still think that 7 minutes is too small :)

hmm I did some further testing on that and it looks like you might
indeed be right in suspecting that there is something fishy with the
GIST results.
It might be possible that there was some issue with the generated
tsvectors (all of them empty due to hitting the "too long value" error
case for exceeding MAXSTRPOS) in the GIST case - sorry for the confusion:-(

Speaking of the "too long value" error message - some of the
errormessages in tsvector.c are a bit "terse" (ie it is not really
obvious what is causing the above error without looking at the source
for example).


Stefan


Re: GIN index creation extremely slow ?

From
Teodor Sigaev
Date:
Try again, today's patch solves the problem.

Stefan Kaltenbrunner wrote:
> on IRC somebody mentioned that it took >34h to greate a GIN index (on a
> tsvector) on a ~3 Million column table (wikipedia dump) with a
> reasonable speced box (AMD 3400+).
> After getting hold of a dump of said table (around 4,1GB in size) I
> managed to get the following timings:
> 
> test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
> CREATE INDEX
> Time: 416122.896 ms
> 
> so about 7 minutes - sounds very reasonable
> 
> test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
> CREATE INDEX
> Time: 52681605.101 ms
> 
> ouch - that makes for a whoppy 14,6hours(!). During that time the box is
> completely CPU bottlenecked and during virtually no IO at all - (varing
> maintainance_work_mem does not seem to make any noticable difference).
> 
> That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10
> for the WAL and 12 disks for the data running a very recent -HEAD
> checkout ...
> 
> It looks like we still don't have any docs for GIN in the tree so I
> don't know if those timings are expected or not ...
> 
> 
> Stefan
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/