Thread: GIN index creation extremely slow ?
on IRC somebody mentioned that it took >34h to greate a GIN index (on a tsvector) on a ~3 Million column table (wikipedia dump) with a reasonable speced box (AMD 3400+). After getting hold of a dump of said table (around 4,1GB in size) I managed to get the following timings: test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector); CREATE INDEX Time: 416122.896 ms so about 7 minutes - sounds very reasonable test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector); CREATE INDEX Time: 52681605.101 ms ouch - that makes for a whoppy 14,6hours(!). During that time the box is completely CPU bottlenecked and during virtually no IO at all - (varing maintainance_work_mem does not seem to make any noticable difference). That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10 for the WAL and 12 disks for the data running a very recent -HEAD checkout ... It looks like we still don't have any docs for GIN in the tree so I don't know if those timings are expected or not ... Stefan
gin uses maintenance_work_mem,so try to increase it and see dramatic improvements Oleg On Mon, 26 Jun 2006, Stefan Kaltenbrunner wrote: > on IRC somebody mentioned that it took >34h to greate a GIN index (on a > tsvector) on a ~3 Million column table (wikipedia dump) with a > reasonable speced box (AMD 3400+). > After getting hold of a dump of said table (around 4,1GB in size) I > managed to get the following timings: > > test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector); > CREATE INDEX > Time: 416122.896 ms > > so about 7 minutes - sounds very reasonable > > test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector); > CREATE INDEX > Time: 52681605.101 ms > > ouch - that makes for a whoppy 14,6hours(!). During that time the box is > completely CPU bottlenecked and during virtually no IO at all - (varing > maintainance_work_mem does not seem to make any noticable difference). > > That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10 > for the WAL and 12 disks for the data running a very recent -HEAD > checkout ... > > It looks like we still don't have any docs for GIN in the tree so I > don't know if those timings are expected or not ... > > > Stefan > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> It looks like we still don't have any docs for GIN in the tree so I > don't know if those timings are expected or not ... Ummm my bad. Sorry...
> test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector); > CREATE INDEX > Time: 416122.896 ms > > so about 7 minutes - sounds very reasonable > > test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector); > CREATE INDEX > Time: 52681605.101 ms I'll look at this, but GiST time creation is suspiciously small. Can you test on smaller table, for example with 100000 records and if results are repeat, pls, send to me test suite... -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev wrote: >> test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector); >> CREATE INDEX >> Time: 416122.896 ms >> >> so about 7 minutes - sounds very reasonable >> >> test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector); >> CREATE INDEX >> Time: 52681605.101 ms > > I'll look at this, but GiST time creation is suspiciously small. > Can you test on smaller table, for example with 100000 records and if > results are repeat, pls, send to me test suite... I won't have access to the original testcase and server for a few days but I just redid some testing on a slower personal box of mine with a smaller(but similiar) testset and on that box I could not reproduce that issue. So the problem is either caused by the size of the table or somehow by the data itself :-( Stefan
> I won't have access to the original testcase and server for a few days > but I just redid some testing on a slower personal box of mine with a > smaller(but similiar) testset and on that box I could not reproduce that > issue. > So the problem is either caused by the size of the table or somehow by > the data itself :-( We tested gin with 5 millions records (but not a wikipedia's text, but with blog records which is usually mush shorter) and index creation time was near 8 hours on Dual Xeon 3.2, but we didn't try GiST. In other hand, creating GiST index on 0.5 millions emails took about 12 hours. That's why 7 minutes is very suspicious result. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Wed, Jun 28, 2006 at 01:09:23PM +0400, Teodor Sigaev wrote: > We tested gin with 5 millions records (but not a wikipedia's text, but with > blog records which is usually mush shorter) and index creation time was > near 8 hours on Dual Xeon 3.2, but we didn't try GiST. In other hand, > creating GiST index on 0.5 millions emails took about 12 hours. > > That's why 7 minutes is very suspicious result. Tom did commit a patch a while ago which made a huge difference in index creation time for tsearch by changing one routine. I don't know if it got backpatched, so it might be worth checking people are working on the same version. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Wed, 28 Jun 2006, Martijn van Oosterhout wrote: > On Wed, Jun 28, 2006 at 01:09:23PM +0400, Teodor Sigaev wrote: >> We tested gin with 5 millions records (but not a wikipedia's text, but with >> blog records which is usually mush shorter) and index creation time was >> near 8 hours on Dual Xeon 3.2, but we didn't try GiST. In other hand, >> creating GiST index on 0.5 millions emails took about 12 hours. >> >> That's why 7 minutes is very suspicious result. > > Tom did commit a patch a while ago which made a huge difference in > index creation time for tsearch by changing one routine. I don't know > if it got backpatched, so it might be worth checking people are working > on the same version. it's on 8.1.4 and speedup is about 10 times. > > Have a nice day, > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> Tom did commit a patch a while ago which made a huge difference in > index creation time for tsearch by changing one routine. I don't know > if it got backpatched, so it might be worth checking people are working > on the same version. I saw that patch, but I still think that 7 minutes is too small :) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev wrote: >> Tom did commit a patch a while ago which made a huge difference in >> index creation time for tsearch by changing one routine. I don't know >> if it got backpatched, so it might be worth checking people are working >> on the same version. > > I saw that patch, but I still think that 7 minutes is too small :) hmm I did some further testing on that and it looks like you might indeed be right in suspecting that there is something fishy with the GIST results. It might be possible that there was some issue with the generated tsvectors (all of them empty due to hitting the "too long value" error case for exceeding MAXSTRPOS) in the GIST case - sorry for the confusion:-( Speaking of the "too long value" error message - some of the errormessages in tsvector.c are a bit "terse" (ie it is not really obvious what is causing the above error without looking at the source for example). Stefan
Try again, today's patch solves the problem. Stefan Kaltenbrunner wrote: > on IRC somebody mentioned that it took >34h to greate a GIN index (on a > tsvector) on a ~3 Million column table (wikipedia dump) with a > reasonable speced box (AMD 3400+). > After getting hold of a dump of said table (around 4,1GB in size) I > managed to get the following timings: > > test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector); > CREATE INDEX > Time: 416122.896 ms > > so about 7 minutes - sounds very reasonable > > test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector); > CREATE INDEX > Time: 52681605.101 ms > > ouch - that makes for a whoppy 14,6hours(!). During that time the box is > completely CPU bottlenecked and during virtually no IO at all - (varing > maintainance_work_mem does not seem to make any noticable difference). > > That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10 > for the WAL and 12 disks for the data running a very recent -HEAD > checkout ... > > It looks like we still don't have any docs for GIN in the tree so I > don't know if those timings are expected or not ... > > > Stefan > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/