Thread: Building full-text index

Building full-text index

From
"Sean Davis"
Date:
I am trying to build a full-text index (gin(to_tsvector('english',
title || abstract))) on about 18 million abstracts and titles from
medical literature.  However, I keep getting out-of-memory errors.  (I
am on a 32Gb linux system with maintenance_work_mem set to 20Gb and
shared buffers at 4Gb; postgres 8.3beta).  Does creation of a
full-text index require that the entire index fit into memory?  I know
this is probably a naive question, but I guess I assumed that the file
system could be used.  Any hints on how to build such an index?

Thanks,
Sean

Re: Building full-text index

From
Tom Lane
Date:
"Sean Davis" <sdavis2@mail.nih.gov> writes:
> I am trying to build a full-text index (gin(to_tsvector('english',
> title || abstract))) on about 18 million abstracts and titles from
> medical literature.  However, I keep getting out-of-memory errors.  (I
> am on a 32Gb linux system with maintenance_work_mem set to 20Gb and
> shared buffers at 4Gb; postgres 8.3beta).  Does creation of a
> full-text index require that the entire index fit into memory?

I can't reproduce any memory-leak issue here.  I wonder whether your
maintenance_work_mem setting is optimistically large (like, higher
than the ulimit restriction on the postmaster).

            regards, tom lane

Re: Building full-text index

From
"Sean Davis"
Date:
On Nov 15, 2007 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Sean Davis" <sdavis2@mail.nih.gov> writes:
> > I am trying to build a full-text index (gin(to_tsvector('english',
> > title || abstract))) on about 18 million abstracts and titles from
> > medical literature.  However, I keep getting out-of-memory errors.  (I
> > am on a 32Gb linux system with maintenance_work_mem set to 20Gb and
> > shared buffers at 4Gb; postgres 8.3beta).  Does creation of a
> > full-text index require that the entire index fit into memory?
>
> I can't reproduce any memory-leak issue here.  I wonder whether your
> maintenance_work_mem setting is optimistically large (like, higher
> than the ulimit restriction on the postmaster).

Thanks, Tom.  ulimit -a shows unlimited, but there may be something
else going on.  I'll try leaving it lower and see what that does for
me.

Sean

Re: Building full-text index

From
Tom Lane
Date:
"Sean Davis" <sdavis2@mail.nih.gov> writes:
> I am trying to build a full-text index (gin(to_tsvector('english',
> title || abstract))) on about 18 million abstracts and titles from
> medical literature.  However, I keep getting out-of-memory errors.  (I
> am on a 32Gb linux system with maintenance_work_mem set to 20Gb and
> shared buffers at 4Gb; postgres 8.3beta).  Does creation of a
> full-text index require that the entire index fit into memory?

I looked closer at this and discovered that there's an overflow problem
in the GIN index build code: with maintenance_work_mem above 8Gb, it
miscalculates how much space it's used and never realizes when it's
reached the intended limit.  So indeed you were seeing it try to create
the entire index in memory :-(.

This will be fixed in the next beta, but in the meantime set
maintenance_work_mem to something less than 8Gb.

            regards, tom lane

Re: Building full-text index

From
"Sean Davis"
Date:
On Nov 16, 2007 5:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Sean Davis" <sdavis2@mail.nih.gov> writes:
> > I am trying to build a full-text index (gin(to_tsvector('english',
> > title || abstract))) on about 18 million abstracts and titles from
> > medical literature.  However, I keep getting out-of-memory errors.  (I
> > am on a 32Gb linux system with maintenance_work_mem set to 20Gb and
> > shared buffers at 4Gb; postgres 8.3beta).  Does creation of a
> > full-text index require that the entire index fit into memory?
>
> I looked closer at this and discovered that there's an overflow problem
> in the GIN index build code: with maintenance_work_mem above 8Gb, it
> miscalculates how much space it's used and never realizes when it's
> reached the intended limit.  So indeed you were seeing it try to create
> the entire index in memory :-(.
>
> This will be fixed in the next beta, but in the meantime set
> maintenance_work_mem to something less than 8Gb.

Thanks, Tom.  I had tried this empirically and things worked fine.
Glad to hear that it is fixed in the next beta.

Sean