Thread: string_to_array eats too much memory?

string_to_array eats too much memory?

From

Tatsuo Ishii

Date:

08 November 2006, 04:40:17

Hi,

I'm playing with GIN to make a full text search system. GIN comes with
built-in TEXT[] support and I use string_to_array() to make a
TEXT[]. Problem is, if there's large number of array elemets,
string_to_array() consumes too much memory. For example, to make ~70k
array elements, string_to_array seems to eat several Gig bytes of
memory. ~70k array elements means there are same number of words in a
document which is not too big in a large text IMO.

Comments?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: string_to_array eats too much memory?

From

Teodor Sigaev

Date:

08 November 2006, 07:53:08

> I'm playing with GIN to make a full text search system. GIN comes with
> built-in TEXT[] support and I use string_to_array() to make a
> TEXT[]. Problem is, if there's large number of array elemets,
> string_to_array() consumes too much memory. For example, to make ~70k
> array elements, string_to_array seems to eat several Gig bytes of
> memory. ~70k array elements means there are same number of words in a
> document which is not too big in a large text IMO.

Do you mean 70k unique lexemes? Ugh.
Why do not you use tsearch framework?

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: string_to_array eats too much memory?

From

Tatsuo Ishii

Date:

08 November 2006, 08:40:08

> > I'm playing with GIN to make a full text search system. GIN comes with
> > built-in TEXT[] support and I use string_to_array() to make a
> > TEXT[]. Problem is, if there's large number of array elemets,
> > string_to_array() consumes too much memory. For example, to make ~70k
> > array elements, string_to_array seems to eat several Gig bytes of
> > memory. ~70k array elements means there are same number of words in a
> > document which is not too big in a large text IMO.
> 
> Do you mean 70k unique lexemes? Ugh.

I'm testing how GIN scales.

> Why do not you use tsearch framework?

? I thought GIN is superior than tsearch2. 

From your GIN proposal posted to pgsql-hackers:

"The primary goal of the Gin index is a scalable full text search in
PostgreSQL"

What do you think?:-)
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: string_to_array eats too much memory?

From

"Magnus Hagander"

Date:

08 November 2006, 08:47:38

> > > I'm playing with GIN to make a full text search system. GIN comes
> > > with built-in TEXT[] support and I use string_to_array()
> to make a
> > > TEXT[]. Problem is, if there's large number of array elemets,
> > > string_to_array() consumes too much memory. For example, to make
> > > ~70k array elements, string_to_array seems to eat several
> Gig bytes
> > > of memory. ~70k array elements means there are same
> number of words
> > > in a document which is not too big in a large text IMO.
> >
> > Do you mean 70k unique lexemes? Ugh.
>
> I'm testing how GIN scales.
>
> > Why do not you use tsearch framework?
>
> ? I thought GIN is superior than tsearch2.
>
> From your GIN proposal posted to pgsql-hackers:
>
> "The primary goal of the Gin index is a scalable full text
> search in PostgreSQL"

tsearch2 *uses* GIN in 8.2. Just CREATE INDEX foo ON bar USING
gin(mytsvector).

And tsearch2 in 8.2 with GIN can be a *lot* faster than with GIST. I've
been running experiments on the website search with tsearch2/GIN and
i've been seeing fantastic performance compared top revious versions.



//Magnus

Re: string_to_array eats too much memory?

From

Teodor Sigaev

Date:

08 November 2006, 09:03:49

> I'm testing how GIN scales.

Have a look at http://www.sigaev.ru/cvsweb/cvsweb.cgi/ftsbench/ - utility is 
specially developed for measuring performance of full-text solutions ( now it 
supports PgSQL( GiST, GIN ) and MySQL ). Right now I'm searching good query 
statistic for simulate load, but this data is a closed information in 
internet-wide search engines :(


> ? I thought GIN is superior than tsearch2. 
> 
> From your GIN proposal posted to pgsql-hackers:
> 
> "The primary goal of the Gin index is a scalable full text search in
> PostgreSQL"

GIN itself is a just a tool for speedup searches, linguistic part is still in 
tsearch2.

It's possible to use tsearch2 without any indexes at all. GiST and GIN is  a way 
to speedup searches.

Of course, you can develop another framework for full text search and framework 
may use GIN as it wish :)




-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: string_to_array eats too much memory?

From

Tatsuo Ishii

Date:

08 November 2006, 09:22:00

> > I'm testing how GIN scales.
> 
> Have a look at http://www.sigaev.ru/cvsweb/cvsweb.cgi/ftsbench/ - utility is 
> specially developed for measuring performance of full-text solutions ( now it 
> supports PgSQL( GiST, GIN ) and MySQL ). Right now I'm searching good query 
> statistic for simulate load, but this data is a closed information in 
> internet-wide search engines :(

Thanks.

> GIN itself is a just a tool for speedup searches, linguistic part is still in 
> tsearch2.
> 
> It's possible to use tsearch2 without any indexes at all. GiST and GIN is  a way 
> to speedup searches.
> 
> Of course, you can develop another framework for full text search and framework 
> may use GIN as it wish :)

Porblem with Japanese is, it's an agglutinative language and we need
to separate each word from a sentence. So, I need to modify tsearch2
anyway (I know someone from Japan is working on this).

BTW, can tsearch2 handle ~70k words in a document?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: string_to_array eats too much memory?

From

Teodor Sigaev

Date:

08 November 2006, 09:58:08

> Porblem with Japanese is, it's an agglutinative language and we need
> to separate each word from a sentence. So, I need to modify tsearch2
> anyway (I know someone from Japan is working on this).
https://www.oss.ecl.ntt.co.jp/tsearch2j/index.html
That's it?

> 
> BTW, can tsearch2 handle ~70k words in a document?

I don't see any problem. tsvector size should not be greater than 1Mb however.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: string_to_array eats too much memory?

From

Tom Lane

Date:

08 November 2006, 13:56:13

Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> string_to_array() consumes too much memory. For example, to make ~70k
> array elements, string_to_array seems to eat several Gig bytes of
> memory.

I'd argue that the problem comes from enlarging the work arrays only 64
elements at a time in accumArrayResult().  Most of the rest of the code
deals with resizing arrays using a "double it each time it has to grow"
approach, I wonder why this is different?
        regards, tom lane

Re: string_to_array eats too much memory?

From

Tatsuo Ishii

Date:

08 November 2006, 14:13:53

> > Porblem with Japanese is, it's an agglutinative language and we need
> > to separate each word from a sentence. So, I need to modify tsearch2
> > anyway (I know someone from Japan is working on this).
> https://www.oss.ecl.ntt.co.jp/tsearch2j/index.html
> That's it?

Yes. However I'm going to use different "word separation" library from
them and will make some tweaks.

> > BTW, can tsearch2 handle ~70k words in a document?
> 
> I don't see any problem.

Great. I have made a little trial and it seems tsearch2 works great
with GIN.

> tsvector size should not be greater than 1Mb however.

Is this documented somewhere? Also I noticed that tsearch2 treats ":"
as a special character. Are there any special characters? If so where
are they documented?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: string_to_array eats too much memory?

From

Teodor Sigaev

Date:

08 November 2006, 14:50:23

>> tsvector size should not be greater than 1Mb however.
> 
> Is this documented somewhere? Also I noticed that tsearch2 treats ":"
> as a special character. Are there any special characters? If so where
> are they documented?
http://www.sai.msu.su/~megera/wiki/Tsearch_V2_in_Brief
Limitations
13.1 2048 bytes for lexems13.2 ts_vector has limit about 1Mb. Exact value depends on    quantity of position
information.If there is no any position                information, then sum of length of lexem must be less than 1Mb,
             otherwise, sum of length of and pos. info.                Positional information uses 2 bytes per each
positionand 2 bytes per lexem with pos info. The number of                lexems is limited by 4^32, so in practice
it'sunlimited.13.3 ts_query:                Number of entries (nodes, i.e sum of lexems and operation)
islimited: internal representation is in polish notation                and position of one operand is pointed by int2,
soit's                rather soft limit.                In any case, low range of limit - 32768 nodes.    Notice:
ts_querydoesn't designed for storing in table and                is optimized for speed, not for size.13.4 Positional
informationin ts_vector:    13.4.1 Value of position may not be greater than 2^14 (16384),                       any
valuegreater than this limit will be replaced                       by 16383.    13.4.2 Only 256 positional info per
lexem.


Some useful articles
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/custom-dict.html


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: string_to_array eats too much memory?

From

Teodor Sigaev

Date:

08 November 2006, 14:54:32

> Is this documented somewhere? Also I noticed that tsearch2 treats ":"
> as a special character. Are there any special characters? If so where
> are they documented?

You can avoid confusions with special character by quoting:
# select '''wow:'''::tsvector; tsvector
---------- 'wow:'
(1 row)

':' is separator of lexeme and its position information
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: string_to_array eats too much memory?

From

Teodor Sigaev

Date:

08 November 2006, 14:56:49

> Limitations
Sorry for noise - it's mentioned in README.tsearch2
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: string_to_array eats too much memory?

From

Oleg Bartunov

Date:

08 November 2006, 15:34:55

On Thu, 9 Nov 2006, Tatsuo Ishii wrote:

>>> Porblem with Japanese is, it's an agglutinative language and we need
>>> to separate each word from a sentence. So, I need to modify tsearch2
>>> anyway (I know someone from Japan is working on this).
>> https://www.oss.ecl.ntt.co.jp/tsearch2j/index.html
>> That's it?
>
> Yes. However I'm going to use different "word separation" library from
> them and will make some tweaks.
>
>>> BTW, can tsearch2 handle ~70k words in a document?
>>
>> I don't see any problem.
>
> Great. I have made a little trial and it seems tsearch2 works great
> with GIN.

Tatsuo, ideallly, I'd like to have tsearch2 untouched, but with 
japanese parser(s) and dictionaries (program) available. This is how
tsearch2 was designed. If something prevent to do so, we should improve
tsearch2. This is important now, since we're going to build tsearch2 into
PostgreSQL core for 8.3.


    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: string_to_array eats too much memory?

From

Michael Paesold

Date:

09 November 2006, 07:26:48

Tom Lane writes:
> Tatsuo Ishii <ishii@sraoss.co.jp> writes:
>> string_to_array() consumes too much memory. For example, to make
>> ~70k array elements, string_to_array seems to eat several Gig bytes
>> of memory.
> 
> I'd argue that the problem comes from enlarging the work arrays only
> 64 elements at a time in accumArrayResult(). Most of the rest of the
> code deals with resizing arrays using a "double it each time it has
> to grow" approach, I wonder why this is different?

Without reading the code, I guess that simply means O(n^2) runtime. This 
should be fixed, then, right?

Best Regards,
Michael Paesold