Thread: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Bruce Momjian
Date:
Euler Taveira de Oliveira wrote:
> Edwin Groothuis wrote:
>
> > Ouch. But... since very long words are already not indexed (is the length
> > configurable anywhere because I don't mind setting it to 50 characters), I
> > don't think that it should bomb out of this but print a similar warning like
> > "String only partly indexed".
> >
> This is not a bug. I would say it's a limitation. Look at
> src/include/tsearch/ts_type.h. You could decrease len in WordEntry to 9
> (512 characters) and increase pos to 22 (4 Mb). Don't forget to update
> MAXSTRLEN and MAXSTRPOS accordingly.
>
> > I'm still trying to determine how big the message it failed on was...
> >
> Maybe we should change the "string is too long for tsvector" to "string
> is too long (%ld bytes, max %ld bytes) for tsvector".

Good idea.  I have applied the following patch to report in the error
message the string length and maximum, like we already do for long
words:

Old:
    test=> select repeat('a', 3000)::tsvector;
    ERROR:  word is too long (3000 bytes, max 2046 bytes)

New:
    test=> select repeat('a ', 3000000)::tsvector;
    ERROR:  string is too long for tsvector (1048576 bytes, max 1048575 bytes)

I did not backpatch this to 8.3 because it would require translation
string updates.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: src/backend/tsearch/to_tsany.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/to_tsany.c,v
retrieving revision 1.8
diff -c -c -r1.8 to_tsany.c
*** src/backend/tsearch/to_tsany.c    1 Jan 2008 19:45:52 -0000    1.8
--- src/backend/tsearch/to_tsany.c    5 Mar 2008 15:41:36 -0000
***************
*** 163,169 ****
      if (lenstr > MAXSTRPOS)
          ereport(ERROR,
                  (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
!                  errmsg("string is too long for tsvector")));

      totallen = CALCDATASIZE(prs->curwords, lenstr);
      in = (TSVector) palloc0(totallen);
--- 163,169 ----
      if (lenstr > MAXSTRPOS)
          ereport(ERROR,
                  (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
!                      errmsg("string is too long for tsvector (%d bytes, max %d bytes)", lenstr, MAXSTRPOS)));

      totallen = CALCDATASIZE(prs->curwords, lenstr);
      in = (TSVector) palloc0(totallen);
Index: src/backend/utils/adt/tsvector.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/tsvector.c,v
retrieving revision 1.11
diff -c -c -r1.11 tsvector.c
*** src/backend/utils/adt/tsvector.c    1 Jan 2008 19:45:53 -0000    1.11
--- src/backend/utils/adt/tsvector.c    5 Mar 2008 15:41:36 -0000
***************
*** 224,230 ****
          if (cur - tmpbuf > MAXSTRPOS)
              ereport(ERROR,
                      (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
!                      errmsg("string is too long for tsvector")));

          /*
           * Enlarge buffers if needed
--- 224,230 ----
          if (cur - tmpbuf > MAXSTRPOS)
              ereport(ERROR,
                      (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
!                       errmsg("string is too long for tsvector (%d bytes, max %d bytes)", cur - tmpbuf, MAXSTRPOS)));

          /*
           * Enlarge buffers if needed
***************
*** 273,279 ****
      if (buflen > MAXSTRPOS)
          ereport(ERROR,
                  (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
!                  errmsg("string is too long for tsvector")));

      totallen = CALCDATASIZE(len, buflen);
      in = (TSVector) palloc0(totallen);
--- 273,279 ----
      if (buflen > MAXSTRPOS)
          ereport(ERROR,
                  (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
!                  errmsg("string is too long for tsvector (%d bytes, max %d bytes)", buflen, MAXSTRPOS)));

      totallen = CALCDATASIZE(len, buflen);
      in = (TSVector) palloc0(totallen);
Index: src/backend/utils/adt/tsvector_op.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/tsvector_op.c,v
retrieving revision 1.12
diff -c -c -r1.12 tsvector_op.c
*** src/backend/utils/adt/tsvector_op.c    1 Jan 2008 19:45:53 -0000    1.12
--- src/backend/utils/adt/tsvector_op.c    5 Mar 2008 15:41:36 -0000
***************
*** 488,494 ****
      if (dataoff > MAXSTRPOS)
          ereport(ERROR,
                  (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
!                  errmsg("string is too long for tsvector")));

      out->size = ptr - ARRPTR(out);
      SET_VARSIZE(out, CALCDATASIZE(out->size, dataoff));
--- 488,494 ----
      if (dataoff > MAXSTRPOS)
          ereport(ERROR,
                  (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
!                  errmsg("string is too long for tsvector (%d bytes, max %d bytes)", dataoff, MAXSTRPOS)));

      out->size = ptr - ARRPTR(out);
      SET_VARSIZE(out, CALCDATASIZE(out->size, dataoff));

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Edwin Groothuis
Date:
On Wed, Mar 05, 2008 at 10:53:38AM -0500, Bruce Momjian wrote:
> Euler Taveira de Oliveira wrote:
> > Edwin Groothuis wrote:
> >
> > > Ouch. But... since very long words are already not indexed (is the length
> > > configurable anywhere because I don't mind setting it to 50 characters), I
> > > don't think that it should bomb out of this but print a similar warning like
> > > "String only partly indexed".
> > >
> > This is not a bug. I would say it's a limitation. Look at
> > src/include/tsearch/ts_type.h. You could decrease len in WordEntry to 9
> > (512 characters) and increase pos to 22 (4 Mb). Don't forget to update
> > MAXSTRLEN and MAXSTRPOS accordingly.
> >
> > > I'm still trying to determine how big the message it failed on was...
> > >
> > Maybe we should change the "string is too long for tsvector" to "string
> > is too long (%ld bytes, max %ld bytes) for tsvector".
>
> Good idea.  I have applied the following patch to report in the error
> message the string length and maximum, like we already do for long
> words:
>
> Old:
>     test=> select repeat('a', 3000)::tsvector;
>     ERROR:  word is too long (3000 bytes, max 2046 bytes)
>
> New:
>     test=> select repeat('a ', 3000000)::tsvector;
>     ERROR:  string is too long for tsvector (1048576 bytes, max 1048575 bytes)

Is it possible to make it a WARNING instead of an ERROR? Right now I get:

NOTICE: word is too long to be indexed
DETAIL: Words longer than 2047 characters are ignored.

when updating the dictionary on a table, which will make it continue,
but with some long messages I get:

ERROR: string is too long for tsvector

Which is quite fatal for the whole UPDATE / INSERT statement.

Edwin

--
Edwin Groothuis      |            Personal website: http://www.mavetju.org
edwin@mavetju.org    |              Weblog: http://www.mavetju.org/weblog/

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Euler Taveira de Oliveira
Date:
Edwin Groothuis wrote:

> Is it possible to make it a WARNING instead of an ERROR? Right now I get:
>
No. All of the other types emit an ERROR if you're trying an out of
range value.


--
   Euler Taveira de Oliveira
   http://www.timbira.com/

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Edwin Groothuis
Date:
On Thu, Mar 06, 2008 at 08:19:35PM -0300, Euler Taveira de Oliveira wrote:
> Edwin Groothuis wrote:
>
> >Is it possible to make it a WARNING instead of an ERROR? Right now I get:
> >
> No. All of the other types emit an ERROR if you're trying an out of
> range value.

Does that then mean that, because of the triggers on inserts, we
will never be able to add this record?

Edwin
--
Edwin Groothuis      |            Personal website: http://www.mavetju.org
edwin@mavetju.org    |              Weblog: http://www.mavetju.org/weblog/

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Tom Lane
Date:
Euler Taveira de Oliveira <euler@timbira.com> writes:
> Edwin Groothuis wrote:
>> Is it possible to make it a WARNING instead of an ERROR? Right now I get:
>>
> No. All of the other types emit an ERROR if you're trying an out of
> range value.

I don't think that follows.  A tsearch index is lossy anyway, so there's
no hard and fast reason why it should reject entries that it can't index
completely.  I think it would be more useful to index whatever it can
(probably just the words in the first N bytes of the document) than to
prevent you from storing the document.

There is another precedent too, namely that tsearch already discards
individual words over so-many-bytes long.  If it doesn't throw an error
for that case, why is it throwing an error for this one?

            regards, tom lane

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Euler Taveira de Oliveira
Date:
Tom Lane wrote:

> I don't think that follows.  A tsearch index is lossy anyway, so there's
> no hard and fast reason why it should reject entries that it can't index
> completely.  I think it would be more useful to index whatever it can
> (probably just the words in the first N bytes of the document) than to
> prevent you from storing the document.
>
The problem with this approach is how to select the part of the document
to index. How will you ensure you're not ignoring the more important
words of the document?
IMHO Postgres shouldn't decide it; it would be good if an user could set
it runtime and/or on postgresql.conf.


--
   Euler Taveira de Oliveira
   http://www.timbira.com/

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Tom Lane
Date:
Euler Taveira de Oliveira <euler@timbira.com> writes:
> The problem with this approach is how to select the part of the document
> to index. How will you ensure you're not ignoring the more important
> words of the document?

That's *always* a risk, anytime you do any sort of processing or
normalization on the text.  The question here is not whether or not
we will make tradeoffs, only which ones to make.

> IMHO Postgres shouldn't decide it; it would be good if an user could set
> it runtime and/or on postgresql.conf.

Well, there is exactly zero chance of that happening in 8.3.x, because
the bit allocations for on-disk tsvector representation are already
determined.  It's fairly hard to see a way of doing it in future
releases that would have acceptable costs, either.

But more to the point: no matter what the document length limit is,
why should it be a hard error to exceed it?  The downside of not
indexing words beyond the length limit is that searches won't find
documents in which the search terms occur only very far into the
document.  The downside of throwing an error is that we can't store such
documents at all, which surely guarantees that searches won't find
them.  How can you possibly argue that that option is better?

            regards, tom lane

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Bruce Momjian
Date:
Tom Lane wrote:
> Euler Taveira de Oliveira <euler@timbira.com> writes:
> > Edwin Groothuis wrote:
> >> Is it possible to make it a WARNING instead of an ERROR? Right now I get:
> >>
> > No. All of the other types emit an ERROR if you're trying an out of
> > range value.
>
> I don't think that follows.  A tsearch index is lossy anyway, so there's

Uh, the index is lossy but I thought it was lossy in a way that just
required additional heap accesses, not lossy in that it doesn't index
everything.

I think just indexing what we can and throwing away the rest is a MySQL
approach we don't want to take.  We could throw a warning on truncation
but that seems wrong too.

I am concerned a 1mb limit is too low though.  Exactly why can't we have
a higher limit?  Is positional information that significant?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Bruce Momjian
Date:
Edwin Groothuis wrote:
> On Thu, Mar 06, 2008 at 08:19:35PM -0300, Euler Taveira de Oliveira wrote:
> > Edwin Groothuis wrote:
> >
> > >Is it possible to make it a WARNING instead of an ERROR? Right now I get:
> > >
> > No. All of the other types emit an ERROR if you're trying an out of
> > range value.
>
> Does that then mean that, because of the triggers on inserts, we
> will never be able to add this record?

Right now, yes, unless we figure something else out.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> I don't think that follows.  A tsearch index is lossy anyway, so there's

> Uh, the index is lossy but I thought it was lossy in a way that just
> required additional heap accesses, not lossy in that it doesn't index
> everything.

Sure it's lossy.  It doesn't index stopwords, and it doesn't index the
difference between various forms of a word (when the dictionaries reduce
them to a common root).

> I am concerned a 1mb limit is too low though.  Exactly why can't we have
> a higher limit?  Is positional information that significant?

That's pretty much exactly the point: it's not very significant, and it
doesn't justify a total inability to index large documents.

One thing we could do is index words that are past the limit but not
store a position, or perhaps have the convention that the maximum
position value means "somewhere past here".

            regards, tom lane

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> I don't think that follows.  A tsearch index is lossy anyway, so there's
>
> > Uh, the index is lossy but I thought it was lossy in a way that just
> > required additional heap accesses, not lossy in that it doesn't index
> > everything.
>
> Sure it's lossy.  It doesn't index stopwords, and it doesn't index the
> difference between various forms of a word (when the dictionaries reduce
> them to a common root).

Yes, but you specify the stop words and stemming rules --- it isn't like
it drops words that are out of your control.

> > I am concerned a 1mb limit is too low though.  Exactly why can't we have
> > a higher limit?  Is positional information that significant?
>
> That's pretty much exactly the point: it's not very significant, and it
> doesn't justify a total inability to index large documents.

Agreed.  I think losing positional information after 1mb is acceptable.

> One thing we could do is index words that are past the limit but not
> store a position, or perhaps have the convention that the maximum
> position value means "somewhere past here".

Sure.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Teodor Sigaev
Date:
To be precise about tsvector:

1) GiST index is lossy for any kind of tserach queries, GIN index for @@
operation is not lossy, for @@@ - is lossy.

2) Number of positions per word is limited to 256 number - bigger number of
positions is not helpful for ranking, but produces a big tsvector. If word has a
lot of positions in document then it close to be a stopword. We could easy
increase this limit to 65536 positions

3) Maximum value of position is 2^14, because for position's storage we use
uint16. In this integer it's needed to reserve 2 bits to store weight of this
position. It's possible to increase int16 to int32, but it will doubled tsvector
size, which is unpractical, I suppose. So, part of document used for ranking
contains first 16384 words - that is about first 50-100 kilobytes.

4) Limit of total size of tsvector is in WordEntry->pos (ts_type.h) field. It
contains number of bytes between first lexeme in tsvector and needed lexeme.
So, limitation is total length of lexemes plus  theirs positional information.


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Euler Taveira de Oliveira
Date:
Tom Lane wrote:

> Well, there is exactly zero chance of that happening in 8.3.x, because
> the bit allocations for on-disk tsvector representation are already
> determined.  It's fairly hard to see a way of doing it in future
> releases that would have acceptable costs, either.
>
I think you missed my point or i didn't explain it in details. I'm
talking about doing the error-or-notice condition to be a guc variable
(eg ignore_tsearch_limits = on/off).

> But more to the point: no matter what the document length limit is,
> why should it be a hard error to exceed it?  The downside of not
> indexing words beyond the length limit is that searches won't find
> documents in which the search terms occur only very far into the
> document.  The downside of throwing an error is that we can't store such
> documents at all, which surely guarantees that searches won't find
> them.  How can you possibly argue that that option is better?
>
IMHO, both of the approaches are "bad"; that's why i propose an user
option. So the user can choose between to be strict (error out when it
exceeds some limit) and to relax (emit a notice when it exceeds some
limit). Maybe some day we can increase the limits (eg ts_type.h redesign).


--
   Euler Taveira de Oliveira
   http://www.timbira.com/


Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

From
Bruce Momjian
Date:
Added to TODO:

        o Consider changing error to warning for strings larger than one
          megabyte

          http://archives.postgresql.org/pgsql-bugs/2008-02/msg00190.php
          http://archives.postgresql.org/pgsql-patches/2008-03/msg00062.php


---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> I don't think that follows.  A tsearch index is lossy anyway, so there's
>
> > Uh, the index is lossy but I thought it was lossy in a way that just
> > required additional heap accesses, not lossy in that it doesn't index
> > everything.
>
> Sure it's lossy.  It doesn't index stopwords, and it doesn't index the
> difference between various forms of a word (when the dictionaries reduce
> them to a common root).
>
> > I am concerned a 1mb limit is too low though.  Exactly why can't we have
> > a higher limit?  Is positional information that significant?
>
> That's pretty much exactly the point: it's not very significant, and it
> doesn't justify a total inability to index large documents.
>
> One thing we could do is index words that are past the limit but not
> store a position, or perhaps have the convention that the maximum
> position value means "somewhere past here".
>
>             regards, tom lane
>
> --
> Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-patches

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +