Thread: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Euler Taveira de Oliveira wrote: > Edwin Groothuis wrote: > > > Ouch. But... since very long words are already not indexed (is the length > > configurable anywhere because I don't mind setting it to 50 characters), I > > don't think that it should bomb out of this but print a similar warning like > > "String only partly indexed". > > > This is not a bug. I would say it's a limitation. Look at > src/include/tsearch/ts_type.h. You could decrease len in WordEntry to 9 > (512 characters) and increase pos to 22 (4 Mb). Don't forget to update > MAXSTRLEN and MAXSTRPOS accordingly. > > > I'm still trying to determine how big the message it failed on was... > > > Maybe we should change the "string is too long for tsvector" to "string > is too long (%ld bytes, max %ld bytes) for tsvector". Good idea. I have applied the following patch to report in the error message the string length and maximum, like we already do for long words: Old: test=> select repeat('a', 3000)::tsvector; ERROR: word is too long (3000 bytes, max 2046 bytes) New: test=> select repeat('a ', 3000000)::tsvector; ERROR: string is too long for tsvector (1048576 bytes, max 1048575 bytes) I did not backpatch this to 8.3 because it would require translation string updates. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + Index: src/backend/tsearch/to_tsany.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/tsearch/to_tsany.c,v retrieving revision 1.8 diff -c -c -r1.8 to_tsany.c *** src/backend/tsearch/to_tsany.c 1 Jan 2008 19:45:52 -0000 1.8 --- src/backend/tsearch/to_tsany.c 5 Mar 2008 15:41:36 -0000 *************** *** 163,169 **** if (lenstr > MAXSTRPOS) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("string is too long for tsvector"))); totallen = CALCDATASIZE(prs->curwords, lenstr); in = (TSVector) palloc0(totallen); --- 163,169 ---- if (lenstr > MAXSTRPOS) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("string is too long for tsvector (%d bytes, max %d bytes)", lenstr, MAXSTRPOS))); totallen = CALCDATASIZE(prs->curwords, lenstr); in = (TSVector) palloc0(totallen); Index: src/backend/utils/adt/tsvector.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/tsvector.c,v retrieving revision 1.11 diff -c -c -r1.11 tsvector.c *** src/backend/utils/adt/tsvector.c 1 Jan 2008 19:45:53 -0000 1.11 --- src/backend/utils/adt/tsvector.c 5 Mar 2008 15:41:36 -0000 *************** *** 224,230 **** if (cur - tmpbuf > MAXSTRPOS) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("string is too long for tsvector"))); /* * Enlarge buffers if needed --- 224,230 ---- if (cur - tmpbuf > MAXSTRPOS) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("string is too long for tsvector (%d bytes, max %d bytes)", cur - tmpbuf, MAXSTRPOS))); /* * Enlarge buffers if needed *************** *** 273,279 **** if (buflen > MAXSTRPOS) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("string is too long for tsvector"))); totallen = CALCDATASIZE(len, buflen); in = (TSVector) palloc0(totallen); --- 273,279 ---- if (buflen > MAXSTRPOS) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("string is too long for tsvector (%d bytes, max %d bytes)", buflen, MAXSTRPOS))); totallen = CALCDATASIZE(len, buflen); in = (TSVector) palloc0(totallen); Index: src/backend/utils/adt/tsvector_op.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/tsvector_op.c,v retrieving revision 1.12 diff -c -c -r1.12 tsvector_op.c *** src/backend/utils/adt/tsvector_op.c 1 Jan 2008 19:45:53 -0000 1.12 --- src/backend/utils/adt/tsvector_op.c 5 Mar 2008 15:41:36 -0000 *************** *** 488,494 **** if (dataoff > MAXSTRPOS) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("string is too long for tsvector"))); out->size = ptr - ARRPTR(out); SET_VARSIZE(out, CALCDATASIZE(out->size, dataoff)); --- 488,494 ---- if (dataoff > MAXSTRPOS) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("string is too long for tsvector (%d bytes, max %d bytes)", dataoff, MAXSTRPOS))); out->size = ptr - ARRPTR(out); SET_VARSIZE(out, CALCDATASIZE(out->size, dataoff));
On Wed, Mar 05, 2008 at 10:53:38AM -0500, Bruce Momjian wrote: > Euler Taveira de Oliveira wrote: > > Edwin Groothuis wrote: > > > > > Ouch. But... since very long words are already not indexed (is the length > > > configurable anywhere because I don't mind setting it to 50 characters), I > > > don't think that it should bomb out of this but print a similar warning like > > > "String only partly indexed". > > > > > This is not a bug. I would say it's a limitation. Look at > > src/include/tsearch/ts_type.h. You could decrease len in WordEntry to 9 > > (512 characters) and increase pos to 22 (4 Mb). Don't forget to update > > MAXSTRLEN and MAXSTRPOS accordingly. > > > > > I'm still trying to determine how big the message it failed on was... > > > > > Maybe we should change the "string is too long for tsvector" to "string > > is too long (%ld bytes, max %ld bytes) for tsvector". > > Good idea. I have applied the following patch to report in the error > message the string length and maximum, like we already do for long > words: > > Old: > test=> select repeat('a', 3000)::tsvector; > ERROR: word is too long (3000 bytes, max 2046 bytes) > > New: > test=> select repeat('a ', 3000000)::tsvector; > ERROR: string is too long for tsvector (1048576 bytes, max 1048575 bytes) Is it possible to make it a WARNING instead of an ERROR? Right now I get: NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. when updating the dictionary on a table, which will make it continue, but with some long messages I get: ERROR: string is too long for tsvector Which is quite fatal for the whole UPDATE / INSERT statement. Edwin -- Edwin Groothuis | Personal website: http://www.mavetju.org edwin@mavetju.org | Weblog: http://www.mavetju.org/weblog/
Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
From
Euler Taveira de Oliveira
Date:
Edwin Groothuis wrote: > Is it possible to make it a WARNING instead of an ERROR? Right now I get: > No. All of the other types emit an ERROR if you're trying an out of range value. -- Euler Taveira de Oliveira http://www.timbira.com/
On Thu, Mar 06, 2008 at 08:19:35PM -0300, Euler Taveira de Oliveira wrote: > Edwin Groothuis wrote: > > >Is it possible to make it a WARNING instead of an ERROR? Right now I get: > > > No. All of the other types emit an ERROR if you're trying an out of > range value. Does that then mean that, because of the triggers on inserts, we will never be able to add this record? Edwin -- Edwin Groothuis | Personal website: http://www.mavetju.org edwin@mavetju.org | Weblog: http://www.mavetju.org/weblog/
Euler Taveira de Oliveira <euler@timbira.com> writes: > Edwin Groothuis wrote: >> Is it possible to make it a WARNING instead of an ERROR? Right now I get: >> > No. All of the other types emit an ERROR if you're trying an out of > range value. I don't think that follows. A tsearch index is lossy anyway, so there's no hard and fast reason why it should reject entries that it can't index completely. I think it would be more useful to index whatever it can (probably just the words in the first N bytes of the document) than to prevent you from storing the document. There is another precedent too, namely that tsearch already discards individual words over so-many-bytes long. If it doesn't throw an error for that case, why is it throwing an error for this one? regards, tom lane
Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
From
Euler Taveira de Oliveira
Date:
Tom Lane wrote: > I don't think that follows. A tsearch index is lossy anyway, so there's > no hard and fast reason why it should reject entries that it can't index > completely. I think it would be more useful to index whatever it can > (probably just the words in the first N bytes of the document) than to > prevent you from storing the document. > The problem with this approach is how to select the part of the document to index. How will you ensure you're not ignoring the more important words of the document? IMHO Postgres shouldn't decide it; it would be good if an user could set it runtime and/or on postgresql.conf. -- Euler Taveira de Oliveira http://www.timbira.com/
Euler Taveira de Oliveira <euler@timbira.com> writes: > The problem with this approach is how to select the part of the document > to index. How will you ensure you're not ignoring the more important > words of the document? That's *always* a risk, anytime you do any sort of processing or normalization on the text. The question here is not whether or not we will make tradeoffs, only which ones to make. > IMHO Postgres shouldn't decide it; it would be good if an user could set > it runtime and/or on postgresql.conf. Well, there is exactly zero chance of that happening in 8.3.x, because the bit allocations for on-disk tsvector representation are already determined. It's fairly hard to see a way of doing it in future releases that would have acceptable costs, either. But more to the point: no matter what the document length limit is, why should it be a hard error to exceed it? The downside of not indexing words beyond the length limit is that searches won't find documents in which the search terms occur only very far into the document. The downside of throwing an error is that we can't store such documents at all, which surely guarantees that searches won't find them. How can you possibly argue that that option is better? regards, tom lane
Tom Lane wrote: > Euler Taveira de Oliveira <euler@timbira.com> writes: > > Edwin Groothuis wrote: > >> Is it possible to make it a WARNING instead of an ERROR? Right now I get: > >> > > No. All of the other types emit an ERROR if you're trying an out of > > range value. > > I don't think that follows. A tsearch index is lossy anyway, so there's Uh, the index is lossy but I thought it was lossy in a way that just required additional heap accesses, not lossy in that it doesn't index everything. I think just indexing what we can and throwing away the rest is a MySQL approach we don't want to take. We could throw a warning on truncation but that seems wrong too. I am concerned a 1mb limit is too low though. Exactly why can't we have a higher limit? Is positional information that significant? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Edwin Groothuis wrote: > On Thu, Mar 06, 2008 at 08:19:35PM -0300, Euler Taveira de Oliveira wrote: > > Edwin Groothuis wrote: > > > > >Is it possible to make it a WARNING instead of an ERROR? Right now I get: > > > > > No. All of the other types emit an ERROR if you're trying an out of > > range value. > > Does that then mean that, because of the triggers on inserts, we > will never be able to add this record? Right now, yes, unless we figure something else out. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> I don't think that follows. A tsearch index is lossy anyway, so there's > Uh, the index is lossy but I thought it was lossy in a way that just > required additional heap accesses, not lossy in that it doesn't index > everything. Sure it's lossy. It doesn't index stopwords, and it doesn't index the difference between various forms of a word (when the dictionaries reduce them to a common root). > I am concerned a 1mb limit is too low though. Exactly why can't we have > a higher limit? Is positional information that significant? That's pretty much exactly the point: it's not very significant, and it doesn't justify a total inability to index large documents. One thing we could do is index words that are past the limit but not store a position, or perhaps have the convention that the maximum position value means "somewhere past here". regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Tom Lane wrote: > >> I don't think that follows. A tsearch index is lossy anyway, so there's > > > Uh, the index is lossy but I thought it was lossy in a way that just > > required additional heap accesses, not lossy in that it doesn't index > > everything. > > Sure it's lossy. It doesn't index stopwords, and it doesn't index the > difference between various forms of a word (when the dictionaries reduce > them to a common root). Yes, but you specify the stop words and stemming rules --- it isn't like it drops words that are out of your control. > > I am concerned a 1mb limit is too low though. Exactly why can't we have > > a higher limit? Is positional information that significant? > > That's pretty much exactly the point: it's not very significant, and it > doesn't justify a total inability to index large documents. Agreed. I think losing positional information after 1mb is acceptable. > One thing we could do is index words that are past the limit but not > store a position, or perhaps have the convention that the maximum > position value means "somewhere past here". Sure. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
To be precise about tsvector: 1) GiST index is lossy for any kind of tserach queries, GIN index for @@ operation is not lossy, for @@@ - is lossy. 2) Number of positions per word is limited to 256 number - bigger number of positions is not helpful for ranking, but produces a big tsvector. If word has a lot of positions in document then it close to be a stopword. We could easy increase this limit to 65536 positions 3) Maximum value of position is 2^14, because for position's storage we use uint16. In this integer it's needed to reserve 2 bits to store weight of this position. It's possible to increase int16 to int32, but it will doubled tsvector size, which is unpractical, I suppose. So, part of document used for ranking contains first 16384 words - that is about first 50-100 kilobytes. 4) Limit of total size of tsvector is in WordEntry->pos (ts_type.h) field. It contains number of bytes between first lexeme in tsvector and needed lexeme. So, limitation is total length of lexemes plus theirs positional information. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
From
Euler Taveira de Oliveira
Date:
Tom Lane wrote: > Well, there is exactly zero chance of that happening in 8.3.x, because > the bit allocations for on-disk tsvector representation are already > determined. It's fairly hard to see a way of doing it in future > releases that would have acceptable costs, either. > I think you missed my point or i didn't explain it in details. I'm talking about doing the error-or-notice condition to be a guc variable (eg ignore_tsearch_limits = on/off). > But more to the point: no matter what the document length limit is, > why should it be a hard error to exceed it? The downside of not > indexing words beyond the length limit is that searches won't find > documents in which the search terms occur only very far into the > document. The downside of throwing an error is that we can't store such > documents at all, which surely guarantees that searches won't find > them. How can you possibly argue that that option is better? > IMHO, both of the approaches are "bad"; that's why i propose an user option. So the user can choose between to be strict (error out when it exceeds some limit) and to relax (emit a notice when it exceeds some limit). Maybe some day we can increase the limits (eg ts_type.h redesign). -- Euler Taveira de Oliveira http://www.timbira.com/
Added to TODO: o Consider changing error to warning for strings larger than one megabyte http://archives.postgresql.org/pgsql-bugs/2008-02/msg00190.php http://archives.postgresql.org/pgsql-patches/2008-03/msg00062.php --------------------------------------------------------------------------- Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Tom Lane wrote: > >> I don't think that follows. A tsearch index is lossy anyway, so there's > > > Uh, the index is lossy but I thought it was lossy in a way that just > > required additional heap accesses, not lossy in that it doesn't index > > everything. > > Sure it's lossy. It doesn't index stopwords, and it doesn't index the > difference between various forms of a word (when the dictionaries reduce > them to a common root). > > > I am concerned a 1mb limit is too low though. Exactly why can't we have > > a higher limit? Is positional information that significant? > > That's pretty much exactly the point: it's not very significant, and it > doesn't justify a total inability to index large documents. > > One thing we could do is index words that are past the limit but not > store a position, or perhaps have the convention that the maximum > position value means "somewhere past here". > > regards, tom lane > > -- > Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-patches -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +