Thread: Re: tsearch2 column update produces "word too long"error

Re: tsearch2 column update produces "word too long"error

From
"Markus Wollny"
Date:
Hi!

Now I really couldn't code C to save my life, but I managed to elicit
some more debugging info. It's still dumb-user-interaction as suspected,
but this is an issue I have to take into account as a basis; here's the
"patch" for ts_cfg.c:

if (lenlemm >= MAXSTRLEN)
                        ereport(ERROR,
                                        (errcode(ERRCODE_SYNTAX_ERROR),
!                                        errmsg("word is too long(%d):
%s",lenlemm,lemm)));

Now when I try

UPDATE ct_com_board_message
     SET ftindex=to_tsvector('default',coalesce(user_login,'') ||'
'|| coalesce(title,'') ||' '|| coalesce(text,''));

I eventually get:

ERROR:  word is too long(2724):
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj

This is a brightly shining example of utterly wanton user-stupidity, I
think: A 2k+ string of |:ja:|. Input like that cannot be helped, though
- if he'd been a bit more imaginative, he could have used a few dozen
"Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch" in a row or
anything else; unfortunately there's no app that could automatically
whack a user if he's doing something stupid.

But on the other hand I cannot think of any reason why crap like that
should be indexed in the first place. Therefore I would like to see some
sort of option allowing me to still use tsearch2 but actually
automatically excluding anything exceeding MAXSTRLEN - so the UPDATE
might throw a NOTICE (if anything at all) but still get on with the
rest.

An alteration like that does however exceed my limited abilities with C
by far and I don't want to mess with something I do not fully understand
and then use that mess in a production environment. Is there a way to
get around this problem with oversized words?

Kind regards

    Markus


> -----Ursprüngliche Nachricht-----
> Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
> Gesendet: Freitag, 21. November 2003 15:13
> An: Markus Wollny
> Cc: pgsql-general@postgresql.org
> Betreff: Re: AW: [GENERAL] tsearch2 column update produces "word too
> long"error
>
>
> On Fri, 21 Nov 2003, Markus Wollny wrote:
>
> > Hello!
> >
> > > Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > > Gesendet: Freitag, 21. November 2003 13:06
> > > An: Markus Wollny
> > > Cc: pgsql-general@postgresql.org
> > >
> > > Word length is limited by 2K. What's exactly the word
> > > tsearch2 complained on ?
> > > 'Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'
> > > is fine :)
> >
> > This was a silly example, I know - it is a long word, but
> not too long
> > to worry a machine. The offending word will surely be much
> longer, but
> > as a matter of fact, I cannot think of any user actually
> typing a 2k+
> > string without any spaces in between. I'm not sure on which word
> > tsearch2 complained, it doesn't tell and even logging did
> not provide me
> > with any more detail:
> >
> > 2003-11-21 14:06:44 [26497] ERROR:  42601: word is too long
> > LOCATION:  parsetext_v2, ts_cfg.c:294
> > STATEMENT:  UPDATE ct_com_board_message
> >                     SET
> > ftindex=to_tsvector('default',coalesce(user_login,'') ||' '||
> > coalesce(title,'') ||' '|| coalesce(text,''));
> >
> > Is there some way to find the exact position?
>
> I'm afraid you need to hack ts_cfg.c:294 yourself to print the word
> which's bugging you :)
>
> >
> > > btw, don't forget to configure properly dictionaries, so you
> > > don't have a lot of unique words.
> >
> > I won't forget that; I justed wanted to run a quick-off first test
> > before diving deeper into Ispell and other issues which are
> as yet a bit
> > of a mystery to me.
> >
> > Kind Regards
> >
> >     Markus
> >
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>

Re: tsearch2 column update produces "word too

From
Oleg Bartunov
Date:
Markus,

thanks for your analyses ! I think we'll submit a patch to throw NOTICE
and skip these useless words from indexing.

Oleg
On Mon, 24 Nov 2003, Markus Wollny wrote:

> Hi!
>
> Now I really couldn't code C to save my life, but I managed to elicit
> some more debugging info. It's still dumb-user-interaction as suspected,
> but this is an issue I have to take into account as a basis; here's the
> "patch" for ts_cfg.c:
>
> if (lenlemm >= MAXSTRLEN)
>                         ereport(ERROR,
>                                         (errcode(ERRCODE_SYNTAX_ERROR),
> !                                        errmsg("word is too long(%d):
> %s",lenlemm,lemm)));
>
> Now when I try
>
> UPDATE ct_com_board_message
>      SET ftindex=to_tsvector('default',coalesce(user_login,'') ||'
> '|| coalesce(title,'') ||' '|| coalesce(text,''));
>
> I eventually get:
>
> ERROR:  word is too long(2724):
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
>
> This is a brightly shining example of utterly wanton user-stupidity, I
> think: A 2k+ string of |:ja:|. Input like that cannot be helped, though
> - if he'd been a bit more imaginative, he could have used a few dozen
> "Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch" in a row or
> anything else; unfortunately there's no app that could automatically
> whack a user if he's doing something stupid.
>
> But on the other hand I cannot think of any reason why crap like that
> should be indexed in the first place. Therefore I would like to see some
> sort of option allowing me to still use tsearch2 but actually
> automatically excluding anything exceeding MAXSTRLEN - so the UPDATE
> might throw a NOTICE (if anything at all) but still get on with the
> rest.
>
> An alteration like that does however exceed my limited abilities with C
> by far and I don't want to mess with something I do not fully understand
> and then use that mess in a production environment. Is there a way to
> get around this problem with oversized words?
>
> Kind regards
>
>     Markus
>
>
> > -----UrsprЭngliche Nachricht-----
> > Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > Gesendet: Freitag, 21. November 2003 15:13
> > An: Markus Wollny
> > Cc: pgsql-general@postgresql.org
> > Betreff: Re: AW: [GENERAL] tsearch2 column update produces "word too
> > long"error
> >
> >
> > On Fri, 21 Nov 2003, Markus Wollny wrote:
> >
> > > Hello!
> > >
> > > > Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > > > Gesendet: Freitag, 21. November 2003 13:06
> > > > An: Markus Wollny
> > > > Cc: pgsql-general@postgresql.org
> > > >
> > > > Word length is limited by 2K. What's exactly the word
> > > > tsearch2 complained on ?
> > > > 'Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'
> > > > is fine :)
> > >
> > > This was a silly example, I know - it is a long word, but
> > not too long
> > > to worry a machine. The offending word will surely be much
> > longer, but
> > > as a matter of fact, I cannot think of any user actually
> > typing a 2k+
> > > string without any spaces in between. I'm not sure on which word
> > > tsearch2 complained, it doesn't tell and even logging did
> > not provide me
> > > with any more detail:
> > >
> > > 2003-11-21 14:06:44 [26497] ERROR:  42601: word is too long
> > > LOCATION:  parsetext_v2, ts_cfg.c:294
> > > STATEMENT:  UPDATE ct_com_board_message
> > >                     SET
> > > ftindex=to_tsvector('default',coalesce(user_login,'') ||' '||
> > > coalesce(title,'') ||' '|| coalesce(text,''));
> > >
> > > Is there some way to find the exact position?
> >
> > I'm afraid you need to hack ts_cfg.c:294 yourself to print the word
> > which's bugging you :)
> >
> > >
> > > > btw, don't forget to configure properly dictionaries, so you
> > > > don't have a lot of unique words.
> > >
> > > I won't forget that; I justed wanted to run a quick-off first test
> > > before diving deeper into Ispell and other issues which are
> > as yet a bit
> > > of a mystery to me.
> > >
> > > Kind Regards
> > >
> > >     Markus
> > >
> >
> >     Regards,
> >         Oleg
> > _____________________________________________________________
> > Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> > Sternberg Astronomical Institute, Moscow University (Russia)
> > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> > phone: +007(095)939-16-83, +007(095)939-23-83
> >
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: tsearch2 column update produces "word too long"error

From
Teodor Sigaev
Date:
Patch submitted to 7.5devel and REL7_4_STABLE

Markus Wollny wrote:
> Hi!
>
> Now I really couldn't code C to save my life, but I managed to elicit
> some more debugging info. It's still dumb-user-interaction as suspected,
> but this is an issue I have to take into account as a basis; here's the
> "patch" for ts_cfg.c:
>
> if (lenlemm >= MAXSTRLEN)
>                         ereport(ERROR,
>                                         (errcode(ERRCODE_SYNTAX_ERROR),
> !                                        errmsg("word is too long(%d):
> %s",lenlemm,lemm)));
>
> Now when I try
>
> UPDATE ct_com_board_message
>      SET ftindex=to_tsvector('default',coalesce(user_login,'') ||'
> '|| coalesce(title,'') ||' '|| coalesce(text,''));
>
> I eventually get:
>
> ERROR:  word is too long(2724):
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
> jajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
> ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
>
> This is a brightly shining example of utterly wanton user-stupidity, I
> think: A 2k+ string of |:ja:|. Input like that cannot be helped, though
> - if he'd been a bit more imaginative, he could have used a few dozen
> "Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch" in a row or
> anything else; unfortunately there's no app that could automatically
> whack a user if he's doing something stupid.
>
> But on the other hand I cannot think of any reason why crap like that
> should be indexed in the first place. Therefore I would like to see some
> sort of option allowing me to still use tsearch2 but actually
> automatically excluding anything exceeding MAXSTRLEN - so the UPDATE
> might throw a NOTICE (if anything at all) but still get on with the
> rest.
>
> An alteration like that does however exceed my limited abilities with C
> by far and I don't want to mess with something I do not fully understand
> and then use that mess in a production environment. Is there a way to
> get around this problem with oversized words?
>
> Kind regards
>
>     Markus
>
>
>
>>-----Ursprüngliche Nachricht-----
>>Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
>>Gesendet: Freitag, 21. November 2003 15:13
>>An: Markus Wollny
>>Cc: pgsql-general@postgresql.org
>>Betreff: Re: AW: [GENERAL] tsearch2 column update produces "word too
>>long"error
>>
>>
>>On Fri, 21 Nov 2003, Markus Wollny wrote:
>>
>>
>>>Hello!
>>>
>>>
>>>>Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
>>>>Gesendet: Freitag, 21. November 2003 13:06
>>>>An: Markus Wollny
>>>>Cc: pgsql-general@postgresql.org
>>>>
>>>>Word length is limited by 2K. What's exactly the word
>>>>tsearch2 complained on ?
>>>>'Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'
>>>>is fine :)
>>>
>>>This was a silly example, I know - it is a long word, but
>>
>>not too long
>>
>>>to worry a machine. The offending word will surely be much
>>
>>longer, but
>>
>>>as a matter of fact, I cannot think of any user actually
>>
>>typing a 2k+
>>
>>>string without any spaces in between. I'm not sure on which word
>>>tsearch2 complained, it doesn't tell and even logging did
>>
>>not provide me
>>
>>>with any more detail:
>>>
>>>2003-11-21 14:06:44 [26497] ERROR:  42601: word is too long
>>>LOCATION:  parsetext_v2, ts_cfg.c:294
>>>STATEMENT:  UPDATE ct_com_board_message
>>>                    SET
>>>ftindex=to_tsvector('default',coalesce(user_login,'') ||' '||
>>>coalesce(title,'') ||' '|| coalesce(text,''));
>>>
>>>Is there some way to find the exact position?
>>
>>I'm afraid you need to hack ts_cfg.c:294 yourself to print the word
>>which's bugging you :)
>>
>>
>>>>btw, don't forget to configure properly dictionaries, so you
>>>>don't have a lot of unique words.
>>>
>>>I won't forget that; I justed wanted to run a quick-off first test
>>>before diving deeper into Ispell and other issues which are
>>
>>as yet a bit
>>
>>>of a mystery to me.
>>>
>>>Kind Regards
>>>
>>>    Markus
>>>
>>
>>    Regards,
>>        Oleg
>>_____________________________________________________________
>>Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>>Sternberg Astronomical Institute, Moscow University (Russia)
>>Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>phone: +007(095)939-16-83, +007(095)939-23-83
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
Teodor Sigaev                                  E-mail: teodor@sigaev.ru