Thread: BUG #18080: to_tsvector fails for long text input

BUG #18080: to_tsvector fails for long text input

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18080
Logged by:          Uwe Binder
Email address:      uwe.binder@pass-consulting.com
PostgreSQL version: 13.11
Operating system:   Rocky Linux 9
Description:

PostgreSQL 13.11 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 11.3.1
20221121 (Red Hat 11.3.1-4), 64-bit

SELECT to_tsvector('english'::regconfig, (REPEAT('<Long123456789/>'::text,
20000000)));
results in
ERROR:  invalid memory alloc request size 2133333320

Where SELECT LENGTH(REPEAT('<Long123456789/>'::text, 20000000));
correctly returns 320000000 .

PostgresSQL is running in a Docker Container with 4GB.


Re: BUG #18080: to_tsvector fails for long text input

From
Alvaro Herrera
Date:
On 2023-Sep-04, PG Bug reporting form wrote:

> SELECT to_tsvector('english'::regconfig, (REPEAT('<Long123456789/>'::text,
> 20000000)));
> results in
> ERROR:  invalid memory alloc request size 2133333320

This is because to_tsvector_byid does this:

    prs.lenwords = VARSIZE_ANY_EXHDR(in) / 6;    /* just estimation of word's
                                                 * number */
    if (prs.lenwords < 2)
        prs.lenwords = 2;
    prs.curwords = 0;
    prs.pos = 0;
    prs.words = (ParsedWord *) palloc(sizeof(ParsedWord) * prs.lenwords);

where sizeof(ParsedWord) is 40 (in my laptop).  So this tries to
allocate more memory than palloc() is willing to give it.  The attached
patch fixes just the query you supplied and nothing else.

I wonder if we want to support this kind of thing; I suspect we don't.
Other parts of text-search would fail in the same way and would also
need to receive similar fixes.  However, the real problem comes when we
try to store such huge tsvectors, because that means we end up with
"huge" tuples on disk that need I/O support.  Eventually AFAIR you run
into the size limit in the FE/BE protocol and all crashes and burns
because that one cannot be changed without bumping the version.

So I don't think this patch actually does you any good.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Attachment

Re: BUG #18080: to_tsvector fails for long text input

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2023-Sep-04, PG Bug reporting form wrote:
>> SELECT to_tsvector('english'::regconfig, (REPEAT('<Long123456789/>'::text,
>> 20000000)));
>> results in
>> ERROR:  invalid memory alloc request size 2133333320

> This is because to_tsvector_byid does this:
>     prs.lenwords = VARSIZE_ANY_EXHDR(in) / 6;    /* just estimation of word's
>                                                  * number */
>     if (prs.lenwords < 2)
>         prs.lenwords = 2;

Yeah.  My thought about blocking the error had been to limit
prs.lenwords to MaxAllocSize/sizeof(ParsedWord) in this code.
I doubt that switching over to MCXT_ALLOC_HUGE is a good idea.
(Would we not also have to touch the places that repalloc that
array?)

            regards, tom lane



Re: BUG #18080: to_tsvector fails for long text input

From
Tom Lane
Date:
I wrote:
> Yeah.  My thought about blocking the error had been to limit
> prs.lenwords to MaxAllocSize/sizeof(ParsedWord) in this code.

Concretely, as attached.  This allows the given test case to
complete, since it doesn't actually create very many distinct
words.  In other cases we could expect to fail when the array
has to get enlarged, but that's just a normal implementation
limitation.

I looked for other places that might initialize lenwords
to not-sane values, and didn't find any.

BTW, the field order in ParsedWord is such that there's a fair
amount of wasted pad space on 64-bit builds.  I doubt we can
get away with rearranging it in released branches; but maybe
it's worth doing something about that in HEAD, to push out
the point at which you hit the 1Gb limit.

            regards, tom lane

diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index 3b6d41f9e8..fe39d6c4b9 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -252,6 +252,8 @@ to_tsvector_byid(PG_FUNCTION_ARGS)
                                                  * number */
     if (prs.lenwords < 2)
         prs.lenwords = 2;
+    else if (prs.lenwords > MaxAllocSize / sizeof(ParsedWord))
+        prs.lenwords = MaxAllocSize / sizeof(ParsedWord);
     prs.curwords = 0;
     prs.pos = 0;
     prs.words = (ParsedWord *) palloc(sizeof(ParsedWord) * prs.lenwords);

Re: BUG #18080: to_tsvector fails for long text input

From
Tom Lane
Date:
I wrote:
> BTW, the field order in ParsedWord is such that there's a fair
> amount of wasted pad space on 64-bit builds.  I doubt we can
> get away with rearranging it in released branches; but maybe
> it's worth doing something about that in HEAD, to push out
> the point at which you hit the 1Gb limit.

I poked at that a little bit.  We can reduce 64-bit sizeof(ParsedWord)
from 40 bytes to 24 bytes with the attached patch.  The main thing
needed to make this pack tightly is to reduce the "alen" field from
uint32 to uint16.  While it's not immediately obvious that that's
a good thing to do, a look at the one place where alen is increased
(uniqueWORD() in to_tsany.c) shows that it cannot get to more than
twice MAXNUMPOS:

            if (res->pos.apos[0] < MAXNUMPOS - 1 && ...)
            {
                if (res->pos.apos[0] + 1 >= res->alen)
                {
                    res->alen *= 2;
                    res->pos.apos = (uint16 *) repalloc(res->pos.apos, sizeof(uint16) * res->alen);
                }

MAXNUMPOS is currently 256, and even if it's possible to increase
that it seems unlikely that we'd want to make it more than 32k.
So this limitation seems OK to me.

            regards, tom lane

diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index d3dc8bae47..d2aae0c337 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -81,8 +81,10 @@ extern void pushOperator(TSQueryParserState state, int8 oper, int16 distance);
  */
 typedef struct
 {
+    uint16        flags;            /* currently, only TSL_PREFIX */
     uint16        len;
     uint16        nvariant;
+    uint16        alen;
     union
     {
         uint16        pos;
@@ -90,13 +92,11 @@ typedef struct
         /*
          * When apos array is used, apos[0] is the number of elements in the
          * array (excluding apos[0]), and alen is the allocated size of the
-         * array.
+         * array.  We do not allow more than MAXNUMPOS array elements.
          */
         uint16       *apos;
     }            pos;
-    uint16        flags;            /* currently, only TSL_PREFIX */
     char       *word;
-    uint32        alen;
 } ParsedWord;

 typedef struct