Well I understand. But in my scenario I am using ansi characters which
are just 1-byte-utf8. The system should be capable of handling that.
Maybe we could implement a different approach for text column content >
256 GB, where we implement something like a "streaming" read where
memory is not allocated based on fixed 4-bytes per character, but on the
true length of characters found in the text column or returned by a
function like lower().
But as a user i would expect that I can store as many characters as
possible into a text field without getting errors in any way.
The following example will give an error although the amount of text
easily fits into the column because the 'x' is ansi and takes 1 byte.
create temp table mytest(mydata text);
create index on mytest(lower(mydata));
insert into mytest
select repeat('x',300*1024*1024) as mydata;
Without the index I could easily store 800mb in the text column:
select repeat('x',800*1024*1024) as mydata;
So from what I see we should at least extend the documentation and tell
people that although a text field can hold up to 1GB of string data,
only 256MB can safely be used if one wants to use function-based indexed
or function on the column inside a where clauses.
So for everyone using text columns in a more than basic way, we should
simply tell everyone not not store more than 256 * 1024*1024 characters
in a text column.
Julien Rouhaud wrote:
> On Mon, Aug 16, 2021 at 4:46 PM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> When I try to search a text field in a where clause that holds more than 250
>> MB of text, PostgreSQL runs out of memory for the request.
>>
>> You can reproduce this behaviour with the following sql statement:
>>
>> with
>> q1 as
>> (
>> -- 260*1024*1024 = 260MB
>> select repeat('x',260*1024*1024) as mydata
>> )
>> select count(*) from q1 where lower(mydata) like '%a%'
>> ;
>>
>> ERROR: invalid memory alloc request size 1090519044
>>
>> Using just a 250MB string:
>>
>> with
>> q1 as
>> (
>> -- 250*1024*1024 = 250MB
>> select repeat('x',250*1024*1024) as mydata
>> )
>> select count(*) from q1 where lower(mydata) like '%a%'
>> ;
>>
>> Everything is fine!
>>
>> The alloc request size seems to be 4 times the length of the text field.
>> [...]
>> Is this an expected behaviour?
>> Why ist PostgreSQL allocating 4 times the column's memory when doing a
>> search like above?
> This is unfortunately the expected behavior, assuming that you're not
> dealing with C/POSIX encoding.
>
> This is because in multibyte encoding each character can occupy up to
> 4B. Postgres needs to allocate a single chunk of memory to hold the
> resulting text, and it has no way to know how many multibyte
> characters are present in the input string or how many character will
> have a different size when down-cased, so it has to allocate the
> maximum size that may be needed, which is 4 times the size of the
> input string. And there's a strict 1GB limitation for a single field
> size, thus the ~256MB limit.