Re: pg_dump / copy bugs with "big lines" ? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pg_dump / copy bugs with "big lines" ?
Date
Msg-id ba62f67a-b585-03f2-cfec-fefb95549efd@2ndquadrant.com
Whole thread Raw
In response to Re: pg_dump / copy bugs with "big lines" ?  (Craig Ringer <craig@2ndquadrant.com>)
Responses Re: pg_dump / copy bugs with "big lines" ?  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: pg_dump / copy bugs with "big lines" ?  ("Daniel Verite" <daniel@manitou-mail.org>)
Re: pg_dump / copy bugs with "big lines" ?  ("Daniel Verite" <daniel@manitou-mail.org>)
List pgsql-hackers
Hi,

I've subscribed to review this patch in the current CF, so let me start 
with a brief summary of the thread as it started more than a year ago.

In general the thread is about hitting the 1GB allocation size limit 
(and 32-bit length in FE/BE protocol) in various ways:


1) attributes are smaller than 1GB, but row exceeds the limit

This causes issues in COPY/pg_dump, as we allocate a single StringInfo 
for the whole row, and send it as a single message (which includes int32 
length, just like most FE/BE messages).


2) attribute values that get over the 1GB due to encoding

For example it's possible to construct values that are OK in hex 
encoding but >1GB in escape (and vice versa). Those values get stored 
just fine, but there's no way to dump / COPY them. And if you happen to 
have both, you can't do pg_dump :-/


I think it's OK not to be able to store extremely large values, but only 
if we make it predictable (ideally by rejecting the value before storing 
in in the database). The cases discussed in the thread are particularly 
annoying exactly because we do the opposite - we allow the value to be 
stored, but then fail when retrieving the value or trying to backup the 
database (which is particularly nasty, IMNSHO).

So although we don't have that many reports about this, it'd be nice to 
improve the behavior a bit.


The trouble is, this rabbit hole is fairly deep - wherever we palloc or 
detoast a value, we're likely to hit those issues with wide values/rows.

Honestly, I don't think it's feasible to make all the code paths work 
with such wide values/rows - as TL points out, particularly with the 
wide values it would be enormously invasive.

So the patch aims to fix just the simplest case, i.e. when the values 
are smaller than 1GB but the total row is larger. It does so by allowing 
StringInfo to exceed the MaxAllocSize in special cases (currently only 
COPY FROM/TO), and using MCXT_ALLOC_HUGE when forming the heap tuple (to 
make it possible to load the data).

It seems to work and I do think it's a reasonable first step to make 
things work.

A few minor comments regarding the patch:

1) CopyStartSend seems pretty pointless - It only has one function call 
in it, and is called on exactly one place (and all other places simply 
call allowLongStringInfo directly). I'd get rid of this function and 
replace the call in CopyOneRowTo(() with allowLongStringInfo().

2) allowlong seems awkward, allowLong or allow_long would be better

3) Why does resetStringInfo reset the allowLong flag? Are there any 
cases when we want/need to forget the flag value? I don't think so, so 
let's simply not reset it and get rid of the allowLongStringInfo() 
calls. Maybe it'd be better to invent a new makeLongStringInfo() method 
instead, which would make it clear that the flag value is permanent.

4) HandleParallelMessage needs a tweak, as it uses msg->len in a log 
message, but with '%d' and not '%ld' (as needed after changing the type 
to Size).


I however wonder whether it wouldn't be better to try Robert's proposal, 
i.e. not building the huge StringInfo for the whole row, but sending the 
attribute data directly. I don't mean to send messages for each 
attribute - the current FE/BE protocol does not allow that (it says that 
each 'd' message is a whole row), but just sending the network message 
in smaller chunks.

That would make the StringInfo changes unnecessary, reduce the memory 
consumption considerably (right now we need 2x the memory, and we know 
we're dealing with gigabytes of data).

Of course, it'd require significant changes of how copy works internally 
(instead of accumulating data for the whole row, we'd have to send them 
right in smaller chunks). Which would allow us getting rid of the 
StringInfo changes, but we'd not be able to backpatch this.


Regarding interpreting the Int32 length field in FE/BE protocol as 
unsigned, I'm a bit worried it might qualify as breaking the protocol. 
It's true we don't really say whether it's signed or unsigned, and we 
handle it differently depending on the message type, but I wonder how 
many libraries simply use int32. OTOH those clients are unlikely to 
handle even the 2GB we might send them without breaking the protocol, so 
I guess this is fine. And 4GB seems like the best we can do.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: System load consideration before spawning parallel workers
Next
From: Jim Nasby
Date:
Subject: Re: autonomous transactions