COPY, bytea streaming and memory footprint - Mailing list pgsql-hackers

From Jerome Wagner
Subject COPY, bytea streaming and memory footprint
Date
Msg-id CA+=V_fNkQyZFcUo2XEwFz95=9eaOUViyNcW0JbgiTdy8RsqcFA@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hello,

I am trying to understand/optimize how a COPY operation behaves when transfering a bytea from the database to a client.

For simplicity, I'll consider that I have only one bytea _image_ in the _images_ table.

Starting with
COPY (SELECT image FROM images) TO STDOUT BINARY

I understand that :
 - the server will create a linear buffer on the server side holding the whole image and then start sending it over the network in one big copyData message chunked in 64KB network chunks
 - the client can manage to extract this copyData payload by re-assembling those chunks in memory or by streaming the relevant data parts of the chunks elsewhere.

so the problem I see in a streaming situation is that the server actually needs to buffer the whole image in memory.

Now the image is already compressed so if I
ALTER TABLE images ALTER image SET STORAGE EXTERNAL
I can use the fact that substring on non compressed toasted values will fetch only the needed parts and do

COPY (
  SELECT (
    SELECT substring(image from n for 65536) from images)
    FROM generate_series(1, (select length(image) from images), 65536) n
  ) TO STDOUT BINARY

As I understand it, this would be less memory intensive on the server side if the server starts sending rows before all rows of the subselect are built because it would only need to prepare a sequence of  65536 bytes long buffers for the rows it would decide to have in memory.

but is there a way to know if such a COPY/SELECT statement will indeed start sending rows before they are all prepared on the server ? Does it depend on the request and  is there a difference if I add an order by on the select versus the natural order of the table ?
How many rows will be needed in memory before the sending begins ?

I hope my explanation was clear. I am looking for help in better understanding how the server decides to stream the COPY data out of the server vs the internal retrieval of the COPY'd subselect.

Thank you
Jérôme







pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Parallel Seq Scan vs kernel read ahead
Next
From: David Rowley
Date:
Subject: Recording test runtimes with the buildfarm