Home > mailing lists

COPY, bytea streaming and memory footprint - Mailing list pgsql-hackers

From	Jerome Wagner
Subject	COPY, bytea streaming and memory footprint
Date	June 10, 2020 15:52:02
Msg-id	CA+=V_fNkQyZFcUo2XEwFz95=9eaOUViyNcW0JbgiTdy8RsqcFA@mail.gmail.com Whole thread Raw
List	pgsql-hackers

Tree view

Hello,

I am trying to understand/optimize how a COPY operation behaves when transfering a bytea from the database to a client.

For simplicity, I'll consider that I have only one bytea _image_ in the _images_ table.

Starting with

COPY (SELECT image FROM images) TO STDOUT BINARY

I understand that :

- the server will create a linear buffer on the server side holding the whole image and then start sending it over the network in one big copyData message chunked in 64KB network chunks

- the client can manage to extract this copyData payload by re-assembling those chunks in memory or by streaming the relevant data parts of the chunks elsewhere.

so the problem I see in a streaming situation is that the server actually needs to buffer the whole image in memory.

Now the image is already compressed so if I

ALTER TABLE images ALTER image SET STORAGE EXTERNAL

I can use the fact that substring on non compressed toasted values will fetch only the needed parts and do

COPY (

SELECT (

SELECT substring(image from n for 65536) from images)

FROM generate_series(1, (select length(image) from images), 65536) n

) TO STDOUT BINARY

As I understand it, this would be less memory intensive on the server side if the server starts sending rows before all rows of the subselect are built because it would only need to prepare a sequence of 65536 bytes long buffers for the rows it would decide to have in memory.

but is there a way to know if such a COPY/SELECT statement will indeed start sending rows before they are all prepared on the server ? Does it depend on the request and is there a difference if I add an order by on the select versus the natural order of the table ?

How many rows will be needed in memory before the sending begins ?

I hope my explanation was clear. I am looking for help in better understanding how the server decides to stream the COPY data out of the server vs the internal retrieval of the COPY'd subselect.

Thank you

Jérôme

pgsql-hackers by date:

From: David Rowley
Date: 10 June 2020, 15:33:45
Subject: Re: Parallel Seq Scan vs kernel read ahead

From: David Rowley
Date: 10 June 2020, 15:58:34
Subject: Recording test runtimes with the buildfarm

COPY, bytea streaming and memory footprint - Mailing list pgsql-hackers

Previous

Next