Home > mailing lists

Re: Binary COPY IN size reduction - Mailing list pgsql-hackers

From	Lőrinc Pap
Subject	Re: Binary COPY IN size reduction
Date	April 28, 2020 12:13:47
Msg-id	CAMyrAscemUmZxKrYCDPf4HisGbMWSBvESUWpR6r__OL9_rgXFA@mail.gmail.com Whole thread Raw
In response to	Re: Binary COPY IN size reduction (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Binary COPY IN size reduction
List	pgsql-hackers

Tree view

Thanks for the quick response, Tom!

What about implementing only the first part of my proposal, i.e. BINARY COPY without the redundant column count & size info?

That would already be a big win - I agree the rest of the proposed changes would only complicate the usage, but I'd argue that leaving out duplicated info would even simplify it!

I'll give a better example this time - writing 1.8 million rows with column types bigint, integer, smallint results in the following COPY IN payloads:

20.8MB - Text protocol
51.3MB - Binary protocol
25.6MB - Binary, without column size info (proposal)

I.e. this would make the binary protocol almost as small as the text one (which isn't an unreasonable expectation, I think), while making it easier to use at the same time.

Thanks for your time,

Lőrinc

On Fri, Apr 24, 2020 at 4:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Lőrinc Pap <lorinc@gradle.com> writes:
> We've switched recently from TEXT based COPY to the BINARY one.
> We've noticed a slight performance increase, mostly because we don't need
> to escape the content anymore.
> Unfortunately the binary protocol's output ended up being slightly bigger
> than the text one (e.g. for one payload it's *373MB* now, was *356MB* before)
> ...
> By skipping the column count and sizes for every row, in our example this
> change would reduce the payload to *332MB* (most of our payload is binary,
> lightweight structures consisting of numbers only could see a >*2x*
> decrease in size).

TBH, that amount of gain does not seem to be worth the enormous
compatibility costs of introducing a new COPY data format. What you
propose also makes the format a great deal less robust (readers are
less able to detect errors), which has other costs. I'd vote no.

regards, tom lane

Lőrinc Pap

Senior Software Engineer

pgsql-hackers by date:

From: Andreas Karlsson
Date: 28 April 2020, 12:10:51
Subject: Re: Raw device on PostgreSQL

From: Robert Haas
Date: 28 April 2020, 12:18:59
Subject: Re: More efficient RI checks - take 2

Re: Binary COPY IN size reduction - Mailing list pgsql-hackers

Previous

Next