Thread: Make COPY extendable in order to support Parquet and other formats

Make COPY extendable in order to support Parquet and other formats

From

Aleksander Alekseev

Date:

20 June 2022, 15:05:37

Hi hackers,

In several conversations I had recently with colleagues it was pointed
out that it would be great if PostgreSQL supported COPY to/from
Parquet and other formats. I've found a corresponding discussion [1]
on pgsql-general@. The consensus reached back in 2018 seems to be that
this shouldn't be implemented in the core but rather an API should be
provided for the extensions. To my knowledge this was never
implemented though.

I would like to invest some time into providing a corresponding patch
for the core and implementing "pg_copy_parquet" extension as a
practical example, and yet another, a bit simpler, extension as an API
usage example for the core codebase. I just wanted to double-check
that this is still a wanted feature and no one on pgsql-hackers@
objects the idea.

Any feedback, suggestions and ideas are most welcome.

[1]: https://postgr.es/m/20180210151304.fonjztsynewldfba%40gmail.com

-- 
Best regards,
Aleksander Alekseev

Re: Make COPY extendable in order to support Parquet and other formats

From

Ashutosh Bapat

Date:

21 June 2022, 09:38:27

On Mon, Jun 20, 2022 at 8:35 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> I would like to invest some time into providing a corresponding patch
> for the core and implementing "pg_copy_parquet" extension as a
> practical example, and yet another, a bit simpler, extension as an API
> usage example for the core codebase. I just wanted to double-check
> that this is still a wanted feature and no one on pgsql-hackers@
> objects the idea.

An extension just for COPY to/from parquet looks limited in
functionality. Shouldn't this be viewed as an FDW or Table AM support
for parquet or other formats? Of course the later is much larger in
scope compared to the first one. But there may already be efforts
underway [1]
https://www.postgresql.org/about/news/parquet-s3-fdw-01-was-newly-released-2179/

I have not used it myself or worked with it.

-- 
Best Wishes,
Ashutosh Bapat

Re: Make COPY extendable in order to support Parquet and other formats

From

Aleksander Alekseev

Date:

21 June 2022, 09:56:21

Hi Ashutosh,

> An extension just for COPY to/from parquet looks limited in
> functionality. Shouldn't this be viewed as an FDW or Table AM support
> for parquet or other formats? Of course the later is much larger in
> scope compared to the first one. But there may already be efforts
> underway
> https://www.postgresql.org/about/news/parquet-s3-fdw-01-was-newly-released-2179/

Many thanks for sharing your thoughts on this!

We are using parquet_fdw [2] but this is a read-only FDW.

What users typically need is to dump their data as fast as possible in
a given format and either to upload it to the cloud as historical data
or to transfer it to another system (Spark, etc). The data can be
accessed later if needed, as read only one.

Note that when accessing the historical data with parquet_fdw you
basically have a zero ingestion time.

Another possible use case is transferring data to PostgreSQL from
another source. Here the requirements are similar - the data should be
dumped as fast as possible from the source, transferred over the
network and imported as fast as possible.

In other words, personally I'm unaware of use cases when somebody
needs a complete read/write FDW or TableAM implementation for formats
like Parquet, ORC, etc. Also to my knowledge they are not particularly
optimized for this.

[2]: https://github.com/adjust/parquet_fdw

-- 
Best regards,
Aleksander Alekseev

Re: Make COPY extendable in order to support Parquet and other formats

From

Ashutosh Bapat

Date:

22 June 2022, 11:29:16

On Tue, Jun 21, 2022 at 3:26 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:

>
> In other words, personally I'm unaware of use cases when somebody
> needs a complete read/write FDW or TableAM implementation for formats
> like Parquet, ORC, etc. Also to my knowledge they are not particularly
> optimized for this.
>

IIUC, you want extensibility in FORMAT argument to COPY command
https://www.postgresql.org/docs/current/sql-copy.html. Where the
format is pluggable. That seems useful.
Another option is to dump the data in csv format but use external
utility to convert csv to parquet or whatever other format is. I
understand that that's not going to be as efficient as dumping
directly in the desired format.

-- 
Best Wishes,
Ashutosh Bapat

Re: Make COPY extendable in order to support Parquet and other formats

From

Aleksander Alekseev

Date:

22 June 2022, 11:51:49

Hi Ashutosh,

> IIUC, you want extensibility in FORMAT argument to COPY command
> https://www.postgresql.org/docs/current/sql-copy.html. Where the
> format is pluggable. That seems useful.
> Another option is to dump the data in csv format but use external
> utility to convert csv to parquet or whatever other format is. I
> understand that that's not going to be as efficient as dumping
> directly in the desired format.

Exactly. However, to clarify, I suspect this may be a bit more
involved than simply extending the FORMAT arguments.

This change per se will not be extremely useful. Currently nothing
prevents an extension author to iterate over a table using
heap_open(), heap_getnext(), etc API and dump its content in any
format. The user will have to write "dump_table(foo, filename)"
instead of "COPY ..." but that's not a big deal.

The problem is that every new extension has to re-invent things like
figuring out the schema, the validation of the data, etc. If we could
do this in the core so that an extension author has to implement only
the minimal format-dependent list of callbacks that would be really
great. In order to make the interface practical though one will have
to implement a practical extension as well, for instance, a Parquet
one.

This being said, if it turns out that for some reason this is not
realistic to deliver, ending up with simply extending this part of the
syntax a bit should be fine too.

-- 
Best regards,
Aleksander Alekseev

Re: Make COPY extendable in order to support Parquet and other formats

From

Andres Freund

Date:

22 June 2022, 23:49:08

Hi,

On 2022-06-22 16:59:16 +0530, Ashutosh Bapat wrote:
> On Tue, Jun 21, 2022 at 3:26 PM Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> 
> >
> > In other words, personally I'm unaware of use cases when somebody
> > needs a complete read/write FDW or TableAM implementation for formats
> > like Parquet, ORC, etc. Also to my knowledge they are not particularly
> > optimized for this.
> >
> 
> IIUC, you want extensibility in FORMAT argument to COPY command
> https://www.postgresql.org/docs/current/sql-copy.html. Where the
> format is pluggable. That seems useful.

Agreed.

But I think it needs quite a bit of care. Just plugging in a bunch of per-row
(or worse, per field) switches to COPYs input / output parsing will make the
code even harder to read and even slower.

I suspect that we'd first need a patch to refactor the existing copy code a
good bit to clean things up. After that it hopefully will be possible to plug
in a new format without being too intrusive.

I know little about parquet - can it support FROM STDIN efficiently?

Greetings,

Andres Freund

Re: Make COPY extendable in order to support Parquet and other formats

From

Tom Lane

Date:

23 June 2022, 02:51:59

Andres Freund <andres@anarazel.de> writes:
> On 2022-06-22 16:59:16 +0530, Ashutosh Bapat wrote:
>> IIUC, you want extensibility in FORMAT argument to COPY command
>> https://www.postgresql.org/docs/current/sql-copy.html. Where the
>> format is pluggable. That seems useful.

> Agreed.

Ditto.

> I suspect that we'd first need a patch to refactor the existing copy code a
> good bit to clean things up. After that it hopefully will be possible to plug
> in a new format without being too intrusive.

I think that step 1 ought to be to convert the existing formats into
plug-ins, and demonstrate that there's no significant loss of performance.

            regards, tom lane

Re: Make COPY extendable in order to support Parquet and other formats

From

Aleksander Alekseev

Date:

23 June 2022, 08:38:29

Andres, Tom,

> > I suspect that we'd first need a patch to refactor the existing copy code a
> > good bit to clean things up. After that it hopefully will be possible to plug
> > in a new format without being too intrusive.
>
> I think that step 1 ought to be to convert the existing formats into
> plug-ins, and demonstrate that there's no significant loss of performance.

Yep, this looks like a promising strategy to me too.

> I know little about parquet - can it support FROM STDIN efficiently?

Parquet is a compressed binary format with data grouped by columns
[1]. I wouldn't assume that this is a primary use case for this
particular format.

[1]: https://parquet.apache.org/docs/file-format/

-- 
Best regards,
Aleksander Alekseev

Re: Make COPY extendable in order to support Parquet and other formats

From

Andres Freund

Date:

24 June 2022, 01:45:00

Hi,

On 2022-06-23 11:38:29 +0300, Aleksander Alekseev wrote:
> > I know little about parquet - can it support FROM STDIN efficiently?
> 
> Parquet is a compressed binary format with data grouped by columns
> [1]. I wouldn't assume that this is a primary use case for this
> particular format.

IMO decent COPY FROM / TO STDIN support is crucial, because otherwise you
can't do COPY from/to a client. Which would make the feature unusable for
anybody not superuser, including just about all users of hosted PG.

Greetings,

Andres Freund

Re: Make COPY extendable in order to support Parquet and other formats

From

Andrew Dunstan

Date:

24 June 2022, 14:14:01

On 2022-06-23 Th 21:45, Andres Freund wrote:
> Hi,
>
> On 2022-06-23 11:38:29 +0300, Aleksander Alekseev wrote:
>>> I know little about parquet - can it support FROM STDIN efficiently?
>> Parquet is a compressed binary format with data grouped by columns
>> [1]. I wouldn't assume that this is a primary use case for this
>> particular format.
> IMO decent COPY FROM / TO STDIN support is crucial, because otherwise you
> can't do COPY from/to a client. Which would make the feature unusable for
> anybody not superuser, including just about all users of hosted PG.
>

+1


Note that Parquet puts the metadata at the end of each file, which makes
it nice to write but somewhat unfriendly for streaming readers, which
would have to accumulate the whole file in order to process it.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Make COPY extendable in order to support Parquet and other formats

From

Aleksander Alekseev

Date:

24 June 2022, 15:04:26

Hi Andrew,

> > IMO decent COPY FROM / TO STDIN support is crucial, because otherwise you
> > can't do COPY from/to a client. Which would make the feature unusable for
> > anybody not superuser, including just about all users of hosted PG.
> >
>
> +1
>
> Note that Parquet puts the metadata at the end of each file, which makes
> it nice to write but somewhat unfriendly for streaming readers, which
> would have to accumulate the whole file in order to process it.

It's not necessarily that bad since data is divided into pages, each
page can be processed separately. However personally I have limited
experience with Parquet at this point. Some experimentation is
required. I will keep in mind the requirement regarding COPY FROM / TO
STDIN.

-- 
Best regards,
Aleksander Alekseev