Thread: Make COPY extendable in order to support Parquet and other formats
Hi hackers, In several conversations I had recently with colleagues it was pointed out that it would be great if PostgreSQL supported COPY to/from Parquet and other formats. I've found a corresponding discussion [1] on pgsql-general@. The consensus reached back in 2018 seems to be that this shouldn't be implemented in the core but rather an API should be provided for the extensions. To my knowledge this was never implemented though. I would like to invest some time into providing a corresponding patch for the core and implementing "pg_copy_parquet" extension as a practical example, and yet another, a bit simpler, extension as an API usage example for the core codebase. I just wanted to double-check that this is still a wanted feature and no one on pgsql-hackers@ objects the idea. Any feedback, suggestions and ideas are most welcome. [1]: https://postgr.es/m/20180210151304.fonjztsynewldfba%40gmail.com -- Best regards, Aleksander Alekseev
On Mon, Jun 20, 2022 at 8:35 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > > I would like to invest some time into providing a corresponding patch > for the core and implementing "pg_copy_parquet" extension as a > practical example, and yet another, a bit simpler, extension as an API > usage example for the core codebase. I just wanted to double-check > that this is still a wanted feature and no one on pgsql-hackers@ > objects the idea. An extension just for COPY to/from parquet looks limited in functionality. Shouldn't this be viewed as an FDW or Table AM support for parquet or other formats? Of course the later is much larger in scope compared to the first one. But there may already be efforts underway [1] https://www.postgresql.org/about/news/parquet-s3-fdw-01-was-newly-released-2179/ I have not used it myself or worked with it. -- Best Wishes, Ashutosh Bapat
Re: Make COPY extendable in order to support Parquet and other formats
From
Aleksander Alekseev
Date:
Hi Ashutosh, > An extension just for COPY to/from parquet looks limited in > functionality. Shouldn't this be viewed as an FDW or Table AM support > for parquet or other formats? Of course the later is much larger in > scope compared to the first one. But there may already be efforts > underway > https://www.postgresql.org/about/news/parquet-s3-fdw-01-was-newly-released-2179/ Many thanks for sharing your thoughts on this! We are using parquet_fdw [2] but this is a read-only FDW. What users typically need is to dump their data as fast as possible in a given format and either to upload it to the cloud as historical data or to transfer it to another system (Spark, etc). The data can be accessed later if needed, as read only one. Note that when accessing the historical data with parquet_fdw you basically have a zero ingestion time. Another possible use case is transferring data to PostgreSQL from another source. Here the requirements are similar - the data should be dumped as fast as possible from the source, transferred over the network and imported as fast as possible. In other words, personally I'm unaware of use cases when somebody needs a complete read/write FDW or TableAM implementation for formats like Parquet, ORC, etc. Also to my knowledge they are not particularly optimized for this. [2]: https://github.com/adjust/parquet_fdw -- Best regards, Aleksander Alekseev
On Tue, Jun 21, 2022 at 3:26 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > > In other words, personally I'm unaware of use cases when somebody > needs a complete read/write FDW or TableAM implementation for formats > like Parquet, ORC, etc. Also to my knowledge they are not particularly > optimized for this. > IIUC, you want extensibility in FORMAT argument to COPY command https://www.postgresql.org/docs/current/sql-copy.html. Where the format is pluggable. That seems useful. Another option is to dump the data in csv format but use external utility to convert csv to parquet or whatever other format is. I understand that that's not going to be as efficient as dumping directly in the desired format. -- Best Wishes, Ashutosh Bapat
Re: Make COPY extendable in order to support Parquet and other formats
From
Aleksander Alekseev
Date:
Hi Ashutosh, > IIUC, you want extensibility in FORMAT argument to COPY command > https://www.postgresql.org/docs/current/sql-copy.html. Where the > format is pluggable. That seems useful. > Another option is to dump the data in csv format but use external > utility to convert csv to parquet or whatever other format is. I > understand that that's not going to be as efficient as dumping > directly in the desired format. Exactly. However, to clarify, I suspect this may be a bit more involved than simply extending the FORMAT arguments. This change per se will not be extremely useful. Currently nothing prevents an extension author to iterate over a table using heap_open(), heap_getnext(), etc API and dump its content in any format. The user will have to write "dump_table(foo, filename)" instead of "COPY ..." but that's not a big deal. The problem is that every new extension has to re-invent things like figuring out the schema, the validation of the data, etc. If we could do this in the core so that an extension author has to implement only the minimal format-dependent list of callbacks that would be really great. In order to make the interface practical though one will have to implement a practical extension as well, for instance, a Parquet one. This being said, if it turns out that for some reason this is not realistic to deliver, ending up with simply extending this part of the syntax a bit should be fine too. -- Best regards, Aleksander Alekseev
Hi, On 2022-06-22 16:59:16 +0530, Ashutosh Bapat wrote: > On Tue, Jun 21, 2022 at 3:26 PM Aleksander Alekseev > <aleksander@timescale.com> wrote: > > > > > In other words, personally I'm unaware of use cases when somebody > > needs a complete read/write FDW or TableAM implementation for formats > > like Parquet, ORC, etc. Also to my knowledge they are not particularly > > optimized for this. > > > > IIUC, you want extensibility in FORMAT argument to COPY command > https://www.postgresql.org/docs/current/sql-copy.html. Where the > format is pluggable. That seems useful. Agreed. But I think it needs quite a bit of care. Just plugging in a bunch of per-row (or worse, per field) switches to COPYs input / output parsing will make the code even harder to read and even slower. I suspect that we'd first need a patch to refactor the existing copy code a good bit to clean things up. After that it hopefully will be possible to plug in a new format without being too intrusive. I know little about parquet - can it support FROM STDIN efficiently? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-06-22 16:59:16 +0530, Ashutosh Bapat wrote: >> IIUC, you want extensibility in FORMAT argument to COPY command >> https://www.postgresql.org/docs/current/sql-copy.html. Where the >> format is pluggable. That seems useful. > Agreed. Ditto. > I suspect that we'd first need a patch to refactor the existing copy code a > good bit to clean things up. After that it hopefully will be possible to plug > in a new format without being too intrusive. I think that step 1 ought to be to convert the existing formats into plug-ins, and demonstrate that there's no significant loss of performance. regards, tom lane
Re: Make COPY extendable in order to support Parquet and other formats
From
Aleksander Alekseev
Date:
Andres, Tom, > > I suspect that we'd first need a patch to refactor the existing copy code a > > good bit to clean things up. After that it hopefully will be possible to plug > > in a new format without being too intrusive. > > I think that step 1 ought to be to convert the existing formats into > plug-ins, and demonstrate that there's no significant loss of performance. Yep, this looks like a promising strategy to me too. > I know little about parquet - can it support FROM STDIN efficiently? Parquet is a compressed binary format with data grouped by columns [1]. I wouldn't assume that this is a primary use case for this particular format. [1]: https://parquet.apache.org/docs/file-format/ -- Best regards, Aleksander Alekseev
Hi, On 2022-06-23 11:38:29 +0300, Aleksander Alekseev wrote: > > I know little about parquet - can it support FROM STDIN efficiently? > > Parquet is a compressed binary format with data grouped by columns > [1]. I wouldn't assume that this is a primary use case for this > particular format. IMO decent COPY FROM / TO STDIN support is crucial, because otherwise you can't do COPY from/to a client. Which would make the feature unusable for anybody not superuser, including just about all users of hosted PG. Greetings, Andres Freund
On 2022-06-23 Th 21:45, Andres Freund wrote: > Hi, > > On 2022-06-23 11:38:29 +0300, Aleksander Alekseev wrote: >>> I know little about parquet - can it support FROM STDIN efficiently? >> Parquet is a compressed binary format with data grouped by columns >> [1]. I wouldn't assume that this is a primary use case for this >> particular format. > IMO decent COPY FROM / TO STDIN support is crucial, because otherwise you > can't do COPY from/to a client. Which would make the feature unusable for > anybody not superuser, including just about all users of hosted PG. > +1 Note that Parquet puts the metadata at the end of each file, which makes it nice to write but somewhat unfriendly for streaming readers, which would have to accumulate the whole file in order to process it. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Make COPY extendable in order to support Parquet and other formats
From
Aleksander Alekseev
Date:
Hi Andrew, > > IMO decent COPY FROM / TO STDIN support is crucial, because otherwise you > > can't do COPY from/to a client. Which would make the feature unusable for > > anybody not superuser, including just about all users of hosted PG. > > > > +1 > > Note that Parquet puts the metadata at the end of each file, which makes > it nice to write but somewhat unfriendly for streaming readers, which > would have to accumulate the whole file in order to process it. It's not necessarily that bad since data is divided into pages, each page can be processed separately. However personally I have limited experience with Parquet at this point. Some experimentation is required. I will keep in mind the requirement regarding COPY FROM / TO STDIN. -- Best regards, Aleksander Alekseev