Thread: New Copy Formats - avro/orc/parquet
Hello I d'found useful to be able to import/export from postgres to those modern data formats: - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html) - parquet (c++ writer=https://github.com/apache/parquet-cpp) - orc (all writers=https://github.com/apache/orc) Something like : COPY table TO STDOUT ORC; Would be lovely. This would greatly enhance how postgres integrates in big-data ecosystem. Any thought ? Thanks
On 02/10/2018 07:13 AM, Nicolas Paris wrote: > Hello > > I d'found useful to be able to import/export from postgres to those modern data > formats: > - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html) > - parquet (c++ writer=https://github.com/apache/parquet-cpp) > - orc (all writers=https://github.com/apache/orc) > > Something like : > COPY table TO STDOUT ORC; > > Would be lovely. > > This would greatly enhance how postgres integrates in big-data ecosystem. > > Any thought ? https://www.postgresql.org/docs/10/static/sql-copy.html "PROGRAM A command to execute. In COPY FROM, the input is read from standard output of the command, and in COPY TO, the output is written to the standard input of the command. Note that the command is invoked by the shell, so if you need to pass any arguments to shell command that come from an untrusted source, you must be careful to strip or escape any special characters that might have a special meaning for the shell. For security reasons, it is best to use a fixed command string, or at least avoid passing any user input in it. " > > Thanks > -- Adrian Klaver adrian.klaver@aklaver.com
> > I d'found useful to be able to import/export from postgres to those modern data > > formats: > > - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html) > > - parquet (c++ writer=https://github.com/apache/parquet-cpp) > > - orc (all writers=https://github.com/apache/orc) > > > > Something like : > > COPY table TO STDOUT ORC; > > > > Would be lovely. > > > > This would greatly enhance how postgres integrates in big-data ecosystem. > > > > Any thought ? > > https://www.postgresql.org/docs/10/static/sql-copy.html > > "PROGRAM > > A command to execute. In COPY FROM, the input is read from standard > output of the command, and in COPY TO, the output is written to the standard > input of the command. > > Note that the command is invoked by the shell, so if you need to pass > any arguments to shell command that come from an untrusted source, you must > be careful to strip or escape any special characters that might have a > special meaning for the shell. For security reasons, it is best to use a > fixed command string, or at least avoid passing any user input in it. > " > PROGRAM would involve overhead of transforming data from CSV or BINARY to AVRO for example. Here, I am talking about native format exports/imports for performance considerations.
On Saturday, February 10, 2018, Nicolas Paris <niparisco@gmail.com> wrote:
Hello
I d'found useful to be able to import/export from postgres to those modern data
formats:
- avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index. html)
- parquet (c++ writer=https://github.com/apache/parquet-cpp)
- orc (all writers=https://github.com/apache/orc)
Something like :
COPY table TO STDOUT ORC;
Would be lovely.
This would greatly enhance how postgres integrates in big-data ecosystem.
It would be written "... with (format 'orc')" and your best bet would be to create an extension. I don't think that having such code in core (or contrib) is desirable.
David J.
On 02/10/2018 04:38 PM, David G. Johnston wrote: > On Saturday, February 10, 2018, Nicolas Paris <niparisco@gmail.com > <mailto:niparisco@gmail.com>> wrote: > > Hello > > I d'found useful to be able to import/export from postgres to those > modern data > formats: > - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html > <https://avro.apache.org/docs/1.8.2/api/c/index.html>) > - parquet (c++ writer=https://github.com/apache/parquet-cpp > <https://github.com/apache/parquet-cpp>) > - orc (all writers=https://github.com/apache/orc > <https://github.com/apache/orc>) > > Something like : > COPY table TO STDOUT ORC; > > Would be lovely. > > This would greatly enhance how postgres integrates in big-data > ecosystem. > > > It would be written "... with (format 'orc')" and your best bet would be > to create an extension. I don't think that having such code in core (or > contrib) is desirable. > I don't think make this extensible by an extension (i.e. the formats supported by COPY are hard-coded in core). But I agree that if we are to add multiple new formats, it'd be nice to allow doing that in extension. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 02/10/2018 04:30 PM, Nicolas Paris wrote: >>> I d'found useful to be able to import/export from postgres to those modern data >>> formats: >>> - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html) >>> - parquet (c++ writer=https://github.com/apache/parquet-cpp) >>> - orc (all writers=https://github.com/apache/orc) >>> >>> Something like : >>> COPY table TO STDOUT ORC; >>> >>> Would be lovely. >>> >>> This would greatly enhance how postgres integrates in big-data ecosystem. >>> >>> Any thought ? >> >> https://www.postgresql.org/docs/10/static/sql-copy.html >> >> "PROGRAM >> >> A command to execute. In COPY FROM, the input is read from standard >> output of the command, and in COPY TO, the output is written to the standard >> input of the command. >> >> Note that the command is invoked by the shell, so if you need to pass >> any arguments to shell command that come from an untrusted source, you must >> be careful to strip or escape any special characters that might have a >> special meaning for the shell. For security reasons, it is best to use a >> fixed command string, or at least avoid passing any user input in it. >> " >> > > PROGRAM would involve overhead of transforming data from CSV or > BINARY to AVRO for example. > > Here, I am talking about native format exports/imports for > performance considerations. > That is true, but the question is how significant the overhead is. If it's 50% then reducing it would make perfect sense. If it's 1% then no one if going to be bothered by it. Without these numbers it's hard to make any judgments. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-02-10 18:21:37 +0100, Tomas Vondra wrote: > That is true, but the question is how significant the overhead is. If > it's 50% then reducing it would make perfect sense. If it's 1% then no > one if going to be bothered by it. I think it's pretty clear that it's going to be way way much more than 1%. It's trivial to construct cases where input parsing / output formatting takes the majority of the time. And a lot of that you're going to be able to avoid with binary formats. Greetings, Andres Freund
> > That is true, but the question is how significant the overhead is. If > > it's 50% then reducing it would make perfect sense. If it's 1% then no > > one if going to be bothered by it. > > I think it's pretty clear that it's going to be way way much more than > 1%. Good news but not sure to anderstand why. > It's trivial to construct cases where input parsing / output > formatting takes the majority of the time. Binary -> ORC ^ | PROGRAM parsing/output formating on the fly > And a lot of that you're going to be able to avoid with binary formats. Still the above diagram shows both parsing/formating step, isn't it ?
On February 11, 2018 12:00:12 PM PST, Nicolas Paris <niparisco@gmail.com> wrote: >> > That is true, but the question is how significant the overhead is. >If >> > it's 50% then reducing it would make perfect sense. If it's 1% then >no >> > one if going to be bothered by it. >> >> I think it's pretty clear that it's going to be way way much more >than >> 1%. > >Good news but not sure to anderstand why. I think you might have misunderstood my reply? I'm saying that going through PROGRAM will have significant overhead. I can'tquite make sense of the rest of your reply otherwise? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Le 11 févr. 2018 à 21:03, Andres Freund écrivait : > > > On February 11, 2018 12:00:12 PM PST, Nicolas Paris <niparisco@gmail.com> wrote: > >> > That is true, but the question is how significant the overhead is. > >If > >> > it's 50% then reducing it would make perfect sense. If it's 1% then > >no > >> > one if going to be bothered by it. > >> > >> I think it's pretty clear that it's going to be way way much more > >than > >> 1%. > > > >Good news but not sure to anderstand why. > > I think you might have misunderstood my reply? I'm saying that going through PROGRAM will have significant overhead. Ican't quite make sense of the rest of your reply otherwise? True, I misunderstood. Then I agree the computation overhead should be non-negligible. I have also the storage and network transfers overhead in mind: All those new formats are compressed; this is not true for current postgres BINARY format and obviously text based format. By experience, the binary format is 10 to 30% larger than the text one. On the contrary, an ORC file can be up to 10 times smaller than a text base format.
On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote: > I have also the storage and network transfers overhead in mind: > All those new formats are compressed; this is not true for current > postgres BINARY format and obviously text based format. By experience, > the binary format is 10 to 30% larger than the text one. On the > contrary, an ORC file can be up to 10 times smaller than a text base > format. That seems largely irrelevant when arguing about using PROGRAM though, right? Greetings, Andres Freund
Le 11 févr. 2018 à 21:53, Andres Freund écrivait : > On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote: > > I have also the storage and network transfers overhead in mind: > > All those new formats are compressed; this is not true for current > > postgres BINARY format and obviously text based format. By experience, > > the binary format is 10 to 30% larger than the text one. On the > > contrary, an ORC file can be up to 10 times smaller than a text base > > format. > > That seems largely irrelevant when arguing about using PROGRAM though, > right? > Indeed those storage and network transfers are only considered versus CSV/BINARY format. No link with PROGRAM aspect.
On 2018-02-10 16:13:04 +0100, Nicolas Paris wrote: > Hello > > I d'found useful to be able to import/export from postgres to those modern data > formats: > - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html) > - parquet (c++ writer=https://github.com/apache/parquet-cpp) > - orc (all writers=https://github.com/apache/orc) > > Something like : > COPY table TO STDOUT ORC; > > Would be lovely. > > This would greatly enhance how postgres integrates in big-data ecosystem. > > Any thought ? So, I think making COPY extensible would be quite beneficial. I'm however quite doubtful that we want to add core code to handle all of the above. I think we should make the COPY input/output formatting extensible by extensions. I think we'd have something like CREATE COPY HANDLER name HANDLER somefunction; somefunction would have to be a function that returns type pg_copy_handler, which in turn basically is a struct of function pointers. I imagine we'd have callbacks for - start copy in / out - output row, with a an array of values/nulls - parse row, with a input buffer as argument, returning values / nulls arrays - finish copy in / out we'd also need to expose a few more utility functions from copy.c externally. I think this'd require a good bit of cleanup in copy.c... Greetings, Andres Freund
On 02/11/2018 12:57 PM, Nicolas Paris wrote: > Le 11 févr. 2018 à 21:53, Andres Freund écrivait : >> On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote: >>> I have also the storage and network transfers overhead in mind: >>> All those new formats are compressed; this is not true for current >>> postgres BINARY format and obviously text based format. By experience, >>> the binary format is 10 to 30% larger than the text one. On the >>> contrary, an ORC file can be up to 10 times smaller than a text base >>> format. >> >> That seems largely irrelevant when arguing about using PROGRAM though, >> right? >> > > Indeed those storage and network transfers are only considered versus > CSV/BINARY format. No link with PROGRAM aspect. > Just wondering what your time frame is on this? Asking because this would be considered a new feature and so would need to be added to a major release of Postgres. Currently work is going on for Postgres version 11 to be released(just a guess) late Fall 2018/early Winter 2019. The CommitFest(https://commitfest.postgresql.org/) for this release is currently approximately 3/4 of the way through. Not sure that new code could make it in at this point. This means it would be bumped to version 12 for 2019/2020. -- Adrian Klaver adrian.klaver@aklaver.com
Le 11 févr. 2018 à 22:19, Adrian Klaver écrivait : > On 02/11/2018 12:57 PM, Nicolas Paris wrote: > > Le 11 févr. 2018 à 21:53, Andres Freund écrivait : > > > On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote: > > > > I have also the storage and network transfers overhead in mind: > > > > All those new formats are compressed; this is not true for current > > > > postgres BINARY format and obviously text based format. By experience, > > > > the binary format is 10 to 30% larger than the text one. On the > > > > contrary, an ORC file can be up to 10 times smaller than a text base > > > > format. > > > > > > That seems largely irrelevant when arguing about using PROGRAM though, > > > right? > > > > > > > Indeed those storage and network transfers are only considered versus > > CSV/BINARY format. No link with PROGRAM aspect. > > > > Just wondering what your time frame is on this? Asking because this would be > considered a new feature and so would need to be added to a major release of > Postgres. Currently work is going on for Postgres version 11 to be > released(just a guess) late Fall 2018/early Winter 2019. The > CommitFest(https://commitfest.postgresql.org/) for this release is currently > approximately 3/4 of the way through. Not sure that new code could make it > in at this point. This means it would be bumped to version 12 for 2019/2020. > Right now, exporting (bilions rows * hundred columns) from postgres to distributed tools such spark is feasible while beeing based on parsing, transfers, tooling and workaround overhead. Waiting until 2020 to get the oportunity to write COPY extensions would mean using this feature around 2022. I mean, writing the ORC COPY extension, extending the postgres JDBC driver, extending the spark jdbc connector, all from different communities: this will be a long process. But again, posgres would be the most advanced RDBMS because AFAIK not any DB deal with those distributed format for the moment. Having in mind that such feature will be released one day, make think the place of postgres in a datawarehouse architecture accordingly.
Andres Freund <andres@anarazel.de> writes: > So, I think making COPY extensible would be quite beneficial. I'm > however quite doubtful that we want to add core code to handle all of > the above. I think we should make the COPY input/output formatting > extensible by extensions. +1. I can't see carrying code for these formats in-core, but I've no objection to making it possible for someone else to maintain them. > I imagine we'd have callbacks for > - start copy in / out > - output row, with a an array of values/nulls > - parse row, with a input buffer as argument, returning values / nulls arrays > - finish copy in / out Also something to allow absorbing format-specific options, if the precedent of CSV is anything to go by. (Any such patch should manage to turn COPY-CSV into an extension, at least so far as copy.c is concerned, even if we don't package it as one.) regards, tom lane
On February 11, 2018 2:48:13 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: > (Any such patch should manage >to turn COPY-CSV into an extension, at least so far as copy.c is >concerned, even if we don't package it as one.) Yea, I was thinking we should move all three (default, csv, binary) supported formats to using such a facility. I guess we'djust install the respective format handlers by default in the respective catalogs, rather than doing so in an extension.The handler function should easily be able to return pointers to functions in the main binary, not just additionalshlibs. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On February 11, 2018 2:48:13 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> (Any such patch should manage >> to turn COPY-CSV into an extension, at least so far as copy.c is >> concerned, even if we don't package it as one.) > Yea, I was thinking we should move all three (default, csv, binary) > supported formats to using such a facility. Hm, yeah, probably. That opens up a different dimension of variation: is the raw data text or binary? I'm thinking we shouldn't make the format handlers reinvent e.g. encoding conversion if they don't have to, so it's likely that some of the infrastructure for text data would remain in core copy.c. regards, tom lane
On Sun, Feb 11, 2018 at 11:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
-- Andres Freund <andres@anarazel.de> writes:
> So, I think making COPY extensible would be quite beneficial. I'm
> however quite doubtful that we want to add core code to handle all of
> the above. I think we should make the COPY input/output formatting
> extensible by extensions.
+1. I can't see carrying code for these formats in-core, but I've
no objection to making it possible for someone else to maintain them.
+1. And bonus points if an API can also be defined so such an extension parsing also becomes useful to file_fdw automatically (or at least optionally).
Magnus Hagander <magnus@hagander.net> writes: > +1. And bonus points if an API can also be defined so such an extension > parsing also becomes useful to file_fdw automatically (or at least > optionally). Hm, well, file_fdw already goes through COPY FROM, so it seems like it'd almost just work. "Almost" because there'd need to be a way for it to support the appropriate options. So this means that whatever the mechanism is for extension-specific options, it has to be introspectable enough for file_fdw.c to understand what to expose as FDW options. regards, tom lane