Thread: COPY command & binary format
Hello,
What is the way to build a binary format (instead of a csv) ? Is there specification for this file ?
http://www.postgresql.org/docs/9.5/static/sql-copy.html
http://www.postgresql.org/docs/9.5/static/sql-copy.html
Could I create such format from java ?
I guess this would be far faster, and maybe safer than CSVs
Thanks by advance,
<div class="moz-cite-prefix">Il 10/05/2016 12:56, Nicolas Paris ha scritto:<br /></div><blockquote cite="mid:CA+ssMORTVs9cnvaS6wOOuPcmmsT5MJ6sfDYs9h3peXqW4zOA5A@mail.gmail.com"type="cite"><div dir="ltr"><div class="gmail_default"style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">Hello,<br /><br /></div><div class="gmail_default"style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">What is the way to build a binary format (insteadof a csv) ? Is there specification for this file ?<br /><a href="http://www.postgresql.org/docs/9.5/static/sql-copy.html" moz-do-not-send="true">http://www.postgresql.org/docs/9.5/static/sql-copy.html</a><br/></div></div></blockquote> I alwayscreate binary files with<br /> COPY table TO 'path/to/file' WITH BINARY<br /><br /> Cheers<br /> Moreno.-<br />
On Tue, May 10, 2016 at 4:26 PM Nicolas Paris <niparisco@gmail.com> wrote:
Hello,What is the way to build a binary format (instead of a csv) ? Is there specification for this file ?
http://www.postgresql.org/docs/9.5/static/sql-copy.html
Could I create such format from java ?
You can use COPY JDBC API to copy to STDOUT and then compress it before you use usual Java file operations to write it to a file. You will have to follow the reverse process while reading from this file and LOADING to a table.
But why would you want to do that?
I guess this would be far faster, and maybe safer than CSVs
I don't think assumption is right. COPY is not meant for backup, it is for LOAD and UN-LOAD.
What you probably need is pg_dump with -Fc format.
What you probably need is pg_dump with -Fc format.
Thanks by advance,
--
--
Best Regards
Sameer Kumar | DB Solution Architect
ASHNIK PTE. LTD.
101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533
T: +65 6438 3504 | M: +65 8110 0350 | www.ashnik.com
On Tue, May 10, 2016 at 4:36 PM Sameer Kumar <sameer.kumar@ashnik.com> wrote:
On Tue, May 10, 2016 at 4:26 PM Nicolas Paris <niparisco@gmail.com> wrote:Hello,What is the way to build a binary format (instead of a csv) ? Is there specification for this file ?
http://www.postgresql.org/docs/9.5/static/sql-copy.htmlCould I create such format from java ?You can use COPY JDBC API to copy to STDOUT and then compress it before you use usual Java file operations to write it to a file. You will have to follow the reverse process while reading from this file and LOADING to a table.But why would you want to do that?I guess this would be far faster, and maybe safer than CSVsI don't think assumption is right. COPY is not meant for backup, it is for LOAD and UN-LOAD.
What you probably need is pg_dump with -Fc format.
Like someone else suggested upthread you can use Binary format in COPY command (default is text)
Thanks by advance,----Best RegardsSameer Kumar | DB Solution ArchitectASHNIK PTE. LTD.101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533
T: +65 6438 3504 | M: +65 8110 0350 | www.ashnik.com
--
--
Best Regards
Sameer Kumar | DB Solution Architect
ASHNIK PTE. LTD.
101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533
T: +65 6438 3504 | M: +65 8110 0350 | www.ashnik.com
2016-05-10 13:04 GMT+02:00 Moreno Andreo <moreno.andreo@evolu-s.it>:
Il 10/05/2016 12:56, Nicolas Paris ha scritto:I always create binary files withHello,What is the way to build a binary format (instead of a csv) ? Is there specification for this file ?
http://www.postgresql.org/docs/9.5/static/sql-copy.html
COPY table TO 'path/to/file' WITH BINARY
Fine, this works in this way :
postgresql -> binary
postgresql -> binary
binary -> postgresql
The way I want is :
csv -> binary -> postgresql
And if possible, transforming csv to binary throught java.
Use case is ETL process.
Le 10/05/2016 à 12:56, Nicolas Paris a écrit :
HiHello,What is the way to build a binary format (instead of a csv) ? Is there specification for this file ?
http://www.postgresql.org/docs/9.5/static/sql-copy.htmlCould I create such format from java ?I guess this would be far faster, and maybe safer than CSVsThanks by advance,
Making a driver that do what you want is not difficult. You will achieve better performances than than loading data from CSV, and you also will have better precision for floating values (there is no text conversion).
In the link you provide there is a description of the file format in section Binary Format.
Mathieu Pujol
Il 10/05/2016 13:38, Nicolas Paris ha scritto: <blockquote cite="mid:CA+ssMOSVC-Yu+yQXeLHL3c0+iz4g1PEtRM49cVvQxyNFN3d4sA@mail.gmail.com" type="cite"> 2016-05-10 13:04 GMT+02:00 Moreno Andreo <<a moz-do-not-send="true" href="mailto:moreno.andreo@evolu-s.it" target="_blank">moreno.andreo@evolu-s.it>: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Il 10/05/2016 12:56, Nicolas Paris ha scritto: <div style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">Hello, <div style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">What is the way to build a binary format (instead of a csv) ? Is there specification for this file ? <a moz-do-not-send="true" href="http://www.postgresql.org/docs/9.5/static/sql-copy.html" target="_blank">http://www.postgresql.org/docs/9.5/static/sql-copy.html I always create binary files with COPY table TO 'path/to/file' WITH BINARY Â <div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">â Fine, this works in this way : postgresql -> binary <div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">binary -> postgresql <div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">The way I want is : <div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">csv -> binary -> postgresql Is this just to be quicker or are you going to add some business logic while converting CSV data? As you mentioned ETL, I assume the second, as I don't think that converting CSV to binary and then loading it to PostgreSQL will be more convenient than loading directly from CSV... as quicker as it can be, you have anyway to load data from CSV. Binary file format is briefly described in the last part of the doc you linked, under "Binary format", and there's also reference to source files. <blockquote cite="mid:CA+ssMOSVC-Yu+yQXeLHL3c0+iz4g1PEtRM49cVvQxyNFN3d4sA@mail.gmail.com" type="cite"> <div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)"> <div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">And if possible, transforming csv to binary throught javaâ. This is beyond my knowledge, ATM. I'm just starting with Java and JDBC is still in the TODO list, sorry... :-) Cheers Moreno.-
2016-05-10 14:47 GMT+02:00 Moreno Andreo <moreno.andreo@evolu-s.it>:
Il 10/05/2016 13:38, Nicolas Paris ha scritto:Is this just to be quicker or are you going to add some business logic while converting CSV data?2016-05-10 13:04 GMT+02:00 Moreno Andreo <moreno.andreo@evolu-s.it>:Il 10/05/2016 12:56, Nicolas Paris ha scritto:I always create binary files withHello,What is the way to build a binary format (instead of a csv) ? Is there specification for this file ?
http://www.postgresql.org/docs/9.5/static/sql-copy.html
COPY table TO 'path/to/file' WITH BINARY Fine, this works in this way :
postgresql -> binarybinary -> postgresqlThe way I want is :csv -> binary -> postgresql
As you mentioned ETL, I assume the second, as I don't think that converting CSV to binary and then loading it to PostgreSQL will be more convenient than loading directly from CSV... as quicker as it can be, you have anyway to load data from CSV.
Right, ETL process means huge business logic.
Right, ETL process means huge business logic.
get the data (csv or other) -> transform it -> produce a binary -> copy from binary from stdin
Producing 100GO CSVs, is a waste of time.
Binary file format is briefly described in the last part of the doc you linked, under "Binary format", and there's also reference to source files.This is beyond my knowledge, ATM. I'm just starting with Java and JDBC is still in the TODO list, sorry... :-)And if possible, transforming csv to binary throught java.
Cheers
Moreno.-
Documentation explains a bit. Moreover, I have found a detailled answer here :
My ultimate goal is to encapsulate it in a Talend component. (talend is an open-source java based ETL software).
Thanks, I ll keep you aware.
Thanks, I ll keep you aware.
On Tue, May 10, 2016 at 01:38:12PM +0200, Nicolas Paris wrote: > The way I want is : > csv -> binary -> postgresql > > And if possible, transforming csv to binary throught java. > > Use case is ETL process. Not sure what the point would be tbh if the data is already in CSV. You might aswell submit the CSV to postgres and let it deal with it. It'll probably be faster. It'll also be more portable. The BINARY format is what Postgres uses internally (more or less). I had to look at the source code to figure out how to insert a timestamp (FYI: Postgres stores timestamps as epoch based off the year 2000 not 1970 amongst other fun things). -- "A search of his car uncovered pornography, a homemade sex aid, women's stockings and a Jack Russell terrier." - http://www.dailytelegraph.com.au/news/wacky/indeed/story-e6frev20-1111118083480
On Tue, May 10, 2016 at 03:00:55PM +0200, Nicolas Paris wrote: > > The way I want is : > > csv -> binary -> postgresql > > > > Is this just to be quicker or are you going to add some business logic > > while converting CSV data? > > As you mentioned ETL, I assume the second, as I don't think that > > converting CSV to binary and then loading it to PostgreSQL will be more > > convenient than loading directly from CSV... as quicker as it can be, you > > have anyway to load data from CSV. > > > > Right, ETL process means huge business logic. > get the data (csv or other) -> transform it -> produce a binary -> copy > from binary from stdin > > Producing 100GO CSVs, is a waste of time. Ah. You need to fiddle with the data. Then you need to weigh the pros of something agnostic to Postgres's internals to something that needs to be aware of them. You will need to delve into the source code for data types more complex than INTEGER, TEXT and BYTEA (which was the majority of my data when I was just looking into it). -- "A search of his car uncovered pornography, a homemade sex aid, women's stockings and a Jack Russell terrier." - http://www.dailytelegraph.com.au/news/wacky/indeed/story-e6frev20-1111118083480
Well the job is done. The talend component is working (https://github.com/parisni/talend/tree/master/tPostgresqlOutputBulkAPHP). It allows creating a file (binary or csv) locally, and then use the COPY function with "FROM STDIN" that does not need to push the file on a remote database server.
I have made a little comparison test:
column1: character varying
column2: integer
column3: boolean
10 000 000 tuples
Type | Create file time | Bulk load time | Total Time | File size
Binary | 11137 milliseconds | 21661 milliseconds | 32798 milliseconds | 250 MOCSV | 23226 milliseconds | 22192 milliseconds | 45418 milliseconds | 179 MO
Binary format is definitely faster and safer
- faster because writing binary is faster than text file. I guess the bulk load time bottleneck is the network, then this is equivalent for both format. It is two time faster to load a binary when the file is on the database server.
- safer thanks to the format (each value is preceded by its lenght) more robust thant CSV and separators (that can be present in the text).
Code has been based on :
- https://github.com/uwescience/myria/blob/master/src/edu/washington/escience/myria/PostgresBinaryTupleWriter.java
- https://github.com/bytefish/PgBulkInsert/tree/master/PgBulkInsert/src/main/de/bytefish/pgbulkinsert/pgsql/handlers
- https://github.com/uwescience/myria/blob/master/src/edu/washington/escience/myria/PostgresBinaryTupleWriter.java
- https://github.com/bytefish/PgBulkInsert/tree/master/PgBulkInsert/src/main/de/bytefish/pgbulkinsert/pgsql/handlers
Thanks,
2016-05-10 15:08 GMT+02:00 Cat <cat@zip.com.au>:
On Tue, May 10, 2016 at 03:00:55PM +0200, Nicolas Paris wrote:
> > The way I want is :
> > csv -> binary -> postgresql
> >
> > Is this just to be quicker or are you going to add some business logic
> > while converting CSV data?
> > As you mentioned ETL, I assume the second, as I don't think that
> > converting CSV to binary and then loading it to PostgreSQL will be more
> > convenient than loading directly from CSV... as quicker as it can be, you
> > have anyway to load data from CSV.
> >
>
> Right, ETL process means huge business logic.
> get the data (csv or other) -> transform it -> produce a binary -> copy
> from binary from stdin
>
> Producing 100GO CSVs, is a waste of time.
Ah. You need to fiddle with the data. Then you need to weigh the pros of
something agnostic to Postgres's internals to something that needs to be
aware of them.
You will need to delve into the source code for data types more complex
than INTEGER, TEXT and BYTEA (which was the majority of my data when I
was just looking into it).
--
"A search of his car uncovered pornography, a homemade sex aid, women's
stockings and a Jack Russell terrier."
- http://www.dailytelegraph.com.au/news/wacky/indeed/story-e6frev20-1111118083480