Thread: Extending copy_expert

Extending copy_expert

From
Andrea Riciputi
Date:
Hi all,
a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another
pieceof software. 

Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG
database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is
performedin C, resulting fast enough for our purpose. 

However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF as
EOLcharacter. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass
thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. 

My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter, but
theykindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there the
LFcharacter with whatever is suitable for me. 

So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent was
toadd a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user
decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in
_pq_copy_out_v3()can be converted. 

However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have your
feedbackhere. My main concern is that extending the copy_expert() method in psycopg leaves the user completely alone
aboutusing this new keyword argument in the right way. 

We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg
andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted
withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s
ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to
becomea little bit too complex  to me. 

So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it done?
Anyonehere also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? 

Thanks for you help, and apologies for the long email.
a.



Re: Extending copy_expert

From
Rich Shepard
Date:
On Sun, 12 Oct 2014, Andrea Riciputi wrote:

> However the target software for which the file is produced, is, let say,
> “legacy” software and can only accept CRLF as EOL character. However by
> calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL
> forcing us to pass the file a second time to convert EOL, which is
> inconvenient. Plus, doing it in Python, make it a little bit to slow.

Andrea,

   How about using unix2dos or todos? These linux utilities translate line
endings. When I get client data in text form with the CR/LF ends I translate
them to LF (\n) with fromdos (or the older flavor called dos2unix). It's
quick, but I've not tried them on a 500G file.

HTH,

Rich


Re: Extending copy_expert

From
Adrian Klaver
Date:
On 10/12/2014 02:28 PM, Andrea Riciputi wrote:
> Hi all,
> a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another
pieceof software. 
>
> Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG
database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is
performedin C, resulting fast enough for our purpose. 
>
> However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF
asEOL character. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass
thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. 
>
> My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter,
butthey kindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there
theLF character with whatever is suitable for me. 
>
> So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent was
toadd a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user
decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in
_pq_copy_out_v3()can be converted. 
>
> However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have your
feedbackhere. My main concern is that extending the copy_expert() method in psycopg leaves the user completely alone
aboutusing this new keyword argument in the right way. 
>
> We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg
andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted
withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s
ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to
becomea little bit too complex  to me. 
>
> So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it
done?Anyone here also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? 
>
> Thanks for you help, and apologies for the long email.

My suspicion is this is something that can be done within the context of
the open(). I know in Python 3 there is a newline argument to open()
that allows you to specify the newline ending. In Python 2 it would
involve playing around with the Universal newline mechanism. I have not
had time to investigate any further, but I thought I would throw it out
there.

> a.
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Extending copy_expert

From
Adrian Klaver
Date:
On 10/12/2014 02:28 PM, Andrea Riciputi wrote:
> Hi all,
> a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another
pieceof software. 
>
> Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG
database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is
performedin C, resulting fast enough for our purpose. 
>
> However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF
asEOL character. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass
thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. 
>
> My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter,
butthey kindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there
theLF character with whatever is suitable for me. 
>
> So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent was
toadd a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user
decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in
_pq_copy_out_v3()can be converted. 
>
> However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have your
feedbackhere. My main concern is that extending the copy_expert() method in psycopg leaves the user completely alone
aboutusing this new keyword argument in the right way. 
>
> We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg
andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted
withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s
ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to
becomea little bit too complex  to me. 
>
> So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it
done?Anyone here also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? 
>
> Thanks for you help, and apologies for the long email.

Alright to follow up on my previous post about open. In Python 2 newline
is available in the io module, so a simple example:

f = io.open('io_newline.csv', 'w',  newline='\r\n')

cur = con.cursor()

cur.copy_expert("COPY cell_per TO STDOUT WITH CSV HEADER", f)

f.close()

aklaver@panda:~/software_projects> file io_newline.csv
io_newline.csv: ASCII text, with CRLF line terminators

> a.
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Extending copy_expert

From
Andrea Riciputi
Date:
Hi all,
thanks for your suggestions but they don’t fit the use-case at hand.

Regarding using unix2dos, it is quite slow when the file become large and here we are talking of several hundreds of
GB.

Using io.open() is a no way either, since the “newline” kwarg only works for “text” files. This means that all the data
comingfrom PG (which are python bytes/str objects) must be converted to Python unicode objects and then back again to
bytes.This (useless) decoding/encoding dance nearly doubles execution times. 

My point was not to get the result, which is trivial, but to get it efficiently. Enabling psycopg (or even better
Postgresitself) to write the EOL straight away in C is a much more efficient way to get the task done. 

In my opinion it’d be better to push such a feature upstream to PG, but even having it in psycopg could be a good
compromise.Do you have any strong argument against such a feature in psycopg? Do you think it’d be better part of PG
itself?If so, how do you think I can gain support in the pgsql-hackers ml? 

Thanks,
a.

On 13 Oct 2014, at 15:45, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

> On 10/12/2014 02:28 PM, Andrea Riciputi wrote:
>> Hi all,
>> a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another
pieceof software. 
>>
>> Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG
database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is
performedin C, resulting fast enough for our purpose. 
>>
>> However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF
asEOL character. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass
thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. 
>>
>> My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter,
butthey kindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there
theLF character with whatever is suitable for me. 
>>
>> So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent
wasto add a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user
decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in
_pq_copy_out_v3()can be converted. 
>>
>> However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have
yourfeedback here. My main concern is that extending the copy_expert() method in psycopg leaves the user completely
aloneabout using this new keyword argument in the right way. 
>>
>> We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg
andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted
withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s
ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to
becomea little bit too complex  to me. 
>>
>> So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it
done?Anyone here also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? 
>>
>> Thanks for you help, and apologies for the long email.
>
> Alright to follow up on my previous post about open. In Python 2 newline is available in the io module, so a simple
example:
>
> f = io.open('io_newline.csv', 'w',  newline='\r\n')
>
> cur = con.cursor()
>
> cur.copy_expert("COPY cell_per TO STDOUT WITH CSV HEADER", f)
>
> f.close()
>
> aklaver@panda:~/software_projects> file io_newline.csv
> io_newline.csv: ASCII text, with CRLF line terminators
>
>> a.
>>
>>
>>
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com



Re: Extending copy_expert

From
Daniele Varrazzo
Date:
On Mon, Oct 13, 2014 at 9:49 PM, Andrea Riciputi
<andrea.riciputi@gmail.com> wrote:
> In my opinion it’d be better to push such a feature upstream to PG, but even having it in psycopg could be a good
compromise.Do you have any strong argument against such a feature in psycopg? Do you think it’d be better part of PG
itself?If so, how do you think I can gain support in the pgsql-hackers ml? 

Psycopg doesn't do much here: it receives buffers via PQgetCopyData
[1] and forwards them to the Python file in argument. You could easily
hack the function and replace the \n into \r\n in C, but this would
still imply a copy so I'm not sure it would be more efficient than
running copy from psql to stdout and piping the result througn sed
's/$/\r/', leaving Python entirely out of the equation.

[1] http://www.postgresql.org/docs/9.3/static/libpq-copy.html#LIBPQ-PQGETCOPYDATA

-- Daniele


Re: Extending copy_expert

From
Adrian Klaver
Date:
On 10/13/2014 01:49 PM, Andrea Riciputi wrote:
> Hi all,
> thanks for your suggestions but they don’t fit the use-case at hand.
>
> Regarding using unix2dos, it is quite slow when the file become large and here we are talking of several hundreds of
GB.
>
> Using io.open() is a no way either, since the “newline” kwarg only works for “text” files. This means that all the
datacoming from PG (which are python bytes/str objects) must be converted to Python unicode objects and then back again
tobytes. This (useless) decoding/encoding dance nearly doubles execution times. 

Well I am only seeing about a 30% difference, but I am not working with
anything close to the same size dataset.

>
> My point was not to get the result, which is trivial, but to get it efficiently. Enabling psycopg (or even better
Postgresitself) to write the EOL straight away in C is a much more efficient way to get the task done. 
>
> In my opinion it’d be better to push such a feature upstream to PG, but even having it in psycopg could be a good
compromise.Do you have any strong argument against such a feature in psycopg? Do you think it’d be better part of PG
itself?If so, how do you think I can gain support in the pgsql-hackers ml? 

I have been on Postgres related lists for many years now and this is the
first time I have seen this issue brought up. This is probably why you
are not seeing a lot of enthusiasm about a patch. It is seen as a one-of
solution, not generally applicable. I would say your best bet would be
to present is as an extension, either a Psycopg extension/extra or a
Postgres contrib module/extension.

>
> Thanks,
> a.
>
> On 13 Oct 2014, at 15:45, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>



--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Extending copy_expert

From
Christophe Pettus
Date:
On Oct 13, 2014, at 1:49 PM, Andrea Riciputi <andrea.riciputi@gmail.com> wrote:

> My point was not to get the result, which is trivial, but to get it efficiently.

Given the size of the dataset and the use case, it might very well make sense to write your own small C utility that
interactswith PostgreSQL and does what you want.  Honestly, in the time it takes to negotiate the process of getting
thefeature you want into either PostgreSQL or psycopg2, you could have that utility working. :) 
--
-- Christophe Pettus
   xof@thebuild.com