Home > mailing lists

Re: Extending copy_expert - Mailing list psycopg

From	Andrea Riciputi
Subject	Re: Extending copy_expert
Date	October 13, 2014 23:49:27
Msg-id	B8BAE8B1-8FB9-4623-AD4B-AB3035B1A377@gmail.com Whole thread Raw
In response to	Re: Extending copy_expert (Adrian Klaver <adrian.klaver@aklaver.com>)
Responses	Re: Extending copy_expert (Daniele Varrazzo <daniele.varrazzo@gmail.com>) Re: Extending copy_expert (Adrian Klaver <adrian.klaver@aklaver.com>) Re: Extending copy_expert (Christophe Pettus <xof@thebuild.com>)
List	psycopg

Tree view

Hi all,
thanks for your suggestions but they don’t fit the use-case at hand.

Regarding using unix2dos, it is quite slow when the file become large and here we are talking of several hundreds of
GB.

Using io.open() is a no way either, since the “newline” kwarg only works for “text” files. This means that all the data
comingfrom PG (which are python bytes/str objects) must be converted to Python unicode objects and then back again to
bytes.This (useless) decoding/encoding dance nearly doubles execution times. 

My point was not to get the result, which is trivial, but to get it efficiently. Enabling psycopg (or even better
Postgresitself) to write the EOL straight away in C is a much more efficient way to get the task done. 

In my opinion it’d be better to push such a feature upstream to PG, but even having it in psycopg could be a good
compromise.Do you have any strong argument against such a feature in psycopg? Do you think it’d be better part of PG
itself?If so, how do you think I can gain support in the pgsql-hackers ml? 

Thanks,
a.

On 13 Oct 2014, at 15:45, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

> On 10/12/2014 02:28 PM, Andrea Riciputi wrote:
>> Hi all,
>> a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another
pieceof software. 
>>
>> Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG
database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is
performedin C, resulting fast enough for our purpose. 
>>
>> However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF
asEOL character. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass
thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. 
>>
>> My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter,
butthey kindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there
theLF character with whatever is suitable for me. 
>>
>> So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent
wasto add a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user
decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in
_pq_copy_out_v3()can be converted. 
>>
>> However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have
yourfeedback here. My main concern is that extending the copy_expert() method in psycopg leaves the user completely
aloneabout using this new keyword argument in the right way. 
>>
>> We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg
andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted
withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s
ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to
becomea little bit too complex  to me. 
>>
>> So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it
done?Anyone here also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? 
>>
>> Thanks for you help, and apologies for the long email.
>
> Alright to follow up on my previous post about open. In Python 2 newline is available in the io module, so a simple
example:
>
> f = io.open('io_newline.csv', 'w',  newline='\r\n')
>
> cur = con.cursor()
>
> cur.copy_expert("COPY cell_per TO STDOUT WITH CSV HEADER", f)
>
> f.close()
>
> aklaver@panda:~/software_projects> file io_newline.csv
> io_newline.csv: ASCII text, with CRLF line terminators
>
>> a.
>>
>>
>>
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com

psycopg by date:

From: Adrian Klaver
Date: 13 October 2014, 16:45:20
Subject: Re: Extending copy_expert

From: Daniele Varrazzo
Date: 14 October 2014, 00:36:13
Subject: Re: Extending copy_expert

Re: Extending copy_expert - Mailing list psycopg

Previous

Next