Thread: Extending copy_expert
Hi all, a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another pieceof software. Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is performedin C, resulting fast enough for our purpose. However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF as EOLcharacter. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter, but theykindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there the LFcharacter with whatever is suitable for me. So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent was toadd a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in _pq_copy_out_v3()can be converted. However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have your feedbackhere. My main concern is that extending the copy_expert() method in psycopg leaves the user completely alone aboutusing this new keyword argument in the right way. We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to becomea little bit too complex to me. So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it done? Anyonehere also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? Thanks for you help, and apologies for the long email. a.
On Sun, 12 Oct 2014, Andrea Riciputi wrote: > However the target software for which the file is produced, is, let say, > “legacy” software and can only accept CRLF as EOL character. However by > calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL > forcing us to pass the file a second time to convert EOL, which is > inconvenient. Plus, doing it in Python, make it a little bit to slow. Andrea, How about using unix2dos or todos? These linux utilities translate line endings. When I get client data in text form with the CR/LF ends I translate them to LF (\n) with fromdos (or the older flavor called dos2unix). It's quick, but I've not tried them on a 500G file. HTH, Rich
On 10/12/2014 02:28 PM, Andrea Riciputi wrote: > Hi all, > a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another pieceof software. > > Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is performedin C, resulting fast enough for our purpose. > > However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF asEOL character. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. > > My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter, butthey kindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there theLF character with whatever is suitable for me. > > So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent was toadd a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in _pq_copy_out_v3()can be converted. > > However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have your feedbackhere. My main concern is that extending the copy_expert() method in psycopg leaves the user completely alone aboutusing this new keyword argument in the right way. > > We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to becomea little bit too complex to me. > > So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it done?Anyone here also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? > > Thanks for you help, and apologies for the long email. My suspicion is this is something that can be done within the context of the open(). I know in Python 3 there is a newline argument to open() that allows you to specify the newline ending. In Python 2 it would involve playing around with the Universal newline mechanism. I have not had time to investigate any further, but I thought I would throw it out there. > a. > > > -- Adrian Klaver adrian.klaver@aklaver.com
On 10/12/2014 02:28 PM, Andrea Riciputi wrote: > Hi all, > a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another pieceof software. > > Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is performedin C, resulting fast enough for our purpose. > > However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF asEOL character. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. > > My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter, butthey kindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there theLF character with whatever is suitable for me. > > So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent was toadd a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in _pq_copy_out_v3()can be converted. > > However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have your feedbackhere. My main concern is that extending the copy_expert() method in psycopg leaves the user completely alone aboutusing this new keyword argument in the right way. > > We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to becomea little bit too complex to me. > > So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it done?Anyone here also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? > > Thanks for you help, and apologies for the long email. Alright to follow up on my previous post about open. In Python 2 newline is available in the io module, so a simple example: f = io.open('io_newline.csv', 'w', newline='\r\n') cur = con.cursor() cur.copy_expert("COPY cell_per TO STDOUT WITH CSV HEADER", f) f.close() aklaver@panda:~/software_projects> file io_newline.csv io_newline.csv: ASCII text, with CRLF line terminators > a. > > > -- Adrian Klaver adrian.klaver@aklaver.com
Hi all, thanks for your suggestions but they don’t fit the use-case at hand. Regarding using unix2dos, it is quite slow when the file become large and here we are talking of several hundreds of GB. Using io.open() is a no way either, since the “newline” kwarg only works for “text” files. This means that all the data comingfrom PG (which are python bytes/str objects) must be converted to Python unicode objects and then back again to bytes.This (useless) decoding/encoding dance nearly doubles execution times. My point was not to get the result, which is trivial, but to get it efficiently. Enabling psycopg (or even better Postgresitself) to write the EOL straight away in C is a much more efficient way to get the task done. In my opinion it’d be better to push such a feature upstream to PG, but even having it in psycopg could be a good compromise.Do you have any strong argument against such a feature in psycopg? Do you think it’d be better part of PG itself?If so, how do you think I can gain support in the pgsql-hackers ml? Thanks, a. On 13 Oct 2014, at 15:45, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > On 10/12/2014 02:28 PM, Andrea Riciputi wrote: >> Hi all, >> a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another pieceof software. >> >> Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is performedin C, resulting fast enough for our purpose. >> >> However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF asEOL character. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. >> >> My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter, butthey kindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there theLF character with whatever is suitable for me. >> >> So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent wasto add a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in _pq_copy_out_v3()can be converted. >> >> However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have yourfeedback here. My main concern is that extending the copy_expert() method in psycopg leaves the user completely aloneabout using this new keyword argument in the right way. >> >> We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to becomea little bit too complex to me. >> >> So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it done?Anyone here also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? >> >> Thanks for you help, and apologies for the long email. > > Alright to follow up on my previous post about open. In Python 2 newline is available in the io module, so a simple example: > > f = io.open('io_newline.csv', 'w', newline='\r\n') > > cur = con.cursor() > > cur.copy_expert("COPY cell_per TO STDOUT WITH CSV HEADER", f) > > f.close() > > aklaver@panda:~/software_projects> file io_newline.csv > io_newline.csv: ASCII text, with CRLF line terminators > >> a. >> >> >> > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
On Mon, Oct 13, 2014 at 9:49 PM, Andrea Riciputi <andrea.riciputi@gmail.com> wrote: > In my opinion it’d be better to push such a feature upstream to PG, but even having it in psycopg could be a good compromise.Do you have any strong argument against such a feature in psycopg? Do you think it’d be better part of PG itself?If so, how do you think I can gain support in the pgsql-hackers ml? Psycopg doesn't do much here: it receives buffers via PQgetCopyData [1] and forwards them to the Python file in argument. You could easily hack the function and replace the \n into \r\n in C, but this would still imply a copy so I'm not sure it would be more efficient than running copy from psql to stdout and piping the result througn sed 's/$/\r/', leaving Python entirely out of the equation. [1] http://www.postgresql.org/docs/9.3/static/libpq-copy.html#LIBPQ-PQGETCOPYDATA -- Daniele
On 10/13/2014 01:49 PM, Andrea Riciputi wrote: > Hi all, > thanks for your suggestions but they don’t fit the use-case at hand. > > Regarding using unix2dos, it is quite slow when the file become large and here we are talking of several hundreds of GB. > > Using io.open() is a no way either, since the “newline” kwarg only works for “text” files. This means that all the datacoming from PG (which are python bytes/str objects) must be converted to Python unicode objects and then back again tobytes. This (useless) decoding/encoding dance nearly doubles execution times. Well I am only seeing about a 30% difference, but I am not working with anything close to the same size dataset. > > My point was not to get the result, which is trivial, but to get it efficiently. Enabling psycopg (or even better Postgresitself) to write the EOL straight away in C is a much more efficient way to get the task done. > > In my opinion it’d be better to push such a feature upstream to PG, but even having it in psycopg could be a good compromise.Do you have any strong argument against such a feature in psycopg? Do you think it’d be better part of PG itself?If so, how do you think I can gain support in the pgsql-hackers ml? I have been on Postgres related lists for many years now and this is the first time I have seen this issue brought up. This is probably why you are not seeing a lot of enthusiasm about a patch. It is seen as a one-of solution, not generally applicable. I would say your best bet would be to present is as an extension, either a Psycopg extension/extra or a Postgres contrib module/extension. > > Thanks, > a. > > On 13 Oct 2014, at 15:45, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > -- Adrian Klaver adrian.klaver@aklaver.com
On Oct 13, 2014, at 1:49 PM, Andrea Riciputi <andrea.riciputi@gmail.com> wrote: > My point was not to get the result, which is trivial, but to get it efficiently. Given the size of the dataset and the use case, it might very well make sense to write your own small C utility that interactswith PostgreSQL and does what you want. Honestly, in the time it takes to negotiate the process of getting thefeature you want into either PostgreSQL or psycopg2, you could have that utility working. :) -- -- Christophe Pettus xof@thebuild.com