Thread: copy_from does not stop reading after an error

copy_from does not stop reading after an error

From
Nicolas Grilly
Date:
Hello,

It looks like copy_from does not stop reading after an error. When the input file is short, it is okay. But when the input file is very long, it is really boring to wait for the entire file to be read just to discover there is an error on the 10th row. 

Given the same input file, it looks like psql \copy command behaves correctly and stops just after the incorrect row, without reading the entire file. I have checked that just by looking at the command execution time that seems proportional to the number of processed rows...

Here is a script to reproduce this bug (just create a database "test" and run the script):

I don't know what to do with this bug. Thanks for your help and advice.

Regards,

Nicolas Grilly

Here is the same script inline:

import traceback
import psycopg2

def main():
    con = psycopg2.connect(database='test', user='postgres')
    cur = con.cursor()

    try:
        cur.execute("create temp table test (id integer primary key, data text)")

        rows = iter_rows()
        copy_file = CopyFile(rows)
        cur.copy_from(copy_file, 'test')

        cur.execute("select count(*) from test")
        print "{0} rows written to database".format(cur.fetchall()[0][0])

    finally:
        cur.close()
        con.close()

def iter_rows():
    random_data = 'x' * 100
    for i in range(0, 500):
        if i == 10:
            i = 'Bad key'
            print "Yield incorrect data row (copy_from should stop reading after that)"
        else:
            print "Yield correct data row"
        yield '{0}\t{1}\n'.format(i, random_data)

class CopyFile(object):

    def __init__(self, row_reader):
        self.next_row = row_reader.next
        self.buffer = ''

    def read(self, limit=-1):
        print "Read {0} bytes".format(limit)

        try:
            buffer = self.buffer

            while limit < 0 or len(buffer) < limit:
                try:
                    buffer += self.next_row()
                except StopIteration:
                    break

            if limit < 0:
                self.buffer = ''
                return buffer
            else:
                self.buffer = buffer[limit:]
                return buffer[:limit]

        except:
            # Report unexpected errors because psycopg2 does not report them
            traceback.print_exc()
            raise

    # Method readline is required by psycopg2 but actually never called
    def readline(self, limit=-1):
        raise NotImplementedError()

if __name__ == '__main__':
    main()

Re: copy_from does not stop reading after an error

From
Federico Di Gregorio
Date:
On 01/02/11 11:24, Nicolas Grilly wrote:
> It looks like copy_from does not stop reading after an error. When the
> input file is short, it is okay. But when the input file is very long,
> it is really boring to wait for the entire file to be read just to
> discover there is an error on the 10th row.

Thank you very much. I just used your email to create a bug report on
Lighthouse:

http://psycopg.lighthouseapp.com/projects/62710-psycopg/tickets/37-copy_from-does-not-stop-reading-after-an-error

federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
  Gli avvoltoi cinesi si nutrono di arte, ma possono anche mangiare
   i `domani'.                                        -- Haruki Murakami

Re: copy_from does not stop reading after an error

From
Nicolas Grilly
Date:
Thank you Federico for your answer.

I have ran the same script with pg8000, and it does not stop reading after an error either... Maybe it is not a bug, and just a limitation of the PostgreSQL protocol? Maybe the copy from protocol is not designed to return errors in the middle of the data stream, and I just have to split my data stream into many chunks and call copy_from for each chunk?

On Tue, Feb 1, 2011 at 11:48, Federico Di Gregorio <federico.digregorio@dndg.it> wrote:
On 01/02/11 11:24, Nicolas Grilly wrote:
> It looks like copy_from does not stop reading after an error. When the
> input file is short, it is okay. But when the input file is very long,
> it is really boring to wait for the entire file to be read just to
> discover there is an error on the 10th row.

Thank you very much. I just used your email to create a bug report on
Lighthouse:

http://psycopg.lighthouseapp.com/projects/62710-psycopg/tickets/37-copy_from-does-not-stop-reading-after-an-error

federico

Re: copy_from does not stop reading after an error

From
Nicolas Grilly
Date:
I have tested again the \copy command of psql and, contrary to what I wrote before, it looks like psql does not stop reading after an error either, and has to read the complete file before reporting the error.

Conclusion: psycopg2, pg8000 and psql have the same behavior regarding the command "copy from stdin". The input data file is read entirely, even if there is incorrect data at the start of the file, and errors are reported only after having read the complete file.

Therefore it is probably not a bug in psycopg2, and just a "limitation" of PostgreSQL protocol. Here is the protocol official documentation:

I understand we have to "end" the copy before having a chance to retrieve PostgreSQL backend response and know if our data are correct, or not. Do you confirm this analysis?

It means copy_from is not designed to send a 10 gigabytes stream to PostgreSQL, with just one "copy from stdin" command. Maybe I have to split my input stream into smaller chunks and execute a "copy from stdin" command for each of them. Do you confirm this is the only (and adequate) solution?

Thanks for you help and advice.

PS: I've copied that email to lighthouse for future reference.


On Tue, Feb 1, 2011 at 12:34, Nicolas Grilly <nicolas@gardentechno.com> wrote:
Thank you Federico for your answer.

I have ran the same script with pg8000, and it does not stop reading after an error either... Maybe it is not a bug, and just a limitation of the PostgreSQL protocol? Maybe the copy from protocol is not designed to return errors in the middle of the data stream, and I just have to split my data stream into many chunks and call copy_from for each chunk?

On Tue, Feb 1, 2011 at 11:48, Federico Di Gregorio <federico.digregorio@dndg.it> wrote:
On 01/02/11 11:24, Nicolas Grilly wrote:
> It looks like copy_from does not stop reading after an error. When the
> input file is short, it is okay. But when the input file is very long,
> it is really boring to wait for the entire file to be read just to
> discover there is an error on the 10th row.

Thank you very much. I just used your email to create a bug report on
Lighthouse:

http://psycopg.lighthouseapp.com/projects/62710-psycopg/tickets/37-copy_from-does-not-stop-reading-after-an-error

federico


Re: copy_from does not stop reading after an error

From
Marko Kreen
Date:
On Tue, Feb 1, 2011 at 2:57 PM, Nicolas Grilly <nicolas@gardentechno.com> wrote:
> I have tested again the \copy command of psql and, contrary to what I wrote
> before, it looks like psql does not stop reading after an error either, and
> has to read the complete file before reporting the error.
> Conclusion: psycopg2, pg8000 and psql have the same behavior regarding the
> command "copy from stdin". The input data file is read entirely, even if
> there is incorrect data at the start of the file, and errors are reported
> only after having read the complete file.
> Therefore it is probably not a bug in psycopg2, and just a "limitation" of
> PostgreSQL protocol. Here is the protocol official documentation:
> http://www.postgresql.org/docs/9.0/static/protocol-flow.html#PROTOCOL-COPY
> I understand we have to "end" the copy before having a chance to retrieve
> PostgreSQL backend response and know if our data are correct, or not. Do you
> confirm this analysis?

No, the error message should arrive immediately.  But it may be deficiency of
libpq that you cannot aquire it before ending the copy.

I'm not sure about that actually.  It should be possible to call
select() & PQconsumeInput
between copy calls, thus also PQgetResult to get the error.

> It means copy_from is not designed to send a 10 gigabytes stream to
> PostgreSQL, with just one "copy from stdin" command. Maybe I have to split
> my input stream into smaller chunks and execute a "copy from stdin" command
> for each of them. Do you confirm this is the only (and adequate) solution?

It should work around the problem, with speed decrease.

--
marko

Re: copy_from does not stop reading after an error

From
Nicolas Grilly
Date:
So, considering the test case I attached in my previous message, do you think it should be possible to retrieve the error message (Invalid syntax for integer "Bad key", CONTEXT: COPY test, line 10, column id: "Bad key") without waiting for complete transmission?

Do you think it is possible to modify psycopg2 to implement what you suggested? I guess the related code is in file pqpath.c, function _pq_copy_in_v3, correct?

On Tue, Feb 1, 2011 at 14:18, Marko Kreen <markokr@gmail.com> wrote:
No, the error message should arrive immediately.  But it may be deficiency of
libpq that you cannot aquire it before ending the copy.

I'm not sure about that actually.  It should be possible to call
select() & PQconsumeInput
between copy calls, thus also PQgetResult to get the error.
--
marko

Re: copy_from does not stop reading after an error

From
Nicolas Grilly
Date:
On Tue, Feb 1, 2011 at 14:18, Marko Kreen <markokr@gmail.com> wrote:
No, the error message should arrive immediately.  But it may be deficiency of
libpq that you cannot aquire it before ending the copy.

I have analyzed the PostgreSQL protocol using Wireshark (an open source packet analyzer), and I confirm that the PostgreSQL backend, while doing a COPY ... FROM STDIN, reports errors as soon as possible (especially errors related to invalid data).

Therefore, the "late" reporting of errors while doing a COPY ... FROM STDIN is not a limitation of the underlying protocol; it is a limitation (or a design choice) of the libpq library.

It looks like this is a well known issue because it is listed on the todo list:

And was discussed before:
 
I'm not sure about that actually.  It should be possible to call
select() & PQconsumeInput
between copy calls, thus also PQgetResult to get the error.

I have tried that, and many other things, and I have carefully read libpq source code, and my understanding is that it is impossible with the current implementation of libpq.

It is impossible because the function pqParseInput3 (file fe-protocol3.c) does not parse error responses while the connection is in PGASYNC_COPY_IN state. We have to call PQputCopyEnd to switch the connection to PGASYNC_BUSY and start error messages parsing.

Can Marko or someone else confirm this analysis? Any idea to improve that?

Regards,

Nicolas Grilly

PS : Considering that both psycopg2 and psql are built on top of libpq, this a quite logical they behave the same regarding late reporting of copy errors, contrary to pg8000 that could easily read, parse and report the errors early.

Re: copy_from does not stop reading after an error

From
Marko Kreen
Date:
On Mon, Feb 7, 2011 at 5:06 PM, Nicolas Grilly <nicolas@gardentechno.com> wrote:
> On Tue, Feb 1, 2011 at 14:18, Marko Kreen <markokr@gmail.com> wrote:
>>
>> No, the error message should arrive immediately.  But it may be deficiency
>> of
>> libpq that you cannot aquire it before ending the copy.
>
> I have analyzed the PostgreSQL protocol using Wireshark (an open source
> packet analyzer), and I confirm that the PostgreSQL backend, while doing a
> COPY ... FROM STDIN, reports errors as soon as possible (especially errors
> related to invalid data).
> Therefore, the "late" reporting of errors while doing a COPY ... FROM STDIN
> is not a limitation of the underlying protocol; it is a limitation (or a
> design choice) of the libpq library.
> It looks like this is a well known issue because it is listed on the todo
> list:
> http://wiki.postgresql.org/wiki/Todo#COPY
> And was discussed before:
> http://archives.postgresql.org/pgsql-hackers/2008-04/msg01169.php
>
>>
>> I'm not sure about that actually.  It should be possible to call
>> select() & PQconsumeInput
>> between copy calls, thus also PQgetResult to get the error.
>
> I have tried that, and many other things, and I have carefully read libpq
> source code, and my understanding is that it is impossible with the current
> implementation of libpq.
> It is impossible because the function pqParseInput3 (file fe-protocol3.c)
> does not parse error responses while the connection is in PGASYNC_COPY_IN
> state. We have to call PQputCopyEnd to switch the connection to PGASYNC_BUSY
> and start error messages parsing.
> Can Marko or someone else confirm this analysis? Any idea to improve that?
> Regards,
> Nicolas Grilly
> PS : Considering that both psycopg2 and psql are built on top of libpq, this
> a quite logical they behave the same regarding late reporting of copy
> errors, contrary to pg8000 that could easily read, parse and report the
> errors early.

Yeah, seems your analysis is right and it's libpq (design?) bug,
so no way to work around it.

--
marko

Re: copy_from does not stop reading after an error

From
Nicolas Grilly
Date:
On Tue, Feb 8, 2011 at 13:30, Marko Kreen <markokr@gmail.com> wrote:
Yeah, seems your analysis is right and it's libpq (design?) bug,
so no way to work around it.

Nice to know that my analysis is right... but eventually it is a bad news because the only way to fix the issue is to split input data into many chunks or to directly modify libpq... By the way, I have posted my analysis to pgsql-hackers and asked what the list thinks about modify pqParseInput to handle this case. But I have never posted to this list before so I don't know what is the good way to ask for feedback.

Nicolas Grilly

Re: copy_from does not stop reading after an error

From
Marko Kreen
Date:
On Tue, Feb 8, 2011 at 2:38 PM, Nicolas Grilly <nicolas@gardentechno.com> wrote:
> On Tue, Feb 8, 2011 at 13:30, Marko Kreen <markokr@gmail.com> wrote:
>>
>> Yeah, seems your analysis is right and it's libpq (design?) bug,
>> so no way to work around it.
>
> Nice to know that my analysis is right... but eventually it is a bad news
> because the only way to fix the issue is to split input data into many
> chunks or to directly modify libpq... By the way, I have posted my analysis
> to pgsql-hackers and asked what the list thinks about modify pqParseInput to
> handle this case. But I have never posted to this list before so I don't
> know what is the good way to ask for feedback.

It's the high point of the of last flame^Wcommitfest for 9.1,
so it may take some time.

--
marko

Re: copy_from does not stop reading after an error

From
Karsten Hilbert
Date:
On Tue, Feb 08, 2011 at 01:38:20PM +0100, Nicolas Grilly wrote:

> Nice to know that my analysis is right... but eventually it is a bad news
> because the only way to fix the issue is to split input data into many
> chunks

Maybe there can be a parameter to the Python level .copy()
telling psycopg2 which way to do this. The default would be
the standard behaviour as it is now.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: copy_from does not stop reading after an error

From
Nicolas Grilly
Date:
On Tue, Feb 8, 2011 at 13:52, Marko Kreen <markokr@gmail.com> wrote:
It's the high point of the of last flame^Wcommitfest for 9.1,
so it may take some time.

Understood. Thanks for this information.