Thread: Why "copy ... from stdio" does not return immediately when reading invalid data?

Why "copy ... from stdio" does not return immediately when reading invalid data?

From

Nicolas Grilly

Date:

02 February 2011, 14:20:26

Hello,

I am importing gigabytes of data into PostgreSQL, and I don't want to wait 10 minutes just to discover an error in the 10th line of my input file.

I tried the command "\copy ... from stdio" in psql and it looks like psql has to read the entire input before returning a potential error, even if the invalid value is in one of the first rows.

Is it a limitation of PostgreSQL protocol, of the library lipq, or of the tool psql?

Is the copy protocol (aka PQputCopyData and PQputCopyEnd) designed to send gigabytes of data with just one "copy ... from stdio" query, and is there a way to be notified of a potential error before calling PQputCopyEnd? Or do I have to send my data in small chunks (for example batch of 10000 rows), issue a PQputCopyEnd, check for errors, and continue with the next chunk?

Thanks for your help and advice.

Regards,

Nicolas Grilly

Re: Why "copy ... from stdio" does not return immediately when reading invalid data?

From

John R Pierce

Date:

02 February 2011, 15:06:19

On 02/02/11 10:20 AM, Nicolas Grilly wrote:
> Is the copy protocol (aka PQputCopyData and PQputCopyEnd) designed to
> send gigabytes of data with just one "copy ... from stdio" query, and
> is there a way to be notified of a potential error before calling
> PQputCopyEnd? Or do I have to send my data in small chunks (for
> example batch of 10000 rows), issue a PQputCopyEnd, check for errors,
> and continue with the next chunk?

I would batch the data, maybe 1000 lines or even 100 lines at a time if
these errors are at all frequent.  put the errored batches in an
exception list or something so you can sort them out later.

Re: Why "copy ... from stdio" does not return immediately when reading invalid data?

From

Nicolas Grilly

Date:

07 February 2011, 10:55:35

I have analyzed the PostgreSQL protocol using Wireshark (an open source packet analyzer), and I observed that the PostgreSQL backend, while doing a COPY ... FROM STDIN, reports errors as soon as possible (especially errors related to invalid data).

Therefore, the "late" reporting of errors while doing a COPY ... FROM STDIN is not a limitation of the underlying protocol; it is a limitation (or a design choice) of the libpq library.

It looks like this is a well known issue because it is listed on the todo list:

http://wiki.postgresql.org/wiki/Todo#COPY

And was discussed before:

http://archives.postgresql.org/pgsql-hackers/2008-04/msg01169.php

Do you think it is possible to change that behavior, or work around it?

While reading libpq source code, I noticed the function pqParseInput3 (file fe-protocol3.c) ignores error responses while the connection is in PGASYNC_COPY_IN state. Maybe we can make a special case for the COPY FROM subprotocol and handle errors early, in order to make them available to PQgetResult? Is is feasible in a simple way or is it a bad idea?

Regards,

Nicolas Grilly

On Wed, Feb 2, 2011 at 20:06, John R Pierce <pierce@hogranch.com> wrote:

On 02/02/11 10:20 AM, Nicolas Grilly wrote:
Is the copy protocol (aka PQputCopyData and PQputCopyEnd) designed to send gigabytes of data with just one "copy ... from stdio" query, and is there a way to be notified of a potential error before calling PQputCopyEnd? Or do I have to send my data in small chunks (for example batch of 10000 rows), issue a PQputCopyEnd, check for errors, and continue with the next chunk?

I would batch the data, maybe 1000 lines or even 100 lines at a time if these errors are at all frequent. put the errored batches in an exception list or something so you can sort them out later.