Thread: BUG #8173: Inserting heap tuples in bulk in COPY patch return wrong line on failure 999 out of 1000 times.

BUG #8173: Inserting heap tuples in bulk in COPY patch return wrong line on failure 999 out of 1000 times.

From

lalbin@fhcrc.org

Date:

22 May 2013, 17:24:48

The following bug has been logged on the website:

Bug reference:      8173
Logged by:          Lloyd Albin
Email address:      lalbin@fhcrc.org
PostgreSQL version: 9.2.4
Operating system:   SUSE Linux (64-bit)
Description:        =


During testing for our 9.2 upgrade, we found that the error messages we were
expecting did not match what was given by the program. In looking over the
revision notes from our current version of 9.0.12 through the 9.2.4, that we
are testing, I believe that I have tracked down the issue to "Improve COPY
performance by adding tuples to the heap in batches". When I looked at the
patch code in
http://www.postgresql.org/message-id/4E708759.40206@enterprisedb.com I found
that you are inserting 1000 rows at a time. The problem is that on failure,
you return either row 1000 or the last line, whichever comes first. This can
be confusing as you will see in the demo code below.

CREATE TABLE public.table1 (
  key INTEGER,
  PRIMARY KEY(key)
);

Create a csv file with only one column of data, numbered from 1 to 1008.

Make two copies of the file and name them csv_test.csv and csv_test2.csv.

Edit csv_test.csv and change the entry 1000 to 500.

Edit csv_test2.csv and change the entry 900 to 500.

On 9.0.12 Server

COPY public.table1 FROM 'csv_test.csv';

ERROR:  duplicate key value violates unique constraint "table1_pkey"
DETAIL:  Key (key)=3D(500) already exists.
CONTEXT:  COPY table1, line 1000: "500"

COPY public.table1 FROM 'csv_test2.csv';

ERROR:  duplicate key value violates unique constraint "table1_pkey"
DETAIL:  Key (key)=3D(500) already exists.
CONTEXT:  COPY table1, line 900: "500"

Both times the context gave us the correct information.

Now try the same thing on 9.2.4 Server

COPY public.table1 FROM 'csv_test.csv';

ERROR:  duplicate key value violates unique constraint "table1_pkey"
DETAIL:  Key (key)=3D(500) already exists.
CONTEXT:  COPY table1, line 1000: "500"

COPY public.table1 FROM 'csv_test2.csv';

ERROR:  duplicate key value violates unique constraint "table1_pkey"
DETAIL:  Key (key)=3D(500) already exists.
CONTEXT:  COPY table1, line 1000: "1000"

As you can see, the second test returned the last line of the set of tuples
being recorded not the line that actually failed.

Make a copy of csv_test2.csv and name it csv_test3.csv.
Edit csv_test3.csv and remove all entries after 994.

COPY public.table1 FROM 'csv_test3.csv';

ERROR:  duplicate key value violates unique constraint "table1_pkey"
DETAIL:  Key (key)=3D(500) already exists.
CONTEXT:  COPY table1, line 995: ""

If you are writing less than 1000 lines then it will return the line after
the last line with a value of "".

Lloyd Albin
Statistical Center for HIV/AIDS Research and Prevention (SCHARP)
Vaccine and Infectious Disease Division (VIDD)
Fred Hutchinson Cancer Research Center (FHCRC)

Re: BUG #8173: Inserting heap tuples in bulk in COPY patch return wrong line on failure 999 out of 1000 times.

From

Heikki Linnakangas

Date:

23 May 2013, 14:56:50

On 21.05.2013 18:20, lalbin@fhcrc.org wrote:
> The following bug has been logged on the website:
>
> Bug reference:      8173
> Logged by:          Lloyd Albin
> Email address:      lalbin@fhcrc.org
> PostgreSQL version: 9.2.4
> Operating system:   SUSE Linux (64-bit)
> Description:
>
> During testing for our 9.2 upgrade, we found that the error messages we were
> expecting did not match what was given by the program. In looking over the
> revision notes from our current version of 9.0.12 through the 9.2.4, that we
> are testing, I believe that I have tracked down the issue to "Improve COPY
> performance by adding tuples to the heap in batches". When I looked at the
> patch code in
> http://www.postgresql.org/message-id/4E708759.40206@enterprisedb.com I found
> that you are inserting 1000 rows at a time. The problem is that on failure,
> you return either row 1000 or the last line, whichever comes first. This can
> be confusing as you will see in the demo code below.
>
> CREATE TABLE public.table1 (
>    key INTEGER,
>    PRIMARY KEY(key)
> );
>
> Create a csv file with only one column of data, numbered from 1 to 1008.
>
> Make two copies of the file and name them csv_test.csv and csv_test2.csv.
>
> Edit csv_test.csv and change the entry 1000 to 500.
>
> Edit csv_test2.csv and change the entry 900 to 500.
>
> On 9.0.12 Server
>
> COPY public.table1 FROM 'csv_test.csv';
>
> ERROR:  duplicate key value violates unique constraint "table1_pkey"
> DETAIL:  Key (key)=(500) already exists.
> CONTEXT:  COPY table1, line 1000: "500"
>
> COPY public.table1 FROM 'csv_test2.csv';
>
> ERROR:  duplicate key value violates unique constraint "table1_pkey"
> DETAIL:  Key (key)=(500) already exists.
> CONTEXT:  COPY table1, line 900: "500"
>
> Both times the context gave us the correct information.
>
> Now try the same thing on 9.2.4 Server
>
> COPY public.table1 FROM 'csv_test.csv';
>
> ERROR:  duplicate key value violates unique constraint "table1_pkey"
> DETAIL:  Key (key)=(500) already exists.
> CONTEXT:  COPY table1, line 1000: "500"
>
> COPY public.table1 FROM 'csv_test2.csv';
>
> ERROR:  duplicate key value violates unique constraint "table1_pkey"
> DETAIL:  Key (key)=(500) already exists.
> CONTEXT:  COPY table1, line 1000: "1000"
>
> As you can see, the second test returned the last line of the set of tuples
> being recorded not the line that actually failed.
>
> Make a copy of csv_test2.csv and name it csv_test3.csv.
> Edit csv_test3.csv and remove all entries after 994.
>
> COPY public.table1 FROM 'csv_test3.csv';
>
> ERROR:  duplicate key value violates unique constraint "table1_pkey"
> DETAIL:  Key (key)=(500) already exists.
> CONTEXT:  COPY table1, line 995: ""
>
> If you are writing less than 1000 lines then it will return the line after
> the last line with a value of "".

Hmm, yeah, it's quite self-evident what's happening; the server reports
the last line that was *read*, no the line where the error happened.

Committed a fix for this. Unfortunately we only keep the last line read
buffered in text format, so after this you'll only get the line number,
not the content of that line:

postgres=# copy foo from '/tmp/foo';
ERROR:  duplicate key value violates unique constraint "foo_pkey"
DETAIL:  Key (id)=(4500) already exists.
CONTEXT:  COPY foo, line 4500

Thanks for the report!

- Heikki