Thread: Re: Bug in copy
On Feb 8, 2025 at 13:28 +0800, me nefcanto <sn.1361@gmail.com>, wrote:
Hello
I run this command:
copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",", on_error ignore)
But I receive this error:
duplicate key value violates unique constraint "PartsUniqueLocaleTitle"
This means that the on_error setting is not working. When I try to insert a million records, this becomes extremely annoying and counterproductive.
When we specify that on_error should be ignored, any type of error including data type inconsistency, check constraint inconsistency, foreign key inconsistency, etc. should be ignored and Postgres should move to the next record and not fail the entire bulk operation.
RegardsSaeed Nemati
Hi,
As my understanding,
on_error
is designed to handle errors during data type conversions in PostgreSQL, similar to what we do in Greenplum or Cloudberry. Since these rows are valid,
on_error
doesn’t raise any concerns.--
Zhang Mingli
HashData
Hi, thank you for the response. If we analyze semantically, it had to be on_type_error or something. But what matters is the problem at hand. Inserting a million records not in an all-or-fail is a requirement. What options do we have for that?
On Sat, Feb 8, 2025 at 9:22 AM Zhang Mingli <zmlpostgres@gmail.com> wrote:
On Feb 8, 2025 at 13:28 +0800, me nefcanto <sn.1361@gmail.com>, wrote:Hello
I run this command:
copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",", on_error ignore)
But I receive this error:
duplicate key value violates unique constraint "PartsUniqueLocaleTitle"
This means that the on_error setting is not working. When I try to insert a million records, this becomes extremely annoying and counterproductive.
When we specify that on_error should be ignored, any type of error including data type inconsistency, check constraint inconsistency, foreign key inconsistency, etc. should be ignored and Postgres should move to the next record and not fail the entire bulk operation.
RegardsSaeed Nemati
Hi,
As my understanding,on_error
is designed to handle errors during data type conversions in PostgreSQL, similar to what we do in Greenplum or Cloudberry.
Since these rows are valid,on_error
doesn’t raise any concerns.
--
Zhang Mingli
HashData
@laurenz if I use `insert into` or the `merge` would I be able to bypass records with errors? Or would I fail there too? I mean there are lots of ways a record can be limited. Unique indexes, check constraints, foreign key constraints, etc. What happens in those cases?
And why not fixing the "on_error ignore" in the first place? Maybe that would be a simpler way. I don't know the internals of bulk insertion, but if at some point it has a loop in it, then that's much simpler to catch errors in that loop.
And why not fixing the "on_error ignore" in the first place? Maybe that would be a simpler way. I don't know the internals of bulk insertion, but if at some point it has a loop in it, then that's much simpler to catch errors in that loop.
Regards
Saeed
On Sun, Feb 9, 2025 at 9:32 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Sat, 2025-02-08 at 09:31 +0330, me nefcanto wrote:
> Inserting a million records not in an all-or-fail is a requirement. What options do we have for that?
Use COPY to load the data into a new (temporary?) table.
Then use INSERT INTO ... SELECT ... ON CONFLICT ... or MERGE to merge
the data from that table to the actual destination.
COPY is not a full-fledged ETL tool.
Yours,
Laurenz Albe
@David, I saw that pg_bulkload. Amazing performance. But that's a command line tool. I need to insert bulk data in my Node.js app, via code.
On Sun, Feb 9, 2025 at 4:00 PM me nefcanto <sn.1361@gmail.com> wrote:
@laurenz if I use `insert into` or the `merge` would I be able to bypass records with errors? Or would I fail there too? I mean there are lots of ways a record can be limited. Unique indexes, check constraints, foreign key constraints, etc. What happens in those cases?
And why not fixing the "on_error ignore" in the first place? Maybe that would be a simpler way. I don't know the internals of bulk insertion, but if at some point it has a loop in it, then that's much simpler to catch errors in that loop.RegardsSaeedOn Sun, Feb 9, 2025 at 9:32 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:On Sat, 2025-02-08 at 09:31 +0330, me nefcanto wrote:
> Inserting a million records not in an all-or-fail is a requirement. What options do we have for that?
Use COPY to load the data into a new (temporary?) table.
Then use INSERT INTO ... SELECT ... ON CONFLICT ... or MERGE to merge
the data from that table to the actual destination.
COPY is not a full-fledged ETL tool.
Yours,
Laurenz Albe
On Sunday, February 9, 2025, me nefcanto <sn.1361@gmail.com> wrote:
@David, I saw that pg_bulkload. Amazing performance. But that's a command line tool. I need to insert bulk data in my Node.js app, via code.
That seems like an arbitrary limitation to impose on yourself. Given that, there is no solution that will satisfy you that I am aware of.
David J.
Sorry for the long delay.
Let's analyze the scenario of fake data insertion. I want to create a million fake products, sometimes even 100 million (we're on MariaDB now and we plan to migrate to Postgres). My team uses fake data for performance tests and other use cases. Now there could be literally now way to sanitize those records
Another scenario is translations. Even in production we have translation files for more than 20 languages, and for more than 2 thousand keys. That means we need to insert 40 thousand translation records in the production.
Another scenario is updating nested model values for a large hierarchical table. For example, the categories table. Anytime the user changes a record in that table we need to recalculate the nested model for the entire categories and bulk update the results.
The point is, the database schema is not in our hands. We don't know what rules exist on each table and what rules change. And it's not practical and feasible to spend resources on keeping our bulk insertion logic with the database changes.
It's a good design that Postgres add a catch-all handler for each row and report accordingly. Give it 1 million records, and it should give you back 1 million results.
Is there a problem in implementing this? After all one expects the most advanced open source database to support this real-world requirement.
Regards
Saeed
On Sun, Feb 9, 2025 at 8:09 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Sun, 2025-02-09 at 16:00 +0330, me nefcanto wrote:
> @laurenz if I use `insert into` or the `merge` would I be able to bypass records
> with errors? Or would I fail there too? I mean there are lots of ways a record
> can be limited. Unique indexes, check constraints, foreign key constraints, etc.
> What happens in those cases?
With INSERT ... ON CONFLICT, you can only handle primar and unique key violations.
MERGE allows some more freedom, but it also only checks for rows that match existing
rows.
You won't find a command that ignores or handles arbitrary kinds of errors.
You have to figure out what kinds of errors you expect and handle them explicitly
by running queries against the data.
I don't think that a catch-it-all handler that handles all errors would be very
useful. Normally, there are certain errors you want to tolerate, while others
should be considered unrecoverable and lead to errors.
Yours,
Laurenz Albe
On Mon, Feb 24, 2025 at 9:09 AM me nefcanto <sn.1361@gmail.com> wrote:
The point is, the database schema is not in our hands. We don't know what rules exist on each table and what rules change. And it's not practical and feasible to spend resources on keeping our bulk insertion logic with the database changes.
That's a company problem, and not one that can be solved by changing the way COPY works.
Is there a problem in implementing this? After all one expects the most advanced open source database to support this real-world requirement.
We're not going to change Postgres into an ETL application. This thread has given you numerous Postgres-specific solutions, but there is also no shortage of ETL applications you can try out.
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support
I didn't understand how putting a catch-all handler turns the Postgres into an ETL application. And about all those numerous Postgres-specific solutions, can you please enumerate them? Because there are not numerous options, just a couple of options, and none of them work. Inserting the data into a table without constraint and then performing a merge won't work. Because merge also does not have a catch-all handler. What other solutions have been proposed? pg_bulkload?
So basically two solutions have been suggested, Merge and pg_bulkload. One does not work (just handles more error and fails again on a single error) and the other is a command-line utility.
I also did a simple research using AI and came up with this table:
RDBMS System - Row-Level Error Handling for Bulk Copy
Oracle Database - Yes (e.g., SQL*Loader can direct bad records)
Microsoft SQL Server - No (BULK INSERT generally aborts on row errors)
MySQL - Yes (using LOAD DATA with IGNORE options)
PostgreSQL - No (COPY stops on error without workarounds)
IBM DB2 - Yes (LOAD utility supports error logging)
Teradata - Yes (has options for capturing reject rows)
SAP HANA - No (bulk load typically aborts on errors)
MariaDB - Yes (similar to MySQL with error-handling options)
SQLite - No (no dedicated bulk load command with error handling)
Amazon Redshift - Yes (COPY command allows error tolerance settings)
Google BigQuery - Yes (load jobs can be configured to skip bad rows)
Netezza - Yes (supports reject files for bulk loads)
Snowflake - Yes (COPY INTO offers error handling parameters)
Vertica - Yes (COPY command logs errors and continues)
RDBMS System - Row-Level Error Handling for Bulk Copy
Oracle Database - Yes (e.g., SQL*Loader can direct bad records)
Microsoft SQL Server - No (BULK INSERT generally aborts on row errors)
MySQL - Yes (using LOAD DATA with IGNORE options)
PostgreSQL - No (COPY stops on error without workarounds)
IBM DB2 - Yes (LOAD utility supports error logging)
Teradata - Yes (has options for capturing reject rows)
SAP HANA - No (bulk load typically aborts on errors)
MariaDB - Yes (similar to MySQL with error-handling options)
SQLite - No (no dedicated bulk load command with error handling)
Amazon Redshift - Yes (COPY command allows error tolerance settings)
Google BigQuery - Yes (load jobs can be configured to skip bad rows)
Netezza - Yes (supports reject files for bulk loads)
Snowflake - Yes (COPY INTO offers error handling parameters)
Vertica - Yes (COPY command logs errors and continues)
10 RDBMS systems support catch-all problems, and only 4 do not. This means that the majority have found it to be a useful feature.
Greg, may I ask what's your argument against having a catch-all error in bulk operations? Because the ETL argument is not a valid one. As I mentioned previously, none of my real-world requirements were related to the ETL.
Regards
Saeed
On Mon, Feb 24, 2025 at 7:25 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:
On Mon, Feb 24, 2025 at 9:09 AM me nefcanto <sn.1361@gmail.com> wrote:The point is, the database schema is not in our hands. We don't know what rules exist on each table and what rules change. And it's not practical and feasible to spend resources on keeping our bulk insertion logic with the database changes.That's a company problem, and not one that can be solved by changing the way COPY works.Is there a problem in implementing this? After all one expects the most advanced open source database to support this real-world requirement.We're not going to change Postgres into an ETL application. This thread has given you numerous Postgres-specific solutions, but there is also no shortage of ETL applications you can try out.Cheers,Greg--Crunchy Data - https://www.crunchydata.comEnterprise Postgres Software Products & Tech Support
Please raise this on pgsql-general, this is a feature request, not a bug. I am not responding to this thread further.