Thread: Please implement a catch-all error handler per row, for COPY
Hello
Please consider these scenarios:
- I want to create a million fake products, sometimes even 100 million (we're on MariaDB now and we plan to migrate to Postgres). My team uses fake data for performance tests and other use cases.
- Another scenario is translations. Even in production, we have translation files for more than 20 languages, and for more than 2 thousand keys. That means we need to insert 40 thousand translation records in the production.
- Another scenario is updating nested model values for a large hierarchical table. For example, the categories table. Anytime the user changes a record in that table we need to recalculate the nested model for the entire categories and bulk update the results.
All of these scenarios are such that data sanitation is difficult if not possible before doing the bulk operation (copy).
I realized that when we specify `on_error ignore` it just handles a handful of errors. I thought this was a bug and sent an email to the pgsql-bugs maling list. But they said it's the intended behavior.
Can you please provide a row-level catch-all handler for the copy command?
Regards
Saeed
Regards
Saeed
me nefcanto <sn.1361@gmail.com> writes: > Can you please provide a row-level catch-all handler for the copy command? Very unlikely to happen. COPY is not intended as a general purpose ETL mechanism, and we don't want to load it down with features that would create substantial performance penalties. Which that would. Even ignoring the performance angle, this request seems remarkably ill-specified. What is a "row-level handler" for errors that have to do with identifying row boundaries? regards, tom lane
On 3/1/25 07:18, me nefcanto wrote: > Hello > > > All of these scenarios are such that data sanitation is difficult if not > possible before doing the bulk operation (copy). > > I realized that when we specify `on_error ignore` it just handles a > handful of errors. I thought this was a bug and sent an email to the > pgsql-bugs maling list. But they said it's the intended behavior. > Take a look at: http://ossc-db.github.io/pg_bulkload/index.html and https://pgloader.io/ They may be able to meet your needs. > > Regards > Saeed -- Adrian Klaver adrian.klaver@aklaver.com
On Sat, Mar 1, 2025 at 9:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
me nefcanto <sn.1361@gmail.com> writes:
> Can you please provide a row-level catch-all handler for the copy command?
Very unlikely to happen. COPY is not intended as a general purpose
ETL mechanism, and we don't want to load it down with features that
would create substantial performance penalties. Which that would.
Maybe it isn't a general purpose ETL tool but there is no reasonable way to do some things unless COPY can be put into a mode that doesn't have the same performance requirements it needs to serve as our dump/restore tool of choice.
I have to imagine such a mode, if not enabled, would have little to no impact on how COPY behaves compared to today. It's kinda like VACUUM FULL existing doesn't impact how VACUUM behaves. Seems more desirable than inventing a whole new SQL Command to do this and copy-paste all of the COPY code since it does mostly the same thing.
In short, it's probably worth giving it a try if someone wants to. Rejecting it without seeing the proposal seems premature.
Now, it's probably challenging enough that if the person requesting the feature isn't driving its development the odds of it getting worked on is fairly low.
David J.
FYI the -bugs thread in question:
seems to imply the primary blocker was a unique constraint.
Cheers,
Greg
Thank you all for responding. With respect, I don't understand why COPY gets related to ETL. All of the requirements I mentioned above have nothing to do with ETL. We have a table of categories. A hierarchical table. This is a multitenant app. Hundreds of thousands of records are in it. We want to calculate the hierarchical properties (nested set models, node depth, determining leaves, materialized paths, etc.) and then update the entire table. What does this have to do with ETL? Or as I said we create millions of records for products, attributes of products, and pricing and media of products to load test our system. Again, what does that have to do with ETL?
The point is, that there is already an `on_error ignore` clause there. This means that somewhere there is a try/catch per row. If I'm wrong, please let me know. But when the `on_error ignore` catches problem x for each row, then it can catch all problems for each row without any performance problem.
Let me give you an example in C#:
try
{
}
catch (SomeException ex)
{
}
becomes:
try
{
}
catch (Exception ex)
{
if (ex is SomeException)
{
}
}
The last catch clause catches everything. How does it affect performance? Running a simple if for hundreds of millions of iterations is literally nothing in time complexity.
As I have specified in the bug thread, from 11 RDBMSs, 7 support this. Thus it's not an uncommon weird request.
Regards
Saeed
On Sat, Mar 1, 2025 at 8:45 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:
FYI the -bugs thread in question:seems to imply the primary blocker was a unique constraint.Cheers,Greg
> On Mar 2, 2025, at 19:44, me nefcanto <sn.1361@gmail.com> wrote: > > As I have specified in the bug thread, from 11 RDBMSs, 7 support this. Thus it's not an uncommon weird request. If your organization is interested in producing a design and a patch, or paying a developer or organization to do so, thatwould be the best way forward. Everyone who works on the core PostgreSQL code is either a volunteer or extremely busy. Those that are paid to work on PostgreSQL usually have their priorities mapped out for months (years?) already. Thebest way to convince the community to adopt a feature to come bearing a high-quality patch. That's no guarantee, of course,but the probability is much higher that way.
me nefcanto <sn.1361@gmail.com> writes: > The point is, that there is already an `on_error ignore` clause there. This > means that somewhere there is a try/catch per row. If I'm wrong, please let > me know. You are wrong. See https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=d9f7f5d32 https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=9e2d87011 and a ton of related commits. It's possible that the specific case of unique-index violations could be handled in a similar style. But "catch any error whatever" is simply not going to happen here, because a subtransaction per row is not practical. regards, tom lane
On 3/2/25 19:44, me nefcanto wrote: > Thank you all for responding. With respect, I don't understand why COPY > gets related to ETL. All of the requirements I mentioned above have > nothing to do with ETL. We have a table of categories. A hierarchical > table. This is a multitenant app. Hundreds of thousands of records are > in it. We want to calculate the hierarchical properties (nested set > models, node depth, determining leaves, materialized paths, etc.) and > then update the entire table. What does this have to do with ETL? Or as Pretty sure it defines ETL, extract data from a source, transform it to meet the properties and load it. At any rate: 1) ON_ERROR IGNORE only just appeared in Postgres 17. 2) Any further changes to it would constitute a feature change that can only happen in major version upgrade. The next major version is due this Fall and development is already well under way. The chances of it making it in given the time frame is slim, so that pushes a practicable deadline until Fall of 2026. 3) As I mentioned in a previous post there are packages already out there that may get you want you want now. -- Adrian Klaver adrian.klaver@aklaver.com
On Mon, 2025-03-03 at 07:14 +0330, me nefcanto wrote: > The point is, that there is already an `on_error ignore` clause there. This means that > somewhere there is a try/catch per row. If I'm wrong, please let me know. The crucial point that Tom referred to is the "per row". What is a row? Imagine you have a COPY FROM statement for a table with three columns, and the data look like this: 1,Smith,John 2,Lewis,Jerry Lee 3,Prince 4,Albe,Laurenz We may be able to guess what is meant, but how shall the machine know where the line boundaries are, which data to ignore and which to process? Currently, that is no problem, because errors are only identified after the data have been successfully parsed. Yours, Laurenz Albe