Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) - Mailing list pgsql-hackers

From Tom Lane
Subject Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date
Msg-id 752672.1699474336@sss.pgh.pa.us
Whole thread Raw
In response to Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
List pgsql-hackers
Daniel Gustafsson <daniel@yesql.se> writes:
>> On 8 Nov 2023, at 19:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think an actually usable feature of this sort would involve
>> copying all the failed lines to some alternate output medium,
>> perhaps a second table with a TEXT column to receive the original
>> data line.  (Or maybe an array of text that could receive the
>> broken-down field values?)  Maybe we could dump the message info,
>> line number, field name etc into additional columns.

> I agree that the errors should be easily visible to the user in some way.  The
> feature is for sure interesting, especially in data warehouse type jobs where
> dirty data is often ingested.

I agree it's interesting, but we need to get it right the first time.

Here is a very straw-man-level sketch of what I think might work.
The option to COPY FROM looks something like

    ERRORS TO other_table_name (item [, item [, ...]])

where the "items" are keywords identifying the information item
we will insert into each successive column of the target table.
This design allows the user to decide which items are of use
to them.  I envision items like

LINENO    bigint        COPY line number, counting from 1
LINE    text        raw text of line (after encoding conversion)
FIELDS    text[]        separated, de-escaped string fields (the data
            that was or would be fed to input functions)
FIELD    text        name of troublesome field, if field-specific
MESSAGE    text        error message text
DETAIL    text        error message detail, if any
SQLSTATE text        error SQLSTATE code

Some of these would have to be populated as NULL if we didn't get
that far in processing the line.  In the worst case, which is
encoding conversion failure, I think we couldn't populate any of
the data items except LINENO.

Not sure if we need to insist that the target table columns be
exactly the data types I show above.  It'd be nice to allow
the LINENO target to be plain int, perhaps.  OTOH, do we really
want to have to deal with issues like conversion failures while
trying to report an error?

> As a data point, Greenplum has this feature with additional SQL syntax to
> control it:
>     COPY .. LOG ERRORS SEGMENT REJECT LIMIT xyz ROWS;
> LOG ERRORS instructs the database to log the faulty rows and SEGMENT REJECT
> LIMIT xyz ROWS sets the limit of how many rows can be faulty before the
> operation errors out.  I'm not at all advocating that we should mimic this,
> just wanted to add a reference to postgres derivative where this has been
> implemented.

Hm.  A "reject limit" might be a useful add-on, but I wouldn't advocate
including it in the initial patch.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: XX000: tuple concurrently deleted during DROP STATISTICS
Next
From: Bruce Momjian
Date:
Subject: Re: XID-wraparound hazards in LISTEN/NOTIFY