Thread: VLDB Features

VLDB Features

From

Simon Riggs

Date:

11 December 2007, 07:11:33

I'm starting work on next projects for 8.4.

Many applications have the need to store very large data volumes for
both archival and analysis. The analytic databases are commonly known as
Data Warehouses, though there isn't a common term for large archival
data stores. The use cases for those can often be blurred and many
people see those as only one use case. My initial interest is in the
large archival data stores.

One of the main issues to be faced is simply data maintenance and
management. Loading, deleting, vacuuming data all takes time. Those
issues relate mainly to the size of the data store rather than any
particular workload, so I'm calling that set of required features "Very
Large Database" (or VLDB) features.

VLDB Features I'm expecting to work on are
- Read Only Tables/WORM tables
- Advanced Partitioning
- Compression
plus related performance features

Details of those will be covered in separate mails over next few weeks
and months. So just to let everybody know that's where I'm headed, so
you see the big picture with me.

I'll be working on other projects as well, many of which I've listed
here: http://developer.postgresql.org/index.php/Simon_Riggs%
27_Development_Projects  I expect the list is too long to complete for
8.4, but I'm allowing for various issues arising during development.

So specific discussion on other mails as they arrive, please.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: VLDB Features

From

Josh Berkus

Date:

11 December 2007, 14:54:51

Simon.

> VLDB Features I'm expecting to work on are
> - Read Only Tables/WORM tables
> - Advanced Partitioning
> - Compression
> plus related performance features

Just so you don't lose sight of it, one of the biggest VLDB features we're 
missing is fault-tolerant bulk load.  Unfortunately, I don't know anyone 
who's working on it.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: VLDB Features

From

Hannu Krosing

Date:

11 December 2007, 17:15:58

Ühel kenal päeval, T, 2007-12-11 kell 10:53, kirjutas Josh Berkus:
> Simon.
>
> > VLDB Features I'm expecting to work on are
> > - Read Only Tables/WORM tables
> > - Advanced Partitioning
> > - Compression
> > plus related performance features
>
> Just so you don't lose sight of it, one of the biggest VLDB features we're
> missing is fault-tolerant bulk load.

What do you mean by fault-tolerant here ?

Just

COPY ... WITH ERRORS TO ...

or something more advanced, like bulkload which can be continued after
crash ?

--------------
Hannu

Re: VLDB Features

From

Simon Riggs

Date:

11 December 2007, 17:33:59

On Tue, 2007-12-11 at 10:53 -0800, Josh Berkus wrote:
> Simon.
> 
> > VLDB Features I'm expecting to work on are
> > - Read Only Tables/WORM tables
> > - Advanced Partitioning
> > - Compression
> > plus related performance features
> 
> Just so you don't lose sight of it, one of the biggest VLDB features we're 
> missing is fault-tolerant bulk load.  Unfortunately, I don't know anyone 
> who's working on it.

Not lost sight of it; I have a design, but I have to prioritise also.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: VLDB Features

From

Josh Berkus

Date:

11 December 2007, 19:28:37

Hannu,

> COPY ... WITH ERRORS TO ...

Yeah, that's a start.

> or something more advanced, like bulkload which can be continued after
> crash ?

Well, we could also use a loader which automatically parallelized, but that 
functionality can be done at the middleware level.  WITH ERRORS is the 
most critical part.

Here's the other VLDB features we're missing:

Parallel Query
Windowing Functions
Parallel Index Build (not sure how this works exactly, but it speeds Oracle 
up considerably)
On-disk Bitmap Index (anyone game to finish GP patch?)

Simon, we should start a VLDB-Postgres developer wiki page.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: VLDB Features

From

Neil Conway

Date:

11 December 2007, 19:40:50

On Tue, 2007-12-11 at 10:53 -0800, Josh Berkus wrote:
> Just so you don't lose sight of it, one of the biggest VLDB features we're 
> missing is fault-tolerant bulk load.

I actually had to cook up a version of this for Truviso recently. I'll
take a look at submitting a cleaned-up implementation for 8.4.

-Neil

Re: VLDB Features

From

Simon Riggs

Date:

11 December 2007, 19:57:14

On Tue, 2007-12-11 at 15:31 -0800, Josh Berkus wrote:

> Here's the other VLDB features we're missing:
> 
> Parallel Query
> Windowing Functions
> Parallel Index Build (not sure how this works exactly, but it speeds Oracle 
> up considerably)
> On-disk Bitmap Index (anyone game to finish GP patch?)

I would call those VLDB Data Warehousing features to differentiate
between that and the use of VLDBs for other purposes.

I'd add Materialized View support in the planner, as well as saying its
more important than parallel query, IMHO. MVs are to DW what indexes are
to OLTP. It's the same as indexes vs. seqscan; you can speed up the seq
scan or you can avoid it. Brute force is cool, but being smarter is even
better. 

The reason they don't normally show up high on anybody's feature list is
that the TPC benchmarks specifically disallow them, which as I once
observed is very good support for them being a useful feature in
practice. (Oracle originally brought out MV support as a way of
improving their TPC scores at a time when Teradata was wiping the floor
with parallel query implementation). 

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: VLDB Features

From

Greg Smith

Date:

11 December 2007, 20:11:35

On Tue, 11 Dec 2007, Josh Berkus wrote:

> Just so you don't lose sight of it, one of the biggest VLDB features we're
> missing is fault-tolerant bulk load.  Unfortunately, I don't know anyone
> who's working on it.

I'm curious what you feel is missing that pgloader doesn't fill that 
requirement:  http://pgfoundry.org/projects/pgloader/

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: VLDB Features

From

Josh Berkus

Date:

12 December 2007, 00:41:26

Greg,

> I'm curious what you feel is missing that pgloader doesn't fill that
> requirement:  http://pgfoundry.org/projects/pgloader/

Because pgloader is implemented in middleware, it carries a very high overhead 
if you have bad rows.  As little as 1% bad rows will slow down loading by 20% 
due to retries.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: VLDB Features

From

Dimitri Fontaine

Date:

12 December 2007, 06:19:29

Hi,

Le mercredi 12 décembre 2007, Josh Berkus a écrit :
> > I'm curious what you feel is missing that pgloader doesn't fill that
> > requirement:  http://pgfoundry.org/projects/pgloader/
>
> Because pgloader is implemented in middleware, it carries a very high
> overhead if you have bad rows.  As little as 1% bad rows will slow down
> loading by 20% due to retries.

Not that much, in fact, I'd say.
pgloader allows its user to configure how large a COPY buffer to use (global
parameter as of now, could easily be a per-section configuration knob, just
didn't see any need for this yet).
It's the 'copy_every' parameter as seen on the man page here: http://pgloader.projects.postgresql.org/#toc4

pgloader will obviously prepare a in-memory buffer of copy_every tuples to
give to COPY, and in case of error will cut it and retry. Classic dichotomy
approach, from initial implementation by Jan Wieck.

So you can easily balance the error recovery costs against the COPY bulk size.

Note also that the overall loading time with pgloader is not scaling the same
as the COPY buffer size, the optimal choice depends on the dataset --- and
the data massaging pgloader has to make on it ---, and I've experienced best
results with 10000 and 15000 tuples buffers so far.

FYI, now the pgloader topic is on the table, the next items I think I'm gonna
develop for it are configurable behavior on errors tuples (load to another
table when pk error, e.g.), and some limited ddl-partioning support.

I'm playing with the idea for pgloader to be able to read some partitioning
schemes (parsing CHECK constraint on inherited tables) and load directly into
the right partitions.
That would of course be done only when configured this way, and if constraints
are misread it would only result in a lot more rejected rows than expected,
and you still can retry using your insert trigger instead of pgloader buggy
smartness.

Comments welcome, regards,
--
dim

Re: VLDB Features

From

Simon Riggs

Date:

12 December 2007, 11:56:19

On Tue, 2007-12-11 at 15:31 -0800, Josh Berkus wrote:

> Simon, we should start a VLDB-Postgres developer wiki page.

http://developer.postgresql.org/index.php/DataWarehousing

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: VLDB Features

From

Markus Schiltknecht

Date:

12 December 2007, 14:26:07

Hi,

Josh Berkus wrote:
> Here's the other VLDB features we're missing:
> 
> Parallel Query

Uh.. this only makes sense in a distributed database, no? I've thought 
about parallel querying on top of Postgres-R. Does it make sense 
implementing some form of parallel querying apart from the distribution 
or replication engine?

> Windowing Functions

Isn't Gavin Sherry working on this? Haven't read anything from him lately...

> Parallel Index Build (not sure how this works exactly, but it speeds Oracle 
> up considerably)

Sounds interesting *turs-away-to-google*

> On-disk Bitmap Index (anyone game to finish GP patch?)

Anybody having an idea of what's missing there (besides good use cases, 
which some people doubt)? Again: Gavin?

> Simon, we should start a VLDB-Postgres developer wiki page.

Thanks, Simon, wiki page looks good!

Regards

Markus

Re: VLDB Features

From

Josh Berkus

Date:

12 December 2007, 14:55:43

Markus,

> > Parallel Query
>
> Uh.. this only makes sense in a distributed database, no? I've thought
> about parallel querying on top of Postgres-R. Does it make sense
> implementing some form of parallel querying apart from the distribution
> or replication engine?

Sure.  Imagine you have a 5TB database on a machine with 8 cores and only one 
concurrent user.  You'd like to have 1 core doing I/O, and say 4-5 cores 
dividing the scan and join processing into 4-5 chunks.

I'd say implementing a separate I/O worker would be the first step towards 
this; if we could avoid doing I/O in the same process/thread where we're 
doing row parsing it would speed up large scans by 100%.  I know Oracle does 
this, and their large-table-I/O is 30-40% faster than ours despite having 
less efficient storage.

Maybe Greenplum or EnterpriseDB will contribute something.  ;-)

> > Windowing Functions
>
> Isn't Gavin Sherry working on this? Haven't read anything from him
> lately...

Me neither.  Swallowed by Greenplum and France.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: VLDB Features

From

Markus Schiltknecht

Date:

12 December 2007, 15:37:28

Hi Josh,

Josh Berkus wrote:
> Sure.  Imagine you have a 5TB database on a machine with 8 cores and only one 
> concurrent user.  You'd like to have 1 core doing I/O, and say 4-5 cores 
> dividing the scan and join processing into 4-5 chunks.

Ah, right, thank for enlightenment. Heck, I'm definitely too focused on 
replication and distributed databases :-)

However, there's certainly a great deal of an intersection between 
parallel processing on different machines and parallel processing on 
multiple CPUs - especially considering NUMA architecture. 
*comes-to-think-again*...

>> Isn't Gavin Sherry working on this? Haven't read anything from him
>> lately...
> 
> Me neither.  Swallowed by Greenplum and France.

Hm.. good for him, I guess!

Regards

Markus

Re: VLDB Features

From

Gavin Sherry

Date:

12 December 2007, 16:42:28

On Wed, Dec 12, 2007 at 08:26:16PM +0100, Markus Schiltknecht wrote:
> >>Isn't Gavin Sherry working on this? Haven't read anything from him
> >>lately...
> >
> >Me neither.  Swallowed by Greenplum and France.
> 
> Hm.. good for him, I guess!

Yes, I'm around -- just extremely busy with a big release at Greenplum as 
well as other Real Life stuff.

Thanks,

Gavin

Re: VLDB Features

From

"Joshua D. Drake"

Date:

12 December 2007, 16:49:35

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> Greenplum as well as other Real Life stuff.

For those of us here who have no idea what you are talking about can
you define what "Real Life" is like?

Joshua D. Drake



- -- 
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ 
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYEmLATb/zqfZUUQRAhHJAJ9GD5DPZOlyd9LiBUG5TENIjuTgSwCaAnsf
5vdCZatl+XqD5S0+zMV/Ltk=
=KyqY
-----END PGP SIGNATURE-----

Re: VLDB Features

From

Gregory Stark

Date:

12 December 2007, 20:22:18

"Josh Berkus" <josh@agliodbs.com> writes:

> Markus,
>
>> > Parallel Query
>>
>> Uh.. this only makes sense in a distributed database, no? I've thought
>> about parallel querying on top of Postgres-R. Does it make sense
>> implementing some form of parallel querying apart from the distribution
>> or replication engine?

Yes, but not for the reasons Josh describes.

> I'd say implementing a separate I/O worker would be the first step towards 
> this; if we could avoid doing I/O in the same process/thread where we're 
> doing row parsing it would speed up large scans by 100%.  I know Oracle does 
> this, and their large-table-I/O is 30-40% faster than ours despite having 
> less efficient storage.

Oracle is using Direct I/O so they need the reader and writer threads to avoid
blocking on i/o all the time. We count on the OS doing readahead and buffering
our writes so we don't have to. Direct I/O and needing some way to do
asynchronous writes and reads are directly tied.

Where Parallel query is useful is when you have queries that involve a
substantial amount of cpu resources, especially if you have a very fast I/O
system which can saturate the bandwidth to a single cpu.

So for example if you have a merge join which requires sorting both sides of
the query you could easily have subprocesses handle those sorts allowing you
to bring two processors to bear on the problem instead of being limited to a
single processor.

On Oracle Parallel Query goes great with partitioned tables. Their query
planner will almost always turn the partition scans into parallel scans and
use separate processors to scan different partitions. 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

Re: VLDB Features

From

Markus Schiltknecht

Date:

13 December 2007, 11:21:11

Hello Gregory,

Gregory Stark wrote:
> Oracle is using Direct I/O so they need the reader and writer threads to avoid
> blocking on i/o all the time. We count on the OS doing readahead and buffering
> our writes so we don't have to. Direct I/O and needing some way to do
> asynchronous writes and reads are directly tied.

Yeah, except in cases where we can tell ahead non-sequential reads. 
Which admittedly doesn't come up too frequently and can probably be 
handled with posix_fadvice - as you are currently testing.

> Where Parallel query is useful is when you have queries that involve a
> substantial amount of cpu resources, especially if you have a very fast I/O
> system which can saturate the bandwidth to a single cpu.

Full ACK, the very same applies to parallel querying on shared-nothing 
clusters. Those can help if the bandwidth to all processing cores 
together becomes the bottleneck (and the resulting data is relatively 
small compared to the input data).

For example, Sun's UltraSparc T2 features only 8 PCIe lanes for those 8 
cores, so you end up with 250 MiB/sec per core or about 32 MiB/sec per 
thread on average. To be fair: their 10 Gig Ethernet ports don't go via 
PCIe, so you get an additional 2x 1 GiB/sec for the complete chip. And 
memory bandwidth looks a lot better: Sun claims 60+ GiB/sec, leaving 
almost 8 GiB/sec per core or 1 GiB/sec per thread.

If my calculations for Intel are correct, a Quad Xeon with a 1.33 GHz 
FSB has around 21 GiB/sec throughput to main memory, giving 5 GiB/sec 
per core. (Why are these numbers so hard to find? It looks like Intel 
deliberately obfuscates them with FSB MHz or Giga-transactions per sec 
and the like.)

Regards

Markus

Re: VLDB Features

From

Hannu Krosing

Date:

14 December 2007, 08:48:57

Ühel kenal päeval, T, 2007-12-11 kell 15:41, kirjutas Neil Conway:
> On Tue, 2007-12-11 at 10:53 -0800, Josh Berkus wrote:
> > Just so you don't lose sight of it, one of the biggest VLDB features we're
> > missing is fault-tolerant bulk load.
>
> I actually had to cook up a version of this for Truviso recently. I'll
> take a look at submitting a cleaned-up implementation for 8.4.

How did you do it ?

Did you enchance COPY command or was it something completely new ?

-----------
Hannu

Re: VLDB Features

From

Neil Conway

Date:

14 December 2007, 15:26:35

On Fri, 2007-12-14 at 14:48 +0200, Hannu Krosing wrote:
> How did you do it ?
> 
> Did you enchance COPY command or was it something completely new ?

By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
to drop (and log) rows that contain malformed data. That is, rows with
too many or too few columns, rows that result in constraint violations,
and rows containing columns where the data type's input function raises
an error. The last case is the only thing that would be a bit tricky to
implement, I think: you could use PG_TRY() around the InputFunctionCall,
but I guess you'd need a subtransaction to ensure that you reset your
state correctly after catching an error.

-Neil

Re: VLDB Features

From

Andrew Dunstan

Date:

14 December 2007, 15:40:00


Neil Conway wrote:
> On Fri, 2007-12-14 at 14:48 +0200, Hannu Krosing wrote:
>   
>> How did you do it ?
>>
>> Did you enchance COPY command or was it something completely new ?
>>     
>
> By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
> to drop (and log) rows that contain malformed data. That is, rows with
> too many or too few columns, rows that result in constraint violations,
> and rows containing columns where the data type's input function raises
> an error. The last case is the only thing that would be a bit tricky to
> implement, I think: you could use PG_TRY() around the InputFunctionCall,
> but I guess you'd need a subtransaction to ensure that you reset your
> state correctly after catching an error.
>
>
>   

Ideally I think you would put the failing input line in another table, 
or maybe another file. If a table, it would probably have to be as bytea.

cheers

andrew

Re: VLDB Features

From

Tom Lane

Date:

14 December 2007, 19:22:56

Neil Conway <neilc@samurai.com> writes:
> By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
> to drop (and log) rows that contain malformed data. That is, rows with
> too many or too few columns, rows that result in constraint violations,
> and rows containing columns where the data type's input function raises
> an error. The last case is the only thing that would be a bit tricky to
> implement, I think: you could use PG_TRY() around the InputFunctionCall,
> but I guess you'd need a subtransaction to ensure that you reset your
> state correctly after catching an error.

Yeah.  It's the subtransaction per row that's daunting --- not only the
cycles spent for that, but the ensuing limitation to 4G rows imported
per COPY.

If we could somehow only do a subtransaction per failure, things would
be much better, but I don't see how.
        regards, tom lane

Re: VLDB Features

From

Neil Conway

Date:

14 December 2007, 20:18:24

On Fri, 2007-12-14 at 18:22 -0500, Tom Lane wrote:
> If we could somehow only do a subtransaction per failure, things would
> be much better, but I don't see how.

One approach would be to essentially implement the pg_bulkloader
approach inside the backend. That is, begin by doing a subtransaction
for every k rows (with k = 1000, say). If you get any errors, then
either repeat the process with k/2 until you locate the individual
row(s) causing the trouble, or perhaps just immediately switch to k = 1.
Fairly ugly though, and would be quite slow for data sets with a high
proportion of erroneous data.

Another approach would be to distinguish between errors that require a
subtransaction to recover to a consistent state, and less serious errors
that don't have this requirement (e.g. invalid input to a data type
input function). If all the errors that we want to tolerate during a
bulk load fall into the latter category, we can do without
subtransactions.

-Neil

Re: VLDB Features

From

Tom Lane

Date:

14 December 2007, 21:04:13

Neil Conway <neilc@samurai.com> writes:
> One approach would be to essentially implement the pg_bulkloader
> approach inside the backend. That is, begin by doing a subtransaction
> for every k rows (with k = 1000, say). If you get any errors, then
> either repeat the process with k/2 until you locate the individual
> row(s) causing the trouble, or perhaps just immediately switch to k = 1.
> Fairly ugly though, and would be quite slow for data sets with a high
> proportion of erroneous data.

You could make it self-tuning, perhaps: initially, or after an error,
set k = 1, and increase k after a successful set of rows.

> Another approach would be to distinguish between errors that require a
> subtransaction to recover to a consistent state, and less serious errors
> that don't have this requirement (e.g. invalid input to a data type
> input function). If all the errors that we want to tolerate during a
> bulk load fall into the latter category, we can do without
> subtransactions.

I think such an approach is doomed to hopeless unreliability.  There is
no concept of an error that doesn't require a transaction abort in the
system now, and that doesn't seem to me like something that can be
successfully bolted on after the fact.  Also, there's a lot of
bookkeeping (eg buffer pins) that has to be cleaned up regardless of the
exact nature of the error, and all those mechanisms are hung off
transactions.
        regards, tom lane

Re: VLDB Features

From

Josh Berkus

Date:

15 December 2007, 00:34:49

Tom,

> I think such an approach is doomed to hopeless unreliability.  There is
> no concept of an error that doesn't require a transaction abort in the
> system now, and that doesn't seem to me like something that can be
> successfully bolted on after the fact.  Also, there's a lot of
> bookkeeping (eg buffer pins) that has to be cleaned up regardless of the
> exact nature of the error, and all those mechanisms are hung off
> transactions.

There's no way we can do a transactionless load, then?  I'm thinking of the 
load-into-new-partition which is a single pass/fail operation.  Would 
ignoring individual row errors in for this case still cause these kinds of 
problems?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: VLDB Features

From

Trent Shipley

Date:

15 December 2007, 01:36:27

On Friday 2007-12-14 16:22, Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
> > to drop (and log) rows that contain malformed data. That is, rows with
> > too many or too few columns, rows that result in constraint violations,
> > and rows containing columns where the data type's input function raises
> > an error. The last case is the only thing that would be a bit tricky to
> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
> > but I guess you'd need a subtransaction to ensure that you reset your
> > state correctly after catching an error.
>
> Yeah.  It's the subtransaction per row that's daunting --- not only the
> cycles spent for that, but the ensuing limitation to 4G rows imported
> per COPY.

You could extend the COPY FROM syntax with a COMMIT EVERY n clause.  This 
would help with the 4G subtransaction limit.  The cost to the ETL process is 
that a simple rollback would not be guaranteed send the process back to it's 
initial state.  There are easy ways to deal with the rollback issue though.  

A {NO} RETRY {USING algorithm} clause might be useful.   If the NO RETRY 
option is selected then the COPY FROM can run without subtransactions and in 
excess of the 4G per transaction limit.  NO RETRY should be the default since 
it preserves the legacy behavior of COPY FROM.

You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the 
option of sending exceptions to a table since they are presumably malformed, 
otherwise they would not be exceptions.  (Users should re-process exception 
files if they want an if good then table a else exception to table b ...)

EXCEPTIONS TO and NO RETRY would be mutually exclusive.

> If we could somehow only do a subtransaction per failure, things would
> be much better, but I don't see how.

Re: VLDB Features

From

Tom Lane

Date:

15 December 2007, 02:13:15

Josh Berkus <josh@agliodbs.com> writes:
> There's no way we can do a transactionless load, then?  I'm thinking of the 
> load-into-new-partition which is a single pass/fail operation.  Would 
> ignoring individual row errors in for this case still cause these kinds of 
> problems?

Given that COPY fires triggers and runs CHECK constraints, there is no
part of the system that cannot be exercised during COPY.  So I think
supposing that we can just deal with some simplified subset of reality
is mere folly.
        regards, tom lane

Re: VLDB Features

From

NikhilS

Date:

15 December 2007, 03:33:03

Hi,

Another approach would be to distinguish between errors that require a
subtransaction to recover to a consistent state, and less serious errors
that don't have this requirement (e.g. invalid input to a data type
input function). If all the errors that we want to tolerate during a
bulk load fall into the latter category, we can do without
subtransactions.

I think errors which occur after we have done a fast_heap_insert of the tuple generated from the current input row are the ones which would require the subtransaction to recover. Examples could be unique/primary key violation errors or FKey/triggers related errors. Any errors which occur before doing the heap_insert should not require any recovery according to me.

The overhead of having a subtransaction per row is a very valid concern. But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction.

Regards,

Nikhils

--
EnterpriseDB http://www.enterprisedb.com

Re: VLDB Features

From

Tom Lane

Date:

15 December 2007, 03:45:11

NikhilS <nikkhils@gmail.com> writes:
> Any errors which occur before doing the heap_insert should not require
> any recovery according to me.

A sufficient (though far from all-encompassing) rejoinder to that is
"triggers and CHECK constraints can do anything".

> The overhead of having a subtransaction per row is a very valid concern. But
> instead of using a per insert or a batch insert substraction, I am
> thinking that we can start off a subtraction and continue it till we
> encounter a failure.

What of failures that occur only at (sub)transaction commit, such as
foreign key checks?
        regards, tom lane

Re: VLDB Features

From

Simon Riggs

Date:

15 December 2007, 05:16:04

On Fri, 2007-12-14 at 18:22 -0500, Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
> > to drop (and log) rows that contain malformed data. That is, rows with
> > too many or too few columns, rows that result in constraint violations,
> > and rows containing columns where the data type's input function raises
> > an error. The last case is the only thing that would be a bit tricky to
> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
> > but I guess you'd need a subtransaction to ensure that you reset your
> > state correctly after catching an error.
> 
> Yeah.  It's the subtransaction per row that's daunting --- not only the
> cycles spent for that, but the ensuing limitation to 4G rows imported
> per COPY.

I'd suggest doing everything at block level
- wrap each new block of data in a subtransaction
- apply data to the table block by block (can still work with FSM). 
- apply indexes in bulk for each block, unique ones first. 

That then gives you a limit of more than 500 trillion rows, which should
be enough for anyone.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: VLDB Features

From

Neil Conway

Date:

16 December 2007, 02:08:04

On Tue, 2007-12-11 at 19:11 -0500, Greg Smith wrote:
> I'm curious what you feel is missing that pgloader doesn't fill that 
> requirement:  http://pgfoundry.org/projects/pgloader/

For complicated ETL, I agree that using an external tool makes the most
sense. But I think there is still merit in adding support to COPY for
the simple case of trying to load a data file that has some corrupted,
invalid or duplicate records.

-Neil

Re: VLDB Features

From

"Pavel Stehule"

Date:

16 December 2007, 02:39:38

On 16/12/2007, Neil Conway <neilc@samurai.com> wrote:
> On Tue, 2007-12-11 at 19:11 -0500, Greg Smith wrote:
> > I'm curious what you feel is missing that pgloader doesn't fill that
> > requirement:  http://pgfoundry.org/projects/pgloader/
>
> For complicated ETL, I agree that using an external tool makes the most
> sense. But I think there is still merit in adding support to COPY for
> the simple case of trying to load a data file that has some corrupted,
> invalid or duplicate records.
>
> -Neil
>
>

Any simple enhancing of COPY is welcome. I lost lot of time with
repeated imports.

Regards
Pavel


>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>

Re: VLDB Features

From

Trent Shipley

Date:

16 December 2007, 05:28:22

On Saturday 2007-12-15 02:14, Simon Riggs wrote:
> On Fri, 2007-12-14 at 18:22 -0500, Tom Lane wrote:
> > Neil Conway <neilc@samurai.com> writes:
> > > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
> > > to drop (and log) rows that contain malformed data. That is, rows with
> > > too many or too few columns, rows that result in constraint violations,
> > > and rows containing columns where the data type's input function raises
> > > an error. The last case is the only thing that would be a bit tricky to
> > > implement, I think: you could use PG_TRY() around the
> > > InputFunctionCall, but I guess you'd need a subtransaction to ensure
> > > that you reset your state correctly after catching an error.
> >
> > Yeah.  It's the subtransaction per row that's daunting --- not only the
> > cycles spent for that, but the ensuing limitation to 4G rows imported
> > per COPY.
>
> I'd suggest doing everything at block level
> - wrap each new block of data in a subtransaction
> - apply data to the table block by block (can still work with FSM).
> - apply indexes in bulk for each block, unique ones first.
>
> That then gives you a limit of more than 500 trillion rows, which should
> be enough for anyone.

Wouldn't it only give you more than 500T rows in the best case?  If it hits a 
bad row it has to back off and roll forward one row and one subtransaction at 
a time for the failed block.  So in the worst case, where there is at least 
one exception row per block, I think you would still wind up with only a 
capacity of 4G rows.

Re: VLDB Features

From

Hannu Krosing

Date:

16 December 2007, 06:13:34

Ühel kenal päeval, L, 2007-12-15 kell 01:12, kirjutas Tom Lane:
> Josh Berkus <josh@agliodbs.com> writes:
> > There's no way we can do a transactionless load, then?  I'm thinking of the
> > load-into-new-partition which is a single pass/fail operation.  Would
> > ignoring individual row errors in for this case still cause these kinds of
> > problems?
>
> Given that COPY fires triggers and runs CHECK constraints, there is no
> part of the system that cannot be exercised during COPY.  So I think
> supposing that we can just deal with some simplified subset of reality
> is mere folly.

But can't we _define_ such a subset, where we can do a transactionless
load ?

I don't think that most DW/VLDB schemas fire complex triggers or custom
data-modifying functions inside CHECK's.

Then we could just run the remaining simple CHECK constraints ourselves
and not abort on non-check, but just log the rows ?

The COPY ... WITH ERRORS TO ... would essentially become a big
conditional RULE through which the incoming data is processed.

------------------
Hannu

Re: VLDB Features

From

Tom Lane

Date:

16 December 2007, 13:27:47

Hannu Krosing <hannu@skype.net> writes:
> But can't we _define_ such a subset, where we can do a transactionless
> load ?

Sure ... but you'll find that it's not large enough to be useful.
Once you remove all the interesting consistency checks such as
unique indexes and foreign keys, the COPY will tend to go through
just fine, and then you're still stuck trying to weed out bad data
without very good tools for it.  The only errors we could really
separate out without subtransaction fencing are extremely trivial
ones like too many or too few fields on a line ... which can be
caught with a sed script.
        regards, tom lane

Re: VLDB Features

From

NikhilS

Date:

17 December 2007, 03:35:46

Hi,

On Dec 15, 2007 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

NikhilS <nikkhils@gmail.com> writes:
> Any errors which occur before doing the heap_insert should not require
> any recovery according to me.

A sufficient (though far from all-encompassing) rejoinder to that is
"triggers and CHECK constraints can do anything".

> The overhead of having a subtransaction per row is a very valid concern. But
> instead of using a per insert or a batch insert substraction, I am
> thinking that we can start off a subtraction and continue it till we
> encounter a failure.The moment an error is encountered, since we have the offending >(already in heap) tuple around, we can call a simple_heap_delete on the same and commit >(instead of aborting) this subtransaction

What of failures that occur only at (sub)transaction commit, such as
foreign key checks?

What if we identify and define a subset where we could do subtransactions based COPY? The following could be supported:

* A subset of triggers and CHECK constraints which do not move the tuple around. (Identifying this subset might be an issue though?)
* Primary/unique key indexes

As Hannu mentioned elsewhere in this thread, there should not be very many instances of complex triggers/CHECKs around? And may be in those instances (and also the foreign key checks case), the behaviour could default to use a per-subtransaction-per-row or even the existing single transaction model?

Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com

Re: VLDB Features

From

"Michał Zaborowski"

Date:

18 December 2007, 11:01:59

2007/12/16, Tom Lane <tgl@sss.pgh.pa.us>:
> Hannu Krosing <hannu@skype.net> writes:
> > But can't we _define_ such a subset, where we can do a transactionless
> > load ?
>
> Sure ... but you'll find that it's not large enough to be useful.
> Once you remove all the interesting consistency checks such as
> unique indexes and foreign keys, the COPY will tend to go through
> just fine, and then you're still stuck trying to weed out bad data
> without very good tools for it.  The only errors we could really
> separate out without subtransaction fencing are extremely trivial
> ones like too many or too few fields on a line ... which can be
> caught with a sed script.
>
I have dump file. I would like to load it ASAP.
Constraints will be applied at the end, so any problem can be detected.
I would like it to be as direct as possible and as bulk as possibe - just
allocate pages and fill them with the data. Maybe it should be different
mode - single user or so. Right now I can save some IO - like turn off
fsync, but that is all :(

I got something like that:
http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results
I have no idea how to load single file in many threads, but... the point is
that it can be much faster that single-thread load - surprisingly - at
least for me.

--
Regards, Michał Zaborowski (TeXXaS)

Re: VLDB Features

From

Decibel!

Date:

18 December 2007, 13:29:47

On Dec 12, 2007, at 1:26 PM, Markus Schiltknecht wrote:
> Josh Berkus wrote:
>> Sure.  Imagine you have a 5TB database on a machine with 8 cores  
>> and only one concurrent user.  You'd like to have 1 core doing I/ 
>> O, and say 4-5 cores dividing the scan and join processing into  
>> 4-5 chunks.
>
> Ah, right, thank for enlightenment. Heck, I'm definitely too  
> focused on replication and distributed databases :-)
>
> However, there's certainly a great deal of an intersection between  
> parallel processing on different machines and parallel processing  
> on multiple CPUs - especially considering NUMA architecture. *comes- 
> to-think-again*...


Except that doing something in-machine is often far simpler than  
trying to go cross-machine, especially when that something is a  
background reader.

Let's walk before we run. :)
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: VLDB Features

From

Josh Berkus

Date:

20 December 2007, 16:04:42

Tom,

> Sure ... but you'll find that it's not large enough to be useful.
> Once you remove all the interesting consistency checks such as
> unique indexes and foreign keys, the COPY will tend to go through
> just fine, and then you're still stuck trying to weed out bad data
> without very good tools for it.  The only errors we could really
> separate out without subtransaction fencing are extremely trivial
> ones like too many or too few fields on a line ... which can be
> caught with a sed script.

Speaking as someone who did a LOT of DW load design only a couple years ago, 
I'll say that the "special case" of no triggers, no constraint checks except 
length, and type-safety check actually constitutes about 50% of DW bulk 
loading.  The only exception to that is unique indexes, which would normally 
be included and would be the difficult thing.

Also, "special case bulk loading" would in fact give users of other types of 
applications a lot more flexibility -- they could always load into a holding 
table just to clean up the type safety issues and then merge into the real 
table.  

So I don't agree that the "load into new partition without dependancies" is 
too much of a special case to be worth pursuing.  It might be a bad idea for 
other reasons, but not because it's too obscure.

--Josh

POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alex Shulgin

Date:

25 December 2014, 21:23:57

Trent Shipley <trent_shipley@qwest.net> writes:

> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah.  It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause.  This
> would help with the 4G subtransaction limit.  The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state.  There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful.   If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit.  NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions.  (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.

Hello,

Attached is a proof of concept patch for this TODO item.  There is no
docs yet, I just wanted to know if approach is sane.

The added syntax is like the following:

  COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]

The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work.  The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.

Example in action:

postgres=# \d test_copy2
  Table "public.test_copy2"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |
 val    | integer |

postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE:  missing data for column "val"
CONTEXT:  COPY test_copy2, line 1: "1"
2
NOTICE:  missing data for column "val"
CONTEXT:  COPY test_copy2, line 2: "2"
3
NOTICE:  missing data for column "val"
CONTEXT:  COPY test_copy2, line 3: "3"
NOTICE:  total exceptions ignored: 3

postgres=# \d test_copy1
  Table "public.test_copy1"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer | not null

postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#

Limited performance testing shows no significant difference between
error-catching and plain code path.  For example, timing

  copy test_copy1 from program 'seq 1000000' [exceptions to stdout]

shows similar numbers with or without the added "exceptions to" clause.

Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds.  Any
example of what would be not properly rolled back by just PG_TRY?

Happy hacking!
--
Alex

Attachment

0001-poc-copy-from-.-exceptions-to.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Pavel Stehule

Date:

26 December 2014, 10:42:41

2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash@commandprompt.com>:

Trent Shipley <trent_shipley@qwest.net> writes:

> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah. It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This
> would help with the 4G subtransaction limit. The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state. There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit. NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions. (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.

Hello,

Attached is a proof of concept patch for this TODO item. There is no
docs yet, I just wanted to know if approach is sane.

The added syntax is like the following:

COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]

The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work. The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.

Example in action:

postgres=# \d test_copy2
Table "public.test_copy2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | integer |

postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 1: "1"
2
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 2: "2"
3
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 3: "3"
NOTICE: total exceptions ignored: 3

postgres=# \d test_copy1
Table "public.test_copy1"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null

postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#

Limited performance testing shows no significant difference between
error-catching and plain code path. For example, timing

copy test_copy1 from program 'seq 1000000' [exceptions to stdout]

shows similar numbers with or without the added "exceptions to" clause.

Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds. Any
example of what would be not properly rolled back by just PG_TRY?

this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaks

you can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimal

when block fails, then you can use row level subtransaction - it works well when you expect almost correct data.

Regards

Pavel

Happy hacking!
--
Alex

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Pavel Stehule

Date:

26 December 2014, 10:50:00

2014-12-26 11:41 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:

2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash@commandprompt.com>:
Trent Shipley <trent_shipley@qwest.net> writes:

> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah. It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This
> would help with the 4G subtransaction limit. The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state. There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit. NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions. (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.

Hello,

Attached is a proof of concept patch for this TODO item. There is no
docs yet, I just wanted to know if approach is sane.

The added syntax is like the following:

COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]

The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work. The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.

Example in action:

postgres=# \d test_copy2
Table "public.test_copy2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | integer |

postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 1: "1"
2
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 2: "2"
3
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 3: "3"
NOTICE: total exceptions ignored: 3

postgres=# \d test_copy1
Table "public.test_copy1"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null

postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#

Limited performance testing shows no significant difference between
error-catching and plain code path. For example, timing

copy test_copy1 from program 'seq 1000000' [exceptions to stdout]

shows similar numbers with or without the added "exceptions to" clause.

Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds. Any
example of what would be not properly rolled back by just PG_TRY?

this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaks

you can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimal

when block fails, then you can use row level subtransaction - it works well when you expect almost correct data.

Two years ago I wrote a extension that did it - but I have not time to finish it and push to upstream.

Regards

Pavel

Regards

Pavel

Happy hacking!
--
Alex

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

ftcopy-04.tgz

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

18 December 2021, 08:55:24

Hello.

Wrote a patch implementing COPY with ignoring errors in rows using block subtransactions.

Syntax: COPY [table] FROM [file/stdin] WITH IGNORE_ERROS;

Examples:

CREATE TABLE check_ign_err (n int, m int, k int);
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2 2 2
3 3 3
\.

WARNING: COPY check_ign_err, line 2: "2 2 2 2"

SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)

##################################################

TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2
3 3 3
\.

WARNING: COPY check_ign_err, line 2: "2 2"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)

##################################################

TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 a 2
3 3 3
\.

WARNING: COPY check_ign_err, line 2, column m: "a"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)

Regards, Damir

пт, 10 дек. 2021 г. в 21:48, Pavel Stehule <pavel.stehule@gmail.com>:

2014-12-26 11:41 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:

2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash@commandprompt.com>:
Trent Shipley <trent_shipley@qwest.net> writes:

> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah. It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This
> would help with the 4G subtransaction limit. The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state. There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit. NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions. (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.

Hello,

Attached is a proof of concept patch for this TODO item. There is no
docs yet, I just wanted to know if approach is sane.

The added syntax is like the following:

COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]

The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work. The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.

Example in action:

postgres=# \d test_copy2
Table "public.test_copy2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | integer |

postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 1: "1"
2
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 2: "2"
3
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 3: "3"
NOTICE: total exceptions ignored: 3

postgres=# \d test_copy1
Table "public.test_copy1"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null

postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#

Limited performance testing shows no significant difference between
error-catching and plain code path. For example, timing

copy test_copy1 from program 'seq 1000000' [exceptions to stdout]

shows similar numbers with or without the added "exceptions to" clause.

Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds. Any
example of what would be not properly rolled back by just PG_TRY?

this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaks

you can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimal

when block fails, then you can use row level subtransaction - it works well when you expect almost correct data.

Two years ago I wrote a extension that did it - but I have not time to finish it and push to upstream.

Regards

Pavel

Regards

Pavel

Happy hacking!
--
Alex

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-COPY-IGNORE_ERRORS-option.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Pavel Stehule

Date:

19 December 2021, 05:09:41

so 18. 12. 2021 v 9:55 odesílatel Damir Belyalov <dam.bel07@gmail.com> napsal:

Hello.

Wrote a patch implementing COPY with ignoring errors in rows using block subtransactions.

It is great so you are working on this patch. Unfortunately, I am afraid this simple design is not optimal. Using subtransaction for every row has too big overhead. I think it should use subtransaction for blocks of rows (1000 rows), and only when there is an exception, then it should replay inserts in subtransaction per row. You should check performance overhead.

Regards

Pavel

Syntax: COPY [table] FROM [file/stdin] WITH IGNORE_ERROS;

Examples:
CREATE TABLE check_ign_err (n int, m int, k int);
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2 2 2
3 3 3
\.
WARNING: COPY check_ign_err, line 2: "2 2 2 2"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)

##################################################

TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2
3 3 3
\.
WARNING: COPY check_ign_err, line 2: "2 2"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)

##################################################

TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 a 2
3 3 3
\.
WARNING: COPY check_ign_err, line 2, column m: "a"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)

Regards, Damir

пт, 10 дек. 2021 г. в 21:48, Pavel Stehule <pavel.stehule@gmail.com>:

2014-12-26 11:41 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:

2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash@commandprompt.com>:
Trent Shipley <trent_shipley@qwest.net> writes:

> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah. It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This
> would help with the 4G subtransaction limit. The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state. There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit. NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions. (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.

Hello,

Attached is a proof of concept patch for this TODO item. There is no
docs yet, I just wanted to know if approach is sane.

The added syntax is like the following:

COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]

The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work. The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.

Example in action:

postgres=# \d test_copy2
Table "public.test_copy2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | integer |

postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 1: "1"
2
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 2: "2"
3
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 3: "3"
NOTICE: total exceptions ignored: 3

postgres=# \d test_copy1
Table "public.test_copy1"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null

postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#

Limited performance testing shows no significant difference between
error-catching and plain code path. For example, timing

copy test_copy1 from program 'seq 1000000' [exceptions to stdout]

shows similar numbers with or without the added "exceptions to" clause.

Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds. Any
example of what would be not properly rolled back by just PG_TRY?

this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaks

you can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimal

when block fails, then you can use row level subtransaction - it works well when you expect almost correct data.

Two years ago I wrote a extension that did it - but I have not time to finish it and push to upstream.

Regards

Pavel

Regards

Pavel

Happy hacking!
--
Alex

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

19 July 2022, 12:40:36

Hi!

Improved my patch by adding block subtransactions.
The block size is determined by the REPLAY_BUFFER_SIZE parameter.
I used the idea of a buffer for accumulating tuples in it.

If we read REPLAY_BUFFER_SIZE rows without errors, the subtransaction will be committed.
If we find an error, the subtransaction will rollback and the buffer will be replayed containing tuples.

In the patch REPLAY_BUFFER_SIZE equals 3, but it can be changed to any other number (for example 1000).

There is an idea to create a GUC parameter for it.

Also maybe create a GUC parameter for the number of occurring WARNINGS by rows with errors.

For CIM_MULTI and CIM_MULTI_CONDITIONAL cases the buffer is not needed.

It is needed for the CIM_SINGLE case.

Tests:

-- CIM_MULTI case
CREATE TABLE check_ign_err (n int, m int, k int);
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;

WARNING: COPY check_ign_err, line 2: "2 2 2 2"
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
1 1 1
2 2 2 2
3 3
a 4 4
5 b b

7 7 7
\.
SELECT * FROM check_ign_err;

n | m | k
---+---+---
1 | 1 | 1
7 | 7 | 7
(2 rows)

##################################################

-- CIM_SINGLE case
-- BEFORE row trigger
CREATE TABLE trig_test(n int, m int);
CREATE FUNCTION fn_trig_before () RETURNS TRIGGER AS '
BEGIN
INSERT INTO trig_test VALUES(NEW.n, NEW.m);
RETURN NEW;
END;
' LANGUAGE plpgsql;
CREATE TRIGGER trig_before BEFORE INSERT ON check_ign_err
FOR EACH ROW EXECUTE PROCEDURE fn_trig_before();
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;

n | m | k
---+---+---
1 | 1 | 1
7 | 7 | 7
(2 rows)

##################################################

-- INSTEAD OF row trigger
CREATE VIEW check_ign_err_view AS SELECT * FROM check_ign_err;
CREATE FUNCTION fn_trig_instead_of () RETURNS TRIGGER AS '
BEGIN
INSERT INTO check_ign_err VALUES(NEW.n, NEW.m, NEW.k);
RETURN NEW;
END;
' LANGUAGE plpgsql;
CREATE TRIGGER trig_instead_of INSTEAD OF INSERT ON check_ign_err_view
FOR EACH ROW EXECUTE PROCEDURE fn_trig_instead_of();
COPY check_ign_err_view FROM STDIN WITH IGNORE_ERRORS;

WARNING: COPY check_ign_err, line 2: "2 2 2 2"
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
SELECT * FROM check_ign_err;
1 1 1
2 2 2 2
3 3
a 4 4
5 b b

7 7 7
\.
SELECT * FROM check_ign_err_view;

n | m | k
---+---+---
1 | 1 | 1
7 | 7 | 7
(2 rows)

##################################################

-- foreign table case in postgres_fdw extension

##################################################
-- volatile function in WHERE clause
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS
WHERE n = floor(random()*(1-1+1))+1; /* find values equal 1 */

WARNING: COPY check_ign_err, line 2: "2 2 2 2"
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
SELECT * FROM check_ign_err;
1 1 1
2 2 2 2
3 3
a 4 4
5 b b

7 7 7
\.
SELECT * FROM check_ign_err;

n | m | k
---+---+---
1 | 1 | 1
(1 row)

##################################################
-- CIM_MULTI_CONDITIONAL case
-- INSERT triggers for partition tables
CREATE TABLE check_ign_err (n int, m int, k int) PARTITION BY RANGE (n);
CREATE TABLE check_ign_err_part1 PARTITION OF check_ign_err
FOR VALUES FROM (1) TO (4);
CREATE TABLE check_ign_err_part2 PARTITION OF check_ign_err
FOR VALUES FROM (4) TO (8);
CREATE FUNCTION fn_trig_before_part () RETURNS TRIGGER AS '
BEGIN
INSERT INTO trig_test VALUES(NEW.n, NEW.m);
RETURN NEW;
END;
' LANGUAGE plpgsql;
CREATE TRIGGER trig_before_part BEFORE INSERT ON check_ign_err
FOR EACH ROW EXECUTE PROCEDURE fn_trig_before_part();
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;

WARNING: COPY check_ign_err, line 2: "2 2 2 2"
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
SELECT * FROM check_ign_err;
1 1 1
2 2 2 2
3 3
a 4 4
5 b b

7 7 7
\.

n | m | k
---+---+---
1 | 1 | 1
7 | 7 | 7
(2 rows)

Thanks for feedback.

Regards, Damir

Attachment

0002-COPY-IGNORE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

15 August 2022, 12:29:10

On 2022-07-19 21:40, Damir Belyalov wrote:
> Hi!
> 
> Improved my patch by adding block subtransactions.
> The block size is determined by the REPLAY_BUFFER_SIZE parameter.
> I used the idea of a buffer for accumulating tuples in it.
> If we read REPLAY_BUFFER_SIZE rows without errors, the subtransaction
> will be committed.
> If we find an error, the subtransaction will rollback and the buffer
> will be replayed containing tuples.

Thanks for working on this!

I tested 0002-COPY-IGNORE_ERRORS.patch and faced an unexpected behavior.

I loaded 10000 rows which contained 1 wrong row.
I expected I could see 9999 rows after COPY, but just saw 999 rows.

Since when I changed MAX_BUFFERED_TUPLES from 1000 to other values, the 
number of loaded rows also changed, I imagine MAX_BUFFERED_TUPLES might 
be giving influence of this behavior.

```sh
$ cat /tmp/test10000.dat

1   aaa
2   aaa
3   aaa
4   aaa
5   aaa
6   aaa
7   aaa
8   aaa
9   aaa
10  aaa
11  aaa
...
9994    aaa
9995    aaa
9996    aaa
9997    aaa
9998    aaa
9999    aaa
xxx aaa
```

```SQL
=# CREATE TABLE test (id int, data text);

=# COPY test FROM '/tmp/test10000.dat' WITH (IGNORE_ERRORS);
WARNING:  COPY test, line 10000, column i: "xxx"
COPY 9999

=# SELECT COUNT(*) FROM test;
  count
-------
    999
(1 row)
```

BTW I may be overlooking it, but have you submit this proposal to the 
next CommitFest?

https://commitfest.postgresql.org/39/

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

15 August 2022, 13:23:26

Thank you for feedback.

I improved my patch recently and tested it on different sizes of MAX_BUFFERED_TUPLES and REPLAY_BUFFER_SIZE.

> I loaded 10000 rows which contained 1 wrong row.
> I expected I could see 9999 rows after COPY, but just saw 999 rows.

Also I implemented your case and it worked correctly.

> BTW I may be overlooking it, but have you submit this proposal to the next CommitFest?

Good idea. Haven't done it yet.

Regards,

Damir

Postgres Professional

Attachment

0003-COPY_IGNORE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

22 August 2022, 12:46:56

On 2022-08-15 22:23, Damir Belyalov wrote:
>> I expected I could see 9999 rows after COPY, but just saw 999 rows.
> Also I implemented your case and it worked correctly.

Thanks for the new patch!

Here are some comments on it.

> +           if (safecstate->saved_tuples < REPLAY_BUFFER_SIZE)
> +           {
> +               valid_row = NextCopyFrom(cstate, econtext, values, 
> nulls);
> +               if (valid_row)
> +               {
> +                   /* Fill replay_buffer in oldcontext*/
> +                   MemoryContextSwitchTo(safecstate->oldcontext);
> +                   
> safecstate->replay_buffer[safecstate->saved_tuples++] = 
> heap_form_tuple(RelationGetDescr(cstate->rel), values, nulls);
> 
> +               /* Buffer was filled, commit subtransaction and prepare 
> to replay */
> +               ReleaseCurrentSubTransaction();

What is actually being committed by this ReleaseCurrentSubTransaction()?
It seems to me that just safecstate->replay_buffer is fulfilled before 
this commit.
As a test, I rewrote this ReleaseCurrentSubTransaction() to 
RollbackAndReleaseCurrentSubTransaction() and COPYed over 1000 rows of 
data, but same data were loaded.

> +#define            REPLAY_BUFFER_SIZE 1000

I feel it might be better to have it as a parameter rather than fixed at 
1000.

> +/*
> + * Analog of NextCopyFrom() but ignore rows with errors while copying.
> + */
> +static bool
> +safeNextCopyFrom(CopyFromState cstate, ExprContext *econtext, Datum 
> *values, bool *nulls)

NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in 
copyfrom.c.
Since safeNextCopyFrom() is analog of NextCopyFrom() as commented, would 
it be natural to put them in the same file?

> 188 +               safecstate->errors++;
> 189 +               if (safecstate->errors <= 100)
> 190 +                   ereport(WARNING,
> 191 +                           (errcode(errdata->sqlerrcode),
> 192 +                           errmsg("%s", errdata->context)));

It would be better to state in the manual that errors exceeding 100 are 
not displayed.
Or, it might be acceptable to output all errors that exceed 100.

> +typedef struct SafeCopyFromState
> +{
> +#define            REPLAY_BUFFER_SIZE 1000
> +   HeapTuple       replay_buffer[REPLAY_BUFFER_SIZE]; /* accumulates 
> tuples for replaying it after an error */
> +   int             saved_tuples;               /* # of tuples in 
> replay_buffer */
> +   int             replayed_tuples;            /* # of tuples was 
> replayed from buffer */
> +   int             errors;                     /* total # of errors */
> +   bool            replay_is_active;
> +   bool            begin_subtransaction;
> +   bool            processed_remaining_tuples; /* for case of 
> replaying last tuples */
> +   bool            skip_row;

It would be helpful to add comments about skip_row, etc.

```
$ git apply ../patch/0003-COPY_IGNORE_ERRORS.patch

../patch/0003-COPY_IGNORE_ERRORS.patch:86: indent with spaces.
                             Datum *values, bool *nulls);
warning: 1 line adds whitespace errors.
```

There was a warning when applying the patch.


```
=# copy test from '/tmp/10000.data' with (ignore_errors);
WARNING:  FIND 0 ERRORS
COPY 1003
```

When there were no errors, this WARNING seems not necessary.


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

24 August 2022, 16:54:11

> + /* Buffer was filled, commit subtransaction and prepare to replay */
> + ReleaseCurrentSubTransaction();
What is actually being committed by this ReleaseCurrentSubTransaction()?
It seems to me that just safecstate->replay_buffer is fulfilled before
this commit.

All tuples are collected in replay_buffer, which is created in ''oldcontext'' of CopyFrom() (not context of a subtransaction). That's why it didn't clean up when you used RollbackAndReleaseCurrentSubTransaction().

Subtransactions are needed for better safety. There is no error when copying from a file to the replay_buffer, but an error may occur at the next stage - when adding a tuple to the table. Also there may be other errors that are not obvious at first glance.

I feel it might be better to have it as a parameter rather than fixed at
1000.

Yes, I thought about it too. So I created another version with the GUC parameter - replay_buffer_size. Attached it. But I think users won't need to change replay_buffer_size.

Also replay_buffer does the same thing as MultiInsert buffer does and MultiInsert buffer defined by const = 1000.

NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in
copyfrom.c.
Since safeNextCopyFrom() is analog of NextCopyFrom() as commented, would
it be natural to put them in the same file?

Sure, corrected it.

> 188 + safecstate->errors++;
> 189 + if (safecstate->errors <= 100)
> 190 + ereport(WARNING,
> 191 + (errcode(errdata->sqlerrcode),
> 192 + errmsg("%s", errdata->context)));

It would be better to state in the manual that errors exceeding 100 are
not displayed.
Or, it might be acceptable to output all errors that exceed 100.

It'll be too complicated to create a new parameter just for showing the given number of errors. I think 100 is an optimal size.

> +typedef struct SafeCopyFromState
> +{
> +#define REPLAY_BUFFER_SIZE 1000
> + HeapTuple replay_buffer[REPLAY_BUFFER_SIZE]; /* accumulates
> tuples for replaying it after an error */
> + int saved_tuples; /* # of tuples in
> replay_buffer */
> + int replayed_tuples; /* # of tuples was
> replayed from buffer */
> + int errors; /* total # of errors */
> + bool replay_is_active;
> + bool begin_subtransaction;
> + bool processed_remaining_tuples; /* for case of
> replaying last tuples */
> + bool skip_row;

It would be helpful to add comments about skip_row, etc.

Corrected it.

WARNING: FIND 0 ERRORS
When there were no errors, this WARNING seems not necessary.

Corrected it.

Add to this patch processing other errors and constraints and tests for them.

I had to create another function safeExecConstraints() only for processing constraints, because ExecConstraints() is after NextCopyFrom() and is not in PG_TRY. This thing a little bit complicated the code.

Maybe it is a good approach to create a new function SafeCopyFrom() and do all ''safe copying'' in PG_TRY, but it will almost duplicate the CopyFrom() code.

Or maybe create a function only for loop for(;;). But we have the same thing with duplicating code and passing a lot of variables (which are created at the beginning of CopyFrom()) to this function.

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

24 August 2022, 16:57:13

Attachment

0004-COPY_IGNORE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

29 August 2022, 13:02:38

On 2022-08-25 01:54, Damir Belyalov wrote:
>>> +               /* Buffer was filled, commit subtransaction and
>> prepare to replay */
>>> +               ReleaseCurrentSubTransaction();
>> What is actually being committed by this
>> ReleaseCurrentSubTransaction()?
>> It seems to me that just safecstate->replay_buffer is fulfilled
>> before
>> this commit.
> 
> All tuples are collected in replay_buffer, which is created in
> ''oldcontext'' of CopyFrom() (not context of a subtransaction). That's
> why it didn't clean up when you used
> RollbackAndReleaseCurrentSubTransaction().
> Subtransactions are needed for better safety. There is no error when
> copying from a file to the replay_buffer, but an error may occur at
> the next stage - when adding a tuple to the table. Also there may be
> other errors that are not obvious at first glance.

Thanks for the explanation and updating patch.
I now understand that the data being COPYed are not the target of 
COMMIT.

Although in past discussions[1] it seems the data to be COPYed were also 
subject to COMMIT, but I understand this patch has adopted another 
design.

350 +               /* Buffer was filled, commit subtransaction and 
prepare to replay */
351 +               ReleaseCurrentSubTransaction();
352 +               CurrentResourceOwner = safecstate->oldowner;
353 +
354 +               safecstate->replay_is_active = true;

BTW in v4 patch, data are loaded into the buffer one by one, and when 
the buffer fills up, the data in the buffer are 'replayed' also one by 
one, right?
Wouldn't this have more overhead than a normal COPY?

As a test, I COPYed slightly larger data with and without ignore_errors 
option.
There might be other reasons, but I found a performance difference.

```
=# copy test from '/tmp/10000000.data' ;
COPY 10000000
Time: 6001.325 ms (00:06.001)

=# copy test from '/tmp/10000000.data' with (ignore_errors);
NOTICE:  FIND 0 ERRORS
COPY 10000000
Time: 7711.555 ms (00:07.712)
```

>> I feel it might be better to have it as a parameter rather than
>> fixed at
>> 1000.
> 
> Yes, I thought about it too. So I created another version with the GUC
> parameter - replay_buffer_size. Attached it. But I think users won't
> need to change replay_buffer_size.
> Also replay_buffer does the same thing as MultiInsert buffer does and
> MultiInsert buffer defined by const = 1000.

Yeah, when the data being COPYed are not the target of COMMIT,  I also 
think users won't neet to change it.
> 
>> NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in
>> copyfrom.c.
>> Since safeNextCopyFrom() is analog of NextCopyFrom() as commented,
>> would
>> it be natural to put them in the same file?
> 
> Sure, corrected it.

Thanks.

> 
>>> 188 +               safecstate->errors++;
>>> 189 +               if (safecstate->errors <= 100)
>>> 190 +                   ereport(WARNING,
>>> 191 +                           (errcode(errdata->sqlerrcode),
>>> 192 +                           errmsg("%s", errdata->context)));
>> 
>> It would be better to state in the manual that errors exceeding 100
>> are
>> not displayed.
>> Or, it might be acceptable to output all errors that exceed 100.
> 
> It'll be too complicated to create a new parameter just for showing
> the given number of errors. I think 100 is an optimal size.

Yeah, I may have misled you, but I also don't think this needs a new 
parameter.
I just thought 'either' of the following would be better:
- Put in the documentation that the warnings will not be output for more 
than 101 cases.
- Output all the warnings.

>> It would be helpful to add comments about skip_row, etc.
> 
> Corrected it.

Thanks.

> 
>> WARNING:  FIND 0 ERRORS
>> When there were no errors, this WARNING seems not necessary.
> 
> Corrected it.

Thanks.


I applied v4 patch and when canceled the COPY, there was a case I found 
myself left in a transaction.
Should this situation be prevented from occurring?

```
=# copy test from '/tmp/10000000.data' with (ignore_errors );

^CCancel request sent
ERROR:  canceling statement due to user request

=# truncate test;
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block
```


[1] 
https://www.postgresql.org/message-id/1197677930.1536.18.camel%40dell.linuxdev.us.dell.com

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

21 September 2022, 12:11:46

Thank you for reviewing.

In the previous patch there was an error when processing constraints. The patch was fixed, but the code grew up and became more complicated (0005-COPY_IGNORE_ERRORS). I also simplified the logic of safeNextCopyFrom().

You asked why we need subtransactions, so the answer is in this patch. When processing a row that does not satisfy constraints or INSTEAD OF triggers, it is necessary to rollback the subtransaction and return the table to its original state.

Cause of complexity, I had to abandon the constraints, triggers processing in and handle only errors that occur when reading the file. Attaching simplified patch (0006-COPY_IGNORE_ERRORS).

Checked out these patches on all COPY regress tests and it worked correctly.

BTW in v4 patch, data are loaded into the buffer one by one, and when
the buffer fills up, the data in the buffer are 'replayed' also one by
one, right?
Wouldn't this have more overhead than a normal COPY?

The data is loaded into the buffer one by one, but in the "replay mode" ignore_errors works as standard COPY. Tuples add to the slot and depending on the type of the slot (CIM_SINGLE or CIM_MULTI) tuples are replayed in the corresponding case.

For the 0006 patch you can imagine that we divide the loop for(;;) in 2 parts. The first part is adding tuples to the buffer and the second part is inserting tuples to the table. These parts don't intersect with each other and are completed sequentially.

The main idea of replay_buffer is that it is needed for the CIM_SINGLE case. You can implement the CIM_SINGLE case and see that tuples before an error occurring don't add to the table. Logic of the 0005 patch is similar but with some differences.

As a test, I COPYed slightly larger data with and without ignore_errors
option.
There might be other reasons, but I found a performance difference.

Tried to reduce performance difference with cleaning up replay_buffer with resetting the new context for replay_buffer - replay_cxt.

```

Before:

Without ignore_errors:

COPY 10000000
Time: 15538,579 ms (00:15,539)

With ignore_errors:
COPY 10000000
Time: 21289,121 ms (00:21,289)

After:

Without ignore_errors:

COPY 10000000
Time: 15318,922 ms (00:15,319)

With ignore_errors:
COPY 10000000
Time: 19868,175 ms (00:19,868)

```

- Put in the documentation that the warnings will not be output for more
than 101 cases.

Yeah, I point it out in the doc.

I applied v4 patch and when canceled the COPY, there was a case I found
myself left in a transaction.
Should this situation be prevented from occurring?

```
=# copy test from '/tmp/10000000.data' with (ignore_errors );

^CCancel request sent
ERROR: canceling statement due to user request

=# truncate test;
ERROR: current transaction is aborted, commands ignored until end of
transaction block
```

Tried to implement your error and could not. The result was the same as COPY FROM implements.

Attachment

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

26 September 2022, 13:04:36

On 2022-09-21 21:11, Damir Belyalov wrote:
Thanks for updating patch.

> In the previous patch there was an error when processing constraints.
> The patch was fixed, but the code grew up and became more complicated
> (0005-COPY_IGNORE_ERRORS). I also simplified the logic of
> safeNextCopyFrom().
> You asked why we need subtransactions, so the answer is in this patch.
> When processing a row that does not satisfy constraints or INSTEAD OF
> triggers, it is necessary to rollback the subtransaction and return
> the table to its original state.
> Cause of complexity, I had to abandon the constraints, triggers
> processing in and handle only errors that occur when reading the file.
> Attaching simplified patch (0006-COPY_IGNORE_ERRORS).

Do you mean you stop dealing with errors concerned with constraints and 
triggers and we should review 0006-COPY_IGNORE_ERRORS?

> Tried to implement your error and could not. The result was the same
> as COPY FROM implements.

Hmm, I applied v6 patch and when canceled COPY by sending SIGINT(ctrl + 
C), I faced the same situation as below.
I tested it on CentOS 8.4.

   =# COPY test FROM '/home/tori/pgsql/master/10000000.data' WITH 
(IGNORE_ERRORS);
   ^CCancel request sent
   ERROR:  canceling statement due to user request
   CONTEXT:  COPY test, line 628000: "628000       xxx"

   =# SELECT 1;
   ERROR:  current transaction is aborted, commands ignored until end of 
transaction block

  =# ABORT;
   FATAL:  UserAbortTransactionBlock: unexpected state STARTED
   server closed the connection unexpectedly
           This probably means the server terminated abnormally
           before or while processing the request.
   The connection to the server was lost. Attempting reset: Succeeded.

I did the same procedure on COPY FROM without IGNORE_ERRORS and didn't 
face this situation.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

29 September 2022, 13:18:51

Do you mean you stop dealing with errors concerned with constraints and
triggers and we should review 0006-COPY_IGNORE_ERRORS?

Yes, this patch is simpler and I think it's worth adding it first.

Hmm, I applied v6 patch and when canceled COPY by sending SIGINT(ctrl +
C), I faced the same situation as below.
I tested it on CentOS 8.4.

Thank you for pointing out this error. it really needs to be taken into account. In the previous 0006 patch, there were 2 stages in one subtransaction - filling the buffer and 'replay mode' (reading from the buffer). Since only NextCopyFrom() is included in PG_TRY(), and the ERRCODE_QUERY_CANCELED error can occur anywhere, it is impossible to catch it and rollback the subtransaction.

I changed the 0006 patch and fixed this error and now only the 'replay buffer filling' is made in the subtransaction.

Patch 0005 (that processed constraints) needs to be finalized, because it requires subtransactions to rollback constraints, triggers. Therefore, it is not possible to fix it yet. There is a decision to put for(;;) loop in PG_TRY. It will solve the problem, but the code will be too complicated.

Attachment

0007-COPY_IGNORE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

17 October 2022, 10:17:38

Updated the patch due to conflicts when applying to master.

Attachment

0008-COPY_IGNORE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

02 November 2022, 08:46:03

Updated the patch:

- Optimized and simplified logic of IGNORE_ERRORS

- Changed variable names to more understandable ones

- Added an analogue of MAX_BUFFERED_BYTES for safe buffer

Regards,

Damir Belyalov

Postgres Professional

Attachment

0010-COPY_IGNORE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Nikita Malakhov

Date:

07 December 2022, 11:15:54

Hi Damir!

Your work looks like a very promising feature for production systems,

where data often needs to be loaded from external sources.

I've looked over the discussion and want to make a proposal -

when we load a bunch of records in database it does not make sense

to output errors to command output, and does not make sense to limit

error output to any number at all, because if we decided to load data

anyway - we would want to have a list (a file) with all records that were

discarded because of errors, with related error information, to, say,

deal with errors and process these records later. It looks like a reasonable

addition to your patch.

As a command output some limited number of error messages has much

less meaning than overall stats - records processed, records loaded,

records discarded, total number of errors.

For example you can look the Oracle SQL Loader feature, I hope this could

give some ideas for further improvements.

On Wed, Nov 2, 2022 at 11:46 AM Damir Belyalov <dam.bel07@gmail.com> wrote:

Updated the patch:
- Optimized and simplified logic of IGNORE_ERRORS
- Changed variable names to more understandable ones
- Added an analogue of MAX_BUFFERED_BYTES for safe buffer

Regards,
Damir Belyalov
Postgres Professional

Regards,

Nikita Malakhov

Postgres Professional

https://postgrespro.ru/

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Danil Anisimow

Date:

09 December 2022, 14:25:34

Hi!

I have looked at your patch and have a few questions.

110: static bool SafeCopying(CopyFromState cstate, ExprContext *econtext,
111: TupleTableSlot *myslot);
---
636: bool
637: SafeCopying(CopyFromState cstate, ExprContext *econtext, TupleTableSlot *myslot)

Why is there no static keyword in the definition of the SafeCopying() function, but it is in the function declaration.

675: MemoryContext cxt = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
676:
677: valid_row = NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull);
678: tuple_is_valid = valid_row;
679:
680: if (valid_row)
681: sfcstate->safeBufferBytes += cstate->line_buf.len;
682:
683: CurrentMemoryContext = cxt;

Why are you using a direct assignment to CurrentMemoryContext instead of using the MemoryContextSwitchTo function in the SafeCopying() routine?

1160: /* Standard copying with option "safe copying" enabled by IGNORE_ERRORS. */
1161: if (!SafeCopying(cstate, econtext, myslot))
1162: break;

I checked with GDB that the CurrentMemoryContext changes when SafeCopying returns. And the target context may be different each time you do a COPY in psql.

1879: cstate->sfcstate->safe_cxt = AllocSetContextCreate(oldcontext,
1880: "Safe_context",
1881: ALLOCSET_DEFAULT_SIZES);

When you initialize the cstate->sfcstate structure, you create a cstate->sfcstate->safe_cxt memory context that inherits from oldcontext.
Was it intended to use cstate->copycontext as the parent context here?

On Wed, Nov 2, 2022 at 11:46 AM Damir Belyalov <dam.bel07@gmail.com> wrote:

Updated the patch:
- Optimized and simplified logic of IGNORE_ERRORS
- Changed variable names to more understandable ones
- Added an analogue of MAX_BUFFERED_BYTES for safe buffer

Regards,
Damir Belyalov
Postgres Professional

Regards,

Daniil Anisimov

Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

03 February 2023, 10:27:24

Hi, Danil and Nikita!

Thank you for reviewing.

Why is there no static keyword in the definition of the SafeCopying() function, but it is in the function declaration.

Correct this.

675: MemoryContext cxt = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
676:
677: valid_row = NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull);
678: tuple_is_valid = valid_row;
679:
680: if (valid_row)
681: sfcstate->safeBufferBytes += cstate->line_buf.len;
682:
683: CurrentMemoryContext = cxt;

Why are you using a direct assignment to CurrentMemoryContext instead of using the MemoryContextSwitchTo function in the SafeCopying() routine?

"CurrentMemoryContext = cxt" is the same as "MemoryContextSwitchTo(cxt)", you can see it in the implementation of MemoryContextSwitchTo(). Also correct this.

When you initialize the cstate->sfcstate structure, you create a cstate->sfcstate->safe_cxt memory context that inherits from oldcontext.
Was it intended to use cstate->copycontext as the parent context here?

Good remark, correct this.

Thanks Nikita Malakhov for advice to create file with errors. But I decided to to log errors in the system logfile and don't print them to the terminal. The error's output in logfile is rather simple - only error context logs (maybe it's better to log all error information?).

There are 2 points why logging errors in logfile is better than logging errors in another file (e.g. PGDATA/copy_ignore_errors.txt). The user is used to looking for errors in logfile. Creating another file entails problems like: 'what file name to create?', 'do we need to make file rotation?', 'where does this file create?' (we can't create it in PGDATA cause of memory constraints)

Regards,
Damir Belyalov
Postgres Professional

Attachment

0011-COPY_IGNORE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Andres Freund

Date:

05 February 2023, 23:37:23

H,

On 2023-02-03 13:27:24 +0300, Damir Belyalov wrote:
> @@ -625,6 +628,173 @@ CopyMultiInsertInfoStore(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,
>      miinfo->bufferedBytes += tuplen;
>  }
>  
> +/*
> + * Safely reads source data, converts to a tuple and fills tuple buffer.
> + * Skips some data in the case of failed conversion if data source for
> + * a next tuple can be surely read without a danger.
> + */
> +static bool
> +SafeCopying(CopyFromState cstate, ExprContext *econtext, TupleTableSlot *myslot)

> +    BeginInternalSubTransaction(NULL);
> +    CurrentResourceOwner = sfcstate->oldowner;

I don't think this is the right approach. Creating a subtransaction for
each row will cause substantial performance issues.

We now can call data type input functions without throwing errors, see
InputFunctionCallSafe(). Use that to avoid throwing an error instead of
catching it.

Greetings,

Andres Freund

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

06 February 2023, 05:00:27

Hi, Andres!

Thank you for reviewing.

I don't think this is the right approach. Creating a subtransaction for
each row will cause substantial performance issues.

Subtransactions aren't created for each row. The block of rows in one subtransaction is 1000 (SAFE_BUFFER_SIZE) and can be changed. There is also a constraint for the number of bytes MAX_SAFE_BUFFER_BYTES in safe_buffer:

while (sfcstate->saved_tuples < SAFE_BUFFER_SIZE &&
sfcstate->safeBufferBytes < MAX_SAFE_BUFFER_BYTES)

We now can call data type input functions without throwing errors, see
InputFunctionCallSafe(). Use that to avoid throwing an error instead of
catching it.

InputFunctionCallSafe() is good for detecting errors from input-functions but there are such errors from NextCopyFrom () that can not be detected with InputFunctionCallSafe(), e.g. "wrong number of columns in row''. Do you offer to process input-function errors separately from other errors? Now all errors are processed in one "switch" loop in PG_CATCH, so this change can complicate code.

Regards,
Damir Belyalov
Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Tom Lane

Date:

06 February 2023, 05:12:17

Damir Belyalov <dam.bel07@gmail.com> writes:
>> I don't think this is the right approach. Creating a subtransaction for
>> each row will cause substantial performance issues.

> Subtransactions aren't created for each row. The block of rows in one
> subtransaction is 1000 (SAFE_BUFFER_SIZE) and can be changed.

I think that at this point, any patch that involves adding subtransactions
to COPY is dead on arrival; whether it's batched or not is irrelevant.
(It's not like batching has no downsides.)

> InputFunctionCallSafe() is good for detecting errors from input-functions
> but there are such errors from NextCopyFrom () that can not be detected
> with InputFunctionCallSafe(), e.g. "wrong number of columns in row''.

If you want to deal with those, then there's more work to be done to make
those bits non-error-throwing.  But there's a very finite amount of code
involved and no obvious reason why it couldn't be done.  The major problem
here has always been the indefinite amount of code implicated by calling
datatype input functions, and we have now created a plausible answer to
that problem.

            regards, tom lane

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Andres Freund

Date:

06 February 2023, 05:52:33

Hi,

On February 5, 2023 9:12:17 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Damir Belyalov <dam.bel07@gmail.com> writes:
>>> I don't think this is the right approach. Creating a subtransaction for
>>> each row will cause substantial performance issues.
>
>> Subtransactions aren't created for each row. The block of rows in one
>> subtransaction is 1000 (SAFE_BUFFER_SIZE) and can be changed.
>
>I think that at this point, any patch that involves adding subtransactions
>to COPY is dead on arrival; whether it's batched or not is irrelevant.
>(It's not like batching has no downsides.)

Indeed.

>> InputFunctionCallSafe() is good for detecting errors from input-functions
>> but there are such errors from NextCopyFrom () that can not be detected
>> with InputFunctionCallSafe(), e.g. "wrong number of columns in row''.
>
>If you want to deal with those, then there's more work to be done to make
>those bits non-error-throwing.  But there's a very finite amount of code
>involved and no obvious reason why it couldn't be done.  The major problem
>here has always been the indefinite amount of code implicated by calling
>datatype input functions, and we have now created a plausible answer to
>that problem.

I'm not even sure it makes sense to avoid that kind of error. And invalid column count or such is something quite
differentthan failing some data type input routine, or falling a constraint.  



--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Tom Lane

Date:

06 February 2023, 06:00:34

Andres Freund <andres@anarazel.de> writes:
> On February 5, 2023 9:12:17 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Damir Belyalov <dam.bel07@gmail.com> writes:
>>> InputFunctionCallSafe() is good for detecting errors from input-functions
>>> but there are such errors from NextCopyFrom () that can not be detected
>>> with InputFunctionCallSafe(), e.g. "wrong number of columns in row''.

>> If you want to deal with those, then there's more work to be done to make
>> those bits non-error-throwing.  But there's a very finite amount of code
>> involved and no obvious reason why it couldn't be done.

> I'm not even sure it makes sense to avoid that kind of error. And
> invalid column count or such is something quite different than failing
> some data type input routine, or falling a constraint.

I think it could be reasonable to put COPY's overall-line-format
requirements on the same level as datatype input format violations.
I agree that trying to trap every kind of error is a bad idea,
for largely the same reason that the soft-input-errors patches
only trap certain kinds of errors: it's too hard to tell whether
an error is an "internal" error that it's scary to continue past.

            regards, tom lane

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

27 February 2023, 05:58:29

On 2023-02-06 15:00, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On February 5, 2023 9:12:17 PM PST, Tom Lane <tgl@sss.pgh.pa.us> 
>> wrote:
>>> Damir Belyalov <dam.bel07@gmail.com> writes:
>>>> InputFunctionCallSafe() is good for detecting errors from 
>>>> input-functions
>>>> but there are such errors from NextCopyFrom () that can not be 
>>>> detected
>>>> with InputFunctionCallSafe(), e.g. "wrong number of columns in 
>>>> row''.
> 
>>> If you want to deal with those, then there's more work to be done to 
>>> make
>>> those bits non-error-throwing.  But there's a very finite amount of 
>>> code
>>> involved and no obvious reason why it couldn't be done.
> 
>> I'm not even sure it makes sense to avoid that kind of error. And
>> invalid column count or such is something quite different than failing
>> some data type input routine, or falling a constraint.
> 
> I think it could be reasonable to put COPY's overall-line-format
> requirements on the same level as datatype input format violations.
> I agree that trying to trap every kind of error is a bad idea,
> for largely the same reason that the soft-input-errors patches
> only trap certain kinds of errors: it's too hard to tell whether
> an error is an "internal" error that it's scary to continue past.

Is it a bad idea to limit the scope of allowing errors to 'soft' errors 
in InputFunctionCallSafe()?

I think it could be still useful for some usecases.

   diff --git a/src/test/regress/sql/copy2.sql 
b/src/test/regress/sql/copy2.sql

   +-- tests for IGNORE_DATATYPE_ERRORS option
   +CREATE TABLE check_ign_err (n int, m int[], k int);
   +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS;
   +1  {1} 1
   +a  {2} 2
   +3  {3} 3333333333
   +4  {a, 4}  4
   +
   +5  {5} 5
   +\.
   +SELECT * FROM check_ign_err;

   diff --git a/src/test/regress/expected/copy2.out 
b/src/test/regress/expected/copy2.out
   index 090ef6c7a8..08e8056fc1 100644

   +-- tests for IGNORE_DATATYPE_ERRORS option
   +CREATE TABLE check_ign_err (n int, m int[], k int);
   +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS;
   +WARNING:  invalid input syntax for type integer: "a"
   +WARNING:  value "3333333333" is out of range for type integer
   +WARNING:  invalid input syntax for type integer: "a"
   +WARNING:  invalid input syntax for type integer: ""
   +SELECT * FROM check_ign_err;
   + n |  m  | k
   +---+-----+---
   + 1 | {1} | 1
   + 5 | {5} | 5
   +(2 rows)

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Attachment

v1-0001-Add-COPY-option-IGNORE_DATATYPE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

28 February 2023, 14:28:06

Hello

Tested patch on all cases: CIM_SINGLE, CIM_MULTI, CIM_MULTI_CONDITION. As expected it works.

Also added a description to copy.sgml and made a review on patch.

I added 'ignored_errors' integer parameter that should be output after the option is finished.

All errors were added to the system logfile with full detailed context. Maybe it's better to log only error message. file:///home/abc13/Documents/todo_copy/postgres/v2-0001-Add-COPY-option-IGNORE_DATATYPE_ERRORS.patch

Regards, Damir Belyalov

Postgres Professional

Attachment

v2-0001-Add-COPY-option-IGNORE_DATATYPE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Daniel Gustafsson

Date:

06 March 2023, 14:03:38

> On 28 Feb 2023, at 15:28, Damir Belyalov <dam.bel07@gmail.com> wrote:

> Tested patch on all cases: CIM_SINGLE, CIM_MULTI, CIM_MULTI_CONDITION. As expected it works.
> Also added a description to copy.sgml and made a review on patch.
>
> I added 'ignored_errors' integer parameter that should be output after the option is finished.
> All errors were added to the system logfile with full detailed context. Maybe it's better to log only error message.

FWIW, Greenplum has a similar construct (but which also logs the errors in the
db) where data type errors are skipped as long as the number of errors don't
exceed a reject limit.  If the reject limit is reached then the COPY fails:

    LOG ERRORS [ SEGMENT REJECT LIMIT <count> [ ROWS | PERCENT ]]

IIRC the gist of this was to catch then the user copies the wrong input data or
plain has a broken file.  Rather than finding out after copying n rows which
are likely to be garbage the process can be restarted.

This version of the patch has a compiler error in the error message:

copyfrom.c: In function ‘CopyFrom’:
copyfrom.c:1008:29: error: format ‘%ld’ expects argument of type ‘long int’, but argument 2 has type ‘uint64’ {aka
‘longlong unsigned int’} [-Werror=format=] 
1008 | ereport(WARNING, errmsg("Errors: %ld", cstate->ignored_errors));
     |                          ^~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~
     |                                              |
     |                                              uint64 {aka long long unsigned int}

On that note though, it seems to me that this error message leaves a bit to be
desired with regards to the level of detail.

--
Daniel Gustafsson

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

07 March 2023, 02:07:07

On 2023-03-06 23:03, Daniel Gustafsson wrote:
>> On 28 Feb 2023, at 15:28, Damir Belyalov <dam.bel07@gmail.com> wrote:
> 
>> Tested patch on all cases: CIM_SINGLE, CIM_MULTI, CIM_MULTI_CONDITION. 
>> As expected it works.
>> Also added a description to copy.sgml and made a review on patch.
Thanks for your tests and improvements!

>> I added 'ignored_errors' integer parameter that should be output after 
>> the option is finished.
>> All errors were added to the system logfile with full detailed 
>> context. Maybe it's better to log only error message.
Certainly.

> FWIW, Greenplum has a similar construct (but which also logs the errors 
> in the
> db) where data type errors are skipped as long as the number of errors 
> don't
> exceed a reject limit.  If the reject limit is reached then the COPY 
> fails:
> 
>     LOG ERRORS [ SEGMENT REJECT LIMIT <count> [ ROWS | PERCENT ]]
> 
> IIRC the gist of this was to catch then the user copies the wrong input 
> data or
> plain has a broken file.  Rather than finding out after copying n rows 
> which
> are likely to be garbage the process can be restarted.
> 
> This version of the patch has a compiler error in the error message:
> 
> copyfrom.c: In function ‘CopyFrom’:
> copyfrom.c:1008:29: error: format ‘%ld’ expects argument of type ‘long
> int’, but argument 2 has type ‘uint64’ {aka ‘long long unsigned int’}
> [-Werror=format=]
> 1008 | ereport(WARNING, errmsg("Errors: %ld", cstate->ignored_errors));
>      |                          ^~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~
>      |                                              |
>      |                                              uint64 {aka long
> long unsigned int}
> 
> 
> On that note though, it seems to me that this error message leaves a 
> bit to be
> desired with regards to the level of detail.
+1.
I felt just logging "Error: %ld" would make people wonder the meaning of 
the %ld. Logging something like ""Error: %ld data type errors were 
found" might be clearer.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

07 March 2023, 08:35:32

FWIW, Greenplum has a similar construct (but which also logs the errors
in the
db) where data type errors are skipped as long as the number of errors
don't
exceed a reject limit. If the reject limit is reached then the COPY
fails:
>
> LOG ERRORS [ SEGMENT REJECT LIMIT <count> [ ROWS | PERCENT ]]
>
IIRC the gist of this was to catch then the user copies the wrong input
data or
plain has a broken file. Rather than finding out after copying n rows
which
are likely to be garbage the process can be restarted.

I think this is a matter for discussion. The same question is: "Where to log errors to separate files or to the system logfile?".

IMO it's better for users to log short-detailed error message to system logfile and not output errors to the terminal.

This version of the patch has a compiler error in the error message:

Yes, corrected it. Changed "ignored_errors" to int64 because "processed" (used for counting copy rows) is int64.

I felt just logging "Error: %ld" would make people wonder the meaning of
the %ld. Logging something like ""Error: %ld data type errors were
found" might be clearer.

Thanks. For more clearance change the message to: "Errors were found: %".

Regards, Damir Belyalov

Postgres Professional

Attachment

v3-0001-Add-COPY-option-IGNORE_DATATYPE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Daniel Gustafsson

Date:

07 March 2023, 09:09:23

> On 7 Mar 2023, at 09:35, Damir Belyalov <dam.bel07@gmail.com> wrote:

> I felt just logging "Error: %ld" would make people wonder the meaning of
> the %ld. Logging something like ""Error: %ld data type errors were
> found" might be clearer.
>
> Thanks. For more clearance change the message to: "Errors were found: %".

I'm not convinced that this adds enough clarity to assist the user.  We also
shouldn't use "error" in a WARNING log since the user has explicitly asked to
skip rows on error, so it's not an error per se. How about something like:

  ereport(WARNING,
          (errmsg("%ld rows were skipped due to data type incompatibility", cstate->ignored_errors),
           errhint("Skipped rows can be inspected in the database log for reprocessing.")));

--
Daniel Gustafsson

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

17 March 2023, 12:23:14

On 2023-03-07 18:09, Daniel Gustafsson wrote:
>> On 7 Mar 2023, at 09:35, Damir Belyalov <dam.bel07@gmail.com> wrote:
> 
>> I felt just logging "Error: %ld" would make people wonder the meaning 
>> of
>> the %ld. Logging something like ""Error: %ld data type errors were
>> found" might be clearer.
>> 
>> Thanks. For more clearance change the message to: "Errors were found: 
>> %".
> 
> I'm not convinced that this adds enough clarity to assist the user.  We 
> also
> shouldn't use "error" in a WARNING log since the user has explicitly 
> asked to
> skip rows on error, so it's not an error per se.
+1

> How about something like:
> 
>   ereport(WARNING,
>           (errmsg("%ld rows were skipped due to data type
> incompatibility", cstate->ignored_errors),
>            errhint("Skipped rows can be inspected in the database log
> for reprocessing.")));
Since skipped rows cannot be inspected in the log when 
log_error_verbosity is set to terse,
it might be better without this errhint.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

22 March 2023, 13:34:20

On 2023-03-17 21:23, torikoshia wrote:
> On 2023-03-07 18:09, Daniel Gustafsson wrote:
>>> On 7 Mar 2023, at 09:35, Damir Belyalov <dam.bel07@gmail.com> wrote:
>> 
>>> I felt just logging "Error: %ld" would make people wonder the meaning 
>>> of
>>> the %ld. Logging something like ""Error: %ld data type errors were
>>> found" might be clearer.
>>> 
>>> Thanks. For more clearance change the message to: "Errors were found: 
>>> %".
>> 
>> I'm not convinced that this adds enough clarity to assist the user.  
>> We also
>> shouldn't use "error" in a WARNING log since the user has explicitly 
>> asked to
>> skip rows on error, so it's not an error per se.
> +1
> 
>> How about something like:
>> 
>>   ereport(WARNING,
>>           (errmsg("%ld rows were skipped due to data type
>> incompatibility", cstate->ignored_errors),
>>            errhint("Skipped rows can be inspected in the database log
>> for reprocessing.")));
> Since skipped rows cannot be inspected in the log when
> log_error_verbosity is set to terse,
> it might be better without this errhint.

Removed errhint.

Modified some codes since v3 couldn't be applied HEAD anymore.

Also modified v3 patch as below:

> 65 +   if (cstate->opts.ignore_datatype_errors)
> 66 +       cstate->ignored_errors = 0;
> 67 +

It seems not necessary since cstate is initialized by palloc0() in 
BeginCopyFrom().

> 134 +               ereport(LOG,
> 135 +                       errmsg("%s", 
> cstate->escontext.error_data->message));
> 136 +
> 137 +               return true;

Since LOG means 'Reports information of interest to administrators'
according to the manual[1], datatype error should not be logged as
LOG. I put it back in WARNING.

[1] 
https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-SEVERITY-LEVELS


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Attachment

v4-0001-Add-COPY-option-IGNORE_DATATYPE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Andres Freund

Date:

22 March 2023, 17:50:00

Hi,

Tom, see below - I wonder if should provide one more piece of infrastructure
around the saved error stuff...

Have you measured whether this has negative performance effects when *NOT*
using the new option?

As-is this does not work with FORMAT BINARY - and converting the binary input
functions to support soft errors won't happen for 16. So I think you need to
raise an error if BINARY and IGNORE_DATATYPE_ERRORS are specified.

On 2023-03-22 22:34:20 +0900, torikoshia wrote:
> @@ -985,9 +986,28 @@ CopyFrom(CopyFromState cstate)
>  
>          ExecClearTuple(myslot);
>  
> +        if (cstate->opts.ignore_datatype_errors)
> +        {
> +            escontext.details_wanted = true;
> +            cstate->escontext = escontext;
> +        }

I think it might be worth pulling this out of the loop. That does mean you'd
have to reset escontext.error_occurred after an error, but that doesn't seem
too bad, you need to do other cleanup anyway.

> @@ -956,10 +957,20 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
>                  values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
>              }
>              else
> -                values[m] = InputFunctionCall(&in_functions[m],
> -                                              string,
> -                                              typioparams[m],
> -                                              att->atttypmod);
> +                /* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype errors */
> +                if (!InputFunctionCallSafe(&in_functions[m],
> +                                           string,
> +                                           typioparams[m],
> +                                           att->atttypmod,
> +                                           (Node *) &cstate->escontext,
> +                                           &values[m]))
> +                {
> +                    cstate->ignored_errors++;
> +
> +                    ereport(WARNING,
> +                            errmsg("%s", cstate->escontext.error_data->message));

That isn't right - you loose all the details of the message. As is you'd also
leak the error context.

I think the best bet for now is to do something like
    /* adjust elevel so we don't jump out */
    cstate->escontext.error_data->elevel = WARNING;
    /* despite the name, this won't raise an error if elevel < ERROR */
    ThrowErrorData(cstate->escontext.error_data);

I wonder if we ought to provide a wrapper for this? It could e.g. know to
mention the original elevel and such?

I don't think NextCopyFrom() is the right place to emit this warning - it
e.g. is also called from file_fdw.c, which might want to do something else
with the error. From a layering POV it seems cleaner to do this in
CopyFrom(). You already have a check for escontext.error_occurred there
anyway.

> @@ -3378,6 +3378,10 @@ copy_opt_item:
>                  {
>                      $$ = makeDefElem("freeze", (Node *) makeBoolean(true), @1);
>                  }
> +            | IGNORE_DATATYPE_ERRORS
> +                {
> +                    $$ = makeDefElem("ignore_datatype_errors", (Node *)makeBoolean(true), @1);
> +                }
>              | DELIMITER opt_as Sconst
>                  {
>                      $$ = makeDefElem("delimiter", (Node *) makeString($3), @1);

I think we shouldn't add a new keyword for this, but only support this via
/* new COPY option syntax */
copy_generic_opt_list:
            copy_generic_opt_elem

Further increasing the size of the grammar with random keywords when we have
more generic ways to represent them seems unnecessary.

> +-- tests for IGNORE_DATATYPE_ERRORS option
> +CREATE TABLE check_ign_err (n int, m int[], k int);
> +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS;
> +1    {1}    1
> +a    {2}    2
> +3    {3}    3333333333
> +4    {a, 4}    4
> +
> +5    {5}    5
> +\.
> +SELECT * FROM check_ign_err;
> +

I suggest adding a few more tests:

- COPY with a datatype error that can't be handled as a soft error
- test documenting that COPY FORMAT BINARY is incompatible with IGNORE_DATATYPE_ERRORS
- a soft error showing the error context - although that will require some
  care to avoid the function name + line in the output

Greetings,

Andres Freund

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

24 March 2023, 09:27:08

On 2023-03-23 02:50, Andres Freund wrote:
> Hi,
> 
> Tom, see below - I wonder if should provide one more piece of 
> infrastructure
> around the saved error stuff...
> 
> 
> Have you measured whether this has negative performance effects when 
> *NOT*
> using the new option?
> 
> 
> As-is this does not work with FORMAT BINARY - and converting the binary 
> input
> functions to support soft errors won't happen for 16. So I think you 
> need to
> raise an error if BINARY and IGNORE_DATATYPE_ERRORS are specified.
> 
> 
> On 2023-03-22 22:34:20 +0900, torikoshia wrote:
>> @@ -985,9 +986,28 @@ CopyFrom(CopyFromState cstate)
>> 
>>          ExecClearTuple(myslot);
>> 
>> +        if (cstate->opts.ignore_datatype_errors)
>> +        {
>> +            escontext.details_wanted = true;
>> +            cstate->escontext = escontext;
>> +        }
> 
> I think it might be worth pulling this out of the loop. That does mean 
> you'd
> have to reset escontext.error_occurred after an error, but that doesn't 
> seem
> too bad, you need to do other cleanup anyway.
> 
> 
>> @@ -956,10 +957,20 @@ NextCopyFrom(CopyFromState cstate, ExprContext 
>> *econtext,
>>                  values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
>>              }
>>              else
>> -                values[m] = InputFunctionCall(&in_functions[m],
>> -                                              string,
>> -                                              typioparams[m],
>> -                                              att->atttypmod);
>> +                /* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype 
>> errors */
>> +                if (!InputFunctionCallSafe(&in_functions[m],
>> +                                           string,
>> +                                           typioparams[m],
>> +                                           att->atttypmod,
>> +                                           (Node *) &cstate->escontext,
>> +                                           &values[m]))
>> +                {
>> +                    cstate->ignored_errors++;
>> +
>> +                    ereport(WARNING,
>> +                            errmsg("%s", cstate->escontext.error_data->message));
> 
> That isn't right - you loose all the details of the message. As is 
> you'd also
> leak the error context.
> 
> I think the best bet for now is to do something like
>     /* adjust elevel so we don't jump out */
>     cstate->escontext.error_data->elevel = WARNING;
>     /* despite the name, this won't raise an error if elevel < ERROR */
>     ThrowErrorData(cstate->escontext.error_data);

Thanks for your reviewing!
I'll try to fix it this way for the time being.

> I wonder if we ought to provide a wrapper for this? It could e.g. know 
> to
> mention the original elevel and such?

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

27 March 2023, 13:51:32

On 2023-03-23 02:50, Andres Freund wrote:
Thanks again for your review.
Attached v5 patch.

> Have you measured whether this has negative performance effects when 
> *NOT*
> using the new option?

I loaded 10000000 rows of pgbench_accounts on my laptop and compared the 
elapsed time.
GUCs changed from the default are logging_collector = on, 
log_error_verbosity = verbose.

Three tests were run under each condition and the middle of them is 
listed below:

- patch NOT applied(36f40ce2dc66): 35299ms
- patch applied, without IGNORE_DATATYPE_ERRORS: 34409ms
- patch applied, with IGNORE_DATATYPE_ERRORS: 35510ms

It seems there are no significant degradation.

Also tested the elapsed time when loading data which has some datatype 
error with IGNORE_DATATYPE_ERRORS:
- data has 100 rows of error: 35269ms
- data has 1000 rows of error: 34577ms
- data has 5000000 rows of error: 48925ms

5000000 rows of error consumes much time, but it seems to be influenced 
by logging time.
Here are test results under log_min_messages and client_min_messages are 
'error':
- data has 5000000 data type error: 23972ms
- data has 0 data type error: 34320ms

Now conversely, when there were many data type errors, it consumes less 
time.
This seems like a reasonable result since the amount of skipped data is 
increasing.


> As-is this does not work with FORMAT BINARY - and converting the binary 
> input
> functions to support soft errors won't happen for 16. So I think you 
> need to
> raise an error if BINARY and IGNORE_DATATYPE_ERRORS are specified.

Added the option check.

> On 2023-03-22 22:34:20 +0900, torikoshia wrote:
> > @@ -985,9 +986,28 @@ CopyFrom(CopyFromState cstate)
> >
> >               ExecClearTuple(myslot);
> >
> > +             if (cstate->opts.ignore_datatype_errors)
> > +             {
> > +                     escontext.details_wanted = true;
> > +                     cstate->escontext = escontext;
> > +             }
> 
> I think it might be worth pulling this out of the loop. That does mean 
> you'd
> have to reset escontext.error_occurred after an error, but that doesn't 
> seem
> too bad, you need to do other cleanup anyway.

Pull this out of the loop and added process for resetting escontext.

> > @@ -956,10 +957,20 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
> >                               values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
> >                       }
> >                       else
> > -                             values[m] = InputFunctionCall(&in_functions[m],
> > -                                                                                       string,
> > -                                                                                       typioparams[m],
> > -                                                                                       att->atttypmod);
> > +                             /* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype errors */
> > +                             if (!InputFunctionCallSafe(&in_functions[m],
> > +                                                                                string,
> > +                                                                                typioparams[m],
> > +                                                                                att->atttypmod,
> > +                                                                                (Node *) &cstate->escontext,
> > +                                                                                &values[m]))
> > +                             {
> > +                                     cstate->ignored_errors++;
> > +
> > +                                     ereport(WARNING,
> > +                                                     errmsg("%s", cstate->escontext.error_data->message));
> 
> That isn't right - you loose all the details of the message. As is 
> you'd also
> leak the error context.
> 
> I think the best bet for now is to do something like
>     /* adjust elevel so we don't jump out */
>     cstate->escontext.error_data->elevel = WARNING;
>     /* despite the name, this won't raise an error if elevel < ERROR */
>     ThrowErrorData(cstate->escontext.error_data);

As I mentioned in one previous email, added above codes for now.

> I wonder if we ought to provide a wrapper for this? It could e.g. know 
> to
> mention the original elevel and such?
> 
> 
> I don't think NextCopyFrom() is the right place to emit this warning - 
> it
> e.g. is also called from file_fdw.c, which might want to do something 
> else
> with the error. From a layering POV it seems cleaner to do this in
> CopyFrom(). You already have a check for escontext.error_occurred there
> anyway.

Agreed.

> > @@ -3378,6 +3378,10 @@ copy_opt_item:
> >                               {
> >                                       $$ = makeDefElem("freeze", (Node *) makeBoolean(true), @1);
> >                               }
> > +                     | IGNORE_DATATYPE_ERRORS
> > +                             {
> > +                                     $$ = makeDefElem("ignore_datatype_errors", (Node *)makeBoolean(true), @1);
> > +                             }
> >                       | DELIMITER opt_as Sconst
> >                               {
> >                                       $$ = makeDefElem("delimiter", (Node *) makeString($3), @1);
> 
> 
> I think we shouldn't add a new keyword for this, but only support this 
> via
> /* new COPY option syntax */
> copy_generic_opt_list:
>                         copy_generic_opt_elem
> 
> Further increasing the size of the grammar with random keywords when we 
> have
> more generic ways to represent them seems unnecessary.

Agreed.

> > +-- tests for IGNORE_DATATYPE_ERRORS option
> > +CREATE TABLE check_ign_err (n int, m int[], k int);
> > +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS;
> > +1    {1}     1
> > +a    {2}     2
> > +3    {3}     3333333333
> > +4    {a, 4}  4
> > +
> > +5    {5}     5
> > +\.
> > +SELECT * FROM check_ign_err;
> > +
> 
> I suggest adding a few more tests:
> 
> - COPY with a datatype error that can't be handled as a soft error

Added a test for cases with missing columns.
However it's not datatype error and not what you expected, is it?

> - test documenting that COPY FORMAT BINARY is incompatible with 
> IGNORE_DATATYPE_ERRORS

Added it.

> - a soft error showing the error context - although that will require 
> some
>   care to avoid the function name + line in the output

I assume you mean a test to check the server log, but I haven't come up 
with a way to do it.
Adding a TAP test might do it, but I think it would be overkill to add 
one just for this.


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Attachment

v5-0001-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

27 March 2023, 14:28:09

Hi!

I made the specified changes and my patch turned out the same as yours. The performance measurements were the same too.

The only thing left to do is how not to add IGNORE_DATATYPE_ERRORS as a keyword. See how this is done for parameters such as FORCE_NOT_NULL, FORCE_NULL, FORCE_QUOTE. They are not in kwlist.h and are not as keywords in gram.y.

Regards,

Damir Belyalov

Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

28 March 2023, 12:16:13

On 2023-03-27 23:28, Damir Belyalov wrote:
> Hi!
> 
> I made the specified changes and my patch turned out the same as
> yours. The performance measurements were the same too.

Thanks for your review and measurements.

> The only thing left to do is how not to add IGNORE_DATATYPE_ERRORS as
> a keyword. See how this is done for parameters such as FORCE_NOT_NULL,
> FORCE_NULL, FORCE_QUOTE. They are not in kwlist.h and are not as
> keywords in gram.y.

I might misunderstand something, but I believe the v5 patch uses 
copy_generic_opt_list and it does not add IGNORE_DATATYPE_ERRORS as a 
keyword.
It modifies neither kwlist.h nor gram.y.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

28 March 2023, 12:22:57

I might misunderstand something, but I believe the v5 patch uses
copy_generic_opt_list and it does not add IGNORE_DATATYPE_ERRORS as a
keyword.
It modifies neither kwlist.h nor gram.y.

Sorry, didn't notice that. I think everything is alright now.

Regards,

Damir Belyalov

Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alena Rybakina

Date:

06 May 2023, 20:05:04

Hi!

Thank you, Damir, for your patch. It is very interesting to review it!

It seemed to me that the names of variables are not the same everywhere.

I noticed that you used ignore_datatype_errors_specified variable in copy.c , but guc has a short name ignore_datatype_errors. Also you used the short variable name in CopyFormatOptions structure.
Name used ignore_datatype_errors_specified is seemed very long to me, may be use a short version of it (ignore_datatype_errors) in copy.c too?

Besides, I noticed that you used ignored_errors variable in CopyFromStateData structure and it's name is strikingly similar to name (ignore_datatype_errors), but they have different meanings.
Maybe it will be better to rename it as ignored_errors_counter?

I tested last version v5-0001-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch with bytea data type and transaction cases. Eventually, I didn't find any problem there.
I described my steps in more detail, if I missed something.

First of all, I ran copy function with IGNORE_DATATYPE_ERRORS parameter being in transaction block.

File t2.csv exists:

id,b 769,\ 1,\6e 2,\x5 5,\x
Test:
CREATE TABLE t (id INT , b BYTEA) ;
postgres=# BEGIN;
copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
SAVEPOINT my_savepoint;
BEGIN
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
COPY 1
SAVEPOINT
postgres=*# copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
COPY 1
postgres=*# ROLLBACK TO my_savepoint;
ROLLBACK
postgres=*# select * from t;
id | b
----+----
5 | \x
(1 row)

postgres=*# copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
COPY 1
postgres=*# select * from t;
id | b
----+----
5 | \x
5 | \x
(2 rows)

postgres=*# commit;
COMMIT

I tried to use the similar test and moved transaction block in function:
CREATE FUNCTION public.log2()
RETURNS void
LANGUAGE plpgsql
SECURITY DEFINER
AS $function$
BEGIN;
copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
SAVEPOINT my_savepoint;
END;
$function$;

postgres=# delete from t;

postgres=# select 1 as t from log2();
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
t
---
1
(1 row)

Secondly I checked function copy with bytea datatype.
t1.csv consists:
id,b
769,\x2d
1,\x6e
2,\x5c
5,\x

And I ran it:postgres=# delete from t;
DELETE 4
postgres=# copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
COPY 1
postgres=# select * from t;
id | b
----+----
5 | \x
(1 row)

-- 
--- 
Alena Rybakina
Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

09 May 2023, 14:23:11

On 2023-05-07 05:05, Alena Rybakina wrote:
Thanks for your reviewing and comments!

> I noticed that you used _ignore_datatype_errors_specified_ variable in
> _copy.c_ , but guc has a short name _ignore_datatype_errors_. Also you
> used the short variable name in _CopyFormatOptions_ structure.

You may already understand it, but these variable names are given in 
imitation of FREEZE and BINARY cases:

   --- a/src/include/commands/copy.h
   +++ b/src/include/commands/copy.h
   @@ -42,6 +42,7 @@ typedef struct CopyFormatOptions
                                    * -1 if not specified */
       bool        binary;         /* binary format? */
       bool        freeze;         /* freeze rows on loading? */
   +   bool        ignore_datatype_errors;  /* ignore rows with datatype 
errors */

   --- a/src/backend/commands/copy.c
   +++ b/src/backend/commands/copy.c
   @@ -419,6 +419,7 @@ ProcessCopyOptions(ParseState *pstate,
       bool        format_specified = false;
       bool        freeze_specified = false;
       bool        header_specified = false;
   +   bool        ignore_datatype_errors_specified = false;

> Name used _ignore_datatype_errors_specified _is seemed very long to
> me, may be use a short version of it (_ignore_datatype_errors_) in
> _copy.c_ too?

I think it would be sane to align the names with the FREEZE and BINARY 
options.

I agree with the name is too long and we once used the name 
'ignore_errors'.
However, current implementation does not ignore all errors but just data 
type error, so I renamed it.
There may be a better name, but I haven't come up with one.

> Besides, I noticed that you used _ignored_errors_ variable in
> _CopyFromStateData_ structure and it's name is strikingly similar to
> name (_ignore_datatype_error__s_), but they have different meanings.
> Maybe it will be better to rename it as _ignored_errors_counter_?

As far as I take a quick look at on PostgreSQL source code, there're few 
variable name with "_counter". It seems to be used for function names.
Something like "ignored_errors_count" might be better.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

21 August 2023, 12:04:31

Since v5 patch failed applying anymore, updated the patch.

On 2023-03-23 02:50, Andres Freund wrote:
> I suggest adding a few more tests:
> 
> - COPY with a datatype error that can't be handled as a soft error

I didn't know proper way to test this, but I've found data type widget's
input function widget_in() defined to occur hard-error in regress.c,
attached patch added a test using it.

--
Atsushi Torikoshi
NTT DATA CORPORATION

Attachment

v6-0001-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

15 September 2023, 10:02:31

> Since v5 patch failed applying anymore, updated the patch.

Thank you for updating the patch . I made a little review on it where corrected some formatting.

> - COPY with a datatype error that can't be handled as a soft error

I didn't know proper way to test this, but I've found data type widget's
input function widget_in() defined to occur hard-error in regress.c,
attached patch added a test using it.

This test seems to be weird a bit, because of the "widget" type. The hard error is thrown by the previous test with missing data. Also it'll be interesting for me to list all cases when a hard error can be thrown.

Regards,

Damir Belyalov

Postgres Professional

Attachment

v7-0001-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

19 September 2023, 14:00:32

On 2023-09-15 19:02, Damir Belyalov wrote:
>> Since v5 patch failed applying anymore, updated the patch.
> 
> Thank you for updating the patch . I made a little review on it where
> corrected some formatting.
> 

Thanks for your review and update!
I don't have objections the modification of the codes and comments.
Although v7 patch doesn't have commit messages on the patch, I think 
leave commit message is good for reviewers.

>>> - COPY with a datatype error that can't be handled as a soft error
>> 
>> I didn't know proper way to test this, but I've found data type
>> widget's
>> input function widget_in() defined to occur hard-error in regress.c,
>> attached patch added a test using it.
> 
> This test seems to be weird a bit, because of the "widget" type. The
> hard error is thrown by the previous test with missing data. Also
> it'll be interesting for me to list all cases when a hard error can be
> thrown.

Although missing data error is hard error, the suggestion from Andres 
was adding `dataype` error:

> - COPY with a datatype error that can't be handled as a soft error

As described in widghet_in(), widget is intentionally left emitting hard 
error for testing purpose:

>   * Note: DON'T convert this error to "soft" style (errsave/ereturn).  
> We
>   * want this data type to stay permanently in the hard-error world so 
> that
>   * it can be used for testing that such cases still work reasonably.

 From this point of view, I think this is a supposed way of using widget.
OTOH widget is declared in create_type.sql and I'm not sure it's ok to 
use it in another test copy2.sql.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir

Date:

20 September 2023, 16:15:15

> Although v7 patch doesn't have commit messages on the patch, I think 
> leave commit message is good for reviewers.

Sure, didn't notice it. Added the commit message to the updated patch.


>   * Note: DON'T convert this error to "soft" style (errsave/ereturn). We
>>   * want this data type to stay permanently in the hard-error world 
>> so that
>>   * it can be used for testing that such cases still work reasonably.
>
> From this point of view, I think this is a supposed way of using widget.

I agree, it's a good approach for checking datatype errors, because 
that's what was intended.


> OTOH widget is declared in create_type.sql and I'm not sure it's ok to 
> use it in another test copy2.sql.

I think that other regress tests with 'widget' type that will be created 
in the future can be not only in the create_type.sql. So it's not a 
problem that some type or function is taken from another regress test. 
For example, the table 'onek' is used in many regress tests.


Regards,

Damir Belyalov

Postgres Professional

Attachment

v7-0002-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Tom Lane

Date:

08 November 2023, 18:18:39

Damir <dam.bel07@gmail.com> writes:
> [ v7-0002-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch ]

Sorry for being so late to the party, but ... I don't think this
is a well-designed feature as it stands.  Simply dropping failed rows
seems like an unusable definition for any application that has
pretensions of robustness.  "But", you say, "we're emitting WARNING
messages about it".  That's *useless*.  For most applications WARNING
messages just go into the bit bucket, or worse they cause memory leaks
(because the app never reads them).  An app that tried to read them
would have to cope with all sorts of fun such as translated messages.
Furthermore, as best I can tell from the provided test cases, the
messages completely lack basic context such as which field or line
the problem occurred in.  An app trying to use this to understand
which input lines had failed would not get far.

I think an actually usable feature of this sort would involve
copying all the failed lines to some alternate output medium,
perhaps a second table with a TEXT column to receive the original
data line.  (Or maybe an array of text that could receive the
broken-down field values?)  Maybe we could dump the message info,
line number, field name etc into additional columns.

Also it'd be a good idea to have a vision of how the feature
could be extended to cope with lower-level errors, such as
lines that have the wrong number of columns or other problems
with line-level syntax.  I don't say we need to cope with that
immediately, but it's going to be something people will want
to add, I think.

            regards, tom lane

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Daniel Gustafsson

Date:

08 November 2023, 19:34:51

> On 8 Nov 2023, at 19:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I think an actually usable feature of this sort would involve
> copying all the failed lines to some alternate output medium,
> perhaps a second table with a TEXT column to receive the original
> data line.  (Or maybe an array of text that could receive the
> broken-down field values?)  Maybe we could dump the message info,
> line number, field name etc into additional columns.

I agree that the errors should be easily visible to the user in some way.  The
feature is for sure interesting, especially in data warehouse type jobs where
dirty data is often ingested.

As a data point, Greenplum has this feature with additional SQL syntax to
control it:

    COPY .. LOG ERRORS SEGMENT REJECT LIMIT xyz ROWS;

LOG ERRORS instructs the database to log the faulty rows and SEGMENT REJECT
LIMIT xyz ROWS sets the limit of how many rows can be faulty before the
operation errors out.  I'm not at all advocating that we should mimic this,
just wanted to add a reference to postgres derivative where this has been
implemented.

--
Daniel Gustafsson

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Tom Lane

Date:

08 November 2023, 20:12:16

Daniel Gustafsson <daniel@yesql.se> writes:
>> On 8 Nov 2023, at 19:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think an actually usable feature of this sort would involve
>> copying all the failed lines to some alternate output medium,
>> perhaps a second table with a TEXT column to receive the original
>> data line.  (Or maybe an array of text that could receive the
>> broken-down field values?)  Maybe we could dump the message info,
>> line number, field name etc into additional columns.

> I agree that the errors should be easily visible to the user in some way.  The
> feature is for sure interesting, especially in data warehouse type jobs where
> dirty data is often ingested.

I agree it's interesting, but we need to get it right the first time.

Here is a very straw-man-level sketch of what I think might work.
The option to COPY FROM looks something like

    ERRORS TO other_table_name (item [, item [, ...]])

where the "items" are keywords identifying the information item
we will insert into each successive column of the target table.
This design allows the user to decide which items are of use
to them.  I envision items like

LINENO    bigint        COPY line number, counting from 1
LINE    text        raw text of line (after encoding conversion)
FIELDS    text[]        separated, de-escaped string fields (the data
            that was or would be fed to input functions)
FIELD    text        name of troublesome field, if field-specific
MESSAGE    text        error message text
DETAIL    text        error message detail, if any
SQLSTATE text        error SQLSTATE code

Some of these would have to be populated as NULL if we didn't get
that far in processing the line.  In the worst case, which is
encoding conversion failure, I think we couldn't populate any of
the data items except LINENO.

Not sure if we need to insist that the target table columns be
exactly the data types I show above.  It'd be nice to allow
the LINENO target to be plain int, perhaps.  OTOH, do we really
want to have to deal with issues like conversion failures while
trying to report an error?

> As a data point, Greenplum has this feature with additional SQL syntax to
> control it:
>     COPY .. LOG ERRORS SEGMENT REJECT LIMIT xyz ROWS;
> LOG ERRORS instructs the database to log the faulty rows and SEGMENT REJECT
> LIMIT xyz ROWS sets the limit of how many rows can be faulty before the
> operation errors out.  I'm not at all advocating that we should mimic this,
> just wanted to add a reference to postgres derivative where this has been
> implemented.

Hm.  A "reject limit" might be a useful add-on, but I wouldn't advocate
including it in the initial patch.

            regards, tom lane

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Andres Freund

Date:

08 November 2023, 23:53:58

Hi,

On 2023-11-08 13:18:39 -0500, Tom Lane wrote:
> Damir <dam.bel07@gmail.com> writes:
> > [ v7-0002-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch ]
> 
> Sorry for being so late to the party, but ... I don't think this
> is a well-designed feature as it stands.  Simply dropping failed rows
> seems like an unusable definition for any application that has
> pretensions of robustness.

Not everything needs to be a robust application though. I've definitely cursed
at postgres for lacking this.

> I think an actually usable feature of this sort would involve
> copying all the failed lines to some alternate output medium,
> perhaps a second table with a TEXT column to receive the original
> data line.  (Or maybe an array of text that could receive the
> broken-down field values?)  Maybe we could dump the message info,
> line number, field name etc into additional columns.

If we go in that direction, we should make it possible to *not* use such a
table as well, for some uses it'd be pointless.

Another way of reporting errors could be for copy to return invalid input back
to the client, via the copy protocol. That would allow the client to handle
failing rows and also to abort if the number of errors or the type of errors
gets to be too big.

Greetings,

Andres Freund

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Tom Lane

Date:

09 November 2023, 00:00:01

Andres Freund <andres@anarazel.de> writes:
> On 2023-11-08 13:18:39 -0500, Tom Lane wrote:
>> I think an actually usable feature of this sort would involve
>> copying all the failed lines to some alternate output medium,
>> perhaps a second table with a TEXT column to receive the original
>> data line.

> If we go in that direction, we should make it possible to *not* use such a
> table as well, for some uses it'd be pointless.

Why?  You can always just drop the errors table if you don't want it.
But I fail to see the use-case for ignoring errors altogether.

> Another way of reporting errors could be for copy to return invalid input back
> to the client, via the copy protocol.

Color me skeptical.  There are approximately zero clients in the
world today that could handle simultaneous return of data during
a COPY.  Certainly neither libpq nor psql are within hailing
distance of being able to support that.  Maybe in some far
future it could be made to work --- but if you want it in the v1
patch, you just moved the goalposts into the next county.

            regards, tom lane

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Andres Freund

Date:

09 November 2023, 00:26:00

Hi,

On 2023-11-08 19:00:01 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-11-08 13:18:39 -0500, Tom Lane wrote:
> >> I think an actually usable feature of this sort would involve
> >> copying all the failed lines to some alternate output medium,
> >> perhaps a second table with a TEXT column to receive the original
> >> data line.
> 
> > If we go in that direction, we should make it possible to *not* use such a
> > table as well, for some uses it'd be pointless.
> 
> Why?  You can always just drop the errors table if you don't want it.

I think it'll often just end up littering the database, particularly if the
callers don't care about a few errors.


> But I fail to see the use-case for ignoring errors altogether.

My experience is that there's often a few errors due to bad encoding, missing
escaping etc that you don't care sufficiently about when importing large
quantities of data.

Greetings,

Andres Freund

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

09 November 2023, 00:28:25

Hello everyone!

Thanks for turning back to this patch.

I had already thought about storing errors in the table / separate file / logfile and it seems to me that the best way is to output errors in logfile. As for user it is more convenient to look for errors in the place where they are usually generated - in logfile and if he wants to intercept them he could easily do that by few commands.

The analogues of this feature in other DBSM usually had additional files for storing errors, but their features had too many options (see attached files).

I also think that the best way is to simplify this feature for the first version and don't use redundant adjustments such as additional files and other options.

IMHO for more complicated operations with loading tables files pgloader exists: https://github.com/dimitri/pgloader

Links of analogues of COPY IGNORE_DATATYPE_ERRORS

https://dev.mysql.com/doc/refman/8.0/en/load-data.html

https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

Regards,
Damir Belyalov
Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

zhihuifan1213@163.com

Date:

09 November 2023, 04:33:34

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Daniel Gustafsson <daniel@yesql.se> writes:
>>> On 8 Nov 2023, at 19:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I think an actually usable feature of this sort would involve
>>> copying all the failed lines to some alternate output medium,
>>> perhaps a second table with a TEXT column to receive the original
>>> data line.  (Or maybe an array of text that could receive the
>>> broken-down field values?)  Maybe we could dump the message info,
>>> line number, field name etc into additional columns.
>
>> I agree that the errors should be easily visible to the user in some way.  The
>> feature is for sure interesting, especially in data warehouse type jobs where
>> dirty data is often ingested.
>
> I agree it's interesting, but we need to get it right the first time.
>
> Here is a very straw-man-level sketch of what I think might work.
> The option to COPY FROM looks something like
>
>     ERRORS TO other_table_name (item [, item [, ...]])
>
> where the "items" are keywords identifying the information item
> we will insert into each successive column of the target table.
> This design allows the user to decide which items are of use
> to them.  I envision items like

While I'm pretty happy with the overall design, which is 'ERRORS to
other_table_name' specially. I'm a bit confused why do we need to
write the codes for (item [, item [, ...]]), not only because it
requires more coding but also requires user to make more decisions.
will it be anything wrong to make all of them as default? 

> LINENO    bigint        COPY line number, counting from 1
> LINE    text        raw text of line (after encoding conversion)
> FIELDS    text[]        separated, de-escaped string fields (the data
>             that was or would be fed to input functions)
> FIELD    text        name of troublesome field, if field-specific
> MESSAGE    text        error message text
> DETAIL    text        error message detail, if any
> SQLSTATE text        error SQLSTATE code
>


-- 
Best Regards
Andy Fan

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Damir Belyalov

Date:

14 November 2023, 10:10:18

Here is a very straw-man-level sketch of what I think might work.
The option to COPY FROM looks something like

ERRORS TO other_table_name (item [, item [, ...]])

I tried to implement the patch using a table and came across a number of questions.

Which table should we implement for this feature: a system catalog table or store this table as a file or create a new table?

In these cases, security and user rights management issues arise.

It is better for other users not to see error lines from another user. It is also not clear how access rights to this table are inherited and be given.

Regards,
Damir Belyalov
Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alena Rybakina

Date:

14 November 2023, 10:16:35

Hi!

On 14.11.2023 13:10, Damir Belyalov wrote:

Here is a very straw-man-level sketch of what I think might work.
The option to COPY FROM looks something like

ERRORS TO other_table_name (item [, item [, ...]])

I tried to implement the patch using a table and came across a number of questions.

Which table should we implement for this feature: a system catalog table or store this table as a file or create a new table?

In these cases, security and user rights management issues arise.
It is better for other users not to see error lines from another user. It is also not clear how access rights to this table are inherited and be given.

Maybe we can add a guc or a parameter to output such errors during the execution of the copy function with errors and check whether the user has enough rights to set such a parameter?

That is, I propose to give the user a choice to run copy with and without saving errors and at the same time immediately check whether the option with error output is possible for him in principle?

-- 
Regards,
Alena Rybakina

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

zhihuifan1213@163.com

Date:

15 November 2023, 01:23:27

Damir Belyalov <dam.bel07@gmail.com> writes:

>   Here is a very straw-man-level sketch of what I think might work.
>   The option to COPY FROM looks something like
>
>        ERRORS TO other_table_name (item [, item [, ...]])
>
> I tried to implement the patch using a table and came across a number of questions.
>
> Which table should we implement for this feature: a system catalog table or store this table as a file or create a
new
> table?

I think system catalog should not be a option at the first place since
it requires more extra workload to do.  see the calls of
IsCatalogRelation in heapam.c.

I prefer to create a new normal heap relation rather than a file since
heap realtion probabaly have better APIs. 

> In these cases, security and user rights management issues arise.
> It is better for other users not to see error lines from another
> user. It is also not clear how access rights to this 
> table are inherited and be given.

How about creating the table just allowing the current user to
read/write or just same as the relation we are copying to? 

-- 
Best Regards
Andy Fan

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

16 November 2023, 00:00:00

On Thu, Nov 9, 2023 at 4:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
> >> On 8 Nov 2023, at 19:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think an actually usable feature of this sort would involve
> >> copying all the failed lines to some alternate output medium,
> >> perhaps a second table with a TEXT column to receive the original
> >> data line.  (Or maybe an array of text that could receive the
> >> broken-down field values?)  Maybe we could dump the message info,
> >> line number, field name etc into additional columns.
>
> > I agree that the errors should be easily visible to the user in some way.  The
> > feature is for sure interesting, especially in data warehouse type jobs where
> > dirty data is often ingested.
>
> I agree it's interesting, but we need to get it right the first time.
>
> Here is a very straw-man-level sketch of what I think might work.
> The option to COPY FROM looks something like
>
>         ERRORS TO other_table_name (item [, item [, ...]])
>
> where the "items" are keywords identifying the information item
> we will insert into each successive column of the target table.
> This design allows the user to decide which items are of use
> to them.  I envision items like
>
> LINENO  bigint          COPY line number, counting from 1
> LINE    text            raw text of line (after encoding conversion)
> FIELDS  text[]          separated, de-escaped string fields (the data
>                         that was or would be fed to input functions)
> FIELD   text            name of troublesome field, if field-specific
> MESSAGE text            error message text
> DETAIL  text            error message detail, if any
> SQLSTATE text           error SQLSTATE code
>

just
SAVE ERRORS

automatically create a table to hold the error. (validate
auto-generated table name uniqueness, validate create privilege).
and the table will have the above related info. if no error then table
gets dropped.

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Andrei Lepikhov

Date:

24 November 2023, 03:52:25

On 14/11/2023 17:10, Damir Belyalov wrote:
>       Here is a very straw-man-level sketch of what I think might work.
>       The option to COPY FROM looks something like
> 
>            ERRORS TO other_table_name (item [, item [, ...]])
> 
> 
> I tried to implement the patch using a table and came across a number of 
> questions.
> 
> Which table should we implement for this feature: a system catalog table 
> or store this table as a file or create a new table?
> 
> In these cases, security and user rights management issues arise.
> It is better for other users not to see error lines from another user. 
> It is also not clear how access rights to this table are inherited and 
> be given.

Previous reviews have given helpful ideas about storing errors in the 
new table.
It should be trivial code - use the current table name + 'err' + suffix 
as we already do in the case of conflicting auto-generated index names.
The 'errors table' must inherit any right policies from the table, to 
which we do the copy.

-- 
regards,
Andrei Lepikhov
Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

04 December 2023, 02:23:30

hi.
here is my implementation based on previous discussions

add a new COPY FROM flag save_error.
save_error only works with non-BINARY flags.
save_error is easier for me to implement, if using "save error" I
worry, 2 words, gram.y will not work.
save_error also works other flag like {csv mode, force_null, force_not_null}

overall logic is:
if  save_error is specified then
   if error_holding table not exists then create one
   if error_holding table exists set error_firsttime to false.
if  save_error is not specified then work as master branch.

if errors happen then insert error info to error_holding table.
if errors do not exist and error_firsttime is true then drop the table.
if errors do not exist and error_firsttime is false then raise a
notice: All the past error holding saved at %s.%s

error holding table:
schema will be the same as COPY destination table.
the table name will be: COPY destination name concatenate with "_error".

error_holding table definition:
CREATE TABLE err_nsp.error_rel (LINENO BIGINT, LINE TEXT,
FIELD TEXT, SOURCE TEXT, ERR_MESSAGE TEXT,
ERR_DETAIL TEXT, ERRORCODE TEXT);

the following field is not implemented.
FIELDS  text[], separated, de-escaped string fields (the data that was
or would be fed to input functions)

because imagine following case:
create type test as (a int, b text);
create table copy_comp (c1 int, c2 test default '(11,test)', c3 date);
copy copy_comp from stdin with (default '\D');
1 \D '2022-07-04'
\.
table copy_comp;

I feel it's hard from textual '\D'  to get text[] `(11,test)` via SPI.
--------------------------------------
demo:

create table copy_default_error_save (
id integer,
text_value text not null default 'test',
ts_value timestamp without time zone not null default '2022-07-05'
);
copy copy_default_error_save from stdin with (save_error, default '\D');
k value '2022-07-04'
z \D '2022-07-03ASKL'
s \D \D
\.

NOTICE:  3 rows were skipped because of error. skipped row saved to
table public.copy_default_error_save_error
select  * from copy_default_error_save_error;
 lineno |               line               |  field   |      source
  |                         err_message                         |
err_detail | errorcode

--------+----------------------------------+----------+------------------+-------------------------------------------------------------+------------+-----------
      1 | k       value   '2022-07-04'     | id       | k
  | invalid input syntax for type integer: "k"                  |
      | 22P02
      2 | z       \D      '2022-07-03ASKL' | id       | z
  | invalid input syntax for type integer: "z"                  |
      | 22P02
      2 | z       \D      '2022-07-03ASKL' | ts_value |
'2022-07-03ASKL' | invalid input syntax for type timestamp:
"'2022-07-03ASKL'" |            | 22007
      3 | s       \D      \D               | id       | s
  | invalid input syntax for type integer: "s"                  |
      | 22P02
(4 rows)

The doc is not so good.

COPY FROM (save_error),  it will not be as fast as COPY FROM (save_error false).
With save_error, we can only use InputFunctionCallSafe, which I
believe is not as fast as InputFunctionCall.
If any conversion error happens, we need to call the SPI interface,
that would add more overhead. also we can only insert error cases row
by row. (maybe we can insert to error_save values(error1), (error2).
(I will try later)...

The main code is about constructing SPI query, and test and test output.

Attachment

v8-0001-Add-a-new-COPY-option-SAVE_ERROR.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alena Rybakina

Date:

05 December 2023, 10:07:24

Hi!

Thank you for your contribution to this thread.

On 04.12.2023 05:23, jian he wrote:

hi.
here is my implementation based on previous discussions

add a new COPY FROM flag save_error.
save_error only works with non-BINARY flags.
save_error is easier for me to implement, if using "save error" I
worry, 2 words, gram.y will not work.
save_error also works other flag like {csv mode, force_null, force_not_null}

overall logic is:
if  save_error is specified then   if error_holding table not exists then create one   if error_holding table exists set error_firsttime to false.
if  save_error is not specified then work as master branch.

if errors happen then insert error info to error_holding table.
if errors do not exist and error_firsttime is true then drop the table.
if errors do not exist and error_firsttime is false then raise a
notice: All the past error holding saved at %s.%s

error holding table:
schema will be the same as COPY destination table.
the table name will be: COPY destination name concatenate with "_error".

error_holding table definition:
CREATE TABLE err_nsp.error_rel (LINENO BIGINT, LINE TEXT,
FIELD TEXT, SOURCE TEXT, ERR_MESSAGE TEXT,
ERR_DETAIL TEXT, ERRORCODE TEXT);

the following field is not implemented.
FIELDS  text[], separated, de-escaped string fields (the data that was
or would be fed to input functions)

because imagine following case:
create type test as (a int, b text);
create table copy_comp (c1 int, c2 test default '(11,test)', c3 date);
copy copy_comp from stdin with (default '\D');
1 \D '2022-07-04'
\.
table copy_comp;

I feel it's hard from textual '\D'  to get text[] `(11,test)` via SPI.
--------------------------------------
demo:

create table copy_default_error_save (
id integer,
text_value text not null default 'test',
ts_value timestamp without time zone not null default '2022-07-05'
);
copy copy_default_error_save from stdin with (save_error, default '\D');
k value '2022-07-04'
z \D '2022-07-03ASKL'
s \D \D
\.

NOTICE:  3 rows were skipped because of error. skipped row saved to
table public.copy_default_error_save_error
select  * from copy_default_error_save_error; lineno |               line               |  field   |      source  |                         err_message                         |
err_detail | errorcode
--------+----------------------------------+----------+------------------+-------------------------------------------------------------+------------+-----------      1 | k       value   '2022-07-04'     | id       | k  | invalid input syntax for type integer: "k"                  |      | 22P02      2 | z       \D      '2022-07-03ASKL' | id       | z  | invalid input syntax for type integer: "z"                  |      | 22P02      2 | z       \D      '2022-07-03ASKL' | ts_value |
'2022-07-03ASKL' | invalid input syntax for type timestamp:
"'2022-07-03ASKL'" |            | 22007      3 | s       \D      \D               | id       | s  | invalid input syntax for type integer: "s"                  |      | 22P02
(4 rows)

The doc is not so good.

COPY FROM (save_error),  it will not be as fast as COPY FROM (save_error false).
With save_error, we can only use InputFunctionCallSafe, which I
believe is not as fast as InputFunctionCall.
If any conversion error happens, we need to call the SPI interface,
that would add more overhead. also we can only insert error cases row
by row. (maybe we can insert to error_save values(error1), (error2).
(I will try later)...

The main code is about constructing SPI query, and test and test output.

I reviewed it and have a few questions.

1. I have seen that you delete a table before creating it, to which you want to add errors due to a failed "copy from" operation. I think this is wrong because this table can save useful data for the user.
At a minimum, we should warn the user about this, but I think we can just add some number at the end of the name, such as name_table1, name_table_2.

2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during 'copy from' operation.
I think it would be better to use the name of the source file that was used while 'copy from' was running.
In addition, there may be several such files, it is also worth considering.

3. I found spelling:

/* no err_nsp.error_rel table then crete one. for holding error. */

4. Maybe rewrite this comment these info need, no error will drop err_nsp.error_rel table
to:
this information is necessary, no error will lead to the deletion of the err_sp.error_rel table.

5. Is this part of the comment needed? I think it duplicates the information below when we form the query.

* . column list(order by attnum, begin from ctid) =
* {ctid, lineno,line,field,source,err_message,err_detail,errorcode}
* . data types (from attnum = -1) ={tid, int8,text,text,text,text,text,text}

I'm not sure if we need to order the rows by number. It might be easier to work with these lines in the order they appear.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

06 December 2023, 10:47:10

On Tue, Dec 5, 2023 at 6:07 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> Hi!
>
> Thank you for your contribution to this thread.
>
>
> I reviewed it and have a few questions.
>
> 1. I have seen that you delete a table before creating it, to which you want to add errors due to a failed "copy
from"operation. I think this is wrong because this table can save useful data for the user. 
> At a minimum, we should warn the user about this, but I think we can just add some number at the end of the name,
suchas name_table1, name_table_2. 

Sorry. I don't understand this part.
Currently, if the error table name already exists, then the copy will
fail, an error will be reported.
I try to first create a table, if no error then the error table will be dropped.
Can you demo the expected behavior?

> 2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during
'copyfrom' operation. 
> I think it would be better to use the name of the source file that was used while 'copy from' was running.
> In addition, there may be several such files, it is also worth considering.
>

Another column added.
now it looks like:

SELECT * FROM save_error_csv_error;
 filename | lineno |                        line
 | field | source |                 err_message                 |
err_detail | errorcode

----------+--------+----------------------------------------------------+-------+--------+---------------------------------------------+------------+-----------
 STDIN    |      1 | 2002    232     40      50      60      70
80 | NULL  | NULL   | extra data after last expected column       |
NULL       | 22P04
 STDIN    |      1 | 2000    230     23
 | d     | NULL   | missing data for column "d"                 | NULL
      | 22P04
 STDIN    |      1 | z,,""
 | a     | z      | invalid input syntax for type integer: "z"  | NULL
      | 22P02
 STDIN    |      2 | \0,,
 | a     | \0     | invalid input syntax for type integer: "\0" | NULL
      | 22P02


> 3. I found spelling:
>
> /* no err_nsp.error_rel table then crete one. for holding error. */
>

fixed.

> 4. Maybe rewrite this comment
>
> these info need, no error will drop err_nsp.error_rel table
> to:
> this information is necessary, no error will lead to the deletion of the err_sp.error_rel table.
>

fixed.

> 5. Is this part of the comment needed? I think it duplicates the information below when we form the query.
>
>  * . column list(order by attnum, begin from ctid) =
>  *    {ctid, lineno,line,field,source,err_message,err_detail,errorcode}
>  * . data types (from attnum = -1) ={tid, int8,text,text,text,text,text,text}
>
> I'm not sure if we need to order the rows by number. It might be easier to work with these lines in the order they
appear.
>
Simplified the comment. "order by attnum" is to make sure that if
there is a table already existing, and the column name is like X and
the data type like Y, then we consider this table is good for holding
potential error info.

COPY FROM, main entry point is NextCopyFrom.
Now for non-binary mode, if you specified save_error then it will not
fail at NextCopyFrom.
all these three errors will be tolerated: extra data after last
expected column, missing data for column, data type conversion.

Attachment

v9-0001-Make-COPY-FROM-more-error-tolerant.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alena Rybakina

Date:

08 December 2023, 07:09:07

Thank you for your work. Unfortunately, your code contained errors during the make installation:

'SAVEPOINT' after 'SAVE_ERROR' in unreserved_keyword list is misplaced
'SAVEPOINT' after 'SAVE_ERROR' in bare_label_keyword list is misplaced
make[2]: *** [../../../src/Makefile.global:783: gram.c] Error 1
make[1]: *** [Makefile:131: parser/gram.h] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [src/Makefile.global:383: submake-generated-headers] Error 2

I have ubuntu 22.04 operation system.

On 06.12.2023 13:47, jian he wrote:

On Tue, Dec 5, 2023 at 6:07 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:

Hi!

Thank you for your contribution to this thread.


I reviewed it and have a few questions.

1. I have seen that you delete a table before creating it, to which you want to add errors due to a failed "copy from" operation. I think this is wrong because this table can save useful data for the user.
At a minimum, we should warn the user about this, but I think we can just add some number at the end of the name, such as name_table1, name_table_2.

Sorry. I don't understand this part.
Currently, if the error table name already exists, then the copy will
fail, an error will be reported.
I try to first create a table, if no error then the error table will be dropped.

To be honest, first of all, I misunderstood this part of the code. Now I see that it works the way you mentioned.

However, I didn't see if you dealt with cases where we already had a table with the same name as the table error.
I mean, when is he trying to create for the first time, or will we never be able to face such a problem?

Can you demo the expected behavior?

Unfortunately, I was unable to launch it due to a build issue.

2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during 'copy from' operation.
I think it would be better to use the name of the source file that was used while 'copy from' was running.
In addition, there may be several such files, it is also worth considering.

Another column added.
now it looks like:

SELECT * FROM save_error_csv_error; filename | lineno |                        line | field | source |                 err_message                 |
err_detail | errorcode
----------+--------+----------------------------------------------------+-------+--------+---------------------------------------------+------------+----------- STDIN    |      1 | 2002    232     40      50      60      70
80 | NULL  | NULL   | extra data after last expected column       |
NULL       | 22P04 STDIN    |      1 | 2000    230     23 | d     | NULL   | missing data for column "d"                 | NULL      | 22P04 STDIN    |      1 | z,,"" | a     | z      | invalid input syntax for type integer: "z"  | NULL      | 22P02 STDIN    |      2 | \0,, | a     | \0     | invalid input syntax for type integer: "\0" | NULL      | 22P02

Yes, I see the "filename" column, and this will solve the problem, but "STDIN" is unclear to me.

3. I found spelling:

/* no err_nsp.error_rel table then crete one. for holding error. */

fixed.

4. Maybe rewrite this comment

these info need, no error will drop err_nsp.error_rel table
to:
this information is necessary, no error will lead to the deletion of the err_sp.error_rel table.

fixed.

Thank you.

5. Is this part of the comment needed? I think it duplicates the information below when we form the query.
 * . column list(order by attnum, begin from ctid) = *    {ctid, lineno,line,field,source,err_message,err_detail,errorcode} * . data types (from attnum = -1) ={tid, int8,text,text,text,text,text,text}

I'm not sure if we need to order the rows by number. It might be easier to work with these lines in the order they appear.

Simplified the comment. "order by attnum" is to make sure that if
there is a table already existing, and the column name is like X and
the data type like Y, then we consider this table is good for holding
potential error info.

COPY FROM, main entry point is NextCopyFrom.
Now for non-binary mode, if you specified save_error then it will not
fail at NextCopyFrom.
all these three errors will be tolerated: extra data after last
expected column, missing data for column, data type conversion.

It looks clearer and better, thanks!

Comments in the format of questions are unusual for me, I perceive them to think about it, for example, as here (contrib/bloom/blinsert.c:312):

/* * Didn't find place to insert in notFullPage array. Allocate new page.
* (XXX is it good to do this while holding ex-lock on the metapage??)
*/

Maybe we can rewrite it like this:/* Check, the err_nsp.error_rel table has already existed
* and if it is, check its column name and data types.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

10 December 2023, 10:32:12

On Fri, Dec 8, 2023 at 3:09 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> Thank you for your work. Unfortunately, your code contained errors during the make installation:
>
> 'SAVEPOINT' after 'SAVE_ERROR' in unreserved_keyword list is misplaced
> 'SAVEPOINT' after 'SAVE_ERROR' in bare_label_keyword list is misplaced
> make[2]: *** [../../../src/Makefile.global:783: gram.c] Error 1
> make[1]: *** [Makefile:131: parser/gram.h] Error 2
> make[1]: *** Waiting for unfinished jobs....
> make: *** [src/Makefile.global:383: submake-generated-headers] Error 2
>
> I have ubuntu 22.04 operation system.
>
> On 06.12.2023 13:47, jian he wrote:
>
> On Tue, Dec 5, 2023 at 6:07 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> Hi!
>
> Thank you for your contribution to this thread.
>
>
> I reviewed it and have a few questions.
>
> 1. I have seen that you delete a table before creating it, to which you want to add errors due to a failed "copy
from"operation. I think this is wrong because this table can save useful data for the user. 
> At a minimum, we should warn the user about this, but I think we can just add some number at the end of the name,
suchas name_table1, name_table_2. 
>
> Sorry. I don't understand this part.
> Currently, if the error table name already exists, then the copy will
> fail, an error will be reported.
> I try to first create a table, if no error then the error table will be dropped.
>
> To be honest, first of all, I misunderstood this part of the code. Now I see that it works the way you mentioned.
>
> However, I didn't see if you dealt with cases where we already had a table with the same name as the table error.
> I mean, when is he trying to create for the first time, or will we never be able to face such a problem?
>
> Can you demo the expected behavior?
>
> Unfortunately, I was unable to launch it due to a build issue.
>

Hopefully attached will work.

> 2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during
'copyfrom' operation. 
> I think it would be better to use the name of the source file that was used while 'copy from' was running.
> In addition, there may be several such files, it is also worth considering.
>
> Another column added.
> now it looks like:
>
> SELECT * FROM save_error_csv_error;
>  filename | lineno |                        line
>  | field | source |                 err_message                 |
> err_detail | errorcode
>
----------+--------+----------------------------------------------------+-------+--------+---------------------------------------------+------------+-----------
>  STDIN    |      1 | 2002    232     40      50      60      70
> 80 | NULL  | NULL   | extra data after last expected column       |
> NULL       | 22P04
>  STDIN    |      1 | 2000    230     23
>  | d     | NULL   | missing data for column "d"                 | NULL
>       | 22P04
>  STDIN    |      1 | z,,""
>  | a     | z      | invalid input syntax for type integer: "z"  | NULL
>       | 22P02
>  STDIN    |      2 | \0,,
>  | a     | \0     | invalid input syntax for type integer: "\0" | NULL
>       | 22P02
>
> Yes, I see the "filename" column, and this will solve the problem, but "STDIN" is unclear to me.

please see comment in struct CopyFromStateData:
char    *filename; /* filename, or NULL for STDIN */


>  */
>
> Maybe we can rewrite it like this:
>
> /* Check, the err_nsp.error_rel table has already existed
> * and if it is, check its column name and data types.
>
refactored.

Attachment

v10-0001-Make-COPY-FROM-more-error-tolerant.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alena Rybakina

Date:

11 December 2023, 14:05:33

Hi! Thank you for your work. Your patch looks better!

On 10.12.2023 13:32, jian he wrote:

On Fri, Dec 8, 2023 at 3:09 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:

Thank you for your work. Unfortunately, your code contained errors during the make installation:

I have ubuntu 22.04 operation system.

On 06.12.2023 13:47, jian he wrote:

On Tue, Dec 5, 2023 at 6:07 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:

Hi!

Thank you for your contribution to this thread.

I reviewed it and have a few questions.

Sorry. I don't understand this part.
Currently, if the error table name already exists, then the copy will
fail, an error will be reported.
I try to first create a table, if no error then the error table will be dropped.

To be honest, first of all, I misunderstood this part of the code. Now I see that it works the way you mentioned.

However, I didn't see if you dealt with cases where we already had a table with the same name as the table error.
I mean, when is he trying to create for the first time, or will we never be able to face such a problem?

Can you demo the expected behavior?

Unfortunately, I was unable to launch it due to a build issue.

Hopefully attached will work.

Yes, thank you! It works fine, and I see that the regression tests have been passed. 🙂

However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created it, I described below):
postgres=# create table test (x int primary key, y int not null);
postgres=# create table test1 (x int, z int, CONSTRAINT fk_x
      FOREIGN KEY(x)
          REFERENCES test(x));

I did not find a table with saved errors after operation, although I received a log about it:postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error
NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test_error
ERROR: duplicate key value violates unique constraint "test_pkey"
DETAIL: Key (x)=(2) already exists.
CONTEXT: COPY test, line 3

postgres=# select * from public.test_error;
ERROR: relation "public.test_error" does not exist
LINE 1: select * from public.test_error;

postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test1_error
ERROR: insert or update on table "test1" violates foreign key constraint "fk_x"
DETAIL: Key (x)=(2) is not present in table "test".

postgres=# select * from public.test1_error;
ERROR: relation "public.test1_error" does not exist
LINE 1: select * from public.test1_error;

Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they were not added to the tables test and test1.

If I leave only the correct rows, everything works fine and the rows are added to the tables.

in copy_test.csv:

2,0

1,1

in copy_test1.csv:

2,0

2,1

1,1
postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV
COPY 2
postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
NOTICE: No error happened.Error holding table public.test1_error will be droped
COPY 3

Maybe I'm launching it the wrong way. If so, let me know about it.

I also notice interesting behavior if the table was previously created by the user. When I was creating an error_table before the 'copy from' operation,
I received a message saying that it is impossible to create a table with the same name (it is shown below) during the 'copy from' operation.
I think you should add information about this in the documentation, since this seems to be normal behavior to me.
postgres=# CREATE TABLE test_error (LINENO BIGINT, LINE TEXT,
FIELD TEXT, SOURCE TEXT, ERR_MESSAGE TEXT,
ERR_DETAIL TEXT, ERRORCODE TEXT);
CREATE TABLE
postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error
ERROR: Error save table public.test_error already exists. Cannot use it for COPY FROM error saving

2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during 'copy from' operation.
I think it would be better to use the name of the source file that was used while 'copy from' was running.
In addition, there may be several such files, it is also worth considering.

Another column added.
now it looks like:

SELECT * FROM save_error_csv_error; filename | lineno |                        line | field | source |                 err_message                 |
err_detail | errorcode
----------+--------+----------------------------------------------------+-------+--------+---------------------------------------------+------------+----------- STDIN    |      1 | 2002    232     40      50      60      70
80 | NULL  | NULL   | extra data after last expected column       |
NULL       | 22P04 STDIN    |      1 | 2000    230     23 | d     | NULL   | missing data for column "d"                 | NULL      | 22P04 STDIN    |      1 | z,,"" | a     | z      | invalid input syntax for type integer: "z"  | NULL      | 22P02 STDIN    |      2 | \0,, | a     | \0     | invalid input syntax for type integer: "\0" | NULL      | 22P02

Yes, I see the "filename" column, and this will solve the problem, but "STDIN" is unclear to me.

please see comment in struct CopyFromStateData:
char    *filename; /* filename, or NULL for STDIN */

Yes, I can see that.

I haven't figured out how to fix it yet either.

 */

Maybe we can rewrite it like this:

/* Check, the err_nsp.error_rel table has already existed
* and if it is, check its column name and data types.

refactored.

Fine)

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

12 December 2023, 13:04:29

On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina
<lena.ribackina@yandex.ru> wrote:
>
> Hi! Thank you for your work. Your patch looks better!
> Yes, thank you! It works fine, and I see that the regression tests have been passed. 🙂
> However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for
tablestest, test1 (how I created it, I described below): 
>
> postgres=# create table test (x int primary key, y int not null);
> postgres=# create table test1 (x int, z int, CONSTRAINT fk_x
>       FOREIGN KEY(x)
>           REFERENCES test(x));
>
> I did not find a table with saved errors after operation, although I received a log about it:
>
> postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error
> NOTICE:  2 rows were skipped because of error. skipped row saved to table public.test_error
> ERROR:  duplicate key value violates unique constraint "test_pkey"
> DETAIL:  Key (x)=(2) already exists.
> CONTEXT:  COPY test, line 3
>
> postgres=# select * from public.test_error;
> ERROR:  relation "public.test_error" does not exist
> LINE 1: select * from public.test_error;
>
> postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
> NOTICE:  2 rows were skipped because of error. skipped row saved to table public.test1_error
> ERROR:  insert or update on table "test1" violates foreign key constraint "fk_x"
> DETAIL:  Key (x)=(2) is not present in table "test".
>
> postgres=# select * from public.test1_error;
> ERROR:  relation "public.test1_error" does not exist
> LINE 1: select * from public.test1_error;
>
> Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they were
notadded to the tables test and test1. 
>
> If I leave only the correct rows, everything works fine and the rows are added to the tables.
>
> in copy_test.csv:
>
> 2,0
>
> 1,1
>
> in copy_test1.csv:
>
> 2,0
>
> 2,1
>
> 1,1
>
> postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV
> COPY 2
> postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
> NOTICE:  No error happened.Error holding table public.test1_error will be droped
> COPY 3
>
> Maybe I'm launching it the wrong way. If so, let me know about it.

looks like the above is about constraints violation while copying.
constraints violation while copying not in the scope of this patch.

Since COPY FROM is very like the INSERT command,
you do want all the valid constraints to check all the copied rows?

but the notice raised by the patch is not right.
So I place the drop error saving table or raise notice logic above
`ExecResetTupleTable(estate->es_tupleTable, false)` in the function
CopyFrom.

>
> I also notice interesting behavior if the table was previously created by the user. When I was creating an
error_tablebefore the 'copy from' operation, 
> I received a message saying that it is impossible to create a table with the same name (it is shown below) during the
'copyfrom' operation. 
> I think you should add information about this in the documentation, since this seems to be normal behavior to me.
>

doc changed. you may check it.

Attachment

v11-0001-Make-COPY-FROM-more-error-tolerant.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alena Rybakina

Date:

14 December 2023, 14:48:41

On 12.12.2023 16:04, jian he wrote:

On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina
<lena.ribackina@yandex.ru> wrote:

Hi! Thank you for your work. Your patch looks better!
Yes, thank you! It works fine, and I see that the regression tests have been passed. 🙂
However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created it, I described below):

postgres=# create table test (x int primary key, y int not null);
postgres=# create table test1 (x int, z int, CONSTRAINT fk_x      FOREIGN KEY(x)          REFERENCES test(x));

I did not find a table with saved errors after operation, although I received a log about it:

postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error
NOTICE:  2 rows were skipped because of error. skipped row saved to table public.test_error
ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (x)=(2) already exists.
CONTEXT:  COPY test, line 3

postgres=# select * from public.test_error;
ERROR:  relation "public.test_error" does not exist
LINE 1: select * from public.test_error;

postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
NOTICE:  2 rows were skipped because of error. skipped row saved to table public.test1_error
ERROR:  insert or update on table "test1" violates foreign key constraint "fk_x"
DETAIL:  Key (x)=(2) is not present in table "test".

postgres=# select * from public.test1_error;
ERROR:  relation "public.test1_error" does not exist
LINE 1: select * from public.test1_error;

Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they were not added to the tables test and test1.

If I leave only the correct rows, everything works fine and the rows are added to the tables.

in copy_test.csv:

2,0

1,1

in copy_test1.csv:

2,0

2,1

1,1

postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV
COPY 2
postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
NOTICE:  No error happened.Error holding table public.test1_error will be droped
COPY 3

Maybe I'm launching it the wrong way. If so, let me know about it.

looks like the above is about constraints violation while copying.
constraints violation while copying not in the scope of this patch.

Since COPY FROM is very like the INSERT command,
you do want all the valid constraints to check all the copied rows?

No, I think it will be too much.

but the notice raised by the patch is not right.
So I place the drop error saving table or raise notice logic above
`ExecResetTupleTable(estate->es_tupleTable, false)` in the function
CopyFrom.

Yes, I see it and agree with you.

I also notice interesting behavior if the table was previously created by the user. When I was creating an error_table before the 'copy from' operation,
I received a message saying that it is impossible to create a table with the same name (it is shown below) during the 'copy from' operation.
I think you should add information about this in the documentation, since this seems to be normal behavior to me.

doc changed. you may check it.

Yes, I saw it. Thank you.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

14 December 2023, 20:48:52

Hi,

On Tue, Dec 12, 2023 at 10:04 PM jian he <jian.universality@gmail.com> wrote:
>
> On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina
> <lena.ribackina@yandex.ru> wrote:
> >
> > Hi! Thank you for your work. Your patch looks better!
> > Yes, thank you! It works fine, and I see that the regression tests have been passed. 🙂
> > However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for
tablestest, test1 (how I created it, I described below): 
> >
> > postgres=# create table test (x int primary key, y int not null);
> > postgres=# create table test1 (x int, z int, CONSTRAINT fk_x
> >       FOREIGN KEY(x)
> >           REFERENCES test(x));
> >
> > I did not find a table with saved errors after operation, although I received a log about it:
> >
> > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error
> > NOTICE:  2 rows were skipped because of error. skipped row saved to table public.test_error
> > ERROR:  duplicate key value violates unique constraint "test_pkey"
> > DETAIL:  Key (x)=(2) already exists.
> > CONTEXT:  COPY test, line 3
> >
> > postgres=# select * from public.test_error;
> > ERROR:  relation "public.test_error" does not exist
> > LINE 1: select * from public.test_error;
> >
> > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
> > NOTICE:  2 rows were skipped because of error. skipped row saved to table public.test1_error
> > ERROR:  insert or update on table "test1" violates foreign key constraint "fk_x"
> > DETAIL:  Key (x)=(2) is not present in table "test".
> >
> > postgres=# select * from public.test1_error;
> > ERROR:  relation "public.test1_error" does not exist
> > LINE 1: select * from public.test1_error;
> >
> > Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they
werenot added to the tables test and test1. 
> >
> > If I leave only the correct rows, everything works fine and the rows are added to the tables.
> >
> > in copy_test.csv:
> >
> > 2,0
> >
> > 1,1
> >
> > in copy_test1.csv:
> >
> > 2,0
> >
> > 2,1
> >
> > 1,1
> >
> > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV
> > COPY 2
> > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
> > NOTICE:  No error happened.Error holding table public.test1_error will be droped
> > COPY 3
> >
> > Maybe I'm launching it the wrong way. If so, let me know about it.
>
> looks like the above is about constraints violation while copying.
> constraints violation while copying not in the scope of this patch.
>
> Since COPY FROM is very like the INSERT command,
> you do want all the valid constraints to check all the copied rows?
>
> but the notice raised by the patch is not right.
> So I place the drop error saving table or raise notice logic above
> `ExecResetTupleTable(estate->es_tupleTable, false)` in the function
> CopyFrom.
>
> >
> > I also notice interesting behavior if the table was previously created by the user. When I was creating an
error_tablebefore the 'copy from' operation, 
> > I received a message saying that it is impossible to create a table with the same name (it is shown below) during
the'copy from' operation. 
> > I think you should add information about this in the documentation, since this seems to be normal behavior to me.
> >
>
> doc changed. you may check it.

I've read this thread and the latest patch. IIUC with SAVE_ERROR
option, COPY FROM creates an error table for the target table and
writes error information there.

While I agree that the final shape of this feature would be something
like that design, I'm concerned some features are missing in order to
make this feature useful in practice. For instance, error logs are
inserted to error tables without bounds, meaning that users who want
to tolerate errors during COPY FROM  will have to truncate or drop the
error tables periodically, or the database will grow with error logs
without limit. Ideally such maintenance work should be done by the
database. There might be some users who want to log such conversion
errors in server logs to avoid such maintenance work. I think we
should provide an option for where to write, at least. Also, since the
error tables are normal user tables internally, error logs are also
replicated to subscribers if there is a publication FOR ALL TABLES,
unlike system catalogs. I think some users would not like such
behavior.

Looking at SAVE_ERROR feature closely, I think it consists of two
separate features. That is, it enables COPY FROM to load data while
(1) tolerating errors and (2) logging errors to somewhere (i.e., an
error table). If we implement only (1), it would be like COPY FROM
tolerate errors infinitely and log errors to /dev/null. The user
cannot see the error details but I guess it could still help some
cases as Andres mentioned[1] (it might be a good idea to send the
number of rows successfully loaded in a NOTICE message if some rows
could not be loaded). Then with (2), COPY FROM can log error
information to somewhere such as tables and server logs and the user
can select it. So I'm thinking we may be able to implement this
feature incrementally. The first step would be something like an
option to ignore all errors or an option to specify the maximum number
of errors to tolerate before raising an ERROR. The second step would
be to support logging destinations such as server logs and tables.

Regards,

[1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

18 December 2023, 00:15:00

On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> I've read this thread and the latest patch. IIUC with SAVE_ERROR
> option, COPY FROM creates an error table for the target table and
> writes error information there.
>
> While I agree that the final shape of this feature would be something
> like that design, I'm concerned some features are missing in order to
> make this feature useful in practice. For instance, error logs are
> inserted to error tables without bounds, meaning that users who want
> to tolerate errors during COPY FROM  will have to truncate or drop the
> error tables periodically, or the database will grow with error logs
> without limit. Ideally such maintenance work should be done by the
> database. There might be some users who want to log such conversion
> errors in server logs to avoid such maintenance work. I think we
> should provide an option for where to write, at least. Also, since the
> error tables are normal user tables internally, error logs are also
> replicated to subscribers if there is a publication FOR ALL TABLES,
> unlike system catalogs. I think some users would not like such
> behavior.

save the error metadata to  system catalogs would be more expensive,
please see below explanation.
I have no knowledge of publications.
but i feel there is a feature request: publication FOR ALL TABLES
exclude regex_pattern.
Anyway, that would be another topic.

> Looking at SAVE_ERROR feature closely, I think it consists of two
> separate features. That is, it enables COPY FROM to load data while
> (1) tolerating errors and (2) logging errors to somewhere (i.e., an
> error table). If we implement only (1), it would be like COPY FROM
> tolerate errors infinitely and log errors to /dev/null. The user
> cannot see the error details but I guess it could still help some
> cases as Andres mentioned[1] (it might be a good idea to send the
> number of rows successfully loaded in a NOTICE message if some rows
> could not be loaded). Then with (2), COPY FROM can log error
> information to somewhere such as tables and server logs and the user
> can select it. So I'm thinking we may be able to implement this
> feature incrementally. The first step would be something like an
> option to ignore all errors or an option to specify the maximum number
> of errors to tolerate before raising an ERROR. The second step would
> be to support logging destinations such as server logs and tables.
>
> Regards,
>
> [1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

> feature incrementally. The first step would be something like an
> option to ignore all errors or an option to specify the maximum number
> of errors to tolerate before raising an ERROR. The second step would

I don't think "specify the maximum number  of errors to tolerate
before raising an ERROR." is very useful....

QUOTE from [1]
MAXERROR [AS] error_count
If the load returns the error_count number of errors or greater, the
load fails. If the load returns fewer errors, it continues and returns
an INFO message that states the number of rows that could not be
loaded. Use this parameter to allow loads to continue when certain
rows fail to load into the table because of formatting errors or other
inconsistencies in the data.
Set this value to 0 or 1 if you want the load to fail as soon as the
first error occurs. The AS keyword is optional. The MAXERROR default
value is 0 and the limit is 100000.
The actual number of errors reported might be greater than the
specified MAXERROR because of the parallel nature of Amazon Redshift.
If any node in the Amazon Redshift cluster detects that MAXERROR has
been exceeded, each node reports all of the errors it has encountered.
END OF QUOTE

option MAXERROR error_count. iiuc, it fails while validating line
error_count + 1, else it raises a notice, tells you how many rows have
errors.

* case when error_count is small, and the copy fails, it only tells
you that at least the error_count line has malformed data. but what if
the actual malformed rows are very big. In this case, this failure
error message is not that helpful.
* case when error_count is very big, and the copy does not fail. then
the actual malformed data rows are very big (still less than
error_count). but there is no error report, you don't know which line
has an error.

Either way, if the file has a large portion of malformed rows, then
the MAXERROR option does not make sense.
so maybe we don't need a threshold for tolerating errors.

however, we can have an option, not actually copy to the table, but
only validate, similar to NOLOAD in [1]

why we save the error:
* if only a small portion of malformed rows then saving the error
metadata would be cheap.
* if a large portion of malformed rows then  copy will be slow but we
saved the error metadata. Now you can fix it based on this error
metadata.

I think saving errors to a regular table or text file seems sane, but
not to a catalog table.
* for a text file with M rows, N fields, contrived corner case would
be (M-2) * N errors, the last 2 rows have the duplicate keys, violate
primary key constraint. In this case, we first insert  (M-2) * N rows
to the catalog table then because of errors we undo it.
I think it will be expensive.
* error meta info is not as important as other pg_catalog tables.

log format is quite verbose, save_error to log seems not so good, I guess.

I suppose we can specify an ERRORFILE directory. similar
implementation [2], demo in [3]
it will generate 2 files, one file shows the malform line content as
is, another file shows the error info.

Let's assume we save the error info to a table:
Since the previous thread says one copy operation may create one error
table is not a good idea, looking back, I agree.
Similar to [4]
I come with the following logic/ideas:
* save_error table name be COPY_ERRORS, shema be the same as copy from
destination table.
* one COPY_ERRORS table saves all COPY FROM generated error metadata
* if save_error specified, before do COPY FROM, first check if the
table COPY_ERRORS
exists,
if not then create one? Or raise an error saying that COPY_ERRORS does
not exist, cannot save_error?
*  COPY_ERRORS table owner be current database owner?
* Only the table owner is allowed to INSERT/DELETE/UPDATE, others are
not allowed to INSERT/DELETE/UPDATE.
while doing copy error happened, record the userid, then switch
COPY_ERRORS owner execute the insert command
*  the user who is doing COPY FROM operation is allowed solely to view
(select) the errored row they generated.

COPY_ERRORS table would be:
userid    oid                        /* the user who is doing this operation */
error_time    timestamptz                       /* when this error
happened. not 100% sure this column is needed */
filename text                      /* the copy from source */
table_name text                 /*  the copy from destination */
lineno    bigint                    /* the error line number */
line          text                     /* the whole line raw content */
colname  text               -- Field with the error.
raw_field_value  text   --- The value for the field that leads to the error.
err_message      text   -- same as ErrorData->message
err_detail      text         --same as ErrorData->detail
errorcode     text         --transformed errcode, example "22P02"

[1] https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html
[2] https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver16
[3] https://www.sqlshack.com/working-with-line-numbers-and-errors-using-bulk-insert/
[4] https://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

18 December 2023, 02:41:42

On 2023-12-15 05:48, Masahiko Sawada wrote:

Thanks for joining this discussion!

> I've read this thread and the latest patch. IIUC with SAVE_ERROR
> option, COPY FROM creates an error table for the target table and
> writes error information there.
> 
> While I agree that the final shape of this feature would be something
> like that design, I'm concerned some features are missing in order to
> make this feature useful in practice. For instance, error logs are
> inserted to error tables without bounds, meaning that users who want
> to tolerate errors during COPY FROM  will have to truncate or drop the
> error tables periodically, or the database will grow with error logs
> without limit. Ideally such maintenance work should be done by the
> database. There might be some users who want to log such conversion
> errors in server logs to avoid such maintenance work. I think we
> should provide an option for where to write, at least. Also, since the
> error tables are normal user tables internally, error logs are also
> replicated to subscribers if there is a publication FOR ALL TABLES,
> unlike system catalogs. I think some users would not like such
> behavior.
> 
> Looking at SAVE_ERROR feature closely, I think it consists of two
> separate features. That is, it enables COPY FROM to load data while
> (1) tolerating errors and (2) logging errors to somewhere (i.e., an
> error table). If we implement only (1), it would be like COPY FROM
> tolerate errors infinitely and log errors to /dev/null. The user
> cannot see the error details but I guess it could still help some
> cases as Andres mentioned[1] (it might be a good idea to send the
> number of rows successfully loaded in a NOTICE message if some rows
> could not be loaded). Then with (2), COPY FROM can log error
> information to somewhere such as tables and server logs and the user
> can select it.

+1.
I may be biased since I wrote some ~v6 patches which just output the 
soft errors and number of skipped rows to log, but I think just (1) 
would be worth implementing as you pointed out and I like if users could 
choose where to log output.

I think there would be situations where it is preferable to save errors 
to server log even considering problems which were pointed out in [1], 
i.e. manually loading data.

[1] 
https://www.postgresql.org/message-id/739953.1699467519%40sss.pgh.pa.us

> feature incrementally. The first step would be something like an
> option to ignore all errors or an option to specify the maximum number
> of errors to tolerate before raising an ERROR. The second step would
> be to support logging destinations such as server logs and tables.
> 
> Regards,
> 
> [1] 
> https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

18 December 2023, 05:09:03

Hi,

> save the error metadata to  system catalogs would be more expensive,
> please see below explanation.
> I have no knowledge of publications.
> but i feel there is a feature request: publication FOR ALL TABLES
> exclude regex_pattern.
> Anyway, that would be another topic.

I think saving error metadata to system catalog is not a good idea, too.
And I believe Sawada-san just pointed out missing features and did not 
suggested that we use system catalog.

> I don't think "specify the maximum number  of errors to tolerate
> before raising an ERROR." is very useful....

That may be so.
I imagine it's useful in some use case since some loading tools have 
such options.
Anyway I agree it's not necessary for initial patch as mentioned in [1].

> I suppose we can specify an ERRORFILE directory. similar
> implementation [2], demo in [3]
> it will generate 2 files, one file shows the malform line content as
> is, another file shows the error info.

That may be a good option when considering "(2) logging errors to 
somewhere".

What do you think about the proposal to develop these features in 
incrementally?

On 2023-12-15 05:48, Masahiko Sawada wrote:
> So I'm thinking we may be able to implement this
> feature incrementally. The first step would be something like an
> option to ignore all errors or an option to specify the maximum number
> of errors to tolerate before raising an ERROR. The second step would
> be to support logging destinations such as server logs and tables.

[1] 
https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

18 December 2023, 07:41:19

On Mon, Dec 18, 2023 at 1:09 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> Hi,
>
> > save the error metadata to  system catalogs would be more expensive,
> > please see below explanation.
> > I have no knowledge of publications.
> > but i feel there is a feature request: publication FOR ALL TABLES
> > exclude regex_pattern.
> > Anyway, that would be another topic.
>
> I think saving error metadata to system catalog is not a good idea, too.
> And I believe Sawada-san just pointed out missing features and did not
> suggested that we use system catalog.
>
> > I don't think "specify the maximum number  of errors to tolerate
> > before raising an ERROR." is very useful....
>
> That may be so.
> I imagine it's useful in some use case since some loading tools have
> such options.
> Anyway I agree it's not necessary for initial patch as mentioned in [1].
>
> > I suppose we can specify an ERRORFILE directory. similar
> > implementation [2], demo in [3]
> > it will generate 2 files, one file shows the malform line content as
> > is, another file shows the error info.
>
> That may be a good option when considering "(2) logging errors to
> somewhere".
>
> What do you think about the proposal to develop these features in
> incrementally?
>

I am more with  tom's idea [1], that is when errors happen (data type
conversion only), do not fail, AND we save the error to a table.  I
guess we can implement this logic together, only with a new COPY
option.

imagine a case (it's not that contrived, imho), while conversion from
text to table's int, postgres isspace is different from the source
text file's isspace logic.
then all the lines are malformed. If we just say on error continue and
not save error meta info, the user is still confused which field has
the wrong data, then the user will probably try to incrementally test
which field contains malformed data.

Since we need to save the error somewhere.
Everyone has the privilege to INSERT can do COPY.
I think we also need to handle the access privilege also.
So like I mentioned above, one copy_error error table hub, then
everyone can view/select their own copy failure record.

but save to a server text file/directory, not easy for an INSERT
privilege user to see these files, I think.
similarly not easy to see these failed records in log for limited privilege.

if someone wants to fail at maxerror rows, they can do it, since we
will count how many rows failed.
even though I didn't get it.

[1] https://www.postgresql.org/message-id/900123.1699488001%40sss.pgh.pa.us

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

19 December 2023, 00:28:08

On Mon, Dec 18, 2023 at 9:16 AM jian he <jian.universality@gmail.com> wrote:
>
> On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > I've read this thread and the latest patch. IIUC with SAVE_ERROR
> > option, COPY FROM creates an error table for the target table and
> > writes error information there.
> >
> > While I agree that the final shape of this feature would be something
> > like that design, I'm concerned some features are missing in order to
> > make this feature useful in practice. For instance, error logs are
> > inserted to error tables without bounds, meaning that users who want
> > to tolerate errors during COPY FROM  will have to truncate or drop the
> > error tables periodically, or the database will grow with error logs
> > without limit. Ideally such maintenance work should be done by the
> > database. There might be some users who want to log such conversion
> > errors in server logs to avoid such maintenance work. I think we
> > should provide an option for where to write, at least. Also, since the
> > error tables are normal user tables internally, error logs are also
> > replicated to subscribers if there is a publication FOR ALL TABLES,
> > unlike system catalogs. I think some users would not like such
> > behavior.
>
> save the error metadata to  system catalogs would be more expensive,
> please see below explanation.
> I have no knowledge of publications.
> but i feel there is a feature request: publication FOR ALL TABLES
> exclude regex_pattern.
> Anyway, that would be another topic.

I don't think the new regex idea would be a good solution for the
existing users who are using FOR ALL TABLES publication. It's not
desirable that they have to change the publication because of this
feature. With the current patch, a logical replication using FOR ALL
TABLES publication will stop immediately after an error information is
inserted into a new error table unless the same error table is created
on subscribers.

>
> > Looking at SAVE_ERROR feature closely, I think it consists of two
> > separate features. That is, it enables COPY FROM to load data while
> > (1) tolerating errors and (2) logging errors to somewhere (i.e., an
> > error table). If we implement only (1), it would be like COPY FROM
> > tolerate errors infinitely and log errors to /dev/null. The user
> > cannot see the error details but I guess it could still help some
> > cases as Andres mentioned[1] (it might be a good idea to send the
> > number of rows successfully loaded in a NOTICE message if some rows
> > could not be loaded). Then with (2), COPY FROM can log error
> > information to somewhere such as tables and server logs and the user
> > can select it. So I'm thinking we may be able to implement this
> > feature incrementally. The first step would be something like an
> > option to ignore all errors or an option to specify the maximum number
> > of errors to tolerate before raising an ERROR. The second step would
> > be to support logging destinations such as server logs and tables.
> >
> > Regards,
> >
> > [1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de
> >
> > --
> > Masahiko Sawada
> > Amazon Web Services: https://aws.amazon.com
>
> > feature incrementally. The first step would be something like an
> > option to ignore all errors or an option to specify the maximum number
> > of errors to tolerate before raising an ERROR. The second step would
>
> I don't think "specify the maximum number  of errors to tolerate
> before raising an ERROR." is very useful....
>
> QUOTE from [1]
> MAXERROR [AS] error_count
> If the load returns the error_count number of errors or greater, the
> load fails. If the load returns fewer errors, it continues and returns
> an INFO message that states the number of rows that could not be
> loaded. Use this parameter to allow loads to continue when certain
> rows fail to load into the table because of formatting errors or other
> inconsistencies in the data.
> Set this value to 0 or 1 if you want the load to fail as soon as the
> first error occurs. The AS keyword is optional. The MAXERROR default
> value is 0 and the limit is 100000.
> The actual number of errors reported might be greater than the
> specified MAXERROR because of the parallel nature of Amazon Redshift.
> If any node in the Amazon Redshift cluster detects that MAXERROR has
> been exceeded, each node reports all of the errors it has encountered.
> END OF QUOTE
>
> option MAXERROR error_count. iiuc, it fails while validating line
> error_count + 1, else it raises a notice, tells you how many rows have
> errors.
>
> * case when error_count is small, and the copy fails, it only tells
> you that at least the error_count line has malformed data. but what if
> the actual malformed rows are very big. In this case, this failure
> error message is not that helpful.
> * case when error_count is very big, and the copy does not fail. then
> the actual malformed data rows are very big (still less than
> error_count). but there is no error report, you don't know which line
> has an error.
>
> Either way, if the file has a large portion of malformed rows, then
> the MAXERROR option does not make sense.
> so maybe we don't need a threshold for tolerating errors.
>
> however, we can have an option, not actually copy to the table, but
> only validate, similar to NOLOAD in [1]

I'm fine even if the feature is not like MAXERROR. If we want a
feature to tolerate errors during COPY FROM, I just thought it might
be a good idea to have a tuning knob for better flexibility, not just
like a on/off switch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

19 December 2023, 01:13:31

On Mon, Dec 18, 2023 at 4:41 PM jian he <jian.universality@gmail.com> wrote:
>
> On Mon, Dec 18, 2023 at 1:09 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
> >
> > Hi,
> >
> > > save the error metadata to  system catalogs would be more expensive,
> > > please see below explanation.
> > > I have no knowledge of publications.
> > > but i feel there is a feature request: publication FOR ALL TABLES
> > > exclude regex_pattern.
> > > Anyway, that would be another topic.
> >
> > I think saving error metadata to system catalog is not a good idea, too.
> > And I believe Sawada-san just pointed out missing features and did not
> > suggested that we use system catalog.
> >
> > > I don't think "specify the maximum number  of errors to tolerate
> > > before raising an ERROR." is very useful....
> >
> > That may be so.
> > I imagine it's useful in some use case since some loading tools have
> > such options.
> > Anyway I agree it's not necessary for initial patch as mentioned in [1].
> >
> > > I suppose we can specify an ERRORFILE directory. similar
> > > implementation [2], demo in [3]
> > > it will generate 2 files, one file shows the malform line content as
> > > is, another file shows the error info.
> >
> > That may be a good option when considering "(2) logging errors to
> > somewhere".
> >
> > What do you think about the proposal to develop these features in
> > incrementally?
> >
>
> I am more with  tom's idea [1], that is when errors happen (data type
> conversion only), do not fail, AND we save the error to a table.  I
> guess we can implement this logic together, only with a new COPY
> option.

If we want only such a feature we need to implement it together (the
patch could be split, though). But if some parts of the feature are
useful for users as well, I'd recommend implementing it incrementally.
That way, the patches can get small and it would be easy for reviewers
and committers to review/commit them.

>
> imagine a case (it's not that contrived, imho), while conversion from
> text to table's int, postgres isspace is different from the source
> text file's isspace logic.
> then all the lines are malformed. If we just say on error continue and
> not save error meta info, the user is still confused which field has
> the wrong data, then the user will probably try to incrementally test
> which field contains malformed data.
>
> Since we need to save the error somewhere.
> Everyone has the privilege to INSERT can do COPY.
> I think we also need to handle the access privilege also.
> So like I mentioned above, one copy_error error table hub, then
> everyone can view/select their own copy failure record.

The error table hub idea is still unclear to me. I assume that there
are error tables at least on each database. And an error table can
have error data that happened during COPY FROM, including malformed
lines. Do the error tables grow without bounds and the users have to
delete rows at some point? If so, who can do that? How can we achieve
that the users can see only errored rows they generated? And the issue
with logical replication also needs to be resolved. Anyway, if we go
this direction, we need to discuss the overall design.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

20 December 2023, 04:07:38

On Tue, Dec 19, 2023 at 9:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> The error table hub idea is still unclear to me. I assume that there
> are error tables at least on each database. And an error table can
> have error data that happened during COPY FROM, including malformed
> lines. Do the error tables grow without bounds and the users have to
> delete rows at some point? If so, who can do that? How can we achieve
> that the users can see only errored rows they generated? And the issue
> with logical replication also needs to be resolved. Anyway, if we go
> this direction, we need to discuss the overall design.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

Please check my latest attached POC.
Main content is to build spi query, execute the spi query, regress
test and regress output.

copy_errors one per schema.
foo.copy_errors will be owned by the schema: foo owner.

if you can insert to a table in that specific schema let's say foo,
then you will get privilege to INSERT/DELETE/SELECT
to foo.copy_errors.
If you are not a superuser, you are only allowed to do
INSERT/DELETE/SELECT on foo.copy_errors rows where USERID =
current_user::regrole::oid.
This is done via row level security.

Since foo.copy_errors is mainly INSERT operations, if copy_errors grow
too much, that means your source file has many errors, it will take a
very long time to finish the whole COPY. maybe we can capture how many
errors encountered in another client.

I don't know how to deal with logic replication. looking for ideas.

Attachment

v12-0001-Make-COPY-FROM-more-error-tolerant.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

20 December 2023, 12:26:36

On Wed, Dec 20, 2023 at 1:07 PM jian he <jian.universality@gmail.com> wrote:
>
> On Tue, Dec 19, 2023 at 9:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > The error table hub idea is still unclear to me. I assume that there
> > are error tables at least on each database. And an error table can
> > have error data that happened during COPY FROM, including malformed
> > lines. Do the error tables grow without bounds and the users have to
> > delete rows at some point? If so, who can do that? How can we achieve
> > that the users can see only errored rows they generated? And the issue
> > with logical replication also needs to be resolved. Anyway, if we go
> > this direction, we need to discuss the overall design.
> >
> > Regards,
> >
> > --
> > Masahiko Sawada
> > Amazon Web Services: https://aws.amazon.com
>
> Please check my latest attached POC.
> Main content is to build spi query, execute the spi query, regress
> test and regress output.

Why do we need to use SPI? I think we can form heap tuples and insert
them to the error table. Creating the error table also doesn't need to
use SPI.

>
> copy_errors one per schema.
> foo.copy_errors will be owned by the schema: foo owner.

It seems that the error table is created when the SAVE_ERROR is used
for the first time. It probably blocks concurrent COPY FROM commands
with SAVE_ERROR option to different tables if the error table is not
created yet.

>
> if you can insert to a table in that specific schema let's say foo,
> then you will get privilege to INSERT/DELETE/SELECT
> to foo.copy_errors.
> If you are not a superuser, you are only allowed to do
> INSERT/DELETE/SELECT on foo.copy_errors rows where USERID =
> current_user::regrole::oid.
> This is done via row level security.

I don't think it works. If the user is dropped, the user's oid could
be reused for a different user.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

28 December 2023, 03:57:00

On Wed, Dec 20, 2023 at 8:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> Why do we need to use SPI? I think we can form heap tuples and insert
> them to the error table. Creating the error table also doesn't need to
> use SPI.
>
Thanks for pointing it out. I figured out how to form heap tuples and
insert them to the error table.
but I don't know how to create the error table without using SPI.
Please pointer it out.

> >
> > copy_errors one per schema.
> > foo.copy_errors will be owned by the schema: foo owner.
>
> It seems that the error table is created when the SAVE_ERROR is used
> for the first time. It probably blocks concurrent COPY FROM commands
> with SAVE_ERROR option to different tables if the error table is not
> created yet.
>
I don't know how to solve this problem.... Maybe we can document this.
but it will block the COPY FROM immediately.

> >
> > if you can insert to a table in that specific schema let's say foo,
> > then you will get privilege to INSERT/DELETE/SELECT
> > to foo.copy_errors.
> > If you are not a superuser, you are only allowed to do
> > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID =
> > current_user::regrole::oid.
> > This is done via row level security.
>
> I don't think it works. If the user is dropped, the user's oid could
> be reused for a different user.
>

You are right.
so I changed, now the schema owner will be the error table owner.
every error table tuple inserts,
I switch to schema owner, do the insert, then switch back to the
COPY_FROM operation user.
now everyone (except superuser) will need explicit grant to access the
error table.

Attachment

v13-0001-Make-COPY-FROM-more-error-tolerant.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

vignesh C

Date:

04 January 2024, 16:05:20

On Thu, 28 Dec 2023 at 09:27, jian he <jian.universality@gmail.com> wrote:
>
> On Wed, Dec 20, 2023 at 8:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > Why do we need to use SPI? I think we can form heap tuples and insert
> > them to the error table. Creating the error table also doesn't need to
> > use SPI.
> >
> Thanks for pointing it out. I figured out how to form heap tuples and
> insert them to the error table.
> but I don't know how to create the error table without using SPI.
> Please pointer it out.
>
> > >
> > > copy_errors one per schema.
> > > foo.copy_errors will be owned by the schema: foo owner.
> >
> > It seems that the error table is created when the SAVE_ERROR is used
> > for the first time. It probably blocks concurrent COPY FROM commands
> > with SAVE_ERROR option to different tables if the error table is not
> > created yet.
> >
> I don't know how to solve this problem.... Maybe we can document this.
> but it will block the COPY FROM immediately.
>
> > >
> > > if you can insert to a table in that specific schema let's say foo,
> > > then you will get privilege to INSERT/DELETE/SELECT
> > > to foo.copy_errors.
> > > If you are not a superuser, you are only allowed to do
> > > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID =
> > > current_user::regrole::oid.
> > > This is done via row level security.
> >
> > I don't think it works. If the user is dropped, the user's oid could
> > be reused for a different user.
> >
>
> You are right.
> so I changed, now the schema owner will be the error table owner.
> every error table tuple inserts,
> I switch to schema owner, do the insert, then switch back to the
> COPY_FROM operation user.
> now everyone (except superuser) will need explicit grant to access the
> error table.

There are some compilation issues reported at [1] for the patch:
[04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’:
[04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’
may be used uninitialized in this function
[-Werror=maybe-uninitialized]
[04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc,
[04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[04:04:26.288] 1127 | t_values,
[04:04:26.288] | ~~~~~~~~~
[04:04:26.288] 1128 | t_isnull);
[04:04:26.288] | ~~~~~~~~~
[04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be
used uninitialized in this function [-Werror=maybe-uninitialized]
[04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock);
[04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[1] - https://cirrus-ci.com/task/4785221183209472

Regards,
Vignesh

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

05 January 2024, 08:37:32

On Fri, Jan 5, 2024 at 12:05 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 28 Dec 2023 at 09:27, jian he <jian.universality@gmail.com> wrote:
> >
> > On Wed, Dec 20, 2023 at 8:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > >
> > > Why do we need to use SPI? I think we can form heap tuples and insert
> > > them to the error table. Creating the error table also doesn't need to
> > > use SPI.
> > >
> > Thanks for pointing it out. I figured out how to form heap tuples and
> > insert them to the error table.
> > but I don't know how to create the error table without using SPI.
> > Please pointer it out.
> >
> > > >
> > > > copy_errors one per schema.
> > > > foo.copy_errors will be owned by the schema: foo owner.
> > >
> > > It seems that the error table is created when the SAVE_ERROR is used
> > > for the first time. It probably blocks concurrent COPY FROM commands
> > > with SAVE_ERROR option to different tables if the error table is not
> > > created yet.
> > >
> > I don't know how to solve this problem.... Maybe we can document this.
> > but it will block the COPY FROM immediately.
> >
> > > >
> > > > if you can insert to a table in that specific schema let's say foo,
> > > > then you will get privilege to INSERT/DELETE/SELECT
> > > > to foo.copy_errors.
> > > > If you are not a superuser, you are only allowed to do
> > > > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID =
> > > > current_user::regrole::oid.
> > > > This is done via row level security.
> > >
> > > I don't think it works. If the user is dropped, the user's oid could
> > > be reused for a different user.
> > >
> >
> > You are right.
> > so I changed, now the schema owner will be the error table owner.
> > every error table tuple inserts,
> > I switch to schema owner, do the insert, then switch back to the
> > COPY_FROM operation user.
> > now everyone (except superuser) will need explicit grant to access the
> > error table.
>
> There are some compilation issues reported at [1] for the patch:
> [04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’:
> [04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’
> may be used uninitialized in this function
> [-Werror=maybe-uninitialized]
> [04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc,
> [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> [04:04:26.288] 1127 | t_values,
> [04:04:26.288] | ~~~~~~~~~
> [04:04:26.288] 1128 | t_isnull);
> [04:04:26.288] | ~~~~~~~~~
> [04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be
> used uninitialized in this function [-Werror=maybe-uninitialized]
> [04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock);
> [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> [1] - https://cirrus-ci.com/task/4785221183209472
>

I fixed this issue, and also improved the doc.
Other implementations have not changed.

Attachment

v14-0001-Make-COPY-FROM-more-error-tolerant.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

06 January 2024, 00:50:00

On Fri, Jan 5, 2024 at 4:37 PM jian he <jian.universality@gmail.com> wrote:
>
> > > > be reused for a different user.
> > > >
> > >
> > > You are right.
> > > so I changed, now the schema owner will be the error table owner.
> > > every error table tuple inserts,
> > > I switch to schema owner, do the insert, then switch back to the
> > > COPY_FROM operation user.
> > > now everyone (except superuser) will need explicit grant to access the
> > > error table.
> >
> > There are some compilation issues reported at [1] for the patch:
> > [04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’:
> > [04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’
> > may be used uninitialized in this function
> > [-Werror=maybe-uninitialized]
> > [04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc,
> > [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > [04:04:26.288] 1127 | t_values,
> > [04:04:26.288] | ~~~~~~~~~
> > [04:04:26.288] 1128 | t_isnull);
> > [04:04:26.288] | ~~~~~~~~~
> > [04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be
> > used uninitialized in this function [-Werror=maybe-uninitialized]
> > [04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock);
> > [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > [1] - https://cirrus-ci.com/task/4785221183209472
> >
>
> I fixed this issue, and also improved the doc.
> Other implementations have not changed.

bother again.
This time, I used the ci test it again.
now there should be no warning.

Attachment

v15-0001-Make-COPY-FROM-more-error-tolerant.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

09 January 2024, 14:36:37

On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada <sawada.mshk@gmail.com> 
wrote:
> If we want only such a feature we need to implement it together (the
> patch could be split, though). But if some parts of the feature are
> useful for users as well, I'd recommend implementing it incrementally.
> That way, the patches can get small and it would be easy for reviewers
> and committers to review/commit them.

Jian, how do you think this comment?

Looking back at the discussion so far, it seems that not everyone thinks 
saving table information is the best idea[1] and some people think just 
skipping error data is useful.[2]

Since there are issues to be considered from the design such as 
physical/logical replication treatment, putting error information to 
table is likely to take time for consensus building and development.

Wouldn't it be better to follow the following advice and develop the 
functionality incrementally?

On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada 
<sawada(dot)mshk(at)gmail(dot)com> wrote:
> So I'm thinking we may be able to implement this
> feature incrementally. The first step would be something like an
> option to ignore all errors or an option to specify the maximum number
> of errors to tolerate before raising an ERROR. The second step would
> be to support logging destinations such as server logs and tables.


Attached a patch for this "first step" with reference to v7 patch, which 
logged errors and simpler than latest one.
- This patch adds new option SAVE_ERROR_TO, but currently only supports 
'none', which means just skips error data. It is expected to support 
'log' and 'table'.
- This patch Skips just soft errors and don't handle other errors such 
as missing column data.


BTW I have question and comment about v15 patch:

> +           {
> +               /*
> +               *
> +               * InputFunctionCall is more faster than 
> InputFunctionCallSafe.
> +               *
> +               */

Have you measured this?
When I tested it in an older patch, there were no big difference[3].

  > -   SAVEPOINT SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P SECURITY 
SELECT
  > +   SAVEPOINT SAVE_ERROR SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P 
SECURITY SELECT

There was a comment that we shouldn't add new keyword for this[4].


I left as it was in v7 patch regarding these points.


[1] 
https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de
[2] 
https://www.postgresql.org/message-id/CAD21AoCeEOBN49fu43e6tBTynnswugA3oZ5AZvLeyDCpxpCXPg%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/19551e8c2717c24689913083f841ddb5%40oss.nttdata.com
[4] 
https://www.postgresql.org/message-id/20230322175000.qbdctk7bnmifh5an%40awork3.anarazel.de


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Attachment

v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

10 January 2024, 07:42:07

On Tue, Jan 9, 2024 at 11:36 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> > If we want only such a feature we need to implement it together (the
> > patch could be split, though). But if some parts of the feature are
> > useful for users as well, I'd recommend implementing it incrementally.
> > That way, the patches can get small and it would be easy for reviewers
> > and committers to review/commit them.
>
> Jian, how do you think this comment?
>
> Looking back at the discussion so far, it seems that not everyone thinks
> saving table information is the best idea[1] and some people think just
> skipping error data is useful.[2]
>
> Since there are issues to be considered from the design such as
> physical/logical replication treatment, putting error information to
> table is likely to take time for consensus building and development.
>
> Wouldn't it be better to follow the following advice and develop the
> functionality incrementally?

Yeah, I'm still thinking it's better to implement this feature
incrementally. Given we're closing to feature freeze, I think it's
unlikely to get the whole feature into PG17 since there are still many
design discussions we need in addition to what Torikoshi-san pointed
out. The feature like "ignore errors" or "logging errors" would have
higher possibilities. Even if we get only these parts of the whole
"error table" feature into PG17, it will make it much easier to
implement "error tables" feature.

>
> On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada
> <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > So I'm thinking we may be able to implement this
> > feature incrementally. The first step would be something like an
> > option to ignore all errors or an option to specify the maximum number
> > of errors to tolerate before raising an ERROR. The second step would
> > be to support logging destinations such as server logs and tables.
>
>
> Attached a patch for this "first step" with reference to v7 patch, which
> logged errors and simpler than latest one.
> - This patch adds new option SAVE_ERROR_TO, but currently only supports
> 'none', which means just skips error data. It is expected to support
> 'log' and 'table'.
> - This patch Skips just soft errors and don't handle other errors such
> as missing column data.

Seems promising. I'll look at the patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

11 January 2024, 03:13:35

On Tue, Jan 9, 2024 at 10:36 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> > If we want only such a feature we need to implement it together (the
> > patch could be split, though). But if some parts of the feature are
> > useful for users as well, I'd recommend implementing it incrementally.
> > That way, the patches can get small and it would be easy for reviewers
> > and committers to review/commit them.
>
> Jian, how do you think this comment?
>
> Looking back at the discussion so far, it seems that not everyone thinks
> saving table information is the best idea[1] and some people think just
> skipping error data is useful.[2]
>
> Since there are issues to be considered from the design such as
> physical/logical replication treatment, putting error information to
> table is likely to take time for consensus building and development.
>
> Wouldn't it be better to follow the following advice and develop the
> functionality incrementally?
>
> On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada
> <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > So I'm thinking we may be able to implement this
> > feature incrementally. The first step would be something like an
> > option to ignore all errors or an option to specify the maximum number
> > of errors to tolerate before raising an ERROR. The second step would
> > be to support logging destinations such as server logs and tables.
>
>
> Attached a patch for this "first step" with reference to v7 patch, which
> logged errors and simpler than latest one.
> - This patch adds new option SAVE_ERROR_TO, but currently only supports
> 'none', which means just skips error data. It is expected to support
> 'log' and 'table'.
> - This patch Skips just soft errors and don't handle other errors such
> as missing column data.

Hi.
I made the following change based on your patch
(v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch)

* when specified SAVE_ERROR_TO, move the initialization of
ErrorSaveContext to the function BeginCopyFrom.
I think that's the right place to initialize struct CopyFromState field.
* I think your patch when there are N rows have malformed data, then it
will initialize N ErrorSaveContext.
In the struct CopyFromStateData, I changed it to ErrorSaveContext *escontext.
So if an error occurred, you can just set the escontext accordingly.
* doc: mention "If this option is omitted, <command>COPY</command>
stops operation at the first error."
* Since we only support 'none' for now, 'none' means we don't want
ErrorSaveContext metadata,
 so we should set cstate->escontext->details_wanted to false.

> BTW I have question and comment about v15 patch:
>
> > +           {
> > +               /*
> > +               *
> > +               * InputFunctionCall is more faster than
> > InputFunctionCallSafe.
> > +               *
> > +               */
>
> Have you measured this?
> When I tested it in an older patch, there were no big difference[3].
Thanks for pointing it out, I probably was over thinking.

>   > -   SAVEPOINT SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P SECURITY
> SELECT
>   > +   SAVEPOINT SAVE_ERROR SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P
> SECURITY SELECT
>
> There was a comment that we shouldn't add new keyword for this[4].
>
Thanks for pointing it out.

Attachment

v1-0001-minor-refactor.no-cfbot

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

12 January 2024, 02:58:44

On Wed, Jan 10, 2024 at 4:42 PM Masahiko Sawada <sawada.mshk@gmail.com> 
wrote:

> Yeah, I'm still thinking it's better to implement this feature
> incrementally. Given we're closing to feature freeze, I think it's
> unlikely to get the whole feature into PG17 since there are still many
> design discussions we need in addition to what Torikoshi-san pointed
> out. The feature like "ignore errors" or "logging errors" would have
> higher possibilities. Even if we get only these parts of the whole
> "error table" feature into PG17, it will make it much easier to
implement "error tables" feature.

+1.
I'm also going to make patch for "logging errors", since this 
functionality is isolated from v7 patch.

> Seems promising. I'll look at the patch.
Thanks a lot!
Sorry to attach v2 if you already reviewed v1..

On 2024-01-11 12:13, jian he wrote:
> On Tue, Jan 9, 2024 at 10:36 PM torikoshia <torikoshia@oss.nttdata.com> 
> wrote:
>> 
>> On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada 
>> <sawada.mshk@gmail.com>
>> wrote:
>> > If we want only such a feature we need to implement it together (the
>> > patch could be split, though). But if some parts of the feature are
>> > useful for users as well, I'd recommend implementing it incrementally.
>> > That way, the patches can get small and it would be easy for reviewers
>> > and committers to review/commit them.
>> 
>> Jian, how do you think this comment?
>> 
>> Looking back at the discussion so far, it seems that not everyone 
>> thinks
>> saving table information is the best idea[1] and some people think 
>> just
>> skipping error data is useful.[2]
>> 
>> Since there are issues to be considered from the design such as
>> physical/logical replication treatment, putting error information to
>> table is likely to take time for consensus building and development.
>> 
>> Wouldn't it be better to follow the following advice and develop the
>> functionality incrementally?
>> 
>> On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada
>> <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> > So I'm thinking we may be able to implement this
>> > feature incrementally. The first step would be something like an
>> > option to ignore all errors or an option to specify the maximum number
>> > of errors to tolerate before raising an ERROR. The second step would
>> > be to support logging destinations such as server logs and tables.
>> 
>> 
>> Attached a patch for this "first step" with reference to v7 patch, 
>> which
>> logged errors and simpler than latest one.
>> - This patch adds new option SAVE_ERROR_TO, but currently only 
>> supports
>> 'none', which means just skips error data. It is expected to support
>> 'log' and 'table'.
>> - This patch Skips just soft errors and don't handle other errors such
>> as missing column data.
> 
> Hi.
> I made the following change based on your patch
> (v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch)
> 
> * when specified SAVE_ERROR_TO, move the initialization of
> ErrorSaveContext to the function BeginCopyFrom.
> I think that's the right place to initialize struct CopyFromState 
> field.
> * I think your patch when there are N rows have malformed data, then it
> will initialize N ErrorSaveContext.
> In the struct CopyFromStateData, I changed it to ErrorSaveContext 
> *escontext.
> So if an error occurred, you can just set the escontext accordingly.
> * doc: mention "If this option is omitted, <command>COPY</command>
> stops operation at the first error."
> * Since we only support 'none' for now, 'none' means we don't want
> ErrorSaveContext metadata,
>  so we should set cstate->escontext->details_wanted to false.
> 
>> BTW I have question and comment about v15 patch:
>> 
>> > +           {
>> > +               /*
>> > +               *
>> > +               * InputFunctionCall is more faster than
>> > InputFunctionCallSafe.
>> > +               *
>> > +               */
>> 
>> Have you measured this?
>> When I tested it in an older patch, there were no big difference[3].
> Thanks for pointing it out, I probably was over thinking.
> 
>>   > -   SAVEPOINT SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P 
>> SECURITY
>> SELECT
>>   > +   SAVEPOINT SAVE_ERROR SCALAR SCHEMA SCHEMAS SCROLL SEARCH 
>> SECOND_P
>> SECURITY SELECT
>> 
>> There was a comment that we shouldn't add new keyword for this[4].
>> 
> Thanks for pointing it out.

Thanks for reviewing!

Updated the patch merging your suggestions except below points:

> +   cstate->num_errors = 0;

Since cstate is already initialized in below lines, this may be 
redundant.

|     /* Allocate workspace and zero all fields */
|     cstate = (CopyFromStateData *) palloc0(sizeof(CopyFromStateData));


>  +                   Assert(!cstate->escontext->details_wanted);

I'm not sure this is necessary, considering we're going to add other 
options like 'table' and 'log', which need details_wanted soon.


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Attachment

v2-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

13 January 2024, 14:19:35

On Fri, Jan 12, 2024 at 10:59 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
>
> Thanks for reviewing!
>
> Updated the patch merging your suggestions except below points:
>
> > +   cstate->num_errors = 0;
>
> Since cstate is already initialized in below lines, this may be
> redundant.
>
> |     /* Allocate workspace and zero all fields */
> |     cstate = (CopyFromStateData *) palloc0(sizeof(CopyFromStateData));
>
>
> >  +                   Assert(!cstate->escontext->details_wanted);
>
> I'm not sure this is necessary, considering we're going to add other
> options like 'table' and 'log', which need details_wanted soon.
>
>
> --
> Regards,

make save_error_to option cannot be used with COPY TO.
add redundant test, save_error_to with COPY TO test.

Attachment

v2-0001-minor-refactor.no-cfbot

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alexander Korotkov

Date:

14 January 2024, 01:30:32

Hi!

I think this is a demanding and long-waited feature.  The thread is
pretty long, but mostly it was disputes about how to save the errors.
The present patch includes basic infrastructure and ability to ignore
errors, thus it's pretty simple.

On Sat, Jan 13, 2024 at 4:20 PM jian he <jian.universality@gmail.com> wrote:
> On Fri, Jan 12, 2024 at 10:59 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> >
> >
> > Thanks for reviewing!
> >
> > Updated the patch merging your suggestions except below points:
> >
> > > +   cstate->num_errors = 0;
> >
> > Since cstate is already initialized in below lines, this may be
> > redundant.
> >
> > |     /* Allocate workspace and zero all fields */
> > |     cstate = (CopyFromStateData *) palloc0(sizeof(CopyFromStateData));
> >
> >
> > >  +                   Assert(!cstate->escontext->details_wanted);
> >
> > I'm not sure this is necessary, considering we're going to add other
> > options like 'table' and 'log', which need details_wanted soon.
> >
> >
> > --
> > Regards,
>
> make save_error_to option cannot be used with COPY TO.
> add redundant test, save_error_to with COPY TO test.

I've incorporated these changes.  Also, I've changed
CopyFormatOptions.save_error_to to enum and made some edits in
comments and the commit message.  I'm going to push this if there are
no objections.

------
Regards,
Alexander Korotkov

Attachment

0001-Add-new-COPY-option-SAVE_ERROR_TO-v3.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

14 January 2024, 20:34:25

On Sun, Jan 14, 2024 at 10:30 AM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
>
> Hi!
>
> I think this is a demanding and long-waited feature.  The thread is
> pretty long, but mostly it was disputes about how to save the errors.
> The present patch includes basic infrastructure and ability to ignore
> errors, thus it's pretty simple.
>
> On Sat, Jan 13, 2024 at 4:20 PM jian he <jian.universality@gmail.com> wrote:
> > On Fri, Jan 12, 2024 at 10:59 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> > >
> > >
> > > Thanks for reviewing!
> > >
> > > Updated the patch merging your suggestions except below points:
> > >
> > > > +   cstate->num_errors = 0;
> > >
> > > Since cstate is already initialized in below lines, this may be
> > > redundant.
> > >
> > > |     /* Allocate workspace and zero all fields */
> > > |     cstate = (CopyFromStateData *) palloc0(sizeof(CopyFromStateData));
> > >
> > >
> > > >  +                   Assert(!cstate->escontext->details_wanted);
> > >
> > > I'm not sure this is necessary, considering we're going to add other
> > > options like 'table' and 'log', which need details_wanted soon.
> > >
> > >
> > > --
> > > Regards,
> >
> > make save_error_to option cannot be used with COPY TO.
> > add redundant test, save_error_to with COPY TO test.
>
> I've incorporated these changes.  Also, I've changed
> CopyFormatOptions.save_error_to to enum and made some edits in
> comments and the commit message.  I'm going to push this if there are
> no objections.

Thank you for updating the patch. Here are two comments:

---
+   if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
+       cstate->num_errors > 0)
+       ereport(WARNING,
+               errmsg("%zd rows were skipped due to data type incompatibility",
+                      cstate->num_errors));
+
    /* Done, clean up */
    error_context_stack = errcallback.previous;

If a malformed input is not the last data, the context message seems odd:

postgres(1:1769258)=# create table test (a int);
CREATE TABLE
postgres(1:1769258)=# copy test from stdin (save_error_to none);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> a
>> 1
>>
2024-01-15 05:05:53.980 JST [1769258] WARNING:  1 rows were skipped
due to data type incompatibility
2024-01-15 05:05:53.980 JST [1769258] CONTEXT:  COPY test, line 3: ""
COPY 1

I think it's better to report the WARNING after resetting the
error_context_stack. Or is a WARNING really appropriate here? The
v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but
the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to
WARNING without explanation.

---
+-- test missing data: should fail
+COPY check_ign_err FROM STDIN WITH (save_error_to none);
+1  {1}
+\.

We might want to cover the extra data cases too.

Regards,

[1] https://www.postgresql.org/message-id/CACJufxEkkqnozdnvNMGxVAA94KZaCPkYw_Cx4JKG9ueNaZma_A%40mail.gmail.com
[2] https://www.postgresql.org/message-id/3d0b349ddbd4ae5f605f77b491697158%40oss.nttdata.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alexander Korotkov

Date:

14 January 2024, 23:21:07

On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for updating the patch. Here are two comments:
>
> ---
> +   if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
> +       cstate->num_errors > 0)
> +       ereport(WARNING,
> +               errmsg("%zd rows were skipped due to data type incompatibility",
> +                      cstate->num_errors));
> +
>     /* Done, clean up */
>     error_context_stack = errcallback.previous;
>
> If a malformed input is not the last data, the context message seems odd:
>
> postgres(1:1769258)=# create table test (a int);
> CREATE TABLE
> postgres(1:1769258)=# copy test from stdin (save_error_to none);
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> a
> >> 1
> >>
> 2024-01-15 05:05:53.980 JST [1769258] WARNING:  1 rows were skipped
> due to data type incompatibility
> 2024-01-15 05:05:53.980 JST [1769258] CONTEXT:  COPY test, line 3: ""
> COPY 1
>
> I think it's better to report the WARNING after resetting the
> error_context_stack. Or is a WARNING really appropriate here? The
> v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but
> the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to
> WARNING without explanation.

Thank you for noticing this.  I think NOTICE is more appropriate here.
There is nothing to "worry" about: the user asked to ignore the errors
and we did.  And yes, it doesn't make sense to use the last line as
the context.  Fixed.

> ---
> +-- test missing data: should fail
> +COPY check_ign_err FROM STDIN WITH (save_error_to none);
> +1  {1}
> +\.
>
> We might want to cover the extra data cases too.

Agreed, the relevant test is added.

------
Regards,
Alexander Korotkov

Attachment

0001-Add-new-COPY-option-SAVE_ERROR_TO-v4.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

15 January 2024, 06:43:46

On Mon, Jan 15, 2024 at 8:21 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Thank you for updating the patch. Here are two comments:
> >
> > ---
> > +   if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
> > +       cstate->num_errors > 0)
> > +       ereport(WARNING,
> > +               errmsg("%zd rows were skipped due to data type incompatibility",
> > +                      cstate->num_errors));
> > +
> >     /* Done, clean up */
> >     error_context_stack = errcallback.previous;
> >
> > If a malformed input is not the last data, the context message seems odd:
> >
> > postgres(1:1769258)=# create table test (a int);
> > CREATE TABLE
> > postgres(1:1769258)=# copy test from stdin (save_error_to none);
> > Enter data to be copied followed by a newline.
> > End with a backslash and a period on a line by itself, or an EOF signal.
> > >> a
> > >> 1
> > >>
> > 2024-01-15 05:05:53.980 JST [1769258] WARNING:  1 rows were skipped
> > due to data type incompatibility
> > 2024-01-15 05:05:53.980 JST [1769258] CONTEXT:  COPY test, line 3: ""
> > COPY 1
> >
> > I think it's better to report the WARNING after resetting the
> > error_context_stack. Or is a WARNING really appropriate here? The
> > v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but
> > the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to
> > WARNING without explanation.
>
> Thank you for noticing this.  I think NOTICE is more appropriate here.
> There is nothing to "worry" about: the user asked to ignore the errors
> and we did.  And yes, it doesn't make sense to use the last line as
> the context.  Fixed.
>
> > ---
> > +-- test missing data: should fail
> > +COPY check_ign_err FROM STDIN WITH (save_error_to none);
> > +1  {1}
> > +\.
> >
> > We might want to cover the extra data cases too.
>
> Agreed, the relevant test is added.

Thank you for updating the patch. I have one minor point:

+       if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
+               cstate->num_errors > 0)
+               ereport(NOTICE,
+                               errmsg("%zd rows were skipped due to
data type incompatibility",
+                                          cstate->num_errors));
+

We can use errmsg_plural() instead.

I have a question about the option values; do you think we need to
have another value of SAVE_ERROR_TO option to explicitly specify the
current default behavior, i.e. not accept any error? With the v4
patch, the user needs to omit SAVE_ERROR_TO option to accept errors
during COPY FROM. If we change the default behavior in the future,
many users will be affected and probably end up changing their
applications to keep the current default behavior.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alexander Korotkov

Date:

15 January 2024, 15:17:33

On Mon, Jan 15, 2024 at 8:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 15, 2024 at 8:21 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Thank you for updating the patch. Here are two comments:
> > >
> > > ---
> > > +   if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
> > > +       cstate->num_errors > 0)
> > > +       ereport(WARNING,
> > > +               errmsg("%zd rows were skipped due to data type incompatibility",
> > > +                      cstate->num_errors));
> > > +
> > >     /* Done, clean up */
> > >     error_context_stack = errcallback.previous;
> > >
> > > If a malformed input is not the last data, the context message seems odd:
> > >
> > > postgres(1:1769258)=# create table test (a int);
> > > CREATE TABLE
> > > postgres(1:1769258)=# copy test from stdin (save_error_to none);
> > > Enter data to be copied followed by a newline.
> > > End with a backslash and a period on a line by itself, or an EOF signal.
> > > >> a
> > > >> 1
> > > >>
> > > 2024-01-15 05:05:53.980 JST [1769258] WARNING:  1 rows were skipped
> > > due to data type incompatibility
> > > 2024-01-15 05:05:53.980 JST [1769258] CONTEXT:  COPY test, line 3: ""
> > > COPY 1
> > >
> > > I think it's better to report the WARNING after resetting the
> > > error_context_stack. Or is a WARNING really appropriate here? The
> > > v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but
> > > the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to
> > > WARNING without explanation.
> >
> > Thank you for noticing this.  I think NOTICE is more appropriate here.
> > There is nothing to "worry" about: the user asked to ignore the errors
> > and we did.  And yes, it doesn't make sense to use the last line as
> > the context.  Fixed.
> >
> > > ---
> > > +-- test missing data: should fail
> > > +COPY check_ign_err FROM STDIN WITH (save_error_to none);
> > > +1  {1}
> > > +\.
> > >
> > > We might want to cover the extra data cases too.
> >
> > Agreed, the relevant test is added.
>
> Thank you for updating the patch. I have one minor point:
>
> +       if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
> +               cstate->num_errors > 0)
> +               ereport(NOTICE,
> +                               errmsg("%zd rows were skipped due to
> data type incompatibility",
> +                                          cstate->num_errors));
> +
>
> We can use errmsg_plural() instead.

Makes sense.  Fixed.

> I have a question about the option values; do you think we need to
> have another value of SAVE_ERROR_TO option to explicitly specify the
> current default behavior, i.e. not accept any error? With the v4
> patch, the user needs to omit SAVE_ERROR_TO option to accept errors
> during COPY FROM. If we change the default behavior in the future,
> many users will be affected and probably end up changing their
> applications to keep the current default behavior.

Valid point.  I've implemented the handling of CopySaveErrorToChoice
in a similar way to CopyHeaderChoice.

Please, check the revised patch attached.

------
Regards,
Alexander Korotkov

Attachment

0001-Add-new-COPY-option-SAVE_ERROR_TO-v5.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

16 January 2024, 00:27:26

On 2024-01-16 00:17, Alexander Korotkov wrote:
> On Mon, Jan 15, 2024 at 8:44 AM Masahiko Sawada <sawada.mshk@gmail.com> 
> wrote:
>> 
>> On Mon, Jan 15, 2024 at 8:21 AM Alexander Korotkov 
>> <aekorotkov@gmail.com> wrote:
>> >
>> > On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> > > Thank you for updating the patch. Here are two comments:
>> > >
>> > > ---
>> > > +   if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
>> > > +       cstate->num_errors > 0)
>> > > +       ereport(WARNING,
>> > > +               errmsg("%zd rows were skipped due to data type incompatibility",
>> > > +                      cstate->num_errors));
>> > > +
>> > >     /* Done, clean up */
>> > >     error_context_stack = errcallback.previous;
>> > >
>> > > If a malformed input is not the last data, the context message seems odd:
>> > >
>> > > postgres(1:1769258)=# create table test (a int);
>> > > CREATE TABLE
>> > > postgres(1:1769258)=# copy test from stdin (save_error_to none);
>> > > Enter data to be copied followed by a newline.
>> > > End with a backslash and a period on a line by itself, or an EOF signal.
>> > > >> a
>> > > >> 1
>> > > >>
>> > > 2024-01-15 05:05:53.980 JST [1769258] WARNING:  1 rows were skipped
>> > > due to data type incompatibility
>> > > 2024-01-15 05:05:53.980 JST [1769258] CONTEXT:  COPY test, line 3: ""
>> > > COPY 1
>> > >
>> > > I think it's better to report the WARNING after resetting the
>> > > error_context_stack. Or is a WARNING really appropriate here? The
>> > > v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but
>> > > the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to
>> > > WARNING without explanation.
>> >
>> > Thank you for noticing this.  I think NOTICE is more appropriate here.
>> > There is nothing to "worry" about: the user asked to ignore the errors
>> > and we did.  And yes, it doesn't make sense to use the last line as
>> > the context.  Fixed.
>> >
>> > > ---
>> > > +-- test missing data: should fail
>> > > +COPY check_ign_err FROM STDIN WITH (save_error_to none);
>> > > +1  {1}
>> > > +\.
>> > >
>> > > We might want to cover the extra data cases too.
>> >
>> > Agreed, the relevant test is added.
>> 
>> Thank you for updating the patch. I have one minor point:
>> 
>> +       if (cstate->opts.save_error_to != 
>> COPY_SAVE_ERROR_TO_UNSPECIFIED &&
>> +               cstate->num_errors > 0)
>> +               ereport(NOTICE,
>> +                               errmsg("%zd rows were skipped due to
>> data type incompatibility",
>> +                                          cstate->num_errors));
>> +
>> 
>> We can use errmsg_plural() instead.
> 
> Makes sense.  Fixed.
> 
>> I have a question about the option values; do you think we need to
>> have another value of SAVE_ERROR_TO option to explicitly specify the
>> current default behavior, i.e. not accept any error? With the v4
>> patch, the user needs to omit SAVE_ERROR_TO option to accept errors
>> during COPY FROM. If we change the default behavior in the future,
>> many users will be affected and probably end up changing their
>> applications to keep the current default behavior.
> 
> Valid point.  I've implemented the handling of CopySaveErrorToChoice
> in a similar way to CopyHeaderChoice.
> 
> Please, check the revised patch attached.

Thanks for updating the patch!

Here is a minor comment:

> +/*
> + * Extract a defGetCopySaveErrorToChoice value from a DefElem.
> + */

Should be Extract a "CopySaveErrorToChoice"?


BTW I'm thinking we should add a column to pg_stat_progress_copy that 
counts soft errors. I'll suggest this in another thread.

> ------
> Regards,
> Alexander Korotkov

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alexander Korotkov

Date:

16 January 2024, 15:08:45

Hi,

> Thanks for updating the patch!

You're welcome!

> Here is a minor comment:
>
> > +/*
> > + * Extract a defGetCopySaveErrorToChoice value from a DefElem.
> > + */
>
> Should be Extract a "CopySaveErrorToChoice"?

Fixed.

> BTW I'm thinking we should add a column to pg_stat_progress_copy that
> counts soft errors. I'll suggest this in another thread.

Please do!

------
Regards,
Alexander Korotkov

Attachment

0001-Add-new-COPY-option-SAVE_ERROR_TO-v6.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

17 January 2024, 05:38:54

Hi,

Thanks for applying!

> +                               errmsg_plural("%zd row were skipped due 
> to data type incompatibility",

Sorry, I just noticed it, but 'were' should be 'was' here?

>> BTW I'm thinking we should add a column to pg_stat_progress_copy that
>> counts soft errors. I'll suggest this in another thread.
> Please do!

I've started it here:

https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff6ed@oss.nttdata.com


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Kyotaro Horiguchi

Date:

17 January 2024, 07:48:59

At Wed, 17 Jan 2024 14:38:54 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in 
> Hi,
> 
> Thanks for applying!
> 
> > + errmsg_plural("%zd row were skipped due to data type
> > incompatibility",
> 
> Sorry, I just noticed it, but 'were' should be 'was' here?
> 
> >> BTW I'm thinking we should add a column to pg_stat_progress_copy that
> >> counts soft errors. I'll suggest this in another thread.
> > Please do!
> 
> I've started it here:
> 
> https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff6ed@oss.nttdata.com

Switching topics, this commit (9e2d870119) adds the following help message:


>                       "COPY { %s [ ( %s [, ...] ) ] | ( %s ) }\n"
>                       "    TO { '%s' | PROGRAM '%s' | STDOUT }\n"
> ...
>                       "    SAVE_ERROR_TO '%s'\n"
> ...
>                       _("location"),

On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
indicate "immediately error out" and 'just ignore the failure'
respectively, but these options hardly seem to denote a 'location',
and appear more like an 'action'. I somewhat suspect that this
parameter name intially conceived with the assupmtion that it would
take file names or similar parameters. I'm not sure if others will
agree, but I think the parameter name might not be the best
choice. For instance, considering the addition of the third value
'log', something like on_error_action (error, ignore, log) would be
more intuitively understandable. What do you think?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alexander Korotkov

Date:

17 January 2024, 21:01:59

On Wed, Jan 17, 2024 at 7:38 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> Hi,
>
> Thanks for applying!
>
> > +                               errmsg_plural("%zd row were skipped due
> > to data type incompatibility",
>
> Sorry, I just noticed it, but 'were' should be 'was' here?

Sure, the fix is pushed.

------
Regards,
Alexander Korotkov

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alexander Korotkov

Date:

17 January 2024, 21:06:56

On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Wed, 17 Jan 2024 14:38:54 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in
> > Hi,
> >
> > Thanks for applying!
> >
> > > + errmsg_plural("%zd row were skipped due to data type
> > > incompatibility",
> >
> > Sorry, I just noticed it, but 'were' should be 'was' here?
> >
> > >> BTW I'm thinking we should add a column to pg_stat_progress_copy that
> > >> counts soft errors. I'll suggest this in another thread.
> > > Please do!
> >
> > I've started it here:
> >
> > https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff6ed@oss.nttdata.com
>
> Switching topics, this commit (9e2d870119) adds the following help message:
>
>
> >                       "COPY { %s [ ( %s [, ...] ) ] | ( %s ) }\n"
> >                       "    TO { '%s' | PROGRAM '%s' | STDOUT }\n"
> > ...
> >                       "    SAVE_ERROR_TO '%s'\n"
> > ...
> >                       _("location"),
>
> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
> indicate "immediately error out" and 'just ignore the failure'
> respectively, but these options hardly seem to denote a 'location',
> and appear more like an 'action'. I somewhat suspect that this
> parameter name intially conceived with the assupmtion that it would
> take file names or similar parameters. I'm not sure if others will
> agree, but I think the parameter name might not be the best
> choice. For instance, considering the addition of the third value
> 'log', something like on_error_action (error, ignore, log) would be
> more intuitively understandable. What do you think?

Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
the next word will be location, not action.  With some stretch we can
assume 'error' to be location.  I think it would be even more stretchy
to think that SAVE_ERROR_TO is followed by action.  Probably, we can
replace SAVE_ERROR_TO with another name which could be naturally
followed by action, but I don't have something appropriate in mind.
However, I'm not native english speaker and certainly could miss
something.

------
Regards,
Alexander Korotkov

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Tom Lane

Date:

17 January 2024, 21:37:57

Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
>> indicate "immediately error out" and 'just ignore the failure'
>> respectively, but these options hardly seem to denote a 'location',
>> and appear more like an 'action'. I somewhat suspect that this
>> parameter name intially conceived with the assupmtion that it would
>> take file names or similar parameters. I'm not sure if others will
>> agree, but I think the parameter name might not be the best
>> choice. For instance, considering the addition of the third value
>> 'log', something like on_error_action (error, ignore, log) would be
>> more intuitively understandable. What do you think?

> Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
> the next word will be location, not action.  With some stretch we can
> assume 'error' to be location.  I think it would be even more stretchy
> to think that SAVE_ERROR_TO is followed by action.

The other problem with this terminology is that with 'none', what it
is doing is the exact opposite of "saving" the errors.  I agree we
need a better name.

Kyotaro-san's suggestion isn't bad, though I might shorten it to
error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
You will need a separate parameter anyway to specify the destination
of "log", unless "none" became an illegal table name when I wasn't
looking.  I don't buy that one parameter that has some special values
while other values could be names will be a good design.  Moreover,
what if we want to support (say) log-to-file along with log-to-table?
Trying to distinguish a file name from a table name without any other
context seems impossible.

            regards, tom lane

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

18 January 2024, 00:56:39

On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
> >> indicate "immediately error out" and 'just ignore the failure'
> >> respectively, but these options hardly seem to denote a 'location',
> >> and appear more like an 'action'. I somewhat suspect that this
> >> parameter name intially conceived with the assupmtion that it would
> >> take file names or similar parameters. I'm not sure if others will
> >> agree, but I think the parameter name might not be the best
> >> choice. For instance, considering the addition of the third value
> >> 'log', something like on_error_action (error, ignore, log) would be
> >> more intuitively understandable. What do you think?
>
> > Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
> > the next word will be location, not action.  With some stretch we can
> > assume 'error' to be location.  I think it would be even more stretchy
> > to think that SAVE_ERROR_TO is followed by action.
>
> The other problem with this terminology is that with 'none', what it
> is doing is the exact opposite of "saving" the errors.  I agree we
> need a better name.

Agreed.

>
> Kyotaro-san's suggestion isn't bad, though I might shorten it to
> error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> You will need a separate parameter anyway to specify the destination
> of "log", unless "none" became an illegal table name when I wasn't
> looking.  I don't buy that one parameter that has some special values
> while other values could be names will be a good design.  Moreover,
> what if we want to support (say) log-to-file along with log-to-table?
> Trying to distinguish a file name from a table name without any other
> context seems impossible.

I've been thinking we can add more values to this option to log errors
not only to the server logs but also to the error table (not sure
details but I imagined an error table is created for each table on
error), without an additional option for the destination name. The
values would be like error_action {error|ignore|save-logs|save-table}.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

18 January 2024, 01:10:51

On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Alexander Korotkov <aekorotkov@gmail.com> writes:
> > > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
> > >> indicate "immediately error out" and 'just ignore the failure'
> > >> respectively, but these options hardly seem to denote a 'location',
> > >> and appear more like an 'action'. I somewhat suspect that this
> > >> parameter name intially conceived with the assupmtion that it would
> > >> take file names or similar parameters. I'm not sure if others will
> > >> agree, but I think the parameter name might not be the best
> > >> choice. For instance, considering the addition of the third value
> > >> 'log', something like on_error_action (error, ignore, log) would be
> > >> more intuitively understandable. What do you think?
> >
> > > Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
> > > the next word will be location, not action.  With some stretch we can
> > > assume 'error' to be location.  I think it would be even more stretchy
> > > to think that SAVE_ERROR_TO is followed by action.
> >
> > The other problem with this terminology is that with 'none', what it
> > is doing is the exact opposite of "saving" the errors.  I agree we
> > need a better name.
>
> Agreed.
>
> >
> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> > You will need a separate parameter anyway to specify the destination
> > of "log", unless "none" became an illegal table name when I wasn't
> > looking.  I don't buy that one parameter that has some special values
> > while other values could be names will be a good design.  Moreover,
> > what if we want to support (say) log-to-file along with log-to-table?
> > Trying to distinguish a file name from a table name without any other
> > context seems impossible.
>
> I've been thinking we can add more values to this option to log errors
> not only to the server logs but also to the error table (not sure
> details but I imagined an error table is created for each table on
> error), without an additional option for the destination name. The
> values would be like error_action {error|ignore|save-logs|save-table}.
>

another idea:
on_error {error|ignore|other_future_option}
if not specified then by default ERROR.
You can also specify ERROR or IGNORE for now.

I agree, the parameter "error_action" is better than "location".

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

18 January 2024, 02:15:47

On 2024-01-18 10:10, jian he wrote:
> On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> 
> wrote:
>> 
>> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >
>> > Alexander Korotkov <aekorotkov@gmail.com> writes:
>> > > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
>> > > <horikyota.ntt@gmail.com> wrote:
>> > >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
>> > >> indicate "immediately error out" and 'just ignore the failure'
>> > >> respectively, but these options hardly seem to denote a 'location',
>> > >> and appear more like an 'action'. I somewhat suspect that this
>> > >> parameter name intially conceived with the assupmtion that it would
>> > >> take file names or similar parameters. I'm not sure if others will
>> > >> agree, but I think the parameter name might not be the best
>> > >> choice. For instance, considering the addition of the third value
>> > >> 'log', something like on_error_action (error, ignore, log) would be
>> > >> more intuitively understandable. What do you think?
>> >
>> > > Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
>> > > the next word will be location, not action.  With some stretch we can
>> > > assume 'error' to be location.  I think it would be even more stretchy
>> > > to think that SAVE_ERROR_TO is followed by action.
>> >
>> > The other problem with this terminology is that with 'none', what it
>> > is doing is the exact opposite of "saving" the errors.  I agree we
>> > need a better name.
>> 
>> Agreed.
>> 
>> >
>> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
>> > You will need a separate parameter anyway to specify the destination
>> > of "log", unless "none" became an illegal table name when I wasn't
>> > looking.  I don't buy that one parameter that has some special values
>> > while other values could be names will be a good design.  Moreover,
>> > what if we want to support (say) log-to-file along with log-to-table?
>> > Trying to distinguish a file name from a table name without any other
>> > context seems impossible.
>> 
>> I've been thinking we can add more values to this option to log errors
>> not only to the server logs but also to the error table (not sure
>> details but I imagined an error table is created for each table on
>> error), without an additional option for the destination name. The
>> values would be like error_action {error|ignore|save-logs|save-table}.
>> 
> 
> another idea:
> on_error {error|ignore|other_future_option}
> if not specified then by default ERROR.
> You can also specify ERROR or IGNORE for now.
> 
> I agree, the parameter "error_action" is better than "location".

I'm not sure whether error_action or on_error is better, but either way 
"error_action error" and "on_error error" seems a bit odd to me.
I feel "stop" is better for both cases as Tom suggested.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alexander Korotkov

Date:

18 January 2024, 07:59:20

On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> On 2024-01-18 10:10, jian he wrote:
> > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > wrote:
> >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> >> > You will need a separate parameter anyway to specify the destination
> >> > of "log", unless "none" became an illegal table name when I wasn't
> >> > looking.  I don't buy that one parameter that has some special values
> >> > while other values could be names will be a good design.  Moreover,
> >> > what if we want to support (say) log-to-file along with log-to-table?
> >> > Trying to distinguish a file name from a table name without any other
> >> > context seems impossible.
> >>
> >> I've been thinking we can add more values to this option to log errors
> >> not only to the server logs but also to the error table (not sure
> >> details but I imagined an error table is created for each table on
> >> error), without an additional option for the destination name. The
> >> values would be like error_action {error|ignore|save-logs|save-table}.
> >>
> >
> > another idea:
> > on_error {error|ignore|other_future_option}
> > if not specified then by default ERROR.
> > You can also specify ERROR or IGNORE for now.
> >
> > I agree, the parameter "error_action" is better than "location".
>
> I'm not sure whether error_action or on_error is better, but either way
> "error_action error" and "on_error error" seems a bit odd to me.
> I feel "stop" is better for both cases as Tom suggested.

OK.  What about this?
on_error {stop|ignore|other_future_option}
where other_future_option might be compound like "file 'copy.log'" or
"table 'copy_log'".

------
Regards,
Alexander Korotkov

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Pavel Stehule

Date:

18 January 2024, 08:01:24

čt 18. 1. 2024 v 8:59 odesílatel Alexander Korotkov <aekorotkov@gmail.com> napsal:

On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> On 2024-01-18 10:10, jian he wrote:
> > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > wrote:
> >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> >> > You will need a separate parameter anyway to specify the destination
> >> > of "log", unless "none" became an illegal table name when I wasn't
> >> > looking. I don't buy that one parameter that has some special values
> >> > while other values could be names will be a good design. Moreover,
> >> > what if we want to support (say) log-to-file along with log-to-table?
> >> > Trying to distinguish a file name from a table name without any other
> >> > context seems impossible.
> >>
> >> I've been thinking we can add more values to this option to log errors
> >> not only to the server logs but also to the error table (not sure
> >> details but I imagined an error table is created for each table on
> >> error), without an additional option for the destination name. The
> >> values would be like error_action {error|ignore|save-logs|save-table}.
> >>
> >
> > another idea:
> > on_error {error|ignore|other_future_option}
> > if not specified then by default ERROR.
> > You can also specify ERROR or IGNORE for now.
> >
> > I agree, the parameter "error_action" is better than "location".
>
> I'm not sure whether error_action or on_error is better, but either way
> "error_action error" and "on_error error" seems a bit odd to me.
> I feel "stop" is better for both cases as Tom suggested.

OK. What about this?
on_error {stop|ignore|other_future_option}
where other_future_option might be compound like "file 'copy.log'" or
"table 'copy_log'".

it is consistent with psql

Regards

Pavel

------
Regards,
Alexander Korotkov

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

18 January 2024, 08:33:16

On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> > On 2024-01-18 10:10, jian he wrote:
> > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > > wrote:
> > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> > >> > You will need a separate parameter anyway to specify the destination
> > >> > of "log", unless "none" became an illegal table name when I wasn't
> > >> > looking.  I don't buy that one parameter that has some special values
> > >> > while other values could be names will be a good design.  Moreover,
> > >> > what if we want to support (say) log-to-file along with log-to-table?
> > >> > Trying to distinguish a file name from a table name without any other
> > >> > context seems impossible.
> > >>
> > >> I've been thinking we can add more values to this option to log errors
> > >> not only to the server logs but also to the error table (not sure
> > >> details but I imagined an error table is created for each table on
> > >> error), without an additional option for the destination name. The
> > >> values would be like error_action {error|ignore|save-logs|save-table}.
> > >>
> > >
> > > another idea:
> > > on_error {error|ignore|other_future_option}
> > > if not specified then by default ERROR.
> > > You can also specify ERROR or IGNORE for now.
> > >
> > > I agree, the parameter "error_action" is better than "location".
> >
> > I'm not sure whether error_action or on_error is better, but either way
> > "error_action error" and "on_error error" seems a bit odd to me.
> > I feel "stop" is better for both cases as Tom suggested.
>
> OK.  What about this?
> on_error {stop|ignore|other_future_option}
> where other_future_option might be compound like "file 'copy.log'" or
> "table 'copy_log'".

+1

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

18 January 2024, 12:09:13

On 2024-01-18 16:59, Alexander Korotkov wrote:
> On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> 
> wrote:
>> On 2024-01-18 10:10, jian he wrote:
>> > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
>> > wrote:
>> >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
>> >> > You will need a separate parameter anyway to specify the destination
>> >> > of "log", unless "none" became an illegal table name when I wasn't
>> >> > looking.  I don't buy that one parameter that has some special values
>> >> > while other values could be names will be a good design.  Moreover,
>> >> > what if we want to support (say) log-to-file along with log-to-table?
>> >> > Trying to distinguish a file name from a table name without any other
>> >> > context seems impossible.
>> >>
>> >> I've been thinking we can add more values to this option to log errors
>> >> not only to the server logs but also to the error table (not sure
>> >> details but I imagined an error table is created for each table on
>> >> error), without an additional option for the destination name. The
>> >> values would be like error_action {error|ignore|save-logs|save-table}.
>> >>
>> >
>> > another idea:
>> > on_error {error|ignore|other_future_option}
>> > if not specified then by default ERROR.
>> > You can also specify ERROR or IGNORE for now.
>> >
>> > I agree, the parameter "error_action" is better than "location".
>> 
>> I'm not sure whether error_action or on_error is better, but either 
>> way
>> "error_action error" and "on_error error" seems a bit odd to me.
>> I feel "stop" is better for both cases as Tom suggested.
> 
> OK.  What about this?
> on_error {stop|ignore|other_future_option}
> where other_future_option might be compound like "file 'copy.log'" or
> "table 'copy_log'".

Thanks, also +1 from me.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

jian he

Date:

18 January 2024, 14:59:44

Hi.
patch refactored based on "on_error {stop|ignore}"
doc changes:

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -43,7 +43,7 @@ COPY { <replaceable
class="parameter">table_name</replaceable> [ ( <replaceable
     FORCE_QUOTE { ( <replaceable
class="parameter">column_name</replaceable> [, ...] ) | * }
     FORCE_NOT_NULL { ( <replaceable
class="parameter">column_name</replaceable> [, ...] ) | * }
     FORCE_NULL { ( <replaceable
class="parameter">column_name</replaceable> [, ...] ) | * }
-    SAVE_ERROR_TO '<replaceable class="parameter">location</replaceable>'
+    ON_ERROR '<replaceable class="parameter">error_action</replaceable>'
     ENCODING '<replaceable class="parameter">encoding_name</replaceable>'
 </synopsis>
  </refsynopsisdiv>
@@ -375,20 +375,20 @@ COPY { <replaceable
class="parameter">table_name</replaceable> [ ( <replaceable
    </varlistentry>

    <varlistentry>
-    <term><literal>SAVE_ERROR_TO</literal></term>
+    <term><literal>ON_ERROR</literal></term>
     <listitem>
      <para>
-      Specifies to save error information to <replaceable class="parameter">
-      location</replaceable> when there is malformed data in the input.
-      Currently, only <literal>error</literal> (default) and
<literal>none</literal>
+      Specifies which <replaceable class="parameter">
+      error_action</replaceable> to perform when there is malformed
data in the input.
+      Currently, only <literal>stop</literal> (default) and
<literal>ignore</literal>
       values are supported.
-      If the <literal>error</literal> value is specified,
+      If the <literal>stop</literal> value is specified,
       <command>COPY</command> stops operation at the first error.
-      If the <literal>none</literal> value is specified,
+      If the <literal>ignore</literal> value is specified,
       <command>COPY</command> skips malformed data and continues copying data.
       The option is allowed only in <command>COPY FROM</command>.
-      The <literal>none</literal> value is allowed only when
-      not using <literal>binary</literal> format.
+      Only <literal>stop</literal> value is allowed only when
+      using <literal>binary</literal> format.
      </para>

Attachment

copy_on_error.diff

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

19 January 2024, 12:37:38

On 2024-01-18 23:59, jian he wrote:
> Hi.
> patch refactored based on "on_error {stop|ignore}"
> doc changes:
> 
> --- a/doc/src/sgml/ref/copy.sgml
> +++ b/doc/src/sgml/ref/copy.sgml
> @@ -43,7 +43,7 @@ COPY { <replaceable
> class="parameter">table_name</replaceable> [ ( <replaceable
>      FORCE_QUOTE { ( <replaceable
> class="parameter">column_name</replaceable> [, ...] ) | * }
>      FORCE_NOT_NULL { ( <replaceable
> class="parameter">column_name</replaceable> [, ...] ) | * }
>      FORCE_NULL { ( <replaceable
> class="parameter">column_name</replaceable> [, ...] ) | * }
> -    SAVE_ERROR_TO '<replaceable 
> class="parameter">location</replaceable>'
> +    ON_ERROR '<replaceable 
> class="parameter">error_action</replaceable>'
>      ENCODING '<replaceable 
> class="parameter">encoding_name</replaceable>'
>  </synopsis>
>   </refsynopsisdiv>
> @@ -375,20 +375,20 @@ COPY { <replaceable
> class="parameter">table_name</replaceable> [ ( <replaceable
>     </varlistentry>
> 
>     <varlistentry>
> -    <term><literal>SAVE_ERROR_TO</literal></term>
> +    <term><literal>ON_ERROR</literal></term>
>      <listitem>
>       <para>
> -      Specifies to save error information to <replaceable 
> class="parameter">
> -      location</replaceable> when there is malformed data in the 
> input.
> -      Currently, only <literal>error</literal> (default) and
> <literal>none</literal>
> +      Specifies which <replaceable class="parameter">
> +      error_action</replaceable> to perform when there is malformed
> data in the input.
> +      Currently, only <literal>stop</literal> (default) and
> <literal>ignore</literal>
>        values are supported.
> -      If the <literal>error</literal> value is specified,
> +      If the <literal>stop</literal> value is specified,
>        <command>COPY</command> stops operation at the first error.
> -      If the <literal>none</literal> value is specified,
> +      If the <literal>ignore</literal> value is specified,
>        <command>COPY</command> skips malformed data and continues 
> copying data.
>        The option is allowed only in <command>COPY FROM</command>.
> -      The <literal>none</literal> value is allowed only when
> -      not using <literal>binary</literal> format.
> +      Only <literal>stop</literal> value is allowed only when
> +      using <literal>binary</literal> format.
>       </para>

Thanks for making the patch!

Here are some comments:

> -      The <literal>none</literal> value is allowed only when
> -      not using <literal>binary</literal> format.
> +      Only <literal>stop</literal> value is allowed only when
> +      using <literal>binary</literal> format.

The second 'only' may be unnecessary.

> -                       /* If SAVE_ERROR_TO is specified, skip rows 
> with soft errors */
> +                       /* If ON_ERROR is specified with IGNORE, skip 
> rows with soft errors */

This is correct now, but considering future works which add other 
options like "file 'copy.log'" and
"table 'copy_log'", it may be better not to limit the case to 'IGNORE'.
How about something like this?

   If ON_ERROR is specified and the value is not STOP, skip rows with 
soft errors

> -COPY x from stdin (format BINARY, save_error_to none);
> -COPY x to stdin (save_error_to none);
> +COPY x from stdin (format BINARY, ON_ERROR ignore);
> +COPY x from stdin (ON_ERROR unsupported);
>  COPY x to stdin (format TEXT, force_quote(a));
>  COPY x from stdin (format CSV, force_quote(a));

In the existing test for copy2.sql, the COPY options are written in 
lower case(e.g. 'format')　and option value(e.g. 'BINARY') are written in 
upper case.
It would be more consistent to align them.


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Alexander Korotkov

Date:

19 January 2024, 13:27:18

Hi!

On Fri, Jan 19, 2024 at 2:37 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
> Thanks for making the patch!

The patch is pushed!  The proposed changes are incorporated excluding this.

> > -                       /* If SAVE_ERROR_TO is specified, skip rows
> > with soft errors */
> > +                       /* If ON_ERROR is specified with IGNORE, skip
> > rows with soft errors */
>
> This is correct now, but considering future works which add other
> options like "file 'copy.log'" and
> "table 'copy_log'", it may be better not to limit the case to 'IGNORE'.
> How about something like this?
>
>    If ON_ERROR is specified and the value is not STOP, skip rows with
> soft errors

I think when we have more options, then we wouldn't just skip rows
with soft errors but rather save them.  So, I left this comment as is
for now.

------
Regards,
Alexander Korotkov

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

19 January 2024, 14:26:25

On 2024-01-19 22:27, Alexander Korotkov wrote:
> Hi!
> 
> On Fri, Jan 19, 2024 at 2:37 PM torikoshia <torikoshia@oss.nttdata.com> 
> wrote:
>> Thanks for making the patch!
> 
> The patch is pushed!  The proposed changes are incorporated excluding 
> this.
> 
>> > -                       /* If SAVE_ERROR_TO is specified, skip rows
>> > with soft errors */
>> > +                       /* If ON_ERROR is specified with IGNORE, skip
>> > rows with soft errors */
>> 
>> This is correct now, but considering future works which add other
>> options like "file 'copy.log'" and
>> "table 'copy_log'", it may be better not to limit the case to 
>> 'IGNORE'.
>> How about something like this?
>> 
>>    If ON_ERROR is specified and the value is not STOP, skip rows with
>> soft errors
> 
> I think when we have more options, then we wouldn't just skip rows
> with soft errors but rather save them.  So, I left this comment as is
> for now.

Agreed.
Thanks for the notification!

> 
> ------
> Regards,
> Alexander Korotkov

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

28 March 2024, 01:20:50

Hi,

On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> > > On 2024-01-18 10:10, jian he wrote:
> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > > > wrote:
> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> > > >> > You will need a separate parameter anyway to specify the destination
> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> > > >> > looking.  I don't buy that one parameter that has some special values
> > > >> > while other values could be names will be a good design.  Moreover,
> > > >> > what if we want to support (say) log-to-file along with log-to-table?
> > > >> > Trying to distinguish a file name from a table name without any other
> > > >> > context seems impossible.
> > > >>
> > > >> I've been thinking we can add more values to this option to log errors
> > > >> not only to the server logs but also to the error table (not sure
> > > >> details but I imagined an error table is created for each table on
> > > >> error), without an additional option for the destination name. The
> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
> > > >>
> > > >
> > > > another idea:
> > > > on_error {error|ignore|other_future_option}
> > > > if not specified then by default ERROR.
> > > > You can also specify ERROR or IGNORE for now.
> > > >
> > > > I agree, the parameter "error_action" is better than "location".
> > >
> > > I'm not sure whether error_action or on_error is better, but either way
> > > "error_action error" and "on_error error" seems a bit odd to me.
> > > I feel "stop" is better for both cases as Tom suggested.
> >
> > OK.  What about this?
> > on_error {stop|ignore|other_future_option}
> > where other_future_option might be compound like "file 'copy.log'" or
> > "table 'copy_log'".
>
> +1
>

I realized that ON_ERROR syntax synoposis in the documentation is not
correct. The option doesn't require the value to be quoted and the
value can be omitted. The attached patch fixes it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

0001-doc-Fix-COPY-ON_ERROR-option-syntax-synopsis.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

28 March 2024, 12:38:12

On 2024-03-28 10:20, Masahiko Sawada wrote:
> Hi,
> 
> On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> 
> wrote:
>> 
>> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov 
>> <aekorotkov@gmail.com> wrote:
>> >
>> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>> > > On 2024-01-18 10:10, jian he wrote:
>> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
>> > > > wrote:
>> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
>> > > >> > You will need a separate parameter anyway to specify the destination
>> > > >> > of "log", unless "none" became an illegal table name when I wasn't
>> > > >> > looking.  I don't buy that one parameter that has some special values
>> > > >> > while other values could be names will be a good design.  Moreover,
>> > > >> > what if we want to support (say) log-to-file along with log-to-table?
>> > > >> > Trying to distinguish a file name from a table name without any other
>> > > >> > context seems impossible.
>> > > >>
>> > > >> I've been thinking we can add more values to this option to log errors
>> > > >> not only to the server logs but also to the error table (not sure
>> > > >> details but I imagined an error table is created for each table on
>> > > >> error), without an additional option for the destination name. The
>> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
>> > > >>
>> > > >
>> > > > another idea:
>> > > > on_error {error|ignore|other_future_option}
>> > > > if not specified then by default ERROR.
>> > > > You can also specify ERROR or IGNORE for now.
>> > > >
>> > > > I agree, the parameter "error_action" is better than "location".
>> > >
>> > > I'm not sure whether error_action or on_error is better, but either way
>> > > "error_action error" and "on_error error" seems a bit odd to me.
>> > > I feel "stop" is better for both cases as Tom suggested.
>> >
>> > OK.  What about this?
>> > on_error {stop|ignore|other_future_option}
>> > where other_future_option might be compound like "file 'copy.log'" or
>> > "table 'copy_log'".
>> 
>> +1
>> 
> 
> I realized that ON_ERROR syntax synoposis in the documentation is not
> correct. The option doesn't require the value to be quoted and the
> value can be omitted. The attached patch fixes it.
> 
> Regards,

Thanks!

Attached patch fixes the doc, but I'm wondering perhaps it might be 
better to modify the codes to prohibit abbreviation of the value.

When seeing the query which abbreviates ON_ERROR value, I feel it's not 
obvious what happens compared to other options which tolerates 
abbreviation of the value such as FREEZE or HEADER.

   COPY t1 FROM stdin WITH (ON_ERROR);

What do you think?

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

28 March 2024, 12:54:02

On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-03-28 10:20, Masahiko Sawada wrote:
> > Hi,
> >
> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
> > wrote:
> >>
> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> <aekorotkov@gmail.com> wrote:
> >> >
> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> >> > > On 2024-01-18 10:10, jian he wrote:
> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
> >> > > > wrote:
> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> >> > > >> > You will need a separate parameter anyway to specify the destination
> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> >> > > >> > looking.  I don't buy that one parameter that has some special values
> >> > > >> > while other values could be names will be a good design.  Moreover,
> >> > > >> > what if we want to support (say) log-to-file along with log-to-table?
> >> > > >> > Trying to distinguish a file name from a table name without any other
> >> > > >> > context seems impossible.
> >> > > >>
> >> > > >> I've been thinking we can add more values to this option to log errors
> >> > > >> not only to the server logs but also to the error table (not sure
> >> > > >> details but I imagined an error table is created for each table on
> >> > > >> error), without an additional option for the destination name. The
> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
> >> > > >>
> >> > > >
> >> > > > another idea:
> >> > > > on_error {error|ignore|other_future_option}
> >> > > > if not specified then by default ERROR.
> >> > > > You can also specify ERROR or IGNORE for now.
> >> > > >
> >> > > > I agree, the parameter "error_action" is better than "location".
> >> > >
> >> > > I'm not sure whether error_action or on_error is better, but either way
> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >
> >> > OK.  What about this?
> >> > on_error {stop|ignore|other_future_option}
> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> > "table 'copy_log'".
> >>
> >> +1
> >>
> >
> > I realized that ON_ERROR syntax synoposis in the documentation is not
> > correct. The option doesn't require the value to be quoted and the
> > value can be omitted. The attached patch fixes it.
> >
> > Regards,
>
> Thanks!
>
> Attached patch fixes the doc, but I'm wondering perhaps it might be
> better to modify the codes to prohibit abbreviation of the value.
>
> When seeing the query which abbreviates ON_ERROR value, I feel it's not
> obvious what happens compared to other options which tolerates
> abbreviation of the value such as FREEZE or HEADER.
>
>    COPY t1 FROM stdin WITH (ON_ERROR);
>
> What do you think?

Indeed. Looking at options of other commands such as VACUUM and
EXPLAIN, I can see that we can omit a boolean value, but non-boolean
parameters require its value. The HEADER option is not a pure boolean
parameter but we can omit the value. It seems to be for backward
compatibility; it used to be a boolean parameter. I agree that the
above example would confuse users.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

29 March 2024, 02:54:10

On 2024-03-28 21:54, Masahiko Sawada wrote:
> On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com> 
> wrote:
>> 
>> On 2024-03-28 10:20, Masahiko Sawada wrote:
>> > Hi,
>> >
>> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
>> > wrote:
>> >>
>> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
>> >> <aekorotkov@gmail.com> wrote:
>> >> >
>> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>> >> > > On 2024-01-18 10:10, jian he wrote:
>> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
>> >> > > > wrote:
>> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
>> >> > > >> > You will need a separate parameter anyway to specify the destination
>> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
>> >> > > >> > looking.  I don't buy that one parameter that has some special values
>> >> > > >> > while other values could be names will be a good design.  Moreover,
>> >> > > >> > what if we want to support (say) log-to-file along with log-to-table?
>> >> > > >> > Trying to distinguish a file name from a table name without any other
>> >> > > >> > context seems impossible.
>> >> > > >>
>> >> > > >> I've been thinking we can add more values to this option to log errors
>> >> > > >> not only to the server logs but also to the error table (not sure
>> >> > > >> details but I imagined an error table is created for each table on
>> >> > > >> error), without an additional option for the destination name. The
>> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
>> >> > > >>
>> >> > > >
>> >> > > > another idea:
>> >> > > > on_error {error|ignore|other_future_option}
>> >> > > > if not specified then by default ERROR.
>> >> > > > You can also specify ERROR or IGNORE for now.
>> >> > > >
>> >> > > > I agree, the parameter "error_action" is better than "location".
>> >> > >
>> >> > > I'm not sure whether error_action or on_error is better, but either way
>> >> > > "error_action error" and "on_error error" seems a bit odd to me.
>> >> > > I feel "stop" is better for both cases as Tom suggested.
>> >> >
>> >> > OK.  What about this?
>> >> > on_error {stop|ignore|other_future_option}
>> >> > where other_future_option might be compound like "file 'copy.log'" or
>> >> > "table 'copy_log'".
>> >>
>> >> +1
>> >>
>> >
>> > I realized that ON_ERROR syntax synoposis in the documentation is not
>> > correct. The option doesn't require the value to be quoted and the
>> > value can be omitted. The attached patch fixes it.
>> >
>> > Regards,
>> 
>> Thanks!
>> 
>> Attached patch fixes the doc, but I'm wondering perhaps it might be
>> better to modify the codes to prohibit abbreviation of the value.
>> 
>> When seeing the query which abbreviates ON_ERROR value, I feel it's 
>> not
>> obvious what happens compared to other options which tolerates
>> abbreviation of the value such as FREEZE or HEADER.
>> 
>>    COPY t1 FROM stdin WITH (ON_ERROR);
>> 
>> What do you think?
> 
> Indeed. Looking at options of other commands such as VACUUM and
> EXPLAIN, I can see that we can omit a boolean value, but non-boolean
> parameters require its value. The HEADER option is not a pure boolean
> parameter but we can omit the value. It seems to be for backward
> compatibility; it used to be a boolean parameter. I agree that the
> above example would confuse users.
> 
> Regards,

Thanks for your comment!

Attached a patch which modifies the code to prohibit omission of its 
value.

I was a little unsure about adding a regression test for this, but I 
have not added it since other COPY option doesn't test the omission of 
its value.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Attachment

v1-0001-Disallow-ON_ERROR-option-without-value.patch

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

01 April 2024, 02:31:56

On Fri, Mar 29, 2024 at 11:54 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-03-28 21:54, Masahiko Sawada wrote:
> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com>
> > wrote:
> >>
> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
> >> > Hi,
> >> >
> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
> >> > wrote:
> >> >>
> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> >> <aekorotkov@gmail.com> wrote:
> >> >> >
> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> >> >> > > On 2024-01-18 10:10, jian he wrote:
> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
> >> >> > > > wrote:
> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> >> >> > > >> > You will need a separate parameter anyway to specify the destination
> >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> >> >> > > >> > looking.  I don't buy that one parameter that has some special values
> >> >> > > >> > while other values could be names will be a good design.  Moreover,
> >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table?
> >> >> > > >> > Trying to distinguish a file name from a table name without any other
> >> >> > > >> > context seems impossible.
> >> >> > > >>
> >> >> > > >> I've been thinking we can add more values to this option to log errors
> >> >> > > >> not only to the server logs but also to the error table (not sure
> >> >> > > >> details but I imagined an error table is created for each table on
> >> >> > > >> error), without an additional option for the destination name. The
> >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
> >> >> > > >>
> >> >> > > >
> >> >> > > > another idea:
> >> >> > > > on_error {error|ignore|other_future_option}
> >> >> > > > if not specified then by default ERROR.
> >> >> > > > You can also specify ERROR or IGNORE for now.
> >> >> > > >
> >> >> > > > I agree, the parameter "error_action" is better than "location".
> >> >> > >
> >> >> > > I'm not sure whether error_action or on_error is better, but either way
> >> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >> >
> >> >> > OK.  What about this?
> >> >> > on_error {stop|ignore|other_future_option}
> >> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> >> > "table 'copy_log'".
> >> >>
> >> >> +1
> >> >>
> >> >
> >> > I realized that ON_ERROR syntax synoposis in the documentation is not
> >> > correct. The option doesn't require the value to be quoted and the
> >> > value can be omitted. The attached patch fixes it.
> >> >
> >> > Regards,
> >>
> >> Thanks!
> >>
> >> Attached patch fixes the doc, but I'm wondering perhaps it might be
> >> better to modify the codes to prohibit abbreviation of the value.
> >>
> >> When seeing the query which abbreviates ON_ERROR value, I feel it's
> >> not
> >> obvious what happens compared to other options which tolerates
> >> abbreviation of the value such as FREEZE or HEADER.
> >>
> >>    COPY t1 FROM stdin WITH (ON_ERROR);
> >>
> >> What do you think?
> >
> > Indeed. Looking at options of other commands such as VACUUM and
> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean
> > parameters require its value. The HEADER option is not a pure boolean
> > parameter but we can omit the value. It seems to be for backward
> > compatibility; it used to be a boolean parameter. I agree that the
> > above example would confuse users.
> >
> > Regards,
>
> Thanks for your comment!
>
> Attached a patch which modifies the code to prohibit omission of its
> value.
>
> I was a little unsure about adding a regression test for this, but I
> have not added it since other COPY option doesn't test the omission of
> its value.

Probably should we change the doc as well since ON_ERROR value doesn't
necessarily need to be single-quoted?

The rest looks good to me.

Alexander, what do you think about this change as you're the committer
of this feature?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

02 April 2024, 10:34:45

On 2024-04-01 11:31, Masahiko Sawada wrote:
> On Fri, Mar 29, 2024 at 11:54 AM torikoshia 
> <torikoshia@oss.nttdata.com> wrote:
>> 
>> On 2024-03-28 21:54, Masahiko Sawada wrote:
>> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com>
>> > wrote:
>> >>
>> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
>> >> > Hi,
>> >> >
>> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
>> >> >> <aekorotkov@gmail.com> wrote:
>> >> >> >
>> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>> >> >> > > On 2024-01-18 10:10, jian he wrote:
>> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
>> >> >> > > > wrote:
>> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
>> >> >> > > >> > You will need a separate parameter anyway to specify the destination
>> >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
>> >> >> > > >> > looking.  I don't buy that one parameter that has some special values
>> >> >> > > >> > while other values could be names will be a good design.  Moreover,
>> >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table?
>> >> >> > > >> > Trying to distinguish a file name from a table name without any other
>> >> >> > > >> > context seems impossible.
>> >> >> > > >>
>> >> >> > > >> I've been thinking we can add more values to this option to log errors
>> >> >> > > >> not only to the server logs but also to the error table (not sure
>> >> >> > > >> details but I imagined an error table is created for each table on
>> >> >> > > >> error), without an additional option for the destination name. The
>> >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
>> >> >> > > >>
>> >> >> > > >
>> >> >> > > > another idea:
>> >> >> > > > on_error {error|ignore|other_future_option}
>> >> >> > > > if not specified then by default ERROR.
>> >> >> > > > You can also specify ERROR or IGNORE for now.
>> >> >> > > >
>> >> >> > > > I agree, the parameter "error_action" is better than "location".
>> >> >> > >
>> >> >> > > I'm not sure whether error_action or on_error is better, but either way
>> >> >> > > "error_action error" and "on_error error" seems a bit odd to me.
>> >> >> > > I feel "stop" is better for both cases as Tom suggested.
>> >> >> >
>> >> >> > OK.  What about this?
>> >> >> > on_error {stop|ignore|other_future_option}
>> >> >> > where other_future_option might be compound like "file 'copy.log'" or
>> >> >> > "table 'copy_log'".
>> >> >>
>> >> >> +1
>> >> >>
>> >> >
>> >> > I realized that ON_ERROR syntax synoposis in the documentation is not
>> >> > correct. The option doesn't require the value to be quoted and the
>> >> > value can be omitted. The attached patch fixes it.
>> >> >
>> >> > Regards,
>> >>
>> >> Thanks!
>> >>
>> >> Attached patch fixes the doc, but I'm wondering perhaps it might be
>> >> better to modify the codes to prohibit abbreviation of the value.
>> >>
>> >> When seeing the query which abbreviates ON_ERROR value, I feel it's
>> >> not
>> >> obvious what happens compared to other options which tolerates
>> >> abbreviation of the value such as FREEZE or HEADER.
>> >>
>> >>    COPY t1 FROM stdin WITH (ON_ERROR);
>> >>
>> >> What do you think?
>> >
>> > Indeed. Looking at options of other commands such as VACUUM and
>> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean
>> > parameters require its value. The HEADER option is not a pure boolean
>> > parameter but we can omit the value. It seems to be for backward
>> > compatibility; it used to be a boolean parameter. I agree that the
>> > above example would confuse users.
>> >
>> > Regards,
>> 
>> Thanks for your comment!
>> 
>> Attached a patch which modifies the code to prohibit omission of its
>> value.
>> 
>> I was a little unsure about adding a regression test for this, but I
>> have not added it since other COPY option doesn't test the omission of
>> its value.
> 
> Probably should we change the doc as well since ON_ERROR value doesn't
> necessarily need to be single-quoted?

Agreed.
Since it seems this issue is independent from the omission of ON_ERROR 
option value, attached a separate patch.

> The rest looks good to me.
> 
> Alexander, what do you think about this change as you're the committer
> of this feature?


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Attachment

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

16 April 2024, 04:16:18

On Tue, Apr 2, 2024 at 7:34 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-04-01 11:31, Masahiko Sawada wrote:
> > On Fri, Mar 29, 2024 at 11:54 AM torikoshia
> > <torikoshia@oss.nttdata.com> wrote:
> >>
> >> On 2024-03-28 21:54, Masahiko Sawada wrote:
> >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com>
> >> > wrote:
> >> >>
> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
> >> >> > Hi,
> >> >> >
> >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> >> >> <aekorotkov@gmail.com> wrote:
> >> >> >> >
> >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> >> >> >> > > On 2024-01-18 10:10, jian he wrote:
> >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
> >> >> >> > > > wrote:
> >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> >> >> >> > > >> > You will need a separate parameter anyway to specify the destination
> >> >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> >> >> >> > > >> > looking.  I don't buy that one parameter that has some special values
> >> >> >> > > >> > while other values could be names will be a good design.  Moreover,
> >> >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table?
> >> >> >> > > >> > Trying to distinguish a file name from a table name without any other
> >> >> >> > > >> > context seems impossible.
> >> >> >> > > >>
> >> >> >> > > >> I've been thinking we can add more values to this option to log errors
> >> >> >> > > >> not only to the server logs but also to the error table (not sure
> >> >> >> > > >> details but I imagined an error table is created for each table on
> >> >> >> > > >> error), without an additional option for the destination name. The
> >> >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
> >> >> >> > > >>
> >> >> >> > > >
> >> >> >> > > > another idea:
> >> >> >> > > > on_error {error|ignore|other_future_option}
> >> >> >> > > > if not specified then by default ERROR.
> >> >> >> > > > You can also specify ERROR or IGNORE for now.
> >> >> >> > > >
> >> >> >> > > > I agree, the parameter "error_action" is better than "location".
> >> >> >> > >
> >> >> >> > > I'm not sure whether error_action or on_error is better, but either way
> >> >> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> >> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >> >> >
> >> >> >> > OK.  What about this?
> >> >> >> > on_error {stop|ignore|other_future_option}
> >> >> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> >> >> > "table 'copy_log'".
> >> >> >>
> >> >> >> +1
> >> >> >>
> >> >> >
> >> >> > I realized that ON_ERROR syntax synoposis in the documentation is not
> >> >> > correct. The option doesn't require the value to be quoted and the
> >> >> > value can be omitted. The attached patch fixes it.
> >> >> >
> >> >> > Regards,
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Attached patch fixes the doc, but I'm wondering perhaps it might be
> >> >> better to modify the codes to prohibit abbreviation of the value.
> >> >>
> >> >> When seeing the query which abbreviates ON_ERROR value, I feel it's
> >> >> not
> >> >> obvious what happens compared to other options which tolerates
> >> >> abbreviation of the value such as FREEZE or HEADER.
> >> >>
> >> >>    COPY t1 FROM stdin WITH (ON_ERROR);
> >> >>
> >> >> What do you think?
> >> >
> >> > Indeed. Looking at options of other commands such as VACUUM and
> >> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean
> >> > parameters require its value. The HEADER option is not a pure boolean
> >> > parameter but we can omit the value. It seems to be for backward
> >> > compatibility; it used to be a boolean parameter. I agree that the
> >> > above example would confuse users.
> >> >
> >> > Regards,
> >>
> >> Thanks for your comment!
> >>
> >> Attached a patch which modifies the code to prohibit omission of its
> >> value.
> >>
> >> I was a little unsure about adding a regression test for this, but I
> >> have not added it since other COPY option doesn't test the omission of
> >> its value.
> >
> > Probably should we change the doc as well since ON_ERROR value doesn't
> > necessarily need to be single-quoted?
>
> Agreed.
> Since it seems this issue is independent from the omission of ON_ERROR
> option value, attached a separate patch.
>

Thank you for the patches! These patches look good to me. I'll push
them, barring any objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

torikoshia

Date:

17 April 2024, 07:28:29

On 2024-04-16 13:16, Masahiko Sawada wrote:
> On Tue, Apr 2, 2024 at 7:34 PM torikoshia <torikoshia@oss.nttdata.com> 
> wrote:
>> 
>> On 2024-04-01 11:31, Masahiko Sawada wrote:
>> > On Fri, Mar 29, 2024 at 11:54 AM torikoshia
>> > <torikoshia@oss.nttdata.com> wrote:
>> >>
>> >> On 2024-03-28 21:54, Masahiko Sawada wrote:
>> >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com>
>> >> > wrote:
>> >> >>
>> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
>> >> >> >> <aekorotkov@gmail.com> wrote:
>> >> >> >> >
>> >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>> >> >> >> > > On 2024-01-18 10:10, jian he wrote:
>> >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
>> >> >> >> > > > wrote:
>> >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
>> >> >> >> > > >> > You will need a separate parameter anyway to specify the destination
>> >> >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
>> >> >> >> > > >> > looking.  I don't buy that one parameter that has some special values
>> >> >> >> > > >> > while other values could be names will be a good design.  Moreover,
>> >> >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table?
>> >> >> >> > > >> > Trying to distinguish a file name from a table name without any other
>> >> >> >> > > >> > context seems impossible.
>> >> >> >> > > >>
>> >> >> >> > > >> I've been thinking we can add more values to this option to log errors
>> >> >> >> > > >> not only to the server logs but also to the error table (not sure
>> >> >> >> > > >> details but I imagined an error table is created for each table on
>> >> >> >> > > >> error), without an additional option for the destination name. The
>> >> >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
>> >> >> >> > > >>
>> >> >> >> > > >
>> >> >> >> > > > another idea:
>> >> >> >> > > > on_error {error|ignore|other_future_option}
>> >> >> >> > > > if not specified then by default ERROR.
>> >> >> >> > > > You can also specify ERROR or IGNORE for now.
>> >> >> >> > > >
>> >> >> >> > > > I agree, the parameter "error_action" is better than "location".
>> >> >> >> > >
>> >> >> >> > > I'm not sure whether error_action or on_error is better, but either way
>> >> >> >> > > "error_action error" and "on_error error" seems a bit odd to me.
>> >> >> >> > > I feel "stop" is better for both cases as Tom suggested.
>> >> >> >> >
>> >> >> >> > OK.  What about this?
>> >> >> >> > on_error {stop|ignore|other_future_option}
>> >> >> >> > where other_future_option might be compound like "file 'copy.log'" or
>> >> >> >> > "table 'copy_log'".
>> >> >> >>
>> >> >> >> +1
>> >> >> >>
>> >> >> >
>> >> >> > I realized that ON_ERROR syntax synoposis in the documentation is not
>> >> >> > correct. The option doesn't require the value to be quoted and the
>> >> >> > value can be omitted. The attached patch fixes it.
>> >> >> >
>> >> >> > Regards,
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Attached patch fixes the doc, but I'm wondering perhaps it might be
>> >> >> better to modify the codes to prohibit abbreviation of the value.
>> >> >>
>> >> >> When seeing the query which abbreviates ON_ERROR value, I feel it's
>> >> >> not
>> >> >> obvious what happens compared to other options which tolerates
>> >> >> abbreviation of the value such as FREEZE or HEADER.
>> >> >>
>> >> >>    COPY t1 FROM stdin WITH (ON_ERROR);
>> >> >>
>> >> >> What do you think?
>> >> >
>> >> > Indeed. Looking at options of other commands such as VACUUM and
>> >> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean
>> >> > parameters require its value. The HEADER option is not a pure boolean
>> >> > parameter but we can omit the value. It seems to be for backward
>> >> > compatibility; it used to be a boolean parameter. I agree that the
>> >> > above example would confuse users.
>> >> >
>> >> > Regards,
>> >>
>> >> Thanks for your comment!
>> >>
>> >> Attached a patch which modifies the code to prohibit omission of its
>> >> value.
>> >>
>> >> I was a little unsure about adding a regression test for this, but I
>> >> have not added it since other COPY option doesn't test the omission of
>> >> its value.
>> >
>> > Probably should we change the doc as well since ON_ERROR value doesn't
>> > necessarily need to be single-quoted?
>> 
>> Agreed.
>> Since it seems this issue is independent from the omission of ON_ERROR
>> option value, attached a separate patch.
>> 
> 
> Thank you for the patches! These patches look good to me. I'll push
> them, barring any objections.
> 
> Regards,

Thanks for your review and apply!

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From

Masahiko Sawada

Date:

17 April 2024, 07:30:31

On Wed, Apr 17, 2024 at 4:28 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-04-16 13:16, Masahiko Sawada wrote:
> > On Tue, Apr 2, 2024 at 7:34 PM torikoshia <torikoshia@oss.nttdata.com>
> > wrote:
> >>
> >> On 2024-04-01 11:31, Masahiko Sawada wrote:
> >> > On Fri, Mar 29, 2024 at 11:54 AM torikoshia
> >> > <torikoshia@oss.nttdata.com> wrote:
> >> >>
> >> >> On 2024-03-28 21:54, Masahiko Sawada wrote:
> >> >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> >> >> >> <aekorotkov@gmail.com> wrote:
> >> >> >> >> >
> >> >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> >> >> >> >> > > On 2024-01-18 10:10, jian he wrote:
> >> >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
> >> >> >> >> > > > wrote:
> >> >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> >> >> >> >> > > >> > You will need a separate parameter anyway to specify the destination
> >> >> >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> >> >> >> >> > > >> > looking.  I don't buy that one parameter that has some special values
> >> >> >> >> > > >> > while other values could be names will be a good design.  Moreover,
> >> >> >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table?
> >> >> >> >> > > >> > Trying to distinguish a file name from a table name without any other
> >> >> >> >> > > >> > context seems impossible.
> >> >> >> >> > > >>
> >> >> >> >> > > >> I've been thinking we can add more values to this option to log errors
> >> >> >> >> > > >> not only to the server logs but also to the error table (not sure
> >> >> >> >> > > >> details but I imagined an error table is created for each table on
> >> >> >> >> > > >> error), without an additional option for the destination name. The
> >> >> >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
> >> >> >> >> > > >>
> >> >> >> >> > > >
> >> >> >> >> > > > another idea:
> >> >> >> >> > > > on_error {error|ignore|other_future_option}
> >> >> >> >> > > > if not specified then by default ERROR.
> >> >> >> >> > > > You can also specify ERROR or IGNORE for now.
> >> >> >> >> > > >
> >> >> >> >> > > > I agree, the parameter "error_action" is better than "location".
> >> >> >> >> > >
> >> >> >> >> > > I'm not sure whether error_action or on_error is better, but either way
> >> >> >> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> >> >> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >> >> >> >
> >> >> >> >> > OK.  What about this?
> >> >> >> >> > on_error {stop|ignore|other_future_option}
> >> >> >> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> >> >> >> > "table 'copy_log'".
> >> >> >> >>
> >> >> >> >> +1
> >> >> >> >>
> >> >> >> >
> >> >> >> > I realized that ON_ERROR syntax synoposis in the documentation is not
> >> >> >> > correct. The option doesn't require the value to be quoted and the
> >> >> >> > value can be omitted. The attached patch fixes it.
> >> >> >> >
> >> >> >> > Regards,
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >> Attached patch fixes the doc, but I'm wondering perhaps it might be
> >> >> >> better to modify the codes to prohibit abbreviation of the value.
> >> >> >>
> >> >> >> When seeing the query which abbreviates ON_ERROR value, I feel it's
> >> >> >> not
> >> >> >> obvious what happens compared to other options which tolerates
> >> >> >> abbreviation of the value such as FREEZE or HEADER.
> >> >> >>
> >> >> >>    COPY t1 FROM stdin WITH (ON_ERROR);
> >> >> >>
> >> >> >> What do you think?
> >> >> >
> >> >> > Indeed. Looking at options of other commands such as VACUUM and
> >> >> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean
> >> >> > parameters require its value. The HEADER option is not a pure boolean
> >> >> > parameter but we can omit the value. It seems to be for backward
> >> >> > compatibility; it used to be a boolean parameter. I agree that the
> >> >> > above example would confuse users.
> >> >> >
> >> >> > Regards,
> >> >>
> >> >> Thanks for your comment!
> >> >>
> >> >> Attached a patch which modifies the code to prohibit omission of its
> >> >> value.
> >> >>
> >> >> I was a little unsure about adding a regression test for this, but I
> >> >> have not added it since other COPY option doesn't test the omission of
> >> >> its value.
> >> >
> >> > Probably should we change the doc as well since ON_ERROR value doesn't
> >> > necessarily need to be single-quoted?
> >>
> >> Agreed.
> >> Since it seems this issue is independent from the omission of ON_ERROR
> >> option value, attached a separate patch.
> >>
> >
> > Thank you for the patches! These patches look good to me. I'll push
> > them, barring any objections.
> >
> > Regards,
>
> Thanks for your review and apply!

Thank you for the patches!

Pushed: a6d0fa5ef8 and f6f8ac8e75.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com