Thread: VLDB Features
I'm starting work on next projects for 8.4. Many applications have the need to store very large data volumes for both archival and analysis. The analytic databases are commonly known as Data Warehouses, though there isn't a common term for large archival data stores. The use cases for those can often be blurred and many people see those as only one use case. My initial interest is in the large archival data stores. One of the main issues to be faced is simply data maintenance and management. Loading, deleting, vacuuming data all takes time. Those issues relate mainly to the size of the data store rather than any particular workload, so I'm calling that set of required features "Very Large Database" (or VLDB) features. VLDB Features I'm expecting to work on are - Read Only Tables/WORM tables - Advanced Partitioning - Compression plus related performance features Details of those will be covered in separate mails over next few weeks and months. So just to let everybody know that's where I'm headed, so you see the big picture with me. I'll be working on other projects as well, many of which I've listed here: http://developer.postgresql.org/index.php/Simon_Riggs% 27_Development_Projects I expect the list is too long to complete for 8.4, but I'm allowing for various issues arising during development. So specific discussion on other mails as they arrive, please. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon. > VLDB Features I'm expecting to work on are > - Read Only Tables/WORM tables > - Advanced Partitioning > - Compression > plus related performance features Just so you don't lose sight of it, one of the biggest VLDB features we're missing is fault-tolerant bulk load. Unfortunately, I don't know anyone who's working on it. -- Josh Berkus PostgreSQL @ Sun San Francisco
Ühel kenal päeval, T, 2007-12-11 kell 10:53, kirjutas Josh Berkus: > Simon. > > > VLDB Features I'm expecting to work on are > > - Read Only Tables/WORM tables > > - Advanced Partitioning > > - Compression > > plus related performance features > > Just so you don't lose sight of it, one of the biggest VLDB features we're > missing is fault-tolerant bulk load. What do you mean by fault-tolerant here ? Just COPY ... WITH ERRORS TO ... or something more advanced, like bulkload which can be continued after crash ? -------------- Hannu
On Tue, 2007-12-11 at 10:53 -0800, Josh Berkus wrote: > Simon. > > > VLDB Features I'm expecting to work on are > > - Read Only Tables/WORM tables > > - Advanced Partitioning > > - Compression > > plus related performance features > > Just so you don't lose sight of it, one of the biggest VLDB features we're > missing is fault-tolerant bulk load. Unfortunately, I don't know anyone > who's working on it. Not lost sight of it; I have a design, but I have to prioritise also. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Hannu, > COPY ... WITH ERRORS TO ... Yeah, that's a start. > or something more advanced, like bulkload which can be continued after > crash ? Well, we could also use a loader which automatically parallelized, but that functionality can be done at the middleware level. WITH ERRORS is the most critical part. Here's the other VLDB features we're missing: Parallel Query Windowing Functions Parallel Index Build (not sure how this works exactly, but it speeds Oracle up considerably) On-disk Bitmap Index (anyone game to finish GP patch?) Simon, we should start a VLDB-Postgres developer wiki page. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
On Tue, 2007-12-11 at 10:53 -0800, Josh Berkus wrote: > Just so you don't lose sight of it, one of the biggest VLDB features we're > missing is fault-tolerant bulk load. I actually had to cook up a version of this for Truviso recently. I'll take a look at submitting a cleaned-up implementation for 8.4. -Neil
On Tue, 2007-12-11 at 15:31 -0800, Josh Berkus wrote: > Here's the other VLDB features we're missing: > > Parallel Query > Windowing Functions > Parallel Index Build (not sure how this works exactly, but it speeds Oracle > up considerably) > On-disk Bitmap Index (anyone game to finish GP patch?) I would call those VLDB Data Warehousing features to differentiate between that and the use of VLDBs for other purposes. I'd add Materialized View support in the planner, as well as saying its more important than parallel query, IMHO. MVs are to DW what indexes are to OLTP. It's the same as indexes vs. seqscan; you can speed up the seq scan or you can avoid it. Brute force is cool, but being smarter is even better. The reason they don't normally show up high on anybody's feature list is that the TPC benchmarks specifically disallow them, which as I once observed is very good support for them being a useful feature in practice. (Oracle originally brought out MV support as a way of improving their TPC scores at a time when Teradata was wiping the floor with parallel query implementation). -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 11 Dec 2007, Josh Berkus wrote: > Just so you don't lose sight of it, one of the biggest VLDB features we're > missing is fault-tolerant bulk load. Unfortunately, I don't know anyone > who's working on it. I'm curious what you feel is missing that pgloader doesn't fill that requirement: http://pgfoundry.org/projects/pgloader/ -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg, > I'm curious what you feel is missing that pgloader doesn't fill that > requirement: http://pgfoundry.org/projects/pgloader/ Because pgloader is implemented in middleware, it carries a very high overhead if you have bad rows. As little as 1% bad rows will slow down loading by 20% due to retries. -- Josh Berkus PostgreSQL @ Sun San Francisco
Hi, Le mercredi 12 décembre 2007, Josh Berkus a écrit : > > I'm curious what you feel is missing that pgloader doesn't fill that > > requirement: http://pgfoundry.org/projects/pgloader/ > > Because pgloader is implemented in middleware, it carries a very high > overhead if you have bad rows. As little as 1% bad rows will slow down > loading by 20% due to retries. Not that much, in fact, I'd say. pgloader allows its user to configure how large a COPY buffer to use (global parameter as of now, could easily be a per-section configuration knob, just didn't see any need for this yet). It's the 'copy_every' parameter as seen on the man page here: http://pgloader.projects.postgresql.org/#toc4 pgloader will obviously prepare a in-memory buffer of copy_every tuples to give to COPY, and in case of error will cut it and retry. Classic dichotomy approach, from initial implementation by Jan Wieck. So you can easily balance the error recovery costs against the COPY bulk size. Note also that the overall loading time with pgloader is not scaling the same as the COPY buffer size, the optimal choice depends on the dataset --- and the data massaging pgloader has to make on it ---, and I've experienced best results with 10000 and 15000 tuples buffers so far. FYI, now the pgloader topic is on the table, the next items I think I'm gonna develop for it are configurable behavior on errors tuples (load to another table when pk error, e.g.), and some limited ddl-partioning support. I'm playing with the idea for pgloader to be able to read some partitioning schemes (parsing CHECK constraint on inherited tables) and load directly into the right partitions. That would of course be done only when configured this way, and if constraints are misread it would only result in a lot more rejected rows than expected, and you still can retry using your insert trigger instead of pgloader buggy smartness. Comments welcome, regards, -- dim
On Tue, 2007-12-11 at 15:31 -0800, Josh Berkus wrote: > Simon, we should start a VLDB-Postgres developer wiki page. http://developer.postgresql.org/index.php/DataWarehousing -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Hi, Josh Berkus wrote: > Here's the other VLDB features we're missing: > > Parallel Query Uh.. this only makes sense in a distributed database, no? I've thought about parallel querying on top of Postgres-R. Does it make sense implementing some form of parallel querying apart from the distribution or replication engine? > Windowing Functions Isn't Gavin Sherry working on this? Haven't read anything from him lately... > Parallel Index Build (not sure how this works exactly, but it speeds Oracle > up considerably) Sounds interesting *turs-away-to-google* > On-disk Bitmap Index (anyone game to finish GP patch?) Anybody having an idea of what's missing there (besides good use cases, which some people doubt)? Again: Gavin? > Simon, we should start a VLDB-Postgres developer wiki page. Thanks, Simon, wiki page looks good! Regards Markus
Markus, > > Parallel Query > > Uh.. this only makes sense in a distributed database, no? I've thought > about parallel querying on top of Postgres-R. Does it make sense > implementing some form of parallel querying apart from the distribution > or replication engine? Sure. Imagine you have a 5TB database on a machine with 8 cores and only one concurrent user. You'd like to have 1 core doing I/O, and say 4-5 cores dividing the scan and join processing into 4-5 chunks. I'd say implementing a separate I/O worker would be the first step towards this; if we could avoid doing I/O in the same process/thread where we're doing row parsing it would speed up large scans by 100%. I know Oracle does this, and their large-table-I/O is 30-40% faster than ours despite having less efficient storage. Maybe Greenplum or EnterpriseDB will contribute something. ;-) > > Windowing Functions > > Isn't Gavin Sherry working on this? Haven't read anything from him > lately... Me neither. Swallowed by Greenplum and France. -- Josh Berkus PostgreSQL @ Sun San Francisco
Hi Josh, Josh Berkus wrote: > Sure. Imagine you have a 5TB database on a machine with 8 cores and only one > concurrent user. You'd like to have 1 core doing I/O, and say 4-5 cores > dividing the scan and join processing into 4-5 chunks. Ah, right, thank for enlightenment. Heck, I'm definitely too focused on replication and distributed databases :-) However, there's certainly a great deal of an intersection between parallel processing on different machines and parallel processing on multiple CPUs - especially considering NUMA architecture. *comes-to-think-again*... >> Isn't Gavin Sherry working on this? Haven't read anything from him >> lately... > > Me neither. Swallowed by Greenplum and France. Hm.. good for him, I guess! Regards Markus
On Wed, Dec 12, 2007 at 08:26:16PM +0100, Markus Schiltknecht wrote: > >>Isn't Gavin Sherry working on this? Haven't read anything from him > >>lately... > > > >Me neither. Swallowed by Greenplum and France. > > Hm.. good for him, I guess! Yes, I'm around -- just extremely busy with a big release at Greenplum as well as other Real Life stuff. Thanks, Gavin
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Greenplum as well as other Real Life stuff. For those of us here who have no idea what you are talking about can you define what "Real Life" is like? Joshua D. Drake - -- The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD' -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHYEmLATb/zqfZUUQRAhHJAJ9GD5DPZOlyd9LiBUG5TENIjuTgSwCaAnsf 5vdCZatl+XqD5S0+zMV/Ltk= =KyqY -----END PGP SIGNATURE-----
"Josh Berkus" <josh@agliodbs.com> writes: > Markus, > >> > Parallel Query >> >> Uh.. this only makes sense in a distributed database, no? I've thought >> about parallel querying on top of Postgres-R. Does it make sense >> implementing some form of parallel querying apart from the distribution >> or replication engine? Yes, but not for the reasons Josh describes. > I'd say implementing a separate I/O worker would be the first step towards > this; if we could avoid doing I/O in the same process/thread where we're > doing row parsing it would speed up large scans by 100%. I know Oracle does > this, and their large-table-I/O is 30-40% faster than ours despite having > less efficient storage. Oracle is using Direct I/O so they need the reader and writer threads to avoid blocking on i/o all the time. We count on the OS doing readahead and buffering our writes so we don't have to. Direct I/O and needing some way to do asynchronous writes and reads are directly tied. Where Parallel query is useful is when you have queries that involve a substantial amount of cpu resources, especially if you have a very fast I/O system which can saturate the bandwidth to a single cpu. So for example if you have a merge join which requires sorting both sides of the query you could easily have subprocesses handle those sorts allowing you to bring two processors to bear on the problem instead of being limited to a single processor. On Oracle Parallel Query goes great with partitioned tables. Their query planner will almost always turn the partition scans into parallel scans and use separate processors to scan different partitions. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Hello Gregory, Gregory Stark wrote: > Oracle is using Direct I/O so they need the reader and writer threads to avoid > blocking on i/o all the time. We count on the OS doing readahead and buffering > our writes so we don't have to. Direct I/O and needing some way to do > asynchronous writes and reads are directly tied. Yeah, except in cases where we can tell ahead non-sequential reads. Which admittedly doesn't come up too frequently and can probably be handled with posix_fadvice - as you are currently testing. > Where Parallel query is useful is when you have queries that involve a > substantial amount of cpu resources, especially if you have a very fast I/O > system which can saturate the bandwidth to a single cpu. Full ACK, the very same applies to parallel querying on shared-nothing clusters. Those can help if the bandwidth to all processing cores together becomes the bottleneck (and the resulting data is relatively small compared to the input data). For example, Sun's UltraSparc T2 features only 8 PCIe lanes for those 8 cores, so you end up with 250 MiB/sec per core or about 32 MiB/sec per thread on average. To be fair: their 10 Gig Ethernet ports don't go via PCIe, so you get an additional 2x 1 GiB/sec for the complete chip. And memory bandwidth looks a lot better: Sun claims 60+ GiB/sec, leaving almost 8 GiB/sec per core or 1 GiB/sec per thread. If my calculations for Intel are correct, a Quad Xeon with a 1.33 GHz FSB has around 21 GiB/sec throughput to main memory, giving 5 GiB/sec per core. (Why are these numbers so hard to find? It looks like Intel deliberately obfuscates them with FSB MHz or Giga-transactions per sec and the like.) Regards Markus
Ühel kenal päeval, T, 2007-12-11 kell 15:41, kirjutas Neil Conway: > On Tue, 2007-12-11 at 10:53 -0800, Josh Berkus wrote: > > Just so you don't lose sight of it, one of the biggest VLDB features we're > > missing is fault-tolerant bulk load. > > I actually had to cook up a version of this for Truviso recently. I'll > take a look at submitting a cleaned-up implementation for 8.4. How did you do it ? Did you enchance COPY command or was it something completely new ? ----------- Hannu
On Fri, 2007-12-14 at 14:48 +0200, Hannu Krosing wrote: > How did you do it ? > > Did you enchance COPY command or was it something completely new ? By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY to drop (and log) rows that contain malformed data. That is, rows with too many or too few columns, rows that result in constraint violations, and rows containing columns where the data type's input function raises an error. The last case is the only thing that would be a bit tricky to implement, I think: you could use PG_TRY() around the InputFunctionCall, but I guess you'd need a subtransaction to ensure that you reset your state correctly after catching an error. -Neil
Neil Conway wrote: > On Fri, 2007-12-14 at 14:48 +0200, Hannu Krosing wrote: > >> How did you do it ? >> >> Did you enchance COPY command or was it something completely new ? >> > > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY > to drop (and log) rows that contain malformed data. That is, rows with > too many or too few columns, rows that result in constraint violations, > and rows containing columns where the data type's input function raises > an error. The last case is the only thing that would be a bit tricky to > implement, I think: you could use PG_TRY() around the InputFunctionCall, > but I guess you'd need a subtransaction to ensure that you reset your > state correctly after catching an error. > > > Ideally I think you would put the failing input line in another table, or maybe another file. If a table, it would probably have to be as bytea. cheers andrew
Neil Conway <neilc@samurai.com> writes: > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY > to drop (and log) rows that contain malformed data. That is, rows with > too many or too few columns, rows that result in constraint violations, > and rows containing columns where the data type's input function raises > an error. The last case is the only thing that would be a bit tricky to > implement, I think: you could use PG_TRY() around the InputFunctionCall, > but I guess you'd need a subtransaction to ensure that you reset your > state correctly after catching an error. Yeah. It's the subtransaction per row that's daunting --- not only the cycles spent for that, but the ensuing limitation to 4G rows imported per COPY. If we could somehow only do a subtransaction per failure, things would be much better, but I don't see how. regards, tom lane
On Fri, 2007-12-14 at 18:22 -0500, Tom Lane wrote: > If we could somehow only do a subtransaction per failure, things would > be much better, but I don't see how. One approach would be to essentially implement the pg_bulkloader approach inside the backend. That is, begin by doing a subtransaction for every k rows (with k = 1000, say). If you get any errors, then either repeat the process with k/2 until you locate the individual row(s) causing the trouble, or perhaps just immediately switch to k = 1. Fairly ugly though, and would be quite slow for data sets with a high proportion of erroneous data. Another approach would be to distinguish between errors that require a subtransaction to recover to a consistent state, and less serious errors that don't have this requirement (e.g. invalid input to a data type input function). If all the errors that we want to tolerate during a bulk load fall into the latter category, we can do without subtransactions. -Neil
Neil Conway <neilc@samurai.com> writes: > One approach would be to essentially implement the pg_bulkloader > approach inside the backend. That is, begin by doing a subtransaction > for every k rows (with k = 1000, say). If you get any errors, then > either repeat the process with k/2 until you locate the individual > row(s) causing the trouble, or perhaps just immediately switch to k = 1. > Fairly ugly though, and would be quite slow for data sets with a high > proportion of erroneous data. You could make it self-tuning, perhaps: initially, or after an error, set k = 1, and increase k after a successful set of rows. > Another approach would be to distinguish between errors that require a > subtransaction to recover to a consistent state, and less serious errors > that don't have this requirement (e.g. invalid input to a data type > input function). If all the errors that we want to tolerate during a > bulk load fall into the latter category, we can do without > subtransactions. I think such an approach is doomed to hopeless unreliability. There is no concept of an error that doesn't require a transaction abort in the system now, and that doesn't seem to me like something that can be successfully bolted on after the fact. Also, there's a lot of bookkeeping (eg buffer pins) that has to be cleaned up regardless of the exact nature of the error, and all those mechanisms are hung off transactions. regards, tom lane
Tom, > I think such an approach is doomed to hopeless unreliability. There is > no concept of an error that doesn't require a transaction abort in the > system now, and that doesn't seem to me like something that can be > successfully bolted on after the fact. Also, there's a lot of > bookkeeping (eg buffer pins) that has to be cleaned up regardless of the > exact nature of the error, and all those mechanisms are hung off > transactions. There's no way we can do a transactionless load, then? I'm thinking of the load-into-new-partition which is a single pass/fail operation. Would ignoring individual row errors in for this case still cause these kinds of problems? -- Josh Berkus PostgreSQL @ Sun San Francisco
On Friday 2007-12-14 16:22, Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY > > to drop (and log) rows that contain malformed data. That is, rows with > > too many or too few columns, rows that result in constraint violations, > > and rows containing columns where the data type's input function raises > > an error. The last case is the only thing that would be a bit tricky to > > implement, I think: you could use PG_TRY() around the InputFunctionCall, > > but I guess you'd need a subtransaction to ensure that you reset your > > state correctly after catching an error. > > Yeah. It's the subtransaction per row that's daunting --- not only the > cycles spent for that, but the ensuing limitation to 4G rows imported > per COPY. You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This would help with the 4G subtransaction limit. The cost to the ETL process is that a simple rollback would not be guaranteed send the process back to it's initial state. There are easy ways to deal with the rollback issue though. A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY option is selected then the COPY FROM can run without subtransactions and in excess of the 4G per transaction limit. NO RETRY should be the default since it preserves the legacy behavior of COPY FROM. You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the option of sending exceptions to a table since they are presumably malformed, otherwise they would not be exceptions. (Users should re-process exception files if they want an if good then table a else exception to table b ...) EXCEPTIONS TO and NO RETRY would be mutually exclusive. > If we could somehow only do a subtransaction per failure, things would > be much better, but I don't see how.
Josh Berkus <josh@agliodbs.com> writes: > There's no way we can do a transactionless load, then? I'm thinking of the > load-into-new-partition which is a single pass/fail operation. Would > ignoring individual row errors in for this case still cause these kinds of > problems? Given that COPY fires triggers and runs CHECK constraints, there is no part of the system that cannot be exercised during COPY. So I think supposing that we can just deal with some simplified subset of reality is mere folly. regards, tom lane
Another approach would be to distinguish between errors that require a
subtransaction to recover to a consistent state, and less serious errors
that don't have this requirement (e.g. invalid input to a data type
input function). If all the errors that we want to tolerate during a
bulk load fall into the latter category, we can do without
subtransactions.
EnterpriseDB http://www.enterprisedb.com
NikhilS <nikkhils@gmail.com> writes: > Any errors which occur before doing the heap_insert should not require > any recovery according to me. A sufficient (though far from all-encompassing) rejoinder to that is "triggers and CHECK constraints can do anything". > The overhead of having a subtransaction per row is a very valid concern. But > instead of using a per insert or a batch insert substraction, I am > thinking that we can start off a subtraction and continue it till we > encounter a failure. What of failures that occur only at (sub)transaction commit, such as foreign key checks? regards, tom lane
On Fri, 2007-12-14 at 18:22 -0500, Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY > > to drop (and log) rows that contain malformed data. That is, rows with > > too many or too few columns, rows that result in constraint violations, > > and rows containing columns where the data type's input function raises > > an error. The last case is the only thing that would be a bit tricky to > > implement, I think: you could use PG_TRY() around the InputFunctionCall, > > but I guess you'd need a subtransaction to ensure that you reset your > > state correctly after catching an error. > > Yeah. It's the subtransaction per row that's daunting --- not only the > cycles spent for that, but the ensuing limitation to 4G rows imported > per COPY. I'd suggest doing everything at block level - wrap each new block of data in a subtransaction - apply data to the table block by block (can still work with FSM). - apply indexes in bulk for each block, unique ones first. That then gives you a limit of more than 500 trillion rows, which should be enough for anyone. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 2007-12-11 at 19:11 -0500, Greg Smith wrote: > I'm curious what you feel is missing that pgloader doesn't fill that > requirement: http://pgfoundry.org/projects/pgloader/ For complicated ETL, I agree that using an external tool makes the most sense. But I think there is still merit in adding support to COPY for the simple case of trying to load a data file that has some corrupted, invalid or duplicate records. -Neil
On 16/12/2007, Neil Conway <neilc@samurai.com> wrote: > On Tue, 2007-12-11 at 19:11 -0500, Greg Smith wrote: > > I'm curious what you feel is missing that pgloader doesn't fill that > > requirement: http://pgfoundry.org/projects/pgloader/ > > For complicated ETL, I agree that using an external tool makes the most > sense. But I think there is still merit in adding support to COPY for > the simple case of trying to load a data file that has some corrupted, > invalid or duplicate records. > > -Neil > > Any simple enhancing of COPY is welcome. I lost lot of time with repeated imports. Regards Pavel > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Saturday 2007-12-15 02:14, Simon Riggs wrote: > On Fri, 2007-12-14 at 18:22 -0500, Tom Lane wrote: > > Neil Conway <neilc@samurai.com> writes: > > > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY > > > to drop (and log) rows that contain malformed data. That is, rows with > > > too many or too few columns, rows that result in constraint violations, > > > and rows containing columns where the data type's input function raises > > > an error. The last case is the only thing that would be a bit tricky to > > > implement, I think: you could use PG_TRY() around the > > > InputFunctionCall, but I guess you'd need a subtransaction to ensure > > > that you reset your state correctly after catching an error. > > > > Yeah. It's the subtransaction per row that's daunting --- not only the > > cycles spent for that, but the ensuing limitation to 4G rows imported > > per COPY. > > I'd suggest doing everything at block level > - wrap each new block of data in a subtransaction > - apply data to the table block by block (can still work with FSM). > - apply indexes in bulk for each block, unique ones first. > > That then gives you a limit of more than 500 trillion rows, which should > be enough for anyone. Wouldn't it only give you more than 500T rows in the best case? If it hits a bad row it has to back off and roll forward one row and one subtransaction at a time for the failed block. So in the worst case, where there is at least one exception row per block, I think you would still wind up with only a capacity of 4G rows.
Ühel kenal päeval, L, 2007-12-15 kell 01:12, kirjutas Tom Lane: > Josh Berkus <josh@agliodbs.com> writes: > > There's no way we can do a transactionless load, then? I'm thinking of the > > load-into-new-partition which is a single pass/fail operation. Would > > ignoring individual row errors in for this case still cause these kinds of > > problems? > > Given that COPY fires triggers and runs CHECK constraints, there is no > part of the system that cannot be exercised during COPY. So I think > supposing that we can just deal with some simplified subset of reality > is mere folly. But can't we _define_ such a subset, where we can do a transactionless load ? I don't think that most DW/VLDB schemas fire complex triggers or custom data-modifying functions inside CHECK's. Then we could just run the remaining simple CHECK constraints ourselves and not abort on non-check, but just log the rows ? The COPY ... WITH ERRORS TO ... would essentially become a big conditional RULE through which the incoming data is processed. ------------------ Hannu
Hannu Krosing <hannu@skype.net> writes: > But can't we _define_ such a subset, where we can do a transactionless > load ? Sure ... but you'll find that it's not large enough to be useful. Once you remove all the interesting consistency checks such as unique indexes and foreign keys, the COPY will tend to go through just fine, and then you're still stuck trying to weed out bad data without very good tools for it. The only errors we could really separate out without subtransaction fencing are extremely trivial ones like too many or too few fields on a line ... which can be caught with a sed script. regards, tom lane
NikhilS <nikkhils@gmail.com> writes:A sufficient (though far from all-encompassing) rejoinder to that is
> Any errors which occur before doing the heap_insert should not require
> any recovery according to me.
"triggers and CHECK constraints can do anything".What of failures that occur only at (sub)transaction commit, such as
> The overhead of having a subtransaction per row is a very valid concern. But
> instead of using a per insert or a batch insert substraction, I am
> thinking that we can start off a subtraction and continue it till we
> encounter a failure.The moment an error is encountered, since we have the offending >(already in heap) tuple around, we can call a simple_heap_delete on the same and commit >(instead of aborting) this subtransaction
foreign key checks?
What if we identify and define a subset where we could do subtransactions based COPY? The following could be supported:
* A subset of triggers and CHECK constraints which do not move the tuple around. (Identifying this subset might be an issue though?)
* Primary/unique key indexes
As Hannu mentioned elsewhere in this thread, there should not be very many instances of complex triggers/CHECKs around? And may be in those instances (and also the foreign key checks case), the behaviour could default to use a per-subtransaction-per-row or even the existing single transaction model?
Nikhils
--
EnterpriseDB http://www.enterprisedb.com
2007/12/16, Tom Lane <tgl@sss.pgh.pa.us>: > Hannu Krosing <hannu@skype.net> writes: > > But can't we _define_ such a subset, where we can do a transactionless > > load ? > > Sure ... but you'll find that it's not large enough to be useful. > Once you remove all the interesting consistency checks such as > unique indexes and foreign keys, the COPY will tend to go through > just fine, and then you're still stuck trying to weed out bad data > without very good tools for it. The only errors we could really > separate out without subtransaction fencing are extremely trivial > ones like too many or too few fields on a line ... which can be > caught with a sed script. > I have dump file. I would like to load it ASAP. Constraints will be applied at the end, so any problem can be detected. I would like it to be as direct as possible and as bulk as possibe - just allocate pages and fill them with the data. Maybe it should be different mode - single user or so. Right now I can save some IO - like turn off fsync, but that is all :( I got something like that: http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results I have no idea how to load single file in many threads, but... the point is that it can be much faster that single-thread load - surprisingly - at least for me. -- Regards, Michał Zaborowski (TeXXaS)
On Dec 12, 2007, at 1:26 PM, Markus Schiltknecht wrote: > Josh Berkus wrote: >> Sure. Imagine you have a 5TB database on a machine with 8 cores >> and only one concurrent user. You'd like to have 1 core doing I/ >> O, and say 4-5 cores dividing the scan and join processing into >> 4-5 chunks. > > Ah, right, thank for enlightenment. Heck, I'm definitely too > focused on replication and distributed databases :-) > > However, there's certainly a great deal of an intersection between > parallel processing on different machines and parallel processing > on multiple CPUs - especially considering NUMA architecture. *comes- > to-think-again*... Except that doing something in-machine is often far simpler than trying to go cross-machine, especially when that something is a background reader. Let's walk before we run. :) -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Tom, > Sure ... but you'll find that it's not large enough to be useful. > Once you remove all the interesting consistency checks such as > unique indexes and foreign keys, the COPY will tend to go through > just fine, and then you're still stuck trying to weed out bad data > without very good tools for it. The only errors we could really > separate out without subtransaction fencing are extremely trivial > ones like too many or too few fields on a line ... which can be > caught with a sed script. Speaking as someone who did a LOT of DW load design only a couple years ago, I'll say that the "special case" of no triggers, no constraint checks except length, and type-safety check actually constitutes about 50% of DW bulk loading. The only exception to that is unique indexes, which would normally be included and would be the difficult thing. Also, "special case bulk loading" would in fact give users of other types of applications a lot more flexibility -- they could always load into a holding table just to clean up the type safety issues and then merge into the real table. So I don't agree that the "load into new partition without dependancies" is too much of a special case to be worth pursuing. It might be a bad idea for other reasons, but not because it's too obscure. --Josh
Trent Shipley <trent_shipley@qwest.net> writes: > On Friday 2007-12-14 16:22, Tom Lane wrote: >> Neil Conway <neilc@samurai.com> writes: >> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY >> > to drop (and log) rows that contain malformed data. That is, rows with >> > too many or too few columns, rows that result in constraint violations, >> > and rows containing columns where the data type's input function raises >> > an error. The last case is the only thing that would be a bit tricky to >> > implement, I think: you could use PG_TRY() around the InputFunctionCall, >> > but I guess you'd need a subtransaction to ensure that you reset your >> > state correctly after catching an error. >> >> Yeah. It's the subtransaction per row that's daunting --- not only the >> cycles spent for that, but the ensuing limitation to 4G rows imported >> per COPY. > > You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This > would help with the 4G subtransaction limit. The cost to the ETL process is > that a simple rollback would not be guaranteed send the process back to it's > initial state. There are easy ways to deal with the rollback issue though. > > A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY > option is selected then the COPY FROM can run without subtransactions and in > excess of the 4G per transaction limit. NO RETRY should be the default since > it preserves the legacy behavior of COPY FROM. > > You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the > option of sending exceptions to a table since they are presumably malformed, > otherwise they would not be exceptions. (Users should re-process exception > files if they want an if good then table a else exception to table b ...) > > EXCEPTIONS TO and NO RETRY would be mutually exclusive. > > >> If we could somehow only do a subtransaction per failure, things would >> be much better, but I don't see how. Hello, Attached is a proof of concept patch for this TODO item. There is no docs yet, I just wanted to know if approach is sane. The added syntax is like the following: COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout] The way it's done it is abusing Copy Both mode and from my limited testing, that seems to just work. The error trapping itself is done using PG_TRY/PG_CATCH and can only catch formatting or before-insert trigger errors, no attempt is made to recover from a failed unique constraint, etc. Example in action: postgres=# \d test_copy2 Table "public.test_copy2" Column | Type | Modifiers --------+---------+----------- id | integer | val | integer | postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout; 1 NOTICE: missing data for column "val" CONTEXT: COPY test_copy2, line 1: "1" 2 NOTICE: missing data for column "val" CONTEXT: COPY test_copy2, line 2: "2" 3 NOTICE: missing data for column "val" CONTEXT: COPY test_copy2, line 3: "3" NOTICE: total exceptions ignored: 3 postgres=# \d test_copy1 Table "public.test_copy1" Column | Type | Modifiers --------+---------+----------- id | integer | not null postgres=# set client_min_messages to warning; SET postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout; ... vmstat zoneinfo postgres=# Limited performance testing shows no significant difference between error-catching and plain code path. For example, timing copy test_copy1 from program 'seq 1000000' [exceptions to stdout] shows similar numbers with or without the added "exceptions to" clause. Now that I'm sending this I wonder if the original comment about the need for subtransaction around every loaded line still holds. Any example of what would be not properly rolled back by just PG_TRY? Happy hacking! -- Alex
Attachment
Trent Shipley <trent_shipley@qwest.net> writes:
> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah. It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This
> would help with the 4G subtransaction limit. The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state. There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit. NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions. (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.
Hello,
Attached is a proof of concept patch for this TODO item. There is no
docs yet, I just wanted to know if approach is sane.
The added syntax is like the following:
COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]
The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work. The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.
Example in action:
postgres=# \d test_copy2
Table "public.test_copy2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | integer |
postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 1: "1"
2
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 2: "2"
3
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 3: "3"
NOTICE: total exceptions ignored: 3
postgres=# \d test_copy1
Table "public.test_copy1"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#
Limited performance testing shows no significant difference between
error-catching and plain code path. For example, timing
copy test_copy1 from program 'seq 1000000' [exceptions to stdout]
shows similar numbers with or without the added "exceptions to" clause.
Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds. Any
example of what would be not properly rolled back by just PG_TRY?
Pavel
Happy hacking!
--
Alex
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash@commandprompt.com>:Trent Shipley <trent_shipley@qwest.net> writes:
> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah. It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This
> would help with the 4G subtransaction limit. The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state. There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit. NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions. (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.
Hello,
Attached is a proof of concept patch for this TODO item. There is no
docs yet, I just wanted to know if approach is sane.
The added syntax is like the following:
COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]
The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work. The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.
Example in action:
postgres=# \d test_copy2
Table "public.test_copy2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | integer |
postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 1: "1"
2
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 2: "2"
3
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 3: "3"
NOTICE: total exceptions ignored: 3
postgres=# \d test_copy1
Table "public.test_copy1"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#
Limited performance testing shows no significant difference between
error-catching and plain code path. For example, timing
copy test_copy1 from program 'seq 1000000' [exceptions to stdout]
shows similar numbers with or without the added "exceptions to" clause.
Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds. Any
example of what would be not properly rolled back by just PG_TRY?this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaksyou can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimalwhen block fails, then you can use row level subtransaction - it works well when you expect almost correct data.
Pavel
Regards
Pavel
Happy hacking!
--
Alex
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2 2 2
3 3 3
\.
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2
3 3 3
\.
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 a 2
3 3 3
\.
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)
2014-12-26 11:41 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash@commandprompt.com>:Trent Shipley <trent_shipley@qwest.net> writes:
> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah. It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This
> would help with the 4G subtransaction limit. The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state. There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit. NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions. (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.
Hello,
Attached is a proof of concept patch for this TODO item. There is no
docs yet, I just wanted to know if approach is sane.
The added syntax is like the following:
COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]
The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work. The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.
Example in action:
postgres=# \d test_copy2
Table "public.test_copy2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | integer |
postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 1: "1"
2
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 2: "2"
3
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 3: "3"
NOTICE: total exceptions ignored: 3
postgres=# \d test_copy1
Table "public.test_copy1"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#
Limited performance testing shows no significant difference between
error-catching and plain code path. For example, timing
copy test_copy1 from program 'seq 1000000' [exceptions to stdout]
shows similar numbers with or without the added "exceptions to" clause.
Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds. Any
example of what would be not properly rolled back by just PG_TRY?this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaksyou can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimalwhen block fails, then you can use row level subtransaction - it works well when you expect almost correct data.Two years ago I wrote a extension that did it - but I have not time to finish it and push to upstream.Regards
PavelRegards
Pavel
Happy hacking!
--
Alex
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello.Wrote a patch implementing COPY with ignoring errors in rows using block subtransactions.
Syntax: COPY [table] FROM [file/stdin] WITH IGNORE_ERROS;Examples:CREATE TABLE check_ign_err (n int, m int, k int);
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2 2 2
3 3 3
\.WARNING: COPY check_ign_err, line 2: "2 2 2 2"SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)##################################################TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2
3 3 3
\.WARNING: COPY check_ign_err, line 2: "2 2"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)##################################################TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 a 2
3 3 3
\.WARNING: COPY check_ign_err, line 2, column m: "a"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)Regards, Damirпт, 10 дек. 2021 г. в 21:48, Pavel Stehule <pavel.stehule@gmail.com>:2014-12-26 11:41 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash@commandprompt.com>:Trent Shipley <trent_shipley@qwest.net> writes:
> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah. It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This
> would help with the 4G subtransaction limit. The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state. There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit. NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions. (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.
Hello,
Attached is a proof of concept patch for this TODO item. There is no
docs yet, I just wanted to know if approach is sane.
The added syntax is like the following:
COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]
The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work. The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.
Example in action:
postgres=# \d test_copy2
Table "public.test_copy2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | integer |
postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 1: "1"
2
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 2: "2"
3
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 3: "3"
NOTICE: total exceptions ignored: 3
postgres=# \d test_copy1
Table "public.test_copy1"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#
Limited performance testing shows no significant difference between
error-catching and plain code path. For example, timing
copy test_copy1 from program 'seq 1000000' [exceptions to stdout]
shows similar numbers with or without the added "exceptions to" clause.
Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds. Any
example of what would be not properly rolled back by just PG_TRY?this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaksyou can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimalwhen block fails, then you can use row level subtransaction - it works well when you expect almost correct data.Two years ago I wrote a extension that did it - but I have not time to finish it and push to upstream.Regards
PavelRegards
Pavel
Happy hacking!
--
Alex
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The block size is determined by the REPLAY_BUFFER_SIZE parameter.
I used the idea of a buffer for accumulating tuples in it.
If we find an error, the subtransaction will rollback and the buffer will be replayed containing tuples.
In the patch REPLAY_BUFFER_SIZE equals 3, but it can be changed to any other number (for example 1000).
For CIM_MULTI and CIM_MULTI_CONDITIONAL cases the buffer is not needed.
Tests:
CREATE TABLE check_ign_err (n int, m int, k int);
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
1 1 1
2 2 2 2
3 3
a 4 4
5 b b
7 7 7
\.
SELECT * FROM check_ign_err;
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
---+---+---
1 | 1 | 1
7 | 7 | 7
(2 rows)
##################################################
-- BEFORE row trigger
CREATE TABLE trig_test(n int, m int);
CREATE FUNCTION fn_trig_before () RETURNS TRIGGER AS '
BEGIN
INSERT INTO trig_test VALUES(NEW.n, NEW.m);
RETURN NEW;
END;
' LANGUAGE plpgsql;
CREATE TRIGGER trig_before BEFORE INSERT ON check_ign_err
FOR EACH ROW EXECUTE PROCEDURE fn_trig_before();
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
1 1 1
2 2 2 2
3 3
a 4 4
5 b b
7 7 7
\.
SELECT * FROM check_ign_err;
---+---+---
1 | 1 | 1
7 | 7 | 7
(2 rows)
##################################################
CREATE VIEW check_ign_err_view AS SELECT * FROM check_ign_err;
CREATE FUNCTION fn_trig_instead_of () RETURNS TRIGGER AS '
BEGIN
INSERT INTO check_ign_err VALUES(NEW.n, NEW.m, NEW.k);
RETURN NEW;
END;
' LANGUAGE plpgsql;
CREATE TRIGGER trig_instead_of INSTEAD OF INSERT ON check_ign_err_view
FOR EACH ROW EXECUTE PROCEDURE fn_trig_instead_of();
COPY check_ign_err_view FROM STDIN WITH IGNORE_ERRORS;
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
SELECT * FROM check_ign_err;
1 1 1
2 2 2 2
3 3
a 4 4
5 b b
7 7 7
\.
SELECT * FROM check_ign_err_view;
---+---+---
1 | 1 | 1
7 | 7 | 7
(2 rows)
##################################################
##################################################
-- volatile function in WHERE clause
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS
WHERE n = floor(random()*(1-1+1))+1; /* find values equal 1 */
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
SELECT * FROM check_ign_err;
1 1 1
2 2 2 2
3 3
a 4 4
5 b b
7 7 7
\.
SELECT * FROM check_ign_err;
---+---+---
1 | 1 | 1
(1 row)
##################################################
-- CIM_MULTI_CONDITIONAL case
-- INSERT triggers for partition tables
CREATE TABLE check_ign_err (n int, m int, k int) PARTITION BY RANGE (n);
CREATE TABLE check_ign_err_part1 PARTITION OF check_ign_err
FOR VALUES FROM (1) TO (4);
CREATE TABLE check_ign_err_part2 PARTITION OF check_ign_err
FOR VALUES FROM (4) TO (8);
CREATE FUNCTION fn_trig_before_part () RETURNS TRIGGER AS '
BEGIN
INSERT INTO trig_test VALUES(NEW.n, NEW.m);
RETURN NEW;
END;
' LANGUAGE plpgsql;
CREATE TRIGGER trig_before_part BEFORE INSERT ON check_ign_err
FOR EACH ROW EXECUTE PROCEDURE fn_trig_before_part();
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
WARNING: COPY check_ign_err, line 3: "3 3"
WARNING: COPY check_ign_err, line 4, column n: "a"
WARNING: COPY check_ign_err, line 5, column m: "b"
WARNING: COPY check_ign_err, line 6, column n: ""
SELECT * FROM check_ign_err;
1 1 1
2 2 2 2
3 3
a 4 4
5 b b
7 7 7
\.
---+---+---
1 | 1 | 1
7 | 7 | 7
(2 rows)
Attachment
On 2022-07-19 21:40, Damir Belyalov wrote: > Hi! > > Improved my patch by adding block subtransactions. > The block size is determined by the REPLAY_BUFFER_SIZE parameter. > I used the idea of a buffer for accumulating tuples in it. > If we read REPLAY_BUFFER_SIZE rows without errors, the subtransaction > will be committed. > If we find an error, the subtransaction will rollback and the buffer > will be replayed containing tuples. Thanks for working on this! I tested 0002-COPY-IGNORE_ERRORS.patch and faced an unexpected behavior. I loaded 10000 rows which contained 1 wrong row. I expected I could see 9999 rows after COPY, but just saw 999 rows. Since when I changed MAX_BUFFERED_TUPLES from 1000 to other values, the number of loaded rows also changed, I imagine MAX_BUFFERED_TUPLES might be giving influence of this behavior. ```sh $ cat /tmp/test10000.dat 1 aaa 2 aaa 3 aaa 4 aaa 5 aaa 6 aaa 7 aaa 8 aaa 9 aaa 10 aaa 11 aaa ... 9994 aaa 9995 aaa 9996 aaa 9997 aaa 9998 aaa 9999 aaa xxx aaa ``` ```SQL =# CREATE TABLE test (id int, data text); =# COPY test FROM '/tmp/test10000.dat' WITH (IGNORE_ERRORS); WARNING: COPY test, line 10000, column i: "xxx" COPY 9999 =# SELECT COUNT(*) FROM test; count ------- 999 (1 row) ``` BTW I may be overlooking it, but have you submit this proposal to the next CommitFest? https://commitfest.postgresql.org/39/ -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
> I expected I could see 9999 rows after COPY, but just saw 999 rows.
Attachment
On 2022-08-15 22:23, Damir Belyalov wrote: >> I expected I could see 9999 rows after COPY, but just saw 999 rows. > Also I implemented your case and it worked correctly. Thanks for the new patch! Here are some comments on it. > + if (safecstate->saved_tuples < REPLAY_BUFFER_SIZE) > + { > + valid_row = NextCopyFrom(cstate, econtext, values, > nulls); > + if (valid_row) > + { > + /* Fill replay_buffer in oldcontext*/ > + MemoryContextSwitchTo(safecstate->oldcontext); > + > safecstate->replay_buffer[safecstate->saved_tuples++] = > heap_form_tuple(RelationGetDescr(cstate->rel), values, nulls); > > + /* Buffer was filled, commit subtransaction and prepare > to replay */ > + ReleaseCurrentSubTransaction(); What is actually being committed by this ReleaseCurrentSubTransaction()? It seems to me that just safecstate->replay_buffer is fulfilled before this commit. As a test, I rewrote this ReleaseCurrentSubTransaction() to RollbackAndReleaseCurrentSubTransaction() and COPYed over 1000 rows of data, but same data were loaded. > +#define REPLAY_BUFFER_SIZE 1000 I feel it might be better to have it as a parameter rather than fixed at 1000. > +/* > + * Analog of NextCopyFrom() but ignore rows with errors while copying. > + */ > +static bool > +safeNextCopyFrom(CopyFromState cstate, ExprContext *econtext, Datum > *values, bool *nulls) NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in copyfrom.c. Since safeNextCopyFrom() is analog of NextCopyFrom() as commented, would it be natural to put them in the same file? > 188 + safecstate->errors++; > 189 + if (safecstate->errors <= 100) > 190 + ereport(WARNING, > 191 + (errcode(errdata->sqlerrcode), > 192 + errmsg("%s", errdata->context))); It would be better to state in the manual that errors exceeding 100 are not displayed. Or, it might be acceptable to output all errors that exceed 100. > +typedef struct SafeCopyFromState > +{ > +#define REPLAY_BUFFER_SIZE 1000 > + HeapTuple replay_buffer[REPLAY_BUFFER_SIZE]; /* accumulates > tuples for replaying it after an error */ > + int saved_tuples; /* # of tuples in > replay_buffer */ > + int replayed_tuples; /* # of tuples was > replayed from buffer */ > + int errors; /* total # of errors */ > + bool replay_is_active; > + bool begin_subtransaction; > + bool processed_remaining_tuples; /* for case of > replaying last tuples */ > + bool skip_row; It would be helpful to add comments about skip_row, etc. ``` $ git apply ../patch/0003-COPY_IGNORE_ERRORS.patch ../patch/0003-COPY_IGNORE_ERRORS.patch:86: indent with spaces. Datum *values, bool *nulls); warning: 1 line adds whitespace errors. ``` There was a warning when applying the patch. ``` =# copy test from '/tmp/10000.data' with (ignore_errors); WARNING: FIND 0 ERRORS COPY 1003 ``` When there were no errors, this WARNING seems not necessary. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
> + /* Buffer was filled, commit subtransaction and prepare to replay */
> + ReleaseCurrentSubTransaction();
What is actually being committed by this ReleaseCurrentSubTransaction()?
It seems to me that just safecstate->replay_buffer is fulfilled before
this commit.
I feel it might be better to have it as a parameter rather than fixed at
1000.
NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in
copyfrom.c.
Since safeNextCopyFrom() is analog of NextCopyFrom() as commented, would
it be natural to put them in the same file?
> 188 + safecstate->errors++;
> 189 + if (safecstate->errors <= 100)
> 190 + ereport(WARNING,
> 191 + (errcode(errdata->sqlerrcode),
> 192 + errmsg("%s", errdata->context)));
It would be better to state in the manual that errors exceeding 100 are
not displayed.
Or, it might be acceptable to output all errors that exceed 100.
> +typedef struct SafeCopyFromState
> +{
> +#define REPLAY_BUFFER_SIZE 1000
> + HeapTuple replay_buffer[REPLAY_BUFFER_SIZE]; /* accumulates
> tuples for replaying it after an error */
> + int saved_tuples; /* # of tuples in
> replay_buffer */
> + int replayed_tuples; /* # of tuples was
> replayed from buffer */
> + int errors; /* total # of errors */
> + bool replay_is_active;
> + bool begin_subtransaction;
> + bool processed_remaining_tuples; /* for case of
> replaying last tuples */
> + bool skip_row;
It would be helpful to add comments about skip_row, etc.
WARNING: FIND 0 ERRORS
When there were no errors, this WARNING seems not necessary.
Attachment
On 2022-08-25 01:54, Damir Belyalov wrote: >>> + /* Buffer was filled, commit subtransaction and >> prepare to replay */ >>> + ReleaseCurrentSubTransaction(); >> What is actually being committed by this >> ReleaseCurrentSubTransaction()? >> It seems to me that just safecstate->replay_buffer is fulfilled >> before >> this commit. > > All tuples are collected in replay_buffer, which is created in > ''oldcontext'' of CopyFrom() (not context of a subtransaction). That's > why it didn't clean up when you used > RollbackAndReleaseCurrentSubTransaction(). > Subtransactions are needed for better safety. There is no error when > copying from a file to the replay_buffer, but an error may occur at > the next stage - when adding a tuple to the table. Also there may be > other errors that are not obvious at first glance. Thanks for the explanation and updating patch. I now understand that the data being COPYed are not the target of COMMIT. Although in past discussions[1] it seems the data to be COPYed were also subject to COMMIT, but I understand this patch has adopted another design. 350 + /* Buffer was filled, commit subtransaction and prepare to replay */ 351 + ReleaseCurrentSubTransaction(); 352 + CurrentResourceOwner = safecstate->oldowner; 353 + 354 + safecstate->replay_is_active = true; BTW in v4 patch, data are loaded into the buffer one by one, and when the buffer fills up, the data in the buffer are 'replayed' also one by one, right? Wouldn't this have more overhead than a normal COPY? As a test, I COPYed slightly larger data with and without ignore_errors option. There might be other reasons, but I found a performance difference. ``` =# copy test from '/tmp/10000000.data' ; COPY 10000000 Time: 6001.325 ms (00:06.001) =# copy test from '/tmp/10000000.data' with (ignore_errors); NOTICE: FIND 0 ERRORS COPY 10000000 Time: 7711.555 ms (00:07.712) ``` >> I feel it might be better to have it as a parameter rather than >> fixed at >> 1000. > > Yes, I thought about it too. So I created another version with the GUC > parameter - replay_buffer_size. Attached it. But I think users won't > need to change replay_buffer_size. > Also replay_buffer does the same thing as MultiInsert buffer does and > MultiInsert buffer defined by const = 1000. Yeah, when the data being COPYed are not the target of COMMIT, I also think users won't neet to change it. > >> NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in >> copyfrom.c. >> Since safeNextCopyFrom() is analog of NextCopyFrom() as commented, >> would >> it be natural to put them in the same file? > > Sure, corrected it. Thanks. > >>> 188 + safecstate->errors++; >>> 189 + if (safecstate->errors <= 100) >>> 190 + ereport(WARNING, >>> 191 + (errcode(errdata->sqlerrcode), >>> 192 + errmsg("%s", errdata->context))); >> >> It would be better to state in the manual that errors exceeding 100 >> are >> not displayed. >> Or, it might be acceptable to output all errors that exceed 100. > > It'll be too complicated to create a new parameter just for showing > the given number of errors. I think 100 is an optimal size. Yeah, I may have misled you, but I also don't think this needs a new parameter. I just thought 'either' of the following would be better: - Put in the documentation that the warnings will not be output for more than 101 cases. - Output all the warnings. >> It would be helpful to add comments about skip_row, etc. > > Corrected it. Thanks. > >> WARNING: FIND 0 ERRORS >> When there were no errors, this WARNING seems not necessary. > > Corrected it. Thanks. I applied v4 patch and when canceled the COPY, there was a case I found myself left in a transaction. Should this situation be prevented from occurring? ``` =# copy test from '/tmp/10000000.data' with (ignore_errors ); ^CCancel request sent ERROR: canceling statement due to user request =# truncate test; ERROR: current transaction is aborted, commands ignored until end of transaction block ``` [1] https://www.postgresql.org/message-id/1197677930.1536.18.camel%40dell.linuxdev.us.dell.com -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
BTW in v4 patch, data are loaded into the buffer one by one, and when
the buffer fills up, the data in the buffer are 'replayed' also one by
one, right?
Wouldn't this have more overhead than a normal COPY?
As a test, I COPYed slightly larger data with and without ignore_errors
option.
There might be other reasons, but I found a performance difference.
Time: 15538,579 ms (00:15,539)
COPY 10000000
Time: 21289,121 ms (00:21,289)
Time: 15318,922 ms (00:15,319)
COPY 10000000
Time: 19868,175 ms (00:19,868)
- Put in the documentation that the warnings will not be output for more
than 101 cases.
I applied v4 patch and when canceled the COPY, there was a case I found
myself left in a transaction.
Should this situation be prevented from occurring?
```
=# copy test from '/tmp/10000000.data' with (ignore_errors );
^CCancel request sent
ERROR: canceling statement due to user request
=# truncate test;
ERROR: current transaction is aborted, commands ignored until end of
transaction block
```
Attachment
On 2022-09-21 21:11, Damir Belyalov wrote: Thanks for updating patch. > In the previous patch there was an error when processing constraints. > The patch was fixed, but the code grew up and became more complicated > (0005-COPY_IGNORE_ERRORS). I also simplified the logic of > safeNextCopyFrom(). > You asked why we need subtransactions, so the answer is in this patch. > When processing a row that does not satisfy constraints or INSTEAD OF > triggers, it is necessary to rollback the subtransaction and return > the table to its original state. > Cause of complexity, I had to abandon the constraints, triggers > processing in and handle only errors that occur when reading the file. > Attaching simplified patch (0006-COPY_IGNORE_ERRORS). Do you mean you stop dealing with errors concerned with constraints and triggers and we should review 0006-COPY_IGNORE_ERRORS? > Tried to implement your error and could not. The result was the same > as COPY FROM implements. Hmm, I applied v6 patch and when canceled COPY by sending SIGINT(ctrl + C), I faced the same situation as below. I tested it on CentOS 8.4. =# COPY test FROM '/home/tori/pgsql/master/10000000.data' WITH (IGNORE_ERRORS); ^CCancel request sent ERROR: canceling statement due to user request CONTEXT: COPY test, line 628000: "628000 xxx" =# SELECT 1; ERROR: current transaction is aborted, commands ignored until end of transaction block =# ABORT; FATAL: UserAbortTransactionBlock: unexpected state STARTED server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. I did the same procedure on COPY FROM without IGNORE_ERRORS and didn't face this situation. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Do you mean you stop dealing with errors concerned with constraints and
triggers and we should review 0006-COPY_IGNORE_ERRORS?
Hmm, I applied v6 patch and when canceled COPY by sending SIGINT(ctrl +Thank you for pointing out this error. it really needs to be taken into account. In the previous 0006 patch, there were 2 stages in one subtransaction - filling the buffer and 'replay mode' (reading from the buffer). Since only NextCopyFrom() is included in PG_TRY(), and the ERRCODE_QUERY_CANCELED error can occur anywhere, it is impossible to catch it and rollback the subtransaction.
C), I faced the same situation as below.
I tested it on CentOS 8.4.
Attachment
Attachment
Attachment
Updated the patch:- Optimized and simplified logic of IGNORE_ERRORS- Changed variable names to more understandable ones- Added an analogue of MAX_BUFFERED_BYTES for safe bufferRegards,Damir BelyalovPostgres Professional
110: static bool SafeCopying(CopyFromState cstate, ExprContext *econtext,
111: TupleTableSlot *myslot);
---
636: bool
637: SafeCopying(CopyFromState cstate, ExprContext *econtext, TupleTableSlot *myslot)
Why is there no static keyword in the definition of the SafeCopying() function, but it is in the function declaration.
675: MemoryContext cxt = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
676:
677: valid_row = NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull);
678: tuple_is_valid = valid_row;
679:
680: if (valid_row)
681: sfcstate->safeBufferBytes += cstate->line_buf.len;
682:
683: CurrentMemoryContext = cxt;
Why are you using a direct assignment to CurrentMemoryContext instead of using the MemoryContextSwitchTo function in the SafeCopying() routine?
1160: /* Standard copying with option "safe copying" enabled by IGNORE_ERRORS. */
1161: if (!SafeCopying(cstate, econtext, myslot))
1162: break;
I checked with GDB that the CurrentMemoryContext changes when SafeCopying returns. And the target context may be different each time you do a COPY in psql.
1879: cstate->sfcstate->safe_cxt = AllocSetContextCreate(oldcontext,
1880: "Safe_context",
1881: ALLOCSET_DEFAULT_SIZES);
When you initialize the cstate->sfcstate structure, you create a cstate->sfcstate->safe_cxt memory context that inherits from oldcontext.
Was it intended to use cstate->copycontext as the parent context here?
Updated the patch:- Optimized and simplified logic of IGNORE_ERRORS- Changed variable names to more understandable ones- Added an analogue of MAX_BUFFERED_BYTES for safe bufferRegards,Damir BelyalovPostgres Professional
Why is there no static keyword in the definition of the SafeCopying() function, but it is in the function declaration.
675: MemoryContext cxt = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
676:
677: valid_row = NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull);
678: tuple_is_valid = valid_row;
679:
680: if (valid_row)
681: sfcstate->safeBufferBytes += cstate->line_buf.len;
682:
683: CurrentMemoryContext = cxt;
Why are you using a direct assignment to CurrentMemoryContext instead of using the MemoryContextSwitchTo function in the SafeCopying() routine?
When you initialize the cstate->sfcstate structure, you create a cstate->sfcstate->safe_cxt memory context that inherits from oldcontext.
Was it intended to use cstate->copycontext as the parent context here?
Damir Belyalov
Postgres Professional
Attachment
H, On 2023-02-03 13:27:24 +0300, Damir Belyalov wrote: > @@ -625,6 +628,173 @@ CopyMultiInsertInfoStore(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri, > miinfo->bufferedBytes += tuplen; > } > > +/* > + * Safely reads source data, converts to a tuple and fills tuple buffer. > + * Skips some data in the case of failed conversion if data source for > + * a next tuple can be surely read without a danger. > + */ > +static bool > +SafeCopying(CopyFromState cstate, ExprContext *econtext, TupleTableSlot *myslot) > + BeginInternalSubTransaction(NULL); > + CurrentResourceOwner = sfcstate->oldowner; I don't think this is the right approach. Creating a subtransaction for each row will cause substantial performance issues. We now can call data type input functions without throwing errors, see InputFunctionCallSafe(). Use that to avoid throwing an error instead of catching it. Greetings, Andres Freund
I don't think this is the right approach. Creating a subtransaction for
each row will cause substantial performance issues.
sfcstate->safeBufferBytes < MAX_SAFE_BUFFER_BYTES)
We now can call data type input functions without throwing errors, see
InputFunctionCallSafe(). Use that to avoid throwing an error instead of
catching it.
Damir Belyalov
Postgres Professional
Damir Belyalov <dam.bel07@gmail.com> writes: >> I don't think this is the right approach. Creating a subtransaction for >> each row will cause substantial performance issues. > Subtransactions aren't created for each row. The block of rows in one > subtransaction is 1000 (SAFE_BUFFER_SIZE) and can be changed. I think that at this point, any patch that involves adding subtransactions to COPY is dead on arrival; whether it's batched or not is irrelevant. (It's not like batching has no downsides.) > InputFunctionCallSafe() is good for detecting errors from input-functions > but there are such errors from NextCopyFrom () that can not be detected > with InputFunctionCallSafe(), e.g. "wrong number of columns in row''. If you want to deal with those, then there's more work to be done to make those bits non-error-throwing. But there's a very finite amount of code involved and no obvious reason why it couldn't be done. The major problem here has always been the indefinite amount of code implicated by calling datatype input functions, and we have now created a plausible answer to that problem. regards, tom lane
Hi, On February 5, 2023 9:12:17 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Damir Belyalov <dam.bel07@gmail.com> writes: >>> I don't think this is the right approach. Creating a subtransaction for >>> each row will cause substantial performance issues. > >> Subtransactions aren't created for each row. The block of rows in one >> subtransaction is 1000 (SAFE_BUFFER_SIZE) and can be changed. > >I think that at this point, any patch that involves adding subtransactions >to COPY is dead on arrival; whether it's batched or not is irrelevant. >(It's not like batching has no downsides.) Indeed. >> InputFunctionCallSafe() is good for detecting errors from input-functions >> but there are such errors from NextCopyFrom () that can not be detected >> with InputFunctionCallSafe(), e.g. "wrong number of columns in row''. > >If you want to deal with those, then there's more work to be done to make >those bits non-error-throwing. But there's a very finite amount of code >involved and no obvious reason why it couldn't be done. The major problem >here has always been the indefinite amount of code implicated by calling >datatype input functions, and we have now created a plausible answer to >that problem. I'm not even sure it makes sense to avoid that kind of error. And invalid column count or such is something quite differentthan failing some data type input routine, or falling a constraint. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On February 5, 2023 9:12:17 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Damir Belyalov <dam.bel07@gmail.com> writes: >>> InputFunctionCallSafe() is good for detecting errors from input-functions >>> but there are such errors from NextCopyFrom () that can not be detected >>> with InputFunctionCallSafe(), e.g. "wrong number of columns in row''. >> If you want to deal with those, then there's more work to be done to make >> those bits non-error-throwing. But there's a very finite amount of code >> involved and no obvious reason why it couldn't be done. > I'm not even sure it makes sense to avoid that kind of error. And > invalid column count or such is something quite different than failing > some data type input routine, or falling a constraint. I think it could be reasonable to put COPY's overall-line-format requirements on the same level as datatype input format violations. I agree that trying to trap every kind of error is a bad idea, for largely the same reason that the soft-input-errors patches only trap certain kinds of errors: it's too hard to tell whether an error is an "internal" error that it's scary to continue past. regards, tom lane
On 2023-02-06 15:00, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On February 5, 2023 9:12:17 PM PST, Tom Lane <tgl@sss.pgh.pa.us> >> wrote: >>> Damir Belyalov <dam.bel07@gmail.com> writes: >>>> InputFunctionCallSafe() is good for detecting errors from >>>> input-functions >>>> but there are such errors from NextCopyFrom () that can not be >>>> detected >>>> with InputFunctionCallSafe(), e.g. "wrong number of columns in >>>> row''. > >>> If you want to deal with those, then there's more work to be done to >>> make >>> those bits non-error-throwing. But there's a very finite amount of >>> code >>> involved and no obvious reason why it couldn't be done. > >> I'm not even sure it makes sense to avoid that kind of error. And >> invalid column count or such is something quite different than failing >> some data type input routine, or falling a constraint. > > I think it could be reasonable to put COPY's overall-line-format > requirements on the same level as datatype input format violations. > I agree that trying to trap every kind of error is a bad idea, > for largely the same reason that the soft-input-errors patches > only trap certain kinds of errors: it's too hard to tell whether > an error is an "internal" error that it's scary to continue past. Is it a bad idea to limit the scope of allowing errors to 'soft' errors in InputFunctionCallSafe()? I think it could be still useful for some usecases. diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql +-- tests for IGNORE_DATATYPE_ERRORS option +CREATE TABLE check_ign_err (n int, m int[], k int); +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS; +1 {1} 1 +a {2} 2 +3 {3} 3333333333 +4 {a, 4} 4 + +5 {5} 5 +\. +SELECT * FROM check_ign_err; diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out index 090ef6c7a8..08e8056fc1 100644 +-- tests for IGNORE_DATATYPE_ERRORS option +CREATE TABLE check_ign_err (n int, m int[], k int); +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS; +WARNING: invalid input syntax for type integer: "a" +WARNING: value "3333333333" is out of range for type integer +WARNING: invalid input syntax for type integer: "a" +WARNING: invalid input syntax for type integer: "" +SELECT * FROM check_ign_err; + n | m | k +---+-----+--- + 1 | {1} | 1 + 5 | {5} | 5 +(2 rows) -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Attachment
Attachment
> On 28 Feb 2023, at 15:28, Damir Belyalov <dam.bel07@gmail.com> wrote: > Tested patch on all cases: CIM_SINGLE, CIM_MULTI, CIM_MULTI_CONDITION. As expected it works. > Also added a description to copy.sgml and made a review on patch. > > I added 'ignored_errors' integer parameter that should be output after the option is finished. > All errors were added to the system logfile with full detailed context. Maybe it's better to log only error message. FWIW, Greenplum has a similar construct (but which also logs the errors in the db) where data type errors are skipped as long as the number of errors don't exceed a reject limit. If the reject limit is reached then the COPY fails: LOG ERRORS [ SEGMENT REJECT LIMIT <count> [ ROWS | PERCENT ]] IIRC the gist of this was to catch then the user copies the wrong input data or plain has a broken file. Rather than finding out after copying n rows which are likely to be garbage the process can be restarted. This version of the patch has a compiler error in the error message: copyfrom.c: In function ‘CopyFrom’: copyfrom.c:1008:29: error: format ‘%ld’ expects argument of type ‘long int’, but argument 2 has type ‘uint64’ {aka ‘longlong unsigned int’} [-Werror=format=] 1008 | ereport(WARNING, errmsg("Errors: %ld", cstate->ignored_errors)); | ^~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~ | | | uint64 {aka long long unsigned int} On that note though, it seems to me that this error message leaves a bit to be desired with regards to the level of detail. -- Daniel Gustafsson
On 2023-03-06 23:03, Daniel Gustafsson wrote: >> On 28 Feb 2023, at 15:28, Damir Belyalov <dam.bel07@gmail.com> wrote: > >> Tested patch on all cases: CIM_SINGLE, CIM_MULTI, CIM_MULTI_CONDITION. >> As expected it works. >> Also added a description to copy.sgml and made a review on patch. Thanks for your tests and improvements! >> I added 'ignored_errors' integer parameter that should be output after >> the option is finished. >> All errors were added to the system logfile with full detailed >> context. Maybe it's better to log only error message. Certainly. > FWIW, Greenplum has a similar construct (but which also logs the errors > in the > db) where data type errors are skipped as long as the number of errors > don't > exceed a reject limit. If the reject limit is reached then the COPY > fails: > > LOG ERRORS [ SEGMENT REJECT LIMIT <count> [ ROWS | PERCENT ]] > > IIRC the gist of this was to catch then the user copies the wrong input > data or > plain has a broken file. Rather than finding out after copying n rows > which > are likely to be garbage the process can be restarted. > > This version of the patch has a compiler error in the error message: > > copyfrom.c: In function ‘CopyFrom’: > copyfrom.c:1008:29: error: format ‘%ld’ expects argument of type ‘long > int’, but argument 2 has type ‘uint64’ {aka ‘long long unsigned int’} > [-Werror=format=] > 1008 | ereport(WARNING, errmsg("Errors: %ld", cstate->ignored_errors)); > | ^~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~ > | | > | uint64 {aka long > long unsigned int} > > > On that note though, it seems to me that this error message leaves a > bit to be > desired with regards to the level of detail. +1. I felt just logging "Error: %ld" would make people wonder the meaning of the %ld. Logging something like ""Error: %ld data type errors were found" might be clearer. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
FWIW, Greenplum has a similar construct (but which also logs the errors
in the
db) where data type errors are skipped as long as the number of errors
don't
exceed a reject limit. If the reject limit is reached then the COPY
fails:
>
> LOG ERRORS [ SEGMENT REJECT LIMIT <count> [ ROWS | PERCENT ]]
>
IIRC the gist of this was to catch then the user copies the wrong input
data or
plain has a broken file. Rather than finding out after copying n rows
which
are likely to be garbage the process can be restarted.
This version of the patch has a compiler error in the error message:
I felt just logging "Error: %ld" would make people wonder the meaning of
the %ld. Logging something like ""Error: %ld data type errors were
found" might be clearer.
Attachment
> On 7 Mar 2023, at 09:35, Damir Belyalov <dam.bel07@gmail.com> wrote: > I felt just logging "Error: %ld" would make people wonder the meaning of > the %ld. Logging something like ""Error: %ld data type errors were > found" might be clearer. > > Thanks. For more clearance change the message to: "Errors were found: %". I'm not convinced that this adds enough clarity to assist the user. We also shouldn't use "error" in a WARNING log since the user has explicitly asked to skip rows on error, so it's not an error per se. How about something like: ereport(WARNING, (errmsg("%ld rows were skipped due to data type incompatibility", cstate->ignored_errors), errhint("Skipped rows can be inspected in the database log for reprocessing."))); -- Daniel Gustafsson
On 2023-03-07 18:09, Daniel Gustafsson wrote: >> On 7 Mar 2023, at 09:35, Damir Belyalov <dam.bel07@gmail.com> wrote: > >> I felt just logging "Error: %ld" would make people wonder the meaning >> of >> the %ld. Logging something like ""Error: %ld data type errors were >> found" might be clearer. >> >> Thanks. For more clearance change the message to: "Errors were found: >> %". > > I'm not convinced that this adds enough clarity to assist the user. We > also > shouldn't use "error" in a WARNING log since the user has explicitly > asked to > skip rows on error, so it's not an error per se. +1 > How about something like: > > ereport(WARNING, > (errmsg("%ld rows were skipped due to data type > incompatibility", cstate->ignored_errors), > errhint("Skipped rows can be inspected in the database log > for reprocessing."))); Since skipped rows cannot be inspected in the log when log_error_verbosity is set to terse, it might be better without this errhint. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
On 2023-03-17 21:23, torikoshia wrote: > On 2023-03-07 18:09, Daniel Gustafsson wrote: >>> On 7 Mar 2023, at 09:35, Damir Belyalov <dam.bel07@gmail.com> wrote: >> >>> I felt just logging "Error: %ld" would make people wonder the meaning >>> of >>> the %ld. Logging something like ""Error: %ld data type errors were >>> found" might be clearer. >>> >>> Thanks. For more clearance change the message to: "Errors were found: >>> %". >> >> I'm not convinced that this adds enough clarity to assist the user. >> We also >> shouldn't use "error" in a WARNING log since the user has explicitly >> asked to >> skip rows on error, so it's not an error per se. > +1 > >> How about something like: >> >> ereport(WARNING, >> (errmsg("%ld rows were skipped due to data type >> incompatibility", cstate->ignored_errors), >> errhint("Skipped rows can be inspected in the database log >> for reprocessing."))); > Since skipped rows cannot be inspected in the log when > log_error_verbosity is set to terse, > it might be better without this errhint. Removed errhint. Modified some codes since v3 couldn't be applied HEAD anymore. Also modified v3 patch as below: > 65 + if (cstate->opts.ignore_datatype_errors) > 66 + cstate->ignored_errors = 0; > 67 + It seems not necessary since cstate is initialized by palloc0() in BeginCopyFrom(). > 134 + ereport(LOG, > 135 + errmsg("%s", > cstate->escontext.error_data->message)); > 136 + > 137 + return true; Since LOG means 'Reports information of interest to administrators' according to the manual[1], datatype error should not be logged as LOG. I put it back in WARNING. [1] https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-SEVERITY-LEVELS -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Attachment
Hi, Tom, see below - I wonder if should provide one more piece of infrastructure around the saved error stuff... Have you measured whether this has negative performance effects when *NOT* using the new option? As-is this does not work with FORMAT BINARY - and converting the binary input functions to support soft errors won't happen for 16. So I think you need to raise an error if BINARY and IGNORE_DATATYPE_ERRORS are specified. On 2023-03-22 22:34:20 +0900, torikoshia wrote: > @@ -985,9 +986,28 @@ CopyFrom(CopyFromState cstate) > > ExecClearTuple(myslot); > > + if (cstate->opts.ignore_datatype_errors) > + { > + escontext.details_wanted = true; > + cstate->escontext = escontext; > + } I think it might be worth pulling this out of the loop. That does mean you'd have to reset escontext.error_occurred after an error, but that doesn't seem too bad, you need to do other cleanup anyway. > @@ -956,10 +957,20 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, > values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]); > } > else > - values[m] = InputFunctionCall(&in_functions[m], > - string, > - typioparams[m], > - att->atttypmod); > + /* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype errors */ > + if (!InputFunctionCallSafe(&in_functions[m], > + string, > + typioparams[m], > + att->atttypmod, > + (Node *) &cstate->escontext, > + &values[m])) > + { > + cstate->ignored_errors++; > + > + ereport(WARNING, > + errmsg("%s", cstate->escontext.error_data->message)); That isn't right - you loose all the details of the message. As is you'd also leak the error context. I think the best bet for now is to do something like /* adjust elevel so we don't jump out */ cstate->escontext.error_data->elevel = WARNING; /* despite the name, this won't raise an error if elevel < ERROR */ ThrowErrorData(cstate->escontext.error_data); I wonder if we ought to provide a wrapper for this? It could e.g. know to mention the original elevel and such? I don't think NextCopyFrom() is the right place to emit this warning - it e.g. is also called from file_fdw.c, which might want to do something else with the error. From a layering POV it seems cleaner to do this in CopyFrom(). You already have a check for escontext.error_occurred there anyway. > @@ -3378,6 +3378,10 @@ copy_opt_item: > { > $$ = makeDefElem("freeze", (Node *) makeBoolean(true), @1); > } > + | IGNORE_DATATYPE_ERRORS > + { > + $$ = makeDefElem("ignore_datatype_errors", (Node *)makeBoolean(true), @1); > + } > | DELIMITER opt_as Sconst > { > $$ = makeDefElem("delimiter", (Node *) makeString($3), @1); I think we shouldn't add a new keyword for this, but only support this via /* new COPY option syntax */ copy_generic_opt_list: copy_generic_opt_elem Further increasing the size of the grammar with random keywords when we have more generic ways to represent them seems unnecessary. > +-- tests for IGNORE_DATATYPE_ERRORS option > +CREATE TABLE check_ign_err (n int, m int[], k int); > +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS; > +1 {1} 1 > +a {2} 2 > +3 {3} 3333333333 > +4 {a, 4} 4 > + > +5 {5} 5 > +\. > +SELECT * FROM check_ign_err; > + I suggest adding a few more tests: - COPY with a datatype error that can't be handled as a soft error - test documenting that COPY FORMAT BINARY is incompatible with IGNORE_DATATYPE_ERRORS - a soft error showing the error context - although that will require some care to avoid the function name + line in the output Greetings, Andres Freund
On 2023-03-23 02:50, Andres Freund wrote: > Hi, > > Tom, see below - I wonder if should provide one more piece of > infrastructure > around the saved error stuff... > > > Have you measured whether this has negative performance effects when > *NOT* > using the new option? > > > As-is this does not work with FORMAT BINARY - and converting the binary > input > functions to support soft errors won't happen for 16. So I think you > need to > raise an error if BINARY and IGNORE_DATATYPE_ERRORS are specified. > > > On 2023-03-22 22:34:20 +0900, torikoshia wrote: >> @@ -985,9 +986,28 @@ CopyFrom(CopyFromState cstate) >> >> ExecClearTuple(myslot); >> >> + if (cstate->opts.ignore_datatype_errors) >> + { >> + escontext.details_wanted = true; >> + cstate->escontext = escontext; >> + } > > I think it might be worth pulling this out of the loop. That does mean > you'd > have to reset escontext.error_occurred after an error, but that doesn't > seem > too bad, you need to do other cleanup anyway. > > >> @@ -956,10 +957,20 @@ NextCopyFrom(CopyFromState cstate, ExprContext >> *econtext, >> values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]); >> } >> else >> - values[m] = InputFunctionCall(&in_functions[m], >> - string, >> - typioparams[m], >> - att->atttypmod); >> + /* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype >> errors */ >> + if (!InputFunctionCallSafe(&in_functions[m], >> + string, >> + typioparams[m], >> + att->atttypmod, >> + (Node *) &cstate->escontext, >> + &values[m])) >> + { >> + cstate->ignored_errors++; >> + >> + ereport(WARNING, >> + errmsg("%s", cstate->escontext.error_data->message)); > > That isn't right - you loose all the details of the message. As is > you'd also > leak the error context. > > I think the best bet for now is to do something like > /* adjust elevel so we don't jump out */ > cstate->escontext.error_data->elevel = WARNING; > /* despite the name, this won't raise an error if elevel < ERROR */ > ThrowErrorData(cstate->escontext.error_data); Thanks for your reviewing! I'll try to fix it this way for the time being. > I wonder if we ought to provide a wrapper for this? It could e.g. know > to > mention the original elevel and such? -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
On 2023-03-23 02:50, Andres Freund wrote: Thanks again for your review. Attached v5 patch. > Have you measured whether this has negative performance effects when > *NOT* > using the new option? I loaded 10000000 rows of pgbench_accounts on my laptop and compared the elapsed time. GUCs changed from the default are logging_collector = on, log_error_verbosity = verbose. Three tests were run under each condition and the middle of them is listed below: - patch NOT applied(36f40ce2dc66): 35299ms - patch applied, without IGNORE_DATATYPE_ERRORS: 34409ms - patch applied, with IGNORE_DATATYPE_ERRORS: 35510ms It seems there are no significant degradation. Also tested the elapsed time when loading data which has some datatype error with IGNORE_DATATYPE_ERRORS: - data has 100 rows of error: 35269ms - data has 1000 rows of error: 34577ms - data has 5000000 rows of error: 48925ms 5000000 rows of error consumes much time, but it seems to be influenced by logging time. Here are test results under log_min_messages and client_min_messages are 'error': - data has 5000000 data type error: 23972ms - data has 0 data type error: 34320ms Now conversely, when there were many data type errors, it consumes less time. This seems like a reasonable result since the amount of skipped data is increasing. > As-is this does not work with FORMAT BINARY - and converting the binary > input > functions to support soft errors won't happen for 16. So I think you > need to > raise an error if BINARY and IGNORE_DATATYPE_ERRORS are specified. Added the option check. > On 2023-03-22 22:34:20 +0900, torikoshia wrote: > > @@ -985,9 +986,28 @@ CopyFrom(CopyFromState cstate) > > > > ExecClearTuple(myslot); > > > > + if (cstate->opts.ignore_datatype_errors) > > + { > > + escontext.details_wanted = true; > > + cstate->escontext = escontext; > > + } > > I think it might be worth pulling this out of the loop. That does mean > you'd > have to reset escontext.error_occurred after an error, but that doesn't > seem > too bad, you need to do other cleanup anyway. Pull this out of the loop and added process for resetting escontext. > > @@ -956,10 +957,20 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, > > values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]); > > } > > else > > - values[m] = InputFunctionCall(&in_functions[m], > > - string, > > - typioparams[m], > > - att->atttypmod); > > + /* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype errors */ > > + if (!InputFunctionCallSafe(&in_functions[m], > > + string, > > + typioparams[m], > > + att->atttypmod, > > + (Node *) &cstate->escontext, > > + &values[m])) > > + { > > + cstate->ignored_errors++; > > + > > + ereport(WARNING, > > + errmsg("%s", cstate->escontext.error_data->message)); > > That isn't right - you loose all the details of the message. As is > you'd also > leak the error context. > > I think the best bet for now is to do something like > /* adjust elevel so we don't jump out */ > cstate->escontext.error_data->elevel = WARNING; > /* despite the name, this won't raise an error if elevel < ERROR */ > ThrowErrorData(cstate->escontext.error_data); As I mentioned in one previous email, added above codes for now. > I wonder if we ought to provide a wrapper for this? It could e.g. know > to > mention the original elevel and such? > > > I don't think NextCopyFrom() is the right place to emit this warning - > it > e.g. is also called from file_fdw.c, which might want to do something > else > with the error. From a layering POV it seems cleaner to do this in > CopyFrom(). You already have a check for escontext.error_occurred there > anyway. Agreed. > > @@ -3378,6 +3378,10 @@ copy_opt_item: > > { > > $$ = makeDefElem("freeze", (Node *) makeBoolean(true), @1); > > } > > + | IGNORE_DATATYPE_ERRORS > > + { > > + $$ = makeDefElem("ignore_datatype_errors", (Node *)makeBoolean(true), @1); > > + } > > | DELIMITER opt_as Sconst > > { > > $$ = makeDefElem("delimiter", (Node *) makeString($3), @1); > > > I think we shouldn't add a new keyword for this, but only support this > via > /* new COPY option syntax */ > copy_generic_opt_list: > copy_generic_opt_elem > > Further increasing the size of the grammar with random keywords when we > have > more generic ways to represent them seems unnecessary. Agreed. > > +-- tests for IGNORE_DATATYPE_ERRORS option > > +CREATE TABLE check_ign_err (n int, m int[], k int); > > +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS; > > +1 {1} 1 > > +a {2} 2 > > +3 {3} 3333333333 > > +4 {a, 4} 4 > > + > > +5 {5} 5 > > +\. > > +SELECT * FROM check_ign_err; > > + > > I suggest adding a few more tests: > > - COPY with a datatype error that can't be handled as a soft error Added a test for cases with missing columns. However it's not datatype error and not what you expected, is it? > - test documenting that COPY FORMAT BINARY is incompatible with > IGNORE_DATATYPE_ERRORS Added it. > - a soft error showing the error context - although that will require > some > care to avoid the function name + line in the output I assume you mean a test to check the server log, but I haven't come up with a way to do it. Adding a TAP test might do it, but I think it would be overkill to add one just for this. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Attachment
The only thing left to do is how not to add IGNORE_DATATYPE_ERRORS as a keyword. See how this is done for parameters such as FORCE_NOT_NULL, FORCE_NULL, FORCE_QUOTE. They are not in kwlist.h and are not as keywords in gram.y.
On 2023-03-27 23:28, Damir Belyalov wrote: > Hi! > > I made the specified changes and my patch turned out the same as > yours. The performance measurements were the same too. Thanks for your review and measurements. > The only thing left to do is how not to add IGNORE_DATATYPE_ERRORS as > a keyword. See how this is done for parameters such as FORCE_NOT_NULL, > FORCE_NULL, FORCE_QUOTE. They are not in kwlist.h and are not as > keywords in gram.y. I might misunderstand something, but I believe the v5 patch uses copy_generic_opt_list and it does not add IGNORE_DATATYPE_ERRORS as a keyword. It modifies neither kwlist.h nor gram.y. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
I might misunderstand something, but I believe the v5 patch uses
copy_generic_opt_list and it does not add IGNORE_DATATYPE_ERRORS as a
keyword.
It modifies neither kwlist.h nor gram.y.
Hi!
Thank you, Damir, for your patch. It is very interesting to review it!
It seemed to me that the names of variables are not the same everywhere.
I noticed that you used ignore_datatype_errors_specified variable in copy.c , but guc has a short name ignore_datatype_errors. Also you used the short variable name in CopyFormatOptions structure.
Name used ignore_datatype_errors_specified is seemed very long to me, may be use a short version of it (ignore_datatype_errors) in copy.c too?
Besides, I noticed that you used ignored_errors variable in CopyFromStateData structure and it's name is strikingly similar to name (ignore_datatype_errors), but they have different meanings.
Maybe it will be better to rename it as ignored_errors_counter?
I tested last version v5-0001-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch with bytea data type and transaction cases. Eventually, I didn't find any problem there.
I described my steps in more detail, if I missed something.
First of all, I ran copy function with IGNORE_DATATYPE_ERRORS parameter being in transaction block.
File t2.csv exists:
id,b
769,\
1,\6e
2,\x5
5,\x
Test:
CREATE TABLE t (id INT , b BYTEA) ;
postgres=# BEGIN;
copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
SAVEPOINT my_savepoint;
BEGIN
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
COPY 1
SAVEPOINT
postgres=*# copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
COPY 1
postgres=*# ROLLBACK TO my_savepoint;
ROLLBACK
postgres=*# select * from t;
id | b
----+----
5 | \x
(1 row)
postgres=*# copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
COPY 1
postgres=*# select * from t;
id | b
----+----
5 | \x
5 | \x
(2 rows)
postgres=*# commit;
COMMIT
I tried to use the similar test and moved transaction block in function:
CREATE FUNCTION public.log2()
RETURNS void
LANGUAGE plpgsql
SECURITY DEFINER
AS $function$
BEGIN;
copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
SAVEPOINT my_savepoint;
END;
$function$;
postgres=# delete from t;
postgres=# select 1 as t from log2();
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
t
---
1
(1 row)
Secondly I checked function copy with bytea datatype.
t1.csv consists:
id,b
769,\x2d
1,\x6e
2,\x5c
5,\x
And I ran it:postgres=# delete from t;
DELETE 4
postgres=# copy t FROM '/home/alena/postgres/t2.csv' WITH (format 'csv', IGNORE_DATATYPE_ERRORS, delimiter ',', HEADER);
WARNING: invalid input syntax for type bytea
WARNING: invalid input syntax for type bytea
WARNING: invalid hexadecimal data: odd number of digits
WARNING: 3 rows were skipped due to data type incompatibility
COPY 1
postgres=# select * from t;
id | b
----+----
5 | \x
(1 row)
-- --- Alena Rybakina Postgres Professional
On 2023-05-07 05:05, Alena Rybakina wrote: Thanks for your reviewing and comments! > I noticed that you used _ignore_datatype_errors_specified_ variable in > _copy.c_ , but guc has a short name _ignore_datatype_errors_. Also you > used the short variable name in _CopyFormatOptions_ structure. You may already understand it, but these variable names are given in imitation of FREEZE and BINARY cases: --- a/src/include/commands/copy.h +++ b/src/include/commands/copy.h @@ -42,6 +42,7 @@ typedef struct CopyFormatOptions * -1 if not specified */ bool binary; /* binary format? */ bool freeze; /* freeze rows on loading? */ + bool ignore_datatype_errors; /* ignore rows with datatype errors */ --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -419,6 +419,7 @@ ProcessCopyOptions(ParseState *pstate, bool format_specified = false; bool freeze_specified = false; bool header_specified = false; + bool ignore_datatype_errors_specified = false; > Name used _ignore_datatype_errors_specified _is seemed very long to > me, may be use a short version of it (_ignore_datatype_errors_) in > _copy.c_ too? I think it would be sane to align the names with the FREEZE and BINARY options. I agree with the name is too long and we once used the name 'ignore_errors'. However, current implementation does not ignore all errors but just data type error, so I renamed it. There may be a better name, but I haven't come up with one. > Besides, I noticed that you used _ignored_errors_ variable in > _CopyFromStateData_ structure and it's name is strikingly similar to > name (_ignore_datatype_error__s_), but they have different meanings. > Maybe it will be better to rename it as _ignored_errors_counter_? As far as I take a quick look at on PostgreSQL source code, there're few variable name with "_counter". It seems to be used for function names. Something like "ignored_errors_count" might be better. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Since v5 patch failed applying anymore, updated the patch. On 2023-03-23 02:50, Andres Freund wrote: > I suggest adding a few more tests: > > - COPY with a datatype error that can't be handled as a soft error I didn't know proper way to test this, but I've found data type widget's input function widget_in() defined to occur hard-error in regress.c, attached patch added a test using it. -- Atsushi Torikoshi NTT DATA CORPORATION
Attachment
> - COPY with a datatype error that can't be handled as a soft error
I didn't know proper way to test this, but I've found data type widget's
input function widget_in() defined to occur hard-error in regress.c,
attached patch added a test using it.
Attachment
On 2023-09-15 19:02, Damir Belyalov wrote: >> Since v5 patch failed applying anymore, updated the patch. > > Thank you for updating the patch . I made a little review on it where > corrected some formatting. > Thanks for your review and update! I don't have objections the modification of the codes and comments. Although v7 patch doesn't have commit messages on the patch, I think leave commit message is good for reviewers. >>> - COPY with a datatype error that can't be handled as a soft error >> >> I didn't know proper way to test this, but I've found data type >> widget's >> input function widget_in() defined to occur hard-error in regress.c, >> attached patch added a test using it. > > This test seems to be weird a bit, because of the "widget" type. The > hard error is thrown by the previous test with missing data. Also > it'll be interesting for me to list all cases when a hard error can be > thrown. Although missing data error is hard error, the suggestion from Andres was adding `dataype` error: > - COPY with a datatype error that can't be handled as a soft error As described in widghet_in(), widget is intentionally left emitting hard error for testing purpose: > * Note: DON'T convert this error to "soft" style (errsave/ereturn). > We > * want this data type to stay permanently in the hard-error world so > that > * it can be used for testing that such cases still work reasonably. From this point of view, I think this is a supposed way of using widget. OTOH widget is declared in create_type.sql and I'm not sure it's ok to use it in another test copy2.sql. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
> Although v7 patch doesn't have commit messages on the patch, I think > leave commit message is good for reviewers. Sure, didn't notice it. Added the commit message to the updated patch. > * Note: DON'T convert this error to "soft" style (errsave/ereturn). We >> * want this data type to stay permanently in the hard-error world >> so that >> * it can be used for testing that such cases still work reasonably. > > From this point of view, I think this is a supposed way of using widget. I agree, it's a good approach for checking datatype errors, because that's what was intended. > OTOH widget is declared in create_type.sql and I'm not sure it's ok to > use it in another test copy2.sql. I think that other regress tests with 'widget' type that will be created in the future can be not only in the create_type.sql. So it's not a problem that some type or function is taken from another regress test. For example, the table 'onek' is used in many regress tests. Regards, Damir Belyalov Postgres Professional
Attachment
Damir <dam.bel07@gmail.com> writes: > [ v7-0002-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch ] Sorry for being so late to the party, but ... I don't think this is a well-designed feature as it stands. Simply dropping failed rows seems like an unusable definition for any application that has pretensions of robustness. "But", you say, "we're emitting WARNING messages about it". That's *useless*. For most applications WARNING messages just go into the bit bucket, or worse they cause memory leaks (because the app never reads them). An app that tried to read them would have to cope with all sorts of fun such as translated messages. Furthermore, as best I can tell from the provided test cases, the messages completely lack basic context such as which field or line the problem occurred in. An app trying to use this to understand which input lines had failed would not get far. I think an actually usable feature of this sort would involve copying all the failed lines to some alternate output medium, perhaps a second table with a TEXT column to receive the original data line. (Or maybe an array of text that could receive the broken-down field values?) Maybe we could dump the message info, line number, field name etc into additional columns. Also it'd be a good idea to have a vision of how the feature could be extended to cope with lower-level errors, such as lines that have the wrong number of columns or other problems with line-level syntax. I don't say we need to cope with that immediately, but it's going to be something people will want to add, I think. regards, tom lane
> On 8 Nov 2023, at 19:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think an actually usable feature of this sort would involve > copying all the failed lines to some alternate output medium, > perhaps a second table with a TEXT column to receive the original > data line. (Or maybe an array of text that could receive the > broken-down field values?) Maybe we could dump the message info, > line number, field name etc into additional columns. I agree that the errors should be easily visible to the user in some way. The feature is for sure interesting, especially in data warehouse type jobs where dirty data is often ingested. As a data point, Greenplum has this feature with additional SQL syntax to control it: COPY .. LOG ERRORS SEGMENT REJECT LIMIT xyz ROWS; LOG ERRORS instructs the database to log the faulty rows and SEGMENT REJECT LIMIT xyz ROWS sets the limit of how many rows can be faulty before the operation errors out. I'm not at all advocating that we should mimic this, just wanted to add a reference to postgres derivative where this has been implemented. -- Daniel Gustafsson
Daniel Gustafsson <daniel@yesql.se> writes: >> On 8 Nov 2023, at 19:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think an actually usable feature of this sort would involve >> copying all the failed lines to some alternate output medium, >> perhaps a second table with a TEXT column to receive the original >> data line. (Or maybe an array of text that could receive the >> broken-down field values?) Maybe we could dump the message info, >> line number, field name etc into additional columns. > I agree that the errors should be easily visible to the user in some way. The > feature is for sure interesting, especially in data warehouse type jobs where > dirty data is often ingested. I agree it's interesting, but we need to get it right the first time. Here is a very straw-man-level sketch of what I think might work. The option to COPY FROM looks something like ERRORS TO other_table_name (item [, item [, ...]]) where the "items" are keywords identifying the information item we will insert into each successive column of the target table. This design allows the user to decide which items are of use to them. I envision items like LINENO bigint COPY line number, counting from 1 LINE text raw text of line (after encoding conversion) FIELDS text[] separated, de-escaped string fields (the data that was or would be fed to input functions) FIELD text name of troublesome field, if field-specific MESSAGE text error message text DETAIL text error message detail, if any SQLSTATE text error SQLSTATE code Some of these would have to be populated as NULL if we didn't get that far in processing the line. In the worst case, which is encoding conversion failure, I think we couldn't populate any of the data items except LINENO. Not sure if we need to insist that the target table columns be exactly the data types I show above. It'd be nice to allow the LINENO target to be plain int, perhaps. OTOH, do we really want to have to deal with issues like conversion failures while trying to report an error? > As a data point, Greenplum has this feature with additional SQL syntax to > control it: > COPY .. LOG ERRORS SEGMENT REJECT LIMIT xyz ROWS; > LOG ERRORS instructs the database to log the faulty rows and SEGMENT REJECT > LIMIT xyz ROWS sets the limit of how many rows can be faulty before the > operation errors out. I'm not at all advocating that we should mimic this, > just wanted to add a reference to postgres derivative where this has been > implemented. Hm. A "reject limit" might be a useful add-on, but I wouldn't advocate including it in the initial patch. regards, tom lane
Hi, On 2023-11-08 13:18:39 -0500, Tom Lane wrote: > Damir <dam.bel07@gmail.com> writes: > > [ v7-0002-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch ] > > Sorry for being so late to the party, but ... I don't think this > is a well-designed feature as it stands. Simply dropping failed rows > seems like an unusable definition for any application that has > pretensions of robustness. Not everything needs to be a robust application though. I've definitely cursed at postgres for lacking this. > I think an actually usable feature of this sort would involve > copying all the failed lines to some alternate output medium, > perhaps a second table with a TEXT column to receive the original > data line. (Or maybe an array of text that could receive the > broken-down field values?) Maybe we could dump the message info, > line number, field name etc into additional columns. If we go in that direction, we should make it possible to *not* use such a table as well, for some uses it'd be pointless. Another way of reporting errors could be for copy to return invalid input back to the client, via the copy protocol. That would allow the client to handle failing rows and also to abort if the number of errors or the type of errors gets to be too big. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-11-08 13:18:39 -0500, Tom Lane wrote: >> I think an actually usable feature of this sort would involve >> copying all the failed lines to some alternate output medium, >> perhaps a second table with a TEXT column to receive the original >> data line. > If we go in that direction, we should make it possible to *not* use such a > table as well, for some uses it'd be pointless. Why? You can always just drop the errors table if you don't want it. But I fail to see the use-case for ignoring errors altogether. > Another way of reporting errors could be for copy to return invalid input back > to the client, via the copy protocol. Color me skeptical. There are approximately zero clients in the world today that could handle simultaneous return of data during a COPY. Certainly neither libpq nor psql are within hailing distance of being able to support that. Maybe in some far future it could be made to work --- but if you want it in the v1 patch, you just moved the goalposts into the next county. regards, tom lane
Hi, On 2023-11-08 19:00:01 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-11-08 13:18:39 -0500, Tom Lane wrote: > >> I think an actually usable feature of this sort would involve > >> copying all the failed lines to some alternate output medium, > >> perhaps a second table with a TEXT column to receive the original > >> data line. > > > If we go in that direction, we should make it possible to *not* use such a > > table as well, for some uses it'd be pointless. > > Why? You can always just drop the errors table if you don't want it. I think it'll often just end up littering the database, particularly if the callers don't care about a few errors. > But I fail to see the use-case for ignoring errors altogether. My experience is that there's often a few errors due to bad encoding, missing escaping etc that you don't care sufficiently about when importing large quantities of data. Greetings, Andres Freund
Regards,
Damir Belyalov
Postgres Professional
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Tom Lane <tgl@sss.pgh.pa.us> writes: > Daniel Gustafsson <daniel@yesql.se> writes: >>> On 8 Nov 2023, at 19:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I think an actually usable feature of this sort would involve >>> copying all the failed lines to some alternate output medium, >>> perhaps a second table with a TEXT column to receive the original >>> data line. (Or maybe an array of text that could receive the >>> broken-down field values?) Maybe we could dump the message info, >>> line number, field name etc into additional columns. > >> I agree that the errors should be easily visible to the user in some way. The >> feature is for sure interesting, especially in data warehouse type jobs where >> dirty data is often ingested. > > I agree it's interesting, but we need to get it right the first time. > > Here is a very straw-man-level sketch of what I think might work. > The option to COPY FROM looks something like > > ERRORS TO other_table_name (item [, item [, ...]]) > > where the "items" are keywords identifying the information item > we will insert into each successive column of the target table. > This design allows the user to decide which items are of use > to them. I envision items like While I'm pretty happy with the overall design, which is 'ERRORS to other_table_name' specially. I'm a bit confused why do we need to write the codes for (item [, item [, ...]]), not only because it requires more coding but also requires user to make more decisions. will it be anything wrong to make all of them as default? > LINENO bigint COPY line number, counting from 1 > LINE text raw text of line (after encoding conversion) > FIELDS text[] separated, de-escaped string fields (the data > that was or would be fed to input functions) > FIELD text name of troublesome field, if field-specific > MESSAGE text error message text > DETAIL text error message detail, if any > SQLSTATE text error SQLSTATE code > -- Best Regards Andy Fan
Here is a very straw-man-level sketch of what I think might work.
The option to COPY FROM looks something like
ERRORS TO other_table_name (item [, item [, ...]])
Which table should we implement for this feature: a system catalog table or store this table as a file or create a new table?
In these cases, security and user rights management issues arise.
Damir Belyalov
Postgres Professional
Hi!
Here is a very straw-man-level sketch of what I think might work.
The option to COPY FROM looks something like
ERRORS TO other_table_name (item [, item [, ...]])I tried to implement the patch using a table and came across a number of questions.
Which table should we implement for this feature: a system catalog table or store this table as a file or create a new table?
In these cases, security and user rights management issues arise.It is better for other users not to see error lines from another user. It is also not clear how access rights to this table are inherited and be given.
Maybe we can add a guc or a parameter to output such errors during the execution of the copy function with errors and check whether the user has enough rights to set such a parameter?
That is, I propose to give the user a choice to run copy with and without saving errors and at the same time immediately check whether the option with error output is possible for him in principle?
-- Regards, Alena Rybakina
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Damir Belyalov <dam.bel07@gmail.com> writes: > Here is a very straw-man-level sketch of what I think might work. > The option to COPY FROM looks something like > > ERRORS TO other_table_name (item [, item [, ...]]) > > I tried to implement the patch using a table and came across a number of questions. > > Which table should we implement for this feature: a system catalog table or store this table as a file or create a new > table? I think system catalog should not be a option at the first place since it requires more extra workload to do. see the calls of IsCatalogRelation in heapam.c. I prefer to create a new normal heap relation rather than a file since heap realtion probabaly have better APIs. > In these cases, security and user rights management issues arise. > It is better for other users not to see error lines from another > user. It is also not clear how access rights to this > table are inherited and be given. How about creating the table just allowing the current user to read/write or just same as the relation we are copying to? -- Best Regards Andy Fan
On Thu, Nov 9, 2023 at 4:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Daniel Gustafsson <daniel@yesql.se> writes: > >> On 8 Nov 2023, at 19:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I think an actually usable feature of this sort would involve > >> copying all the failed lines to some alternate output medium, > >> perhaps a second table with a TEXT column to receive the original > >> data line. (Or maybe an array of text that could receive the > >> broken-down field values?) Maybe we could dump the message info, > >> line number, field name etc into additional columns. > > > I agree that the errors should be easily visible to the user in some way. The > > feature is for sure interesting, especially in data warehouse type jobs where > > dirty data is often ingested. > > I agree it's interesting, but we need to get it right the first time. > > Here is a very straw-man-level sketch of what I think might work. > The option to COPY FROM looks something like > > ERRORS TO other_table_name (item [, item [, ...]]) > > where the "items" are keywords identifying the information item > we will insert into each successive column of the target table. > This design allows the user to decide which items are of use > to them. I envision items like > > LINENO bigint COPY line number, counting from 1 > LINE text raw text of line (after encoding conversion) > FIELDS text[] separated, de-escaped string fields (the data > that was or would be fed to input functions) > FIELD text name of troublesome field, if field-specific > MESSAGE text error message text > DETAIL text error message detail, if any > SQLSTATE text error SQLSTATE code > just SAVE ERRORS automatically create a table to hold the error. (validate auto-generated table name uniqueness, validate create privilege). and the table will have the above related info. if no error then table gets dropped.
On 14/11/2023 17:10, Damir Belyalov wrote: > Here is a very straw-man-level sketch of what I think might work. > The option to COPY FROM looks something like > > ERRORS TO other_table_name (item [, item [, ...]]) > > > I tried to implement the patch using a table and came across a number of > questions. > > Which table should we implement for this feature: a system catalog table > or store this table as a file or create a new table? > > In these cases, security and user rights management issues arise. > It is better for other users not to see error lines from another user. > It is also not clear how access rights to this table are inherited and > be given. Previous reviews have given helpful ideas about storing errors in the new table. It should be trivial code - use the current table name + 'err' + suffix as we already do in the case of conflicting auto-generated index names. The 'errors table' must inherit any right policies from the table, to which we do the copy. -- regards, Andrei Lepikhov Postgres Professional
hi. here is my implementation based on previous discussions add a new COPY FROM flag save_error. save_error only works with non-BINARY flags. save_error is easier for me to implement, if using "save error" I worry, 2 words, gram.y will not work. save_error also works other flag like {csv mode, force_null, force_not_null} overall logic is: if save_error is specified then if error_holding table not exists then create one if error_holding table exists set error_firsttime to false. if save_error is not specified then work as master branch. if errors happen then insert error info to error_holding table. if errors do not exist and error_firsttime is true then drop the table. if errors do not exist and error_firsttime is false then raise a notice: All the past error holding saved at %s.%s error holding table: schema will be the same as COPY destination table. the table name will be: COPY destination name concatenate with "_error". error_holding table definition: CREATE TABLE err_nsp.error_rel (LINENO BIGINT, LINE TEXT, FIELD TEXT, SOURCE TEXT, ERR_MESSAGE TEXT, ERR_DETAIL TEXT, ERRORCODE TEXT); the following field is not implemented. FIELDS text[], separated, de-escaped string fields (the data that was or would be fed to input functions) because imagine following case: create type test as (a int, b text); create table copy_comp (c1 int, c2 test default '(11,test)', c3 date); copy copy_comp from stdin with (default '\D'); 1 \D '2022-07-04' \. table copy_comp; I feel it's hard from textual '\D' to get text[] `(11,test)` via SPI. -------------------------------------- demo: create table copy_default_error_save ( id integer, text_value text not null default 'test', ts_value timestamp without time zone not null default '2022-07-05' ); copy copy_default_error_save from stdin with (save_error, default '\D'); k value '2022-07-04' z \D '2022-07-03ASKL' s \D \D \. NOTICE: 3 rows were skipped because of error. skipped row saved to table public.copy_default_error_save_error select * from copy_default_error_save_error; lineno | line | field | source | err_message | err_detail | errorcode --------+----------------------------------+----------+------------------+-------------------------------------------------------------+------------+----------- 1 | k value '2022-07-04' | id | k | invalid input syntax for type integer: "k" | | 22P02 2 | z \D '2022-07-03ASKL' | id | z | invalid input syntax for type integer: "z" | | 22P02 2 | z \D '2022-07-03ASKL' | ts_value | '2022-07-03ASKL' | invalid input syntax for type timestamp: "'2022-07-03ASKL'" | | 22007 3 | s \D \D | id | s | invalid input syntax for type integer: "s" | | 22P02 (4 rows) The doc is not so good. COPY FROM (save_error), it will not be as fast as COPY FROM (save_error false). With save_error, we can only use InputFunctionCallSafe, which I believe is not as fast as InputFunctionCall. If any conversion error happens, we need to call the SPI interface, that would add more overhead. also we can only insert error cases row by row. (maybe we can insert to error_save values(error1), (error2). (I will try later)... The main code is about constructing SPI query, and test and test output.
Attachment
Hi!
Thank you for your contribution to this thread.
I reviewed it and have a few questions.hi. here is my implementation based on previous discussions add a new COPY FROM flag save_error. save_error only works with non-BINARY flags. save_error is easier for me to implement, if using "save error" I worry, 2 words, gram.y will not work. save_error also works other flag like {csv mode, force_null, force_not_null} overall logic is: if save_error is specified then if error_holding table not exists then create one if error_holding table exists set error_firsttime to false. if save_error is not specified then work as master branch. if errors happen then insert error info to error_holding table. if errors do not exist and error_firsttime is true then drop the table. if errors do not exist and error_firsttime is false then raise a notice: All the past error holding saved at %s.%s error holding table: schema will be the same as COPY destination table. the table name will be: COPY destination name concatenate with "_error". error_holding table definition: CREATE TABLE err_nsp.error_rel (LINENO BIGINT, LINE TEXT, FIELD TEXT, SOURCE TEXT, ERR_MESSAGE TEXT, ERR_DETAIL TEXT, ERRORCODE TEXT); the following field is not implemented. FIELDS text[], separated, de-escaped string fields (the data that was or would be fed to input functions) because imagine following case: create type test as (a int, b text); create table copy_comp (c1 int, c2 test default '(11,test)', c3 date); copy copy_comp from stdin with (default '\D'); 1 \D '2022-07-04' \. table copy_comp; I feel it's hard from textual '\D' to get text[] `(11,test)` via SPI. -------------------------------------- demo: create table copy_default_error_save ( id integer, text_value text not null default 'test', ts_value timestamp without time zone not null default '2022-07-05' ); copy copy_default_error_save from stdin with (save_error, default '\D'); k value '2022-07-04' z \D '2022-07-03ASKL' s \D \D \. NOTICE: 3 rows were skipped because of error. skipped row saved to table public.copy_default_error_save_error select * from copy_default_error_save_error; lineno | line | field | source | err_message | err_detail | errorcode --------+----------------------------------+----------+------------------+-------------------------------------------------------------+------------+----------- 1 | k value '2022-07-04' | id | k | invalid input syntax for type integer: "k" | | 22P02 2 | z \D '2022-07-03ASKL' | id | z | invalid input syntax for type integer: "z" | | 22P02 2 | z \D '2022-07-03ASKL' | ts_value | '2022-07-03ASKL' | invalid input syntax for type timestamp: "'2022-07-03ASKL'" | | 22007 3 | s \D \D | id | s | invalid input syntax for type integer: "s" | | 22P02 (4 rows) The doc is not so good. COPY FROM (save_error), it will not be as fast as COPY FROM (save_error false). With save_error, we can only use InputFunctionCallSafe, which I believe is not as fast as InputFunctionCall. If any conversion error happens, we need to call the SPI interface, that would add more overhead. also we can only insert error cases row by row. (maybe we can insert to error_save values(error1), (error2). (I will try later)... The main code is about constructing SPI query, and test and test output.
1. I have seen that you delete a table before creating it, to which you want to add errors due to a failed "copy from" operation. I think this is wrong because this table can save useful data for the user.
At a minimum, we should warn the user about this, but I think we can just add some number at the end of the name, such as name_table1, name_table_2.
2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during 'copy from' operation.
I think it would be better to use the name of the source file that was used while 'copy from' was running.
In addition, there may be several such files, it is also worth considering.
3. I found spelling:
/* no err_nsp.error_rel table then crete one. for holding error. */
4. Maybe rewrite this comment these info need, no error will drop err_nsp.error_rel table
to:
this information is necessary, no error will lead to the deletion of the err_sp.error_rel table.
5. Is this part of the comment needed? I think it duplicates the information below when we form the query.
* . column list(order by attnum, begin from ctid) =
* {ctid, lineno,line,field,source,err_message,err_detail,errorcode}
* . data types (from attnum = -1) ={tid, int8,text,text,text,text,text,text}
I'm not sure if we need to order the rows by number. It might be easier to work with these lines in the order they appear.
-- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Dec 5, 2023 at 6:07 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote: > > Hi! > > Thank you for your contribution to this thread. > > > I reviewed it and have a few questions. > > 1. I have seen that you delete a table before creating it, to which you want to add errors due to a failed "copy from"operation. I think this is wrong because this table can save useful data for the user. > At a minimum, we should warn the user about this, but I think we can just add some number at the end of the name, suchas name_table1, name_table_2. Sorry. I don't understand this part. Currently, if the error table name already exists, then the copy will fail, an error will be reported. I try to first create a table, if no error then the error table will be dropped. Can you demo the expected behavior? > 2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during 'copyfrom' operation. > I think it would be better to use the name of the source file that was used while 'copy from' was running. > In addition, there may be several such files, it is also worth considering. > Another column added. now it looks like: SELECT * FROM save_error_csv_error; filename | lineno | line | field | source | err_message | err_detail | errorcode ----------+--------+----------------------------------------------------+-------+--------+---------------------------------------------+------------+----------- STDIN | 1 | 2002 232 40 50 60 70 80 | NULL | NULL | extra data after last expected column | NULL | 22P04 STDIN | 1 | 2000 230 23 | d | NULL | missing data for column "d" | NULL | 22P04 STDIN | 1 | z,,"" | a | z | invalid input syntax for type integer: "z" | NULL | 22P02 STDIN | 2 | \0,, | a | \0 | invalid input syntax for type integer: "\0" | NULL | 22P02 > 3. I found spelling: > > /* no err_nsp.error_rel table then crete one. for holding error. */ > fixed. > 4. Maybe rewrite this comment > > these info need, no error will drop err_nsp.error_rel table > to: > this information is necessary, no error will lead to the deletion of the err_sp.error_rel table. > fixed. > 5. Is this part of the comment needed? I think it duplicates the information below when we form the query. > > * . column list(order by attnum, begin from ctid) = > * {ctid, lineno,line,field,source,err_message,err_detail,errorcode} > * . data types (from attnum = -1) ={tid, int8,text,text,text,text,text,text} > > I'm not sure if we need to order the rows by number. It might be easier to work with these lines in the order they appear. > Simplified the comment. "order by attnum" is to make sure that if there is a table already existing, and the column name is like X and the data type like Y, then we consider this table is good for holding potential error info. COPY FROM, main entry point is NextCopyFrom. Now for non-binary mode, if you specified save_error then it will not fail at NextCopyFrom. all these three errors will be tolerated: extra data after last expected column, missing data for column, data type conversion.
Attachment
'SAVEPOINT' after 'SAVE_ERROR' in bare_label_keyword list is misplaced
make[2]: *** [../../../src/Makefile.global:783: gram.c] Error 1
make[1]: *** [Makefile:131: parser/gram.h] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [src/Makefile.global:383: submake-generated-headers] Error 2
To be honest, first of all, I misunderstood this part of the code. Now I see that it works the way you mentioned.On Tue, Dec 5, 2023 at 6:07 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:Hi! Thank you for your contribution to this thread. I reviewed it and have a few questions. 1. I have seen that you delete a table before creating it, to which you want to add errors due to a failed "copy from" operation. I think this is wrong because this table can save useful data for the user. At a minimum, we should warn the user about this, but I think we can just add some number at the end of the name, such as name_table1, name_table_2.Sorry. I don't understand this part. Currently, if the error table name already exists, then the copy will fail, an error will be reported. I try to first create a table, if no error then the error table will be dropped.
However, I didn't see if you dealt with cases where we already had a table with the same name as the table error.
I mean, when is he trying to create for the first time, or will we never be able to face such a problem?
Unfortunately, I was unable to launch it due to a build issue.Can you demo the expected behavior?
Yes, I see the "filename" column, and this will solve the problem, but "STDIN" is unclear to me.2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during 'copy from' operation. I think it would be better to use the name of the source file that was used while 'copy from' was running. In addition, there may be several such files, it is also worth considering.Another column added. now it looks like: SELECT * FROM save_error_csv_error; filename | lineno | line | field | source | err_message | err_detail | errorcode ----------+--------+----------------------------------------------------+-------+--------+---------------------------------------------+------------+----------- STDIN | 1 | 2002 232 40 50 60 70 80 | NULL | NULL | extra data after last expected column | NULL | 22P04 STDIN | 1 | 2000 230 23 | d | NULL | missing data for column "d" | NULL | 22P04 STDIN | 1 | z,,"" | a | z | invalid input syntax for type integer: "z" | NULL | 22P02 STDIN | 2 | \0,, | a | \0 | invalid input syntax for type integer: "\0" | NULL | 22P02
Thank you.3. I found spelling: /* no err_nsp.error_rel table then crete one. for holding error. */fixed.4. Maybe rewrite this comment these info need, no error will drop err_nsp.error_rel table to: this information is necessary, no error will lead to the deletion of the err_sp.error_rel table.fixed.
It looks clearer and better, thanks!5. Is this part of the comment needed? I think it duplicates the information below when we form the query. * . column list(order by attnum, begin from ctid) = * {ctid, lineno,line,field,source,err_message,err_detail,errorcode} * . data types (from attnum = -1) ={tid, int8,text,text,text,text,text,text} I'm not sure if we need to order the rows by number. It might be easier to work with these lines in the order they appear.Simplified the comment. "order by attnum" is to make sure that if there is a table already existing, and the column name is like X and the data type like Y, then we consider this table is good for holding potential error info. COPY FROM, main entry point is NextCopyFrom. Now for non-binary mode, if you specified save_error then it will not fail at NextCopyFrom. all these three errors will be tolerated: extra data after last expected column, missing data for column, data type conversion.
Comments in the format of questions are unusual for me, I perceive them to think about it, for example, as here (contrib/bloom/blinsert.c:312):
/* * Didn't find place to insert in notFullPage array. Allocate new page.
* (XXX is it good to do this while holding ex-lock on the metapage??)
*/
Maybe we can rewrite it like this:/* Check, the err_nsp.error_rel table has already existed
* and if it is, check its column name and data types.
-- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Dec 8, 2023 at 3:09 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote: > > Thank you for your work. Unfortunately, your code contained errors during the make installation: > > 'SAVEPOINT' after 'SAVE_ERROR' in unreserved_keyword list is misplaced > 'SAVEPOINT' after 'SAVE_ERROR' in bare_label_keyword list is misplaced > make[2]: *** [../../../src/Makefile.global:783: gram.c] Error 1 > make[1]: *** [Makefile:131: parser/gram.h] Error 2 > make[1]: *** Waiting for unfinished jobs.... > make: *** [src/Makefile.global:383: submake-generated-headers] Error 2 > > I have ubuntu 22.04 operation system. > > On 06.12.2023 13:47, jian he wrote: > > On Tue, Dec 5, 2023 at 6:07 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote: > > Hi! > > Thank you for your contribution to this thread. > > > I reviewed it and have a few questions. > > 1. I have seen that you delete a table before creating it, to which you want to add errors due to a failed "copy from"operation. I think this is wrong because this table can save useful data for the user. > At a minimum, we should warn the user about this, but I think we can just add some number at the end of the name, suchas name_table1, name_table_2. > > Sorry. I don't understand this part. > Currently, if the error table name already exists, then the copy will > fail, an error will be reported. > I try to first create a table, if no error then the error table will be dropped. > > To be honest, first of all, I misunderstood this part of the code. Now I see that it works the way you mentioned. > > However, I didn't see if you dealt with cases where we already had a table with the same name as the table error. > I mean, when is he trying to create for the first time, or will we never be able to face such a problem? > > Can you demo the expected behavior? > > Unfortunately, I was unable to launch it due to a build issue. > Hopefully attached will work. > 2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during 'copyfrom' operation. > I think it would be better to use the name of the source file that was used while 'copy from' was running. > In addition, there may be several such files, it is also worth considering. > > Another column added. > now it looks like: > > SELECT * FROM save_error_csv_error; > filename | lineno | line > | field | source | err_message | > err_detail | errorcode > ----------+--------+----------------------------------------------------+-------+--------+---------------------------------------------+------------+----------- > STDIN | 1 | 2002 232 40 50 60 70 > 80 | NULL | NULL | extra data after last expected column | > NULL | 22P04 > STDIN | 1 | 2000 230 23 > | d | NULL | missing data for column "d" | NULL > | 22P04 > STDIN | 1 | z,,"" > | a | z | invalid input syntax for type integer: "z" | NULL > | 22P02 > STDIN | 2 | \0,, > | a | \0 | invalid input syntax for type integer: "\0" | NULL > | 22P02 > > Yes, I see the "filename" column, and this will solve the problem, but "STDIN" is unclear to me. please see comment in struct CopyFromStateData: char *filename; /* filename, or NULL for STDIN */ > */ > > Maybe we can rewrite it like this: > > /* Check, the err_nsp.error_rel table has already existed > * and if it is, check its column name and data types. > refactored.
Attachment
Hi! Thank you for your work. Your patch looks better!
On Fri, Dec 8, 2023 at 3:09 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:Thank you for your work. Unfortunately, your code contained errors during the make installation: 'SAVEPOINT' after 'SAVE_ERROR' in unreserved_keyword list is misplaced 'SAVEPOINT' after 'SAVE_ERROR' in bare_label_keyword list is misplaced make[2]: *** [../../../src/Makefile.global:783: gram.c] Error 1 make[1]: *** [Makefile:131: parser/gram.h] Error 2 make[1]: *** Waiting for unfinished jobs.... make: *** [src/Makefile.global:383: submake-generated-headers] Error 2 I have ubuntu 22.04 operation system. On 06.12.2023 13:47, jian he wrote: On Tue, Dec 5, 2023 at 6:07 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote: Hi! Thank you for your contribution to this thread. I reviewed it and have a few questions. 1. I have seen that you delete a table before creating it, to which you want to add errors due to a failed "copy from" operation. I think this is wrong because this table can save useful data for the user. At a minimum, we should warn the user about this, but I think we can just add some number at the end of the name, such as name_table1, name_table_2. Sorry. I don't understand this part. Currently, if the error table name already exists, then the copy will fail, an error will be reported. I try to first create a table, if no error then the error table will be dropped. To be honest, first of all, I misunderstood this part of the code. Now I see that it works the way you mentioned. However, I didn't see if you dealt with cases where we already had a table with the same name as the table error. I mean, when is he trying to create for the first time, or will we never be able to face such a problem? Can you demo the expected behavior? Unfortunately, I was unable to launch it due to a build issue.Hopefully attached will work.
Yes, thank you! It works fine, and I see that the regression tests have been passed. 🙂
However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created it, I described below):
postgres=# create table test (x int primary key, y int not null);
postgres=# create table test1 (x int, z int, CONSTRAINT fk_x
FOREIGN KEY(x)
REFERENCES test(x));
I did not find a table with saved errors after operation, although I received a log about it:postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error
NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test_error
ERROR: duplicate key value violates unique constraint "test_pkey"
DETAIL: Key (x)=(2) already exists.
CONTEXT: COPY test, line 3
postgres=# select * from public.test_error;
ERROR: relation "public.test_error" does not exist
LINE 1: select * from public.test_error;
postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test1_error
ERROR: insert or update on table "test1" violates foreign key constraint "fk_x"
DETAIL: Key (x)=(2) is not present in table "test".
postgres=# select * from public.test1_error;
ERROR: relation "public.test1_error" does not exist
LINE 1: select * from public.test1_error;
Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they were not added to the tables test and test1.
If I leave only the correct rows, everything works fine and the rows are added to the tables.
in copy_test.csv:
2,0
1,1
in copy_test1.csv:
2,0
2,1
1,1
postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV
COPY 2
postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
NOTICE: No error happened.Error holding table public.test1_error will be droped
COPY 3
Maybe I'm launching it the wrong way. If so, let me know about it.
I also notice interesting behavior if the table was previously created by the user. When I was creating an error_table before the 'copy from' operation,
I received a message saying that it is impossible to create a table with the same name (it is shown below) during the 'copy from' operation.
I think you should add information about this in the documentation, since this seems to be normal behavior to me.
postgres=# CREATE TABLE test_error (LINENO BIGINT, LINE TEXT,
FIELD TEXT, SOURCE TEXT, ERR_MESSAGE TEXT,
ERR_DETAIL TEXT, ERRORCODE TEXT);
CREATE TABLE
postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error
ERROR: Error save table public.test_error already exists. Cannot use it for COPY FROM error saving
2. I noticed that you are forming a table name using the type of errors that prevent rows from being added during 'copy from' operation. I think it would be better to use the name of the source file that was used while 'copy from' was running. In addition, there may be several such files, it is also worth considering. Another column added. now it looks like: SELECT * FROM save_error_csv_error; filename | lineno | line | field | source | err_message | err_detail | errorcode ----------+--------+----------------------------------------------------+-------+--------+---------------------------------------------+------------+----------- STDIN | 1 | 2002 232 40 50 60 70 80 | NULL | NULL | extra data after last expected column | NULL | 22P04 STDIN | 1 | 2000 230 23 | d | NULL | missing data for column "d" | NULL | 22P04 STDIN | 1 | z,,"" | a | z | invalid input syntax for type integer: "z" | NULL | 22P02 STDIN | 2 | \0,, | a | \0 | invalid input syntax for type integer: "\0" | NULL | 22P02 Yes, I see the "filename" column, and this will solve the problem, but "STDIN" is unclear to me.please see comment in struct CopyFromStateData: char *filename; /* filename, or NULL for STDIN */
Yes, I can see that.
I haven't figured out how to fix it yet either.
*/ Maybe we can rewrite it like this: /* Check, the err_nsp.error_rel table has already existed * and if it is, check its column name and data types.refactored.
Fine)
-- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote: > > Hi! Thank you for your work. Your patch looks better! > Yes, thank you! It works fine, and I see that the regression tests have been passed. 🙂 > However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for tablestest, test1 (how I created it, I described below): > > postgres=# create table test (x int primary key, y int not null); > postgres=# create table test1 (x int, z int, CONSTRAINT fk_x > FOREIGN KEY(x) > REFERENCES test(x)); > > I did not find a table with saved errors after operation, although I received a log about it: > > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error > NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test_error > ERROR: duplicate key value violates unique constraint "test_pkey" > DETAIL: Key (x)=(2) already exists. > CONTEXT: COPY test, line 3 > > postgres=# select * from public.test_error; > ERROR: relation "public.test_error" does not exist > LINE 1: select * from public.test_error; > > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error > NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test1_error > ERROR: insert or update on table "test1" violates foreign key constraint "fk_x" > DETAIL: Key (x)=(2) is not present in table "test". > > postgres=# select * from public.test1_error; > ERROR: relation "public.test1_error" does not exist > LINE 1: select * from public.test1_error; > > Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they were notadded to the tables test and test1. > > If I leave only the correct rows, everything works fine and the rows are added to the tables. > > in copy_test.csv: > > 2,0 > > 1,1 > > in copy_test1.csv: > > 2,0 > > 2,1 > > 1,1 > > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV > COPY 2 > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error > NOTICE: No error happened.Error holding table public.test1_error will be droped > COPY 3 > > Maybe I'm launching it the wrong way. If so, let me know about it. looks like the above is about constraints violation while copying. constraints violation while copying not in the scope of this patch. Since COPY FROM is very like the INSERT command, you do want all the valid constraints to check all the copied rows? but the notice raised by the patch is not right. So I place the drop error saving table or raise notice logic above `ExecResetTupleTable(estate->es_tupleTable, false)` in the function CopyFrom. > > I also notice interesting behavior if the table was previously created by the user. When I was creating an error_tablebefore the 'copy from' operation, > I received a message saying that it is impossible to create a table with the same name (it is shown below) during the 'copyfrom' operation. > I think you should add information about this in the documentation, since this seems to be normal behavior to me. > doc changed. you may check it.
Attachment
No, I think it will be too much.On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:Hi! Thank you for your work. Your patch looks better! Yes, thank you! It works fine, and I see that the regression tests have been passed. 🙂 However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created it, I described below): postgres=# create table test (x int primary key, y int not null); postgres=# create table test1 (x int, z int, CONSTRAINT fk_x FOREIGN KEY(x) REFERENCES test(x)); I did not find a table with saved errors after operation, although I received a log about it: postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test_error ERROR: duplicate key value violates unique constraint "test_pkey" DETAIL: Key (x)=(2) already exists. CONTEXT: COPY test, line 3 postgres=# select * from public.test_error; ERROR: relation "public.test_error" does not exist LINE 1: select * from public.test_error; postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test1_error ERROR: insert or update on table "test1" violates foreign key constraint "fk_x" DETAIL: Key (x)=(2) is not present in table "test". postgres=# select * from public.test1_error; ERROR: relation "public.test1_error" does not exist LINE 1: select * from public.test1_error; Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they were not added to the tables test and test1. If I leave only the correct rows, everything works fine and the rows are added to the tables. in copy_test.csv: 2,0 1,1 in copy_test1.csv: 2,0 2,1 1,1 postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV COPY 2 postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error NOTICE: No error happened.Error holding table public.test1_error will be droped COPY 3 Maybe I'm launching it the wrong way. If so, let me know about it.looks like the above is about constraints violation while copying. constraints violation while copying not in the scope of this patch. Since COPY FROM is very like the INSERT command, you do want all the valid constraints to check all the copied rows?
Yes, I see it and agree with you.but the notice raised by the patch is not right. So I place the drop error saving table or raise notice logic above `ExecResetTupleTable(estate->es_tupleTable, false)` in the function CopyFrom.
Yes, I saw it. Thank you.I also notice interesting behavior if the table was previously created by the user. When I was creating an error_table before the 'copy from' operation, I received a message saying that it is impossible to create a table with the same name (it is shown below) during the 'copy from' operation. I think you should add information about this in the documentation, since this seems to be normal behavior to me.doc changed. you may check it.
-- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, On Tue, Dec 12, 2023 at 10:04 PM jian he <jian.universality@gmail.com> wrote: > > On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina > <lena.ribackina@yandex.ru> wrote: > > > > Hi! Thank you for your work. Your patch looks better! > > Yes, thank you! It works fine, and I see that the regression tests have been passed. 🙂 > > However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for tablestest, test1 (how I created it, I described below): > > > > postgres=# create table test (x int primary key, y int not null); > > postgres=# create table test1 (x int, z int, CONSTRAINT fk_x > > FOREIGN KEY(x) > > REFERENCES test(x)); > > > > I did not find a table with saved errors after operation, although I received a log about it: > > > > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error > > NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test_error > > ERROR: duplicate key value violates unique constraint "test_pkey" > > DETAIL: Key (x)=(2) already exists. > > CONTEXT: COPY test, line 3 > > > > postgres=# select * from public.test_error; > > ERROR: relation "public.test_error" does not exist > > LINE 1: select * from public.test_error; > > > > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error > > NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test1_error > > ERROR: insert or update on table "test1" violates foreign key constraint "fk_x" > > DETAIL: Key (x)=(2) is not present in table "test". > > > > postgres=# select * from public.test1_error; > > ERROR: relation "public.test1_error" does not exist > > LINE 1: select * from public.test1_error; > > > > Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they werenot added to the tables test and test1. > > > > If I leave only the correct rows, everything works fine and the rows are added to the tables. > > > > in copy_test.csv: > > > > 2,0 > > > > 1,1 > > > > in copy_test1.csv: > > > > 2,0 > > > > 2,1 > > > > 1,1 > > > > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV > > COPY 2 > > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error > > NOTICE: No error happened.Error holding table public.test1_error will be droped > > COPY 3 > > > > Maybe I'm launching it the wrong way. If so, let me know about it. > > looks like the above is about constraints violation while copying. > constraints violation while copying not in the scope of this patch. > > Since COPY FROM is very like the INSERT command, > you do want all the valid constraints to check all the copied rows? > > but the notice raised by the patch is not right. > So I place the drop error saving table or raise notice logic above > `ExecResetTupleTable(estate->es_tupleTable, false)` in the function > CopyFrom. > > > > > I also notice interesting behavior if the table was previously created by the user. When I was creating an error_tablebefore the 'copy from' operation, > > I received a message saying that it is impossible to create a table with the same name (it is shown below) during the'copy from' operation. > > I think you should add information about this in the documentation, since this seems to be normal behavior to me. > > > > doc changed. you may check it. I've read this thread and the latest patch. IIUC with SAVE_ERROR option, COPY FROM creates an error table for the target table and writes error information there. While I agree that the final shape of this feature would be something like that design, I'm concerned some features are missing in order to make this feature useful in practice. For instance, error logs are inserted to error tables without bounds, meaning that users who want to tolerate errors during COPY FROM will have to truncate or drop the error tables periodically, or the database will grow with error logs without limit. Ideally such maintenance work should be done by the database. There might be some users who want to log such conversion errors in server logs to avoid such maintenance work. I think we should provide an option for where to write, at least. Also, since the error tables are normal user tables internally, error logs are also replicated to subscribers if there is a publication FOR ALL TABLES, unlike system catalogs. I think some users would not like such behavior. Looking at SAVE_ERROR feature closely, I think it consists of two separate features. That is, it enables COPY FROM to load data while (1) tolerating errors and (2) logging errors to somewhere (i.e., an error table). If we implement only (1), it would be like COPY FROM tolerate errors infinitely and log errors to /dev/null. The user cannot see the error details but I guess it could still help some cases as Andres mentioned[1] (it might be a good idea to send the number of rows successfully loaded in a NOTICE message if some rows could not be loaded). Then with (2), COPY FROM can log error information to somewhere such as tables and server logs and the user can select it. So I'm thinking we may be able to implement this feature incrementally. The first step would be something like an option to ignore all errors or an option to specify the maximum number of errors to tolerate before raising an ERROR. The second step would be to support logging destinations such as server logs and tables. Regards, [1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > I've read this thread and the latest patch. IIUC with SAVE_ERROR > option, COPY FROM creates an error table for the target table and > writes error information there. > > While I agree that the final shape of this feature would be something > like that design, I'm concerned some features are missing in order to > make this feature useful in practice. For instance, error logs are > inserted to error tables without bounds, meaning that users who want > to tolerate errors during COPY FROM will have to truncate or drop the > error tables periodically, or the database will grow with error logs > without limit. Ideally such maintenance work should be done by the > database. There might be some users who want to log such conversion > errors in server logs to avoid such maintenance work. I think we > should provide an option for where to write, at least. Also, since the > error tables are normal user tables internally, error logs are also > replicated to subscribers if there is a publication FOR ALL TABLES, > unlike system catalogs. I think some users would not like such > behavior. save the error metadata to system catalogs would be more expensive, please see below explanation. I have no knowledge of publications. but i feel there is a feature request: publication FOR ALL TABLES exclude regex_pattern. Anyway, that would be another topic. > Looking at SAVE_ERROR feature closely, I think it consists of two > separate features. That is, it enables COPY FROM to load data while > (1) tolerating errors and (2) logging errors to somewhere (i.e., an > error table). If we implement only (1), it would be like COPY FROM > tolerate errors infinitely and log errors to /dev/null. The user > cannot see the error details but I guess it could still help some > cases as Andres mentioned[1] (it might be a good idea to send the > number of rows successfully loaded in a NOTICE message if some rows > could not be loaded). Then with (2), COPY FROM can log error > information to somewhere such as tables and server logs and the user > can select it. So I'm thinking we may be able to implement this > feature incrementally. The first step would be something like an > option to ignore all errors or an option to specify the maximum number > of errors to tolerate before raising an ERROR. The second step would > be to support logging destinations such as server logs and tables. > > Regards, > > [1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com > feature incrementally. The first step would be something like an > option to ignore all errors or an option to specify the maximum number > of errors to tolerate before raising an ERROR. The second step would I don't think "specify the maximum number of errors to tolerate before raising an ERROR." is very useful.... QUOTE from [1] MAXERROR [AS] error_count If the load returns the error_count number of errors or greater, the load fails. If the load returns fewer errors, it continues and returns an INFO message that states the number of rows that could not be loaded. Use this parameter to allow loads to continue when certain rows fail to load into the table because of formatting errors or other inconsistencies in the data. Set this value to 0 or 1 if you want the load to fail as soon as the first error occurs. The AS keyword is optional. The MAXERROR default value is 0 and the limit is 100000. The actual number of errors reported might be greater than the specified MAXERROR because of the parallel nature of Amazon Redshift. If any node in the Amazon Redshift cluster detects that MAXERROR has been exceeded, each node reports all of the errors it has encountered. END OF QUOTE option MAXERROR error_count. iiuc, it fails while validating line error_count + 1, else it raises a notice, tells you how many rows have errors. * case when error_count is small, and the copy fails, it only tells you that at least the error_count line has malformed data. but what if the actual malformed rows are very big. In this case, this failure error message is not that helpful. * case when error_count is very big, and the copy does not fail. then the actual malformed data rows are very big (still less than error_count). but there is no error report, you don't know which line has an error. Either way, if the file has a large portion of malformed rows, then the MAXERROR option does not make sense. so maybe we don't need a threshold for tolerating errors. however, we can have an option, not actually copy to the table, but only validate, similar to NOLOAD in [1] why we save the error: * if only a small portion of malformed rows then saving the error metadata would be cheap. * if a large portion of malformed rows then copy will be slow but we saved the error metadata. Now you can fix it based on this error metadata. I think saving errors to a regular table or text file seems sane, but not to a catalog table. * for a text file with M rows, N fields, contrived corner case would be (M-2) * N errors, the last 2 rows have the duplicate keys, violate primary key constraint. In this case, we first insert (M-2) * N rows to the catalog table then because of errors we undo it. I think it will be expensive. * error meta info is not as important as other pg_catalog tables. log format is quite verbose, save_error to log seems not so good, I guess. I suppose we can specify an ERRORFILE directory. similar implementation [2], demo in [3] it will generate 2 files, one file shows the malform line content as is, another file shows the error info. Let's assume we save the error info to a table: Since the previous thread says one copy operation may create one error table is not a good idea, looking back, I agree. Similar to [4] I come with the following logic/ideas: * save_error table name be COPY_ERRORS, shema be the same as copy from destination table. * one COPY_ERRORS table saves all COPY FROM generated error metadata * if save_error specified, before do COPY FROM, first check if the table COPY_ERRORS exists, if not then create one? Or raise an error saying that COPY_ERRORS does not exist, cannot save_error? * COPY_ERRORS table owner be current database owner? * Only the table owner is allowed to INSERT/DELETE/UPDATE, others are not allowed to INSERT/DELETE/UPDATE. while doing copy error happened, record the userid, then switch COPY_ERRORS owner execute the insert command * the user who is doing COPY FROM operation is allowed solely to view (select) the errored row they generated. COPY_ERRORS table would be: userid oid /* the user who is doing this operation */ error_time timestamptz /* when this error happened. not 100% sure this column is needed */ filename text /* the copy from source */ table_name text /* the copy from destination */ lineno bigint /* the error line number */ line text /* the whole line raw content */ colname text -- Field with the error. raw_field_value text --- The value for the field that leads to the error. err_message text -- same as ErrorData->message err_detail text --same as ErrorData->detail errorcode text --transformed errcode, example "22P02" [1] https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html [2] https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver16 [3] https://www.sqlshack.com/working-with-line-numbers-and-errors-using-bulk-insert/ [4] https://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html
On 2023-12-15 05:48, Masahiko Sawada wrote: Thanks for joining this discussion! > I've read this thread and the latest patch. IIUC with SAVE_ERROR > option, COPY FROM creates an error table for the target table and > writes error information there. > > While I agree that the final shape of this feature would be something > like that design, I'm concerned some features are missing in order to > make this feature useful in practice. For instance, error logs are > inserted to error tables without bounds, meaning that users who want > to tolerate errors during COPY FROM will have to truncate or drop the > error tables periodically, or the database will grow with error logs > without limit. Ideally such maintenance work should be done by the > database. There might be some users who want to log such conversion > errors in server logs to avoid such maintenance work. I think we > should provide an option for where to write, at least. Also, since the > error tables are normal user tables internally, error logs are also > replicated to subscribers if there is a publication FOR ALL TABLES, > unlike system catalogs. I think some users would not like such > behavior. > > Looking at SAVE_ERROR feature closely, I think it consists of two > separate features. That is, it enables COPY FROM to load data while > (1) tolerating errors and (2) logging errors to somewhere (i.e., an > error table). If we implement only (1), it would be like COPY FROM > tolerate errors infinitely and log errors to /dev/null. The user > cannot see the error details but I guess it could still help some > cases as Andres mentioned[1] (it might be a good idea to send the > number of rows successfully loaded in a NOTICE message if some rows > could not be loaded). Then with (2), COPY FROM can log error > information to somewhere such as tables and server logs and the user > can select it. +1. I may be biased since I wrote some ~v6 patches which just output the soft errors and number of skipped rows to log, but I think just (1) would be worth implementing as you pointed out and I like if users could choose where to log output. I think there would be situations where it is preferable to save errors to server log even considering problems which were pointed out in [1], i.e. manually loading data. [1] https://www.postgresql.org/message-id/739953.1699467519%40sss.pgh.pa.us > feature incrementally. The first step would be something like an > option to ignore all errors or an option to specify the maximum number > of errors to tolerate before raising an ERROR. The second step would > be to support logging destinations such as server logs and tables. > > Regards, > > [1] > https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Hi, > save the error metadata to system catalogs would be more expensive, > please see below explanation. > I have no knowledge of publications. > but i feel there is a feature request: publication FOR ALL TABLES > exclude regex_pattern. > Anyway, that would be another topic. I think saving error metadata to system catalog is not a good idea, too. And I believe Sawada-san just pointed out missing features and did not suggested that we use system catalog. > I don't think "specify the maximum number of errors to tolerate > before raising an ERROR." is very useful.... That may be so. I imagine it's useful in some use case since some loading tools have such options. Anyway I agree it's not necessary for initial patch as mentioned in [1]. > I suppose we can specify an ERRORFILE directory. similar > implementation [2], demo in [3] > it will generate 2 files, one file shows the malform line content as > is, another file shows the error info. That may be a good option when considering "(2) logging errors to somewhere". What do you think about the proposal to develop these features in incrementally? On 2023-12-15 05:48, Masahiko Sawada wrote: > So I'm thinking we may be able to implement this > feature incrementally. The first step would be something like an > option to ignore all errors or an option to specify the maximum number > of errors to tolerate before raising an ERROR. The second step would > be to support logging destinations such as server logs and tables. [1] https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On Mon, Dec 18, 2023 at 1:09 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > Hi, > > > save the error metadata to system catalogs would be more expensive, > > please see below explanation. > > I have no knowledge of publications. > > but i feel there is a feature request: publication FOR ALL TABLES > > exclude regex_pattern. > > Anyway, that would be another topic. > > I think saving error metadata to system catalog is not a good idea, too. > And I believe Sawada-san just pointed out missing features and did not > suggested that we use system catalog. > > > I don't think "specify the maximum number of errors to tolerate > > before raising an ERROR." is very useful.... > > That may be so. > I imagine it's useful in some use case since some loading tools have > such options. > Anyway I agree it's not necessary for initial patch as mentioned in [1]. > > > I suppose we can specify an ERRORFILE directory. similar > > implementation [2], demo in [3] > > it will generate 2 files, one file shows the malform line content as > > is, another file shows the error info. > > That may be a good option when considering "(2) logging errors to > somewhere". > > What do you think about the proposal to develop these features in > incrementally? > I am more with tom's idea [1], that is when errors happen (data type conversion only), do not fail, AND we save the error to a table. I guess we can implement this logic together, only with a new COPY option. imagine a case (it's not that contrived, imho), while conversion from text to table's int, postgres isspace is different from the source text file's isspace logic. then all the lines are malformed. If we just say on error continue and not save error meta info, the user is still confused which field has the wrong data, then the user will probably try to incrementally test which field contains malformed data. Since we need to save the error somewhere. Everyone has the privilege to INSERT can do COPY. I think we also need to handle the access privilege also. So like I mentioned above, one copy_error error table hub, then everyone can view/select their own copy failure record. but save to a server text file/directory, not easy for an INSERT privilege user to see these files, I think. similarly not easy to see these failed records in log for limited privilege. if someone wants to fail at maxerror rows, they can do it, since we will count how many rows failed. even though I didn't get it. [1] https://www.postgresql.org/message-id/900123.1699488001%40sss.pgh.pa.us
On Mon, Dec 18, 2023 at 9:16 AM jian he <jian.universality@gmail.com> wrote: > > On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Hi, > > > > I've read this thread and the latest patch. IIUC with SAVE_ERROR > > option, COPY FROM creates an error table for the target table and > > writes error information there. > > > > While I agree that the final shape of this feature would be something > > like that design, I'm concerned some features are missing in order to > > make this feature useful in practice. For instance, error logs are > > inserted to error tables without bounds, meaning that users who want > > to tolerate errors during COPY FROM will have to truncate or drop the > > error tables periodically, or the database will grow with error logs > > without limit. Ideally such maintenance work should be done by the > > database. There might be some users who want to log such conversion > > errors in server logs to avoid such maintenance work. I think we > > should provide an option for where to write, at least. Also, since the > > error tables are normal user tables internally, error logs are also > > replicated to subscribers if there is a publication FOR ALL TABLES, > > unlike system catalogs. I think some users would not like such > > behavior. > > save the error metadata to system catalogs would be more expensive, > please see below explanation. > I have no knowledge of publications. > but i feel there is a feature request: publication FOR ALL TABLES > exclude regex_pattern. > Anyway, that would be another topic. I don't think the new regex idea would be a good solution for the existing users who are using FOR ALL TABLES publication. It's not desirable that they have to change the publication because of this feature. With the current patch, a logical replication using FOR ALL TABLES publication will stop immediately after an error information is inserted into a new error table unless the same error table is created on subscribers. > > > Looking at SAVE_ERROR feature closely, I think it consists of two > > separate features. That is, it enables COPY FROM to load data while > > (1) tolerating errors and (2) logging errors to somewhere (i.e., an > > error table). If we implement only (1), it would be like COPY FROM > > tolerate errors infinitely and log errors to /dev/null. The user > > cannot see the error details but I guess it could still help some > > cases as Andres mentioned[1] (it might be a good idea to send the > > number of rows successfully loaded in a NOTICE message if some rows > > could not be loaded). Then with (2), COPY FROM can log error > > information to somewhere such as tables and server logs and the user > > can select it. So I'm thinking we may be able to implement this > > feature incrementally. The first step would be something like an > > option to ignore all errors or an option to specify the maximum number > > of errors to tolerate before raising an ERROR. The second step would > > be to support logging destinations such as server logs and tables. > > > > Regards, > > > > [1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de > > > > -- > > Masahiko Sawada > > Amazon Web Services: https://aws.amazon.com > > > feature incrementally. The first step would be something like an > > option to ignore all errors or an option to specify the maximum number > > of errors to tolerate before raising an ERROR. The second step would > > I don't think "specify the maximum number of errors to tolerate > before raising an ERROR." is very useful.... > > QUOTE from [1] > MAXERROR [AS] error_count > If the load returns the error_count number of errors or greater, the > load fails. If the load returns fewer errors, it continues and returns > an INFO message that states the number of rows that could not be > loaded. Use this parameter to allow loads to continue when certain > rows fail to load into the table because of formatting errors or other > inconsistencies in the data. > Set this value to 0 or 1 if you want the load to fail as soon as the > first error occurs. The AS keyword is optional. The MAXERROR default > value is 0 and the limit is 100000. > The actual number of errors reported might be greater than the > specified MAXERROR because of the parallel nature of Amazon Redshift. > If any node in the Amazon Redshift cluster detects that MAXERROR has > been exceeded, each node reports all of the errors it has encountered. > END OF QUOTE > > option MAXERROR error_count. iiuc, it fails while validating line > error_count + 1, else it raises a notice, tells you how many rows have > errors. > > * case when error_count is small, and the copy fails, it only tells > you that at least the error_count line has malformed data. but what if > the actual malformed rows are very big. In this case, this failure > error message is not that helpful. > * case when error_count is very big, and the copy does not fail. then > the actual malformed data rows are very big (still less than > error_count). but there is no error report, you don't know which line > has an error. > > Either way, if the file has a large portion of malformed rows, then > the MAXERROR option does not make sense. > so maybe we don't need a threshold for tolerating errors. > > however, we can have an option, not actually copy to the table, but > only validate, similar to NOLOAD in [1] I'm fine even if the feature is not like MAXERROR. If we want a feature to tolerate errors during COPY FROM, I just thought it might be a good idea to have a tuning knob for better flexibility, not just like a on/off switch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Dec 18, 2023 at 4:41 PM jian he <jian.universality@gmail.com> wrote: > > On Mon, Dec 18, 2023 at 1:09 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > Hi, > > > > > save the error metadata to system catalogs would be more expensive, > > > please see below explanation. > > > I have no knowledge of publications. > > > but i feel there is a feature request: publication FOR ALL TABLES > > > exclude regex_pattern. > > > Anyway, that would be another topic. > > > > I think saving error metadata to system catalog is not a good idea, too. > > And I believe Sawada-san just pointed out missing features and did not > > suggested that we use system catalog. > > > > > I don't think "specify the maximum number of errors to tolerate > > > before raising an ERROR." is very useful.... > > > > That may be so. > > I imagine it's useful in some use case since some loading tools have > > such options. > > Anyway I agree it's not necessary for initial patch as mentioned in [1]. > > > > > I suppose we can specify an ERRORFILE directory. similar > > > implementation [2], demo in [3] > > > it will generate 2 files, one file shows the malform line content as > > > is, another file shows the error info. > > > > That may be a good option when considering "(2) logging errors to > > somewhere". > > > > What do you think about the proposal to develop these features in > > incrementally? > > > > I am more with tom's idea [1], that is when errors happen (data type > conversion only), do not fail, AND we save the error to a table. I > guess we can implement this logic together, only with a new COPY > option. If we want only such a feature we need to implement it together (the patch could be split, though). But if some parts of the feature are useful for users as well, I'd recommend implementing it incrementally. That way, the patches can get small and it would be easy for reviewers and committers to review/commit them. > > imagine a case (it's not that contrived, imho), while conversion from > text to table's int, postgres isspace is different from the source > text file's isspace logic. > then all the lines are malformed. If we just say on error continue and > not save error meta info, the user is still confused which field has > the wrong data, then the user will probably try to incrementally test > which field contains malformed data. > > Since we need to save the error somewhere. > Everyone has the privilege to INSERT can do COPY. > I think we also need to handle the access privilege also. > So like I mentioned above, one copy_error error table hub, then > everyone can view/select their own copy failure record. The error table hub idea is still unclear to me. I assume that there are error tables at least on each database. And an error table can have error data that happened during COPY FROM, including malformed lines. Do the error tables grow without bounds and the users have to delete rows at some point? If so, who can do that? How can we achieve that the users can see only errored rows they generated? And the issue with logical replication also needs to be resolved. Anyway, if we go this direction, we need to discuss the overall design. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Dec 19, 2023 at 9:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > The error table hub idea is still unclear to me. I assume that there > are error tables at least on each database. And an error table can > have error data that happened during COPY FROM, including malformed > lines. Do the error tables grow without bounds and the users have to > delete rows at some point? If so, who can do that? How can we achieve > that the users can see only errored rows they generated? And the issue > with logical replication also needs to be resolved. Anyway, if we go > this direction, we need to discuss the overall design. > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com Please check my latest attached POC. Main content is to build spi query, execute the spi query, regress test and regress output. copy_errors one per schema. foo.copy_errors will be owned by the schema: foo owner. if you can insert to a table in that specific schema let's say foo, then you will get privilege to INSERT/DELETE/SELECT to foo.copy_errors. If you are not a superuser, you are only allowed to do INSERT/DELETE/SELECT on foo.copy_errors rows where USERID = current_user::regrole::oid. This is done via row level security. Since foo.copy_errors is mainly INSERT operations, if copy_errors grow too much, that means your source file has many errors, it will take a very long time to finish the whole COPY. maybe we can capture how many errors encountered in another client. I don't know how to deal with logic replication. looking for ideas.
Attachment
On Wed, Dec 20, 2023 at 1:07 PM jian he <jian.universality@gmail.com> wrote: > > On Tue, Dec 19, 2023 at 9:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > The error table hub idea is still unclear to me. I assume that there > > are error tables at least on each database. And an error table can > > have error data that happened during COPY FROM, including malformed > > lines. Do the error tables grow without bounds and the users have to > > delete rows at some point? If so, who can do that? How can we achieve > > that the users can see only errored rows they generated? And the issue > > with logical replication also needs to be resolved. Anyway, if we go > > this direction, we need to discuss the overall design. > > > > Regards, > > > > -- > > Masahiko Sawada > > Amazon Web Services: https://aws.amazon.com > > Please check my latest attached POC. > Main content is to build spi query, execute the spi query, regress > test and regress output. Why do we need to use SPI? I think we can form heap tuples and insert them to the error table. Creating the error table also doesn't need to use SPI. > > copy_errors one per schema. > foo.copy_errors will be owned by the schema: foo owner. It seems that the error table is created when the SAVE_ERROR is used for the first time. It probably blocks concurrent COPY FROM commands with SAVE_ERROR option to different tables if the error table is not created yet. > > if you can insert to a table in that specific schema let's say foo, > then you will get privilege to INSERT/DELETE/SELECT > to foo.copy_errors. > If you are not a superuser, you are only allowed to do > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID = > current_user::regrole::oid. > This is done via row level security. I don't think it works. If the user is dropped, the user's oid could be reused for a different user. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Dec 20, 2023 at 8:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Why do we need to use SPI? I think we can form heap tuples and insert > them to the error table. Creating the error table also doesn't need to > use SPI. > Thanks for pointing it out. I figured out how to form heap tuples and insert them to the error table. but I don't know how to create the error table without using SPI. Please pointer it out. > > > > copy_errors one per schema. > > foo.copy_errors will be owned by the schema: foo owner. > > It seems that the error table is created when the SAVE_ERROR is used > for the first time. It probably blocks concurrent COPY FROM commands > with SAVE_ERROR option to different tables if the error table is not > created yet. > I don't know how to solve this problem.... Maybe we can document this. but it will block the COPY FROM immediately. > > > > if you can insert to a table in that specific schema let's say foo, > > then you will get privilege to INSERT/DELETE/SELECT > > to foo.copy_errors. > > If you are not a superuser, you are only allowed to do > > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID = > > current_user::regrole::oid. > > This is done via row level security. > > I don't think it works. If the user is dropped, the user's oid could > be reused for a different user. > You are right. so I changed, now the schema owner will be the error table owner. every error table tuple inserts, I switch to schema owner, do the insert, then switch back to the COPY_FROM operation user. now everyone (except superuser) will need explicit grant to access the error table.
Attachment
On Thu, 28 Dec 2023 at 09:27, jian he <jian.universality@gmail.com> wrote: > > On Wed, Dec 20, 2023 at 8:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Why do we need to use SPI? I think we can form heap tuples and insert > > them to the error table. Creating the error table also doesn't need to > > use SPI. > > > Thanks for pointing it out. I figured out how to form heap tuples and > insert them to the error table. > but I don't know how to create the error table without using SPI. > Please pointer it out. > > > > > > > copy_errors one per schema. > > > foo.copy_errors will be owned by the schema: foo owner. > > > > It seems that the error table is created when the SAVE_ERROR is used > > for the first time. It probably blocks concurrent COPY FROM commands > > with SAVE_ERROR option to different tables if the error table is not > > created yet. > > > I don't know how to solve this problem.... Maybe we can document this. > but it will block the COPY FROM immediately. > > > > > > > if you can insert to a table in that specific schema let's say foo, > > > then you will get privilege to INSERT/DELETE/SELECT > > > to foo.copy_errors. > > > If you are not a superuser, you are only allowed to do > > > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID = > > > current_user::regrole::oid. > > > This is done via row level security. > > > > I don't think it works. If the user is dropped, the user's oid could > > be reused for a different user. > > > > You are right. > so I changed, now the schema owner will be the error table owner. > every error table tuple inserts, > I switch to schema owner, do the insert, then switch back to the > COPY_FROM operation user. > now everyone (except superuser) will need explicit grant to access the > error table. There are some compilation issues reported at [1] for the patch: [04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’: [04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc, [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [04:04:26.288] 1127 | t_values, [04:04:26.288] | ~~~~~~~~~ [04:04:26.288] 1128 | t_isnull); [04:04:26.288] | ~~~~~~~~~ [04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock); [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [1] - https://cirrus-ci.com/task/4785221183209472 Regards, Vignesh
On Fri, Jan 5, 2024 at 12:05 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 28 Dec 2023 at 09:27, jian he <jian.universality@gmail.com> wrote: > > > > On Wed, Dec 20, 2023 at 8:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > Why do we need to use SPI? I think we can form heap tuples and insert > > > them to the error table. Creating the error table also doesn't need to > > > use SPI. > > > > > Thanks for pointing it out. I figured out how to form heap tuples and > > insert them to the error table. > > but I don't know how to create the error table without using SPI. > > Please pointer it out. > > > > > > > > > > copy_errors one per schema. > > > > foo.copy_errors will be owned by the schema: foo owner. > > > > > > It seems that the error table is created when the SAVE_ERROR is used > > > for the first time. It probably blocks concurrent COPY FROM commands > > > with SAVE_ERROR option to different tables if the error table is not > > > created yet. > > > > > I don't know how to solve this problem.... Maybe we can document this. > > but it will block the COPY FROM immediately. > > > > > > > > > > if you can insert to a table in that specific schema let's say foo, > > > > then you will get privilege to INSERT/DELETE/SELECT > > > > to foo.copy_errors. > > > > If you are not a superuser, you are only allowed to do > > > > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID = > > > > current_user::regrole::oid. > > > > This is done via row level security. > > > > > > I don't think it works. If the user is dropped, the user's oid could > > > be reused for a different user. > > > > > > > You are right. > > so I changed, now the schema owner will be the error table owner. > > every error table tuple inserts, > > I switch to schema owner, do the insert, then switch back to the > > COPY_FROM operation user. > > now everyone (except superuser) will need explicit grant to access the > > error table. > > There are some compilation issues reported at [1] for the patch: > [04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’: > [04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’ > may be used uninitialized in this function > [-Werror=maybe-uninitialized] > [04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc, > [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > [04:04:26.288] 1127 | t_values, > [04:04:26.288] | ~~~~~~~~~ > [04:04:26.288] 1128 | t_isnull); > [04:04:26.288] | ~~~~~~~~~ > [04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be > used uninitialized in this function [-Werror=maybe-uninitialized] > [04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock); > [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > [1] - https://cirrus-ci.com/task/4785221183209472 > I fixed this issue, and also improved the doc. Other implementations have not changed.
Attachment
On Fri, Jan 5, 2024 at 4:37 PM jian he <jian.universality@gmail.com> wrote: > > > > > be reused for a different user. > > > > > > > > > > You are right. > > > so I changed, now the schema owner will be the error table owner. > > > every error table tuple inserts, > > > I switch to schema owner, do the insert, then switch back to the > > > COPY_FROM operation user. > > > now everyone (except superuser) will need explicit grant to access the > > > error table. > > > > There are some compilation issues reported at [1] for the patch: > > [04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’: > > [04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’ > > may be used uninitialized in this function > > [-Werror=maybe-uninitialized] > > [04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc, > > [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > [04:04:26.288] 1127 | t_values, > > [04:04:26.288] | ~~~~~~~~~ > > [04:04:26.288] 1128 | t_isnull); > > [04:04:26.288] | ~~~~~~~~~ > > [04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be > > used uninitialized in this function [-Werror=maybe-uninitialized] > > [04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock); > > [04:04:26.288] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > [1] - https://cirrus-ci.com/task/4785221183209472 > > > > I fixed this issue, and also improved the doc. > Other implementations have not changed. bother again. This time, I used the ci test it again. now there should be no warning.
Attachment
On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > If we want only such a feature we need to implement it together (the > patch could be split, though). But if some parts of the feature are > useful for users as well, I'd recommend implementing it incrementally. > That way, the patches can get small and it would be easy for reviewers > and committers to review/commit them. Jian, how do you think this comment? Looking back at the discussion so far, it seems that not everyone thinks saving table information is the best idea[1] and some people think just skipping error data is useful.[2] Since there are issues to be considered from the design such as physical/logical replication treatment, putting error information to table is likely to take time for consensus building and development. Wouldn't it be better to follow the following advice and develop the functionality incrementally? On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote: > So I'm thinking we may be able to implement this > feature incrementally. The first step would be something like an > option to ignore all errors or an option to specify the maximum number > of errors to tolerate before raising an ERROR. The second step would > be to support logging destinations such as server logs and tables. Attached a patch for this "first step" with reference to v7 patch, which logged errors and simpler than latest one. - This patch adds new option SAVE_ERROR_TO, but currently only supports 'none', which means just skips error data. It is expected to support 'log' and 'table'. - This patch Skips just soft errors and don't handle other errors such as missing column data. BTW I have question and comment about v15 patch: > + { > + /* > + * > + * InputFunctionCall is more faster than > InputFunctionCallSafe. > + * > + */ Have you measured this? When I tested it in an older patch, there were no big difference[3]. > - SAVEPOINT SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P SECURITY SELECT > + SAVEPOINT SAVE_ERROR SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P SECURITY SELECT There was a comment that we shouldn't add new keyword for this[4]. I left as it was in v7 patch regarding these points. [1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de [2] https://www.postgresql.org/message-id/CAD21AoCeEOBN49fu43e6tBTynnswugA3oZ5AZvLeyDCpxpCXPg%40mail.gmail.com [3] https://www.postgresql.org/message-id/19551e8c2717c24689913083f841ddb5%40oss.nttdata.com [4] https://www.postgresql.org/message-id/20230322175000.qbdctk7bnmifh5an%40awork3.anarazel.de -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
On Tue, Jan 9, 2024 at 11:36 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: > > If we want only such a feature we need to implement it together (the > > patch could be split, though). But if some parts of the feature are > > useful for users as well, I'd recommend implementing it incrementally. > > That way, the patches can get small and it would be easy for reviewers > > and committers to review/commit them. > > Jian, how do you think this comment? > > Looking back at the discussion so far, it seems that not everyone thinks > saving table information is the best idea[1] and some people think just > skipping error data is useful.[2] > > Since there are issues to be considered from the design such as > physical/logical replication treatment, putting error information to > table is likely to take time for consensus building and development. > > Wouldn't it be better to follow the following advice and develop the > functionality incrementally? Yeah, I'm still thinking it's better to implement this feature incrementally. Given we're closing to feature freeze, I think it's unlikely to get the whole feature into PG17 since there are still many design discussions we need in addition to what Torikoshi-san pointed out. The feature like "ignore errors" or "logging errors" would have higher possibilities. Even if we get only these parts of the whole "error table" feature into PG17, it will make it much easier to implement "error tables" feature. > > On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada > <sawada(dot)mshk(at)gmail(dot)com> wrote: > > So I'm thinking we may be able to implement this > > feature incrementally. The first step would be something like an > > option to ignore all errors or an option to specify the maximum number > > of errors to tolerate before raising an ERROR. The second step would > > be to support logging destinations such as server logs and tables. > > > Attached a patch for this "first step" with reference to v7 patch, which > logged errors and simpler than latest one. > - This patch adds new option SAVE_ERROR_TO, but currently only supports > 'none', which means just skips error data. It is expected to support > 'log' and 'table'. > - This patch Skips just soft errors and don't handle other errors such > as missing column data. Seems promising. I'll look at the patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Jan 9, 2024 at 10:36 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: > > If we want only such a feature we need to implement it together (the > > patch could be split, though). But if some parts of the feature are > > useful for users as well, I'd recommend implementing it incrementally. > > That way, the patches can get small and it would be easy for reviewers > > and committers to review/commit them. > > Jian, how do you think this comment? > > Looking back at the discussion so far, it seems that not everyone thinks > saving table information is the best idea[1] and some people think just > skipping error data is useful.[2] > > Since there are issues to be considered from the design such as > physical/logical replication treatment, putting error information to > table is likely to take time for consensus building and development. > > Wouldn't it be better to follow the following advice and develop the > functionality incrementally? > > On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada > <sawada(dot)mshk(at)gmail(dot)com> wrote: > > So I'm thinking we may be able to implement this > > feature incrementally. The first step would be something like an > > option to ignore all errors or an option to specify the maximum number > > of errors to tolerate before raising an ERROR. The second step would > > be to support logging destinations such as server logs and tables. > > > Attached a patch for this "first step" with reference to v7 patch, which > logged errors and simpler than latest one. > - This patch adds new option SAVE_ERROR_TO, but currently only supports > 'none', which means just skips error data. It is expected to support > 'log' and 'table'. > - This patch Skips just soft errors and don't handle other errors such > as missing column data. Hi. I made the following change based on your patch (v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch) * when specified SAVE_ERROR_TO, move the initialization of ErrorSaveContext to the function BeginCopyFrom. I think that's the right place to initialize struct CopyFromState field. * I think your patch when there are N rows have malformed data, then it will initialize N ErrorSaveContext. In the struct CopyFromStateData, I changed it to ErrorSaveContext *escontext. So if an error occurred, you can just set the escontext accordingly. * doc: mention "If this option is omitted, <command>COPY</command> stops operation at the first error." * Since we only support 'none' for now, 'none' means we don't want ErrorSaveContext metadata, so we should set cstate->escontext->details_wanted to false. > BTW I have question and comment about v15 patch: > > > + { > > + /* > > + * > > + * InputFunctionCall is more faster than > > InputFunctionCallSafe. > > + * > > + */ > > Have you measured this? > When I tested it in an older patch, there were no big difference[3]. Thanks for pointing it out, I probably was over thinking. > > - SAVEPOINT SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P SECURITY > SELECT > > + SAVEPOINT SAVE_ERROR SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P > SECURITY SELECT > > There was a comment that we shouldn't add new keyword for this[4]. > Thanks for pointing it out.
Attachment
On Wed, Jan 10, 2024 at 4:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Yeah, I'm still thinking it's better to implement this feature > incrementally. Given we're closing to feature freeze, I think it's > unlikely to get the whole feature into PG17 since there are still many > design discussions we need in addition to what Torikoshi-san pointed > out. The feature like "ignore errors" or "logging errors" would have > higher possibilities. Even if we get only these parts of the whole > "error table" feature into PG17, it will make it much easier to implement "error tables" feature. +1. I'm also going to make patch for "logging errors", since this functionality is isolated from v7 patch. > Seems promising. I'll look at the patch. Thanks a lot! Sorry to attach v2 if you already reviewed v1.. On 2024-01-11 12:13, jian he wrote: > On Tue, Jan 9, 2024 at 10:36 PM torikoshia <torikoshia@oss.nttdata.com> > wrote: >> >> On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada >> <sawada.mshk@gmail.com> >> wrote: >> > If we want only such a feature we need to implement it together (the >> > patch could be split, though). But if some parts of the feature are >> > useful for users as well, I'd recommend implementing it incrementally. >> > That way, the patches can get small and it would be easy for reviewers >> > and committers to review/commit them. >> >> Jian, how do you think this comment? >> >> Looking back at the discussion so far, it seems that not everyone >> thinks >> saving table information is the best idea[1] and some people think >> just >> skipping error data is useful.[2] >> >> Since there are issues to be considered from the design such as >> physical/logical replication treatment, putting error information to >> table is likely to take time for consensus building and development. >> >> Wouldn't it be better to follow the following advice and develop the >> functionality incrementally? >> >> On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada >> <sawada(dot)mshk(at)gmail(dot)com> wrote: >> > So I'm thinking we may be able to implement this >> > feature incrementally. The first step would be something like an >> > option to ignore all errors or an option to specify the maximum number >> > of errors to tolerate before raising an ERROR. The second step would >> > be to support logging destinations such as server logs and tables. >> >> >> Attached a patch for this "first step" with reference to v7 patch, >> which >> logged errors and simpler than latest one. >> - This patch adds new option SAVE_ERROR_TO, but currently only >> supports >> 'none', which means just skips error data. It is expected to support >> 'log' and 'table'. >> - This patch Skips just soft errors and don't handle other errors such >> as missing column data. > > Hi. > I made the following change based on your patch > (v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch) > > * when specified SAVE_ERROR_TO, move the initialization of > ErrorSaveContext to the function BeginCopyFrom. > I think that's the right place to initialize struct CopyFromState > field. > * I think your patch when there are N rows have malformed data, then it > will initialize N ErrorSaveContext. > In the struct CopyFromStateData, I changed it to ErrorSaveContext > *escontext. > So if an error occurred, you can just set the escontext accordingly. > * doc: mention "If this option is omitted, <command>COPY</command> > stops operation at the first error." > * Since we only support 'none' for now, 'none' means we don't want > ErrorSaveContext metadata, > so we should set cstate->escontext->details_wanted to false. > >> BTW I have question and comment about v15 patch: >> >> > + { >> > + /* >> > + * >> > + * InputFunctionCall is more faster than >> > InputFunctionCallSafe. >> > + * >> > + */ >> >> Have you measured this? >> When I tested it in an older patch, there were no big difference[3]. > Thanks for pointing it out, I probably was over thinking. > >> > - SAVEPOINT SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P >> SECURITY >> SELECT >> > + SAVEPOINT SAVE_ERROR SCALAR SCHEMA SCHEMAS SCROLL SEARCH >> SECOND_P >> SECURITY SELECT >> >> There was a comment that we shouldn't add new keyword for this[4]. >> > Thanks for pointing it out. Thanks for reviewing! Updated the patch merging your suggestions except below points: > + cstate->num_errors = 0; Since cstate is already initialized in below lines, this may be redundant. | /* Allocate workspace and zero all fields */ | cstate = (CopyFromStateData *) palloc0(sizeof(CopyFromStateData)); > + Assert(!cstate->escontext->details_wanted); I'm not sure this is necessary, considering we're going to add other options like 'table' and 'log', which need details_wanted soon. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
On Fri, Jan 12, 2024 at 10:59 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > > Thanks for reviewing! > > Updated the patch merging your suggestions except below points: > > > + cstate->num_errors = 0; > > Since cstate is already initialized in below lines, this may be > redundant. > > | /* Allocate workspace and zero all fields */ > | cstate = (CopyFromStateData *) palloc0(sizeof(CopyFromStateData)); > > > > + Assert(!cstate->escontext->details_wanted); > > I'm not sure this is necessary, considering we're going to add other > options like 'table' and 'log', which need details_wanted soon. > > > -- > Regards, make save_error_to option cannot be used with COPY TO. add redundant test, save_error_to with COPY TO test.
Attachment
Hi! I think this is a demanding and long-waited feature. The thread is pretty long, but mostly it was disputes about how to save the errors. The present patch includes basic infrastructure and ability to ignore errors, thus it's pretty simple. On Sat, Jan 13, 2024 at 4:20 PM jian he <jian.universality@gmail.com> wrote: > On Fri, Jan 12, 2024 at 10:59 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > > > Thanks for reviewing! > > > > Updated the patch merging your suggestions except below points: > > > > > + cstate->num_errors = 0; > > > > Since cstate is already initialized in below lines, this may be > > redundant. > > > > | /* Allocate workspace and zero all fields */ > > | cstate = (CopyFromStateData *) palloc0(sizeof(CopyFromStateData)); > > > > > > > + Assert(!cstate->escontext->details_wanted); > > > > I'm not sure this is necessary, considering we're going to add other > > options like 'table' and 'log', which need details_wanted soon. > > > > > > -- > > Regards, > > make save_error_to option cannot be used with COPY TO. > add redundant test, save_error_to with COPY TO test. I've incorporated these changes. Also, I've changed CopyFormatOptions.save_error_to to enum and made some edits in comments and the commit message. I'm going to push this if there are no objections. ------ Regards, Alexander Korotkov
Attachment
On Sun, Jan 14, 2024 at 10:30 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Hi! > > I think this is a demanding and long-waited feature. The thread is > pretty long, but mostly it was disputes about how to save the errors. > The present patch includes basic infrastructure and ability to ignore > errors, thus it's pretty simple. > > On Sat, Jan 13, 2024 at 4:20 PM jian he <jian.universality@gmail.com> wrote: > > On Fri, Jan 12, 2024 at 10:59 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > > > > > > Thanks for reviewing! > > > > > > Updated the patch merging your suggestions except below points: > > > > > > > + cstate->num_errors = 0; > > > > > > Since cstate is already initialized in below lines, this may be > > > redundant. > > > > > > | /* Allocate workspace and zero all fields */ > > > | cstate = (CopyFromStateData *) palloc0(sizeof(CopyFromStateData)); > > > > > > > > > > + Assert(!cstate->escontext->details_wanted); > > > > > > I'm not sure this is necessary, considering we're going to add other > > > options like 'table' and 'log', which need details_wanted soon. > > > > > > > > > -- > > > Regards, > > > > make save_error_to option cannot be used with COPY TO. > > add redundant test, save_error_to with COPY TO test. > > I've incorporated these changes. Also, I've changed > CopyFormatOptions.save_error_to to enum and made some edits in > comments and the commit message. I'm going to push this if there are > no objections. Thank you for updating the patch. Here are two comments: --- + if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED && + cstate->num_errors > 0) + ereport(WARNING, + errmsg("%zd rows were skipped due to data type incompatibility", + cstate->num_errors)); + /* Done, clean up */ error_context_stack = errcallback.previous; If a malformed input is not the last data, the context message seems odd: postgres(1:1769258)=# create table test (a int); CREATE TABLE postgres(1:1769258)=# copy test from stdin (save_error_to none); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> a >> 1 >> 2024-01-15 05:05:53.980 JST [1769258] WARNING: 1 rows were skipped due to data type incompatibility 2024-01-15 05:05:53.980 JST [1769258] CONTEXT: COPY test, line 3: "" COPY 1 I think it's better to report the WARNING after resetting the error_context_stack. Or is a WARNING really appropriate here? The v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to WARNING without explanation. --- +-- test missing data: should fail +COPY check_ign_err FROM STDIN WITH (save_error_to none); +1 {1} +\. We might want to cover the extra data cases too. Regards, [1] https://www.postgresql.org/message-id/CACJufxEkkqnozdnvNMGxVAA94KZaCPkYw_Cx4JKG9ueNaZma_A%40mail.gmail.com [2] https://www.postgresql.org/message-id/3d0b349ddbd4ae5f605f77b491697158%40oss.nttdata.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for updating the patch. Here are two comments: > > --- > + if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED && > + cstate->num_errors > 0) > + ereport(WARNING, > + errmsg("%zd rows were skipped due to data type incompatibility", > + cstate->num_errors)); > + > /* Done, clean up */ > error_context_stack = errcallback.previous; > > If a malformed input is not the last data, the context message seems odd: > > postgres(1:1769258)=# create table test (a int); > CREATE TABLE > postgres(1:1769258)=# copy test from stdin (save_error_to none); > Enter data to be copied followed by a newline. > End with a backslash and a period on a line by itself, or an EOF signal. > >> a > >> 1 > >> > 2024-01-15 05:05:53.980 JST [1769258] WARNING: 1 rows were skipped > due to data type incompatibility > 2024-01-15 05:05:53.980 JST [1769258] CONTEXT: COPY test, line 3: "" > COPY 1 > > I think it's better to report the WARNING after resetting the > error_context_stack. Or is a WARNING really appropriate here? The > v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but > the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to > WARNING without explanation. Thank you for noticing this. I think NOTICE is more appropriate here. There is nothing to "worry" about: the user asked to ignore the errors and we did. And yes, it doesn't make sense to use the last line as the context. Fixed. > --- > +-- test missing data: should fail > +COPY check_ign_err FROM STDIN WITH (save_error_to none); > +1 {1} > +\. > > We might want to cover the extra data cases too. Agreed, the relevant test is added. ------ Regards, Alexander Korotkov
Attachment
On Mon, Jan 15, 2024 at 8:21 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thank you for updating the patch. Here are two comments: > > > > --- > > + if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED && > > + cstate->num_errors > 0) > > + ereport(WARNING, > > + errmsg("%zd rows were skipped due to data type incompatibility", > > + cstate->num_errors)); > > + > > /* Done, clean up */ > > error_context_stack = errcallback.previous; > > > > If a malformed input is not the last data, the context message seems odd: > > > > postgres(1:1769258)=# create table test (a int); > > CREATE TABLE > > postgres(1:1769258)=# copy test from stdin (save_error_to none); > > Enter data to be copied followed by a newline. > > End with a backslash and a period on a line by itself, or an EOF signal. > > >> a > > >> 1 > > >> > > 2024-01-15 05:05:53.980 JST [1769258] WARNING: 1 rows were skipped > > due to data type incompatibility > > 2024-01-15 05:05:53.980 JST [1769258] CONTEXT: COPY test, line 3: "" > > COPY 1 > > > > I think it's better to report the WARNING after resetting the > > error_context_stack. Or is a WARNING really appropriate here? The > > v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but > > the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to > > WARNING without explanation. > > Thank you for noticing this. I think NOTICE is more appropriate here. > There is nothing to "worry" about: the user asked to ignore the errors > and we did. And yes, it doesn't make sense to use the last line as > the context. Fixed. > > > --- > > +-- test missing data: should fail > > +COPY check_ign_err FROM STDIN WITH (save_error_to none); > > +1 {1} > > +\. > > > > We might want to cover the extra data cases too. > > Agreed, the relevant test is added. Thank you for updating the patch. I have one minor point: + if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED && + cstate->num_errors > 0) + ereport(NOTICE, + errmsg("%zd rows were skipped due to data type incompatibility", + cstate->num_errors)); + We can use errmsg_plural() instead. I have a question about the option values; do you think we need to have another value of SAVE_ERROR_TO option to explicitly specify the current default behavior, i.e. not accept any error? With the v4 patch, the user needs to omit SAVE_ERROR_TO option to accept errors during COPY FROM. If we change the default behavior in the future, many users will be affected and probably end up changing their applications to keep the current default behavior. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jan 15, 2024 at 8:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jan 15, 2024 at 8:21 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Thank you for updating the patch. Here are two comments: > > > > > > --- > > > + if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED && > > > + cstate->num_errors > 0) > > > + ereport(WARNING, > > > + errmsg("%zd rows were skipped due to data type incompatibility", > > > + cstate->num_errors)); > > > + > > > /* Done, clean up */ > > > error_context_stack = errcallback.previous; > > > > > > If a malformed input is not the last data, the context message seems odd: > > > > > > postgres(1:1769258)=# create table test (a int); > > > CREATE TABLE > > > postgres(1:1769258)=# copy test from stdin (save_error_to none); > > > Enter data to be copied followed by a newline. > > > End with a backslash and a period on a line by itself, or an EOF signal. > > > >> a > > > >> 1 > > > >> > > > 2024-01-15 05:05:53.980 JST [1769258] WARNING: 1 rows were skipped > > > due to data type incompatibility > > > 2024-01-15 05:05:53.980 JST [1769258] CONTEXT: COPY test, line 3: "" > > > COPY 1 > > > > > > I think it's better to report the WARNING after resetting the > > > error_context_stack. Or is a WARNING really appropriate here? The > > > v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but > > > the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to > > > WARNING without explanation. > > > > Thank you for noticing this. I think NOTICE is more appropriate here. > > There is nothing to "worry" about: the user asked to ignore the errors > > and we did. And yes, it doesn't make sense to use the last line as > > the context. Fixed. > > > > > --- > > > +-- test missing data: should fail > > > +COPY check_ign_err FROM STDIN WITH (save_error_to none); > > > +1 {1} > > > +\. > > > > > > We might want to cover the extra data cases too. > > > > Agreed, the relevant test is added. > > Thank you for updating the patch. I have one minor point: > > + if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED && > + cstate->num_errors > 0) > + ereport(NOTICE, > + errmsg("%zd rows were skipped due to > data type incompatibility", > + cstate->num_errors)); > + > > We can use errmsg_plural() instead. Makes sense. Fixed. > I have a question about the option values; do you think we need to > have another value of SAVE_ERROR_TO option to explicitly specify the > current default behavior, i.e. not accept any error? With the v4 > patch, the user needs to omit SAVE_ERROR_TO option to accept errors > during COPY FROM. If we change the default behavior in the future, > many users will be affected and probably end up changing their > applications to keep the current default behavior. Valid point. I've implemented the handling of CopySaveErrorToChoice in a similar way to CopyHeaderChoice. Please, check the revised patch attached. ------ Regards, Alexander Korotkov
Attachment
On 2024-01-16 00:17, Alexander Korotkov wrote: > On Mon, Jan 15, 2024 at 8:44 AM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Mon, Jan 15, 2024 at 8:21 AM Alexander Korotkov >> <aekorotkov@gmail.com> wrote: >> > >> > On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> > > Thank you for updating the patch. Here are two comments: >> > > >> > > --- >> > > + if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED && >> > > + cstate->num_errors > 0) >> > > + ereport(WARNING, >> > > + errmsg("%zd rows were skipped due to data type incompatibility", >> > > + cstate->num_errors)); >> > > + >> > > /* Done, clean up */ >> > > error_context_stack = errcallback.previous; >> > > >> > > If a malformed input is not the last data, the context message seems odd: >> > > >> > > postgres(1:1769258)=# create table test (a int); >> > > CREATE TABLE >> > > postgres(1:1769258)=# copy test from stdin (save_error_to none); >> > > Enter data to be copied followed by a newline. >> > > End with a backslash and a period on a line by itself, or an EOF signal. >> > > >> a >> > > >> 1 >> > > >> >> > > 2024-01-15 05:05:53.980 JST [1769258] WARNING: 1 rows were skipped >> > > due to data type incompatibility >> > > 2024-01-15 05:05:53.980 JST [1769258] CONTEXT: COPY test, line 3: "" >> > > COPY 1 >> > > >> > > I think it's better to report the WARNING after resetting the >> > > error_context_stack. Or is a WARNING really appropriate here? The >> > > v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but >> > > the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to >> > > WARNING without explanation. >> > >> > Thank you for noticing this. I think NOTICE is more appropriate here. >> > There is nothing to "worry" about: the user asked to ignore the errors >> > and we did. And yes, it doesn't make sense to use the last line as >> > the context. Fixed. >> > >> > > --- >> > > +-- test missing data: should fail >> > > +COPY check_ign_err FROM STDIN WITH (save_error_to none); >> > > +1 {1} >> > > +\. >> > > >> > > We might want to cover the extra data cases too. >> > >> > Agreed, the relevant test is added. >> >> Thank you for updating the patch. I have one minor point: >> >> + if (cstate->opts.save_error_to != >> COPY_SAVE_ERROR_TO_UNSPECIFIED && >> + cstate->num_errors > 0) >> + ereport(NOTICE, >> + errmsg("%zd rows were skipped due to >> data type incompatibility", >> + cstate->num_errors)); >> + >> >> We can use errmsg_plural() instead. > > Makes sense. Fixed. > >> I have a question about the option values; do you think we need to >> have another value of SAVE_ERROR_TO option to explicitly specify the >> current default behavior, i.e. not accept any error? With the v4 >> patch, the user needs to omit SAVE_ERROR_TO option to accept errors >> during COPY FROM. If we change the default behavior in the future, >> many users will be affected and probably end up changing their >> applications to keep the current default behavior. > > Valid point. I've implemented the handling of CopySaveErrorToChoice > in a similar way to CopyHeaderChoice. > > Please, check the revised patch attached. Thanks for updating the patch! Here is a minor comment: > +/* > + * Extract a defGetCopySaveErrorToChoice value from a DefElem. > + */ Should be Extract a "CopySaveErrorToChoice"? BTW I'm thinking we should add a column to pg_stat_progress_copy that counts soft errors. I'll suggest this in another thread. > ------ > Regards, > Alexander Korotkov -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Hi, > Thanks for updating the patch! You're welcome! > Here is a minor comment: > > > +/* > > + * Extract a defGetCopySaveErrorToChoice value from a DefElem. > > + */ > > Should be Extract a "CopySaveErrorToChoice"? Fixed. > BTW I'm thinking we should add a column to pg_stat_progress_copy that > counts soft errors. I'll suggest this in another thread. Please do! ------ Regards, Alexander Korotkov
Attachment
Hi, Thanks for applying! > + errmsg_plural("%zd row were skipped due > to data type incompatibility", Sorry, I just noticed it, but 'were' should be 'was' here? >> BTW I'm thinking we should add a column to pg_stat_progress_copy that >> counts soft errors. I'll suggest this in another thread. > Please do! I've started it here: https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff6ed@oss.nttdata.com -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
At Wed, 17 Jan 2024 14:38:54 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in > Hi, > > Thanks for applying! > > > + errmsg_plural("%zd row were skipped due to data type > > incompatibility", > > Sorry, I just noticed it, but 'were' should be 'was' here? > > >> BTW I'm thinking we should add a column to pg_stat_progress_copy that > >> counts soft errors. I'll suggest this in another thread. > > Please do! > > I've started it here: > > https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff6ed@oss.nttdata.com Switching topics, this commit (9e2d870119) adds the following help message: > "COPY { %s [ ( %s [, ...] ) ] | ( %s ) }\n" > " TO { '%s' | PROGRAM '%s' | STDOUT }\n" > ... > " SAVE_ERROR_TO '%s'\n" > ... > _("location"), On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which indicate "immediately error out" and 'just ignore the failure' respectively, but these options hardly seem to denote a 'location', and appear more like an 'action'. I somewhat suspect that this parameter name intially conceived with the assupmtion that it would take file names or similar parameters. I'm not sure if others will agree, but I think the parameter name might not be the best choice. For instance, considering the addition of the third value 'log', something like on_error_action (error, ignore, log) would be more intuitively understandable. What do you think? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Jan 17, 2024 at 7:38 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > Hi, > > Thanks for applying! > > > + errmsg_plural("%zd row were skipped due > > to data type incompatibility", > > Sorry, I just noticed it, but 'were' should be 'was' here? Sure, the fix is pushed. ------ Regards, Alexander Korotkov
On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Wed, 17 Jan 2024 14:38:54 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in > > Hi, > > > > Thanks for applying! > > > > > + errmsg_plural("%zd row were skipped due to data type > > > incompatibility", > > > > Sorry, I just noticed it, but 'were' should be 'was' here? > > > > >> BTW I'm thinking we should add a column to pg_stat_progress_copy that > > >> counts soft errors. I'll suggest this in another thread. > > > Please do! > > > > I've started it here: > > > > https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff6ed@oss.nttdata.com > > Switching topics, this commit (9e2d870119) adds the following help message: > > > > "COPY { %s [ ( %s [, ...] ) ] | ( %s ) }\n" > > " TO { '%s' | PROGRAM '%s' | STDOUT }\n" > > ... > > " SAVE_ERROR_TO '%s'\n" > > ... > > _("location"), > > On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which > indicate "immediately error out" and 'just ignore the failure' > respectively, but these options hardly seem to denote a 'location', > and appear more like an 'action'. I somewhat suspect that this > parameter name intially conceived with the assupmtion that it would > take file names or similar parameters. I'm not sure if others will > agree, but I think the parameter name might not be the best > choice. For instance, considering the addition of the third value > 'log', something like on_error_action (error, ignore, log) would be > more intuitively understandable. What do you think? Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes the next word will be location, not action. With some stretch we can assume 'error' to be location. I think it would be even more stretchy to think that SAVE_ERROR_TO is followed by action. Probably, we can replace SAVE_ERROR_TO with another name which could be naturally followed by action, but I don't have something appropriate in mind. However, I'm not native english speaker and certainly could miss something. ------ Regards, Alexander Korotkov
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which >> indicate "immediately error out" and 'just ignore the failure' >> respectively, but these options hardly seem to denote a 'location', >> and appear more like an 'action'. I somewhat suspect that this >> parameter name intially conceived with the assupmtion that it would >> take file names or similar parameters. I'm not sure if others will >> agree, but I think the parameter name might not be the best >> choice. For instance, considering the addition of the third value >> 'log', something like on_error_action (error, ignore, log) would be >> more intuitively understandable. What do you think? > Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes > the next word will be location, not action. With some stretch we can > assume 'error' to be location. I think it would be even more stretchy > to think that SAVE_ERROR_TO is followed by action. The other problem with this terminology is that with 'none', what it is doing is the exact opposite of "saving" the errors. I agree we need a better name. Kyotaro-san's suggestion isn't bad, though I might shorten it to error_action {error|ignore|log} (or perhaps "stop" instead of "error")? You will need a separate parameter anyway to specify the destination of "log", unless "none" became an illegal table name when I wasn't looking. I don't buy that one parameter that has some special values while other values could be names will be a good design. Moreover, what if we want to support (say) log-to-file along with log-to-table? Trying to distinguish a file name from a table name without any other context seems impossible. regards, tom lane
On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alexander Korotkov <aekorotkov@gmail.com> writes: > > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which > >> indicate "immediately error out" and 'just ignore the failure' > >> respectively, but these options hardly seem to denote a 'location', > >> and appear more like an 'action'. I somewhat suspect that this > >> parameter name intially conceived with the assupmtion that it would > >> take file names or similar parameters. I'm not sure if others will > >> agree, but I think the parameter name might not be the best > >> choice. For instance, considering the addition of the third value > >> 'log', something like on_error_action (error, ignore, log) would be > >> more intuitively understandable. What do you think? > > > Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes > > the next word will be location, not action. With some stretch we can > > assume 'error' to be location. I think it would be even more stretchy > > to think that SAVE_ERROR_TO is followed by action. > > The other problem with this terminology is that with 'none', what it > is doing is the exact opposite of "saving" the errors. I agree we > need a better name. Agreed. > > Kyotaro-san's suggestion isn't bad, though I might shorten it to > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > You will need a separate parameter anyway to specify the destination > of "log", unless "none" became an illegal table name when I wasn't > looking. I don't buy that one parameter that has some special values > while other values could be names will be a good design. Moreover, > what if we want to support (say) log-to-file along with log-to-table? > Trying to distinguish a file name from a table name without any other > context seems impossible. I've been thinking we can add more values to this option to log errors not only to the server logs but also to the error table (not sure details but I imagined an error table is created for each table on error), without an additional option for the destination name. The values would be like error_action {error|ignore|save-logs|save-table}. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Alexander Korotkov <aekorotkov@gmail.com> writes: > > > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which > > >> indicate "immediately error out" and 'just ignore the failure' > > >> respectively, but these options hardly seem to denote a 'location', > > >> and appear more like an 'action'. I somewhat suspect that this > > >> parameter name intially conceived with the assupmtion that it would > > >> take file names or similar parameters. I'm not sure if others will > > >> agree, but I think the parameter name might not be the best > > >> choice. For instance, considering the addition of the third value > > >> 'log', something like on_error_action (error, ignore, log) would be > > >> more intuitively understandable. What do you think? > > > > > Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes > > > the next word will be location, not action. With some stretch we can > > > assume 'error' to be location. I think it would be even more stretchy > > > to think that SAVE_ERROR_TO is followed by action. > > > > The other problem with this terminology is that with 'none', what it > > is doing is the exact opposite of "saving" the errors. I agree we > > need a better name. > > Agreed. > > > > > Kyotaro-san's suggestion isn't bad, though I might shorten it to > > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > > You will need a separate parameter anyway to specify the destination > > of "log", unless "none" became an illegal table name when I wasn't > > looking. I don't buy that one parameter that has some special values > > while other values could be names will be a good design. Moreover, > > what if we want to support (say) log-to-file along with log-to-table? > > Trying to distinguish a file name from a table name without any other > > context seems impossible. > > I've been thinking we can add more values to this option to log errors > not only to the server logs but also to the error table (not sure > details but I imagined an error table is created for each table on > error), without an additional option for the destination name. The > values would be like error_action {error|ignore|save-logs|save-table}. > another idea: on_error {error|ignore|other_future_option} if not specified then by default ERROR. You can also specify ERROR or IGNORE for now. I agree, the parameter "error_action" is better than "location".
On 2024-01-18 10:10, jian he wrote: > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > >> > Alexander Korotkov <aekorotkov@gmail.com> writes: >> > > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi >> > > <horikyota.ntt@gmail.com> wrote: >> > >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which >> > >> indicate "immediately error out" and 'just ignore the failure' >> > >> respectively, but these options hardly seem to denote a 'location', >> > >> and appear more like an 'action'. I somewhat suspect that this >> > >> parameter name intially conceived with the assupmtion that it would >> > >> take file names or similar parameters. I'm not sure if others will >> > >> agree, but I think the parameter name might not be the best >> > >> choice. For instance, considering the addition of the third value >> > >> 'log', something like on_error_action (error, ignore, log) would be >> > >> more intuitively understandable. What do you think? >> > >> > > Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes >> > > the next word will be location, not action. With some stretch we can >> > > assume 'error' to be location. I think it would be even more stretchy >> > > to think that SAVE_ERROR_TO is followed by action. >> > >> > The other problem with this terminology is that with 'none', what it >> > is doing is the exact opposite of "saving" the errors. I agree we >> > need a better name. >> >> Agreed. >> >> > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? >> > You will need a separate parameter anyway to specify the destination >> > of "log", unless "none" became an illegal table name when I wasn't >> > looking. I don't buy that one parameter that has some special values >> > while other values could be names will be a good design. Moreover, >> > what if we want to support (say) log-to-file along with log-to-table? >> > Trying to distinguish a file name from a table name without any other >> > context seems impossible. >> >> I've been thinking we can add more values to this option to log errors >> not only to the server logs but also to the error table (not sure >> details but I imagined an error table is created for each table on >> error), without an additional option for the destination name. The >> values would be like error_action {error|ignore|save-logs|save-table}. >> > > another idea: > on_error {error|ignore|other_future_option} > if not specified then by default ERROR. > You can also specify ERROR or IGNORE for now. > > I agree, the parameter "error_action" is better than "location". I'm not sure whether error_action or on_error is better, but either way "error_action error" and "on_error error" seems a bit odd to me. I feel "stop" is better for both cases as Tom suggested. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > On 2024-01-18 10:10, jian he wrote: > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> > > wrote: > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > >> > You will need a separate parameter anyway to specify the destination > >> > of "log", unless "none" became an illegal table name when I wasn't > >> > looking. I don't buy that one parameter that has some special values > >> > while other values could be names will be a good design. Moreover, > >> > what if we want to support (say) log-to-file along with log-to-table? > >> > Trying to distinguish a file name from a table name without any other > >> > context seems impossible. > >> > >> I've been thinking we can add more values to this option to log errors > >> not only to the server logs but also to the error table (not sure > >> details but I imagined an error table is created for each table on > >> error), without an additional option for the destination name. The > >> values would be like error_action {error|ignore|save-logs|save-table}. > >> > > > > another idea: > > on_error {error|ignore|other_future_option} > > if not specified then by default ERROR. > > You can also specify ERROR or IGNORE for now. > > > > I agree, the parameter "error_action" is better than "location". > > I'm not sure whether error_action or on_error is better, but either way > "error_action error" and "on_error error" seems a bit odd to me. > I feel "stop" is better for both cases as Tom suggested. OK. What about this? on_error {stop|ignore|other_future_option} where other_future_option might be compound like "file 'copy.log'" or "table 'copy_log'". ------ Regards, Alexander Korotkov
On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> On 2024-01-18 10:10, jian he wrote:
> > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > wrote:
> >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> >> > You will need a separate parameter anyway to specify the destination
> >> > of "log", unless "none" became an illegal table name when I wasn't
> >> > looking. I don't buy that one parameter that has some special values
> >> > while other values could be names will be a good design. Moreover,
> >> > what if we want to support (say) log-to-file along with log-to-table?
> >> > Trying to distinguish a file name from a table name without any other
> >> > context seems impossible.
> >>
> >> I've been thinking we can add more values to this option to log errors
> >> not only to the server logs but also to the error table (not sure
> >> details but I imagined an error table is created for each table on
> >> error), without an additional option for the destination name. The
> >> values would be like error_action {error|ignore|save-logs|save-table}.
> >>
> >
> > another idea:
> > on_error {error|ignore|other_future_option}
> > if not specified then by default ERROR.
> > You can also specify ERROR or IGNORE for now.
> >
> > I agree, the parameter "error_action" is better than "location".
>
> I'm not sure whether error_action or on_error is better, but either way
> "error_action error" and "on_error error" seems a bit odd to me.
> I feel "stop" is better for both cases as Tom suggested.
OK. What about this?
on_error {stop|ignore|other_future_option}
where other_future_option might be compound like "file 'copy.log'" or
"table 'copy_log'".
------
Regards,
Alexander Korotkov
On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-01-18 10:10, jian he wrote: > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> > > > wrote: > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > > >> > You will need a separate parameter anyway to specify the destination > > >> > of "log", unless "none" became an illegal table name when I wasn't > > >> > looking. I don't buy that one parameter that has some special values > > >> > while other values could be names will be a good design. Moreover, > > >> > what if we want to support (say) log-to-file along with log-to-table? > > >> > Trying to distinguish a file name from a table name without any other > > >> > context seems impossible. > > >> > > >> I've been thinking we can add more values to this option to log errors > > >> not only to the server logs but also to the error table (not sure > > >> details but I imagined an error table is created for each table on > > >> error), without an additional option for the destination name. The > > >> values would be like error_action {error|ignore|save-logs|save-table}. > > >> > > > > > > another idea: > > > on_error {error|ignore|other_future_option} > > > if not specified then by default ERROR. > > > You can also specify ERROR or IGNORE for now. > > > > > > I agree, the parameter "error_action" is better than "location". > > > > I'm not sure whether error_action or on_error is better, but either way > > "error_action error" and "on_error error" seems a bit odd to me. > > I feel "stop" is better for both cases as Tom suggested. > > OK. What about this? > on_error {stop|ignore|other_future_option} > where other_future_option might be compound like "file 'copy.log'" or > "table 'copy_log'". +1 Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024-01-18 16:59, Alexander Korotkov wrote: > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> > wrote: >> On 2024-01-18 10:10, jian he wrote: >> > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> >> > wrote: >> >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to >> >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? >> >> > You will need a separate parameter anyway to specify the destination >> >> > of "log", unless "none" became an illegal table name when I wasn't >> >> > looking. I don't buy that one parameter that has some special values >> >> > while other values could be names will be a good design. Moreover, >> >> > what if we want to support (say) log-to-file along with log-to-table? >> >> > Trying to distinguish a file name from a table name without any other >> >> > context seems impossible. >> >> >> >> I've been thinking we can add more values to this option to log errors >> >> not only to the server logs but also to the error table (not sure >> >> details but I imagined an error table is created for each table on >> >> error), without an additional option for the destination name. The >> >> values would be like error_action {error|ignore|save-logs|save-table}. >> >> >> > >> > another idea: >> > on_error {error|ignore|other_future_option} >> > if not specified then by default ERROR. >> > You can also specify ERROR or IGNORE for now. >> > >> > I agree, the parameter "error_action" is better than "location". >> >> I'm not sure whether error_action or on_error is better, but either >> way >> "error_action error" and "on_error error" seems a bit odd to me. >> I feel "stop" is better for both cases as Tom suggested. > > OK. What about this? > on_error {stop|ignore|other_future_option} > where other_future_option might be compound like "file 'copy.log'" or > "table 'copy_log'". Thanks, also +1 from me. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Hi. patch refactored based on "on_error {stop|ignore}" doc changes: --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -43,7 +43,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable FORCE_QUOTE { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * } FORCE_NOT_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * } FORCE_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * } - SAVE_ERROR_TO '<replaceable class="parameter">location</replaceable>' + ON_ERROR '<replaceable class="parameter">error_action</replaceable>' ENCODING '<replaceable class="parameter">encoding_name</replaceable>' </synopsis> </refsynopsisdiv> @@ -375,20 +375,20 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable </varlistentry> <varlistentry> - <term><literal>SAVE_ERROR_TO</literal></term> + <term><literal>ON_ERROR</literal></term> <listitem> <para> - Specifies to save error information to <replaceable class="parameter"> - location</replaceable> when there is malformed data in the input. - Currently, only <literal>error</literal> (default) and <literal>none</literal> + Specifies which <replaceable class="parameter"> + error_action</replaceable> to perform when there is malformed data in the input. + Currently, only <literal>stop</literal> (default) and <literal>ignore</literal> values are supported. - If the <literal>error</literal> value is specified, + If the <literal>stop</literal> value is specified, <command>COPY</command> stops operation at the first error. - If the <literal>none</literal> value is specified, + If the <literal>ignore</literal> value is specified, <command>COPY</command> skips malformed data and continues copying data. The option is allowed only in <command>COPY FROM</command>. - The <literal>none</literal> value is allowed only when - not using <literal>binary</literal> format. + Only <literal>stop</literal> value is allowed only when + using <literal>binary</literal> format. </para>
Attachment
On 2024-01-18 23:59, jian he wrote: > Hi. > patch refactored based on "on_error {stop|ignore}" > doc changes: > > --- a/doc/src/sgml/ref/copy.sgml > +++ b/doc/src/sgml/ref/copy.sgml > @@ -43,7 +43,7 @@ COPY { <replaceable > class="parameter">table_name</replaceable> [ ( <replaceable > FORCE_QUOTE { ( <replaceable > class="parameter">column_name</replaceable> [, ...] ) | * } > FORCE_NOT_NULL { ( <replaceable > class="parameter">column_name</replaceable> [, ...] ) | * } > FORCE_NULL { ( <replaceable > class="parameter">column_name</replaceable> [, ...] ) | * } > - SAVE_ERROR_TO '<replaceable > class="parameter">location</replaceable>' > + ON_ERROR '<replaceable > class="parameter">error_action</replaceable>' > ENCODING '<replaceable > class="parameter">encoding_name</replaceable>' > </synopsis> > </refsynopsisdiv> > @@ -375,20 +375,20 @@ COPY { <replaceable > class="parameter">table_name</replaceable> [ ( <replaceable > </varlistentry> > > <varlistentry> > - <term><literal>SAVE_ERROR_TO</literal></term> > + <term><literal>ON_ERROR</literal></term> > <listitem> > <para> > - Specifies to save error information to <replaceable > class="parameter"> > - location</replaceable> when there is malformed data in the > input. > - Currently, only <literal>error</literal> (default) and > <literal>none</literal> > + Specifies which <replaceable class="parameter"> > + error_action</replaceable> to perform when there is malformed > data in the input. > + Currently, only <literal>stop</literal> (default) and > <literal>ignore</literal> > values are supported. > - If the <literal>error</literal> value is specified, > + If the <literal>stop</literal> value is specified, > <command>COPY</command> stops operation at the first error. > - If the <literal>none</literal> value is specified, > + If the <literal>ignore</literal> value is specified, > <command>COPY</command> skips malformed data and continues > copying data. > The option is allowed only in <command>COPY FROM</command>. > - The <literal>none</literal> value is allowed only when > - not using <literal>binary</literal> format. > + Only <literal>stop</literal> value is allowed only when > + using <literal>binary</literal> format. > </para> Thanks for making the patch! Here are some comments: > - The <literal>none</literal> value is allowed only when > - not using <literal>binary</literal> format. > + Only <literal>stop</literal> value is allowed only when > + using <literal>binary</literal> format. The second 'only' may be unnecessary. > - /* If SAVE_ERROR_TO is specified, skip rows > with soft errors */ > + /* If ON_ERROR is specified with IGNORE, skip > rows with soft errors */ This is correct now, but considering future works which add other options like "file 'copy.log'" and "table 'copy_log'", it may be better not to limit the case to 'IGNORE'. How about something like this? If ON_ERROR is specified and the value is not STOP, skip rows with soft errors > -COPY x from stdin (format BINARY, save_error_to none); > -COPY x to stdin (save_error_to none); > +COPY x from stdin (format BINARY, ON_ERROR ignore); > +COPY x from stdin (ON_ERROR unsupported); > COPY x to stdin (format TEXT, force_quote(a)); > COPY x from stdin (format CSV, force_quote(a)); In the existing test for copy2.sql, the COPY options are written in lower case(e.g. 'format') and option value(e.g. 'BINARY') are written in upper case. It would be more consistent to align them. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Hi! On Fri, Jan 19, 2024 at 2:37 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > Thanks for making the patch! The patch is pushed! The proposed changes are incorporated excluding this. > > - /* If SAVE_ERROR_TO is specified, skip rows > > with soft errors */ > > + /* If ON_ERROR is specified with IGNORE, skip > > rows with soft errors */ > > This is correct now, but considering future works which add other > options like "file 'copy.log'" and > "table 'copy_log'", it may be better not to limit the case to 'IGNORE'. > How about something like this? > > If ON_ERROR is specified and the value is not STOP, skip rows with > soft errors I think when we have more options, then we wouldn't just skip rows with soft errors but rather save them. So, I left this comment as is for now. ------ Regards, Alexander Korotkov
On 2024-01-19 22:27, Alexander Korotkov wrote: > Hi! > > On Fri, Jan 19, 2024 at 2:37 PM torikoshia <torikoshia@oss.nttdata.com> > wrote: >> Thanks for making the patch! > > The patch is pushed! The proposed changes are incorporated excluding > this. > >> > - /* If SAVE_ERROR_TO is specified, skip rows >> > with soft errors */ >> > + /* If ON_ERROR is specified with IGNORE, skip >> > rows with soft errors */ >> >> This is correct now, but considering future works which add other >> options like "file 'copy.log'" and >> "table 'copy_log'", it may be better not to limit the case to >> 'IGNORE'. >> How about something like this? >> >> If ON_ERROR is specified and the value is not STOP, skip rows with >> soft errors > > I think when we have more options, then we wouldn't just skip rows > with soft errors but rather save them. So, I left this comment as is > for now. Agreed. Thanks for the notification! > > ------ > Regards, > Alexander Korotkov -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Hi, On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > > On 2024-01-18 10:10, jian he wrote: > > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> > > > > wrote: > > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to > > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > > > >> > You will need a separate parameter anyway to specify the destination > > > >> > of "log", unless "none" became an illegal table name when I wasn't > > > >> > looking. I don't buy that one parameter that has some special values > > > >> > while other values could be names will be a good design. Moreover, > > > >> > what if we want to support (say) log-to-file along with log-to-table? > > > >> > Trying to distinguish a file name from a table name without any other > > > >> > context seems impossible. > > > >> > > > >> I've been thinking we can add more values to this option to log errors > > > >> not only to the server logs but also to the error table (not sure > > > >> details but I imagined an error table is created for each table on > > > >> error), without an additional option for the destination name. The > > > >> values would be like error_action {error|ignore|save-logs|save-table}. > > > >> > > > > > > > > another idea: > > > > on_error {error|ignore|other_future_option} > > > > if not specified then by default ERROR. > > > > You can also specify ERROR or IGNORE for now. > > > > > > > > I agree, the parameter "error_action" is better than "location". > > > > > > I'm not sure whether error_action or on_error is better, but either way > > > "error_action error" and "on_error error" seems a bit odd to me. > > > I feel "stop" is better for both cases as Tom suggested. > > > > OK. What about this? > > on_error {stop|ignore|other_future_option} > > where other_future_option might be compound like "file 'copy.log'" or > > "table 'copy_log'". > > +1 > I realized that ON_ERROR syntax synoposis in the documentation is not correct. The option doesn't require the value to be quoted and the value can be omitted. The attached patch fixes it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On 2024-03-28 10:20, Masahiko Sawada wrote: > Hi, > > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov >> <aekorotkov@gmail.com> wrote: >> > >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: >> > > On 2024-01-18 10:10, jian he wrote: >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> >> > > > wrote: >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? >> > > >> > You will need a separate parameter anyway to specify the destination >> > > >> > of "log", unless "none" became an illegal table name when I wasn't >> > > >> > looking. I don't buy that one parameter that has some special values >> > > >> > while other values could be names will be a good design. Moreover, >> > > >> > what if we want to support (say) log-to-file along with log-to-table? >> > > >> > Trying to distinguish a file name from a table name without any other >> > > >> > context seems impossible. >> > > >> >> > > >> I've been thinking we can add more values to this option to log errors >> > > >> not only to the server logs but also to the error table (not sure >> > > >> details but I imagined an error table is created for each table on >> > > >> error), without an additional option for the destination name. The >> > > >> values would be like error_action {error|ignore|save-logs|save-table}. >> > > >> >> > > > >> > > > another idea: >> > > > on_error {error|ignore|other_future_option} >> > > > if not specified then by default ERROR. >> > > > You can also specify ERROR or IGNORE for now. >> > > > >> > > > I agree, the parameter "error_action" is better than "location". >> > > >> > > I'm not sure whether error_action or on_error is better, but either way >> > > "error_action error" and "on_error error" seems a bit odd to me. >> > > I feel "stop" is better for both cases as Tom suggested. >> > >> > OK. What about this? >> > on_error {stop|ignore|other_future_option} >> > where other_future_option might be compound like "file 'copy.log'" or >> > "table 'copy_log'". >> >> +1 >> > > I realized that ON_ERROR syntax synoposis in the documentation is not > correct. The option doesn't require the value to be quoted and the > value can be omitted. The attached patch fixes it. > > Regards, Thanks! Attached patch fixes the doc, but I'm wondering perhaps it might be better to modify the codes to prohibit abbreviation of the value. When seeing the query which abbreviates ON_ERROR value, I feel it's not obvious what happens compared to other options which tolerates abbreviation of the value such as FREEZE or HEADER. COPY t1 FROM stdin WITH (ON_ERROR); What do you think? -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-03-28 10:20, Masahiko Sawada wrote: > > Hi, > > > > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> > > wrote: > >> > >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov > >> <aekorotkov@gmail.com> wrote: > >> > > >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > >> > > On 2024-01-18 10:10, jian he wrote: > >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> > >> > > > wrote: > >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to > >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > >> > > >> > You will need a separate parameter anyway to specify the destination > >> > > >> > of "log", unless "none" became an illegal table name when I wasn't > >> > > >> > looking. I don't buy that one parameter that has some special values > >> > > >> > while other values could be names will be a good design. Moreover, > >> > > >> > what if we want to support (say) log-to-file along with log-to-table? > >> > > >> > Trying to distinguish a file name from a table name without any other > >> > > >> > context seems impossible. > >> > > >> > >> > > >> I've been thinking we can add more values to this option to log errors > >> > > >> not only to the server logs but also to the error table (not sure > >> > > >> details but I imagined an error table is created for each table on > >> > > >> error), without an additional option for the destination name. The > >> > > >> values would be like error_action {error|ignore|save-logs|save-table}. > >> > > >> > >> > > > > >> > > > another idea: > >> > > > on_error {error|ignore|other_future_option} > >> > > > if not specified then by default ERROR. > >> > > > You can also specify ERROR or IGNORE for now. > >> > > > > >> > > > I agree, the parameter "error_action" is better than "location". > >> > > > >> > > I'm not sure whether error_action or on_error is better, but either way > >> > > "error_action error" and "on_error error" seems a bit odd to me. > >> > > I feel "stop" is better for both cases as Tom suggested. > >> > > >> > OK. What about this? > >> > on_error {stop|ignore|other_future_option} > >> > where other_future_option might be compound like "file 'copy.log'" or > >> > "table 'copy_log'". > >> > >> +1 > >> > > > > I realized that ON_ERROR syntax synoposis in the documentation is not > > correct. The option doesn't require the value to be quoted and the > > value can be omitted. The attached patch fixes it. > > > > Regards, > > Thanks! > > Attached patch fixes the doc, but I'm wondering perhaps it might be > better to modify the codes to prohibit abbreviation of the value. > > When seeing the query which abbreviates ON_ERROR value, I feel it's not > obvious what happens compared to other options which tolerates > abbreviation of the value such as FREEZE or HEADER. > > COPY t1 FROM stdin WITH (ON_ERROR); > > What do you think? Indeed. Looking at options of other commands such as VACUUM and EXPLAIN, I can see that we can omit a boolean value, but non-boolean parameters require its value. The HEADER option is not a pure boolean parameter but we can omit the value. It seems to be for backward compatibility; it used to be a boolean parameter. I agree that the above example would confuse users. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024-03-28 21:54, Masahiko Sawada wrote: > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com> > wrote: >> >> On 2024-03-28 10:20, Masahiko Sawada wrote: >> > Hi, >> > >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> >> > wrote: >> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov >> >> <aekorotkov@gmail.com> wrote: >> >> > >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: >> >> > > On 2024-01-18 10:10, jian he wrote: >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> >> >> > > > wrote: >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? >> >> > > >> > You will need a separate parameter anyway to specify the destination >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't >> >> > > >> > looking. I don't buy that one parameter that has some special values >> >> > > >> > while other values could be names will be a good design. Moreover, >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table? >> >> > > >> > Trying to distinguish a file name from a table name without any other >> >> > > >> > context seems impossible. >> >> > > >> >> >> > > >> I've been thinking we can add more values to this option to log errors >> >> > > >> not only to the server logs but also to the error table (not sure >> >> > > >> details but I imagined an error table is created for each table on >> >> > > >> error), without an additional option for the destination name. The >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}. >> >> > > >> >> >> > > > >> >> > > > another idea: >> >> > > > on_error {error|ignore|other_future_option} >> >> > > > if not specified then by default ERROR. >> >> > > > You can also specify ERROR or IGNORE for now. >> >> > > > >> >> > > > I agree, the parameter "error_action" is better than "location". >> >> > > >> >> > > I'm not sure whether error_action or on_error is better, but either way >> >> > > "error_action error" and "on_error error" seems a bit odd to me. >> >> > > I feel "stop" is better for both cases as Tom suggested. >> >> > >> >> > OK. What about this? >> >> > on_error {stop|ignore|other_future_option} >> >> > where other_future_option might be compound like "file 'copy.log'" or >> >> > "table 'copy_log'". >> >> >> >> +1 >> >> >> > >> > I realized that ON_ERROR syntax synoposis in the documentation is not >> > correct. The option doesn't require the value to be quoted and the >> > value can be omitted. The attached patch fixes it. >> > >> > Regards, >> >> Thanks! >> >> Attached patch fixes the doc, but I'm wondering perhaps it might be >> better to modify the codes to prohibit abbreviation of the value. >> >> When seeing the query which abbreviates ON_ERROR value, I feel it's >> not >> obvious what happens compared to other options which tolerates >> abbreviation of the value such as FREEZE or HEADER. >> >> COPY t1 FROM stdin WITH (ON_ERROR); >> >> What do you think? > > Indeed. Looking at options of other commands such as VACUUM and > EXPLAIN, I can see that we can omit a boolean value, but non-boolean > parameters require its value. The HEADER option is not a pure boolean > parameter but we can omit the value. It seems to be for backward > compatibility; it used to be a boolean parameter. I agree that the > above example would confuse users. > > Regards, Thanks for your comment! Attached a patch which modifies the code to prohibit omission of its value. I was a little unsure about adding a regression test for this, but I have not added it since other COPY option doesn't test the omission of its value. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
On Fri, Mar 29, 2024 at 11:54 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-03-28 21:54, Masahiko Sawada wrote: > > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com> > > wrote: > >> > >> On 2024-03-28 10:20, Masahiko Sawada wrote: > >> > Hi, > >> > > >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> > >> > wrote: > >> >> > >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov > >> >> <aekorotkov@gmail.com> wrote: > >> >> > > >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > >> >> > > On 2024-01-18 10:10, jian he wrote: > >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> > >> >> > > > wrote: > >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to > >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > >> >> > > >> > You will need a separate parameter anyway to specify the destination > >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't > >> >> > > >> > looking. I don't buy that one parameter that has some special values > >> >> > > >> > while other values could be names will be a good design. Moreover, > >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table? > >> >> > > >> > Trying to distinguish a file name from a table name without any other > >> >> > > >> > context seems impossible. > >> >> > > >> > >> >> > > >> I've been thinking we can add more values to this option to log errors > >> >> > > >> not only to the server logs but also to the error table (not sure > >> >> > > >> details but I imagined an error table is created for each table on > >> >> > > >> error), without an additional option for the destination name. The > >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}. > >> >> > > >> > >> >> > > > > >> >> > > > another idea: > >> >> > > > on_error {error|ignore|other_future_option} > >> >> > > > if not specified then by default ERROR. > >> >> > > > You can also specify ERROR or IGNORE for now. > >> >> > > > > >> >> > > > I agree, the parameter "error_action" is better than "location". > >> >> > > > >> >> > > I'm not sure whether error_action or on_error is better, but either way > >> >> > > "error_action error" and "on_error error" seems a bit odd to me. > >> >> > > I feel "stop" is better for both cases as Tom suggested. > >> >> > > >> >> > OK. What about this? > >> >> > on_error {stop|ignore|other_future_option} > >> >> > where other_future_option might be compound like "file 'copy.log'" or > >> >> > "table 'copy_log'". > >> >> > >> >> +1 > >> >> > >> > > >> > I realized that ON_ERROR syntax synoposis in the documentation is not > >> > correct. The option doesn't require the value to be quoted and the > >> > value can be omitted. The attached patch fixes it. > >> > > >> > Regards, > >> > >> Thanks! > >> > >> Attached patch fixes the doc, but I'm wondering perhaps it might be > >> better to modify the codes to prohibit abbreviation of the value. > >> > >> When seeing the query which abbreviates ON_ERROR value, I feel it's > >> not > >> obvious what happens compared to other options which tolerates > >> abbreviation of the value such as FREEZE or HEADER. > >> > >> COPY t1 FROM stdin WITH (ON_ERROR); > >> > >> What do you think? > > > > Indeed. Looking at options of other commands such as VACUUM and > > EXPLAIN, I can see that we can omit a boolean value, but non-boolean > > parameters require its value. The HEADER option is not a pure boolean > > parameter but we can omit the value. It seems to be for backward > > compatibility; it used to be a boolean parameter. I agree that the > > above example would confuse users. > > > > Regards, > > Thanks for your comment! > > Attached a patch which modifies the code to prohibit omission of its > value. > > I was a little unsure about adding a regression test for this, but I > have not added it since other COPY option doesn't test the omission of > its value. Probably should we change the doc as well since ON_ERROR value doesn't necessarily need to be single-quoted? The rest looks good to me. Alexander, what do you think about this change as you're the committer of this feature? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024-04-01 11:31, Masahiko Sawada wrote: > On Fri, Mar 29, 2024 at 11:54 AM torikoshia > <torikoshia@oss.nttdata.com> wrote: >> >> On 2024-03-28 21:54, Masahiko Sawada wrote: >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com> >> > wrote: >> >> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote: >> >> > Hi, >> >> > >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> >> >> > wrote: >> >> >> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov >> >> >> <aekorotkov@gmail.com> wrote: >> >> >> > >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: >> >> >> > > On 2024-01-18 10:10, jian he wrote: >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> >> >> >> > > > wrote: >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? >> >> >> > > >> > You will need a separate parameter anyway to specify the destination >> >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't >> >> >> > > >> > looking. I don't buy that one parameter that has some special values >> >> >> > > >> > while other values could be names will be a good design. Moreover, >> >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table? >> >> >> > > >> > Trying to distinguish a file name from a table name without any other >> >> >> > > >> > context seems impossible. >> >> >> > > >> >> >> >> > > >> I've been thinking we can add more values to this option to log errors >> >> >> > > >> not only to the server logs but also to the error table (not sure >> >> >> > > >> details but I imagined an error table is created for each table on >> >> >> > > >> error), without an additional option for the destination name. The >> >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}. >> >> >> > > >> >> >> >> > > > >> >> >> > > > another idea: >> >> >> > > > on_error {error|ignore|other_future_option} >> >> >> > > > if not specified then by default ERROR. >> >> >> > > > You can also specify ERROR or IGNORE for now. >> >> >> > > > >> >> >> > > > I agree, the parameter "error_action" is better than "location". >> >> >> > > >> >> >> > > I'm not sure whether error_action or on_error is better, but either way >> >> >> > > "error_action error" and "on_error error" seems a bit odd to me. >> >> >> > > I feel "stop" is better for both cases as Tom suggested. >> >> >> > >> >> >> > OK. What about this? >> >> >> > on_error {stop|ignore|other_future_option} >> >> >> > where other_future_option might be compound like "file 'copy.log'" or >> >> >> > "table 'copy_log'". >> >> >> >> >> >> +1 >> >> >> >> >> > >> >> > I realized that ON_ERROR syntax synoposis in the documentation is not >> >> > correct. The option doesn't require the value to be quoted and the >> >> > value can be omitted. The attached patch fixes it. >> >> > >> >> > Regards, >> >> >> >> Thanks! >> >> >> >> Attached patch fixes the doc, but I'm wondering perhaps it might be >> >> better to modify the codes to prohibit abbreviation of the value. >> >> >> >> When seeing the query which abbreviates ON_ERROR value, I feel it's >> >> not >> >> obvious what happens compared to other options which tolerates >> >> abbreviation of the value such as FREEZE or HEADER. >> >> >> >> COPY t1 FROM stdin WITH (ON_ERROR); >> >> >> >> What do you think? >> > >> > Indeed. Looking at options of other commands such as VACUUM and >> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean >> > parameters require its value. The HEADER option is not a pure boolean >> > parameter but we can omit the value. It seems to be for backward >> > compatibility; it used to be a boolean parameter. I agree that the >> > above example would confuse users. >> > >> > Regards, >> >> Thanks for your comment! >> >> Attached a patch which modifies the code to prohibit omission of its >> value. >> >> I was a little unsure about adding a regression test for this, but I >> have not added it since other COPY option doesn't test the omission of >> its value. > > Probably should we change the doc as well since ON_ERROR value doesn't > necessarily need to be single-quoted? Agreed. Since it seems this issue is independent from the omission of ON_ERROR option value, attached a separate patch. > The rest looks good to me. > > Alexander, what do you think about this change as you're the committer > of this feature? -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
On Tue, Apr 2, 2024 at 7:34 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-04-01 11:31, Masahiko Sawada wrote: > > On Fri, Mar 29, 2024 at 11:54 AM torikoshia > > <torikoshia@oss.nttdata.com> wrote: > >> > >> On 2024-03-28 21:54, Masahiko Sawada wrote: > >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com> > >> > wrote: > >> >> > >> >> On 2024-03-28 10:20, Masahiko Sawada wrote: > >> >> > Hi, > >> >> > > >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov > >> >> >> <aekorotkov@gmail.com> wrote: > >> >> >> > > >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > >> >> >> > > On 2024-01-18 10:10, jian he wrote: > >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> > >> >> >> > > > wrote: > >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to > >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > >> >> >> > > >> > You will need a separate parameter anyway to specify the destination > >> >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't > >> >> >> > > >> > looking. I don't buy that one parameter that has some special values > >> >> >> > > >> > while other values could be names will be a good design. Moreover, > >> >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table? > >> >> >> > > >> > Trying to distinguish a file name from a table name without any other > >> >> >> > > >> > context seems impossible. > >> >> >> > > >> > >> >> >> > > >> I've been thinking we can add more values to this option to log errors > >> >> >> > > >> not only to the server logs but also to the error table (not sure > >> >> >> > > >> details but I imagined an error table is created for each table on > >> >> >> > > >> error), without an additional option for the destination name. The > >> >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}. > >> >> >> > > >> > >> >> >> > > > > >> >> >> > > > another idea: > >> >> >> > > > on_error {error|ignore|other_future_option} > >> >> >> > > > if not specified then by default ERROR. > >> >> >> > > > You can also specify ERROR or IGNORE for now. > >> >> >> > > > > >> >> >> > > > I agree, the parameter "error_action" is better than "location". > >> >> >> > > > >> >> >> > > I'm not sure whether error_action or on_error is better, but either way > >> >> >> > > "error_action error" and "on_error error" seems a bit odd to me. > >> >> >> > > I feel "stop" is better for both cases as Tom suggested. > >> >> >> > > >> >> >> > OK. What about this? > >> >> >> > on_error {stop|ignore|other_future_option} > >> >> >> > where other_future_option might be compound like "file 'copy.log'" or > >> >> >> > "table 'copy_log'". > >> >> >> > >> >> >> +1 > >> >> >> > >> >> > > >> >> > I realized that ON_ERROR syntax synoposis in the documentation is not > >> >> > correct. The option doesn't require the value to be quoted and the > >> >> > value can be omitted. The attached patch fixes it. > >> >> > > >> >> > Regards, > >> >> > >> >> Thanks! > >> >> > >> >> Attached patch fixes the doc, but I'm wondering perhaps it might be > >> >> better to modify the codes to prohibit abbreviation of the value. > >> >> > >> >> When seeing the query which abbreviates ON_ERROR value, I feel it's > >> >> not > >> >> obvious what happens compared to other options which tolerates > >> >> abbreviation of the value such as FREEZE or HEADER. > >> >> > >> >> COPY t1 FROM stdin WITH (ON_ERROR); > >> >> > >> >> What do you think? > >> > > >> > Indeed. Looking at options of other commands such as VACUUM and > >> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean > >> > parameters require its value. The HEADER option is not a pure boolean > >> > parameter but we can omit the value. It seems to be for backward > >> > compatibility; it used to be a boolean parameter. I agree that the > >> > above example would confuse users. > >> > > >> > Regards, > >> > >> Thanks for your comment! > >> > >> Attached a patch which modifies the code to prohibit omission of its > >> value. > >> > >> I was a little unsure about adding a regression test for this, but I > >> have not added it since other COPY option doesn't test the omission of > >> its value. > > > > Probably should we change the doc as well since ON_ERROR value doesn't > > necessarily need to be single-quoted? > > Agreed. > Since it seems this issue is independent from the omission of ON_ERROR > option value, attached a separate patch. > Thank you for the patches! These patches look good to me. I'll push them, barring any objections. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024-04-16 13:16, Masahiko Sawada wrote: > On Tue, Apr 2, 2024 at 7:34 PM torikoshia <torikoshia@oss.nttdata.com> > wrote: >> >> On 2024-04-01 11:31, Masahiko Sawada wrote: >> > On Fri, Mar 29, 2024 at 11:54 AM torikoshia >> > <torikoshia@oss.nttdata.com> wrote: >> >> >> >> On 2024-03-28 21:54, Masahiko Sawada wrote: >> >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com> >> >> > wrote: >> >> >> >> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote: >> >> >> > Hi, >> >> >> > >> >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> >> >> >> > wrote: >> >> >> >> >> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov >> >> >> >> <aekorotkov@gmail.com> wrote: >> >> >> >> > >> >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: >> >> >> >> > > On 2024-01-18 10:10, jian he wrote: >> >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> >> >> >> >> > > > wrote: >> >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to >> >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? >> >> >> >> > > >> > You will need a separate parameter anyway to specify the destination >> >> >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't >> >> >> >> > > >> > looking. I don't buy that one parameter that has some special values >> >> >> >> > > >> > while other values could be names will be a good design. Moreover, >> >> >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table? >> >> >> >> > > >> > Trying to distinguish a file name from a table name without any other >> >> >> >> > > >> > context seems impossible. >> >> >> >> > > >> >> >> >> >> > > >> I've been thinking we can add more values to this option to log errors >> >> >> >> > > >> not only to the server logs but also to the error table (not sure >> >> >> >> > > >> details but I imagined an error table is created for each table on >> >> >> >> > > >> error), without an additional option for the destination name. The >> >> >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}. >> >> >> >> > > >> >> >> >> >> > > > >> >> >> >> > > > another idea: >> >> >> >> > > > on_error {error|ignore|other_future_option} >> >> >> >> > > > if not specified then by default ERROR. >> >> >> >> > > > You can also specify ERROR or IGNORE for now. >> >> >> >> > > > >> >> >> >> > > > I agree, the parameter "error_action" is better than "location". >> >> >> >> > > >> >> >> >> > > I'm not sure whether error_action or on_error is better, but either way >> >> >> >> > > "error_action error" and "on_error error" seems a bit odd to me. >> >> >> >> > > I feel "stop" is better for both cases as Tom suggested. >> >> >> >> > >> >> >> >> > OK. What about this? >> >> >> >> > on_error {stop|ignore|other_future_option} >> >> >> >> > where other_future_option might be compound like "file 'copy.log'" or >> >> >> >> > "table 'copy_log'". >> >> >> >> >> >> >> >> +1 >> >> >> >> >> >> >> > >> >> >> > I realized that ON_ERROR syntax synoposis in the documentation is not >> >> >> > correct. The option doesn't require the value to be quoted and the >> >> >> > value can be omitted. The attached patch fixes it. >> >> >> > >> >> >> > Regards, >> >> >> >> >> >> Thanks! >> >> >> >> >> >> Attached patch fixes the doc, but I'm wondering perhaps it might be >> >> >> better to modify the codes to prohibit abbreviation of the value. >> >> >> >> >> >> When seeing the query which abbreviates ON_ERROR value, I feel it's >> >> >> not >> >> >> obvious what happens compared to other options which tolerates >> >> >> abbreviation of the value such as FREEZE or HEADER. >> >> >> >> >> >> COPY t1 FROM stdin WITH (ON_ERROR); >> >> >> >> >> >> What do you think? >> >> > >> >> > Indeed. Looking at options of other commands such as VACUUM and >> >> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean >> >> > parameters require its value. The HEADER option is not a pure boolean >> >> > parameter but we can omit the value. It seems to be for backward >> >> > compatibility; it used to be a boolean parameter. I agree that the >> >> > above example would confuse users. >> >> > >> >> > Regards, >> >> >> >> Thanks for your comment! >> >> >> >> Attached a patch which modifies the code to prohibit omission of its >> >> value. >> >> >> >> I was a little unsure about adding a regression test for this, but I >> >> have not added it since other COPY option doesn't test the omission of >> >> its value. >> > >> > Probably should we change the doc as well since ON_ERROR value doesn't >> > necessarily need to be single-quoted? >> >> Agreed. >> Since it seems this issue is independent from the omission of ON_ERROR >> option value, attached a separate patch. >> > > Thank you for the patches! These patches look good to me. I'll push > them, barring any objections. > > Regards, Thanks for your review and apply! -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On Wed, Apr 17, 2024 at 4:28 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-04-16 13:16, Masahiko Sawada wrote: > > On Tue, Apr 2, 2024 at 7:34 PM torikoshia <torikoshia@oss.nttdata.com> > > wrote: > >> > >> On 2024-04-01 11:31, Masahiko Sawada wrote: > >> > On Fri, Mar 29, 2024 at 11:54 AM torikoshia > >> > <torikoshia@oss.nttdata.com> wrote: > >> >> > >> >> On 2024-03-28 21:54, Masahiko Sawada wrote: > >> >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia <torikoshia@oss.nttdata.com> > >> >> > wrote: > >> >> >> > >> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote: > >> >> >> > Hi, > >> >> >> > > >> >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada <sawada.mshk@gmail.com> > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov > >> >> >> >> <aekorotkov@gmail.com> wrote: > >> >> >> >> > > >> >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > >> >> >> >> > > On 2024-01-18 10:10, jian he wrote: > >> >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> > >> >> >> >> > > > wrote: > >> >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to > >> >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > >> >> >> >> > > >> > You will need a separate parameter anyway to specify the destination > >> >> >> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't > >> >> >> >> > > >> > looking. I don't buy that one parameter that has some special values > >> >> >> >> > > >> > while other values could be names will be a good design. Moreover, > >> >> >> >> > > >> > what if we want to support (say) log-to-file along with log-to-table? > >> >> >> >> > > >> > Trying to distinguish a file name from a table name without any other > >> >> >> >> > > >> > context seems impossible. > >> >> >> >> > > >> > >> >> >> >> > > >> I've been thinking we can add more values to this option to log errors > >> >> >> >> > > >> not only to the server logs but also to the error table (not sure > >> >> >> >> > > >> details but I imagined an error table is created for each table on > >> >> >> >> > > >> error), without an additional option for the destination name. The > >> >> >> >> > > >> values would be like error_action {error|ignore|save-logs|save-table}. > >> >> >> >> > > >> > >> >> >> >> > > > > >> >> >> >> > > > another idea: > >> >> >> >> > > > on_error {error|ignore|other_future_option} > >> >> >> >> > > > if not specified then by default ERROR. > >> >> >> >> > > > You can also specify ERROR or IGNORE for now. > >> >> >> >> > > > > >> >> >> >> > > > I agree, the parameter "error_action" is better than "location". > >> >> >> >> > > > >> >> >> >> > > I'm not sure whether error_action or on_error is better, but either way > >> >> >> >> > > "error_action error" and "on_error error" seems a bit odd to me. > >> >> >> >> > > I feel "stop" is better for both cases as Tom suggested. > >> >> >> >> > > >> >> >> >> > OK. What about this? > >> >> >> >> > on_error {stop|ignore|other_future_option} > >> >> >> >> > where other_future_option might be compound like "file 'copy.log'" or > >> >> >> >> > "table 'copy_log'". > >> >> >> >> > >> >> >> >> +1 > >> >> >> >> > >> >> >> > > >> >> >> > I realized that ON_ERROR syntax synoposis in the documentation is not > >> >> >> > correct. The option doesn't require the value to be quoted and the > >> >> >> > value can be omitted. The attached patch fixes it. > >> >> >> > > >> >> >> > Regards, > >> >> >> > >> >> >> Thanks! > >> >> >> > >> >> >> Attached patch fixes the doc, but I'm wondering perhaps it might be > >> >> >> better to modify the codes to prohibit abbreviation of the value. > >> >> >> > >> >> >> When seeing the query which abbreviates ON_ERROR value, I feel it's > >> >> >> not > >> >> >> obvious what happens compared to other options which tolerates > >> >> >> abbreviation of the value such as FREEZE or HEADER. > >> >> >> > >> >> >> COPY t1 FROM stdin WITH (ON_ERROR); > >> >> >> > >> >> >> What do you think? > >> >> > > >> >> > Indeed. Looking at options of other commands such as VACUUM and > >> >> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean > >> >> > parameters require its value. The HEADER option is not a pure boolean > >> >> > parameter but we can omit the value. It seems to be for backward > >> >> > compatibility; it used to be a boolean parameter. I agree that the > >> >> > above example would confuse users. > >> >> > > >> >> > Regards, > >> >> > >> >> Thanks for your comment! > >> >> > >> >> Attached a patch which modifies the code to prohibit omission of its > >> >> value. > >> >> > >> >> I was a little unsure about adding a regression test for this, but I > >> >> have not added it since other COPY option doesn't test the omission of > >> >> its value. > >> > > >> > Probably should we change the doc as well since ON_ERROR value doesn't > >> > necessarily need to be single-quoted? > >> > >> Agreed. > >> Since it seems this issue is independent from the omission of ON_ERROR > >> option value, attached a separate patch. > >> > > > > Thank you for the patches! These patches look good to me. I'll push > > them, barring any objections. > > > > Regards, > > Thanks for your review and apply! Thank you for the patches! Pushed: a6d0fa5ef8 and f6f8ac8e75. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com