Re: COPY enhancements - Mailing list pgsql-hackers

From Emmanuel Cecchet
Subject Re: COPY enhancements
Date
Msg-id 4AA9705C.5020108@asterdata.com
Whole thread Raw
In response to Re: COPY enhancements  (Josh Berkus <josh@agliodbs.com>)
Responses Re: COPY enhancements
List pgsql-hackers
Josh,

See the answers inlined.
> Thank you for tackling this very long-time TODO.
>   
>> Error logging is described here:
>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY
>>     
>
> Questions & Comments:
>
> A) Why would someone want to turn error_logging on, but leave
> error_logging_skip_tuples off?  The pg_log already logs errors which copy throws by default.
>   
When error_logging is on and skip_tuples is off, errors are logged in 
the error table. If skip_tuples is on, tuples are not logged in the 
error table.
> B) As I mentioned earlier, we'll want to provide the option of logging
> to a file instead of to a table.  That's not a reason to reject this
> patch, but probably a TODO for 8.5.
>   
Ok but what should be the format of that file?
> C) Are we sure we want to handle this via GUCs rather than extensions to
> COPY syntax?  It seems like fairly often users would want to log
> different COPY sources to different tables/files.
>   
I agree that new COPY options could be easier to use, the implementation 
is just more complex. However, the labels allows you to select the 
tuples related to specific COPY commands.
> D) These GUCs are userset, I hope?  (haven't dug into the code far
> enough to tell yet).
>   
Yes.
> E) What is error_logging_tuple_label for?   You don't explain/give
> examples.  And how is error_logging_tuple_partition_key used?
>   
We use the label and partition key in Aster products to easily retrieve 
which COPY command on which partition did generate the bad tuples. By 
default, the tuple_label contains the COPY command that was executed 
(see example on Wiki) and the key contains the index of the tuple in the 
source file (see example on Wiki).
> F) Rawdata for rejected tuples is presumably BYTEA?
>   
Yes. I forgot to put back the table description that can be seen in the 
unit tests. I have updated the Wiki with the table definition.
> G) We should probably have a default for error_logging_table_name, such
> as pg_copy_errors.  Does that table get automatically created if it
> doesn't exist?
>   
Yes, as indicated on the wiki the table is created automatically (see 
config variable section).
> H) Finally, one request of the TODO is some way to halt import after a
> specified number of bad tuples because it probably means you have the
> wrong file or wrong table.  Do we still want that?
>   
We can still do that. It can be another GUC variable or an option to 
COPY. If the COPY command fails, everything gets rolled back (data in 
the destination table and error table). That would be harder to 
implement with a file (the rollback part).
>> Autopartitioning is described here:
>> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY
>>     
>
> M) tuple_routing_in_copy should take "on" or "off", not 0 or 1.
>   
Ok.
> N) Have you measured the overhead & speed of this kind of COPY as
> opposed to COPY into a single table?  Have you checked the overhead if
> tuple_routing_in_copy is on, but you are not loading into a partitioned
> table?
>   
Yes. There is no noticeable overhead if there is no routing to do (but 
routing is on).
If routing is involved, the overhead depends on how sorted your input 
data is. If it all goes to the same partition, the caching effect works 
well and there is no noticeable overhead. The cost is in the constraint 
check and it depends on the complexity of the constraint. The more 
constraints you have to check and the more complex they are, the more 
overhead on each tuple routing.
> O) Is this capable of dealing with partitioning by more than one column,
> or by an expression?
>   
Yes, we just use a brute force technique where we try all child tables 
1-by-1 and rely on the existing Postgres constraint checking mechanism 
(no new or duplicated code there).
> Finally, I'm going to suggest different names for the GUCs, as the names
> you've chosen don't group well and would likely cause confusion.  Here
> are my suggestions, which all begin with "copy_" for prefix matching:
>
> error_logging --> probaby not needed, see able
> error_logging_skip_tuples --> copy_skip_bad_rows
> error_logging_schema_name --> copy_logging_schema_name
> error_logging_relation_name --> copy_logging_table_name
> error_logging_tuple_label --> don't know what this is for, see above
> error_logging_tuple_partition_key --> don't know what this is for, see above
>
> tuple_routing_in_copy --> copy_partitioning
> tuple_routing_cache_size --> copy_partitioning_cache_size
>   
This makes sense. I'll add that on my todo list.

Emmanuel

-- 
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: RfD: more powerful "any" types
Next
From: Josh Berkus
Date:
Subject: Re: COPY enhancements