Re: COPY enhancements - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: COPY enhancements
Date
Msg-id 4AA97977.4020507@agliodbs.com
Whole thread Raw
In response to Re: COPY enhancements  (Emmanuel Cecchet <manu@asterdata.com>)
Responses Re: COPY enhancements
Re: COPY enhancements
List pgsql-hackers
Emmanuel,

BTW, some of the questions were for -hackers in general to give
feedback.  Don't take just my responses as final "what you have to do";
other contributors will have opinions, some of which will be more
informed than mine.

>> A) Why would someone want to turn error_logging on, but leave
>> error_logging_skip_tuples off?  The pg_log already logs errors which
>> copy throws by default.
>>   
> When error_logging is on and skip_tuples is off, errors are logged in
> the error table. If skip_tuples is on, tuples are not logged in the
> error table.

Although if you're not skipping tuples, presumably only the first error
is logged, yes?  At that point, COPY would stop.

>> B) As I mentioned earlier, we'll want to provide the option of logging
>> to a file instead of to a table.  That's not a reason to reject this
>> patch, but probably a TODO for 8.5.
>>   
> Ok but what should be the format of that file?

I'd say a CSV version of the table you have is the simplest way to go.

>> C) Are we sure we want to handle this via GUCs rather than extensions to
>> COPY syntax?  It seems like fairly often users would want to log
>> different COPY sources to different tables/files.
>>   
> I agree that new COPY options could be easier to use, the implementation
> is just more complex. However, the labels allows you to select the
> tuples related to specific COPY commands.

Yes, and GUCs allow users to retrofit this approach onto existing
infrastructure without changing their COPY commands.  So there's
advantages and disadvantages.  My question was really for the -hackers
at large: is this the design we want?  Or, more directly, is the GUC
approach anathema to anyone?

>> E) What is error_logging_tuple_label for?   You don't explain/give
>> examples.  And how is error_logging_tuple_partition_key used?
>>   
> We use the label and partition key in Aster products to easily retrieve
> which COPY command on which partition did generate the bad tuples. By
> default, the tuple_label contains the COPY command that was executed
> (see example on Wiki) and the key contains the index of the tuple in the
> source file (see example on Wiki).

Ah, ok, let me suggest a modified table format then (btw, the table
format you give doesn't match your examples):

CREATE TABLE pg_copy_errors( session_id     TEXT    -- session id from PostgreSQL tupletimestamp TIMESTAMP WITH TIME
ZONE,targettable    TEXT, -- table being copied to errmessage     TEXT, -- full error message encountered sqlerrcode
CHAR(5), -- sql error code statement     TEXT, -- the full copy statement which failed label          TEXT, -- optional
user-suppliedlabel rawdata        BYTEA, -- the failed data constraint pg_copy_errors_pk primary key (session_id,
tupletimestamp,
targettable)
);

.. the label would be supplied as copy_logging_label.

This would require you to collect the session_id, of course.  Or the
pid, or something else.  But, the table as you laid it out has no
natural key, which is a problem for debugging.  Also, see discussion below.

I still don't understand how error_logging_tuple_partition_key is used.Please give more explicit examples.

>> G) We should probably have a default for error_logging_table_name, such
>> as pg_copy_errors.  Does that table get automatically created if it
>> doesn't exist?
>>   
> Yes, as indicated on the wiki the table is created automatically (see
> config variable section).

As long as you use CREATE IF NOT EXISTS, that should work ok.

>> H) Finally, one request of the TODO is some way to halt import after a
>> specified number of bad tuples because it probably means you have the
>> wrong file or wrong table.  Do we still want that?
>>   
> We can still do that. It can be another GUC variable or an option to
> COPY. If the COPY command fails, everything gets rolled back (data in
> the destination table and error table). That would be harder to
> implement with a file (the rollback part).

With a logging file, you wouldn't worry about rollback.  You'd just log
a statement to the file that it was rolled back.

I) Aster's current implementation has the advantage of being able to log
to any user-defined table, giving users the flexibility to log different
COPYs to different tables, or even put them on various tablespaces.  Is
that what we want, though?  Clearly it would make things simpler for
most (but not all) users to have just a canonical pg_copy_errors table
which was in pg_catalog.  It would also presumably remove some code from
the patch (usually a good thing)  What do people think?

J) One option I think we need if we're going to support logging to "any
defined logging table" is the option to make that table a temporary table.

>> O) Is this capable of dealing with partitioning by more than one column,
>> or by an expression?
>>   
> Yes, we just use a brute force technique where we try all child tables
> 1-by-1 and rely on the existing Postgres constraint checking mechanism
> (no new or duplicated code there).

Sounds disastrous performance-wise.  There's no easy way to test the
expression without attempting an actual insert?


-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


pgsql-hackers by date:

Previous
From: Emmanuel Cecchet
Date:
Subject: Re: COPY enhancements
Next
From: Tom Lane
Date:
Subject: Re: COPY enhancements