Re: COPY-able csv log outputs - Mailing list pgsql-patches

From Andrew Dunstan
Subject Re: COPY-able csv log outputs
Date
Msg-id 4651108D.9070806@dunslane.net
Whole thread Raw
In response to Re: COPY-able csv log outputs  (Greg Smith <gsmith@gregsmith.com>)
Responses Re: COPY-able csv log outputs  (Greg Smith <gsmith@gregsmith.com>)
Re: COPY-able csv log outputs  (Greg Smith <gsmith@gregsmith.com>)
List pgsql-patches

Greg Smith wrote:
> I got a chance to review this patch over the weekend.  Basic API seems
> good, met all my requirements, no surprises with how the GUC variable
> controlled the feature.
>
> The most fundamental issue I have with the interface is that using
> COPY makes it difficult to put any unique index on the resulting
> table.  I like to have a unique index on my imported log table because
> it rejects the dupe records if you accidentally import the same
> section of log file twice.  COPY tosses the whole thing if there's an
> index violation, which is a problem during a regular import because
> you will occasionally come across lines with the same timestamp that
> are similar in every way except for their statment; putting an index
> on the timestamp+statement seems impractical.

Does the format not include the per-process line number? (I know i
briefly looked at this patch previously, but I forget the details.) One
reason I originally included line numbers in log_line-prefix was to
handle this sort of problem.

>
> I've had a preference for INSERT from the beginning here that this
> reinforces.

COPY is our standard bulk insert mechanism. I think arguing against it
would be a very hard sell.

> I'm planning to just work around this issue by doing the COPY into a
> temporary table and then INSERTing from there.  I didn't want to just
> let the concern pass by without mentioning it though.  It crosses my
> mind that inserting some sort of unique log file line ID number would
> prevent the dupe issue and make for better ordering (it's possible to
> have two lines with the same timestamp show up in the wrong order
> now), not sure that's a practical idea to consider.

I guess that answers my question. We should definitely provide a unique
line key.
>
> The basic coding of the patch seemed OK to me, but someone who is much
> more familiar than myself with the mechanics of pipes should take a
> look at that part of the patch before committing; it's complicated
> code and I can't comment on it.  There are some small formatting
> issues that need to be fixed, particularly in the host+port mapping.
> I can fix those myself and submit a slightly updated patch.  There's
> some documentation improvements I want to make before this goes in as
> well.
>
> The patch is actually broken fairly hard right now because of the
> switch from INSERT to COPY FROM CSV as the output format at the last
> minute.  It outputs missing fields as NULL (fine for INSERT) that
> chokes the CSV import when the session_start timestamp is missing.
> All of those NULL values need to be just replaced with nothing for
> proper CSV syntax; there should just the comma for the next field.  I
> worked around this with
>
> copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL';
>
>

I missed that before - yes it should be fixed.

cheers

andrew

pgsql-patches by date:

Previous
From: "FAST PostgreSQL"
Date:
Subject: Re: Updateable cursors patch
Next
From: Tom Lane
Date:
Subject: Re: COPY-able csv log outputs