Thread: COPY-able sql log outputs

COPY-able sql log outputs

From
"FAST PostgreSQL"
Date:
Hi,

Attached is the completed patch for the COPY-able sql log outputs. I have
modified my previous WIP patch, incorporating all the changes requested by
the community.  This patch has been tested both on windows and linux.

Reiterating what has been done.

The log is now output in COPY-able format as suggested. (Not INSERT-able as
was in the previous WIP patch.)

log_destination now accepts 'sqllog' as a valid output destination. The log
output file will be determined by pg_log and log_filename variables. The sql
log output filename will be 'log_filename'.sql. The file rotation rules apply
to the sql log file output as well.

The log output format is as follows.

timestamp, username, database_name, sessionid, host_port, process_id,
command_tag, session_start_time, transaction_id, error_severity,
sql_state_code, statement

The logs can be loaded into a table using the command

COPY sqltable FROM 'filename.sql' WITH CSV;

There are only two minor issues I can think of
1. The sql log is currently output with newline and tab characters. It loads
into the table neatly. No problems. But when read back, atleast from psql in
windows, the tabs are replaced with some special characters.

2. I think it is better to document somewhere the table structure and the
COPY statement above. But where?

[P.S. - In the wake of community's concerns regarding the legal disclaimer
getting attached to the end of mails we send to the community, we have got an
exemption from the disclaimer getting attached. As this is the first mail I
am sending after this approval, fingers crossed, it works. If for some reason
it gets attached, please ignore this mail and I will send the patch from some
other account.]

Rgds,
Arul Shaji






Attachment

Re: COPY-able sql log outputs

From
Peter Eisentraut
Date:
Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
> Attached is the completed patch for the COPY-able sql log outputs.

Could you please remove random whitespace changes from this patch?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: COPY-able sql log outputs

From
Russell Smith
Date:
FAST PostgreSQL wrote:

I have attempted to do a quick review of the code in this patch, I've
not installed or compiled it.  However I do have a couple of comments.
> Hi,
>
> Attached is the completed patch for the COPY-able sql log outputs. I have
> modified my previous WIP patch, incorporating all the changes requested by
> the community.  This patch has been tested both on windows and linux.
>
> Reiterating what has been done.
>
> The log is now output in COPY-able format as suggested. (Not INSERT-able as
> was in the previous WIP patch.)
>
> log_destination now accepts 'sqllog' as a valid output destination. The log
> output file will be determined by pg_log and log_filename variables. The sql
> log output filename will be 'log_filename'.sql. The file rotation rules apply
> to the sql log file output as well.
>
I'm not sure I'm sold on sqllog given it's not actually sql at all, it's
just a CSV/text file.  But I don't have any better names at this time.
copylog, justifiedlog, csvlog  come to mind, but that all seem bad.
> The log output format is as follows.
>
> timestamp, username, database_name, sessionid, host_port, process_id,
> command_tag, session_start_time, transaction_id, error_severity,
> sql_state_code, statement
>
> The logs can be loaded into a table using the command
>
> COPY sqltable FROM 'filename.sql' WITH CSV;
>
I am concerned with the fact that no escaping is attempted on anything
except the error message itself, is it known that it's not possible to
have a ',' in the rest of the field types?
> There are only two minor issues I can think of
> 1. The sql log is currently output with newline and tab characters. It loads
> into the table neatly. No problems. But when read back, atleast from psql in
> windows, the tabs are replaced with some special characters.
>
> 2. I think it is better to document somewhere the table structure and the
> COPY statement above. But where?
>
With regard to where to document the column structure, how about a
header line when a new file is opened, that will allow users to use
whatever table structure they like, and can import relevant columns from
the log file.  You should also mention the format in the logging section
under the sqllog parameter in the docs.
> [P.S. - In the wake of community's concerns regarding the legal disclaimer
> getting attached to the end of mails we send to the community, we have got an
> exemption from the disclaimer getting attached. As this is the first mail I
> am sending after this approval, fingers crossed, it works. If for some reason
> it gets attached, please ignore this mail and I will send the patch from some
> other account.]
>
> Rgds,
> Arul Shaji
>
Other Patch comments;

Is there an overall line length you should be adhering too?  I think
there is, so you will want to shorten those lines.

+         /* Win32 timezone names are too long so don't print them */
+ #ifndef WIN32
+              "%Y-%m-%d %H:%M:%S %Z",
+ #else
+              "%Y-%m-%d %H:%M:%S",
+ #endif

Do we actually care that the WIN32 format is long for this type of log?



+ /*
+  * Calculates and returns the timestamp
+  */
+ static void
+ get_timestamp(StringInfo buf)

"the timestamp" some words about the fact it gets the "current"
timestamp might be helpful an that it appends to a string.

!         sqllogFile = fopen(strcat(filename, ".sql"), "a");

I don't believe forcing .sql for a CSV file is a good idea, let the user
decide what they want to call it, if you are forcing anything, at least
make it ".csv".


+ #define LOG_DESTINATION_SQL    8

SQLLOG I would think for completness and consistency of naming.


Hope this helps get your patch into PostgreSQL cleanly and quickly.

Regards

Russell Smith








Re: COPY-able sql log outputs

From
"FAST PostgreSQL"
Date:
> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
> > Attached is the completed patch for the COPY-able sql log outputs.
>
> Could you please remove random whitespace changes from this patch?

Done and attached.

Attachment

Re: COPY-able sql log outputs

From
Magnus Hagander
Date:
FAST PostgreSQL wrote:
>> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
>>> Attached is the completed patch for the COPY-able sql log outputs.
>> Could you please remove random whitespace changes from this patch?
>
> Done and attached.

A couple of comments.

First one that's not at all about the code - but your emails appear to
be dated several days into the future. This one, for example, is
datestamped 2007-04-04. AFAIK timezones only go 13 (or is it 14?) hours
:) You'll want to look into that. (otherwise, it'll look like you missed
feature freeze..)


Looking at the code, I was first going to ask what happens when you
enable both redirect_stderr and this one - would the data be put in
random order in the same file. But looking at the code, it seems you're
appending ".sql" to the filename for this log. Two comments about that:

1) This is not mentioned anywhere in the docs or comments. It needs to be.

2) Not sure if .sql is such a great appendix to use, given that it's not
actual SQL. (I will also echo the other comment that was posted about
the whole name being sqllog, when it's not SQL. I think a better name is
needed since the data is csv. Goes for all parameters and function names)



Isn't the data included in the loglines in need of quoting/escaping?
What if there are weird charactersin the database name, for example? You
seem to only be escaping the error message itself. I think all
user-controlled data needs to be.


Then, it may be just me, but I find code like this:
!         sqllogFile = fopen(strcat(filename, ".sql"), "a");

very hard to read. Took me a while to realize that's how it dealt with
the different filenames. I think it's better to do the filename catting
as a separate step.


//Magnus

Re: COPY-able sql log outputs

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Then, it may be just me, but I find code like this:
> !         sqllogFile = fopen(strcat(filename, ".sql"), "a");
> very hard to read.

Such coding should be rejected out of hand anyway --- can you say
"buffer overrun"?

            regards, tom lane

Re: COPY-able sql log outputs

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Then, it may be just me, but I find code like this:
>> !         sqllogFile = fopen(strcat(filename, ".sql"), "a");
>> very hard to read.
>
> Such coding should be rejected out of hand anyway --- can you say
> "buffer overrun"?

I was thinking that, but given that it only takes the config parameter
from postgresql.conf and no actual user input, I considered it less
important. But thinking twice, yeah, even in that case it shouldn't be
done, just to stay on the safe side.

//Magnus

Re: COPY-able sql log outputs

From
Bruce Momjian
Date:
Surely your system time is wrong with email dated:

    Date: Wed, 4 Apr 2007 18:24:54 +0000


---------------------------------------------------------------------------

FAST PostgreSQL wrote:
> > Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
> > > Attached is the completed patch for the COPY-able sql log outputs.
> >
> > Could you please remove random whitespace changes from this patch?
>
> Done and attached.
[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: COPY-able sql log outputs

From
"FAST PostgreSQL"
Date:
On Sat, 31 Mar 2007 14:09, you wrote:
> FAST PostgreSQL wrote:
> >> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
> >>> Attached is the completed patch for the COPY-able sql log outputs.
> >>
> >> Could you please remove random whitespace changes from this patch?
> >
> > Done and attached.
>
> A couple of comments.
>
> First one that's not at all about the code - but your emails appear to
> be dated several days into the future. This one, for example, is
> datestamped 2007-04-04. AFAIK timezones only go 13 (or is it 14?) hours
>
> :) You'll want to look into that. (otherwise, it'll look like you missed
>
> feature freeze..)

Sorry.. Fixed.

>
>
> Looking at the code, I was first going to ask what happens when you
> enable both redirect_stderr and this one - would the data be put in
> random order in the same file. But looking at the code, it seems you're
> appending ".sql" to the filename for this log. Two comments about that:
>
> 1) This is not mentioned anywhere in the docs or comments. It needs to be.

Agreed. As I mentioned in the previous mail, preferably alongwith the
recommended table structure to COPY the logs into. But where? Is adding a new
paragraph under the log_destination section of 'Where to Log' chapter ok ?

>
> 2) Not sure if .sql is such a great appendix to use, given that it's not
> actual SQL. (I will also echo the other comment that was posted about
> the whole name being sqllog, when it's not SQL. I think a better name is
> needed since the data is csv. Goes for all parameters and function names)

'csvlog' is another option. I can change that. Anyone have other preferences ?

>
>
>
> Isn't the data included in the loglines in need of quoting/escaping?
> What if there are weird charactersin the database name, for example? You
> seem to only be escaping the error message itself. I think all
> user-controlled data needs to be.
>
>
> Then, it may be just me, but I find code like this:
> !         sqllogFile = fopen(strcat(filename, ".sql"), "a");
>
> very hard to read. Took me a while to realize that's how it dealt with
> the different filenames. I think it's better to do the filename catting
> as a separate step.

I will fix that too.

Rgds,
Arul Shaji


>
>
> //Magnus


Re: COPY-able sql log outputs

From
Magnus Hagander
Date:
On Mon, Apr 02, 2007 at 04:27:33PM +0000, FAST PostgreSQL wrote:
> On Sat, 31 Mar 2007 14:09, you wrote:
> > FAST PostgreSQL wrote:
> > >> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
> > >>> Attached is the completed patch for the COPY-able sql log outputs.
> > >>
> > >> Could you please remove random whitespace changes from this patch?
> > >
> > > Done and attached.
> >
> > A couple of comments.
> >
> > First one that's not at all about the code - but your emails appear to
> > be dated several days into the future. This one, for example, is
> > datestamped 2007-04-04. AFAIK timezones only go 13 (or is it 14?) hours
> >
> > :) You'll want to look into that. (otherwise, it'll look like you missed
> >
> > feature freeze..)
>
> Sorry.. Fixed.

Not quite. Time looks correct, but it specifies a timezone that's ten hours
off:
Received: from there (unknown 137.172.75.153)
    by fast.fujitsu.com.au (Scalix SMTP Relay 10.0.1.3)
    via ESMTP; Mon, 02 Apr 2007 16:28:34 +1000 (EST)
Date: Mon, 2 Apr 2007 16:27:33 +0000


> > Looking at the code, I was first going to ask what happens when you
> > enable both redirect_stderr and this one - would the data be put in
> > random order in the same file. But looking at the code, it seems you're
> > appending ".sql" to the filename for this log. Two comments about that:
> >
> > 1) This is not mentioned anywhere in the docs or comments. It needs to be.
>
> Agreed. As I mentioned in the previous mail, preferably alongwith the
> recommended table structure to COPY the logs into. But where? Is adding a new
> paragraph under the log_destination section of 'Where to Log' chapter ok ?

I think that makes sense. The comment in postgresql.conf might also need
some updating to reflect it.


> > 2) Not sure if .sql is such a great appendix to use, given that it's not
> > actual SQL. (I will also echo the other comment that was posted about
> > the whole name being sqllog, when it's not SQL. I think a better name is
> > needed since the data is csv. Goes for all parameters and function names)
>
> 'csvlog' is another option. I can change that. Anyone have other preferences ?

I assume csvlog for parameter names, and .csv as the file extension? Or did
you mean csvlog for extension as well? In that case, I'd suggest .csv ;-)

//Magnus


Re: COPY-able sql log outputs

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


FAST PostgreSQL wrote:
> > Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
> > > Attached is the completed patch for the COPY-able sql log outputs.
> >
> > Could you please remove random whitespace changes from this patch?
>
> Done and attached.
[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: COPY-able sql log outputs

From
Andrew Dunstan
Date:
FAST PostgreSQL wrote:
>> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
>>
>>> Attached is the completed patch for the COPY-able sql log outputs.
>>>
>> Could you please remove random whitespace changes from this patch?
>>
>
> Done and attached.

Brief review of the CSV aspect only:

a. username and databasename should probably be quoted and escaped in
case they contain dangerous characters (even though that's unlikely)
b. calling pg_get_client_encoding() inside the loop in
escape_string_literal seems suboptimal. At the very least it should be
moved outside the loop so it's only called once per string.

cheers

andrew


Re: COPY-able sql log outputs

From
"FAST PostgreSQL"
Date:
On Wed, 4 Apr 2007 02:26, Andrew Dunstan wrote:

I am currently doing the following modifications to the patch as per the
review comments.

1. Changing all references to 'sqllog' or 'sql' to 'csvlog' and 'csv'
respectively.
2. Escaping the username and databasename
3. Constructing the csvlog filename using log_filename instead of strcat.
4. General code optimization.

Any other changes required ? ?

Will soon submit the updated patch.

Rgds,
Arul Shaji


> FAST PostgreSQL wrote:
> >> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
> >>> Attached is the completed patch for the COPY-able sql log outputs.
> >>
> >> Could you please remove random whitespace changes from this patch?
> >
> > Done and attached.
>
> Brief review of the CSV aspect only:
>
> a. username and databasename should probably be quoted and escaped in
> case they contain dangerous characters (even though that's unlikely)
> b. calling pg_get_client_encoding() inside the loop in
> escape_string_literal seems suboptimal. At the very least it should be
> moved outside the loop so it's only called once per string.
>
> cheers
>
> andrew


Re: COPY-able sql log outputs

From
Russell Smith
Date:
FAST PostgreSQL wrote:
On Wed, 4 Apr 2007 02:26, Andrew Dunstan wrote:

I am currently doing the following modifications to the patch as per the 
review comments.

1. Changing all references to 'sqllog' or 'sql' to 'csvlog' and 'csv' 
respectively.
2. Escaping the username and databasename
3. Constructing the csvlog filename using log_filename instead of strcat.
4. General code optimization. 

Any other changes required ? ?  
Function additions like get_timestamp need more comments, or better comments "returns the timestamp", which timestamp?


Will soon submit the updated patch.

Rgds,
Arul Shaji

 
FAST PostgreSQL wrote:   
Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:       
Attached is the completed patch for the COPY-able sql log outputs.         
Could you please remove random whitespace changes from this patch?       
Done and attached.     
Brief review of the CSV aspect only:

a. username and databasename should probably be quoted and escaped in
case they contain dangerous characters (even though that's unlikely)
b. calling pg_get_client_encoding() inside the loop in
escape_string_literal seems suboptimal. At the very least it should be
moved outside the loop so it's only called once per string.

cheers

andrew   

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
 

Re: COPY-able csv log outputs

From
"FAST PostgreSQL"
Date:
[Modified subject line from sql to csv]

Attached is an updated patch for COPY-able log outputs. It incorporates all
the comments received on my previous patch including improved commenting,
documentation, escaping all user controlled data

Rgds,
Arul Shaji


On Wed, 4 Apr 2007 18:02, Russell Smith wrote:
> FAST PostgreSQL wrote:
> > On Wed, 4 Apr 2007 02:26, Andrew Dunstan wrote:
> >
> > I am currently doing the following modifications to the patch as per the
> > review comments.
> >
> > 1. Changing all references to 'sqllog' or 'sql' to 'csvlog' and 'csv'
> > respectively.
> > 2. Escaping the username and databasename
> > 3. Constructing the csvlog filename using log_filename instead of strcat.
> > 4. General code optimization.
> >
> > Any other changes required ? ?
>
> Function additions like get_timestamp need more comments, or better
> comments "returns the timestamp", which timestamp?
>
> > Will soon submit the updated patch.
> >
> > Rgds,
> > Arul Shaji
> >
> >> FAST PostgreSQL wrote:
> >>>> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:
> >>>>> Attached is the completed patch for the COPY-able sql log outputs.
> >>>>
> >>>> Could you please remove random whitespace changes from this patch?
> >>>
> >>> Done and attached.
> >>
> >> Brief review of the CSV aspect only:
> >>
> >> a. username and databasename should probably be quoted and escaped in
> >> case they contain dangerous characters (even though that's unlikely)
> >> b. calling pg_get_client_encoding() inside the loop in
> >> escape_string_literal seems suboptimal. At the very least it should be
> >> moved outside the loop so it's only called once per string.
> >>
> >> cheers
> >>
> >> andrew
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: explain analyze is your friend
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
>   <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
> </head>
> <body bgcolor="#ffffff" text="#000000">
> FAST PostgreSQL wrote:
> <blockquote cite="mid26121.11031175649322.fast.fujitsu.com.au@MHS"
>  type="cite">
>   <pre wrap="">On Wed, 4 Apr 2007 02:26, Andrew Dunstan wrote:
>
> I am currently doing the following modifications to the patch as per the
> review comments.
>
> 1. Changing all references to 'sqllog' or 'sql' to 'csvlog' and 'csv'
> respectively.
> 2. Escaping the username and databasename
> 3. Constructing the csvlog filename using log_filename instead of strcat.
> 4. General code optimization.
>
> Any other changes required ? ?
>   </pre>
> </blockquote>
> Function additions like get_timestamp need more comments, or better
> comments "returns the timestamp", which timestamp?<br>
> <br>
> <br>
> <blockquote cite="mid26121.11031175649322.fast.fujitsu.com.au@MHS"
>  type="cite">
>   <pre wrap="">
> Will soon submit the updated patch.
>
> Rgds,
> Arul Shaji
>
>
>   </pre>
>   <blockquote type="cite">
>     <pre wrap="">FAST PostgreSQL wrote:
>     </pre>
>     <blockquote type="cite">
>       <blockquote type="cite">
>         <pre wrap="">Am Dienstag, 3. April 2007 20:33 schrieb FAST
> PostgreSQL: </pre>
>         <blockquote type="cite">
>           <pre wrap="">Attached is the completed patch for the COPY-able
> sql log outputs. </pre>
>         </blockquote>
>         <pre wrap="">Could you please remove random whitespace changes from
> this patch? </pre>
>       </blockquote>
>       <pre wrap="">Done and attached.
>       </pre>
>     </blockquote>
>     <pre wrap="">Brief review of the CSV aspect only:
>
> a. username and databasename should probably be quoted and escaped in
> case they contain dangerous characters (even though that's unlikely)
> b. calling pg_get_client_encoding() inside the loop in
> escape_string_literal seems suboptimal. At the very least it should be
> moved outside the loop so it's only called once per string.
>
> cheers
>
> andrew
>     </pre>
>   </blockquote>
>   <pre wrap=""><!---->
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
>   </pre>
> </blockquote>
> <br>
> </body>
> </html>

Attachment

Re: COPY-able csv log outputs

From
Greg Smith
Date:
I got a chance to review this patch over the weekend.  Basic API seems
good, met all my requirements, no surprises with how the GUC variable
controlled the feature.

The most fundamental issue I have with the interface is that using COPY
makes it difficult to put any unique index on the resulting table.  I like
to have a unique index on my imported log table because it rejects the
dupe records if you accidentally import the same section of log file
twice.  COPY tosses the whole thing if there's an index violation, which
is a problem during a regular import because you will occasionally come
across lines with the same timestamp that are similar in every way except
for their statment; putting an index on the timestamp+statement seems
impractical.

I've had a preference for INSERT from the beginning here that this
reinforces.  I'm planning to just work around this issue by doing the COPY
into a temporary table and then INSERTing from there.  I didn't want to
just let the concern pass by without mentioning it though.  It crosses my
mind that inserting some sort of unique log file line ID number would
prevent the dupe issue and make for better ordering (it's possible to have
two lines with the same timestamp show up in the wrong order now), not
sure that's a practical idea to consider.

The basic coding of the patch seemed OK to me, but someone who is much
more familiar than myself with the mechanics of pipes should take a look
at that part of the patch before committing; it's complicated code and I
can't comment on it.  There are some small formatting issues that need to
be fixed, particularly in the host+port mapping.  I can fix those myself
and submit a slightly updated patch.  There's some documentation
improvements I want to make before this goes in as well.

The patch is actually broken fairly hard right now because of the switch
from INSERT to COPY FROM CSV as the output format at the last minute.  It
outputs missing fields as NULL (fine for INSERT) that chokes the CSV
import when the session_start timestamp is missing.  All of those NULL
values need to be just replaced with nothing for proper CSV syntax; there
should just the comma for the next field.  I worked around this with

copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL';

I can fix that too when I'm revising.  I plan to have a version free of
obvious bugs to re-submit ready by next weekend.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: COPY-able csv log outputs

From
Andrew Dunstan
Date:

Greg Smith wrote:
> I got a chance to review this patch over the weekend.  Basic API seems
> good, met all my requirements, no surprises with how the GUC variable
> controlled the feature.
>
> The most fundamental issue I have with the interface is that using
> COPY makes it difficult to put any unique index on the resulting
> table.  I like to have a unique index on my imported log table because
> it rejects the dupe records if you accidentally import the same
> section of log file twice.  COPY tosses the whole thing if there's an
> index violation, which is a problem during a regular import because
> you will occasionally come across lines with the same timestamp that
> are similar in every way except for their statment; putting an index
> on the timestamp+statement seems impractical.

Does the format not include the per-process line number? (I know i
briefly looked at this patch previously, but I forget the details.) One
reason I originally included line numbers in log_line-prefix was to
handle this sort of problem.

>
> I've had a preference for INSERT from the beginning here that this
> reinforces.

COPY is our standard bulk insert mechanism. I think arguing against it
would be a very hard sell.

> I'm planning to just work around this issue by doing the COPY into a
> temporary table and then INSERTing from there.  I didn't want to just
> let the concern pass by without mentioning it though.  It crosses my
> mind that inserting some sort of unique log file line ID number would
> prevent the dupe issue and make for better ordering (it's possible to
> have two lines with the same timestamp show up in the wrong order
> now), not sure that's a practical idea to consider.

I guess that answers my question. We should definitely provide a unique
line key.
>
> The basic coding of the patch seemed OK to me, but someone who is much
> more familiar than myself with the mechanics of pipes should take a
> look at that part of the patch before committing; it's complicated
> code and I can't comment on it.  There are some small formatting
> issues that need to be fixed, particularly in the host+port mapping.
> I can fix those myself and submit a slightly updated patch.  There's
> some documentation improvements I want to make before this goes in as
> well.
>
> The patch is actually broken fairly hard right now because of the
> switch from INSERT to COPY FROM CSV as the output format at the last
> minute.  It outputs missing fields as NULL (fine for INSERT) that
> chokes the CSV import when the session_start timestamp is missing.
> All of those NULL values need to be just replaced with nothing for
> proper CSV syntax; there should just the comma for the next field.  I
> worked around this with
>
> copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL';
>
>

I missed that before - yes it should be fixed.

cheers

andrew

Re: COPY-able csv log outputs

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> The most fundamental issue I have with the interface is that using COPY
> makes it difficult to put any unique index on the resulting table. I like
> to have a unique index on my imported log table because it rejects the
> dupe records if you accidentally import the same section of log file
> twice.  COPY tosses the whole thing if there's an index violation, which
> is a problem during a regular import because you will occasionally come
> across lines with the same timestamp that are similar in every way except
> for their statment; putting an index on the timestamp+statement seems
> impractical.

Essentially the above is arguing that you want a unique index but you
can't be bothered to invent an actually-unique key.  This doesn't seem
a sound argument to me.  If we need a unique key, let's find one.

            regards, tom lane

Re: COPY-able csv log outputs

From
Greg Smith
Date:
On Sun, 20 May 2007, Andrew Dunstan wrote:

> Does the format not include the per-process line number?

It does not, and I never noticed that under the prefix
possibilities---never seemed import before!  The combination of
timestamp/pid/line (%t %p %l) looks like a useful and unique key here, so
I'll add another column for the line number to the output.  Thanks for
pointing that out, I can finish cleaning up all the functional
implementation work on this patch myself now.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: COPY-able csv log outputs

From
Greg Smith
Date:
On Sun, 20 May 2007, Andrew Dunstan wrote:

>> I've had a preference for INSERT from the beginning here that this
>> reinforces.
> COPY is our standard bulk insert mechanism. I think arguing against it would
> be a very hard sell.

Let me say my final peace on this subject...if I considered this data to
be strictly bulk insert, then I'd completely agree here.  Most of the
really interesting applications I was planning to build on top of this
mechanism are more interactive than that though.  Here's a sample:

-Write a daemon that lives on the server, connects to a logging database,
and pops into an idle loop based on LISTEN.
-A client app wants to see the recent logs files.  It uses NOTIFY to ask
the daemon for them and LISTENs for a response.
-The daemon wakes up, reads all the log files since it last did something,
and appends those records to the log file table.  It sends out a NOTIFY to
say the log file table is current.

That enables remote clients to grab the log files from the server whenever
they please, so they can actually monitor themselves.  Benchmarking is the
initial app I expect to call this, and with some types of tests I expect
the daemon to be importing every 10 minutes or so.

Assuming a unique index on the data to prevent duplication is a required
feature, I can build this using the COPY format logs as well, but that
requires I either a) am 100% perfect in making sure I never pass over the
same data twice, which is particularly annoying when the daemon gets
restarted, or b) break the COPY into single lines and insert them one at a
time, at which point I'm not bulk loading at all.  If these were INSERT
statements instead, I'd have a lot more tolerance for error, because the
worst problem I'd ever run into is spewing some unique key violation
errors into the logs if I accidentally imported too much.  With COPY, any
mistake or synchronization issue and I lose the whole import.

I don't mean to try and stir this back up again as an argument
(particularly not on this list).  There are plenty of other apps where
COPY is clearly the best approach, you can easily make a case that my app
is a fringe application rather than a mainstream one, and on the balance
this job is still far easier than my current approach of parsing the logs.
I just wanted to give a sample of how using COPY impacts the dynamics of
how downstream applications will have to work with this data, so you can
see that my contrary preference isn't completely random here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: COPY-able csv log outputs

From
Greg Smith
Date:
The attached patch fixes all the issues I found in the original version of
this code and completes the review I wanted to do.  Someone else will need
to take this from here.  As I already mentioned, I can't comment on the
quality of the piping implementation used to add this feature other than
to say it worked for me.

Here is a sample line generated by the current code that illustrates most
of what I adjusted:

2007-05-28 13:45:14.028 EDT,"postgres","pgbench",465b1519.5f3d,laptop:32783,24381,17,idle,2007-05-28
13:44:57 EDT,60337,LOG,,"statement: select * from branches;"

In addition to cleanup and bug fixes, notable changes made from the
original version include:

-Changed connection information to standard host:port syntax

-Added a per process line number to allow a unique index.  The code that
supports that in the current logs is very local to log_line_prefix.
Rather than try to share that information like is done for the timestamp,
it seemed easier to just duplicate those few lines of code for the CSV
output path.  This could cause the CVS logs to have different values for
the line number of a statement than the text format ones.  I didn't feel
that was a problem serious enough to justify the code refactoring that
would be required to assure the line numbers were identical.

-Added a new documentation section to the logging chapter devoted just to
the csvlog feature.  It gives a sample table and import syntax.  I also
gave recommendations on how to configure some related log file parameters
that can interact badly with this feature.  For example, I noticed that if
log_rotation_size was set to a value, it could split the CSV lines in two;
the result was two CVS files you couldn't import because of the partial
lines in each.  Since the rotation size feature causes other issues anyway
that make importing more complicated, documenting the issue seemed
sufficient.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Attachment

Re: COPY-able csv log outputs

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Greg Smith wrote:
> The attached patch fixes all the issues I found in the original version of
> this code and completes the review I wanted to do.  Someone else will need
> to take this from here.  As I already mentioned, I can't comment on the
> quality of the piping implementation used to add this feature other than
> to say it worked for me.
>
> Here is a sample line generated by the current code that illustrates most
> of what I adjusted:
>
> 2007-05-28 13:45:14.028 EDT,"postgres","pgbench",465b1519.5f3d,laptop:32783,24381,17,idle,2007-05-28
> 13:44:57 EDT,60337,LOG,,"statement: select * from branches;"
>
> In addition to cleanup and bug fixes, notable changes made from the
> original version include:
>
> -Changed connection information to standard host:port syntax
>
> -Added a per process line number to allow a unique index.  The code that
> supports that in the current logs is very local to log_line_prefix.
> Rather than try to share that information like is done for the timestamp,
> it seemed easier to just duplicate those few lines of code for the CSV
> output path.  This could cause the CVS logs to have different values for
> the line number of a statement than the text format ones.  I didn't feel
> that was a problem serious enough to justify the code refactoring that
> would be required to assure the line numbers were identical.
>
> -Added a new documentation section to the logging chapter devoted just to
> the csvlog feature.  It gives a sample table and import syntax.  I also
> gave recommendations on how to configure some related log file parameters
> that can interact badly with this feature.  For example, I noticed that if
> log_rotation_size was set to a value, it could split the CSV lines in two;
> the result was two CVS files you couldn't import because of the partial
> lines in each.  Since the rotation size feature causes other issues anyway
> that make importing more complicated, documenting the issue seemed
> sufficient.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: COPY-able csv log outputs

From
Andrew Dunstan
Date:

Greg Smith wrote:
> The attached patch fixes all the issues I found in the original
> version of this code and completes the review I wanted to do.  Someone
> else will need to take this from here.  As I already mentioned, I
> can't comment on the quality of the piping implementation used to add
> this feature other than to say it worked for me.

I'll take it from here.


>
>
> -Added a new documentation section to the logging chapter devoted just
> to the csvlog feature.  It gives a sample table and import syntax.  I
> also gave recommendations on how to configure some related log file
> parameters that can interact badly with this feature.  For example, I
> noticed that if log_rotation_size was set to a value, it could split
> the CSV lines in two; the result was two CVS files you couldn't import
> because of the partial lines in each.  Since the rotation size feature
> causes other issues anyway that make importing more complicated,
> documenting the issue seemed sufficient.
>
>

What are the other issues? I'm not happy about producing files with
split lines.

cheers

andrew

Re: COPY-able csv log outputs

From
Greg Smith
Date:
On Fri, 1 Jun 2007, Andrew Dunstan wrote:

> Greg Smith wrote:
>> Since the rotation size feature causes other issues anyway
>> that make importing more complicated, documenting the issue seemed
>> sufficient.
>
> What are the other issues? I'm not happy about producing files with split
> lines.

Just that it's fairly simple and predictable to know what your log files
are going to be named and when they'll rollover if you use something like
a date-based naming convention--but when you add size-based rotation into
the mix figuring out what files you need to import and when you should
import them gets more complicated.

Clearly fixing this issue altogether would be better, and I gather the
problem may extend to any time there is a switch to a new log file; my
"workaround" doesn't appear good enough anyway.  I'm very glad I caught
and mentioned this now.

Because of the extra naming/import complexity, it still might be
worthwhile to suggest people not combine size-based rotation and the
csvlog though.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: COPY-able csv log outputs

From
Andrew Dunstan
Date:

Greg Smith wrote:
> The attached patch fixes all the issues I found in the original
> version of this code and completes the review I wanted to do.  Someone
> else will need to take this from here.  As I already mentioned, I
> can't comment on the quality of the piping implementation used to add
> this feature other than to say it worked for me.

The code below strikes me as being seriously broken.

The comment says that it doubles all single quotes, double quotes and
backslashes. It doesn't (backslashes are not doubled) and in any case
doing that would simply be wrong (with or without backslashes). The ONLY
things that should be doubled are double quotes. Period.

That part is easy to fix.

But here's my question: why are we worrying at all about things like the
encoding? Especially, why are we worrying about the *client* encoding
for a server log file? We surely aren't going to switch encodings in the
middle of a log file! ISTM that this routine should simply copy the
input, byte for byte, apart from doubling the dquotes. Does that make
sense? I assume that this routine has been copied from somewhere else
without sufficient regard to the context (or the logic).

The code also illustrates something else that annoyed me elsewhere in
the patch and that I have eliminated, namely use of a macro placed in
the header file and then used in exactly one place in the code. The
macro didn't make anything clearer - quite the reverse. ISTM that a
macro used only in one file should be defined in that file, generally,
and if it's used only once is probably not much use anyway.

cheers

andrew


> +
> + /*
> +  * Escapes special characters in the string to conform
> +  * with the csv type output.
> +  * Replaces " with "", ' with '' and \ with \\.
> +  */
> + static size_t
> + escape_string_literal(char *to, const char *from)
> + {
> +     const char *source = from;
> +     char       *target = to;
> +     size_t        remaining = 0;
> +     int    client_encoding = 0;
> +
> +     if (from == NULL)
> +         return remaining;
> +
> +     remaining = strlen(from);
> +     client_encoding = pg_get_client_encoding();
> +
> +     while (remaining > 0 && *source != '\0')
> +     {
> +         char        c = *source;
> +         int            len;
> +         int            i;
> +
> +         /* Fast path for plain ASCII */
> +         if (!IS_HIGHBIT_SET(c))
> +         {
> +             /* Apply quoting if needed */
> +             if (CSV_STR_DOUBLE(c, false))
> +                 *target++ = c;
> +             /* Copy the character */
> +             *target++ = c;
> +             source++;
> +             remaining--;
> +             continue;
> +         }
> +
> +         /* Slow path for possible multibyte characters */
> +         len = pg_encoding_mblen(client_encoding, source);
> +
> +         /* Copy the character */
> +         for (i = 0; i < len; i++)
> +         {
> +             if (remaining == 0 || *source == '\0')
> +                 break;
> +             *target++ = *source++;
> +             remaining--;
> +         }
> +
> +         /*
> +          * If we hit premature end of string (ie, incomplete multibyte
> +          * character), try to pad out to the correct length with spaces.
> +          * We may not be able to pad completely, but we will always be
> +          * able to insert at least one pad space (since we'd not have
> +          * quoted a multibyte character). This should be enough to make
> +          * a string that the server will error out on.
> +          */
> +         if (i < len)
> +         {
> +             for (; i < len; i++)
> +             {
> +                 if (((size_t) (target - to)) / 2 >= strlen(from))
> +                     break;
> +                 *target++ = ' ';
> +             }
> +             break;
> +         }
> +     }
> +
> +     /* Write the terminating NUL character. */
> +     *target = '\0';
> +
> +     return target - to;
> + }
>
>