Thread: COPY-able sql log outputs
Hi, Attached is the completed patch for the COPY-able sql log outputs. I have modified my previous WIP patch, incorporating all the changes requested by the community. This patch has been tested both on windows and linux. Reiterating what has been done. The log is now output in COPY-able format as suggested. (Not INSERT-able as was in the previous WIP patch.) log_destination now accepts 'sqllog' as a valid output destination. The log output file will be determined by pg_log and log_filename variables. The sql log output filename will be 'log_filename'.sql. The file rotation rules apply to the sql log file output as well. The log output format is as follows. timestamp, username, database_name, sessionid, host_port, process_id, command_tag, session_start_time, transaction_id, error_severity, sql_state_code, statement The logs can be loaded into a table using the command COPY sqltable FROM 'filename.sql' WITH CSV; There are only two minor issues I can think of 1. The sql log is currently output with newline and tab characters. It loads into the table neatly. No problems. But when read back, atleast from psql in windows, the tabs are replaced with some special characters. 2. I think it is better to document somewhere the table structure and the COPY statement above. But where? [P.S. - In the wake of community's concerns regarding the legal disclaimer getting attached to the end of mails we send to the community, we have got an exemption from the disclaimer getting attached. As this is the first mail I am sending after this approval, fingers crossed, it works. If for some reason it gets attached, please ignore this mail and I will send the patch from some other account.] Rgds, Arul Shaji
Attachment
Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: > Attached is the completed patch for the COPY-able sql log outputs. Could you please remove random whitespace changes from this patch? -- Peter Eisentraut http://developer.postgresql.org/~petere/
FAST PostgreSQL wrote: I have attempted to do a quick review of the code in this patch, I've not installed or compiled it. However I do have a couple of comments. > Hi, > > Attached is the completed patch for the COPY-able sql log outputs. I have > modified my previous WIP patch, incorporating all the changes requested by > the community. This patch has been tested both on windows and linux. > > Reiterating what has been done. > > The log is now output in COPY-able format as suggested. (Not INSERT-able as > was in the previous WIP patch.) > > log_destination now accepts 'sqllog' as a valid output destination. The log > output file will be determined by pg_log and log_filename variables. The sql > log output filename will be 'log_filename'.sql. The file rotation rules apply > to the sql log file output as well. > I'm not sure I'm sold on sqllog given it's not actually sql at all, it's just a CSV/text file. But I don't have any better names at this time. copylog, justifiedlog, csvlog come to mind, but that all seem bad. > The log output format is as follows. > > timestamp, username, database_name, sessionid, host_port, process_id, > command_tag, session_start_time, transaction_id, error_severity, > sql_state_code, statement > > The logs can be loaded into a table using the command > > COPY sqltable FROM 'filename.sql' WITH CSV; > I am concerned with the fact that no escaping is attempted on anything except the error message itself, is it known that it's not possible to have a ',' in the rest of the field types? > There are only two minor issues I can think of > 1. The sql log is currently output with newline and tab characters. It loads > into the table neatly. No problems. But when read back, atleast from psql in > windows, the tabs are replaced with some special characters. > > 2. I think it is better to document somewhere the table structure and the > COPY statement above. But where? > With regard to where to document the column structure, how about a header line when a new file is opened, that will allow users to use whatever table structure they like, and can import relevant columns from the log file. You should also mention the format in the logging section under the sqllog parameter in the docs. > [P.S. - In the wake of community's concerns regarding the legal disclaimer > getting attached to the end of mails we send to the community, we have got an > exemption from the disclaimer getting attached. As this is the first mail I > am sending after this approval, fingers crossed, it works. If for some reason > it gets attached, please ignore this mail and I will send the patch from some > other account.] > > Rgds, > Arul Shaji > Other Patch comments; Is there an overall line length you should be adhering too? I think there is, so you will want to shorten those lines. + /* Win32 timezone names are too long so don't print them */ + #ifndef WIN32 + "%Y-%m-%d %H:%M:%S %Z", + #else + "%Y-%m-%d %H:%M:%S", + #endif Do we actually care that the WIN32 format is long for this type of log? + /* + * Calculates and returns the timestamp + */ + static void + get_timestamp(StringInfo buf) "the timestamp" some words about the fact it gets the "current" timestamp might be helpful an that it appends to a string. ! sqllogFile = fopen(strcat(filename, ".sql"), "a"); I don't believe forcing .sql for a CSV file is a good idea, let the user decide what they want to call it, if you are forcing anything, at least make it ".csv". + #define LOG_DESTINATION_SQL 8 SQLLOG I would think for completness and consistency of naming. Hope this helps get your patch into PostgreSQL cleanly and quickly. Regards Russell Smith
> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: > > Attached is the completed patch for the COPY-able sql log outputs. > > Could you please remove random whitespace changes from this patch? Done and attached.
Attachment
FAST PostgreSQL wrote: >> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: >>> Attached is the completed patch for the COPY-able sql log outputs. >> Could you please remove random whitespace changes from this patch? > > Done and attached. A couple of comments. First one that's not at all about the code - but your emails appear to be dated several days into the future. This one, for example, is datestamped 2007-04-04. AFAIK timezones only go 13 (or is it 14?) hours :) You'll want to look into that. (otherwise, it'll look like you missed feature freeze..) Looking at the code, I was first going to ask what happens when you enable both redirect_stderr and this one - would the data be put in random order in the same file. But looking at the code, it seems you're appending ".sql" to the filename for this log. Two comments about that: 1) This is not mentioned anywhere in the docs or comments. It needs to be. 2) Not sure if .sql is such a great appendix to use, given that it's not actual SQL. (I will also echo the other comment that was posted about the whole name being sqllog, when it's not SQL. I think a better name is needed since the data is csv. Goes for all parameters and function names) Isn't the data included in the loglines in need of quoting/escaping? What if there are weird charactersin the database name, for example? You seem to only be escaping the error message itself. I think all user-controlled data needs to be. Then, it may be just me, but I find code like this: ! sqllogFile = fopen(strcat(filename, ".sql"), "a"); very hard to read. Took me a while to realize that's how it dealt with the different filenames. I think it's better to do the filename catting as a separate step. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Then, it may be just me, but I find code like this: > ! sqllogFile = fopen(strcat(filename, ".sql"), "a"); > very hard to read. Such coding should be rejected out of hand anyway --- can you say "buffer overrun"? regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Then, it may be just me, but I find code like this: >> ! sqllogFile = fopen(strcat(filename, ".sql"), "a"); >> very hard to read. > > Such coding should be rejected out of hand anyway --- can you say > "buffer overrun"? I was thinking that, but given that it only takes the config parameter from postgresql.conf and no actual user input, I considered it less important. But thinking twice, yeah, even in that case it shouldn't be done, just to stay on the safe side. //Magnus
Surely your system time is wrong with email dated: Date: Wed, 4 Apr 2007 18:24:54 +0000 --------------------------------------------------------------------------- FAST PostgreSQL wrote: > > Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: > > > Attached is the completed patch for the COPY-able sql log outputs. > > > > Could you please remove random whitespace changes from this patch? > > Done and attached. [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sat, 31 Mar 2007 14:09, you wrote: > FAST PostgreSQL wrote: > >> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: > >>> Attached is the completed patch for the COPY-able sql log outputs. > >> > >> Could you please remove random whitespace changes from this patch? > > > > Done and attached. > > A couple of comments. > > First one that's not at all about the code - but your emails appear to > be dated several days into the future. This one, for example, is > datestamped 2007-04-04. AFAIK timezones only go 13 (or is it 14?) hours > > :) You'll want to look into that. (otherwise, it'll look like you missed > > feature freeze..) Sorry.. Fixed. > > > Looking at the code, I was first going to ask what happens when you > enable both redirect_stderr and this one - would the data be put in > random order in the same file. But looking at the code, it seems you're > appending ".sql" to the filename for this log. Two comments about that: > > 1) This is not mentioned anywhere in the docs or comments. It needs to be. Agreed. As I mentioned in the previous mail, preferably alongwith the recommended table structure to COPY the logs into. But where? Is adding a new paragraph under the log_destination section of 'Where to Log' chapter ok ? > > 2) Not sure if .sql is such a great appendix to use, given that it's not > actual SQL. (I will also echo the other comment that was posted about > the whole name being sqllog, when it's not SQL. I think a better name is > needed since the data is csv. Goes for all parameters and function names) 'csvlog' is another option. I can change that. Anyone have other preferences ? > > > > Isn't the data included in the loglines in need of quoting/escaping? > What if there are weird charactersin the database name, for example? You > seem to only be escaping the error message itself. I think all > user-controlled data needs to be. > > > Then, it may be just me, but I find code like this: > ! sqllogFile = fopen(strcat(filename, ".sql"), "a"); > > very hard to read. Took me a while to realize that's how it dealt with > the different filenames. I think it's better to do the filename catting > as a separate step. I will fix that too. Rgds, Arul Shaji > > > //Magnus
On Mon, Apr 02, 2007 at 04:27:33PM +0000, FAST PostgreSQL wrote: > On Sat, 31 Mar 2007 14:09, you wrote: > > FAST PostgreSQL wrote: > > >> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: > > >>> Attached is the completed patch for the COPY-able sql log outputs. > > >> > > >> Could you please remove random whitespace changes from this patch? > > > > > > Done and attached. > > > > A couple of comments. > > > > First one that's not at all about the code - but your emails appear to > > be dated several days into the future. This one, for example, is > > datestamped 2007-04-04. AFAIK timezones only go 13 (or is it 14?) hours > > > > :) You'll want to look into that. (otherwise, it'll look like you missed > > > > feature freeze..) > > Sorry.. Fixed. Not quite. Time looks correct, but it specifies a timezone that's ten hours off: Received: from there (unknown 137.172.75.153) by fast.fujitsu.com.au (Scalix SMTP Relay 10.0.1.3) via ESMTP; Mon, 02 Apr 2007 16:28:34 +1000 (EST) Date: Mon, 2 Apr 2007 16:27:33 +0000 > > Looking at the code, I was first going to ask what happens when you > > enable both redirect_stderr and this one - would the data be put in > > random order in the same file. But looking at the code, it seems you're > > appending ".sql" to the filename for this log. Two comments about that: > > > > 1) This is not mentioned anywhere in the docs or comments. It needs to be. > > Agreed. As I mentioned in the previous mail, preferably alongwith the > recommended table structure to COPY the logs into. But where? Is adding a new > paragraph under the log_destination section of 'Where to Log' chapter ok ? I think that makes sense. The comment in postgresql.conf might also need some updating to reflect it. > > 2) Not sure if .sql is such a great appendix to use, given that it's not > > actual SQL. (I will also echo the other comment that was posted about > > the whole name being sqllog, when it's not SQL. I think a better name is > > needed since the data is csv. Goes for all parameters and function names) > > 'csvlog' is another option. I can change that. Anyone have other preferences ? I assume csvlog for parameter names, and .csv as the file extension? Or did you mean csvlog for extension as well? In that case, I'd suggest .csv ;-) //Magnus
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- FAST PostgreSQL wrote: > > Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: > > > Attached is the completed patch for the COPY-able sql log outputs. > > > > Could you please remove random whitespace changes from this patch? > > Done and attached. [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
FAST PostgreSQL wrote: >> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: >> >>> Attached is the completed patch for the COPY-able sql log outputs. >>> >> Could you please remove random whitespace changes from this patch? >> > > Done and attached. Brief review of the CSV aspect only: a. username and databasename should probably be quoted and escaped in case they contain dangerous characters (even though that's unlikely) b. calling pg_get_client_encoding() inside the loop in escape_string_literal seems suboptimal. At the very least it should be moved outside the loop so it's only called once per string. cheers andrew
On Wed, 4 Apr 2007 02:26, Andrew Dunstan wrote: I am currently doing the following modifications to the patch as per the review comments. 1. Changing all references to 'sqllog' or 'sql' to 'csvlog' and 'csv' respectively. 2. Escaping the username and databasename 3. Constructing the csvlog filename using log_filename instead of strcat. 4. General code optimization. Any other changes required ? ? Will soon submit the updated patch. Rgds, Arul Shaji > FAST PostgreSQL wrote: > >> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: > >>> Attached is the completed patch for the COPY-able sql log outputs. > >> > >> Could you please remove random whitespace changes from this patch? > > > > Done and attached. > > Brief review of the CSV aspect only: > > a. username and databasename should probably be quoted and escaped in > case they contain dangerous characters (even though that's unlikely) > b. calling pg_get_client_encoding() inside the loop in > escape_string_literal seems suboptimal. At the very least it should be > moved outside the loop so it's only called once per string. > > cheers > > andrew
FAST PostgreSQL wrote:
Function additions like get_timestamp need more comments, or better comments "returns the timestamp", which timestamp?On Wed, 4 Apr 2007 02:26, Andrew Dunstan wrote: I am currently doing the following modifications to the patch as per the review comments. 1. Changing all references to 'sqllog' or 'sql' to 'csvlog' and 'csv' respectively. 2. Escaping the username and databasename 3. Constructing the csvlog filename using log_filename instead of strcat. 4. General code optimization. Any other changes required ? ?
Will soon submit the updated patch. Rgds, Arul ShajiFAST PostgreSQL wrote:Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL:Attached is the completed patch for the COPY-able sql log outputs.Could you please remove random whitespace changes from this patch?Done and attached.Brief review of the CSV aspect only: a. username and databasename should probably be quoted and escaped in case they contain dangerous characters (even though that's unlikely) b. calling pg_get_client_encoding() inside the loop in escape_string_literal seems suboptimal. At the very least it should be moved outside the loop so it's only called once per string. cheers andrew---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
[Modified subject line from sql to csv] Attached is an updated patch for COPY-able log outputs. It incorporates all the comments received on my previous patch including improved commenting, documentation, escaping all user controlled data Rgds, Arul Shaji On Wed, 4 Apr 2007 18:02, Russell Smith wrote: > FAST PostgreSQL wrote: > > On Wed, 4 Apr 2007 02:26, Andrew Dunstan wrote: > > > > I am currently doing the following modifications to the patch as per the > > review comments. > > > > 1. Changing all references to 'sqllog' or 'sql' to 'csvlog' and 'csv' > > respectively. > > 2. Escaping the username and databasename > > 3. Constructing the csvlog filename using log_filename instead of strcat. > > 4. General code optimization. > > > > Any other changes required ? ? > > Function additions like get_timestamp need more comments, or better > comments "returns the timestamp", which timestamp? > > > Will soon submit the updated patch. > > > > Rgds, > > Arul Shaji > > > >> FAST PostgreSQL wrote: > >>>> Am Dienstag, 3. April 2007 20:33 schrieb FAST PostgreSQL: > >>>>> Attached is the completed patch for the COPY-able sql log outputs. > >>>> > >>>> Could you please remove random whitespace changes from this patch? > >>> > >>> Done and attached. > >> > >> Brief review of the CSV aspect only: > >> > >> a. username and databasename should probably be quoted and escaped in > >> case they contain dangerous characters (even though that's unlikely) > >> b. calling pg_get_client_encoding() inside the loop in > >> escape_string_literal seems suboptimal. At the very least it should be > >> moved outside the loop so it's only called once per string. > >> > >> cheers > >> > >> andrew > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: explain analyze is your friend > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html> > <head> > <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> > </head> > <body bgcolor="#ffffff" text="#000000"> > FAST PostgreSQL wrote: > <blockquote cite="mid26121.11031175649322.fast.fujitsu.com.au@MHS" > type="cite"> > <pre wrap="">On Wed, 4 Apr 2007 02:26, Andrew Dunstan wrote: > > I am currently doing the following modifications to the patch as per the > review comments. > > 1. Changing all references to 'sqllog' or 'sql' to 'csvlog' and 'csv' > respectively. > 2. Escaping the username and databasename > 3. Constructing the csvlog filename using log_filename instead of strcat. > 4. General code optimization. > > Any other changes required ? ? > </pre> > </blockquote> > Function additions like get_timestamp need more comments, or better > comments "returns the timestamp", which timestamp?<br> > <br> > <br> > <blockquote cite="mid26121.11031175649322.fast.fujitsu.com.au@MHS" > type="cite"> > <pre wrap=""> > Will soon submit the updated patch. > > Rgds, > Arul Shaji > > > </pre> > <blockquote type="cite"> > <pre wrap="">FAST PostgreSQL wrote: > </pre> > <blockquote type="cite"> > <blockquote type="cite"> > <pre wrap="">Am Dienstag, 3. April 2007 20:33 schrieb FAST > PostgreSQL: </pre> > <blockquote type="cite"> > <pre wrap="">Attached is the completed patch for the COPY-able > sql log outputs. </pre> > </blockquote> > <pre wrap="">Could you please remove random whitespace changes from > this patch? </pre> > </blockquote> > <pre wrap="">Done and attached. > </pre> > </blockquote> > <pre wrap="">Brief review of the CSV aspect only: > > a. username and databasename should probably be quoted and escaped in > case they contain dangerous characters (even though that's unlikely) > b. calling pg_get_client_encoding() inside the loop in > escape_string_literal seems suboptimal. At the very least it should be > moved outside the loop so it's only called once per string. > > cheers > > andrew > </pre> > </blockquote> > <pre wrap=""><!----> > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > > </pre> > </blockquote> > <br> > </body> > </html>
Attachment
I got a chance to review this patch over the weekend. Basic API seems good, met all my requirements, no surprises with how the GUC variable controlled the feature. The most fundamental issue I have with the interface is that using COPY makes it difficult to put any unique index on the resulting table. I like to have a unique index on my imported log table because it rejects the dupe records if you accidentally import the same section of log file twice. COPY tosses the whole thing if there's an index violation, which is a problem during a regular import because you will occasionally come across lines with the same timestamp that are similar in every way except for their statment; putting an index on the timestamp+statement seems impractical. I've had a preference for INSERT from the beginning here that this reinforces. I'm planning to just work around this issue by doing the COPY into a temporary table and then INSERTing from there. I didn't want to just let the concern pass by without mentioning it though. It crosses my mind that inserting some sort of unique log file line ID number would prevent the dupe issue and make for better ordering (it's possible to have two lines with the same timestamp show up in the wrong order now), not sure that's a practical idea to consider. The basic coding of the patch seemed OK to me, but someone who is much more familiar than myself with the mechanics of pipes should take a look at that part of the patch before committing; it's complicated code and I can't comment on it. There are some small formatting issues that need to be fixed, particularly in the host+port mapping. I can fix those myself and submit a slightly updated patch. There's some documentation improvements I want to make before this goes in as well. The patch is actually broken fairly hard right now because of the switch from INSERT to COPY FROM CSV as the output format at the last minute. It outputs missing fields as NULL (fine for INSERT) that chokes the CSV import when the session_start timestamp is missing. All of those NULL values need to be just replaced with nothing for proper CSV syntax; there should just the comma for the next field. I worked around this with copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL'; I can fix that too when I'm revising. I plan to have a version free of obvious bugs to re-submit ready by next weekend. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > I got a chance to review this patch over the weekend. Basic API seems > good, met all my requirements, no surprises with how the GUC variable > controlled the feature. > > The most fundamental issue I have with the interface is that using > COPY makes it difficult to put any unique index on the resulting > table. I like to have a unique index on my imported log table because > it rejects the dupe records if you accidentally import the same > section of log file twice. COPY tosses the whole thing if there's an > index violation, which is a problem during a regular import because > you will occasionally come across lines with the same timestamp that > are similar in every way except for their statment; putting an index > on the timestamp+statement seems impractical. Does the format not include the per-process line number? (I know i briefly looked at this patch previously, but I forget the details.) One reason I originally included line numbers in log_line-prefix was to handle this sort of problem. > > I've had a preference for INSERT from the beginning here that this > reinforces. COPY is our standard bulk insert mechanism. I think arguing against it would be a very hard sell. > I'm planning to just work around this issue by doing the COPY into a > temporary table and then INSERTing from there. I didn't want to just > let the concern pass by without mentioning it though. It crosses my > mind that inserting some sort of unique log file line ID number would > prevent the dupe issue and make for better ordering (it's possible to > have two lines with the same timestamp show up in the wrong order > now), not sure that's a practical idea to consider. I guess that answers my question. We should definitely provide a unique line key. > > The basic coding of the patch seemed OK to me, but someone who is much > more familiar than myself with the mechanics of pipes should take a > look at that part of the patch before committing; it's complicated > code and I can't comment on it. There are some small formatting > issues that need to be fixed, particularly in the host+port mapping. > I can fix those myself and submit a slightly updated patch. There's > some documentation improvements I want to make before this goes in as > well. > > The patch is actually broken fairly hard right now because of the > switch from INSERT to COPY FROM CSV as the output format at the last > minute. It outputs missing fields as NULL (fine for INSERT) that > chokes the CSV import when the session_start timestamp is missing. > All of those NULL values need to be just replaced with nothing for > proper CSV syntax; there should just the comma for the next field. I > worked around this with > > copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL'; > > I missed that before - yes it should be fixed. cheers andrew
Greg Smith <gsmith@gregsmith.com> writes: > The most fundamental issue I have with the interface is that using COPY > makes it difficult to put any unique index on the resulting table. I like > to have a unique index on my imported log table because it rejects the > dupe records if you accidentally import the same section of log file > twice. COPY tosses the whole thing if there's an index violation, which > is a problem during a regular import because you will occasionally come > across lines with the same timestamp that are similar in every way except > for their statment; putting an index on the timestamp+statement seems > impractical. Essentially the above is arguing that you want a unique index but you can't be bothered to invent an actually-unique key. This doesn't seem a sound argument to me. If we need a unique key, let's find one. regards, tom lane
On Sun, 20 May 2007, Andrew Dunstan wrote: > Does the format not include the per-process line number? It does not, and I never noticed that under the prefix possibilities---never seemed import before! The combination of timestamp/pid/line (%t %p %l) looks like a useful and unique key here, so I'll add another column for the line number to the output. Thanks for pointing that out, I can finish cleaning up all the functional implementation work on this patch myself now. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sun, 20 May 2007, Andrew Dunstan wrote: >> I've had a preference for INSERT from the beginning here that this >> reinforces. > COPY is our standard bulk insert mechanism. I think arguing against it would > be a very hard sell. Let me say my final peace on this subject...if I considered this data to be strictly bulk insert, then I'd completely agree here. Most of the really interesting applications I was planning to build on top of this mechanism are more interactive than that though. Here's a sample: -Write a daemon that lives on the server, connects to a logging database, and pops into an idle loop based on LISTEN. -A client app wants to see the recent logs files. It uses NOTIFY to ask the daemon for them and LISTENs for a response. -The daemon wakes up, reads all the log files since it last did something, and appends those records to the log file table. It sends out a NOTIFY to say the log file table is current. That enables remote clients to grab the log files from the server whenever they please, so they can actually monitor themselves. Benchmarking is the initial app I expect to call this, and with some types of tests I expect the daemon to be importing every 10 minutes or so. Assuming a unique index on the data to prevent duplication is a required feature, I can build this using the COPY format logs as well, but that requires I either a) am 100% perfect in making sure I never pass over the same data twice, which is particularly annoying when the daemon gets restarted, or b) break the COPY into single lines and insert them one at a time, at which point I'm not bulk loading at all. If these were INSERT statements instead, I'd have a lot more tolerance for error, because the worst problem I'd ever run into is spewing some unique key violation errors into the logs if I accidentally imported too much. With COPY, any mistake or synchronization issue and I lose the whole import. I don't mean to try and stir this back up again as an argument (particularly not on this list). There are plenty of other apps where COPY is clearly the best approach, you can easily make a case that my app is a fringe application rather than a mainstream one, and on the balance this job is still far easier than my current approach of parsing the logs. I just wanted to give a sample of how using COPY impacts the dynamics of how downstream applications will have to work with this data, so you can see that my contrary preference isn't completely random here. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
The attached patch fixes all the issues I found in the original version of this code and completes the review I wanted to do. Someone else will need to take this from here. As I already mentioned, I can't comment on the quality of the piping implementation used to add this feature other than to say it worked for me. Here is a sample line generated by the current code that illustrates most of what I adjusted: 2007-05-28 13:45:14.028 EDT,"postgres","pgbench",465b1519.5f3d,laptop:32783,24381,17,idle,2007-05-28 13:44:57 EDT,60337,LOG,,"statement: select * from branches;" In addition to cleanup and bug fixes, notable changes made from the original version include: -Changed connection information to standard host:port syntax -Added a per process line number to allow a unique index. The code that supports that in the current logs is very local to log_line_prefix. Rather than try to share that information like is done for the timestamp, it seemed easier to just duplicate those few lines of code for the CSV output path. This could cause the CVS logs to have different values for the line number of a statement than the text format ones. I didn't feel that was a problem serious enough to justify the code refactoring that would be required to assure the line numbers were identical. -Added a new documentation section to the logging chapter devoted just to the csvlog feature. It gives a sample table and import syntax. I also gave recommendations on how to configure some related log file parameters that can interact badly with this feature. For example, I noticed that if log_rotation_size was set to a value, it could split the CSV lines in two; the result was two CVS files you couldn't import because of the partial lines in each. Since the rotation size feature causes other issues anyway that make importing more complicated, documenting the issue seemed sufficient. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Attachment
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Greg Smith wrote: > The attached patch fixes all the issues I found in the original version of > this code and completes the review I wanted to do. Someone else will need > to take this from here. As I already mentioned, I can't comment on the > quality of the piping implementation used to add this feature other than > to say it worked for me. > > Here is a sample line generated by the current code that illustrates most > of what I adjusted: > > 2007-05-28 13:45:14.028 EDT,"postgres","pgbench",465b1519.5f3d,laptop:32783,24381,17,idle,2007-05-28 > 13:44:57 EDT,60337,LOG,,"statement: select * from branches;" > > In addition to cleanup and bug fixes, notable changes made from the > original version include: > > -Changed connection information to standard host:port syntax > > -Added a per process line number to allow a unique index. The code that > supports that in the current logs is very local to log_line_prefix. > Rather than try to share that information like is done for the timestamp, > it seemed easier to just duplicate those few lines of code for the CSV > output path. This could cause the CVS logs to have different values for > the line number of a statement than the text format ones. I didn't feel > that was a problem serious enough to justify the code refactoring that > would be required to assure the line numbers were identical. > > -Added a new documentation section to the logging chapter devoted just to > the csvlog feature. It gives a sample table and import syntax. I also > gave recommendations on how to configure some related log file parameters > that can interact badly with this feature. For example, I noticed that if > log_rotation_size was set to a value, it could split the CSV lines in two; > the result was two CVS files you couldn't import because of the partial > lines in each. Since the rotation size feature causes other issues anyway > that make importing more complicated, documenting the issue seemed > sufficient. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Smith wrote: > The attached patch fixes all the issues I found in the original > version of this code and completes the review I wanted to do. Someone > else will need to take this from here. As I already mentioned, I > can't comment on the quality of the piping implementation used to add > this feature other than to say it worked for me. I'll take it from here. > > > -Added a new documentation section to the logging chapter devoted just > to the csvlog feature. It gives a sample table and import syntax. I > also gave recommendations on how to configure some related log file > parameters that can interact badly with this feature. For example, I > noticed that if log_rotation_size was set to a value, it could split > the CSV lines in two; the result was two CVS files you couldn't import > because of the partial lines in each. Since the rotation size feature > causes other issues anyway that make importing more complicated, > documenting the issue seemed sufficient. > > What are the other issues? I'm not happy about producing files with split lines. cheers andrew
On Fri, 1 Jun 2007, Andrew Dunstan wrote: > Greg Smith wrote: >> Since the rotation size feature causes other issues anyway >> that make importing more complicated, documenting the issue seemed >> sufficient. > > What are the other issues? I'm not happy about producing files with split > lines. Just that it's fairly simple and predictable to know what your log files are going to be named and when they'll rollover if you use something like a date-based naming convention--but when you add size-based rotation into the mix figuring out what files you need to import and when you should import them gets more complicated. Clearly fixing this issue altogether would be better, and I gather the problem may extend to any time there is a switch to a new log file; my "workaround" doesn't appear good enough anyway. I'm very glad I caught and mentioned this now. Because of the extra naming/import complexity, it still might be worthwhile to suggest people not combine size-based rotation and the csvlog though. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > The attached patch fixes all the issues I found in the original > version of this code and completes the review I wanted to do. Someone > else will need to take this from here. As I already mentioned, I > can't comment on the quality of the piping implementation used to add > this feature other than to say it worked for me. The code below strikes me as being seriously broken. The comment says that it doubles all single quotes, double quotes and backslashes. It doesn't (backslashes are not doubled) and in any case doing that would simply be wrong (with or without backslashes). The ONLY things that should be doubled are double quotes. Period. That part is easy to fix. But here's my question: why are we worrying at all about things like the encoding? Especially, why are we worrying about the *client* encoding for a server log file? We surely aren't going to switch encodings in the middle of a log file! ISTM that this routine should simply copy the input, byte for byte, apart from doubling the dquotes. Does that make sense? I assume that this routine has been copied from somewhere else without sufficient regard to the context (or the logic). The code also illustrates something else that annoyed me elsewhere in the patch and that I have eliminated, namely use of a macro placed in the header file and then used in exactly one place in the code. The macro didn't make anything clearer - quite the reverse. ISTM that a macro used only in one file should be defined in that file, generally, and if it's used only once is probably not much use anyway. cheers andrew > + > + /* > + * Escapes special characters in the string to conform > + * with the csv type output. > + * Replaces " with "", ' with '' and \ with \\. > + */ > + static size_t > + escape_string_literal(char *to, const char *from) > + { > + const char *source = from; > + char *target = to; > + size_t remaining = 0; > + int client_encoding = 0; > + > + if (from == NULL) > + return remaining; > + > + remaining = strlen(from); > + client_encoding = pg_get_client_encoding(); > + > + while (remaining > 0 && *source != '\0') > + { > + char c = *source; > + int len; > + int i; > + > + /* Fast path for plain ASCII */ > + if (!IS_HIGHBIT_SET(c)) > + { > + /* Apply quoting if needed */ > + if (CSV_STR_DOUBLE(c, false)) > + *target++ = c; > + /* Copy the character */ > + *target++ = c; > + source++; > + remaining--; > + continue; > + } > + > + /* Slow path for possible multibyte characters */ > + len = pg_encoding_mblen(client_encoding, source); > + > + /* Copy the character */ > + for (i = 0; i < len; i++) > + { > + if (remaining == 0 || *source == '\0') > + break; > + *target++ = *source++; > + remaining--; > + } > + > + /* > + * If we hit premature end of string (ie, incomplete multibyte > + * character), try to pad out to the correct length with spaces. > + * We may not be able to pad completely, but we will always be > + * able to insert at least one pad space (since we'd not have > + * quoted a multibyte character). This should be enough to make > + * a string that the server will error out on. > + */ > + if (i < len) > + { > + for (; i < len; i++) > + { > + if (((size_t) (target - to)) / 2 >= strlen(from)) > + break; > + *target++ = ' '; > + } > + break; > + } > + } > + > + /* Write the terminating NUL character. */ > + *target = '\0'; > + > + return target - to; > + } > >