Re: performance for high-volume log insertion - Mailing list pgsql-performance

From david@lang.hm
Subject Re: performance for high-volume log insertion
Date
Msg-id alpine.DEB.1.10.0904210834250.12662@asgard.lang.hm
Whole thread Raw
In response to Re: performance for high-volume log insertion  (Kenneth Marshall <ktm@rice.edu>)
Responses Re: performance for high-volume log insertion
List pgsql-performance
Kenneth,
   could you join the discussion on the rsyslog mailing list?
rsyslog-users <rsyslog@lists.adiscon.com>

I'm surprised to hear you say that rsyslog can already do batch inserts
and am interested in how you did that.

what sort of insert rate did you mange to get?

David Lang

On Tue, 21 Apr 2009, Kenneth Marshall wrote:

> Date: Tue, 21 Apr 2009 08:33:30 -0500
> From: Kenneth Marshall <ktm@rice.edu>
> To: Richard Huxton <dev@archonet.com>
> Cc: david@lang.hm, Stephen Frost <sfrost@snowman.net>,
>     Greg Smith <gsmith@gregsmith.com>, pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] performance for high-volume log insertion
>
> Hi,
>
> I just finished reading this thread. We are currently working on
> setting up a central log system using rsyslog and PostgreSQL. It
> works well once we patched the memory leak. We also looked at what
> could be done to improve the efficiency of the DB interface. On the
> rsyslog side, moving to prepared queries allows you to remove the
> escaping that needs to be done currently before attempting to
> insert the data into the SQL backend as well as removing the parsing
> and planning time from the insert. This is a big win for high insert
> rates, which is what we are talking about. The escaping process is
> also a big CPU user in rsyslog which then hands the escaped string
> to the backend which then has to undo everything that had been done
> and parse/plan the resulting query. This can use a surprising amount
> of additional CPU. Even if you cannot support a general prepared
> query interface, by specifying what the query should look like you
> can handle much of the low-hanging fruit query-wise.
>
> We are currently using a date based trigger to use a new partition
> each day and keep 2 months of logs currently. This can be usefully
> managed on the backend database, but if rsyslog supported changing
> the insert to the new table on a time basis, the CPU used by the
> trigger to support this on the backend could be reclaimed. This
> would be a win for any DB backend. As you move to the new partition,
> issuing a truncate to clear the table would simplify the DB interfaces.
>
> Another performance enhancement already mentioned, would be to
> allow certain extra fields in the DB to be automatically populated
> as a function of the log messages. For example, logging the mail queue
> id for messages from mail systems would make it much easier to locate
> particular mail transactions in large amounts of data.
>
> To sum up, eliminating the escaping in rsyslog through the use of
> prepared queries would reduce the CPU load on the DB backend. Batching
> the inserts will also net you a big performance increase. Some DB-based
> applications allow for the specification of several types of queries,
> one for single inserts and then a second to support multiple inserts
> (copy). Rsyslog already supports the queuing pieces to allow you to
> batch inserts. Just some ideas.
>
> Regards,
> Ken
>
>
> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote:
>> david@lang.hm wrote:
>>> On Tue, 21 Apr 2009, Stephen Frost wrote:
>>>> * david@lang.hm (david@lang.hm) wrote:
>>>>> while I fully understand the 'benchmark your situation' need, this isn't
>>>>> that simple.
>>>>
>>>> It really is.  You know your application, you know it's primary use
>>>> cases, and probably have some data to play with.  You're certainly in a
>>>> much better situation to at least *try* and benchmark it than we are.
>>> rsyslog is a syslog server. it replaces (or for debian and fedora, has
>>> replaced) your standard syslog daemon. it recieves log messages from every
>>> app on your system (and possibly others), filters, maniulates them, and
>>> then stores them somewhere. among the places that it can store the logs
>>> are database servers (native support for MySQL, PostgreSQL, and Oracle.
>>> plus libdbi for others)
>>
>> Well, from a performance standpoint the obvious things to do are:
>> 1. Keep a connection open, do NOT reconnect for each log-statement
>> 2. Batch log statements together where possible
>> 3. Use prepared statements
>> 4. Partition the tables by day/week/month/year (configurable I suppose)
>>
>> The first two are vital, the third takes you a step further. The fourth is
>> a long-term admin thing.
>>
>> And possibly
>> 5. Have two connections, one for fatal/error etc and one for info/debug
>> level log statements (configurable split?). Then you can use the
>> synchronous_commit setting on the less important ones. Might buy you some
>> performance on a busy system.
>>
>> http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS
>>
>>> other apps then search and report on the data after it is stored. what
>>> apps?, I don't know either. pick your favorite reporting tool and you'll
>>> be a step ahead of me (I don't know a really good reporting tool)
>>> as for sample data, you have syslog messages, just like I do. so you have
>>> the same access to data that I have.
>>> how would you want to query them? how would people far less experianced
>>> that you want to query them?
>>> I can speculate that some people would do two columns (time, everything
>>> else), others will do three (time, server, everything else), and others
>>> will go further (I know some who would like to extract IP addresses
>>> embedded in a message into their own column). some people will index on
>>> the time and host, others will want to do full-text searches of
>>> everything.
>>
>> Well, assuming it looks much like traditional syslog, I would do something
>> like: (timestamp, host, facility, priority, message). It's easy enough to
>> stitch back together if people want that.
>>
>> PostgreSQL's full-text indexing is quite well suited to logfiles I'd have
>> thought, since it knows about filenames, urls etc already.
>>
>> If you want to get fancy, add a msg_type column and one subsidiary table
>> for each msg_type. So - you might have smtp_connect_from (hostname,
>> ip_addr). A set of perl regexps can match and extract the fields for these
>> extra tables, or you could do it with triggers inside the database. I think
>> it makes sense to do it in the application. Easier for users to contribute
>> new patterns/extractions. Meanwhile, the core table is untouched so you
>> don't *need* to know about these extra tables.
>>
>> If you have subsidiary tables, you'll want to partition those too and
>> perhaps stick them in their own schema (logs200901, logs200902 etc).
>>
>> --
>>   Richard Huxton
>>   Archonet Ltd
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: WHERE condition not being pushed down to union parts
Next
From: Kenneth Marshall
Date:
Subject: Re: performance for high-volume log insertion