Re: performance for high-volume log insertion - Mailing list pgsql-performance

From Kenneth Marshall
Subject Re: performance for high-volume log insertion
Date
Msg-id 20090421154458.GD18845@it.is.rice.edu
Whole thread Raw
In response to Re: performance for high-volume log insertion  (david@lang.hm)
List pgsql-performance
On Tue, Apr 21, 2009 at 08:37:54AM -0700, david@lang.hm wrote:
> Kenneth,
>   could you join the discussion on the rsyslog mailing list?
> rsyslog-users <rsyslog@lists.adiscon.com>
>
> I'm surprised to hear you say that rsyslog can already do batch inserts and
> am interested in how you did that.
>
> what sort of insert rate did you mange to get?
>
> David Lang
>
David,

I would be happy to join the discussion. I did not mean to say
that rsyslog currently supported batch inserts, just that the
pieces that provide "stand-by queuing" could be used to manage
batching inserts.

Cheers,
Ken

> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
>
>> Date: Tue, 21 Apr 2009 08:33:30 -0500
>> From: Kenneth Marshall <ktm@rice.edu>
>> To: Richard Huxton <dev@archonet.com>
>> Cc: david@lang.hm, Stephen Frost <sfrost@snowman.net>,
>>     Greg Smith <gsmith@gregsmith.com>, pgsql-performance@postgresql.org
>> Subject: Re: [PERFORM] performance for high-volume log insertion
>> Hi,
>>
>> I just finished reading this thread. We are currently working on
>> setting up a central log system using rsyslog and PostgreSQL. It
>> works well once we patched the memory leak. We also looked at what
>> could be done to improve the efficiency of the DB interface. On the
>> rsyslog side, moving to prepared queries allows you to remove the
>> escaping that needs to be done currently before attempting to
>> insert the data into the SQL backend as well as removing the parsing
>> and planning time from the insert. This is a big win for high insert
>> rates, which is what we are talking about. The escaping process is
>> also a big CPU user in rsyslog which then hands the escaped string
>> to the backend which then has to undo everything that had been done
>> and parse/plan the resulting query. This can use a surprising amount
>> of additional CPU. Even if you cannot support a general prepared
>> query interface, by specifying what the query should look like you
>> can handle much of the low-hanging fruit query-wise.
>>
>> We are currently using a date based trigger to use a new partition
>> each day and keep 2 months of logs currently. This can be usefully
>> managed on the backend database, but if rsyslog supported changing
>> the insert to the new table on a time basis, the CPU used by the
>> trigger to support this on the backend could be reclaimed. This
>> would be a win for any DB backend. As you move to the new partition,
>> issuing a truncate to clear the table would simplify the DB interfaces.
>>
>> Another performance enhancement already mentioned, would be to
>> allow certain extra fields in the DB to be automatically populated
>> as a function of the log messages. For example, logging the mail queue
>> id for messages from mail systems would make it much easier to locate
>> particular mail transactions in large amounts of data.
>>
>> To sum up, eliminating the escaping in rsyslog through the use of
>> prepared queries would reduce the CPU load on the DB backend. Batching
>> the inserts will also net you a big performance increase. Some DB-based
>> applications allow for the specification of several types of queries,
>> one for single inserts and then a second to support multiple inserts
>> (copy). Rsyslog already supports the queuing pieces to allow you to
>> batch inserts. Just some ideas.
>>
>> Regards,
>> Ken
>>
>>
>> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote:
>>> david@lang.hm wrote:
>>>> On Tue, 21 Apr 2009, Stephen Frost wrote:
>>>>> * david@lang.hm (david@lang.hm) wrote:
>>>>>> while I fully understand the 'benchmark your situation' need, this
>>>>>> isn't
>>>>>> that simple.
>>>>>
>>>>> It really is.  You know your application, you know it's primary use
>>>>> cases, and probably have some data to play with.  You're certainly in a
>>>>> much better situation to at least *try* and benchmark it than we are.
>>>> rsyslog is a syslog server. it replaces (or for debian and fedora, has
>>>> replaced) your standard syslog daemon. it recieves log messages from
>>>> every
>>>> app on your system (and possibly others), filters, maniulates them, and
>>>> then stores them somewhere. among the places that it can store the logs
>>>> are database servers (native support for MySQL, PostgreSQL, and Oracle.
>>>> plus libdbi for others)
>>>
>>> Well, from a performance standpoint the obvious things to do are:
>>> 1. Keep a connection open, do NOT reconnect for each log-statement
>>> 2. Batch log statements together where possible
>>> 3. Use prepared statements
>>> 4. Partition the tables by day/week/month/year (configurable I suppose)
>>>
>>> The first two are vital, the third takes you a step further. The fourth
>>> is
>>> a long-term admin thing.
>>>
>>> And possibly
>>> 5. Have two connections, one for fatal/error etc and one for info/debug
>>> level log statements (configurable split?). Then you can use the
>>> synchronous_commit setting on the less important ones. Might buy you some
>>> performance on a busy system.
>>>
>>> http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS
>>>
>>>> other apps then search and report on the data after it is stored. what
>>>> apps?, I don't know either. pick your favorite reporting tool and you'll
>>>> be a step ahead of me (I don't know a really good reporting tool)
>>>> as for sample data, you have syslog messages, just like I do. so you
>>>> have
>>>> the same access to data that I have.
>>>> how would you want to query them? how would people far less experianced
>>>> that you want to query them?
>>>> I can speculate that some people would do two columns (time, everything
>>>> else), others will do three (time, server, everything else), and others
>>>> will go further (I know some who would like to extract IP addresses
>>>> embedded in a message into their own column). some people will index on
>>>> the time and host, others will want to do full-text searches of
>>>> everything.
>>>
>>> Well, assuming it looks much like traditional syslog, I would do
>>> something
>>> like: (timestamp, host, facility, priority, message). It's easy enough to
>>> stitch back together if people want that.
>>>
>>> PostgreSQL's full-text indexing is quite well suited to logfiles I'd have
>>> thought, since it knows about filenames, urls etc already.
>>>
>>> If you want to get fancy, add a msg_type column and one subsidiary table
>>> for each msg_type. So - you might have smtp_connect_from (hostname,
>>> ip_addr). A set of perl regexps can match and extract the fields for
>>> these
>>> extra tables, or you could do it with triggers inside the database. I
>>> think
>>> it makes sense to do it in the application. Easier for users to
>>> contribute
>>> new patterns/extractions. Meanwhile, the core table is untouched so you
>>> don't *need* to know about these extra tables.
>>>
>>> If you have subsidiary tables, you'll want to partition those too and
>>> perhaps stick them in their own schema (logs200901, logs200902 etc).
>>>
>>> --
>>>   Richard Huxton
>>>   Archonet Ltd
>>>
>>> --
>>> Sent via pgsql-performance mailing list
>>> (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>
>>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

pgsql-performance by date:

Previous
From: david@lang.hm
Date:
Subject: Re: performance for high-volume log insertion
Next
From: Robert Haas
Date:
Subject: Re: SQL With Dates