Re: performance for high-volume log insertion - Mailing list pgsql-performance

From david@lang.hm
Subject Re: performance for high-volume log insertion
Date
Msg-id alpine.DEB.1.10.0904210051080.12662@asgard.lang.hm
Whole thread Raw
In response to Re: performance for high-volume log insertion  (Stephen Frost <sfrost@snowman.net>)
List pgsql-performance
On Tue, 21 Apr 2009, Stephen Frost wrote:

> David,
>
> * david@lang.hm (david@lang.hm) wrote:
>> I thought that part of the 'efficiancy' and 'performance' to be gained
>> from binary modes were avoiding the need to parse commands, if it's only
>> the savings in converting column contents from text to specific types,
>> it's much less important.
>
> No, binary mode is about the column contents.  Prepared queries is about
> avoiding having to parse commands (which is why the command and the data
> elements are seperately done).

thanks for the clarification.

>>> I'd avoid having the user provide actual SQL, because that becomes
>>> difficult to deal with unless you embed an SQL parser in rsyslog, and
>>> I don't really see the value in that.
>>
>> there's no need for rsyslog to parse the SQL, just to be able to escape
>> it appropriately and then pass it to the database for execution
>
> If the user is providing SQL, then you need to be able to parse that SQL
> if you're going to do prepared queries.  It might not require you to be
> able to fully parse SQL the way the back-end does, but anything you do
> that's not a full SQL parser is going to end up having limitations that
> won't be easy to find or document.

the current situation is that rsyslog never parses the SQL (other than as
text for a template, just like if you were going to write the log message
to disk)

if we stick with the string based API we never need to

the user gives us one string 'prepare...' that we send to the database.
the user then gives us another string 'execute...' that we send to the
database. at no point do we ever need to parse the SQL, or even really
know that it is SQL (the one exception is an escapeing routine that
replace ' with '' in the strings comeing from the outside world), it's
just strings assembled using the same string assembly logic that is used
for writing files to disk, crafting the payload of network packets to
other servers, etc.

I do agree that there is a reduction in security risk. but since rsyslog
is rather draconian about forcing the escaping, I'm not sure this is
enough to tip the scales.


>> one huge advantage of putting the sql into the configuration is the
>> ability to work around other users of the database.
>
> See, I just don't see that.

moving a bit away from the traditional syslog use case for a moment. with
the ability to accept messages from many different types of sources (some
unreliable like UDP syslog, others very reliably with full
application-level acknowledgements), the ability to filter messages to
different destination, and the ability to configure it to craft arbatrary
SQL statements, rsyslog can be useful as an 'impedance match' between
different applications. you can coherse just about any app to write some
sort of message to a file/pipe, and rsyslog can take that and get it into
a database elsewhere. yes, you or I could write a quick program that would
reformat the message and submit it (in perl/python/etc, but extending that
to handle outages, high-volume bursts of traffic, etc starts to be hard.

this is very much _not_ a common use case, but it's a useful side-use of
rsyslog today.

>> I guess you could give examples of the SQL in the documentation for how
>> to create the prepared statement etc in the databases, but how is that
>> much better than having it in the config file?
>>
>> for many users it's easier to do middlein -fancy stuff in the SQL than
>> loading things into the database (can you pre-load prepared statements in
>> the database? or are they a per-connection thing?)
>
> Prepared statements, at least under PG, are a per-connection thing.
> Triggers aren't the same, those are attached to tables and get called
> whenever a particular action is done on those tables (defined in the
> trigger definition).  The trigger is then called with the row which is
> being inserted, etc, and can do whatever it wants with that row (put it
> in a different table, ignore it, etc).

that sounds like a lot of work at the database level to avoid some
complexity on the app side (and it seems that the need to fire a trigger
probably cost more than the prepared statement ever hoped to gain.)

>> so back to the main questions of the advantages
>>
>> prepared statements avoid needing to escape things, but at the
>> complication of a more complex API.
>>
>> there's still the question of the performance difference. I have been
>> thinking that the overhead of doing the work itself would overwelm the
>> performance benifits of prepared statements.
>
> What work is it that you're referring to here?

doing the inserts themselves (putting the data in the tables, updating
indexes, issuing a fsync)

> Based on what you've
> said about your application so far, I would expect that the run-time
> cost to prepare the statement (which you do only once) to be a bit of a
> cost, but not much, and that the actual inserts would be almost free
> from the application side, and much easier for the database to
> parse/use.

the inserts are far from free ;-)

but I agree that with prepared statements, the overhead of the insert is
small. I'm trying to get a guess as to how small.

>> as I understand it, the primary performance benifit is the ability to
>> avoid the parsing and planning stages of the command. for simple commands
>> (which I assume inserts to be, even if inserting a lot of stuff), the
>> planning would seem to be cheap compared to the work of doing the inserts
>
> The planning isn't very expensive, no, but it's also not free.  The
> parsing is more expensive.

ok, that makes the multi-value insert process (or copy) sound more
attractive (less parsing)

>> on a fully tuned database are we talking about 10% performance? 1%? 0.01%?
>>
>> any ideas
>
> It depends a great deal on your application..  Do you have some example
> data that we could use to test with?  Some default templates that you
> think most people end up using which we could create dummy data for?

take the contents of /var/log/messages on your system, split it into
timestamp, server, log and you have at least a reasonable case to work
with.

> Of course, in the end, batching your inserts into fewer transactions
> than 1-per-insert should give you a huge benefit right off the bat.

agreed, that was my expectation, but then people started saying how
important it was to use prepared statements.

> What we're talking about is really the steps after that, which might not
> ever be necessary for your particular application.  On the other hand,
> I'd never let production code go out that isn't using prepared queries
> wherever possible.

I do understand the sentiment, really I do. I'm just trying to understand
the why.

in this case, moving from being able to insert one record per transaction
to inserting several hundred per transaction will _probably_ swamp
everything else. but I really hate the word 'probably', I've been
surprised to many times.

one of the more recent surprises was on my multi-TB (4-core, 32G ram, 16
spindles of data) log query server I discovered that there was no
noticable difference between raid 10 and raid 6 for the bulk of my query
workload (an example is finding 14 log entries out of 12B). it turns out
that if you are read-only with a seek-heavy workload you can keep all
spindles busy with either approach. so in spite of the fact that
'everybody knows' that raid 10 is _far_ better than raid 6, especially for
databases, I discovered that that really only applies to writes.


so I really do want to understand the 'why' if possible.

you have helped a lot.

David Lang

pgsql-performance by date:

Previous
From: david@lang.hm
Date:
Subject: Re: performance for high-volume log insertion
Next
From: Richard Huxton
Date:
Subject: Re: performance for high-volume log insertion