Re: performance for high-volume log insertion - Mailing list pgsql-performance

From Stephen Frost
Subject Re: performance for high-volume log insertion
Date
Msg-id 20090421064138.GV8123@tamriel.snowman.net
Whole thread Raw
In response to Re: performance for high-volume log insertion  (david@lang.hm)
Responses Re: performance for high-volume log insertion  (david@lang.hm)
List pgsql-performance
David,

* david@lang.hm (david@lang.hm) wrote:
> I thought that part of the 'efficiancy' and 'performance' to be gained
> from binary modes were avoiding the need to parse commands, if it's only
> the savings in converting column contents from text to specific types,
> it's much less important.

No, binary mode is about the column contents.  Prepared queries is about
avoiding having to parse commands (which is why the command and the data
elements are seperately done).

>> table = "mytable";
>> data = "$Y, $m, $d, $H, $msg";
>
> if the user creates the data this way, you just reintroduced the escaping
> problem. they would have to do something like
>
> data = "$Y"
> data = "$m"
> data = "$d"
> data = "$H"
> data = "$msg"

Yes, there is a bit of escaping that the admins will have to deal with
in the config file.  There's no way around that though, regardless of
what you do.  If you let them put SQL in, then they may have to escape
stuff there too.  In the end, an escape problem in the config file is
something which you should really catch when you read the file in, and
even if you don't catch it there, it's less problematic, not really a
performance problem, and much less of a security risk, than having the
escaping done on data from an untrusted source.

> one key thing is that it's very probable that the user will want to
> manipulate the string, not just send a single variable as-is

You could probably let them do some manipulation, add extra
non-escape-code fields, maybe tack something on the beginning and end,
basically, anything that can be done in the application prior to it
hitting the database should be fine.

>> I'd avoid having the user provide actual SQL, because that becomes
>> difficult to deal with unless you embed an SQL parser in rsyslog, and
>> I don't really see the value in that.
>
> there's no need for rsyslog to parse the SQL, just to be able to escape
> it appropriately and then pass it to the database for execution

If the user is providing SQL, then you need to be able to parse that SQL
if you're going to do prepared queries.  It might not require you to be
able to fully parse SQL the way the back-end does, but anything you do
that's not a full SQL parser is going to end up having limitations that
won't be easy to find or document.

For example, you could ask users to provide the prepared statement the
way the database wants it, and then list the data elements seperately
from it somehow, eg:

myquery = "INSERT INTO blah (col1, col2, col3) VALUES ($1, $2, $3);"
myvals[1] = "$Y"
myvals[2] = "$M"
myvals[3] = "$msg"

The user could then do:

myquery = "INSERT INTO blah (col1, col2) SELECT substring($1), $2;"
myvals[1] = "$M"
myvals[2] = "$msg"

Both of these will work just fine as prepared queries.

You can then parse that string by just looking for the $'s, but of
course, if the user wants to put an *actual* dollar sign in, then you
have to support that somehow ($$?).  Then you have to deal with whatever
other quoting requirements you have in your config file (how do you deal
with double quotes?  What about single quotes? etc, etc).

You could possibly even put your escape codes into myquery and just try
to figure out how to do the substitutions with the $NUMs and build your
prepared query string.  It gets uglier and uglier if you ask me though.

In the end, I'd still worry about users coming up with new and different
ways to break your sql 'parser'.

> one huge advantage of putting the sql into the configuration is the
> ability to work around other users of the database.

See, I just don't see that.

> for example, what if the database has additional columns that you don't
> want to touch (say an item# sequence), if the SQL is in the config this
> is easy to work around, if it's seperate (or created by the module), this
> is very hard to do.

You can do that with a trigger trivially..  That could also be supported
through other mechanisms (for instance, let the user provide a list of
columns to fill with DEFAULT in the prepared query).

> I guess you could give examples of the SQL in the documentation for how
> to create the prepared statement etc in the databases, but how is that
> much better than having it in the config file?
>
> for many users it's easier to do middlein -fancy stuff in the SQL than
> loading things into the database (can you pre-load prepared statements in
> the database? or are they a per-connection thing?)

Prepared statements, at least under PG, are a per-connection thing.
Triggers aren't the same, those are attached to tables and get called
whenever a particular action is done on those tables (defined in the
trigger definition).  The trigger is then called with the row which is
being inserted, etc, and can do whatever it wants with that row (put it
in a different table, ignore it, etc).

> so back to the main questions of the advantages
>
> prepared statements avoid needing to escape things, but at the
> complication of a more complex API.
>
> there's still the question of the performance difference. I have been
> thinking that the overhead of doing the work itself would overwelm the
> performance benifits of prepared statements.

What work is it that you're referring to here?  Based on what you've
said about your application so far, I would expect that the run-time
cost to prepare the statement (which you do only once) to be a bit of a
cost, but not much, and that the actual inserts would be almost free
from the application side, and much easier for the database to
parse/use.

> as I understand it, the primary performance benifit is the ability to
> avoid the parsing and planning stages of the command. for simple commands
> (which I assume inserts to be, even if inserting a lot of stuff), the
> planning would seem to be cheap compared to the work of doing the inserts

The planning isn't very expensive, no, but it's also not free.  The
parsing is more expensive.

> on a fully tuned database are we talking about 10% performance? 1%? 0.01%?
>
> any ideas

It depends a great deal on your application..  Do you have some example
data that we could use to test with?  Some default templates that you
think most people end up using which we could create dummy data for?

Of course, in the end, batching your inserts into fewer transactions
than 1-per-insert should give you a huge benefit right off the bat.
What we're talking about is really the steps after that, which might not
ever be necessary for your particular application.  On the other hand,
I'd never let production code go out that isn't using prepared queries
wherever possible.

:)

    Thanks,

        Stephen

Attachment

pgsql-performance by date:

Previous
From: david@lang.hm
Date:
Subject: Re: performance for high-volume log insertion
Next
From: Stephen Frost
Date:
Subject: Re: performance for high-volume log insertion