Thread: getGeneratedKeys method
Hello, We, the log4j developers, are working on an appender that will write log4j events to various RDBMS Various fields of an event object will go to 3 different tables, namely the logging_event table, the logging_event_property table and the logging_event_exception table. Each entry in the logging_event_property and the logging_event_exception tables uses a reference to event_id, where event_id is a database generated primary key in the logging_event table. For entries in the logging_event_property and logging_event_exception tables to be meaningful, we absolutely need the generated event_id each time we insert a new logging_event row. We are able to do this by following each logging_event insertion with a query to the database asking for the event_id of the last inserted event. This works for multiple database systems. However, we have discovered that batching multiple insertions gives a very significant boost in performance. Thus, we would like to insert say 50 logging_event rows, then multiple logging_event_property rows and then multiple logging_event_exception rows. We are using the JDBC getGeneratedKeys method to obtain the event_ids. Unfortunately, this function is not implemented in Postgresql. Looking at the archives it seems that this functionality requires support for the database back end which does not exist currently. Is that correct? Is there another way to insert into multiple tables in batches as described above without using the JDBC getGeneratedKeys method? Your views on the matter would be highly appreciated. Thanking you in advance, -- Ceki G�lc� For log4j documentation consider "The complete log4j manual" ISBN: 2970036908 http://www.qos.ch/shop/products/clm_t.jsp __________________________________ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/
Ceki, Yes, this is correct, there is no functionality in postgres to get the generated keys. This is because the keys are generated using a sequence which can be any of the columns in the row, there is no apriori knowledge of the column that is being generated. There are three ways around this. 1) get the key before the insert. select nextval('sequence_name'); 2) get the key after the insert leaving the key value out of the insert, or inserting a null there. select currval('sequence_name'); 3) do an insert and a select at the same time ie. insert into foo ....; select curval('sequence name') FYI, nextval and currval are valid in the context, so you don't have to worry about other connections getting in the way. These functions do the "right" thing. On Wed, 2004-05-26 at 07:56, Ceki Gulcu wrote: > Hello, > > We, the log4j developers, are working on an appender > that will write > log4j events to various RDBMS Various fields of an > event object will > go to 3 different tables, namely the logging_event > table, the > logging_event_property table and the > logging_event_exception > table. Each entry in the logging_event_property and > the > logging_event_exception tables uses a reference to > event_id, where > event_id is a database generated primary key in the > logging_event > table. > > For entries in the logging_event_property and > logging_event_exception > tables to be meaningful, we absolutely need the > generated event_id > each time we insert a new logging_event row. We are > able to do this by > following each logging_event insertion with a query to > the database > asking for the event_id of the last inserted event. > This works for > multiple database systems. > > However, we have discovered that batching multiple > insertions gives a > very significant boost in performance. Thus, we would > like to insert > say 50 logging_event rows, then multiple > logging_event_property rows > and then multiple logging_event_exception rows. We are > using the > JDBC getGeneratedKeys method to obtain the event_ids. > Unfortunately, this > function is not implemented in Postgresql. > > > Looking at the archives it seems that this > functionality requires > support for the database back end which does not exist > currently. Is > that correct? > > Is there another way to insert into multiple tables in > batches as > described above without using the JDBC > getGeneratedKeys method? > > Your views on the matter would be highly appreciated. > Thanking you > in advance, -- Dave Cramer 519 939 0336 ICQ # 14675561
Ceki Gulcu wrote: > For entries in the logging_event_property and > logging_event_exception > tables to be meaningful, we absolutely need the > generated event_id > each time we insert a new logging_event row. We are > able to do this by > following each logging_event insertion with a query to > the database > asking for the event_id of the last inserted event. > This works for > multiple database systems. > > However, we have discovered that batching multiple > insertions gives a > very significant boost in performance. Note that as it stands this isn't true for the postgresql driver, executeBatch() actually just runs each query synchronously. Fixing this is at the top of my todo list.. > Thus, we would > like to insert > say 50 logging_event rows, then multiple > logging_event_property rows > and then multiple logging_event_exception rows. We are > using the > JDBC getGeneratedKeys method to obtain the event_ids. > Unfortunately, this > function is not implemented in Postgresql. > > > Looking at the archives it seems that this > functionality requires > support for the database back end which does not exist > currently. Is > that correct? It'd probably be possible to implement support for returning the OID of inserted rows in a batch update. You can already get at that OID for individual inserts via a postgresql-specific interface, PGStatement.getLastOID(). However, the OID doesn't make a great key anyway, and I think the general move is towards omitting OIDs from tables altogether. Anything else is going to be harder as the server does not provide much in the way of metadata back from a query, and the JDBC driver does little parsing of the SQL queries. If you don't mind doing DB-specific implementations, and you have all of the data for a particular event available at once, you might want to look at using CREATE SEQUENCE for the event ID (I guess you're already doing this if you want autogenerated keys) and use something like: INSERT INTO logging_event(nextval('id'), ...) INSERT INTO logging_event_property(currval('id'), "property 1", ...) INSERT INTO logging_event_property(currval('id'), "property 2", ...) INSERT INTO logging_event_exception(currval('id'), 'line 1", ...) INSERT INTO logging_event_exception(currval('id'), "line 2", ...) You'd be able to batch those INSERTs so long as you keep them grouped correctly. This is safe to do even if multiple connections are using the sequence concurrently -- see http://www.postgresql.org/docs/7.4/static/functions-sequence.html. -O
Hello, Thanks for all the replies received so far. I really appreciate it. One important point I failed to mention in my original mail, was that we are using a prepared statement to perform the initial batch of inserts via executeBatch, followed by another prepared statement and invoking executeBatch and a third prepared statement+executeBatch. Your suggestions imply the use of regular statements as opposed to prepared statements. I don't know whether the performance boost is due to the use of prepared statements or whether it's more the "batching", that is regrouping the SQL statements in one transaction. If you are interested you can find the actual code at http://cvs.apache.org/viewcvs.cgi/logging-log4j/src/java/org/apache/log4j/db/ DBAppender is the current implementation which works with PostgreSQL, MySQL, Oracle, DB2 and MsSQL. With pooled-connections, it takes under 10 milliseconds to persist a single event. Without pooling, it's 50 millis. DBAppender2 which uses prepares statements, we can get the figure to as low as 1 milli. However, although the performance is much better, the results are incorrect. It the performance improvement is linked to the use of prepared statements, then there isn't much I can do. However, if regular statements would also yield a performance improvement, I could try the techniques suggested in previous mails in this thread. Anyway, thank you for you assistance. It has been very helpful. -- Ceki G�lc� For log4j documentation consider "The complete log4j manual" ISBN: 2970036908 http://www.qos.ch/shop/products/clm_t.jsp --- Oliver Jowett <oliver@opencloud.com> wrote: > Ceki Gulcu wrote: > > For entries in the logging_event_property and > > logging_event_exception > > tables to be meaningful, we absolutely need the > > generated event_id > > each time we insert a new logging_event row. We > are > > able to do this by > > following each logging_event insertion with a > query to > > the database > > asking for the event_id of the last inserted > event. > > This works for > > multiple database systems. > > > > However, we have discovered that batching multiple > > insertions gives a > > very significant boost in performance. > > Note that as it stands this isn't true for the > postgresql driver, > executeBatch() actually just runs each query > synchronously. Fixing this > is at the top of my todo list.. > > > Thus, we would > > like to insert > > say 50 logging_event rows, then multiple > > logging_event_property rows > > and then multiple logging_event_exception rows. We > are > > using the > > JDBC getGeneratedKeys method to obtain the > event_ids. > > Unfortunately, this > > function is not implemented in Postgresql. > > > > > > Looking at the archives it seems that this > > functionality requires > > support for the database back end which does not > exist > > currently. Is > > that correct? > > It'd probably be possible to implement support for > returning the OID of > inserted rows in a batch update. You can already get > at that OID for > individual inserts via a postgresql-specific > interface, > PGStatement.getLastOID(). However, the OID doesn't > make a great key > anyway, and I think the general move is towards > omitting OIDs from > tables altogether. > > Anything else is going to be harder as the server > does not provide much > in the way of metadata back from a query, and the > JDBC driver does > little parsing of the SQL queries. > > If you don't mind doing DB-specific implementations, > and you have all of > the data for a particular event available at once, > you might want to > look at using CREATE SEQUENCE for the event ID (I > guess you're already > doing this if you want autogenerated keys) and use > something like: > > INSERT INTO logging_event(nextval('id'), ...) > INSERT INTO logging_event_property(currval('id'), > "property 1", ...) > INSERT INTO logging_event_property(currval('id'), > "property 2", ...) > INSERT INTO > logging_event_exception(currval('id'), 'line 1", > ...) > INSERT INTO > logging_event_exception(currval('id'), "line 2", > ...) > > You'd be able to batch those INSERTs so long as you > keep them grouped > correctly. > > This is safe to do even if multiple connections are > using the sequence > concurrently -- see > http://www.postgresql.org/docs/7.4/static/functions-sequence.html. > > -O > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org __________________________________ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/
Ceki Gulcu wrote: > Hello, > > Thanks for all the replies received so far. I really > appreciate it. > > One important point I failed to mention in my original > mail, was that we are using a prepared statement to > perform the initial batch of inserts via executeBatch, > followed by another prepared statement and invoking > executeBatch and a third prepared > statement+executeBatch. > > Your suggestions imply the use of regular statements > as opposed to prepared statements. I don't know > whether the performance boost is due to the use of > prepared statements or whether it's more the > "batching", that is regrouping the SQL statements in > one transaction. I initially thought you could batch the inserts in the approach I suggested, but of course addBatch and executeBatch live on a particular PreparedStatement so you can't do that. Ok, how about something like this instead: CREATE SEQUENCE id INCREMENT BY 100; Then limit batch size to 100, and for each batch do: SELECT nextval('id') -- let's assume this returns 1000 Next do the actual batch inserts, grouped by statement and computing ID values yourself based on the returned nextval + offset: INSERT INTO logging_event VALUES (1000, ...) INSERT INTO logging_event VALUES (1001, ...) INSERT INTO logging_event VALUES (1002, ...) -- ... INSERT INTO logging_event VALUES (1099, ...) INSERT INTO logging_event_property VALUES (1000, 'foo', ...) INSERT INTO logging_event_property VALUES (1000, 'bar', ...) INSERT INTO logging_event_property VALUES (1001, 'foo', ...) INSERT INTO logging_event_property VALUES (1001, 'bar', ...) -- etc INSERT INTO logging_event_exception VALUES (1000, 'line 1', ...) INSERT INTO logging_event_exception VALUES (1000, 'line 2', ...) INSERT INTO logging_event_exception VALUES (1001, 'line 1', ...) INSERT INTO logging_event_exception VALUES (1001, 'line 2', ...) -- etc Because of the INCREMENT BY clause, the generated IDs won't collide with another concurrent inserter. This should let you use both PreparedStatement and batch execution I think. There is a tradeoff between rate of ID consumption and maximum batch size to be made, but if you're using int8 for IDs then ID consumption is unlikely to be an issue.. -O
Hi Oliver, The increment by N aproach, where N < batch size, is imho very original. It should work but the DBAppender code has to be compatible with different RDBMSs. The approach you suggest assumes that we can manually set event_id, which is not necessarily always the case, or maybe it is, but I am not sure. More importantly, on some RDBMS we are using auto-generated keys where it is not always possible to set the increment value. (Maybe it is always possible, I'll have to check that too.) Anyway, I now know when an increment value other than 1 could be useful. Thanks again for your help. --- Oliver Jowett <oliver@opencloud.com> wrote: > Ceki Gulcu wrote: > > Hello, > > > > Thanks for all the replies received so far. I > really > > appreciate it. > > > > One important point I failed to mention in my > original > > mail, was that we are using a prepared statement > to > > perform the initial batch of inserts via > executeBatch, > > followed by another prepared statement and > invoking > > executeBatch and a third prepared > > statement+executeBatch. > > > > Your suggestions imply the use of regular > statements > > as opposed to prepared statements. I don't know > > whether the performance boost is due to the use of > > prepared statements or whether it's more the > > "batching", that is regrouping the SQL statements > in > > one transaction. > > I initially thought you could batch the inserts in > the approach I > suggested, but of course addBatch and executeBatch > live on a particular > PreparedStatement so you can't do that. > > Ok, how about something like this instead: > > CREATE SEQUENCE id INCREMENT BY 100; > > Then limit batch size to 100, and for each batch do: > > SELECT nextval('id') > -- let's assume this returns 1000 > > Next do the actual batch inserts, grouped by > statement and computing ID > values yourself based on the returned nextval + > offset: > > INSERT INTO logging_event VALUES (1000, ...) > INSERT INTO logging_event VALUES (1001, ...) > INSERT INTO logging_event VALUES (1002, ...) > -- ... > INSERT INTO logging_event VALUES (1099, ...) > > INSERT INTO logging_event_property VALUES (1000, > 'foo', ...) > INSERT INTO logging_event_property VALUES (1000, > 'bar', ...) > INSERT INTO logging_event_property VALUES (1001, > 'foo', ...) > INSERT INTO logging_event_property VALUES (1001, > 'bar', ...) > -- etc > > INSERT INTO logging_event_exception VALUES (1000, > 'line 1', ...) > INSERT INTO logging_event_exception VALUES (1000, > 'line 2', ...) > INSERT INTO logging_event_exception VALUES (1001, > 'line 1', ...) > INSERT INTO logging_event_exception VALUES (1001, > 'line 2', ...) > -- etc > > Because of the INCREMENT BY clause, the generated > IDs won't collide with > another concurrent inserter. This should let you use > both > PreparedStatement and batch execution I think. > > There is a tradeoff between rate of ID consumption > and maximum batch > size to be made, but if you're using int8 for IDs > then ID consumption is > unlikely to be an issue.. > > -O __________________________________ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/
Ceki, Couple of things: 1) alot of persistence layers use a GenerateKey interface to generate the key ahead of time, that way you can use multiple implementations for whichever db you use. 2) MSSQL for sure will not let you insert any value into the serial column. 3) there's yet another option for serial values in postgres which is cacheing, if you were going to increment by 10, you may wish to cache by 100 which means that for each connection it will only have to write to the backend 1 time for every 10 fetch's Dave On Thu, 2004-05-27 at 09:50, Ceki Gulcu wrote: > Hi Oliver, > > The increment by N aproach, where N < batch size, is > imho very original. It should work but the DBAppender > code has to be compatible with different RDBMSs. The > approach you suggest assumes that we can manually set > event_id, which is not necessarily always the case, or > maybe it is, but I am not sure. More importantly, on > some RDBMS we are using auto-generated keys where it > is not always possible to set the increment value. > (Maybe it is always possible, I'll have to check that > too.) > > Anyway, I now know when an increment value other than > 1 could be useful. > > Thanks again for your help. > > --- Oliver Jowett <oliver@opencloud.com> wrote: > > Ceki Gulcu wrote: > > > Hello, > > > > > > Thanks for all the replies received so far. I > > really > > > appreciate it. > > > > > > One important point I failed to mention in my > > original > > > mail, was that we are using a prepared statement > > to > > > perform the initial batch of inserts via > > executeBatch, > > > followed by another prepared statement and > > invoking > > > executeBatch and a third prepared > > > statement+executeBatch. > > > > > > Your suggestions imply the use of regular > > statements > > > as opposed to prepared statements. I don't know > > > whether the performance boost is due to the use of > > > prepared statements or whether it's more the > > > "batching", that is regrouping the SQL statements > > in > > > one transaction. > > > > I initially thought you could batch the inserts in > > the approach I > > suggested, but of course addBatch and executeBatch > > live on a particular > > PreparedStatement so you can't do that. > > > > Ok, how about something like this instead: > > > > CREATE SEQUENCE id INCREMENT BY 100; > > > > Then limit batch size to 100, and for each batch do: > > > > SELECT nextval('id') > > -- let's assume this returns 1000 > > > > Next do the actual batch inserts, grouped by > > statement and computing ID > > values yourself based on the returned nextval + > > offset: > > > > INSERT INTO logging_event VALUES (1000, ...) > > INSERT INTO logging_event VALUES (1001, ...) > > INSERT INTO logging_event VALUES (1002, ...) > > -- ... > > INSERT INTO logging_event VALUES (1099, ...) > > > > INSERT INTO logging_event_property VALUES (1000, > > 'foo', ...) > > INSERT INTO logging_event_property VALUES (1000, > > 'bar', ...) > > INSERT INTO logging_event_property VALUES (1001, > > 'foo', ...) > > INSERT INTO logging_event_property VALUES (1001, > > 'bar', ...) > > -- etc > > > > INSERT INTO logging_event_exception VALUES (1000, > > 'line 1', ...) > > INSERT INTO logging_event_exception VALUES (1000, > > 'line 2', ...) > > INSERT INTO logging_event_exception VALUES (1001, > > 'line 1', ...) > > INSERT INTO logging_event_exception VALUES (1001, > > 'line 2', ...) > > -- etc > > > > Because of the INCREMENT BY clause, the generated > > IDs won't collide with > > another concurrent inserter. This should let you use > > both > > PreparedStatement and batch execution I think. > > > > There is a tradeoff between rate of ID consumption > > and maximum batch > > size to be made, but if you're using int8 for IDs > > then ID consumption is > > unlikely to be an issue.. > > > > -O > > > > > > __________________________________ > Do you Yahoo!? > Friends. Fun. Try the all-new Yahoo! Messenger. > http://messenger.yahoo.com/ > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > > > > !DSPAM:40b5f32a168261639319129! > > -- Dave Cramer 519 939 0336 ICQ # 14675561
A slight modification of (1) that I've used is "select nextval('sequence_name'), nextval('sequence_name'), nextval('sequence_name'), ...", to get the next n keys, then follow that with a bunch of inserts that use those keys. This lets (1) play a little more nicely with batch inserts (although (3) may be more useful). mike -----Original Message----- From: pgsql-jdbc-owner@postgresql.org [mailto:pgsql-jdbc-owner@postgresql.org] On Behalf Of Dave Cramer Sent: Wednesday, May 26, 2004 7:12 AM To: Ceki Gulcu Cc: pgsql-jdbc@postgresql.org Subject: Re: [JDBC] getGeneratedKeys method Ceki, Yes, this is correct, there is no functionality in postgres to get the generated keys. This is because the keys are generated using a sequence which can be any of the columns in the row, there is no apriori knowledge of the column that is being generated. There are three ways around this. 1) get the key before the insert. select nextval('sequence_name'); 2) get the key after the insert leaving the key value out of the insert, or inserting a null there. select currval('sequence_name'); 3) do an insert and a select at the same time ie. insert into foo ....; select curval('sequence name') FYI, nextval and currval are valid in the context, so you don't have to worry about other connections getting in the way. These functions do the "right" thing. On Wed, 2004-05-26 at 07:56, Ceki Gulcu wrote: > Hello, > > We, the log4j developers, are working on an appender > that will write > log4j events to various RDBMS Various fields of an > event object will > go to 3 different tables, namely the logging_event > table, the > logging_event_property table and the > logging_event_exception > table. Each entry in the logging_event_property and > the > logging_event_exception tables uses a reference to > event_id, where > event_id is a database generated primary key in the logging_event > table. > > For entries in the logging_event_property and logging_event_exception > tables to be meaningful, we absolutely need the > generated event_id > each time we insert a new logging_event row. We are > able to do this by > following each logging_event insertion with a query to > the database > asking for the event_id of the last inserted event. > This works for > multiple database systems. > > However, we have discovered that batching multiple > insertions gives a > very significant boost in performance. Thus, we would > like to insert > say 50 logging_event rows, then multiple logging_event_property rows > and then multiple logging_event_exception rows. We are > using the > JDBC getGeneratedKeys method to obtain the event_ids. > Unfortunately, this > function is not implemented in Postgresql. > > > Looking at the archives it seems that this > functionality requires > support for the database back end which does not exist currently. Is > that correct? > > Is there another way to insert into multiple tables in batches as > described above without using the JDBC > getGeneratedKeys method? > > Your views on the matter would be highly appreciated. Thanking you > in advance, -- Dave Cramer 519 939 0336 ICQ # 14675561 ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
> Yes, this is correct, there is no functionality in postgres to get the > generated keys. That's sad. > This is because the keys are generated using a sequence which can be any > of the columns in the row, there is no apriori knowledge of the column > that is being generated. First, we have to define what a "generated key" is! If you define it as a value generated by a sequence, than this is hard to find out. Simply take the following definition: A column in the primary key is a "generated key", iff a value for the columns hasn't been given in the INSERT and if the column has a default value. With this definition, is should be easy for the server to find the generated keys. > 2) get the key after the insert leaving the key value out of the insert, > or inserting a null there. select currval('sequence_name'); but you have to know the sequence name for that. That's really bad!