Thread: How to determine cause of performance problem?
Hi, I must convert an old table into a new table. The conversion goes at ~ 100 records per second. Given the fact that I must convert 40 million records, it takes too much time: more hours than the 48 hour weekend I have for the conversion;-). The tables are rather simple: both tables only have a primary key constraint (of type text) and no other indexes. I only copy 3 columns. I use Java for the conversion. For the exact code see below. During the conversion my processor load is almost non existant. The harddisk throughput is ~ 6 megabyte/second max (measured with iostat). My platform is Debian Sarge AMD64. My hardware is a Tyan Thunder K8W 2885 motherboard, 2 Opteron 248 processors, 2 GB RAM, a SATA bootdisk with / and swap, and a 3Ware 9500S-8 RAID-5 controller with 5 attached SATA disks with /home and /var. /var contains *all* PostgreSQL log and database files (default Debian installation). Output of hdparm -Tt /dev/sdb (sdb is the RAID opartition) /dev/sdb: Timing cached reads: 1696 MB in 2.00 seconds = 846.86 MB/sec Timing buffered disk reads: 246 MB in 3.01 seconds = 81.79 MB/sec I want to determine the cause of my performance problem (if it is one). 1. Is this a performance I can expect? 2. If not, how can I determine the cause? 3. Can I anyhow improve the performance without replacing my hardware, e.g. by tweaking the software? 4. Is there a Linux (Debian) tool that I can use to benchmark write performance? -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl The Java code I use for the conversion : //////////////// .... ResultSet resultSet = selectStatement.executeQuery( "select ordernummer, orderdatum, klantnummer from odbc.orders order by ordernummer"); connection.setAutoCommit(false); PreparedStatement ordersInsertStatement = connection.prepareStatement("insert into prototype.orders (objectid,ordernumber,orderdate,customernumber) values (?,?,?,?)"); while( resultSet.next() ) { if( (++record % 100) == 0){ System.err.println( "handling record: " + record); } // the next line can do > 1.000.000 objectId/sec String orderObjectId = ObjectIdGenerator.newObjectId(); ordersInsertStatement.setString(1,orderObjectId); ordersInsertStatement.setInt(2,resultSet.getInt("ordernummer")); ordersInsertStatement.setDate(3,resultSet.getDate("orderdatum")); ordersInsertStatement.setInt(4,resultSet.getInt("klantnummer")); ordersInsertStatement.execute(); } connection.commit();
Hi Joost, why do you convert programmatically? I would do something like create sequence s_objectid; insert into prototype.orders(objectid,ordernumber,orderdate,customernumber) select next_val('s_objectid'),ordernummer, orderdatum, klantnummer from odbc.orders Sounds a lot faster to me. /Ulrich
On Fri, Sep 23, 2005 at 08:49:27AM +0200, Joost Kraaijeveld wrote: >3. Can I anyhow improve the performance without replacing my hardware, >e.g. by tweaking the software? It's not clear what your object id generator does. If it's just a sequence, it's not clear that you need this program at all--just use a SELECT INTO and make the object id a SERIAL. If you do need to control the object id or do some other processing before putting the data into the new table, rewrite to use a COPY instead of an INSERT. Mike Stone
On Fri, 2005-09-23 at 05:55 -0400, Michael Stone wrote: > It's not clear what your object id generator does. If it's just a > sequence, it's not clear that you need this program at all--just use a > SELECT INTO and make the object id a SERIAL. It generates a GUID (and no, I do not want to turn this in a discussion about GUIDs). As in the Java code comment: it is not the generation of the GUID that is the problem (that is, I can generate millions of them per second.) > If you do need to control the object id or do some other processing > before putting the data into the new table, rewrite to use a COPY > instead of an INSERT. It is actually the shortest piece of code that gives me a poor performance. The conversion problem is much, much larger and much much more complicated. I suspect that either my hardware is to slow (but then again, see the specs), or my Debian is to slow, or my PostgreSQL settings are wrong. But I have no clue where to begin with determining the bottleneck (it even may be a normal performance for all I know: I have no experience with converting such (large) database). Any suggestions? -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On Fri, Sep 23, 2005 at 12:21:15PM +0200, Joost Kraaijeveld wrote: >On Fri, 2005-09-23 at 05:55 -0400, Michael Stone wrote: >> It's not clear what your object id generator does. If it's just a >> sequence, it's not clear that you need this program at all--just use a >> SELECT INTO and make the object id a SERIAL. >It generates a GUID (and no, I do not want to turn this in a discussion >about GUIDs). As in the Java code comment: it is not the generation of >the GUID that is the problem (that is, I can generate millions of them >per second.) I didn't say it was, did I? If you use a SELECT INTO instead of SELECTing each record and then reINSERTing it you avoid a round trip latency for each row. There's a reason I said "if it's just a sequence". >> If you do need to control the object id or do some other processing >> before putting the data into the new table, rewrite to use a COPY >> instead of an INSERT. >It is actually the shortest piece of code that gives me a poor >performance. The conversion problem is much, much larger and much much >more complicated. Ok, that's great, but you didn't respond to the suggestion of using COPY INTO instead of INSERT. >But I have no clue where to begin with determining the bottleneck (it >even may be a normal performance for all I know: I have no experience >with converting such (large) database). > >Any suggestions? Respond to the first suggestion? Mike Stone
On 23-9-2005 13:05, Michael Stone wrote: > On Fri, Sep 23, 2005 at 12:21:15PM +0200, Joost Kraaijeveld wrote: > > Ok, that's great, but you didn't respond to the suggestion of using COPY > INTO instead of INSERT. > >> But I have no clue where to begin with determining the bottleneck (it >> even may be a normal performance for all I know: I have no experience >> with converting such (large) database). >> >> Any suggestions? > > > Respond to the first suggestion? Another suggestion: How many indexes and constraints are on the new table? Drop all of them and recreate them once the table is filled. Of course that only works if you know your data will be ok (which is normal for imports of already conforming data like database dumps of existing tables). This will give major performance improvements, if you have indexes and such on the new table. Best regards, Arjen
Joost, I presume you are using a relatively new jdbc driver. Make sure you have added prepareThreshold=1 to the url to that it will use a named server side prepared statement You could also use your mod 100 code block to implement batch processing of the inserts. see addBatch, in jdbc specs Dave On 23-Sep-05, at 2:49 AM, Joost Kraaijeveld wrote: > Hi, > > I must convert an old table into a new table. The conversion goes at ~ > 100 records per second. Given the fact that I must convert 40 million > records, it takes too much time: more hours than the 48 hour weekend I > have for the conversion;-). > > The tables are rather simple: both tables only have a primary key > constraint (of type text) and no other indexes. I only copy 3 > columns. I > use Java for the conversion. For the exact code see below. > > During the conversion my processor load is almost non existant. The > harddisk throughput is ~ 6 megabyte/second max (measured with iostat). > > My platform is Debian Sarge AMD64. My hardware is a Tyan Thunder K8W > 2885 motherboard, 2 Opteron 248 processors, 2 GB RAM, a SATA bootdisk > with / and swap, and a 3Ware 9500S-8 RAID-5 controller with 5 attached > SATA disks with /home and /var. /var contains *all* PostgreSQL log and > database files (default Debian installation). > > Output of hdparm -Tt /dev/sdb (sdb is the RAID opartition) > > /dev/sdb: > Timing cached reads: 1696 MB in 2.00 seconds = 846.86 MB/sec > Timing buffered disk reads: 246 MB in 3.01 seconds = 81.79 MB/sec > > > I want to determine the cause of my performance problem (if it is > one). > > 1. Is this a performance I can expect? > 2. If not, how can I determine the cause? > 3. Can I anyhow improve the performance without replacing my hardware, > e.g. by tweaking the software? > 4. Is there a Linux (Debian) tool that I can use to benchmark write > performance? > > > > -- > Groeten, > > Joost Kraaijeveld > Askesis B.V. > Molukkenstraat 14 > 6524NB Nijmegen > tel: 024-3888063 / 06-51855277 > fax: 024-3608416 > e-mail: J.Kraaijeveld@Askesis.nl > web: www.askesis.nl > > > The Java code I use for the conversion : > > //////////////// .... > ResultSet resultSet = selectStatement.executeQuery( > "select ordernummer, orderdatum, klantnummer from odbc.orders order by > ordernummer"); > > connection.setAutoCommit(false); > > PreparedStatement ordersInsertStatement = > connection.prepareStatement("insert into prototype.orders > (objectid,ordernumber,orderdate,customernumber) values (?,?,?,?)"); > > while( resultSet.next() ) > { > > if( (++record % 100) == 0){ > System.err.println( "handling record: " + record); > } > > // the next line can do > 1.000.000 objectId/sec > String orderObjectId = ObjectIdGenerator.newObjectId(); > ordersInsertStatement.setString(1,orderObjectId); > ordersInsertStatement.setInt(2,resultSet.getInt("ordernummer")); > ordersInsertStatement.setDate(3,resultSet.getDate("orderdatum")); > ordersInsertStatement.setInt(4,resultSet.getInt("klantnummer")); > > ordersInsertStatement.execute(); > > } > > connection.commit(); > > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > >
On Fri, 2005-09-23 at 13:19 +0200, Arjen van der Meijden wrote: > Another suggestion: > How many indexes and constraints are on the new table? As mentioned in the first mail: in this tables only primary key constraints, no other indexes or constraints. > Drop all of them and recreate them once the table is filled. Of course > that only works if you know your data will be ok (which is normal for > imports of already conforming data like database dumps of existing tables). > This will give major performance improvements, if you have indexes and > such on the new table. I will test this a for perfomance improvement, but still, I wonder if ~ 100 inserts/second is a reasonable performance for my software/hardware combination. -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On Fri, 2005-09-23 at 07:05 -0400, Michael Stone wrote: > On Fri, Sep 23, 2005 at 12:21:15PM +0200, Joost Kraaijeveld wrote: > >On Fri, 2005-09-23 at 05:55 -0400, Michael Stone wrote: > I didn't say it was, did I? No, you did not. But only last week someon'es head was (luckely for him only virtually) almost chopped off for suggesting the usage of GUIDs ;-) > Ok, that's great, but you didn't respond to the suggestion of using COPY > INTO instead of INSERT. Part of the code I left out are some data conversions (e.g. from path-to-file to blob, from text to date (not castable because of the homebrew original format)). I don't believe that I can do these in a SQL statement, can I (my knowledge of SQL as a langage is not that good)? . However I will investigate if I can do the conversion in two steps and check if it is faster. But still, I wonder if ~100 inserts/second is a reasonable performance for my software/hardware combination. -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On 23-9-2005 15:35, Joost Kraaijeveld wrote: > On Fri, 2005-09-23 at 13:19 +0200, Arjen van der Meijden wrote: >>Drop all of them and recreate them once the table is filled. Of course >>that only works if you know your data will be ok (which is normal for >>imports of already conforming data like database dumps of existing tables). >>This will give major performance improvements, if you have indexes and >>such on the new table. > > I will test this a for perfomance improvement, but still, I wonder if ~ > 100 inserts/second is a reasonable performance for my software/hardware > combination. For the hardware: no, I don't think it is for such a simple table/small recordsize. I did a few batch-inserts with indexes on tables and was very disappointed about the time it took. But with no indexes and constraints left it flew and the table of 7 million records (of 3 ints and 2 bigints) was imported in 75 seconds, on a bit simpler hardware. That was done using a simple pg_dump-built sql-file which was then fed to psql as input. And of course that used the local unix socket, not the local network interface (I don't know which jdbc takes). But generating a single transaction (as you do) with inserts shouldn't be that much slower. So I don't think its your hardware, nor your postgresql, although a bit extra maintenance_work_mem may help, if you haven't touched that. Leaving the queries, the application and the driver. But I don't have that much experience with jdbc and postgresql-performance. In php I wouldn't select all the 40M records at once, the resultset would be in the clients-memory and that may actually cause trouble. But I don't know how that is implemented in JDBC, it may of course be using cursors and it would be less of a problem than perhaps. You could try writing the inserts to file and see how long that takes, to eliminate the possibility of your application being slow on other parts than the inserting of data. If that is fast enough, a last resort may be to write a csv-file from java and use that with a copy-statement in psql ;) Best regards, Arjen
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes: > I will test this a for perfomance improvement, but still, I wonder if ~ > 100 inserts/second is a reasonable performance for my software/hardware > combination. Is the client code running on the same machine as the database server? If not, what's the network delay and latency between them? The major problem you're going to have here is at least one network round trip per row inserted --- possibly more, if the jdbc driver is doing "helpful" stuff behind your back like starting/committing transactions. regards, tom lane
On Fri, 2005-09-23 at 10:33 -0400, Tom Lane wrote: > Is the client code running on the same machine as the database server? > If not, what's the network delay and latency between them? Yes, it is running on the same machine. > The major problem you're going to have here is at least one network > round trip per row inserted --- possibly more, if the jdbc driver is > doing "helpful" stuff behind your back like starting/committing > transactions. OK, I will look into that. But do you maybe know a pointer to info, or tools that can measure, what my machine is doing during all the time it is doing nothing? Something like the performance monitor in Windows but than for Linux? -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes: > But do you maybe know a pointer to info, or tools that can measure, what > my machine is doing during all the time it is doing nothing? Something > like the performance monitor in Windows but than for Linux? top, vmstat, iostat, sar, strace, oprofile, ... regards, tom lane
On Fri, Sep 23, 2005 at 03:49:25PM +0200, Joost Kraaijeveld wrote: >On Fri, 2005-09-23 at 07:05 -0400, Michael Stone wrote: >> Ok, that's great, but you didn't respond to the suggestion of using COPY >> INTO instead of INSERT. >Part of the code I left out are some data conversions (e.g. from >path-to-file to blob, from text to date (not castable because of the >homebrew original format)). I don't believe that I can do these in a SQL >statement, can I (my knowledge of SQL as a langage is not that good)? . >However I will investigate if I can do the conversion in two steps and >check if it is faster. I'm not sure what you're trying to say. You're currently putting rows into the table by calling "INSERT INTO" for each row. The sample code you send could be rewritten to use "COPY INTO" instead. For bulk inserts like you're doing, the copy approach will be a lot faster. Instead of inserting one row, waiting for a reply, and inserting the next row, you just cram data down a pipe to the server. See: http://www.postgresql.org/docs/8.0/interactive/sql-copy.html http://www.faqs.org/docs/ppbook/x5504.htm Mike Stone