Thread: Insert performance, what should I expect?
Hi, I've after some opinions about insert performance. I'm importing a file with 13,002 lines to a database that ends up with 75,703 records across 6 tables. This is a partial file – the real data is 4 files with total lines 95174. I'll be loading these files each morning, and then running a number of queries on them. The select queries run fast enough, (mostly - 2 queries are slow but I'll look into that later), but importing is slower than I'd like it to be, but I'm wondering what to expect? I've done some manual benchmarking running my script 'time script.pl' I realise my script uses some of the time, bench marking shows that %50 of the time is spent in dbd:execute. Test 1, For each import, I'm dropping all indexes and pkeys/fkeys, then importing, then adding keys and indexes. Then I've got successive runs. I figure the reindexing will get more expensive as the database grows? Successive Imports: 44,49,50,57,55,61,72 (seconds) = average 1051inserts/second (which now that I've written this seems fairly good) Test 2, no dropping etc of indexes, just INSERTs Import – 61, 62, 73, 68, 78, 74 (seconds) = average 1091 inserts/second Machine is Linux 2.6.4, 1GB RAM, 3.something GHz XEON processor, SCSI hdd's (raid1). PostgreSQL 7.4.2. Lightly loaded machine, not doing much other than my script. Script and DB on same machine. Sysctl –a | grep shm kernel.shmmni = 4096 kernel.shmall = 134217728 (pages or bytes? Anyway…) kernel.shmmax = 134217728 postgresql.conf tcpip_socket = true max_connections = 32 superuser_reserved_connections = 2 shared_buffers = 8192 sort_mem = 4096 vacuum_mem = 16384 max_fsm_relations = 300 fsync = true wal_buffers = 64 checkpoint_segments = 10 effective_cache_size = 16000 syslog = 1 silent_mode = false log_connections = true log_pid = true log_timestamp = true stats_start_collector = true stats_row_level = true Can I expect it to go faster than this? I'll see where I can make my script itself go faster, but I don't think I'll be able to do much. I'll do some pre-prepare type stuff, but I don't expect significant gains, maybe 5-10%. I'd could happily turn off fsync for this job, but not for some other databases the server is hosting. Any comments/suggestions would be appreciated. Thanks :) Brock Henry
Brock Henry wrote: >Hi, > >I've after some opinions about insert performance. > Have you looked into using the copy command instead of inserts? For bulk loading of data it can be significantly faster.
> I've done some manual benchmarking running my script 'time script.pl' > I realise my script uses some of the time, bench marking shows that > %50 of the time is spent in dbd:execute. The perl drivers don't currently use database level prepared statements which would give a small boost. But your best bet is to switch to using COPY instead of INSERT. Two ways to do this. 1) Drop DBD::Pg and switch to the Pg driver for Perl instead (non-DBI compliant) which has functions similar to putline() that allow COPY to be used. 2) Have your perl script output a .sql file with the data prepared (COPY statements) which you feed into the database via psql. You can probably achieve a 50% increase in throughput.
On Wed, 2004-10-20 at 11:53 +1000, Brock Henry wrote: > > Test 1, For each import, I'm dropping all indexes and pkeys/fkeys, > then importing, then adding keys and indexes. Then I've got successive > runs. I figure the reindexing will get more expensive as the database > grows? Sounds like the right approach to me, if the tables are empty before the import. > Successive Imports: 44,49,50,57,55,61,72 (seconds) > = average 1051inserts/second (which now that I've written this seems > fairly good) (A) Are you doing the whole thing inside a transaction? This will be significantly quicker. COPY would probably be quicker still, but the biggest difference will be a single transaction. (B) If you are starting with empty files, are you ensuring that the dead records are vacuumed before you start? I would recommend a "vacuum full" on the affected tables prior to the first import run (i.e. when the tables are empty). This is likely to be the reason that the timing on your successive imports increases so much. > sort_mem = 4096 You probably want to increase this - if you have 1G of RAM then there is probably some spare. But if you actually expect to use 32 connections then 32 * 4M = 128M might mean a careful calculation is needed. If you are really only likely to have 1-2 connections running concurrently then increase it to (e.g.) 32768. > max_fsm_relations = 300 If you do a "vacuum full verbose;" the last line will give you some clues as to what to set this (and max_fsm_pages) too. > effective_cache_size = 16000 16000 * 8k = 128M seems low for a 1G machine - probably you could say 64000 without fear of being wrong. What does "free" show as "cached"? Depending on how dedicated the machine is to the database, the effective cache size may be as much as 80-90% of that. > Can I expect it to go faster than this? I'll see where I can make my > script itself go faster, but I don't think I'll be able to do much. > I'll do some pre-prepare type stuff, but I don't expect significant > gains, maybe 5-10%. I'd could happily turn off fsync for this job, but > not for some other databases the server is hosting. You can probably double the speed - maybe more. Cheers, Andrew, ------------------------------------------------------------------------- Andrew @ Catalyst .Net .NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)803-2201 MOB: +64(272)DEBIAN OFFICE: +64(4)499-2267 How many things I can do without! -- Socrates -------------------------------------------------------------------------
Attachment
When grilled further on (Tue, 19 Oct 2004 22:12:28 -0400), Rod Taylor <pg@rbt.ca> confessed: > > I've done some manual benchmarking running my script 'time script.pl' > > I realise my script uses some of the time, bench marking shows that > > %50 of the time is spent in dbd:execute. > > > 1) Drop DBD::Pg and switch to the Pg driver for Perl instead (non-DBI > compliant) which has functions similar to putline() that allow COPY to > be used. COPY can be used with DBD::Pg, per a script I use: $dbh->do( "COPY temp_obs_$band ( $col_list ) FROM stdin" ); $dbh->func( join ( "\t", @data ) . "\n", 'putline' ); $dbh->func( "\\.\n", 'putline' ); $dbh->func( 'endcopy' ); With sets of data from 1000 to 8000 records, my COPY performance is consistent at ~10000 records per second. Cheers, Rob -- 10:39:31 up 2 days, 16:25, 2 users, load average: 2.15, 2.77, 3.06 Linux 2.6.5-02 #8 SMP Mon Jul 12 21:34:44 MDT 2004
Attachment
On Wed, 2004-10-20 at 12:45, Robert Creager wrote: > When grilled further on (Tue, 19 Oct 2004 22:12:28 -0400), > Rod Taylor <pg@rbt.ca> confessed: > > > > I've done some manual benchmarking running my script 'time script.pl' > > > I realise my script uses some of the time, bench marking shows that > > > %50 of the time is spent in dbd:execute. > > > > > 1) Drop DBD::Pg and switch to the Pg driver for Perl instead (non-DBI > > compliant) which has functions similar to putline() that allow COPY to > > be used. > > COPY can be used with DBD::Pg, per a script I use: > > $dbh->do( "COPY temp_obs_$band ( $col_list ) FROM stdin" ); > $dbh->func( join ( "\t", @data ) . "\n", 'putline' ); > $dbh->func( "\\.\n", 'putline' ); > $dbh->func( 'endcopy' ); Thanks for that. All of the conversations I've seen on the subject stated that DBD::Pg only supported standard DB features -- copy not amongst them. > With sets of data from 1000 to 8000 records, my COPY performance is consistent > at ~10000 records per second. Well done.
Brock Henry wrote: > Any comments/suggestions would be appreciated. Tune also the disk I/O elevator. look at this: http://www.varlena.com/varlena/GeneralBits/49.php Regards Gaetano Mendola
On Sat, Oct 23, 2004 at 12:31:32PM +0200, Gaetano Mendola wrote: >> Any comments/suggestions would be appreciated. > Tune also the disk I/O elevator. > > look at this: http://www.varlena.com/varlena/GeneralBits/49.php Mm, interesting. I've heard somewhere that the best for database-like loads on Linux is to disable the anticipatory I/O scheduler (http://kerneltrap.org/node/view/567), which should probably influence the numbers for elvtune also -- anybody know whether this is true or not for PostgreSQL? /* Steinar */ -- Homepage: http://www.sesse.net/