Thread: Improving speed of copy
Hi all, While testing for large databases, I am trying to load 12.5M rows of data from a text file and it takes lot longer than mysql even with copy. Mysql takes 221 sec. v/s 1121 sec. for postgres. For postgresql, that is around 11.5K rows per second. Each tuple has 23 fields with fixed length of around 100 bytes I wrote a programs which does inserts in batches but none of thme reaches performance of copy. I tried 1K/5K/10K/100K rows in a transaction but it can not cross 2.5K rows/sec. The machine is 800MHz, P-III/512MB/IDE disk. Postmaster is started with 30K buffers i.e. around 235MB buffers. Kernel caching paramaters are defaults. Besides there is issue of space. Mysql takes 1.4GB space for 1.2GB text data and postgresql takes 3.2GB of space. Even with 40 bytes per row overhead mentioned in FAQ, that should come to around 1.7GB, counting for 40% increase in size. Vacuum was run on database. Any further help? Especially if batch inserts could be speed up, that would be great.. ByeShridhar -- Alone, adj.: In bad company. -- Ambrose Bierce, "The Devil's Dictionary"
Are you using copy within a transaction? I don't know how to explain the size difference tho. I have never seen an overhead difference that large. What type of MySQL tables were you using and what version? Have you tried this with Oracle or similar commercial database? -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Shridhar Daithankar Sent: Friday, September 20, 2002 9:52 AM To: Pgsql-hackers@postgresql.org Subject: [HACKERS] Improving speed of copy Hi all, While testing for large databases, I am trying to load 12.5M rows of data from a text file and it takes lot longer than mysql even with copy. Mysql takes 221 sec. v/s 1121 sec. for postgres. For postgresql, that is around 11.5K rows per second. Each tuple has 23 fields with fixed length of around 100 bytes I wrote a programs which does inserts in batches but none of thme reaches performance of copy. I tried 1K/5K/10K/100K rows in a transaction but it can not cross 2.5K rows/sec. The machine is 800MHz, P-III/512MB/IDE disk. Postmaster is started with 30K buffers i.e. around 235MB buffers. Kernel caching paramaters are defaults. Besides there is issue of space. Mysql takes 1.4GB space for 1.2GB text data and postgresql takes 3.2GB of space. Even with 40 bytes per row overhead mentioned in FAQ, that should come to around 1.7GB, counting for 40% increase in size. Vacuum was run on database. Any further help? Especially if batch inserts could be speed up, that would be great.. ByeShridhar -- Alone, adj.: In bad company. -- Ambrose Bierce, "The Devil's Dictionary" ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
On 20 Sep 2002 at 21:22, Shridhar Daithankar wrote: > Mysql takes 221 sec. v/s 1121 sec. for postgres. For postgresql, that is around > 11.5K rows per second. Each tuple has 23 fields with fixed length of around 100 > bytes > > I wrote a programs which does inserts in batches but none of thme reaches > performance of copy. I tried 1K/5K/10K/100K rows in a transaction but it can > not cross 2.5K rows/sec. 1121 sec. was time with postgres default of 64 buffers. With 30K buffers it has degraded to 1393 sec. One more issue is time taken for composite index creation. It's 4341 sec as opposed to 436 sec for mysql. These are three non-unique character fields where the combination itself is not unique as well. Will doing a R-Tree index would be a better choice? In select test where approx. 15 rows where reported with query on index field, mysql took 14 sec. and psotgresql took 17.5 sec. Not bad but other issues eclipse the result.. TIA once again.. ByeShridhar -- revolutionary, adj.: Repackaged.
On 20 Sep 2002 at 10:26, Jonah H. Harris wrote: > Also, did you disable fsync? Aarrrgggh.. If that turns out to be culprit, I will kill him..;-) Problem is I can't see postgresql.conf nor can access his command history and he has left for the day.. I will count that in checklist but this is postgresql 7.1.3 on RHL7.2.. IIRC it should have WAL, in which case -F should not matter much.. On second thought, would it be worth to try 7.2.2, compiled? Will there be any performance difference? I can see on other machine that Mandrake8.2 has come with 7.2-12.. I think this may be the factor as well.. ByeShridhar -- A hypothetical paradox: What would happen in a battle between an Enterprise security team, who always get killed soon after appearing, and a squad of Imperial Stormtroopers, who can't hit the broad side of a planet? -- Tom Galloway
Also, did you disable fsync? -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Jonah H. Harris Sent: Friday, September 20, 2002 10:15 AM To: shridhar_daithankar@persistent.co.in; Pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Improving speed of copy Are you using copy within a transaction? I don't know how to explain the size difference tho. I have never seen an overhead difference that large. What type of MySQL tables were you using and what version? Have you tried this with Oracle or similar commercial database? -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Shridhar Daithankar Sent: Friday, September 20, 2002 9:52 AM To: Pgsql-hackers@postgresql.org Subject: [HACKERS] Improving speed of copy Hi all, While testing for large databases, I am trying to load 12.5M rows of data from a text file and it takes lot longer than mysql even with copy. Mysql takes 221 sec. v/s 1121 sec. for postgres. For postgresql, that is around 11.5K rows per second. Each tuple has 23 fields with fixed length of around 100 bytes I wrote a programs which does inserts in batches but none of thme reaches performance of copy. I tried 1K/5K/10K/100K rows in a transaction but it can not cross 2.5K rows/sec. The machine is 800MHz, P-III/512MB/IDE disk. Postmaster is started with 30K buffers i.e. around 235MB buffers. Kernel caching paramaters are defaults. Besides there is issue of space. Mysql takes 1.4GB space for 1.2GB text data and postgresql takes 3.2GB of space. Even with 40 bytes per row overhead mentioned in FAQ, that should come to around 1.7GB, counting for 40% increase in size. Vacuum was run on database. Any further help? Especially if batch inserts could be speed up, that would be great.. ByeShridhar -- Alone, adj.: In bad company. -- Ambrose Bierce, "The Devil's Dictionary" ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
On 20 Sep 2002 at 10:14, Jonah H. Harris wrote: > Are you using copy within a transaction? No. Will that help? I can try. But the utility I wrote, I could insert say 10K records in a transaction. Copy seems to be doing it all in one transaction. I don't get any value for select count(*) in another psql session till copy finishes.. > I don't know how to explain the size difference tho. I have never seen an > overhead difference that large. What type of MySQL tables were you using > and what version? Dunno.. Not my machine.. I am just trying to tune postgres on a friends machine.. not even postgres/root there.. So can not answer these questions fast but will get back on themm.. > Have you tried this with Oracle or similar commercial database? No. This requirement is specific for open source database.. May be in another test on a 4 way/4GB RAM machine, I might seem another result. Mysql was creating index on a 10GB table for last 25hours and last I knew it wasn't finished..Must be something with parameters.. Will keep you guys posted.. ByeShridhar -- brain, n: The apparatus with which we think that we think. -- Ambrose Bierce, "The Devil's Dictionary"
On Fri, 2002-09-20 at 08:52, Shridhar Daithankar wrote: > Besides there is issue of space. Mysql takes 1.4GB space for 1.2GB text data > and postgresql takes 3.2GB of space. Even with 40 bytes per row overhead > mentioned in FAQ, that should come to around 1.7GB, counting for 40% increase > in size. Vacuum was run on database. > How did you calculate the size of database? If you used "du" make sure you do it in the data/base directory as to not include the WAL files. -- Best Regards, Mike Benoit NetNation Communication Inc. Systems Engineer Tel: 604-684-6892 or 888-983-6600---------------------------------------Disclaimer: Opinions expressed here are my own andnot necessarily those of my employer
On Fri, 20 Sep 2002, Shridhar Daithankar wrote: > In select test where approx. 15 rows where reported with query on index field, > mysql took 14 sec. and psotgresql took 17.5 sec. Not bad but other issues > eclipse the result.. I don't know about anyone else but I find this aspect strange. That's 1 second (approx.) per row retrieved. That is pretty dire for an index scan. The data/index must be very non unique. -- Nigel J. Andrews
Nigel J. Andrews wrote: > On Fri, 20 Sep 2002, Shridhar Daithankar wrote: > >>In select test where approx. 15 rows where reported with query on index field, >>mysql took 14 sec. and psotgresql took 17.5 sec. Not bad but other issues >>eclipse the result.. > > I don't know about anyone else but I find this aspect strange. That's 1 second > (approx.) per row retrieved. That is pretty dire for an index scan. The > data/index must be very non unique. > Yeah, I'd agree that is strange. Can we see EXPLAIN ANALYZE for that query. Also, in one of your ealier posts you mentioned a slowdown after raising shared buffers from the default 64 to 30000. You might have driven the machine into swapping. Maybe try something more like 10000 - 15000. HTH, Joe
On 20 Sep 2002 at 10:27, Mike Benoit wrote: > On Fri, 2002-09-20 at 08:52, Shridhar Daithankar wrote: > > > Besides there is issue of space. Mysql takes 1.4GB space for 1.2GB text data > > and postgresql takes 3.2GB of space. Even with 40 bytes per row overhead > > mentioned in FAQ, that should come to around 1.7GB, counting for 40% increase > > in size. Vacuum was run on database. > > > > How did you calculate the size of database? If you used "du" make sure > you do it in the data/base directory as to not include the WAL files. OK latest experiments, I turned number of buffers 15K and fsync is disabled.. Load time is now 1250 sec. I noticed lots of notices in log saying, XLogWrite: new log files created.. I am pushing wal_buffers to 1000 and wal_files to 40 to test again.. I hope it gives me some required boost.. And BTW about disk space usage, it's 2.6G with base pg_xlog taking 65M. still not good.. Will keep you guys updated.. ByeShridhar -- It is necessary to have purpose. -- Alice #1, "I, Mudd", stardate 4513.3
On 20 Sep 2002 at 18:41, Nigel J. Andrews wrote: > On Fri, 20 Sep 2002, Shridhar Daithankar wrote: > > > In select test where approx. 15 rows where reported with query on index field, > > mysql took 14 sec. and psotgresql took 17.5 sec. Not bad but other issues > > eclipse the result.. > > I don't know about anyone else but I find this aspect strange. That's 1 second > (approx.) per row retrieved. That is pretty dire for an index scan. The > data/index must be very non unique. Sorry for late reply.. The numbers were scaled off.. Actually my fiend forgot to add units to those number.. The actual numbers are 140ms for mysql and 17 5ms for postgresql.. Further since result are obtained via 'time psql' higher overhead of postgres connection establishement is factored in.. Neck to neck I would say.. ByeShridhar -- Steele's Law: There exist tasks which cannot be done by more than ten men or fewer than one hundred.
Hi, I am wondering about bad INSERT performance compared against the speed of COPY. (I use 7.2.2 on RedHat 7.2) I have a table with about 30 fields, some constraints, some indexes, some foreign key constraints. I use COPY to import old data. Copying about 10562 rows takes about 19 seconds. For testing I have writtin a simple function in PL/pgSQL that inserts dummy records into the same table (just a FOR loop and an INSERT INTO ...). To insert another 10562 rows takes about 12 minutes now!!! What is the problem with INSERT in postgresql? I usually don't compare mysql and postgresql because mysql is just playing stuff, but I have think that the insert performance of mysql (even with innodb tables) is about 10 times better than the insert performance of postgresql. What is the reason and what can be done about it? Best Regards, Michael P.S: Perhaps you want to know about my postgresql.conf # # Shared Memory Size # shared_buffers = 12288 # 2*max_connections, min 16 max_fsm_relations = 100 # min 10, fsm is free space map max_fsm_pages = 20000 # min 1000, fsm is free space map max_locks_per_transaction = 64 # min 10 wal_buffers = 8 # min 4 # # Non-shared Memory Sizes # sort_mem = 4096 # min 32 (in Kb) vacuum_mem = 16384 # min 1024 # # Write-ahead log (WAL) # wal_files = 8 # range 0-64, default 0 wal_sync_method = fdatasync # the default varies across platforms: # # fsync, fdatasync, open_sync, or open_datasync fsync = true
"Michael Paesold" <mpaesold@gmx.at> writes: > To insert another 10562 rows takes about 12 minutes now!!! See http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/populate.html particularly the point about not committing each INSERT as a separate transaction. regards, tom lane
Tom Lane wrote: > "Michael Paesold" <mpaesold@gmx.at> writes: > > To insert another 10562 rows takes about 12 minutes now!!! > > See > http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/populate.html > particularly the point about not committing each INSERT as a separate > transaction. > > regards, tom lane As I said I wrote a function to insert the rows (PL/pgSQL). All values were inserted inside a single function call; I always though that a function call would be executed inside a transaction block. Experience says it does. About the other points in the docs: > Use COPY FROM: Well, I am currently comparing INSERT to COPY ... ;) > Remove Indexes: Doesn't COPY also have to update indexes? > ANALYZE Afterwards: I have done a VACUUM FULL; VACUUM ANALYZE; just before running the test. So is it just the planner/optimizer/etc. costs? Would a PREPARE in 7.3 help? Best Regards, Michael Paesold
"Michael Paesold" <mpaesold@gmx.at> writes: > To insert another 10562 rows takes about 12 minutes now!!! > As I said I wrote a function to insert the rows (PL/pgSQL). All values were > inserted inside a single function call; I always though that a function call > would be executed inside a transaction block. Experience says it does. Well, there's something fishy about your results. Using CVS tip I see about a 4-to-1 difference between COPYing 10000 rows and INSERT'ing 10000 rows (as one transaction). That's annoyingly high, but it's still way lower than what you're reporting ... I used the contents of table tenk1 in the regression database for test data, and dumped it out with "pg_dump -a" with and without -d. I then just timed feeding the scripts to psql ... regards, tom lane
Tom Lane wrote: > "Michael Paesold" <mpaesold@gmx.at> writes: > > To insert another 10562 rows takes about 12 minutes now!!! > > > As I said I wrote a function to insert the rows (PL/pgSQL). All values were > > inserted inside a single function call; I always though that a function call > > would be executed inside a transaction block. Experience says it does. > > Well, there's something fishy about your results. Using CVS tip I see > about a 4-to-1 difference between COPYing 10000 rows and INSERT'ing > 10000 rows (as one transaction). That's annoyingly high, but it's still > way lower than what you're reporting ... > > I used the contents of table tenk1 in the regression database for test > data, and dumped it out with "pg_dump -a" with and without -d. I then > just timed feeding the scripts to psql ... > > regards, tom lane I have further played around with the test here. I now realized that insert performance is much better right after a vacuum full; vacuum analyze; I have this function bench_invoice(integer) that will insert $1 records into invoice table; select bench_invoice(10000) took about 10 minutes average. Now I executed this with psql: vacuum full; vacuum analyze; select bench_invoice(1000); select bench_invoice(1000); ... (10 times) It seems performance is degrading with every insert! Here is the result (time in seconds in bench_invoice(), commit between selects just under a second) 13, 24, 36, 47, 58, 70, 84, 94, 105, 117, ... (seconds per 1000 rows inserted) Isn't that odd? I have tried again. vacuum analyze alone (without full) is enough to lower times again. They will start again with 13 seconds. I did not delete from the table by now; the table now has about 50000 rows. The disk is not swapping, there are no other users using postgres, postmaster takes about 100% cpu time during the whole operation. There are no special messages in error log. Can you explain? Should I enable some debug logging? Disable some optimizer? Do something else? This is a development server, I habe no problem with playing around. Best Regards, Michael Paesold
Update: > vacuum full; vacuum analyze; > select bench_invoice(1000); select bench_invoice(1000); ... (10 times) > > It seems performance is degrading with every insert! > Here is the result (time in seconds in bench_invoice(), commit between > selects just under a second) > > 13, 24, 36, 47, 58, 70, 84, 94, 105, 117, ... (seconds per 1000 rows > inserted) > > Isn't that odd? > I have tried again. vacuum analyze alone (without full) is enough to lower > times again. They will start again with 13 seconds. Tested further what exactly will reset insert times to lowest possible: vacuum full; helps vacuum analyze; helps analyze <tablename>; of table that I insert to doesn't help! analyze <tablename>; of any table reference in foreign key constraints doesn't help! Only vacuum will reset the insert times to the lowest possible! What does the vacuum code do?? :-] Regards, Michael Paesold
> Only vacuum will reset the insert times to the lowest possible! > What does the vacuum code do?? :-] Please see the manual and the extensive discussions on this point in the archives. This behaviour is well known -- though undesirable. It is an effect of the multi-version concurrency control system. Gavin
"Michael Paesold" <mpaesold@gmx.at> writes: > Only vacuum will reset the insert times to the lowest possible! > What does the vacuum code do?? :-] It removes dead tuples. Dead tuples can only arise from update or delete operations ... so you have not been telling us the whole truth. An insert-only test would not have this sort of behavior. regards, tom lane
Tom Lane wrote: > "Michael Paesold" <mpaesold@gmx.at> writes: > > Only vacuum will reset the insert times to the lowest possible! > > What does the vacuum code do?? :-] > > It removes dead tuples. Dead tuples can only arise from update or > delete operations ... so you have not been telling us the whole > truth. An insert-only test would not have this sort of behavior. > > regards, tom lane Sleeping is good. When I woke up this morning I had an idea of what is causing these problems; and you are right. I had used a self-written sequence system for the invoice_ids -- I can't use a sequence because sequence values can skip. So inserting an invoice would also do an update on a single row of the cs_sequence table, which cause the problems. Now, with a normal sequence, it works like a charm. 17 sec. for 10000 rows and 2-3 sec. for commit. But why is performance so much degrading? After 10000 updates on a row, the row seems to be unusable without vacuum! I hope the currently discussed autovacuum daemon will help in such a situation. So I think I will have to look for another solution. It would be nice if one could lock a sequence! That would solve all my troubles,... <dreaming> BEGIN; LOCK SEQUENCE invoice_id_seq; -- now only this connection can get nextval(), all others will block INSERT INTO invoice VALUES (nextval('invoice_id_seq'), ...); INSERT INTO invoice VALUES (nextval('invoice_id_seq'), ...); ... COMMIT; -- now this only helps if sequences could be rolled back -- wake up! </dreaming> What could you recommend? Locking the table and selecting max(invoice_id) wouldn't really be much faster, with max(invoice_id) not using an index... Best Regards, Michael Paesold
On 26 Sep 2002 at 12:28, Michael Paesold wrote: > But why is performance so much degrading? After 10000 updates > on a row, the row seems to be unusable without vacuum! I hope > the currently discussed autovacuum daemon will help in such a > situation. Let mw know if it works. Use CVS BTW.. I am eager to know any bug reports.. Didn't have a chance to test it the way I would have liked. May be this weekend.. ByeShridhar -- QOTD: The forest may be quiet, but that doesn't mean the snakes have gone away.
"Michael Paesold" <mpaesold@gmx.at> writes: > So inserting an invoice would also do an update on a single row > of the cs_sequence table, which cause the problems. > Now, with a normal sequence, it works like a charm. > 17 sec. for 10000 rows and 2-3 sec. for commit. > But why is performance so much degrading? After 10000 updates > on a row, the row seems to be unusable without vacuum! Probably, because the table contains 10000 dead tuples and one live one. The system is scanning all 10001 tuples looking for the one to UPDATE. In 7.3 it might help a little to create an index on the table. But really this is one of the reasons that SEQUENCEs were invented --- you have no alternative but to do frequent vacuums, if you repeatedly update the same row of a table. You might consider issuing a selective "VACUUM cs_sequence" command every so often (ideally every few hundred updates). > I hope the currently discussed autovacuum daemon will help in such a > situation. Probably, if we can teach it to recognize that such frequent vacuums are needed. In the meantime, cron is your friend ... regards, tom lane
>Have you tried this with Oracle or similar commercial database? I have timed COPY/LOAD times for Postgresql/Mysql/Oracle/Db2 - the rough comparison is : Db2 and Mysql fastest (Db2 slightly faster) Oracle approx twice as slow as Db2 Postgresql about 3.5-4 times slower than Db2 However Postgresql can sometimes create indexes faster than Mysql .... so that the total time of COPY + CREATE INDEX can be smaller for Postgresql than Mysql. Oracle an Db2 seemed similarish to Postgresql with respect to CREATE INDEX regards Mark
On Fri, 20 Sep 2002, Shridhar Daithankar wrote: > On 20 Sep 2002 at 21:22, Shridhar Daithankar wrote: > > > Mysql takes 221 sec. v/s 1121 sec. for postgres. For postgresql, > > that is around 11.5K rows per second. Each tuple has 23 fields with > > fixed length of around 100 bytes Yes, postgres is much slower than MySQL for doing bulk loading of data. There's not much, short of hacking on the code, that can be done about this. > One more issue is time taken for composite index creation. It's 4341 > sec as opposed to 436 sec for mysql. These are three non-unique > character fields where the combination itself is not unique as well. Setting sort_mem appropriately makes a big difference here. I generally bump it up to 2-8 MB for everyone, and when I'm building a big index, I set it to 32 MB or so just for that session. But make sure you don't set it so high you drive your system into swapping, or it will kill your performance. Remember also, that in 7.2.x, postgres will actually use almost three times the value you give sort_mem (i.e., sort_mem of 32 MB will actually allocate close to 96 MB of memory for the sort). cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC