Thread: A speed comparison with sqlite
Just thought I would share some rough numbers here. A bit of an unusual edge case but a big time difference... A guy using RealBasic (GUI development IDE) which uses sqlite as it's inbuilt db engine wanted to generate a series of unique codes and decided sql will help (wanting 30 million codes starts to rule out ram based solutions) This is a 7 character alphanumeric code. His program generated 30M codes in about 15 minutes into the sqlite db file without any duplicate checks. Select distinct(pincode) from codes; returned the results (that is to his client not out to file) after 22 hours and he was after a faster solution. Using a unique index to check as he went, inserts dropped from 18,000 per second to about 200 a second after 8 hours (without completing) (the following times are taken from psql's timing option) With a P4 3Ghz - 80GB IDE drive - running XP pro - pg 8.2.6 - Using postgresql (with psql not Realbasic) I created the table and inserted 30M codes in 5.9 mins select count(distinct(pincode)) from codes; took about 7.2 minutes (29,993,182 unique codes of the 30M created) setting output to a text file - select distinct(pincode) from codes took 18.45 minutes to send them all out (283MB file). The same hardware running FreeBSD 7.0RC1 GENERIC Kernel - create the table and insert 30M codes in 5.6 mins select count(distinct(pincode)) from codes; took 5.9 minutes (29,993,279 unique codes of the 30M created) setting output to a text file - select distinct(pincode) from codes took 9.7 minutes to send them all out to file. It would appear that the method of locating distinct results is the key here - postgresql sorts the distinct column allowing it to identify a duplicate or new value quicker - sqlite returns the distinct columns in the order they are inserted - so it must be scanning all previous entries in the output to ensure distinct. -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
> A guy using RealBasic (GUI development IDE) which uses sqlite as it's > inbuilt db engine wanted to generate a series of unique codes and > decided sql will help (wanting 30 million codes starts to rule out ram > based solutions) > > This is a 7 character alphanumeric code. > > His program generated 30M codes in about 15 minutes into the sqlite db > file without any duplicate checks. > > Select distinct(pincode) from codes; returned the results (that is to > his client not out to file) after 22 hours and he was after a faster > solution. > > Using a unique index to check as he went, inserts dropped from 18,000 > per second to about 200 a second after 8 hours (without completing) > > (the following times are taken from psql's timing option) > > With a P4 3Ghz - 80GB IDE drive - running XP pro - pg 8.2.6 - > ... > The same hardware running FreeBSD 7.0RC1 GENERIC Kernel - > > create the table and insert 30M codes in 5.6 mins > > select count(distinct(pincode)) from codes; took 5.9 minutes > (29,993,279 unique codes of the 30M created) > > setting output to a text file - select distinct(pincode) from codes took > 9.7 minutes to send them all out to file. > > It would appear that the method of locating distinct results is the key > here - postgresql sorts the distinct column allowing it to identify a > duplicate or new value quicker - sqlite returns the distinct columns in > the order they are inserted - so it must be scanning all previous > entries in the output to ensure distinct. > Come on, guys. SQLite does have performance issues with sorting large tables. It is something we are working on. But it isn't *that* bad. I reran the test with no special tricks and SQLite took 15 minutes. Still 3x slower than PostgreSQL, I admit. But it is within a factor of 3.... Here is the output I get when I create 2^25 random 7-character strings, then did a CREATE TABLE t2 AS SELECT DISTINCT * FROM t1: 2008-01-18 23:29:22: i=16 2008-01-18 23:29:22: i=32 2008-01-18 23:29:22: i=64 2008-01-18 23:29:22: i=128 2008-01-18 23:29:22: i=256 2008-01-18 23:29:22: i=512 2008-01-18 23:29:22: i=1024 2008-01-18 23:29:22: i=2048 2008-01-18 23:29:22: i=4096 2008-01-18 23:29:22: i=8192 2008-01-18 23:29:22: i=16384 2008-01-18 23:29:22: i=32768 2008-01-18 23:29:23: i=65536 2008-01-18 23:29:23: i=131072 2008-01-18 23:29:24: i=262144 2008-01-18 23:29:25: i=524288 2008-01-18 23:29:27: i=1048576 2008-01-18 23:29:31: i=2097152 2008-01-18 23:29:40: i=4194304 2008-01-18 23:29:59: i=8388608 2008-01-18 23:30:37: i=16777216 2008-01-18 23:31:54: start the distinct... 2008-01-18 23:46:46: finish the distinct... So you can see, the SELECT distinct took about 15 minutes. The TCL script used to run this test is: sqlite db test.db db eval { CREATE TABLE t1(pincode TEXT); INSERT INTO t1 VALUES(randstr(7,7)); } proc msg {txt} { set now [db one {SELECT datetime('now')}] puts "$now: $txt" flush stdout } for {set i 1} {$i<30000000} {incr i $i} { msg "i=$i" flush stdout db eval { INSERT INTO t1 SELECT randstr(7,7) FROM t1; } } msg "start the distinct..." db eval { CREATE TABLE t2 AS SELECT distinct(pincode) FROM t1; } msg "finish the distinct..." -- D. Richard Hipp <drh@hwaci.com>
Shane Ambler wrote: > Just thought I would share some rough numbers here. > A bit of an unusual edge case but a big time difference... > > Just following up with this comparison. I have been in contact with the SQLite creator about this issue and have found that SQLite's design leans it's default settings toward low memory usage keeping with it's design goals of having a small footprint. Even with that design goal, it doesn't lock you into that. You can tell SQLite to use any memory it needs as well as how much of your db is cached in RAM. You can also build SQLite to use these options as default. One drawback with large datasets is that temp_store can get then get larger than available ram invoking swapping and still negating the benefits. (though I found this to still be quicker than the original tests) Having said that - setting SQLite's temp_store to memory allows SQLite to return SELECT DISTINCT(pincode) FROM codes; in around 15 minutes compared to PostgreSQL's 9.7 minutes. It seems that SQLite's sorting algorithm's are better suited for in memory use than on disk use. This issue really only appears in large datasets as used in my original timing examples. More specifically when temp storage exceeds the OS disk cache size. They are aware of this situation and are working on improvements. -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
Em Fri, 25 Jan 2008 06:26:26 +1030 Shane Ambler <pgsql@Sheeky.Biz> escreveu: Added... When acess competitor uses this change because the SQLite does not have a backend to manage that. Regards, -- Fernando Ike http://www.midstorm.org/~fike/weblog
On Fri, Jan 25, 2008 at 06:26:26AM +1030, Shane Ambler wrote: > Shane Ambler wrote: > >Just thought I would share some rough numbers here. > >A bit of an unusual edge case but a big time difference... > > > > > > Just following up with this comparison. > > > I have been in contact with the SQLite creator about this issue and have > found that SQLite's design leans it's default settings toward low memory > usage keeping with it's design goals of having a small footprint. > > Even with that design goal, it doesn't lock you into that. You can tell > SQLite to use any memory it needs as well as how much of your db is > cached in RAM. You can also build SQLite to use these options as default. Yeah, at least SQLite doesn't try or promise to be something that it's not. They're intended for embedded apps, and they seem to do that quite well. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828