Thread: Re: Unacceptable postgres performance vs. Microsoft sqlserver
"tosbalok@gmail.com" <tosbalok@gmail.com> writes: > Let me just start off by saying that I *want* to use postgresql. > That's my goal. I do not want to use SQLServer. I'm posting this > message not to slam postgres, but to ask for someone to help me figure > out what I'm doing wrong. > > I've used postgres for several years as the backend to web > applications. I'm not entirely new to it, but I've never needed to > bother with performance tuning. Netflix is running a contest in which > they will give you a set of movie ratings and you try to develop an > algorithm to predict future ratings. I've imported this data into a > postgresql database, and the performance is abysmal. Here's my setup: > > Core 2 Quad with 4GB RAM and two SATAII hard drives in RAID 0 > > I install debian linux and postgresql 8.1 (this is the latest version > that is available through the debian package manager). I import the > Netflix data into a table with the following characteristics: > > Create table rating (movieid int, userid int, rating int4, ratingdate > date) > > There are 180 million rows. I've done the import 3 times, it takes on > average 815 seconds. I'm not too worried about that. Now for a test, > I run the query, select count(*) from rating; > > This takes an average of 172 seconds. Call it three minutes. > > Now, on the same machine, I install windows XP and Microsoft SQL > server 2000. The import took 742 seconds. The count(*) query took 22 > seconds. > > 22 seconds. What's gong on? > > Another test. In postgres I added an index to the userid column and > then counted distinct userids. The average run time over three > queries was 4666 seconds, or 78 minutes. Unbelievable. > > On SQL Server, with *no* index, the same query takes on average 414 > seconds, or about 7 minutes. Ten times faster! > > I'm sure someone will flame me and say that I'm a worthless noob and > if only I was smart I would know what I'm doing wrong. So let me just > repeat: I want to use postgres. That's my goal. Please tell me what > I can do to make the performance of this acceptable. It's not a > complicated setup. One table. A couple of count(*) queries. This is a well-understood issue... Some database systems are able to do some optimization where either: a) They collect some statistics to answer such queries in O(1) time, or b) They can use index accesses and only look at an index. The MVCC strategy in PostgreSQL, which allows it to avoid the need for readers to block writers, and vice-versa, has the result that running "count(*)" without any WHERE clause requires a scan across the entire table. If the entire purpose of your application is to run COUNT(*) against the entireties of a table with 180M rows, then PostgreSQL may not be the right database for your application. If, on the other hand, this is a poor benchmark of your actual requirements, then it would be a bad idea to draw any conclusions based on the performance of "select count(*) from some_table;" -- (format nil "~S@~S" "cbbrowne" "linuxfinances.info") http://www3.sympatico.ca/cbbrowne/oses.html "If a word in the dictionary were misspelled, how would we know?" -- Steven Wright
On Mon, Apr 14, 2008 at 1:34 PM, Chris Browne <cbbrowne@acm.org> wrote: > "tosbalok@gmail.com" <tosbalok@gmail.com> writes: > > Another test. In postgres I added an index to the userid column and > > then counted distinct userids. The average run time over three > > queries was 4666 seconds, or 78 minutes. Unbelievable. > > > > On SQL Server, with *no* index, the same query takes on average 414 > > seconds, or about 7 minutes. Ten times faster! First, in general- use the EXPLAIN and EXPLAIN ANALYZE feature of postgresql. That will tell you a lot about what your queries are doing and why they're taking so long. Second, make sure you've ANALYZE'd your table after creating it and the index, which gives the planner the statistics necessary to make intelligent choices. For instance, your count of distinct userids is probably not using the index you just created. If it still isn't using it after you ANALYZE the table, try rewriting the query using group by (select count(*) from (select userid from mytable group by userid) tmp). I recently had a similar performance issue on a 75m row table, and the above helped. VACUUM ANALYZE tables, and then remember that EXPLAIN and EXPLAIN ANALYZE are your best friends. -- - David T. Wilson david.t.wilson@gmail.com
On Mon, 14 Apr 2008 13:34:36 -0400 Chris Browne <cbbrowne@acm.org> wrote: > > Another test. In postgres I added an index to the userid column > > and then counted distinct userids. The average run time over > > three queries was 4666 seconds, or 78 minutes. Unbelievable. > > On SQL Server, with *no* index, the same query takes on average > > 414 seconds, or about 7 minutes. Ten times faster! > Some database systems are able to do some optimization where either: > a) They collect some statistics to answer such queries in O(1) > time, or b) They can use index accesses and only look at an index. > The MVCC strategy in PostgreSQL, which allows it to avoid the need > for readers to block writers, and vice-versa, has the result that > running "count(*)" without any WHERE clause requires a scan across > the entire table. > If the entire purpose of your application is to run COUNT(*) against > the entireties of a table with 180M rows, then PostgreSQL may not be > the right database for your application. > If, on the other hand, this is a poor benchmark of your actual > requirements, then it would be a bad idea to draw any conclusions > based on the performance of "select count(*) from some_table;" But why once you add the index and count distinct the performances are still so far? I'd say that counting in this case is not the hardest thing to do, but rather the "distinct" part. -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Mon, 14 Apr 2008 13:34:36 -0400 Chris Browne wrote: > If the entire purpose of your application is to run COUNT(*) against > the entireties of a table with 180M rows, then PostgreSQL may not be > the right database for your application. For running a COUNT(*) without WHERE you could even add a trigger and a second table which contains statistics. With this simple workaround you have the answer for the COUNT(*) question in almost no time. Getting results for a COUNT(*) with WHERE is more complicated. Kind regards -- Andreas 'ads' Scherbaum German PostgreSQL User Group
"Ivan Sergio Borgonovo" <mail@webthatworks.it> writes: > But why once you add the index and count distinct the performances > are still so far? > I'd say that counting in this case is not the hardest thing to do, > but rather the "distinct" part. Your tests have been a bit unfortunate in finding a few particularly soft bits in Postgres's underbelly. a) Postgres can't do "index-only" scans so the index doesn't really help in this case. There's some discussion about improving that soon but it's still early. b) Postgres doesn't know how to pick just distinct values out of an index, it has to find a start point and read all the records from that point forward so even if this is a very low cardinality value it would have to read the whole index. c) DISTINCT is one of the earlier features in Postgres and a lot planner code is a lot smarter. In particular it doesn't know about using hash tables to find distinct values. This you can work around by rewriting the query as GROUP BY goes through a more recent code path. d) The default settings if you haven't tuned postgresql.conf are quite conservative. That alone often makes it look poor when compared against other databases. In particular it makes it unlikely to pick hash tables for things or do in-memory sorts unless you raise work_mem. This is what you're doing now: postgres=# explain select count(distinct aid) from accounts; QUERY PLAN ---------------------------------------------------------------------- Aggregate (cost=2838.00..2838.01 rows=1 width=4) -> Seq Scan on accounts (cost=0.00..2588.00 rows=100000 width=4) (2 rows) This is what you're expecting it to do (on a freshly built clustered index it might perform ok but on a non-clustered index it will be terrible): postgres=# explain select count(*) from (select 1 from accounts group by aid) as a; QUERY PLAN -------------------------------------------------------------------------------------------------- Aggregate (cost=5475.26..5475.27 rows=1 width=0) -> Group (cost=0.00..4225.26 rows=100000 width=4) -> Index Scan using accounts_pkey on accounts (cost=0.00..3975.26 rows=100000 width=4) (3 rows) postgres=# set work_mem = '128M'; ERROR: invalid value for parameter "work_mem": "128M" HINT: Valid units for this parameter are "kB", "MB", and "GB". (ARGH! DAMN YOU PETER!!!!) This is probably the best you can get Postgres to do currently: postgres=# set work_mem = '128MB'; SET postgres=# explain select count(*) from (select 1 from accounts group by aid) as a; QUERY PLAN ---------------------------------------------------------------------------- Aggregate (cost=5088.00..5088.01 rows=1 width=0) -> HashAggregate (cost=2838.00..3838.00 rows=100000 width=4) -> Seq Scan on accounts (cost=0.00..2588.00 rows=100000 width=4) (3 rows) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Mon, 14 Apr 2008 21:30:15 +0100 Gregory Stark <stark@enterprisedb.com> wrote: > "Ivan Sergio Borgonovo" <mail@webthatworks.it> writes: > > > But why once you add the index and count distinct the performances > > are still so far? > > I'd say that counting in this case is not the hardest thing to do, > > but rather the "distinct" part. > > Your tests have been a bit unfortunate in finding a few > particularly soft bits in Postgres's underbelly. While I digest the interesting things you wrote... are there any comparative benchmark on count(*) with where clauses? -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Apr 14, 2:17 pm, david.t.wil...@gmail.com ("David Wilson") wrote: > For instance, your count of distinct userids is probably not using the > index you just created. If it still isn't using it after you ANALYZE > the table, try rewriting the query using group by (select count(*) > from (select userid from mytable group by userid) tmp). I recently had > a similar performance issue on a 75m row table, and the above helped. > thanks. There's a lot of good info in your post. I'll give it a try.
On Apr 14, 12:34 pm, Chris Browne <cbbro...@acm.org> wrote: > If the entire purpose of your application is to run COUNT(*) haha. no. I see your point. I'll be doing statistical functions on group-by's. So I'll go back and give it another try with queries like those. And I'll use some of the functions that other people here have suggeted. I hope this works! thanks.