Thread: Re: Unacceptable postgres performance vs. Microsoft sqlserver

Re: Unacceptable postgres performance vs. Microsoft sqlserver

From
Chris Browne
Date:
"tosbalok@gmail.com" <tosbalok@gmail.com> writes:
> Let me just start off by saying that I *want* to use postgresql.
> That's my goal.  I do not want to use SQLServer.  I'm posting this
> message not to slam postgres, but to ask for someone to help me figure
> out what I'm doing wrong.
>
> I've used postgres for several years as the backend to web
> applications. I'm not entirely new to it, but I've never needed to
> bother with performance tuning.  Netflix is running a contest in which
> they will give you a set of movie ratings and you try to develop an
> algorithm to predict future ratings.  I've imported this data into a
> postgresql database, and the performance is abysmal.  Here's my setup:
>
> Core 2 Quad with 4GB RAM and two SATAII hard drives in RAID 0
>
> I install debian linux and postgresql 8.1 (this is the latest version
> that is available through the debian package manager).  I import the
> Netflix data into a table with the following characteristics:
>
> Create table rating (movieid int, userid int, rating int4, ratingdate
> date)
>
> There are 180 million rows.  I've done the import 3 times, it takes on
> average 815 seconds.  I'm not too worried about that.  Now for a test,
> I run the query, select count(*) from rating;
>
> This takes an average of 172 seconds.  Call it three minutes.
>
> Now, on the same machine, I install windows XP and Microsoft SQL
> server 2000.  The import took 742 seconds.  The count(*) query took 22
> seconds.
>
> 22 seconds.  What's gong on?
>
> Another test.  In postgres I added an index to the userid column and
> then counted distinct userids.  The average run time over three
> queries was 4666 seconds, or 78 minutes.  Unbelievable.
>
> On SQL Server, with *no* index, the same query takes on average 414
> seconds, or about 7 minutes.  Ten times faster!
>
> I'm sure someone will flame me and say that I'm a worthless noob and
> if only I was smart I would know what I'm doing wrong.  So let me just
> repeat: I want to use postgres.  That's my goal.  Please tell me what
> I can do to make the performance of this acceptable.  It's not a
> complicated setup.  One table.  A couple of count(*) queries.

This is a well-understood issue...

Some database systems are able to do some optimization where either:
a) They collect some statistics to answer such queries in O(1) time, or
b) They can use index accesses and only look at an index.

The MVCC strategy in PostgreSQL, which allows it to avoid the need for
readers to block writers, and vice-versa, has the result that running
"count(*)" without any WHERE clause requires a scan across the entire
table.

If the entire purpose of your application is to run COUNT(*) against
the entireties of a table with 180M rows, then PostgreSQL may not be
the right database for your application.

If, on the other hand, this is a poor benchmark of your actual
requirements, then it would be a bad idea to draw any conclusions
based on the performance of "select count(*) from some_table;"
--
(format nil "~S@~S" "cbbrowne" "linuxfinances.info")
http://www3.sympatico.ca/cbbrowne/oses.html
"If a word in the dictionary  were misspelled, how would we know?"
-- Steven Wright

Re: Unacceptable postgres performance vs. Microsoft sqlserver

From
"David Wilson"
Date:
On Mon, Apr 14, 2008 at 1:34 PM, Chris Browne <cbbrowne@acm.org> wrote:
> "tosbalok@gmail.com" <tosbalok@gmail.com> writes:
>  > Another test.  In postgres I added an index to the userid column and
>  > then counted distinct userids.  The average run time over three
>  > queries was 4666 seconds, or 78 minutes.  Unbelievable.
>  >
>  > On SQL Server, with *no* index, the same query takes on average 414
>  > seconds, or about 7 minutes.  Ten times faster!

First, in general- use the EXPLAIN and EXPLAIN ANALYZE feature of
postgresql. That will tell you a lot about what your queries are doing
and why they're taking so long.

Second, make sure you've ANALYZE'd your table after creating it and
the index, which gives the planner the statistics necessary to make
intelligent choices.

For instance, your count of distinct userids is probably not using the
index you just created. If it still isn't using it after you ANALYZE
the table, try rewriting the query using group by (select count(*)
from (select userid from mytable group by userid) tmp). I recently had
a similar performance issue on a 75m row table, and the above helped.

VACUUM ANALYZE tables, and then remember that EXPLAIN and EXPLAIN
ANALYZE are your best friends.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: Unacceptable postgres performance vs. Microsoft sqlserver

From
Ivan Sergio Borgonovo
Date:
On Mon, 14 Apr 2008 13:34:36 -0400
Chris Browne <cbbrowne@acm.org> wrote:


> > Another test.  In postgres I added an index to the userid column
> > and then counted distinct userids.  The average run time over
> > three queries was 4666 seconds, or 78 minutes.  Unbelievable.

> > On SQL Server, with *no* index, the same query takes on average
> > 414 seconds, or about 7 minutes.  Ten times faster!


> Some database systems are able to do some optimization where either:
> a) They collect some statistics to answer such queries in O(1)
> time, or b) They can use index accesses and only look at an index.

> The MVCC strategy in PostgreSQL, which allows it to avoid the need
> for readers to block writers, and vice-versa, has the result that
> running "count(*)" without any WHERE clause requires a scan across
> the entire table.

> If the entire purpose of your application is to run COUNT(*) against
> the entireties of a table with 180M rows, then PostgreSQL may not be
> the right database for your application.

> If, on the other hand, this is a poor benchmark of your actual
> requirements, then it would be a bad idea to draw any conclusions
> based on the performance of "select count(*) from some_table;"

But why once you add the index and count distinct the performances
are still so far?
I'd say that counting in this case is not the hardest thing to do,
but rather the "distinct" part.


--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: Unacceptable postgres performance vs. Microsoft sqlserver

From
Andreas 'ads' Scherbaum
Date:
On Mon, 14 Apr 2008 13:34:36 -0400 Chris Browne wrote:

> If the entire purpose of your application is to run COUNT(*) against
> the entireties of a table with 180M rows, then PostgreSQL may not be
> the right database for your application.

For running a COUNT(*) without WHERE you could even add a trigger and a
second table which contains statistics. With this simple workaround you
have the answer for the COUNT(*) question in almost no time.

Getting results for a COUNT(*) with WHERE is more complicated.


Kind regards

--
                Andreas 'ads' Scherbaum
German PostgreSQL User Group

Re: Unacceptable postgres performance vs. Microsoft sqlserver

From
Gregory Stark
Date:
"Ivan Sergio Borgonovo" <mail@webthatworks.it> writes:

> But why once you add the index and count distinct the performances
> are still so far?
> I'd say that counting in this case is not the hardest thing to do,
> but rather the "distinct" part.

Your tests have been a bit unfortunate in finding a few particularly soft bits
in Postgres's underbelly.

a) Postgres can't do "index-only" scans so the index doesn't really help in
   this case. There's some discussion about improving that soon but it's still
   early.

b) Postgres doesn't know how to pick just distinct values out of an index, it
   has to find a start point and read all the records from that point forward
   so even if this is a very low cardinality value it would have to read the
   whole index.

c) DISTINCT is one of the earlier features in Postgres and a lot planner code
   is a lot smarter. In particular it doesn't know about using hash tables to
   find distinct values. This you can work around by rewriting the query as
   GROUP BY goes through a more recent code path.

d) The default settings if you haven't tuned postgresql.conf are quite
   conservative. That alone often makes it look poor when compared against
   other databases. In particular it makes it unlikely to pick hash tables for
   things or do in-memory sorts unless you raise work_mem.



This is what you're doing now:

postgres=# explain select count(distinct aid) from accounts;
                              QUERY PLAN
----------------------------------------------------------------------
 Aggregate  (cost=2838.00..2838.01 rows=1 width=4)
   ->  Seq Scan on accounts  (cost=0.00..2588.00 rows=100000 width=4)
(2 rows)



This is what you're expecting it to do (on a freshly built clustered index it
might perform ok but on a non-clustered index it will be terrible):

postgres=# explain select count(*) from (select 1 from accounts group by aid) as a;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Aggregate  (cost=5475.26..5475.27 rows=1 width=0)
   ->  Group  (cost=0.00..4225.26 rows=100000 width=4)
         ->  Index Scan using accounts_pkey on accounts  (cost=0.00..3975.26 rows=100000 width=4)
(3 rows)

postgres=# set work_mem = '128M';
ERROR:  invalid value for parameter "work_mem": "128M"
HINT:  Valid units for this parameter are "kB", "MB", and "GB".

(ARGH! DAMN YOU PETER!!!!)



This is probably the best you can get Postgres to do currently:

postgres=# set work_mem = '128MB';
SET

postgres=# explain select count(*) from (select 1 from accounts group by aid) as a;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Aggregate  (cost=5088.00..5088.01 rows=1 width=0)
   ->  HashAggregate  (cost=2838.00..3838.00 rows=100000 width=4)
         ->  Seq Scan on accounts  (cost=0.00..2588.00 rows=100000 width=4)
(3 rows)


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: Unacceptable postgres performance vs. Microsoft sqlserver

From
Ivan Sergio Borgonovo
Date:
On Mon, 14 Apr 2008 21:30:15 +0100
Gregory Stark <stark@enterprisedb.com> wrote:

> "Ivan Sergio Borgonovo" <mail@webthatworks.it> writes:
>
> > But why once you add the index and count distinct the performances
> > are still so far?
> > I'd say that counting in this case is not the hardest thing to do,
> > but rather the "distinct" part.
>
> Your tests have been a bit unfortunate in finding a few
> particularly soft bits in Postgres's underbelly.

While I digest the interesting things you wrote... are there any
comparative benchmark on count(*) with where clauses?


--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: Unacceptable postgres performance vs. Microsoft sqlserver

From
"tosbalok@gmail.com"
Date:
On Apr 14, 2:17 pm, david.t.wil...@gmail.com ("David Wilson") wrote:
> For instance, your count of distinct userids is probably not using the
> index you just created. If it still isn't using it after you ANALYZE
> the table, try rewriting the query using group by (select count(*)
> from (select userid from mytable group by userid) tmp). I recently had
> a similar performance issue on a 75m row table, and the above helped.
>
thanks.  There's a lot of good info in your post.  I'll give it a try.

Re: Unacceptable postgres performance vs. Microsoft sqlserver

From
"tosbalok@gmail.com"
Date:
On Apr 14, 12:34 pm, Chris Browne <cbbro...@acm.org> wrote:
> If the entire purpose of your application is to run COUNT(*)

haha.  no.  I see your point.

I'll be doing statistical functions on group-by's.  So I'll go back
and give it another try with queries like those.  And I'll use some of
the functions that other people here have suggeted.  I hope this
works!

thanks.