Re: Unacceptable postgres performance vs. Microsoft sqlserver - Mailing list pgsql-general

From Gregory Stark
Subject Re: Unacceptable postgres performance vs. Microsoft sqlserver
Date
Msg-id 87r6d8faco.fsf@oxford.xeocode.com
Whole thread Raw
In response to Re: Unacceptable postgres performance vs. Microsoft sqlserver  (Ivan Sergio Borgonovo <mail@webthatworks.it>)
Responses Re: Unacceptable postgres performance vs. Microsoft sqlserver
List pgsql-general
"Ivan Sergio Borgonovo" <mail@webthatworks.it> writes:

> But why once you add the index and count distinct the performances
> are still so far?
> I'd say that counting in this case is not the hardest thing to do,
> but rather the "distinct" part.

Your tests have been a bit unfortunate in finding a few particularly soft bits
in Postgres's underbelly.

a) Postgres can't do "index-only" scans so the index doesn't really help in
   this case. There's some discussion about improving that soon but it's still
   early.

b) Postgres doesn't know how to pick just distinct values out of an index, it
   has to find a start point and read all the records from that point forward
   so even if this is a very low cardinality value it would have to read the
   whole index.

c) DISTINCT is one of the earlier features in Postgres and a lot planner code
   is a lot smarter. In particular it doesn't know about using hash tables to
   find distinct values. This you can work around by rewriting the query as
   GROUP BY goes through a more recent code path.

d) The default settings if you haven't tuned postgresql.conf are quite
   conservative. That alone often makes it look poor when compared against
   other databases. In particular it makes it unlikely to pick hash tables for
   things or do in-memory sorts unless you raise work_mem.



This is what you're doing now:

postgres=# explain select count(distinct aid) from accounts;
                              QUERY PLAN
----------------------------------------------------------------------
 Aggregate  (cost=2838.00..2838.01 rows=1 width=4)
   ->  Seq Scan on accounts  (cost=0.00..2588.00 rows=100000 width=4)
(2 rows)



This is what you're expecting it to do (on a freshly built clustered index it
might perform ok but on a non-clustered index it will be terrible):

postgres=# explain select count(*) from (select 1 from accounts group by aid) as a;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Aggregate  (cost=5475.26..5475.27 rows=1 width=0)
   ->  Group  (cost=0.00..4225.26 rows=100000 width=4)
         ->  Index Scan using accounts_pkey on accounts  (cost=0.00..3975.26 rows=100000 width=4)
(3 rows)

postgres=# set work_mem = '128M';
ERROR:  invalid value for parameter "work_mem": "128M"
HINT:  Valid units for this parameter are "kB", "MB", and "GB".

(ARGH! DAMN YOU PETER!!!!)



This is probably the best you can get Postgres to do currently:

postgres=# set work_mem = '128MB';
SET

postgres=# explain select count(*) from (select 1 from accounts group by aid) as a;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Aggregate  (cost=5088.00..5088.01 rows=1 width=0)
   ->  HashAggregate  (cost=2838.00..3838.00 rows=100000 width=4)
         ->  Seq Scan on accounts  (cost=0.00..2588.00 rows=100000 width=4)
(3 rows)


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

pgsql-general by date:

Previous
From: Andreas 'ads' Scherbaum
Date:
Subject: Re: Unacceptable postgres performance vs. Microsoft sqlserver
Next
From: Ivan Sergio Borgonovo
Date:
Subject: Re: Unacceptable postgres performance vs. Microsoft sqlserver