Thread: Any better plan for this query?..

From:
Dimitri
Date:

Hi,

any idea if there is a more optimal execution plan possible for this query:

select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg,
        H.END_DATE as hend, H.NOTE as hnote
         from HISTORY H, STAT S
         where S.REF = H.REF_STAT
         and H.REF_OBJECT = '0000000001'
         order by H.HORDER ;

EXPLAIN ANALYZE output on 8.4:
                                                                   QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=4549.75..4555.76 rows=2404 width=176) (actual
time=1.341..1.343 rows=20 loops=1)
   Sort Key: h.horder
   Sort Method:  quicksort  Memory: 30kB
   ->  Hash Join  (cost=33.50..4414.75 rows=2404 width=176) (actual
time=1.200..1.232 rows=20 loops=1)
         Hash Cond: (h.ref_stat = s.ref)
         ->  Index Scan using history_ref_idx on history h
(cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052
rows=20 loops=1)
               Index Cond: (ref_object = '0000000001'::bpchar)
         ->  Hash  (cost=21.00..21.00 rows=1000 width=45) (actual
time=1.147..1.147 rows=1000 loops=1)
               ->  Seq Scan on stat s  (cost=0.00..21.00 rows=1000
width=45) (actual time=0.005..0.325 rows=1000 loops=1)
 Total runtime: 1.442 ms
(10 rows)

Table HISTORY contains 200M rows, only 20 needed
Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY values.

Table definitions:
"""""""""""""""""""""""""""""""""""""""""""""""""""
create table STAT
(
    REF                 CHAR(3)            not null,
    NAME                CHAR(40)           not null,
    NUMB                INT                not null
);

create table HISTORY
(
    REF_OBJECT          CHAR(10)              not null,
    HORDER              INT                   not null,
    REF_STAT            CHAR(3)               not null,
    BEGIN_DATE          CHAR(12)              not null,
    END_DATE            CHAR(12)                      ,
    NOTE                CHAR(100)
);

create unique index stat_ref_idx on STAT( ref );
create index history_ref_idx on HISTORY( ref_object, horder );
"""""""""""""""""""""""""""""""""""""""""""""""""""

NOTE: The same query runs 2 times faster on MySQL.

Any idea?..

Rgds,
-Dimitri

From:
Craig Ringer
Date:

Dimitri wrote:
> Hi,
>
> any idea if there is a more optimal execution plan possible for this query:
>
> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg,
>         H.END_DATE as hend, H.NOTE as hnote
>          from HISTORY H, STAT S
>          where S.REF = H.REF_STAT
>          and H.REF_OBJECT = '0000000001'
>          order by H.HORDER ;

OK, so you're taking a simple:

    history INNER JOIN stat ON (stat.ref = history.ref_stat)

then filtering for records with a particular value of history.ref_object
and finally performing a sort.

If I'm reading it right, the plan below does a sequential scan on the
`stat' table. The stat table only has 1000 rows, so this isn't
necessarily an unreasonable choice even if there is an appropriate index
and even if not many of the rows will be needed.

It then does an index scan of the history table looking for tuples with
ref_object = '0000000001' (text match). It hash joins the hashed results
of the initial seq scan to the results of the index scan, and sorts the
result.

To me, that looks pretty reasonable. You might be able to avoid the hash
join in favour of a nested loop scan of stat_ref_idx (looping over
records from history.ref_stat where ref_object = '00000000001') by
providing a composite index on HISTORY(ref_stat, ref_object). I'm really
not too sure, though; plan optimization isn't my thing, I'm just seeing
if I can offer a few ideas.

> Table definitions:

While not strictly necessary, it's a *REALLY* good idea to define a
suitable PRIMARY KEY.

Also, the `CHAR(n)' data type is evil. E.V.I.L. Use `varchar(n)' for
bounded-length values, or `text' for unbounded fields, unless you REALLY
want the crazy behaviour of `CHAR(n)'.

I'm a little bit puzzled about why you seem to be doing lots of things
with integer values stored in text strings, but that probably doesn't
matter too much for the issue at hand.

> NOTE: The same query runs 2 times faster on MySQL.

With InnoDB tables and proper transactional safety? Or using scary
MyISAM tables and a "just pray" approach to data integrity? If you're
using MyISAM tables I'm not surprised; MySQL with MyISAM is stunningly
fast, but oh-my-god dangerous.

--
Craig Ringer

From:
Dimitri
Date:

Hi Craig,

yes, you detailed very well the problem! :-)
all those CHAR columns are so just due historical issues :-) as well
they may contains anything else and not only numbers, that's why..
Also, all data inside are fixed, so VARCHAR will not save place, or
what kind of performance issue may we expect with CHAR vs VARCHAR if
all data have a fixed length?..

Any way to force nested loop without additional index?..

It's 2 times faster on InnoDB, and as it's just a SELECT query no need
to go in transaction details :-)

Rgds,
-Dimitri

On 5/6/09, Craig Ringer <> wrote:
> Dimitri wrote:
>> Hi,
>>
>> any idea if there is a more optimal execution plan possible for this
>> query:
>>
>> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as
>> hbeg,
>>         H.END_DATE as hend, H.NOTE as hnote
>>          from HISTORY H, STAT S
>>          where S.REF = H.REF_STAT
>>          and H.REF_OBJECT = '0000000001'
>>          order by H.HORDER ;
>
> OK, so you're taking a simple:
>
>     history INNER JOIN stat ON (stat.ref = history.ref_stat)
>
> then filtering for records with a particular value of history.ref_object
> and finally performing a sort.
>
> If I'm reading it right, the plan below does a sequential scan on the
> `stat' table. The stat table only has 1000 rows, so this isn't
> necessarily an unreasonable choice even if there is an appropriate index
> and even if not many of the rows will be needed.
>
> It then does an index scan of the history table looking for tuples with
> ref_object = '0000000001' (text match). It hash joins the hashed results
> of the initial seq scan to the results of the index scan, and sorts the
> result.
>
> To me, that looks pretty reasonable. You might be able to avoid the hash
> join in favour of a nested loop scan of stat_ref_idx (looping over
> records from history.ref_stat where ref_object = '00000000001') by
> providing a composite index on HISTORY(ref_stat, ref_object). I'm really
> not too sure, though; plan optimization isn't my thing, I'm just seeing
> if I can offer a few ideas.
>
>> Table definitions:
>
> While not strictly necessary, it's a *REALLY* good idea to define a
> suitable PRIMARY KEY.
>
> Also, the `CHAR(n)' data type is evil. E.V.I.L. Use `varchar(n)' for
> bounded-length values, or `text' for unbounded fields, unless you REALLY
> want the crazy behaviour of `CHAR(n)'.
>
> I'm a little bit puzzled about why you seem to be doing lots of things
> with integer values stored in text strings, but that probably doesn't
> matter too much for the issue at hand.
>
>> NOTE: The same query runs 2 times faster on MySQL.
>
> With InnoDB tables and proper transactional safety? Or using scary
> MyISAM tables and a "just pray" approach to data integrity? If you're
> using MyISAM tables I'm not surprised; MySQL with MyISAM is stunningly
> fast, but oh-my-god dangerous.
>
> --
> Craig Ringer
>

From:
Heikki Linnakangas
Date:

Dimitri wrote:
> any idea if there is a more optimal execution plan possible for this query:
>
> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg,
>         H.END_DATE as hend, H.NOTE as hnote
>          from HISTORY H, STAT S
>          where S.REF = H.REF_STAT
>          and H.REF_OBJECT = '0000000001'
>          order by H.HORDER ;
>
> EXPLAIN ANALYZE output on 8.4:
>                                                                    QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=4549.75..4555.76 rows=2404 width=176) (actual
> time=1.341..1.343 rows=20 loops=1)
>    Sort Key: h.horder
>    Sort Method:  quicksort  Memory: 30kB
>    ->  Hash Join  (cost=33.50..4414.75 rows=2404 width=176) (actual
> time=1.200..1.232 rows=20 loops=1)
>          Hash Cond: (h.ref_stat = s.ref)
>          ->  Index Scan using history_ref_idx on history h
> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052
> rows=20 loops=1)
>                Index Cond: (ref_object = '0000000001'::bpchar)
>          ->  Hash  (cost=21.00..21.00 rows=1000 width=45) (actual
> time=1.147..1.147 rows=1000 loops=1)
>                ->  Seq Scan on stat s  (cost=0.00..21.00 rows=1000
> width=45) (actual time=0.005..0.325 rows=1000 loops=1)
>  Total runtime: 1.442 ms
> (10 rows)
>
> Table HISTORY contains 200M rows, only 20 needed
> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY values.

The bad doesn't look too bad to me, although the planner is
over-estimating the number of matches in the history table (2404 vs 20).
That's a bit surprising given how simple the predicate is. Make sure
you've ANALYZEd the table. If that's not enough, you can try to increase
the statistics target for ref_object column, ie. ALTER TABLE history
ALTER COLUMN ref_object SET STATISTICS 500. That might give you a
different plan, maybe with a nested loop join instead of hash join,
which might be faster in this case.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

From:
Chris
Date:

Dimitri wrote:
> Hi Craig,
>
> yes, you detailed very well the problem! :-)
> all those CHAR columns are so just due historical issues :-) as well
> they may contains anything else and not only numbers, that's why..
> Also, all data inside are fixed, so VARCHAR will not save place, or
> what kind of performance issue may we expect with CHAR vs VARCHAR if
> all data have a fixed length?..

None in postgres, but the char/varchar thing may or may not bite you at
some point later - sounds like you have it covered though.

> It's 2 times faster on InnoDB, and as it's just a SELECT query no need
> to go in transaction details :-)

  Total runtime: 1.442 ms
(10 rows)

You posted a query that's taking 2/1000's of a second. I don't really
see a performance problem here :)

--
Postgresql & php tutorials
http://www.designmagick.com/


From:
Dimitri
Date:

Hi Heikki,

I've already tried a target 1000 and the only thing it changes
comparing to the current 100 (default) is instead of 2404 rows it says
240 rows, but the plan remaining the same..

Rgds,
-Dimitri

On 5/6/09, Heikki Linnakangas <> wrote:
> Dimitri wrote:
>> any idea if there is a more optimal execution plan possible for this
>> query:
>>
>> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as
>> hbeg,
>>         H.END_DATE as hend, H.NOTE as hnote
>>          from HISTORY H, STAT S
>>          where S.REF = H.REF_STAT
>>          and H.REF_OBJECT = '0000000001'
>>          order by H.HORDER ;
>>
>> EXPLAIN ANALYZE output on 8.4:
>>                                                                    QUERY
>> PLAN
>>
------------------------------------------------------------------------------------------------------------------------------------------------
>>  Sort  (cost=4549.75..4555.76 rows=2404 width=176) (actual
>> time=1.341..1.343 rows=20 loops=1)
>>    Sort Key: h.horder
>>    Sort Method:  quicksort  Memory: 30kB
>>    ->  Hash Join  (cost=33.50..4414.75 rows=2404 width=176) (actual
>> time=1.200..1.232 rows=20 loops=1)
>>          Hash Cond: (h.ref_stat = s.ref)
>>          ->  Index Scan using history_ref_idx on history h
>> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052
>> rows=20 loops=1)
>>                Index Cond: (ref_object = '0000000001'::bpchar)
>>          ->  Hash  (cost=21.00..21.00 rows=1000 width=45) (actual
>> time=1.147..1.147 rows=1000 loops=1)
>>                ->  Seq Scan on stat s  (cost=0.00..21.00 rows=1000
>> width=45) (actual time=0.005..0.325 rows=1000 loops=1)
>>  Total runtime: 1.442 ms
>> (10 rows)
>>
>> Table HISTORY contains 200M rows, only 20 needed
>> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY
>> values.
>
> The bad doesn't look too bad to me, although the planner is
> over-estimating the number of matches in the history table (2404 vs 20).
> That's a bit surprising given how simple the predicate is. Make sure
> you've ANALYZEd the table. If that's not enough, you can try to increase
> the statistics target for ref_object column, ie. ALTER TABLE history
> ALTER COLUMN ref_object SET STATISTICS 500. That might give you a
> different plan, maybe with a nested loop join instead of hash join,
> which might be faster in this case.
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
>

From:
Dimitri
Date:

Hi Chris,

the only problem I see here is it's 2 times slower vs InnoDB, so
before I'll say myself it's ok I want to be sure there is nothing else
to do.. :-)

Rgds,
-Dimitri


On 5/6/09, Chris <> wrote:
> Dimitri wrote:
>> Hi Craig,
>>
>> yes, you detailed very well the problem! :-)
>> all those CHAR columns are so just due historical issues :-) as well
>> they may contains anything else and not only numbers, that's why..
>> Also, all data inside are fixed, so VARCHAR will not save place, or
>> what kind of performance issue may we expect with CHAR vs VARCHAR if
>> all data have a fixed length?..
>
> None in postgres, but the char/varchar thing may or may not bite you at
> some point later - sounds like you have it covered though.
>
>> It's 2 times faster on InnoDB, and as it's just a SELECT query no need
>> to go in transaction details :-)
>
>   Total runtime: 1.442 ms
> (10 rows)
>
> You posted a query that's taking 2/1000's of a second. I don't really
> see a performance problem here :)
>
> --
> Postgresql & php tutorials
> http://www.designmagick.com/
>
>

From:
Dimitri
Date:

Hi Richard,

no, of course it's not based on explain :-)
I've run several tests before and now going in depth to understand if
there is nothing wrong. Due such a single query time difference InnoDB
is doing 2-3 times better TPS level comparing to PostgreSQL..

Rgds,
-Dimitri


On 5/6/09, Richard Huxton <> wrote:
> Dimitri wrote:
>> Hi Chris,
>>
>> the only problem I see here is it's 2 times slower vs InnoDB
>
> How do you know? This isn't just based on the explain values reported,
> is it?
>
> --
>    Richard Huxton
>    Archonet Ltd
>

From:
Richard Huxton
Date:

Dimitri wrote:
> Hi Chris,
>
> the only problem I see here is it's 2 times slower vs InnoDB

How do you know? This isn't just based on the explain values reported,
is it?

--
   Richard Huxton
   Archonet Ltd

From:
Richard Huxton
Date:

Dimitri wrote:
> Hi Richard,
>
> no, of course it's not based on explain :-)
> I've run several tests before and now going in depth to understand if
> there is nothing wrong. Due such a single query time difference InnoDB
> is doing 2-3 times better TPS level comparing to PostgreSQL..

And you are satisfied that it is the planned query time that is the
dominant factor here, and not parsing time, connection time, data
transport, disk bandwidth etc?

--
   Richard Huxton
   Archonet Ltd

From:
Matthew Wakeling
Date:

On Wed, 6 May 2009, Heikki Linnakangas wrote:
>>  Total runtime: 1.442 ms

It's pretty clear that this query isn't even going to disc - it's all CPU
time. That can be the case if you run the exact same query more than once,
and it can cause your EXPLAIN output to be vastly different from your real
use case. Do the queries on the live system hit the disc at all?

> The bad doesn't look too bad to me, although the planner is over-estimating
> the number of matches in the history table (2404 vs 20). That's a bit
> surprising given how simple the predicate is. Make sure you've ANALYZEd the
> table. If that's not enough, you can try to increase the statistics target
> for ref_object column, ie. ALTER TABLE history ALTER COLUMN ref_object SET
> STATISTICS 500.

I would have thought this would actually make it slower, by increasing the
time taken to plan. On such small queries, the planner overhead must be
quite significant.

Matthew

--
 Q: What's the difference between ignorance and apathy?
 A: I don't know, and I don't care.

From:
Merlin Moncure
Date:

On Wed, May 6, 2009 at 3:38 AM, Dimitri <> wrote:
> Hi,
>
> any idea if there is a more optimal execution plan possible for this query:
>
> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg,
>        H.END_DATE as hend, H.NOTE as hnote
>         from HISTORY H, STAT S
>         where S.REF = H.REF_STAT
>         and H.REF_OBJECT = '0000000001'
>         order by H.HORDER ;
>
> EXPLAIN ANALYZE output on 8.4:
>                                                                   QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=4549.75..4555.76 rows=2404 width=176) (actual
> time=1.341..1.343 rows=20 loops=1)
>   Sort Key: h.horder
>   Sort Method:  quicksort  Memory: 30kB
>   ->  Hash Join  (cost=33.50..4414.75 rows=2404 width=176) (actual
> time=1.200..1.232 rows=20 loops=1)
>         Hash Cond: (h.ref_stat = s.ref)
>         ->  Index Scan using history_ref_idx on history h
> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052
> rows=20 loops=1)
>               Index Cond: (ref_object = '0000000001'::bpchar)
>         ->  Hash  (cost=21.00..21.00 rows=1000 width=45) (actual
> time=1.147..1.147 rows=1000 loops=1)
>               ->  Seq Scan on stat s  (cost=0.00..21.00 rows=1000
> width=45) (actual time=0.005..0.325 rows=1000 loops=1)
>  Total runtime: 1.442 ms
> (10 rows)
>
> Table HISTORY contains 200M rows, only 20 needed
> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY values.
>
> Table definitions:
> """""""""""""""""""""""""""""""""""""""""""""""""""
> create table STAT
> (
>    REF                 CHAR(3)            not null,
>    NAME                CHAR(40)           not null,
>    NUMB                INT                not null
> );
>
> create table HISTORY
> (
>    REF_OBJECT          CHAR(10)              not null,
>    HORDER              INT                   not null,
>    REF_STAT            CHAR(3)               not null,
>    BEGIN_DATE          CHAR(12)              not null,
>    END_DATE            CHAR(12)                      ,
>    NOTE                CHAR(100)
> );
>
> create unique index stat_ref_idx on STAT( ref );
> create index history_ref_idx on HISTORY( ref_object, horder );
> """""""""""""""""""""""""""""""""""""""""""""""""""
>
> NOTE: The same query runs 2 times faster on MySQL.

couple of things to try:
*) as others have noted, get rid of char() columns. use varchar, or
int if you can.  this is a bigger deal in postgres than mysql.
*) curious if disabling sequential scan helps (set enable_seqscan =
false) or changes the plan.  .3 msec is spent on seq scan and an index
lookup is likely much faster.
*) prepare the query:

prepare history_stat(char(10) as
  select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg,
       H.END_DATE as hend, H.NOTE as hnote
        from HISTORY H, STAT S
        where S.REF = H.REF_STAT
        and H.REF_OBJECT = $1
        order by H.HORDER ;

execute history_stat('0000000001');

(prepared queries have some annoyances you  need to be prepared to
deal with. however, they are quite useful when squeezing every last
msec out of fast queries).

merlin

From:
Merlin Moncure
Date:

On Wed, May 6, 2009 at 7:46 AM, Merlin Moncure <> wrote:
> prepare history_stat(char(10) as

typo:
prepare history_stat(char(10)) as

From:
"Albe Laurenz"
Date:

Dimitri wrote:
> I've run several tests before and now going in depth to understand if
> there is nothing wrong. Due such a single query time difference InnoDB
> is doing 2-3 times better TPS level comparing to PostgreSQL..

Why don't you use MySQL then?
Or tune PostgreSQL?

Yours,
Laurenz Albe

From:
Dimitri
Date:

I'll try to answer all mails at once :-))

- query is running fully in RAM, no I/O, no network, only CPU time

- looping 100 times the same query gives 132ms total time (~1.32ms per
query), while it's 44ms on InnoDB (~0.44ms per query)

- disabling seq scan forcing a planner to use an index scan, and
finally it worse as gives 1.53ms per query..

- prepare the query helps: prepare statement takes 16ms, but execute
runs in 0.98ms  = which make me think it's not only a planner
overhead... And it's still 2 times lower vs 0.44ms.
Also, generally prepare cannot be used in this test case as we suppose
any query may be of any kind (even if it's not always true :-))

- char or varchar should be used here because the reference code is
supposed to accept any characters (alphanumeric)

- it also reminds me that probably there are some extra CPU time due
locale setting - but all my "lc_*" variables are set to "C"...

Rgds,
-Dimitri


On 5/6/09, Merlin Moncure <> wrote:
> On Wed, May 6, 2009 at 7:46 AM, Merlin Moncure <> wrote:
>> prepare history_stat(char(10) as
>
> typo:
> prepare history_stat(char(10)) as
>

From:
Dimitri
Date:

The story is simple: for the launching of MySQL 5.4 I've done a
testing comparing available on that time variations of InnoDB engines,
and at the end by curiosity started the same test with PostgreSQL
8.3.7 to see if MySQL performance level is more close to PostgreSQL
now (PG was a strong true winner before). For my big surprise MySQL
5.4 outpassed 8.3.7...
However, analyzing the PostgreSQL processing I got a feeling something
goes wrong on PG side.. So, now I've installed both 8.3.7 and 8.4beta1
to see more in depth what's going on. Currently 8.4 performs much
better than 8.3.7, but there is still a room for improvement if such a
small query may go faster :-)

Rgds,
-Dimitri

On 5/6/09, Albe Laurenz <> wrote:
> Dimitri wrote:
>> I've run several tests before and now going in depth to understand if
>> there is nothing wrong. Due such a single query time difference InnoDB
>> is doing 2-3 times better TPS level comparing to PostgreSQL..
>
> Why don't you use MySQL then?
> Or tune PostgreSQL?
>
> Yours,
> Laurenz Albe
>

From:
Richard Huxton
Date:

Dimitri wrote:
> I'll try to answer all mails at once :-))
>
> - query is running fully in RAM, no I/O, no network, only CPU time
>
> - looping 100 times the same query gives 132ms total time (~1.32ms per
> query), while it's 44ms on InnoDB (~0.44ms per query)

Well, assuming you're happy that PG is tuned reasonably for your machine
and that MySQL's query cache isn't returning the results here it looks
like MySQL is faster for this particular query.

The only obvious place there could be a big gain is with the hashing
algorithm. If you remove the ORDER BY and the query-time doesn't fall by
much then it's the hash phase.

The other thing to try is to alter the query to be a SELECT count(*)
rather than returning rows - that will let you measure the time to
transfer the result rows.

--
   Richard Huxton
   Archonet Ltd

From:
Kenneth Marshall
Date:

On Wed, May 06, 2009 at 04:01:03PM +0800, Craig Ringer wrote:
> Dimitri wrote:
>> Hi,
>> any idea if there is a more optimal execution plan possible for this
>> query:
>> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as
>> hbeg,
>>         H.END_DATE as hend, H.NOTE as hnote
>>          from HISTORY H, STAT S
>>          where S.REF = H.REF_STAT
>>          and H.REF_OBJECT = '0000000001'
>>          order by H.HORDER ;
>
> OK, so you're taking a simple:
>
>    history INNER JOIN stat ON (stat.ref = history.ref_stat)
>
> then filtering for records with a particular value of history.ref_object
> and finally performing a sort.
>
> If I'm reading it right, the plan below does a sequential scan on the
> `stat' table. The stat table only has 1000 rows, so this isn't necessarily
> an unreasonable choice even if there is an appropriate index and even if
> not many of the rows will be needed.
>
> It then does an index scan of the history table looking for tuples with
> ref_object = '0000000001' (text match). It hash joins the hashed results of
> the initial seq scan to the results of the index scan, and sorts the
> result.
>
> To me, that looks pretty reasonable. You might be able to avoid the hash
> join in favour of a nested loop scan of stat_ref_idx (looping over records
> from history.ref_stat where ref_object = '00000000001') by providing a
> composite index on HISTORY(ref_stat, ref_object). I'm really not too sure,
> though; plan optimization isn't my thing, I'm just seeing if I can offer a
> few ideas.
>
>> Table definitions:
>
> While not strictly necessary, it's a *REALLY* good idea to define a
> suitable PRIMARY KEY.
>
> Also, the `CHAR(n)' data type is evil. E.V.I.L. Use `varchar(n)' for
> bounded-length values, or `text' for unbounded fields, unless you REALLY
> want the crazy behaviour of `CHAR(n)'.
>
> I'm a little bit puzzled about why you seem to be doing lots of things with
> integer values stored in text strings, but that probably doesn't matter too
> much for the issue at hand.
>
>> NOTE: The same query runs 2 times faster on MySQL.
>
> With InnoDB tables and proper transactional safety? Or using scary MyISAM
> tables and a "just pray" approach to data integrity? If you're using MyISAM
> tables I'm not surprised; MySQL with MyISAM is stunningly fast, but
> oh-my-god dangerous.
>
> --
> Craig Ringer
>
I just thought I would ask. Are you using the query cache in MySQL?
If that is on, that could be the difference. Another thing to check,
try issuing the selects concurrently: 2 at a time, 5 at a time, 10
at a time... and see if that has an effect on timing. In many of the
benchmarks, MySQL will out perform PostgreSQL for very low numbers of
clients. Once you are using more than a handful, PostgreSQL pulls
ahead. Also, is this a completely static table? i.e. no updates or
inserts. How is the performance with those happening? This should
help you get a clearer picture of the performance.

My two cents.
Ken

From:
Kenneth Marshall
Date:

On Wed, May 06, 2009 at 02:49:23PM +0200, Dimitri wrote:
> The story is simple: for the launching of MySQL 5.4 I've done a
> testing comparing available on that time variations of InnoDB engines,
> and at the end by curiosity started the same test with PostgreSQL
> 8.3.7 to see if MySQL performance level is more close to PostgreSQL
> now (PG was a strong true winner before). For my big surprise MySQL
> 5.4 outpassed 8.3.7...
> However, analyzing the PostgreSQL processing I got a feeling something
> goes wrong on PG side.. So, now I've installed both 8.3.7 and 8.4beta1
> to see more in depth what's going on. Currently 8.4 performs much
> better than 8.3.7, but there is still a room for improvement if such a
> small query may go faster :-)
>
> Rgds,
> -Dimitri
>
> On 5/6/09, Albe Laurenz <> wrote:
> > Dimitri wrote:
> >> I've run several tests before and now going in depth to understand if
> >> there is nothing wrong. Due such a single query time difference InnoDB
> >> is doing 2-3 times better TPS level comparing to PostgreSQL..
> >
> > Why don't you use MySQL then?
> > Or tune PostgreSQL?
> >
> > Yours,
> > Laurenz Albe
> >

Another thought, have you tuned PostgreSQL for an in memory database?
Those tuning options may be what is needed to improve the plan chosen
by PostgreSQL.

Cheers,
Ken

From:
Ries van Twisk
Date:

On May 6, 2009, at 7:53 AM, Richard Huxton wrote:

> Dimitri wrote:
>> I'll try to answer all mails at once :-))
>> - query is running fully in RAM, no I/O, no network, only CPU time
>> - looping 100 times the same query gives 132ms total time (~1.32ms
>> per
>> query), while it's 44ms on InnoDB (~0.44ms per query)
>
> Well, assuming you're happy that PG is tuned reasonably for your
> machine and that MySQL's query cache isn't returning the results
> here it looks like MySQL is faster for this particular query.
>
> The only obvious place there could be a big gain is with the hashing
> algorithm. If you remove the ORDER BY and the query-time doesn't
> fall by much then it's the hash phase.
>
> The other thing to try is to alter the query to be a SELECT count(*)
> rather than returning rows - that will let you measure the time to
> transfer the result rows.
>
> --
>  Richard Huxton
>  Archonet Ltd
>


Do you expect to run this query 100 times per second during your
application?
or is this just a test to see how fast the query is for optimalisation.

I always get scared myself with such a test as 'runs out of memory',
reason
given is that usually this is not really the case in a production
environment.

Try to make a little test case where you give the query random
parameters
so different result sets are returned. This will give you a better
idea on how
fast the query really is and might give you better comparison results.

instead of count(*) I isusallt do explain analyze to see how fast
PostgreSQL handles to query.

Ries



From:
Dimitri
Date:

Folks, first of all:

  - I used a fixed reference value just to simplify the case analyzing
and isolate it as max as possible, of course during my tests all
values are random :-)

- final goal of the test is to analyze scalability, so yes, concurrent
sessions with random keys are growing from 1 to 256  (I run it on
32cores server, no think time, just stressing), and the result is
still not yet better comparing to InnoDB

- I'm analyzing this query running in memory to understand what's
blocking while all main bottlenecks are avoided (no I/O anymore nor
network, etc.)

- initial explain analyze and table details were posted in the first message


Now, let's go more further:

 -  so "as it" query execution took 1.50ms

 - after removing "order by" it took 1.19ms

 - select count(*)  instead of columns and with removed "order by" took 0.98ms

- execute of the same prepared "select count(*) ..." took 0.68ms

So, where the time is going?...

Rgds,
-Dimitri


On 5/6/09, Ries van Twisk <> wrote:
>
> On May 6, 2009, at 7:53 AM, Richard Huxton wrote:
>
>> Dimitri wrote:
>>> I'll try to answer all mails at once :-))
>>> - query is running fully in RAM, no I/O, no network, only CPU time
>>> - looping 100 times the same query gives 132ms total time (~1.32ms
>>> per
>>> query), while it's 44ms on InnoDB (~0.44ms per query)
>>
>> Well, assuming you're happy that PG is tuned reasonably for your
>> machine and that MySQL's query cache isn't returning the results
>> here it looks like MySQL is faster for this particular query.
>>
>> The only obvious place there could be a big gain is with the hashing
>> algorithm. If you remove the ORDER BY and the query-time doesn't
>> fall by much then it's the hash phase.
>>
>> The other thing to try is to alter the query to be a SELECT count(*)
>> rather than returning rows - that will let you measure the time to
>> transfer the result rows.
>>
>> --
>>  Richard Huxton
>>  Archonet Ltd
>>
>
>
> Do you expect to run this query 100 times per second during your
> application?
> or is this just a test to see how fast the query is for optimalisation.
>
> I always get scared myself with such a test as 'runs out of memory',
> reason
> given is that usually this is not really the case in a production
> environment.
>
> Try to make a little test case where you give the query random
> parameters
> so different result sets are returned. This will give you a better
> idea on how
> fast the query really is and might give you better comparison results.
>
> instead of count(*) I isusallt do explain analyze to see how fast
> PostgreSQL handles to query.
>
> Ries
>
>
>

From:
Craig Ringer
Date:

Dimitri wrote:
> Hi Chris,
>
> the only problem I see here is it's 2 times slower vs InnoDB, so
> before I'll say myself it's ok I want to be sure there is nothing else
> to do.. :-)

Can the genetic query optimizer come into play on small queries?

--
Craig Ringer

From:
Kenneth Marshall
Date:

No.

Ken
On Wed, May 06, 2009 at 10:04:33PM +0800, Craig Ringer wrote:
> Dimitri wrote:
> > Hi Chris,
> >
> > the only problem I see here is it's 2 times slower vs InnoDB, so
> > before I'll say myself it's ok I want to be sure there is nothing else
> > to do.. :-)
>
> Can the genetic query optimizer come into play on small queries?
>
> --
> Craig Ringer
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

From:
Matthew Wakeling
Date:

> On Wed, May 06, 2009 at 10:04:33PM +0800, Craig Ringer wrote:
>> Can the genetic query optimizer come into play on small queries?

On Wed, 6 May 2009, Kenneth Marshall wrote:
> No.

Yes. But you would have had to have set some really weird configuration.

Matthew

--
 And the lexer will say "Oh look, there's a null string. Oooh, there's
 another. And another.", and will fall over spectacularly when it realises
 there are actually rather a lot.
         - Computer Science Lecturer (edited)

From:
Dimitri
Date:

I supposed in case with prepare and then execute a query optimizer is
no more coming in play on "execute" phase, or did I miss something?..

Forget to say: query cache is disabled on MySQL side.

Rgds,
-Dimitri

On 5/6/09, Craig Ringer <> wrote:
> Dimitri wrote:
>> Hi Chris,
>>
>> the only problem I see here is it's 2 times slower vs InnoDB, so
>> before I'll say myself it's ok I want to be sure there is nothing else
>> to do.. :-)
>
> Can the genetic query optimizer come into play on small queries?
>
> --
> Craig Ringer
>

From:
Kenneth Marshall
Date:

On Wed, May 06, 2009 at 09:38:59AM +0200, Dimitri wrote:
> Hi,
>
> any idea if there is a more optimal execution plan possible for this query:
>
> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg,
>         H.END_DATE as hend, H.NOTE as hnote
>          from HISTORY H, STAT S
>          where S.REF = H.REF_STAT
>          and H.REF_OBJECT = '0000000001'
>          order by H.HORDER ;
>
> EXPLAIN ANALYZE output on 8.4:
>                                                                    QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=4549.75..4555.76 rows=2404 width=176) (actual
> time=1.341..1.343 rows=20 loops=1)
>    Sort Key: h.horder
>    Sort Method:  quicksort  Memory: 30kB
>    ->  Hash Join  (cost=33.50..4414.75 rows=2404 width=176) (actual
> time=1.200..1.232 rows=20 loops=1)
>          Hash Cond: (h.ref_stat = s.ref)
>          ->  Index Scan using history_ref_idx on history h
> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052
> rows=20 loops=1)
>                Index Cond: (ref_object = '0000000001'::bpchar)
>          ->  Hash  (cost=21.00..21.00 rows=1000 width=45) (actual
> time=1.147..1.147 rows=1000 loops=1)
>                ->  Seq Scan on stat s  (cost=0.00..21.00 rows=1000
> width=45) (actual time=0.005..0.325 rows=1000 loops=1)
>  Total runtime: 1.442 ms
> (10 rows)
>
> Table HISTORY contains 200M rows, only 20 needed
> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY values.
>
> Table definitions:
> """""""""""""""""""""""""""""""""""""""""""""""""""
> create table STAT
> (
>     REF                 CHAR(3)            not null,
>     NAME                CHAR(40)           not null,
>     NUMB                INT                not null
> );
>
> create table HISTORY
> (
>     REF_OBJECT          CHAR(10)              not null,
>     HORDER              INT                   not null,
>     REF_STAT            CHAR(3)               not null,
>     BEGIN_DATE          CHAR(12)              not null,
>     END_DATE            CHAR(12)                      ,
>     NOTE                CHAR(100)
> );
>
> create unique index stat_ref_idx on STAT( ref );
> create index history_ref_idx on HISTORY( ref_object, horder );
> """""""""""""""""""""""""""""""""""""""""""""""""""
>
> NOTE: The same query runs 2 times faster on MySQL.
>
> Any idea?..
>
> Rgds,
> -Dimitri
>
Dimitri,

Is there any chance of profiling the postgres backend to see
where the time is used?

Just an idea,
Ken

From:
Dimitri
Date:

Hi Ken,

yes, I may do it, but I did not expect to come into profiling initially :-)
I expected there is just something trivial within a plan that I just
don't know.. :-)

BTW, is there already an integrated profiled within a code? or do I
need external tools?..

Rgds,
-Dimitri

On 5/6/09, Kenneth Marshall <> wrote:
> On Wed, May 06, 2009 at 09:38:59AM +0200, Dimitri wrote:
>> Hi,
>>
>> any idea if there is a more optimal execution plan possible for this
>> query:
>>
>> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as
>> hbeg,
>>         H.END_DATE as hend, H.NOTE as hnote
>>          from HISTORY H, STAT S
>>          where S.REF = H.REF_STAT
>>          and H.REF_OBJECT = '0000000001'
>>          order by H.HORDER ;
>>
>> EXPLAIN ANALYZE output on 8.4:
>>                                                                    QUERY
>> PLAN
>>
------------------------------------------------------------------------------------------------------------------------------------------------
>>  Sort  (cost=4549.75..4555.76 rows=2404 width=176) (actual
>> time=1.341..1.343 rows=20 loops=1)
>>    Sort Key: h.horder
>>    Sort Method:  quicksort  Memory: 30kB
>>    ->  Hash Join  (cost=33.50..4414.75 rows=2404 width=176) (actual
>> time=1.200..1.232 rows=20 loops=1)
>>          Hash Cond: (h.ref_stat = s.ref)
>>          ->  Index Scan using history_ref_idx on history h
>> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052
>> rows=20 loops=1)
>>                Index Cond: (ref_object = '0000000001'::bpchar)
>>          ->  Hash  (cost=21.00..21.00 rows=1000 width=45) (actual
>> time=1.147..1.147 rows=1000 loops=1)
>>                ->  Seq Scan on stat s  (cost=0.00..21.00 rows=1000
>> width=45) (actual time=0.005..0.325 rows=1000 loops=1)
>>  Total runtime: 1.442 ms
>> (10 rows)
>>
>> Table HISTORY contains 200M rows, only 20 needed
>> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY
>> values.
>>
>> Table definitions:
>> """""""""""""""""""""""""""""""""""""""""""""""""""
>> create table STAT
>> (
>>     REF                 CHAR(3)            not null,
>>     NAME                CHAR(40)           not null,
>>     NUMB                INT                not null
>> );
>>
>> create table HISTORY
>> (
>>     REF_OBJECT          CHAR(10)              not null,
>>     HORDER              INT                   not null,
>>     REF_STAT            CHAR(3)               not null,
>>     BEGIN_DATE          CHAR(12)              not null,
>>     END_DATE            CHAR(12)                      ,
>>     NOTE                CHAR(100)
>> );
>>
>> create unique index stat_ref_idx on STAT( ref );
>> create index history_ref_idx on HISTORY( ref_object, horder );
>> """""""""""""""""""""""""""""""""""""""""""""""""""
>>
>> NOTE: The same query runs 2 times faster on MySQL.
>>
>> Any idea?..
>>
>> Rgds,
>> -Dimitri
>>
> Dimitri,
>
> Is there any chance of profiling the postgres backend to see
> where the time is used?
>
> Just an idea,
> Ken
>

From:
Kenneth Marshall
Date:

On Wed, May 06, 2009 at 04:48:21PM +0200, Dimitri wrote:
> Hi Ken,
>
> yes, I may do it, but I did not expect to come into profiling initially :-)
> I expected there is just something trivial within a plan that I just
> don't know.. :-)
>
> BTW, is there already an integrated profiled within a code? or do I
> need external tools?..
>
> Rgds,
> -Dimitri

I only suggested it because it might have the effect of changing
the sequential scan on the stat table to an indexed scan.

Cheers,
Ken
>
> On 5/6/09, Kenneth Marshall <> wrote:
> > On Wed, May 06, 2009 at 09:38:59AM +0200, Dimitri wrote:
> >> Hi,
> >>
> >> any idea if there is a more optimal execution plan possible for this
> >> query:
> >>
> >> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as
> >> hbeg,
> >>         H.END_DATE as hend, H.NOTE as hnote
> >>          from HISTORY H, STAT S
> >>          where S.REF = H.REF_STAT
> >>          and H.REF_OBJECT = '0000000001'
> >>          order by H.HORDER ;
> >>
> >> EXPLAIN ANALYZE output on 8.4:
> >>                                                                    QUERY
> >> PLAN
> >>
------------------------------------------------------------------------------------------------------------------------------------------------
> >>  Sort  (cost=4549.75..4555.76 rows=2404 width=176) (actual
> >> time=1.341..1.343 rows=20 loops=1)
> >>    Sort Key: h.horder
> >>    Sort Method:  quicksort  Memory: 30kB
> >>    ->  Hash Join  (cost=33.50..4414.75 rows=2404 width=176) (actual
> >> time=1.200..1.232 rows=20 loops=1)
> >>          Hash Cond: (h.ref_stat = s.ref)
> >>          ->  Index Scan using history_ref_idx on history h
> >> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052
> >> rows=20 loops=1)
> >>                Index Cond: (ref_object = '0000000001'::bpchar)
> >>          ->  Hash  (cost=21.00..21.00 rows=1000 width=45) (actual
> >> time=1.147..1.147 rows=1000 loops=1)
> >>                ->  Seq Scan on stat s  (cost=0.00..21.00 rows=1000
> >> width=45) (actual time=0.005..0.325 rows=1000 loops=1)
> >>  Total runtime: 1.442 ms
> >> (10 rows)
> >>
> >> Table HISTORY contains 200M rows, only 20 needed
> >> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY
> >> values.
> >>
> >> Table definitions:
> >> """""""""""""""""""""""""""""""""""""""""""""""""""
> >> create table STAT
> >> (
> >>     REF                 CHAR(3)            not null,
> >>     NAME                CHAR(40)           not null,
> >>     NUMB                INT                not null
> >> );
> >>
> >> create table HISTORY
> >> (
> >>     REF_OBJECT          CHAR(10)              not null,
> >>     HORDER              INT                   not null,
> >>     REF_STAT            CHAR(3)               not null,
> >>     BEGIN_DATE          CHAR(12)              not null,
> >>     END_DATE            CHAR(12)                      ,
> >>     NOTE                CHAR(100)
> >> );
> >>
> >> create unique index stat_ref_idx on STAT( ref );
> >> create index history_ref_idx on HISTORY( ref_object, horder );
> >> """""""""""""""""""""""""""""""""""""""""""""""""""
> >>
> >> NOTE: The same query runs 2 times faster on MySQL.
> >>
> >> Any idea?..
> >>
> >> Rgds,
> >> -Dimitri
> >>
> > Dimitri,
> >
> > Is there any chance of profiling the postgres backend to see
> > where the time is used?
> >
> > Just an idea,
> > Ken
> >
>

From:
Simon Riggs
Date:

On Wed, 2009-05-06 at 10:31 +0200, Dimitri wrote:

> I've already tried a target 1000 and the only thing it changes
> comparing to the current 100 (default) is instead of 2404 rows it says
> 240 rows, but the plan remaining the same..

Try both of these things
* REINDEX on the index being used in the query, then re-EXPLAIN
* enable_hashjoin = off, then re-EXPLAIN

You should first attempt to get the same plan, then confirm it really is
faster before we worry why the optimizer hadn't picked that plan.

We already know that MySQL favors nested loop joins, so turning up a
plan that on this occasion is actually better that way is in no way
representative of general performance. Does MySQL support hash joins?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Dimitri
Date:

Hi Simon,

may you explain why REINDEX may help here?.. - database was just
created, data loaded, and then indexes were created + analyzed.. What
may change here after REINDEX?..

With hashjoin disabled was a good try!
Running this query "as it" from 1.50ms we move to 0.84ms now,
and the plan is here:

                                                                      QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=4562.83..4568.66 rows=2329 width=176) (actual
time=0.225..0.229 rows=20 loops=1)
   Sort Key: h.horder
   Sort Method:  quicksort  Memory: 30kB
   ->  Merge Join  (cost=4345.89..4432.58 rows=2329 width=176) (actual
time=0.056..0.205 rows=20 loops=1)
         Merge Cond: (s.ref = h.ref_stat)
         ->  Index Scan using stat_ref_idx on stat s
(cost=0.00..49.25 rows=1000 width=45) (actual time=0.012..0.079
rows=193 loops=1)
         ->  Sort  (cost=4345.89..4351.72 rows=2329 width=135) (actual
time=0.041..0.043 rows=20 loops=1)
               Sort Key: h.ref_stat
               Sort Method:  quicksort  Memory: 30kB
               ->  Index Scan using history_ref_idx on history h
(cost=0.00..4215.64 rows=2329 width=135) (actual time=0.013..0.024
rows=20 loops=1)
                     Index Cond: (ref_object = '0000000001'::bpchar)
 Total runtime: 0.261 ms
(12 rows)

Curiously planner expect to run it in 0.26ms

Any idea why planner is not choosing this plan from the beginning?..
Any way to keep this plan without having a global or per sessions
hashjoin disabled?..

Rgds,
-Dimitri


On 5/6/09, Simon Riggs <> wrote:
>
> On Wed, 2009-05-06 at 10:31 +0200, Dimitri wrote:
>
>> I've already tried a target 1000 and the only thing it changes
>> comparing to the current 100 (default) is instead of 2404 rows it says
>> 240 rows, but the plan remaining the same..
>
> Try both of these things
> * REINDEX on the index being used in the query, then re-EXPLAIN
> * enable_hashjoin = off, then re-EXPLAIN
>
> You should first attempt to get the same plan, then confirm it really is
> faster before we worry why the optimizer hadn't picked that plan.
>
> We already know that MySQL favors nested loop joins, so turning up a
> plan that on this occasion is actually better that way is in no way
> representative of general performance. Does MySQL support hash joins?
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

From:
Dimitri
Date:

The problem with "gprof" - it'll profile all stuff from the beginning
to the end, and a lot of function executions unrelated to this query
will be profiled...

As well when you look on profiling technology - all such kind of
solutions are based on the system clock frequency and have their
limits on time resolution. On my system this limit is 0.5ms, and it's
too big comparing to the query execution time :-)

So, what I've done - I changed little bit a reference key criteria from
= '0000000001' to < '0000000051', so instead of 20 rows I have 1000
rows on output now, it's still slower than InnoDB (12ms vs 9ms), but
at least may be profiled (well, we also probably moving far from the
problem as time may be spent mostly on the output traffic now, but
I've tried anyway) - I've made a loop of 100 iterations of this query
which is reading but not printing data. The total execution time of
this loop is 1200ms, and curiously under profiling was not really
changed. Profiler was able to catch 733ms of total execution time (if
I understand well, all functions running faster than 0.5ms are remain
un-profiled). The top profiler output is here:

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
0.733     0.733      <Total>
0.103     0.103      memcpy
0.045     0.045      slot_deform_tuple
0.037     0.040      AllocSetAlloc
0.021     0.021      AllocSetFree
0.018     0.037      pfree
0.018     0.059      appendBinaryStringInfo
0.017     0.031      heap_fill_tuple
0.017     0.017      _ndoprnt
0.016     0.016      nocachegetattr
0.015     0.065      heap_form_minimal_tuple
0.015     0.382      ExecProcNode
0.015     0.015      strlen
0.014     0.037      ExecScanHashBucket
0.014     0.299      printtup
0.013     0.272      ExecHashJoin
0.011     0.011      enlargeStringInfo
0.011     0.086      index_getnext
0.010     0.010      hash_any
0.009     0.076      FunctionCall1
0.009     0.037      MemoryContextAlloc
0.008     0.008      LWLockAcquire
0.007     0.069      pq_sendcountedtext
0.007     0.035      ExecProject
0.007     0.127      ExecScan
...

Curiously "memcpy" is in top. Don't know if it's impacted in many
cases, but probably it make sense to see if it may be optimized, etc..

Rgds,
-Dimitri



On 5/7/09, Euler Taveira de Oliveira <> wrote:
> Dimitri escreveu:
>> BTW, is there already an integrated profiled within a code? or do I
>> need external tools?..
>>
> Postgres provides support for profiling. Add --enable-profiling flag. Use
> gprof to get the profile.
>
>
> --
>   Euler Taveira de Oliveira
>   http://www.timbira.com/
>

From:
Merlin Moncure
Date:

On Thu, May 7, 2009 at 4:20 AM, Dimitri <> wrote:
> Hi Simon,
>
> may you explain why REINDEX may help here?.. - database was just
> created, data loaded, and then indexes were created + analyzed.. What
> may change here after REINDEX?..
>
> With hashjoin disabled was a good try!
> Running this query "as it" from 1.50ms we move to 0.84ms now,
> and the plan is here:
>
>                                                                      QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=4562.83..4568.66 rows=2329 width=176) (actual
> time=0.225..0.229 rows=20 loops=1)
>   Sort Key: h.horder
>   Sort Method:  quicksort  Memory: 30kB
>   ->  Merge Join  (cost=4345.89..4432.58 rows=2329 width=176) (actual
> time=0.056..0.205 rows=20 loops=1)
>         Merge Cond: (s.ref = h.ref_stat)
>         ->  Index Scan using stat_ref_idx on stat s
> (cost=0.00..49.25 rows=1000 width=45) (actual time=0.012..0.079
> rows=193 loops=1)
>         ->  Sort  (cost=4345.89..4351.72 rows=2329 width=135) (actual
> time=0.041..0.043 rows=20 loops=1)
>               Sort Key: h.ref_stat
>               Sort Method:  quicksort  Memory: 30kB
>               ->  Index Scan using history_ref_idx on history h
> (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.013..0.024
> rows=20 loops=1)
>                     Index Cond: (ref_object = '0000000001'::bpchar)
>  Total runtime: 0.261 ms
> (12 rows)
>
> Curiously planner expect to run it in 0.26ms
>
> Any idea why planner is not choosing this plan from the beginning?..
> Any way to keep this plan without having a global or per sessions
> hashjoin disabled?..

can you work prepared statements into your app?  turn off hash join,
prepare the query, then turn it back on.

merlin

From:
Gregory Stark
Date:

Simon Riggs <> writes:

> We already know that MySQL favors nested loop joins

From what I read I thought that was the *only* type of join MySQL supports.

The big picture view here is that whether we run a short query in half a
millisecond versus two milliseconds is usually not really important. It could
matter if you're concerned with how many transactions/s you can run in a busy
server -- but that's not exactly the same thing and you should really measure
that in that case.

It would be nice if we were in the same ballpark as MySQL but we would only be
interesting in such optimizations if they don't come at the expense of
scalability under more complex workloads.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

From:
Simon Riggs
Date:

On Thu, 2009-05-07 at 12:58 +0100, Gregory Stark wrote:

> It would be nice if we were in the same ballpark as MySQL but we would only be
> interesting in such optimizations if they don't come at the expense of
> scalability under more complex workloads.

It doesn't appear there is a scalability issue here at all.

Postgres can clearly do the same query in about the same time.

We just have a case where MySQL happens to optimise it well and Postgres
doesn't. Since we can trivially design cases that show the opposite I'm
not worried too much.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Alvaro Herrera
Date:

Dimitri escribió:

> As well when you look on profiling technology - all such kind of
> solutions are based on the system clock frequency and have their
> limits on time resolution. On my system this limit is 0.5ms, and it's
> too big comparing to the query execution time :-)
>
> So, what I've done - I changed little bit a reference key criteria from
> = '0000000001' to < '0000000051', so instead of 20 rows I have 1000
> rows on output now,

Another thing you can try is run the query several times (like 10000 or so).

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:
Dimitri
Date:

I've simply restarted a full test with hashjoin OFF. Until 32
concurrent users things are going well. Then since 32 users response
time is jumping to 20ms, with 64 users it's higher again, and with 256
users reaching 700ms, so TPS is dropping from 5.000 to ~200..

With hashjoin ON it's not happening, and I'm reaching at least 11.000
TPS on fully busy 32 cores.

I should not use prepare/execute as the test conditions should remain "generic".

About scalability issue - there is one on 8.3.7, because on 32 cores
with such kind of load it's using only 50% CPU and not outpassing
6.000 TPS, while 8.4 uses 90% CPU and reaching 11.000 TPS..

On the same time while I'm comparing 8.3 and 8.4 - the response time
is 2 times lower in 8.4, and seems to me the main gain for 8.4 is
here.

I'll publish all details, just need a time :-)

Rgds,
-Dimitri

On 5/7/09, Merlin Moncure <> wrote:
> On Thu, May 7, 2009 at 4:20 AM, Dimitri <> wrote:
>> Hi Simon,
>>
>> may you explain why REINDEX may help here?.. - database was just
>> created, data loaded, and then indexes were created + analyzed.. What
>> may change here after REINDEX?..
>>
>> With hashjoin disabled was a good try!
>> Running this query "as it" from 1.50ms we move to 0.84ms now,
>> and the plan is here:
>>
>>                                                                      QUERY
>> PLAN
>>
------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Sort  (cost=4562.83..4568.66 rows=2329 width=176) (actual
>> time=0.225..0.229 rows=20 loops=1)
>>   Sort Key: h.horder
>>   Sort Method:  quicksort  Memory: 30kB
>>   ->  Merge Join  (cost=4345.89..4432.58 rows=2329 width=176) (actual
>> time=0.056..0.205 rows=20 loops=1)
>>         Merge Cond: (s.ref = h.ref_stat)
>>         ->  Index Scan using stat_ref_idx on stat s
>> (cost=0.00..49.25 rows=1000 width=45) (actual time=0.012..0.079
>> rows=193 loops=1)
>>         ->  Sort  (cost=4345.89..4351.72 rows=2329 width=135) (actual
>> time=0.041..0.043 rows=20 loops=1)
>>               Sort Key: h.ref_stat
>>               Sort Method:  quicksort  Memory: 30kB
>>               ->  Index Scan using history_ref_idx on history h
>> (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.013..0.024
>> rows=20 loops=1)
>>                     Index Cond: (ref_object = '0000000001'::bpchar)
>>  Total runtime: 0.261 ms
>> (12 rows)
>>
>> Curiously planner expect to run it in 0.26ms
>>
>> Any idea why planner is not choosing this plan from the beginning?..
>> Any way to keep this plan without having a global or per sessions
>> hashjoin disabled?..
>
> can you work prepared statements into your app?  turn off hash join,
> prepare the query, then turn it back on.
>
> merlin
>

From:
Simon Riggs
Date:

On Thu, 2009-05-07 at 20:36 +0200, Dimitri wrote:

> I've simply restarted a full test with hashjoin OFF. Until 32
> concurrent users things are going well. Then since 32 users response
> time is jumping to 20ms, with 64 users it's higher again, and with 256
> users reaching 700ms, so TPS is dropping from 5.000 to ~200..
>
> With hashjoin ON it's not happening, and I'm reaching at least 11.000
> TPS on fully busy 32 cores.

Much better to stick to the defaults.

Sounds like a problem worth investigating further, but not pro bono.

> About scalability issue - there is one on 8.3.7, because on 32 cores
> with such kind of load it's using only 50% CPU and not outpassing
> 6.000 TPS, while 8.4 uses 90% CPU and reaching 11.000 TPS..

Yeh, small changes make a big difference. Thanks for the info.

How does MySQL perform?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Dimitri
Date:

Folks, it's completely crazy, but here is what I found:

- if HISTORY table is analyzed with target 1000 my SELECT response
time is jumping to 3ms, and the max throughput is limited to 6.000 TPS
(it's what happenned to 8.3.7)

-if HISTORY table is analyzed with target 5 - my SELECT response time
is decreasing to 1.2ms (!)  and then my max TPS level is ~12.000 !
and CPU is used up to 95% even by 8.3.7 :-)  and 8.4 performed better
just because I left its analyze target to default 100 value.

Anyone may explain me why analyze target may have so huge negative
secondary effect?..

Next point: SCALABILITY ISSUE

Now both 8.3.7 and 8.4 have similar performance levels, but 8.3.7 is
always slightly better comparing to 8.4, but well. The problem I have:
   - on 8 cores: ~5.000 TPS  / 5.500 MAX
   - on 16 cores: ~10.000 TPS / 11.000 MAX
   - on  32 cores: ~10.500 TPS  / 11.500 MAX

What else may limit concurrent SELECTs here?..

Yes, forget, MySQL is reaching 17.500 TPS here.

Rgds,
-Dimitri

On 5/7/09, Simon Riggs <> wrote:
>
> On Thu, 2009-05-07 at 20:36 +0200, Dimitri wrote:
>
>> I've simply restarted a full test with hashjoin OFF. Until 32
>> concurrent users things are going well. Then since 32 users response
>> time is jumping to 20ms, with 64 users it's higher again, and with 256
>> users reaching 700ms, so TPS is dropping from 5.000 to ~200..
>>
>> With hashjoin ON it's not happening, and I'm reaching at least 11.000
>> TPS on fully busy 32 cores.
>
> Much better to stick to the defaults.
>
> Sounds like a problem worth investigating further, but not pro bono.
>
>> About scalability issue - there is one on 8.3.7, because on 32 cores
>> with such kind of load it's using only 50% CPU and not outpassing
>> 6.000 TPS, while 8.4 uses 90% CPU and reaching 11.000 TPS..
>
> Yeh, small changes make a big difference. Thanks for the info.
>
> How does MySQL perform?
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

From:
Tom Lane
Date:

Dimitri <> writes:
> Anyone may explain me why analyze target may have so huge negative
> secondary effect?..

If these are simple queries, maybe what you're looking at is the
increase in planning time caused by having to process 10x as much
statistical data.  Cranking statistics_target to the max just because
you can is not necessarily a good strategy.

            regards, tom lane

From:
Dimitri
Date:

Hi Tom,

it was not willing :-)
it just stayed so after various probes with a query plan.

Anyway, on 8.4 the default target is 100, an just by move it to 5 I
reached on 16cores  10.500 TPS instead of 8.000 initially. And I think
you have a good reason to keep it equal to 100 by default, isn't it?
;-)

And what about scalability on 32cores?..
Any idea?

Rgds,
-Dimitri

On 5/11/09, Tom Lane <> wrote:
> Dimitri <> writes:
>> Anyone may explain me why analyze target may have so huge negative
>> secondary effect?..
>
> If these are simple queries, maybe what you're looking at is the
> increase in planning time caused by having to process 10x as much
> statistical data.  Cranking statistics_target to the max just because
> you can is not necessarily a good strategy.
>
>             regards, tom lane
>

From:
Merlin Moncure
Date:

On Mon, May 11, 2009 at 11:18 AM, Dimitri <> wrote:
> Folks, it's completely crazy, but here is what I found:
>
> - if HISTORY table is analyzed with target 1000 my SELECT response
> time is jumping to 3ms, and the max throughput is limited to 6.000 TPS
> (it's what happenned to 8.3.7)
>
> -if HISTORY table is analyzed with target 5 - my SELECT response time
> is decreasing to 1.2ms (!)  and then my max TPS level is ~12.000 !
> and CPU is used up to 95% even by 8.3.7 :-)  and 8.4 performed better
> just because I left its analyze target to default 100 value.
>
> Anyone may explain me why analyze target may have so huge negative
> secondary effect?..
>
> Next point: SCALABILITY ISSUE
>
> Now both 8.3.7 and 8.4 have similar performance levels, but 8.3.7 is
> always slightly better comparing to 8.4, but well. The problem I have:
>   - on 8 cores: ~5.000 TPS  / 5.500 MAX
>   - on 16 cores: ~10.000 TPS / 11.000 MAX
>   - on  32 cores: ~10.500 TPS  / 11.500 MAX
>
> What else may limit concurrent SELECTs here?..
>
> Yes, forget, MySQL is reaching 17.500 TPS here.

why aren't you preparing the query?  mysql uses simple rule based
planner and postgresql has a statistics based planner.  Our planner
has all kinds of advantages in various scenarios, but this is
compensated by slightly longer planning time in some cases.  OTOH, you
have prepared queries to compensate this.  (mysql also has prepared
queries, but the syntax is awkward and there is much less benefit to
using them).

merlin

merlin

From:
Simon Riggs
Date:

On Mon, 2009-05-11 at 17:18 +0200, Dimitri wrote:

> Yes, forget, MySQL is reaching 17.500 TPS here.

Please share your measurements of MySQL scalability also.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Simon Riggs
Date:

On Mon, 2009-05-11 at 11:23 -0400, Tom Lane wrote:
> Dimitri <> writes:
> > Anyone may explain me why analyze target may have so huge negative
> > secondary effect?..
>
> If these are simple queries, maybe what you're looking at is the
> increase in planning time caused by having to process 10x as much
> statistical data.  Cranking statistics_target to the max just because
> you can is not necessarily a good strategy.

statistics_target effects tables, so we have problems if you have a mix
of simple and complex queries. IMHO we need an explicit planner_effort
control, rather than the more arcane *_limit knobs which are effectively
the same thing, just harder to use in practice.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Aidan Van Dyk
Date:

* Dimitri <> [090511 11:18]:
> Folks, it's completely crazy, but here is what I found:
>
> - if HISTORY table is analyzed with target 1000 my SELECT response
> time is jumping to 3ms, and the max throughput is limited to 6.000 TPS
> (it's what happenned to 8.3.7)
>
> -if HISTORY table is analyzed with target 5 - my SELECT response time
> is decreasing to 1.2ms (!)  and then my max TPS level is ~12.000 !
> and CPU is used up to 95% even by 8.3.7 :-)  and 8.4 performed better
> just because I left its analyze target to default 100 value.
>
> Anyone may explain me why analyze target may have so huge negative
> secondary effect?..

It's actually pretty straight forward.

The PostgreSQL query planner is a "smart planner".  It takes into
consideration all the statistics available on the columns/tables,
expected outputs based on inputs, etc, to choose what it thinks will be
the best plan.  The more data you have in statistics (the larger
statistics target you have), the more CPU time and longer it's going to
take to "plan" your queries.  The tradeoff is hopefully better plans.

But, in your scenario, where you are hitting the database with the
absolute worst possible way to use PostgreSQL, with small, repeated,
simple queries, you're not getting the advantage  of "better" plans.  In
your case, you're throwing absolutely simple queries at PG as fast as
you can, and for each query, PostgreSQL has to:

1) Parse the given "query string"
2) Given the statistics available, plan the query and pick the best one
3) Actually run the query.

Part 2 is going to dominate the CPU time in your tests, more so the more
statistics it has to evaluate, and unless the data has to come from the
disks (i.e. not in shared buffers or cache) is thus going to dominate the
time before you get your results.  More statistics means more time
needed to do the planning/picking of the query.

If you were to use prepared statements, the cost of #1 and #2 is done
once, and then every time you throw a new execution of the query to
PostgreSQL, you get to just do #3, the easy quick part, especially for
small simple queries where all the data is in shared buffers or the cache.

a.

--
Aidan Van Dyk                                             Create like a god,
                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Attachment
From:
Dimitri
Date:

Hi Simon,

it's too early yet to speak about MySQL scalability... :-)
it's only since few months there is *no more* regression on MySQL
performance while moving from 8 to 16 cores. But looking how quickly
it's progressing now things may change very quickly :-)

For the moment on my tests it gives:
  - on 8 cores: 14.000 TPS
  - on 16 cores: 17.500 TPS
  - on 32 cores: 15.000 TPS (regression)

Rgds,
-Dimitri

On 5/11/09, Simon Riggs <> wrote:
>
> On Mon, 2009-05-11 at 17:18 +0200, Dimitri wrote:
>
>> Yes, forget, MySQL is reaching 17.500 TPS here.
>
> Please share your measurements of MySQL scalability also.
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

From:
Dimitri
Date:

Hi Aidan,

thanks a lot for this detailed summary!

So, why I don't use prepare here: let's say I'm testing the worst
stress case :-)  Imagine you have thousands of such kind of queries -
you cannot prepare all of them! :-)  or you'll maybe prepare it once,
but as I showed previously in this thread prepare statement itself
takes 16ms, so for a single shot there is no gain! :-)  Stressing with
such kind of short and simple queries (and again, they have joins, it
may be even more simple :-)) will give me a result to show with
guarantee my worst case - I know then if I have to deploy a bombarding
OLTP-like application my database engine will be able to keep such
workload, and if I have performance problems they are inside of
application! :-)  (well, it's very simplistic, but it's not far from
the truth :-))

Now, as you see from your explanation, the Part #2 is the most
dominant - so why instead to blame this query not to implement a QUERY
PLANNER CACHE??? - in way if any *similar* query is recognized by
parser we simply *reuse* the same plan?..

Rgds,
-Dimitri


On 5/11/09, Aidan Van Dyk <> wrote:
> * Dimitri <> [090511 11:18]:
>> Folks, it's completely crazy, but here is what I found:
>>
>> - if HISTORY table is analyzed with target 1000 my SELECT response
>> time is jumping to 3ms, and the max throughput is limited to 6.000 TPS
>> (it's what happenned to 8.3.7)
>>
>> -if HISTORY table is analyzed with target 5 - my SELECT response time
>> is decreasing to 1.2ms (!)  and then my max TPS level is ~12.000 !
>> and CPU is used up to 95% even by 8.3.7 :-)  and 8.4 performed better
>> just because I left its analyze target to default 100 value.
>>
>> Anyone may explain me why analyze target may have so huge negative
>> secondary effect?..
>
> It's actually pretty straight forward.
>
> The PostgreSQL query planner is a "smart planner".  It takes into
> consideration all the statistics available on the columns/tables,
> expected outputs based on inputs, etc, to choose what it thinks will be
> the best plan.  The more data you have in statistics (the larger
> statistics target you have), the more CPU time and longer it's going to
> take to "plan" your queries.  The tradeoff is hopefully better plans.
>
> But, in your scenario, where you are hitting the database with the
> absolute worst possible way to use PostgreSQL, with small, repeated,
> simple queries, you're not getting the advantage  of "better" plans.  In
> your case, you're throwing absolutely simple queries at PG as fast as
> you can, and for each query, PostgreSQL has to:
>
> 1) Parse the given "query string"
> 2) Given the statistics available, plan the query and pick the best one
> 3) Actually run the query.
>
> Part 2 is going to dominate the CPU time in your tests, more so the more
> statistics it has to evaluate, and unless the data has to come from the
> disks (i.e. not in shared buffers or cache) is thus going to dominate the
> time before you get your results.  More statistics means more time
> needed to do the planning/picking of the query.
>
> If you were to use prepared statements, the cost of #1 and #2 is done
> once, and then every time you throw a new execution of the query to
> PostgreSQL, you get to just do #3, the easy quick part, especially for
> small simple queries where all the data is in shared buffers or the cache.
>
> a.
>
> --
> Aidan Van Dyk                                             Create like a god,
>                                        command like a king,
> http://www.highrise.ca/                                   work like a slave.
>

From:
Alvaro Herrera
Date:

Dimitri escribió:
> Hi Aidan,
>
> thanks a lot for this detailed summary!
>
> So, why I don't use prepare here: let's say I'm testing the worst
> stress case :-)  Imagine you have thousands of such kind of queries -
> you cannot prepare all of them! :-)

Thousands?  Surely there'll be a dozen or three of most common queries,
to which you pass different parameters.  You can prepare thoseu

> Now, as you see from your explanation, the Part #2 is the most
> dominant - so why instead to blame this query not to implement a QUERY
> PLANNER CACHE??? - in way if any *similar* query is recognized by
> parser we simply *reuse* the same plan?..

This has been discussed in the past, but it turns out that a real
implementation is a lot harder than it seems.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:
Heikki Linnakangas
Date:

Dimitri wrote:
> Now, as you see from your explanation, the Part #2 is the most
> dominant - so why instead to blame this query not to implement a QUERY
> PLANNER CACHE??? - in way if any *similar* query is recognized by
> parser we simply *reuse* the same plan?..

At least in JDBC, there's several open source prepared statement cache
implementations out there that people use. I don't know about other
client libraries, but it certainly is possible to do in the client.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

From:
Dimitri
Date:

>> So, why I don't use prepare here: let's say I'm testing the worst
>> stress case :-)  Imagine you have thousands of such kind of queries -
>> you cannot prepare all of them! :-)
>
> Thousands?  Surely there'll be a dozen or three of most common queries,
> to which you pass different parameters.  You can prepare thoseu

Ok, and if each client just connect to the database, execute each kind
of query just *once* and then disconnect?..  - cost of prepare will
kill performance here if it's not reused at least 10 times within the
same session.

Well, I know, we always can do better, and even use stored procedures,
etc. etc.


>
>> Now, as you see from your explanation, the Part #2 is the most
>> dominant - so why instead to blame this query not to implement a QUERY
>> PLANNER CACHE??? - in way if any *similar* query is recognized by
>> parser we simply *reuse* the same plan?..
>
> This has been discussed in the past, but it turns out that a real
> implementation is a lot harder than it seems.

Ok. If I remember well, Oracle have it and it helps a lot, but for
sure it's not easy to implement..

Rgds,
-Dimitri

From:
Dimitri
Date:

Nice to know. But again, if this cache is kept only on the client side
it'll be always lost on disconnect. And if clients are "short-lived"
it'll not help.

BTW, is there an option to say "do execution plan as simple as
possible"? If you're sure about your data and your indexes - don't
need to spend so much time.

Rgds,
-Dimitri

On 5/12/09, Heikki Linnakangas <> wrote:
> Dimitri wrote:
>> Now, as you see from your explanation, the Part #2 is the most
>> dominant - so why instead to blame this query not to implement a QUERY
>> PLANNER CACHE??? - in way if any *similar* query is recognized by
>> parser we simply *reuse* the same plan?..
>
> At least in JDBC, there's several open source prepared statement cache
> implementations out there that people use. I don't know about other
> client libraries, but it certainly is possible to do in the client.
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
>

From:
Andres Freund
Date:

Hi,

On 05/12/2009 12:46 AM, Dimitri wrote:
> So, why I don't use prepare here: let's say I'm testing the worst
> stress case :-)  Imagine you have thousands of such kind of queries -
> you cannot prepare all of them! :-)  or you'll maybe prepare it once,
> but as I showed previously in this thread prepare statement itself
> takes 16ms, so for a single shot there is no gain! :-)
I have a hard time imaging a high throughput OLTP workload with that
many different queries ;-)

Naturally it would still be nice to be good in this not optimal workload...

Andres

From:
Dimitri Fontaine
Date:

Hi,

Dimitri <> writes:

>>> So, why I don't use prepare here: let's say I'm testing the worst
>>> stress case :-)  Imagine you have thousands of such kind of queries -
>>> you cannot prepare all of them! :-)
>>
>> Thousands?  Surely there'll be a dozen or three of most common queries,
>> to which you pass different parameters.  You can prepare thoseu
>
> Ok, and if each client just connect to the database, execute each kind
> of query just *once* and then disconnect?..  - cost of prepare will
> kill performance here if it's not reused at least 10 times within the
> same session.

In a scenario which looks like this one, what I'm doing is using
pgbouncer transaction pooling. Now a new connection from client can be
served by an existing backend, which already has prepared your
statement.

So you first SELECT name FROM pg_prepared_statements; to know if you
have to PREPARE or just EXECUTE, and you not only maintain much less
running backends, lower fork() calls, but also benefit fully from
preparing the statements even when you EXECUTE once per client
connection.

> Well, I know, we always can do better, and even use stored procedures,
> etc. etc.

Plain SQL stored procedure will prevent PostgreSQL to prepare your
queries, only PLpgSQL functions will force transparent plan caching. But
calling this PL will cost about 1ms per call in my tests, so it's not a
good solution.

It's possible to go as far as providing your own PostgreSQL C module
where you PREPARE at _PG_init() time and EXECUTE in a SQL callable
function, coupled with pgbouncer it should max out the perfs. But maybe
you're not willing to go this far.

Anyway, is hammering the server with always the same query your real
need or just a simplified test-case? If the former, you'll see there are
good ways to theorically obtain better perfs than what you're currently
reaching, if the latter I urge you to consider some better benchmarking
tools, such as playr or tsung.

  https://area51.myyearbook.com/trac.cgi/wiki/Playr
  http://tsung.erlang-projects.org/
  http://pgfouine.projects.postgresql.org/tsung.html
  http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php

Regards,
--
dim

From:
Dimitri
Date:

Folks, before you start to think "what a dumb guy doing a dumb thing" :-))
I'll explain you few details:

it's for more than 10 years I'm using a db_STRESS kit
(http://dimitrik.free.fr/db_STRESS.html) to check databases
performance and scalability. Until now I was very happy with results
it gave me as it stress very well each database engine internals an
put on light some things I should probably skip on other workloads.
What do you want, with a time the "fast" query executed before in
500ms now runs within 1-2ms  - not only hardware was improved but also
database engines increased their performance a lot! :-))

In 2007 I've published the first public results with PostgreSQL, and
it was 2 times faster on that time comparing to MySQL
(http://dimitrik.free.fr/db_STRESS_BMK_Part1.html)

Last month for the launching of MySQL 5.4 I've done a long series of
tests and at the end for my curiosity I've executed the same load
against PostgreSQL 8.3.7 to see if MySQL is more close now. For my big
surprise, MySQL was faster! As well observations on PG processing
bring me a lot of questions - I supposed something was abnormal on PG
side, but I did not have too much time to understand what it was
exactly (http://dimitrik.free.fr/db_STRESS_MySQL_540_and_others_Apr2009.html#note_5443)

What I'm trying to do now is to understand what exactly is the problem.

What I discovered so far with all your help:
  - the impact of a planner
  - the impact of the analyze target
  - the impact of prepare / execute
  - scalability limit on 32 cores

I'll also try to adapt prepare/execute solution to see how much it
improves performance and/or scalability.

As well helping from the other thread I was able to improve a lot the
TPS stability on read+write workload! :-)

Any other comments are welcome!

Rgds,
-Dimitri

On 5/12/09, Dimitri Fontaine <> wrote:
> Hi,
>
> Dimitri <> writes:
>
>>>> So, why I don't use prepare here: let's say I'm testing the worst
>>>> stress case :-)  Imagine you have thousands of such kind of queries -
>>>> you cannot prepare all of them! :-)
>>>
>>> Thousands?  Surely there'll be a dozen or three of most common queries,
>>> to which you pass different parameters.  You can prepare thoseu
>>
>> Ok, and if each client just connect to the database, execute each kind
>> of query just *once* and then disconnect?..  - cost of prepare will
>> kill performance here if it's not reused at least 10 times within the
>> same session.
>
> In a scenario which looks like this one, what I'm doing is using
> pgbouncer transaction pooling. Now a new connection from client can be
> served by an existing backend, which already has prepared your
> statement.
>
> So you first SELECT name FROM pg_prepared_statements; to know if you
> have to PREPARE or just EXECUTE, and you not only maintain much less
> running backends, lower fork() calls, but also benefit fully from
> preparing the statements even when you EXECUTE once per client
> connection.
>
>> Well, I know, we always can do better, and even use stored procedures,
>> etc. etc.
>
> Plain SQL stored procedure will prevent PostgreSQL to prepare your
> queries, only PLpgSQL functions will force transparent plan caching. But
> calling this PL will cost about 1ms per call in my tests, so it's not a
> good solution.
>
> It's possible to go as far as providing your own PostgreSQL C module
> where you PREPARE at _PG_init() time and EXECUTE in a SQL callable
> function, coupled with pgbouncer it should max out the perfs. But maybe
> you're not willing to go this far.
>
> Anyway, is hammering the server with always the same query your real
> need or just a simplified test-case? If the former, you'll see there are
> good ways to theorically obtain better perfs than what you're currently
> reaching, if the latter I urge you to consider some better benchmarking
> tools, such as playr or tsung.
>
>   https://area51.myyearbook.com/trac.cgi/wiki/Playr
>   http://tsung.erlang-projects.org/
>   http://pgfouine.projects.postgresql.org/tsung.html
>   http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php
>
> Regards,
> --
> dim
>

From:
Simon Riggs
Date:

On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote:

> For my big surprise, MySQL was faster!

Ours too.

** I bet you $1000 that I can improve the performance of your benchmark
results with PostgreSQL. You give me $1000 up-front and if I can't
improve your high end numbers I'll give you $2000 back. Either way, you
name me and link to me from your blog. Assuming you re-run the tests as
requested and give me reasonable access to info and measurements. **

I note your blog identifies you as a Sun employee. Is that correct? If
you do not give us the opportunity to improve upon the results then
reasonable observers might be persuaded you did not wish to show
PostgreSQL in its best light. You up for it?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Dimitri
Date:

Wow, Simon! :-))

yes, I'm working in Sun Benchmark Center :-))
(I'm not using my Sun email on public lists only to avid a spam)

and as came here and asking questions it's probably proving my
intentions to show PostgreSQL in its best light, no?.. - I never liked
"not honest" comparisons :-))

Regarding your bet: from a very young age I learned a one thing - you
take any 2 person who betting for any reason - you'll find in them one
idiot and one bastard :-))   idiot - because betting while missing
knowledge, and bastard - because knowing the truth is not honset to
get a profit from idiots :-))  That's why I never betting in my life,
but every time telling the same story in such situation... Did you
like it? ;-))

However, no problem to give you a credit as well to all pg-perf list
as it provides a very valuable help! :-))

Rgds,
-Dimitri

On 5/12/09, Simon Riggs <> wrote:
>
> On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote:
>
>> For my big surprise, MySQL was faster!
>
> Ours too.
>
> ** I bet you $1000 that I can improve the performance of your benchmark
> results with PostgreSQL. You give me $1000 up-front and if I can't
> improve your high end numbers I'll give you $2000 back. Either way, you
> name me and link to me from your blog. Assuming you re-run the tests as
> requested and give me reasonable access to info and measurements. **
>
> I note your blog identifies you as a Sun employee. Is that correct? If
> you do not give us the opportunity to improve upon the results then
> reasonable observers might be persuaded you did not wish to show
> PostgreSQL in its best light. You up for it?
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

From:
Simon Riggs
Date:

On Tue, 2009-05-12 at 13:16 +0200, Dimitri wrote:

> Wow, Simon! :-))
>
> yes, I'm working in Sun Benchmark Center :-))
> (I'm not using my Sun email on public lists only to avid a spam)
>
> and as came here and asking questions it's probably proving my
> intentions to show PostgreSQL in its best light, no?.. - I never liked
> "not honest" comparisons :-))
>
> Regarding your bet: from a very young age I learned a one thing - you
> take any 2 person who betting for any reason - you'll find in them one
> idiot and one bastard :-))   idiot - because betting while missing
> knowledge, and bastard - because knowing the truth is not honset to
> get a profit from idiots :-))  That's why I never betting in my life,
> but every time telling the same story in such situation... Did you
> like it? ;-))

No, but I asked for it, so we're even. ;-)

Let's work on the benchmark.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Heikki Linnakangas
Date:

Dimitri wrote:
> What I discovered so far with all your help:
>   - the impact of a planner
>   - the impact of the analyze target
>   - the impact of prepare / execute
>   - scalability limit on 32 cores

You've received good advice on how to minimize the impact of the first
three points, and using those techniques should bring a benefit. But I'm
pretty surprised by the bad scalability you're seeing and no-one seems
to have a good idea on where that limit is coming from. At a quick
glance, I don't see any inherent bottlenecks in the schema and workload.

If you could analyze where the bottleneck is with multiple cores, that
would be great. With something like oprofile, it should be possible to
figure out where the time is spent.

My first guess would be the WALInsertLock: writing to WAL is protected
by that and it an become a bottleneck with lots of small
UPDATE/DELETE/INSERT transactions. But a profile would be required to
verify that.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

From:
Simon Riggs
Date:

On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote:

> What I'm trying to do now is to understand what exactly is the
> problem.

You're running with 1600 users, which is above the scalability limit
uncovered (by Sun...) during earlier benchmarking. The scalability
issues are understood but currently considered above the
reasonable-setting limit and so nobody has been inclined to improve
matters.

You should use a connection concentrator to reduce the number of
sessions down to say 400.

You're WAL buffers setting is also too low and you will be experiencing
contention on the WALWriteLock. Increase wal_buffers to about x8 where
you have it now.

You can move pg_xlog to its own set of drives.

Set checkpoint_completion_target to 0.95.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Dimitri
Date:

For the moment I'm even not considering any scalability issues on the
Read+Write workload - it may always be related to the storage box, and
storage latency or controller/cache efficiency may play a lot.

As problem I'm considering a scalability issue on Read-Only workload -
only selects, no disk access, and if on move from 8 to 16 cores we
gain near 100%, on move from 16 to 32 cores it's only 10%...

I think I have to replay Read-Only with prepare/execute and check how
much it'll help (don't know if there are some internal locking used
when a planner is involved)..

And yes, I'll try to profile on 32 cores, it makes sense.

Rgds,
-Dimitri

On 5/12/09, Heikki Linnakangas <> wrote:
> Dimitri wrote:
>> What I discovered so far with all your help:
>>   - the impact of a planner
>>   - the impact of the analyze target
>>   - the impact of prepare / execute
>>   - scalability limit on 32 cores
>
> You've received good advice on how to minimize the impact of the first
> three points, and using those techniques should bring a benefit. But I'm
> pretty surprised by the bad scalability you're seeing and no-one seems
> to have a good idea on where that limit is coming from. At a quick
> glance, I don't see any inherent bottlenecks in the schema and workload.
>
> If you could analyze where the bottleneck is with multiple cores, that
> would be great. With something like oprofile, it should be possible to
> figure out where the time is spent.
>
> My first guess would be the WALInsertLock: writing to WAL is protected
> by that and it an become a bottleneck with lots of small
> UPDATE/DELETE/INSERT transactions. But a profile would be required to
> verify that.
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
>

From:
Stefan Kaltenbrunner
Date:

Dimitri wrote:
> Folks, before you start to think "what a dumb guy doing a dumb thing" :-))
> I'll explain you few details:
>
> it's for more than 10 years I'm using a db_STRESS kit
> (http://dimitrik.free.fr/db_STRESS.html) to check databases
> performance and scalability. Until now I was very happy with results
> it gave me as it stress very well each database engine internals an
> put on light some things I should probably skip on other workloads.
> What do you want, with a time the "fast" query executed before in
> 500ms now runs within 1-2ms  - not only hardware was improved but also
> database engines increased their performance a lot! :-))

I was attempting to look into that "benchmark" kit a bit but I find the
information on that page a bit lacking :( a few notices:

* is the sourcecode for the benchmark actually available? the "kit"
seems to contain a few precompiled binaries and some source/headfiles
but there are no building instructions, no makefile or even a README
which makes it really hard to verify exactly what the benchmark is doing
or if the benchmark client might actually be the problem here.

* there is very little information on how the toolkit talks to the
database - some of the binaries seem to contain a static copy of libpq
or such?

* how many queries per session is the toolkit actually using - some
earlier comments seem to imply you are doing a connect/disconnect cycle
for every query ist that actually true?


Stefan

From:
Dimitri
Date:

Wait wait, currently I'm playing the "stress scenario", so there are
only 256 sessions max, but thing time is zero (full stress). Scenario
with 1600 users is to test how database is solid just to keep a huge
amount of users, but doing only one transaction per second (very low
global TPS comparing to what database is able to do, but it's testing
how well its internals working to manage the user sessions).

I did not plan to do 1600 users test this time (all depends on time :-))

So, do I need to increase WAL buffers for 256 users?

My LOG and DATA are placed on separated storage LUNs and controllers
from the beginning.

I've changed the default 0.5 checkpoint_completion_target to 0.8 now,
should I go until 0.95 ?..

Also, to avoid TPS "waves" and bring stability on Read+Write workload
I followed advices from a parallel thread:

bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 4.0
shared_buffers = 1024MB

I've also tried shared_buffers=256MB as it was advised, but then
Read-Only workload decreasing performance as PG self caching helps
anyway.

Also, checkpoint_timeout is 30s now, and of course a huge difference
came with moving default_statistics_target to 5 !  -but this one I
found myself :-))

Probably checkpoint_timeout may be bigger now with the current
settings? - the goal here is to keep Read+Write TPS as stable as
possible and also avoid a long recovery in case of
system/database/other crash (in theory).

Rgds,
-Dimitri


On 5/12/09, Simon Riggs <> wrote:
>
> On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote:
>
>> What I'm trying to do now is to understand what exactly is the
>> problem.
>
> You're running with 1600 users, which is above the scalability limit
> uncovered (by Sun...) during earlier benchmarking. The scalability
> issues are understood but currently considered above the
> reasonable-setting limit and so nobody has been inclined to improve
> matters.
>
> You should use a connection concentrator to reduce the number of
> sessions down to say 400.
>
> You're WAL buffers setting is also too low and you will be experiencing
> contention on the WALWriteLock. Increase wal_buffers to about x8 where
> you have it now.
>
> You can move pg_xlog to its own set of drives.
>
> Set checkpoint_completion_target to 0.95.
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

From:
Dimitri
Date:

Hi Stefan,

sorry, I did not have a time to bring all details into the toolkit -
but at least I published it instead to tell a "nice story" about :-)

The client process is a binary compiled with libpq. Client is
interpreting a scenario script and publish via SHM a time spent on
each SQL request. I did not publish sources yet as it'll also require
to explain how to compile them :-)) So for the moment it's shipped as
a freeware, but with time everything will be available (BTW, you're
the first who asking for sources (well, except IBM guys who asked to
get it on POWER boxes, but it's another story :-))

What is good is each client is publishing *live* its internal stats an
we're able to get live data and follow any kind of "waves" in
performance. Each session is a single process, so there is no
contention between clients as you may see on some other tools. The
current scenario script contains 2 selects (representing a Read
transaction) and delete/insert/update (representing Write
transaction). According a start parameters each client executing a
given number Reads per Write. It's connecting on the beginning and
disconnecting at the end of the test.

It's also possible to extend it to do other queries, or simply give to
each client a different scenario script - what's important is to able
to collect then its stats live to understand what's going wrong (if
any)..

I'm planning to extend it and give an easy way to run it against any
database schema, it's only question of time..

Rgds,
-Dimitri

On 5/12/09, Stefan Kaltenbrunner <> wrote:
> Dimitri wrote:
>> Folks, before you start to think "what a dumb guy doing a dumb thing" :-))
>> I'll explain you few details:
>>
>> it's for more than 10 years I'm using a db_STRESS kit
>> (http://dimitrik.free.fr/db_STRESS.html) to check databases
>> performance and scalability. Until now I was very happy with results
>> it gave me as it stress very well each database engine internals an
>> put on light some things I should probably skip on other workloads.
>> What do you want, with a time the "fast" query executed before in
>> 500ms now runs within 1-2ms  - not only hardware was improved but also
>> database engines increased their performance a lot! :-))
>
> I was attempting to look into that "benchmark" kit a bit but I find the
> information on that page a bit lacking :( a few notices:
>
> * is the sourcecode for the benchmark actually available? the "kit"
> seems to contain a few precompiled binaries and some source/headfiles
> but there are no building instructions, no makefile or even a README
> which makes it really hard to verify exactly what the benchmark is doing
> or if the benchmark client might actually be the problem here.
>
> * there is very little information on how the toolkit talks to the
> database - some of the binaries seem to contain a static copy of libpq
> or such?
>
> * how many queries per session is the toolkit actually using - some
> earlier comments seem to imply you are doing a connect/disconnect cycle
> for every query ist that actually true?
>
>
> Stefan
>

From:
Stefan Kaltenbrunner
Date:

Dimitri wrote:
> Hi Stefan,
>
> sorry, I did not have a time to bring all details into the toolkit -
> but at least I published it instead to tell a "nice story" about :-)

fair point and appreciated. But it seems important that benchmarking
results can be verified by others as well...

>
> The client process is a binary compiled with libpq. Client is
> interpreting a scenario script and publish via SHM a time spent on
> each SQL request. I did not publish sources yet as it'll also require
> to explain how to compile them :-)) So for the moment it's shipped as
> a freeware, but with time everything will be available (BTW, you're
> the first who asking for sources (well, except IBM guys who asked to
> get it on POWER boxes, but it's another story :-))

well there is no licence tag(or a copyright notice) or anything als
associated with the download which makes it a bit harder than it really
needs to be.
The reason why I was actually looking for the source is that all my
available benchmark platforms are none of the ones you are providing
binaries for which kinda reduces its usefulness.

>
> What is good is each client is publishing *live* its internal stats an
> we're able to get live data and follow any kind of "waves" in
> performance. Each session is a single process, so there is no
> contention between clients as you may see on some other tools. The
> current scenario script contains 2 selects (representing a Read
> transaction) and delete/insert/update (representing Write
> transaction). According a start parameters each client executing a
> given number Reads per Write. It's connecting on the beginning and
> disconnecting at the end of the test.

well I have seen clients getting bottlenecked internally (like wasting
more time in getting rid/absorbing of the actual result than it took the
server to generate the answer...).
How sure are you that your "live publishing of data" does not affect the
benchmark results(because it kinda generates an artifical think time)
for example?
But what I get from your answer is that you are basically doing one
connect/disconnect per client and the testcase you are talking about has
256 clients?


Stefan

From:
Robert Haas
Date:

On Tue, May 12, 2009 at 8:59 AM, Dimitri <> wrote:
> Wait wait, currently I'm playing the "stress scenario", so there are
> only 256 sessions max, but thing time is zero (full stress). Scenario
> with 1600 users is to test how database is solid just to keep a huge
> amount of users, but doing only one transaction per second (very low
> global TPS comparing to what database is able to do, but it's testing
> how well its internals working to manage the user sessions).

Didn't we beat this to death in mid-March on this very same list?
Last time I think it was Jignesh Shah.  AIUI, it's a well-known fact
that PostgreSQL doesn't do very well at this kind of workload unless
you use a connection pooler.

*goes and checks the archives*  Sure enough, 116 emails under the
subject line "Proposal of tunable fix for scalability of 8.4".

So, if your goal is to find a scenario under which PostgreSQL performs
as badly as possible, congratulations - you've discovered the same
case that we already knew about.  Obviously it would be nice to
improve it, but IIRC so far no one has had any very good ideas on how
to do that.  If this example mimics a real-world workload that you
care about, and if using a connection pooler is just not a realistic
option in that scenario for whatever reason, then you'd be better off
working on how to fix it than on measuring it, because it seems to me
we already know it's got problems, per previous discussions.

...Robert

From:
Matthew Wakeling
Date:

On Tue, 12 May 2009, Stefan Kaltenbrunner wrote:
> But what I get from your answer is that you are basically doing one
> connect/disconnect per client and the testcase you are talking about has 256
> clients?

Correct me if I'm wrong, but won't connect operations be all handled by a
single thread - the parent postmaster? There's your scalability problem
right there. Also, spawning a new backend process is an awful lot of
overhead to run just one query.

As far as I can see, it's quite understandable for MySQL to perform better
than PostgreSQL in these circumstances, as it has a smaller simpler
backend to start up each time. If you really want to get a decent
performance out of Postgres, then use long-lived connections (which most
real-world use cases will do) and prepare your queries in advance with
parameters.

Matthew

--
 import oz.wizards.Magic;
   if (Magic.guessRight())...           -- Computer Science Lecturer

From:
Simon Riggs
Date:

On Tue, 2009-05-12 at 16:00 +0100, Matthew Wakeling wrote:
> won't connect operations be all handled by a
> single thread - the parent postmaster?

No, we spawn then authenticate.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Matthew Wakeling
Date:

On Tue, 12 May 2009, Simon Riggs wrote:
>> won't connect operations be all handled by a
>> single thread - the parent postmaster?
>
> No, we spawn then authenticate.

But you still have a single thread doing the accept() and spawn. At some
point (maybe not now, but in the future) this could become a bottleneck
given very short-lived connections.

Matthew

--
 -. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-.
 ||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||
 |/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/
 '   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'

From:
Stefan Kaltenbrunner
Date:

Matthew Wakeling wrote:
> On Tue, 12 May 2009, Simon Riggs wrote:
>>> won't connect operations be all handled by a
>>> single thread - the parent postmaster?
>>
>> No, we spawn then authenticate.
>
> But you still have a single thread doing the accept() and spawn. At some
> point (maybe not now, but in the future) this could become a bottleneck
> given very short-lived connections.

well the main cost is backend startup and that one is extremely
expensive (compared to the cost of a simple query and also depending on
the OS). We have more overhead there than other databases (most notably
MySQL) hence what prompted my question on how the benchmark was operating.
For any kind of workloads that contain frequent connection
establishments one wants to use a connection pooler like pgbouncer(as
said elsewhere in the thread already).


Stefan

From:
Tom Lane
Date:

Matthew Wakeling <> writes:
> On Tue, 12 May 2009, Simon Riggs wrote:
>> No, we spawn then authenticate.

> But you still have a single thread doing the accept() and spawn. At some
> point (maybe not now, but in the future) this could become a bottleneck
> given very short-lived connections.

More to the point, each backend process is a pretty heavyweight object:
it is a process, not a thread, and it's not going to be good for much
until it's built up a reasonable amount of stuff in its private caches.
I don't think the small number of cycles executed in the postmaster
process amount to anything at all compared to the other overhead
involved in getting a backend going.

In short: executing a single query per connection is going to suck,
and there is not anything we are going to do about it except to tell
you to use a connection pooler.

MySQL has a different architecture: thread per connection, and AFAIK
whatever caches it has are shared across threads.  So a connection is a
lighter-weight object for them; but there's no free lunch.  They pay for
it in having to tolerate locking/contention overhead on operations that
for us are backend-local.

            regards, tom lane

From:
Dimitri
Date:

Robert, what I'm testing now is 256 users max. The workload is growing
progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max
throughput is reached on the number of users equal to 2 * number of
cores, but what's important for me here - database should continue to
keep the workload! - response time regressing, but the troughput
should remain near the same.

So, do I really need a pooler to keep 256 users working??  - I don't
think so, but please, correct me.

BTW, I did not look to put PostgreSQL in bad conditions - the test is
the test, and as I said 2 years ago PostgreSQL outperformed MySQL on
the same test case, and there was nothing done within MySQL code to
improve it explicitly for db_STRESS.. And I'm staying pretty honest
when I'm testing something.

Rgds,
-Dimitri


On 5/12/09, Robert Haas <> wrote:
> On Tue, May 12, 2009 at 8:59 AM, Dimitri <> wrote:
>> Wait wait, currently I'm playing the "stress scenario", so there are
>> only 256 sessions max, but thing time is zero (full stress). Scenario
>> with 1600 users is to test how database is solid just to keep a huge
>> amount of users, but doing only one transaction per second (very low
>> global TPS comparing to what database is able to do, but it's testing
>> how well its internals working to manage the user sessions).
>
> Didn't we beat this to death in mid-March on this very same list?
> Last time I think it was Jignesh Shah.  AIUI, it's a well-known fact
> that PostgreSQL doesn't do very well at this kind of workload unless
> you use a connection pooler.
>
> *goes and checks the archives*  Sure enough, 116 emails under the
> subject line "Proposal of tunable fix for scalability of 8.4".
>
> So, if your goal is to find a scenario under which PostgreSQL performs
> as badly as possible, congratulations - you've discovered the same
> case that we already knew about.  Obviously it would be nice to
> improve it, but IIRC so far no one has had any very good ideas on how
> to do that.  If this example mimics a real-world workload that you
> care about, and if using a connection pooler is just not a realistic
> option in that scenario for whatever reason, then you'd be better off
> working on how to fix it than on measuring it, because it seems to me
> we already know it's got problems, per previous discussions.
>
> ...Robert
>

From:
"Kevin Grittner"
Date:

Dimitri <> wrote:

> Of course the Max throughput is reached on the number of users equal
> to 2 * number of cores

I'd expect that when disk I/O is not a significant limiting factor,
but I've seen a "sweet spot" of (2 * cores) + (effective spindle
count) for loads involving a lot of random I/O.

> So, do I really need a pooler to keep 256 users working??

I have seen throughput fall above a certain point when I don't use a
connection pooler.  With a connection pooler which queues requests
when all connections are busy, you will see no throughput degradation
as users of the pool are added.  Our connection pool is in our
framework, so I don't know whether pgbouncer queues requests.
(Perhaps someone else can comment on that, and make another suggestion
if it doesn't.)

-Kevin

From:
Dimitri
Date:

On 5/12/09, Stefan Kaltenbrunner <> wrote:
> Dimitri wrote:
>> Hi Stefan,
>>
>> sorry, I did not have a time to bring all details into the toolkit -
>> but at least I published it instead to tell a "nice story" about :-)
>
> fair point and appreciated. But it seems important that benchmarking
> results can be verified by others as well...

until now there were only people running Solaris or Linux :-))

>
>>
>> The client process is a binary compiled with libpq. Client is
>> interpreting a scenario script and publish via SHM a time spent on
>> each SQL request. I did not publish sources yet as it'll also require
>> to explain how to compile them :-)) So for the moment it's shipped as
>> a freeware, but with time everything will be available (BTW, you're
>> the first who asking for sources (well, except IBM guys who asked to
>> get it on POWER boxes, but it's another story :-))
>
> well there is no licence tag(or a copyright notice) or anything als
> associated with the download which makes it a bit harder than it really
> needs to be.
> The reason why I was actually looking for the source is that all my
> available benchmark platforms are none of the ones you are providing
> binaries for which kinda reduces its usefulness.
>

agree, will improve this point

>>
>> What is good is each client is publishing *live* its internal stats an
>> we're able to get live data and follow any kind of "waves" in
>> performance. Each session is a single process, so there is no
>> contention between clients as you may see on some other tools. The
>> current scenario script contains 2 selects (representing a Read
>> transaction) and delete/insert/update (representing Write
>> transaction). According a start parameters each client executing a
>> given number Reads per Write. It's connecting on the beginning and
>> disconnecting at the end of the test.
>
> well I have seen clients getting bottlenecked internally (like wasting
> more time in getting rid/absorbing of the actual result than it took the
> server to generate the answer...).
> How sure are you that your "live publishing of data" does not affect the
> benchmark results(because it kinda generates an artifical think time)
> for example?

On all my test tools client are publishing their data via shared
memory segment (ISM), all they do is just *incrementing* their current
stats values and continuing their processing. Another dedicated
program should be executed to print these stats - it's connecting to
the same SHM segment and printing a *difference* between values for
the current and the next interval. Let me know if you need more
details.

> But what I get from your answer is that you are basically doing one
> connect/disconnect per client and the testcase you are talking about has
> 256 clients?

Exactly, only one connect/disconnect per test, and number of clients
is growing progressively from 1, 2, 4, 8, 16, .. to 256

Rgds,
-Dimitri

>
>
> Stefan
>

From:
"Joshua D. Drake"
Date:

On Tue, 2009-05-12 at 17:22 +0200, Dimitri wrote:
> Robert, what I'm testing now is 256 users max. The workload is growing
> progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max
> throughput is reached on the number of users equal to 2 * number of
> cores, but what's important for me here - database should continue to
> keep the workload! - response time regressing, but the troughput
> should remain near the same.
>
> So, do I really need a pooler to keep 256 users working??  - I don't
> think so, but please, correct me.

If they disconnect and reconnect yes. If they keep the connections live
then no.

Joshua D. Drake

--
PostgreSQL - XMPP: 
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


From:
Dimitri
Date:

No, they keep connections till the end of the test.

Rgds,
-Dimitri

On 5/12/09, Joshua D. Drake <> wrote:
> On Tue, 2009-05-12 at 17:22 +0200, Dimitri wrote:
>> Robert, what I'm testing now is 256 users max. The workload is growing
>> progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max
>> throughput is reached on the number of users equal to 2 * number of
>> cores, but what's important for me here - database should continue to
>> keep the workload! - response time regressing, but the troughput
>> should remain near the same.
>>
>> So, do I really need a pooler to keep 256 users working??  - I don't
>> think so, but please, correct me.
>
> If they disconnect and reconnect yes. If they keep the connections live
> then no.
>
> Joshua D. Drake
>
> --
> PostgreSQL - XMPP: 
>    Consulting, Development, Support, Training
>    503-667-4564 - http://www.commandprompt.com/
>    The PostgreSQL Company, serving since 1997
>
>

From:
Tom Lane
Date:

Robert Haas <> writes:
> AIUI, whenever the connection pooler switches to serving a new client,
> it tells the PG backend to DISCARD ALL.  But why couldn't we just
> implement this same logic internally?  IOW, when a client disconnects,
> instead of having the backend exit immediately, have it perform the
> equivalent of DISCARD ALL and then stick around for a minute or two
> and, if a new connection request arrives within that time, have the
> old backend handle the new connection...

See previous discussions.  IIRC, there are two killer points:

1. There is no (portable) way to pass the connection from the postmaster
to another pre-existing process.

2. You'd have to track which database, and probably which user, each
such backend had been launched for; reconnecting a backend to a new
database is probably impractical and would certainly invalidate all
the caching.

Overall it looked like way too much effort for way too little gain.

            regards, tom lane

From:
Alvaro Herrera
Date:

Andres Freund escribió:

> Naturally it would still be nice to be good in this not optimal workload...

I find it hard to justify wasting our scarce development resources into
optimizing such a contrived workload.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:
Simon Riggs
Date:

On Tue, 2009-05-12 at 11:30 -0700, Scott Carey wrote:
> the fact is there is no evidence that a connection pooler will fix the
> scalability from 16 > 32 cores.

There has been much analysis over a number of years of the effects of
the ProcArrayLock, specifically the O(N^2) effect of increasing numbers
of connections on GetSnapshotData(). Most discussion has been on
-hackers, not -perform.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Simon Riggs
Date:

On Tue, 2009-05-12 at 15:52 -0400, Robert Haas wrote:
> On Tue, May 12, 2009 at 12:49 PM, Tom Lane <> wrote:
> > 1. There is no (portable) way to pass the connection from the postmaster
> > to another pre-existing process.
>
> [Googles.]  It's not obvious to me that SCM_RIGHTS is non-portable,
> and Windows has an API call WSADuplicateSocket() specifically for this
> purpose.

Robert, Greg,

Tom's main point is it isn't worth doing. We have connection pooling
software that works well, very well. Why do we want to bring it into
core? (Think of the bugs we'd hit...) If we did, who would care?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Dimitri Fontaine
Date:

Hi,

Le 12 mai 09 à 18:32, Robert Haas a écrit :
> implement this same logic internally?  IOW, when a client disconnects,
> instead of having the backend exit immediately, have it perform the
> equivalent of DISCARD ALL and then stick around for a minute or two
> and, if a new connection request arrives within that time, have the
> old backend handle the new connection...

A much better idea to solve this, in my opinion, would be to have
pgbouncer as a postmaster child, integrated into PostgreSQL. It allows
for choosing whether you want session pooling, transaction pooling or
statement pooling, which is a more deterministic way to choose when
your client connection will benefit from a fresh backend or an
existing one. And it's respecting some backend timeouts etc.
It's Open-Source proven technology, and I think I've heard about some
PostgreSQL distribution where it's already a postmaster's child.

<handwaving>
And when associated with Hot Standby (and Sync Wal Shipping), having a
connection pooler in -core could allow for transparent Read-Write
access to the slave: at the moment you need an XID (and when connected
on the slave), the backend could tell the pgbouncer process to
redirect the connection to the master. With such a feature, you don't
have to build client side high availability, just connect to either
the master or the slave and be done with it, whatever the SQL you're
gonna run.
</>

>

Regards,
--
dim

From:
Scott Carey
Date:

Although nobody wants to support it, he should try the patch that Jignesh K.
Shah (from Sun) proposed that makes ProcArrayLock lighter-weight.  If it
makes 32 cores much faster, then we have a smoking gun.

Although everyone here is talking about this as an 'unoptimal' solution, the
fact is there is no evidence that a connection pooler will fix the
scalability from 16 > 32 cores.
Certainly a connection pooler will help most results, but it may not fix the
scalability problem.

A question for Dimitri:
What is the scalability from 16 > 32 cores at the 'peak' load that occurs
near 2x the CPU count?  Is it also poor?  If this is also poor, IMO the
community here should not be complaining about this unopimal case -- a
connection pooler at that stage does little and prepared statements will
increase throughput but not likely alter scalability.

If that result scales, then the short term answer is a connection pooler.

In the tests that Jingesh ran -- making the ProcArrayLock faster helped the
case where connections = 2x the CPU core count quite a bit.

The thread about the CPU scalability is "Proposal of tunable fix for
scalability of 8.4", originally posted by "Jignesh K. Shah"
<>, March 11 2009.

It would be very useful to see results of this benchmark with:
1. A Connection Pooler
2. Jignesh's patch
3. Prepared statements

#3 is important, because prepared statements are ideal for queries that
perform well with low statistics_targets, and not ideal for those that
require high statistics targets.  Realistically, an app won't have more than
a couple dozen statement forms to prepare.  Setting the default statistics
target to 5 is just a way to make some other query perform bad.


On 5/12/09 10:53 AM, "Alvaro Herrera" <> wrote:

> Andres Freund escribió:
>
>> Naturally it would still be nice to be good in this not optimal workload...
>
> I find it hard to justify wasting our scarce development resources into
> optimizing such a contrived workload.
>
> --
> Alvaro Herrera                                http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


From:
"Joshua D. Drake"
Date:

On Tue, 2009-05-12 at 21:24 +0100, Simon Riggs wrote:
> On Tue, 2009-05-12 at 15:52 -0400, Robert Haas wrote:
> > On Tue, May 12, 2009 at 12:49 PM, Tom Lane <> wrote:
> > > 1. There is no (portable) way to pass the connection from the postmaster
> > > to another pre-existing process.
> >
> > [Googles.]  It's not obvious to me that SCM_RIGHTS is non-portable,
> > and Windows has an API call WSADuplicateSocket() specifically for this
> > purpose.
>
> Robert, Greg,
>
> Tom's main point is it isn't worth doing. We have connection pooling
> software that works well, very well. Why do we want to bring it into
> core? (Think of the bugs we'd hit...) If we did, who would care?

I would.

Not to knock poolers but they are limited and not very efficient. Heck
the best one I have used is pgbouncer and it has problems too under
heavy load (due to libevent issues). It also doesn't support all of our
auth methods.

Apache solved this problem back when it was still called NSCA HTTPD. Why
aren't we preforking again?

Joshua D. Drake



--
PostgreSQL - XMPP: 
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


From:
Robert Haas
Date:

On Tue, May 12, 2009 at 1:00 PM, Dimitri <> wrote:
> On MySQL there is no changes if I set the number of sessions in the
> config file to 400 or to 2000 - for 2000 it'll just allocate more
> memory.

I don't care whether the setting affects the speed of MySQL.  I want
to know if it affects the speed of PostgreSQL.

> After latest fix with default_statistics_target=5, version 8.3.7 is
> running as fast as 8.4, even 8.4 is little little bit slower.
>
> I understand your position with a pooler, but I also want you think
> about idea that 128 cores system will become a commodity server very
> soon, and to use these cores on their full power you'll need a
> database engine capable to run 256 users without pooler, because a
> pooler will not help you here anymore..

So what?  People with 128-core systems will not be running trivial
joins that return in 1-2ms and have one second think times between
them.  And if they are, and if they have nothing better to do than
worry about whether MySQL can process those queries in 1/2000th of the
think time rather than 1/1000th of the think time, then they can use
MySQL.  If we're going to worry about performance on 128-core system,
we would be much better advised to put our efforts into parallel query
execution than how many microseconds it takes to execute very simple
queries.

Still, I have no problem with making PostgreSQL faster in the case
you're describing.  I'm just not interested in doing it on my own time
for free.  I am sure there are a number of people who read this list
regularly who would be willing to do it for money, though.  Maybe even
me.  :-)

...Robert

From:
Robert Haas
Date:

On Tue, May 12, 2009 at 4:24 PM, Simon Riggs <> wrote:
>
> On Tue, 2009-05-12 at 15:52 -0400, Robert Haas wrote:
>> On Tue, May 12, 2009 at 12:49 PM, Tom Lane <> wrote:
>> > 1. There is no (portable) way to pass the connection from the postmaster
>> > to another pre-existing process.
>>
>> [Googles.]  It's not obvious to me that SCM_RIGHTS is non-portable,
>> and Windows has an API call WSADuplicateSocket() specifically for this
>> purpose.
>
> Robert, Greg,
>
> Tom's main point is it isn't worth doing. We have connection pooling
> software that works well, very well. Why do we want to bring it into
> core? (Think of the bugs we'd hit...) If we did, who would care?

I don't know.  It seems like it would be easier to manage just
PostgreSQL than PostgreSQL + connection pooling software, but mostly I
was just curious whether it had been thought about, so I asked, and
the answer then led to a further question...  was not intending to
make a big deal about it.

...Robert

From:
Alvaro Herrera
Date:

Dimitri Fontaine escribió:

> A much better idea to solve this, in my opinion, would be to have
> pgbouncer as a postmaster child, integrated into PostgreSQL. It allows
> for choosing whether you want session pooling, transaction pooling or
> statement pooling, which is a more deterministic way to choose when your
> client connection will benefit from a fresh backend or an existing one.
> And it's respecting some backend timeouts etc.

Hmm.  Seems like the best idea if we go this route would be one of
Simon's which was to have better support for pluggable postmaster
children.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:
Greg Stark
Date:

On Tue, May 12, 2009 at 5:49 PM, Tom Lane <> wrote:
> See previous discussions.  IIRC, there are two killer points:
>
> 1. There is no (portable) way to pass the connection from the postmaster
> to another pre-existing process.

The Apache model is to have all the backends call accept. So incoming
connections don't get handled by a single master process, they get
handled by whichever process the kernel picks to receive the
connection.

--
greg

From:
"Joshua D. Drake"
Date:

On Tue, 2009-05-12 at 20:34 -0400, Aidan Van Dyk wrote:
> * Joshua D. Drake <> [090512 19:27]:
>
> > Apache solved this problem back when it was still called NSCA HTTPD. Why
> > aren't we preforking again?
>
> Of course, preforking and connection pooling are totally different
> beast...
>

Yes and no. They both solve similar problems and preforking solves more
problems when you look at the picture in entirety (namely authentication
integration etc..)

> But, what really does preforking give us?  A 2 or 3% improvement?

It depends on the problem we are solving. We can test it but I would bet
it is more than that especially in a high velocity environment.

>   The
> forking isn't the expensive part,

It is expensive but not as expensive as the below.

>  the per-database setup that happens is
> the expensive setup...  All pre-forking would save us is a tiny part of
> the initial setup, and in turn make our robust postmaster controller no
> longer have control.

I don't buy this. Properly coded we aren't going to lose any "control".

Joshua D. Drake

--
PostgreSQL - XMPP: 
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


From:
Aidan Van Dyk
Date:

* Joshua D. Drake <> [090512 19:27]:

> Apache solved this problem back when it was still called NSCA HTTPD. Why
> aren't we preforking again?

Of course, preforking and connection pooling are totally different
beast...

But, what really does preforking give us?  A 2 or 3% improvement?  The
forking isn't the expensive part, the per-database setup that happens is
the expensive setup...  All pre-forking would save us is a tiny part of
the initial setup, and in turn make our robust postmaster controller no
longer have control.

a.

--
Aidan Van Dyk                                             Create like a god,
                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Attachment
From:
Stephen Frost
Date:

* Aidan Van Dyk () wrote:
> But, what really does preforking give us?  A 2 or 3% improvement?  The
> forking isn't the expensive part, the per-database setup that happens is
> the expensive setup...

Obviously that begs the question- why not support pre-fork with specific
databases associated with specific backends that do the per-database
setup prior to a connection coming in?  eg- I want 5 backends ready per
user database (excludes template0, template1, postgres).

Thoughts?

    Thanks,

        Stephen

Attachment
From:
Robert Haas
Date:

On Tue, May 12, 2009 at 11:22 AM, Dimitri <> wrote:
> Robert, what I'm testing now is 256 users max. The workload is growing
> progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max
> throughput is reached on the number of users equal to 2 * number of
> cores, but what's important for me here - database should continue to
> keep the workload! - response time regressing, but the troughput
> should remain near the same.
>
> So, do I really need a pooler to keep 256 users working??  - I don't
> think so, but please, correct me.

Not an expert on this, but there has been a lot of discussion of the
importance of connection pooling in this space.  Is MySQL still faster
if you lower max_connections to a value that is closer to the number
of users, like 400 rather than 2000?

> BTW, I did not look to put PostgreSQL in bad conditions - the test is
> the test, and as I said 2 years ago PostgreSQL outperformed MySQL on
> the same test case, and there was nothing done within MySQL code to
> improve it explicitly for db_STRESS.. And I'm staying pretty honest
> when I'm testing something.

Yeah but it's not really clear what that something is.  I believe you
said upthread that PG 8.4 beta 1 is faster than PG 8.3.7, but PG 8.4
beta 1 is slower than MySQL 5.4 whereas PG 8.3.7 was faster than some
older version of MySQL.  So PG got faster and MySQL got faster, but
they sped things up more than we did.  If our performance were getting
WORSE, I'd be worried about that, but the fact that they were able to
make more improvement on this particular case than we were doesn't
excite me very much.  Sure, I'd love it if PG were even faster than it
is, and if you have a suggested patch please send it in...  or if you
want to profile it and send the results that would be great too.  But
I guess my point is that the case of a very large number of
simultaneous users with pauses-for-thought between queries has already
been looked at in the very recent past in a way that's very similar to
what you are doing (and by someone who works at the same company you
do, no less!) so I'm not quite sure why we're rehashing the issue.

...Robert

From:
Robert Haas
Date:

On Tue, May 12, 2009 at 12:49 PM, Tom Lane <> wrote:
> 1. There is no (portable) way to pass the connection from the postmaster
> to another pre-existing process.

[Googles.]  It's not obvious to me that SCM_RIGHTS is non-portable,
and Windows has an API call WSADuplicateSocket() specifically for this
purpose.

> 2. You'd have to track which database, and probably which user, each
> such backend had been launched for; reconnecting a backend to a new
> database is probably impractical and would certainly invalidate all
> the caching.

User doesn't seem like a major problem, but I understand your point
about databases, which would presumably preclude the Apache approach
of having every backend call accept() on the master socket.

...Robert

From:
Glenn Maynard
Date:

I'm sorry, but I'm confused.  Everyone keeps talking about connection
pooling, but Dimitri has said repeatedly that each client makes a
single connection and then keeps it open until the end of the test,
not that it makes a single connection per SQL query.  Connection
startup costs shouldn't be an issue.  Am I missing something here?
test(N) starts N clients, each client creates a single connection and
hammers the server for a while on that connection.  test(N) is run for
N=1,2,4,8...256.  This seems like a very reasonable test scenario.

--
Glenn Maynard

From:
Dimitri
Date:

On 5/12/09, Robert Haas <> wrote:
> On Tue, May 12, 2009 at 1:00 PM, Dimitri <> wrote:
>> On MySQL there is no changes if I set the number of sessions in the
>> config file to 400 or to 2000 - for 2000 it'll just allocate more
>> memory.
>
> I don't care whether the setting affects the speed of MySQL.  I want
> to know if it affects the speed of PostgreSQL.

the problem is they both have "max_connections" parameter, so as you
asked for MySQL I answered for MySQL, did not test yet for PostgreSQL,
will be in the next series..

>
>> After latest fix with default_statistics_target=5, version 8.3.7 is
>> running as fast as 8.4, even 8.4 is little little bit slower.
>>
>> I understand your position with a pooler, but I also want you think
>> about idea that 128 cores system will become a commodity server very
>> soon, and to use these cores on their full power you'll need a
>> database engine capable to run 256 users without pooler, because a
>> pooler will not help you here anymore..
>
> So what?  People with 128-core systems will not be running trivial
> joins that return in 1-2ms and have one second think times between
> them.  And if they are, and if they have nothing better to do than
> worry about whether MySQL can process those queries in 1/2000th of the
> think time rather than 1/1000th of the think time, then they can use
> MySQL.  If we're going to worry about performance on 128-core system,
> we would be much better advised to put our efforts into parallel query
> execution than how many microseconds it takes to execute very simple
> queries.

Do you really think nowdays for example a web forum application having
PG as a backend will have queries running slower than 1-2ms to print a
thread message within your browser???  or banking transactions??

>
> Still, I have no problem with making PostgreSQL faster in the case
> you're describing.  I'm just not interested in doing it on my own time
> for free.  I am sure there are a number of people who read this list
> regularly who would be willing to do it for money, though.  Maybe even
> me.  :-)
>
> ...Robert
>

You don't need to believe me, but I'm doing it for free - I still have
my work to finish in parallel :-))  And on the same time I don't see
any other way to learn and improve my knowledge, but nobody is perfect
:-))

Rgds,
-Dimitri

From:
Dimitri
Date:

Hi Scott,

On 5/12/09, Scott Carey <> wrote:
> Although nobody wants to support it, he should try the patch that Jignesh K.
> Shah (from Sun) proposed that makes ProcArrayLock lighter-weight.  If it
> makes 32 cores much faster, then we have a smoking gun.
>
> Although everyone here is talking about this as an 'unoptimal' solution, the
> fact is there is no evidence that a connection pooler will fix the
> scalability from 16 > 32 cores.
> Certainly a connection pooler will help most results, but it may not fix the
> scalability problem.
>
> A question for Dimitri:
> What is the scalability from 16 > 32 cores at the 'peak' load that occurs
> near 2x the CPU count?  Is it also poor?  If this is also poor, IMO the
> community here should not be complaining about this unopimal case -- a
> connection pooler at that stage does little and prepared statements will
> increase throughput but not likely alter scalability.

I'm attaching a small graph showing a TPS level on PG 8.4 depending on
number of cores (X-axis is a number of concurrent users, Y-axis is the
TPS number). As you may see TPS increase is near linear while moving
from 8 to 16 cores, while on 32cores even it's growing slightly
differently, what is unclear is why TPS level is staying limited to
11.000 TPS on 32cores. And it's pure read-only workload.

>
> If that result scales, then the short term answer is a connection pooler.
>
> In the tests that Jingesh ran -- making the ProcArrayLock faster helped the
> case where connections = 2x the CPU core count quite a bit.
>
> The thread about the CPU scalability is "Proposal of tunable fix for
> scalability of 8.4", originally posted by "Jignesh K. Shah"
> <>, March 11 2009.
>
> It would be very useful to see results of this benchmark with:
> 1. A Connection Pooler

will not help, as each client is *not* disconnecting/reconnecting
during the test, as well PG is keeping well even 256 users. And TPS
limit is reached already on 64 users, don't think pooler will help
here.

> 2. Jignesh's patch

I've already tested it and it did not help in my case because the real
problem is elsewhere.. (however, I did not test it yet with my latest
config params)

> 3. Prepared statements
>

yes, I'm preparing this test.

> #3 is important, because prepared statements are ideal for queries that
> perform well with low statistics_targets, and not ideal for those that
> require high statistics targets.  Realistically, an app won't have more than
> a couple dozen statement forms to prepare.  Setting the default statistics
> target to 5 is just a way to make some other query perform bad.

Agree, but as you may have a different statistic target *per* table it
should not be a problem. What is sure - all time spent on parse and
planner will be removed here, and the final time should be a pure
execution.

Rgds,
-Dimitri

>
>
> On 5/12/09 10:53 AM, "Alvaro Herrera" <> wrote:
>
>> Andres Freund escribió:
>>
>>> Naturally it would still be nice to be good in this not optimal
>>> workload...
>>
>> I find it hard to justify wasting our scarce development resources into
>> optimizing such a contrived workload.
>>
>> --
>> Alvaro Herrera
>> http://www.CommandPrompt.com/
>> The PostgreSQL Company - Command Prompt, Inc.
>>
>> --
>> Sent via pgsql-performance mailing list ()
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
>

Attachment
From:
Dimitri
Date:

I'm also confused, but seems discussion giving also other ideas :-)
But yes, each client is connecting to the database server only *once*.

To presice how the test is running:
 - 1 client is started  => 1 in total
 - sleep ...
 - 1 another client is started  => 2 in total
 - sleep ..
 - 2 another clients are started => 4 in total
 - sleep ..
 ...
 ... =======> 256 in total
 - sleep ...
 - kill clients

So I even able to monitor how each new client impact all others. The
test kit is quite flexible to prepare any kind of stress situations.

Rgds,
-Dimitri

On 5/12/09, Glenn Maynard <> wrote:
> I'm sorry, but I'm confused.  Everyone keeps talking about connection
> pooling, but Dimitri has said repeatedly that each client makes a
> single connection and then keeps it open until the end of the test,
> not that it makes a single connection per SQL query.  Connection
> startup costs shouldn't be an issue.  Am I missing something here?
> test(N) starts N clients, each client creates a single connection and
> hammers the server for a while on that connection.  test(N) is run for
> N=1,2,4,8...256.  This seems like a very reasonable test scenario.
>
> --
> Glenn Maynard
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

From:
Dimitri
Date:

On MySQL there is no changes if I set the number of sessions in the
config file to 400 or to 2000 - for 2000 it'll just allocate more
memory.

After latest fix with default_statistics_target=5, version 8.3.7 is
running as fast as 8.4, even 8.4 is little little bit slower.

I understand your position with a pooler, but I also want you think
about idea that 128 cores system will become a commodity server very
soon, and to use these cores on their full power you'll need a
database engine capable to run 256 users without pooler, because a
pooler will not help you here anymore..

Rgds,
-Dimitri

On 5/12/09, Robert Haas <> wrote:
> On Tue, May 12, 2009 at 11:22 AM, Dimitri <> wrote:
>> Robert, what I'm testing now is 256 users max. The workload is growing
>> progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max
>> throughput is reached on the number of users equal to 2 * number of
>> cores, but what's important for me here - database should continue to
>> keep the workload! - response time regressing, but the troughput
>> should remain near the same.
>>
>> So, do I really need a pooler to keep 256 users working??  - I don't
>> think so, but please, correct me.
>
> Not an expert on this, but there has been a lot of discussion of the
> importance of connection pooling in this space.  Is MySQL still faster
> if you lower max_connections to a value that is closer to the number
> of users, like 400 rather than 2000?
>
>> BTW, I did not look to put PostgreSQL in bad conditions - the test is
>> the test, and as I said 2 years ago PostgreSQL outperformed MySQL on
>> the same test case, and there was nothing done within MySQL code to
>> improve it explicitly for db_STRESS.. And I'm staying pretty honest
>> when I'm testing something.
>
> Yeah but it's not really clear what that something is.  I believe you
> said upthread that PG 8.4 beta 1 is faster than PG 8.3.7, but PG 8.4
> beta 1 is slower than MySQL 5.4 whereas PG 8.3.7 was faster than some
> older version of MySQL.  So PG got faster and MySQL got faster, but
> they sped things up more than we did.  If our performance were getting
> WORSE, I'd be worried about that, but the fact that they were able to
> make more improvement on this particular case than we were doesn't
> excite me very much.  Sure, I'd love it if PG were even faster than it
> is, and if you have a suggested patch please send it in...  or if you
> want to profile it and send the results that would be great too.  But
> I guess my point is that the case of a very large number of
> simultaneous users with pauses-for-thought between queries has already
> been looked at in the very recent past in a way that's very similar to
> what you are doing (and by someone who works at the same company you
> do, no less!) so I'm not quite sure why we're rehashing the issue.
>
> ...Robert
>

From:
Robert Haas
Date:

On Tue, May 12, 2009 at 11:18 AM, Tom Lane <> wrote:
> Matthew Wakeling <> writes:
>> On Tue, 12 May 2009, Simon Riggs wrote:
>>> No, we spawn then authenticate.
>
>> But you still have a single thread doing the accept() and spawn. At some
>> point (maybe not now, but in the future) this could become a bottleneck
>> given very short-lived connections.
>
> More to the point, each backend process is a pretty heavyweight object:
> it is a process, not a thread, and it's not going to be good for much
> until it's built up a reasonable amount of stuff in its private caches.
> I don't think the small number of cycles executed in the postmaster
> process amount to anything at all compared to the other overhead
> involved in getting a backend going.

AIUI, whenever the connection pooler switches to serving a new client,
it tells the PG backend to DISCARD ALL.  But why couldn't we just
implement this same logic internally?  IOW, when a client disconnects,
instead of having the backend exit immediately, have it perform the
equivalent of DISCARD ALL and then stick around for a minute or two
and, if a new connection request arrives within that time, have the
old backend handle the new connection...

(There is the problem of how to get the file descriptor returned by
the accept() call in the parent process down to the child... but I
think that at least on some UNIXen there is a way to pass an fd
through a socket, or even dup it into another process by opening it
from /proc/fd)

...Robert

From:
"Kevin Grittner"
Date:

Glenn Maynard <> wrote:
> I'm sorry, but I'm confused.  Everyone keeps talking about
> connection pooling, but Dimitri has said repeatedly that each client
> makes a single connection and then keeps it open until the end of
> the test, not that it makes a single connection per SQL query.
> Connection startup costs shouldn't be an issue.  Am I missing
> something here?

Quite aside from the overhead of spawning new processes, if you have
more active connections than you have resources for them to go after,
you just increase context switching and resource contention, both of
which have some cost, without any offsetting gains.  That would tend
to explain why performance tapers off after a certain point.  A
connection pool which queues requests prevents this degradation.

It would be interesting, with each of the CPU counts, to profile
PostgreSQL at the peak of each curve to see where the time goes when
it is operating with an optimal poolsize.  Tapering after that point
is rather uninteresting, and profiles would be less useful beyond that
point, as the noise from the context switching and resource contention
would make it harder to spot issues that really matter..

-Kevin

From:
Dimitri
Date:

The idea is good, but *only* pooling will be not enough. I mean if all
what pooler is doing is only keeping no more than N backends working -
it'll be not enough. You never know what exactly your query will do -
if you choose your N value to be sure to not overload CPU and then
some of your queries start to read from disk - you waste your idle CPU
time because it was still possible to run other queries requiring CPU
time rather I/O, etc...

I wrote some ideas about an "ideal" solution here (just omit the word
"mysql" - as it's a theory it's valable for any db engine):
http://dimitrik.free.fr/db_STRESS_MySQL_540_and_others_Apr2009.html#note_5442

Rgds,
-Dimitri

On 5/13/09, Kevin Grittner <> wrote:
> Glenn Maynard <> wrote:
>> I'm sorry, but I'm confused.  Everyone keeps talking about
>> connection pooling, but Dimitri has said repeatedly that each client
>> makes a single connection and then keeps it open until the end of
>> the test, not that it makes a single connection per SQL query.
>> Connection startup costs shouldn't be an issue.  Am I missing
>> something here?
>
> Quite aside from the overhead of spawning new processes, if you have
> more active connections than you have resources for them to go after,
> you just increase context switching and resource contention, both of
> which have some cost, without any offsetting gains.  That would tend
> to explain why performance tapers off after a certain point.  A
> connection pool which queues requests prevents this degradation.
>
> It would be interesting, with each of the CPU counts, to profile
> PostgreSQL at the peak of each curve to see where the time goes when
> it is operating with an optimal poolsize.  Tapering after that point
> is rather uninteresting, and profiles would be less useful beyond that
> point, as the noise from the context switching and resource contention
> would make it harder to spot issues that really matter..
>
> -Kevin
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

From:
Scott Carey
Date:

On 5/13/09 3:22 AM, "Dimitri" <> wrote:

> Hi Scott,
>
> On 5/12/09, Scott Carey <> wrote:
>> Although nobody wants to support it, he should try the patch that Jignesh K.
>> Shah (from Sun) proposed that makes ProcArrayLock lighter-weight.  If it
>> makes 32 cores much faster, then we have a smoking gun.
>>
>> Although everyone here is talking about this as an 'unoptimal' solution, the
>> fact is there is no evidence that a connection pooler will fix the
>> scalability from 16 > 32 cores.
>> Certainly a connection pooler will help most results, but it may not fix the
>> scalability problem.
>>
>> A question for Dimitri:
>> What is the scalability from 16 > 32 cores at the 'peak' load that occurs
>> near 2x the CPU count?  Is it also poor?  If this is also poor, IMO the
>> community here should not be complaining about this unopimal case -- a
>> connection pooler at that stage does little and prepared statements will
>> increase throughput but not likely alter scalability.
>
> I'm attaching a small graph showing a TPS level on PG 8.4 depending on
> number of cores (X-axis is a number of concurrent users, Y-axis is the
> TPS number). As you may see TPS increase is near linear while moving
> from 8 to 16 cores, while on 32cores even it's growing slightly
> differently, what is unclear is why TPS level is staying limited to
> 11.000 TPS on 32cores. And it's pure read-only workload.
>

Interesting.  What hardware is this, btw? Looks like the 32 core system
probably has 2x the CPU and a bit less interconnect efficiency versus the 16
core one (which would be typical).
Is the 16 core case the same, but with fewer cores per processor active?  Or
fewer processors total?
Understanding the scaling difference may require a better understanding of
the other differences besides core count.

>>
>> If that result scales, then the short term answer is a connection pooler.
>>
>> In the tests that Jingesh ran -- making the ProcArrayLock faster helped the
>> case where connections = 2x the CPU core count quite a bit.
>>
>> The thread about the CPU scalability is "Proposal of tunable fix for
>> scalability of 8.4", originally posted by "Jignesh K. Shah"
>> <>, March 11 2009.
>>
>> It would be very useful to see results of this benchmark with:
>> 1. A Connection Pooler
>
> will not help, as each client is *not* disconnecting/reconnecting
> during the test, as well PG is keeping well even 256 users. And TPS
> limit is reached already on 64 users, don't think pooler will help
> here.
>

Actually, it might help a little.  Postgres has a flaw that makes backends
block on a lock briefly based on the number of total backends -- active or
completely passive.  Your tool has some (very small) user-side delay and a
connection pooler would probably allow 64 of your users to efficiently 'fit'
in 48 or so connection pooler slots.

It is not about connecting and disconnecting in this case, its about
minimizing Postgres' process count.  If this does help, it would hint at
certain bottlenecks.  If it doesn't it would point elsewhere (and quiet some
critics).

However, its unrealistic for any process-per-connection system to have less
backends than about 2x the core count -- else any waiting on I/O or network
will just starve CPU.  So this would just be done for research, not a real
answer to making it scale better.

For those who say "but, what if its I/O bound!   You don't need more
backends then!":   Well you don't need more CPU either if you're I/O bound.
By definition, CPU scaling tests imply the I/O can keep up.


>> 2. Jignesh's patch
>
> I've already tested it and it did not help in my case because the real
> problem is elsewhere.. (however, I did not test it yet with my latest
> config params)
>

Great to hear that! -- That means this case is probably not ProcArrayLock.
If its Solaris, could we get:
1. What is the CPU stats when it is in the inefficient state near 64 or 128
concurrent users (vmstat, etc.   I'm interested in CPU in
user/system/idle/wait time, and context switches/sec mostly).
2.  A Dtrace probe on the postgres locks -- we might be able to identify
something here.

The results here would be useful -- if its an expected condition in the
planner or parser, it would be useful confirmation.  If its something
unexpected and easy to fix -- it might be changed relatively soon.

If its not easy to detect, it could be many other things -- but the process
above at least rules some things out and better characterizes the state.

>> 3. Prepared statements
>>
>
> yes, I'm preparing this test.
>
>> #3 is important, because prepared statements are ideal for queries that
>> perform well with low statistics_targets, and not ideal for those that
>> require high statistics targets.  Realistically, an app won't have more than
>> a couple dozen statement forms to prepare.  Setting the default statistics
>> target to 5 is just a way to make some other query perform bad.
>
> Agree, but as you may have a different statistic target *per* table it
> should not be a problem. What is sure - all time spent on parse and
> planner will be removed here, and the final time should be a pure
> execution.
>

I'm definitely interested here because although pure execution will
certainly be faster, it may not scale any better.


> Rgds,
> -Dimitri
>


From:
"Kevin Grittner"
Date:

Dimitri <> wrote:
> The idea is good, but *only* pooling will be not enough. I mean if
> all what pooler is doing is only keeping no more than N backends
> working - it'll be not enough. You never know what exactly your
> query will do - if you choose your N value to be sure to not
> overload CPU and then some of your queries start to read from disk -
> you waste your idle CPU time because it was still possible to run
> other queries requiring CPU time rather I/O, etc...

I never meant to imply that CPUs were the only resources which
mattered.  Network and disk I/O certainly come into play.  I would
think that various locks might count.  You have to benchmark your
actual workload to find the sweet spot for your load on your hardware.
 I've usually found it to be around (2 * cpu count) + (effective
spindle count), where effective spindle count id determined not only
by your RAID also your access pattern.  (If everything is fully
cached, and you have no write delays because of a BBU RAID controller
with write-back, effective spindle count is zero.)

Since the curve generally falls off more slowly past the sweet spot
than it climbs to get there, I tend to go a little above the apparent
sweet spot to protect against bad performance in a different load mix
than my tests.

> I wrote some ideas about an "ideal" solution here (just omit the
> word "mysql" - as it's a theory it's valable for any db engine):
>
http://dimitrik.free.fr/db_STRESS_MySQL_540_and_others_Apr2009.html#note_5442

I've seen similar techniques used in other databases, and I'm far from
convinced that it's ideal or optimal.

-Kevin

From:
Dimitri Fontaine
Date:

Hi,

Le 13 mai 09 à 18:42, Scott Carey a écrit :
>> will not help, as each client is *not* disconnecting/reconnecting
>> during the test, as well PG is keeping well even 256 users. And TPS
>> limit is reached already on 64 users, don't think pooler will help
>> here.
>
> Actually, it might help a little.  Postgres has a flaw that makes
> backends
> block on a lock briefly based on the number of total backends --
> active or
> completely passive.  Your tool has some (very small) user-side delay
> and a
> connection pooler would probably allow 64 of your users to
> efficiently 'fit'
> in 48 or so connection pooler slots.

It seems you have think time, and I'm only insisting on what Scott
said, but having thinktime means a connection pool can help. Pgbouncer
is a good choice because it won't even attempt to parse the queries,
and it has a flexible configuration.

>>> 3. Prepared statements
>> yes, I'm preparing this test.

It's possible to use prepared statement and benefit from pgbouncer at
the same time, but up until now it requires the application to test
whether its statements are already prepared at connection time,
because the application is not controlling when pgbouncer is reusing
an existing backend or giving it a fresh one.

As I think I need this solution too, I've coded a PG module to scratch
that itch this morning, and just published it (BSD licenced) on
pgfoundry:
   http://preprepare.projects.postgresql.org/README.html
   http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/preprepare/preprepare/

With this module and the proper pgbouncer setup (connect_query='SELECT
prepare_all();') the application has no more to special case the fresh-
backend-nothing-prepared case, it's all transparent, just replace your
SELECT query with its EXECUTE foo(x, y, z) counter part.

I've took the approach to setup the prepared statements themselves
into a table with columns name and statement, this latter one
containing the full PREPARE SQL command. There's a custom variable
preprepare.relation that has to be your table name (shema qualified).
Each statement that you then put in there will get prepared when you
SELECT prepare_all();

Hope this helps, regards,
--
dim

From:
Simon Riggs
Date:

On Tue, 2009-05-12 at 14:28 +0200, Dimitri wrote:

> As problem I'm considering a scalability issue on Read-Only workload -
> only selects, no disk access, and if on move from 8 to 16 cores we
> gain near 100%, on move from 16 to 32 cores it's only 10%...

Dimitri,

Will you be re-running the Read-Only tests?

Can you run the Dtrace script to assess LWlock contention during the
run?

Would you re-run the tests with a patch?

Thanks,

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Dimitri
Date:

Folks, sorry, I'm outpassed little bit by the events :-))

I've finished tests with PREPARE/EXECUTE - it's much faster of course,
and the max TSP is 15.000 now on 24 cores!  - I've done various tests
to see where is the limit bottleneck may be present - it's more likely
something timer or interrupt based, etc. Nothing special via DTrace,
or probably it'll say you more things then me, but for a 10sec period
it's quite small wait time:

# lwlock_wait_8.4.d `pgrep -n postgres`

               Lock Id            Mode           Count
 FirstBufMappingLock       Exclusive               1
      FirstLockMgrLock       Exclusive               1
       BufFreelistLock       Exclusive               3
 FirstBufMappingLock          Shared               4
      FirstLockMgrLock          Shared               4

               Lock Id            Mode   Combined Time (ns)
      FirstLockMgrLock       Exclusive                 803700
       BufFreelistLock       Exclusive                 3001600
      FirstLockMgrLock          Shared               4586600
 FirstBufMappingLock       Exclusive              6283900
 FirstBufMappingLock          Shared             21792900

On the same time those lock waits are appearing only on 24 or 32 cores.
I'll plan to replay this case on the bigger server (64 cores or more)
- it'll be much more evident if the problem is in locks.

Currently I'm finishing my report with all data all of you asked
(system graphs, pgsql, and other). I'll publish it on my web site and
send you a link.

Rgds,
-Dimitri

On 5/14/09, Simon Riggs <> wrote:
>
> On Tue, 2009-05-12 at 14:28 +0200, Dimitri wrote:
>
>> As problem I'm considering a scalability issue on Read-Only workload -
>> only selects, no disk access, and if on move from 8 to 16 cores we
>> gain near 100%, on move from 16 to 32 cores it's only 10%...
>
> Dimitri,
>
> Will you be re-running the Read-Only tests?
>
> Can you run the Dtrace script to assess LWlock contention during the
> run?
>
> Would you re-run the tests with a patch?
>
> Thanks,
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

From:
Dimitri
Date:

It's absolutely great!
it'll not help here because a think time is 0.
but for any kind of solution with a spooler it's a must to try!

Rgds,
-Dimitri

On 5/13/09, Dimitri Fontaine <> wrote:
> Hi,
>
> Le 13 mai 09 à 18:42, Scott Carey a écrit :
>>> will not help, as each client is *not* disconnecting/reconnecting
>>> during the test, as well PG is keeping well even 256 users. And TPS
>>> limit is reached already on 64 users, don't think pooler will help
>>> here.
>>
>> Actually, it might help a little.  Postgres has a flaw that makes
>> backends
>> block on a lock briefly based on the number of total backends --
>> active or
>> completely passive.  Your tool has some (very small) user-side delay
>> and a
>> connection pooler would probably allow 64 of your users to
>> efficiently 'fit'
>> in 48 or so connection pooler slots.
>
> It seems you have think time, and I'm only insisting on what Scott
> said, but having thinktime means a connection pool can help. Pgbouncer
> is a good choice because it won't even attempt to parse the queries,
> and it has a flexible configuration.
>
>>>> 3. Prepared statements
>>> yes, I'm preparing this test.
>
> It's possible to use prepared statement and benefit from pgbouncer at
> the same time, but up until now it requires the application to test
> whether its statements are already prepared at connection time,
> because the application is not controlling when pgbouncer is reusing
> an existing backend or giving it a fresh one.
>
> As I think I need this solution too, I've coded a PG module to scratch
> that itch this morning, and just published it (BSD licenced) on
> pgfoundry:
>    http://preprepare.projects.postgresql.org/README.html
>    http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/preprepare/preprepare/
>
> With this module and the proper pgbouncer setup (connect_query='SELECT
> prepare_all();') the application has no more to special case the fresh-
> backend-nothing-prepared case, it's all transparent, just replace your
> SELECT query with its EXECUTE foo(x, y, z) counter part.
>
> I've took the approach to setup the prepared statements themselves
> into a table with columns name and statement, this latter one
> containing the full PREPARE SQL command. There's a custom variable
> preprepare.relation that has to be your table name (shema qualified).
> Each statement that you then put in there will get prepared when you
> SELECT prepare_all();
>
> Hope this helps, regards,
> --
> dim

From:
Dimitri
Date:

Hi Scott,

let me now finish my report and regroup all data together, and then
we'll continue discussion as it'll come more in debug/profile phase..
- I'll be not polite from my part to send some tons of attachments to
the mail list :-)

Rgds,
-Dimitri

On 5/13/09, Scott Carey <> wrote:
>
> On 5/13/09 3:22 AM, "Dimitri" <> wrote:
>
>> Hi Scott,
>>
>> On 5/12/09, Scott Carey <> wrote:
>>> Although nobody wants to support it, he should try the patch that Jignesh
>>> K.
>>> Shah (from Sun) proposed that makes ProcArrayLock lighter-weight.  If it
>>> makes 32 cores much faster, then we have a smoking gun.
>>>
>>> Although everyone here is talking about this as an 'unoptimal' solution,
>>> the
>>> fact is there is no evidence that a connection pooler will fix the
>>> scalability from 16 > 32 cores.
>>> Certainly a connection pooler will help most results, but it may not fix
>>> the
>>> scalability problem.
>>>
>>> A question for Dimitri:
>>> What is the scalability from 16 > 32 cores at the 'peak' load that occurs
>>> near 2x the CPU count?  Is it also poor?  If this is also poor, IMO the
>>> community here should not be complaining about this unopimal case -- a
>>> connection pooler at that stage does little and prepared statements will
>>> increase throughput but not likely alter scalability.
>>
>> I'm attaching a small graph showing a TPS level on PG 8.4 depending on
>> number of cores (X-axis is a number of concurrent users, Y-axis is the
>> TPS number). As you may see TPS increase is near linear while moving
>> from 8 to 16 cores, while on 32cores even it's growing slightly
>> differently, what is unclear is why TPS level is staying limited to
>> 11.000 TPS on 32cores. And it's pure read-only workload.
>>
>
> Interesting.  What hardware is this, btw? Looks like the 32 core system
> probably has 2x the CPU and a bit less interconnect efficiency versus the 16
> core one (which would be typical).
> Is the 16 core case the same, but with fewer cores per processor active?  Or
> fewer processors total?
> Understanding the scaling difference may require a better understanding of
> the other differences besides core count.
>
>>>
>>> If that result scales, then the short term answer is a connection pooler.
>>>
>>> In the tests that Jingesh ran -- making the ProcArrayLock faster helped
>>> the
>>> case where connections = 2x the CPU core count quite a bit.
>>>
>>> The thread about the CPU scalability is "Proposal of tunable fix for
>>> scalability of 8.4", originally posted by "Jignesh K. Shah"
>>> <>, March 11 2009.
>>>
>>> It would be very useful to see results of this benchmark with:
>>> 1. A Connection Pooler
>>
>> will not help, as each client is *not* disconnecting/reconnecting
>> during the test, as well PG is keeping well even 256 users. And TPS
>> limit is reached already on 64 users, don't think pooler will help
>> here.
>>
>
> Actually, it might help a little.  Postgres has a flaw that makes backends
> block on a lock briefly based on the number of total backends -- active or
> completely passive.  Your tool has some (very small) user-side delay and a
> connection pooler would probably allow 64 of your users to efficiently 'fit'
> in 48 or so connection pooler slots.
>
> It is not about connecting and disconnecting in this case, its about
> minimizing Postgres' process count.  If this does help, it would hint at
> certain bottlenecks.  If it doesn't it would point elsewhere (and quiet some
> critics).
>
> However, its unrealistic for any process-per-connection system to have less
> backends than about 2x the core count -- else any waiting on I/O or network
> will just starve CPU.  So this would just be done for research, not a real
> answer to making it scale better.
>
> For those who say "but, what if its I/O bound!   You don't need more
> backends then!":   Well you don't need more CPU either if you're I/O bound.
> By definition, CPU scaling tests imply the I/O can keep up.
>
>
>>> 2. Jignesh's patch
>>
>> I've already tested it and it did not help in my case because the real
>> problem is elsewhere.. (however, I did not test it yet with my latest
>> config params)
>>
>
> Great to hear that! -- That means this case is probably not ProcArrayLock.
> If its Solaris, could we get:
> 1. What is the CPU stats when it is in the inefficient state near 64 or 128
> concurrent users (vmstat, etc.   I'm interested in CPU in
> user/system/idle/wait time, and context switches/sec mostly).
> 2.  A Dtrace probe on the postgres locks -- we might be able to identify
> something here.
>
> The results here would be useful -- if its an expected condition in the
> planner or parser, it would be useful confirmation.  If its something
> unexpected and easy to fix -- it might be changed relatively soon.
>
> If its not easy to detect, it could be many other things -- but the process
> above at least rules some things out and better characterizes the state.
>
>>> 3. Prepared statements
>>>
>>
>> yes, I'm preparing this test.
>>
>>> #3 is important, because prepared statements are ideal for queries that
>>> perform well with low statistics_targets, and not ideal for those that
>>> require high statistics targets.  Realistically, an app won't have more
>>> than
>>> a couple dozen statement forms to prepare.  Setting the default
>>> statistics
>>> target to 5 is just a way to make some other query perform bad.
>>
>> Agree, but as you may have a different statistic target *per* table it
>> should not be a problem. What is sure - all time spent on parse and
>> planner will be removed here, and the final time should be a pure
>> execution.
>>
>
> I'm definitely interested here because although pure execution will
> certainly be faster, it may not scale any better.
>
>
>> Rgds,
>> -Dimitri
>>
>
>

From:
Simon Riggs
Date:

On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote:

> # lwlock_wait_8.4.d `pgrep -n postgres`

>                Lock Id            Mode   Combined Time (ns)
>       FirstLockMgrLock       Exclusive                 803700
>        BufFreelistLock       Exclusive                 3001600
>       FirstLockMgrLock          Shared               4586600
>  FirstBufMappingLock       Exclusive              6283900
>  FirstBufMappingLock          Shared             21792900

I've published two patches to -Hackers to see if we can improve the read
only numbers on 32+ cores.

Try shared_buffer_partitions = 256

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Simon Riggs
Date:

On Wed, 2009-05-13 at 23:23 +0200, Dimitri Fontaine wrote:

> As I think I need this solution too, I've coded a PG module to
> scratch
> that itch this morning, and just published it (BSD licenced) on
> pgfoundry:
>    http://preprepare.projects.postgresql.org/README.html
>    http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/preprepare/preprepare/

Looks very cool Dimitri

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Dimitri
Date:

Folks, I've just published a full report including all results here:
http://dimitrik.free.fr/db_STRESS_PostgreSQL_837_and_84_May2009.html

From my point of view it needs first to understand where the time is
wasted on a single query (even when the statement is prepared it runs
still slower comparing to MySQL).

Then to investigate on scalability issue I think a bigger server will
be needed here (I'm looking for 64cores at least :-))

If  you have some other ideas or patches (like Simon) - don't hesitate
to send them - once I'll get an access to the server again the
available test time will be very limited..

Best regards!
-Dimitri


On 5/18/09, Simon Riggs <> wrote:
>
> On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote:
>
>> # lwlock_wait_8.4.d `pgrep -n postgres`
>
>>                Lock Id            Mode   Combined Time (ns)
>>       FirstLockMgrLock       Exclusive                 803700
>>        BufFreelistLock       Exclusive                 3001600
>>       FirstLockMgrLock          Shared               4586600
>>  FirstBufMappingLock       Exclusive              6283900
>>  FirstBufMappingLock          Shared             21792900
>
> I've published two patches to -Hackers to see if we can improve the read
> only numbers on 32+ cores.
>
> Try shared_buffer_partitions = 256
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

From:
Simon Riggs
Date:

On Mon, 2009-05-18 at 20:00 +0200, Dimitri wrote:

> >From my point of view it needs first to understand where the time is
> wasted on a single query (even when the statement is prepared it runs
> still slower comparing to MySQL).

There is still a significant number of things to say about these numbers
and much tuning still to do, so I'm still confident of improving those
numbers if we needed to.

In particular, running the tests repeatedly using
    H.REF_OBJECT = '0000000001'
rather than varying the value seems likely to benefit MySQL. The
distribution of values is clearly non-linear; while Postgres picks a
strange plan for that particular value, I would guess there are also
values for which the MySQL plan is sub-optimal. Depending upon the
distribution of selected data we might see the results go either way.

What I find worrying is your result of a scalability wall for hash
joins. Is that a repeatable issue?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Scott Carey
Date:

Great data Dimitri!'

I see a few key trends in the poor scalability:

The throughput scales roughly with %CPU fairly well.  But CPU used doesn't
go past ~50% on the 32 core tests.  This indicates lock contention.

Other proof of lock contention are the mutex locks / sec graph which climbs
rapidly as the system gets more inefficient (along with context switches).

Another trend is the system calls/sec which caps out with the test, at about
400,000 per sec on the peak (non-prepared statement) result.  Note that when
the buffer size is 256MB, the performance scales much worse and is slower.
And correlated with this the system calls/sec per transaction is more than
double, at slower throughput.

Using the OS to cache pages is not as fast as pages in shared_buffers, by a
more significant amount with many cores and higher concurrency than in the
low concurrency case.

The system is largely lock limited in the poor scaling results.  This holds
true with or without the use of prepared statements -- which help a some,
but not a lot and don't affect the scalability.


4096MB shared buffers, 32 cores, 8.4, read only:
http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html

256MB cache, 32 cores, 8.4, read-only:
http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_52.html

4096MB shared buffs, 32 cores, 8.4, read only, prepared statements
http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html

On 5/18/09 11:00 AM, "Dimitri" <> wrote:

> Folks, I've just published a full report including all results here:
> http://dimitrik.free.fr/db_STRESS_PostgreSQL_837_and_84_May2009.html
>
> From my point of view it needs first to understand where the time is
> wasted on a single query (even when the statement is prepared it runs
> still slower comparing to MySQL).
>
> Then to investigate on scalability issue I think a bigger server will
> be needed here (I'm looking for 64cores at least :-))
>
> If  you have some other ideas or patches (like Simon) - don't hesitate
> to send them - once I'll get an access to the server again the
> available test time will be very limited..
>
> Best regards!
> -Dimitri
>
>
> On 5/18/09, Simon Riggs <> wrote:
>>
>> On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote:
>>
>>> # lwlock_wait_8.4.d `pgrep -n postgres`
>>
>>>                Lock Id            Mode   Combined Time (ns)
>>>       FirstLockMgrLock       Exclusive                 803700
>>>        BufFreelistLock       Exclusive                 3001600
>>>       FirstLockMgrLock          Shared               4586600
>>>  FirstBufMappingLock       Exclusive              6283900
>>>  FirstBufMappingLock          Shared             21792900
>>
>> I've published two patches to -Hackers to see if we can improve the read
>> only numbers on 32+ cores.
>>
>> Try shared_buffer_partitions = 256
>>
>> --
>>  Simon Riggs           www.2ndQuadrant.com
>>  PostgreSQL Training, Services and Support
>>
>>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


From:
"Dave Dutcher"
Date:

What I don't understand is the part where you talking about disabling hash
joins:

>    * result: planner replaced hash join is replaced by merge join
>    * execution time: 0.84ms !
>    * NOTE: curiously planner is expecting to execute this query in 0.29ms
- so it's supposed from its logic to be faster, so why this plan is not used
from the beginning???...
>
>     Sort  (cost=4562.83..4568.66 rows=2329 width=176) (actual
time=0.237..0.237 rows=20 loops=1)
>         Sort Key: h.horder
>         Sort Method:  quicksort  Memory: 30kB
>         ->  Merge Join  (cost=4345.89..4432.58 rows=2329 width=176)
(actual time=0.065..0.216 rows=20 loops=1)
>               Merge Cond: (s.ref = h.ref_stat)
>               ->  Index Scan using stat_ref_idx on stat s
(cost=0.00..49.25 rows=1000 width=45) (actual time=0.018..0.089 rows=193
loops=1)
>               ->  Sort  (cost=4345.89..4351.72 rows=2329 width=135)
(actual time=0.042..0.043 rows=20 loops=1)
>                     Sort Key: h.ref_stat
>                     Sort Method:  quicksort  Memory: 30kB
>                     ->  Index Scan using history_ref_idx on history h
(cost=0.00..4215.64 rows=2329 width=135) (actual time=0.012..0.025 rows=20
loops=1)
>                           Index Cond: (ref_object = '0000000001'::bpchar)
>     Total runtime: 0.288 ms
>    (12 rows)

The explain analyze ran the query in 0.288 ms.  That is the actual time it
took to run the query on the server.  It is not an estimate of the time.
You measured 0.84 ms to run the query, which seems to imply either a problem
in one of the timing methods or that 66% of your query execution time is
sending the results to the client.  I'm curious how you did you execution
time measurements.

Dave


From:
Dimitri
Date:

Thanks Dave for correction, but I'm also curious where the time is
wasted in this case?..

0.84ms is displayed by "psql" once the result output is printed, and I
got similar time within my client (using libpq) which is not printing
any output..

Rgds,
-Dimitri

On 5/18/09, Dave Dutcher <> wrote:
>
> What I don't understand is the part where you talking about disabling hash
> joins:
>
>>    * result: planner replaced hash join is replaced by merge join
>>    * execution time: 0.84ms !
>>    * NOTE: curiously planner is expecting to execute this query in 0.29ms
> - so it's supposed from its logic to be faster, so why this plan is not used
> from the beginning???...
>>
>>     Sort  (cost=4562.83..4568.66 rows=2329 width=176) (actual
> time=0.237..0.237 rows=20 loops=1)
>>         Sort Key: h.horder
>>         Sort Method:  quicksort  Memory: 30kB
>>         ->  Merge Join  (cost=4345.89..4432.58 rows=2329 width=176)
> (actual time=0.065..0.216 rows=20 loops=1)
>>               Merge Cond: (s.ref = h.ref_stat)
>>               ->  Index Scan using stat_ref_idx on stat s
> (cost=0.00..49.25 rows=1000 width=45) (actual time=0.018..0.089 rows=193
> loops=1)
>>               ->  Sort  (cost=4345.89..4351.72 rows=2329 width=135)
> (actual time=0.042..0.043 rows=20 loops=1)
>>                     Sort Key: h.ref_stat
>>                     Sort Method:  quicksort  Memory: 30kB
>>                     ->  Index Scan using history_ref_idx on history h
> (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.012..0.025 rows=20
> loops=1)
>>                           Index Cond: (ref_object = '0000000001'::bpchar)
>>     Total runtime: 0.288 ms
>>    (12 rows)
>
> The explain analyze ran the query in 0.288 ms.  That is the actual time it
> took to run the query on the server.  It is not an estimate of the time.
> You measured 0.84 ms to run the query, which seems to imply either a problem
> in one of the timing methods or that 66% of your query execution time is
> sending the results to the client.  I'm curious how you did you execution
> time measurements.
>
> Dave
>
>

From:
Dimitri
Date:

On 5/18/09, Scott Carey <> wrote:
> Great data Dimitri!'

Thank you! :-)

>
> I see a few key trends in the poor scalability:
>
> The throughput scales roughly with %CPU fairly well.  But CPU used doesn't
> go past ~50% on the 32 core tests.  This indicates lock contention.
>

You should not look on #1 STATs, but on #2 - they are all with the
latest "fixes"  - on all of them CPU is used well (90% in pic on
32cores).
Also, keep in mind these cores are having 2 threads, and from Solaris
point of view they are seen as CPU (so 64 CPU) and %busy is accounted
as for 64 CPU

> Other proof of lock contention are the mutex locks / sec graph which climbs

exactly, except no locking was seen on processes while I tried to
trace them.. What I think will be needed here is a global and
corelated tracing of all PG processes - I did not expect to do it now,
but next time

> rapidly as the system gets more inefficient (along with context switches).
>
> Another trend is the system calls/sec which caps out with the test, at about
> 400,000 per sec on the peak (non-prepared statement) result.  Note that when
> the buffer size is 256MB, the performance scales much worse and is slower.
> And correlated with this the system calls/sec per transaction is more than
> double, at slower throughput.

of course, because even the data were cached by filesystem to get them
you still need to call a read() system call..

>
> Using the OS to cache pages is not as fast as pages in shared_buffers, by a
> more significant amount with many cores and higher concurrency than in the
> low concurrency case.


exactly, it's what I also wanted to demonstrate because I often hear
"PG is delegating caching to the filesystem" - and I don't think it's
optimal :-)

>
> The system is largely lock limited in the poor scaling results.  This holds
> true with or without the use of prepared statements -- which help a some,
> but not a lot and don't affect the scalability.

we are agree here, but again - 20K mutex spins/sec is a quite low
value, that's why I hope on the bigger server it'll be more clear
where is a bottleneck :-)

Rgds,
-Dimitri


>
>
> 4096MB shared buffers, 32 cores, 8.4, read only:
> http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html
>
> 256MB cache, 32 cores, 8.4, read-only:
> http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_52.html
>
> 4096MB shared buffs, 32 cores, 8.4, read only, prepared statements
> http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html
>
> On 5/18/09 11:00 AM, "Dimitri" <> wrote:
>
>> Folks, I've just published a full report including all results here:
>> http://dimitrik.free.fr/db_STRESS_PostgreSQL_837_and_84_May2009.html
>>
>> From my point of view it needs first to understand where the time is
>> wasted on a single query (even when the statement is prepared it runs
>> still slower comparing to MySQL).
>>
>> Then to investigate on scalability issue I think a bigger server will
>> be needed here (I'm looking for 64cores at least :-))
>>
>> If  you have some other ideas or patches (like Simon) - don't hesitate
>> to send them - once I'll get an access to the server again the
>> available test time will be very limited..
>>
>> Best regards!
>> -Dimitri
>>
>>
>> On 5/18/09, Simon Riggs <> wrote:
>>>
>>> On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote:
>>>
>>>> # lwlock_wait_8.4.d `pgrep -n postgres`
>>>
>>>>                Lock Id            Mode   Combined Time (ns)
>>>>       FirstLockMgrLock       Exclusive                 803700
>>>>        BufFreelistLock       Exclusive                 3001600
>>>>       FirstLockMgrLock          Shared               4586600
>>>>  FirstBufMappingLock       Exclusive              6283900
>>>>  FirstBufMappingLock          Shared             21792900
>>>
>>> I've published two patches to -Hackers to see if we can improve the read
>>> only numbers on 32+ cores.
>>>
>>> Try shared_buffer_partitions = 256
>>>
>>> --
>>>  Simon Riggs           www.2ndQuadrant.com
>>>  PostgreSQL Training, Services and Support
>>>
>>>
>>
>> --
>> Sent via pgsql-performance mailing list ()
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
>

From:
Dimitri
Date:

On 5/18/09, Simon Riggs <> wrote:
>
> On Mon, 2009-05-18 at 20:00 +0200, Dimitri wrote:
>
>> >From my point of view it needs first to understand where the time is
>> wasted on a single query (even when the statement is prepared it runs
>> still slower comparing to MySQL).
>
> There is still a significant number of things to say about these numbers
> and much tuning still to do, so I'm still confident of improving those
> numbers if we needed to.
>
> In particular, running the tests repeatedly using
>     H.REF_OBJECT = '0000000001'
> rather than varying the value seems likely to benefit MySQL. The

let me repeat again - the reference is *random*,
the '0000000001' value I've used just to show a query execution
plan.

also, what is important - the random ID is chosen in way that no one
user use the same to avoid deadlocks previously seen with PostgreSQL
(see the "Deadlock mystery" note 2 years ago
http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 )

> distribution of values is clearly non-linear; while Postgres picks a
> strange plan for that particular value, I would guess there are also
> values for which the MySQL plan is sub-optimal. Depending upon the
> distribution of selected data we might see the results go either way.
>
> What I find worrying is your result of a scalability wall for hash
> joins. Is that a repeatable issue?

I think yes (but of course I did not try to replay it several times)

Rgds,
-Dimitri


>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

From:
Tom Lane
Date:

Simon Riggs <> writes:
> In particular, running the tests repeatedly using
>     H.REF_OBJECT = '0000000001'
> rather than varying the value seems likely to benefit MySQL.

... mumble ... query cache?

            regards, tom lane

From:
Scott Carey
Date:

On 5/18/09 3:32 PM, "Dimitri" <> wrote:

> On 5/18/09, Scott Carey <> wrote:
>> Great data Dimitri!'
>
> Thank you! :-)
>
>>
>> I see a few key trends in the poor scalability:
>>
>> The throughput scales roughly with %CPU fairly well.  But CPU used doesn't
>> go past ~50% on the 32 core tests.  This indicates lock contention.
>>
>
> You should not look on #1 STATs, but on #2 - they are all with the
> latest "fixes"  - on all of them CPU is used well (90% in pic on
> 32cores).
> Also, keep in mind these cores are having 2 threads, and from Solaris
> point of view they are seen as CPU (so 64 CPU) and %busy is accounted
> as for 64 CPU
>

Well, if the CPU usage is actually higher, then it might not be lock waiting
-- it could be spin locks or context switches or cache coherency overhead.
Postgres may also not be very SMT friendly, at least on the hardware tested
here.

(what was the context switch rate?  I didn't see that in the data, just
mutex spins).

The scalability curve is definitely showing something.  Prepared statements
were tried, as were most of the other suggestions other than one:

What happens if the queries are more complicated (say, they take 15ms server
side with a more complicated plan required)?  That is a harder question to
answer


From:
Simon Riggs
Date:

On Tue, 2009-05-19 at 00:33 +0200, Dimitri wrote:
> >
> > In particular, running the tests repeatedly using
> >     H.REF_OBJECT = '0000000001'
> > rather than varying the value seems likely to benefit MySQL. The
>
> let me repeat again - the reference is *random*,
> the '0000000001' value I've used just to show a query execution
> plan.
>
> also, what is important - the random ID is chosen in way that no one
> user use the same to avoid deadlocks previously seen with PostgreSQL
> (see the "Deadlock mystery" note 2 years ago
> http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 )

OK, didn't pick up on that.

(Like Tom, I was thinking query cache)

Can you comment on the distribution of values for that column? If you
are picking randomly, this implies distribution is uniform and so I am
surprised we are mis-estimating the selectivity.

> I think yes (but of course I did not try to replay it several times)

If you could that would be appreciated. We don't want to go chasing
after something that is not repeatable.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Dimitri
Date:

No, Tom,  the query cache was off.
I put it always explicitly off on MySQL as it has scalability issues.

Rgds,
-Dimitri

On 5/19/09, Tom Lane <> wrote:
> Simon Riggs <> writes:
>> In particular, running the tests repeatedly using
>>     H.REF_OBJECT = '0000000001'
>> rather than varying the value seems likely to benefit MySQL.
>
> ... mumble ... query cache?
>
>             regards, tom lane
>

From:
Dimitri
Date:

On 5/19/09, Scott Carey <> wrote:
>
> On 5/18/09 3:32 PM, "Dimitri" <> wrote:
>
>> On 5/18/09, Scott Carey <> wrote:
>>> Great data Dimitri!'
>>
>> Thank you! :-)
>>
>>>
>>> I see a few key trends in the poor scalability:
>>>
>>> The throughput scales roughly with %CPU fairly well.  But CPU used
>>> doesn't
>>> go past ~50% on the 32 core tests.  This indicates lock contention.
>>>
>>
>> You should not look on #1 STATs, but on #2 - they are all with the
>> latest "fixes"  - on all of them CPU is used well (90% in pic on
>> 32cores).
>> Also, keep in mind these cores are having 2 threads, and from Solaris
>> point of view they are seen as CPU (so 64 CPU) and %busy is accounted
>> as for 64 CPU
>>
>
> Well, if the CPU usage is actually higher, then it might not be lock waiting
> -- it could be spin locks or context switches or cache coherency overhead.
> Postgres may also not be very SMT friendly, at least on the hardware tested
> here.

do you mean SMP or CMT? ;-)
however both should work well with PostgreSQL. I also think about CPU
affinity - probably it may help to avoid CPU cache misses - but makes
sense mostly if pooler will be added as a part of PG.

>
> (what was the context switch rate?  I didn't see that in the data, just
> mutex spins).

increasing with a load, as this ex.:
http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_100.html#bmk_CPU_CtxSwitch_100


>
> The scalability curve is definitely showing something.  Prepared statements
> were tried, as were most of the other suggestions other than one:
>
> What happens if the queries are more complicated (say, they take 15ms server
> side with a more complicated plan required)?  That is a harder question to
> answer

What I observed is: if planner takes more long time (like initially
with 8.3.7 and analyze target 1000) the scalability problem is
appearing more strange -
http://dimitrik.free.fr/Report_20090505/5521_dim_STAT_18.html - as you
see CPU even not used more than 60% , and as you may see spin locks
are lowering - CPUs are not spinning for locks, there is something
else..
I'm supposing a problem of some kind of synchronization - background
processes are not waking up on time or something like this...
Then, if more time spent on the query execution itself and not planner:
 - if it'll be I/O time - I/O will hide everything else until you
increase a storage performance and/or add more RAM, but then you come
back to the initial issue :-)
 - if it'll be a CPU time it may be interesting! :-)

Rgds,
-Dimitri

From:
Dimitri
Date:

On 5/19/09, Simon Riggs <> wrote:
>
> On Tue, 2009-05-19 at 00:33 +0200, Dimitri wrote:
>> >
>> > In particular, running the tests repeatedly using
>> >     H.REF_OBJECT = '0000000001'
>> > rather than varying the value seems likely to benefit MySQL. The
>>
>> let me repeat again - the reference is *random*,
>> the '0000000001' value I've used just to show a query execution
>> plan.
>>
>> also, what is important - the random ID is chosen in way that no one
>> user use the same to avoid deadlocks previously seen with PostgreSQL
>> (see the "Deadlock mystery" note 2 years ago
>> http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 )
>
> OK, didn't pick up on that.
>
> (Like Tom, I was thinking query cache)
>
> Can you comment on the distribution of values for that column? If you
> are picking randomly, this implies distribution is uniform and so I am
> surprised we are mis-estimating the selectivity.

yes, the distribution of reference values is uniform between
'0000000001' to '0010000000' (10M), only one OBJECT row by one
reference, and only 20 rows with the same reference in HISTORY table.

>
>> I think yes (but of course I did not try to replay it several times)
>
> If you could that would be appreciated. We don't want to go chasing
> after something that is not repeatable.

I'll retry and let you know.

Rgds,
-Dimitri

From:
Simon Riggs
Date:

On Mon, 2009-05-18 at 19:00 -0400, Tom Lane wrote:
> Simon Riggs <> writes:
> > In particular, running the tests repeatedly using
> >     H.REF_OBJECT = '0000000001'
> > rather than varying the value seems likely to benefit MySQL.

One thing to note in terms of optimisation of this query is that we
perform a top-level sort at the end of the query.

Both plans for this query show an IndexScan on a two column-index, with
an Index Condition of equality on the leading column. The ORDER BY
specifies a sort by the second index column, so the top-level Sort is
superfluous in this case.

My understanding is that we don't currently eliminate superfluous
additional sorts of this kind. Now I know that is a hard subject, but it
seems straightforward to consider interesting sort order equivalence
when we have constant equality constraints.

My guess would be that MySQL does do the sort removal, in latest
version.

Dimitri's EXPLAIN ANALYZEs show differing costs for that additional
step, but the around 10% of query time looks shaveable.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Matthew Wakeling
Date:

On Tue, 19 May 2009, Simon Riggs wrote:
> Both plans for this query show an IndexScan on a two column-index, with
> an Index Condition of equality on the leading column. The ORDER BY
> specifies a sort by the second index column, so the top-level Sort is
> superfluous in this case.
>
> My understanding is that we don't currently eliminate superfluous
> additional sorts of this kind. Now I know that is a hard subject, but it
> seems straightforward to consider interesting sort order equivalence
> when we have constant equality constraints.

Yes, Postgres has been missing the boat on this one for a while. +1 on
requesting this feature.

Speaking of avoiding large sorts, I'd like to push again for partial
sorts. This is the situation where an index provides data sorted by column
"a", and the query requests data sorted by "a, b". Currently, Postgres
sorts the entire data set, whereas it need only group each set of
identical "a" and sort each by "b".

Matthew

--
 Riker: Our memory pathways have become accustomed to your sensory input.
 Data:  I understand - I'm fond of you too, Commander. And you too Counsellor

From:
Simon Riggs
Date:

On Tue, 2009-05-19 at 12:17 +0100, Matthew Wakeling wrote:
> Yes, Postgres has been missing the boat on this one for a while. +1 on
> requesting this feature.

That's an optimizer feature.

> Speaking of avoiding large sorts, I'd like to push again for partial
> sorts. This is the situation where an index provides data sorted by
> column "a", and the query requests data sorted by "a, b". Currently,
> Postgres sorts the entire data set, whereas it need only group each
> set of identical "a" and sort each by "b".

This is an executor feature.

Partially sorted data takes much less effort to sort (OK, not zero, I
grant) so this seems like a high complexity, lower value feature. I
agree it should be on the TODO, just IMHO at a lower priority than some
other features.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Simon Riggs
Date:

On Tue, 2009-05-19 at 12:36 +0100, Simon Riggs wrote:

> Partially sorted data takes much less effort to sort (OK, not zero, I
> grant) so this seems like a high complexity, lower value feature. I
> agree it should be on the TODO, just IMHO at a lower priority than some
> other features.

Perhaps its worth looking at a hybrid merge-join/hash-join that can cope
with data only mostly-sorted rather than fully sorted. That way we can
probably skip the partial sort altogether.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Dimitri
Date:

I may confirm the issue with hash join - it's repeating both with
prepared and not prepared statements - it's curious because initially
the response time is lowering near ~1ms (the lowest seen until now)
and then once workload growing to 16 sessions it's jumping to 2.5ms,
then with 32 sessions it's 18ms, etc..

I've retested on 24 isolated cores, so any external secondary effects
are avoided.

Rgds,
-Dimitri

On 5/19/09, Dimitri <> wrote:
> On 5/19/09, Simon Riggs <> wrote:
>>
>> On Tue, 2009-05-19 at 00:33 +0200, Dimitri wrote:
>>> >
>>> > In particular, running the tests repeatedly using
>>> >     H.REF_OBJECT = '0000000001'
>>> > rather than varying the value seems likely to benefit MySQL. The
>>>
>>> let me repeat again - the reference is *random*,
>>> the '0000000001' value I've used just to show a query execution
>>> plan.
>>>
>>> also, what is important - the random ID is chosen in way that no one
>>> user use the same to avoid deadlocks previously seen with PostgreSQL
>>> (see the "Deadlock mystery" note 2 years ago
>>> http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 )
>>
>> OK, didn't pick up on that.
>>
>> (Like Tom, I was thinking query cache)
>>
>> Can you comment on the distribution of values for that column? If you
>> are picking randomly, this implies distribution is uniform and so I am
>> surprised we are mis-estimating the selectivity.
>
> yes, the distribution of reference values is uniform between
> '0000000001' to '0010000000' (10M), only one OBJECT row by one
> reference, and only 20 rows with the same reference in HISTORY table.
>
>>
>>> I think yes (but of course I did not try to replay it several times)
>>
>> If you could that would be appreciated. We don't want to go chasing
>> after something that is not repeatable.
>
> I'll retry and let you know.
>
> Rgds,
> -Dimitri
>

From:
Matthew Wakeling
Date:

On Tue, 19 May 2009, Simon Riggs wrote:
>> Speaking of avoiding large sorts, I'd like to push again for partial
>> sorts. This is the situation where an index provides data sorted by
>> column "a", and the query requests data sorted by "a, b". Currently,
>> Postgres sorts the entire data set, whereas it need only group each
>> set of identical "a" and sort each by "b".
>
> Partially sorted data takes much less effort to sort (OK, not zero, I
> grant) so this seems like a high complexity, lower value feature. I
> agree it should be on the TODO, just IMHO at a lower priority than some
> other features.

Not arguing with you, however I'd like to point out that partial sorting
allows the results to be streamed, which would lower the cost to produce
the first row of results significantly, and reduce the amount of RAM used
by the query, and prevent temporary tables from being used. That has to be
a fairly major win. Queries with a LIMIT would see the most benefit.

That leads me on to another topic. Consider the query:

SELECT * FROM table ORDER BY a, b

where the column "a" is declared UNIQUE and has an index. Does Postgres
eliminate "b" from the ORDER BY, and therefore allow fetching without
sorting from the index?

Or how about this query:

SELECT * FROM table1, table2 WHERE table1.fk = table2.id ORDER BY
     table1.id, table2.id

where both "id" columns are UNIQUE with an index. Do we eliminate
"table2.id" from the ORDER BY in this case?

Matthew

--
"Programming today is a race between software engineers striving to build
 bigger and better idiot-proof programs, and the Universe trying to produce
 bigger and better idiots. So far, the Universe is winning."  -- Rich Cook

From:
Tom Lane
Date:

Simon Riggs <> writes:
> Both plans for this query show an IndexScan on a two column-index, with
> an Index Condition of equality on the leading column. The ORDER BY
> specifies a sort by the second index column, so the top-level Sort is
> superfluous in this case.

> My understanding is that we don't currently eliminate superfluous
> additional sorts of this kind.

Nonsense.  The planner might think some other plan is cheaper, but
it definitely knows how to do this, and has since at least 8.1.

            regards, tom lane

From:
Simon Riggs
Date:

On Tue, 2009-05-19 at 14:00 +0200, Dimitri wrote:

> I may confirm the issue with hash join - it's repeating both with
> prepared and not prepared statements - it's curious because initially
> the response time is lowering near ~1ms (the lowest seen until now)
> and then once workload growing to 16 sessions it's jumping to 2.5ms,
> then with 32 sessions it's 18ms, etc..

Is it just bad all the time, or does it get worse over time?

Do you get the same behaviour as 32 sessions if you run 16 sessions for
twice as long?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Merlin Moncure
Date:

On Mon, May 18, 2009 at 6:32 PM, Dimitri <> wrote:
> Thanks Dave for correction, but I'm also curious where the time is
> wasted in this case?..
>
> 0.84ms is displayed by "psql" once the result output is printed, and I
> got similar time within my client (using libpq) which is not printing
> any output..

Using libpq?  What is the exact method you are using to execute
queries...PQexec?  If you are preparing queries against libpq, the
best way to execute queries is via PQexecPrepared.  Also, it's
interesting to see if you can get any benefit from asynchronous
queries (PQsendPrepared), but this might involve more changes to your
application than you are willing to make.

Another note: I would like to point out again that there are possible
negative side effects in using char(n) vs. varchar(n) that IIRC do not
exist in mysql.  When you repeat your test I strongly advise switching
to varchar.

Another question: how exactly are you connecting to the database?
local machine? if so, domain socket or tcp/ip?  What are you doing
with the results...immediately discarding?

One last thing: when you get access to the server, can you run a
custom format query test from pgbench and compare the results to your
test similarly configured (same number of backends, etc) in terms of
tps?

merlin

From:
Robert Haas
Date:

On May 19, 2009, at 7:36 AM, Simon Riggs <> wrote:

>
> On Tue, 2009-05-19 at 12:17 +0100, Matthew Wakeling wrote:
>> Yes, Postgres has been missing the boat on this one for a while. +1
>> on
>> requesting this feature.
>
> That's an optimizer feature.
>
>> Speaking of avoiding large sorts, I'd like to push again for partial
>> sorts. This is the situation where an index provides data sorted by
>> column "a", and the query requests data sorted by "a, b". Currently,
>> Postgres sorts the entire data set, whereas it need only group each
>> set of identical "a" and sort each by "b".
>
> This is an executor feature.
>
> Partially sorted data takes much less effort to sort (OK, not zero, I
> grant) so this seems like a high complexity, lower value feature. I
> agree it should be on the TODO, just IMHO at a lower priority than
> some
> other features.

I have no particular thoughts on priority (whose priority?), but I
will say I've run across queries that could benefit from this
optimization.  I fairly often write queries where the first key is
mostly unique and the second is just to make things deterministic in
the event of a tie.  So the partial sort would be almost no work at all.

...Robert


>
>
> --
> Simon Riggs           www.2ndQuadrant.com
> PostgreSQL Training, Services and Support
>
>
> --
> Sent via pgsql-performance mailing list (
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

From:
Dimitri
Date:

The response time is not progressive, it's simply jumping, it's likely
since 16 sessions there is a sort of serialization happening
somewhere.. As well on 16 sessions the throughput in TPS is near the
same as on 8 (response time is only twice bigger for the moment), but
on 32 it's dramatically dropping down..

Rgds,
-Dimitri


On 5/19/09, Simon Riggs <> wrote:
>
> On Tue, 2009-05-19 at 14:00 +0200, Dimitri wrote:
>
>> I may confirm the issue with hash join - it's repeating both with
>> prepared and not prepared statements - it's curious because initially
>> the response time is lowering near ~1ms (the lowest seen until now)
>> and then once workload growing to 16 sessions it's jumping to 2.5ms,
>> then with 32 sessions it's 18ms, etc..
>
> Is it just bad all the time, or does it get worse over time?
>
> Do you get the same behaviour as 32 sessions if you run 16 sessions for
> twice as long?
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

From:
Dimitri
Date:

On 5/19/09, Merlin Moncure <> wrote:
> On Mon, May 18, 2009 at 6:32 PM, Dimitri <> wrote:
>> Thanks Dave for correction, but I'm also curious where the time is
>> wasted in this case?..
>>
>> 0.84ms is displayed by "psql" once the result output is printed, and I
>> got similar time within my client (using libpq) which is not printing
>> any output..
>
> Using libpq?  What is the exact method you are using to execute
> queries...PQexec?

exactly

> If you are preparing queries against libpq, the
> best way to execute queries is via PQexecPrepared.

the query is *once* prepared via PQexec,
then it's looping with "execute" via PQexec.
Why PQexecPrepared will be better in my case?..

> Also, it's
> interesting to see if you can get any benefit from asynchronous
> queries (PQsendPrepared), but this might involve more changes to your
> application than you are willing to make.
>
> Another note: I would like to point out again that there are possible
> negative side effects in using char(n) vs. varchar(n) that IIRC do not
> exist in mysql.  When you repeat your test I strongly advise switching
> to varchar.

if it's true for any case, why not just replace CHAR implementation by
VARCHAR directly within PG code?..

>
> Another question: how exactly are you connecting to the database?
> local machine? if so, domain socket or tcp/ip?

local TCP/IP, same as MySQL

>  What are you doing
> with the results...immediately discarding?

from PQ side they immediately discarded once all rows are fetched

>
> One last thing: when you get access to the server, can you run a
> custom format query test from pgbench and compare the results to your
> test similarly configured (same number of backends, etc) in terms of
> tps?

I'll try


Rgds,
-Dimitri

From:
Simon Riggs
Date:

On Tue, 2009-05-19 at 08:58 -0400, Tom Lane wrote:
> Simon Riggs <> writes:
> > Both plans for this query show an IndexScan on a two column-index, with
> > an Index Condition of equality on the leading column. The ORDER BY
> > specifies a sort by the second index column, so the top-level Sort is
> > superfluous in this case.
>
> > My understanding is that we don't currently eliminate superfluous
> > additional sorts of this kind.
>
> Nonsense.  The planner might think some other plan is cheaper, but
> it definitely knows how to do this, and has since at least 8.1.

Please look at Dimitri's plan. If it can remove the pointless sort, why
does it not do so?

I agree that it will remove a Sort when the data is already has the
exact same interesting sort order. In this case the sort order is not
exactly the same, but looks fully removable to me.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Simon Riggs
Date:

On Tue, 2009-05-19 at 13:01 +0100, Matthew Wakeling wrote:

> That leads me on to another topic. Consider the query:
>
> SELECT * FROM table ORDER BY a, b
>
> where the column "a" is declared UNIQUE and has an index. Does Postgres
> eliminate "b" from the ORDER BY, and therefore allow fetching without
> sorting from the index?

No, because we don't use unique constraints much at all to infer things.

> Or how about this query:
>
> SELECT * FROM table1, table2 WHERE table1.fk = table2.id ORDER BY
>      table1.id, table2.id
>
> where both "id" columns are UNIQUE with an index. Do we eliminate
> "table2.id" from the ORDER BY in this case?

Yes, that is eliminated via equivalence classes.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Scott Carey
Date:

On 5/19/09 5:01 AM, "Matthew Wakeling" <> wrote:

> On Tue, 19 May 2009, Simon Riggs wrote:
>>> Speaking of avoiding large sorts, I'd like to push again for partial
>>> sorts. This is the situation where an index provides data sorted by
>>> column "a", and the query requests data sorted by "a, b". Currently,
>>> Postgres sorts the entire data set, whereas it need only group each
>>> set of identical "a" and sort each by "b".
>>
>> Partially sorted data takes much less effort to sort (OK, not zero, I
>> grant) so this seems like a high complexity, lower value feature. I
>> agree it should be on the TODO, just IMHO at a lower priority than some
>> other features.
>
> Not arguing with you, however I'd like to point out that partial sorting
> allows the results to be streamed, which would lower the cost to produce
> the first row of results significantly, and reduce the amount of RAM used
> by the query, and prevent temporary tables from being used. That has to be
> a fairly major win. Queries with a LIMIT would see the most benefit.
>

I will second that point --
Although for smaller sorts, the partial sort doesn't help much and is just
complicated -- once the sort is large, it reduces the amount of work_mem
needed significantly for large performance gain, and large concurrent query
scale gain.
And those benefits occur without using LIMIT.



From:
Scott Carey
Date:

On 5/19/09 3:46 AM, "Dimitri" <> wrote:

> On 5/19/09, Scott Carey <> wrote:
>>
>> On 5/18/09 3:32 PM, "Dimitri" <> wrote:
>>
>>> On 5/18/09, Scott Carey <> wrote:
>>>> Great data Dimitri!'
>>>
>>> Thank you! :-)
>>>
>>>>
>>>> I see a few key trends in the poor scalability:
>>>>
>>>> The throughput scales roughly with %CPU fairly well.  But CPU used
>>>> doesn't
>>>> go past ~50% on the 32 core tests.  This indicates lock contention.
>>>>
>>>
>>> You should not look on #1 STATs, but on #2 - they are all with the
>>> latest "fixes"  - on all of them CPU is used well (90% in pic on
>>> 32cores).
>>> Also, keep in mind these cores are having 2 threads, and from Solaris
>>> point of view they are seen as CPU (so 64 CPU) and %busy is accounted
>>> as for 64 CPU
>>>
>>
>> Well, if the CPU usage is actually higher, then it might not be lock waiting
>> -- it could be spin locks or context switches or cache coherency overhead.
>> Postgres may also not be very SMT friendly, at least on the hardware tested
>> here.
>
> do you mean SMP or CMT? ;-)
> however both should work well with PostgreSQL. I also think about CPU
> affinity - probably it may help to avoid CPU cache misses - but makes
> sense mostly if pooler will be added as a part of PG.

Symmetric Multi Threading (HyperThreading in Intels marketing terms, other
marketing terms for Sun or IBM).  One CPU core that can handle more than one
concurrently executing thread.
Technically, 'SMT' allows instructions in flight from multiple threads at
once in a superscalar Cpu core while some implementations differ and might
technically CMT (one thread or the other, but can switch fast, or a
non-superscalar core).

For many implementations of  'multiple threads on one CPU core' many of the
processor resources are reduced per thread when it is active -- caches get
split, instruction re-order buffers are split, etc.  That is rather hardware
implementation dependant.

For Intel's SMT (and other similar), spin-locks hurt scalability if they
aren't using new special instructions for the spin to yield pipeline slots
to the other thread.

Generally, code that stresses common processor resources more than CPU
execution will scale poorly with SMT/CMT etc.

So I'm not sure about the Postgres details, but the general case of an
application that doesn't benefit from these technologies exists, and there
is a non-zero chance that Postgres has some characteristics of such an app.

>>
>> (what was the context switch rate?  I didn't see that in the data, just
>> mutex spins).
>
> increasing with a load, as this ex.:
> http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_100.html#bmk_CPU_CtxSwit
> ch_100
>

Well, on most systems over 100K context switches/sec is a lot.  And those
reach 180000 /sec.
However, this is 'only' 10 context switches per transaction and less than
20% system CPU, so maybe those numbers aren't quite as big as they seem.

Out of curiosity, what was the context switch rate for MySql at its peak
throughput?
>
>>
>> The scalability curve is definitely showing something.  Prepared statements
>> were tried, as were most of the other suggestions other than one:
>>
>> What happens if the queries are more complicated (say, they take 15ms server
>> side with a more complicated plan required)?  That is a harder question to
>> answer
>
> What I observed is: if planner takes more long time (like initially
> with 8.3.7 and analyze target 1000) the scalability problem is
> appearing more strange -
> http://dimitrik.free.fr/Report_20090505/5521_dim_STAT_18.html - as you
> see CPU even not used more than 60% , and as you may see spin locks
> are lowering - CPUs are not spinning for locks, there is something
> else..
> I'm supposing a problem of some kind of synchronization - background
> processes are not waking up on time or something like this...
> Then, if more time spent on the query execution itself and not planner:
>  - if it'll be I/O time - I/O will hide everything else until you
> increase a storage performance and/or add more RAM, but then you come
> back to the initial issue :-)
>  - if it'll be a CPU time it may be interesting! :-)
>
> Rgds,
> -Dimitri
>

Ok, so that's good info that the planner or parser side seems to scale less
effectively than the execution (as the results show), but I'm wondering
about queries with longer execution times not longer planner times.  I'm
wondering that, because its my opinion that most applications that will use
larger scale hardware will have more complicated queries than your test.
Its also greedy on my part since most queries in my applications are
significantly more complicated.
Regardless of my opinions -- this test is on one extreme (small fast
queries) of the spectrum.  Its useful to know some data points on other
parts of the spectrum.


From:
Merlin Moncure
Date:

On Tue, May 19, 2009 at 11:53 AM, Dimitri <> wrote:
> On 5/19/09, Merlin Moncure <> wrote:
>> On Mon, May 18, 2009 at 6:32 PM, Dimitri <> wrote:
>>> Thanks Dave for correction, but I'm also curious where the time is
>>> wasted in this case?..
>>>
>>> 0.84ms is displayed by "psql" once the result output is printed, and I
>>> got similar time within my client (using libpq) which is not printing
>>> any output..
>>
>> Using libpq?  What is the exact method you are using to execute
>> queries...PQexec?
>
> exactly
>
>> If you are preparing queries against libpq, the
>> best way to execute queries is via PQexecPrepared.
>
> the query is *once* prepared via PQexec,
> then it's looping with "execute" via PQexec.
> Why PQexecPrepared will be better in my case?..

It can be better or worse (usually better).  the parameters are
separated from the query string.  Regardless of performance, the
parametrized interfaces are superior for any queries taking arguments
and should be used when possible.

>> Another note: I would like to point out again that there are possible
>> negative side effects in using char(n) vs. varchar(n) that IIRC do not
>> exist in mysql.  When you repeat your test I strongly advise switching
>> to varchar.
>
> if it's true for any case, why not just replace CHAR implementation by
> VARCHAR directly within PG code?..

First, let me explain the difference.  char(n) is padded out to 'n' on
disk and when returned.  despite this, the length is still stored so
there is no real advantage to using the char(n) type except that the
returned string is of a guaranteed length.   mysql, at least the
particular version and storage engine that I am logged into right now,
does not do this for char(n).  In other words, select cast('abc' as
char(50)) returns a string of 50 chars on pgsql and 3 chars on mysql.
I will leave it as an exercise to the reader to figure out whom is
following the standard.  pg's handling of the situation is not
necessarily optimal, but we just tell everyone to quit using 'char(n)'
type.

Unless for example your 'NOTE' column is mostly full or mostly null,
your query is not fair because postgres has to both store and return a
proportionally greater amount of data.  This makes the comparison
hardly apples to apples.   This stuff counts when we are measuring at
microsecond level.

>> Another question: how exactly are you connecting to the database?
>> local machine? if so, domain socket or tcp/ip?
>
> local TCP/IP, same as MySQL

would be curious to see if you get different results from domain socket.

merlin

From:
Dimitri
Date:

On 5/19/09, Scott Carey <> wrote:
>
> On 5/19/09 3:46 AM, "Dimitri" <> wrote:
>
>> On 5/19/09, Scott Carey <> wrote:
>>>
>>> On 5/18/09 3:32 PM, "Dimitri" <> wrote:
>>>
>>>> On 5/18/09, Scott Carey <> wrote:
>>>>> Great data Dimitri!'
>>>>
>>>> Thank you! :-)
>>>>
>>>>>
>>>>> I see a few key trends in the poor scalability:
>>>>>
>>>>> The throughput scales roughly with %CPU fairly well.  But CPU used
>>>>> doesn't
>>>>> go past ~50% on the 32 core tests.  This indicates lock contention.
>>>>>
>>>>
>>>> You should not look on #1 STATs, but on #2 - they are all with the
>>>> latest "fixes"  - on all of them CPU is used well (90% in pic on
>>>> 32cores).
>>>> Also, keep in mind these cores are having 2 threads, and from Solaris
>>>> point of view they are seen as CPU (so 64 CPU) and %busy is accounted
>>>> as for 64 CPU
>>>>
>>>
>>> Well, if the CPU usage is actually higher, then it might not be lock
>>> waiting
>>> -- it could be spin locks or context switches or cache coherency
>>> overhead.
>>> Postgres may also not be very SMT friendly, at least on the hardware
>>> tested
>>> here.
>>
>> do you mean SMP or CMT? ;-)
>> however both should work well with PostgreSQL. I also think about CPU
>> affinity - probably it may help to avoid CPU cache misses - but makes
>> sense mostly if pooler will be added as a part of PG.
>
> Symmetric Multi Threading (HyperThreading in Intels marketing terms, other
> marketing terms for Sun or IBM).  One CPU core that can handle more than one
> concurrently executing thread.
> Technically, 'SMT' allows instructions in flight from multiple threads at
> once in a superscalar Cpu core while some implementations differ and might
> technically CMT (one thread or the other, but can switch fast, or a
> non-superscalar core).
>
> For many implementations of  'multiple threads on one CPU core' many of the
> processor resources are reduced per thread when it is active -- caches get
> split, instruction re-order buffers are split, etc.  That is rather hardware
> implementation dependant.
>
> For Intel's SMT (and other similar), spin-locks hurt scalability if they
> aren't using new special instructions for the spin to yield pipeline slots
> to the other thread.
>
> Generally, code that stresses common processor resources more than CPU
> execution will scale poorly with SMT/CMT etc.

All application are scaling well anyway, except if you have any kind
of lock contention inside of the application itself or meet any kind
of system resource become hot. But well, here we may spend days to
discuss :-)


>
> So I'm not sure about the Postgres details, but the general case of an
> application that doesn't benefit from these technologies exists, and there
> is a non-zero chance that Postgres has some characteristics of such an app.
>
>>>
>>> (what was the context switch rate?  I didn't see that in the data, just
>>> mutex spins).
>>
>> increasing with a load, as this ex.:
>> http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_100.html#bmk_CPU_CtxSwit
>> ch_100
>>
>
> Well, on most systems over 100K context switches/sec is a lot.  And those
> reach 180000 /sec.
> However, this is 'only' 10 context switches per transaction and less than
> 20% system CPU, so maybe those numbers aren't quite as big as they seem.
>
> Out of curiosity, what was the context switch rate for MySql at its peak
> throughput?

the main MySQL problem is a mutex locking like here:
http://dimitrik.free.fr/Report_20090504/5465_dim_STAT_31.html#bmk_SpinMtx_31
so you have to limit a number of active threads to lower this
contention (similar to pooler idea folks told here)

and the context switch is even higher (~200K/sec)


>>
>>>
>>> The scalability curve is definitely showing something.  Prepared
>>> statements
>>> were tried, as were most of the other suggestions other than one:
>>>
>>> What happens if the queries are more complicated (say, they take 15ms
>>> server
>>> side with a more complicated plan required)?  That is a harder question
>>> to
>>> answer
>>
>> What I observed is: if planner takes more long time (like initially
>> with 8.3.7 and analyze target 1000) the scalability problem is
>> appearing more strange -
>> http://dimitrik.free.fr/Report_20090505/5521_dim_STAT_18.html - as you
>> see CPU even not used more than 60% , and as you may see spin locks
>> are lowering - CPUs are not spinning for locks, there is something
>> else..
>> I'm supposing a problem of some kind of synchronization - background
>> processes are not waking up on time or something like this...
>> Then, if more time spent on the query execution itself and not planner:
>>  - if it'll be I/O time - I/O will hide everything else until you
>> increase a storage performance and/or add more RAM, but then you come
>> back to the initial issue :-)
>>  - if it'll be a CPU time it may be interesting! :-)
>>
>> Rgds,
>> -Dimitri
>>
>
> Ok, so that's good info that the planner or parser side seems to scale less
> effectively than the execution (as the results show), but I'm wondering
> about queries with longer execution times not longer planner times.  I'm
> wondering that, because its my opinion that most applications that will use
> larger scale hardware will have more complicated queries than your test.
> Its also greedy on my part since most queries in my applications are
> significantly more complicated.
> Regardless of my opinions -- this test is on one extreme (small fast
> queries) of the spectrum.  Its useful to know some data points on other
> parts of the spectrum.

As I've mentioned before, such fast queries are very common for some
applications (like banking transactions, stock management, internet
forums, etc.) - 5-7years ago the goal with this test was to keep
response time under 1sec (I'm not kidding :-)  but nowdays we're
running under a millisecond.. Crazy progress, no? :-))

However, I've started to extend db_STRESS kit to accept any kind of
query against any kind of db schema. So if you have an interesting
data model and some queries to run - I'll be happy to adapt them as a
new scenario! :-))

Rgds,
-Dimitri


>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

From:
Dimitri
Date:

On 5/19/09, Merlin Moncure <> wrote:
> On Tue, May 19, 2009 at 11:53 AM, Dimitri <> wrote:
>> On 5/19/09, Merlin Moncure <> wrote:
>>> On Mon, May 18, 2009 at 6:32 PM, Dimitri <> wrote:
>>>> Thanks Dave for correction, but I'm also curious where the time is
>>>> wasted in this case?..
>>>>
>>>> 0.84ms is displayed by "psql" once the result output is printed, and I
>>>> got similar time within my client (using libpq) which is not printing
>>>> any output..
>>>
>>> Using libpq?  What is the exact method you are using to execute
>>> queries...PQexec?
>>
>> exactly
>>
>>> If you are preparing queries against libpq, the
>>> best way to execute queries is via PQexecPrepared.
>>
>> the query is *once* prepared via PQexec,
>> then it's looping with "execute" via PQexec.
>> Why PQexecPrepared will be better in my case?..
>
> It can be better or worse (usually better).  the parameters are
> separated from the query string.  Regardless of performance, the
> parametrized interfaces are superior for any queries taking arguments
> and should be used when possible.

you're probably right, but I don't like either when solution become so
complicated - PG has a so elegant way to execute a prepared query!


>
>>> Another note: I would like to point out again that there are possible
>>> negative side effects in using char(n) vs. varchar(n) that IIRC do not
>>> exist in mysql.  When you repeat your test I strongly advise switching
>>> to varchar.
>>
>> if it's true for any case, why not just replace CHAR implementation by
>> VARCHAR directly within PG code?..
>
> First, let me explain the difference.  char(n) is padded out to 'n' on
> disk and when returned.  despite this, the length is still stored so
> there is no real advantage to using the char(n) type except that the
> returned string is of a guaranteed length.   mysql, at least the
> particular version and storage engine that I am logged into right now,
> does not do this for char(n).  In other words, select cast('abc' as
> char(50)) returns a string of 50 chars on pgsql and 3 chars on mysql.
> I will leave it as an exercise to the reader to figure out whom is
> following the standard.  pg's handling of the situation is not
> necessarily optimal, but we just tell everyone to quit using 'char(n)'
> type.
>
> Unless for example your 'NOTE' column is mostly full or mostly null,
> your query is not fair because postgres has to both store and return a
> proportionally greater amount of data.  This makes the comparison
> hardly apples to apples.   This stuff counts when we are measuring at
> microsecond level.

Good point! I may confirm only at least at the beginning all fields
are fully filled within a database. Will test both engines with
VARCHAR next time to be sure it's not an issue.


>
>>> Another question: how exactly are you connecting to the database?
>>> local machine? if so, domain socket or tcp/ip?
>>
>> local TCP/IP, same as MySQL
>
> would be curious to see if you get different results from domain socket.

at least for PG there was no difference if I remember well.
However, before when I tested on the real network I finished by change
completely my code to reduce a network traffic (initially I've used
cursors), and finally PG traffic was lower or similar to MySQL, it was
an interesting stuff too :-)

Rgds,
-Dimitri


>
> merlin
>

From:
Merlin Moncure
Date:

On Tue, May 19, 2009 at 3:15 PM, Dimitri <> wrote:
> On 5/19/09, Merlin Moncure <> wrote:
>> On Tue, May 19, 2009 at 11:53 AM, Dimitri <> wrote:
>>> the query is *once* prepared via PQexec,
>>> then it's looping with "execute" via PQexec.
>>> Why PQexecPrepared will be better in my case?..
>>
>> It can be better or worse (usually better).  the parameters are
>> separated from the query string.  Regardless of performance, the
>> parametrized interfaces are superior for any queries taking arguments
>> and should be used when possible.
>
> you're probably right, but I don't like either when solution become so
> complicated - PG has a so elegant way to execute a prepared query!

It's not so bad.

PQexec:
sprintf(buf, query, char_arg1, my_arg2);
PQexec(conn, query);
sprintf(buf, query, char_arg1, my_arg2);
PQexec(conn, query);

PQexecParams:
char *vals[2];
int formats[2] ={0,0};
vals = {char_arg1, char_arg2};
PQexecPrepared(conn, stmt, 2, vals, NULL, formats, 0);
vals = {char_arg1, char_arg2};
PQexecPrepared(conn, stmt, 2, vals, NULL, formats, 0);

The setup is a little rough, and 'non strings' can be a pain vs.
printf, but the queries are safer (goodbye sql injection) and usually
faster.  Also the door is opened to binary formats which can be huge
performance win on some data types...especially bytea, date/time, and
geo.  There are some good quality libraries out there to help dealing
with execparams family of functions :D.

merlin

From:
Tom Lane
Date:

Simon Riggs <> writes:
> On Tue, 2009-05-19 at 08:58 -0400, Tom Lane wrote:
>> Nonsense.  The planner might think some other plan is cheaper, but
>> it definitely knows how to do this, and has since at least 8.1.

> Please look at Dimitri's plan. If it can remove the pointless sort, why
> does it not do so?

I haven't followed the whole thread, but the plan in the original post
is for a hash join.  The planner does not trust a hash join to preserve
the order of its left input, because of possible batching.  See the
discussion a couple of months ago where we considered allowing the
planner to disable batching so it *could* assume order preservation, and
decided the risk of hashtable bloat was too great.

            regards, tom lane

From:
Robert Haas
Date:

On Tue, May 19, 2009 at 6:49 PM, Tom Lane <> wrote:
> Simon Riggs <> writes:
>> On Tue, 2009-05-19 at 08:58 -0400, Tom Lane wrote:
>>> Nonsense.  The planner might think some other plan is cheaper, but
>>> it definitely knows how to do this, and has since at least 8.1.
>
>> Please look at Dimitri's plan. If it can remove the pointless sort, why
>> does it not do so?
>
> I haven't followed the whole thread, but the plan in the original post
> is for a hash join.  The planner does not trust a hash join to preserve
> the order of its left input, because of possible batching.  See the
> discussion a couple of months ago where we considered allowing the
> planner to disable batching so it *could* assume order preservation, and
> decided the risk of hashtable bloat was too great.

Hmm, my recollection of that conversation was that we decided that we
should have the planner tell the executor whether or not we are
relying on it to produce sorted output.  We set this flag only when
the hash join is expected to fit comfortably within one batch (that
is, we allow a safety margin).  If this flag is set and a hash join
unexpectedly goes multi-batch, then we perform a final merge pass (a
la merge sort) before returning any results.

I don't think it's a good idea to write off the idea of implementing
this optimization at some point.  I see a lot of queries that join one
fairly large table against a whole bunch of little tables, and then
sorting the results by a column that is indexed in the big table.  The
optimizer handles this by sequentially scanning the big table, hash
joining against all of the little tables, and then sorting the output,
which is pretty silly (given that all of the tables fit in RAM and are
in fact actually cached there).  If there is a LIMIT clause, then it
might instead index-scan the big table, do the hash joins, and then
sort the already-ordered results.  This is better because at least
we're not sorting the entire table unnecessarily but it's still poor.

...Robert

From:
Simon Riggs
Date:

On Tue, 2009-05-19 at 23:54 -0400, Robert Haas wrote:

> I don't think it's a good idea to write off the idea of implementing
> this optimization at some point.  I see a lot of queries that join one
> fairly large table against a whole bunch of little tables, and then
> sorting the results by a column that is indexed in the big table.

Agreed it's a common use case.

> The
> optimizer handles this by sequentially scanning the big table, hash
> joining against all of the little tables, and then sorting the output,
> which is pretty silly (given that all of the tables fit in RAM and are
> in fact actually cached there).  If there is a LIMIT clause, then it
> might instead index-scan the big table, do the hash joins, and then
> sort the already-ordered results.  This is better because at least
> we're not sorting the entire table unnecessarily but it's still poor.

The Hash node is fully executed before we start pulling rows through the
Hash Join node. So the Hash Join node will know at execution time
whether or not it will continue to maintain sorted order. So we put the
Sort node into the plan, then the Sort node can just ask the Hash Join
at execution time whether it should perform a sort or just pass rows
through (act as a no-op).

The cost of the Sort node can either be zero, or pro-rated down from the
normal cost based upon what we think the probability is of going
multi-batch, which would vary by work_mem available.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


From:
Robert Haas
Date:

On Wed, May 20, 2009 at 4:11 AM, Simon Riggs <> wrote:
> The Hash node is fully executed before we start pulling rows through the
> Hash Join node. So the Hash Join node will know at execution time
> whether or not it will continue to maintain sorted order. So we put the
> Sort node into the plan, then the Sort node can just ask the Hash Join
> at execution time whether it should perform a sort or just pass rows
> through (act as a no-op).

It's not actually a full sort.  For example if the join has two
batches, you don't need to dump all of the tuples from both batches
into a sort.  Each of the two tapes produced by the hash join is
sorted, but if you read tape one and then tape two, of course then it
won't be.  What you want to do is read the first tuple from each tape
and return whichever one is smaller, and put the other one back; then
lather, rinse, and repeat.  Because it's such a special-case
computation, I think you're going to want to implement it within the
HashJoin node rather than inserting a Sort node (or any other kind).

...Robert

From:
Simon Riggs
Date:

On Wed, 2009-05-20 at 07:17 -0400, Robert Haas wrote:
> On Wed, May 20, 2009 at 4:11 AM, Simon Riggs <> wrote:
> > The Hash node is fully executed before we start pulling rows through the
> > Hash Join node. So the Hash Join node will know at execution time
> > whether or not it will continue to maintain sorted order. So we put the
> > Sort node into the plan, then the Sort node can just ask the Hash Join
> > at execution time whether it should perform a sort or just pass rows
> > through (act as a no-op).
>
> It's not actually a full sort.  For example if the join has two
> batches, you don't need to dump all of the tuples from both batches
> into a sort.  Each of the two tapes produced by the hash join is
> sorted, but if you read tape one and then tape two, of course then it
> won't be.  What you want to do is read the first tuple from each tape
> and return whichever one is smaller, and put the other one back; then
> lather, rinse, and repeat.  Because it's such a special-case
> computation, I think you're going to want to implement it within the
> HashJoin node rather than inserting a Sort node (or any other kind).

That has wider applicability and seems sound. It will also be easier to
assess a cost for that aspect in the optimizer. I like that approach.

Code wise, you'll need to refactor things quite a lot to make the
tuplesort code accessible to the HJ node. The sorting code is going to
get pretty hectic if we add in all the ideas for this, partial sort,
improved sorting (at least 3 other ideas). Perhaps it will be easier to
write a specific final merge routine just for HJs.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support