Thread: Any better plan for this query?..
Hi, any idea if there is a more optimal execution plan possible for this query: select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg, H.END_DATE as hend, H.NOTE as hnote from HISTORY H, STAT S where S.REF = H.REF_STAT and H.REF_OBJECT = '0000000001' order by H.HORDER ; EXPLAIN ANALYZE output on 8.4: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual time=1.341..1.343 rows=20 loops=1) Sort Key: h.horder Sort Method: quicksort Memory: 30kB -> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual time=1.200..1.232 rows=20 loops=1) Hash Cond: (h.ref_stat = s.ref) -> Index Scan using history_ref_idx on history h (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 rows=20 loops=1) Index Cond: (ref_object = '0000000001'::bpchar) -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual time=1.147..1.147 rows=1000 loops=1) -> Seq Scan on stat s (cost=0.00..21.00 rows=1000 width=45) (actual time=0.005..0.325 rows=1000 loops=1) Total runtime: 1.442 ms (10 rows) Table HISTORY contains 200M rows, only 20 needed Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY values. Table definitions: """"""""""""""""""""""""""""""""""""""""""""""""""" create table STAT ( REF CHAR(3) not null, NAME CHAR(40) not null, NUMB INT not null ); create table HISTORY ( REF_OBJECT CHAR(10) not null, HORDER INT not null, REF_STAT CHAR(3) not null, BEGIN_DATE CHAR(12) not null, END_DATE CHAR(12) , NOTE CHAR(100) ); create unique index stat_ref_idx on STAT( ref ); create index history_ref_idx on HISTORY( ref_object, horder ); """"""""""""""""""""""""""""""""""""""""""""""""""" NOTE: The same query runs 2 times faster on MySQL. Any idea?.. Rgds, -Dimitri
Dimitri wrote: > Hi, > > any idea if there is a more optimal execution plan possible for this query: > > select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg, > H.END_DATE as hend, H.NOTE as hnote > from HISTORY H, STAT S > where S.REF = H.REF_STAT > and H.REF_OBJECT = '0000000001' > order by H.HORDER ; OK, so you're taking a simple: history INNER JOIN stat ON (stat.ref = history.ref_stat) then filtering for records with a particular value of history.ref_object and finally performing a sort. If I'm reading it right, the plan below does a sequential scan on the `stat' table. The stat table only has 1000 rows, so this isn't necessarily an unreasonable choice even if there is an appropriate index and even if not many of the rows will be needed. It then does an index scan of the history table looking for tuples with ref_object = '0000000001' (text match). It hash joins the hashed results of the initial seq scan to the results of the index scan, and sorts the result. To me, that looks pretty reasonable. You might be able to avoid the hash join in favour of a nested loop scan of stat_ref_idx (looping over records from history.ref_stat where ref_object = '00000000001') by providing a composite index on HISTORY(ref_stat, ref_object). I'm really not too sure, though; plan optimization isn't my thing, I'm just seeing if I can offer a few ideas. > Table definitions: While not strictly necessary, it's a *REALLY* good idea to define a suitable PRIMARY KEY. Also, the `CHAR(n)' data type is evil. E.V.I.L. Use `varchar(n)' for bounded-length values, or `text' for unbounded fields, unless you REALLY want the crazy behaviour of `CHAR(n)'. I'm a little bit puzzled about why you seem to be doing lots of things with integer values stored in text strings, but that probably doesn't matter too much for the issue at hand. > NOTE: The same query runs 2 times faster on MySQL. With InnoDB tables and proper transactional safety? Or using scary MyISAM tables and a "just pray" approach to data integrity? If you're using MyISAM tables I'm not surprised; MySQL with MyISAM is stunningly fast, but oh-my-god dangerous. -- Craig Ringer
Hi Craig, yes, you detailed very well the problem! :-) all those CHAR columns are so just due historical issues :-) as well they may contains anything else and not only numbers, that's why.. Also, all data inside are fixed, so VARCHAR will not save place, or what kind of performance issue may we expect with CHAR vs VARCHAR if all data have a fixed length?.. Any way to force nested loop without additional index?.. It's 2 times faster on InnoDB, and as it's just a SELECT query no need to go in transaction details :-) Rgds, -Dimitri On 5/6/09, Craig Ringer <craig@postnewspapers.com.au> wrote: > Dimitri wrote: >> Hi, >> >> any idea if there is a more optimal execution plan possible for this >> query: >> >> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as >> hbeg, >> H.END_DATE as hend, H.NOTE as hnote >> from HISTORY H, STAT S >> where S.REF = H.REF_STAT >> and H.REF_OBJECT = '0000000001' >> order by H.HORDER ; > > OK, so you're taking a simple: > > history INNER JOIN stat ON (stat.ref = history.ref_stat) > > then filtering for records with a particular value of history.ref_object > and finally performing a sort. > > If I'm reading it right, the plan below does a sequential scan on the > `stat' table. The stat table only has 1000 rows, so this isn't > necessarily an unreasonable choice even if there is an appropriate index > and even if not many of the rows will be needed. > > It then does an index scan of the history table looking for tuples with > ref_object = '0000000001' (text match). It hash joins the hashed results > of the initial seq scan to the results of the index scan, and sorts the > result. > > To me, that looks pretty reasonable. You might be able to avoid the hash > join in favour of a nested loop scan of stat_ref_idx (looping over > records from history.ref_stat where ref_object = '00000000001') by > providing a composite index on HISTORY(ref_stat, ref_object). I'm really > not too sure, though; plan optimization isn't my thing, I'm just seeing > if I can offer a few ideas. > >> Table definitions: > > While not strictly necessary, it's a *REALLY* good idea to define a > suitable PRIMARY KEY. > > Also, the `CHAR(n)' data type is evil. E.V.I.L. Use `varchar(n)' for > bounded-length values, or `text' for unbounded fields, unless you REALLY > want the crazy behaviour of `CHAR(n)'. > > I'm a little bit puzzled about why you seem to be doing lots of things > with integer values stored in text strings, but that probably doesn't > matter too much for the issue at hand. > >> NOTE: The same query runs 2 times faster on MySQL. > > With InnoDB tables and proper transactional safety? Or using scary > MyISAM tables and a "just pray" approach to data integrity? If you're > using MyISAM tables I'm not surprised; MySQL with MyISAM is stunningly > fast, but oh-my-god dangerous. > > -- > Craig Ringer >
Dimitri wrote: > any idea if there is a more optimal execution plan possible for this query: > > select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg, > H.END_DATE as hend, H.NOTE as hnote > from HISTORY H, STAT S > where S.REF = H.REF_STAT > and H.REF_OBJECT = '0000000001' > order by H.HORDER ; > > EXPLAIN ANALYZE output on 8.4: > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------ > Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual > time=1.341..1.343 rows=20 loops=1) > Sort Key: h.horder > Sort Method: quicksort Memory: 30kB > -> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual > time=1.200..1.232 rows=20 loops=1) > Hash Cond: (h.ref_stat = s.ref) > -> Index Scan using history_ref_idx on history h > (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 > rows=20 loops=1) > Index Cond: (ref_object = '0000000001'::bpchar) > -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual > time=1.147..1.147 rows=1000 loops=1) > -> Seq Scan on stat s (cost=0.00..21.00 rows=1000 > width=45) (actual time=0.005..0.325 rows=1000 loops=1) > Total runtime: 1.442 ms > (10 rows) > > Table HISTORY contains 200M rows, only 20 needed > Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY values. The bad doesn't look too bad to me, although the planner is over-estimating the number of matches in the history table (2404 vs 20). That's a bit surprising given how simple the predicate is. Make sure you've ANALYZEd the table. If that's not enough, you can try to increase the statistics target for ref_object column, ie. ALTER TABLE history ALTER COLUMN ref_object SET STATISTICS 500. That might give you a different plan, maybe with a nested loop join instead of hash join, which might be faster in this case. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Dimitri wrote: > Hi Craig, > > yes, you detailed very well the problem! :-) > all those CHAR columns are so just due historical issues :-) as well > they may contains anything else and not only numbers, that's why.. > Also, all data inside are fixed, so VARCHAR will not save place, or > what kind of performance issue may we expect with CHAR vs VARCHAR if > all data have a fixed length?.. None in postgres, but the char/varchar thing may or may not bite you at some point later - sounds like you have it covered though. > It's 2 times faster on InnoDB, and as it's just a SELECT query no need > to go in transaction details :-) Total runtime: 1.442 ms (10 rows) You posted a query that's taking 2/1000's of a second. I don't really see a performance problem here :) -- Postgresql & php tutorials http://www.designmagick.com/
Hi Heikki, I've already tried a target 1000 and the only thing it changes comparing to the current 100 (default) is instead of 2404 rows it says 240 rows, but the plan remaining the same.. Rgds, -Dimitri On 5/6/09, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Dimitri wrote: >> any idea if there is a more optimal execution plan possible for this >> query: >> >> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as >> hbeg, >> H.END_DATE as hend, H.NOTE as hnote >> from HISTORY H, STAT S >> where S.REF = H.REF_STAT >> and H.REF_OBJECT = '0000000001' >> order by H.HORDER ; >> >> EXPLAIN ANALYZE output on 8.4: >> QUERY >> PLAN >> ------------------------------------------------------------------------------------------------------------------------------------------------ >> Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual >> time=1.341..1.343 rows=20 loops=1) >> Sort Key: h.horder >> Sort Method: quicksort Memory: 30kB >> -> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual >> time=1.200..1.232 rows=20 loops=1) >> Hash Cond: (h.ref_stat = s.ref) >> -> Index Scan using history_ref_idx on history h >> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 >> rows=20 loops=1) >> Index Cond: (ref_object = '0000000001'::bpchar) >> -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual >> time=1.147..1.147 rows=1000 loops=1) >> -> Seq Scan on stat s (cost=0.00..21.00 rows=1000 >> width=45) (actual time=0.005..0.325 rows=1000 loops=1) >> Total runtime: 1.442 ms >> (10 rows) >> >> Table HISTORY contains 200M rows, only 20 needed >> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY >> values. > > The bad doesn't look too bad to me, although the planner is > over-estimating the number of matches in the history table (2404 vs 20). > That's a bit surprising given how simple the predicate is. Make sure > you've ANALYZEd the table. If that's not enough, you can try to increase > the statistics target for ref_object column, ie. ALTER TABLE history > ALTER COLUMN ref_object SET STATISTICS 500. That might give you a > different plan, maybe with a nested loop join instead of hash join, > which might be faster in this case. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
Hi Chris, the only problem I see here is it's 2 times slower vs InnoDB, so before I'll say myself it's ok I want to be sure there is nothing else to do.. :-) Rgds, -Dimitri On 5/6/09, Chris <dmagick@gmail.com> wrote: > Dimitri wrote: >> Hi Craig, >> >> yes, you detailed very well the problem! :-) >> all those CHAR columns are so just due historical issues :-) as well >> they may contains anything else and not only numbers, that's why.. >> Also, all data inside are fixed, so VARCHAR will not save place, or >> what kind of performance issue may we expect with CHAR vs VARCHAR if >> all data have a fixed length?.. > > None in postgres, but the char/varchar thing may or may not bite you at > some point later - sounds like you have it covered though. > >> It's 2 times faster on InnoDB, and as it's just a SELECT query no need >> to go in transaction details :-) > > Total runtime: 1.442 ms > (10 rows) > > You posted a query that's taking 2/1000's of a second. I don't really > see a performance problem here :) > > -- > Postgresql & php tutorials > http://www.designmagick.com/ > >
Hi Richard, no, of course it's not based on explain :-) I've run several tests before and now going in depth to understand if there is nothing wrong. Due such a single query time difference InnoDB is doing 2-3 times better TPS level comparing to PostgreSQL.. Rgds, -Dimitri On 5/6/09, Richard Huxton <dev@archonet.com> wrote: > Dimitri wrote: >> Hi Chris, >> >> the only problem I see here is it's 2 times slower vs InnoDB > > How do you know? This isn't just based on the explain values reported, > is it? > > -- > Richard Huxton > Archonet Ltd >
Dimitri wrote: > Hi Chris, > > the only problem I see here is it's 2 times slower vs InnoDB How do you know? This isn't just based on the explain values reported, is it? -- Richard Huxton Archonet Ltd
Dimitri wrote: > Hi Richard, > > no, of course it's not based on explain :-) > I've run several tests before and now going in depth to understand if > there is nothing wrong. Due such a single query time difference InnoDB > is doing 2-3 times better TPS level comparing to PostgreSQL.. And you are satisfied that it is the planned query time that is the dominant factor here, and not parsing time, connection time, data transport, disk bandwidth etc? -- Richard Huxton Archonet Ltd
On Wed, 6 May 2009, Heikki Linnakangas wrote: >> Total runtime: 1.442 ms It's pretty clear that this query isn't even going to disc - it's all CPU time. That can be the case if you run the exact same query more than once, and it can cause your EXPLAIN output to be vastly different from your real use case. Do the queries on the live system hit the disc at all? > The bad doesn't look too bad to me, although the planner is over-estimating > the number of matches in the history table (2404 vs 20). That's a bit > surprising given how simple the predicate is. Make sure you've ANALYZEd the > table. If that's not enough, you can try to increase the statistics target > for ref_object column, ie. ALTER TABLE history ALTER COLUMN ref_object SET > STATISTICS 500. I would have thought this would actually make it slower, by increasing the time taken to plan. On such small queries, the planner overhead must be quite significant. Matthew -- Q: What's the difference between ignorance and apathy? A: I don't know, and I don't care.
On Wed, May 6, 2009 at 3:38 AM, Dimitri <dimitrik.fr@gmail.com> wrote: > Hi, > > any idea if there is a more optimal execution plan possible for this query: > > select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg, > H.END_DATE as hend, H.NOTE as hnote > from HISTORY H, STAT S > where S.REF = H.REF_STAT > and H.REF_OBJECT = '0000000001' > order by H.HORDER ; > > EXPLAIN ANALYZE output on 8.4: > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------ > Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual > time=1.341..1.343 rows=20 loops=1) > Sort Key: h.horder > Sort Method: quicksort Memory: 30kB > -> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual > time=1.200..1.232 rows=20 loops=1) > Hash Cond: (h.ref_stat = s.ref) > -> Index Scan using history_ref_idx on history h > (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 > rows=20 loops=1) > Index Cond: (ref_object = '0000000001'::bpchar) > -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual > time=1.147..1.147 rows=1000 loops=1) > -> Seq Scan on stat s (cost=0.00..21.00 rows=1000 > width=45) (actual time=0.005..0.325 rows=1000 loops=1) > Total runtime: 1.442 ms > (10 rows) > > Table HISTORY contains 200M rows, only 20 needed > Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY values. > > Table definitions: > """"""""""""""""""""""""""""""""""""""""""""""""""" > create table STAT > ( > REF CHAR(3) not null, > NAME CHAR(40) not null, > NUMB INT not null > ); > > create table HISTORY > ( > REF_OBJECT CHAR(10) not null, > HORDER INT not null, > REF_STAT CHAR(3) not null, > BEGIN_DATE CHAR(12) not null, > END_DATE CHAR(12) , > NOTE CHAR(100) > ); > > create unique index stat_ref_idx on STAT( ref ); > create index history_ref_idx on HISTORY( ref_object, horder ); > """"""""""""""""""""""""""""""""""""""""""""""""""" > > NOTE: The same query runs 2 times faster on MySQL. couple of things to try: *) as others have noted, get rid of char() columns. use varchar, or int if you can. this is a bigger deal in postgres than mysql. *) curious if disabling sequential scan helps (set enable_seqscan = false) or changes the plan. .3 msec is spent on seq scan and an index lookup is likely much faster. *) prepare the query: prepare history_stat(char(10) as select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg, H.END_DATE as hend, H.NOTE as hnote from HISTORY H, STAT S where S.REF = H.REF_STAT and H.REF_OBJECT = $1 order by H.HORDER ; execute history_stat('0000000001'); (prepared queries have some annoyances you need to be prepared to deal with. however, they are quite useful when squeezing every last msec out of fast queries). merlin
On Wed, May 6, 2009 at 7:46 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > prepare history_stat(char(10) as typo: prepare history_stat(char(10)) as
Dimitri wrote: > I've run several tests before and now going in depth to understand if > there is nothing wrong. Due such a single query time difference InnoDB > is doing 2-3 times better TPS level comparing to PostgreSQL.. Why don't you use MySQL then? Or tune PostgreSQL? Yours, Laurenz Albe
I'll try to answer all mails at once :-)) - query is running fully in RAM, no I/O, no network, only CPU time - looping 100 times the same query gives 132ms total time (~1.32ms per query), while it's 44ms on InnoDB (~0.44ms per query) - disabling seq scan forcing a planner to use an index scan, and finally it worse as gives 1.53ms per query.. - prepare the query helps: prepare statement takes 16ms, but execute runs in 0.98ms = which make me think it's not only a planner overhead... And it's still 2 times lower vs 0.44ms. Also, generally prepare cannot be used in this test case as we suppose any query may be of any kind (even if it's not always true :-)) - char or varchar should be used here because the reference code is supposed to accept any characters (alphanumeric) - it also reminds me that probably there are some extra CPU time due locale setting - but all my "lc_*" variables are set to "C"... Rgds, -Dimitri On 5/6/09, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, May 6, 2009 at 7:46 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> prepare history_stat(char(10) as > > typo: > prepare history_stat(char(10)) as >
The story is simple: for the launching of MySQL 5.4 I've done a testing comparing available on that time variations of InnoDB engines, and at the end by curiosity started the same test with PostgreSQL 8.3.7 to see if MySQL performance level is more close to PostgreSQL now (PG was a strong true winner before). For my big surprise MySQL 5.4 outpassed 8.3.7... However, analyzing the PostgreSQL processing I got a feeling something goes wrong on PG side.. So, now I've installed both 8.3.7 and 8.4beta1 to see more in depth what's going on. Currently 8.4 performs much better than 8.3.7, but there is still a room for improvement if such a small query may go faster :-) Rgds, -Dimitri On 5/6/09, Albe Laurenz <laurenz.albe@wien.gv.at> wrote: > Dimitri wrote: >> I've run several tests before and now going in depth to understand if >> there is nothing wrong. Due such a single query time difference InnoDB >> is doing 2-3 times better TPS level comparing to PostgreSQL.. > > Why don't you use MySQL then? > Or tune PostgreSQL? > > Yours, > Laurenz Albe >
Dimitri wrote: > I'll try to answer all mails at once :-)) > > - query is running fully in RAM, no I/O, no network, only CPU time > > - looping 100 times the same query gives 132ms total time (~1.32ms per > query), while it's 44ms on InnoDB (~0.44ms per query) Well, assuming you're happy that PG is tuned reasonably for your machine and that MySQL's query cache isn't returning the results here it looks like MySQL is faster for this particular query. The only obvious place there could be a big gain is with the hashing algorithm. If you remove the ORDER BY and the query-time doesn't fall by much then it's the hash phase. The other thing to try is to alter the query to be a SELECT count(*) rather than returning rows - that will let you measure the time to transfer the result rows. -- Richard Huxton Archonet Ltd
On Wed, May 06, 2009 at 04:01:03PM +0800, Craig Ringer wrote: > Dimitri wrote: >> Hi, >> any idea if there is a more optimal execution plan possible for this >> query: >> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as >> hbeg, >> H.END_DATE as hend, H.NOTE as hnote >> from HISTORY H, STAT S >> where S.REF = H.REF_STAT >> and H.REF_OBJECT = '0000000001' >> order by H.HORDER ; > > OK, so you're taking a simple: > > history INNER JOIN stat ON (stat.ref = history.ref_stat) > > then filtering for records with a particular value of history.ref_object > and finally performing a sort. > > If I'm reading it right, the plan below does a sequential scan on the > `stat' table. The stat table only has 1000 rows, so this isn't necessarily > an unreasonable choice even if there is an appropriate index and even if > not many of the rows will be needed. > > It then does an index scan of the history table looking for tuples with > ref_object = '0000000001' (text match). It hash joins the hashed results of > the initial seq scan to the results of the index scan, and sorts the > result. > > To me, that looks pretty reasonable. You might be able to avoid the hash > join in favour of a nested loop scan of stat_ref_idx (looping over records > from history.ref_stat where ref_object = '00000000001') by providing a > composite index on HISTORY(ref_stat, ref_object). I'm really not too sure, > though; plan optimization isn't my thing, I'm just seeing if I can offer a > few ideas. > >> Table definitions: > > While not strictly necessary, it's a *REALLY* good idea to define a > suitable PRIMARY KEY. > > Also, the `CHAR(n)' data type is evil. E.V.I.L. Use `varchar(n)' for > bounded-length values, or `text' for unbounded fields, unless you REALLY > want the crazy behaviour of `CHAR(n)'. > > I'm a little bit puzzled about why you seem to be doing lots of things with > integer values stored in text strings, but that probably doesn't matter too > much for the issue at hand. > >> NOTE: The same query runs 2 times faster on MySQL. > > With InnoDB tables and proper transactional safety? Or using scary MyISAM > tables and a "just pray" approach to data integrity? If you're using MyISAM > tables I'm not surprised; MySQL with MyISAM is stunningly fast, but > oh-my-god dangerous. > > -- > Craig Ringer > I just thought I would ask. Are you using the query cache in MySQL? If that is on, that could be the difference. Another thing to check, try issuing the selects concurrently: 2 at a time, 5 at a time, 10 at a time... and see if that has an effect on timing. In many of the benchmarks, MySQL will out perform PostgreSQL for very low numbers of clients. Once you are using more than a handful, PostgreSQL pulls ahead. Also, is this a completely static table? i.e. no updates or inserts. How is the performance with those happening? This should help you get a clearer picture of the performance. My two cents. Ken
On Wed, May 06, 2009 at 02:49:23PM +0200, Dimitri wrote: > The story is simple: for the launching of MySQL 5.4 I've done a > testing comparing available on that time variations of InnoDB engines, > and at the end by curiosity started the same test with PostgreSQL > 8.3.7 to see if MySQL performance level is more close to PostgreSQL > now (PG was a strong true winner before). For my big surprise MySQL > 5.4 outpassed 8.3.7... > However, analyzing the PostgreSQL processing I got a feeling something > goes wrong on PG side.. So, now I've installed both 8.3.7 and 8.4beta1 > to see more in depth what's going on. Currently 8.4 performs much > better than 8.3.7, but there is still a room for improvement if such a > small query may go faster :-) > > Rgds, > -Dimitri > > On 5/6/09, Albe Laurenz <laurenz.albe@wien.gv.at> wrote: > > Dimitri wrote: > >> I've run several tests before and now going in depth to understand if > >> there is nothing wrong. Due such a single query time difference InnoDB > >> is doing 2-3 times better TPS level comparing to PostgreSQL.. > > > > Why don't you use MySQL then? > > Or tune PostgreSQL? > > > > Yours, > > Laurenz Albe > > Another thought, have you tuned PostgreSQL for an in memory database? Those tuning options may be what is needed to improve the plan chosen by PostgreSQL. Cheers, Ken
On May 6, 2009, at 7:53 AM, Richard Huxton wrote: > Dimitri wrote: >> I'll try to answer all mails at once :-)) >> - query is running fully in RAM, no I/O, no network, only CPU time >> - looping 100 times the same query gives 132ms total time (~1.32ms >> per >> query), while it's 44ms on InnoDB (~0.44ms per query) > > Well, assuming you're happy that PG is tuned reasonably for your > machine and that MySQL's query cache isn't returning the results > here it looks like MySQL is faster for this particular query. > > The only obvious place there could be a big gain is with the hashing > algorithm. If you remove the ORDER BY and the query-time doesn't > fall by much then it's the hash phase. > > The other thing to try is to alter the query to be a SELECT count(*) > rather than returning rows - that will let you measure the time to > transfer the result rows. > > -- > Richard Huxton > Archonet Ltd > Do you expect to run this query 100 times per second during your application? or is this just a test to see how fast the query is for optimalisation. I always get scared myself with such a test as 'runs out of memory', reason given is that usually this is not really the case in a production environment. Try to make a little test case where you give the query random parameters so different result sets are returned. This will give you a better idea on how fast the query really is and might give you better comparison results. instead of count(*) I isusallt do explain analyze to see how fast PostgreSQL handles to query. Ries
Folks, first of all: - I used a fixed reference value just to simplify the case analyzing and isolate it as max as possible, of course during my tests all values are random :-) - final goal of the test is to analyze scalability, so yes, concurrent sessions with random keys are growing from 1 to 256 (I run it on 32cores server, no think time, just stressing), and the result is still not yet better comparing to InnoDB - I'm analyzing this query running in memory to understand what's blocking while all main bottlenecks are avoided (no I/O anymore nor network, etc.) - initial explain analyze and table details were posted in the first message Now, let's go more further: - so "as it" query execution took 1.50ms - after removing "order by" it took 1.19ms - select count(*) instead of columns and with removed "order by" took 0.98ms - execute of the same prepared "select count(*) ..." took 0.68ms So, where the time is going?... Rgds, -Dimitri On 5/6/09, Ries van Twisk <pg@rvt.dds.nl> wrote: > > On May 6, 2009, at 7:53 AM, Richard Huxton wrote: > >> Dimitri wrote: >>> I'll try to answer all mails at once :-)) >>> - query is running fully in RAM, no I/O, no network, only CPU time >>> - looping 100 times the same query gives 132ms total time (~1.32ms >>> per >>> query), while it's 44ms on InnoDB (~0.44ms per query) >> >> Well, assuming you're happy that PG is tuned reasonably for your >> machine and that MySQL's query cache isn't returning the results >> here it looks like MySQL is faster for this particular query. >> >> The only obvious place there could be a big gain is with the hashing >> algorithm. If you remove the ORDER BY and the query-time doesn't >> fall by much then it's the hash phase. >> >> The other thing to try is to alter the query to be a SELECT count(*) >> rather than returning rows - that will let you measure the time to >> transfer the result rows. >> >> -- >> Richard Huxton >> Archonet Ltd >> > > > Do you expect to run this query 100 times per second during your > application? > or is this just a test to see how fast the query is for optimalisation. > > I always get scared myself with such a test as 'runs out of memory', > reason > given is that usually this is not really the case in a production > environment. > > Try to make a little test case where you give the query random > parameters > so different result sets are returned. This will give you a better > idea on how > fast the query really is and might give you better comparison results. > > instead of count(*) I isusallt do explain analyze to see how fast > PostgreSQL handles to query. > > Ries > > >
Dimitri wrote: > Hi Chris, > > the only problem I see here is it's 2 times slower vs InnoDB, so > before I'll say myself it's ok I want to be sure there is nothing else > to do.. :-) Can the genetic query optimizer come into play on small queries? -- Craig Ringer
No. Ken On Wed, May 06, 2009 at 10:04:33PM +0800, Craig Ringer wrote: > Dimitri wrote: > > Hi Chris, > > > > the only problem I see here is it's 2 times slower vs InnoDB, so > > before I'll say myself it's ok I want to be sure there is nothing else > > to do.. :-) > > Can the genetic query optimizer come into play on small queries? > > -- > Craig Ringer > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
> On Wed, May 06, 2009 at 10:04:33PM +0800, Craig Ringer wrote: >> Can the genetic query optimizer come into play on small queries? On Wed, 6 May 2009, Kenneth Marshall wrote: > No. Yes. But you would have had to have set some really weird configuration. Matthew -- And the lexer will say "Oh look, there's a null string. Oooh, there's another. And another.", and will fall over spectacularly when it realises there are actually rather a lot. - Computer Science Lecturer (edited)
I supposed in case with prepare and then execute a query optimizer is no more coming in play on "execute" phase, or did I miss something?.. Forget to say: query cache is disabled on MySQL side. Rgds, -Dimitri On 5/6/09, Craig Ringer <craig@postnewspapers.com.au> wrote: > Dimitri wrote: >> Hi Chris, >> >> the only problem I see here is it's 2 times slower vs InnoDB, so >> before I'll say myself it's ok I want to be sure there is nothing else >> to do.. :-) > > Can the genetic query optimizer come into play on small queries? > > -- > Craig Ringer >
On Wed, May 06, 2009 at 09:38:59AM +0200, Dimitri wrote: > Hi, > > any idea if there is a more optimal execution plan possible for this query: > > select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg, > H.END_DATE as hend, H.NOTE as hnote > from HISTORY H, STAT S > where S.REF = H.REF_STAT > and H.REF_OBJECT = '0000000001' > order by H.HORDER ; > > EXPLAIN ANALYZE output on 8.4: > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------ > Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual > time=1.341..1.343 rows=20 loops=1) > Sort Key: h.horder > Sort Method: quicksort Memory: 30kB > -> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual > time=1.200..1.232 rows=20 loops=1) > Hash Cond: (h.ref_stat = s.ref) > -> Index Scan using history_ref_idx on history h > (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 > rows=20 loops=1) > Index Cond: (ref_object = '0000000001'::bpchar) > -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual > time=1.147..1.147 rows=1000 loops=1) > -> Seq Scan on stat s (cost=0.00..21.00 rows=1000 > width=45) (actual time=0.005..0.325 rows=1000 loops=1) > Total runtime: 1.442 ms > (10 rows) > > Table HISTORY contains 200M rows, only 20 needed > Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY values. > > Table definitions: > """"""""""""""""""""""""""""""""""""""""""""""""""" > create table STAT > ( > REF CHAR(3) not null, > NAME CHAR(40) not null, > NUMB INT not null > ); > > create table HISTORY > ( > REF_OBJECT CHAR(10) not null, > HORDER INT not null, > REF_STAT CHAR(3) not null, > BEGIN_DATE CHAR(12) not null, > END_DATE CHAR(12) , > NOTE CHAR(100) > ); > > create unique index stat_ref_idx on STAT( ref ); > create index history_ref_idx on HISTORY( ref_object, horder ); > """"""""""""""""""""""""""""""""""""""""""""""""""" > > NOTE: The same query runs 2 times faster on MySQL. > > Any idea?.. > > Rgds, > -Dimitri > Dimitri, Is there any chance of profiling the postgres backend to see where the time is used? Just an idea, Ken
Hi Ken, yes, I may do it, but I did not expect to come into profiling initially :-) I expected there is just something trivial within a plan that I just don't know.. :-) BTW, is there already an integrated profiled within a code? or do I need external tools?.. Rgds, -Dimitri On 5/6/09, Kenneth Marshall <ktm@rice.edu> wrote: > On Wed, May 06, 2009 at 09:38:59AM +0200, Dimitri wrote: >> Hi, >> >> any idea if there is a more optimal execution plan possible for this >> query: >> >> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as >> hbeg, >> H.END_DATE as hend, H.NOTE as hnote >> from HISTORY H, STAT S >> where S.REF = H.REF_STAT >> and H.REF_OBJECT = '0000000001' >> order by H.HORDER ; >> >> EXPLAIN ANALYZE output on 8.4: >> QUERY >> PLAN >> ------------------------------------------------------------------------------------------------------------------------------------------------ >> Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual >> time=1.341..1.343 rows=20 loops=1) >> Sort Key: h.horder >> Sort Method: quicksort Memory: 30kB >> -> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual >> time=1.200..1.232 rows=20 loops=1) >> Hash Cond: (h.ref_stat = s.ref) >> -> Index Scan using history_ref_idx on history h >> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 >> rows=20 loops=1) >> Index Cond: (ref_object = '0000000001'::bpchar) >> -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual >> time=1.147..1.147 rows=1000 loops=1) >> -> Seq Scan on stat s (cost=0.00..21.00 rows=1000 >> width=45) (actual time=0.005..0.325 rows=1000 loops=1) >> Total runtime: 1.442 ms >> (10 rows) >> >> Table HISTORY contains 200M rows, only 20 needed >> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY >> values. >> >> Table definitions: >> """"""""""""""""""""""""""""""""""""""""""""""""""" >> create table STAT >> ( >> REF CHAR(3) not null, >> NAME CHAR(40) not null, >> NUMB INT not null >> ); >> >> create table HISTORY >> ( >> REF_OBJECT CHAR(10) not null, >> HORDER INT not null, >> REF_STAT CHAR(3) not null, >> BEGIN_DATE CHAR(12) not null, >> END_DATE CHAR(12) , >> NOTE CHAR(100) >> ); >> >> create unique index stat_ref_idx on STAT( ref ); >> create index history_ref_idx on HISTORY( ref_object, horder ); >> """"""""""""""""""""""""""""""""""""""""""""""""""" >> >> NOTE: The same query runs 2 times faster on MySQL. >> >> Any idea?.. >> >> Rgds, >> -Dimitri >> > Dimitri, > > Is there any chance of profiling the postgres backend to see > where the time is used? > > Just an idea, > Ken >
On Wed, May 06, 2009 at 04:48:21PM +0200, Dimitri wrote: > Hi Ken, > > yes, I may do it, but I did not expect to come into profiling initially :-) > I expected there is just something trivial within a plan that I just > don't know.. :-) > > BTW, is there already an integrated profiled within a code? or do I > need external tools?.. > > Rgds, > -Dimitri I only suggested it because it might have the effect of changing the sequential scan on the stat table to an indexed scan. Cheers, Ken > > On 5/6/09, Kenneth Marshall <ktm@rice.edu> wrote: > > On Wed, May 06, 2009 at 09:38:59AM +0200, Dimitri wrote: > >> Hi, > >> > >> any idea if there is a more optimal execution plan possible for this > >> query: > >> > >> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as > >> hbeg, > >> H.END_DATE as hend, H.NOTE as hnote > >> from HISTORY H, STAT S > >> where S.REF = H.REF_STAT > >> and H.REF_OBJECT = '0000000001' > >> order by H.HORDER ; > >> > >> EXPLAIN ANALYZE output on 8.4: > >> QUERY > >> PLAN > >> ------------------------------------------------------------------------------------------------------------------------------------------------ > >> Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual > >> time=1.341..1.343 rows=20 loops=1) > >> Sort Key: h.horder > >> Sort Method: quicksort Memory: 30kB > >> -> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual > >> time=1.200..1.232 rows=20 loops=1) > >> Hash Cond: (h.ref_stat = s.ref) > >> -> Index Scan using history_ref_idx on history h > >> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 > >> rows=20 loops=1) > >> Index Cond: (ref_object = '0000000001'::bpchar) > >> -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual > >> time=1.147..1.147 rows=1000 loops=1) > >> -> Seq Scan on stat s (cost=0.00..21.00 rows=1000 > >> width=45) (actual time=0.005..0.325 rows=1000 loops=1) > >> Total runtime: 1.442 ms > >> (10 rows) > >> > >> Table HISTORY contains 200M rows, only 20 needed > >> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY > >> values. > >> > >> Table definitions: > >> """"""""""""""""""""""""""""""""""""""""""""""""""" > >> create table STAT > >> ( > >> REF CHAR(3) not null, > >> NAME CHAR(40) not null, > >> NUMB INT not null > >> ); > >> > >> create table HISTORY > >> ( > >> REF_OBJECT CHAR(10) not null, > >> HORDER INT not null, > >> REF_STAT CHAR(3) not null, > >> BEGIN_DATE CHAR(12) not null, > >> END_DATE CHAR(12) , > >> NOTE CHAR(100) > >> ); > >> > >> create unique index stat_ref_idx on STAT( ref ); > >> create index history_ref_idx on HISTORY( ref_object, horder ); > >> """"""""""""""""""""""""""""""""""""""""""""""""""" > >> > >> NOTE: The same query runs 2 times faster on MySQL. > >> > >> Any idea?.. > >> > >> Rgds, > >> -Dimitri > >> > > Dimitri, > > > > Is there any chance of profiling the postgres backend to see > > where the time is used? > > > > Just an idea, > > Ken > > >
On Wed, 2009-05-06 at 10:31 +0200, Dimitri wrote: > I've already tried a target 1000 and the only thing it changes > comparing to the current 100 (default) is instead of 2404 rows it says > 240 rows, but the plan remaining the same.. Try both of these things * REINDEX on the index being used in the query, then re-EXPLAIN * enable_hashjoin = off, then re-EXPLAIN You should first attempt to get the same plan, then confirm it really is faster before we worry why the optimizer hadn't picked that plan. We already know that MySQL favors nested loop joins, so turning up a plan that on this occasion is actually better that way is in no way representative of general performance. Does MySQL support hash joins? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Hi Simon, may you explain why REINDEX may help here?.. - database was just created, data loaded, and then indexes were created + analyzed.. What may change here after REINDEX?.. With hashjoin disabled was a good try! Running this query "as it" from 1.50ms we move to 0.84ms now, and the plan is here: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Sort (cost=4562.83..4568.66 rows=2329 width=176) (actual time=0.225..0.229 rows=20 loops=1) Sort Key: h.horder Sort Method: quicksort Memory: 30kB -> Merge Join (cost=4345.89..4432.58 rows=2329 width=176) (actual time=0.056..0.205 rows=20 loops=1) Merge Cond: (s.ref = h.ref_stat) -> Index Scan using stat_ref_idx on stat s (cost=0.00..49.25 rows=1000 width=45) (actual time=0.012..0.079 rows=193 loops=1) -> Sort (cost=4345.89..4351.72 rows=2329 width=135) (actual time=0.041..0.043 rows=20 loops=1) Sort Key: h.ref_stat Sort Method: quicksort Memory: 30kB -> Index Scan using history_ref_idx on history h (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.013..0.024 rows=20 loops=1) Index Cond: (ref_object = '0000000001'::bpchar) Total runtime: 0.261 ms (12 rows) Curiously planner expect to run it in 0.26ms Any idea why planner is not choosing this plan from the beginning?.. Any way to keep this plan without having a global or per sessions hashjoin disabled?.. Rgds, -Dimitri On 5/6/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2009-05-06 at 10:31 +0200, Dimitri wrote: > >> I've already tried a target 1000 and the only thing it changes >> comparing to the current 100 (default) is instead of 2404 rows it says >> 240 rows, but the plan remaining the same.. > > Try both of these things > * REINDEX on the index being used in the query, then re-EXPLAIN > * enable_hashjoin = off, then re-EXPLAIN > > You should first attempt to get the same plan, then confirm it really is > faster before we worry why the optimizer hadn't picked that plan. > > We already know that MySQL favors nested loop joins, so turning up a > plan that on this occasion is actually better that way is in no way > representative of general performance. Does MySQL support hash joins? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > >
The problem with "gprof" - it'll profile all stuff from the beginning to the end, and a lot of function executions unrelated to this query will be profiled... As well when you look on profiling technology - all such kind of solutions are based on the system clock frequency and have their limits on time resolution. On my system this limit is 0.5ms, and it's too big comparing to the query execution time :-) So, what I've done - I changed little bit a reference key criteria from = '0000000001' to < '0000000051', so instead of 20 rows I have 1000 rows on output now, it's still slower than InnoDB (12ms vs 9ms), but at least may be profiled (well, we also probably moving far from the problem as time may be spent mostly on the output traffic now, but I've tried anyway) - I've made a loop of 100 iterations of this query which is reading but not printing data. The total execution time of this loop is 1200ms, and curiously under profiling was not really changed. Profiler was able to catch 733ms of total execution time (if I understand well, all functions running faster than 0.5ms are remain un-profiled). The top profiler output is here: Excl. Incl. Name User CPU User CPU sec. sec. 0.733 0.733 <Total> 0.103 0.103 memcpy 0.045 0.045 slot_deform_tuple 0.037 0.040 AllocSetAlloc 0.021 0.021 AllocSetFree 0.018 0.037 pfree 0.018 0.059 appendBinaryStringInfo 0.017 0.031 heap_fill_tuple 0.017 0.017 _ndoprnt 0.016 0.016 nocachegetattr 0.015 0.065 heap_form_minimal_tuple 0.015 0.382 ExecProcNode 0.015 0.015 strlen 0.014 0.037 ExecScanHashBucket 0.014 0.299 printtup 0.013 0.272 ExecHashJoin 0.011 0.011 enlargeStringInfo 0.011 0.086 index_getnext 0.010 0.010 hash_any 0.009 0.076 FunctionCall1 0.009 0.037 MemoryContextAlloc 0.008 0.008 LWLockAcquire 0.007 0.069 pq_sendcountedtext 0.007 0.035 ExecProject 0.007 0.127 ExecScan ... Curiously "memcpy" is in top. Don't know if it's impacted in many cases, but probably it make sense to see if it may be optimized, etc.. Rgds, -Dimitri On 5/7/09, Euler Taveira de Oliveira <euler@timbira.com> wrote: > Dimitri escreveu: >> BTW, is there already an integrated profiled within a code? or do I >> need external tools?.. >> > Postgres provides support for profiling. Add --enable-profiling flag. Use > gprof to get the profile. > > > -- > Euler Taveira de Oliveira > http://www.timbira.com/ >
On Thu, May 7, 2009 at 4:20 AM, Dimitri <dimitrik.fr@gmail.com> wrote: > Hi Simon, > > may you explain why REINDEX may help here?.. - database was just > created, data loaded, and then indexes were created + analyzed.. What > may change here after REINDEX?.. > > With hashjoin disabled was a good try! > Running this query "as it" from 1.50ms we move to 0.84ms now, > and the plan is here: > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------ > Sort (cost=4562.83..4568.66 rows=2329 width=176) (actual > time=0.225..0.229 rows=20 loops=1) > Sort Key: h.horder > Sort Method: quicksort Memory: 30kB > -> Merge Join (cost=4345.89..4432.58 rows=2329 width=176) (actual > time=0.056..0.205 rows=20 loops=1) > Merge Cond: (s.ref = h.ref_stat) > -> Index Scan using stat_ref_idx on stat s > (cost=0.00..49.25 rows=1000 width=45) (actual time=0.012..0.079 > rows=193 loops=1) > -> Sort (cost=4345.89..4351.72 rows=2329 width=135) (actual > time=0.041..0.043 rows=20 loops=1) > Sort Key: h.ref_stat > Sort Method: quicksort Memory: 30kB > -> Index Scan using history_ref_idx on history h > (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.013..0.024 > rows=20 loops=1) > Index Cond: (ref_object = '0000000001'::bpchar) > Total runtime: 0.261 ms > (12 rows) > > Curiously planner expect to run it in 0.26ms > > Any idea why planner is not choosing this plan from the beginning?.. > Any way to keep this plan without having a global or per sessions > hashjoin disabled?.. can you work prepared statements into your app? turn off hash join, prepare the query, then turn it back on. merlin
Simon Riggs <simon@2ndQuadrant.com> writes: > We already know that MySQL favors nested loop joins From what I read I thought that was the *only* type of join MySQL supports. The big picture view here is that whether we run a short query in half a millisecond versus two milliseconds is usually not really important. It could matter if you're concerned with how many transactions/s you can run in a busy server -- but that's not exactly the same thing and you should really measure that in that case. It would be nice if we were in the same ballpark as MySQL but we would only be interesting in such optimizations if they don't come at the expense of scalability under more complex workloads. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Thu, 2009-05-07 at 12:58 +0100, Gregory Stark wrote: > It would be nice if we were in the same ballpark as MySQL but we would only be > interesting in such optimizations if they don't come at the expense of > scalability under more complex workloads. It doesn't appear there is a scalability issue here at all. Postgres can clearly do the same query in about the same time. We just have a case where MySQL happens to optimise it well and Postgres doesn't. Since we can trivially design cases that show the opposite I'm not worried too much. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Dimitri escribió: > As well when you look on profiling technology - all such kind of > solutions are based on the system clock frequency and have their > limits on time resolution. On my system this limit is 0.5ms, and it's > too big comparing to the query execution time :-) > > So, what I've done - I changed little bit a reference key criteria from > = '0000000001' to < '0000000051', so instead of 20 rows I have 1000 > rows on output now, Another thing you can try is run the query several times (like 10000 or so). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
I've simply restarted a full test with hashjoin OFF. Until 32 concurrent users things are going well. Then since 32 users response time is jumping to 20ms, with 64 users it's higher again, and with 256 users reaching 700ms, so TPS is dropping from 5.000 to ~200.. With hashjoin ON it's not happening, and I'm reaching at least 11.000 TPS on fully busy 32 cores. I should not use prepare/execute as the test conditions should remain "generic". About scalability issue - there is one on 8.3.7, because on 32 cores with such kind of load it's using only 50% CPU and not outpassing 6.000 TPS, while 8.4 uses 90% CPU and reaching 11.000 TPS.. On the same time while I'm comparing 8.3 and 8.4 - the response time is 2 times lower in 8.4, and seems to me the main gain for 8.4 is here. I'll publish all details, just need a time :-) Rgds, -Dimitri On 5/7/09, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, May 7, 2009 at 4:20 AM, Dimitri <dimitrik.fr@gmail.com> wrote: >> Hi Simon, >> >> may you explain why REINDEX may help here?.. - database was just >> created, data loaded, and then indexes were created + analyzed.. What >> may change here after REINDEX?.. >> >> With hashjoin disabled was a good try! >> Running this query "as it" from 1.50ms we move to 0.84ms now, >> and the plan is here: >> >> QUERY >> PLAN >> ------------------------------------------------------------------------------------------------------------------------------------------------------ >> Sort (cost=4562.83..4568.66 rows=2329 width=176) (actual >> time=0.225..0.229 rows=20 loops=1) >> Sort Key: h.horder >> Sort Method: quicksort Memory: 30kB >> -> Merge Join (cost=4345.89..4432.58 rows=2329 width=176) (actual >> time=0.056..0.205 rows=20 loops=1) >> Merge Cond: (s.ref = h.ref_stat) >> -> Index Scan using stat_ref_idx on stat s >> (cost=0.00..49.25 rows=1000 width=45) (actual time=0.012..0.079 >> rows=193 loops=1) >> -> Sort (cost=4345.89..4351.72 rows=2329 width=135) (actual >> time=0.041..0.043 rows=20 loops=1) >> Sort Key: h.ref_stat >> Sort Method: quicksort Memory: 30kB >> -> Index Scan using history_ref_idx on history h >> (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.013..0.024 >> rows=20 loops=1) >> Index Cond: (ref_object = '0000000001'::bpchar) >> Total runtime: 0.261 ms >> (12 rows) >> >> Curiously planner expect to run it in 0.26ms >> >> Any idea why planner is not choosing this plan from the beginning?.. >> Any way to keep this plan without having a global or per sessions >> hashjoin disabled?.. > > can you work prepared statements into your app? turn off hash join, > prepare the query, then turn it back on. > > merlin >
On Thu, 2009-05-07 at 20:36 +0200, Dimitri wrote: > I've simply restarted a full test with hashjoin OFF. Until 32 > concurrent users things are going well. Then since 32 users response > time is jumping to 20ms, with 64 users it's higher again, and with 256 > users reaching 700ms, so TPS is dropping from 5.000 to ~200.. > > With hashjoin ON it's not happening, and I'm reaching at least 11.000 > TPS on fully busy 32 cores. Much better to stick to the defaults. Sounds like a problem worth investigating further, but not pro bono. > About scalability issue - there is one on 8.3.7, because on 32 cores > with such kind of load it's using only 50% CPU and not outpassing > 6.000 TPS, while 8.4 uses 90% CPU and reaching 11.000 TPS.. Yeh, small changes make a big difference. Thanks for the info. How does MySQL perform? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Folks, it's completely crazy, but here is what I found: - if HISTORY table is analyzed with target 1000 my SELECT response time is jumping to 3ms, and the max throughput is limited to 6.000 TPS (it's what happenned to 8.3.7) -if HISTORY table is analyzed with target 5 - my SELECT response time is decreasing to 1.2ms (!) and then my max TPS level is ~12.000 ! and CPU is used up to 95% even by 8.3.7 :-) and 8.4 performed better just because I left its analyze target to default 100 value. Anyone may explain me why analyze target may have so huge negative secondary effect?.. Next point: SCALABILITY ISSUE Now both 8.3.7 and 8.4 have similar performance levels, but 8.3.7 is always slightly better comparing to 8.4, but well. The problem I have: - on 8 cores: ~5.000 TPS / 5.500 MAX - on 16 cores: ~10.000 TPS / 11.000 MAX - on 32 cores: ~10.500 TPS / 11.500 MAX What else may limit concurrent SELECTs here?.. Yes, forget, MySQL is reaching 17.500 TPS here. Rgds, -Dimitri On 5/7/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2009-05-07 at 20:36 +0200, Dimitri wrote: > >> I've simply restarted a full test with hashjoin OFF. Until 32 >> concurrent users things are going well. Then since 32 users response >> time is jumping to 20ms, with 64 users it's higher again, and with 256 >> users reaching 700ms, so TPS is dropping from 5.000 to ~200.. >> >> With hashjoin ON it's not happening, and I'm reaching at least 11.000 >> TPS on fully busy 32 cores. > > Much better to stick to the defaults. > > Sounds like a problem worth investigating further, but not pro bono. > >> About scalability issue - there is one on 8.3.7, because on 32 cores >> with such kind of load it's using only 50% CPU and not outpassing >> 6.000 TPS, while 8.4 uses 90% CPU and reaching 11.000 TPS.. > > Yeh, small changes make a big difference. Thanks for the info. > > How does MySQL perform? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > >
Dimitri <dimitrik.fr@gmail.com> writes: > Anyone may explain me why analyze target may have so huge negative > secondary effect?.. If these are simple queries, maybe what you're looking at is the increase in planning time caused by having to process 10x as much statistical data. Cranking statistics_target to the max just because you can is not necessarily a good strategy. regards, tom lane
Hi Tom, it was not willing :-) it just stayed so after various probes with a query plan. Anyway, on 8.4 the default target is 100, an just by move it to 5 I reached on 16cores 10.500 TPS instead of 8.000 initially. And I think you have a good reason to keep it equal to 100 by default, isn't it? ;-) And what about scalability on 32cores?.. Any idea? Rgds, -Dimitri On 5/11/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Dimitri <dimitrik.fr@gmail.com> writes: >> Anyone may explain me why analyze target may have so huge negative >> secondary effect?.. > > If these are simple queries, maybe what you're looking at is the > increase in planning time caused by having to process 10x as much > statistical data. Cranking statistics_target to the max just because > you can is not necessarily a good strategy. > > regards, tom lane >
On Mon, May 11, 2009 at 11:18 AM, Dimitri <dimitrik.fr@gmail.com> wrote: > Folks, it's completely crazy, but here is what I found: > > - if HISTORY table is analyzed with target 1000 my SELECT response > time is jumping to 3ms, and the max throughput is limited to 6.000 TPS > (it's what happenned to 8.3.7) > > -if HISTORY table is analyzed with target 5 - my SELECT response time > is decreasing to 1.2ms (!) and then my max TPS level is ~12.000 ! > and CPU is used up to 95% even by 8.3.7 :-) and 8.4 performed better > just because I left its analyze target to default 100 value. > > Anyone may explain me why analyze target may have so huge negative > secondary effect?.. > > Next point: SCALABILITY ISSUE > > Now both 8.3.7 and 8.4 have similar performance levels, but 8.3.7 is > always slightly better comparing to 8.4, but well. The problem I have: > - on 8 cores: ~5.000 TPS / 5.500 MAX > - on 16 cores: ~10.000 TPS / 11.000 MAX > - on 32 cores: ~10.500 TPS / 11.500 MAX > > What else may limit concurrent SELECTs here?.. > > Yes, forget, MySQL is reaching 17.500 TPS here. why aren't you preparing the query? mysql uses simple rule based planner and postgresql has a statistics based planner. Our planner has all kinds of advantages in various scenarios, but this is compensated by slightly longer planning time in some cases. OTOH, you have prepared queries to compensate this. (mysql also has prepared queries, but the syntax is awkward and there is much less benefit to using them). merlin merlin
On Mon, 2009-05-11 at 17:18 +0200, Dimitri wrote: > Yes, forget, MySQL is reaching 17.500 TPS here. Please share your measurements of MySQL scalability also. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Mon, 2009-05-11 at 11:23 -0400, Tom Lane wrote: > Dimitri <dimitrik.fr@gmail.com> writes: > > Anyone may explain me why analyze target may have so huge negative > > secondary effect?.. > > If these are simple queries, maybe what you're looking at is the > increase in planning time caused by having to process 10x as much > statistical data. Cranking statistics_target to the max just because > you can is not necessarily a good strategy. statistics_target effects tables, so we have problems if you have a mix of simple and complex queries. IMHO we need an explicit planner_effort control, rather than the more arcane *_limit knobs which are effectively the same thing, just harder to use in practice. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
* Dimitri <dimitrik.fr@gmail.com> [090511 11:18]: > Folks, it's completely crazy, but here is what I found: > > - if HISTORY table is analyzed with target 1000 my SELECT response > time is jumping to 3ms, and the max throughput is limited to 6.000 TPS > (it's what happenned to 8.3.7) > > -if HISTORY table is analyzed with target 5 - my SELECT response time > is decreasing to 1.2ms (!) and then my max TPS level is ~12.000 ! > and CPU is used up to 95% even by 8.3.7 :-) and 8.4 performed better > just because I left its analyze target to default 100 value. > > Anyone may explain me why analyze target may have so huge negative > secondary effect?.. It's actually pretty straight forward. The PostgreSQL query planner is a "smart planner". It takes into consideration all the statistics available on the columns/tables, expected outputs based on inputs, etc, to choose what it thinks will be the best plan. The more data you have in statistics (the larger statistics target you have), the more CPU time and longer it's going to take to "plan" your queries. The tradeoff is hopefully better plans. But, in your scenario, where you are hitting the database with the absolute worst possible way to use PostgreSQL, with small, repeated, simple queries, you're not getting the advantage of "better" plans. In your case, you're throwing absolutely simple queries at PG as fast as you can, and for each query, PostgreSQL has to: 1) Parse the given "query string" 2) Given the statistics available, plan the query and pick the best one 3) Actually run the query. Part 2 is going to dominate the CPU time in your tests, more so the more statistics it has to evaluate, and unless the data has to come from the disks (i.e. not in shared buffers or cache) is thus going to dominate the time before you get your results. More statistics means more time needed to do the planning/picking of the query. If you were to use prepared statements, the cost of #1 and #2 is done once, and then every time you throw a new execution of the query to PostgreSQL, you get to just do #3, the easy quick part, especially for small simple queries where all the data is in shared buffers or the cache. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Attachment
Hi Simon, it's too early yet to speak about MySQL scalability... :-) it's only since few months there is *no more* regression on MySQL performance while moving from 8 to 16 cores. But looking how quickly it's progressing now things may change very quickly :-) For the moment on my tests it gives: - on 8 cores: 14.000 TPS - on 16 cores: 17.500 TPS - on 32 cores: 15.000 TPS (regression) Rgds, -Dimitri On 5/11/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Mon, 2009-05-11 at 17:18 +0200, Dimitri wrote: > >> Yes, forget, MySQL is reaching 17.500 TPS here. > > Please share your measurements of MySQL scalability also. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > >
Hi Aidan, thanks a lot for this detailed summary! So, why I don't use prepare here: let's say I'm testing the worst stress case :-) Imagine you have thousands of such kind of queries - you cannot prepare all of them! :-) or you'll maybe prepare it once, but as I showed previously in this thread prepare statement itself takes 16ms, so for a single shot there is no gain! :-) Stressing with such kind of short and simple queries (and again, they have joins, it may be even more simple :-)) will give me a result to show with guarantee my worst case - I know then if I have to deploy a bombarding OLTP-like application my database engine will be able to keep such workload, and if I have performance problems they are inside of application! :-) (well, it's very simplistic, but it's not far from the truth :-)) Now, as you see from your explanation, the Part #2 is the most dominant - so why instead to blame this query not to implement a QUERY PLANNER CACHE??? - in way if any *similar* query is recognized by parser we simply *reuse* the same plan?.. Rgds, -Dimitri On 5/11/09, Aidan Van Dyk <aidan@highrise.ca> wrote: > * Dimitri <dimitrik.fr@gmail.com> [090511 11:18]: >> Folks, it's completely crazy, but here is what I found: >> >> - if HISTORY table is analyzed with target 1000 my SELECT response >> time is jumping to 3ms, and the max throughput is limited to 6.000 TPS >> (it's what happenned to 8.3.7) >> >> -if HISTORY table is analyzed with target 5 - my SELECT response time >> is decreasing to 1.2ms (!) and then my max TPS level is ~12.000 ! >> and CPU is used up to 95% even by 8.3.7 :-) and 8.4 performed better >> just because I left its analyze target to default 100 value. >> >> Anyone may explain me why analyze target may have so huge negative >> secondary effect?.. > > It's actually pretty straight forward. > > The PostgreSQL query planner is a "smart planner". It takes into > consideration all the statistics available on the columns/tables, > expected outputs based on inputs, etc, to choose what it thinks will be > the best plan. The more data you have in statistics (the larger > statistics target you have), the more CPU time and longer it's going to > take to "plan" your queries. The tradeoff is hopefully better plans. > > But, in your scenario, where you are hitting the database with the > absolute worst possible way to use PostgreSQL, with small, repeated, > simple queries, you're not getting the advantage of "better" plans. In > your case, you're throwing absolutely simple queries at PG as fast as > you can, and for each query, PostgreSQL has to: > > 1) Parse the given "query string" > 2) Given the statistics available, plan the query and pick the best one > 3) Actually run the query. > > Part 2 is going to dominate the CPU time in your tests, more so the more > statistics it has to evaluate, and unless the data has to come from the > disks (i.e. not in shared buffers or cache) is thus going to dominate the > time before you get your results. More statistics means more time > needed to do the planning/picking of the query. > > If you were to use prepared statements, the cost of #1 and #2 is done > once, and then every time you throw a new execution of the query to > PostgreSQL, you get to just do #3, the easy quick part, especially for > small simple queries where all the data is in shared buffers or the cache. > > a. > > -- > Aidan Van Dyk Create like a god, > aidan@highrise.ca command like a king, > http://www.highrise.ca/ work like a slave. >
Dimitri escribió: > Hi Aidan, > > thanks a lot for this detailed summary! > > So, why I don't use prepare here: let's say I'm testing the worst > stress case :-) Imagine you have thousands of such kind of queries - > you cannot prepare all of them! :-) Thousands? Surely there'll be a dozen or three of most common queries, to which you pass different parameters. You can prepare thoseu > Now, as you see from your explanation, the Part #2 is the most > dominant - so why instead to blame this query not to implement a QUERY > PLANNER CACHE??? - in way if any *similar* query is recognized by > parser we simply *reuse* the same plan?.. This has been discussed in the past, but it turns out that a real implementation is a lot harder than it seems. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Dimitri wrote: > Now, as you see from your explanation, the Part #2 is the most > dominant - so why instead to blame this query not to implement a QUERY > PLANNER CACHE??? - in way if any *similar* query is recognized by > parser we simply *reuse* the same plan?.. At least in JDBC, there's several open source prepared statement cache implementations out there that people use. I don't know about other client libraries, but it certainly is possible to do in the client. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
>> So, why I don't use prepare here: let's say I'm testing the worst >> stress case :-) Imagine you have thousands of such kind of queries - >> you cannot prepare all of them! :-) > > Thousands? Surely there'll be a dozen or three of most common queries, > to which you pass different parameters. You can prepare thoseu Ok, and if each client just connect to the database, execute each kind of query just *once* and then disconnect?.. - cost of prepare will kill performance here if it's not reused at least 10 times within the same session. Well, I know, we always can do better, and even use stored procedures, etc. etc. > >> Now, as you see from your explanation, the Part #2 is the most >> dominant - so why instead to blame this query not to implement a QUERY >> PLANNER CACHE??? - in way if any *similar* query is recognized by >> parser we simply *reuse* the same plan?.. > > This has been discussed in the past, but it turns out that a real > implementation is a lot harder than it seems. Ok. If I remember well, Oracle have it and it helps a lot, but for sure it's not easy to implement.. Rgds, -Dimitri
Nice to know. But again, if this cache is kept only on the client side it'll be always lost on disconnect. And if clients are "short-lived" it'll not help. BTW, is there an option to say "do execution plan as simple as possible"? If you're sure about your data and your indexes - don't need to spend so much time. Rgds, -Dimitri On 5/12/09, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Dimitri wrote: >> Now, as you see from your explanation, the Part #2 is the most >> dominant - so why instead to blame this query not to implement a QUERY >> PLANNER CACHE??? - in way if any *similar* query is recognized by >> parser we simply *reuse* the same plan?.. > > At least in JDBC, there's several open source prepared statement cache > implementations out there that people use. I don't know about other > client libraries, but it certainly is possible to do in the client. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
Hi, On 05/12/2009 12:46 AM, Dimitri wrote: > So, why I don't use prepare here: let's say I'm testing the worst > stress case :-) Imagine you have thousands of such kind of queries - > you cannot prepare all of them! :-) or you'll maybe prepare it once, > but as I showed previously in this thread prepare statement itself > takes 16ms, so for a single shot there is no gain! :-) I have a hard time imaging a high throughput OLTP workload with that many different queries ;-) Naturally it would still be nice to be good in this not optimal workload... Andres
Hi, Dimitri <dimitrik.fr@gmail.com> writes: >>> So, why I don't use prepare here: let's say I'm testing the worst >>> stress case :-) Imagine you have thousands of such kind of queries - >>> you cannot prepare all of them! :-) >> >> Thousands? Surely there'll be a dozen or three of most common queries, >> to which you pass different parameters. You can prepare thoseu > > Ok, and if each client just connect to the database, execute each kind > of query just *once* and then disconnect?.. - cost of prepare will > kill performance here if it's not reused at least 10 times within the > same session. In a scenario which looks like this one, what I'm doing is using pgbouncer transaction pooling. Now a new connection from client can be served by an existing backend, which already has prepared your statement. So you first SELECT name FROM pg_prepared_statements; to know if you have to PREPARE or just EXECUTE, and you not only maintain much less running backends, lower fork() calls, but also benefit fully from preparing the statements even when you EXECUTE once per client connection. > Well, I know, we always can do better, and even use stored procedures, > etc. etc. Plain SQL stored procedure will prevent PostgreSQL to prepare your queries, only PLpgSQL functions will force transparent plan caching. But calling this PL will cost about 1ms per call in my tests, so it's not a good solution. It's possible to go as far as providing your own PostgreSQL C module where you PREPARE at _PG_init() time and EXECUTE in a SQL callable function, coupled with pgbouncer it should max out the perfs. But maybe you're not willing to go this far. Anyway, is hammering the server with always the same query your real need or just a simplified test-case? If the former, you'll see there are good ways to theorically obtain better perfs than what you're currently reaching, if the latter I urge you to consider some better benchmarking tools, such as playr or tsung. https://area51.myyearbook.com/trac.cgi/wiki/Playr http://tsung.erlang-projects.org/ http://pgfouine.projects.postgresql.org/tsung.html http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php Regards, -- dim
Folks, before you start to think "what a dumb guy doing a dumb thing" :-)) I'll explain you few details: it's for more than 10 years I'm using a db_STRESS kit (http://dimitrik.free.fr/db_STRESS.html) to check databases performance and scalability. Until now I was very happy with results it gave me as it stress very well each database engine internals an put on light some things I should probably skip on other workloads. What do you want, with a time the "fast" query executed before in 500ms now runs within 1-2ms - not only hardware was improved but also database engines increased their performance a lot! :-)) In 2007 I've published the first public results with PostgreSQL, and it was 2 times faster on that time comparing to MySQL (http://dimitrik.free.fr/db_STRESS_BMK_Part1.html) Last month for the launching of MySQL 5.4 I've done a long series of tests and at the end for my curiosity I've executed the same load against PostgreSQL 8.3.7 to see if MySQL is more close now. For my big surprise, MySQL was faster! As well observations on PG processing bring me a lot of questions - I supposed something was abnormal on PG side, but I did not have too much time to understand what it was exactly (http://dimitrik.free.fr/db_STRESS_MySQL_540_and_others_Apr2009.html#note_5443) What I'm trying to do now is to understand what exactly is the problem. What I discovered so far with all your help: - the impact of a planner - the impact of the analyze target - the impact of prepare / execute - scalability limit on 32 cores I'll also try to adapt prepare/execute solution to see how much it improves performance and/or scalability. As well helping from the other thread I was able to improve a lot the TPS stability on read+write workload! :-) Any other comments are welcome! Rgds, -Dimitri On 5/12/09, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > Hi, > > Dimitri <dimitrik.fr@gmail.com> writes: > >>>> So, why I don't use prepare here: let's say I'm testing the worst >>>> stress case :-) Imagine you have thousands of such kind of queries - >>>> you cannot prepare all of them! :-) >>> >>> Thousands? Surely there'll be a dozen or three of most common queries, >>> to which you pass different parameters. You can prepare thoseu >> >> Ok, and if each client just connect to the database, execute each kind >> of query just *once* and then disconnect?.. - cost of prepare will >> kill performance here if it's not reused at least 10 times within the >> same session. > > In a scenario which looks like this one, what I'm doing is using > pgbouncer transaction pooling. Now a new connection from client can be > served by an existing backend, which already has prepared your > statement. > > So you first SELECT name FROM pg_prepared_statements; to know if you > have to PREPARE or just EXECUTE, and you not only maintain much less > running backends, lower fork() calls, but also benefit fully from > preparing the statements even when you EXECUTE once per client > connection. > >> Well, I know, we always can do better, and even use stored procedures, >> etc. etc. > > Plain SQL stored procedure will prevent PostgreSQL to prepare your > queries, only PLpgSQL functions will force transparent plan caching. But > calling this PL will cost about 1ms per call in my tests, so it's not a > good solution. > > It's possible to go as far as providing your own PostgreSQL C module > where you PREPARE at _PG_init() time and EXECUTE in a SQL callable > function, coupled with pgbouncer it should max out the perfs. But maybe > you're not willing to go this far. > > Anyway, is hammering the server with always the same query your real > need or just a simplified test-case? If the former, you'll see there are > good ways to theorically obtain better perfs than what you're currently > reaching, if the latter I urge you to consider some better benchmarking > tools, such as playr or tsung. > > https://area51.myyearbook.com/trac.cgi/wiki/Playr > http://tsung.erlang-projects.org/ > http://pgfouine.projects.postgresql.org/tsung.html > http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php > > Regards, > -- > dim >
On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote: > For my big surprise, MySQL was faster! Ours too. ** I bet you $1000 that I can improve the performance of your benchmark results with PostgreSQL. You give me $1000 up-front and if I can't improve your high end numbers I'll give you $2000 back. Either way, you name me and link to me from your blog. Assuming you re-run the tests as requested and give me reasonable access to info and measurements. ** I note your blog identifies you as a Sun employee. Is that correct? If you do not give us the opportunity to improve upon the results then reasonable observers might be persuaded you did not wish to show PostgreSQL in its best light. You up for it? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Wow, Simon! :-)) yes, I'm working in Sun Benchmark Center :-)) (I'm not using my Sun email on public lists only to avid a spam) and as came here and asking questions it's probably proving my intentions to show PostgreSQL in its best light, no?.. - I never liked "not honest" comparisons :-)) Regarding your bet: from a very young age I learned a one thing - you take any 2 person who betting for any reason - you'll find in them one idiot and one bastard :-)) idiot - because betting while missing knowledge, and bastard - because knowing the truth is not honset to get a profit from idiots :-)) That's why I never betting in my life, but every time telling the same story in such situation... Did you like it? ;-)) However, no problem to give you a credit as well to all pg-perf list as it provides a very valuable help! :-)) Rgds, -Dimitri On 5/12/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote: > >> For my big surprise, MySQL was faster! > > Ours too. > > ** I bet you $1000 that I can improve the performance of your benchmark > results with PostgreSQL. You give me $1000 up-front and if I can't > improve your high end numbers I'll give you $2000 back. Either way, you > name me and link to me from your blog. Assuming you re-run the tests as > requested and give me reasonable access to info and measurements. ** > > I note your blog identifies you as a Sun employee. Is that correct? If > you do not give us the opportunity to improve upon the results then > reasonable observers might be persuaded you did not wish to show > PostgreSQL in its best light. You up for it? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > >
On Tue, 2009-05-12 at 13:16 +0200, Dimitri wrote: > Wow, Simon! :-)) > > yes, I'm working in Sun Benchmark Center :-)) > (I'm not using my Sun email on public lists only to avid a spam) > > and as came here and asking questions it's probably proving my > intentions to show PostgreSQL in its best light, no?.. - I never liked > "not honest" comparisons :-)) > > Regarding your bet: from a very young age I learned a one thing - you > take any 2 person who betting for any reason - you'll find in them one > idiot and one bastard :-)) idiot - because betting while missing > knowledge, and bastard - because knowing the truth is not honset to > get a profit from idiots :-)) That's why I never betting in my life, > but every time telling the same story in such situation... Did you > like it? ;-)) No, but I asked for it, so we're even. ;-) Let's work on the benchmark. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Dimitri wrote: > What I discovered so far with all your help: > - the impact of a planner > - the impact of the analyze target > - the impact of prepare / execute > - scalability limit on 32 cores You've received good advice on how to minimize the impact of the first three points, and using those techniques should bring a benefit. But I'm pretty surprised by the bad scalability you're seeing and no-one seems to have a good idea on where that limit is coming from. At a quick glance, I don't see any inherent bottlenecks in the schema and workload. If you could analyze where the bottleneck is with multiple cores, that would be great. With something like oprofile, it should be possible to figure out where the time is spent. My first guess would be the WALInsertLock: writing to WAL is protected by that and it an become a bottleneck with lots of small UPDATE/DELETE/INSERT transactions. But a profile would be required to verify that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote: > What I'm trying to do now is to understand what exactly is the > problem. You're running with 1600 users, which is above the scalability limit uncovered (by Sun...) during earlier benchmarking. The scalability issues are understood but currently considered above the reasonable-setting limit and so nobody has been inclined to improve matters. You should use a connection concentrator to reduce the number of sessions down to say 400. You're WAL buffers setting is also too low and you will be experiencing contention on the WALWriteLock. Increase wal_buffers to about x8 where you have it now. You can move pg_xlog to its own set of drives. Set checkpoint_completion_target to 0.95. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
For the moment I'm even not considering any scalability issues on the Read+Write workload - it may always be related to the storage box, and storage latency or controller/cache efficiency may play a lot. As problem I'm considering a scalability issue on Read-Only workload - only selects, no disk access, and if on move from 8 to 16 cores we gain near 100%, on move from 16 to 32 cores it's only 10%... I think I have to replay Read-Only with prepare/execute and check how much it'll help (don't know if there are some internal locking used when a planner is involved).. And yes, I'll try to profile on 32 cores, it makes sense. Rgds, -Dimitri On 5/12/09, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Dimitri wrote: >> What I discovered so far with all your help: >> - the impact of a planner >> - the impact of the analyze target >> - the impact of prepare / execute >> - scalability limit on 32 cores > > You've received good advice on how to minimize the impact of the first > three points, and using those techniques should bring a benefit. But I'm > pretty surprised by the bad scalability you're seeing and no-one seems > to have a good idea on where that limit is coming from. At a quick > glance, I don't see any inherent bottlenecks in the schema and workload. > > If you could analyze where the bottleneck is with multiple cores, that > would be great. With something like oprofile, it should be possible to > figure out where the time is spent. > > My first guess would be the WALInsertLock: writing to WAL is protected > by that and it an become a bottleneck with lots of small > UPDATE/DELETE/INSERT transactions. But a profile would be required to > verify that. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
Dimitri wrote: > Folks, before you start to think "what a dumb guy doing a dumb thing" :-)) > I'll explain you few details: > > it's for more than 10 years I'm using a db_STRESS kit > (http://dimitrik.free.fr/db_STRESS.html) to check databases > performance and scalability. Until now I was very happy with results > it gave me as it stress very well each database engine internals an > put on light some things I should probably skip on other workloads. > What do you want, with a time the "fast" query executed before in > 500ms now runs within 1-2ms - not only hardware was improved but also > database engines increased their performance a lot! :-)) I was attempting to look into that "benchmark" kit a bit but I find the information on that page a bit lacking :( a few notices: * is the sourcecode for the benchmark actually available? the "kit" seems to contain a few precompiled binaries and some source/headfiles but there are no building instructions, no makefile or even a README which makes it really hard to verify exactly what the benchmark is doing or if the benchmark client might actually be the problem here. * there is very little information on how the toolkit talks to the database - some of the binaries seem to contain a static copy of libpq or such? * how many queries per session is the toolkit actually using - some earlier comments seem to imply you are doing a connect/disconnect cycle for every query ist that actually true? Stefan
Wait wait, currently I'm playing the "stress scenario", so there are only 256 sessions max, but thing time is zero (full stress). Scenario with 1600 users is to test how database is solid just to keep a huge amount of users, but doing only one transaction per second (very low global TPS comparing to what database is able to do, but it's testing how well its internals working to manage the user sessions). I did not plan to do 1600 users test this time (all depends on time :-)) So, do I need to increase WAL buffers for 256 users? My LOG and DATA are placed on separated storage LUNs and controllers from the beginning. I've changed the default 0.5 checkpoint_completion_target to 0.8 now, should I go until 0.95 ?.. Also, to avoid TPS "waves" and bring stability on Read+Write workload I followed advices from a parallel thread: bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 4.0 shared_buffers = 1024MB I've also tried shared_buffers=256MB as it was advised, but then Read-Only workload decreasing performance as PG self caching helps anyway. Also, checkpoint_timeout is 30s now, and of course a huge difference came with moving default_statistics_target to 5 ! -but this one I found myself :-)) Probably checkpoint_timeout may be bigger now with the current settings? - the goal here is to keep Read+Write TPS as stable as possible and also avoid a long recovery in case of system/database/other crash (in theory). Rgds, -Dimitri On 5/12/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote: > >> What I'm trying to do now is to understand what exactly is the >> problem. > > You're running with 1600 users, which is above the scalability limit > uncovered (by Sun...) during earlier benchmarking. The scalability > issues are understood but currently considered above the > reasonable-setting limit and so nobody has been inclined to improve > matters. > > You should use a connection concentrator to reduce the number of > sessions down to say 400. > > You're WAL buffers setting is also too low and you will be experiencing > contention on the WALWriteLock. Increase wal_buffers to about x8 where > you have it now. > > You can move pg_xlog to its own set of drives. > > Set checkpoint_completion_target to 0.95. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > >
Hi Stefan, sorry, I did not have a time to bring all details into the toolkit - but at least I published it instead to tell a "nice story" about :-) The client process is a binary compiled with libpq. Client is interpreting a scenario script and publish via SHM a time spent on each SQL request. I did not publish sources yet as it'll also require to explain how to compile them :-)) So for the moment it's shipped as a freeware, but with time everything will be available (BTW, you're the first who asking for sources (well, except IBM guys who asked to get it on POWER boxes, but it's another story :-)) What is good is each client is publishing *live* its internal stats an we're able to get live data and follow any kind of "waves" in performance. Each session is a single process, so there is no contention between clients as you may see on some other tools. The current scenario script contains 2 selects (representing a Read transaction) and delete/insert/update (representing Write transaction). According a start parameters each client executing a given number Reads per Write. It's connecting on the beginning and disconnecting at the end of the test. It's also possible to extend it to do other queries, or simply give to each client a different scenario script - what's important is to able to collect then its stats live to understand what's going wrong (if any).. I'm planning to extend it and give an easy way to run it against any database schema, it's only question of time.. Rgds, -Dimitri On 5/12/09, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > Dimitri wrote: >> Folks, before you start to think "what a dumb guy doing a dumb thing" :-)) >> I'll explain you few details: >> >> it's for more than 10 years I'm using a db_STRESS kit >> (http://dimitrik.free.fr/db_STRESS.html) to check databases >> performance and scalability. Until now I was very happy with results >> it gave me as it stress very well each database engine internals an >> put on light some things I should probably skip on other workloads. >> What do you want, with a time the "fast" query executed before in >> 500ms now runs within 1-2ms - not only hardware was improved but also >> database engines increased their performance a lot! :-)) > > I was attempting to look into that "benchmark" kit a bit but I find the > information on that page a bit lacking :( a few notices: > > * is the sourcecode for the benchmark actually available? the "kit" > seems to contain a few precompiled binaries and some source/headfiles > but there are no building instructions, no makefile or even a README > which makes it really hard to verify exactly what the benchmark is doing > or if the benchmark client might actually be the problem here. > > * there is very little information on how the toolkit talks to the > database - some of the binaries seem to contain a static copy of libpq > or such? > > * how many queries per session is the toolkit actually using - some > earlier comments seem to imply you are doing a connect/disconnect cycle > for every query ist that actually true? > > > Stefan >
Dimitri wrote: > Hi Stefan, > > sorry, I did not have a time to bring all details into the toolkit - > but at least I published it instead to tell a "nice story" about :-) fair point and appreciated. But it seems important that benchmarking results can be verified by others as well... > > The client process is a binary compiled with libpq. Client is > interpreting a scenario script and publish via SHM a time spent on > each SQL request. I did not publish sources yet as it'll also require > to explain how to compile them :-)) So for the moment it's shipped as > a freeware, but with time everything will be available (BTW, you're > the first who asking for sources (well, except IBM guys who asked to > get it on POWER boxes, but it's another story :-)) well there is no licence tag(or a copyright notice) or anything als associated with the download which makes it a bit harder than it really needs to be. The reason why I was actually looking for the source is that all my available benchmark platforms are none of the ones you are providing binaries for which kinda reduces its usefulness. > > What is good is each client is publishing *live* its internal stats an > we're able to get live data and follow any kind of "waves" in > performance. Each session is a single process, so there is no > contention between clients as you may see on some other tools. The > current scenario script contains 2 selects (representing a Read > transaction) and delete/insert/update (representing Write > transaction). According a start parameters each client executing a > given number Reads per Write. It's connecting on the beginning and > disconnecting at the end of the test. well I have seen clients getting bottlenecked internally (like wasting more time in getting rid/absorbing of the actual result than it took the server to generate the answer...). How sure are you that your "live publishing of data" does not affect the benchmark results(because it kinda generates an artifical think time) for example? But what I get from your answer is that you are basically doing one connect/disconnect per client and the testcase you are talking about has 256 clients? Stefan
On Tue, May 12, 2009 at 8:59 AM, Dimitri <dimitrik.fr@gmail.com> wrote: > Wait wait, currently I'm playing the "stress scenario", so there are > only 256 sessions max, but thing time is zero (full stress). Scenario > with 1600 users is to test how database is solid just to keep a huge > amount of users, but doing only one transaction per second (very low > global TPS comparing to what database is able to do, but it's testing > how well its internals working to manage the user sessions). Didn't we beat this to death in mid-March on this very same list? Last time I think it was Jignesh Shah. AIUI, it's a well-known fact that PostgreSQL doesn't do very well at this kind of workload unless you use a connection pooler. *goes and checks the archives* Sure enough, 116 emails under the subject line "Proposal of tunable fix for scalability of 8.4". So, if your goal is to find a scenario under which PostgreSQL performs as badly as possible, congratulations - you've discovered the same case that we already knew about. Obviously it would be nice to improve it, but IIRC so far no one has had any very good ideas on how to do that. If this example mimics a real-world workload that you care about, and if using a connection pooler is just not a realistic option in that scenario for whatever reason, then you'd be better off working on how to fix it than on measuring it, because it seems to me we already know it's got problems, per previous discussions. ...Robert
On Tue, 12 May 2009, Stefan Kaltenbrunner wrote: > But what I get from your answer is that you are basically doing one > connect/disconnect per client and the testcase you are talking about has 256 > clients? Correct me if I'm wrong, but won't connect operations be all handled by a single thread - the parent postmaster? There's your scalability problem right there. Also, spawning a new backend process is an awful lot of overhead to run just one query. As far as I can see, it's quite understandable for MySQL to perform better than PostgreSQL in these circumstances, as it has a smaller simpler backend to start up each time. If you really want to get a decent performance out of Postgres, then use long-lived connections (which most real-world use cases will do) and prepare your queries in advance with parameters. Matthew -- import oz.wizards.Magic; if (Magic.guessRight())... -- Computer Science Lecturer
On Tue, 2009-05-12 at 16:00 +0100, Matthew Wakeling wrote: > won't connect operations be all handled by a > single thread - the parent postmaster? No, we spawn then authenticate. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Tue, 12 May 2009, Simon Riggs wrote: >> won't connect operations be all handled by a >> single thread - the parent postmaster? > > No, we spawn then authenticate. But you still have a single thread doing the accept() and spawn. At some point (maybe not now, but in the future) this could become a bottleneck given very short-lived connections. Matthew -- -. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. ||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /||| |/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ ' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-'
Matthew Wakeling wrote: > On Tue, 12 May 2009, Simon Riggs wrote: >>> won't connect operations be all handled by a >>> single thread - the parent postmaster? >> >> No, we spawn then authenticate. > > But you still have a single thread doing the accept() and spawn. At some > point (maybe not now, but in the future) this could become a bottleneck > given very short-lived connections. well the main cost is backend startup and that one is extremely expensive (compared to the cost of a simple query and also depending on the OS). We have more overhead there than other databases (most notably MySQL) hence what prompted my question on how the benchmark was operating. For any kind of workloads that contain frequent connection establishments one wants to use a connection pooler like pgbouncer(as said elsewhere in the thread already). Stefan
Matthew Wakeling <matthew@flymine.org> writes: > On Tue, 12 May 2009, Simon Riggs wrote: >> No, we spawn then authenticate. > But you still have a single thread doing the accept() and spawn. At some > point (maybe not now, but in the future) this could become a bottleneck > given very short-lived connections. More to the point, each backend process is a pretty heavyweight object: it is a process, not a thread, and it's not going to be good for much until it's built up a reasonable amount of stuff in its private caches. I don't think the small number of cycles executed in the postmaster process amount to anything at all compared to the other overhead involved in getting a backend going. In short: executing a single query per connection is going to suck, and there is not anything we are going to do about it except to tell you to use a connection pooler. MySQL has a different architecture: thread per connection, and AFAIK whatever caches it has are shared across threads. So a connection is a lighter-weight object for them; but there's no free lunch. They pay for it in having to tolerate locking/contention overhead on operations that for us are backend-local. regards, tom lane
Robert, what I'm testing now is 256 users max. The workload is growing progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max throughput is reached on the number of users equal to 2 * number of cores, but what's important for me here - database should continue to keep the workload! - response time regressing, but the troughput should remain near the same. So, do I really need a pooler to keep 256 users working?? - I don't think so, but please, correct me. BTW, I did not look to put PostgreSQL in bad conditions - the test is the test, and as I said 2 years ago PostgreSQL outperformed MySQL on the same test case, and there was nothing done within MySQL code to improve it explicitly for db_STRESS.. And I'm staying pretty honest when I'm testing something. Rgds, -Dimitri On 5/12/09, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 12, 2009 at 8:59 AM, Dimitri <dimitrik.fr@gmail.com> wrote: >> Wait wait, currently I'm playing the "stress scenario", so there are >> only 256 sessions max, but thing time is zero (full stress). Scenario >> with 1600 users is to test how database is solid just to keep a huge >> amount of users, but doing only one transaction per second (very low >> global TPS comparing to what database is able to do, but it's testing >> how well its internals working to manage the user sessions). > > Didn't we beat this to death in mid-March on this very same list? > Last time I think it was Jignesh Shah. AIUI, it's a well-known fact > that PostgreSQL doesn't do very well at this kind of workload unless > you use a connection pooler. > > *goes and checks the archives* Sure enough, 116 emails under the > subject line "Proposal of tunable fix for scalability of 8.4". > > So, if your goal is to find a scenario under which PostgreSQL performs > as badly as possible, congratulations - you've discovered the same > case that we already knew about. Obviously it would be nice to > improve it, but IIRC so far no one has had any very good ideas on how > to do that. If this example mimics a real-world workload that you > care about, and if using a connection pooler is just not a realistic > option in that scenario for whatever reason, then you'd be better off > working on how to fix it than on measuring it, because it seems to me > we already know it's got problems, per previous discussions. > > ...Robert >
Dimitri <dimitrik.fr@gmail.com> wrote: > Of course the Max throughput is reached on the number of users equal > to 2 * number of cores I'd expect that when disk I/O is not a significant limiting factor, but I've seen a "sweet spot" of (2 * cores) + (effective spindle count) for loads involving a lot of random I/O. > So, do I really need a pooler to keep 256 users working?? I have seen throughput fall above a certain point when I don't use a connection pooler. With a connection pooler which queues requests when all connections are busy, you will see no throughput degradation as users of the pool are added. Our connection pool is in our framework, so I don't know whether pgbouncer queues requests. (Perhaps someone else can comment on that, and make another suggestion if it doesn't.) -Kevin
On 5/12/09, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > Dimitri wrote: >> Hi Stefan, >> >> sorry, I did not have a time to bring all details into the toolkit - >> but at least I published it instead to tell a "nice story" about :-) > > fair point and appreciated. But it seems important that benchmarking > results can be verified by others as well... until now there were only people running Solaris or Linux :-)) > >> >> The client process is a binary compiled with libpq. Client is >> interpreting a scenario script and publish via SHM a time spent on >> each SQL request. I did not publish sources yet as it'll also require >> to explain how to compile them :-)) So for the moment it's shipped as >> a freeware, but with time everything will be available (BTW, you're >> the first who asking for sources (well, except IBM guys who asked to >> get it on POWER boxes, but it's another story :-)) > > well there is no licence tag(or a copyright notice) or anything als > associated with the download which makes it a bit harder than it really > needs to be. > The reason why I was actually looking for the source is that all my > available benchmark platforms are none of the ones you are providing > binaries for which kinda reduces its usefulness. > agree, will improve this point >> >> What is good is each client is publishing *live* its internal stats an >> we're able to get live data and follow any kind of "waves" in >> performance. Each session is a single process, so there is no >> contention between clients as you may see on some other tools. The >> current scenario script contains 2 selects (representing a Read >> transaction) and delete/insert/update (representing Write >> transaction). According a start parameters each client executing a >> given number Reads per Write. It's connecting on the beginning and >> disconnecting at the end of the test. > > well I have seen clients getting bottlenecked internally (like wasting > more time in getting rid/absorbing of the actual result than it took the > server to generate the answer...). > How sure are you that your "live publishing of data" does not affect the > benchmark results(because it kinda generates an artifical think time) > for example? On all my test tools client are publishing their data via shared memory segment (ISM), all they do is just *incrementing* their current stats values and continuing their processing. Another dedicated program should be executed to print these stats - it's connecting to the same SHM segment and printing a *difference* between values for the current and the next interval. Let me know if you need more details. > But what I get from your answer is that you are basically doing one > connect/disconnect per client and the testcase you are talking about has > 256 clients? Exactly, only one connect/disconnect per test, and number of clients is growing progressively from 1, 2, 4, 8, 16, .. to 256 Rgds, -Dimitri > > > Stefan >
On Tue, 2009-05-12 at 17:22 +0200, Dimitri wrote: > Robert, what I'm testing now is 256 users max. The workload is growing > progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max > throughput is reached on the number of users equal to 2 * number of > cores, but what's important for me here - database should continue to > keep the workload! - response time regressing, but the troughput > should remain near the same. > > So, do I really need a pooler to keep 256 users working?? - I don't > think so, but please, correct me. If they disconnect and reconnect yes. If they keep the connections live then no. Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
No, they keep connections till the end of the test. Rgds, -Dimitri On 5/12/09, Joshua D. Drake <jd@commandprompt.com> wrote: > On Tue, 2009-05-12 at 17:22 +0200, Dimitri wrote: >> Robert, what I'm testing now is 256 users max. The workload is growing >> progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max >> throughput is reached on the number of users equal to 2 * number of >> cores, but what's important for me here - database should continue to >> keep the workload! - response time regressing, but the troughput >> should remain near the same. >> >> So, do I really need a pooler to keep 256 users working?? - I don't >> think so, but please, correct me. > > If they disconnect and reconnect yes. If they keep the connections live > then no. > > Joshua D. Drake > > -- > PostgreSQL - XMPP: jdrake@jabber.postgresql.org > Consulting, Development, Support, Training > 503-667-4564 - http://www.commandprompt.com/ > The PostgreSQL Company, serving since 1997 > >
Robert Haas <robertmhaas@gmail.com> writes: > AIUI, whenever the connection pooler switches to serving a new client, > it tells the PG backend to DISCARD ALL. But why couldn't we just > implement this same logic internally? IOW, when a client disconnects, > instead of having the backend exit immediately, have it perform the > equivalent of DISCARD ALL and then stick around for a minute or two > and, if a new connection request arrives within that time, have the > old backend handle the new connection... See previous discussions. IIRC, there are two killer points: 1. There is no (portable) way to pass the connection from the postmaster to another pre-existing process. 2. You'd have to track which database, and probably which user, each such backend had been launched for; reconnecting a backend to a new database is probably impractical and would certainly invalidate all the caching. Overall it looked like way too much effort for way too little gain. regards, tom lane
Andres Freund escribió: > Naturally it would still be nice to be good in this not optimal workload... I find it hard to justify wasting our scarce development resources into optimizing such a contrived workload. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, 2009-05-12 at 11:30 -0700, Scott Carey wrote: > the fact is there is no evidence that a connection pooler will fix the > scalability from 16 > 32 cores. There has been much analysis over a number of years of the effects of the ProcArrayLock, specifically the O(N^2) effect of increasing numbers of connections on GetSnapshotData(). Most discussion has been on -hackers, not -perform. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Tue, 2009-05-12 at 15:52 -0400, Robert Haas wrote: > On Tue, May 12, 2009 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > 1. There is no (portable) way to pass the connection from the postmaster > > to another pre-existing process. > > [Googles.] It's not obvious to me that SCM_RIGHTS is non-portable, > and Windows has an API call WSADuplicateSocket() specifically for this > purpose. Robert, Greg, Tom's main point is it isn't worth doing. We have connection pooling software that works well, very well. Why do we want to bring it into core? (Think of the bugs we'd hit...) If we did, who would care? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Hi, Le 12 mai 09 à 18:32, Robert Haas a écrit : > implement this same logic internally? IOW, when a client disconnects, > instead of having the backend exit immediately, have it perform the > equivalent of DISCARD ALL and then stick around for a minute or two > and, if a new connection request arrives within that time, have the > old backend handle the new connection... A much better idea to solve this, in my opinion, would be to have pgbouncer as a postmaster child, integrated into PostgreSQL. It allows for choosing whether you want session pooling, transaction pooling or statement pooling, which is a more deterministic way to choose when your client connection will benefit from a fresh backend or an existing one. And it's respecting some backend timeouts etc. It's Open-Source proven technology, and I think I've heard about some PostgreSQL distribution where it's already a postmaster's child. <handwaving> And when associated with Hot Standby (and Sync Wal Shipping), having a connection pooler in -core could allow for transparent Read-Write access to the slave: at the moment you need an XID (and when connected on the slave), the backend could tell the pgbouncer process to redirect the connection to the master. With such a feature, you don't have to build client side high availability, just connect to either the master or the slave and be done with it, whatever the SQL you're gonna run. </> > Regards, -- dim
Although nobody wants to support it, he should try the patch that Jignesh K. Shah (from Sun) proposed that makes ProcArrayLock lighter-weight. If it makes 32 cores much faster, then we have a smoking gun. Although everyone here is talking about this as an 'unoptimal' solution, the fact is there is no evidence that a connection pooler will fix the scalability from 16 > 32 cores. Certainly a connection pooler will help most results, but it may not fix the scalability problem. A question for Dimitri: What is the scalability from 16 > 32 cores at the 'peak' load that occurs near 2x the CPU count? Is it also poor? If this is also poor, IMO the community here should not be complaining about this unopimal case -- a connection pooler at that stage does little and prepared statements will increase throughput but not likely alter scalability. If that result scales, then the short term answer is a connection pooler. In the tests that Jingesh ran -- making the ProcArrayLock faster helped the case where connections = 2x the CPU core count quite a bit. The thread about the CPU scalability is "Proposal of tunable fix for scalability of 8.4", originally posted by "Jignesh K. Shah" <J.K.Shah@Sun.COM>, March 11 2009. It would be very useful to see results of this benchmark with: 1. A Connection Pooler 2. Jignesh's patch 3. Prepared statements #3 is important, because prepared statements are ideal for queries that perform well with low statistics_targets, and not ideal for those that require high statistics targets. Realistically, an app won't have more than a couple dozen statement forms to prepare. Setting the default statistics target to 5 is just a way to make some other query perform bad. On 5/12/09 10:53 AM, "Alvaro Herrera" <alvherre@commandprompt.com> wrote: > Andres Freund escribió: > >> Naturally it would still be nice to be good in this not optimal workload... > > I find it hard to justify wasting our scarce development resources into > optimizing such a contrived workload. > > -- > Alvaro Herrera http://www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Tue, 2009-05-12 at 21:24 +0100, Simon Riggs wrote: > On Tue, 2009-05-12 at 15:52 -0400, Robert Haas wrote: > > On Tue, May 12, 2009 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > 1. There is no (portable) way to pass the connection from the postmaster > > > to another pre-existing process. > > > > [Googles.] It's not obvious to me that SCM_RIGHTS is non-portable, > > and Windows has an API call WSADuplicateSocket() specifically for this > > purpose. > > Robert, Greg, > > Tom's main point is it isn't worth doing. We have connection pooling > software that works well, very well. Why do we want to bring it into > core? (Think of the bugs we'd hit...) If we did, who would care? I would. Not to knock poolers but they are limited and not very efficient. Heck the best one I have used is pgbouncer and it has problems too under heavy load (due to libevent issues). It also doesn't support all of our auth methods. Apache solved this problem back when it was still called NSCA HTTPD. Why aren't we preforking again? Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Tue, May 12, 2009 at 1:00 PM, Dimitri <dimitrik.fr@gmail.com> wrote: > On MySQL there is no changes if I set the number of sessions in the > config file to 400 or to 2000 - for 2000 it'll just allocate more > memory. I don't care whether the setting affects the speed of MySQL. I want to know if it affects the speed of PostgreSQL. > After latest fix with default_statistics_target=5, version 8.3.7 is > running as fast as 8.4, even 8.4 is little little bit slower. > > I understand your position with a pooler, but I also want you think > about idea that 128 cores system will become a commodity server very > soon, and to use these cores on their full power you'll need a > database engine capable to run 256 users without pooler, because a > pooler will not help you here anymore.. So what? People with 128-core systems will not be running trivial joins that return in 1-2ms and have one second think times between them. And if they are, and if they have nothing better to do than worry about whether MySQL can process those queries in 1/2000th of the think time rather than 1/1000th of the think time, then they can use MySQL. If we're going to worry about performance on 128-core system, we would be much better advised to put our efforts into parallel query execution than how many microseconds it takes to execute very simple queries. Still, I have no problem with making PostgreSQL faster in the case you're describing. I'm just not interested in doing it on my own time for free. I am sure there are a number of people who read this list regularly who would be willing to do it for money, though. Maybe even me. :-) ...Robert
On Tue, May 12, 2009 at 4:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2009-05-12 at 15:52 -0400, Robert Haas wrote: >> On Tue, May 12, 2009 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > 1. There is no (portable) way to pass the connection from the postmaster >> > to another pre-existing process. >> >> [Googles.] It's not obvious to me that SCM_RIGHTS is non-portable, >> and Windows has an API call WSADuplicateSocket() specifically for this >> purpose. > > Robert, Greg, > > Tom's main point is it isn't worth doing. We have connection pooling > software that works well, very well. Why do we want to bring it into > core? (Think of the bugs we'd hit...) If we did, who would care? I don't know. It seems like it would be easier to manage just PostgreSQL than PostgreSQL + connection pooling software, but mostly I was just curious whether it had been thought about, so I asked, and the answer then led to a further question... was not intending to make a big deal about it. ...Robert
Dimitri Fontaine escribió: > A much better idea to solve this, in my opinion, would be to have > pgbouncer as a postmaster child, integrated into PostgreSQL. It allows > for choosing whether you want session pooling, transaction pooling or > statement pooling, which is a more deterministic way to choose when your > client connection will benefit from a fresh backend or an existing one. > And it's respecting some backend timeouts etc. Hmm. Seems like the best idea if we go this route would be one of Simon's which was to have better support for pluggable postmaster children. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, May 12, 2009 at 5:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > See previous discussions. IIRC, there are two killer points: > > 1. There is no (portable) way to pass the connection from the postmaster > to another pre-existing process. The Apache model is to have all the backends call accept. So incoming connections don't get handled by a single master process, they get handled by whichever process the kernel picks to receive the connection. -- greg
On Tue, 2009-05-12 at 20:34 -0400, Aidan Van Dyk wrote: > * Joshua D. Drake <jd@commandprompt.com> [090512 19:27]: > > > Apache solved this problem back when it was still called NSCA HTTPD. Why > > aren't we preforking again? > > Of course, preforking and connection pooling are totally different > beast... > Yes and no. They both solve similar problems and preforking solves more problems when you look at the picture in entirety (namely authentication integration etc..) > But, what really does preforking give us? A 2 or 3% improvement? It depends on the problem we are solving. We can test it but I would bet it is more than that especially in a high velocity environment. > The > forking isn't the expensive part, It is expensive but not as expensive as the below. > the per-database setup that happens is > the expensive setup... All pre-forking would save us is a tiny part of > the initial setup, and in turn make our robust postmaster controller no > longer have control. I don't buy this. Properly coded we aren't going to lose any "control". Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
* Joshua D. Drake <jd@commandprompt.com> [090512 19:27]: > Apache solved this problem back when it was still called NSCA HTTPD. Why > aren't we preforking again? Of course, preforking and connection pooling are totally different beast... But, what really does preforking give us? A 2 or 3% improvement? The forking isn't the expensive part, the per-database setup that happens is the expensive setup... All pre-forking would save us is a tiny part of the initial setup, and in turn make our robust postmaster controller no longer have control. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Attachment
* Aidan Van Dyk (aidan@highrise.ca) wrote: > But, what really does preforking give us? A 2 or 3% improvement? The > forking isn't the expensive part, the per-database setup that happens is > the expensive setup... Obviously that begs the question- why not support pre-fork with specific databases associated with specific backends that do the per-database setup prior to a connection coming in? eg- I want 5 backends ready per user database (excludes template0, template1, postgres). Thoughts? Thanks, Stephen
Attachment
On Tue, May 12, 2009 at 11:22 AM, Dimitri <dimitrik.fr@gmail.com> wrote: > Robert, what I'm testing now is 256 users max. The workload is growing > progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max > throughput is reached on the number of users equal to 2 * number of > cores, but what's important for me here - database should continue to > keep the workload! - response time regressing, but the troughput > should remain near the same. > > So, do I really need a pooler to keep 256 users working?? - I don't > think so, but please, correct me. Not an expert on this, but there has been a lot of discussion of the importance of connection pooling in this space. Is MySQL still faster if you lower max_connections to a value that is closer to the number of users, like 400 rather than 2000? > BTW, I did not look to put PostgreSQL in bad conditions - the test is > the test, and as I said 2 years ago PostgreSQL outperformed MySQL on > the same test case, and there was nothing done within MySQL code to > improve it explicitly for db_STRESS.. And I'm staying pretty honest > when I'm testing something. Yeah but it's not really clear what that something is. I believe you said upthread that PG 8.4 beta 1 is faster than PG 8.3.7, but PG 8.4 beta 1 is slower than MySQL 5.4 whereas PG 8.3.7 was faster than some older version of MySQL. So PG got faster and MySQL got faster, but they sped things up more than we did. If our performance were getting WORSE, I'd be worried about that, but the fact that they were able to make more improvement on this particular case than we were doesn't excite me very much. Sure, I'd love it if PG were even faster than it is, and if you have a suggested patch please send it in... or if you want to profile it and send the results that would be great too. But I guess my point is that the case of a very large number of simultaneous users with pauses-for-thought between queries has already been looked at in the very recent past in a way that's very similar to what you are doing (and by someone who works at the same company you do, no less!) so I'm not quite sure why we're rehashing the issue. ...Robert
On Tue, May 12, 2009 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > 1. There is no (portable) way to pass the connection from the postmaster > to another pre-existing process. [Googles.] It's not obvious to me that SCM_RIGHTS is non-portable, and Windows has an API call WSADuplicateSocket() specifically for this purpose. > 2. You'd have to track which database, and probably which user, each > such backend had been launched for; reconnecting a backend to a new > database is probably impractical and would certainly invalidate all > the caching. User doesn't seem like a major problem, but I understand your point about databases, which would presumably preclude the Apache approach of having every backend call accept() on the master socket. ...Robert
I'm sorry, but I'm confused. Everyone keeps talking about connection pooling, but Dimitri has said repeatedly that each client makes a single connection and then keeps it open until the end of the test, not that it makes a single connection per SQL query. Connection startup costs shouldn't be an issue. Am I missing something here? test(N) starts N clients, each client creates a single connection and hammers the server for a while on that connection. test(N) is run for N=1,2,4,8...256. This seems like a very reasonable test scenario. -- Glenn Maynard
On 5/12/09, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 12, 2009 at 1:00 PM, Dimitri <dimitrik.fr@gmail.com> wrote: >> On MySQL there is no changes if I set the number of sessions in the >> config file to 400 or to 2000 - for 2000 it'll just allocate more >> memory. > > I don't care whether the setting affects the speed of MySQL. I want > to know if it affects the speed of PostgreSQL. the problem is they both have "max_connections" parameter, so as you asked for MySQL I answered for MySQL, did not test yet for PostgreSQL, will be in the next series.. > >> After latest fix with default_statistics_target=5, version 8.3.7 is >> running as fast as 8.4, even 8.4 is little little bit slower. >> >> I understand your position with a pooler, but I also want you think >> about idea that 128 cores system will become a commodity server very >> soon, and to use these cores on their full power you'll need a >> database engine capable to run 256 users without pooler, because a >> pooler will not help you here anymore.. > > So what? People with 128-core systems will not be running trivial > joins that return in 1-2ms and have one second think times between > them. And if they are, and if they have nothing better to do than > worry about whether MySQL can process those queries in 1/2000th of the > think time rather than 1/1000th of the think time, then they can use > MySQL. If we're going to worry about performance on 128-core system, > we would be much better advised to put our efforts into parallel query > execution than how many microseconds it takes to execute very simple > queries. Do you really think nowdays for example a web forum application having PG as a backend will have queries running slower than 1-2ms to print a thread message within your browser??? or banking transactions?? > > Still, I have no problem with making PostgreSQL faster in the case > you're describing. I'm just not interested in doing it on my own time > for free. I am sure there are a number of people who read this list > regularly who would be willing to do it for money, though. Maybe even > me. :-) > > ...Robert > You don't need to believe me, but I'm doing it for free - I still have my work to finish in parallel :-)) And on the same time I don't see any other way to learn and improve my knowledge, but nobody is perfect :-)) Rgds, -Dimitri
Hi Scott, On 5/12/09, Scott Carey <scott@richrelevance.com> wrote: > Although nobody wants to support it, he should try the patch that Jignesh K. > Shah (from Sun) proposed that makes ProcArrayLock lighter-weight. If it > makes 32 cores much faster, then we have a smoking gun. > > Although everyone here is talking about this as an 'unoptimal' solution, the > fact is there is no evidence that a connection pooler will fix the > scalability from 16 > 32 cores. > Certainly a connection pooler will help most results, but it may not fix the > scalability problem. > > A question for Dimitri: > What is the scalability from 16 > 32 cores at the 'peak' load that occurs > near 2x the CPU count? Is it also poor? If this is also poor, IMO the > community here should not be complaining about this unopimal case -- a > connection pooler at that stage does little and prepared statements will > increase throughput but not likely alter scalability. I'm attaching a small graph showing a TPS level on PG 8.4 depending on number of cores (X-axis is a number of concurrent users, Y-axis is the TPS number). As you may see TPS increase is near linear while moving from 8 to 16 cores, while on 32cores even it's growing slightly differently, what is unclear is why TPS level is staying limited to 11.000 TPS on 32cores. And it's pure read-only workload. > > If that result scales, then the short term answer is a connection pooler. > > In the tests that Jingesh ran -- making the ProcArrayLock faster helped the > case where connections = 2x the CPU core count quite a bit. > > The thread about the CPU scalability is "Proposal of tunable fix for > scalability of 8.4", originally posted by "Jignesh K. Shah" > <J.K.Shah@Sun.COM>, March 11 2009. > > It would be very useful to see results of this benchmark with: > 1. A Connection Pooler will not help, as each client is *not* disconnecting/reconnecting during the test, as well PG is keeping well even 256 users. And TPS limit is reached already on 64 users, don't think pooler will help here. > 2. Jignesh's patch I've already tested it and it did not help in my case because the real problem is elsewhere.. (however, I did not test it yet with my latest config params) > 3. Prepared statements > yes, I'm preparing this test. > #3 is important, because prepared statements are ideal for queries that > perform well with low statistics_targets, and not ideal for those that > require high statistics targets. Realistically, an app won't have more than > a couple dozen statement forms to prepare. Setting the default statistics > target to 5 is just a way to make some other query perform bad. Agree, but as you may have a different statistic target *per* table it should not be a problem. What is sure - all time spent on parse and planner will be removed here, and the final time should be a pure execution. Rgds, -Dimitri > > > On 5/12/09 10:53 AM, "Alvaro Herrera" <alvherre@commandprompt.com> wrote: > >> Andres Freund escribió: >> >>> Naturally it would still be nice to be good in this not optimal >>> workload... >> >> I find it hard to justify wasting our scarce development resources into >> optimizing such a contrived workload. >> >> -- >> Alvaro Herrera >> http://www.CommandPrompt.com/ >> The PostgreSQL Company - Command Prompt, Inc. >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > >
Attachment
I'm also confused, but seems discussion giving also other ideas :-) But yes, each client is connecting to the database server only *once*. To presice how the test is running: - 1 client is started => 1 in total - sleep ... - 1 another client is started => 2 in total - sleep .. - 2 another clients are started => 4 in total - sleep .. ... ... =======> 256 in total - sleep ... - kill clients So I even able to monitor how each new client impact all others. The test kit is quite flexible to prepare any kind of stress situations. Rgds, -Dimitri On 5/12/09, Glenn Maynard <glennfmaynard@gmail.com> wrote: > I'm sorry, but I'm confused. Everyone keeps talking about connection > pooling, but Dimitri has said repeatedly that each client makes a > single connection and then keeps it open until the end of the test, > not that it makes a single connection per SQL query. Connection > startup costs shouldn't be an issue. Am I missing something here? > test(N) starts N clients, each client creates a single connection and > hammers the server for a while on that connection. test(N) is run for > N=1,2,4,8...256. This seems like a very reasonable test scenario. > > -- > Glenn Maynard > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On MySQL there is no changes if I set the number of sessions in the config file to 400 or to 2000 - for 2000 it'll just allocate more memory. After latest fix with default_statistics_target=5, version 8.3.7 is running as fast as 8.4, even 8.4 is little little bit slower. I understand your position with a pooler, but I also want you think about idea that 128 cores system will become a commodity server very soon, and to use these cores on their full power you'll need a database engine capable to run 256 users without pooler, because a pooler will not help you here anymore.. Rgds, -Dimitri On 5/12/09, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 12, 2009 at 11:22 AM, Dimitri <dimitrik.fr@gmail.com> wrote: >> Robert, what I'm testing now is 256 users max. The workload is growing >> progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max >> throughput is reached on the number of users equal to 2 * number of >> cores, but what's important for me here - database should continue to >> keep the workload! - response time regressing, but the troughput >> should remain near the same. >> >> So, do I really need a pooler to keep 256 users working?? - I don't >> think so, but please, correct me. > > Not an expert on this, but there has been a lot of discussion of the > importance of connection pooling in this space. Is MySQL still faster > if you lower max_connections to a value that is closer to the number > of users, like 400 rather than 2000? > >> BTW, I did not look to put PostgreSQL in bad conditions - the test is >> the test, and as I said 2 years ago PostgreSQL outperformed MySQL on >> the same test case, and there was nothing done within MySQL code to >> improve it explicitly for db_STRESS.. And I'm staying pretty honest >> when I'm testing something. > > Yeah but it's not really clear what that something is. I believe you > said upthread that PG 8.4 beta 1 is faster than PG 8.3.7, but PG 8.4 > beta 1 is slower than MySQL 5.4 whereas PG 8.3.7 was faster than some > older version of MySQL. So PG got faster and MySQL got faster, but > they sped things up more than we did. If our performance were getting > WORSE, I'd be worried about that, but the fact that they were able to > make more improvement on this particular case than we were doesn't > excite me very much. Sure, I'd love it if PG were even faster than it > is, and if you have a suggested patch please send it in... or if you > want to profile it and send the results that would be great too. But > I guess my point is that the case of a very large number of > simultaneous users with pauses-for-thought between queries has already > been looked at in the very recent past in a way that's very similar to > what you are doing (and by someone who works at the same company you > do, no less!) so I'm not quite sure why we're rehashing the issue. > > ...Robert >
On Tue, May 12, 2009 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Matthew Wakeling <matthew@flymine.org> writes: >> On Tue, 12 May 2009, Simon Riggs wrote: >>> No, we spawn then authenticate. > >> But you still have a single thread doing the accept() and spawn. At some >> point (maybe not now, but in the future) this could become a bottleneck >> given very short-lived connections. > > More to the point, each backend process is a pretty heavyweight object: > it is a process, not a thread, and it's not going to be good for much > until it's built up a reasonable amount of stuff in its private caches. > I don't think the small number of cycles executed in the postmaster > process amount to anything at all compared to the other overhead > involved in getting a backend going. AIUI, whenever the connection pooler switches to serving a new client, it tells the PG backend to DISCARD ALL. But why couldn't we just implement this same logic internally? IOW, when a client disconnects, instead of having the backend exit immediately, have it perform the equivalent of DISCARD ALL and then stick around for a minute or two and, if a new connection request arrives within that time, have the old backend handle the new connection... (There is the problem of how to get the file descriptor returned by the accept() call in the parent process down to the child... but I think that at least on some UNIXen there is a way to pass an fd through a socket, or even dup it into another process by opening it from /proc/fd) ...Robert
Glenn Maynard <glennfmaynard@gmail.com> wrote: > I'm sorry, but I'm confused. Everyone keeps talking about > connection pooling, but Dimitri has said repeatedly that each client > makes a single connection and then keeps it open until the end of > the test, not that it makes a single connection per SQL query. > Connection startup costs shouldn't be an issue. Am I missing > something here? Quite aside from the overhead of spawning new processes, if you have more active connections than you have resources for them to go after, you just increase context switching and resource contention, both of which have some cost, without any offsetting gains. That would tend to explain why performance tapers off after a certain point. A connection pool which queues requests prevents this degradation. It would be interesting, with each of the CPU counts, to profile PostgreSQL at the peak of each curve to see where the time goes when it is operating with an optimal poolsize. Tapering after that point is rather uninteresting, and profiles would be less useful beyond that point, as the noise from the context switching and resource contention would make it harder to spot issues that really matter.. -Kevin
The idea is good, but *only* pooling will be not enough. I mean if all what pooler is doing is only keeping no more than N backends working - it'll be not enough. You never know what exactly your query will do - if you choose your N value to be sure to not overload CPU and then some of your queries start to read from disk - you waste your idle CPU time because it was still possible to run other queries requiring CPU time rather I/O, etc... I wrote some ideas about an "ideal" solution here (just omit the word "mysql" - as it's a theory it's valable for any db engine): http://dimitrik.free.fr/db_STRESS_MySQL_540_and_others_Apr2009.html#note_5442 Rgds, -Dimitri On 5/13/09, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Glenn Maynard <glennfmaynard@gmail.com> wrote: >> I'm sorry, but I'm confused. Everyone keeps talking about >> connection pooling, but Dimitri has said repeatedly that each client >> makes a single connection and then keeps it open until the end of >> the test, not that it makes a single connection per SQL query. >> Connection startup costs shouldn't be an issue. Am I missing >> something here? > > Quite aside from the overhead of spawning new processes, if you have > more active connections than you have resources for them to go after, > you just increase context switching and resource contention, both of > which have some cost, without any offsetting gains. That would tend > to explain why performance tapers off after a certain point. A > connection pool which queues requests prevents this degradation. > > It would be interesting, with each of the CPU counts, to profile > PostgreSQL at the peak of each curve to see where the time goes when > it is operating with an optimal poolsize. Tapering after that point > is rather uninteresting, and profiles would be less useful beyond that > point, as the noise from the context switching and resource contention > would make it harder to spot issues that really matter.. > > -Kevin > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On 5/13/09 3:22 AM, "Dimitri" <dimitrik.fr@gmail.com> wrote: > Hi Scott, > > On 5/12/09, Scott Carey <scott@richrelevance.com> wrote: >> Although nobody wants to support it, he should try the patch that Jignesh K. >> Shah (from Sun) proposed that makes ProcArrayLock lighter-weight. If it >> makes 32 cores much faster, then we have a smoking gun. >> >> Although everyone here is talking about this as an 'unoptimal' solution, the >> fact is there is no evidence that a connection pooler will fix the >> scalability from 16 > 32 cores. >> Certainly a connection pooler will help most results, but it may not fix the >> scalability problem. >> >> A question for Dimitri: >> What is the scalability from 16 > 32 cores at the 'peak' load that occurs >> near 2x the CPU count? Is it also poor? If this is also poor, IMO the >> community here should not be complaining about this unopimal case -- a >> connection pooler at that stage does little and prepared statements will >> increase throughput but not likely alter scalability. > > I'm attaching a small graph showing a TPS level on PG 8.4 depending on > number of cores (X-axis is a number of concurrent users, Y-axis is the > TPS number). As you may see TPS increase is near linear while moving > from 8 to 16 cores, while on 32cores even it's growing slightly > differently, what is unclear is why TPS level is staying limited to > 11.000 TPS on 32cores. And it's pure read-only workload. > Interesting. What hardware is this, btw? Looks like the 32 core system probably has 2x the CPU and a bit less interconnect efficiency versus the 16 core one (which would be typical). Is the 16 core case the same, but with fewer cores per processor active? Or fewer processors total? Understanding the scaling difference may require a better understanding of the other differences besides core count. >> >> If that result scales, then the short term answer is a connection pooler. >> >> In the tests that Jingesh ran -- making the ProcArrayLock faster helped the >> case where connections = 2x the CPU core count quite a bit. >> >> The thread about the CPU scalability is "Proposal of tunable fix for >> scalability of 8.4", originally posted by "Jignesh K. Shah" >> <J.K.Shah@Sun.COM>, March 11 2009. >> >> It would be very useful to see results of this benchmark with: >> 1. A Connection Pooler > > will not help, as each client is *not* disconnecting/reconnecting > during the test, as well PG is keeping well even 256 users. And TPS > limit is reached already on 64 users, don't think pooler will help > here. > Actually, it might help a little. Postgres has a flaw that makes backends block on a lock briefly based on the number of total backends -- active or completely passive. Your tool has some (very small) user-side delay and a connection pooler would probably allow 64 of your users to efficiently 'fit' in 48 or so connection pooler slots. It is not about connecting and disconnecting in this case, its about minimizing Postgres' process count. If this does help, it would hint at certain bottlenecks. If it doesn't it would point elsewhere (and quiet some critics). However, its unrealistic for any process-per-connection system to have less backends than about 2x the core count -- else any waiting on I/O or network will just starve CPU. So this would just be done for research, not a real answer to making it scale better. For those who say "but, what if its I/O bound! You don't need more backends then!": Well you don't need more CPU either if you're I/O bound. By definition, CPU scaling tests imply the I/O can keep up. >> 2. Jignesh's patch > > I've already tested it and it did not help in my case because the real > problem is elsewhere.. (however, I did not test it yet with my latest > config params) > Great to hear that! -- That means this case is probably not ProcArrayLock. If its Solaris, could we get: 1. What is the CPU stats when it is in the inefficient state near 64 or 128 concurrent users (vmstat, etc. I'm interested in CPU in user/system/idle/wait time, and context switches/sec mostly). 2. A Dtrace probe on the postgres locks -- we might be able to identify something here. The results here would be useful -- if its an expected condition in the planner or parser, it would be useful confirmation. If its something unexpected and easy to fix -- it might be changed relatively soon. If its not easy to detect, it could be many other things -- but the process above at least rules some things out and better characterizes the state. >> 3. Prepared statements >> > > yes, I'm preparing this test. > >> #3 is important, because prepared statements are ideal for queries that >> perform well with low statistics_targets, and not ideal for those that >> require high statistics targets. Realistically, an app won't have more than >> a couple dozen statement forms to prepare. Setting the default statistics >> target to 5 is just a way to make some other query perform bad. > > Agree, but as you may have a different statistic target *per* table it > should not be a problem. What is sure - all time spent on parse and > planner will be removed here, and the final time should be a pure > execution. > I'm definitely interested here because although pure execution will certainly be faster, it may not scale any better. > Rgds, > -Dimitri >
Dimitri <dimitrik.fr@gmail.com> wrote: > The idea is good, but *only* pooling will be not enough. I mean if > all what pooler is doing is only keeping no more than N backends > working - it'll be not enough. You never know what exactly your > query will do - if you choose your N value to be sure to not > overload CPU and then some of your queries start to read from disk - > you waste your idle CPU time because it was still possible to run > other queries requiring CPU time rather I/O, etc... I never meant to imply that CPUs were the only resources which mattered. Network and disk I/O certainly come into play. I would think that various locks might count. You have to benchmark your actual workload to find the sweet spot for your load on your hardware. I've usually found it to be around (2 * cpu count) + (effective spindle count), where effective spindle count id determined not only by your RAID also your access pattern. (If everything is fully cached, and you have no write delays because of a BBU RAID controller with write-back, effective spindle count is zero.) Since the curve generally falls off more slowly past the sweet spot than it climbs to get there, I tend to go a little above the apparent sweet spot to protect against bad performance in a different load mix than my tests. > I wrote some ideas about an "ideal" solution here (just omit the > word "mysql" - as it's a theory it's valable for any db engine): > http://dimitrik.free.fr/db_STRESS_MySQL_540_and_others_Apr2009.html#note_5442 I've seen similar techniques used in other databases, and I'm far from convinced that it's ideal or optimal. -Kevin
Hi, Le 13 mai 09 à 18:42, Scott Carey a écrit : >> will not help, as each client is *not* disconnecting/reconnecting >> during the test, as well PG is keeping well even 256 users. And TPS >> limit is reached already on 64 users, don't think pooler will help >> here. > > Actually, it might help a little. Postgres has a flaw that makes > backends > block on a lock briefly based on the number of total backends -- > active or > completely passive. Your tool has some (very small) user-side delay > and a > connection pooler would probably allow 64 of your users to > efficiently 'fit' > in 48 or so connection pooler slots. It seems you have think time, and I'm only insisting on what Scott said, but having thinktime means a connection pool can help. Pgbouncer is a good choice because it won't even attempt to parse the queries, and it has a flexible configuration. >>> 3. Prepared statements >> yes, I'm preparing this test. It's possible to use prepared statement and benefit from pgbouncer at the same time, but up until now it requires the application to test whether its statements are already prepared at connection time, because the application is not controlling when pgbouncer is reusing an existing backend or giving it a fresh one. As I think I need this solution too, I've coded a PG module to scratch that itch this morning, and just published it (BSD licenced) on pgfoundry: http://preprepare.projects.postgresql.org/README.html http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/preprepare/preprepare/ With this module and the proper pgbouncer setup (connect_query='SELECT prepare_all();') the application has no more to special case the fresh- backend-nothing-prepared case, it's all transparent, just replace your SELECT query with its EXECUTE foo(x, y, z) counter part. I've took the approach to setup the prepared statements themselves into a table with columns name and statement, this latter one containing the full PREPARE SQL command. There's a custom variable preprepare.relation that has to be your table name (shema qualified). Each statement that you then put in there will get prepared when you SELECT prepare_all(); Hope this helps, regards, -- dim
On Tue, 2009-05-12 at 14:28 +0200, Dimitri wrote: > As problem I'm considering a scalability issue on Read-Only workload - > only selects, no disk access, and if on move from 8 to 16 cores we > gain near 100%, on move from 16 to 32 cores it's only 10%... Dimitri, Will you be re-running the Read-Only tests? Can you run the Dtrace script to assess LWlock contention during the run? Would you re-run the tests with a patch? Thanks, -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Folks, sorry, I'm outpassed little bit by the events :-)) I've finished tests with PREPARE/EXECUTE - it's much faster of course, and the max TSP is 15.000 now on 24 cores! - I've done various tests to see where is the limit bottleneck may be present - it's more likely something timer or interrupt based, etc. Nothing special via DTrace, or probably it'll say you more things then me, but for a 10sec period it's quite small wait time: # lwlock_wait_8.4.d `pgrep -n postgres` Lock Id Mode Count FirstBufMappingLock Exclusive 1 FirstLockMgrLock Exclusive 1 BufFreelistLock Exclusive 3 FirstBufMappingLock Shared 4 FirstLockMgrLock Shared 4 Lock Id Mode Combined Time (ns) FirstLockMgrLock Exclusive 803700 BufFreelistLock Exclusive 3001600 FirstLockMgrLock Shared 4586600 FirstBufMappingLock Exclusive 6283900 FirstBufMappingLock Shared 21792900 On the same time those lock waits are appearing only on 24 or 32 cores. I'll plan to replay this case on the bigger server (64 cores or more) - it'll be much more evident if the problem is in locks. Currently I'm finishing my report with all data all of you asked (system graphs, pgsql, and other). I'll publish it on my web site and send you a link. Rgds, -Dimitri On 5/14/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2009-05-12 at 14:28 +0200, Dimitri wrote: > >> As problem I'm considering a scalability issue on Read-Only workload - >> only selects, no disk access, and if on move from 8 to 16 cores we >> gain near 100%, on move from 16 to 32 cores it's only 10%... > > Dimitri, > > Will you be re-running the Read-Only tests? > > Can you run the Dtrace script to assess LWlock contention during the > run? > > Would you re-run the tests with a patch? > > Thanks, > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > >
It's absolutely great! it'll not help here because a think time is 0. but for any kind of solution with a spooler it's a must to try! Rgds, -Dimitri On 5/13/09, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > Hi, > > Le 13 mai 09 à 18:42, Scott Carey a écrit : >>> will not help, as each client is *not* disconnecting/reconnecting >>> during the test, as well PG is keeping well even 256 users. And TPS >>> limit is reached already on 64 users, don't think pooler will help >>> here. >> >> Actually, it might help a little. Postgres has a flaw that makes >> backends >> block on a lock briefly based on the number of total backends -- >> active or >> completely passive. Your tool has some (very small) user-side delay >> and a >> connection pooler would probably allow 64 of your users to >> efficiently 'fit' >> in 48 or so connection pooler slots. > > It seems you have think time, and I'm only insisting on what Scott > said, but having thinktime means a connection pool can help. Pgbouncer > is a good choice because it won't even attempt to parse the queries, > and it has a flexible configuration. > >>>> 3. Prepared statements >>> yes, I'm preparing this test. > > It's possible to use prepared statement and benefit from pgbouncer at > the same time, but up until now it requires the application to test > whether its statements are already prepared at connection time, > because the application is not controlling when pgbouncer is reusing > an existing backend or giving it a fresh one. > > As I think I need this solution too, I've coded a PG module to scratch > that itch this morning, and just published it (BSD licenced) on > pgfoundry: > http://preprepare.projects.postgresql.org/README.html > http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/preprepare/preprepare/ > > With this module and the proper pgbouncer setup (connect_query='SELECT > prepare_all();') the application has no more to special case the fresh- > backend-nothing-prepared case, it's all transparent, just replace your > SELECT query with its EXECUTE foo(x, y, z) counter part. > > I've took the approach to setup the prepared statements themselves > into a table with columns name and statement, this latter one > containing the full PREPARE SQL command. There's a custom variable > preprepare.relation that has to be your table name (shema qualified). > Each statement that you then put in there will get prepared when you > SELECT prepare_all(); > > Hope this helps, regards, > -- > dim
Hi Scott, let me now finish my report and regroup all data together, and then we'll continue discussion as it'll come more in debug/profile phase.. - I'll be not polite from my part to send some tons of attachments to the mail list :-) Rgds, -Dimitri On 5/13/09, Scott Carey <scott@richrelevance.com> wrote: > > On 5/13/09 3:22 AM, "Dimitri" <dimitrik.fr@gmail.com> wrote: > >> Hi Scott, >> >> On 5/12/09, Scott Carey <scott@richrelevance.com> wrote: >>> Although nobody wants to support it, he should try the patch that Jignesh >>> K. >>> Shah (from Sun) proposed that makes ProcArrayLock lighter-weight. If it >>> makes 32 cores much faster, then we have a smoking gun. >>> >>> Although everyone here is talking about this as an 'unoptimal' solution, >>> the >>> fact is there is no evidence that a connection pooler will fix the >>> scalability from 16 > 32 cores. >>> Certainly a connection pooler will help most results, but it may not fix >>> the >>> scalability problem. >>> >>> A question for Dimitri: >>> What is the scalability from 16 > 32 cores at the 'peak' load that occurs >>> near 2x the CPU count? Is it also poor? If this is also poor, IMO the >>> community here should not be complaining about this unopimal case -- a >>> connection pooler at that stage does little and prepared statements will >>> increase throughput but not likely alter scalability. >> >> I'm attaching a small graph showing a TPS level on PG 8.4 depending on >> number of cores (X-axis is a number of concurrent users, Y-axis is the >> TPS number). As you may see TPS increase is near linear while moving >> from 8 to 16 cores, while on 32cores even it's growing slightly >> differently, what is unclear is why TPS level is staying limited to >> 11.000 TPS on 32cores. And it's pure read-only workload. >> > > Interesting. What hardware is this, btw? Looks like the 32 core system > probably has 2x the CPU and a bit less interconnect efficiency versus the 16 > core one (which would be typical). > Is the 16 core case the same, but with fewer cores per processor active? Or > fewer processors total? > Understanding the scaling difference may require a better understanding of > the other differences besides core count. > >>> >>> If that result scales, then the short term answer is a connection pooler. >>> >>> In the tests that Jingesh ran -- making the ProcArrayLock faster helped >>> the >>> case where connections = 2x the CPU core count quite a bit. >>> >>> The thread about the CPU scalability is "Proposal of tunable fix for >>> scalability of 8.4", originally posted by "Jignesh K. Shah" >>> <J.K.Shah@Sun.COM>, March 11 2009. >>> >>> It would be very useful to see results of this benchmark with: >>> 1. A Connection Pooler >> >> will not help, as each client is *not* disconnecting/reconnecting >> during the test, as well PG is keeping well even 256 users. And TPS >> limit is reached already on 64 users, don't think pooler will help >> here. >> > > Actually, it might help a little. Postgres has a flaw that makes backends > block on a lock briefly based on the number of total backends -- active or > completely passive. Your tool has some (very small) user-side delay and a > connection pooler would probably allow 64 of your users to efficiently 'fit' > in 48 or so connection pooler slots. > > It is not about connecting and disconnecting in this case, its about > minimizing Postgres' process count. If this does help, it would hint at > certain bottlenecks. If it doesn't it would point elsewhere (and quiet some > critics). > > However, its unrealistic for any process-per-connection system to have less > backends than about 2x the core count -- else any waiting on I/O or network > will just starve CPU. So this would just be done for research, not a real > answer to making it scale better. > > For those who say "but, what if its I/O bound! You don't need more > backends then!": Well you don't need more CPU either if you're I/O bound. > By definition, CPU scaling tests imply the I/O can keep up. > > >>> 2. Jignesh's patch >> >> I've already tested it and it did not help in my case because the real >> problem is elsewhere.. (however, I did not test it yet with my latest >> config params) >> > > Great to hear that! -- That means this case is probably not ProcArrayLock. > If its Solaris, could we get: > 1. What is the CPU stats when it is in the inefficient state near 64 or 128 > concurrent users (vmstat, etc. I'm interested in CPU in > user/system/idle/wait time, and context switches/sec mostly). > 2. A Dtrace probe on the postgres locks -- we might be able to identify > something here. > > The results here would be useful -- if its an expected condition in the > planner or parser, it would be useful confirmation. If its something > unexpected and easy to fix -- it might be changed relatively soon. > > If its not easy to detect, it could be many other things -- but the process > above at least rules some things out and better characterizes the state. > >>> 3. Prepared statements >>> >> >> yes, I'm preparing this test. >> >>> #3 is important, because prepared statements are ideal for queries that >>> perform well with low statistics_targets, and not ideal for those that >>> require high statistics targets. Realistically, an app won't have more >>> than >>> a couple dozen statement forms to prepare. Setting the default >>> statistics >>> target to 5 is just a way to make some other query perform bad. >> >> Agree, but as you may have a different statistic target *per* table it >> should not be a problem. What is sure - all time spent on parse and >> planner will be removed here, and the final time should be a pure >> execution. >> > > I'm definitely interested here because although pure execution will > certainly be faster, it may not scale any better. > > >> Rgds, >> -Dimitri >> > >
On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote: > # lwlock_wait_8.4.d `pgrep -n postgres` > Lock Id Mode Combined Time (ns) > FirstLockMgrLock Exclusive 803700 > BufFreelistLock Exclusive 3001600 > FirstLockMgrLock Shared 4586600 > FirstBufMappingLock Exclusive 6283900 > FirstBufMappingLock Shared 21792900 I've published two patches to -Hackers to see if we can improve the read only numbers on 32+ cores. Try shared_buffer_partitions = 256 -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Wed, 2009-05-13 at 23:23 +0200, Dimitri Fontaine wrote: > As I think I need this solution too, I've coded a PG module to > scratch > that itch this morning, and just published it (BSD licenced) on > pgfoundry: > http://preprepare.projects.postgresql.org/README.html > http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/preprepare/preprepare/ Looks very cool Dimitri -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Folks, I've just published a full report including all results here: http://dimitrik.free.fr/db_STRESS_PostgreSQL_837_and_84_May2009.html From my point of view it needs first to understand where the time is wasted on a single query (even when the statement is prepared it runs still slower comparing to MySQL). Then to investigate on scalability issue I think a bigger server will be needed here (I'm looking for 64cores at least :-)) If you have some other ideas or patches (like Simon) - don't hesitate to send them - once I'll get an access to the server again the available test time will be very limited.. Best regards! -Dimitri On 5/18/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote: > >> # lwlock_wait_8.4.d `pgrep -n postgres` > >> Lock Id Mode Combined Time (ns) >> FirstLockMgrLock Exclusive 803700 >> BufFreelistLock Exclusive 3001600 >> FirstLockMgrLock Shared 4586600 >> FirstBufMappingLock Exclusive 6283900 >> FirstBufMappingLock Shared 21792900 > > I've published two patches to -Hackers to see if we can improve the read > only numbers on 32+ cores. > > Try shared_buffer_partitions = 256 > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > >
On Mon, 2009-05-18 at 20:00 +0200, Dimitri wrote: > >From my point of view it needs first to understand where the time is > wasted on a single query (even when the statement is prepared it runs > still slower comparing to MySQL). There is still a significant number of things to say about these numbers and much tuning still to do, so I'm still confident of improving those numbers if we needed to. In particular, running the tests repeatedly using H.REF_OBJECT = '0000000001' rather than varying the value seems likely to benefit MySQL. The distribution of values is clearly non-linear; while Postgres picks a strange plan for that particular value, I would guess there are also values for which the MySQL plan is sub-optimal. Depending upon the distribution of selected data we might see the results go either way. What I find worrying is your result of a scalability wall for hash joins. Is that a repeatable issue? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Great data Dimitri!' I see a few key trends in the poor scalability: The throughput scales roughly with %CPU fairly well. But CPU used doesn't go past ~50% on the 32 core tests. This indicates lock contention. Other proof of lock contention are the mutex locks / sec graph which climbs rapidly as the system gets more inefficient (along with context switches). Another trend is the system calls/sec which caps out with the test, at about 400,000 per sec on the peak (non-prepared statement) result. Note that when the buffer size is 256MB, the performance scales much worse and is slower. And correlated with this the system calls/sec per transaction is more than double, at slower throughput. Using the OS to cache pages is not as fast as pages in shared_buffers, by a more significant amount with many cores and higher concurrency than in the low concurrency case. The system is largely lock limited in the poor scaling results. This holds true with or without the use of prepared statements -- which help a some, but not a lot and don't affect the scalability. 4096MB shared buffers, 32 cores, 8.4, read only: http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html 256MB cache, 32 cores, 8.4, read-only: http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_52.html 4096MB shared buffs, 32 cores, 8.4, read only, prepared statements http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html On 5/18/09 11:00 AM, "Dimitri" <dimitrik.fr@gmail.com> wrote: > Folks, I've just published a full report including all results here: > http://dimitrik.free.fr/db_STRESS_PostgreSQL_837_and_84_May2009.html > > From my point of view it needs first to understand where the time is > wasted on a single query (even when the statement is prepared it runs > still slower comparing to MySQL). > > Then to investigate on scalability issue I think a bigger server will > be needed here (I'm looking for 64cores at least :-)) > > If you have some other ideas or patches (like Simon) - don't hesitate > to send them - once I'll get an access to the server again the > available test time will be very limited.. > > Best regards! > -Dimitri > > > On 5/18/09, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote: >> >>> # lwlock_wait_8.4.d `pgrep -n postgres` >> >>> Lock Id Mode Combined Time (ns) >>> FirstLockMgrLock Exclusive 803700 >>> BufFreelistLock Exclusive 3001600 >>> FirstLockMgrLock Shared 4586600 >>> FirstBufMappingLock Exclusive 6283900 >>> FirstBufMappingLock Shared 21792900 >> >> I've published two patches to -Hackers to see if we can improve the read >> only numbers on 32+ cores. >> >> Try shared_buffer_partitions = 256 >> >> -- >> Simon Riggs www.2ndQuadrant.com >> PostgreSQL Training, Services and Support >> >> > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
What I don't understand is the part where you talking about disabling hash joins: > * result: planner replaced hash join is replaced by merge join > * execution time: 0.84ms ! > * NOTE: curiously planner is expecting to execute this query in 0.29ms - so it's supposed from its logic to be faster, so why this plan is not used from the beginning???... > > Sort (cost=4562.83..4568.66 rows=2329 width=176) (actual time=0.237..0.237 rows=20 loops=1) > Sort Key: h.horder > Sort Method: quicksort Memory: 30kB > -> Merge Join (cost=4345.89..4432.58 rows=2329 width=176) (actual time=0.065..0.216 rows=20 loops=1) > Merge Cond: (s.ref = h.ref_stat) > -> Index Scan using stat_ref_idx on stat s (cost=0.00..49.25 rows=1000 width=45) (actual time=0.018..0.089 rows=193 loops=1) > -> Sort (cost=4345.89..4351.72 rows=2329 width=135) (actual time=0.042..0.043 rows=20 loops=1) > Sort Key: h.ref_stat > Sort Method: quicksort Memory: 30kB > -> Index Scan using history_ref_idx on history h (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.012..0.025 rows=20 loops=1) > Index Cond: (ref_object = '0000000001'::bpchar) > Total runtime: 0.288 ms > (12 rows) The explain analyze ran the query in 0.288 ms. That is the actual time it took to run the query on the server. It is not an estimate of the time. You measured 0.84 ms to run the query, which seems to imply either a problem in one of the timing methods or that 66% of your query execution time is sending the results to the client. I'm curious how you did you execution time measurements. Dave
Thanks Dave for correction, but I'm also curious where the time is wasted in this case?.. 0.84ms is displayed by "psql" once the result output is printed, and I got similar time within my client (using libpq) which is not printing any output.. Rgds, -Dimitri On 5/18/09, Dave Dutcher <dave@tridecap.com> wrote: > > What I don't understand is the part where you talking about disabling hash > joins: > >> * result: planner replaced hash join is replaced by merge join >> * execution time: 0.84ms ! >> * NOTE: curiously planner is expecting to execute this query in 0.29ms > - so it's supposed from its logic to be faster, so why this plan is not used > from the beginning???... >> >> Sort (cost=4562.83..4568.66 rows=2329 width=176) (actual > time=0.237..0.237 rows=20 loops=1) >> Sort Key: h.horder >> Sort Method: quicksort Memory: 30kB >> -> Merge Join (cost=4345.89..4432.58 rows=2329 width=176) > (actual time=0.065..0.216 rows=20 loops=1) >> Merge Cond: (s.ref = h.ref_stat) >> -> Index Scan using stat_ref_idx on stat s > (cost=0.00..49.25 rows=1000 width=45) (actual time=0.018..0.089 rows=193 > loops=1) >> -> Sort (cost=4345.89..4351.72 rows=2329 width=135) > (actual time=0.042..0.043 rows=20 loops=1) >> Sort Key: h.ref_stat >> Sort Method: quicksort Memory: 30kB >> -> Index Scan using history_ref_idx on history h > (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.012..0.025 rows=20 > loops=1) >> Index Cond: (ref_object = '0000000001'::bpchar) >> Total runtime: 0.288 ms >> (12 rows) > > The explain analyze ran the query in 0.288 ms. That is the actual time it > took to run the query on the server. It is not an estimate of the time. > You measured 0.84 ms to run the query, which seems to imply either a problem > in one of the timing methods or that 66% of your query execution time is > sending the results to the client. I'm curious how you did you execution > time measurements. > > Dave > >
On 5/18/09, Scott Carey <scott@richrelevance.com> wrote: > Great data Dimitri!' Thank you! :-) > > I see a few key trends in the poor scalability: > > The throughput scales roughly with %CPU fairly well. But CPU used doesn't > go past ~50% on the 32 core tests. This indicates lock contention. > You should not look on #1 STATs, but on #2 - they are all with the latest "fixes" - on all of them CPU is used well (90% in pic on 32cores). Also, keep in mind these cores are having 2 threads, and from Solaris point of view they are seen as CPU (so 64 CPU) and %busy is accounted as for 64 CPU > Other proof of lock contention are the mutex locks / sec graph which climbs exactly, except no locking was seen on processes while I tried to trace them.. What I think will be needed here is a global and corelated tracing of all PG processes - I did not expect to do it now, but next time > rapidly as the system gets more inefficient (along with context switches). > > Another trend is the system calls/sec which caps out with the test, at about > 400,000 per sec on the peak (non-prepared statement) result. Note that when > the buffer size is 256MB, the performance scales much worse and is slower. > And correlated with this the system calls/sec per transaction is more than > double, at slower throughput. of course, because even the data were cached by filesystem to get them you still need to call a read() system call.. > > Using the OS to cache pages is not as fast as pages in shared_buffers, by a > more significant amount with many cores and higher concurrency than in the > low concurrency case. exactly, it's what I also wanted to demonstrate because I often hear "PG is delegating caching to the filesystem" - and I don't think it's optimal :-) > > The system is largely lock limited in the poor scaling results. This holds > true with or without the use of prepared statements -- which help a some, > but not a lot and don't affect the scalability. we are agree here, but again - 20K mutex spins/sec is a quite low value, that's why I hope on the bigger server it'll be more clear where is a bottleneck :-) Rgds, -Dimitri > > > 4096MB shared buffers, 32 cores, 8.4, read only: > http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html > > 256MB cache, 32 cores, 8.4, read-only: > http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_52.html > > 4096MB shared buffs, 32 cores, 8.4, read only, prepared statements > http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html > > On 5/18/09 11:00 AM, "Dimitri" <dimitrik.fr@gmail.com> wrote: > >> Folks, I've just published a full report including all results here: >> http://dimitrik.free.fr/db_STRESS_PostgreSQL_837_and_84_May2009.html >> >> From my point of view it needs first to understand where the time is >> wasted on a single query (even when the statement is prepared it runs >> still slower comparing to MySQL). >> >> Then to investigate on scalability issue I think a bigger server will >> be needed here (I'm looking for 64cores at least :-)) >> >> If you have some other ideas or patches (like Simon) - don't hesitate >> to send them - once I'll get an access to the server again the >> available test time will be very limited.. >> >> Best regards! >> -Dimitri >> >> >> On 5/18/09, Simon Riggs <simon@2ndquadrant.com> wrote: >>> >>> On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote: >>> >>>> # lwlock_wait_8.4.d `pgrep -n postgres` >>> >>>> Lock Id Mode Combined Time (ns) >>>> FirstLockMgrLock Exclusive 803700 >>>> BufFreelistLock Exclusive 3001600 >>>> FirstLockMgrLock Shared 4586600 >>>> FirstBufMappingLock Exclusive 6283900 >>>> FirstBufMappingLock Shared 21792900 >>> >>> I've published two patches to -Hackers to see if we can improve the read >>> only numbers on 32+ cores. >>> >>> Try shared_buffer_partitions = 256 >>> >>> -- >>> Simon Riggs www.2ndQuadrant.com >>> PostgreSQL Training, Services and Support >>> >>> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > >
On 5/18/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Mon, 2009-05-18 at 20:00 +0200, Dimitri wrote: > >> >From my point of view it needs first to understand where the time is >> wasted on a single query (even when the statement is prepared it runs >> still slower comparing to MySQL). > > There is still a significant number of things to say about these numbers > and much tuning still to do, so I'm still confident of improving those > numbers if we needed to. > > In particular, running the tests repeatedly using > H.REF_OBJECT = '0000000001' > rather than varying the value seems likely to benefit MySQL. The let me repeat again - the reference is *random*, the '0000000001' value I've used just to show a query execution plan. also, what is important - the random ID is chosen in way that no one user use the same to avoid deadlocks previously seen with PostgreSQL (see the "Deadlock mystery" note 2 years ago http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 ) > distribution of values is clearly non-linear; while Postgres picks a > strange plan for that particular value, I would guess there are also > values for which the MySQL plan is sub-optimal. Depending upon the > distribution of selected data we might see the results go either way. > > What I find worrying is your result of a scalability wall for hash > joins. Is that a repeatable issue? I think yes (but of course I did not try to replay it several times) Rgds, -Dimitri > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > >
Simon Riggs <simon@2ndQuadrant.com> writes: > In particular, running the tests repeatedly using > H.REF_OBJECT = '0000000001' > rather than varying the value seems likely to benefit MySQL. ... mumble ... query cache? regards, tom lane
On 5/18/09 3:32 PM, "Dimitri" <dimitrik.fr@gmail.com> wrote: > On 5/18/09, Scott Carey <scott@richrelevance.com> wrote: >> Great data Dimitri!' > > Thank you! :-) > >> >> I see a few key trends in the poor scalability: >> >> The throughput scales roughly with %CPU fairly well. But CPU used doesn't >> go past ~50% on the 32 core tests. This indicates lock contention. >> > > You should not look on #1 STATs, but on #2 - they are all with the > latest "fixes" - on all of them CPU is used well (90% in pic on > 32cores). > Also, keep in mind these cores are having 2 threads, and from Solaris > point of view they are seen as CPU (so 64 CPU) and %busy is accounted > as for 64 CPU > Well, if the CPU usage is actually higher, then it might not be lock waiting -- it could be spin locks or context switches or cache coherency overhead. Postgres may also not be very SMT friendly, at least on the hardware tested here. (what was the context switch rate? I didn't see that in the data, just mutex spins). The scalability curve is definitely showing something. Prepared statements were tried, as were most of the other suggestions other than one: What happens if the queries are more complicated (say, they take 15ms server side with a more complicated plan required)? That is a harder question to answer
On Tue, 2009-05-19 at 00:33 +0200, Dimitri wrote: > > > > In particular, running the tests repeatedly using > > H.REF_OBJECT = '0000000001' > > rather than varying the value seems likely to benefit MySQL. The > > let me repeat again - the reference is *random*, > the '0000000001' value I've used just to show a query execution > plan. > > also, what is important - the random ID is chosen in way that no one > user use the same to avoid deadlocks previously seen with PostgreSQL > (see the "Deadlock mystery" note 2 years ago > http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 ) OK, didn't pick up on that. (Like Tom, I was thinking query cache) Can you comment on the distribution of values for that column? If you are picking randomly, this implies distribution is uniform and so I am surprised we are mis-estimating the selectivity. > I think yes (but of course I did not try to replay it several times) If you could that would be appreciated. We don't want to go chasing after something that is not repeatable. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
No, Tom, the query cache was off. I put it always explicitly off on MySQL as it has scalability issues. Rgds, -Dimitri On 5/19/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> In particular, running the tests repeatedly using >> H.REF_OBJECT = '0000000001' >> rather than varying the value seems likely to benefit MySQL. > > ... mumble ... query cache? > > regards, tom lane >
On 5/19/09, Scott Carey <scott@richrelevance.com> wrote: > > On 5/18/09 3:32 PM, "Dimitri" <dimitrik.fr@gmail.com> wrote: > >> On 5/18/09, Scott Carey <scott@richrelevance.com> wrote: >>> Great data Dimitri!' >> >> Thank you! :-) >> >>> >>> I see a few key trends in the poor scalability: >>> >>> The throughput scales roughly with %CPU fairly well. But CPU used >>> doesn't >>> go past ~50% on the 32 core tests. This indicates lock contention. >>> >> >> You should not look on #1 STATs, but on #2 - they are all with the >> latest "fixes" - on all of them CPU is used well (90% in pic on >> 32cores). >> Also, keep in mind these cores are having 2 threads, and from Solaris >> point of view they are seen as CPU (so 64 CPU) and %busy is accounted >> as for 64 CPU >> > > Well, if the CPU usage is actually higher, then it might not be lock waiting > -- it could be spin locks or context switches or cache coherency overhead. > Postgres may also not be very SMT friendly, at least on the hardware tested > here. do you mean SMP or CMT? ;-) however both should work well with PostgreSQL. I also think about CPU affinity - probably it may help to avoid CPU cache misses - but makes sense mostly if pooler will be added as a part of PG. > > (what was the context switch rate? I didn't see that in the data, just > mutex spins). increasing with a load, as this ex.: http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_100.html#bmk_CPU_CtxSwitch_100 > > The scalability curve is definitely showing something. Prepared statements > were tried, as were most of the other suggestions other than one: > > What happens if the queries are more complicated (say, they take 15ms server > side with a more complicated plan required)? That is a harder question to > answer What I observed is: if planner takes more long time (like initially with 8.3.7 and analyze target 1000) the scalability problem is appearing more strange - http://dimitrik.free.fr/Report_20090505/5521_dim_STAT_18.html - as you see CPU even not used more than 60% , and as you may see spin locks are lowering - CPUs are not spinning for locks, there is something else.. I'm supposing a problem of some kind of synchronization - background processes are not waking up on time or something like this... Then, if more time spent on the query execution itself and not planner: - if it'll be I/O time - I/O will hide everything else until you increase a storage performance and/or add more RAM, but then you come back to the initial issue :-) - if it'll be a CPU time it may be interesting! :-) Rgds, -Dimitri
On 5/19/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2009-05-19 at 00:33 +0200, Dimitri wrote: >> > >> > In particular, running the tests repeatedly using >> > H.REF_OBJECT = '0000000001' >> > rather than varying the value seems likely to benefit MySQL. The >> >> let me repeat again - the reference is *random*, >> the '0000000001' value I've used just to show a query execution >> plan. >> >> also, what is important - the random ID is chosen in way that no one >> user use the same to avoid deadlocks previously seen with PostgreSQL >> (see the "Deadlock mystery" note 2 years ago >> http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 ) > > OK, didn't pick up on that. > > (Like Tom, I was thinking query cache) > > Can you comment on the distribution of values for that column? If you > are picking randomly, this implies distribution is uniform and so I am > surprised we are mis-estimating the selectivity. yes, the distribution of reference values is uniform between '0000000001' to '0010000000' (10M), only one OBJECT row by one reference, and only 20 rows with the same reference in HISTORY table. > >> I think yes (but of course I did not try to replay it several times) > > If you could that would be appreciated. We don't want to go chasing > after something that is not repeatable. I'll retry and let you know. Rgds, -Dimitri
On Mon, 2009-05-18 at 19:00 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > In particular, running the tests repeatedly using > > H.REF_OBJECT = '0000000001' > > rather than varying the value seems likely to benefit MySQL. One thing to note in terms of optimisation of this query is that we perform a top-level sort at the end of the query. Both plans for this query show an IndexScan on a two column-index, with an Index Condition of equality on the leading column. The ORDER BY specifies a sort by the second index column, so the top-level Sort is superfluous in this case. My understanding is that we don't currently eliminate superfluous additional sorts of this kind. Now I know that is a hard subject, but it seems straightforward to consider interesting sort order equivalence when we have constant equality constraints. My guess would be that MySQL does do the sort removal, in latest version. Dimitri's EXPLAIN ANALYZEs show differing costs for that additional step, but the around 10% of query time looks shaveable. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Tue, 19 May 2009, Simon Riggs wrote: > Both plans for this query show an IndexScan on a two column-index, with > an Index Condition of equality on the leading column. The ORDER BY > specifies a sort by the second index column, so the top-level Sort is > superfluous in this case. > > My understanding is that we don't currently eliminate superfluous > additional sorts of this kind. Now I know that is a hard subject, but it > seems straightforward to consider interesting sort order equivalence > when we have constant equality constraints. Yes, Postgres has been missing the boat on this one for a while. +1 on requesting this feature. Speaking of avoiding large sorts, I'd like to push again for partial sorts. This is the situation where an index provides data sorted by column "a", and the query requests data sorted by "a, b". Currently, Postgres sorts the entire data set, whereas it need only group each set of identical "a" and sort each by "b". Matthew -- Riker: Our memory pathways have become accustomed to your sensory input. Data: I understand - I'm fond of you too, Commander. And you too Counsellor
On Tue, 2009-05-19 at 12:17 +0100, Matthew Wakeling wrote: > Yes, Postgres has been missing the boat on this one for a while. +1 on > requesting this feature. That's an optimizer feature. > Speaking of avoiding large sorts, I'd like to push again for partial > sorts. This is the situation where an index provides data sorted by > column "a", and the query requests data sorted by "a, b". Currently, > Postgres sorts the entire data set, whereas it need only group each > set of identical "a" and sort each by "b". This is an executor feature. Partially sorted data takes much less effort to sort (OK, not zero, I grant) so this seems like a high complexity, lower value feature. I agree it should be on the TODO, just IMHO at a lower priority than some other features. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Tue, 2009-05-19 at 12:36 +0100, Simon Riggs wrote: > Partially sorted data takes much less effort to sort (OK, not zero, I > grant) so this seems like a high complexity, lower value feature. I > agree it should be on the TODO, just IMHO at a lower priority than some > other features. Perhaps its worth looking at a hybrid merge-join/hash-join that can cope with data only mostly-sorted rather than fully sorted. That way we can probably skip the partial sort altogether. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
I may confirm the issue with hash join - it's repeating both with prepared and not prepared statements - it's curious because initially the response time is lowering near ~1ms (the lowest seen until now) and then once workload growing to 16 sessions it's jumping to 2.5ms, then with 32 sessions it's 18ms, etc.. I've retested on 24 isolated cores, so any external secondary effects are avoided. Rgds, -Dimitri On 5/19/09, Dimitri <dimitrik.fr@gmail.com> wrote: > On 5/19/09, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> On Tue, 2009-05-19 at 00:33 +0200, Dimitri wrote: >>> > >>> > In particular, running the tests repeatedly using >>> > H.REF_OBJECT = '0000000001' >>> > rather than varying the value seems likely to benefit MySQL. The >>> >>> let me repeat again - the reference is *random*, >>> the '0000000001' value I've used just to show a query execution >>> plan. >>> >>> also, what is important - the random ID is chosen in way that no one >>> user use the same to avoid deadlocks previously seen with PostgreSQL >>> (see the "Deadlock mystery" note 2 years ago >>> http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 ) >> >> OK, didn't pick up on that. >> >> (Like Tom, I was thinking query cache) >> >> Can you comment on the distribution of values for that column? If you >> are picking randomly, this implies distribution is uniform and so I am >> surprised we are mis-estimating the selectivity. > > yes, the distribution of reference values is uniform between > '0000000001' to '0010000000' (10M), only one OBJECT row by one > reference, and only 20 rows with the same reference in HISTORY table. > >> >>> I think yes (but of course I did not try to replay it several times) >> >> If you could that would be appreciated. We don't want to go chasing >> after something that is not repeatable. > > I'll retry and let you know. > > Rgds, > -Dimitri >
On Tue, 19 May 2009, Simon Riggs wrote: >> Speaking of avoiding large sorts, I'd like to push again for partial >> sorts. This is the situation where an index provides data sorted by >> column "a", and the query requests data sorted by "a, b". Currently, >> Postgres sorts the entire data set, whereas it need only group each >> set of identical "a" and sort each by "b". > > Partially sorted data takes much less effort to sort (OK, not zero, I > grant) so this seems like a high complexity, lower value feature. I > agree it should be on the TODO, just IMHO at a lower priority than some > other features. Not arguing with you, however I'd like to point out that partial sorting allows the results to be streamed, which would lower the cost to produce the first row of results significantly, and reduce the amount of RAM used by the query, and prevent temporary tables from being used. That has to be a fairly major win. Queries with a LIMIT would see the most benefit. That leads me on to another topic. Consider the query: SELECT * FROM table ORDER BY a, b where the column "a" is declared UNIQUE and has an index. Does Postgres eliminate "b" from the ORDER BY, and therefore allow fetching without sorting from the index? Or how about this query: SELECT * FROM table1, table2 WHERE table1.fk = table2.id ORDER BY table1.id, table2.id where both "id" columns are UNIQUE with an index. Do we eliminate "table2.id" from the ORDER BY in this case? Matthew -- "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." -- Rich Cook
Simon Riggs <simon@2ndQuadrant.com> writes: > Both plans for this query show an IndexScan on a two column-index, with > an Index Condition of equality on the leading column. The ORDER BY > specifies a sort by the second index column, so the top-level Sort is > superfluous in this case. > My understanding is that we don't currently eliminate superfluous > additional sorts of this kind. Nonsense. The planner might think some other plan is cheaper, but it definitely knows how to do this, and has since at least 8.1. regards, tom lane
On Tue, 2009-05-19 at 14:00 +0200, Dimitri wrote: > I may confirm the issue with hash join - it's repeating both with > prepared and not prepared statements - it's curious because initially > the response time is lowering near ~1ms (the lowest seen until now) > and then once workload growing to 16 sessions it's jumping to 2.5ms, > then with 32 sessions it's 18ms, etc.. Is it just bad all the time, or does it get worse over time? Do you get the same behaviour as 32 sessions if you run 16 sessions for twice as long? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Mon, May 18, 2009 at 6:32 PM, Dimitri <dimitrik.fr@gmail.com> wrote: > Thanks Dave for correction, but I'm also curious where the time is > wasted in this case?.. > > 0.84ms is displayed by "psql" once the result output is printed, and I > got similar time within my client (using libpq) which is not printing > any output.. Using libpq? What is the exact method you are using to execute queries...PQexec? If you are preparing queries against libpq, the best way to execute queries is via PQexecPrepared. Also, it's interesting to see if you can get any benefit from asynchronous queries (PQsendPrepared), but this might involve more changes to your application than you are willing to make. Another note: I would like to point out again that there are possible negative side effects in using char(n) vs. varchar(n) that IIRC do not exist in mysql. When you repeat your test I strongly advise switching to varchar. Another question: how exactly are you connecting to the database? local machine? if so, domain socket or tcp/ip? What are you doing with the results...immediately discarding? One last thing: when you get access to the server, can you run a custom format query test from pgbench and compare the results to your test similarly configured (same number of backends, etc) in terms of tps? merlin
On May 19, 2009, at 7:36 AM, Simon Riggs <simon@2ndQuadrant.com> wrote: > > On Tue, 2009-05-19 at 12:17 +0100, Matthew Wakeling wrote: >> Yes, Postgres has been missing the boat on this one for a while. +1 >> on >> requesting this feature. > > That's an optimizer feature. > >> Speaking of avoiding large sorts, I'd like to push again for partial >> sorts. This is the situation where an index provides data sorted by >> column "a", and the query requests data sorted by "a, b". Currently, >> Postgres sorts the entire data set, whereas it need only group each >> set of identical "a" and sort each by "b". > > This is an executor feature. > > Partially sorted data takes much less effort to sort (OK, not zero, I > grant) so this seems like a high complexity, lower value feature. I > agree it should be on the TODO, just IMHO at a lower priority than > some > other features. I have no particular thoughts on priority (whose priority?), but I will say I've run across queries that could benefit from this optimization. I fairly often write queries where the first key is mostly unique and the second is just to make things deterministic in the event of a tie. So the partial sort would be almost no work at all. ...Robert > > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
The response time is not progressive, it's simply jumping, it's likely since 16 sessions there is a sort of serialization happening somewhere.. As well on 16 sessions the throughput in TPS is near the same as on 8 (response time is only twice bigger for the moment), but on 32 it's dramatically dropping down.. Rgds, -Dimitri On 5/19/09, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2009-05-19 at 14:00 +0200, Dimitri wrote: > >> I may confirm the issue with hash join - it's repeating both with >> prepared and not prepared statements - it's curious because initially >> the response time is lowering near ~1ms (the lowest seen until now) >> and then once workload growing to 16 sessions it's jumping to 2.5ms, >> then with 32 sessions it's 18ms, etc.. > > Is it just bad all the time, or does it get worse over time? > > Do you get the same behaviour as 32 sessions if you run 16 sessions for > twice as long? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > >
On 5/19/09, Merlin Moncure <mmoncure@gmail.com> wrote: > On Mon, May 18, 2009 at 6:32 PM, Dimitri <dimitrik.fr@gmail.com> wrote: >> Thanks Dave for correction, but I'm also curious where the time is >> wasted in this case?.. >> >> 0.84ms is displayed by "psql" once the result output is printed, and I >> got similar time within my client (using libpq) which is not printing >> any output.. > > Using libpq? What is the exact method you are using to execute > queries...PQexec? exactly > If you are preparing queries against libpq, the > best way to execute queries is via PQexecPrepared. the query is *once* prepared via PQexec, then it's looping with "execute" via PQexec. Why PQexecPrepared will be better in my case?.. > Also, it's > interesting to see if you can get any benefit from asynchronous > queries (PQsendPrepared), but this might involve more changes to your > application than you are willing to make. > > Another note: I would like to point out again that there are possible > negative side effects in using char(n) vs. varchar(n) that IIRC do not > exist in mysql. When you repeat your test I strongly advise switching > to varchar. if it's true for any case, why not just replace CHAR implementation by VARCHAR directly within PG code?.. > > Another question: how exactly are you connecting to the database? > local machine? if so, domain socket or tcp/ip? local TCP/IP, same as MySQL > What are you doing > with the results...immediately discarding? from PQ side they immediately discarded once all rows are fetched > > One last thing: when you get access to the server, can you run a > custom format query test from pgbench and compare the results to your > test similarly configured (same number of backends, etc) in terms of > tps? I'll try Rgds, -Dimitri
On Tue, 2009-05-19 at 08:58 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > Both plans for this query show an IndexScan on a two column-index, with > > an Index Condition of equality on the leading column. The ORDER BY > > specifies a sort by the second index column, so the top-level Sort is > > superfluous in this case. > > > My understanding is that we don't currently eliminate superfluous > > additional sorts of this kind. > > Nonsense. The planner might think some other plan is cheaper, but > it definitely knows how to do this, and has since at least 8.1. Please look at Dimitri's plan. If it can remove the pointless sort, why does it not do so? I agree that it will remove a Sort when the data is already has the exact same interesting sort order. In this case the sort order is not exactly the same, but looks fully removable to me. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Tue, 2009-05-19 at 13:01 +0100, Matthew Wakeling wrote: > That leads me on to another topic. Consider the query: > > SELECT * FROM table ORDER BY a, b > > where the column "a" is declared UNIQUE and has an index. Does Postgres > eliminate "b" from the ORDER BY, and therefore allow fetching without > sorting from the index? No, because we don't use unique constraints much at all to infer things. > Or how about this query: > > SELECT * FROM table1, table2 WHERE table1.fk = table2.id ORDER BY > table1.id, table2.id > > where both "id" columns are UNIQUE with an index. Do we eliminate > "table2.id" from the ORDER BY in this case? Yes, that is eliminated via equivalence classes. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On 5/19/09 5:01 AM, "Matthew Wakeling" <matthew@flymine.org> wrote: > On Tue, 19 May 2009, Simon Riggs wrote: >>> Speaking of avoiding large sorts, I'd like to push again for partial >>> sorts. This is the situation where an index provides data sorted by >>> column "a", and the query requests data sorted by "a, b". Currently, >>> Postgres sorts the entire data set, whereas it need only group each >>> set of identical "a" and sort each by "b". >> >> Partially sorted data takes much less effort to sort (OK, not zero, I >> grant) so this seems like a high complexity, lower value feature. I >> agree it should be on the TODO, just IMHO at a lower priority than some >> other features. > > Not arguing with you, however I'd like to point out that partial sorting > allows the results to be streamed, which would lower the cost to produce > the first row of results significantly, and reduce the amount of RAM used > by the query, and prevent temporary tables from being used. That has to be > a fairly major win. Queries with a LIMIT would see the most benefit. > I will second that point -- Although for smaller sorts, the partial sort doesn't help much and is just complicated -- once the sort is large, it reduces the amount of work_mem needed significantly for large performance gain, and large concurrent query scale gain. And those benefits occur without using LIMIT.
On 5/19/09 3:46 AM, "Dimitri" <dimitrik.fr@gmail.com> wrote: > On 5/19/09, Scott Carey <scott@richrelevance.com> wrote: >> >> On 5/18/09 3:32 PM, "Dimitri" <dimitrik.fr@gmail.com> wrote: >> >>> On 5/18/09, Scott Carey <scott@richrelevance.com> wrote: >>>> Great data Dimitri!' >>> >>> Thank you! :-) >>> >>>> >>>> I see a few key trends in the poor scalability: >>>> >>>> The throughput scales roughly with %CPU fairly well. But CPU used >>>> doesn't >>>> go past ~50% on the 32 core tests. This indicates lock contention. >>>> >>> >>> You should not look on #1 STATs, but on #2 - they are all with the >>> latest "fixes" - on all of them CPU is used well (90% in pic on >>> 32cores). >>> Also, keep in mind these cores are having 2 threads, and from Solaris >>> point of view they are seen as CPU (so 64 CPU) and %busy is accounted >>> as for 64 CPU >>> >> >> Well, if the CPU usage is actually higher, then it might not be lock waiting >> -- it could be spin locks or context switches or cache coherency overhead. >> Postgres may also not be very SMT friendly, at least on the hardware tested >> here. > > do you mean SMP or CMT? ;-) > however both should work well with PostgreSQL. I also think about CPU > affinity - probably it may help to avoid CPU cache misses - but makes > sense mostly if pooler will be added as a part of PG. Symmetric Multi Threading (HyperThreading in Intels marketing terms, other marketing terms for Sun or IBM). One CPU core that can handle more than one concurrently executing thread. Technically, 'SMT' allows instructions in flight from multiple threads at once in a superscalar Cpu core while some implementations differ and might technically CMT (one thread or the other, but can switch fast, or a non-superscalar core). For many implementations of 'multiple threads on one CPU core' many of the processor resources are reduced per thread when it is active -- caches get split, instruction re-order buffers are split, etc. That is rather hardware implementation dependant. For Intel's SMT (and other similar), spin-locks hurt scalability if they aren't using new special instructions for the spin to yield pipeline slots to the other thread. Generally, code that stresses common processor resources more than CPU execution will scale poorly with SMT/CMT etc. So I'm not sure about the Postgres details, but the general case of an application that doesn't benefit from these technologies exists, and there is a non-zero chance that Postgres has some characteristics of such an app. >> >> (what was the context switch rate? I didn't see that in the data, just >> mutex spins). > > increasing with a load, as this ex.: > http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_100.html#bmk_CPU_CtxSwit > ch_100 > Well, on most systems over 100K context switches/sec is a lot. And those reach 180000 /sec. However, this is 'only' 10 context switches per transaction and less than 20% system CPU, so maybe those numbers aren't quite as big as they seem. Out of curiosity, what was the context switch rate for MySql at its peak throughput? > >> >> The scalability curve is definitely showing something. Prepared statements >> were tried, as were most of the other suggestions other than one: >> >> What happens if the queries are more complicated (say, they take 15ms server >> side with a more complicated plan required)? That is a harder question to >> answer > > What I observed is: if planner takes more long time (like initially > with 8.3.7 and analyze target 1000) the scalability problem is > appearing more strange - > http://dimitrik.free.fr/Report_20090505/5521_dim_STAT_18.html - as you > see CPU even not used more than 60% , and as you may see spin locks > are lowering - CPUs are not spinning for locks, there is something > else.. > I'm supposing a problem of some kind of synchronization - background > processes are not waking up on time or something like this... > Then, if more time spent on the query execution itself and not planner: > - if it'll be I/O time - I/O will hide everything else until you > increase a storage performance and/or add more RAM, but then you come > back to the initial issue :-) > - if it'll be a CPU time it may be interesting! :-) > > Rgds, > -Dimitri > Ok, so that's good info that the planner or parser side seems to scale less effectively than the execution (as the results show), but I'm wondering about queries with longer execution times not longer planner times. I'm wondering that, because its my opinion that most applications that will use larger scale hardware will have more complicated queries than your test. Its also greedy on my part since most queries in my applications are significantly more complicated. Regardless of my opinions -- this test is on one extreme (small fast queries) of the spectrum. Its useful to know some data points on other parts of the spectrum.
On Tue, May 19, 2009 at 11:53 AM, Dimitri <dimitrik.fr@gmail.com> wrote: > On 5/19/09, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Mon, May 18, 2009 at 6:32 PM, Dimitri <dimitrik.fr@gmail.com> wrote: >>> Thanks Dave for correction, but I'm also curious where the time is >>> wasted in this case?.. >>> >>> 0.84ms is displayed by "psql" once the result output is printed, and I >>> got similar time within my client (using libpq) which is not printing >>> any output.. >> >> Using libpq? What is the exact method you are using to execute >> queries...PQexec? > > exactly > >> If you are preparing queries against libpq, the >> best way to execute queries is via PQexecPrepared. > > the query is *once* prepared via PQexec, > then it's looping with "execute" via PQexec. > Why PQexecPrepared will be better in my case?.. It can be better or worse (usually better). the parameters are separated from the query string. Regardless of performance, the parametrized interfaces are superior for any queries taking arguments and should be used when possible. >> Another note: I would like to point out again that there are possible >> negative side effects in using char(n) vs. varchar(n) that IIRC do not >> exist in mysql. When you repeat your test I strongly advise switching >> to varchar. > > if it's true for any case, why not just replace CHAR implementation by > VARCHAR directly within PG code?.. First, let me explain the difference. char(n) is padded out to 'n' on disk and when returned. despite this, the length is still stored so there is no real advantage to using the char(n) type except that the returned string is of a guaranteed length. mysql, at least the particular version and storage engine that I am logged into right now, does not do this for char(n). In other words, select cast('abc' as char(50)) returns a string of 50 chars on pgsql and 3 chars on mysql. I will leave it as an exercise to the reader to figure out whom is following the standard. pg's handling of the situation is not necessarily optimal, but we just tell everyone to quit using 'char(n)' type. Unless for example your 'NOTE' column is mostly full or mostly null, your query is not fair because postgres has to both store and return a proportionally greater amount of data. This makes the comparison hardly apples to apples. This stuff counts when we are measuring at microsecond level. >> Another question: how exactly are you connecting to the database? >> local machine? if so, domain socket or tcp/ip? > > local TCP/IP, same as MySQL would be curious to see if you get different results from domain socket. merlin
On 5/19/09, Scott Carey <scott@richrelevance.com> wrote: > > On 5/19/09 3:46 AM, "Dimitri" <dimitrik.fr@gmail.com> wrote: > >> On 5/19/09, Scott Carey <scott@richrelevance.com> wrote: >>> >>> On 5/18/09 3:32 PM, "Dimitri" <dimitrik.fr@gmail.com> wrote: >>> >>>> On 5/18/09, Scott Carey <scott@richrelevance.com> wrote: >>>>> Great data Dimitri!' >>>> >>>> Thank you! :-) >>>> >>>>> >>>>> I see a few key trends in the poor scalability: >>>>> >>>>> The throughput scales roughly with %CPU fairly well. But CPU used >>>>> doesn't >>>>> go past ~50% on the 32 core tests. This indicates lock contention. >>>>> >>>> >>>> You should not look on #1 STATs, but on #2 - they are all with the >>>> latest "fixes" - on all of them CPU is used well (90% in pic on >>>> 32cores). >>>> Also, keep in mind these cores are having 2 threads, and from Solaris >>>> point of view they are seen as CPU (so 64 CPU) and %busy is accounted >>>> as for 64 CPU >>>> >>> >>> Well, if the CPU usage is actually higher, then it might not be lock >>> waiting >>> -- it could be spin locks or context switches or cache coherency >>> overhead. >>> Postgres may also not be very SMT friendly, at least on the hardware >>> tested >>> here. >> >> do you mean SMP or CMT? ;-) >> however both should work well with PostgreSQL. I also think about CPU >> affinity - probably it may help to avoid CPU cache misses - but makes >> sense mostly if pooler will be added as a part of PG. > > Symmetric Multi Threading (HyperThreading in Intels marketing terms, other > marketing terms for Sun or IBM). One CPU core that can handle more than one > concurrently executing thread. > Technically, 'SMT' allows instructions in flight from multiple threads at > once in a superscalar Cpu core while some implementations differ and might > technically CMT (one thread or the other, but can switch fast, or a > non-superscalar core). > > For many implementations of 'multiple threads on one CPU core' many of the > processor resources are reduced per thread when it is active -- caches get > split, instruction re-order buffers are split, etc. That is rather hardware > implementation dependant. > > For Intel's SMT (and other similar), spin-locks hurt scalability if they > aren't using new special instructions for the spin to yield pipeline slots > to the other thread. > > Generally, code that stresses common processor resources more than CPU > execution will scale poorly with SMT/CMT etc. All application are scaling well anyway, except if you have any kind of lock contention inside of the application itself or meet any kind of system resource become hot. But well, here we may spend days to discuss :-) > > So I'm not sure about the Postgres details, but the general case of an > application that doesn't benefit from these technologies exists, and there > is a non-zero chance that Postgres has some characteristics of such an app. > >>> >>> (what was the context switch rate? I didn't see that in the data, just >>> mutex spins). >> >> increasing with a load, as this ex.: >> http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_100.html#bmk_CPU_CtxSwit >> ch_100 >> > > Well, on most systems over 100K context switches/sec is a lot. And those > reach 180000 /sec. > However, this is 'only' 10 context switches per transaction and less than > 20% system CPU, so maybe those numbers aren't quite as big as they seem. > > Out of curiosity, what was the context switch rate for MySql at its peak > throughput? the main MySQL problem is a mutex locking like here: http://dimitrik.free.fr/Report_20090504/5465_dim_STAT_31.html#bmk_SpinMtx_31 so you have to limit a number of active threads to lower this contention (similar to pooler idea folks told here) and the context switch is even higher (~200K/sec) >> >>> >>> The scalability curve is definitely showing something. Prepared >>> statements >>> were tried, as were most of the other suggestions other than one: >>> >>> What happens if the queries are more complicated (say, they take 15ms >>> server >>> side with a more complicated plan required)? That is a harder question >>> to >>> answer >> >> What I observed is: if planner takes more long time (like initially >> with 8.3.7 and analyze target 1000) the scalability problem is >> appearing more strange - >> http://dimitrik.free.fr/Report_20090505/5521_dim_STAT_18.html - as you >> see CPU even not used more than 60% , and as you may see spin locks >> are lowering - CPUs are not spinning for locks, there is something >> else.. >> I'm supposing a problem of some kind of synchronization - background >> processes are not waking up on time or something like this... >> Then, if more time spent on the query execution itself and not planner: >> - if it'll be I/O time - I/O will hide everything else until you >> increase a storage performance and/or add more RAM, but then you come >> back to the initial issue :-) >> - if it'll be a CPU time it may be interesting! :-) >> >> Rgds, >> -Dimitri >> > > Ok, so that's good info that the planner or parser side seems to scale less > effectively than the execution (as the results show), but I'm wondering > about queries with longer execution times not longer planner times. I'm > wondering that, because its my opinion that most applications that will use > larger scale hardware will have more complicated queries than your test. > Its also greedy on my part since most queries in my applications are > significantly more complicated. > Regardless of my opinions -- this test is on one extreme (small fast > queries) of the spectrum. Its useful to know some data points on other > parts of the spectrum. As I've mentioned before, such fast queries are very common for some applications (like banking transactions, stock management, internet forums, etc.) - 5-7years ago the goal with this test was to keep response time under 1sec (I'm not kidding :-) but nowdays we're running under a millisecond.. Crazy progress, no? :-)) However, I've started to extend db_STRESS kit to accept any kind of query against any kind of db schema. So if you have an interesting data model and some queries to run - I'll be happy to adapt them as a new scenario! :-)) Rgds, -Dimitri > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On 5/19/09, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, May 19, 2009 at 11:53 AM, Dimitri <dimitrik.fr@gmail.com> wrote: >> On 5/19/09, Merlin Moncure <mmoncure@gmail.com> wrote: >>> On Mon, May 18, 2009 at 6:32 PM, Dimitri <dimitrik.fr@gmail.com> wrote: >>>> Thanks Dave for correction, but I'm also curious where the time is >>>> wasted in this case?.. >>>> >>>> 0.84ms is displayed by "psql" once the result output is printed, and I >>>> got similar time within my client (using libpq) which is not printing >>>> any output.. >>> >>> Using libpq? What is the exact method you are using to execute >>> queries...PQexec? >> >> exactly >> >>> If you are preparing queries against libpq, the >>> best way to execute queries is via PQexecPrepared. >> >> the query is *once* prepared via PQexec, >> then it's looping with "execute" via PQexec. >> Why PQexecPrepared will be better in my case?.. > > It can be better or worse (usually better). the parameters are > separated from the query string. Regardless of performance, the > parametrized interfaces are superior for any queries taking arguments > and should be used when possible. you're probably right, but I don't like either when solution become so complicated - PG has a so elegant way to execute a prepared query! > >>> Another note: I would like to point out again that there are possible >>> negative side effects in using char(n) vs. varchar(n) that IIRC do not >>> exist in mysql. When you repeat your test I strongly advise switching >>> to varchar. >> >> if it's true for any case, why not just replace CHAR implementation by >> VARCHAR directly within PG code?.. > > First, let me explain the difference. char(n) is padded out to 'n' on > disk and when returned. despite this, the length is still stored so > there is no real advantage to using the char(n) type except that the > returned string is of a guaranteed length. mysql, at least the > particular version and storage engine that I am logged into right now, > does not do this for char(n). In other words, select cast('abc' as > char(50)) returns a string of 50 chars on pgsql and 3 chars on mysql. > I will leave it as an exercise to the reader to figure out whom is > following the standard. pg's handling of the situation is not > necessarily optimal, but we just tell everyone to quit using 'char(n)' > type. > > Unless for example your 'NOTE' column is mostly full or mostly null, > your query is not fair because postgres has to both store and return a > proportionally greater amount of data. This makes the comparison > hardly apples to apples. This stuff counts when we are measuring at > microsecond level. Good point! I may confirm only at least at the beginning all fields are fully filled within a database. Will test both engines with VARCHAR next time to be sure it's not an issue. > >>> Another question: how exactly are you connecting to the database? >>> local machine? if so, domain socket or tcp/ip? >> >> local TCP/IP, same as MySQL > > would be curious to see if you get different results from domain socket. at least for PG there was no difference if I remember well. However, before when I tested on the real network I finished by change completely my code to reduce a network traffic (initially I've used cursors), and finally PG traffic was lower or similar to MySQL, it was an interesting stuff too :-) Rgds, -Dimitri > > merlin >
On Tue, May 19, 2009 at 3:15 PM, Dimitri <dimitrik.fr@gmail.com> wrote: > On 5/19/09, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Tue, May 19, 2009 at 11:53 AM, Dimitri <dimitrik.fr@gmail.com> wrote: >>> the query is *once* prepared via PQexec, >>> then it's looping with "execute" via PQexec. >>> Why PQexecPrepared will be better in my case?.. >> >> It can be better or worse (usually better). the parameters are >> separated from the query string. Regardless of performance, the >> parametrized interfaces are superior for any queries taking arguments >> and should be used when possible. > > you're probably right, but I don't like either when solution become so > complicated - PG has a so elegant way to execute a prepared query! It's not so bad. PQexec: sprintf(buf, query, char_arg1, my_arg2); PQexec(conn, query); sprintf(buf, query, char_arg1, my_arg2); PQexec(conn, query); PQexecParams: char *vals[2]; int formats[2] ={0,0}; vals = {char_arg1, char_arg2}; PQexecPrepared(conn, stmt, 2, vals, NULL, formats, 0); vals = {char_arg1, char_arg2}; PQexecPrepared(conn, stmt, 2, vals, NULL, formats, 0); The setup is a little rough, and 'non strings' can be a pain vs. printf, but the queries are safer (goodbye sql injection) and usually faster. Also the door is opened to binary formats which can be huge performance win on some data types...especially bytea, date/time, and geo. There are some good quality libraries out there to help dealing with execparams family of functions :D. merlin
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2009-05-19 at 08:58 -0400, Tom Lane wrote: >> Nonsense. The planner might think some other plan is cheaper, but >> it definitely knows how to do this, and has since at least 8.1. > Please look at Dimitri's plan. If it can remove the pointless sort, why > does it not do so? I haven't followed the whole thread, but the plan in the original post is for a hash join. The planner does not trust a hash join to preserve the order of its left input, because of possible batching. See the discussion a couple of months ago where we considered allowing the planner to disable batching so it *could* assume order preservation, and decided the risk of hashtable bloat was too great. regards, tom lane
On Tue, May 19, 2009 at 6:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Tue, 2009-05-19 at 08:58 -0400, Tom Lane wrote: >>> Nonsense. The planner might think some other plan is cheaper, but >>> it definitely knows how to do this, and has since at least 8.1. > >> Please look at Dimitri's plan. If it can remove the pointless sort, why >> does it not do so? > > I haven't followed the whole thread, but the plan in the original post > is for a hash join. The planner does not trust a hash join to preserve > the order of its left input, because of possible batching. See the > discussion a couple of months ago where we considered allowing the > planner to disable batching so it *could* assume order preservation, and > decided the risk of hashtable bloat was too great. Hmm, my recollection of that conversation was that we decided that we should have the planner tell the executor whether or not we are relying on it to produce sorted output. We set this flag only when the hash join is expected to fit comfortably within one batch (that is, we allow a safety margin). If this flag is set and a hash join unexpectedly goes multi-batch, then we perform a final merge pass (a la merge sort) before returning any results. I don't think it's a good idea to write off the idea of implementing this optimization at some point. I see a lot of queries that join one fairly large table against a whole bunch of little tables, and then sorting the results by a column that is indexed in the big table. The optimizer handles this by sequentially scanning the big table, hash joining against all of the little tables, and then sorting the output, which is pretty silly (given that all of the tables fit in RAM and are in fact actually cached there). If there is a LIMIT clause, then it might instead index-scan the big table, do the hash joins, and then sort the already-ordered results. This is better because at least we're not sorting the entire table unnecessarily but it's still poor. ...Robert
On Tue, 2009-05-19 at 23:54 -0400, Robert Haas wrote: > I don't think it's a good idea to write off the idea of implementing > this optimization at some point. I see a lot of queries that join one > fairly large table against a whole bunch of little tables, and then > sorting the results by a column that is indexed in the big table. Agreed it's a common use case. > The > optimizer handles this by sequentially scanning the big table, hash > joining against all of the little tables, and then sorting the output, > which is pretty silly (given that all of the tables fit in RAM and are > in fact actually cached there). If there is a LIMIT clause, then it > might instead index-scan the big table, do the hash joins, and then > sort the already-ordered results. This is better because at least > we're not sorting the entire table unnecessarily but it's still poor. The Hash node is fully executed before we start pulling rows through the Hash Join node. So the Hash Join node will know at execution time whether or not it will continue to maintain sorted order. So we put the Sort node into the plan, then the Sort node can just ask the Hash Join at execution time whether it should perform a sort or just pass rows through (act as a no-op). The cost of the Sort node can either be zero, or pro-rated down from the normal cost based upon what we think the probability is of going multi-batch, which would vary by work_mem available. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Wed, May 20, 2009 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > The Hash node is fully executed before we start pulling rows through the > Hash Join node. So the Hash Join node will know at execution time > whether or not it will continue to maintain sorted order. So we put the > Sort node into the plan, then the Sort node can just ask the Hash Join > at execution time whether it should perform a sort or just pass rows > through (act as a no-op). It's not actually a full sort. For example if the join has two batches, you don't need to dump all of the tuples from both batches into a sort. Each of the two tapes produced by the hash join is sorted, but if you read tape one and then tape two, of course then it won't be. What you want to do is read the first tuple from each tape and return whichever one is smaller, and put the other one back; then lather, rinse, and repeat. Because it's such a special-case computation, I think you're going to want to implement it within the HashJoin node rather than inserting a Sort node (or any other kind). ...Robert
On Wed, 2009-05-20 at 07:17 -0400, Robert Haas wrote: > On Wed, May 20, 2009 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > The Hash node is fully executed before we start pulling rows through the > > Hash Join node. So the Hash Join node will know at execution time > > whether or not it will continue to maintain sorted order. So we put the > > Sort node into the plan, then the Sort node can just ask the Hash Join > > at execution time whether it should perform a sort or just pass rows > > through (act as a no-op). > > It's not actually a full sort. For example if the join has two > batches, you don't need to dump all of the tuples from both batches > into a sort. Each of the two tapes produced by the hash join is > sorted, but if you read tape one and then tape two, of course then it > won't be. What you want to do is read the first tuple from each tape > and return whichever one is smaller, and put the other one back; then > lather, rinse, and repeat. Because it's such a special-case > computation, I think you're going to want to implement it within the > HashJoin node rather than inserting a Sort node (or any other kind). That has wider applicability and seems sound. It will also be easier to assess a cost for that aspect in the optimizer. I like that approach. Code wise, you'll need to refactor things quite a lot to make the tuplesort code accessible to the HJ node. The sorting code is going to get pretty hectic if we add in all the ideas for this, partial sort, improved sorting (at least 3 other ideas). Perhaps it will be easier to write a specific final merge routine just for HJs. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support