Thread: Problem query
Hi All; We have a table with approx 200 columns. about a dozen columns are text data types and the rest are a mix of integers , bigint's and double precision types. The table has about 25million rows. The app wants to run a query like this: select count(pri_num) from max_xtrv_st_t where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1 group by tds_cx_ind, cxs_ind_2 ) I've tried to split the query up but made little progress, pri_num and tds_cx_ind are bigint's and cxs_ind_2 is an integer The table has an index on all 3 columns (3 separate indexes) Anyone have any thoughts on tuning this query? -- --------------------------------------------- Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com ---------------------------------------------
On Wed, Jun 1, 2011 at 3:14 PM, CS DBA <cs_dba@consistentstate.com> wrote: > Hi All; > > We have a table with approx 200 columns. about a dozen columns are text data > types and the rest are a mix of integers , bigint's and double precision > types. > > The table has about 25million rows. > > > The app wants to run a query like this: > > select count(pri_num) from max_xtrv_st_t > where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1 > group by tds_cx_ind, cxs_ind_2 ) > > I've tried to split the query up but made little progress, pri_num and > tds_cx_ind are bigint's and cxs_ind_2 is an integer > > The table has an index on all 3 columns (3 separate indexes) > > Anyone have any thoughts on tuning this query? need postgres version# and the current explain analyze (or explain, if you can't wait for it) merlin
CS DBA <cs_dba@consistentstate.com> wrote: > The app wants to run a query like this: > > select count(pri_num) from max_xtrv_st_t > where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1 > group by tds_cx_ind, cxs_ind_2) Why not something simpler? There are a number of possibilities, and I don't claim this one is necessarily best (or even error free), but how about something like?: select count(*) from (select distinct max(pri_num) from max_xtrv_st_t group by tds_cx_ind, cxs_ind_2) x -Kevin
On 06/01/2011 03:38 PM, Kevin Grittner wrote: > CS DBA<cs_dba@consistentstate.com> wrote: > >> The app wants to run a query like this: >> >> select count(pri_num) from max_xtrv_st_t >> where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1 >> group by tds_cx_ind, cxs_ind_2) > > Why not something simpler? There are a number of possibilities, and > I don't claim this one is necessarily best (or even error free), but > how about something like?: > > select count(*) from > (select distinct max(pri_num) > from max_xtrv_st_t > group by tds_cx_ind, cxs_ind_2) x > > -Kevin I've tried a number of alternates, each one wants to do a seq scan of the table (including your suggestion above). -- --------------------------------------------- Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com ---------------------------------------------
On 06/01/2011 03:15 PM, Merlin Moncure wrote: > On Wed, Jun 1, 2011 at 3:14 PM, CS DBA<cs_dba@consistentstate.com> wrote: >> Hi All; >> >> We have a table with approx 200 columns. about a dozen columns are text data >> types and the rest are a mix of integers , bigint's and double precision >> types. >> >> The table has about 25million rows. >> >> >> The app wants to run a query like this: >> >> select count(pri_num) from max_xtrv_st_t >> where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1 >> group by tds_cx_ind, cxs_ind_2 ) >> >> I've tried to split the query up but made little progress, pri_num and >> tds_cx_ind are bigint's and cxs_ind_2 is an integer >> >> The table has an index on all 3 columns (3 separate indexes) >> >> Anyone have any thoughts on tuning this query? > need postgres version# and the current explain analyze (or explain, if > you can't wait for it) > > merlin Postgresql version 8.4.2 Explain: QUERY PLAN ------------------------------------------------------------------------------------------------ Aggregate (cost=6551481.85..6551481.86 rows=1 width=8) -> Nested Loop (cost=6550474.85..6551481.35 rows=200 width=8) -> HashAggregate (cost=6550474.85..6550476.85 rows=200 width=8) -> GroupAggregate (cost=5918263.18..6334840.58 rows=17250742 width=20) -> Sort (cost=5918263.18..5968498.96 rows=20094312 width=20) Sort Key: tds_cx_ind, cxs_ind_2 -> Seq Scan on max_xtrv_st_t (cost=0.00..3068701.12 rows=20094312 width=20) -> Index Scan using max_xtrv_st_t_pkey on max_xtrv_st_t (cost=0.00..5.01 rows=1 width=8) Index Cond: (max_xtrv_st_t.pri_num = (max(max_xtrv_st_t.pri_num))) (9 rows) -- --------------------------------------------- Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com ---------------------------------------------
On Wed, Jun 1, 2011 at 6:28 PM, CS DBA <cs_dba@consistentstate.com> wrote: > On 06/01/2011 03:38 PM, Kevin Grittner wrote: >> >> CS DBA<cs_dba@consistentstate.com> wrote: >> >>> The app wants to run a query like this: >>> >>> select count(pri_num) from max_xtrv_st_t >>> where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1 >>> group by tds_cx_ind, cxs_ind_2) >> >> Why not something simpler? There are a number of possibilities, and >> I don't claim this one is necessarily best (or even error free), but >> how about something like?: >> >> select count(*) from >> (select distinct max(pri_num) >> from max_xtrv_st_t >> group by tds_cx_ind, cxs_ind_2) x >> >> -Kevin > > I've tried a number of alternates, each one wants to do a seq scan of the > table (including your suggestion above). why wouldn't you expect a sequential scan? what is the number of unique values for tds_cx_ind, cxs_ind_2 on the table? one of the most important techniques with query optimization is to put yourself in the place of the database and try to imagine how *you* would pass over the records...then try and coerce the database into that plan. merlin
CS DBA <cs_dba@consistentstate.com> wrote: > On 06/01/2011 03:38 PM, Kevin Grittner wrote: >> select count(*) from >> (select distinct max(pri_num) >> from max_xtrv_st_t >> group by tds_cx_ind, cxs_ind_2) x > I've tried a number of alternates, each one wants to do a seq scan > of the table (including your suggestion above). Is there some reason to believe that a sequential scan isn't the fastest way to get the data? When generating summary data like this, it often is faster than lots of random access. If you can coerce it into a faster plan by turning off enable_seqscan on the connection before running the query, then we can look at how you might adjust your costing parameters to get better plans. -Kevin
Shaun Thomas <sthomas@peak6.com> wrote: > You're counting the number of maximum values in your table for > tds_cx_ind and cxs_ind_2, but there will always be at least one > for every combination. Good point. > What you really want is this: > > SELECT count(1) FROM ( > SELECT DISTINCT tds_cx_ind, cxs_ind_2 > FROM max_xtrv_st_t > ); Or maybe: SELECT count(DISTINCT (tds_cx_ind, cxs_ind_2)) FROM max_xtrv_st_t; -Kevin
On 06/02/2011 08:47 AM, Kevin Grittner wrote: > Is there some reason to believe that a sequential scan isn't the > fastest way to get the data? When generating summary data like > this, it often is faster than lots of random access. If you can > coerce it into a faster plan by turning off enable_seqscan on the > connection before running the query, then we can look at how you > might adjust your costing parameters to get better plans. This is right. There's really no way for the optimizer to get the values you want, even though your columns are indexed. But your query is a tad naive, unless you wrote up a special case for us. You're counting the number of maximum values in your table for tds_cx_ind and cxs_ind_2, but there will always be at least one for every combination. What you really want is this: SELECT count(1) FROM ( SELECT DISTINCT tds_cx_ind, cxs_ind_2 FROM max_xtrv_st_t ); If you really must have that inner query because it's generated and you won't know what it contains, you'd be better off with a CTE: WITH x AS ( SELECT max(pri_num) FROM max_xtrv_st_t GROUP BY tds_cx_ind, cxs_ind_2 ) SELECT count(1) FROM x; You'll still get a sequence scan from these, however. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > Shaun Thomas <sthomas@peak6.com> wrote: >> What you really want is this: >> >> SELECT count(1) FROM ( >> SELECT DISTINCT tds_cx_ind, cxs_ind_2 >> FROM max_xtrv_st_t >> ); > > Or maybe: > > SELECT count(DISTINCT (tds_cx_ind, cxs_ind_2)) FROM max_xtrv_st_t; Or maybe not. I tried various forms of the query against "real" tables here, and Shaun's format was ten times as fast as my last suggestion and 12% faster than my first suggestion. They all gave the same result, of course, and they all used a seq scan.. -Kevin
On 06/02/2011 11:15 AM, Kevin Grittner wrote: > They all gave the same result, of course, and they all used a seq > scan.. And they all will. I created a test table with a bunch of generate_series and emulated 200 unique matches of column1 and column2, on a table with a mere 1-million rows (5000 for each of column3). And no matter what index combination I used, it always did a sequence scan... even when I indexed every column and indexed column3 descending. But here's the thing. I turned off sequence scans to force index scans, and it got 2-3x slower. But is that really surprising? Without a proper where exclusion, it has to probe every occurrence... also known as a loose index scan, which PostgreSQL doesn't have (yet). And... this is horrifying, but: WITH RECURSIVE t1 AS ( SELECT min(f.tds_cx_ind) AS tds_cx_ind FROM max_xtrv_st_t f UNION ALL SELECT (SELECT min(tds_cx_ind) FROM max_xtrv_st_t f WHERE f.tds_cx_ind > t1.tds_cx_ind) FROM t1 WHERE t1.tds_cx_ind IS NOT NULL ), t2 AS ( SELECT min(f.cxs_ind_2) AS cxs_ind_2 FROM max_xtrv_st_t f UNION ALL SELECT (SELECT min(cxs_ind_2) FROM max_xtrv_st_t f WHERE f.cxs_ind_2 > t2.cxs_ind_2) FROM t2 WHERE t2.cxs_ind_2 IS NOT NULL ) SELECT t1.tds_cx_ind, t2.cxs_ind_2 FROM t1, t2 WHERE t1.tds_cx_ind IS NOT NULL AND t2.cxs_ind_2 IS NOT NULL; It works on my test, but might not be what OP wants. It's a cross product of the two unique column sets, and it's possible it represents combinations that don't exist. But I suppose a late EXISTS pass could solve that problem. I assume there's an easier way to do that. In either case, when is PG getting loose index scans? ;) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
Shaun Thomas <sthomas@peak6.com> wrote: > On 06/02/2011 11:15 AM, Kevin Grittner wrote: > >> They all gave the same result, of course, and they all used a seq >> scan.. > > And they all will. I always eschew generalizations, since they're always wrong. ;-) I used a real table which had somewhat similar indexes to what I think the OP is using, and tried the fastest query using the sequential scan. A typical result once cached: explain analyze select count(*) from (select distinct "caseType", "statusCode" from "Case") x; Aggregate (cost=10105.01..10105.02 rows=1 width=0) (actual time=478.893..478.893 rows=1 loops=1) -> HashAggregate (cost=10101.95..10103.31 rows=136 width=6) (actual time=478.861..478.881 rows=79 loops=1) -> Seq Scan on "Case" (cost=0.00..7419.20 rows=536550 width=6) (actual time=0.010..316.481 rows=536550 loops=1) Total runtime: 478.940 ms Then I tried it with a setting designed to discourage seq scans. A typical run: set cpu_tuple_cost = 1; explain analyze select count(*) from (select distinct "caseType", "statusCode" from "Case") x; Aggregate (cost=544529.30..544530.30 rows=1 width=0) (actual time=443.972..443.972 rows=1 loops=1) -> Unique (cost=0.00..544392.95 rows=136 width=6) (actual time=0.021..443.933 rows=79 loops=1) -> Index Scan using "Case_CaseTypeStatus" on "Case" (cost=0.00..541710.20 rows=536550 width=6) (actual time=0.019..347.193 rows=536550 loops=1) Total runtime: 444.014 ms Now, on a table which didn't fit in cache, this would probably be another story.... -Kevin
On 06/02/2011 11:31 AM, Shaun Thomas wrote: > On 06/02/2011 11:15 AM, Kevin Grittner wrote: > >> They all gave the same result, of course, and they all used a seq >> scan.. > > And they all will. I created a test table with a bunch of > generate_series and emulated 200 unique matches of column1 and > column2, on a table with a mere 1-million rows (5000 for each of > column3). And no matter what index combination I used, it always did a > sequence scan... even when I indexed every column and indexed column3 > descending. > > But here's the thing. I turned off sequence scans to force index > scans, and it got 2-3x slower. But is that really surprising? Without > a proper where exclusion, it has to probe every occurrence... also > known as a loose index scan, which PostgreSQL doesn't have (yet). > > And... this is horrifying, but: > > WITH RECURSIVE t1 AS ( > SELECT min(f.tds_cx_ind) AS tds_cx_ind > FROM max_xtrv_st_t f > UNION ALL > SELECT (SELECT min(tds_cx_ind) > FROM max_xtrv_st_t f > WHERE f.tds_cx_ind > t1.tds_cx_ind) > FROM t1 > WHERE t1.tds_cx_ind IS NOT NULL > ), t2 AS ( > SELECT min(f.cxs_ind_2) AS cxs_ind_2 > FROM max_xtrv_st_t f > UNION ALL > SELECT (SELECT min(cxs_ind_2) > FROM max_xtrv_st_t f > WHERE f.cxs_ind_2 > t2.cxs_ind_2) > FROM t2 > WHERE t2.cxs_ind_2 IS NOT NULL > ) > SELECT t1.tds_cx_ind, t2.cxs_ind_2 FROM t1, t2 > WHERE t1.tds_cx_ind IS NOT NULL > AND t2.cxs_ind_2 IS NOT NULL; > > It works on my test, but might not be what OP wants. It's a cross > product of the two unique column sets, and it's possible it represents > combinations that don't exist. But I suppose a late EXISTS pass could > solve that problem. > > I assume there's an easier way to do that. In either case, when is PG > getting loose index scans? ;) > Thanks everyone for the feedback. I'll attempt the suggestions from today as soon as I can and let you know where we end up. -- --------------------------------------------- Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com ---------------------------------------------