Thread: Problem query

Problem query

From
CS DBA
Date:
Hi All;

We have a table with approx 200 columns. about a dozen columns are text
data types and the rest are a mix of integers , bigint's and double
precision types.

The table has about 25million rows.


The app wants to run a query like this:

select count(pri_num) from max_xtrv_st_t
where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1
                                group by tds_cx_ind, cxs_ind_2 )

I've tried to split the query up but made little progress, pri_num and
tds_cx_ind are bigint's and  cxs_ind_2 is an integer

The table has an index on all 3 columns (3 separate indexes)

Anyone have any thoughts on tuning this query?


--
---------------------------------------------
Kevin Kempter       -       Constent State
A PostgreSQL Professional Services Company
           www.consistentstate.com
---------------------------------------------


Re: Problem query

From
Merlin Moncure
Date:
On Wed, Jun 1, 2011 at 3:14 PM, CS DBA <cs_dba@consistentstate.com> wrote:
> Hi All;
>
> We have a table with approx 200 columns. about a dozen columns are text data
> types and the rest are a mix of integers , bigint's and double precision
> types.
>
> The table has about 25million rows.
>
>
> The app wants to run a query like this:
>
> select count(pri_num) from max_xtrv_st_t
> where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1
>                               group by tds_cx_ind, cxs_ind_2 )
>
> I've tried to split the query up but made little progress, pri_num and
>  tds_cx_ind are bigint's and  cxs_ind_2 is an integer
>
> The table has an index on all 3 columns (3 separate indexes)
>
> Anyone have any thoughts on tuning this query?

need postgres version# and the current explain analyze (or explain, if
you can't wait for it)

merlin

Re: Problem query

From
"Kevin Grittner"
Date:
CS DBA <cs_dba@consistentstate.com> wrote:

> The app wants to run a query like this:
>
> select count(pri_num) from max_xtrv_st_t
> where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1
>                                 group by tds_cx_ind, cxs_ind_2)

Why not something simpler?  There are a number of possibilities, and
I don't claim this one is necessarily best (or even error free), but
how about something like?:

select count(*) from
  (select distinct max(pri_num)
  from max_xtrv_st_t
  group by tds_cx_ind, cxs_ind_2) x

-Kevin

Re: Problem query

From
CS DBA
Date:
On 06/01/2011 03:38 PM, Kevin Grittner wrote:
> CS DBA<cs_dba@consistentstate.com>  wrote:
>
>> The app wants to run a query like this:
>>
>> select count(pri_num) from max_xtrv_st_t
>> where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1
>>                                  group by tds_cx_ind, cxs_ind_2)
>
> Why not something simpler?  There are a number of possibilities, and
> I don't claim this one is necessarily best (or even error free), but
> how about something like?:
>
> select count(*) from
>    (select distinct max(pri_num)
>    from max_xtrv_st_t
>    group by tds_cx_ind, cxs_ind_2) x
>
> -Kevin

I've tried a number of alternates, each one wants to do a seq scan of
the table (including your suggestion above).


--
---------------------------------------------
Kevin Kempter       -       Constent State
A PostgreSQL Professional Services Company
           www.consistentstate.com
---------------------------------------------


Re: Problem query

From
CS DBA
Date:
On 06/01/2011 03:15 PM, Merlin Moncure wrote:
> On Wed, Jun 1, 2011 at 3:14 PM, CS DBA<cs_dba@consistentstate.com>  wrote:
>> Hi All;
>>
>> We have a table with approx 200 columns. about a dozen columns are text data
>> types and the rest are a mix of integers , bigint's and double precision
>> types.
>>
>> The table has about 25million rows.
>>
>>
>> The app wants to run a query like this:
>>
>> select count(pri_num) from max_xtrv_st_t
>> where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1
>>                                group by tds_cx_ind, cxs_ind_2 )
>>
>> I've tried to split the query up but made little progress, pri_num and
>>   tds_cx_ind are bigint's and  cxs_ind_2 is an integer
>>
>> The table has an index on all 3 columns (3 separate indexes)
>>
>> Anyone have any thoughts on tuning this query?
> need postgres version# and the current explain analyze (or explain, if
> you can't wait for it)
>
> merlin


Postgresql version 8.4.2


Explain:

                                            QUERY PLAN
------------------------------------------------------------------------------------------------
  Aggregate  (cost=6551481.85..6551481.86 rows=1 width=8)
    ->  Nested Loop  (cost=6550474.85..6551481.35 rows=200 width=8)
          ->  HashAggregate  (cost=6550474.85..6550476.85 rows=200 width=8)
                ->  GroupAggregate  (cost=5918263.18..6334840.58
rows=17250742 width=20)
                      ->  Sort  (cost=5918263.18..5968498.96
rows=20094312 width=20)
                            Sort Key: tds_cx_ind, cxs_ind_2
                            ->  Seq Scan on max_xtrv_st_t
(cost=0.00..3068701.12 rows=20094312 width=20)
          ->  Index Scan using max_xtrv_st_t_pkey on max_xtrv_st_t
(cost=0.00..5.01 rows=1 width=8)
                Index Cond: (max_xtrv_st_t.pri_num =
(max(max_xtrv_st_t.pri_num)))
(9 rows)




--
---------------------------------------------
Kevin Kempter       -       Constent State
A PostgreSQL Professional Services Company
           www.consistentstate.com
---------------------------------------------


Re: Problem query

From
Merlin Moncure
Date:
On Wed, Jun 1, 2011 at 6:28 PM, CS DBA <cs_dba@consistentstate.com> wrote:
> On 06/01/2011 03:38 PM, Kevin Grittner wrote:
>>
>> CS DBA<cs_dba@consistentstate.com>  wrote:
>>
>>> The app wants to run a query like this:
>>>
>>> select count(pri_num) from max_xtrv_st_t
>>> where pri_num in (select max(pri_num) from max_xtrv_st_t where 1=1
>>>                                 group by tds_cx_ind, cxs_ind_2)
>>
>> Why not something simpler?  There are a number of possibilities, and
>> I don't claim this one is necessarily best (or even error free), but
>> how about something like?:
>>
>> select count(*) from
>>   (select distinct max(pri_num)
>>   from max_xtrv_st_t
>>   group by tds_cx_ind, cxs_ind_2) x
>>
>> -Kevin
>
> I've tried a number of alternates, each one wants to do a seq scan of the
> table (including your suggestion above).

why wouldn't you expect a sequential scan? what is the number of
unique values for tds_cx_ind, cxs_ind_2 on the table?

one of the most important techniques with query optimization is to put
yourself in the place of the database and try to imagine how *you*
would pass over the records...then try and coerce the database into
that plan.

merlin

Re: Problem query

From
"Kevin Grittner"
Date:
CS DBA <cs_dba@consistentstate.com> wrote:
> On 06/01/2011 03:38 PM, Kevin Grittner wrote:

>> select count(*) from
>>    (select distinct max(pri_num)
>>    from max_xtrv_st_t
>>    group by tds_cx_ind, cxs_ind_2) x

> I've tried a number of alternates, each one wants to do a seq scan
> of the table (including your suggestion above).

Is there some reason to believe that a sequential scan isn't the
fastest way to get the data?  When generating summary data like
this, it often is faster than lots of random access.  If you can
coerce it into a faster plan by turning off enable_seqscan on the
connection before running the query, then we can look at how you
might adjust your costing parameters to get better plans.

-Kevin

Re: Problem query

From
"Kevin Grittner"
Date:
Shaun Thomas <sthomas@peak6.com> wrote:

> You're counting the number of maximum values in your table for
> tds_cx_ind and cxs_ind_2, but there will always be at least one
> for every combination.

Good point.

> What you really want is this:
>
> SELECT count(1) FROM (
>    SELECT DISTINCT tds_cx_ind, cxs_ind_2
>      FROM max_xtrv_st_t
> );

Or maybe:

SELECT count(DISTINCT (tds_cx_ind, cxs_ind_2)) FROM max_xtrv_st_t;

-Kevin

Re: Problem query

From
Shaun Thomas
Date:
On 06/02/2011 08:47 AM, Kevin Grittner wrote:

> Is there some reason to believe that a sequential scan isn't the
> fastest way to get the data?  When generating summary data like
> this, it often is faster than lots of random access.  If you can
> coerce it into a faster plan by turning off enable_seqscan on the
> connection before running the query, then we can look at how you
> might adjust your costing parameters to get better plans.

This is right. There's really no way for the optimizer to get the values
you want, even though your columns are indexed. But your query is a tad
naive, unless you wrote up a special case for us. You're counting the
number of maximum values in your table for tds_cx_ind and cxs_ind_2, but
there will always be at least one for every combination. What you really
want is this:

SELECT count(1) FROM (
   SELECT DISTINCT tds_cx_ind, cxs_ind_2
     FROM max_xtrv_st_t
);

If you really must have that inner query because it's generated and you
won't know what it contains, you'd be better off with a CTE:

WITH x AS (
   SELECT max(pri_num)
     FROM max_xtrv_st_t
    GROUP BY tds_cx_ind, cxs_ind_2
)
SELECT count(1) FROM x;

You'll still get a sequence scan from these, however.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Problem query

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> Shaun Thomas <sthomas@peak6.com> wrote:

>> What you really want is this:
>>
>> SELECT count(1) FROM (
>>    SELECT DISTINCT tds_cx_ind, cxs_ind_2
>>      FROM max_xtrv_st_t
>> );
>
> Or maybe:
>
> SELECT count(DISTINCT (tds_cx_ind, cxs_ind_2)) FROM max_xtrv_st_t;

Or maybe not.  I tried various forms of the query against "real"
tables here, and Shaun's format was ten times as fast as my last
suggestion and 12% faster than my first suggestion.

They all gave the same result, of course, and they all used a seq
scan..

-Kevin

Re: Problem query

From
Shaun Thomas
Date:
On 06/02/2011 11:15 AM, Kevin Grittner wrote:

> They all gave the same result, of course, and they all used a seq
> scan..

And they all will. I created a test table with a bunch of
generate_series and emulated 200 unique matches of column1 and column2,
on a table with a mere 1-million rows (5000 for each of column3). And no
matter what index combination I used, it always did a sequence scan...
even when I indexed every column and indexed column3 descending.

But here's the thing. I turned off sequence scans to force index scans,
and it got 2-3x slower. But is that really surprising? Without a proper
where exclusion, it has to probe every occurrence... also known as a
loose index scan, which PostgreSQL doesn't have (yet).

And... this is horrifying, but:

WITH RECURSIVE t1 AS (
   SELECT min(f.tds_cx_ind) AS tds_cx_ind
     FROM max_xtrv_st_t f
    UNION ALL
   SELECT (SELECT min(tds_cx_ind)
             FROM max_xtrv_st_t f
            WHERE f.tds_cx_ind > t1.tds_cx_ind)
     FROM t1
    WHERE t1.tds_cx_ind IS NOT NULL
), t2 AS (
   SELECT min(f.cxs_ind_2) AS cxs_ind_2
     FROM max_xtrv_st_t f
    UNION ALL
   SELECT (SELECT min(cxs_ind_2)
             FROM max_xtrv_st_t f
            WHERE f.cxs_ind_2 > t2.cxs_ind_2)
     FROM t2
    WHERE t2.cxs_ind_2 IS NOT NULL
)
SELECT t1.tds_cx_ind, t2.cxs_ind_2  FROM t1, t2
  WHERE t1.tds_cx_ind IS NOT NULL
    AND t2.cxs_ind_2 IS NOT NULL;

It works on my test, but might not be what OP wants. It's a cross
product of the two unique column sets, and it's possible it represents
combinations that don't exist. But I suppose a late EXISTS pass could
solve that problem.

I assume there's an easier way to do that. In either case, when is PG
getting loose index scans? ;)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Problem query

From
"Kevin Grittner"
Date:
Shaun Thomas <sthomas@peak6.com> wrote:
> On 06/02/2011 11:15 AM, Kevin Grittner wrote:
>
>> They all gave the same result, of course, and they all used a seq
>> scan..
>
> And they all will.

I always eschew generalizations, since they're always wrong.  ;-)  I
used a real table which had somewhat similar indexes to what I think
the OP is using, and tried the fastest query using the sequential
scan.  A typical result once cached:

explain analyze select count(*) from
  (select distinct "caseType", "statusCode" from "Case") x;

 Aggregate  (cost=10105.01..10105.02 rows=1 width=0)
            (actual time=478.893..478.893 rows=1 loops=1)
   ->  HashAggregate  (cost=10101.95..10103.31 rows=136 width=6)
                      (actual time=478.861..478.881 rows=79 loops=1)
         ->  Seq Scan on "Case"
                  (cost=0.00..7419.20 rows=536550 width=6)
                  (actual time=0.010..316.481 rows=536550 loops=1)
 Total runtime: 478.940 ms

Then I tried it with a setting designed to discourage seq scans.
A typical run:

set cpu_tuple_cost = 1;
explain analyze select count(*) from
  (select distinct "caseType", "statusCode" from "Case") x;

 Aggregate  (cost=544529.30..544530.30 rows=1 width=0)
            (actual time=443.972..443.972 rows=1 loops=1)
   ->  Unique  (cost=0.00..544392.95 rows=136 width=6)
               (actual time=0.021..443.933 rows=79 loops=1)
         ->  Index Scan using "Case_CaseTypeStatus" on "Case"
                  (cost=0.00..541710.20 rows=536550 width=6)
                  (actual time=0.019..347.193 rows=536550 loops=1)
 Total runtime: 444.014 ms

Now, on a table which didn't fit in cache, this would probably be
another story....

-Kevin

Re: Problem query

From
CS DBA
Date:
On 06/02/2011 11:31 AM, Shaun Thomas wrote:
> On 06/02/2011 11:15 AM, Kevin Grittner wrote:
>
>> They all gave the same result, of course, and they all used a seq
>> scan..
>
> And they all will. I created a test table with a bunch of
> generate_series and emulated 200 unique matches of column1 and
> column2, on a table with a mere 1-million rows (5000 for each of
> column3). And no matter what index combination I used, it always did a
> sequence scan... even when I indexed every column and indexed column3
> descending.
>
> But here's the thing. I turned off sequence scans to force index
> scans, and it got 2-3x slower. But is that really surprising? Without
> a proper where exclusion, it has to probe every occurrence... also
> known as a loose index scan, which PostgreSQL doesn't have (yet).
>
> And... this is horrifying, but:
>
> WITH RECURSIVE t1 AS (
>   SELECT min(f.tds_cx_ind) AS tds_cx_ind
>     FROM max_xtrv_st_t f
>    UNION ALL
>   SELECT (SELECT min(tds_cx_ind)
>             FROM max_xtrv_st_t f
>            WHERE f.tds_cx_ind > t1.tds_cx_ind)
>     FROM t1
>    WHERE t1.tds_cx_ind IS NOT NULL
> ), t2 AS (
>   SELECT min(f.cxs_ind_2) AS cxs_ind_2
>     FROM max_xtrv_st_t f
>    UNION ALL
>   SELECT (SELECT min(cxs_ind_2)
>             FROM max_xtrv_st_t f
>            WHERE f.cxs_ind_2 > t2.cxs_ind_2)
>     FROM t2
>    WHERE t2.cxs_ind_2 IS NOT NULL
> )
> SELECT t1.tds_cx_ind, t2.cxs_ind_2  FROM t1, t2
>  WHERE t1.tds_cx_ind IS NOT NULL
>    AND t2.cxs_ind_2 IS NOT NULL;
>
> It works on my test, but might not be what OP wants. It's a cross
> product of the two unique column sets, and it's possible it represents
> combinations that don't exist. But I suppose a late EXISTS pass could
> solve that problem.
>
> I assume there's an easier way to do that. In either case, when is PG
> getting loose index scans? ;)
>


Thanks everyone for the feedback.  I'll attempt the suggestions from
today as soon as I can and let you know where we end up.


--
---------------------------------------------
Kevin Kempter       -       Constent State
A PostgreSQL Professional Services Company
           www.consistentstate.com
---------------------------------------------