Thread: Query performance question on a large table

Query performance question on a large table

From

Együd Csaba

Date:

06 January 2004, 15:44:05

Hi All,
how can I improve the query performance in the following situation:
I have a big (4.5+ million rows) table. One query takes approx. 9 sec to
finish resulting ~10000 rows. But if I run simultaneously 4 similar queries
it takes nearly 5 minutes instead of 4 times 9 seconds or something near of
that.

here is a sample query:
select  mertido, fomeazon, ertektipus, mertertek from t_me30 where fomeazon
in (select distinct fomeazon from t_fome where lower(inuse) = 'igen') and
mertido like '2003-12-17%' and ertektipus in ('+MW') order by mertido,
fomeazon, ertektipus;

What kind of indexes could speed up a query like this? I tried to create one
on fields (mertido, fomeazon, ertektipus) but led me much longer execution
time.

Ohh, I nearly forgot the config:  Linux 7.1; Postgres 7.3.2;

Thank you,
bye

----------------------------------------
Csaba Együd
csegyud@vnet.hu

Re: Query performance question on a large table

From

Tom Lane

Date:

06 January 2004, 16:09:17

=?iso-8859-2?Q?Egy=FCd_Csaba?= <csegyud@vnet.hu> writes:
> here is a sample query:
> select  mertido, fomeazon, ertektipus, mertertek from t_me30 where fomeazon
> in (select distinct fomeazon from t_fome where lower(inuse) = 'igen') and
> mertido like '2003-12-17%' and ertektipus in ('+MW') order by mertido,
> fomeazon, ertektipus;

> Ohh, I nearly forgot the config:  Linux 7.1; Postgres 7.3.2;

The first thing you ought to do is move to PG 7.4.  "foo IN (SELECT ...)"
generally works a lot better under 7.4 than prior releases.  I'd suggest
dropping the "DISTINCT" when using 7.4, too.

            regards, tom lane

Re: Query performance question on a large table

From

Sean Shanny

Date:

07 January 2004, 03:11:47

Tom,

Sort of piggybacking on this thread but why the suggestion to drop the
use of DISTINCT in 7.4?  We use DISTINCT all over the place to eliminate
duplicates in sub select statements. Running 7.4.0 currently on
FreeBSD5.1 Dell 2650 4GB RAM 5 disk SCSI array hardware RAID 0

 Example:

explain analyze select t1.raw_agent_string from d_useragent t1 where
t1.id in (select distinct useragent_key from f_pageviews where date_key
between 356 and 362);

QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1020025.13..1020178.84 rows=51 width=79) (actual
time=954080.021..970268.457 rows=82207 loops=1)
   ->  HashAggregate  (cost=1020025.13..1020025.13 rows=51 width=4)
(actual time=954049.317..954450.065 rows=82208 loops=1)
         ->  Subquery Scan "IN_subquery"  (cost=983429.20..1020025.00
rows=51 width=4) (actual time=856641.244..953639.116 rows=82208 loops=1)
               ->  Unique  (cost=983429.20..1020024.49 rows=51 width=4)
(actual time=856641.230..952939.539 rows=82208 loops=1)
                     ->  Sort  (cost=983429.20..1001726.84 rows=7319058
width=4) (actual time=856641.215..906429.835 rows=11067735 loops=1)
                           Sort Key: useragent_key
                           ->  Index Scan using
idx_pageviews_date_dec_2003 on f_pageviews  (cost=0.00..136434.63
rows=7319058 width=4) (actual time=1.140..693400.464 rows=11067735 loops=1)
                                 Index Cond: ((date_key >= 356) AND
(date_key <= 362))
   ->  Index Scan using d_useragent_pkey on d_useragent t1
(cost=0.00..3.00 rows=1 width=83) (actual time=0.169..0.174 rows=1
loops=82208)
         Index Cond: (t1.id = "outer".useragent_key)
 Total runtime: 970657.888 ms
(11 rows)

t1.id is the primary key on d_useragent.  d_useragent  actually has
390751 rows.
useragent_key has an index.  f_pageviews has roughly 120 million rows.

Is there a better way of writing this sort of query that will accomplish
the same thing?

Thanks.

--sean

Tom Lane wrote:

>=?iso-8859-2?Q?Egy=FCd_Csaba?= <csegyud@vnet.hu> writes:
>
>
>>here is a sample query:
>>select  mertido, fomeazon, ertektipus, mertertek from t_me30 where fomeazon
>>in (select distinct fomeazon from t_fome where lower(inuse) = 'igen') and
>>mertido like '2003-12-17%' and ertektipus in ('+MW') order by mertido,
>>fomeazon, ertektipus;
>>
>>
>
>
>
>>Ohh, I nearly forgot the config:  Linux 7.1; Postgres 7.3.2;
>>
>>
>
>The first thing you ought to do is move to PG 7.4.  "foo IN (SELECT ...)"
>generally works a lot better under 7.4 than prior releases.  I'd suggest
>dropping the "DISTINCT" when using 7.4, too.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>

Re: Query performance question on a large table

From

Tom Lane

Date:

07 January 2004, 03:32:18

Sean Shanny <shannyconsulting@earthlink.net> writes:
> Sort of piggybacking on this thread but why the suggestion to drop the
> use of DISTINCT in 7.4?

Because the 7.4 planner can decide for itself whether DISTINCT'ifying
the sub-select output is the best way to proceed or not.  There is more
than one good way to do an "IN sub-SELECT" operation, and the 7.4
planner knows several.  (Pre-7.4 planners didn't know any :-( ... but
I digress.)  When you write "foo IN (SELECT DISTINCT ...)", the DISTINCT
doesn't change the semantics at all, it just adds overhead.

In fact it's worse than that: if the planner decides that the best way
to proceed is to make the subselect output unique, it will throw another
layer of sort/unique processing on top of what you did.  So writing
DISTINCT is actually a pessimization in 7.4.

>  Example:

>    ->  HashAggregate  (cost=1020025.13..1020025.13 rows=51 width=4)
> (actual time=954049.317..954450.065 rows=82208 loops=1)
>          ->  Subquery Scan "IN_subquery"  (cost=983429.20..1020025.00
> rows=51 width=4) (actual time=856641.244..953639.116 rows=82208 loops=1)
>                ->  Unique  (cost=983429.20..1020024.49 rows=51 width=4)
> (actual time=856641.230..952939.539 rows=82208 loops=1)
>                      ->  Sort  (cost=983429.20..1001726.84 rows=7319058
> width=4) (actual time=856641.215..906429.835 rows=11067735 loops=1)
>                            Sort Key: useragent_key
>                            ->  Index Scan using

The sort/unique steps are coming from the DISTINCT.  The HashAggregate
step is the planner making sure the output rows are distinct :-(

I just a couple days ago added some logic to CVS tip to notice that the
sub-select has a DISTINCT clause, and not add unnecessary unique-ifying
processing on top of it.  So in 7.5, writing a DISTINCT clause will
amount to forcing a particular query plan, which might or might not be
the best thing but hopefully won't be too terrible.  But in 7.4 it has
nothing to recommend it ...

            regards, tom lane

Re: Query performance question on a large table

From

Együd Csaba

Date:

07 January 2004, 03:44:51

Hi Tom,
thank you, I'll upgrade as soon as I can. Anyway I've already planned to do
so for a while.
I'll keep in mind your remarks concerning the DISTINCT clause too.

Bye and Best Regards,
-- Csaba

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Tom Lane
> Sent: 2004. január 6. 21:04
> To: csegyud@vnet.hu
> Cc: Pgsql-General@Postgresql.Org (E-mail)
> Subject: Re: [GENERAL] Query performance question on a large table
>
>
> =?iso-8859-2?Q?Egy=FCd_Csaba?= <csegyud@vnet.hu> writes:
> > here is a sample query:
> > select  mertido, fomeazon, ertektipus, mertertek from
> t_me30 where fomeazon
> > in (select distinct fomeazon from t_fome where lower(inuse)
> = 'igen') and
> > mertido like '2003-12-17%' and ertektipus in ('+MW') order
> by mertido,
> > fomeazon, ertektipus;
>
> > Ohh, I nearly forgot the config:  Linux 7.1; Postgres 7.3.2;
>
> The first thing you ought to do is move to PG 7.4.  "foo IN
> (SELECT ...)"
> generally works a lot better under 7.4 than prior releases.
> I'd suggest
> dropping the "DISTINCT" when using 7.4, too.
>
>             regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
>
>
>
> -- Incoming mail is certified Virus Free.
> Checked by AVG Anti-Virus (http://www.grisoft.com).
> Version: 7.0.209 / Virus Database: 261 - Release Date: 2004. 01. 02.
>

Re: Query performance question on a large table

From

Bruno Wolff III

Date:

07 January 2004, 13:58:22

On Wed, Jan 07, 2004 at 02:31:22 -0500,
  Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I just a couple days ago added some logic to CVS tip to notice that the
> sub-select has a DISTINCT clause, and not add unnecessary unique-ifying
> processing on top of it.  So in 7.5, writing a DISTINCT clause will
> amount to forcing a particular query plan, which might or might not be
> the best thing but hopefully won't be too terrible.  But in 7.4 it has
> nothing to recommend it ...

Can't the DISTINCT be dropped if there isn't a LIMIT clause?
Similarly UNION, INTERSECTION and EXCEPT could also also be changed
to the ALL forms if there isn't a LIMIT.

Re: Query performance question on a large table

From

Együd Csaba

Date:

07 January 2004, 15:20:45

Hi Tom,
I've upgraded to 7.4.1. It seems to be working fine - haven't encountered
any problems yet.
The upgrade didn't lead to the desired outcome however. The query doesn't
run faster then under v7.3.2.


I have the following relations:
tgr=# \d t_fome -- 46 rows
              Table "public.t_fome"
  Column   |           Type           | Modifiers
-----------+--------------------------+-----------
 fomeazon  | integer                  | not null
 fomenev   | character varying(50)    |
 inuse     | character(4)             |
 mecsazon  | integer                  |
 merotipus | character(10)            |
 szbevont  | character(1)             |
 utmodido  | timestamp with time zone |
 visible   | character(1)             |
Indexes:
    "t_fome_pkey" primary key, btree (fomeazon)
    "idx_t_fome_fomeazon" btree (fomeazon)
    "idx_t_fome_inuse" btree (inuse)
    "idx_t_fome_lower_inuse" btree (lower((inuse)::text))
    "idx_t_fome_mecsazon" btree (mecsazon)

tgr=# \d t_me30 -- 4518927 rows
                Table "public.t_me30"
    Column    |           Type           | Modifiers
--------------+--------------------------+-----------
 fomeazon     | integer                  |
 mertido      | character(16)            |
 ertektipus   | character(10)            |
 hetnap       | character(1)             |
 impulzusszam | double precision         |
 mertertek    | double precision         |
 merttartam   | integer                  |
 utmodido     | timestamp with time zone |
Indexes:
    "idx_t_me30_ertektipus" btree (ertektipus)
    "idx_t_me30_fomeazon" btree (fomeazon)
    "idx_t_me30_mertido" btree (mertido)
    "idx_t_me30_mertido_fomeazon_ertektipus" btree (mertido, fomeazon,
ertektipus)
    "idx_t_me30_utmodido" btree (utmodido)



I found that:

1. explain select fomeazon from t_fome where lower(inuse) = 'igen'
                      QUERY PLAN
------------------------------------------------------
 Seq Scan on t_fome  (cost=0.00..1.80 rows=1 width=4)
   Filter: (lower((inuse)::text) = 'igen'::text)

As the table has an index on lower((inuse)::text), I belive it should be
used for searching.


2. explain select  mertido, fomeazon, ertektipus, mertertek
   from t_me30 where fomeazon in (select fomeazon from t_fome where
lower(inuse) = 'igen') and
   mertido like '2003-12-17%' and ertektipus in ('+MW') order by mertido,
fomeazon, ertektipus;
                                        QUERY PLAN
----------------------------------------------------------------------------
---------------
 Sort  (cost=128045.87..128045.93 rows=24 width=46)
   Sort Key: t_me30.mertido, t_me30.fomeazon, t_me30.ertektipus
   ->  Hash IN Join  (cost=1.81..128045.32 rows=24 width=46)
         Hash Cond: ("outer".fomeazon = "inner".fomeazon)
         ->  Seq Scan on t_me30  (cost=0.00..128037.62 rows=1129 width=46)
               Filter: ((mertido ~~ '2003-12-17%'::text) AND (ertektipus =
'+MW'::bpchar))
         ->  Hash  (cost=1.80..1.80 rows=1 width=4)
               ->  Seq Scan on t_fome  (cost=0.00..1.80 rows=1 width=4)
                     Filter: (lower((inuse)::text) = 'igen'::text)

In the first line of query plan we have a sort operation which is the most
expensive part of the plan. Having an index on (mertido, fomeazon,
ertektipus) key, shouldn't it be used to sort the result set? Like doesn't
use the index (mertido) either.

How could I make Postgres to use these indexes. Is there any other way to
make lower the costs on sort operations and as a result the query run time?

Thank you all,

-- Csaba Együd


> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Tom Lane
> Sent: 2004. január 6. 21:04
> To: csegyud@vnet.hu
> Cc: Pgsql-General@Postgresql.Org (E-mail)
> Subject: Re: [GENERAL] Query performance question on a large table
>
>
> =?iso-8859-2?Q?Egy=FCd_Csaba?= <csegyud@vnet.hu> writes:
> > here is a sample query:
> > select  mertido, fomeazon, ertektipus, mertertek from
> t_me30 where fomeazon
> > in (select distinct fomeazon from t_fome where lower(inuse)
> = 'igen') and
> > mertido like '2003-12-17%' and ertektipus in ('+MW') order
> by mertido,
> > fomeazon, ertektipus;
>
> > Ohh, I nearly forgot the config:  Linux 7.1; Postgres 7.3.2;
>
> The first thing you ought to do is move to PG 7.4.  "foo IN
> (SELECT ...)"
> generally works a lot better under 7.4 than prior releases.
> I'd suggest
> dropping the "DISTINCT" when using 7.4, too.
>
>             regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
>
>
>
> -- Incoming mail is certified Virus Free.
> Checked by AVG Anti-Virus (http://www.grisoft.com).
> Version: 7.0.209 / Virus Database: 261 - Release Date: 2004. 01. 02.
>

Re: Query performance question on a large table

From

Tom Lane

Date:

07 January 2004, 15:36:32

=?iso-8859-2?Q?Egy=FCd_Csaba?= <csegyud@vnet.hu> writes:
> I found that:

> 1. explain select fomeazon from t_fome where lower(inuse) = 'igen'
>                       QUERY PLAN
> ------------------------------------------------------
>  Seq Scan on t_fome  (cost=0.00..1.80 rows=1 width=4)
>    Filter: (lower((inuse)::text) = 'igen'::text)

Sure looks like you have not VACUUM ANALYZED this table yet.

> 2. explain select  mertido, fomeazon, ertektipus, mertertek
>    from t_me30 where fomeazon in (select fomeazon from t_fome where
> lower(inuse) = 'igen') and
>    mertido like '2003-12-17%' and ertektipus in ('+MW') order by mertido,
> fomeazon, ertektipus;

Could we see EXPLAIN ANALYZE not just EXPLAIN output?

            regards, tom lane

Re: Query performance question on a large table

From

Együd Csaba

Date:

07 January 2004, 15:38:25

Sorry, I just have found the thread concerning about index useage in
connection with like clause.
I created an index on mertido using
CREATE INDEX idx_t_me30_mertido2 ON t_me30 USING btree (mertido
bpchar_pattern_ops);
command and analyzed and the query plan now is:

tgr=# explain select  mertido, fomeazon, ertektipus, mertertek from t_me30
where fomeazon in (select fomeazon from t_fome where lower(inuse) = 'igen')
and mertido like '2003-12-17%' and ertektipus i
n ('+MW') order by mertido, fomeazon, ertektipus;
                                               QUERY PLAN
----------------------------------------------------------------------------
----------------------------
 Sort  (cost=7.84..7.85 rows=1 width=46)
   Sort Key: t_me30.mertido, t_me30.fomeazon, t_me30.ertektipus
   ->  Nested Loop IN Join  (cost=0.00..7.83 rows=1 width=46)
         Join Filter: ("outer".fomeazon = "inner".fomeazon)
         ->  Index Scan using idx_t_me30_mertido2 on t_me30
(cost=0.00..6.02 rows=1 width=46)
               Index Cond: ((mertido ~>=~ '2003-12-17'::bpchar) AND (mertido
~<~ '2003-12-18'::bpchar))
               Filter: ((mertido ~~ '2003-12-17%'::text) AND (ertektipus =
'+MW'::bpchar))
         ->  Seq Scan on t_fome  (cost=0.00..1.80 rows=1 width=4)
               Filter: (lower((inuse)::text) = 'igen'::text)

Thats it! Thank you very much!

-- Csaba

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Együd Csaba
> Sent: 2004. január 7. 20:17
> To: 'Tom Lane'
> Cc: 'Pgsql-General@Postgresql.Org (E-mail)'
> Subject: Re: [GENERAL] Query performance question on a large table
>
>
> Hi Tom,
> I've upgraded to 7.4.1. It seems to be working fine - haven't
> encountered
> any problems yet.
> The upgrade didn't lead to the desired outcome however. The
> query doesn't
> run faster then under v7.3.2.
>
>
> I have the following relations:
> tgr=# \d t_fome -- 46 rows
>               Table "public.t_fome"
>   Column   |           Type           | Modifiers
> -----------+--------------------------+-----------
>  fomeazon  | integer                  | not null
>  fomenev   | character varying(50)    |
>  inuse     | character(4)             |
>  mecsazon  | integer                  |
>  merotipus | character(10)            |
>  szbevont  | character(1)             |
>  utmodido  | timestamp with time zone |
>  visible   | character(1)             |
> Indexes:
>     "t_fome_pkey" primary key, btree (fomeazon)
>     "idx_t_fome_fomeazon" btree (fomeazon)
>     "idx_t_fome_inuse" btree (inuse)
>     "idx_t_fome_lower_inuse" btree (lower((inuse)::text))
>     "idx_t_fome_mecsazon" btree (mecsazon)
>
> tgr=# \d t_me30 -- 4518927 rows
>                 Table "public.t_me30"
>     Column    |           Type           | Modifiers
> --------------+--------------------------+-----------
>  fomeazon     | integer                  |
>  mertido      | character(16)            |
>  ertektipus   | character(10)            |
>  hetnap       | character(1)             |
>  impulzusszam | double precision         |
>  mertertek    | double precision         |
>  merttartam   | integer                  |
>  utmodido     | timestamp with time zone |
> Indexes:
>     "idx_t_me30_ertektipus" btree (ertektipus)
>     "idx_t_me30_fomeazon" btree (fomeazon)
>     "idx_t_me30_mertido" btree (mertido)
>     "idx_t_me30_mertido_fomeazon_ertektipus" btree (mertido, fomeazon,
> ertektipus)
>     "idx_t_me30_utmodido" btree (utmodido)
>
>
>
> I found that:
>
> 1. explain select fomeazon from t_fome where lower(inuse) = 'igen'
>                       QUERY PLAN
> ------------------------------------------------------
>  Seq Scan on t_fome  (cost=0.00..1.80 rows=1 width=4)
>    Filter: (lower((inuse)::text) = 'igen'::text)
>
> As the table has an index on lower((inuse)::text), I belive
> it should be
> used for searching.
>
>
> 2. explain select  mertido, fomeazon, ertektipus, mertertek
>    from t_me30 where fomeazon in (select fomeazon from t_fome where
> lower(inuse) = 'igen') and
>    mertido like '2003-12-17%' and ertektipus in ('+MW') order
> by mertido,
> fomeazon, ertektipus;
>                                         QUERY PLAN
> --------------------------------------------------------------
> --------------
> ---------------
>  Sort  (cost=128045.87..128045.93 rows=24 width=46)
>    Sort Key: t_me30.mertido, t_me30.fomeazon, t_me30.ertektipus
>    ->  Hash IN Join  (cost=1.81..128045.32 rows=24 width=46)
>          Hash Cond: ("outer".fomeazon = "inner".fomeazon)
>          ->  Seq Scan on t_me30  (cost=0.00..128037.62
> rows=1129 width=46)
>                Filter: ((mertido ~~ '2003-12-17%'::text) AND
> (ertektipus =
> '+MW'::bpchar))
>          ->  Hash  (cost=1.80..1.80 rows=1 width=4)
>                ->  Seq Scan on t_fome  (cost=0.00..1.80
> rows=1 width=4)
>                      Filter: (lower((inuse)::text) = 'igen'::text)
>
> In the first line of query plan we have a sort operation
> which is the most
> expensive part of the plan. Having an index on (mertido, fomeazon,
> ertektipus) key, shouldn't it be used to sort the result set?
> Like doesn't
> use the index (mertido) either.
>
> How could I make Postgres to use these indexes. Is there any
> other way to
> make lower the costs on sort operations and as a result the
> query run time?
>
> Thank you all,
>
> -- Csaba Együd
>
>
> > -----Original Message-----
> > From: pgsql-general-owner@postgresql.org
> > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Tom Lane
> > Sent: 2004. január 6. 21:04
> > To: csegyud@vnet.hu
> > Cc: Pgsql-General@Postgresql.Org (E-mail)
> > Subject: Re: [GENERAL] Query performance question on a large table
> >
> >
> > =?iso-8859-2?Q?Egy=FCd_Csaba?= <csegyud@vnet.hu> writes:
> > > here is a sample query:
> > > select  mertido, fomeazon, ertektipus, mertertek from
> > t_me30 where fomeazon
> > > in (select distinct fomeazon from t_fome where lower(inuse)
> > = 'igen') and
> > > mertido like '2003-12-17%' and ertektipus in ('+MW') order
> > by mertido,
> > > fomeazon, ertektipus;
> >
> > > Ohh, I nearly forgot the config:  Linux 7.1; Postgres 7.3.2;
> >
> > The first thing you ought to do is move to PG 7.4.  "foo IN
> > (SELECT ...)"
> > generally works a lot better under 7.4 than prior releases.
> > I'd suggest
> > dropping the "DISTINCT" when using 7.4, too.
> >
> >             regards, tom lane
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 1: subscribe and unsubscribe commands go to
> > majordomo@postgresql.org
> >
> >
> >
> > -- Incoming mail is certified Virus Free.
> > Checked by AVG Anti-Virus (http://www.grisoft.com).
> > Version: 7.0.209 / Virus Database: 261 - Release Date: 2004. 01. 02.
> >
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
>
>
> -- Incoming mail is certified Virus Free.
> Checked by AVG Anti-Virus (http://www.grisoft.com).
> Version: 7.0.209 / Virus Database: 261 - Release Date: 2004. 01. 02.
>