Thread: Re: [GENERAL] Tuning queries on large database

Re: [GENERAL] Tuning queries on large database

From
Valerie Schneider DSI/DEV
Date:
Hi,
I 've decreased the sort_mem to 5000 instead of 50000.
I recreated ma table using integer and real types instead of
numeric : the result is very improved for the disk space :

 schema | relfilenode |      table       |   index    |  reltuples  |   size
--------+-------------+------------------+------------+-------------+----------
 public |   253442696 | data             |            | 1.25113e+08 | 29760016
 public |   378639579 | data             | i_data_dat | 1.25113e+08 |  2744400
 public |   378555698 | data             | pk_data    | 1.25113e+08 |  3295584

so it takes about 28 Gb instead of 68 Gb !

For my different queries, it's better but less performant than oracle :

    oracle    PG yesterday(numeric)    PG today(integer/real)
Q1    <1s       <1s              <1s
Q2     3s        8s               4s
Q3     8s     1m20s              27s
Q4    28s    17m20s            6m47s

Result of EXPLAIN ANALYZE :

Q1 :bench=> explain analyze select 'Q1',min(td),max(u) from data where
num_poste=1000 and dat between
(date_trunc('month',to_timestamp('31012004','ddmmyyyy')-interval '2000
days'))::timestamp and
(date_trunc('month',to_timestamp('31012004','ddmmyyyy')-interval '2000 days') +
interval '1 month' - interval '1 hour')::timestamp;

                       QUERY PLAN
                                                                                                   
--------------------------------------------------------------------------------
 Aggregate  (cost=2466.47..2466.47 rows=1 width=8) (actual time=261.777..261.778
rows=1 loops=1)
   ->  Index Scan using pk_data on data  (cost=0.00..2463.41 rows=611 width=8)
(actual time=20.106..259.924 rows=744 loops=1)
         Index Cond: ((num_poste = 1000) AND (dat >= (date_trunc('month'::text,
(to_timestamp('31012004'::text, 'ddmmyyyy'::text) - '2000
days'::interval)))::timestamp without time zone) AND (dat <=
(((date_trunc('month'::text, (to_timestamp('31012004'::text, 'ddmmyyyy'::text) -
'2000 days'::interval)) + '1 mon'::interval) - '01:00:00'::interval))::timestamp
without time zone))
 Total runtime: 262.145 ms
(4 rows)


Q2 : bench=> explain analyze select 'Q2',count(*) from data where num_poste
between 100 and 100+10 and dat between
(date_trunc('month',to_timestamp('31012004','ddmmyyyy')-interval '3000
days'))::timestamp and
(date_trunc('month',to_timestamp('31012004','ddmmyyyy')-interval '3000 days') +
interval '1 month' - interval '1 hour')::timestamp;

                                    QUERY PLAN

  
--------------------------------------------------------------------------------
 Aggregate  (cost=24777.68..24777.68 rows=1 width=0) (actual
time=4253.977..4253.978 rows=1 loops=1)
   ->  Index Scan using pk_data on data  (cost=0.00..24762.34 rows=6138 width=0)
(actual time=46.602..4244.984 rows=7920 loops=1)
         Index Cond: ((num_poste >= 100) AND (num_poste <= 110) AND (dat >=
(date_trunc('month'::text, (to_timestamp('31012004'::text, 'ddmmyyyy'::text) -
'3000 days'::interval)))::timestamp without time zone) AND (dat <=
(((date_trunc('month'::text, (to_timestamp('31012004'::text, 'ddmmyyyy'::text) -
'3000 days'::interval)) + '1 mon'::interval) - '01:00:00'::interval))::timestamp
without time zone))
 Total runtime: 4254.233 ms
(4 rows)


Q3 : bench=> explain analyze select 'Q3',sum(rr1),count(ff) from data where
num_poste in (50,50+2);
                             QUERY PLAN
--------------------------------------------------------------------------------
 Aggregate  (cost=963455.87..963455.87 rows=1 width=8) (actual
time=27668.666..27668.667 rows=1 loops=1)
   ->  Index Scan using pk_data, pk_data on data  (cost=0.00..962236.31
rows=243910 width=8) (actual time=16.251..27275.468 rows=250226 loops=1)
         Index Cond: ((num_poste = 50) OR (num_poste = 52))
 Total runtime: 27673.837 ms
(4 rows)


Q4 : bench=> explain analyze select 'Q4',count(*) from data where num_poste
between 600 and 625;
                                     QUERY PLAN
--------------------------------------------------------------------------------
 Aggregate  (cost=14086174.57..14086174.57 rows=1 width=0) (actual
time=428235.024..428235.025 rows=1 loops=1)
   ->  Index Scan using pk_data on data  (cost=0.00..14076910.99 rows=3705431
width=0) (actual time=45.283..424634.826 rows=3252938 loops=1)
         Index Cond: ((num_poste >= 600) AND (num_poste <= 625))
 Total runtime: 428235.224 ms
(4 rows)

Thanks for all, Valerie.

>X-Original-To: pgsql-general-postgresql.org@localhost.postgresql.org
>X-Authentication-Warning: houston.familyhealth.com.au: chriskl owned process
doing -bs
>Date: Wed, 4 Aug 2004 21:21:51 +0800 (WST)
>From: Christopher Kings-Lynne <chriskl@familyhealth.com.au>
>To: Valerie Schneider DSI/DEV <Valerie.Schneider@meteo.fr>
>Cc: pgsql-performance@postgresql.org, <pgsql-general@postgresql.org>
>Subject: Re: [GENERAL] [PERFORM] Tuning queries on large database
>MIME-Version: 1.0
>X-Virus-Scanned: by amavisd-new at hub.org
>X-Spam-Status: No, hits=0.0 tagged_above=0.0 required=5.0 tests=
>X-Spam-Level:
>X-Mailing-List: pgsql-general
>
>>     sort_mem = 50000
>
>That is way, way too large.  Try more like 5000 or lower.
>
>>  num_poste  | numeric(9,0)                | not null
>
>For starters numerics are really, really slow compared to integers.  Why
>aren't you using an integer for this field since youhave '0' decimal
>places.
>
>>  schema | relfilenode |      table       |   index    |  reltuples  |   size
>>
--------+-------------+------------------+------------+-------------+----------
>>  public |   125615917 | data             |            | 1.25113e+08 |
72312040
>>  public |   251139049 | data             | i_data_dat | 1.25113e+08 |
2744400
>>  public |   250870177 | data             | pk_data    | 1.25113e+08 |
4395480
>>
>> My first remark is that the table takes a lot of place on disk, about
>> 70 Gb, instead of 35 Gb with oracle.
>
>Integers will take a lot less space than numerics.
>
>> The different queries of the bench are "simple" queries (no join,
>> sub-query, ...) and are using indexes (I "explained" each one to
>> be sure) :
>> Q1 select_court : access to about 700 rows  : 1 "num_poste" and 1 month
>>     (using PK : num_poste=p1  and dat between p2 and p3)
>> Q2 select_moy   : access to about 7000 rows : 10 "num_poste" and 1 month
>>     (using PK : num_poste between p1 and p1+10 and dat between p2 and p3)
>> Q3 select_long  : about 250 000 rows        : 2 "num_poste"
>>     (using PK : num_poste in (p1,p1+2))
>> Q4 select_tres_long : about 3 millions rows : 25 "num_poste"
>>     (using PK : num_poste between p1 and p1 + 25)
>>
>> The result is that for "short queries" (Q1 and Q2) it runs in a few
>> seconds on both Oracle and PG. The difference becomes important with
>> Q3 : 8 seconds with oracle
>>      80 sec with PG
>> and too much with Q4 : 28s with oracle
>>                        17m20s with PG !
>>
>> Of course when I run 100 or 1000 parallel queries such as Q3 or Q4,
>> it becomes a disaster !
>
>Please reply with the EXPLAIN ANALYZE output of these queries so we can
>have some idea of how to help you.
>
>Chris
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html



********************************************************************
*    Les points de vue exprimes sont strictement personnels et     *
*      n'engagent pas la responsabilite de METEO-FRANCE.           *
********************************************************************
* Valerie SCHNEIDER             Tel : +33 (0)5 61 07 81 91         *
* METEO-FRANCE / DSI/DEV        Fax : +33 (0)5 61 07 81 09         *
* 42, avenue G. Coriolis        Email : Valerie.Schneider@meteo.fr *
* 31057 TOULOUSE Cedex - FRANCE         http://www.meteo.fr        *
********************************************************************


Re: [GENERAL] Tuning queries on large database

From
Mark Kirkwood
Date:
I am guessing that Oracle can satisfy Q4 entirely via index access,
whereas Pg has to visit the table as well.

Having said that, a few partial indexes may be worth trying out on
data.num_poste (say 10 or so), this won't help the table access but
could lower the index cost. If you combine this with loading the data in
num_poste order (or run CLUSTER), you may get closer to Oracle's time
for this query.

regards

Mark

Valerie Schneider DSI/DEV wrote:

>For my different queries, it's better but less performant than oracle :
>
>    oracle    PG yesterday(numeric)    PG today(integer/real)
>
>Q4    28s    17m20s            6m47s
>
>
>
>Q4 : bench=> explain analyze select 'Q4',count(*) from data where num_poste
>between 600 and 625;
>                                     QUERY PLAN
>--------------------------------------------------------------------------------
> Aggregate  (cost=14086174.57..14086174.57 rows=1 width=0) (actual
>time=428235.024..428235.025 rows=1 loops=1)
>   ->  Index Scan using pk_data on data  (cost=0.00..14076910.99 rows=3705431
>width=0) (actual time=45.283..424634.826 rows=3252938 loops=1)
>         Index Cond: ((num_poste >= 600) AND (num_poste <= 625))
> Total runtime: 428235.224 ms
>(4 rows)
>
>Thanks for all, Valerie.
>
>
>

Re: [GENERAL] Tuning queries on large database

From
Rod Taylor
Date:
> so it takes about 28 Gb instead of 68 Gb !

Huzzah!

> For my different queries, it's better but less performant than oracle :

Not surprising. Oracle has a number of optimizations that we don't have
implemented at this point, particularly where aggregates are involved.

One that PG could use, particularly for Q4, is the ability to execute a
selective sequential scan based on a read of the index -- right now it
pulls in actual data from the table structure while following the index
-- creates unnecessary disk-head movement.

The only solution to that, at the moment, is to cluster the table by
pk_data.


I am curious though, could you run the below query on both systems and
report back times?

        select 'Q4', * from data where num_poste between 600 and 625;

I'm wondering if Oracle is using a shortcut since the count(*) doesn't
actually require the data -- just knowledge of whether a matching row
exists or not.



Re: [GENERAL] Tuning queries on large database

From
Gaetano Mendola
Date:
Valerie Schneider DSI/DEV wrote:

 > Hi,
 > I 've decreased the sort_mem to 5000 instead of 50000.
 > I recreated ma table using integer and real types instead of
 > numeric : the result is very improved for the disk space :
 >
 >  schema | relfilenode |      table       |   index    |  reltuples  |   size
 > --------+-------------+------------------+------------+-------------+----------
 >  public |   253442696 | data             |            | 1.25113e+08 | 29760016
 >  public |   378639579 | data             | i_data_dat | 1.25113e+08 |  2744400
 >  public |   378555698 | data             | pk_data    | 1.25113e+08 |  3295584
 >
 > so it takes about 28 Gb instead of 68 Gb !
 >
 > For my different queries, it's better but less performant than oracle :
 >
 >     oracle    PG yesterday(numeric)    PG today(integer/real)
 > Q1    <1s       <1s              <1s
 > Q2     3s        8s               4s
 > Q3     8s     1m20s              27s
 > Q4    28s    17m20s            6m47s


Are you using the same disk for oracle and PG ?

Could you post your actual postgresql.conf ?
Try also to mount your partition with the option: noatime
and try again.


Regards
Gaetano Mendola