Thread: performance question

performance question

From
"Moritz Lennert"
Date:
Hello,

I have a table with some 2.2 million rows on a Pentium4, 1.8GHz with 512
MB RAM.
Some queries I launch take quite a long time, and I'm wondering whether
this is normal,or whether I can get better performance somehow.

As an example, I have a field which is in char(2), with 23 different
possible values. When I launch a "select * from table where field = 'xx'"
this takes a very long time (some 230194.10 msec, i.e. almost 4 minutes,
according to "explain analyze"). I've tried indexing on this column and
vacuuming ("vacuum analyze"), but this doesn't change anything. "explain
select" always gives me a sequential scan. The correlation value in
pg_stats for this column is 0.0583268, so a seq scan is probably cheaper
than index scan.

Now maybe this is just the way it is and there is no possibility to
enhance performance, but if someone has some tips on what I might try,
they are very welcome !
One question I asked myself is whether the use of char(2) is the best
option. The column (and most others in the table) contains codes that
designate different characteristics (for ex. in a column 'sex' one would
find '1'=male, '2'=female).

Moritz

Re: performance question

From
Manfred Koizar
Date:
On Mon, 20 Jan 2003 12:40:34 +0100 (CET), "Moritz Lennert"
<mlennert@club.worldonline.be> wrote:
>I have a table with some 2.2 million rows on a Pentium4, 1.8GHz with 512
>MB RAM.
>Some queries I launch take quite a long time, and I'm wondering whether
>this is normal,or whether I can get better performance somehow.

Moritz, we need more information.  Please show us. your PG version. CREATE TABLE .... indices. your query. EXPLAIN
ANALYZEoutput. your settings, especially shared_buffers, sort_mem,
 
random_page_cost, effective_cache_size

>One question I asked myself is whether the use of char(2) is the best
>option. The column (and most others in the table) contains codes that
>designate different characteristics (for ex. in a column 'sex' one would
>find '1'=male, '2'=female).

char(2) needs 8 bytes, smallint only 2 bytes (unless followed by a
column with 4 or 8 byte alignment).  Instead of char(1) (8 bytes) you
might want to use the Postgres specific type "char" (with the double
quotes!) needing only 1 byte.

ServusManfred


Re: performance question

From
"Moritz Lennert"
Date:
> On Mon, 20 Jan 2003 12:40:34 +0100 (CET), "Moritz Lennert"
> <mlennert@club.worldonline.be> wrote:
>>I have a table with some 2.2 million rows on a Pentium4, 1.8GHz with 512
MB RAM.
>>Some queries I launch take quite a long time, and I'm wondering whether
this is normal,or whether I can get better performance somehow.
>
> Moritz, we need more information.  Please show us
>  . your PG version

 7.2.1-2woody2

>  . CREATE TABLE ...

CREATE TABLE "rec81" (
        "commune_residence" character(5),
        "sexe" character(1),
        "annee_naissance" smallint,
        "mois_naissance" smallint,
        "jour_naissance" smallint,
        "parent" character(2),
        "etat_civil" character(1),
        "nationalite" character(3),
        "type_menage" character(1),
        "logement_depuis_naiss" character(1),
        "domicile_mere" character(6),
        "dans_log_depuis_quand" smallint,
        "meme_log_1980" character(1),
        "commune_1980" character(6),
        "annee_entree_belgique" smallint,
        "age_fin_etude" smallint,
        "detenteur_diplome" character(1),
        "type_diplome" character(2),
        "detenteur_diplome_etranger" character(1),
        "actif" character(1),
        "actif_temporaire" character(1),
        "type_profession" character(4),
        "statut_professionnel" character(1),
        "temps_partiel" character(1),
        "nombre_heures_travail" smallint,
        "employeur" character(1),
        "nombre_personnes_ds_services" integer,
        "direction" character(1),
        "lieu_travail" character(6),
        "secteur_activite" character(3),
        "emploi_complementaire" character(1),
        "type_emploi_complementaire" character(4),
        "lieu_depart_navette" character(1),
        "commune_depart_navette" character(6),
        "distance" smallint,
        "nbre_navettes_par_jour" character(1),
        "nbre_jours_navette_par_semaine" character(1),
        "type_transport_navette" character(3),
        "duree_trajet_navette" character(1),
        "statut_non_occupe" character(2),
        "effectif_menage" smallint,
        "sec_stat_residence" character(6)
);

>  . indices

CREATE INDEX rec81_commune_residence_idx ON rec81 USING btree
(commune_residence);
CREATE INDEX rec81_annee_naissance_idx ON rec81 USING btree
(annee_naissance);
CREATE INDEX rec81_nationalite_idx ON rec81 USING btree (nationalite);
CREATE INDEX rec81_meme_log_1980_idx ON rec81 USING btree (meme_log_1980);
CREATE INDEX rec81_commune_1980_idx ON rec81 USING btree (commune_1980);
CREATE INDEX rec81_age_fin_etude_idx ON rec81 USING btree (age_fin_etude);
CREATE INDEX rec81_detenteur_diplome_idx ON rec81 USING btree
(detenteur_diplome);
CREATE INDEX rec81_type_profession_idx ON rec81 USING btree
(type_profession);
CREATE INDEX rec81_statut_professionnel_idx ON rec81 USING btree
(statut_professionnel);
CREATE INDEX rec81_lieu_travail_idx ON rec81 USING btree (lieu_travail);
CREATE INDEX rec81_secteur_activite_idx ON rec81 USING btree
(secteur_activite);
CREATE INDEX rec81_statut_non_occupe_idx ON rec81 USING btree
(statut_non_occupe);
CREATE INDEX rec81_sec_stat_residence_idx ON rec81 USING btree
(sec_stat_residence);
CREATE INDEX rec81_comres_typedipl_idx ON rec81 USING btree
commune_residence, type_diplome);
CREATE INDEX rec81_type_diplome_idx ON rec81 USING btree (type_diplome);

>  . your query

select commune_residence, type_diplome from rec81 where type_diplome = '11';

>  . EXPLAIN ANALYZE output

explain analyze select commune_residence, type_diplome from rec81 where
type_diplome = '11';
NOTICE:  QUERY PLAN:

Seq Scan on rec81  (cost=0.00..120316.30 rows=177698 width=15) (actual
time=23.03..219164.82 rows=176621 loops=1)
Total runtime: 226149.03 msec

EXPLAIN

>  . your settings, especially shared_buffers, sort_mem,
> random_page_cost, effective_cache_size


shared_buffers = 128

The others are not set (I suppose they should be set in the
postgresql.conf file ?)


Thanks for your help !

Moritz

Re: performance question

From
"Moritz Lennert"
Date:
> On Mon, 20 Jan 2003 16:42:12 +0100 (CET), "Moritz Lennert"
> <mlennert@club.worldonline.be> wrote:
>>explain analyze select commune_residence, type_diplome from rec81 where
>>type_diplome = '11';
>>NOTICE:  QUERY PLAN:
>>
>>Seq Scan on rec81  (cost=0.00..120316.30 rows=177698 width=15) (actual
>>time=23.03..219164.82 rows=176621 loops=1)
>>Total runtime: 226149.03 msec
>
> Assumung that there are no NULLs, I guesstimate that 25 tuples should
> fit onto one page (could be more than 40 if you use smallint and
> "char").  That gives 88K pages for 2.2M tuples.  However,
> cost=0.00..120316.30  tells us that there are almost 120K pages.  Time
> to do a VACUUM FULL?
>

I'll try that, although I haven't changed any of the tuples since import
of the data (this is a static table...)

> From what I've seen I think that the planner is right to choose a seq
> scan.  226 seconds for reading 120K pages (~ 1GB) is not very
> impressive, though.  What kind of disk do you have?

IDE, Samsung, 7200rpm

>  Is your disk heavily fragmented?

It shouldn't be.

> Did you enable DMA?

No, should I ?

>  What else was running on
> your machine while you did that VACUUM ANALYZE?

Mozilla, maybe xterm with vi...

>>shared_buffers = 128
>
> It won't help much for this query we are discussing, but I would
> recommend setting shared_buffers to something in the range [1000,
> 4000].
>
> And one of my favorites: effective_cache_size = 40000
>

I will have to increase /proc/sys/kernel/shmmax for that, or ?

Thanks again !

Moritz

Re: performance question

From
"Josh Berkus"
Date:
Moritz,

There is a performance mailing list at:

pgsql-performance@postgresql.org

--Josh


Re: performance question

From
Manfred Koizar
Date:
On Mon, 20 Jan 2003 18:10:24 +0100 (CET), "Moritz Lennert"
<mlennert@club.worldonline.be> wrote:
>I'll try that, although I haven't changed any of the tuples since import
>of the data (this is a static table...)

Then I must have miscalculated something :-(  What does VACUUM VERBOSE
ANALYZE <yourtable> say?

>> From what I've seen I think that the planner is right to choose a seq
>> scan.  226 seconds for reading 120K pages (~ 1GB) is not very
>> impressive, though.  What kind of disk do you have?
>
>IDE, Samsung, 7200rpm
>
>>  Is your disk heavily fragmented?
>
>It shouldn't be.
>
>> Did you enable DMA?
>
>No, should I ?

Yes.  Here is what I got on a P IV 2 GHz with a Seagate 7200rpm(?)
disk:

~ # hdparm -t -T /dev/hda

/dev/hda:Timing buffer-cache reads:   128 MB in  0.39 seconds =328.21 MB/secTiming buffered disk reads:  64 MB in  2.49
seconds= 25.70 MB/sec
 

vs.

~ # hdparm -t -T /dev/hda

/dev/hda:Timing buffer-cache reads:   128 MB in  0.37 seconds =345.95 MB/secTiming buffered disk reads:  64 MB in 23.38
seconds=  2.74 MB/sec
 


~ # l xx
-rw-r--r--    1 root     root     1332104434 2003-01-20 19:04 xx
~ # time dd if=xx of=/dev/null bs=8k
162610+1 Records in
162610+1 Records out

real    0m48.665s
user    0m0.150s
sys     0m1.690s
~ # hdparm -d 0 /dev/hda
~ # time dd if=xx of=/dev/null bs=8k
162610+1 Records in
162610+1 Records out

real    7m42.666s
user    0m0.270s
sys     1m27.160s


With DMA: More than 3000 pages / second
Without DMA:       ~ 350 pages / second

Your throughput:   ~ 530 pages / second

>> recommend setting shared_buffers to something in the range [1000,
>> 4000].
>> And one of my favorites: effective_cache_size = 40000
>
>I will have to increase /proc/sys/kernel/shmmax for that, or ?

Maybe for shared_buffers.  Certainly not for effective_cache_size.
The latter does not consume resources, it's just a hint for the
planner.

ServusManfred


Re: performance question

From
"Moritz Lennert"
Date:
> Moritz,
>
> There is a performance mailing list at:
>
> pgsql-performance@postgresql.org
>
> --Josh

I'm sorry, I didn't realize this, it is certainly closer to what I need.
Next time I'll look better (google brought up references to this list, so
I subscribed here).

Thanks for the hint,

Moritz

Re: performance question

From
Tom Lane
Date:
"Moritz Lennert" <mlennert@club.worldonline.be> writes:
> One question I asked myself is whether the use of char(2) is the best
> option.

It may not be, particularly if you are running in a non-English locale.
Comparisons will invoke the standard library routine strcoll(), which is
depressingly slow in some locales, at least on some platforms.

> The column (and most others in the table) contains codes that
> designate different characteristics (for ex. in a column 'sex' one would
> find '1'=male, '2'=female).

If you are using numeric codes you are surely better off storing them
as integer or perhaps smallint (integer is less likely to create type
mismatch headaches, though).  In the above example you are getting the
worst of both worlds: there's no mnemonic value in your data entries,
*and* you're paying for textual comparisons.
        regards, tom lane


Re: performance question

From
Tomasz Myrta
Date:
Moritz Lennert wrote:

>Hello,
>
>I have a table with some 2.2 million rows on a Pentium4, 1.8GHz with 512
>MB RAM.
>Some queries I launch take quite a long time, and I'm wondering whether
>this is normal,or whether I can get better performance somehow.
>
>As an example, I have a field which is in char(2), with 23 different
>possible values. 

This field is not too selective. Are there any more fields which are always in where condition? 
If yes, indexing on more than one field should help.

Tomasz Myrta



Re: performance question

From
"Josh Berkus"
Date:
Moritz,

> I'm sorry, I didn't realize this, it is certainly closer to what I
> need.
> Next time I'll look better (google brought up references to this
> list, so
> I subscribed here).

Hey, there are 18 active lists now ... we don't expect anyone to get
the right one right off!

-Josh Berkus


Re: performance question

From
"Moritz Lennert"
Date:
> On Mon, 20 Jan 2003 16:42:12 +0100 (CET), "Moritz Lennert"
> <mlennert@club.worldonline.be> wrote:
>>explain analyze select commune_residence, type_diplome from rec81 where
>>type_diplome = '11';
>>NOTICE:  QUERY PLAN:
>>
>>Seq Scan on rec81  (cost=0.00..120316.30 rows=177698 width=15) (actual
>>time=23.03..219164.82 rows=176621 loops=1)
>>Total runtime: 226149.03 msec
>
> Assumung that there are no NULLs, I guesstimate that 25 tuples should
> fit onto one page (could be more than 40 if you use smallint and
> "char").  That gives 88K pages for 2.2M tuples.  However,
> cost=0.00..120316.30  tells us that there are almost 120K pages.  Time
> to do a VACUUM FULL?
>

I'll try that, although I haven't changed any of the tuples since import
of the data (this is a static table...)

> From what I've seen I think that the planner is right to choose a seq
> scan.  226 seconds for reading 120K pages (~ 1GB) is not very
> impressive, though.  What kind of disk do you have?

IDE, Samsung, 7200rpm

>  Is your disk heavily fragmented?

It shouldn't be.

> Did you enable DMA?

No, should I ?

>  What else was running on
> your machine while you did that VACUUM ANALYZE?

Mozilla, maybe xterm with vi...

>>shared_buffers = 128
>
> It won't help much for this query we are discussing, but I would
> recommend setting shared_buffers to something in the range [1000,
> 4000].
>
> And one of my favorites: effective_cache_size = 40000
>

I will have to increase /proc/sys/kernel/shmmax for that, or ?

Thanks again !

Moritz


Re: performance question

From
"Moritz Lennert"
Date:
> Moritz,
>
> There is a performance mailing list at:
>
> pgsql-performance@postgresql.org
>
> --Josh

I'm sorry, I didn't realize this, it is certainly closer to what I need.
Next time I'll look better (google brought up references to this list, so
I subscribed here).

Thanks for the hint,

Moritz



Re: performance question

From
Vernon Wu
Date:
20/01/2003 12:38:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>"Moritz Lennert" <mlennert@club.worldonline.be> writes:
>> One question I asked myself is whether the use of char(2) is the best
>> option.
>
>It may not be, particularly if you are running in a non-English locale.
>Comparisons will invoke the standard library routine strcoll(), which is
>depressingly slow in some locales, at least on some platforms.
>


In the case of selection operations involving multiple tables joined by userid that can be in various languages, is a 
potential performance trouble spot? Considering the number of tables anywhere from ten to twenty.



>> The column (and most others in the table) contains codes that
>> designate different characteristics (for ex. in a column 'sex' one would
>> find '1'=male, '2'=female).
>
>If you are using numeric codes you are surely better off storing them
>as integer or perhaps smallint (integer is less likely to create type
>mismatch headaches, though).  In the above example you are getting the
>worst of both worlds: there's no mnemonic value in your data entries,
>*and* you're paying for textual comparisons.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>