Thread: Performance question

Performance question

From
"Tille, Andreas"
Date:
Hello,

I have ported a database from MS SQL Server to PostgreSQL.  The database has
40 tables and 117 indexes which I defined for the same fields as in MS SQL.
I converted the data using some SQL output from MS SQL server and inserted
it with psql.

Now I started some performance comparisons and did the following statement:

SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
FROM Hauptdaten_Fall
WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie
ORDER BY Hauptdaten_Fall.MeldeKategorie;

(sorry for the German names used here).

The MS-SQL server represents the result "immediately" - you just not notice
any delay.  If I do it on the PostgreSQL server it takes 30s on comparable
hardware :-(((.

I really have no idea what caused this big difference in speed which would
forbid the use of PostgreSQL for our application.  How can I checked
whether the indexes I created are really used?  What could be other
reasons for such a big difference in speed?

I´m using PostgreSQL 7.1.2 on Debian GNU/Linux (testing).

Any help appreciated

        Andreas.

Re: Performance question

From
Date:
On Mon, 10 Sep 2001, Tille, Andreas wrote:

> Hello,
>
> I have ported a database from MS SQL Server to PostgreSQL.  The database has
> 40 tables and 117 indexes which I defined for the same fields as in MS SQL.
> I converted the data using some SQL output from MS SQL server and inserted
> it with psql.
>
> Now I started some performance comparisons and did the following statement:
>
> SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
> FROM Hauptdaten_Fall
> WHERE (((Hauptdaten_Fall.IstAktuell)=20))
> GROUP BY Hauptdaten_Fall.MeldeKategorie
> ORDER BY Hauptdaten_Fall.MeldeKategorie;
>
> (sorry for the German names used here).
>
> The MS-SQL server represents the result "immediately" - you just not notice
> any delay.  If I do it on the PostgreSQL server it takes 30s on comparable
> hardware :-(((.
>
> I really have no idea what caused this big difference in speed which would
> forbid the use of PostgreSQL for our application.  How can I checked
> whether the indexes I created are really used?  What could be other
> reasons for such a big difference in speed?

Use explain. Explain tells you the query plan of the optimizer.

explain SELECT .....;

Best regards
Herbie
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Herbert Liechti                                  http://www.thinx.ch
ThinX networked business services    Adlergasse 5, CH-4500 Solothurn
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Re: Performance question

From
"Tille, Andreas"
Date:
On Mon, 10 Sep 2001 Herbert.Liechti@thinx.ch wrote:

> Use explain. Explain tells you the query plan of the optimizer.
>
> explain SELECT .....;
Thanks I just found the thread "Index usage question" and tried to make
some profit from it:

explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; 

NOTICE:  QUERY PLAN:

Aggregate  (cost=35267.33..36154.62 rows=17746 width=16)
  ->  Group  (cost=35267.33..35710.98 rows=177458 width=16)
        ->  Sort  (cost=35267.33..35267.33 rows=177458 width=16)
              ->  Seq Scan on hauptdaten_fall  (cost=0.00..15024.12 rows=177458 width=16)


Now I tried

Vacuum analyze;

but nothing changed :-(

Aggregate  (cost=35267.33..36154.62 rows=17746 width=16)
  ->  Group  (cost=35267.33..35710.98 rows=177458 width=16)
        ->  Sort  (cost=35267.33..35267.33 rows=177458 width=16)
              ->  Seq Scan on hauptdaten_fall  (cost=0.00..15024.12 rows=177458 width=16)

I have nearly no experience with query optimizing but the gread difference
in speed tells me that something is wrong here.  There were some hints in
the "Index usage question" thread about some fields which might be interpreted
as strings.  Could this be a reason and how to check this?

Kind regards

       Andreas.

Re: Performance question

From
Einar Karttunen
Date:
On Mon, Sep 10, 2001 at 02:34:25PM +0200, Tille, Andreas wrote:
> On Mon, 10 Sep 2001 Herbert.Liechti@thinx.ch wrote:
>
> > Use explain. Explain tells you the query plan of the optimizer.
> >
> > explain SELECT .....;
> Thanks I just found the thread "Index usage question" and tried to make
> some profit from it:
>
> explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; 
>
> NOTICE:  QUERY PLAN:
>
> Aggregate  (cost=35267.33..36154.62 rows=17746 width=16)
>   ->  Group  (cost=35267.33..35710.98 rows=177458 width=16)
>         ->  Sort  (cost=35267.33..35267.33 rows=177458 width=16)
>               ->  Seq Scan on hauptdaten_fall  (cost=0.00..15024.12 rows=177458 width=16)
>
>
There must be a problem with your indeces. I tried the following:
temp=# CREATE TABLE Hauptdaten_Fall (
temp(# MeldeKategorie text,
temp(# ID integer,
temp(# IstAktuell integer);
CREATE
temp=# explain SELECT MeldeKategorie,count(ID) FROM Hauptdaten_Fall
temp-#  WHERE IstAktuell=20 GROUP BY MeldeKategorie
temp-#  ORDER BY MeldeKategorie;
NOTICE:  QUERY PLAN:

Aggregate  (cost=22.67..22.72 rows=1 width=16)
  ->  Group  (cost=22.67..22.69 rows=10 width=16)
        ->  Sort  (cost=22.67..22.67 rows=10 width=16)
              ->  Seq Scan on hauptdaten_fall  (cost=0.00..22.50 rows=10 width=16)

EXPLAIN
temp=# CREATE INDEX hfia ON Hauptdaten_Fall (IstAktuell);
CREATE
temp=# explain SELECT MeldeKategorie,count(ID) FROM Hauptdaten_Fall
temp-#  WHERE IstAktuell=20 GROUP BY MeldeKategorie
temp-#  ORDER BY MeldeKategorie;
NOTICE:  QUERY PLAN:

Aggregate  (cost=8.30..8.35 rows=1 width=16)
  ->  Group  (cost=8.30..8.33 rows=10 width=16)
        ->  Sort  (cost=8.30..8.30 rows=10 width=16)
              ->  Index Scan using hfia on hauptdaten_fall  (cost=0.00..8.14 rows=10 width=16)

EXPLAIN
temp=#

Which shows quite clearly that an index scan will improve the situation drastically. Even more
so for you because the table seems to have very many rows in it.

- Einar Karttunen

Re: Performance question

From
"Roderick A. Anderson"
Date:
On Mon, 10 Sep 2001, Tille, Andreas wrote:

> Hello,
>
> Now I started some performance comparisons and did the following statement:

> The MS-SQL server represents the result "immediately" - you just not notice
> any delay.  If I do it on the PostgreSQL server it takes 30s on comparable
> hardware :-(((.

Did you do a VACUUM ANALYZE after loading the data?



Rod
--
                      A small clue and no money
                      will get you further than
                      lots of money and no clue.


Re: Performance question

From
Stephan Szabo
Date:
On Mon, 10 Sep 2001, Tille, Andreas wrote:

> On Mon, 10 Sep 2001 Herbert.Liechti@thinx.ch wrote:
>
> > Use explain. Explain tells you the query plan of the optimizer.
> >
> > explain SELECT .....;
> Thanks I just found the thread "Index usage question" and tried to make
> some profit from it:
>
> explain SELECT Hauptdaten_Fall.MeldeKategorie,
> Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
> (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY
> Hauptdaten_Fall.MeldeKategorie ORDER BY
> Hauptdaten_Fall.MeldeKategorie;
>
> Aggregate  (cost=35267.33..36154.62 rows=17746 width=16)
>   ->  Group  (cost=35267.33..35710.98 rows=177458 width=16)
>         ->  Sort  (cost=35267.33..35267.33 rows=177458 width=16)
>               ->  Seq Scan on hauptdaten_fall  (cost=0.00..15024.12 rows=177458 width=16)
>
> I have nearly no experience with query optimizing but the gread difference
> in speed tells me that something is wrong here.  There were some hints in
> the "Index usage question" thread about some fields which might be interpreted
> as strings.  Could this be a reason and how to check this?

What's the schema for the table? How many rows are in the table?  How many
rows actually have IstAktuell=20 (is 177458 a reasonable estimate?).  If
not, is there a common, non-NULL value that is much more common than other
values?



Re: Performance question

From
"Tille, Andreas"
Date:
On Mon, 10 Sep 2001, Stephan Szabo wrote:

> > explain SELECT Hauptdaten_Fall.MeldeKategorie,
> > Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
> > (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY
> > Hauptdaten_Fall.MeldeKategorie ORDER BY
> > Hauptdaten_Fall.MeldeKategorie;
> >
> > Aggregate  (cost=35267.33..36154.62 rows=17746 width=16)
> >   ->  Group  (cost=35267.33..35710.98 rows=177458 width=16)
> >         ->  Sort  (cost=35267.33..35267.33 rows=177458 width=16)
> >               ->  Seq Scan on hauptdaten_fall  (cost=0.00..15024.12 rows=177458 width=16)
> >
> > I have nearly no experience with query optimizing but the gread difference
> > in speed tells me that something is wrong here.  There were some hints in
> > the "Index usage question" thread about some fields which might be interpreted
> > as strings.  Could this be a reason and how to check this?
>
> What's the schema for the table?

CREATE SEQUENCE SeqHauptdaten_Fall;
CREATE TABLE Hauptdaten_Fall (
    ID int DEFAULT nextval('SeqHauptdaten_Fall') ,
    InterneRef int NOT NULL ,
    Zeitstempel datetime NOT NULL ,
    ZeitStempelRKI datetime NOT NULL ,
    IstAktuell int NOT NULL ,
    IstOffen int NOT NULL ,
    IstFallDef int NOT NULL ,
    Wichtigkeit int NOT NULL ,
    ImportTyp int NOT NULL ,
    ImportStatus int NOT NULL ,
    ImportFile int NOT NULL ,
    FehlerZahl int NOT NULL ,
    ExportTyp int NOT NULL ,
    ExportStatus int NOT NULL ,
    ExportFile int NOT NULL ,
    QuittungsText text NULL ,
    Anmerkung text NULL ,
    SaveIstAktuell int NOT NULL ,
    SaveIstOffen int NOT NULL ,
    SaveExportStatus int NOT NULL ,
    FallkennungStelle varchar (15) NOT NULL ,
    FallkennungAktenzeichen varchar (50) NOT NULL ,
    FallVersion int NOT NULL ,
    MeldekennungStelle varchar (15) NULL ,
    MeldekennungAktenzeichen varchar (50) NULL ,
    MeldeKategorie varchar (10) NULL ,
    MeldeSoftware varchar (50) NULL ,
    MeldeZuordnung varchar (50) NULL ,
    Landkreis int NOT NULL ,
    MeldeJahr int NOT NULL ,
    MeldeWoche int NOT NULL ,
    PersonGeburtsJahr int NOT NULL ,
    PersonGeburtsMonat int NOT NULL ,
    GeburtstagTheoretisch datetime NULL ,
    AlterTheoretisch int NOT NULL ,
    PersonGeschlecht int NOT NULL ,
    PersonWohnsitz int NOT NULL ,
    PersonAufenthalt int NOT NULL ,
    Spende int NOT NULL ,
    ErkranktStatus int NOT NULL ,
    ErkranktDatumVon datetime NULL ,
    ErkranktDatumBis datetime NULL ,
    ErregerStatus int NOT NULL ,
    EpidemiologieStatus int NOT NULL ,
    KlinikAufenthaltStatus int NOT NULL ,
    KlinikAufenthaltDatumVon datetime NULL ,
    KlinikAufenthaltDatumBis datetime NULL ,
    KlinikAdresse int NOT NULL ,
    VerstorbenStatus int NOT NULL ,
    VerstorbenDatumVon datetime NULL ,
    VerstorbenDatumBis datetime NULL ,
    InfektionsOrt varchar (50) NULL ,
    InfektionsOKZ varchar (50) NULL ,
    InfektionsInfo text NULL ,
    HerdkennungStelle varchar (15) NULL ,
    HerdkennungAktenzeichen varchar (50) NULL ,
    MeldeDatum datetime NULL ,
    AbsenderTyp int NOT NULL ,
    Absender int NOT NULL ,
    Mediziner int NOT NULL ,
    Labor int NOT NULL ,
    WirdBetreut int NOT NULL ,
    Betreuungseinrichtung int NOT NULL ,
    IstBetreuer int NOT NULL ,
    BetreuerArbeitsstelle int NOT NULL ,
    Lebensmittel int NOT NULL ,
    LebensmittelBetrieb int NOT NULL ,
    ts timestamp NULL ,
    CONSTRAINT PK_Fall_Hauptdaten PRIMARY KEY (FallkennungStelle,FallkennungAktenzeichen,FallVersion)
) ;

CREATE   INDEX IX_IstAktuellKategorie          ON Hauptdaten_Fall(IstAktuell, MeldeKategorie);
CREATE UNIQUE INDEX IX_ID_Hauptdaten_Fall     ON Hauptdaten_Fall(ID);
CREATE  INDEX IX_MeldeJahr ON Hauptdaten_Fall(MeldeJahr);
CREATE  INDEX IX_MeldeWoche ON Hauptdaten_Fall(MeldeWoche);
CREATE  INDEX IX_Landkreis ON Hauptdaten_Fall(Landkreis);
CREATE  INDEX IX_ErkranktStatus ON Hauptdaten_Fall(ErkranktStatus);
CREATE  INDEX IX_ErregerStatus ON Hauptdaten_Fall(ErregerStatus);
CREATE  INDEX IX_EpidemiologieStatus ON Hauptdaten_Fall(EpidemiologieStatus);
CREATE  INDEX IX_KlinikAufenthaltStatus ON Hauptdaten_Fall(KlinikAufenthaltStatus);
CREATE  INDEX IX_VerstorbenStatus ON Hauptdaten_Fall(VerstorbenStatus);
CREATE  INDEX IX_ImportStatus ON Hauptdaten_Fall(ImportStatus);
CREATE  INDEX IX_ExportStatus ON Hauptdaten_Fall(ExportStatus);
CREATE  INDEX IX_ImportFile ON Hauptdaten_Fall(ImportFile);
CREATE  INDEX IX_ExportFile ON Hauptdaten_Fall(ExportFile);
CREATE  INDEX IX_Herd ON Hauptdaten_Fall(HerdkennungStelle, HerdkennungAktenzeichen);
CREATE  INDEX IX_ImportTyp ON Hauptdaten_Fall(ImportTyp);
CREATE  INDEX IX_ExportTyp ON Hauptdaten_Fall(ExportTyp);
CREATE  INDEX IX_MeldeKategorie_Hauptdaten_Fa ON Hauptdaten_Fall(MeldeKategorie);
CREATE  INDEX IX_IstFallDef ON Hauptdaten_Fall(IstFallDef);
CREATE  INDEX IX_SaveIstAktuell ON Hauptdaten_Fall(SaveIstAktuell);
CREATE  INDEX IX_FallVersion ON Hauptdaten_Fall(FallVersion);
CREATE  INDEX IX_InterneRef_Hauptdaten_Fall ON Hauptdaten_Fall(InterneRef);
CREATE  INDEX IX_IstAktuell_Hauptdaten_Fall ON Hauptdaten_Fall(IstAktuell);

> How many rows are in the table?
# select count(*) from Hauptdaten_Fall ;
 count
--------
 257530
(1 row)

> How many
> rows actually have IstAktuell=20 (is 177458 a reasonable estimate?).
Yes.
ifsg=# select count(*) from Hauptdaten_Fall WHERE IstAktuell=20;
 count
--------
 177458
(1 row)

Moreover I tried explain with:

set enable_seqscan = off;
explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; 
NOTICE:  QUERY PLAN:

Aggregate  (cost=0.00..146770.97 rows=17746 width=16)
  ->  Group  (cost=0.00..146327.32 rows=177458 width=16)
        ->  Index Scan using ix_meldekategorie_hauptdaten_fa on hauptdaten_fall  (cost=0.00..145883.68 rows=177458
width=16)

I wonder, why the Index IX_IstAktuell_Hauptdaten_Fall for IstAktuell is not
used and moreover why the query takes now 127s with enable_seqscan = off
against 32s with the default setting.

Kind regards

        Andreas.

Re: Performance question

From
"Tille, Andreas"
Date:
On Mon, 10 Sep 2001, Einar Karttunen wrote:

> ...
> temp=# CREATE INDEX hfia ON Hauptdaten_Fall (IstAktuell);
> CREATE
> temp=# explain SELECT MeldeKategorie,count(ID) FROM Hauptdaten_Fall
> temp-#  WHERE IstAktuell=20 GROUP BY MeldeKategorie
> temp-#  ORDER BY MeldeKategorie;
> NOTICE:  QUERY PLAN:
>
> Aggregate  (cost=8.30..8.35 rows=1 width=16)
>   ->  Group  (cost=8.30..8.33 rows=10 width=16)
>         ->  Sort  (cost=8.30..8.30 rows=10 width=16)
>               ->  Index Scan using hfia on hauptdaten_fall  (cost=0.00..8.14 rows=10 width=16)
>
Thanks for the hint but the index exists but is not used (see my previous
mail with a detailed description).

Kind regards

       Andreas.

Re: Performance question

From
Stephan Szabo
Date:
> > How many rows are in the table?
> # select count(*) from Hauptdaten_Fall ;
>  count
> --------
>  257530
> (1 row)
>
> > How many
> > rows actually have IstAktuell=20 (is 177458 a reasonable estimate?).
> Yes.
> ifsg=# select count(*) from Hauptdaten_Fall WHERE IstAktuell=20;
>  count
> --------
>  177458
> (1 row)
>
> Moreover I tried explain with:
>
> set enable_seqscan = off;
> explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; 
> NOTICE:  QUERY PLAN:
>
> Aggregate  (cost=0.00..146770.97 rows=17746 width=16)
>   ->  Group  (cost=0.00..146327.32 rows=177458 width=16)
>         ->  Index Scan using ix_meldekategorie_hauptdaten_fa on hauptdaten_fall  (cost=0.00..145883.68 rows=177458
width=16)
>
> I wonder, why the Index IX_IstAktuell_Hauptdaten_Fall for IstAktuell is not
> used and moreover why the query takes now 127s with enable_seqscan = off
> against 32s with the default setting.

Well, the index isn't used because it estimates (apparently
correctly) that not using it is cheaper.  Because the information about
whether a row is valid is kept in the heap, for each index hit, the
heap needs to be read to see if the row is visible.  This results in
jumping about the heap file with seeks and such plus the index
search itself.  When most of the rows are going to be returned, the
sequence scan will generally be cheaper.

Alot of the real time may be being spent in the sort step.  You may want
to raise the amount of memory used for sorting and see if that helps.




Re: Performance question

From
"Tille, Andreas"
Date:
On Tue, 11 Sep 2001, Stephan Szabo wrote:

> Well, the index isn't used because it estimates (apparently
> correctly) that not using it is cheaper.  Because the information about
> whether a row is valid is kept in the heap, for each index hit, the
> heap needs to be read to see if the row is visible.  This results in
> jumping about the heap file with seeks and such plus the index
> search itself.  When most of the rows are going to be returned, the
> sequence scan will generally be cheaper.
It would me drive crazy if this "cheaper" solution would last miore than
30 seconds for PostgreSQL if MS-SQL server solves this tasks in less
than a second.  This would really destroy all my plans with PostgreSQL
if it would be true (what I can´t really imagine).

> Alot of the real time may be being spent in the sort step.  You may want
> to raise the amount of memory used for sorting and see if that helps.
How to raise this memory amount.
This is more or less a theoretical question because sorting of about
150 items can´t last 30s.  The result set is:


MeldeKategorie    Anz
ADV    142
BOR    1
BRU    10
CAM    34965
CHL    45
CJK    61
CLO    6
COX    237
CRY    645
ECH    269
ECO    3030
EHC    750
FRT    1
FSV    160
GIL    3013
HAV    1495
HBV    3352
HCV    7710
HDV    18
HEV    33
HFA    24
HIN    66
HIV    3628
HTV    132
HXV    3
INV    2400
LEG    203
LEP    17
LIS    167
MSV    5164
MYL    2
MYT    4650
NEI    576
NWV    6566
PLA    692
POV    4
RBV    1
RIC    2
RTV    42862
RUB    1
SAL    47829
SHI    917
SPA    33
STY    60
TOX    26
TRI    6
TRP    674
VCH    2
YEN    4810


There is not much to sort here, thought.

Kind regards

        Andreas.

Re: Performance question

From
Stephan Szabo
Date:
On Wed, 12 Sep 2001, Tille, Andreas wrote:

> On Tue, 11 Sep 2001, Stephan Szabo wrote:
>
> > Well, the index isn't used because it estimates (apparently
> > correctly) that not using it is cheaper.  Because the information about
> > whether a row is valid is kept in the heap, for each index hit, the
> > heap needs to be read to see if the row is visible.  This results in
> > jumping about the heap file with seeks and such plus the index
> > search itself.  When most of the rows are going to be returned, the
> > sequence scan will generally be cheaper.
> It would me drive crazy if this "cheaper" solution would last miore than
> 30 seconds for PostgreSQL if MS-SQL server solves this tasks in less
> than a second.  This would really destroy all my plans with PostgreSQL
> if it would be true (what I can�t really imagine).
>
> > Alot of the real time may be being spent in the sort step.  You may want
> > to raise the amount of memory used for sorting and see if that helps.
> How to raise this memory amount.
Probably the best way is sort_mem in postgresql.conf in your data
directory and restarting the server.  It may take multiple iterations
to find a good number (and you may need to up your systems shared memory
limits if you run into it, but that differs from system to system).
For real world testing, you probably also want to raise the shared_buffers
value as well.

> This is more or less a theoretical question because sorting of about
> 150 items can�t last 30s.  The result set is:
It's sorting *before* doing the grouping, so it's the 170000 (or
whatever) rows being sorted by the group by columns.  If I guess
correctly (without looking at the code), the group by does a one pass
across the sorted rows breaking the groups when it sees the grouped
column values change, so it wants the rows in sorted order.


Re: Performance question

From
Herbert Liechti
Date:
On Mon, 10 Sep 2001, Tille, Andreas wrote:

> On Mon, 10 Sep 2001 Herbert.Liechti@thinx.ch wrote:
>
> > Use explain. Explain tells you the query plan of the optimizer.
> >
> > explain SELECT .....;
> Thanks I just found the thread "Index usage question" and tried to make
> some profit from it:
>
> explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; 
>
> NOTICE:  QUERY PLAN:
>
> Aggregate  (cost=35267.33..36154.62 rows=17746 width=16)
>   ->  Group  (cost=35267.33..35710.98 rows=177458 width=16)
>         ->  Sort  (cost=35267.33..35267.33 rows=177458 width=16)
>               ->  Seq Scan on hauptdaten_fall  (cost=0.00..15024.12 rows=177458 width=16)

I assume that an index on the field IstAktuell avoids the sequential scan.
If an index is there try the following statement before you submit the
query:

set enable_seqscan = off;

Regards Herbie
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Herbert Liechti                                  http://www.thinx.ch
ThinX networked business services    Adlergasse 5, CH-4500 Solothurn
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Performance question (stripped down the problem)

From
"Tille, Andreas"
Date:
Hello once more,

I stripped the database down to one single table with only the relevant
columns.  A pg_dump can be found under

    http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2

I would really like it if you try the query


#!/bin/sh
ERG=$0.ptk
START=`date +%s`

echo "SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;" |
psqlifsgtest > result 

let i="`date +%s` - $START"
echo "Query took ${i} seconds." >> result


(which should just measure the time needed for this task).  It took my
E250 (2x400MHz, 2GB) server 20 seconds and this is definitely much to
long for our application.

Any hints are gratly appreciated

         Andreas.

Re: Performance question (stripped down the problem)

From
GB Clark II
Date:
On Tuesday 18 September 2001 07:06, Tille, Andreas wrote:
-SNIP-
>
> echo "SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS
> Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY
> Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;" |
> psql ifsgtest > result
>
> let i="`date +%s` - $START"
> echo "Query took ${i} seconds." >> result
>
>
> (which should just measure the time needed for this task).  It took my
> E250 (2x400MHz, 2GB) server 20 seconds and this is definitely much to
> long for our application.
>
> Any hints are gratly appreciated
>
>          Andreas.

Ok, I tried the query on two boxes:
#1 800Mhz duron IDE drives - 11 secs
#2 800Mhz P3 SCSI drives - 5.35 secs
NOTE: the P3 has alot larger buffers and stuff....(more mem..:))

What does the function plpgsql_call_handler do?
Could this be where the slow down is?

If it is minor, then how are your buffers and shared mem set?

I am by NO means an expert, but I'm quite willing to help if I can.

GB

--
GB Clark II             | Roaming FreeBSD Admin
gclarkii@VSServices.COM | General Geek
           CTHULU for President - Why choose the lesser of two evils?

Re: Performance question (stripped down the problem)

From
"Matt Block"
Date:
> -----Original Message-----
> From: Tille, Andreas

> echo "SELECT Hauptdaten_Fall.MeldeKategorie, Count
> (Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall
> WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP
> BY Hauptdaten_Fall.MeldeKategorie ORDER BY
> Hauptdaten_Fall.MeldeKategorie;" | psql ifsgtest > result

I've not yet looked at the rest of the problem, but doesn't
ORDER BY make GROUP BY redundant?  Just an extra sort?

  -- Matt


Re: Performance question (stripped down the problem)

From
"Tille, Andreas"
Date:
On Tue, 18 Sep 2001, Matt Block wrote:

> > echo "SELECT Hauptdaten_Fall.MeldeKategorie, Count
> > (Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall
> > WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP
> > BY Hauptdaten_Fall.MeldeKategorie ORDER BY
> > Hauptdaten_Fall.MeldeKategorie;" | psql ifsgtest > result
>
> I've not yet looked at the rest of the problem, but doesn't
> ORDER BY make GROUP BY redundant?  Just an extra sort?
No it doesn´t.  I have to do exact this query.

Kind regards

         Andreas.

Re: Performance question (stripped down the problem)

From
Peter Eisentraut
Date:
Tille, Andreas writes:

> SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
> FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY
> Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;

> (which should just measure the time needed for this task).  It took my
> E250 (2x400MHz, 2GB) server 20 seconds and this is definitely much to
> long for our application.

I loaded this into 7.2 development sources and it ran 35 seconds
wall-clock time on a much smaller machine.  After I ran what in 7.1 would
be VACUUM ANALYZE it took about 22 seconds.  The difference was that it
was using the index on hauptdaten_fall.istaktuell when it shouldn't.
(Try EXPLAIN to see what it does in your case.  If it's still using the
index you might want to force enable_indexscan = off.)

I also got a minuscule speed-up by replacing the Count(Hauptdaten_Fall.ID)
with Count(*), which acts differently with respect to nulls, so it depends
whether you want to use it.

Besides that, I don't see anything blatantly obvious to speed this up.

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter


Re: Performance question (stripped down the problem)

From
"Tille, Andreas"
Date:
On Wed, 19 Sep 2001, Peter Eisentraut wrote:

> > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
> > FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY
> > Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;
>
> > (which should just measure the time needed for this task).  It took my
> > E250 (2x400MHz, 2GB) server 20 seconds and this is definitely much to
> > long for our application.
>
> I loaded this into 7.2 development sources and it ran 35 seconds
> wall-clock time on a much smaller machine.  After I ran what in 7.1 would
> be VACUUM ANALYZE it took about 22 seconds.  The difference was that it
> was using the index on hauptdaten_fall.istaktuell when it shouldn't.
> (Try EXPLAIN to see what it does in your case.  If it's still using the
> index you might want to force enable_indexscan = off.)
Well, I tried several things to avoid/enforce index scans.  The problem
remains every time.  We don´t talk about 20 or 30 seconds here.
We talk about "slow" (i.e. > 10 seconds) for PostgreSQL against "fast"
(i.e. feeling like immediately, not to measure exactly) on MS-SQL server.

> Besides that, I don't see anything blatantly obvious to speed this up.
This fact is quite hard for me.  I´ve got some hints about changes in
table structure (use of an index for MeldeKategorie) and I could perhaps
track down this way even if I´m relatively stong bound to the given
stucture because the data should be "replicated" from MS-SQL server to
PostgreSQL.

Moreover ths problem is not this "quite simple" query (compared to other
more complicated ones).  This was just a simple test.  My colleague
posted me another (more practical) query which took him under
Access querying the MS-SQL server (so there is an additional bottleneck)
2,5 seconds.  PostgreSQL took 92 seconds for the same task.  This is
way to slow and would stop our PostgreSQL efforts completely, which
means I would have to search for OS alternatives (bad) or even use
the MS-SQL server (even worse if I look at the current NT-Worms).

Perhaps this test could be an example which helps developers to
optimize such kind of queries.  I´m afraid the GROUP BY / ORDER BY
combination which causes expensive sorting could be the reason for
this slow down.  Perhaps they can find a more sophisticated sorting
algorithm.  There must be a reason for this speed difference by the
order of magnitude for this special case and I would be more than
happy if I could help increasing PostgreSQL by providing this example
(to make PostgreSQL better in general and fit for my own task).

Kind regards

         Andreas.

Re: Performance question (stripped down the problem)

From
Tom Lane
Date:
"Matt Block" <matt@blockdev.net> writes:
> I've not yet looked at the rest of the problem, but doesn't
> ORDER BY make GROUP BY redundant?  Just an extra sort?

No.  In the first place, there's no extra sort: the planner is well
aware that our current GROUP BY implementation produces ordered output.
In the second place, there's no guarantee that GROUP BY will always
produce ordered output in the future --- we are thinking about changing
over to a hash-table-based implementation of grouping.  If you want
ordered output, you should say ORDER BY, not try to outsmart the system.

            regards, tom lane

Re: Performance question (stripped down the problem)

From
"Tille, Andreas"
Date:
On Wed, 19 Sep 2001, Tom Lane wrote:

> No.  In the first place, there's no extra sort: the planner is well
> aware that our current GROUP BY implementation produces ordered output.
> In the second place, there's no guarantee that GROUP BY will always
> produce ordered output in the future --- we are thinking about changing
> over to a hash-table-based implementation of grouping.  If you want
> ordered output, you should say ORDER BY, not try to outsmart the system.
And even if I would do so - sorting 51 items takes quite nothing.  So
the problem is anywhere else.  I did further tests and got:

  SELECT Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) ;

  anz
--------
 177458
(1 row)   ---------> 2 Seconds

... that means longer than MS-SQL server takes with the additional GROUB BY

and

  SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie; 

... without ORDER BY but this doesn´t matter regarding the result set
(well it is as ordered as it would be with the ORDER BY clause) and
the time which stays at 18 seconds (after I increased sort_mem = 2048
it was *slightly* faster - 20 seconds for default sort_mem).

So the real bottleneck seems to be the GROUP BY.

Any chances that this could be changed in future PostgreSQL versions?
This slowness makes use of PostgreSQL impossible for our application.

Kind regards

         Andreas.