Thread: Performance question
Hello, I have ported a database from MS SQL Server to PostgreSQL. The database has 40 tables and 117 indexes which I defined for the same fields as in MS SQL. I converted the data using some SQL output from MS SQL server and inserted it with psql. Now I started some performance comparisons and did the following statement: SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; (sorry for the German names used here). The MS-SQL server represents the result "immediately" - you just not notice any delay. If I do it on the PostgreSQL server it takes 30s on comparable hardware :-(((. I really have no idea what caused this big difference in speed which would forbid the use of PostgreSQL for our application. How can I checked whether the indexes I created are really used? What could be other reasons for such a big difference in speed? I´m using PostgreSQL 7.1.2 on Debian GNU/Linux (testing). Any help appreciated Andreas.
On Mon, 10 Sep 2001, Tille, Andreas wrote: > Hello, > > I have ported a database from MS SQL Server to PostgreSQL. The database has > 40 tables and 117 indexes which I defined for the same fields as in MS SQL. > I converted the data using some SQL output from MS SQL server and inserted > it with psql. > > Now I started some performance comparisons and did the following statement: > > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz > FROM Hauptdaten_Fall > WHERE (((Hauptdaten_Fall.IstAktuell)=20)) > GROUP BY Hauptdaten_Fall.MeldeKategorie > ORDER BY Hauptdaten_Fall.MeldeKategorie; > > (sorry for the German names used here). > > The MS-SQL server represents the result "immediately" - you just not notice > any delay. If I do it on the PostgreSQL server it takes 30s on comparable > hardware :-(((. > > I really have no idea what caused this big difference in speed which would > forbid the use of PostgreSQL for our application. How can I checked > whether the indexes I created are really used? What could be other > reasons for such a big difference in speed? Use explain. Explain tells you the query plan of the optimizer. explain SELECT .....; Best regards Herbie -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Herbert Liechti http://www.thinx.ch ThinX networked business services Adlergasse 5, CH-4500 Solothurn ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On Mon, 10 Sep 2001 Herbert.Liechti@thinx.ch wrote: > Use explain. Explain tells you the query plan of the optimizer. > > explain SELECT .....; Thanks I just found the thread "Index usage question" and tried to make some profit from it: explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; NOTICE: QUERY PLAN: Aggregate (cost=35267.33..36154.62 rows=17746 width=16) -> Group (cost=35267.33..35710.98 rows=177458 width=16) -> Sort (cost=35267.33..35267.33 rows=177458 width=16) -> Seq Scan on hauptdaten_fall (cost=0.00..15024.12 rows=177458 width=16) Now I tried Vacuum analyze; but nothing changed :-( Aggregate (cost=35267.33..36154.62 rows=17746 width=16) -> Group (cost=35267.33..35710.98 rows=177458 width=16) -> Sort (cost=35267.33..35267.33 rows=177458 width=16) -> Seq Scan on hauptdaten_fall (cost=0.00..15024.12 rows=177458 width=16) I have nearly no experience with query optimizing but the gread difference in speed tells me that something is wrong here. There were some hints in the "Index usage question" thread about some fields which might be interpreted as strings. Could this be a reason and how to check this? Kind regards Andreas.
On Mon, Sep 10, 2001 at 02:34:25PM +0200, Tille, Andreas wrote: > On Mon, 10 Sep 2001 Herbert.Liechti@thinx.ch wrote: > > > Use explain. Explain tells you the query plan of the optimizer. > > > > explain SELECT .....; > Thanks I just found the thread "Index usage question" and tried to make > some profit from it: > > explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; > > NOTICE: QUERY PLAN: > > Aggregate (cost=35267.33..36154.62 rows=17746 width=16) > -> Group (cost=35267.33..35710.98 rows=177458 width=16) > -> Sort (cost=35267.33..35267.33 rows=177458 width=16) > -> Seq Scan on hauptdaten_fall (cost=0.00..15024.12 rows=177458 width=16) > > There must be a problem with your indeces. I tried the following: temp=# CREATE TABLE Hauptdaten_Fall ( temp(# MeldeKategorie text, temp(# ID integer, temp(# IstAktuell integer); CREATE temp=# explain SELECT MeldeKategorie,count(ID) FROM Hauptdaten_Fall temp-# WHERE IstAktuell=20 GROUP BY MeldeKategorie temp-# ORDER BY MeldeKategorie; NOTICE: QUERY PLAN: Aggregate (cost=22.67..22.72 rows=1 width=16) -> Group (cost=22.67..22.69 rows=10 width=16) -> Sort (cost=22.67..22.67 rows=10 width=16) -> Seq Scan on hauptdaten_fall (cost=0.00..22.50 rows=10 width=16) EXPLAIN temp=# CREATE INDEX hfia ON Hauptdaten_Fall (IstAktuell); CREATE temp=# explain SELECT MeldeKategorie,count(ID) FROM Hauptdaten_Fall temp-# WHERE IstAktuell=20 GROUP BY MeldeKategorie temp-# ORDER BY MeldeKategorie; NOTICE: QUERY PLAN: Aggregate (cost=8.30..8.35 rows=1 width=16) -> Group (cost=8.30..8.33 rows=10 width=16) -> Sort (cost=8.30..8.30 rows=10 width=16) -> Index Scan using hfia on hauptdaten_fall (cost=0.00..8.14 rows=10 width=16) EXPLAIN temp=# Which shows quite clearly that an index scan will improve the situation drastically. Even more so for you because the table seems to have very many rows in it. - Einar Karttunen
On Mon, 10 Sep 2001, Tille, Andreas wrote: > Hello, > > Now I started some performance comparisons and did the following statement: > The MS-SQL server represents the result "immediately" - you just not notice > any delay. If I do it on the PostgreSQL server it takes 30s on comparable > hardware :-(((. Did you do a VACUUM ANALYZE after loading the data? Rod -- A small clue and no money will get you further than lots of money and no clue.
On Mon, 10 Sep 2001, Tille, Andreas wrote: > On Mon, 10 Sep 2001 Herbert.Liechti@thinx.ch wrote: > > > Use explain. Explain tells you the query plan of the optimizer. > > > > explain SELECT .....; > Thanks I just found the thread "Index usage question" and tried to make > some profit from it: > > explain SELECT Hauptdaten_Fall.MeldeKategorie, > Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE > (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY > Hauptdaten_Fall.MeldeKategorie ORDER BY > Hauptdaten_Fall.MeldeKategorie; > > Aggregate (cost=35267.33..36154.62 rows=17746 width=16) > -> Group (cost=35267.33..35710.98 rows=177458 width=16) > -> Sort (cost=35267.33..35267.33 rows=177458 width=16) > -> Seq Scan on hauptdaten_fall (cost=0.00..15024.12 rows=177458 width=16) > > I have nearly no experience with query optimizing but the gread difference > in speed tells me that something is wrong here. There were some hints in > the "Index usage question" thread about some fields which might be interpreted > as strings. Could this be a reason and how to check this? What's the schema for the table? How many rows are in the table? How many rows actually have IstAktuell=20 (is 177458 a reasonable estimate?). If not, is there a common, non-NULL value that is much more common than other values?
On Mon, 10 Sep 2001, Stephan Szabo wrote: > > explain SELECT Hauptdaten_Fall.MeldeKategorie, > > Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE > > (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY > > Hauptdaten_Fall.MeldeKategorie ORDER BY > > Hauptdaten_Fall.MeldeKategorie; > > > > Aggregate (cost=35267.33..36154.62 rows=17746 width=16) > > -> Group (cost=35267.33..35710.98 rows=177458 width=16) > > -> Sort (cost=35267.33..35267.33 rows=177458 width=16) > > -> Seq Scan on hauptdaten_fall (cost=0.00..15024.12 rows=177458 width=16) > > > > I have nearly no experience with query optimizing but the gread difference > > in speed tells me that something is wrong here. There were some hints in > > the "Index usage question" thread about some fields which might be interpreted > > as strings. Could this be a reason and how to check this? > > What's the schema for the table? CREATE SEQUENCE SeqHauptdaten_Fall; CREATE TABLE Hauptdaten_Fall ( ID int DEFAULT nextval('SeqHauptdaten_Fall') , InterneRef int NOT NULL , Zeitstempel datetime NOT NULL , ZeitStempelRKI datetime NOT NULL , IstAktuell int NOT NULL , IstOffen int NOT NULL , IstFallDef int NOT NULL , Wichtigkeit int NOT NULL , ImportTyp int NOT NULL , ImportStatus int NOT NULL , ImportFile int NOT NULL , FehlerZahl int NOT NULL , ExportTyp int NOT NULL , ExportStatus int NOT NULL , ExportFile int NOT NULL , QuittungsText text NULL , Anmerkung text NULL , SaveIstAktuell int NOT NULL , SaveIstOffen int NOT NULL , SaveExportStatus int NOT NULL , FallkennungStelle varchar (15) NOT NULL , FallkennungAktenzeichen varchar (50) NOT NULL , FallVersion int NOT NULL , MeldekennungStelle varchar (15) NULL , MeldekennungAktenzeichen varchar (50) NULL , MeldeKategorie varchar (10) NULL , MeldeSoftware varchar (50) NULL , MeldeZuordnung varchar (50) NULL , Landkreis int NOT NULL , MeldeJahr int NOT NULL , MeldeWoche int NOT NULL , PersonGeburtsJahr int NOT NULL , PersonGeburtsMonat int NOT NULL , GeburtstagTheoretisch datetime NULL , AlterTheoretisch int NOT NULL , PersonGeschlecht int NOT NULL , PersonWohnsitz int NOT NULL , PersonAufenthalt int NOT NULL , Spende int NOT NULL , ErkranktStatus int NOT NULL , ErkranktDatumVon datetime NULL , ErkranktDatumBis datetime NULL , ErregerStatus int NOT NULL , EpidemiologieStatus int NOT NULL , KlinikAufenthaltStatus int NOT NULL , KlinikAufenthaltDatumVon datetime NULL , KlinikAufenthaltDatumBis datetime NULL , KlinikAdresse int NOT NULL , VerstorbenStatus int NOT NULL , VerstorbenDatumVon datetime NULL , VerstorbenDatumBis datetime NULL , InfektionsOrt varchar (50) NULL , InfektionsOKZ varchar (50) NULL , InfektionsInfo text NULL , HerdkennungStelle varchar (15) NULL , HerdkennungAktenzeichen varchar (50) NULL , MeldeDatum datetime NULL , AbsenderTyp int NOT NULL , Absender int NOT NULL , Mediziner int NOT NULL , Labor int NOT NULL , WirdBetreut int NOT NULL , Betreuungseinrichtung int NOT NULL , IstBetreuer int NOT NULL , BetreuerArbeitsstelle int NOT NULL , Lebensmittel int NOT NULL , LebensmittelBetrieb int NOT NULL , ts timestamp NULL , CONSTRAINT PK_Fall_Hauptdaten PRIMARY KEY (FallkennungStelle,FallkennungAktenzeichen,FallVersion) ) ; CREATE INDEX IX_IstAktuellKategorie ON Hauptdaten_Fall(IstAktuell, MeldeKategorie); CREATE UNIQUE INDEX IX_ID_Hauptdaten_Fall ON Hauptdaten_Fall(ID); CREATE INDEX IX_MeldeJahr ON Hauptdaten_Fall(MeldeJahr); CREATE INDEX IX_MeldeWoche ON Hauptdaten_Fall(MeldeWoche); CREATE INDEX IX_Landkreis ON Hauptdaten_Fall(Landkreis); CREATE INDEX IX_ErkranktStatus ON Hauptdaten_Fall(ErkranktStatus); CREATE INDEX IX_ErregerStatus ON Hauptdaten_Fall(ErregerStatus); CREATE INDEX IX_EpidemiologieStatus ON Hauptdaten_Fall(EpidemiologieStatus); CREATE INDEX IX_KlinikAufenthaltStatus ON Hauptdaten_Fall(KlinikAufenthaltStatus); CREATE INDEX IX_VerstorbenStatus ON Hauptdaten_Fall(VerstorbenStatus); CREATE INDEX IX_ImportStatus ON Hauptdaten_Fall(ImportStatus); CREATE INDEX IX_ExportStatus ON Hauptdaten_Fall(ExportStatus); CREATE INDEX IX_ImportFile ON Hauptdaten_Fall(ImportFile); CREATE INDEX IX_ExportFile ON Hauptdaten_Fall(ExportFile); CREATE INDEX IX_Herd ON Hauptdaten_Fall(HerdkennungStelle, HerdkennungAktenzeichen); CREATE INDEX IX_ImportTyp ON Hauptdaten_Fall(ImportTyp); CREATE INDEX IX_ExportTyp ON Hauptdaten_Fall(ExportTyp); CREATE INDEX IX_MeldeKategorie_Hauptdaten_Fa ON Hauptdaten_Fall(MeldeKategorie); CREATE INDEX IX_IstFallDef ON Hauptdaten_Fall(IstFallDef); CREATE INDEX IX_SaveIstAktuell ON Hauptdaten_Fall(SaveIstAktuell); CREATE INDEX IX_FallVersion ON Hauptdaten_Fall(FallVersion); CREATE INDEX IX_InterneRef_Hauptdaten_Fall ON Hauptdaten_Fall(InterneRef); CREATE INDEX IX_IstAktuell_Hauptdaten_Fall ON Hauptdaten_Fall(IstAktuell); > How many rows are in the table? # select count(*) from Hauptdaten_Fall ; count -------- 257530 (1 row) > How many > rows actually have IstAktuell=20 (is 177458 a reasonable estimate?). Yes. ifsg=# select count(*) from Hauptdaten_Fall WHERE IstAktuell=20; count -------- 177458 (1 row) Moreover I tried explain with: set enable_seqscan = off; explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; NOTICE: QUERY PLAN: Aggregate (cost=0.00..146770.97 rows=17746 width=16) -> Group (cost=0.00..146327.32 rows=177458 width=16) -> Index Scan using ix_meldekategorie_hauptdaten_fa on hauptdaten_fall (cost=0.00..145883.68 rows=177458 width=16) I wonder, why the Index IX_IstAktuell_Hauptdaten_Fall for IstAktuell is not used and moreover why the query takes now 127s with enable_seqscan = off against 32s with the default setting. Kind regards Andreas.
On Mon, 10 Sep 2001, Einar Karttunen wrote: > ... > temp=# CREATE INDEX hfia ON Hauptdaten_Fall (IstAktuell); > CREATE > temp=# explain SELECT MeldeKategorie,count(ID) FROM Hauptdaten_Fall > temp-# WHERE IstAktuell=20 GROUP BY MeldeKategorie > temp-# ORDER BY MeldeKategorie; > NOTICE: QUERY PLAN: > > Aggregate (cost=8.30..8.35 rows=1 width=16) > -> Group (cost=8.30..8.33 rows=10 width=16) > -> Sort (cost=8.30..8.30 rows=10 width=16) > -> Index Scan using hfia on hauptdaten_fall (cost=0.00..8.14 rows=10 width=16) > Thanks for the hint but the index exists but is not used (see my previous mail with a detailed description). Kind regards Andreas.
> > How many rows are in the table? > # select count(*) from Hauptdaten_Fall ; > count > -------- > 257530 > (1 row) > > > How many > > rows actually have IstAktuell=20 (is 177458 a reasonable estimate?). > Yes. > ifsg=# select count(*) from Hauptdaten_Fall WHERE IstAktuell=20; > count > -------- > 177458 > (1 row) > > Moreover I tried explain with: > > set enable_seqscan = off; > explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; > NOTICE: QUERY PLAN: > > Aggregate (cost=0.00..146770.97 rows=17746 width=16) > -> Group (cost=0.00..146327.32 rows=177458 width=16) > -> Index Scan using ix_meldekategorie_hauptdaten_fa on hauptdaten_fall (cost=0.00..145883.68 rows=177458 width=16) > > I wonder, why the Index IX_IstAktuell_Hauptdaten_Fall for IstAktuell is not > used and moreover why the query takes now 127s with enable_seqscan = off > against 32s with the default setting. Well, the index isn't used because it estimates (apparently correctly) that not using it is cheaper. Because the information about whether a row is valid is kept in the heap, for each index hit, the heap needs to be read to see if the row is visible. This results in jumping about the heap file with seeks and such plus the index search itself. When most of the rows are going to be returned, the sequence scan will generally be cheaper. Alot of the real time may be being spent in the sort step. You may want to raise the amount of memory used for sorting and see if that helps.
On Tue, 11 Sep 2001, Stephan Szabo wrote: > Well, the index isn't used because it estimates (apparently > correctly) that not using it is cheaper. Because the information about > whether a row is valid is kept in the heap, for each index hit, the > heap needs to be read to see if the row is visible. This results in > jumping about the heap file with seeks and such plus the index > search itself. When most of the rows are going to be returned, the > sequence scan will generally be cheaper. It would me drive crazy if this "cheaper" solution would last miore than 30 seconds for PostgreSQL if MS-SQL server solves this tasks in less than a second. This would really destroy all my plans with PostgreSQL if it would be true (what I can´t really imagine). > Alot of the real time may be being spent in the sort step. You may want > to raise the amount of memory used for sorting and see if that helps. How to raise this memory amount. This is more or less a theoretical question because sorting of about 150 items can´t last 30s. The result set is: MeldeKategorie Anz ADV 142 BOR 1 BRU 10 CAM 34965 CHL 45 CJK 61 CLO 6 COX 237 CRY 645 ECH 269 ECO 3030 EHC 750 FRT 1 FSV 160 GIL 3013 HAV 1495 HBV 3352 HCV 7710 HDV 18 HEV 33 HFA 24 HIN 66 HIV 3628 HTV 132 HXV 3 INV 2400 LEG 203 LEP 17 LIS 167 MSV 5164 MYL 2 MYT 4650 NEI 576 NWV 6566 PLA 692 POV 4 RBV 1 RIC 2 RTV 42862 RUB 1 SAL 47829 SHI 917 SPA 33 STY 60 TOX 26 TRI 6 TRP 674 VCH 2 YEN 4810 There is not much to sort here, thought. Kind regards Andreas.
On Wed, 12 Sep 2001, Tille, Andreas wrote: > On Tue, 11 Sep 2001, Stephan Szabo wrote: > > > Well, the index isn't used because it estimates (apparently > > correctly) that not using it is cheaper. Because the information about > > whether a row is valid is kept in the heap, for each index hit, the > > heap needs to be read to see if the row is visible. This results in > > jumping about the heap file with seeks and such plus the index > > search itself. When most of the rows are going to be returned, the > > sequence scan will generally be cheaper. > It would me drive crazy if this "cheaper" solution would last miore than > 30 seconds for PostgreSQL if MS-SQL server solves this tasks in less > than a second. This would really destroy all my plans with PostgreSQL > if it would be true (what I can�t really imagine). > > > Alot of the real time may be being spent in the sort step. You may want > > to raise the amount of memory used for sorting and see if that helps. > How to raise this memory amount. Probably the best way is sort_mem in postgresql.conf in your data directory and restarting the server. It may take multiple iterations to find a good number (and you may need to up your systems shared memory limits if you run into it, but that differs from system to system). For real world testing, you probably also want to raise the shared_buffers value as well. > This is more or less a theoretical question because sorting of about > 150 items can�t last 30s. The result set is: It's sorting *before* doing the grouping, so it's the 170000 (or whatever) rows being sorted by the group by columns. If I guess correctly (without looking at the code), the group by does a one pass across the sorted rows breaking the groups when it sees the grouped column values change, so it wants the rows in sorted order.
On Mon, 10 Sep 2001, Tille, Andreas wrote: > On Mon, 10 Sep 2001 Herbert.Liechti@thinx.ch wrote: > > > Use explain. Explain tells you the query plan of the optimizer. > > > > explain SELECT .....; > Thanks I just found the thread "Index usage question" and tried to make > some profit from it: > > explain SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; > > NOTICE: QUERY PLAN: > > Aggregate (cost=35267.33..36154.62 rows=17746 width=16) > -> Group (cost=35267.33..35710.98 rows=177458 width=16) > -> Sort (cost=35267.33..35267.33 rows=177458 width=16) > -> Seq Scan on hauptdaten_fall (cost=0.00..15024.12 rows=177458 width=16) I assume that an index on the field IstAktuell avoids the sequential scan. If an index is there try the following statement before you submit the query: set enable_seqscan = off; Regards Herbie -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Herbert Liechti http://www.thinx.ch ThinX networked business services Adlergasse 5, CH-4500 Solothurn ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Hello once more, I stripped the database down to one single table with only the relevant columns. A pg_dump can be found under http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2 I would really like it if you try the query #!/bin/sh ERG=$0.ptk START=`date +%s` echo "SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;" | psqlifsgtest > result let i="`date +%s` - $START" echo "Query took ${i} seconds." >> result (which should just measure the time needed for this task). It took my E250 (2x400MHz, 2GB) server 20 seconds and this is definitely much to long for our application. Any hints are gratly appreciated Andreas.
On Tuesday 18 September 2001 07:06, Tille, Andreas wrote: -SNIP- > > echo "SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS > Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY > Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;" | > psql ifsgtest > result > > let i="`date +%s` - $START" > echo "Query took ${i} seconds." >> result > > > (which should just measure the time needed for this task). It took my > E250 (2x400MHz, 2GB) server 20 seconds and this is definitely much to > long for our application. > > Any hints are gratly appreciated > > Andreas. Ok, I tried the query on two boxes: #1 800Mhz duron IDE drives - 11 secs #2 800Mhz P3 SCSI drives - 5.35 secs NOTE: the P3 has alot larger buffers and stuff....(more mem..:)) What does the function plpgsql_call_handler do? Could this be where the slow down is? If it is minor, then how are your buffers and shared mem set? I am by NO means an expert, but I'm quite willing to help if I can. GB -- GB Clark II | Roaming FreeBSD Admin gclarkii@VSServices.COM | General Geek CTHULU for President - Why choose the lesser of two evils?
> -----Original Message----- > From: Tille, Andreas > echo "SELECT Hauptdaten_Fall.MeldeKategorie, Count > (Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall > WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP > BY Hauptdaten_Fall.MeldeKategorie ORDER BY > Hauptdaten_Fall.MeldeKategorie;" | psql ifsgtest > result I've not yet looked at the rest of the problem, but doesn't ORDER BY make GROUP BY redundant? Just an extra sort? -- Matt
On Tue, 18 Sep 2001, Matt Block wrote: > > echo "SELECT Hauptdaten_Fall.MeldeKategorie, Count > > (Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall > > WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP > > BY Hauptdaten_Fall.MeldeKategorie ORDER BY > > Hauptdaten_Fall.MeldeKategorie;" | psql ifsgtest > result > > I've not yet looked at the rest of the problem, but doesn't > ORDER BY make GROUP BY redundant? Just an extra sort? No it doesn´t. I have to do exact this query. Kind regards Andreas.
Tille, Andreas writes: > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz > FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY > Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; > (which should just measure the time needed for this task). It took my > E250 (2x400MHz, 2GB) server 20 seconds and this is definitely much to > long for our application. I loaded this into 7.2 development sources and it ran 35 seconds wall-clock time on a much smaller machine. After I ran what in 7.1 would be VACUUM ANALYZE it took about 22 seconds. The difference was that it was using the index on hauptdaten_fall.istaktuell when it shouldn't. (Try EXPLAIN to see what it does in your case. If it's still using the index you might want to force enable_indexscan = off.) I also got a minuscule speed-up by replacing the Count(Hauptdaten_Fall.ID) with Count(*), which acts differently with respect to nulls, so it depends whether you want to use it. Besides that, I don't see anything blatantly obvious to speed this up. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
On Wed, 19 Sep 2001, Peter Eisentraut wrote: > > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz > > FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY > > Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; > > > (which should just measure the time needed for this task). It took my > > E250 (2x400MHz, 2GB) server 20 seconds and this is definitely much to > > long for our application. > > I loaded this into 7.2 development sources and it ran 35 seconds > wall-clock time on a much smaller machine. After I ran what in 7.1 would > be VACUUM ANALYZE it took about 22 seconds. The difference was that it > was using the index on hauptdaten_fall.istaktuell when it shouldn't. > (Try EXPLAIN to see what it does in your case. If it's still using the > index you might want to force enable_indexscan = off.) Well, I tried several things to avoid/enforce index scans. The problem remains every time. We don´t talk about 20 or 30 seconds here. We talk about "slow" (i.e. > 10 seconds) for PostgreSQL against "fast" (i.e. feeling like immediately, not to measure exactly) on MS-SQL server. > Besides that, I don't see anything blatantly obvious to speed this up. This fact is quite hard for me. I´ve got some hints about changes in table structure (use of an index for MeldeKategorie) and I could perhaps track down this way even if I´m relatively stong bound to the given stucture because the data should be "replicated" from MS-SQL server to PostgreSQL. Moreover ths problem is not this "quite simple" query (compared to other more complicated ones). This was just a simple test. My colleague posted me another (more practical) query which took him under Access querying the MS-SQL server (so there is an additional bottleneck) 2,5 seconds. PostgreSQL took 92 seconds for the same task. This is way to slow and would stop our PostgreSQL efforts completely, which means I would have to search for OS alternatives (bad) or even use the MS-SQL server (even worse if I look at the current NT-Worms). Perhaps this test could be an example which helps developers to optimize such kind of queries. I´m afraid the GROUP BY / ORDER BY combination which causes expensive sorting could be the reason for this slow down. Perhaps they can find a more sophisticated sorting algorithm. There must be a reason for this speed difference by the order of magnitude for this special case and I would be more than happy if I could help increasing PostgreSQL by providing this example (to make PostgreSQL better in general and fit for my own task). Kind regards Andreas.
"Matt Block" <matt@blockdev.net> writes: > I've not yet looked at the rest of the problem, but doesn't > ORDER BY make GROUP BY redundant? Just an extra sort? No. In the first place, there's no extra sort: the planner is well aware that our current GROUP BY implementation produces ordered output. In the second place, there's no guarantee that GROUP BY will always produce ordered output in the future --- we are thinking about changing over to a hash-table-based implementation of grouping. If you want ordered output, you should say ORDER BY, not try to outsmart the system. regards, tom lane
On Wed, 19 Sep 2001, Tom Lane wrote: > No. In the first place, there's no extra sort: the planner is well > aware that our current GROUP BY implementation produces ordered output. > In the second place, there's no guarantee that GROUP BY will always > produce ordered output in the future --- we are thinking about changing > over to a hash-table-based implementation of grouping. If you want > ordered output, you should say ORDER BY, not try to outsmart the system. And even if I would do so - sorting 51 items takes quite nothing. So the problem is anywhere else. I did further tests and got: SELECT Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) ; anz -------- 177458 (1 row) ---------> 2 Seconds ... that means longer than MS-SQL server takes with the additional GROUB BY and SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))GROUP BY Hauptdaten_Fall.MeldeKategorie; ... without ORDER BY but this doesn´t matter regarding the result set (well it is as ordered as it would be with the ORDER BY clause) and the time which stays at 18 seconds (after I increased sort_mem = 2048 it was *slightly* faster - 20 seconds for default sort_mem). So the real bottleneck seems to be the GROUP BY. Any chances that this could be changed in future PostgreSQL versions? This slowness makes use of PostgreSQL impossible for our application. Kind regards Andreas.