Thread: Serious performance problem

Serious performance problem

From
"Tille, Andreas"
Date:
Hello,

I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list.  The thread starts with
message
   http://fts.postgresql.org/db/mw/msg.html?mid=1035557

Now I tried a snapshot of version 7.2 and got an increase of speed of
about factor 2.  But sorry this is really not enough.  The very simple
test I pointed to in my mail is even much to slow and the issue would
probably spoil down the whole project which should be a complete open
source solution and would perhaps and in any M$ stuff.  I´ve got under
heavy preasur from my employer who was talking about the nice world
of MS .net (while he is using MS-SQL exclusively).  To make the thing
clear the issue is the gal database of infectious diseases in Germany
runned by the Robert Koch-Institute.  So the beast could be of some
importance for increasing the acceptance of PostgreSQL and Open Source
in the field of medicine which is generally known for the money which
is involved in.  So I really hope that some skilled programmers would
be able to find a good way to solve the performance issue perhaps by
just profiling the simple query
  SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;

to the data set I put on
  http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2

If this should take less than half a second on a modern PC I could
continue to try mo realistic queries.

I really hope that I could readjust the image of PostgreSQL in the
eyes of my M$-centered colleagues.

Kind regards
      Andreas.


Re: Serious performance problem

From
Vsevolod Lobko
Date:
Seems that problem is very simple :))
MSSql can do queries from indexes, without using actual table at all.
Postgresql doesn't.

So mssql avoids sequental scanning of big table, and simply does scan of
index which is already in needed order and has very much less size.

On Mon, Oct 29, 2001 at 01:43:37PM +0100, Tille, Andreas wrote:
> Hello,
> 
> I discussed a problem concerning the speed of PostgreSQL compared to
> MS SQL server heavily on postgres-general list.  The thread starts with
> message
> 
>     http://fts.postgresql.org/db/mw/msg.html?mid=1035557
> 
> Now I tried a snapshot of version 7.2 and got an increase of speed of
> about factor 2.  But sorry this is really not enough.  The very simple
> test I pointed to in my mail is even much to slow and the issue would
> probably spoil down the whole project which should be a complete open
> source solution and would perhaps and in any M$ stuff.  I've got under
> heavy preasur from my employer who was talking about the nice world
> of MS .net (while he is using MS-SQL exclusively).  To make the thing
> clear the issue is the gal database of infectious diseases in Germany
> runned by the Robert Koch-Institute.  So the beast could be of some
> importance for increasing the acceptance of PostgreSQL and Open Source
> in the field of medicine which is generally known for the money which
> is involved in.  So I really hope that some skilled programmers would
> be able to find a good way to solve the performance issue perhaps by
> just profiling the simple query
> 
>    SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))
> GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;
> 
> to the data set I put on
> 
>    http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2
> 
> If this should take less than half a second on a modern PC I could
> continue to try mo realistic queries.
> 
> I really hope that I could readjust the image of PostgreSQL in the
> eyes of my M$-centered colleagues.
> 
> Kind regards
> 
>        Andreas.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html


Re: Serious performance problem

From
"Ross J. Reedstrom"
Date:
Andreas - 
I took a look at your problem, since I'm sort of in the field,
and would liek to see free solutions spread, as well.

Here's what I see: Your example touches on what can be an achilles
heel for pgsql's current statistical analyzer: selection on data fields
that have a few common values. Often, the indices don't get used, since
a large fraction of the table needs to be scanned, in any case. In
your example, fully 68% of the table fits the where condition.

Here's some timing results on my machine:

Your dataset and query, as written:

real    0m25.272s
user    0m0.090s
sys     0m0.050s

Creating an index on  meldekategorie, and forcing it's use with
"set enable_seqscan = off"

real    0m14.743s
user    0m0.070s
sys     0m0.050s

Same, with index on istaktuell:

real    0m26.511s
user    0m0.050s
sys     0m0.060s

Now, with an index on both  meldekategorie and istaktuell:

real    0m7.179s
user    0m0.060s
sys     0m0.030s

I think we have a winner. No it's not sub-second, but I improved the time
by 3x just by trying some indices. Note that I _still_ had to force the
use of indices for this one. It's also the first time I've personally seen
a query/dataset that benefits this much from a two-key index.

As another poster replied to you, there is limitation with postgresql's
use of indices that arises from MVCC: even if the only data requested is
that stored in the index itself, the backend must visit the actual tuple
in the table to ensure that it is 'visible' to the current transaction.

How realistic a representation of your real workload is this query? Realize
that more selective, complex queries are where pgsql shines compared to
other RDBMS: the 'fast table scanner' type query that you proposed as your
test don't really let pgsql stretch it's legs. Do you have example timings
from MS-SQL or others?

Ross

On Mon, Oct 29, 2001 at 01:43:37PM +0100, Tille, Andreas wrote:
> Hello,
> 
> I discussed a problem concerning the speed of PostgreSQL compared to
> MS SQL server heavily on postgres-general list.  The thread starts with
> message
> 
>     http://fts.postgresql.org/db/mw/msg.html?mid=1035557
> 
> Now I tried a snapshot of version 7.2 and got an increase of speed of
> about factor 2.  But sorry this is really not enough.  The very simple
> test I pointed to in my mail is even much to slow and the issue would
> probably spoil down the whole project which should be a complete open
> source solution and would perhaps and in any M$ stuff.  I?ve got under
> heavy preasur from my employer who was talking about the nice world
> of MS .net (while he is using MS-SQL exclusively).  To make the thing
> clear the issue is the gal database of infectious diseases in Germany
> runned by the Robert Koch-Institute.  So the beast could be of some
> importance for increasing the acceptance of PostgreSQL and Open Source
> in the field of medicine which is generally known for the money which
> is involved in.  So I really hope that some skilled programmers would
> be able to find a good way to solve the performance issue perhaps by
> just profiling the simple query
> 
>    SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))
> GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;
> 
> to the data set I put on
> 
>    http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2
> 
> If this should take less than half a second on a modern PC I could
> continue to try mo realistic queries.
> 
> I really hope that I could readjust the image of PostgreSQL in the
> eyes of my M$-centered colleagues.
> 
> Kind regards
> 
>        Andreas.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html

-- 
Ross Reedstrom, Ph.D.                                 reedstrm@rice.edu
Executive Director                                  phone: 713-348-6166
Gulf Coast Consortium for Bioinformatics              fax: 713-348-6182
Rice University MS-39
Houston, TX 77005


Re: Serious performance problem

From
Jean-Michel POURE
Date:
Hello Andreas,

A possible solution would be:
CREATE TABLE foo AS
SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz 
FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY 
Hauptdaten_Fall.MeldeKategorie;

This is suitable if your data does not change often. To get automatic updates:
1) Define iPrecision, the precision that you need (integer).
2) Create a trigger which increases a counter when a record is updated or 
inserted. When the counter reaches iPrecision, do a DROP TABLE foo + CREATE 
TABLE foo AS SELECT Hauptdaten_Fall.MeldeKategorie, 
Count(Hauptdaten_Fall.ID)... This will take a few seconds but only once. 
Run a batch script within a time frame (1 hour, 4 hours, 1 day ?) so a 
human user has very little chance to reach iPrecision.

On 300.000 records, you will get instant results. There are plenty of 
tricks like this one. If you employ them, you will ***never*** reach the 
limits of a double Pentium III computer with U3W discs.

If you need to answer this message, please reply on 
pgsql-general@postgresql.org.

Cheers,
Jean-Michel POURE


At 13:43 29/10/01 +0100, you wrote:
>Hello,
>
>I discussed a problem concerning the speed of PostgreSQL compared to
>MS SQL server heavily on postgres-general list.  The thread starts with
>message
>
>     http://fts.postgresql.org/db/mw/msg.html?mid=1035557
>
>Now I tried a snapshot of version 7.2 and got an increase of speed of
>about factor 2.  But sorry this is really not enough.  The very simple
>test I pointed to in my mail is even much to slow and the issue would
>probably spoil down the whole project which should be a complete open
>source solution and would perhaps and in any M$ stuff.  I´ve got under
>heavy preasur from my employer who was talking about the nice world
>of MS .net (while he is using MS-SQL exclusively).  To make the thing
>clear the issue is the gal database of infectious diseases in Germany
>runned by the Robert Koch-Institute.  So the beast could be of some
>importance for increasing the acceptance of PostgreSQL and Open Source
>in the field of medicine which is generally known for the money which
>is involved in.  So I really hope that some skilled programmers would
>be able to find a good way to solve the performance issue perhaps by
>just profiling the simple query
>
>    SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS 
> Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
>GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY 
>Hauptdaten_Fall.MeldeKategorie;
>
>to the data set I put on
>
>    http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2
>
>If this should take less than half a second on a modern PC I could
>continue to try mo realistic queries.
>
>I really hope that I could readjust the image of PostgreSQL in the
>eyes of my M$-centered colleagues.
>
>Kind regards
>
>        Andreas.
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/users-lounge/docs/faq.html



Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Mon, 29 Oct 2001, Jean-Michel POURE wrote:

> A possible solution would be:
> CREATE TABLE foo AS
> SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
> FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
> GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY
> Hauptdaten_Fall.MeldeKategorie;
Sorry, this is NO solution of my problem.

> On 300.000 records, you will get instant results. There are plenty of
> tricks like this one. If you employ them, you will ***never*** reach the
> limits of a double Pentium III computer with U3W discs.
It is really no help if I solve the speed issue of this *very simple,
zeroth order try*.  I repeat a hava a plenty of queries which do much
more complicated stuff than this.  This is just a rude strip down of the
problem fit for debugging/profg issues of the database *server*.  Simple
tricks on a simple example do not help.

> If you need to answer this message, please reply on
> pgsql-general@postgresql.org.
No, because ...

> >I discussed a problem concerning the speed of PostgreSQL compared to
> >MS SQL server heavily on postgres-general list.  The thread starts with
> >message
> >
> >     http://fts.postgresql.org/db/mw/msg.html?mid=1035557
I did so and got the explicit advise of Tom to ask here.

Consider the problem as a benchmark.  I would love to see postgresql
as the winner.

Kind regards
       Andreas.


Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Mon, 29 Oct 2001, Ross J. Reedstrom wrote:

> Here's what I see: Your example touches on what can be an achilles
> heel for pgsql's current statistical analyzer: selection on data fields
> that have a few common values. Often, the indices don't get used, since
> a large fraction of the table needs to be scanned, in any case. In
> your example, fully 68% of the table fits the where condition.
>  ...
>
> I think we have a winner. No it's not sub-second, but I improved the time
> by 3x just by trying some indices. Note that I _still_ had to force the
> use of indices for this one. It's also the first time I've personally seen
> a query/dataset that benefits this much from a two-key index.
This is true for this example and I also played with indices as you.  I also
enforced the index scan and compared with forbidding the index scan.  The
result was on my more realistic examples that both versions performed quite
the same.  There was no *real* difference. For sure in this simple query there
is a difference but the real examples showed only 2% - 5% speed increase
(if not slower with enforcing index scans!).

> As another poster replied to you, there is limitation with postgresql's
> use of indices that arises from MVCC: even if the only data requested is
> that stored in the index itself, the backend must visit the actual tuple
> in the table to ensure that it is 'visible' to the current transaction.
Any possibility to switch of this temporarily for certain queries like this
if the programmer could make sure that it is not necessary?  Just a stupid
idea from a bloody uneducated man in database-engeniering.

> How realistic a representation of your real workload is this query? Realize
> that more selective, complex queries are where pgsql shines compared to
> other RDBMS: the 'fast table scanner' type query that you proposed as your
> test don't really let pgsql stretch it's legs. Do you have example timings
> from MS-SQL or others?
Unfortunately the four test we did here seemed all to suffer from the
same problem.  The situation is that there is a given database structure
which was developed over more than a year on MS-SQL and has a Access GUI.
Now parts of the UI should be made public via web (I want to use Zope)
and I just imported the data and did some example queries with the
terrible slow result.

Kind regards and thanks for your ideas
         Andreas.


Re: Serious performance problem

From
Hannu Krosing
Date:
"Tille, Andreas" wrote:
> 
> On Mon, 29 Oct 2001, Ross J. Reedstrom wrote:
> 
> > Here's what I see: Your example touches on what can be an achilles
> > heel for pgsql's current statistical analyzer: selection on data fields
> > that have a few common values. Often, the indices don't get used, since
> > a large fraction of the table needs to be scanned, in any case. In
> > your example, fully 68% of the table fits the where condition.
> >  ...
> >
> > I think we have a winner. No it's not sub-second, but I improved the time
> > by 3x just by trying some indices. Note that I _still_ had to force the
> > use of indices for this one. It's also the first time I've personally seen
> > a query/dataset that benefits this much from a two-key index.
> This is true for this example and I also played with indices as you.  I also
> enforced the index scan and compared with forbidding the index scan.  The
> result was on my more realistic examples that both versions performed quite
> the same.  There was no *real* difference. For sure in this simple query there
> is a difference but the real examples showed only 2% - 5% speed increase
> (if not slower with enforcing index scans!).

I studied his dataset and found that a simple count(*) on whole table 
took 1.3 sec on my Celeron 375 so I'm sure that the more complex query, 
which has to visit 2/3 of tuples will not be able to execute under 1 sec

My playing with indexes / subqueries and query rewriting got the example 
query (actually a functional equivalent) to run in ~5 sec with simple 
aggregate(group(indexscan))) plan and I suspect that this is how fast 
it will be on my hardware

It could probably be soon possible to make it run in ~ 1.5 by using an
aggregate 
function that does a sequential scan and returns a rowset.

> > As another poster replied to you, there is limitation with postgresql's
> > use of indices that arises from MVCC: even if the only data requested is
> > that stored in the index itself, the backend must visit the actual tuple
> > in the table to ensure that it is 'visible' to the current transaction.
> Any possibility to switch of this temporarily for certain queries like this
> if the programmer could make sure that it is not necessary?  Just a stupid
> idea from a bloody uneducated man in database-engeniering.

There have been plans to set aside a bit in index that would mark the
deleted 
tuple. Unfortunately this helps only in cases when there are many
deleted tuples
and all live tuples have to be checked anyway ;(

--------------
Hannu


Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Mon, 29 Oct 2001, Vsevolod Lobko wrote:

> Seems that problem is very simple :))
> MSSql can do queries from indexes, without using actual table at all.
> Postgresql doesn't.
>
> So mssql avoids sequental scanning of big table, and simply does scan of
> index which is already in needed order and has very much less size.
Hmmm, could anyone imagine a simple or not *solution* of the Problem.
I´m thinking of some switch the database programmer could use if he
really knows what he is doing.

Kind regards
       Andreas.


Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Mon, 29 Oct 2001, Vsevolod Lobko wrote:

> Seems that problem is very simple :))
> MSSql can do queries from indexes, without using actual table at all.
> Postgresql doesn't.
>
> So mssql avoids sequental scanning of big table, and simply does scan of
> index which is already in needed order and has very much less size.
I forewarded this information to my colleague and he replied the following
(im translating from German into English):

hc> I expected this problem.  But what is the purpose of an index: Not
hc> to look into the table itself.  Moreover this means that the expense
hc> grows linear with the table size - no good prospect at all (the
hc> good thing is it is not exponential :-)).
I have to explain that we are in the *beginning* of production process.
We expect a lot more of data.

hc> In case of real index usage the expense grows only with log(n).
hc> No matter about the better philosophy of database servers, MS-SQL-Server
hc> has a consequent index usage and so it is very fast at many queries.
hc> When performing a query to a field without index, I get a slow
hc> table scan.  This is like measuring the speed of the harddisk and
hc> the cleverness of the cache.

The consequence for my problem is now:  If it is technically possible
to implement index scans without table lookups please implement it.  If
not we just have to look for another database engine which does so,
because our applictaion really need the speed on this type of queries.
I repeat from my initial posting: The choice of the server for our
application could have importance for many projects in the field of
medicine in Germany.  I really hope that there is a reasonable solution
which perhaps could give a balance between safety and speed.  For
example I can assure in my application that the index, once created
will be valid, because I just want to read in a new set of data once
a day (from the MS-SQL Server which collects data over the day).  So
I could recreate all indices after the import and the database is
readonly until the next cron job.  Is there any chance to speed up
those applications?

Kind regards
         Andreas.



Re: Serious performance problem

From
"Zeugswetter Andreas SB SD"
Date:
> I also
> enforced the index scan and compared with forbidding the index scan.
The
> result was on my more realistic examples that both versions performed
quite
> the same.  There was no *real* difference. For sure in this simple
query there
> is a difference but the real examples showed only 2% - 5% speed
increase
> (if not slower with enforcing index scans!).

You could somewhat speed up the query if you avoid that the sort
hits the disk. A simple test here showed, that you need somewhere
near sort_mem = 15000 in postgresql.conf.

Andreas


Re: Serious performance problem

From
Brent Verner
Date:
On 30 Oct 2001 at 11:44 (+0100), Tille, Andreas wrote:
| On Mon, 29 Oct 2001, Vsevolod Lobko wrote:
| 
| > Seems that problem is very simple :))
| > MSSql can do queries from indexes, without using actual table at all.
| > Postgresql doesn't.
| >
| > So mssql avoids sequental scanning of big table, and simply does scan of
| > index which is already in needed order and has very much less size.
| I forewarded this information to my colleague and he replied the following
| (im translating from German into English):
| 
| hc> I expected this problem.  But what is the purpose of an index: Not
| hc> to look into the table itself.  Moreover this means that the expense
| hc> grows linear with the table size - no good prospect at all (the
| hc> good thing is it is not exponential :-)).
| I have to explain that we are in the *beginning* of production process.
| We expect a lot more of data.
| 
| hc> In case of real index usage the expense grows only with log(n).
| hc> No matter about the better philosophy of database servers, MS-SQL-Server
| hc> has a consequent index usage and so it is very fast at many queries.
| hc> When performing a query to a field without index, I get a slow
| hc> table scan.  This is like measuring the speed of the harddisk and
| hc> the cleverness of the cache.
| 
| The consequence for my problem is now:  If it is technically possible
| to implement index scans without table lookups please implement it.  If
| not we just have to look for another database engine which does so,
| because our applictaion really need the speed on this type of queries.
| I repeat from my initial posting: The choice of the server for our
| application could have importance for many projects in the field of
| medicine in Germany.  I really hope that there is a reasonable solution
| which perhaps could give a balance between safety and speed.  For
| example I can assure in my application that the index, once created
| will be valid, because I just want to read in a new set of data once
| a day (from the MS-SQL Server which collects data over the day).  So
| I could recreate all indices after the import and the database is
| readonly until the next cron job.  Is there any chance to speed up
| those applications?

CREATE INDEX idx_meldekategorie_hauptdaten_f ON hauptdaten_fall(meldekategorie);
CLUSTER idx_meldekategorie_hauptdaten_f ON hauptdaten_fall;

Aggregate  (cost=5006.02..5018.90 rows=258 width=16) ->  Group  (cost=5006.02..5012.46 rows=2575 width=16)       ->
Sort (cost=5006.02..5006.02 rows=2575 width=16)             ->  Seq Scan on hauptdaten_fall  (cost=0.00..4860.12
rows=2575width=16)
 

This looks much nicer, but is still quite slow.  I'm quite sure the
slowness is in the sort(), since all queries that don't sort, return
quickly.  I hoped the clustered index would speed up the sort, but 
that is not the case.  

It _seems_ a simple optimization would be to not (re)sort the tuples 
when using a clustered index.

if( the_column_to_order_by_is_clustered ){ if( order_by_is_DESC )   // reverse the tuples to handle
}

I haven't looked at the code to see if this is even feasible, but I
do imagine there is enough info available to avoid an unnecessary
sort on the CLUSTERED index.  The only problem I see with this is
if the CLUSTERed index is not kept in a CLUSTERed state as more
records are added to this table.
 brent

-- 
"Develop your talent, man, and leave the world something. Records are 
really gifts from people. To think that an artist would love you enough
to share his music with anyone is a beautiful thing."  -- Duane Allman


Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Tue, 30 Oct 2001, Zeugswetter Andreas SB SD wrote:

> You could somewhat speed up the query if you avoid that the sort
> hits the disk. A simple test here showed, that you need somewhere
> near sort_mem = 15000 in postgresql.conf.
Well this are the usual hints from pgsql-general.  I did so and
increased step by step to:
  shared_buffers = 131072  sort_mem       = 65536

This lead to a double of speed in my tests but this are settings where
an enhancement of memory doesn´t result in a speed increase any more.

When I was posting my question here I was talking about this "tuned"
PostgreSQL server.  The default settings where even worse!

Kind regards
       Andreas.


Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Tue, 30 Oct 2001, Jean-Michel POURE wrote:

> Is your database read-only?
Daily update from MS-SQL server.  Between updates it is Read-only.

> Good point, sorry to insist your problem is
> software optimization. In your case, the database may climb up to 200
> million rows (1000 days x 200.000 rows). What are you going to do then? Buy
> a 16 Itanium computer with 10 Gb RAM and MS SQL Server licence. Have a
> close look at your problem. How much time does it get MS SQL Server to
> query 200 million rows ? The problem is not in choosing MS SQL or
> PostgreSQL ...
The problem is for sure.  If one server is 10 to 30 times faster for the very
same tasks and chances are high that it skales better for the next orders of
magnitude where our data will fit in for the next years because of real
index usage (see postings on the hackers list) than the decission is easy.
My colleague made sure that MS SQL server is fit for the next years and
I can only convince him if an other Server has a comparable speed *for the
same task*.

> If you are adding 200.000 rows data everyday, consider using a combination
I do not add this much.

> of CREATE TABLE AS to create a result table with PL/pgSQL triggers to
> maintain data consistency. You will then get instant results, even on 2
> billion rows because you will always query the result table; not the
> original one. Large databases are always optimized this way because, even
> in case of smart indexes, there are things (like your problem) that need
> *smart* optimization.
>
> Do you need PL/pgSQL source code to perform a test on 2 billion rows? If
> so, please email me on pgsql-general and I will send you the code.
I really believe that there are many problems in the world that fall under
this category and you are completely right.  My coleague is a database
expert (I consider me as a beginner) and he made sure that performance is
no issue for the next couple of years.  So what?  Spending hours in
optimisation into things who work perfectly?  Why not asking the
PostgreSQL authors to optimize tha server this way the very same task
performs comparable??????  If we afterwards need further database
optimization because of further constrains, I´m the first who will start
this.   But there must be server code in the world that is able to answer
the example query that fast.  This is proven!

Kind regards
        Andreas.



Re: Serious performance problem

From
Antonio Fiol Bonnín
Date:
AFAIK, sorting is necessary even when you have CLUSTERed a table using an index.

Somewhere on the docs I read sth like "CLUSTER reorders the table on disk so that entries
closer on the index are closer on the disk" (obviously written in better English ;-)

But if you INSERT a single row later, it will NOT get inserted to the right place. So
SORT is still necessary.

MAYBE, but I am not sure at all, the sort may take place in less "real" time than in case
the table was not CLUSTERed, as the table is "nearly" sorted.

Hackers, is the sorting algorithm capable of exiting at the very moment the table is
sorted, or are some extra passes always calculated?

Good luck!

Antonio

Brent Verner wrote:

> On 30 Oct 2001 at 11:44 (+0100), Tille, Andreas wrote:
> | On Mon, 29 Oct 2001, Vsevolod Lobko wrote:
> |
> | > Seems that problem is very simple :))
> | > MSSql can do queries from indexes, without using actual table at all.
> | > Postgresql doesn't.
> | >
> | > So mssql avoids sequental scanning of big table, and simply does scan of
> | > index which is already in needed order and has very much less size.
> | I forewarded this information to my colleague and he replied the following
> | (im translating from German into English):
> |
> | hc> I expected this problem.  But what is the purpose of an index: Not
> | hc> to look into the table itself.  Moreover this means that the expense
> | hc> grows linear with the table size - no good prospect at all (the
> | hc> good thing is it is not exponential :-)).
> | I have to explain that we are in the *beginning* of production process.
> | We expect a lot more of data.
> |
> | hc> In case of real index usage the expense grows only with log(n).
> | hc> No matter about the better philosophy of database servers, MS-SQL-Server
> | hc> has a consequent index usage and so it is very fast at many queries.
> | hc> When performing a query to a field without index, I get a slow
> | hc> table scan.  This is like measuring the speed of the harddisk and
> | hc> the cleverness of the cache.
> |
> | The consequence for my problem is now:  If it is technically possible
> | to implement index scans without table lookups please implement it.  If
> | not we just have to look for another database engine which does so,
> | because our applictaion really need the speed on this type of queries.
> | I repeat from my initial posting: The choice of the server for our
> | application could have importance for many projects in the field of
> | medicine in Germany.  I really hope that there is a reasonable solution
> | which perhaps could give a balance between safety and speed.  For
> | example I can assure in my application that the index, once created
> | will be valid, because I just want to read in a new set of data once
> | a day (from the MS-SQL Server which collects data over the day).  So
> | I could recreate all indices after the import and the database is
> | readonly until the next cron job.  Is there any chance to speed up
> | those applications?
>
> CREATE INDEX idx_meldekategorie_hauptdaten_f
>   ON hauptdaten_fall(meldekategorie);
> CLUSTER idx_meldekategorie_hauptdaten_f ON hauptdaten_fall;
>
> Aggregate  (cost=5006.02..5018.90 rows=258 width=16)
>   ->  Group  (cost=5006.02..5012.46 rows=2575 width=16)
>         ->  Sort  (cost=5006.02..5006.02 rows=2575 width=16)
>               ->  Seq Scan on hauptdaten_fall  (cost=0.00..4860.12 rows=2575 width=16)
>
> This looks much nicer, but is still quite slow.  I'm quite sure the
> slowness is in the sort(), since all queries that don't sort, return
> quickly.  I hoped the clustered index would speed up the sort, but
> that is not the case.
>
> It _seems_ a simple optimization would be to not (re)sort the tuples
> when using a clustered index.
>
> if( the_column_to_order_by_is_clustered ){
>   if( order_by_is_DESC )
>     // reverse the tuples to handle
> }
>
> I haven't looked at the code to see if this is even feasible, but I
> do imagine there is enough info available to avoid an unnecessary
> sort on the CLUSTERED index.  The only problem I see with this is
> if the CLUSTERed index is not kept in a CLUSTERed state as more
> records are added to this table.
>
>   brent
>
> --
> "Develop your talent, man, and leave the world something. Records are
> really gifts from people. To think that an artist would love you enough
> to share his music with anyone is a beautiful thing."  -- Duane Allman
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org



Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Tue, 30 Oct 2001, Antonio Fiol Bonnín wrote:

> AFAIK, sorting is necessary even when you have CLUSTERed a table using an index.
Sorting is not the performance constraint in my example.  Just leave out
the sorting and see what happens ...

> But if you INSERT a single row later, it will NOT get inserted to the right place. So
> SORT is still necessary.
Well rearanging the database in a cronjob after inserting new data once a day
over night would be possible - but I doubt that it makes a big difference.

Kind regards
       Andreas.


Re: Serious performance problem

From
Alex Pilosov
Date:
On Tue, 30 Oct 2001, Antonio Fiol [iso-8859-1] Bonn�n wrote:

> > | > Seems that problem is very simple :))
> > | > MSSql can do queries from indexes, without using actual table at all.
> > | > Postgresql doesn't.
> > | >
> > | > So mssql avoids sequental scanning of big table, and simply does scan of
> > | > index which is already in needed order and has very much less size.
<snip>
> > | The consequence for my problem is now:  If it is technically possible
> > | to implement index scans without table lookups please implement it.  If
The feature you are looking for is called 'index coverage'. Unfortunately,
it is not easy to implement with Postgresql, and it is one of few
outstanding 'nasties'. The reason you can't do it is follows: Postgres
uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even
if index contains all the information you need, you still need to access
main table to check if the tuple is valid. 

Possible workaround: store tuple validity in index, that way, a lot more
space is wasted (16 more bytes/tuple/index), and you will need to update
all indices when the base table is updated, even if indexed information
have not changed.

Fundamentally, this may be necessary anyway, to make index handlers aware
of transactions and tuple validity (currently, if you have unique index,
you may have conflicts when different transactions attempt to insert
conflicting data, _at the time of insert, not at time of commit_).

-alex



Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Tue, 30 Oct 2001, Alex Pilosov wrote:

> The feature you are looking for is called 'index coverage'. Unfortunately,
> it is not easy to implement with Postgresql, and it is one of few
> outstanding 'nasties'. The reason you can't do it is follows: Postgres
> uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even
> if index contains all the information you need, you still need to access
> main table to check if the tuple is valid.
Well, I do not fully understand that stuff, but I get a feeling of the
problem.  Thanks for the explanation.

> Possible workaround: store tuple validity in index, that way, a lot more
> space is wasted (16 more bytes/tuple/index), and you will need to update
> all indices when the base table is updated, even if indexed information
> have not changed.
This would be acceptable for *my* special application but I´m afraid
this could be a problem for others.

> Fundamentally, this may be necessary anyway, to make index handlers aware
> of transactions and tuple validity (currently, if you have unique index,
> you may have conflicts when different transactions attempt to insert
> conflicting data, _at the time of insert, not at time of commit_).
As I said all this wouln´t be a problem for my application.  I just
run a sequential insert of data each night.  Then the database is read only.

Does anybody see chances that 'index coverage' would be implemented into
7.2.  This would be a cruxial feature for my application.  If it will
not happen in a reasonable time frame I would have to look for
alternative database server.  Anybody knows something about MySQL or
Interbase?

Kind regards
         Andreas.


Re: Serious performance problem

From
Stephan Szabo
Date:
> Does anybody see chances that 'index coverage' would be implemented into
> 7.2.  This would be a cruxial feature for my application.  If it will
> not happen in a reasonable time frame I would have to look for
> alternative database server.  Anybody knows something about MySQL or
> Interbase?

Since I don't remember anyone mentioning working on it here and 7.2 just
went into beta, I don't think it's likely.  If you want to push, you may
be able to convince someone for 7.3.




Re: Serious performance problem

From
Horst Herb
Date:
On Wednesday 31 October 2001 03:13, you wrote:
> On Tue, 30 Oct 2001, Alex Pilosov wrote:

> As I said all this wouln´t be a problem for my application.  I just
> run a sequential insert of data each night.  Then the database is read
> only.
>
> Does anybody see chances that 'index coverage' would be implemented into
> 7.2.  This would be a cruxial feature for my application.  If it will

Andreas,

I have the feeling that your problem is solved best by taking a different 
approach. 
As A. Pilosovs posting pointed out, index coverage is a problem intrinsic to 
the MVCC implementation (IMHO a small price to pay for a priceless feature). 
I can't see why much effort should go into a brute force method to implement 
index coverage, if your problem can be solved more elegant in a different way.

With the example you posted, it is essentially only simple statistics you 
want to run on tables where the *majority* of records would qualify in your 
query.
Why not create an extra "statistics" table which is updated automatically 
through triggers in your original table? That way, you will always get 
up-to-date INSTANT query results no matter how huge your original table is.

And, don't forget that the only way MS SQL can achieve the better performance 
here is through mercilessly hogging ressources. In a complex database 
environment with even larger tables, the performance gain in MS SQL would be 
minimal (my guess).

Horst


Re: Serious performance problem

From
Jean-Michel POURE
Date:
>Why not create an extra "statistics" table which is updated automatically
>through triggers in your original table? That way, you will always get
>up-to-date INSTANT query results no matter how huge your original table is.
>
>And, don't forget that the only way MS SQL can achieve the better performance
>here is through mercilessly hogging ressources. In a complex database
>environment with even larger tables, the performance gain in MS SQL would be
>minimal (my guess).

Definitely. This is a design optimization problem not an index problem.


Re: Serious performance problem

From
Hannu Krosing
Date:
Alex Pilosov wrote:
> 
> On Tue, 30 Oct 2001, Antonio Fiol [iso-8859-1] Bonnín wrote:
> 
> > > | > Seems that problem is very simple :))
> > > | > MSSql can do queries from indexes, without using actual table at all.
> > > | > Postgresql doesn't.
> > > | >
> > > | > So mssql avoids sequental scanning of big table, and simply does scan of
> > > | > index which is already in needed order and has very much less size.
> <snip>
> > > | The consequence for my problem is now:  If it is technically possible
> > > | to implement index scans without table lookups please implement it.  If
> The feature you are looking for is called 'index coverage'. Unfortunately,
> it is not easy to implement with Postgresql, and it is one of few
> outstanding 'nasties'. The reason you can't do it is follows: Postgres
> uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even
> if index contains all the information you need, you still need to access
> main table to check if the tuple is valid.
> 
> Possible workaround: store tuple validity in index, that way, a lot more
> space is wasted (16 more bytes/tuple/index), and you will need to update
> all indices when the base table is updated, even if indexed information
> have not changed.

AFAIK you will need to update all indexes anyway as MVCC changes the
location 
of the new tuple.

-------------
Hannu


Re: Serious performance problem

From
Hannu Krosing
Date:
"Tille, Andreas" wrote:
> 
> On Tue, 30 Oct 2001, Alex Pilosov wrote:
> 
> > The feature you are looking for is called 'index coverage'. Unfortunately,
> > it is not easy to implement with Postgresql, and it is one of few
> > outstanding 'nasties'. The reason you can't do it is follows: Postgres
> > uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even
> > if index contains all the information you need, you still need to access
> > main table to check if the tuple is valid.
> Well, I do not fully understand that stuff, but I get a feeling of the
> problem.  Thanks for the explanation.
> 
> > Possible workaround: store tuple validity in index, that way, a lot more
> > space is wasted (16 more bytes/tuple/index), and you will need to update
> > all indices when the base table is updated, even if indexed information
> > have not changed.
> This would be acceptable for *my* special application but IŽm afraid
> this could be a problem for others.
> 
> > Fundamentally, this may be necessary anyway, to make index handlers aware
> > of transactions and tuple validity (currently, if you have unique index,
> > you may have conflicts when different transactions attempt to insert
> > conflicting data, _at the time of insert, not at time of commit_).
> As I said all this woulnŽt be a problem for my application.  I just
> run a sequential insert of data each night.  Then the database is read only.
> 
> Does anybody see chances that 'index coverage' would be implemented into
> 7.2.  This would be a cruxial feature for my application.  If it will
> not happen in a reasonable time frame I would have to look for
> alternative database server.  Anybody knows something about MySQL or
> Interbase?

If it is static data and simple queries then there is fairly good chance 
that MySQL is a good choice .

As fo the other two opensource databases (Interbase and SAPDB (a
modyfied 
version of ADABAS released under GPL by SAP - http://www.sapdb.com/) I
have 
no direct experience. 

I occasionally read sapdb mailing list, and I've got an impression that
it 
is quite usable and stable DB once you have set it up. Setting up seems 
order(s) of magnitude harder than for PostgreSQL or MySQL.

Weather it actually runs full-table aggregates faster than PG is a thing 
I can't comment on, but you could get some of their people to do the 
benchmarking for you if you send them an advocacy-urging request, like
I'd 
switch if you show me that yur dbis fast enough ;)

-------------------
Hannu


Re: Serious performance problem

From
Hannu Krosing
Date:
"Tille, Andreas" wrote:
> 
> Hello,
> 
> I discussed a problem concerning the speed of PostgreSQL compared to
> MS SQL server heavily on postgres-general list.  The thread starts with
> message
> 
>     http://fts.postgresql.org/db/mw/msg.html?mid=1035557
> 
> Now I tried a snapshot of version 7.2 and got an increase of speed of
> about factor 2.  But sorry this is really not enough.  The very simple
> test I pointed to in my mail is even much to slow and the issue would
> probably spoil down the whole project which should be a complete open
> source solution and would perhaps and in any M$ stuff.  I´ve got under
> heavy preasur from my employer who was talking about the nice world
> of MS .net (while he is using MS-SQL exclusively).  To make the thing
> clear the issue is the gal database of infectious diseases in Germany
> runned by the Robert Koch-Institute.  So the beast could be of some
> importance for increasing the acceptance of PostgreSQL and Open Source
> in the field of medicine which is generally known for the money which
> is involved in.  So I really hope that some skilled programmers would
> be able to find a good way to solve the performance issue perhaps by
> just profiling the simple query
> 
>    SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))
> GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;
> 
> to the data set I put on
> 
>    http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2
> 
> If this should take less than half a second on a modern PC I could
> continue to try mo realistic queries.


I tried some more on optimizing the query on my work computer 
(AMD ATHLON 850, 512MB, PostgreSQL 7.1.3 with default memory settings)


SELECT MeldeKategorie,       Count(ID) AS Anz FROM Hauptdaten_Fall WHERE IstAktuell=20
GROUP BY MeldeKategorie 
ORDER BY MeldeKategorie;

real    0m9.675s

create index i1 on Hauptdaten_Fall(IstAktuell,MeldeKategorie);

----------------------------
set enable_seqscan = off;
SELECT MeldeKategorie,       Count(ID) AS Anz FROM Hauptdaten_Fall WHERE IstAktuell=20
GROUP BY MeldeKategorie 
ORDER BY MeldeKategorie;

Aggregate  (cost=4497.30..4510.18 rows=258 width=16) ->  Group  (cost=4497.30..4503.74 rows=2575 width=16)       ->
Sort (cost=4497.30..4497.30 rows=2575 width=16)             ->  Index Scan using i1 on hauptdaten_fall 
 
(cost=0.00..4351.40 rows=2575 width=16)

real    0m7.131s

---------------------------

set enable_seqscan = off;
SELECT MeldeKategorie,       Count(ID) AS Anz FROM Hauptdaten_Fall WHERE IstAktuell=20
GROUP BY IstAktuell,MeldeKategorie
ORDER BY IstAktuell,MeldeKategorie;

Aggregate  (cost=4497.30..4510.18 rows=258 width=16) ->  Group  (cost=4497.30..4503.74 rows=2575 width=16)       ->
IndexScan using i1 on hauptdaten_fall  (cost=0.00..4351.40
 
rows=2575 width=16)

real    0m3.223s

-- same after doing

cluster i1 on Hauptdaten_Fall;

real    1.590 -- 1.600



select count(*) from Hauptdaten_Fall;

real    0m0.630s

---------------------------

The following query is marginally (about 0.1 sec) faster, though the 
plan looks the same down to cost estimates.

SET ENABLE_SEQSCAN = OFF;
SELECT MeldeKategorie,       Count(*) AS Anz FROM (select IstAktuell,MeldeKategorie from Hauptdaten_Fall where
IstAktuell=20) sub
GROUP BY IstAktuell,MeldeKategorie
ORDER BY IstAktuell,MeldeKategorie;

Aggregate  (cost=0.00..4370.72 rows=258 width=16) ->  Group  (cost=0.00..4364.28 rows=2575 width=16)       ->  Index
Scanusing i1 on hauptdaten_fall  (cost=0.00..4351.40
 
rows=2575 width=16)

real    0m1.438s - 1.506s

---------------------------

now I make the dataset bigger keeping the number of rows returned by
query the same

insert into hauptdaten_fall (istaktuell, meldekategorie)
select istaktuell + 20, meldekategorie 
from  hauptdaten_fall ;

INSERT 0 257530

insert into hauptdaten_fall (istaktuell, meldekategorie)
select istaktuell + 40, meldekategorie 
from  hauptdaten_fall ;

INSERT 0 515060
ifsgtest=# select count(*) from hauptdaten_fall; count  
---------1030120
(1 row)

cluster i1 on Hauptdaten_Fall;
vacuum analyze;


-- The query time is still the same 1.44 - 1.5 sec

SET ENABLE_SEQSCAN = OFF;
SELECT MeldeKategorie,       Count(*) AS Anz FROM (select IstAktuell,MeldeKategorie from Hauptdaten_Fall where
IstAktuell=20) sub
GROUP BY IstAktuell,MeldeKategorie
ORDER BY IstAktuell,MeldeKategorie;

Aggregate  (cost=0.00..4370.72 rows=258 width=16) ->  Group  (cost=0.00..4364.28 rows=2575 width=16)       ->  Index
Scanusing i1 on hauptdaten_fall  (cost=0.00..4351.40
 
rows=2575 width=16)

real    0m1.438s - 1.506s

----------------------------

now back to original data distribution, just 4 times bigger

ifsgtest=# update hauptdaten_fall
ifsgtest-# set istaktuell = case when istaktuell % 20 = 0 then 20 else
10 end
ifsgtest-# ;
UPDATE 1030120
ifsgtest=# vacuum analyze;
VACUUM

SET ENABLE_SEQSCAN = OFF;
SELECT MeldeKategorie,       Count(*) AS Anz FROM (select IstAktuell,MeldeKategorie from Hauptdaten_Fall where
IstAktuell=20) sub
GROUP BY IstAktuell,MeldeKategorie
ORDER BY IstAktuell,MeldeKategorie;

real    0m6.077 -- 6.606s

and after clustering:
cluster i1 on Hauptdaten_Fall;

real    0m5.683 - 5.750s

so it's linear growth here

----------------------------

Hannu


Re: Serious performance problem

From
mlw
Date:
I have been thinking about this query, I downloaded all your info and I read
your reply to a previous post. 

At issue, you say MSSQL outperforms PGSQL, this may be true for a finite set of
query types, it may even be true for your entire application, but for how long?
What will be the nature of your data next year? 

Your query is a prime example of where application optimization needs to
happen. Regardless if MSSQL can currently execute that query quickly, at some
point there will be a volume of data which is too large to process quickly.
This table, you say, is created periodically with a cron job. How hard would it
be to append a couple SQL statements to create a summary table for the high
speed queries?

Personally, I think your approach needs to be modified a bit. The fact that
your query runs well on one SQL database and poorly on another indicates to me
that you will be tied to one database forever. If you use standard database
optimization techniques in your design, you can choose any database by the
whole of important criteria, such as reliability, speed, support,
administration, and price, rather than just speed.

Also, if you use MSSQL you will need to have some version of MS-Windows on
which to run it, that alone indicates to me you will have reliability problems.

"Tille, Andreas" wrote:
> 
> Hello,
> 
> I discussed a problem concerning the speed of PostgreSQL compared to
> MS SQL server heavily on postgres-general list.  The thread starts with
> message
> 
>     http://fts.postgresql.org/db/mw/msg.html?mid=1035557
> 
> Now I tried a snapshot of version 7.2 and got an increase of speed of
> about factor 2.  But sorry this is really not enough.  The very simple
> test I pointed to in my mail is even much to slow and the issue would
> probably spoil down the whole project which should be a complete open
> source solution and would perhaps and in any M$ stuff.  I´ve got under
> heavy preasur from my employer who was talking about the nice world
> of MS .net (while he is using MS-SQL exclusively).  To make the thing
> clear the issue is the gal database of infectious diseases in Germany
> runned by the Robert Koch-Institute.  So the beast could be of some
> importance for increasing the acceptance of PostgreSQL and Open Source
> in the field of medicine which is generally known for the money which
> is involved in.  So I really hope that some skilled programmers would
> be able to find a good way to solve the performance issue perhaps by
> just profiling the simple query
> 
>    SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE
(((Hauptdaten_Fall.IstAktuell)=20))
> GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;
> 
> to the data set I put on
> 
>    http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2
> 
> If this should take less than half a second on a modern PC I could
> continue to try mo realistic queries.
> 
> I really hope that I could readjust the image of PostgreSQL in the
> eyes of my M$-centered colleagues.
> 
> Kind regards
> 
>        Andreas.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html

-- 
5-4-3-2-1 Thunderbirds are GO!
------------------------
http://www.mohawksoft.com


Re: Serious performance problem

From
Denis Perchine
Date:
On Tuesday 30 October 2001 21:24, Alex Pilosov wrote:
> > > | The consequence for my problem is now:  If it is technically possible
> > > | to implement index scans without table lookups please implement it. 
> > > | If
>
> The feature you are looking for is called 'index coverage'. Unfortunately,
> it is not easy to implement with Postgresql, and it is one of few
> outstanding 'nasties'. The reason you can't do it is follows: Postgres
> uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even
> if index contains all the information you need, you still need to access
> main table to check if the tuple is valid.
>
> Possible workaround: store tuple validity in index, that way, a lot more
> space is wasted (16 more bytes/tuple/index), and you will need to update
> all indices when the base table is updated, even if indexed information
> have not changed.

What is the problem to implement this index as a special index type for 
people who need this? Just add a flag keyword to index creation clause.

Actually I would like to hear Tom's opinion on this issue. This issue is of 
my interest too.

Also I saw sometime ago in hackers that there is a patch implementing this...
Or I am wrong here?

--
Denis



Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Wed, 31 Oct 2001, Hannu Krosing wrote:

> I tried some more on optimizing the query on my work computer
> (AMD ATHLON 850, 512MB, PostgreSQL 7.1.3 with default memory settings)
>
>
> SELECT MeldeKategorie,
>        Count(ID) AS Anz
>   FROM Hauptdaten_Fall
>  WHERE IstAktuell=20
> GROUP BY MeldeKategorie
> ORDER BY MeldeKategorie;
>
> real    0m9.675s
>
> create index i1 on Hauptdaten_Fall(IstAktuell,MeldeKategorie);
>
> ----------------------------
> set enable_seqscan = off;
> SELECT MeldeKategorie,
>        Count(ID) AS Anz
>   FROM Hauptdaten_Fall
>  WHERE IstAktuell=20
> GROUP BY MeldeKategorie
> ORDER BY MeldeKategorie;
>
> Aggregate  (cost=4497.30..4510.18 rows=258 width=16)
>   ->  Group  (cost=4497.30..4503.74 rows=2575 width=16)
>         ->  Sort  (cost=4497.30..4497.30 rows=2575 width=16)
>               ->  Index Scan using i1 on hauptdaten_fall
> (cost=0.00..4351.40 rows=2575 width=16)
>
> real    0m7.131s
>
> ---------------------------
>
> set enable_seqscan = off;
> SELECT MeldeKategorie,
>        Count(ID) AS Anz
>   FROM Hauptdaten_Fall
>  WHERE IstAktuell=20
> GROUP BY IstAktuell,MeldeKategorie
> ORDER BY IstAktuell,MeldeKategorie;
>
> Aggregate  (cost=4497.30..4510.18 rows=258 width=16)
>   ->  Group  (cost=4497.30..4503.74 rows=2575 width=16)
>         ->  Index Scan using i1 on hauptdaten_fall  (cost=0.00..4351.40
> rows=2575 width=16)
>
> real    0m3.223s
Hmmm, could you please explain the theory behind that for quite a
beginner like me (perhaps on -general if you feel it apropriate)

The change in the second select is that you included IstAktuell in the
GROUP BY/ORDER BY clause and this gives a speed increas by factor 2.
It seems that the  "Sort" can be left out in this case if I look at the
plan, but why that?  The WHERE clause should select just all IstAktuell=20
data sets and so the GROUP BY/ORDER BY clauses should every time have the
same work - as for my humble understanding.

>
> -- same after doing
>
> cluster i1 on Hauptdaten_Fall;
>
> real    1.590 -- 1.600
That´s also interesting.  In reality the table Hauptdaten_Fall has many fields
with many indices.  If I understand things right it makes no sense to have
more than one clustered index, right? A further speed increase of factor two
would be welcome.  Could I expect this if I would find out the "sensitive"
index of my table for certain tasks?  Or is my understanging wrong and it
makes sense to cluster more than one index.  Unfortunately clustering the
index of a huge table takes some time.  Could I speed this up by some
tricks?

> select count(*) from Hauptdaten_Fall;
>
> real    0m0.630s
>
> ---------------------------
>
> The following query is marginally (about 0.1 sec) faster, though the
> plan looks the same down to cost estimates.
>
> SET ENABLE_SEQSCAN = OFF;
> SELECT MeldeKategorie,
>        Count(*) AS Anz
>   FROM (select IstAktuell,MeldeKategorie from Hauptdaten_Fall where
> IstAktuell=20) sub
> GROUP BY IstAktuell,MeldeKategorie
> ORDER BY IstAktuell,MeldeKategorie;
>
> Aggregate  (cost=0.00..4370.72 rows=258 width=16)
>   ->  Group  (cost=0.00..4364.28 rows=2575 width=16)
>         ->  Index Scan using i1 on hauptdaten_fall  (cost=0.00..4351.40
> rows=2575 width=16)
>
> real    0m1.438s - 1.506s
Hmm, perhaps this is nearly nothing or is there any theory that a
count(*) is faster than a count(<fieldname>)?

> ...
> real    0m6.077 -- 6.606s
>
> and after clustering:
> cluster i1 on Hauptdaten_Fall;
>
> real    0m5.683 - 5.750s
>
> so it's linear growth here
This is what my colleague was afraid of: We would have linear growth
compared to the log(n) growth which is to be expected on MS SQL server
(for this certain type of queries and for sure up to a far limit of
data where other constraints could get influence, but we are far from
this limit).  This would not convince him :-(.

Kind regards
        Andreas.


Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Wed, 31 Oct 2001, Horst Herb wrote:

> I have the feeling that your problem is solved best by taking a different
> approach.
> As A. Pilosovs posting pointed out, index coverage is a problem intrinsic to
> the MVCC implementation (IMHO a small price to pay for a priceless feature).
Could somebody explain MVCC to such an uneducated man like me.  Is this a
certain feature (which perhaps MS SQL) doesn´t have and which might be
important in the future?
> I can't see why much effort should go into a brute force method to implement
> index coverage, if your problem can be solved more elegant in a different way.
>
> With the example you posted, it is essentially only simple statistics you
> want to run on tables where the *majority* of records would qualify in your
> query.
> Why not create an extra "statistics" table which is updated automatically
> through triggers in your original table? That way, you will always get
> up-to-date INSTANT query results no matter how huge your original table is.
My problem is to convince my colleague.  I´m afraid that he would consider
those optimizing stuff as "tricks" to work around constraints of the
database server.  He might argue that if it comes to the point that also
MS SQL server needs some speed improvement and he has to do the same
performance tuning things MS SQL does outperform PostgreSQL again and we
are at the end with our wisdom.  I repeat:  I for myself see the strength
of OpenSource (Horst, you know me ;-) ) and I would really love to use
PostgreSQL.  But how to prove those arguing wrong? *This* is my problem.
We have to do a design decision.  My colleague is a mathematician who
has prefered MS SQL server some years ago over Oracle and had certain
reasons for it based on estimations of our needs.  He had no problems
with UNIX or something else and he theoretically is on my side that OpenSource
is the better way and would accept it if it would give the same results
as his stuff.
But he had never had some performance problems with his databases and
knows people who claim to fill Zillions of Megabytes of MS SQL server.
So he doubt on the quality of PostgreSQL server if it has problems in
the first run.  I have to admit that his point of view is easy to
understand.  I would have to prove (!) that we wouldn´t have trouble
with bigger databases and that those things are no "dirty workarounds"
of a weak server.

> And, don't forget that the only way MS SQL can achieve the better performance
> here is through mercilessly hogging ressources. In a complex database
> environment with even larger tables, the performance gain in MS SQL would be
> minimal (my guess).
Unfortunately it is not enough to guess.  He has enough experiences that
I knows that the MS SQL server is fit for the task he wants to solve.  If
I tell him: "*Perhaps* you could run into trouble.", he would just laugh
about me because I´m in trouble *now* and can´t prove that I won´t be
again.

Kind regards
        Andreas.


Re: Serious performance problem

From
Stephan Szabo
Date:
> My problem is to convince my colleague.  I�m afraid that he would consider
> those optimizing stuff as "tricks" to work around constraints of the
> database server.  He might argue that if it comes to the point that also
> MS SQL server needs some speed improvement and he has to do the same
> performance tuning things MS SQL does outperform PostgreSQL again and we
> are at the end with our wisdom.  I repeat:  I for myself see the strength
> of OpenSource (Horst, you know me ;-) ) and I would really love to use
> PostgreSQL.  But how to prove those arguing wrong? *This* is my problem.
> We have to do a design decision.  My colleague is a mathematician who
> has prefered MS SQL server some years ago over Oracle and had certain
> reasons for it based on estimations of our needs.  He had no problems
> with UNIX or something else and he theoretically is on my side that OpenSource
> is the better way and would accept it if it would give the same results
> as his stuff.
> But he had never had some performance problems with his databases and
> knows people who claim to fill Zillions of Megabytes of MS SQL server.
> So he doubt on the quality of PostgreSQL server if it has problems in
> the first run.  I have to admit that his point of view is easy to
> understand.  I would have to prove (!) that we wouldn�t have trouble
> with bigger databases and that those things are no "dirty workarounds"
> of a weak server.
>
> > And, don't forget that the only way MS SQL can achieve the better performance
> > here is through mercilessly hogging ressources. In a complex database
> > environment with even larger tables, the performance gain in MS SQL would be
> > minimal (my guess).
> Unfortunately it is not enough to guess.  He has enough experiences that
> I knows that the MS SQL server is fit for the task he wants to solve.  If
> I tell him: "*Perhaps* you could run into trouble.", he would just laugh
> about me because I�m in trouble *now* and can�t prove that I won�t be
> again.

The only way to know for certain is to try both at various sizes to see.
Getting numbers for one type of query on one size database tells very
little.  Load a test set that's 100, 1000, whatever times the current size
and see what happens.  ISTM anything short of this is fairly meaningless.
What point does the other person expect to run into problems, how would
he solve them, how does postgres run at that point with and without
special optimization.

It's perfectly possible that for the particular queries and load you're
running that MSSQL will be better, there's nothing
wrong with that.  Conversely, it's entirely possible that one could find
workloads that postgres is better at.



Re: Serious performance problem

From
Doug McNaught
Date:
"Tille, Andreas" <TilleA@rki.de> writes:

> Could somebody explain MVCC to such an uneducated man like me.  Is this a
> certain feature (which perhaps MS SQL) doesn�t have and which might be
> important in the future?

http://www.us.postgresql.org/users-lounge/docs/7.1/postgres/mvcc.html

(Or substitute your favorite mirror)

Only Oracle has anything like it AFAIK.

-Doug
-- 
Let us cross over the river, and rest under the shade of the trees.  --T. J. Jackson, 1863


Re: Serious performance problem

From
"Zeugswetter Andreas SB SD"
Date:
> > so it's linear growth here
> This is what my colleague was afraid of: We would have linear growth
> compared to the log(n) growth which is to be expected on MS SQL server

This is not true, since the index scan also neads to read the leaf pages
in MS Sql. The number of leaf pages grows linear with number of rows
that qualify the where restriction.

R = number of rows that qualify
--> O(R + log(R))

The pg measurements showed, that PostgreSQL query performance can be
expected
to stay nearly the same regardless of number of rows in the table as
long as 
the number of rows that qualify the where restriction stays constant.
The response time is linear to the number of rows that qualify the where

restriction, but that linear behavior is also expected with MS Sql.

Andreas


Re: Serious performance problem

From
"Tille, Andreas"
Date:
On Fri, 2 Nov 2001, Zeugswetter Andreas SB SD wrote:

> This is not true, since the index scan also neads to read the leaf pages
> in MS Sql. The number of leaf pages grows linear with number of rows
> that qualify the where restriction.
>
> R = number of rows that qualify
> --> O(R + log(R))
>
> The pg measurements showed, that PostgreSQL query performance can be
> expected
> to stay nearly the same regardless of number of rows in the table as
> long as
> the number of rows that qualify the where restriction stays constant.
> The response time is linear to the number of rows that qualify the where
>
> restriction, but that linear behavior is also expected with MS Sql.
Well, may be you are right here but I talked once more with my colleague
about specifications.  We can assure that the input of data is about 1GB.
We can be sure about that because it is defined what has to be stored
is fixed in the German law about infectious diseases.  We have no online
shop system or something else. <sarcastic>If the recent anthrax problem
would increase exponential we could be into trouble, but chances are
low.</sarcastic>  So we have good chances to estimate the amount of data
quite well.  It is a linear growth of 1GB per year.  If MS SQL server is
now fast enough we can grow with normal hardware performance increase
over the year.  This is a fact I have to accept.

Additional constraint is that the underlying data modell with an
Access application is running by about 170 clients which have an amount
of data of about 100 - 500 data sets which they export once a week into
our central server.  The developers tried hard to get the Access application
and the MS SQL server solution in sync and having a third application
(by rewriting some 500 queries) would be a lot of work.  (I´m not afraid
this work but I must be sure it would make sense before I start and so
I hope for advice of people who perhaps did so.)

I discussed the issue of using statistics tables to speed up certain
queries.  He told me that those technique is known as OLAP tubes in
MS SQL server and that there are tools to build such things.  Is this
a valid comparison?  He did not use it because it would disable the
access solution of our clients.  Are there any tools for PostgreSQL for
such stuff besides the manual creating tables and triggers?

Currently I see two solutions to solve my problem:1. Hoping that 'index coverage' coverage is implented (perhaps by a
patch... sombody asked about it but no response) in 7.2 or at   least 7.3.   In this case I would try to do my best
withthe statistic tables   but I wouldn´t cope with it if at some stage our data model would   change and I would
reworkall such stuff.2. Giving MySQL a trial because I expect it to solve my problem in   the fashion I need.  (Well -
readonlyis OK, surely no such features   like MVCC and thus perhaps faster index scans.)  I would definitely   come
backto PostgreSQL once 'index coverage' or any other method   to speed up index search will be implemented.
 

Could somebody give any advise what would be the best strategy? (Perhaps
I should switch back to pgsql-general for this question, but I definitely
want to hear a statement from the hackers about future implementation
plans!)

By the way in my former postings I forgot to mention a further problem
which stayed unanswered in my questions on pgsql-general is the fact that
while observing "top" while doing a query (over some 30 seconds) the
memory load from postgresql increases heavily when executing a query.
I wonder if it could help if there would be some mechanism to let keep
some information of the database resident in memory.  I surely know that
memory handling of Linux/UNIX is different from Win (and this is a great
feature ;-) ), but if I have a plenty of free memory (2GB) and my box
wasn´t swapping at any time I wonder if it shouldn´t be possible to
hold some information in memory in favour of simply relying on the hard
disk cache of the OS.  Any opinions?

Kind regards
        Andreas.






Re: Serious performance problem

From
"Andrea Aime"
Date:

Alex Pilosov wrote:
> 
> On Tue, 30 Oct 2001, Antonio Fiol [iso-8859-1] Bonnín wrote:
> 
> > > | > Seems that problem is very simple :))
> > > | > MSSql can do queries from indexes, without using actual table at all.
> > > | > Postgresql doesn't.
> > > | >
> > > | > So mssql avoids sequental scanning of big table, and simply does scan of
> > > | > index which is already in needed order and has very much less size.
> <snip>
> > > | The consequence for my problem is now:  If it is technically possible
> > > | to implement index scans without table lookups please implement it.  If
> The feature you are looking for is called 'index coverage'. Unfortunately,
> it is not easy to implement with Postgresql, and it is one of few
> outstanding 'nasties'. The reason you can't do it is follows: Postgres
> uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even
> if index contains all the information you need, you still need to access
> main table to check if the tuple is valid.
> 
> Possible workaround: store tuple validity in index, that way, a lot more
> space is wasted (16 more bytes/tuple/index), and you will need to update
> all indices when the base table is updated, even if indexed information
> have not changed.
> 

Maybe just a silly idea, but would'nt it be possible (and useful)
to store tuple validity in a separate bitmap file, that reports in every
bit the validity of the corresponding tuple? It would grow linearly, but
at least it would be very small compared to the actual data...
Best regards
Andrea Aime


Licensing issues including another projects source code into the jdbc driver

From
"Dave Cramer"
Date:
I am contemplating including log4jme source code into the jdbc driver.
Who would be the best person to contact wrt ironing out the licensing
issues?

Dave



Re: Serious performance problem

From
Hannu Krosing
Date:
Andrea Aime wrote:
> 
> Alex Pilosov wrote:
> >
> > On Tue, 30 Oct 2001, Antonio Fiol [iso-8859-1] BonnМn wrote:
> >
> > > > | > Seems that problem is very simple :))
> > > > | > MSSql can do queries from indexes, without using actual table at all.
> > > > | > Postgresql doesn't.
> > > > | >
> > > > | > So mssql avoids sequental scanning of big table, and simply does scan of
> > > > | > index which is already in needed order and has very much less size.
> > <snip>
> > > > | The consequence for my problem is now:  If it is technically possible
> > > > | to implement index scans without table lookups please implement it.  If
> > The feature you are looking for is called 'index coverage'. Unfortunately,
> > it is not easy to implement with Postgresql, and it is one of few
> > outstanding 'nasties'. The reason you can't do it is follows: Postgres
> > uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even
> > if index contains all the information you need, you still need to access
> > main table to check if the tuple is valid.
> >
> > Possible workaround: store tuple validity in index, that way, a lot more
> > space is wasted (16 more bytes/tuple/index), and you will need to update
> > all indices when the base table is updated, even if indexed information
> > have not changed.
> >
> 
> Maybe just a silly idea, but would'nt it be possible (and useful)
> to store tuple validity in a separate bitmap file, that reports in every
> bit the validity of the corresponding tuple? It would grow linearly, but
> at least it would be very small compared to the actual data...

I see two problems with this approach:

1. Tuple validity is different for different transactions running
concurrently.

We still could cache death-transaction_ids of tuples _in_memory_ quite
cheaply 
time-wize, but I'm not sure how big win it will be in general

2. thene is no easy way to know which bit corresponds to which tuple as
each   database page can contain arbitrary number of pages (this one is
easyer,  as we can use a somewhat sparse bitmap that is less space-efficient)

------------
Hannu


Re: Serious performance problem

From
Horst Herb
Date:
On Monday 05 November 2001 03:24, Tille, Andreas wrote:

> I discussed the issue of using statistics tables to speed up certain
> queries.  He told me that those technique is known as OLAP tubes in
> MS SQL server and that there are tools to build such things.  Is this
> a valid comparison?  He did not use it because it would disable the
> access solution of our clients.  Are there any tools for PostgreSQL for
> such stuff besides the manual creating tables and triggers?

I still don't understand your guy. Knowing that the table (and with it the 
performance demands) will grow, it is quite stubborn and certainly not 
elegant at all to insist on the blunt query instead of a smart solution. The 
smart solution as outlined always returns results instantly and needs next to 
no memory or other ressources as compared to the blunt query, regardless of 
the growth of your database. It would only impact the growth *rate* due to 
the fired triggers, but then, your application does not seem to have a heavy 
insert load anyway and you could always queue the inserts with middleware as 
you have no realtime demands.

Btw, what is wrong with creating a few tables and a few trigger functions 
"manually"? Writing,  testing, and debugging them should not cost  more than 
a couple of days. Why would I want a tool for it?  I might spend a couple of 
hours writing a python script if I would need similar triggers for many 
tables over and over again, but your problem does not seem to have the need 
for this.

Horst


Re: Serious performance problem

From
Horst Herb
Date:
On Monday 05 November 2001 03:24, Tille, Andreas wrote:

> I discussed the issue of using statistics tables to speed up certain
> queries.  He told me that those technique is known as OLAP tubes in
> MS SQL server and that there are tools to build such things.  Is this
> a valid comparison?  He did not use it because it would disable the
> access solution of our clients.  Are there any tools for PostgreSQL for
> such stuff besides the manual creating tables and triggers?

I still don't understand your guy. Knowing that the table (and with it the 
performance demands) will grow, it is quite stubborn and certainly not 
elegant at all to insist on the blunt query instead of a smart solution. The 
smart solution as outlined always returns results instantly and needs next to 
no memory or other ressources as compared to the blunt query, regardless of 
the growth of your database. It would only impact the growth *rate* due to 
the fired triggers, but then, your application does not seem to have a heavy 
insert load anyway and you could always queue the inserts with middleware as 
you have no realtime demands.

Btw, what is wrong with creating a few tables and a few trigger functions 
"manually"? Writing,  testing, and debugging them should not cost  more than 
a couple of days. Why would I want a tool for it?  I might spend a couple of 
hours writing a python script if I would need similar triggers for many 
tables over and over again, but your problem does not seem to have the need 
for this.

Horst