Thread: Serious performance problem
Hello, I discussed a problem concerning the speed of PostgreSQL compared to MS SQL server heavily on postgres-general list. The thread starts with message http://fts.postgresql.org/db/mw/msg.html?mid=1035557 Now I tried a snapshot of version 7.2 and got an increase of speed of about factor 2. But sorry this is really not enough. The very simple test I pointed to in my mail is even much to slow and the issue would probably spoil down the whole project which should be a complete open source solution and would perhaps and in any M$ stuff. I´ve got under heavy preasur from my employer who was talking about the nice world of MS .net (while he is using MS-SQL exclusively). To make the thing clear the issue is the gal database of infectious diseases in Germany runned by the Robert Koch-Institute. So the beast could be of some importance for increasing the acceptance of PostgreSQL and Open Source in the field of medicine which is generally known for the money which is involved in. So I really hope that some skilled programmers would be able to find a good way to solve the performance issue perhaps by just profiling the simple query SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; to the data set I put on http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2 If this should take less than half a second on a modern PC I could continue to try mo realistic queries. I really hope that I could readjust the image of PostgreSQL in the eyes of my M$-centered colleagues. Kind regards Andreas.
Seems that problem is very simple :)) MSSql can do queries from indexes, without using actual table at all. Postgresql doesn't. So mssql avoids sequental scanning of big table, and simply does scan of index which is already in needed order and has very much less size. On Mon, Oct 29, 2001 at 01:43:37PM +0100, Tille, Andreas wrote: > Hello, > > I discussed a problem concerning the speed of PostgreSQL compared to > MS SQL server heavily on postgres-general list. The thread starts with > message > > http://fts.postgresql.org/db/mw/msg.html?mid=1035557 > > Now I tried a snapshot of version 7.2 and got an increase of speed of > about factor 2. But sorry this is really not enough. The very simple > test I pointed to in my mail is even much to slow and the issue would > probably spoil down the whole project which should be a complete open > source solution and would perhaps and in any M$ stuff. I've got under > heavy preasur from my employer who was talking about the nice world > of MS .net (while he is using MS-SQL exclusively). To make the thing > clear the issue is the gal database of infectious diseases in Germany > runned by the Robert Koch-Institute. So the beast could be of some > importance for increasing the acceptance of PostgreSQL and Open Source > in the field of medicine which is generally known for the money which > is involved in. So I really hope that some skilled programmers would > be able to find a good way to solve the performance issue perhaps by > just profiling the simple query > > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) > GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; > > to the data set I put on > > http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2 > > If this should take less than half a second on a modern PC I could > continue to try mo realistic queries. > > I really hope that I could readjust the image of PostgreSQL in the > eyes of my M$-centered colleagues. > > Kind regards > > Andreas. > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html
Andreas - I took a look at your problem, since I'm sort of in the field, and would liek to see free solutions spread, as well. Here's what I see: Your example touches on what can be an achilles heel for pgsql's current statistical analyzer: selection on data fields that have a few common values. Often, the indices don't get used, since a large fraction of the table needs to be scanned, in any case. In your example, fully 68% of the table fits the where condition. Here's some timing results on my machine: Your dataset and query, as written: real 0m25.272s user 0m0.090s sys 0m0.050s Creating an index on meldekategorie, and forcing it's use with "set enable_seqscan = off" real 0m14.743s user 0m0.070s sys 0m0.050s Same, with index on istaktuell: real 0m26.511s user 0m0.050s sys 0m0.060s Now, with an index on both meldekategorie and istaktuell: real 0m7.179s user 0m0.060s sys 0m0.030s I think we have a winner. No it's not sub-second, but I improved the time by 3x just by trying some indices. Note that I _still_ had to force the use of indices for this one. It's also the first time I've personally seen a query/dataset that benefits this much from a two-key index. As another poster replied to you, there is limitation with postgresql's use of indices that arises from MVCC: even if the only data requested is that stored in the index itself, the backend must visit the actual tuple in the table to ensure that it is 'visible' to the current transaction. How realistic a representation of your real workload is this query? Realize that more selective, complex queries are where pgsql shines compared to other RDBMS: the 'fast table scanner' type query that you proposed as your test don't really let pgsql stretch it's legs. Do you have example timings from MS-SQL or others? Ross On Mon, Oct 29, 2001 at 01:43:37PM +0100, Tille, Andreas wrote: > Hello, > > I discussed a problem concerning the speed of PostgreSQL compared to > MS SQL server heavily on postgres-general list. The thread starts with > message > > http://fts.postgresql.org/db/mw/msg.html?mid=1035557 > > Now I tried a snapshot of version 7.2 and got an increase of speed of > about factor 2. But sorry this is really not enough. The very simple > test I pointed to in my mail is even much to slow and the issue would > probably spoil down the whole project which should be a complete open > source solution and would perhaps and in any M$ stuff. I?ve got under > heavy preasur from my employer who was talking about the nice world > of MS .net (while he is using MS-SQL exclusively). To make the thing > clear the issue is the gal database of infectious diseases in Germany > runned by the Robert Koch-Institute. So the beast could be of some > importance for increasing the acceptance of PostgreSQL and Open Source > in the field of medicine which is generally known for the money which > is involved in. So I really hope that some skilled programmers would > be able to find a good way to solve the performance issue perhaps by > just profiling the simple query > > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) > GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; > > to the data set I put on > > http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2 > > If this should take less than half a second on a modern PC I could > continue to try mo realistic queries. > > I really hope that I could readjust the image of PostgreSQL in the > eyes of my M$-centered colleagues. > > Kind regards > > Andreas. > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html -- Ross Reedstrom, Ph.D. reedstrm@rice.edu Executive Director phone: 713-348-6166 Gulf Coast Consortium for Bioinformatics fax: 713-348-6182 Rice University MS-39 Houston, TX 77005
Hello Andreas, A possible solution would be: CREATE TABLE foo AS SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; This is suitable if your data does not change often. To get automatic updates: 1) Define iPrecision, the precision that you need (integer). 2) Create a trigger which increases a counter when a record is updated or inserted. When the counter reaches iPrecision, do a DROP TABLE foo + CREATE TABLE foo AS SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID)... This will take a few seconds but only once. Run a batch script within a time frame (1 hour, 4 hours, 1 day ?) so a human user has very little chance to reach iPrecision. On 300.000 records, you will get instant results. There are plenty of tricks like this one. If you employ them, you will ***never*** reach the limits of a double Pentium III computer with U3W discs. If you need to answer this message, please reply on pgsql-general@postgresql.org. Cheers, Jean-Michel POURE At 13:43 29/10/01 +0100, you wrote: >Hello, > >I discussed a problem concerning the speed of PostgreSQL compared to >MS SQL server heavily on postgres-general list. The thread starts with >message > > http://fts.postgresql.org/db/mw/msg.html?mid=1035557 > >Now I tried a snapshot of version 7.2 and got an increase of speed of >about factor 2. But sorry this is really not enough. The very simple >test I pointed to in my mail is even much to slow and the issue would >probably spoil down the whole project which should be a complete open >source solution and would perhaps and in any M$ stuff. I´ve got under >heavy preasur from my employer who was talking about the nice world >of MS .net (while he is using MS-SQL exclusively). To make the thing >clear the issue is the gal database of infectious diseases in Germany >runned by the Robert Koch-Institute. So the beast could be of some >importance for increasing the acceptance of PostgreSQL and Open Source >in the field of medicine which is generally known for the money which >is involved in. So I really hope that some skilled programmers would >be able to find a good way to solve the performance issue perhaps by >just profiling the simple query > > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS > Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) >GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY >Hauptdaten_Fall.MeldeKategorie; > >to the data set I put on > > http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2 > >If this should take less than half a second on a modern PC I could >continue to try mo realistic queries. > >I really hope that I could readjust the image of PostgreSQL in the >eyes of my M$-centered colleagues. > >Kind regards > > Andreas. > >---------------------------(end of broadcast)--------------------------- >TIP 5: Have you checked our extensive FAQ? > >http://www.postgresql.org/users-lounge/docs/faq.html
On Mon, 29 Oct 2001, Jean-Michel POURE wrote: > A possible solution would be: > CREATE TABLE foo AS > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz > FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) > GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY > Hauptdaten_Fall.MeldeKategorie; Sorry, this is NO solution of my problem. > On 300.000 records, you will get instant results. There are plenty of > tricks like this one. If you employ them, you will ***never*** reach the > limits of a double Pentium III computer with U3W discs. It is really no help if I solve the speed issue of this *very simple, zeroth order try*. I repeat a hava a plenty of queries which do much more complicated stuff than this. This is just a rude strip down of the problem fit for debugging/profg issues of the database *server*. Simple tricks on a simple example do not help. > If you need to answer this message, please reply on > pgsql-general@postgresql.org. No, because ... > >I discussed a problem concerning the speed of PostgreSQL compared to > >MS SQL server heavily on postgres-general list. The thread starts with > >message > > > > http://fts.postgresql.org/db/mw/msg.html?mid=1035557 I did so and got the explicit advise of Tom to ask here. Consider the problem as a benchmark. I would love to see postgresql as the winner. Kind regards Andreas.
On Mon, 29 Oct 2001, Ross J. Reedstrom wrote: > Here's what I see: Your example touches on what can be an achilles > heel for pgsql's current statistical analyzer: selection on data fields > that have a few common values. Often, the indices don't get used, since > a large fraction of the table needs to be scanned, in any case. In > your example, fully 68% of the table fits the where condition. > ... > > I think we have a winner. No it's not sub-second, but I improved the time > by 3x just by trying some indices. Note that I _still_ had to force the > use of indices for this one. It's also the first time I've personally seen > a query/dataset that benefits this much from a two-key index. This is true for this example and I also played with indices as you. I also enforced the index scan and compared with forbidding the index scan. The result was on my more realistic examples that both versions performed quite the same. There was no *real* difference. For sure in this simple query there is a difference but the real examples showed only 2% - 5% speed increase (if not slower with enforcing index scans!). > As another poster replied to you, there is limitation with postgresql's > use of indices that arises from MVCC: even if the only data requested is > that stored in the index itself, the backend must visit the actual tuple > in the table to ensure that it is 'visible' to the current transaction. Any possibility to switch of this temporarily for certain queries like this if the programmer could make sure that it is not necessary? Just a stupid idea from a bloody uneducated man in database-engeniering. > How realistic a representation of your real workload is this query? Realize > that more selective, complex queries are where pgsql shines compared to > other RDBMS: the 'fast table scanner' type query that you proposed as your > test don't really let pgsql stretch it's legs. Do you have example timings > from MS-SQL or others? Unfortunately the four test we did here seemed all to suffer from the same problem. The situation is that there is a given database structure which was developed over more than a year on MS-SQL and has a Access GUI. Now parts of the UI should be made public via web (I want to use Zope) and I just imported the data and did some example queries with the terrible slow result. Kind regards and thanks for your ideas Andreas.
"Tille, Andreas" wrote: > > On Mon, 29 Oct 2001, Ross J. Reedstrom wrote: > > > Here's what I see: Your example touches on what can be an achilles > > heel for pgsql's current statistical analyzer: selection on data fields > > that have a few common values. Often, the indices don't get used, since > > a large fraction of the table needs to be scanned, in any case. In > > your example, fully 68% of the table fits the where condition. > > ... > > > > I think we have a winner. No it's not sub-second, but I improved the time > > by 3x just by trying some indices. Note that I _still_ had to force the > > use of indices for this one. It's also the first time I've personally seen > > a query/dataset that benefits this much from a two-key index. > This is true for this example and I also played with indices as you. I also > enforced the index scan and compared with forbidding the index scan. The > result was on my more realistic examples that both versions performed quite > the same. There was no *real* difference. For sure in this simple query there > is a difference but the real examples showed only 2% - 5% speed increase > (if not slower with enforcing index scans!). I studied his dataset and found that a simple count(*) on whole table took 1.3 sec on my Celeron 375 so I'm sure that the more complex query, which has to visit 2/3 of tuples will not be able to execute under 1 sec My playing with indexes / subqueries and query rewriting got the example query (actually a functional equivalent) to run in ~5 sec with simple aggregate(group(indexscan))) plan and I suspect that this is how fast it will be on my hardware It could probably be soon possible to make it run in ~ 1.5 by using an aggregate function that does a sequential scan and returns a rowset. > > As another poster replied to you, there is limitation with postgresql's > > use of indices that arises from MVCC: even if the only data requested is > > that stored in the index itself, the backend must visit the actual tuple > > in the table to ensure that it is 'visible' to the current transaction. > Any possibility to switch of this temporarily for certain queries like this > if the programmer could make sure that it is not necessary? Just a stupid > idea from a bloody uneducated man in database-engeniering. There have been plans to set aside a bit in index that would mark the deleted tuple. Unfortunately this helps only in cases when there are many deleted tuples and all live tuples have to be checked anyway ;( -------------- Hannu
On Mon, 29 Oct 2001, Vsevolod Lobko wrote: > Seems that problem is very simple :)) > MSSql can do queries from indexes, without using actual table at all. > Postgresql doesn't. > > So mssql avoids sequental scanning of big table, and simply does scan of > index which is already in needed order and has very much less size. Hmmm, could anyone imagine a simple or not *solution* of the Problem. I´m thinking of some switch the database programmer could use if he really knows what he is doing. Kind regards Andreas.
On Mon, 29 Oct 2001, Vsevolod Lobko wrote: > Seems that problem is very simple :)) > MSSql can do queries from indexes, without using actual table at all. > Postgresql doesn't. > > So mssql avoids sequental scanning of big table, and simply does scan of > index which is already in needed order and has very much less size. I forewarded this information to my colleague and he replied the following (im translating from German into English): hc> I expected this problem. But what is the purpose of an index: Not hc> to look into the table itself. Moreover this means that the expense hc> grows linear with the table size - no good prospect at all (the hc> good thing is it is not exponential :-)). I have to explain that we are in the *beginning* of production process. We expect a lot more of data. hc> In case of real index usage the expense grows only with log(n). hc> No matter about the better philosophy of database servers, MS-SQL-Server hc> has a consequent index usage and so it is very fast at many queries. hc> When performing a query to a field without index, I get a slow hc> table scan. This is like measuring the speed of the harddisk and hc> the cleverness of the cache. The consequence for my problem is now: If it is technically possible to implement index scans without table lookups please implement it. If not we just have to look for another database engine which does so, because our applictaion really need the speed on this type of queries. I repeat from my initial posting: The choice of the server for our application could have importance for many projects in the field of medicine in Germany. I really hope that there is a reasonable solution which perhaps could give a balance between safety and speed. For example I can assure in my application that the index, once created will be valid, because I just want to read in a new set of data once a day (from the MS-SQL Server which collects data over the day). So I could recreate all indices after the import and the database is readonly until the next cron job. Is there any chance to speed up those applications? Kind regards Andreas.
> I also > enforced the index scan and compared with forbidding the index scan. The > result was on my more realistic examples that both versions performed quite > the same. There was no *real* difference. For sure in this simple query there > is a difference but the real examples showed only 2% - 5% speed increase > (if not slower with enforcing index scans!). You could somewhat speed up the query if you avoid that the sort hits the disk. A simple test here showed, that you need somewhere near sort_mem = 15000 in postgresql.conf. Andreas
On 30 Oct 2001 at 11:44 (+0100), Tille, Andreas wrote: | On Mon, 29 Oct 2001, Vsevolod Lobko wrote: | | > Seems that problem is very simple :)) | > MSSql can do queries from indexes, without using actual table at all. | > Postgresql doesn't. | > | > So mssql avoids sequental scanning of big table, and simply does scan of | > index which is already in needed order and has very much less size. | I forewarded this information to my colleague and he replied the following | (im translating from German into English): | | hc> I expected this problem. But what is the purpose of an index: Not | hc> to look into the table itself. Moreover this means that the expense | hc> grows linear with the table size - no good prospect at all (the | hc> good thing is it is not exponential :-)). | I have to explain that we are in the *beginning* of production process. | We expect a lot more of data. | | hc> In case of real index usage the expense grows only with log(n). | hc> No matter about the better philosophy of database servers, MS-SQL-Server | hc> has a consequent index usage and so it is very fast at many queries. | hc> When performing a query to a field without index, I get a slow | hc> table scan. This is like measuring the speed of the harddisk and | hc> the cleverness of the cache. | | The consequence for my problem is now: If it is technically possible | to implement index scans without table lookups please implement it. If | not we just have to look for another database engine which does so, | because our applictaion really need the speed on this type of queries. | I repeat from my initial posting: The choice of the server for our | application could have importance for many projects in the field of | medicine in Germany. I really hope that there is a reasonable solution | which perhaps could give a balance between safety and speed. For | example I can assure in my application that the index, once created | will be valid, because I just want to read in a new set of data once | a day (from the MS-SQL Server which collects data over the day). So | I could recreate all indices after the import and the database is | readonly until the next cron job. Is there any chance to speed up | those applications? CREATE INDEX idx_meldekategorie_hauptdaten_f ON hauptdaten_fall(meldekategorie); CLUSTER idx_meldekategorie_hauptdaten_f ON hauptdaten_fall; Aggregate (cost=5006.02..5018.90 rows=258 width=16) -> Group (cost=5006.02..5012.46 rows=2575 width=16) -> Sort (cost=5006.02..5006.02 rows=2575 width=16) -> Seq Scan on hauptdaten_fall (cost=0.00..4860.12 rows=2575width=16) This looks much nicer, but is still quite slow. I'm quite sure the slowness is in the sort(), since all queries that don't sort, return quickly. I hoped the clustered index would speed up the sort, but that is not the case. It _seems_ a simple optimization would be to not (re)sort the tuples when using a clustered index. if( the_column_to_order_by_is_clustered ){ if( order_by_is_DESC ) // reverse the tuples to handle } I haven't looked at the code to see if this is even feasible, but I do imagine there is enough info available to avoid an unnecessary sort on the CLUSTERED index. The only problem I see with this is if the CLUSTERed index is not kept in a CLUSTERed state as more records are added to this table. brent -- "Develop your talent, man, and leave the world something. Records are really gifts from people. To think that an artist would love you enough to share his music with anyone is a beautiful thing." -- Duane Allman
On Tue, 30 Oct 2001, Zeugswetter Andreas SB SD wrote: > You could somewhat speed up the query if you avoid that the sort > hits the disk. A simple test here showed, that you need somewhere > near sort_mem = 15000 in postgresql.conf. Well this are the usual hints from pgsql-general. I did so and increased step by step to: shared_buffers = 131072 sort_mem = 65536 This lead to a double of speed in my tests but this are settings where an enhancement of memory doesn´t result in a speed increase any more. When I was posting my question here I was talking about this "tuned" PostgreSQL server. The default settings where even worse! Kind regards Andreas.
On Tue, 30 Oct 2001, Jean-Michel POURE wrote: > Is your database read-only? Daily update from MS-SQL server. Between updates it is Read-only. > Good point, sorry to insist your problem is > software optimization. In your case, the database may climb up to 200 > million rows (1000 days x 200.000 rows). What are you going to do then? Buy > a 16 Itanium computer with 10 Gb RAM and MS SQL Server licence. Have a > close look at your problem. How much time does it get MS SQL Server to > query 200 million rows ? The problem is not in choosing MS SQL or > PostgreSQL ... The problem is for sure. If one server is 10 to 30 times faster for the very same tasks and chances are high that it skales better for the next orders of magnitude where our data will fit in for the next years because of real index usage (see postings on the hackers list) than the decission is easy. My colleague made sure that MS SQL server is fit for the next years and I can only convince him if an other Server has a comparable speed *for the same task*. > If you are adding 200.000 rows data everyday, consider using a combination I do not add this much. > of CREATE TABLE AS to create a result table with PL/pgSQL triggers to > maintain data consistency. You will then get instant results, even on 2 > billion rows because you will always query the result table; not the > original one. Large databases are always optimized this way because, even > in case of smart indexes, there are things (like your problem) that need > *smart* optimization. > > Do you need PL/pgSQL source code to perform a test on 2 billion rows? If > so, please email me on pgsql-general and I will send you the code. I really believe that there are many problems in the world that fall under this category and you are completely right. My coleague is a database expert (I consider me as a beginner) and he made sure that performance is no issue for the next couple of years. So what? Spending hours in optimisation into things who work perfectly? Why not asking the PostgreSQL authors to optimize tha server this way the very same task performs comparable?????? If we afterwards need further database optimization because of further constrains, I´m the first who will start this. But there must be server code in the world that is able to answer the example query that fast. This is proven! Kind regards Andreas.
AFAIK, sorting is necessary even when you have CLUSTERed a table using an index. Somewhere on the docs I read sth like "CLUSTER reorders the table on disk so that entries closer on the index are closer on the disk" (obviously written in better English ;-) But if you INSERT a single row later, it will NOT get inserted to the right place. So SORT is still necessary. MAYBE, but I am not sure at all, the sort may take place in less "real" time than in case the table was not CLUSTERed, as the table is "nearly" sorted. Hackers, is the sorting algorithm capable of exiting at the very moment the table is sorted, or are some extra passes always calculated? Good luck! Antonio Brent Verner wrote: > On 30 Oct 2001 at 11:44 (+0100), Tille, Andreas wrote: > | On Mon, 29 Oct 2001, Vsevolod Lobko wrote: > | > | > Seems that problem is very simple :)) > | > MSSql can do queries from indexes, without using actual table at all. > | > Postgresql doesn't. > | > > | > So mssql avoids sequental scanning of big table, and simply does scan of > | > index which is already in needed order and has very much less size. > | I forewarded this information to my colleague and he replied the following > | (im translating from German into English): > | > | hc> I expected this problem. But what is the purpose of an index: Not > | hc> to look into the table itself. Moreover this means that the expense > | hc> grows linear with the table size - no good prospect at all (the > | hc> good thing is it is not exponential :-)). > | I have to explain that we are in the *beginning* of production process. > | We expect a lot more of data. > | > | hc> In case of real index usage the expense grows only with log(n). > | hc> No matter about the better philosophy of database servers, MS-SQL-Server > | hc> has a consequent index usage and so it is very fast at many queries. > | hc> When performing a query to a field without index, I get a slow > | hc> table scan. This is like measuring the speed of the harddisk and > | hc> the cleverness of the cache. > | > | The consequence for my problem is now: If it is technically possible > | to implement index scans without table lookups please implement it. If > | not we just have to look for another database engine which does so, > | because our applictaion really need the speed on this type of queries. > | I repeat from my initial posting: The choice of the server for our > | application could have importance for many projects in the field of > | medicine in Germany. I really hope that there is a reasonable solution > | which perhaps could give a balance between safety and speed. For > | example I can assure in my application that the index, once created > | will be valid, because I just want to read in a new set of data once > | a day (from the MS-SQL Server which collects data over the day). So > | I could recreate all indices after the import and the database is > | readonly until the next cron job. Is there any chance to speed up > | those applications? > > CREATE INDEX idx_meldekategorie_hauptdaten_f > ON hauptdaten_fall(meldekategorie); > CLUSTER idx_meldekategorie_hauptdaten_f ON hauptdaten_fall; > > Aggregate (cost=5006.02..5018.90 rows=258 width=16) > -> Group (cost=5006.02..5012.46 rows=2575 width=16) > -> Sort (cost=5006.02..5006.02 rows=2575 width=16) > -> Seq Scan on hauptdaten_fall (cost=0.00..4860.12 rows=2575 width=16) > > This looks much nicer, but is still quite slow. I'm quite sure the > slowness is in the sort(), since all queries that don't sort, return > quickly. I hoped the clustered index would speed up the sort, but > that is not the case. > > It _seems_ a simple optimization would be to not (re)sort the tuples > when using a clustered index. > > if( the_column_to_order_by_is_clustered ){ > if( order_by_is_DESC ) > // reverse the tuples to handle > } > > I haven't looked at the code to see if this is even feasible, but I > do imagine there is enough info available to avoid an unnecessary > sort on the CLUSTERED index. The only problem I see with this is > if the CLUSTERed index is not kept in a CLUSTERed state as more > records are added to this table. > > brent > > -- > "Develop your talent, man, and leave the world something. Records are > really gifts from people. To think that an artist would love you enough > to share his music with anyone is a beautiful thing." -- Duane Allman > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
On Tue, 30 Oct 2001, Antonio Fiol Bonnín wrote: > AFAIK, sorting is necessary even when you have CLUSTERed a table using an index. Sorting is not the performance constraint in my example. Just leave out the sorting and see what happens ... > But if you INSERT a single row later, it will NOT get inserted to the right place. So > SORT is still necessary. Well rearanging the database in a cronjob after inserting new data once a day over night would be possible - but I doubt that it makes a big difference. Kind regards Andreas.
On Tue, 30 Oct 2001, Antonio Fiol [iso-8859-1] Bonn�n wrote: > > | > Seems that problem is very simple :)) > > | > MSSql can do queries from indexes, without using actual table at all. > > | > Postgresql doesn't. > > | > > > | > So mssql avoids sequental scanning of big table, and simply does scan of > > | > index which is already in needed order and has very much less size. <snip> > > | The consequence for my problem is now: If it is technically possible > > | to implement index scans without table lookups please implement it. If The feature you are looking for is called 'index coverage'. Unfortunately, it is not easy to implement with Postgresql, and it is one of few outstanding 'nasties'. The reason you can't do it is follows: Postgres uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even if index contains all the information you need, you still need to access main table to check if the tuple is valid. Possible workaround: store tuple validity in index, that way, a lot more space is wasted (16 more bytes/tuple/index), and you will need to update all indices when the base table is updated, even if indexed information have not changed. Fundamentally, this may be necessary anyway, to make index handlers aware of transactions and tuple validity (currently, if you have unique index, you may have conflicts when different transactions attempt to insert conflicting data, _at the time of insert, not at time of commit_). -alex
On Tue, 30 Oct 2001, Alex Pilosov wrote: > The feature you are looking for is called 'index coverage'. Unfortunately, > it is not easy to implement with Postgresql, and it is one of few > outstanding 'nasties'. The reason you can't do it is follows: Postgres > uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even > if index contains all the information you need, you still need to access > main table to check if the tuple is valid. Well, I do not fully understand that stuff, but I get a feeling of the problem. Thanks for the explanation. > Possible workaround: store tuple validity in index, that way, a lot more > space is wasted (16 more bytes/tuple/index), and you will need to update > all indices when the base table is updated, even if indexed information > have not changed. This would be acceptable for *my* special application but I´m afraid this could be a problem for others. > Fundamentally, this may be necessary anyway, to make index handlers aware > of transactions and tuple validity (currently, if you have unique index, > you may have conflicts when different transactions attempt to insert > conflicting data, _at the time of insert, not at time of commit_). As I said all this wouln´t be a problem for my application. I just run a sequential insert of data each night. Then the database is read only. Does anybody see chances that 'index coverage' would be implemented into 7.2. This would be a cruxial feature for my application. If it will not happen in a reasonable time frame I would have to look for alternative database server. Anybody knows something about MySQL or Interbase? Kind regards Andreas.
> Does anybody see chances that 'index coverage' would be implemented into > 7.2. This would be a cruxial feature for my application. If it will > not happen in a reasonable time frame I would have to look for > alternative database server. Anybody knows something about MySQL or > Interbase? Since I don't remember anyone mentioning working on it here and 7.2 just went into beta, I don't think it's likely. If you want to push, you may be able to convince someone for 7.3.
On Wednesday 31 October 2001 03:13, you wrote: > On Tue, 30 Oct 2001, Alex Pilosov wrote: > As I said all this wouln´t be a problem for my application. I just > run a sequential insert of data each night. Then the database is read > only. > > Does anybody see chances that 'index coverage' would be implemented into > 7.2. This would be a cruxial feature for my application. If it will Andreas, I have the feeling that your problem is solved best by taking a different approach. As A. Pilosovs posting pointed out, index coverage is a problem intrinsic to the MVCC implementation (IMHO a small price to pay for a priceless feature). I can't see why much effort should go into a brute force method to implement index coverage, if your problem can be solved more elegant in a different way. With the example you posted, it is essentially only simple statistics you want to run on tables where the *majority* of records would qualify in your query. Why not create an extra "statistics" table which is updated automatically through triggers in your original table? That way, you will always get up-to-date INSTANT query results no matter how huge your original table is. And, don't forget that the only way MS SQL can achieve the better performance here is through mercilessly hogging ressources. In a complex database environment with even larger tables, the performance gain in MS SQL would be minimal (my guess). Horst
>Why not create an extra "statistics" table which is updated automatically >through triggers in your original table? That way, you will always get >up-to-date INSTANT query results no matter how huge your original table is. > >And, don't forget that the only way MS SQL can achieve the better performance >here is through mercilessly hogging ressources. In a complex database >environment with even larger tables, the performance gain in MS SQL would be >minimal (my guess). Definitely. This is a design optimization problem not an index problem.
Alex Pilosov wrote: > > On Tue, 30 Oct 2001, Antonio Fiol [iso-8859-1] Bonnín wrote: > > > > | > Seems that problem is very simple :)) > > > | > MSSql can do queries from indexes, without using actual table at all. > > > | > Postgresql doesn't. > > > | > > > > | > So mssql avoids sequental scanning of big table, and simply does scan of > > > | > index which is already in needed order and has very much less size. > <snip> > > > | The consequence for my problem is now: If it is technically possible > > > | to implement index scans without table lookups please implement it. If > The feature you are looking for is called 'index coverage'. Unfortunately, > it is not easy to implement with Postgresql, and it is one of few > outstanding 'nasties'. The reason you can't do it is follows: Postgres > uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even > if index contains all the information you need, you still need to access > main table to check if the tuple is valid. > > Possible workaround: store tuple validity in index, that way, a lot more > space is wasted (16 more bytes/tuple/index), and you will need to update > all indices when the base table is updated, even if indexed information > have not changed. AFAIK you will need to update all indexes anyway as MVCC changes the location of the new tuple. ------------- Hannu
"Tille, Andreas" wrote: > > On Tue, 30 Oct 2001, Alex Pilosov wrote: > > > The feature you are looking for is called 'index coverage'. Unfortunately, > > it is not easy to implement with Postgresql, and it is one of few > > outstanding 'nasties'. The reason you can't do it is follows: Postgres > > uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even > > if index contains all the information you need, you still need to access > > main table to check if the tuple is valid. > Well, I do not fully understand that stuff, but I get a feeling of the > problem. Thanks for the explanation. > > > Possible workaround: store tuple validity in index, that way, a lot more > > space is wasted (16 more bytes/tuple/index), and you will need to update > > all indices when the base table is updated, even if indexed information > > have not changed. > This would be acceptable for *my* special application but IŽm afraid > this could be a problem for others. > > > Fundamentally, this may be necessary anyway, to make index handlers aware > > of transactions and tuple validity (currently, if you have unique index, > > you may have conflicts when different transactions attempt to insert > > conflicting data, _at the time of insert, not at time of commit_). > As I said all this woulnŽt be a problem for my application. I just > run a sequential insert of data each night. Then the database is read only. > > Does anybody see chances that 'index coverage' would be implemented into > 7.2. This would be a cruxial feature for my application. If it will > not happen in a reasonable time frame I would have to look for > alternative database server. Anybody knows something about MySQL or > Interbase? If it is static data and simple queries then there is fairly good chance that MySQL is a good choice . As fo the other two opensource databases (Interbase and SAPDB (a modyfied version of ADABAS released under GPL by SAP - http://www.sapdb.com/) I have no direct experience. I occasionally read sapdb mailing list, and I've got an impression that it is quite usable and stable DB once you have set it up. Setting up seems order(s) of magnitude harder than for PostgreSQL or MySQL. Weather it actually runs full-table aggregates faster than PG is a thing I can't comment on, but you could get some of their people to do the benchmarking for you if you send them an advocacy-urging request, like I'd switch if you show me that yur dbis fast enough ;) ------------------- Hannu
"Tille, Andreas" wrote: > > Hello, > > I discussed a problem concerning the speed of PostgreSQL compared to > MS SQL server heavily on postgres-general list. The thread starts with > message > > http://fts.postgresql.org/db/mw/msg.html?mid=1035557 > > Now I tried a snapshot of version 7.2 and got an increase of speed of > about factor 2. But sorry this is really not enough. The very simple > test I pointed to in my mail is even much to slow and the issue would > probably spoil down the whole project which should be a complete open > source solution and would perhaps and in any M$ stuff. I´ve got under > heavy preasur from my employer who was talking about the nice world > of MS .net (while he is using MS-SQL exclusively). To make the thing > clear the issue is the gal database of infectious diseases in Germany > runned by the Robert Koch-Institute. So the beast could be of some > importance for increasing the acceptance of PostgreSQL and Open Source > in the field of medicine which is generally known for the money which > is involved in. So I really hope that some skilled programmers would > be able to find a good way to solve the performance issue perhaps by > just profiling the simple query > > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) > GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; > > to the data set I put on > > http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2 > > If this should take less than half a second on a modern PC I could > continue to try mo realistic queries. I tried some more on optimizing the query on my work computer (AMD ATHLON 850, 512MB, PostgreSQL 7.1.3 with default memory settings) SELECT MeldeKategorie, Count(ID) AS Anz FROM Hauptdaten_Fall WHERE IstAktuell=20 GROUP BY MeldeKategorie ORDER BY MeldeKategorie; real 0m9.675s create index i1 on Hauptdaten_Fall(IstAktuell,MeldeKategorie); ---------------------------- set enable_seqscan = off; SELECT MeldeKategorie, Count(ID) AS Anz FROM Hauptdaten_Fall WHERE IstAktuell=20 GROUP BY MeldeKategorie ORDER BY MeldeKategorie; Aggregate (cost=4497.30..4510.18 rows=258 width=16) -> Group (cost=4497.30..4503.74 rows=2575 width=16) -> Sort (cost=4497.30..4497.30 rows=2575 width=16) -> Index Scan using i1 on hauptdaten_fall (cost=0.00..4351.40 rows=2575 width=16) real 0m7.131s --------------------------- set enable_seqscan = off; SELECT MeldeKategorie, Count(ID) AS Anz FROM Hauptdaten_Fall WHERE IstAktuell=20 GROUP BY IstAktuell,MeldeKategorie ORDER BY IstAktuell,MeldeKategorie; Aggregate (cost=4497.30..4510.18 rows=258 width=16) -> Group (cost=4497.30..4503.74 rows=2575 width=16) -> IndexScan using i1 on hauptdaten_fall (cost=0.00..4351.40 rows=2575 width=16) real 0m3.223s -- same after doing cluster i1 on Hauptdaten_Fall; real 1.590 -- 1.600 select count(*) from Hauptdaten_Fall; real 0m0.630s --------------------------- The following query is marginally (about 0.1 sec) faster, though the plan looks the same down to cost estimates. SET ENABLE_SEQSCAN = OFF; SELECT MeldeKategorie, Count(*) AS Anz FROM (select IstAktuell,MeldeKategorie from Hauptdaten_Fall where IstAktuell=20) sub GROUP BY IstAktuell,MeldeKategorie ORDER BY IstAktuell,MeldeKategorie; Aggregate (cost=0.00..4370.72 rows=258 width=16) -> Group (cost=0.00..4364.28 rows=2575 width=16) -> Index Scanusing i1 on hauptdaten_fall (cost=0.00..4351.40 rows=2575 width=16) real 0m1.438s - 1.506s --------------------------- now I make the dataset bigger keeping the number of rows returned by query the same insert into hauptdaten_fall (istaktuell, meldekategorie) select istaktuell + 20, meldekategorie from hauptdaten_fall ; INSERT 0 257530 insert into hauptdaten_fall (istaktuell, meldekategorie) select istaktuell + 40, meldekategorie from hauptdaten_fall ; INSERT 0 515060 ifsgtest=# select count(*) from hauptdaten_fall; count ---------1030120 (1 row) cluster i1 on Hauptdaten_Fall; vacuum analyze; -- The query time is still the same 1.44 - 1.5 sec SET ENABLE_SEQSCAN = OFF; SELECT MeldeKategorie, Count(*) AS Anz FROM (select IstAktuell,MeldeKategorie from Hauptdaten_Fall where IstAktuell=20) sub GROUP BY IstAktuell,MeldeKategorie ORDER BY IstAktuell,MeldeKategorie; Aggregate (cost=0.00..4370.72 rows=258 width=16) -> Group (cost=0.00..4364.28 rows=2575 width=16) -> Index Scanusing i1 on hauptdaten_fall (cost=0.00..4351.40 rows=2575 width=16) real 0m1.438s - 1.506s ---------------------------- now back to original data distribution, just 4 times bigger ifsgtest=# update hauptdaten_fall ifsgtest-# set istaktuell = case when istaktuell % 20 = 0 then 20 else 10 end ifsgtest-# ; UPDATE 1030120 ifsgtest=# vacuum analyze; VACUUM SET ENABLE_SEQSCAN = OFF; SELECT MeldeKategorie, Count(*) AS Anz FROM (select IstAktuell,MeldeKategorie from Hauptdaten_Fall where IstAktuell=20) sub GROUP BY IstAktuell,MeldeKategorie ORDER BY IstAktuell,MeldeKategorie; real 0m6.077 -- 6.606s and after clustering: cluster i1 on Hauptdaten_Fall; real 0m5.683 - 5.750s so it's linear growth here ---------------------------- Hannu
I have been thinking about this query, I downloaded all your info and I read your reply to a previous post. At issue, you say MSSQL outperforms PGSQL, this may be true for a finite set of query types, it may even be true for your entire application, but for how long? What will be the nature of your data next year? Your query is a prime example of where application optimization needs to happen. Regardless if MSSQL can currently execute that query quickly, at some point there will be a volume of data which is too large to process quickly. This table, you say, is created periodically with a cron job. How hard would it be to append a couple SQL statements to create a summary table for the high speed queries? Personally, I think your approach needs to be modified a bit. The fact that your query runs well on one SQL database and poorly on another indicates to me that you will be tied to one database forever. If you use standard database optimization techniques in your design, you can choose any database by the whole of important criteria, such as reliability, speed, support, administration, and price, rather than just speed. Also, if you use MSSQL you will need to have some version of MS-Windows on which to run it, that alone indicates to me you will have reliability problems. "Tille, Andreas" wrote: > > Hello, > > I discussed a problem concerning the speed of PostgreSQL compared to > MS SQL server heavily on postgres-general list. The thread starts with > message > > http://fts.postgresql.org/db/mw/msg.html?mid=1035557 > > Now I tried a snapshot of version 7.2 and got an increase of speed of > about factor 2. But sorry this is really not enough. The very simple > test I pointed to in my mail is even much to slow and the issue would > probably spoil down the whole project which should be a complete open > source solution and would perhaps and in any M$ stuff. I´ve got under > heavy preasur from my employer who was talking about the nice world > of MS .net (while he is using MS-SQL exclusively). To make the thing > clear the issue is the gal database of infectious diseases in Germany > runned by the Robert Koch-Institute. So the beast could be of some > importance for increasing the acceptance of PostgreSQL and Open Source > in the field of medicine which is generally known for the money which > is involved in. So I really hope that some skilled programmers would > be able to find a good way to solve the performance issue perhaps by > just profiling the simple query > > SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20)) > GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie; > > to the data set I put on > > http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2 > > If this should take less than half a second on a modern PC I could > continue to try mo realistic queries. > > I really hope that I could readjust the image of PostgreSQL in the > eyes of my M$-centered colleagues. > > Kind regards > > Andreas. > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html -- 5-4-3-2-1 Thunderbirds are GO! ------------------------ http://www.mohawksoft.com
On Tuesday 30 October 2001 21:24, Alex Pilosov wrote: > > > | The consequence for my problem is now: If it is technically possible > > > | to implement index scans without table lookups please implement it. > > > | If > > The feature you are looking for is called 'index coverage'. Unfortunately, > it is not easy to implement with Postgresql, and it is one of few > outstanding 'nasties'. The reason you can't do it is follows: Postgres > uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even > if index contains all the information you need, you still need to access > main table to check if the tuple is valid. > > Possible workaround: store tuple validity in index, that way, a lot more > space is wasted (16 more bytes/tuple/index), and you will need to update > all indices when the base table is updated, even if indexed information > have not changed. What is the problem to implement this index as a special index type for people who need this? Just add a flag keyword to index creation clause. Actually I would like to hear Tom's opinion on this issue. This issue is of my interest too. Also I saw sometime ago in hackers that there is a patch implementing this... Or I am wrong here? -- Denis
On Wed, 31 Oct 2001, Hannu Krosing wrote: > I tried some more on optimizing the query on my work computer > (AMD ATHLON 850, 512MB, PostgreSQL 7.1.3 with default memory settings) > > > SELECT MeldeKategorie, > Count(ID) AS Anz > FROM Hauptdaten_Fall > WHERE IstAktuell=20 > GROUP BY MeldeKategorie > ORDER BY MeldeKategorie; > > real 0m9.675s > > create index i1 on Hauptdaten_Fall(IstAktuell,MeldeKategorie); > > ---------------------------- > set enable_seqscan = off; > SELECT MeldeKategorie, > Count(ID) AS Anz > FROM Hauptdaten_Fall > WHERE IstAktuell=20 > GROUP BY MeldeKategorie > ORDER BY MeldeKategorie; > > Aggregate (cost=4497.30..4510.18 rows=258 width=16) > -> Group (cost=4497.30..4503.74 rows=2575 width=16) > -> Sort (cost=4497.30..4497.30 rows=2575 width=16) > -> Index Scan using i1 on hauptdaten_fall > (cost=0.00..4351.40 rows=2575 width=16) > > real 0m7.131s > > --------------------------- > > set enable_seqscan = off; > SELECT MeldeKategorie, > Count(ID) AS Anz > FROM Hauptdaten_Fall > WHERE IstAktuell=20 > GROUP BY IstAktuell,MeldeKategorie > ORDER BY IstAktuell,MeldeKategorie; > > Aggregate (cost=4497.30..4510.18 rows=258 width=16) > -> Group (cost=4497.30..4503.74 rows=2575 width=16) > -> Index Scan using i1 on hauptdaten_fall (cost=0.00..4351.40 > rows=2575 width=16) > > real 0m3.223s Hmmm, could you please explain the theory behind that for quite a beginner like me (perhaps on -general if you feel it apropriate) The change in the second select is that you included IstAktuell in the GROUP BY/ORDER BY clause and this gives a speed increas by factor 2. It seems that the "Sort" can be left out in this case if I look at the plan, but why that? The WHERE clause should select just all IstAktuell=20 data sets and so the GROUP BY/ORDER BY clauses should every time have the same work - as for my humble understanding. > > -- same after doing > > cluster i1 on Hauptdaten_Fall; > > real 1.590 -- 1.600 That´s also interesting. In reality the table Hauptdaten_Fall has many fields with many indices. If I understand things right it makes no sense to have more than one clustered index, right? A further speed increase of factor two would be welcome. Could I expect this if I would find out the "sensitive" index of my table for certain tasks? Or is my understanging wrong and it makes sense to cluster more than one index. Unfortunately clustering the index of a huge table takes some time. Could I speed this up by some tricks? > select count(*) from Hauptdaten_Fall; > > real 0m0.630s > > --------------------------- > > The following query is marginally (about 0.1 sec) faster, though the > plan looks the same down to cost estimates. > > SET ENABLE_SEQSCAN = OFF; > SELECT MeldeKategorie, > Count(*) AS Anz > FROM (select IstAktuell,MeldeKategorie from Hauptdaten_Fall where > IstAktuell=20) sub > GROUP BY IstAktuell,MeldeKategorie > ORDER BY IstAktuell,MeldeKategorie; > > Aggregate (cost=0.00..4370.72 rows=258 width=16) > -> Group (cost=0.00..4364.28 rows=2575 width=16) > -> Index Scan using i1 on hauptdaten_fall (cost=0.00..4351.40 > rows=2575 width=16) > > real 0m1.438s - 1.506s Hmm, perhaps this is nearly nothing or is there any theory that a count(*) is faster than a count(<fieldname>)? > ... > real 0m6.077 -- 6.606s > > and after clustering: > cluster i1 on Hauptdaten_Fall; > > real 0m5.683 - 5.750s > > so it's linear growth here This is what my colleague was afraid of: We would have linear growth compared to the log(n) growth which is to be expected on MS SQL server (for this certain type of queries and for sure up to a far limit of data where other constraints could get influence, but we are far from this limit). This would not convince him :-(. Kind regards Andreas.
On Wed, 31 Oct 2001, Horst Herb wrote: > I have the feeling that your problem is solved best by taking a different > approach. > As A. Pilosovs posting pointed out, index coverage is a problem intrinsic to > the MVCC implementation (IMHO a small price to pay for a priceless feature). Could somebody explain MVCC to such an uneducated man like me. Is this a certain feature (which perhaps MS SQL) doesn´t have and which might be important in the future? > I can't see why much effort should go into a brute force method to implement > index coverage, if your problem can be solved more elegant in a different way. > > With the example you posted, it is essentially only simple statistics you > want to run on tables where the *majority* of records would qualify in your > query. > Why not create an extra "statistics" table which is updated automatically > through triggers in your original table? That way, you will always get > up-to-date INSTANT query results no matter how huge your original table is. My problem is to convince my colleague. I´m afraid that he would consider those optimizing stuff as "tricks" to work around constraints of the database server. He might argue that if it comes to the point that also MS SQL server needs some speed improvement and he has to do the same performance tuning things MS SQL does outperform PostgreSQL again and we are at the end with our wisdom. I repeat: I for myself see the strength of OpenSource (Horst, you know me ;-) ) and I would really love to use PostgreSQL. But how to prove those arguing wrong? *This* is my problem. We have to do a design decision. My colleague is a mathematician who has prefered MS SQL server some years ago over Oracle and had certain reasons for it based on estimations of our needs. He had no problems with UNIX or something else and he theoretically is on my side that OpenSource is the better way and would accept it if it would give the same results as his stuff. But he had never had some performance problems with his databases and knows people who claim to fill Zillions of Megabytes of MS SQL server. So he doubt on the quality of PostgreSQL server if it has problems in the first run. I have to admit that his point of view is easy to understand. I would have to prove (!) that we wouldn´t have trouble with bigger databases and that those things are no "dirty workarounds" of a weak server. > And, don't forget that the only way MS SQL can achieve the better performance > here is through mercilessly hogging ressources. In a complex database > environment with even larger tables, the performance gain in MS SQL would be > minimal (my guess). Unfortunately it is not enough to guess. He has enough experiences that I knows that the MS SQL server is fit for the task he wants to solve. If I tell him: "*Perhaps* you could run into trouble.", he would just laugh about me because I´m in trouble *now* and can´t prove that I won´t be again. Kind regards Andreas.
> My problem is to convince my colleague. I�m afraid that he would consider > those optimizing stuff as "tricks" to work around constraints of the > database server. He might argue that if it comes to the point that also > MS SQL server needs some speed improvement and he has to do the same > performance tuning things MS SQL does outperform PostgreSQL again and we > are at the end with our wisdom. I repeat: I for myself see the strength > of OpenSource (Horst, you know me ;-) ) and I would really love to use > PostgreSQL. But how to prove those arguing wrong? *This* is my problem. > We have to do a design decision. My colleague is a mathematician who > has prefered MS SQL server some years ago over Oracle and had certain > reasons for it based on estimations of our needs. He had no problems > with UNIX or something else and he theoretically is on my side that OpenSource > is the better way and would accept it if it would give the same results > as his stuff. > But he had never had some performance problems with his databases and > knows people who claim to fill Zillions of Megabytes of MS SQL server. > So he doubt on the quality of PostgreSQL server if it has problems in > the first run. I have to admit that his point of view is easy to > understand. I would have to prove (!) that we wouldn�t have trouble > with bigger databases and that those things are no "dirty workarounds" > of a weak server. > > > And, don't forget that the only way MS SQL can achieve the better performance > > here is through mercilessly hogging ressources. In a complex database > > environment with even larger tables, the performance gain in MS SQL would be > > minimal (my guess). > Unfortunately it is not enough to guess. He has enough experiences that > I knows that the MS SQL server is fit for the task he wants to solve. If > I tell him: "*Perhaps* you could run into trouble.", he would just laugh > about me because I�m in trouble *now* and can�t prove that I won�t be > again. The only way to know for certain is to try both at various sizes to see. Getting numbers for one type of query on one size database tells very little. Load a test set that's 100, 1000, whatever times the current size and see what happens. ISTM anything short of this is fairly meaningless. What point does the other person expect to run into problems, how would he solve them, how does postgres run at that point with and without special optimization. It's perfectly possible that for the particular queries and load you're running that MSSQL will be better, there's nothing wrong with that. Conversely, it's entirely possible that one could find workloads that postgres is better at.
"Tille, Andreas" <TilleA@rki.de> writes: > Could somebody explain MVCC to such an uneducated man like me. Is this a > certain feature (which perhaps MS SQL) doesn�t have and which might be > important in the future? http://www.us.postgresql.org/users-lounge/docs/7.1/postgres/mvcc.html (Or substitute your favorite mirror) Only Oracle has anything like it AFAIK. -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863
> > so it's linear growth here > This is what my colleague was afraid of: We would have linear growth > compared to the log(n) growth which is to be expected on MS SQL server This is not true, since the index scan also neads to read the leaf pages in MS Sql. The number of leaf pages grows linear with number of rows that qualify the where restriction. R = number of rows that qualify --> O(R + log(R)) The pg measurements showed, that PostgreSQL query performance can be expected to stay nearly the same regardless of number of rows in the table as long as the number of rows that qualify the where restriction stays constant. The response time is linear to the number of rows that qualify the where restriction, but that linear behavior is also expected with MS Sql. Andreas
On Fri, 2 Nov 2001, Zeugswetter Andreas SB SD wrote: > This is not true, since the index scan also neads to read the leaf pages > in MS Sql. The number of leaf pages grows linear with number of rows > that qualify the where restriction. > > R = number of rows that qualify > --> O(R + log(R)) > > The pg measurements showed, that PostgreSQL query performance can be > expected > to stay nearly the same regardless of number of rows in the table as > long as > the number of rows that qualify the where restriction stays constant. > The response time is linear to the number of rows that qualify the where > > restriction, but that linear behavior is also expected with MS Sql. Well, may be you are right here but I talked once more with my colleague about specifications. We can assure that the input of data is about 1GB. We can be sure about that because it is defined what has to be stored is fixed in the German law about infectious diseases. We have no online shop system or something else. <sarcastic>If the recent anthrax problem would increase exponential we could be into trouble, but chances are low.</sarcastic> So we have good chances to estimate the amount of data quite well. It is a linear growth of 1GB per year. If MS SQL server is now fast enough we can grow with normal hardware performance increase over the year. This is a fact I have to accept. Additional constraint is that the underlying data modell with an Access application is running by about 170 clients which have an amount of data of about 100 - 500 data sets which they export once a week into our central server. The developers tried hard to get the Access application and the MS SQL server solution in sync and having a third application (by rewriting some 500 queries) would be a lot of work. (I´m not afraid this work but I must be sure it would make sense before I start and so I hope for advice of people who perhaps did so.) I discussed the issue of using statistics tables to speed up certain queries. He told me that those technique is known as OLAP tubes in MS SQL server and that there are tools to build such things. Is this a valid comparison? He did not use it because it would disable the access solution of our clients. Are there any tools for PostgreSQL for such stuff besides the manual creating tables and triggers? Currently I see two solutions to solve my problem:1. Hoping that 'index coverage' coverage is implented (perhaps by a patch... sombody asked about it but no response) in 7.2 or at least 7.3. In this case I would try to do my best withthe statistic tables but I wouldn´t cope with it if at some stage our data model would change and I would reworkall such stuff.2. Giving MySQL a trial because I expect it to solve my problem in the fashion I need. (Well - readonlyis OK, surely no such features like MVCC and thus perhaps faster index scans.) I would definitely come backto PostgreSQL once 'index coverage' or any other method to speed up index search will be implemented. Could somebody give any advise what would be the best strategy? (Perhaps I should switch back to pgsql-general for this question, but I definitely want to hear a statement from the hackers about future implementation plans!) By the way in my former postings I forgot to mention a further problem which stayed unanswered in my questions on pgsql-general is the fact that while observing "top" while doing a query (over some 30 seconds) the memory load from postgresql increases heavily when executing a query. I wonder if it could help if there would be some mechanism to let keep some information of the database resident in memory. I surely know that memory handling of Linux/UNIX is different from Win (and this is a great feature ;-) ), but if I have a plenty of free memory (2GB) and my box wasn´t swapping at any time I wonder if it shouldn´t be possible to hold some information in memory in favour of simply relying on the hard disk cache of the OS. Any opinions? Kind regards Andreas.
Alex Pilosov wrote: > > On Tue, 30 Oct 2001, Antonio Fiol [iso-8859-1] Bonnín wrote: > > > > | > Seems that problem is very simple :)) > > > | > MSSql can do queries from indexes, without using actual table at all. > > > | > Postgresql doesn't. > > > | > > > > | > So mssql avoids sequental scanning of big table, and simply does scan of > > > | > index which is already in needed order and has very much less size. > <snip> > > > | The consequence for my problem is now: If it is technically possible > > > | to implement index scans without table lookups please implement it. If > The feature you are looking for is called 'index coverage'. Unfortunately, > it is not easy to implement with Postgresql, and it is one of few > outstanding 'nasties'. The reason you can't do it is follows: Postgres > uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even > if index contains all the information you need, you still need to access > main table to check if the tuple is valid. > > Possible workaround: store tuple validity in index, that way, a lot more > space is wasted (16 more bytes/tuple/index), and you will need to update > all indices when the base table is updated, even if indexed information > have not changed. > Maybe just a silly idea, but would'nt it be possible (and useful) to store tuple validity in a separate bitmap file, that reports in every bit the validity of the corresponding tuple? It would grow linearly, but at least it would be very small compared to the actual data... Best regards Andrea Aime
Licensing issues including another projects source code into the jdbc driver
From
"Dave Cramer"
Date:
I am contemplating including log4jme source code into the jdbc driver. Who would be the best person to contact wrt ironing out the licensing issues? Dave
Andrea Aime wrote: > > Alex Pilosov wrote: > > > > On Tue, 30 Oct 2001, Antonio Fiol [iso-8859-1] BonnМn wrote: > > > > > > | > Seems that problem is very simple :)) > > > > | > MSSql can do queries from indexes, without using actual table at all. > > > > | > Postgresql doesn't. > > > > | > > > > > | > So mssql avoids sequental scanning of big table, and simply does scan of > > > > | > index which is already in needed order and has very much less size. > > <snip> > > > > | The consequence for my problem is now: If it is technically possible > > > > | to implement index scans without table lookups please implement it. If > > The feature you are looking for is called 'index coverage'. Unfortunately, > > it is not easy to implement with Postgresql, and it is one of few > > outstanding 'nasties'. The reason you can't do it is follows: Postgres > > uses MVCC, and stores 'when' the tuple is alive inside the tuple. So, even > > if index contains all the information you need, you still need to access > > main table to check if the tuple is valid. > > > > Possible workaround: store tuple validity in index, that way, a lot more > > space is wasted (16 more bytes/tuple/index), and you will need to update > > all indices when the base table is updated, even if indexed information > > have not changed. > > > > Maybe just a silly idea, but would'nt it be possible (and useful) > to store tuple validity in a separate bitmap file, that reports in every > bit the validity of the corresponding tuple? It would grow linearly, but > at least it would be very small compared to the actual data... I see two problems with this approach: 1. Tuple validity is different for different transactions running concurrently. We still could cache death-transaction_ids of tuples _in_memory_ quite cheaply time-wize, but I'm not sure how big win it will be in general 2. thene is no easy way to know which bit corresponds to which tuple as each database page can contain arbitrary number of pages (this one is easyer, as we can use a somewhat sparse bitmap that is less space-efficient) ------------ Hannu
On Monday 05 November 2001 03:24, Tille, Andreas wrote: > I discussed the issue of using statistics tables to speed up certain > queries. He told me that those technique is known as OLAP tubes in > MS SQL server and that there are tools to build such things. Is this > a valid comparison? He did not use it because it would disable the > access solution of our clients. Are there any tools for PostgreSQL for > such stuff besides the manual creating tables and triggers? I still don't understand your guy. Knowing that the table (and with it the performance demands) will grow, it is quite stubborn and certainly not elegant at all to insist on the blunt query instead of a smart solution. The smart solution as outlined always returns results instantly and needs next to no memory or other ressources as compared to the blunt query, regardless of the growth of your database. It would only impact the growth *rate* due to the fired triggers, but then, your application does not seem to have a heavy insert load anyway and you could always queue the inserts with middleware as you have no realtime demands. Btw, what is wrong with creating a few tables and a few trigger functions "manually"? Writing, testing, and debugging them should not cost more than a couple of days. Why would I want a tool for it? I might spend a couple of hours writing a python script if I would need similar triggers for many tables over and over again, but your problem does not seem to have the need for this. Horst
On Monday 05 November 2001 03:24, Tille, Andreas wrote: > I discussed the issue of using statistics tables to speed up certain > queries. He told me that those technique is known as OLAP tubes in > MS SQL server and that there are tools to build such things. Is this > a valid comparison? He did not use it because it would disable the > access solution of our clients. Are there any tools for PostgreSQL for > such stuff besides the manual creating tables and triggers? I still don't understand your guy. Knowing that the table (and with it the performance demands) will grow, it is quite stubborn and certainly not elegant at all to insist on the blunt query instead of a smart solution. The smart solution as outlined always returns results instantly and needs next to no memory or other ressources as compared to the blunt query, regardless of the growth of your database. It would only impact the growth *rate* due to the fired triggers, but then, your application does not seem to have a heavy insert load anyway and you could always queue the inserts with middleware as you have no realtime demands. Btw, what is wrong with creating a few tables and a few trigger functions "manually"? Writing, testing, and debugging them should not cost more than a couple of days. Why would I want a tool for it? I might spend a couple of hours writing a python script if I would need similar triggers for many tables over and over again, but your problem does not seem to have the need for this. Horst