Thread: Select performance vs. mssql
Hi, I have some experience with MSSQL and am examining PostgreSQL. I'm running under Windows. I like what I see so far, but I'm hoping for some performance advice: 1. My test database has 7 million records. 2. There are two columns - an integer and a char column called Day which has a random value of Mon or Tues, etc. in it. 3. I made an index on Day. My query is: select count(*) from mtable where day='Mon' Results: 1. P3 600 512MB RAM MSSQL. It takes about 4-5 secs to run. If I run a few queries and everything is cached, it is sometimes just 1 second. 2. Athlon 1.3 Ghz 1GB RAM. PostgreSQL takes 7 seconds. I have played with the buffers setting and currently have it at 7500. At 20000 it took over 20 seconds to run. 5 seconds vs 7 isn't that big of a deal, but 1 second vs 7 seconds is. Also, the slower performance is with much lesser hardware. Any ideas to try? Thanks much, Mark __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
> select count(*) from mtable where day='Mon' > > Results: > > 1. P3 600 512MB RAM MSSQL. It takes about 4-5 secs to > run. If I run a few queries and everything is cached, > it is sometimes just 1 second. > > 2. Athlon 1.3 Ghz 1GB RAM. PostgreSQL takes 7 seconds. > I have played with the buffers setting and currently > have it at 7500. At 20000 it took over 20 seconds to > run. > > 5 seconds vs 7 isn't that big of a deal, but 1 second > vs 7 seconds is. Also, the slower performance is with > much lesser hardware. Post the result of this for us: explain analyze select count(*) from mtable where day='Mon'; On both machines. Chris
> Post the result of this for us: > > explain analyze select count(*) from mtable where > day='Mon'; > > On both machines. Hi Chris -- PostgreSQL Machine: "Aggregate (cost=140122.56..140122.56 rows=1 width=0) (actual time=24516.000..24516.000 rows=1 loops=1)" " -> Index Scan using "day" on mtable (cost=0.00..140035.06 rows=35000 width=0) (actual time=47.000..21841.000 rows=1166025 loops=1)" " Index Cond: ("day" = 'Mon'::bpchar)" "Total runtime: 24516.000 ms" (Note this took 24 seconds after fresh reboot, next execution was 11, and execution without explain analyze was 6.7 seconds) MSSQL Machine: That "Explain Analyze" command doesn't work for MSSQL, but I did view the Query plan. 97% of it was "Scanning a particular range of rows from a nonclustered index" Thanks for your help --Mark __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
mark durrant wrote: > PostgreSQL Machine: > "Aggregate (cost=140122.56..140122.56 rows=1 width=0) > (actual time=24516.000..24516.000 rows=1 loops=1)" > " -> Index Scan using "day" on mtable > (cost=0.00..140035.06 rows=35000 width=0) (actual > time=47.000..21841.000 rows=1166025 loops=1)" > " Index Cond: ("day" = 'Mon'::bpchar)" > "Total runtime: 24516.000 ms" Have you run ANALYZE? Clustering the table on the "day" index (via the CLUSTER command) would be worth trying. -Neil
Mark, > MSSQL Machine: > That "Explain Analyze" command doesn't work for MSSQL, try this: set showplan_all on go select ... go Harald
First, thanks for all the helpful replies. I've listened to the suggestions and done some more digging and have results: I did show_plan_all in MSSQL and found that it was doing an Index Scan. I've read someplace that if the data you need is all in the index, then MSSQL has a feature/hack where it does not have to go to the table, it can do my COUNT using the index alone. I think this explains the 1 second query performance. I changed the query to also include the other column which is not indexed. The results were MSSQL now used a TableScan and was MUCH slower than PostgreSQL. I clustered the index on MSSQL and PostgreSQL and increased buffers to 15000 on PGSQL. I saw a noticeable performance increase on both. On the more complicated query, PostgreSQL is now 3.5 seconds. MSSQL is faster again doing an index scan and is at 2 seconds. Remember the MSSQL machine has a slower CPU as well. My interpretations: --Given having to do a table scan, PostgreSQL seems to be faster. The hardware on my PostrgreSQL machine is nicer than the MSSQL one, so perhaps they are just about the same speed with speed determined by the disk. --Tuning helps. Clustered index cut my query time down. More buffers helped. --As Chris pointed out, how real-world is this test? His point is valid. The database we're planning will have a lot of rows and require a lot of summarization (hence my attempt at a "test"), but we shouldn't be pulling a million rows at a time. --MSSQL's ability to hit the index only and not having to go to the table itself results in a _big_ performance/efficiency gain. If someone who's in development wants to pass this along, it would be a nice addition to PostgreSQL sometime in the future. I'd suspect that as well as making one query faster, it would make everything else faster/more scalable as the server load is so much less. Thanks again, Mark __________________________________ Do you Yahoo!? Yahoo! Small Business - Try our new Resources site http://smallbusiness.yahoo.com/resources/
On Tue, May 24, 2005 at 08:36:36 -0700, mark durrant <markd89@yahoo.com> wrote: > > --MSSQL's ability to hit the index only and not having > to go to the table itself results in a _big_ > performance/efficiency gain. If someone who's in > development wants to pass this along, it would be a > nice addition to PostgreSQL sometime in the future. > I'd suspect that as well as making one query faster, > it would make everything else faster/more scalable as > the server load is so much less. This gets brought up a lot. The problem is that the index doesn't include information about whether the current transaction can see the referenced row. Putting this information in the index will add significant overhead to every update and the opinion of the developers is that this would be a net loss overall.
I'm far from an expert, so this may be off-base... but perhaps a suggestion would be to allow a hint to be sent to the optimizer if the user doesn't care that the result is "approximate" maybe then this wouldn't require adding more overhead to the indexes. MSSQL has something like this with (nolock) i.e. select count(*) from customers (nolock) where name like 'Mark%' Regardless, I'm very impressed with PostgreSQL and I think we're moving ahead with it. Mark --- Bruno Wolff III <bruno@wolff.to> wrote: > On Tue, May 24, 2005 at 08:36:36 -0700, > mark durrant <markd89@yahoo.com> wrote: > > > > --MSSQL's ability to hit the index only and not > having > > to go to the table itself results in a _big_ > > performance/efficiency gain. If someone who's in > > development wants to pass this along, it would be > a > > nice addition to PostgreSQL sometime in the > future. > > I'd suspect that as well as making one query > faster, > > it would make everything else faster/more scalable > as > > the server load is so much less. > > This gets brought up a lot. The problem is that the > index doesn't include > information about whether the current transaction > can see the referenced > row. Putting this information in the index will add > significant overhead > to every update and the opinion of the developers is > that this would be > a net loss overall. __________________________________ Do you Yahoo!? Yahoo! Small Business - Try our new Resources site http://smallbusiness.yahoo.com/resources/
Until you start worrying about MVC - we have had problems with the MSSQL implementation of read consistency because of this 'feature'.
Alex Turner
NetEconomist
Alex Turner
NetEconomist
On 5/24/05, Bruno Wolff III <bruno@wolff.to> wrote:
On Tue, May 24, 2005 at 08:36:36 -0700,
mark durrant <markd89@yahoo.com> wrote:
>
> --MSSQL's ability to hit the index only and not having
> to go to the table itself results in a _big_
> performance/efficiency gain. If someone who's in
> development wants to pass this along, it would be a
> nice addition to PostgreSQL sometime in the future.
> I'd suspect that as well as making one query faster,
> it would make everything else faster/more scalable as
> the server load is so much less.
This gets brought up a lot. The problem is that the index doesn't include
information about whether the current transaction can see the referenced
row. Putting this information in the index will add significant overhead
to every update and the opinion of the developers is that this would be
a net loss overall.
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Folks, > > This gets brought up a lot. The problem is that the > > index doesn't include > > information about whether the current transaction > > can see the referenced > > row. Putting this information in the index will add > > significant overhead > > to every update and the opinion of the developers is > > that this would be > > a net loss overall. Pretty much. There has been discussion about allowing index-only access to "frozen" tables, i.e. archive partitions. But it all sort of hinges on someone implementing it and testing .... -- --Josh Josh Berkus Aglio Database Solutions San Francisco
> Pretty much. There has been discussion about allowing index-only access > to > "frozen" tables, i.e. archive partitions. But it all sort of hinges on > someone implementing it and testing .... Would be interesting as a parameter to set at index creation (ie. if you know this table will have a lot of reads and few writes)... like create an index on columns X,Y keeping data on columns X,Y and Z... But in this case do you still need the table ? Or even create a table type where the table and the index are one, like an auto-clustered table... I don't know if it would be used that often, though ;)
On Tue, May 24, 2005 at 04:35:14PM -0700, Josh Berkus wrote: >Pretty much. There has been discussion about allowing index-only access to >"frozen" tables, i.e. archive partitions. But it all sort of hinges on >someone implementing it and testing .... Is there any way to expose the planner estimate? For some purposes it's enough to just give a rough ballpark (e.g., a google-esque "results 1-10 of approximately 10000000") so a user knows whether its worth even starting to page through. Mike Stone
Michael Stone wrote: > On Tue, May 24, 2005 at 04:35:14PM -0700, Josh Berkus wrote: > >> Pretty much. There has been discussion about allowing index-only >> access to "frozen" tables, i.e. archive partitions. But it all sort >> of hinges on someone implementing it and testing .... > > > Is there any way to expose the planner estimate? For some purposes it's > enough to just give a rough ballpark (e.g., a google-esque "results 1-10 > of approximately 10000000") so a user knows whether its worth even > starting to page through. > > Mike Stone > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster Well, you could always do: EXPLAIN SELECT ... And then parse out the rows= in the first line. John =:->
Attachment
> --As Chris pointed out, how real-world is this test? > His point is valid. The database we're planning will > have a lot of rows and require a lot of summarization > (hence my attempt at a "test"), but we shouldn't be > pulling a million rows at a time. If you want to do lots of aggregate analysis, I suggest you create a sepearate summary table, and create triggers on the main table to maintain your summaries in the other table... > --MSSQL's ability to hit the index only and not having > to go to the table itself results in a _big_ > performance/efficiency gain. If someone who's in > development wants to pass this along, it would be a > nice addition to PostgreSQL sometime in the future. > I'd suspect that as well as making one query faster, > it would make everything else faster/more scalable as > the server load is so much less. This is well-known and many databases do it. However, due to MVCC considerations in PostgreSQL, it's not feasible for us to implement it... Chris
On Wed, May 25, 2005 at 09:29:36AM +0800, Christopher Kings-Lynne wrote: > >--MSSQL's ability to hit the index only and not having > >to go to the table itself results in a _big_ > >performance/efficiency gain. If someone who's in > >development wants to pass this along, it would be a > >nice addition to PostgreSQL sometime in the future. > >I'd suspect that as well as making one query faster, > >it would make everything else faster/more scalable as > >the server load is so much less. > > This is well-known and many databases do it. However, due to MVCC > considerations in PostgreSQL, it's not feasible for us to implement it... Wasn't there a plan to store some visibility info in indexes? IIRC the idea was that a bit would be set in the index tuple indicating that all transactions that wouldn't be able to see that index value were complete, meaning that there was no reason to hit the heap for that tuple. I looked on the TODO but didn't see this, maybe it fell through the cracks? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
* Bruno Wolff III <bruno@wolff.to> wrote: <snip> > This gets brought up a lot. The problem is that the index doesn't include > information about whether the current transaction can see the referenced > row. Putting this information in the index will add significant overhead > to every update and the opinion of the developers is that this would be > a net loss overall. wouldn't it work well to make this feature optionally for each index ? There could be some flag on the index (ie set at create time) which tells postgres whether to store mvcc information. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de --------------------------------------------------------------------- Realtime Forex/Stock Exchange trading powered by postgresSQL :)) http://www.fxignal.net/ ---------------------------------------------------------------------
Enrico Weigelt wrote: > Bruno Wolff III wrote: >> >> This gets brought up a lot. The problem is that the index doesn't include >> information about whether the current transaction can see the referenced >> row. Putting this information in the index will add significant overhead >> to every update and the opinion of the developers is that this would be >> a net loss overall. > > wouldn't it work well to make this feature optionally for each > index ? There could be some flag on the index (ie set at create > time) which tells postgres whether to store mvcc information. There is no reason to assume it can't work. There is little reason to assume that it will be the best solution in many circumstances. There is a big reason why people are sceptical: there is no patch. The issue has been debated and beaten to death. People have formed their opinions and are unlikely to change their position. If you want to convince people, your best bet is to submit a patch and have OSDL measure the performance improvement. Jochem