Thread: Performance pb vs SQLServer.

From:
Stéphane COEZ
Date:

Hi,

I have a perfomance issue :

I run PG (8.0.3) and SQLServer2000 on a Windows2000 Server (P4 1,5Ghz 512Mo)
I have a table (3200000 rows) and I run this single query :

select cod from mytable group by cod
I have an index on cod (char(4) - 88 different values)

PG = ~ 20 sec.
SQLServer = < 8 sec


the explain is :

HashAggregate  (cost=64410.09..64410.09 rows=55 width=8)
  ->  Seq Scan on mytable  (cost=0.00..56325.27 rows=3233927 width=8)


if I switch to "enable_hashagg = false" (just for a try...)
the planner will choose my index :

Group  (cost=0.00..76514.01 rows=55 width=8)
  ->  Index Scan using myindex on mytable  (cost=0.00..68429.20 rows=3233927
width=8)

but performance will be comparable to previous test.

So with or without using Index I have the same result.


Thanks for help.

Stéphane COEZ




From:
John Arbash Meinel
Date:

Stéphane COEZ wrote:

>Hi,
>
>I have a perfomance issue :
>
>I run PG (8.0.3) and SQLServer2000 on a Windows2000 Server (P4 1,5Ghz 512Mo)
>I have a table (3200000 rows) and I run this single query :
>
>select cod from mytable group by cod
>I have an index on cod (char(4) - 88 different values)
>
>PG = ~ 20 sec.
>SQLServer = < 8 sec
>
>
>the explain is :
>
>HashAggregate  (cost=64410.09..64410.09 rows=55 width=8)
>  ->  Seq Scan on mytable  (cost=0.00..56325.27 rows=3233927 width=8)
>
>
>if I switch to "enable_hashagg = false" (just for a try...)
>the planner will choose my index :
>
>Group  (cost=0.00..76514.01 rows=55 width=8)
>  ->  Index Scan using myindex on mytable  (cost=0.00..68429.20 rows=3233927
>width=8)
>
>but performance will be comparable to previous test.
>
>So with or without using Index I have the same result.
>
>

My guess is that this is part of a larger query. There isn't really much
you can do. If you want all 3.2M rows, then you have to wait for them to
be pulled in.

What you generally can do for performance, is to restructure things, so
that you *don't* have to touch all 3.2M rows.
If you are just trying to determine what the unique entries are for cod,
you probably are better off doing some normalization, and keeping a
separate table of cod values.

I'm guessing the reason your query is faster with SQLServer is because
of how postgres handles MVCC. Basically, it still has to fetch the main
page to determine if a row exists. While SQL server doesn't do MVCC, so
it can just look things up in the index.

You might also try a different query, something like:

SELECT DISTINCT cod FROM mytable ORDER BY cod GROUP BY cod;
(You may or may not want order by, or group by, try the different
combinations.)
It might be possible to have the planner realize that all you want is
unique rows, just doing a group by doesn't give you that.

John
=:->

>
>Thanks for help.
>
>Stéphane COEZ
>
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>
>
>



From:
"Steinar H. Gunderson"
Date:

On Sun, Aug 14, 2005 at 07:27:38PM -0500, John Arbash Meinel wrote:
> My guess is that this is part of a larger query. There isn't really much
> you can do. If you want all 3.2M rows, then you have to wait for them to
> be pulled in.

To me, it looks like he'll get 88 rows, not 3.2M. Surely we must be able to
do something better than a full sequential scan in this case?

test=# create table foo ( bar char(4) );
CREATE TABLE
test=# insert into foo values ('0000');
INSERT 24773320 1
test=# insert into foo values ('0000');
INSERT 24773321 1
test=# insert into foo values ('1111');
INSERT 24773322 1
test=# select * from foo group by bar;
 bar
------
 1111
 0000
(2 rows)

I considered doing some odd magic with generate_series() and subqueries with
LIMIT 1, but it was a bit too weird in the end :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

From:
"Steinar H. Gunderson"
Date:

On Sun, Aug 14, 2005 at 07:27:38PM -0500, John Arbash Meinel wrote:
> If you are just trying to determine what the unique entries are for cod,
> you probably are better off doing some normalization, and keeping a
> separate table of cod values.

Pah, I missed this part of the e-mail -- you can ignore most of my (other)
reply, then :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

From:
John Arbash Meinel
Date:

Steinar H. Gunderson wrote:

>On Sun, Aug 14, 2005 at 07:27:38PM -0500, John Arbash Meinel wrote:
>
>
>>My guess is that this is part of a larger query. There isn't really much
>>you can do. If you want all 3.2M rows, then you have to wait for them to
>>be pulled in.
>>
>>
>
>To me, it looks like he'll get 88 rows, not 3.2M. Surely we must be able to
>do something better than a full sequential scan in this case?
>
>test=# create table foo ( bar char(4) );
>CREATE TABLE
>test=# insert into foo values ('0000');
>INSERT 24773320 1
>test=# insert into foo values ('0000');
>INSERT 24773321 1
>test=# insert into foo values ('1111');
>INSERT 24773322 1
>test=# select * from foo group by bar;
> bar
>------
> 1111
> 0000
>(2 rows)
>
>I considered doing some odd magic with generate_series() and subqueries with
>LIMIT 1, but it was a bit too weird in the end :-)
>
>/* Steinar */
>
>
I think a plain "GROUP BY" is not smart enough to detect it doesn't need
all rows (since it is generally used because you want to get aggregate
values of other columns).
I think you would want something like SELECT DISTINCT, possibly with an
ORDER BY rather than a GROUP BY (which was my final suggestion).

John
=:->


From:
Tom Lane
Date:

"Steinar H. Gunderson" <> writes:
> To me, it looks like he'll get 88 rows, not 3.2M. Surely we must be able to
> do something better than a full sequential scan in this case?

Not really.  There's been some speculation about implementing index
"skip search" --- once you've verified there's at least one visible
row of a given index value, tell the index to skip to the next different
value instead of handing back any of the remaining entries of the
current value.  But it'd be a lot of work and AFAICS not useful for
very many kinds of queries besides this.

            regards, tom lane

From:
"Steinar H. Gunderson"
Date:

On Sun, Aug 14, 2005 at 09:18:45PM -0400, Tom Lane wrote:
> Not really.  There's been some speculation about implementing index
> "skip search" --- once you've verified there's at least one visible
> row of a given index value, tell the index to skip to the next different
> value instead of handing back any of the remaining entries of the
> current value.  But it'd be a lot of work and AFAICS not useful for
> very many kinds of queries besides this.

This is probably a completely wrong way of handling it all, but could it be
done in a PL/PgSQL query like this? (Pseudo-code, sort of; I'm not very well
versed in the actual syntax, but I'd guess you get the idea.)

x = ( SELECT foo FROM table ORDER BY foo LIMIT 1 );
WHILE x IS NOT NULL
  RETURN NEXT x;
  x = ( SELECT foo FROM table WHERE foo > x ORDER BY foo LIMIT 1 );
END;

(Replace with max() and min() for 8.1, of course.)

/* Steinar */
- fond of horrible hacks :-)
--
Homepage: http://www.sesse.net/

From:
dario_d_s@unitech.com.ar
Date:

This is a multi-part message in MIME format.

--bound1124085115
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 7bit

One little thing. Did you shutdown sql2000 while testing postgresql? Remember that postgresql uses system cache.
Sql2000uses a large part of memory as buffer and it will not be available to operating system. I must say that,
probably,results will be the same, but it will be a better test. 

> I'm guessing the reason your query is faster with SQLServer is because
> of how postgres handles MVCC. Basically, it still has to fetch the main
> page to determine if a row exists. While SQL server doesn't do MVCC, so
> it can just look things up in the index.

Another thing [almost offtopic]:
I would like to add something to understand what does MVCC means and what are the consecuences.
MVCC: multiversion concurrency control. (ehhh...)

Just do this.

Open two psql sessions. Do this:
Session 1:
   begin;
   update any_table set any_column = 'value_a' where other_column = 'value_b'
   -- do not commit
Session 2:
   select any_table where other_column = 'value_b'
   Watch the result.
Session 1:
   commit;
Session 2:
   select any_table where other_column = 'value_b'
   Watch the result.

Now open two session in query analyzer. Do the same thing:
Session 1:
   begin tran
   update any_table set any_column = 'value_a' where other_column = 'value_b'
   -- do not commit
Session 2:
   select any_table where other_column = 'value_b'
   Wait for result.
   Wait... wait... (Oh, a lock! Ok, when you get tired, go back to session 1.)
Session 1:
   commit
Session 2:
   Then watch the result.

Which one was faster?

["very, very offtopic"]
Ok. This comparition is just as useless as the other one, because it's comparing oranges with apples (It's funny
anyway).I was just choosing an example in which you can see the best of postgresql against 'not so nice' behavior of
mssql2000(no service pack, it's my desktop system, I'll do the same test later with SP4 and different isolation levels
andI'll check results). Furthermore, MSSQL2000 is 5 years old now. Does anybody has the same cellular phone, or
computer?(I don't want to know :-) ). The big question is 'What do you need?'. No system can give you all. That's
marketing'sarasa'. 

Sorry for my english and the noise. [End of offtopic]

Long life, little spam and prosperity.

--bound1124085115--


From:
"Magnus Hagander"
Date:

> Hi,
>
> I have a perfomance issue :
>
> I run PG (8.0.3) and SQLServer2000 on a Windows2000 Server
> (P4 1,5Ghz 512Mo) I have a table (3200000 rows) and I run
> this single query :
>
> select cod from mytable group by cod
> I have an index on cod (char(4) - 88 different values)
>
> PG = ~ 20 sec.
> SQLServer = < 8 sec
>
>
> the explain is :
>
> HashAggregate  (cost=64410.09..64410.09 rows=55 width=8)
>   ->  Seq Scan on mytable  (cost=0.00..56325.27 rows=3233927 width=8)
>
>
> if I switch to "enable_hashagg = false" (just for a try...)
> the planner will choose my index :
>
> Group  (cost=0.00..76514.01 rows=55 width=8)
>   ->  Index Scan using myindex on mytable
> (cost=0.00..68429.20 rows=3233927
> width=8)
>
> but performance will be comparable to previous test.
>
> So with or without using Index I have the same result.

Out of curiosity, what plan do you get from SQLServer? I bet it's a clustered index scan...


//Magnus

From:
"Magnus Hagander"
Date:

> ["very, very offtopic"]
> Ok. This comparition is just as useless as the other one,
> because it's comparing oranges with apples (It's funny
> anyway). I was just choosing an example in which you can see
> the best of postgresql against 'not so nice' behavior of
> mssql2000 (no service pack, it's my desktop system, I'll do
> the same test later with SP4 and different isolation levels
> and I'll check results).

There will be no difference in the service packs.
SQL 2005 has "MVCC" (they call it something different, of course, but
that's basicallyi what it is)

> Furthermore, MSSQL2000 is 5 years
> old now. Does anybody has the same cellular phone, or
> computer? (I don't want to know :-) ). The big question is

There is a big difference between your database and your cellphone.
There are a lot of systems out there running very solidly on older
products like MSSQL 7 (probably even some on 6.x), as well as Oracle 7,8
and 9...
I'd say there is generally a huge difference in reliabilty in your
cellphone hw/sw than there is in your db hw/sw. I have yet to see a
cellphone that can run for a year without a reboot (or with a lot of
brands, complete replacement).

//Magnus

From:
Alvaro Herrera
Date:

On Mon, Aug 15, 2005 at 10:25:47AM +0200, Magnus Hagander wrote:

> SQL 2005 has "MVCC" (they call it something different, of course, but
> that's basicallyi what it is)

Interesting; do they use an overwriting storage manager like Oracle, or
a non-overwriting one like Postgres?

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"The Postgresql hackers have what I call a "NASA space shot" mentality.
 Quite refreshing in a world of "weekend drag racer" developers."
(Scott Marlowe)

From:
"Qingqing Zhou"
Date:

"Alvaro Herrera" <-ip.org> writes
>
> Interesting; do they use an overwriting storage manager like Oracle, or
> a non-overwriting one like Postgres?
>

They call this MVCC "RLV(row level versioning)". I think they use rollback
segment like Oracle (a.k.a "version store" or tempdb in SQL Server).  Some
details are explained in their white paper:"Database concurrency and row
level versioning in SQL Server 2005".

Regards,
Qingqing



From:
John A Meinel
Date:

Qingqing Zhou wrote:
> "Alvaro Herrera" <-ip.org> writes
>
>>Interesting; do they use an overwriting storage manager like Oracle, or
>>a non-overwriting one like Postgres?
>>
>
>
> They call this MVCC "RLV(row level versioning)". I think they use rollback
> segment like Oracle (a.k.a "version store" or tempdb in SQL Server).  Some
> details are explained in their white paper:"Database concurrency and row
> level versioning in SQL Server 2005".
>
> Regards,
> Qingqing
>

I found the paper here:
http://www.microsoft.com/technet/prodtechnol/sql/2005/cncrrncy.mspx

And it does sound like they are doing it the Oracle way:

When a record in a table or index is updated, the new record is stamped
with the transaction sequence_number of the transaction that is doing
the update. The previous version of the record is stored in the version
store, and the new record contains a pointer to the old record in the
version store. Old records in the version store may contain pointers to
even older versions. All the old versions of a particular record are
chained in a linked list, and SQL Server may need to follow several
pointers in a list to reach the right version. Version records need to
be kept in the version store only as long as there are there are
operations that might require them.

John
=:->

From:
Stéphane COEZ
Date:

John Arbash Meinel wrote :
>
> You might also try a different query, something like:
>
> SELECT DISTINCT cod FROM mytable ORDER BY cod GROUP BY cod;
> (You may or may not want order by, or group by, try the different
> combinations.)
> It might be possible to have the planner realize that all you
> want is unique rows, just doing a group by doesn't give you that.
>
> John
> =:->
>
Thanks John, but using SELECT DISTINCT with or without Order nor Group by is
worth...
30 sec (with index) - stopped at 200 sec without index...

So Hash Aggregate is much better than index scan ...


> >
> >Thanks for help.
> >
> >Stéphane COEZ
> >
> >
> >
> >
> >---------------------------(end of
> >broadcast)---------------------------
> >TIP 9: In versions below 8.0, the planner will ignore your desire to
> >       choose an index scan if your joining column's datatypes do not
> >       match
> >
> >
> >
>
>
>




From:
Stéphane COEZ
Date:

> De : Magnus Hagander [mailto:]
> Out of curiosity, what plan do you get from SQLServer? I bet
> it's a clustered index scan...
>
>
> //Magnus
>

I have a Table scan and Hashaggregate...
Stephane





From:
Stéphane COEZ
Date:

>
> One little thing. Did you shutdown sql2000 while testing
> postgresql? Remember that postgresql uses system cache.
> Sql2000 uses a large part of memory as buffer and it will not
> be available to operating system. I must say that, probably,
> results will be the same, but it will be a better test.
>

Shutting done SQL2000 has no effect on PG performancies.

Stephane.