Thread: Index with all necessary columns - Postgres vs MSSQL

Index with all necessary columns - Postgres vs MSSQL

From
Gudmundur Johannesson
Date:
Hi,

I have a table in Postgres like:
CREATE TABLE test
(
  id integer,
  dtstamp timestamp without time zone,
  rating real
)
CREATE INDEX test_all
  ON test
  USING btree
  (id , dtstamp , rating);

My db has around 200M rows and I have reduced my test select statement down to:
SELECT count(1) FROM test
WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and cast('2011-10-19 16:00:00' as timestamp)

In Postgres this takes about 23 sec.
In MSSQL this takes about 1 sec.

MSSQL only accesses the index and does not access the table it self (uses only index scan)

Postgres has the following plan:
"Aggregate  (cost=130926.24..130926.25 rows=1 width=0)"
"  ->  Bitmap Heap Scan on test  (cost=1298.97..130832.92 rows=37330 width=0)"
"        Recheck Cond: ((id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
"        ->  Bitmap Index Scan on test_all  (cost=0.00..1289.64 rows=37330 width=0)"
"              Index Cond: ((id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"

The results are disappointing since I want to switch to Postgres but I have not been able to force Postgres to only use the index :-(

Any hints that may lead me back on track?

Thanks,
   - Gummi

Re: Index with all necessary columns - Postgres vs MSSQL

From
Merlin Moncure
Date:
On Wed, Feb 1, 2012 at 11:10 AM, Gudmundur Johannesson
<gudmundur.johannesson@gmail.com> wrote:
> Hi,
>
> I have a table in Postgres like:
> CREATE TABLE test
> (
>   id integer,
>   dtstamp timestamp without time zone,
>   rating real
> )
> CREATE INDEX test_all
>   ON test
>   USING btree
>   (id , dtstamp , rating);
>
> My db has around 200M rows and I have reduced my test select statement down
> to:
> SELECT count(1) FROM test
> WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
> AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and
> cast('2011-10-19 16:00:00' as timestamp)
>
> In Postgres this takes about 23 sec.
> In MSSQL this takes about 1 sec.
>
> MSSQL only accesses the index and does not access the table it self (uses
> only index scan)
>
> Postgres has the following plan:
> "Aggregate  (cost=130926.24..130926.25 rows=1 width=0)"
> "  ->  Bitmap Heap Scan on test  (cost=1298.97..130832.92 rows=37330
> width=0)"
> "        Recheck Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
> "        ->  Bitmap Index Scan on test_all  (cost=0.00..1289.64 rows=37330
> width=0)"
> "              Index Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
>
> The results are disappointing since I want to switch to Postgres but I have
> not been able to force Postgres to only use the index :-(
>
> Any hints that may lead me back on track?

*) are the times in postgres stable across calls?
*) where is the 'id list' coming from?
*) how long does this query take?

SELECT count(1) FROM test WHERE id = 202 AND AND dtstamp between
'2011-10-19 08:00:00'::timestamp  and '2011-10-19
16:00:00'::timestamp; ?

The feature you're looking for in postgres is called 'index only
scans' and an 9.2 will contain an implementation of that feature (see:
http://rhaas.blogspot.com/2011/10/index-only-scans-weve-got-em.html).

merlin

Re: Index with all necessary columns - Postgres vs MSSQL

From
Scott Marlowe
Date:
On Wed, Feb 1, 2012 at 10:10 AM, Gudmundur Johannesson
<gudmundur.johannesson@gmail.com> wrote:
> Hi,
>
> I have a table in Postgres like:
> CREATE TABLE test
> (
>   id integer,
>   dtstamp timestamp without time zone,
>   rating real
> )
> CREATE INDEX test_all
>   ON test
>   USING btree
>   (id , dtstamp , rating);
>
> My db has around 200M rows and I have reduced my test select statement down
> to:
> SELECT count(1) FROM test
> WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
> AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and
> cast('2011-10-19 16:00:00' as timestamp)
>
> In Postgres this takes about 23 sec.
> In MSSQL this takes about 1 sec.
>
> MSSQL only accesses the index and does not access the table it self (uses
> only index scan)
>
> Postgres has the following plan:
> "Aggregate  (cost=130926.24..130926.25 rows=1 width=0)"
> "  ->  Bitmap Heap Scan on test  (cost=1298.97..130832.92 rows=37330
> width=0)"
> "        Recheck Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
> "        ->  Bitmap Index Scan on test_all  (cost=0.00..1289.64 rows=37330
> width=0)"
> "              Index Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
>
> The results are disappointing since I want to switch to Postgres but I have
> not been able to force Postgres to only use the index :-(
>
> Any hints that may lead me back on track?

As Merlin mentioned postgres doesn't have "covering" indexes yet.  I
was wondering what explain ANALYZE of your query looks like, and what
version of pgsql you're running.  It might be that we can at least get
that 23 seconds down to something closer to 1 second rather than
waiting for pg 9.2 to get here.

First try individual indexes on the two fields, and also try a two
column index on the two fields, both with id first and with date
first.  Use explain analyze to see if this does any better.  also look
at this wiki page and see if there's anything there that helps:
http://wiki.postgresql.org/wiki/SlowQueryQuestions  Especially this
part: http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

Re: Index with all necessary columns - Postgres vs MSSQL

From
Merlin Moncure
Date:
On Wed, Feb 1, 2012 at 12:50 PM, Gudmundur Johannesson
<gudmundur.johannesson@gmail.com> wrote:
> Here are the answers to your questions:
> 1) I change the select statement so I am refering to 1 day at a time.  In
> that case the response time is similar.  Basically, the data is not in cache
> when I do that and the response time is about 23 seconds.

what's the difference between the first and the second run time?
Note, if you are only interested in the date the dtStamp falls on, you
can exploit that in the index to knock 4 bytes off your index entry:

CREATE INDEX test_all
  ON test
  USING btree
  (id , (dtstamp::date) , rating);

and then use a similar expression to query it back out.

> 3) The query takes 23 sec vs 1 sec or lower in mssql.

I asked you to time a different query.  Look again (and I'd like to
see cached and uncached times).

> We never update/delete and therefore the data is alway correct in the index
> (never dirty).  Therefore, Postgres could have used the data in it.
>
> I started to add columns into indexes in Oracle for approx 15 years ago and
> it was a brilliant discovery.  This looks like a show stopper for me but I

I doubt covering indexes is going to make that query 23x faster.
However, I bet we can get something worked out.

merlin

Re: Index with all necessary columns - Postgres vs MSSQL

From
Merlin Moncure
Date:
On Thu, Feb 2, 2012 at 10:41 AM, Gudmundur Johannesson
<gudmundur.johannesson@gmail.com> wrote:
> Do you think I should try using the latest build of the source for 9.2 since
> index-only-scan is "ready" according to
> http://www.depesz.com/index.php/2011/10/08/waiting-for-9-2-index-only-scans/
> ?

hm, interesting.

You are simply welcome to try that and we would definitely like to see
your results.   I looked around and didn't see any binaries for the
development snapshots for windows to test.  That means you have to
compile postgres in order to test 9.2 at this point in time.  Testing
and feedback of index only scan feature would be very much
appreciated.

Generally speaking, postgresql source tree is very high quality --
stuff should mostly work.  The biggest annoyance is that you get lots
of catalog version bumps when pulling new versions of the sources
forcing a dump/reload.

merlin

Re: Index with all necessary columns - Postgres vs MSSQL

From
Gudmundur Johannesson
Date:
Hi,

Here are the answers to your questions:
1) I change the select statement so I am refering to 1 day at a time.  In that case the response time is similar.  Basically, the data is not in cache when I do that and the response time is about 23 seconds.

2) The list of IDs is provided by the middle layer and represents a logical group.
btw: There are about 360 devices there.  The distribution of dtStamp is approx 200.000.000 rows / 360 devices / (4 months) which gives approx 4600 dtStamp values per device per day.

3) The query takes 23 sec vs 1 sec or lower in mssql.

We never update/delete and therefore the data is alway correct in the index (never dirty).  Therefore, Postgres could have used the data in it.

I started to add columns into indexes in Oracle for approx 15 years ago and it was a brilliant discovery.  This looks like a show stopper for me but I will

Thanks,
   - Gummi

On Wed, Feb 1, 2012 at 5:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Feb 1, 2012 at 11:10 AM, Gudmundur Johannesson
<gudmundur.johannesson@gmail.com> wrote:
> Hi,
>
> I have a table in Postgres like:
> CREATE TABLE test
> (
>   id integer,
>   dtstamp timestamp without time zone,
>   rating real
> )
> CREATE INDEX test_all
>   ON test
>   USING btree
>   (id , dtstamp , rating);
>
> My db has around 200M rows and I have reduced my test select statement down
> to:
> SELECT count(1) FROM test
> WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
> AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and
> cast('2011-10-19 16:00:00' as timestamp)
>
> In Postgres this takes about 23 sec.
> In MSSQL this takes about 1 sec.
>
> MSSQL only accesses the index and does not access the table it self (uses
> only index scan)
>
> Postgres has the following plan:
> "Aggregate  (cost=130926.24..130926.25 rows=1 width=0)"
> "  ->  Bitmap Heap Scan on test  (cost=1298.97..130832.92 rows=37330
> width=0)"
> "        Recheck Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
> "        ->  Bitmap Index Scan on test_all  (cost=0.00..1289.64 rows=37330
> width=0)"
> "              Index Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
>
> The results are disappointing since I want to switch to Postgres but I have
> not been able to force Postgres to only use the index :-(
>
> Any hints that may lead me back on track?

*) are the times in postgres stable across calls?
*) where is the 'id list' coming from?
*) how long does this query take?

SELECT count(1) FROM test WHERE id = 202 AND AND dtstamp between
'2011-10-19 08:00:00'::timestamp  and '2011-10-19
16:00:00'::timestamp; ?

The feature you're looking for in postgres is called 'index only
scans' and an 9.2 will contain an implementation of that feature (see:
http://rhaas.blogspot.com/2011/10/index-only-scans-weve-got-em.html).

merlin

Re: Index with all necessary columns - Postgres vs MSSQL

From
Gudmundur Johannesson
Date:
Hi,

I want to start by thanking you guys for a quick response and I will try to provide all the information you request. 

1) What version am I running:
"PostgreSQL 9.1.2, compiled by Visual C++ build 1500, 64-bit"

2) Schema:
CREATE TABLE test( id integer,  dtstamp timestamp without time zone,  rating real) WITH ( OIDS=FALSE);
CREATE INDEX test_all ON test USING btree  (id , dtstamp, rating);
200M rows
Table size 9833MB
Index size 7653 MB

3) Difference between the first and the second run time?
The statement executed is:
SELECT count(1) FROM test
WHERE id in (58,83,88,98,124,141,170,195,
202,252,265,293,305,331,348)
AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and cast('2011-10-19 16:00:00' as timestamp)
a) 1st run = 26 seconds
b) 2nd run = 0.234 seconds
c) 3rd-6th run = 0.06 seconds

If I perform the query above for another day then I get 26 seconds for the 1st query.

4) What was the execution plan of it
"Aggregate  (cost=151950.75..151950.76 rows=1 width=0)"
"  ->  Bitmap Heap Scan on data_cbm_reading cbm  (cost=1503.69..151840.82 rows=43974 width=0)"
"        Recheck Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
"        ->  Bitmap Index Scan on data_cbm_reading_all  (cost=0.00..1492.70 rows=43974 width=0)"
"              Index Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"

5) In this case, I shut down the mssql server/machine and restart it.  To be on the safe side, I ensured the cache is empty using dbcc freeproccache and dbcc dropcleanbuffers.
Then I tried the same statement as above:
a) 1st run = 0.8 seconds
b) 2nd, 3rd, ... run = 0.04 seconds
c) change the select statement for any another other day and run it again give 1st run 0.5 seconds
d) 2nd, 3rd, ... run = 0.04 seconds


6) You wrote "I doubt covering indexes is going to make that query 23x faster."
I decided to check out how mssql performs if it cannot use a covering index.  In order to do that, I drop my current index and create it again on id, dtstamp.  That forces mssql to look into the data file and the index is no longer sufficient.
Running the following statement force the "rating" columns to be accessed:
select sum(rating)
FROM test
               WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
               AND dtstamp >= '2011-10-19 08:00:00' AND dtstamp <= '2011-10-19 16:00:00'
a) 1st run = 20 seconds
b) 2nd run = 0.6
c) 3rd, ... run = 0.3 seconds
As you can see the response time gets just as bad as in Postgres.
Now lets recreate the mssql index with all the columns and double check the response time:
a) 1st run = 2 seconds
b) 2nd run = 0.12
c) 3rd, ... run = 0.3 seconds


Therefore, I must conclude that in the case of mssql the "covering" index is making a huge impact.

I have spent the whole day providing this data (takes a while to shuffle 200M rows) and tomorrow I will try your suggestion regarding two indexes.

Do you think I should try using the latest build of the source for 9.2 since index-only-scan is "ready" according to http://www.depesz.com/index.php/2011/10/08/waiting-for-9-2-index-only-scans/
?


Thanks,
   - Gummi



On Wed, Feb 1, 2012 at 7:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Feb 1, 2012 at 12:50 PM, Gudmundur Johannesson
<gudmundur.johannesson@gmail.com> wrote:
> Here are the answers to your questions:
> 1) I change the select statement so I am refering to 1 day at a time.  In
> that case the response time is similar.  Basically, the data is not in cache
> when I do that and the response time is about 23 seconds.

what's the difference between the first and the second run time?
Note, if you are only interested in the date the dtStamp falls on, you
can exploit that in the index to knock 4 bytes off your index entry:

CREATE INDEX test_all
 ON test
 USING btree
 (id , (dtstamp::date) , rating);

and then use a similar expression to query it back out.

> 3) The query takes 23 sec vs 1 sec or lower in mssql.

I asked you to time a different query.  Look again (and I'd like to
see cached and uncached times).

> We never update/delete and therefore the data is alway correct in the index
> (never dirty).  Therefore, Postgres could have used the data in it.
>
> I started to add columns into indexes in Oracle for approx 15 years ago and
> it was a brilliant discovery.  This looks like a show stopper for me but I

I doubt covering indexes is going to make that query 23x faster.
However, I bet we can get something worked out.

merlin

Re: Index with all necessary columns - Postgres vs MSSQL

From
Gudmundur Johannesson
Date:
May be I should first try to partition the table by date and see if that helps.

Thanks,
   - Gummi

On Thu, Feb 2, 2012 at 8:30 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Feb 2, 2012 at 10:41 AM, Gudmundur Johannesson
<gudmundur.johannesson@gmail.com> wrote:
> Do you think I should try using the latest build of the source for 9.2 since
> index-only-scan is "ready" according to
> http://www.depesz.com/index.php/2011/10/08/waiting-for-9-2-index-only-scans/
> ?

hm, interesting.

You are simply welcome to try that and we would definitely like to see
your results.   I looked around and didn't see any binaries for the
development snapshots for windows to test.  That means you have to
compile postgres in order to test 9.2 at this point in time.  Testing
and feedback of index only scan feature would be very much
appreciated.

Generally speaking, postgresql source tree is very high quality --
stuff should mostly work.  The biggest annoyance is that you get lots
of catalog version bumps when pulling new versions of the sources
forcing a dump/reload.

merlin

Re: Index with all necessary columns - Postgres vs MSSQL

From
"Igor Neyman"
Date:
From: Gudmundur Johannesson [mailto:gudmundur.johannesson@gmail.com]
Sent: Thursday, February 02, 2012 11:42 AM
To: Merlin Moncure
Cc: pgsql-performance@postgresql.org
Subject: Re: Index with all necessary columns - Postgres vs MSSQL

Hi,

I want to start by thanking you guys for a quick response and I will try to provide all the information you request. 

1) What version am I running:
"PostgreSQL 9.1.2, compiled by Visual C++ build 1500, 64-bit"

2) Schema:
CREATE TABLE test( id integer,  dtstamp timestamp without time zone,  rating real) WITH ( OIDS=FALSE);
CREATE INDEX test_all ON test USING btree  (id , dtstamp, rating);
200M rows
Table size 9833MB
Index size 7653 MB

3) Difference between the first and the second run time?
The statement executed is:
SELECT count(1) FROM test
WHERE id in (58,83,88,98,124,141,170,195,
202,252,265,293,305,331,348)
AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and cast('2011-10-19 16:00:00' as timestamp)
a) 1st run = 26 seconds
b) 2nd run = 0.234 seconds
c) 3rd-6th run = 0.06 seconds

If I perform the query above for another day then I get 26 seconds for the 1st query.

4) What was the execution plan of it
"Aggregate  (cost=151950.75..151950.76 rows=1 width=0)"
"  ->  Bitmap Heap Scan on data_cbm_reading cbm  (cost=1503.69..151840.82 rows=43974 width=0)"
"        Recheck Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND
("timestamp">= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19 16:00:00'::timestamp
withouttime zone))" 
"        ->  Bitmap Index Scan on data_cbm_reading_all  (cost=0.00..1492.70 rows=43974 width=0)"
"              Index Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
AND("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19
16:00:00'::timestampwithout time zone))" 

5) In this case, I shut down the mssql server/machine and restart it.  To be on the safe side, I ensured the cache is
emptyusing dbcc freeproccache and dbcc dropcleanbuffers. 
Then I tried the same statement as above:
a) 1st run = 0.8 seconds
b) 2nd, 3rd, ... run = 0.04 seconds
c) change the select statement for any another other day and run it again give 1st run 0.5 seconds
d) 2nd, 3rd, ... run = 0.04 seconds

6) You wrote "I doubt covering indexes is going to make that query 23x faster."
I decided to check out how mssql performs if it cannot use a covering index.  In order to do that, I drop my current
indexand create it again on id, dtstamp.  That forces mssql to look into the data file and the index is no longer
sufficient.
Running the following statement force the "rating" columns to be accessed:
select sum(rating)
FROM test
               WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
               AND dtstamp >= '2011-10-19 08:00:00' AND dtstamp <= '2011-10-19 16:00:00'
a) 1st run = 20 seconds
b) 2nd run = 0.6
c) 3rd, ... run = 0.3 seconds
As you can see the response time gets just as bad as in Postgres.
Now lets recreate the mssql index with all the columns and double check the response time:
a) 1st run = 2 seconds
b) 2nd run = 0.12
c) 3rd, ... run = 0.3 seconds


Therefore, I must conclude that in the case of mssql the "covering" index is making a huge impact.

I have spent the whole day providing this data (takes a while to shuffle 200M rows) and tomorrow I will try your
suggestionregarding two indexes. 

Do you think I should try using the latest build of the source for 9.2 since index-only-scan is "ready" according to
http://www.depesz.com/index.php/2011/10/08/waiting-for-9-2-index-only-scans/
?

Thanks,
   - Gummi


Gudmundur,

Just for clarification purposes:

This schema:

CREATE TABLE test( id integer,  dtstamp timestamp without time zone,  rating real) WITH ( OIDS=FALSE);
CREATE INDEX test_all ON test USING btree  (id , dtstamp, rating);

and this query plan:

"Aggregate  (cost=151950.75..151950.76 rows=1 width=0)"
"  ->  Bitmap Heap Scan on data_cbm_reading cbm  (cost=1503.69..151840.82 rows=43974 width=0)"
"        Recheck Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND
("timestamp">= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19 16:00:00'::timestamp
withouttime zone))" 
"        ->  Bitmap Index Scan on data_cbm_reading_all  (cost=0.00..1492.70 rows=43974 width=0)"
"              Index Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
AND("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19
16:00:00'::timestampwithout time zone))" 

reference different table and index names.
Also, EXPLAIN ANALYZE would provide additional info compared to just EXPLAIN.

One option you could try, is to cluster your table based on " test_all" index, and see if it makes a difference.
BTW., in SQL Server your "covering" index - is it clustered?

Regards,
Igor Neyman


Re: Index with all necessary columns - Postgres vs MSSQL

From
Merlin Moncure
Date:
On Tue, Feb 7, 2012 at 11:59 AM, Gudmundur Johannesson
<gudmundur.johannesson@gmail.com> wrote:
> On Tue, Feb 7, 2012 at 3:11 PM, Igor Neyman <ineyman@perceptron.com> wrote:
>>
>> From: Gudmundur Johannesson [mailto:gudmundur.johannesson@gmail.com]
>> Sent: Thursday, February 02, 2012 11:42 AM
>> To: Merlin Moncure
>> Cc: pgsql-performance@postgresql.org
>> Subject: Re: Index with all necessary columns - Postgres vs MSSQL
>>
>> Hi,
>>
>> I want to start by thanking you guys for a quick response and I will try
>> to provide all the information you request.
>>
>> 1) What version am I running:
>> "PostgreSQL 9.1.2, compiled by Visual C++ build 1500, 64-bit"
>>
>> 2) Schema:
>> CREATE TABLE test( id integer,  dtstamp timestamp without time zone,
>> rating real) WITH ( OIDS=FALSE);
>> CREATE INDEX test_all ON test USING btree  (id , dtstamp, rating);
>> 200M rows
>> Table size 9833MB
>> Index size 7653 MB
>>
>> 3) Difference between the first and the second run time?
>> The statement executed is:
>> SELECT count(1) FROM test
>> WHERE id in (58,83,88,98,124,141,170,195,
>> 202,252,265,293,305,331,348)
>> AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and
>> cast('2011-10-19 16:00:00' as timestamp)
>> a) 1st run = 26 seconds
>> b) 2nd run = 0.234 seconds
>> c) 3rd-6th run = 0.06 seconds
>>
>> If I perform the query above for another day then I get 26 seconds for the
>> 1st query.
>>
>> 4) What was the execution plan of it
>> "Aggregate  (cost=151950.75..151950.76 rows=1 width=0)"
>> "  ->  Bitmap Heap Scan on data_cbm_reading cbm  (cost=1503.69..151840.82
>> rows=43974 width=0)"
>> "        Recheck Cond: ((virtual_id = ANY
>> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
>> AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND
>> ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
>> "        ->  Bitmap Index Scan on data_cbm_reading_all
>> (cost=0.00..1492.70 rows=43974 width=0)"
>> "              Index Cond: ((virtual_id = ANY
>> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
>> AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND
>> ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
>>
>> 5) In this case, I shut down the mssql server/machine and restart it.  To
>> be on the safe side, I ensured the cache is empty using dbcc freeproccache
>> and dbcc dropcleanbuffers.
>> Then I tried the same statement as above:
>> a) 1st run = 0.8 seconds
>> b) 2nd, 3rd, ... run = 0.04 seconds
>> c) change the select statement for any another other day and run it again
>> give 1st run 0.5 seconds
>> d) 2nd, 3rd, ... run = 0.04 seconds
>>
>> 6) You wrote "I doubt covering indexes is going to make that query 23x
>> faster."
>> I decided to check out how mssql performs if it cannot use a covering
>> index.  In order to do that, I drop my current index and create it again on
>> id, dtstamp.  That forces mssql to look into the data file and the index is
>> no longer sufficient.
>> Running the following statement force the "rating" columns to be accessed:
>> select sum(rating)
>> FROM test
>>                WHERE id in
>> (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
>>                AND dtstamp >= '2011-10-19 08:00:00' AND dtstamp <=
>> '2011-10-19 16:00:00'
>> a) 1st run = 20 seconds
>> b) 2nd run = 0.6
>> c) 3rd, ... run = 0.3 seconds
>> As you can see the response time gets just as bad as in Postgres.
>> Now lets recreate the mssql index with all the columns and double check
>> the response time:
>> a) 1st run = 2 seconds
>> b) 2nd run = 0.12
>> c) 3rd, ... run = 0.3 seconds
>>
>>
>> Therefore, I must conclude that in the case of mssql the "covering" index
>> is making a huge impact.
>>
>> I have spent the whole day providing this data (takes a while to shuffle
>> 200M rows) and tomorrow I will try your suggestion regarding two indexes.
>>
>> Do you think I should try using the latest build of the source for 9.2
>> since index-only-scan is "ready" according to
>> http://www.depesz.com/index.php/2011/10/08/waiting-for-9-2-index-only-scans/
>> ?
>>
>> Thanks,
>>    - Gummi
>>
>>
>> Gudmundur,
>>
>> Just for clarification purposes:
>>
>> This schema:
>>
>> CREATE TABLE test( id integer,  dtstamp timestamp without time zone,
>>  rating real) WITH ( OIDS=FALSE);
>> CREATE INDEX test_all ON test USING btree  (id , dtstamp, rating);
>>
>> and this query plan:
>>
>> "Aggregate  (cost=151950.75..151950.76 rows=1 width=0)"
>> "  ->  Bitmap Heap Scan on data_cbm_reading cbm  (cost=1503.69..151840.82
>> rows=43974 width=0)"
>> "        Recheck Cond: ((virtual_id = ANY
>> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
>> AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND
>> ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
>> "        ->  Bitmap Index Scan on data_cbm_reading_all
>>  (cost=0.00..1492.70 rows=43974 width=0)"
>> "              Index Cond: ((virtual_id = ANY
>> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
>> AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND
>> ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
>>
>> reference different table and index names.
>> Also, EXPLAIN ANALYZE would provide additional info compared to just
>> EXPLAIN.
>>
>> One option you could try, is to cluster your table based on " test_all"
>> index, and see if it makes a difference.
>> BTW., in SQL Server your "covering" index - is it clustered?
>>
>> Regards,
>> Igor Neyman
>>
>
>
> Hi Igor,
>
> 1) I "simplified" the names when posting originally and forgot to replace
> the names in the analyze output.  Sorry about the confusion.
>
> 2) The index in mssql is not clustered.
>
> 3) I am now testing to partition the 200 million table into one partition
> per day and see how it performs.
>
> 4) I compiled and installed Postgres 9.2 and proved to my self that Postgres
> does not look up into the table and relies only on the index.  Therefore,
> this is looking bright at the moment.
>
> 5) I must deliver the db for production in june and it does not sound wise
> to do that in 9.2 (unless it has been released by then).

yeah -- I just started to do some performance testing on index only
scan as well and am finding the speedup to be really dramatic when the
optimization kicks in, especially when your query passes over the heap
in a random-ish fashion.   note results in the field will vary wildly
-- index only scan optimization does visibility checks at the page
level so write once or read mostly tables will see a lot more benefit
than high traffic oltp type tables.

regarding postgresql 9.2 by june, the official release schedule has
9.2 going into beta by april.  if we actually make that date, then a
production worthy build (which I define as release candidate or
better) by june is plausible as long as you are willing to do binary
swap in the field post release and will have some tolerance for early
release type bugs.  a key thing to watch for is the current commit
fest (https://commitfest.postgresql.org/action/commitfest_view?id=13)
to be wrapped up in a timely fashion -- it is scheduled to be wrapped
up by feb 14 which is starting to look highly optimistic at best.  on
a positive note, 9.2 seems to be somewhat lighter on big controversial
patches than previous releases but if I were in your shoes I wouldn't
unfortunately bank on 9.2 for june.

merlin

Re: Index with all necessary columns - Postgres vs MSSQL

From
Gudmundur Johannesson
Date:
On Tue, Feb 7, 2012 at 3:11 PM, Igor Neyman <ineyman@perceptron.com> wrote:
From: Gudmundur Johannesson [mailto:gudmundur.johannesson@gmail.com]
Sent: Thursday, February 02, 2012 11:42 AM
To: Merlin Moncure
Cc: pgsql-performance@postgresql.org
Subject: Re: Index with all necessary columns - Postgres vs MSSQL

Hi,

I want to start by thanking you guys for a quick response and I will try to provide all the information you request. 

1) What version am I running:
"PostgreSQL 9.1.2, compiled by Visual C++ build 1500, 64-bit"

2) Schema:
CREATE TABLE test( id integer,  dtstamp timestamp without time zone,  rating real) WITH ( OIDS=FALSE);
CREATE INDEX test_all ON test USING btree  (id , dtstamp, rating);
200M rows
Table size 9833MB
Index size 7653 MB

3) Difference between the first and the second run time?
The statement executed is:
SELECT count(1) FROM test
WHERE id in (58,83,88,98,124,141,170,195,
202,252,265,293,305,331,348)
AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and cast('2011-10-19 16:00:00' as timestamp)
a) 1st run = 26 seconds
b) 2nd run = 0.234 seconds
c) 3rd-6th run = 0.06 seconds

If I perform the query above for another day then I get 26 seconds for the 1st query.

4) What was the execution plan of it
"Aggregate  (cost=151950.75..151950.76 rows=1 width=0)"
"  ->  Bitmap Heap Scan on data_cbm_reading cbm  (cost=1503.69..151840.82 rows=43974 width=0)"
"        Recheck Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
"        ->  Bitmap Index Scan on data_cbm_reading_all  (cost=0.00..1492.70 rows=43974 width=0)"
"              Index Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"

5) In this case, I shut down the mssql server/machine and restart it.  To be on the safe side, I ensured the cache is empty using dbcc freeproccache and dbcc dropcleanbuffers.
Then I tried the same statement as above:
a) 1st run = 0.8 seconds
b) 2nd, 3rd, ... run = 0.04 seconds
c) change the select statement for any another other day and run it again give 1st run 0.5 seconds
d) 2nd, 3rd, ... run = 0.04 seconds

6) You wrote "I doubt covering indexes is going to make that query 23x faster."
I decided to check out how mssql performs if it cannot use a covering index.  In order to do that, I drop my current index and create it again on id, dtstamp.  That forces mssql to look into the data file and the index is no longer sufficient.
Running the following statement force the "rating" columns to be accessed:
select sum(rating)
FROM test
               WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
               AND dtstamp >= '2011-10-19 08:00:00' AND dtstamp <= '2011-10-19 16:00:00'
a) 1st run = 20 seconds
b) 2nd run = 0.6
c) 3rd, ... run = 0.3 seconds
As you can see the response time gets just as bad as in Postgres.
Now lets recreate the mssql index with all the columns and double check the response time:
a) 1st run = 2 seconds
b) 2nd run = 0.12
c) 3rd, ... run = 0.3 seconds


Therefore, I must conclude that in the case of mssql the "covering" index is making a huge impact.

I have spent the whole day providing this data (takes a while to shuffle 200M rows) and tomorrow I will try your suggestion regarding two indexes.

Do you think I should try using the latest build of the source for 9.2 since index-only-scan is "ready" according to http://www.depesz.com/index.php/2011/10/08/waiting-for-9-2-index-only-scans/
?

Thanks,
   - Gummi


Gudmundur,

Just for clarification purposes:

This schema:

CREATE TABLE test( id integer,  dtstamp timestamp without time zone,  rating real) WITH ( OIDS=FALSE);
CREATE INDEX test_all ON test USING btree  (id , dtstamp, rating);

and this query plan:

"Aggregate  (cost=151950.75..151950.76 rows=1 width=0)"
"  ->  Bitmap Heap Scan on data_cbm_reading cbm  (cost=1503.69..151840.82 rows=43974 width=0)"
"        Recheck Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
"        ->  Bitmap Index Scan on data_cbm_reading_all  (cost=0.00..1492.70 rows=43974 width=0)"
"              Index Cond: ((virtual_id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"

reference different table and index names.
Also, EXPLAIN ANALYZE would provide additional info compared to just EXPLAIN.

One option you could try, is to cluster your table based on " test_all" index, and see if it makes a difference.
BTW., in SQL Server your "covering" index - is it clustered?

Regards,
Igor Neyman



Hi Igor,

1) I "simplified" the names when posting originally and forgot to replace the names in the analyze output.  Sorry about the confusion.

2) The index in mssql is not clustered.

3) I am now testing to partition the 200 million table into one partition per day and see how it performs.

4) I compiled and installed Postgres 9.2 and proved to my self that Postgres does not look up into the table and relies only on the index.  Therefore, this is looking bright at the moment.

5) I must deliver the db for production in june and it does not sound wise to do that in 9.2 (unless it has been released by then).

Thanks,
   - Gummi

Thanks,
   - Gummi