Thread:
I have 2 tables with a 200,000 rows of data 3 character/string columns ID, Question and Response. The query below compares the data between the 2 tables based on ID and Question and if the Response does not match between the left table and the right table it identifies the ID's where there is a mismatch. Running the query in SQL Server 2008 using the ISNULL function take a few milliseconds. Running the same query in Postgresql takes over 70 seconds. The 2 queries are below:
SQL Server 2008 R2 Query
select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and t1.question = t2.question and isnull(t1.response,'ISNULL') <> isnull(t2.response,'ISNULL')
Postgres 9.1 Query
select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and t1.question = t2.question and coalesce(t1.response,'ISNULL') <> coalesce(t2.response,'ISNULL')What gives?
Hi > SQL Server 2008 R2 Query > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id > and t1.question = t2.question and isnull(t1.response,'ISNULL') <> > isnull(t2.response,'ISNULL') > > Postgres 9.1 Query > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id > and t1.question = t2.question and coalesce(t1.response,'ISNULL') <> > coalesce(t2.response,'ISNULL') > > What gives? They have same indexes/PK etc? Regards, Jayadevan DISCLAIMER: "The information in this e-mail and any attachment is intended only for the person to whom it is addressed and may contain confidential and/or privileged material. If you have received this e-mail in error, kindly contact the sender and destroy all copies of the original communication. IBS makes no warranty, express or implied, nor guarantees the accuracy, adequacy or completeness of the information contained in this email or any attachment and is not liable for any errors, defects, omissions, viruses or for resultant loss or damage, if any, direct or indirect."
On 16/11/10 09:14, Humair Mohammed wrote:
Can we see the execution plans: (EXPLAIN <the query text here>) for Postgres and (however you get text based query plan from Sql Server), so we can see if there is any obvious differences in how things are done.
Also probably worthwhile is telling us the table definitions of the tables concerned.
For Postgres - did you run ANALYZE on the database concerned before running the queries? (optimizer stats are usually updated automatically, but if you were quick to run the queries after loading the data they might not have been).
regards
Mark
I have 2 tables with a 200,000 rows of data 3 character/string columns ID, Question and Response. The query below compares the data between the 2 tables based on ID and Question and if the Response does not match between the left table and the right table it identifies the ID's where there is a mismatch. Running the query in SQL Server 2008 using the ISNULL function take a few milliseconds. Running the same query in Postgresql takes over 70 seconds. The 2 queries are below: SQL Server 2008 R2 Queryselect t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and t1.question = t2.question and isnull(t1.response,'ISNULL') <> isnull(t2.response,'ISNULL') Postgres 9.1 Queryselect t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and t1.question = t2.question and coalesce(t1.response,'ISNULL') <> coalesce(t2.response,'ISNULL') What gives?
Can we see the execution plans: (EXPLAIN <the query text here>) for Postgres and (however you get text based query plan from Sql Server), so we can see if there is any obvious differences in how things are done.
Also probably worthwhile is telling us the table definitions of the tables concerned.
For Postgres - did you run ANALYZE on the database concerned before running the queries? (optimizer stats are usually updated automatically, but if you were quick to run the queries after loading the data they might not have been).
regards
Mark
2010/11/15 Humair Mohammed <humairm@hotmail.com>: > I have 2 tables with a 200,000 rows of data 3 character/string columns ID, > Question and Response. The query below compares the data between the 2 > tables based on ID and Question and if the Response does not match between > the left table and the right table it identifies the ID's where there is a > mismatch. Running the query in SQL Server 2008 using the ISNULL function > take a few milliseconds. Running the same query in Postgresql takes over 70 > seconds. The 2 queries are below: > SQL Server 2008 R2 Query > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and > t1.question = t2.question and isnull(t1.response,'ISNULL') <> > isnull(t2.response,'ISNULL') > Postgres 9.1 Query > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and > t1.question = t2.question and coalesce(t1.response,'ISNULL') <> > coalesce(t2.response,'ISNULL') > What gives? I think, so must problem can be in ugly predicate coalesce(t1.response,'ISNULL') <> > coalesce(t2.response,'ISNULL') try use a IS DISTINCT OF operator ... AND t1.response IS DISTINCT t2.response Regards Pavel Stehule p.s. don't use a coalesce in WHERE clause if it is possible.
2010/11/17 Humair Mohammed <humairm@hotmail.com>: > > There are no indexes on the tables either in SQL Server or Postgresql - I am > comparing apples to apples here. I ran ANALYZE on the postgresql tables, > after that query performance times are still high 42 seconds with COALESCE > and 35 seconds with IS DISTINCT FROM. > Here is the execution plan from Postgresql for qurey - select pb.id from > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question = > pg.question and coalesce(pb.response,'MISSING') <> > coalesce(pg.response,'MISSING') > Execution Time: 42 seconds > "Hash Join (cost=16212.30..48854.24 rows=93477 width=17)" > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = > (pg.question)::text))" > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)" > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134)" > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134)" > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 > width=134)" this is little bit strange - did you ANALYZE and VACUUM? please send result of EXPLAIN ANALYZE Pavel > > And here is the execution plan from SQL Server for query - select pb.id from > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question = > pg.question and isnull(pb.response,'ISNULL')<> isnull(pg.response,'ISNULL') > Execution Time: < 1 second > Cost: 1% |--Parallelism(Gather Streams) > Cost: 31% |--Hash Match(Inner Join, HASH:([pb].[ID], > [pb].[Question])=([pg].[ID], [pg].[Question]), > RESIDUAL:([master].[dbo].[pivotbad].[ID] as > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND > [master].[dbo].[pivotbad].[Question] as > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question] AND > [Expr1006]<>[Expr1007])) > Cost: 0% |--Bitmap(HASH:([pb].[ID], [pb].[Question]), > DEFINE:([Bitmap1008])) > Cost: 0% |--Compute > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as > [pb].[Response],'ISNULL'))) > Cost: 6% |--Parallelism(Repartition Streams, Hash > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question])) > Cost: 12% |--Table Scan(OBJECT:([master].[dbo].[pivotbad] AS > [pb])) > Cost: 0% |--Compute > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response] as > [pg].[Response],'ISNULL'))) > Cost: 17% |--Parallelism(Repartition Streams, Hash > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question])) > Cost: 33% |--Table > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]), > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question]))) > > > >> From: pavel.stehule@gmail.com >> Date: Tue, 16 Nov 2010 08:12:03 +0100 >> Subject: Re: [PERFORM] >> To: humairm@hotmail.com >> CC: pgsql-performance@postgresql.org >> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>: >> > I have 2 tables with a 200,000 rows of data 3 character/string columns >> > ID, >> > Question and Response. The query below compares the data between the 2 >> > tables based on ID and Question and if the Response does not match >> > between >> > the left table and the right table it identifies the ID's where there is >> > a >> > mismatch. Running the query in SQL Server 2008 using the ISNULL function >> > take a few milliseconds. Running the same query in Postgresql takes over >> > 70 >> > seconds. The 2 queries are below: >> > SQL Server 2008 R2 Query >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <> >> > isnull(t2.response,'ISNULL') >> >> > Postgres 9.1 Query >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <> >> > coalesce(t2.response,'ISNULL') >> > What gives? >> >> I think, so must problem can be in ugly predicate >> coalesce(t1.response,'ISNULL') <> >> > coalesce(t2.response,'ISNULL') >> >> try use a IS DISTINCT OF operator >> >> ... AND t1.response IS DISTINCT t2.response >> >> Regards >> >> Pavel Stehule >> >> p.s. don't use a coalesce in WHERE clause if it is possible. >
Dne 17.11.2010 05:47, Pavel Stehule napsal(a): > 2010/11/17 Humair Mohammed <humairm@hotmail.com>: >> >> There are no indexes on the tables either in SQL Server or Postgresql - I am >> comparing apples to apples here. I ran ANALYZE on the postgresql tables, Actually no, you're not comparing apples to apples. You've provided so little information that you may be comparing apples to cucumbers or maybe some strange animals. 1) info about the install What OS is this running on? I guess it's Windows in both cases, right? How nuch memory is there? What is the size of shared_buffers? The default PostgreSQL settings is very very very limited, you have to bump it to a much larger value. What are the other inportant settings (e.g. the work_mem)? 2) info about the dataset How large are the tables? I don't mean number of rows, I mean number of blocks / occupied disk space. Run this query SELECT relname, relpages, reltuples, pg_size_pretty(pg_table_size(oid)) FROM pg_class WHERE relname IN ('table1', 'table2'); 3) info about the plan Please, provide EXPLAIN ANALYZE output, maybe with info about buffers, e.g. something like EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT ... 4) no indexes ? Why have you decided not to use any indexes? If you want a decent performance, you will have to use indexes. Obviously there is some overhead associated with them, but it's premature optimization unless you prove the opposite. BTW I'm not a MSSQL expert, but it seems like it's building a bitmap index on the fly, to synchronize parallelized query - PostgreSQL does not support that. regards Tomas
I have to concur. Sql is written specifially and only for Windows. It is optimized for windows. Postgreal is writeen for just about everything trying to use common code so there isn't much optimization because it has to be optimized based on the OS that is running it. Check out your config and send it to us. That would include the OS and hardware configs for both machines.
On Wed, Nov 17, 2010 at 3:47 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 17.11.2010 05:47, Pavel Stehule napsal(a):> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:Actually no, you're not comparing apples to apples. You've provided so
>>
>> There are no indexes on the tables either in SQL Server or Postgresql - I am
>> comparing apples to apples here. I ran ANALYZE on the postgresql tables,
little information that you may be comparing apples to cucumbers or
maybe some strange animals.
1) info about the install
What OS is this running on? I guess it's Windows in both cases, right?
How nuch memory is there? What is the size of shared_buffers? The
default PostgreSQL settings is very very very limited, you have to bump
it to a much larger value.
What are the other inportant settings (e.g. the work_mem)?
2) info about the dataset
How large are the tables? I don't mean number of rows, I mean number of
blocks / occupied disk space. Run this query
SELECT relname, relpages, reltuples, pg_size_pretty(pg_table_size(oid))
FROM pg_class WHERE relname IN ('table1', 'table2');
3) info about the plan
Please, provide EXPLAIN ANALYZE output, maybe with info about buffers,
e.g. something like
EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT ...
4) no indexes ?
Why have you decided not to use any indexes? If you want a decent
performance, you will have to use indexes. Obviously there is some
overhead associated with them, but it's premature optimization unless
you prove the opposite.
BTW I'm not a MSSQL expert, but it seems like it's building a bitmap
index on the fly, to synchronize parallelized query - PostgreSQL does
not support that.
regards
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun the query. Results from EXPLAIN ANALYZE below:
> From: pavel.stehule@gmail.com
> Date: Wed, 17 Nov 2010 05:47:51 +0100
> Subject: Re: Query Performance SQL Server vs. Postgresql
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> >
> > There are no indexes on the tables either in SQL Server or Postgresql - I am
> > comparing apples to apples here. I ran ANALYZE on the postgresql tables,
> > after that query performance times are still high 42 seconds with COALESCE
> > and 35 seconds with IS DISTINCT FROM.
> > Here is the execution plan from Postgresql for qurey - select pb.id from
> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> > pg.question and coalesce(pb.response,'MISSING') <>
> > coalesce(pg.response,'MISSING')
> > Execution Time: 42 seconds
> > "Hash Join (cost=16212.30..48854.24 rows=93477 width=17)"
> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134)"
> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134)"
> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212
> > width=134)"
>
> this is little bit strange - did you ANALYZE and VACUUM?
>
> please send result of EXPLAIN ANALYZE
>
> Pavel
>
> >
> > And here is the execution plan from SQL Server for query - select pb.id from
> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> > pg.question and isnull(pb.response,'ISNULL')<> isnull(pg.response,'ISNULL')
> > Execution Time: < 1 second
> > Cost: 1% |--Parallelism(Gather Streams)
> > Cost: 31% |--Hash Match(Inner Join, HASH:([pb].[ID],
> > [pb].[Question])=([pg].[ID], [pg].[Question]),
> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as
> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
> > [master].[dbo].[pivotbad].[Question] as
> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question] AND
> > [Expr1006]<>[Expr1007]))
> > Cost: 0% |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
> > DEFINE:([Bitmap1008]))
> > Cost: 0% |--Compute
> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as
> > [pb].[Response],'ISNULL')))
> > Cost: 6% |--Parallelism(Repartition Streams, Hash
> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
> > Cost: 12% |--Table Scan(OBJECT:([master].[dbo].[pivotbad] AS
> > [pb]))
> > Cost: 0% |--Compute
> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response] as
> > [pg].[Response],'ISNULL')))
> > Cost: 17% |--Parallelism(Repartition Streams, Hash
> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
> > Cost: 33% |--Table
> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
> >
> >
> >
> >> From: pavel.stehule@gmail.com
> >> Date: Tue, 16 Nov 2010 08:12:03 +0100
> >> Subject: Re: [PERFORM]
> >> To: humairm@hotmail.com
> >> CC: pgsql-performance@postgresql.org
> >>
> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
> >> > I have 2 tables with a 200,000 rows of data 3 character/string columns
> >> > ID,
> >> > Question and Response. The query below compares the data between the 2
> >> > tables based on ID and Question and if the Response does not match
> >> > between
> >> > the left table and the right table it identifies the ID's where there is
> >> > a
> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL function
> >> > take a few milliseconds. Running the same query in Postgresql takes over
> >> > 70
> >> > seconds. The 2 queries are below:
> >> > SQL Server 2008 R2 Query
> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> >> > isnull(t2.response,'ISNULL')
> >>
> >> > Postgres 9.1 Query
> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> >> > coalesce(t2.response,'ISNULL')
> >> > What gives?
> >>
> >> I think, so must problem can be in ugly predicate
> >> coalesce(t1.response,'ISNULL') <>
> >> > coalesce(t2.response,'ISNULL')
> >>
> >> try use a IS DISTINCT OF operator
> >>
> >> ... AND t1.response IS DISTINCT t2.response
> >>
> >> Regards
> >>
> >> Pavel Stehule
> >>
> >> p.s. don't use a coalesce in WHERE clause if it is possible.
> >
"Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual time=43200.223..49502.874 rows=3163 loops=1)"
" Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
" Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
" -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.009..48.200 rows=93496 loops=1)"
" -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual time=42919.453..42919.453 rows=251212 loops=1)"
" Buckets: 1024 Batches: 64 Memory Usage: 650kB"
" -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.119..173.019 rows=251212 loops=1)"
"Total runtime: 49503.450 ms"
> From: pavel.stehule@gmail.com
> Date: Wed, 17 Nov 2010 05:47:51 +0100
> Subject: Re: Query Performance SQL Server vs. Postgresql
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> >
> > There are no indexes on the tables either in SQL Server or Postgresql - I am
> > comparing apples to apples here. I ran ANALYZE on the postgresql tables,
> > after that query performance times are still high 42 seconds with COALESCE
> > and 35 seconds with IS DISTINCT FROM.
> > Here is the execution plan from Postgresql for qurey - select pb.id from
> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> > pg.question and coalesce(pb.response,'MISSING') <>
> > coalesce(pg.response,'MISSING')
> > Execution Time: 42 seconds
> > "Hash Join (cost=16212.30..48854.24 rows=93477 width=17)"
> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134)"
> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134)"
> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212
> > width=134)"
>
> this is little bit strange - did you ANALYZE and VACUUM?
>
> please send result of EXPLAIN ANALYZE
>
> Pavel
>
> >
> > And here is the execution plan from SQL Server for query - select pb.id from
> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> > pg.question and isnull(pb.response,'ISNULL')<> isnull(pg.response,'ISNULL')
> > Execution Time: < 1 second
> > Cost: 1% |--Parallelism(Gather Streams)
> > Cost: 31% |--Hash Match(Inner Join, HASH:([pb].[ID],
> > [pb].[Question])=([pg].[ID], [pg].[Question]),
> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as
> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
> > [master].[dbo].[pivotbad].[Question] as
> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question] AND
> > [Expr1006]<>[Expr1007]))
> > Cost: 0% |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
> > DEFINE:([Bitmap1008]))
> > Cost: 0% |--Compute
> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as
> > [pb].[Response],'ISNULL')))
> > Cost: 6% |--Parallelism(Repartition Streams, Hash
> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
> > Cost: 12% |--Table Scan(OBJECT:([master].[dbo].[pivotbad] AS
> > [pb]))
> > Cost: 0% |--Compute
> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response] as
> > [pg].[Response],'ISNULL')))
> > Cost: 17% |--Parallelism(Repartition Streams, Hash
> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
> > Cost: 33% |--Table
> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
> >
> >
> >
> >> From: pavel.stehule@gmail.com
> >> Date: Tue, 16 Nov 2010 08:12:03 +0100
> >> Subject: Re: [PERFORM]
> >> To: humairm@hotmail.com
> >> CC: pgsql-performance@postgresql.org
> >>
> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
> >> > I have 2 tables with a 200,000 rows of data 3 character/string columns
> >> > ID,
> >> > Question and Response. The query below compares the data between the 2
> >> > tables based on ID and Question and if the Response does not match
> >> > between
> >> > the left table and the right table it identifies the ID's where there is
> >> > a
> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL function
> >> > take a few milliseconds. Running the same query in Postgresql takes over
> >> > 70
> >> > seconds. The 2 queries are below:
> >> > SQL Server 2008 R2 Query
> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> >> > isnull(t2.response,'ISNULL')
> >>
> >> > Postgres 9.1 Query
> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> >> > coalesce(t2.response,'ISNULL')
> >> > What gives?
> >>
> >> I think, so must problem can be in ugly predicate
> >> coalesce(t1.response,'ISNULL') <>
> >> > coalesce(t2.response,'ISNULL')
> >>
> >> try use a IS DISTINCT OF operator
> >>
> >> ... AND t1.response IS DISTINCT t2.response
> >>
> >> Regards
> >>
> >> Pavel Stehule
> >>
> >> p.s. don't use a coalesce in WHERE clause if it is possible.
> >
There are no indexes on the tables either in SQL Server or Postgresql - I am comparing apples to apples here. I ran ANALYZE on the postgresql tables, after that query performance times are still high 42 seconds with COALESCE and 35 seconds with IS DISTINCT FROM.
Here is the execution plan from Postgresql for qurey - select pb.id from pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question = pg.question and coalesce(pb.response,'MISSING') <> coalesce(pg.response,'MISSING')
Execution Time: 42 seconds
"Hash Join (cost=16212.30..48854.24 rows=93477 width=17)"
" Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
" Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
" -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134)"
" -> Hash (cost=7537.12..7537.12 rows=251212 width=134)"
" -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 width=134)"
And here is the execution plan from SQL Server for query - select pb.id from pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question = pg.question and isnull(pb.response,'ISNULL')<> isnull(pg.response,'ISNULL')
Execution Time: < 1 second
Cost: 1% |--Parallelism(Gather Streams)
Cost: 31% |--Hash Match(Inner Join, HASH:([pb].[ID], [pb].[Question])=([pg].[ID], [pg].[Question]), RESIDUAL:([master].[dbo].[pivotbad].[ID] as [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND [master].[dbo].[pivotbad].[Question] as [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question] AND [Expr1006]<>[Expr1007]))
Cost: 0% |--Bitmap(HASH:([pb].[ID], [pb].[Question]), DEFINE:([Bitmap1008]))
Cost: 0% |--Compute Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as [pb].[Response],'ISNULL')))
Cost: 6% |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
Cost: 12% |--Table Scan(OBJECT:([master].[dbo].[pivotbad] AS [pb]))
Cost: 0% |--Compute Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response] as [pg].[Response],'ISNULL')))
Cost: 17% |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
Cost: 33% |--Table Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]), WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
> From: pavel.stehule@gmail.com
> Date: Tue, 16 Nov 2010 08:12:03 +0100
> Subject: Re: [PERFORM]
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
> > I have 2 tables with a 200,000 rows of data 3 character/string columns ID,
> > Question and Response. The query below compares the data between the 2
> > tables based on ID and Question and if the Response does not match between
> > the left table and the right table it identifies the ID's where there is a
> > mismatch. Running the query in SQL Server 2008 using the ISNULL function
> > take a few milliseconds. Running the same query in Postgresql takes over 70
> > seconds. The 2 queries are below:
> > SQL Server 2008 R2 Query
> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> > isnull(t2.response,'ISNULL')
>
> > Postgres 9.1 Query
> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> > coalesce(t2.response,'ISNULL')
> > What gives?
>
> I think, so must problem can be in ugly predicate
> coalesce(t1.response,'ISNULL') <>
> > coalesce(t2.response,'ISNULL')
>
> try use a IS DISTINCT OF operator
>
> ... AND t1.response IS DISTINCT t2.response
>
> Regards
>
> Pavel Stehule
>
> p.s. don't use a coalesce in WHERE clause if it is possible.
Humair Mohammed <humairm@hotmail.com> writes: > Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun the query. Results from EXPLAIN ANALYZE below: > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual time=43200.223..49502.874 rows=3163 loops=1)"" HashCond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"" Join Filter: ((COALESCE(pb.response,'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"" -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.009..48.200 rows=93496loops=1)"" -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual time=42919.453..42919.453 rows=251212loops=1)"" Buckets: 1024 Batches: 64 Memory Usage: 650kB"" -> Seq Scan on pivotgood pg (cost=0.00..7537.12rows=251212 width=134) (actual time=0.119..173.019 rows=251212 loops=1)""Total runtime: 49503.450 ms" I have no idea how much memory SQL Server thinks it can use, but Postgres is limiting itself to work_mem which you've apparently left at the default 1MB. You might get a fairer comparison by bumping that up some --- try 32MB or so. You want it high enough so that the Hash output doesn't say there are multiple batches. regards, tom lane
Hello, there should be a problem in a statistic, they are out of reality. Please, try to use a DISTINCT OF operator now - maybe a statistic will be better. Next - try to increase a work_mem. Hash join is untypically slow in your comp. Regards Pavel Stehule 2010/11/17 Humair Mohammed <humairm@hotmail.com>: > Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun > the query. Results from EXPLAIN ANALYZE below: > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual > time=43200.223..49502.874 rows=3163 loops=1)" > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = > (pg.question)::text))" > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)" > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) > (actual time=0.009..48.200 rows=93496 loops=1)" > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual > time=42919.453..42919.453 rows=251212 loops=1)" > " Buckets: 1024 Batches: 64 Memory Usage: 650kB" > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 > width=134) (actual time=0.119..173.019 rows=251212 loops=1)" > "Total runtime: 49503.450 ms" > >> From: pavel.stehule@gmail.com >> Date: Wed, 17 Nov 2010 05:47:51 +0100 >> Subject: Re: Query Performance SQL Server vs. Postgresql >> To: humairm@hotmail.com >> CC: pgsql-performance@postgresql.org >> >> 2010/11/17 Humair Mohammed <humairm@hotmail.com>: >> > >> > There are no indexes on the tables either in SQL Server or Postgresql - >> > I am >> > comparing apples to apples here. I ran ANALYZE on the postgresql tables, >> > after that query performance times are still high 42 seconds with >> > COALESCE >> > and 35 seconds with IS DISTINCT FROM. >> > Here is the execution plan from Postgresql for qurey - select pb.id from >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question = >> > pg.question and coalesce(pb.response,'MISSING') <> >> > coalesce(pg.response,'MISSING') >> > Execution Time: 42 seconds >> > "Hash Join (cost=16212.30..48854.24 rows=93477 width=17)" >> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text >> > = >> > (pg.question)::text))" >> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character >> > varying))::text >> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)" >> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 >> > width=134)" >> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134)" >> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 >> > width=134)" >> >> this is little bit strange - did you ANALYZE and VACUUM? >> >> please send result of EXPLAIN ANALYZE >> >> Pavel >> >> > >> > And here is the execution plan from SQL Server for query - select pb.id >> > from >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question = >> > pg.question and isnull(pb.response,'ISNULL')<> >> > isnull(pg.response,'ISNULL') >> > Execution Time: < 1 second >> > Cost: 1% |--Parallelism(Gather Streams) >> > Cost: 31% |--Hash Match(Inner Join, HASH:([pb].[ID], >> > [pb].[Question])=([pg].[ID], [pg].[Question]), >> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as >> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND >> > [master].[dbo].[pivotbad].[Question] as >> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question] >> > AND >> > [Expr1006]<>[Expr1007])) >> > Cost: 0% |--Bitmap(HASH:([pb].[ID], [pb].[Question]), >> > DEFINE:([Bitmap1008])) >> > Cost: 0% |--Compute >> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as >> > [pb].[Response],'ISNULL'))) >> > Cost: 6% |--Parallelism(Repartition Streams, Hash >> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question])) >> > Cost: 12% |--Table Scan(OBJECT:([master].[dbo].[pivotbad] >> > AS >> > [pb])) >> > Cost: 0% |--Compute >> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response] >> > as >> > [pg].[Response],'ISNULL'))) >> > Cost: 17% |--Parallelism(Repartition Streams, Hash >> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question])) >> > Cost: 33% |--Table >> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]), >> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as >> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question]))) >> > >> > >> > >> >> From: pavel.stehule@gmail.com >> >> Date: Tue, 16 Nov 2010 08:12:03 +0100 >> >> Subject: Re: [PERFORM] >> >> To: humairm@hotmail.com >> >> CC: pgsql-performance@postgresql.org >> >> >> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>: >> >> > I have 2 tables with a 200,000 rows of data 3 character/string >> >> > columns >> >> > ID, >> >> > Question and Response. The query below compares the data between the >> >> > 2 >> >> > tables based on ID and Question and if the Response does not match >> >> > between >> >> > the left table and the right table it identifies the ID's where there >> >> > is >> >> > a >> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL >> >> > function >> >> > take a few milliseconds. Running the same query in Postgresql takes >> >> > over >> >> > 70 >> >> > seconds. The 2 queries are below: >> >> > SQL Server 2008 R2 Query >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and >> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <> >> >> > isnull(t2.response,'ISNULL') >> >> >> >> > Postgres 9.1 Query >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and >> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <> >> >> > coalesce(t2.response,'ISNULL') >> >> > What gives? >> >> >> >> I think, so must problem can be in ugly predicate >> >> coalesce(t1.response,'ISNULL') <> >> >> > coalesce(t2.response,'ISNULL') >> >> >> >> try use a IS DISTINCT OF operator >> >> >> >> ... AND t1.response IS DISTINCT t2.response >> >> >> >> Regards >> >> >> >> Pavel Stehule >> >> >> >> p.s. don't use a coalesce in WHERE clause if it is possible. >> > >
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I have no idea how much memory SQL Server thinks it can use Hmmm... That triggered an old memory -- when we were running SQL Server on Windows there was some registry setting which we tweaked to prevent the OS from trying to cache disk I/O. (Sorry I don't remember the name of it.) That helped SQL Server perform better, but would cripple PostgreSQL -- it counts on OS caching. Of course, once we found that PostgreSQL was 70% faster on identical hardware with identical load, and switching the OS to Linux brought it to twice as fast, I haven't had to worry about SQL Server or Windows configurations. ;-) Don't panic if PostgreSQL seems slower at first, it's probably a configuration or maintenance schedule issue that can be sorted out. Besides the specific advice Tom gave you, you might want to browse this page for configuration in general: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server And if you continue to experience performance issues, this page can help you get to a resolution quickly: http://wiki.postgresql.org/wiki/SlowQueryQuestions We've been very happy with the switch to PostgreSQL. We've had better performance, better reliability, less staff time needed to babysit backups, and we've been gradually using more of the advance features not available in other products. It's well worth the effort to get over those initial bumps resulting from product differences. -Kevin
Hello I don't know. I checked similar query on similar dataset (size), and I have a times about 3 sec on much more worse notebook. So problem can be in disk IO operation speed - maybe in access to TOASTed value. 2010/11/21 Humair Mohammed <humairm@hotmail.com>: > I am running 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 > Ghz Intel CPU. Both the SQL 2008 R2 and Postgresql are installed on the same > machine. The DISTINCT FROM instead of the COALESCE does not help much. I ran > 2 further tests with work_mem modifications (please note memory usage is > quite low 650kb, so I am not sure if the work_mem is a factor): it's has a little bit different meaning. work_mem is just limit, so "memory usage" must not be great than work_mem ever. if then pg increase "butches" number - store data to blocks on disk. Higher work_mem ~ less butches. So ideal is 1 butches. Regards Pavel Stehule > First, I modified the work_mem setting to 1GB (reloaded config) from the > default 1MB and I see a response time of 33 seconds. Results below from > EXPLAIN ANALYZE: > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual > time=26742.343..33274.317 rows=3163 loops=1)" > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = > (pg.question)::text))" > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)" > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) > (actual time=0.055..40.710 rows=93496 loops=1)" > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual > time=25603.460..25603.460 rows=251212 loops=1)" > " Buckets: 1024 Batches: 64 Memory Usage: 650kB" > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 > width=134) (actual time=0.050..120.269 rows=251212 loops=1)" > "Total runtime: 33275.028 ms" > > Second, I modified the work_mem setting to 2GB (reloaded config) and I see a > response time of 38 seconds. Results below from EXPLAIN ANALYZE: > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual > time=26574.459..38406.422 rows=3163 loops=1)" > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = > (pg.question)::text))" > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)" > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) > (actual time=0.067..37.938 rows=93496 loops=1)" > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual > time=26426.127..26426.127 rows=251212 loops=1)" > " Buckets: 1024 Batches: 64 Memory Usage: 650kB" > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 > width=134) (actual time=0.038..115.319 rows=251212 loops=1)" > "Total runtime: 38406.927 ms" > > By no means I am trying to compare the 2 products. When I noticed the slow > behavior of COALESCE I tried it on SQL Server. And since they are running on > the same machine my comment regarding apples to apples. It is possible that > this is not an apples to apples comparison other than the fact that it is > running on the same machine. > >> From: pavel.stehule@gmail.com >> Date: Thu, 18 Nov 2010 07:14:24 +0100 >> Subject: Re: Query Performance SQL Server vs. Postgresql >> To: humairm@hotmail.com >> CC: pgsql-performance@postgresql.org >> >> Hello, >> >> there should be a problem in a statistic, they are out of reality. >> Please, try to use a DISTINCT OF operator now - maybe a statistic will >> be better. Next - try to increase a work_mem. Hash join is >> untypically slow in your comp. >> >> Regards >> >> Pavel Stehule >> >> 2010/11/17 Humair Mohammed <humairm@hotmail.com>: >> > Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to >> > rerun >> > the query. Results from EXPLAIN ANALYZE below: >> > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual >> > time=43200.223..49502.874 rows=3163 loops=1)" >> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text >> > = >> > (pg.question)::text))" >> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character >> > varying))::text >> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)" >> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 >> > width=134) >> > (actual time=0.009..48.200 rows=93496 loops=1)" >> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual >> > time=42919.453..42919.453 rows=251212 loops=1)" >> > " Buckets: 1024 Batches: 64 Memory Usage: 650kB" >> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 >> > width=134) (actual time=0.119..173.019 rows=251212 loops=1)" >> > "Total runtime: 49503.450 ms" >> > >> >> From: pavel.stehule@gmail.com >> >> Date: Wed, 17 Nov 2010 05:47:51 +0100 >> >> Subject: Re: Query Performance SQL Server vs. Postgresql >> >> To: humairm@hotmail.com >> >> CC: pgsql-performance@postgresql.org >> >> >> >> 2010/11/17 Humair Mohammed <humairm@hotmail.com>: >> >> > >> >> > There are no indexes on the tables either in SQL Server or Postgresql >> >> > - >> >> > I am >> >> > comparing apples to apples here. I ran ANALYZE on the postgresql >> >> > tables, >> >> > after that query performance times are still high 42 seconds with >> >> > COALESCE >> >> > and 35 seconds with IS DISTINCT FROM. >> >> > Here is the execution plan from Postgresql for qurey - select pb.id >> >> > from >> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question >> >> > = >> >> > pg.question and coalesce(pb.response,'MISSING') <> >> >> > coalesce(pg.response,'MISSING') >> >> > Execution Time: 42 seconds >> >> > "Hash Join (cost=16212.30..48854.24 rows=93477 width=17)" >> >> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND >> >> > ((pb.question)::text >> >> > = >> >> > (pg.question)::text))" >> >> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character >> >> > varying))::text >> >> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)" >> >> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 >> >> > width=134)" >> >> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134)" >> >> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 >> >> > rows=251212 >> >> > width=134)" >> >> >> >> this is little bit strange - did you ANALYZE and VACUUM? >> >> >> >> please send result of EXPLAIN ANALYZE >> >> >> >> Pavel >> >> >> >> > >> >> > And here is the execution plan from SQL Server for query - select >> >> > pb.id >> >> > from >> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question >> >> > = >> >> > pg.question and isnull(pb.response,'ISNULL')<> >> >> > isnull(pg.response,'ISNULL') >> >> > Execution Time: < 1 second >> >> > Cost: 1% |--Parallelism(Gather Streams) >> >> > Cost: 31% |--Hash Match(Inner Join, HASH:([pb].[ID], >> >> > [pb].[Question])=([pg].[ID], [pg].[Question]), >> >> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as >> >> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND >> >> > [master].[dbo].[pivotbad].[Question] as >> >> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as >> >> > [pg].[Question] >> >> > AND >> >> > [Expr1006]<>[Expr1007])) >> >> > Cost: 0% |--Bitmap(HASH:([pb].[ID], [pb].[Question]), >> >> > DEFINE:([Bitmap1008])) >> >> > Cost: 0% |--Compute >> >> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] >> >> > as >> >> > [pb].[Response],'ISNULL'))) >> >> > Cost: 6% |--Parallelism(Repartition Streams, Hash >> >> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question])) >> >> > Cost: 12% |--Table >> >> > Scan(OBJECT:([master].[dbo].[pivotbad] >> >> > AS >> >> > [pb])) >> >> > Cost: 0% |--Compute >> >> > >> >> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response] >> >> > as >> >> > [pg].[Response],'ISNULL'))) >> >> > Cost: 17% |--Parallelism(Repartition Streams, Hash >> >> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question])) >> >> > Cost: 33% |--Table >> >> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]), >> >> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as >> >> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question]))) >> >> > >> >> > >> >> > >> >> >> From: pavel.stehule@gmail.com >> >> >> Date: Tue, 16 Nov 2010 08:12:03 +0100 >> >> >> Subject: Re: [PERFORM] >> >> >> To: humairm@hotmail.com >> >> >> CC: pgsql-performance@postgresql.org >> >> >> >> >> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>: >> >> >> > I have 2 tables with a 200,000 rows of data 3 character/string >> >> >> > columns >> >> >> > ID, >> >> >> > Question and Response. The query below compares the data between >> >> >> > the >> >> >> > 2 >> >> >> > tables based on ID and Question and if the Response does not match >> >> >> > between >> >> >> > the left table and the right table it identifies the ID's where >> >> >> > there >> >> >> > is >> >> >> > a >> >> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL >> >> >> > function >> >> >> > take a few milliseconds. Running the same query in Postgresql >> >> >> > takes >> >> >> > over >> >> >> > 70 >> >> >> > seconds. The 2 queries are below: >> >> >> > SQL Server 2008 R2 Query >> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id >> >> >> > and >> >> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <> >> >> >> > isnull(t2.response,'ISNULL') >> >> >> >> >> >> > Postgres 9.1 Query >> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id >> >> >> > and >> >> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <> >> >> >> > coalesce(t2.response,'ISNULL') >> >> >> > What gives? >> >> >> >> >> >> I think, so must problem can be in ugly predicate >> >> >> coalesce(t1.response,'ISNULL') <> >> >> >> > coalesce(t2.response,'ISNULL') >> >> >> >> >> >> try use a IS DISTINCT OF operator >> >> >> >> >> >> ... AND t1.response IS DISTINCT t2.response >> >> >> >> >> >> Regards >> >> >> >> >> >> Pavel Stehule >> >> >> >> >> >> p.s. don't use a coalesce in WHERE clause if it is possible. >> >> > >> > >
I am running 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU. Both the SQL 2008 R2 and Postgresql are installed on the same machine. The DISTINCT FROM instead of the COALESCE does not help much. I ran 2 further tests with work_mem modifications (please note memory usage is quite low 650kb, so I am not sure if the work_mem is a factor):
First, I modified the work_mem setting to 1GB (reloaded config) from the default 1MB and I see a response time of 33 seconds. Results below from EXPLAIN ANALYZE:
> From: pavel.stehule@gmail.com
> Date: Thu, 18 Nov 2010 07:14:24 +0100
> Subject: Re: Query Performance SQL Server vs. Postgresql
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> Hello,
>
> there should be a problem in a statistic, they are out of reality.
> Please, try to use a DISTINCT OF operator now - maybe a statistic will
> be better. Next - try to increase a work_mem. Hash join is
> untypically slow in your comp.
>
> Regards
>
> Pavel Stehule
>
> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> > Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun
> > the query. Results from EXPLAIN ANALYZE below:
> > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual
> > time=43200.223..49502.874 rows=3163 loops=1)"
> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134)
> > (actual time=0.009..48.200 rows=93496 loops=1)"
> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual
> > time=42919.453..42919.453 rows=251212 loops=1)"
> > " Buckets: 1024 Batches: 64 Memory Usage: 650kB"
> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212
> > width=134) (actual time=0.119..173.019 rows=251212 loops=1)"
> > "Total runtime: 49503.450 ms"
> >
> >> From: pavel.stehule@gmail.com
> >> Date: Wed, 17 Nov 2010 05:47:51 +0100
> >> Subject: Re: Query Performance SQL Server vs. Postgresql
> >> To: humairm@hotmail.com
> >> CC: pgsql-performance@postgresql.org
> >>
> >> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> >> >
> >> > There are no indexes on the tables either in SQL Server or Postgresql -
> >> > I am
> >> > comparing apples to apples here. I ran ANALYZE on the postgresql tables,
> >> > after that query performance times are still high 42 seconds with
> >> > COALESCE
> >> > and 35 seconds with IS DISTINCT FROM.
> >> > Here is the execution plan from Postgresql for qurey - select pb.id from
> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> >> > pg.question and coalesce(pb.response,'MISSING') <>
> >> > coalesce(pg.response,'MISSING')
> >> > Execution Time: 42 seconds
> >> > "Hash Join (cost=16212.30..48854.24 rows=93477 width=17)"
> >> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text
> >> > =
> >> > (pg.question)::text))"
> >> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character
> >> > varying))::text
> >> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> >> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496
> >> > width=134)"
> >> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134)"
> >> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212
> >> > width=134)"
> >>
> >> this is little bit strange - did you ANALYZE and VACUUM?
> >>
> >> please send result of EXPLAIN ANALYZE
> >>
> >> Pavel
> >>
> >> >
> >> > And here is the execution plan from SQL Server for query - select pb.id
> >> > from
> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> >> > pg.question and isnull(pb.response,'ISNULL')<>
> >> > isnull(pg.response,'ISNULL')
> >> > Execution Time: < 1 second
> >> > Cost: 1% |--Parallelism(Gather Streams)
> >> > Cost: 31% |--Hash Match(Inner Join, HASH:([pb].[ID],
> >> > [pb].[Question])=([pg].[ID], [pg].[Question]),
> >> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as
> >> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
> >> > [master].[dbo].[pivotbad].[Question] as
> >> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question]
> >> > AND
> >> > [Expr1006]<>[Expr1007]))
> >> > Cost: 0% |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
> >> > DEFINE:([Bitmap1008]))
> >> > Cost: 0% |--Compute
> >> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as
> >> > [pb].[Response],'ISNULL')))
> >> > Cost: 6% |--Parallelism(Repartition Streams, Hash
> >> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
> >> > Cost: 12% |--Table Scan(OBJECT:([master].[dbo].[pivotbad]
> >> > AS
> >> > [pb]))
> >> > Cost: 0% |--Compute
> >> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response]
> >> > as
> >> > [pg].[Response],'ISNULL')))
> >> > Cost: 17% |--Parallelism(Repartition Streams, Hash
> >> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
> >> > Cost: 33% |--Table
> >> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
> >> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
> >> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
> >> >
> >> >
> >> >
> >> >> From: pavel.stehule@gmail.com
> >> >> Date: Tue, 16 Nov 2010 08:12:03 +0100
> >> >> Subject: Re: [PERFORM]
> >> >> To: humairm@hotmail.com
> >> >> CC: pgsql-performance@postgresql.org
> >> >>
> >> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
> >> >> > I have 2 tables with a 200,000 rows of data 3 character/string
> >> >> > columns
> >> >> > ID,
> >> >> > Question and Response. The query below compares the data between the
> >> >> > 2
> >> >> > tables based on ID and Question and if the Response does not match
> >> >> > between
> >> >> > the left table and the right table it identifies the ID's where there
> >> >> > is
> >> >> > a
> >> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL
> >> >> > function
> >> >> > take a few milliseconds. Running the same query in Postgresql takes
> >> >> > over
> >> >> > 70
> >> >> > seconds. The 2 queries are below:
> >> >> > SQL Server 2008 R2 Query
> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> >> >> > isnull(t2.response,'ISNULL')
> >> >>
> >> >> > Postgres 9.1 Query
> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> >> >> > coalesce(t2.response,'ISNULL')
> >> >> > What gives?
> >> >>
> >> >> I think, so must problem can be in ugly predicate
> >> >> coalesce(t1.response,'ISNULL') <>
> >> >> > coalesce(t2.response,'ISNULL')
> >> >>
> >> >> try use a IS DISTINCT OF operator
> >> >>
> >> >> ... AND t1.response IS DISTINCT t2.response
> >> >>
> >> >> Regards
> >> >>
> >> >> Pavel Stehule
> >> >>
> >> >> p.s. don't use a coalesce in WHERE clause if it is possible.
> >> >
> >
"Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual time=26742.343..33274.317 rows=3163 loops=1)"
" Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
" Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
" -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.055..40.710 rows=93496 loops=1)"
" -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual time=25603.460..25603.460 rows=251212 loops=1)"
" Buckets: 1024 Batches: 64 Memory Usage: 650kB"
" -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.050..120.269 rows=251212 loops=1)"
"Total runtime: 33275.028 ms"
Second, I modified the work_mem setting to 2GB (reloaded config) and I see a response time of 38 seconds. Results below from EXPLAIN ANALYZE:
"Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual time=26574.459..38406.422 rows=3163 loops=1)"
" Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
" Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
" -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.067..37.938 rows=93496 loops=1)"
" -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual time=26426.127..26426.127 rows=251212 loops=1)"
" Buckets: 1024 Batches: 64 Memory Usage: 650kB"
" -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.038..115.319 rows=251212 loops=1)"
"Total runtime: 38406.927 ms"
By no means I am trying to compare the 2 products. When I noticed the slow behavior of COALESCE I tried it on SQL Server. And since they are running on the same machine my comment regarding apples to apples. It is possible that this is not an apples to apples comparison other than the fact that it is running on the same machine.
> From: pavel.stehule@gmail.com
> Date: Thu, 18 Nov 2010 07:14:24 +0100
> Subject: Re: Query Performance SQL Server vs. Postgresql
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> Hello,
>
> there should be a problem in a statistic, they are out of reality.
> Please, try to use a DISTINCT OF operator now - maybe a statistic will
> be better. Next - try to increase a work_mem. Hash join is
> untypically slow in your comp.
>
> Regards
>
> Pavel Stehule
>
> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> > Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun
> > the query. Results from EXPLAIN ANALYZE below:
> > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual
> > time=43200.223..49502.874 rows=3163 loops=1)"
> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134)
> > (actual time=0.009..48.200 rows=93496 loops=1)"
> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual
> > time=42919.453..42919.453 rows=251212 loops=1)"
> > " Buckets: 1024 Batches: 64 Memory Usage: 650kB"
> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212
> > width=134) (actual time=0.119..173.019 rows=251212 loops=1)"
> > "Total runtime: 49503.450 ms"
> >
> >> From: pavel.stehule@gmail.com
> >> Date: Wed, 17 Nov 2010 05:47:51 +0100
> >> Subject: Re: Query Performance SQL Server vs. Postgresql
> >> To: humairm@hotmail.com
> >> CC: pgsql-performance@postgresql.org
> >>
> >> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> >> >
> >> > There are no indexes on the tables either in SQL Server or Postgresql -
> >> > I am
> >> > comparing apples to apples here. I ran ANALYZE on the postgresql tables,
> >> > after that query performance times are still high 42 seconds with
> >> > COALESCE
> >> > and 35 seconds with IS DISTINCT FROM.
> >> > Here is the execution plan from Postgresql for qurey - select pb.id from
> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> >> > pg.question and coalesce(pb.response,'MISSING') <>
> >> > coalesce(pg.response,'MISSING')
> >> > Execution Time: 42 seconds
> >> > "Hash Join (cost=16212.30..48854.24 rows=93477 width=17)"
> >> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text
> >> > =
> >> > (pg.question)::text))"
> >> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character
> >> > varying))::text
> >> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> >> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496
> >> > width=134)"
> >> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134)"
> >> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212
> >> > width=134)"
> >>
> >> this is little bit strange - did you ANALYZE and VACUUM?
> >>
> >> please send result of EXPLAIN ANALYZE
> >>
> >> Pavel
> >>
> >> >
> >> > And here is the execution plan from SQL Server for query - select pb.id
> >> > from
> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> >> > pg.question and isnull(pb.response,'ISNULL')<>
> >> > isnull(pg.response,'ISNULL')
> >> > Execution Time: < 1 second
> >> > Cost: 1% |--Parallelism(Gather Streams)
> >> > Cost: 31% |--Hash Match(Inner Join, HASH:([pb].[ID],
> >> > [pb].[Question])=([pg].[ID], [pg].[Question]),
> >> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as
> >> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
> >> > [master].[dbo].[pivotbad].[Question] as
> >> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question]
> >> > AND
> >> > [Expr1006]<>[Expr1007]))
> >> > Cost: 0% |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
> >> > DEFINE:([Bitmap1008]))
> >> > Cost: 0% |--Compute
> >> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as
> >> > [pb].[Response],'ISNULL')))
> >> > Cost: 6% |--Parallelism(Repartition Streams, Hash
> >> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
> >> > Cost: 12% |--Table Scan(OBJECT:([master].[dbo].[pivotbad]
> >> > AS
> >> > [pb]))
> >> > Cost: 0% |--Compute
> >> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response]
> >> > as
> >> > [pg].[Response],'ISNULL')))
> >> > Cost: 17% |--Parallelism(Repartition Streams, Hash
> >> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
> >> > Cost: 33% |--Table
> >> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
> >> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
> >> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
> >> >
> >> >
> >> >
> >> >> From: pavel.stehule@gmail.com
> >> >> Date: Tue, 16 Nov 2010 08:12:03 +0100
> >> >> Subject: Re: [PERFORM]
> >> >> To: humairm@hotmail.com
> >> >> CC: pgsql-performance@postgresql.org
> >> >>
> >> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
> >> >> > I have 2 tables with a 200,000 rows of data 3 character/string
> >> >> > columns
> >> >> > ID,
> >> >> > Question and Response. The query below compares the data between the
> >> >> > 2
> >> >> > tables based on ID and Question and if the Response does not match
> >> >> > between
> >> >> > the left table and the right table it identifies the ID's where there
> >> >> > is
> >> >> > a
> >> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL
> >> >> > function
> >> >> > take a few milliseconds. Running the same query in Postgresql takes
> >> >> > over
> >> >> > 70
> >> >> > seconds. The 2 queries are below:
> >> >> > SQL Server 2008 R2 Query
> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> >> >> > isnull(t2.response,'ISNULL')
> >> >>
> >> >> > Postgres 9.1 Query
> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> >> >> > coalesce(t2.response,'ISNULL')
> >> >> > What gives?
> >> >>
> >> >> I think, so must problem can be in ugly predicate
> >> >> coalesce(t1.response,'ISNULL') <>
> >> >> > coalesce(t2.response,'ISNULL')
> >> >>
> >> >> try use a IS DISTINCT OF operator
> >> >>
> >> >> ... AND t1.response IS DISTINCT t2.response
> >> >>
> >> >> Regards
> >> >>
> >> >> Pavel Stehule
> >> >>
> >> >> p.s. don't use a coalesce in WHERE clause if it is possible.
> >> >
> >
1) OS/Configuration
64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU
postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500, 64-bit)
work_mem 2GB
shared_buffers = 2
2) Dataset
name,pages,tuples,pg_size_pretty
"pivotbad";1870;93496;"15 MB"
"pivotgood";5025;251212;"39 MB"
3) EXPLAIN (ANALYZE ON, BUFFERS ON)
"Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual time=25814.222..32296.765 rows=3163 loops=1)"
" Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
" Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
" Buffers: shared hit=384 read=6511, temp read=6444 written=6318"
" -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.069..37.143 rows=93496 loops=1)"
" Buffers: shared hit=192 read=1678"
" -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual time=24621.752..24621.752 rows=251212 loops=1)"
" Buckets: 1024 Batches: 64 Memory Usage: 650kB"
" Buffers: shared hit=192 read=4833, temp written=4524"
" -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.038..117.780 rows=251212 loops=1)"
" Buffers: shared hit=192 read=4833"
"Total runtime: 32297.305 ms"
4) INDEXES
I can certainly add an index but given the table sizes I am not sure if that is a factor. This by no means is a large dataset less than 350,000 rows in total and 3 columns. Also this was just a quick dump of data for comparison purpose. When I saw the poor performance on the COALESCE, I pointed the data load to SQL Server and ran the same query except with the TSQL specific ISNULL function.
2010/11/21 Humair Mohammed <humairm@hotmail.com>: > > 1) OS/Configuration > 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU > postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500, > 64-bit) > work_mem 2GB > shared_buffers = 2 shared_buffers = 2 ??? Regards Pavel Stehule > 2) Dataset > name,pages,tuples,pg_size_pretty > "pivotbad";1870;93496;"15 MB" > "pivotgood";5025;251212;"39 MB" > 3) EXPLAIN (ANALYZE ON, BUFFERS ON) > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual > time=25814.222..32296.765 rows=3163 loops=1)" > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = > (pg.question)::text))" > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)" > " Buffers: shared hit=384 read=6511, temp read=6444 written=6318" > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) > (actual time=0.069..37.143 rows=93496 loops=1)" > " Buffers: shared hit=192 read=1678" > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual > time=24621.752..24621.752 rows=251212 loops=1)" > " Buckets: 1024 Batches: 64 Memory Usage: 650kB" > " Buffers: shared hit=192 read=4833, temp written=4524" > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 > width=134) (actual time=0.038..117.780 rows=251212 loops=1)" > " Buffers: shared hit=192 read=4833" > "Total runtime: 32297.305 ms" > 4) INDEXES > I can certainly add an index but given the table sizes I am not sure if that > is a factor. This by no means is a large dataset less than 350,000 rows in > total and 3 columns. Also this was just a quick dump of data for comparison > purpose. When I saw the poor performance on the COALESCE, I pointed the data > load to SQL Server and ran the same query except with the TSQL specific > ISNULL function. >
> 4) INDEXESI can certainly add an index but given the table sizes I am not > sure if that is a factor. This by no means is a large dataset less than > 350,000 rows in total and 3 columns. Also this was just a quick dump of > data for comparison purpose. When I saw the poor performance on the > COALESCE, I pointed the data load to SQL Server and ran the same query > except with the TSQL specific ISNULL function. 350000 rows definitely is a lot of rows, although with 3 INT column it's just about 13MB of data (including overhead). But indexes can be quite handy when doing joins, as in this case. Tomas
Hello 2010/11/21 Humair Mohammed <humairm@hotmail.com>: > That was a typo: > work_mem = 2GB > shared_buffers = 2GB ok, then try to decrease a shared_buffers. Maybe a win7 has a some problem - large a shared buffers are well just for UNIX like systems. I am thinking so 500 MB is enough Regards Pavel Stehule >> From: pavel.stehule@gmail.com >> Date: Sun, 21 Nov 2010 12:38:43 +0100 >> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql >> To: humairm@hotmail.com >> CC: pgsql-performance@postgresql.org >> >> 2010/11/21 Humair Mohammed <humairm@hotmail.com>: >> > >> > 1) OS/Configuration >> > 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel >> > CPU >> > postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500, >> > 64-bit) >> > work_mem 2GB >> > shared_buffers = 2 >> >> shared_buffers = 2 ??? >> >> Regards >> >> Pavel Stehule >> >> >> > 2) Dataset >> > name,pages,tuples,pg_size_pretty >> > "pivotbad";1870;93496;"15 MB" >> > "pivotgood";5025;251212;"39 MB" >> > 3) EXPLAIN (ANALYZE ON, BUFFERS ON) >> > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual >> > time=25814.222..32296.765 rows=3163 loops=1)" >> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text >> > = >> > (pg.question)::text))" >> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character >> > varying))::text >> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)" >> > " Buffers: shared hit=384 read=6511, temp read=6444 written=6318" >> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 >> > width=134) >> > (actual time=0.069..37.143 rows=93496 loops=1)" >> > " Buffers: shared hit=192 read=1678" >> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual >> > time=24621.752..24621.752 rows=251212 loops=1)" >> > " Buckets: 1024 Batches: 64 Memory Usage: 650kB" >> > " Buffers: shared hit=192 read=4833, temp written=4524" >> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 >> > width=134) (actual time=0.038..117.780 rows=251212 loops=1)" >> > " Buffers: shared hit=192 read=4833" >> > "Total runtime: 32297.305 ms" >> > 4) INDEXES >> > I can certainly add an index but given the table sizes I am not sure if >> > that >> > is a factor. This by no means is a large dataset less than 350,000 rows >> > in >> > total and 3 columns. Also this was just a quick dump of data for >> > comparison >> > purpose. When I saw the poor performance on the COALESCE, I pointed the >> > data >> > load to SQL Server and ran the same query except with the TSQL specific >> > ISNULL function. >> > >
> First, I modified the work_mem setting to 1GB (reloaded config) from the > default 1MB and I see a response time of 33 seconds. Results below from > EXPLAIN ANALYZE: ... > Second, I modified the work_mem setting to 2GB (reloaded config) and I see > a response time of 38 seconds. Results below from EXPLAIN ANALYZE: ... How did you reload the config? Using 'kill -HUP pid'? That should work fine. Have you cheched 'work_mem' after the reload? Because the explain plans are exactly the same (structure, estimated costs). The really interesting bit is this and it did not change at all Buckets: 1024 Batches: 64 Memory Usage: 650kB As Tom Lane already mentioned, splitting hash join into batches (due to small memory) adds overhead, the optimal number of batches is 1. But I guess 1GB of work_mem is an overkill - something like 64MB should be fine. The suspicious thing is the query plans have not changed at all (especially the number of batches). I think you're not telling us something important (unintentionally of course). > By no means I am trying to compare the 2 products. When I noticed the slow > behavior of COALESCE I tried it on SQL Server. And since they are running > on the same machine my comment regarding apples to apples. It is possible > that this is not an apples to apples comparison other than the fact that > it is running on the same machine. OK. The point of my post was that you've provided very little info about the settings etc. so it was difficult to identify why PostgreSQL is so slow. Tomas
>> 4) INDEXESI can certainly add an index but given the table sizes I am >> not >> sure if that is a factor. This by no means is a large dataset less than >> 350,000 rows in total and 3 columns. Also this was just a quick dump of >> data for comparison purpose. When I saw the poor performance on the >> COALESCE, I pointed the data load to SQL Server and ran the same query >> except with the TSQL specific ISNULL function. > > 350000 rows definitely is a lot of rows, although with 3 INT column it's > just about 13MB of data (including overhead). But indexes can be quite > handy when doing joins, as in this case. OK, I've just realized the tables have 3 character columns, not integers. In that case the tables are probably much bigger (and there are things like TOAST). In that case indexes may be even more important. Tomas
tv@fuzzy.cz writes: >> Second, I modified the work_mem setting to 2GB (reloaded config) and I see >> a response time of 38 seconds. Results below from EXPLAIN ANALYZE: > How did you reload the config? Using 'kill -HUP pid'? That should work > fine. Have you cheched 'work_mem' after the reload? > Because the explain plans are exactly the same (structure, estimated > costs). The really interesting bit is this and it did not change at all > Buckets: 1024 Batches: 64 Memory Usage: 650kB If that didn't change, I'm prepared to bet that the OP didn't actually manage to change the active value of work_mem. regards, tom lane
On Nov 21, 2010, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > tv@fuzzy.cz writes: >>> Second, I modified the work_mem setting to 2GB (reloaded config) and I see >>> a response time of 38 seconds. Results below from EXPLAIN ANALYZE: > >> How did you reload the config? Using 'kill -HUP pid'? That should work >> fine. Have you cheched 'work_mem' after the reload? > >> Because the explain plans are exactly the same (structure, estimated >> costs). The really interesting bit is this and it did not change at all > >> Buckets: 1024 Batches: 64 Memory Usage: 650kB > > If that didn't change, I'm prepared to bet that the OP didn't actually > manage to change the active value of work_mem. Yep. All this speculation about slow disks and/or COALESCE strikes me as likely totally off-base. I think the original posterneeds to run "show work_mem" right before the EXPLAIN ANALYZE to make sure the new value they set actually stuck. There'sno reason for the planner to have used only 650kB if work_mem is set to anything >=2MB. ...Robert
That was a typo:
work_mem = 2GB
shared_buffers = 2GB
> From: pavel.stehule@gmail.com
> Date: Sun, 21 Nov 2010 12:38:43 +0100
> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> 2010/11/21 Humair Mohammed <humairm@hotmail.com>:
> >
> > 1) OS/Configuration
> > 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU
> > postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500,
> > 64-bit)
> > work_mem 2GB
> > shared_buffers = 2
>
> shared_buffers = 2 ???
>
> Regards
>
> Pavel Stehule
>
>
> > 2) Dataset
> > name,pages,tuples,pg_size_pretty
> > "pivotbad";1870;93496;"15 MB"
> > "pivotgood";5025;251212;"39 MB"
> > 3) EXPLAIN (ANALYZE ON, BUFFERS ON)
> > "Hash Join (cost=16212.30..52586.43 rows=92869 width=17) (actual
> > time=25814.222..32296.765 rows=3163 loops=1)"
> > " Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > " Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > " Buffers: shared hit=384 read=6511, temp read=6444 written=6318"
> > " -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134)
> > (actual time=0.069..37.143 rows=93496 loops=1)"
> > " Buffers: shared hit=192 read=1678"
> > " -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual
> > time=24621.752..24621.752 rows=251212 loops=1)"
> > " Buckets: 1024 Batches: 64 Memory Usage: 650kB"
> > " Buffers: shared hit=192 read=4833, temp written=4524"
> > " -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212
> > width=134) (actual time=0.038..117.780 rows=251212 loops=1)"
> > " Buffers: shared hit=192 read=4833"
> > "Total runtime: 32297.305 ms"
> > 4) INDEXES
> > I can certainly add an index but given the table sizes I am not sure if that
> > is a factor. This by no means is a large dataset less than 350,000 rows in
> > total and 3 columns. Also this was just a quick dump of data for comparison
> > purpose. When I saw the poor performance on the COALESCE, I pointed the data
> > load to SQL Server and ran the same query except with the TSQL specific
> > ISNULL function.
> >
Correct, the optimizer did not take the settings with the pg_ctl reload command. I did a pg_ctl restart and work_mem now displays the updated value. I had to bump up all the way to 2047 MB to get the response below (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB (which is the max value that can be set for work_mem - anything more than that results in a FATAL error because of the limit) the results are below. The batches and memory usage are reflecting the right behavior with these settings. Thanks for everyones input, the result is now matching what SQL Server was producing.
"Hash Join (cost=11305.30..39118.43 rows=92869 width=17) (actual time=145.888..326.216 rows=3163 loops=1)"
" Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
" Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
" Buffers: shared hit=6895"
" -> Seq Scan on pivotbad pb (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.011..11.903 rows=93496 loops=1)"
" Buffers: shared hit=1870"
" -> Hash (cost=7537.12..7537.12 rows=251212 width=134) (actual time=145.673..145.673 rows=251212 loops=1)"
" Buckets: 32768 Batches: 1 Memory Usage: 39939kB"
" Buffers: shared hit=5025"
" -> Seq Scan on pivotgood pg (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.004..26.242 rows=251212 loops=1)"
" Buffers: shared hit=5025"
"Total runtime: 331.168 ms"
Humair
> CC: tv@fuzzy.cz; humairm@hotmail.com; pavel.stehule@gmail.com; pgsql-performance@postgresql.org
> From: robertmhaas@gmail.com
> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
> Date: Sun, 21 Nov 2010 13:55:54 -0500
> To: tgl@sss.pgh.pa.us
>
> On Nov 21, 2010, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > tv@fuzzy.cz writes:
> >>> Second, I modified the work_mem setting to 2GB (reloaded config) and I see
> >>> a response time of 38 seconds. Results below from EXPLAIN ANALYZE:
> >
> >> How did you reload the config? Using 'kill -HUP pid'? That should work
> >> fine. Have you cheched 'work_mem' after the reload?
> >
> >> Because the explain plans are exactly the same (structure, estimated
> >> costs). The really interesting bit is this and it did not change at all
> >
> >> Buckets: 1024 Batches: 64 Memory Usage: 650kB
> >
> > If that didn't change, I'm prepared to bet that the OP didn't actually
> > manage to change the active value of work_mem.
>
> Yep. All this speculation about slow disks and/or COALESCE strikes me as likely totally off-base. I think the original poster needs to run "show work_mem" right before the EXPLAIN ANALYZE to make sure the new value they set actually stuck. There's no reason for the planner to have used only 650kB if work_mem is set to anything >=2MB.
>
> ...Robert
> > > Correct, the optimizer did not take the settings with the pg_ctl reload > command. I did a pg_ctl restart and work_mem now displays the updated > value. I had to bump up all the way to 2047 MB to get the response below > (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB > (which is the max value that can be set for work_mem - anything more than > that results in a FATAL error because of the limit) the results are below. Hm, can you post explain plan for the case work_mem=1024MB. I guess the difference is due to caching. According to the explain analyze, there are just cache hits, no reads. Anyway the hash join uses only about 40MB of memory, so 1024MB should be perfectly fine and the explain plan should be exactly the same as with work_mem=2047MB. And the row estimates seem quite precise, so I don't think there's some severe overestimation. Tomas
On Sun, Nov 21, 2010 at 10:21 PM, Humair Mohammed <humairm@hotmail.com> wrote:
Correct, the optimizer did not take the settings with the pg_ctl reload command. I did a pg_ctl restart and work_mem now displays the updated value. I had to bump up all the way to 2047 MB to get the response below (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB (which is the max value that can be set for work_mem - anything more than that results in a FATAL error because of the limit) the results are below. The batches and memory usage are reflecting the right behavior with these settings. Thanks for everyones input, the result is now matching what SQL Server was producing.
I believe you can set work_mem to a different value just for the duration of a single query, so you needn't have work_mem set so high if for every query on the system. A single query may well use a multiple of work_mem, so you really probably don't want it that high all the time unless all of your queries are structured similarly. Just set work_mem='2047MB'; query; reset all;
But you should wait until someone more knowledgable than I confirm what I just wrote.
> I believe you can set work_mem to a different value just for the duration > of > a single query, so you needn't have work_mem set so high if for every > query > on the system. A single query may well use a multiple of work_mem, so you > really probably don't want it that high all the time unless all of your > queries are structured similarly. Just set work_mem='2047MB'; query; > reset > all; Yes, executing "set work_mem='64MB'" right before the query should be just fine. Setting work_mem to 2GB is an overkill most of the time (99.99999%). Tomas
I did some further analysis and here are the results:
work_mem;response_time
1MB;62 seconds
2MB;2 seconds
4MB;700 milliseconds
8MB;550 milliseconds
In all cases shared_buffers were set to the default value of 32MB. As you can see the 1 to 2 MB jump on the work_mem does wonders. I probably don't need this to be any higher than 8 or 16 MB. Thanks to all for help!
Humair
> Date: Mon, 22 Nov 2010 12:00:15 +0100
> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
> From: tv@fuzzy.cz
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> >
> >
> > Correct, the optimizer did not take the settings with the pg_ctl reload
> > command. I did a pg_ctl restart and work_mem now displays the updated
> > value. I had to bump up all the way to 2047 MB to get the response below
> > (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB
> > (which is the max value that can be set for work_mem - anything more than
> > that results in a FATAL error because of the limit) the results are below.
>
> Hm, can you post explain plan for the case work_mem=1024MB. I guess the
> difference is due to caching. According to the explain analyze, there are
> just cache hits, no reads.
>
> Anyway the hash join uses only about 40MB of memory, so 1024MB should be
> perfectly fine and the explain plan should be exactly the same as with
> work_mem=2047MB. And the row estimates seem quite precise, so I don't
> think there's some severe overestimation.
>
> Tomas
>
On Mon, Nov 22, 2010 at 7:12 PM, Humair Mohammed <humairm@hotmail.com> wrote: > I did some further analysis and here are the results: > work_mem;response_time > 1MB;62 seconds > 2MB;2 seconds > 4MB;700 milliseconds > 8MB;550 milliseconds > In all cases shared_buffers were set to the default value of 32MB. As you > can see the 1 to 2 MB jump on the work_mem does wonders. I probably don't > need this to be any higher than 8 or 16 MB. Thanks to all for help! > Humair work_mem directly affects how the query is planned, because certain types of plans (hash joins ans large sorts) require memory. raising or lowering shared_buffers OTOH is very subtle and is not something you tune to improve the execution of a single query... merlin