Thread: PostgreSQL strange query plan for my query

PostgreSQL strange query plan for my query

From
David Popiashvili
Date:
I have database with few hundred millions of rows. I'm running the following query:

select * from "Payments" as p
inner join "PaymentOrders" as po
on po."Id" = p."PaymentOrderId"
inner join "Users" as u
On u."Id" = po."UserId"
INNER JOIN "Roles" as r
on u."RoleId" = r."Id"
Where r."Name" = 'Moses'
LIMIT 1000
When the where clause finds a match in database, I get the result in several milliseconds, but if I modify the query and specify a non-existent r."Name" in where clause, it takes too much time to complete. I guess that PostgreSQL is doing a sequential scan on the Payments table (which contains the most rows), comparing each row one by one.

Isn't postgresql smart enough to check first if Roles table contains any row with Name 'Moses'?

Roles table contains only 15 row, while Payments contains ~350 million.

I'm running PostgreSQL 9.2.1.

BTW, this same query on the same schema/data takes 0.024ms to complete on MS SQL Server.


Here'e explain analyse results: http://explain.depesz.com/s/7e7

And here's server configuration:

version PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
client_encoding UNICODE
effective_cache_size    4500MB
fsync   on
lc_collate  English_United States.1252
lc_ctype    English_United States.1252
listen_addresses    *
log_destination stderr
log_line_prefix %t 
logging_collector   on
max_connections 100
max_stack_depth 2MB
port    5432
search_path dbo, "$user", public
server_encoding UTF8
shared_buffers  1500MB
TimeZone    Asia/Tbilisi
wal_buffers 16MB
work_mem    10MB

I'm running postgresql on a i5 cpu (4 core, 3.3 GHz), 8 GB of RAM and Crucial m4 SSD 128GB


Original question source http://stackoverflow.com/questions/13407555/postgresql-query-taking-too-long#comment18330095_13407555


Thank you very much.

Re: PostgreSQL strange query plan for my query

From
"Albe Laurenz"
Date:
David Popiashvili wrote:
> I have database with few hundred millions of rows. I'm running the
following query:
>
> select * from "Payments" as p
> inner join "PaymentOrders" as po
> on po."Id" = p."PaymentOrderId"
> inner join "Users" as u
> On u."Id" = po."UserId"
> INNER JOIN "Roles" as r
> on u."RoleId" = r."Id"
> Where r."Name" = 'Moses'
> LIMIT 1000
> When the where clause finds a match in database, I get the result in
several milliseconds, but if I
> modify the query and specify a non-existent r."Name" in where clause,
it takes too much time to
> complete. I guess that PostgreSQL is doing a sequential scan on the
Payments table (which contains the
> most rows), comparing each row one by one.
> Isn't postgresql smart enough to check first if Roles table contains
any row with Name 'Moses'?
>
> Roles table contains only 15 row, while Payments contains ~350
million.
>
> I'm running PostgreSQL 9.2.1.

> Here'e explain analyse results: http://explain.depesz.com/s/7e7

Can you also show the plan for the good case?

Yours,
Laurenz Albe


Re: PostgreSQL strange query plan for my query

From
David Popiashvili
Date:
All right, after some discussion on StackOverflow, we found out that incorrect query plan is generated due to the fact that there is a LIMIT keyword in the query. I guess Postgresql expects to find appropriate rows faster and that's why it generates a seq scan on the table. If I remove LIMIT 1000 everything is executed in several milliseconds and query plan looks like this:

Hash Join  (cost=2662004.85..14948213.44 rows=22661658 width=138) (actual time=0.105..0.105 rows=0 loops=1)
  Hash Cond: (p."PaymentOrderId" = po."Id")
  ->  Seq Scan on "Payments" p  (cost=0.00..5724570.00 rows=350000000 width=18) (actual time=0.018..0.018 rows=1 loops=1)
  ->  Hash  (cost=2583365.85..2583365.85 rows=2614480 width=120) (actual time=0.046..0.046 rows=0 loops=1)
        Buckets: 8192  Batches: 64  Memory Usage: 0kB
        ->  Hash Join  (cost=904687.05..2583365.85 rows=2614480 width=120) (actual time=0.046..0.046 rows=0 loops=1)
              Hash Cond: (po."UserId" = u."Id")
              ->  Seq Scan on "PaymentOrders" po  (cost=0.00..654767.00 rows=40000000 width=24) (actual time=0.003..0.003 rows=1 loops=1)
              ->  Hash  (cost=850909.04..850909.04 rows=1980881 width=96) (actual time=0.016..0.016 rows=0 loops=1)
                    Buckets: 8192  Batches: 32  Memory Usage: 0kB
                    ->  Hash Join  (cost=1.20..850909.04 rows=1980881 width=96) (actual time=0.016..0.016 rows=0 loops=1)
                          Hash Cond: (u."RoleId" = r."Id")
                          ->  Seq Scan on "Users" u  (cost=0.00..718598.20 rows=30000220 width=80) (actual time=0.002..0.002 rows=1 loops=1)
                          ->  Hash  (cost=1.19..1.19 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 0kB
                                ->  Seq Scan on "Roles" r  (cost=0.00..1.19 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=1)
                                      Filter: (("Name")::text = 'Moses2333'::text)
                                      Rows Removed by Filter: 15
Total runtime: 0.209 ms

According to Erwin Brandstetter I also tried pushing the query in a subquery and applying LIMIT there:

SELECT *
FROM  (
   SELECT *
   FROM   "Roles"         AS r  
   JOIN   "Users"         AS u  ON u."RoleId" = r."Id"
   JOIN   "PaymentOrders" AS po ON po."UserId" = u."Id"
   JOIN   "Payments"      AS p  ON p."PaymentOrderId" = po."Id"
   WHERE  r."Name" = 'Moses'
  ) x
LIMIT  1000;

but this solution also generates incorrect query plan. Any idea how to solve this query without omitting LIMIT keyword?
Thanks

> Subject: RE: [PERFORM] PostgreSQL strange query plan for my query
> Date: Fri, 16 Nov 2012 13:55:41 +0100
> From: laurenz.albe@wien.gv.at
> To: dato0011@hotmail.com; pgsql-performance@postgresql.org
>
> David Popiashvili wrote:
> > I have database with few hundred millions of rows. I'm running the
> following query:
> >
> > select * from "Payments" as p
> > inner join "PaymentOrders" as po
> > on po."Id" = p."PaymentOrderId"
> > inner join "Users" as u
> > On u."Id" = po."UserId"
> > INNER JOIN "Roles" as r
> > on u."RoleId" = r."Id"
> > Where r."Name" = 'Moses'
> > LIMIT 1000
> > When the where clause finds a match in database, I get the result in
> several milliseconds, but if I
> > modify the query and specify a non-existent r."Name" in where clause,
> it takes too much time to
> > complete. I guess that PostgreSQL is doing a sequential scan on the
> Payments table (which contains the
> > most rows), comparing each row one by one.
> > Isn't postgresql smart enough to check first if Roles table contains
> any row with Name 'Moses'?
> >
> > Roles table contains only 15 row, while Payments contains ~350
> million.
> >
> > I'm running PostgreSQL 9.2.1.
>
> > Here'e explain analyse results: http://explain.depesz.com/s/7e7
>
> Can you also show the plan for the good case?
>
> Yours,
> Laurenz Albe

Re: PostgreSQL strange query plan for my query

From
Julien Cigar
Date:
On 11/16/2012 14:04, David Popiashvili wrote:
All right, after some discussion on StackOverflow, we found out that incorrect query plan is generated due to the fact that there is a LIMIT keyword in the query. I guess Postgresql expects to find appropriate rows faster and that's why it generates a seq scan on the table. If I remove LIMIT 1000 everything is executed in several milliseconds and query plan looks like this:

Hash Join  (cost=2662004.85..14948213.44 rows=22661658 width=138) (actual time=0.105..0.105 rows=0 loops=1)
  Hash Cond: (p."PaymentOrderId" = po."Id")
  ->  Seq Scan on "Payments" p  (cost=0.00..5724570.00 rows=350000000 width=18) (actual time=0.018..0.018 rows=1 loops=1)
  ->  Hash  (cost=2583365.85..2583365.85 rows=2614480 width=120) (actual time=0.046..0.046 rows=0 loops=1)
        Buckets: 8192  Batches: 64  Memory Usage: 0kB
        ->  Hash Join  (cost=904687.05..2583365.85 rows=2614480 width=120) (actual time=0.046..0.046 rows=0 loops=1)
              Hash Cond: (po."UserId" = u."Id")
              ->  Seq Scan on "PaymentOrders" po  (cost=0.00..654767.00 rows=40000000 width=24) (actual time=0.003..0.003 rows=1 loops=1)
              ->  Hash  (cost=850909.04..850909.04 rows=1980881 width=96) (actual time=0.016..0.016 rows=0 loops=1)
                    Buckets: 8192  Batches: 32  Memory Usage: 0kB
                    ->  Hash Join  (cost=1.20..850909.04 rows=1980881 width=96) (actual time=0.016..0.016 rows=0 loops=1)
                          Hash Cond: (u."RoleId" = r."Id")
                          ->  Seq Scan on "Users" u  (cost=0.00..718598.20 rows=30000220 width=80) (actual time=0.002..0.002 rows=1 loops=1)
                          ->  Hash  (cost=1.19..1.19 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 0kB
                                ->  Seq Scan on "Roles" r  (cost=0.00..1.19 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=1)
                                      Filter: (("Name")::text = 'Moses2333'::text)
                                      Rows Removed by Filter: 15
Total runtime: 0.209 ms

According to Erwin Brandstetter I also tried pushing the query in a subquery and applying LIMIT there:

SELECT *
FROM  (
   SELECT *
   FROM   "Roles"         AS r  
   JOIN   "Users"         AS u  ON u."RoleId" = r."Id"
   JOIN   "PaymentOrders" AS po ON po."UserId" = u."Id"
   JOIN   "Payments"      AS p  ON p."PaymentOrderId" = po."Id"
   WHERE  r."Name" = 'Moses'
  ) x
LIMIT  1000;

but this solution also generates incorrect query plan. Any idea how to solve this query without omitting LIMIT keyword?
Thanks


maybe with a CTE ?

> Subject: RE: [PERFORM] PostgreSQL strange query plan for my query
> Date: Fri, 16 Nov 2012 13:55:41 +0100
> From: laurenz.albe@wien.gv.at
> To: dato0011@hotmail.com; pgsql-performance@postgresql.org
>
> David Popiashvili wrote:
> > I have database with few hundred millions of rows. I'm running the
> following query:
> >
> > select * from "Payments" as p
> > inner join "PaymentOrders" as po
> > on po."Id" = p."PaymentOrderId"
> > inner join "Users" as u
> > On u."Id" = po."UserId"
> > INNER JOIN "Roles" as r
> > on u."RoleId" = r."Id"
> > Where r."Name" = 'Moses'
> > LIMIT 1000
> > When the where clause finds a match in database, I get the result in
> several milliseconds, but if I
> > modify the query and specify a non-existent r."Name" in where clause,
> it takes too much time to
> > complete. I guess that PostgreSQL is doing a sequential scan on the
> Payments table (which contains the
> > most rows), comparing each row one by one.
> > Isn't postgresql smart enough to check first if Roles table contains
> any row with Name 'Moses'?
> >
> > Roles table contains only 15 row, while Payments contains ~350
> million.
> >
> > I'm running PostgreSQL 9.2.1.
>
> > Here'e explain analyse results: http://explain.depesz.com/s/7e7
>
> Can you also show the plan for the good case?
>
> Yours,
> Laurenz Albe


-- 
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.
Attachment

Re: PostgreSQL strange query plan for my query

From
Craig James
Date:


On Fri, Nov 16, 2012 at 3:40 AM, David Popiashvili <dato0011@hotmail.com> wrote:
I have database with few hundred millions of rows. I'm running the following query:

select * from "Payments" as p
inner join "PaymentOrders" as po
on po."Id" = p."PaymentOrderId"
inner join "Users" as u
On u."Id" = po."UserId"
INNER JOIN "Roles" as r
on u."RoleId" = r."Id"
Where r."Name" = 'Moses'
LIMIT 1000
When the where clause finds a match in database, I get the result in several milliseconds, but if I modify the query and specify a non-existent r."Name" in where clause, it takes too much time to complete. I guess that PostgreSQL is doing a sequential scan on the Payments table (which contains the most rows), comparing each row one by one.

Isn't postgresql smart enough to check first if Roles table contains any row with Name 'Moses'?

Roles table contains only 15 row, while Payments contains ~350 million

You probably checked this already, but just in case you didn't ... did you do an "analyze" on the small table?  I've been hit by this before ... it's natural to think that Postgres would always check a very small table first no matter what the statistics are.  But it's not true.  If you analyze the small table, even if it only has one or two rows in it, it will often radically change the plan that Postgres chooses.

Craig James
 

Re: PostgreSQL strange query plan for my query

From
David Popiashvili
Date:
Thanks Craig. Yes I already tried it but it didn't work. I don't see any solution other than fixing this bug. Take a look http://www.postgresql.org/search/?m=1&q=LIMIT&l=8&d=365&s=r. There are too many bug reports about LIMIT slowing down queries. Let's hope it will be fixed someday :)


Date: Fri, 16 Nov 2012 08:32:24 -0800
Subject: Re: [PERFORM] PostgreSQL strange query plan for my query
From: cjames@emolecules.com
To: dato0011@hotmail.com
CC: pgsql-performance@postgresql.org



On Fri, Nov 16, 2012 at 3:40 AM, David Popiashvili <dato0011@hotmail.com> wrote:
I have database with few hundred millions of rows. I'm running the following query:

select * from "Payments" as p
inner join "PaymentOrders" as po
on po."Id" = p."PaymentOrderId"
inner join "Users" as u
On u."Id" = po."UserId"
INNER JOIN "Roles" as r
on u."RoleId" = r."Id"
Where r."Name" = 'Moses'
LIMIT 1000
When the where clause finds a match in database, I get the result in several milliseconds, but if I modify the query and specify a non-existent r."Name" in where clause, it takes too much time to complete. I guess that PostgreSQL is doing a sequential scan on the Payments table (which contains the most rows), comparing each row one by one.

Isn't postgresql smart enough to check first if Roles table contains any row with Name 'Moses'?

Roles table contains only 15 row, while Payments contains ~350 million

You probably checked this already, but just in case you didn't ... did you do an "analyze" on the small table?  I've been hit by this before ... it's natural to think that Postgres would always check a very small table first no matter what the statistics are.  But it's not true.  If you analyze the small table, even if it only has one or two rows in it, it will often radically change the plan that Postgres chooses.

Craig James
 

Re: PostgreSQL strange query plan for my query

From
Julien Cigar
Date:
On 11/16/2012 17:35, David Popiashvili wrote:
Thanks Craig. Yes I already tried it but it didn't work. I don't see any solution other than fixing this bug. Take a look http://www.postgresql.org/search/?m=1&q=LIMIT&l=8&d=365&s=r. There are too many bug reports about LIMIT slowing down queries. Let's hope it will be fixed someday :)


Date: Fri, 16 Nov 2012 08:32:24 -0800
Subject: Re: [PERFORM] PostgreSQL strange query plan for my query
From: cjames@emolecules.com
To: dato0011@hotmail.com
CC: pgsql-performance@postgresql.org



On Fri, Nov 16, 2012 at 3:40 AM, David Popiashvili <dato0011@hotmail.com> wrote:
I have database with few hundred millions of rows. I'm running the following query:

select * from "Payments" as p
inner join "PaymentOrders" as po
on po."Id" = p."PaymentOrderId"
inner join "Users" as u
On u."Id" = po."UserId"
INNER JOIN "Roles" as r
on u."RoleId" = r."Id"
Where r."Name" = 'Moses'
LIMIT 1000

did you try:

with foo as (
select * from "Payments" as p
inner join "PaymentOrders" as po
on po."Id" = p."PaymentOrderId"
inner join "Users" as u
On u."Id" = po."UserId"
INNER JOIN "Roles" as r
on u."RoleId" = r."Id"
Where r."Name" = 'Moses'
) select * from foo LIMIT 1000

?

When the where clause finds a match in database, I get the result in several milliseconds, but if I modify the query and specify a non-existent r."Name" in where clause, it takes too much time to complete. I gues
s that PostgreSQL is doing a sequential scan on the Payments table (which contains the most rows), comparing each row one by one.

Isn't postgresql smart enough to check first if Roles table contains any row with Name 'Moses'?

Roles table contains only 15 row, while Payments contains ~350 million

You probably checked this already, but just in case you didn't ... did you do an "analyze" on the small table?  I've been hit by this before ... it's natural to think that Postgres would always check a very small table first no matter what the statistics are.  But it's not true.  If you analyze the small table, even if it only has one or two rows in it, it will often radically change the plan that Postgres chooses.

Craig James