Thread: Slow SELECT

Slow SELECT

From

Frank Millman

Date:

26 May 2020, 10:22:49

Hi all

I have a SELECT that runs over 5 times slower on PostgreSQL compared 
with Sql Server and sqlite3. I am trying to understand why.

I have a table that looks like this (simplified) -

CREATE TABLE my_table (
     row_id SERIAL PRIMARY KEY,
     deleted_id INT DEFAULT 0,
     fld_1 INT REFERENCES table_1(row_id),
     fld_2 INT REFERENCES table_2(row_id),
     fld_3 INT REFERENCES table_3(row_id),
     fld_4 INT REFERENCES table_4(row_id),
     tran_date DATE,
     tran_total DEC(21,2)
     );

CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3, 
fld_4, tran_date) WHERE deleted_id = 0;

The table sizes are -
     my_table : 167 rows
     table_1 : 21 rows
     table_2 : 11 rows
     table_3 : 3 rows
     table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially 
21x11x3x16 = 11088 rows. Most will be null.

I want to select the row_id for the last tran_date for each of those 
potential groups. This is my select -

     SELECT (
         SELECT a.row_id FROM my_table a
         WHERE a.fld_1 = b.row_id
         AND a.fld_2 = c.row_id
         AND a.fld_3 = d.row_id
         AND a.fld_4 = e.row_id
         AND a.deleted_id = 0
         ORDER BY a.tran_date DESC LIMIT 1
     )
     FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.

On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on 
sqlite3, and 0.31 sec on PostgreSQL.

I have looked at the EXPLAIN, but I don't really know what to look for. 
I can supply it if that would help.

Thanks for any advice.

Frank Millman

Re: Slow SELECT

From

Olivier Gautherot

Date:

26 May 2020, 10:32:14

Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman <frank@chagford.com> wrote:

Hi all

I have a SELECT that runs over 5 times slower on PostgreSQL compared
with Sql Server and sqlite3. I am trying to understand why.

I have a table that looks like this (simplified) -

CREATE TABLE my_table (
row_id SERIAL PRIMARY KEY,
deleted_id INT DEFAULT 0,
fld_1 INT REFERENCES table_1(row_id),
fld_2 INT REFERENCES table_2(row_id),
fld_3 INT REFERENCES table_3(row_id),
fld_4 INT REFERENCES table_4(row_id),
tran_date DATE,
tran_total DEC(21,2)
);

CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
fld_4, tran_date) WHERE deleted_id = 0;

The table sizes are -
my_table : 167 rows
table_1 : 21 rows
table_2 : 11 rows
table_3 : 3 rows
table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially
21x11x3x16 = 11088 rows. Most will be null.

I want to select the row_id for the last tran_date for each of those
potential groups. This is my select -

SELECT (
SELECT a.row_id FROM my_table a
WHERE a.fld_1 = b.row_id
AND a.fld_2 = c.row_id
AND a.fld_3 = d.row_id
AND a.fld_4 = e.row_id
AND a.deleted_id = 0
ORDER BY a.tran_date DESC LIMIT 1
)
FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.

On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
sqlite3, and 0.31 sec on PostgreSQL.

SQL Server does a good job at caching data in memory. PostgreSQL does too on consecutive calls to the same table. What execution time do you get if you issue the query a second time?

My first guess would be to add an index on my_table.tran_date and check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.

I have looked at the EXPLAIN, but I don't really know what to look for.
I can supply it if that would help.

Thanks for any advice.

Frank Millman

Olivier Gautherot

Re: Slow SELECT

From

Frank Millman

Date:

26 May 2020, 11:38:51

On 2020-05-26 9:32 AM, Olivier Gautherot wrote:
> Hi Frank,
> 
> On Tue, May 26, 2020 at 9:23 AM Frank Millman <frank@chagford.com 
> <mailto:frank@chagford.com>> wrote:
> 
>     Hi all
> 
>     I have a SELECT that runs over 5 times slower on PostgreSQL compared
>     with Sql Server and sqlite3. I am trying to understand why.
> 
>     I have a table that looks like this (simplified) -
> 
>     CREATE TABLE my_table (
>           row_id SERIAL PRIMARY KEY,
>           deleted_id INT DEFAULT 0,
>           fld_1 INT REFERENCES table_1(row_id),
>           fld_2 INT REFERENCES table_2(row_id),
>           fld_3 INT REFERENCES table_3(row_id),
>           fld_4 INT REFERENCES table_4(row_id),
>           tran_date DATE,
>           tran_total DEC(21,2)
>           );
> 
>     CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
>     fld_4, tran_date) WHERE deleted_id = 0;
> 
>     The table sizes are -
>           my_table : 167 rows
>           table_1 : 21 rows
>           table_2 : 11 rows
>           table_3 : 3 rows
>           table_4 : 16 rows
> 
>     Therefore for each tran_date in my_table there are potentially
>     21x11x3x16 = 11088 rows. Most will be null.
> 
>     I want to select the row_id for the last tran_date for each of those
>     potential groups. This is my select -
> 
>           SELECT (
>               SELECT a.row_id FROM my_table a
>               WHERE a.fld_1 = b.row_id
>               AND a.fld_2 = c.row_id
>               AND a.fld_3 = d.row_id
>               AND a.fld_4 = e.row_id
>               AND a.deleted_id = 0
>               ORDER BY a.tran_date DESC LIMIT 1
>           )
>           FROM table_1 b, table_2 c, table_3 d, table_4 e
> 
>     Out of 11088 rows selected, 103 are not null.
> 
>     On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
>     sqlite3, and 0.31 sec on PostgreSQL.
> 
> 
> SQL Server does a good job at caching data in memory. PostgreSQL does 
> too on consecutive calls to the same table. What execution time do you 
> get if you issue the query a second time?
> 
> My first guess would be to add an index on my_table.tran_date and check 
> in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.
> 
>     I have looked at the EXPLAIN, but I don't really know what to look for.
>     I can supply it if that would help.
> 
>     Thanks for any advice.
> 

Thanks Olivier. Unfortunately that did not help.

I was already running the query twice and only timing the second one.

I added the index on tran_date. The timing is the same, and EXPLAIN 
shows that it is using a SEQUENTIAL SCAN.

Here is the EXPLAIN -


  Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
    ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
          ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
                ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 width=4)
                ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
                      ->  Seq Scan on table_3 d  (cost=0.00..1.03 rows=3 
width=4)
          ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
                ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 width=4)
    ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
          ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
    SubPlan 1
      ->  Limit  (cost=5.77..5.77 rows=1 width=8)
            ->  Sort  (cost=5.77..5.77 rows=1 width=8)
                  Sort Key: a.tran_date DESC
                  ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
width=8)
                        Filter: ((fld_1 = b.row_id) AND (fld_2 = 
c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND (deleted_id 
= 0))


Frank

Re: Slow SELECT

From

Charles Clavadetscher

Date:

26 May 2020, 12:10:45

Hello

On 2020-05-26 10:38, Frank Millman wrote:
> On 2020-05-26 9:32 AM, Olivier Gautherot wrote:
>> Hi Frank,
>> 
>> On Tue, May 26, 2020 at 9:23 AM Frank Millman <frank@chagford.com 
>> <mailto:frank@chagford.com>> wrote:
>> 
>>     Hi all
>> 
>>     I have a SELECT that runs over 5 times slower on PostgreSQL 
>> compared
>>     with Sql Server and sqlite3. I am trying to understand why.
>> 
>>     I have a table that looks like this (simplified) -
>> 
>>     CREATE TABLE my_table (
>>           row_id SERIAL PRIMARY KEY,
>>           deleted_id INT DEFAULT 0,
>>           fld_1 INT REFERENCES table_1(row_id),
>>           fld_2 INT REFERENCES table_2(row_id),
>>           fld_3 INT REFERENCES table_3(row_id),
>>           fld_4 INT REFERENCES table_4(row_id),
>>           tran_date DATE,
>>           tran_total DEC(21,2)
>>           );
>> 
>>     CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
>>     fld_4, tran_date) WHERE deleted_id = 0;
>> 
>>     The table sizes are -
>>           my_table : 167 rows
>>           table_1 : 21 rows
>>           table_2 : 11 rows
>>           table_3 : 3 rows
>>           table_4 : 16 rows
>> 
>>     Therefore for each tran_date in my_table there are potentially
>>     21x11x3x16 = 11088 rows. Most will be null.
>> 
>>     I want to select the row_id for the last tran_date for each of 
>> those
>>     potential groups. This is my select -
>> 
>>           SELECT (
>>               SELECT a.row_id FROM my_table a
>>               WHERE a.fld_1 = b.row_id
>>               AND a.fld_2 = c.row_id
>>               AND a.fld_3 = d.row_id
>>               AND a.fld_4 = e.row_id
>>               AND a.deleted_id = 0
>>               ORDER BY a.tran_date DESC LIMIT 1
>>           )
>>           FROM table_1 b, table_2 c, table_3 d, table_4 e
>> 
>>     Out of 11088 rows selected, 103 are not null.
>> 
>>     On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
>>     sqlite3, and 0.31 sec on PostgreSQL.
>> 
>> 
>> SQL Server does a good job at caching data in memory. PostgreSQL does 
>> too on consecutive calls to the same table. What execution time do you 
>> get if you issue the query a second time?
>> 
>> My first guess would be to add an index on my_table.tran_date and 
>> check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.
>> 
>>     I have looked at the EXPLAIN, but I don't really know what to look 
>> for.
>>     I can supply it if that would help.
>> 
>>     Thanks for any advice.
>> 
> 
> Thanks Olivier. Unfortunately that did not help.
> 
> I was already running the query twice and only timing the second one.
> 
> I added the index on tran_date. The timing is the same, and EXPLAIN
> shows that it is using a SEQUENTIAL SCAN.
> 
> Here is the EXPLAIN -
> 
> 
>  Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
>    ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
>          ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
>                ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
> width=4)
>                ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
>                      ->  Seq Scan on table_3 d  (cost=0.00..1.03 rows=3 
> width=4)
>          ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
>                ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
> width=4)
>    ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
>          ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
>    SubPlan 1
>      ->  Limit  (cost=5.77..5.77 rows=1 width=8)
>            ->  Sort  (cost=5.77..5.77 rows=1 width=8)
>                  Sort Key: a.tran_date DESC
>                  ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
> width=8)
>                        Filter: ((fld_1 = b.row_id) AND (fld_2 =
> c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
> (deleted_id = 0))
> 
> 
> Frank

If I see it correct, the query runs sequential scans on all tables, i.e. 
table_1 to table_4.
Do you have an index on the referenced keys (row_id) in table_1 to 
table_4?

It happens often that referenced keys are not indexed, leading to poor 
execution plans.

Bye
Charles

Re: Slow SELECT

From

Charles Clavadetscher

Date:

26 May 2020, 12:27:12

On 2020-05-26 11:10, Charles Clavadetscher wrote:
> Hello
> 
> On 2020-05-26 10:38, Frank Millman wrote:
>> On 2020-05-26 9:32 AM, Olivier Gautherot wrote:
>>> Hi Frank,
>>> 
>>> On Tue, May 26, 2020 at 9:23 AM Frank Millman <frank@chagford.com 
>>> <mailto:frank@chagford.com>> wrote:
>>> 
>>>     Hi all
>>> 
>>>     I have a SELECT that runs over 5 times slower on PostgreSQL 
>>> compared
>>>     with Sql Server and sqlite3. I am trying to understand why.
>>> 
>>>     I have a table that looks like this (simplified) -
>>> 
>>>     CREATE TABLE my_table (
>>>           row_id SERIAL PRIMARY KEY,
>>>           deleted_id INT DEFAULT 0,
>>>           fld_1 INT REFERENCES table_1(row_id),
>>>           fld_2 INT REFERENCES table_2(row_id),
>>>           fld_3 INT REFERENCES table_3(row_id),
>>>           fld_4 INT REFERENCES table_4(row_id),
>>>           tran_date DATE,
>>>           tran_total DEC(21,2)
>>>           );
>>> 
>>>     CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, 
>>> fld_3,
>>>     fld_4, tran_date) WHERE deleted_id = 0;
>>> 
>>>     The table sizes are -
>>>           my_table : 167 rows
>>>           table_1 : 21 rows
>>>           table_2 : 11 rows
>>>           table_3 : 3 rows
>>>           table_4 : 16 rows
>>> 
>>>     Therefore for each tran_date in my_table there are potentially
>>>     21x11x3x16 = 11088 rows. Most will be null.
>>> 
>>>     I want to select the row_id for the last tran_date for each of 
>>> those
>>>     potential groups. This is my select -
>>> 
>>>           SELECT (
>>>               SELECT a.row_id FROM my_table a
>>>               WHERE a.fld_1 = b.row_id
>>>               AND a.fld_2 = c.row_id
>>>               AND a.fld_3 = d.row_id
>>>               AND a.fld_4 = e.row_id
>>>               AND a.deleted_id = 0
>>>               ORDER BY a.tran_date DESC LIMIT 1
>>>           )
>>>           FROM table_1 b, table_2 c, table_3 d, table_4 e
>>> 
>>>     Out of 11088 rows selected, 103 are not null.
>>> 
>>>     On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
>>>     sqlite3, and 0.31 sec on PostgreSQL.
>>> 
>>> 
>>> SQL Server does a good job at caching data in memory. PostgreSQL does 
>>> too on consecutive calls to the same table. What execution time do 
>>> you get if you issue the query a second time?
>>> 
>>> My first guess would be to add an index on my_table.tran_date and 
>>> check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.
>>> 
>>>     I have looked at the EXPLAIN, but I don't really know what to 
>>> look for.
>>>     I can supply it if that would help.
>>> 
>>>     Thanks for any advice.
>>> 
>> 
>> Thanks Olivier. Unfortunately that did not help.
>> 
>> I was already running the query twice and only timing the second one.
>> 
>> I added the index on tran_date. The timing is the same, and EXPLAIN
>> shows that it is using a SEQUENTIAL SCAN.
>> 
>> Here is the EXPLAIN -
>> 
>> 
>>  Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
>>    ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
>>          ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
>>                ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
>> width=4)
>>                ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
>>                      ->  Seq Scan on table_3 d  (cost=0.00..1.03 
>> rows=3 width=4)
>>          ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
>>                ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
>> width=4)
>>    ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
>>          ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
>>    SubPlan 1
>>      ->  Limit  (cost=5.77..5.77 rows=1 width=8)
>>            ->  Sort  (cost=5.77..5.77 rows=1 width=8)
>>                  Sort Key: a.tran_date DESC
>>                  ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
>> width=8)
>>                        Filter: ((fld_1 = b.row_id) AND (fld_2 =
>> c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
>> (deleted_id = 0))
>> 
>> 
>> Frank
> 
> If I see it correct, the query runs sequential scans on all tables,
> i.e. table_1 to table_4.
> Do you have an index on the referenced keys (row_id) in table_1 to 
> table_4?
> 
> It happens often that referenced keys are not indexed, leading to poor
> execution plans.
> 
> Bye
> Charles

I noticed later that you have very small tables. This will probably lead 
to a sequential scan althought there is an index in place.

I am not sure if it makes a difference, but what about using explicit 
joins?

SELECT a.row_id FROM my_table a
JOIN b table_1 ON (b.row_id = a.fld_1)
JOIN c table_2 ON (c.row_id = a.fld_2)
JOIN d table_3 ON (d.row_id = a.fld_3)
JOIN e table_4 ON (e.row_id = a.fld_4)
WHERE a.deleted_id = 0
ORDER BY a.tran_date DESC LIMIT 1;

Regards
Charles

Re: Slow SELECT

From

Frank Millman

Date:

26 May 2020, 12:32:47

On 2020-05-26 11:10 AM, Charles Clavadetscher wrote:
> Hello
> 
> On 2020-05-26 10:38, Frank Millman wrote:
>> On 2020-05-26 9:32 AM, Olivier Gautherot wrote:
>>> Hi Frank,
>>>
>>> On Tue, May 26, 2020 at 9:23 AM Frank Millman <frank@chagford.com 
>>> <mailto:frank@chagford.com>> wrote:
>>>
>>>     Hi all
>>>
>>>     I have a SELECT that runs over 5 times slower on PostgreSQL compared
>>>     with Sql Server and sqlite3. I am trying to understand why.
>>>
>>>     I have a table that looks like this (simplified) -
>>>
>>>     CREATE TABLE my_table (
>>>           row_id SERIAL PRIMARY KEY,
>>>           deleted_id INT DEFAULT 0,
>>>           fld_1 INT REFERENCES table_1(row_id),
>>>           fld_2 INT REFERENCES table_2(row_id),
>>>           fld_3 INT REFERENCES table_3(row_id),
>>>           fld_4 INT REFERENCES table_4(row_id),
>>>           tran_date DATE,
>>>           tran_total DEC(21,2)
>>>           );
>>>
>>>     CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
>>>     fld_4, tran_date) WHERE deleted_id = 0;
>>>
>>>     The table sizes are -
>>>           my_table : 167 rows
>>>           table_1 : 21 rows
>>>           table_2 : 11 rows
>>>           table_3 : 3 rows
>>>           table_4 : 16 rows
>>>
>>>     Therefore for each tran_date in my_table there are potentially
>>>     21x11x3x16 = 11088 rows. Most will be null.
>>>
>>>     I want to select the row_id for the last tran_date for each of those
>>>     potential groups. This is my select -
>>>
>>>           SELECT (
>>>               SELECT a.row_id FROM my_table a
>>>               WHERE a.fld_1 = b.row_id
>>>               AND a.fld_2 = c.row_id
>>>               AND a.fld_3 = d.row_id
>>>               AND a.fld_4 = e.row_id
>>>               AND a.deleted_id = 0
>>>               ORDER BY a.tran_date DESC LIMIT 1
>>>           )
>>>           FROM table_1 b, table_2 c, table_3 d, table_4 e
>>>
>>>     Out of 11088 rows selected, 103 are not null.
>>>
>>>     On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
>>>     sqlite3, and 0.31 sec on PostgreSQL.
>>>
>>>
>>> SQL Server does a good job at caching data in memory. PostgreSQL does 
>>> too on consecutive calls to the same table. What execution time do 
>>> you get if you issue the query a second time?
>>>
>>> My first guess would be to add an index on my_table.tran_date and 
>>> check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.
>>>
>>>     I have looked at the EXPLAIN, but I don't really know what to 
>>> look for.
>>>     I can supply it if that would help.
>>>
>>>     Thanks for any advice.
>>>
>>
>> Thanks Olivier. Unfortunately that did not help.
>>
>> I was already running the query twice and only timing the second one.
>>
>> I added the index on tran_date. The timing is the same, and EXPLAIN
>> shows that it is using a SEQUENTIAL SCAN.
>>
>> Here is the EXPLAIN -
>>
>>
>>  Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
>>    ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
>>          ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
>>                ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
>> width=4)
>>                ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
>>                      ->  Seq Scan on table_3 d  (cost=0.00..1.03 
>> rows=3 width=4)
>>          ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
>>                ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
>> width=4)
>>    ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
>>          ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
>>    SubPlan 1
>>      ->  Limit  (cost=5.77..5.77 rows=1 width=8)
>>            ->  Sort  (cost=5.77..5.77 rows=1 width=8)
>>                  Sort Key: a.tran_date DESC
>>                  ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
>> width=8)
>>                        Filter: ((fld_1 = b.row_id) AND (fld_2 =
>> c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
>> (deleted_id = 0))
>>
>>
>> Frank
> 
> If I see it correct, the query runs sequential scans on all tables, i.e. 
> table_1 to table_4.
> Do you have an index on the referenced keys (row_id) in table_1 to table_4?
> 
> It happens often that referenced keys are not indexed, leading to poor 
> execution plans.
> 

The referenced keys are all defined as SERIAL PRIMARY KEY in their own 
tables, so I presume that that are all indexed automatically.

On the other hand, there are not many rows in those tables, so the 
planner may decide not to use the index in that case.

Frank

Re: Slow SELECT

From

Christian Ramseyer

Date:

26 May 2020, 13:02:59

Hi

On 26.05.20 09:22, Frank Millman wrote:
> 
> I have looked at the EXPLAIN, but I don't really know what to look for.
> I can supply it if that would help.
> 

My favorite approach to tuning Postgres queries is:

1. Run EXPLAIN ANALYZE <query>
2. Copy/Paste the output into the fantastic https://explain.depesz.com/

This will turn the somewhat hard-to-understand explain output into a
nice colored structure. If it's not obvious from the orange-reddish
boxes where the slowness comes from, please post the link here and
somebody will certainly have some advice.

Cheers
Christian

-- 
Christian Ramseyer, netnea ag
Network Management. Security. OpenSource.
https://www.netnea.com

Re: Slow SELECT

From

David Rowley

Date:

26 May 2020, 13:04:07

On Tue, 26 May 2020 at 19:23, Frank Millman <frank@chagford.com> wrote:
> The table sizes are -
>      my_table : 167 rows
>      table_1 : 21 rows
>      table_2 : 11 rows
>      table_3 : 3 rows
>      table_4 : 16 rows
>
> Therefore for each tran_date in my_table there are potentially
> 21x11x3x16 = 11088 rows. Most will be null.
>
> I want to select the row_id for the last tran_date for each of those
> potential groups. This is my select -
>
>      SELECT (
>          SELECT a.row_id FROM my_table a
>          WHERE a.fld_1 = b.row_id
>          AND a.fld_2 = c.row_id
>          AND a.fld_3 = d.row_id
>          AND a.fld_4 = e.row_id
>          AND a.deleted_id = 0
>          ORDER BY a.tran_date DESC LIMIT 1
>      )
>      FROM table_1 b, table_2 c, table_3 d, table_4 e
>
> Out of 11088 rows selected, 103 are not null.

Perhaps SQL Server is doing something to rewrite the subquery in the
target list to a LEFT JOIN.  PostgreSQL currently does not do that.

Since "my_table" is small, you'd likely be much better doing a manual
rewrite of the query to join a subquery containing the required
details from "my_table".  It looks like you want the row_id from the
latest tran_date for each fld_N column. So something like:

SELECT a.row_id
FROM table_1 b
CROSS JOIN table_2 c
CROSS JOIN table_3 d
CROSS JOIN table_4 e
LEFT OUTER JOIN (
SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
tran_date DESC) row_num
FROM my_table
WHERE deleted_id = 0
) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;

Should do it. You could also perhaps do something with DISTINCT ON
instead of using ROW_NUMBER(). That might be a bit more efficient, but
it's unlikely to matter too much since there are only 167 rows in that
table.

David

Re: Slow SELECT

From

Frank Millman

Date:

26 May 2020, 13:25:26


On 2020-05-26 11:27 AM, Charles Clavadetscher wrote:
> On 2020-05-26 11:10, Charles Clavadetscher wrote:
>> Hello
>>
>> On 2020-05-26 10:38, Frank Millman wrote:
>>> On 2020-05-26 9:32 AM, Olivier Gautherot wrote:
>>>> Hi Frank,
>>>>
>>>> On Tue, May 26, 2020 at 9:23 AM Frank Millman <frank@chagford.com 
>>>> <mailto:frank@chagford.com>> wrote:
>>>>
>>>>     Hi all
>>>>
>>>>     I have a SELECT that runs over 5 times slower on PostgreSQL 
>>>> compared
>>>>     with Sql Server and sqlite3. I am trying to understand why.
>>>>
>>>>     I have a table that looks like this (simplified) -
>>>>
>>>>     CREATE TABLE my_table (
>>>>           row_id SERIAL PRIMARY KEY,
>>>>           deleted_id INT DEFAULT 0,
>>>>           fld_1 INT REFERENCES table_1(row_id),
>>>>           fld_2 INT REFERENCES table_2(row_id),
>>>>           fld_3 INT REFERENCES table_3(row_id),
>>>>           fld_4 INT REFERENCES table_4(row_id),
>>>>           tran_date DATE,
>>>>           tran_total DEC(21,2)
>>>>           );
>>>>
>>>>     CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
>>>>     fld_4, tran_date) WHERE deleted_id = 0;
>>>>
>>>>     The table sizes are -
>>>>           my_table : 167 rows
>>>>           table_1 : 21 rows
>>>>           table_2 : 11 rows
>>>>           table_3 : 3 rows
>>>>           table_4 : 16 rows
>>>>
>>>>     Therefore for each tran_date in my_table there are potentially
>>>>     21x11x3x16 = 11088 rows. Most will be null.
>>>>
>>>>     I want to select the row_id for the last tran_date for each of 
>>>> those
>>>>     potential groups. This is my select -
>>>>
>>>>           SELECT (
>>>>               SELECT a.row_id FROM my_table a
>>>>               WHERE a.fld_1 = b.row_id
>>>>               AND a.fld_2 = c.row_id
>>>>               AND a.fld_3 = d.row_id
>>>>               AND a.fld_4 = e.row_id
>>>>               AND a.deleted_id = 0
>>>>               ORDER BY a.tran_date DESC LIMIT 1
>>>>           )
>>>>           FROM table_1 b, table_2 c, table_3 d, table_4 e
>>>>
>>>>     Out of 11088 rows selected, 103 are not null.
>>>>
>>>>     On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
>>>>     sqlite3, and 0.31 sec on PostgreSQL.
>>>>
>>>>
>>>> SQL Server does a good job at caching data in memory. PostgreSQL 
>>>> does too on consecutive calls to the same table. What execution time 
>>>> do you get if you issue the query a second time?
>>>>
>>>> My first guess would be to add an index on my_table.tran_date and 
>>>> check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.
>>>>
>>>>     I have looked at the EXPLAIN, but I don't really know what to 
>>>> look for.
>>>>     I can supply it if that would help.
>>>>
>>>>     Thanks for any advice.
>>>>
>>>
>>> Thanks Olivier. Unfortunately that did not help.
>>>
>>> I was already running the query twice and only timing the second one.
>>>
>>> I added the index on tran_date. The timing is the same, and EXPLAIN
>>> shows that it is using a SEQUENTIAL SCAN.
>>>
>>> Here is the EXPLAIN -
>>>
>>>
>>>  Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
>>>    ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
>>>          ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
>>>                ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
>>> width=4)
>>>                ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
>>>                      ->  Seq Scan on table_3 d  (cost=0.00..1.03 
>>> rows=3 width=4)
>>>          ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
>>>                ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
>>> width=4)
>>>    ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
>>>          ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
>>>    SubPlan 1
>>>      ->  Limit  (cost=5.77..5.77 rows=1 width=8)
>>>            ->  Sort  (cost=5.77..5.77 rows=1 width=8)
>>>                  Sort Key: a.tran_date DESC
>>>                  ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
>>> width=8)
>>>                        Filter: ((fld_1 = b.row_id) AND (fld_2 =
>>> c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
>>> (deleted_id = 0))
>>>
>>>
>>> Frank
>>
>> If I see it correct, the query runs sequential scans on all tables,
>> i.e. table_1 to table_4.
>> Do you have an index on the referenced keys (row_id) in table_1 to 
>> table_4?
>>
>> It happens often that referenced keys are not indexed, leading to poor
>> execution plans.
>>
>> Bye
>> Charles
> 
> I noticed later that you have very small tables. This will probably lead 
> to a sequential scan althought there is an index in place.
> 
> I am not sure if it makes a difference, but what about using explicit 
> joins?
> 
> SELECT a.row_id FROM my_table a
> JOIN b table_1 ON (b.row_id = a.fld_1)
> JOIN c table_2 ON (c.row_id = a.fld_2)
> JOIN d table_3 ON (d.row_id = a.fld_3)
> JOIN e table_4 ON (e.row_id = a.fld_4)
> WHERE a.deleted_id = 0
> ORDER BY a.tran_date DESC LIMIT 1;
> 

Thanks, Charles. I tried that, but unfortunately it produces a different 
result. I need to test for every possible combination of fld1-4, and get 
the highest date for each one. Using joins only tests existing 
combinations, and gets the highest date for all of them combined.

Seel my reply to David Rowley. I do not fully understand his solution 
yet, but it seems to be what I am looking for.

Thanks again

Frank

Re: Slow SELECT

From

Frank Millman

Date:

26 May 2020, 13:26:12


On 2020-05-26 12:02 PM, Christian Ramseyer wrote:
> Hi
> 
> On 26.05.20 09:22, Frank Millman wrote:
>>
>> I have looked at the EXPLAIN, but I don't really know what to look for.
>> I can supply it if that would help.
>>
> 
> My favorite approach to tuning Postgres queries is:
> 
> 1. Run EXPLAIN ANALYZE <query>
> 2. Copy/Paste the output into the fantastic https://explain.depesz.com/
> 
> This will turn the somewhat hard-to-understand explain output into a
> nice colored structure. If it's not obvious from the orange-reddish
> boxes where the slowness comes from, please post the link here and
> somebody will certainly have some advice.
> 

Thanks, Christian. I will definitely look into that.

Frank

Re: Slow SELECT

From

Frank Millman

Date:

26 May 2020, 13:30:47


On 2020-05-26 12:04 PM, David Rowley wrote:
> On Tue, 26 May 2020 at 19:23, Frank Millman <frank@chagford.com> wrote:
>> The table sizes are -
>>       my_table : 167 rows
>>       table_1 : 21 rows
>>       table_2 : 11 rows
>>       table_3 : 3 rows
>>       table_4 : 16 rows
>>
>> Therefore for each tran_date in my_table there are potentially
>> 21x11x3x16 = 11088 rows. Most will be null.
>>
>> I want to select the row_id for the last tran_date for each of those
>> potential groups. This is my select -
>>
>>       SELECT (
>>           SELECT a.row_id FROM my_table a
>>           WHERE a.fld_1 = b.row_id
>>           AND a.fld_2 = c.row_id
>>           AND a.fld_3 = d.row_id
>>           AND a.fld_4 = e.row_id
>>           AND a.deleted_id = 0
>>           ORDER BY a.tran_date DESC LIMIT 1
>>       )
>>       FROM table_1 b, table_2 c, table_3 d, table_4 e
>>
>> Out of 11088 rows selected, 103 are not null.
> 
> Perhaps SQL Server is doing something to rewrite the subquery in the
> target list to a LEFT JOIN.  PostgreSQL currently does not do that.
> 
> Since "my_table" is small, you'd likely be much better doing a manual
> rewrite of the query to join a subquery containing the required
> details from "my_table".  It looks like you want the row_id from the
> latest tran_date for each fld_N column. So something like:
> 
> SELECT a.row_id
> FROM table_1 b
> CROSS JOIN table_2 c
> CROSS JOIN table_3 d
> CROSS JOIN table_4 e
> LEFT OUTER JOIN (
> SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
> ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
> tran_date DESC) row_num
> FROM my_table
> WHERE deleted_id = 0
> ) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
> d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;
> 
> Should do it. You could also perhaps do something with DISTINCT ON
> instead of using ROW_NUMBER(). That might be a bit more efficient, but
> it's unlikely to matter too much since there are only 167 rows in that
> table.
> 

Thank you David. I tried that and it produced the correct result in 
53ms, which is what I am looking for.

It will take me some time to understand it fully, so I have some 
homework to do!

Much appreciated.

Frank

Re: Slow SELECT

From

Vik Fearing

Date:

26 May 2020, 14:41:49

On 5/26/20 12:04 PM, David Rowley wrote:
> Since "my_table" is small, you'd likely be much better doing a manual
> rewrite of the query to join a subquery containing the required
> details from "my_table".  It looks like you want the row_id from the
> latest tran_date for each fld_N column. So something like:
> 
> SELECT a.row_id
> FROM table_1 b
> CROSS JOIN table_2 c
> CROSS JOIN table_3 d
> CROSS JOIN table_4 e
> LEFT OUTER JOIN (
> SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
> ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
> tran_date DESC) row_num
> FROM my_table
> WHERE deleted_id = 0
> ) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
> d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;
> 
> Should do it. You could also perhaps do something with DISTINCT ON
> instead of using ROW_NUMBER(). That might be a bit more efficient, but
> it's unlikely to matter too much since there are only 167 rows in that
> table.

I would expect a lateral query to be better here.

SELECT a.*
FROM table_1 AS b
CROSS JOIN table_2 AS c
CROSS JOIN table_3 AS d
CROSS JOIN table_4 AS e
CROSS JOIN LATERAL (
    SELECT *
    FROM my_table AS a
    WHERE (a.fld_1, a.fld_2, a.fld_3, a.fld_4) = (b.row_id, c.row_id,
d.row_id, e.row_id)
      AND a.deleted = 0
    ORDER BY a.tran_date DESC
    FETCH FIRST ROW ONLY
) AS a
WHERE a.row_id IS NOT NULL;

You will likely want an index on my_table (fld_1, fld_2, fld_3, fld_4,
tran_date) if your dataset gets bigger.

This query is 100% Standard SQL, so it *should* work on other engines.
That doesn't mean it will, though.
-- 
Vik Fearing

Re: Slow SELECT

From

David Rowley

Date:

26 May 2020, 14:45:22

On Tue, 26 May 2020 at 22:31, Frank Millman <frank@chagford.com> wrote:

> Thank you David. I tried that and it produced the correct result in
> 53ms, which is what I am looking for.
>
> It will take me some time to understand it fully, so I have some
> homework to do!

The main problem with your previous query was that the subquery was
being executed 11088 times and could only ever find anything 167
times. The remaining number of times nothing would be found.

I just changed the subquery which would be executed once per output
row and altered it so it became a subquery that's joined and only
executed once.  The ROW_NUMBER() is a windowing function, which is
explained in [1].  I used this to get the row_id of the record with
the lowest tran_date, just like you were doing with the ORDER BY
tran_date DESC LIMIT 1, but the subquery with the windowing function
gets them all at once, rather than doing it in a way that requires it
to be executed once for each row in the top-level query. In this case,
the functionality that the LIMIT 1 does in your query is achieved with
"AND a.row_num = 1;" in my version. This is pretty fast to execute
once due to there only being 167 rows.

It's also important to know that there may be cases where the method I
proposed is slower.  For example, if my_table was very large and
contained rows that were not in table_1 to table_4. Since the subquery
in my version calculates everything then it could be wasteful to do
that for values that would never be used. For you, you have foreign
keys that ensure my_table does not contain records that are not in the
other tables, but you could still see this issue if you were to add
some restrictive WHERE clause to the outer query.  Perhaps this won't
be a problem for you, but it's likely good to know.

[1] https://www.postgresql.org/docs/current/tutorial-window.html

David

Re: Slow SELECT

From

David Rowley

Date:

26 May 2020, 14:49:04

On Tue, 26 May 2020 at 23:41, Vik Fearing <vik@postgresfriends.org> wrote:
>
> On 5/26/20 12:04 PM, David Rowley wrote:
> > Since "my_table" is small, you'd likely be much better doing a manual
> > rewrite of the query to join a subquery containing the required
> > details from "my_table".  It looks like you want the row_id from the
> > latest tran_date for each fld_N column. So something like:
> >
> > SELECT a.row_id
> > FROM table_1 b
> > CROSS JOIN table_2 c
> > CROSS JOIN table_3 d
> > CROSS JOIN table_4 e
> > LEFT OUTER JOIN (
> > SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
> > ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
> > tran_date DESC) row_num
> > FROM my_table
> > WHERE deleted_id = 0
> > ) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
> > d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;
> >
> > Should do it. You could also perhaps do something with DISTINCT ON
> > instead of using ROW_NUMBER(). That might be a bit more efficient, but
> > it's unlikely to matter too much since there are only 167 rows in that
> > table.
>
> I would expect a lateral query to be better here.

But that would put it back to executing the subquery 11088 times. I
wrote it in a way to avoid that.

David

Re: Slow SELECT

From

Frank Millman

Date:

26 May 2020, 15:07:36


On 2020-05-26 1:45 PM, David Rowley wrote:
> On Tue, 26 May 2020 at 22:31, Frank Millman <frank@chagford.com> wrote:
> 
>> Thank you David. I tried that and it produced the correct result in
>> 53ms, which is what I am looking for.
>>
>> It will take me some time to understand it fully, so I have some
>> homework to do!
> 
> The main problem with your previous query was that the subquery was
> being executed 11088 times and could only ever find anything 167
> times. The remaining number of times nothing would be found.
> 
> I just changed the subquery which would be executed once per output
> row and altered it so it became a subquery that's joined and only
> executed once.  The ROW_NUMBER() is a windowing function, which is
> explained in [1].  I used this to get the row_id of the record with
> the lowest tran_date, just like you were doing with the ORDER BY
> tran_date DESC LIMIT 1, but the subquery with the windowing function
> gets them all at once, rather than doing it in a way that requires it
> to be executed once for each row in the top-level query. In this case,
> the functionality that the LIMIT 1 does in your query is achieved with
> "AND a.row_num = 1;" in my version. This is pretty fast to execute
> once due to there only being 167 rows.
> 
> It's also important to know that there may be cases where the method I
> proposed is slower.  For example, if my_table was very large and
> contained rows that were not in table_1 to table_4. Since the subquery
> in my version calculates everything then it could be wasteful to do
> that for values that would never be used. For you, you have foreign
> keys that ensure my_table does not contain records that are not in the
> other tables, but you could still see this issue if you were to add
> some restrictive WHERE clause to the outer query.  Perhaps this won't
> be a problem for you, but it's likely good to know.
> 
> [1] https://www.postgresql.org/docs/current/tutorial-window.html
> 

Thanks very much for the explanation. I will go through it carefully.

For the record, your query works without modification in both Sql Server 
and sqlite3. It is also much faster in all three cases - all around 
0.005 sec instead of 0.05 sec.

Frank

Re: Slow SELECT

From

Frank Millman

Date:

29 May 2020, 15:14:05


On 2020-05-26 12:04 PM, David Rowley wrote:
> 
> Since "my_table" is small, you'd likely be much better doing a manual
> rewrite of the query to join a subquery containing the required
> details from "my_table".  It looks like you want the row_id from the
> latest tran_date for each fld_N column. So something like:
> 
> SELECT a.row_id
> FROM table_1 b
> CROSS JOIN table_2 c
> CROSS JOIN table_3 d
> CROSS JOIN table_4 e
> LEFT OUTER JOIN (
> SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
> ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
> tran_date DESC) row_num
> FROM my_table
> WHERE deleted_id = 0
> ) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
> d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;
> 
> Should do it. You could also perhaps do something with DISTINCT ON
> instead of using ROW_NUMBER(). That might be a bit more efficient, but
> it's unlikely to matter too much since there are only 167 rows in that
> table.
> 

I have studied the above SELECT, and I now more or less understand it. I 
would not have come up with that unaided, so many thanks.

I tried DISTINCT ON, and it was very efficient, but unfortunately that 
is not supported by SQL Server or sqlite3.

Then I came up with this alternative, which works on all three platforms 
and seems a bit faster than the above -

SELECT a.row_id FROM (
     SELECT row_id,
         ROW_NUMBER() OVER (PARTITION BY fld_1, fld_2, fld_3, fld_4
         ORDER BY tran_date DESC) row_num
     FROM my_table
     WHERE deleted_id = 0
    ) as a
WHERE a.row_num = 1

Do you see any problem with this?

Thanks

Frank