Thread: A way to optimize sql about the last temporary-related row

A way to optimize sql about the last temporary-related row

From
"agharta82@gmail.com"
Date:
Hello everyone,
Sorry to bother you but I have a query that is driving me crazy.

I need to have the last valid record at a temporal level according to a 
specific parameter.

First some data:
Linux Rocky 8.10 environment, minimal installation (on VM KVM with 
Fedora 40).
Postgresql 16.3, installed by official Postgresql guide.
effective_cache_size = '1000 MB';
shared_buffers = '500 MB';
work_mem = '16MB';
The changes are deliberately minimal to be able to all to simulate the 
problem.

Table script:
CREATE TABLE test_table
(
  pk_id int NOT NULL,
  integer_field_1 int ,
  integer_field_2 int,
  datetime_field_1 timestamp,
  primary key (pk_id)
)

-- insert 4M records
insert into test_table(pk_id) select generate_series(1,4000000,1);

-- now set some random data, distribuited between specific ranges (as in 
my production table)
update test_table set
datetime_field_1 = timestamp '2000-01-01 00:00:00' + random() * 
(timestamp '2024-05-31 23:59:59' - timestamp '2000-01-01 00:00:00'),
integer_field_1 = floor(random() * (6-1+1) + 1)::int,
integer_field_2 = floor(random() * (200000-1+1) + 1)::int;


-- indexes
CREATE INDEX idx_test_table_integer_field_1 ON test_table(integer_field_1);
CREATE INDEX xtest_table_datetime_field_1 ON test_table(datetime_field_1 
desc);
CREATE INDEX idx_test_table_integer_field_2 ON test_table(integer_field_2);


--vacuum

vacuum full test_table;



Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
   select xx from (
   select LAST_VALUE(pk_id) over (partition by integer_field_2 order by 
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING) xx
   from test_table
   where integer_field_1 = 1
   and datetime_field_1 <= CURRENT_TIMESTAMP
   ) ww group by ww.xx

),
last_row_per_ids as (
   select tt.* from last_table_ids lt
   inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


This query, on my PC, takes 46 seconds!!!
I was expecting about 2-3 seconds (according with my other queries in 
this table) but it seems that the xtest_table_datetime_field_1 index is 
not being used.

Do you think there is a way to optimize the query?

Thanks so much for the support,

Agharta




Re: A way to optimize sql about the last temporary-related row

From
Ron Johnson
Date:
On Thu, Jun 27, 2024 at 11:20 AM agharta82@gmail.com <agharta82@gmail.com> wrote:
[snip] 
-- insert 4M records
insert into test_table(pk_id) select generate_series(1,4000000,1);

-- now set some random data, distribuited between specific ranges (as in
my production table)
update test_table set
datetime_field_1 = timestamp '2000-01-01 00:00:00' + random() *
(timestamp '2024-05-31 23:59:59' - timestamp '2000-01-01 00:00:00'),
integer_field_1 = floor(random() * (6-1+1) + 1)::int,
integer_field_2 = floor(random() * (200000-1+1) + 1)::int;


-- indexes
CREATE INDEX idx_test_table_integer_field_1 ON test_table(integer_field_1);
CREATE INDEX xtest_table_datetime_field_1 ON test_table(datetime_field_1
desc);
CREATE INDEX idx_test_table_integer_field_2 ON test_table(integer_field_2);


Off-topic: save some resources by vacuuming before creating indices.
 

Re: A way to optimize sql about the last temporary-related row

From
"agharta82@gmail.com"
Date:

Hi,

You are right. Too quickly copy-paste on my part :-)

I take this opportunity to add a NOT insignificant detail.

Before executing the select query I clear the cache:

systemctl stop postgresql-16 && sync && echo 3 > /proc/sys/vm/drop_caches  &&  systemctl start postgresql-16

I need to get a performance result even if data is not in cache.


My best regards,

Agharta


Il 27/06/24 5:27 PM, Ron Johnson ha scritto:
On Thu, Jun 27, 2024 at 11:20 AM agharta82@gmail.com <agharta82@gmail.com> wrote:
[snip] 
-- insert 4M records
insert into test_table(pk_id) select generate_series(1,4000000,1);

-- now set some random data, distribuited between specific ranges (as in
my production table)
update test_table set
datetime_field_1 = timestamp '2000-01-01 00:00:00' + random() *
(timestamp '2024-05-31 23:59:59' - timestamp '2000-01-01 00:00:00'),
integer_field_1 = floor(random() * (6-1+1) + 1)::int,
integer_field_2 = floor(random() * (200000-1+1) + 1)::int;


-- indexes
CREATE INDEX idx_test_table_integer_field_1 ON test_table(integer_field_1);
CREATE INDEX xtest_table_datetime_field_1 ON test_table(datetime_field_1
desc);
CREATE INDEX idx_test_table_integer_field_2 ON test_table(integer_field_2);


Off-topic: save some resources by vacuuming before creating indices.
 

Re: A way to optimize sql about the last temporary-related row

From
"David G. Johnston"
Date:
On Thursday, June 27, 2024, agharta82@gmail.com <agharta82@gmail.com> wrote:

Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
  select xx from (
  select LAST_VALUE(pk_id) over (partition by integer_field_2 order by datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) xx
  from test_table
  where integer_field_1 = 1
  and datetime_field_1 <= CURRENT_TIMESTAMP
  ) ww group by ww.xx

),
last_row_per_ids as (
  select tt.* from last_table_ids lt
  inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


Do you think there is a way to optimize the query?

Write a lateral subquery to pick the first row of a descending ordered query? Using group to select ranked rows is both semantically wrong and potentially optimization blocking.

I’m going by the general query form and the “last row” aspect of the question.  I haven’t gone and confirmed your specific query can benefit from this approach. The window expression does give me pause.

David J.

Re: A way to optimize sql about the last temporary-related row

From
Ron Johnson
Date:
On Thu, Jun 27, 2024 at 11:33 AM agharta82@gmail.com <agharta82@gmail.com> wrote:

Hi,

You are right. Too quickly copy-paste on my part :-)

I take this opportunity to add a NOT insignificant detail.

Before executing the select query I clear the cache:

systemctl stop postgresql-16 && sync && echo 3 > /proc/sys/vm/drop_caches  &&  systemctl start postgresql-16

I need to get a performance result even if data is not in cache.


In addition to David's suggestion, consider the pg_prewarm extension: https://www.postgresql.org/docs/current/pgprewarm.html
 

Re: A way to optimize sql about the last temporary-related row

From
"agharta82@gmail.com"
Date:

Hi,

Thanks for you reply.

About syntax you're right, but I couldn't think of anything better :(((  I'm here for that too, to solve the problem in a fancy way, with your great support.

In practice, I need to get back a dataset with the last association (the most datatime recent record) for all the distinct entries of integer_field_2 based on filter:  integer_field_1 = 1

As said in another reply, the query needs to be performant even if data is not in cache (systemctl stop postgresql-16 && sync && echo 3 > /proc/sys/vm/drop_caches  &&  systemctl start postgresql-16).

Many thanks for your support.

Agharta









Il 27/06/24 5:33 PM, David G. Johnston ha scritto:
On Thursday, June 27, 2024, agharta82@gmail.com <agharta82@gmail.com> wrote:

Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
  select xx from (
  select LAST_VALUE(pk_id) over (partition by integer_field_2 order by datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) xx
  from test_table
  where integer_field_1 = 1
  and datetime_field_1 <= CURRENT_TIMESTAMP
  ) ww group by ww.xx

),
last_row_per_ids as (
  select tt.* from last_table_ids lt
  inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


Do you think there is a way to optimize the query?

Write a lateral subquery to pick the first row of a descending ordered query? Using group to select ranked rows is both semantically wrong and potentially optimization blocking.

I’m going by the general query form and the “last row” aspect of the question.  I haven’t gone and confirmed your specific query can benefit from this approach. The window expression does give me pause.

David J.

Re: A way to optimize sql about the last temporary-related row

From
David Rowley
Date:


On Fri, 28 Jun 2024, 3:20 am agharta82@gmail.com, <agharta82@gmail.com> wrote:
 
Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
   select xx from (
   select LAST_VALUE(pk_id) over (partition by integer_field_2 order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) xx
   from test_table
   where integer_field_1 = 1
   and datetime_field_1 <= CURRENT_TIMESTAMP
   ) ww group by ww.xx

),
last_row_per_ids as (
   select tt.* from last_table_ids lt
   inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


This query, on my PC, takes 46 seconds!!!

(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from test_table where integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by integer_field_2,datetime_field_1 desc;

Might run a bit faster.  However if it's slow due to I/O then maybe not much faster.  Your version took about 5 seconds on my phone and my version ran in 1.5 seconds.

It's difficult for me to check the results match with each query from my phone. A quick scan of the first 10 or so records looked good.

If the updated query is still too slow on cold cache then faster disks might be needed.

David

Re: A way to optimize sql about the last temporary-related row

From
"agharta82@gmail.com"
Date:

HOO-HA! This is HUGE!

Only 2.2 seconds on my data!!!! Amazing!

distinct on (field) followed by "*" is a hidden gem!

Thank you so much and thanks to everyone who helped me!  Thank you very much!!

Cheers,

Agharta

 


Il 27/06/24 6:16 PM, David Rowley ha scritto:


On Fri, 28 Jun 2024, 3:20 am agharta82@gmail.com, <agharta82@gmail.com> wrote:
 
Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
   select xx from (
   select LAST_VALUE(pk_id) over (partition by integer_field_2 order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) xx
   from test_table
   where integer_field_1 = 1
   and datetime_field_1 <= CURRENT_TIMESTAMP
   ) ww group by ww.xx

),
last_row_per_ids as (
   select tt.* from last_table_ids lt
   inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


This query, on my PC, takes 46 seconds!!!

(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from test_table where integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by integer_field_2,datetime_field_1 desc;

Might run a bit faster.  However if it's slow due to I/O then maybe not much faster.  Your version took about 5 seconds on my phone and my version ran in 1.5 seconds.

It's difficult for me to check the results match with each query from my phone. A quick scan of the first 10 or so records looked good.

If the updated query is still too slow on cold cache then faster disks might be needed.

David

Re: A way to optimize sql about the last temporary-related row

From
Richard Welty
Date:
not really in direct response to this conversation, but is there any reason
on the face of the planet why read receipts need to be sent to every single
recipient of the mailing list?

just saying,
  richard



---- On Fri, 28 Jun 2024 03:20:26 -0400 <agharta82@gmail.com> wrote ---

HOO-HA! This is HUGE!

Only 2.2 seconds on my data!!!! Amazing!

distinct on (field) followed by "*" is a hidden gem!

Thank you so much and thanks to everyone who helped me!  Thank you very much!!

Cheers,

Agharta

 


Il 27/06/24 6:16 PM, David Rowley ha scritto:



On Fri, 28 Jun 2024, 3:20 am agharta82@gmail.com, <agharta82@gmail.com> wrote:
 
Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
   select xx from (
   select LAST_VALUE(pk_id) over (partition by integer_field_2 order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) xx
   from test_table
   where integer_field_1 = 1
   and datetime_field_1 <= CURRENT_TIMESTAMP
   ) ww group by ww.xx

),
last_row_per_ids as (
   select tt.* from last_table_ids lt
   inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


This query, on my PC, takes 46 seconds!!!

(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from test_table where integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by integer_field_2,datetime_field_1 desc;

Might run a bit faster.  However if it's slow due to I/O then maybe not much faster.  Your version took about 5 seconds on my phone and my version ran in 1.5 seconds.

It's difficult for me to check the results match with each query from my phone. A quick scan of the first 10 or so records looked good.

If the updated query is still too slow on cold cache then faster disks might be needed.

David



Re: A way to optimize sql about the last temporary-related row

From
agharta agharta
Date:
Sorry, my wrong settings on pc mail client.
Sorry again.
Agharta

Il ven 28 giu 2024, 19:51 Richard Welty <rwelty@salesium.com> ha scritto:
not really in direct response to this conversation, but is there any reason
on the face of the planet why read receipts need to be sent to every single
recipient of the mailing list?

just saying,
  richard



---- On Fri, 28 Jun 2024 03:20:26 -0400 <agharta82@gmail.com> wrote ---

HOO-HA! This is HUGE!

Only 2.2 seconds on my data!!!! Amazing!

distinct on (field) followed by "*" is a hidden gem!

Thank you so much and thanks to everyone who helped me!  Thank you very much!!

Cheers,

Agharta

 


Il 27/06/24 6:16 PM, David Rowley ha scritto:



On Fri, 28 Jun 2024, 3:20 am agharta82@gmail.com, <agharta82@gmail.com> wrote:
 
Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
   select xx from (
   select LAST_VALUE(pk_id) over (partition by integer_field_2 order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) xx
   from test_table
   where integer_field_1 = 1
   and datetime_field_1 <= CURRENT_TIMESTAMP
   ) ww group by ww.xx

),
last_row_per_ids as (
   select tt.* from last_table_ids lt
   inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


This query, on my PC, takes 46 seconds!!!

(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from test_table where integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by integer_field_2,datetime_field_1 desc;

Might run a bit faster.  However if it's slow due to I/O then maybe not much faster.  Your version took about 5 seconds on my phone and my version ran in 1.5 seconds.

It's difficult for me to check the results match with each query from my phone. A quick scan of the first 10 or so records looked good.

If the updated query is still too slow on cold cache then faster disks might be needed.

David