Thread: Very newbie question

Very newbie question

From

Олег Самойлов

Date:

23 October 2023, 15:13:58

Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of
partition10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3
month.Here is query: 

SELECT id/10000000 as partition
   FROM delivery
   GROUP BY partition
   HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?

Re: Very newbie question

From

Toomas

Date:

23 October 2023, 15:23:33

There is no reason to use index. The query has neither WHERE nor ORDER BY clause.

Toomas

> On 23. Oct 2023, at 18:13, Олег Самойлов <splarv@ya.ru> wrote:
>
> Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size
ofpartition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3
month.Here is query: 
>
> SELECT id/10000000 as partition
>   FROM delivery
>   GROUP BY partition
>   HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;
>
> The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
> The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
>

Re: Very newbie question

From

Francisco Olarte

Date:

23 October 2023, 15:25:16

On Mon, 23 Oct 2023 at 17:14, Олег Самойлов <splarv@ya.ru> wrote:
> Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size
ofpartition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3
month.Here is query: 
>
> SELECT id/10000000 as partition
>    FROM delivery
>    GROUP BY partition
>    HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;
>
> The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
> The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?

You should send an explain of your query, and your table and index definition.

Unless you are tied to do this in one query, and assuming you have an
index by "created_at", I normally do these kind of things by:
1.- Get list of partitions, sort oldest first.
2.- do "select created_at from $partition order by created at desc
limit 1", which normally is just an index lookup, and compare
client-side.
You can do the date math in the database too. Also, rhs of the
comparison seems to be date, if created_at is timestamp you may be
blocking the optimizer for some things.

Francisco Olarte.

Re: Very newbie question

From

Олег Самойлов

Date:

23 October 2023, 15:42:05

Well, get list of partitions and later scan one by one all 100 partitions is too simple. :) I am interesting is here
moreelegant way? Any rewriting the query, any creating an index are permitted. 

> 23 окт. 2023 г., в 18:25, Francisco Olarte <folarte@peoplecall.com> написал(а):
>
> On Mon, 23 Oct 2023 at 17:14, Олег Самойлов <splarv@ya.ru> wrote:
>> Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size
ofpartition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3
month.Here is query: 
>>
>> SELECT id/10000000 as partition
>>   FROM delivery
>>   GROUP BY partition
>>   HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;
>>
>> The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
>> The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
>
> You should send an explain of your query, and your table and index definition.
>
> Unless you are tied to do this in one query, and assuming you have an
> index by "created_at", I normally do these kind of things by:
> 1.- Get list of partitions, sort oldest first.
> 2.- do "select created_at from $partition order by created at desc
> limit 1", which normally is just an index lookup, and compare
> client-side.
> You can do the date math in the database too. Also, rhs of the
> comparison seems to be date, if created_at is timestamp you may be
> blocking the optimizer for some things.
>
> Francisco Olarte.

Re: Very newbie question

From

Олег Самойлов

Date:

23 October 2023, 15:45:10

This is not correct. An index can accelerate, for instance, max(). Here is also not WHERE or ORDER BY, but index is
useful:
select max(created_at) from delivery;

> 23 окт. 2023 г., в 18:23, Toomas <toomas.kristin@gmail.com> написал(а):
>
>
> There is no reason to use index. The query has neither WHERE nor ORDER BY clause.
>
> Toomas
>
>> On 23. Oct 2023, at 18:13, Олег Самойлов <splarv@ya.ru> wrote:
>>
>> Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size
ofpartition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3
month.Here is query: 
>>
>> SELECT id/10000000 as partition
>>  FROM delivery
>>  GROUP BY partition
>>  HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;
>>
>> The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
>> The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
>>
>

Re: Very newbie question

From

Ron

Date:

23 October 2023, 15:50:47

On 10/23/23 10:13, Олег Самойлов wrote:

Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:

SELECT id/10000000 as partition   FROM delivery   GROUP BY partition   HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?

Maybe:

SELECT DISTINCT id/10000000 as partition
FROM delivery
WHERE max(created_at) < CURRENT_DATE - '3 month'::interval;

I haven't tried it, though.

--
Born in Arizona, moved to Babylonia.

Re: Very newbie question

From

Francisco Olarte

Date:

23 October 2023, 17:26:31

On Mon, 23 Oct 2023 at 17:42, Олег Самойлов <splarv@ya.ru> wrote:
> Well, get list of partitions and later scan one by one all 100 partitions is too simple. :) I am interesting is here
moreelegant way? Any rewriting the query, any creating an index are permitted. 

1.- You do not scan all partitions. Had you not top-posted it coudl
easily be noted:
> > 1.- Get list of partitions, sort oldest first.

This means you get list of partitions, which is just a query, sorting
them, even if you have to do it client side should be trivial unless
you use really weird schemes, and you could sort them by your ranges
in the query.

Also, getting them oldest first means you evaluate the age-query
before archiving, once for each archivable partition plus one extra,
an overhead which should be dwarfed by any non-trivial archival, even
a rename or drop index would probably be longer.

And last. We have different concepts for elegance. IMO by saying a
solution is "too simple" not having stated "I want a complex tricky
solution" disqualifies you a bit. For real problems, no solution is
too simple ( it may be a language problem, or you may have hidden
constraints, but that needs to be specified ).

feel free to exec &> /dev/null.

Francisco Olarte.

Re: Very newbie question

From

Олег Самойлов

Date:

25 October 2023, 14:58:15

Okey, I see no one was be able to solve this problem. But I could. May be for someone this will be useful too. There is
solution.

Original query was:

> 23 окт. 2023 г., в 18:13, Олег Самойлов <splarv@ya.ru> написал(а):
>
> SELECT id/10000000 as partition
>   FROM delivery
>   GROUP BY partition
>   HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

And I was not able to accelerate it by any index, works 5 minutes. Now query is:

SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM delivery) as part_numbers
         WHERE (SELECT max(created_at) from delivery where n*10000000 <=id and id < (n+1)*10000000)
            < CURRENT_DATE-'3 month'::interval;

Return the same (number of partition need to archive), accelerated by two btree index: on id and created_at. Works very
quick,less then second.

Re: Very newbie question

From

Olivier Gautherot

Date:

25 October 2023, 15:48:46

Hi,

El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:

Okey, I see no one was be able to solve this problem. But I could. May be for someone this will be useful too. There is solution.

Original query was:

> 23 окт. 2023 г., в 18:13, Олег Самойлов <splarv@ya.ru> написал(а):
>
> SELECT id/10000000 as partition
> FROM delivery
> GROUP BY partition
> HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

And I was not able to accelerate it by any index, works 5 minutes. Now query is:

SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000 <=id and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;

Return the same (number of partition need to archive), accelerated by two btree index: on id and created_at. Works very quick, less then second.

If you happen to rework your design, consider partitioning on (created_at), as it may simplify your maintenance.

The reason why you couldn't improve the performance with an index is due to the calls of min() and max() that force to evaluate every single row. You may consider using a computed index in this case.

Your fast solution will work as long as you don't have missing sequences (like deleted rows).

Regards

Olivier

Re: Very newbie question

From

Olivier Gautherot

Date:

26 October 2023, 09:56:56

Hi,

El jue, 26 oct 2023 11:15, Peter J. Holzer <hjp-pgsql@hjp.at> escribió:

On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote:
> El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
> Okey, I see no one was be able to solve this problem. But I could. May be
> for someone this will be useful too. There is solution.
[...]
> Now query is:
>
> SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
> delivery) as part_numbers
> WHERE (SELECT max(created_at) from delivery where n*10000000 <=id
> and id < (n+1)*10000000)
> < CURRENT_DATE-'3 month'::interval;
>
> Return the same (number of partition need to archive), accelerated by two
> btree index: on id and created_at. Works very quick, less then second.
[...]
> Your fast solution will work as long as you don't have missing sequences (like
> deleted rows).

Why do you think this would break with missing sequence numbers?

hp

In the suggested query, the return value contains a list of sequential numbers from a min to a max - they seem to be markers of the partitions. Let's assume that a complete partition is deleted in the middle: its index will still be returned by the query, although it doesn't exist any more in the table. It can be an issue if the list of indexes is actually used and partitions are not deleted sequentially.

My cent worth to ensure data integrity.

Regards

Olivier Gautherot

Re: Very newbie question

From

"Peter J. Holzer"

Date:

26 October 2023, 11:28:44

On 2023-10-26 11:56:56 +0200, Olivier Gautherot wrote:
> El jue, 26 oct 2023 11:15, Peter J. Holzer <hjp-pgsql@hjp.at> escribió:
>     On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote:
>     > El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
>     >     Okey, I see no one was be able to solve this problem. But I could.
>     >     May be
>     >     for someone this will be useful too. There is solution.
>     [...]
>     >     Now query is:
>     >
>     >     SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
>     >     delivery) as part_numbers
>     >              WHERE (SELECT max(created_at) from delivery where n*10000000
>     <=id
>     >     and id < (n+1)*10000000)
>     >                 < CURRENT_DATE-'3 month'::interval;

I just realized that this query is mangled. I'm going to assume that it
should have been something like

with part_numbers as (
    SELECT generate_series(min(id)/100, max(id)/100) as n
    from delivery
)
select * from part_numbers
WHERE (SELECT max(created_at) from delivery where n*100 <= id and id < (n+1)*100) < CURRENT_DATE-'3 month'::interval;

>     [...]
>     > Your fast solution will work as long as you don't have missing sequences
>     (like
>     > deleted rows).
>
>     Why do you think this would break with missing sequence numbers?
>
>
> In the suggested query, the return value contains a list of sequential numbers
> from a min to a max - they seem to be markers of the partitions. Let's assume
> that a complete partition is deleted in the middle: its index will still be
> returned by the query, although it doesn't exist any more in the table.

I don't think it will. While the generate_series() will produce the
partition number, the where clause will not find any matching rows, so
the query will not return it.

E.g. (this table isn't partitioned, but that shouldn't affect the
result, also I'll reduce the "partition size" to 100 to make it more
readable):

create table delivery (id int, created_at date);
insert into delivery(200, '2000-01-01');
insert into delivery values(200, '2000-01-01');
insert into delivery values(299, '2000-12-01');
insert into delivery values(412, '2002-02-01');
insert into delivery values(439, '2002-03-01');
insert into delivery values(501, '2023-01-01');
insert into delivery values(555, now());

Note that there are no records in "partition" 3, and "partition" 5
contains current data, so we should get only "partition numbers" 2 and
4:

with part_numbers as (
    SELECT generate_series(min(id)/100, max(id)/100) as n
    from delivery
)
select * from part_numbers
WHERE (SELECT max(created_at) from delivery where n*100 <= id and id < (n+1)*100) < CURRENT_DATE-'3 month'::interval;

╔═══╗
║ n ║
╟───╢
║ 2 ║
║ 4 ║
╚═══╝
(2 rows)

Looks ok to me.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

signature.asc