Thread: Very newbie question
Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month.Here is query: SELECT id/10000000 as partition FROM delivery GROUP BY partition HAVING max(created_at) < CURRENT_DATE - '3 month'::interval; The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name. The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
There is no reason to use index. The query has neither WHERE nor ORDER BY clause. Toomas > On 23. Oct 2023, at 18:13, Олег Самойлов <splarv@ya.ru> wrote: > > Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size ofpartition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month.Here is query: > > SELECT id/10000000 as partition > FROM delivery > GROUP BY partition > HAVING max(created_at) < CURRENT_DATE - '3 month'::interval; > > The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name. > The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index? >
On Mon, 23 Oct 2023 at 17:14, Олег Самойлов <splarv@ya.ru> wrote: > Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size ofpartition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month.Here is query: > > SELECT id/10000000 as partition > FROM delivery > GROUP BY partition > HAVING max(created_at) < CURRENT_DATE - '3 month'::interval; > > The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name. > The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index? You should send an explain of your query, and your table and index definition. Unless you are tied to do this in one query, and assuming you have an index by "created_at", I normally do these kind of things by: 1.- Get list of partitions, sort oldest first. 2.- do "select created_at from $partition order by created at desc limit 1", which normally is just an index lookup, and compare client-side. You can do the date math in the database too. Also, rhs of the comparison seems to be date, if created_at is timestamp you may be blocking the optimizer for some things. Francisco Olarte.
Well, get list of partitions and later scan one by one all 100 partitions is too simple. :) I am interesting is here moreelegant way? Any rewriting the query, any creating an index are permitted. > 23 окт. 2023 г., в 18:25, Francisco Olarte <folarte@peoplecall.com> написал(а): > > On Mon, 23 Oct 2023 at 17:14, Олег Самойлов <splarv@ya.ru> wrote: >> Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size ofpartition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month.Here is query: >> >> SELECT id/10000000 as partition >> FROM delivery >> GROUP BY partition >> HAVING max(created_at) < CURRENT_DATE - '3 month'::interval; >> >> The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name. >> The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index? > > You should send an explain of your query, and your table and index definition. > > Unless you are tied to do this in one query, and assuming you have an > index by "created_at", I normally do these kind of things by: > 1.- Get list of partitions, sort oldest first. > 2.- do "select created_at from $partition order by created at desc > limit 1", which normally is just an index lookup, and compare > client-side. > You can do the date math in the database too. Also, rhs of the > comparison seems to be date, if created_at is timestamp you may be > blocking the optimizer for some things. > > Francisco Olarte.
This is not correct. An index can accelerate, for instance, max(). Here is also not WHERE or ORDER BY, but index is useful: select max(created_at) from delivery; > 23 окт. 2023 г., в 18:23, Toomas <toomas.kristin@gmail.com> написал(а): > > > There is no reason to use index. The query has neither WHERE nor ORDER BY clause. > > Toomas > >> On 23. Oct 2023, at 18:13, Олег Самойлов <splarv@ya.ru> wrote: >> >> Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size ofpartition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month.Here is query: >> >> SELECT id/10000000 as partition >> FROM delivery >> GROUP BY partition >> HAVING max(created_at) < CURRENT_DATE - '3 month'::interval; >> >> The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name. >> The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index? >> >
On 10/23/23 10:13, Олег Самойлов wrote:
Maybe:
Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query: SELECT id/10000000 as partition FROM delivery GROUP BY partition HAVING max(created_at) < CURRENT_DATE - '3 month'::interval; The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name. The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
Maybe:
SELECT DISTINCT id/10000000 as partition FROM delivery WHERE max(created_at) < CURRENT_DATE - '3 month'::interval;I haven't tried it, though.
--
Born in Arizona, moved to Babylonia.
Born in Arizona, moved to Babylonia.
On Mon, 23 Oct 2023 at 17:42, Олег Самойлов <splarv@ya.ru> wrote: > Well, get list of partitions and later scan one by one all 100 partitions is too simple. :) I am interesting is here moreelegant way? Any rewriting the query, any creating an index are permitted. 1.- You do not scan all partitions. Had you not top-posted it coudl easily be noted: > > 1.- Get list of partitions, sort oldest first. This means you get list of partitions, which is just a query, sorting them, even if you have to do it client side should be trivial unless you use really weird schemes, and you could sort them by your ranges in the query. Also, getting them oldest first means you evaluate the age-query before archiving, once for each archivable partition plus one extra, an overhead which should be dwarfed by any non-trivial archival, even a rename or drop index would probably be longer. And last. We have different concepts for elegance. IMO by saying a solution is "too simple" not having stated "I want a complex tricky solution" disqualifies you a bit. For real problems, no solution is too simple ( it may be a language problem, or you may have hidden constraints, but that needs to be specified ). feel free to exec &> /dev/null. Francisco Olarte.
Okey, I see no one was be able to solve this problem. But I could. May be for someone this will be useful too. There is solution. Original query was: > 23 окт. 2023 г., в 18:13, Олег Самойлов <splarv@ya.ru> написал(а): > > SELECT id/10000000 as partition > FROM delivery > GROUP BY partition > HAVING max(created_at) < CURRENT_DATE - '3 month'::interval; And I was not able to accelerate it by any index, works 5 minutes. Now query is: SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM delivery) as part_numbers WHERE (SELECT max(created_at) from delivery where n*10000000 <=id and id < (n+1)*10000000) < CURRENT_DATE-'3 month'::interval; Return the same (number of partition need to archive), accelerated by two btree index: on id and created_at. Works very quick,less then second.
Hi,
El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
Okey, I see no one was be able to solve this problem. But I could. May be for someone this will be useful too. There is solution.
Original query was:
> 23 окт. 2023 г., в 18:13, Олег Самойлов <splarv@ya.ru> написал(а):
>
> SELECT id/10000000 as partition
> FROM delivery
> GROUP BY partition
> HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;
And I was not able to accelerate it by any index, works 5 minutes. Now query is:
SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000 <=id and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;
Return the same (number of partition need to archive), accelerated by two btree index: on id and created_at. Works very quick, less then second.
If you happen to rework your design, consider partitioning on (created_at), as it may simplify your maintenance.
The reason why you couldn't improve the performance with an index is due to the calls of min() and max() that force to evaluate every single row. You may consider using a computed index in this case.
Your fast solution will work as long as you don't have missing sequences (like deleted rows).
Regards
Olivier
On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote: > El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió: > Okey, I see no one was be able to solve this problem. But I could. May be > for someone this will be useful too. There is solution. [...] > Now query is: > > SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM > delivery) as part_numbers > WHERE (SELECT max(created_at) from delivery where n*10000000 <=id > and id < (n+1)*10000000) > < CURRENT_DATE-'3 month'::interval; > > Return the same (number of partition need to archive), accelerated by two > btree index: on id and created_at. Works very quick, less then second. [...] > Your fast solution will work as long as you don't have missing sequences (like > deleted rows). Why do you think this would break with missing sequence numbers? hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
Hi,
El jue, 26 oct 2023 11:15, Peter J. Holzer <hjp-pgsql@hjp.at> escribió:
On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote:
> El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
> Okey, I see no one was be able to solve this problem. But I could. May be
> for someone this will be useful too. There is solution.
[...]
> Now query is:
>
> SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
> delivery) as part_numbers
> WHERE (SELECT max(created_at) from delivery where n*10000000 <=id
> and id < (n+1)*10000000)
> < CURRENT_DATE-'3 month'::interval;
>
> Return the same (number of partition need to archive), accelerated by two
> btree index: on id and created_at. Works very quick, less then second.
[...]
> Your fast solution will work as long as you don't have missing sequences (like
> deleted rows).
Why do you think this would break with missing sequence numbers?
hp
In the suggested query, the return value contains a list of sequential numbers from a min to a max - they seem to be markers of the partitions. Let's assume that a complete partition is deleted in the middle: its index will still be returned by the query, although it doesn't exist any more in the table. It can be an issue if the list of indexes is actually used and partitions are not deleted sequentially.
My cent worth to ensure data integrity.
Regards
Olivier Gautherot
On 2023-10-26 11:56:56 +0200, Olivier Gautherot wrote: > El jue, 26 oct 2023 11:15, Peter J. Holzer <hjp-pgsql@hjp.at> escribió: > On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote: > > El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió: > > Okey, I see no one was be able to solve this problem. But I could. > > May be > > for someone this will be useful too. There is solution. > [...] > > Now query is: > > > > SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM > > delivery) as part_numbers > > WHERE (SELECT max(created_at) from delivery where n*10000000 > <=id > > and id < (n+1)*10000000) > > < CURRENT_DATE-'3 month'::interval; I just realized that this query is mangled. I'm going to assume that it should have been something like with part_numbers as ( SELECT generate_series(min(id)/100, max(id)/100) as n from delivery ) select * from part_numbers WHERE (SELECT max(created_at) from delivery where n*100 <= id and id < (n+1)*100) < CURRENT_DATE-'3 month'::interval; > [...] > > Your fast solution will work as long as you don't have missing sequences > (like > > deleted rows). > > Why do you think this would break with missing sequence numbers? > > > In the suggested query, the return value contains a list of sequential numbers > from a min to a max - they seem to be markers of the partitions. Let's assume > that a complete partition is deleted in the middle: its index will still be > returned by the query, although it doesn't exist any more in the table. I don't think it will. While the generate_series() will produce the partition number, the where clause will not find any matching rows, so the query will not return it. E.g. (this table isn't partitioned, but that shouldn't affect the result, also I'll reduce the "partition size" to 100 to make it more readable): create table delivery (id int, created_at date); insert into delivery(200, '2000-01-01'); insert into delivery values(200, '2000-01-01'); insert into delivery values(299, '2000-12-01'); insert into delivery values(412, '2002-02-01'); insert into delivery values(439, '2002-03-01'); insert into delivery values(501, '2023-01-01'); insert into delivery values(555, now()); Note that there are no records in "partition" 3, and "partition" 5 contains current data, so we should get only "partition numbers" 2 and 4: with part_numbers as ( SELECT generate_series(min(id)/100, max(id)/100) as n from delivery ) select * from part_numbers WHERE (SELECT max(created_at) from delivery where n*100 <= id and id < (n+1)*100) < CURRENT_DATE-'3 month'::interval; ╔═══╗ ║ n ║ ╟───╢ ║ 2 ║ ║ 4 ║ ╚═══╝ (2 rows) Looks ok to me. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"