Thread: [PERFORM] select with max functions
Hi,
I need to use the max function in my query. I had very bad performance when I used the max :
SELECT Ma.User_Id,
COUNT(*) COUNT
FROM Manuim Ma
WHERE Ma.Bb_Open_Date =
(SELECT max(Bb_Open_Date)
FROM Manuim Man
WHERE Man.User_Id = Ma.User_Id
)
GROUP BY Ma.User_Id
HAVING COUNT(*) > 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.56..3250554784.13 rows=115111 width=18)
Group Key: ma.user_id
Filter: (count(*) > 1)
-> Index Scan using manuim_i_user_id on manuim ma (cost=0.56..3250552295.59 rows=178324 width=10)
Filter: (bb_open_date = (SubPlan 1))
SubPlan 1
-> Aggregate (cost=90.98..90.99 rows=1 width=8)
-> Index Scan using manuim_i_user_id on manuim man (cost=0.56..90.92 rows=22 width=8)
Index Cond: ((user_id)::text = (ma.user_id)::text)
(9 rows)
So I used the limit 1 option :
SELECT Ma.User_Id,
COUNT(*) COUNT
FROM Manuim Ma
WHERE Ma.Bb_Open_Date =
(SELECT Bb_Open_Date
FROM Manuim Man
WHERE Man.User_Id = Ma.User_Id order by bb_open_date desc limit 1
)
GROUP BY Ma.User_Id
HAVING COUNT(*) > 1;
and the performance are still the same :
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.56..3252248863.46 rows=115111 width=18)
Group Key: ma.user_id
Filter: (count(*) > 1)
-> Index Scan using manuim_i_user_id on manuim ma (cost=0.56..3252246374.92 rows=178324 width=10)
Filter: (bb_open_date = (SubPlan 1))
SubPlan 1
-> Limit (cost=91.03..91.03 rows=1 width=8)
-> Sort (cost=91.03..91.09 rows=22 width=8)
Sort Key: man.bb_open_date DESC
-> Index Scan using manuim_i_user_id on manuim man (cost=0.56..90.92 rows=22 width=8)
Index Cond: ((user_id)::text = (ma.user_id)::text)
(11 rows)
the reading on the table manuim takes a lot of effort, what else can I do ? the table`s size is 8G.
select count(*) from manuim;
count
----------
35664828
(1 row)
the indexes on the table :
"manuim_bb_open_date" btree (bb_open_date)
"manuim_i_user_id" btree (user_id)
Any idea how can I continue from here ? Thanks , Mariel.
----- Mensaje original ----- > De: "Mariel Cherkassky" <mariel.cherkassky@gmail.com> > Para: pgsql-performance@postgresql.org > Enviados: Domingo, 1 de Octubre 2017 9:41:37 > Asunto: [PERFORM] select with max functions > > Hi, > I need to use the max function in my query. I had very bad performance when > I used the max : > > SELECT Ma.User_Id, > COUNT(*) COUNT > FROM Manuim Ma > WHERE Ma.Bb_Open_Date = > (SELECT max(Bb_Open_Date) > FROM Manuim Man > WHERE Man.User_Id = Ma.User_Id > ) > GROUP BY Ma.User_Id > HAVING COUNT(*) > 1; > > > QUERY PLAN > > --------------------------------------------------------------------------------------------------------- > GroupAggregate (cost=0.56..3250554784.13 rows=115111 width=18) > Group Key: ma.user_id > Filter: (count(*) > 1) > -> Index Scan using manuim_i_user_id on manuim ma > (cost=0.56..3250552295.59 rows=178324 width=10) > Filter: (bb_open_date = (SubPlan 1)) > SubPlan 1 > -> Aggregate (cost=90.98..90.99 rows=1 width=8) > -> Index Scan using manuim_i_user_id on manuim man > (cost=0.56..90.92 rows=22 width=8) > Index Cond: ((user_id)::text = (ma.user_id)::text) > (9 rows) > > > > So I used the limit 1 option : > > SELECT Ma.User_Id, > COUNT(*) COUNT > FROM Manuim Ma > WHERE Ma.Bb_Open_Date = > (SELECT Bb_Open_Date > FROM Manuim Man > WHERE Man.User_Id = Ma.User_Id order > by bb_open_date desc limit 1 > ) > GROUP BY Ma.User_Id > HAVING COUNT(*) > 1; > > and the performance are still the same : > > QUERY PLAN > > --------------------------------------------------------------------------------------------------------------- > GroupAggregate (cost=0.56..3252248863.46 rows=115111 width=18) > Group Key: ma.user_id > Filter: (count(*) > 1) > -> Index Scan using manuim_i_user_id on manuim ma > (cost=0.56..3252246374.92 rows=178324 width=10) > Filter: (bb_open_date = (SubPlan 1)) > SubPlan 1 > -> Limit (cost=91.03..91.03 rows=1 width=8) > -> Sort (cost=91.03..91.09 rows=22 width=8) > Sort Key: man.bb_open_date DESC > -> Index Scan using manuim_i_user_id on manuim man > (cost=0.56..90.92 rows=22 width=8) > Index Cond: ((user_id)::text = > (ma.user_id)::text) > (11 rows) > > > > the reading on the table manuim takes a lot of effort, what else can I do ? > the table`s size is 8G. > > select count(*) from manuim; > count > ---------- > 35664828 > (1 row) > > the indexes on the table : > "manuim_bb_open_date" btree (bb_open_date) > "manuim_i_user_id" btree (user_id) > > > Any idea how can I continue from here ? Thanks , Mariel. Start by posting the results of "explain analyze" of that queries, so we can see some timming stuff. Gerardo -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Am 01.10.2017 um 14:41 schrieb Mariel Cherkassky: > Hi, > I need to use the max function in my query. I had very bad performance > when I used the max : > > SELECT Ma.User_Id, > COUNT(*) COUNT > FROM Manuim Ma > WHERE Ma.Bb_Open_Date = > (SELECT max(Bb_Open_Date) > FROM Manuim Man > WHERE Man.User_Id = Ma.User_Id > ) > GROUP BY Ma.User_Id > HAVING COUNT(*) > 1; > > > Any idea how can I continue from here ? Thanks , Mariel. Maybe you can rewrite it, for instance to select distinct on (user_id, bb_open_date) user_id, bb_open_date, count(1) from Manuim group by 1,2 having count(1) > 1; maybe much cheaper, but untested! If not, please share more details, at least table-definition. Regards, Andreas -- 2ndQuadrant - The PostgreSQL Support Company. www.2ndQuadrant.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Andreas I tried to rewrite it with the function rank() but I failed. The query you wrote isnt the same as what I search. Moreover, I cant use explain analyze because it is taking to much time to run and I'm getting timeout..
2017-10-01 21:48 GMT+03:00 Andreas Kretschmer <andreas@a-kretschmer.de>:
Am 01.10.2017 um 14:41 schrieb Mariel Cherkassky:Hi,
I need to use the max function in my query. I had very bad performance when I used the max :
SELECT Ma.User_Id,
COUNT(*) COUNT
FROM Manuim Ma
WHERE Ma.Bb_Open_Date =
(SELECT max(Bb_Open_Date)
FROM Manuim Man
WHERE Man.User_Id = Ma.User_Id
)
GROUP BY Ma.User_Id
HAVING COUNT(*) > 1;
Any idea how can I continue from here ? Thanks , Mariel.
Maybe you can rewrite it, for instance to
select distinct on (user_id, bb_open_date) user_id, bb_open_date, count(1) from Manuim group by 1,2 having count(1) > 1;
maybe much cheaper, but untested! If not, please share more details, at least table-definition.
Regards, Andreas
--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
----- Mensaje original ----- > De: "Mariel Cherkassky" <mariel.cherkassky@gmail.com> > Para: "Andreas Kretschmer" <andreas@a-kretschmer.de> > CC: pgsql-performance@postgresql.org > Enviados: Lunes, 2 de Octubre 2017 10:25:19 > Asunto: Re: [PERFORM] select with max functions > > Andreas I tried to rewrite it with the function rank() but I failed. The > query you wrote isnt the same as what I search. Moreover, I cant use > explain analyze because it is taking to much time to run and I'm getting > timeout.. > > 2017-10-01 21:48 GMT+03:00 Andreas Kretschmer <andreas@a-kretschmer.de>: Do a "set statement_timeout TO 0" prior to "explain analyze" -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
explain analyze SELECT Ma.User_Id,
COUNT(*) COUNT
FROM Manuim Ma
WHERE Ma.Bb_Open_Date =
(SELECT Bb_Open_Date
FROM Manuim Man
WHERE Man.User_Id = Ma.User_Id order by bb_open_date desc limit 1
)
GROUP BY Ma.User_Id
HAVING COUNT(*) > 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
----------------------------------------
GroupAggregate (cost=0.56..2430770384.80 rows=128137 width=18) (actual time=55.823..2970443.757 rows=1213 loops=1)
Group Key: ma.user_id
Filter: (count(*) > 1)
Rows Removed by Filter: 3693020
-> Index Scan using manuim_i_user_id on manuim ma (cost=0.56..2430767766.00 rows=178324 width=10) (actual time=0.249
..2966355.734 rows=3695461 loops=1)
Filter: (bb_open_date = (SubPlan 1))
Rows Removed by Filter: 31969367
SubPlan 1
-> Limit (cost=68.00..68.00 rows=1 width=8) (actual time=0.082..0.082 rows=0 loops=35664828)
-> Sort (cost=68.00..68.04 rows=16 width=8) (actual time=0.081..0.081 rows=0 loops=35664828)
Sort Key: man.bb_open_date DESC
Sort Method: quicksort Memory: 25kB
-> Index Scan using manuim_i_user_id on manuim man (cost=0.56..67.92 rows=16 width=8) (actual ti
me=0.001..0.069 rows=85 loops=35664828)
Index Cond: ((user_id)::text = (ma.user_id)::text)
Planning time: 0.414 ms
Execution time: 2970444.732 ms
(16 rows)
2017-10-02 16:45 GMT+03:00 Gerardo Herzig <gherzig@fmed.uba.ar>:
----- Mensaje original -----> De: "Mariel Cherkassky" <mariel.cherkassky@gmail.com>> Para: "Andreas Kretschmer" <andreas@a-kretschmer.de>> Enviados: Lunes, 2 de Octubre 2017 10:25:19> Asunto: Re: [PERFORM] select with max functions>> Andreas I tried to rewrite it with the function rank() but I failed. The> query you wrote isnt the same as what I search. Moreover, I cant use> explain analyze because it is taking to much time to run and I'm getting> timeout..>> 2017-10-01 21:48 GMT+03:00 Andreas Kretschmer <andreas@a-kretschmer.de>:Do a "set statement_timeout TO 0" prior to "explain analyze"
Mariel Cherkassky <mariel.cherkassky@gmail.com> writes: > explain analyze SELECT Ma.User_Id, > COUNT(*) COUNT > FROM Manuim Ma > WHERE Ma.Bb_Open_Date = > (SELECT Bb_Open_Date > FROM Manuim Man > WHERE Man.User_Id = Ma.User_Id order > by bb_open_date desc limit 1 > ) > GROUP BY Ma.User_Id > HAVING COUNT(*) > 1; The core problem with this query is that the sub-select has to be done over again for each row of the outer table, since it's a correlated sub-select (ie, it refers to Ma.User_Id from the outer table). Replacing a max() call with handmade logic doesn't do anything to help that. I'd try refactoring it so that you calculate the max Bb_Open_Date just once for each user id, perhaps along the lines of SELECT Ma.User_Id, COUNT(*) COUNT FROM Manuim Ma, (SELECT User_Id, max(Bb_Open_Date) as max FROM Manuim Man GROUP BY User_Id) ss WHERE Ma.User_Id = ss.User_Id AND Ma.Bb_Open_Date= ss.max GROUP BY Ma.User_Id HAVING COUNT(*) > 1; This is still not going to be instantaneous, but it might be better. It's possible that an index on (User_Id, Bb_Open_Date) would help, but I'm not sure. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
On 03/10/17 04:29, Tom Lane wrote: > Mariel Cherkassky <mariel.cherkassky@gmail.com> writes: >> explain analyze SELECT Ma.User_Id, >> COUNT(*) COUNT >> FROM Manuim Ma >> WHERE Ma.Bb_Open_Date = >> (SELECT Bb_Open_Date >> FROM Manuim Man >> WHERE Man.User_Id = Ma.User_Id order >> by bb_open_date desc limit 1 >> ) >> GROUP BY Ma.User_Id >> HAVING COUNT(*) > 1; > The core problem with this query is that the sub-select has to be done > over again for each row of the outer table, since it's a correlated > sub-select (ie, it refers to Ma.User_Id from the outer table). Replacing > a max() call with handmade logic doesn't do anything to help that. > I'd try refactoring it so that you calculate the max Bb_Open_Date just > once for each user id, perhaps along the lines of > > SELECT Ma.User_Id, > COUNT(*) COUNT > FROM Manuim Ma, > (SELECT User_Id, max(Bb_Open_Date) as max > FROM Manuim Man > GROUP BY User_Id) ss > WHERE Ma.User_Id = ss.User_Id AND > Ma.Bb_Open_Date = ss.max > GROUP BY Ma.User_Id > HAVING COUNT(*) > 1; > > This is still not going to be instantaneous, but it might be better. > > It's possible that an index on (User_Id, Bb_Open_Date) would help, > but I'm not sure. > > regards, tom lane > > Further ideas based on Tom's rewrite: If that MAX is still expensive it might be worth breaking SELECT User_Id, max(Bb_Open_Date) as max FROM Manuim Man GROUP BY User_Id out into a VIEW, and considering making it MATERIALIZED, or creating an equivalent trigger based summary table (there are examples in the docs of how to do this). Cheers Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance