Thread: iceberg queries
Does PostgreSQL optimizer handle iceberg queries well? Thanks Wei
> > Does PostgreSQL optimizer handle iceberg queries well? > What do you mean by "iceberg query" ? I've never heard this term. Regards, Christoph
It is a query that looks like SELECT target1, target2... targetn, SUN(t.qty) FROM Table t GROUP BY target1 HAVING SUM(t.qty)>=10 You can replace SUM(t.qty)>=10 with other aggregate constraints. ----- Original Message ----- From: Christoph Haller To: pgsql-sql@postgresql.org Cc: wweng@kencast.com Sent: Tuesday, February 04, 2003 3:39 AM Subject: Re: [SQL] iceberg queries > > Does PostgreSQL optimizer handle iceberg queries well? > What do you mean by "iceberg query" ? I've never heard this term. Regards, Christoph
On Tue, Feb 04, 2003 at 09:08:56 -0500, Wei Weng <wweng@kencast.com> wrote: > It is a query that looks like > > SELECT target1, target2... targetn, SUN(t.qty) > FROM Table t > GROUP BY target1 > HAVING SUM(t.qty)>=10 > > You can replace SUM(t.qty)>=10 with other aggregate constraints. There were some recent changes to allow groups to use hashes that may help for queries like this. This can save a sort step or using an index scan. In theory you might be able to make some other speed ups by taking advantage of properties of specific aggregate functions (in particular that several common ones monotonicly increase or decrease as they are being calculated) and if doing an index scan on the fields used for group you might be able to skip a lot of rows. I expect that this situation would come up pretty rarely though.
Christoph Haller wrote: > > > > > Does PostgreSQL optimizer handle iceberg queries well? > > > What do you mean by "iceberg query" ? > I've never heard this term. Iceberg queries compute one or more aggregate functions to find aggregate values above a specified threshold. A typical iceberg query would be SELECT a, count(a) FROM tab GROUP BY a HAVING count(a) >= 100; This base form can easily be made more complicated by doing self joins and the like. This type of query is often found in market research, data warehousing and search engines. As to the original question, if an index is available that returns the rows in the sort order of the GROUP BY clause, PostgreSQL defaults to an index scan, otherwise it will do a sort of the rows matching an optional WHERE clause. This sorted set is then grouped and aggregated and filtered by the HAVING clause after aggregation. It is well known that this approach does not scale well for large data sets. But in contrast to a specialized statistical software, PostgreSQL has to answer the query precisely. So sampling or bucket methods aren't options. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > As to the original question, if an index is available that returns the > rows in the sort order of the GROUP BY clause, PostgreSQL defaults to an > index scan, otherwise it will do a sort of the rows matching an optional > WHERE clause. This sorted set is then grouped and aggregated and > filtered by the HAVING clause after aggregation. Note that as of 7.4, the planner will probably pick hashed aggregation rather than sort-based aggregation, if it can predict that the number of groups will not be too large for a hash table to fit in memory. This means we can do a seqscan (or, perhaps, an indexscan to match WHERE conditions) and avoid sorting. So I expect performance on this type of query to be a good deal better in 7.4. There are a few benchmark comparisons in the pghackers archives a couple months back. regards, tom lane
Tom Lane wrote: > > Jan Wieck <JanWieck@Yahoo.com> writes: > > As to the original question, if an index is available that returns the > > rows in the sort order of the GROUP BY clause, PostgreSQL defaults to an > > index scan, otherwise it will do a sort of the rows matching an optional > > WHERE clause. This sorted set is then grouped and aggregated and > > filtered by the HAVING clause after aggregation. > > Note that as of 7.4, the planner will probably pick hashed aggregation > rather than sort-based aggregation, if it can predict that the number > of groups will not be too large for a hash table to fit in memory. > This means we can do a seqscan (or, perhaps, an indexscan to match > WHERE conditions) and avoid sorting. So I expect performance on this > type of query to be a good deal better in 7.4. There are a few > benchmark comparisons in the pghackers archives a couple months back. > > regards, tom lane If it can predict. I guess the question was asked because one expects performance problems. From that I conclude that the amount of data is significant in this particular case. That does not necessarily but usually mean a large number of unique groups. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #