Thread: iceberg queries

iceberg queries

From
"Wei Weng"
Date:
Does PostgreSQL optimizer handle iceberg queries well?

Thanks

Wei



Re: iceberg queries

From
Christoph Haller
Date:
>
> Does PostgreSQL optimizer handle iceberg queries well?
>
What do you mean by "iceberg query" ?
I've never heard this term.

Regards, Christoph




Re: iceberg queries

From
"Wei Weng"
Date:
It is a query that looks like

SELECT target1, target2... targetn, SUN(t.qty)
FROM Table t
GROUP BY target1
HAVING SUM(t.qty)>=10

You can replace SUM(t.qty)>=10 with other aggregate constraints.




----- Original Message ----- 
From: Christoph Haller 
To: pgsql-sql@postgresql.org 
Cc: wweng@kencast.com 
Sent: Tuesday, February 04, 2003 3:39 AM
Subject: Re: [SQL] iceberg queries


>
> Does PostgreSQL optimizer handle iceberg queries well?
>
What do you mean by "iceberg query" ?
I've never heard this term.

Regards, Christoph



Re: iceberg queries

From
Bruno Wolff III
Date:
On Tue, Feb 04, 2003 at 09:08:56 -0500, Wei Weng <wweng@kencast.com> wrote:
> It is a query that looks like
> 
> SELECT target1, target2... targetn, SUN(t.qty)
> FROM Table t
> GROUP BY target1
> HAVING SUM(t.qty)>=10
> 
> You can replace SUM(t.qty)>=10 with other aggregate constraints.

There were some recent changes to allow groups to use hashes that may
help for queries like this. This can save a sort step or using an
index scan.

In theory you might be able to make some other speed ups by taking advantage
of properties of specific aggregate functions (in particular that several
common ones monotonicly increase or decrease as they are being calculated)
and if doing an index scan on the fields used for group you might be able to
skip a lot of rows. I expect that this situation would come up pretty rarely
though.


Re: iceberg queries

From
Jan Wieck
Date:
Christoph Haller wrote:
> 
> >
> > Does PostgreSQL optimizer handle iceberg queries well?
> >
> What do you mean by "iceberg query" ?
> I've never heard this term.

Iceberg queries compute one or more aggregate functions to find
aggregate values above a specified threshold. A typical iceberg query
would be

SELECT a, count(a)   FROM tab   GROUP BY a   HAVING count(a) >= 100;

This base form can easily be made more complicated by doing self joins
and the like. This type of query is often found in market research, data
warehousing and search engines.

As to the original question, if an index is available that returns the
rows in the sort order of the GROUP BY clause, PostgreSQL defaults to an
index scan, otherwise it will do a sort of the rows matching an optional
WHERE clause. This sorted set is then grouped and aggregated and
filtered by the HAVING clause after aggregation.

It is well known that this approach does not scale well for large data
sets. But in contrast to a specialized statistical software, PostgreSQL
has to answer the query precisely. So sampling or bucket methods aren't
options.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: iceberg queries

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> As to the original question, if an index is available that returns the
> rows in the sort order of the GROUP BY clause, PostgreSQL defaults to an
> index scan, otherwise it will do a sort of the rows matching an optional
> WHERE clause. This sorted set is then grouped and aggregated and
> filtered by the HAVING clause after aggregation.

Note that as of 7.4, the planner will probably pick hashed aggregation
rather than sort-based aggregation, if it can predict that the number
of groups will not be too large for a hash table to fit in memory.
This means we can do a seqscan (or, perhaps, an indexscan to match
WHERE conditions) and avoid sorting.  So I expect performance on this
type of query to be a good deal better in 7.4.  There are a few
benchmark comparisons in the pghackers archives a couple months back.
        regards, tom lane


Re: iceberg queries

From
Jan Wieck
Date:
Tom Lane wrote:
> 
> Jan Wieck <JanWieck@Yahoo.com> writes:
> > As to the original question, if an index is available that returns the
> > rows in the sort order of the GROUP BY clause, PostgreSQL defaults to an
> > index scan, otherwise it will do a sort of the rows matching an optional
> > WHERE clause. This sorted set is then grouped and aggregated and
> > filtered by the HAVING clause after aggregation.
> 
> Note that as of 7.4, the planner will probably pick hashed aggregation
> rather than sort-based aggregation, if it can predict that the number
> of groups will not be too large for a hash table to fit in memory.
> This means we can do a seqscan (or, perhaps, an indexscan to match
> WHERE conditions) and avoid sorting.  So I expect performance on this
> type of query to be a good deal better in 7.4.  There are a few
> benchmark comparisons in the pghackers archives a couple months back.
> 
>                         regards, tom lane

If it can predict.

I guess the question was asked because one expects performance problems.
From that I conclude that the amount of data is significant in this
particular case. That does not necessarily but usually mean a large
number of unique groups.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #