Thread: Re: count of occurences PLUS optimisation

Re: count of occurences PLUS optimisation

From
"Thurstan R. McDougle"
Date:
HACKERS: see the end of this message about a possible optimisation for
ORDER BY+LIMIT cases (the normal use of LIMIT?)

Adam wrote:
>
> I help run a job database and have a table of search records.  I want
> a query that will return the top 10 jobs by search frequency.  I'm
> familiar with ORDER BY and LIMIT, so I basically need this:
>
> Given a table search_records:
> job_num
> -------
> 1
> 2
> 2
> 3
> 4
> 4
> 4
>
> I want a query that will return:
> job_num | count
> --------+------
> 1       |1
> 2       |2
> 3       |1
> 4       |3
>
> I tried
>
> select distinct job_num, (select count(*) from search_records j where
> j.job_num=k.job_num) from search_records k
>
> but it is horribly slow (it takes several minutes on a table of about
> 25k rows!).  I assume it scans the entire table for every job_num in
> order to count the number of occurences of that job_num, taking order
> n^2 time.  Since I can easily use job_num as an index (being integers
> from 0 to roughly 400 so far) I could just do a "select * from
> search_records" and do the counting in PHP (our HTML pre-processor) in
> order n time.  However, I don't know how to do an order n*log(n) sort
> in PHP, just n^2, so there would still be an efficiency problem.
> I have Postgresql 7.0.3.
> Help is of course greatly appreciated.

I have not tried it but how about:-

select job_num from
(select job_num, count(*) as c from search_records group by job_num)
order by c limit 10;

I am not sure if count(*) would work in this context, if not try count()
on some field that is in every record.


If you can be sure that the top 10 will have at least a certain
threshold of searches (perhaps >1!) then it MIGHT be faster, due to less
data being sorted for the outer selects order by, (experiment) to do:-

select job_num from
(select job_num, count(*) as c from search_records group by job_num
HAVING c>1)
order by c limit 10;

It would depend on how efficient the ORDER BY and LIMIT work together.
(The ORDER BY could build a list of LIMIT n items and just replace items
in that list...a lot more efficient both of memory and comparisons than
building the full list and then keeping the top n)

HACKERS: If it does not do this it might be a usefull optimisation.
There would probably need to be a cutoff limit on whether to apply this
method or sort and keep n.  Also for LIMIT plus OFFSET it would need to
build a list of the the total of the LIMIT and OFFSET figures.

--
This is the identity that I use for NewsGroups. Email to
this will just sit there. If you wish to email me replace
the domain with knightpiesold . co . uk (no spaces).

Re: count of occurences PLUS optimisation

From
Martijn van Oosterhout
Date:
On Thu, Sep 13, 2001 at 01:03:54PM +0100, Thurstan R. McDougle wrote:
> It would depend on how efficient the ORDER BY and LIMIT work together.
> (The ORDER BY could build a list of LIMIT n items and just replace items
> in that list...a lot more efficient both of memory and comparisons than
> building the full list and then keeping the top n)

There is some of this already. In the output of EXPLAIN you see two numbers.
The first is the estimated time toget the first tuple, the second is to get
all the tuples.

When LIMIT is applied, the estimated total cost is adjusted based on the
number of rows. So with a small number of tuples the planner will favour
plans that get tuples early even if the total cost would be larger.

> HACKERS: If it does not do this it might be a usefull optimisation.
> There would probably need to be a cutoff limit on whether to apply this
> method or sort and keep n.  Also for LIMIT plus OFFSET it would need to
> build a list of the the total of the LIMIT and OFFSET figures.

The problem is that it sometimes doesn't help as much as you'd expect. If
you see a Sort stage in the plan, that means that everything below that has
to be completly calculated.

The only solution is to use a sorted index to avoid the sort step, if
possible.

HTH,
--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> Magnetism, electricity and motion are like a three-for-two special offer:
> if you have two of them, the third one comes free.

Re: count of occurences PLUS optimisation

From
"Thurstan R. McDougle"
Date:
Martijn van Oosterhout wrote:
>
> On Thu, Sep 13, 2001 at 01:03:54PM +0100, Thurstan R. McDougle wrote:
> > It would depend on how efficient the ORDER BY and LIMIT work together.
> > (The ORDER BY could build a list of LIMIT n items and just replace items
> > in that list...a lot more efficient both of memory and comparisons than
> > building the full list and then keeping the top n)
>
> There is some of this already. In the output of EXPLAIN you see two numbers.
> The first is the estimated time toget the first tuple, the second is to get
> all the tuples.
>
> When LIMIT is applied, the estimated total cost is adjusted based on the
> number of rows. So with a small number of tuples the planner will favour
> plans that get tuples early even if the total cost would be larger.
>
> > HACKERS: If it does not do this it might be a usefull optimisation.
> > There would probably need to be a cutoff limit on whether to apply this
> > method or sort and keep n.  Also for LIMIT plus OFFSET it would need to
> > build a list of the the total of the LIMIT and OFFSET figures.
>
> The problem is that it sometimes doesn't help as much as you'd expect. If
> you see a Sort stage in the plan, that means that everything below that has
> to be completly calculated.
>
> The only solution is to use a sorted index to avoid the sort step, if
> possible.


What I am talking about is WHEN the sort is required we could make the
sort more efficient as inserting into a SHORT ordered list should be
better than building a BIG list and sorting it, then only keeping a
small part of the list.

In the example in question there would be perhaps 400 records, but only
10 are needed.  From the questions on these lists it seems quite common
for only a very low proportion of the records to be required (less then
10%/upto 100 typically), in these cases it would seem to be a usefull
optimisation.


>
> HTH,
> --
> Martijn van Oosterhout <kleptog@svana.org>
> http://svana.org/kleptog/
> > Magnetism, electricity and motion are like a three-for-two special offer:
> > if you have two of them, the third one comes free.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
This is the identity that I use for NewsGroups. Email to
this will just sit there. If you wish to email me replace
the domain with knightpiesold . co . uk (no spaces).

Re: count of occurences PLUS optimisation

From
Martijn van Oosterhout
Date:
On Thu, Sep 13, 2001 at 05:38:56PM +0100, Thurstan R. McDougle wrote:
> What I am talking about is WHEN the sort is required we could make the
> sort more efficient as inserting into a SHORT ordered list should be
> better than building a BIG list and sorting it, then only keeping a
> small part of the list.

For a plain SORT, it would be possible. Anything to avoid materialising the
entire table in memory. Unfortunatly it won't help if there is a GROUP
afterwards because the group can't really know when to stop.

But yes, if you had LIMIT<SORT<...>> you could do that. I can't imagine it
would be too hard to arrange.

> In the example in question there would be perhaps 400 records, but only
> 10 are needed.  From the questions on these lists it seems quite common
> for only a very low proportion of the records to be required (less then
> 10%/upto 100 typically), in these cases it would seem to be a usefull
> optimisation.

Say you have a query:

select id, count(*) from test group by id order by count desc limit 10;

This becomes:

LIMIT < SORT < GROUP < SORT < test > > > >

The inner sort would still have to scan the whole table, unless you have an
index on id. In that case your optimisation would be cool.

Have I got it right now?

--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> Magnetism, electricity and motion are like a three-for-two special offer:
> if you have two of them, the third one comes free.

Re: count of occurences PLUS optimisation

From
"Thurstan R. McDougle"
Date:
Sorry about the size of this message!, it covers several optimisation
areas.

Yes we are talking about a limited situation of ORDER BY (that does not
match the GROUP BY order) plus LIMIT, but one that is easy to identify.

It also has the advantage that the number to be LIMITed will 9 times out
of 10 be known at query plan time (as LIMIT seems to mostly be used with
a constant), so making it an optimization that rarely needs to estimate.

It could even be tested for at the query run stage rather than the query
plan stage in those cases where the limit is not known in advance,
although that would make the explain less accurate.  Probably for
planning we should just assume that if a LIMIT is present that it is
likely to be for a smallish number.  The planner currently estimates
that 10% of the tuples will be returned in these cases.

The level up to which building a shorter list is better than a sort and
keep/discard should be evaluated.  It would perhaps depend on what
proportion the LIMIT is of the estimated set returned by the GROUP BY.
One point to note is that, IIRC, an ordered lists efficiency drops
faster than that of most decent sorts once the data must be paged
to/from disk.

For larger LIMITs we should still get some benefits if we get the first
LIMIT items, then sort just them and compare each new item against the
lowest item in this list.  Maybe form a batch of new items, then merge
the new batch in to produce a new LIMIT n long sorted list and repeat.
One major advantage to this is that as the new batch is being fetched we
no longer need to keep the existing list in ram.  I should think that
each new batch should be no longer than we can fit in ram or the amount
of ram that is best to enable an efficient list merge phase.

We could kick into this second mode when we the LIMIT exceeds the cutoff
or available ram.

I have just noticed while looking through the planner and executor that
the 'unique' node (SELECT DISTINCT [ON]) comes between the sort and
limit nodes and is run seperately.  Would it not be more efficient, in
the normal case of distinct on ORDER BY order (or start of ORDER BY),
for uniqueness to be handled within the the sorting as these routines
are already comparing the tuples?  Also if the unique node is seperate
then it makes the merging of sort and limit impossible if DISTINCT is
present.
However there is still the case of distinct where a sort is not
requested,  needed (index scan instead?) or is not suitable for the
distinct, so a seperate distinct node executor is still required.

Taking all these into account it seems that quite a lot of code would
need changing to implement this optimisation. Specifically the SORT,
UNIQUE and LIMIT nodes (and their planners) and the sort utils would
need a seperate variant and the current nodes would need altering.
It is a pity as I would expect fairly large benefits in those cases of
LIMITing to a small subset of a large dataset.

Martijn van Oosterhout wrote:
>
> On Thu, Sep 13, 2001 at 05:38:56PM +0100, Thurstan R. McDougle wrote:
> > What I am talking about is WHEN the sort is required we could make the
> > sort more efficient as inserting into a SHORT ordered list should be
> > better than building a BIG list and sorting it, then only keeping a
> > small part of the list.
>
> For a plain SORT, it would be possible. Anything to avoid materialising the
> entire table in memory. Unfortunatly it won't help if there is a GROUP
> afterwards because the group can't really know when to stop.
>
> But yes, if you had LIMIT<SORT<...>> you could do that. I can't imagine it
> would be too hard to arrange.
>
> > In the example in question there would be perhaps 400 records, but only
> > 10 are needed.  From the questions on these lists it seems quite common
> > for only a very low proportion of the records to be required (less then
> > 10%/upto 100 typically), in these cases it would seem to be a usefull
> > optimisation.
>
> Say you have a query:
>
> select id, count(*) from test group by id order by count desc limit 10;
>
> This becomes:
>
> LIMIT < SORT < GROUP < SORT < test > > > >
>
> The inner sort would still have to scan the whole table, unless you have an
> index on id. In that case your optimisation would be cool.
>
> Have I got it right now?
>
> --
> Martijn van Oosterhout <kleptog@svana.org>
> http://svana.org/kleptog/
> > Magnetism, electricity and motion are like a three-for-two special offer:
> > if you have two of them, the third one comes free.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
This is the identity that I use for NewsGroups. Email to
this will just sit there. If you wish to email me replace
the domain with knightpiesold . co . uk (no spaces).