Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans - Mailing list pgsql-general

From Victor Blomqvist
Subject Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans
Date
Msg-id CAL870DWRn5-i1gFfzF5y-K7z-6e+PfRgSwm5_-3tOu2+NNbv9w@mail.gmail.com
Whole thread Raw
In response to Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans  (Francisco Olarte <folarte@peoplecall.com>)
Responses Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans  (Francisco Olarte <folarte@peoplecall.com>)
List pgsql-general


On Fri, Aug 19, 2016 at 6:01 PM, Francisco Olarte <folarte@peoplecall.com> wrote:
Hi Victor:

On Fri, Aug 19, 2016 at 7:06 AM, Victor Blomqvist <vb@viblo.se> wrote:
> Is it possible to break/limit a query so that it returns whatever results
> found after having checked X amount of rows in a index scan?
>
> For example:
> create table a(id int primary key);
> insert into a select * from generate_series(1,100000);
>
> select * from a
> where id%2 = 0
> order by id limit 10
>
> In this case the query will "visit" 20 rows and filter out 10 of them. We
> can see that in the query plan:
> "Rows Removed by Filter: 10"
> "Heap Fetches: 20"
>
> Is it somehow possible to limit this query so that it only fetches X amount,
> in my example if we limited it to 10 Heap Fetches the query could return the
> first 5 rows?
>
>
> My use case is I have a table with 35 million rows with a geo index, and I
> want to do a KNN search but also limit the query on some other parameters.
> In some cases the other parameters restrict the query so much that Heap
> Fetches becomes several 100k or more, and in those cases I would like to
> have a limit to my query.

Well, if you accept more abstract limits (i.e. you do not depend on
heap fetches, you just want up to 5 even IDs from the first 10 IDs )
you could try:

with base as (select * from a order by id limit 10)
select * from base where id %2 = 0 order by id limit 5;

( Or do it with a subquery instead of a CTE ).

In general

select * from table where common_condition and filter_condition order
by xx limit N

becomes

with base as (select * from table where common_condition order by xx
limit base_fecthes)
select * from base where filter_condition order by XX limit N;

In the example common_condition is non existent, put it as true,
optimize after transforming.

Francisco Olarte.

Sorry if that wasnt clear in my first email. This whole problem is part of a 100 lines
long function so I tried to simplify it as much as possible for my question, but maybe I
went a bit too far. I will try again:

The problem with this approach is that I "waste" the number of rows in those cases
when it doesnt need to "loop" 10 times to fetch my result. Let me show a slightly
more complicated example:

create table b(id serial primary key, age_preference int);
insert into a (
age_preference)
  select
(random()*100)::integer from generate_series(1,1000000);

select * from b
where
age_preference%10 = x?
order by id limit 10


So here I have a table with id and age_preference, where age_preference is the
preferred age of users in my database. The distribution of age preferences is not
uniform in real life (only in my mock data here). Its also changing, so I cant know
beforehand, only do some very rough estimations.

What I do know is that for most values of x its enough to visit the 1k first rows.
However, sometimes I need more rows, maybe 100k rows. And for some special x
(such as 11 in my stupid example), its not enough even with 1m fetches.

What I want to avoid is my query visiting the whole 1m rows to get a result,
because in my real table that can take 100sec. At the same time I want the
queries that only need to visit 1k rows finish quickly, and the queries that
visit 100k rows at least get some result back.


Thank you for your patience so far!
/Victor

pgsql-general by date:

Previous
From: John R Pierce
Date:
Subject: Re: PG vs ElasticSearch for Logs
Next
From: Francisco Olarte
Date:
Subject: Re: Sequential vs. random values - number of pages in B-tree