Re: Digesting explain analyze - Mailing list pgsql-performance

From Jesper Krogh
Subject Re: Digesting explain analyze
Date
Msg-id 4B457AD4.8090405@krogh.cc
Whole thread Raw
In response to Re: Digesting explain analyze  (Ron Mayer <rm_pg@cheapcomplexdevices.com>)
Responses Re: Digesting explain analyze
Re: Digesting explain analyze
List pgsql-performance
Ron Mayer wrote:
>> ...The inner sets are on average 3.000 for
>> both id1 and id2 and a typical limit would be 100, so if I could convince
>> postgresql to not fetch all of them then I would reduce the set retrieved
>> by around 60. The dataset is quite large so the random query is not very
>> likely to be hitting the same part of the dataset again, so there is going
>> to be a fair amount of going to disk.,
>
> If disk seeks are killing you a kinda crazy idea would be to
> duplicate the table - clustering one by (id1) and
> the other one by an index on (id2) and unioning the
> results of each.

That's doubling the disk space needs for the table. Is there any odds
that this would benefit when the intitial table significantly exceeds
available memory by itself?

> Since each of these duplicates of the table will be clustered
> by the column you're querying it on, it should just take one
> seek in each table.
>
> Then your query could be something like
>
>   select * from (
>     select * from t1 where id1=2067 order by evalue limit 100
>     union
>     select * from t2 where id2=2067 order by evalue limit 100
>   ) as foo order by evalue limit 100;

This is actually what I ended up with as the best performing query, just
still on a single table, because without duplication I can add index and
optimize this one by (id1,evalue) and (id2,evalue). It is still getting
killed quite a lot by disk IO. So I guess I'm up to:

1) By better disk (I need to get an estimate how large it actually is
going to get).
2) Stick with one table, but make sure to have enough activity to get a
large part of the index in the OS-cache anyway. (and add more memory if
nessesary).

The data is seeing a fair amount of growth (doubles in a couple of years
) so it is fairly hard to maintain clustering on them .. I would suspect.

Is it possible to get PG to tell me, how many rows that fits in a
disk-page. All columns are sitting in "plain" storage according to \d+
on the table.

> Hmm..  and I wonder if putting evalue into the criteria to cluster
> the tables too (i.e. cluster on id1,evalue) if you could make it
> so the limit finds the right 100 evalues first for each table....

I didnt cluster it, since clustering "locks everything".

--
Jesper

pgsql-performance by date:

Previous
From: Radhika S
Date:
Subject: Joining on text field VS int
Next
From: Greg Smith
Date:
Subject: Re: Digesting explain analyze