Ron Mayer wrote:
>> ...The inner sets are on average 3.000 for
>> both id1 and id2 and a typical limit would be 100, so if I could convince
>> postgresql to not fetch all of them then I would reduce the set retrieved
>> by around 60. The dataset is quite large so the random query is not very
>> likely to be hitting the same part of the dataset again, so there is going
>> to be a fair amount of going to disk.,
>
> If disk seeks are killing you a kinda crazy idea would be to
> duplicate the table - clustering one by (id1) and
> the other one by an index on (id2) and unioning the
> results of each.
That's doubling the disk space needs for the table. Is there any odds
that this would benefit when the intitial table significantly exceeds
available memory by itself?
> Since each of these duplicates of the table will be clustered
> by the column you're querying it on, it should just take one
> seek in each table.
>
> Then your query could be something like
>
> select * from (
> select * from t1 where id1=2067 order by evalue limit 100
> union
> select * from t2 where id2=2067 order by evalue limit 100
> ) as foo order by evalue limit 100;
This is actually what I ended up with as the best performing query, just
still on a single table, because without duplication I can add index and
optimize this one by (id1,evalue) and (id2,evalue). It is still getting
killed quite a lot by disk IO. So I guess I'm up to:
1) By better disk (I need to get an estimate how large it actually is
going to get).
2) Stick with one table, but make sure to have enough activity to get a
large part of the index in the OS-cache anyway. (and add more memory if
nessesary).
The data is seeing a fair amount of growth (doubles in a couple of years
) so it is fairly hard to maintain clustering on them .. I would suspect.
Is it possible to get PG to tell me, how many rows that fits in a
disk-page. All columns are sitting in "plain" storage according to \d+
on the table.
> Hmm.. and I wonder if putting evalue into the criteria to cluster
> the tables too (i.e. cluster on id1,evalue) if you could make it
> so the limit finds the right 100 evalues first for each table....
I didnt cluster it, since clustering "locks everything".
--
Jesper