Hello,
we have a simple query here that uses a table such as Foo(id*, bar, baz)
were id is a primary key and then we have an additional index on
Foo(bar, baz). Now, we have a simple query such as this:
SELECT bar, baz FROM Foo WHERE bar='HIT';
The Foo table has about 25 million rows and the above query selects 35000
rows.
It takes 10 minutes, that's the problem.
Yes, it's doing an index scan according to EXPLAIN, and yes VACUUM
ANALYZE is done, the machine is decent (dual 1 GHz i*86 processor,
2 GB RAM doing software RAID 0-1 under Linux.) The shared memory
buffer is 512 MB and there is no significant swapping activity.
The index and the table are on the same spindle, yes, but that's
nothing I can change right now.
I found that this above query runs in about 8 second if it's a rerun.
Presumably our large cache then contains most of the index if not also
the data rows. Apparently I am stuck in an I/O bottleneck.
So, what can I do? Again I can't touch the hardware for now. Besides,
we are running this same query on Oracle (on an already aging Alpha
mini) and it consistently runs in 17 seconds. May be there is also
caching involved, but there is something Oracle does better about it.
Two things come into mind:
- Oracle is content with data read from an index if that is all that's needed. Indeed, I only need bar and baz and
thoseare in the index. The benefits of not having to seek the scattered rows from the data table saves 35000 back and
forthhead movements / seeks.
- Even if we had to go to the data rows, could it be that if a good big chunk of the index would be read ahead and
bufferedin one piece and then iterated over to retrieve the data rows without making the r/w head jump back and forth
toand from the index would be good.
Is PostgreSQL doing this index read-ahead? Is there a tuning variable I failed to set to make it read-ahead more?
I presume that PostgreSQL does good reading-ahead of the index, right?
But I heard it could not do index-only reads. Now, since I have several
of these problems (each time with different indexes) and our Oracle side
of the project uses index-only reads a lot, I would be very eager to have
index-only reads in PostgreSQL.
Naively it doesn't sound like a hard problem to implement in PostgreSQL,
does it? All we would need to do is get the data off the index that we
already read, and not do the row lookup. Probably the optimizer has
to make different plans here, so the main work will probably have to be
in the optimizer code. And in addition it might happen that some lower
level constraint exists in the executor who need to go to the rows for
some reason. In the worst case the index does not contain all data but
only just enough of a portion that distinguishes rows. That would then
require to rewrite all index code to have complete data in indexes.
But I'm only naively conjecturing.
So, please advise, what are the real barriers to this kind of
optimization? Is there anything I could do to help the progress on
this issue. (I'm generally capable of hacking certain features into
existing code if there isn't some subtle logic involved that I do not
understand, because it isn't documented.)
I appreciate your help,
thank you,
-Gunther
--
Gunther Schadow, M.D., Ph.D. gschadow@regenstrief.org
Medical Information Scientist Regenstrief Institute for Health Care
Adjunct Assistant Professor Indiana University School of Medicine
tel:1(317)630-7960 http://aurora.regenstrief.org