LIANHE SHAO <lshao2@jhmi.edu> writes:
> PGA=> explain select ei.expid, er.geneid,
> er.sampleid, ei.annotation, si.samplename,
> ei.title as exp_name, aaa.chip,
> aaa.sequence_derived_from as accession_number,
> aaa.gene_symbol, aaa.title as gene_function,
> er.exprs, er.mas5exprs from expressiondata er,
> experimentinfo ei, sampleinfo si,
> affy_array_annotation aaa where exists (select
> distinct ei.expid from experimentinfo) and
> ei.annotation = aaa.chip and (lower (aaa.title)
> like '%mif%' or lower(aaa.sequence_description) like
> '%mif%') and exists (select distinct ei.annotation
> from experimentinfo) and ei.expid = er.expid and
> er.expid = si.expid and er.sampleid = si.sampleid
> and er.geneid = aaa.probeset_id order by si.sampleid
> limit 20;
What is the purpose of the EXISTS() clauses? They are almost surely not
doing what you intended, because AFAICS they are just an extremely
expensive means of producing a constant-TRUE result. In
exists (select distinct ei.expid from experimentinfo)
"ei.expid" is an outer reference, which will necessarily be the same
value over all rows of the sub-select. After computing this same value
for every row of experimentinfo, the system performs a DISTINCT
operation (sort + unique, not cheap) ... and then all it checks for is
whether at least one row was produced, which means the DISTINCT
operation was completely unnecessary. The only way the EXISTS could
return false is if experimentinfo were empty, but if it were so then the
outer FROM would've produced no rows and we'd not have got to WHERE
anyway.
I'm not sure why you get a worse plan for the simpler variant of the
query; it would help to see EXPLAIN ANALYZE rather than EXPLAIN output.
But it's not worth trying to improve the performance until you are
calculating correct answers, and I suspect the above is not doing
what you are after at all.
regards, tom lane