Re: Joining two large tables on a tiny subset of rows - Mailing list pgsql-sql

From Richard Huxton
Subject Re: Joining two large tables on a tiny subset of rows
Date
Msg-id 42E7326C.5060403@archonet.com
Whole thread Raw
In response to Joining two large tables on a tiny subset of rows  ("Dmitri Bichko" <dbichko@aveopharma.com>)
List pgsql-sql
Dmitri Bichko wrote:
> Hello,
> 
> I have two tables, one has a foreing key from the other (only showing
> the relevant columns and indices here):
> 
>              Table "expresso.probes"
>    Column    |          Type          | Modifiers
> -------------+------------------------+-----------
>  platform_id | integer                | not null
>  probe_num   | integer                | not null
>  mrna_acc    | character varying(50)  |
> Indexes:
>     "idx_probes_mrna_acc" btree (mrna_acc, platform_id) WHERE mrna_acc
> IS NOT NULL
> 
>            Table "expresso.mrna_info"
>    Column    |         Type          | Modifiers
> -------------+-----------------------+-----------
>  mrna_acc    | character varying(25) | not null
>  symbol      | character varying(50) | not null
> Indexes:
>     "idx_mrna_info_symbol" btree (upper(symbol::text)) WHERE symbol IS
> NOT NULL

1. The two types of mrna_acc don't match - one has a max length of 25, 
one 50. Why?
2. With idx_probes_mrna_acc, why WHERE mrna_acc IS NOT NULL? NULLs 
aren't indexed anyway.
3. You say there is a foreign key, but I don't even see a primary key 
anywhere. I'm guessing mrna_info.mrna_acc is the primary key for that table.

> Here are the explains for the two step process:
> 
> expression=> EXPLAIN ANALYZE SELECT mrna_acc FROM mrna_info WHERE
> upper(symbol) = upper('pparg') AND symbol IS NOT NULL;

As someone else mentions, the IS NOT NULL is redundant.

>  Index Scan using idx_mrna_info_symbol on mrna_info  (cost=0.00..2934.78
> rows=930 width=12) (actual time=0.038..0.089 rows=12 loops=1)

Note that the estimated number of rows is wrong though (930 rather than 
the actual 12).

> EXPLAIN ANALYZE SELECT platform_id, probe_num FROM probes WHERE mrna_acc
> IN
> ('U10374','U09138','U01841','U01664','NM_015869','NM_013124','NM_011146'
> ,'NM_005037','D83233','BC021798','BC006811','AB011365') AND mrna_acc IS
> NOT NULL;

Again, a redundant IS NOT NULL, which presumably you're putting in to 
use the index.

>  Index Scan using idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc on probes  (cost=0.00..14710.63 rows=4151 width=8)
> (actual time=0.040..0.719 rows=142 loops=1)

Again, it's getting the row estimate badly wrong (4151 vs 142).

> And here is the explain of the join (it's essentially the same plan as
> the subselect and all the other ways I've tried):
> 
> expression=> explain SELECT platform_id, probe_num FROM mrna_info m,
> probes p WHERE m.mrna_acc = p.mrna_acc and p.mrna_acc is not null and
> UPPER(symbol) = UPPER('pparg') AND m.symbol IS NOT NULL;

Well, firstly get rid of the redundant "IS NOT NULL"s in the query and 
the indexes, then vacuum analyse the tables and post an EXPLAIN ANALYSE.

The problem will probably turn out to be poor row estimates (you can 
increase the statistics gathered on the mrna_acc values) or poor 
configuration settings (making indexes look expensive compared to 
sequential scans).

--  Richard Huxton  Archonet Ltd


pgsql-sql by date:

Previous
From: Achilleus Mantzios
Date:
Subject: Re: REINDEX DATABASE
Next
From: Chris Browne
Date:
Subject: Re: REINDEX DATABASE