[GENERAL] query not scaling - Mailing list pgsql-general

From Rob Sargent
Subject [GENERAL] query not scaling
Date
Msg-id 6f7d4fd4-aed2-2fae-f551-d0f67196da33@gmail.com
Whole thread Raw
Responses Re: [GENERAL] query not scaling
List pgsql-general
I have a query I cannot tame and I'm wondering if there's an alternative
to  the "between"  clause I'm  using.  Perhaps  a custom  type could  do
better?  I've  tried the "<@" orperator and that changes the  query plan
significantly but the execution cost/time is not improved.

Any suggestion or pointers much appreciated.

Environment: Using  a virtual  CentOS Linux  release 7.4.1708  (Core), 4
cores (2.3GHz),  8G RAM and  postgres 10.0(beta3) shared_buffers  = 1GB,
work_mem = 2GB

Domain: (TL/DR)  A "segment" is  defined by  a particular set  of people
(probandset.id)  plus a  subset of  markers (markerset.id,  startmarker,
endmarker).   I need  the minimum  p-value for  each marker  in the  set
across all segments matching the set and a specific set of poeple.  So a
given segment says  "I cover all the markers from  startbase to endbase"
and each marker has a specific base position (relative to a chromosome).
I'm  after the  smallest  p-value  for each  marker  across  the set  of
segments which include that marker (from the 'between' clause).

Context:  I have  the query  in a  function so  the ids  of the  all the
players are available to the following sql:
   select m.id as mkrid          , min(pv(s.events_less, s.events_equal, s.events_greater, 0)) as optval                                                                from marker m join segment s on m.basepos between s.startbase and s.endbase        and m.chrom = 1        and s.chrom = 1        and s.markerset_id = suppliedMarkersetId   join probandset r on s.probandset_id = r.id        and r.people_id =  suppliedPeopleId   group by m.id

where the pv function is
   create or replace function pv(l bigint, e bigint, g bigint, o int)    returns numeric   as    $$   select 1.0*(g+e+o)/(l+e+g+o);   $$   language sql   ;


I have the identical schema in  two databases (same pg instance) and the
tables definitions  involved are  below.  In one  schema there  are 1.7M
records  in segment  and in  the other  there is  40M rows.   The marker
tables are much more similar with 600K and 900K respectively.  The third
table, probandset, has 60 and 600 respectively. On average 0.8M and 1.8M
segments per markerset_id.

The explains: (fast  (12sec), then slow(hours)).  The  part which sticks
out  to  me  is  where  the "between"  gets  used.   (I'm  betting  that
probandset is  too small to  matter.)  The  slower explain plan  is very
similar to  what I  saw originally in  the now "fast"  data set  and the
current indexing stategy comes largely from that performance work.

It looks like I'm getting a Cartesian between the number of markers in a
set and the number of segments found: ten zeros at least.                                                                 QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------HashAggregate  (cost=291472.27..292040.58 rows=56831 width=48)  Group Key: m.id  ->  Nested Loop  (cost=3752.33..167295.52 rows=4515518 width=40)        ->  Nested Loop  (cost=3751.90..17906.25 rows=715 width=32)              ->  Seq Scan on probandset r  (cost=0.00..2.77 rows=4 width=16)                    Filter: (people_id = '4e3b9829-43a8-4f84-9df6-f120dc5b1a7e'::uuid)              ->  Bitmap Heap Scan on segment s  (cost=3751.90..4473.96 rows=191 width=48)                    Recheck Cond: ((probandset_id = r.id) AND (chrom = 1) AND (markerset_id = '61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid))                    ->  BitmapAnd  (cost=3751.90..3751.90 rows=191 width=0)                          ->  Bitmap Index Scan on useg  (cost=0.00..72.61 rows=2418 width=0)                                Index Cond: ((probandset_id = r.id) AND (chrom = 1))                          ->  Bitmap Index Scan on segment_markerset_id_idx  (cost=0.00..3676.23 rows=140240 width=0)                                Index Cond: (markerset_id = '61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid)        ->  Index Scan using marker_chrom_basepos_idx on marker m  (cost=0.42..145.79 rows=6315 width=20)              Index Cond: ((chrom = 1) AND (basepos >= s.startbase) AND (basepos <= s.endbase))
(15 rows)

                                                        QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------HashAggregate  (cost=83131331.81..83132151.44 rows=81963 width=48)  Group Key: m.id  ->  Nested Loop  (cost=1907.38..70802659.35 rows=448315362 width=40)        Join Filter: ((m.basepos >= s.startbase) AND (m.basepos <= s.endbase))        ->  Bitmap Heap Scan on marker m  (cost=1883.64..11009.18 rows=81963 width=20)              Recheck Cond: (chrom = 1)              ->  Bitmap Index Scan on marker_chrom_basepos_idx  (cost=0.00..1863.15 rows=81963 width=0)                    Index Cond: (chrom = 1)        ->  Materialize  (cost=23.74..181468.38 rows=49228 width=32)              ->  Hash Join  (cost=23.74..181222.24 rows=49228 width=32)                    Hash Cond: (s.probandset_id = r.id)                    ->  Index Scan using segment_markerset_id_idx on segment s  (cost=0.56..178022.70 rows=251881 width=48)                          Index Cond: (markerset_id = 'afad387e-a8e7-4c3b-9adb-3f00e70a13b3'::uuid)                          Filter: (chrom = 1)                    ->  Hash  (cost=21.68..21.68 rows=120 width=16)                          ->  Seq Scan on probandset r  (cost=0.00..21.68 rows=120 width=16)                                Filter: (people_id = 'b124acb2-e400-4e63-8010-5a5f25a78f1c'::uuid)
(17 rows)

            Table "base.marker"    Column  |      Type      | Modifiers    ---------+----------------+-----------    id      | uuid           | not null    name    | text           | not null    chrom   | integer        | not null    basepos | integer        | not null    alleles | character(1)[] |    Indexes:       "marker_pkey" PRIMARY KEY, btree (id)       "marker_name_key" UNIQUE CONSTRAINT, btree (name)       "marker_basepos_idx" btree (basepos)       "marker_chrom_basepos_idx" btree (chrom, basepos)   Referenced by:       TABLE "markerset_member" CONSTRAINT "markerset_member_member_id_fkey" FOREIGN KEY (member_id) REFERENCES marker(id)
                 Table "aut.segment"        Column     |  Type   |     Modifiers         ----------------+---------+--------------------    id             | uuid    | not null    chrom          | integer | not null    markerset_id   | uuid    | not null    probandset_id  | uuid    | not null    startbase      | integer | not null    endbase        | integer | not null    firstmarker    | integer | not null    lastmarker     | integer | not null    events_less    | bigint  | not null default 0    events_equal   | bigint  | not null default 0    events_greater | bigint  | not null default 0   Indexes:       "segment_pkey" PRIMARY KEY, btree (id)       "useg" UNIQUE CONSTRAINT, btree (probandset_id, chrom, startbase)       "segment_markerset_id_chrom_firstmarker_idx" btree (markerset_id, chrom, firstmarker)       "segment_markerset_id_idx" btree (markerset_id)       "segment_startbase_idx" btree (startbase)   Foreign-key constraints:       "segment_markerset_id_fkey" FOREIGN KEY (markerset_id) REFERENCES markerset(id)       "segment_probandset_id_fkey" FOREIGN KEY (probandset_id) REFERENCES probandset(id)   Referenced by:       TABLE "segmentset_member" CONSTRAINT "segmentset_member_segment_id_fkey" FOREIGN KEY (segment_id) REFERENCES segment(id)
              Table "aut.probandset"      Column    |       Type       | Modifiers    -------------+------------------+-----------    id          | uuid             | not null    name        | text             |     probands    | uuid[]           | not null    meioses     | integer          |     min_kincoef | double precision |     max_kincoef | double precision |     people_id   | uuid             | not null   Indexes:       "probandset_pkey" PRIMARY KEY, btree (id)       "probandsetunique" gin (probands)   Check constraints:       "sortedset" CHECK (issorteduuids(probands))   Foreign-key constraints:       "probandset_people_id_fkey" FOREIGN KEY (people_id) REFERENCES people(id)   Referenced by:       TABLE "probandset_group_member" CONSTRAINT "probandset_group_member_member_id_fkey" FOREIGN KEY (member_id) REFERENCES probandset(id)       TABLE "segment" CONSTRAINT "segment_probandset_id_fkey" FOREIGN KEY (probandset_id) REFERENCES probandset(id)
   explain select m.id as mkrid          , min(pv(s.events_less, s.events_equal, s.events_greater, 0)) as optval                                                                from marker m join segment s on m.basepos between s.startbase and s.endbase        and m.chrom = 1        and s.chrom = 1        and s.markerset_id = 'afad387e-a8e7-4c3b-9adb-3f00e70a13b3'    join probandset r on s.probandset_id = r.id        and r.people_id =  'b124acb2-e400-4e63-8010-5a5f25a78f1c'   group by m.id


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: [GENERAL] How do I insert and update into a table of arrays of composite types via a select command?
Next
From: Celia McInnis
Date:
Subject: Re: [GENERAL] How do I insert and update into a table of arrays ofcomposite types via a select command?