[GENERAL] query not scaling - Mailing list pgsql-general
From | Rob Sargent |
---|---|
Subject | [GENERAL] query not scaling |
Date | |
Msg-id | 6f7d4fd4-aed2-2fae-f551-d0f67196da33@gmail.com Whole thread Raw |
Responses |
Re: [GENERAL] query not scaling
|
List | pgsql-general |
I have a query I cannot tame and I'm wondering if there's an alternative to the "between" clause I'm using. Perhaps a custom type could do better? I've tried the "<@" orperator and that changes the query plan significantly but the execution cost/time is not improved. Any suggestion or pointers much appreciated. Environment: Using a virtual CentOS Linux release 7.4.1708 (Core), 4 cores (2.3GHz), 8G RAM and postgres 10.0(beta3) shared_buffers = 1GB, work_mem = 2GB Domain: (TL/DR) A "segment" is defined by a particular set of people (probandset.id) plus a subset of markers (markerset.id, startmarker, endmarker). I need the minimum p-value for each marker in the set across all segments matching the set and a specific set of poeple. So a given segment says "I cover all the markers from startbase to endbase" and each marker has a specific base position (relative to a chromosome). I'm after the smallest p-value for each marker across the set of segments which include that marker (from the 'between' clause). Context: I have the query in a function so the ids of the all the players are available to the following sql: select m.id as mkrid , min(pv(s.events_less, s.events_equal, s.events_greater, 0)) as optval from marker m join segment s on m.basepos between s.startbase and s.endbase and m.chrom = 1 and s.chrom = 1 and s.markerset_id = suppliedMarkersetId join probandset r on s.probandset_id = r.id and r.people_id = suppliedPeopleId group by m.id where the pv function is create or replace function pv(l bigint, e bigint, g bigint, o int) returns numeric as $$ select 1.0*(g+e+o)/(l+e+g+o); $$ language sql ; I have the identical schema in two databases (same pg instance) and the tables definitions involved are below. In one schema there are 1.7M records in segment and in the other there is 40M rows. The marker tables are much more similar with 600K and 900K respectively. The third table, probandset, has 60 and 600 respectively. On average 0.8M and 1.8M segments per markerset_id. The explains: (fast (12sec), then slow(hours)). The part which sticks out to me is where the "between" gets used. (I'm betting that probandset is too small to matter.) The slower explain plan is very similar to what I saw originally in the now "fast" data set and the current indexing stategy comes largely from that performance work. It looks like I'm getting a Cartesian between the number of markers in a set and the number of segments found: ten zeros at least. QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=291472.27..292040.58 rows=56831 width=48) Group Key: m.id -> Nested Loop (cost=3752.33..167295.52 rows=4515518 width=40) -> Nested Loop (cost=3751.90..17906.25 rows=715 width=32) -> Seq Scan on probandset r (cost=0.00..2.77 rows=4 width=16) Filter: (people_id = '4e3b9829-43a8-4f84-9df6-f120dc5b1a7e'::uuid) -> Bitmap Heap Scan on segment s (cost=3751.90..4473.96 rows=191 width=48) Recheck Cond: ((probandset_id = r.id) AND (chrom = 1) AND (markerset_id = '61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid)) -> BitmapAnd (cost=3751.90..3751.90 rows=191 width=0) -> Bitmap Index Scan on useg (cost=0.00..72.61 rows=2418 width=0) Index Cond: ((probandset_id = r.id) AND (chrom = 1)) -> Bitmap Index Scan on segment_markerset_id_idx (cost=0.00..3676.23 rows=140240 width=0) Index Cond: (markerset_id = '61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid) -> Index Scan using marker_chrom_basepos_idx on marker m (cost=0.42..145.79 rows=6315 width=20) Index Cond: ((chrom = 1) AND (basepos >= s.startbase) AND (basepos <= s.endbase)) (15 rows) QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=83131331.81..83132151.44 rows=81963 width=48) Group Key: m.id -> Nested Loop (cost=1907.38..70802659.35 rows=448315362 width=40) Join Filter: ((m.basepos >= s.startbase) AND (m.basepos <= s.endbase)) -> Bitmap Heap Scan on marker m (cost=1883.64..11009.18 rows=81963 width=20) Recheck Cond: (chrom = 1) -> Bitmap Index Scan on marker_chrom_basepos_idx (cost=0.00..1863.15 rows=81963 width=0) Index Cond: (chrom = 1) -> Materialize (cost=23.74..181468.38 rows=49228 width=32) -> Hash Join (cost=23.74..181222.24 rows=49228 width=32) Hash Cond: (s.probandset_id = r.id) -> Index Scan using segment_markerset_id_idx on segment s (cost=0.56..178022.70 rows=251881 width=48) Index Cond: (markerset_id = 'afad387e-a8e7-4c3b-9adb-3f00e70a13b3'::uuid) Filter: (chrom = 1) -> Hash (cost=21.68..21.68 rows=120 width=16) -> Seq Scan on probandset r (cost=0.00..21.68 rows=120 width=16) Filter: (people_id = 'b124acb2-e400-4e63-8010-5a5f25a78f1c'::uuid) (17 rows) Table "base.marker" Column | Type | Modifiers ---------+----------------+----------- id | uuid | not null name | text | not null chrom | integer | not null basepos | integer | not null alleles | character(1)[] | Indexes: "marker_pkey" PRIMARY KEY, btree (id) "marker_name_key" UNIQUE CONSTRAINT, btree (name) "marker_basepos_idx" btree (basepos) "marker_chrom_basepos_idx" btree (chrom, basepos) Referenced by: TABLE "markerset_member" CONSTRAINT "markerset_member_member_id_fkey" FOREIGN KEY (member_id) REFERENCES marker(id) Table "aut.segment" Column | Type | Modifiers ----------------+---------+-------------------- id | uuid | not null chrom | integer | not null markerset_id | uuid | not null probandset_id | uuid | not null startbase | integer | not null endbase | integer | not null firstmarker | integer | not null lastmarker | integer | not null events_less | bigint | not null default 0 events_equal | bigint | not null default 0 events_greater | bigint | not null default 0 Indexes: "segment_pkey" PRIMARY KEY, btree (id) "useg" UNIQUE CONSTRAINT, btree (probandset_id, chrom, startbase) "segment_markerset_id_chrom_firstmarker_idx" btree (markerset_id, chrom, firstmarker) "segment_markerset_id_idx" btree (markerset_id) "segment_startbase_idx" btree (startbase) Foreign-key constraints: "segment_markerset_id_fkey" FOREIGN KEY (markerset_id) REFERENCES markerset(id) "segment_probandset_id_fkey" FOREIGN KEY (probandset_id) REFERENCES probandset(id) Referenced by: TABLE "segmentset_member" CONSTRAINT "segmentset_member_segment_id_fkey" FOREIGN KEY (segment_id) REFERENCES segment(id) Table "aut.probandset" Column | Type | Modifiers -------------+------------------+----------- id | uuid | not null name | text | probands | uuid[] | not null meioses | integer | min_kincoef | double precision | max_kincoef | double precision | people_id | uuid | not null Indexes: "probandset_pkey" PRIMARY KEY, btree (id) "probandsetunique" gin (probands) Check constraints: "sortedset" CHECK (issorteduuids(probands)) Foreign-key constraints: "probandset_people_id_fkey" FOREIGN KEY (people_id) REFERENCES people(id) Referenced by: TABLE "probandset_group_member" CONSTRAINT "probandset_group_member_member_id_fkey" FOREIGN KEY (member_id) REFERENCES probandset(id) TABLE "segment" CONSTRAINT "segment_probandset_id_fkey" FOREIGN KEY (probandset_id) REFERENCES probandset(id) explain select m.id as mkrid , min(pv(s.events_less, s.events_equal, s.events_greater, 0)) as optval from marker m join segment s on m.basepos between s.startbase and s.endbase and m.chrom = 1 and s.chrom = 1 and s.markerset_id = 'afad387e-a8e7-4c3b-9adb-3f00e70a13b3' join probandset r on s.probandset_id = r.id and r.people_id = 'b124acb2-e400-4e63-8010-5a5f25a78f1c' group by m.id
pgsql-general by date: