Thread: How to find greatest record before known values fast
I’m looking for a way to increase select statement speed in Postgres 9.0.
Table has required index present. Desired result can obtained using index (kuupaev,kellaaeg) immediately.
However Postgres scans all rows:
explain analyze SELECT
max( kuupaev||kellaaeg ) as res
from firma2.ALGSA
where laonr=1 and kuupaev <=current_date and
(kuupaev,kellaaeg) <= ( current_date, '23 59' )
"Aggregate (cost=6932.65..6932.67 rows=1 width=10) (actual time=1608.590..1608.592 rows=1 loops=1)"
" -> Seq Scan on algsa (cost=0.00..6571.49 rows=144464 width=10) (actual time=0.032..922.431 rows=144458 loops=1)"
" Filter: ((laonr = 1::numeric) AND (kuupaev <= ('now'::text)::date) AND (ROW(kuupaev, kellaaeg) <= ROW(('now'::text)::date, '23 59'::bpchar)))"
"Total runtime: 1608.846 ms"
In real query instead of 1, current_date and '23 59' there are variable parameters.
Table has both indexes present but postgres will not use them.
Indexes can changed and query can re-written if this helps.
Table structure cannot changed. char columns cannot replaced with varchar columns. kuupaev must be date and kellaaeg must be char(5) type.
Query contains reduntant condition `kuupaev <=current_date` but index is still not used.
I tried also `SELECT max( (kuupaev,kellaaeg ))` but got error that max() function does not exist.
How to speed this query ?
Table structure is :
CREATE TABLE firma2.algsa
(
id serial NOT NULL,
laonr numeric(2,0),
kuupaev date NOT NULL,
kellaaeg character(5) NOT NULL DEFAULT ''::bpchar,
osak character(10) NOT NULL,
toode character(20) NOT NULL,
partii character(15),
kogus numeric(12,4) NOT NULL DEFAULT 0,
hind numeric(15,5) NOT NULL DEFAULT 0,
kulum numeric(15,5) NOT NULL DEFAULT 0,
tegkogus numeric(12,4),
stkuupaev date,
klient character(12),
masin character(5),
CONSTRAINT algsa_pkey PRIMARY KEY (id)
);
CREATE INDEX algsa_kuupaev_idx
ON firma2.algsa
USING btree
(kuupaev);
CREATE INDEX algsa_kuupaev_kellaaeg_idx
ON firma2.algsa
USING btree
(kuupaev, kellaaeg);
using
"PostgreSQL 9.0.3, compiled by Visual C++ build 1500, 32-bit"
Posted also in
Andrus.
So kellaaeg is a time? Your best bet here would be to create an index that is an actual timestamp comprised of both kuupaevand kellaaeg. You could do this with to_timestamp by concatinating both fields together, or it may be easier to replacethe space in kellaaeg with a colon and cast it to time, then add the two: kuupaev + replace( kellaaeg, ' ', ':' )::time I know you can't alter the table, but can you create a view on top of the table? If you did that, you could have a real timestampfield in the view that is calculated from kuupaev and kellaaeg and you can create a functional index that uses thesame calculation. That would be the easiest way to use this. On 10/2/14, 3:49 PM, Andrus wrote: > I’m looking for a way to increase select statement speed in Postgres 9.0. > Table has required index present. Desired result can obtained using index (kuupaev,kellaaeg) immediately. > However Postgres scans all rows: > explain analyze SELECT > max( kuupaev||kellaaeg ) as res > from firma2.ALGSA > where laonr=1 and kuupaev <=current_date and > (kuupaev,kellaaeg) <= ( current_date, '23 59' ) > "Aggregate (cost=6932.65..6932.67 rows=1 width=10) (actual time=1608.590..1608.592 rows=1 loops=1)" > " -> Seq Scan on algsa (cost=0.00..6571.49 rows=144464 width=10) (actual time=0.032..922.431 rows=144458 loops=1)" > " Filter: ((laonr = 1::numeric) AND (kuupaev <= ('now'::text)::date) AND (ROW(kuupaev, kellaaeg) <= ROW(('now'::text)::date,'23 59'::bpchar)))" > "Total runtime: 1608.846 ms" > In real query instead of 1, current_date and '23 59' there are variable parameters. > Table has both indexes present but postgres will not use them. > Indexes can changed and query can re-written if this helps. > Table structure cannot changed. char columns cannot replaced with varchar columns. kuupaev must be date and kellaaeg mustbe char(5) type. > Query contains reduntant condition `kuupaev <=current_date` but index is still not used. > I tried also `SELECT max( (kuupaev,kellaaeg ))` but got error that max() function does not exist. > How to speed this query ? > Table structure is : > CREATE TABLE firma2.algsa > ( > id serial NOT NULL, > laonr numeric(2,0), > kuupaev date NOT NULL, > kellaaeg character(5) NOT NULL DEFAULT ''::bpchar, > osak character(10) NOT NULL, > toode character(20) NOT NULL, > partii character(15), > kogus numeric(12,4) NOT NULL DEFAULT 0, > hind numeric(15,5) NOT NULL DEFAULT 0, > kulum numeric(15,5) NOT NULL DEFAULT 0, > tegkogus numeric(12,4), > stkuupaev date, > klient character(12), > masin character(5), > CONSTRAINT algsa_pkey PRIMARY KEY (id) > ); > CREATE INDEX algsa_kuupaev_idx > ON firma2.algsa > USING btree > (kuupaev); > CREATE INDEX algsa_kuupaev_kellaaeg_idx > ON firma2.algsa > USING btree > (kuupaev, kellaaeg); > using > "PostgreSQL 9.0.3, compiled by Visual C++ build 1500, 32-bit" > Posted also in > http://stackoverflow.com/questions/26165745/how-find-greatest-tuple-before-given-2-column-tuple-in-postgres-fast > Andrus.
Andrus Moor wrote > Table has both indexes present but postgres will not use them. > Indexes can changed and query can re-written if this helps. > Table structure cannot changed. char columns cannot replaced with varchar > columns. kuupaev must be date and kellaaeg must be char(5) type. Don't discard the option to add additional trigger-maintained columns to the table and then modify the where clause to use those new columns while still outputting the originals. You also need to explain whether you can update from 9.0.3 to 9.0.18 As Craig Ringer already said working with seldom used types in unusual ways on ancient releases means that you may be stuck with second rate workarounds or just living with the performance degradation. I guess at worse you could delete data so the scan doesn't take as long... David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/How-to-find-greatest-record-before-known-values-fast-tp5821461p5821523.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Jim Nasby <Jim.Nasby@BlueTreble.com> writes: > So kellaaeg is a time? Your best bet here would be to create an index > that is an actual timestamp comprised of both kuupaev and kellaaeg. The real problem with this query, or at least with the index design, is that the index design isn't accounting for the need to constrain "laonr". The best way to create the index is with laonr first, and instead of the max() write something like regression=# create table foo (f1 int, f2 date, f3 bpchar(5)); CREATE TABLE regression=# create index on foo (f1, f2, f3); CREATE INDEX regression=# explain select * from foo where f1 = 1 and (f2,f3) <= (current_date, '23 59') order by f2 desc, f3 desc limit1; QUERY PLAN ----------------------------------------------------------------------------------------------- Limit (cost=0.01..5.44 rows=1 width=17) -> Index Scan Backward using foo_f1_f2_f3_idx on foo (cost=0.01..16.32 rows=3 width=17) Index Cond: ((f1 = 1) AND (ROW(f2, f3) <= ROW(('now'::text)::date, '23 59'::bpchar))) (3 rows) (tested on 9.0.18, should work on newer versions too) BTW, the fact that newer versions are refusing to create an index on "kuupaev||kellaaeg" should not be dismissed as mere pedantry. The reason for that is that this expression involves a cast from date to text, and the format of the text you get depends on DateStyle. And that should call your attention to the fact that *the original query gives the wrong answer*, or at least an answer that I bet is not the one you want, unless DateStyle chances to be ISO. IMO, one of the ten deadly sins of database design is using text-string mashing to accomplish operations that are not naturally textual. This query and the table design itself are in dire need of sackcloth and ashes. Had the table designer had the wit to use a timestamp field rather than this unholy mashup, we'd not be having this conversation. regards, tom lane
Hi! >So kellaaeg is a time? Your best bet here would be to create an index that >is an actual timestamp comprised of both >kuupaev and kellaaeg. You could >do this with to_timestamp by concatinating both fields together, or it may >be easier to replace the space in kellaaeg with a colon and cast it to >time, then add the two: > kuupaev + replace( kellaaeg, ' ', ':' )::time >I know you can't alter the table, but can you create a view on top of the >table? If you did that, you could have a real >timestamp field in the view >that is calculated from kuupaev and kellaaeg and you can create a >functional index that >uses the same calculation. That would be the easiest >way to use this. Thank you. I solved this by creating composite index on 3 columns and re-writing query as Tom recommended. It looks like Tom's recommendation is simpler for me. Andrus.
On Fri, Oct 3, 2014 at 1:28 AM, Andrus <kobruleht2@hot.ee> wrote: > Hi! > >> So kellaaeg is a time? Your best bet here would be to create an index that >> is an actual timestamp comprised of both >kuupaev and kellaaeg. You could do >> this with to_timestamp by concatinating both fields together, or it may be >> easier to replace the space in kellaaeg with a colon and cast it to time, >> then add the two: >> kuupaev + replace( kellaaeg, ' ', ':' )::time >> I know you can't alter the table, but can you create a view on top of the >> table? If you did that, you could have a real >timestamp field in the view >> that is calculated from kuupaev and kellaaeg and you can create a functional >> index that >uses the same calculation. That would be the easiest way to use >> this. > > > Thank you. > I solved this by creating composite index on 3 columns and re-writing query > as Tom recommended. > It looks like Tom's recommendation is simpler for me. Also, *) quit using char() type. use varchar() -- at least in postgres, it's better in every respect. the char type pads out the fields on disk. (this is a common noobie error in postgres since that may not necessarily be true in other databases) *) numeric type gives fixed point operations and clean comparisons and so is generally necessary, but it in some cases floating point (float4/float8) are more compact and give better performance without much downside. *) 9.0 is up to 9.0.18. Time to upgrade. (it's a binary only replacement). *) indexes like this: (kuupaev), (kuupaev, kellaaeg) are mostly redundant because the composite index can service queries on kuupaev nearly as well as the single field index. Either kill the single field index to get better memory utilization or reverse the fields in the composite index to (kellaaeg, kuupaev) if you make frequent searches on 'kellaaeg'. Indexes match quals in left to right order to give the best performance. So, an index on a,b,c gives good performance for searches on (a), (a,b), and (a,b,c). There are certain limited exceptions to this rule but it's a good design principle to know. merlin
I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time offthis one. I'm hoping someone has another strategy. I have 2 tables: resource resource_2_tag I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource"table. both tables have around 1.6million records. If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk. If the database doesn't need to warm up, it averages 1.76seconds. The 1.76s time is troubling me. Searching for the discrete elements of this is pretty lightweight. here's an explain -- http://explain.depesz.com/s/PndC I tried a subquery instead of a join, and the query optimized the plan to the same. i'm hoping someone will see something that I just don't see. Table "public.resource_2_tag" Column | Type | Modifiers -----------------------+---------+----------- resource_id | integer | tag_id | integer | Indexes: "_idx_speed_resource_2_tag__resource_id" btree (resource_id) "_idx_speed_resource_2_tag__tag_id" btree (tag_id) Table "public.resource" Column | Type | Modifiers -------------------------------------+-----------------------------+---------------------------------------------------------- id | integer | not null default nextval('resource_id_seq'::regclass) resource_attribute1_id | integer | lots of other columns | | Indexes: "resource_attribute1_idx" btree (resource_attribute1_id) -------------------------------------------------------------------------------- select count(*) from resource; -- 1669729 select count(*) from resource_2_tag; -- 1676594 select count(*) from resource where resource_attribute1_id = 614; -- 5184 -- 4.386ms select id from resource where resource_attribute1_id = 614; -- 5184 -- 87.303ms popping the 5k elements into an "in" clause, will run the query in around 100ms. EXPLAIN ANALYZE SELECT resource_2_tag.tag_id AS resource_2_tag_tag_id, count(resource_2_tag.tag_id) AS counted FROM resource_2_tag JOIN resource ON resource.id = resource_2_tag.resource_id WHERE resource.resource_attribute1_id = 614 GROUP BY resource_2_tag.tag_id ORDER BY counted DESC LIMIT 25 OFFSET 0; -------------------------------------------------------------------------------- Limit (cost=76659.61..76659.68 rows=25 width=4) (actual time=1502.902..1502.913 rows=25 loops=1) -> Sort (cost=76659.61..76672.47 rows=5141 width=4) (actual time=1502.900..1502.906 rows=25 loops=1) Sort Key: (count(resource_2_tag.tag_id)) Sort Method: top-N heapsort Memory: 26kB -> HashAggregate (cost=76463.13..76514.54 rows=5141 width=4) (actual time=1487.016..1495.206 rows=13887 loops=1) -> Hash Join (cost=35867.88..76437.42 rows=5141 width=4) (actual time=97.654..1453.337 rows=27068 loops=1) Hash Cond: (resource_2_tag.resource_id = resource.id) -> Seq Scan on resource_2_tag (cost=0.00..25847.94 rows=1676594 width=8) (actual time=0.032..513.046rows=1676594 loops=1) -> Hash (cost=35803.88..35803.88 rows=5120 width=4) (actual time=97.576..97.576 rows=5184 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 183kB -> Bitmap Heap Scan on resource (cost=272.68..35803.88 rows=5120 width=4) (actual time=5.911..90.264rows=5184 loops=1) Recheck Cond: (resource_attribute1_id = 614) -> Bitmap Index Scan on resource_attribute1_idx (cost=0.00..271.40 rows=5120 width=0)(actual time=3.575..3.575 rows=5184 loops=1) Index Cond: (resource_attribute1_id = 614) Total runtime: 1503.146 ms
Hi! Thank you for explanations. > the char type pads out the fields on disk. It looks like you wrote that char takes more disk space. from http://www.pgcon.org/2013/schedule/attachments/269_tour-of-postgresql-data-types.pdf page 28: Unlike many databases, char(n) is NOT stored as afixed-sizedfield in Postgres. It is treated exactly the sameas varchar(n)except for being padded So char type does not take more space than varchar. Andrus.
On Fri, Oct 3, 2014 at 3:28 PM, Andrus <kobruleht2@hot.ee> wrote: > Hi! > > Thank you for explanations. > >> the char type pads out the fields on disk. > > > It looks like you wrote that char takes more disk space. > > from > > http://www.pgcon.org/2013/schedule/attachments/269_tour-of-postgresql-data-types.pdf > > page 28: > > Unlike many > databases, char(n) is NOT stored as afixed-sizedfield in > Postgres. It is treated exactly the sameas > varchar(n)except for being padded > > So char type does not take more space than varchar. I beg to differ: postgres=# create table t1(v char(100)); CREATE TABLE postgres=# create table t2(v varchar(100)); CREATE TABLE postgres=# insert into t1 select '' from generate_series(1,1000000); INSERT 0 1000000 Time: 5951.023 ms postgres=# insert into t2 select '' from generate_series(1,1000000); INSERT 0 1000000 Time: 2083.323 ms postgres=# select pg_size_pretty(pg_relation_size(oid)) from pg_class where relname = 't1'; pg_size_pretty ──────────────── 128 MB (1 row) postgres=# select pg_size_pretty(pg_relation_size(oid)) from pg_class where relname = 't2'; pg_size_pretty ──────────────── 35 MB (1 row) merlin
On 10/03/2014 01:28 PM, Andrus wrote: > Hi! > > Thank you for explanations. > >> the char type pads out the fields on disk. > > It looks like you wrote that char takes more disk space. > > from > > http://www.pgcon.org/2013/schedule/attachments/269_tour-of-postgresql-data-types.pdf > > > page 28: > > Unlike many > databases, char(n) is NOT stored as afixed-sizedfield > in Postgres. It is treated exactly the sameas > varchar(n)except for being padded > > So char type does not take more space than varchar. Which directly contradicts the information on page 27: Character Types (or Strings) Name Description varchar(n) variable-length with limit char(n) fixed-length, blank padded text variable unlimited length and the docs: http://www.postgresql.org/docs/9.3/interactive/datatype-character.html Values of type character are physically padded with spaces to the specified width n, and are stored and displayed that way. However, the padding spaces are treated as semantically insignificant. Trailing spaces are disregarded when comparing two values of type character, and they will be removed when converting a character value to one of the other string types. Note that trailing spaces are semantically significant in character varying and text values, and when using pattern matching, e.g. LIKE, regular expressions. Tip: There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead. > > Andrus. > > -- Adrian Klaver adrian.klaver@aklaver.com
Adrian Klaver <adrian.klaver@aklaver.com> writes: >> page 28: >> >> Unlike many >> databases, char(n) is NOT stored as afixed-sizedfield >> in Postgres. It is treated exactly the sameas >> varchar(n)except for being padded > Which directly contradicts the information on page 27: This info is probably not as well worded as it could be, but it's not really wrong. The key point is that char(N) is blank-padded (thereby wasting space) to be N *characters*, but that is not necessarily N *bytes*, because of possible multi-byte characters. Therefore the engine has to treat it as a variable-length datatype. I believe in some some other DBMSes, char(N) means N *bytes* and is treated as a fixed-size datatype. Dunno what they do with multibyte encodings. regards, tom lane
On 10/03/2014 03:59 PM, Tom Lane wrote: > Adrian Klaver <adrian.klaver@aklaver.com> writes: >>> page 28: >>> >>> Unlike many >>> databases, char(n) is NOT stored as afixed-sizedfield >>> in Postgres. It is treated exactly the sameas >>> varchar(n)except for being padded > >> Which directly contradicts the information on page 27: > > This info is probably not as well worded as it could be, but it's not > really wrong. The key point is that char(N) is blank-padded (thereby > wasting space) to be N *characters*, but that is not necessarily N > *bytes*, because of possible multi-byte characters. Therefore the engine > has to treat it as a variable-length datatype. Well that is a distinction I had missed, thanks for the heads up. So the bottom line is char(N) is variable length, but for values that do not reach length N will require more storage space then varchar(n). > > I believe in some some other DBMSes, char(N) means N *bytes* and is > treated as a fixed-size datatype. Dunno what they do with multibyte > encodings. > > regards, tom lane > > -- Adrian Klaver adrian.klaver@aklaver.com
Hi! Thank you. People often looks for different period sales using different filters. There are lot of sales and every sale is individual record in sales table. So increasing sequential scan speed is important. I tried create table t1(v char(100), p numeric(12,5)); create table t2(v varchar(100), p numeric(12,5)); insert into t1 select '', generate_series from generate_series(1,1000000); insert into t2 select '', generate_series from generate_series(1,1000000); and after that measured speed of select sum(p) from t1 and select sum(p) from t2 both of them took approximately 800 ms So it looks like thee is no difference in sequential scan speed and thus no need to change char types. Andrus
Hi! Thank you. In my db people often looks for different period sales using different filters and will sum There are lot of sales and every sale is individual record in sales table. So increasing sequential scan speed is important. I tried create table t1(v char(100), p numeric(12,5)); create table t2(v varchar(100), p numeric(12,5)); insert into t1 select '', generate_series from generate_series(1,1000000); insert into t2 select '', generate_series from generate_series(1,1000000); and after that measured speed of select sum(p) from t1 and select sum(p) from t2 Both of them took approximately 800 ms Also select max(length(v)) from t1 and select max(length(v)) from t2 speed is the same Also I tested with 1600000 rows production char(100) type column database by running select sum(decimalcolumn) from testtable before and after running alter table testtable alter char100column type varchar(100) Select sum command tooks 1700 ms in both cases. So there is no difference in sequential scan speed. Replacing char with varchar requires re-writing some parts of code. Disk space is minor issue compared to cost of code-rewrite. It looks like it is not reasonable to replace char with varchar. Andrus
On Sat, Oct 4, 2014 at 3:46 PM, Andrus <kobruleht2@hot.ee> wrote: > In my db people often looks for different period sales using different > filters and will sum > There are lot of sales and every sale is individual record in sales table. > So increasing sequential scan speed is important. > > I tried > > create table t1(v char(100), p numeric(12,5)); > create table t2(v varchar(100), p numeric(12,5)); > insert into t1 select '', generate_series from generate_series(1,1000000); > insert into t2 select '', generate_series from generate_series(1,1000000); > > and after that measured speed of > > select sum(p) from t1 > > and > > select sum(p) from t2 > > Both of them took approximately 800 ms > > So there is no difference in sequential scan speed. > Replacing char with varchar requires re-writing some parts of code. > Disk space is minor issue compared to cost of code-rewrite. > It looks like it is not reasonable to replace char with varchar. Sure, in this trivial case it's not different (both tables are small, fit in cache, and the numeric calculation is where the bulk of time is getting spent). But if your table is double the size it's going to have impacts on many real world workloads. I'm not in any way saying to go change up your database but I'd definitely avoid char() for all new code. merlin
On 10/3/14, 11:21 AM, Jonathan Vanasco wrote: > I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any timeoff this one. I'm hoping someone has another strategy. > > I have 2 tables: > resource > resource_2_tag > > I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource"table. > > both tables have around 1.6million records. > > If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk. > If the database doesn't need to warm up, it averages 1.76seconds. > > The 1.76s time is troubling me. > Searching for the discrete elements of this is pretty lightweight. > > here's an explain -- http://explain.depesz.com/s/PndC > > I tried a subquery instead of a join, and the query optimized the plan to the same. > > i'm hoping someone will see something that I just don't see. > > > > Table "public.resource_2_tag" > Column | Type | Modifiers > -----------------------+---------+----------- > resource_id | integer | > tag_id | integer | > Indexes: > "_idx_speed_resource_2_tag__resource_id" btree (resource_id) > "_idx_speed_resource_2_tag__tag_id" btree (tag_id) > > Table "public.resource" > Column | Type | Modifiers > -------------------------------------+-----------------------------+---------------------------------------------------------- > id | integer | not null default nextval('resource_id_seq'::regclass) > resource_attribute1_id | integer | > lots of other columns | | > Indexes: > "resource_attribute1_idx" btree (resource_attribute1_id) > > -------------------------------------------------------------------------------- > > select count(*) from resource; > -- 1669729 > > select count(*) from resource_2_tag; > -- 1676594 > > select count(*) from resource where resource_attribute1_id = 614; > -- 5184 > -- 4.386ms > > select id from resource where resource_attribute1_id = 614; > -- 5184 > -- 87.303ms > > popping the 5k elements into an "in" clause, will run the query in around 100ms. > > > EXPLAIN ANALYZE > SELECT > resource_2_tag.tag_id AS resource_2_tag_tag_id, > count(resource_2_tag.tag_id) AS counted > FROM > resource_2_tag > JOIN resource ON resource.id = resource_2_tag.resource_id > WHERE > resource.resource_attribute1_id = 614 > GROUP BY resource_2_tag.tag_id > ORDER BY counted DESC > LIMIT 25 OFFSET 0; Don't join to the resource table; there's no reason to because you're not pulling anything from it. If for some reason you do need data out of the resource table, do the LIMIT 25 first, in a sub-select: SELECT r.*, counted FROM resource r JOIN ( SELECT tag_id, count(*) FROM resource_2_tag GROUP BY tag_id ORDER BY tag_id LIMIT 25 ) t ON ... ; -- -- Jim Nasby, Data Architect, Blue Treble Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: faster way to calculate top "tags" for a "resource" based on a column
From
Jonathan Vanasco
Date:
On Oct 6, 2014, at 5:56 PM, Jim Nasby wrote: > Don't join to the resource table; there's no reason to because you're not pulling anything from it. Thanks the reply! I'm not pulling anything from the resource table, but the join is necessary because I'm filtering based on it. ( see theWHERE clause ) I'm not trying to find the most used overall tags, but the ones that are used by resources with a (variable) id on a columnin the resources table. .
>I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any timeoff this one. I'm hoping someone has another strategy. > >I have 2 tables: > resource > resource_2_tag > >I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource"table. > >both tables have around 1.6million records. > >If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk. >If the database doesn't need to warm up, it averages 1.76seconds. > >The 1.76s time is troubling me. >Searching for the discrete elements of this is pretty lightweight. > >here's an explain -- http://explain.depesz.com/s/PndC > >I tried a subquery instead of a join, and the query optimized the plan to the same. > >i'm hoping someone will see something that I just don't see. > > > > Table "public.resource_2_tag" > Column | Type | Modifiers >-----------------------+---------+----------- > resource_id | integer | > tag_id | integer | >Indexes: > "_idx_speed_resource_2_tag__resource_id" btree (resource_id) > "_idx_speed_resource_2_tag__tag_id" btree (tag_id) > > Table "public.resource" > Column | Type | Modifiers >-------------------------------------+-----------------------------+---------------------------------------------------------- > id | integer | not null default nextval('resource_id_seq'::regclass) >resource_attribute1_id | integer | >lots of other columns | | >Indexes: > "resource_attribute1_idx" btree (resource_attribute1_id) > >-------------------------------------------------------------------------------- > >select count(*) from resource; >-- 1669729 > >select count(*) from resource_2_tag; >-- 1676594 > >select count(*) from resource where resource_attribute1_id = 614; >-- 5184 >-- 4.386ms > >select id from resource where resource_attribute1_id = 614; >-- 5184 >-- 87.303ms > >popping the 5k elements into an "in" clause, will run the query in around 100ms. > > >EXPLAIN ANALYZE >SELECT > resource_2_tag.tag_id AS resource_2_tag_tag_id, > count(resource_2_tag.tag_id) AS counted >FROM > resource_2_tag >JOIN resource ON resource.id = resource_2_tag.resource_id >WHERE > resource.resource_attribute1_id = 614 >GROUP BY resource_2_tag.tag_id >ORDER BY counted DESC >LIMIT 25 OFFSET 0; Hi, it seems to me that your subquery may deliver duplicate ids. And with the selectivity of your example, I would expect an index usage instead of a table scan. You may check how up to date your statistics are and try to raise the statistic target on the column resource_2_tag.tag_id. Also try a CTE form for your query: WITH A as (SELECT DISTINCT id FROM resource WHERE resource_attribute1_id = 614 ) SELECT resource_2_tag.tag_id AS resource_2_tag_tag_id, count(resource_2_tag.tag_id) AS counted FROM resource_2_tag JOIN A ON A.id = resource_2_tag.resource_id ORDER BY counted DESC LIMIT 25; regards, Marc Mamin
Re: faster way to calculate top "tags" for a "resource" based on a column
From
Jonathan Vanasco
Date:
On Oct 7, 2014, at 10:02 AM, Marc Mamin wrote: > Hi, > it seems to me that your subquery may deliver duplicate ids. > And with the selectivity of your example, I would expect an index usage > instead of a table scan. You may check how up to date your statistics are > and try to raise the statistic target on the column resource_2_tag.tag_id. > Also try a CTE form for your query: It shouldn't be able to deliver duplicate ids. => SELECT COUNT(*) FROM (SELECT DISTINCT id FROM resource WHERE resource_attribute1_id = 614) AS foo; count ------- 5184 => SELECT COUNT(*) FROM (SELECT id FROM resource WHERE resource_attribute1_id = 614) AS foo; count ------- 5184 However, adding in the DISTINCT drastically changed the query plan, and did give a speedup. Your comment made me focus on the notion of a Table Scan. I assumed it did the seq scan - and there would not be much savingsotherwise - because the table is just 2 ids. I was wrong. I noticed that I never put a PRIMARY KEY constraint on that table. So i tried adding a PRIMARY KEY constraint, then running vacuum analyze... And that solved all my problems. the original query ended up being the fastest at 260ms ( down from 1760 ) Join - 260ms Subquery w/DISTINCT - 300ms CTE - 330 CTE w/DISTINCT - 345ms Subquery (no DISTINCT) - 1500ms