Thread: Indexes not used in 7.1RC4: Bug?
Hi, I have the following table, containing about 570000 Rows, but some indexes are not used, on 7.1RC4, freshly vacuumed (analyse). It was the same at least in 7.1RC1 --- SNIP --- CREATE TABLE access_log( site_id int2 NOT NULL DEFAULT 0, access_time timestamp NOT NULL DEFAULT NOW(), time_taken interval NOT NULL, remote_ip inet NOT NULL, method_num int2 NOT NULL, url_id int4 NOT NULL REFERENCES urls(id), referer_id int4 REFERENCES referer(id), browser_id int4 REFERENCES browser(id), status int2 NOT NULL DEFAULT 0, bytes int4 NOT NULL DEFAULT 0, content_id int2 NOT NULL REFERENCES content_types(id), https_flag boolean NOT NULL DEFAULT 'f', session_id char(32), user_id int4 REFERENCES users(id), uname varchar(255), note_id int4 ); CREATE INDEX site_idx ON access_log(site_id); CREATE INDEX access_time_idx ON access_log(access_time); CREATE INDEX time_taken_idx ON access_log(time_taken); CREATE INDEX remote_ip_idx ON access_log(remote_ip); CREATE INDEX method_idx ON access_log(method_num); CREATE INDEX url_idx ON access_log(url_id); CREATE INDEX referer_idx ON access_log(referer_id); CREATE INDEX browser_idx ON access_log(browser_id); CREATE INDEX status_idx ON access_log(status); CREATE INDEX bytes_idx ON access_log(bytes); CREATE INDEX content_idx ON access_log(content_id); CREATE INDEX https_idx ON access_log(https_flag); CREATE INDEX session_idx ON access_log(session_id); CREATE INDEX user_id_idx ON access_log(user_id); CREATE INDEX user_idx ON access_log(uname); CREATE INDEX note_idx ON access_log(note_id); --- SNAP --- url_idx seems OK: logger=# EXPLAIN SELECT * FROM access_log WHERE url_id = 1000; Index Scan using url_idx on access_log (cost=0.00..3618.92 rows=1002 width=89) But the others not: logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0; Seq Scan on access_log (cost=0.00..16443.71 rows=559371 width=89) logger=# EXPLAIN SELECT * FROM access_log WHERE browser_id = 500; Seq Scan on access_log (cost=0.00..16443.71 rows=7935 width=89) logger=# EXPLAIN SELECT * FROM access_log WHERE content_id = 20; Seq Scan on access_log (cost=0.00..16443.71 rows=20579 width=89) .... And very strange: logger=# EXPLAIN SELECT * FROM access_log WHERE access_time > '2001-04-10 10:10:10'; Index Scan using access_time_idx on access_log (cost=0.00..10605.12 rows=3251 width=89) logger=# EXPLAIN SELECT * FROM access_log WHERE access_time > '2001-04-08 10:10:10'; Seq Scan on access_log (cost=0.00..16443.71 rows=152292 width=89) Indexes are also not used in Subselects: logger=# EXPLAIN SELECT * FROM access_log WHERE url_id in (1); Index Scan using url_idx on access_log (cost=0.00..3618.92 rows=1002 width=89) logger=# EXPLAIN SELECT * FROM access_log WHERE url_id in (1,2,3); Index Scan using url_idx, url_idx, url_idx on access_log (cost=0.00..10871.79 rows=3000 width=89) But: logger=# EXPLAIN SELECT * FROM access_log WHERE url_id IN (SELECT 1); Seq Scan on access_log (cost=0.00..16443.71 rows=572537 width=89) SubPlan -> Materialize (cost=0.00..0.00 rows=0 width=0) -> Result (cost=0.00..0.00 rows=0 width=0) Indexes are also not used for remote_ip, ORDER BY access_time (timestamp), ORDER BY time_taken (interval), status, method_num etc. The only I found where indexes are used is url_id! hmmm, any hints? Bug? Postgres configuration is default for all optimizations (geqo_...); others: sort_mem = 1024 shared_buffers = 512 Tested on Linux and Win2K/Cygwin. For the hackers: explain verbose follows: logger=# EXPLAIN VERBOSE SELECT * FROM access_log WHERE url_id = 1000; NOTICE: QUERY DUMP: { INDEXSCAN :startup_cost 0.00 :total_cost 3618.92 :rows 1002 :width 89 :qptargetlist ({ TARGETENTRY :resdom { RESDOM :resno 1 :restype 21 :restypmod -1 :resname site_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 1 :vartype 21 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 1}} { TARGETENTRY :resdom { RESDOM :resno 2 :restype 1184 :restypmod -1 :resname access_time :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 2 :vartype 1184 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 2}} { TARGETENTRY :resdom { RESDOM :resno 3 :restype 1186 :restypmod -1 :resname time_taken :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 3 :vartype 1186 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 3}} { TARGETENTRY :resdom { RESDOM :resno 4 :restype 869 :restypmod -1 :resname remote_ip :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 4 :vartype 869 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 4}} { TARGETENTRY :resdom { RESDOM :resno 5 :restype 21 :restypmod -1 :resname method_num :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 5 :vartype 21 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 5}} { TARGETENTRY :resdom { RESDOM :resno 6 :restype 23 :restypmod -1 :resname url_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 6 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 6}} { TARGETENTRY :resdom { RESDOM :resno 7 :restype 23 :restypmod -1 :resname referer_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 7 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 7}} { TARGETENTRY :resdom { RESDOM :resno 8 :restype 23 :restypmod -1 :resname browser_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 8 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 8}} { TARGETENTRY :resdom { RESDOM :resno 9 :restype 21 :restypmod -1 :resname status :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 9 :vartype 21 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 9}} { TARGETENTRY :resdom { RESDOM :resno 10 :restype 23 :restypmod -1 :resname bytes :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 10 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 10}} { TARGETENTRY :resdom { RESDOM :resno 11 :restype 21 :restypmod -1 :resname content_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 11 :vartype 21 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 11}} { TARGETENTRY :resdom { RESDOM :resno 12 :restype 16 :restypmod -1 :resname https_flag :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 12 :vartype 16 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 12}} { TARGETENTRY :resdom { RESDOM :resno 13 :restype 1042 :restypmod 36 :resname session_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 13 :vartype 1042 :vartypmod 36 :varlevelsup 0 :varnoold 1 :varoattno 13}} { TARGETENTRY :resdom { RESDOM :resno 14 :restype 23 :restypmod -1 :resname user_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 14 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 14}} { TARGETENTRY :resdom { RESDOM :resno 15 :restype 1043 :restypmod 259 :resname uname :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 15 :vartype 1043 :vartypmod 259 :varlevelsup 0 :varnoold 1 :varoattno 15}} { TARGETENTRY :resdom { RESDOM :resno 16 :restype 23 :restypmod -1 :resname note_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 16 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 16}}) :qpqual <> :lefttree <> :righttree <> :extprm () :locprm () :initplan <> :nprm 0 :scanrelid 1 :indxid ( 1870492) :indxqual (({ EXPR :typeOid 16 :opType op :oper { OPER :opno 96 :opid 65 :opresulttype 16 } :args ({ VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 6} { CONST :consttype 23 :constlen 4 :constbyval true :constisnull false :constvalue 4 [ -24 3 0 0 ] })})) :indxqualorig (({ EXPR :typeOid 16 :opType op :oper { OPER :opno 96 :opid 65 :opresulttype 16 } :args ({ VAR :varno 1 :varattno 6 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 6} { CONST :consttype 23 :constlen 4 :constbyval true :constisnull false :constvalue 4 [ -24 3 0 0 ] })})) :indxorderdir 1 } NOTICE: QUERY PLAN: Index Scan using url_idx on access_log (cost=0.00..3618.92 rows=1002 width=89) EXPLAIN logger=# EXPLAIN VERBOSE SELECT * FROM access_log WHERE referer_id = 1000; NOTICE: QUERY DUMP: { SEQSCAN :startup_cost 0.00 :total_cost 16443.71 :rows 11715 :width 89 :qptargetlist ({ TARGETENTRY :resdom { RESDOM :resno 1 :restype 21 :restypmod -1 :resname site_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 1 :vartype 21 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 1}} { TARGETENTRY :resdom { RESDOM :resno 2 :restype 1184 :restypmod -1 :resname access_time :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 2 :vartype 1184 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 2}} { TARGETENTRY :resdom { RESDOM :resno 3 :restype 1186 :restypmod -1 :resname time_taken :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 3 :vartype 1186 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 3}} { TARGETENTRY :resdom { RESDOM :resno 4 :restype 869 :restypmod -1 :resname remote_ip :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 4 :vartype 869 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 4}} { TARGETENTRY :resdom { RESDOM :resno 5 :restype 21 :restypmod -1 :resname method_num :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 5 :vartype 21 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 5}} { TARGETENTRY :resdom { RESDOM :resno 6 :restype 23 :restypmod -1 :resname url_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 6 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 6}} { TARGETENTRY :resdom { RESDOM :resno 7 :restype 23 :restypmod -1 :resname referer_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 7 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 7}} { TARGETENTRY :resdom { RESDOM :resno 8 :restype 23 :restypmod -1 :resname browser_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 8 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 8}} { TARGETENTRY :resdom { RESDOM :resno 9 :restype 21 :restypmod -1 :resname status :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 9 :vartype 21 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 9}} { TARGETENTRY :resdom { RESDOM :resno 10 :restype 23 :restypmod -1 :resname bytes :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 10 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 10}} { TARGETENTRY :resdom { RESDOM :resno 11 :restype 21 :restypmod -1 :resname content_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 11 :vartype 21 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 11}} { TARGETENTRY :resdom { RESDOM :resno 12 :restype 16 :restypmod -1 :resname https_flag :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 12 :vartype 16 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 12}} { TARGETENTRY :resdom { RESDOM :resno 13 :restype 1042 :restypmod 36 :resname session_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 13 :vartype 1042 :vartypmod 36 :varlevelsup 0 :varnoold 1 :varoattno 13}} { TARGETENTRY :resdom { RESDOM :resno 14 :restype 23 :restypmod -1 :resname user_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 14 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 14}} { TARGETENTRY :resdom { RESDOM :resno 15 :restype 1043 :restypmod 259 :resname uname :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 15 :vartype 1043 :vartypmod 259 :varlevelsup 0 :varnoold 1 :varoattno 15}} { TARGETENTRY :resdom { RESDOM :resno 16 :restype 23 :restypmod -1 :resname note_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 16 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 16}}) :qpqual ({ EXPR :typeOid 16 :opType op :oper { OPER :opno 96 :opid 65 :opresulttype 16 } :args ({ VAR :varno 1 :varattno 7 :vartype 23 :vartypmod -1 :varlevelsup 0 :varnoold 1 :varoattno 7} { CONST :consttype 23 :constlen 4 :constbyval true :constisnull false :constvalue 4 [ -24 3 0 0 ] })}) :lefttree <> :righttree <> :extprm () :locprm () :initplan <> :nprm 0 :scanrelid 1 } NOTICE: QUERY PLAN: Seq Scan on access_log (cost=0.00..16443.71 rows=11715 width=89) EXPLAIN Thanks & Ciao Alvar -- AGI Magirusstrasse 21B, 70469 Stuttgart Fon +49 (0)711.228 74-50, Fax +49 (0)711.228 74-88 +++news+++news+++news+++ Beste Image-Website 2001 kommt von AGI http://www.agi.de/tagebuch http://www.agi.com/diary (english)
> I have the following table, containing about 570000 Rows, but some > indexes are not used, on 7.1RC4, freshly vacuumed (analyse). It was the > same at least in 7.1RC1 > CREATE TABLE access_log( > access_time timestamp NOT NULL DEFAULT NOW(), > method_num int2 NOT NULL, > url_id int4 NOT NULL REFERENCES urls(id), > ); > CREATE INDEX method_idx ON access_log(method_num); > CREATE INDEX url_idx ON access_log(url_id); > url_idx seems OK: > But the others not: > logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0; > Seq Scan on access_log (cost=0.00..16443.71 rows=559371 width=89) The parser does not know that your int4 constant "0" can be represented as an int2. Try SELECT * FROM access_log WHERE method_num = int2 '0'; (note the type coersion on the constant; there are other ways of specifying the same thing). For the other cases, PostgreSQL is estimating the query cost to be lower with a sequential scan. For the "SELECT 1" subselect case, it may be that the optimizer does not cheat and determine that there will be only one row returned, or that the query can be reformulated to use a simple constant. HTH - Thomas
> url_idx seems OK: > > logger=# EXPLAIN SELECT * FROM access_log WHERE url_id = 1000; > Index Scan using url_idx on access_log > (cost=0.00..3618.92 rows=1002 width=89) > > > > But the others not: > > logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0; > Seq Scan on access_log (cost=0.00..16443.71 rows=559371 width=89) In this case you'd need to coerce the 0 to int2 in any case, but it figures that most of your rows are returned so a sequence scan will be faster. For an index scan, it's got to seek around the heap file doing random access to determine if rows are visible to your transaction which is more expensive than sequential reads, so at some point the optimizer will guess the sequence scan to be faster. > logger=# EXPLAIN SELECT * FROM access_log WHERE browser_id = 500; > Seq Scan on access_log (cost=0.00..16443.71 rows=7935 width=89) > > logger=# EXPLAIN SELECT * FROM access_log WHERE content_id = 20; > Seq Scan on access_log (cost=0.00..16443.71 rows=20579 width=89) Not sure on these two, it probably is estimating less disk access for doing the sequence scan, my guess would be that the break point is probably somewhere between the 1000 and 8000 row point for the two queries. And I believe the second was an int2, so you'll need to cast the 20. > And very strange: > > logger=# EXPLAIN SELECT * FROM access_log WHERE access_time > > '2001-04-10 10:10:10'; > Index Scan using access_time_idx on access_log > (cost=0.00..10605.12 rows=3251 width=89) > > logger=# EXPLAIN SELECT * FROM access_log WHERE access_time > > '2001-04-08 10:10:10'; > Seq Scan on access_log (cost=0.00..16443.71 rows=152292 width=89) Same as above, for 3000 rows it thinks index scan will be faster, for 152000 rows the sequence scan. > But: > > logger=# EXPLAIN SELECT * FROM access_log > WHERE url_id IN (SELECT 1); > Seq Scan on access_log (cost=0.00..16443.71 rows=572537 width=89) > SubPlan > -> Materialize (cost=0.00..0.00 rows=0 width=0) > -> Result (cost=0.00..0.00 rows=0 width=0) My guess is it doesn't realize that SELECT 1 is a constant that it can use the index for. IN (subselect) isn't handled very well right now. > Indexes are also not used for remote_ip, ORDER BY access_time > (timestamp), ORDER BY time_taken (interval), status, method_num etc. The > only I found where indexes are used is url_id! Any of the int2s will require explicit casting of a constant in order to use the index. I'm not sure on the others.
Thomas Lockhart wrote: > > The parser does not know that your int4 constant "0" can be represented > as an int2. Try > > SELECT * FROM access_log WHERE method_num = int2 '0'; hmmm, but its still a sequentiell scan: logger=# explain SELECT * FROM access_log WHERE method_num = int2 '0'; Seq Scan on access_log (cost=0.00..16443.71 rows=559371 width=89) But: Now I realised: the number of rows! :) If I make "WHERE method_num = int2 '2', then the index is used, interesting -- so it seems that the optimizer uses the value of the WHERE clause to check what might be faster and guesses, that an index scan is more overhead and slower. Nice! > For the other cases, PostgreSQL is estimating the query cost to be lower > with a sequential scan. hm, OK, but I guess, that he is estimating wrong ;) After re-reading the using-explain chapter in the docs I guess I understand the problems of estimating the number of rows ... Do you have any hints how to optimize the ..._cost-Values? Perhaps it is possible to write a test program, which checks out some good ..._cost-Values -- I'm volunteer, but I guess it should possible for this to force some optimizer results to measure the real time some different methods cost. > For the "SELECT 1" subselect case, it may be that the optimizer does not cheat and > determine that there will be only > one row returned, or that the query can be reformulated to use a simple > constant. yes, it was only an example -- i hope nobody is really so stupid and uses a "select 1" subselect ;) It might be an optimization, that the hole subselect is performed before the outer select is called, so the result of the subselect can be used in the query planer. Ciao Alvar -- AGI Magirusstrasse 21B, 70469 Stuttgart Fon +49 (0)711.228 74-50, Fax +49 (0)711.228 74-88 +++news+++news+++news+++ Beste Image-Website 2001 kommt von AGI http://www.agi.de/tagebuch http://www.agi.com/diary (english)
Thomas Lockhart wrote: > The parser does not know that your int4 constant "0" can be represented > as an int2. Try > > SELECT * FROM access_log WHERE method_num = int2 '0'; > > (note the type coersion on the constant; there are other ways of > specifying the same thing). Surely this is something that should be fixed. An int2 column ought to behave exactly like an int4 with a CHECK() constraint forcing the value to be in range. In object oriented terms: a smallint isA integer a integer isA bigint Likewise: a integer isA smallint if it falls in -32768..32767 a bigint isA integer if it falls in -2147483648..2147483647 Similar promotion rules should apply for all other numeric types. Any floating point value without a fractional part should be treated exactly like a big integer. The issues here are closely related to the 7.1 changes in INHERITS semantics. If any operator treats a smaller precision (more highly constrained) type in a materially different way than a compatible higher precision type, it is fundamentally broken for exactly the same reason that we expect a query on a super-class would be if if did not return all matching instances of every sub class. If a function is overloaded with multiple compatible scalar data types, the database should be free to call any matching implementation after performing an arbitrary number of *lossless* compatible type conversions. i.e. if you have f(smallint), f(integer), and f(double) the actual function called by f(0) should be undefined. The distinction between smallint '0', integer '0', and double '0' is meaningless and should be explicitly ignored. This is a little extreme, but I do not think it makes a lot of sense to maintain semantic differences between different representations of the same number. (Oracle certainly doesn't) Any comments? - Mark Butler
Hmm. The problem is as you describe, but the requirements for a solution are more severe than you (or I) would hope. We would like to have an extensible mechanism for type promotion and demotion, but it is not (yet) clear how to implement it. In this case, we must demote a constant assigned as "int4" by the parser into an "int2" to be directly comparable to the indexed column. We could probably do this with some hack code as a brute-force exercise, but no one has yet bothered (patches welcome ;) But in general, we must handle the case that the specified constraint is *not* directly convertible to the indexed type (e.g. is out of range) even though this would seem to reduce to a choice between a trivial noop or a sequential scan of the entire table. If we can do this without cluttering up the code too much, we should go ahead and do it, but it has apparently been a low priority. - Thomas
Good day, I've been experimenting a bit with Full Text Indexing in PostgreSQL. I have found several conflicting sites various places on the net pertaining to whether or not PostgreSQL supports FTI, and I was hoping I could find an authoritative answer here - I tried searching the website's archives, but the search seems to be having some problems. At any rate, I am running a CVS snapshot of 7.1, and I have been trying to create a full text index on a series of resumes. Some of these exceed 8k in size, which is no longer a storage problem of course with 7.1, but I seem to have run into the wicked 8k once again. Specifically: ERROR: index_formtuple: data takes 9344 bytes, max is 8191 Furthermore, after trying to just index on a 8191-character long substring of the resume, I run into the following: ERROR: btree: index item size 3948 exceeds maximum 2713 The only way I could actually get the index created was to substring the body of the resumes down to 2k. I also later tried using HASH rather than BTREE, which worked, but none of these solutions really appreciably increased performance in the way we were hoping. Are these known and accepted limitations of the current 7.1 implementation, or am I doing something terribly wrong? ;) On Tue, 10 Apr 2001, Thomas Lockhart wrote: >> I have the following table, containing about 570000 Rows, but some >> indexes are not used, on 7.1RC4, freshly vacuumed (analyse). It was the >> same at least in 7.1RC1 >> CREATE TABLE access_log( >> access_time timestamp NOT NULL DEFAULT NOW(), >> method_num int2 NOT NULL, >> url_id int4 NOT NULL REFERENCES urls(id), >> ); >> CREATE INDEX method_idx ON access_log(method_num); >> CREATE INDEX url_idx ON access_log(url_id); >> url_idx seems OK: >> But the others not: >> logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0; >> Seq Scan on access_log (cost=0.00..16443.71 rows=559371 width=89) > >The parser does not know that your int4 constant "0" can be represented >as an int2. Try > > SELECT * FROM access_log WHERE method_num = int2 '0'; > >(note the type coersion on the constant; there are other ways of >specifying the same thing). > >For the other cases, PostgreSQL is estimating the query cost to be lower >with a sequential scan. For the "SELECT 1" subselect case, it may be >that the optimizer does not cheat and determine that there will be only >one row returned, or that the query can be reformulated to use a simple >constant. > >HTH > > - Thomas > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster > -- -- <COMPANY>CommandPrompt - http://www.commandprompt.com </COMPANY> <PROJECT>OpenDocs, LLC. - http://www.opendocs.org </PROJECT> <PROJECT>LinuxPorts - http://www.linuxports.com </PROJECT> <WEBMASTER>LDP - http://www.linuxdoc.org </WEBMASTER> -- Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom. --
I believe that the basis for such a mechanism should be a model of the semantic type inheritance for primitive data types. Note that type inheritance is a completely different concept than representation inheritance, as witnessed by the confusion over the now implemented proposal to correct the semantics of table inheritance. Logically, a sub-type expresses the idea that any instance of the sub-type is also an instance of the super-type. For example, semantically speaking, a smallint is an integer because the set of all small integers is a subset of the set of all integers. We could represent this fact with something like a the pg_inherits table with entries for conversion functions to convert the canonical representation of the sub-type into the canonical representation of the super-type and vice versa. In a normal implementation, the index scan boundary values should be stored internally using the representation of the lowest common super-type. That way you can get a correct result for queries like(*): select * from table where smallint_column < 100000 Alternatively, the query engine could internally down cast the value to be compared to the index column type extended with flags like the following: COMPATIBLE_VALUE_GREATER - value is comparable and always greater than any instance of columntype COMPATIBLE_VALUE_LESS - value is comparable and always less than any instance of column type INCOMPATIBLE_VALUE - value is not comparable to column type The type down-conversion function would need to clear the resulting value and set the appropriate flag if the conversion does not succeed. The flags would then be used to calculate which index scan boundary values are equivalent to the original query predicate by substituting the maximum and minimum allowed values of the column type as appropriate. I have not looked at the source code in detail yet, but I believe the basic idea is sound. - Mark Butler Note: Oracle avoids this whole problem for numeric types by using a common variable precision format for *all* numbers. The nice thing is that you can increase the precision / scale of any numeric column without touching the data in each row. Thomas Lockhart wrote: > > Hmm. The problem is as you describe, but the requirements for a solution > are more severe than you (or I) would hope. > > We would like to have an extensible mechanism for type promotion and > demotion, but it is not (yet) clear how to implement it. In this case, > we must demote a constant assigned as "int4" by the parser into an > "int2" to be directly comparable to the indexed column. We could > probably do this with some hack code as a brute-force exercise, but no > one has yet bothered (patches welcome ;) But in general, we must handle > the case that the specified constraint is *not* directly convertible to > the indexed type (e.g. is out of range) even though this would seem to > reduce to a choice between a trivial noop or a sequential scan of the > entire table. If we can do this without cluttering up the code too much, > we should go ahead and do it, but it has apparently been a low priority. > > - Thomas
Poet/Joshua Drake wrote: > > Good day, > > I've been experimenting a bit with Full Text Indexing in PostgreSQL. I > have found several conflicting sites various places on the net pertaining > to whether or not PostgreSQL supports FTI, and I was hoping I could find > an authoritative answer here - I tried searching the website's archives, > but the search seems to be having some problems. > > At any rate, I am running a CVS snapshot of 7.1, and I have been trying to > create a full text index on a series of resumes. Some of these exceed 8k > in size, which is no longer a storage problem of course with 7.1, but I > seem to have run into the wicked 8k once again. Specifically: > > ERROR: index_formtuple: data takes 9344 bytes, max is 8191 > > Furthermore, after trying to just index on a 8191-character long substring > of the resume, I run into the following: > > ERROR: btree: index item size 3948 exceeds maximum 2713 > > The only way I could actually get the index created was to substring the > body of the resumes down to 2k. I also later tried using HASH rather than > BTREE, which worked, but none of these solutions really appreciably > increased performance in the way we were hoping. > > Are these known and accepted limitations of the current 7.1 > implementation, or am I doing something terribly wrong? ;) > On Tue, 10 Apr 2001, Thomas Lockhart wrote: You need to use the 'contrib' code for full-text indexing. The indexing you are trying to do with that is just using the whole content of the string as the index value. Close to useless. The contrib code is in contrib/fulltextindex. I have a hacked version of that which changes it to keyword indexing, if you're interested. Regards, Andrew. -- _____________________________________________________________________ Andrew McMillan, e-mail: Andrew@catalyst.net.nz Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington Me: +64 (21) 635 694, Fax: +64 (4) 499 5596, Office: +64 (4) 499 2267
At 4/10/2001 02:42 PM, Thomas Lockhart wrote: >Hmm. The problem is as you describe, but the requirements for a solution >are more severe than you (or I) would hope. > >We would like to have an extensible mechanism for type promotion and >demotion, but it is not (yet) clear how to implement it. In this case, >we must demote a constant assigned as "int4" by the parser into an >"int2" to be directly comparable to the indexed column. We could >probably do this with some hack code as a brute-force exercise, but no >one has yet bothered (patches welcome ;) But in general, we must handle >the case that the specified constraint is *not* directly convertible to >the indexed type (e.g. is out of range) even though this would seem to >reduce to a choice between a trivial noop or a sequential scan of the >entire table. If we can do this without cluttering up the code too much, >we should go ahead and do it, but it has apparently been a low priority. What about going the other way around... Promote the int2 to an int4 (lossless). Actually for all int1,int2 datatypes (regardless of whether it was the constant or the column) you could promote all to a common int4 and then do comparisons. Promoting all to int8 and then doing a comparison would be excessively slow.
> What about going the other way around... Promote the int2 to an int4 > (lossless). Actually for all int1,int2 datatypes (regardless of whether it > was the constant or the column) you could promote all to a common int4 and > then do comparisons. That is why the index is not used: the backend is promoting all of the int2 column values to int4 for the comparison, and concludes that the available index is not relevant. The index traversal code would need to know how to promote individual values in the index for comparison, which is an interesting idea but I haven't thought about how efficient it would be. Clearly the cost would be different than a simple comparison. - Thomas
> Hmm. I'm pretty sure that a single index on the entire contents of a > resume *as a single field* is close to useless. And an index on an 8k > piece is also useless. Presumably you really want an index covering each > significant word of each resume, in which case you would not run into > the 4k limit (or 2k limit? it is documented somewhere) on the size of an > *index* field (which is still a limitation on PostgreSQL built with the > standard 8k block size. Of course, you can build with a larger block > size). Just an FYI.. I asked the other day and someone (Tom?) told me it was about 2k.. -Mitch
There are several ways to solve the problem: 1. Convert to common numeric format for all numbers, ala Oracle 2. Promote for comparison during the index scan 3. Promote index boundary values for comparison in query planner only Convert back to index column type for actual scan Option 1 doesn't solve the general problem, has a space / performance penalty, and would be a major change. Option 2 involves making serious changes to every index access method, and also has a performance penalty. Option 3 appears to me to be the way to go. The main general requirement is method similar to typeInheritsFrom() in backend/parser/parse_func.c to determine whether a true promotion is possible for a pair of non-complex data types. One thing I am not clear on is how much re-planning is done when a query is executed with different parameter values. If re-planning is not done, is it acceptable to make minor plan changes according to the parameter values? For example, it would be necessary to change a "<" operator to a "<=" operator to get proper index scan behavior on a smallint index if the original right hand side was greater than 32767. - Mark Thomas Lockhart wrote: > That is why the index is not used: the backend is promoting all of the > int2 column values to > int4 for the comparison, and concludes that the available index is not > relevant. > > The index traversal code would need to know how to promote individual > values in the index for comparison, which is an interesting idea but I > haven't thought about how efficient it would be. Clearly the cost would > be different than a simple comparison.