Thread: visualizing B-tree index coverage
Does anyone know of a tools that allows one to visualize the tree created by a multi-column B-tree index? A picture of a tree with branches, showing how "branchy" the tree is would be great. I'm wondering how well I've "clustered" the data in my table using the multi-column index. In other words, do my multi-columns sufficiently but not overly discriminate rows from each other? Do I have too many with the same index? (not enough branches) Do I have a unique index for each row? (way too many branches) Thanks, TJ
Useful explanation of PostgreSQL index format: http://www.faqs.org/docs/ppbook/c13329.htm I think you are aiming for the wrong thing. The worst possible index is one with every value the same. The second worst (still basically useless) is one with only two values. The greater the differentiation of the data, the more workload is reduced on a search. Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the index should be a problem. Do you have data or experience that shows otherwise? -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25, 2005 2:19 PM To: pgsql-general@postgresql.org Cc: tjo@acm.org Subject: [GENERAL] visualizing B-tree index coverage Does anyone know of a tools that allows one to visualize the tree created by a multi-column B-tree index? A picture of a tree with branches, showing how "branchy" the tree is would be great. I'm wondering how well I've "clustered" the data in my table using the multi-column index. In other words, do my multi-columns sufficiently but not overly discriminate rows from each other? Do I have too many with the same index? (not enough branches) Do I have a unique index for each row? (way too many branches) Thanks, TJ ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
Since I'm using a multi-column index, I can greatly influence the nature of the index created, depending on which columns I use and how many. I'm searching for an optimal set of columns that creates an index that, for sure does not have every value the same, nor only two values. Instead, I want to see how well I've spread the index out over the data (if that phrasing makes sense). More specifically, I have character data representing molecular structures. I've written (rather slow) search functions. I can create any number of columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms, # single bonds, etc. I expect my fingerprints will not be unique (fingerprint may be a poor analogy), but rather will classify similar structures together. I create a multi-column index on these counts and get about 2-3 times speedup using 13 columns right now. For example: select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec. select count(smiles) from structure where (_c, _n, _o, _s, _p, _halo, _arom_c, _arom_n, _arom_o, _arom_s, _atoms, _single_bonds, _other_bonds) >= ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 ) and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds when the (_c, etc.) is a multi-column index. The data isn't inherently structured in any way that invites some particular number of columns for indexing. I don't want to use too many, nor too few columns. I also want to optimize the nature(which atom types, bond types, etc.) of the count columns. While I could do this and use the speedup as the measure of success, I think that if my B-tree were "covering" the data well, I would get the best results. Covering means finding that optimal situation where there is not one index for all rows and also not a unique index for every row - something inbetween would be ideal, or is that basically a wrong idea? TJ > Useful explanation of PostgreSQL index format: > http://www.faqs.org/docs/ppbook/c13329.htm > > I think you are aiming for the wrong thing. > The worst possible index is one with every value the same. > The second worst (still basically useless) is one with only two values. The greater the > differentiation of the data, the more workload is > reduced on a search. > > Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the > index should be a problem. > > Do you have data or experience that shows otherwise? > > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25, > 2005 2:19 PM > To: pgsql-general@postgresql.org > Cc: tjo@acm.org > Subject: [GENERAL] visualizing B-tree index coverage > > Does anyone know of a tools that allows one to visualize > the tree created by a multi-column B-tree index? > A picture of a tree with branches, showing how "branchy" the > tree is would be great. > I'm wondering how well I've "clustered" the data in my table > using the multi-column index. In other words, do my > multi-columns sufficiently but not overly discriminate rows from each other? > Do I have too many with the same index? (not enough branches) > Do I have a unique index for each row? (way too many branches) > > Thanks, > TJ > > > > ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to > increase your free space map settings
Normally, a unique, clustered index is the silver bullet to the best performance (but my experience with unique clustered indexes is largely in non-PostgreSQL database systems -- so take it with a grain of salt). I do not see any extra expense for a unique index verses one that is mostly unique. Further, if an index is unique, that should be an excellent optimizer hint for query acceleration. If you know what queries you run most frequently, I would tailor the index for optimal query execution via the join columns and columns often involved in where clause filtering. If it is easily possible to make a unique index I would definitely time the queries with a unique index as well. I suspect that the unique index will fare better unless there is something odd about your data. -----Original Message----- From: TJ O'Donnell [mailto:tjo@acm.org] Sent: Tuesday, January 25, 2005 3:50 PM To: pgsql-general@postgresql.org Cc: Dann Corbit Subject: RE: [GENERAL] visualizing B-tree index coverage Since I'm using a multi-column index, I can greatly influence the nature of the index created, depending on which columns I use and how many. I'm searching for an optimal set of columns that creates an index that, for sure does not have every value the same, nor only two values. Instead, I want to see how well I've spread the index out over the data (if that phrasing makes sense). More specifically, I have character data representing molecular structures. I've written (rather slow) search functions. I can create any number of columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms, # single bonds, etc. I expect my fingerprints will not be unique (fingerprint may be a poor analogy), but rather will classify similar structures together. I create a multi-column index on these counts and get about 2-3 times speedup using 13 columns right now. For example: select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec. select count(smiles) from structure where (_c, _n, _o, _s, _p, _halo, _arom_c, _arom_n, _arom_o, _arom_s, _atoms, _single_bonds, _other_bonds) >= ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 ) and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds when the (_c, etc.) is a multi-column index. The data isn't inherently structured in any way that invites some particular number of columns for indexing. I don't want to use too many, nor too few columns. I also want to optimize the nature(which atom types, bond types, etc.) of the count columns. While I could do this and use the speedup as the measure of success, I think that if my B-tree were "covering" the data well, I would get the best results. Covering means finding that optimal situation where there is not one index for all rows and also not a unique index for every row - something inbetween would be ideal, or is that basically a wrong idea? TJ > Useful explanation of PostgreSQL index format: > http://www.faqs.org/docs/ppbook/c13329.htm > > I think you are aiming for the wrong thing. > The worst possible index is one with every value the same. > The second worst (still basically useless) is one with only two values. The greater the > differentiation of the data, the more workload is > reduced on a search. > > Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the > index should be a problem. > > Do you have data or experience that shows otherwise? > > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25, > 2005 2:19 PM > To: pgsql-general@postgresql.org > Cc: tjo@acm.org > Subject: [GENERAL] visualizing B-tree index coverage > > Does anyone know of a tools that allows one to visualize > the tree created by a multi-column B-tree index? > A picture of a tree with branches, showing how "branchy" the > tree is would be great. > I'm wondering how well I've "clustered" the data in my table > using the multi-column index. In other words, do my > multi-columns sufficiently but not overly discriminate rows from each other? > Do I have too many with the same index? (not enough branches) > Do I have a unique index for each row? (way too many branches) > > Thanks, > TJ > > > > ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to > increase your free space map settings
Some other things that are important: How much is the data in transition (updates/deletes/inserts)? If the data is mostly static or static you can add many special case indexes with little penalty. The biggest cost of indexes (besides disk space consumed) is in the slowdown of inserts, updates, and deletes. If the data hardly changes, you can throw on index after index with little cost. But if the data is in huge flux, you will have to be careful about performance targets for each index you add. This stuff may prove to be of great value: http://www.postgresql.org/docs/8.0/interactive/monitoring-stats.html I would also run EXPLAIN against every distinct sort of query you plan to execute (unless it is for ad-hoc reporting in which case such a requirement cannot be met).
Excuse me for bothering but what kind of search engine you developed. Does it looks like sets comparing ? Oleg On Tue, 25 Jan 2005, TJ O'Donnell wrote: > Since I'm using a multi-column index, I can greatly influence > the nature of the index created, depending on which columns I use > and how many. I'm searching for an optimal set > of columns that creates an index that, for sure does not have > every value the same, nor only two values. Instead, I want to see > how well I've spread the index out over the data (if that phrasing makes sense). > > More specifically, I have character data representing molecular structures. > I've written (rather slow) search functions. I can create any number of > columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms, > # single bonds, etc. I expect my fingerprints will not be unique (fingerprint may > be a poor analogy), but rather will classify similar structures together. > I create a multi-column index on these counts and > get about 2-3 times speedup using 13 columns right now. > For example: > > select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec. > > select count(smiles) from structure where > (_c, _n, _o, _s, _p, _halo, > _arom_c, _arom_n, _arom_o, _arom_s, > _atoms, _single_bonds, _other_bonds) >= > ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 ) > and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds > when the (_c, etc.) is a multi-column index. > > The data isn't inherently structured in any way that invites some particular number of columns > for indexing. I don't want to use too many, nor too few columns. I also > want to optimize the nature(which atom types, bond types, etc.) > of the count columns. While I could do this > and use the speedup as the measure of success, I think > that if my B-tree were "covering" the data well, I would get the best results. > Covering means finding that optimal situation where there is not one index for all rows > and also not a unique index for every row - something inbetween would be ideal, > or is that basically a wrong idea? > > TJ > > > >> Useful explanation of PostgreSQL index format: >> http://www.faqs.org/docs/ppbook/c13329.htm >> >> I think you are aiming for the wrong thing. >> The worst possible index is one with every value the same. >> The second worst (still basically useless) is one with only two values. The greater the >> differentiation of the data, the more workload is >> reduced on a search. >> >> Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the >> index should be a problem. >> >> Do you have data or experience that shows otherwise? >> >> -----Original Message----- >> From: pgsql-general-owner@postgresql.org >> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25, >> 2005 2:19 PM >> To: pgsql-general@postgresql.org >> Cc: tjo@acm.org >> Subject: [GENERAL] visualizing B-tree index coverage >> >> Does anyone know of a tools that allows one to visualize >> the tree created by a multi-column B-tree index? >> A picture of a tree with branches, showing how "branchy" the >> tree is would be great. >> I'm wondering how well I've "clustered" the data in my table >> using the multi-column index. In other words, do my >> multi-columns sufficiently but not overly discriminate rows from each other? >> Do I have too many with the same index? (not enough branches) >> Do I have a unique index for each row? (way too many branches) >> >> Thanks, >> TJ >> >> >> >> ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to >> increase your free space map settings > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
I think you missed an important "feature" of multicolumn indexes, that you better not use 'OR' in your expressions. You seem to want only to use '>=' so this should be OK. Suppose you have 3 columns a,z,e containing values linearly distributed between ... select min(a),max(a),min(z),max(z),min(e),max(e) from test; min | max | min | max | min | max -----+-----+-----+-----+-----+----- 0 | 13 | 0 | 99 | 0 | 99 For instance the following query is indexed : explain analyze select * from test where a>=0 and z>=90 and e>=0; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Index Scan using testa on test (cost=0.00..1637.56 rows=11345 width=16) (actual time=0.085..51.441 rows=13000 loops=1) Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0)) Total runtime: 56.307 ms The following is only partially indexed : explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8) and e>=0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Index Scan using testa, testa on test (cost=0.00..3269.06 rows=346 width=16) (actual time=0.328..52.961 rows=400 loops=1) Index Cond: ((a = 1) OR (a = 2)) Filter: (((z = 1) OR (z = 8)) AND (e >= 0)) Total runtime: 53.297 ms You see the 'index cond' field which is what determines the fetched rows, which are then fetched and filtered with the 'filter' expression. Having the most selective index cond is important because it will diminish the number of rows to be fetched. However, in your case the filter expression is also beneficial because any row eliminated by the filter will not need to go through your expensive matching function. In this case : SELECT count(*) FROM test; => 131072 SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2)); => 20000 SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and e>=0; => 400 In this case the index fetches 20k rows out of 131072 but only 400 are used... If you don't use OR, index use is more likely : explain analyze select * from test where (a,z,e) >= (0,50,80); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Index Scan using testa on test (cost=0.00..1669.78 rows=12627 width=16) (actual time=0.087..58.316 rows=13000 loops=1) Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80)) Total runtime: 63.049 ms Here you have a full index scan. To determine the efficiency of your indexes, you can thus use this method, and look at the 'index cond' and 'filter' expressions, and counting the rows matched by each. > particular number of columns > for indexing. I don't want to use too many, nor too few columns. I also > want to optimize the nature(which atom types, bond types, etc.) > of the count columns. While I could do this > and use the speedup as the measure of success, I think > that if my B-tree were "covering" the data well, I would get the best > results. > Covering means finding that optimal situation where there is not one > index for all rows > and also not a unique index for every row - something inbetween would be > ideal, > or is that basically a wrong idea?
TJ O'Donnell wrote: > More specifically, I have character data representing molecular structures. > I've written (rather slow) search functions. I can create any number of > columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms, > # single bonds, etc. I expect my fingerprints will not be unique (fingerprint may > be a poor analogy), but rather will classify similar structures together. > I create a multi-column index on these counts and > get about 2-3 times speedup using 13 columns right now. > For example: > select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec. > > select count(smiles) from structure where > (_c, _n, _o, _s, _p, _halo, > _arom_c, _arom_n, _arom_o, _arom_s, > _atoms, _single_bonds, _other_bonds) >= > ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 ) > and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds > when the (_c, etc.) is a multi-column index. Is 3 really only a lower bound for the number of carbon atoms (which, I guess is what _c represents), or will there be exactly 3 carbon atoms in the molecules found? If not, can you estimate an upper bound, as well as an lower bound? If you can, than you cound further optimize this by using a range query (i.e. select ... from ... where (...) >= (...) and (...) <= (...( and oe_matches(...)). If you can only determine bounds (upper _and_ lower) for some of the "fingerprint" columns, that I would suggest you choose those for the index, and use them as the "first" columns (because otherwise a range-scan won't work). "analyze" gathers some statistics about the distribution of values in a table - I guess you could extract at least an approximation of "index coverage" from there. Since there cost of index inserts/updates/deletes is afaik largly independent from the number of columns the index consists of, I'd suggest that you add columns until the stats show that it's nearly an unique index - but prefer columns for which you can compute strong upper & lower bounds, and put those "in front". > The data isn't inherently structured in any way that invites some particular number of columns > for indexing. I don't want to use too many, nor too few columns. I also > want to optimize the nature(which atom types, bond types, etc.) > of the count columns. While I could do this > and use the speedup as the measure of success, I think > that if my B-tree were "covering" the data well, I would get the best results. > Covering means finding that optimal situation where there is not one index for all rows > and also not a unique index for every row - something inbetween would be ideal, > or is that basically a wrong idea? I believe it depends on the costs of executing one oe_matches call - according to you, those costs are quite heavy, so I every call you can omit by adding more columns to the index is worth the performance penalty. If you need fast adding of data, as well as fast searches, you could add the data into a "pending" table that is not indexed, and union the searches over both tables. At low-traffic times (say, at night), you could move the data from the "pending" table to the "structure" table. This of course only works, if the number of tuples insertes between two move-runs is much smaller than the number of tuples in "structure". greetings, Florian Pflug
I realize that using OR will not result in an index scan. I will never be interested in a OR condition for the kinds of searches I use. In my Select statements, I always name every column of the multi-column index in same order that they were named when creating the index. I always use the >= condition, and very rarely, the = condition. However, I am concerned that I must place the most selective column first in my index. I cannot tell, a priori, which column will be most selective. That depends on the nature of search, which can vary widely each time. Are you saying that if my first column is not selective, even though the remaining columns are, the planner may choose not to use the index after seeing that the first column is not very selective? That seems like an oversight, IMHO. Shouldn't the overall effect of using all the columns be considered before choosing not to use an index scan? Since I'm using every column of my multi-column index for every search, and I always use >=, Explain Analyze always shows that every column is considered in the index scan. However, that is only when the index scan is used. Sometimes, Explain Analyze shows it is not used. That appears to happen when my search condition is very general. This it to be expected, so I am not worried. Most of my searches will be intermediate, namely not VERY selective, but also not VERY general. So the idea of the multi-column index is to "characterize" each row sufficiently, even when it is a perfectly ordinary row with no ONE feature being distinctive, but rather several features together giving it it's distinctive character. That is my interpretation of the multi-column index. TJ PFC wrote: > > I think you missed an important "feature" of multicolumn indexes, > that you better not use 'OR' in your expressions. You seem to want only > to use '>=' so this should be OK. > > Suppose you have 3 columns a,z,e containing values linearly > distributed between ... > > select min(a),max(a),min(z),max(z),min(e),max(e) from test; > min | max | min | max | min | max > -----+-----+-----+-----+-----+----- > 0 | 13 | 0 | 99 | 0 | 99 > > > For instance the following query is indexed : > explain analyze select * from test where a>=0 and z>=90 and e>=0; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------- > > Index Scan using testa on test (cost=0.00..1637.56 rows=11345 > width=16) (actual time=0.085..51.441 rows=13000 loops=1) > Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0)) > Total runtime: 56.307 ms > > The following is only partially indexed : > > explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8) > and e>=0; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------- > > Index Scan using testa, testa on test (cost=0.00..3269.06 rows=346 > width=16) (actual time=0.328..52.961 rows=400 loops=1) > Index Cond: ((a = 1) OR (a = 2)) > Filter: (((z = 1) OR (z = 8)) AND (e >= 0)) > Total runtime: 53.297 ms > > You see the 'index cond' field which is what determines the fetched > rows, which are then fetched and filtered with the 'filter' expression. > Having the most selective index cond is important because it will > diminish the number of rows to be fetched. However, in your case the > filter expression is also beneficial because any row eliminated by the > filter will not need to go through your expensive matching function. > > In this case : > > SELECT count(*) FROM test; > => 131072 > > SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2)); > => 20000 > > SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and e>=0; > => 400 > > In this case the index fetches 20k rows out of 131072 but only 400 are > used... > > If you don't use OR, index use is more likely : > > explain analyze select * from test where (a,z,e) >= (0,50,80); > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------- > > Index Scan using testa on test (cost=0.00..1669.78 rows=12627 > width=16) (actual time=0.087..58.316 rows=13000 loops=1) > Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80)) > Total runtime: 63.049 ms > > Here you have a full index scan. > > To determine the efficiency of your indexes, you can thus use this > method, and look at the 'index cond' and 'filter' expressions, and > counting the rows matched by each. > > >> particular number of columns >> for indexing. I don't want to use too many, nor too few columns. I also >> want to optimize the nature(which atom types, bond types, etc.) >> of the count columns. While I could do this >> and use the speedup as the measure of success, I think >> that if my B-tree were "covering" the data well, I would get the best >> results. >> Covering means finding that optimal situation where there is not one >> index for all rows >> and also not a unique index for every row - something inbetween would >> be ideal, >> or is that basically a wrong idea?
"TJ O'Donnell" <tjo@acm.org> writes: > However, I am concerned that I must place the most selective column first in > my index. I cannot tell, a priori, which column will be most selective. That > depends on the nature of search, which can vary widely each time. If you're always using > operators then perhaps you would be better off with three separate indexes on <x>, <y>, and <z> rather than one on <x,y,z>. It depends on your use case. It's also possible you would be better off with a GiST index on the array [x,y,z] and using the GiST indexable operators in contrib/intarray. However there are disadvantages to these addons as well. > Are you saying that if my first column is not selective, even though the > remaining columns are, the planner may choose not to use the index after > seeing that the first column is not very selective? That seems like an > oversight, IMHO. Shouldn't the overall effect of using all the columns be > considered before choosing not to use an index scan? If your index is on <x,y,z> and you say x=? and y=? and z=? then it's best to have x be the most selective column but it's only a question of degree. The index will be useful in basically the same situations, it'll just perform somewhat better. On the other hand if your index is on <x,y,z> and you say "x>? and y=? and z=?" then the remaining qualifications after x>? are useless. Postgres simply cannot use them in the index lookup. If your index was on <y,z,x> then Postgres can use it to go straight to the start of the records that match and read them from the index until they stop matching. Consider the analogous situation if you were using an index of a book. I tell you to look up everything in the encylopedia that starts with a letter after "p". And by the way the second letter has to be an "x" and the third letter a "y". You will more or less have to skim through everything after "p" even though you know the second and third letter because the matching entries will be scattered throughout the remaining pages. Actually that's not entirely true, you could check each letter after "p" and look up specifically "qxy...", "rxy...", "sxy...", "txy...", etc right up to "xzy...". But unfortunately that's just not something Postgres currently knows how to do. Even then you can see that it's much less efficient than just reading a contiguous section of records out in order. -- greg
> I realize that using OR will not result in an index scan. > I will never be interested in a OR condition for the kinds > of searches I use. In my Select statements, I always name > every column of the multi-column index in same order that > they were named when creating the index. I always use > the >= condition, and very rarely, the = condition. All the leftmost index column must be named, but the order is unimportant. You can use (a BETWEEN x AND y) instead of (a>=x AND a<=y), it is cleaner. > However, I am concerned that I must place > the most selective column first in my index. I cannot tell, > a priori, which column will be most selective. That depends on the > nature of search, which can vary widely each time. > Are you saying that if my first column is not selective, even though the > remaining > columns are, the planner may choose not to use the index after > seeing that the first column is not very selective? I thought this was true but made some tests and the index scanner is smart. Try this : CREATE TABLE test (id serial primary key, a INTEGER, z INTEGER, e INTEGER, r INTEGER, t INTEGER, y INTEGER ) WITHOUT OIDS; INSERT 1M rows into table using a plpgsql function, with a,z,e,r,t,y being floor(random()*10) for instance. Then you can try various selects. a,z,e,r,t,y are a linear distribution between 0 and 9 included, so : a>=A AND z>=Z ... y>=Y gives a result set of about (10-A)*(10-Z)*...*(10-Y) results. You'll see the planner will use an index scan when needed. You can try the easiest case (a>=9) which just explores one part of the tree, and the worst case which explores a part of all leafs (y>=9). Both should yield about the same number of results, but the first should be faster. To know how much, just try ;) > That seems like an oversight, IMHO. Shouldn't the overall effect of > using all the columns be considered before choosing not to use an > index scan? I think it is. There are no cross column correlation stats though. > Since I'm using every column of my multi-column index for every search, > and I always use >=, Explain Analyze always shows that every column > is considered in the index scan. However, that is only when the > index scan is used. Sometimes, Explain Analyze shows it is not used. > That appears to happen when my search condition is very general. > This it to be expected, so I am not worried. Most of my searches will > be intermediate, namely not VERY selective, but also not VERY general. > So the idea of the multi-column index is to "characterize" each row > sufficiently, even when it is a perfectly ordinary row with no ONE > feature being distinctive, but rather several features together giving > it it's distinctive character. That is my interpretation of the > multi-column index. If you have some features which are highly selective, you can create a single column index on them. It won't be used often, but when it will, it will really work.
>> Rather than having the most selective item first, put the most frequently used item first. So (for instance) if your database is an organic chemistry database and the columns were elements, then carbon might be a logical first choice, hydrogen second, etc. for your index. What you want in creating the index is to choose an index where the items requested for filtering (in either WHERE clause or join columns) are very sure to exist somewhere in the early part of the index. A big problem will arise if (for all indexes) there is no instance where the first column of the index is referenced in the where clause. That means you will do a complete table scan. So create an index so for queries the most probable things to ask for are earliest in the index. More important than selectivity is the probability of the column being included in the filter part of the request. << -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Thursday, January 27, 2005 7:38 AM To: PFC Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] visualizing B-tree index coverage I realize that using OR will not result in an index scan. I will never be interested in a OR condition for the kinds of searches I use. In my Select statements, I always name every column of the multi-column index in same order that they were named when creating the index. I always use the >= condition, and very rarely, the = condition. However, I am concerned that I must place the most selective column first in my index. I cannot tell, a priori, which column will be most selective. That depends on the nature of search, which can vary widely each time. Are you saying that if my first column is not selective, even though the remaining columns are, the planner may choose not to use the index after seeing that the first column is not very selective? That seems like an oversight, IMHO. Shouldn't the overall effect of using all the columns be considered before choosing not to use an index scan? Since I'm using every column of my multi-column index for every search, and I always use >=, Explain Analyze always shows that every column is considered in the index scan. However, that is only when the index scan is used. Sometimes, Explain Analyze shows it is not used. That appears to happen when my search condition is very general. This it to be expected, so I am not worried. Most of my searches will be intermediate, namely not VERY selective, but also not VERY general. So the idea of the multi-column index is to "characterize" each row sufficiently, even when it is a perfectly ordinary row with no ONE feature being distinctive, but rather several features together giving it it's distinctive character. That is my interpretation of the multi-column index. TJ PFC wrote: > > I think you missed an important "feature" of multicolumn indexes, > that you better not use 'OR' in your expressions. You seem to want only > to use '>=' so this should be OK. > > Suppose you have 3 columns a,z,e containing values linearly > distributed between ... > > select min(a),max(a),min(z),max(z),min(e),max(e) from test; > min | max | min | max | min | max > -----+-----+-----+-----+-----+----- > 0 | 13 | 0 | 99 | 0 | 99 > > > For instance the following query is indexed : > explain analyze select * from test where a>=0 and z>=90 and e>=0; > QUERY PLAN > ------------------------------------------------------------------------ ------------------------------------------------- > > Index Scan using testa on test (cost=0.00..1637.56 rows=11345 > width=16) (actual time=0.085..51.441 rows=13000 loops=1) > Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0)) > Total runtime: 56.307 ms > > The following is only partially indexed : > > explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8) > and e>=0; > QUERY PLAN > ------------------------------------------------------------------------ ---------------------------------------------------- > > Index Scan using testa, testa on test (cost=0.00..3269.06 rows=346 > width=16) (actual time=0.328..52.961 rows=400 loops=1) > Index Cond: ((a = 1) OR (a = 2)) > Filter: (((z = 1) OR (z = 8)) AND (e >= 0)) > Total runtime: 53.297 ms > > You see the 'index cond' field which is what determines the fetched > rows, which are then fetched and filtered with the 'filter' expression. > Having the most selective index cond is important because it will > diminish the number of rows to be fetched. However, in your case the > filter expression is also beneficial because any row eliminated by the > filter will not need to go through your expensive matching function. > > In this case : > > SELECT count(*) FROM test; > => 131072 > > SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2)); > => 20000 > > SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and e>=0; > => 400 > > In this case the index fetches 20k rows out of 131072 but only 400 are > used... > > If you don't use OR, index use is more likely : > > explain analyze select * from test where (a,z,e) >= (0,50,80); > QUERY PLAN > ------------------------------------------------------------------------ ------------------------------------------------- > > Index Scan using testa on test (cost=0.00..1669.78 rows=12627 > width=16) (actual time=0.087..58.316 rows=13000 loops=1) > Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80)) > Total runtime: 63.049 ms > > Here you have a full index scan. > > To determine the efficiency of your indexes, you can thus use this > method, and look at the 'index cond' and 'filter' expressions, and > counting the rows matched by each. > > >> particular number of columns >> for indexing. I don't want to use too many, nor too few columns. I also >> want to optimize the nature(which atom types, bond types, etc.) >> of the count columns. While I could do this >> and use the speedup as the measure of success, I think >> that if my B-tree were "covering" the data well, I would get the best >> results. >> Covering means finding that optimal situation where there is not one >> index for all rows >> and also not a unique index for every row - something inbetween would >> be ideal, >> or is that basically a wrong idea? ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
I have a table of about 5 million rows, 24 columns. Integer column _c is BTREE indexed (as is _n, _o and 3 others). This I understand and like: Explain Analyze Select count(smiles) from structure where _c = 30 Aggregate (cost=105595.11..105595.11 rows=1 width=32) (actual time=17.722..17.724 rows=1 loops=1) -> Index Scan using "Nc" on structure (cost=0.00..105528.89 rows=26486 width=32) (actual time=0.098..16.095 rows=734 loops=1) Index Cond: (_c = 30) Total runtime: 18.019 ms This I don't get. Why is an index scan not used? Isn't an index supposed to help when using > < >= <= too? Explain Analyze Select count(smiles) from structure where _c >= 30 Aggregate (cost=196033.74..196033.74 rows=1 width=32) (actual time=42133.432..42133.434 rows=1 loops=1) -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 width=32) (actual time=8050.437..42117.062 rows=1569 loops=1) Filter: (_c >= 30) Total runtime: 42133.746 ms TJ
> This I don't get. Why is an index scan not used? Isn't an index > supposed > to help when using > < >= <= too? It should ! > Explain Analyze Select count(smiles) from structure where _c >= 30 > Aggregate (cost=196033.74..196033.74 rows=1 width=32) (actual > time=42133.432..42133.434 rows=1 > loops=1) > -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 > width=32) (actual > time=8050.437..42117.062 rows=1569 loops=1) > Filter: (_c >= 30) > Total runtime: 42133.746 ms See these : -> Index Scan using "Nc" on structure (cost=0.00..105528.89 rows=26486 width=32) (actualtime=0.098..16.095 rows=734 loops=1) -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 width=32) (actual time=8050.437..42117.062 rows=1569 loops=1) In the index scan case, Planner thinks it'll get "rows=26486" but in reality only gets 734 rows. In the seq scan case, Planner thinks it'll get "rows=1765669" but in reality only gets 1569 rows. The two are way off-mark. 26486 still makes it choose an index scan because it's a small fraction of the table, but 1765669 is not. Analyze, use more precise statistics (alter table set statistics), whatever... but you gotta get the planner correctly estimating these rowcounts.
"TJ O'Donnell" <tjo@acm.org> writes: > This I don't get. Why is an index scan not used? Isn't an index supposed > to help when using > < >= <= too? > Explain Analyze Select count(smiles) from structure where _c >= 30 > Aggregate (cost=196033.74..196033.74 rows=1 width=32) (actual time=42133.432..42133.434 rows=1 > loops=1) > -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 width=32) (actual > time=8050.437..42117.062 rows=1569 loops=1) > Filter: (_c >= 30) Have you ANALYZEd the table lately? That rowcount estimate is off by about three orders of magnitude :-( regards, tom lane
> "TJ O'Donnell" <tjo@acm.org> writes: > > > -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 width=32) (actual > > time=8050.437..42117.062 rows=1569 loops=1) ^^^^^^^^ And do you vacuum regularly? Have you done batch updates or deletes and then never inserted enough records to reuse that dead space? You might have to vacuum full (or alternatively use the cluster command) to reclaim this dead space. I don't recall what the original motivation to rewrite the analyze sampling was, did having lots of dead tuples cause bad estimates in 7.4? -- greg
I had thought that the Creation of the Index would do something equivalent to Analyze. I tried Analyze Verbose and it improved the scanner's ability to predict when an index would be useful. Last week, I asked about visualizing B-tree "coverage". I think I meant "Can I see the histograms that Analyze creates?" Are they available anywhere? The docs mention them (bins) and I was hoping Analyze Verbose would show them to me. TJ Tom Lane wrote: > "TJ O'Donnell" <tjo@acm.org> writes: > >>This I don't get. Why is an index scan not used? Isn't an index supposed >>to help when using > < >= <= too? >>Explain Analyze Select count(smiles) from structure where _c >= 30 >>Aggregate (cost=196033.74..196033.74 rows=1 width=32) (actual time=42133.432..42133.434 rows=1 >>loops=1) >> -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 width=32) (actual >>time=8050.437..42117.062 rows=1569 loops=1) >> Filter: (_c >= 30) > > > Have you ANALYZEd the table lately? That rowcount estimate is off by > about three orders of magnitude :-( > > regards, tom lane
On Wed, Feb 02, 2005 at 06:51:13AM -0800, TJ O'Donnell wrote: > I had thought that the Creation of the Index would do something > equivalent to Analyze. I tried Analyze Verbose and it improved > the scanner's ability to predict when an index would be useful. Create index creates an index, analyze collects statistics. Neither happens without being asked for... > Last week, I asked about visualizing B-tree "coverage". I think > I meant "Can I see the histograms that Analyze creates?" > Are they available anywhere? The docs mention them (bins) and I > was hoping Analyze Verbose would show them to me. Maybe pg_statistic? You may need the oid of the column definition to work out what goes where... Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
"TJ O'Donnell" <tjo@acm.org> writes: > Last week, I asked about visualizing B-tree "coverage". I think > I meant "Can I see the histograms that Analyze creates?" > Are they available anywhere? See pg_stats regards, tom lane
On 02 Feb 2005 00:58:17 -0500, Greg Stark <gsstark@mit.edu> wrote: >I don't recall what the original motivation to rewrite the analyze sampling >was, did having lots of dead tuples cause bad estimates in 7.4? The original intention was to make analysing huge tables faster. But the patch turned out to be _slower_ for mid sized tables. So it was sold to the community as a way to produce better statistics for relations with many dead tuples near the start. Servus Manfred