Thread: Better index stategy for many fields with few values
Hi,
I want to optimize something like this.
- My items table:
code int -- can take one of 100 values
property varchar(250) -- can take one of 5000 values
param01 char(10) -- can take one of 10 values
param02 char(10) -- can take one of 10 values
...
[ 20 similar columns }
...
parama20 char(10) -- can take one of 10 values
- The kind of query I want to optimize:
select * from items
where code betwwen 5 and 22
and param01 = 'P'
and param02 = 'G'
...
[ all the 20 paramXX columns are used in the query}
...
and param20 = 'C';
How can I optimize this kind of query?
I was thinking about using a multicolumns index, but I have read that we should limit multicolumns indice to at most 2 or 3 columns.
If that's true then 22 columns for a multicolumn incdex seems way too much. Or maybe it is workable as every column uses only a very limited set of values?
I was also thinking about about using a functional index.
What do you think would be the best solution in such a case?
Thanks.
Oscar
Blab-away for as little as 1¢/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice.
I want to optimize something like this.
- My items table:
code int -- can take one of 100 values
property varchar(250) -- can take one of 5000 values
param01 char(10) -- can take one of 10 values
param02 char(10) -- can take one of 10 values
...
[ 20 similar columns }
...
parama20 char(10) -- can take one of 10 values
- The kind of query I want to optimize:
select * from items
where code betwwen 5 and 22
and param01 = 'P'
and param02 = 'G'
...
[ all the 20 paramXX columns are used in the query}
...
and param20 = 'C';
How can I optimize this kind of query?
I was thinking about using a multicolumns index, but I have read that we should limit multicolumns indice to at most 2 or 3 columns.
If that's true then 22 columns for a multicolumn incdex seems way too much. Or maybe it is workable as every column uses only a very limited set of values?
I was also thinking about about using a functional index.
What do you think would be the best solution in such a case?
Thanks.
Oscar
Blab-away for as little as 1¢/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice.
> - My items table: > code int -- can take one of 100 values > property varchar(250) -- can take one of 5000 values > param01 char(10) -- can take one of 10 values > param02 char(10) -- can take one of 10 values > ... > [ 20 similar columns } > ... > parama20 char(10) -- can take one of 10 values Instead of 20 columns, you could instead use a "param" field containing an array of 20 TEXT fields. Then create a simple index on (code, param) and SELECT WHERE code BETWEEN ... AND param = '{P,G,....,C}' If you don't want to modify your structure, you can create a functional index on an array {param1...param20}, but your queries will be a bit uglier.
Hi, Oscar, Oscar Picasso wrote: > [ all the 20 paramXX columns are used in the query} > How can I optimize this kind of query? PostgreSQL 8.1 has so-called bitmap index scans, which can combine several index scans before actually accessing the data. So I think it's best to create an index on each of the paramXX columns, and see with EXPLAIN ANALYZE what it is doing. > I was thinking about using a multicolumns index, but I have read that > we should limit multicolumns indice to at most 2 or 3 columns. Yes, that's true, the index overhead gets too high. > If that's true then 22 columns for a multicolumn incdex seems way too > much. Or maybe it is workable as every column uses only a very limited > set of values? Yes, I think that a 22 column index is way too much, especially with the new bitmap index scans available. > I was also thinking about about using a functional index. If there's a logical relation between those values that they can easily combined, that may be a good alternative. I just had another weird idea: As your paramXX values can have only 10 parameters, it also might be feasible to use a bunch of 10 conditional indices, like: CREATE INDEX foo1 ON table (param1, param2 WHERE param0='1st value'; CREATE INDEX foo2 ON table (param1, param2 WHERE param0='2nd value'; CREATE INDEX foo3 ON table (param1, param2 WHERE param0='3rd value'; [...] This way, you don't have the index bloat of a 3-column index, but 10 2-column indices that cover 1/10th of the table each. For 22 columns, you'd need a bunch of seven such indices plus a single-column one, or can use some 3+1 and some 2+1 column index. I'd like to see the query plans from explain analyze. Btw, I expect query planning time to get rather significant for so much columns, so gequo tuning, tuning work_mem (for the bitmap scans) and prepared statements will pay off. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
On Wed, Apr 12, 2006 at 02:59:32PM +0200, Markus Schaber wrote: > > I was thinking about using a multicolumns index, but I have read that > > we should limit multicolumns indice to at most 2 or 3 columns. > > Yes, that's true, the index overhead gets too high. > > > I was also thinking about about using a functional index. > > If there's a logical relation between those values that they can easily > combined, that may be a good alternative. How would that be any better than just doing a multi-column index? > I just had another weird idea: > > As your paramXX values can have only 10 parameters, it also might be > feasible to use a bunch of 10 conditional indices, like: > > CREATE INDEX foo1 ON table (param1, param2 WHERE param0='1st value'; > CREATE INDEX foo2 ON table (param1, param2 WHERE param0='2nd value'; > CREATE INDEX foo3 ON table (param1, param2 WHERE param0='3rd value'; > [...] Not all that weird; it's known as index partitioning. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Oscar, On 4/10/06 9:58 AM, "Oscar Picasso" <oscgoogle@yahoo.com> wrote: > - My items table: > code int -- can take one of 100 values > property varchar(250) -- can take one of 5000 values > param01 char(10) -- can take one of 10 values > param02 char(10) -- can take one of 10 values > ... > [ 20 similar columns } > ... > parama20 char(10) -- can take one of 10 values > > - The kind of query I want to optimize: > select * from items > where code betwwen 5 and 22 > and param01 = 'P' > and param02 = 'G' > ... > [ all the 20 paramXX columns are used in the query} > ... > and param20 = 'C'; Bizgres 0.9 has an on-disk bitmap index which will likely improve this query speed by a very large amount over normal Postgres 8.1. - Luke
Adding -performance back in
-----Original Message-----
From: Oscar Picasso [mailto:oscgoogle@yahoo.com]
Sent: Wednesday, April 12, 2006 5:51 PM
To: Jim Nasby
Subject: Re: [PERFORM] Better index stategy for many fields with few valuesI would like to try it.
However in an other post I added that contrary to what I stated initially all the paramXX columns are not mandatory in the query. So it seems that requirement make the problem more complexe.
Doesn't this new requirement rule out this solution?
No, just group the columns logically.
By the way I have test to index each column individually and check what happens in relation to bitscan map. My test table is 1 million rows. The explain analyze command shows that a bit scan is sometimes used but I still end up with queries that can take up to 10s which is way to much.
"Jim C. Nasby" <jnasby@pervasive.com> wrote:On Wed, Apr 12, 2006 at 02:59:32PM +0200, Markus Schaber wrote:
> > I was thinking about using a multicolumns index, but I have read that
> > we should limit multicolumns indice to at most 2 or 3 columns.
>
> Yes, that's true, the index overhead gets too high.
>
> > I was also thinking about about using a functional index.
>
> If there's a logical relation between those values that they can easily
> combined, that may be a good alternative.
How would that be any better than just doing a multi-column index?
> I just had another weird idea:
>
> As your paramXX values can have only 10 parameters, it also might be
> feasible to use a bunch of 10 conditional indices, like:
>
> CREATE INDEX foo1 ON table (param1, param2 WHERE param0='1st value';
> CREATE INDEX foo2 ON table (param1, param2 WHERE param0='2nd value';
> CREATE INDEX foo3 ON table (param1, param2 WHERE param0='3rd value';
> [...]
Not all that weird; it's known as index partitioning.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.
Hi, Jim, Jim C. Nasby wrote: >>>I was also thinking about about using a functional index. >>If there's a logical relation between those values that they can easily >>combined, that may be a good alternative. > How would that be any better than just doing a multi-column index? 10 different values per column, and 20 columns are 10^20 value combinations. Partitioning it for the first column gives 10^19 combinations which is smaller than 2^64, and thus fits into a long value. And I just guess that a 10-partition functional index on a long value could perform better than a multi-column index on 20 columns of character(10), if only because it is approx. 1/25th in size. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
Hi, Jim, Jim Nasby wrote: > Adding -performance back in > I would like to try it. > > However in an other post I added that contrary to what I stated > initially all the paramXX columns are not mandatory in the query. So > it seems that requirement make the problem more complexe. Okay, this rules out my functional index over 19 columns. > Doesn't this new requirement rule out this solution? > > No, just group the columns logically. Yes, that's the solution. If you have common groups of columns that appear and disappear synchroneously, pack those together in an (possibly partitioned and/or functional) index. Then rely on the query planner that the combines the appropriate indices via index bitmap scan. > By the way I have test to index each column individually and check > what happens in relation to bitscan map. My test table is 1 > million rows. The explain analyze command shows that a bit scan is > sometimes used but I still end up with queries that can take up to > 10s which is way to much. Is it on the first query, or on repeated queries? It might be that you're I/O bound, and the backend has to fetch indices and rows from Disk into RAM. I currently don't know whether the order of indices in a multi-index bitmap scan is relevant, but I could imagine that it may be useful to have the most selective index scanned first. And keep in mind that, assuming an equal distribution of your parameters, every index bitmap hits 1/10th of the whole table on average, so the selectivity generally is low. The selectivity of a partitioned 3-column index will be much better (about 1/10000th of the whole table), and less index scans and bitmaps have to be generated. A functional index may also make sense to CLUSTER the table to optimize the locality of search results (and so reducing disk I/O). In case your table has low write activity, but high read-only activity, the overhead that comes with the additional index is neglible compared to the performance improvement proper CLUSTERing can generate. Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
Hi, Oscar, Please reply to the list and not privately, so others can learn from your replies, and possibly have better Ideas than me. Oscar Picasso wrote: > I cannot group the columns logically. Any column may or may not appear > in a query. That's suboptimal. > Summrarizing what I have learned: > - I cannot use multicolumn indexes because I cannot group the column > logically. > - I cannot use funtional indexes > - I cannot use clustering. You still can have a set of partitioned multi-column indices, overlapping enough that every combination of columns is covered (or risk a sequential sub scan for the last two or three columns, this should not hurt too much if the first 17 columns were selective enough). The main problem with indices is that they also decrease write performance. If disk costs are not limited, it will make sense to have WAL, table and indices on different disks / raid arrays, to parallelize writes. Btw, I guess you have multiple, concurrent users? Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
Hi Markus, Markus Schaber <schabi@logix-tt.com> wrote: >Hi, Oscar, > >Please reply to the list and not privately, so others can learn from >your replies, and possibly have better Ideas than me. That was my intention. I made a mistake. >Oscar Picasso wrote: > >> I cannot group the columns logically. Any column may or may not appear >> in a query. > >That's suboptimal. > >> Summrarizing what I have learned: >> - I cannot use multicolumn indexes because I cannot group the column >> logically. >> - I cannot use funtional indexes >> - I cannot use clustering. > >You still can have a set of partitioned multi-column indices, >overlapping enough that every combination of columns is covered (or risk >a sequential sub scan for the last two or three columns, this should not >hurt too much if the first 17 columns were selective enough). > >The main problem with indices is that they also decrease write performance. > >If disk costs are not limited, it will make sense to have WAL, table and >indices on different disks / raid arrays, to parallelize writes. > >Btw, I guess you have multiple, concurrent users? Yes I do. I have just made other tests with only the individual indexes and performance is much better than previously. Obviously therewas an I/O problem during my initial test. Something interesting though. If I use few columns in the query the results come very quickly and pg does a sequential scan. When it reachs some threshold (4 or 5 columns) pg switches to bitmap scans. It then takes an almost constant time (~ 2500ms) not matter how many more columns I add to the where clause. Interestingly enough, queries with many columns are less common. They also return less results and even many times no resultat all.