Thread: Help/advice/suggestions on query optimizer for a large table
Folks, No one on general or novice have had any insight into the following problem. I have a very large table (10Gb, 20 million records each with 54 fields) with both float, integer and text values. It's an astronomical database for the 2MASS project (http://pegasus.phast.umass.edu). This database will be 10 times larger in the end . . . Anyway, if I submit a query such as: select * from mytable where x=3.14 and y=6.28; it takes about 3 minutes to return the record. Both x and y are indexed: create index xindex on mytable using btree (x);create index yindex on mytable using btree (y); And "explain" on the select query above says it's doing a sequential scan. However if I say: select * from mytable where x='3.14'::float4 and y='6.28'::float4; it takes about 3 seconds! And now "explain" says it's doing an indexed scan. My understanding is that the query optimizer should pick the index scan for this query based on the cost. My attempts at debugging have not turned up anything obvious to me. Is there a problem with my set up or is this a known problem? Is there something I can do as a work around to make this efficient? I would like PostgreSQL to succeed in this application, if possible, so that it can be adopted. If we can get this working, I would recommend that the astronomical community consider adopting this PostgreSQL to "spin" portions of this database (if you think this is reasonable). BTW, this is PostgreSQL 6.4.2 on a dual Xeon running Linux 2.2.5, over Debian 2.1. Thanks! --Martin =========================================================================== Prof. Martin Weinberg Phone: (413) 545-3821 Dept. of Physics and Astronomy FAX: (413) 545-2117/0648 530 Graduate Research Tower University of Massachusetts Amherst, MA 01003-4525
Re: [HACKERS] Help/advice/suggestions on query optimizer for a large table
From
Thomas Lockhart
Date:
> Anyway, if I submit a query such as: > select * from mytable where x=3.14 and y=6.28; > it takes about 3 minutes to return the record. Both x and y are indexed: > And "explain" on the select query above says it's doing a sequential scan. > However if I say: > select * from mytable where x='3.14'::float4 and y='6.28'::float4; > it takes about 3 seconds! And now "explain" says it's doing an indexed > scan. > My understanding is that the query optimizer should pick the index > scan for this query based on the cost. This is a known feature. The Postgres parser converts an unquoted 3.14 to a float8, which is not the same as the float4 column you have indexed. And the optimizer is not (yet) bright enough to convert constants to the column type, and then use the available indexes. In fact, the apparently more desirable strategy is not particularly easy to get right. Look at this example: create table t1 (i int4); (insert a bunch of data) create index tx on t1 using btree(i); vacuum; select * from t1 wherei < 3.5; In this case, we can't convert the 3.5 to an integer (3) without changing the comparison operator to "<=". And in your case, "downconverting" the float8 to a float4 probably would risk the same problem. So Postgres *promotes* the float4s to float8s, and has to do a sequential scan along the way. Anyway, afaik you have two options. The first is to surround the "3.14" in your example with single quotes (probably the coersion to float4 is unnecessary). The second is to create a function index on your table, to allow your queries to use float8 arguments transparently: create index mx on mytable using btree (float8(x)); create index my on mytable using btree (float8(y)); If you are hiding all of the queries inside an app, then I'd suggest the first technique. If you are going to be doing a lot of direct SQL, then you might want to use the second. - Tom -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
OK, so here is the answer to this question. Thanks Thomas. > > Anyway, if I submit a query such as: > > select * from mytable where x=3.14 and y=6.28; > > it takes about 3 minutes to return the record. Both x and y are indexed: > > And "explain" on the select query above says it's doing a sequential scan. > > However if I say: > > select * from mytable where x='3.14'::float4 and y='6.28'::float4; > > it takes about 3 seconds! And now "explain" says it's doing an indexed > > scan. > > My understanding is that the query optimizer should pick the index > > scan for this query based on the cost. > > This is a known feature. The Postgres parser converts an unquoted 3.14 > to a float8, which is not the same as the float4 column you have > indexed. And the optimizer is not (yet) bright enough to convert > constants to the column type, and then use the available indexes. > > In fact, the apparently more desirable strategy is not particularly > easy to get right. Look at this example: > > create table t1 (i int4); > (insert a bunch of data) > create index tx on t1 using btree(i); > vacuum; > select * from t1 where i < 3.5; > > In this case, we can't convert the 3.5 to an integer (3) without > changing the comparison operator to "<=". And in your case, > "downconverting" the float8 to a float4 probably would risk the same > problem. So Postgres *promotes* the float4s to float8s, and has to do > a sequential scan along the way. > > Anyway, afaik you have two options. The first is to surround the > "3.14" in your example with single quotes (probably the coersion to > float4 is unnecessary). The second is to create a function index on > your table, to allow your queries to use float8 arguments > transparently: > > create index mx on mytable using btree (float8(x)); > create index my on mytable using btree (float8(y)); > > If you are hiding all of the queries inside an app, then I'd suggest > the first technique. If you are going to be doing a lot of direct SQL, > then you might want to use the second. > > - Tom > > -- > Thomas Lockhart lockhart@alumni.caltech.edu > South Pasadena, California > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026