Thread: Help/advice/suggestions on query optimizer for a large table

Help/advice/suggestions on query optimizer for a large table

From
Martin Weinberg
Date:
Folks,

No one on general or novice have had any insight into the following
problem.

I have a very large table (10Gb, 20 million records each with 54 fields) 
with both float, integer and text values.  It's an astronomical
database for the 2MASS project (http://pegasus.phast.umass.edu).  This
database will be 10 times larger in the end . . . 

Anyway, if I submit a query such as:
select * from mytable where x=3.14 and y=6.28;


it takes about 3 minutes to return the record.  Both x and y are indexed:
create index xindex on mytable using btree (x);create index yindex on mytable using btree (y);

And "explain" on the select query above says it's doing a sequential scan.   

However if I say:
select * from mytable where x='3.14'::float4 and y='6.28'::float4;

it takes about 3 seconds!  And now "explain" says it's doing an indexed
scan.

My understanding is that the query optimizer should pick the index
scan for this query based on the cost.  My attempts at debugging
have not turned up anything obvious to me.

Is there a problem with my set up or is this a known problem?  Is 
there something I can do as a work around to make this efficient?  

I would like PostgreSQL to succeed in this application, if possible,
so that it can be adopted.  If we can get this working, I would
recommend that the astronomical community consider adopting this
PostgreSQL to "spin" portions of this database (if you think this
is reasonable).

BTW, this is PostgreSQL 6.4.2 on a dual Xeon running Linux 2.2.5,
over Debian 2.1.

Thanks!

--Martin

===========================================================================

Prof. Martin Weinberg                Phone: (413) 545-3821
Dept. of Physics and Astronomy       FAX:   (413) 545-2117/0648
530 Graduate Research Tower
University of Massachusetts
Amherst, MA  01003-4525


Re: [HACKERS] Help/advice/suggestions on query optimizer for a large table

From
Thomas Lockhart
Date:
> Anyway, if I submit a query such as:
>         select * from mytable where x=3.14 and y=6.28;
> it takes about 3 minutes to return the record.  Both x and y are indexed:
> And "explain" on the select query above says it's doing a sequential scan.
> However if I say:
>         select * from mytable where x='3.14'::float4 and y='6.28'::float4;
> it takes about 3 seconds!  And now "explain" says it's doing an indexed
> scan.
> My understanding is that the query optimizer should pick the index
> scan for this query based on the cost.

This is a known feature. The Postgres parser converts an unquoted 3.14
to a float8, which is not the same as the float4 column you have
indexed. And the optimizer is not (yet) bright enough to convert
constants to the column type, and then use the available indexes.

In fact, the apparently more desirable strategy is not particularly
easy to get right. Look at this example:
 create table t1 (i int4); (insert a bunch of data) create index tx on t1 using btree(i); vacuum; select * from t1
wherei < 3.5;
 

In this case, we can't convert the 3.5 to an integer (3) without
changing the comparison operator to "<=". And in your case,
"downconverting" the float8 to a float4 probably would risk the same
problem. So Postgres *promotes* the float4s to float8s, and has to do
a sequential scan along the way.

Anyway, afaik you have two options. The first is to surround the
"3.14" in your example with single quotes (probably the coersion to
float4 is unnecessary). The second is to create a function index on
your table, to allow your queries to use float8 arguments
transparently:
 create index mx on mytable using btree (float8(x)); create index my on mytable using btree (float8(y));

If you are hiding all of the queries inside an app, then I'd suggest
the first technique. If you are going to be doing a lot of direct SQL,
then you might want to use the second.
                          - Tom

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Help/advice/suggestions on query optimizer for a large table

From
Bruce Momjian
Date:
OK, so here is the answer to this question.  Thanks Thomas.



> > Anyway, if I submit a query such as:
> >         select * from mytable where x=3.14 and y=6.28;
> > it takes about 3 minutes to return the record.  Both x and y are indexed:
> > And "explain" on the select query above says it's doing a sequential scan.
> > However if I say:
> >         select * from mytable where x='3.14'::float4 and y='6.28'::float4;
> > it takes about 3 seconds!  And now "explain" says it's doing an indexed
> > scan.
> > My understanding is that the query optimizer should pick the index
> > scan for this query based on the cost.
> 
> This is a known feature. The Postgres parser converts an unquoted 3.14
> to a float8, which is not the same as the float4 column you have
> indexed. And the optimizer is not (yet) bright enough to convert
> constants to the column type, and then use the available indexes.
> 
> In fact, the apparently more desirable strategy is not particularly
> easy to get right. Look at this example:
> 
>   create table t1 (i int4);
>   (insert a bunch of data)
>   create index tx on t1 using btree(i);
>   vacuum;
>   select * from t1 where i < 3.5;
> 
> In this case, we can't convert the 3.5 to an integer (3) without
> changing the comparison operator to "<=". And in your case,
> "downconverting" the float8 to a float4 probably would risk the same
> problem. So Postgres *promotes* the float4s to float8s, and has to do
> a sequential scan along the way.
> 
> Anyway, afaik you have two options. The first is to surround the
> "3.14" in your example with single quotes (probably the coersion to
> float4 is unnecessary). The second is to create a function index on
> your table, to allow your queries to use float8 arguments
> transparently:
> 
>   create index mx on mytable using btree (float8(x));
>   create index my on mytable using btree (float8(y));
> 
> If you are hiding all of the queries inside an app, then I'd suggest
> the first technique. If you are going to be doing a lot of direct SQL,
> then you might want to use the second.
> 
>                            - Tom
> 
> -- 
> Thomas Lockhart                lockhart@alumni.caltech.edu
> South Pasadena, California
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026