Thread: Forcing query to use an index

Forcing query to use an index

From
Michael Nachbaur
Date:
Hello everyone,

I have a search query that does a whole bunch of LEFT OUTER JOINs
between multiple tables, since this is a generic search and some
records may not exist for certain customers (e.g. searching for the
text "kate" should bring up people whose customer name, street address
or email addresses match that word).  This is for an ISP's customer
management database.

Unfortunately one stage in the query keeps using a sequence scan rather
than the index.  Here is the "EXPLAIN ANALYZE" results for the 115 line
SQL query.

Sort  (cost=6666.08..6666.08 rows=268 width=265) (actual
time=949.00..949.00 rows=1 loops=1)  ->  Aggregate  (cost=6487.84..6655.27 rows=268 width=265) (actual
time=948.86..948.86 rows=1 loops=1)        ->  Group  (cost=6487.84..6648.58 rows=2679 width=265) (actual
time=948.70..948.70 rows=1 loops=1)              ->  Sort  (cost=6487.84..6487.84 rows=2679 width=265)
(actual time=948.66..948.66 rows=1 loops=1)                    ->  Merge Join  (cost=6106.42..6335.30 rows=2679
width=265) (actual time=859.77..948.06 rows=1 loops=1)                          ->  Merge Join  (cost=6101.24..6319.77
rows=2679 width=247) (actual time=554.11..674.17 rows=2679 loops=1)                                ->  Index Scan using
customer_id_keyon  
customer c  (cost=0.00..129.63 rows=2679 width=156) (actual
time=0.40..43.43 rows=2679 loops=1)                                ->  Sort  (cost=6101.24..6101.24
rows=8117 width=91) (actual time=553.64..559.58 rows=8117 loops=1)                                      ->  Seq Scan on

customer_month_summary cms  (cost=0.00..5574.17 rows=8117 width=91)
(actual time=258.03..477.11 rows=8117 loops=1)                          ->  Sort  (cost=5.18..5.18 rows=77 width=18)
(actual time=0.70..0.80 rows=77 loops=1)                                ->  Seq Scan on emailaddress ea
(cost=0.00..2.77 rows=77 width=18) (actual time=0.08..0.35 rows=77
loops=1)
Total runtime: 951.70 msec

The table in question is "customer_month_summary"; it has 8117 rows.
Essentially, there is one record in the customer_month_summary table
for every month for every customer that has a cable modem (this doesn't
include dial-up users).  I have two columns in the summary table that
I'm matching by: CustomerID and MonthStart, which is the first day of
the month in question.  I also have an index on this table on
"Customer_Month_Summary(MonthStart, CustomerID)".  It keeps trying to
do a sequence scan, and while the query only takes 951 msec right now,
the summary table will keep growing, and I don't want performance to
suffer a few months/years down the line (also, having to wait a second
between search results is a bit too much).

This is currently running on a dual-proc PIII-800 with 4G of ram.  I've
put a lot of effort to make this application very responsive, but it's
this one query that keeps killing me.  Any help you can provide would
therefore be much appreciated.

--man
Michael A Nachbaur <mike@nachbaur.com>

"I used to hate writing assignments, but now I enjoy them. I realized
that the purpose of writing is to inflate weak ideas, obscure poor
reasoning and inhibit clarity.  With a little practice, writing can be
an intimidating and impenetrable fog!" -- Calvin


Re: Forcing query to use an index

From
Josh Berkus
Date:
Michael,

> Unfortunately one stage in the query keeps using a sequence scan rather
> than the index.  Here is the "EXPLAIN ANALYZE" results for the 115 line
> SQL query.

If you want any help, you'll need to post the query as well as the explain
results.


--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Forcing query to use an index

From
Stephan Szabo
Date:
On Mon, 3 Mar 2003, Michael Nachbaur wrote:

> Hello everyone,
>
> I have a search query that does a whole bunch of LEFT OUTER JOINs
> between multiple tables, since this is a generic search and some
> records may not exist for certain customers (e.g. searching for the
> text "kate" should bring up people whose customer name, street address
> or email addresses match that word).  This is for an ISP's customer
> management database.
>
> Unfortunately one stage in the query keeps using a sequence scan rather
> than the index.  Here is the "EXPLAIN ANALYZE" results for the 115 line
> SQL query.
>
> Sort  (cost=6666.08..6666.08 rows=268 width=265) (actual
> time=949.00..949.00 rows=1 loops=1)
>    ->  Aggregate  (cost=6487.84..6655.27 rows=268 width=265) (actual
> time=948.86..948.86 rows=1 loops=1)
>          ->  Group  (cost=6487.84..6648.58 rows=2679 width=265) (actual
> time=948.70..948.70 rows=1 loops=1)
>                ->  Sort  (cost=6487.84..6487.84 rows=2679 width=265)
> (actual time=948.66..948.66 rows=1 loops=1)
>                      ->  Merge Join  (cost=6106.42..6335.30 rows=2679
> width=265) (actual time=859.77..948.06 rows=1 loops=1)
>                            ->  Merge Join  (cost=6101.24..6319.77
> rows=2679 width=247) (actual time=554.11..674.17 rows=2679 loops=1)
>                                  ->  Index Scan using customer_id_key on
> customer c  (cost=0.00..129.63 rows=2679 width=156) (actual
> time=0.40..43.43 rows=2679 loops=1)
>                                  ->  Sort  (cost=6101.24..6101.24
> rows=8117 width=91) (actual time=553.64..559.58 rows=8117 loops=1)
>                                        ->  Seq Scan on
> customer_month_summary cms  (cost=0.00..5574.17 rows=8117 width=91)
> (actual time=258.03..477.11 rows=8117 loops=1)
>                            ->  Sort  (cost=5.18..5.18 rows=77 width=18)
> (actual time=0.70..0.80 rows=77 loops=1)
>                                  ->  Seq Scan on emailaddress ea
> (cost=0.00..2.77 rows=77 width=18) (actual time=0.08..0.35 rows=77
> loops=1)
> Total runtime: 951.70 msec
>
> The table in question is "customer_month_summary"; it has 8117 rows.

If you're hitting all the rows in the table, there's only disadvantage
to using an indexscan (right now, given the way data is stored).  If you
were returning some fraction of the rows postgresql should hopefully
switch to a different plan (depending on the estimated costs).



Re: Forcing query to use an index

From
Greg Stark
Date:
Michael Nachbaur <mike@nachbaur.com> writes:

>->  Merge Join  (cost=6106.42..6335.30 rows=2679 width=265) (actual time=859.77..948.06 rows=1 loops=1)
>      ->  Merge Join  (cost=6101.24..6319.77 rows=2679 width=247) (actual time=554.11..674.17 rows=2679 loops=1)
>            ->  Index Scan using customer_id_key on customer c  (cost=0.00..129.63 rows=2679 width=156) (actual
time=0.40..43.43rows=2679 loops=1)
 
>            ->  Sort  (cost=6101.24..6101.24 rows=8117 width=91) (actual time=553.64..559.58 rows=8117 loops=1)
>                  ->  Seq Scan on customer_month_summary cms  (cost=0.00..5574.17 rows=8117 width=91) (actual
time=258.03..477.11rows=8117 loops=1)
 

You should send the query as well, and \d customer_month_summary so we can see
how you defined your indexes.

There doesn't seem to be a filter on the scan so it looks like postgres thinks
you're actually reading in the entire table, which is normally faster with a
sequential scan than an index scan. In fact I'm surprised it's doing an index
scan on the other table and not a sequential scan.

Some things to try:

set enable_seqscan = off

Then try your query again, see if postgres is right and it really is faster to
do the sequential scan. 

set random_page_cost = 2

Or even lower values.

I've also had some success with raising cpu_tuple_cost, though I'm unclear on
whether that's actually a good approach or not.

Also, don't forget to do a vacuum full on these tables before doing
testing for optimizations at this level. You can get some confusing results if
your tables have lots of empty holes in them.

--
greg



Re: Forcing query to use an index

From
Michael Nachbaur
Date:
On Monday, Mar 3, 2003, at 14:09 US/Pacific, Josh Berkus wrote:
> If you want any help, you'll need to post the query as well as the
> explain
> results.

Sorry, forgot to include the statement.  Don't you just hate it when
you do that?  (or forget to include an attachment to an email)


Attachment

Re: Forcing query to use an index

From
Michael Nachbaur
Date:
On Monday, Mar 3, 2003, at 14:09 US/Pacific, Stephan Szabo wrote:
> If you're hitting all the rows in the table, there's only disadvantage
> to using an indexscan (right now, given the way data is stored).  If 
> you
> were returning some fraction of the rows postgresql should hopefully
> switch to a different plan (depending on the estimated costs).

I should only ever get one match per record.  This is the structure of 
the relevant tables:

Customer  --> Customer_Month_Summary   ^   |
EmailAddress

There is one Customer_Month_Summary record per-customer-per-month, so I 
should only get one record out of the database.  Additionally, since 
there are about 2000 customer records, but multiple months in the 
summary table, I'll never use the entire results of the summary table 
in one query.



Re: Forcing query to use an index

From
Michael Nachbaur
Date:
On Monday, Mar 3, 2003, at 14:21 US/Pacific, Greg Stark wrote:
> You should send the query as well, and \d customer_month_summary so we 
> can see
> how you defined your indexes.
            Table "customer_month_summary"    Column     |           Type           | Modifiers
---------------+--------------------------+----------- customerid    | integer                  | timestart     |
timestampwith time zone | timeend       | timestamp with time zone | accountnum    | character varying(255)   |
firstname    | character varying(255)   | lastname      | character varying(255)   | organization  | character
varying(255)  | package       | character varying(255)   | up            | bigint                   | down          |
bigint                  | maxup         | bigint                   | maxdown       | bigint                   |
violatedsize | bigint                   | maxtotal      | bigint                   | total         | bigint
     | violatedup    | bigint                   | violateddown  | bigint                   | violatedtotal | bigint
             | monthstart    | date                     |
 
Indexes: customer_month_summary_cid_idx,         customer_month_summary_cm_idx,
customer_month_summary_time_idx

> There doesn't seem to be a filter on the scan so it looks like 
> postgres thinks
> you're actually reading in the entire table, which is normally faster 
> with a
> sequential scan than an index scan. In fact I'm surprised it's doing 
> an index
> scan on the other table and not a sequential scan.
>
> Some things to try:
>
> set enable_seqscan = off
>
> Then try your query again, see if postgres is right and it really is 
> faster to
> do the sequential scan.

Sort  (cost=100014872.07..100014872.07 rows=268 width=265) (actual 
time=382.51..382.51 rows=6 loops=1)  ->  Aggregate  (cost=100014693.83..100014861.27 rows=268 width=265) 
(actual time=381.93..382.33 rows=6 loops=1)        ->  Group  (cost=100014693.83..100014854.57 rows=2679 
width=265) (actual time=381.79..381.90 rows=6 loops=1)              ->  Sort  (cost=100014693.83..100014693.83
rows=2679
 
width=265) (actual time=381.75..381.75 rows=6 loops=1)                    ->  Merge Join
(cost=100000005.18..100014541.30
 
rows=2679 width=265) (actual time=38.21..381.13 rows=6 loops=1)                          ->  Nested Loop
(cost=0.00..14525.77
 
rows=2679 width=247) (actual time=0.14..149.21 rows=2679 loops=1)                                ->  Index Scan using
customer_id_keyon 
 
customer c  (cost=0.00..129.73 rows=2679 width=156) (actual 
time=0.06..36.92 rows=2679 loops=1)                                ->  Index Scan using 
customer_month_summary_cm_idx on customer_month_summary cms  
(cost=0.00..5.36 rows=1 width=91) (actual time=0.02..0.03 rows=1 
loops=2679)                          ->  Sort  (cost=100000005.18..100000005.18 
rows=77 width=18) (actual time=0.44..0.56 rows=77 loops=1)                                ->  Seq Scan on emailaddress
ea 
 
(cost=100000000.00..100000002.77 rows=77 width=18) (actual 
time=0.01..0.24 rows=77 loops=1)
Total runtime: 383.25 msec

So from the looks of things the index is way faster (querying 1 row, 
rather than over 8000).

> set random_page_cost = 2
>
> Or even lower values.

Sort  (cost=6655.12..6655.12 rows=268 width=265) (actual 
time=902.75..902.76 rows=6 loops=1)  ->  Aggregate  (cost=6476.88..6644.32 rows=268 width=265) (actual 
time=902.18..902.58 rows=6 loops=1)        ->  Group  (cost=6476.88..6637.62 rows=2679 width=265) (actual 
time=902.01..902.13 rows=6 loops=1)              ->  Sort  (cost=6476.88..6476.88 rows=2679 width=265) 
(actual time=901.97..901.98 rows=6 loops=1)                    ->  Merge Join  (cost=6106.42..6324.34 rows=2679 
width=265) (actual time=585.73..901.35 rows=6 loops=1)                          ->  Merge Join  (cost=6101.24..6308.82

rows=2679 width=247) (actual time=549.53..667.35 rows=2679 loops=1)                                ->  Index Scan using
customer_id_keyon 
 
customer c  (cost=0.00..118.77 rows=2679 width=156) (actual 
time=0.25..42.08 rows=2679 loops=1)                                ->  Sort  (cost=6101.24..6101.24 
rows=8117 width=91) (actual time=549.21..555.19 rows=8117 loops=1)                                      ->  Seq Scan on

customer_month_summary cms  (cost=0.00..5574.17 rows=8117 width=91) 
(actual time=252.90..472.29 rows=8117 loops=1)                          ->  Sort  (cost=5.18..5.18 rows=77 width=18) 
(actual time=0.70..0.81 rows=77 loops=1)                                ->  Seq Scan on emailaddress ea  
(cost=0.00..2.77 rows=77 width=18) (actual time=0.08..0.35 rows=77 
loops=1)
Total runtime: 905.47 msec

This is with enable_seqscan turned off, so it doesn't look like it's 
doing much good.

> I've also had some success with raising cpu_tuple_cost, though I'm 
> unclear on
> whether that's actually a good approach or not.
>
> Also, don't forget to do a vacuum full on these tables before doing
> testing for optimizations at this level. You can get some confusing 
> results if
> your tables have lots of empty holes in them.

IIRC this locks the tables, so I'll have to run this tonight so the 
users of this system don't race into my office to tar and feather me.

Does using "set enable_seqscan = off" impact anything else on the 
system, or is it connection-specific?  Is this the recommended way of 
doing this, or is there another way of coercing Postgres into using 
indexes?  I'm making these queries from Perl code running in an XSP 
page, so I'm not sure how flexible this option would be from a 
developer point of view.

Thanks for your help.



Re: Forcing query to use an index

From
Josh Berkus
Date:
Micheal,

Don't forget to run ANALYZE as well; this does not require a table lock.

> Does using "set enable_seqscan = off" impact anything else on the
> system, or is it connection-specific?  Is this the recommended way of
> doing this, or is there another way of coercing Postgres into using
> indexes?  I'm making these queries from Perl code running in an XSP
> page, so I'm not sure how flexible this option would be from a
> developer point of view.

Setting enable_seqscan=off is just for testing purposes, to see if the planner
is making a wrong decision so that you can adjust your .conf params, query,
and/or indexes to suit.  The setting is *not* intended as a production
solution.

--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Forcing query to use an index

From
Michael Nachbaur
Date:
On Monday, Mar 3, 2003, at 15:21 US/Pacific, Josh Berkus wrote:

> Micheal,
>
> Don't forget to run ANALYZE as well; this does not require a table 
> lock.

Yes, I have a cron job that runs a "vacuum analyze" once every other 
hour.

>> Does using "set enable_seqscan = off" impact anything else on the
>> system, or is it connection-specific?  Is this the recommended way of
>> doing this, or is there another way of coercing Postgres into using
>> indexes?  I'm making these queries from Perl code running in an XSP
>> page, so I'm not sure how flexible this option would be from a
>> developer point of view.
>
> Setting enable_seqscan=off is just for testing purposes, to see if the 
> planner
> is making a wrong decision so that you can adjust your .conf params, 
> query,
> and/or indexes to suit.  The setting is *not* intended as a production
> solution.

*phew*  good.  Now that I know the planner is making a bad choice, 
where do I go from here?  Should I RTFM?

Also, the query values in the configuration file that the planner uses, 
are those machine-specific or are the defaults the recommended values?



Re: Forcing query to use an index

From
Josh Berkus
Date:
Micheal,

Issues with your query:
 LEFT OUTER JOIN Customer_Month_Summary AS CMS   ON ( C.ID = CMS.CustomerID    AND CMS.MonthStart = DATE '2003-02-01'
  ) 

Check out the thread: Re: [SQL] OUTER JOIN with filter
in today's list; this relates to your problem.  Then try your query as:
 LEFT OUTER JOIN (SELECT * FROM Customer_Month_Summary    WHRE CMS.MonthStart = DATE '2003-02-01'      ) CMS ON C.ID =
CMS.CustomerID

This may make better use of your index, because the planner will have a more
accurate estimate of the number of rows returned from the outer join.

AND:
  AND ( C.Accountnum                            ~* 'kate'     OR C.Firstname                             ~* 'kate'
ORC.Lastname                              ~* 'kate'     OR C.Organization                          ~* 'kate'     OR
C.Address                              ~* 'kate'     OR C.Postal                                ~* 'kate'     OR C.City
                                ~* 'kate'     OR EA.Name || '@' || JoinDomain(EA.Domain) ~* 'kate' 

This set of expressions has "seq scan" written all over it.   I hihgly suggest
that you try to find a way to turn these into anchored text searches, perhaps
using functional indexes on lower(column).

Finally:
     OR CMS.Package                             ~* 'kate'

Per the above, this is why Postgres cannot use an index on your table; that is
an unanchored text search which can *only* be indexed using FTS.

--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Forcing query to use an index

From
Michael Nachbaur
Date:
On Monday, Mar 3, 2003, at 15:32 US/Pacific, Josh Berkus wrote:
> Check out the thread: Re: [SQL] OUTER JOIN with filter
> in today's list; this relates to your problem.  Then try your query as:

I'll read through this, thank you.

>   LEFT OUTER JOIN (SELECT * FROM Customer_Month_Summary
>     WHRE CMS.MonthStart = DATE '2003-02-01'
>        ) CMS ON C.ID = CMS.CustomerID

This works beautifully!  Thanks.

> This set of expressions has "seq scan" written all over it.   I hihgly 
> suggest
> that you try to find a way to turn these into anchored text searches, 
> perhaps
> using functional indexes on lower(column).

Hmm, I'll have to give this one some thought, since it looks like there 
won't be any easy way to index those text columns, since the text I'm 
searching for could be anywhere within the column.

Thank you everyone for all the help!



Re: Forcing query to use an index

From
Greg Stark
Date:
One suggestion I'll make about your data model -- I'm not sure it would
actually help this query, but might help elsewhere:
WHERE ( C.Disabled > '2003-02-28'     OR C.Disabled IS NULL      )

Don't use NULL values like this. Most databases don't index NULLs (Oracle) or
even if they do, don't make "IS NULL" an indexable operation (postgres).
There's been some talk of changing this in postgres but even then, it wouldn't
be able to use an index for an OR clause like this.

If you used a very large date, like 9999-01-01 as your "not deactivated" value
then the constraint would be C.disabled > '2003-02-28' and postgres could use
an index on "disabled".

Alternatively if you have a disabled_flag and disabled_date then you could
have an index on disabled_flag,disabled_date and uhm, there should be a way to
use that index though I'm not seeing it right now. 

This won't matter at first when 99% of your customers are active. And ideally
in this query you find some way to use an index to find "kate" rather than
doing a fully table scan. But later when 90% of the clients are disabled, then
in a bigger batch job where you actually want to process every active record
it could prevent postgres from having to dig through a table full of old
inactive records.

> This may make better use of your index, because the planner will have a more 
> accurate estimate of the number of rows returned from the outer join.
> 
> AND:
> 
>    AND ( C.Accountnum                            ~* 'kate'
>       OR C.Firstname                             ~* 'kate'
>       OR C.Lastname                              ~* 'kate'
>       OR C.Organization                          ~* 'kate'
>       OR C.Address                               ~* 'kate'
>       OR C.Postal                                ~* 'kate'
>       OR C.City                                  ~* 'kate'
>       OR EA.Name || '@' || JoinDomain(EA.Domain) ~* 'kate'
> 
> This set of expressions has "seq scan" written all over it.   I hihgly suggest 
> that you try to find a way to turn these into anchored text searches, perhaps 
> using functional indexes on lower(column).

If you really need to find substring matches everywhere you might want to look
into the full text search module in contrib/tsearch. I haven't started using
it yet but I expect I will have to when I get to that part of my project. 

> Finally:
> 
>       OR CMS.Package                             ~* 'kate'

*confusion*. Oooh, Yeah, this one is a big problem. It means it's not clear
which end of the join to start with. Maybe it would be better to separate this
into two separate queries, give the user the option to search for a user
"kate" or a package "kate" but not both simultaneously.

-- 
greg



Re: Forcing query to use an index

From
Greg Stark
Date:

> ->  Merge Join  (cost=6106.42..6335.30 rows=2679 width=265)                   (actual time=859.77..948.06 rows=1
loops=1)

Actually another problem, notice the big discrepancy between the estimated row
and the actual rows. That's because you have the big OR clause so postgres
figures there's a good chance one of the clauses will be true so it estimates
a lot of rows will match. In fact of course they're all very selective and
you'll usually probably only get a few records.

If you're stuck with the unanchored text search it will always do a full table
scan so it will never be lightening fast. But it would probably be a bit
faster if you put a limit clause (on a subquery) on the table that's doing the
full table scan. 

That will convince postgres that there won't be thousands of resulting
records, which might convince it to do a nested loop.

Also, as a beneficial side effect will also limit the damage if one your users
does a search for "e"...

This only really helps if you can get rid of the OR CMS.package clause...
otherwise it actually needs all the records in case they match a summary
record with a kate package.

-- 
greg



Re: Forcing query to use an index

From
Jean-Luc Lachance
Date:
I beg to differ.

A NULL field means not set.

Having to use work around because the database does not index null is
one thing, but making it a general rule is not.

Having NULL indexed would also speed up things when "is null" is part af
the query.

Until then...

JLL



Greg Stark wrote:
> 
> One suggestion I'll make about your data model -- I'm not sure it would
> actually help this query, but might help elsewhere:
> 
>  WHERE ( C.Disabled > '2003-02-28'
>       OR C.Disabled IS NULL
>        )
> 
> Don't use NULL values like this. Most databases don't index NULLs (Oracle) or
> even if they do, don't make "IS NULL" an indexable operation (postgres).
> There's been some talk of changing this in postgres but even then, it wouldn't
> be able to use an index for an OR clause like this.
> 
> If you used a very large date, like 9999-01-01 as your "not deactivated" value
> then the constraint would be C.disabled > '2003-02-28' and postgres could use
> an index on "disabled".
> 
> Alternatively if you have a disabled_flag and disabled_date then you could
> have an index on disabled_flag,disabled_date and uhm, there should be a way to
> use that index though I'm not seeing it right now.
> 
> This won't matter at first when 99% of your customers are active. And ideally
> in this query you find some way to use an index to find "kate" rather than
> doing a fully table scan. But later when 90% of the clients are disabled, then
> in a bigger batch job where you actually want to process every active record
> it could prevent postgres from having to dig through a table full of old
> inactive records.
>


Re: Forcing query to use an index

From
Greg Stark
Date:
Jean-Luc Lachance <jllachan@nsd.ca> writes:

> I beg to differ.
> 
> A NULL field means not set.

The best description for what NULL means is "unknown".

> Having to use work around because the database does not index null is
> one thing, but making it a general rule is not.

My natural inclination is to use exactly the representation he used. 
I've done so numerous times in the past. But using NULL sort of means "we
don't know when this account might have been deactivated" which is why it
leads to all these awkward OR clauses in his queries.

I admit using 9999-01-01 as a date gives me the willies. But it does match
with the way the field is used and it produces nice clean index range lookups.

> Having NULL indexed would also speed up things when "is null" is part af
> the query.

No, it wouldn't. Not in his query. His query had "disabled IS NULL OR disabled < ?"
Even if "IS NULL" was indexable this still wouldn't be an indexable clause.

Another option would be to use a functional index. 

create function disabled_as_of(timestamp with time zone) as'select coalesce($1,''9999-01-01'')' language sql immutable

Then index disabled_as_of(disabled) and access it with"disabled_as_of(disabled) > current_time"

(or perhaps it would be clearer as "NOT disabled_as_of(disabled) < current_time")

Of course all this is just extra busywork to keep the ugly data representation
out of the data model and hidden in the functional index. And it's not really
very well hidden either.

-- 
greg



Re: Forcing query to use an index

From
Jean-Luc Lachance
Date:
Greg Stark wrote:
> 
> Jean-Luc Lachance <jllachan@nsd.ca> writes:
> 
> > I beg to differ.
> >
> > A NULL field means not set.
> 
> The best description for what NULL means is "unknown".

I agree.

> 
> > Having to use work around because the database does not index null is
> > one thing, but making it a general rule is not.
> 
> My natural inclination is to use exactly the representation he used.
> I've done so numerous times in the past. But using NULL sort of means "we
> don't know when this account might have been deactivated" which is why it
> leads to all these awkward OR clauses in his queries.
> 
> I admit using 9999-01-01 as a date gives me the willies. But it does match
> with the way the field is used and it produces nice clean index range lookups.

I know it is quite far in the futur, but it remind me too much of he Y2K
problem.
One of my customers buried in their code 99-12-31 as an undefined
date...

> 
> > Having NULL indexed would also speed up things when "is null" is part af
> > the query.
> 
> No, it wouldn't. Not in his query. His query had
>  "disabled IS NULL OR disabled < ?"
> Even if "IS NULL" was indexable this still wouldn't be an indexable clause.

What? 

Select f from t where f is null or f < '2003-03-04';
is the same as
Select f from t where f is null
union
select f from t where f > '2003-03-03';

Surely the optimizer will know to use the index. Will it not???

JLL

> 
> Another option would be to use a functional index.
> 
> create function disabled_as_of(timestamp with time zone) as
>  'select coalesce($1,''9999-01-01'')' language sql immutable
> 
> Then index disabled_as_of(disabled) and access it with
>  "disabled_as_of(disabled) > current_time"
> 
> (or perhaps it would be clearer as "NOT disabled_as_of(disabled) < current_time")
> 
> Of course all this is just extra busywork to keep the ugly data representation
> out of the data model and hidden in the functional index. And it's not really
> very well hidden either.
> 
> --
> greg


Re: Forcing query to use an index

From
Tom Lane
Date:
Greg Stark <gsstark@MIT.EDU> writes:
> I admit using 9999-01-01 as a date gives me the willies. But it does match
> with the way the field is used and it produces nice clean index range
> lookups.

If you're willing to switch to type 'timestamp' then you could use the
special value 'infinity' (or '-infinity' when you need it to sort before
instead of after normal times).

There was some talk awhile back of making type 'date' support infinity
as well.  This'd cost one day at each end of the allowable range, which
doesn't seem like a big loss.  But no one's done the legwork.
        regards, tom lane