Thread: Putting an INDEX on a boolean field?

Putting an INDEX on a boolean field?

From
"Marc G. Fournier"
Date:
Does that make sense?  Would it ever get used?  I can't see it, but 
figured I'd ask ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664


Re: Putting an INDEX on a boolean field?

From
David Dick
Date:
how about an very large table with a "processed" type flag?
uru
-Dave

Marc G. Fournier wrote:
> 
> Does that make sense?  Would it ever get used?  I can't see it, but 
> figured I'd ask ...
> 
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
>               http://archives.postgresql.org
> 


Re: Putting an INDEX on a boolean field?

From
Tom Lane
Date:
"Marc G. Fournier" <scrappy@postgresql.org> writes:
> Does that make sense?  Would it ever get used?

It could get used if one of the two values is far less frequent than the
other.  Personally I'd think about a partial index instead ...
        regards, tom lane


Re: Putting an INDEX on a boolean field?

From
"Marc G. Fournier"
Date:
On Fri, 17 Jun 2005, Tom Lane wrote:

> "Marc G. Fournier" <scrappy@postgresql.org> writes:
>> Does that make sense?  Would it ever get used?
>
> It could get used if one of the two values is far less frequent than the
> other.  Personally I'd think about a partial index instead ...

Hrmmmm, hadn't thought of that ... wouldn't you have to build two indexes 
(one for true, one for false) for this to be completely effective?  unless 
you know all your queries are going to search for one, but not the other?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664


Re: Putting an INDEX on a boolean field?

From
Jaime Casanova
Date:
On 6/17/05, Marc G. Fournier <scrappy@postgresql.org> wrote:
> On Fri, 17 Jun 2005, Tom Lane wrote:
>
> > "Marc G. Fournier" <scrappy@postgresql.org> writes:
> >> Does that make sense?  Would it ever get used?
> >
> > It could get used if one of the two values is far less frequent than the
> > other.  Personally I'd think about a partial index instead ...
>
> Hrmmmm, hadn't thought of that ... wouldn't you have to build two indexes
> (one for true, one for false) for this to be completely effective?  unless
> you know all your queries are going to search for one, but not the other?
>

I guess it will be effective only if you know wich value will be less
frequent... on the other value a sequential scan will be a win, isn't
it?

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)


Re: Putting an INDEX on a boolean field?

From
"Erik Aronesty"
Date:
In my database, the "sites" table is large, and the "usersites" table
has only a few sites per userid - so it should be looked in first.  I'm
surprised that I had to juggle my query around (below), rather than
trusting the optimizer to figure this out for me.

Should I start looking to figure out why the optimizer didn't figure out
that it should be doing this sort of thing?  Or should I just keep
juggling with subselects, since it's not that hard to do.

This query results in a sequential scan:
select sites.*, blocks from sites left join quota on
sites.host_u =     quota.host and quota.date = ('2005-06-16 23:55:05-04')
where sites.id     in (select siteid from usersites where userid = 1);

versus this one which does not:
explain analyze select sites.*, blocks from (select * from sites
where     id in (select siteid from usersites where userid = 1)) as sites
left     join quota on sites.host_u = quota.host and quota.date =
('2005-06-16     23:55:05-04');

The tables have been vacuumed/analyzed, etc.

---slow---

explain analyze select sites.*, blocks from sites left join quota on
sites.host_u = quota.host and quota.date = ('2005-06-16 23:55:05-04')
where sites.id in (select siteid from usersites where userid = 1);

---

Hash IN Join  (cost=3183.30..3295.50 rows=13 width=158) (actual
time=4865.895..4942.097 rows=10 loops=1)  Hash Cond: ("outer".id = "inner".siteid)  ->  Merge Left Join
(cost=3173.52..3263.12rows=4493 width=158)
 
(actual time=4856.212..4939.329 rows=4443 loops=1)        Merge Cond: ("outer"."?column24?" = "inner".host)        ->
Sort (cost=2786.62..2797.72 rows=4443 width=154) (actual
 
time=4811.499..4816.164 rows=4443 loops=1)              Sort Key: (sites.host_u)::text              ->  Seq Scan on
sites (cost=0.00..2517.43 rows=4443
 
width=154) (actual time=11.611..4598.849 rows=4443 loops=1)        ->  Sort  (cost=386.91..398.13 rows=4489 width=26)
(actual
time=44.638..46.307 rows=4454 loops=1)              Sort Key: quota.host              ->  Index Scan using
quota_date_idxon quota
 
(cost=0.00..114.60 rows=4489 width=26) (actual time=0.069..10.780
rows=4453 loops=1)                    Index Cond: (date = '2005-06-16
23:55:05-04'::timestamp with time zone)  ->  Hash  (cost=9.75..9.75 rows=12 width=4) (actual time=0.086..0.086
rows=0 loops=1)        ->  Index Scan using usersites_userid_idx on usersites
(cost=0.00..9.75 rows=12 width=4) (actual time=0.047..0.076 rows=10
loops=1)              Index Cond: (userid = 1)Total runtime: 4944.575 ms
(15 rows)

---fast---

explain analyze select sites.*, blocks from (select * from sites where
id in (select siteid from usersites where userid = 1)) as sites left
join quota on sites.host_u = quota.host and quota.date = ('2005-06-16
23:55:05-04');

---

Merge Left Join  (cost=468.77..491.41 rows=13 width=158) (actual
time=46.547..53.669 rows=10 loops=1)  Merge Cond: ("outer"."?column24?" = "inner".host)  ->  Sort  (cost=81.86..81.89
rows=12width=154) (actual
 
time=0.450..0.454 rows=10 loops=1)        Sort Key: (public.sites.host_u)::text        ->  Nested Loop
(cost=9.78..81.65rows=12 width=154) (actual
 
time=0.129..0.392 rows=10 loops=1)              ->  HashAggregate  (cost=9.78..9.78 rows=12 width=4)
(actual time=0.084..0.095 rows=10 loops=1)                    ->  Index Scan using usersites_userid_idx on
usersites  (cost=0.00..9.75 rows=12 width=4) (actual time=0.040..0.067
rows=10 loops=1)                          Index Cond: (userid = 1)              ->  Index Scan using sites_pkey on
sites
(cost=0.00..5.98 rows=1 width=154) (actual time=0.017..0.018 rows=1
loops=10)                    Index Cond: (sites.id = "outer".siteid)  ->  Sort  (cost=386.91..398.13 rows=4489
width=26)(actual
 
time=44.971..46.042 rows=3741 loops=1)        Sort Key: quota.host        ->  Index Scan using quota_date_idx on quota
(cost=0.00..114.60 rows=4489 width=26) (actual time=0.025..10.643
rows=4453 loops=1)              Index Cond: (date = '2005-06-16 23:55:05-04'::timestamp
with time zone)Total runtime: 54.988 ms
(15 rows)



Re: Putting an INDEX on a boolean field?

From
Tom Lane
Date:
"Erik Aronesty" <erik@memebot.com> writes:
> Should I start looking to figure out why the optimizer didn't figure out
> that it should be doing this sort of thing?

It looks to me that the problem is that convert_IN_to_join() is not
being smart about where to attach the IN's subselect to the join tree.
It's just adding it to the top FROM-expression, so that the join tree
is effectively((sites left join quota) IN-join usersites)
and since we don't currently allow any rearrangement of outer joins,
this cannot be rearranged into((sites IN-join usersites) left join quota)
as you'd like.

The really clean solution to this would be to implement logic about when
it is safe to rearrange the join order of outer joins.  But I think
that's a fairly hard problem in general.  A shorter-term solution might
be possible by teaching convert_IN_to_join() to attach the IN subselect
further down in the join tree, using logic similar to what we use to
decide where ordinary WHERE quals can bubble down to.
        regards, tom lane