Thread: About b-tree usage

About b-tree usage

From
Ioannis Theoharis
Date:

Please let me know, if there is any option in postgresql to achieve the
following usage of a b-tree index:

For a relation R(att0, att1) and a btree index on attribute att0

In each insertion of a tuple on table:
- look on index if the value of att0 of new entry does already exist in index, and- if no, allow the aprorpiate entry
onb-tree- if yes, do not allow an entry.
 

In my aplication i have always my relation clustered according to att0.
And the only information needed for a query with a range condition over
att0 in WHERE clause, is the place on disc where the first tuple with a
given value on att0 is placed.

The hint, is that beacause of too many raws of table, the index size is
too big. But the number of discrete values of att0 is orders of
magnitudes smaller than the number of tuples.

I try to investigate, if there is a way to use an alternative of b-tree
index, to decrease the blocks of indexed that are fetched into memory.

Thanks.




Re: About b-tree usage

From
Jeff Davis
Date:
If I understand your question, you want to reduce the index size by only
pointing to the first tuple in a table with a given key in att0, since
the rest of the tuples will be right afterward (because you keep the
table clustered on that key).

However, from the docs:
<http://www.postgresql.org/docs/8.0/static/sql-cluster.html>

"When a table is clustered, it is physically reordered based on the
index information. Clustering is a one-time operation: when the table is
subsequently updated, the changes are not clustered. That is, no attempt
is made to store new or updated rows according to their index order. If
one wishes, one can periodically recluster by issuing the command
again."

So, there is no guarantee that there won't be tuples with the same key
in a different physical location between your CLUSTER commands. That
means your index idea won't really work.

Perhaps you can alter your table layout/design to better suit your
needs.

For example:
* If you have a very small number of values in att0, you could make a
different table for each one and make a view that's the union of those
tables. 
* You could make att1 an array. Yes, it is a horrible design from a
relational standpoint, but if you need performance, there may not be
many other options. You can then use set-returning functions and other
functions to make it behave more like a relation. Ideally, postgresql
would have relation-valued attributes, but currently arrays seem like
the best option available for simulating a relation-valued attribute.

If there are many identical values in att0, are you sure a sequential
scan isn't more efficient? Also, are you sure the index isn't working
well? It seems to me since you have the table clustered, it might be
fairly efficient as-is (it would get a huge benefit from the spatial
locality of the tuples in the table). Index size alone shouldn't destroy
your performance, since the idea of an index lookup is that it only has
to read O(log n) pages from the disk per lookup.

Regards,Jeff Davis


On Sun, 2005-03-06 at 23:33 +0200, Ioannis Theoharis wrote:
> 
> Please let me know, if there is any option in postgresql to achieve the
> following usage of a b-tree index:
> 
> For a relation R(att0, att1) and a btree index on attribute att0
> 
> In each insertion of a tuple on table:
> - look on index if the value of att0 of new entry does already exist in
>   index, and
>     - if no, allow the aprorpiate entry on b-tree
>     - if yes, do not allow an entry.
> 
> In my aplication i have always my relation clustered according to att0.
> And the only information needed for a query with a range condition over
> att0 in WHERE clause, is the place on disc where the first tuple with a
> given value on att0 is placed.
> 
> The hint, is that beacause of too many raws of table, the index size is
> too big. But the number of discrete values of att0 is orders of
> magnitudes smaller than the number of tuples.
> 
> I try to investigate, if there is a way to use an alternative of b-tree
> index, to decrease the blocks of indexed that are fetched into memory.
> 
> Thanks.
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend



Re: About b-tree usage

From
Ioannis Theoharis
Date:
> If there are many identical values in att0, are you sure a sequential
> scan isn't more efficient? Also, are you sure the index isn't working
> well? It seems to me since you have the table clustered, it might be
> fairly efficient as-is (it would get a huge benefit from the spatial
> locality of the tuples in the table). Index size alone shouldn't destroy
> your performance, since the idea of an index lookup is that it only has
> to read O(log n) pages from the disk per lookup.

In the next example, have in mind that:
select relname, relpages, reltuples from pg_class;
           relname             | relpages |  reltuples
--------------------------------+----------+-------------
...
tc2000000000                    |   142858 | 1.00001e+06
inst_id_idx                     |     2745 |       1e+06
...

and that i run postgresql, on a UltraSPARC[tm] III 600MHz, ram: 512MB
OS : sol 9

att0: varchar(1000)
att1: int4
and that 0<=att1>=900000000 for every tuple of tabe and index.

query:
select att0 from tc2000000000 where att1=900000000 AND att1>=0

plan:Index Scan using inst_id_idx on tc2000000000  (cost=0.00..161603.06
rows=1000006 width=1004) (actual time=41.21..101917.36 rows=1000000 loops=1)  Index Cond: ((att1 <= 900000000) AND
(att1>= 0))Total runtime: 103135.03 msec
 


query:
select att0 from tc2000000000

plan:Seq Scan on tc2000000000  (cost=100000000.00..100152858.06 rows=1000006
width=1004) (actual time=0.21..42584.87 rows=1000000 loops=1)Total runtime: 43770.73 msec

Can you explain me this big difference? Perhaps postgresql caches in
memory a big part (or the whole) of index?

And by the way why postgresql doesn't select sequential scan? (I have
done vacuum analyze).


Re: About b-tree usage

From
Jeff Davis
Date:
In that case, sequential scan is faster, but perhaps the planner doesn't
know that ahead of time. Try turning on more statistics if you haven't
already, and then run ANALYZE again. If the planner sees a range,
perhaps it assumes that it is a highly selective range, when in fact, it
consists of all of the tuples. Also, make sure enable_seqscan is true
(in case you turned it off for testing or something and forgot). 

A seqscan is usually faster when a large proportion of the tuples are
returned because:
(1) It uses sequential I/O; whereas an index might access tuples in a
random order.
(2) It doesn't have to read the index's disk pages at all.

I suspect you don't need to return all the tuples in the table. If you
include the details of a real scenario perhaps the people on the list
could be more helpful.

Regards,Jeff Davis


On Mon, 2005-03-07 at 20:43 +0200, Ioannis Theoharis wrote:
> > If there are many identical values in att0, are you sure a sequential
> > scan isn't more efficient? Also, are you sure the index isn't working
> > well? It seems to me since you have the table clustered, it might be
> > fairly efficient as-is (it would get a huge benefit from the spatial
> > locality of the tuples in the table). Index size alone shouldn't destroy
> > your performance, since the idea of an index lookup is that it only has
> > to read O(log n) pages from the disk per lookup.
> 
> In the next example, have in mind that:
> select relname, relpages, reltuples from pg_class;
> 
>             relname             | relpages |  reltuples
> --------------------------------+----------+-------------
> ...
> tc2000000000                    |   142858 | 1.00001e+06
> inst_id_idx                     |     2745 |       1e+06
> ...
> 
> and that i run postgresql, on a UltraSPARC[tm] III 600MHz, ram: 512MB
> OS : sol 9
> 
> att0: varchar(1000)
> att1: int4
> and that 0<=att1>=900000000 for every tuple of tabe and index.
> 
> query:
> select att0 from tc2000000000 where att1=900000000 AND att1>=0
> 
> plan:
>  Index Scan using inst_id_idx on tc2000000000  (cost=0.00..161603.06
> rows=1000006 width=1004) (actual time=41.21..101917.36 rows=1000000 loops=1)
>    Index Cond: ((att1 <= 900000000) AND (att1 >= 0))
>  Total runtime: 103135.03 msec
> 
> 
> query:
> select att0 from tc2000000000
> 
> plan:
>  Seq Scan on tc2000000000  (cost=100000000.00..100152858.06 rows=1000006
> width=1004) (actual time=0.21..42584.87 rows=1000000 loops=1)
>  Total runtime: 43770.73 msec
> 
> Can you explain me this big difference? Perhaps postgresql caches in
> memory a big part (or the whole) of index?
> 
> And by the way why postgresql doesn't select sequential scan? (I have
> done vacuum analyze).
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings



Re: About b-tree usage

From
Tom Lane
Date:
Ioannis Theoharis <theohari@ics.forth.gr> writes:
> select att0 from tc2000000000 where att1=900000000 AND att1>=0

> plan:
>  Index Scan using inst_id_idx on tc2000000000  (cost=0.00..161603.06
> rows=1000006 width=1004) (actual time=41.21..101917.36 rows=1000000 loops=1)
>    Index Cond: ((att1 <= 900000000) AND (att1 >= 0))
>  Total runtime: 103135.03 msec

> query:
> select att0 from tc2000000000

> plan:
>  Seq Scan on tc2000000000  (cost=100000000.00..100152858.06 rows=1000006
> width=1004) (actual time=0.21..42584.87 rows=1000000 loops=1)
>  Total runtime: 43770.73 msec

> Can you explain me this big difference? Perhaps postgresql caches in
> memory a big part (or the whole) of index?

seqscan is naturally faster when you have to visit the whole table anyway.

> And by the way why postgresql doesn't select sequential scan?

Because you've set enable_seqscan = off.
        regards, tom lane


Re: About b-tree usage

From
Ioannis Theoharis
Date:

let me, i have turned enable_seqscan to off, in order to discourage
optimizer to choose seq_scan whenever an idex_scan can be used.

But in this case, why optimizer don't chooses seq_scan (discourage is
different than prevent) ?

At many cases i need only a small fragment of raws to be retrieved. But
this extreme case is a real-scenario (not the most frequent but real).

I try to find a way to achieve good performence even for the extreme
case. Is there any way?

ps. In bibliografy, there is a different alternative for indices. except
th simple approach of <attr_val, rid> is the alternative <attr_val, set
of rids>. The second means the attaches to each discrete attr_val the set
o rid's of all raws with same attr_val. Is this alternative taken into
account in postgres?


On Mon, 7 Mar 2005, Jeff Davis wrote:

>
> In that case, sequential scan is faster, but perhaps the planner doesn't
> know that ahead of time. Try turning on more statistics if you haven't
> already, and then run ANALYZE again. If the planner sees a range,
> perhaps it assumes that it is a highly selective range, when in fact, it
> consists of all of the tuples. Also, make sure enable_seqscan is true
> (in case you turned it off for testing or something and forgot).
>
> A seqscan is usually faster when a large proportion of the tuples are
> returned because:
> (1) It uses sequential I/O; whereas an index might access tuples in a
> random order.
> (2) It doesn't have to read the index's disk pages at all.
>
> I suspect you don't need to return all the tuples in the table. If you
> include the details of a real scenario perhaps the people on the list
> could be more helpful.
>


Re: About b-tree usage

From
"Michael Paesold"
Date:
Ioannis Theoharis wrote:

> let me, i have turned enable_seqscan to off, in order to discourage
> optimizer to choose seq_scan whenever an idex_scan can be used.
>
> But in this case, why optimizer don't chooses seq_scan (discourage is
> different than prevent) ?

You probably know that PostgreSQL uses a cost-based optimizer. The optimizer 
chooses different plans based on the cost it calculates for them.

enable_seqscan = OFF is very primitive but effective: it tells the optimizer 
to raise the cost of a sequential scan to a value going towards infinity.

When it comes to the choice between seq scan and index scan, the optimizer 
will now always choose the index scan. It does not "known" anymore if 
sequential scan would be cheaper -- *you* have told the optimizer that it is 
not.

Only when there is no other way except seq scan to execute your query at 
all, then the optimizer must choose this very costly path. An example is an 
unqualified SELECT * FROM table; -- there is no path with an index here.

I hope that answers your first question. As you see, enable_seqscan = OFF 
should not be used for production systems, but only for debugging. Perhaps 
it's useful to set at query level, but not in postgresql.conf.

Best Regards,
Michael Paesold 



Re: About b-tree usage

From
Klaus Naumann
Date:
Hi,

if you're using a pg version prio to 8.0 your pitfall might also be
a conversion between int and bigint datatypes.
So if you're doing somthing like

SELECT a.x, b.y, c.y FROM a, b WHERE a.x = b.x;

and a.x is INT4 and b.x is INT8 (or BIGINT) the planner counts this as
a data conversion and uses a full table scan.

Greetings, Klaus

Ioannis Theoharis wrote:
> 
> let me, i have turned enable_seqscan to off, in order to discourage
> optimizer to choose seq_scan whenever an idex_scan can be used.
> 
> But in this case, why optimizer don't chooses seq_scan (discourage is
> different than prevent) ?
> 
> At many cases i need only a small fragment of raws to be retrieved. But
> this extreme case is a real-scenario (not the most frequent but real).
> 
> I try to find a way to achieve good performence even for the extreme
> case. Is there any way?
> 
> ps. In bibliografy, there is a different alternative for indices. except
> th simple approach of <attr_val, rid> is the alternative <attr_val, set
> of rids>. The second means the attaches to each discrete attr_val the set
> o rid's of all raws with same attr_val. Is this alternative taken into
> account in postgres?
> 
> 
> On Mon, 7 Mar 2005, Jeff Davis wrote:
> 
> 
>>In that case, sequential scan is faster, but perhaps the planner doesn't
>>know that ahead of time. Try turning on more statistics if you haven't
>>already, and then run ANALYZE again. If the planner sees a range,
>>perhaps it assumes that it is a highly selective range, when in fact, it
>>consists of all of the tuples. Also, make sure enable_seqscan is true
>>(in case you turned it off for testing or something and forgot).
>>
>>A seqscan is usually faster when a large proportion of the tuples are
>>returned because:
>>(1) It uses sequential I/O; whereas an index might access tuples in a
>>random order.
>>(2) It doesn't have to read the index's disk pages at all.
>>
>>I suspect you don't need to return all the tuples in the table. If you
>>include the details of a real scenario perhaps the people on the list
>>could be more helpful.
>>
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
> 
> 


Re: About b-tree usage

From
Jeff Davis
Date:
On Tue, 2005-03-08 at 15:30 +0200, Ioannis Theoharis wrote:
> 
> let me, i have turned enable_seqscan to off, in order to discourage
> optimizer to choose seq_scan whenever an idex_scan can be used.
> 
> But in this case, why optimizer don't chooses seq_scan (discourage is
> different than prevent) ?
> 

As Michael Paesold pointed out, enable_seqscan=off effectively means
that it will never choose a seqscan when there exists an alternative.

> At many cases i need only a small fragment of raws to be retrieved. But
> this extreme case is a real-scenario (not the most frequent but real).
> 

Returning all the rows in a large table is a real scenario? I would
expect you to at least use a cursor. Keep in mind that libpq has to
store all those rows in local client memory, so a cursor is a good idea
if that is the case.

And regardless, if you are returning all the rows in a table, the
absolute fastest way possible is a seq scan. An index is much slower
because it does way more work, and the disk accesses are random rather
than sequential.

> I try to find a way to achieve good performence even for the extreme
> case. Is there any way?
> 

The extreme case you provided is a SELECT without a WHERE. We already
know that PostgreSQL is executing that as efficiently as possible when
enable_seqscan=on. Why not send me a simple script that generates some
data similar to what you want to access, and then the query you'd like
to perform on that data. We might find that PostgreSQL is already
executing the query as efficiently as possible; in fact I suspect that's
the case (although perhaps some tuning would help).

In short, we shouldn't try to use indexes for all situations; they are
inherently worse for some situations. That's why PostgreSQL has both
seqscans and indexes, and an intelligent planner.

> ps. In bibliografy, there is a different alternative for indices. except
> th simple approach of <attr_val, rid> is the alternative <attr_val, set
> of rids>. The second means the attaches to each discrete attr_val the set
> o rid's of all raws with same attr_val. Is this alternative taken into
> account in postgres?
> 

I don't entirely understand what you're asking. It seems like you're
talking about a relatively minor implementation issue, and if it does
make a difference, it would probably not be very much. Perhaps rephrase
your question?

> 
> On Mon, 7 Mar 2005, Jeff Davis wrote:
> 
> >
> > In that case, sequential scan is faster, but perhaps the planner doesn't
> > know that ahead of time. Try turning on more statistics if you haven't
> > already, and then run ANALYZE again. If the planner sees a range,
> > perhaps it assumes that it is a highly selective range, when in fact, it
> > consists of all of the tuples. Also, make sure enable_seqscan is true
> > (in case you turned it off for testing or something and forgot).
> >
> > A seqscan is usually faster when a large proportion of the tuples are
> > returned because:
> > (1) It uses sequential I/O; whereas an index might access tuples in a
> > random order.
> > (2) It doesn't have to read the index's disk pages at all.
> >
> > I suspect you don't need to return all the tuples in the table. If you
> > include the details of a real scenario perhaps the people on the list
> > could be more helpful.
> >
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly