Thread: Index size

Index size

From
Ioannis Theoharis
Date:

Hi,

I have created a btree index on a 'int4' attribute of a table.

After i have inserted 1,000,000 raws in my table, i can see that my index
size is 2745 Blocks (8KB each) from pg_class. That means about 21,960 KB
size.

I try to understand hows is this number generated, because thought that
for each new entry in table, there is a new entry in index and that each
entry of the index is:

4 Bytes for the int4 attribute
and
40 Bytes for oid

So 44 * 1,000,000 ~ 42,969 KB

Can anybody inform me where I do the mistake?


Re: Index size

From
Tatsuo Ishii
Date:
> I have created a btree index on a 'int4' attribute of a table.
>
> After i have inserted 1,000,000 raws in my table, i can see that my index
> size is 2745 Blocks (8KB each) from pg_class. That means about 21,960 KB
> size.
>
> I try to understand hows is this number generated, because thought that
> for each new entry in table, there is a new entry in index and that each
> entry of the index is:
>
> 4 Bytes for the int4 attribute
> and
> 40 Bytes for oid
>
> So 44 * 1,000,000 ~ 42,969 KB
>
> Can anybody inform me where I do the mistake?

There's no oid in index tuples. There is an 8-byte long header for
each index tuple. Since you are inserting 4-byte long user data, you
index tuples are 12-byte each. Each index tuple needs a "pointer" in a
block, which is called "item pointer" and that is 4-byte long. Each
block can hold up to floor((8192-24(page header)-16(special
data))/(12+4)) = 509 tuples. ceil(1,000,000/509) = 1965 is the blocks
you need for your index. In addition to this, you need a "meta page"
and a "root page". So it becomes 1965+1+1 = 1967. Also you need
"internal pages", whose numer is hard to guess since it depends on the
actual index tree structure(for example, tree height). From my limited
experience, for 1,000,000 tuples, you will need at least 7 internal
pages. Now the number becomes 1967+7 = 1974. Still it's different from
2745. If you don't have deleted tuples, the difference probably comes
from the fact that a btree index can never be 100% occupied. IMO
1974/2745 = 0.71 seems not so bad.
--
Tatsuo Ishii

Re: Index size

From
Ioannis Theoharis
Date:

Thanks a lot.

An other question:

Is there any way to prevent duplicates on btree index attribute,
PERMITTING them on table?




On Tue, 1 Mar 2005, Tatsuo Ishii wrote:

> > I have created a btree index on a 'int4' attribute of a table.
> >
> > After i have inserted 1,000,000 raws in my table, i can see that my index
> > size is 2745 Blocks (8KB each) from pg_class. That means about 21,960 KB
> > size.
> >
> > I try to understand hows is this number generated, because thought that
> > for each new entry in table, there is a new entry in index and that each
> > entry of the index is:
> >
> > 4 Bytes for the int4 attribute
> > and
> > 40 Bytes for oid
> >
> > So 44 * 1,000,000 ~ 42,969 KB
> >
> > Can anybody inform me where I do the mistake?
>
> There's no oid in index tuples. There is an 8-byte long header for
> each index tuple. Since you are inserting 4-byte long user data, you
> index tuples are 12-byte each. Each index tuple needs a "pointer" in a
> block, which is called "item pointer" and that is 4-byte long. Each
> block can hold up to floor((8192-24(page header)-16(special
> data))/(12+4)) = 509 tuples. ceil(1,000,000/509) = 1965 is the blocks
> you need for your index. In addition to this, you need a "meta page"
> and a "root page". So it becomes 1965+1+1 = 1967. Also you need
> "internal pages", whose numer is hard to guess since it depends on the
> actual index tree structure(for example, tree height). From my limited
> experience, for 1,000,000 tuples, you will need at least 7 internal
> pages. Now the number becomes 1967+7 = 1974. Still it's different from
> 2745. If you don't have deleted tuples, the difference probably comes
> from the fact that a btree index can never be 100% occupied. IMO
> 1974/2745 = 0.71 seems not so bad.
> --
> Tatsuo Ishii
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Re: Index size

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> ... Now the number becomes 1967+7 = 1974. Still it's different from
> 2745. If you don't have deleted tuples, the difference probably comes
> from the fact that a btree index can never be 100% occupied. IMO
> 1974/2745 = 0.71 seems not so bad.

In fact the traditional figure for the steady-state load factor of a
btree index is 2/3rds; that is, after a long sequence of inserts and
deletes you can expect about one-third of each page to be empty space.

If Ioannis' number was taken immediately after a CREATE INDEX operation,
then his index size isn't reflective of any settling to a steady-state
load factor; rather it happens because the CREATE INDEX command
deliberately loads the index leaf pages only 2/3rds full, to avoid a
disproportionate amount of page splitting when normal inserts commence.

            regards, tom lane

Re: Index size

From
Tatsuo Ishii
Date:
> An other question:
>
> Is there any way to prevent duplicates on btree index attribute,
> PERMITTING them on table?

I can't think of any usefull usage for such an index. Can you explain
why you need it?
--
Tatsuo Ishii

Re: Index size

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > ... Now the number becomes 1967+7 = 1974. Still it's different from
> > 2745. If you don't have deleted tuples, the difference probably comes
> > from the fact that a btree index can never be 100% occupied. IMO
> > 1974/2745 = 0.71 seems not so bad.
>
> In fact the traditional figure for the steady-state load factor of a
> btree index is 2/3rds; that is, after a long sequence of inserts and
> deletes you can expect about one-third of each page to be empty space.
>
> If Ioannis' number was taken immediately after a CREATE INDEX operation,
> then his index size isn't reflective of any settling to a steady-state
> load factor; rather it happens because the CREATE INDEX command
> deliberately loads the index leaf pages only 2/3rds full, to avoid a
> disproportionate amount of page splitting when normal inserts commence.

Interesting. Right after CREATE INDEX for a int4 column using pgbench
-s 10(1,000,000 tuples), I got 2184 leaf pages. From my caliculation
the number of leaf pages is expected to 1965, which is 100% full case
assumption of course. So 1965/2184 = 0.8997 = 90% is actually used?
--
Tatsuo Ishii

Re: Index size

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> ... rather it happens because the CREATE INDEX command
>> deliberately loads the index leaf pages only 2/3rds full, to avoid a
>> disproportionate amount of page splitting when normal inserts commence.

> Interesting. Right after CREATE INDEX for a int4 column using pgbench
> -s 10(1,000,000 tuples), I got 2184 leaf pages. From my caliculation
> the number of leaf pages is expected to 1965, which is 100% full case
> assumption of course. So 1965/2184 = 0.8997 = 90% is actually used?

Shoulda read the code rather than going by memory ;-).  What nbtsort.c
actually says is

 * It is not wise to pack the pages entirely full, since then *any*
 * insertion would cause a split (and not only of the leaf page; the need
 * for a split would cascade right up the tree).  The steady-state load
 * factor for btrees is usually estimated at 70%.  We choose to pack leaf
 * pages to 90% and upper pages to 70%.  This gives us reasonable density
 * (there aren't many upper pages if the keys are reasonable-size) without
 * incurring a lot of cascading splits during early insertions.

and indeed the code seems to do that:

    /* set "full" threshold based on level.  See notes at head of file. */
    if (level > 0)
        state->btps_full = (PageGetPageSize(state->btps_page) * 3) / 10;
    else
        state->btps_full = PageGetPageSize(state->btps_page) / 10;

            regards, tom lane

Re: Index size

From
Tatsuo Ishii
Date:
> > Interesting. Right after CREATE INDEX for a int4 column using pgbench
> > -s 10(1,000,000 tuples), I got 2184 leaf pages. From my caliculation
> > the number of leaf pages is expected to 1965, which is 100% full case
> > assumption of course. So 1965/2184 = 0.8997 = 90% is actually used?
>
> Shoulda read the code rather than going by memory ;-).  What nbtsort.c
> actually says is
>
>  * It is not wise to pack the pages entirely full, since then *any*
>  * insertion would cause a split (and not only of the leaf page; the need
>  * for a split would cascade right up the tree).  The steady-state load
>  * factor for btrees is usually estimated at 70%.  We choose to pack leaf
>  * pages to 90% and upper pages to 70%.  This gives us reasonable density
>  * (there aren't many upper pages if the keys are reasonable-size) without
>  * incurring a lot of cascading splits during early insertions.
>
> and indeed the code seems to do that:
>
>     /* set "full" threshold based on level.  See notes at head of file. */
>     if (level > 0)
>         state->btps_full = (PageGetPageSize(state->btps_page) * 3) / 10;
>     else
>         state->btps_full = PageGetPageSize(state->btps_page) / 10;
>

Thanks for the explanation.

So it seems Ioannis' number was not taken immediately after a CREATE
INDEX operation?
--
Tatsuo Ishii

Re: Index size

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> So it seems Ioannis' number was not taken immediately after a CREATE
> INDEX operation?

I would guess not, but it's up to him to say.  If it is a number derived
after some period of normal operation, then his result agrees with the
theory that says 70% is the steady-state figure ...

            regards, tom lane

Re: Index size

From
Ioannis Theoharis
Date:

On Tue, 1 Mar 2005, Tom Lane wrote:

> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > So it seems Ioannis' number was not taken immediately after a CREATE
> > INDEX operation?
>
> I would guess not, but it's up to him to say.  If it is a number derived
> after some period of normal operation, then his result agrees with the
> theory that says 70% is the steady-state figure ...

yes, my number was taken after a large amount of inserts.

Your comments about the block usage in case of b-tree indexes are
absolutely interesting.
Where can i find a documentation with technical analysis for all (if
possible) of components of postgres?

All documentations that i have found are very general and refer to simple
users.


>
>             regards, tom lane
>

Re: Index size

From
Tom Lane
Date:
Ioannis Theoharis <theohari@ics.forth.gr> writes:
> Where can i find a documentation with technical analysis for all (if
> possible) of components of postgres?

Read the source code.

            regards, tom lane

Re: Index size

From
Ioannis Theoharis
Date:

On Wed, 2 Mar 2005, Tatsuo Ishii wrote:

> > An other question:
> >
> > Is there any way to prevent duplicates on btree index attribute,
> > PERMITTING them on table?
>
> I can't think of any usefull usage for such an index. Can you explain
> why you need it?

I have a relation like this: (att0 varchar(1000), att1 int4)

i create a b-tree index on att1 ()
i cluster my raltion according to index

now i have a query
select     *
form     tc2000000000
where     att1<=900000000 and att1>=0 ;

As far as i can see from explain analyze an index scan is used:
Index Scan using inst_id_idx on tc2000000000
  Index Cond: ((att1 <= 900000000) AND (att1 >= 0))

If for each entry in table, an entry in index is beeing held, then the
index size is populated too fast.

I guess, that postgres uses index to find the first entry satisfying the
index conition, after find the last one and then do a sequential scan on
the appropriate fraction of the table (to take advantage of physical
clustering).

In my case, discrete values on att1 are orders of magnitude less than
number of table raws.

Thus, the big index size is useless for me. I want to avoid the overhead
of scanning such a big index, just permitting ONLY the discrete values to
entry in index. In such a way the whole scenario i presented before for
how i guess, that postgres evaluates my query, is still in use.

I think there must be a way to change the way of index_usage to alter it
to what i 'm looking for.


> --
> Tatsuo Ishii
>

Re: Index size

From
Martijn van Oosterhout
Date:
On Wed, Mar 02, 2005 at 10:08:58PM +0200, Ioannis Theoharis wrote:
> I have a relation like this: (att0 varchar(1000), att1 int4)
>
> i create a b-tree index on att1 ()
> i cluster my raltion according to index
>
> now i have a query
> select     *
> form     tc2000000000
> where     att1<=900000000 and att1>=0 ;
>
> As far as i can see from explain analyze an index scan is used:
> Index Scan using inst_id_idx on tc2000000000
>   Index Cond: ((att1 <= 900000000) AND (att1 >= 0))
>
> If for each entry in table, an entry in index is beeing held, then the
> index size is populated too fast.
>
> I guess, that postgres uses index to find the first entry satisfying the
> index conition, after find the last one and then do a sequential scan on
> the appropriate fraction of the table (to take advantage of physical
> clustering).

What makes you think that? Clustering is nice, but postgresql needs to
get the right answer and that the table in clustered is not something
postgresql can rely on. It uses the index to find *every* row you're
looking for, there's no shortcut here.

> In my case, discrete values on att1 are orders of magnitude less than
> number of table raws.
>
> Thus, the big index size is useless for me. I want to avoid the overhead
> of scanning such a big index, just permitting ONLY the discrete values to
> entry in index. In such a way the whole scenario i presented before for
> how i guess, that postgres evaluates my query, is still in use.

There's no special relationship between two rows with the same att1.
Either you find the rows by using an index for each row, or scanning
the whole table. There's no inbetween. The only thing clustering
acheives is that due to values being together, the chance that
succeeding indexes entries will already have been loaded is higher,
thus reducing the overall cost.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Index size

From
Martijn van Oosterhout
Date:
On Wed, Mar 02, 2005 at 11:30:58PM +0200, Ioannis Theoharis wrote:
> On Wed, 2 Mar 2005, Martijn van Oosterhout wrote:
> > What makes you think that? Clustering is nice, but postgresql needs to
> > get the right answer and that the table in clustered is not something
> > postgresql can rely on.
>
> If postgresql doesn't rely on it, it' s postgresql's technical decision
> (and i don't know the reason) and not a default decision between rdbms's.
>
> But if you know exactly the reason, it would be a great help for me to
> know it.

Easy, if you CLUSTER a table, it's CLUSTERed then. But it doesn't stay
that way. As soon as you insert a new row, or update an old one, it
gets added to the end (the only place with space) and now it's not
clustered anymore. It's almost clustered and from a caching point of
view it's fine. But postgresql can't assume at any point a table will
stay clustered, an insert could happen in the middle of your
processing.

Logically you can't magically add space in the middle of a file, you
have to move everything else up. If you know an efficient way to keep a
table clustered while handling arbitrary inserts and updates, I'd be
curious to know...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Index size

From
Ioannis Theoharis
Date:

All you said are wright. But it 's not so difficult for postgresql to hold
on a bit attribute attached to each table the information, whether there
is done an insertion/deletion/update to a clustered table or not.

And i guess, postgresql would already implement this simply alternative.


> Easy, if you CLUSTER a table, it's CLUSTERed then. But it doesn't stay
> that way. As soon as you insert a new row, or update an old one, it
> gets added to the end (the only place with space) and now it's not
> clustered anymore. It's almost clustered and from a caching point of
> view it's fine. But postgresql can't assume at any point a table will
> stay clustered, an insert could happen in the middle of your
> processing.
>
> Logically you can't magically add space in the middle of a file, you
> have to move everything else up. If you know an efficient way to keep a
> table clustered while handling arbitrary inserts and updates, I'd be
> curious to know...

SRF, JDBC and result info

From
ntinos@aueb.gr
Date:
Hi everybody!

I have an SRF which is called from a JAVA app with JDBC. Everything works
fine
and I want now to be able to pass some result-related info to my app. It is
not about the format of the results (ResultSetMetaData) or something like
that.

Is it possible to return  some string (or other type of)info together with
the result tuples (even if it requiers some hacking i.e. there is no
provision for something like that)? Any ideas?

Regards,
Ntinos Katsaros


Re: SRF, JDBC and result info

From
Kris Jurka
Date:

On Tue, 8 Mar 2005 ntinos@aueb.gr wrote:

> Hi everybody!
>
> I have an SRF which is called from a JAVA app with JDBC. Everything
> works fine and I want now to be able to pass some result-related info to
> my app. It is not about the format of the results (ResultSetMetaData) or
> something like that.
>
> Is it possible to return  some string (or other type of)info together with
> the result tuples (even if it requiers some hacking i.e. there is no
> provision for something like that)? Any ideas?
>

The only idea that comes to mind is using RAISE NOTICE in your plpgsql
function and Statement or ResultSet .getWarnings() on the Java side to
retrieve that info.  There really isn't any other out of band data path.

Kris Jurka


Re: SRF, JDBC and result info

From
ntinos@aueb.gr
Date:
Thank you very much for your reply. The thing is that my SRF is written in
C, not plpgsql, but I'll look into RAISE NOTICE anyway.(I think there is
something equevalent in libpq)

Thanks again,
Ntinos Katsaros

Kris Jurka writes:

>
>
> On Tue, 8 Mar 2005 ntinos@aueb.gr wrote:
>
>> Hi everybody!
>>
>> I have an SRF which is called from a JAVA app with JDBC. Everything
>> works fine and I want now to be able to pass some result-related info to
>> my app. It is not about the format of the results (ResultSetMetaData) or
>> something like that.
>>
>> Is it possible to return  some string (or other type of)info together with
>> the result tuples (even if it requiers some hacking i.e. there is no
>> provision for something like that)? Any ideas?
>>
>
> The only idea that comes to mind is using RAISE NOTICE in your plpgsql
> function and Statement or ResultSet .getWarnings() on the Java side to
> retrieve that info.  There really isn't any other out of band data path.
>
> Kris Jurka
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster



Re: SRF, JDBC and result info

From
ntinos@aueb.gr
Date:
Solution found! Thanks to Kris Jurka's advise I managed to pass this info
using: elog(INFO,... or elog(NOTICE,... . These messages together with
.getWarnings() do the job. :

e.g. message returned by the SQLWarning:

SNOTICEC00000M#SUCCESSFUL EXECUTION.  NO TUPLES FROM PEER(S):
mobileb#Ftestmybuild.cL2558Ranswer

Getting the plain message is then trivial (e.g. using flag chars like '#'
above)

Of cource the appropriate logging must be set in postgresql.conf.

Just in case somenone wants to do the same thing.
I dont know if this is the best solution (or if any other exists) but it
surely works.

Regards,
Ntinos Katsaros

PS: libpq has nothing to do with the above :-)!

ntinos@aueb.gr writes:

> Thank you very much for your reply. The thing is that my SRF is written in
> C, not plpgsql, but I'll look into RAISE NOTICE anyway.(I think there is
> something equevalent in libpq)
>
> Thanks again,
> Ntinos Katsaros
>
> Kris Jurka writes:
>
>>
>>
>> On Tue, 8 Mar 2005 ntinos@aueb.gr wrote:
>>
>>> Hi everybody!
>>>
>>> I have an SRF which is called from a JAVA app with JDBC. Everything
>>> works fine and I want now to be able to pass some result-related info to
>>> my app. It is not about the format of the results (ResultSetMetaData) or
>>> something like that.
>>>
>>> Is it possible to return  some string (or other type of)info together
>>> with the result tuples (even if it requiers some hacking i.e. there is
>>> no provision for something like that)? Any ideas?
>>>
>>
>> The only idea that comes to mind is using RAISE NOTICE in your plpgsql
>> function and Statement or ResultSet .getWarnings() on the Java side to
>> retrieve that info.  There really isn't any other out of band data path.
>>
>> Kris Jurka
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 4: Don't 'kill -9' the postmaster
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq