Thread: Estimating space required for indexes

Estimating space required for indexes

From
Shridhar Daithankar
Date:
Hi all,

Yesterday I was trying to upload a medium size terrain data dump to a
postgresql database. (350 files, of sizes between 8-13MB each.. 3.3GB all in
all). The load was done in 45 minutes.(7.3.2 is real fast for data loading is
my feeling, anyway..). It took 4GB of disk space after upload.

A tuple consists of 3 float values, x,y,z. I had to create a compound index on
x and y. I started indexing it and killed it 1.5 hours later as it filled
rest of the 5GB free disk upto point of choking.

Since it wasn't much of time, I recreated the database, created the index
first and then started bulk loading. It took 3 hours and 10 minutes which is
not bad at all.

How can I predict reasonably how much disk space I am going to need for such
kind of indexing operation? This data is just a small sample of things and
more data is coming. Is there any other solution from inerting into indexed
table? BTW, oracle 8i chocked equally badly when we attemted to create index
on such a large table..

Some observations which I feel worth mentioning..

Machnie:- P-III/800MHZ-256MB-40GB IDE disk.
Postgresql settings: Shared buffers=800, fsync off

1) With 5 concurrent processes loading data, it took 45 minutes to insert
3.3GB of dump, with each file worth ~200K rows done in single transaction.
For indexed insertion, I didn't use more than one connection as disk was
already choking. The text dump were read from same disk and database was on
the same one, just on different partitions..

2) The floats are real i.e. float4. Table is without oids. So a tuple is 12
bytes of data. It shows 395320 pages i.e. more than 3 gigs of data. It
roughly has 8.1M rows. Should that have been bit less of disk usage? 8.1M*12
gives around 100MB of raw data. Amounting for all overhead, the disk usage
seems bit too much.

3) Index is taking 376840 pages i.e. another 3 gigs. I understand it is
compind index but could it be made bit more space savvy?

4) I got bitten by planner.. x=-81.45 reasults sequential scan and
x=-81.45::real results index scan.. Oops!..

I am really impressed with the way 7.3.2 is handling data on such a low spec
machine.. Will come back again if I have any more problems/comments..

 Shridhar


Re: Estimating space required for indexes

From
Tom Lane
Date:
Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> Yesterday I was trying to upload a medium size terrain data dump to a
> postgresql database. (350 files, of sizes between 8-13MB each.. 3.3GB all in
> all). The load was done in 45 minutes.(7.3.2 is real fast for data loading is
> my feeling, anyway..). It took 4GB of disk space after upload.

> A tuple consists of 3 float values, x,y,z. I had to create a compound index on
> x and y. I started indexing it and killed it 1.5 hours later as it filled
> rest of the 5GB free disk upto point of choking.

AFAIK, a CREATE INDEX should require no more than twice the finished
index's size on disk.  I'm surprised that you were able to build the
index one way and not the other.

> How can I predict reasonably how much disk space I am going to need for such
> kind of indexing operation?

Assuming your "float"s were float4s, the heap tuple size is

    28 bytes overhead + 3 * 4 bytes data = 40 bytes/row

(assuming WITHOUT OIDS, no nulls, Intel-like alignment rules) while the
index tuple size is

    12 bytes overhead + 2 * 4 bytes data = 20 bytes/row

But this is not the whole story because heap pages are normally crammed
full while btree index pages are normally only filled 2/3rds full during
initial creation.  (Plus you have to allow for upper b-tree levels, but
with such small index entries that won't be much.)  So I'd have expected
the finished index to be about 3/4ths the size of the table proper.
I'm surprised you could fit it at all.

> This data is just a small sample of things and
> more data is coming.

Better buy more disk ...

            regards, tom lane


Re: Estimating space required for indexes

From
Shridhar Daithankar
Date:
On Monday 28 April 2003 19:40, Tom Lane wrote:
> Assuming your "float"s were float4s, the heap tuple size is

Yes, they are.

>     28 bytes overhead + 3 * 4 bytes data = 40 bytes/row
>
> (assuming WITHOUT OIDS, no nulls, Intel-like alignment rules) while the
> index tuple size is
>
>     12 bytes overhead + 2 * 4 bytes data = 20 bytes/row

With these overheads it turns out that,

disk space for table= 8.1M*40=324MB
disk space for index=8.1M*20*1.5=243MB

On disk actually, 5.9GB is gone, as I mentioned earlier. Even we count other
overheads, the above total should not blow beyond 600-650MB, isn't it? But
that is not the case. And there are absolutely no other objects in the
database.

It is vacuumed analyze after insertions. No deletes at all so far. I really
wonder where this 10x bloat came from.

> > This data is just a small sample of things and
> > more data is coming.
>
> Better buy more disk ...

Disk is not a problem. But that does not mean I would like fill it up without
knowing what is happening..

 Shridhar

--
I know it all.  I just can't remember it all at once.


Re: Estimating space required for indexes

From
Tom Lane
Date:
Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> With these overheads it turns out that,
> disk space for table= 8.1M*40=324MB
> disk space for index=8.1M*20*1.5=243MB
> On disk actually, 5.9GB is gone, as I mentioned earlier.

Into what?  Look in the database directory and show us the file sizes.
contrib/pgstattuple might be useful as well.

            regards, tom lane


Re: Estimating space required for indexes

From
Oleg Bartunov
Date:
On Mon, 28 Apr 2003, Tom Lane wrote:

> Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> > Yesterday I was trying to upload a medium size terrain data dump to a
> > postgresql database. (350 files, of sizes between 8-13MB each.. 3.3GB all in
> > all). The load was done in 45 minutes.(7.3.2 is real fast for data loading is
> > my feeling, anyway..). It took 4GB of disk space after upload.
>
> > A tuple consists of 3 float values, x,y,z. I had to create a compound index on
> > x and y. I started indexing it and killed it 1.5 hours later as it filled
> > rest of the 5GB free disk upto point of choking.
>
> AFAIK, a CREATE INDEX should require no more than twice the finished
> index's size on disk.  I'm surprised that you were able to build the
> index one way and not the other.
>
> > How can I predict reasonably how much disk space I am going to need for such
> > kind of indexing operation?
>
> Assuming your "float"s were float4s, the heap tuple size is
>
>     28 bytes overhead + 3 * 4 bytes data = 40 bytes/row
>
> (assuming WITHOUT OIDS, no nulls, Intel-like alignment rules) while the
> index tuple size is
>
>     12 bytes overhead + 2 * 4 bytes data = 20 bytes/row
>
> But this is not the whole story because heap pages are normally crammed
> full while btree index pages are normally only filled 2/3rds full during
> initial creation.  (Plus you have to allow for upper b-tree levels, but

Are there any benefits from getting btree index pages to be more effective
in space usage ? I've read some paper about 98% space usage for Btree.

> with such small index entries that won't be much.)  So I'd have expected
> the finished index to be about 3/4ths the size of the table proper.
> I'm surprised you could fit it at all.
>
> > This data is just a small sample of things and
> > more data is coming.
>
> Better buy more disk ...
>
>             regards, tom lane
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Estimating space required for indexes

From
Shridhar Daithankar
Date:
On Monday 28 April 2003 20:11, Tom Lane wrote:
> Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> > With these overheads it turns out that,
> > disk space for table= 8.1M*40=324MB
> > disk space for index=8.1M*20*1.5=243MB
> > On disk actually, 5.9GB is gone, as I mentioned earlier.
>
> Into what?  Look in the database directory and show us the file sizes.
> contrib/pgstattuple might be useful as well.

OK, whatever I picked up from admin. guide regarding disk usage. I hope this
is useful..

nav=# \d height
   Table "public.height"
 Column | Type | Modifiers
--------+------+-----------
 x      | real |
 y      | real |
 z      | real |
Indexes: height_index btree (x, y)
nav=# \di
               List of relations
 Schema |     Name     | Type  | Owner | Table
--------+--------------+-------+-------+--------
 public | height_index | index | ajit  | height
(1 row)

nav=# select relfilenode,relpages from pg_class where relname='height';
 relfilenode | relpages
-------------+----------
       16977 |   395320
(1 row)

nav=# select relfilenode,relpages from pg_class where relname='height_index';
 relfilenode | relpages
-------------+----------
       16988 |   376840
(1 row)
nav=# SELECT c2.relname, c2.relpages
nav-# FROM pg_class c, pg_class c2, pg_index i
nav-# WHERE c.relname = 'height' AND
nav-# c.oid = i.indrelid AND
nav-# c2.oid = i.indexrelid
nav-# ORDER BY c2.relname;
   relname    | relpages
--------------+----------
 height_index |   376840
(1 row)

[ajit@jejuri dbs]$ du -h
3.6M    ./base/1
3.6M    ./base/16975
5.9G    ./base/16976
6.0G    ./base
132K    ./global
8.0K    ./pg_clog
128M    ./pg_xlog
6.1G    .

Shridhar

--
"The algorithm to do that is extremely nasty.  You might want to mug
someone with it."
        -- M. Devine, Computer Science 340


Re: Estimating space required for indexes

From
Shridhar Daithankar
Date:
On Monday 28 April 2003 20:19, Shridhar Daithankar wrote:
> [ajit@jejuri dbs]$ du -h
> 3.6M    ./base/1
> 3.6M    ./base/16975
> 5.9G    ./base/16976
> 6.0G    ./base
> 132K    ./global
> 8.0K    ./pg_clog
> 128M    ./pg_xlog
> 6.1G    .

Sorry, forgot this last time..

nav=# select relfilenode from pg_class where relname='height';
 relfilenode
-------------
       16977
(1 row)

nav=# \q
[ajit@jejuri dbs]$ find . -name 16977
./base/16976/16977

 Shridhar

--
If A equals success, then the formula is _A = _X + _Y + _Z.  _X is work.
_Y
is play.  _Z is keep your mouth shut.
        -- Albert Einstein


Re: Estimating space required for indexes

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
>> But this is not the whole story because heap pages are normally crammed
>> full while btree index pages are normally only filled 2/3rds full during
>> initial creation.  (Plus you have to allow for upper b-tree levels, but

> Are there any benefits from getting btree index pages to be more effective
> in space usage ? I've read some paper about 98% space usage for Btree.

Standard theory says that optimal load for a b-tree is 65-70%.  We used
to make CREATE INDEX cram the leaf pages full, but that just resulted in
a lot of page splits as soon as you did any inserts or updates.  And the
page splits destroy the physical ordering of the index, which negates
any I/O savings you might have had from fewer pages.

I suppose if you know that the table will be static there might be some
value in telling CREATE INDEX to pack the index pages full, but I'm not
sure it's worth the trouble ...

            regards, tom lane


Re: Estimating space required for indexes

From
Oleg Bartunov
Date:
On Mon, 28 Apr 2003, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
> >> But this is not the whole story because heap pages are normally crammed
> >> full while btree index pages are normally only filled 2/3rds full during
> >> initial creation.  (Plus you have to allow for upper b-tree levels, but
>
> > Are there any benefits from getting btree index pages to be more effective
> > in space usage ? I've read some paper about 98% space usage for Btree.
>
> Standard theory says that optimal load for a b-tree is 65-70%.  We used
> to make CREATE INDEX cram the leaf pages full, but that just resulted in
> a lot of page splits as soon as you did any inserts or updates.  And the
> page splits destroy the physical ordering of the index, which negates
> any I/O savings you might have had from fewer pages.

Thanks for comment, I didn't think about that.

>
> I suppose if you know that the table will be static there might be some
> value in telling CREATE INDEX to pack the index pages full, but I'm not
> sure it's worth the trouble ...
>

Sure.

>             regards, tom lane
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Estimating space required for indexes

From
Shridhar Daithankar
Date:
On Monday 28 April 2003 20:11, Tom Lane wrote:
> Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> > With these overheads it turns out that,
> > disk space for table= 8.1M*40=324MB
> > disk space for index=8.1M*20*1.5=243MB
> > On disk actually, 5.9GB is gone, as I mentioned earlier.
>
> Into what?  Look in the database directory and show us the file sizes.
> contrib/pgstattuple might be useful as well.
>
>             regards, tom lane

OK, I solved the mystary or what ever it was. My stupidity mostly.

I started with assumption that a table row would be 40 bytes and index row
would be 20 bytes.

But I found this *little* discrepency between actual database size and number
if rows. Upon further reseatch, I found where I was wrong.

nav=# explain select * from height;
                             QUERY PLAN
--------------------------------------------------------------------
 Seq Scan on height  (cost=0.00..1201902.72 rows=80658272 width=12)
(1 row)

nav=# select reltuples from pg_class where relname='height';
  reltuples
-------------
 8.06583e+07
(1 row)

The number of tuples is 80.6M rather than 8.1M as I said earlier. That +07
there told me what I was doing wrong.. +07 certainly is not a million..

So the estimated table space is 3076 MB and estimated index space is 1538MB
for 100% compaction. The index is actually eating close to 2950MB space which
counts for 50% page usage ratio. It is perfectly OK.

I made a mistake in reading significant figures. No wonder I thought it was
10x bloated.

 Shridhar
--
"One thing they don't tell you about doing experimental physics is that
sometimes you must work under adverse conditions ... like a state of
sheer terror."
        -- W. K. Hartmann