Thread: sizing storage for 100 M files

sizing storage for 100 M files

From

Alinga Yeung

Date:

17 September 2009, 22:01:12

Hi,

We plan to acquire some storage for our storage system to be based on
iRODS + PostgreSQL. iRODS uses PostgreSQL to store the metadata of a
file. For each file that iRODS stores, a row is added to each of the
following tables in PostgreSQL.

create table R_DATA_MAIN
(
  data_id bigint not null,
  coll_id bigint not null,
  data_name varchar(1000) not null,
  data_repl_num INTEGER not null,
  data_version varchar(250) DEFAULT '0',
  data_type_name varchar(250) not null,
  data_size bigint not null,
  resc_group_name varchar(250),
  resc_name varchar(250) not null,
  data_path varchar(2700) not null,
  data_owner_name varchar(250) not null,
  data_owner_zone varchar(250) not null,
  data_is_dirty INTEGER DEFAULT 0,
  data_status varchar(250),
  data_checksum varchar(1000),
  data_expiry_ts varchar(32),
  data_map_id bigint DEFAULT 0,
  data_mode varchar(32),
  r_comment varchar(1000),
  create_ts varchar(32),
  modify_ts varchar(32)
);

And:

create table R_OBJT_ACCESS
(
  object_id bigint not null,
  user_id bigint not null,
  access_type_id bigint not null,
  create_ts varchar(32),
  modify_ts varchar(32)
);

Size of R_DATA_MAIN ~= 7618 bytes
Size of R_OBJT_ACCESS ~= 88 bytes

According to the PostgreSQL FAQ, a PostgreSQL database may require up to
five times the disk space to store data from a text file.

Is it valid for me to use this multiplier to estimate the storage
requirement for 100 M files?
i.e. estimated storage = (7618 bytes + 88 bytes) * 100 M * 5
                      = 3.853 TB

Thank you in advance for your help.

Alinga

Re: sizing storage for 100 M files

From

Stephan Szabo

Date:

17 September 2009, 22:47:21

On Thu, 17 Sep 2009, Alinga Yeung wrote:

> Hi,
>
> We plan to acquire some storage for our storage system to be based on
> iRODS + PostgreSQL. iRODS uses PostgreSQL to store the metadata of a
> file. For each file that iRODS stores, a row is added to each of the
> following tables in PostgreSQL.
>
> create table R_DATA_MAIN
> (
>   data_id bigint not null,
>   coll_id bigint not null,
>   data_name varchar(1000) not null,
>   data_repl_num INTEGER not null,
>   data_version varchar(250) DEFAULT '0',
>   data_type_name varchar(250) not null,
>   data_size bigint not null,
>   resc_group_name varchar(250),
>   resc_name varchar(250) not null,
>   data_path varchar(2700) not null,
>   data_owner_name varchar(250) not null,
>   data_owner_zone varchar(250) not null,
>   data_is_dirty INTEGER DEFAULT 0,
>   data_status varchar(250),
>   data_checksum varchar(1000),
>   data_expiry_ts varchar(32),
>   data_map_id bigint DEFAULT 0,
>   data_mode varchar(32),
>   r_comment varchar(1000),
>   create_ts varchar(32),
>   modify_ts varchar(32)
> );
>
> And:
>
> create table R_OBJT_ACCESS
> (
>   object_id bigint not null,
>   user_id bigint not null,
>   access_type_id bigint not null,
>   create_ts varchar(32),
>   modify_ts varchar(32)
> );
>
> Size of R_DATA_MAIN ~= 7618 bytes
> Size of R_OBJT_ACCESS ~= 88 bytes

I don't think those sizes are right. You're missing the row header and I
think you're missing the length info on each varchar if I'm counting
right. Also the maximum size in bytes of those varchars depends on the
encoding as the (N) is number of characters. And, if you're looking for an
estimate of actual usage rather than worse case usage, the varchar fields
aren't going to use up the maximum, but instead the size of the actual
data (plus the length).

> According to the PostgreSQL FAQ, a PostgreSQL database may require up to
> five times the disk space to store data from a text file.
>
> Is it valid for me to use this multiplier to estimate the storage
> requirement for 100 M files?

Not really. That's for comparing a plain text file to the size based on
counts like the above once corrected. You wouldn't multiply numbers from
the above again.

> i.e. estimated storage = (7618 bytes + 88 bytes) * 100 M * 5

Also, can there be multiple access rows per file, if so, then you'd need
to take that into account?