Thread: BIG files

BIG files

From
rabt@dim.uchile.cl
Date:
Hi all Postgresql users,

I've been using MySQL for years and now I have decided to switch to Postgresql,
because I needed more robust "enterprise" features like views and triggers. I
work with VERY large datasets: 60 monthly tables with 700,000 rows and 99
columns each, with mostly large numeric values (15 digits) ( NUMERIC(15,0)
datatypes, not all filled). So far, I've migrated 2 of my tables to a dedicated
directory and tablespace in Postgres. I've created some views, triggers, run
queries, and let me tell you that performance is quite good after tweaking some
parameters in postgresql.conf. Now I'm very excited about all Postgresql
capabilities, and I want to switch right away.

The main problem is disk space. The database files stored in postgres take 4 or
5 times more space than in Mysql. Just to be sure, after each bulk load, I
performed a VACUUM FULL to reclaim any posible lost space, but nothing gets
reclaimed. My plain text dump files with INSERTS are just 150 Mb in size, while
the files in Postgres directory are more than 1 Gb each!!. I've tested other
free DBMS like Firebird and Ingres, but Postgresql is far more disk space
consumer than the others.

Disk space is a very relevant issue to me because harddisks are shared with
other people, and it is not posible to buy more hard disks right now.

I don't know about this but, is there any configuration parameter that I'm
missing, like page size or something? , or is it an OS problem?. What should I
look for?. I really like postgres, but if can't solve the space issue I would
have to stick to MySQL (which I don't want).

By the way, I'm working on a P4, 640 Mb RAM, with Win2K, Postgresql v8.0.3

Thanks in advance,
Rodrigo,
Chile

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

Re: BIG files

From
Bruno Wolff III
Date:
On Sat, Jun 18, 2005 at 13:45:42 -0400,
  rabt@dim.uchile.cl wrote:
> Hi all Postgresql users,
>
> I've been using MySQL for years and now I have decided to switch to Postgresql,
> because I needed more robust "enterprise" features like views and triggers. I
> work with VERY large datasets: 60 monthly tables with 700,000 rows and 99
> columns each, with mostly large numeric values (15 digits) ( NUMERIC(15,0)
> datatypes, not all filled). So far, I've migrated 2 of my tables to a dedicated
>
> The main problem is disk space. The database files stored in postgres take 4 or
> 5 times more space than in Mysql. Just to be sure, after each bulk load, I
> performed a VACUUM FULL to reclaim any posible lost space, but nothing gets
> reclaimed. My plain text dump files with INSERTS are just 150 Mb in size, while
> the files in Postgres directory are more than 1 Gb each!!. I've tested other
> free DBMS like Firebird and Ingres, but Postgresql is far more disk space
> consumer than the others.

From discussions I have seen here, MYSQL implements Numeric using a floating
point type. Postgres stores it using something like a base 10000 digit
for each 4 bytes of storage. Plus there will be some overhead for storing
the precision and scale. You might be better off using bigint to store
your data. That will take 8 bytes per datum and is probably the same size
as was used in MYSQL.

Re: BIG files

From
Michael Fuhr
Date:
On Sun, Jun 19, 2005 at 07:48:08AM -0500, Bruno Wolff III wrote:
>
> From discussions I have seen here, MYSQL implements Numeric using a floating
> point type. Postgres stores it using something like a base 10000 digit
> for each 4 bytes of storage.

Hmmm...that got me wondering about something....

CREATE TABLE foo (n1 decimal(17, 0), n2 decimal(17, 0));
INSERT INTO foo VALUES (10000000000000000, 10000000000000001);
SELECT n1, n2, n1 + n2 FROM foo;

PostgreSQL 8.0.3:

        n1         |        n2         |     ?column?
-------------------+-------------------+-------------------
 10000000000000000 | 10000000000000001 | 20000000000000001

MySQL 4.1.12:

+-------------------+-------------------+-------------------+
| n1                | n2                | n1 + n2           |
+-------------------+-------------------+-------------------+
| 10000000000000000 | 10000000000000001 | 20000000000000000 |
+-------------------+-------------------+-------------------+

Wrong sum in MySQL.  Nice.  Oh wait, even better....

CREATE TABLE foo (n1 decimal(20, 0), n2 decimal(20, 0));
INSERT INTO foo VALUES (10000000000000000000, 10000000000000000001);
SELECT n1, n2, n1 + n2 FROM foo;

PostgreSQL 8.0.3:

          n1          |          n2          |       ?column?
----------------------+----------------------+----------------------
 10000000000000000000 | 10000000000000000001 | 20000000000000000001

MySQL 4.1.12:

+----------------------+----------------------+-----------------------+
| n1                   | n2                   | n1 + n2               |
+----------------------+----------------------+-----------------------+
| -8446744073709551616 | -8446744073709551615 | -16893488147419099136 |
+----------------------+----------------------+-----------------------+

Beauty.  Nice of MySQL not to raise an error if I've exceeded some
implementation-defined limit.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: BIG files

From
Stephan Szabo
Date:
On Sun, 19 Jun 2005, Michael Fuhr wrote:

> PostgreSQL 8.0.3:
>
>           n1          |          n2          |       ?column?
> ----------------------+----------------------+----------------------
>  10000000000000000000 | 10000000000000000001 | 20000000000000000001
>
> MySQL 4.1.12:
>
> +----------------------+----------------------+-----------------------+
> | n1                   | n2                   | n1 + n2               |
> +----------------------+----------------------+-----------------------+
> | -8446744073709551616 | -8446744073709551615 | -16893488147419099136 |
> +----------------------+----------------------+-----------------------+
>
> Beauty.  Nice of MySQL not to raise an error if I've exceeded some
> implementation-defined limit.

To be fair, we didn't raise proper errors for integer math until fairly
recently (as recently as 7.4, we had overflow issues on operators where
the arguments were in range but the result was not).

Re: BIG files

From
Tom Lane
Date:
rabt@dim.uchile.cl writes:
> I work with VERY large datasets: 60 monthly tables with 700,000 rows and 99
> columns each, with mostly large numeric values (15 digits) ( NUMERIC(15,0)
> datatypes, not all filled).

> The main problem is disk space. The database files stored in postgres
> take 4 or 5 times more space than in Mysql. Just to be sure, after
> each bulk load, I performed a VACUUM FULL to reclaim any posible lost
> space, but nothing gets reclaimed. My plain text dump files with
> INSERTS are just 150 Mb in size, while the files in Postgres directory
> are more than 1 Gb each!!

Hmm ... it's conventional wisdom that a PG data file will be larger than
an equivalent text file, but you seem to have a much worse case than I'd
expect.  As Bruno noted, BIGINT sounds like a better choice than NUMERIC
for your data, but even with NUMERIC I don't see where the bloat is
coming from.  A 15-digit NUMERIC value ought to occupy 16 bytes on disk
(8 bytes of overhead + 4 base-10000 digits at 2 bytes each), so that
ought to be just about one-for-one with the text representation.  And
with 99 of those per row, the 32-bytes-per-row row header overhead isn't
what's killing you either.

Could we see the exact table declaration (including indexes) and a few
rows of sample data?

            regards, tom lane

Re: BIG files

From
Tom Lane
Date:
rabt@dim.uchile.cl writes:
> Tom, this is my table definition including domains:

The problem appears to be that you have defined all the columns as NOT
NULL with defaults that are obviously NULL substitutes, eg

> CREATE DOMAIN f29.tipo_idcomuna
>   AS int4
>   DEFAULT 99999
>   NOT NULL
>    CONSTRAINT tipo_idcomuna_check CHECK ((VALUE <= 99999) AND (VALUE >= 0));

That means that every one of the 99 columns is actually present, with a
value, in every row.  According to VACUUM FULL VERBOSE the actual
on-disk widths of your three sample rows range from 816 to 828 bytes,
which works out to an average column width of 8 or so bytes, which seems
reasonable.

Had you allowed the missing columns to go to NULL there would only be a
dozen or so actual values stored in each of these sample rows, so
(allowing for an extra 12 bytes for the nulls bitmap) the stored width
ought to be in the vicinity of 140-150 bytes.  Which more than accounts
for the bloat you are seeing.

In short: use NULL the way it was intended to be used, that is, to
indicate missing values.

            regards, tom lane