> serial_half is a 1-column list of 10-digit
> numbers. I'm doing a select distinct because I
> believe there may be duplicates in that column.
>
> The misunderstanding on my end came because
> serial_half was a 60MB text file, but when it was
> inserted into postgres, it became 345MB (6.8
> million rows has a lot of bloat apparently).
>
> So the temp-sort space for 345MB could easily
> surpass the 1GB I had on my hard disk. Although
> how anyone can take a 60MB text file and turn it
> into > 1GB is beyond me.
Sigh. Y'all like the sweeping statement, which got you in a bit of
trouble the first time too :)
Without knowing your schema, I can't say why you have *exactly* the
storage requirement you see. But, you have chosen the absolute worst
case for *any* relational database: a schema with only a single, very
small column.
For Postgres (and other DBs, but the details will vary) there is a 36
byte overhead per row to manage the tuple and the transaction
behavior. So if you stored your data as int8 (int4 is too small for 10
digits, right?) I see an average usage of slightly over 44 bytes per
row (36+8). So, for 6.8 million rows, you will require 300MB. I'm
guessing that you are using char(10) fields, which gives 50 bytes/row
or a total of 340MB, which matches your number to two digits.
Note that the tuple header size will stay the same (with possibly some
modest occasional bumps) for rows with more columns, so the overhead
decreases as you increase the number of columns in your tables.
By the way, I was going to say to RTFM, but I see a big blank spot on
this topic (I could have sworn that some of the info posted to the
mailing lists on this topic had made it into the manual, but maybe
not).
Does anyone see where this is in the docs, or have an interest in
writing a bit? The place is doc/src/sgml/storage.sgml and page.sgml
...
Good luck.
- Thomas
--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California