Thread: Realistic upper bounds on table size

Realistic upper bounds on table size

From
"A.M."
Date:
Sorry for the cross-posting but I wasn't able to elicit a response on
-general.

I'm trying to figure out what the upper bounds on a postgresql table
are based on required use of indices and integer columns in a single
table.
          An astronomy institution I'm considering working for receives
a  monster amount of telescope data from a government observatory. Each
day, they download millions of rows of data (including position in the
sky, infrared reading, etc.) in CSV format. Most of the rows are floats
and integers. I would like to offer them an improvement over their old
system.
          I would like to know how postgresql does under such extreme
circumstances- for example, I may load the entire millions of rows CSV
file into a table and then eliminate some odd million rows they are not
  interested in. Would a vacuum at this time be prohibitively expensive?
  If I add some odd millions of rows to a table every day, can I expect
the necessary indices to keep up? In other words, will postgresql be
able to keep up with their simple and infrequent selects on monster
amounts of data (potentially 15 GB/day moving in and out daily with db
growth at ~+5 GB/day [millions of rows] in big blocks all at once)
assuming that they have top-of-the-line equipment for this sort of
thing (storage, memory, processors, etc.)? Anyone else using postgresql
  on heavy-duty astronomy data? Thanks for any info.

 ><><><><><><><><><
AgentM
agentm@cmu.edu


Re: Realistic upper bounds on table size

From
Robert Treat
Date:
On Fri, 2003-03-28 at 12:07, A.M. wrote:
> Sorry for the cross-posting but I wasn't able to elicit a response on
> -general.
>
> I'm trying to figure out what the upper bounds on a postgresql table
> are based on required use of indices and integer columns in a single
> table.
>           An astronomy institution I'm considering working for receives
> a  monster amount of telescope data from a government observatory. Each
> day, they download millions of rows of data (including position in the
> sky, infrared reading, etc.) in CSV format. Most of the rows are floats
> and integers. I would like to offer them an improvement over their old
> system.
>           I would like to know how postgresql does under such extreme
> circumstances- for example, I may load the entire millions of rows CSV
> file into a table and then eliminate some odd million rows they are not
>   interested in. Would a vacuum at this time be prohibitively expensive?

I think a vacuum full might be on the expensive side, since you'd have
to lock the table long enough to delete the # of pages holding those
delete million or sow rows. Given that it is only one table, and with
proper hardware, this might not be as bad as I'm thinking.  If it is a
problem, you can use the "lazy vacuum" to mark the rows deleted, and
keep excessive table size to a minimum. You'll definitely need to bump
up your max_fsm_pages in the postgresql.conf.

>   If I add some odd millions of rows to a table every day, can I expect
> the necessary indices to keep up?

I'm a little concerned about this, as index bloat is more of a problem
depending on which version you're using. 7.4 has some good fixes for
this if your willing to run code from CVS, in the mean time you might
need to devise an index maintenance scheme. If you can schedule down
time, I might suggest you drop the index before loading and deleting
data, then recreating them when your done. You'll have to time it on
your hardware to see if it creates too much of a problem for you. If you
need to keep index availability round the clock, you can rebuild indexes
within a transaction to help keep them available, it will just take more
intensive processing by your server.

In other words, will postgresql be
> able to keep up with their simple and infrequent selects on monster
> amounts of data (potentially 15 GB/day moving in and out daily with db
> growth at ~+5 GB/day [millions of rows] in big blocks all at once)
> assuming that they have top-of-the-line equipment for this sort of
> thing (storage, memory, processors, etc.)? Anyone else using postgresql
>   on heavy-duty astronomy data? Thanks for any info.

While I think it will be able to keep up, it will take some time to get
your server maintenance figured out. Still, if you have quality
hardware, it's worth a try.

Robert Treat