Thread: Rearchitecting for storage

Rearchitecting for storage

From
Matthew Pounsett
Date:

I've recently inherited a database that is dangerously close to outgrowing the available storage on its existing hardware.  I'm looking for (pointers to) advice on scaling the storage in a financially constrained not-for-profit.

The current size of the DB's data directory is just shy of 23TB.  When I received the machine it's on, it was configured with 18x3TB drives in RAID10 (9x 2-drive mirrors striped together) for about 28TB of available storage.  As a short term measure I've reconfigured them into RAID50 (3x 6-drive RAID5 arrays).  This is obviously a poor choice for performance, but it'll get us through until we figure out what to do about upgrading/replacing the hardware.  The host is constrained to 24x3TB drives, so we can't get much of an upgrade by just adding/replacing disks.

One of my anticipated requirements for any replacement we design is that I should be able to do upgrades of Postgres for up to five years without needing major upgrades to the hardware.  My understanding of the standard upgrade process is that this requires that the data directory be smaller than the free storage (so that there is room to hold two copies of the data directory simultaneously).  I haven't got detailed growth statistics yet, but given that the DB has grown to 23TB in 5 years, I should assume that it could double in the next five years, requiring 100TB of available storage to be able to do updates.  

This seems to be right on the cusp of what is possible to fit in a single chassis with a RAID10 configuration (at least, with commodify hardware), which means we're looking at pretty high cost:performance ratio.  I'd like to see if we can find designs that get that ratio down a bit, or a lot, but I'm a general sysadmin, and the detailed effects on those choices are outside of my limited DBA experience.

Are there good documents out there on sizing hardware for this sort of mid-range storage requirement, that is neither big data, nor "small data" able to fit on a single host?   I'm hoping for an overview of the tradeoffs between single head, dual-head setups with a JBOD array, or whatever else is advisable to consider these days.  Corrections of any poor assumptions exposed above are also quite welcome. :)

Thanks in advance for any assistance!




Re: Rearchitecting for storage

From
Kenneth Marshall
Date:
Hi Matt,

On Thu, Jul 18, 2019 at 09:44:04AM -0400, Matthew Pounsett wrote:
> I've recently inherited a database that is dangerously close to outgrowing
> the available storage on its existing hardware.  I'm looking for (pointers
> to) advice on scaling the storage in a financially constrained
> not-for-profit.

Have you considered using the VDO compression for tables that are less
update intensive. Using just compression you can get almost 4X size
reduction. For a database, I would forgo the deduplication function.
You can then use a non-compressed tablespace for the heavier I/O tables
and indexes.

> 
> One of my anticipated requirements for any replacement we design is that I
> should be able to do upgrades of Postgres for up to five years without
> needing major upgrades to the hardware.  My understanding of the standard
> upgrade process is that this requires that the data directory be smaller
> than the free storage (so that there is room to hold two copies of the data
> directory simultaneously).  I haven't got detailed growth statistics yet,
> but given that the DB has grown to 23TB in 5 years, I should assume that it
> could double in the next five years, requiring 100TB of available storage
> to be able to do updates.
> 

The link option with pg_upgrade does not require 2X the space, since it
uses hard links instead of copying the files to the new cluster.

Regards,
Ken



Re: Rearchitecting for storage

From
Matthew Pounsett
Date:


On Thu, 18 Jul 2019 at 13:34, Kenneth Marshall <ktm@rice.edu> wrote:
Hi Matt,

Hi!  Thanks for your reply.
 
Have you considered using the VDO compression for tables that are less
update intensive. Using just compression you can get almost 4X size
reduction. For a database, I would forgo the deduplication function.
You can then use a non-compressed tablespace for the heavier I/O tables
and indexes.

VDO is a RedHat-only thing, isn't it?  We're not running RHEL... Debian.  Anyway, the bulk of the data (nearly 80%) is in a single table and its indexes.  ~6TB to the table, and ~12TB to its indices.  Even if we switched over to RedHat, there's no value in compressing lesser-used tables.
 

> My understanding of the standard
> upgrade process is that this requires that the data directory be smaller
> than the free storage (so that there is room to hold two copies of the data
> directory simultaneously). 

The link option with pg_upgrade does not require 2X the space, since it
uses hard links instead of copying the files to the new cluster.

That would likely keep the extra storage requirements small, but still non-zero.  Presumably the upgrade would be unnecessary if it could be done without rewriting files.  Is there any rule of thumb for making sure one has enough space available for the upgrade?   I suppose that would come down to what exactly needs to get rewritten, in what order, etc., but the pg_upgrade docs don't seem to have that detail.  For example, since we've got an ~18TB table (including its indices), if that needs to be rewritten then we're still looking at requiring significant extra storage.  Recent experience suggests postgres won't necessarily do things in the most storage-efficient way.. we just had a reindex on that database fail (in --single-user) because 17TB was insufficient free storage for the db to grow into.

Re: Rearchitecting for storage

From
Rob Sargent
Date:
>
> That would likely keep the extra storage requirements small, but still non-zero.  Presumably the upgrade would be
unnecessaryif it could be done without rewriting files.  Is there any rule of thumb for making sure one has enough
spaceavailable for the upgrade?   I suppose that would come down to what exactly needs to get rewritten, in what order,
etc.,but the pg_upgrade docs don't seem to have that detail.  For example, since we've got an ~18TB table (including
itsindices), if that needs to be rewritten then we're still looking at requiring significant extra storage.  Recent
experiencesuggests postgres won't necessarily do things in the most storage-efficient way.. we just had a reindex on
thatdatabase fail (in --single-user) because 17TB was insufficient free storage for the db to grow into. 
>
Can you afford to drop and re-create those 6 indices?


Re: Rearchitecting for storage

From
Andy Colson
Date:
On 7/18/19 8:44 AM, Matthew Pounsett wrote:
> 
> I've recently inherited a database that is dangerously close to outgrowing the available storage on its existing
hardware. I'm looking for (pointers to) advice on scaling the storage in a financially constrained not-for-profit.
 
> 
> The current size of the DB's data directory is just shy of 23TB.  When I received the machine it's on, it was
configuredwith 18x3TB drives in RAID10 (9x 2-drive mirrors striped together) for about 28TB of available storage.  As a
shortterm measure I've reconfigured them into RAID50 (3x 6-drive RAID5 arrays).  This is obviously a poor choice for
performance,but it'll get us through until we figure out what to do about upgrading/replacing the hardware.  The host
isconstrained to 24x3TB drives, so we can't get much of an upgrade by just adding/replacing disks.
 
> 
> One of my anticipated requirements for any replacement we design is that I should be able to do upgrades of Postgres
forup to five years without needing major upgrades to the hardware.  My understanding of the standard upgrade process
isthat this requires that the data directory be smaller than the free storage (so that there is room to hold two copies
ofthe data directory simultaneously).  I haven't got detailed growth statistics yet, but given that the DB has grown to
23TBin 5 years, I should assume that it could double in the next five years, requiring 100TB of available storage to be
ableto do updates.
 
> 
> This seems to be right on the cusp of what is possible to fit in a single chassis with a RAID10 configuration (at
least,with commodify hardware), which means we're looking at pretty high cost:performance ratio.  I'd like to see if we
canfind designs that get that ratio down a bit, or a lot, but I'm a general sysadmin, and the detailed effects on those
choicesare outside of my limited DBA experience.
 
> 
> Are there good documents out there on sizing hardware for this sort of mid-range storage requirement, that is neither
bigdata, nor "small data" able to fit on a single host?   I'm hoping for an overview of the tradeoffs between single
head,dual-head setups with a JBOD array, or whatever else is advisable to consider these days.  Corrections of any poor
assumptionsexposed above are also quite welcome. :)
 
> 
> Thanks in advance for any assistance!
> 

Now might be a good time to consider splitting the database onto multiple computers.  Might be simpler with a mid-range
database,then your plan for the future is "add more computers".
 

-Andy



Re: Rearchitecting for storage

From
Luca Ferrari
Date:
On Thu, Jul 18, 2019 at 10:09 PM Matthew Pounsett <matt@conundrum.com> wrote:
> That would likely keep the extra storage requirements small, but still non-zero.  Presumably the upgrade would be
unnecessaryif it could be done without rewriting files.  Is there any rule of thumb for making sure one has enough
spaceavailable for the upgrade?   I suppose that would come down to what exactly needs to get rewritten, in what order,
etc.,but the pg_upgrade docs don't seem to have that detail.  For example, since we've got an ~18TB table (including
itsindices), if that needs to be rewritten then we're still looking at requiring significant extra storage.  Recent
experiencesuggests postgres won't necessarily do things in the most storage-efficient way.. we just had a reindex on
thatdatabase fail (in --single-user) because 17TB was insufficient free storage for the db to grow into. 

This could be trivial, but any chance you can partition the table
and/or archive unused records (at least temporarly)? A 18 TB table
quite frankly sounds a good candidate to contain records no one is
interested in the near future.
In any case, if you can partition the table chances are you can at
least do a per-table backup that could simplify maintanance of the
database.

In desperate order, I would check also the log files (I mean, textual
logs, not wals) because occasionally I found them requiring a few GBs
on my disk, and that can be easily archived to gain some more extra
space.
Then I would go for some commodity NAS to attach as extra storage, at
least for the upgrade process.

If any of the following fails, I would probably drop all the indexes
to gain extra space, perform the upgrade, and then reindex (removing
the old cluster, in the case it has not been upgraded with the link
option).

Luca



Re: Rearchitecting for storage

From
Matthew Pounsett
Date:


On Thu, 18 Jul 2019 at 19:53, Rob Sargent <robjsargent@gmail.com> wrote:

>
> That would likely keep the extra storage requirements small, but still non-zero.  Presumably the upgrade would be unnecessary if it could be done without rewriting files.  Is there any rule of thumb for making sure one has enough space available for the upgrade?   I suppose that would come down to what exactly needs to get rewritten, in what order, etc., but the pg_upgrade docs don't seem to have that detail.  For example, since we've got an ~18TB table (including its indices), if that needs to be rewritten then we're still looking at requiring significant extra storage.  Recent experience suggests postgres won't necessarily do things in the most storage-efficient way.. we just had a reindex on that database fail (in --single-user) because 17TB was insufficient free storage for the db to grow into.
>
Can you afford to drop and re-create those 6 indices?

Technically, yes.  I don't see any reason we'd be prevented from doing that.  But, rebuilding them will take a long time.  That's a lot of downtime to incur any time we update the DB.  I'd prefer to avoid it if I can.  For scale, the recent 'reindex database' that failed ran for nine days before it ran out of room, and that was in single-user.  Trying to do that concurrently would take a lot longer, I imagine.
 

Re: Rearchitecting for storage

From
Matthew Pounsett
Date:


On Thu, 18 Jul 2019 at 21:59, Andy Colson <andy@squeakycode.net> wrote:
>

Now might be a good time to consider splitting the database onto multiple computers.  Might be simpler with a mid-range database, then your plan for the future is "add more computers".

Hmm... yes.  Range partitioning seems like a possible way forward.  I hadn't considered that yet.  We might hold that back for when a dual-head approach (server + jbod) can't scale anymore, but I think that's a long way in the future.

Re: Rearchitecting for storage

From
Matthew Pounsett
Date:


On Fri, 19 Jul 2019 at 04:21, Luca Ferrari <fluca1978@gmail.com> wrote:

This could be trivial, but any chance you can partition the table
and/or archive unused records (at least temporarly)? A 18 TB table
quite frankly sounds a good candidate to contain records no one is
interested in the near future.

Partitioning is a possibility.  The whole database is historical test results, stored specifically for doing comparisons over time, so I'm not sure we can actually archive anything.  Expiring old test data is a discussion we might have to have, eventually.
 
In any case, if you can partition the table chances are you can at
least do a per-table backup that could simplify maintanance of the
database.

My current backup plan for this database is on-site replication, and a monthly pg_dump from the standby to be copied off-site.  Doing per-table backups sounds like a great way to end up with an inconsistent backup, but perhaps I misunderstand what you mean.  

Another possibility is putting the server into backup mode and taking a snapshot of the filesystem, but coordinating that across chassis (in the case where partitioning is used) in such a way that the db is consistent sounds like a hard problem... unless issuing pg_start_backup on the chassis holding the master table coordinates backup mode on all the chassis holding child tables at the same time?  I haven't read enough on that yet.
 

In desperate order, I would check also the log files (I mean, textual
logs, not wals) because occasionally I found them requiring a few GBs
on my disk, and that can be easily archived to gain some more extra
space.
Then I would go for some commodity NAS to attach as extra storage, at
least for the upgrade process.

Okay.  So I guess the short answer is no, nobody really knows how to judge how much space is required for an upgrade?  :)

Any logs we have are going to be a rounding error when compared to the database itself.  And buying storage last-minute because an upgrade failed is exactly the sort of thing that a resource constrained not-for-profit can't do.  We really need to be able to plan this out long term so that we get as much as possible out of every dollar.
 
If any of the following fails, I would probably drop all the indexes
to gain extra space, perform the upgrade, and then reindex (removing
the old cluster, in the case it has not been upgraded with the link
option).
 
Yeah, this sort of trial-and-error approach to getting upgrades done will bother me, but it seems like it'll be necessary once we start growing into whatever new home we get for the db.  

Thanks very much for your time on this.

Re: Rearchitecting for storage

From
Kenneth Marshall
Date:
Hi Matt,

On Fri, Jul 19, 2019 at 10:41:31AM -0400, Matthew Pounsett wrote:
> On Fri, 19 Jul 2019 at 04:21, Luca Ferrari <fluca1978@gmail.com> wrote:
> 
> >
> > This could be trivial, but any chance you can partition the table
> > and/or archive unused records (at least temporarly)? A 18 TB table
> > quite frankly sounds a good candidate to contain records no one is
> > interested in the near future.
> >
> 
> Partitioning is a possibility.  The whole database is historical test
> results, stored specifically for doing comparisons over time, so I'm not
> sure we can actually archive anything.  Expiring old test data is a
> discussion we might have to have, eventually.
> 

This is a case were using a compressed filesystem can give you space
savings as well as faster read performance due to the compression
factor. In my case a sequential scan of something in the compressed
tablespace runs almost 3X faster than on the non-compressed one.

Regards,
Ken




Re: Rearchitecting for storage

From
Matthew Pounsett
Date:


On Thu, 18 Jul 2019 at 09:44, Matthew Pounsett <matt@conundrum.com> wrote:

I've recently inherited a database that is dangerously close to outgrowing the available storage on its existing hardware.  I'm looking for (pointers to) advice on scaling the storage in a financially constrained not-for-profit.

Thanks for your replies, everyone.  Here's a quick summary of what I've got out of this.

Although nobody really addressed the core question of the performance tradeoffs in different storage architectures, perhaps the fact that nobody mentioned them means there really aren't any.  We'll proceed on the assumption that externally attached storage really doesn't make a difference.  NAS storage seems like a poor choice to me, for performance reasons, and nobody's really said anything to disabuse me of that notion.

We're going to have a look at the relative costs of single-head solutions as well as dual head (server + jbod) setups, and see what gets us the most growth for the least cost. We'll plan for enough storage to get us five years of growth, and just accept that maybe in the 5th year we won't be able to do in-place upgrades without dropping indexes.   

Horizontal scalability through range partitioning sounds interesting, but I don't think it's a cost-effective solution for us right now.  As long as it's possible for the db to fit in a single server (or jbod) using commodity hardware, the incremental cost of adding more chassis (and therefore more motherbaords, more CPUs, more memory) isn't offset by a reduced cost anywhere else (e.g. using cheaper drives).  And that's not even accounting for the increased operational cost of coordinating the DB across multiple servers.  It could be a useful approach if DB growth outpaces historical averages and we need to add hardware before a normal replacement cycle.  It could also be useful at the end of that replacement cycle if DB growth has outpaced commodity hardware improvements, and single server solutions are no longer viable.

The DB server we inherited is currently a single instance, but once we expand and have replication in place I'll have to do some testing to see if LVM compression gives us any performance boost (or incurs an acceptable performance cost). The big question there is whether the processing required to do the compression is faster than the difference in read times on the disk... I think that might be dependent on the data and how it's accessed.  It certainly seems like it could give us some benefits, but I don't think it's an experiment I want to attempt with only a single production copy of the DB; the downtime required to rebuild the DB server for A+B comparisons would be unacceptable.

Thanks again everyone.  This has been educational.

Re: Rearchitecting for storage

From
"Peter J. Holzer"
Date:
On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote:
> Okay.  So I guess the short answer is no, nobody really knows how to
> judge how much space is required for an upgrade?  :)

As I understand it, a pg_upgrade --link uses only negligible extra
space. It duplicates a bit of householding information, but not your
data tables or indexes. Your 18 TB table will definitely not be duplicated
during the upgrade if you can use --link.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: Rearchitecting for storage

From
Matthew Pounsett
Date:


On Fri, 19 Jul 2019 at 11:25, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote:
> Okay.  So I guess the short answer is no, nobody really knows how to
> judge how much space is required for an upgrade?  :)

As I understand it, a pg_upgrade --link uses only negligible extra
space. It duplicates a bit of householding information, but not your
data tables or indexes. Your 18 TB table will definitely not be duplicated
during the upgrade if you can use --link.

The documentation for pg_upgrade --link says that the old copy is no longer usable, which means it's modifying files that are linked.  If it were only modifying small housekeeping files, then it would be most efficient not to link those, which would keep both copies of the db usable.  That seems incompatible with your suggestion that it doesn't need to modify the data files.  Depending on how it goes about doing that, it could mean a significant short-term increase in storage requirements while the data is being converted.  

Going back to our recent 'reindex database' attempt, pgsql does not necessarily do these things in the most storage-efficient manner; it seems entirely likely that it would choose to use links to duplicate the data directory, then create copies of each data file as it converts them over, then link that back to the original for an atomic replacement.  That could eat up a HUGE amount of storage during the conversion process without the start and end sizes being very different at all.  

Sorry, but I can't reconcile your use of "as I understand it" with your use of "definitely".  It sounds like you're guessing, rather than speaking from direct knowledge of how the internals of pg_upgrade.

Re: Rearchitecting for storage

From
Jacob Bunk Nielsen
Date:
Matthew Pounsett <matt@conundrum.com> writes:
> On Thu, 18 Jul 2019 at 19:53, Rob Sargent <robjsargent@gmail.com> wrote:
>
>  Can you afford to drop and re-create those 6 indices?
>
> Technically, yes. I don't see any reason we'd be prevented from doing that. But, rebuilding them will take a long
time.That's a lot of downtime to incur any time we update
 
> the DB. I'd prefer to avoid it if I can. For scale, the recent 'reindex database' that failed ran for nine days
beforeit ran out of room, and that was in single-user. Trying to do
 
> that concurrently would take a lot longer, I imagine.

This may be a stupid question, but are you certain they are all used? It
wouldn't be the first time that I've seen someone create indexes and
then never use them. This script can tell you if there are any indexes
that seems largely unused.

https://github.com/pgexperts/pgx_scripts/blob/master/indexes/unused_indexes.sql

If you can run your application without access to the indexes for a
while you can create them concurrently in the background using "CREATE
INDEX CONCURRENTLY ...".

Best regards,
Jacob




Re: Rearchitecting for storage

From
Jacob Bunk Nielsen
Date:
Matthew Pounsett <matt@conundrum.com> writes:

> [...] Is there any rule of thumb for making sure one has enough space
> available for the upgrade?

No, because it depends greatly on which version you are upgrading from
and which version you are upgrading to etc.

Perhaps you could carve out a slice of data, e.g. 1 GB and load it into
a test database and try to upgrade that. That would probably give you an
idea.

Also, you mentioned that your database contains historical test data¹,
then I would guess that one of the indexes is related to timestamps? But
maybe you could live with a smaller BRIN index for the timestamps:
https://www.postgresql.org/docs/11/brin-intro.html - that could
potentially save some space, and may not have been something on the
radar when the database was first developed.

Best regards,
Jacob

¹) I think I know which kind of data based on your progress reports on
   a DNS related list I'm subscribed to.




Re: Rearchitecting for storage

From
"Peter J. Holzer"
Date:
On 2019-07-19 11:37:52 -0400, Matthew Pounsett wrote:
> On Fri, 19 Jul 2019 at 11:25, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>
>     On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote:
>     > Okay.  So I guess the short answer is no, nobody really knows how to
>     > judge how much space is required for an upgrade?  :)
>
>     As I understand it, a pg_upgrade --link uses only negligible extra
>     space. It duplicates a bit of householding information, but not your
>     data tables or indexes. Your 18 TB table will definitely not be duplicated
>     during the upgrade if you can use --link.
>
>
> The documentation for pg_upgrade --link says that the old copy is no longer
> usable, which means it's modifying files that are linked.  If it were only
> modifying small housekeeping files, then it would be most efficient not to link
> those, which would keep both copies of the db usable.

This was discussed recently: The old database is made intentionally
unusable to prevent accidentally starting both (which would result in
data corruption).

> That seems incompatible with your suggestion that it doesn't need to
> modify the data files.  Depending on how it goes about doing that, it
> could mean a significant short-term increase in storage requirements
> while the data is being converted.  
>
> Going back to our recent 'reindex database' attempt, pgsql does not
> necessarily do these things in the most storage-efficient manner; it
> seems entirely likely that it would choose to use links to duplicate
> the data directory, then create copies of each data file as it
> converts them over, then link that back to the original for an atomic
> replacement.  That could eat up a HUGE amount of storage during the
> conversion process without the start and end sizes being very
> different at all.  

I can't really think of a scenario in which this would be the best
(or even a good) strategy to convert the database. I am quite confident
that pg_upgrade doesn't do that at present and reasonably confident that
it won't do it in the future.


> Sorry, but I can't reconcile your use of "as I understand it" with
> your use of "definitely".  It sounds like you're guessing, rather than
> speaking from direct knowledge of how the internals of pg_upgrade.

I don't have direct knowledge of the internals of pg_upgrade, but I
have upgraded a database of about 1 TB at least twice with --link. Since
I had much less than 1 TB of free space and the upgrade completed very
quickly, I am very confident that no user defined tables are copied. I
have also been on this mailing list for a few years and read quite a few
discussions about the usage of pg_upgrade in that time (though I may not
always have paid much attention to them).

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: Rearchitecting for storage

From
Stephen Frost
Date:
Greetings,

* Matthew Pounsett (matt@conundrum.com) wrote:
> On Fri, 19 Jul 2019 at 11:25, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote:
> > > Okay.  So I guess the short answer is no, nobody really knows how to
> > > judge how much space is required for an upgrade?  :)
> >
> > As I understand it, a pg_upgrade --link uses only negligible extra
> > space. It duplicates a bit of householding information, but not your
> > data tables or indexes. Your 18 TB table will definitely not be duplicated
> > during the upgrade if you can use --link.
>
> The documentation for pg_upgrade --link says that the old copy is no longer
> usable,

That's not entirely true- the old copy is only no longer usable *after*
you've started the new version of the DB against those data files.  If
you haven't started the new major version of PG yet, then you can go
back to using the old version against those files.

> which means it's modifying files that are linked.

No, it doesn't.

> If it were only
> modifying small housekeeping files, then it would be most efficient not to
> link those, which would keep both copies of the db usable.

The catalog tables *aren't* linked.  Both copies of the DB are usable-
but only until you start the DB against one of the versions.  Once
you've started either the old version or the new version, you can't
switch.  If you started the old version, then you could do another
pg_upgrade, of course, but you can't use the new version as there will
have been changes made to the catalog tables and control file (which
aren't linked) that would have to be accounted for in the new version's
catalog by pg_upgrade.

> That seems
> incompatible with your suggestion that it doesn't need to modify the data
> files.  Depending on how it goes about doing that, it could mean a
> significant short-term increase in storage requirements while the data is
> being converted.

No, that's not the case- link mode doesn't copy the data files, it just
rebuilds the catalog tables and fixes up things in the new database
cluster (clog, wal, et al, not the user data tables/indexes).

> Going back to our recent 'reindex database' attempt, pgsql does not
> necessarily do these things in the most storage-efficient manner; it seems
> entirely likely that it would choose to use links to duplicate the data
> directory, then create copies of each data file as it converts them over,
> then link that back to the original for an atomic replacement.  That could
> eat up a HUGE amount of storage during the conversion process without the
> start and end sizes being very different at all.

No, that isn't how pg_upgrade works.

Thanks,

Stephen

Attachment

Re: Rearchitecting for storage

From
Luca Ferrari
Date:
On Fri, Jul 19, 2019 at 4:41 PM Matthew Pounsett <matt@conundrum.com> wrote:
> My current backup plan for this database is on-site replication, and a monthly pg_dump from the standby to be copied
off-site. Doing per-table backups sounds like a great way to end up with an inconsistent backup, but perhaps I
misunderstandwhat you mean. 
>
Well, my idea was that, as I was supposing and you confirmed, the
database is full also of historical data, that will not be updated in
the future. Therefore you could at least perform a partitioning,
backup historical data and, after having verified it, drop historical
data. This will lead you to have a "partially" online system (I mean
partially because it will not have let's say the last 20 years, but
only the last 10 years) and you will save space for upgrading. After
the upgrade is completed, you can restore the oldest data and you will
come back online with the full dataset. IN this scenario the backup is
not inconsistent, since old data is supposed to stay frozen. If this
is not true, my idea is completly wrong.
I know, this is the desperate-poor-man approach, but I have quite
frankly no other ideas if you are constrained on space, money and time
(because as much as you take, the much it becomes harder to upgrade,
in my opinion).
I would also inspect _now_ a possible refactoring of the database in
order to gain, if possible, some extra space. I mean, sorry to be
harsh, but a database with a huge large table has not been designed
efficiently, so chances are some columns can be shrinked (to the
correct data type could be?) and this could provide you some extra
space.
But without having a better understanding of the scenario and the
context, I think I cannot help very much.

Sorry,
Luca



Re: Rearchitecting for storage

From
Luca Ferrari
Date:
On Thu, Jul 18, 2019 at 10:09 PM Matthew Pounsett <matt@conundrum.com> wrote:
> That would likely keep the extra storage requirements small, but still non-zero.  Presumably the upgrade would be
unnecessaryif it could be done without rewriting files.  Is there any rule of thumb for making sure one has enough
spaceavailable for the upgrade?   I suppose that would come down to what exactly needs to get rewritten, in what order,
etc.,but the pg_upgrade docs don't seem to have that detail.  For example, since we've got an ~18TB table (including
itsindices), if that needs to be rewritten then we're still looking at requiring significant extra storage.  Recent
experiencesuggests postgres won't necessarily do things in the most storage-efficient way.. we just had a reindex on
thatdatabase fail (in --single-user) because 17TB was insufficient free storage for the db to grow into. 
>

I've done a test on a virtual machine of mine, with the following
three databases: one 0f 4.9 GB, one of 500 MB, one of 50 MB. I know
this is not even close to your environment, however upgrading with
pg_upgrade from 10.9 to 11.4 _without_ the link option ask for 85% of
space.

On a machine with a single database of 8.9 GB and a space occupation,
as reported by df, of 64% (mean 46% available) I was able to upgrade
from 10.9 to 11.4 without the link option. Space occupation increased
of 90%.
Using the link option on the same cluster required 1.1% of extra space
(around 100 MB).
Of course, these are poor-man results, but give you an advice on the
space required by pg_ugprade (which seems to be less than 100% or 2x).

Hope this helps.
Luca