Thread: database split

database split

From
Dave Owens
Date:
Greetings,

We are migrating a subset of our customers to a new set of servers.  This requires that we migrate their data stored in postgresql (v9.1.15, Linux) to a new database.  The new database happens to reside in the same cluster, but in the future this may not be the case.  We are using a separate tablespace for the new database.

Our schema has several tiers of foreign key relationships, and most FKs eventually lead back to a few core tables.

Here are the options I am considering:

A. Text-processing a dump of the original database, filtering only the rows in which I am interested, while loading the dump into the new database.

B. Copying the original database (CREATE DATABASE smalldb WITH TEMPLATE bigdb), then modifying the schema so that each FK constraint now includes ON DELETE CASCADE, then deleting rows from the few core tables while letting postgresql cascade the deletes down the hierarchy.  Then, of course, restoring the schema to its original state.

C. Generating the myriad SELECTs necessary to export only the required data from every table, then importing those results (either using dblink or COPY FROMs) into the new database that has been populated with schema from the original database.  Carefully ordering the imports to avoid missing FKs.

There are tradeoffs to each approach, of course.

A) does not require me to modify/restore our schema, but will take some effort to build the intermediate processing code (for example: having to account for gotchas where our FK columns are not predictably named).

B) would probably require the least amount effort to code up, but I am not certain this will perform well (or even work!).

Ordering the imports for C) correctly will be a pain, but generating the SELECTs and INSERTs programmatically is straightforward.

Of course, I would love to hear about options D) - Z) as well!  Thanks in advance for your input.

Dave Owens

Re: database split

From
Jim Nasby
Date:
On 4/29/15 6:04 PM, Dave Owens wrote:
>
> A. Text-processing a dump of the original database, filtering only the
> rows in which I am interested, while loading the dump into the new database.
>
> B. Copying the original database (CREATE DATABASE smalldb WITH TEMPLATE
> bigdb), then modifying the schema so that each FK constraint now
> includes ON DELETE CASCADE, then deleting rows from the few core tables
> while letting postgresql cascade the deletes down the hierarchy.  Then,
> of course, restoring the schema to its original state.

This would be quite slow, but will work.

> C. Generating the myriad SELECTs necessary to export only the required
> data from every table, then importing those results (either using dblink
> or COPY FROMs) into the new database that has been populated with schema
> from the original database.  Carefully ordering the imports to avoid
> missing FKs.

D. Combine A and B. FKs are added AFTER data is loaded in a SQL dump. So
after the COPY commands are done you could run relevant DELETES. Brute
force, but it'd do the job and be faster than cascade deletes.

E. A spin on C would be to put triggers on the tables to silently drop
data that you don't want.

All that said, my guess is you're doing this to support horizontal
scale-out, which means you'll probably need to do this more than once,
and it'd presumably be nice for you and your customers if this didn't
require downtime. I would look at having a way to create a partial
replica using londiste (or BDR if it'd support it). The trick there is
having a way to identify whether you want each individual row on a
replica. If you add some kind of cluster_id field to every table that
makes doing that filtering pretty easy; IIRC londiste supports that out
of the box.

The other option is having functions that would let you determine
whether you want a row. I know you could get londiste to do this, but
you'd have to implement some of the filtering in python (Skype actually
does, or at least did this).

One more thing to consider: You can scale PG to some serious database
sizes if you're willing to spend money on hardware. Given what kind of a
server you can get for $20k now-a-days, I'd seriously investigate that
if you haven't already.

BTW, there's also https://github.com/citusdata/pg_shard. I've never
really looked at it, but it might make all of this a lot easier.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: database split

From
Dave Owens
Date:
All that said, my guess is you're doing this to support horizontal scale-out, which means you'll probably need to do this more than once, and it'd presumably be nice for you and your customers if this didn't require downtime. I would look at having a way to create a partial replica using londiste (or BDR if it'd support it). The trick there is having a way to identify whether you want each individual row on a replica. If you add some kind of cluster_id field to every table that makes doing that filtering pretty easy; IIRC londiste supports that out of the box.
 
Jim, thank you for your input.  The reason for the split is partially for horizontal scale-out (current system is well-provisioned for the near future), partly for localization concerns arising from legacy code.  We are under a bit of a deadline to finish this split, so selective replication is not feasible at this time.  The tools you mention do look very promising for our needs, however.

I decided to do a combination of A and C...
- load a schema-only dump from the original db to the new db (filtering out triggers and constraints)
- do COPY TO/FROMs (I would have to generate all the WHEREs for Jim's option D anyway, the data I am after is only tiny fraction of the original db...)
- restore the filtered constraints/triggers