3.2. Data Migration #

When migrating data, the order of fields in the source and target schema is important. The order and types of fields in the non-distributed and distributed databases must be the same.

The migration utility does exactly what is requested by the user, who does not interfere with data migration processes except, maybe, distributing the data directly to the shard where it must be stored.

Shardman provides convenient migration tools. Once the distributed schema is created and the sharding key chosen, it is now needed to define the data migration rules. The data source can be either export CSV data files or a single DBMS server.

It is not always convenient to use CSV files as they can reach a pretty large size and require additional resources for storage and transfer.

Migrating data directly from DB to DB without an intermediate storage phase is much more convenient.

The order of loading data during migration must be taken into account. Tables can be linked with a foreign key, so the data in tables that other tables will reference must be loaded first. To follow such an order, in the migration file, you should establish the priority that defines tables whose data must be loaded first. The higher the value of the priority parameter, the higher the priority. For example, if the priorities 1, 2 and 3 are defined, tables with the priority 3 will be loaded first, then those with the priority 2, and last with the priority 1.

The shardmanctl load command lets you define the order of migrating tables, which can be specified in the configuration YML file.

3.2.1. Naive Approach #

The following is an example of the migrate.yml file:

version: "1.0"
migrate:
 connstr: "dbname=demo host=single-pg-instance port=5432 user=postgres password=******"
 jobs: 8
 batch: 2000
 options:
 schemas:
   - name: bookings
     # the all parameter set to false turns off automatic creation of pages
     # tables are already created, at the Schema Migration phase
     all: false
     tables:
       - name: airports
         # defining a global table
         type: global
         # as tables are linked, data migration priority must be defined
         # setting highest priority to tables whose data
         # must be copied first
         priority: 3
       - name: aircrafts
         type: global
         priority: 3
       - name: seats
         type: global
         priority: 3
       - name: bookings
         type: global
         priority: 3
       - name: flights
         type: global
         priority: 3
       - name: tickets
         type: sharded
         # defining a sharded table
         # specifying the sharding key
         distributedby: ticket_no
         partitions: 4
         priority: 2
       - name: ticket_flights
         type: sharded
         distributedby: ticket_no
         # defining a sharded and colocated table
         # specifying the name of the table that ticket_flights table will be colocated with
         colocatewith: tickets
         partitions: 4
         priority: 2
       - name: boarding_passes
         type: sharded
         distributedby: ticket_no
         colocatewith: tickets
         partitions: 4
         priority: 1

This file defines the data source, that is, the single-pg-instance node, its connection port, user name and password, and data source DB name. Some parameters of the migration utility operation are also defined (there can be quite a few of them, as explained in the section called “Loading Data with a Schema from PostgreSQL). The file also defines the number of threads — 8, batch size, that is, the number of rows organized into batches for processing during migration, as well as table processing priorities. The data for the global tables is migrated first, then the data for the sharded tables tickets and ticket_flights, and migration of the boarding_passes table completes the migration. The value of priority defines the priority of data loading, data for tables with higher value will be loaded earlier than with the lower value. The following command performs the migration:

shardmanctl load --schema migrate.yml

If the utility completes with the message data loading completed successfully, it means that the migration was a success.

3.2.2. Complex Approach #

With this approach, the launch and operation of the shardmanctl utility in the load mode is the same as with the naive approach. However, the file that defines the order of loading tables will slightly differ as the sharding key has changed:

 ---
version: "1.0"
migrate:
 connstr: "dbname=demo host=single-pg-instance port=5432 user=postgres password=postgres"
 jobs: 8
 batch: 2000
 options:
 schemas:
   - name: bookings
     all: false
     tables:
       - name: airports
         type: global
         priority: 5
       - name: aircrafts
         type: global
         priority: 5
       - name: seats
         type: global
         priority: 5
       - name: flights
         type: global
         priority: 5
       - name: bookings
         type: sharded
         priority: 4
         partitions: 4
         distributedby: book_ref
       - name: tickets
         type: sharded
         distributedby: book_ref
         colocatewith: bookings
         partitions: 4
         priority: 3
       - name: ticket_flights
         type: sharded
         distributedby: book_ref
         colocatewith: bookings
         partitions: 4
         priority: 2
       - name: boarding_passes
         type: sharded
         distributedby: book_ref
         colocatewith: bookings
         partitions: 4
         priority: 1

pdf