3.2. Data Migration #
When migrating data, the order of fields in the source and target schema is important. The order and types of fields in the non-distributed and distributed databases must be the same.
The migration utility does exactly what is requested by the user, who does not interfere with data migration processes except, maybe, distributing the data directly to the shard where it must be stored.
Shardman provides convenient migration tools. Once the distributed schema is created and the sharding key chosen, it is now needed to define the data migration rules. The data source can be either export CSV data files or a single DBMS server.
It is not always convenient to use CSV files as they can reach a pretty large size and require additional resources for storage and transfer.
Migrating data directly from DB to DB without an intermediate storage phase is much more convenient.
The order of loading data during migration must be taken into account. Tables can be linked with a foreign key, so the data in tables that other tables will reference must be loaded first. To follow such an order, in the migration file, you should establish the priority that defines tables whose data must be loaded first. The higher the value of the priority parameter, the higher the priority. For example, if the priorities 1, 2 and 3 are defined, tables with the priority 3 will be loaded first, then those with the priority 2, and last with the priority 1.
The shardmanctl load command lets you define the order of migrating tables, which can be specified in the configuration YML file.
3.2.1. Naive Approach #
The following is an example of the migrate.yml file:
version: "1.0"
migrate:
connstr: "dbname=demo host=single-pg-instance port=5432 user=postgres password=******"
jobs: 8
batch: 2000
options:
schemas:
- name: bookings
# the all parameter set to false turns off automatic creation of pages
# tables are already created, at the Schema Migration phase
all: false
tables:
- name: airports
# defining a global table
type: global
# as tables are linked, data migration priority must be defined
# setting highest priority to tables whose data
# must be copied first
priority: 3
- name: aircrafts
type: global
priority: 3
- name: seats
type: global
priority: 3
- name: bookings
type: global
priority: 3
- name: flights
type: global
priority: 3
- name: tickets
type: sharded
# defining a sharded table
# specifying the sharding key
distributedby: ticket_no
partitions: 4
priority: 2
- name: ticket_flights
type: sharded
distributedby: ticket_no
# defining a sharded and colocated table
# specifying the name of the table that ticket_flights table will be colocated with
colocatewith: tickets
partitions: 4
priority: 2
- name: boarding_passes
type: sharded
distributedby: ticket_no
colocatewith: tickets
partitions: 4
priority: 1
This file defines the data source, that is, the single-pg-instance node, its connection port, user name and password, and data source DB name. Some parameters of the migration utility operation are also defined (there can be quite a few of them, as explained in the section called “Loading Data with a Schema from PostgreSQL”). The file also defines the number of threads — 8, batch size, that is, the number of rows organized into batches for processing during migration, as well as table processing priorities. The data for the global tables is migrated first, then the data for the sharded tables tickets and ticket_flights, and migration of the boarding_passes table completes the migration. The value of priority defines the priority of data loading, data for tables with higher value will be loaded earlier than with the lower value. The following command performs the migration:
shardmanctl load --schema migrate.yml
If the utility completes with the message “data loading completed successfully”, it means that the migration was a success.
3.2.2. Complex Approach #
With this approach, the launch and operation of the shardmanctl utility in the load mode is the same as with the naive approach. However, the file that defines the order of loading tables will slightly differ as the sharding key has changed:
---
version: "1.0"
migrate:
connstr: "dbname=demo host=single-pg-instance port=5432 user=postgres password=postgres"
jobs: 8
batch: 2000
options:
schemas:
- name: bookings
all: false
tables:
- name: airports
type: global
priority: 5
- name: aircrafts
type: global
priority: 5
- name: seats
type: global
priority: 5
- name: flights
type: global
priority: 5
- name: bookings
type: sharded
priority: 4
partitions: 4
distributedby: book_ref
- name: tickets
type: sharded
distributedby: book_ref
colocatewith: bookings
partitions: 4
priority: 3
- name: ticket_flights
type: sharded
distributedby: book_ref
colocatewith: bookings
partitions: 4
priority: 2
- name: boarding_passes
type: sharded
distributedby: book_ref
colocatewith: bookings
partitions: 4
priority: 1