3.2. Data Migration #
When migrating data, the order of fields in the source and target schema is important. The order and types of fields in the non-distributed and distributed databases must be the same.
The migration utility does exactly what is requested by the user, who does not interfere with data migration processes except, maybe, distributing the data directly to the shard where it must be stored.
Shardman provides convenient migration tools. Once the distributed schema is created and the sharding key chosen, it is now needed to define the data migration rules. The data source can be either export CSV data files or a single DBMS server.
It is not always convenient to use CSV files as they can reach a pretty large size and require additional resources for storage and transfer.
Migrating data directly from DB to DB without an intermediate storage phase is much more convenient.
The order of loading data during migration must be taken into account. Tables can be linked with a foreign key, so the data in tables that other tables will reference must be loaded first. To follow such an order, in the migration file, you should establish the priority that defines tables whose data must be loaded first. The higher the value of the priority
parameter, the higher the priority. For example, if the priorities 1, 2 and 3 are defined, tables with the priority 3 will be loaded first, then those with the priority 2, and last with the priority 1.
The shardmanctl load
command lets you define the order of migrating tables, which can be specified in the configuration YML file.
3.2.1. Naive Approach #
The following is an example of the migrate.yml
file:
version: "1.0" migrate: connstr: "dbname=demo host=single-pg-instance port=5432 user=postgres password=******" jobs: 8 batch: 2000 options: schemas: - name: bookings # the all parameter set to false turns off automatic creation of pages # tables are already created, at the Schema Migration phase all: false tables: - name: airports # defining a global table type: global # as tables are linked, data migration priority must be defined # setting highest priority to tables whose data # must be copied first priority: 3 - name: aircrafts type: global priority: 3 - name: seats type: global priority: 3 - name: bookings type: global priority: 3 - name: flights type: global priority: 3 - name: tickets type: sharded # defining a sharded table # specifying the sharding key distributedby: ticket_no partitions: 4 priority: 2 - name: ticket_flights type: sharded distributedby: ticket_no # defining a sharded and colocated table # specifying the name of the table that ticket_flights table will be colocated with colocatewith: tickets partitions: 4 priority: 2 - name: boarding_passes type: sharded distributedby: ticket_no colocatewith: tickets partitions: 4 priority: 1
This file defines the data source, that is, the single-pg-instance
node, its connection port, user name and password, and data source DB name. Some parameters of the migration utility operation are also defined (there can be quite a few of them, as explained in the section called “Loading Data with a Schema from PostgreSQL”). The file also defines the number of threads — 8, batch size, that is, the number of rows organized into batches for processing during migration, as well as table processing priorities. The data for the global tables is migrated first, then the data for the sharded tables tickets
and ticket_flights
, and migration of the boarding_passes
table completes the migration. The value of priority
defines the priority of data loading, data for tables with higher value will be loaded earlier than with the lower value. The following command performs the migration:
shardmanctl load --schema migrate.yml
If the utility completes with the message “data loading completed successfully”, it means that the migration was a success.
3.2.2. Complex Approach #
With this approach, the launch and operation of the shardmanctl utility in the load
mode is the same as with the naive approach. However, the file that defines the order of loading tables will slightly differ as the sharding key has changed:
--- version: "1.0" migrate: connstr: "dbname=demo host=single-pg-instance port=5432 user=postgres password=postgres" jobs: 8 batch: 2000 options: schemas: - name: bookings all: false tables: - name: airports type: global priority: 5 - name: aircrafts type: global priority: 5 - name: seats type: global priority: 5 - name: flights type: global priority: 5 - name: bookings type: sharded priority: 4 partitions: 4 distributedby: book_ref - name: tickets type: sharded distributedby: book_ref colocatewith: bookings partitions: 4 priority: 3 - name: ticket_flights type: sharded distributedby: book_ref colocatewith: bookings partitions: 4 priority: 2 - name: boarding_passes type: sharded distributedby: book_ref colocatewith: bookings partitions: 4 priority: 1