procopy

procopy — a utility for fast data loading from a source database to a destination database

Synopsis

procopy config generate [option...]

procopy load [option...]

Description #

procopy is a utility for fast loading of data from a source database, such as Oracle, PostgreSQL, MySQL, etc., to a destination database, such as PostgreSQL, Postgres Pro Enterprise, etc. For the sources and destinations that are currently supported, see Table 4.1.

procopy is not responsible for:

  • Database schema transfer.

  • Replaying the changes that occurred during data loading.

  • Retaining the data consistency. Therefore, it is recommended to remove existing constraints where possible and bring them back when prosync finishes.

  • Data verification after loading.

procopy is developed in Go for efficient resource utilization when migrating databases of the 10 TB size or larger. At a high level, procopy consists of a pool of Readers, pool of Loaders, and a built-in bus used to exchange data between Readers and Loaders:

Recommended Drivers #

It is recommended to choose drivers that are written in Pure Go to minimize portability issues with the utility. Specifically, the following drivers are recommended:

Special Considerations #

  • Because of the binary load format for COPY, addition of new types requires the support of the corresponding encoders.

  • Loading of BLOB and BFILE objects is slower than loading of other objects because at least two SQL commands are needed for each row. This is much slower than just using COPY.

Limitations #

  • procopy does not support views and materialized views as sources.

Usage #

During ProGate operation, tasks of different types are performed. A task can be of the Query, Schema, or Table type. The tasks section in the configuration file enables you to configure procopy tasks.

Types of Tasks #

  • Table.

    Tasks of this type are used to load a specific table with a possibility to filter and map columns.

  • Schema.

    Tasks of this type are used to load the whole schemas with a possibility to filter and map tables.

  • Query.

    In tasks of this type, you can specify a statement that defines the selection from the source database.

    Important

    In procopy, tasks of the Query type have some limitations. They cannot be continued after a restart.

Generation of an Initial Configuration File #

To generate a configuration file, execute the config-generate command. You can update the configuration values there later. The configuration file is described in Section 5.1 and Section 5.2.

Launching the Data Load from the Source to the Destination DB #

To launch the data load from the source to the destination DB, execute the load command with the configuration file specified.

On launching the data load, the configuration file is validated as follows:

  • In case of incorrect names of configuration parameters, procopy completes with an error.

  • Values of configuration parameters are checked (for example, correctness of IP addresses and host names).

  • If not all the required configuration parameters are specified, procopy completes with an error.

  • If incompatible configuration parameters are specified, procopy completes with an error.

Stopping procopy #

procopy terminates automatically when all the tasks are completed. A task is considered complete if procopy reached the end of the table.

To stop procopy in operation, press Ctrl+C. Pressing Ctrl+C once launches the graceful shutdown. It takes several minutes for the application to shut down correctly and save all the information required for a restart. If Ctrl+C is pressed twice, the application terminates quickly. In this case, multiple unique constraint violation errors are likely to occur after a restart. Despite the errors, procopy must continue its operation as usual, but this can take much time.

Error Handling #

If for some reason procopy cannot apply a row, it saves the key and an error in a CSV file <taks_id>_index.csv in the problem_rows.local.save_dir/<timestamp> directory. If the problem_rows.local.save_dir configuration parameter is not specified, the /tmp/<timestamp>/*.csv directory is used.

The default behavior is to skip problematic rows, but in practice, this is not suitable in most cases. In such cases, it is posssible to set up procopy_options to store where to save batches of problematic rows using the following construct:

{
  "version": 1,
  "procopy_options": {
    "problem_rows": {
      "local": {
        "save_dir": "path_to_save_dir"
      }
    }
  }
}

If an error occurs that causes procopy to crash, the application saves the diagnostic information to the Procopy_panic_data_<current_time>.tar.gz file.

Repeated Application of Problematic Rows #

If issues arose during a previous load and you managed to resolve them, you can try to apply these rows without full data reload. To do this, execute the command:

procopy load --file filename --reload-from TIMESTAMP

Where TIMESTAMP is the name of the directory that stores indexes with problematic rows.

Note

Rows from *_index.csv files will be reloaded. New rows that occurred in tables will not be applied.

If issues occur again, new *_index.csv files will occur under the new TIMESTAMP.

Resuming the Load #

procopy can resume data loading from the point where it stopped last time for tasks of the Schema type and of the Table type for tables with keys.

If a task of the Query type was not completed, it cannot be resumed. Perform truncate before reloading.

If a task of the Table type was not completed for a table without keys (a heap), perform truncate before reloading, otherwise, rows will be duplicated.

If procopy_options.truncate is true or truncate is true for a task, the table will be cleared and data loading will start from scratch.

Configuring the Load Parallelism #

Parallelism of the load process is achieved by:

  • Setting up the read and write processes

  • Configuring splitting tasks into subtasks

  • Processing partitioned tables

Set the global limitations on the number of threads for reading data from the source and loading it to the destination in the procopy_options.readers and procopy_options.loaders configuration parameters, respectively. By default, both equal NumCPU/2.

You can split large tables into parts by the ranges of unique keys, so each part will be processed in an individual task. To do this, set the global procopy_options.sub_task_rows parameter or the task.sub_task_rows parameter at the task level.

For each partition of a partitioned table, an individual task is created.

Configuring Tasks of the Table Type #

Configuration settings related to tasks of this type are specified in the tasks.table section of the configuration file. For example:

tasks:
    - id: tab1
      table:
        source_table: source_table
        destination_table: destination_table
        exclude_columns:
            - field4
        include_columns:
            - field1
            - field2
            - field3
            - field4
        column_mapping:
            field1: field2
            field2: field1
        where: (COL1 IS NULL OR COL1 > 10) AND COL2 <> 'value'
      batch_bytes: 976.6KiB
      truncate: false
      transform:
        field1:
            null_to_value: ""
        field2:
            values_to_null:
                - "NULL"
                - NULL_VAL
                - INVALID
        field3:
            null_char_replace: "\n"
      snapshot_id: null

To configure tasks of the Table type, choose the columns that will be finally loaded:

  • Only choose specific columns to load using the include_columns setting. All the columns are assumed by default.

  • Exclude columns from loading using the exclude_columns setting.

  • Rename columns to be loaded using the column_mapping setting.

Use the transform setting to list rules to transform null values of the specified columns.

Use the where setting to limit the number of rows to load.

Configuring Tasks of the Schema Type #

Configuration settings related to tasks of this type are specified in the tasks.schema section of the configuration file. For example:

tasks:
    - id: schema_task
      schema:
        source_schema: schema1
        destination_schema: schema2
        exclude_tables:
            - TABLE1
        include_tables:
            - TABLE1
            - TABLE2
            - TABLE2
        table_mapping:
            TABLE2: main_table
      batch_bytes: 976.6KiB
      truncate: false
      snapshot_id: null

To configure tasks of the Schema type, choose tables that will be finally loaded:

  • Only choose specific tables to load using the include_tables setting. All the tables are assumed by default.

  • Exclude tables from loading using the exclude_tables setting.

  • Rename tables to be loaded using the table_mapping setting.

Configuring Tasks of the Query Type #

Tasks of the Query type load data to the destination DB from the source DB that is obtained by running a query. Configuration settings related to tasks of this type are specified in the tasks.query section of the configuration file. For example:

tasks:
    - id: tab2
      query:
        sql: SELECT id, name FROM users ORDER BY id
        destination_table: users_table
        destination_columns:
            - id
            - login
      batch_bytes: 976.6KiB
      truncate: false
      snapshot_id: null

When configuring tasks of the Query type, list columns in the tasks.query.destination_columns setting in the same order as in the query.

Configuring Tasks to Specify a Snapshot #

In tasks that load data from PostgreSQL/Postgres Pro, you can specify the snapshot to load data for. To do this, follow the steps below:

  1. In the source DB, start the transaction with the REPEATABLE READ isolation level.

  2. Get the snapshot identifier with the query

    SELECT pg_export_snapshot()
    

  3. In the section of the configuration file related to this task, specify the returned value for snapshot_id.

For example, the fragment of the configuration file for this task can look as follows:

tasks:
- id: query_task
  query:
    sql: SELECT id, name FROM users ORDER BY id
    destination_table: users_table
    destination_columns:
        - id
        - login
  batch_bytes: 1000000
  truncate: false
  snapshot_id: 00000006-0000000000002007-1

As a result of this task, the data rows will be loaded as of the start of the transaction even if concurrent transactions change these rows.

For automatic creation of a snapshot on launching the data load, set the procopy_options.enable_auto_snapshot configuration parameter to true.

Migrating BFILE Objects #

When migrating BFILE objects, only aliases are transferred. The pure data must be available from the source or transferred by users.

Command-Line Reference #

This section describes procopy commands. Optional parameters are enclosed in square brackets.

config generate #

procopy config generate [-f|--format json|yaml] [-o|--output filename]

Generates a procopy configuration file.

-f json|yaml
--format json|yaml

Specifies the format of the configuration file: JSON or YAML. The default is YAML.

-o filename
--output filename

Specifies the filename of the configuration file. By default, the configuration file will be output to stdout.

load #

load -f|--file filename [--log-level error|warn|info|debug]
[-s|--queue-size count_of_batches] [-q|--quiet]
[-r|--reload-from TIMESTAMP] [--dry-run|--read-only]

Launches data load between the databases.

-f filename
--file filename

Name of the configuration file. Required.

--log-level error|warn|info|debug

Logging level. The default is info.

Important

Detailed logging can affect the performance. So only choose debug when you are researching errors.

-s count_of_batches
--queue-size count_of_batches

Size of the queue between Readers and Loaders to be used by procopy. The default is 128.

Important

This value can excessively increase the memory size. It is usually enough to set it to NumLoaders * 2. If procopy starts to slow down when it loads data to the destination, the memory size can grow more than by count_of_batches * batch_bytes bytes.

-q
--quiet

Turns off output of the procopy progress.

-r TIMESTAMP
--reload-from TIMESTAMP

Launches a repeated load of problematic rows. TIMESTAMP is the timestamp of one of the previous launches of loading. If this option is specified, procopy tries to only load rows that had issues during loading with the TIMESTAMP. You can learn this timestamp from logs or find it in the directory specified in the problem_rows.local.save_dir configuration parameter.

--dry-run

Turns off writing to the destination. procopy will perform all the operations except writing. This flag is useful for checking the performance of writing.

--read-only

Turns off all the procopy logic except reading from the source. This flag is useful for checking the performance of reading.