Chapter 3. Develop

A Shardman cluster uses two main ways to store data: sharded tables is the main way, designed for big data, and global tables, designed for small dictionaries. A sharded table contains different parts of the data in each shard, while a global table contains the same data in all shards. Efficient query execution on a Shardman cluster requires that the data is properly distributed across cluster shards and primarily, a sharding key is properly selected.

First of all, when transitioning from a regular database schema to the distributed one, it makes sense to start the design with deciding how the data will be distributed in the Shardman cluster. Shardman distributes table rows across shards according to the hash value of the column to use for the table partitioning. In other words, the desired distribution must be even, and it aims to distribute equal parts of the data across cluster nodes and evenly distribute the workload.

When a database architect chooses the column to use for the table partitioning, the majority of typical queries executed must be taken into account to ensure the maximum performance.

In general, for most queries, especially, for those that use joins, the sharding key must be included in the query text. Otherwise, Shardman will not push down queries to cluster nodes for execution, which will cause essential performance degradation as compared to usage of a single instance.

Secondly, when choosing a sharding key, it is important that it does not change. A resharding operation, that is, a change of the sharding key, is pretty time-consuming and resource-intensive. At present, Shardman lacks techniques that automate this procedure. In general, if resharding is required, the data in all the sharded tables should be either moved to local tables or to sharded tables with another sharding key. Then you will have to create new sharded tables with a new sharding key and move the data back. This operation is very expensive and resource-intensive. Such operations often cannot be performed without the system outage during the migration.

Another point is that distributed transactions, that is, those that update data on several cluster shards at the same point in time, cannot be performed for free. So the better data is located and computations are performed inside one shard, the faster queries are executed. In general, the proportion of distributed and non-distributed transactions must be shifted towards non-distributed ones. Only apply distributed transactions if you have a compelling need to do it.

And finally, Shardman is a distributed system, which has both advantages and disadvantages inherent to such systems. Besides, Shardman is primarily designed for OLTP load. OLAP queries to Shardman are also possible, but only pretty simple of them (for details, see limitations). If you want to load an OLTP system with OLAP functionality, bear in mind that the lists of analytic and aggregate SQL functions to be sent to other shards for execution are highly limited.

Also special attention should be paid to type casts in queries because inclusion of a type casting function in a query condition can make it impossible to be pushed down to a remote server.

Taking into account the above features and limitations of the RDBMS, we will provide two simple examples of the transition from a regular to a distributed database schema.

pdf