citus

citus — distributed database and columnar storage functionality

Description

citus is an extension that is made compatible with Postgres Pro and provides such major functionalities as columnar data storage and distributed OLAP database, which can be used either together or separately.

citus offers the following benefits:

  • Columnar storage with data compression.

  • The ability to scale your Postgres Pro installation to a distributed database cluster.

  • Row-based or schema-based sharding.

  • Parallelized DML operations across cluster nodes.

  • Reference tables, which can be accessed locally on each node.

  • The ability to execute DML queries on any node, which allows utilizing the full capacity of your cluster for distributed queries.

Key Concepts

The key concepts of citus are as follows:

  • Cluster is a group of database servers, which are also called cluster nodes, managed by Postgres Pro usually on different computers. It comprises the coordinator node and several worker nodes. As a standard scenario, queries are sent to the coordinator node that ensures their parallel execution by relaying each query to one or more worker nodes.

  • Columnar storage is a storage where data is stored as arrays each of which contains data of a column for many rows at once rather than as a sequence of rows each of which contains values of all its columns. This type of storage is append-only in citus and does not support the UPDATE/DELETE commands.

  • Reference table is a table, whose content is replicated to all nodes.

  • Row-based sharding is sharding of tables where different rows of the same table may be stored on different nodes. Rows are assigned to nodes based on the value of an additional column called distribution column or sharding key. The Tenant ID column is usually used as a sharding key.

  • Schema-based sharding is sharding of tables where a separate schema is created for each tenant. Tables of a schema are located on one node while different nodes may have different schemas.

  • Sharding is a division of a table into shards and storage of these shards on different nodes.

  • Tenant is an end user of a database. Tenants are not allowed to see the data of each other, but their data may be physically stored in a single database or even table.

For details about the citus concepts, see the official documentation of the extension.

Here are some of the most useful getting started resources:

Limitations

citus is incompatible with some Postgres Pro Enterprise features, take note of these limitations while arranging your work with the extension:

Installation

Installing citus on a Single Node

To enable citus on a single node, complete the following steps:

  1. Add citus to the shared_preload_libraries variable in the postgresql.conf file:

    shared_preload_libraries = 'citus'
    
  2. Reload the database server for the changes to take effect. To verify that the citus library was installed correctly, you can run the following command:

    SHOW shared_preload_libraries;
    
  3. Create the citus extension using the following query:

    CREATE EXTENSION citus;
    

The CREATE EXTENSION command in the procedure above also installs the citus_columnar extension. If you want to enable only citus_columnar, complete the same steps but specify citus_columnar instead.

Installing citus on Multiple Nodes

To enable citus on multiple nodes, complete the following steps on all nodes:

  1. Add citus to the shared_preload_libraries variable in the postgresql.conf file:

    shared_preload_libraries = 'citus'
    
  2. Set up access permissions to the database server. By default, the database server listens only to clients on localhost. Set the listen_addresses configuration parameter to * to specify all available IP interfaces.

  3. Configure client authentication by editing the pg_hba.conf file.

  4. Reload the database server for the changes to take effect. To verify that the citus library was installed correctly, you can run the following command:

    SHOW shared_preload_libraries;
    
  5. Create the citus extension using the following query:

    CREATE EXTENSION citus;
    

When the above steps have been taken on all nodes, perform the actions below on the coordinator node for worker nodes to be able to connect to it:

  1. Register the hostname that worker nodes use to connect to the coordinator node:

    SELECT citus_set_coordinator_host('coordinator_name', coordinator_port);
    

  2. Add each worker node:

    SELECT * from citus_add_node('worker_name', worker_port);
    

  3. Verify that worker nodes are set successfully:

    SELECT * FROM citus_get_active_worker_nodes();
    

When to Use citus

Multi-Tenant SaaS Database

Most B2B applications already have the notion of a tenant, customer, or account built into their data model. In this model, the database serves many tenants, each of whose data is separate from other tenants.

citus provides full SQL functionality for this workload and enables scaling out your relational database. citus also adds new features for multi-tenancy. For example, citus supports tenant isolation to provide performance guarantees for large tenants, and has the concept of reference tables to reduce data duplication across tenants.

These capabilities allow you to scale out data of your tenants across many machines and add more CPU, memory, and disk resources. Further, sharing the same database schema across multiple tenants makes efficient use of hardware resources and simplifies database management.

citus offers the following advantages for multi-tenant applications:

  • Fast queries for all tenants.

  • Sharding logic in the database rather than the application.

  • Hold more data in single-node Postgres Pro.

  • Scale out maintaining the SQL functionality.

  • Maintain performance under high concurrency.

  • Fast metrics analysis across customer base.

  • Scale to handle new customer sign-ups.

  • Isolate resource usage of large and small customers.

Real-Time Analytics

citus supports real-time queries over large datasets. Commonly these queries occur in rapidly growing event systems or systems with time series data. Example use cases include:

  • Analytic dashboards with sub-second response times.

  • Exploratory queries on unfolding events.

  • Large dataset archival and reporting.

  • Analyzing sessions with funnel, segmentation, and cohort queries.

citus parallelizes query execution and scales linearly with the number of worker databases in a cluster. Some advantages of citus for real-time applications are as follows:

  • Maintain sub-second responses as the dataset grows.

  • Analyze new events and new data in real time.

  • Parallelize SQL queries.

  • Scale out maintaining the SQL functionality.

  • Maintain performance under high concurrency.

  • Fast responses to dashboard queries.

  • Use one database rather than many on several nodes.

  • Rich Postgres Pro data types and extensions.

Microservices

citus supports schema-based sharding, which allows distributing regular database schemas across many machines. This sharding methodology aligns well with typical microservices architecture, where storage is fully owned by the service hence cannot share the same schema definition with other tenants.

Schema-based sharding is an easier model to adopt, create a new schema, and set the search_path in your service.

Advantages of using citus for microservices:

  • Allows distributing horizontally scalable state across services.

  • Transfer strategic business data from microservices into common distributed tables for analytics.

  • Efficiently use hardware by balancing services on multiple machines.

  • Isolate noisy services to their own nodes.

  • Easy to understand sharding model.

  • Quick adoption.

Considerations for Use

citus extends Postgres Pro with distributed functionality, but it is not a drop-in replacement that scales out all workloads. A performant citus cluster involves thinking about the data model, tooling, and choice of SQL features used.

A good way to think about tools and SQL features is the following: if your workload aligns with use cases described here and you happen to run into an unsupported tool or query, then there is usually a good workaround.

When citus is Inappropriate

Some workloads do not need a powerful distributed database, while others require a large flow of information between worker nodes. In the first case citus is unnecessary and in the second not generally performant. Below are a few examples when you do not need to use citus:

  • You do not expect your workload to ever grow beyond a single Postgres Pro Enterprise node.

  • Offline analytics, without the need for real-time data transfer nor real-time queries.

  • Analytics apps that do not need to support a large number of concurrent users.

  • Queries that return data-heavy ETL results rather than summaries.

To learn more about citus, see its official documentation.