PostgreSQL on S3-backed Block Storage with Near-Local Performance - Mailing list pgsql-general

From Pierre Barre
Subject PostgreSQL on S3-backed Block Storage with Near-Local Performance
Date
Msg-id a9fe5ddb-9685-4139-bc1f-88161a7a4da3@app.fastmail.com
Whole thread Raw
Responses Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
List pgsql-general
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining
performancecomparable to local NVMe. The approach uses block-level access rather than trying to map filesystem
operationsto S3 objects. 

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs
unmodifiedon ZFS pools built on these block devices: 

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies
despitethe underlying storage being in S3. 

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and
ZeroFS'smemory caches, while cold data comes from S3. 

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
   a. ZFS ARC/L2ARC for frequently accessed blocks
   b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can
uselike any other block device 
   c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1


                         PostgreSQL Client
                                   |
                                   | SQL queries
                                   |
                            +--------------+
                            |  PG Proxy    |
                            | (HAProxy/    |
                            |  PgBouncer)  |
                            +--------------+
                               /        \
                              /          \
                   Synchronous            Synchronous
                   Replication            Replication
                            /              \
                           /                \
              +---------------+        +---------------+
              | PostgreSQL 1  |        | PostgreSQL 2  |
              | (Primary)     |◄------►| (Standby)     |
              +---------------+        +---------------+
                      |                        |
                      |  POSIX filesystem ops  |
                      |                        |
              +---------------+        +---------------+
              |   ZFS Pool 1  |        |   ZFS Pool 2  |
              | (3-way mirror)|        | (3-way mirror)|
              +---------------+        +---------------+
               /      |      \          /      |      \
              /       |       \        /       |       \
        NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
             |        |        |           |        |        |
        +--------++--------++--------++--------++--------++--------+
        |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
        +--------++--------++--------++--------++--------++--------+
             |         |         |         |         |         |
             |         |         |         |         |         |
        S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
        (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
                \                    /
                 \                  /
                  Same ZFS Pool (NBD)
                         |
                  6 Global ZeroFS
                         |
                      S3 Regions


The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.



pgsql-general by date:

Previous
From: Ron Johnson
Date:
Subject: Re: Bypassing Directory Ownership Check in PostgreSQL 16.6 with Secure z/OS NFS (AT-TLS)
Next
From: Craig Ringer
Date:
Subject: Should we document the cost of pg_database_size()? Alternatives?