Home > mailing lists

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance - Mailing list pgsql-general

From	Pierre Barre
Subject	Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Date	July 25 01:31:58
Msg-id	55a8a36a-eb10-44e7-adca-12c30ae1254d@app.fastmail.com Whole thread Raw
In response to	Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance (Marco Torres <mtors25@gmail.com>)
List	pgsql-general

Tree view

Hi Marco,

Thanks for the kind words!

> and potentially elaborate on other projects for an active/active cluster! I applaud you.

I wrote an argument there: https://github.com/Barre/ZeroFS?tab=readme-ov-file#cap-theorem

I definitely want to write a proof of concept when I get some time.

Best,

Pierre

On Fri, Jul 25, 2025, at 00:21, Marco Torres wrote:

My humble take on this project: well done! You are opening the doors to work on a much-needed endeavor, decouple compute from storage, and potentially elaborate on other projects for an active/active cluster! I applaud you.

On Thu, Jul 17, 2025, 4:59 PM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1

PostgreSQL Client
|
| SQL queries
|
+--------------+
| PG Proxy |
| (HAProxy/ |
| PgBouncer) |
+--------------+
/ \
/ \
Synchronous Synchronous
Replication Replication
/ \
/ \
+---------------+ +---------------+
| PostgreSQL 1 | | PostgreSQL 2 |
| (Primary) |◄------►| (Standby) |
+---------------+ +---------------+
| |
| POSIX filesystem ops |
| |
+---------------+ +---------------+
| ZFS Pool 1 | | ZFS Pool 2 |
| (3-way mirror)| | (3-way mirror)|
+---------------+ +---------------+
/ | \ / | \
/ | \ / | \
NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
| | | | | |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
| | | | | |
| | | | | |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

pgsql-general by date:

From: Marco Torres
Date: 25 July, 01:21:39
Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

From: "Pierre Barre"
Date: 25 July, 01:44:01
Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance - Mailing list pgsql-general

Previous

Next