Home > mailing lists

Advice on best way to store a large amount of data in postgresql - Mailing list pgsql-performance

From	spiral
Subject	Advice on best way to store a large amount of data in postgresql
Date	January 8, 2023 12:02:01
Msg-id	20230108070201.39b19d30@xps Whole thread Raw
Responses	Re: Advice on best way to store a large amount of data in postgresql Re: Advice on best way to store a large amount of data in postgresql Re: Advice on best way to store a large amount of data in postgresql
List	pgsql-performance

Tree view

Hello,

We have a table containing ~1.75 billion rows, using 170GB storage.
The table schema is the following:

messages=# \d messages
                 Table "public.messages"
    Column    |  Type   | Collation | Nullable | Default 
--------------+---------+-----------+----------+---------
 mid          | bigint  |           | not null | 
 channel      | bigint  |           | not null | 
 member       | integer |           |          | 
 sender       | bigint  |           | not null | 
 original_mid | bigint  |           |          | 
 guild        | bigint  |           |          | 
Indexes:
    "messages_pkey" PRIMARY KEY, btree (mid)


This table is used essentially as a key-value store; rows are accessed
only with `mid` primary key. Additionally, inserted rows may only be
deleted, but never updated.

We only run the following queries:
- INSERT INTO messages VALUES (...data...);
- SELECT * FROM messages WHERE mid = $1;
- DELETE FROM messages WHERE mid = $1;
- DELETE FROM messages WHERE mid IN ($1...$n);
- SELECT count(*) FROM messages;

For the "IN" query, it is possible for there to be up to 100
parameters, and it is possible that none of them will match an existing
row.

So, the problem: I don't know how to best store this data in
postgres, or what system requirements would be needed.
Originally, this table did not contain a substantial amount of data,
and so I stored it in the same database as our CRUD user data. However,
as the table became larger, cache was being allocated to (mostly
unused) historical data from the `messages` table, and I decided to
move the large table to its own postgres instance.

At the same time, I partitioned the table, with TimescaleDB's automatic
time-series partitioning, because our data is essentially time-series
(`mid` values are Twitter-style snowflakes) and it was said that
partitioning would improve performance.
This ended up being a mistake... shared_buffers memory usage went way
up, from the 20GB of the previous combined database to 28GB for just
the messages database, and trying to lower shared_buffers at all made
the database start throwing "out of shared memory" errors when running
DELETE queries. A TimescaleDB update did improve this, but 28GB is way
more memory than I can afford to allocate to this database - instead of
"out of shared memory", it gets OOM killed by the system.

What is the best course of action here?
- Ideally, I would like to host this database on a machine with 4
  (Ryzen) cores, 8GB RAM, and tiered storage (our cloud provider doesn't
  support adding additional local storage to a VPS plan). Of course,
  this seems very unrealistic, so it's not a requirement, but the
  closer we can get to this, the better.
- Is it a good idea to use table partitioning? I heard advice that one
  should partition tables with above a couple million rows, but I don't
  know how true this is. We have a table with ~6mil rows in our main
  database that has somewhat slow lookups, but we also have a table
  with ~13mil rows that has fast lookups, so I'm not sure.

Thanks
spiral

pgsql-performance by date:

From: Joe Wildish
Date: 06 January 2023, 21:09:54
Subject: Max write throughput for single COPY

From: "Michaeldba@sqlexec.com"
Date: 09 January 2023, 10:39:45
Subject: Re: Advice on best way to store a large amount of data in postgresql

Advice on best way to store a large amount of data in postgresql - Mailing list pgsql-performance

Previous

Next