Home > mailing lists

Re: When to use PARTITION BY HASH? - Mailing list pgsql-general

From	Oleksandr Shulgin
Subject	Re: When to use PARTITION BY HASH?
Date	June 8, 2020 11:40:15
Msg-id	CACACo5SSz1YVDQSw1nVZ5AJmi23NXVaK4rammnzKMdN=j2rQZQ@mail.gmail.com Whole thread Raw
In response to	Re: When to use PARTITION BY HASH? (Michel Pelletier <pelletier.michel@gmail.com>)
Responses	Re: When to use PARTITION BY HASH? (Ron <ronljohnsonjr@gmail.com>)
List	pgsql-general

Tree view

On Sat, Jun 6, 2020 at 6:14 PM Michel Pelletier <pelletier.michel@gmail.com> wrote:

Well lets take a step back here and look at the question, hash partitioning exists in Postgres, is it useful? While I appreciate the need to see a fact demonstrated, and generally avoiding argument by authority, it is true that many of the very smartest database people in the world conceived of, discussed, implemented and documented this feature for us. It stands to reason that it is useful, or it wouldn't exist. So maybe this is more about finding or needing better partitioning documentation.

Fair point.

I've found the original commit adding this feature in version 11: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1aba8e651ac3e37e1d2d875842de1e0ed22a651e

It says:

"Hash partitioning is useful when you want to partition a growing data
set evenly. This can be useful to keep table sizes reasonable, which
makes maintenance operations such as VACUUM faster, or to enable
partition-wise join."

It also includes a link to discussion, though that starts in the middle of a long thread.

The original thread is here: https://www.postgresql.org/message-id/flat/20170228233313.fc14d8b6.nagata%40sraoss.co.jp

However, these threads only argue about implementation details and it's not easy to find a discussion of motivation for this particular partitioning scheme support.

I guess it was quite obvious to the participants at that point already.

With hash partitioning you are not expected, in general, to end up with a small number of partitions being accessed more heavily than the rest. So your indexes will also not fit into memory.

Indexes are not (usually) constant time structures, they take more time the bigger they get. So partitioned indexes will be smaller, quicker to insert into, and quicker to vacuum, and also gain possible pruning advantages on query when you split them up. If the planner can, knowing the key, exclude all but one partition, it won't even look at the other tables, so if you hash partition by primary key, you reduce the search space to 1/N immediately.

Indexes with high update activity also suffer from a problem called "index bloat" where spares "holes" get punched in the buckets of btree indexes from updates and delete deletes. These holes are minimized by vacuuming but the bigger the index gets, the harder that process is to maintain. Smaller indexes suffer less from index bloat, and remedying the situation is easier because you can reindex partitions independently of each other. Your not just reducing the query load to an Nth, you're reducing the maintenance load.

Thanks for taking your time to explain it in detail. Though I do not tend to believe the insert/scan performance benefit is measurable without trying it, I do see the benefits for maintenance.

I have the feeling that using a hash function to distribute rows simply contradicts the basic assumption of when you would think of partitioning your table at all: that is to make sure the most active part of the table and indexes is small enough to be cached in memory.

I think you might be framing this with a specific data pattern in mind, not all data distributions have a "most active" or power law distribution of data.

I'm just referring to the first bullet-point in the docs:

"Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory."

I think it does not apply to hash partitioning in the general case.

Alex

pgsql-general by date:

From: Thomas Kellerer
Date: 08 June 2020, 11:18:41
Subject: Re: checking existence of a table before updating its SERIAL

From: "Daniel Verite"
Date: 08 June 2020, 12:05:02
Subject: Re: Concurrenctly running CREATE TEMP TABLE IF NOT EXISTS [...] AS[...]

Re: When to use PARTITION BY HASH? - Mailing list pgsql-general

Previous

Next