Home > mailing lists

Re: When to use PARTITION BY HASH? - Mailing list pgsql-general

From	Michel Pelletier
Subject	Re: When to use PARTITION BY HASH?
Date	June 6, 2020 16:13:25
Msg-id	CACxu=v+-xS8bpo_4H38Ck=b3rjeyM20JzTWLRco1G_0XFDmDrQ@mail.gmail.com Whole thread Raw
In response to	Re: When to use PARTITION BY HASH? (Oleksandr Shulgin <oleksandr.shulgin@zalando.de>)
Responses	Re: When to use PARTITION BY HASH?
List	pgsql-general

Tree view

On Wed, Jun 3, 2020 at 4:55 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

Do you also assign the partitions to different tablespaces as you've hinted below or do you see performance improvement from partitioning alone? How does that work? Does it give better results than using a RAID to spread the disk IO, for example?

In general you could find write throughput improvements from all three, partitioning, tablespacing, and disk striping. It depends on your problem. Hash partitioning is a common feature in other databases as well. The hash strategy works for many distributed access patterns.

Yes its' true you end up with more smaller tables than one big large one, but remember the indexes are (often) tree data structures. Smaller trees are faster than bigger trees. By making the indexes smaller they are faster. Since the planner can knows to only examine the specific index it needs, this ends up being a lot faster.

That sounds logical, but can it be demonstrated? If the index(es) fit in memory fully, it doesn't make a measurable difference, I guess?

Well lets take a step back here and look at the question, hash partitioning exists in Postgres, is it useful? While I appreciate the need to see a fact demonstrated, and generally avoiding argument by authority, it is true that many of the very smartest database people in the world conceived of, discussed, implemented and documented this feature for us. It stands to reason that it is useful, or it wouldn't exist. So maybe this is more about finding or needing better partitioning documentation.

With hash partitioning you are not expected, in general, to end up with a small number of partitions being accessed more heavily than the rest. So your indexes will also not fit into memory.

Indexes are not (usually) constant time structures, they take more time the bigger they get. So partitioned indexes will be smaller, quicker to insert into, and quicker to vacuum, and also gain possible pruning advantages on query when you split them up. If the planner can, knowing the key, exclude all but one partition, it won't even look at the other tables, so if you hash partition by primary key, you reduce the search space to 1/N immediately.

Indexes with high update activity also suffer from a problem called "index bloat" where spares "holes" get punched in the buckets of btree indexes from updates and delete deletes. These holes are minimized by vacuuming but the bigger the index gets, the harder that process is to maintain. Smaller indexes suffer less from index bloat, and remedying the situation is easier because you can reindex partitions independently of each other. Your not just reducing the query load to an Nth, you're reducing the maintenance load.

I have the feeling that using a hash function to distribute rows simply contradicts the basic assumption of when you would think of partitioning your table at all: that is to make sure the most active part of the table and indexes is small enough to be cached in memory.

I think you might be framing this with a specific data pattern in mind, not all data distributions have a "most active" or power law distribution of data. For example i work with a lot of commercial airline position data that services both real-time queries and ad-hoc analytical queries over arbitrary airframe identifiers. There is no advantage trying to have a "most active" data strategy because all airframes in the air at any given time are by definition most active. A medium sized drone may send out as many pings as a jumbo jet in a given interval of time.

-Michel

Regards,
--
Alex

pgsql-general by date:

From: Laura Smith
Date: 06 June 2020, 08:58:42
Subject: Re: Postgres12 - Confusion with pg_restore

From: Michel Pelletier
Date: 06 June 2020, 17:21:08
Subject: Re: Multitenent architecture

Re: When to use PARTITION BY HASH? - Mailing list pgsql-general

Previous

Next