Thread: Clustered index to preserve data locality in a multitenant application?

Clustered index to preserve data locality in a multitenant application?

From
Nicolas Grilly
Date:
Hello,

We are developing a multitenant application which is currently based on MySQL, but we're thinking of migrating to PostgreSQL.

We rely on clustered indexes to preserve data locality for each tenant. Primary keys start with the tenant ID. This way, rows belonging to the same tenant are stored next to each other. Because all requests hit only one tenant, this is a great performance improvement.

PostgreSQL doesn't have clustered indexes — I'm aware of the CLUSTER command but it's a one-time operation — and I'm wondering if this can be a problem or not.

Let's say we have a table containing data for 10,000 tenants and 10,000 rows per tenant, for a total of 100,000,000 rows. Let's say each 8 KB block contains ~10 rows. Let's way we want to compute the sum of an integer column for all rows belonging to a given tenant ID.

In MySQL/InnoDB, rows are stored in the leaf nodes of a B-tree. To compute the sum, MySQL has to read at least 1,000 blocks (each block containing ~10 rows). I deliberately neglect the cost of walking the B-tree intermediate nodes.

By comparison, PostgreSQL has to read at least 10,000 blocks (each block containing ~10 rows, but most of the time, only one row will match the tenant ID, other rows belonging to other tenants).

A few questions:

- Am I missing something?
- Am I overestimating the benefit of a clustered index in our case, and the cost of not having one in PostgreSQL?
- Is there another technical solution to this problem?

Thanks,
Nicolas Grilly
Managing Partner
+33 6 03 00 25 34
www.vocationcity.com  •  Recruitment & application tracking software
www.gardentechno.com  •  Consulting & software engineering

Re: Clustered index to preserve data locality in a multitenant application?

From
Thomas Kellerer
Date:
Nicolas Grilly schrieb am 30.08.2016 um 13:12:
> We rely on clustered indexes to preserve data locality for each
> tenant. Primary keys start with the tenant ID. This way, rows
> belonging to the same tenant are stored next to each other. Because
> all requests hit only one tenant, this is a great performance
> improvement.
>

What about partitioning by tenant? With a local index on each partition.

Partitioning is currently a bit limited in Postgres (e.g. you can't have incoming foreign keys) but this would fit your
requirementspretty much as I understand them.  


From: Thomas Kellerer Sent: Tuesday, August 30, 2016 4:39 AM
Nicolas Grilly schrieb am 30.08.2016 um 13:12:

> We rely on clustered indexes to preserve data locality for each

> tenant. Primary keys start with the tenant ID. This way, rows

> belonging to the same tenant are stored next to each other. Because

> all requests hit only one tenant, this is a great performance

> improvement.

>

 

What about partitioning by tenant? With a local index on each partition.

Partitioning is currently a bit limited in Postgres (e.g. you can't have incoming foreign keys) but this would fit your requirements pretty much as I understand them.

 

For Nicolas’s situation, that would require 10,000 partitions – not very useful, and each partition would be very small.

 

The premise of clustered indexes is that the row data is “in the index”, so no row lookups are needed – that’s very effective/performant and clearly works well for many scenarios. 

 

In Postgres, as you mentioned, clustering is a “one time” operation but only in the sense that after you add more rows, you’ll need to re-cluster the table.  Depending on the activity model for that table, that may be feasible/ok.  For example, if you load it via regular batch scripts, then the clustering could be done after those loads.  If you add rows only rarely but then do lots of updates, then the clustering would work great.  If this is an active real time data table, then clustering would not be viable.

 

But you may be still be fine without them.  Here’s why:  Postgres is a very fast database engine.  I am constantly surprised at how performant it is.  I came from the SQL Server world where I always leveraged the automatic clustered indexes but in Postgres I am not using them and am running some big data with no table partitioning (yet) and performance is still very good.  I carefully optimize my data models, so that is part of it.  I also carefully optimize my stored functions/indexes.  I am migrating data from 500+ mysql databases/~8,000 tables/~20 billion rows into a single Postgres db.  As my data grows, I may shard it.  The new parallel query support in PG v9.6 may also support your table model very nicely, depending on how you query into it.

 

So I’d say, just build a prototype PG db, build that one table, load your existing data into it (very easy via the mysqly_fdw data wrapper – it’s exactly what I am doing…ping me off list if you need some ideas), put a regular index on it and run some queries.  If you have a decent size dev server to work on, you should see adequate performance.  And, what you’ll be incredibly pleased with is the remarkably rich and flexible plpgsql coding environment.  It’s heaven for sql devs.

 

BTW, I am putting together a rather detailed examination of Postgres ETL/bulk loading performance, on 2 different platforms: 4 cores/16gb ram/3tb SSD on AWS (Amazon cloud), and 48 cores/256gb ram/ 10tb SSD on a monster loaner Cisco UCS server.  Should have that posted to the Perform list later this week.

 

Mike Sofen (USA)

 

Re: Clustered index to preserve data locality in a multitenant application?

From
Eduardo Morras
Date:
On Tue, 30 Aug 2016 13:12:33 +0200
Nicolas Grilly <nicolas@gardentechno.com> wrote:

> Hello,
>
> We are developing a multitenant application which is currently based
> on MySQL, but we're thinking of migrating to PostgreSQL.
>
> We rely on clustered indexes to preserve data locality for each
> tenant. Primary keys start with the tenant ID. This way, rows
> belonging to the same tenant are stored next to each other. Because
> all requests hit only one tenant, this is a great performance
> improvement.
>
> PostgreSQL doesn't have clustered indexes ? I'm aware of the CLUSTER
> command but it's a one-time operation ? and I'm wondering if this can
> be a problem or not.
>
> Let's say we have a table containing data for 10,000 tenants and
> 10,000 rows per tenant, for a total of 100,000,000 rows. Let's say
> each 8 KB block contains ~10 rows. Let's way we want to compute the
> sum of an integer column for all rows belonging to a given tenant ID.
>
> In MySQL/InnoDB, rows are stored in the leaf nodes of a B-tree. To
> compute the sum, MySQL has to read at least 1,000 blocks (each block
> containing ~10 rows). I deliberately neglect the cost of walking the
> B-tree intermediate nodes.
>
> By comparison, PostgreSQL has to read at least 10,000 blocks (each
> block containing ~10 rows, but most of the time, only one row will
> match the tenant ID, other rows belonging to other tenants).
>
> A few questions:
>
> - Am I missing something?
> - Am I overestimating the benefit of a clustered index in our case,
> and the cost of not having one in PostgreSQL?
> - Is there another technical solution to this problem?

Check BRIN indexs, they are "designed for handling very large tables in
which certain columns have some natural correlation with their physical
location within the table", I think they fit your needs.

(see https://www.postgresql.org/docs/current/static/brin-intro.html)


> Thanks,
> Nicolas Grilly
> Managing Partner
> +33 6 03 00 25 34
> www.vocationcity.com  ?  Recruitment & application tracking software
> www.gardentechno.com  ?  Consulting & software engineering


---   ---
Eduardo Morras <emorrasg@yahoo.es>