Re: Re: Clustered index to preserve data locality in a multitenant application? - Mailing list pgsql-general

From Mike Sofen
Subject Re: Re: Clustered index to preserve data locality in a multitenant application?
Date
Msg-id 029c01d202ba$5ba1aae0$12e500a0$@runbox.com
Whole thread Raw
In response to Re: Clustered index to preserve data locality in a multitenant application?  (Thomas Kellerer <spam_eater@gmx.net>)
List pgsql-general

From: Thomas Kellerer Sent: Tuesday, August 30, 2016 4:39 AM
Nicolas Grilly schrieb am 30.08.2016 um 13:12:

> We rely on clustered indexes to preserve data locality for each

> tenant. Primary keys start with the tenant ID. This way, rows

> belonging to the same tenant are stored next to each other. Because

> all requests hit only one tenant, this is a great performance

> improvement.

>

 

What about partitioning by tenant? With a local index on each partition.

Partitioning is currently a bit limited in Postgres (e.g. you can't have incoming foreign keys) but this would fit your requirements pretty much as I understand them.

 

For Nicolas’s situation, that would require 10,000 partitions – not very useful, and each partition would be very small.

 

The premise of clustered indexes is that the row data is “in the index”, so no row lookups are needed – that’s very effective/performant and clearly works well for many scenarios. 

 

In Postgres, as you mentioned, clustering is a “one time” operation but only in the sense that after you add more rows, you’ll need to re-cluster the table.  Depending on the activity model for that table, that may be feasible/ok.  For example, if you load it via regular batch scripts, then the clustering could be done after those loads.  If you add rows only rarely but then do lots of updates, then the clustering would work great.  If this is an active real time data table, then clustering would not be viable.

 

But you may be still be fine without them.  Here’s why:  Postgres is a very fast database engine.  I am constantly surprised at how performant it is.  I came from the SQL Server world where I always leveraged the automatic clustered indexes but in Postgres I am not using them and am running some big data with no table partitioning (yet) and performance is still very good.  I carefully optimize my data models, so that is part of it.  I also carefully optimize my stored functions/indexes.  I am migrating data from 500+ mysql databases/~8,000 tables/~20 billion rows into a single Postgres db.  As my data grows, I may shard it.  The new parallel query support in PG v9.6 may also support your table model very nicely, depending on how you query into it.

 

So I’d say, just build a prototype PG db, build that one table, load your existing data into it (very easy via the mysqly_fdw data wrapper – it’s exactly what I am doing…ping me off list if you need some ideas), put a regular index on it and run some queries.  If you have a decent size dev server to work on, you should see adequate performance.  And, what you’ll be incredibly pleased with is the remarkably rich and flexible plpgsql coding environment.  It’s heaven for sql devs.

 

BTW, I am putting together a rather detailed examination of Postgres ETL/bulk loading performance, on 2 different platforms: 4 cores/16gb ram/3tb SSD on AWS (Amazon cloud), and 48 cores/256gb ram/ 10tb SSD on a monster loaner Cisco UCS server.  Should have that posted to the Perform list later this week.

 

Mike Sofen (USA)

 

pgsql-general by date:

Previous
From: Thomas Kellerer
Date:
Subject: Re: Clustered index to preserve data locality in a multitenant application?
Next
From: Eduardo Morras
Date:
Subject: Re: Clustered index to preserve data locality in a multitenant application?