Re: [GENERAL] EAV Designs for Multi-Tenant Applications - Mailing list pgsql-general

From Alban Hertroys
Subject Re: [GENERAL] EAV Designs for Multi-Tenant Applications
Date
Msg-id 69816B23-8EF9-4C5E-B877-2BC6ADE092A6@gmail.com
Whole thread Raw
In response to [GENERAL] EAV Designs for Multi-Tenant Applications  (Nic Pottier <nicpottier@gmail.com>)
List pgsql-general
> On 12 Oct 2017, at 16:48, Nic Pottier <nicpottier@gmail.com> wrote:

…

> We have thousands of organizations, some organizations have hundreds of custom fields. Some organizations have
millionsof contacts. We have tens of millions of values across organizations, so relatively sparse with a large number
oftiny / unused organizations. All these numbers will continue to grow, though single organizations greater than 10
millioncontacts is unlikely. 
>
> What is the recommended way of modeling this in order to allow performant queries on the custom fields?
>
> Right now we basically use an EAV model (a single contact_fields table) with compound indexes that join the field key
andfield value so as to force locality in the index. That has worked ok, but occasionally falls on its face when the
queryplanner makes a wrong guess due to the (obviously skewed) statistics it has to work with. Multi-field queries can
alsoget painful, especially on the largest organizations with millions of contacts. 
>
> What other approaches should we be looking at? We've brainstormed different approaches but would love some wisdom to
helpus narrow down what are reasonable things to try testing out. Most of our designs hone in on creating a table per
organizationsto hold field values and dynamically creating indexes on that. The idea being that we won't kill our
insertperformance as much by only having one index per field to check on inserts and table statistics should still be
ok.(how are statistics managed for JSONB fields?) The main question we have is what is going to happen if we have
thousands(or tens of thousands) of tables on a single database? The good news is the vast majority of our organizations
areidle at any point in time. 

The table per customer approach is probably on the right track. It means you don't have to bother the query planner
withfields that are specific to only a few organisations when they don't apply to the currently queried organisation. 

I would go one step further and put all the fields common across all organisations into one master table and inherit
thatin each organisations specific table. If you add a check constraint on the inheriting table's organisation id, I
expectthat you can even make use of constraint exclusion. That basically turns the master table into a partitioned
table,where each partition has it's own fields. I'm not 100% certain that constraint exclusion can work when partitions
havedifferent layouts, but I don't see why not - as long as the partitioning (check) constraint is on a field that is
alsoin the master table. 

> Approaches we've thought about and questions / pros / cons:
>
> 1) Add a JSONB field on our (shared across organizations) contact table, store field values there. Create JSONB
indexesper unique field.  
>    pros: nice having the locality of data on contact, multi field queries are likely way better, query performance
shouldbe good 
>    cons: we have to create thousands of indexes? we have to use uuids as keys to keep our indexes org-specific?
insertperformance suffers from having thousands of partial indexes (how badly)? 
>
> 2) Create a table per organization `contact_fields_[org_id]` containing a column per field. Create columns and
indexesper unique field. 
>    pros: locality is nice again, multi field queries are better, query and insert performance should be good.
>    cons: thousands of tables with up to 100 indexes per, is that going to blow up?

Perhaps not so much as you think. If an organisation has 100s of fields, they are probably not using a normalised view
oftheir data. Since you're using separate columns now, each column can have its own data type (another con), and that
addsthe possibility to add references to/from other tables. 

For example, if an organisation created fields address1 to address10, you can instead have an address table that
referencesthe organisation's specific contact partition, allowing them to add as many addresses as they need. If more
organisationsuse addresses (very likely), you can use the same partitioning approach for each organisation's address
tableand put a foreign key constraint on each to the organisation's contact partition. 
That puts multiple of your original fields under a single index.

I seem to recall that's pretty much how ToroDB goes about organising unstructured data on top of PG, although they
probablyuse a couple more tricks than just that. 

In fact, you could already apply this to your current design, although you probably wouldn't gain as much from it.

> What other approaches should we be considering? I know EAV is a pain, especially in multi-tenant situations, but we'd
loveto hear success (and failure) stories from the community on how they've dealt with these. 

I'm not that familiar with JSONB, but wouldn't it be possible to create an index over an organisation-specific JSON
'blob'such that all fields in it are part of the index? I expect that index types aimed at text searches (word
searches)would be useful in that respect. 

Regards,

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pgsql-general by date:

Previous
From: Seamus Abshere
Date:
Subject: Re: [GENERAL] Non-overlapping updates blocking each other
Next
From: "Greg Sabino Mullane"
Date:
Subject: Re: [GENERAL] Is pgbouncer still maintained?