Re: [GENERAL] EAV Designs for Multi-Tenant Applications - Mailing list pgsql-general
From | Alban Hertroys |
---|---|
Subject | Re: [GENERAL] EAV Designs for Multi-Tenant Applications |
Date | |
Msg-id | 69816B23-8EF9-4C5E-B877-2BC6ADE092A6@gmail.com Whole thread Raw |
In response to | [GENERAL] EAV Designs for Multi-Tenant Applications (Nic Pottier <nicpottier@gmail.com>) |
List | pgsql-general |
> On 12 Oct 2017, at 16:48, Nic Pottier <nicpottier@gmail.com> wrote: … > We have thousands of organizations, some organizations have hundreds of custom fields. Some organizations have millionsof contacts. We have tens of millions of values across organizations, so relatively sparse with a large number oftiny / unused organizations. All these numbers will continue to grow, though single organizations greater than 10 millioncontacts is unlikely. > > What is the recommended way of modeling this in order to allow performant queries on the custom fields? > > Right now we basically use an EAV model (a single contact_fields table) with compound indexes that join the field key andfield value so as to force locality in the index. That has worked ok, but occasionally falls on its face when the queryplanner makes a wrong guess due to the (obviously skewed) statistics it has to work with. Multi-field queries can alsoget painful, especially on the largest organizations with millions of contacts. > > What other approaches should we be looking at? We've brainstormed different approaches but would love some wisdom to helpus narrow down what are reasonable things to try testing out. Most of our designs hone in on creating a table per organizationsto hold field values and dynamically creating indexes on that. The idea being that we won't kill our insertperformance as much by only having one index per field to check on inserts and table statistics should still be ok.(how are statistics managed for JSONB fields?) The main question we have is what is going to happen if we have thousands(or tens of thousands) of tables on a single database? The good news is the vast majority of our organizations areidle at any point in time. The table per customer approach is probably on the right track. It means you don't have to bother the query planner withfields that are specific to only a few organisations when they don't apply to the currently queried organisation. I would go one step further and put all the fields common across all organisations into one master table and inherit thatin each organisations specific table. If you add a check constraint on the inheriting table's organisation id, I expectthat you can even make use of constraint exclusion. That basically turns the master table into a partitioned table,where each partition has it's own fields. I'm not 100% certain that constraint exclusion can work when partitions havedifferent layouts, but I don't see why not - as long as the partitioning (check) constraint is on a field that is alsoin the master table. > Approaches we've thought about and questions / pros / cons: > > 1) Add a JSONB field on our (shared across organizations) contact table, store field values there. Create JSONB indexesper unique field. > pros: nice having the locality of data on contact, multi field queries are likely way better, query performance shouldbe good > cons: we have to create thousands of indexes? we have to use uuids as keys to keep our indexes org-specific? insertperformance suffers from having thousands of partial indexes (how badly)? > > 2) Create a table per organization `contact_fields_[org_id]` containing a column per field. Create columns and indexesper unique field. > pros: locality is nice again, multi field queries are better, query and insert performance should be good. > cons: thousands of tables with up to 100 indexes per, is that going to blow up? Perhaps not so much as you think. If an organisation has 100s of fields, they are probably not using a normalised view oftheir data. Since you're using separate columns now, each column can have its own data type (another con), and that addsthe possibility to add references to/from other tables. For example, if an organisation created fields address1 to address10, you can instead have an address table that referencesthe organisation's specific contact partition, allowing them to add as many addresses as they need. If more organisationsuse addresses (very likely), you can use the same partitioning approach for each organisation's address tableand put a foreign key constraint on each to the organisation's contact partition. That puts multiple of your original fields under a single index. I seem to recall that's pretty much how ToroDB goes about organising unstructured data on top of PG, although they probablyuse a couple more tricks than just that. In fact, you could already apply this to your current design, although you probably wouldn't gain as much from it. > What other approaches should we be considering? I know EAV is a pain, especially in multi-tenant situations, but we'd loveto hear success (and failure) stories from the community on how they've dealt with these. I'm not that familiar with JSONB, but wouldn't it be possible to create an index over an organisation-specific JSON 'blob'such that all fields in it are part of the index? I expect that index types aimed at text searches (word searches)would be useful in that respect. Regards, Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
pgsql-general by date: