Thread: [GENERAL] EAV Designs for Multi-Tenant Applications

[GENERAL] EAV Designs for Multi-Tenant Applications

From
Nic Pottier
Date:
Howdy all,

We maintain a hosted multi-tenant system for a large number of users. Each user has what we call an "organization" or "org" which is their sandbox. In that organization they can manage contacts and define custom fields on those contacts, which then have values per contact.

We have thousands of organizations, some organizations have hundreds of custom fields. Some organizations have millions of contacts. We have tens of millions of values across organizations, so relatively sparse with a large number of tiny / unused organizations. All these numbers will continue to grow, though single organizations greater than 10 million contacts is unlikely.

What is the recommended way of modeling this in order to allow performant queries on the custom fields?

Right now we basically use an EAV model (a single contact_fields table) with compound indexes that join the field key and field value so as to force locality in the index. That has worked ok, but occasionally falls on its face when the query planner makes a wrong guess due to the (obviously skewed) statistics it has to work with. Multi-field queries can also get painful, especially on the largest organizations with millions of contacts.

What other approaches should we be looking at? We've brainstormed different approaches but would love some wisdom to help us narrow down what are reasonable things to try testing out. Most of our designs hone in on creating a table per organizations to hold field values and dynamically creating indexes on that. The idea being that we won't kill our insert performance as much by only having one index per field to check on inserts and table statistics should still be ok. (how are statistics managed for JSONB fields?) The main question we have is what is going to happen if we have thousands (or tens of thousands) of tables on a single database? The good news is the vast majority of our organizations are idle at any point in time.

Approaches we've thought about and questions / pros / cons:

1) Add a JSONB field on our (shared across organizations) contact table, store field values there. Create JSONB indexes per unique field. 
   pros: nice having the locality of data on contact, multi field queries are likely way better, query performance should be good
   cons: we have to create thousands of indexes? we have to use uuids as keys to keep our indexes org-specific? insert performance suffers from having thousands of partial indexes (how badly)?

2) Create a table per organization `contact_fields_[org_id]` containing a column per field. Create columns and indexes per unique field.
   pros: locality is nice again, multi field queries are better, query and insert performance should be good.
   cons: thousands of tables with up to 100 indexes per, is that going to blow up?

3) Create a table per organization `contacts_fields_[org_id]` which contains a `fields` JSONB column, Create JSONB indexes per unique field.
  pros: locality is good, multi field queries good, query performance should be good. Adding and removing fields is a bit simpler than 2) case above and naming can be a bit clearer than 1) as we don't have to worry about multi-org key name collisions
  cons: same as 2) but with a JSONB flavor, hundreds of JSONB indexes on thousands of tables, thousands of tables

4) Create a database per organization? Use 1) above
  pros: all data is localized, might see performance improvements elsewhere, query and insert performance should be good
  cons: mother of a refactor :)  what happens with thousands of databases on a single box? is this actually better from the perspective of getting to offload currently inactive orgs?

What other approaches should we be considering? I know EAV is a pain, especially in multi-tenant situations, but we'd love to hear success (and failure) stories from the community on how they've dealt with these.

Cheers,

-Nic





Re: [GENERAL] EAV Designs for Multi-Tenant Applications

From
Alban Hertroys
Date:
> On 12 Oct 2017, at 16:48, Nic Pottier <nicpottier@gmail.com> wrote:

…

> We have thousands of organizations, some organizations have hundreds of custom fields. Some organizations have
millionsof contacts. We have tens of millions of values across organizations, so relatively sparse with a large number
oftiny / unused organizations. All these numbers will continue to grow, though single organizations greater than 10
millioncontacts is unlikely. 
>
> What is the recommended way of modeling this in order to allow performant queries on the custom fields?
>
> Right now we basically use an EAV model (a single contact_fields table) with compound indexes that join the field key
andfield value so as to force locality in the index. That has worked ok, but occasionally falls on its face when the
queryplanner makes a wrong guess due to the (obviously skewed) statistics it has to work with. Multi-field queries can
alsoget painful, especially on the largest organizations with millions of contacts. 
>
> What other approaches should we be looking at? We've brainstormed different approaches but would love some wisdom to
helpus narrow down what are reasonable things to try testing out. Most of our designs hone in on creating a table per
organizationsto hold field values and dynamically creating indexes on that. The idea being that we won't kill our
insertperformance as much by only having one index per field to check on inserts and table statistics should still be
ok.(how are statistics managed for JSONB fields?) The main question we have is what is going to happen if we have
thousands(or tens of thousands) of tables on a single database? The good news is the vast majority of our organizations
areidle at any point in time. 

The table per customer approach is probably on the right track. It means you don't have to bother the query planner
withfields that are specific to only a few organisations when they don't apply to the currently queried organisation. 

I would go one step further and put all the fields common across all organisations into one master table and inherit
thatin each organisations specific table. If you add a check constraint on the inheriting table's organisation id, I
expectthat you can even make use of constraint exclusion. That basically turns the master table into a partitioned
table,where each partition has it's own fields. I'm not 100% certain that constraint exclusion can work when partitions
havedifferent layouts, but I don't see why not - as long as the partitioning (check) constraint is on a field that is
alsoin the master table. 

> Approaches we've thought about and questions / pros / cons:
>
> 1) Add a JSONB field on our (shared across organizations) contact table, store field values there. Create JSONB
indexesper unique field.  
>    pros: nice having the locality of data on contact, multi field queries are likely way better, query performance
shouldbe good 
>    cons: we have to create thousands of indexes? we have to use uuids as keys to keep our indexes org-specific?
insertperformance suffers from having thousands of partial indexes (how badly)? 
>
> 2) Create a table per organization `contact_fields_[org_id]` containing a column per field. Create columns and
indexesper unique field. 
>    pros: locality is nice again, multi field queries are better, query and insert performance should be good.
>    cons: thousands of tables with up to 100 indexes per, is that going to blow up?

Perhaps not so much as you think. If an organisation has 100s of fields, they are probably not using a normalised view
oftheir data. Since you're using separate columns now, each column can have its own data type (another con), and that
addsthe possibility to add references to/from other tables. 

For example, if an organisation created fields address1 to address10, you can instead have an address table that
referencesthe organisation's specific contact partition, allowing them to add as many addresses as they need. If more
organisationsuse addresses (very likely), you can use the same partitioning approach for each organisation's address
tableand put a foreign key constraint on each to the organisation's contact partition. 
That puts multiple of your original fields under a single index.

I seem to recall that's pretty much how ToroDB goes about organising unstructured data on top of PG, although they
probablyuse a couple more tricks than just that. 

In fact, you could already apply this to your current design, although you probably wouldn't gain as much from it.

> What other approaches should we be considering? I know EAV is a pain, especially in multi-tenant situations, but we'd
loveto hear success (and failure) stories from the community on how they've dealt with these. 

I'm not that familiar with JSONB, but wouldn't it be possible to create an index over an organisation-specific JSON
'blob'such that all fields in it are part of the index? I expect that index types aimed at text searches (word
searches)would be useful in that respect. 

Regards,

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general