Thread: Storing large documents - one table or partition by doc?

Storing large documents - one table or partition by doc?

From
Dev Nop
Date:

I’m storing thousands of independent documents each containing around 20k rows. The larger the document, the more likely it is to be active with inserts and updates (1000s/day). The most common read query is to get all the rows for a single document (100s/day). It will be supporting real-time collaboration but with strong-consistency for a simple schema so not well-suited to dedicated "document databases" that assume schema-less & eventual consistency. I won’t have great hardware/budget so need to squeeze the most out of the least.

My question is whether to put all documents into a single huge table or partition by document?

The documents are independent so its purely a performance question. Its too many tables for postgresql partitioning support but I don’t get any benefit from a master table and constraints. Handling partitioning in application logic is effectively zero cost.

I know that 1000s of tables is regarded as an anti-pattern but I can only see the performance and maintenance benefits of one table per independent document e.g. fast per-table vacuum, incremental schema updates, easy future sharding. A monster table will require additional key columns and indexes that don’t have any value beyond allowing the documents to sit in the same table.

The only downsides seem to be the system level per-table overhead but I only see that as a problem if I have a very long tail of tiny documents. I'd rather solve that problem if it occurs than manage an all-eggs-in-one-basket monster table.
Is there anything significant I am missing in my reasoning? Is it mostly a “relational purist” perspective that argues against multiple tables? Should I be looking at alternative tech for this problem?


The one factor I haven't fully resolved is how much a caching layer in front of the database changes things.

Thanks for your help.

Re: Storing large documents - one table or partition by doc?

From
"Mike Sofen"
Date:

From: Dev Nop  Sent: Friday, September 23, 2016 3:12 AM
I’m storing thousands of independent documents each containing around 20k rows. The larger the document, the more likely it is to be active with inserts and updates (1000s/day). The most common read query is to get all the rows for a single document (100s/day). It will be supporting real-time collaboration but with strong-consistency for a simple schema so not well-suited to dedicated "document databases" that assume schema-less & eventual consistency. I won’t have great hardware/budget so need to squeeze the most out of the least.

 

My question is whether to put all documents into a single huge table or partition by document?

 

The documents are independent so its purely a performance question. Its too many tables for postgresql partitioning support but I don’t get any benefit from a master table and constraints. Handling partitioning in application logic is effectively zero cost.

 

I know that 1000s of tables is regarded as an anti-pattern but I can only see the performance and maintenance benefits of one table per independent document e.g. fast per-table vacuum, incremental schema updates, easy future sharding. A monster table will require additional key columns and indexes that don’t have any value beyond allowing the documents to sit in the same table.

 

The only downsides seem to be the system level per-table overhead but I only see that as a problem if I have a very long tail of tiny documents. I'd rather solve that problem if it occurs than manage an all-eggs-in-one-basket monster table.


Is there anything significant I am missing in my reasoning? Is it mostly a “relational purist” perspective that argues against multiple tables? Should I be looking at alternative tech for this problem?

 

The one factor I haven't fully resolved is how much a caching layer in front of the database changes things.

 

Thanks for your help.

---------------------------------

This is, to me, a very standard, almost classic, relational pattern, and one that a relational engine handles extremely well, especially the consistency and locking needed to support lots of updates.  Inserts are irrelevant unless the parent record must be locked to do so…that would be a bad design.

 

Imagine a normal parent-child table pair, 1:M, with the 20k rows per parent document in the child table.  Unless there’s something very bizarre about the access patterns against that child table, those 20k rows per document would not normally all be in play for every user on every access throughout that access (it’s too much data to show on a web page, for instance).  Even so, at “100s” of large queries per day, it’s a trivial load unless each child row contains a large json blob…which doesn’t jive with your table description.

 

So with proper indexing, I can’t see where there will be a performance issue.   Worst case, you create a few partitions based on some category, but the row counts you’re describing don’t yet warrant it.  I’m running a few hundred million rows in a new “child” table on a dev server (4 cores/16gb ram) with large json documents in each row and it’s still web page performant on normal queries, using a paging model (say 20 full rows per web page request).  The critical pieces, hardware-wise, are memory (buy as much as you can afford) and using SSDs (required, IMO).  It’s much harder to create measurable loads on the CPUs.  Amazon has memory optimized EC2 instances that support that pattern (with SSD storage).

 

Are there other issues/requirements that are creating other performance concerns that aren’t obvious in your initial post?

 

Mike Sofen     (Synthetic Genomics)

Re: Storing large documents - one table or partition by doc?

From
Jim Nasby
Date:
On 9/23/16 7:14 AM, Mike Sofen wrote:
> So with proper indexing, I can’t see where there will be a performance
> issue.

Table bloat could become problematic. If there is a pattern where you
can predict which documents are likely to be active (say, documents that
have been modified in the last 10 days), then you can keep all of those
in a set of tables that is fairly small, and keep the remaining
documents in a set of "archive" tables. That will help reduce bloat in
the large archive tables. Before putting in that extra work though, I'd
just try the simple solution and see how well it works.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


Re: Storing large documents - one table or partition by doc?

From
Dev Nop
Date:
Thank you Mike & Jim for your responses.

> Are there other issues/requirements that are creating other performance concerns
> that aren’t obvious in your initial post

Yes, there are a few things:

1. Full document queries really are necessary

> it’s too much data to show on a web page, for instance

The documents are like spreadsheets so the whole document must be in memory to view a part of it or make a change. Client apps don't have to render the whole document but the data must be there.

This means that the applications are sensitive to the size of ids. A previous incarnation used GUIDs which was a brutal overhead for large documents.

2. The documents are graph structures

The data structure is technically a property graph. The default text format is hierarchical with cross links.

There are currently no requirements to query the graph structure of the document so postgresql can be ignorant of the internal relationships. It is possible for the API to return an unordered flat list of rows and require the client to build the graph.

However, the most desirable data returned by the API would be an ordered hierarchical format. I'm not sure if postgresql has a role building the tree or whether to do it in code externally. I imagine using CTEs to process a column containing an array of child ids is brutal but maybe its more efficient than loading all rows into memory and processing it in code. I guess doing it in code means the cpu load is on the api-server which can be scaled horizontally more easily than a db server.

There is no interest in integrity constraints on element relationships. If there were an application bug creating an invalid document structure, it would be better to store it than reject it. The application can fix broken documents but not data loss.

3. Element ids are only unique within a document

The element ids the application uses are not globally unique across documents. Modelling what we currently have in a single table would mean the primary key is a composite of (document_id, element_id) which I don't believe is good practice?

Using current ids, a pseudo schema might look like:

CREATE TABLE document (
id serial primary key, 
        revision int,
name text,
        root_element_id int);

CREATE TABLE element (
document_id int REFERENCES document(id), 
element_id int, 
data text,
children int[],
        primary key (document_id, element_id));

The two main queries are intentionally trivial but any advice on indexes would be helpful e.g.

a) Fetch a document without creating the tree:

  select * from element where document_id = DOCID

b) Update an element:

  update element 
    set data = "new data" 
    where document_id=DOC_ID and element_id=EL_ID

  + update history & increment revision

4. Storing history of changes

The application is to view change history, wind back changes, restore old revisions and provide a change-stream to clients. Its not an initial performance concern because its not part of the main workflow but history tables will be much larger than the documents themselves but append only and rarely queried.

Updating them would be part of the update transaction so maybe they could become a bottleneck? A single edit from a client is actually a batch of smaller changes so a pseudo schema supporting change-sets might look something like:

CREATE TABLE changeset (
id bigserial primary key,
document_id int REFERENCES document(id),
user_id int REFERENCES users(id),
modified_at timestamp);

CREATE TABLE changes (
id bigserial primary key,
changeset_id int REFERENCES changeset(id),
change_type int,
element_id int, 
old_value text, 
new_value text);

CREATE TABLE history (
id bigserial primary key,
element_id int, 
data text,
children int[],
valid_from bigint REFERENCES changeset(id),
valid_to bigint REFERENCES changeset(id));

Where [history] is used for fetching complete revisions and [changes] is used to store the change stream to support winding recent changes back or enable offline clients with old revisions to catch up.

4. My Nightmares (fighting the last war)

In a previous life, I had bad experiences with un-partitioned gargantuan tables in Sql Server standard. Table-locking operations could take 24 hours to complete and DBAs had to spend weekends defragging, rebuilding indexes, performing schema migrations, data migrations or handling backups. It always felt on the edge of disaster e.g. a misbehaving query plan that one day decides to ignore indexes and do a full table scan... Every fix you wanted to do couldn't be done because it would take too long to process and cause too much downtime.

My nightmares are of a future filled with hours of down-time caused by struggling to restore a gargantuan table from a backup due to a problem with just one tiny document or schema changes that require disconnecting all clients for hours when instead I could ignore best practice, create 10k tables and process them iteratively and live in a utopia where I never have 100% downtime only per document unavailability.

thanks for you help

Re: Storing large documents - one table or partition by doc?

From
Jim Nasby
Date:
On 9/24/16 6:33 AM, Dev Nop wrote:
> This means that the applications are sensitive to the size of ids. A
> previous incarnation used GUIDs which was a brutal overhead for large
> documents.

If GUIDs *stored in a binary format* were too large, then you won't be
terribly happy with the 24 byte per-row overhead in Postgres.

What I would look into at this point is using int ranges and arrays to
greatly reduce your overhead:

CREATE TABLE ...(
   document_version_id int NOT NULL REFERENCES document_version
   , document_line_range int4range NOT NULL
   , document_lines text[] NOT NULL
   , EXCLUDE USING gist( document_version_id =, document_line_range && )
);

That allows you to store the lines of a document as an array of values, ie:

INSERT INTO ... VALUES(
   1
   , '[11-15]'
   , '[11:15]={line11,line12,line13,line14,line15}'
);

Note that I'm using explicit array bounds syntax to make the array
bounds match the line numbers. I'm not sure that's a great idea, but it
is possible.


> My nightmares are of a future filled with hours of down-time caused by
> struggling to restore a gargantuan table from a backup due to a problem
> with just one tiny document or schema changes that require disconnecting
> all clients for hours when instead I could ignore best practice, create
> 10k tables and process them iteratively and live in a utopia where I
> never have 100% downtime only per document unavailability.

At some size you'd certainly want partitioning. The good news is that
you can mostly hide partitioning from the application and other database
logic, so there's not a lot of incentive to set it up immediately. You
can always do that after the fact.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


Re: Storing large documents - one table or partition by doc?

From
Jeff Janes
Date:
On Fri, Sep 23, 2016 at 3:12 AM, Dev Nop <devnop0@gmail.com> wrote:

I’m storing thousands of independent documents each containing around 20k rows. The larger the document, the more likely it is to be active with inserts and updates (1000s/day). The most common read query is to get all the rows for a single document (100s/day).


How can the query be an order of magnitude less than the writes? Wouldn't anything doing an insert or update want to see the results of other people's inserts/updates about as frequently as they happen?


 

It will be supporting real-time collaboration but with strong-consistency for a simple schema so not well-suited to dedicated "document databases" that assume schema-less & eventual consistency. I won’t have great hardware/budget so need to squeeze the most out of the least.

My question is whether to put all documents into a single huge table or partition by document?

The documents are independent so its purely a performance question. Its too many tables for postgresql partitioning support but I don’t get any benefit from a master table and constraints. Handling partitioning in application logic is effectively zero cost.

I know that 1000s of tables is regarded as an anti-pattern but I can only see the performance and maintenance benefits of one table per independent document e.g. fast per-table vacuum, incremental schema updates, easy future sharding. A monster table will require additional key columns and indexes that don’t have any value beyond allowing the documents to sit in the same table.


If you go the partitioned route, I would add the extra column anyway (but not an index on it), so that it is there if/when you need it.
 

The only downsides seem to be the system level per-table overhead but I only see that as a problem if I have a very long tail of tiny documents. I'd rather solve that problem if it occurs than manage an all-eggs-in-one-basket monster table.
Is there anything significant I am missing in my reasoning?


If you use a reasonably modern version of PostgreSQL (say, >=9.4) , the overhead of having 1000s of tables should not be too large of a problem.  When get into the 100,000 range, that it is likely to start being a problem.  If you get to 1,000,000, you almost definitely have a problem.

Cheers,

Jeff

Re: Storing large documents - one table or partition by doc?

From
Dev Nop
Date:
If GUIDs *stored in a binary format* were too large, then you won't be terribly happy with the 24 byte per-row overhead in Postgres.

Heh. In this case the ids have a life outside the database in various text formats.
 
What I would look into at this point is using int ranges and arrays to greatly reduce your overhead:
CREATE TABLE ...(
  document_version_id int NOT NULL REFERENCES document_version
  , document_line_range int4range NOT NULL
  , document_lines text[] NOT NULL
  , EXCLUDE USING gist( document_version_id =, document_line_range && )
);

Thanks! Some new things for me to learn about there. Had to read "Range Types: Your Life Will Never Be The Same" - lol. https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf

To check I understand what you are proposing: the current version and history is stored in the same table. Each line is referred to by a sequential line number and then lines are stored in sequential chunks with range + array. The gist index is preventing any insert with the same version & line range. This sounds very compact for a static doc but doesn't it mean lines must be renumbered on inserts/moves?

On Mon, Sep 26, 2016 at 9:26 AM, Dev Nop <devnop0@gmail.com> wrote:
If GUIDs *stored in a binary format* were too large, then you won't be terribly happy with the 24 byte per-row overhead in Postgres.

Heh. In this case the ids have a life outside the database in various text formats.
 
What I would look into at this point is using int ranges and arrays to greatly reduce your overhead:
CREATE TABLE ...(
  document_version_id int NOT NULL REFERENCES document_version
  , document_line_range int4range NOT NULL
  , document_lines text[] NOT NULL
  , EXCLUDE USING gist( document_version_id =, document_line_range && )
);

Thanks! Some new things for me to learn about there. Had to read "Range Types: Your Life Will Never Be The Same" - lol. https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf

To check I understand what you are proposing: the current version and history is stored in the same table. Each line is referred to by a sequential line number and then lines are stored in sequential chunks with range + array. The gist index is preventing any insert with the same version & line range. This sounds very compact for a static doc but doesn't it mean lines must be renumbered on inserts/moves?


Re: Storing large documents - one table or partition by doc?

From
Jim Nasby
Date:
Please CC the mailing list so others can chime in or learn...

On 9/26/16 3:26 AM, Dev Nop wrote:
>     What I would look into at this point is using int ranges and arrays
>     to greatly reduce your overhead:
>     CREATE TABLE ...(
>       document_version_id int NOT NULL REFERENCES document_version
>       , document_line_range int4range NOT NULL
>       , document_lines text[] NOT NULL
>       , EXCLUDE USING gist( document_version_id =, document_line_range && )
>     );
>
>
> Thanks! Some new things for me to learn about there. Had to read "Range
> Types: Your Life Will Never Be The Same" - lol.
> https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf
>
> To check I understand what you are proposing: the current version and
> history is stored in the same table. Each line is referred to by a
> sequential line number and then lines are stored in sequential chunks
> with range + array. The gist index is preventing any insert with the
> same version & line range. This sounds very compact for a static doc but

You've got it correct.

> doesn't it mean lines must be renumbered on inserts/moves?

Yes, but based on your prior descriptions I was assuming that was what
you wanted... weren't you basically suggesting storing one line per row?

There's certainly other options if you want full tracking of every
change... for example, you could store every change as some form of a
diff, and only store the full document every X number of changes.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461