Thread: Design strategy for table with many attributes
Hello,
In one of the applications we are getting transactions in messages/events format and also in files and then they are getting parsed and stored into the relational database. The number of attributes/columns each transaction has is ~900+. Logically they are part of one single transaction and should be stored in one table as one single row. There will be ~500million such transactions each day coming into the system. And there will be approx ~10K peak write TPS and 5K read TPS in target state. This system has a postgres database as a "source of truth" or OLTP store. And then data moves to snowflakes for the olap store.
Initially when the system was designed the number of attributes per transaction was <100 but slowly the business wants to keep/persist other attributes too in the current system and the number of columns keep growing.
However, as worked with some database systems , we get few suggestions from DBA's to not have many columns in a single table. For example in oracle they say not to go beyond ~255 columns as then row chaining and row migration type of things are going to hunt us. Also we are afraid concurrent DMLS on the table may cause this as a contention point. So I wanted to understand , in such a situation what would be the best design approach we should use irrespective of databases? Or say, what is the maximum number of columns per table we should restrict? Should we break the single transaction into multiple tables like one main table and other addenda tables with the same primary key to join and fetch the results wherever necessary?
In one of the applications we are getting transactions in messages/events format and also in files and then they are getting parsed and stored into the relational database. The number of attributes/columns each transaction has is ~900+. Logically they are part of one single transaction and should be stored in one table as one single row. There will be ~500million such transactions each day coming into the system. And there will be approx ~10K peak write TPS and 5K read TPS in target state. This system has a postgres database as a "source of truth" or OLTP store. And then data moves to snowflakes for the olap store.
Initially when the system was designed the number of attributes per transaction was <100 but slowly the business wants to keep/persist other attributes too in the current system and the number of columns keep growing.
However, as worked with some database systems , we get few suggestions from DBA's to not have many columns in a single table. For example in oracle they say not to go beyond ~255 columns as then row chaining and row migration type of things are going to hunt us. Also we are afraid concurrent DMLS on the table may cause this as a contention point. So I wanted to understand , in such a situation what would be the best design approach we should use irrespective of databases? Or say, what is the maximum number of columns per table we should restrict? Should we break the single transaction into multiple tables like one main table and other addenda tables with the same primary key to join and fetch the results wherever necessary?
Regards
Lok
On Thu, Jul 4, 2024 at 12:38 PM Lok P <loknath.73@gmail.com> wrote:
Should we break the single transaction into multiple tables like one main table and other addenda tables with the same primary key to join and fetch the results wherever necessary?
I would say yes. Find a way to logically group sets of columns together and place those groups into separate tables. I'd also be looking for cases where multiple columns really should be multiple rows. This is not uncommon.
David J.
David J.
On 7/4/24 15:37, Lok P wrote:
Or say, what is the maximum number of columns per table we should restrict? Should we break the single transaction into multiple tables like one main table and other addenda tables with the same primary key to join and fetch the results wherever necessary?
900 columns makes my head hurt badly...
The neophyte will design a table with a separate field for each type of phone number that may be encountered. The experienced designer will move all phone numbers to its own table, where each entry/row contains a reference link, a "number_type" field, and a field with the actual number in it...Three fields in a table that is manageable and can be queried/modified without stressing the database server.
On Thu, Jul 4, 2024 at 3:38 PM Lok P <loknath.73@gmail.com> wrote:
Hello,
In one of the applications we are getting transactions in messages/events format and also in files and then they are getting parsed and stored into the relational database. The number of attributes/columns each transaction has is ~900+. Logically they are part of one single transaction
Nothing out of the ordinary.
and should be stored in one table as one single row.
Says who?
There will be ~500million such transactions each day coming into the system. And there will be approx ~10K peak write TPS and 5K read TPS in target state. This system has a postgres database as a "source of truth" or OLTP store. And then data moves to snowflakes for the olap store.
Initially when the system was designed the number of attributes per transaction was <100 but slowly the business wants to keep/persist other attributes too in the current system and the number of columns keep growing.
However, as worked with some database systems , we get few suggestions from DBA's to not have many columns in a single table. For example in oracle they say not to go beyond ~255 columns as then row chaining and row migration type of things are going to hunt us. Also we are afraid concurrent DMLS on the table may cause this as a contention point. So I wanted to understand , in such a situation what would be the best design approach we should use irrespective of databases? Or say, what is the maximum number of columns per table we should restrict? Should we break the single transaction into multiple tables like one main table and other addenda tables with the same primary key to join and fetch the results wherever necessary?
You need database normalization. It's a big topic. Here's a good simple explanation:
Ultimately, the way you should store the data depends on how you will use it. When you retrieve these values, what are thedifferent ways you’ll be using them? Normalised representations are more flexible, and the pragmatic, performance-based consideration is that all the values ina row are always effectively retrieved together. So if you reasonably often have a query that only accesses the creationdate and transaction id, then it will be pretty slow if you are also always retrieving 500 other columns you don’tneed. So: you might often pull all the values *other* than the attributes (creation date, creator, job information, whatever) together.This argues that those values should be in one table, and the attributes in another. Will you usually be using *all* of the attributes for a particular transaction together in the same operation? It might makesense to store them in eg an array in that case. But this significantly reduces the speed of accessing particular attributesseparately. It is likely that you will want to slice things by particular named attribute across many transactions. This argues for themore normalised form, as does the general principle of doing things in the way that is most general and flexible. When considering how the data will be used, please consider not only the developers of your current application(s), but alsoeg data analysts, managers, future applications etc. The less specific you can be about how you want to use the data, the more you should lean toward fully normalising. Note also that you can store your data in a normalised and maximally flexible form, but also use triggers or batch jobs togather various permutations of the data for specific purposes. If you really do, say, both have some major part of yourproject that uses all the attributes on a given transaction together, but you also have other uses, you may want to storeboth the normalised/attribute table and the “all the values together” version. Even if you want to store “all the values together”, it may well be better to use an array, JSON or HStore, rather than havinga super-wide table. JSON would eg let you enumerate all the column names (for example) and employ Postgres’s reallynice JSON query features. > On Jul 4, 2024, at 12:37, Lok P <loknath.73@gmail.com> wrote: > > Hello, > In one of the applications we are getting transactions in messages/events format and also in files and then they are gettingparsed and stored into the relational database. The number of attributes/columns each transaction has is ~900+. Logicallythey are part of one single transaction and should be stored in one table as one single row. There will be ~500millionsuch transactions each day coming into the system. And there will be approx ~10K peak write TPS and 5K read TPSin target state. This system has a postgres database as a "source of truth" or OLTP store. And then data moves to snowflakesfor the olap store. > > Initially when the system was designed the number of attributes per transaction was <100 but slowly the business wantsto keep/persist other attributes too in the current system and the number of columns keep growing. > > However, as worked with some database systems , we get few suggestions from DBA's to not have many columns in a singletable. For example in oracle they say not to go beyond ~255 columns as then row chaining and row migration type ofthings are going to hunt us. Also we are afraid concurrent DMLS on the table may cause this as a contention point. So Iwanted to understand , in such a situation what would be the best design approach we should use irrespective of databases?Or say, what is the maximum number of columns per table we should restrict? Should we break the single transactioninto multiple tables like one main table and other addenda tables with the same primary key to join and fetchthe results wherever necessary? > > Regards > Lok
On Fri, Jul 5, 2024 at 1:26 AM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Thu, Jul 4, 2024 at 12:38 PM Lok P <loknath.73@gmail.com> wrote:Should we break the single transaction into multiple tables like one main table and other addenda tables with the same primary key to join and fetch the results wherever necessary?I would say yes. Find a way to logically group sets of columns together and place those groups into separate tables. I'd also be looking for cases where multiple columns really should be multiple rows. This is not uncommon.David J.
Thank you David.
As you said, to logically break this into multiple tables so i believe it means it should be such that there will be no need to query multiple tables and join them most of the time for fetching the results. It should just fetch the results from one table at any point in time.
But do you also suggest keeping those table pieces related to each other through the same primary key ? Won't there be a problem when we load the data like say for example , in normal scenario the data load will be to one table but when we break it to multiple tables it will happen to all the individual pieces, won't that cause additional burden to the data load?
Also I understand the technical limitation of the max number of columns per table is ~1600. But should you advise to restrict/stop us to some low number long before reaching that limit , such that we will not face any anomalies when we grow in future. And if we should maintain any specific order in the columns from start to end column in the specific table?
On Fri, 5 Jul 2024 at 17:07, Lok P <loknath.73@gmail.com> wrote: > Also I understand the technical limitation of the max number of columns per table is ~1600. But should you advise to restrict/stopus to some low number long before reaching that limit , such that we will not face any anomalies when we growin future. And if we should maintain any specific order in the columns from start to end column in the specific table? Something else you may wish to consider, depending on the column types of your 900+ columns is the possibility that some INSERTs may fail due to row length while others with shorter variable length values may be ok. Here's a quick example with psql: select 'create table a (' || string_agg('a'||x||' text not null default $$$$',',') || ')' from generate_series(1,1000)x; \gexec insert into a default values; INSERT 0 1 again but with a larger DEFAULT to make the tuple larger. select 'create table b (' || string_agg('a'||x||' text not null default $$hello world$$',',') || ')' from generate_series(1,1000)x; \gexec insert into b default values; ERROR: row is too big: size 12024, maximum size 8160 There is a paragraph at the bottom of [1] with some warnings about things relating to this. The tuple length would be fixed for fixed-length types defined as NOT NULL. So, if you have that, there should be no such surprises. David [1] https://www.postgresql.org/docs/current/limits.html
On Thursday, July 4, 2024, Lok P <loknath.73@gmail.com> wrote:
But do you also suggest keeping those table pieces related to each other through the same primary key ?
Yes, everyone row produced from the input data “row” should get the same ID associated with it - either as an entire PK or a component of a multi-column PK/unique index.
Won't there be a problem when we load the data like say for example , in normal scenario the data load will be to one table but when we break it to multiple tables it will happen to all the individual pieces, won't that cause additional burden to the data load?
Yes, doing this requires additional CPU time to perform the work. I’d say IO is hopefully a wash.
Also I understand the technical limitation of the max number of columns per table is ~1600. But should you advise to restrict/stop us to some low number long before reaching that limit , such that we will not face any anomalies when we grow in future.
In a row-oriented system wider is worser. I start evaluation of table design with that in mind at the fourth column (including the surrogate key that is usually present, and the natural key - ignoring auditing columns.)
And if we should maintain any specific order in the columns from start to end column in the specific table?
There is material out there on micro-optimizing column ordering to match with alignment boundaries. I’d the benefit is meaningful but there is a non-trivial cost to actually setup the testing to verify that what you’ve figured out is working. Never actually done it myself. Though it actually seems like something someone could/may have written an algorithm for (though I do not recall ever seeing mention of one.)
David J.
On Fri, Jul 5, 2024 at 10:45 AM Guyren Howe <guyren@gmail.com> wrote:
On Jul 4, 2024, at 22:07, Lok P <loknath.73@gmail.com> wrote:If you stick to the principle of grouping columns in a table when you use those columns together, you should be good.Note that you might want to split up the “parent” table if that naturally groups its columns together for certain uses. In that case, you could have the same pk on all the 1:1 tables you then have. In that case, the pk for each of those tables is also the fk.
When you said below,
"Note that you might want to split up the “parent” table if that naturally groups its columns together for certain uses. In that case, you could have the same pk on all the 1:1 tables you then have. In that case, the pk for each of those tables is also the fk."
Do you mean having a real FK created through DDL and maintaining it or just assume it and no need to define it for all the pieces/tables. Only keep the same PK across all the pieces and as we know these are related to the same transaction and are logically related?
Some folks in the team suggested to have key business attributes or say frequently queried attributes in individual columns and others in a column in same table clubbed in JSON format. Is that advisable or any issues can occur with this approach? Also not sure how effectively postgres processes JSON (both read and write perspective) as compared to normal column in a oltp environment. Please advise.
As David suggested it breaks if a row exceeds the 8k limit I. E a single page size , will that still holds true if we have a column with JSON in it?
On Fri, 5 Jul, 2024, 12:04 pm Guyren Howe, <guyren@gmail.com> wrote:
On Jul 4, 2024, at 23:28, Lok P <loknath.73@gmail.com> wrote:"Note that you might want to split up the “parent” table if that naturally groups its columns together for certain uses. In that case, you could have the same pk on all the 1:1 tables you then have. In that case, the pk for each of those tables is also the fk."Do you mean having a real FK created through DDL and maintaining it or just assume it and no need to define it for all the pieces/tables. Only keep the same PK across all the pieces and as we know these are related to the same transaction and are logically related?A primary key *names something*. Often it’s a kind of platonic representation of a real thing — say, a person.I might use a person’s login columns in some functions, and the person’s name, birth date, etc in other functions.Rather than have one table, I should split this into two, but use the same primary key (I would either name both id or both, say, person_id, irrespective of the name of the table, so it’s clear you’re doing this).You can just do a join on the mutual primary keys as you’d expect. In fact, if you name them the same, you can just use NATURAL JOIN.So you’d have person_details and person_login tables, and have a person_id pk for both.
On Fri, 5 Jul 2024 at 19:53, Lok P <loknath.73@gmail.com> wrote: > As David suggested it breaks if a row exceeds the 8k limit I. E a single page size , will that still holds true if we havea column with JSON in it? You wouldn't be at risk of the same tuple length problem if you reduced the column count and stored the additional details in JSON. Each varlena column is either stored in the tuple inline, or toasted and stored out of line. Out of line values need an 18-byte pointer to the toasted data. That pointer contributes to the tuple length. This isn't me advocating for JSON. I'm just explaining the limitations. I think I'd only advocate for JSON if the properties you need to store vary wildly between each tuple. There's a large overhead to storing JSON labels, which you'd pay the price of for each tuple. That sounds like it would scale terribly with the data volumes you've suggested you'll be processing. David
On Fri, 5 Jul, 2024, 1:44 pm David Rowley, <dgrowleyml@gmail.com> wrote:
On Fri, 5 Jul 2024 at 19:53, Lok P <loknath.73@gmail.com> wrote:
> As David suggested it breaks if a row exceeds the 8k limit I. E a single page size , will that still holds true if we have a column with JSON in it?
You wouldn't be at risk of the same tuple length problem if you
reduced the column count and stored the additional details in JSON.
Each varlena column is either stored in the tuple inline, or toasted
and stored out of line. Out of line values need an 18-byte pointer to
the toasted data. That pointer contributes to the tuple length.
David
Got it. Thank you very much.
So there would be performance overhead with JSON and we need to validate that carefully, if at all going in that direction.
However out of curiosity, if the roasted/compressed component or column which is JSON itself goes beyond 8k post compression, will it break then?
On Fri, 5 Jul 2024 at 20:53, Lok P <loknath.73@gmail.com> wrote: > However out of curiosity, if the roasted/compressed component or column which is JSON itself goes beyond 8k post compression,will it break then? No. The size limit of a varlena field such as TEXT, JSON or JSONB is 1GB. See "field size" in [1]. Please also read [2]. David [1] https://www.postgresql.org/docs/current/limits.html [2] https://www.postgresql.org/docs/current/storage-toast.html