Thread: Storage for multiple variable-length attributes in a single row
Dear all
When ingesting mobility (IoT) data into MobilityDB
we transform very wide (2K attributes) car mobility data of high frequence (every tenth of a second) from flat format (e.g. CSV) into MobilityDB format in which there is a single record per trip and each of the signals is transformed into a temporal attribute (tbool, tint, tfloat, ttext, tgeompoint, tgeogpoint), which are temporal extensions of the corresponding PostgreSQL/PostGIS base types (bool, int, float, text, geometry, geography). All temporal types are stored using extended format, e.g.,
CREATE TYPE tfloat (
internallength = variable,
[...]
storage = extended,
alignment = double,
[...]
);
[...]
storage = extended,
alignment = double,
[...]
);
Given that each temporal value can be very wide (on average 30K timestamped points/floats/text/... per trip) our first question is
* Is extended the right storage for this ?
Our second question is how all the 2K temporal attributes are stored, which may be
* on a single table space
* in one table space per attribute
which in other words, relates to the question row vs column storage.
Many thanks for your insight
Esteban
Dear all
May I kindly ask your insight about a question I posted 1 month ago and for which I never received any answer ?
Many thanks
On Thu, Jan 6, 2022 at 4:05 PM Esteban Zimanyi <esteban.zimanyi@ulb.be> wrote:
Dear allWhen ingesting mobility (IoT) data into MobilityDBwe transform very wide (2K attributes) car mobility data of high frequence (every tenth of a second) from flat format (e.g. CSV) into MobilityDB format in which there is a single record per trip and each of the signals is transformed into a temporal attribute (tbool, tint, tfloat, ttext, tgeompoint, tgeogpoint), which are temporal extensions of the corresponding PostgreSQL/PostGIS base types (bool, int, float, text, geometry, geography). All temporal types are stored using extended format, e.g.,CREATE TYPE tfloat (internallength = variable,
[...]
storage = extended,
alignment = double,
[...]
);Given that each temporal value can be very wide (on average 30K timestamped points/floats/text/... per trip) our first question is* Is extended the right storage for this ?Our second question is how all the 2K temporal attributes are stored, which may be* on a single table space* in one table space per attributewhich in other words, relates to the question row vs column storage.Many thanks for your insightEsteban
On Mon, Feb 7, 2022 at 8:44 AM Esteban Zimanyi <esteban.zimanyi@ulb.be> wrote:
May I kindly ask your insight about a question I posted 1 month ago and for which I never received any answer ?
-hackers really isn't the correct place for usage questions like this - even if you are creating a custom type (why you are doing this is left unstated, I have no idea what it means to be a temporally extended version of boolean, etc...)
You should read up on how TOAST works in PostgreSQL since that is what handles large length data values.
IIUC your setup correctly, you are claiming to have 2k or more columns. This well exceeds the limit for PostgreSQL.
A "tablespace" is a particular functionality provided by the server. You are using "table space" in a different sense and I'm unsure exactly what you mean. I presume "cell". PostgreSQL has row-oriented storage (our internals documentation goes over this).
I think your mention of mobilitydb also complicates receiving a useful response as this list is for the core project. That you can exceed the column count limit suggests that your environment is enough different than core that you should be asking there.
You will need to await someone else to specifically answer the extended storage question though - but I suspect you've provided insufficient details in that regard.
David J.
Many thanks for your prompt reply David. Allow me then to restate the questions, hoping that this better fits this mailing list.
MobilityDB is a time-series extension to PostgreSQL/PostGIS in which time-varying attributes (e.g., gear, GPS location of a car) are semantically grouped into "units" (e.g., a trip of a car) and are stored as temporal functions, e.g., a set of couples (integer, timestamptz) for gear (a temporal integer) or a set of triples (lon, lat, timestamptz) for the GPS location (a temporal point). All temporal types are stored using extended format, e.g.,
CREATE TYPE tint (
internallength = variable,
[...]
storage = extended,
alignment = double,
[...]
);
When ingesting mobility (IoT) data into MobilityDB we receive very wide (2K attributes) of high frequency (every tenth of a second) from flat format (e.g. CSV) and we need to store it in PostgreSQL tables using MobilityDB temporal types. In the above scenario, the values of these temporal types can be very wide (on average 30K timestamped couples/triples per trip).
As suggested by David, this goes beyond the "traditional" usage of PostgreSQL. Therefore my questions are
* What is the suggested strategy to splitting these 2K attributes into vertically partitioned tables where the tables are linked by the primary key (e.g. trip number in the example above). Are there any limitations/best practices in the number/size of TOASTED attributes that a table should contain.
* In each partitioned table containing N TOASTED attributes, given the above requirements, are there any limitations/best practices in storing them using extended storage or an alternative one such as external.
Many thanks for your insight
Esteban
CREATE TYPE tint (
internallength = variable,
[...]
storage = extended,
alignment = double,
[...]
);
When ingesting mobility (IoT) data into MobilityDB we receive very wide (2K attributes) of high frequency (every tenth of a second) from flat format (e.g. CSV) and we need to store it in PostgreSQL tables using MobilityDB temporal types. In the above scenario, the values of these temporal types can be very wide (on average 30K timestamped couples/triples per trip).
As suggested by David, this goes beyond the "traditional" usage of PostgreSQL. Therefore my questions are
* What is the suggested strategy to splitting these 2K attributes into vertically partitioned tables where the tables are linked by the primary key (e.g. trip number in the example above). Are there any limitations/best practices in the number/size of TOASTED attributes that a table should contain.
* In each partitioned table containing N TOASTED attributes, given the above requirements, are there any limitations/best practices in storing them using extended storage or an alternative one such as external.
Many thanks for your insight
Esteban
On Mon, Feb 7, 2022 at 9:58 AM Esteban Zimanyi <estebanzimanyi@gmail.com> wrote:
As suggested by David, this goes beyond the "traditional" usage of PostgreSQL. Therefore my questions are
* What is the suggested strategy to splitting these 2K attributes into vertically partitioned tables where the tables are linked by the primary key (e.g. trip number in the example above). Are there any limitations/best practices in the number/size of TOASTED attributes that a table should contain.
* In each partitioned table containing N TOASTED attributes, given the above requirements, are there any limitations/best practices in storing them using extended storage or an alternative one such as external.
Frankly, the best practice is "don't have that many columns". Since you do, I posit that you are just going to have to make decisions (possibly with experimentation) on your own. Or maybe ask around on a MobilityDB forum what people using that tool and having these kinds of data structures do. From a core PostgreSQL perspective you've already deviated from the model structures that it was designed with in mind.
I'm really confused that you'd want the data value itself to contain a timestamp that, on a per-row basis, should be the same timestamp that every other value on the row has. Extracting the timestamp to it own column and using simpler and atomic data types is how core PostgreSQL and the relational model normalization recommend dealing with this situation. Then you just break up the attributes of a similar nature into their own tables based upon their shared nature. In almost all cases relying on "main" storage.
David J.
On Mon, Feb 07, 2022 at 10:10:53AM -0700, David G. Johnston wrote: > On Mon, Feb 7, 2022 at 9:58 AM Esteban Zimanyi <estebanzimanyi@gmail.com> > wrote: > > > > > As suggested by David, this goes beyond the "traditional" usage of > > PostgreSQL. Therefore my questions are > > * What is the suggested strategy to splitting these 2K attributes into > > vertically partitioned tables where the tables are linked by the primary > > key (e.g. trip number in the example above). Are there any limitations/best > > practices in the number/size of TOASTED attributes that a table should > > contain. > > * In each partitioned table containing N TOASTED attributes, given the > > above requirements, are there any limitations/best practices in storing > > them using extended storage or an alternative one such as external. > > > > > Frankly, the best practice is "don't have that many columns". Since you > do, I posit that you are just going to have to make decisions (possibly > with experimentation) on your own. Or maybe ask around on a MobilityDB > forum what people using that tool and having these kinds of data structures > do. From a core PostgreSQL perspective you've already deviated from the > model structures that it was designed with in mind. > I'm really confused that you'd want the data value itself to contain a > timestamp that, on a per-row basis, should be the same timestamp that every > other value on the row has. Extracting the timestamp to it own column and > using simpler and atomic data types is how core PostgreSQL and the > relational model normalization recommend dealing with this situation. Then > you just break up the attributes of a similar nature into their own tables > based upon their shared nature. In almost all cases relying on "main" > storage. Actually looking at the original example: > CREATE TYPE tint ( > internallength = variable, > [...] > storage = extended, > alignment = double, > [...] > ); I'm wondering if it's just some miscommunication here. If the tint data type only needs to hold a timestamp and an int, I don't see why it would be varlerna at all. So if a single tint can internally hold thousands of (int, timestamptz), a bit like pgpointcloud, then having it by default external (so both possibly out-of-line and compressed) seems like a good idea, as you can definitely hit the 8k boundary, it should compress nicely and you also avoid some quite high tuple header overhead.
Dear David
There are two approaches for storing temporal information in a relational database, explored since the 1980s following the work of Richard Snodgrass
tuple-timestamping vs attribute-timestamping. The SQL standard used the tuple-timestamping approach, but in MobilityDB we decided to use the attribute-timestamping approach. As you rightly pointed out, tuple-timestamping follows the traditional relational normalization theory.
The main advantage of the attribute timestamping for mobility data is that we need only to store the changes of values for a temporal attribute. In the example of gear for a car, even if we receive high-frequency observations, there will be very little gear changes for a trip, while there will be much more position changes. Therefore on MobilityDB we only store the change of values (e.g., no change of position will be stored during a red light or traffic jam), which constitutes a huge lossless compression with respect to the raw format storing every observation in a single row. We have experimented 450% lossless compression for real IoT data.
In addition, MobilityDB does all temporal operations and allows to determine the value of any temporal attribute at any timestamp (e.g., using linear interpolation between observations for speed or GPS position), independently of the actual stored observations.
I hope this clarifies things a little.