Thread: DB design: How to store object properties?
Greetings everyone, I'm fairly new to PostgreSQL, but I'm currently doing some research on tools to be used for an upcoming project and could really use your help with a possible database design. So the scenario is this. We have two projects starting that will deal heavily with mapping spatial regions. One of the reasons I'm looking at PostgreSQL is the PostGIS extension that may help us in dealing with all the geometry. The first project will be mapping the outside environment. Think along the lines of Google Maps, but intended for use by pedestrians rather than cars. The other project, one that I'll be managing, will be doing the same thing but with interiors of buildings. For now, the two projects will be separate, but eventually the plan is to merge them and actually allow the system to plan paths that incorporate outside and interior maps. That's the project description in a nutshell. What I'm doing now is trying to figure out the best way to store all the spatial information. I want to have a foundational layer that is as simple as possible, and that can later be extended for any special needs. The current plan is to do all the mapping via a graph structure using nodes/vertices and edges. A node will be defined simply in terms of a GUID (UUID in PostgreSQL). Each node will have a unique id, but no other "mandatory" information. An edge is defined by the ids of its two end points. This setup will be common to both projects, and the use of UUIDs will ensure that any database merges in the future will not result in conflicts. Now here's the tricky bit. We have to be able to associate an arbitrary number of properties with each node and edge. So an example of a node property would be that node's position defined by a latitude/longitude pair. We will define all the possible properties between our two projects, but just because a property is valid, doesn't mean that it will be present in or be relevant to every node in the system. For instance, nodes located inside of a building will likely have their position defined by something other than latitude and longitude, since capturing that data with a GPS will not be possible. The simplest design would be to create two tables, one for nodes another for edges, and create a column for every possible property. This, however, is huge waste of space, since there will not be a single node or edge that will make use of all the defined properties. There may be hundreds of properties, but each node may use ten on average. That's the question - how do you represent this information in the database in a space-efficient manner. We still have to be able to search for specific nodes given their properties. For example – find all the nodes located within radius z of coordinate x,y. PostGIS will handle the special bit, but there has to be an efficient way of accessing this information/property for each node that has it. One other possibility I thought of was defining a master node table that would contain just the node ids. Then for each property I would create a separate table with the id column referencing the master. Here we're still wasting space for the extra storage of UUIDs, and if I want to run a query that selects nodes with several different properties, it would require a long list of JOINs. I looked at table inheritance that PostgreSQL offers, but it turned out to be something different from what I was hoping for. What I need is some sort of data inheritance based on a common unique key, but I don't think that's possible. Same thing with partitioning. If there was a way to partition the columns (with data) of a single large table, then that would be an ideal solution. But in the absence of that feature, what design would you recommend? Thanks, Max
On 2/17/08, Maxim Khitrov <mkhitrov@gmail.com> wrote: > The simplest design would be to create two tables, one for nodes > another for edges, and create a column for every possible property. > This, however, is huge waste of space, since there will not be a > single node or edge that will make use of all the defined properties. > There may be hundreds of properties, but each node may use ten on > average. That's the question - how do you represent this information > in the database in a space-efficient manner. We still have to be able > to search for specific nodes given their properties. For example – > find all the nodes located within radius z of coordinate x,y. PostGIS > will handle the special bit, but there has to be an efficient way of > accessing this information/property for each node that has it. If you have a lot of columns but most of them are NULL for a given record, the tuple size won't actually be that big. PG stores a bitmap in each tuple telling which columns are NULL, and then only stores the non-NULL column vales. So this approach may end up being reasonably efficient for you. -Doug
On 2/17/08, Maxim Khitrov <mkhitrov@gmail.com> wrote: > So the scenario is this. We have two projects starting that will deal > heavily with mapping spatial regions. One of the reasons I'm looking > at PostgreSQL is the PostGIS extension that may help us in dealing > with all the geometry. From your description, PostGIS sounds like it would work great. > Now here's the tricky bit. We have to be able to associate an > arbitrary number of properties with each node and edge. [snip] This is actually a fairly common scenario. As you point out, the problem case is impractical to implement in PostgreSQL (though the relational model is not to blame), and you have enumerated some of the possible solutions. There are a few other ways to tackle the problem, though. While dividing the properties into subtables either by property or by type (eg., string_properties, int_properties -- which may result in fever tables) results in many outer joins, you could alleviate the need to express these joins so often by creating a view that selects from all the tables. You can then create rules which turn insert statements on this view into insert statements on the various subtables. There's the question of how efficient this will be, but it simplifies the data model conceptually in a way that does not go against PostgreSQL's grain. I would also look at the different types of nodes required by your application. You say that defining a single table containing columns for all possible properties will result in too many columns. Could you instead isolate the types of nodes you need (outdoor nodes, indoor nodes, and so on) and create tables for those, each of which have only the properties that apply to those nodes? Again you would end up with a bunch of joins, but perhaps fewer than in the one-table-per-property scenario. Alexander.
Max, we use contrib/hstore specially designed for such kind of problem. It's a sort of perl's hash, where you can store all specific properties. In that way, your table will looks like create table objects ( id integer, x real, y real, ...., props hstore) Here '...' designates other mandatory fields you want to be searched. Oleg On Sun, 17 Feb 2008, Maxim Khitrov wrote: > Greetings everyone, > > I'm fairly new to PostgreSQL, but I'm currently doing some research on > tools to be used for an upcoming project and could really use your > help with a possible database design. > > So the scenario is this. We have two projects starting that will deal > heavily with mapping spatial regions. One of the reasons I'm looking > at PostgreSQL is the PostGIS extension that may help us in dealing > with all the geometry. The first project will be mapping the outside > environment. Think along the lines of Google Maps, but intended for > use by pedestrians rather than cars. The other project, one that I'll > be managing, will be doing the same thing but with interiors of > buildings. For now, the two projects will be separate, but eventually > the plan is to merge them and actually allow the system to plan paths > that incorporate outside and interior maps. > > That's the project description in a nutshell. What I'm doing now is > trying to figure out the best way to store all the spatial > information. I want to have a foundational layer that is as simple as > possible, and that can later be extended for any special needs. > > The current plan is to do all the mapping via a graph structure using > nodes/vertices and edges. A node will be defined simply in terms of a > GUID (UUID in PostgreSQL). Each node will have a unique id, but no > other "mandatory" information. An edge is defined by the ids of its > two end points. This setup will be common to both projects, and the > use of UUIDs will ensure that any database merges in the future will > not result in conflicts. > > Now here's the tricky bit. We have to be able to associate an > arbitrary number of properties with each node and edge. So an example > of a node property would be that node's position defined by a > latitude/longitude pair. We will define all the possible properties > between our two projects, but just because a property is valid, > doesn't mean that it will be present in or be relevant to every node > in the system. For instance, nodes located inside of a building will > likely have their position defined by something other than latitude > and longitude, since capturing that data with a GPS will not be > possible. > > The simplest design would be to create two tables, one for nodes > another for edges, and create a column for every possible property. > This, however, is huge waste of space, since there will not be a > single node or edge that will make use of all the defined properties. > There may be hundreds of properties, but each node may use ten on > average. That's the question - how do you represent this information > in the database in a space-efficient manner. We still have to be able > to search for specific nodes given their properties. For example ЪЪ > find all the nodes located within radius z of coordinate x,y. PostGIS > will handle the special bit, but there has to be an efficient way of > accessing this information/property for each node that has it. > > One other possibility I thought of was defining a master node table > that would contain just the node ids. Then for each property I would > create a separate table with the id column referencing the master. > Here we're still wasting space for the extra storage of UUIDs, and if > I want to run a query that selects nodes with several different > properties, it would require a long list of JOINs. > > I looked at table inheritance that PostgreSQL offers, but it turned > out to be something different from what I was hoping for. What I need > is some sort of data inheritance based on a common unique key, but I > don't think that's possible. Same thing with partitioning. If there > was a way to partition the columns (with data) of a single large > table, then that would be an ideal solution. But in the absence of > that feature, what design would you recommend? > > Thanks, > Max > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83