Thread: Whole-row comparison ?
Hi there,
I'm trying to implement a "versionned" storage for a datawarehouse system, meaning I have tables which store daily information about objects and I would like to have a row inserted (i.e. a new version) only if it differs from the most recent existing row. For instance instead of storing
version attribute1 attribute2
1 x y
2 x y
3 x y
4 z y
5 z y
6 z t
we would only keep the diffs :
version attribute1 attribute2
1 x y
4 z y
6 z t
This would save lots of space and the full table could be easily accessed through a view.
However as the tables have a large number of rows (several hundreds) I would like to avoid having to write conditions like WHERE (old.att1 != new.attr1 or old.attr2 != new.attr2 or ... or old.attr245 != new.attr245 )
So my question is: is there a way to do some kind of whole-row comparison ? For instance a hash of the full row or something similar ? Is there a standard way of solving this problem ?
Thanks a lot for any hind !
Christian
> So my question is: is there a way to do some kind of whole-row > comparison ? For instance a hash of the full row or something similar ? > Is there a standard way of solving this problem ? Sure it is called row wise comparison: http://www.postgresql.org/docs/8.2/interactive/functions-comparisons.html#ROW-WISE-COMPARISON you can write something like: SELECT * FROM Your_tableWHERE (c1,c2,c3,c4) = ('a',2,'xyz',123); Regards, Richard Broersma Jr.
--- christian.roche.ext@nsn.com wrote: > select * from temp."BSC_Table" t, public.bsc_view p where t.id = p.id > and row(t) <> row(p); > > ERROR: operator does not exist: "temp"."BSC_Table" <> bsc_view > SQL state: 42883 > Hint: No operator matches the given name and argument type(s). You may > need to add explicit type casts. > Don't forget to reply all so that everyone on the list can participate. Also, do this is good since it doesn't limit you to my limited knowledge. ;) row() doesn't do what you think it does. you have to specify every column that you want to compare, so: row(t.col1, t.col2, t.col3, t.col4) <> row(p.col1, p.col2, p.col3, p.col4) where the datatypes of each respective column match. i.e. t.col1 maps to p.col1, ... Regards, Richard Broersma Jr.
Ok I understand now that this row(t.*) syntax is new to postgres 8.2 As explained in the documentation, §4.2.11. Row Constructors: A row constructor can include the syntax rowvalue.*, which will be expanded to a list of the elements of the row value, justas occurs when the .* syntax is used at the top level of a SELECT list. For example, if table t has columns f1 and f2,these are the same: SELECT ROW(t.*, 42) FROM t; SELECT ROW(t.f1, t.f2, 42) FROM t; This would obviously simplify the syntax of my queries a lot since I have so many columns. I'm going to try and installthe newest version and check it. I'll keep you informed. Thanks to all! Christian --- > select * from temp."BSC_Table" t, public.bsc_view p where t.id = p.id > and row(t) <> row(p); > > ERROR: operator does not exist: "temp"."BSC_Table" <> bsc_view SQL > state: 42883 > Hint: No operator matches the given name and argument type(s). You may > need to add explicit type casts. > Don't forget to reply all so that everyone on the list can participate. Also, do this is good since it doesn't limit youto my limited knowledge. ;) row() doesn't do what you think it does. you have to specify every column that you want to compare, so: row(t.col1, t.col2, t.col3, t.col4) <> row(p.col1, p.col2, p.col3, p.col4) where the datatypes of each respective column match. i.e. t.col1 maps to p.col1, ... Regards, Richard Broersma Jr.
Hi all, first let me tell you that this nice "whole-row comparison" feature of postgres 8.2 allowed me to create a versionned database model very neatly. The SQL statement that inserts in the destination table only the one rows that are new or that have changed since last time is very simply written: INSERT INTO bsc_table SELECT nextval('version_seq'), <fields> FROM load.bsc_table AS ld LEFT JOIN bsc_view AS nt USING (obj_id) WHERE nt.obj_id IS NULL OR row(nt.*) <> row(ld.*); bsc_view is a view that returns the latest version of each object in the bsc table: CREATE VIEW bsc_view AS SELECT <fields> FROM bsc_table WHERE (obj_id, ver_id) IN (SELECT obj_id, max(ver_id) FROM bsc_table GROUP BY obj_id); This is all nice as long as I only want to access the very last version of the table. However what should be done if I now need to access an earlier version ? The most elegant way would be to pass a ver_id parameter to bsc_view, something like : CREATE VIEW bsc_view(int) AS SELECT <fields> FROM bsc_table WHERE (obj_id, ver_id) IN (SELECT obj_id, max(ver_id) FROM bsc_table WHERE ver_id <= $1 GROUP BY obj_id)); However postgres doesn't allow parameters in views as far as I know. I guess I could create a function returning a set of rows, but then I would lose most advantages of rewritten views, especially optimization, right ? I've contemplated reusing an awful hack from my Access era, namely using a single-rowed table to store the parameter and joining the view on it. The parameter would be updated before the view is called; this would work but would definitely be ugly. Can someone think of a better way to do that ? Thanks a lot, Christian -----Original Message----- I'm trying to implement a "versionned" storage for a datawarehouse system, meaning I have tables which store daily information about objects and I would like to have a row inserted (i.e. a new version) only if it differs from the most recent existing row. For instance instead of storing version attribute1 attribute2 1 x y 2 x y 3 x y 4 z y 5 z y 6 z t we would only keep the diffs : version attribute1 attribute2 1 x y 4 z y 6 z t
On Fri, Jun 01, 2007 at 08:07:46PM +0300, christian.roche.ext@nsn.com wrote: > I've contemplated reusing an awful hack from my Access era, namely using > a single-rowed table to store the parameter and joining the view on it. > The parameter would be updated before the view is called; this would > work but would definitely be ugly. Can someone think of a better way to > do that ? I sort of don't see how that hack would be any different from a SRF. You'd lose the planner benefits anyway, I think, because you'd have to plan for the generic case where the data could be anything, no? A -- Andrew Sullivan | ajs@crankycanuck.ca Users never remark, "Wow, this software may be buggy and hard to use, but at least there is a lot of code underneath." --Damien Katz
Hi Andrew, I must admit I don't really understand what you say. I don't know what SRF stand for, and what you say about generic case is not clear to me, sorry. My idea is that using a parameter table allows me to keep using a view, which is optimized for instance when used against a WHERE condition. For example, I could write : CREATE VIEW bsc_view AS SELECT <fields> FROM bsc_table WHERE (obj_id, ver_id) IN (SELECT obj_id, max(ver_id) FROM bsc_table, param_table WHERE ver_id <= param_table.ver_id GROUP BY obj_id)); and the following statement would be optimized: UPDATE param_table SET ver_id = xxx; SELECT * FROM bsc_view WHERE obj_id = yyy; which would not be the case would I have used a multi-row function. Does this make sense ? Thanks a lot, Christian -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of ext Andrew Sullivan Sent: Friday, June 01, 2007 17:47 To: pgsql-sql@postgresql.org Subject: Re: [SQL] Versionning (was: Whole-row comparison) On Fri, Jun 01, 2007 at 08:07:46PM +0300, christian.roche.ext@nsn.com wrote: > I've contemplated reusing an awful hack from my Access era, namely > using a single-rowed table to store the parameter and joining the view on it. > The parameter would be updated before the view is called; this would > work but would definitely be ugly. Can someone think of a better way > to do that ? I sort of don't see how that hack would be any different from a SRF. You'd lose the planner benefits anyway, I think, because you'd have to plan for the generic case where the data could be anything, no? A
On Fri, Jun 01, 2007 at 09:07:10PM +0300, christian.roche.ext@nsn.com wrote: > > Hi Andrew, > > I must admit I don't really understand what you say. I don't know what > SRF stand for, and what you say about generic case is not clear to me, > sorry. Sorry, it stands for set returning function. I thought someone upthread suggested that instead of a view. > and the following statement would be optimized: > > UPDATE param_table SET ver_id = xxx; > SELECT * FROM bsc_view WHERE obj_id = yyy; > > which would not be the case would I have used a multi-row function. > > Does this make sense ? Yes, but I don't think it's true. Because you change the value of ver_id all the time, the actual result can't be collapsed to a constant, so you end up having to execute the query with the additional value, and you still have to plan that. The same thing is true of a function, which will have its plan prepared the first time you execute it. (I could be wrong about this; I suppose the only way would be to try it.) A -- Andrew Sullivan | ajs@crankycanuck.ca However important originality may be in some fields, restraint and adherence to procedure emerge as the more significant virtues in a great many others. --Alain de Botton
Hi Andrew, what is worrying me is that if I use a SRF, any additional WHERE condition would not be taken into account before executing the underlying query, e.g., in this request using a view, the WHERE condition would be considered in the final query : UPDATE params SET version = ver_id; SELECT * FROM bsc_list_view WHERE obj_id = 'xxx'; because the bsc_list_view would be expanded to the underlying request, while using a SRF, the whole table would be scaned before the WHERE condition is applied: SELECT * FROM bsc_list_srf(ver_id) WHERE obj_id = 'xxx'; This is what I mean when I say that the optimization would be lost when using a SRF. Now what is the "Right Thing To Do" in this particular case ? The nicest thing would really to have parametrized view. Is there any fundamental reason why such a beast does not exist, or is it only postgres (compared to higher-level RDBMS) ? Thanks a lot ! Christian -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of ext Andrew Sullivan Sent: Friday, June 01, 2007 18:51 To: pgsql-sql@postgresql.org Subject: Re: [SQL] Versionning (was: Whole-row comparison) Yes, but I don't think it's true. Because you change the value of ver_id all the time, the actual result can't be collapsed to a constant, so you end up having to execute the query with the additional value, and you still have to plan that. The same thing is true of a function, which will have its plan prepared the first time you execute it. (I could be wrong about this; I suppose the only way would be to try it.)
On Mon, Jun 04, 2007 at 01:40:18PM +0300, christian.roche.ext@nsn.com wrote: > case ? The nicest thing would really to have parametrized view. Is > there any fundamental reason why such a beast does not exist, or is it > only postgres (compared to higher-level RDBMS) ? I don't think there's a fundamental reason, no. But why couldn't you change your query to issue the SRF directly, with the parameter: SELECT * FROM some_srf(param1, param2)? A -- Andrew Sullivan | ajs@crankycanuck.ca "The year's penultimate month" is not in truth a good way of saying November. --H.W. Fowler