Thread: Dependency graph of all tuples relied upon in a query answer

Dependency graph of all tuples relied upon in a query answer

From
Randall Lucas
Date:
I'm storing facts about an entity (e.g., "company") in a normalized
form with various repeating groups, link tables, etc.  My workflow
requires that after (or as part of) collecting these facts, I be able
to "sign off" as having verified all of the facts that pertain to a
given company.  I understand this as meaning I need to sign off on each
row that was used in answering the query "select * from company left
join ..."

An inverted way of thinking about the problem is the notion of getting a
source document (say, a "company registration form") and parsing and
storing it in a normalized format.  How do you "tag" the origin of each
and every row in every table that resulted from that source document?
It is possible to do so by associating an extra column with each
inserted or modified value (yuck).

It seems to me that the elegant way to do this would be to get the
entire graph of dependencies for not only all tables that reference the
entity, but only those rows within those tables that refer specifically
to my entity.

The query engine should have a pretty good idea about which tables and
which rows would actually be used in forming the responses to a given
query.  How can I get this information?  Or am I off the deep end (-ency
graph) with this one?

Best,

Randall

--
Randall Lucas       Tercent, Inc.       DF93EAD1

Re: Dependency graph of all tuples relied upon in a query answer

From
"Merlin Moncure"
Date:
On 8/30/06, Randall Lucas <rlucas@tercent.com> wrote:
> I'm storing facts about an entity (e.g., "company") in a normalized
> form with various repeating groups, link tables, etc.  My workflow
> requires that after (or as part of) collecting these facts, I be able
> to "sign off" as having verified all of the facts that pertain to a
> given company.  I understand this as meaning I need to sign off on each
> row that was used in answering the query "select * from company left
> join ..."

> An inverted way of thinking about the problem is the notion of getting a
> source document (say, a "company registration form") and parsing and
> storing it in a normalized format.  How do you "tag" the origin of each
> and every row in every table that resulted from that source document?

your form should have a code, either entered by the customer or by the
preparer who enters it into the database, which becomes the key that
identifies the registration document.  Put that key into other tables.

> It is possible to do so by associating an extra column with each
> inserted or modified value (yuck).

be careful, you are flirting with EAV thinking.  I think EAV designs
are terrible.

 > It seems to me that the elegant way to do this would be to get the
> entire graph of dependencies for not only all tables that reference the
> entity, but only those rows within those tables that refer specifically
> to my entity.
>
> The query engine should have a pretty good idea about which tables and
> which rows would actually be used in forming the responses to a given
> query.  How can I get this information?  Or am I off the deep end (-ency
> graph) with this one?

I am not sure where you are going with this.  Maybe you should mock up
some simple tables and repost your question.

merlin

Re: Dependency graph of all tuples relied upon in a query answer

From
Randall Lucas
Date:
On Thu, Aug 31, 2006 at 09:17:54AM -0400, Merlin Moncure wrote:
> On 8/30/06, Randall Lucas <rlucas@tercent.com> wrote:
>
> >An inverted way of thinking about the problem is the notion of getting a
> >source document (say, a "company registration form") and parsing and
> >storing it in a normalized format.  How do you "tag" the origin of each
> >and every row in every table that resulted from that source document?
>
> your form should have a code, either entered by the customer or by the
> preparer who enters it into the database, which becomes the key that
> identifies the registration document.  Put that key into other tables.

Yes -- I have flirted with this idea for the case of importing a set of
facts from a single source.  But where it breaks down is in being able
to do the reverse -- ex post facto, to certify the results of a query
as being accurate, and thereby attesting to the underlying facts.

> be careful, you are flirting with EAV thinking.  I think EAV designs
> are terrible.

(Though religion is not at issue here, I am pretty devoutly relational.
I just want a good way to perform audits and other functions on my
relations.)

> > It seems to me that the elegant way to do this would be to get the
> >entire graph of dependencies for not only all tables that reference the
> >entity, but only those rows within those tables that refer specifically
> >to my entity.
> >
> >The query engine should have a pretty good idea about which tables and
> >which rows would actually be used in forming the responses to a given
> >query.  How can I get this information?  Or am I off the deep end (-ency
> >graph) with this one?
>
> I am not sure where you are going with this.  Maybe you should mock up
> some simple tables and repost your question.

Imagine that I am the SEC.  I look for cases of insider trading, and
when I find a case, I have to go back and double-check all the pieces
of evidence I am using against them.

    create table company (
        id      serial primary key,
        name    text not null
    );

    create table officer (
        id      serial primary key,
        company_id  int not null references company(id) on delete cascade,
        name    text not null,
        title   text not null
    );

    create table insider_trade (
        id      serial primary key,
        officer_id int not null references officer(id) on delete cascade,
        shares_traded   numeric,
        share_price     numeric,
        trade_date      timestamptz
    );

    insert into company (name) values ('goodco');
    insert into company (name) values ('badco');
    insert into officer (company_id, name, title) values (1, 'johnny b. good', 'ceo');
    insert into officer (company_id, name, title) values (1, 'mother teresa', 'saint');
    insert into officer (company_id, name, title) values (2, 'leroy brown', 'ceo');
    insert into insider_trade (officer_id, shares_traded, share_price, trade_date) values
        (3, '50000', '6.66', '2006-07-04 1:23 PM PST');

Now, my database looks like this:

    select * from company left join officer on company.id=officer.company_id left join insider_trade on
officer.id=insider_trade.officer_id;
     id |  name  | id | company_id |      name      | title | id | officer_id | shares_traded | share_price |
trade_date       

----+--------+----+------------+----------------+-------+----+------------+---------------+-------------+------------------------
      1 | goodco |  1 |          1 | johnny b. good | ceo   |    |            |               |             |
      1 | goodco |  2 |          1 | mother teresa  | saint |    |            |               |             |
      2 | badco  |  3 |          2 | leroy brown    | ceo   |  1 |          3 |         50000 |        6.66 |
2006-07-0414:23:00-07 
    (3 rows)

If I want to know whom to investigate, I might do a query like this:

    select * from company left join officer on company.id=officer.company_id left join insider_trade on
officer.id=insider_trade.officer_idwhere insider_trade.id is not null; 
     id | name  | id | company_id |    name     | title | id | officer_id | shares_traded | share_price |
trade_date       

----+-------+----+------------+-------------+-------+----+------------+---------------+-------------+------------------------
      2 | badco |  3 |          2 | leroy brown | ceo   |  1 |          3 |         50000 |        6.66 | 2006-07-04
14:23:00-07
    (1 row)

Now that I have this query, in order to make my case, I need to "sign
off" on all of the individual data that went into it.  I would like to
do something like:

    select last_query_shown_tuples();
     schema |  table_name   | pk_columns | pk_values
    --------+---------------+------------+-----------
     public |   company     |    [id]    |   [2]
     public |   officer     |    [id]    |   [3]
     public | insider_trade |    [id]    |   [1]
     (3 rows)

(I am aware that, at least theoretically, my query is just as reliant
upon the tuples that it *didn't* show, since they were acted upon by
the where clause.)

(I am also aware that this example is sub-optimal in that the primary
keys are all single ints called 'id' and that they are present
individually in the query result -- imagine that I used an aggregate
function in a GROUP BY, for example, and you can see how the ids relied
upon would become opaque, or imagine a multicolumn pk and you
understand my odd notation.)

Finally, I'd want to end up with something like:

    select last_query_shown_tuples();
     schema |  table_name   | pk_columns | pk_values | audited_ts
    --------+---------------+------------+-----------+------------
     public |   company     |    [id]    |   [2]     | 2006-08-31 18:52
     public |   officer     |    [id]    |   [3]     | 2006-08-31 18:55
     public | insider_trade |    [id]    |   [1]     | 2006-08-31 18:57
     (3 rows)

Is something like this even possible, much less doable at present?  I
have an inkling that with all of the capabilities of information_schema,
rules, and triggers, it should be pretty close to doable.

Best,

Randall

--
Randall Lucas       Tercent, Inc.       DF93EAD1

Re: Dependency graph of all tuples relied upon in a query answer

From
"Merlin Moncure"
Date:
On 8/31/06, Randall Lucas <rlucas@tercent.com> wrote:
> Now that I have this query, in order to make my case, I need to "sign
> off" on all of the individual data that went into it.  I would like to
> do something like:
>
>     select last_query_shown_tuples();
>      schema |  table_name   | pk_columns | pk_values
>     --------+---------------+------------+-----------
>      public |   company     |    [id]    |   [2]
>      public |   officer     |    [id]    |   [3]
>      public | insider_trade |    [id]    |   [1]
>      (3 rows)

right.  in sql, except for a few miscellaneous things that are session
based, information state is kept in the tables and if you want to keep
things relational all information should be have a primary key.  Your
last_query_shown_tuples function should be a simple select statement
returning information written with an id based on some criteria.  You
could call this table 'investigation' and put in it the information
necesasry to backtrack to the other tables.

A key tenet of relational thinking is to reduce all information to its
functional dependencies, and to try and avoid as much as possible
keeping information state in the data in a declarative sense.
last_query_shown_tuples() is imo a violation in that sense.  so is
currval() in fact, because given the same database defined by its data
it could give different answers depending on things leading up to how
the data was put there.  therefore, it is a hack, albeit a very useful
one.

One more thing I would like to suggest is to not automatically use a
serial column to define a primary key.  While this is a good idea in
many cases for various reasons, it can lead to lazy thinking and in
extreme cases bad performance.  a good way to approach sequences is to
define the natural key and use the generated ID as an alternative
candidate key if you want to keep things tight.

merlin

Re: Dependency graph of all tuples relied upon in a query answer

From
Randall Lucas
Date:
On Fri, Sep 01, 2006 at 09:30:57AM -0400, Merlin Moncure wrote:
> On 8/31/06, Randall Lucas <rlucas@tercent.com> wrote:
> >Now that I have this query, in order to make my case, I need to "sign
> >off" on all of the individual data that went into it.  I would like to
> >do something like:
> >
> >    select last_query_shown_tuples();
> >     schema |  table_name   | pk_columns | pk_values
> >    --------+---------------+------------+-----------
> >     public |   company     |    [id]    |   [2]
> >     public |   officer     |    [id]    |   [3]
> >     public | insider_trade |    [id]    |   [1]
> >     (3 rows)
>
> right.  in sql, except for a few miscellaneous things that are session
> based, information state is kept in the tables and if you want to keep
> things relational all information should be have a primary key.

Agreed.  As noted in my example above, all of my tables have primary
keys (serials for simplicity, although I represented the pk columns and
values as arrays because of the possibility of other sorts of keys,
like multicolumn).


> A key tenet of relational thinking is to reduce all information to its
> functional dependencies, and to try and avoid as much as possible
> keeping information state in the data in a declarative sense.
> last_query_shown_tuples() is imo a violation in that sense.

I used last_query_shown_tuples() as a quick example.  It seems more
likely that it would be implemented in fact as something like EXPLAIN,
where it acts upon a given query.  Like,

    EXPLAIN DEPENDENCIES SELECT * FROM ...

It seems that the query planner *must* know which rows, from which
tables, actually get used in producing the output for a given query.
Given that this info is present somewhere in the depths, is there a way
to get this information out to the app level?

Best,

Randall

--
Randall Lucas       Tercent, Inc.       DF93EAD1

Re: Dependency graph of all tuples relied upon in a query answer

From
"Merlin Moncure"
Date:
On 9/1/06, Randall Lucas <rlucas@tercent.com> wrote:
> On Fri, Sep 01, 2006 at 09:30:57AM -0400, Merlin Moncure wrote:
> > A key tenet of relational thinking is to reduce all information to its
> > functional dependencies, and to try and avoid as much as possible
> > keeping information state in the data in a declarative sense.
> > last_query_shown_tuples() is imo a violation in that sense.
>
> I used last_query_shown_tuples() as a quick example.  It seems more
> likely that it would be implemented in fact as something like EXPLAIN,
> where it acts upon a given query.  Like,
>
>     EXPLAIN DEPENDENCIES SELECT * FROM ...
>
> It seems that the query planner *must* know which rows, from which
> tables, actually get used in producing the output for a given query.
> Given that this info is present somewhere in the depths, is there a way
> to get this information out to the app level?

not exactly in the sense you are describing, but you can make the
query such that it gives you the knowledge to get back to the original
data and look it up, yes?

merlin