Thread: Compare rows

Compare rows

From
Greg Spiegelberg
Date:
All,

Anyone have any suggestions on how to efficiently compare
rows in the same table?  This table has 637 columns to be
compared and 642 total columns.

TIA,
Greg

--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Compare rows

From
Josh Berkus
Date:
Greg,

> Anyone have any suggestions on how to efficiently compare
> rows in the same table?  This table has 637 columns to be
> compared and 642 total columns.

637 columns?   Are you sure that's normalized?   It's hard for me to conceive
of a circumstance where that many columns would be necessary.

If this isn't a catastrophic normalization problem (which it sounds like),
then you will probably still need to work through procedureal normalization
code, as SQL simply doesn't offer any way around naming all the columns by
hand.   Perhaps you could describe the problem in more detail?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Compare rows

From
Greg Spiegelberg
Date:
Josh Berkus wrote:
> Greg,
>
>
>>Anyone have any suggestions on how to efficiently compare
>>rows in the same table?  This table has 637 columns to be
>>compared and 642 total columns.
>
>
> 637 columns?   Are you sure that's normalized?   It's hard for me to conceive
> of a circumstance where that many columns would be necessary.
>
> If this isn't a catastrophic normalization problem (which it sounds like),
> then you will probably still need to work through procedureal normalization
> code, as SQL simply doesn't offer any way around naming all the columns by
> hand.   Perhaps you could describe the problem in more detail?
>

The data represents metrics at a point in time on a system for
network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
speed, and whatever else can be gathered.

We arrived at this one 642 column table after testing the whole
process from data gathering, methods of temporarily storing then
loading to the database.  Initially, 37+ tables were in use but
the one big-un has saved us over 3.4 minutes.

The reason for my initial question was this.  We save changes only.
In other words, if system S has row T1 for day D1 and if on day D2
we have another row T1 (excluding our time column) we don't want
to save it.

That said, if the 3.4 minutes gets burned during our comparison which
saves changes only we may look at reverting to separate tables.  There
are only 1,700 to 3,000 rows on average per load.

Oh, PostgreSQL 7.3.3, PHP 4.3.1, RedHat 7.3, kernel 2.4.20-18.7smp,
2x1.4GHz PIII, 2GB memory, and 1Gbs SAN w/ Hitachi 9910 LUN's.

Greg

--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Compare rows

From
Shridhar Daithankar
Date:
Greg Spiegelberg wrote:

> The data represents metrics at a point in time on a system for
> network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
> speed, and whatever else can be gathered.
>
> We arrived at this one 642 column table after testing the whole
> process from data gathering, methods of temporarily storing then
> loading to the database.  Initially, 37+ tables were in use but
> the one big-un has saved us over 3.4 minutes.

I am sure you changed the desing because those 3.4 minutes were significant to you.


But I suggest you go back to 37 table design and see where bottleneck is.
Probably you can tune a join across 37 tables much better than optimizing a
difference between two 637 column rows.

Besides such a large number of columns will cost heavily in terms of
defragmentation across pages. The wasted space and IO therof could be
significant issue for large number of rows.

642 column is a bad design. Theoretically and from implementation of postgresql
point of view. You did it because of speed problem. Now if we can resolve those
speed problems, perhaps you could go back to other design.

Is it feasible for you right now or you are too much committed to the big table?

And of course, then it is routing postgresql tuning exercise..:-)

  Shridhar



Re: Compare rows

From
Joe Conway
Date:
Greg Spiegelberg wrote:
> The reason for my initial question was this.  We save changes only.
> In other words, if system S has row T1 for day D1 and if on day D2
> we have another row T1 (excluding our time column) we don't want
> to save it.

It still isn't entirely clear to me what you are trying to do, but
perhaps some sort of calculated checksum or hash would work to determine
if the data has changed?

Joe



Re: Compare rows

From
Josh Berkus
Date:
Greg,

> The data represents metrics at a point in time on a system for
> network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
> speed, and whatever else can be gathered.
>
> We arrived at this one 642 column table after testing the whole
> process from data gathering, methods of temporarily storing then
> loading to the database.  Initially, 37+ tables were in use but
> the one big-un has saved us over 3.4 minutes.

Hmmm ... if few of those columns are NULL, then you are probably right ...
this is probably the most normalized design.   If, however, many of columns
are NULL the majority of the time, then the design you should be using is a
vertial child table, of the form  ( value_type  | value ).

Such a vertical child table would also make your comparison between instances
*much* easier, as it could be executed via a simple 4-table-outer-join and 3
where clauses.  So even if you don't have a lot of NULLs, you probably want
to consider this.

> The reason for my initial question was this.  We save changes only.
> In other words, if system S has row T1 for day D1 and if on day D2
> we have another row T1 (excluding our time column) we don't want
> to save it.

If re-designing the table per the above is not a possibility, then I'd suggest
that you locate 3-5 columns that:
1) are not NULL for any row;
2) combined, serve to identify a tiny subset of rows, i.e. 3% or less of the
table.

Then put a multi-column index on those columns, and do your comparison.
Hopefully the planner should pick up on the availablity of the index and scan
only the rows retrieved by the index.   However, there is the distinct
possibility that the presence of 637 WHERE criteria will confuse the planner,
causing it to resort to a full table seq scan; in that case, you will want to
use a subselect to force the issue.

Or, as Joe Conway suggested, you could figure out some kind of value hash that
uniquely identifies your rows.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Compare rows

From
Jason Hihn
Date:
Comment interjected below.

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Greg
> Spiegelberg
> Sent: Wednesday, October 08, 2003 12:28 PM
> To: PgSQL Performance ML
> Subject: Re: [PERFORM] Compare rows
>
>
> Josh Berkus wrote:
> > Greg,
> >
> >
> >>Anyone have any suggestions on how to efficiently compare
> >>rows in the same table?  This table has 637 columns to be
> >>compared and 642 total columns.
> >
> >
> > 637 columns?   Are you sure that's normalized?   It's hard for
> me to conceive
> > of a circumstance where that many columns would be necessary.
> >
> > If this isn't a catastrophic normalization problem (which it
> sounds like),
> > then you will probably still need to work through procedureal
> normalization
> > code, as SQL simply doesn't offer any way around naming all the
> columns by
> > hand.   Perhaps you could describe the problem in more detail?
> >
>
> The data represents metrics at a point in time on a system for
> network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
> speed, and whatever else can be gathered.
>
> We arrived at this one 642 column table after testing the whole
> process from data gathering, methods of temporarily storing then
> loading to the database.  Initially, 37+ tables were in use but
> the one big-un has saved us over 3.4 minutes.
>
> The reason for my initial question was this.  We save changes only.
> In other words, if system S has row T1 for day D1 and if on day D2
> we have another row T1 (excluding our time column) we don't want
> to save it.

Um, isn't this a purpose of a key? And I am confused. Do you want to UPDATE
the changed columns? or skip it all together?
You have: (System, Day, T1 | T2 |...Tn )
But should use:
Master: (System, Day, Table={T1, T2, .. Tn)) [Keys: sytem, day, table]
T1 { System, Day, {other fields}}  [foreign keys [system, day]

This should allow you to find your dupes very fast (indexes!) and save a lot
of space (few/no null columns), and now you don't have to worry about
comparing fields, and moving huge result sets around.


> That said, if the 3.4 minutes gets burned during our comparison which
> saves changes only we may look at reverting to separate tables.  There
> are only 1,700 to 3,000 rows on average per load.
>
> Oh, PostgreSQL 7.3.3, PHP 4.3.1, RedHat 7.3, kernel 2.4.20-18.7smp,
> 2x1.4GHz PIII, 2GB memory, and 1Gbs SAN w/ Hitachi 9910 LUN's.
>
> Greg
>
> --
> Greg Spiegelberg
>   Sr. Product Development Engineer
>   Cranel, Incorporated.
>   Phone: 614.318.4314
>   Fax:   614.431.8388
>   Email: gspiegelberg@Cranel.com
> Cranel. Technology. Integrity. Focus.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


Re: Compare rows

From
Greg Spiegelberg
Date:
See below.


Shridhar Daithankar wrote:
> Greg Spiegelberg wrote:
>
>> The data represents metrics at a point in time on a system for
>> network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
>> speed, and whatever else can be gathered.
>>
>> We arrived at this one 642 column table after testing the whole
>> process from data gathering, methods of temporarily storing then
>> loading to the database.  Initially, 37+ tables were in use but
>> the one big-un has saved us over 3.4 minutes.
>
>
> I am sure you changed the desing because those 3.4 minutes were
> significant to you.
>
>
> But I suggest you go back to 37 table design and see where bottleneck
> is. Probably you can tune a join across 37 tables much better than
> optimizing a difference between two 637 column rows.

The bottleneck is across the board.

On the data collection side I'd have to manage 37 different methods
and output formats whereas now I have 1 standard associative array
that gets reset in memory for each "row" stored.

On the data validation side, I have one routine to check the incoming
data for errors, missing columns, data types and so on.  Quick & easy.

On the data import it's easier and more efficient to do one COPY for
a standard format from one program instead of multiple programs or
COPY's.  We were using 37 PHP scripts to handle the import and the
time it took to load, execute, exit, reload each script was killing
us.  Now, 1 PHP and 1 COPY.


> Besides such a large number of columns will cost heavily in terms of
> defragmentation across pages. The wasted space and IO therof could be
> significant issue for large number of rows.

No arguement here.


> 642 column is a bad design. Theoretically and from implementation of
> postgresql point of view. You did it because of speed problem. Now if we
> can resolve those speed problems, perhaps you could go back to other
> design.
>
> Is it feasible for you right now or you are too much committed to the
> big table?

Pretty commited though I do try to be open.

Greg

--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Compare rows

From
Greg Spiegelberg
Date:
Joe Conway wrote:
> Greg Spiegelberg wrote:
>
>> The reason for my initial question was this.  We save changes only.
>> In other words, if system S has row T1 for day D1 and if on day D2
>> we have another row T1 (excluding our time column) we don't want
>> to save it.
>
>
> It still isn't entirely clear to me what you are trying to do, but
> perhaps some sort of calculated checksum or hash would work to determine
> if the data has changed?

Best example I have is this.

You're running Solaris 5.8 with patch 108528-X and you're collecting
that data daily.  Would you want option 1 or 2 below?

Option 1 - Store it all
  Day  |      OS     |   Patch
------+-------------+-----------
Oct 1 | Solaris 5.8 | 108528-12
Oct 2 | Solaris 5.8 | 108528-12
Oct 3 | Solaris 5.8 | 108528-13
Oct 4 | Solaris 5.8 | 108528-13
Oct 5 | Solaris 5.8 | 108528-13
and so on...

To find what you're running:
select * from table order by day desc limit 1;

To find when it last changed however takes a join.


Option 2 - Store only changes
  Day  |      OS     |   Patch
------+-------------+-----------
Oct 1 | Solaris 5.8 | 108528-12
Oct 3 | Solaris 5.8 | 108528-13

To find what you're running:
select * from table order by day desc limit 1;

To find when it last changed:
select * from table order by day desc limit 1 offset 1;

I selected Option 2 because I'm dealing with mounds of complicated and
varying data formats and didn't want to have to write complex queries
for everything.

Greg

--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Compare rows

From
Dror Matalon
Date:
It's still not quite clear what you're trying to do. Many people's gut
reaction is that you're doing something strange with so many columns in
a table.

Using your example, a different approach might be to do this instead:

 Day  |      Name     |   Value
 ------+-------------+-----------
 Oct 1 | OS          | Solaris 5.8
 Oct 1 | Patch       | 108528-12
 Oct 3 | Patch       | 108528-13


You end up with lots more rows, fewer columns, but it might be
harder to query the table. On the other hand, queries should run quite
fast, since it's a much more "normal" table.

But without knowing more, and seeing what the other columns look like,
it's hard to tell.

Dror

On Wed, Oct 08, 2003 at 02:39:54PM -0400, Greg Spiegelberg wrote:
> Joe Conway wrote:
> >Greg Spiegelberg wrote:
> >
> >>The reason for my initial question was this.  We save changes only.
> >>In other words, if system S has row T1 for day D1 and if on day D2
> >>we have another row T1 (excluding our time column) we don't want
> >>to save it.
> >
> >
> >It still isn't entirely clear to me what you are trying to do, but
> >perhaps some sort of calculated checksum or hash would work to determine
> >if the data has changed?
>
> Best example I have is this.
>
> You're running Solaris 5.8 with patch 108528-X and you're collecting
> that data daily.  Would you want option 1 or 2 below?
>
> Option 1 - Store it all
>  Day  |      OS     |   Patch
> ------+-------------+-----------
> Oct 1 | Solaris 5.8 | 108528-12
> Oct 2 | Solaris 5.8 | 108528-12
> Oct 3 | Solaris 5.8 | 108528-13
> Oct 4 | Solaris 5.8 | 108528-13
> Oct 5 | Solaris 5.8 | 108528-13
> and so on...
>
> To find what you're running:
> select * from table order by day desc limit 1;
>
> To find when it last changed however takes a join.
>
>
> Option 2 - Store only changes
>  Day  |      OS     |   Patch
> ------+-------------+-----------
> Oct 1 | Solaris 5.8 | 108528-12
> Oct 3 | Solaris 5.8 | 108528-13
>
> To find what you're running:
> select * from table order by day desc limit 1;
>
> To find when it last changed:
> select * from table order by day desc limit 1 offset 1;
>
> I selected Option 2 because I'm dealing with mounds of complicated and
> varying data formats and didn't want to have to write complex queries
> for everything.
>
> Greg
>
> --
> Greg Spiegelberg
>  Sr. Product Development Engineer
>  Cranel, Incorporated.
>  Phone: 614.318.4314
>  Fax:   614.431.8388
>  Email: gspiegelberg@Cranel.com
> Cranel. Technology. Integrity. Focus.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html

--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.zapatec.com

Re: Compare rows

From
Greg Spiegelberg
Date:
Dror,

I gave this some serious thought at first.  I only deal with
int8, numeric(24,12) and varchar(32) columns which I could
reduce to 3 different tables.  Problem was going from 1700-3000
rows to around 300,000-1,000,000 rows per system per day that
is sending data to our database.

BTW, the int8 and numeric(24,12) are for future expansion.
I hate limits.

Greg


Dror Matalon wrote:
> It's still not quite clear what you're trying to do. Many people's gut
> reaction is that you're doing something strange with so many columns in
> a table.
>
> Using your example, a different approach might be to do this instead:
>
>  Day  |      Name     |   Value
>  ------+-------------+-----------
>  Oct 1 | OS          | Solaris 5.8
>  Oct 1 | Patch       | 108528-12
>  Oct 3 | Patch       | 108528-13
>
>
> You end up with lots more rows, fewer columns, but it might be
> harder to query the table. On the other hand, queries should run quite
> fast, since it's a much more "normal" table.
>
> But without knowing more, and seeing what the other columns look like,
> it's hard to tell.
>
> Dror


--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Compare rows

From
Greg Spiegelberg
Date:
Josh Berkus wrote:
> Greg,
>
>
>>The data represents metrics at a point in time on a system for
>>network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
>>speed, and whatever else can be gathered.
>>
>>We arrived at this one 642 column table after testing the whole
>>process from data gathering, methods of temporarily storing then
>>loading to the database.  Initially, 37+ tables were in use but
>>the one big-un has saved us over 3.4 minutes.
>
>
> Hmmm ... if few of those columns are NULL, then you are probably right ...
> this is probably the most normalized design.   If, however, many of columns
> are NULL the majority of the time, then the design you should be using is a
> vertial child table, of the form  ( value_type  | value ).
>
> Such a vertical child table would also make your comparison between instances
> *much* easier, as it could be executed via a simple 4-table-outer-join and 3
> where clauses.  So even if you don't have a lot of NULLs, you probably want
> to consider this.

You lost me on that one.  What's a "vertical child table"?

Statistically, about 6% of the rows use more than 200 of the columns,
27% of the rows use 80-199 or more columns, 45% of the rows use 40-79
columns and the remaining 22% of the rows use 39 or less of the columns.
That is a lot of NULLS.  Never gave that much thought.

To ensure query efficiency, hide the NULLs and simulate the multiple
tables I have a boatload of indexes, ensure that every query makees use
of an index, and have created 37 views.  It's worked pretty well so
far


>>The reason for my initial question was this.  We save changes only.
>>In other words, if system S has row T1 for day D1 and if on day D2
>>we have another row T1 (excluding our time column) we don't want
>>to save it.
>
>
> If re-designing the table per the above is not a possibility, then I'd suggest
> that you locate 3-5 columns that:
> 1) are not NULL for any row;
> 2) combined, serve to identify a tiny subset of rows, i.e. 3% or less of the
> table.

There are always, always, always 7 columns that contain data.


> Then put a multi-column index on those columns, and do your comparison.
> Hopefully the planner should pick up on the availablity of the index and scan
> only the rows retrieved by the index.   However, there is the distinct
> possibility that the presence of 637 WHERE criteria will confuse the planner,
> causing it to resort to a full table seq scan; in that case, you will want to
> use a subselect to force the issue.

That's what I'm trying to avoid is a big WHERE (c1,c2,...,c637) <>
(d1,d2,...,d637) clause.  Ugly.


> Or, as Joe Conway suggested, you could figure out some kind of value hash that
> uniquely identifies your rows.

I've given that some though and though appealing I don't think I'd care
to spend the CPU cycles to do it.  Best way I can figure to accomplish
it would be to generate an MD5 on each row without the timestamp and
store it in another column, create an index on the MD5 column, generate
MD5 on each line I want to insert.  Makes for a simple WHERE...

Okay.  I'll give it a whirl.  What's one more column, right?

Greg

--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Compare rows

From
Jason Hihn
Date:

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Greg
> Spiegelberg
> Sent: Wednesday, October 08, 2003 3:11 PM
> To: PgSQL Performance ML
> Subject: Re: [PERFORM] Compare rows
>
>
> Josh Berkus wrote:
> > Greg,
> >
> >
> >>The data represents metrics at a point in time on a system for
> >>network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
> >>speed, and whatever else can be gathered.
> >>
> >>We arrived at this one 642 column table after testing the whole
> >>process from data gathering, methods of temporarily storing then
> >>loading to the database.  Initially, 37+ tables were in use but
> >>the one big-un has saved us over 3.4 minutes.
> >
> >
> > Hmmm ... if few of those columns are NULL, then you are
> probably right ...
> > this is probably the most normalized design.   If, however,
> many of columns
> > are NULL the majority of the time, then the design you should
> be using is a
> > vertial child table, of the form  ( value_type  | value ).
> >
> > Such a vertical child table would also make your comparison
> between instances
> > *much* easier, as it could be executed via a simple
> 4-table-outer-join and 3
> > where clauses.  So even if you don't have a lot of NULLs, you
> probably want
> > to consider this.
>
> You lost me on that one.  What's a "vertical child table"?

Parent table Fkey | Option | Value
------------------+--------+-------
                  | OS     | Solaris
                  | DISK1  | 30g
                   ^^^^^^^^   ^^^-- values
                      fields are values in a column rather than 'fields'


> Statistically, about 6% of the rows use more than 200 of the columns,
> 27% of the rows use 80-199 or more columns, 45% of the rows use 40-79
> columns and the remaining 22% of the rows use 39 or less of the columns.
> That is a lot of NULLS.  Never gave that much thought.
>
> To ensure query efficiency, hide the NULLs and simulate the multiple
> tables I have a boatload of indexes, ensure that every query makees use
> of an index, and have created 37 views.  It's worked pretty well so
> far
>
>
> >>The reason for my initial question was this.  We save changes only.
> >>In other words, if system S has row T1 for day D1 and if on day D2
> >>we have another row T1 (excluding our time column) we don't want
> >>to save it.
> >
> >
> > If re-designing the table per the above is not a possibility,
> then I'd suggest
> > that you locate 3-5 columns that:
> > 1) are not NULL for any row;
> > 2) combined, serve to identify a tiny subset of rows, i.e. 3%
> or less of the
> > table.
>
> There are always, always, always 7 columns that contain data.
>
>
> > Then put a multi-column index on those columns, and do your
> comparison.
> > Hopefully the planner should pick up on the availablity of the
> index and scan
> > only the rows retrieved by the index.   However, there is the distinct
> > possibility that the presence of 637 WHERE criteria will
> confuse the planner,
> > causing it to resort to a full table seq scan; in that case,
> you will want to
> > use a subselect to force the issue.
>
> That's what I'm trying to avoid is a big WHERE (c1,c2,...,c637) <>
> (d1,d2,...,d637) clause.  Ugly.
>
>
> > Or, as Joe Conway suggested, you could figure out some kind of
> value hash that
> > uniquely identifies your rows.
>
> I've given that some though and though appealing I don't think I'd care
> to spend the CPU cycles to do it.  Best way I can figure to accomplish
> it would be to generate an MD5 on each row without the timestamp and
> store it in another column, create an index on the MD5 column, generate
> MD5 on each line I want to insert.  Makes for a simple WHERE...
>
> Okay.  I'll give it a whirl.  What's one more column, right?
>
> Greg
>
> --
> Greg Spiegelberg
>   Sr. Product Development Engineer
>   Cranel, Incorporated.
>   Phone: 614.318.4314
>   Fax:   614.431.8388
>   Email: gspiegelberg@Cranel.com
> Cranel. Technology. Integrity. Focus.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

Re: Compare rows

From
Jean-Luc Lachance
Date:
Here is what i think you can use:

One master table with out duplicates and one anciliary table with
duplicate for the day.
Insert the result of the select from the anciliary table into the master
table, truncate the anciliary table.


select distinct on ( {all the fields except day}) * from table order by
{all the fields except day}, day;

As in:

select distinct on ( OS, Patch) * from table order by OS, Patch, Day;

JLL

BTW, PG developper, since the distinct on list MUST be included in the
order by clause why not make it implicitly part of the order by clause?



Greg Spiegelberg wrote:
>
> Joe Conway wrote:
> > Greg Spiegelberg wrote:
> >
> >> The reason for my initial question was this.  We save changes only.
> >> In other words, if system S has row T1 for day D1 and if on day D2
> >> we have another row T1 (excluding our time column) we don't want
> >> to save it.
> >
> >
> > It still isn't entirely clear to me what you are trying to do, but
> > perhaps some sort of calculated checksum or hash would work to determine
> > if the data has changed?
>
> Best example I have is this.
>
> You're running Solaris 5.8 with patch 108528-X and you're collecting
> that data daily.  Would you want option 1 or 2 below?
>
> Option 1 - Store it all
>   Day  |      OS     |   Patch
> ------+-------------+-----------
> Oct 1 | Solaris 5.8 | 108528-12
> Oct 2 | Solaris 5.8 | 108528-12
> Oct 3 | Solaris 5.8 | 108528-13
> Oct 4 | Solaris 5.8 | 108528-13
> Oct 5 | Solaris 5.8 | 108528-13
> and so on...
>
> To find what you're running:
> select * from table order by day desc limit 1;
>
> To find when it last changed however takes a join.
>
> Option 2 - Store only changes
>   Day  |      OS     |   Patch
> ------+-------------+-----------
> Oct 1 | Solaris 5.8 | 108528-12
> Oct 3 | Solaris 5.8 | 108528-13
>
> To find what you're running:
> select * from table order by day desc limit 1;
>
> To find when it last changed:
> select * from table order by day desc limit 1 offset 1;
>
> I selected Option 2 because I'm dealing with mounds of complicated and
> varying data formats and didn't want to have to write complex queries
> for everything.
>
> Greg
>
> --
> Greg Spiegelberg
>   Sr. Product Development Engineer
>   Cranel, Incorporated.
>   Phone: 614.318.4314
>   Fax:   614.431.8388
>   Email: gspiegelberg@Cranel.com
> Cranel. Technology. Integrity. Focus.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html

Re: Compare rows

From
Dror Matalon
Date:
Greg,

On Wed, Oct 08, 2003 at 03:07:30PM -0400, Greg Spiegelberg wrote:
> Dror,
>
> I gave this some serious thought at first.  I only deal with
> int8, numeric(24,12) and varchar(32) columns which I could
> reduce to 3 different tables.  Problem was going from 1700-3000

I'm not sure how the data types come into play here. I was for the most
part following your examples.

> rows to around 300,000-1,000,000 rows per system per day that
> is sending data to our database.
>

Depending on the distribution of your data you can end up with more,
less or roughly the same amount of data in the end. It all depends on
how many of the 600+ columns change every time you insert a row. If only
a few of them do, then you'll clearly end up with less total data, since
you'll be writing several rows that are very short instead of one
huge row that contains all the information. In other words, you're
tracking changes better.

It also sounds like you feel that having a few thousand rows in a very
"wide" table is better than having 300,000 - 1,00,000 rows in a "narrow"
table. My gut feeling is that it's the other way around, but there are
plenty of people on this list who can provide a more informed answer.

Using the above eample, assuming that both tables roughly have the same
number of pages in them, would postgres deal better with a table with
3-4 columns with 300,000 - 1,000,000 rows or with a table with several
hundred columns with only 3000 or so rows?

Regards,

Dror


> BTW, the int8 and numeric(24,12) are for future expansion.
> I hate limits.
>
> Greg
>
>
> Dror Matalon wrote:
> >It's still not quite clear what you're trying to do. Many people's gut
> >reaction is that you're doing something strange with so many columns in
> >a table.
> >
> >Using your example, a different approach might be to do this instead:
> >
> > Day  |      Name     |   Value
> > ------+-------------+-----------
> > Oct 1 | OS          | Solaris 5.8
> > Oct 1 | Patch       | 108528-12
> > Oct 3 | Patch       | 108528-13
> >
> >
> >You end up with lots more rows, fewer columns, but it might be
> >harder to query the table. On the other hand, queries should run quite
> >fast, since it's a much more "normal" table.
> >
> >But without knowing more, and seeing what the other columns look like,
> >it's hard to tell.
> >
> >Dror
>
>
> --
> Greg Spiegelberg
>  Sr. Product Development Engineer
>  Cranel, Incorporated.
>  Phone: 614.318.4314
>  Fax:   614.431.8388
>  Email: gspiegelberg@Cranel.com
> Cranel. Technology. Integrity. Focus.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.zapatec.com

Re: Compare rows

From
Josh Berkus
Date:
Greg,

> You lost me on that one.  What's a "vertical child table"?

Currently, you store data like this:

id    address    uptime    speed    memory    tty
3    67.92    0.3        11.2        37        6
7    69.5        1.1        NULL    15        NULL
9    65.5        0.1        NULL    94        2

The most efficient way for you to store data would be like this:

main table
id    address
3    67.92
7    69.5
9    65.5

child table
id    value_type    value
3    uptime        0.3
3    speed        11.2
3    memory        37
3    tty            6
7    uptime        1.1
7    memory        15
9    uptime        0.1
9    memory        94
9    tty            2

As you can see, the NULLs are not stored, making this system much more
efficient on storage space.

Tommorrow I'll (hopefully) write up how to query this for comparisons.   It
would help if you gave a little more details about what specific comparison
you're doing, e.g. between tables or table to value, comparing just the last
value or all rows, etc.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Compare rows

From
Christopher Browne
Date:
In an attempt to throw the authorities off his trail, josh@agliodbs.com (Josh Berkus) transmitted:
> child table
> id    value_type    value
> 3    uptime        0.3
> 3    speed        11.2
> 3    memory        37
> 3    tty            6
> 7    uptime        1.1
> 7    memory        15
> 9    uptime        0.1
> 9    memory        94
> 9    tty            2
>
> As you can see, the NULLs are not stored, making this system much more
> efficient on storage space.

Wow, that takes me back to a paper I have been looking for for
_years_.

Some time in the late '80s, probably '88 or '89, there was a paper
presented in Communications of the ACM that proposed using this sort
of "hypernormalized" schema as a way of having _really_ narrow schemas
that would be exceedingly expressive.  They illustrated an example of
an address table that could hold full addresses with a schema with
only about half a dozen columns, the idea being that you'd have
several rows linked together.

The methodology was _heavy_ on metadata, though not so much so that
there were no columns left over for "real" data.

The entertaining claim was that they felt they could model the
complexities of the operations of any sort of company using not more
than 50 tables.  It seemed somewhat interesting, at the time; it truly
resonated as Really Interesting when I saw SAP R/3, with its bloat of
1500-odd tables.

(I seem to remember the authors being Boston-based, and they indicated
that they had implemented this "on VMS," which would more than likely
imply RDB; somehow I doubt that'll be the set of detail that makes
someone remember it...)

The need to do a lot of joins would likely hurt performance somewhat,
as well as the way that it greatly increases the number of rows.
Although you could always split it into several tables, one for each
"value_type", and UNION them into a view...
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','acm.org').
http://cbbrowne.com/info/unix.html
You shouldn't anthropomorphize computers; they don't like it.

Re: Compare rows

From
Josh Berkus
Date:
Chris,

> Some time in the late '80s, probably '88 or '89, there was a paper
> presented in Communications of the ACM that proposed using this sort
> of "hypernormalized" schema as a way of having _really_ narrow schemas
> that would be exceedingly expressive.  They illustrated an example of
<snip>
> The entertaining claim was that they felt they could model the
> complexities of the operations of any sort of company using not more
> than 50 tables.  It seemed somewhat interesting, at the time; it truly
> resonated as Really Interesting when I saw SAP R/3, with its bloat of
> 1500-odd tables.

One can always take things too far.   Trying to make everying 100% dynamic so
that you can cram your whole database into 4 tables is going too far; so is
the kind of bloat that produces systems like SAP, which is more based on
legacy than design (I analyzed a large commercial billing system once and was
startled to discover that 1/4 of its 400 tables and almost half of the 40,000
collective columns were not used and present only for backward
compatibility).

The usefulness of the "vertical values child table" which I suggest is largely
dependant on the number of values not represented.   In Greg's case, fully
75% of the fields in his huge table are NULL; this is incredibly inefficient,
the more so when you consider his task of calling each field by name in each
query.

The "vertical values child table" is also ideal for User Defined Fields or any
other form of user-configurable add-on data which will be NULL more often
than not.

This is an old SQL concept, though; I'm sure it has an official name
somewhere.

> The need to do a lot of joins would likely hurt performance somewhat,
> as well as the way that it greatly increases the number of rows.
> Although you could always split it into several tables, one for each
> "value_type", and UNION them into a view...

It increases the number of rows, yes, but *decreases* the storage size of data
by eliminating thousands ... or millions ... of NULL fields.   How would
splitting the vertical values into dozens of seperate tables help things?

Personally, I'd rather have a table with 3 columns and 8 million rows than a
table with 642 columns and 100,000 rows.  Much easier to deal with.

And we are also assuming that Greg seldom needs to see all of the fields at
once.   I'm pretty sure of this; if he did, he'd have run into the "wide row"
bug in 7.3 and would be complaining about it.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Compare rows

From
Christopher Browne
Date:
The world rejoiced as josh@agliodbs.com (Josh Berkus) wrote:
> Chris,
>> Some time in the late '80s, probably '88 or '89, there was a paper
>> presented in Communications of the ACM that proposed using this sort
>> of "hypernormalized" schema as a way of having _really_ narrow schemas
>> that would be exceedingly expressive.  They illustrated an example of
> <snip>
>> The entertaining claim was that they felt they could model the
>> complexities of the operations of any sort of company using not
>> more than 50 tables.  It seemed somewhat interesting, at the time;
>> it truly resonated as Really Interesting when I saw SAP R/3, with
>> its bloat of 1500-odd tables.
>
> One can always take things too far.  Trying to make everying 100%
> dynamic so that you can cram your whole database into 4 tables is
> going too far; so is the kind of bloat that produces systems like
> SAP, which is more based on legacy than design (I analyzed a large
> commercial billing system once and was startled to discover that 1/4
> of its 400 tables and almost half of the 40,000 collective columns
> were not used and present only for backward compatibility).

With R/3, the problem is that there are hundreds (now thousands) of
developers trying to coexist on the same code base, with the result
tables containing nearly-the-same fields are strewn all over.

It's _possible_ that the design I saw amounted to nothing more than a
clever hack for implementing LDAP atop a relational database, but they
seemed to have something slightly more to say than that.
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','ntlug.org').
http://www3.sympatico.ca/cbbrowne/emacs.html
Why does the word "lisp" have an "s" in it?

Re: Compare rows

From
Greg Spiegelberg
Date:
Christopher Browne wrote:
>
> Wow, that takes me back to a paper I have been looking for for
> _years_.
>
> Some time in the late '80s, probably '88 or '89, there was a paper
> presented in Communications of the ACM that proposed using this sort
> of "hypernormalized" schema as a way of having _really_ narrow schemas
> that would be exceedingly expressive.  They illustrated an example of
> an address table that could hold full addresses with a schema with
> only about half a dozen columns, the idea being that you'd have
> several rows linked together.

I'd be interested in the title / author when you remember.
I'm kinda sick.  I like reading on most computer theory,
designs, algorithms, database implementations, etc.  Usually
how I get into trouble too with 642 column tables though. :)

--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Compare rows

From
Greg Spiegelberg
Date:
Josh Berkus wrote:
> Greg,
>
>
>>You lost me on that one.  What's a "vertical child table"?
>
>
> Currently, you store data like this:
>
> id    address    uptime    speed    memory    tty
> 3    67.92    0.3        11.2        37        6
> 7    69.5        1.1        NULL    15        NULL
> 9    65.5        0.1        NULL    94        2
>
> The most efficient way for you to store data would be like this:
>
> main table
> id    address
> 3    67.92
> 7    69.5
> 9    65.5
>
> child table
> id    value_type    value
> 3    uptime        0.3
> 3    speed        11.2
> 3    memory        37
> 3    tty            6
> 7    uptime        1.1
> 7    memory        15
> 9    uptime        0.1
> 9    memory        94
> 9    tty            2
>
> As you can see, the NULLs are not stored, making this system much more
> efficient on storage space.
>
> Tommorrow I'll (hopefully) write up how to query this for comparisons.   It
> would help if you gave a little more details about what specific comparison
> you're doing, e.g. between tables or table to value, comparing just the last
> value or all rows, etc.
>

Got it.  I can see how it would be more efficient in storing.  At this
point it would require a lot of query and code rewrites to handle it.
Fortunately, we're looking for alternatives for the next revision and
we're leaving ourselves open for a rewrite much to the boss's chagrin.

I will be spinning up a test server soon and may attempt a quick
implementation.  I may make value_type a foreign key on a table that
includes a full and/or brief description of the key.  Problem I'll have
then will be categorizing all those keys into disk, cpu, memory, user,
and all the other data categories since it's in one big table rather
than specialized tables.

Greg

--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Compare rows

From
Hannu Krosing
Date:
Josh Berkus kirjutas N, 09.10.2003 kell 08:36:
> Chris,

> > The need to do a lot of joins would likely hurt performance somewhat,
> > as well as the way that it greatly increases the number of rows.
> > Although you could always split it into several tables, one for each
> > "value_type", and UNION them into a view...
>
> It increases the number of rows, yes, but *decreases* the storage size of data
> by eliminating thousands ... or millions ... of NULL fields.

I'm not sure I buy that.

Null fields take exactly 1 *bit* to store (or more exactly, if you have
any null fields in tuple then one 32bit int for each 32 fields is used
for NULL bitmap), whereas the same fields in "vertical" table takes 4
bytes for primary key and 1-4 bytes for category key + tuple header per
value + neccessary indexes. So if you have more than one non-null field
per tuple you will certainly lose in storage.

> How would splitting the vertical values into dozens of seperate tables help things?

If you put each category in a separate table you save 1-4 bytes for
category per value, but still store primary key and tuple header *per
value*.

Jou may stii get better performance for single-column comparisons as
fewer pages must be touched.

> Personally, I'd rather have a table with 3 columns and 8 million rows than a
> table with 642 columns and 100,000 rows.  Much easier to deal with.

Same here ;)

------------------
Hannu


Re: Compare rows, SEMI-SUMMARY

From
Greg Spiegelberg
Date:
Per Josh's recommendation to implement a Vertical Child Table I came
up with 3 possible tables to handle the 3 possible value types: varchar,
numeric and bigint.  Each table has 7 columns: 1 to denote the time the
data was collected, 4 which identify where the data came from, 1 to
tell me the value name and the last being the value itself.

        OLD        NEW
tables        1        3
columns        642        7 each
indexes        ~1200        39
views        37        ?
rows        1700-3000    30,000
query on table    0.01 sec    0.06 sec
query on view    0.02 sec    ?

Not too bad.  Guess there were a few 0's and NULL's out there, eh?

642 * 1,700    = 1,091,400 cells
3 * 7 * 30,000 =   630,000 cells
                    461,400 NULL's and 0's using the big 'ol table

I can get around in this setup, however, I would appreciate some help
in recreating my views.  The views use to be there simply as an initial
filter and to hide all the 0's and NULL's.  If I can't do this I will
be revisiting and testing possibly hundreds of programs and scripts.

Any takers?

Greg

--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Compare rows

From
Gaetano Mendola
Date:
Greg Spiegelberg wrote:

> Josh Berkus wrote:
>
>>
>> As you can see, the NULLs are not stored, making this system much more
>> efficient on storage space.
>>
>> Tommorrow I'll (hopefully) write up how to query this for
>> comparisons.   It would help if you gave a little more details about
>> what specific comparison you're doing, e.g. between tables or table to
>> value, comparing just the last value or all rows, etc.
>>
>
> Got it.  I can see how it would be more efficient in storing.  At this
> point it would require a lot of query and code rewrites to handle it.
> Fortunately, we're looking for alternatives for the next revision and
> we're leaving ourselves open for a rewrite much to the boss's chagrin.

I'm not sure about the save in storage. See the Hannu Krosing
arguments.


Regards
Gaetano Mendola


Re: Compare rows

From
Thomas Swan
Date:
I took this approach with a former company in designing an dynamic
e-commerce system.   This kept the addition of new products from
requiring an alteration of the schema.   With an ORB manager and cache
control the performance was not significantly, but the automatic
extensibility and the ease of maintainabilty was greatly enhanced.

Thomas


Jason Hihn wrote:

>
>
>>-----Original Message-----
>>From: pgsql-performance-owner@postgresql.org
>>[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Greg
>>Spiegelberg
>>Sent: Wednesday, October 08, 2003 3:11 PM
>>To: PgSQL Performance ML
>>Subject: Re: [PERFORM] Compare rows
>>
>>
>>Josh Berkus wrote:
>>
>>
>>>Greg,
>>>
>>>
>>>
>>>
>>>>The data represents metrics at a point in time on a system for
>>>>network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
>>>>speed, and whatever else can be gathered.
>>>>
>>>>We arrived at this one 642 column table after testing the whole
>>>>process from data gathering, methods of temporarily storing then
>>>>loading to the database.  Initially, 37+ tables were in use but
>>>>the one big-un has saved us over 3.4 minutes.
>>>>
>>>>
>>>Hmmm ... if few of those columns are NULL, then you are
>>>
>>>
>>probably right ...
>>
>>
>>>this is probably the most normalized design.   If, however,
>>>
>>>
>>many of columns
>>
>>
>>>are NULL the majority of the time, then the design you should
>>>
>>>
>>be using is a
>>
>>
>>>vertial child table, of the form  ( value_type  | value ).
>>>
>>>Such a vertical child table would also make your comparison
>>>
>>>
>>between instances
>>
>>
>>>*much* easier, as it could be executed via a simple
>>>
>>>
>>4-table-outer-join and 3
>>
>>
>>>where clauses.  So even if you don't have a lot of NULLs, you
>>>
>>>
>>probably want
>>
>>
>>>to consider this.
>>>
>>>
>>You lost me on that one.  What's a "vertical child table"?
>>
>>
>
>Parent table Fkey | Option | Value
>------------------+--------+-------
>                  | OS     | Solaris
>                  | DISK1  | 30g
>                   ^^^^^^^^   ^^^-- values
>                      fields are values in a column rather than 'fields'
>
>
>
>
>>Statistically, about 6% of the rows use more than 200 of the columns,
>>27% of the rows use 80-199 or more columns, 45% of the rows use 40-79
>>columns and the remaining 22% of the rows use 39 or less of the columns.
>>That is a lot of NULLS.  Never gave that much thought.
>>
>>To ensure query efficiency, hide the NULLs and simulate the multiple
>>tables I have a boatload of indexes, ensure that every query makees use
>>of an index, and have created 37 views.  It's worked pretty well so
>>far
>>
>>
>>
>>
>>>>The reason for my initial question was this.  We save changes only.
>>>>In other words, if system S has row T1 for day D1 and if on day D2
>>>>we have another row T1 (excluding our time column) we don't want
>>>>to save it.
>>>>
>>>>
>>>If re-designing the table per the above is not a possibility,
>>>
>>>
>>then I'd suggest
>>
>>
>>>that you locate 3-5 columns that:
>>>1) are not NULL for any row;
>>>2) combined, serve to identify a tiny subset of rows, i.e. 3%
>>>
>>>
>>or less of the
>>
>>
>>>table.
>>>
>>>
>>There are always, always, always 7 columns that contain data.
>>
>>
>>
>>
>>>Then put a multi-column index on those columns, and do your
>>>
>>>
>>comparison.
>>
>>
>>>Hopefully the planner should pick up on the availablity of the
>>>
>>>
>>index and scan
>>
>>
>>>only the rows retrieved by the index.   However, there is the distinct
>>>possibility that the presence of 637 WHERE criteria will
>>>
>>>
>>confuse the planner,
>>
>>
>>>causing it to resort to a full table seq scan; in that case,
>>>
>>>
>>you will want to
>>
>>
>>>use a subselect to force the issue.
>>>
>>>
>>That's what I'm trying to avoid is a big WHERE (c1,c2,...,c637) <>
>>(d1,d2,...,d637) clause.  Ugly.
>>
>>
>>
>>
>>>Or, as Joe Conway suggested, you could figure out some kind of
>>>
>>>
>>value hash that
>>
>>
>>>uniquely identifies your rows.
>>>
>>>
>>I've given that some though and though appealing I don't think I'd care
>>to spend the CPU cycles to do it.  Best way I can figure to accomplish
>>it would be to generate an MD5 on each row without the timestamp and
>>store it in another column, create an index on the MD5 column, generate
>>MD5 on each line I want to insert.  Makes for a simple WHERE...
>>
>>Okay.  I'll give it a whirl.  What's one more column, right?
>>
>>Greg
>>
>>--
>>Greg Spiegelberg
>>  Sr. Product Development Engineer
>>  Cranel, Incorporated.
>>  Phone: 614.318.4314
>>  Fax:   614.431.8388
>>  Email: gspiegelberg@Cranel.com
>>Cranel. Technology. Integrity. Focus.
>>
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 2: you can get off all lists at once with the unregister command
>>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>>
>>
>>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>



Re: Compare rows

From
greg@turnstep.com
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> The most efficient way for you to store data would be like this:
> main table
> id  address
> 3   67.92
> 7   69.5
>
> child table
> id  value_type  value
> 3   uptime      0.3
> 3   memory      37
> 7   uptime      1.1
> 7   memory      15

Actually, a more efficient* way is this:

value table
vid value_name
1  uptime
2  memory

child table
id   vid   value
3     1     0.3
3     2     37
7     1     1.1
7     2     15


* Still not necessarily the *most* efficient, depending on how the
values are distributed, but it sure beats storing "uptime" over
and over again. :)


- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200310101243
-----BEGIN PGP SIGNATURE-----
Comment: http://www.turnstep.com/pgp.html

iD8DBQE/huHxvJuQZxSWSsgRAiMNAKD4kQCwdv3fXyEFUu64mymtf567dwCcCKd5
ZzJaV7wjfs00DBT62bVpHhs=
=32b8
-----END PGP SIGNATURE-----