Re: Questions about indexes? - Mailing list pgsql-performance

From Josh Berkus
Subject Re: Questions about indexes?
Date
Msg-id 200302190950.42446.josh@agliodbs.com
Whole thread Raw
In response to Re: Questions about indexes?  (Ryan Bradetich <rbradetich@uswest.net>)
List pgsql-performance
Ryan,

> Posting to the performance list as requested :)  The reason I orgionally
> posted to the hackers list was I was curious about the contents of the
> index and how they worked.... but now this thread is more about
> performance, so this list is more appropriate.

*shrug* I just figured that you didn't know about the performance list ...
also, I'm doing my bit to reduce traffic on -hackers.

> Duplicate anomalies are not invalid, they are only invalid if they are
> for the same system, same category, from the same compliancy report.
>
> ie.  This is bad:
> 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
> 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
>
> This is ok:
> 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
> 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.

OK.  Given the necessity of verifying anominaly uniqueness, my suggestions
below change somewhat.

> Please do point out these design errors!  I am always interested in
> learning more about normialization since I do not have any formal DBA
> training, and most of my knowledge is from reading books, mailing lists,
> and experimenting :)

"Practical Issues in Database Management" and "Database Design for Mere
Mortals" are useful.  Me, I learned through 5 years of doing the wrong thing
and finding out why it was wrong ...

> I agree the operations table would be smaller, but you have also added
> some data by breaking it up into 2 tables.  You have an oid (8 bytes) +
> operations.id (4 bytes) + anomalies.id (4 bytes) + operation_id (4
> bytes) + tuple overhead[2] (36 bytes).

Yes, and given your level of traffic, you might have to use 8byte id fields.
But if disk space is your main issue, then I'd suggest swaping the category
code to a "categories" table, allowing you to use an int4 or even and int2 as
the category_id in the Operations table.   This would save you 6-8 bytes per
row in Operations.

> > 2) In neither table would you be including the anominaly text in an index
> > ... thus reducing index size tremendously.
>
> Unless I impliment Daniel's method of verifying the uniqness at the data
> insertion point, I will still need to guarentee the anomaly is unique
> for the given operation_id.  If I mis-read the table schema, would you
> please point it out to me .. I'm probably being dense :)
>
> Also, I do not understand why the anomalies table need the id key for
> the primary key.  Wouldn't the operation_id and the anomaly form the
> primary key?  We could save 8 bytes (4 from table + 4 from index) * ~85
> Million rows by removing this column.

As I said above, I didn't understand why you needed to check anominaly
uniqueness.  Given that you do, I'd suggest that you do the above.

Of course, you also have another option.  You could check uniqueness on the
operation_id and the md5 of the anominaly field.  This would be somewhat
awkward, and would still require that you have a seperate primary key for the
anominalies table.   But the difference between an index on an up-to-1024
character field and a md5 string might make it worth it, particularly when it
comes to inserting new rows.

In other words, test:
1) drop the anominaly_id as suggested, above.
2) adding an anominaly_md5 column to the anominalies table.
3) make operation_id, anominaly_md5 your primary key
4) write a BEFORE trigger that caclulates the md5 of any incoming anominalies
and adds it to that column.

It's worth testing, since a unique index on a 1024-character field for 85
million records could be very slow.

> Thanks, I will keep this in mind.  Although the inserts are usually done
> in a batch job ... so interactive speed is generally not an issue ...
> but faster is always better :)

In a transaction, I hope.

> Also I am curious ... With the table more nomalized, I now need to
> perform a join when selecting data.... I realize there will be fewer
> pages to read from disk (which is good!) when doing this join, but I
> will interested to see how the join affects the interactive performance
> of the queries.... something to test :)

I'll look forward to seeing the results of your test.

>  If anyone is interested I am willing to
> post the results to this list ... and if anyone has some other benchmark
> suggestions they would like to see, feel free to let me know.

We're always interested in benchmarks.


--
Josh Berkus
Aglio Database Solutions
San Francisco

pgsql-performance by date:

Previous
From: Andrew Sullivan
Date:
Subject: Re: Performance Baseline Script
Next
From: Justin Clift
Date:
Subject: Re: Performance Baseline Script