Thread: Questions about indexes?

Questions about indexes?

From
Ryan Bradetich
Date:
Hello postgres hackers,

Been a while since I have participated on this list ... but I have a new
itch to scratch....

Although the table schema is immaterial, I will provide it so we have a
common framework for this discussion:

host_id        integer     (not null)timestamp    datetime    (not null)category    text        (not null)    [<=    5
chars]anomaly       text        (not null)    [<= 1024 chars]
 

This table is used to store archived data, so each row in the table must
be unique.  Currently I am using a primary key across each column to
enforce this uniqueness.  This table currently has ~86 million rows and
is 16+ GB in size.  This primary key index is also 16+ GB in size,
because it appears all the data is duplicated in the index.  (I have
only done some preliminary looking at the database file with strings,
etc ... so this assumption is purly based on these observations).

I am not sure why all the data is duplicated in the index ... but i bet
it has to do with performance since it would save a lookup in the main
table.  Is there any benchmarks or papers related to this topic I should
locate and read?  I am curious about this because it seems the only
advantaged gained is searching the index for the specified values....
Once the entry is found, the full entry needs to be pulled from the main
table anyhow since the index does not contain all the data.  Also with
the increased size, it seems additional pressure would be put on the
shared memory caches (no idea how this really works, just guessing! :))


Since my only requirement is that the rows be unique, I have developed a
custom MD5 function in C, and created an index on the MD5 hash of the
concatanation of all the fields.  This has reduced the disk space usage
considerably, as show below against my test database ~6 million rows
at 1+ GB.

All this data is based off the test database running 7.3.2:
Type            Size-------------------------------------------Database Table        1188642816All columns pkey
1510252544MD5columns pkey     370999296
 

Just using MD5 hash data instead of all the columns is a considerable
diskspace win going from 1.5 GB to 370 MB.

Has anyone else solved this problem?  Has anyone else looked into
something like this and mind sharing so I do not have to re-invent the
wheel? :)  Also (assuming there is no papers / benchmarks proving data
in index is a good idea), how difficult would it be to impliment an
index type that extracts the data from the main table?


Thanks for reading.  I will be happy to field any question that I can,
or read any papers, research, etc that relates to this topic.

- Ryan

P.S. the production database is running 7.2.4 if that makes a
difference.

-- 
Ryan Bradetich <rbradetich@uswest.net>



Re: Questions about indexes?

From
Tom Lane
Date:
Ryan Bradetich <rbradetich@uswest.net> writes:
> Although the table schema is immaterial, I will provide it so we have a
> common framework for this discussion:

>     host_id        integer     (not null)
>     timestamp    datetime    (not null)
>     category    text        (not null)    [<=    5 chars]
>     anomaly        text        (not null)    [<= 1024 chars]

> This table is used to store archived data, so each row in the table must
> be unique.  Currently I am using a primary key across each column to
> enforce this uniqueness.

It's not real clear to me why you bother enforcing a constraint that the
complete row be unique.  Wouldn't a useful constraint be that the first
three columns be unique?  Even if that's not correct, what's wrong with
tolerating a few duplicates?  You can't tell me it's to save on storage
;-)

> I am not sure why all the data is duplicated in the index ... but i bet
> it has to do with performance since it would save a lookup in the main
> table.

An index that can't prevent looking into the main table wouldn't be
worth anything AFAICS ...
        regards, tom lane


Re: Questions about indexes?

From
Ryan Bradetich
Date:
On Sun, 2003-02-16 at 23:34, Tom Lane wrote:
> Ryan Bradetich <rbradetich@uswest.net> writes:
> > Although the table schema is immaterial, I will provide it so we have a
> > common framework for this discussion:
> 
> >     host_id        integer     (not null)
> >     timestamp    datetime    (not null)
> >     category    text        (not null)    [<=    5 chars]
> >     anomaly        text        (not null)    [<= 1024 chars]
> 
> > This table is used to store archived data, so each row in the table must
> > be unique.  Currently I am using a primary key across each column to
> > enforce this uniqueness.
> 
> It's not real clear to me why you bother enforcing a constraint that the
> complete row be unique.  Wouldn't a useful constraint be that the first
> three columns be unique?  Even if that's not correct, what's wrong with
> tolerating a few duplicates?  You can't tell me it's to save on storage
> ;-)

The table holds system policy compliance data.  The catagory is
basically the policy, and the anomaly is the detailed text explaining
why the system is out of compliance.  So the anomaly data is important
(and often the reason why the key is unique).  The reason we are
archiving the data is to generate reports and graphs showing policy
compliance over time.  Duplicated rows will artifically inflate the
numbers in the reports and graphs.  The other option we had was to
perform a DISTINCT select at report / graph time, we chose no to go this
route bacause of the sort added to the query.  (Also it just seemed
tidier to only store good data :))

The disk storage is a minor concern :), but I was actually looking at it
as a possible performance enhancement.  I am curious how it affects the
shared buffer cache, and also there should be less average pages to read
since the index size was smaller.

Does this make sense? Or am I out in left field again? :)

> > I am not sure why all the data is duplicated in the index ... but i bet
> > it has to do with performance since it would save a lookup in the main
> > table.
> 
> An index that can't prevent looking into the main table wouldn't be
> worth anything AFAICS ...

Ok, scratch that idea then :)  I will continue looking at other ideas
like the MD5 data hashing etc.  

Thanks for your input Tom!

- Ryan
        regards, tom lane
-- 
Ryan Bradetich <rbradetich@uswest.net>



Re: Questions about indexes?

From
Tom Lane
Date:
Ryan Bradetich <rbradetich@uswest.net> writes:
> On Sun, 2003-02-16 at 23:34, Tom Lane wrote:
>> It's not real clear to me why you bother enforcing a constraint that the
>> complete row be unique.  Wouldn't a useful constraint be that the first
>> three columns be unique?

> The table holds system policy compliance data.  The catagory is
> basically the policy, and the anomaly is the detailed text explaining
> why the system is out of compliance.  So the anomaly data is important
> (and often the reason why the key is unique).

Well, sure the anomaly is important: it's the payload, the reason why
you bother to have the table in the first place.  But that doesn't mean
it's part of the key.  Generally the key would be the info you use to
look up a particular anomaly text.  In this example, it's not clear to
me why you'd need/want two different anomaly texts entered for the same
host_id and the same category at the same instant of time.  ISTM there's
something inadequate about your category column if you need that.
        regards, tom lane


Re: Questions about indexes?

From
Ryan Bradetich
Date:
On Mon, 2003-02-17 at 00:15, Tom Lane wrote:
> Ryan Bradetich <rbradetich@uswest.net> writes:
> > On Sun, 2003-02-16 at 23:34, Tom Lane wrote:
> >> It's not real clear to me why you bother enforcing a constraint that the
> >> complete row be unique.  Wouldn't a useful constraint be that the first
> >> three columns be unique?
> 
> > The table holds system policy compliance data.  The catagory is
> > basically the policy, and the anomaly is the detailed text explaining
> > why the system is out of compliance.  So the anomaly data is important
> > (and often the reason why the key is unique).
> 
> Well, sure the anomaly is important: it's the payload, the reason why
> you bother to have the table in the first place.  But that doesn't mean
> it's part of the key.  Generally the key would be the info you use to
> look up a particular anomaly text.  In this example, it's not clear to
> me why you'd need/want two different anomaly texts entered for the same
> host_id and the same category at the same instant of time.  ISTM there's
> something inadequate about your category column if you need that.

Ok, I understand what you are asking now :)

Let me make up a contrived example to show how the table is used.
host_id 1 = hosta.somewhere.comhost_id 2 = hostb.somewhere.com

The catagories are coded so (made up examples):cat p101 = /etc/passwd checkcat f101 = filesystem check.

the table would look like:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password.
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.
etc...

So I do not need the anomaly to be part of the index, I only need it to 

I agree with you, that I would not normally add the anomally to the
index, except for the unique row requirement.  Thinking about it now,
maybe I should guarentee unique rows via a check constraint...

Thanks for making me think about this in a different way!

- Ryan

>             regards, tom lane
-- 
Ryan Bradetich <rbradetich@uswest.net>



Re: Questions about indexes?

From
Daniel Kalchev
Date:
>>>Ryan Bradetich said:> the table would look like:> 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid
shell.>1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.> 1 | Mon Feb 17 00:34:24 MST 2003 | p101
|user y has expired password.> 2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.> etc...> > So I
donot need the anomaly to be part of the index, I only need it to > > I agree with you, that I would not normally add
theanomally to the> index, except for the unique row requirement.  Thinking about it now,> maybe I should guarentee
uniquerows via a check constraint...> > Thanks for making me think about this in a different way!
 

(sorry this is a bit long)

Ryan,

I use somewhat similarly structured data (archived records of various events) 
and when the database was setup (back when this baby was called postgres95), I 
too used indexes on all possible fields.

My database consists of an 'operations' table, which holds for the last x days 
period (example) and several tables with archived records (per month, or 
per-year - see later, The operations table can have frequent updates, which 
add new data. Data is never modified but often lookups are made. The archived 
tables are generated once and forever from the operations table (possibly 
merging in the future, but I haven't yet made my mind on this) - then access 
is read-only, although sufficiently frequent.

What I found for the many years of operating this database on different 
PostgreSQL versions and hardware is that indexes have considerable cost. :)
So does the need to not miss anything from the operations table (that is, 
collect data from many places and have have it all it there).

I ended up with few only indexes on the operations table, because the 
processes that fill it up do minimal lookups to see if data is already in the 
table, if not do inserts. Then at regular intervals, the table is cleaned up - 
that is, a process to remove the duplicate is run. This unfortunately costs 
OIDs, but I found no other reasonable way to do the fast inserts. Perhaps the 
best way is to create the table without OIDs (but wouldn't this still waste 
OIDs?) use COPY and then clean afterwards?

The archived tables are generated, then cleaned up. Then, as Tom suggested 
indexes are put on the archived tables, only for the fields that are used in 
queries. Once the table is created, there is no way duplicated data will 
exist, as it will not be inserted into. Therefore no need for UNIQUE index 
enforcement.

If you need to have one large 'history' table, then perhaps you will just have 
to do (slow :) selects for each record before each insert, or just insert the 
data and then run the cleanup process.

Daniel



Re: Questions about indexes?

From
Christopher Kings-Lynne
Date:
> I ended up with few only indexes on the operations table, because the
> processes that fill it up do minimal lookups to see if data is already in the
> table, if not do inserts. Then at regular intervals, the table is cleaned up -
> that is, a process to remove the duplicate is run. This unfortunately costs
> OIDs, but I found no other reasonable way to do the fast inserts. Perhaps the
> best way is to create the table without OIDs (but wouldn't this still waste
> OIDs?) use COPY and then clean afterwards?

No, WITHOUT OIDS is implemented specifically to not waste OIDs.

Chris



Re: Questions about indexes?

From
Tom Lane
Date:
Ryan Bradetich <rbradetich@uswest.net> writes:
> the table would look like:
> 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
> 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.

Ah, I see your point now.  (Thinks: what about separating the "anomaly"
column into an "identifier" and a "complaint" column:

1 | Mon Feb 17 00:34:24 MST 2003 | p101 | x    | user has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | y    | user has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | y    | user has expired password.
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | /foo | file has improper owner.

No, that doesn't quite work either, unless you are willing to make the
categories more specific.  At which point the category and the anomaly
text become equivalent.  Actually I'm wondering why you bother with the
category at all; isn't it implied by the anomaly text?)

> I agree with you, that I would not normally add the anomally to the
> index, except for the unique row requirement.  Thinking about it now,
> maybe I should guarentee unique rows via a check constraint...

A check constraint won't be efficient either, at least not without a
supporting index.  Possibly you could index just the host and timestamp
columns, which would not be unique but it would cut the number of rows
the constraint would need to examine to something manageable.

But I'm still thinking that enforcing uniqueness is a waste of time.
What exactly is so harmful about it if
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
appears twice?  How likely is that anyway (especially if you don't
truncate the timestamp precision)?
        regards, tom lane


Re: Questions about indexes?

From
Curt Sampson
Date:
On Mon, 16 Feb 2003, Ryan Bradetich wrote:

> I am not sure why all the data is duplicated in the index ...

Well, you have to have the full key in the index, or how would you know,
when you look at a particular index item, if it actually matches what
you're searching for?

MS SQL server does have an interesting option that would help you a lot
in this case: clustered indexes. A table may have a single clustered
index, and each leaf node of the index stores not just the key but
actually the entire row. Thus, in a case like yours, you'd store the row
only once, not twice.

Without thinking too hard about it (my usual mode of operation on this
list :-)) this could probably be implemented in postgresql. But I don't
think it would be entirely trivial, and your case is unusual enough
that I very much doubt whether it would be worth implementing to fix
that alone. It would also offer the advantage that any lookup using the
clustered index would save fetching the heap page after that as well,
but it's hard to say if the savings would be worth the work.

> Since my only requirement is that the rows be unique, I have developed a
> custom MD5 function in C, and created an index on the MD5 hash of the
> concatanation of all the fields.

Well, that won't guarantee uniqueness, since it's perfectly possible
to have two different rows hash to the same value. (If that weren't
possible, your hash would have to contain as much information as the row
itself, and your space savings wouldn't be nearly so dramatic.)

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 


Re: Questions about indexes?

From
Kevin Brown
Date:
Curt Sampson wrote:
> On Mon, 16 Feb 2003, Ryan Bradetich wrote:
> > Since my only requirement is that the rows be unique, I have developed a
> > custom MD5 function in C, and created an index on the MD5 hash of the
> > concatanation of all the fields.
> 
> Well, that won't guarantee uniqueness, since it's perfectly possible
> to have two different rows hash to the same value. (If that weren't
> possible, your hash would have to contain as much information as the row
> itself, and your space savings wouldn't be nearly so dramatic.)

That's true, but even if he has 4 billion rows it drops the
probability of a duplicate down to something like one in 4 billion, so
it's probably a safe enough bet.  His application doesn't require
absolute uniqueness, fortunately, so md5 works well enough in this
case.

Otherwise md5 wouldn't be a terribly good hash...


-- 
Kevin Brown                          kevin@sysexperts.com


Re: Questions about indexes?

From
Josh Berkus
Date:
Ryan,

I am crossing this discussion to the PGSQL-PERFORMANCE list, which is the
proper place for it.   Anyone interested, please follow us there!

>>>Ryan Bradetich said:
 > the table would look like:
 > 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
 > 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.
 > 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password.
 > 2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.
 > etc...
 >
 > So I do not need the anomaly to be part of the index, I only need it to
 >
 > I agree with you, that I would not normally add the anomally to the
 > index, except for the unique row requirement.  Thinking about it now,
 > maybe I should guarentee unique rows via a check constraint...
 >
 > Thanks for making me think about this in a different way!

First off, I'm not clear on why a duplicate anominaly would be necessarily
invalid, given the above.  Not useful, certainly, but legitimate real data.

I realize that you have performance considerations to work with.  However, I
must point out that your database design is not *at all* normalized ... and
that normalizing it might solve some of your indexing problems.

A normalized structure would look something like:

TABLE operations
    id serial not null primary key,
    host_id int not null,
    timeoccurred timestamp not null default now(),
    category varchar(5) not null,
    constraint operations_unq unique (host_id, timeoccurred, category)

TABLE anominalies
    id serial not null primary key,
    operation_id int not null references operations(id) on delete cascade,
    anominaly text

This structure would have two advantages for you:
1) the operations table would be *much* smaller than what you have now, as you
would not be repeating rows for each anominaly.
2) In neither table would you be including the anominaly text in an index ...
thus reducing index size tremendously.

As a warning, though:  you may find that for insert speed the referential
integrity key is better maintained at the middleware layer.   We've had some
complaints about slow FK enforcement in Postgres.

--
Josh Berkus
Aglio Database Solutions
San Francisco