Thread: Threaded Records in SQL: Advice Needed

Threaded Records in SQL: Advice Needed

From
"Ingram, Bryan"
Date:
Hi Everyone,
I'm planning to implement a project using a php/postgres 6.5.3 combination.
Part of what I need to do is develop a model in SQL to keep track of
"threaded records."  What I mean by this, is that I need to keep track of
parent-child relationships in exactly the same way that a
threaded-discussion group does.
The design should be able to support approximately 100 users, mostly
readers, and be able to handle approximately 1 insert per minute.
So far, I've found two methods/algorithms for implenting this in SQL, but
there's a problem with each.  
The first one is extremely efficient and inexpensive for selecting and
ordering all of the records in a thread (taking only one select to get all
of the records and in order), but requires that ALL records be updated on
insert, which is very expensive.  Because I'm expecting about 120000 records
a year, or ca. 450 new records every business day, the prospect of each
insert updating each record in the database is not appealing.  At first it
would be ok, but as the record count increased, the inserts would become
extremely slow.  This method works by modeling the records as "nested sets"
and was explained by Joe Celko in one of his articles.
The second method is simply adding fields to keep track of an id, parent_id,
and root_id.  Its benefit is that it is inexpensive on inserts, requiring
only one insert and no updates, but it is costly to select and order all the
records in a thread.  Essentially, the only way I have found to do this is
to implement a recursive function that selects one record at a time, checks
for children, ad infinitum.  Not very efficient.
My concern with method 1 is that the great expense of inserting new records
will slow the database down to unacceptable levels. This is based on my past
experience with postgres when updating a database of 1.5 million rows (which
took an hour and a half with syncing turned off.)  
Method 2 is problematic because it uses recursion first of all (which I
understand to be heavy in resource usage) and because it makes a select for
each and every record in a thread to be displayed.
What I'd like to find is either a better way that I'm not aware of, or some
kind of middle ground.
Also, what do you make of these methods.  Are they as inefficient as I want
to believe?
Thanks!
Bryan Ingram
Database Programmer
S!XTYFOOTSP!DER
972-492-5676
bingram@sixtyfootspider.com <mailto:bingram@sixtyfootspider.com> <http://www.sixtyfootspider.com/>
www.sixtyfootspider.com


Re: Threaded Records in SQL: Advice Needed

From
"Patrick Giagnocavo"
Date:
> Hi Everyone,
>  
> I'm planning to implement a project using a php/postgres 6.5.3 combination.
>
> Part of what I need to do is develop a model in SQL to keep track of
> "threaded records."  What I mean by this, is that I need to keep track of
> parent-child relationships in exactly the same way that a
> threaded-discussion group does.


Hi Bryan

check out the PDF files at

http://ourworld.compuserve.com/homepages/Graeme_Birchall/HTM_COOK.HTM

The "SQL Cookbook" that this guy writes is oriented towards IBM's DB2
database, however, he does talk about a number of different ways of handling
parent-child relationships in an SQL table; his versions are able to be
nested as well.

Actually, this guy seems to know his stuff when it comes to SQL, so it is
probably worth having a look at anyways...thus I have also posted this
response to the list.

Cordially

Patrick Giagnocavo
a222@redrose.net



Re: Threaded Records in SQL: Advice Needed

From
mig@utdt.edu
Date:
In order to simplify the regular expressions, I propose to change the
"." as "field separator" in the ids: "." has a special meaning in
regular expressions, let us avoid escaping all over the place. So,
take for instance "/" as separator, so that the message in my previous
example is now "25/7/19/2".

In order to compute the correct id at insert time, I would suggest
keeping a sequence root_seq for the root messages (see CREATE SEQUENCE
in the postgres manual), and just   "select nextval(root_seq)"
each time you insert a new root message. Alternatively, if you write
root message ids as e.g. "/25" and keep the previous structure, you
can use the method described below also for root messages. In this
case, you just would be interpreting the root messages as "replies to
the (fictitious) message with an empty index".

Now assume you want to insert a new reply to message with id X (which
could be at any level, e.g. X = 25/7/19). You can get the number of
the next response to X = 25/7/19 using the regular expression
capabilities of postgres  "select count(id)+1 as Y from your_table where id ~ '25/7/19/[^/]*$' "
and then compute the index to be inserted as "X/Y"

In the regexp you are requesting something that matches 25/7/19/(any number of symbols different from "/")
so that all direct replies are selected, but NOT the replies to them -
as they would have a "/" somewhere before the end.

You can automatize this with the sql functions  create function next_reply_num(text) returns int4 as   'select
count(*)+1from ids where id ~ ($1|| ''/[^/]*$'') '  language 'sql';
 
  create function next_reply_id(text) returns text as         'select ($1 || ''/'' || next_reply_num($1)::text)'
language'sql';
 

You could then insert the next reply to message "25/19/2" using

insert into messages(id, ...) values(next_reply_id('25/19/2'), ...);

I do not know how to do this within a single call to an sql function;
it would be easy to do if you use either PL/tcl or PL/pgsql procedural
languages. 

Thanks for the "challenge": this IS fun.

Miguel




RE: Threaded Records in SQL: Advice Needed

From
"Ingram, Bryan"
Date:
Thanks for the ideas on the functions, that'll work nicely.

The only other problem I see in actually implementing this, is that the id
column i.e. /25/10/2/ will not be ordered correctly because it relies on
ascii values.  You get alphabetic orderings, rather than numerical.

Such as:

1
10
11
12
14
2
20
25
3
4
5

instead of

1
2
3
4
5
10
11
12
14
20
25

Any ideas how to get around this?  I'm working on the problem right now, but
haven't found anything yet.

Bryan




RE: Threaded Records in SQL: Advice Needed

From
"Ingram, Bryan"
Date:
In reference to the ascii/numeric ordering problem ..

But first let me say ..if someone knows of a way to get ascii values to
order as if they were numerics ..please let me in on the secret ..

Instead of using numbers, if letters were used the ordering would be
correct.

The id string would become something like: A/A/A for the 1st reply to the
1st reply of the 1st root message.

This will work fine for threads with relatively few replies on the same
level.  For instance, the number 100 would take only 4 characters to encode,
however, the number 1000 would need 39 characters to encode.

For my purpose 100-200 replies on any given level will suffice nicely.

Even though text fields are limited to 4096 characters, even if each level
needed 40 characters for encoding, I would still have room for approximately
100 levels (plenty) ..

Adding a step to the procedure you developed, we could convert the returned
"next reply number" into its alphabetic equivalent.

e.g.  1 =  A2 =  B26 = Z27 = ZA28 = ZB52 = ZZ53 = ZZA

The procedure is this:

next_id/26 = number of Z's
mod of next_id/26 = numeric position within alphabet

So .. if you wanted to add the 100th reply to the 2nd root topic ...

100/26         = 3 (ZZZ) (drop everything but the integer)
mod of 100/26     = 12 (L)

So we'd have 3 Z's and an L for the second root topic.

The fully assembled ID would be /B/ZZZL

Then a reply to this would become /B/ZZZL/A

So on, and so forth ..

This is verging on getting kludgey, but it still looks like it meets all of
the criteria:

1) Fast inserts
2) Fast selects of a thread or part of a thread  Actually this criterion hinges on the quality of the indexing
3) Rows are returned in order, with only one select.

So ..I really wish there was a better way ..but this isn't too big of a
price to pay.

Thanks,
Bryan






-----Original Message-----
From: Ingram, Bryan 
Sent: Tuesday, April 11, 2000 3:05 PM
To: 'mig@utdt.edu'
Cc: pgsql-sql@postgresql.org
Subject: RE: [SQL] Threaded Records in SQL: Advice Needed


Thanks for the ideas on the functions, that'll work nicely.

The only other problem I see in actually implementing this, is that the id
column i.e. /25/10/2/ will not be ordered correctly because it relies on
ascii values.  You get alphabetic orderings, rather than numerical.

Such as:

1
10
11
12
14
2
20
25
3
4
5

instead of

1
2
3
4
5
10
11
12
14
20
25

Any ideas how to get around this?  I'm working on the problem right now, but
haven't found anything yet.

Bryan




Re: Threaded Records in SQL: Advice Needed

From
mig@utdt.edu
Date:
lexicographic ordering is what you want ...

So, new proposal (described from the start) which solves all this and
should (I think) be rather fast. On the other hand, the id fields get
larger and larger ...

(1) Ids of messages are of the form /(field_0)/(field_1)/.../(field_n)   where each field counts the number of message
atthat level:   field_0 counts the root messages, field_1 the replies to the root   message at field_0, etc ...
 
(2) The field counters are numbers where   (a) the first digit gives the count of digits in the index (if youthink that
youwill have more than nine digits in any field,make it a two-digit counter: 01, 02, ... 99   (b) the rest is the index
 Same example as before: the second reply to the nineteenth reply   to the seventh reply to root message 25 has id
/225/17/219/12  (field_0=225 means: two digits in the counter, the counter is 25)   This does solve the issue of
correctordering: lexicographic   ordering of THESE indexes is what you want ... shorter numbers do   come before the
longerones
 
(3) The issuing of new indexes can be implemented in three sql functions   as follows:      create function
next_reply_num(text)returns text as   'select (count(*)+1)::text from ids where id ~ ($1|| ''/[^/]*$'') '  language
'sql';     create function add_ct(text) returns text as         'select (char_length($1)::text || $1)'  language 'sql';
    create function next_reply_id(text) returns text as         'select ($1 || ''/'' || add_ct(next_reply_num($1)))'
language'sql';    The function next_reply_num gets the next counter; add_ct    prepends the digit count; next_reply_id
assemblesthe lot.
 

Again, maybe this can be implemented with fewer sql function calls; it
can definitely be implemented in a single tcl or pgSQL function ...

Another remark on these functions: if you EVER delete a message from
the list, the procedure next_reply_num will wreak havoc in your
numbering! If that is a possibility, you should probably select
the largest reply number to that message, parse it, add one to the
counter, prepend the byte count and then compute the index for the new
reply. 

PS: I just received your e-mail which solves these issues; here is an   alternative ... It is of course compatible with
usingletters   instead of numbers, as you propose - the new idea is adding a   "byte count" to each field, which can of
coursealso be encoded in   a letter
 
      

--text follows this line--
>From: "Ingram, Bryan" <BIngram@sixtyfootspider.com>
>Cc: pgsql-sql@postgresql.org
>Date: Tue, 11 Apr 2000 14:57:00 -0500
>Content-Type: text/plain;
>    charset="iso-8859-1"
>
>Thanks for the ideas on the functions, that'll work nicely.
>
>The only other problem I see in actually implementing this, is that the id
>column i.e. /25/10/2/ will not be ordered correctly because it relies on
>ascii values.  You get alphabetic orderings, rather than numerical.
>
>Such as:
>
>1
>10
>11
>12
>14
>2
>20
>25
>3
>4
>5
>
>instead of
>
>1
>2
>3
>4
>5
>10
>11
>12
>14
>20
>25
>
>Any ideas how to get around this?  I'm working on the problem right now, but
>haven't found anything yet.
>
>Bryan
>


RE: Threaded Records in SQL: Advice Needed

From
Michael McCarthy
Date:
Would it be too obvious to mention that leading zeroes would give you what
you need?

//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\
Michael McCarthy                             TCSI Corporation
michael@tcsi.com                             1080 Marina Village Parkway
(510) 749-8739                               Alameda, CA 94501

> The only other problem I see in actually implementing this, is that the id
> column i.e. /25/10/2/ will not be ordered correctly because it relies on
> ascii values.  You get alphabetic orderings, rather than numerical.



Re: Threaded Records in SQL: Advice Needed

From
"tjk@tksoft.com"
Date:
Bryan,

You can numerically, even if the field is a text field,
if all fields have a number.

Just cst to an int4.

E.g.

create table testtable (num text, num2 text);

select num from testtable order by num::int4;



Troy


>
> Thanks for the ideas on the functions, that'll work nicely.
>
> The only other problem I see in actually implementing this, is that the id
> column i.e. /25/10/2/ will not be ordered correctly because it relies on
> ascii values.  You get alphabetic orderings, rather than numerical.
>
> Such as:
>
> 1
> 10
> 11
> 12
> 14
> 2
> 20
> 25
> 3
> 4
> 5
>
> instead of
>
> 1
> 2
> 3
> 4
> 5
> 10
> 11
> 12
> 14
> 20
> 25
>
> Any ideas how to get around this?  I'm working on the problem right now, but
> haven't found anything yet.
>
> Bryan
>
>
>

Re: Threaded Records in SQL: Advice Needed

From
"Moray McConnachie"
Date:
----- Original Message -----
From: "Ingram, Bryan" <BIngram@sixtyfootspider.com>
To: "Ingram, Bryan" <BIngram@sixtyfootspider.com>; <mig@utdt.edu>
Cc: <pgsql-sql@postgresql.org>
Sent: Tuesday, April 11, 2000 9:57 PM
Subject: RE: [SQL] Threaded Records in SQL: Advice Needed


> In reference to the ascii/numeric ordering problem ..
>
> But first let me say ..if someone knows of a way to get ascii values to
> order as if they were numerics ..please let me in on the secret ..
>
> Instead of using numbers, if letters were used the ordering would be
> correct.
>
> The id string would become something like: A/A/A for the 1st reply to the
> 1st reply of the 1st root message.
>
> This will work fine for threads with relatively few replies on the same
> level.  For instance, the number 100 would take only 4 characters to
encode,
> however, the number 1000 would need 39 characters to encode.

More efficient to use letters as numbers in base 26? Then for up to 676
replies per level you need only 2  letters. If you allowed 3, you'd have
some headroom.
----------------------------------------------------------------
Moray.McConnachie@computing-services.oxford.ac.uk



RE: Threaded Records in SQL: Advice Needed

From
"Sondaar, Roelof"
Date:
Hello,

I don't know if it has been answered yet (bit late I know).
I solved the ascii numeric ordering as followed:
- concatenate a lot of zeroes in front (at least the length of the result
value)
- cut the the result length.


snlsor=> select * from asciinum order by number;
number
------    1   10  111 1211    2   20  211   22
(8 rows)
snlsor=> select * from asciinum order by lpad(number, 4, '0');
number
------    1    2   10   20   22  111  211 1211
(8 rows)

Best regards,
Roelof
> -----Original Message-----
> From:    Ingram, Bryan [SMTP:BIngram@sixtyfootspider.com]
> Sent:    dinsdag 11 april 2000 22:58
> To:    Ingram, Bryan; mig@utdt.edu
> Cc:    pgsql-sql@postgresql.org
> Subject:    RE: [SQL] Threaded Records in SQL: Advice Needed
> 
> In reference to the ascii/numeric ordering problem ..
> 
> But first let me say ..if someone knows of a way to get ascii values to
> order as if they were numerics ..please let me in on the secret ..
> 
> Instead of using numbers, if letters were used the ordering would be
> correct.
> 
> The id string would become something like: A/A/A for the 1st reply to the
> 1st reply of the 1st root message.
> 
> This will work fine for threads with relatively few replies on the same
> level.  For instance, the number 100 would take only 4 characters to
> encode,
> however, the number 1000 would need 39 characters to encode.
> 
> For my purpose 100-200 replies on any given level will suffice nicely.
> 
> Even though text fields are limited to 4096 characters, even if each level
> needed 40 characters for encoding, I would still have room for
> approximately
> 100 levels (plenty) ..
> 
> Adding a step to the procedure you developed, we could convert the
> returned
> "next reply number" into its alphabetic equivalent.
> 
> e.g.  1 =  A
>     2 =  B
>     26 = Z
>     27 = ZA
>     28 = ZB
>     52 = ZZ
>     53 = ZZA
> 
> The procedure is this:
> 
> next_id/26 = number of Z's
> mod of next_id/26 = numeric position within alphabet
> 
> So .. if you wanted to add the 100th reply to the 2nd root topic ...
> 
> 100/26         = 3 (ZZZ) (drop everything but the integer)
> mod of 100/26     = 12 (L)
> 
> So we'd have 3 Z's and an L for the second root topic.
> 
> The fully assembled ID would be /B/ZZZL
> 
> Then a reply to this would become /B/ZZZL/A
> 
> So on, and so forth ..
> 
> This is verging on getting kludgey, but it still looks like it meets all
> of
> the criteria:
> 
> 1) Fast inserts
> 2) Fast selects of a thread or part of a thread
>    Actually this criterion hinges on the quality of the indexing
> 3) Rows are returned in order, with only one select.
> 
> So ..I really wish there was a better way ..but this isn't too big of a
> price to pay.
> 
> Thanks,
> Bryan
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Ingram, Bryan 
> Sent: Tuesday, April 11, 2000 3:05 PM
> To: 'mig@utdt.edu'
> Cc: pgsql-sql@postgresql.org
> Subject: RE: [SQL] Threaded Records in SQL: Advice Needed
> 
> 
> Thanks for the ideas on the functions, that'll work nicely.
> 
> The only other problem I see in actually implementing this, is that the id
> column i.e. /25/10/2/ will not be ordered correctly because it relies on
> ascii values.  You get alphabetic orderings, rather than numerical.
> 
> Such as:
> 
> 1
> 10
> 11
> 12
> 14
> 2
> 20
> 25
> 3
> 4
> 5
> 
> instead of
> 
> 1
> 2
> 3
> 4
> 5
> 10
> 11
> 12
> 14
> 20
> 25
> 
> Any ideas how to get around this?  I'm working on the problem right now,
> but
> haven't found anything yet.
> 
> Bryan
>