Thread: Searching BLOB
Hi,
I am not 100% sure what the best solution would be, so I was hoping
someone could point me in the right direction.
I usually develop in MS tools, such as .net, ASP, SQL Server etc...,
but I really want to expand my skillset and learn as much about Postgresqlas possible.
I am not 100% sure what the best solution would be, so I was hoping
someone could point me in the right direction.
I usually develop in MS tools, such as .net, ASP, SQL Server etc...,
but I really want to expand my skillset and learn as much about Postgresqlas possible.
What I need to do, is design a DB that will index and store
approximately 300 word docs, each with a size no more that 1MB. They
need to be able to seacrh the word documents for keyword/phrases to be
able to identify which one to use.
So, I need to write 2 web interfaces. A front end and a back end. Front
end for the users who will search for their documents, and a backend
for an admin person to upload new/ammended documents to the DB to be
searchable.
NOW..... I could do this in the usual MS tools that I work with using
BLOB's and the built in Full-text searching that comes with SQL Server,
but i don't have these to work with at the mometn. I am working with PostGres & JSP
pages
What I was hoping someone could help me out with was identifying the
best possible solution to use.
1. How can I store the word doc's in the DB, would it be best to use a
BLOB data type?
2. Does Postgres support full text searching of a word document once it
is loaded into the BLOB column & how would this work? Would I have to
unload each BLOB object, convert it back to text to search, or does
Postgres have the ability to complete the full-text search of a BLOB,
like MSSQL Server & Oracle do?
3. Is there a way to export the Word Doc From the BLOB colum and dump
it into a PDF format (I guess I am asking if someone has seen or
written a PDF generator script/storedProc for Postgres)?
If someone could help me out, it would be greatly appreciated.
cheers,
James
James Watson wrote: > What I was hoping someone could help me out with was identifying the > best possible solution to use. > > 1. How can I store the word doc's in the DB, would it be best to use a > BLOB data type? You can use the column type "bytea", which can store (nearly) arbitrary amounts of binary data. > 2. Does Postgres support full text searching of a word document once it > is loaded into the BLOB column & how would this work? Would I have to > unload each BLOB object, convert it back to text to search, or does > Postgres have the ability to complete the full-text search of a BLOB, > like MSSQL Server & Oracle do? There is fulltext indexing support for postgres, look for tsearch2 in the contrib module of postgres. A bytea-column is basically used like a string, so there is no need to load/unload the blob. There is also the concept of a LOB as a distinct entity in postgresql. Accessing those lobs needs special support from your client library (standard libpq provides that support of course). They have the advantage that you can open/seek/close them like a regular file. But the disadvantage is that you can't store them in columns - they are referenced via oids, and you need to store those oids. You also can't put triggers on those LOBs, and I'm not sure how transaction-safe they are. > 3. Is there a way to export the Word Doc From the BLOB colum and dump > it into a PDF format (I guess I am asking if someone has seen or > written a PDF generator script/storedProc for Postgres)? You can use java as a backend language with postgresql (google for pljava). So you can pretty much do whatever you can do with java. greetings, Florian Pflug
Save yourself some effort and use Lucene to index a directory of your 300 word documents. I'm pretty sure that Lucene includes an extension to read Word documents, and you can use PDFBox to read/write PDF files. Marrying the searching and displaying of results to your web application should be trivial since you're wanting to use java anyway. Lucene has full character set support and is blindingly fast If you're looking for a solution to this problem using Postgres, then you'll be creating a ton extra work for yourself. If you're wanting to learn more about postgres, then maybe it'll be worthwhile. John James Watson said: > Hi, > I am not 100% sure what the best solution would be, so I was hoping > someone could point me in the right direction. > > I usually develop in MS tools, such as .net, ASP, SQL Server etc..., > but I really want to expand my skillset and learn as much about > Postgresqlas > possible. > > What I need to do, is design a DB that will index and store > approximately 300 word docs, each with a size no more that 1MB. They > need to be able to seacrh the word documents for keyword/phrases to be > able to identify which one to use. > > So, I need to write 2 web interfaces. A front end and a back end. Front > end for the users who will search for their documents, and a backend > for an admin person to upload new/ammended documents to the DB to be > searchable. > > NOW..... I could do this in the usual MS tools that I work with using > BLOB's and the built in Full-text searching that comes with SQL Server, > but i don't have these to work with at the mometn. I am working with > PostGres & JSP > pages > > What I was hoping someone could help me out with was identifying the > best possible solution to use. > > 1. How can I store the word doc's in the DB, would it be best to use a > BLOB data type? > > 2. Does Postgres support full text searching of a word document once it > is loaded into the BLOB column & how would this work? Would I have to > unload each BLOB object, convert it back to text to search, or does > Postgres have the ability to complete the full-text search of a BLOB, > like MSSQL Server & Oracle do? > > 3. Is there a way to export the Word Doc From the BLOB colum and dump > it into a PDF format (I guess I am asking if someone has seen or > written a PDF generator script/storedProc for Postgres)? > > If someone could help me out, it would be greatly appreciated. > > cheers, > James >
Hi John, I have had a read through the lucene website (http://lucene.apache.org/java/docs/index.html) and it sounds pretty good to me. I should be able to use this in conjuction with my JSP pages. This may sound quite dumb to anyone who develops in java, but I need a little help setting up the demo on my windowsXP machine. I have installed JDY 1.5.0_07, i have installed tomcat and can confirm that is is all up and running correctly, as I have already written a few simple JSP pages. I have downloaded the lucene package, extracted the package to my C:\ and followed the steps of the demo page: http://lucene.apache.org/java/docs/demo.html But, when i try to run "java org.apache.lucene.demo.IndexFiles c:\lucene-2.0.0\src" from the cmd prompt, I get the following error: "Exception in thread 'main' java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyser" I am not sure why this is coming up. I have followed the instructions on the demo page on the web. The only thing i can think of is I may have my "CLASSPATH" incorrect. Can someone help me out with a basic desription if what the classpath is and where I should point the classpath environment variable to? Once I have that correct, i think that I may be able to run the demo. thanks for any help you can provide. James "John Sidney-Woollett" wrote: > Save yourself some effort and use Lucene to index a directory of your 300 > word documents. I'm pretty sure that Lucene includes an extension to read > Word documents, and you can use PDFBox to read/write PDF files. Marrying > the searching and displaying of results to your web application should be > trivial since you're wanting to use java anyway. Lucene has full character > set support and is blindingly fast > > If you're looking for a solution to this problem using Postgres, then > you'll be creating a ton extra work for yourself. If you're wanting to > learn more about postgres, then maybe it'll be worthwhile. > > John >
This is a bit off topic for the Postgres list... ;) Make sure you explicitly include the name of the Lucene jar file in your command line invocation, and any other directories that are required (normally your current working directory), so for Windows you'd use something like java -cp .;{pathto}\lucene-1.4.3.jar YouJavaApp When you use Lucene in your webapp include the Lucene jar file in {tomcat_home}\commons\lib or the WEB-INF\lib directory under your webapp. Hope that helps. John jdwatson1@gmail.com wrote: > Hi John, > I have had a read through the lucene website > (http://lucene.apache.org/java/docs/index.html) and it sounds pretty > good to me. I should be able to use this in conjuction with my JSP > pages. > > This may sound quite dumb to anyone who develops in java, but I need a > little help setting up the demo on my windowsXP machine. I have > installed JDY 1.5.0_07, i have installed tomcat and can confirm that is > is all up and running correctly, as I have already written a few simple > JSP pages. > > I have downloaded the lucene package, extracted the package to my C:\ > and followed the steps of the demo page: > http://lucene.apache.org/java/docs/demo.html > > But, when i try to run "java org.apache.lucene.demo.IndexFiles > c:\lucene-2.0.0\src" from the cmd prompt, I get the following error: > > "Exception in thread 'main' java.lang.NoClassDefFoundError: > org/apache/lucene/analysis/Analyser" > > I am not sure why this is coming up. I have followed the instructions > on the demo page on the web. > > The only thing i can think of is I may have my "CLASSPATH" incorrect. > Can someone help me out with a basic desription if what the classpath > is and where I should point the classpath environment variable to? > > Once I have that correct, i think that I may be able to run the demo. > > thanks for any help you can provide. > > James > > "John Sidney-Woollett" wrote: > >>Save yourself some effort and use Lucene to index a directory of your 300 >>word documents. I'm pretty sure that Lucene includes an extension to read >>Word documents, and you can use PDFBox to read/write PDF files. Marrying >>the searching and displaying of results to your web application should be >>trivial since you're wanting to use java anyway. Lucene has full character >>set support and is blindingly fast >> >>If you're looking for a solution to this problem using Postgres, then >>you'll be creating a ton extra work for yourself. If you're wanting to >>learn more about postgres, then maybe it'll be worthwhile. >> >>John >> > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster