Home > mailing lists

Re: Indexing MS/Open Office and PDF documents - Mailing list pgsql-general

From	Samba
Subject	Re: Indexing MS/Open Office and PDF documents
Date	March 15, 2012 21:46:06
Msg-id	CAKgWO9JGu004KMJ1RD6HtSWX_tQXAZm2wNVV4fKBtNUx0Ko+3A@mail.gmail.com Whole thread
In response to	Re: Indexing MS/Open Office and PDF documents (dennis jenkins <dennis.jenkins.75@gmail.com>)
List	pgsql-general

Tree view

Word documents can be processed by Abiword into any msword document into html, latex, postscript, text formats with very simple commands; i guess it also exposes some api which can be integrated into document parsers/indexers.

Spreadsheets can be processed by utilizing ExcelFormat library
http://www.codeproject.com/Articles/42504/ExcelFormat-Library

or BasicExcel library
http://www.codeproject.com/Articles/13852/BasicExcel-A-Class-to-Read-and-Write-to-Microsoft

Or even the GNU GNumeric project has some api to process spreadsheets which can be used to extract text and index.

Code to extract text from PDF
http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file

Overall, I guess there are bits and pieces available over the internet and some dedicated efforts are needed to assemble those and develop into a finished product, namely document indexer.

Wish you success!

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

On Fri, Mar 16, 2012 at 2:51 AM, dennis jenkins <dennis.jenkins.75@gmail.com> wrote:

On Thu, Mar 15, 2012 at 4:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2012-03-16 at 01:57 +0530, Alexander.Bagerman@cognizant.com
> wrote:
>> Hi,
>>
>> We are looking to use Postgres 9 for the document storing and would
>> like to take advantage of the full text search capabilities. We have
>> hard time identifying MS/Open Office and PDF parsers to index stored
>> documents and make them available for text searching. Any advice would
>> be appreciated.
>
> The first step is to find a library that can parse such documents, or
> convert them to a format that can be parsed.

I don't know about MS-Office document parsing, but the "PoDoFo" (pdf
parsing library) can strip text from PDFs. Every now and then someone
posts to the podofo mailing list with questions related to extracting
text for the purposes of indexing it in FTS capable database. Podofo
has excellent developer support. The maintainer is quick to accept
patches, verify bugs, add features, etc... Disclaimer: I'm not a pdf
nor podofo expert. I can't help you accomplish what you want.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pgsql-general by date:

From: Filip Rembiałkowski
Date: 15 March 2012, 20:30:04
Subject: Re: A 154 GB table swelled to 527 GB on the Slony slave. How to compact it?

From: Dmytrii Nagirniak
Date: 16 March 2012, 01:39:02
Subject: Re: Optimise PostgreSQL for fast testing

Re: Indexing MS/Open Office and PDF documents - Mailing list pgsql-general

Previous

Next