Contao Open Source CMS > Contao forum

Switch to german forum

Index > Feature request > Indexing of PDFs & Word docs.

jamsig
User
Avatar
Hi there,

This is my first post so please don't shoot me down if i get this wrong or its already included (I'm a newbie & yes i searched the site first but couldn't find any topics about this feature).

Pretty self-explanatory really, but it would be nice if the search engine could index PDFs & Word Docs. If it does already great, can someone PLEASE show me how coz my search engine won't do it, and if it doesn't could you please add it.

Thank you,

JamSig
2008-05-06 01:00
Ben
Partner
Avatar
Posts: 2126
Atlanta, Georgia, United States
Jamsig,

First off, welcome to the TYPOlight community. Secondly, I don't recall any talk of the internal search engine indexing documents. This actually sounds like an extraordinary feature (as in, extra ordinary), so I'm not sure if it is feasible. Google has this capability, so you might want to consider adding a local site seach via Google.
2008-05-06 03:32
jamsig
User
Avatar
Hi Ben,

Thanks for the reply :). WOW people do read these things haha.

At the moment I'm building a site for our local intranet so I'm not sure how I can intergrate google as most users won't have access to the outside world, including google.

Anyway in my searching, I stumbled across a small OS Search Engine called Sphider (http://www.sphider.eu/index.php). I'm doing a bit of coding at the moment to implement this into TL. If I'm successful I'll post a tutorial for those that need it.

Thank you for your comments.

James
2008-05-06 03:39
Ben
Partner
Avatar
Posts: 2126
Atlanta, Georgia, United States
James,

Sphider looks interesting - let us know how it goes. I don't think Google would be of much help to you in a local intranet.
2008-05-06 05:47
thyon
User
Avatar
Posts: 1756
Cape Town, South Africa
This was a feature originally used in Microsoft Sharepoint. They used index extenders as files that could understand each document and created indexes of each document type, e.g. PDF, DOC, XLS, etc. Of course MS could index all their own documents, but they needed a parser to read the PDF's and write that.

Its therefore feasible, but its quite an undertaking, as the parser has to understand the file format. Perhaps we could start to create parser routines, that would return a list of keywords and page numbers, e.g.
'pdf,xls' => array('ModuleParser'=>'MyCustomParserFunction');

Then the search engine could hook into the parser routine as a HOOK to allow the file to be indexed as well. I've not used the search engine for other than passing pages to it for indexing, so these would most like be a completely new set of features.

Now for step 2:
Once indexed, its a bit pointless to say keyword: e.g. "mathematical" appears in this 10MB document without a teaser line to help you show the word in content, this is what Google does, so the parser needs to be able to parse the section of the document that is being worked on.

Maybe there are already such parser routines in the public domain, as many PDF creators already exist.
thyon | iMac 24" 3.06GHz, OSX Leopard, Safari, Camino, Coda
Manuals: QuickPoll, FormAuto, EventsAttend, Galleries, Invitations, Catalog, Catalog Ext, Catalog Tutorials, Catalog Multi-language Tutorial
2008-05-06 08:18
Pauline
User
Avatar
We're currently in the similar situation: extranet, pdf (doc, ppt) documents that need to be parsed.
Jamsig, did you find a solution with yours?
2008-07-04 16:01
stefanjohannsen
Partner
Avatar
Posts: 46
Sønderborg, Denmark
I would appreciate that feature as well. It would be so nice to have PDF indexed for searching.
2009-04-30 21:13