Is office 2007 formats like docx supported? Is there anyway to get xapian to index office 2007 formats? Is there any option/procedure to add a new mime plugin? For example if you rename a docx .zip you can retrieve text from document.xml Thanks Frank
On Thu, Jul 24, 2008 at 04:08:26AM +0930, Frank Bruzzaniti wrote:> Is office 2007 formats like docx supported?Out of the box, not unless antiword supports it. The last update to the debian packaged version was August 2006, so I suspect the answer is "no".> Is there anyway to get xapian to index office 2007 formats? > > Is there any option/procedure to add a new mime plugin? > For example if you rename a docx .zip you can retrieve text from > document.xmlI assume you mean for Omega's omindex indexer? There isn't currently a way to configure additional filters without modifying the source code in omindex.cc (ideally there should be a configuration file to allow this, but it's not implemented yet), but it's quite easy to wire in additional external filters if you aren't scared of dabbling in C++. Rather than writing a full guide here, I'm going to write this up as a wiki page, since that will be easier for others to find in the future. I'll reply again when I'm done. Cheers, Olly
This is how I do it using tinyxml parser. My xml parsing may be a bit convoluted but it works. This can be applied for powerpoint and excel too. ... mime_map["docx"] = "application/vnd.openxmlformats- officedocument.wordprocessingml.document"; mime_map["pptx"] = "application/vnd.openxmlformats- officedocument.presentationml.presentation"; mime_map["xlsx"] = "application/vnd.openxmlformats- officedocument.spreadsheetml.sheet"; ... //HANDLE DOCX WORD DOCUMENTS if (mimetype == "application/vnd.openxmlformats- officedocument.wordprocessingml.document"){ string cmd = "unzip -p " + shell_protect(filepath) + " docProps/ core.xml"; fileData+=parseWordXMetaData(mstdout_to_string(cmd)); cmd = "unzip -p " + shell_protect(filepath) + " docProps/app.xml"; fFileData+=parseWordXMetaData(mstdout_to_string(cmd)); cmd = "unzip -p " + shell_protect(filepath) + " docProps/custom.xml"; fileData+=parseWordXCustomMetaData(mstdout_to_string(cmd)); cmd = "unzip -p " + shell_protect(filepath) + " word/document.xml"; try{ XmlParser xmlparser; xmlparser.parse_html(mstdout_to_string(cmd)); dump = xmlparser.dump; } catch (ReadError) { cout << "\"" << cmd << "\" failed - skipping\n"; return 0; } } string parseWordXCustomMetaData(string xml){ string fileData = ""; TiXmlDocument doc; doc.Parse((char *) xml.c_str()); TiXmlElement* root = doc.RootElement(); if(root){ TiXmlNode * pParent = root->FirstChild(); if(pParent){ TiXmlNode * pChild = root->IterateChildren(pParent); for (pChild = pParent; pChild != 0; pChild = pChild->NextSibling()){ if(pChild){ TiXmlElement* aElem = pChild->ToElement(); if(aElem){ string name = aElem->Attribute("name"); TiXmlNode * pProperty = aElem->FirstChild(); if(pProperty){ TiXmlNode * pPropertyChild = aElem->IterateChildren(pProperty); for (pPropertyChild = pProperty; pPropertyChild != 0; pPropertyChild = pPropertyChild->NextSibling()){ if(pPropertyChild){ TiXmlElement* bElem = pPropertyChild->ToElement(); if(bElem->GetText()){ fileData+= "name:" + name + "=\"" + bElem->GetText() + "\"\n"; } } } } } } } } } return fileData; } Easy peasy ;-) On 23 Jul 2008, at 19:38, Frank Bruzzaniti wrote:> Is office 2007 formats like docx supported? > > Is there anyway to get xapian to index office 2007 formats? > > Is there any option/procedure to add a new mime plugin? > For example if you rename a docx .zip you can retrieve text from > document.xml > > Thanks > > Frank > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss
Hi Frank You will have to get your hands dirty I'm afraid. I use my own indexer (which is very customised) and not Omega. Essentially you would have to integrate the example code I gave you into the Omega source and compile it. Otherwise you could use the code in your own indexer. I'm not sure if the Xapian mega coders responsible for Omega might find it worthy of official inclusion? On 24 Jul 2008, at 12:19, Frank Bruzzaniti wrote:> I have just setup my first test using omega + xapian, how would I > integrate what you have provided bellow? > > Colin Bell wrote: >> >> This is how I do it using tinyxml parser. My xml parsing may be a >> bit convoluted but it works. This can be applied for powerpoint and >> excel too. >> >> ... >> mime_map["docx"] = "application/vnd.openxmlformats- >> officedocument.wordprocessingml.document"; >> mime_map["pptx"] = "application/vnd.openxmlformats- >> officedocument.presentationml.presentation"; >> mime_map["xlsx"] = "application/vnd.openxmlformats- >> officedocument.spreadsheetml.sheet"; >> >> ... >> >> //HANDLE DOCX WORD DOCUMENTS >> if (mimetype == "application/vnd.openxmlformats- >> officedocument.wordprocessingml.document"){ >> string cmd = "unzip -p " + shell_protect(filepath) + " docProps/ >> core.xml"; >> fileData+=parseWordXMetaData(mstdout_to_string(cmd)); >> cmd = "unzip -p " + shell_protect(filepath) + " docProps/app.xml"; >> fFileData+=parseWordXMetaData(mstdout_to_string(cmd)); >> cmd = "unzip -p " + shell_protect(filepath) + " docProps/ >> custom.xml"; >> fileData+=parseWordXCustomMetaData(mstdout_to_string(cmd)); >> cmd = "unzip -p " + shell_protect(filepath) + " word/document.xml"; >> try{ >> XmlParser xmlparser; >> xmlparser.parse_html(mstdout_to_string(cmd)); >> dump = xmlparser.dump; >> } catch (ReadError) { >> cout << "\"" << cmd << "\" failed - skipping\n"; >> return 0; >> } >> } >> >> string parseWordXCustomMetaData(string xml){ >> string fileData = ""; >> TiXmlDocument doc; >> doc.Parse((char *) xml.c_str()); >> TiXmlElement* root = doc.RootElement(); >> if(root){ >> TiXmlNode * pParent = root->FirstChild(); >> if(pParent){ >> TiXmlNode * pChild = root->IterateChildren(pParent); >> for (pChild = pParent; pChild != 0; pChild = pChild->NextSibling()){ >> if(pChild){ >> TiXmlElement* aElem = pChild->ToElement(); >> if(aElem){ >> string name = aElem->Attribute("name"); >> TiXmlNode * pProperty = aElem->FirstChild(); >> if(pProperty){ >> TiXmlNode * pPropertyChild = aElem->IterateChildren(pProperty); >> for (pPropertyChild = pProperty; pPropertyChild != 0; >> pPropertyChild = pPropertyChild->NextSibling()){ >> if(pPropertyChild){ >> TiXmlElement* bElem = pPropertyChild->ToElement(); >> if(bElem->GetText()){ >> fileData+= "name:" + name + "=\"" + bElem->GetText() + "\"\n"; >> } >> } >> } >> } >> } >> } >> } >> } >> } >> return fileData; >> } >> >> Easy peasy ;-) >> >> On 23 Jul 2008, at 19:38, Frank Bruzzaniti wrote: >> >>> Is office 2007 formats like docx supported? >>> >>> Is there anyway to get xapian to index office 2007 formats? >>> >>> Is there any option/procedure to add a new mime plugin? >>> For example if you rename a docx .zip you can retrieve text from >>> document.xml >>> >>> Thanks >>> >>> Frank >>> >>> _______________________________________________ >>> Xapian-discuss mailing list >>> Xapian-discuss at lists.xapian.org >>> http://lists.xapian.org/mailman/listinfo/xapian-discuss >>
Hi Frank Xapian is an excellent tool and will do what you want very well, but it is a tool and not "shrink wrapped" product. It requires a lot of technical knowledge to implement and to use. Often developers will take Xapian , customise it, and create a user friendly front end for it. Omni index / Omega will do the job your after but needs customisation to suit your requirements. There are people on this list who are available as paid consultants to help you if don't have the technical background to implement Xapian. I'm sure they will make themselves available to you if you ask. If you do want to get your hands dirty, then I'm sure everyone on this list will chip in to help you reach your goal. Personally I use it to index everything from photos (using exiv data) to pdf, word, html etc. As long your able to extract raw text from something , then you can put it in Xapian. Regards Colin On 24 Jul 2008, at 12:37, Frank Bruzzaniti wrote:> I think you should then the numbnuts like me could use it. > > You write your own indexer, wow. > > I was looking for a indexer that could index all my documents and > then give a simple "google" like webpage that I could customize. > > I wanted to be able to process searchable pdf's and office > documents, do you think xapian is the right project for me? > > Colin Bell wrote: >> >> Hi Frank >> >> You will have to get your hands dirty I'm afraid. >> >> I use my own indexer (which is very customised) and not Omega. >> Essentially you would have to integrate the example code I gave you >> into the Omega source and compile it. Otherwise you could use the >> code in your own indexer. >> >> I'm not sure if the Xapian mega coders responsible for Omega might >> find it worthy of official inclusion? >> >> On 24 Jul 2008, at 12:19, Frank Bruzzaniti wrote: >> >>> I have just setup my first test using omega + xapian, how would I >>> integrate what you have provided bellow? >>> >>> Colin Bell wrote: >>>> >>>> This is how I do it using tinyxml parser. My xml parsing may be a >>>> bit convoluted but it works. This can be applied for powerpoint >>>> and excel too. >>>> >>>> ... >>>> mime_map["docx"] = "application/vnd.openxmlformats- >>>> officedocument.wordprocessingml.document"; >>>> mime_map["pptx"] = "application/vnd.openxmlformats- >>>> officedocument.presentationml.presentation"; >>>> mime_map["xlsx"] = "application/vnd.openxmlformats- >>>> officedocument.spreadsheetml.sheet"; >>>> >>>> ... >>>> >>>> //HANDLE DOCX WORD DOCUMENTS >>>> if (mimetype == "application/vnd.openxmlformats- >>>> officedocument.wordprocessingml.document"){ >>>> string cmd = "unzip -p " + shell_protect(filepath) + " docProps/ >>>> core.xml"; >>>> fileData+=parseWordXMetaData(mstdout_to_string(cmd)); >>>> cmd = "unzip -p " + shell_protect(filepath) + " docProps/app.xml"; >>>> fFileData+=parseWordXMetaData(mstdout_to_string(cmd)); >>>> cmd = "unzip -p " + shell_protect(filepath) + " docProps/ >>>> custom.xml"; >>>> fileData+=parseWordXCustomMetaData(mstdout_to_string(cmd)); >>>> cmd = "unzip -p " + shell_protect(filepath) + " word/ >>>> document.xml"; >>>> try{ >>>> XmlParser xmlparser; >>>> xmlparser.parse_html(mstdout_to_string(cmd)); >>>> dump = xmlparser.dump; >>>> } catch (ReadError) { >>>> cout << "\"" << cmd << "\" failed - skipping\n"; >>>> return 0; >>>> } >>>> } >>>> >>>> string parseWordXCustomMetaData(string xml){ >>>> string fileData = ""; >>>> TiXmlDocument doc; >>>> doc.Parse((char *) xml.c_str()); >>>> TiXmlElement* root = doc.RootElement(); >>>> if(root){ >>>> TiXmlNode * pParent = root->FirstChild(); >>>> if(pParent){ >>>> TiXmlNode * pChild = root->IterateChildren(pParent); >>>> for (pChild = pParent; pChild != 0; pChild = pChild- >>>> >NextSibling()){ >>>> if(pChild){ >>>> TiXmlElement* aElem = pChild->ToElement(); >>>> if(aElem){ >>>> string name = aElem->Attribute("name"); >>>> TiXmlNode * pProperty = aElem->FirstChild(); >>>> if(pProperty){ >>>> TiXmlNode * pPropertyChild = aElem->IterateChildren(pProperty); >>>> for (pPropertyChild = pProperty; pPropertyChild != 0; >>>> pPropertyChild = pPropertyChild->NextSibling()){ >>>> if(pPropertyChild){ >>>> TiXmlElement* bElem = pPropertyChild->ToElement(); >>>> if(bElem->GetText()){ >>>> fileData+= "name:" + name + "=\"" + bElem->GetText() + "\"\n"; >>>> } >>>> } >>>> } >>>> } >>>> } >>>> } >>>> } >>>> } >>>> } >>>> return fileData; >>>> } >>>> >>>> Easy peasy ;-) >>>> >>>> On 23 Jul 2008, at 19:38, Frank Bruzzaniti wrote: >>>> >>>>> Is office 2007 formats like docx supported? >>>>> >>>>> Is there anyway to get xapian to index office 2007 formats? >>>>> >>>>> Is there any option/procedure to add a new mime plugin? >>>>> For example if you rename a docx .zip you can retrieve text from >>>>> document.xml >>>>> >>>>> Thanks >>>>> >>>>> Frank >>>>> >>>>> _______________________________________________ >>>>> Xapian-discuss mailing list >>>>> Xapian-discuss at lists.xapian.org >>>>> http://lists.xapian.org/mailman/listinfo/xapian-discuss >>>> >>