thr3ads.net - Xapian discuss - [Xapian-discuss] docx support [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Frank Bruzzaniti

2008-Jul-23 18:38 UTC

[Xapian-discuss] docx support

Is office 2007 formats like docx supported?

Is there anyway to get xapian to index office 2007 formats?

Is there any option/procedure to add a new mime plugin?
For example if you rename a docx .zip you can retrieve text from 
document.xml

Thanks

Frank

Olly Betts

2008-Jul-24 01:51 UTC

head link

[Xapian-discuss] docx support

On Thu, Jul 24, 2008 at 04:08:26AM +0930, Frank Bruzzaniti
wrote:> Is office 2007 formats like docx supported?
Out of the box, not unless antiword supports it.  The last update to the
debian packaged version was August 2006, so I suspect the answer is
"no".
> Is there anyway to get xapian to index office 2007 formats?
> 
> Is there any option/procedure to add a new mime plugin?
> For example if you rename a docx .zip you can retrieve text from 
> document.xml
I assume you mean for Omega's omindex indexer?

There isn't currently a way to configure additional filters without
modifying the source code in omindex.cc (ideally there should be a
configuration file to allow this, but it's not implemented yet), but
it's quite easy to wire in additional external filters if you aren't
scared of dabbling in C++.

Rather than writing a full guide here, I'm going to write this up as a
wiki page, since that will be easier for others to find in the future.
I'll reply again when I'm done.

Cheers,
    Olly

Colin Bell

2008-Jul-24 08:08 UTC

head link

[Xapian-discuss] docx support

This is how I do it using tinyxml parser. My xml parsing may be a bit  
convoluted but it works. This can be applied for powerpoint and excel  
too.

...
	mime_map["docx"] = "application/vnd.openxmlformats- 
officedocument.wordprocessingml.document";
	mime_map["pptx"] = "application/vnd.openxmlformats- 
officedocument.presentationml.presentation";
	mime_map["xlsx"] = "application/vnd.openxmlformats- 
officedocument.spreadsheetml.sheet";

...

//HANDLE DOCX WORD DOCUMENTS
	if (mimetype == "application/vnd.openxmlformats- 
officedocument.wordprocessingml.document"){
		string cmd = "unzip -p " + shell_protect(filepath) + "
docProps/
core.xml";
		fileData+=parseWordXMetaData(mstdout_to_string(cmd));
		cmd = "unzip -p " + shell_protect(filepath) + "
docProps/app.xml";
		fFileData+=parseWordXMetaData(mstdout_to_string(cmd));
		cmd = "unzip -p " + shell_protect(filepath) + "
docProps/custom.xml";
		fileData+=parseWordXCustomMetaData(mstdout_to_string(cmd));
		cmd = "unzip -p " + shell_protect(filepath) + "
word/document.xml";
		try{
			XmlParser xmlparser;
			xmlparser.parse_html(mstdout_to_string(cmd));
			dump = xmlparser.dump;
		} catch (ReadError) {
			cout << "\"" << cmd << "\" failed
- skipping\n";
			return 0;
		}
	}

string parseWordXCustomMetaData(string xml){
	string fileData = "";
	TiXmlDocument doc;
	doc.Parse((char *) xml.c_str());
	TiXmlElement* root = doc.RootElement();
	if(root){
		TiXmlNode * pParent = root->FirstChild();
		if(pParent){
			TiXmlNode * pChild = root->IterateChildren(pParent);
			for (pChild = pParent; pChild != 0; pChild = pChild->NextSibling()){
				if(pChild){
					TiXmlElement* aElem = pChild->ToElement();
					if(aElem){
						string name = aElem->Attribute("name");
						TiXmlNode * pProperty = aElem->FirstChild();
						if(pProperty){
							TiXmlNode * pPropertyChild = aElem->IterateChildren(pProperty);
							for (pPropertyChild = pProperty; pPropertyChild != 0;  
pPropertyChild = pPropertyChild->NextSibling()){
								if(pPropertyChild){
									TiXmlElement* bElem = pPropertyChild->ToElement();
									if(bElem->GetText()){
										fileData+= "name:" + name + "=\"" +
bElem->GetText() + "\"\n";
									}
								}
							}
						}
					}
				}
			}
		}
	}
	return fileData;
}

Easy peasy ;-)

On 23 Jul 2008, at 19:38, Frank Bruzzaniti wrote:
> Is office 2007 formats like docx supported?
>
> Is there anyway to get xapian to index office 2007 formats?
>
> Is there any option/procedure to add a new mime plugin?
> For example if you rename a docx .zip you can retrieve text from
> document.xml
>
> Thanks
>
> Frank
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss

Colin Bell

2008-Jul-24 11:29 UTC

head link

[Xapian-discuss] docx support

Hi Frank

You will have to get your hands dirty I'm afraid.

I use my own indexer (which is very customised) and not Omega.  
Essentially you would have to integrate the example code I gave you  
into the Omega source and compile it. Otherwise you could use the code  
in your own indexer.

I'm not sure if the Xapian mega coders responsible for Omega might  
find it worthy of official inclusion?

On 24 Jul 2008, at 12:19, Frank Bruzzaniti wrote:
> I have just setup my first test using omega + xapian, how would I  
> integrate what you have provided bellow?
>
> Colin Bell wrote:
>>
>> This is how I do it using tinyxml parser. My xml parsing may be a  
>> bit convoluted but it works. This can be applied for powerpoint and  
>> excel too.
>>
>> ...
>>  mime_map["docx"] = "application/vnd.openxmlformats- 
>> officedocument.wordprocessingml.document";
>>  mime_map["pptx"] = "application/vnd.openxmlformats- 
>> officedocument.presentationml.presentation";
>>  mime_map["xlsx"] = "application/vnd.openxmlformats- 
>> officedocument.spreadsheetml.sheet";
>>
>> ...
>>
>> //HANDLE DOCX WORD DOCUMENTS
>>  if (mimetype == "application/vnd.openxmlformats- 
>> officedocument.wordprocessingml.document"){
>>  string cmd = "unzip -p " + shell_protect(filepath) + "
docProps/
>> core.xml";
>>  fileData+=parseWordXMetaData(mstdout_to_string(cmd));
>>  cmd = "unzip -p " + shell_protect(filepath) + "
docProps/app.xml";
>>  fFileData+=parseWordXMetaData(mstdout_to_string(cmd));
>>  cmd = "unzip -p " + shell_protect(filepath) + "
docProps/
>> custom.xml";
>>  fileData+=parseWordXCustomMetaData(mstdout_to_string(cmd));
>>  cmd = "unzip -p " + shell_protect(filepath) + "
word/document.xml";
>>  try{
>>  XmlParser xmlparser;
>>  xmlparser.parse_html(mstdout_to_string(cmd));
>>  dump = xmlparser.dump;
>>  } catch (ReadError) {
>>  cout << "\"" << cmd << "\"
failed - skipping\n";
>>  return 0;
>>  }
>>  }
>>
>> string parseWordXCustomMetaData(string xml){
>>  string fileData = "";
>>  TiXmlDocument doc;
>>  doc.Parse((char *) xml.c_str());
>>  TiXmlElement* root = doc.RootElement();
>>  if(root){
>>  TiXmlNode * pParent = root->FirstChild();
>>  if(pParent){
>>  TiXmlNode * pChild = root->IterateChildren(pParent);
>>  for (pChild = pParent; pChild != 0; pChild =
pChild->NextSibling()){
>>  if(pChild){
>>  TiXmlElement* aElem = pChild->ToElement();
>>  if(aElem){
>>  string name = aElem->Attribute("name");
>>  TiXmlNode * pProperty = aElem->FirstChild();
>>  if(pProperty){
>>  TiXmlNode * pPropertyChild = aElem->IterateChildren(pProperty);
>>  for (pPropertyChild = pProperty; pPropertyChild != 0;  
>> pPropertyChild = pPropertyChild->NextSibling()){
>>  if(pPropertyChild){
>>  TiXmlElement* bElem = pPropertyChild->ToElement();
>>  if(bElem->GetText()){
>>  fileData+= "name:" + name + "=\"" +
bElem->GetText() + "\"\n";
>>  }
>>  }
>>  }
>>  }
>>  }
>>  }
>>  }
>>  }
>>  }
>>  return fileData;
>> }
>>
>> Easy peasy ;-)
>>
>> On 23 Jul 2008, at 19:38, Frank Bruzzaniti wrote:
>>
>>> Is office 2007 formats like docx supported?
>>>
>>> Is there anyway to get xapian to index office 2007 formats?
>>>
>>> Is there any option/procedure to add a new mime plugin?
>>> For example if you rename a docx .zip you can retrieve text from
>>> document.xml
>>>
>>> Thanks
>>>
>>> Frank
>>>
>>> _______________________________________________
>>> Xapian-discuss mailing list
>>> Xapian-discuss at lists.xapian.org
>>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>

Colin Bell

2008-Jul-24 11:51 UTC

head link

[Xapian-discuss] docx support

Hi Frank

Xapian is an excellent tool and will do what you want very well, but  
it is a tool and not "shrink wrapped" product. It requires a lot of  
technical knowledge to implement and to use. Often developers will  
take Xapian , customise it, and create a user friendly front end for  
it. Omni index / Omega will do the job your after but needs  
customisation to suit your requirements. There are people on this list  
who are available as paid consultants to help you if don't have the  
technical background to implement Xapian. I'm sure they will make  
themselves available to you if you ask.

If you do want to get your hands dirty, then I'm sure everyone on this  
list will chip in to help you reach your goal.

Personally I use it to index everything from photos (using exiv data)  
to pdf, word, html etc. As long your able to extract raw text from  
something , then you can put it in Xapian.

Regards

Colin

On 24 Jul 2008, at 12:37, Frank Bruzzaniti wrote:
> I think you should then the numbnuts like  me could use it.
>
> You write your own indexer, wow.
>
> I was looking for a indexer that could index all my documents and  
> then give a simple "google" like webpage that I could customize.
>
> I wanted to be able to process searchable pdf's and office  
> documents, do you think xapian is the right project for me?
>
> Colin Bell wrote:
>>
>> Hi Frank
>>
>> You will have to get your hands dirty I'm afraid.
>>
>> I use my own indexer (which is very customised) and not Omega.  
>> Essentially you would have to integrate the example code I gave you  
>> into the Omega source and compile it. Otherwise you could use the  
>> code in your own indexer.
>>
>> I'm not sure if the Xapian mega coders responsible for Omega might
>> find it worthy of official inclusion?
>>
>> On 24 Jul 2008, at 12:19, Frank Bruzzaniti wrote:
>>
>>> I have just setup my first test using omega + xapian, how would I  
>>> integrate what you have provided bellow?
>>>
>>> Colin Bell wrote:
>>>>
>>>> This is how I do it using tinyxml parser. My xml parsing may be
a
>>>> bit convoluted but it works. This can be applied for powerpoint
>>>> and excel too.
>>>>
>>>> ...
>>>>  mime_map["docx"] =
"application/vnd.openxmlformats-
>>>> officedocument.wordprocessingml.document";
>>>>  mime_map["pptx"] =
"application/vnd.openxmlformats-
>>>> officedocument.presentationml.presentation";
>>>>  mime_map["xlsx"] =
"application/vnd.openxmlformats-
>>>> officedocument.spreadsheetml.sheet";
>>>>
>>>> ...
>>>>
>>>> //HANDLE DOCX WORD DOCUMENTS
>>>>  if (mimetype == "application/vnd.openxmlformats- 
>>>> officedocument.wordprocessingml.document"){
>>>>  string cmd = "unzip -p " + shell_protect(filepath) +
" docProps/
>>>> core.xml";
>>>>  fileData+=parseWordXMetaData(mstdout_to_string(cmd));
>>>>  cmd = "unzip -p " + shell_protect(filepath) + "
docProps/app.xml";
>>>>  fFileData+=parseWordXMetaData(mstdout_to_string(cmd));
>>>>  cmd = "unzip -p " + shell_protect(filepath) + "
docProps/
>>>> custom.xml";
>>>>  fileData+=parseWordXCustomMetaData(mstdout_to_string(cmd));
>>>>  cmd = "unzip -p " + shell_protect(filepath) + "
word/
>>>> document.xml";
>>>>  try{
>>>>  XmlParser xmlparser;
>>>>  xmlparser.parse_html(mstdout_to_string(cmd));
>>>>  dump = xmlparser.dump;
>>>>  } catch (ReadError) {
>>>>  cout << "\"" << cmd <<
"\" failed - skipping\n";
>>>>  return 0;
>>>>  }
>>>>  }
>>>>
>>>> string parseWordXCustomMetaData(string xml){
>>>>  string fileData = "";
>>>>  TiXmlDocument doc;
>>>>  doc.Parse((char *) xml.c_str());
>>>>  TiXmlElement* root = doc.RootElement();
>>>>  if(root){
>>>>  TiXmlNode * pParent = root->FirstChild();
>>>>  if(pParent){
>>>>  TiXmlNode * pChild = root->IterateChildren(pParent);
>>>>  for (pChild = pParent; pChild != 0; pChild = pChild- 
>>>> >NextSibling()){
>>>>  if(pChild){
>>>>  TiXmlElement* aElem = pChild->ToElement();
>>>>  if(aElem){
>>>>  string name = aElem->Attribute("name");
>>>>  TiXmlNode * pProperty = aElem->FirstChild();
>>>>  if(pProperty){
>>>>  TiXmlNode * pPropertyChild =
aElem->IterateChildren(pProperty);
>>>>  for (pPropertyChild = pProperty; pPropertyChild != 0;  
>>>> pPropertyChild = pPropertyChild->NextSibling()){
>>>>  if(pPropertyChild){
>>>>  TiXmlElement* bElem = pPropertyChild->ToElement();
>>>>  if(bElem->GetText()){
>>>>  fileData+= "name:" + name + "=\"" +
bElem->GetText() + "\"\n";
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  return fileData;
>>>> }
>>>>
>>>> Easy peasy ;-)
>>>>
>>>> On 23 Jul 2008, at 19:38, Frank Bruzzaniti wrote:
>>>>
>>>>> Is office 2007 formats like docx supported?
>>>>>
>>>>> Is there anyway to get xapian to index office 2007 formats?
>>>>>
>>>>> Is there any option/procedure to add a new mime plugin?
>>>>> For example if you rename a docx .zip you can retrieve text
from
>>>>> document.xml
>>>>>
>>>>> Thanks
>>>>>
>>>>> Frank
>>>>>
>>>>> _______________________________________________
>>>>> Xapian-discuss mailing list
>>>>> Xapian-discuss at lists.xapian.org
>>>>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>>>
>>

Xapian discuss - Jul 2008 - docx support

[Xapian-discuss] docx support

[Xapian-discuss] docx support

[Xapian-discuss] docx support

[Xapian-discuss] docx support

[Xapian-discuss] docx support