Hello, I am working on an indexing/search engine for speech and I would like to try to use Xapian for that. I have an idea how to do it in Xapian, but I am not sure, if it is correct since I have just quickly looked at the Xapian code. Tokens I need to index: Each speech audio record, processed by a speech recognizer is converted to an oriented graph of hypotheses. Each hypothesis contains the recognized word, start time, end time and confidence score. These hypotheses are overlapped in time, so there is generally a bunch of hypotheses in each point of time. A simple graph of hypotheses (output of speech recognizer): http://www.research.ibm.com/journal/sj/404/brown1.gif ?So I suppose that the main thing I need to change in Xapian code is the termpos type (in types.h), which is just an unsigned integer. For speech indexing I need to change it to a struct containing start time, end time and score of recognized words. Then to be able to search for phrases correctly, I have to change the code in ./matcher/phrasepostlist.cc to take start and end time into account. Please, correct me if I am wrong or if I missed something. I am really new to Xapian, so I will be grateful for any hint on this problem (tutorial, code snippet, doxygen page, ...). Thank you, Miso
Richard Boulton
2008-Jun-05 23:12 UTC
[Xapian-devel] indexing and searching of timed events
Michal Fapso wrote:> I am working on an indexing/search engine for speech and I would like to > try to use Xapian for that. I have an idea how to do it in Xapian, but I > am not sure, if it is correct since I have just quickly looked at the > Xapian code.Are you aiming for the textual queries and spoken documents, or spoken queries and spoken documents? (Spoken queries and textual documents is also probably an interesting case, but from the next paragraph it sounds like your documents are definitely spoken.)> Tokens I need to index: > Each speech audio record, processed by a speech recognizer is converted > to an oriented graph of hypotheses. Each hypothesis contains the > recognized word, start time, end time and confidence score. These > hypotheses are overlapped in time, so there is generally a bunch of > hypotheses in each point of time. > > A simple graph of hypotheses (output of speech recognizer): > http://www.research.ibm.com/journal/sj/404/brown1.gifOut of interest, what speech recogniser are you using? If it's an open source one, others here might be interested in collaborating. Sphinx-4 is the only open source recogniser I know of, but there may be others (I've not looked for them).> ?So I suppose that the main thing I need to change in Xapian code is the > termpos type (in types.h), which is just an unsigned integer. For speech > indexing I need to change it to a struct containing start time, end time > and score of recognized words.If you can possibly work around it, I recommend avoiding changing the termpos type - not only will you have to change the type in the header, you'll need to modify the code which stores the list of positions to store this header. Currently, this uses quite a bit of compression, which you'd probably have to rip out and replace (possibly replacing with no compression, but that would make databases quite a bit bigger, and correspondingly slower). It's going to be messy, and there is a better way. The current phrase checking support in xapian works by performing an "AND" search for all the terms in the phrase. Any document which doesn't match the AND can't possibly match the phrase, so it is immediately discarded. Any document which does match the AND needs to be checked further - at this point, the position lists for the terms in the phrase are opened, and searched for any occurrences of the words in the correct order. Xapian's phrase searching, as you've noticed, depends on each term having a position specifiable by a single integer. It then looks for occurrences of the words, in increasing order of position, such that the difference in the position of the first and last word is less than a given window size. This means that (unless the window size is equal to the number of terms), phrases can contain gaps. This is no good for you, if I understand correctly, because you don't want phrases to be able to contain gaps. However the lattices coming out of the recogniser don't allow the positions to be numbered such that all possible paths involve a difference of 1 between successive words. Therefore, you do indeed need to store different information. However, I believe that all you need to store is the links in the network coming out of the recogniser. One approach to the problem is to implement your own phrase matching. Instead of using Xapian's built in support for storing the position list, you can store an arbitrary piece of data associated with each document. I would recommend storing this in a value slot (see Document::add_value()). If you store data describing the lattice from the recogniser, you should then be able to attempt to find a path through the lattice which corresponds to the query. You can use a MatchDecider (or, if you use SVN HEAD on the trunk branch, you can use a PostingSource) to perform arbitrary calculations using this data at search time, so you can implement whatever check algorithm you need. It'll be fiddly, mainly in serialising and unserialising the data, but it should be doable without needing to modify any of the Xapian core code.> Then to be able to search for phrases correctly, I have to change the > code in ./matcher/phrasepostlist.cc to take start and end time into > account. > > Please, correct me if I am wrong or if I missed something. I am really > new to Xapian, so I will be grateful for any hint on this problem > (tutorial, code snippet, doxygen page, ...).-- Richard