yarong@xennexinc.com
2008-Jan-24 12:41 UTC
[Xapian-discuss] Full Text Searching a Relational Model
Hi, (Warning, not for the weak-hearted) I'm currently working on a project where we have a large and complex data model, related to Genomics. We are trying to build a search engine that provides "full text" and "field-based text" searches for our customer base (mostly academic research), and are evaluating different tools for this purpose. As a starting point, we have, as an example, a set of objects (stored in tables as a relational model): Gene [ID, Symbol, Description] Article - M:M with Gene [ID, Title] Disease - M:M with Gene [ID, Name] Author - M:M with Article [ID, Name] (Note: M:M tables exist, just link IDs) An example model would be (hierarchical, relations dealt with as duplications) Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor] Article [ID=1, Title=EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy] Author [ID=1, Name=H. Michaelson] Author [ID=2, Name=J. Watson] Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry] Author [ID=1, Name=H. Michaelson] Author [ID=3, Name=M. Roberts] Disease [ID=1, Name=Epidermal sluffing] Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase] Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine hydrolase: implications for the three-dimensional structure] Author [ID=4, Name=B. Cohen] Author [ID=5, Name=L. Alexander] Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry] Author [ID=1, Name=H. Michaelson] Author [ID=3, Name=M. Roberts] Note IDs in the objects above, as they relay the relations in the hierarchical model. In our Full-Text search, we would like to allow users to search ANY textual field for any string. For instance, the term "epidermal", and display the list of genes which have any data associated with them with that term (ranked, of course). Our list of results would be something like: EGFR Found in Description (epidermal growth factor receptor) Found in Article ID#2, in Title (proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry) Found in Disease ID#1, in Name (Epidermal sluffing) AHCY Found in Article ID#2, in Title (proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry) Note that the results retain a hierarchial view of our Genes (us being Gene-Centric, we're pretty much framing the question "find this term related in information related to those genes"). Also note that Article ID #2 has an M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to that fact, AHCY is considered a gene that has "epidermal" in its annotations. Obviously, we'd like to rank fields by location in hierarchy (A term in a gene name is scored higher than the name of the author of an article related to a gene) and by number of hits (number of times a term is found related to that gene, 3 in the case of EGFR above). Ideas for how to take on this challenge? Implementation? Tools? Thanks! Yaron Golan