Hi all, We use xapian as the backend of our system. Now the data need be indexed ever-increasing, and the local mode is hard to maintain, so we plan to move the index builder to hadoop. We try to make xapian can be run in hadoop, and now met a problem that there are many seek operations when xapian writes the index files, but the method seek() in hadoop c api only support read, and we blocked by that now?It looks a big work to rewrite the xapian database backend to adapter the hadoop c api. Could you please give us some suggestions? Aimee Cheng
On Thu, Nov 21, 2019 at 10:20:19AM +0800, ??? wrote:> We use xapian as the backend of our system. Now the data need be > indexed ever-increasing, and the local mode is hard to maintain, so we > plan to move the index builder to hadoop. We try to make xapian can be > run in hadoop, and now met a problem that there are many seek > operations when xapian writes the index files, but the method seek() > in hadoop c api only support read, and we blocked by that nowUpdating a glass backend database pretty fundamentally requires a way to "write block N". We don't actually require the ability to seek arbitrarily, but if hadoop writes are limited to appending to a file your approach is just not going to work for updating. It might be possible to buffer up everything in RAM and then write out a glass database in one go with such a limitation, but if you're having scaling problems then forcing a situation where the whole database needs to be created in RAM before it can be written is not going to help.> It looks a big work to rewrite the xapian database backend to > adapter the hadoop c api. Could you please give us some suggestions?The in-development backend (honey) would probably be easier to get to work here once finished, but currently it doesn't support writing directly so that's no help if you want a solution now. Perhaps you could elaborate on the problem you're actually trying to solve here. What does "the local mode is hard to maintain" actually mean? Cheers, Olly
>What does "the local mode is hard to maintain" actually mean?Ok, as some of our databases are very large, and we partition it into 16 shards or even more shards. So when run on local but not a distribute framework, we need do many works to maintain the databases and the builder hosts, for example, taking care about the storage, the fault-tolerant, and some other things. The scalability is not well.At 2019-11-22 13:45:23, "Olly Betts" <olly at survex.com> wrote:>On Thu, Nov 21, 2019 at 10:20:19AM +0800, ??? wrote: >> We use xapian as the backend of our system. Now the data need be >> indexed ever-increasing, and the local mode is hard to maintain, so we >> plan to move the index builder to hadoop. We try to make xapian can be >> run in hadoop, and now met a problem that there are many seek >> operations when xapian writes the index files, but the method seek() >> in hadoop c api only support read, and we blocked by that now > >Updating a glass backend database pretty fundamentally requires a >way to "write block N". We don't actually require the ability to >seek arbitrarily, but if hadoop writes are limited to appending to >a file your approach is just not going to work for updating. > >It might be possible to buffer up everything in RAM and then write out a >glass database in one go with such a limitation, but if you're having >scaling problems then forcing a situation where the whole database needs >to be created in RAM before it can be written is not going to help. > >> It looks a big work to rewrite the xapian database backend to >> adapter the hadoop c api. Could you please give us some suggestions? > >The in-development backend (honey) would probably be easier to get >to work here once finished, but currently it doesn't support >writing directly so that's no help if you want a solution now. > >Perhaps you could elaborate on the problem you're actually trying >to solve here. > >What does "the local mode is hard to maintain" actually mean? > >Cheers, > Olly