thr3ads.net - Xapian discuss - Indexing stack overflow [Mar 2024]

If this information is useful, please help other people find it:
Share via:

Sagar Acharya

2024-Mar-19 15:40 UTC

Indexing stack overflow

I am using omindex to prepare a database.

While omindex has a way to index local website. What is the right way to index
every subpage of stackoverflow?

Which markups does xapian support, namely, html, javascript, reactjs, nodejs,
etc.?
Thanking you
Sagar Acharya
https://dumbdevices.in

Olly Betts

2024-Mar-19 21:44 UTC

head link

Indexing stack overflow

On Tue, Mar 19, 2024 at 09:10:37PM +0530, Sagar Acharya
wrote:> I am using omindex to prepare a database.
> 
> While omindex has a way to index local website. What is the right way
> to index every subpage of stackoverflow?
We don't provide a crawler.

The simple approach is just to mirror the site locally (e.g. wget
--mirror can do this but there may well be better options) and index
with omindex from that local mirror.  If the mirroring tool you use
supports incremental updates and only touches the timestamps of the
new/changed files then omindex should be able to incrementally update.
It'll have to scan the directory tree to find the new/changed files but
that's not usually the slow part.

Or find an existing web crawler and write a bit of code to feed the
pages it crawls into the Xapian API.
> Which markups does xapian support, namely, html, javascript, reactjs,
> nodejs, etc.?
Of those, only HTML is actually a markup language (and is supported).
We don't attempt to execute javascript in pages, but nodejs is server
side so would effectively be supported when crawling a website.

There's a full list of supported formats in the Omega docs (search in
the page for `formats`):

https://xapian.org/docs/omega/overview.html

The code on git master supports a few additional formats so worth
checking there if there's one you really want not in that list.

If there's an existing extractor for a format (can be a command line
tool, or git master also support C/C++ libraries) then it shouldn't be
hard to hook up.  So if you really want client-side javascript support
then see if you can find a tool or library to render a webpage which
runs client-side javascript.

Cheers,
    Olly

Xapian discuss - Mar 2024 - Indexing stack overflow

Indexing stack overflow

Indexing stack overflow