thr3ads.net - Xapian discuss - Amount of writes during index creation [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Jean-Francois Dockes

2019-Feb-03 09:07 UTC

Amount of writes during index creation

Bron Gondwana writes:
 > This is quite possibly part of the underlying write explosion that we ran
into when we wrote:
 > 
 > https://fastmail.blog/2014/12/01/email-search-system/
 > 
 > Which now almost 5 years on, has been running like a champion! We're
really pleased with how well it works. Xapian reads from multiple databases are
really easy, and the immediate writes onto tmpfs and daily compacts work really
well. We also have a cron job which runs hourly and will do immediate compacts
to disk from memory if the tmpfs hits more than 50% of its nominal size, and it
keeps us from almost ever needing to do any manual management as this thing
indexed millions of new emails per day across our cluster.
 > 
 > And then when we do the compact down to disk, it's a single thread
compacting indexes while new emails still index to tmpfs, so there's always
tons of IO available for searches.
 > 
 > I think even with more efficient IO patterns, I'd still stick with the
design we have. It's really nice :)
 > 
 > Bron.


Thank you for these informations.

I re-ran the 20 GB index creation with the latest xapian git code but a
much smaller commit threshold (20 MB instead of 200). There were more than
800 GB of data written (instead of 125 GB).

So it would seem that the right approach for creating big indexes is to:

- Always set the commit interval as high as the available RAM allows.

- Use the future Xapian 1.4.10, the patch brings a significant improvement.

- Segment the index, then use xapian-compact to merge if needed. It would
  be interesting to see how the fastmail approach works for an initial bulk
  index creation, compared to just segmenting, that is, what is the optimal
  number of merges?

JF

Bron Gondwana

2019-Feb-03 09:22 UTC

head link

Amount of writes during index creation

Indexing to tmpfs is nice because there is no disk Io! So I would guess as much
as you have memory for at once. We have one index per user, and we have never
had a user so big that we can't fit their index in memory, so initial index
creation is always build entire index in memory then compact it to archive.

On Sun, Feb 3, 2019, at 10:07, Jean-Francois Dockes
wrote:> Bron Gondwana writes:
> > This is quite possibly part of the underlying write explosion that we
ran into when we wrote:
> > 
> > https://fastmail.blog/2014/12/01/email-search-system/
> > 
> > Which now almost 5 years on, has been running like a champion!
We're really pleased with how well it works. Xapian reads from multiple
databases are really easy, and the immediate writes onto tmpfs and daily
compacts work really well. We also have a cron job which runs hourly and will do
immediate compacts to disk from memory if the tmpfs hits more than 50% of its
nominal size, and it keeps us from almost ever needing to do any manual
management as this thing indexed millions of new emails per day across our
cluster.
> > 
> > And then when we do the compact down to disk, it's a single thread
compacting indexes while new emails still index to tmpfs, so there's always
tons of IO available for searches.
> > 
> > I think even with more efficient IO patterns, I'd still stick with
the design we have. It's really nice :)
> > 
> > Bron.
> 
> 
> Thank you for these informations.
> 
> I re-ran the 20 GB index creation with the latest xapian git code but a
> much smaller commit threshold (20 MB instead of 200). There were more than
> 800 GB of data written (instead of 125 GB).
> 
> So it would seem that the right approach for creating big indexes is to:
> 
> - Always set the commit interval as high as the available RAM allows.
> 
> - Use the future Xapian 1.4.10, the patch brings a significant improvement.
> 
> - Segment the index, then use xapian-compact to merge if needed. It would
>  be interesting to see how the fastmail approach works for an initial bulk
>  index creation, compared to just segmenting, that is, what is the optimal
>  number of merges?
> 
> JF
> 
-- 
 Bron Gondwana
 brong at fastmail.fm

Jean-Francois Dockes

2019-Feb-03 11:04 UTC

head link

Amount of writes during index creation

Bron Gondwana writes:
 > Indexing to tmpfs is nice because there is no disk Io! So I would guess as
 > much as you have memory for at once. We have one index per user, and we
have
 > never had a user so big that we can't fit their index in memory, so
initial
 > index creation is always build entire index in memory then compact it to
 > archive.

The user who presented the issue seems to be creating a huge index (one of
the tests was stopped with an index size of almost 250 GB). Depending on
local conditions, using tmpfs may force performing many merges. I don't
know how efficient this would be compared to creating several dbs of
similar size on disk and then either merging them or querying them in
parallel. Experimentation needed...

Also, depending on how the source data is organized, it may not be simple
to segment it in small enough pieces, and Recoll has nothing to help with
this.



 > On Sun, Feb 3, 2019, at 10:07, Jean-Francois Dockes wrote:
 > 
 >     Bron Gondwana writes:
 >     > This is quite possibly part of the underlying write explosion
that we
 >     ran into when we wrote:
 >     > 
 >     > https://fastmail.blog/2014/12/01/email-search-system/
 >     > 
 >     > Which now almost 5 years on, has been running like a champion!
We're
 >     really pleased with how well it works. Xapian reads from multiple
 >     databases are really easy, and the immediate writes onto tmpfs and
daily
 >     compacts work really well. We also have a cron job which runs hourly
and
 >     will do immediate compacts to disk from memory if the tmpfs hits more
than
 >     50% of its nominal size, and it keeps us from almost ever needing to
do
 >     any manual management as this thing indexed millions of new emails per
day
 >     across our cluster.
 >     > 
 >     > And then when we do the compact down to disk, it's a single
thread
 >     compacting indexes while new emails still index to tmpfs, so
there's
 >     always tons of IO available for searches.
 >     > 
 >     > I think even with more efficient IO patterns, I'd still stick
with the
 >     design we have. It's really nice :)
 >     > 
 >     > Bron.
 > 
 >     Thank you for these informations.
 >    
 >     I re-ran the 20 GB index creation with the latest xapian git code but
a
 >     much smaller commit threshold (20 MB instead of 200). There were more
than
 >     800 GB of data written (instead of 125 GB).
 >    
 >     So it would seem that the right approach for creating big indexes is
to:
 >    
 >     - Always set the commit interval as high as the available RAM allows.
 >    
 >     - Use the future Xapian 1.4.10, the patch brings a significant
 >     improvement.
 >    
 >     - Segment the index, then use xapian-compact to merge if needed. It
would
 >       be interesting to see how the fastmail approach works for an initial
 >     bulk
 >       index creation, compared to just segmenting, that is, what is the
 >     optimal
 >       number of merges?
 >    
 >     JF
 > 
 > -- 
 >   Bron Gondwana
 >   brong at fastmail.fm
 >

Reasonably Related Threads

Search for more seemingly similar threads

Xapian discuss - Feb 2019 - Amount of writes during index creation

Amount of writes during index creation

Amount of writes during index creation

Amount of writes during index creation

Reasonably Related Threads