PGNet Dev
2020-Nov-15 19:48 UTC
[patch] enhancement for tika server protected by user/password basic auth
On 11/15/20 11:13 AM, John Fawcett wrote:> Just a couple of updates about Tika and Solr together. > > 1. On mass reindexing I'm seeing panics - see below. These are present > with Dovecot 2.3.10 and 2.3.11.3. Seem to go away with the fix which was > previously posted on this list by Josef 'Jeff' Sipek, which I repeat > below for easy of reference. > > 2. On mass reindexing my Tika server seems to get a bit overwhelmed. I > think I'll need to look into how resources are allocated and do some > tuning. This produces 502 Proxy Error responses back to Dovecot.Which tika instance are you running on the backend? The tika-app.jar, with --server? or the JAXRS tika-server.jar?> As far as Dovecot integration with Tika, I believe that some resource > limits would be helpful. I think it would make sense to have a limit in > Dovecot about the maximum file size it will try to send to Tika. > Potentially, it could be useful also to allow configuration of the types > of file to send to Tika. For example I see lots of image files going > across, but I'd probably be happy not to have them indexed. It won't be > perfect, since those file types could exist inside zip files, but maybe > would cut out a bit of the load.Solr itself apparently has 'tika integration' out of the box. Since the solr server instance bundles jetty _anyway_, and it _is_ already up/running ... wondering if the indexing load can be better managed there. iiuc, limits and types can be specified in solr/tika config directly. perhaps Dovecot can be configured to send all messages+attachments, and let solr/tika config 'choose' to index just the message, or the attachment as well. that said, config in Dovecot is certainly convenient.
John Fawcett
2020-Nov-15 20:21 UTC
[patch] enhancement for tika server protected by user/password basic auth
On 15/11/2020 20:48, PGNet Dev wrote:> On 11/15/20 11:13 AM, John Fawcett wrote: >> Just a couple of updates about Tika and Solr together. >> >> 1. On mass reindexing I'm seeing panics - see below. These are present >> with Dovecot 2.3.10 and 2.3.11.3. Seem to go away with the fix which was >> previously posted on this list by Josef 'Jeff' Sipek, which I repeat >> below for easy of reference. >> >> 2. On mass reindexing my Tika server seems to get a bit overwhelmed. I >> think I'll need to look into how resources are allocated and do some >> tuning. This produces 502 Proxy Error responses back to Dovecot. > > Which tika instance are you running on the backend? > > The tika-app.jar, with --server? or the JAXRS tika-server.jar?I'm using tika-server.jar installed as a service> >> As far as Dovecot integration with Tika, I believe that some resource >> limits would be helpful. I think it would make sense to have a limit in >> Dovecot about the maximum file size it will try to send to Tika. >> Potentially, it could be useful also to allow configuration of the types >> of file to send to Tika. For example I see lots of image files going >> across, but I'd probably be happy not to have them indexed. It won't be >> perfect, since those file types could exist inside zip files, but maybe >> would cut out a bit of the load. > > Solr itself apparently has 'tika integration' out of the box. > Since the solr server instance bundles jetty _anyway_, and it _is_ > already up/running ... > ?wondering if the indexing load can be better managed there.Dovecot currently implements separate integrations, first the attachments are sent to tika, then the results are sent to solr. The two could even be running on separate servers.> > iiuc, limits and types can be specified in solr/tika config directly. > > perhaps Dovecot can be configured to send all messages+attachments, > and let solr/tika config 'choose' to index just the message, or the > attachment as well.Yes that could be an alternative way, so instead of sending the attachments to tika, send the attachments to solr and let it send them to tika. It would be more than configuration in Dovecot though.> > that said, config in Dovecot is certainly convenient. >Yes, I think limits on Dovecot are useful in any case, otherwise you end up sending arbitrary sized files across the network to have them thrown away on the server. John
PGNet Dev
2020-Nov-15 20:54 UTC
[patch] enhancement for tika server protected by user/password basic auth
On 11/15/20 12:21 PM, John Fawcett wrote:> I'm using tika-server.jar installed as a serviceyup. same here. atm, listening on localhost, with Dovecot -> Tika direct, no proxy. similarly fragile under load. throwing ~10 messages with .5-5MB attachments at it at once causes all sorts of complaints. one at a time seems OK ...> Dovecot currently implements separate integrations, first the > attachments are sent to tika, then the results are sent to solr.ah, so tika first ...> The two could even be running on separate servers.Not sure when that's a useful usecase. I can certainly see a separate, integrated solr+tika server. ExtremelyhHeavy loads, I guess.> Yes that could be an alternative way, so instead of sending the > attachments to tika, send the attachments to solr and let it send them > to tika. It would be more than configuration in Dovecot though.yup. taking a look at solr cell + tika integration to see where the config makes most sense. this is a useful 1st read https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html> Yes, I think limits on Dovecot are useful in any case, otherwise you end > up sending arbitrary sized files across the network to have them thrown > away on the server.point taken. afaict, fts_solr has only a batch_size limit -- but neither a total message size, or an attachment size limit.
Apparently Analagous Threads
- [patch] enhancement for tika server protected by user/password basic auth
- [patch] enhancement for tika server protected by user/password basic auth
- [patch] enhancement for tika server protected by user/password basic auth
- [patch] enhancement for tika server protected by user/password basic auth
- [patch] enhancement for tika server protected by user/password basic auth