PGNet Dev
2020-Nov-15 20:54 UTC
[patch] enhancement for tika server protected by user/password basic auth
On 11/15/20 12:21 PM, John Fawcett wrote:> I'm using tika-server.jar installed as a serviceyup. same here. atm, listening on localhost, with Dovecot -> Tika direct, no proxy. similarly fragile under load. throwing ~10 messages with .5-5MB attachments at it at once causes all sorts of complaints. one at a time seems OK ...> Dovecot currently implements separate integrations, first the > attachments are sent to tika, then the results are sent to solr.ah, so tika first ...> The two could even be running on separate servers.Not sure when that's a useful usecase. I can certainly see a separate, integrated solr+tika server. ExtremelyhHeavy loads, I guess.> Yes that could be an alternative way, so instead of sending the > attachments to tika, send the attachments to solr and let it send them > to tika. It would be more than configuration in Dovecot though.yup. taking a look at solr cell + tika integration to see where the config makes most sense. this is a useful 1st read https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html> Yes, I think limits on Dovecot are useful in any case, otherwise you end > up sending arbitrary sized files across the network to have them thrown > away on the server.point taken. afaict, fts_solr has only a batch_size limit -- but neither a total message size, or an attachment size limit.
John Fawcett
2020-Nov-15 21:29 UTC
[patch] enhancement for tika server protected by user/password basic auth
On 15/11/2020 21:54, PGNet Dev wrote:> On 11/15/20 12:21 PM, John Fawcett wrote: >> I'm using tika-server.jar installed as a service > > yup. same here. > > atm, listening on localhost, with Dovecot -> Tika direct, no proxy. > > similarly fragile under load.? throwing ~10 messages with .5-5MB > attachments at it at once causes all sorts of complaints. > > one at a time seems OK ... > >> Dovecot currently implements separate integrations, first the >> attachments are sent to tika, then the results are sent to solr. > > ah, so tika first ... > >> The two could even be running on separate servers. > > Not sure when that's a useful usecase.? I can certainly see a > separate, integrated solr+tika server. > > ExtremelyhHeavy loads, I guess.Not sure when it would be useful, but that was just to underline the current integration model for Dovecot.> >> Yes that could be an alternative way, so instead of sending the >> attachments to tika, send the attachments to solr and let it send them >> to tika. It would be more than configuration in Dovecot though. > > yup.? taking a look at solr cell + tika integration to see where the > config makes most sense. > > this is a useful 1st read > > ? > https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.htmlIt's an approach that could be worthwhile looking into, though not using solr cell, given the following statements at that link: "If any exceptions cause the |ExtractingRequestHandler| and/or Tika to crash, Solr as a whole will also crash because the request handler is running in the same JVM that Solr uses for other operations. Indexing can also consume all available Solr resources, particularly with large PDFs, presentations, or other files that have a lot of rich media embedded in them. For these reasons, Solr Cell is not recommended for use in a production system."> >> Yes, I think limits on Dovecot are useful in any case, otherwise you end >> up sending arbitrary sized files across the network to have them thrown >> away on the server. > > point taken. > > afaict, fts_solr has only a batch_size limit -- but neither a total > message size, or an attachment size limit.Yes, batch_size was an attempt to introduce some configurable limit. If attachments are being sent across it many not be sufficient. John -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20201115/b2f27dfc/attachment.html>
PGNet Dev
2020-Nov-16 00:14 UTC
[patch] enhancement for tika server protected by user/password basic auth
On 11/15/20 1:29 PM, John Fawcett wrote:>> atm, listening on localhost, with Dovecot -> Tika direct, no proxy. >> >> similarly fragile under load. throwing ~10 messages with .5-5MB attachments at it at once causes all sorts of complaints.frequently, like this Nov 15 15:59:40 test.loc tika[35696]: INFO tika/ (message/rfc822) Nov 15 15:59:41 test.loc tika[35696]: WARN tika/: Text extraction failed (null) Nov 15 15:59:41 test.loc tika[35696]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:521) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1472) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1300) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1215) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.Server.handle(Server.java:500) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) Nov 15 15:59:41 test.loc tika[35696]: at java.base/java.lang.Thread.run(Thread.java:832) Nov 15 15:59:41 test.loc tika[35696]: ERROR Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain Nov 15 15:59:41 test.loc tika[35696]: INFO tika/ (message/rfc822) Nov 15 15:59:41 test.loc tika[35696]: WARN tika/: Text extraction failed (Tried to contact you | Quote #Q4889744.eml) Nov 15 15:59:41 test.loc tika[35696]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:521) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1472) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1300) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1215) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.Server.handle(Server.java:500) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) Nov 15 15:59:41 test.loc tika[35696]: at java.base/java.lang.Thread.run(Thread.java:832) Nov 15 15:59:41 test.loc tika[35696]: ERROR Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain Nov 15 15:59:41 test.loc tika[35696]: INFO tika/ (image/jpeg) Nov 15 15:59:41 test.loc tika[35696]: INFO tika/ (image/png) seems fts_tika isn't going to be a well-behaved black box. pulling it out of dovecot usage for now, to setup a standalone instance and throw test attachments at it directly ...
Seemingly Similar Threads
- [patch] enhancement for tika server protected by user/password basic auth
- [patch] enhancement for tika server protected by user/password basic auth
- [patch] enhancement for tika server protected by user/password basic auth
- [patch] enhancement for tika server protected by user/password basic auth
- [patch] enhancement for tika server protected by user/password basic auth