Hi ceph-ers, The email below was posted on the ceph mailinglist yesterday by Wido den Hollander. I guess this could be interesting for user here as well. MJ -------- Forwarded Message -------- Subject: [ceph-users] librmb: Mail storage on RADOS with Dovecot Date: Thu, 21 Sep 2017 10:40:03 +0200 (CEST) From: Wido den Hollander <wido at 42on.com> To: ceph-users at ceph.com Hi, A tracker issue has been out there for a while: http://tracker.ceph.com/issues/12430 Storing e-mail in RADOS with Dovecot, the IMAP/POP3/LDA server with a huge marketshare. It took a while, but last year Deutsche Telekom took on the heavy work and started a project to develop librmb: LibRadosMailBox Together with Deutsche Telekom and Tallence GmbH (DE) this project came to life. First, the Github link: https://github.com/ceph-dovecot/dovecot-ceph-plugin I am not going to repeat everything which is on Github, put a short summary: - CephFS is used for storing Mailbox Indexes - E-Mails are stored directly as RADOS objects - It's a Dovecot plugin We would like everybody to test librmb and report back issues on Github so that further development can be done. It's not finalized yet, but all the help is welcome to make librmb the best solution for storing your e-mails on Ceph with Dovecot. Danny Al-Gaaf has written a small blogpost about it and a presentation: - https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/ - http://blog.bisect.de/2017/09/ceph-meetup-berlin-followup-librmb.html To get a idea of the scale: 4,7PB of RAW storage over 1.200 OSDs is the final goal (last slide in presentation). That will provide roughly 1,2PB of usable storage capacity for storing e-mail, a lot of e-mail. To see this project finally go into the Open Source world excites me a lot :-) A very, very big thanks to Deutsche Telekom for funding this awesome project! A big thanks as well to Tallence as they did an awesome job in developing librmb in such a short time. Wido
On 22 Sep 2017, at 14.18, mj <lists at merit.unu.edu> wrote:> First, the Github link: > https://github.com/ceph-dovecot/dovecot-ceph-plugin > > I am not going to repeat everything which is on Github, put a short summary: > > - CephFS is used for storing Mailbox Indexes > - E-Mails are stored directly as RADOS objects > - It's a Dovecot plugin > > We would like everybody to test librmb and report back issues on Github so that further development can be done. > > It's not finalized yet, but all the help is welcome to make librmb the best solution for storing your e-mails on Ceph with Dovecot.It would be have been nicer if RADOS support was implemented as lib-fs driver, and the fs-API had been used all over the place elsewhere. So 1) LibRadosMailBox wouldn't have been relying so much on RADOS specifically and 2) fs-rados could have been used for other purposes. There are already fs-dict and dict-fs drivers, so the RADOS dict driver may not have been necessary to implement if fs-rados was implemented instead (although I didn't check it closely enough to verify). (We've had fs-rados on our TODO list for a while also.) BTW. We've also been planning on open sourcing some of the obox pieces, mainly fs-drivers (e.g. fs-s3). The obox format maybe too, but without the "metacache" piece. The current obox code is a bit too much married into the metacache though to make open sourcing it easy. (The metacache is about storing the Dovecot index files in object storage and efficiently caching them on local filesystem, which isn't planned to be open sourced in near future. That's pretty much the only difficult piece of the obox plugin, with Cassandra integration coming as a good second. I wish there had been a better/easier geo-distributed key-value database to use - tombstones are annoyingly troublesome.) And using rmb-mailbox format, my main worries would be: * doesn't store index files (= message flags) - not necessarily a problem, as long as you don't want geo-replication * index corruption means rebuilding them, which means rescanning list of mail files, which means rescanning the whole RADOS namespace, which practically means rescanning the RADOS pool. That most likely is a very very slow operation, which you want to avoid unless it's absolutely necessary. Need to be very careful to avoid that happening, and in general to avoid losing mails in case of crashes or other bugs. * I think copying/moving mails physically copies the full data on disk * Each IMAP/POP3/LMTP/etc process connects to RADOS separately from each others - some connection pooling would likely help here
Am 24.09.2017 um 02:43 schrieb Timo Sirainen:> On 22 Sep 2017, at 14.18, mj <lists at merit.unu.edu> wrote: >> First, the Github link: >> https://github.com/ceph-dovecot/dovecot-ceph-plugin >> >> I am not going to repeat everything which is on Github, put a short >> summary: >> >> - CephFS is used for storing Mailbox Indexes - E-Mails are stored >> directly as RADOS objects - It's a Dovecot plugin >> >> We would like everybody to test librmb and report back issues on >> Github so that further development can be done. >> >> It's not finalized yet, but all the help is welcome to make librmb >> the best solution for storing your e-mails on Ceph with Dovecot. > > It would be have been nicer if RADOS support was implemented as > lib-fs driver, and the fs-API had been used all over the place > elsewhere. So 1) LibRadosMailBox wouldn't have been relying so much > on RADOS specifically and 2) fs-rados could have been used for other > purposes. There are already fs-dict and dict-fs drivers, so the RADOS > dict driver may not have been necessary to implement if fs-rados was > implemented instead (although I didn't check it closely enough to > verify). (We've had fs-rados on our TODO list for a while also.)Please note: librmb is not Dovecot specific. The goal of this library is to abstract email storage at Ceph independent of Dovecot to allow also other mail systems to store emails in RADOS via one library. This is also the reason why it's relying on RADOS. [...]> And using rmb-mailbox format, my main worries would be: > * doesn't store index files (= message flags) - not necessarily a problem, as > long as you don't want geo-replicationThe index files are stored via Dovecot's lib-index on CephFS. This is only an intermediate step. The goal is to store also index data directly in RADOS/Ceph omap key-value store. Currently geo-replication isn't an important topic for our PoC setup at Deutsche Telekom.> * index corruption means > rebuilding them, which means rescanning list of mail files, which > means rescanning the whole RADOS namespace, which practically means > rescanning the RADOS pool. That most likely is a very very slow > operation, which you want to avoid unless it's absolutely necessary. > Need to be very careful to avoid that happening, and in general to > avoid losing mails in case of crashes or other bugs.This could be may avoided by snapshot on CephFS currently, at least partially. But we will take a look at it during the PoC phase.> * I think copying/moving mails physically copies the full data on disk > * Each IMAP/POP3/LMTP/etc process connects to RADOS separately from each > others - some connection pooling would likely help hereI'm not so deep in what Dovecot is currently doing. It's still under heavy development and any comment and feedback is really welcome as Wido already pointed out. Danny
Hi Timo, I am one of the authors of the software Wido announced in his mail. First, I'd like to say that Dovecot is a wonderful piece of software and thank you for it. I would like to give some explanations regarding the design we choose. Von: Timo Sirainen <tss at iki.fi><mailto:tss at iki.fi> Antworten: Dovecot Mailing List <dovecot at dovecot.org><mailto:dovecot at dovecot.org> Datum: 24. September 2017 at 02:43:44 An: Dovecot Mailing List <dovecot at dovecot.org><mailto:dovecot at dovecot.org> Betreff: Re: librmb: Mail storage on RADOS with Dovecot It would be have been nicer if RADOS support was implemented as lib-fs driver, and the fs-API had been used all over the place elsewhere. So 1) LibRadosMailBox wouldn't have been relying so much on RADOS specifically and 2) fs-rados could have been used for other purposes. There are already fs-dict and dict-fs drivers, so the RADOS dict driver may not have been necessary to implement if fs-rados was implemented instead (although I didn't check it closely enough to verify). (We've had fs-rados on our TODO list for a while also.) Actually I considered using the fs-api to build a RADOS driver. But I did not follow that path: The dict-fs mapping is quite simplistic. For example, I would not be able to use RADOS read/write operations to batch request or model the dictionary transactions. Also there is no async support if you hide the RADOS dictionary behind as fs-api module, which would make the use of dict-rados in the dict-proxy harder. Doing this would help to lower the price you have to pay for the process model Dovecot ist using a lot. Using a fs-rados module behing a storage module, let?s say sdbox, would IMO not fit to our goals. We planned to store mails in RADOS object and their (immutable) metadata in RADOS omap K/V. We want to be able to access the objects without Dovecot. This is not possible if RADOS is hidden behind a fs-rados module. The format of the stored objects would be different and depended on the storage module sitting in front of fs-rados. Another reason is that at the fs level the operations are to decomposed. We would not have any, as with the dictionaries, transactional contexts etc. This context information allows us to use the RADOS operations in an optimized way. The storage API is IMO the right level of abstraction. Especially if we follow our long term goal to eliminate the fs needs for index data to. I like the internal abstraction of sdbox/mdbox a lot. But for our purpose it should have been on mail and not file level. But building a fs-rados should not be very hard. BTW. We've also been planning on open sourcing some of the obox pieces, mainly fs-drivers (e.g. fs-s3). The obox format maybe too, but without the "metacache" piece. The current obox code is a bit too much married into the metacache though to make open sourcing it easy. (The metacache is about storing the Dovecot index files in object storage and efficiently caching them on local filesystem, which isn't planned to be open sourced in near future. That's pretty much the only difficult piece of the obox plugin, with Cassandra integration coming as a good second. I wish there had been a better/easier geo-distributed key-value database to use - tombstones are annoyingly troublesome.) That would be great. And using rmb-mailbox format, my main worries would be: * doesn't store index files (= message flags) - not necessarily a problem, as long as you don't want geo-replication Your index management is awesome, highly optimized and not easily reimplemented. Very nice work. Unfortunately it is not using the fs-api and therefore not capable of being located on not fs storage. We are believing that CephFS will be a good and stable solution for the next time. Of course it would be nicer to have a lib-index that allows us to plug in different backends. * index corruption means rebuilding them, which means rescanning list of mail files, which means rescanning the whole RADOS namespace, which practically means rescanning the RADOS pool. That most likely is a very very slow operation, which you want to avoid unless it's absolutely necessary. Need to be very careful to avoid that happening, and in general to avoid losing mails in case of crashes or other bugs. Yes, disaster is a problem. We are trying to build as many rescue tools as possible but in the end scanning mails is involved. All mails are stored within separate RADOS namespaces each representing a different user. This will help us to avoid scanning the whole pool. But it this not should not be a regular operation. You are right. * I think copying/moving mails physically copies the full data on disk We tried to optimize this. Moves within a users mailboxes are done without copying the mails by just changing the index data. Copies, when really necessary, are done be native RADOS commands (OSD to OSD) without transferring the data to the client and back. There is potential for even more optimization. We could build a mechanism similar to the mdbox reference counters to reduce copying. I am sure we will give it a try in a later version. * Each IMAP/POP3/LMTP/etc process connects to RADOS separately from each others - some connection pooling would likely help here Dovecot is using separate processes a lot. You are right that this is a problem for protocols/libraries that have a high setup cost. You build some mechanisms like login process reuse or the dict-proxy to overcome that problem. Ceph is a low latency object store. One reason of the speed of Ceph is the fact that the cluster structure is known to the clients. The clients have a direct connection to the OSD that hosts the object they are looking for. If we place any intermediaries between the client process and the OSD (like with the dict-proxy) the performance will suffer. IMO the processes you mentioned should be reused to reduce the setup cost per session (or implemented multithreaded or async). I am aware of the fact that this might be a potential security risk. Right now we do not know the price for the connection setup in a real cluster in a Dovecot context. We are curious about the results of the tests with Danny's cluster and will change the design of the software to get the best results of it if necessary. Best regards Peter