thr3ads.net - dovecot - [Dovecot] GPFS for mail-storage (Was: Re: Compressing existing maildirs) [Jan 2012]

If this information is useful, please help other people find it:
Share via:

list@airstreamcomm.net

2012-Jan-04 06:09 UTC

[Dovecot] GPFS for mail-storage (Was: Re: Compressing existing maildirs)

Great information, thank you.  Could you remark on GPFS services hosting mail
storage over a WAN between two geographically separated data centers?

----- Reply message -----
From: "Jan-Frode Myklebust" <janfrode at tanso.net>
To: "Stan Hoeppner" <stan at hardwarefreak.com>
Cc: "Timo Sirainen" <tss at iki.fi>, <dovecot at
dovecot.org>
Subject: [Dovecot] GPFS for mail-storage (Was: Re: Compressing existing
maildirs)
Date: Tue, Jan 3, 2012 2:14 am


On Sat, Dec 31, 2011 at 01:54:32AM -0600, Stan Hoeppner
wrote:> Nice setup.  I've mentioned GPFS for cluster use on this list before,
> but I think you're the only operator to confirm using it.  I'm sure
> others would be interested in hearing of your first hand experience:
> pros, cons, performance, etc.  And a ball park figure on the licensing
> costs, whether one can only use GPFS on IBM storage or if storage from
> others vendors is allowed in the GPFS pool.
I used to work for IBM, so I've been a bit uneasy about pushing GPFS too
hard publicly, for risk of being accused of being biased. But I changed job in
November, so now I'm only a satisfied customer :-)

Pros:
	Extremely simple to configure and manage. Assuming root on all
	nodes can ssh freely, and port 1191/tcp is open between the
	nodes, these are the commands to create the cluster, create a
	NSD (network shared disks), and create a filesystem:

		# echo hostname1:manager-quorum > NodeFile	# "manager" means this
node can be selected as filesystem manager
		# echo hostname2:manager-quorum >> NodeFile	# "quorum" means
this node has a vote in the quorum selection
		# echo hostname3:manager-quorum >> NodeFile	# all my nodes are usually
the same, so they all have same roles.
		# mmcrcluster  -n  NodeFile  -p $(hostname) -A

		### sdb1 is either a local disk on hostname1 (in which case the other nodes
will access it over tcp to
		### hostname1), or a SAN-disk that they can access directly over FC/iSCSI.
		# echo sdb1:hostname1::dataAndMetadata:: > DescFile # This disk can be used
for both data and metadata
		# mmcrnsd -F DescFile

		# mmstartup -A	# starts GPFS services on all nodes
		# mmcrfs /gpfs1 gpfs1 -F DescFile
		# mount /gpfs1

	You can add and remove disks from the filesystem, and change most
	settings without downtime. You can scale out your workload by adding
	more nodes (SAN attached or not), and scale out your disk performance
	by adding more disks on the fly. (IBM uses GPFS to create
	scale-out NAS solutions http://www-03.ibm.com/systems/storage/network/sonas/ ,
	which highlights a few of the features available with GPFS)

	There's no problem running GPFS on other vendors disk systems. I've
used Nexsan
	SATAboy earlier, for a HPC cluster. One can easily move from one disksystem to
	another without downtime.

Cons:
	It has it's own page cache, staticly configured. So you don't get the
"all
	available memory used for page caching" behaviour as you normally do on
linux.

	There is a kernel module that needs to be rebuilt on every
	upgrade. It's a simple process, but it needs to be done and means we
	can't just run "yum update ; reboot" to upgrade.

		% export SHARKCLONEROOT=/usr/lpp/mmfs/src
		% cp /usr/lpp/mmfs/src/config/site.mcr.proto /usr/lpp/mmfs/src/config/site.mcr
		% vi /usr/lpp/mmfs/src/config/site.mcr     # correct GPFS_ARCH,
LINUX_DISTRIBUTION and LINUX_KERNEL_VERSION
		% cd /usr/lpp/mmfs/src/ ; make clean ; make World
		% su - root
		# export SHARKCLONEROOT=/usr/lpp/mmfs/src
		# cd /usr/lpp/mmfs/src/ ; make InstallImages

> 
> To this point IIRC everyone here doing clusters is using NFS, GFS, or
> OCFS.  Each has its downsides, mostly because everyone is using maildir.
>  NFS has locking issues with shared dovecot index files.  GFS and OCFS
> have filesystem metadata performance issues.  How does GPFS perform with
> your maildir workload?
Maildir is likely a worst case type workload for filesystems. Millions
of tiny-tiny files, making all IO random, and getting minimal controller
read cache utilized (unless you can cache all active files). So I've
concluded that our performance issues are mostly design errors (and the
fact that there were no better mail storage formats than maildir at the
time these servers were implemented). I expect moving to mdbox will 
fix all our performance issues.

I *think* GPFS is as good as it gets for maildir storage on clusterfs,
but have no number to back that up ... Would be very interesting if we
could somehow compare numbers for a few clusterfs'. 

I believe our main limitation in this setup is the iops we can get from
the backend storage system. It's hard to balance the IO over enough
RAID arrays (the fs is spread over 11 RAID5 arrays of 5 disks each),
and we're always having hotspots. Right now two arrays are doing <100
iops,
while others are doing 4-500 iops. Would very much like to replace
it by something smarter where we can utilize SSDs for active data and
something slower for stale data. GPFS can manage this by itself trough
it's ILM interface, but we don't have the very fast storage to put in as
tier-1.


  -jf

Jan-Frode Myklebust

2012-Jan-04 07:33 UTC

head link

[Dovecot] GPFS for mail-storage (Was: Re: Compressing existing maildirs)

On Wed, Jan 04, 2012 at 12:09:39AM -0600, list at airstreamcomm.net
wrote:> Could you remark on GPFS services hosting mail storage over a WAN between
two geographically separated data centers?
I haven't tried that, but know the theory quite well. There are 2 or 3
options:

1 - shared SAN between the data centers. Should work the same as
a single data center, but you'd want to use disk quorum or
a quorum node on a 3. site to avoid split brain.

2 - different SANs on the two sites. Disks on SAN1 would belong
to failure group 1 and disks on SAN2 would belong to failure
group 2. GPFS will write every block to disks in different
failure groups. Nodes on location 1 will use SAN1 directly,
and write to SAN2 via tcp/ip to nodes on location 2 (and vica
versa). It's configurable if you want to return success when
first block is written (asyncronous replication), or if you
need both replicas to be written. Ref: mmcrfs -K:

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.v3r4.gpfs300.doc%2Fbl1adm_mmcrfs.html

With asyncronous replication it will try to allocate both
replicas, but if it fails you can re-establish the
replication level later using "mmrestripefs".

Reading will happen from a direct attached disk if possible,
and over tcp/ip if there are no local replica of the needed
block.

Again you'll need a quorum node on a 3. site to avoid split brain.

3 - GPFS multi-cluster. Separate GPFS clusters on the two
locations. Let them mount each others filesystems over IP,
and access disks over either SAN or IP network. Each cluster is
managed locally, if one site goes down the other site also
loses access to the fs.

I don't have any experience with this kind of config, but believe
it's quite popular to use to share fs between HPC-sites.

http://www.ibm.com/developerworks/systems/library/es-multiclustergpfs/index.html
http://www.cisl.ucar.edu/hss/ssg/presentations/storage/NCAR-GPFS_Elahi.pdf

-jf

Apparently Analagous Threads

Search for more reasonably related threads

dovecot - Jan 2012 - GPFS for mail-storage (Was: Re: Compressing existing maildirs)

[Dovecot] GPFS for mail-storage (Was: Re: Compressing existing maildirs)

[Dovecot] GPFS for mail-storage (Was: Re: Compressing existing maildirs)

Apparently Analagous Threads