Hi, I''ve been searching around on the Internet to fine some help with this, but have been unsuccessfull so far. I have some performance issues with my file server. I have an OpenSolaris server with a Pentium D 3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate (ST31500341AS) 1,5TB SATA drives. If I compile or even just unpack a tar.gz archive with source code (or any archive with lots of small files), on my Linux client onto a NFS mounted disk to the OpenSolaris server, it''s extremely slow compared to unpacking this archive on the locally on the server. A 22MB .tar.gz file containng 7360 files takes 9 minutes and 12seconds to unpack over NFS. Unpacking the same file locally on the server is just under 2 seconds. Between the server and client I have a gigabit network, which at the time of testing had no other significant load. My NFS mount options are: "rw,hard,intr,nfsvers=3,tcp,sec=sys". Any suggestions to why this is? Regards, Sigbjorn
That''s because NFS adds synchronous writes to the mix (e.g. the client needs to know certain transactions made it to nonvolatile storage in case the server restarts etc). The simplest safe solution, although not cheap, is to add an SSD log device to the pool. On 23 Jul 2010, at 08:11, "Sigbjorn Lie" <sigbjorn at nixtra.com> wrote:> Hi, > > I''ve been searching around on the Internet to fine some help with this, but have been > unsuccessfull so far. > > I have some performance issues with my file server. I have an OpenSolaris server with a Pentium D > 3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate (ST31500341AS) 1,5TB SATA drives. > > If I compile or even just unpack a tar.gz archive with source code (or any archive with lots of > small files), on my Linux client onto a NFS mounted disk to the OpenSolaris server, it''s extremely > slow compared to unpacking this archive on the locally on the server. A 22MB .tar.gz file > containng 7360 files takes 9 minutes and 12seconds to unpack over NFS. > > Unpacking the same file locally on the server is just under 2 seconds. Between the server and > client I have a gigabit network, which at the time of testing had no other significant load. My > NFS mount options are: "rw,hard,intr,nfsvers=3,tcp,sec=sys". > > Any suggestions to why this is? > > > Regards, > Sigbjorn > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, Jul 23, 2010 at 3:11 AM, Sigbjorn Lie <sigbjorn at nixtra.com> wrote:> Hi, > > I''ve been searching around on the Internet to fine some help with this, but > have been > unsuccessfull so far. > > I have some performance issues with my file server. I have an OpenSolaris > server with a Pentium D > 3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate (ST31500341AS) 1,5TB > SATA drives. > > If I compile or even just unpack a tar.gz archive with source code (or any > archive with lots of > small files), on my Linux client onto a NFS mounted disk to the OpenSolaris > server, it''s extremely > slow compared to unpacking this archive on the locally on the server. A > 22MB .tar.gz file > containng 7360 files takes 9 minutes and 12seconds to unpack over NFS. > > Unpacking the same file locally on the server is just under 2 seconds. > Between the server and > client I have a gigabit network, which at the time of testing had no other > significant load. My > NFS mount options are: "rw,hard,intr,nfsvers=3,tcp,sec=sys". > > Any suggestions to why this is? > > > Regards, > Sigbjorn > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >as someone else said, adding an ssd log device can help hugely. I saw about a 500% nfs write increase by doing this. I''ve heard of people getting even more. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100723/e7991a73/attachment.html>
Thomas Burgess wrote:> > > On Fri, Jul 23, 2010 at 3:11 AM, Sigbjorn Lie <sigbjorn at nixtra.com > <mailto:sigbjorn at nixtra.com>> wrote: > > Hi, > > I''ve been searching around on the Internet to fine some help with > this, but have been > unsuccessfull so far. > > I have some performance issues with my file server. I have an > OpenSolaris server with a Pentium D > 3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate > (ST31500341AS) 1,5TB SATA drives. > > If I compile or even just unpack a tar.gz archive with source code > (or any archive with lots of > small files), on my Linux client onto a NFS mounted disk to the > OpenSolaris server, it''s extremely > slow compared to unpacking this archive on the locally on the > server. A 22MB .tar.gz file > containng 7360 files takes 9 minutes and 12seconds to unpack over NFS. > > Unpacking the same file locally on the server is just under 2 > seconds. Between the server and > client I have a gigabit network, which at the time of testing had > no other significant load. My > NFS mount options are: "rw,hard,intr,nfsvers=3,tcp,sec=sys". > > Any suggestions to why this is? > > > Regards, > Sigbjorn > > > as someone else said, adding an ssd log device can help hugely. I saw > about a 500% nfs write increase by doing this. > I''ve heard of people getting even more.Another option if you don''t care quite so much about data security in the event of an unexpected system outage would be to use Robert Milkowski and Neil Perrin''s zil synchronicity [PSARC/2010/108] changes with sync=disabled, when the changes work their way into an available build. The risk is that if the file server goes down unexpectedly, it might come back up having lost some seconds worth of changes which it told the client (lied) that it had committed to disk, when it hadn''t, and this violates the NFS protocol. That might be OK if you are using it to hold source that''s being built, where you can kick off a build again if the server did go down in the middle of it. Wouldn''t be a good idea for some other applications though (although Linux ran this way for many years, seemingly without many complaints). Note that there''s no increased risk of the zpool going bad - it''s just that after the reboot, filesystems with sync=disabled will look like they were rewound by some seconds (possibly up to 30 seconds). -- Andrew Gabriel
I agree, I get apalling NFS speeds compared to CIFS/Samba..ie. CIFS/Samba of 95-105MB and NFS of 5-20MB. Not the thread hijack, but I assume a SSD ZIL will similarly improve an iSCSI target...as I am getting 2-5MB on that too. -- This message posted from opensolaris.org
I see I have already received several replies, thanks to all! I would not like to risk losing any data, so I believe a ZIL device would be the way for me. I see these exists in different prices. Any reason why I would not buy a cheap one? Like the Intel X25-V SSD 40GB 2,5"? What size of ZIL device would be recommened for my pool consisting for 4 x 1,5TB drives? Any brands I should stay away from? Regards, Sigbjorn On Fri, July 23, 2010 09:48, Phil Harman wrote:> That''s because NFS adds synchronous writes to the mix (e.g. the client needs to know certain > transactions made it to nonvolatile storage in case the server restarts etc). The simplest safe > solution, although not cheap, is to add an SSD log device to the pool. > > On 23 Jul 2010, at 08:11, "Sigbjorn Lie" <sigbjorn at nixtra.com> wrote: > > >> Hi, >> >> >> I''ve been searching around on the Internet to fine some help with this, but have been >> unsuccessfull so far. >> >> I have some performance issues with my file server. I have an OpenSolaris server with a Pentium >> D >> 3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate (ST31500341AS) 1,5TB SATA drives. >> >> >> If I compile or even just unpack a tar.gz archive with source code (or any archive with lots of >> small files), on my Linux client onto a NFS mounted disk to the OpenSolaris server, it''s >> extremely slow compared to unpacking this archive on the locally on the server. A 22MB .tar.gz >> file containng 7360 files takes 9 minutes and 12seconds to unpack over NFS. >> >> Unpacking the same file locally on the server is just under 2 seconds. Between the server and >> client I have a gigabit network, which at the time of testing had no other significant load. My >> NFS mount options are: "rw,hard,intr,nfsvers=3,tcp,sec=sys". >> >> >> Any suggestions to why this is? >>
On 23 Jul 2010, at 09:18, Andrew Gabriel <Andrew.Gabriel at oracle.com> wrote:> Thomas Burgess wrote: >> >> On Fri, Jul 23, 2010 at 3:11 AM, Sigbjorn Lie <sigbjorn at nixtra.com <mailto:sigbjorn at nixtra.com>> wrote: >> >> Hi, >> >> I''ve been searching around on the Internet to fine some help with >> this, but have been >> unsuccessfull so far. >> >> I have some performance issues with my file server. I have an >> OpenSolaris server with a Pentium D >> 3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate >> (ST31500341AS) 1,5TB SATA drives. >> >> If I compile or even just unpack a tar.gz archive with source code >> (or any archive with lots of >> small files), on my Linux client onto a NFS mounted disk to the >> OpenSolaris server, it''s extremely >> slow compared to unpacking this archive on the locally on the >> server. A 22MB .tar.gz file >> containng 7360 files takes 9 minutes and 12seconds to unpack over NFS. >> >> Unpacking the same file locally on the server is just under 2 >> seconds. Between the server and >> client I have a gigabit network, which at the time of testing had >> no other significant load. My >> NFS mount options are: "rw,hard,intr,nfsvers=3,tcp,sec=sys". >> >> Any suggestions to why this is? >> >> >> Regards, >> Sigbjorn >> >> >> as someone else said, adding an ssd log device can help hugely. I saw about a 500% nfs write increase by doing this. >> I''ve heard of people getting even more. > > Another option if you don''t care quite so much about data security in the event of an unexpected system outage would be to use Robert Milkowski and Neil Perrin''s zil synchronicity [PSARC/2010/108] changes with sync=disabled, when the changes work their way into an available build. The risk is that if the file server goes down unexpectedly, it might come back up having lost some seconds worth of changes which it told the client (lied) that it had committed to disk, when it hadn''t, and this violates the NFS protocol. That might be OK if you are using it to hold source that''s being built, where you can kick off a build again if the server did go down in the middle of it. Wouldn''t be a good idea for some other applications though (although Linux ran this way for many years, seemingly without many complaints). Note that there''s no increased risk of the zpool going bad - it''s just that after the reboot, filesystems with sync=disabled will look like they were rewound by some seconds (possibly up to 30 seconds).That''s assuming you know it happened and that you need to restart the build (ideally with a make clean). All the NFS client knows is that the NFS server went away for some time. It still assumes nothing was lost. I can imagine cases where the build might continue to completion but with partially corrupted files. It''s unlikely, but conceivable. Of course, databases like dbm, MySQL or Oracle would go blithely on up the swanee with silent data corruption. The fact that people run unsafe systems seemingly without complaint for years assumes that they know silent data corruption when they see^H^H^Hhear it ... which, of course, they didn''t ... because it is silent ... or having encountered corrupted data, that they have the faintest idea where it came from. In my day to day work I still find many people that have been (apparently) very lucky. Feel free to play fast and loose with your own data, but I won''t with mine, thanks! ;)
On Fri, July 23, 2010 10:42, tomwaters wrote:> I agree, I get apalling NFS speeds compared to CIFS/Samba..ie. CIFS/Samba of 95-105MB and NFS of > 5-20MB. > > > Not the thread hijack, but I assume a SSD ZIL will similarly improve an iSCSI target...as I am > getting 2-5MB on that too. -- > This message posted from opensolaris.orgThis is exactly the numbers I''m getting as well. What''s the reason for such low rate when using iSCSI?
Sent from my iPhone On 23 Jul 2010, at 09:42, tomwaters <tomwaters at chadmail.com> wrote:> I agree, I get apalling NFS speeds compared to CIFS/Samba..ie. CIFS/Samba of 95-105MB and NFS of 5-20MB. > > Not the thread hijack, but I assume a SSD ZIL will similarly improve an iSCSI target...as I am getting 2-5MB on that too.Yes, it generally will. I''ve seen some huge improvements with iSCSI, but YMMV depending on your config, application and workload.
On 23/07/2010 10:02, Sigbjorn Lie wrote:> On Fri, July 23, 2010 10:42, tomwaters wrote: > >> I agree, I get apalling NFS speeds compared to CIFS/Samba..ie. CIFS/Samba of 95-105MB and NFS of >> 5-20MB. >> >> >> Not the thread hijack, but I assume a SSD ZIL will similarly improve an iSCSI target...as I am >> getting 2-5MB on that too. -- >> This message posted from opensolaris.org >> > This is exactly the numbers I''m getting as well. > > What''s the reason for such low rate when using iSCSI? >The filesystem or application using the iSCSI target may be requesting regular cache flushes. These will require synchronous writes to disk. An SSD doesn''t remove the sync writes, it just makes them a lot faster. Other sensible storage servers typically use NVRAM caches to solve this problem. Others just play fast and loose with your data.
On Fri, Jul 23, 2010 at 5:00 AM, Sigbjorn Lie <sigbjorn at nixtra.com> wrote:> I see I have already received several replies, thanks to all! > > I would not like to risk losing any data, so I believe a ZIL device would > be the way for me. I see > these exists in different prices. Any reason why I would not buy a cheap > one? Like the Intel X25-V > SSD 40GB 2,5"? > > What size of ZIL device would be recommened for my pool consisting for 4 x > 1,5TB drives? Any > brands I should stay away from? > > > > Regards, > Sigbjorn > > Like i said, i bought a 50 gb OCZ Vertex Limited Edition...it''s like 200dollars, up to 15,000 random iops (iops is what you want for fast zil) I''ve gotten excelent performance out of it. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100723/81bbf572/attachment.html>
On Fri, July 23, 2010 11:21, Thomas Burgess wrote:> On Fri, Jul 23, 2010 at 5:00 AM, Sigbjorn Lie <sigbjorn at nixtra.com> wrote: > > >> I see I have already received several replies, thanks to all! >> >> >> I would not like to risk losing any data, so I believe a ZIL device would >> be the way for me. I see these exists in different prices. Any reason why I would not buy a cheap >> one? Like the Intel X25-V SSD 40GB 2,5"? >> >> >> What size of ZIL device would be recommened for my pool consisting for 4 x >> 1,5TB drives? Any >> brands I should stay away from? >> >> >> >> Regards, >> Sigbjorn >> >> >> Like i said, i bought a 50 gb OCZ Vertex Limited Edition...it''s like 200 >> > dollars, up to 15,000 random iops (iops is what you want for fast zil) > > > I''ve gotten excelent performance out of it. > >The X25-V has up to 25k random read iops and up to 2.5k random write iops per second, so that would seem okay for approx $80. :) What about mirroring? Do I need mirrored ZIL devices in case of a power outage?
On 23/07/2010 10:53, Sigbjorn Lie wrote:> The X25-V has up to 25k random read iops and up to 2.5k random write iops per second, so that > would seem okay for approx $80. :) > > What about mirroring? Do I need mirrored ZIL devices in case of a power outage?Note there is not a ZIL device, there is a slog device. Every pool has one or more ZIL it may or may not have a slog device used to hold ZIL contents, wither a ZIL is on the slog or not depends on a lot of factors including the logbias property. You don''t need to mirror the slog device to protect against a power outage. You need to mirror the slog if you want to protect against loosing synchronous writes (but not pool consistency on disk) on power outage *and* failure of your slog device at the same time (ie a double fault). -- Darren J Moffat
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Phil Harman > > Milkowski and Neil Perrin''s zil synchronicity [PSARC/2010/108] changes > with sync=disabled, when the changes work their way into an available > > The fact that people run unsafe systems seemingly without complaint for > years assumes that they know silent data corruption when they > see^H^H^Hhear it ... which, of course, they didn''t ... because it is > silent ... or having encountered corrupted data, that they have the > faintest idea where it came from. In my day to day work I still find > many people that have been (apparently) very lucky.Running with sync disabled, or ZIL disabled, you could call "unsafe" if you want to use a generalization and a stereotype. Just like people say "writeback" is unsafe. If you apply a little more intelligence, you''ll know, it''s safe in some conditions, and not in other conditions. Like ... If you have a BBU, you can use your writeback safely. And if you''re not sharing stuff across the network, you''re guaranteed the disabled ZIL is safe. But even when you are sharing stuff across the network, the disabled ZIL can still be safe under the following conditions: If you are only doing file sharing (NFS, CIFS) and you are willing to reboot/remount from all your clients after an ungraceful shutdown of your server, then it''s safe to run with ZIL disabled. If you''re unsure, then adding SSD nonvolatile log device, as people have said, is the way to go.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Sigbjorn Lie > > What size of ZIL device would be recommened for my pool consisting forGet the smallest one. Even an unrealistic high performance scenario cannot come close to using 32G. I am sure you''ll never reach even 4G usage.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Sigbjorn Lie > > What about mirroring? Do I need mirrored ZIL devices in case of a power > outage?You don''t need mirroring for the sake of *power outage* but you *do* need mirroring for the sake of preventing data loss when one of the SSD devices fails. There is some gray area here: If you have zpool < 19, then you do not have "log device removal" which means you lose your whole zpool in the event of a failed unmirrored log device. (Techniques exist to recover, but it''s not always easy.) If you have zpool >= 19, then the danger is much smaller. If you have a failed unmirrored log device, and the failure is detected, then the log device is simply marked "failed" and the system slows down, and everything is fine. But if you have an undetected failure, *and* an ungraceful reboot (which is more likely than it seems) then you risk up to 30 sec of data that was intended to be written, immediately before the crash. None of that is a concern, if you have a mirrored log device.
Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Phil Harman >> >> Milkowski and Neil Perrin''s zil synchronicity [PSARC/2010/108] changes >> with sync=disabled, when the changes work their way into an available >> >> The fact that people run unsafe systems seemingly without complaint for >> years assumes that they know silent data corruption when they >> see^H^H^Hhear it ... which, of course, they didn''t ... because it is >> silent ... or having encountered corrupted data, that they have the >> faintest idea where it came from. In my day to day work I still find >> many people that have been (apparently) very lucky. >> > > Running with sync disabled, or ZIL disabled, you could call "unsafe" if you > want to use a generalization and a stereotype. > > Just like people say "writeback" is unsafe. If you apply a little more > intelligence, you''ll know, it''s safe in some conditions, and not in other > conditions. Like ... If you have a BBU, you can use your writeback safely. > And if you''re not sharing stuff across the network, you''re guaranteed the > disabled ZIL is safe. But even when you are sharing stuff across the > network, the disabled ZIL can still be safe under the following conditions: > > If you are only doing file sharing (NFS, CIFS) and you are willing to > reboot/remount from all your clients after an ungraceful shutdown of your > server, then it''s safe to run with ZIL disabled. >No, that''s not safe. The client can still lose up to 30 seconds of data, which could be, for example, an email message which is received and foldered on the server, and is then lost. It''s probably /*safe enough*/ for most home users, but you should be fully aware of the potential implications before embarking on this route. (As I said before, the zpool itself is not at any additional risk of corruption, it''s just that you might find the zfs filesystems with sync=disabled appear to have been rewound by up to 30 seconds.)> If you''re unsure, then adding SSD nonvolatile log device, as people have > said, is the way to go. >-- Andrew Gabriel
Phil Harmon wrote:> > Not the thread hijack, but I assume a SSD ZIL will similarly improve > an iSCSI target...as I am getting 2-5MB on that too. > > Yes, it generally will. I''ve seen some huge improvements with iSCSI, > but YMMV depending on your config, application and workload.Sorry this isn''t completely ZFS-related, but with all this expert storage knowledge here... On a related note - all other things being equal, is there any reason to choose NFS over ISCI, or vice-versa? I''m currently looking at this decision. We have a NetApp (I wish it were a ZFS-based appliance!) and need to remotely mount a filesystem from it. It will share the filesystem either as NFS or ICSI. Some of my colleagues say it would be better to use NFS. Their reasoning is basically: "That''s the way it''s always been done". I''m leaning towards ISCSI. My reasoning is that it removes a whole extra layer of complexity - as I understand it, the remote client just treats the remote mount like any other physical device. And I''ve had MAJOR headaches over the years fixing/tweaking NFS. Even though version 4 seems better, I''d still rather bypass it completely. I believe in the "Keep it simple, stupid" philosophy. I do realize that NFS is probably better for remote filesystems that have multiple simultaneous users, but we won''t be doing that in this case. Any major arguments for/against one over the other? Thanks for any suggestions. Doug Linder ---------- Learn more about Merchant Link at www.merchantlink.com. THIS MESSAGE IS CONFIDENTIAL. This e-mail message and any attachments are proprietary and confidential information intended only for the use of the recipient(s) named above. If you are not the intended recipient, you may not print, distribute, or copy this message or any attachments. If you have received this communication in error, please notify the sender by return e-mail and delete this message and any attachments from your computer.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Linder, Doug > > On a related note - all other things being equal, is there any reason > to choose NFS over ISCI, or vice-versa? I''m currently looking at thisiscsi and NFS are completely different technologies. If you use iscsi, then all the initiators (clients) are the things which format and control the filesystem. So the limitations of the filesystem are determined by whichever clustering filesystem you''ve chosen to implement. It probably won''t do snapshots and so forth. Although the ZFS filesystem could make a snapshot, it wouldn''t be automatically mounted or made available without the clients doing explicit mounts... With NFS, the filesystem is formatted and controlled by the server. Both WAFL and ZFS do some pretty good things with snapshotting, and making snapshots available to users without any effort.
Fundamentally, my recommendation is to choose NFS if your clients can use it. You''ll get a lot of potential advantages in the NFS/zfs integration, so better performance. Plus you can serve multiple clients, etc. The only reason to use iSCSI is when you don''t have a choice, IMO. You should only use iSCSI with a single initiator at any point in time unless you have some higher level contention management in place. - Garrett On Fri, 2010-07-23 at 22:20 -0400, Edward Ned Harvey wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Linder, Doug > > > > On a related note - all other things being equal, is there any reason > > to choose NFS over ISCI, or vice-versa? I''m currently looking at this > > iscsi and NFS are completely different technologies. If you use iscsi, then > all the initiators (clients) are the things which format and control the > filesystem. So the limitations of the filesystem are determined by > whichever clustering filesystem you''ve chosen to implement. It probably > won''t do snapshots and so forth. Although the ZFS filesystem could make a > snapshot, it wouldn''t be automatically mounted or made available without the > clients doing explicit mounts... > > With NFS, the filesystem is formatted and controlled by the server. Both > WAFL and ZFS do some pretty good things with snapshotting, and making > snapshots available to users without any effort. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> From: Garrett D''Amore [mailto:garrett at nexenta.com] > > Fundamentally, my recommendation is to choose NFS if your clients can > use it. You''ll get a lot of potential advantages in the NFS/zfs > integration, so better performance. Plus you can serve multiple > clients, etc. > > The only reason to use iSCSI is when you don''t have a choice, IMO. You > should only use iSCSI with a single initiator at any point in time > unless you have some higher level contention management in place.So ... You don''t think filesystems like gfs etc, should ever be used?
On Sat, 2010-07-24 at 19:54 -0400, Edward Ned Harvey wrote:> > From: Garrett D''Amore [mailto:garrett at nexenta.com] > > > > Fundamentally, my recommendation is to choose NFS if your clients can > > use it. You''ll get a lot of potential advantages in the NFS/zfs > > integration, so better performance. Plus you can serve multiple > > clients, etc. > > > > The only reason to use iSCSI is when you don''t have a choice, IMO. You > > should only use iSCSI with a single initiator at any point in time > > unless you have some higher level contention management in place. > > So ... You don''t think filesystems like gfs etc, should ever be used?"gfs" provides such higher level contention management. I can''t speak for it myself, but my gut reaction is that unless you have a need for the features of gfs, you are probably better served by NFS. Running a more traditional filesystem (that does not allow concurrent block device access) is almost certainly a bad idea unless you have special needs. - Garrett
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Sigbjorn Lie >> >> What about mirroring? Do I need mirrored ZIL devices in case of a power >> outage? >> > > You don''t need mirroring for the sake of *power outage* but you *do* need > mirroring for the sake of preventing data loss when one of the SSD devices > fails. There is some gray area here: > > If you have zpool < 19, then you do not have "log device removal" which > means you lose your whole zpool in the event of a failed unmirrored log > device. (Techniques exist to recover, but it''s not always easy.) > > If you have zpool >= 19, then the danger is much smaller. If you have a > failed unmirrored log device, and the failure is detected, then the log > device is simply marked "failed" and the system slows down, and everything > is fine. But if you have an undetected failure, *and* an ungraceful reboot > (which is more likely than it seems) then you risk up to 30 sec of data that > was intended to be written, immediately before the crash. > > None of that is a concern, if you have a mirrored log device. > >Ah, I see! Thanks.
> On Fri, 2010-07-23 at 22:20 -0400, Edward Ned Harvey wrote: > > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > > bounces at opensolaris.org] On Behalf Of Linder, Doug > > > > > > On a related note - all other things being equal, is > there any reason > > > to choose NFS over ISCI, or vice-versa? I''m currently > looking at this > > > > iscsi and NFS are completely different technologies. If > you use iscsi, then > > all the initiators (clients) are the things which format > and control the > > filesystem. So the limitations of the filesystem are determined by > > whichever clustering filesystem you''ve chosen to implement. > It probably > > won''t do snapshots and so forth. Although the ZFS > filesystem could make a > > snapshot, it wouldn''t be automatically mounted or made > available without the > > clients doing explicit mounts... > > > > With NFS, the filesystem is formatted and controlled by the > server. Both > > WAFL and ZFS do some pretty good things with snapshotting, > and making > > snapshots available to users without any effort. > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org > [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of > Garrett D''Amore > Sent: Friday, July 23, 2010 11:46 PM > To: Edward Ned Harvey > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] NFS performance? > > Fundamentally, my recommendation is to choose NFS if your clients can > use it. You''ll get a lot of potential advantages in the NFS/zfs > integration, so better performance. Plus you can serve multiple > clients, etc. > > The only reason to use iSCSI is when you don''t have a choice, > IMO. You > should only use iSCSI with a single initiator at any point in time > unless you have some higher level contention management in place. > > - Garrett > >I think there may be very good reason to use iSCSI, if you''re limited to gigabit but need to be able to handle higher throughput for a single client. I may be wrong, but I believe iSCSI to/from a single initiator can take advantage of multiple links in an active-active multipath scenario whereas NFS is only going to be able to take advantage of 1 link (at least until pNFS). -Will
On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote:> > I think there may be very good reason to use iSCSI, if you''re limited > to gigabit but need to be able to handle higher throughput for a > single client. I may be wrong, but I believe iSCSI to/from a single > initiator can take advantage of multiple links in an active-active > multipath scenario whereas NFS is only going to be able to take > advantage of 1 link (at least until pNFS).There are other ways to get multiple paths. First off, there is IP multipathing. which offers some of this at the IP layer. There is also 802.3ad link aggregation (trunking). So you can still get high performance beyond single link with NFS. (It works with iSCSI too, btw.) -- Garrett
On Sun, Jul 25, 2010 at 8:50 PM, Garrett D''Amore <garrett at nexenta.com> wrote:> On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote: >> >> I think there may be very good reason to use iSCSI, if you''re limited >> to gigabit but need to be able to handle higher throughput for a >> single client. I may be wrong, but I believe iSCSI to/from a single >> initiator can take advantage of multiple links in an active-active >> multipath scenario whereas NFS is only going to be able to take >> advantage of 1 link (at least until pNFS). > > There are other ways to get multiple paths. ?First off, there is IP > multipathing. which offers some of this at the IP layer. ?There is also > 802.3ad link aggregation (trunking). ?So you can still get high > performance beyond ?single link with NFS. ?(It works with iSCSI too, > btw.)With both IPMP and link aggregation, each TCP session will go over the same wire. There is no guarantee that load will be evenly balanced between links when there are multiple TCP sessions. As such, any scalability you get using these configurations will be dependent on having a complex enough workload, wise cconfiguration choices, and and a bit of luck. Note that with Sun Trunking there was an option to load balance using a round robin hashing algorithm. When pushing high network loads this may cause performance problems with reassembly. -- Mike Gerdts http://mgerdts.blogspot.com/
On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote:> On Sun, Jul 25, 2010 at 8:50 PM, Garrett D''Amore <garrett at nexenta.com> wrote: > > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote: > >> > >> I think there may be very good reason to use iSCSI, if you''re limited > >> to gigabit but need to be able to handle higher throughput for a > >> single client. I may be wrong, but I believe iSCSI to/from a single > >> initiator can take advantage of multiple links in an active-active > >> multipath scenario whereas NFS is only going to be able to take > >> advantage of 1 link (at least until pNFS). > > > > There are other ways to get multiple paths. First off, there is IP > > multipathing. which offers some of this at the IP layer. There is also > > 802.3ad link aggregation (trunking). So you can still get high > > performance beyond single link with NFS. (It works with iSCSI too, > > btw.) > > With both IPMP and link aggregation, each TCP session will go over the > same wire. There is no guarantee that load will be evenly balanced > between links when there are multiple TCP sessions. As such, any > scalability you get using these configurations will be dependent on > having a complex enough workload, wise cconfiguration choices, and and > a bit of luck.If you''re really that concerned, you could use UDP instead of TCP. But that may have other detrimental performance impacts, I''m not sure how bad they would be in a data center with generally lossless ethernet links. Btw, I am not certain that the multiple initiator support (mpxio) is necessarily any better as far as guaranteed performance/balancing. (It may be; I''ve not looked closely enough at it.) I should look more closely at NFS as well -- if multiple applications on the same client are access the same filesystem, do they use a single common TCP session, or can they each have separate instances open? Again, I''m not sure.> > Note that with Sun Trunking there was an option to load balance using > a round robin hashing algorithm. When pushing high network loads this > may cause performance problems with reassembly.Yes. Reassembly is Evil for TCP performance. Btw, the iSCSI balancing act that was described does seem a bit contrived -- a single initiator and a COMSTAR server, both client *and server* with multiple ethernet links instead of a single 10GbE link. I''m not saying it doesn''t happen, but I think it happens infrequently enough that its reasonable that this scenario wasn''t one that popped immediately into my head. :-) - Garrett
On Mon, Jul 26, 2010 at 1:27 AM, Garrett D''Amore <garrett at nexenta.com> wrote:> On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote: >> On Sun, Jul 25, 2010 at 8:50 PM, Garrett D''Amore <garrett at nexenta.com> wrote: >> > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote: >> >> >> >> I think there may be very good reason to use iSCSI, if you''re limited >> >> to gigabit but need to be able to handle higher throughput for a >> >> single client. I may be wrong, but I believe iSCSI to/from a single >> >> initiator can take advantage of multiple links in an active-active >> >> multipath scenario whereas NFS is only going to be able to take >> >> advantage of 1 link (at least until pNFS). >> > >> > There are other ways to get multiple paths. ?First off, there is IP >> > multipathing. which offers some of this at the IP layer. ?There is also >> > 802.3ad link aggregation (trunking). ?So you can still get high >> > performance beyond ?single link with NFS. ?(It works with iSCSI too, >> > btw.) >> >> With both IPMP and link aggregation, each TCP session will go over the >> same wire. ?There is no guarantee that load will be evenly balanced >> between links when there are multiple TCP sessions. ?As such, any >> scalability you get using these configurations will be dependent on >> having a complex enough workload, wise cconfiguration choices, and and >> a bit of luck. > > If you''re really that concerned, you could use UDP instead of TCP. ?But > that may have other detrimental performance impacts, I''m not sure how > bad they would be in a data center with generally lossless ethernet > links.Heh. My horror story with reassembly was actually with connectionless transports (LLT, then UDP). Oracle RAC''s cache fusion sends 8 KB blocks via UDP by default, or LLT when used in the Veritas + Oracle RAC certified configuration from 5+ years ago. The use of Sun trunking with round robin hashing and the lack of use of jumbo packets made every cache fusion block turn into 6 LLT or UDP packets that had to be reassembled on the other end. This was on a 15K domain with the NICs spread across IO boards. I assume that interrupts for a NIC are handled by a CPU on the closest system board (Solaris 8, FWIW). If that assumption is true then there would also be a flurry of inter-system board chatter to put the block back together. In any case, performance was horrible until we got rid of round robin and enabled jumbo frames.> Btw, I am not certain that the multiple initiator support (mpxio) is > necessarily any better as far as guaranteed performance/balancing. ?(It > may be; I''ve not looked closely enough at it.)I haven''t paid close attention to how mpxio works. The Veritas analog, vxdmp, does a very good job of balancing traffic down multiple paths, even when only a single LUN is accessed. The exact mode that dmp will use is dependent on the capabilities of the array it is talking to - many arrays work in an active/passive mode. As such, I would expect that with vxdmp or mpxio the balancing with iSCSI would be at least partially dependent on what the array said to do.> I should look more closely at NFS as well -- if multiple applications on > the same client are access the same filesystem, do they use a single > common TCP session, or can they each have separate instances open? > Again, I''m not sure.It''s worse than that. A quick experiment with two different automounted home directories from the same NFS server suggests that both home directories share one TCP session to the NFS server. The latest version of Oracle''s RDBMS supports a userland NFS client option. It would be very interesting to see if this does a separate session per data file, possibly allowing for better load spreading.>> Note that with Sun Trunking there was an option to load balance using >> a round robin hashing algorithm. ?When pushing high network loads this >> may cause performance problems with reassembly. > > Yes. ?Reassembly is Evil for TCP performance. > > Btw, the iSCSI balancing act that was described does seem a bit > contrived -- a single initiator and a COMSTAR server, both client *and > server* with multiple ethernet links instead of a single 10GbE link. > > I''m not saying it doesn''t happen, but I think it happens infrequently > enough that its reasonable that this scenario wasn''t one that popped > immediately into my head. :-)It depends on whether the people that control the network gear are the same ones that control servers. My experience suggests that if there is a disconnect, it seems rather likely that each group''s standardization efforts, procurement cycles, and capacity plans will work against any attempt to have an optimal configuration. Also, it is rather common to have multiple 1 Gb links to servers going to disparate switches so as to provide resilience in the face of switch failures. This is not unlike (at a block diagram level) the architecture that you see in pretty much every SAN. In such a configuation, it is reasonable for people to expect that load balancing will occur. -- Mike Gerdts http://mgerdts.blogspot.com/
> -----Original Message----- > From: Garrett D''Amore [mailto:garrett at nexenta.com] > Sent: Monday, July 26, 2010 2:27 AM > To: Mike Gerdts > Cc: Saxon, Will; zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] NFS performance? > > On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote: > > On Sun, Jul 25, 2010 at 8:50 PM, Garrett D''Amore > <garrett at nexenta.com> wrote: > > > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote: > > >> > > >> I think there may be very good reason to use iSCSI, if > you''re limited > > >> to gigabit but need to be able to handle higher throughput for a > > >> single client. I may be wrong, but I believe iSCSI > to/from a single > > >> initiator can take advantage of multiple links in an > active-active > > >> multipath scenario whereas NFS is only going to be able to take > > >> advantage of 1 link (at least until pNFS). > > > > > > There are other ways to get multiple paths. First off, > there is IP > > > multipathing. which offers some of this at the IP layer. > There is also > > > 802.3ad link aggregation (trunking). So you can still get high > > > performance beyond single link with NFS. (It works with > iSCSI too, > > > btw.) > > > > With both IPMP and link aggregation, each TCP session will > go over the > > same wire. There is no guarantee that load will be evenly balanced > > between links when there are multiple TCP sessions. As such, any > > scalability you get using these configurations will be dependent on > > having a complex enough workload, wise cconfiguration > choices, and and > > a bit of luck. > > If you''re really that concerned, you could use UDP instead of > TCP. But > that may have other detrimental performance impacts, I''m not sure how > bad they would be in a data center with generally lossless ethernet > links. >UDP is an advantage for NFS in this regard.> Btw, I am not certain that the multiple initiator support (mpxio) is > necessarily any better as far as guaranteed > performance/balancing. (It > may be; I''ve not looked closely enough at it.) >I''m not sure I''m referring to multi initiator. iSCSI can have multiple sessions between an initiator and a target, or multiple sessions by virtue of connections to different targets presenting the same LUN (this is the multipathing I am talking about). I''m not sure about the multiple sessions between single initiator/target scenario, but the single initiator/multiple target config can work in an IPMP scenario to get you more usable capacity between your initiator and target(s) over multiple links, using a variety of algorithms to balance load amongst the sessions.> I should look more closely at NFS as well -- if multiple > applications on > the same client are access the same filesystem, do they use a single > common TCP session, or can they each have separate instances open? > Again, I''m not sure.This is probably going to depend on the software, but in the scenario I am personally interested in (VMware), it doesn''t really matter: it''s a single application. VMware says they create two sessions - one for control and one for data - so I assume the maximum speed available for data tranfer to/from a particular mount is going to be the speed of 1 link. I guess this is getting way off topic for the list, but VMware also computes a unique ID for each NFS mount. The ID is computed based somehow off the mount configuration. If NFS datastore IDs are not identical then VMware thinks they are different datastores regardless of their contents. I have had a situation where some clients thought a particular datastore was different from the same datastore on some other clients, which prevented VMs hosted on that datastore from migrating between these clients. I traced the problem to inconsistent NFS mount configs; I''d used the fqdn for the configuration on most of the hosts but an IP address on the others. Reconfiguration resolved the issue. This would suggest that at least for this client/server combo, it would not be possible to do manual load balancing by pointing some clients at one IP and some at another IP for the same export. It would have to be balanced by export instead, which is a lot less convenient. VMware could also be more intelligent about generating their ID.> > > > > Note that with Sun Trunking there was an option to load > balance using > > a round robin hashing algorithm. When pushing high network > loads this > > may cause performance problems with reassembly. > > Yes. Reassembly is Evil for TCP performance. > > Btw, the iSCSI balancing act that was described does seem a bit > contrived -- a single initiator and a COMSTAR server, both client *and > server* with multiple ethernet links instead of a single 10GbE link. > > I''m not saying it doesn''t happen, but I think it happens infrequently > enough that its reasonable that this scenario wasn''t one that popped > immediately into my head. :-)I don''t agree that it''s contrived, but I do agree that it''s reasonable you didn''t think of it :). I don''t want to have to create a bunch of custom initiator/target configurations to spread load. I want to have a target with some particular configuration and a bunch of initiators configured identically to each other, and I want load to be spread across the available gigabit links. My understanding is that the way to do this means having multiple targets configured per LUN on the storage server, with each target set up to be available only on a specific network. Each initiator/client is set up with interfaces in these networks and pointed at these targets, and an appropriate load balancing algorithm is chosen to spread load between the sessions. The configuration can use individual links and/or an 802.3ad configuration if the aggregate(s) are also 802.1q trunks. This is obviously dependent on client/initiator support, but I think initiators that claim to support MPIO or multipathing implement something like this to get it done. I''m pretty sure Solaris/COMSTAR permits this also, with multiple targets configured per LUN and each target able to be pinned to specific IP addresses. I haven''t actually done this though so I guess I''m not 100% certain. -Will
>>>>> "mg" == Mike Gerdts <mgerdts at gmail.com> writes: >>>>> "sw" == Saxon, Will <Will.Saxon at sage.com> writes:sw> I think there may be very good reason to use iSCSI, if you''re sw> limited to gigabit but need to be able to handle higher sw> throughput for a single client. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942 look at it now before it gets pulled back inside the wall. :( I think this bug was posted on zfs-discuss earlier. Please see the comments because he is not using lagg''s: even with a single 10Gbit/s NIC, you cannot use the link well unless you take advantage of the multiple MSI''s and L4 preclass built into the NIC. You need multiple TCP circuits between client and server so that each will fire a different MSI. He got about 3x performance using 8 connections. It sounds like NFS is already fixed for this, but requires manual tuning of clnt_max_conns and the number of reader and writer threads. mg> it is rather common to have multiple 1 Gb links to mg> servers going to disparate switches so as to provide mg> resilience in the face of switch failures. This is not unlike mg> (at a block diagram level) the architecture that you see in mg> pretty much every SAN. In such a configuation, it is mg> reasonable for people to expect that load balancing will mg> occur. nope. spanning tree removes all loops, which means between any two points there will be only one enabled path. An L2-switched network will look into L4 headers for splitting traffic across an aggregated link (as long as it''s been deliberately configured to do that---by default probably only looks to L2), but it won''t do any multipath within the mesh. Even with an L3 routing protocol it usually won''t do multipath unless the costs of the paths match exactly, so you''d want to build the topology to achieve this and then do all switching at layer 3 by making sure no VLAN is larger than a switch. There''s actually a cisco feature to make no VLAN larger than a *port*, which I use a little bit. It''s meant for CATV networks I think, or DSL networks aggregated by IP instead of ATM like maybe some European ones? but the idea is not to put edge ports into vlans any more but instead say ''ip unnumbered loopbackN'', and then some black magic they have built into their DHCP forwarder adds /32 routes by watching the DHCP replies. If you don''t use DHCP you can add static /32 routes yourself, and it will work. It does not help with IPv6, and also you can only use it on vlan-tagged edge ports (whaaaaat? arbitrary!) but neat that it''s there at all. http://www.cisco.com/en/US/docs/ios/12_3t/12_3t4/feature/guide/gtunvlan.html The best thing IMHO would be to use this feature on the edge ports, just as I said, but you will have to teach the servers to VLAN-tag their packets. not such a bad idea, but weird. You could also use it one hop up from the edge switches, but I think it might have problems in general removing the routes when you unplug a server, and using it one hop up could make them worse. I only use it with static routes so far, so no mobility for me: I have to keep each server plugged into its assigned port, and reconfigure switches if I move it. Once you have ``no vlan larger than 1 switch,'''' if you actually need a vlan-like thing that spans multiple switches, the new word for it is ''vrf''. so, yeah, it means the server people will have to take over the job of the networking people. The good news is that networking people don''t like spanning tree very much because it''s always going wrong, so AFAICT most of them who are paying attention are already moving in this direction. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100726/978759b8/attachment.bin>
On Mon, Jul 26, 2010 at 2:56 PM, Miles Nordin <carton at ivy.net> wrote:>>>>>> "mg" == Mike Gerdts <mgerdts at gmail.com> writes: > ? ?mg> it is rather common to have multiple 1 Gb links to > ? ?mg> servers going to disparate switches so as to provide > ? ?mg> resilience in the face of switch failures. ?This is not unlike > ? ?mg> (at a block diagram level) the architecture that you see in > ? ?mg> pretty much every SAN. ?In such a configuation, it is > ? ?mg> reasonable for people to expect that load balancing will > ? ?mg> occur. > > nope. ?spanning tree removes all loops, which means between any two > points there will be only one enabled path. ?An L2-switched network > will look into L4 headers for splitting traffic across an aggregated > link (as long as it''s been deliberately configured to do that---by > default probably only looks to L2), but it won''t do any multipath > within the mesh.I was speaking more of IPMP, which is at layer 3.> Even with an L3 routing protocol it usually won''t do multipath unless > the costs of the paths match exactly, so you''d want to build the > topology to achieve this and then do all switching at layer 3 by > making sure no VLAN is larger than a switch.By default, IPMP does outbound load spreading. Inbound load spreading is not practical with a single (non-test) IP address. If you have multiple virtual IP''s you can spread them across all of the NICs in the IPMP group and get some degree of inbound spreading as well. This is the default behavior of the OpenSolaris IPMP implementation, last I looked. I''ve not seen any examples (although I can''t say I''ve looked real hard either) of the Solaris 10 IPMP configuration set up with multipe IP''s to encourage inbound load spreading as well.> > There''s actually a cisco feature to make no VLAN larger than a *port*, > which I use a little bit. ?It''s meant for CATV networks I think, or > DSL networks aggregated by IP instead of ATM like maybe some European > ones? ?but the idea is not to put edge ports into vlans any more but > instead say ''ip unnumbered loopbackN'', and then some black magic they > have built into their DHCP forwarder adds /32 routes by watching the > DHCP replies. ?If you don''t use DHCP you can add static /32 routes > yourself, and it will work. ?It does not help with IPv6, and also you > can only use it on vlan-tagged edge ports (whaaaaat? arbitrary!) but > neat that it''s there at all. > > ?http://www.cisco.com/en/US/docs/ios/12_3t/12_3t4/feature/guide/gtunvlan.htmlInteresting... however this seems to limit you to < 4096 edge ports per VTP domain, as the VID field in the 802.1q header is only 12 bits. It is also unclear how this works when you have one physical host with many guests. And then there is the whole thing that I don''t really see how this helps with resilience in the face of a switch failure. Cool technology, but I''m not certain that it addresses what I was talking about.> > The best thing IMHO would be to use this feature on the edge ports, > just as I said, but you will have to teach the servers to VLAN-tag > their packets. ?not such a bad idea, but weird. > > You could also use it one hop up from the edge switches, but I think > it might have problems in general removing the routes when you unplug > a server, and using it one hop up could make them worse. ?I only use > it with static routes so far, so no mobility for me: I have to keep > each server plugged into its assigned port, and reconfigure switches > if I move it. ?Once you have ``no vlan larger than 1 switch,'''' if you > actually need a vlan-like thing that spans multiple switches, the new > word for it is ''vrf''.There was some other Cisco dark magic that our network guys were touting a while ago that would make each edge switch look like a blade in a 6500 series. This would then allow them to do link aggregation across edge switches. At least two of "organizational changes", "personnel changes", and "roadmap changes" happened so I''ve not seen this in action.> > so, yeah, it means the server people will have to take over the job of > the networking people. ?The good news is that networking people don''t > like spanning tree very much because it''s always going wrong, so > AFAICT most of them who are paying attention are already moving in > this direction. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-- Mike Gerdts http://mgerdts.blogspot.com/