On Feb 16, 2006 14:02 -0500, John R. Dunning wrote:> When I run 4 copies of bonnie++ on my cluster, one per compute node, > each chewing on its own subdirectory, I can fairly reliably get it to > fail when it gets to the portion of the test where bonnie is creating > thousands of empty/small files. I typically get an error like: > > Can''t create file 00013/001372LwXOJ > > When I run only a single instance of bonnie, I get no failures.Does it report why this failed? Are you running out of space/inodes? Do "grep ''[0-9] /proc/fs/lustre/*c/*/{kbytes,files}{free,total,avail}" to see per-target statfs information. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
From: Andreas Dilger <adilger@clusterfs.com> Date: Thu, 16 Feb 2006 13:41:05 -0700 On Feb 16, 2006 14:02 -0500, John R. Dunning wrote: > When I run 4 copies of bonnie++ on my cluster, one per compute node, > each chewing on its own subdirectory, I can fairly reliably get it to > fail when it gets to the portion of the test where bonnie is creating > thousands of empty/small files. I typically get an error like: > > Can''t create file 00013/001372LwXOJ > > When I run only a single instance of bonnie, I get no failures. Does it report why this failed? Are you running out of space/inodes? D''oh! How embarrassing. Yes, bonnie was doing a better job of creating files than I thought it was, using up all the inodes, then carefully not reporting why it failed, and cleaning up after itself. When I rebuilt the fs with more space, lustre worked great, which was what I should have expected all along :-} Thanks!
Hi wizards. I''ve put together a cluster of 4 compute nodes and two file server nodes, for torture-testing. I''m running lustre 1.4.5.1, against a vanilla kernel.org 2.6.9 kernel. Before you jump all over me, I know that''s not one of the blessed kernels :-} But for various reasons, it doesn''t work for me to use the RHEL or SUSE ones. So I did some surgery to make it work on a vanilla kernel. Said surgery is a likely cause of what I''m now seeing.... When I run 4 copies of bonnie++ on my cluster, one per compute node, each chewing on its own subdirectory, I can fairly reliably get it to fail when it gets to the portion of the test where bonnie is creating thousands of empty/small files. I typically get an error like: Can''t create file 00013/001372LwXOJ When I run only a single instance of bonnie, I get no failures. So now I need to dig deeper and debug this thing. I''m currently running with debug_daemon enabled, in hopes of being able to see something in the log which correlates to the error reported by the client. Does anybody have other suggestions about places I should look at in the code, or other ways of getting relevant debug output out of it? Thanks in advance...