We have moved our 3rd-party multiuser billing system database files from Novell NetWare to Samba (first 2.2.8a, and now, just last Friday evening, upgraded to 3.0.2a) on Mandrake Linux 9.2 (kernel version 2.4). Now, about once every week or so, we get a file corruption, and last week (even after upgrading some NICs) seemed to be even worse, with 5 or 6 problems. Because we had no problems before changing servers, I think hardware errors are probably not to blame, even though I've seen them implicated in Samba discussions. And because it only occurs when multiple users are in a file (never otherwise, even after many, many index rebuilds and other file repair operations done by a single user), my guess is that it stems from some sort of locking or other synchronization problem. Also, so far there does not seem to be a pattern as to which workstations have errors, except generally the most-used ones. We have a mix of Win 98 and Win 2K clients (mostly the former). We used to have two Win 95 workstations, but upgraded them to 98 to try to solve the problems. No Unix programs access these files (except for nightly backups), only the billing software using Samba. The workstations still login to NetWare as the primary network login, then use the Windows networking to map the drive to Samba. Our Samba configuration file is very simple, with only one share. I've tried various combinations of these three settings: 1. I turned off all oplocks, and that didn't fix it. 2. I set sync always = yes and strict sync = yes, and that didn't fix it either. (I have turned these off & on several times to see if there's any effect.) 3. Most recently I have set strict locking = yes. Week before last we had 3 corruptions in 2 days. After the first two, that's when I finally turned on #3 above, and then within a few hours had the third corruption. The boss is really getting upset that I have to kick everyone off the system to rebuild the problem file--some of these files are> 300MB and take 2 hours or more to rebuild. He is saying another problem,and Samba goes into the trash and we revert to the Novell server. I know it's hard to track down things like this, but here are some specific questions: 1. Are there any other options anyone can suggest trying? Also, apart from a server crash, would you expect #2 to be actually relevant to the problem or not? 2. I know Samba is supposed to re-read the config file periodically, and I'm counting on that when I change the various options. But how can I really tell whether or not Samba has changed the option--and more to the point, changed its behavior? Do any of the above options have inherent delays before Samba can change? The way some of the corruptions have come shortly after I changed a setting which would be expected to make the files MORE safe, not less, have me wondering whether Samba is really changing the settings. I can use smbstatus to confirm there are no oplocks, but what about the other settings? In other words, must I stop & restart Samba after changes such as these (thereby temporarily kicking everyone off the system, a real hassle)? 3. What debugging level would be required for a developer to investigate this? Would it be preferable to be a combined log, or would separate logs for each workstation be usable? Is there a way to get Samba logs to contain only the most recent stuff leading up to a non-reproducible-on-demand incident like this, without filling them up with hours or days of clutter? 4. Does anyone know of some software I could run to actually test Samba for problems? Something that would really exercise multi-user access.? Any help would be MUCH appreciated. I'm running out of time. Thanks -- Warren
On Sun, Jun 13, 2004 at 08:05:02PM -0500, Warren Odom wrote:> > Week before last we had 3 corruptions in 2 days. After the first two, > that's when I finally turned on #3 above, and then within a few hours had > the third corruption. The boss is really getting upset that I have to kick > everyone off the system to rebuild the problem file--some of these files are > > 300MB and take 2 hours or more to rebuild. He is saying another problem, > and Samba goes into the trash and we revert to the Novell server.Ok, the interesting thing here is that you've moved from using Novell client software on your Windows clients to depending on the Microsoft code shipped with Windows. The Microsoft code is noticeably poorer quality at doing things like multi-user database access than the Novell code. Note I'm not ruling out a bug in Samba, but I don't know of any reproducible data corruption bugs in the current versions of the code (naturally if I did I'd be working full time on fixing them immediately :-). Have you updated to the latest redirector hotfixes on all your clients ? Turning off oplocks and setting strict locking is the *minimum* you should to to attempt multi-user database access using the Windows client code.> 4. Does anyone know of some software I could run to actually test Samba for > problems? Something that would really exercise multi-user access.?Does your vendor have test code to ensure server/client correctness against a particular server ? What recommendations does the vendor make when using their software in an all-Windows (no Novell) environment ? Do they recommend any changes in server or client settings ? Jeremy.
Postscript to my last message: I was hoping you could tell me if there ware a "Samba exerciser" program that would simulate this sort of file activity, and would report any errors along with the valuable information of which workstation caused the problem. One or two of the archived threads I saw referred to some such software used by the Samba team--might this be available for others to test with? Or would it not be applicable to a database environment? -- Warren
>Now, about once every week or so, we get a file corruption, and last week(even after>upgrading some NICs) seemed to be even worse, with 5 or 6 problems. >Because we had no problems before changing servers, I think hardware errors >are probably not to blame,Hi Warren! Unfortunately, I don't know that I have a good answer for you, but I thought I'd share this. We too have been experiencing EXACTLY what you have described this past two weeks...seemingly random file corruption. We're running Samba 3 on a RedHat Linux 9.0 box. We temporarily switched to NFS but that turned out to be more of a nightmare than Samba (for us anyway). Finally, after googling for hours and posting to this list, we decided to try and trouble shoot the problem on a lower level the best we knew how. There are three main things you can try that might reveal some clues as to what's going on. Maybe you've tried them already. 1) Use strace to start smbd or attach it to a already running child process. If you have a general idea of when or under what circumstances these corruptions occurr, that would be a good time to fire it off because it spits out a insane amount of data. 2) Turn Samba's log level to 3. Again, do that around the time you think corruptions may occurr. Logging level 3 is VERY intense on your server and will definately effect performance. 3) Use ethereal to capture and examine the network traffic. Look through the SMB packets and see what you can see. Of course, all those things really only help if you can reproduce the problem to some degree. We had our hopes set high on strace, but after having experienced a known kernel bug, we could not use it. Since we had spent so much time on the problem, we upgraded to RedHat Enterprise Edition...all the problems vanished immediately. Hope this helps a little! Matthew Connor
On Tue, Jun 15, 2004 at 08:59:37AM -0400, mbc@reisonline.com wrote:> > Of course, all those things really only help if you can reproduce > the problem to some degree. We had our hopes set high on strace, but after > having experienced a known kernel bug, we could not use it. Since we had > spent so much time on the problem, we upgraded to RedHat Enterprise > Edition...all the problems vanished immediately.Very interesting. What filesystem were you using on RH9.x and RHAS ? Jeremy.