We're been running the distribution version of Samba 3.6.9 as DC on Centos 6 for a few weeks now with a number of Samba 4 clients. We have ~20,000 machine and user accounts in our password database. We see a SIGSEGV every 24-48 hours or so in tcopy_passwd. I built a Samba 3.6.24 with debugging symbols and waited for a core file which we got this morning. Looking through the code we can see that there was a machine lookup for one of our Samba 4 servers which was handed a duff memory pointer from a memcache lookup. When it tried to dereference the memory pointer the SIGSEGV resulted. I'm working on the theory that the heap has been corrupted rather than a fault with memcache, but I thought I'd just check in here to make sure no-one else was seeing the same problems and had already made progress in fixing it. -- Jonathan Knight IT Services Keele University
On Thu, Sep 18, 2014 at 12:12:13PM +0100, Jonathan Knight wrote:> We're been running the distribution version of Samba 3.6.9 as DC on Centos > 6 for a few weeks now with a number of Samba 4 clients. We have ~20,000 > machine and user accounts in our password database. > > We see a SIGSEGV every 24-48 hours or so in tcopy_passwd. > > I built a Samba 3.6.24 with debugging symbols and waited for a core file > which we got this morning. > > Looking through the code we can see that there was a machine lookup for one > of our Samba 4 servers which was handed a duff memory pointer from a > memcache lookup. When it tried to dereference the memory pointer the > SIGSEGV resulted. > > I'm working on the theory that the heap has been corrupted rather than a > fault with memcache, but I thought I'd just check in here to make sure > no-one else was seeing the same problems and had already made progress in > fixing it.Not a known issue. Can you reproduce under valgrind ?
Hi All, To follow up on the SIGSEGV's we were experiencing under 3.6.9 and 3.6.24 I spent some time with core files and identified the problem as being the in memory caching feature. The failure occurs when a machine trust account is looked up after 2-7 days and the unix->pw pointer is returned with a very low (non zero) value that cannot be a memory address. The tcopy_passwd function tries to de-reference the pointer and gets a SIGSEGV In order to restore the service I've regressed to 3.5.22 which pre-dates the in-memory caching feature and we've not seen any issues since then. As semester starts on Saturday we're likely to stick with the 3.5 version rather than trying to fix the 3.6 one just so we can run a reliable service. However, as I have a collection of core files to explore I am tempted to poke about and see if I can spot what happened to that pointer because the values in the core files are strangely similar which hints that there is a straight forward answer to what happened. Jon. On 18 September 2014 12:12, Jonathan Knight <j.knight at keele.ac.uk> wrote:> > > We're been running the distribution version of Samba 3.6.9 as DC on > Centos 6 for a few weeks now with a number of Samba 4 clients. We have > ~20,000 machine and user accounts in our password database. > > We see a SIGSEGV every 24-48 hours or so in tcopy_passwd. > > I built a Samba 3.6.24 with debugging symbols and waited for a core file > which we got this morning. > > Looking through the code we can see that there was a machine lookup for > one of our Samba 4 servers which was handed a duff memory pointer from a > memcache lookup. When it tried to dereference the memory pointer the > SIGSEGV resulted. > > I'm working on the theory that the heap has been corrupted rather than a > fault with memcache, but I thought I'd just check in here to make sure > no-one else was seeing the same problems and had already made progress in > fixing it. > > > > -- > Jonathan Knight > IT Services > Keele University >-- Jonathan Knight IT Services Keele University