thr3ads.net - Xen devel - [PATCH] tools/hotplug: fix locking [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Zhigang Wang

2012-Jun-11 13:59 UTC

[PATCH] tools/hotplug: fix locking

# HG changeset patch
# User Zhigang Wang <zhigang.x.wang@oracle.com>
# Date 1339421383 14400
# Node ID eb72d7090b5956490b2f26c858cd9d896feca309
# Parent  32034d1914a607d7b6f1f060352b4cac973600f8
tools/hotplug: fix locking

The current locking implementation would allow two processes get the lock
simultaneously:

++ echo 16741: /etc/xen/scripts/block
++ cut -d : -f 1
+ local pid=16741
+ ''['' -n 16741 -a 16741 ''!='' unknown -a
''!'' -f /proc/16741/status '']''
+ ''['' 5 -gt 5 '']''
+ sleep 0
+ retries=6
+ ''['' 6 -lt 100 '']''
+ mkdir /var/run/xen-hotplug/block
++ _lock_owner /var/run/xen-hotplug/block
++ cat /var/run/xen-hotplug/block/owner
+ local ''new_owner=16741: /etc/xen/scripts/block''
+ ''['' ''16741: /etc/xen/scripts/block''
''!='' ''16741: /etc/xen/scripts/block''
'']''
++ echo 16741: /etc/xen/scripts/block
++ cut -d : -f 1
+ local pid=16741
+ ''['' -n 16741 -a 16741 ''!='' unknown -a
''!'' -f /proc/16741/status '']''
+ ''['' 6 -gt 5 '']''
+ sleep 1
+ do_or_die losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb00001200002b0ef651033c8381.img
+ losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb00001200002b0ef651033c8381.img
+ do_or_die losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb0000120000143173990458f2a7.img
+ losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb0000120000143173990458f2a7.img
+ xenstore_write backend/vbd/33/51920/node /dev/loop27
+ _xenstore_write backend/vbd/33/51920/node /dev/loop27
+ log debug ''Writing backend/vbd/33/51920/node''
''/dev/loop27 to xenstore.''
+ local level=debug
+ shift
+ logger -p daemon.debug -- /etc/xen/scripts/block: ''Writing
backend/vbd/33/51920/node'' ''/dev/loop27 to xenstore.''
ioctl: LOOP_SET_FD: Device or resource busy
+ xenstore-write backend/vbd/33/51920/node /dev/loop27
+ fatal losetup -r /dev/loop27
''/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb0000120000143173990458f2a7.img
failed''

This patch removes the ownner support and fixed this issue per my test.

Kouya: would you please help to confirm this fix is correct?

Anyone has encountered this issue please help to test.

Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
Cc: Kouya Shimura <kouya@jp.fujitsu.com>

diff -r 32034d1914a6 -r eb72d7090b59 tools/hotplug/Linux/locking.sh
--- a/tools/hotplug/Linux/locking.sh	Thu Jun 07 19:46:57 2012 +0100
+++ b/tools/hotplug/Linux/locking.sh	Mon Jun 11 09:29:43 2012 -0400
@@ -48,32 +48,14 @@ sigerr() {
 _claim_lock()
 {
   local lockdir="$1"
-  local owner=$(_lock_owner "$lockdir")
   local retries=0
 
   while [ $retries -lt $LOCK_RETRIES ]
   do
-    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR &&
-      _update_lock_info "$lockdir" && return
-
-    local new_owner=$(_lock_owner "$lockdir")
-    if [ "$new_owner" != "$owner" ]
-    then
-      owner="$new_owner"
-      retries=0
-    else
-      local pid=$(echo $owner | cut -d : -f 1)
-      if [ -n "$pid" -a "$pid" != "unknown" -a !
-f "/proc/$pid/status" ]
-      then
-        _release_lock $lockdir
-      fi
-    fi
-
+    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR && return
     if [ $retries -gt $LOCK_SPINNING_RETRIES ]
     then
       sleep $LOCK_SLEEPTIME
-    else
-      sleep 0
     fi
     retries=$(($retries + 1))
   done
@@ -91,20 +73,7 @@ _release_lock()
 _steal_lock()
 {
   local lockdir="$1"
-  local owner=$(cat "$lockdir/owner" 2>/dev/null || echo
"unknown")
-  log err "Forced to steal lock on $lockdir from $owner!"
+  log err "Forced to steal lock on $lockdir!"
   _release_lock "$lockdir"
   _claim_lock "$lockdir"
 }
-
-
-_lock_owner()
-{
-  cat "$1/owner" 2>/dev/null || echo "unknown"
-}
-
-
-_update_lock_info()
-{
-  echo "$$: $0" >"$1/owner"
-}

Kouya Shimura

2012-Jun-12 06:53 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

Hi,

The owner support is introduced in c/s 8175, not by me.
Anyway, something is wrong with your execution trace.
> + ''['' 6 -gt 5 '']''
> + sleep 1
<<< Why claim_lock() returns here???> + do_or_die losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb00001200002b0ef651033c8381.img
> + losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/Vi
I don''t know what is a problem and whether your patch resolves the
issue or not.

It would be better to replace locking.sh with the RHEL5 implementation
which uses ''flock'' rather than to fix it.

Thanks,
Kouya


Zhigang Wang writes:> # HG changeset patch
> # User Zhigang Wang <zhigang.x.wang@oracle.com>
> # Date 1339421383 14400
> # Node ID eb72d7090b5956490b2f26c858cd9d896feca309
> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
> tools/hotplug: fix locking
> 
> The current locking implementation would allow two processes get the lock
> simultaneously:
> 
> ++ echo 16741: /etc/xen/scripts/block
> ++ cut -d : -f 1
> + local pid=16741
> + ''['' -n 16741 -a 16741 ''!='' unknown -a
''!'' -f /proc/16741/status '']''
> + ''['' 5 -gt 5 '']''
> + sleep 0
> + retries=6
> + ''['' 6 -lt 100 '']''
> + mkdir /var/run/xen-hotplug/block
> ++ _lock_owner /var/run/xen-hotplug/block
> ++ cat /var/run/xen-hotplug/block/owner
> + local ''new_owner=16741: /etc/xen/scripts/block''
> + ''['' ''16741: /etc/xen/scripts/block''
''!='' ''16741: /etc/xen/scripts/block''
'']''
> ++ echo 16741: /etc/xen/scripts/block
> ++ cut -d : -f 1
> + local pid=16741
> + ''['' -n 16741 -a 16741 ''!='' unknown -a
''!'' -f /proc/16741/status '']''
> + ''['' 6 -gt 5 '']''
> + sleep 1
> + do_or_die losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb00001200002b0ef651033c8381.img
> + losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb00001200002b0ef651033c8381.img
> + do_or_die losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb0000120000143173990458f2a7.img
> + losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb0000120000143173990458f2a7.img
> + xenstore_write backend/vbd/33/51920/node /dev/loop27
> + _xenstore_write backend/vbd/33/51920/node /dev/loop27
> + log debug ''Writing backend/vbd/33/51920/node''
''/dev/loop27 to xenstore.''
> + local level=debug
> + shift
> + logger -p daemon.debug -- /etc/xen/scripts/block: ''Writing
backend/vbd/33/51920/node'' ''/dev/loop27 to xenstore.''
> ioctl: LOOP_SET_FD: Device or resource busy
> + xenstore-write backend/vbd/33/51920/node /dev/loop27
> + fatal losetup -r /dev/loop27
''/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb0000120000143173990458f2a7.img
failed''
> 
> This patch removes the ownner support and fixed this issue per my test.
> 
> Kouya: would you please help to confirm this fix is correct?
> 
> Anyone has encountered this issue please help to test.
> 
> Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
> Cc: Kouya Shimura <kouya@jp.fujitsu.com>
> 
> diff -r 32034d1914a6 -r eb72d7090b59 tools/hotplug/Linux/locking.sh
> --- a/tools/hotplug/Linux/locking.sh	Thu Jun 07 19:46:57 2012 +0100
> +++ b/tools/hotplug/Linux/locking.sh	Mon Jun 11 09:29:43 2012 -0400
> @@ -48,32 +48,14 @@ sigerr() {
>  _claim_lock()
>  {
>    local lockdir="$1"
> -  local owner=$(_lock_owner "$lockdir")
>    local retries=0
>  
>    while [ $retries -lt $LOCK_RETRIES ]
>    do
> -    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR &&
> -      _update_lock_info "$lockdir" && return
> -
> -    local new_owner=$(_lock_owner "$lockdir")
> -    if [ "$new_owner" != "$owner" ]
> -    then
> -      owner="$new_owner"
> -      retries=0
> -    else
> -      local pid=$(echo $owner | cut -d : -f 1)
> -      if [ -n "$pid" -a "$pid" != "unknown"
-a ! -f "/proc/$pid/status" ]
> -      then
> -        _release_lock $lockdir
> -      fi
> -    fi
> -
> +    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR && return
>      if [ $retries -gt $LOCK_SPINNING_RETRIES ]
>      then
>        sleep $LOCK_SLEEPTIME
> -    else
> -      sleep 0
>      fi
>      retries=$(($retries + 1))
>    done
> @@ -91,20 +73,7 @@ _release_lock()
>  _steal_lock()
>  {
>    local lockdir="$1"
> -  local owner=$(cat "$lockdir/owner" 2>/dev/null || echo
"unknown")
> -  log err "Forced to steal lock on $lockdir from $owner!"
> +  log err "Forced to steal lock on $lockdir!"
>    _release_lock "$lockdir"
>    _claim_lock "$lockdir"
>  }
> -
> -
> -_lock_owner()
> -{
> -  cat "$1/owner" 2>/dev/null || echo "unknown"
> -}
> -
> -
> -_update_lock_info()
> -{
> -  echo "$$: $0" >"$1/owner"
> -}

Marek Marczykowski

2012-Jun-13 02:25 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On 12.06.2012 08:53, Kouya Shimura wrote:> Hi,
> 
> The owner support is introduced in c/s 8175, not by me.
> Anyway, something is wrong with your execution trace.
> 
>> + ''['' 6 -gt 5 '']''
>> + sleep 1
> <<< Why claim_lock() returns here???
>> + do_or_die losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb00001200002b0ef651033c8381.img
>> + losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/Vi
> 
> I don''t know what is a problem and whether your patch resolves the
issue or not.
I can confirm this problem, I''ve faced it some time ago:
http://lists.xen.org/archives/html/xen-devel/2011-07/msg00182.html
> It would be better to replace locking.sh with the RHEL5 implementation
> which uses ''flock'' rather than to fix it.
I agree. Currently I''ve workarounded it by adding flock in udev rule
which run
this script, but working locking.sh will be much nicer.

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2012-Jun-13 08:11 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On Mon, 2012-06-11 at 14:59 +0100, Zhigang Wang wrote:> This patch removes the ownner support and fixed this issue per my test.
I think the changelog here needs to go into a bit more detail about why
it is correct to remove the owner support. e.g. why is it no longer
needed or why is it bogus. Also why does removing it fix the issue?
Please describe the logic behind the change not just its impact on your
test case.
> Kouya: would you please help to confirm this fix is correct?
> 
> Anyone has encountered this issue please help to test.
> 
> Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
> Cc: Kouya Shimura <kouya@jp.fujitsu.com>
> 
> diff -r 32034d1914a6 -r eb72d7090b59 tools/hotplug/Linux/locking.sh
> --- a/tools/hotplug/Linux/locking.sh	Thu Jun 07 19:46:57 2012 +0100
> +++ b/tools/hotplug/Linux/locking.sh	Mon Jun 11 09:29:43 2012 -0400
> @@ -48,32 +48,14 @@ sigerr() {
>  _claim_lock()
>  {
>    local lockdir="$1"
> -  local owner=$(_lock_owner "$lockdir")
>    local retries=0
>  
>    while [ $retries -lt $LOCK_RETRIES ]
>    do
> -    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR &&
> -      _update_lock_info "$lockdir" && return
> -
> -    local new_owner=$(_lock_owner "$lockdir")
> -    if [ "$new_owner" != "$owner" ]
> -    then
> -      owner="$new_owner"
> -      retries=0
> -    else
> -      local pid=$(echo $owner | cut -d : -f 1)
> -      if [ -n "$pid" -a "$pid" != "unknown"
-a ! -f "/proc/$pid/status" ]
> -      then
> -        _release_lock $lockdir
> -      fi
> -    fi
> -
> +    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR && return
>      if [ $retries -gt $LOCK_SPINNING_RETRIES ]
>      then
>        sleep $LOCK_SLEEPTIME
> -    else
> -      sleep 0
>      fi
>      retries=$(($retries + 1))
>    done
> @@ -91,20 +73,7 @@ _release_lock()
>  _steal_lock()
>  {
>    local lockdir="$1"
> -  local owner=$(cat "$lockdir/owner" 2>/dev/null || echo
"unknown")
> -  log err "Forced to steal lock on $lockdir from $owner!"
> +  log err "Forced to steal lock on $lockdir!"
>    _release_lock "$lockdir"
>    _claim_lock "$lockdir"
>  }
> -
> -
> -_lock_owner()
> -{
> -  cat "$1/owner" 2>/dev/null || echo "unknown"
> -}
> -
> -
> -_update_lock_info()
> -{
> -  echo "$$: $0" >"$1/owner"
> -}
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Zhigang Wang

2012-Jun-13 08:16 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On 06/13/2012 04:11 AM, Ian Campbell wrote:> On Mon, 2012-06-11 at 14:59 +0100, Zhigang Wang wrote:
>> This patch removes the ownner support and fixed this issue per my test.
> I think the changelog here needs to go into a bit more detail about why
> it is correct to remove the owner support. e.g. why is it no longer
> needed or why is it bogus. Also why does removing it fix the issue?
> Please describe the logic behind the change not just its impact on your
> test case.Thanks Kouya for the info about Red Hat implement. I also like the flock
implement better.

My plan is to withdraw this patch and test Red Hat implement and if it works
fine, then I will submit the Red Hat patch (as far as it''s GPL).

Thanks,

Zhigang>
>> Kouya: would you please help to confirm this fix is correct?
>>
>> Anyone has encountered this issue please help to test.
>>
>> Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
>> Cc: Kouya Shimura <kouya@jp.fujitsu.com>
>>
>> diff -r 32034d1914a6 -r eb72d7090b59 tools/hotplug/Linux/locking.sh
>> --- a/tools/hotplug/Linux/locking.sh	Thu Jun 07 19:46:57 2012 +0100
>> +++ b/tools/hotplug/Linux/locking.sh	Mon Jun 11 09:29:43 2012 -0400
>> @@ -48,32 +48,14 @@ sigerr() {
>>  _claim_lock()
>>  {
>>    local lockdir="$1"
>> -  local owner=$(_lock_owner "$lockdir")
>>    local retries=0
>>  
>>    while [ $retries -lt $LOCK_RETRIES ]
>>    do
>> -    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR &&
>> -      _update_lock_info "$lockdir" && return
>> -
>> -    local new_owner=$(_lock_owner "$lockdir")
>> -    if [ "$new_owner" != "$owner" ]
>> -    then
>> -      owner="$new_owner"
>> -      retries=0
>> -    else
>> -      local pid=$(echo $owner | cut -d : -f 1)
>> -      if [ -n "$pid" -a "$pid" !=
"unknown" -a ! -f "/proc/$pid/status" ]
>> -      then
>> -        _release_lock $lockdir
>> -      fi
>> -    fi
>> -
>> +    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR && return
>>      if [ $retries -gt $LOCK_SPINNING_RETRIES ]
>>      then
>>        sleep $LOCK_SLEEPTIME
>> -    else
>> -      sleep 0
>>      fi
>>      retries=$(($retries + 1))
>>    done
>> @@ -91,20 +73,7 @@ _release_lock()
>>  _steal_lock()
>>  {
>>    local lockdir="$1"
>> -  local owner=$(cat "$lockdir/owner" 2>/dev/null || echo
"unknown")
>> -  log err "Forced to steal lock on $lockdir from $owner!"
>> +  log err "Forced to steal lock on $lockdir!"
>>    _release_lock "$lockdir"
>>    _claim_lock "$lockdir"
>>  }
>> -
>> -
>> -_lock_owner()
>> -{
>> -  cat "$1/owner" 2>/dev/null || echo "unknown"
>> -}
>> -
>> -
>> -_update_lock_info()
>> -{
>> -  echo "$$: $0" >"$1/owner"
>> -}
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>

Zhigang Wang

2012-Jun-13 17:29 UTC

head link

[PATCH] tools/hotplug: fix locking

# HG changeset patch
# User Zhigang Wang <zhigang.x.wang@oracle.com>
# Date 1339608534 14400
# Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
# Parent  32034d1914a607d7b6f1f060352b4cac973600f8
tools/hotplug: fix locking

The current locking implementation would allow two processes get the lock
simultaneously:

++ echo 16741: /etc/xen/scripts/block
++ cut -d : -f 1
+ local pid=16741
+ ''['' -n 16741 -a 16741 ''!='' unknown -a
''!'' -f /proc/16741/status '']''
+ ''['' 5 -gt 5 '']''
+ sleep 0
+ retries=6
+ ''['' 6 -lt 100 '']''
+ mkdir /var/run/xen-hotplug/block
++ _lock_owner /var/run/xen-hotplug/block
++ cat /var/run/xen-hotplug/block/owner
+ local ''new_owner=16741: /etc/xen/scripts/block''
+ ''['' ''16741: /etc/xen/scripts/block''
''!='' ''16741: /etc/xen/scripts/block''
'']''
++ echo 16741: /etc/xen/scripts/block
++ cut -d : -f 1
+ local pid=16741
+ ''['' -n 16741 -a 16741 ''!='' unknown -a
''!'' -f /proc/16741/status '']''
+ ''['' 6 -gt 5 '']''
+ sleep 1
+ do_or_die losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb00001200002b0ef651033c8381.img
+ losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb00001200002b0ef651033c8381.img
+ do_or_die losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb0000120000143173990458f2a7.img
+ losetup -r /dev/loop27
/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb0000120000143173990458f2a7.img
+ xenstore_write backend/vbd/33/51920/node /dev/loop27
+ _xenstore_write backend/vbd/33/51920/node /dev/loop27
+ log debug ''Writing backend/vbd/33/51920/node''
''/dev/loop27 to xenstore.''
+ local level=debug
+ shift
+ logger -p daemon.debug -- /etc/xen/scripts/block: ''Writing
backend/vbd/33/51920/node'' ''/dev/loop27 to xenstore.''
ioctl: LOOP_SET_FD: Device or resource busy
+ xenstore-write backend/vbd/33/51920/node /dev/loop27
+ fatal losetup -r /dev/loop27
''/OVS/Repositories/0004fb00000300000aac184a9bbab7a9/VirtualDisks/0004fb0000120000143173990458f2a7.img
failed''

This patch is ported from Red Hat Enterprise Linux 5.8.

Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
Cc: Kouya Shimura <kouya@jp.fujitsu.com>

diff -r 32034d1914a6 -r 650b03f41214 tools/hotplug/Linux/locking.sh
--- a/tools/hotplug/Linux/locking.sh	Thu Jun 07 19:46:57 2012 +0100
+++ b/tools/hotplug/Linux/locking.sh	Wed Jun 13 13:28:54 2012 -0400
@@ -1,5 +1,6 @@
 #
 # Copyright (c) 2005 XenSource Ltd.
+# Copyright (c) 2007 Red Hat
 #
 # This library is free software; you can redistribute it and/or
 # modify it under the terms of version 2.1 of the GNU Lesser General Public
@@ -19,92 +20,30 @@
 # Serialisation
 #
 
-LOCK_SLEEPTIME=1
-LOCK_SPINNING_RETRIES=5
-LOCK_RETRIES=100
 LOCK_BASEDIR=/var/run/xen-hotplug
 
+_setlockfd()
+{
+    local i
+    for ((i = 0; i < ${#_lockdict}; i++))
+    do [ -z "${_lockdict[$i]}" -o "${_lockdict[$i]}" =
"$1" ] && break
+    done
+    _lockdict[$i]="$1"
+    let _lockfd=200+i
+}
+
 
 claim_lock()
 {
-  local lockdir="$LOCK_BASEDIR/$1"
-  mkdir -p "$LOCK_BASEDIR"
-  _claim_lock "$lockdir"
+    mkdir -p "$LOCK_BASEDIR"
+    _setlockfd $1
+    eval "exec $_lockfd>>$LOCK_BASEDIR/$1"
+    flock -x $_lockfd
 }
 
 
 release_lock()
 {
-  _release_lock "$LOCK_BASEDIR/$1"
+    _setlockfd $1
+    flock -u $_lockfd
 }
-
-
-# This function will be redefined in xen-hotplug-common.sh.
-sigerr() {
-  exit 1
-}
-
-
-_claim_lock()
-{
-  local lockdir="$1"
-  local owner=$(_lock_owner "$lockdir")
-  local retries=0
-
-  while [ $retries -lt $LOCK_RETRIES ]
-  do
-    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR &&
-      _update_lock_info "$lockdir" && return
-
-    local new_owner=$(_lock_owner "$lockdir")
-    if [ "$new_owner" != "$owner" ]
-    then
-      owner="$new_owner"
-      retries=0
-    else
-      local pid=$(echo $owner | cut -d : -f 1)
-      if [ -n "$pid" -a "$pid" != "unknown" -a !
-f "/proc/$pid/status" ]
-      then
-        _release_lock $lockdir
-      fi
-    fi
-
-    if [ $retries -gt $LOCK_SPINNING_RETRIES ]
-    then
-      sleep $LOCK_SLEEPTIME
-    else
-      sleep 0
-    fi
-    retries=$(($retries + 1))
-  done
-  _steal_lock "$lockdir"
-}
-
-
-_release_lock()
-{
-  trap sigerr ERR
-  rm -rf "$1" 2>/dev/null || true
-}
-
-
-_steal_lock()
-{
-  local lockdir="$1"
-  local owner=$(cat "$lockdir/owner" 2>/dev/null || echo
"unknown")
-  log err "Forced to steal lock on $lockdir from $owner!"
-  _release_lock "$lockdir"
-  _claim_lock "$lockdir"
-}
-
-
-_lock_owner()
-{
-  cat "$1/owner" 2>/dev/null || echo "unknown"
-}
-
-
-_update_lock_info()
-{
-  echo "$$: $0" >"$1/owner"
-}

Ian Campbell

2012-Jun-20 15:34 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote:> # HG changeset patch
> # User Zhigang Wang <zhigang.x.wang@oracle.com>
> # Date 1339608534 14400
> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
> tools/hotplug: fix locking
A better title would be "tools/hotplug: Switch to locking with flock"
since "fix" is not very descriptive.

The commit message should also state why changing this scheme is
preferable to fixing the current one.

Is this flock tool widely available? Does it need to be added to the
dependencies in the README?
> The current locking implementation would allow two processes get the lock
> simultaneously:
[...snip shell trace...]

Can you add a line or two of analysis to explain where in that shell
spew things are actually going wrong and why?
> This patch is ported from Red Hat Enterprise Linux 5.8.
I think we need a Signed-off-by from the original patch author. (Unless
DCO clause b somehow applies?)


Thanks,
Ian.
> Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
> Cc: Kouya Shimura <kouya@jp.fujitsu.com>
> 
> diff -r 32034d1914a6 -r 650b03f41214 tools/hotplug/Linux/locking.sh
> --- a/tools/hotplug/Linux/locking.sh	Thu Jun 07 19:46:57 2012 +0100
> +++ b/tools/hotplug/Linux/locking.sh	Wed Jun 13 13:28:54 2012 -0400
> @@ -1,5 +1,6 @@
>  #
>  # Copyright (c) 2005 XenSource Ltd.
> +# Copyright (c) 2007 Red Hat
>  #
>  # This library is free software; you can redistribute it and/or
>  # modify it under the terms of version 2.1 of the GNU Lesser General
Public
> @@ -19,92 +20,30 @@
>  # Serialisation
>  #
>  
> -LOCK_SLEEPTIME=1
> -LOCK_SPINNING_RETRIES=5
> -LOCK_RETRIES=100
>  LOCK_BASEDIR=/var/run/xen-hotplug
>  
> +_setlockfd()
> +{
> +    local i
> +    for ((i = 0; i < ${#_lockdict}; i++))
> +    do [ -z "${_lockdict[$i]}" -o "${_lockdict[$i]}" =
"$1" ] && break
> +    done
> +    _lockdict[$i]="$1"
> +    let _lockfd=200+i
> +}
> +
>  
>  claim_lock()
>  {
> -  local lockdir="$LOCK_BASEDIR/$1"
> -  mkdir -p "$LOCK_BASEDIR"
> -  _claim_lock "$lockdir"
> +    mkdir -p "$LOCK_BASEDIR"
> +    _setlockfd $1
> +    eval "exec $_lockfd>>$LOCK_BASEDIR/$1"
> +    flock -x $_lockfd
>  }
>  
> 
>  release_lock()
>  {
> -  _release_lock "$LOCK_BASEDIR/$1"
> +    _setlockfd $1
> +    flock -u $_lockfd
>  }
> -
> -
> -# This function will be redefined in xen-hotplug-common.sh.
> -sigerr() {
> -  exit 1
> -}
> -
> -
> -_claim_lock()
> -{
> -  local lockdir="$1"
> -  local owner=$(_lock_owner "$lockdir")
> -  local retries=0
> -
> -  while [ $retries -lt $LOCK_RETRIES ]
> -  do
> -    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR &&
> -      _update_lock_info "$lockdir" && return
> -
> -    local new_owner=$(_lock_owner "$lockdir")
> -    if [ "$new_owner" != "$owner" ]
> -    then
> -      owner="$new_owner"
> -      retries=0
> -    else
> -      local pid=$(echo $owner | cut -d : -f 1)
> -      if [ -n "$pid" -a "$pid" != "unknown"
-a ! -f "/proc/$pid/status" ]
> -      then
> -        _release_lock $lockdir
> -      fi
> -    fi
> -
> -    if [ $retries -gt $LOCK_SPINNING_RETRIES ]
> -    then
> -      sleep $LOCK_SLEEPTIME
> -    else
> -      sleep 0
> -    fi
> -    retries=$(($retries + 1))
> -  done
> -  _steal_lock "$lockdir"
> -}
> -
> -
> -_release_lock()
> -{
> -  trap sigerr ERR
> -  rm -rf "$1" 2>/dev/null || true
> -}
> -
> -
> -_steal_lock()
> -{
> -  local lockdir="$1"
> -  local owner=$(cat "$lockdir/owner" 2>/dev/null || echo
"unknown")
> -  log err "Forced to steal lock on $lockdir from $owner!"
> -  _release_lock "$lockdir"
> -  _claim_lock "$lockdir"
> -}
> -
> -
> -_lock_owner()
> -{
> -  cat "$1/owner" 2>/dev/null || echo "unknown"
> -}
> -
> -
> -_update_lock_info()
> -{
> -  echo "$$: $0" >"$1/owner"
> -}
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Zhigang Wang

2012-Jun-20 17:53 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On 06/20/2012 11:34 AM, Ian Campbell wrote:> On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote:
>> # HG changeset patch
>> # User Zhigang Wang <zhigang.x.wang@oracle.com>
>> # Date 1339608534 14400
>> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
>> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
>> tools/hotplug: fix locking
> A better title would be "tools/hotplug: Switch to locking with
flock"
> since "fix" is not very descriptive.
Agree.>
> The commit message should also state why changing this scheme is
> preferable to fixing the current one.I have two points:

1. No timeout: in the old one, if one process holding the lock more than 100
   seconds, other processes could steal it.
2. No leftovers: if a process holding this lock is dead, it will close the lock
   file, so it will not affect other processes.
>
> Is this flock tool widely available? Does it need to be added to the
> dependencies in the README?It is widely distributed: it''s part of the util-linux package. So I
think no
need to document it.>
>> The current locking implementation would allow two processes get the
lock
>> simultaneously:
> [...snip shell trace...]
>
> Can you add a line or two of analysis to explain where in that shell
> spew things are actually going wrong and why?I didn''t spent much time on this complicated implement. But it fails
for me and
also for others (from response).>
>> This patch is ported from Red Hat Enterprise Linux 5.8.
> I think we need a Signed-off-by from the original patch author. (Unless
> DCO clause b somehow applies?)I''m not sure how to handle this. There''s no signed-off-by on
the original Red
Hat patch. Could anyone in Red Hat help to get it signed off?

Thanks,

Zhigang>
>
> Thanks,
> Ian.
>
>> Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
>> Cc: Kouya Shimura <kouya@jp.fujitsu.com>
>>
>> diff -r 32034d1914a6 -r 650b03f41214 tools/hotplug/Linux/locking.sh
>> --- a/tools/hotplug/Linux/locking.sh	Thu Jun 07 19:46:57 2012 +0100
>> +++ b/tools/hotplug/Linux/locking.sh	Wed Jun 13 13:28:54 2012 -0400
>> @@ -1,5 +1,6 @@
>>  #
>>  # Copyright (c) 2005 XenSource Ltd.
>> +# Copyright (c) 2007 Red Hat
>>  #
>>  # This library is free software; you can redistribute it and/or
>>  # modify it under the terms of version 2.1 of the GNU Lesser General
Public
>> @@ -19,92 +20,30 @@
>>  # Serialisation
>>  #
>>  
>> -LOCK_SLEEPTIME=1
>> -LOCK_SPINNING_RETRIES=5
>> -LOCK_RETRIES=100
>>  LOCK_BASEDIR=/var/run/xen-hotplug
>>  
>> +_setlockfd()
>> +{
>> +    local i
>> +    for ((i = 0; i < ${#_lockdict}; i++))
>> +    do [ -z "${_lockdict[$i]}" -o
"${_lockdict[$i]}" = "$1" ] && break
>> +    done
>> +    _lockdict[$i]="$1"
>> +    let _lockfd=200+i
>> +}
>> +
>>  
>>  claim_lock()
>>  {
>> -  local lockdir="$LOCK_BASEDIR/$1"
>> -  mkdir -p "$LOCK_BASEDIR"
>> -  _claim_lock "$lockdir"
>> +    mkdir -p "$LOCK_BASEDIR"
>> +    _setlockfd $1
>> +    eval "exec $_lockfd>>$LOCK_BASEDIR/$1"
>> +    flock -x $_lockfd
>>  }
>>  
>>
>>  release_lock()
>>  {
>> -  _release_lock "$LOCK_BASEDIR/$1"
>> +    _setlockfd $1
>> +    flock -u $_lockfd
>>  }
>> -
>> -
>> -# This function will be redefined in xen-hotplug-common.sh.
>> -sigerr() {
>> -  exit 1
>> -}
>> -
>> -
>> -_claim_lock()
>> -{
>> -  local lockdir="$1"
>> -  local owner=$(_lock_owner "$lockdir")
>> -  local retries=0
>> -
>> -  while [ $retries -lt $LOCK_RETRIES ]
>> -  do
>> -    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR &&
>> -      _update_lock_info "$lockdir" && return
>> -
>> -    local new_owner=$(_lock_owner "$lockdir")
>> -    if [ "$new_owner" != "$owner" ]
>> -    then
>> -      owner="$new_owner"
>> -      retries=0
>> -    else
>> -      local pid=$(echo $owner | cut -d : -f 1)
>> -      if [ -n "$pid" -a "$pid" !=
"unknown" -a ! -f "/proc/$pid/status" ]
>> -      then
>> -        _release_lock $lockdir
>> -      fi
>> -    fi
>> -
>> -    if [ $retries -gt $LOCK_SPINNING_RETRIES ]
>> -    then
>> -      sleep $LOCK_SLEEPTIME
>> -    else
>> -      sleep 0
>> -    fi
>> -    retries=$(($retries + 1))
>> -  done
>> -  _steal_lock "$lockdir"
>> -}
>> -
>> -
>> -_release_lock()
>> -{
>> -  trap sigerr ERR
>> -  rm -rf "$1" 2>/dev/null || true
>> -}
>> -
>> -
>> -_steal_lock()
>> -{
>> -  local lockdir="$1"
>> -  local owner=$(cat "$lockdir/owner" 2>/dev/null || echo
"unknown")
>> -  log err "Forced to steal lock on $lockdir from $owner!"
>> -  _release_lock "$lockdir"
>> -  _claim_lock "$lockdir"
>> -}
>> -
>> -
>> -_lock_owner()
>> -{
>> -  cat "$1/owner" 2>/dev/null || echo "unknown"
>> -}
>> -
>> -
>> -_update_lock_info()
>> -{
>> -  echo "$$: $0" >"$1/owner"
>> -}
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Ian Campbell

2012-Jun-21 11:42 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On Wed, 2012-06-20 at 18:53 +0100, Zhigang Wang wrote:> On 06/20/2012 11:34 AM, Ian Campbell wrote:
> > On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote:
> >> # HG changeset patch
> >> # User Zhigang Wang <zhigang.x.wang@oracle.com>
> >> # Date 1339608534 14400
> >> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
> >> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
> >> tools/hotplug: fix locking
> > A better title would be "tools/hotplug: Switch to locking with
flock"
> > since "fix" is not very descriptive.
> Agree.
> >
> > The commit message should also state why changing this scheme is
> > preferable to fixing the current one.
> I have two points:
> 
> 1. No timeout: in the old one, if one process holding the lock more than
100
>    seconds, other processes could steal it.
> 2. No leftovers: if a process holding this lock is dead, it will close the
lock
>    file, so it will not affect other processes.
> 
> >
> > Is this flock tool widely available? Does it need to be added to the
> > dependencies in the README?
> It is widely distributed: it''s part of the util-linux package. So
I think no
> need to document it.
> >
> >> The current locking implementation would allow two processes get
the lock
> >> simultaneously:
> > [...snip shell trace...]
> >
> > Can you add a line or two of analysis to explain where in that shell
> > spew things are actually going wrong and why?
> I didn''t spent much time on this complicated implement. But it
fails for me and
> also for others (from response).
> >
> >> This patch is ported from Red Hat Enterprise Linux 5.8.
> > I think we need a Signed-off-by from the original patch author.
(Unless
> > DCO clause b somehow applies?)
> I''m not sure how to handle this. There''s no signed-off-by
on the original Red
> Hat patch. Could anyone in Red Hat help to get it signed off?
Perhaps Andrew Jones or Don Dutile can point us in the right direction
(both CCd) if you identify the precise source of the patch (i.e. the
name of the .src.rpm and the name of the patch within it)?

I looked in kernel-2.6.18-308.8.2.el5.src.rpm (which seems fairly
recent) but maybe I''m looking in the wrong place.

Ian.> 
> Thanks,
> 
> Zhigang
> >
> >
> > Thanks,
> > Ian.
> >
> >> Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
> >> Cc: Kouya Shimura <kouya@jp.fujitsu.com>
> >>
> >> diff -r 32034d1914a6 -r 650b03f41214
tools/hotplug/Linux/locking.sh
> >> --- a/tools/hotplug/Linux/locking.sh	Thu Jun 07 19:46:57 2012
+0100
> >> +++ b/tools/hotplug/Linux/locking.sh	Wed Jun 13 13:28:54 2012
-0400
> >> @@ -1,5 +1,6 @@
> >>  #
> >>  # Copyright (c) 2005 XenSource Ltd.
> >> +# Copyright (c) 2007 Red Hat
> >>  #
> >>  # This library is free software; you can redistribute it and/or
> >>  # modify it under the terms of version 2.1 of the GNU Lesser
General Public
> >> @@ -19,92 +20,30 @@
> >>  # Serialisation
> >>  #
> >>  
> >> -LOCK_SLEEPTIME=1
> >> -LOCK_SPINNING_RETRIES=5
> >> -LOCK_RETRIES=100
> >>  LOCK_BASEDIR=/var/run/xen-hotplug
> >>  
> >> +_setlockfd()
> >> +{
> >> +    local i
> >> +    for ((i = 0; i < ${#_lockdict}; i++))
> >> +    do [ -z "${_lockdict[$i]}" -o
"${_lockdict[$i]}" = "$1" ] && break
> >> +    done
> >> +    _lockdict[$i]="$1"
> >> +    let _lockfd=200+i
> >> +}
> >> +
> >>  
> >>  claim_lock()
> >>  {
> >> -  local lockdir="$LOCK_BASEDIR/$1"
> >> -  mkdir -p "$LOCK_BASEDIR"
> >> -  _claim_lock "$lockdir"
> >> +    mkdir -p "$LOCK_BASEDIR"
> >> +    _setlockfd $1
> >> +    eval "exec $_lockfd>>$LOCK_BASEDIR/$1"
> >> +    flock -x $_lockfd
> >>  }
> >>  
> >>
> >>  release_lock()
> >>  {
> >> -  _release_lock "$LOCK_BASEDIR/$1"
> >> +    _setlockfd $1
> >> +    flock -u $_lockfd
> >>  }
> >> -
> >> -
> >> -# This function will be redefined in xen-hotplug-common.sh.
> >> -sigerr() {
> >> -  exit 1
> >> -}
> >> -
> >> -
> >> -_claim_lock()
> >> -{
> >> -  local lockdir="$1"
> >> -  local owner=$(_lock_owner "$lockdir")
> >> -  local retries=0
> >> -
> >> -  while [ $retries -lt $LOCK_RETRIES ]
> >> -  do
> >> -    mkdir "$lockdir" 2>/dev/null && trap
"_release_lock $lockdir; sigerr" ERR &&
> >> -      _update_lock_info "$lockdir" && return
> >> -
> >> -    local new_owner=$(_lock_owner "$lockdir")
> >> -    if [ "$new_owner" != "$owner" ]
> >> -    then
> >> -      owner="$new_owner"
> >> -      retries=0
> >> -    else
> >> -      local pid=$(echo $owner | cut -d : -f 1)
> >> -      if [ -n "$pid" -a "$pid" !=
"unknown" -a ! -f "/proc/$pid/status" ]
> >> -      then
> >> -        _release_lock $lockdir
> >> -      fi
> >> -    fi
> >> -
> >> -    if [ $retries -gt $LOCK_SPINNING_RETRIES ]
> >> -    then
> >> -      sleep $LOCK_SLEEPTIME
> >> -    else
> >> -      sleep 0
> >> -    fi
> >> -    retries=$(($retries + 1))
> >> -  done
> >> -  _steal_lock "$lockdir"
> >> -}
> >> -
> >> -
> >> -_release_lock()
> >> -{
> >> -  trap sigerr ERR
> >> -  rm -rf "$1" 2>/dev/null || true
> >> -}
> >> -
> >> -
> >> -_steal_lock()
> >> -{
> >> -  local lockdir="$1"
> >> -  local owner=$(cat "$lockdir/owner" 2>/dev/null ||
echo "unknown")
> >> -  log err "Forced to steal lock on $lockdir from
$owner!"
> >> -  _release_lock "$lockdir"
> >> -  _claim_lock "$lockdir"
> >> -}
> >> -
> >> -
> >> -_lock_owner()
> >> -{
> >> -  cat "$1/owner" 2>/dev/null || echo
"unknown"
> >> -}
> >> -
> >> -
> >> -_update_lock_info()
> >> -{
> >> -  echo "$$: $0" >"$1/owner"
> >> -}
> >>
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xen.org
> >> http://lists.xen.org/xen-devel
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
>

Ian Campbell

2012-Jun-21 11:49 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On Thu, 2012-06-21 at 12:42 +0100, Ian Campbell wrote:> On Wed, 2012-06-20 at 18:53 +0100, Zhigang Wang wrote:
> > On 06/20/2012 11:34 AM, Ian Campbell wrote:
> > > On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote:
> > >> # HG changeset patch
> > >> # User Zhigang Wang <zhigang.x.wang@oracle.com>
> > >> # Date 1339608534 14400
> > >> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
> > >> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
> > >> tools/hotplug: fix locking
> > > A better title would be "tools/hotplug: Switch to locking
with flock"
> > > since "fix" is not very descriptive.
> > Agree.
> > >
> > > The commit message should also state why changing this scheme is
> > > preferable to fixing the current one.
> > I have two points:
> > 
> > 1. No timeout: in the old one, if one process holding the lock more
than 100
> >    seconds, other processes could steal it.
> > 2. No leftovers: if a process holding this lock is dead, it will close
the lock
> >    file, so it will not affect other processes.
> > 
> > >
> > > Is this flock tool widely available? Does it need to be added to
the
> > > dependencies in the README?
> > It is widely distributed: it''s part of the util-linux
package. So I think no
> > need to document it.
> > >
> > >> The current locking implementation would allow two processes
get the lock
> > >> simultaneously:
> > > [...snip shell trace...]
> > >
> > > Can you add a line or two of analysis to explain where in that
shell
> > > spew things are actually going wrong and why?
> > I didn''t spent much time on this complicated implement. But
it fails for me and
> > also for others (from response).
> > >
> > >> This patch is ported from Red Hat Enterprise Linux 5.8.
> > > I think we need a Signed-off-by from the original patch author.
(Unless
> > > DCO clause b somehow applies?)
> > I''m not sure how to handle this. There''s no
signed-off-by on the original Red
> > Hat patch. Could anyone in Red Hat help to get it signed off?
> 
> Perhaps Andrew Jones or Don Dutile can point us in the right direction
> (both CCd) if you identify the precise source of the patch (i.e. the
> name of the .src.rpm and the name of the patch within it)?
> 
> I looked in kernel-2.6.18-308.8.2.el5.src.rpm (which seems fairly
> recent) but maybe I''m looking in the wrong place.
Nevermind, I found it in xen-3.0.3-135.el5_8.2.src.rpm (I''d have sworn
that RH shipped kernel+xen in a single source package, oh well).

%changelog says:
        * Fri Sep 14 2007 Daniel P. Berrange <berrange@redhat.com> -
3.0.3-40.el5
        - Rewrite locking in hotplug scripts to fix timeouts (rhbz #267861)

Daniel CCd.

I wonder if there were any followup fixes or changes, especially wrt to
your point 1 above (the lack of a timeout now) requiring xend (or now
libxl) changes?
> 
> Ian.
> > 
> > Thanks,
> > 
> > Zhigang
> > >
> > >
> > > Thanks,
> > > Ian.
> > >
> > >> Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
> > >> Cc: Kouya Shimura <kouya@jp.fujitsu.com>
> > >>
> > >> diff -r 32034d1914a6 -r 650b03f41214
tools/hotplug/Linux/locking.sh
> > >> --- a/tools/hotplug/Linux/locking.sh	Thu Jun 07 19:46:57 2012
+0100
> > >> +++ b/tools/hotplug/Linux/locking.sh	Wed Jun 13 13:28:54 2012
-0400
> > >> @@ -1,5 +1,6 @@
> > >>  #
> > >>  # Copyright (c) 2005 XenSource Ltd.
> > >> +# Copyright (c) 2007 Red Hat
> > >>  #
> > >>  # This library is free software; you can redistribute it
and/or
> > >>  # modify it under the terms of version 2.1 of the GNU Lesser
General Public
> > >> @@ -19,92 +20,30 @@
> > >>  # Serialisation
> > >>  #
> > >>  
> > >> -LOCK_SLEEPTIME=1
> > >> -LOCK_SPINNING_RETRIES=5
> > >> -LOCK_RETRIES=100
> > >>  LOCK_BASEDIR=/var/run/xen-hotplug
> > >>  
> > >> +_setlockfd()
> > >> +{
> > >> +    local i
> > >> +    for ((i = 0; i < ${#_lockdict}; i++))
> > >> +    do [ -z "${_lockdict[$i]}" -o
"${_lockdict[$i]}" = "$1" ] && break
> > >> +    done
> > >> +    _lockdict[$i]="$1"
> > >> +    let _lockfd=200+i
> > >> +}
> > >> +
> > >>  
> > >>  claim_lock()
> > >>  {
> > >> -  local lockdir="$LOCK_BASEDIR/$1"
> > >> -  mkdir -p "$LOCK_BASEDIR"
> > >> -  _claim_lock "$lockdir"
> > >> +    mkdir -p "$LOCK_BASEDIR"
> > >> +    _setlockfd $1
> > >> +    eval "exec $_lockfd>>$LOCK_BASEDIR/$1"
> > >> +    flock -x $_lockfd
> > >>  }
> > >>  
> > >>
> > >>  release_lock()
> > >>  {
> > >> -  _release_lock "$LOCK_BASEDIR/$1"
> > >> +    _setlockfd $1
> > >> +    flock -u $_lockfd
> > >>  }
> > >> -
> > >> -
> > >> -# This function will be redefined in xen-hotplug-common.sh.
> > >> -sigerr() {
> > >> -  exit 1
> > >> -}
> > >> -
> > >> -
> > >> -_claim_lock()
> > >> -{
> > >> -  local lockdir="$1"
> > >> -  local owner=$(_lock_owner "$lockdir")
> > >> -  local retries=0
> > >> -
> > >> -  while [ $retries -lt $LOCK_RETRIES ]
> > >> -  do
> > >> -    mkdir "$lockdir" 2>/dev/null &&
trap "_release_lock $lockdir; sigerr" ERR &&
> > >> -      _update_lock_info "$lockdir" &&
return
> > >> -
> > >> -    local new_owner=$(_lock_owner "$lockdir")
> > >> -    if [ "$new_owner" != "$owner" ]
> > >> -    then
> > >> -      owner="$new_owner"
> > >> -      retries=0
> > >> -    else
> > >> -      local pid=$(echo $owner | cut -d : -f 1)
> > >> -      if [ -n "$pid" -a "$pid" !=
"unknown" -a ! -f "/proc/$pid/status" ]
> > >> -      then
> > >> -        _release_lock $lockdir
> > >> -      fi
> > >> -    fi
> > >> -
> > >> -    if [ $retries -gt $LOCK_SPINNING_RETRIES ]
> > >> -    then
> > >> -      sleep $LOCK_SLEEPTIME
> > >> -    else
> > >> -      sleep 0
> > >> -    fi
> > >> -    retries=$(($retries + 1))
> > >> -  done
> > >> -  _steal_lock "$lockdir"
> > >> -}
> > >> -
> > >> -
> > >> -_release_lock()
> > >> -{
> > >> -  trap sigerr ERR
> > >> -  rm -rf "$1" 2>/dev/null || true
> > >> -}
> > >> -
> > >> -
> > >> -_steal_lock()
> > >> -{
> > >> -  local lockdir="$1"
> > >> -  local owner=$(cat "$lockdir/owner"
2>/dev/null || echo "unknown")
> > >> -  log err "Forced to steal lock on $lockdir from
$owner!"
> > >> -  _release_lock "$lockdir"
> > >> -  _claim_lock "$lockdir"
> > >> -}
> > >> -
> > >> -
> > >> -_lock_owner()
> > >> -{
> > >> -  cat "$1/owner" 2>/dev/null || echo
"unknown"
> > >> -}
> > >> -
> > >> -
> > >> -_update_lock_info()
> > >> -{
> > >> -  echo "$$: $0" >"$1/owner"
> > >> -}
> > >>
> > >> _______________________________________________
> > >> Xen-devel mailing list
> > >> Xen-devel@lists.xen.org
> > >> http://lists.xen.org/xen-devel
> > >
> > >
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.xen.org
> > > http://lists.xen.org/xen-devel
> > 
>

Zhigang Wang

2012-Jun-21 12:08 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On 06/21/2012 07:49 AM, Ian Campbell wrote:> On Thu, 2012-06-21 at 12:42 +0100, Ian Campbell wrote:
>> On Wed, 2012-06-20 at 18:53 +0100, Zhigang Wang wrote:
>>> On 06/20/2012 11:34 AM, Ian Campbell wrote:
>>>> On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote:
>>>>> # HG changeset patch
>>>>> # User Zhigang Wang <zhigang.x.wang@oracle.com>
>>>>> # Date 1339608534 14400
>>>>> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
>>>>> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
>>>>> tools/hotplug: fix locking
>>>> A better title would be "tools/hotplug: Switch to locking
with flock"
>>>> since "fix" is not very descriptive.
>>> Agree.
>>>> The commit message should also state why changing this scheme
is
>>>> preferable to fixing the current one.
>>> I have two points:
>>>
>>> 1. No timeout: in the old one, if one process holding the lock more
than 100
>>>    seconds, other processes could steal it.
>>> 2. No leftovers: if a process holding this lock is dead, it will
close the lock
>>>    file, so it will not affect other processes.
>>>
>>>> Is this flock tool widely available? Does it need to be added
to the
>>>> dependencies in the README?
>>> It is widely distributed: it''s part of the util-linux
package. So I think no
>>> need to document it.
>>>>> The current locking implementation would allow two
processes get the lock
>>>>> simultaneously:
>>>> [...snip shell trace...]
>>>>
>>>> Can you add a line or two of analysis to explain where in that
shell
>>>> spew things are actually going wrong and why?
>>> I didn''t spent much time on this complicated implement.
But it fails for me and
>>> also for others (from response).
>>>>> This patch is ported from Red Hat Enterprise Linux 5.8.
>>>> I think we need a Signed-off-by from the original patch author.
(Unless
>>>> DCO clause b somehow applies?)
>>> I''m not sure how to handle this. There''s no
signed-off-by on the original Red
>>> Hat patch. Could anyone in Red Hat help to get it signed off?
>> Perhaps Andrew Jones or Don Dutile can point us in the right direction
>> (both CCd) if you identify the precise source of the patch (i.e. the
>> name of the .src.rpm and the name of the patch within it)?
>>
>> I looked in kernel-2.6.18-308.8.2.el5.src.rpm (which seems fairly
>> recent) but maybe I''m looking in the wrong place.
> Nevermind, I found it in xen-3.0.3-135.el5_8.2.src.rpm (I''d have
sworn
> that RH shipped kernel+xen in a single source package, oh well).
>
> %changelog says:
>         * Fri Sep 14 2007 Daniel P. Berrange <berrange@redhat.com> -
3.0.3-40.el5
>         - Rewrite locking in hotplug scripts to fix timeouts (rhbz #267861)
>
> Daniel CCd.
>
> I wonder if there were any followup fixes or changes, especially wrt to
> your point 1 above (the lack of a timeout now) requiring xend (or now
> libxl) changes?
We have another timeout in xend/xl: if device backend is not setup properly, the
device creation will fail.

Without any timeout for this locking, all backend uevent scripts will hang if
one goes wrong. Just stealing the lock upon timeoutis not a good idea: it may
cause error or damage. What about adding a long (10minutes) timeout? (by adding
-w/--wait/--timeout 600)

Thanks,

Zhigang

Daniel P. Berrange

2012-Jun-21 12:20 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On Thu, Jun 21, 2012 at 12:49:24PM +0100, Ian Campbell
wrote:> On Thu, 2012-06-21 at 12:42 +0100, Ian Campbell wrote:
> > On Wed, 2012-06-20 at 18:53 +0100, Zhigang Wang wrote:
> > > On 06/20/2012 11:34 AM, Ian Campbell wrote:
> > > > On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote:
> > > >> # HG changeset patch
> > > >> # User Zhigang Wang <zhigang.x.wang@oracle.com>
> > > >> # Date 1339608534 14400
> > > >> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
> > > >> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
> > > >> tools/hotplug: fix locking
> > > > A better title would be "tools/hotplug: Switch to
locking with flock"
> > > > since "fix" is not very descriptive.
> > > Agree.
> > > >
> > > > The commit message should also state why changing this
scheme is
> > > > preferable to fixing the current one.
> > > I have two points:
> > > 
> > > 1. No timeout: in the old one, if one process holding the lock
more than 100
> > >    seconds, other processes could steal it.
> > > 2. No leftovers: if a process holding this lock is dead, it will
close the lock
> > >    file, so it will not affect other processes.
> > > 
> > > >
> > > > Is this flock tool widely available? Does it need to be
added to the
> > > > dependencies in the README?
> > > It is widely distributed: it''s part of the util-linux
package. So I think no
> > > need to document it.
> > > >
> > > >> The current locking implementation would allow two
processes get the lock
> > > >> simultaneously:
> > > > [...snip shell trace...]
> > > >
> > > > Can you add a line or two of analysis to explain where in
that shell
> > > > spew things are actually going wrong and why?
> > > I didn''t spent much time on this complicated implement.
But it fails for me and
> > > also for others (from response).
> > > >
> > > >> This patch is ported from Red Hat Enterprise Linux 5.8.
> > > > I think we need a Signed-off-by from the original patch
author. (Unless
> > > > DCO clause b somehow applies?)
> > > I''m not sure how to handle this. There''s no
signed-off-by on the original Red
> > > Hat patch. Could anyone in Red Hat help to get it signed off?
> > 
> > Perhaps Andrew Jones or Don Dutile can point us in the right direction
> > (both CCd) if you identify the precise source of the patch (i.e. the
> > name of the .src.rpm and the name of the patch within it)?
> > 
> > I looked in kernel-2.6.18-308.8.2.el5.src.rpm (which seems fairly
> > recent) but maybe I''m looking in the wrong place.
> 
> Nevermind, I found it in xen-3.0.3-135.el5_8.2.src.rpm (I''d have
sworn
> that RH shipped kernel+xen in a single source package, oh well).
> 
> %changelog says:
>         * Fri Sep 14 2007 Daniel P. Berrange <berrange@redhat.com> -
3.0.3-40.el5
>         - Rewrite locking in hotplug scripts to fix timeouts (rhbz #267861)
> 
> Daniel CCd.
> 
> I wonder if there were any followup fixes or changes, especially wrt to
> your point 1 above (the lack of a timeout now) requiring xend (or now
> libxl) changes?
Removing any kind of timeout was in fact the point of this change.
We do not want hotplug scripts stealing each others locks, or
timing out, because the timeouts cause alot of problems when
launching guests on a highly loaded machine where execution time
can be alot longer than a timeout setting may expect.

I expect you can''t see the original BZ ticket quoted, since it
has customer information in it.  Here is my description of
what the problem we weere solving was:

[quote]
In the normal case of a single domain booting with one disk, the disk hotplug
script will fire once. In this case taking out the lock will never cause a sleep
because there''s no other concurrent invocations. If a domain has 4
disks
configured, then the hotplug script will fire 4 times, all 4 invocations at
pretty much the same time. If there is even a little load on the system, the
locking function in the shell script will sleep for a few seconds - as many as 5
seconds, or potentially even time out & fail completely.

If say 4 or even more domains each with 4 disks start up at once,
that''s 16
invocations of the hotplug script running at once. There will be alot of
sleep''s
done & because of the very coarse 1 second granularity the delay can add up
significantly. 

The change to using flock() removes the arbitrary 1 second sleep, so the very
instant once hotplug script finishes, another can start & there is no
repeated
attempts & failures to lock which would add more delay.
[/quote]

Usually it was our policy to send all these kind of patches upstream,
so I''m not really sure why this was not already merged. Possibly I
forgot to submit it, or maybe it was rejected - I honestly can''t
remember.

Below is the original patch I wrote, to which I apply:

  Signed-off-by: Daniel P. Berrange.

diff -rup xen-3.1.0-src.orig/tools/examples/locking.sh
xen-3.1.0-src.new/tools/examples/locking.sh
--- xen-3.1.0-src.orig/tools/examples/locking.sh	2006-10-15 08:22:03.000000000
-0400
+++ xen-3.1.0-src.new/tools/examples/locking.sh	2007-09-12 11:25:20.000000000
-0400
@@ -1,5 +1,6 @@
 #
 # Copyright (c) 2005 XenSource Ltd.
+# Copyright (c) 2007 Red Hat
 #
 # This library is free software; you can redistribute it and/or
 # modify it under the terms of version 2.1 of the GNU Lesser General Public
@@ -19,80 +20,30 @@
 # Serialisation
 #
 
-LOCK_SLEEPTIME=1
-LOCK_SPINNING_RETRIES=5
-LOCK_RETRIES=100
 LOCK_BASEDIR=/var/run/xen-hotplug
 
-
-claim_lock()
-{
-  local lockdir="$LOCK_BASEDIR/$1"
-  mkdir -p "$LOCK_BASEDIR"
-  _claim_lock "$lockdir"
-}
-
-
-release_lock()
-{
-  _release_lock "$LOCK_BASEDIR/$1"
-}
-
-
-_claim_lock()
+_setlockfd()
 {
-  local lockdir="$1"
-  local owner=$(_lock_owner "$lockdir")
-  local retries=0
-
-  while [ $retries -lt $LOCK_RETRIES ]
-  do
-    mkdir "$lockdir" 2>/dev/null && trap
"release_lock $1; sigerr" ERR &&
-      _update_lock_info "$lockdir" && return
-
-    local new_owner=$(_lock_owner "$lockdir")
-    if [ "$new_owner" != "$owner" ]
-    then
-      owner="$new_owner"
-      retries=0
-    fi
-
-    if [ $retries -gt $LOCK_SPINNING_RETRIES ]
-    then
-      sleep $LOCK_SLEEPTIME
-    else
-      sleep 0
-    fi
-    retries=$(($retries + 1))
-  done
-  _steal_lock "$lockdir"
+    local i
+    for ((i = 0; i < ${#_lockdict}; i++))
+    do [ -z "${_lockdict[$i]}" -o "${_lockdict[$i]}" =
"$1" ] && break
+    done
+    _lockdict[$i]="$1"
+    let _lockfd=200+i
 }
 
 
-_release_lock()
-{
-  trap sigerr ERR
-  rm -rf "$1" 2>/dev/null || true
-}
-
-
-_steal_lock()
-{
-  local lockdir="$1"
-  local owner=$(cat "$lockdir/owner" 2>/dev/null || echo
"unknown")
-  log err "Forced to steal lock on $lockdir from $owner!"
-  _release_lock "$lockdir"
-  _claim_lock "$lockdir"
-}
-
-
-_lock_owner()
+claim_lock()
 {
-  cat "$1/owner" 2>/dev/null || echo "unknown"
+    mkdir -p "$LOCK_BASEDIR"
+    _setlockfd $1
+    eval "exec $_lockfd>>$LOCK_BASEDIR/$1"
+    flock -x $_lockfd
 }
 
 
-_update_lock_info()
+release_lock()
 {
-  echo "$$: $0" >"$1/owner"
+    _setlockfd $1
+    flock -u $_lockfd
 }



Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

Daniel P. Berrange

2012-Jun-21 12:23 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On Thu, Jun 21, 2012 at 08:08:51AM -0400, Zhigang Wang
wrote:> On 06/21/2012 07:49 AM, Ian Campbell wrote:
> > On Thu, 2012-06-21 at 12:42 +0100, Ian Campbell wrote:
> >> On Wed, 2012-06-20 at 18:53 +0100, Zhigang Wang wrote:
> >>> On 06/20/2012 11:34 AM, Ian Campbell wrote:
> >>>> On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote:
> >>>>> # HG changeset patch
> >>>>> # User Zhigang Wang <zhigang.x.wang@oracle.com>
> >>>>> # Date 1339608534 14400
> >>>>> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
> >>>>> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
> >>>>> tools/hotplug: fix locking
> >>>> A better title would be "tools/hotplug: Switch to
locking with flock"
> >>>> since "fix" is not very descriptive.
> >>> Agree.
> >>>> The commit message should also state why changing this
scheme is
> >>>> preferable to fixing the current one.
> >>> I have two points:
> >>>
> >>> 1. No timeout: in the old one, if one process holding the lock
more than 100
> >>>    seconds, other processes could steal it.
> >>> 2. No leftovers: if a process holding this lock is dead, it
will close the lock
> >>>    file, so it will not affect other processes.
> >>>
> >>>> Is this flock tool widely available? Does it need to be
added to the
> >>>> dependencies in the README?
> >>> It is widely distributed: it''s part of the util-linux
package. So I think no
> >>> need to document it.
> >>>>> The current locking implementation would allow two
processes get the lock
> >>>>> simultaneously:
> >>>> [...snip shell trace...]
> >>>>
> >>>> Can you add a line or two of analysis to explain where in
that shell
> >>>> spew things are actually going wrong and why?
> >>> I didn''t spent much time on this complicated
implement. But it fails for me and
> >>> also for others (from response).
> >>>>> This patch is ported from Red Hat Enterprise Linux
5.8.
> >>>> I think we need a Signed-off-by from the original patch
author. (Unless
> >>>> DCO clause b somehow applies?)
> >>> I''m not sure how to handle this. There''s no
signed-off-by on the original Red
> >>> Hat patch. Could anyone in Red Hat help to get it signed off?
> >> Perhaps Andrew Jones or Don Dutile can point us in the right
direction
> >> (both CCd) if you identify the precise source of the patch (i.e.
the
> >> name of the .src.rpm and the name of the patch within it)?
> >>
> >> I looked in kernel-2.6.18-308.8.2.el5.src.rpm (which seems fairly
> >> recent) but maybe I''m looking in the wrong place.
> > Nevermind, I found it in xen-3.0.3-135.el5_8.2.src.rpm (I''d
have sworn
> > that RH shipped kernel+xen in a single source package, oh well).
> >
> > %changelog says:
> >         * Fri Sep 14 2007 Daniel P. Berrange
<berrange@redhat.com> - 3.0.3-40.el5
> >         - Rewrite locking in hotplug scripts to fix timeouts (rhbz
#267861)
> >
> > Daniel CCd.
> >
> > I wonder if there were any followup fixes or changes, especially wrt
to
> > your point 1 above (the lack of a timeout now) requiring xend (or now
> > libxl) changes?
> 
> We have another timeout in xend/xl: if device backend is not setup
properly, the
> device creation will fail.
> 
> Without any timeout for this locking, all backend uevent scripts will hang
if
> one goes wrong. Just stealing the lock upon timeoutis not a good idea: it
may
> cause error or damage. What about adding a long (10minutes) timeout? (by
adding
> -w/--wait/--timeout 600)
Timeouts are a really bad idea for this area of code. If you make the
timeout really long (10 mins as you suggest), then no other guest can
be started during this entire 10 minute window if something goes wrong.
IMHO this is unacceptable.  If you make the timeout short enough that
you don''t unduly delay other VM startup operations, then you will often
hit bogus timeouts under high load.

In practice the scripts should not hang unless something is seriously
wrong, in which case timing out & pretending every thing is still fine
for the future is bogus.  If you have real world examples of hangs
then you should focus efforts on fixing the hangs, rather than papering
over the problem with a timeout.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

Zhigang Wang

2012-Jun-21 13:07 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On 06/21/2012 08:23 AM, Daniel P. Berrange wrote:> On Thu, Jun 21, 2012 at 08:08:51AM -0400, Zhigang Wang wrote:
>> On 06/21/2012 07:49 AM, Ian Campbell wrote:
>>> On Thu, 2012-06-21 at 12:42 +0100, Ian Campbell wrote:
>>>> On Wed, 2012-06-20 at 18:53 +0100, Zhigang Wang wrote:
>>>>> On 06/20/2012 11:34 AM, Ian Campbell wrote:
>>>>>> On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote:
>>>>>>> # HG changeset patch
>>>>>>> # User Zhigang Wang
<zhigang.x.wang@oracle.com>
>>>>>>> # Date 1339608534 14400
>>>>>>> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
>>>>>>> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
>>>>>>> tools/hotplug: fix locking
>>>>>> A better title would be "tools/hotplug: Switch to
locking with flock"
>>>>>> since "fix" is not very descriptive.
>>>>> Agree.
>>>>>> The commit message should also state why changing this
scheme is
>>>>>> preferable to fixing the current one.
>>>>> I have two points:
>>>>>
>>>>> 1. No timeout: in the old one, if one process holding the
lock more than 100
>>>>>    seconds, other processes could steal it.
>>>>> 2. No leftovers: if a process holding this lock is dead, it
will close the lock
>>>>>    file, so it will not affect other processes.
>>>>>
>>>>>> Is this flock tool widely available? Does it need to be
added to the
>>>>>> dependencies in the README?
>>>>> It is widely distributed: it''s part of the
util-linux package. So I think no
>>>>> need to document it.
>>>>>>> The current locking implementation would allow two
processes get the lock
>>>>>>> simultaneously:
>>>>>> [...snip shell trace...]
>>>>>>
>>>>>> Can you add a line or two of analysis to explain where
in that shell
>>>>>> spew things are actually going wrong and why?
>>>>> I didn''t spent much time on this complicated
implement. But it fails for me and
>>>>> also for others (from response).
>>>>>>> This patch is ported from Red Hat Enterprise Linux
5.8.
>>>>>> I think we need a Signed-off-by from the original patch
author. (Unless
>>>>>> DCO clause b somehow applies?)
>>>>> I''m not sure how to handle this. There''s
no signed-off-by on the original Red
>>>>> Hat patch. Could anyone in Red Hat help to get it signed
off?
>>>> Perhaps Andrew Jones or Don Dutile can point us in the right
direction
>>>> (both CCd) if you identify the precise source of the patch
(i.e. the
>>>> name of the .src.rpm and the name of the patch within it)?
>>>>
>>>> I looked in kernel-2.6.18-308.8.2.el5.src.rpm (which seems
fairly
>>>> recent) but maybe I''m looking in the wrong place.
>>> Nevermind, I found it in xen-3.0.3-135.el5_8.2.src.rpm
(I''d have sworn
>>> that RH shipped kernel+xen in a single source package, oh well).
>>>
>>> %changelog says:
>>>         * Fri Sep 14 2007 Daniel P. Berrange
<berrange@redhat.com> - 3.0.3-40.el5
>>>         - Rewrite locking in hotplug scripts to fix timeouts (rhbz
#267861)
>>>
>>> Daniel CCd.
>>>
>>> I wonder if there were any followup fixes or changes, especially
wrt to
>>> your point 1 above (the lack of a timeout now) requiring xend (or
now
>>> libxl) changes?
>> We have another timeout in xend/xl: if device backend is not setup
properly, the
>> device creation will fail.
>>
>> Without any timeout for this locking, all backend uevent scripts will
hang if
>> one goes wrong. Just stealing the lock upon timeoutis not a good idea:
it may
>> cause error or damage. What about adding a long (10minutes) timeout?
(by adding
>> -w/--wait/--timeout 600)
> Timeouts are a really bad idea for this area of code. If you make the
> timeout really long (10 mins as you suggest), then no other guest can
> be started during this entire 10 minute window if something goes wrong.
> IMHO this is unacceptable.  If you make the timeout short enough that
> you don''t unduly delay other VM startup operations, then you will
often
> hit bogus timeouts under high load.
>
> In practice the scripts should not hang unless something is seriously
> wrong, in which case timing out & pretending every thing is still fine
> for the future is bogus.  If you have real world examples of hangs
> then you should focus efforts on fixing the hangs, rather than papering
> over the problem with a timeout.
>
> DanielThanks Daniel. I agree with your points on timeout.

Zhigang

Ian Campbell

2012-Jun-26 15:53 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On Thu, 2012-06-21 at 13:20 +0100, Daniel P. Berrange
wrote:> On Thu, Jun 21, 2012 at 12:49:24PM +0100, Ian Campbell wrote:
> > I wonder if there were any followup fixes or changes, especially wrt
to
> > your point 1 above (the lack of a timeout now) requiring xend (or now
> > libxl) changes?
> 
> Removing any kind of timeout was in fact the point of this change.
> We do not want hotplug scripts stealing each others locks, or
> timing out, because the timeouts cause alot of problems when
> launching guests on a highly loaded machine where execution time
> can be alot longer than a timeout setting may expect.
Agreed.

What I meant was if there was any code in xend which had inadvertently
come to rely on the existence of the timeout which needed fixing. I''d
guess not -- it''d be a pretty strange pattern...
> I expect you can''t see the original BZ ticket quoted, since it
> has customer information in it.  Here is my description of
> what the problem we weere solving was:
> 
> [quote]
> In the normal case of a single domain booting with one disk, the disk
hotplug
> script will fire once. In this case taking out the lock will never cause a
sleep
> because there''s no other concurrent invocations. If a domain has 4
disks
> configured, then the hotplug script will fire 4 times, all 4 invocations at
> pretty much the same time. If there is even a little load on the system,
the
> locking function in the shell script will sleep for a few seconds - as many
as 5
> seconds, or potentially even time out & fail completely.
> 
> If say 4 or even more domains each with 4 disks start up at once,
that''s 16
> invocations of the hotplug script running at once. There will be alot of
sleep''s
> done & because of the very coarse 1 second granularity the delay can
add up
> significantly. 
> 
> The change to using flock() removes the arbitrary 1 second sleep, so the
very
> instant once hotplug script finishes, another can start & there is no
repeated
> attempts & failures to lock which would add more delay.
> [/quote]
Thanks, this description makes sense. I''d be inclined to use it
verbatim
as the commit message.
> Usually it was our policy to send all these kind of patches upstream,
> so I''m not really sure why this was not already merged. Possibly I
> forgot to submit it, or maybe it was rejected - I honestly can''t
> remember.
> 
> Below is the original patch I wrote, to which I apply:
> 
>   Signed-off-by: Daniel P. Berrange.
Cheers. By my analysis the result is that the claim_lock(),
release_lock() and _setlockfd() functions (i.e. all the actual code) are
identical to those in the patch which Zhigang sent (the differences are
in the removed code, which I presume has changed since you wrote the
original patch).

Therefore I think it is appropriate to include your S-o-b on the new
patch -- assuming you are OK with that?

Also the patch is:

Acked-by: Ian Campbell <ian.campbell@citrix.com>

Ian.

Daniel P. Berrange

2012-Jun-26 16:14 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

On Tue, Jun 26, 2012 at 04:53:12PM +0100, Ian Campbell
wrote:> On Thu, 2012-06-21 at 13:20 +0100, Daniel P. Berrange wrote:
> > On Thu, Jun 21, 2012 at 12:49:24PM +0100, Ian Campbell wrote:
> > > I wonder if there were any followup fixes or changes, especially
wrt to
> > > your point 1 above (the lack of a timeout now) requiring xend (or
now
> > > libxl) changes?
> > 
> > Removing any kind of timeout was in fact the point of this change.
> > We do not want hotplug scripts stealing each others locks, or
> > timing out, because the timeouts cause alot of problems when
> > launching guests on a highly loaded machine where execution time
> > can be alot longer than a timeout setting may expect.
> 
> Agreed.
> 
> What I meant was if there was any code in xend which had inadvertently
> come to rely on the existence of the timeout which needed fixing.
I''d
> guess not -- it''d be a pretty strange pattern...
It has been a while, but I don''t recall any issues in the Xend
layer in this respect.
> > I expect you can''t see the original BZ ticket quoted, since
it
> > has customer information in it.  Here is my description of
> > what the problem we weere solving was:
> > 
> > [quote]
> > In the normal case of a single domain booting with one disk, the disk
hotplug
> > script will fire once. In this case taking out the lock will never
cause a sleep
> > because there''s no other concurrent invocations. If a domain
has 4 disks
> > configured, then the hotplug script will fire 4 times, all 4
invocations at
> > pretty much the same time. If there is even a little load on the
system, the
> > locking function in the shell script will sleep for a few seconds - as
many as 5
> > seconds, or potentially even time out & fail completely.
> > 
> > If say 4 or even more domains each with 4 disks start up at once,
that''s 16
> > invocations of the hotplug script running at once. There will be alot
of sleep''s
> > done & because of the very coarse 1 second granularity the delay
can add up
> > significantly. 
> > 
> > The change to using flock() removes the arbitrary 1 second sleep, so
the very
> > instant once hotplug script finishes, another can start & there is
no repeated
> > attempts & failures to lock which would add more delay.
> > [/quote]
> 
> Thanks, this description makes sense. I''d be inclined to use it
verbatim
> as the commit message.
> 
> > Usually it was our policy to send all these kind of patches upstream,
> > so I''m not really sure why this was not already merged.
Possibly I
> > forgot to submit it, or maybe it was rejected - I honestly
can''t
> > remember.
> > 
> > Below is the original patch I wrote, to which I apply:
> > 
> >   Signed-off-by: Daniel P. Berrange.
> 
> Cheers. By my analysis the result is that the claim_lock(),
> release_lock() and _setlockfd() functions (i.e. all the actual code) are
> identical to those in the patch which Zhigang sent (the differences are
> in the removed code, which I presume has changed since you wrote the
> original patch).
> 
> Therefore I think it is appropriate to include your S-o-b on the new
> patch -- assuming you are OK with that?
Yes, that''s fine with me
> 
> Also the patch is:
> 
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> Ian.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

Ian Campbell

2012-Jul-04 14:48 UTC

head link

Re: [PATCH] tools/hotplug: fix locking

[...]> > Thanks, this description makes sense. I''d be inclined to use
it verbatim
> > as the commit message.
[...]
> > Therefore I think it is appropriate to include your S-o-b on the new
> > patch -- assuming you are OK with that?
> 
> Yes, that''s fine with me
Thanks. I have applied with the following revised commit message:

hotplug/Linux: use flock based locking

In the normal case of a single domain booting with one disk, the disk hotplug
script will fire once. In this case taking out the lock will never cause a 
sleep because there''s no other concurrent invocations. If a domain has
4 disks
configured, then the hotplug script will fire 4 times, all 4 invocations at
pretty much the same time. If there is even a little load on the system, the
locking function in the shell script will sleep for a few seconds - as many as 5
seconds, or potentially even time out & fail completely.

If say 4 or even more domains each with 4 disks start up at once,
that''s 16
invocations of the hotplug script running at once. There will be alot of 
sleep''s done & because of the very coarse 1 second granularity the
delay can
add up significantly.

The change to using flock() removes the arbitrary 1 second sleep, so the very
instant once hotplug script finishes, another can start & there is no
repeated
attempts & failures to lock which would add more delay.

In addition the current locking implementation would allow two processes get
the lock simultaneously if one decided the other had timed out.

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Xen devel - Jun 2012 - [PATCH] tools/hotplug: fix locking

[PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

[PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking

Re: [PATCH] tools/hotplug: fix locking