thr3ads.net - Ocfs2 devel - [Ocfs2-devel] OCFS2 BUG with 2 different kernels [Jul 2018]

If this information is useful, please help other people find it:
Share via:

Daniel Sobe

2018-Jul-12 14:24 UTC

[Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Larry,

sorry for not responding any earlier. It took me quite a while to reproduce the
issue on a "playground" installation. Here's todays kernel BUG
log:

Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut here
]------------
Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at
/build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 [#1] SMP
PTI
Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules linked in: btrfs
zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix
ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C)
p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse
snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel
dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof
dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core
ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore
intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit
Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support
mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache
libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic
fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64
crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e
xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded:
configfs]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1
Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell Inc.
OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
Jul 12 15:29:08 drs1p001 kernel: [1300619.423923] RIP:
0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10
EFLAGS: 00010046
Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 0000000000000282 RBX:
ffff9d269d990018 RCX: 0000000000000000
Jul 12 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
ffff9d269d990018 RDI: ffff9d269d990094
Jul 12 15:29:08 drs1p001 kernel: [1300619.423931] RBP: 0000000000000003 R08:
000062d940000000 R09: 000000000000036a
Jul 12 15:29:08 drs1p001 kernel: [1300619.423933] R10: ffffb14b4a133af8 R11:
0000000000000068 R12: ffff9d269d990094
Jul 12 15:29:08 drs1p001 kernel: [1300619.423934] R13: ffff9d2882baa000 R14:
0000000000000000 R15: ffffffffc0bf3940
Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000)
GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00
Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b
CR0: 0000000080050033
Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3:
00000001725f0002 CR4: 00000000003606e0
Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423969] 
ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]  ocfs2_lookup+0x199/0x2e0
[ocfs2]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423986]  ? _cond_resched+0x16/0x40
Jul 12 15:29:08 drs1p001 kernel: [1300619.423989]  lookup_slow+0xa9/0x170
Jul 12 15:29:08 drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350
Jul 12 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]  path_lookupat+0x73/0x220
Jul 12 15:29:08 drs1p001 kernel: [1300619.423998]  ?
___bpf_prog_run+0xba7/0x1260
Jul 12 15:29:08 drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424003]  ?
seccomp_run_filters+0x58/0xb0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424005]  ?
__check_object_size+0x98/0x1a0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424008]  ?
strncpy_from_user+0x48/0x160
Jul 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]  C_SYSC_x86_stat64+0x39/0x70
Jul 12 15:29:08 drs1p001 kernel: [1300619.424018]  ?
syscall_trace_enter+0x117/0x2c0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424022] 
entry_SYSENTER_compat+0x7f/0x8e
Jul 12 15:29:08 drs1p001 kernel: [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d
e9 a1 77 78 db 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2
74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00
00 00 00 0f 1f
Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10
Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace
aea789961795b75f ]---
Jul 12 15:29:08 drs1p001 kernel: [1300628.967649] ------------[ cut here
]------------

As this occurred while compiling C code with "-j" I think we were on
the wrong track, it is not about mount sharing, but rather a multicore issue.
That would be in line with the other report that I found (I referenced it when I
was reporting my issue), who claimed the issue went away after he restricted to
1 active CPU core.

Unfortunately I could not do much with the machine afterwards. Probably the
OCFS2 mechanism to reboot the node if the local heartbeat isn't updated
anymore kicked in, so there was no way I could have SSHed in and run some
debugging.

I have now updated to the kernel Debian package of 4.16.16 backported for Debian
9. I guess I will hit the bug again and let you know.

Regards,

Daniel


-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com] 
Sent: Freitag, 11. Mai 2018 09:01
To: Daniel Sobe <daniel.sobe at nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

On 04/12/2018 08:20 PM, Daniel Sobe wrote:> Hi Larry,
>
> this is, in a nutshell, what I do to create a LXC container as
"ordinary user":
>
> * Install the LXC packages from the distribution
> * run the command "lxc-create -n test1 -t download"
> ** first run might prompt you to generate a ~/.config/lxc/default.conf 
> to define UID mappings
> ** in a corporate environment it might be tricky to set the http_proxy 
> (and maybe even https_proxy) environment variables correctly
> ** once the list of images is shown, select for instance "debian"
"jessie" "amd64"
> * the container downloads to ~/.local/share/lxc/
> * adapt the "config" file in that directory to add the shared
ocfs2
> mount like in my example below
> * if you're lucky, then "lxc-start -d -n test1" already
works, which you can confirm by "lxc-ls --fancy", and attach to the
container with "lxc-attach -n test1"
> ** if you want to finally enable networking, most distributions 
> arrange a dedicated bridge (lxcbr0) which you can configure similar to 
> my example below
> ** in my case I had to install cgroup related tools and reboot to have 
> all cgroups available, and to allow use of lxcbr0 bridge in 
> /etc/lxc/lxc-usernet
>
> Now if you access the mount-shared OCFS2 file system from with several
containers, the bug will (hopefully) trigger on your side as well. I don't
know the conditions under which this will occur, unfortunately.
>
> Regards,
>
> Daniel
>
>
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Donnerstag, 12. April 2018 11:20
> To: Daniel Sobe <daniel.sobe at nxp.com>
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi Daniel,
>
> Quite an interesting issue.
>
> I'm not familiar with lxc tools, so it may take some time to reproduce
it.
>
> Do you have a script to build up your lxc environment?
> Because I want to make sure that my environment is quite the same as yours.
>
> Thanks,
> Larry
>
>
> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> not sure if it helps, the issue wasn't there with Debian 8 and
kernel
>> 3.16 - but that's a long history. Unfortunately, the only machine 
>> where I could try to bisect, does not run any kernel < 4.16 without 
>> other issues ?
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 05:17
>> To: Daniel Sobe <daniel.sobe at nxp.com>; ocfs2-devel at
oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Thanks for your report.
>> I'll try to reproduce this bug as you did.
>>
>> I'm afraid there may be some bugs on the collaboration of cgroups
and ocfs2.
>>
>> Thanks
>> Larry
>>
>>
>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> below is an example config file like I use it for LXC containers. I
followed the instructions
(https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=d8YTOI365uammRcpTuXDoQhwuGDm0CyQ-QNJxQAZczs&s=crzdJkF_u3rBf8xZ1cHEce1LBwHIrVIDads0aP6CP74&e=)
and downloaded a Debian 8 container as user (unprivileged) and adapted the
config file. Several of those containers run on one host and share the OCFS2
directory as you can see at the "lxc.mount.entry" line.
>>>
>>> Meanwhile I'm trying whether the problem can be reproduced with
shared mounts in one namespace, as you suggested. So far with no success, will
report once anything happens.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> ----
>>>
>>> # Distribution configuration
>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>> lxc.arch = x86_64
>>>
>>> # Container specific configuration
>>> lxc.id_map = u 0 624288 65536
>>> lxc.id_map = g 0 624288 65536
>>>
>>> lxc.utsname = container1
>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>
>>> lxc.network.type = veth
>>> lxc.network.flags = up
>>> lxc.network.link = bridge1
>>> lxc.network.name = eth0
>>> lxc.network.veth.pair = aabbccddeeff
>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>
>>> lxc.cgroup.cpuset.cpus = 63-86
>>>
>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none
bind 0 0
>>>
>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>
>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>
>>> ----
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Mittwoch, 11. April 2018 13:31
>>> To: Daniel Sobe <daniel.sobe at nxp.com>; ocfs2-devel at
oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>>
>>>
>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> this is what I was doing. The 2nd node, while being
"declared" in the cluster.conf, does not exist yet, and thus
everything was happening on one node only.
>>>>
>>>> I do not know in detail how LXC does the mount sharing, but I
assume it simply calls "mount --bind /original/mount/point
/new/mount/point" in a separate namespace (or, somehow unshares the mount
from the original namespace afterwards).
>>> I thought of there is a way to share a directory between host and
docker container, like
>>>     ?? docker run -v /host/directory:/container/directory -other
-options image_name command_to_run That's different from yours.
>>>
>>> How did you setup your lxc or container?
>>>
>>> If you could, show me the procedure, I'll try to reproduce it.
>>>
>>> And by the way, if you get rid of lxc, and just mount ocfs2 on
several different mount point of local host, will the problem recur?
>>>
>>> Regards,
>>> Larry
>>>> Regards,
>>>>
>>>> Daniel
>>>>
Sorry for this delayed reply.

I tried with lxc + ocfs2 in your mount-shared way.

But I can not reproduce your bugs.

What I use is opensuse tumbleweed.

The procedure I try to reproduce your bugs:
0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
 ?? mount /dev/xxx /mnt
 ?? then it shows
 ?? 207 65 254:16 / /mnt rw,relatime shared:94
 ?? I think this *shared* is what you want. And this mount point will be shared
within multiple namespaces.

1. Start Virtual Machine Manager.
2. add a local LXC connection by clicking File ? Add Connection.
 ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
3. Select the localhost (LXC) connection and click File New Virtual Machine
menu.
4. Activate Application container and click Forward.
 ?? Set the path to the application to be launched. As an example, the field is
filled with /bin/sh, which is fine to create a first container.
Click Forward.
5. Choose the maximum amount of memory and CPUs to allocate to the container.
Click Forward.
6. Type in a name for the container. This name will be used for all virsh
commands on the container.
 ?? Click Advanced options. Select the network to connect the container to and
click Finish. The container will be created and started. A console will be
opened automatically.

If possible, could you please provide a shell script to show what you did with
you mount point.

Thanks
Larry

Daniel Sobe

2018-Jul-13 09:35 UTC

head link

[Ocfs2-devel] OCFS2 BUG with 2 different kernels

This is a stacktrace from 4.16.16. All I was doing this time was a "git
checkout" which probably led to a lot of file system activity.


Jul 13 11:31:00 drs1p001 kernel: [  849.213765] ------------[ cut here
]------------
Jul 13 11:31:00 drs1p001 kernel: [  849.213766] kernel BUG at
/build/linux-Sci2oS/linux-4.16.16/fs/ocfs2/dlmglue.c:848!
Jul 13 11:31:00 drs1p001 kernel: [  849.213774] invalid opcode: 0000 [#1] SMP
PTI
Jul 13 11:31:00 drs1p001 kernel: [  849.213776] Modules linked in: tcp_diag
inet_diag unix_diag veth ocfs2 quota_tree bridge stp llc ocfs2_dlmfs
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
iptable_filter fuse snd_hda_codec_hdmi rfkill snd_hda_codec_realtek
snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel kvm i915 irqbypass crct10dif_pclmul dell_wmi crc32_pclmul
sparse_keymap wmi_bmof dell_smbios dell_wmi_descriptor ghash_clmulni_intel
snd_hda_intel evdev snd_hda_codec intel_cstate dcdbas drm_kms_helper
snd_hda_core snd_hwdep intel_uncore intel_rapl_perf snd_pcm snd_timer drm mei_me
iTCO_wdt snd pcspkr mei soundcore iTCO_vendor_support i2c_algo_bit sg shpchp
intel_pch_thermal wmi serio_raw button video acpi_pad drbd lru_cache libcrc32c
ip_tables x_tables autofs4 ext4
Jul 13 11:31:00 drs1p001 kernel: [  849.213808]  crc16 mbcache jbd2
crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel
aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci e1000e libata
xhci_pci e1000 xhci_hcd i2c_i801 scsi_mod usbcore usb_common fan thermal
Jul 13 11:31:00 drs1p001 kernel: [  849.213823] CPU: 1 PID: 4266 Comm: git Not
tainted 4.16.0-0.bpo.2-amd64 #1 Debian 4.16.16-2~bpo9+1
Jul 13 11:31:00 drs1p001 kernel: [  849.213825] Hardware name: Dell Inc.
OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
Jul 13 11:31:00 drs1p001 kernel: [  849.213851] RIP:
0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
Jul 13 11:31:00 drs1p001 kernel: [  849.213865] RSP: 0000:ffffab4243c73b20
EFLAGS: 00010046
Jul 13 11:31:00 drs1p001 kernel: [  849.213867] RAX: 0000000000000282 RBX:
ffff9b5fb19d1818 RCX: 0000000000000000
Jul 13 11:31:00 drs1p001 kernel: [  849.213869] RDX: 0000000000000000 RSI:
ffff9b5fb19d1818 RDI: ffff9b5fb19d1894
Jul 13 11:31:00 drs1p001 kernel: [  849.213870] RBP: 0000000000000003 R08:
ffff9b5fd9ca22e0 R09: ffff9b5fcf1ac400
Jul 13 11:31:00 drs1p001 kernel: [  849.213872] R10: ffffab4243c73b08 R11:
0000000000000000 R12: ffff9b5fb19d1894
Jul 13 11:31:00 drs1p001 kernel: [  849.213874] R13: ffff9b5fcd2cd000 R14:
0000000000000000 R15: ffffffffc0b2d940
Jul 13 11:31:00 drs1p001 kernel: [  849.213876] FS:  00007f62f1fa4700(0000)
GS:ffff9b5fd9c80000(0000) knlGS:0000000000000000
Jul 13 11:31:00 drs1p001 kernel: [  849.213878] CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Jul 13 11:31:00 drs1p001 kernel: [  849.213879] CR2: 00007f62cc000010 CR3:
000000022abd2003 CR4: 00000000003606e0
Jul 13 11:31:00 drs1p001 kernel: [  849.213881] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Jul 13 11:31:00 drs1p001 kernel: [  849.213883] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Jul 13 11:31:00 drs1p001 kernel: [  849.213884] Call Trace:
Jul 13 11:31:00 drs1p001 kernel: [  849.213897]  ? ocfs2_dentry_unlock+0x35/0x80
[ocfs2]
Jul 13 11:31:00 drs1p001 kernel: [  849.213908] 
ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
Jul 13 11:31:00 drs1p001 kernel: [  849.213921]  ocfs2_lookup+0x199/0x2e0
[ocfs2]
Jul 13 11:31:00 drs1p001 kernel: [  849.213925]  ? _cond_resched+0x16/0x40
Jul 13 11:31:00 drs1p001 kernel: [  849.213928]  lookup_slow+0xa9/0x170
Jul 13 11:31:00 drs1p001 kernel: [  849.213930]  walk_component+0x1c6/0x350
Jul 13 11:31:00 drs1p001 kernel: [  849.213932]  path_lookupat+0x73/0x220
Jul 13 11:31:00 drs1p001 kernel: [  849.213935]  ? ___bpf_prog_run+0xba7/0x1260
Jul 13 11:31:00 drs1p001 kernel: [  849.213937]  filename_lookup+0xb8/0x1a0
Jul 13 11:31:00 drs1p001 kernel: [  849.213940]  ? seccomp_run_filters+0x58/0xb0
Jul 13 11:31:00 drs1p001 kernel: [  849.213942]  ?
__check_object_size+0x98/0x1a0
Jul 13 11:31:00 drs1p001 kernel: [  849.213945]  ? strncpy_from_user+0x48/0x160
Jul 13 11:31:00 drs1p001 kernel: [  849.213947]  ? getname_flags+0x6a/0x1e0
Jul 13 11:31:00 drs1p001 kernel: [  849.213950]  ? vfs_statx+0x73/0xe0
Jul 13 11:31:00 drs1p001 kernel: [  849.213952]  vfs_statx+0x73/0xe0
Jul 13 11:31:00 drs1p001 kernel: [  849.213954]  SYSC_newlstat+0x39/0x70
Jul 13 11:31:00 drs1p001 kernel: [  849.213957]  ?
syscall_trace_enter+0x117/0x2c0
Jul 13 11:31:00 drs1p001 kernel: [  849.213959]  do_syscall_64+0x6c/0x130
Jul 13 11:31:00 drs1p001 kernel: [  849.213961] 
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jul 13 11:31:00 drs1p001 kernel: [  849.213964] RIP: 0033:0x7f62f20800f5
Jul 13 11:31:00 drs1p001 kernel: [  849.213965] RSP: 002b:00007f62f1fa3d08
EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Jul 13 11:31:00 drs1p001 kernel: [  849.213967] RAX: ffffffffffffffda RBX:
00007f62f1fa3e50 RCX: 00007f62f20800f5
Jul 13 11:31:00 drs1p001 kernel: [  849.213969] RDX: 00007f62f1fa3d40 RSI:
00007f62f1fa3d40 RDI: 00007f62e80008c0
Jul 13 11:31:00 drs1p001 kernel: [  849.213971] RBP: 0000000000000033 R08:
0000000000000003 R09: 0000000000000000
Jul 13 11:31:00 drs1p001 kernel: [  849.213972] R10: 0000000000000000 R11:
0000000000000246 R12: 0000000000000005
Jul 13 11:31:00 drs1p001 kernel: [  849.213974] R13: 0000000000000000 R14:
0000000000000003 R15: 0000564ea29f2878
Jul 13 11:31:00 drs1p001 kernel: [  849.213976] Code: 89 c6 5b 5d 41 5c 41 5d e9
01 b8 e4 ea 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74
c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
00 00 0f 1f
Jul 13 11:31:00 drs1p001 kernel: [  849.214007] RIP:
__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffab4243c73b20
Jul 13 11:31:00 drs1p001 kernel: [  849.214010] ---[ end trace 99c07b7b69ee7717
]---

I'll see to have a backported 4.17 installed soon to verify whether it
happens with newer kernels at all.

Regards,

Daniel

-----Original Message-----
From: ocfs2-devel-bounces at oss.oracle.com [mailto:ocfs2-devel-bounces at
oss.oracle.com] On Behalf Of Daniel Sobe
Sent: Donnerstag, 12. Juli 2018 16:24
To: Larry Chen <lchen at suse.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Larry,

sorry for not responding any earlier. It took me quite a while to reproduce the
issue on a "playground" installation. Here's todays kernel BUG
log:

Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut here
]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at
/build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 [#1] SMP
PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules linked in: btrfs
zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix
ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C)
p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse
snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel
dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof
dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core
ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore
intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit
Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support
mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache
libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic
fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64
crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e
xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded:
configfs]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1
Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell Inc.
OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 15:29:08 drs1p001 kernel:
[1300619.423923] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul
12 15:29:08 drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 0000000000000282
RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel:
[1300619.423929] RDX: 0000000000000000 RSI: ffff9d269d990018 RDI:
ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423931] RBP:
0000000000000003 R08: 000062d940000000 R09: 000000000000036a Jul 12 15:29:08
drs1p001 kernel: [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068
R12: ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934] R13:
ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08
drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000)
GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001
kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul
12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3:
00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel:
[1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3:
0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08
drs1p001 kernel: [1300619.423945] Call Trace:
Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
[1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Jul 12 15:29:08
drs1p001 kernel: [1300619.423981]  ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12
15:29:08 drs1p001 kernel: [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12
15:29:08 drs1p001 kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12
15:29:08 drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 Jul 12
15:29:08 drs1p001 kernel: [1300619.423995]  path_lookupat+0x73/0x220 Jul 12
15:29:08 drs1p001 kernel: [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul
12 15:29:08 drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
15:29:08 drs1p001 kernel: [1300619.424003]  ? seccomp_run_filters+0x58/0xb0 Jul
12 15:29:08 drs1p001 kernel: [1300619.424005]  ? __check_object_size+0x98/0x1a0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424008]  ?
strncpy_from_user+0x48/0x160 Jul 12 15:29:08 drs1p001 kernel: [1300619.424010] 
? vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424012] 
vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424015] 
C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel: [1300619.424018]  ?
syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08 drs1p001 kernel:
[1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul 12 15:29:08 drs1p001 kernel:
[1300619.424022]  entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001
kernel: [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 8b
53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b
<0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f Jul 12
15:29:08 drs1p001 kernel: [1300619.424055] RIP:
__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10 Jul 12
15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace aea789961795b75f ]---
Jul 12 15:29:08 drs1p001 kernel: [1300628.967649] ------------[ cut here
]------------

As this occurred while compiling C code with "-j" I think we were on
the wrong track, it is not about mount sharing, but rather a multicore issue.
That would be in line with the other report that I found (I referenced it when I
was reporting my issue), who claimed the issue went away after he restricted to
1 active CPU core.

Unfortunately I could not do much with the machine afterwards. Probably the
OCFS2 mechanism to reboot the node if the local heartbeat isn't updated
anymore kicked in, so there was no way I could have SSHed in and run some
debugging.

I have now updated to the kernel Debian package of 4.16.16 backported for Debian
9. I guess I will hit the bug again and let you know.

Regards,

Daniel


-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com]
Sent: Freitag, 11. Mai 2018 09:01
To: Daniel Sobe <daniel.sobe at nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

On 04/12/2018 08:20 PM, Daniel Sobe wrote:> Hi Larry,
>
> this is, in a nutshell, what I do to create a LXC container as
"ordinary user":
>
> * Install the LXC packages from the distribution
> * run the command "lxc-create -n test1 -t download"
> ** first run might prompt you to generate a ~/.config/lxc/default.conf 
> to define UID mappings
> ** in a corporate environment it might be tricky to set the http_proxy 
> (and maybe even https_proxy) environment variables correctly
> ** once the list of images is shown, select for instance "debian"
"jessie" "amd64"
> * the container downloads to ~/.local/share/lxc/
> * adapt the "config" file in that directory to add the shared
ocfs2
> mount like in my example below
> * if you're lucky, then "lxc-start -d -n test1" already
works, which you can confirm by "lxc-ls --fancy", and attach to the
container with "lxc-attach -n test1"
> ** if you want to finally enable networking, most distributions 
> arrange a dedicated bridge (lxcbr0) which you can configure similar to 
> my example below
> ** in my case I had to install cgroup related tools and reboot to have 
> all cgroups available, and to allow use of lxcbr0 bridge in 
> /etc/lxc/lxc-usernet
>
> Now if you access the mount-shared OCFS2 file system from with several
containers, the bug will (hopefully) trigger on your side as well. I don't
know the conditions under which this will occur, unfortunately.
>
> Regards,
>
> Daniel
>
>
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Donnerstag, 12. April 2018 11:20
> To: Daniel Sobe <daniel.sobe at nxp.com>
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi Daniel,
>
> Quite an interesting issue.
>
> I'm not familiar with lxc tools, so it may take some time to reproduce
it.
>
> Do you have a script to build up your lxc environment?
> Because I want to make sure that my environment is quite the same as yours.
>
> Thanks,
> Larry
>
>
> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> not sure if it helps, the issue wasn't there with Debian 8 and
kernel
>> 3.16 - but that's a long history. Unfortunately, the only machine 
>> where I could try to bisect, does not run any kernel < 4.16 without 
>> other issues ?
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 05:17
>> To: Daniel Sobe <daniel.sobe at nxp.com>; ocfs2-devel at
oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Thanks for your report.
>> I'll try to reproduce this bug as you did.
>>
>> I'm afraid there may be some bugs on the collaboration of cgroups
and ocfs2.
>>
>> Thanks
>> Larry
>>
>>
>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> below is an example config file like I use it for LXC containers. I
followed the instructions
(https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3Dd8YTOI365uammRcpTuXDoQhwuGDm0CyQ-QNJxQAZczs%26s%3DcrzdJkF_u3rBf8xZ1cHEce1LBwHIrVIDads0aP6CP74%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C1f1b5d6a87334604103108d5e803507b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670023298552201&amp;sdata=fB%2BH2oqPXUFCWoAO%2BlZ1Qg8jJkKpM0rf39AgJ1ObWJQ%3D&amp;reserved=0=)
and downloaded a Debian 8 container as user (unprivileged) and adapted the
config file. Several of those containers run on one host and share the OCFS2
directory as you can see at the "lxc.mount.entry" line.
>>>
>>> Meanwhile I'm trying whether the problem can be reproduced with
shared mounts in one namespace, as you suggested. So far with no success, will
report once anything happens.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> ----
>>>
>>> # Distribution configuration
>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>> lxc.arch = x86_64
>>>
>>> # Container specific configuration
>>> lxc.id_map = u 0 624288 65536
>>> lxc.id_map = g 0 624288 65536
>>>
>>> lxc.utsname = container1
>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>
>>> lxc.network.type = veth
>>> lxc.network.flags = up
>>> lxc.network.link = bridge1
>>> lxc.network.name = eth0
>>> lxc.network.veth.pair = aabbccddeeff
>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>
>>> lxc.cgroup.cpuset.cpus = 63-86
>>>
>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none
bind 0 0
>>>
>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>
>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>
>>> ----
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Mittwoch, 11. April 2018 13:31
>>> To: Daniel Sobe <daniel.sobe at nxp.com>; ocfs2-devel at
oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>>
>>>
>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> this is what I was doing. The 2nd node, while being
"declared" in the cluster.conf, does not exist yet, and thus
everything was happening on one node only.
>>>>
>>>> I do not know in detail how LXC does the mount sharing, but I
assume it simply calls "mount --bind /original/mount/point
/new/mount/point" in a separate namespace (or, somehow unshares the mount
from the original namespace afterwards).
>>> I thought of there is a way to share a directory between host and
docker container, like
>>>     ?? docker run -v /host/directory:/container/directory -other
-options image_name command_to_run That's different from yours.
>>>
>>> How did you setup your lxc or container?
>>>
>>> If you could, show me the procedure, I'll try to reproduce it.
>>>
>>> And by the way, if you get rid of lxc, and just mount ocfs2 on
several different mount point of local host, will the problem recur?
>>>
>>> Regards,
>>> Larry
>>>> Regards,
>>>>
>>>> Daniel
>>>>
Sorry for this delayed reply.

I tried with lxc + ocfs2 in your mount-shared way.

But I can not reproduce your bugs.

What I use is opensuse tumbleweed.

The procedure I try to reproduce your bugs:
0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
 ?? mount /dev/xxx /mnt
 ?? then it shows
 ?? 207 65 254:16 / /mnt rw,relatime shared:94
 ?? I think this *shared* is what you want. And this mount point will be shared
within multiple namespaces.

1. Start Virtual Machine Manager.
2. add a local LXC connection by clicking File ? Add Connection.
 ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
3. Select the localhost (LXC) connection and click File New Virtual Machine
menu.
4. Activate Application container and click Forward.
 ?? Set the path to the application to be launched. As an example, the field is
filled with /bin/sh, which is fine to create a first container.
Click Forward.
5. Choose the maximum amount of memory and CPUs to allocate to the container.
Click Forward.
6. Type in a name for the container. This name will be used for all virsh
commands on the container.
 ?? Click Advanced options. Select the network to connect the container to and
click Finish. The container will be created and started. A console will be
opened automatically.

If possible, could you please provide a shell script to show what you did with
you mount point.

Thanks
Larry


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel at oss.oracle.com
https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss.oracle.com-252Fmailman-252Flistinfo-252Focfs2-2Ddevel-26amp-3Bdata-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C1f1b5d6a87334604103108d5e803507b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636670023298552201-26amp-3Bsdata-3DSMj8hOyr2U1FctgW76Vei7KqVxNnVDXLmZYhNSKEhGc-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=3bYXCH9yvRLxBfpwQceLASJbHuuZ29zmJbPFtjPR91s&s=2icg3OSQjoAiuqSoxkPsC0Uh3n_Y1gAK4fgMErbIjf8&e=

Larry Chen

2018-Jul-13 09:48 UTC

head link

[Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

Thanks for your effort to reproduce the bug.
I can confirm that there exist more than one bug.
I'll focus on this interesting issue.


On 07/12/2018 10:24 PM, Daniel Sobe wrote:> Hi Larry,
> 
> sorry for not responding any earlier. It took me quite a while to reproduce
the issue on a "playground" installation. Here's todays kernel BUG
log:
> 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut here
]------------
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at
/build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 [#1]
SMP PTI
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules linked in: btrfs
zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix
ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C)
p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse
snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel
dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof
dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core
ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore
intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore
iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button
drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2
crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel
aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata
e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last
unloaded: configfs]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm:
cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell Inc.
OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423923] RIP:
0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423925] RSP:
0018:ffffb14b4a133b10 EFLAGS: 00010046
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 0000000000000282
RBX: ffff9d269d990018 RCX: 0000000000000000
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000
RSI: ffff9d269d990018 RDI: ffff9d269d990094
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423931] RBP: 0000000000000003
R08: 000062d940000000 R09: 000000000000036a
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423933] R10: ffffb14b4a133af8
R11: 0000000000000068 R12: ffff9d269d990094
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423934] R13: ffff9d2882baa000
R14: 0000000000000000 R15: ffffffffc0bf3940
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS: 
0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES:
002b CR0: 0000000080050033
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc
CR3: 00000001725f0002 CR4: 00000000003606e0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000
DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423969] 
ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
Here is caused by ocfs2_dentry_lock failed.
I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the 
failure of ocfs2_dentry_lock.

But why it failed still confuses me.

> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]  ocfs2_lookup+0x199/0x2e0
[ocfs2]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423986]  ?
_cond_resched+0x16/0x40
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423989]  lookup_slow+0xa9/0x170
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423991] 
walk_component+0x1c6/0x350
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]  path_lookupat+0x73/0x220
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423998]  ?
___bpf_prog_run+0xba7/0x1260
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424000] 
filename_lookup+0xb8/0x1a0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424003]  ?
seccomp_run_filters+0x58/0xb0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424005]  ?
__check_object_size+0x98/0x1a0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424008]  ?
strncpy_from_user+0x48/0x160
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424015] 
C_SYSC_x86_stat64+0x39/0x70
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424018]  ?
syscall_trace_enter+0x117/0x2c0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424020] 
do_fast_syscall_32+0xab/0x1f0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424022] 
entry_SYSENTER_compat+0x7f/0x8e
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424025] Code: 89 c6 5b 5d 41 5c
41 5d e9 a1 77 78 db 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c
85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84
00 00 00 00 00 0f 1f
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace
aea789961795b75f ]---
> Jul 12 15:29:08 drs1p001 kernel: [1300628.967649] ------------[ cut here
]------------
> 
> As this occurred while compiling C code with "-j" I think we were
on the wrong track, it is not about mount sharing, but rather a multicore issue.
That would be in line with the other report that I found (I referenced it when I
was reporting my issue), who claimed the issue went away after he restricted to
1 active CPU core.
> 
> Unfortunately I could not do much with the machine afterwards. Probably the
OCFS2 mechanism to reboot the node if the local heartbeat isn't updated
anymore kicked in, so there was no way I could have SSHed in and run some
debugging.
> 
> I have now updated to the kernel Debian package of 4.16.16 backported for
Debian 9. I guess I will hit the bug again and let you know.
> 
> Regards,
> 
> Daniel
> 
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Freitag, 11. Mai 2018 09:01
> To: Daniel Sobe <daniel.sobe at nxp.com>; ocfs2-devel at
oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> this is, in a nutshell, what I do to create a LXC container as
"ordinary user":
>>
>> * Install the LXC packages from the distribution
>> * run the command "lxc-create -n test1 -t download"
>> ** first run might prompt you to generate a ~/.config/lxc/default.conf
>> to define UID mappings
>> ** in a corporate environment it might be tricky to set the http_proxy
>> (and maybe even https_proxy) environment variables correctly
>> ** once the list of images is shown, select for instance
"debian" "jessie" "amd64"
>> * the container downloads to ~/.local/share/lxc/
>> * adapt the "config" file in that directory to add the shared
ocfs2
>> mount like in my example below
>> * if you're lucky, then "lxc-start -d -n test1" already
works, which you can confirm by "lxc-ls --fancy", and attach to the
container with "lxc-attach -n test1"
>> ** if you want to finally enable networking, most distributions
>> arrange a dedicated bridge (lxcbr0) which you can configure similar to
>> my example below
>> ** in my case I had to install cgroup related tools and reboot to have
>> all cgroups available, and to allow use of lxcbr0 bridge in
>> /etc/lxc/lxc-usernet
>>
>> Now if you access the mount-shared OCFS2 file system from with several
containers, the bug will (hopefully) trigger on your side as well. I don't
know the conditions under which this will occur, unfortunately.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 11:20
>> To: Daniel Sobe <daniel.sobe at nxp.com>
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Quite an interesting issue.
>>
>> I'm not familiar with lxc tools, so it may take some time to
reproduce it.
>>
>> Do you have a script to build up your lxc environment?
>> Because I want to make sure that my environment is quite the same as
yours.
>>
>> Thanks,
>> Larry
>>
>>
>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> not sure if it helps, the issue wasn't there with Debian 8 and
kernel
>>> 3.16 - but that's a long history. Unfortunately, the only
machine
>>> where I could try to bisect, does not run any kernel < 4.16
without
>>> other issues ?
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Donnerstag, 12. April 2018 05:17
>>> To: Daniel Sobe <daniel.sobe at nxp.com>; ocfs2-devel at
oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Thanks for your report.
>>> I'll try to reproduce this bug as you did.
>>>
>>> I'm afraid there may be some bugs on the collaboration of
cgroups and ocfs2.
>>>
>>> Thanks
>>> Larry
>>>
>>>
>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> below is an example config file like I use it for LXC
containers. I followed the instructions
(https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ZyZJHyS-hpOt5cG5HUXWP4DMTlWsFjNC24kNNJrQP7Y&s=p6IBQtAwi0mNcKsq2KQH96D2xRdWu7HXtgXDOtDYq28&e=)
and downloaded a Debian 8 container as user (unprivileged) and adapted the
config file. Several of those containers run on one host and share the OCFS2
directory as you can see at the "lxc.mount.entry" line.
>>>>
>>>> Meanwhile I'm trying whether the problem can be reproduced
with shared mounts in one namespace, as you suggested. So far with no success,
will report once anything happens.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> ----
>>>>
>>>> # Distribution configuration
>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>> lxc.arch = x86_64
>>>>
>>>> # Container specific configuration
>>>> lxc.id_map = u 0 624288 65536
>>>> lxc.id_map = g 0 624288 65536
>>>>
>>>> lxc.utsname = container1
>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>
>>>> lxc.network.type = veth
>>>> lxc.network.flags = up
>>>> lxc.network.link = bridge1
>>>> lxc.network.name = eth0
>>>> lxc.network.veth.pair = aabbccddeeff
>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>>
>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>
>>>> lxc.mount.entry = /storage/ocfs2/sw            sw           
none bind 0 0
>>>>
>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>
>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>
>>>> ----
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>> To: Daniel Sobe <daniel.sobe at nxp.com>; ocfs2-devel at
oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>>
>>>>
>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> this is what I was doing. The 2nd node, while being
"declared" in the cluster.conf, does not exist yet, and thus
everything was happening on one node only.
>>>>>
>>>>> I do not know in detail how LXC does the mount sharing, but
I assume it simply calls "mount --bind /original/mount/point
/new/mount/point" in a separate namespace (or, somehow unshares the mount
from the original namespace afterwards).
>>>> I thought of there is a way to share a directory between host
and docker container, like
>>>>      ?? docker run -v /host/directory:/container/directory
-other -options image_name command_to_run That's different from yours.
>>>>
>>>> How did you setup your lxc or container?
>>>>
>>>> If you could, show me the procedure, I'll try to reproduce
it.
>>>>
>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on
several different mount point of local host, will the problem recur?
>>>>
>>>> Regards,
>>>> Larry
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
> 
> Sorry for this delayed reply.
> 
> I tried with lxc + ocfs2 in your mount-shared way.
> 
> But I can not reproduce your bugs.
> 
> What I use is opensuse tumbleweed.
> 
> The procedure I try to reproduce your bugs:
> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with
command
>   ?? mount /dev/xxx /mnt
>   ?? then it shows
>   ?? 207 65 254:16 / /mnt rw,relatime shared:94
>   ?? I think this *shared* is what you want. And this mount point will be
shared within multiple namespaces.
> 
> 1. Start Virtual Machine Manager.
> 2. add a local LXC connection by clicking File ? Add Connection.
>   ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
> 3. Select the localhost (LXC) connection and click File New Virtual Machine
menu.
> 4. Activate Application container and click Forward.
>   ?? Set the path to the application to be launched. As an example, the
field is filled with /bin/sh, which is fine to create a first container.
> Click Forward.
> 5. Choose the maximum amount of memory and CPUs to allocate to the
container. Click Forward.
> 6. Type in a name for the container. This name will be used for all virsh
commands on the container.
>   ?? Click Advanced options. Select the network to connect the container to
and click Finish. The container will be created and started. A console will be
opened automatically.
> 
> If possible, could you please provide a shell script to show what you did
with you mount point.
> 
> Thanks
> Larry
>

Ocfs2 devel - Jul 2018 - OCFS2 BUG with 2 different kernels

[Ocfs2-devel] OCFS2 BUG with 2 different kernels

[Ocfs2-devel] OCFS2 BUG with 2 different kernels

[Ocfs2-devel] OCFS2 BUG with 2 different kernels