John A Hawkinson
2017-Apr-10 23:37 UTC
error allocating core memory buffers (code 22) at util2.c(106) [sender=3.1.2]
Hi: I'm in the middle of recoverying from a tactical error copying around an Mac OS X 10.10.5 Time Machine backup (turns out Apple's instructions aren't great...), and I had rsync running for the past 6 hours repairing permissions/acls on 1.5 TB of data (not copying the data), and then it just died in the middle with: .L....og.... 2015-03-11-094807/platinum-bar2/usr/local/mysql -> mysql-5.6.21-osx 10.8-x86_64 ERROR: out of memory in expand_item_list [sender] rsync error: error allocating core memory buffers (code 22) at util2.c(106) [sender=3.1.2] rsync: [sender] write error: Broken pipe (32) It was invoked as rsync -iaHJAX platinum-barzoom/Backups.backupdb/pb3/ platinum-barratry/x/Backups.backupdb/pb3/ I suspect the situation will be different the next time around, and I'm also not really inclined to try to wait another 7 hours for it to fail. Process limits were: bash-3.2# ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 256 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 709 virtual memory (kbytes, -v) unlimited Based on "df -i", there are about 200 million inodes in the scope of the copy bash-3.2# df -i Filesystem 512-blocks Used Available Capacity iused ifree %iused Mounted on /dev/disk2s2 3906353072 3879969656 26383416 100% 484996205 3297927 99% /Volumes/platinum-barzoom /dev/disk1s2 9766869344 3327435312 6439434032 35% 207964705 402464627 34% /Volumes/platinum-barratry Because this is a Time Machine backup, and there were 66 snapshots of a 1 TB disk consuming about 1.5 TB, there were a *lot* of hard links. Many of directories rather than individual files, so it's a little challenging to estimate what the number of files to links is. Are there any useful tips here? Is it worth filing a bug report on this thin record? I guess I can turn on core dumps and increase (unlimit completely) the stack size... Although it doesn't seem to have segfaulted, so I'm not sure having core dumps enabled would have helped? This was rsync 3.1.2 as installed via Homebrew which appears to have applied 3 local patches: https://github.com/Homebrew/homebrew-core/blob/68fe052398ea0939315ad87b0a08313c9d9a95af/Formula/rsync.rb apply "patches/fileflags.diff", "patches/crtimes.diff", "patches/hfs-compression.diff" p.s.: If I had to start over, I would have spent less time just deleting the data and recopying it, rather than trying to fixup the metadata and dealing with magic Apple stuff like the inability to modify symlinks inside a top-level Backups.backupdb directory of a Time Machine hfs volume (But you can move the top-level directory into another directory and then modify symlinks inside and then move it back). This has been an "interesting" experience. Thanks. --jhawk at mit.edu John Hawkinson
John Hawkinson
2017-Apr-11 12:33 UTC
error allocating core memory buffers (code 22) at util2.c(106) [sender=3.1.2]
> I guess I can turn on core dumps and increase (unlimit completely) the > stack size... > > Although it doesn't seem to have segfaulted, so I'm not sure having > core dumps enabled would have helped?Indeed. I reran it on the 50 remaining individual directories, it seems to have made it through 10 before failing again overnight: 66596 20:38:03 rsync -vP -iaHJAX `cat /tmp/unback` /Volumes/platinum-barratry/x/Backups.backupdb/pb3/ ... rsync: unpack_smb_acl: sys_acl_get_info(): Undefined error: 0 (0) rsync: unpack_smb_acl: sys_acl_get_info(): Undefined error: 0 (0) ... .f....og.a.. 2015-06-19-072520/platinum-bar2/Users/jhawk/.emacs.d/auto-save-list/.saves-1541-platinum-bar2.local~ ERROR: out of memory in expand_item_list [sender] rsync error: error allocating core memory buffers (code 22) at util2.c(106) [sender=3.1.2] bash-3.2# No segfault, no coredump, no syslogs. Do I need to run this under lldb and set a breakpoint in expand_item_list()? Quick inspection suggests running with -vvvv might give some useful output: 1680 if (DEBUG_GTE(FLIST, 3)) { 1681 rprintf(FINFO, "[%s] expand %s to %s bytes, did%s move\n", 1682 who_am_i(), desc, big_num(new_size * item_size), 1683 new_ptr == lp->items ? " not" : ""); 1684 } 1685 if (!new_ptr) 1686 out_of_memory("expand_item_list"); Although I imagine that output might be voluminous [but maybe not]? Again, I don't have time to build test cases and reproduce this carefully, every run is painful and long and slow. But I'd like to do the responsible thing if someone can tell me what that is.> p.s.: If I had to start over, I would have spent less time just deleting > the data and recopying it, rather than trying to fixup the metadata andIndeed, it's looking like fixing the metadata with rsync is an order of magnitude slower, even as far as I've gotten. So maybe it's time to find another method. I don't think fts(3) is optimized any better for large hardlink farms, so I think maybe I need a homegrown solution? Ughhhh. --jhawk at mit.edu John Hawkinson
John Hawkinson
2017-Apr-11 15:29 UTC
error allocating core memory buffers (code 22) at util2.c(106) [sender=3.1.2]
Sorry to keep replying to myself:> Because this is a Time Machine backup, and there were 66 snapshots of a > 1 TB disk consuming about 1.5 TB, there were a *lot* of hard links. Many > of directories rather than individual files, so it's a littleErr, whoops? No, I was tired and confused. They are not hard links to directories, that would screw up the universe. Still, lots of hard linked files.> Do I need to run this under lldb and set a breakpoint in expand_item_list()? > Quick inspection suggests running with -vvvv might give some useful output:Err... the result of this was it processed a few files for a minute or so and then hung in select() and consumed no cpu and there was no disk activity. Unfortunately apparently my clang/lldb workflow was broken and I didn't have functional debugging symbols (...) and I also lost the stack trace I thought I had (inadequate scrollback), so I'm not sure what was going on. But at first blush, it appeared that adding -vvvv made things hang forever. Removing it, and rerunning, it's now happily trucking along and has been for the past hour actually doing work. --jhawk at mit.edu John Hawkinson
John Hawkinson
2017-Apr-11 17:09 UTC
error allocating core memory buffers (code 22) at util2.c(106) [sender=3.1.2]
> But at first blush, it appeared that adding -vvvv made things hang > forever.Yes. Confirmed against the git HEAD (9e7b8ab7cf66ecd152002926a7da61d8ad862522). Running: rsync -nvvvv -iaHJAX $d $b Does some initial work and then gets to: bash-3.2# lldb -p 6458 (lldb) process attach --pid 6458 Process 6458 stopped * thread #1: tid = 0x1d5ff0, 0x00007fff8c2d83fa libsystem_kernel.dylib`__select + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x00007fff8c2d83fa libsystem_kernel.dylib`__select + 10 libsystem_kernel.dylib`__select: -> 0x7fff8c2d83fa <+10>: jae 0x7fff8c2d8404 ; <+20> 0x7fff8c2d83fc <+12>: movq %rax, %rdi 0x7fff8c2d83ff <+15>: jmp 0x7fff8c2d3c78 ; cerror 0x7fff8c2d8404 <+20>: retq Executable module set to "/usr/local/bin/rsync". Architecture set to: x86_64h-apple-macosx. (lldb) bt * thread #1: tid = 0x1d5ff0, 0x00007fff8c2d83fa libsystem_kernel.dylib`__select + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00007fff8c2d83fa libsystem_kernel.dylib`__select + 10 frame #1: 0x000000010dd6aedc rsync`perform_io(needed=47, flags=4) + 2924 at io.c:742 frame #2: 0x000000010dd6a0bd rsync`send_msg(code=MSG_INFO, buf="[receiver] receiving flist for dir 1319\n", len=40, convert=0) + 349 at io.c:958 frame #3: 0x000000010dd5c765 rsync`rwrite(code=FINFO, buf="[receiver] receiving flist for dir 1319\n", len=40, is_utf8=0) + 469 at log.c:279 frame #4: 0x000000010dd5d3f3 rsync`rprintf(code=FINFO, format="[%s] receiving flist for dir %d\n") + 627 at log.c:435 frame #5: 0x000000010dd394a6 rsync`read_ndx_and_attrs(f_in=0, f_out=4, iflag_ptr=0x00007fff51ed0f64, type_ptr="\x80", buf="", len_ptr=0x00007fff51ed0f60) + 646 at rsync.c:362 frame #6: 0x000000010dd4483b rsync`recv_files(f_in=0, f_out=4, local_name=0x0000000000000000) + 379 at receiver.c:542 frame #7: 0x000000010dd566c3 rsync`do_recv(f_in=0, f_out=4, local_name=0x0000000000000000) + 899 at main.c:909 frame #8: 0x000000010dd54ac7 rsync`do_server_recv(f_in=0, f_out=1, argc=1, argv=0x00007fff51ed2328) + 1207 at main.c:1078 frame #9: 0x000000010dd5431e rsync`start_server(f_in=0, f_out=1, argc=2, argv=0x00007fff51ed2320) + 286 at main.c:1112 frame #10: 0x000000010dd541ee rsync`child_main(argc=2, argv=0x00007fff51ed2320) + 46 at main.c:1085 frame #11: 0x000000010dd852ad rsync`local_child(argc=2, argv=0x00007fff51ed2320, f_in=0x00007fff51ed4328, f_out=0x00007fff51ed4324, child_main=(rsync`child_main at main.c:1084)) + 701 at pipe.c:167 frame #12: 0x000000010dd58e42 rsync`do_cmd(cmd=0x0000000000000000, machine=0x0000000000000000, user=0x0000000000000000, remote_argv=0x00007fee40c04a50, remote_argc=0, f_in_p=0x00007fff51ed4328, f_out_p=0x00007fff51ed4324) + 2210 at main.c:543 frame #13: 0x000000010dd57fd2 rsync`start_client(argc=1, argv=0x00007fee40c04a40) + 2402 at main.c:1414 frame #14: 0x000000010dd5747b rsync`main(argc=2, argv=0x00007fee40c04a40) + 2555 at main.c:1652 frame #15: 0x00007fff96cad5c9 libdyld.dylib`start + 1 frame #16: 0x00007fff96cad5c9 libdyld.dylib`start + 1 (lldb) up frame #1: 0x000000010dd6aedc rsync`perform_io(needed=47, flags=4) + 2924 at io.c:742 739 tv.tv_sec = select_timeout; 740 tv.tv_usec = 0; 741 -> 742 cnt = select(max_fd + 1, &r_fds, &w_fds, &e_fds, &tv); 743 744 if (cnt <= 0) { 745 if (cnt < 0 && errno == EBADF) { (lldb) fin And then never returns fomr that stack frame, seems to hang out in select() forever (unless interrupted by a breakpoint or whatnot). Bugzilla account request submitted. --jhawk at mit.edu John Hawkinson