Chris Lalancette
2008-May-06 17:36 UTC
[Xen-devel] Greater than 16 xvd devices for blkfront
All, We''ve had a number of requests to increase the number of xvd devices that a PV guest can have. Currently, if you try to connect > 16 disks, you get an error from xend. The problem ends up being that both xend and blkfront assume that for dev_t, major/minor is 8 bits each, where in fact there are actually 10 bits for major and 22 bits for minor. Therefore, it shouldn''t really be a problem giving lots of disks to guests. The problem is in backwards compatibility, and the details. What I am initially proposing to do is to leave things where they are for /dev/xvd[a-p]; that is, still put the xenstore entries in the same place, and use 8 bits for the major and 8 bits for the minor. For anything above that, we would end up putting the xenstore entry in a different place, and pushing the major into the top 10 bits (leaving the bottom 22 bits for the minor); that way old guests won''t fire when the entry is added, and we will add code to newer guests blkfront so that they will fire when they see that entry. Does anyone see any problems with this setup, or have any ideas how to do it better? Thanks, Chris Lalancette _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel P. Berrange
2008-May-06 17:45 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:> All, > We''ve had a number of requests to increase the number of xvd devices that a > PV guest can have. Currently, if you try to connect > 16 disks, you get an > error from xend. The problem ends up being that both xend and blkfront assume > that for dev_t, major/minor is 8 bits each, where in fact there are actually 10 > bits for major and 22 bits for minor. > Therefore, it shouldn''t really be a problem giving lots of disks to guests. > The problem is in backwards compatibility, and the details. What I am > initially proposing to do is to leave things where they are for /dev/xvd[a-p]; > that is, still put the xenstore entries in the same place, and use 8 bits for > the major and 8 bits for the minor. For anything above that, we would end up > putting the xenstore entry in a different place, and pushing the major into the > top 10 bits (leaving the bottom 22 bits for the minor); that way old guests > won''t fire when the entry is added, and we will add code to newer guests > blkfront so that they will fire when they see that entry. Does anyone see any > problems with this setup, or have any ideas how to do it better?Putting the xenstore entries in a different place is a non-starter. Too many things look at that location already. When blktap was added and it put xenstore entries in a different place it took months to track down all the bugs this caused. Dan. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel P. Berrange
2008-May-07 01:55 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:> All, > We''ve had a number of requests to increase the number of xvd devices that a > PV guest can have. Currently, if you try to connect > 16 disks, you get an > error from xend. The problem ends up being that both xend and blkfront assume > that for dev_t, major/minor is 8 bits each, where in fact there are actually 10 > bits for major and 22 bits for minor. > Therefore, it shouldn''t really be a problem giving lots of disks to guests. > The problem is in backwards compatibility, and the details. What I am > initially proposing to do is to leave things where they are for /dev/xvd[a-p]; > that is, still put the xenstore entries in the same place, and use 8 bits for > the major and 8 bits for the minor. For anything above that, we would end up > putting the xenstore entry in a different place, and pushing the major into the > top 10 bits (leaving the bottom 22 bits for the minor); that way old guests > won''t fire when the entry is added, and we will add code to newer guests > blkfront so that they will fire when they see that entry. Does anyone see any > problems with this setup, or have any ideas how to do it better?Looking at the blkfront code I think we can increase the minor numbers available for xvdX devices without requiring changes to the where stuff is stored. The key is that in blkfront we can reliably detect the overflow triggered by the 16th disk, because the next major number 203 doesn''t clash with any of the other major numbers blkfront is looking for Consider the 17th disk, which has name ''xvdq'', this gives a device number in xenstore of ''51968''. Upon seeing this, current blkfront code will use #define BLKIF_MAJOR(dev) ((dev)>>8) #define BLKIF_MINOR(dev) ((dev) & 0xff) And so get back major number of 203 and minor number of ''0''. In the xlbd_get_major_info(int vdevice) function, it has a switch on major numbers and the xvdX case is handled as the default major = BLKIF_MAJOR(vdevice); minor = BLKIF_MINOR(vdevice); switch (major) { case IDE0_MAJOR: index = 0; break; ....snipped... case IDE9_MAJOR: index = 9; break; case SCSI_DISK0_MAJOR: index = 10; break; case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR: index = 11 + major - SCSI_DISK1_MAJOR; break; case SCSI_CDROM_MAJOR: index = 18; break; default: index = 19; break; } So, the 17th disk in fact gets treated as 1st disk and the front end assigns it the name ''xvda'', and then promptly kernel panics because xvda already exists in sysfs. kobject_add failed for xvda with -EEXIST, don''t try to register things with the same name in the same directory. Call Trace: [<ffffffff80336951>] kobject_add+0x16e/0x199 [<ffffffff8025ce3c>] exact_lock+0x0/0x14 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4 [<ffffffff802f393e>] register_disk+0x43/0x198 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4 [<ffffffff8032e453>] add_disk+0x34/0x3d [<ffffffff88074eb8>] :xenblk:backend_changed+0x110/0x193 [<ffffffff803a4029>] xenbus_read_driver_state+0x26/0x3b Now, this kernel panic isn''t a huge problem (though it ought to handle the kobject_add gracefully), because we can never do anything to make existing frontends deal with > 16 disks. If an admin tries to add more than 16 disks to an existing guest they should already expect doom. For future frontends though, it looks like we can adapt the switch(major) in xlbd_get_major_info(), so that it detects the overflow of minor numbers, and re-adjusts the major/minor numbers to their intended value: eg change case SCSI_CDROM_MAJOR: index = 18; break; default: index = 19; break; } to case SCSI_CDROM_MAJOR: index = 18; break; default: index = 19; if (major > XLBD_MAJOR_VBD_START) { minor += 16 * (major - XLBD_MAJOR_VBD_START); major = XLBD_MAJOR_VBD_START; } break; } Now, I''ve not actually tested this, and there''s a few other places in blkfront needing similar tweaks but I don''t see anything in the code which fundamentally stops this overflow detection & fixup. As far as the XenD backend is concerned, all we need todo is edit the XenD blkdev_name_to_number() function in tools/python/xen/util/blkif.py to relax the regex to allow > xvdp. And adapt the math so it overflows onto the major numbers following XVD''s 202. In eg, change if re.match( ''/dev/xvd[a-p]([1-9]|1[0-5])?'', n): return 202 * 256 + 16 * (ord(n[8:9]) - ord(''a'')) + int(n[9:] or 0) to if re.match( ''/dev/xvd[a-z]([1-9]|1[0-5])?'', n): return 202 * 256 + 16 * (ord(n[8:9]) - ord(''a'')) + int(n[9:] or 0) gets you to 26 disks. This is how I got the gues to boot and front end to crash on the 17th disk ''xvdq''. It is a little more complex to cope with 2-letter drives, but no show stopper there. So, unless I''m missing something obvious we can keep compatability with existing guests for the first 16 disks and still (indirectly) make use of a 22/12 dev_t split for the 17th+ disk, without needing to change how or where stuff is stored in XenStore. Regards, Daniel. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel P. Berrange
2008-May-07 03:47 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
On Wed, May 07, 2008 at 02:55:02AM +0100, Daniel P. Berrange wrote:> On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote: > > All, > > We''ve had a number of requests to increase the number of xvd devices that a > > PV guest can have. Currently, if you try to connect > 16 disks, you get an > > error from xend. The problem ends up being that both xend and blkfront assume > > that for dev_t, major/minor is 8 bits each, where in fact there are actually 10 > > bits for major and 22 bits for minor. > > Therefore, it shouldn''t really be a problem giving lots of disks to guests. > > The problem is in backwards compatibility, and the details. What I am > > initially proposing to do is to leave things where they are for /dev/xvd[a-p]; > > that is, still put the xenstore entries in the same place, and use 8 bits for > > the major and 8 bits for the minor. For anything above that, we would end up > > putting the xenstore entry in a different place, and pushing the major into the > > top 10 bits (leaving the bottom 22 bits for the minor); that way old guests > > won''t fire when the entry is added, and we will add code to newer guests > > blkfront so that they will fire when they see that entry. Does anyone see any > > problems with this setup, or have any ideas how to do it better? > > Looking at the blkfront code I think we can increase the minor numbers > available for xvdX devices without requiring changes to the where stuff > is stored.Have a go with this proof of concept patch to blkfront. I built pv-on-hvm drivers with this and successfully booted my guest with 25 disks (xvdb -> xvdz) and saw them registered in /dev as can be seen from /proc/partitions: major minor #blocks name 3 0 5242880 hda 3 1 104391 hda1 3 2 5132767 hda2 253 0 4096000 dm-0 253 1 1015808 dm-1 202 16 102400 xvdb 202 32 102400 xvdc 202 48 102400 xvdd 202 49 48163 xvdd1 202 50 48195 xvdd2 202 64 102400 xvde 202 80 102400 xvdf 202 96 102400 xvdg 202 112 102400 xvdh 202 128 102400 xvdi 202 144 102400 xvdj 202 160 102400 xvdk 202 176 102400 xvdl 202 192 102400 xvdm 202 208 102400 xvdn 202 224 102400 xvdo 202 240 102400 xvdp 202 256 102400 xvdq 202 272 102400 xvdr 202 288 102400 xvds 202 304 102400 xvdt 202 320 102400 xvdu 202 336 102400 xvdv 202 352 102400 xvdw 202 368 102400 xvdx 202 384 102400 xvdy 202 400 102400 xvdz 202 401 96358 xvdz1 NB, requires the regex tweak to blkif.py in XenD to allow xvd[a-z] naming. Regards, Daniel. diff -r 57ab8dd47580 drivers/xen/blkfront/vbd.c --- a/drivers/xen/blkfront/vbd.c Sun Jul 01 22:07:32 2007 +0100 +++ b/drivers/xen/blkfront/vbd.c Tue May 06 23:38:20 2008 -0400 @@ -166,7 +166,14 @@ xlbd_get_major_info(int vdevice) index = 18 + major - SCSI_DISK8_MAJOR; break; case SCSI_CDROM_MAJOR: index = 26; break; - default: index = 27; break; + default: + index = 27; + if (major > XLBD_MAJOR_VBD_START) { + printk("xen-vbd: fixup major/minor %d -> %d,%d\n", vdevice, major, minor); + minor += (16 * 16 * (major - 202)); + major = 202; + } + printk("xen-vbd: process major/minor %d -> %d,%d\n", vdevice, major, minor); } mi = ((major_info[index] != NULL) ? major_info[index] : @@ -315,14 +322,42 @@ xlvbd_add(blkif_sector_t capacity, int v { struct block_device *bd; int err = 0; + int major, minor; - info->dev = MKDEV(BLKIF_MAJOR(vdevice), BLKIF_MINOR(vdevice)); + major = BLKIF_MAJOR(vdevice); + minor = BLKIF_MINOR(vdevice); + + switch (major) { + case IDE0_MAJOR: + case IDE1_MAJOR: + case IDE2_MAJOR: + case IDE3_MAJOR: + case IDE4_MAJOR: + case IDE5_MAJOR: + case IDE6_MAJOR: + case IDE7_MAJOR: + case IDE8_MAJOR: + case IDE9_MAJOR: + case SCSI_DISK0_MAJOR: + case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR: + case SCSI_DISK8_MAJOR ... SCSI_DISK15_MAJOR: + case SCSI_CDROM_MAJOR: + break; + + default: + if (major > 202) { + minor += (16 * 16 * (major - 202)); + major = 202; + } + } + + info->dev = MKDEV(major, minor); bd = bdget(info->dev); if (bd == NULL) return -ENODEV; - err = xlvbd_alloc_gendisk(BLKIF_MINOR(vdevice), capacity, vdevice, + err = xlvbd_alloc_gendisk(minor, capacity, vdevice, vdisk_info, sector_size, info); bdput(bd); -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chris Wright
2008-May-07 16:04 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
* Daniel P. Berrange (berrange@redhat.com) wrote:> On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote: > > All, > > We''ve had a number of requests to increase the number of xvd devices that a > > PV guest can have. Currently, if you try to connect > 16 disks, you get an > > error from xend. The problem ends up being that both xend and blkfront assume > > that for dev_t, major/minor is 8 bits each, where in fact there are actually 10 > > bits for major and 22 bits for minor.Just a nit, it''s actually 12:20.> > Therefore, it shouldn''t really be a problem giving lots of disks to guests. > > The problem is in backwards compatibility, and the details. What I am > > initially proposing to do is to leave things where they are for /dev/xvd[a-p]; > > that is, still put the xenstore entries in the same place, and use 8 bits for > > the major and 8 bits for the minor. For anything above that, we would end up > > putting the xenstore entry in a different place, and pushing the major into the > > top 10 bits (leaving the bottom 22 bits for the minor); that way old guests > > won''t fire when the entry is added, and we will add code to newer guests > > blkfront so that they will fire when they see that entry. Does anyone see any > > problems with this setup, or have any ideas how to do it better? > > Putting the xenstore entries in a different place is a non-starter. Too > many things look at that location already. When blktap was added and it > put xenstore entries in a different place it took months to track down > all the bugs this caused.I''m not sure what you mean? Since this is blkfront it''d be more like adding a virtual-device2 to extend the protocol. /* FIXME: Use dynamic device id if this is not set. */ err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device", "%i", &vdevice); if (err != 1) { xenbus_dev_fatal(dev, err, "reading virtual-device"); return err; } IOW smth simple like: err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device", "%i", &vdevice); if (err == -ENOENT) err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device2", "%i", &vdevice); Then we can stop propagating the myth that dev_t is 8:8. thanks, -chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chris Wright
2008-May-07 16:40 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
* Daniel P. Berrange (berrange@redhat.com) wrote:> + default: > + if (major > 202) { > + minor += (16 * 16 * (major - 202)); > + major = 202; > + } > + }I didn''t think of handling overflow (since the major for scsi/ide/etc were involved, I expected that to fail). But, aside of crashing an older guest with > 16 disks (not ideal, but I think it''s possible already with 0x format), seems good. thanks, -chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson
2008-May-08 09:30 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
Chris Wright writes ("Re: [Xen-devel] Greater than 16 xvd devices for blkfront"):> * Daniel P. Berrange (berrange@redhat.com) wrote: > > + default: > > + if (major > 202) { > > + minor += (16 * 16 * (major - 202)); > > + major = 202; > > + } > > + }The root cause of the problem is the incorporation of the Linux device numbering scheme into the xenstore protocol, which is wrong I think. What Daniel''s excellent if rather unpleasant suggestion is doing is to regard the xenstore number not as a `Linux device number'' but rather as a crazy encoding of the disk number. I think this is fine but it would be good if we could think about what the new crazy encoding is, and document it. I infer that in Daniel''s suggestion it''s: xenstore number = (202 << 8) + (actual disk number << 4) | partition number where the actual disk number starts at 0 for xvda and partition numbers are 0 for whole disk or 1..15. Daniel''s solution still doesn''t work for partitions >15. Perhaps, given that old guests are going to break anyway, we should consider a different scheme ? Since disks and partitions not supported by the old encoding won''t work on old guests anyway, we can use a completely new encoding for that case provided only that it doesn''t use numbers of the form (202 << 8) | something Presumably we can safely use at least 31 bits. If we reserve one to indicate that this is the new encoding that leaves us with 30 which should be enough for a reasonable number of disks with many partitions each.> I didn''t think of handling overflow (since the major for scsi/ide/etc > were involved, I expected that to fail). But, aside of crashing an > older guest with > 16 disks (not ideal, but I think it''s possible > already with 0x format), seems good.If a guest takes the xenstore number to be the concatenation of its own major and minor numbers then obviously it is leaving itself open to breaking in the future. dom0 admins will just have to Not Do That Then. (It''s a shame, if true, that the guests don''t have actual error checking.) Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chris Wright
2008-May-08 15:33 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
* Ian Jackson (Ian.Jackson@eu.citrix.com) wrote:> Chris Wright writes ("Re: [Xen-devel] Greater than 16 xvd devices for blkfront"): > > * Daniel P. Berrange (berrange@redhat.com) wrote: > > > + default: > > > + if (major > 202) { > > > + minor += (16 * 16 * (major - 202)); > > > + major = 202; > > > + } > > > + } > > The root cause of the problem is the incorporation of the Linux device > numbering scheme into the xenstore protocol, which is wrong I think. > What Daniel''s excellent if rather unpleasant suggestion is doing is to > regard the xenstore number not as a `Linux device number'' but rather > as a crazy encoding of the disk number. > > I think this is fine but it would be good if we could think about what > the new crazy encoding is, and document it. I infer that in Daniel''s > suggestion it''s: > > xenstore number = (202 << 8) + (actual disk number << 4) > | partition number > > where the actual disk number starts at 0 for xvda and partition > numbers are 0 for whole disk or 1..15. > > Daniel''s solution still doesn''t work for partitions >15. Perhaps,I think that''s OK, and effectively a hard limitation w.r.t. lanana: 202 block Xen Virtual Block Device 0 = /dev/xvda First Xen VBD whole disk 16 = /dev/xvdb Second Xen VBD whole disk 32 = /dev/xvdc Third Xen VBD whole disk ... 240 = /dev/xvdp Sixteenth Xen VBD whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15.> given that old guests are going to break anyway, we should consider a > different scheme ? Since disks and partitions not supported by the > old encoding won''t work on old guests anyway, we can use a completely > new encoding for that case provided only that it doesn''t use numbers > of the form (202 << 8) | somethingWell, we don''t actually need 202, or any minor numbers at all. The major is only needed for the case where xvd masquerades as IDE or SCSI. We ripped this wart out for upstream Linux. And the guest can happily dynamically allocate minor numbers on its own behalf. A disk discovery event can be completely dynamic, the admin just wouldn''t be able to guarantee which minor slot gets allocated for a particular disk in a guest. We do have mount by label or UUID.> Presumably we can safely use at least 31 bits. If we reserve one to > indicate that this is the new encoding that leaves us with 30 which > should be enough for a reasonable number of disks with many > partitions each. > > > I didn''t think of handling overflow (since the major for scsi/ide/etc > > were involved, I expected that to fail). But, aside of crashing an > > older guest with > 16 disks (not ideal, but I think it''s possible > > already with 0x format), seems good. > > If a guest takes the xenstore number to be the concatenation of its > own major and minor numbers then obviously it is leaving itself open > to breaking in the future. dom0 admins will just have to Not Do That > Then. (It''s a shame, if true, that the guests don''t have actual error > checking.)Agreed. thanks, -chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson
2008-May-08 17:03 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
Chris Wright writes ("Re: [Xen-devel] Greater than 16 xvd devices for blkfront"):> Ian Jackson (Ian.Jackson@eu.citrix.com) wrote: > > Daniel''s solution still doesn''t work for partitions >15. Perhaps, > > I think that''s OK, and effectively a hard limitation w.r.t. lanana:No, because not all guests are Linux, and anyway that limitation in Linux may be improved in the future. If we''re going to invent a new scheme then we may as well solve the problem properly.> > given that old guests are going to break anyway, we should consider a > > different scheme ? Since disks and partitions not supported by the > > old encoding won''t work on old guests anyway, we can use a completely > > new encoding for that case provided only that it doesn''t use numbers > > of the form (202 << 8) | something > > Well, we don''t actually need 202, or any minor numbers at all. The major > is only needed for the case where xvd masquerades as IDE or SCSI.I think you''re really missing the point. At the moment the Xen domain config specifies whether the device is supposed to show up in the guest as a native xvd, or masquerading as scsi or ide. This information is encoded, along with the disk number and partition number, into the xenstore path. The xenstore path element is currently as a decimal integer, and that integer supplies this information in a encoding derived from that used internally by pre-32-bit-devt Linux guests. That''s completely mad. However, we can''t really change it now at least for disks which fit into the old encoding scheme, because any new scheme won''t be supported by old guests. For disks and partitions which are out of the range which fit into the current encodings, we need a new encoding anyway. Old guests definitely can''t cope with those so we don''t need to be compatible. Daniel Berrange''s suggestion amounts to this: rather than invent a wholly new location in xenstore for these disks, we simply make use of more of the available values of this integer. I''m pointing out that when we do that we ought to take into account our future requirements in general, which may include >15 partitions. Something like this: Old format: 202 << 8 | disk << 4 | partition xvd, disks and partitions up to 15 8 << 8 | disk << 4 | partition sd, disks and partitions up to 15 3 << 8 | disk << 6 | partition hd, disks 0..3, partitions 1..63 New format: 1 << 28 | disk << 8 | partition xvd, disks or partitions 16 onwards Reserved for future use: 2 << 28 onwards Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2008-May-08 22:14 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
Chris Wright wrote:> Well, we don''t actually need 202, or any minor numbers at all. The major > is only needed for the case where xvd masquerades as IDE or SCSI. > We ripped this wart out for upstream Linux.I''m considering putting it back in if it makes anyone''s life easier. In general using labels/uuids is the best way to make an installation device-agnostic, but installers might have an easier time with a forged scsi device or something. I mentioned it in passing to Al Viro, and he was surprisingly non-insulting about the notion.> And the guest can happily > dynamically allocate minor numbers on its own behalf. A disk discovery > event can be completely dynamic, the admin just wouldn''t be able to > guarantee which minor slot gets allocated for a particular disk in > a guest. We do have mount by label or UUID. >That''s true for filesystems which have already been initialized. But if you''re attaching 4 new devices to a guest and they appear at random device nodes, how do you know which is which? Smell? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel P. Berrange
2008-May-08 23:34 UTC
Re: [Xen-devel] Greater than 16 xvd devices for blkfront
On Thu, May 08, 2008 at 11:14:34PM +0100, Jeremy Fitzhardinge wrote:> Chris Wright wrote: > >Well, we don''t actually need 202, or any minor numbers at all. The major > >is only needed for the case where xvd masquerades as IDE or SCSI. > >We ripped this wart out for upstream Linux. > > I''m considering putting it back in if it makes anyone''s life easier. In > general using labels/uuids is the best way to make an installation > device-agnostic, but installers might have an easier time with a forged > scsi device or something. I mentioned it in passing to Al Viro, and he > was surprisingly non-insulting about the notion. > > > And the guest can happily > >dynamically allocate minor numbers on its own behalf. A disk discovery > >event can be completely dynamic, the admin just wouldn''t be able to > >guarantee which minor slot gets allocated for a particular disk in > >a guest. We do have mount by label or UUID. > > > > That''s true for filesystems which have already been initialized. But if > you''re attaching 4 new devices to a guest and they appear at random > device nodes, how do you know which is which? Smell?Well there''s /dev/disk/by-{path,id}. Now there''s no udev rules to setup these links for Xen VBD (afaik), but we could arrange to have some suitable info used to provide a persistent path under either of those locations. Dan. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
In May 2008 I wrote:> Old format: > > 202 << 8 | disk << 4 | partition xvd, disks and partitions up to 15 > 8 << 8 | disk << 4 | partition sd, disks and partitions up to 15 > 3 << 8 | disk << 6 | partition hd, disks 0..3, partitions 1..63 > > New format: > > 1 << 28 | disk << 8 | partition xvd, disks or partitions 16 onwards > > Reserved for future use: > > 2 << 28 onwardsBut now that I get down and dirty with some code I discover that actually what we have is not quite this. Much Linux-specific stuff has crept in and the result is a mess. After consultation, what we intend to implement in libxl is as follows: * The abstract interface specifies, for each VBD: * Nominal disk type: Xen virtual disk (aka xvd, the default); SCSI (sd); IDE (hd). This is for use as a hint by the guest''s device naming scheme. * Disk number, which is a nonnegative integer, conventionally starting at 0 for the first disk. * Partition number, which is a nonnegative integer where by convention partition 0 indicates the "whole disk". Normally for any disk _either_ partition 0 should be supplied in which case the guest is expected to treat it as they would a native whole disk (for example by putting or expecting a partition table or disk label on it); _Or_ only non-0 partitions should be supplied in which case the guest should expect storage management to be done by the host and treat each vbd as it would a partition or slice or LVM volume (for example by putting or expecting a filesystem on it). * The syntaxes are, for example: d0 d0p0 xvda Xen virtual disk 0 partition 0 (whole disk) d1p2 xvda2 Xen virtual disk 1 partition 2 d536p37 xvdtq37 Xen virtual disk 536 partition 37 sdb3 SCSI disk 1 partition 3 hdc2 IDE disk 2 partition 2 The d*p* syntax is not supported by xm/xend. * This is encoded in the concrete interface as an integer (in a canonical decimal format in xenstore), whose value encodes the information above as follows: 1 << 28 | disk << 8 | partition xvd, disks or partitions 16 onwards 202 << 8 | disk << 4 | partition xvd, disks and partitions up to 15 8 << 8 | disk << 4 | partition sd, disks and partitions up to 15 3 << 8 | disk << 6 | partition hd, disks 0..1, partitions 0..63 22 << 8 | (disk-2) << 6 | partition hd, disks 2..3, partitions 0..63 2 << 28 onwards reserved for future use other values less than 1 << 28 deprecated / reserved The 1<<28 format handles disks up to (1<<20)-1 and partitions up to 255. It will be used only where the 202<<8 format does not have enough bits. Guests MAY support any subset of the formats above except that if they support 1<<28 they MUST also support 202<<8. Some software has provided essentially Linux-specific encodings for SCSI disks beyond disk 15 partition 15, and IDE disks beyond disk 3 partition 63. These vbds, and the corresponding encoded integers, are deprecated. * Guests SHOULD ignore numbers that they do not understand or recognise. They SHOULD check supplied numbers for validity. * We know that not all guests conform to this interface. For example, old Linux systems interpret the integer as major << 8 | minor where major and minor are the Linux-specific device numbers, and some old configurations may depend on deprecated high-numbered SCSI and IDE disks. We will therefore preserve the existing facility to specify the xenstore numerical value directly by putting a single number (hex, decimal or octal) in the domain config file instead of the disk identifier. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel