On Mon, May 1, 2017, at 02:34 PM, Gandalf Corvotempesta
wrote:> I'm still thinking that saving (I don't know where, I don't
know how)
> a mapping between
> files and bricks would solve many issues and add much more flexibility.
Every system we've discussed has a map. The differences are only in the
granularity, and how the map is stored. Per-file maps inevitably become
a scaling problem, so a deterministic function is used to map individual
files into a much smaller number of buckets, placement groups, hash
ranges, or whatever. Then information about those buckets and their
locations is stored somehow:
* Centrally - Lustre, HDFS, Moose/Lizard
* Distributed among a few servers - Ceph, possibly Gluster with DHT2
* Distributed among all servers - Gluster today
No matter which approach you use, you can manipulate the maps. Without
changing the fundamental structure of Gluster, you could take a brick's
hash range and split it in two to create two bricks. Then you could
quietly migrate the files in one brick to anywhere else in the
background. That doesn't quite work today because the two bricks would
be trying to operate on the same directories, seeing each others' files,
etc. Making it more transparent won't be easy, but the changes would be
pretty well localized to DHT. Brick multiplexing can help too, because
it allows a volume to be created with many more bricks initially so
they'd already be in separate directories and ready to move. Multiple
bricks living in one process also makes coordination during such
transitions much easier. This has been part of my plan for years, not
only to support adding a single server but also to support more
sophisticated forms of tiering, quality of service, etc.
The big question as I see it is what we can do *in the near term* to
make N+1 addition easier on *existing* clusters. That probably deserves
a separate answer, so I'll leave it for another time.