Roman Shaposhnik
2013-Feb-19 07:26 UTC
[Puppet Users] Request for an architectural advice for Hadoop ecosystem deployment
Hi! a few email exchanges on this ML coupled with John''s remark that he''d be open for an architectural advice made realize that a discussion focused on a particular use case I''m trying to address might be much more fruitful than random questions here and there. It is a long email, but I hope it will be useful for the majority of folks subscribed to puppet-users@ This use case originates from the Apache Bigtop project. Bigtop is to Hadoop what Debian is to Linux -- we''re a project aiming at building a 100% community driven BigData management distribution based on Apache Hadoop and its ecosystem projects. We are concerned with integration, packaging, deployment and system testing of the resulting distro and we also happen to be the basis for a few commercial distributions -- most notably Cloudera''s CDH. Now, it must be mentioned that when I say ''a distribution'' I really mean it. Here''s the list of components that we have to manage (it is definitely not just Hadoop): https://issues.apache.org/jira/browse/BIGTOP-816?focusedCommentId=13560059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13560059 Our current Puppet code is a pretty old code base (originated in pre 2.X puppet) that currently serves as the main engine for us to dynamically deploy Bigtop clusters on EC2 for testing purposes. However, given that the Bigtop distro is the foundation for the commercial distros, we would like our Puppet code to be the go-to place for all the puppet-driven Hadoop deployment needs. Thus at the highest level, our Puppet code needs to be: #0 useful for as many versions of Puppet as possible. Obviously, we shouldn''t obsess too much over something like Puppet 0.24, but we should keep the goal in mind #1 useful in a classical puppet-master driven setup where one has access to modules, hiera/extlookup, etc all nicely setup and maitained under /etc/puppet #2 useful in a masterless mode so that things like Puppet/Whirr integration can be utilized: https://issues.apache.org/jira/browse/WHIRR-385 This is the case where the Puppet classes are guaranteed to be delivered to each node out of band and --modulepath will be given to puppet apply. Everything else (hiera/extlookup files, etc) is likely to require additional out-of-band communications that we would like to minimize. #3 useful in orchestration scenarios (like Apache Ambari) although this could be viewed as a subset of the previous one. Now, given that a typical Hadoop cluster is a collection of nodes each of which is running a certain collection of services that belong to a particular subsystem (such as HDFS, YARN, HBase, etc.) My first instinct at modeling was to introduce a series of classes that would capture configuration of these subsystems. Plus, a top-level class that would correspond to settings common to the entire cluster. IOW, I would like to be able to express things like "in this cluster for every subsystem that cares about authentication the setting should be ''kerberos'' and all of the jvm''s should be given at minimum 1G of RAM, then I want node X to host HDFS''s namenode, etc". All of this brings us to question #1 Q1: what would be the most natural way to instantiate such classes on every node that would satisfy #1-#3 styles of use? Initially, I was thinking of an ENC-style where a complete manifest of classes, their arguments and top-level parameters can be expected on/for every node. This style has a nice property of making classes completely independent of a particular way of instantiating them. IOW, I do not care whether a user of Bigtop''s puppet code will explicitly put something like: class { "bigtop::hdfs::datanode": namenode_uri => "hdfs://nn.cluster.mycompany.com" .... } or whether somehow an ENC will generate: classes: bigtop::hdfs::datanode: namenode_uri: hdfs://nn.cluster.mycompany.com The classes do NOT care how they are being instantiated. Well, almost. They don''t if I''m willing to make their use super-verbose essentially requiring that every single setting is given explicitly. IOW, even though something like namenode_uri will be exactly the same for all the services comprising HDFS subsystem, I will still require its explicit setting for every single class that gets instantiated (even on the same node). E.g.: class { "bigtop::hdfs::datanode": namenode_uri => "hdfs://nn.cluster.mycompany.com" } class { "bigtop::hdfs::secondarynamenode": namenode_uri => "hdfs://nn.cluster.mycompany.com" } Now, this brings us to the second question. Q2: In a situation like what would be an ideal way of making class instantiations less verbose? Also, as long as we are making them less verbose via some external means like hiera or extlookup, is there any point in keeping the instantiations to be along the lines of: class { "bigtop::hdfs::secondarynamenode": } instead of: include bigtop::hdfs::secondarynamenode ? After all, if we end up requiring *some* class parameters to be loaded from hiera/extlookup we might as well expect *all* of them to be loaded from there. This, by the way, will give us an extra benefit of being able to do things like: include bigtop::hdfs::datanode from multiple sites without fear of already declared resource. Finally, the question that has become really obvious to me after pondering this design is what ways of capturing a datum (e.g.: facter, top-scope variables, node-scope variables, class-scope variables, parent-class-scope variables) are the most appropriate for different types of information that we need to express about our clusters. Q3: IOW, what are the best practices for managing the following classes of class parameters (categorization stolen from Rich): 1) variables that are defaults that can be rationally set based on properties of the node itself, such as using os system for setting a package manager or package name to use, 2) variables that are set as part of group and the role they are playing, such as the set of common variables that all "slave" nodes in a cluster should have, 3) variables that are set as function of other components that relate or connect to them, such as a client needing the port and host address of a server, that is, variables that depend on toplogy, 4) variables that are set based on external context but can be categorized in a node by node basis, such as ntp server address based on location of data center, or which users should have logon access to which machines Thanks, Roman. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
jcbollinger
2013-Feb-19 22:30 UTC
[Puppet Users] Re: Request for an architectural advice for Hadoop ecosystem deployment
On Tuesday, February 19, 2013 1:26:34 AM UTC-6, Roman Shaposhnik wrote:> > Hi! > > a few email exchanges on this ML coupled with John''s remark that he''d > be open for an architectural advice made realize that a discussion focused > on a particular use case I''m trying to address might be much more fruitful > than random questions here and there. It is a long email, but I hope it > will > be useful for the majority of folks subscribed to puppet-users@ >I hope so, too. Comments in-line below.> > This use case originates from the Apache Bigtop project. Bigtop is to > Hadoop > what Debian is to Linux -- we''re a project aiming at building a 100% > community > driven BigData management distribution based on Apache Hadoop and its > ecosystem projects. We are concerned with integration, packaging, > deployment > and system testing of the resulting distro and we also happen to be the > basis > for a few commercial distributions -- most notably Cloudera''s CDH. Now, > it must be mentioned that when I say ''a distribution'' I really mean it. > Here''s > the list of components that we have to manage (it is definitely not > just Hadoop): > > https://issues.apache.org/jira/browse/BIGTOP-816?focusedCommentId=13560059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13560059 > > Our current Puppet code is a pretty old code base (originated in pre 2.X > puppet) > that currently serves as the main engine for us to dynamically deploy > Bigtop > clusters on EC2 for testing purposes. However, given that the Bigtop > distro > is the foundation for the commercial distros, we would like our Puppet > code > to be the go-to place for all the puppet-driven Hadoop deployment needs. > > Thus at the highest level, our Puppet code needs to be: > #0 useful for as many versions of Puppet as possible. > Obviously, we shouldn''t obsess too much over something > like Puppet 0.24, but we should keep the goal in mind >This could actually be a substantial issue. Much depends on whether you are looking for code (manifest) compatibility, or functional compatibility between various versions of the master and agent. The former is more tractable. The latter is subject to these constraints: the master must not be older than the agents it serves, and the agents must not be too many minor revisions behind the master. For example, v. 3.0.x agents should work with a v 3.1 master, and I think even a v 3.2 master when that eventually comes, but they are likely to not work with masters in the 3.3 series, when that arrives in a couple years. It is easier to maintain manifest compatibility, though maintaining compatibility for a wide selection of versions probably will require careful coding, and not relying on third-party modules (unless you fork them for yourselves).> #1 useful in a classical puppet-master driven setup where > one has access to modules, hiera/extlookup, etc all nicely > setup and maitained under /etc/puppet > #2 useful in a masterless mode so that things like Puppet/Whirr > integration can be utilized: > https://issues.apache.org/jira/browse/WHIRR-385 > This is the case where the Puppet classes are guaranteed to be > delivered > to each node out of band and --modulepath will be given to puppet > apply. > Everything else (hiera/extlookup files, etc) is likely to > require additional > out-of-band communications that we would like to minimize. >You have to communicate the needed data somehow. This can''t fundamentally be about *additional* communication; it can only be about how the communication is organized.> #3 useful in orchestration scenarios (like Apache Ambari) although > this could be viewed as a subset of the previous one. > >It is important to understand that Puppet is not an orchestration tool, though it can be used *by* one. In an orchestrated cluster (re)configuration scenario, it is also important to avoid overloading a Puppet master. One way to do so is to push out manifests and data to the nodes so that they can "puppet apply" them instead of requiring catalog compilation by the master. That is the sense in which #3 might be viewed as a subset of #2. There are couple of puppet features that you cannot then use (exported resources springs to mind), but that may be tolerable, or even necessary.> Now, given that a typical Hadoop cluster is a collection of nodes each > of which is running a certain collection of services that belong to a > particular > subsystem (such as HDFS, YARN, HBase, etc.) My first instinct at > modeling was to introduce a series of classes that would capture > configuration of these subsystems. Plus, a top-level class that would > correspond to settings common to the entire cluster.Perhaps this is just a terminology or mindset problem, or maybe I''m hearing you wrong, but it sounds like you''re focusing on data, whereas I think you should be focusing first on modeling system components and their relationships in the large. Very much as Rich described, in fact (apologies to those following along on puppet-users). Puppet DSL rewards the top-down approach more than most languages.> IOW, I would > like to be able to express things like "in this cluster for every > subsystem > that cares about authentication the setting should be ''kerberos'' and > all of the jvm''s should be given at minimum 1G of RAM, then I want > node X to host HDFS''s namenode, etc".You''re starting off with some architectural assumptions already, though they don''t seem unreasonable (just abstract). You might want to have a look at this Puppet architectural pattern: http://www.craigdunn.org/2012/05/239/. It''s been getting a fair amount of attention around here lately. It looks well-thought, and all commentary I''ve read about it has been positive. Even if you don''t find it suitable for your own needs, evaluating the approach might nevertheless prove a useful exercise. I include additional responses below to some of your questions, but you might be better off setting the rest of this aside for now, and coming back to it when you''re dealing with these issues in the context of implementing the reference examples Rich suggested.> All of this brings us to question #1 > > Q1: what would be the most natural way to instantiate such classes > on every node that would satisfy #1-#3 styles of use? > > Initially, I was thinking of an ENC-style where a complete manifest > of classes, their arguments and top-level parameters can be > expected on/for every node.In other words, a program will compute (or a static data file will provide) a complete description of the configuration to be applied to the target node, all in one big chunk, and Puppet will, one way or another, interpret that data to apply the desired configuration. Surely there are some subtleties or implications that I do not capture with that characterization, though, because otherwise that''s not very meaningful. Every Puppet configuration is more or less like that, so what are the distinguishing features of this particular idea? This style has a nice property of making> classes completely independent of a particular way of instantiating > them. IOW, I do not care whether a user of Bigtop''s puppet code > will explicitly put something like: > class { "bigtop::hdfs::datanode": > namenode_uri => "hdfs://nn.cluster.mycompany.com" > .... > } > or whether somehow an ENC will generate: > classes: > bigtop::hdfs::datanode: > namenode_uri: hdfs://nn.cluster.mycompany.com > > The classes do NOT care how they are being instantiated.Terminology nitpick: puppet classes are not "instantiated" in any case. I raise this because it is important to keep in mind that although Puppet terminology shares a fair number of terms with OO-speak, Puppet DSL is not an OO language. The borrowed terms often have different meaning or implication in Puppet. Classes (there''s one of those terms) are "declared", "included", or maybe "assigned". With that said, I don''t understand the point. In what way do Puppet classes ever care how they are instantiated? Especially, what about this approach produces that result?> > Well, almost. They don''t if I''m willing to make their use > super-verbose essentially requiring that every single > setting is given explicitly. IOW, even though something > like namenode_uri will be exactly the same for all the > services comprising HDFS subsystem, I will still require its > explicit setting for every single class that gets instantiated > (even on the same node).Yuck. [...] Q2: In a situation like what would be an ideal way of making> class instantiations less verbose? Also, as long as we are > making them less verbose via some external means like > hiera or extlookup, is there any point in keeping the > instantiations to be along the lines of: > class { "bigtop::hdfs::secondarynamenode": > } > instead of: > include bigtop::hdfs::secondarynamenode > ? >If you are going to rely on hiera or another external source for all class data, then you absolutely should use the ''include'' form, not the parametrized form. The latter carries the constraint that it can be used only once for any given class, and that use must be the first one parsed. There are very good reasons, however, why sometimes you would like to declare a given class in more than one place. You can work around any need to do so with enough effort and code, but that generally makes your manifest set more brittle, and / or puts substantially greater demands on your ENC. I see from your subsequent comment (elided) that you recognize that, at least to some degree.> > Finally, the question that has become really obvious to me > after pondering this design is what ways of capturing > a datum (e.g.: facter, top-scope variables, node-scope variables, > class-scope variables, parent-class-scope variables) > are the most appropriate for different types of information > that we need to express about our clusters. >Puppet best practices generally hold that reliance on top-scope variables is best avoided. There are a few notable exceptions; principally, these are node facts and the globals provided by Puppet itself. Node facts of course include any custom facts that you may create. Puppet best practices also hold that class inheritance (there''s another of those terms) is inappropriate for most classes, so, usually, few classes have a parent-class-scope to consider. There is an important and relevant exception, however: a parametrized class may inherit from a data-only parent class (conventionally named <module>::params) to use set default values of its own parameters from the class variables of that class. That usage is reasonably well regarded. Otherwise, the only reason to use class inheritance is for the subclass to override properties of resources declared by its parent, and even that can often be readily replaced by more data-centric approaches these days. Not mentioned in that list, however, is other-class-scope variables. Puppet classes do not provide any data encapsulation, so the variables of all declared classes are accessible, and using them is not inherently unreasonable. (The lack of encapsulation is not a functional risk because Puppet "variables" cannot be modified once set, and they are set when (and only when) the class to which they belong is parsed.)> > Q3: IOW, what are the best practices for managing the following > classes of class parameters (categorization stolen from Rich): >You are supposing that these will be modeled as class parameters in the first place. Certainly they are data that characterize the configuration to be deployed, and it is possible to model them as class parameters, but that is not the only -- and maybe not the best -- alternative available to you. Class parametrization is a protocol for declaring and documenting that data on which a class relies, and it enables mechanisms for obtaining that data that non-parametrized classes cannot use, but the same configuration objectives can be accomplished without them.> > 1) variables that are defaults that can be rationally set based > on properties of the node itself, such as using os system for > setting a package manager or package name to use, >Some of these (package manager, for instance), Puppet determines automatically. You can override its choices, if necessary, but it is pretty reliable about doing the right thing in those cases it covers. Other cases in this category don''t have such clear-cut answers, but often the variables and the logic to choose their values reside in the classes that use them, or in classes close to those, following a modular approach to manifest set design. Some prefer to externalize the data and thus replace the logic with an external lookup, and although that wouldn''t be appropriate for every case, the prevailing opinion in the community is that separating data from code is a principle of good manifest design. My personal opinion is that if you anticipate ever wanting to change or extend the data, then it''s a good idea to pull it out.> 2) variables that are set as part of group and the role they are > playing, such as the set of common variables that all "slave" > nodes in a cluster should have, >Do I understand correctly that we''re talking about data that should be the same for all members of the given group (or all nodes having the given role)? Or are we talking about data only make sense only in the context of a given group / role, but may differ among nodes? Are there data among these that apply to multiple groups / roles, but whose values may differ on a per-group or per-role basis? The best ways to handle such data may differ depending on the details. Supposing that we are talking about group-specific data that will be common to all members of the group, it seems best to build some kind of abstraction representing the group, and to associate the data with that. That could mean a class and class variables (whether or not they are parameters), it could be a single array- or hash-valued variable in some global class, it could be a group-specific data file at some level of an external data hierarchy, etc..> 3) variables that are set as function of other components that > relate or connect to them, such as a client needing the port > and host address of a server, that is, variables that depend on > toplogy, >Exported resources are often used for such purposes, but to the best of my knowledge they rely on a master. For a masterless scenario, this really isn''t any different from (2), where the group in question is "consumers of service A" or even "all nodes in the cluster".> 4) variables that are set based on external context but can be > categorized in a node by node basis, such as ntp server > address based on location of data center, or which users > should have logon access to which machines > >If the point is that the data will be chosen based on facts presented by the client node, then I would be inclined to put such data into classes, and to assign those classes the responsibility to look up the appropriate value for the facts presented. That implies some kind of lookup table, which could reside in the class itself or in an external data store, or which could be provided as a class parameter. John -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.