thr3ads.net - Puppet users - [Puppet Users] Request for an architectural advice for Hadoop ecosystem deployment [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Roman Shaposhnik

2013-Feb-19 07:26 UTC

[Puppet Users] Request for an architectural advice for Hadoop ecosystem deployment

Hi!

a few email exchanges on this ML coupled with John''s remark that
he''d
be open for an architectural advice made realize that a discussion focused
on a particular use case I''m trying to address might be much more
fruitful
than random questions here and there. It is a long email, but I hope it will
be useful for the majority of folks subscribed to puppet-users@

This use case originates from the Apache Bigtop project. Bigtop is to Hadoop
what Debian is to Linux -- we''re a project aiming at building a 100%
community
driven BigData management distribution based on Apache Hadoop and its
ecosystem projects. We are concerned with integration, packaging, deployment
and system testing of the resulting distro and we also happen to be the basis
for a few commercial distributions -- most notably Cloudera''s CDH. Now,
it must be mentioned that when I say ''a distribution'' I really
mean it. Here''s
the list of components that we have to manage (it is definitely not
just Hadoop):
  
https://issues.apache.org/jira/browse/BIGTOP-816?focusedCommentId=13560059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13560059

Our current Puppet code is a pretty old code base (originated in pre 2.X puppet)
that currently serves as the main engine for us to dynamically deploy Bigtop
clusters on EC2 for testing purposes. However, given that the Bigtop distro
is the foundation for the commercial distros, we would like our Puppet code
to be the go-to place for all the puppet-driven Hadoop deployment needs.

Thus at the highest level, our Puppet code needs to be:
   #0 useful for as many versions of Puppet as possible.
        Obviously, we shouldn''t obsess too much over something
        like Puppet 0.24, but we should keep the goal in mind
   #1 useful in a classical puppet-master driven setup where
        one has access to modules, hiera/extlookup, etc all nicely
        setup and maitained under /etc/puppet
   #2 useful in a masterless mode so that things like Puppet/Whirr
        integration can be utilized:
https://issues.apache.org/jira/browse/WHIRR-385
        This is the case where the Puppet classes are guaranteed to be delivered
        to each node out of band and --modulepath will be given to puppet apply.
        Everything else (hiera/extlookup files, etc) is likely to
require additional
        out-of-band communications that we would like to minimize.
   #3 useful in orchestration scenarios (like Apache Ambari) although
        this could be viewed as a subset of the previous one.

Now, given that a typical Hadoop cluster is a collection of nodes each
of which is running a certain collection of services that belong to a particular
subsystem (such as HDFS, YARN, HBase, etc.) My first instinct at
modeling was to introduce a series of classes that would capture
configuration of these subsystems. Plus, a top-level class that would
correspond to settings common to the entire cluster. IOW, I would
like to be able to express things like "in this cluster for every subsystem
that cares about authentication the setting should be
''kerberos'' and
all of the jvm''s should be given at minimum 1G of RAM, then I want
node X to host HDFS''s namenode, etc". All of this brings us to
question #1

Q1: what would be the most natural way to instantiate such classes
on every node that would satisfy #1-#3 styles of use?

Initially, I was thinking of an ENC-style where a complete manifest
of classes, their arguments and top-level parameters can be
expected on/for every node. This style has a nice property of making
classes completely independent of a particular way of instantiating
them. IOW, I do not care whether a user of Bigtop''s puppet code
will explicitly put something like:
     class { "bigtop::hdfs::datanode":
         namenode_uri => "hdfs://nn.cluster.mycompany.com"
         ....
     }
or whether somehow an ENC will generate:
    classes:
       bigtop::hdfs::datanode:
          namenode_uri: hdfs://nn.cluster.mycompany.com

The classes do NOT care how they are being instantiated.

Well, almost. They don''t if I''m willing to make their use
super-verbose essentially requiring that every single
setting is given explicitly. IOW, even though something
like namenode_uri will be exactly the same for all the
services comprising HDFS subsystem, I will still require its
explicit setting for every single class that gets instantiated
(even on the same node). E.g.:
     class { "bigtop::hdfs::datanode":
         namenode_uri => "hdfs://nn.cluster.mycompany.com"
     }
     class { "bigtop::hdfs::secondarynamenode":
         namenode_uri => "hdfs://nn.cluster.mycompany.com"
     }

Now, this brings us to the second question.

Q2: In a situation like  what would be an ideal way of making
class instantiations less verbose? Also, as long as we are
making them less verbose via some external means like
hiera or extlookup, is there any point in keeping the
instantiations to be along the lines of:
   class { "bigtop::hdfs::secondarynamenode":
   }
instead of:
   include bigtop::hdfs::secondarynamenode
?

After all, if we end up requiring *some* class parameters to be
loaded from hiera/extlookup we might as well expect *all* of
them to be loaded from there. This, by the way, will give us an
extra benefit of being able to do things like:
     include bigtop::hdfs::datanode
from multiple sites without fear of already declared resource.

Finally, the question that has become really obvious to me
after pondering this design is what ways of capturing
a datum (e.g.: facter, top-scope variables, node-scope variables,
class-scope variables, parent-class-scope variables)
are the most appropriate for different types of information
that we need to express about our clusters.

Q3: IOW, what are the best practices for managing the following
classes of class parameters (categorization stolen from Rich):

1)  variables that are defaults that can be rationally set based
      on properties of the node itself, such as using os system for
      setting a  package manager or package name to use,
2)  variables that are set as part of group and the role they are
     playing, such as the set of common variables that all "slave"
     nodes in a cluster should have,
3)  variables that are set as function of other components that
     relate or connect to them, such as a client needing the port
     and  host address of a server, that is, variables that depend on
     toplogy,
4)  variables that are set based on external context but can be
     categorized in a node by node basis, such as ntp server
     address based on location of data center, or which users
     should have logon access to which machines

Thanks,
Roman.

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

jcbollinger

2013-Feb-19 22:30 UTC

head link

[Puppet Users] Re: Request for an architectural advice for Hadoop ecosystem deployment

On Tuesday, February 19, 2013 1:26:34 AM UTC-6, Roman Shaposhnik
wrote:>
> Hi! 
>
> a few email exchanges on this ML coupled with John''s remark that
he''d
> be open for an architectural advice made realize that a discussion focused 
> on a particular use case I''m trying to address might be much more
fruitful
> than random questions here and there. It is a long email, but I hope it 
> will 
> be useful for the majority of folks subscribed to puppet-users@ 
>

I hope so, too.  Comments in-line below.

>
> This use case originates from the Apache Bigtop project. Bigtop is to 
> Hadoop 
> what Debian is to Linux -- we''re a project aiming at building a
100%
> community 
> driven BigData management distribution based on Apache Hadoop and its 
> ecosystem projects. We are concerned with integration, packaging, 
> deployment 
> and system testing of the resulting distro and we also happen to be the 
> basis 
> for a few commercial distributions -- most notably Cloudera''s CDH.
Now,
> it must be mentioned that when I say ''a distribution'' I
really mean it.
> Here''s 
> the list of components that we have to manage (it is definitely not 
> just Hadoop): 
>    
>
https://issues.apache.org/jira/browse/BIGTOP-816?focusedCommentId=13560059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13560059
>
> Our current Puppet code is a pretty old code base (originated in pre 2.X 
> puppet) 
> that currently serves as the main engine for us to dynamically deploy 
> Bigtop 
> clusters on EC2 for testing purposes. However, given that the Bigtop 
> distro 
> is the foundation for the commercial distros, we would like our Puppet 
> code 
> to be the go-to place for all the puppet-driven Hadoop deployment needs. 
>
> Thus at the highest level, our Puppet code needs to be: 
>    #0 useful for as many versions of Puppet as possible. 
>         Obviously, we shouldn''t obsess too much over something 
>         like Puppet 0.24, but we should keep the goal in mind 
>

This could actually be a substantial issue.  Much depends on whether you 
are looking for code (manifest) compatibility, or functional compatibility 
between various versions of the master and agent.  The former is more 
tractable.  The latter is subject to these constraints: the master must not 
be older than the agents it serves, and the agents must not be too many 
minor revisions behind the master.  For example, v. 3.0.x agents should 
work with a v 3.1 master, and I think even a v 3.2 master when that 
eventually comes, but they are likely to not work with masters in the 3.3 
series, when that arrives in a couple years.

It is easier to maintain manifest compatibility, though maintaining 
compatibility for a wide selection of versions probably will require 
careful coding, and not relying on third-party modules (unless you fork 
them for yourselves).

>    #1 useful in a classical puppet-master driven setup where 
>         one has access to modules, hiera/extlookup, etc all nicely 
>         setup and maitained under /etc/puppet 
>    #2 useful in a masterless mode so that things like Puppet/Whirr 
>         integration can be utilized: 
> https://issues.apache.org/jira/browse/WHIRR-385 
>         This is the case where the Puppet classes are guaranteed to be 
> delivered 
>         to each node out of band and --modulepath will be given to puppet 
> apply. 
>         Everything else (hiera/extlookup files, etc) is likely to 
> require additional 
>         out-of-band communications that we would like to minimize. 
>

You have to communicate the needed data somehow.  This can''t
fundamentally
be about *additional* communication; it can only be about how the 
communication is organized.

>    #3 useful in orchestration scenarios (like Apache Ambari) although 
>         this could be viewed as a subset of the previous one. 
>
>
It is important to understand that Puppet is not an orchestration tool, 
though it can be used *by* one.  In an orchestrated cluster 
(re)configuration scenario, it is also important to avoid overloading a 
Puppet master.  One way to do so is to push out manifests and data to the 
nodes so that they can "puppet apply" them instead of requiring
catalog
compilation by the master.  That is the sense in which #3 might be viewed 
as a subset of #2.  There are couple of puppet features that you cannot 
then use (exported resources springs to mind), but that may be tolerable, 
or even necessary.

> Now, given that a typical Hadoop cluster is a collection of nodes each 
> of which is running a certain collection of services that belong to a 
> particular 
> subsystem (such as HDFS, YARN, HBase, etc.) My first instinct at 
> modeling was to introduce a series of classes that would capture 
> configuration of these subsystems. Plus, a top-level class that would 
> correspond to settings common to the entire cluster.

Perhaps this is just a terminology or mindset problem, or maybe I''m
hearing
you wrong, but it sounds like you''re focusing on data, whereas I think
you
should be focusing first on modeling system components and their 
relationships in the large.  Very much as Rich described, in fact 
(apologies to those following along on puppet-users).  Puppet DSL rewards 
the top-down approach more than most languages.

> IOW, I would 
> like to be able to express things like "in this cluster for every 
> subsystem 
> that cares about authentication the setting should be
''kerberos'' and
> all of the jvm''s should be given at minimum 1G of RAM, then I want
> node X to host HDFS''s namenode, etc".

You''re starting off with some architectural assumptions already, though
they don''t seem unreasonable (just abstract).  You might want to have a
look at this Puppet architectural pattern: 
http://www.craigdunn.org/2012/05/239/.  It''s been getting a fair amount
of
attention around here lately.  It looks well-thought, and all commentary 
I''ve read about it has been positive.  Even if you don''t find
it suitable
for your own needs, evaluating the approach might nevertheless prove a 
useful exercise.

I include additional responses below to some of your questions, but you 
might be better off setting the rest of this aside for now, and coming back 
to it when you''re dealing with these issues in the context of
implementing
the reference examples Rich suggested.

> All of this brings us to question #1 
>
> Q1: what would be the most natural way to instantiate such classes 
> on every node that would satisfy #1-#3 styles of use? 
>
> Initially, I was thinking of an ENC-style where a complete manifest 
> of classes, their arguments and top-level parameters can be 
> expected on/for every node.

In other words, a program will compute (or a static data file will provide) 
a complete description of the configuration to be applied to the target 
node, all in one big chunk, and Puppet will, one way or another, interpret 
that data to apply the desired configuration.  Surely there are some 
subtleties or implications that I do not capture with that 
characterization, though, because otherwise that''s not very meaningful.
Every Puppet configuration is more or less like that, so what are the 
distinguishing features of this particular idea?

This style has a nice property of making > classes completely independent of a particular way of instantiating 
> them. IOW, I do not care whether a user of Bigtop''s puppet code 
> will explicitly put something like: 
>      class { "bigtop::hdfs::datanode": 
>          namenode_uri => "hdfs://nn.cluster.mycompany.com" 
>          .... 
>      } 
> or whether somehow an ENC will generate: 
>     classes: 
>        bigtop::hdfs::datanode: 
>           namenode_uri: hdfs://nn.cluster.mycompany.com 
>
> The classes do NOT care how they are being instantiated.

Terminology nitpick: puppet classes are not "instantiated" in any
case.  I
raise this because it is important to keep in mind that although Puppet 
terminology shares a fair number of terms with OO-speak, Puppet DSL is not 
an OO language.  The borrowed terms often have different meaning or 
implication in Puppet.  Classes (there''s one of those terms) are 
"declared", "included", or maybe "assigned".

With that said, I don''t understand the point.  In what way do Puppet 
classes ever care how they are instantiated? Especially, what about this 
approach produces that result?

>
> Well, almost. They don''t if I''m willing to make their use
> super-verbose essentially requiring that every single 
> setting is given explicitly. IOW, even though something 
> like namenode_uri will be exactly the same for all the 
> services comprising HDFS subsystem, I will still require its 
> explicit setting for every single class that gets instantiated 
> (even on the same node).

Yuck.

[...]

Q2: In a situation like  what would be an ideal way of making
> class instantiations less verbose? Also, as long as we are 
> making them less verbose via some external means like 
> hiera or extlookup, is there any point in keeping the 
> instantiations to be along the lines of: 
>    class { "bigtop::hdfs::secondarynamenode": 
>    } 
> instead of: 
>    include bigtop::hdfs::secondarynamenode 
> ? 
>

If you are going to rely on hiera or another external source for all class 
data, then you absolutely should use the ''include'' form, not
the
parametrized form.  The latter carries the constraint that it can be used 
only once for any given class, and that use must be the first one parsed.  
There are very good reasons, however, why sometimes you would like to 
declare a given class in more than one place.  You can work around any need 
to do so with enough effort and code, but that generally makes your 
manifest set more brittle, and / or puts substantially greater demands on 
your ENC.  I see from your subsequent comment (elided) that you recognize 
that, at least to some degree.

>
> Finally, the question that has become really obvious to me 
> after pondering this design is what ways of capturing 
> a datum (e.g.: facter, top-scope variables, node-scope variables, 
> class-scope variables, parent-class-scope variables) 
> are the most appropriate for different types of information 
> that we need to express about our clusters. 
>

Puppet best practices generally hold that reliance on top-scope variables 
is best avoided.  There are a few notable exceptions; principally, these 
are node facts and the globals provided by Puppet itself.  Node facts of 
course include any custom facts that you may create.

Puppet best practices also hold that class inheritance (there''s another
of
those terms) is inappropriate for most classes, so, usually, few classes 
have a parent-class-scope to consider.  There is an important and relevant 
exception, however: a parametrized class may inherit from a data-only 
parent class (conventionally named <module>::params) to use set default 
values of its own parameters from the class variables of that class.  That 
usage is reasonably well regarded.  Otherwise, the only reason to use class 
inheritance is for the subclass to override properties of resources 
declared by its parent, and even that can often be readily replaced by more 
data-centric approaches these days.

Not mentioned in that list, however, is other-class-scope variables.  
Puppet classes do not provide any data encapsulation, so the variables of 
all declared classes are accessible, and using them is not inherently 
unreasonable.  (The lack of encapsulation is not a functional risk because 
Puppet "variables" cannot be modified once set, and they are set when
(and
only when) the class to which they belong is parsed.)

>
> Q3: IOW, what are the best practices for managing the following 
> classes of class parameters (categorization stolen from Rich): 
>

You are supposing that these will be modeled as class parameters in the 
first place.  Certainly they are data that characterize the configuration 
to be deployed, and it is possible to model them as class parameters, but 
that is not the only -- and maybe not the best -- alternative available to 
you.  Class parametrization is a protocol for declaring and documenting 
that data on which a class relies, and it enables mechanisms for obtaining 
that data that non-parametrized classes cannot use, but the same 
configuration objectives can be accomplished without them.

>
> 1)  variables that are defaults that can be rationally set based 
>       on properties of the node itself, such as using os system for 
>       setting a  package manager or package name to use, 
>

Some of these (package manager, for instance), Puppet determines 
automatically.  You can override its choices, if necessary, but it is 
pretty reliable about doing the right thing in those cases it covers.

Other cases in this category don''t have such clear-cut answers, but
often
the variables and the logic to choose their values reside in the classes 
that use them, or in classes close to those, following a modular approach 
to manifest set design.  Some prefer to externalize the data and thus 
replace the logic with an external lookup, and although that wouldn''t
be
appropriate for every case, the prevailing opinion in the community is that 
separating data from code is a principle of good manifest design.  My 
personal opinion is that if you anticipate ever wanting to change or extend 
the data, then it''s a good idea to pull it out.

> 2)  variables that are set as part of group and the role they are 
>      playing, such as the set of common variables that all
"slave"
>      nodes in a cluster should have, 
>

Do I understand correctly that we''re talking about data that should be
the
same for all members of the given group (or all nodes having the given 
role)?  Or are we talking about data only make sense only in the context of 
a given group / role, but may differ among nodes?  Are there data among 
these that apply to multiple groups / roles, but whose values may differ on 
a per-group or per-role basis?

The best ways to handle such data may differ depending on the details.  
Supposing that we are talking about group-specific data that will be common 
to all members of the group, it seems best to build some kind of 
abstraction representing the group, and to associate the data with that.  
That could mean a class and class variables (whether or not they are 
parameters), it could be a single array- or hash-valued variable in some 
global class, it could be a group-specific data file at some level of an 
external data hierarchy, etc..

> 3)  variables that are set as function of other components that 
>      relate or connect to them, such as a client needing the port 
>      and  host address of a server, that is, variables that depend on 
>      toplogy, 
>

Exported resources are often used for such purposes, but to the best of my 
knowledge they rely on a master.  For a masterless scenario, this really 
isn''t any different from (2), where the group in question is
"consumers of
service A" or even "all nodes in the cluster".

> 4)  variables that are set based on external context but can be 
>      categorized in a node by node basis, such as ntp server 
>      address based on location of data center, or which users 
>      should have logon access to which machines 
>
> 
If the point is that the data will be chosen based on facts presented by 
the client node, then I would be inclined to put such data into classes, 
and to assign those classes the responsibility to look up the appropriate 
value for the facts presented.  That implies some kind of lookup table, 
which could reside in the class itself or in an external data store, or 
which could be provided as a class parameter.

John

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Puppet users - Feb 2013 - Request for an architectural advice for Hadoop ecosystem deployment

[Puppet Users] Request for an architectural advice for Hadoop ecosystem deployment

[Puppet Users] Re: Request for an architectural advice for Hadoop ecosystem deployment