thr3ads.net - Gluster users - [Gluster-users] Can Hadoop run on gluster in 1 JT, N TT setup or only works for 1 JT+TT? [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Fermín Galán Márquez

2012-Jan-30 17:00 UTC

[Gluster-users] Can Hadoop run on gluster in 1 JT, N TT setup or only works for 1 JT+TT?

Hi,

Recently I've set up a Gluster cluster to run Hadoop M/R jobs, following
the document at
http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/Gluster_Hadoop_Compatible_Storage.pdf.

As long as I check in my tests, what the gluster_hadoop.jar plugin is
doing is to automatically mount the gluster volumen at the JT node, then
the TT (in the same node) uses that mountpoint to do its work. That's ok
if JT and TT are runing in the same node (i.e. a one-node setup (*)).
However, when I test with a 2-nodes (*) setup in which the JT runs in a
node and TT in another node it doesn't work (e.g. hadoop jar gets
stalled in the "INFO mapred.JobClient: map 0% reduce 0%" with no
progress after that), which at the end makes sense, given that the
gluster volume is not mounted in the TT node (it's only mounted in the
JT node).

This is a bit annoying to me, given I was expecting that the gluster
volume gets mounted in the TT nodes, which are the ones that actually
need to access to data in the filesystem.

Thus, is not possible to run a 1 JT, N TT Hadoop cluster with gluster?
It only works on a 1 JT+TT?

Or maybe I'm doing something wrong or maybe I'm not understanding
correctly the document at
http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/Gluster_Hadoop_Compatible_Storage.pdf
(any piece of information about Hadoop running on gluster is highly
welcome, please).

I'm using Hadoop 0.20.2 and Gluster 3.3beta2. If you need to know any
other information about my setup, don't hesitate to ask for it!

Thanks in advance!

Best regards,

------
Ferm?n

(*) I refer to nodes in the Hadoop cluster, no matter how many nodes are
implementing the gluster cluster (latter ones are "abstracted" by the
mountpoint, as far as I understand)

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra
pol?tica de env?o y recepci?n de correo electr?nico en el enlace situado m?s
abajo.
This message is intended exclusively for its addressee. We only send and receive
email on the basis of the terms set out at
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

Venky Shankar

2012-Jan-30 17:14 UTC

head link

[Gluster-users] Can Hadoop run on gluster in 1 JT, N TT setup or only works for 1 JT+TT?

Hi,

Can you please dump the contents of conf/core-site.xml from the JT and TT ? (or
attach it).

We have tested the plugin with 1 hadoop master (JT) and 8 Hadoop Task Trackers
(TT), so it should work with your setup too.

Additionally it would be better if you can give us back the JobTracker and
TaskTracker logs. (If they are huge in size paste the last 50 odd lines)

Thanks,
-Venky
________________________________________
From: gluster-users-bounces at gluster.org [gluster-users-bounces at
gluster.org] on behalf of Ferm?n Gal?n M?rquez [fermin at tid.es]
Sent: Monday, January 30, 2012 10:30 PM
To: gluster-users at gluster.org
Subject: [Gluster-users] Can Hadoop run on gluster in 1 JT, N TT setup or only
works for 1 JT+TT?

Hi,

Recently I've set up a Gluster cluster to run Hadoop M/R jobs, following
the document at
http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/Gluster_Hadoop_Compatible_Storage.pdf.

As long as I check in my tests, what the gluster_hadoop.jar plugin is
doing is to automatically mount the gluster volumen at the JT node, then
the TT (in the same node) uses that mountpoint to do its work. That's ok
if JT and TT are runing in the same node (i.e. a one-node setup (*)).
However, when I test with a 2-nodes (*) setup in which the JT runs in a
node and TT in another node it doesn't work (e.g. hadoop jar gets
stalled in the "INFO mapred.JobClient:  map 0% reduce 0%" with no
progress after that), which at the end makes sense, given that the
gluster volume is not mounted in the TT node (it's only mounted in the
JT node).

This is a bit annoying to me, given I was expecting that the gluster
volume gets mounted in the TT nodes, which are the ones that actually
need to access to data in the filesystem.

Thus, is not possible to run a 1 JT, N TT Hadoop cluster with gluster?
It only works on a 1 JT+TT?

Or maybe I'm doing something wrong or maybe I'm not understanding
correctly the document at
http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/Gluster_Hadoop_Compatible_Storage.pdf
(any piece of information about Hadoop running on gluster is highly
welcome, please).

I'm using Hadoop 0.20.2 and Gluster 3.3beta2. If you need to know any
other information about my setup, don't hesitate to ask for it!

Thanks in advance!

Best regards,

------
Ferm?n

(*) I refer to nodes in the Hadoop cluster, no matter how many nodes are
implementing the gluster cluster (latter ones are "abstracted" by the
mountpoint, as far as I understand)

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra
pol?tica de env?o y recepci?n de correo electr?nico en el enlace situado m?s
abajo.
This message is intended exclusively for its addressee. We only send and receive
email on the basis of the terms set out at
http://www.tid.es/ES/PAGINAS/disclaimer.aspx
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Venky Shankar

2012-Jan-31 05:45 UTC

head link

[Gluster-users] (Fixed) Re: Can Hadoop run on gluster in 1 JT, N TT setup or only works for 1 JT+TT?

[snip]>
>     * Are you collocating gluster cluster peers with TT nodes (I mean,
>       each one of the 8 TT nodes is also a gluster peer in the
>       cluster) or are the gluster cluster running in separate nodes?
>
Yes, you are right. Each TaskTracker node is a gluster peer in the cluster.
>     * In the case the answer to the question above is that they are
>       collocated, which fs.glusterfs.server are you using in each TT?
>
For the TaskTracker, fs.glusterfs.server would be _any_ one of the 
gluster peers (i.e. any one of the 8 machines considering you have a 1JT 
+ 8TT setup). For simplicity, stick to one hostname/ip for this, since 
that would make deployment easier (no need to edit core-site.xml on 
every machine)
> I'm asking so because in my mind I'm thinking in a configuration
like
> that:
>
> TT1-> fs.glusterfs.server @ core-site.xml in TT1= IP_TT1
> TT2-> fs.glusterfs.server @ core-site.xml in TT2= IP_TT2
> ...
> TTn-> fs.glusterfs.server @ core-site.xml in TTn= IP_TTn
This will definitely work for you, but as i said stick to one 
hostname/ip. So for each (TT1, TT2 .. TTn) use IP_TT1.
>
> so, each TT mounts "itself" which I suppose achieves a data
locality
> similar to the one achieved with HDFS (considering the gluster driver 
> is clever enough to use the local disk when the data is located in the 
> same node). Does it make sense this configuration?
Exactly ! Each TT node (and the JT too) does a GlusterFS FUSE mount to 
get a _view_ of the entire namespace of the FS. JobTracker schedules 
jobs to TaskTracker nodes. When a job runs on the TT node, all I/O is 
done through the GlusterFS mount. Data locality is a bit of a catch 
here. Since all I/O calls go through the mount, each call has to take 
the route of client translator(s) -> server translator(s) before it hits 
the posix layer (even if the client and the server are on the same node, 
the TT in this case).

To optimize this we introduced a configurable option "quick.slave.io".
This is essentially a "short circuit" for the case i just mentioned 
above. When the job wants to read from a particular offset in the file, 
the GlusterFS Hadoop plugin checks whether the (offset, length) in 
question is present in the backend file system. If yes, then it 
satisfies the read directly from the backed FS instead of going through 
the FUSE mount, thereby saving context switches, translator overhead etc..

A bit more info, this option is not tested well, so we default to
"Off"
in core-site.xml. If you do try it out please let us know if you hit any 
bugs (and please file them too !).

HTH

Thanks,
-Venky
>
> Thanks!
>
> Best regards,
>
> ------
> Ferm?n
>
>
>
> El 30/01/2012 18:14, Venky Shankar escribi?:
>> Hi,
>>
>> Can you please dump the contents of conf/core-site.xml from the JT and
TT ? (or attach it).
>>
>> We have tested the plugin with 1 hadoop master (JT) and 8 Hadoop Task
Trackers (TT), so it should work with your setup too.
>>
>> Additionally it would be better if you can give us back the JobTracker
and TaskTracker logs. (If they are huge in size paste the last 50 odd lines)
>>
>> Thanks,
>> -Venky
>> ________________________________________
>> From:gluster-users-bounces at gluster.org  [gluster-users-bounces at
gluster.org] on behalf of Ferm?n Gal?n M?rquez [fermin at tid.es]
>> Sent: Monday, January 30, 2012 10:30 PM
>> To:gluster-users at gluster.org
>> Subject: [Gluster-users] Can Hadoop run on gluster in 1 JT, N TT setup
or only works for 1 JT+TT?
>>
>> Hi,
>>
>> Recently I've set up a Gluster cluster to run Hadoop M/R jobs,
following
>> the document at
>>
http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/Gluster_Hadoop_Compatible_Storage.pdf.
>>
>> As long as I check in my tests, what the gluster_hadoop.jar plugin is
>> doing is to automatically mount the gluster volumen at the JT node,
then
>> the TT (in the same node) uses that mountpoint to do its work.
That's ok
>> if JT and TT are runing in the same node (i.e. a one-node setup (*)).
>> However, when I test with a 2-nodes (*) setup in which the JT runs in a
>> node and TT in another node it doesn't work (e.g. hadoop jar gets
>> stalled in the "INFO mapred.JobClient:  map 0% reduce 0%"
with no
>> progress after that), which at the end makes sense, given that the
>> gluster volume is not mounted in the TT node (it's only mounted in
the
>> JT node).
>>
>> This is a bit annoying to me, given I was expecting that the gluster
>> volume gets mounted in the TT nodes, which are the ones that actually
>> need to access to data in the filesystem.
>>
>> Thus, is not possible to run a 1 JT, N TT Hadoop cluster with gluster?
>> It only works on a 1 JT+TT?
>>
>> Or maybe I'm doing something wrong or maybe I'm not
understanding
>> correctly the document at
>>
http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/Gluster_Hadoop_Compatible_Storage.pdf
>> (any piece of information about Hadoop running on gluster is highly
>> welcome, please).
>>
>> I'm using Hadoop 0.20.2 and Gluster 3.3beta2. If you need to know
any
>> other information about my setup, don't hesitate to ask for it!
>>
>> Thanks in advance!
>>
>> Best regards,
>>
>> ------
>> Ferm?n
>>
>> (*) I refer to nodes in the Hadoop cluster, no matter how many nodes
are
>> implementing the gluster cluster (latter ones are
"abstracted" by the
>> mountpoint, as far as I understand)
>>
>> Este mensaje se dirige exclusivamente a su destinatario. Puede
consultar nuestra pol?tica de env?o y recepci?n de correo electr?nico en el
enlace situado m?s abajo.
>> This message is intended exclusively for its addressee. We only send
and receive email on the basis of the terms set out at
>> http://www.tid.es/ES/PAGINAS/disclaimer.aspx
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>
> ------------------------------------------------------------------------
> Este mensaje se dirige exclusivamente a su destinatario. Puede 
> consultar nuestra pol?tica de env?o y recepci?n de correo electr?nico 
> en el enlace situado m?s abajo.
> This message is intended exclusively for its addressee. We only send 
> and receive email on the basis of the terms set out at
> http://www.tid.es/ES/PAGINAS/disclaimer.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120131/b887958c/attachment.html>

Gluster users - Jan 2012 - Can Hadoop run on gluster in 1 JT, N TT setup or only works for 1 JT+TT?

[Gluster-users] Can Hadoop run on gluster in 1 JT, N TT setup or only works for 1 JT+TT?

[Gluster-users] Can Hadoop run on gluster in 1 JT, N TT setup or only works for 1 JT+TT?

[Gluster-users] (Fixed) Re: Can Hadoop run on gluster in 1 JT, N TT setup or only works for 1 JT+TT?