A simple walk-through with the standard commands used, along with some best practices, for setting up Cloudera’s Hadoop Infrastructure: Cloudera Manager 5.4.1 & CDH 5.4.1

Today I’ll be loading Cloudera Hadoop on a virtual six node cluster, with a separate Cloudera Manager vm, on top of vmware. All of the nodes will be running Ubuntu 14.04.2 and of course updated to the latest apt-get packages and dist-upgrades like below:

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get dist-upgrade

You may need to manually edit the IP address and hostname if this was not done on install:

$ sudo nano /etc/network/interfaces

Then edit the info to be specific for your network:

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet static
        address 10.200.2.2
        netmask 255.255.0.0
        network 10.200.0.0
        broadcast 10.200.255.255
        gateway 10.200.0.1
        # dns-* options are implemented by the resolvconf package, if installed
        dns-nameservers 208.67.222.222 208.67.220.220

Then restart networking the easy way 😉 :

$ sudo ifdown eth0 && sudo ifup eth0

Hostname attributes need to be changed in two places to stick after reboot.

1. Run the aptly named hostname command:

$ sudo hostname NEW_NAME_HERE

2. To change the name permanently, run command to edit the host files:

$ sudo nano /etc/hostname /etc/hosts

Note: The above command will allow editing the two files individually one-after-another.

3. The edit to the /etc/hostname file may already be the hostname you have chosen by running the hostname command, if it isn’t just write out the name you would like the server to have and save the file.

4. The /etc/hosts file should now look like below, where masbdevcm is the hostname of my vm and the 127 address is of course the loopback interface:

127.0.0.1       localhost
10.200.2.2      masbdevcm

Next, Cloudera recommends turning off SELinux (CentOS) and AppArmor (Ubuntu). We will also need to disable iptables if those are running. This is highly controversial but I am going ahead and doing this to make configuration easier and the fact that this cluster will never see anything but local LAN traffic after the Cloudera software is downloaded and upgrades are complete.

1. How to check AppArmor status :

$ sudo apparmor_status
[sudo] password for dave:
apparmor module is loaded.
4 profiles are loaded.
4 profiles are in enforce mode.
   /sbin/dhclient
   /usr/lib/NetworkManager/nm-dhcp-client.action
   /usr/lib/connman/scripts/dhclient-script
   /usr/sbin/tcpdump
0 profiles are in complain mode.
0 processes have profiles defined.
0 processes are in enforce mode.
0 processes are in complain mode.
0 processes are unconfined but have a profile defined.

2. Disable AppArmor and unload the kernel module by entering the following:

$ sudo /etc/init.d/apparmor stop
$ sudo update-rc.d -f apparmor remove

or

$ sudo service apparmor stop
$ sudo update-rc.d -f apparmor remove

3. Remove AppArmor software :

$ sudo apt-get remove apparmor apparmor-utils -y

4. Check status of iptables (using built in ufw ubuntu 8.04+):

$ sudo ufw status verbose; sudo iptables -L
Status: inactive
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

5. Disable iptables:

$ sudo ufw disable
Firewall stopped and disabled on system startup

Next I want to set the /etc/hosts file with all of the names of all of the servers in the cluster since I won’t have a DNS server setup:

$ sudo nano /etc/hosts

And Add entries for my worker nodes:

127.0.0.1       localhost
10.200.2.2      masbdevcm
10.200.2.3      masbdevn1
10.200.2.4      masbdevn2
10.200.2.5      masbdevn3
10.200.2.6      masbdevn4
10.200.2.7      masbdevn5

Now, disable IPv6.

1. Open /etc/sysctl.conf:

$ sudo nano /etc/sysctl.conf

2. Insert the following lines at the end:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

3. Reload sysctl:

$ sudo sysctl -p
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

After that, if you run:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

It will report:

1

If you see 1, ipv6 has been successfully disabled.

Cloudera requires an ssh root account enabled with a password, else certificates to be distributed to each node. Again, for the sake of brevity, I will enable root ssh access with a password.

1. Setup a password for the root account.

$ sudo passwd
[sudo] password for dave:
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully

2. Edit /etc/ssh/sshd_config, and comment out the following line:

PermitRootLogin without-password

3. Just below it, add the following line:

PermitRootLogin yes

4. Then restart SSH:

$ sudo service ssh reload

Since all of my nodes are virtual, I’ll also install vmware tools so when I clone all of the other nodes off of this one I only have to do it once.
1. In Virtual Center, right-click guest, select guest, install vmware tools
2. Run this command to create a directory to mount the CD-ROM:

$ sudo mkdir /mnt/cdrom

3. Run this command to mount the CD-ROM:

$ sudo mount /dev/cdrom /mnt/cdrom
mount: block device /dev/sr0 is write-protected, mounting read-only

4. The file name of the VMware Tools bundle varies depending on your version of the VMware product. Run this command to find the exact name:

$ ll /mnt/cdrom/
total 62824
dr-xr-xr-x 2 root root     2048 Mar 30 19:49 ./
drwxr-xr-x 3 root root     4096 May 22 11:23 ../
-r-xr-xr-x 1 root root     1969 Mar 30 19:16 manifest.txt*
-r-xr-xr-x 1 root root     1850 Mar 30 19:14 run_upgrader.sh*
-r--r--r-- 1 root root 62924912 Mar 30 19:16 VMwareTools-9.4.12-2627939.tar.gz
-r-xr-xr-x 1 root root   693484 Mar 30 19:15 vmware-tools-upgrader-32*
-r-xr-xr-x 1 root root   702400 Mar 30 19:15 vmware-tools-upgrader-64*

5. Run this command to extract the contents of the VMware Tools bundle:

$ tar xzvf /mnt/cdrom/VMwareTools-x.x.x-xxxx.tar.gz -C /tmp/

6. Run this command to change directories into the VMware Tools distribution:

$ cd /tmp/vmware-tools-distrib/

7. Run this command to install VMware Tools:

$ sudo ./vmware-install.pl -d

Note: The -d switch assumes that you want to accept the defaults. If you do not use -d, press Return to accept each default or supply your own answers. There is an Ubuntu maintained apt-get package that some versions of vmware tools will prompt you to install. In my case I am just going to use the vmware built tools, but doing a -d above will in some cases exit the install in favor of the open-vm-tools:

$ sudo ./vmware-install.pl
open-vm-tools are available from the OS vendor and VMware recommends using
open-vm-tools. See http://kb.vmware.com/kb/2073803 for more information.
Do you still want to proceed with this legacy installer? [no] y

8. Run this command to reboot the virtual machine after the installation completes (Note you may need to Ctrl+Z the install script):

$ sudo reboot

Ok, now we are ready to clone this vm and create however many other servers we would like from it. In this case we will need to clone it five times as this original node will be the Cloudera Manager node. I’ll also be extending the five cloned vm’s hard drive to 500 GB as they will be the nodes I create my hdfs data space on. There is a great tutorial on how to use gparted’s live CD to do the partition expansion here.

$ sudo shutdown -P 0

<< Clone VMs 😉 >>

Next, go ahead and start up all five cloned VMs. I don’t need SSH access at this point as I can use the console on VMware, but you may need to start them individually so you can have SSH access over the IP address we set on the vm that we did first, to avoid any IP address conflicts. Once all VMs are up, set the IP address and hostnames of all the nodes like we did much earlier in the post.

Cloudera Manager (Cloudera Install Instructions)

Finally, we are ready to install Cloudera Manager. SSH to the Cloudera Manager VM and login as root:

$ sudo ssh root@10.200.2.2

1. Download the Cloudera Manager installer binary from Cloudera Manager Downloads to the cluster host where you want to install the Cloudera Manager Server.

Download the installer:

$ wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin

2. Change cloudera-manager-installer.bin to have executable permission.

$ sudo chmod u+x cloudera-manager-installer.bin

3. Run the Cloudera Manager Server installer.

$ sudo ./cloudera-manager-installer.bin

This will start the TUI installer and you should be presented with the below screen:
Screenshot from 2015-06-04 13:37:19

Go ahead and “Next” through the License Agreement for Cloudera and Oracle and the install will begin:
Screenshot from 2015-06-04 13:37:58

Once complete, and assuming everything installed successfully, you will be presented with the below:
Screenshot from 2015-06-04 13:39:14

Follow the onscreen instructions to login to the web UI (In our case, replace localhost with the IP address we set earlier for the VM http://10.200.2.2:7180):
Screenshot from 2015-06-04 13:44:01

You will now be asked to select a license, make sure you have Cloudera Express highlighted in green before clicking continue:
Screenshot from 2015-06-04 13:55:08

There will be another screen telling you about the Cloudera modules that express allows you to install, and then you will be taken to the next screen where we can specify our worker nodes (make sure you include the Cloudera Manager vm for health reporting):
Screenshot from 2015-06-04 14:17:19

Once we specify all nodes in the cluster, click the search button to verify connectivity and then continue:
Screenshot from 2015-06-04 14:18:06

The next screen will present us with options for using Packages vs. Parcels. I’m not going to get into the difference here, but we will leave all defaults on this screen and click continue:
Screenshot from 2015-06-04 14:22:38

Next we will simply check the box for “Install Oracle Java SE Development Kit (JDK)” but not for the encryption policies:
Screenshot from 2015-06-04 14:25:16

Step 3 in the wizard (highlighted by the little squares at the bottom) asks if we would like to “Enable Single User Mode”. You may want this in production as it is a security enhancement. Here I will not be using it:
Screenshot from 2015-06-04 14:28:28

Step 4 asks us about login credentials on our VMs. We have already enabled root SSH access with a password above, so all we have to do is type our password in the two text boxes and continue:
Screenshot from 2015-06-04 14:29:25

Almost Done! Step 5 will begin the installation of the java jdk and the Cloudera agent on all nodes:
Screenshot from 2015-06-04 14:31:40

Once complete, Step 7 will skip Step 6 and will start to download, distribute, unpack, and activate the required parcels on all the VMs in the cluster (You can see a progress bar for each VM):
Screenshot from 2015-06-04 14:59:40

The last and final step does a health check on all nodes in the cluster. In our case all checks passed except the swappiness setting in Ubuntu Server. The error message gives the solution for the problem by setting the sysctl.conf to 10, but I prefer this setting to be set to 1 for VMs. Lets set that by logging into all the VMs as root and making the change:

$ nano /etc/sysctl.conf

Add this to the bottom of the file:

# Decrease swappiness
vm.swappiness = 1

The now 100% Host Inspector:
Screenshot from 2015-06-08 11:01:48

More Wizardry! Now comes the fun part, we get to choose what services we would like to install. I’m going to choose the first option as I would like to have what Cloudera considers “Core Hadoop”:
Screenshot from 2015-06-08 11:05:48

One advantage of using the really streamlined wizard in Cloudera Manager is the next screen where I can easily pick and choose the services I want to run on each node. Cloudera Manager will attempt to lay out the services in the most logical pattern for the nodes you have provisioned, and for the most part it does a pretty good job. Again, I’ll leave everything at the defaults. It’s really important here to remember to check which nodes you are selecting for data nodes. In my case, the Cloudera Manager node does not have sufficient storage like the nodes masbdevn[1-5]. It’s important to make sure that it does not get selected by the system, and in my case it didn’t. Like I said, awesome logic in the wizard:
Screenshot from 2015-06-08 11:14:18

Next up is database setup. No this isn’t hdfs, this is just the PostgreSQL database the system uses for meta (And it will set it up for you 😉 ):
Screenshot from 2015-06-08 11:16:27

What do you know a review of changes BEFORE they will be made! Again, all defaults:
Screenshot from 2015-06-08 11:18:32

Away we go provisioning services!
Screenshot from 2015-06-08 11:20:48

All services provisioned in a little over six minutes… try to do that by hand faster 😯 :
Screenshot from 2015-06-08 11:25:53

And last but not least, a sincere congrats!
Screenshot from 2015-06-08 11:28:07

When the wizard exits, it will take you to the Cloudera Manager console. This console will auto update with the latest status from all of your services:
Screenshot from 2015-06-08 11:29:54

So at this point, the cluster is online and ready for work. I’ll post more articles on submitting jobs to the cluster from python soon!