A simple walk-through with the standard commands used, along with some best practices, for setting up Cloudera’s Hadoop Infrastructure: Cloudera Manager 5.4.1 & CDH 5.4.1
Today I’ll be loading Cloudera Hadoop on a virtual six node cluster, with a separate Cloudera Manager vm, on top of vmware. All of the nodes will be running Ubuntu 14.04.2 and of course updated to the latest apt-get packages and dist-upgrades like below:
$ sudo apt-get update $ sudo apt-get upgrade $ sudo apt-get dist-upgrade
You may need to manually edit the IP address and hostname if this was not done on install:
$ sudo nano /etc/network/interfaces
Then edit the info to be specific for your network:
# The loopback network interface auto lo iface lo inet loopback # The primary network interface auto eth0 iface eth0 inet static address 10.200.2.2 netmask 255.255.0.0 network 10.200.0.0 broadcast 10.200.255.255 gateway 10.200.0.1 # dns-* options are implemented by the resolvconf package, if installed dns-nameservers 18.104.22.168 22.214.171.124
Then restart networking the easy way 😉 :
$ sudo ifdown eth0 && sudo ifup eth0
Hostname attributes need to be changed in two places to stick after reboot.
1. Run the aptly named hostname command:
$ sudo hostname NEW_NAME_HERE
2. To change the name permanently, run command to edit the host files:
$ sudo nano /etc/hostname /etc/hosts
Note: The above command will allow editing the two files individually one-after-another.
3. The edit to the /etc/hostname file may already be the hostname you have chosen by running the hostname command, if it isn’t just write out the name you would like the server to have and save the file.
4. The /etc/hosts file should now look like below, where masbdevcm is the hostname of my vm and the 127 address is of course the loopback interface:
127.0.0.1 localhost 10.200.2.2 masbdevcm
Next, Cloudera recommends turning off SELinux (CentOS) and AppArmor (Ubuntu). We will also need to disable iptables if those are running. This is highly controversial but I am going ahead and doing this to make configuration easier and the fact that this cluster will never see anything but local LAN traffic after the Cloudera software is downloaded and upgrades are complete.
1. How to check AppArmor status :
$ sudo apparmor_status [sudo] password for dave: apparmor module is loaded. 4 profiles are loaded. 4 profiles are in enforce mode. /sbin/dhclient /usr/lib/NetworkManager/nm-dhcp-client.action /usr/lib/connman/scripts/dhclient-script /usr/sbin/tcpdump 0 profiles are in complain mode. 0 processes have profiles defined. 0 processes are in enforce mode. 0 processes are in complain mode. 0 processes are unconfined but have a profile defined.
2. Disable AppArmor and unload the kernel module by entering the following:
$ sudo /etc/init.d/apparmor stop $ sudo update-rc.d -f apparmor remove
$ sudo service apparmor stop $ sudo update-rc.d -f apparmor remove
3. Remove AppArmor software :
$ sudo apt-get remove apparmor apparmor-utils -y
4. Check status of iptables (using built in ufw ubuntu 8.04+):
$ sudo ufw status verbose; sudo iptables -L Status: inactive Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination
5. Disable iptables:
$ sudo ufw disable Firewall stopped and disabled on system startup
Next I want to set the /etc/hosts file with all of the names of all of the servers in the cluster since I won’t have a DNS server setup:
$ sudo nano /etc/hosts
And Add entries for my worker nodes:
127.0.0.1 localhost 10.200.2.2 masbdevcm 10.200.2.3 masbdevn1 10.200.2.4 masbdevn2 10.200.2.5 masbdevn3 10.200.2.6 masbdevn4 10.200.2.7 masbdevn5
Now, disable IPv6.
1. Open /etc/sysctl.conf:
$ sudo nano /etc/sysctl.conf
2. Insert the following lines at the end:
net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
3. Reload sysctl:
$ sudo sysctl -p net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
After that, if you run:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
It will report:
If you see 1, ipv6 has been successfully disabled.
Cloudera requires an ssh root account enabled with a password, else certificates to be distributed to each node. Again, for the sake of brevity, I will enable root ssh access with a password.
1. Setup a password for the root account.
$ sudo passwd [sudo] password for dave: Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully
2. Edit /etc/ssh/sshd_config, and comment out the following line:
3. Just below it, add the following line:
4. Then restart SSH:
$ sudo service ssh reload
Since all of my nodes are virtual, I’ll also install vmware tools so when I clone all of the other nodes off of this one I only have to do it once.
1. In Virtual Center, right-click guest, select guest, install vmware tools
2. Run this command to create a directory to mount the CD-ROM:
$ sudo mkdir /mnt/cdrom
3. Run this command to mount the CD-ROM:
$ sudo mount /dev/cdrom /mnt/cdrom mount: block device /dev/sr0 is write-protected, mounting read-only
4. The file name of the VMware Tools bundle varies depending on your version of the VMware product. Run this command to find the exact name:
$ ll /mnt/cdrom/ total 62824 dr-xr-xr-x 2 root root 2048 Mar 30 19:49 ./ drwxr-xr-x 3 root root 4096 May 22 11:23 ../ -r-xr-xr-x 1 root root 1969 Mar 30 19:16 manifest.txt* -r-xr-xr-x 1 root root 1850 Mar 30 19:14 run_upgrader.sh* -r--r--r-- 1 root root 62924912 Mar 30 19:16 VMwareTools-9.4.12-2627939.tar.gz -r-xr-xr-x 1 root root 693484 Mar 30 19:15 vmware-tools-upgrader-32* -r-xr-xr-x 1 root root 702400 Mar 30 19:15 vmware-tools-upgrader-64*
5. Run this command to extract the contents of the VMware Tools bundle:
$ tar xzvf /mnt/cdrom/VMwareTools-x.x.x-xxxx.tar.gz -C /tmp/
6. Run this command to change directories into the VMware Tools distribution:
$ cd /tmp/vmware-tools-distrib/
7. Run this command to install VMware Tools:
$ sudo ./vmware-install.pl -d
Note: The -d switch assumes that you want to accept the defaults. If you do not use -d, press Return to accept each default or supply your own answers. There is an Ubuntu maintained apt-get package that some versions of vmware tools will prompt you to install. In my case I am just going to use the vmware built tools, but doing a -d above will in some cases exit the install in favor of the open-vm-tools:
$ sudo ./vmware-install.pl open-vm-tools are available from the OS vendor and VMware recommends using open-vm-tools. See http://kb.vmware.com/kb/2073803 for more information. Do you still want to proceed with this legacy installer? [no] y
8. Run this command to reboot the virtual machine after the installation completes (Note you may need to Ctrl+Z the install script):
$ sudo reboot
Ok, now we are ready to clone this vm and create however many other servers we would like from it. In this case we will need to clone it five times as this original node will be the Cloudera Manager node. I’ll also be extending the five cloned vm’s hard drive to 500 GB as they will be the nodes I create my hdfs data space on. There is a great tutorial on how to use gparted’s live CD to do the partition expansion here.
$ sudo shutdown -P 0
<< Clone VMs 😉 >>
Next, go ahead and start up all five cloned VMs. I don’t need SSH access at this point as I can use the console on VMware, but you may need to start them individually so you can have SSH access over the IP address we set on the vm that we did first, to avoid any IP address conflicts. Once all VMs are up, set the IP address and hostnames of all the nodes like we did much earlier in the post.
Cloudera Manager (Cloudera Install Instructions)
Finally, we are ready to install Cloudera Manager. SSH to the Cloudera Manager VM and login as root:
$ sudo ssh firstname.lastname@example.org
1. Download the Cloudera Manager installer binary from Cloudera Manager Downloads to the cluster host where you want to install the Cloudera Manager Server.
Download the installer:
$ wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
2. Change cloudera-manager-installer.bin to have executable permission.
$ sudo chmod u+x cloudera-manager-installer.bin
3. Run the Cloudera Manager Server installer.
$ sudo ./cloudera-manager-installer.bin
There will be another screen telling you about the Cloudera modules that express allows you to install, and then you will be taken to the next screen where we can specify our worker nodes (make sure you include the Cloudera Manager vm for health reporting):
Step 3 in the wizard (highlighted by the little squares at the bottom) asks if we would like to “Enable Single User Mode”. You may want this in production as it is a security enhancement. Here I will not be using it:
The last and final step does a health check on all nodes in the cluster. In our case all checks passed except the swappiness setting in Ubuntu Server. The error message gives the solution for the problem by setting the sysctl.conf to 10, but I prefer this setting to be set to 1 for VMs. Lets set that by logging into all the VMs as root and making the change:
$ nano /etc/sysctl.conf
Add this to the bottom of the file:
# Decrease swappiness vm.swappiness = 1
One advantage of using the really streamlined wizard in Cloudera Manager is the next screen where I can easily pick and choose the services I want to run on each node. Cloudera Manager will attempt to lay out the services in the most logical pattern for the nodes you have provisioned, and for the most part it does a pretty good job. Again, I’ll leave everything at the defaults. It’s really important here to remember to check which nodes you are selecting for data nodes. In my case, the Cloudera Manager node does not have sufficient storage like the nodes masbdevn[1-5]. It’s important to make sure that it does not get selected by the system, and in my case it didn’t. Like I said, awesome logic in the wizard:
So at this point, the cluster is online and ready for work. I’ll post more articles on submitting jobs to the cluster from python soon!