Installation Quick Start

These instructions will take you from determining your deployment architecture, installing CDAP on a Hadoop cluster, through the running of a verification application in CDAP.

Deployment

Software Prerequisites

Install:

  • Java runtime (JDK or JRE version 1.6.xx or 1.7.xx) on CDAP and Hadoop nodes. Set the JAVA_HOME environment variable. (details)
  • Node.js on CDAP nodes. (details)
  • Hadoop and HBase (and possibly Hive) environment to run against. (details)
  • CDAP nodes require Hadoop and HBase client installation and configuration. Note: No Hadoop services need to be running.

Preparing the Cluster

To prepare your cluster so that CDAP can write to its default namespace, create a top-level /cdap directory in HDFS, owned by an HDFS user yarn:

sudo -u hdfs hadoop fs -mkdir /cdap && hadoop fs -chown yarn /cdap

In the CDAP packages, the default HDFS namespace is /cdap and the default HDFS user is yarn. If you set up your cluster as above, no further changes are required.

If your cluster is not setup with these defaults, you’ll need to edit your CDAP configuration once you have downloaded and installed the packages, and prior to starting services.

Configuring Package Managers

RPM using Yum

Download the Cask Yum repo definition file:

sudo curl -o /etc/yum.repos.d/cask.repo http://repository.cask.co/downloads/centos/6/x86_64/cask.repo

This will create the file /etc/yum.repos.d/cask.repo with:

[cask]
name=Cask Packages
baseurl=http://repository.cask.co/centos/6/x86_64/releases
enabled=1
gpgcheck=1

Add the Cask Public GPG Key to your repository:

sudo rpm --import http://repository.cask.co/centos/6/x86_64/releases/pubkey.gpg

Debian using APT

Download the Cask Apt repo definition file:

sudo curl -o /etc/apt/sources.list.d/cask.list http://repository.cask.co/downloads/ubuntu/precise/amd64/cask.list

This will create the file /etc/apt/sources.list.d/cask.list with:

deb [ arch=amd64 ] http://repository.cask.co/ubuntu/precise/amd64/releases precise releases

Add the Cask Public GPG Key to your repository:

curl -s http://repository.cask.co/ubuntu/precise/amd64/releases/pubkey.gpg | sudo apt-key add -

Installation

Install the CDAP packages by using one of these methods:

Using Chef:

If you are using Chef to install CDAP, an official cookbook is available.

Using Yum:

sudo yum install cdap-gateway cdap-kafka cdap-master cdap-security cdap-web-app

Using APT:

sudo apt-get update
sudo apt-get install cdap-gateway cdap-kafka cdap-master cdap-security cdap-web-app

Do this on each of the boxes that are being used for the CDAP components; our recommended installation is a minimum of two boxes.

This will download and install the latest version of CDAP with all of its dependencies.

Configuration

CDAP packages utilize a central configuration, stored by default in /etc/cdap.

When you install the CDAP base package, a default configuration is placed in /etc/cdap/conf.dist. The cdap-site.xml file is a placeholder where you can define your specific configuration for all CDAP components. The cdap-site.xml.example file shows the properties that usually require customization for all installations.

To configure your particular installation, follow one of these two approaches:

  1. Modify cdap-site.xml, using cdap-site.example as a model to follow.

    To make alterations to your configuration, create (or edit if existing) an .xml file conf/cdap-site.xml (see the Appendix: cdap-site.xml) and set appropriate properties.

  2. Add these properties to cdap-site.xml; they are the minimal required configuration:

<configuration>

  <!--
    Cluster configurations
  -->

  <property>
    <name>root.namespace</name>
    <value>cdap</value>
    <description>Specifies the root namespace</description>
  </property>

  <!-- Substitute the zookeeper quorum for components here -->
  <property>
    <name>zookeeper.quorum</name>
    <value>FQDN1:2181,FQDN2:2181/${root.namespace}</value>
    <description>Specifies the zookeeper host:port</description>
  </property>

  <property>
    <name>hdfs.namespace</name>
    <value>/${root.namespace}</value>
    <description>Namespace for HDFS files</description>
  </property>

  <property>
    <name>hdfs.user</name>
    <value>yarn</value>
    <description>User name for accessing HDFS</description>
  </property>

  <!--
    Router configuration
  -->
  <!-- Substitue the IP to which Router service should bind to and listen on -->
  <property>
    <name>router.bind.address</name>
    <value>LOCAL-ROUTER-IP</value>
    <description>Specifies the inet address on which the Router service will listen</description>
  </property>

  <!--
    App Fabric configuration
  -->
  <!-- Substitute the IP to which App-Fabric service should bind to and listen on -->
  <property>
    <name>app.bind.address</name>
    <value>LOCAL-APP-FABRIC-IP</value>
    <description>Specifies the inet address on which the app fabric service will listen</description>
  </property>

  <!--
    Data Fabric configuration
  -->
  <!-- Substitute the IP to which Data-Fabric tx service should bind to and listen on -->
  <property>
    <name>data.tx.bind.address</name>
    <value>LOCAL-DATA-FABRIC-IP</value>
    <description>Specifies the inet address on which the transaction service will listen</description>
  </property>

  <!-- 
    Kafka Configuration
  -->
  <property>
    <name>kafka.log.dir</name>
    <value>/data/cdap/kafka-logs</value>
    <description>Directory to store Kafka logs</description>
  </property>

  <!-- Substitute with a list of all machines which will run the Kafka component -->
  <property>
    <name>kafka.seed.brokers</name>
    <value>FQDN1:9092,FQDN2:9092</value>
    <description>List of Kafka brokers (comma separated)</description>
  </property>

  <!-- Must be <= the number of kafka.seed.brokers configured above.  For HA this should be at least 2. -->
  <property>
    <name>kafka.default.replication.factor</name>
    <value>1</value>
    <description>Kafka replication factor</description>
  </property>

  <!--
    Watchdog Configuration
  -->
  <!-- Substitute the IP to which metrics-query service should bind to and listen on -->
  <property>
    <name>metrics.query.bind.address</name>
    <value>LOCAL-WATCHDOG-IP</value>
    <description>Specifies the inet address on which the metrics-query service will listen</description>
  </property>

  <!--
    Web-App Configuration
  -->
  <property>
    <name>dashboard.bind.port</name>
    <value>9999</value>
    <description>Specifies the port on which dashboard listens</description>
  </property>

  <!-- Substitute the IP of the Router service to which the UI should connect -->
  <property>
    <name>router.server.address</name>
    <value>ROUTER-HOST-IP</value>
    <description>Specifies the destination IP where Router service is running</description>
  </property>

  <property>
    <name>router.server.port</name>
    <value>10000</value>
    <description>Specifies the destination Port where Router service is listening</description>
  </property>

</configuration>

Depending on your installation, you may want to set these properties:

  • If you want to use an HDFS directory with a name other than /cdap:

    1. Create the HDFS directory you want to use, such as /myhadoop/myspace.

    2. Create an hdfs.namespace property for the HDFS directory in conf/cdap-site.xml:

      <property>
        <name>hdfs.namespace</name>
        <value>/myhadoop/myspace</value>
        <description>Default HDFS namespace</description>
      </property>
      
    3. Ensure that the default HDFS user yarn owns that HDFS directory.

  • If you want to use a different HDFS user than yarn:

    1. Check that there is—and create if necessary—a corresponding user on all machines in the cluster on which YARN is running (typically, all of the machines).

    2. Create an hdfs.user property for that user in conf/cdap-site.xml:

      <property>
        <name>hdfs.user</name>
        <value>my_username</value>
        <description>User for accessing HDFS</description>
      </property>
      
    3. Check that the HDFS user owns the HDFS directory described by hdfs.namespace on all machines.

  • Set the router.server.address property in conf/cdap-site.xml to the hostname of the CDAP Router. The CDAP Console uses this property to connect to the Router:

    <property>
      <name>router.server.address</name>
      <value>{router-host-name}</value>
    </property>
    

Starting Services

When all the packages and dependencies have been installed, and the configuration parameters set, you can start the services on each of the CDAP boxes by running the command:

for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i restart ; done

When all the services have completed starting, the CDAP Console should then be accessible through a browser at port 9999.

The URL will be http://<host>:9999 where <host> is the IP address of one of the machines where you installed the packages and started the services.

Verification

To verify that the CDAP software is successfully installed and you are able to use your Hadoop cluster, run an example application. We provide in our SDK pre-built .JAR files for convenience.

  1. Download and install the latest CDAP Software Development Kit (SDK).
  2. Extract to a folder (CDAP_HOME).
  3. Open a command prompt and navigate to CDAP_HOME/examples.
  4. Each example folder has a .jar file in its target directory. For verification, we will use the WordCount example.
  5. Open a web browser to the CDAP Console. It is located on port 9999 of the box where you installed CDAP.
  6. On the Console, click the button Load an App.
  7. Find the pre-built WordCount-<cdap-version>.jar using the dialog box to navigate to CDAP_HOME/examples/WordCount/target/, substituting your version for <cdap-version>.
  8. Once the application is deployed, instructions on running the example can be found at the WordCount example.
  9. You should be able to start the application, inject sentences, and retrieve results.
  10. When finished, you can stop and remove the application as described in the section on Building and Running CDAP Applications.