Manual Installation using Packages



This section describes installing CDAP on Hadoop clusters that are:

  • Generic Apache Hadoop distributions;
  • CDH (Cloudera Data Hub) clusters not managed with Cloudera Manager; or
  • HDP (Hortonworks Data Platform) clusters not managed with Apache Ambari.

Cloudera Manager (CDH), Apache Ambari (HDP), and MapR distributions should be installed with our other distribution instructions.

  • As CDAP depends on HDFS, YARN, HBase, ZooKeeper, and (optionally) Hive and Spark, it must be installed on cluster host(s) with full client configurations for these dependent services.
  • The CDAP Master Service must be co-located on a cluster host with an HDFS client, a YARN client, an HBase client, and—optionally—Hive or Spark clients.
  • Note that these clients are redundant if you are co-locating the CDAP Master on a cluster host (or hosts, in the case of a deployment with high availability) with actual services, such as the HDFS Namenode, the YARN resource manager, or the HBase Master.
  • You can download the Hadoop client and HBase client libraries, and then install them on the hosts running CDAP services. No Hadoop or HBase services need be running.
  • All services run as the 'cdap' user installed by the package manager.

Preparing the Cluster

Please review the Software Prerequisites, as a configured Hadoop, HBase, and Hive (plus an optional Spark client) needs to be configured on the node(s) where CDAP will run.

Hadoop Configuration

  1. ZooKeeper’s maxClientCnxns must be raised from its default. We suggest setting it to zero (0: unlimited connections). As each YARN container launched by CDAP makes a connection to ZooKeeper, the number of connections required is a function of usage.

  2. Ensure that YARN has sufficient memory capacity by lowering the default minimum container size (controlled by the property yarn.scheduler.minimum-allocation-mb). Lack of YARN memory capacity is the leading cause of apparent failures that we see reported. We recommend starting with these settings:

    • yarn.nodemanager.delete.debug-delay-sec: 43200
    • yarn.scheduler.minimum-allocation-mb: 512 mb

    Please ensure your yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb settings are set sufficiently to run CDAP, as described in the CDAP Memory and Core Requirements.

HDFS Permissions

Ensure YARN is configured properly to run MapReduce programs. Often, this includes ensuring that the HDFS /user/yarn directory exists with proper permissions:

# su hdfs
$ hdfs dfs -mkdir -p /user/yarn && hadoop fs -chown yarn /user/yarn && hadoop fs -chgrp yarn /user/yarn

Downloading and Distributing Packages

Preparing Package Managers

CDAP components are available as either Yum .rpm or APT .deb packages. There is one package for each CDAP component, and each component may have multiple services. Additionally, there is a base CDAP package with three utility packages (for HBase compatibility) installed which creates the base configuration and the cdap user. We provide packages for Ubuntu 12 and CentOS 6.

Available packaging types:

  • RPM: Yum repo
  • Debian: APT repo
  • Tar: For specialized installations only

Note: If you are using Chef to install CDAP, an official cookbook is available.

On RPM using Yum

Download the Cask Yum repo definition file:

$ sudo curl -o /etc/yum.repos.d/cask.repo

This will create the file /etc/yum.repos.d/cask.repo with:

name=Cask Packages

Add the Cask Public GPG Key to your repository:

$ sudo rpm --import

Update your Yum cache:

$ sudo yum makecache

On Debian using APT

Download the Cask APT repo definition file:

$ sudo curl -o /etc/apt/sources.list.d/cask.list

This will create the file /etc/apt/sources.list.d/cask.list with:

deb [ arch=amd64 ] precise cdap

Add the Cask Public GPG Key to your repository:

$ curl -s | sudo apt-key add -

Update your APT-cache:

$ sudo apt-get update

Installing CDAP Services

Package Installation

Install the CDAP packages by using one of the following methods. Do this on each of the boxes that are being used for the CDAP components; our recommended installation is a minimum of two boxes.

This will download and install the latest version of CDAP with all of its dependencies.

To install the optional CDAP CLI on a node, add the cdap-cli package to the list of packages in the commands below.

Using Chef

If you are using Chef to install CDAP, an official cookbook is available.

To install the optional CDAP CLI on a node, use the fullstack recipe.

On RPM using Yum

$ sudo yum install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui

On Debian using APT

$ sudo apt-get install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui

Create Required Directories

To prepare your cluster so that CDAP can write to its default namespace, create a top-level /cdap directory in HDFS, owned by an HDFS user yarn:

$ su hdfs
$ hadoop fs -mkdir -p /cdap && hadoop fs -chown yarn /cdap

In the CDAP packages, the default property hdfs.namespace is /cdap and the default property hdfs.user is yarn.

Also, create a tx.snapshot subdirectory:

$ su hdfs
$ hadoop fs -mkdir -p /cdap/tx.snapshot && hadoop fs -chown yarn /cdap/tx.snapshot

Note: If you have customized (or will be customizing) the property data.tx.snapshot.dir in your CDAP configuration, use that value instead for /cdap/tx.snapshot.

If your cluster is not setup with these defaults, you’ll need to edit your CDAP configuration prior to starting services.

CDAP Configuration

This section describes how to configure the CDAP components so they work with your existing Hadoop cluster. Certain Hadoop components may need changes, as described below, for CDAP to run successfully.

  1. CDAP packages utilize a central configuration, stored by default in /etc/cdap.

    When you install the CDAP base package, a default configuration is placed in /etc/cdap/conf.dist. The cdap-site.xml file is a placeholder where you can define your specific configuration for all CDAP components. The cdap-site.xml.example file shows the properties that usually require customization for all installations.

    Similar to Hadoop, CDAP utilizes the alternatives framework to allow you to easily switch between multiple configurations. The alternatives system is used for ease of management and allows you to to choose between different directories to fulfill the same purpose.

    Simply copy the contents of /etc/cdap/conf.dist into a directory of your choice (such as /etc/cdap/conf.mycdap) and make all of your customizations there. Then run the alternatives command to point the /etc/cdap/conf symlink to your custom directory /etc/cdap/conf.mycdap:

    $ sudo cp -r /etc/cdap/conf.dist /etc/cdap/conf.mycdap
    $ sudo update-alternatives --install /etc/cdap/conf cdap-conf /etc/cdap/conf.mycdap 10
  2. Configure the cdap-site.xml after you have installed the CDAP packages.

    To configure your particular installation, modify cdap-site.xml, using cdap-site.xml.example as a model. (See the appendix for a listing of cdap-site.xml.example, the minimal cdap-site.xml file required.)

    Customize your configuration by creating (or editing if existing) an .xml file conf/cdap-site.xml and set appropriate properties:

    $ sudo cp -f /etc/cdap/conf.mycdap/cdap-site.xml.example /etc/cdap/conf.mycdap/cdap-site.xml
    $ sudo vi /etc/cdap/conf.mycdap/cdap-site.xml
  3. Depending on your installation, you may need to set these properties:

    1. Check that the zookeeper.quorum property in conf/cdap-site.xml is set to the ZooKeeper quorum string, a comma-delimited list of fully-qualified domain names for the ZooKeeper quorum:

          ZooKeeper quorum string; specifies the ZooKeeper host:port;
          substitute the quorum for the components shown here (FQDN1:2181,FQDN2:2181)
    2. Check that the router.server.address property in conf/cdap-site.xml is set to the hostname of the CDAP Router. The CDAP UI uses this property to connect to the Router:

        <description>CDAP Router address to which CDAP UI connects</description>
    3. Check that there exists in HDFS a user directory for the hdfs.user property of conf/cdap-site.xml. By default, the HDFS user is yarn. If necessary, create the directory:

      $ su hdfs
      $ hdfs dfs -mkdir -p /user/yarn && hadoop fs -chown yarn /user/yarn && hadoop fs -chgrp yarn /user/yarn
    4. If you want to use an HDFS directory with a name other than /cdap:

      1. Create the HDFS directory you want to use, such as /myhadoop/myspace.

      2. Create an hdfs.namespace property for the HDFS directory in conf/cdap-site.xml:

          <description>Default HDFS namespace</description>
      3. Check that the default HDFS user yarn owns that HDFS directory.

    5. If you want to use an HDFS user other than yarn:

      1. Check that there is—and create if necessary—a corresponding user on all machines in the cluster on which YARN is running (typically, all of the machines).

      2. Create an hdfs.user property for that user in conf/cdap-site.xml:

          <description>User for accessing HDFS</description>
      3. Check that the HDFS user owns the HDFS directory described by hdfs.namespace on all machines.

      4. Check that there exists in HDFS a /user/ directory for that HDFS user, as described above.

    6. To use the ad-hoc querying capabilities of CDAP, ensure the cluster has a compatible version of Hive installed. See the section on Hadoop Compatibility. To use this feature on secure Hadoop clusters, please see these instructions on configuring secure Hadoop.

      Note: Some versions of Hive contain a bug that may prevent the CDAP Explore Service from starting up. See CDAP-1865 for more information about the issue. If the CDAP Explore Service fails to start and you see a javax.jdo.JDODataStoreException: Communications link failure in the log, try adding this property to the Hive hive-site.xml file:

    7. If Hive is not going to be installed, disable the CDAP Explore Service in conf/cdap-site.xml (by default, it is enabled):

        <description>Enable Explore functionality</description>
    8. If you’d like to publish metadata updates to an external Apache Kafka instance, CDAP has the capability of publishing notifications upon metadata updates. Details on the configuration settings and an example output are shown in the Metadata and Lineage section of the Developers’ Manual.

ULIMIT Configuration

When you install the CDAP packages, the ulimit settings for the CDAP user are specified in the /etc/security/limits.d/cdap.conf file. On Ubuntu, they won’t take effect unless you make changes to the /etc/pam.d/common-session file. You can check this setting with the command ulimit -n when logged in as the CDAP user. For more information, refer to the ulimit discussion in the Apache HBase Reference Guide.

Writing to Temp Files

Temp directories, depending on the distribution, are utilized by CDAP (the first two of this set of directories are properties specified in the cdap-site.xml file, as described in the Appendix: cdap-site.xml and cdap-default.xml):

  • app.temp.dir (default: /tmp)
  • kafka.log.dir (default: /tmp/kafka-logs)
  • /var/cdap/run
  • /var/log/cdap
  • /var/run/cdap
  • /var/tmp/cdap

The CDAP user (the cdap system user) must be able to write to these directories, as they are used for deploying applications and for operating CDAP.

As in all installations, the kafka.log.dir may need to be created locally. If you configure kafka.log.dir (or any of the other settable parameters) to a particular directory, you need to make sure that the directory exists and that it is writable by the CDAP user.

Configuring Hortonworks Data Platform

Beginning with Hortonworks Data Platform (HDP) 2.2, the MapReduce libraries are in HDFS. This requires an addition be made to the file to indicate the version of HDP:

export OPTS="${OPTS} -Dhdp.version=<version>"

where <version> matches the HDP version of the cluster. The build iteration must be included, so if the cluster version of HDP is, use:

export OPTS="${OPTS} -Dhdp.version="

The file is located in the central configuration directory, as described above under CDAP Configuration.

In addition, the property app.program.jvm.opts must be set in the cdap-site.xml:

  <value>-XX:MaxPermSize=128M ${twill.jvm.gc.opts} -Dhdp.version=<version><version></value>
  <description>Java options for all program containers</description>

Using the same example as above, substituting for <version>, as:

  <value>-XX:MaxPermSize=128M ${twill.jvm.gc.opts} -Dhdp.version=</value>
  <description>Java options for all program containers</description>

Starting CDAP Services

When all the packages and dependencies have been installed, and the configuration parameters set, you can start the services on each of the CDAP boxes by running the command:

$ for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i start ; done

When all the services have completed starting, the CDAP UI should then be accessible through a browser at port 9999.

The URL will be http://<host>:9999 where <host> is the IP address of one of the machines where you installed the packages and started the services.


CDAP Smoke Test

The CDAP UI may initially show errors while all of the CDAP YARN containers are starting up. Allow for up to a few minutes for this. The Services link in the CDAP UI in the upper right will show the status of the CDAP services.


CDAP UI: Showing started-up before data or applications are deployed.

Further instructions for verifying your installation are contained in Verification.

Advanced Topics

Enabling Perimeter Security

Cask Data Application Platform (CDAP) supports securing clusters using perimeter security. Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.

We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively “trusted”. Though there is still value in having the perimeter access be authenticated in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.

For instructions on enabling CDAP Security, see CDAP Security; and in particular, see the instructions for configuring the properties of cdap-site.xml.

Enabling Kerberos

When running CDAP on top of a secure Hadoop cluster (using Kerberos authentication), the CDAP processes will need to obtain Kerberos credentials in order to authenticate with Hadoop, HBase, ZooKeeper, and (optionally) Hive. In this case, the setting for hdfs.user in cdap-site.xml will be ignored and the CDAP processes will be identified by the default authenticated Kerberos principal.

Note: CDAP support for secure Hadoop clusters is limited to the latest versions of CDH, HDP, and Apache BigTop; currently, MapR is not supported on secure Hadoop clusters.

  1. In order to configure CDAP for Kerberos authentication:

    1. Create a Kerberos principal for the user running CDAP. The principal name should be in the form username/hostname@REALM, creating a separate principal for each host where a CDAP service will run. This prevents simultaneous login attempts from multiple hosts from being mistaken for a replay attack by the Kerberos KDC.

    2. Generate a keytab file for each CDAP Master Kerberos principal, and place the file as /etc/security/keytabs/cdap.keytab on the corresponding CDAP Master host. The file should be readable only by the user running the CDAP Master service.

    3. Edit /etc/cdap/conf/cdap-site.xml on each host running a CDAP service, substituting the Kerberos primary (user) for <cdap-principal>, and your Kerberos authentication realm for EXAMPLE.COM, when adding these two properties:

    4. The <cdap-principal> is shown in the commands that follow as cdap; however, you are free to use a different appropriate name.

    5. The /cdap directory needs to be owned by the <cdap-principal>; you can set that by running the following command as the hdfs user (change the ownership in the command from cdap to whatever is the <cdap-principal>):

      $ su hdfs && hadoop fs -mkdir -p /cdap && hadoop fs -chown cdap /cdap
    6. When running on a secure HBase cluster, as the hbase user, issue the command:

      $ echo "grant 'cdap', 'RWCA'" | hbase shell
    7. When CDAP Master is started, it will login using the configured keytab file and principal.

  2. In order to configure CDAP Explore Service for secure Hadoop:

    1. To allow CDAP to act as a Hive client, it must be given proxyuser permissions and allowed from all hosts. For example: set the following properties in the configuration file core-site.xml, where cdap is a system group to which the cdap user is a member:

    2. To execute Hive queries on a secure cluster, the cluster must be running the MapReduce JobHistoryServer service. Consult your distribution documentation on the proper configuration of this service.

    3. To execute Hive queries on a secure cluster using the CDAP Explore Service, the Hive MetaStore service must be configured for Kerberos authentication. Consult your distribution documentation on the proper configuration of the Hive MetaStore service.

    With all these properties set, the CDAP Explore Service will run on secure Hadoop clusters.

CDAP HA setup

Repeat the installation steps on additional boxes. The configuration settings (in cdap-site.xml, property:value) needed to support high-availability are:

  • kafka.seed.brokers:,.../${root.namespace}
    • Kafka brokers list (comma-separated), followed by /${root.namespace}
  • kafka.default.replication.factor: 2
    • Used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure.
    • The recommended setting is to run at least two Kafka servers.
    • Set this to the number of Kafka servers.