Installation using Cloudera Manager



This section describes installing CDAP on Hadoop clusters managed by Cloudera Manager.

  • The CDAP integration with Cloudera Manager is provided in the form of a Custom Service Descriptor (CSD), which must be installed into Cloudera Manager prior to installing CDAP. The CSD contains service definitions and configurations to make Cloudera Manager "CDAP-aware."

    After the CDAP CSD has been downloaded and installed, the CDAP service can then be installed via the usual Cloudera Manager methods. CDAP parcels will be available from the preconfigured CDAP parcel repository, and the CDAP service can be added to a cluster using the "Add Service" wizard.

    A new CDAP CSD is released with each CDAP minor version (for example: 4.0, 4.1, etc.) with patch releases as needed. The installed CSD version should always match the major.minor version of the CDAP Parcel. For example, the 5.1 CSD can be used with CDAP 5.1.x.

  • If you are installing CDAP with the intention of using replication, see these instructions on CDAP Replication before installing or starting CDAP.

Preparing the Cluster

Roles and Dependencies

The CDAP CSD (Custom Service Descriptor) consists of four mandatory roles and two optional roles:

CSD Role Description
CDAP Master Service Service for managing runtime, lifecycle and resources of CDAP applications
CDAP Gateway/Router Service Service supporting REST endpoints for CDAP
CDAP Kafka Service Metrics and logging transport service, using an embedded version of Kafka
CDAP UI Service User interface for managing CDAP applications
CDAP Security Auth Service Performs client authentication for CDAP when security is enabled (optional)
Gateway Cloudera Manager Gateway Role that installs the CDAP client tools (such as the CDAP CLI) and configuration (optional)

These roles map to the CDAP components of the same name.

  • As CDAP depends on HDFS, YARN, HBase, ZooKeeper, and (optionally) Hive and Spark, it must be installed on cluster host(s) with full client configurations for these dependent services.
  • The CDAP Master Service role (or CDAP Master) must be co-located on a cluster host with an HDFS Gateway, a YARN Gateway, an HBase Gateway, and—optionally—Hive or Spark Gateways.
  • Note that these Gateways are redundant if you are co-locating the CDAP Master role on a cluster host (or hosts, in the case of a deployment with high availability) with actual services, such as the HDFS Namenode, the YARN resource manager, or the HBase Master.
  • Note that the CDAP Gateway/Router Service is not a Cloudera Manager Gateway Role but is instead another name for the CDAP Router Service.
  • CDAP also provides its own Gateway role that can be used to install CDAP client configurations on other hosts of the cluster.
  • All services run as the 'cdap' user installed by the parcel.

Hadoop Configuration

  1. ZooKeeper's maxClientCnxns must be raised from its default. We suggest setting it to zero (0: unlimited connections). As each YARN container launched by CDAP makes a connection to ZooKeeper, the number of connections required is a function of usage.

  2. Ensure that YARN has sufficient memory capacity by lowering the default minimum container size (controlled by the property yarn.scheduler.minimum-allocation-mb). Lack of YARN memory capacity is the leading cause of apparent failures that we see reported. We recommend starting with these settings:

    • yarn.nodemanager.delete.debug-delay-sec: 43200 (see note below)
    • yarn.scheduler.minimum-allocation-mb: 512 mb

    The value we recommend for yarn.nodemanager.delete.debug-delay-sec (43200 or 12 hours) is what we use internally at Cask for testing as that provides adequate time to capture the logs of any failures. However, you should use an appropriate non-zero value specific to your environment. A large value can be expensive from a storage perspective.

    Please ensure your yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb settings are set sufficiently to run CDAP, as described in the CDAP Memory and Core Requirements.

  3. Add additional entries to the YARN Application Classpath for Spark jobs.

    If you plan on running Spark programs from CDAP, CDAP requires that additional entries be added to the YARN application classpath, as the Spark installed on Cloudera Manager clusters is a "Hadoop-less" build and does not include Hadoop jars required by Spark.

    To resolve this, go to the CM page for your cluster, click on the YARN service, click on the configuration tab, and then enter mapreduce.application.classpath in the search box. You will see entries similar to these:


    Copy all the entries to the yarn.application.classpath configuration for YARN on your Cluster. The yarn.application.classpath setting can be found by searching as mentioned above.

    Add the entries required by scrolling to the last entry in the classpath form, clicking the "+" button to add a new text box entry field at the end. Once you have added all the entries from the mapreduce.application.classpath to the yarn.application.classpath, click on Save.

You can make these changes using Cloudera Manager. Please restart the stale services upon seeing a prompt to do so after making the above changes.

Create the "cdap" User

The CDAP system user: As Hadoop resolves users at the NameNode, the cdap user must be added there, or name resolution for the user will fail. With Cloudera Manager, the CDAP installation will create the cdap user on all nodes when it is distributed or activated on the cluster.

Note that Cloudera Manager can be configured to not add users specified in a installation. This can be the case for installations whose IT policies or infrastructure do not allow local user creation. If this is the case, manual creation of the cdap user on nodes may be required.

HDFS Permissions

Ensure YARN is configured properly to run MapReduce programs. Often, this includes ensuring that the HDFS /user/yarn and /user/cdap directories exist with proper permissions:

$ su hdfs
$ hadoop fs -mkdir -p /user/yarn && hadoop fs -chown yarn:yarn /user/yarn
$ hadoop fs -mkdir -p /user/cdap && hadoop fs -chown cdap:cdap /user/cdap

Downloading and Distributing Packages

Note: Both the Custom Service Descriptor (CSD) and the CDAP Parcel must be downloaded and installed in order to successfully install CDAP.

Downloading and Installing CSD

To install CDAP on a cluster managed by Cloudera, we have available a Custom Service Descriptor (CSD) which you can install onto your CM server. This adds CDAP to the list of available services which CM can install.

Supported Cloudera Manager (CM) and Cloudera Distribution of Apache Hadoop (CDH) Distributions
CM Version CDH Version CDAP Parcel / CSD Version
5.10 5.9.x through 5.10.x 5.1.x
5.10 5.8.x 3.5.x through 5.1.x
5.10 5.7.x 3.4.x through 5.1.x
5.10 5.5.x through 5.6.x 3.3.x through 5.1.x
5.10 5.4.x 3.1.x through 5.1.x
5.10 no greater than 5.3.x 3.0.x through 5.1.x
5.9 5.9.x 5.1.x
5.9 5.8.x 3.5.x through 5.1.x
5.9 5.7.x 3.4.x through 5.1.x
5.9 5.5.x through 5.6.x 3.3.x through 5.1.x
5.9 5.4.x 3.1.x through 5.1.x
5.9 no greater than 5.3.x 3.0.x through 5.1.x
5.8 5.8.x 3.5.x through 5.1.x
5.8 5.7.x 3.4.x through 5.1.x
5.8 5.5.x through 5.6.x 3.3.x through 5.1.x
5.8 5.4.x 3.1.x through 5.1.x
5.8 no greater than 5.3.x 3.0.x through 5.1.x
5.7 5.7.x 3.4.x through 5.1.x
5.7 5.5.x through 5.6.x 3.3.x through 5.1.x
5.7 5.4.x 3.1.x through 5.1.x
5.7 no greater than 5.3.x 3.0.x through 5.1.x
5.6 5.5.x through 5.6.x 3.3.x through 3.6.x
5.6 5.4.x 3.1.x through 3.6.x
5.6 no greater than 5.3.x 3.0.x through 3.6.x
5.5 5.5.x 3.3.x through 3.6.x
5.5 5.4.x 3.1.x through 3.6.x
5.5 no greater than 5.3.x 3.0.x through 3.6.x
5.4 5.4.x 3.1.x through 3.6.x
5.4 no greater than 5.3.x 3.0.x through 3.6.x
5.3 no greater than 5.3.x 3.0.x through 3.1.x
5.2 no greater than 5.2.x 3.0.x through 3.1.x
5.1 no greater than 5.1.x Not supported


  • Cloudera Manager supports a version of CDH no greater than its own (for example, CM version 5.1 supports CDH versions less than or equal to 5.1).
  • The version of the CDAP Parcel that is used should match the CSD major.minor version.


  1. Download the CDAP CSD by downloading the JAR file. Details on CSDs and Cloudera Manager Extensions are available online.
  2. Install the CSD following the instructions at Cloudera's website on Add-on Services, using the instructions given for the case of installing software in the form of a parcel. In this case, you install the CSD first and then install the parcel second.
  3. The first time the CDAP CSD is installed, the Cloudera Management Service may prompt to be restarted. This is necessary for the CDAP services to be properly monitored.

Downloading and Installing Parcels

Download and distribute the CDAP-5.1.2 parcel. Complete instructions on parcels are available at Cloudera's website, but in summary these are the steps:

  1. Installing the CSD adds the corresponding Cask parcel repository for you; however, you can customize the list of repositories searched by Cloudera Manager if you need to;
  2. Download the parcel to your Cloudera Manager server;
  3. Distribute the parcel to all the servers in your cluster; and
  4. Activate the parcel.

Cloudera Manager: CDAP Parcels Distributed, Activated on a cluster.


  • If the Cask parcel repository is inaccessible to your cluster, please see these suggestions.

  • The CDAP parcels are hosted at a repository determined by the CDAP version. For instance, the CDAP 5.1 parcel metadata is accessed by Cloudera Manager at this URL:

Installing CDAP Services

These instructions show how to use the Cloudera Manager Admin Console Add Service Wizard to install and start CDAP. Note that the screens of the wizard will vary depending on which version of Cloudera Manager and CDAP you are using.

Add CDAP Service

Start from the Cloudera Manager Admin Console's Home page, selecting Add Service from the menu for your cluster:


Cloudera Manager: Starting the Add Service Wizard.

Add Service Wizard: Selecting CDAP

Use the Add Service Wizard and select CDAP.


Add Service Wizard: Selecting CDAP as the service to be added.

Add Service Wizard: Specifying Dependencies

The Hive dependency is for the CDAP "Explore" component, which is enabled by default. Note that if you do not select Hive, you will need to disable CDAP Explore in a later page when you review these changes.


Add Service Wizard, Page 1: Setting the dependencies (in this case, including Hive).

Add Service Wizard: Customize Role Assignments

Customize Role Assignments: Ensure the CDAP Master role is assigned to hosts colocated with service or gateway roles for HBase, HDFS, YARN, and (optionally) Hive and Spark.


Add Service Wizard, Page 2: When customizing Role Assignments, the CDAP Security Auth Service can be added later, if required.

Add Service Wizard: Customize Role Assignments


Add Service Wizard, Page 2 (dialog): Assigning the CDAP Master Role to a host with the HBase, HDFS, YARN, Hive, and Spark Gateway roles. It could also be on a host with running services instead.

Add Service Wizard: Customize Role Assignments


Add Service Wizard, Page 2 (dialog): Completing assignments with the CDAP Gateway client added to other nodes of the cluster; it can be added to nodes with CDAP roles.

Add Service Wizard: Customize Role Assignments


Add Service Wizard, Page 2: Completed role assignments.

Add Service Wizard: Reviewing Configuration

App Artifact Dir: This should initially point to the bundled system artifacts included in the CDAP parcel directory. If you have modified ${PARCELS_ROOT} for your instance of Cloudera Manager, please update this setting (App Artifact Dir) to match. You may want to customize this directory to a location outside of the CDAP Parcel.

Explore Enabled: This needs to be disabled if you didn't select Hive earlier.

Kerberos Auth Enabled: This is needed if running on a secure Hadoop cluster.

Router Bind Port, Router Server Port: These two ports should match; Router Server Port is used by the CDAP UI to connect to the CDAP Router service.


Add Service Wizard, Page 4: Reviewing changes and (initial) configuration.

Additional CDAP configuration properties can be added using Cloudera Manager's Safety Valve Advanced Configuration Snippets. Documentation of the available CDAP properties is in the Appendix: cdap-site.xml, cdap-default.xml. Note that for certain CDAP properties, the defaults values for Cloudera may vary from the above appendix:

  • For kafka.server.log.dirs, the default value is {$LOCAL_DIR/kafka-logs} or /var/tmp/cdap/kafka-logs, instead of /tmp/kafka-logs as shown in the Appendix: Kafka Server.

Additional environment variables can be set, as required, using Cloudera Manager's CDAP Service Environment Advanced Configuration Snippet (Safety Valve). See the example below for configuring Spark.

Note: Service-specific Java heap memory settings (that override the default values) can be created by setting these environment variables:


At this point, the CDAP installation is configured and is ready to be installed. Review your settings before continuing to the next step, which will install and start CDAP.

Starting CDAP Services

Add Service Wizard: First Run of Commands

Executing commands to install and automatically start CDAP services.


Add Service Wizard, Page 5: Finishing first run of commands to install and start CDAP.

Add Service Wizard: Completion Page


Add Service Wizard, Page 6: Congratulations screen, though there is still work to be done.

Cluster Home Page: Status Tab


Cluster Home Page, Status Tab: Showing all CDAP services running. Gateway is not an actual service.


Cloudera Manager Home Page: Showing CDAP installed on the cluster as a service.

Cluster Home Page: Configuring for Spark

Including Spark: If your cluster contains both Spark1 and Spark2, and you would like to use Spark2, the Environment Advanced Configuration needs to contain the Spark version to use as SPARK_MAJOR_VERSION=2. If you only have one version of Spark installed, CDAP will use that version.

Additional environment variables are set using the Cloudera Manager's "CDAP Service Environment Advanced Configuration Snippet (Safety Valve)".

Cluster Home Page: Configuring for Spark

You will then have a stale configuration and need to restart the CDAP services.


Cluster Home Page, Status Tab: Stale configuration that requires restarting.

Cluster Home Page: Restarting CDAP


Cluster Stale Configurations: Restarting CDAP services.


Cluster Stale Configurations: Restarting CDAP services.

Cluster Home Page: CDAP Services Restarted


Cluster Stale Configurations: CDAP services after restart.


Service Checks in Cloudera Manager

After the Cloudera Manager Admin Console's Add Service Wizard completes, CDAP will show in your cluster's list of services.


Cloudera Manager: CDAP added to the cluster.

You can select it, and go to the CDAP page, with Quick Links and Status Summary. The lights of the Status Summary should all turn green, showing completion of startup. (Note: Gateway is not an actual service, and does not show a green status indicator.)

The Quick Links includes a link to the CDAP UI, which by default is running on port 11011 of the host where the UI role instance is running.


Cloudera Manager: CDAP page showing available services and their status.

CDAP Smoke Test

The CDAP UI may initially show errors while all of the CDAP YARN containers are starting up. Allow for up to a few minutes for this.

The Administration page of the CDAP UI shows the status of the CDAP services. It can be reached at http://<cdap-host>:11011/cdap/administration, substituting for <cdap-host> the host name or IP address of the CDAP server:


CDAP UI: Showing started-up, Administration page.

Further instructions for verifying your installation are contained in Verification.

Advanced Topics

Enabling Security

Cask Data Application Platform (CDAP) supports securing clusters using perimeter security, authorization, impersonation and secure storage.

Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.

Authorization provides a way of enforcing access control on CDAP entities.

Impersonation ensures that programs inside CDAP are run as configured users at the namespace level. When enabled, it guarantees that all actions on datasets, streams and other resources happen as the configured user.

We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively "trusted”. Although there is still value in having perimeter security, authorization enforcement and secure storage in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.

For instructions on enabling CDAP Security, see CDAP Security.

Enabling Kerberos

For Kerberos-enabled Hadoop clusters:

  • The cdap user needs to be granted HBase permissions to create tables. As the hbase user, issue the command:

    $ echo "grant 'cdap', 'RWCA'" | hbase shell
  • The cdap user must be able to launch YARN containers, either by adding it to the YARN allowed.system.users or by adjusting the YARN to include the cdap user. (Search for the YARN configuration allowed.system.users in Cloudera Manager, and then add the cdap user to the whitelist.)

  • If you are converting an existing CDAP cluster to being Kerberos-enabled, then you may run into YARN usercache directory permission problems. A non-Kerberos cluster with default settings will run CDAP containers as the user yarn. A Kerberos cluster will run them as the user cdap. When converting, the usercache directory that YARN creates will already exist and be owned by a different user. On all datanodes, run this command, substituting in the correct value of the YARN parameter yarn.nodemanager.local-dirs:

    $ rm -rf <YARN.NODEMANAGER.LOCAL-DIRS>/usercache/cdap

    (As yarn.nodemanager.local-dirs can be a comma-separated list of directories, you may need to run this command multiple times, once for each entry.)

    If, for example, the setting for yarn.nodemanager.local-dirs is /yarn/nm, you would use:

    $ rm -rf /yarn/nm/usercache/cdap

    Restart CDAP after removing the usercache(s).

Enabling Sentry

To use CDAP with Cloudera clusters using Sentry authorization, refer to the steps at Apache Sentry Configuration

The properties described there can be set from within Cloudera Manager by searching for them in the configuration for each component; particularly, Sentry and Hive.

Enabling CDAP HA

In addition to having a cluster architecture that supports HA (high availability), these additional configuration steps need to be followed and completed:

CDAP Components

For each of the CDAP components listed below (Master, Router, Kafka, UI, Authentication Server), these comments apply:

  • Sync the configuration files (such as cdap-site.xml and cdap-security.xml) on all the nodes.
  • While the default bind.address settings (, used for app.bind.address, data.tx.bind.address, router.bind.address, and so on) can be synced across hosts, if you customize them to a particular IP address, they will—as a result—be different on different hosts. This can be controlled by the settings for an individual Role Instance.

CDAP Master

The CDAP Master service primarily performs coordination tasks and can be scaled for redundancy. The instances coordinate amongst themselves, electing one as a leader at all times.

  • Using the Cloudera Manager UI, add additional Role Instances of the role type CDAP Master Service to additional machines.
  • Ensure each machine has all required Gateway roles.
  • Start each CDAP Master Service role.

CDAP Router

The CDAP Router service is a stateless API endpoint for CDAP, and simply routes requests to the appropriate service. It can be scaled horizontally for performance. A load balancer, if desired, can be placed in front of the nodes running the service.

  • Using the Cloudera Manager UI, add Role Instances of the role type CDAP Gateway/Router Service to additional machines.
  • Start each CDAP Gateway/Router Service role.

CDAP Kafka

  • Using the Cloudera Manager UI, add Role Instances of the role type CDAP Kafka Service to additional machines.
  • Two properties govern the Kafka setting in the cluster:
    • The list of Kafka seed brokers is generated automatically, but the replication factor (kafka.server.default.replication.factor) is not set automatically. Instead, it needs to be set manually.
    • The replication factor is used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure.
  • The recommended setting is to run at least two Kafka brokers with a minimum replication factor of two; set this property to the maximum number of tolerated machine failures plus one (assuming you have that number of machines). For example, if you were running five Kafka brokers, and would tolerate two of those failing, you would set the replication factor to three. The number of Kafka brokers listed should always be equal to or greater than the replication factor.
  • Start each CDAP Kafka Service role.


  • Using the Cloudera Manager UI, add Role Instances of the role type CDAP UI Service to additional machines.
  • For Cloudera Manager, the CDAP UI and the CDAP Router currently need to be colocated on the same node.
  • Start each CDAP UI Service role.

CDAP Authentication Server

  • Using the Cloudera Manager UI, add Role Instances of the role type CDAP Security Auth Service (the CDAP Authentication Server) to additional machines.
  • Start each CDAP Security Auth Service role.
  • Note that when an unauthenticated request is made in a secure HA setup, a list of all running authentication endpoints will be returned in the body of the request.

Hive Execution Engines

CDAP Explore has support for additional execution engines such as Apache Spark and Apache Tez. Details on specifying these engines and configuring CDAP are in the Developer Manual section on Date Exploration, Hive Execution Engines.

Enabling Spark2

In order to use Spark2, you must first install Spark2 on your cluster. If both Spark1 and Spark2 are installed, you must set SPARK_MAJOR_VERSION to 2 in cdap-env. In addition, you must set Spark2 as a service dependency of CDAP. This can be done in the Configuration section of CDAP, by searching for 'dependency'.

You can verify that Spark2 is being used by CDAP by looking at stdout of the CDAP master. As the master is starting up, you should see a line with 'SPARK_COMPAT=spark2_2.11'.

When Spark2 is in use, Spark1 programs cannot be run. Similarly, when Spark1 is in use, Spark2 programs cannot be run.

When CDAP starts up, it detects the spark version and uploads the corresponding pipeline system artifact. If you have already started CDAP with Spark1, you will also need to delete the pipeline system artifacts, then reload them in order to use the spark2 versions. After CDAP has been restarted with Spark2, use the RESTful API:

$ DELETE /v3/namespaces/system/artifacts/cdap-data-pipeline/versions/5.1.2
$ DELETE /v3/namespaces/system/artifacts/cdap-data-streams/versions/5.1.2
$ POST /v3/namespaces/system/artifacts