Installation using Cloudera Manager

step-1step-2step-3step-4step-5

Preparing the Cluster

Roles and Dependencies

The CDAP CSD (Custom Service Descriptor) consists of four mandatory roles and two optional roles:

CSD Role Description
CDAP Master Service Service for managing runtime, lifecycle and resources of CDAP applications
CDAP Gateway/Router Service Service supporting REST endpoints for CDAP
CDAP Kafka Service Metrics and logging transport service, using an embedded version of Kafka
CDAP UI Service User interface for managing CDAP applications
   
CDAP Security Auth Service Performs client authentication for CDAP when security is enabled (optional)
Gateway Cloudera Manager Gateway Role that installs the CDAP client tools (such as the CDAP CLI) and configuration (optional)

These roles map to the CDAP components of the same name.

  • As CDAP depends on HDFS, YARN, HBase, ZooKeeper, and (optionally) Hive and Spark, it must be installed on cluster host(s) with full client configurations for these dependent services.
  • The CDAP Master Service role (or CDAP Master) must be co-located on a cluster host with an HDFS Gateway, a YARN Gateway, an HBase Gateway, and—optionally—Hive or Spark Gateways.
  • Note that these Gateways are redundant if you are co-locating the CDAP Master role on a cluster host (or hosts, in the case of a deployment with high availability) with actual services, such as the HDFS Namenode, the YARN resource manager, or the HBase Master.
  • Note that the CDAP Gateway/Router Service is not a Cloudera Manager Gateway Role but is instead another name for the CDAP Router Service.
  • CDAP also provides its own Gateway role that can be used to install CDAP client configurations on other hosts of the cluster.
  • All services run as the 'cdap' user installed by the parcel.

Node.js Installation

Node.js must be installed on the node(s) where the CDAP UI service will run. We recommend any version of Node.js greater than v0.10.36 through v0.12.*; in particular, we recommend v0.12.*. You can download an appropriate version of Node.js from nodejs.org. Detailed instructions on installing Node.js are available.

Hadoop Configuration

  1. ZooKeeper’s maxClientCnxns must be raised from its default. We suggest setting it to zero (0: unlimited connections). As each YARN container launched by CDAP makes a connection to ZooKeeper, the number of connections required is a function of usage.

  2. Ensure that YARN has sufficient memory capacity by lowering the default minimum container size (controlled by the property yarn.scheduler.minimum-allocation-mb). Lack of YARN memory capacity is the leading cause of apparent failures that we see reported. We recommend starting with these settings:

    • yarn.nodemanager.delete.debug-delay-sec: 43200
    • yarn.scheduler.minimum-allocation-mb: 512 mb

    Please ensure your yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb settings are set sufficiently to run CDAP, as described in the CDAP Memory and Core Requirements.

  3. Add additional entries to the YARN Application Classpath for Spark jobs.

    If you plan on running Spark programs from CDAP, CDAP requires that additional entries be added to the YARN application classpath, as the Spark installed on Cloudera Manager clusters is a “Hadoop-less” build and does not include Hadoop jars required by Spark.

    To resolve this, go to the CM page for your cluster, click on the YARN service, click on the configuration tab, and then enter mapreduce.application.classpath in the search box. You will see entries similar to these:

    $HADOOP_MAPRED_HOME/*
    
    $HADOOP_MAPRED_HOME/lib/*
    
    $MR2_CLASSPATH
    

    Copy all the entries to the yarn.application.classpath configuration for YARN on your Cluster. The yarn.application.classpath setting can be found by searching as mentioned above.

    Add the entries required by scrolling to the last entry in the classpath form, clicking the “+” button to add a new text box entry field at the end. Once you have added all the entries from the mapreduce.application.classpath to the yarn.application.classpath, click on Save.

You can make these changes using Cloudera Manager. Please restart the stale services upon seeing a prompt to do so after making the above changes.

HDFS Permissions

Ensure YARN is configured properly to run MapReduce programs. Often, this includes ensuring that the HDFS /user/yarn directory exists with proper permissions:

# su hdfs
$ hdfs dfs -mkdir -p /user/yarn && hadoop fs -chown yarn /user/yarn && hadoop fs -chgrp yarn /user/yarn

Downloading and Distributing Packages

Downloading and Installing CSD

To install CDAP on a cluster managed by Cloudera, we have available a Custom Service Descriptor (CSD) which you can install onto your CM server. This adds CDAP to the list of available services which CM can install.

Supported Cloudera Manager (CM) and Cloudera Data Hub (CDH) Distributions
CM Version CDH Version CSD Version CDAP Parcel Version
5.6 5.6.x 3.3.x Matching CSD major.minor
5.5, 5.6 5.5.x 3.3.x Matching CSD major.minor
5.5, 5.6 5.4.x 3.1.x through 3.3.x Matching CSD major.minor
5.5, 5.6 no greater than 5.3.x 3.0.x through 3.3.x Matching CSD major.minor
 
5.4 5.4.x 3.1.x through 3.3.x Matching CSD major.minor
5.4 no greater than 5.3.x 3.0.x through 3.3.x Matching CSD major.minor
 
5.3 no greater than 5.3.x 3.0.x through 3.1.x Matching CSD major.minor
5.2 no greater than 5.2.x 3.0.x through 3.1.x Matching CSD major.minor
5.1 no greater than 5.1.x Not supported

Notes:

  • Cloudera Manager supports a version of CDH no greater than its own (for example, CM version 5.1 supports CDH versions less than or equal to 5.1).
  • The version of the CDAP Parcel that is used should match the CSD major.minor version.

Steps:

  1. Download the CDAP CSD by downloading the JAR file. Details on CSDs and Cloudera Manager Extensions are available online.
  2. Install the CSD following the instructions at Cloudera’s website on Add-on Services, using the instructions given for the case of installing software in the form of a parcel. In this case, you install the CSD first and then install the parcel second.

Downloading and Installing Parcels

Download and distribute the CDAP-3.3.7 parcel. Complete instructions on parcels are available at Cloudera’s website, but in summary these are the steps:

  1. Installing the CSD adds the corresponding Cask parcel repository for you; however, you can customize the list of repositories searched by Cloudera Manager if you need to;
  2. Download the parcel to your Cloudera Manager server;
  3. Distribute the parcel to all the servers in your cluster; and
  4. Activate the parcel.

Notes:

  • If the Cask parcel repository is inaccessible to your cluster, please see these suggestions.

  • The CDAP parcels are hosted at a repository determined by the CDAP version. For instance, the CDAP 3.3 parcel metadata is accessed by Cloudera Manager at this URL:

    http://repository.cask.co/parcels/cdap/3.3/manifest.json
    

Installing CDAP Services

These instructions show how to use the Cloudera Manager Admin Console Add Service Wizard to install and start CDAP. Note that the screens of the wizard will vary depending on which version of Cloudera Manager and CDAP you are using.

Add CDAP Service

Start from the Cloudera Manager Admin Console’s Home page, selecting Add a Service from the menu for your cluster:

../_images/cloudera-csd-01.png

Cloudera Manager: Starting the Add Service Wizard.

Add Service Wizard: Selecting CDAP

Use the Add Service Wizard and select Cask DAP.

../_images/cloudera-csd-02.png

Add Service Wizard: Selecting CDAP (Cask DAP) as the service to be added.

Add Service Wizard: Specifying Dependencies

The Hive dependency is for the CDAP “Explore” component, which is enabled by default. Note that if you do not select Hive, you will need to disable CDAP Explore in a later page when you review these changes.

../_images/cloudera-csd-03.png

Add Service Wizard, Page 1: Setting the dependencies (in this case, including Hive).

Add Service Wizard: Customize Role Assignments

Customize Role Assignments: Ensure the CDAP Master role is assigned to hosts colocated with service or gateway roles for HBase, HDFS, YARN, and (optionally) Hive and Spark.

../_images/cloudera-csd-04.png

Add Service Wizard, Page 3: When customizing Role Assignments, the CDAP Security Auth Service can be added later, if required.

Add Service Wizard: Customize Role Assignments

../_images/cloudera-csd-04b.png

Add Service Wizard, Page 3: Assigning the CDAP Master Role to a host with the HBase, HDFS, YARN, Hive, and Spark Gateway roles. It could also be on a host with running services instead.

Add Service Wizard: Customize Role Assignments

../_images/cloudera-csd-04c.png

Add Service Wizard, Page 3: Completing assignments with CDAP Gateway client added to all nodes of the cluster including those with CDAP roles.

Add Service Wizard: Customize Role Assignments

../_images/cloudera-csd-05.png

Add Service Wizard, Page 3: Completed role assignments.

Add Service Wizard: Reviewing Configuration

App Artifact Dir: This should initially point to the bundled system artifacts included in the CDAP parcel directory. If you have modified ${PARCELS_ROOT} for your instance of Cloudera Manager, please update this setting (App Artifact Dir) to match. You may want to customize this directory to a location outside of the CDAP Parcel.

Explore Enabled: This needs to be disabled if you didn’t select Hive earlier.

Kerberos Auth Enabled: This is needed if running on a secure Hadoop cluster.

Router Bind Port, Router Server Port: These two ports should match; Router Server Port is used by the CDAP UI to connect to the CDAP Router service.

../_images/cloudera-csd-06.png

Add Service Wizard, Page 4: Reviewing changes and (initial) configuration.

Additional CDAP configuration properties can be added after using the Cloudera Manager’s Safety Valve Advanced Configuration Snippets. Documentation of the available CDAP properties is in the Appendix: cdap-site.xml and cdap-default.xml.

Additional environment variables can be set after, as required, using the Cloudera Manager’s “Cask DAP Service Environment Advanced Configuration Snippet (Safety Valve)”.

At this point, the CDAP installation is configured and is ready to be installed. Review your settings before continuing to the next step, which will install and start CDAP.

Starting CDAP Services

Add Service Wizard: First Run of Commands

Executing commands to install and automatically start CDAP services.

../_images/cloudera-csd-07.png

Add Service Wizard, Page 5: Finishing first run of commands to install and start CDAP.

Add Service Wizard: Completion Page

../_images/cloudera-csd-08.png

Add Service Wizard, Page 7: Congratulations screen, though there is still work to be done.

Cluster Home Page: Status Tab

../_images/cloudera-csd-09.png

Cluster Home Page, Status Tab: Showing all CDAP services running. Gateway is not an actual service.

Cluster Home Page: Configuring for Spark

Including Spark: If you are including Spark, the Environment Advanced Configuration needs to contain the location of the Spark libraries, typically as SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark.

Additional environment variables are set using the Cloudera Manager’s “Cask DAP Service Environment Advanced Configuration Snippet (Safety Valve)”.

../_images/cloudera-csd-10.png

Cluster Home Page, Configuration Tab: Adding SPARK_HOME environmental value using the Service Environment Advanced Configuration Snippet (Safety Valve).

Cluster Home Page: Configuring for Spark

You will then have a stale configuration and need to restart the CDAP services.

../_images/cloudera-csd-11.png

Cluster Home Page, Status Tab: Stale configuration that requires restarting.

Cluster Home Page: Restarting CDAP

../_images/cloudera-csd-12.png

Cluster Stale Configurations: Restarting CDAP services.

Cluster Home Page: CDAP Services Restarted

../_images/cloudera-csd-13.png

Cluster Stale Configurations: CDAP services after restart.

Verification

Service Checks in Cloudera Manager

After the Cloudera Manager Admin Console’s Add Service Wizard completes, Cask DAP will show in the list for the cluster where you installed it. You can select it, and go to the Cask DAP page, with Quick Links and Status Summary. The lights of the Status Summary should all turn green, showing completion of startup.

The Quick Links includes a link to the CDAP UI, which by default is running on port 9999 of the host where the UI role instance is running.

../_images/cloudera-csd-09.png

Cloudera Manager: CDAP (Cask DAP) added to the cluster.

CDAP Smoke Test

The CDAP UI may initially show errors while all of the CDAP YARN containers are starting up. Allow for up to a few minutes for this. The Services link in the CDAP UI in the upper right will show the status of the CDAP services.

../_images/console_01_overview.png

CDAP UI: Showing started-up before data or applications are deployed.

Further instructions for verifying your installation are contained in Verification.

Advanced Topics

Enabling Perimeter Security

Cask Data Application Platform (CDAP) supports securing clusters using perimeter security. Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.

We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively “trusted”. Though there is still value in having the perimeter access be authenticated in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.

For instructions on enabling CDAP Security, see CDAP Security; and in particular, see the instructions for configuring the properties of cdap-site.xml.

Enabling Kerberos

For Kerberos-enabled Hadoop clusters:

  • The 'cdap' user needs to be granted HBase permissions to create tables. As the hbase user, issue the command:

    $ echo "grant 'cdap', 'RWCA'" | hbase shell
    
  • The 'cdap' user must be able to launch YARN containers, either by adding it to the YARN allowed.system.users or by adjusting the YARN min.user.id to include the cdap user. (Search for the YARN configuration allowed.system.users in Cloudera Manager, and then add the cdap user to the whitelist.)