πŸ”—Installation using Cloudera Manager

step-1step-2step-3step-4step-5

πŸ”—Preparing the Cluster

πŸ”—Roles and Dependencies

The CDAP CSD (Custom Service Descriptor) consists of four mandatory roles and two optional roles:

CSD Role Description
CDAP Master Service Service for managing runtime, lifecycle and resources of CDAP applications
CDAP Gateway/Router Service Service supporting REST endpoints for CDAP
CDAP Kafka Service Metrics and logging transport service, using an embedded version of Kafka
CDAP UI Service User interface for managing CDAP applications
   
CDAP Security Auth Service Performs client authentication for CDAP when security is enabled (optional)
Gateway Cloudera Manager Gateway Role that installs the CDAP client tools (such as the CDAP CLI) and configuration (optional)

These roles map to the CDAP components of the same name.

  • As CDAP depends on HDFS, YARN, HBase, ZooKeeper, and (optionally) Hive and Spark, it must be installed on cluster host(s) with full client configurations for these dependent services.
  • The CDAP Master Service role (or CDAP Master) must be co-located on a cluster host with an HDFS Gateway, a YARN Gateway, an HBase Gateway, andβ€”optionallyβ€”Hive or Spark Gateways.
  • Note that these Gateways are redundant if you are co-locating the CDAP Master role on a cluster host (or hosts, in the case of a deployment with high availability) with actual services, such as the HDFS Namenode, the YARN resource manager, or the HBase Master.
  • Note that the CDAP Gateway/Router Service is not a Cloudera Manager Gateway Role but is instead another name for the CDAP Router Service.
  • CDAP also provides its own Gateway role that can be used to install CDAP client configurations on other hosts of the cluster.
  • All services run as the 'cdap' user installed by the parcel.

πŸ”—Hadoop Configuration

  1. ZooKeeper's maxClientCnxns must be raised from its default. We suggest setting it to zero (0: unlimited connections). As each YARN container launched by CDAP makes a connection to ZooKeeper, the number of connections required is a function of usage.

  2. Ensure that YARN has sufficient memory capacity by lowering the default minimum container size (controlled by the property yarn.scheduler.minimum-allocation-mb). Lack of YARN memory capacity is the leading cause of apparent failures that we see reported. We recommend starting with these settings:

    • yarn.nodemanager.delete.debug-delay-sec: 43200
    • yarn.scheduler.minimum-allocation-mb: 512 mb

    Please ensure your yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb settings are set sufficiently to run CDAP, as described in the CDAP Memory and Core Requirements.

  3. Add additional entries to the YARN Application Classpath for Spark jobs.

    If you plan on running Spark programs from CDAP, CDAP requires that additional entries be added to the YARN application classpath, as the Spark installed on Cloudera Manager clusters is a "Hadoop-less" build and does not include Hadoop jars required by Spark.

    To resolve this, go to the CM page for your cluster, click on the YARN service, click on the configuration tab, and then enter mapreduce.application.classpath in the search box. You will see entries similar to these:

    $HADOOP_MAPRED_HOME/*
    
    $HADOOP_MAPRED_HOME/lib/*
    
    $MR2_CLASSPATH
    

    Copy all the entries to the yarn.application.classpath configuration for YARN on your Cluster. The yarn.application.classpath setting can be found by searching as mentioned above.

    Add the entries required by scrolling to the last entry in the classpath form, clicking the "+" button to add a new text box entry field at the end. Once you have added all the entries from the mapreduce.application.classpath to the yarn.application.classpath, click on Save.

You can make these changes using Cloudera Manager. Please restart the stale services upon seeing a prompt to do so after making the above changes.

πŸ”—HDFS Permissions

Ensure YARN is configured properly to run MapReduce programs. Often, this includes ensuring that the HDFS /user/yarn directory exists with proper permissions:

# su hdfs
$ hdfs dfs -mkdir -p /user/yarn && hadoop fs -chown yarn /user/yarn && hadoop fs -chgrp yarn /user/yarn

πŸ”—Downloading and Distributing Packages

πŸ”—Downloading and Installing CSD

To install CDAP on a cluster managed by Cloudera, we have available a Custom Service Descriptor (CSD) which you can install onto your CM server. This adds CDAP to the list of available services which CM can install.

Supported Cloudera Manager (CM) and Cloudera Data Hub (CDH) Distributions
CM Version CDH Version CDAP Parcel / CSD Version
5.8 5.8.x 3.6.x
 
5.7 5.7.x 3.4.x through 3.6.x
5.7 5.5.x through 5.6.x 3.3.x through 3.6.x
5.7 5.4.x 3.1.x through 3.6.x
5.7 no greater than 5.3.x 3.0.x through 3.6.x
 
5.6 5.5.x through 5.6.x 3.3.x through 3.6.x
5.6 5.4.x 3.1.x through 3.6.x
5.6 no greater than 5.3.x 3.0.x through 3.6.x
 
5.5 5.5.x 3.3.x through 3.6.x
5.5 5.4.x 3.1.x through 3.6.x
5.5 no greater than 5.3.x 3.0.x through 3.6.x
 
5.4 5.4.x 3.1.x through 3.6.x
5.4 no greater than 5.3.x 3.0.x through 3.6.x
 
5.3 no greater than 5.3.x 3.0.x through 3.1.x
5.2 no greater than 5.2.x 3.0.x through 3.1.x
5.1 no greater than 5.1.x Not supported

Notes:

  • Cloudera Manager supports a version of CDH no greater than its own (for example, CM version 5.1 supports CDH versions less than or equal to 5.1).
  • The version of the CDAP Parcel that is used should match the CSD major.minor version.

Steps:

  1. Download the CDAP CSD by downloading the JAR file. Details on CSDs and Cloudera Manager Extensions are available online.
  2. Install the CSD following the instructions at Cloudera's website on Add-on Services, using the instructions given for the case of installing software in the form of a parcel. In this case, you install the CSD first and then install the parcel second.

πŸ”—Downloading and Installing Parcels

Download and distribute the CDAP-3.6.0 parcel. Complete instructions on parcels are available at Cloudera's website, but in summary these are the steps:

  1. Installing the CSD adds the corresponding Cask parcel repository for you; however, you can customize the list of repositories searched by Cloudera Manager if you need to;
  2. Download the parcel to your Cloudera Manager server;
  3. Distribute the parcel to all the servers in your cluster; and
  4. Activate the parcel.
../_images/cloudera-parcels.png

Cloudera Manager: CDAP Parcels Distributed, Activated on a cluster.

Notes:

  • If the Cask parcel repository is inaccessible to your cluster, please see these suggestions.

  • The CDAP parcels are hosted at a repository determined by the CDAP version. For instance, the CDAP 3.6 parcel metadata is accessed by Cloudera Manager at this URL:

    http://repository.cask.co/parcels/cdap/3.6/manifest.json
    

πŸ”—Installing CDAP Services

These instructions show how to use the Cloudera Manager Admin Console Add Service Wizard to install and start CDAP. Note that the screens of the wizard will vary depending on which version of Cloudera Manager and CDAP you are using.

πŸ”—Add CDAP Service

Start from the Cloudera Manager Admin Console's Home page, selecting Add Service from the menu for your cluster:

../_images/cloudera-csd-01.png

Cloudera Manager: Starting the Add Service Wizard.

πŸ”—Add Service Wizard: Selecting CDAP

Use the Add Service Wizard and select CDAP.

../_images/cloudera-csd-02.png

Add Service Wizard: Selecting CDAP as the service to be added.

πŸ”—Add Service Wizard: Specifying Dependencies

The Hive dependency is for the CDAP "Explore" component, which is enabled by default. Note that if you do not select Hive, you will need to disable CDAP Explore in a later page when you review these changes.

../_images/cloudera-csd-03.png

Add Service Wizard, Page 1: Setting the dependencies (in this case, including Hive).

πŸ”—Add Service Wizard: Customize Role Assignments

Customize Role Assignments: Ensure the CDAP Master role is assigned to hosts colocated with service or gateway roles for HBase, HDFS, YARN, and (optionally) Hive and Spark.

../_images/cloudera-csd-04.png

Add Service Wizard, Page 2: When customizing Role Assignments, the CDAP Security Auth Service can be added later, if required.

πŸ”—Add Service Wizard: Customize Role Assignments

../_images/cloudera-csd-04b.png

Add Service Wizard, Page 2 (dialog): Assigning the CDAP Master Role to a host with the HBase, HDFS, YARN, Hive, and Spark Gateway roles. It could also be on a host with running services instead.

πŸ”—Add Service Wizard: Customize Role Assignments

../_images/cloudera-csd-04c.png

Add Service Wizard, Page 2 (dialog): Completing assignments with the CDAP Gateway client added to other nodes of the cluster; it can be added to nodes with CDAP roles.

πŸ”—Add Service Wizard: Customize Role Assignments

../_images/cloudera-csd-05.png

Add Service Wizard, Page 2: Completed role assignments.

πŸ”—Add Service Wizard: Reviewing Configuration

App Artifact Dir: This should initially point to the bundled system artifacts included in the CDAP parcel directory. If you have modified ${PARCELS_ROOT} for your instance of Cloudera Manager, please update this setting (App Artifact Dir) to match. You may want to customize this directory to a location outside of the CDAP Parcel.

Explore Enabled: This needs to be disabled if you didn't select Hive earlier.

Kerberos Auth Enabled: This is needed if running on a secure Hadoop cluster.

Router Bind Port, Router Server Port: These two ports should match; Router Server Port is used by the CDAP UI to connect to the CDAP Router service.

../_images/cloudera-csd-06.png

Add Service Wizard, Page 4: Reviewing changes and (initial) configuration.

Additional CDAP configuration properties can be added using Cloudera Manager's Safety Valve Advanced Configuration Snippets. Documentation of the available CDAP properties is in the Appendix: cdap-site.xml, cdap-default.xml.

Additional environment variables can be set, as required, using Cloudera Manager's CDAP Service Environment Advanced Configuration Snippet (Safety Valve). See the example below for configuring Spark.

Note: Service-specific Java heap memory settings (that override the default values) can be created by setting these environment variables:

AUTH_JAVA_HEAPMAX
KAFKA_JAVA_HEAPMAX
MASTER_JAVA_HEAPMAX
ROUTER_JAVA_HEAPMAX

At this point, the CDAP installation is configured and is ready to be installed. Review your settings before continuing to the next step, which will install and start CDAP.

πŸ”—Starting CDAP Services

πŸ”—Add Service Wizard: First Run of Commands

Executing commands to install and automatically start CDAP services.

../_images/cloudera-csd-07.png

Add Service Wizard, Page 5: Finishing first run of commands to install and start CDAP.

πŸ”—Add Service Wizard: Completion Page

../_images/cloudera-csd-08.png

Add Service Wizard, Page 6: Congratulations screen, though there is still work to be done.

πŸ”—Cluster Home Page: Status Tab

../_images/cloudera-csd-09a.png

Cluster Home Page, Status Tab: Showing all CDAP services running. Gateway is not an actual service.

../_images/cloudera-csd-09b.png

Cloudera Manager Home Page: Showing CDAP installed on the cluster as a service.

πŸ”—Cluster Home Page: Configuring for Spark

Including Spark: If you are including Spark, the Environment Advanced Configuration needs to contain the location of the Spark libraries, typically as SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark.

Additional environment variables are set using the Cloudera Manager's "CDAP Service Environment Advanced Configuration Snippet (Safety Valve)".

../_images/cloudera-csd-10.png

Cluster Home Page, Configuration Tab: Adding SPARK_HOME environmental value using the Service Environment Advanced Configuration Snippet (Safety Valve).

πŸ”—Cluster Home Page: Configuring for Spark

You will then have a stale configuration and need to restart the CDAP services.

../_images/cloudera-csd-11.png

Cluster Home Page, Status Tab: Stale configuration that requires restarting.

πŸ”—Cluster Home Page: Restarting CDAP

../_images/cloudera-csd-12.1.png

Cluster Stale Configurations: Restarting CDAP services.

../_images/cloudera-csd-12.2.png

Cluster Stale Configurations: Restarting CDAP services.

πŸ”—Cluster Home Page: CDAP Services Restarted

../_images/cloudera-csd-09a.png

Cluster Stale Configurations: CDAP services after restart.

πŸ”—Verification

πŸ”—Service Checks in Cloudera Manager

After the Cloudera Manager Admin Console's Add Service Wizard completes, CDAP will show in your cluster's list of services.

../_images/cloudera-csd-09b.png

Cloudera Manager: CDAP added to the cluster.

You can select it, and go to the CDAP page, with Quick Links and Status Summary. The lights of the Status Summary should all turn green, showing completion of startup. (Note: Gateway is not an actual service, and does not show a green status indicator.)

The Quick Links includes a link to the CDAP UI, which by default is running on port 9999 of the host where the UI role instance is running.

../_images/cloudera-csd-09a.png

Cloudera Manager: CDAP page showing available services and their status.

πŸ”—CDAP Smoke Test

The CDAP UI may initially show errors while all of the CDAP YARN containers are starting up. Allow for up to a few minutes for this. The System Health link in the CDAP UI in the left side shows the status of the CDAP services.

../_images/console-distributed.png

CDAP UI: Showing started-up before data or applications are deployed.

Further instructions for verifying your installation are contained in Verification.

πŸ”—Advanced Topics

πŸ”—Enabling Security

Cask Data Application Platform (CDAP) supports securing clusters using perimeter security, authorization, impersonation and secure storage.

Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.

Authorization provides a way of enforcing access control on CDAP entities.

Impersonation ensures that programs inside CDAP are run as configured users at the namespace level. When enabled, it guarantees that all actions on datasets, streams and other resources happen as the configured user.

We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively "trusted”. Although there is still value in having perimeter security, authorization enforcement and secure storage in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.

For instructions on enabling CDAP Security, see CDAP Security.

πŸ”—Enabling Kerberos

For Kerberos-enabled Hadoop clusters:

  • The cdap user needs to be granted HBase permissions to create tables. As the hbase user, issue the command:

    $ echo "grant 'cdap', 'RWCA'" | hbase shell
    
  • The cdap user must be able to launch YARN containers, either by adding it to the YARN allowed.system.users or by adjusting the YARN min.user.id to include the cdap user. (Search for the YARN configuration allowed.system.users in Cloudera Manager, and then add the cdap user to the whitelist.)

  • If you are converting an existing CDAP cluster to being Kerberos-enabled, then you may run into Yarn usercache directory permission problems. A non-Kerberos cluster with default settings will run CDAP containers as the user yarn. A Kerberos cluster will run them as the user cdap. When converting, the usercache directory that Yarn creates will already exist and be owned by a different user. On all datanodes, run this command, substituting in the correct value of the YARN parameter yarn.nodemanager.local-dirs:

    rm -rf <YARN.NODEMANAGER.LOCAL-DIRS>/usercache/cdap
    

    (As yarn.nodemanager.local-dirs can be a comma-separated list of directories, you may need to run this command multiple times, once for each entry.)

    If, for example, the setting for yarn.nodemanager.local-dirs is /yarn/nm, you would use:

    rm -rf /yarn/nm/usercache/cdap
    

    Restart CDAP after removing the usercache(s).

πŸ”—Enabling Sentry

To use CDAP with Cloudera clusters using Sentry authorization, refer to the steps at Apache Sentry Configuration

The properties described there can be set from within Cloudera Manager by searching for them in the configuration for each component; particularly, Sentry and Hive.

πŸ”—Enabling CDAP HA

In addition to having a cluster architecture that supports HA (high availability), these additional configuration steps need to be followed and completed:

πŸ”—CDAP Components

For each of the CDAP components listed below (Master, Router, Kafka, UI, Authentication Server), these comments apply:

  • Sync the configuration files (such as cdap-site.xml and cdap-security.xml) on all the nodes.
  • While the default bind.address settings (0.0.0.0, used for app.bind.address, data.tx.bind.address, router.bind.address, and so on) can be synced across hosts, if you customize them to a particular IP address, they willβ€”as a resultβ€”be different on different hosts. This can be controlled by the settings for an individual Role Instance.

πŸ”—CDAP Master

The CDAP Master service primarily performs coordination tasks and can be scaled for redundancy. The instances coordinate amongst themselves, electing one as a leader at all times.

  • Using the Cloudera Manager UI, add additional Role Instances of the role type CDAP Master Service to additional machines.
  • Ensure each machine has all required Gateway roles.
  • Start each CDAP Master Service role.

πŸ”—CDAP Router

The CDAP Router service is a stateless API endpoint for CDAP, and simply routes requests to the appropriate service. It can be scaled horizontally for performance. A load balancer, if desired, can be placed in front of the nodes running the service.

  • Using the Cloudera Manager UI, add Role Instances of the role type CDAP Gateway/Router Service to additional machines.
  • Start each CDAP Gateway/Router Service role.

πŸ”—CDAP Kafka

  • Using the Cloudera Manager UI, add Role Instances of the role type CDAP Kafka Service to additional machines.
  • Two properties govern the Kafka setting in the cluster:
    • The list of Kafka seed brokers is generated automatically, but the replication factor (kafka.default.replication.factor) is not set automatically. Instead, it needs to be set manually.
    • The replication factor is used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure.
  • The recommended setting is to run at least two Kafka brokers with a minimum replication factor of two; set this property to the maximum number of tolerated machine failures plus one (assuming you have that number of machines). For example, if you were running five Kafka brokers, and would tolerate two of those failing, you would set the replication factor to three. The number of Kafka brokers listed should always be equal to or greater than the replication factor.
  • Start each CDAP Kafka Service role.

πŸ”—CDAP UI

  • Using the Cloudera Manager UI, add Role Instances of the role type CDAP UI Service to additional machines.
  • For Cloudera Manager, the CDAP UI and the CDAP Router currently need to be colocated on the same node.
  • Start each CDAP UI Service role.

πŸ”—CDAP Authentication Server

  • Using the Cloudera Manager UI, add Role Instances of the role type CDAP Security Auth Service (the CDAP Authentication Server) to additional machines.
  • Start each CDAP Security Auth Service role.
  • Note that when an unauthenticated request is made in a secure HA setup, a list of all running authentication endpoints will be returned in the body of the request.