Appendix: cdap-site.xml and cdap-default.xml

The cdap-site.xml file is the configuration file for a CDAP installation. Its properties and values determine the settings used by CDAP when starting and operating.

Any properties not found in an installation’s cdap-site.xml will use a default parameter value defined in the file cdap-default.xml. It is located in the CDAP JARs, and should not be altered.

Any of the default values (with the exception of those marked [Final]) can be over-ridden by defining a modifying value in the cdap-site.xml file, located (by default) either in <CDAP-SDK-HOME>/conf/cdap-site.xml (Standalone mode) or /etc/cdap/conf/cdap-site.xml (Distributed mode).

Below are the parameters that can be defined in the cdap-site.xml file, their default values (obtained from cdap-default.xml), descriptions, and notes.

For information on configuring the cdap-site.xml file and CDAP for security, see the CDAP Security section.

General

Parameter Name Default Value Description
hdfs.lib.dir ${hdfs.namespace}/lib Common directory in HDFS for, among others, JAR files for coprocessors
hdfs.namespace /${root.namespace} Root directory for HDFS files written by CDAP
hdfs.user yarn User name for accessing HDFS
instance.name ${root.namespace} Determines a unique identifier for a CDAP instance. It is used for providing authorization to a particular CDAP instance. Must be alphanumeric, and should not be changed after CDAP has been started. If it is changed, there is a risk of losing data (for example, authorization policies).
local.data.dir data Data directory for standalone mode
mapreduce.include.custom.format.classes true Indicates whether to include custom input/output format classes in the job.jar or not; if set to true, custom format classes will be added to the job.jar and available as part of the MapReduce system classpath
mapreduce.jobclient.connect.max.retries 2 Indicates the maximum number of retries JobClient will make to establish a service connection when retrieving job status and history
namespaces.dir namespaces The sub-directory of ${hdfs.namespace} in which namespaces are stored
root.namespace cdap Root for this CDAP instance; used as the parent (or root) node for ZooKeeper, as the directory under which all CDAP data and metadata is stored in HDFS, and as the prefix for all HBase tables created by CDAP; must be composed of alphanumeric characters
master.startup.checks.enabled true Whether checks should be run before startup to determine if the CDAP Master can be run correctly. Which checks are run is determined by the master.startup.checks.packages and master.startup.checks.classes settings. If any checks fail, the CDAP Master will fail to start instead of waiting for the problem to be fixed. This setting only affects Distributed CDAP. It does not apply to Standalone CDAP.
master.startup.checks.packages co.cask.cdap.master.startup Comma-separated list of packages containing checks that will be run before the CDAP Master starts up. If any of the checks fails, the CDAP Master will not start up. Checks will only be run if master.startup.checks.enabled is set to true.
master.startup.checks.classes   Comma-separated list of classnames for checks that will be run before the CDAP Master starts up. If any of the checks fails, the CDAP Master will not start up. Checks will only be run if master.startup.checks.enabled is set to true.
thrift.max.read.buffer 16777216 Specifies the maximum read buffer size in bytes used by the Thrift service; value should be set to greater than the maximum frame sent on the RPC channel
twill.java.reserved.memory.mb 250 Reserved non-heap memory in megabytes for Apache Twill container
twill.jvm.gc.opts
-verbose:gc
-Xloggc:&lt;LOG_DIR&gt;/gc.log
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=1M
Java garbage collection options for all Apache Twill containers; “&lt;LOG_DIR&gt;” is the location of the log directory in the container; note that the special characters are replaced with entity equivalents so they can be included in the XML
twill.no.container.timeout 120000 Amount of time in milliseconds to wait for at least one container for Apache Twill runnable
twill.zookeeper.namespace /twill ZooKeeper namespace prefix for Apache Twill
zookeeper.quorum 127.0.0.1:2181/${root.namespace} ZooKeeper quorum string; specifies the ZooKeeper host:port; substitute the quorum (FQDN1:2181,FQDN2:2181,...) for the components shown here
zookeeper.session.timeout.millis 40000 ZooKeeper session timeout in milliseconds

Global

Parameter Name Default Value Description
dataset.unchecked.upgrade false By default, any changes made to existing datasets are not deployed when an app is redeployed; setting this value to true allows the dataset changes to be deployed upon app redeployment.
enable.unrecoverable.reset false Determines if resetting CDAP should be enabled. WARNING: Enabling this option makes it possible to delete all applications and data; NO RECOVERY IS POSSIBLE!

Applications

Parameter Name Default Value Description
app.artifact.dir /opt/cdap/master/artifacts Semicolon-separated list of local directories scanned for system artifacts to add to the artifact repository
app.bind.address 0.0.0.0 App Fabric service bind address
app.output.dir /programs Directory where all archives are stored
app.program.extra.classpath   Extra Java classpath for CDAP programs
app.program.jvm.opts
-XX:MaxPermSize=128M
${twill.jvm.gc.opts}
Java options for all program containers
app.program.runid.corrector.interval 180 The interval of how often the run id corrector thread will run in seconds; this value should be greater than 0
app.program.runtime.extensions.dir /opt/cdap/master/ext/runtimes Semicolon-separated list of local directories that are scanned for program runtime extensions
app.program.spark.yarn.client.rewrite.
enabled
true Specify whether to rewrite the yarn.Client class in Spark to work around the issue SPARK-13441 in CDM clusters.
app.temp.dir /tmp Temp directory
scheduler.max.thread.pool.size 100 Size of the scheduler thread pool
workflow.token.max.size.mb 30 Maximum allowed size of a Workflow Token, in megabytes; if the workflow token exceeds this size, no further updates are allowed

Audit

Parameter Name Default Value Description
audit.enabled true Determines whether to publish audit messages to Apache Kafka
audit.kafka.topic audit Apache Kafka topic name to which audit messages are published

Datasets

Parameter Name Default Value Description
data.local.storage ${local.data.dir}/ldb Database directory for LevelDB, used for data fabric in standalone mode
data.local.storage.blocksize 1024 Block size in bytes for data fabric when in standalone mode
data.local.storage.cachesize 104857600 Cache size in bytes for data fabric when in standalone mode
data.tx.bind.address 0.0.0.0 Transaction service bind address
data.tx.bind.port 15165 Transaction service bind port
data.tx.client.count 50 The number of pooled instances of the transaction client; increase this to increase transaction concurrency
data.tx.client.provider pool Provider strategy for transaction clients; valid values are “pool” and “thread-local”
data.tx.discovery.service.name transaction Name in discovery service for the transaction service
data.tx.hdfs.user ${hdfs.user} User name for accessing HDFS (if not running in secure HDFS)
data.tx.janitor.enable true Determines if the TransactionDataJanitor coprocessor is enabled on tables; normally should be true
data.tx.max.instances ${master.service.max.instances} Maximum number of transaction client instances
data.tx.memory.mb ${master.service.memory.mb} Memory in megabytes of the transaction clients
data.tx.num.cores ${master.service.num.cores} Maximum number of transaction client cores
data.tx.num.instances 1 Requested number of transaction client instances
data.tx.server.io.threads 2 Number of IO threads for the transaction service
data.tx.server.threads 25 Number of threads for the transaction service
data.tx.snapshot.codecs
co.cask.cdap.data2.transaction.snapshot.SnapshotCodecV1,
co.cask.cdap.data2.transaction.snapshot.SnapshotCodecV2,
co.cask.tephra.snapshot.SnapshotCodecV3,
co.cask.tephra.snapshot.SnapshotCodecV4
Specifies the class names of all supported transaction state codecs
data.tx.snapshot.dir ${hdfs.namespace}/tx.snapshot Directory in HDFS used to store snapshots and logs of transaction state
data.tx.snapshot.interval 60 Frequency of transaction snapshots in seconds
data.tx.snapshot.local.dir ${local.data.dir}/tx.snapshot Storage directory on the local filesystem of snapshot and logs of transaction state when in standalone mode
data.tx.snapshot.retain 10 Number of transaction snapshot files to retain as backups
data.tx.thrift.max.read.buffer ${thrift.max.read.buffer} Maximum read buffer size (in bytes) used by the transaction service; the value should be set to something greater than the maximum frame sent on the RPC channel
data.tx.timeout 30 Timeout value in seconds for a transaction; if the transaction is not finished in that time, it is marked invalid
dataset.data.dir data Base directory for user data on the filesystem
dataset.executor.container.instances 1 Number of dataset executor instances
dataset.executor.container.memory.mb 512 Size of Memory in megabytes for each dataset executor instance
dataset.executor.container.num.cores 1 Number of virtual cores for each dataset executor instance
dataset.executor.max.instances ${master.service.max.instances} Maximum number of dataset executor instances
dataset.extensions.dir /opt/cdap/ext/lib Directory where all dataset extensions are stored
dataset.service.bind.address 0.0.0.0 Dataset service bind address
dataset.service.output.dir /datasets Directory where all dataset modules archives are stored
dataset.table.prefix ${root.namespace} Prefix for dataset table name

Explore Service

Parameter Name Default Value Description
explore.active.operation.timeout.secs 86400 Timeout value in seconds for an SQL operation whose result was not fetched completely
explore.cleanup.job.schedule.secs 60 Time in seconds to schedule clean-up job to timeout operations
explore.enabled true Determines if the CDAP Explore Service (ad-hoc SQL queries) is enabled
explore.executor.container.instances 1 Number of explore executor instances
explore.executor.container.num.cores 1 Number of virtual cores for each explore executor instance
explore.executor.container.memory.mb 1024 Size of Memory in megabytes for each explore executor instance
explore.executor.max.instances 1 Maximum number of explore executor instances
explore.inactive.operation.timeout.secs 3600 Timeout value in seconds for an SQL operation which does not have any more results to be fetched
explore.local.data.dir ${local.data.dir}/explore Data directory for CDAP Explore Service when in Standalone mode
explore.start.on.demand false Determines the start-up of the CDAP Explore Service (ad-hoc SQL queries); if false, the explore service starts up when CDAP is started; if true, the CDAP Explore Service will start upon the first query it receives
explore.writes.enabled true Determines if writing to a table through the CDAP Explore Service (ad- hoc SQL queries) is enabled

Gateway

Parameter Name Default Value Description
app.boss.threads 1 Number of Netty service boss threads
app.connection.backlog 20000 Max connection backlog of CDAP Master
app.exec.threads 20 Number of Netty service executor threads
app.worker.threads 10 Number of Netty service worker threads

Kafka Server

Parameter Name Default Value Description
kafka.bind.address 0.0.0.0 CDAP Kafka service bind address
kafka.bind.port 9092 CDAP Kafka service bind port
kafka.default.replication.factor 1 CDAP Kafka service replication factor; used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure. The recommended setting is to run at least two CDAP Kafka servers. If you are running two CDAP Kafka servers, set this value to 2; otherwise, set it to the number of CDAP Kafka servers.
kafka.log.dir /tmp/kafka-logs CDAP Kafka service log storage directory
kafka.num.partitions 10 Default number of partitions for a topic
kafka.seed.brokers 127.0.0.1:9092 Comma-separated list of CDAP Kafka service brokers; for distributed CDAP, replace with list of FQDN:port brokers
kafka.zookeeper.namespace kafka CDAP Kafka service ZooKeeper namespace

Logging

Parameter Name Default Value Description
log.base.dir /logs/avro Base log directory
log.cleanup.run.interval.mins 1440 Log cleanup interval in minutes
log.collection.root ${local.data.dir}/logs Root location for collecting logs when in standalone mode
log.kafka.topic logs.user-v2 Kafka topic name used to publish logs
log.publish.num.partitions 10 Number of CDAP Kafka service partitions to publish the logs to
log.retention.duration.days 7 Log file HDFS retention duration in days
log.saver.max.instances ${master.service.max.instances} Maximum number of log saver instances to run in YARN
log.saver.num.instances 1 Number of log saver instances to run in YARN
log.saver.run.memory.megs 1024 Memory in megabytes allocated for log saver instances to run in YARN
log.saver.run.num.cores 2 Number of cores for each log saver instance in YARN
log.saver.status.bind.address 0.0.0.0 Log Saver HTTP service bind address

Master

Parameter Name Default Value Description
http.service.boss.threads 1 Number of Netty service boss threads for master HTTP services
http.service.connection.backlog 20000 Max connection backlog of master HTTP service
http.service.exec.threads 20 Number of Netty service executor threads for master HTTP services
http.service.worker.threads 10 Number of Netty service worker threads for master HTTP services
master.collect.app.containers.log.level ERROR The log level of application container logs that are streamed back to the CDAP Master process log. The levels supported are [ ALL, TRACE, DEBUG, INFO, WARN, ERROR, OFF ].
master.collect.containers.log true Determines if master service container logs are streamed back to the CDAP Master process log
master.service.max.instances 5 Maximum number of Master Service instances
master.service.memory.mb 512 Size of memory in megabytes for Master Service instance
master.service.num.cores 2 Number of cores for Master Service instance
master.startup.service.timeout.seconds 600 Timeout in seconds for master services to wait for their dependent services to be available. For example, the dataset executor master service requires the transaction service, and will wait for the transaction service to become available while it is starting up. If the timeout is hit, the service will fail to start and the master service will shut itself down. If set to 0 or below, master services will not wait for their dependent services to start before starting themselves.

Metadata

Parameter Name Default Value Description
metadata.max.allowed.chars 50 Maximum number of characters for metadata keys, values, and tags
metadata.service.bind.address 0.0.0.0 Metadata HTTP service bind address
metadata.service.exec.threads ${http.service.exec.threads} Number of Netty service executor threads for metadata HTTP service
metadata.service.worker.threads ${http.service.worker.threads} Number of Netty service IO worker threads for metadata HTTP service
metadata.updates.kafka.broker.list 127.0.0.1:${kafka.bind.port} Apache Kafka broker list to which metadata update notifications are published (deprecated)
metadata.updates.kafka.topic cdap-metadata-updates Apache Kafka topic name to which metadata update notifications are published (deprecated)
metadata.updates.publish.enabled false Determines if metadata updates will be published to Apache Kafka. External systems can subscribe to the Kafka topic determined by ${metadata.updates.kafka.topic} to receive notifications of metadata updates (deprecated).

Metrics

Parameter Name Default Value Description
metrics.boss.threads ${http.service.boss.threads} Number of Netty service boss threads for metrics HTTP services
metrics.connection.backlog ${http.service.connection.backlog} Max connection backlog of metrics HTTP service
metrics.data.table.retention.resolution.
1.seconds
7200 Retention resolution of the 1-second resolution table in seconds; default retention period is 2 hours
metrics.data.table.retention.resolution.
3600.seconds
2592000 Retention resolution 1-hour resolution table (in seconds); default retention period is 30 days
metrics.data.table.retention.resolution.
60.seconds
2592000 Retention resolution for 1-minute resolution table (in seconds); default retention period is 30 days
metrics.data.table.ts.rollTime.3600 24 Number of columns in a 1-hour resolution timeseries table
metrics.data.table.ts.rollTime.60 60 Number of columns in a 1-minute resolution timeseries table
metrics.dataset.hbase.stats.report.
interval
60 Report interval for HBase stats, in seconds
metrics.dataset.leveldb.stats.report.
interval
60 Report interval for LevelDB stats, in seconds
metrics.exec.threads ${http.service.exec.threads} Number of Netty service executor threads for metrics HTTP services
metrics.kafka.partition.size 10 Number of partitions for metrics topic
metrics.max.instances ${master.service.max.instances} Maximum number of instances for the metrics service
metrics.memory.mb ${master.service.memory.mb} Memory assigned to the metrics service in megabytes
metrics.num.cores ${master.service.num.cores} Number of virtual cores for the metrics service
metrics.num.instances 1 Number of instances for the metrics service
metrics.processor.max.instances ${master.service.max.instances} Maximum number of instances for metrics processor service Apache Twill runnable
metrics.processor.memory.mb 512 Size of memory in megabytes for metrics processor service Apache Twill runnable
metrics.processor.num.cores 1 Number of cores for metrics processor service Apache Twill runnable
metrics.processor.num.instances 1 Number of instances for metrics processor service Apache Twill runnable
metrics.processor.status.bind.address 0.0.0.0 Metrics Processor HTTP service bind address
metrics.query.bind.port 45005 Metrics Query service bind port
metrics.worker.threads ${http.service.worker.threads} Number of Netty service worker threads for metrics HTTP services

Monitor Handler

Parameter Name Default Value Description
monitor.handler.service.discovery.
timeout.seconds
1 Timeout in seconds for service discovery used in monitor handler service status check

Notification System

Parameter Name Default Value Description
notification.kafka.topic notifications Kafka topic name used to publish notifications
notification.transport.system kafka Transport system used by the notification system; can be either ‘kafka’ or ‘stream’

Queue

Parameter Name Default Value Description
data.queue.config.update.interval 5 Frequency, in seconds, of updates to the queue consumer configuration used in evicting queue entries on flush and compaction
data.queue.dequeue.tx.percent 30 Percentage of transaction time allowed to spend in dequeue; it should be an integer between 1-100
data.queue.table.presplits 16 Number of splits in the queue table

Router

Parameter Name Default Value Description
router.bind.address 0.0.0.0 CDAP Router service bind address
router.bind.port 10000 CDAP Router service bind port
router.client.boss.threads 1 The number of boss threads in the CDAP Router service client
router.client.worker.threads 10 The number of worker threads in the CDAP Router service client
router.connection.backlog 20000 The connection backlog in the CDAP Router service
router.connection.idle.timeout.secs 15 The number of seconds after an HTTP request completes that idle router connections are closed
router.server.address 127.0.0.1 CDAP Router service address to which CDAP UI connects
router.server.boss.threads 1 The number of boss threads in the CDAP Router service
router.server.port ${router.bind.port} CDAP Router service port to which CDAP UI connects
router.server.worker.threads 10 The number of worker threads in the CDAP Router service
router.ssl.bind.port 10443 CDAP Router service bind port for HTTPS

Security

Parameter Name Default Value Description
cdap.master.kerberos.keytab   The full path to the Kerberos keytab file containing the CDAP Master service’s credentials
cdap.master.kerberos.principal   Example: “CDAP_PRINCIPAL/_HOST@EXAMPLE.COM”. The Kerberos primary user that should be used to login to the CDAP Master service. Substitute the Kerberos primary (user) for CDAP_PRINCIPAL, and your domain for EXAMPLE.COM. The string “_HOST” will be substituted with the local hostname.
kerberos.auth.enabled ${security.enabled} Determines if Kerberos authentication is enabled
kerberos.auth.relogin.interval.seconds 300 Relogin interval for Kerberos keytab
security.auth.server.bind.address 0.0.0.0 CDAP Authentication service bind address
security.auth.server.bind.port 10009 CDAP Authentication service bind port
security.auth.server.ssl.bind.port 10010 CDAP Authentication service bind port for HTTPS
security.authentication.basic.realmfile   Username / password file to use when basic authentication is configured
security.authentication.handlerClassName   Name of the authentication implementation to use to validate user credentials
security.authentication.loginmodule.
className
  JAAS LoginModule implementation to use when co.cask.security.server.JAASAuthenticationHandler is configured for security.authentication.handlerClassName
security.authorization.enabled false When set to true, all operations in CDAP are authorized using the authorizer implementation found at the property security.authorization.extension.jar.path.
security.authorization.extension.jar.
path
  If an external authorization system is used for authorizing operations on CDAP entities, this property sets the path to the bundled JAR file containing the extension code. This jar is only used when authorization is enabled by setting security.authorization.enabled to true.
security.data.keyfile.path ${local.data.dir}/security/keyfile Path to the secret key file (only used in standalone mode)
security.enabled false Determines if authentication is enabled for CDAP; if true, all requests to CDAP must provide a valid access token
security.realm cdap Authentication realm used for scoping security; this value should be unique for each installation of CDAP
security.server.extended.token.
expiration.ms
604800000 Admin tool access token expiration time in milliseconds; defaults to 1 week (internal)
security.server.maxthreads 100 Maximum number of threads that the CDAP Authentication service should use for handling HTTP requests
security.server.token.expiration.ms 86400000 AccessToken expiration time in milliseconds (defaults to 24 hours)
security.token.digest.algorithm HmacSHA256 Algorithm used for generating MAC of access tokens
security.token.digest.key.expiration.ms 3600000 Time duration (in milliseconds) after which an active secret key used for signing tokens should be retired
security.token.digest.keylength 128 Key length used in generating the secret keys for generating MAC of access tokens
security.token.distributed.parent.znode /${root.namespace}/security/auth Parent node in ZooKeeper used for secret key distribution in distributed mode
ssl.enabled false Determines if SSL is enabled

Stream

Parameter Name Default Value Description
stream.async.queue.size 100 Queue size per async worker thread for queuing up async write requests
stream.async.worker.threads ${stream.worker.threads} Number of async worker threads for handling async write requests
stream.base.dir /streams The directory root for all stream files, relative to the HDFS namespace
stream.batch.buffer.threshold 1048576 Bytes retained in-memory before writing to a new stream file
stream.bind.address 0.0.0.0 Stream HTTP service bind address
stream.consumer.table.presplits 16 Number of splits for the stream consumer table
stream.container.instance.id 0 Instance ID for the stream service container; the actual value will be set at runtime by the system automatically
stream.container.instances 1 Number of YARN container instances for the stream handler; in standalone mode, it’s always one
stream.container.memory.mb 512 Amount of memory in megabytes for the YARN container that runs the stream handler
stream.container.num.cores 2 Number of virtual core for the YARN container that runs the stream handler
stream.event.ttl 9223372036854775807 Default time to live in milliseconds (Long.MAX_VALUE) for stream events
stream.file.cleanup.period 300000 Default time interval in milliseconds for running the stream file cleanup process
stream.file.prefix file Prefix of file name for stream file
stream.index.interval 10000 Default time interval in milliseconds for emitting new index entry in stream file
stream.instance.file.prefix [Final]
${stream.file.prefix}.${stream.
container.instance.id}
Prefix of file name for stream file per writer instance
stream.notification.threshold 1024 Size of data, in megabytes, to be ingested by a stream before a notification is published
stream.partition.duration 3600000 The default duration of a stream partition in milliseconds
stream.size.schedule.polling.delay 600 Delay, in seconds, to poll a stream in a StreamSizeSchedule if no notification is received
stream.worker.threads ${http.service.worker.threads} Default number of IO worker threads for the stream HTTP service

UI

Parameter Name Default Value Description
dashboard.bind.address 0.0.0.0 CDAP UI bind address
dashboard.bind.port 9999 CDAP UI bind port
dashboard.router.check.timeout.secs 0 Amount of time, in seconds, CDAP UI waits before exiting when unable to connect to CDAP Router service on startup; use a timeout of 0 to wait indefinitely
dashboard.ssl.bind.port 9443 CDAP UI bind port for HTTPS
dashboard.ssl.disable.cert.check false True to disable SSL certificate check from the CDAP UI
http.client.connection.timeout.ms 60000 Timeout in milliseconds for internal HTTP requests

Notes

[Final]: Properties marked as [Final] indicates that their value cannot be changed, even with a setting in the cdap-site.xml.