Metrics HTTP RESTful API

As Applications process data, CDAP collects metrics about the Application’s behavior and performance. Some of these metrics are the same for every Application—how many events are processed, how many data operations are performed, etc.—and are thus called system or CDAP metrics.

Other metrics are user-defined and differ from Application to Application. For details on how to add metrics to your Application, see the section on User-Defined Metrics in the CDAP Administration Manual.

Metrics Data

Metrics data is identified as a combination of two entities:

  • Metrics context
  • Metrics name

Metrics contexts are hierarchal, rooted in the CDAP instance, and extend through namespaces, applications, and down to the individual elements.

For example: the metrics context namespace.default.app.PurchaseHistory.flow.PurchaseFlow is a context that identifies a Flow. It has a parent context, namespace.default.app.PurchaseHistory, which identifies the parent application.

Each level of the context is described by a pair, composed of a tag name and a value, such as:

  • flow.PurchaseFlow (tag name: flow, value: PurchaseFlow)
  • app.PurchaseHistory (tag name: app, value: PurchaseHistory)
  • namespace.default (tag name: namespace, value: default)

Metrics names are either names generated by CDAP, and pre-pended with system, or are names set by a developer when writing an application, which are pre-pended with user.

System metrics vary depending on the context; a list is available of common system metrics for different contexts. User metrics are defined by the application developer and thus are completely dependent on what the developer sets.

In either case, searches using this API will show, for a given context, all available metrics.

Available Contexts

The context of a metric is typically enclosed into a hierarchy of contexts. For example, the Flowlet context is enclosed in the Flow context, which in turn is enclosed in the Application context. A metric can always be queried (and aggregated) relative to any enclosing context. These are the available Application contexts of CDAP:

System Metric Context
One Flowlet of a Flow namespace.<namespace>.app.<app-id>.flow.<flow-id>.flowlet.<flowlet-id>
All Flowlets of a Flow namespace.<namespace>.app.<app-id>.flow.<flow-id>
All Flowlets of all app of an Application namespace.<namespace>.app.<app-id>.flow.*
One Worker namespace.<namespace>.app.<app-id>.worker.<worker-id>
All Workers of an Application namespace.<namespace>.app.<app-id>.workers.*
All Mappers of a MapReduce namespace.<namespace>.app.<app-id>.mapreduce.<mapreduce-id>.tasktype.m
All Reducers of a MapReduce namespace.<namespace>.app.<app-id>.mapreduce.<mapreduce-id>.tasktype.r
One MapReduce namespace.<namespace>.app.<app-id>.mapreduce.<mapreduce-id>
All MapReduce of an Application namespace.<namespace>.app.<app-id>.mapreduce.*
One Spark Program namespace.<namespace>.app.<app-id>.spark.<spark-id>
One Service namespace.<namespace>.app.<app-id>.service.<service-id>
All Services of an Application namespace.<namespace>.app.<app-id>.service.*
All components of an Application namespace.<namespace>.app.<app-id>
All components of all Applications namespace.<namespace>

Stream metrics are only available at the Stream level and the only available context is:

Stream Metric Context
A single Stream namespace.<namespace>.stream.<stream-id>

Dataset metrics are available at the Dataset level, but they can also be queried down to the Flowlet, Worker, Service, Mapper, or Reducer level:

Dataset Metric Context
A single Dataset in the context of a single Flowlet namespace.<namespace>.dataset.<dataset-id>.app.<app-id>.flow.<flow-id>.flowlet.<flowlet-id>
A single Dataset in the context of a single Flow namespace.<namespace>.dataset.<dataset-id>.app.<app-id>.flow.<flow-id>
A single Dataset in the context of a specific Application namespace.<namespace>.dataset.<dataset-id>.app.<app-id>
A single Dataset across all Applications namespace.<namespace>.dataset.<dataset-id>
All Datasets across all Applications namespace.<namespace>.dataset.*

Available System Metrics

Note that a user metric may have the same name as a system metric; they are distinguished by prepending the respective prefix when querying: user or system.

These metrics are available in a Datasets context:

Datasets Metric Description
system.store.bytes Number of bytes written
system.store.ops Operations (reads and writes) performed
system.store.reads Read operations performed
system.store.writes Write operations performed

These metrics are available in a Flowlet context:

Flowlet Metric Description
system.process.errors Number of errors while processing
system.process.events.processed Number of events/data objects processed
system.process.events.in Number of events read in by the Flowlet
system.process.events.out Number of events emitted by the Flowlet
system.process.tuples.read Number of tuples read by the Flowlet
system.store.bytes Number of bytes written to Datasets
system.store.ops Operations (writes and read) performed on Datasets
system.store.reads Read operations performed on Datasets
system.store.writes Write operations performed on Datasets

These metrics are available in a Mappers and Reducers context:

Mappers and Reducers Metric Description
system.process.completion A number from 0 to 100 indicating the progress of the Map or Reduce phase
system.process.entries.in Number of entries read in by the Map or Reduce phase
system.process.entries.out Number of entries written out by the Map or Reduce phase

These metrics are available in a Services context:

Services Metric Description
system.requests.count Number of requests made to the Service
system.response.successful.count Number of successful requests completed by the Service
system.response.server.error.count Number of failures seen by the Service

These metrics are available in a Spark context, where <spark-id> depends on the Spark program being queried:

Spark Metric Description
system.<spark-id>.BlockManager.disk.diskSpaceUsed_MB Disk space used by the Block Manager
system.<spark-id>.BlockManager.memory.maxMem_MB Maximum memory given to the Block Manager
system.<spark-id>.BlockManager.memory.memUsed_MB Memory used by the Block Manager
system.<spark-id>.BlockManager.memory.remainingMem_MB Memory remaining to the Block Manager
system.<spark-id>.DAGScheduler.job.activeJobs Number of active jobs
system.<spark-id>.DAGScheduler.job.allJobs Total number of jobs
system.<spark-id>.DAGScheduler.stage.failedStages Number of failed stages
system.<spark-id>.DAGScheduler.stage.runningStages Number of running stages
system.<spark-id>.DAGScheduler.stage.waitingStages Number of waiting stages

These metrics are available in a Streams context:

Streams Metric Description
system.collect.events Number of events collected by the Stream
system.collect.bytes Number of bytes collected by the Stream

Searches and Queries

The process of retrieving a metric involves these steps:

  1. Obtain (usually through a search) the correct context for a metric;
  2. Obtain (usually through a search within the context) the available metrics;
  3. Querying for a specific metric, supplying the context and any parameters.

Search for Contexts

To search for the available contexts, perform an HTTP request:

POST '<base-url>/metrics/search?target=childContext[&context=<context>]'

The optional <context> defines a metrics context to search within. If it is not provided, the search is performed across all data. The available contexts that are returned can be used to query for a lower-level of contexts.

You can also define the query to search in a given context across all values of one or more tags provided in the context by specifying * as a value for a tag. See the examples below for its use.

Parameter Description
<context> [Optional] Metrics context to search within. If not provided, the search is provided across all contexts.

Examples

HTTP Method POST '<base-url>/metrics/search?target=childContext'
Returns ["namespace.default", "namespace.system"]
Description Returns all first-level contexts; in this case, two namespaces.
   
HTTP Method POST '<base-url>/metrics/search?target=childContext&context=namespace.default'
Returns ["namespace.default.app.HelloWorld", "namespace.default.app.PurchaseHistory", "namespace.default.dataset.purchases", "namespace.default.dataset.whom", "namespace.default.stream.purchaseStream", ..., "namespace.default.stream.who"]
Description Returns all child contexts of the given parent context; in this case, all entities in the default namespace.
   
HTTP Method POST '<base-url>/metrics/search?target=childContext&context= namespace.default.app.PurchaseHistory.flow.PurchaseFlow.run.*'
Returns ["namespace.default.app.PurchaseHistory.flow.PurchaseFlow.run.*.flowlet.collector", "namespace.default.app.PurchaseHistory.flow.PurchaseFlow.run.*.flowlet.reader"]
Description Queries all available contexts within the PurchaseHistory‘s PurchaseFlow for any run; in this case, it returns all available Flowlets.

Search for Metrics

To search for the available metrics within a given context, perform an HTTP POST request:

POST '<base-url>/metrics/search?target=metric&context=<context>'
Parameter Description
<context> Metrics context to search within.

Example

HTTP Method POST '<base-url>/metrics/search?target=metric& context=namespace.default.app.PurchaseHistory'
Returns ["system.dataset.store.bytes","system.dataset.store.ops","system.dataset.store.reads", "system.dataset.store.writes","system.process.bytes",...,"user.customers.count"]
Description Returns all metrics in the context of the application PurchaseHistory of the default namespace; in this case, returns a list of system and user-defined metrics.

Querying A Metric

Once you know the context and the metric to query, you can formulate a request for the metrics data.

To query a metric within a given context, perform an HTTP GET request:

POST '<base-url>/metrics/query?context=<context>[&groupBy=<tags>]&metric=<metric>&<time-range>'
Parameter Description
<context> Metrics context to search within
<tags> (Optional) Comma-separated tag list by which to group results (optional)
<metric> Metric being queried
<time-range> A time range or aggregate=true for all since the Application was deployed

Examples

HTTP Method POST '<base-url>/metrics/query?context=namespace.default.app.HelloWorld.flow. WhoFlow.flowlet.saver&metric=system.process.events.processed?aggregate=true'
Description Using a System metric, system.process.events.processed
   
HTTP Method POST '<base-url>/metrics/query?context=namespace.default.app.HelloWorld.flow. WhoFlow.run.13ac3a50-a435-49c8-a752-83b3c1e1b9a8.flowlet.saver&metric=user.names.bytes?aggregate=true'
Description Querying the User-defined metric names.bytes, of the Flow saver, by its run-ID
   
HTTP Method POST '<base-url>/metrics/query?context=namespace.default.app.HelloWorld.services WhoService.runnables.WhoRun&metric=user.names.bytes'
Description Using a User-defined metric, names.bytes in a Service’s Handler

Query Tips

  • To retrieve the number of input data objects (“events”) processed by the Flowlet named splitter, in the Flow CountRandom of the example application CountRandom, over the last 5 seconds, you can issue an HTTP POST method:

    POST '<base-url>/metrics/query?context=namespace.default.app.CountRandom.flow.CountRandom.
      flowlet.splitter&metric=system.process.events.processed&start=now-5s&count=5'
    

    This returns a JSON response that has one entry for every second in the requested time interval. It will have values only for the times where the metric was actually emitted (shown here “pretty-printed”):

    {
      "startTime": 1427225350,
      "endTime": 1427225354,
      "series": [
        {
          "metricName": "system.process.events.processed",
          "grouping": {
    
          },
          "data": [
            {
              "time": 1427225350,
              "value": 760
            },
            {
              "time": 1427225351,
              "value": 774
            },
            {
              "time": 1427225352,
              "value": 792
            },
            {
              "time": 1427225353,
              "value": 756
            },
            {
              "time": 1427225354,
              "value": 766
            }
          ]
        }
      ]
    }
    
  • You can retrieve results based on a run-id.

  • If a run-ID is not specified, we aggregate the events processed for all the runs of this flow.

    The resulting timeseries will represent aggregated values for the context specified. Currently, summation is used as the aggregation function. So, if you query for the system.process.events.processed metric for a Flow—thus across all Flowlets—since this metric was actually emitted at the Flowlet level, the resulting values retrieved will be a sum across all Flowlets of the Flow.

  • If you want the number of input objects processed across all Flowlets of a Flow, you address the metrics API at the Flow context:

    POST '<base-url>/metrics/query?context=namespace.default.app.CountRandom.flow.CountRandom.flowlet.*
      &metric=system.process.events.processed&start=now-5s&count=5'
    
  • Similarly, you can address the context of all Flows of an Application, an entire Application, or the entire namespace of a CDAP instance:

    POST '<base-url>/metrics/query?context=namespace.default.app.CountRandom.flow.*
      &metric=system.process.events.processed&start=now-5s&count=5'
    
    POST '<base-url>/metrics/query?context=namespace.default.app.CountRandom
      &metric=system.process.events.processed&start=now-5s&count=5'
    
    POST '<base-url>/metrics/query?context=namespace.default
      &metric=system.process.events.processed&start=now-5s&count=5'
    
  • To request user-defined metrics instead of system metrics, specify user instead of cdap in the URL and specify the user-defined metric at the end of the request.

    For example, to request a user-defined metric for the HelloWorld Application’s WhoFlow Flow:

    POST '<base-url>/metrics/query?context=namespace.default.app.HelloWorld.flow.WhoFlow.flowlet.saver
      &metric=user.names.bytes&aggregate=true'
    
  • Retrieving multiple metrics at once, by issuing an HTTP POST request with a JSON list as the request body that enumerates the name and attributes for each metric, is currently not supported in this API. Instead, use the v2 API until it is supported in a future release.

Querying for Multiple Time-series

In a query, the optional groupBy parameter defines a list of tags whose values are used to build multiple timeseries. All data points that have the same values in tags specified in the groupBy parameter will form a single timeseries. You can define multiple tags for grouping by providing a comma-separated list.

Tag List Description
groupBy=app Retrieves the time series for each application.
groupBy=app,flow Retrieves a time series for each app and flow combination

Example

The method:

POST '<base-url>/metrics/query?context=namespace.default.app.PurchaseHistory&
  groupBy=flow&metric=user.customers.count&start=now-2s&end=now'

returns the user.customers.count metric in the context of the application PurchaseHistory of the default namespace, for the specified time range, and grouped by flow (results reformatted to fit):

{
  "startTime": 1421188775,
  "endTime": 1421188777,
  "series": [
    {
      "metricName": "user.customers.count",
      "grouping": { "flow": "PurchaseHistoryFlow" },
      "data": [
        { "time": 1421188776, "value": 3 },
        { "time": 1421188777, "value": 2 }
      ]
    },
    {
      "metricName": "user.customers.count",
      "grouping": { "flow": "PurchaseAnalysisFlow" },
      "data": [
        { "time": 1421188775, "value": 1 },
        { "time": 1421188777, "value": 2 }
      ]
    }
  ]
}

Querying by a Time Range

The time range of a metric query can be specified in various ways: either aggregate=true to retrieve the total aggregated since the Application was deployed or—in the case of Dataset metrics—since a Dataset was created; or as a start and end to define a specific range and return a series of data points.

By default, queries without a time range retrieve a value based on aggregate=true.

Parameter Description
aggregate=true Total aggregated value for the metric since the Application was deployed. If the metric is a gauge type, the aggregate will return the latest value set for the metric.
start=<time>&end=<time> Time range defined by start and end times, where the times are either in seconds since the start of the Epoch, or a relative time, using now and times added to it.
count=<count> Number of time intervals since start with length of time interval defined by resolution. If count=60 and resolution=1s, the time range would be 60 seconds in length.
resolution=[1s|1m|1h|auto] Time resolution in seconds, minutes or hours; or if “auto”, one of {1s, 1m, 1h} is used based on the time difference.

With a specific time range, a resolution can be included to retrieve a series of data points for a metric. By default, 1 second resolution is used. Acceptable values are noted above. If resolution=auto, the resolution will be determined based on a time difference calculated between the start and end times:

  • (endTime - startTime) >= 3610 seconds, resolution will be 1 hour;
  • (endTime - startTime) >= 610 seconds, resolution will be 1 minute;
  • otherwise, resolution will be 1 second.
Time Range Description
start=now-30s&end=now The last 30 seconds. The start time is given in seconds relative to the current time. You can apply simple math, using now for the current time, s for seconds, m for minutes, h for hours and d for days. For example: now-5d-12h is 5 days and 12 hours ago.
start=1385625600& end=1385629200 From Thu, 28 Nov 2013 08:00:00 GMT to Thu, 28 Nov 2013 09:00:00 GMT, both given as since the start of the Epoch.
start=1385625600& count=3600& resolution=1s The same as before, the count given as a number of time intervals, each 1 second.
start=1385625600& end=1385629200& resolution=1m From Thu, 28 Nov 2013 08:00:00 GMT to Thu, 28 Nov 2013 09:00:00 GMT, with 1 minute resolution, will return 61 data points with metrics aggregated for each minute.
start=1385625600& end=1385632800& resolution=1h From Thu, 28 Nov 2013 08:00:00 GMT to Thu, 28 Nov 2013 10:00:00 GMT, with 1 hour resolution, will return 3 data points with metrics aggregated for each hour.

Example:

POST '<base-url>/metrics/query?context=namespace.default.app.CountRandom&
  metric=system.process.events.processed&start=now-1h&end=now&resolution=1m'

This will return the value of the metric system.process.events.processed for the last hour at one-second intervals.

For aggregates, you cannot specify a time range. As an example, to return the total number of input objects processed since the Application CountRandom was deployed, assuming that CDAP has not been stopped or restarted:

POST '<base-url>/metrics/query?context=namespace.default.app.CountRandom
  &metric=system.process.events.processed?aggregate=true'

If a metric is a gauge type, the aggregate will return the latest value set for the metric. For example, this request will retrieve the completion percentage for the map-stage of the MapReduce PurchaseHistoryWorkflow_PurchaseHistoryBuilder (reformatted to fit):

POST '<base-url>/metrics/query?context=namespace.default.app.PurchaseHistory.mapreduce.
    PurchaseHistoryWorkflow_PurchaseHistoryBuilder&metric=system.process.completion&aggregate=true'

Querying by Run-ID

Each execution of an program (Flow, MapReduce, Spark, Services, Worker) has an associated run-ID that uniquely identifies that program’s run. We can query metrics for a program by its run-ID to retrieve the metrics for a particular run. Please see the Run Records and Schedule on retrieving active and historical program runs.

When querying by run-ID, it is specified in the context after the program-id with the tag run:

...app.<app-id>.<program-type>.<program-id>.run.<run-id>

Examples of using a run-ID (reformatted to fit):

POST '<base-url>/metrics/query?context=namespace.default.app.PurchaseHistory.flow.
    MyFlow.run.364-789-1636765&metric=system.process.completion'

POST '<base-url>/metrics/query?context=namespace.default.app.PurchaseHistory.mapreduce.
    PurchaseHistoryBuilder.run.453-454-447683&metric=system.process.completion'

POST '<base-url>/metrics/query?context=namespace.default.app.CountRandom.flow.CountRandom.run.
  bca50436-9650-448e-9ab1-f1d186eb2285.flowlet.splitter&metric=system.process.events.processed&aggregate=true'

The last example will return something similar to (where "time"=0 means aggregated total number):

{"startTime":0,"endTime":0,"series":[{"metricName":"system.process.events.processed",
 "grouping":{},"data":[{"time":0,"value":11188}]}]}