Home > user > operation > Monitors

Monitors

Monitoring of numerous metrics about components is a powerful feature available to users. It includes support for aspects such as

  • thresholds and notification via alerts
  • tracking of metrics over long time ranges
  • usage for compute heartbeat signal
  • extensive charting for visual and interactive inspections

Metrics can be collected for numerous aspects for various levels of behavior of your assembly in operations such as

  • memory usage
  • CPU utilization
  • network
  • processes
  • IO metrics like open files
  • process specific aspects e.g. JVM or database-specific aspects

Any component can be monitored and most components included a number of monitors by default. Monitoring in OneOps scales for tracking thousands of metrics for long periods of time.

Under the cover OneOps facilitates the industry standard open source solution Nagios and the numerous checks supplied by it.

Configuration

Default Monitors

Default monitors are automatically created from the component definition and can be configured in the transition view of a component:

  1. Navigate to the desired assembly.
  2. Press Transition in the left hand navigation.
  3. Select the environment by clicking on the name in the list.
  4. Select the platform in the list on the right by clicking on the name - e.g. tomcat.
  5. Select the component in the list on the right by clicking on the name - e.g. compute.
  6. Go to the monitors tab.
  7. The monitors are listed on the left and can be _Edit_ed individually.

Check out the demo video showing how to navigate to monitors.

Custom Monitors

Custom monitors allow you to define additional metric monitoring for any component. They can be configured in the design view of a component:

  1. Navigate to the desired assembly.
  2. Press Design in the left hand navigation.
  3. Select the platform in the list on the right by clicking on the name - e.g. tomcat.
  4. Select the component in the list on the right by clicking on the name - e.g. compute.
  5. Go to the monitors tab.
  6. Press Add to start create a custom monitor
  7. Alternatively select an existing custom monitor and Edit it as desired.

Attributes

The following Global and Collection related attributes can be configured for existing and new monitors:

Name: simple name for the monitor as visible in the list.
Description: descriptive text about the behavior of the the monitor.
Command: Nagios command that defines the metric gathering.
Command Line Options Map: map of key value pairs that are passed to the command line invocation.
Command Line: specific command line invocation for the metrics gathering.

Each monitor can include one or more Metrics defined by:

Name: simple name for the metric.
Unit: the unit used for data points, used in charts and notifications.
Description: descriptive text about the metric.
DS Type: the data source type. GAUGE signals that this metric gather as a measurement each time. DERIVE on the other hand signals a rate of change from a prior measurement is tracked.
Display: flag to signal if the metric should be displayed.
Display Group: string to allow grouping of metrics.
Sample Interval (in sec): Number of seconds between each metric measurement event.

In addition, aspects for alerting can be configured as documented in the following section.

Additional options are available in the Advanced Configuration:

Receive email notifications only on state change: enable this flag to reduce notifications to be sent only when the monitor state changes.

URL to a page having resolution or escalation details: This allows you to add a URL to an external website or other resource that provides further information for the user receiving notifications from this monitor. The URL is added to all notifications.

Alerting with Thresholds and Heartbeats

Heartbeats

Configured in the Alerting section the Heartbeat flag and Heartbeat Duration allow a metric to be used as a critical metric signaling the health of the component itself.

If the data collection fails for a metric with the heartbeat flag enabled and the heartbeat duration has passed, an unhealthy event is generated. Ideally at least one metric per component is flagged as a heartbeat metric. Heartbeat metrics are automatically collected every minute from all components.

Heartbeat Duration: defines the wait time (in minutes) before marking a component instance as unhealthy due to a missing heartbeat.

The unhealthy event caused by missing heartbeat leads to execution of a repair action on the instances marked as unhealthy. The automatic healing of instances using Auto Repair enables the recovery of components instances back to a healthy state.

Threshold

A Threshold uses a metric and a set of conditions to change the state of a component. These changes can trigger events such that result in notifications, automatic scaling or automatic repair events.

Components include a predefined set of default thresholds that are used implicitly with any environment deployment. Users can add a new threshold definitions that are suitable for their operation or edit existing thresholds.

Threshold are visible as part of the monitor configuration.

The following attributes characterize a threshold:

Name: Name the threshold so that it is easy to understand what happened. For example: HighThreadUse implies thread count going too high. This name is seen as part of the alert message and should be intuitive enough to understand what happened when the threshold was crossed.

State: Defines the state of the instance when the threshold is crossed. Depending on the state of instance, certain actions are performed implicitly to recover the component back to good health. The user can select a value to define the expected state of the threshold.

The following states are available:

Notify-Only: Use this state when no automated action is expected. When the trigger condition is met, the state of the instances is flipped to notify and an event is triggered. The event can be seen on the environment operation view.

Unhealthy: When a threshold is defined with an unhealthy state, the instances meeting trigger condition require some repair action to fix their state and the repair action associated with the component is executed. The automatic healing of instances using Auto-Repair helps in recovery of instances back to good state.

Over-utilized: Use this state to define a threshold where the load is not sustainable and the component requires additional capacity. Auto scale) is used to add more capacity until the maximum limit of scaling configuration is reached.

Under-utilized: This state signifies that the component instance is not being used to its capacity and can be removed. Auto scale) is used to remove capacity until the minimum limit of scaling configuration is reached.

Further threshold configuration attributes are:

Bucket: Time interval used for each metric collection.

Stat: Stat determines the value selection from the bucket for aggregation. Values are average, min, max, count, etc.

Metric: The metric to use for the threshold.

Trigger and Reset determine when an event is raised and subsequently removed. They are configured with an expression using and Operator and Value to create and expression. The Duration defines the time window during which the collected metric value is evaluated. Occurrences defines the number of repetitions needed to trigger

Cool-off: The time after which a repeated threshold crossing raises another event. Before that time repeated violations do not raise additional events.

An alert is generated for any state trigger. If you are watching the assembly, you can expect a notification about the event. The events can be viewed in the operation view.

Usage in Operation

The actual usage of monitors occurs in the operation view for each individual component:

  1. Navigate to the desired assembly.
  2. Press Operation in the left hand navigation.
  3. Select the environment by clicking on the name in the list.
  4. Select the platform in the list on the right by clicking on the name - e.g. tomcat.
  5. Select the component in the list on the right by clicking on the name - e.g. compute.
  6. Go to the monitors tab.
  7. The monitors are listed on the left as a list.
  8. Click on an individual monitor name to view a chart visualizing the monitor data.

Check out the demo video showing how to navigate to monitors.

The list of monitors shows the names of the monitors and additional icons that highlight heartbeat monitors and defined thresholds. You can also mark them as a favorite.

The header includes a filter for the monitors, select/deselect all buttons, a sort features and well as the Actions button. If you select a few monitors in the list with the checkboxes beside the names, you can use the Compound charts action to merge all metrics from the selected monitors into one chart. The Stack charts action triggers all selected charts to be displayed above each other.

Threshold and heartbeat configuration for the monitor is displayed below the chart.

Charts

Chart inspections can be used to visually analyze your component behavior over time.

Enjoy our demo video showcasing usage of charts.

A number of features are available in the chart display:

Time range control: The top left corner contains a control with buttons to select time range for the whole chart displaying of one hour, six hours, one day, one week, one month or one year.

Time navigation: The top right corner contains a control to navigate the chart time data by the size of the range. << navigates a full period back, < half a period back, > half a period forward, >> a full period forward. Now jumps to the current date and time.

Read value: Moving the mouse pointer over the chart triggers a marker that displays the metric value at the current location in the chart.

Legend: The legend beneath the chart shows the different metric names for the monitor. Clicking on a metric triggers the chart to display only that metric vs. all metrics.

Threshhold display: Threshold levels are displayed as horizontal lines in the chart using a dotted line of the same color as the metric with the threshold. The legend includes a dot beside the metric name. The color of the dot reflect the state (blue for notify, red for unhealthy…). Hovering over the dot shows the threshold definition.

Zoom: You can select a rectangle on the chart to enlarge a specific x/y region of the chart. This can be repeated multiple times until you see the region of interest. Double-click causes a zoom back out.

Standalong view: The button on the top right corner in the chart title display triggers the current chart to be displayed in a new browser window without the rest of the user interface.

The data available in the chart depends on a few aspects:

  • Actual metrics taken successfully and component operational times e.g. there won’t be any old data for a compute that was just started today
  • TTL policies for storing the data. One minute buckets are used for hour and 6 hours charts up to two days into the past. Then metrics switch to 5 minute buckets.

Charts in Action

Examples

Open Files Monitor

The Open Files Monitor monitors the open files on the process and is includes in a number of components and disabled by default. You can simply activate it and enter the process name in the configuration if you want to montior files opened e.g. by your application as the artifact component.

App Version Monitor

The App Version monitor is a monitor of the tomcat component used to validate that the server is restarted after all artifacts are deployed. By default, the monitor is disabled.

You can enable it in transition view of the component. The ValidateAppVersion action can perform the same check as the monitor as an on demand action.